SQL Server 2016 High Availability Unleashed PDF
SQL Server 2016 High Availability Unleashed PDF
Paul Bertucci
4 Failover Clustering
5 SQL Server Clustering
6 SQL Server AlwaysOn and Availability Groups
7 SQL Server Database Snapshots
8 SQL Server Data Replication
9 SQL Server Log Shipping
10 High Availability Options in the Cloud
11 High Availability and Big Data Options
12 Hardware and OS Options for High Availability
13 Disaster Recovery and Business Continuity
14 Bringing HA Together
15 Upgrading Your Current Deployment to HA
16 High Availability and Security
17 Future Direction of High Availability
Index
Table of Contents
Introduction
4 Failover Clustering
Variations of Failover Clustering
How Clustering Works
Understanding WSFC
Extending WSFC with NLB
How WSFC Sets the Stage for SQL Server Clustering and AlwaysOn
Installing Failover Clustering
A SQL Clustering Configuration
An AlwaysOn Availability Group Configuration
Configuring SQL Server Database Disks
Summary
5 SQL Server Clustering
Installing SQL Server Clustering Within WSFC
Potential Problems to Watch Out for with SQL Server Failover Clustering
Multisite SQL Server Failover Clustering
Scenario 1: Application Service Provider with SQL Server Clustering
Summary
6 SQL Server AlwaysOn and Availability Groups
AlwaysOn and Availability Groups Use Cases
Windows Server Failover Clustering
AlwaysOn Failover Clustering Instances
AlwaysOn and Availability Groups
Combining Failover with Scale-out Options
Building a Multinode AlwaysOn Configuration
Verifying SQL Server Instances
Setting Up Failover Clustering
Preparing the Database
Enabling AlwaysOn HA
Backing Up the Database
Creating the Availability Group
Selecting the Databases for the Availability Group
Identifying the Primary and Secondary Replicas
Synchronizing the Data
Setting Up the Listener
Connecting Using the Listener
Failing Over to a Secondary
Dashboard and Monitoring
Scenario 3: Investment Portfolio Management with AlwaysOn and Availability
Groups
Summary
7 SQL Server Database Snapshots
What Are Database Snapshots?
Copy-on-Write Technology
When to Use Database Snapshots
Reverting to a Snapshot for Recovery Purposes
Safeguarding a Database Prior to Making Mass Changes
Providing a Testing (or Quality Assurance) Starting Point (Baseline)
Providing a Point-in-Time Reporting Database
Providing a Highly Available and Offloaded Reporting Database from a
Database Mirror
Setup and Breakdown of a Database Snapshot
Creating a Database Snapshot
Breaking Down a Database Snapshot
Reverting to a Database Snapshot for Recovery
Reverting a Source Database from a Database Snapshot
Using Database Snapshots with Testing and QA
Security for Database Snapshots
Snapshot Sparse File Size Management
Number of Database Snapshots per Source Database
Adding Database Mirroring for High Availability
What Is Database Mirroring?
When to Use Database Mirroring
Roles of the Database Mirroring Configuration
Playing Roles and Switching Roles
Database Mirroring Operating Modes
Setting Up and Configuring Database Mirroring
Getting Ready to Mirror a Database
Creating the Endpoints
Granting Permissions
Creating the Database on the Mirror Server
Identifying the Other Endpoints for Database Mirroring 180
Monitoring a Mirrored Database Environment
Removing Mirroring
Testing Failover from the Principal to the Mirror
Client Setup and Configuration for Database Mirroring
Setting Up DB Snapshots Against a Database Mirror
Reciprocal Principal/Mirror Reporting Configuration
Scenario 3: Investment Portfolio Management with DB Snapshots and DB
Mirroring
Summary
8 SQL Server Data Replication
Data Replication for High Availability
Snapshot Replication
Transactional Replication
Merge Replication
What Is Data Replication?
The Publisher, Distributor, and Subscriber Metaphor
Publications and Articles
Filtering Articles
Replication Scenarios
Central Publisher
Central Publisher with a Remote Distributor
Subscriptions
Pull Subscriptions
Push Subscriptions
The Distribution Database
Replication Agents
The Snapshot Agent
The Log Reader Agent
The Distribution Agent
The Miscellaneous Agents
User Requirements Driving the Replication Design
Setting Up Replication
Enabling a Distributor
Publishing
Creating a Publication
Creating a Subscription
Switching Over to a Warm Standby (Subscriber)
Scenarios That Dictate Switching to the Warm Standby
Switching Over to a Warm Standby (the Subscriber)
Turning the Subscriber into a Publisher (if Needed)
Monitoring Replication
SQL Statements
SQL Server Management Studio
The Windows Performance Monitor and Replication
Backup and Recovery in a Replication Configuration
Scenario 2: Worldwide Sales and Marketing with Data Replication
Summary
9 SQL Server Log Shipping
Poor Man’s High Availability
Data Latency and Log Shipping
Design and Administration Implications of Log Shipping
Setting Up Log Shipping
Before Creating Log Shipping
Using the Database Log Shipping Task
When the Source Server Fails
Scenario 4: Call Before Digging with Log Shipping
Summary
10 High Availability Options in the Cloud
A High Availability Cloud Nightmare
HA Hybrid Approaches to Leveraging the Cloud
Extending Your Replication Topology to the Cloud
Extending Log Shipping to the Cloud for Additional HA
Creating a Stretch Database to the Cloud for Higher HA
Using AlwaysOn and Availability Groups to the Cloud
Using AlwaysOn and Availability Groups in the Cloud
Using Azure SQL Database for HA in the Cloud
Using Active Geo Replication
HA When Using Azure Big Data Options in the Cloud
Summary
11 High Availability and Big Data Options
Big Data Options for Azure
HDInsight
Machine Learning Web Service
Stream Analytics
Cognitive Services
Data Lake Analytics
Data Lake Store
Data Factory
Power BI Embedded
Microsoft Azure Data Lake Services
HDInsight Features
Using NoSQL Capabilities
Real-Time Processing
Spark for Interactive Analysis
R for Predictive Analysis and Machine Learning
Azure Data Lake Analytics
Azure Data Lake Store
High Availability of Azure Big Data
Data Redundancy
High Availability Services
How to Create a Highly Available HDInsight Cluster
Accessing Your Big Data
The Seven-Step Big Data Journey from Inception to Enterprise Scale
Other Things to Consider for Your Big Data Solution
Azure Big Data Use Cases
Use Case 1: Iterative Exploration
Use Case 2: Data Warehouse on Demand
Use Case 3: ETL Automation
Use Case 4: BI Integration
Use Case 5: Predictive Analysis
Summary
12 Hardware and OS Options for High Availability
Server HA Considerations
Failover Clustering
Networking Configuration
Clustered Virtual Machine Replication
Virtualization Wars
Backup Considerations
Integrated Hypervisor Replication
VM Snapshots
Disaster Recovery as a Service (DRaaS)
Summary
13 Disaster Recovery and Business Continuity
How to Approach Disaster Recovery
Disaster Recovery Patterns
Recovery Objectives
A Data-centric Approach to Disaster Recovery
Microsoft Options for Disaster Recovery
Data Replication
Log Shipping
Database Mirroring and Snapshots
Change Data Capture
AlwaysOn and Availability Groups
Azure and Active Geo Replication
The Overall Disaster Recovery Process
The Focus of Disaster Recovery
Planning and Executing Disaster Recovery
Have You Detached a Database Recently?
Third-Party Disaster Recovery Alternatives
Disaster Recovery as a Service (DRaaS)
Summary
14 Bringing HA Together
Foundation First
Assembling Your HA Assessment Team
Setting the HA Assessment Project Schedule/Timeline
Doing a Phase 0 High Availability Assessment
Step 1: Conducting the HA Assessment
Step 2: Gauging HA Primary Variables
High Availability Tasks Integrated into Your Development Life Cycle
Selecting an HA Solution
Determining Whether an HA Solution Is Cost-Effective
Summary
15 Upgrading Your Current Deployment to HA
Quantifying Your Current Deployment
Scenario 1 Original Environment List
Deciding What HA Solution You Will Upgrade To
Scenario 1 Target HA Environment List
Planning Your Upgrade
Doing Your Upgrade
Testing Your HA Configuration
Monitoring Your HA Health
Summary
16 High Availability and Security
The Security Big Picture
Using Object Permissions and Roles
Object Protection Using Schema-Bound Views
Ensuring Proper Security for HA Options
SQL Clustering Security Considerations
Log Shipping Security Considerations
Data Replication Security Considerations
Database Snapshots Security Considerations
AlwaysOn Availability Group Security Considerations
SQL Server Auditing
General Thoughts on Database Backup/Restore, Isolating SQL Roles, and
Disaster Recovery Security Considerations
Summary
17 Future Direction of High Availability
High Availability as a Service (HAaaS)
100% Virtualization of Your Platforms
Being 100% in the Cloud
Advanced Geo Replication
Disaster Recovery as a Service?
Summary
Conclusion
Index
About the Author
Paul Bertucci is the founder of Data by Design (www.dataXdesign.com) a database
consulting firm with offices in the United States and Paris, France. He has more than
30 years of experience with database design, data modeling, data architecture, data
replication, performance and tuning, distributed data systems, big data/Hadoop, data
integration, high availability, disaster recovery/business continuity, master data
management/data quality, and system architectures for numerous Fortune 500
companies, including Intel, Coca-Cola, Symantec, Autodesk, Apple, Toshiba,
Lockheed, Wells Fargo, Merrill-Lynch, Safeway, Texaco, Charles Schwab, Wealth
Front, Pacific Gas and Electric, Dayton Hudson, Abbott Labs, Cisco Systems,
Sybase, and Honda, to name a few. He has written numerous articles, company and
international data standards, and high-profile courses such as “Performance and
Tuning” and “Physical Database Design” for Sybase and “Entity Relationship
Modeling” courses for Chen & Associates (Dr. Peter P. Chen). Other Sams books
that he has authored include the highly popular Microsoft SQL Server Unleashed
series (SQL Server 2000, 2005, 2008 R2, 2012, and 2014), ADO.NET in 24 Hours,
and Microsoft SQL Server High Availability.
He has deployed numerous traditional database systems with MS SQL Server,
Sybase, DB2, and Oracle database engines, big data databases with Hadoop, and
non-SQL databases (value pair) such as Oracle’s NoSQL and Cassandra NoSQL. He
has designed/architected several commercially available tools in the database, data
modeling, performance and tuning, data integrity, data integration, and
multidimensional planning spaces.
Paul is also an experienced leader of global enterprise architecture teams for multi-
billion-dollar companies and lead global teams in data warehousing/BI, big data,
master data management, identity management, enterprise application integration,
and collaboration systems. He has held positions such as chief data architect for
Symantec, chief architect and director of Shared Services for Autodesk, CTO for
Diginome, and CTO for both LISI and PointCare. Paul speaks regularly at many
conferences and gatherings worldwide, such as SQL Saturday’s, Ignite, TechEd,
MDM Summit, Oracle World, Informatica World, SRII, MIT Chief Data Officer
symposium, and many others.
Paul received his formal education in computer science and electrical engineering
from UC Berkeley (Go, Bears!). He lives in the beautiful Pacific Northwest (Oregon)
with the three children who still live at home (Donny, Juliana, and Nina) and lives
near the other two, “working” adult children, Marissa and Paul Jr., who live in
Portland.
Paul can be reached at [email protected].
Contributing Author
Raju Shreewastava is a leading expert in data warehousing, business intelligence,
and big data for numerous companies around the globe. He is based out of Silicon
Valley, supporting several high-profile big data implementations. He previously led
the data warehouse/business intelligence and big data teams while working for Paul
at Autodesk. His big data and Azure contributions of content and examples represent
the bulk of Chapter 11, “High Availability and Big Data Options.” He has more than
20 years of experience with database design, data integration, and deployments. Raju
can be reached at [email protected].
Dedication
Successes are hard to achieve without hard work, fortitude, support, inspiration,
and guidance. The daily examples of how to succeed I owe to my parents, Donald
and Jane Bertucci, and my inspiration comes to me easily from wanting to be
the best father I can be for my children and helping them become successful in their
own lives. But I find even greater inspiration and amazing support from my loving
life partner, Michelle, to whom I dedicate this book. Infinity!
Acknowledgments
All my writing efforts require a huge sacrifice of time to properly research,
demonstrate, and describe leading-edge subject matter. The brunt of the burden
usually falls on those many people who are near and very dear to me. With this in
mind, I desperately need to thank my family for allowing me to encroach on many
months of what should have been my family’s “quality time.”
However, with sacrifice also comes reward, in this case in the form of technical
excellence and solid business relationships. Many individuals were involved in this
effort, both directly and indirectly. Thanks to my technology leaders network, Yves
Moison, Jose Solera, Anthony Vanlandingham, Jack McElreath, Paul Broenen, Jeff
Brzycki, Walter Kuketz, Steve Luk, Bert Haberland, Peter P. Chen, Gary Dunn,
Martin Sommer, Raju Shreewastava, Mark Ginnebaugh, Christy Foulger, Suzanne
Finley, and G. “Morgan” Watkins.
Thanks also for the technology environment, setup, and testing help from Ryan
McCarty and for the big data and Azure content and examples that form the bulk of
Chapter 11, “High Availability and Big Data options,” from Raju Shreewastava.
Thanks, guys!
Many good suggestions and comments came from the technical and copy editors at
Pearson, yielding an outstanding effort.
We Want to Hear from You!
As the reader of this book, you are our most important critic and commentator. We
value your opinion and want to know what we’re doing right, what we could do
better, what areas you’d like to see us publish in, and any other words of wisdom
you’re willing to pass our way.
We welcome your comments. You can email or write to let us know what you did or
didn’t like about this book—as well as what we can do to make our books better.
Please note that we cannot help you with technical problems related to the topic of
this book.
When you write, please be sure to include this book’s title and author as well as your
name and email address. We will carefully review your comments and share them
with the author and editors who worked on the book.
Email: [email protected]
Mail: Sams Publishing
ATTN: Reader Feedback
800 East 96th Street
Indianapolis
IN 46240 USA
Reader Services
Register your copy of SQL Server 2016 High Availability Unleashed at
www.informit.com for convenient access to downloads, updates, and corrections as
they become available. To start the registration process, go to
www.informit.com/register and log in or create an account*. Enter the product ISBN
9780672337765 and click Submit. When the process is complete, you will find any
available bonus content under Registered Products.
*Be sure to check the box that you would like to hear from us to receive exclusive
discounts on future editions of this product.
Introduction
Always on, always ready is not just a business goal but a competitive
requirement for any company that wants to compete in the cloud space. Highly
available technologies—deployed in the right architectures—allow for nonstop
delivery of value to your customers.”
—Jeff Brzycki, Chief Information Officer, Autodesk, March 2017
Five 9s
Downtime (system unavailability) directly translates to loss of profit, productivity,
your ability to deliver to your customers, and customer goodwill—plain and simple.
If your current or planned applications are vulnerable to downtime problems—or if
you are unsure of the potential downtime issues—then this book is for you. Is your
business at or nearing a requirement to be “highly available” or “continually
available” in order to protect the previously mentioned profit, productivity, and
customer goodwill? Again, this book is for you.
Helping you understand the high availability (HA) solutions available to you and
choosing the high availability approach that maximizes benefit and minimizes cost is
our primary goal. This book provides a roadmap to design and implement these high
availability solutions. The good news is that software and hardware vendors in
general, and Microsoft specifically, have come a long way in supporting high
availability needs and will move even further toward achieving 99.999% availability
(herein referred to as “five 9s”) in the near future. A 24×7 application that aspires to
achieve five 9s would tolerate only a yearly total of 5.26 minutes of downtime.
Knowing how to design for such high availability is crucial.
This book even touches on some alternatives for “always available” systems (100%
availability). These capabilities, coupled with a formal methodology for designing
high availability solutions, will allow you to design, install, and maintain systems to
maximize availability while minimizing development and platform costs.
The success or failure of your company may well be influenced, if not driven, by
your ability to understand the essential elements that comprise a high availability
environment, the business requirements driving the proper high availability
approach, and the cost considerations affecting the ROI (return on investment) of a
high availability solution. It is likely that a company’s most critical applications
demand some type of high availability solution. For example, if a global online
ordering system went down and remained down for any length of time, millions of
dollars would be lost, along with the public’s goodwill toward that company. The
stakes are high indeed!
This book outlines how you can “design in” high availability for new applications
and “upgrade” current applications to improve availability. In all cases, a crucial
consideration will be the business drivers influencing a proposed application’s
uptime requirements, factoring in the dollar cost, productivity cost, and the goodwill
cost of not having that system available to the end users for any period of time.
This book highlights current Microsoft capabilities and options that allow you to
achieve high availability systems. These include, among others, Microsoft Cluster
Services, Microsoft SQL Server 2016 SQL Clustering, SQL Data Replication, Log
Shipping, Database Mirroring/Snapshots, AlwaysOn Availability Groups, and built-
in architectures on Azure for Big Data and Azure SQL.
Most importantly, this book presents a set of business scenarios that reflect actual
companies’ high availability requirements. These business scenarios guide you
through the design process, show you how to determine the high availability
approach best suited for a particular business scenario, and help specify a roadmap to
implement the business scenario with a specific technical solution.
This book may feel more like a cookbook or a Google Maps route suggestion than a
typical technical manual—and that is the intention. It is one thing to describe
technical syntax, but it is much more important to actually explain why you choose a
particular approach to meet a particular business or application requirement. This
book focuses on the later. The business scenarios introduced and implemented in this
book come from live customer implementations. It does not reveal the names of
these customers for obvious nondisclosure reasons. However, these business
scenarios should allow you to correlate your own business requirements to these high
availability situations. This book also includes examples using the infamous
AdventureWorks database provided by Microsoft. Utilizing the AdventureWorks
database will allow you to replicate some of the solutions quickly and easily in your
own sandbox.
Several tools, scripts, documents, and references to help you jump-start your next
high availability implementation are available at the book’s website at
www.informit.com/title/9780672337765.
IN THIS CHAPTER
Overview of High Availability
Calculating Availability
Availability Variables
General Design Approach for Achieving High Availability
Development Methodology with High Availability Built In
High Availability Business Scenarios (Applications)
Microsoft Technologies That Yield High Availability
Knowing clearly what essential elements comprise a high availability environment
and completely understanding the business requirements that are driving you to think
about high availability solutions may well determine the success or failure of your
company. More times than not, a company’s most critical application demands some
type of high availability solution. Having high availability is often termed having
“resilience”—the ability to recover or fail over quickly. In today’s competitive
marketplace, if a global online ordering system goes down (that is, is unavailable for
any reason) and remains down for any length of time, millions of dollars may be lost,
along with the public’s goodwill toward that company. Profit margins are thin
enough, without having to add your system’s downtime into the equation of whether
your company makes a profit. The impact of unplanned or even planned downtime
may be much greater than you realize.
Calculating Availability
Calculating what the availability of a system has been (or needs to be) is actually
quite simple. You simply subtract the “time unavailable” from the “mean time
between unavailability” and then divide this by the same “mean time between
unavailability.” This is the formula:
Availability percentage = ((MBU − TU) / MBU) × 100
where:
MBU is mean time between unavailability
TU is time unavailable (planned/unplanned downtime)
It’s important to use a common time factor as the basis of this calculation (for
example, minutes). The “time available” is the actual time if you’re calculating what
has already happened, and it is the estimated time if you’re doing this for the future.
In addition, it is here that you add in all unplanned and planned downtime. The
“mean time between unavailability” is the time since the last unavailability occurred.
Note
For a system that needs to be up 24 hours per day, 7 days a week, 365
days per year, you would measure against 100% of the minutes in the
year. For a system that is only supposed to be available 18 hours per
day, 7 days a week, you would measure against 75% of the minutes in
the year. In other words, you measure your availability against the
planned hours of operation, not the total number of minutes in a year
(unless your planned hours of operation are 24×7×365).
Availability Variables
The following are the primary variables that help you determine what high
availability path you should be going down:
Uptime requirement—This is the goal (from 0% to 100%) of what you require
from your application for its planned hours of operation. This is above 95% for a
typical highly available application.
Time to recover—This is a general indication (from long to short) of the
amount of time required to recover an application and put it back online. This
could be stated in minutes, hours, or just in terms of long, medium, or short
amount of time to recover. The more precise the better, though. For example, a
typical time to recover for an OLTP (online transaction processing) application
might be 5 minutes. This is fairly short but can be achieved with various
techniques.
Tolerance of recovery time—You need to describe what the impact might be
(from high to low tolerance) of extended recovery times needed to resynchronize
data, restore transactions, and so on. This is mostly tied to the time-to-recover
variable but can vary widely, depending on who the end users of the system are.
For example, internal company users of a self-service HR application may have
a high tolerance for downtime (because the application doesn’t affect their
primary work). However, the same end users might have a very low tolerance
for downtime of the conference room scheduling/meeting system.
Data resiliency—You need to describe how much data you are willing to lose
and whether it needs to be kept intact (that is, have complete data integrity, even
in failure). This is often described in terms of low to high data resiliency. Both
hardware and software solutions are in play for this variable—mirrored disk,
RAID levels, database backup/recovery options, and so on.
Application resiliency—You need an application-oriented description of the
behavior you are seeking (from low to high application resiliency). In other
words, should your applications (programs) be able to be restarted, switched to
other machines without the end user having to reconnect, and so on? Very often
the term application clustering is used to describe applications that have been
written and designed to fail over to another machine without the end users
realizing they have been switched. The .NET default of using “optimistic
concurrency” combined with SQL clustering often yields this type of end-user
experience very easily.
Degree of distributed access/synchronization—For systems that are
geographically distributed or partitioned (as are many global applications), it is
critical to understand how distributed and tightly coupled they must be at all
times (indicated from low to high degree of distributed access and
synchronization required). A low specification of this variable indicates that the
application and data are very loosely coupled and can stand on their own for
periods of time. Then they can be resynchronized at a later date.
Scheduled maintenance frequency—This is an indication of the anticipated (or
current) rate of scheduled maintenance required for the box, OS, network,
application software, and other components in the system stack. This may vary
greatly, from often to never. Some applications may undergo upgrades, point
releases, or patches very frequently (for example, SAP and Oracle applications).
Performance/scalability—This is a firm requirement of the overall system
performance and scalability needed for the application (from low- to high-
performance need). This variable will drive many of the high availability
solutions that you end up with because high-performance systems often sacrifice
many of the other variables mentioned here (such as data resilience).
Cost of downtime ($ lost/hour)—You need to estimate or calculate the dollar
(or euro, yen, and so forth) cost for every minute of downtime (from low to high
cost). You will usually find that the cost is not a single number, like an average
cost per minute. In reality, short downtimes have lower costs, and the costs
(losses) grow exponentially for longer downtimes. In addition, I usually try to
measure the “goodwill” cost (or loss) for B2C type of applications. So, this
variable might have a subvariable for you to specify.
Cost to build and maintain the high availability solution ($)—This last
variable may not be known initially. However, as you near the design and
implementation of a high availability system, the costs come barreling in rapidly
and often trump certain decisions (such as throwing out that RAID 10 idea due
to the excessive cost of a large number of mirrored disks). This variable is also
used in the cost justification of the high availability solution, so it must be
specified or estimated as early as possible.
As you can see in Figure 1.8, you can think of each of these variables as an oil gauge
or a temperature gauge. In your own early depiction of your high availability
requirements, simply place an arrow along the gauge of each variable to estimate the
approximate “temperature,” or level, of a particular variable. As you can see, I have
specified all the variables of a system that will fall directly into being highly
available. This one is fairly tame, as highly available systems go, because there is a
high tolerance for recovery time, and application resilience is moderately low. Later
in this chapter, I describe four business scenarios, each of them including a full
specification of these primary variables. In addition, starting in Chapter 3, an ROI
calculation is included to provide the full cost justification of a particular HA
solution.
FIGURE 1.8 Primary variables for understanding your availability needs.
Note
This book discusses something called “application clustering” in
concept and not in any technical detail, since it is programming oriented
and really would require a complete programming book to give it the
proper treatment.
Summary
This chapter discusses several primary variables of availability that should help
capture your high availability requirements cleanly and precisely. These variables
include a basic uptime requirement, time to recovery, tolerance of recovery, data
resiliency, application resiliency, performance and scalability, and the costs of
downtime (loss). You can couple this information with your hardware/software
configurations, several Microsoft-based technology offerings, and your allowable
upgrade budget to fairly easily determine exactly which high availability solution
will best support your availability needs. A general one-two punch approach of
establishing a proper high availability foundation in your environment should be
done as soon as possible to at least get your “at risk” applications out of that tenuous
state. Once this is complete, you can serve the knock-out punch that fully matches
the proper high availability solution to all your critical applications—and get it right
the first time.
The following sections delve into the critical needs surrounding disaster recovery
and business continuity. Many of the SQL Server options covered in this book also
lend themselves to disaster recovery configurations with minimal time and data loss
(see Chapter 13, “Disaster Recovery and Business Continuity”). But first, let’s move
on to a more complete discussion of the Microsoft high availability capabilities in
Chapter 2, “Microsoft High Availability Options.”
CHAPTER 2. Microsoft High Availability
Options
IN THIS CHAPTER
Getting Started with High Availability
Microsoft Options for Building an HA Solution
Understanding your high availability requirements is only the first step in
implementing a successful high availability solution. Knowing what available
technical options exist is equally important. Then, by following a few basic design
guidelines, you can match your requirements to a suitable high availability technical
solution. This chapter introduces you to the fundamental HA options, such as RAID
disk arrays, redundant network connectors (NICs), and Windows Server Failover
Clustering (WSFC), as well as other more high-level options, such as AlwaysOn
availability groups, SQL clustering, SQL Server Data Replication, and some
Microsoft Azure and SQL Azure options that can help you create a solid high
availability foundation.
Note
In the past 10 or 15 years, I’ve put in place countless SLAs, and I’ve
never lost a job by doing this. However, I know of people who didn’t
bother to put these types of agreements in place and did lose their jobs.
Implementing SLAs provides a good insurance policy.
Note
Building your systems/servers with at least RAID 1, RAID 5, and RAID
1+0 is critical to achieving a highly available system along with a high-
performing system. RAID 5 is better suited for read-only applications
that need fault tolerance and high availability, while RAID 1 and RAID
1+0 are better suited for OLTP or moderately high-volatility
applications. RAID 0 by itself can help boost performance for any data
allocations that don’t need the fault tolerance of the other RAID
configurations but need to be high performing.
Note
Prior to Windows Server 2008 R2, clustering was done with Microsoft
Cluster Services (MSCS). If you are running an older OS version, refer
to SQL Server 2008 R2 Unleashed to see how to set up SQL clustering
on this older operating system.
Clusters use an algorithm to detect a failure, and they use failover policies to
determine how to handle the work from a failed server. These policies also specify
how a server is to be restored to the cluster when it becomes available again.
Although clustering doesn’t guarantee continuous operation, it does provide
availability sufficient for most mission-critical applications and is a building block in
numerous high-availability solutions. WSFC can monitor applications and resources
to automatically recognize and recover from many failure conditions. This capability
provides great flexibility in managing the workload within a cluster, and it improves
the overall availability of the system. Technologies that are cluster aware—such as
SQL Server, Microsoft Message Queuing (MSMQ), Distributed Transaction
Coordinator (DTC), and file shares—have already been programmed to work within
(that is, under the control of) WSFC.
WSFC still has some hardware and software compatibility to worry about, but it now
includes the Cluster Validation Wizard, which helps you see whether your
configuration will work. You can also still refer to Microsoft’s support site for server
clustering (see http://support.microsoft.com/kb/309395). In addition, SQL Server
Failover Cluster Instances (FCI) is not supported where the cluster nodes are also
domain controllers.
Let’s look a little more closely at a two-node active/passive cluster configuration. At
regular intervals, known as time slices, the failover cluster nodes look to see if they
are still alive. If the active node is determined to be failed (not functioning), a
failover is initiated, and another node in the cluster takes over for the failed node.
Each physical server (node) uses separate network adapters for its own network
connection. (Therefore, there is always at least one network communication
capability working for the cluster at all times, as shown in Figure 2.12.)
SQL Clustering
If you want a SQL Server instance to be clustered for high availability, you are
essentially asking that this SQL Server instance (and the database) be completely
resilient to a server failure and completely available to the application without the
end user ever even noticing that there was a failure. Microsoft provides this
capability through the SQL clustering option in SQL Server 2016. SQL clustering
builds on top of MSCS for its underlying detection of a failed server and for its
availability of the databases on the shared disk (which is controlled by MSCS). SQL
Server is a cluster-aware/cluster-enabled technology. You can create a “virtual” SQL
Server that is known to the application (the constant in the equation) and two
physical SQL Servers that share one set of databases. Only one SQL Server is active
at a time, and it just goes along and does its work. If that server fails (and with it the
physical SQL Server instance), the passive server (and the physical SQL Server
instance on that server) takes over instantaneously. This is possible because cluster
services also controls the shared disk where the databases are. The end user (and
application) pretty much never know which physical SQL Server instance they are
on or whether one failed. Figure 2.13 illustrates a typical SQL clustering
configuration that is built on top of MSCS.
Log Shipping
Another, more direct method of creating a completely redundant database image is
Log Shipping. Microsoft certifies Log Shipping as a method of creating an “almost
hot” spare. Some folks even use log shipping as an alternative to data replication (it
has been referred to as “the poor man’s data replication”). Keep in mind that log
shipping does three primary things:
Makes an exact image copy of a database on one server from a database dump
Creates a copy of that database on one or more other servers from that dump
Continuously applies transaction log dumps from the original database to the
copy
In other words, log shipping effectively replicates the data of one server to one or
more other servers via transaction log dumps. Figure 2.17 shows a source/destination
SQL Server pair that has been configured for log shipping.
Database Snapshots
In SQL Server 2016, Database Snapshots is still supported and can be combined with
Database Mirroring (which will be deprecated in the next release of SQL Server) to
offload reporting or other read-only access to a secondary location, thus enhancing
both performance and availability of the primary database. In addition, as shown in
Figure 2.18, database snapshots can aid high availability by utilizing features that
restore a database back to a point in time rapidly if things like mass updates are
applied to the primary database but the subsequent results are not acceptable. This
can have a huge impact on recovery time objectives and can directly affect HA.
Figure 2.21 Azure SQL Database primary and secondary configuration for HA.
A more advanced configuration with much less data loss potential is shown in Figure
2.21. This configuration creates a direct secondary replica of your SQL database in
another geo region (via a replication channel and geo-replication). You can fail over
to this secondary replica in case the primary SQL database (in one region) is lost.
This is done without having to worry about managing availability groups and other
HA nuances (in the case where you have rolled your own high availability solution).
There are, however, a number of downsides to this approach, such as manual client
string changes, some data loss, and other critical high availability restrictions. (We
discuss the pros and cons of using this configuration in Chapter 10, “High
Availability Options in the Cloud.”)
Application Clustering
As described earlier, high availability of an application is across all of its technology
tiers (web, application, database, infrastructure). Many techniques and options exist
at each tier. One major approach has long been to create scalability and availability
at the application server tier. Large applications such as SAP's ERP offering have
implemented application clustering to guarantee that the end user is always able to
be served by the application, regardless of whether data is needed from the data tier.
If one application server fails or becomes oversaturated from too much traffic,
another application server is used to pick up the slack and service the end user
without missing a beat. This is termed application clustering. Often even the state of
the logical work transaction can be recovered and handed off to the receiving
clustered application server. Figure 2.22 shows a typical multitiered application with
application clustering and application load balancing which ensures that the end user
(regardless of what client tier device or endpoint the user accesses the application
from) is serviced at all times.
IN THIS CHAPTER
A Four-Step Process for Moving Toward High Availability
Step 1: Launching a Phase 0 HA Assessment
Step 2: Gauging HA Primary Variables
Step 3: Determining the Optimal HA Solution
Step 4: Justifying the Cost of a Selected High Availability Solution
Chapters 1, “Understanding High Availability,” and 2, “Microsoft High Availability
Options,” describe most of the essential elements that need to be defined in order to
properly assess an application’s likeliness of being built utilizing a high availability
configuration of some kind. This chapter describes a rigorous process you can step
through to determine exactly what HA solution is right for you. It begins with a
Phase 0 high availability assessment. Formally conducting a Phase 0 HA assessment
ensures that you consider the primary questions that need to be answered before you
go off and try to throw some type of HA solution at your application. but the four-
step process described in this chapter helps you determine which solution is the best
one for your situation.
Note
This chapter describes a fairly simple and straightforward method of
calculating the ROI when deploying a specific high availability solution.
Your calculations will vary because ROI is extremely unique for any
company. However, in general, ROI can be calculated by adding up the
incremental costs of the new HA solution and comparing them against
the complete cost of downtime for a period of time (I suggest using a 1-
year time period). This ROI calculation includes the following:
Maintenance cost (for a 1-year period):
+ system admin personnel cost (additional time for training of these
personnel)
+ software licensing cost (of additional HA components)
Hardware cost (add +):
+ hardware cost (of additional HW in the new HA solution)
Deployment/assessment cost:
+ deployment cost (develop, test, QA, production implementation of the
solution)
+ HA assessment cost (be bold and go ahead and throw the cost of the
assessment into this to be a complete ROI calculation)
Downtime cost (for a 1-year period):
If you kept track of last year’s downtime record, use that number;
otherwise, produce an estimate of planned and unplanned downtime for
this calculation.
+ Planned downtime hours × cost of hourly downtime to the company
(revenue loss/productivity loss/goodwill loss [optional])
+ Unplanned downtime hours × cost of hourly downtime to the
company (revenue loss/productivity loss/goodwill loss [optional])
If the HA costs (above) are more than the downtime costs for 1 year,
then extend it out another year, and then another until you can determine
how long it will take to get the ROI.
In reality, most companies will have achieved the ROI within 6 to 9
months in the first year.
Note
If this is a new application, skip Task 1!
Note
If this is a new application, Task 3 is to create an estimate of the future
month, quarter, 6-month, and 1-year intervals.
Note
If this is a new application, Task 4 is to create an estimate of the future
month, quarter, 6 month, and 1 year intervals.
Note
If this is a new application, Task 5 is to create an estimate of the future
monthly, quarter, 6-month, and 1-year intervals.
Task 6—Calculate the loss of downtime. This involves the following points:
Revenue loss (per hour of unavailability)—For example, in an online order
entry system, look at any peak order entry hour and calculate the total order
amounts for that peak hour. This will be your revenue loss per hour value.
Productivity dollar loss (per hour of unavailability)—For example, in an
internal financial data warehouse that is used for executive decision support,
calculate the length of time that this data mart/warehouse was not available
within the past month or two and multiply this by the number of
executives/managers who were supposed to be querying it during that period.
This is the “productivity effect.” Multiply this by the average salary of these
execs/managers to get a rough estimate of productivity dollar loss. This does
not consider the bad business decisions they might have made without having
their data mart/warehouse available and the dollar loss of those bad business
decisions. Calculating a productivity dollar loss might be a bit aggressive for
this assessment, but there needs to be something to measure against and to
help justify the return on investment. For applications that are not productivity
applications, this value will not be calculated.
Goodwill dollar loss (in terms of customers lost per hour of
unavailability)—It’s extremely important to include this component.
Goodwill loss can be measured by taking the average number of customers for
a period of time (such as last month’s online order customer average) and
comparing it with a period of processing following a system failure (where
there was a significant amount of downtime). Chances are that there was a
drop-off of the same amount that can be rationalized as goodwill loss (that is,
the online customer didn’t come back to you, they went to the competition).
You must then take that percentage drop-off (for example, 2%) and multiply it
by the peak order amount averages for the defined period. This period loss
number is like a repeating loss overhead value that should be included in the
ROI calculation for every month.
Note
If this is a new application, Task 6 is to create an estimate of the losses.
The loss of downtime might be difficult to calculate but will help in any
justification process for purchase of HA-enabling products and in the
measurement of ROI.
Once you have completed these tasks, you are ready to move on to step 2: gauging
the HA primary variables.
Step 2: Gauging HA Primary Variables
It is now time to properly place your arrows (as relative value indications) on the
primary variable gauges (refer to Figure 1.8). You should place the assessment arrow
on each of the 10 variables as accurately as possible and as rigorously as possible.
Each variable continuum should be evenly divided into a scale of some kind, and an
exact value should be determined or calculated to help place the arrow. For example,
the cost of the downtime (per hour) variable could be a scale from $0/hr at the
bottom (left) to $500,000/hr at the top (right) for Company X. The $500,000/hr top
scale value would represent what might have been the peak order amounts ever taken
in by the online order entry system for Company X and thus would represent the
known dollar amount being lost for this period. Remember that everything is relative
to the other systems in your company and to the perceived value of each of these
variables. In other words, some companies won’t place much value on the end-user
tolerance of downtime variable if the application is for internal employees. So, adjust
accordingly.
For each of the primary variable gauges, you need to follow these steps:
1. Assign relative values to each primary variable (based on your company’s
characteristics).
2. Place an arrow on the perceived (or estimated) point in each gauge that best
reflects the system being assessed.
As another example, let’s look at the first HA primary variable, total uptime
requirement percentage. If you are assessing an ATM system, the placement of the
assessment arrow would be at a percentage of 98% or higher. Remember that five 9s
means a 99.999% uptime percentage. Also remember that this uptime requirement is
for the “planned” time of operation, not the total time in a day (or a year)—except, of
course, if the system is a 24×7×365 system. Very often the service level agreement
that is defined for this application will spell out the uptime percentage requirement.
Figure 3.1 shows the placement of the assessment arrow at 99.999% for the ATM
application example—at the top edge of extreme availability (at least from an uptime
percentage point of view).
Figure 3.1 HA primary variables gauge—ATM uptime percentage example.
Note
An important factor that may come into play is your timeline for getting
an application to become highly available. If the timeline is very short,
your solution may exclude costs as a barrier and may not even consider
hardware solutions that take months to order and install. In this case, it
would be appropriate expand the primary variables gauge to include this
question (and any others) as well. This particular question could be
“What is the timeline for making your application highly available?”
However, this book assumes that you have a reasonable amount of time
to properly assess your application.
It is also assumed that if you have written (or are planning to write) an application
that will be cluster aware, you can leverage WSFC. This would be considered to be
an implementation of application clustering (that is, an application that is cluster
aware). As mentioned earlier, SQL Server is a cluster-aware program. However, I
don’t consider it to be application clustering in the strictest sense; it is, rather,
database clustering.
Note
A bit later in this chapter, you will work though a complete ROI
calculation so that you can fully understand where these values come
from.
Figure 3.8 Decision tree for the ASP, question 10 and HA solution.
10. What is the estimated cost of a possible high availability solution? What is the
budget?
Response: C: $100k <= C$ < $250k—This is a moderate amount of cost for
potentially a huge amount of benefit. These estimates are involved:
Five new multi-core servers with 64GB RAM at $30k per server
Five Microsoft Windows 2012 licenses
Five shared SCSI disk systems with RAID 10 (50 drives)
Five days of additional training costs for personnel
Five SQL Server Enterprise Edition licenses
Note
A full decision-tree explosion (complete HA decision tree) that has all
questions and paths defined is available in Microsoft Excel form on the
book’s companion website at www.informit.com/title/9780672337765.
In addition, a blank Nassi-Shneiderman chart and an HA primary
variables gauge are also available at this site in a single PowerPoint
document.
ROI Calculation
As stated earlier, ROI can be calculated by adding up the incremental costs (or
estimates) of the new HA solution and comparing them against the complete cost of
downtime for a period of time (such as 1 year). This section uses the ASP business
from Scenario 1 as the basis for a ROI calculation. Recall that for this scenario, the
costs are estimated to be between $100k and $250k and include the following:
Five new multi-core servers with 64GB RAM at $30k per server
Five Microsoft Windows 2012 Server licenses
Five shared SCSI disk systems with RAID 10 (50 drives)
Five days of additional training costs for personnel
Five SQL Server Enterprise Edition licenses
These are the incremental costs:
Maintenance cost (for a 1-year period):
$20k (estimate)—System admin personnel cost (additional time for training
of these personnel)
$35k (estimate)—Software licensing cost (of additional HA components)
Hardware cost:
$100k hardware cost—The cost of additional HW in the new HA solution
Deployment/assessment cost:
$20k deployment cost—The costs for development, testing, QA, and
production implementation of the solution
$10k HA assessment cost—Be bold and go ahead and throw the cost of the
assessment into this estimate to get a complete ROI calculation
Downtime cost (for a 1-year period):
If you kept track of last year’s downtime record, use that number; otherwise,
produce an estimate of planned and unplanned downtime for this calculation.
For this scenario, the estimated cost of downtime/hour is be $15k/hour.
Planned downtime cost (revenue loss cost) = Planned downtime hours × cost
of hourly downtime to the company:
a. 0.25% × 8,760 hours in a year = 21.9 hours of planned downtime
b. 21.9 hours × $15k/hr = $328,500/year cost of planned downtime.
Unplanned downtime cost (revenue loss cost) = Unplanned downtime hours ×
cost of hourly downtime to the company:
a. 0.25% × 8,760 hours in a year = 21.9 hours of unplanned downtime
b. 21.9 hours × $15k/hr = $328,500/year cost of unplanned downtime.
ROI totals:
Total of the incremental costs = $185,000 (for the year)
Total of downtime cost = $657,000 (for the year)
The incremental cost is 0.28 of the downtime cost for 1 year. In other words, the
investment of the HA solution will pay for itself in 0.28 year, or 3.4 months!
In reality, most companies will have achieved the ROI within 6 to 9 months.
Summary
This chapter introduces a fairly formal approach to assessing and choosing a high
availability solution for your applications. In reality, most folks who are attempting
to do this Phase 0 HA assessment are really retrofitting their existing application for
high availability. That’s okay because this Phase 0 assessment directly supports the
retrofitting process. The key to success is doing as complete a job as you can on the
assessment and using some of your best folks to do it. They will interpret the
technology and the business needs with the most accuracy. You have a lot riding on
the proper assessment—potentially your company’s existence. If you cannot free up
your best folks to do this Phase 0 assessment, then hire some professionals who do
this every day to do it for you. You will recoup the relatively small cost of this short
effort very quickly.
FIGURE 3.17 Development methodology with high availability built in.
It is no small task to understand an application’s HA requirement, time to recovery,
tolerance of recovery, data resiliency, application resiliency, performance/scalability,
and costs of downtime (loss). Then, you must couple this information with your
hardware/software configurations, several Microsoft-based technology offerings, and
your allowable upgrade budget. The cost of not doing this upgrade will have a much
greater impact, and if you are going to move to a high availability solution, getting
the right one in place to start with will save tons of time and money in and of itself—
and potentially your job.
Chapters 4 through 10 describe the Microsoft solutions that can be used to create a
high availability solution (or component) and show exactly how to implement them.
Those chapters provide a cookbook-type approach that will take you through the
complete setup of something such as WSFC, SQL Clustering, Log Shipping, Data
Replication, Availability Groups, or even a Stretch Database configuration using
Microsoft Azure.
So, hold on to your hat, here we go.
Part III. Implementing High Availability
4 Failover Clustering
5 SQL Server Clustering
6 SQL Server AlwaysOn and Availability Groups
7 SQL Server Database Snapshots
8 SQL Server Data Replication
9 SQL Server Log Shipping
10 High Availability Options in the Cloud
11 High Availability and Big Data Options
12 Hardware and OS Options for High Availability
13 Disaster Recovery and Business Continuity
14 Bringing HA Together
15 Upgrading Your Current Deployment to HA
16 High Availability and Security
17 Future Direction of High Availability
CHAPTER 4. Failover Clustering
IN THIS CHAPTER
Variations of Failover Clustering
How Clustering Works
A SQL Clustering Configuration
An AlwaysOn Availability Group Configuration
In today’s fast-paced businesses environments, enterprise computing requires that
the entire set of technologies used to develop, deploy, and manage mission-critical
business applications be highly reliable, scalable, and resilient. The technology scope
includes the network, the entire hardware or cloud technology stack, the operating
systems on the servers, the applications you deploy, the database management
systems, and everything in between.
An enterprise must now be able to provide a complete solution with regard to the
following:
Scalability—As organizations grow, so does the need for more computing
power. The systems in place must enable an organization to leverage existing
hardware and quickly and easily add computing power as needs demand.
Availability—As organizations rely more on information, it is critical that the
information be available at all times and under all circumstances. Downtime is
not acceptable. Moving to five 9s reliability (that is, 99.999% uptime) is a must,
not a dream.
Interoperability—As organizations grow and evolve, so do their information
systems. It is impractical to think that an organization will not have many
heterogeneous sources of information. It is becoming increasingly important for
applications to get to all the information, regardless of its location.
Reliability—An organization is only as good as its data and information. It is
critical that the systems providing that information be bulletproof.
It is assumed that you are already using or acquiring a certain level of foundational
capabilities with regard to network, hardware, and operating system resilience.
However, there are some further foundational operating system components that can
be enabled to raise your high availability bar even higher. Central to many HA
features on Windows servers is Windows Server Failover Clustering (WSFC). This
has been around for quite some time and, as you will see in the next few chapters, it
is used at the foundation level to build out other more advanced and integrated
features for SQL Server and high availability. The tried-and-true HA solution using
WSFC is SQL clustering, which creates redundant SQL Server instances (for server
resilience) and shares storage between the servers. This shared storage is usually
mirrored storage of one kind or another (for storage resilience). You can use SQL
clustering for local server instance resilience and then include that with a larger HA
topology using AlwaysOn availability groups across several nodes.
For SQL clustering, a failover cluster instance (FCI) is created; it is essentially an
instance of SQL Server that is installed across WSFC nodes and, possibly, across
multiple subnets. On the network, an FCI appears to be an instance of SQL Server
running on a single computer; however, the FCI provides failover from one WSFC
node to another if the current (active) node becomes unavailable. You can achieve
many of your enterprise’s high availability demands easily and inexpensively by
using WSFC, network load balancing (NLB), SQL Server failover clustering, and
AlwaysOn availability groups (or combinations of them).
Note
You can install SQL Server on as many servers as you want; the number
is limited only by the operating system license and SQL Server edition
you have purchased. However, you should not overload WSFC with
more than 10 or so SQL Servers to manage if you can help it.
Understanding WSFC
A server failover cluster is a group of two or more physically separate servers
running WSFC and working collectively as a single system. The server failover
cluster, in turn, provides high availability, scalability, and manageability for
resources and applications. In other words, a group of servers is physically connected
via communication hardware (network), shares storage (via SCSI or Fibre Channel
connectors), and uses WSFC software to tie them all together into managed
resources.
Server failover clusters can preserve client access to applications and resources
during failures and planned outages. They provide server instance–level failover. If
one of the servers in a cluster is unavailable due to failure or maintenance, resources
and applications move (fail over) to another available cluster node.
Clusters use an algorithm to detect a failure, and they use failover policies to
determine how to handle the work from a failed server. These policies also specify
how a server is to be restored to the cluster when it becomes available again.
Although clustering doesn’t guarantee continuous operation, it does provide
availability sufficient for most mission-critical applications and is a building block of
numerous high availability solutions. WSFC can monitor applications and resources
and automatically recognize and recover from many failure conditions. This
capability provides great flexibility in managing the workload within a cluster, and it
improves the overall availability of the system. Technologies that are cluster aware
—such as SQL Server, Microsoft Message Queuing (MSMQ), Distributed
Transaction Coordinator (DTC), and file shares—have already been programmed to
work within WSFC.
WSFC still has some hardware and software compatibility to worry about, but the
built-in Cluster Validation Wizard allows you to see whether your configuration will
work. In addition, SQL Server FCIs are not supported where the cluster nodes are
also domain controllers.
Let’s look a little more closely at a two-node active/passive cluster configuration. At
regular intervals, known as time slices, the failover cluster nodes look to see if they
are still alive. If the active node is determined to be failed (not functioning), a
failover is initiated, and another node in the cluster takes over for the failed node.
Each physical server (node) uses separate network adapters for its own network
connection. (Therefore, there is always at least one network communication
capability working for the cluster at all times, as shown in Figure 4.2.)
Note
In general (and as part of a high availability disk configuration), the
quorum drive should be isolated to a drive all by itself and should be
mirrored to guarantee that it is available to the cluster at all times.
Without it, the cluster doesn’t come up at all, and you cannot access
your SQL databases.
The WSFC architecture requires a single quorum resource in the cluster that is used
as the tie-breaker to avoid split-brain scenarios. A split-brain scenario happens when
all the network communication links between two or more cluster nodes fail. In such
cases, the cluster may be split into two or more partitions that cannot communicate
with each other. WSFC guarantees that even in these cases, a resource is brought
online on only one node. If the different partitions of the cluster each brought a given
resource online, this would violate what a cluster guarantees and potentially cause
data corruption. When the cluster is partitioned, the quorum resource is used as an
arbiter. The partition that owns the quorum resource is allowed to continue. The
other partitions of the cluster are said to have “lost quorum,” and WSFC and any
resources hosted on nodes that are not part of the partition that has quorum are
terminated.
The quorum resource is a storage-class resource and, in addition to being the arbiter
in a split-brain scenario, is used to store the definitive version of the cluster
configuration. To ensure that the cluster always has an up-to-date copy of the latest
configuration information, you should deploy the quorum resource on a highly
available disk configuration (using mirroring, triple-mirroring, or RAID 10, at the
very least).
The notion of quorum as a single shared disk resource means that the storage
subsystem has to interact with the cluster infrastructure to provide the illusion of a
single storage device with very strict semantics. Although the quorum disk itself can
be made highly available via RAID or mirroring, the controller port may be a single
point of failure. In addition, if an application inadvertently corrupts the quorum disk
or an operator takes down the quorum disk, the cluster becomes unavailable.
This situation can be resolved by using a majority node set option as a single quorum
resource from a WSFC perspective. In this set, the cluster log and configuration
information are stored on multiple disks across the cluster. A new majority node set
resource ensures that the cluster configuration data stored on the majority node set is
kept consistent across the different disks.
The disks that make up the majority node set could, in principle, be local disks
physically attached to the nodes themselves or disks on a shared storage fabric (that
is, a collection of centralized shared storage area network [SAN] devices connected
over a switched-fabric or Fibre Channel–arbitrated loop SAN). In the majority node
set implementation that is provided as part of WSFC in Windows Server 2008 and
later, every node in the cluster uses a directory on its own local system disk to store
the quorum data, as shown in Figure 4.3.
How WSFC Sets the Stage for SQL Server Clustering and
AlwaysOn
Good setup practices include documenting all the needed Internet Protocol (IP)
addresses, network names, domain definitions, and SQL Server references to set up a
two-node SQL Server failover clustering configuration (configured in an
active/passive mode) or an AlwaysOn availability groups configuration before you
physically set up your clustering configuration.
You first identify the servers (nodes), such as PROD-DB01 (the first node) and PROD-
DB02 (the second node), and the cluster group name, DXD_Cluster.
The cluster controls the following resources:
Physical disks (Cluster Disk 1 is for the quorum disk, Cluster Disk 2 is for
the shared disks, and so on)
The cluster IP address (for example, 20.0.0.242)
The cluster name (network name) (for example, DXD_Cluster)
The DTC (optional)
The domain name (for example, DXD.local)
You need the following for SQL clustering documentation:
The SQL Server virtual IP address (for example, 192.168.1.211)
The SQL Server virtual name (network name) (for example, VSQL16DXD)
SQL Server instance (for example, SQL16DXD_DB01)
SQL Server agents
SQL Server SSIS services (if needed)
The SQL Server full-text search service instances (if needed)
You need the following for AlwaysOn availability groups documentation:
The availability group listener IP address (for example, 20.00.243)
The availability group listener name (for example, DXD_LISTENER)
The availability group name (for example, DXD_AG)
SQL server instances (for example, SQL16DXD_DB01, SQL16DXD_DB02,
SQL16DXD_DR01)
SQL server agents
SQL Server SSIS services (if needed)
The SQL Server full-text search service instances (if needed)
After you have successfully installed, configured, and tested your failover cluster (in
WSFC), you can add the SQL Server components as resources that will be managed
by WSFC. You will learn about the installation of SQL clustering and AlwaysOn
availability groups in Chapter 5, “SQL Server Clustering,” and Chapter 6, “SQL
Server AlwaysOn Availability Groups.”
FIGURE 4.5 Using Windows 2012 R2 Server Manager to add failover clustering
to a local server.
2. In the next dialog box in the Add Roles and Features Wizard, select either a
role-based or feature-based installation. You don’t need remote installation.
When you’re done making selections (typically feature-based), click Next.
3. Choose the correct target (destination) server and then click Next.
4. In the dialog that appears, scroll down until you find the Failover Clustering
option and check it, as shown in Figure 4.6.
FIGURE 4.6 Using Windows 2012 R2 Server Manager to select Failover
Clustering.
Figure 4.7 shows the final installation confirmation dialog box that appears before
the feature is enabled. Once it is installed, you need to do the same thing on the other
nodes (servers) that will be part of the cluster. When you have completed this, you
can fire up the Failover Cluster Manager and start configuring clustering with these
two nodes.
FIGURE 4.7 Failover Clustering feature installation confirmation dialog.
With the Failover Clustering feature installed on your server, you are ready to use
the Validate a Configuration Wizard (see Figure 4.8) to specify all the nodes that
will be a part of the two-node cluster and check whether it is viable to use.
FIGURE 4.8 Failover Cluster Manager Validate a Configuration Wizard.
The cluster must pass all validations. When the validation is complete, you can
create the cluster, name it, and bring all nodes and resources online to the cluster.
Follow these steps:
1. On the first dialog in the wizard, specify the servers to use in the cluster. To do
so, click the Browse button to the right of the Enter Name box and specify the
two server nodes you want: PROD-DB01 and PROD-DB02 (see Figure 4.9). Then
click OK.
FIGURE 4.9 Failover Cluster Manager Select Servers or a Cluster dialog.
2. In the dialog that appears, specify to run all validation tests to guarantee that you
don’t miss any issues in this critical validation process. Figure 4.10 shows the
successful completion of the failover clustering validation for the two-node
configuration. Everything is labeled “Validated.”
FIGURE 4.10 Failover Cluster Manager failover cluster validation summary.
3. As shown in Figure 4.10, check the Create the Cluster Now Using the Validated
Nodes box and click Finish. The Create Cluster Wizard (yes, another wizard!)
now appears. This wizard will gather the access point for administering the
cluster, confirm the components of the cluster, and create the cluster with the
name you specify.
4. As shown in Figure 4.11, enter DXD_CLUSTER as the cluster name. A default IP
address is assigned to this cluster, but you can change it later to the IP address
that you prefer to use. Remember that this wizard knows about the two nodes
you just validated, and it automatically includes them in this cluster. As you can
see in Figure 4.12, both nodes are included in the cluster.
FIGURE 4.11 Creating the cluster name in the Create Cluster Wizard.
5. Ask that all eligible disk storage be added by checking the Add All Eligible
Storage to the Cluster box.
FIGURE 4.12 Creating a cluster with validated nodes and the eligible disk option
checked.
6. In the summary dialog that appears, review what will be done, including what
eligible storage will be included. Go ahead and click Finish. Figure 4.13 shows a
completed cluster configuration for the two-node cluster, including the cluster
network, cluster storage, and the two nodes (PROD-DB01 and PROD-DB02).
FIGURE 4.13 The new DXD_CLUSTER cluster.
Figure 4.14 shows you the node view of the cluster. As you can see, both nodes are
up and running. Figure 4.14 also shows the disks that are in this cluster. One disk is
used as the quorum drive (for cluster decisions), and the other is the main available
storage drive that will be used for databases and such. Don’t forget to specify (via
the properties of the clustered disks) both nodes as possible owners for the disks.
This is essential because the disks are shared between these two nodes.
You now have a fully functional failover cluster with a shared disk ready to be used
for things like SQL Server clustering. A good setup practices is to documents all the
needed IP addresses, network names, domain definitions, and SQL Server references
to set up a two-node SQL Server failover cluster.
FIGURE 4.14 Failover Cluster Manager view of both nodes and the disks in the
cluster.
Summary
Failover clustering is the cornerstone of both SQL Server clustering and the
AlwaysOn availability groups high availability configurations. As described in this
chapter, there can be shared resources such as storage, nodes, even SQL Servers.
These are cluster-aware resources that inherit the overall quality of being able to be
managed within a cluster as if they were one working unit. If any of the core
resources (such as a node) ever fails, the cluster is able to fail over to the other node,
thus achieving high availability at the server (node) level. In Chapter 5, you will see
how SQL Server failover clustering allows you to get SQL Server instance-level
failover with shared storage. Then, in Chapter 6, you will see how AlwaysOn
availability groups can give you SQL Server instance and database high availability
through redundancy of both the SQL Server and the databases.
CHAPTER 5. SQL Server Clustering
IN THIS CHAPTER
Installing SQL Server Clustering Within WSFC
Potential Problems to Watch Out for with SQL Server Failover Clustering
Multisite SQL Server Failover Clustering
Scenario 1: Application Service Provider with SQL Server Clustering
As described in Chapter 4, “Failover Clustering," WSFC is capable of detecting
hardware or software failures and automatically shifting control of a failed server
(node) to a healthy node. SQL Server clustering implements SQL Server instance-
level resilience built on top of this core foundational clustering feature.
As mentioned previously, SQL Server is a fully cluster-aware application. The
failover cluster shares a common set of cluster resources, such as clustered (shared)
disk drives, networks, and, yes, SQL Server itself.
SQL Server allows you to fail over and fail back to or from another node in a cluster.
In an active/passive configuration, an instance of SQL Server actively services
database requests from one of the nodes in a SQL cluster (active node). Another
node is idle until, for whatever reason, a failover occurs. With a failover situation,
the secondary node (the passive node) takes over all SQL resources (databases)
without the end user ever knowing that a failover has occurred. The end user might
experience some type of brief transactional interruption because SQL clustering
cannot take over in-flight transactions. However, the end user is still just connected
to a single (virtual) SQL Server and truly doesn’t know which node is fulfilling
requests. This type of application transparency is a highly desirable feature that has
made SQL clustering fairly popular over the past 15 years.
In an active/active configuration, SQL Server runs multiple servers simultaneously
with different databases, allowing organizations with more constrained hardware
requirements (that is, no designated secondary systems) to enable failover to or from
any node without having to set aside (idle) hardware. There can also be multisite
SQL clustering across data centers (sites), further enhancing the high availability
options that SQL Server clustering can fulfill.
Note
SQL Server failover clustering is available with SQL Server 2016
Standard Edition, Enterprise Edition, and Developer Edition. However,
Standard Edition supports only a two-node cluster. If you want to
configure a cluster with more than two nodes, you need to upgrade to
SQL Server 2016 Enterprise Edition, which has no limitations on the
number of clusters.
If this check fails, you must resolve the warnings before you continue. After product
key and licensing terms dialogs are completed, the install setup files are loaded.
You are then prompted to proceed to the SQL Server feature installation portion of
the setup in the Feature Selection dialog, shown in Figure 5.4. Next, a set of feature
rules validation checks must be passed for things like cluster supported for this
edition and product update language compatibility.
FIGURE 5.4 The Feature Selection dialog for a SQL Server failover cluster
installation.
Next, you need to specify of the SQL Server network name (the name of the new
SQL Server failover cluster, which is essentially the virtual SQL Server name). You
also need to specify a named instance for the physical SQL Server itself
(SQL16DXD_DB01 in this example) on the PROD-DB01 node (as shown in Figure 5.5).
FIGURE 5.5 Specifying the SQL Server network name (VSQL16DXD) and instance
name (SQL16DXD_DB01).
When an application attempts to connect to an instance of SQL Server 2016 that is
running on a failover cluster, the application must specify both the virtual server
name and the instance name (if an instance name was used), such as
VSQL16DXD\SQL16DXD_DB01 (virtual server name\SQL Server instance name other
than the default) or VSQL16DXD (just the virtual SQL Server name, without the
default SQL Server instance name). The virtual server name must be unique on the
network.
Next comes the cluster resource group specification for your SQL cluster, followed
by the selection of the disks to be clustered. This is where the SQL Server resources
are placed within WSFC. For this example, you can use the SQL Server resource
group name (SQL16DXD_DB01) and click Next (see Figure 5.6). After you assign the
resource group, you need to identify which clustered disks are to be used on the
Cluster Disk Selection dialog, also shown in Figure 5.6. It contains a Cluster Disk 2
disk option (which was the shared drive volume) and a Cluster Disk 1 disk option
(which was the quorum drive location). You simply select the available drive(s)
where you want to put your SQL database files (the Cluster Disk 2 disk drive option
in this example). As you can also see, the only “qualified” disk is the Cluster Disk 2
drive. If the quorum resource is in the cluster group you have selected, a warning
message is issued, informing you of this fact. A general rule of thumb is to isolate
the quorum resource to a separate cluster group.
FIGURE 5.6 Cluster resource group specification and cluster disk selection.
The next thing you need to do for this new virtual server specification is to identify
an IP address and which network it should use. As you can see in the Cluster
Network Configuration dialog, shown in Figure 5.7, you simply type in the IP
address (in this example, 20.0.0.222) that is to be the IP address for this virtual SQL
Server for the available networks known to this cluster configuration (in this
example, for the Cluster Network 1 network). If the IP address being specified is
already in use, an error occurs.
Note
Keep in mind that you are using a separate IP address for the virtual
SQL Server failover cluster that is completely different from the cluster
IP addresses themselves. In an unclustered installation of SQL Server,
the server can be referenced using the machine’s IP address. In a
clustered configuration, you do not use the IP addresses of the physical
servers; instead, you use this separately assigned IP address for the
“virtual” SQL Server.
You then need to specify the server configuration service accounts for the SQL
Server Agent, Database Engine, so on. These should be the same for both nodes you
will be including in the SQL Server cluster configuration (as you can see in Figure
5.8, SQL Server Agent account name and SQL Server Database Engine account
name).
FIGURE 5.7 Specifying the virtual SQL Server IP address and which network to
use.
FIGURE 5.8 Specifying the SQL Server service accounts and passwords for the
SQL Server failover cluster.
You then see the Database Engine Configuration dialog with all the standard things
you usually must specify: authentication mode, data directories, TempDB, and
filestream options. At this point, you have worked your way down to the feature
configuration rules check to determine whether everything specified to this point is
correct. The next dialog shows a summary of what is about to be done in this
installation and the location of the configuration file (and path) that can be used later
if you are doing command-line installations of new nodes in the cluster (see Figure
5.9).
FIGURE 5.9 Ready to install the SQL Server failover cluster.
Figure 5.10 shows the complete SQL server installation for this node.
Before moving on to the next part of the SQL Server failover cluster node
installation, you can take a quick peek at what the Failover Cluster Manager has set
up so far by opening the Roles node in the failover cluster. Figure 5.11 shows the
SQL Server instance up and running within the failover cluster, the server name
(VSQL16DXD), and the other clustered resources, including the SQL Server instance
and the SQL Server Agent for that instance. All are online and being managed by the
failover cluster now. However, there is no second node yet. If this SQL Server
instance failed right now, it would not have anything to fail over to.
FIGURE 5.10 The SQL Server 2016 failover cluster preparation and install is
complete for this node.
FIGURE 5.11 The Failover Cluster Manager, showing the newly created SQL
Server instance and other clustered resources.
Now you must take care of the second node that is to be in the SQL Server cluster
(PROD-DB02 in this example). You can add as many nodes to a SQL Server cluster
configuration as needed, but here you’ll stick to two for the active/passive
configuration. For the second node, you must select Add Node to a SQL Server
Failover Cluster, as shown in Figure 5.12.
FIGURE 5.12 The Add Node to a SQL Server Failover Cluster option in SQL
Server Installation Center.
The Add a Failover Cluster Node Wizard first does a brief global rules check and
looks for any critical Microsoft Windows updates and SQL Server product updates
the installation may need. It is always a good idea to install the most up-to-date code
possible. In Figure 5.13, you can see that a critical update for SQL Server has been
identified.
FIGURE 5.13 Microsoft Windows and SQL Server Product Updates dialog.
All product updates and setup files are added for the installation. The next step in the
wizard is the Add Node Rules dialog, which runs some preliminary rules checking,
captures the product key and license terms, deals with the cluster node configuration,
handles the setup of the service accounts, and then adds the new node to the cluster.
This is sort of the same steps you did for the original node, but from the second
node's perspective (as you can see in Figure 5.14). This includes verification of all
the failover cluster services on that second node, DTC verification, cluster remote
access (for PROD-DB02) and DNS settings (for both PROD-DB01 and PROD-DB02).
FIGURE 5.14 Add Node Rules dialog for the second node.
Next comes the cluster node configuration, where you identify the SQL Server
instance name, verify the name of the node (PROD-DB02 in this example), and verify
the cluster network name being used for this cluster node configuration. Figure 5.15
shows this node configuration and that the SQL Server instance (SQL16DXD_DB01) is
also already configured with the PROD-DB01 node.
FIGURE 5.15 Cluster Node Configuration dialog for the second node (PROD-
DB02).
The cluster network configuration is then identified (verified) next. The current node
being added is already associated with the cluster, so you don’t have to modify or
add an IP address here (unless you are doing multisite configurations). Next comes
the specification of the service accounts that are needed by this node for SQL Server.
Both the network and the service accounts specifications are shown in Figure 5.16.
Now you have completed the second node’s configuration, and you can simply
review what is about to be done by the setup process and complete the installation.
When it is complete, you may have to restart the server. When all node installations
are complete, you should have a fully operational SQL Server clustering
configuration.
FIGURE 5.16 Cluster network configuration and the SQL Server service
accounts.
You should do a quick check to make sure you can get to the database via the virtual
SQL Server network cluster name and the old reliable AdventureWorks database that
is attached to this SQL Server cluster configuration. You can also test that the nodes
fail over properly and allow you to get to your data, regardless of which node is
active. You can do this test with a brute-force approach. First, connect to the virtual
SQL Server cluster name (VSQL16DXD\SQL16DXD_DB in this example), do a quick
SELECT against the Person table, shut down the PROD-DB01 server, and then try to
execute the same SELECT against the Person table. When you do this, you get an
error saying that the SQL Server cluster name is not connected. After about 4 or 5
seconds, you can issue the same SELECT statement again, and you get success—the
result rows from the Person table. This test sequence is shown in Figure 5.17. The
left side of the figure shows the connection to the virtual SQL Server cluster name,
and just to the right is the failed connection after you have shut down the PROD-DB01
node. Then at the bottom right is the successful SELECT against the Person table,
with the expected result rows.
Note
Alternatively, you could also do the same SQL Server clustering
configuration setup by using the SQL Server setup Advanced option.
Either setup process works.
Summary
Building out your company’s infrastructure with clustering technology at the heart is
a huge step toward achieving five 9s reliability. If you do this, every application,
system component, and database you deploy on this architecture will have that added
element of resilience. WSFC and SQL failover clustering are high availability
approaches at the instance level. In many cases, the application or system component
changes needed to take advantage of these clustering technologies are completely
transparent. Utilizing a combination of NLB and WSFC allows you to not only fail
over applications but also scale for increasing network capacity. Many organizations
around the globe have used this two-node active/passive SQL Server clustering
approach over the past 10 or 15 years.
As you will see with the AlwaysOn features in the next chapter, expanding this
resilience to the database tier adds even more high availability and scalability to your
implementations if a much higher HA requirement is needed.
CHAPTER 6. SQL Server AlwaysOn and
Availability Groups
IN THIS CHAPTER
AlwaysOn and Availability Groups Use Cases
Building a Multinode AlwaysOn Configuration
Dashboard and Monitoring
Scenario 3: Investment Portfolio Management with AlwaysOn and Availability
Groups
Microsoft continues to push the high availability and performance bar higher and
higher. Extensive HA options such as AlwaysOn and availability groups and
AlwaysOn failover cluster instances (FCIs), coupled with a variety of other
Windows Server family enhancements, provide almost everyone with a chance at
achieving the mythical five 9s (that is, 99.999% uptime). Microsoft is investing in
this HA approach for most of the next-generation SQL Server HA options. However,
you might have noticed that some of the concepts and technical approaches in
AlwaysOn availability groups are a bit reminiscent of SQL Server clustering and
database mirroring—and that’s because they are! Both of these other features paved
the way for what we now know as AlwaysOn availability groups. It is also important
to remember is that this and other HA options build on top of Windows Server
Failover Clustering (WSFC).
Modes
As with database mirroring, two primary replication modes are used to move data via
the transaction log from the primary replica to the secondary replicas: synchronous
mode and asynchronous mode.
Synchronous mode means that the data writes of any database change must be done
in not only the primary replica but also the secondary replica, as a part of one logical
committed transaction.
Figure 6.1 shows a rounded box around both the primary and secondary database
that is using the synchronous mode of replication. This can be costly in the sense of
doubling the writes, so the connection between the primary and secondary should be
fast and nearby (within the same subnet). However, for this same reason, the primary
and secondary replicas are in a transactionally consistent state at all times, which
makes failover nearly instantaneous. Synchronous mode is used for automatic
failover between the primary replica and the secondary replica. You can have up to
three nodes in synchronous mode (essentially two secondaries and one primary at
once). Figure 6.1 shows that Node 1 and Node 2 are configured to use automatic
failover mode (synchronous). And, as previously mentioned, because of this
transactional consistency, it is also possible to do database backups against the
secondary replica with 100% accuracy and integrity.
Asynchronous mode does not have the commit transaction requirement that
synchronous mode has, and it is actually pretty lightweight (from performance and
overhead points of view). Even in asynchronous mode, transactions can make it to
the secondary replicas pretty quickly (in a few seconds) in most cases. Network
traffic and the number of transactions determine the speed. Asynchronous mode can
also be used just about anywhere you need it within a stable network (across the
country or even to another continent, if you have decent network speeds).
The AlwaysOn availability groups feature also takes advantage of transaction record
compression, which compresses all the transaction log records used in database
mirroring and AlwaysOn configurations to increase the speed of transmission to the
mirror or replicas. This both increases the number of log transactions you can shoot
across to the secondary and also makes the size of the transmissions much smaller,
so the log records get to their destinations that much faster.
In addition, as with database mirroring, during data replication of the transaction, if
data page errors are detected, the data pages on the secondary replica are repaired as
part of the transaction writes to the replica and raise the overall database stability
even further (if you had not been replicating). This is a nice feature.
Read-Only Replicas
As you can see in Figure 6.1, you can create more secondary replicas (up to eight).
However, they must be asynchronous replicas. You can easily add these replicas to
the availability group and provide distribution of workload and significant mitigation
to your performance. Figures 6.1 and 6.2 show two additional secondary replicas
used to handle all the read-only data accesses that would normally be hitting the
primary database. These read-only replicas have near-real-time data and can be
pretty much anywhere you want (from a stable network point of view).
Endpoints
Availability groups also leverage the endpoint concept for all communication (and
visibility) from one node to another node in an availability group configuration.
These endpoints are the exposed points used by the availability group
communication between nodes. This is also the case with database mirroring.
Availability group endpoints are created as a part of each availability group node
configuration (for each replica).
FIGURE 6.5 The installed Failover Clustering feature shown from the Server
Manager.
You likely need to also run a validation of the cluster configuration.
A number of extensive tests are performed on each node in the cluster that you are
configuring. These tests take a bit of time, so go get tea or coffee, and then make
sure that you look through the summary report for any true errors. You’ll likely see a
few warnings that refer to items that were not essential to the configuration (typically
some TCP/IP or network-related things). When this is done, you are ready to get into
the AlwaysOn business.
Figure 6.6 shows how you create the cluster group access point (named
DXD_Cluster) in Failover Cluster Manager. The IP address for this access point is
20.0.0.242.
FIGURE 6.6 Using Failover Cluster Manager to create an access point for
administering the cluster and cluster name.
This cluster should contain three nodes, PROD-DB01, PROD-DB01, and DR-DB01, as
shown in the Failover Cluster Manager in Figure 6.7.
FIGURE 6.7 Failover Cluster Manager with the DXD_Cluster cluster and three
nodes.
It’s now time to start the AlwaysOn configuration on the SQL Server side of the
equation.
Enabling AlwaysOn HA
For each of the SQL Server instances that you want to include in the AlwaysOn
configuration, you need to enable their instances for AlwaysOn; it is turned off by
default. From each node, bring up the SQL Server 2016 Configuration Manager and
select the SQL Server Services node in the Services pane. Right-click the SQL
Server instance for this node (with instance name SQL16DXD_DB01 in this example)
and choose Properties. Figure 6.8 shows the properties of this SQL Server instance.
Click the AlwaysOn High Availability tab and check Enable AlwaysOn Availability
Groups. (Notice that the cluster name appears in this dialog box because this server
was identified already in the cluster configuration step.) Click OK (or Apply), and
you see a note about having to restart the service for this option to be used. After you
have closed the Properties dialog, go ahead and right-click the SQL Server instance
service again, but this time choose the Restart option to enable the AlwaysOn HA
feature. For each of the other nodes (SQL16DXD_DB02 and SQL16DXD_DR01 in this
example), do the same for the SQL Server configuration and the SQL Server
instance that is to be included in the AlwaysOn configuration.
FIGURE 6.9 Doing a full database backup for the primary database
(AdventureWorks).
At this point, I usually like to copy this full database backup to each secondary
instance and restore it to guarantee that I have exactly the same thing at all nodes
before the synchronization begins. When doing these restores, be sure to specify the
Restore with No Recovery option. When you finish this, you can move on to creating
the availability group.
Creating the Availability Group
From Node 1 (SQL16DXD_DB01 node in this example), expand the AlwaysOn High
Availability node for this SQL Server instance (in SSMS). As you can see in Figure
6.10, you can right-click the Availability Group node and choose to create a new
availability group (via the wizard). This is where all the action will be in creating the
entire availability group.
FIGURE 6.10 Invoking the New Availability Group Wizard from SSMS.
In the New Availability Group Wizard, you can specify the availability group name,
select the database to replicate, specify the replicas, select the data synchronization,
and then do validations. Initially, there is a splash page for the wizard on which you
just click Next. This brings you to the Specify Availability Group Name dialog.
Figure 6.11 shows this dialog, with the availability group name DXD-AG being
specified. Click Next.
FIGURE 6.11 Specifying the availability group name.
Note
If you choose the Full option, it is important that the shared network
location be fully accessible by the service accounts from all nodes in the
availability group (the service account being used by the SQL Server
services on each node).
As you can see in Figure 6.17, the validation dialog appears, showing you the
success, failure, warnings, or steps skipped in this creation process, based on the
options you specified. In Figure 6.17 you can also see the summary dialog of all
work to be performed, and you get one last chance to verify the work about to be
done before it is executed. Just click Next to finish this process.
FIGURE 6.17 Validation and summary steps in creating the availability group.
As shown in Figure 6.18, you now see the results of the availability group creation
steps. You can see that the secondaries were joined to the availability group
(indicated by the arrow in Figure 6.18), and you can see what the availability group
node in SSMS will contain for the newly formed availability group. The availability
group is functional now. You can see the primary and secondary replicas, the
databases within the availability group, and an indication at the database node level
about whether the database is synchronized.
FIGURE 6.18 New Availability Group results dialog along with the SSMS Object
Explorer results of the availability group and the joined replicas.
When the status of the databases changes to “synchronized,” you are in business.
However, to complete the abstraction of instance names away from the applications,
you should create the availability group listener to complete this configuration.
Summary
With SQL Server 2016, it’s all about the AlwaysOn features. The adoption of the
AlwaysOn and availability groups capabilities has been astonishing. Older, more
complex HA solutions are being cast aside left and right in favor of this clean, highly
scalable method of achieving five 9s and high performance. It is truly the next
generation of HA and scale-out for existing and new database tiers of any kind.
Microsoft publicly advises all its customers that have implemented log shipping,
database mirroring, and even SQL clustering and advises customers to get to
AlwaysOn availability groups at some point. As with Scenario 3, these extreme
availability requirements fit very nicely with what availability groups brings to the
table: short failover times and very limited data loss. More importantly, transactional
performance could be maintained by offloading reporting and backups to secondary
replicas.
CHAPTER 7. SQL Server Database Snapshots
IN THIS CHAPTER
What Are Database Snapshots?
Copy-on-Write Technology
When to Use Database Snapshots
Setup and Breakdown of a Database Snapshot
Reverting to a Database Snapshot for Recovery
What Is Database Mirroring?
Setting Up and Configuring Database Mirroring
Testing Failover from the Principal to the Mirror
Setting Up DB Snapshots Against a Database Mirror
Scenario 3: Investment Portfolio Management with DB Snapshots and DB
Mirroring
Database snapshots have been a feature of competing database products (including
Oracle and DB2) for years. Database snapshots are great for fulfilling point-in-time
reporting requirements; they directly increase your reporting consistency,
availability, and overall performance. They are also great for reverting a database
back to a point in time (supporting your recovery time objectives, recovery point
objectives, and overall availability) and for potentially reducing the processing
impact of querying against primary transactional databases when used with database
mirroring. All these factors contribute to the high availability picture in some way.
Database snapshots are fairly easy to implement and administer.
However, keep in mind that database snapshots are point-in-time reflections of an
entire database and are not bound to the underlying database objects from which they
pull their data. A snapshot provides a full, read-only copy of the database at a
specific point in time. Because of this point-in-time aspect, data latency must be well
understood for all users of this feature: Snapshot data is only as current as when the
snapshot was made.
As mentioned previously, database snapshots can be used in conjunction with
database mirroring to provide a highly available transactional system and a reporting
platform that is created from the database mirror and offloads the reporting away
from the primary transactional database, without any data loss impact whatsoever.
This is a very powerful reporting and availability configuration. Database mirroring
is available in the Standard Edition of SQL Server, whereas database snapshots
require the Enterprise Edition of SQL Server.
Note
Database mirroring has been earmarked for deprecation for some time.
However, it still just keeps being included with each SQL Server
release. AlwaysOn availability groups are the recommended new
method of creating redundant databases (secondaries) for high
availability and offloading of reporting loads to separate SQL instances
(see Chapter 6, “SQL Server AlwaysOn and Availability Groups”).
However, as you will see in this chapter, database mirroring is
extremely effective at creating a single mirrored database and is fairly
easy to set up administratively. So, it's okay to use this feature, but just
keep in mind that it will likely go away in the next SQL Server version
(SQL Server 2018?) or certainly in the one following that (SQL Server
2020?). Or will it?
FIGURE 7.1 Basic database snapshot concept: a source database and its database
snapshot, all on a single SQL Server instance.
This point-in-time view of a database's data never changes, even though the data
(data pages) in the primary database (the source of the database snapshot) may
change. It is truly a snapshot at a point in time. For a snapshot, it always simply
points to data pages that were present at the time the snapshot was created. If a data
page is updated in the source database, a copy of the original source data page is
moved to a new page chain, termed the sparse file via copy-on-write technology.
Figure 7.2 shows the sparse file that is created, alongside the source database itself.
FIGURE 7.2 Source database data pages and the sparse file data pages comprising
the database snapshot.
A database snapshot really uses the primary database's data pages up until the point
that one of these data pages is updated (that is, changed in any way). As mentioned
previously, if a data page is updated in the source database, the original copy of the
data page (which is referenced by the database snapshot) is written to the sparse file
page chain as part of an update operation, using the copy-on-write technology. It is
this new data page in the sparse file that still provides the correct point-in-time data
to the database snapshot that it serves. Figure 7.3 illustrates that as more data
changes (updates) occur in the source database, the sparse file gets larger and larger
with the old original data pages.
Eventually, a sparse file could contain the entire original database if all data pages in
the primary database were changed. As you can see in Figure 7.3, which data pages
the database snapshot uses from the original (source) database and which data pages
it uses from the sparse file are managed by references in the system catalog for the
database snapshot. This setup is incredibly efficient and represents a major
breakthrough in providing data to others. Because SQL Server is using the copy-on-
write technology, a certain amount of overhead is used during write operations. This
is one of the critical factors you must sort through if you plan on using database
snapshots. Nothing is free. The overhead includes the copying of the original data
page, the writing of this copied data page to the sparse file, and the subsequent
metadata updating to the system catalog that manages the database snapshot data
page list. Because of this sharing of data pages, it should also be clear why database
snapshots must be within the same instance of a SQL Server: Both the source
database and snapshot start out as the same data pages and then diverge as source
data pages are updated. In addition, when a database snapshot is created, SQL Server
rolls back any uncommitted transactions for that database snapshot; only the
committed transactions are part of a newly created database snapshot. And, as you
might expect of something that shares data pages, database snapshots become
unavailable if the source database becomes unavailable (for example, if it is
damaged or goes offline).
FIGURE 7.3 Data pages being copied to the sparse file for a database snapshot as
pages are being updated in the source database.
Note
You might plan to do a new snapshot after about 30% of the source
database has changed to keep overhead and file sizes in the sparse file at
a minimum. The problem that most frequently occurs with database
snapshots is related to sparse file sizes and available space. Remember
that the sparse file has the potential of being as big as the source
database itself (if all data pages in the source database eventually get
updated). Plan ahead for this situation!
There are, of course, alternatives to database snapshots, such as data replication, log
shipping, and even materialized views, but none are as easy to manage and use as
database snapshots.
The following terms are commonly associated with database snapshots:
Source database—This is the database on which the database snapshot is based.
A database is a collection of data pages. It is the fundamental data storage
mechanism that SQL Server uses.
Snapshot databases—There can be one or more database snapshots defined
against any one source database. All snapshots must reside in the same SQL
Server instance.
Database snapshot sparse file—This new data page allocation contains the
original source database data pages when updates occur to the source database
data pages. One sparse file is associated with each database data file. If you have
a source database allocated with one or more separate data files, you have
corresponding sparse files of each of them.
Reverting to a database snapshot—If you restore a source database based on a
particular database snapshot that was done at a point in time, you are reverting.
You are actually doing a database RESTORE operation with a FROM
DATABASE_SNAPSHOT statement.
Copy-on-write technology—As part of an update transaction in the source
database, a copy of the source database data page is written to a sparse file so
that the database snapshot can be served correctly (that is, still see the data page
as of the snapshot point in time).
As Figure 7.4 illustrates, any data query using the database snapshot looks at both
the source database data pages and the sparse file data pages at the same time. These
data pages always reflect the unchanged data pages at the point in time the snapshot
was created.
FIGURE 7.4 A query using the database snapshot touches both source database
data pages and sparse file data pages to satisfy a query.
Copy-on-Write Technology
The copy-on-write technology that Microsoft first introduced with SQL Server 2005
is at the core of both database mirroring and database snapshot capabilities. This
section walks through a typical transactional user's update of data in a source
database.
As you can see in Figure 7.5, an update transaction is initiated against the
AdventureWorks database (labeled A). As the data is being updated in the source
database's data page and the change is written to the transaction log (labeled B), the
copy-on-write technology also copies the original source database data page in its
unchanged state to the sparse data file (also labeled B) and updates the metadata
page references in the system catalog (also labeled B) with this movement.
The original source data page is still available to the database snapshot. This adds
extra overhead to any transaction that updates, inserts, or deletes data from the
source database. After the copy-on-write technology finishes its write on the sparse
file, the original update transaction is properly committed, and acknowledgment is
sent back to the transactional user (labeled C).
FIGURE 7.5 Using the copy-on-write technology with database snapshots.
Note
Database snapshots cannot be used for any of SQL Server's internal
databases—tempdb, master, msdb, or model. Also, database snapshots
are supported only in the Enterprise Edition of SQL Server 2016.
FIGURE 7.6 Basic database snapshot configuration: a source database and one or
more database snapshots at different time intervals.
To revert to a particular snapshot interval, you simply use the RESTORE DATABASE
command with the FROM DATABASE_SNAPSHOT statement. This is a complete
database restore; you cannot limit it to just a single database object. In addition, you
must drop all other database snapshots before you can use one of them to restore a
database.
As you can see in Figure 7.6, a very specific SQL statement referencing a snapshot
could be used if you knew exactly what you wanted to restore at the table and row
levels. You could simply use SQL statements (such as an UPDATE SQL statement
[labeled A] or an INSERT SQL statement) from one of the snapshots to selectively
apply only the fixes you are sure need to be recovered (reverted). In other words, you
don't restore the whole database from the snapshot; you use only some of the
snapshot’s data with SQL statements and bring the messed-up data row values back
in line with the original values in the snapshot. This is at the row and column level
and usually requires quite a bit of detailed analysis before it can be applied to a
production database.
It is also possible to use a snapshot to recover a table that someone accidentally
dropped. There is a little data loss since the last snapshot, but it involves a simple
INSERT INTO statement from the latest snapshot before the table drop. Be careful
here, but consider the value as well.
FIGURE 7.7 Creating a before database snapshot prior to scheduled mass updates
to a database.
If you are not satisfied with the entire update operation, you can use RESTORE
DATABASE from the snapshot and revert it to this point. Or, if you are happy with
some updates but not others, you can use the SQL UPDATE statement to selectively
update (restore) particular values back to their original values using the snapshot.
FIGURE 7.8 Establishing a baseline testing database snapshot before running tests
and then reverting when finished.
You then just run your test scripts or do any manual testing—as much as you want—
and then revert back to this starting point rapidly. Then you run more tests again.
You need to worry about only the data portion of the database for the snapshot:
Click here to view code image
CREATE DATABASE SNAP_AdventureWorks_6AM
ON
( NAME = AdventureWorks_Data,
FILENAME= 'C:\Server\
MSSQL13.SQL2016DXD01\MSSQL\DATA\SNAP_AW_data_6AM.snap'
AS SNAPSHOT OF AdventureWorks
go
Creating the database snapshot is really that easy. Now let's walk through a simple
example that shows how to create a series of four database snapshots against the
AdventureWorks source database that represent snapshots six hours apart (as shown
in Figure 7.6). Here is the next snapshot to be run at 12:00 p.m.:
Click here to view code image
CREATE DATABASE SNAP_AdventureWorks_12PM
ON
( NAME = AdventureWorks_Data,
FILENAME= 'C:\Server\
MSSQL13.SQL2016DXD01\MSSQL\DATA\SNAP_AW_data_12PM.snap')
AS SNAPSHOT OF AdventureWorks
go
These snapshots made at equal time intervals and can be used for reporting or
reverting.
Note
This book uses a simple naming convention for the database names for
snapshots and for the snapshot files themselves. The database snapshot
name is the word SNAP, followed by the source database name,
followed by a qualifying description of what this snapshot represents, all
separated with underscores. For example, a database snapshot that
represents a 6:00 a.m. snapshot of the AdventureWorks database would
have this name:
SNAP_AdventureWorks_6AM
The snapshot file-naming convention is similar. The name starts with
the word SNAP, followed by the database name that the snapshot is for
(AdventureWorks, in this example), followed by the data portion
indication (for example, data or data1), a short identification of what
this snapshot represents (for example, 6AM), and then the filename
extension .snap to distinguish it from .mdf and .ldf files. For example,
the snapshot filename for the preceding database snapshot would look
like this:
SNAP_AdventureWorks_data_6AM.snap
This statement delivers the correct, point-in-time result rows from the database
snapshot:
Click here to view code image
CreditCardID CardType CardNumber ExpMonth
ExpYear
ModifiedDate
--------------------------------------------------------------
--------------
1 SuperiorCard 33332664695310 11 2006
2013-12-03
00:00:39.560
You see how this looks by opening SQL Server Management Studio. Figure 7.11
shows the database snapshot database SNAP_AdventureWorks_6AM along with the
source database AdventureWorks. It also shows the results of the system queries on
these database object properties.
FIGURE 7.11 SSMS snapshot DB branch, system query results, and snapshot
isolation state (ON).
You are now in the database snapshot business!
If you'd like, you can also drop (delete) a database snapshot from SQL Server
Management Studio by right-clicking the database snapshot entry and choosing the
Delete option. However, it's best to do everything with scripts so that you can
accurately reproduce the same action over and over.
This query shows the existing source database and the newly created database
snapshot, as follows:
Click here to view code image
name database_id
source_database_id create_date snapshot_isolation_
state_desc
----------------------------------------------------------------
------------------
AdventureWorks 5 NULL 2016-02-17
23:37:02.763
OFF
SNAP_AdventureWorks_6AM 9 5 2016-12-05
06:00:36.597
ON
SNAP_AdventureWorks_12PM 10 5 2016-12-05
12:00:36.227
ON
In this example, there are two snapshots against the AdventureWorks database. The
one you don't want to use when reverting must be dropped first. Then you can
proceed to restore the source database with the remaining snapshot that you want.
These are the steps:
1. Drop the unwanted snapshot(s):
Click here to view code image
Use [master]
go
DROP DATABASE SNAP_AdventureWorks_12PM
go
2. Issue the RESTORE DATABASE command with the remaining snapshot:
Click here to view code image
USE [master]
go
RESTORE DATABASE AdventureWorks FROM DATABASE_SNAPSHOT =
'SNAP_AdventureWorks_6AM'
go
When this process is complete, the source database and snapshot are essentially the
same point-in-time database. But remember that the source database quickly
diverges as updates begin to flow in again.
Note
Database mirroring cannot be implemented on a database that is also
configured for filestream storage.
Note
Database mirroring cannot be used for any of SQL Server's internal
databases—tempdb, masterdb, msdb, or modeldb. Database mirroring is
fully supported in SQL Server Standard Edition, Developer Edition, and
Enterprise Edition, but it is not supported in SQL Server Express
Edition. However, even machines running SQL Server Express Edition
can be used as witness servers.
After this T-SQL runs, you should run the following SELECT statements to verify that
the endpoint has been correctly created:
Click here to view code image
select name,type_desc,port,ip_address from sys.tcp_endpoints;
select name,role_desc,state_desc from
sys.database_mirroring_endpoints;
If you also look at the database properties for the AdventureWorks database on the
principal server (SQL2016DXD01, in this example), you see the server network
address for the principal server automatically appear now when you look at the
Database Properties Mirroring page (see Figure 7.15).
For the witness server (notice that the role is now witness), you run the following:
Click here to view code image
-- create endpoint for witness server --
CREATE ENDPOINT [EndPoint4DBMirroring51450]
STATE=STARTED
AS TCP (LISTENER_PORT = 51450, LISTENER_IP = ALL)
FOR DATA_MIRRORING (ROLE = WITNESS, AUTHENTICATION = WINDOWS
NEGOTIATE
, ENCRYPTION = REQUIRED ALGORITHM RC4)
Granting Permissions
It is possible to have an AUTHORIZATION [login] statement in the CREATE
ENDPOINT command that establishes the permissions for a login account to the
endpoint being defined. However, separating this out into a GRANT greatly stresses
the point of allowing this connection permission. From each SQL query connection,
you run a GRANT to allow a specific login account to connect on the ENDPOINT for
database mirroring. If you don't have a specific login account to use, default it to [NT
AUTHORITY\SYSTEM].
From the principal server instance (SQL2016DXD01), you run the following GRANT
(substituting [NT AUTHORITY\SYSTEM] with your specific login account to be used
by database mirroring):
Click here to view code image
GRANT CONNECT ON ENDPOINT::EndPoint4DBMirroring51430 TO [NT
AUTHORITY\SYSTEM];
Then, from the mirror server instance (SQL2016DXD02), you run the following
GRANT:
Click here to view code image
GRANT CONNECT ON ENDPOINT:: EndPoint4DBMirroring51440 TO [NT
AUTHORITY\SYSTEM];
Then, from the witness server instance (SQL2016DXD03), you run the following
GRANT:
Click here to view code image
GRANT CONNECT ON ENDPOINT:: EndPoint4DBMirroring51450 TO [NT
AUTHORITY\SYSTEM];
Creating the Database on the Mirror Server
When the endpoints are configured and roles are established, you can create the
database on the mirror server and get it to the point of being able to mirror. You must
first make a backup copy of the principal database (AdventureWorks, in this
example). This backup will be used to create the database on the mirror server. You
can use SSMS tasks or SQL scripts to do this. The SQL scripts
(DBBackupAW2016.sql), which are easily repeatable, are used here.
On the principal server, you make a complete backup as follows:
Click here to view code image
BACKUP DATABASE [AdventureWorks]
TO DISK = N'C:\Program Files\Microsoft SQL
Server\MSSQL13.SQL2016DXD01\MSSQL\Backup\AdventureWorks4Mirror.bak'
WITH FORMAT
GO
Next, you copy this backup file to a place where the mirror server can reach it on the
network. When that is complete, you can issue the following database RESTORE
command to create the AdventureWorks database on the mirror server (using the
WITH NORECOVERY option):
Click here to view code image
-- use this restore database(with NoRecovery option)
to create the mirrored version of this DB --
RESTORE FILELISTONLY
FROM DISK = 'C:\Program Files\Microsoft SQL
Server\MSSQL13.SQL2016DXD01\MSSQL\Backup\AdventureWorks4Mirror.bak'
go
RESTORE DATABASE AdventureWorks
FROM DISK = 'C:\Program Files\Microsoft SQL
Server\MSSQL13.SQL2016DXD01\MSSQL\Backup\AdventureWorks4
WITH NORECOVERY,
MOVE 'AdventureWorks_Data' TO 'C:\Program
Files\Microsoft SQL
Server\MSSQL13.SQL2016DXD02\MSSQL\Data\AdventureWorks_Da
MOVE 'AdventureWorks_Log' TO 'C:\Program
Files\Microsoft SQL
Server\MSSQL13.SQL2016DXD02\MSSQL\Data\AdventureWorks_Lo
GO
Because you don't necessarily have the same directory structure on the mirror server,
you use the MOVE option as part of this restore to place the database files in the
location you desire.
The restore process should yield something that looks like the following result set:
Click here to view code image
Processed 24216 pages for database 'AdventureWorks', file
'AdventureWorks_Data' on file 1.
Processed 3 pages for database 'AdventureWorks', file
'AdventureWorks_Log' on file 1.
RESTORE DATABASE successfully processed 24219 pages in 5.677
seconds (33.328 MB/sec).
You must now apply at least one transaction log dump to the mirror database. This
brings the mirror database to a point of synchronization with the principal and leaves
the mirror database in the restoring state. At this database recovery point, you can
run through the Database Mirroring Wizard and start mirroring for high availability.
From the principal server, you dump (that is, back up) a transaction log as follows:
Click here to view code image
BACKUP LOG [AdventureWorks] TO
DISK = N'C:\Program Files\Microsoft SQL
Server\MSSQL13.SQL2016DXD01\MSSQL\Backup\AdventureWorks4MirrorLog.bak'
WITH FORMAT
Go
Processed 4 pages for database 'AdventureWorks', file
'AdventureWorks_Log' on file 1.
BACKUP LOG successfully processed 4 pages in 0.063 seconds
(0.496 MB/sec).
Then you move this backup to a place where it can be reached by the mirror server.
When that is done, you restore the log to the mirror database. From the mirror server,
you restore the transaction log as follows:
Click here to view code image
RESTORE LOG [AdventureWorks]
FROM DISK = N'C:\Program Files\Microsoft SQL
Server\MSSQL13.SQL2016DXD02\MSSQL\Backup\AdventureWorks4Mirror
WITH FILE = 1, NORECOVERY
GO
Note
In the WITH FILE = statement, the file number must match the value in
the backup log results (see the on file 1 reference in the previous
code).
The restore log process should yield something that looks like the following result
set:
Click here to view code image
Processed 0 pages for database 'AdventureWorks', file
'AdventureWorks_Data' on file 1.
Processed 4 pages for database 'AdventureWorks', file
'AdventureWorks_Log' on file 1.
RESTORE LOG successfully processed 4 pages in 0.007 seconds
(4.464 MB/sec).
Note
You might need to update the FILE = x entry in the RESTORE LOG
command to correspond to the on file value given during the log
backup.
You are now ready to mirror the database in high availability mode.
Now, you are ready for the final step: From the principal server, you identify the
mirror and witness. After you complete this step, the database mirroring topology
tries to synchronize itself and begin database mirroring. The following statements
identify the mirror server endpoint and witness server endpoint to the principal
server's database:
Click here to view code image
-- From the Principal Server Database: identify the mirror
server endpoint --
ALTER DATABASE AdventureWorks
SET PARTNER = 'TCP://DXD001:51440'
GO
-- From the Principal Server Database: identify the witness
server endpoint --
ALTER DATABASE AdventureWorks
SET WITNESS = 'TCP://DXD001:51450'
GO
You do not have to alter any database from the witness server.
When this process completes successfully, you are mirroring! In fact, with this
configuration, you are in automatic failover mode.
If you have issues or just want to start over, you can drop an endpoint or alter an
endpoint quite easily. To drop and existing endpoint, you use the DROP ENDPOINT
command. In this example, the following command would drop the endpoint you just
created:
Click here to view code image
-- To DROP an existing endpoint --
DROP ENDPOINT EndPoint4DBMirroring51430;
However, because you use the port in the endpoint name, it might be best to just drop
and create a new endpoint to fit the naming convention. Either way, you can easily
manipulate these endpoints to fit your networking needs.
As you can see in Figure 7.16, the databases are fully synchronizing for mirroring,
and you are now in full-safety high availability mode.
Removing Mirroring
Very likely, you will have to remove all traces of database mirroring from each
server instance of a database mirroring configuration at some point. Doing so is
actually pretty easy. Basically, you have to disable mirroring of the principal, drop
the mirror server's database, and remove all endpoints from each server instance.
You can simply start from the Database Properties page and the Mirroring option
and do the whole thing. Alternatively, you can do this through SQL scripts. Let's first
use the Mirroring options. Looking at the options in Figure 7.21, you simply choose
to remove mirroring (from the principal server instance). This is just a bit too easy to
do—almost dangerous!
From the mirror server instance (not the principal!), you run the DROP DATABASE and
DROP ENDPOINT SQL commands, as follows:
Click here to view code image
DROP DATABASE AdventureWorks
go
DROP ENDPOINT EndPoint4DBMirroring51440
go
From the witness server instance, you remove the endpoint as follows:
DROP ENDPOINT EndPoint4DBMirroring51450
go
To verify that you have removed these endpoints from each server instance, you
simply run the following SELECT statements:
Click here to view code image
select name,type_desc,port,ip_address from sys.tcp_endpoints
select name,role_desc,state_desc from
sys.database_mirroring_endpoints
These are all informational messages only. No user action is required. As you can
see from these messages, you are now in a state of no database mirroring. You have
to completely build up database mirroring again if you want to mirror the database
again.
This command has the same effect as using SSMS or even shutting down the
principal SQL Server instance service.
Note
You cannot bring the principal offline as you can do in an unmirrored
configuration.
As you can see in Figure 7.24, this would be the live configuration of the
principal server (DXD001\SQL2016DXD01), the mirror server
(DXD001\SQL2016DXD02), and the reporting database snapshot
(SNAP_AdventureWorks_REPORTING), as shown from SQL Server Management
Studio.
FIGURE 7.24 SQL Server Management Studio, showing database mirroring with
a database snapshot for reporting configuration.
If the principal fails over to the mirror, you drop the database snapshot that is
currently created off that database and create a new one on the old principal
(now the mirror), as shown in the following steps.
2. Drop the reporting database snapshot on the new principal server (the principal
is now DXD001\SQL2016DXD02):
Click here to view code image
Use [master]
go
DROP DATABASE SNAP_AdventureWorks_REPORTING
go
3. Create the new reporting database snapshot on the new mirrored database server
(the mirror is now DXD001\SQL2016DXD01):
Click here to view code image
Use [master]
go
CREATE DATABASE SNAP_AdventureWorks_REPORTING
ON ( NAME = AdventureWorks_Data, FILENAME= 'C:\Program Files\
Microsoft SQL Server\ MSSQL13.SQL2016DXD01\MSSQL\DATA\
SNAP_AdventureWorks_data_REPORTING.snap')
AS SNAPSHOT OF AdventureWorks
Go
That's it. You now have your reporting users completely isolated from your principal
server (and the transactional users) again. Life can return to normal very quickly.
Summary
This chapter covers two fairly complex and complimentary solutions that can
potentially be leveraged for high availability needs. Both are fairly easy to
implement and manage. Both are also time tested and have provided many years of
success to many companies around the globe. A database snapshot can be thought of
as an enabling capability with many purposes. It can be great for fulfilling point-in-
time reporting requirements easily, reverting a database to a point in time
(recoverability and availability), insulating a database from issues that may arise
during mass updates, and potentially reducing the processing impact of querying
against the primary transactional databases (via database mirroring and database
snapshots). You must remember that database snapshots are point in time and read-
only. The only way to update a snapshot is to drop it and re-create it. Data latency of
this point-in-time snapshot capability must always be made very clear to any of its
users.
A database snapshot is a snapshot of an entire database, not a subset. This clearly
makes data snapshots very different from alternative data access capabilities, such as
data replication and materialized views. This feature has been made possible via a
major breakthrough from Microsoft called copy-on-write technology. This is
certainly an exciting extension to SQL Server but is not to be used as a substitute for
good old database backups and restores. Database snapshots is one capability that I
recommend you consider using as soon as possible.
Database mirroring provides a way for users to get to a minimum level of high
availability for their databases and applications without having to use complex
hardware and software configurations (as are needed with Cluster Services, SQL
Server clustering, and higher OS and SQL editions that support AlwaysOn
configurations). Even though database mirroring has been a great addition to SQL
Server, it will be deprecated in the not-too-distant future, so use some caution here.
As mentioned earlier in this chapter, the core technology components that comprise
database mirroring have been utilized in the AlwaysOn availability groups
capability. In fact, it is at the core of availability groups (which is explained in
Chapter 6). Both of these technologies will play nicely into some organizations’
needs for high availability and, as your confidence with these technologies increases,
they will provide a basis for graduating to more robust solutions as your needs
change.
CHAPTER 8. SQL Server Data Replication
IN THIS CHAPTER
Using Data Replication for High Availability
The Publisher, Distributor, and Subscriber Metaphor
Replication Scenarios
Subscriptions
The Distribution Database
Replication Agents
User Requirements Driving the Replication Design
Setting Up Replication
Switching Over to a Warm Standby (Subscriber)
Monitoring Replication
Scenario 2: Worldwide Sales and Marketing with Data Replication
Yes, you can use data replication as a high availability solution! It depends on your
HA requirements, of course. Originally, the Microsoft SQL Server implementation
of data replication was created to distribute data to another location for location-
specific use. Replication can also be used to "offload" processing from a very busy
server, such as an online transaction processing (OLTP) application server to a
second server for use for things like reporting or local referencing. In this way, you
can use replication to isolate reporting or reference-only data processing away from
the primary OLTP server without having to sacrifice performance of that OLTP
server. Data replication also is well suited to support naturally distributed data that
has very distinct users (such as a geographically oriented order entry system). As
data replication has become more stable and reliable, it has been used to create
“warm,” almost “hot,” standby SQL Servers. If failures ever occur with the primary
server in certain replication topologies, the secondary (replicate) server can still be
able to be used for work. When the failed server is brought back up, the replication
of data that changed will catch up, and all the data will be resynchronized.
Snapshot Replication
Snapshot replication involves making an image of all the tables in a publication at a
single moment in time and then moving that entire image to the subscribers. Little
overhead on the server is incurred because snapshot replication does not track data
modifications, as the other forms of replication do. It is possible, however, for
snapshot replication to require large amounts of network bandwidth, especially if the
articles being replicated are large. Snapshot replication is the easiest form of
replication to set up, and it is used primarily with smaller tables for which
subscribers do not have to perform updates. An example of this might be a phone list
that is to be replicated to many subscribers. This phone list is not considered to be
critical data, and the frequency with which it is refreshed is more than enough to
satisfy all its users.
Transactional Replication
Transactional replication is the process of capturing transactions from the transaction
log of the published database and applying them to the subscription databases. With
SQL Server transactional replication, you can publish all or part of a table, views, or
one or more stored procedures as an article. All data updates are then stored in a
distribution database and sent, and subsequently applied to, any number of
subscribing servers. Obtaining these updates from the publishing database’s
transaction log is extremely efficient. No direct reading of tables is required except
during initialization process, and only the minimal amount of traffic is generated
over the network. This has made transactional replication the most often used
method.
As data changes are made, they are propagated to the other sites in near real time;
you determine the frequency of this propagation. Because changes are usually made
only at the publishing server, data conflicts are avoided for the most part. For
example, subscribers of the published data usually receive these updates in a few
seconds, depending on the speed and availability of the network.
Merge Replication
Merge replication involves getting the publisher and all subscribers initialized and
then allowing data to be changed at all sites involved in the merge replication at the
publisher and at all subscribers. All these changes to the data are subsequently
merged at certain intervals so that, again, all copies of the database have identical
data. Occasionally, data conflicts have to be resolved. The publisher does not always
win in a conflict resolution. Instead, the winner is determined by whatever criteria
you establish.
With transactional replication in the instantaneous replication mode, data changes on
the primary server (publisher) are replicated to one or more secondary servers
(subscribers) extremely quickly. This type of replication can essentially create a
“warm standby” SQL Server that is as fresh as the last transaction log entries that
made it through the distribution server mechanism to the subscriber. In many cases,
it can actually be considered a hot standby because of increasingly faster network
speeds between locations. And, along the way, there are numerous side benefits,
such as achieving higher degrees of scalability and mitigating failure risk. Figure 8.1
shows a typical SQL Server data replication configuration that can serve as a basis
for high availability and that also, at the same time, fulfills a reporting server
requirement.
Filtering Articles
You can create articles on SQL Server in several different ways. The basic way to
create an article is to publish all the columns and rows that are contained in a table.
Although this is the easiest way to create articles, your business needs might require
that you publish only certain columns or rows from a table. This is referred to as
filtering, and it can be done both vertically and horizontally. Vertical filtering filters
only specific columns, whereas horizontal filtering filters only specific rows. In
addition, SQL Server 2016 provides the added functionality of join filters and
dynamic filters. (We discuss filtering here because, depending on what type of high
availability requirements you have, you may need to employ one or more of these
techniques within data replication.)
As you can see in Figure 8.6, you might only need to replicate a customer’s customer
ID, TerritoryID, and the associated customer account numbers to various
subscribing servers around your company (vertical filtering). Or, as shown in Figure
8.7, you might need to publish only the Customers table data that is in a specific
region, in which case you would need to geographically partition the data (horizontal
filtering).
It is also possible to combine both horizontal and vertical filtering, as shown in
Figure 8.8. This allows you to pare out unneeded columns and rows that aren’t
required for replication. For example, you might only need the “west"
(TerritoryID=1) region data and need to publish only the CustomerID,
TerritoryID, and AccountNumber data.
FIGURE 8.6 Vertical filtering is the process of creating a subset of columns from
a table to be replicated to subscribers.
FIGURE 8.7 Horizontal filtering is the process of creating a subset of rows from a
table to be replicated to subscribers.
FIGURE 8.8 Combining horizontal and vertical filtering allows you to pare down
the information in an article to only the important information.
As mentioned earlier, it is now possible to use join filters. Join filters enable you to
go one step further for a particular filter created on a table to another. For example, if
you are publishing the Customers table data based on the region (west), you can
extend filtering to the Orders and Order Details tables for the west region customers'
orders only, as shown in Figure 8.9. This way, you will only be replicating orders for
customers in the west to a location that only needs to see that specific data. This can
be very efficient if it is done well.
You also can publish stored procedure executions as articles, along with their
parameters. This can be either a standard procedure execution article or a serializable
procedure execution article. The difference is that the latter is executed as a
serializable transaction, and the other is not. A serializable transaction is a
transaction that is being executed with the serializable isolation level, which places a
range lock on the affected data set, preventing other users from updating or inserting
rows into the data set until the transaction is complete.
FIGURE 8.9 Horizontal and join publication.
What publishing stored procedure executions as articles gets you is a major reduction
of mass SQL statements being replicated across your network. For instance, if you
wanted to update the Customers table for every customer between customerID 1
and customerID 5000, the Customers table updates would be replicated as a large
multistep transaction involving 5,000 separate update statements. This would
significantly bog down your network. However, with stored procedure execution
articles, only the execution of the stored procedure is replicated to the subscription
server, and the stored procedure is executed on that subscription server. Figure 8.10
illustrates the difference in execution described earlier. Some subtleties when using
this type of data replication processing can’t be overlooked, such as making sure the
published stored procedure behaves the same on the subscribing server side. Just to
be safe, you should have abbreviated testing scripts that can be run on the subscriber,
whose results will be verified with the same results on the publisher.
FIGURE 8.10 Stored procedure execution comparison.
Now, it is essential to learn about the different types of replication scenarios that can
be built and the reasons any one of them would be desired over the others. It is worth
noting that Microsoft SQL Server 2016 supports replication to and from many
different heterogeneous data sources. For example, OLE DB or ODBC data sources
(including Microsoft Exchange, Microsoft Access, Oracle, and DB2) can subscribe
to SQL Server publications, as well as publish data.
Replication Scenarios
In general, depending on your business requirements, you can implement one of
several different data replication scenarios, including the following:
Central publisher
Central publisher with a remote distributor
Publishing subscriber
Central subscriber
Multiple publishers or multiple subscribers
Merge replication
Peer-to-peer replication
Updating subscribers
For high availability, the two central publisher topologies are the most appropriate.
These two are, by far, the best to use for a near-real-time and simple-to-set-up
hot/warm spare HA solution.
Note
To learn more about other uses of data replication, refer to Sams
Publishing's SQL Server Unleashed, which expands on this subject for
all processing use case scenarios.
Central Publisher
The central publisher replication model, as shown in Figure 8.11, is Microsoft’s
default scenario. In this scenario, one SQL Server performs the functions of both
publisher and distributor. The publisher/distributor can have any number of
subscribers, which can come in many different varieties, such as most SQL Server
versions, MySQL, and Oracle.
FIGURE 8.11 The central publisher scenario is a simple and frequently used
scenario.
The central publisher scenario can be used in the following situations:
To create a copy of a database for ad hoc queries and report generation (classic
use)
To publish master lists to remote locations, such as master customer lists or
master price lists
To maintain a remote copy of an OLTP database that can be used by the remote
sites during communication outages
To maintain a spare copy of an OLTP database that can be used as a hot spare in
case of server failure
However, it’s important to consider the following for this scenario:
If your OLTP server’s activity is substantial and affects greater than 10% of your
total data per day, then this central publisher scenario is not for you. Other
replication configuration scenarios will better fit yours need or, if you're trying
to achieve HA, another HA option may serve you better.
If your OLTP server is maxed out on CPU, memory, and disk utilization, you
should consider using another data replication scenario. Again, the central
publisher scenario is not for you. There would be no bandwidth on this server to
support the replication overhead.
Pull Subscriptions
As shown in Figure 8.13, a pull subscription is set up and managed by the
subscription server. The biggest advantage here is that pull subscriptions allow the
system administrators of the subscription servers to choose what publications they
will receive and when they will be received. With pull subscriptions, publishing and
subscribing are separate acts and are not necessarily performed by the same user. In
general, pull subscriptions are best when the publication does not require high
security or when subscribing is done intermittently, as the subscriber’s data needs to
be periodically brought up to date.
Push Subscriptions
A push subscription is created and managed by the publication server. In effect, the
publication server is pushing the publication to the subscription server. The
advantage of using push subscriptions is that all the administration takes place in a
central location. In addition, publishing and subscribing happen at the same time,
and many subscribers can be set up at once. A push subscription is recommended
when dealing with heterogeneous subscribers because of the lack of pull capability
on the subscription server side. You may want to use the push subscription approach
for a high availability configuration that will be used in a failover scenario.
Replication Agents
SQL Server utilizes replication agents to do different tasks during the replication
process. These agents are constantly waking up at some frequency and fulfilling
specific jobs. Let’s look at the main ones.
Caution
Make sure you have enough disk space on the drive that contains the
temporary working directory (the snapshot folder). The snapshot data
files may be huge, and that is a primary reason for the high rate of
snapshot failure. The amount of disk space also directly affects high
availability. Filling up a disk will translate to some additional unplanned
downtime. You’ve been warned!
Note
Truncating and fast bulk-copying into a table are non-logged processes.
In tables marked for publication, you cannot perform non-logged
operations unless you temporarily turn off replication on that table.
Then you need to re-sync the table on the subscriber before you
reenable replication.
Note
If you have triggers on your tables and want them to be replicated along
with your table, you might want to revisit them and add a line of code
that reads NOT FOR REPLICATION so that the trigger code isn’t executed
redundantly on the subscriber side. So, for a trigger (an insert, update, or
delete trigger) on the subscriber, you would use the NOT FOR
REPLICATION statement for the whole trigger (placed before the AS
statement of the trigger). If you want to be selective on a part of the
trigger code (for example, FOR INSERT, FOR UPDATE, FOR DELETE),
you put NOT FOR REPLICATION immediately following the statements
you don’t want to execute and put nothing on the ones you do want to
execute.
Setting Up Replication
In general, SQL Server 2016 data replication is exceptionally easy to set up via SQL
Server Management Studio. Be sure to generate SQL scripts for every phase of your
replication configuration. In a production environment, you most likely will rely
heavily on scripts and will not have the luxury of having much time to set up and
break down production replication configurations via manual configuration steps.
You have to define any data replication configuration in the following order:
1. Create or enable a distributor to enable publishing.
2. Enable/configure publishing (with a distributor designated for a publisher).
3. Create a publication and define articles within the publication.
4. Define subscribers and subscribe to a publication.
Next you will set up a transactional replication configuration (as shown in Figure
8.15) with three servers and publish the AdventureWorks database to a secondary
server (subscriber) to fulfill the high availability hot/warm spare use case.
In this example you will use the SQL2016DXD01 instance as the publisher, the
SQL2016DXD02 server as the remote distributor, and SQL2016DXD03 as the
subscriber. As you know, you start the whole configuration process by enabling the
distribution server first. One thing you will notice with the replication configuration
capabilities in SQL Server Management Studio is the extensive use of wizards.
FIGURE 8.15 A central publisher with remote distributor configuration for HA.
Enabling a Distributor
The first thing you need to do is to designate a distribution server to be used by the
publisher. As discussed earlier, you can either configure the local server as the
distribution server or choose a remote server as the distributor. For this HA
configuration, you want a remote distribution server that is separate from the
publisher. You'll also have to be a member of the SYSADMIN server role to use these
configuration wizards. From the intended distribution server (SQL2016DXD02 in
this example), you right-click on the replication node to invoke the distribution
configuration wizard (see Figure 8.16).
FIGURE 8.16 Configuring a SQL Server instance as a distributor for a publisher.
Figure 8.16 shows the first option selected, designating the SQL2016DXD02 server
to act as its own distributor, which will result in the distribution database being
created on this server.
Next, as shown in the upper left of Figure 8.17, you are asked to specify a snapshot
folder. Give it the proper network pathname. Remember that tons of data will be
moving through this snapshot folder, so it should be on a drive that can support the
snapshot without filling up the drive. Next comes the distribution database name and
location on the distribution server. It's best to just accept the defaults here, as shown
in the upper right of Figure 8.17. Next, you specify what publisher this distribution
server will distribute for. As shown in the lower left of Figure 8.17, add (and check)
the SQL2016DXD01 server to the publishers list. Finally, the wizard finishes the
distribution processing (as shown in the lower right of Figure 8.17) by creating the
distribution database and setting up the distribution agents needed, along with the
access to the publication server for any database that is to be published later.
FIGURE 8.17 Specify the snapshot folder, distribution database, publishers, and
finish processing.
Now you can get to the business of creating a publisher and a publication for the HA
configuration.
Publishing
Because you have created a remote distributor, you only need to “configure” a
publisher to use the remote distributor and then create the publications that are to be
published.
As you can see in Figure 8.18, in SQL Server Management Studio, you can navigate
to the Replication node in the Object Explorer on the publication server and right-
click the Local Publication node to choose to create new publications, launch
replication monitor, generate scripts, or configure distribution (if you need to do this
locally). In this example, you want the DXD001\SQL2016DXD01 SQL Server.
Choose New Publication to invoke the New Publication Wizard.
FIGURE 8.18 New Publication Wizard and specifying the distributor.
The next dialog in this wizard prompts you for the distributor. In this case, select the
second option to specify the SQL2016DXD02 server as the distributor (thus a remote
distributor). Because you already enabled this server as a distributor and identified
the SQL2016DX01 server as a publisher, the option appears by default. Click Next
to create this remote distributor. Now you are ready to create a publication.
Creating a Publication
You are now prompted to select the database for which you are going to set up a
publication (as shown in the upper left of Figure 8.19). For this example, you'll be
publishing the AdventureWorks database.
You are now asked to specify the type of replication method for this publication (as
you can see in the upper right of Figure 8.19). This will be either a snapshot
publication, transactional publication, peer-to-peer publication, or merge publication.
Select a transactional publication in this case.
FIGURE 8.19 Choosing the publication database, publication type, and articles to
publish.
In the Articles dialog, you are prompted to identify articles in your publication (see
the bottom left of Figure 8.19). You must include at least one article in your
publication. For this example, select all the objects to publish: Tables, Views,
Indexed Views, User Defined Functions, and Stored Procedures. Remember that you
are trying to create an exact image of the publisher to use as a warm standby, and it
must have all objects included. If your table has triggers, you may elect to leave this
item unchecked and then run a script on the subscription side with the trigger code
that contains the NOT FOR REPLICATION option. You will not be doing any table
filtering for this HA publication, so just click Next in that dialog.
For transactional replication, you must determine how the snapshot portion of the
replication will occur. The snapshot agent will create a snapshot immediately and
keep that snapshot available to initialize subscriptions. This will include a snapshot
of the schema and the data. You can choose to have the snapshot agent run
immediately as opposed to setting a scheduled time for it to begin its processing, as
shown in the upper left of Figure 8.20. You also need to provide access credentials to
the snapshot agent for it to connect to the publisher for all the publication creation
activity and the log reader agent processing (which will feed the transactions to the
subscriber via the distributor), as shown in the upper right of Figure 8.20. In the next
dialog box, indicate what you want done at the end of the wizard processing.
FIGURE 8.20 Choosing the publication database, specifying the publication type,
and identifying what server type the subscribers will be.
The last dialog box in Figure 8.20 shows a summary of the tasks that will be
processed and the place where you will name the publication. Name it AW2AW4HA (for
AdventureWorks to AdventureWorks for high availability).
Figure 8.21 shows the creation processing steps for this new publication. All action
statuses should read Success. Now that you have installed and configured the
remote distributor, enabled publishing, and created a publication, you need to create
a subscription.
FIGURE 8.21 Enterprise Manager with a new snapshot and log reader agents.
Creating a Subscription
Remember that you can create two types of subscriptions: push and pull. Pull
subscriptions allow remote sites to subscribe to publications and initiate their
processing from the subscriber side. Push subscription processes are performed and
administered from the distributor side. Because you are creating this subscriber to be
a failover server, you should choose to use the push subscription approach because
you don't want additional agents on that subscriber, and you want to administer all
processing from one place (the distributor).
Figure 8.22 shows the creation of a new subscription on the subscription server. In
this example, it is the SQL2016DXD03 server. You simply right-click the Local
Subscriptions replication node option and select New Subscriptions.
FIGURE 8.22 Creating a new subscription on the subscriber.
In particular, you will be creating a push subscription (pushed from the distributor),
so you need to identify the publisher that you'll be subscribing to. As shown on the
right in Figure 8.22, you select <Find SQL Server Publisher>, which allows you to
connect to the desired SQL Server instance (SQL2016DXD01 in this case) and
establish access to the publications on that publisher. Once you're connected, the
publications available on that publisher are listed for you to choose from (as shown
in the upper left of Figure 8.23).
FIGURE 8.23 Choosing the publication to subscribe to.
Once you've selected the publication you are interested in subscribing to, you need to
decide whether this will be a push-or-pull subscription. As mentioned earlier, you
want to make this a push subscription that will run all agents on the distribution
server, so select the first option, as shown in the upper right of Figure 8.23.
Next, you need to identify the database location for the subscriber (SQL2016DXD03
in this example) and specify the name of the database that will be the target of the
subscription. In this case, give the database the same name as the original database
because you will potentially use this database for failover for your applications. As
shown in Figure 8.23, choose to create a new database with the same name
(AdventureWorks), which will receive the publication from the publisher.
You've now identified the database to hold the publication. Next, you specify the
access credentials (accounts) that the distribution agents use for the subscriber.
Figure 8.24 shows the selected subscriber database (AdventureWorks) and the
distribution agent security specification. Once access is specified, the
synchronization schedule must be established for the transactions to be pushed to the
subscriber. You want the latency to be as short as possible to guarantee that data is
being pushed to the subscriber as quickly as possible. As shown in the lower left of
Figure 8.24, choose the Run Continuously mode for synchronization. This means
that as transactions arrive at the distributor, they will be continuously pushed to the
subscriber as fast as possible. This should limit unapplied transactions at the
subscriber significantly.
FIGURE 8.24 Setting distribution agent access and the synchronization schedule.
As you can see in Figure 8.25, you have to specify how the subscription will be
initialized and when that should take place. You have an option to create the schema
and data at the subscriber (and also to do it immediately) or to skip this initialization
altogether because you have already created the schema and loaded the data
manually. Choose to have the initialization create the schema and initialize the data
immediately by checking the Initialize box and selecting Immediately. This is all you
have to do at this point. The next dialog shows the summary of what is about to be
initiated. Once you kick this off, the Creating Subscription(s) dialog shows you the
progress of the subscription creation. By now you have probably noticed that you are
always asking for the wizard to generate the script for what it is executing, so you
can do this later without having to wade through the wizard. This is a solid
management procedure. This book includes all the scripts you need for setting up the
replication topology you've just walked through.
Tip
Make sure you have kept your SQL logins/users synchronized and up to
date in both the publisher and the subscriber SQL Server instances.
Monitoring Replication
When replication is up and running, you need to monitor the replication and see how
things are running. You can do this in several ways, including using SQL statements,
SQL Server Management Studio's Replication Monitor, and Windows Performance
Monitor (PerfMon counters).
Basically, you are interested in the agent’s successes and failures, the speed at which
replication is done, and the synchronization state of tables involved in replication
(that is, all data rows present on both the publisher and the subscriber). Other things
to watch for are the sizes of the distribution database, the growth of the subscriber
databases, and the available space on the distribution server’s snapshot working
directory.
SQL Statements
You need to validate that the data is in both the publisher and the subscriber. You
can use the publication validation stored procedure (sp_publication_validation)
to do this fairly quickly. This will give you actual row count validation. The
following command checks the row counts of the publication and subscribers:
exec sp_publication_validation @publication = N'AW2AW4HA'
go
Summary
Data replication is a powerful feature of SQL Server that can be used in many
business situations. Companies can use replication for anything from roll-up
reporting to relieving the main server from ad hoc queries and reporting. However,
applying it as a high availability solution can be very effective if your requirements
match well to its capability. Determining the right replication option and
configuration to use is somewhat difficult, but actually setting it up is pretty easy.
Microsoft has come a long way in this regard. As with Scenario 2, if your
requirements are not extreme availability, you may use data replication for high
availability. It is more than production-worthy, and the flexibility it offers and the
overall performance are just short of incredible, incredible, incredible (replication
humor for you).
CHAPTER 9. SQL Server Log Shipping
IN THIS CHAPTER
Poor Man’s High Availability
Setting Up Log Shipping
Scenario 4: Call Before Digging with Log Shipping
A direct method of creating a completely redundant database image for higher
availability is to use log shipping. Microsoft certifies log shipping as a method of
creating an “almost” hot spare. Some folks use log shipping as an alternative to data
replication. Log shipping used to be referred to as “poor man’s replication” when
replication was an add-on product. Now that log shipping and replication are both
included in the box, neither one is really any more expensive to implement than the
other. However, they differ in terms of how they replicate. Log shipping uses the
transaction log entries, whereas replication uses SQL statements. This is hugely
different. The log shipping method has three components:
Making a full backup of a database (database dump) on a “source” server (which
you want to be the origin of all transactions to other servers)
Creating a copy of that database on one or more other servers from that dump
(called destinations)
Continuously applying transaction log dumps from that “source” database to the
“destination” databases
This is the dump, copy, restore sequence. In other words, log shipping effectively
replicates the data of one server (the source) to one or more other servers (the
destinations) via transaction log dumps. Destination servers are read-only.
Note
You can actually set up log shipping to work entirely within a single
SQL Server instance if you wish. This may be useful if you’re doing
extensive testing or in other situations that can benefit from a separate
copy of the “source” database or if you want to isolate reporting to a
separate SQL Server instance. Log shipping is typically done from one
SQL Server instance to another, regardless of the location of the
destination server (as in another data center). You are only limited by
the communication stability and consistency between SQL instances.
The amount of data latency that exists between the source and destination database
images is the main determining factor in understanding the state of your
recoverability and failover capabilities. You need to set up these data latency (delay)
values as part of the log shipping configuration.
These are the primary factors in using log shipping as the method of creating and
maintaining a redundant database image:
Data latency is an issue. This is the time between the transaction log dumps on
the source database and when these dumps get applied to the destination
databases.
Sources and destinations must be the same SQL Server version.
Data is read-only on the destination SQL Server until the log shipping pairing is
broken (as it should be to guarantee that the translogs can be applied to the
destination SQL Server).
The data latency restrictions might quickly disqualify log shipping as a foolproof
high availability solution, though. However, log shipping might be adequate for
certain HA situations. If a failure ever occurs on the primary SQL Server, a
destination SQL Server that was created and maintained via log shipping can be
swapped into use at a moment’s notice. It would contain exactly what was on the
source SQL Server (right down to every user ID, table, index, and file allocation
map, except for any changes to the source database that occurred after the last log
dump was applied). This directly achieves a level of high availability. It is still not
quite completely transparent, though, because the SQL Server instance names are
different, and the end user may be required to log in again to the new SQL Server
instance. But unavailability is usually minimal.
Design and Administration Implications of Log Shipping
From a design and administration point of view, you need to consider some
important aspects associated with log shipping:
User IDs and their associated permissions are copied as part of log shipping.
They are the same at all servers, which might or might not be what you want.
Log shipping has no filtering. You cannot vertically or horizontally limit the data
that will be log shipped.
Log shipping has no ability to do data transformation. No summarizations,
format changes, or things like this are possible as part of the log shipping
mechanism.
Data latency is a factor. The amount of latency is dependent upon the frequency
of transaction log dumps being performed at the source and when they can be
applied to the destination copies.
Sources and destinations must be the same SQL Server version.
All tables, views, stored procedures, functions, and so on are copied.
Indexes cannot be tuned in the copies to support any read-only reporting
requirements.
Data is read-only (until log shipping is turned off).
If these restrictions are not going to cause you any trouble and your high availability
requirements dictate a log shipping solution, then you can proceed with confidence
in leveraging this Microsoft capability.
Note
Log shipping in MS SQL Server 2016 is extremely stable, but it will
eventually be deprecated (that is, dropped from SQL Server in future
releases). Many organizations are using availability groups instead, but
log shipping may be all that you really need, depending on your basic
needs, and doesn’t require things like failover clustering services.
Note
When you configure log shipping, a series of recurring SQL Server
Agent jobs are created on the SQL Server instances being used in your
configuration:
A job for database backup (if you have specified one on the source server)
A job for transaction log backups (on the source server)
A job for log shipping alerts (on the monitor server)
Two jobs on the destination server for copying and loading (restoring) the
transaction log
Remember that you should make sure that each SQL Server instance in your log
shipping configuration has its corresponding SQL Server Agent running, since tasks
will be created on each SQL Server instance and won’t get executed unless SQL
Server Agent is functioning and has permissions to access what they will need. The
login that you use to start the SQL Server and SQL Server Agent services must have
administrative access to the log shipping plan jobs, the source server, and the
destination server. The user who sets up log shipping must be a member of the
SYSADMIN server role, which gives the user permission to modify the database to do
log shipping.
Next, you need to create a network share on the primary server where the transaction
log backups will be stored. You do this so that the transaction log backups can be
accessed by the log shipping jobs (tasks). This is especially important if you use a
directory that is different from the default backup location. Here is how it looks:
\\SourceServerXX\NetworkSharename
In this chapter you will create log shipping for the AdventureWorks database that is
shipped with SQL Server 2016 (and is available at www.msdn.com). If you don’t
already have this database downloaded, please get it now and install it on the source
SQL Server instance. Figure 9.2 shows the log shipping configuration you will set
up. The SQL2016DXD01 server instance will be the source, the SQL2016DXD02
server instance will be the destination, and the SQL2016DXD03 server instance will
be the monitor. Be sure to set the AdventureWorks database recovery model to be
Full so that the Ship Transaction Logs task is available for this database.
FIGURE 9.3 Starting the Ship Transaction Logs task for a source database.
As you can see in Figure 9.4, you have to enable this database as the primary
database in the log shipping configuration by clicking the check box at the top. As
you define all the other properties to this log shipping configuration, they will be
visible from this database properties Transaction Log Shipping page. Once you
check the Enable check box, the Backup Settings option becomes available so you
can specify all that you need.
FIGURE 9.4 Database Properties page for log shipping settings.
Now, click on the Backup Settings button and, as you can see in Figure 9.5, you can
specify all backup and file server settings. You need to specify the full network path
to the backup folder that will be used to store the transaction log backups to be
shipped. It is important to grant read and write permissions on this folder to allow the
service account to use this (on the source server). It is also important to grand read
permissions to the account that will be executing the copy jobs; the transaction log
copies/moves to the destination servers. Specify the network path
\\DXD001\FileServerBACKUP.
FIGURE 9.5 Transaction Log Backup Settings page.
Accept LSBackup_AdventureWorks (the default name) as the backup job name for
the SQL Server agent job that does the transaction log backups at some frequency.
For this example, set up the backups to run every 5 minutes (the default is 15
minutes) because you need more current data (more frequent pushes of transactions
to the destination server). If you click the Schedule button to the right of the Job
Name field, you can set the schedule frequency (as shown in Figure 9.6).
FIGURE 9.6 New Job Schedule page.
Once you have specified the transaction log settings and schedule, you can continue
to specify the destination servers (secondary server instances and databases) along
with the monitor server instance that will be keeping track of the log shipping
timing.
To set up the monitor server instance, click the Use a Monitor Server Instance check
box at the bottom of the Database Properties page and click the Settings button to
connect to the SQL Server instance that will serve as the monitor for log shipping.
This also generates a local SQL Server Agent job for alerting when issues such as
timing problems arise. As shown in Figure 9.7, you want to connect to the
SQL2016DXD03 server instance as your monitor server.
Now you can add the destination server instance and database you are targeting for
log shipping. On the Database Properties page, click the Add button for adding
secondary server instances and databases (in the middle portion of this page). As you
can see in Figure 9.8, you need to connect to the destination (secondary) server
instance with the appropriate login credentials. You can then finish specifying how
you want to initiate log shipping on that server (SQL2016DXD02 in this example),
including creating the destination database if it doesn’t exist yet or restoring a full
backup to that destination database if it exists already.
FIGURE 9.7 Database Properties page with the monitor server instance
configured.
FIGURE 9.8 Transaction log backup schedule frequency settings.
For the destination (secondary) server, you need to specify the copy files schedule to
hold the transaction logs copied from the source server, as shown in Figure 9.9. In
addition, you specify the restore transaction log schedule, which applies the
transaction logs to the destination server (secondary) database, as also shown in
Figure 9.9. For this example, set all schedule frequencies to be at 5-minute intervals
to achieve more data.
FIGURE 9.9 Destination (secondary) server and database initialization settings.
When the destination (secondary) server settings are all configured, the backup,
copy, and restore process begins. Figure 9.10 shows the final configuration all set up
and the process of restoring the backup to the destination (secondary) database
beginning. By clicking the Script Configuration button, you can specify to generate
the entire log shipping configuration in script form. (It is included in this chapter’s
SQL script downloads.)
Figure 9.11 shows the SQL Server Agent jobs executing away and the destination
server database in Restoring mode. This means the transaction logs are being applied
to this secondary database, and the database can be used for failover if needed.
FIGURE 9.10 Transaction log shipping database properties completely configured
and beginning processing.
FIGURE 9.11 SQL Server Agent job history of copies and restores at 5-minute
intervals.
In Restoring mode, no access to the destination database is allowed. You can use the
Standby mode to allow the destination database to be used for query processing.
However, the transaction log restore jobs may get held up (that is, queued up) until
any existing connections are terminated. (You’ll learn more about this in a bit.) This
isn’t necessarily the end of the world. You will still be able to apply those transaction
logs when the read-only activity is finished. Remember that the data in this
destination database is only as current as the last transaction log restore. Figure 9.12
shows the same overall log shipping configuration but with the destination database
in Standby mode. You can also see in Figure 9.12 the results of a read-only query
against the destination database Person table.
FIGURE 9.12 Log shipping in Standby mode and querying destination database
data.
Note
Log shipping and disk space may affect high availability. The directory
location you specify for your transaction log dumps should be large
enough to accommodate a very long period of dump activity. You may
want to consider using the Remove Files Older Than option to delete
backup files from the source server’s transaction log directory after a
specified amount of time has elapsed. But remember that disk space is
not endless. It will fill up eventually unless you specify something for
this option.
The database load state identifies how a destination database (the target of the log
shipping process) is to be managed during data loads:
No Recovery mode indicates that this destination database is not available for
use. The destination database will be in No Recovery mode as a result of either
the RESTORE LOG or RESTORE WITH NORECOVERY operations. When and if the
source server fails, the mode for this destination database will be changed so that
it can be used.
Standby mode indicates that this destination database is available for use but in
read-only mode. The destination database will be placed in Standby mode as a
result of either the RESTORE LOG or RESTORE DATABASE WITH STANDBY
operation.
This example allows for read-only access to this destination database if you specify
the Standby mode option. You should also specify the Terminate Users in Database
option since any restore operation (of transaction logs) will fail if any users are
connected to the destination database. This might seem a bit abrupt, but it is critical
to keeping the database intact and ensuring an exact image of the source database.
The users can reestablish the connection to this destination database after the restore
process is complete (which is usually very quickly). As you specify the frequency of
these restores, you need to consider the usage of this destination database as a
secondary data access point (such as for reporting).
Note
If you are going to use the destination database for reporting, you might
want to decrease the frequency of these backups and copies/loads since
the reporting requirements may well tolerate something less frequent.
However, if the destination server is to be a hot spare for failover, you
might even want to increase the frequency to a few minutes apart for
both the backups and the copies/loads.
A quick look at the monitor server instance and the job history of the LSAlert SQL
Server Agent task shows a clean and healthy log shipping monitoring sequence. You
would want to use this SQL Server Agent task to send out alerts (for example,
emails, SMS messages) when log shipping is failing. Figure 9.13 shows the last
successful task executions from the monitor server instance. You can also execute a
few simple SELECT statements on the monitor server instance (against the MSDB
database tables for log shipping). You can usually look at the
msdb..log_shippping_monitor_history_detail table to get a good idea of
what is going on:
Click here to view code image
SELECT * FROM msdb..log_shipping_monitor_history_detail
WHERE [database_name] = ‘AdventureWorks’
FIGURE 9.13 The monitor server instance Log Shipping Alerts task.
Each of the SQL Server instances in your topology has a series of log shipping tables
in the MSDB database:
Click here to view code image
log_shipping_primary_databases
log_shipping_secondary_databases
log_shipping_monitor_alert
log_shipping_monitor_error_detail
log_shipping_monitor_history_detail
log_shipping_monitor_primary
log_shipping_monitor_secondary
log_shipping_plan_databases
log_shipping_plan_history
log_shipping_plans
log_shipping_primaries
log_shipping_secondary
log_shipping_secondaries
Each appropriate table will be used, depending on the role the server plays in the
topology (source, destination, or monitor).
Note
You will not find entries in all tables in all SQL Servers in your
topology. Only the tables that are needed by each server’s functions will
have rows. So don’t be alarmed.
FIGURE 9.14 Call before digging high availability “live solution” with log
shipping.
The total incremental costs to upgrade to this SQL clustering with log shipping high
availability solution was approximately $108,000—just slightly over the earlier
estimates.
Now, let’s work through the complete ROI calculation with these incremental costs,
along with the cost of downtime:
1. Maintenance cost (for a 1-year period):
$12k (estimate)—Yearly system admin personnel cost (additional time for
training of these personnel)
$16k (estimate)—Recurring software licensing cost (of additional HA
components; 2 OS + 2 SQL Server 2000)
2. Hardware cost:
$80k hardware cost—The cost of additional hardware in the new HA
solution)
3. Deployment/assessment cost:
$20k deployment cost—The cost of development, testing, QA, and
production implementation of the solution
$10k HA assessment cost
4. Downtime cost (for a 1-year period):
If you kept track of last year’s downtime record, use that number; otherwise,
produce an estimate of planned and unplanned downtime for this calculation.
For this scenario, the estimated cost of downtime/hour is $2k/hour.
Planned downtime cost (revenue loss cost) = Planned downtime hours × Cost
of hourly downtime to the company (should be $0).
Unplanned downtime cost (revenue loss cost) = Unplanned downtime hours ×
Cost of hourly downtime to the company:
a. 0.5% (estimate of unplanned downtime percentage in 1 year) × 8,760 hours
in a year = 43.8 hours of unplanned downtime
b. 43.8 hours × $2k/hr (hourly cost of downtime) = $87,600/year cost of
unplanned downtime
ROI totals:
Total costs to get on this HA solution = $128,000 (for the year—slightly higher
than the immediate incremental costs stated above)
Total of downtime cost = $87,600 (for the year)
The incremental cost is about 123% of the downtime cost for 1 year. In other
words, the investment of the HA solution will pay for itself in 1 year and 3
months! This is well within the ROI payback the company was looking for, and
it will provide a solid HA solution for years to come.
Summary
In contrast to data replication and SQL clustering, log shipping is fairly easy to
configure. It also doesn’t have many hardware or operating system restrictions. Log
shipping is a good option because it not only provides high availability but also
ensures your data against hardware failures. In other words, if one of the disks on the
source (primary) server stops responding, you can still restore the saved transaction
logs on the destination (secondary) server and upgrade it to a primary server, with
little or no loss of work. In addition, log shipping does not require that the servers be
in close proximity. And, as added benefits, log shipping supports sending transaction
logs to more than one secondary server and enables you to offload some of the query
processing and reporting needs to these secondary servers.
As indicated earlier, log shipping is not as transparent as failover clustering or
availability groups because the end user will not be able to connect to the database
for a period of time, and users must update the connection information to the new
server when it becomes available. Remember that, from a data synchronization point
of view, you are only able to recover the database up to the last valid transaction log
backup, which means your users may have to redo some of the work that was already
performed on the primary server. It is possible to combine log shipping with
replication and/or failover clustering to overcome some of these disadvantages. Your
particular HA requirements may be very well supported with a log shipping model.
CHAPTER 10. High Availability Options in the
Cloud
IN THIS CHAPTER
A High Availability Cloud Nightmare
HA Hybrid Approaches to Leveraging the Cloud
Most organizations have started to do parts of their workload on any number of
cloud platforms. Some organizations are already completely cloud based.
Application options on software as a service (SaaS) cloud-based platforms (such as
Salesforce, Box, NetSuite, and others) are rapidly growing in popularity. There are
also many cloud computing options to choose from these days, such as Microsoft
Azure, Amazon, Rackspace, IBM, Oracle, and so on. An organization must weigh
many factors in deciding whether to use cloud computing. Cost is usually not what
drives a company to use or not use cloud computing; rather, things like performance
and very often security are the deciding factors. Equally as important is the legacy
systems involves; some of them simply are not good candidates for 100% cloud-
based deployment. However, if your organization is already using or is about to use
cloud computing, the SQL Server family of products and Windows Server editions
have positioned you well to either take baby steps to the cloud or go all in as fast as
you want. Some of the Microsoft options available to you are cloud hosting
(infrastructure as a service [IaaS]), Azure SQL Database (which is really a database
platform as a service [PaaS] offering), and several hybrid options that combine your
existing on-premises deployment with Azure options to get you started on cloud
computing capabilities that will quickly enhance your high availability position.
This chapter introduces high availability options that you can leverage in two ways:
as a way to extend your current on-premises deployment (a hybrid approach to HA
in the cloud) and as a 100% cloud-based approach to achieving HA for your
applications (or just your database tiers). This chapter also describes a little about the
big data HA story, which is further detailed in Chapter 11, “High Availability and
Big Data Options.”
Note
Most of the big players—like Microsoft, Amazon, and Rackspace—
provide numerous solutions and options in cloud-based computing. This
chapter focuses on the options that best serve a mostly Microsoft stack.
However, to be fair, although we describe a Microsoft Azure option,
there might very well be a similar option on Amazon or another service.
The following sections look at a few natural extensions to the cloud for the following
legacy on-premises HA or partial HA solutions you might already have in place,
including the following:
Extending data replication topologies to the cloud
Creating a Stretch Database on Microsoft Azure from your on-premises database
Creating an AlwaysOn availability group on the cloud
Adding a new destination node in the cloud for your existing log shipping
configuration
Figure 10.2 shows each of these options with portions of the corresponding
topologies on MS Azure (in the cloud). Microsoft has made it very clear that it wants
to protect all the existing investment you have already put into place and, at the same
time, provide you with expansion options to meet your future needs. The following
sections describe a number of ways to fairly easily to get into the cloud to fulfill your
high availability needs.
FIGURE 10.2 Extending your current technology to the cloud to enhance HA.
FIGURE 10.7 Typical OLTP database breakdown of current versus historical data
for a large database.
The Stretch Database capability takes infrequently accessed data that meets certain
criteria and pushes this data to the cloud via the linked server mechanism (see Figure
10.8). The net effect is a significantly smaller primary database, increased
performance for current data, and significantly decreased times to back up and
restore your local database, which, in turn, directly affects your RTO and RPO
numbers. This can have a huge impact on your overall HA experience. Put simply,
backing up and restoring a 400GB database is very different from backing up and
restoring a 3GB database. Using the Stretch Database capability allows you the
luxury of smaller database backups/restores without the loss of any of that 397GB of
data and without the loss of access to that 397GB of data.
FIGURE 10.8 How Stretch Database reduces the local database size and migrates
this data to the cloud.
Summary
Whether you are trying to expand into the cloud in a hybrid (partial) approach or go
100% into the cloud, you must consider how high availability will be provided for
any configuration you use. This chapter has shown how to extend your current SQL
Server deployment into the cloud by using hybrid approaches for each of the major
types of HA configurations. Select the one that you have experience with first and
then move to others that have more resilience as your business requirements warrant.
This chapter demonstrates was how easy it is to extend your environments into the
cloud without much disruption to your current implementations. However, if you
make that first big step, the HA improvements (and advantages) are huge. If you are
choosing to go 100% into the cloud with Azure IaaS options, you can readily build
out everything you currently have on-premises already. As you grow, you simply
dynamically expand that footprint. One step further is to utilize Azure SQL Database
for a PaaS solution with high availability. This offering is taking over the PaaS
market like wildfire.
CHAPTER 11. High Availability and Big Data
Options
IN THIS CHAPTER
Big Data Options for Azure
HDInsight Features
High Availability of Azure Big Data
How to Create a Highly Available HDInsight Cluster
Accessing Your Big Data
The Seven-Step Big Data Journey from Inception to Enterprise Scale
Other Things to Consider
Azure Big Data Use Cases
With the exponential growth of data generated by individuals and corporations, big
data applications have garnered a lot of attention. It all started with the publication of
a monumental white paper by Google on using GFS (Google File System) for
storage and MapReduce for processing back in 2003. A year later, Doug Cutting and
Michael Cafarella created Apache Hadoop. Since then, Hadoop and several other
open source Apache projects have created an entire ecosystem to address diverse
scenarios and create amazing applications. These big data ecosystems can be
deployed in the cloud or on-premises. In the cloud, the two most prominent offerings
are AWS (Amazon Web Services) from Amazon and Azure from Microsoft. The
Azure offering has come a long way and now provides a rich set of enterprise-caliber
big data implementation options. Over a period of time, Microsoft has developed a
full stack for this big data ecosystem, ranging from Azure storage, Hadoop cluster
implementation, and recently advanced analytics, including machine learning.
This chapter introduces various Microsoft big data offerings and a few third-party
offerings. It also describes the high availability features that are part of the big data
solutions. Finally, this chapter provides some real-life use cases for highly available
big data solutions in the cloud. Because Microsoft has decided to embrace the
Hadoop architecture in its Azure deployment for big data, all Azure big data
deployments naturally inherit the resilience and failover features at their lowest
levels. (More on this HA architecture later.)
HDInsight
HDInsight is the Azure managed cloud service, which provides the capability to
deploy Apache Hadoop, Spark, R, HBase, and several other big data components.
HDInsight is discussed in detail later in this chapter.
Stream Analytics
Stream Analytics is a fully managed, cost-effective, real-time event processing
engine that can handle a wide range of streaming sources, such as sensors, websites,
social media, applications, and infrastructure systems.
Cognitive Services
Cognitive Services is a rich set of smart APIs that enable natural and contextual
interactions for language, speech, search, vision, knowledge, and so on.
Data Factory
Azure Data Factory allows you to compose and orchestrate data services for data
movement and transformation. It supports a wide range of data stores, including
Azure data stores, SQL and NoSQL databases, flat files and several other data
containers as sources and includes an array of transformation activities that can range
from simple high-level APIs like Hive and Pig to advanced analytics using machine
learning.
Power BI Embedded
Power BI Embedded brings the capability of interactive reports in Power BI to the
Azure platform. Power BI Desktop users, OEM vendors, and developers can create
custom data visualizations in their own applications.
HDInsight Features
As you can also see in Figure 11.5, Azure HDInsight is the primary big data product.
It is from Apache Hadoop and powered by the Microsoft Cloud, which means it
follows the Hadoop architecture and can process petabytes of data. As you probably
already know, big data can be structured, unstructured, or semi-structured.
HDInsight allows you to develop in your favorite languages. HDInsight supports
programming extensions of many languages and frameworks, such as C#, .NET,
Java, and JSE/J2EE.
With the use of HDInsight, there is no need to worry about the purchase and
maintenance of hardware. There is also no time-consuming installation or setup.
For data analysis you can use Excel or your favorite business intelligence (BI) tools,
including the following:
Tableau
Qlik
Power BI
SAP
You can also customize a cluster to run other Hadoop projects, such as the
following:
Pig
Hive
HBase
Solr
MLlib
Real-Time Processing
HDInsight is equipped with Apache Storm, which is an open source real-time event-
processing system. It enables users to analyze the real-time data from the Internet of
Things (IoT), social networks, and sensors.
Data Redundancy
For data redundancy, Hadoop enables fault tolerance capabilities by storing
redundant copies of data. This is very similar to RAID storage but is actually built
into the architecture and implemented with software. The default replication level for
any data field stored in Hadoop is three. This can be adjusted in the hdfs-site.xml
configuration file if needed.
Data in Microsoft Azure storage is always replicated, and you have the option of
selecting the replicated data copy to be within the same data center (region) or to
different data centers (regions).
During the creation of your storage account, four replication options are available
(see Figure 11.9):
FIGURE 11.9 Azure storage replication options.
Read-access geo-redundant storage (RA-GRS)—This is the default storage
account option, which maximizes availability and is commonly used for high
availability. It provides read-only access to your data at a secondary location,
along with the replication across two regions provided by GRS.
Zone-redundant storage (ZRS)—ZRS provides three copies of data replicated
across data centers within one or two regions. This, unto itself, provides an
additional layer of fault tolerance in the event that the primary data center is
unavailable. ZRS has some limitations, as is it available for blob storage only.
Locally redundant storage (LRS)—This is the simplest and lowest-cost option
for replication, involving making three copies of data within the data center
spread to three difference storage nodes. Rack-level awareness is achieved by
storing the copies over different fault domains (FDs) and upgrade domain (UDs)
to provide fault tolerance in case a failure impacts a single rack.
Geo-redundant storage (GRS)—GRS replicates the three copies made with
LRS to another region hundreds of miles away from the primary region. So, if
the primary region becomes unavailable, your data is still available in another
region.
FIGURE 11.10 The architecture of head nodes and data nodes of a Hadoop
cluster.
You can also run some simple Hadoop commands, such as to get a list of HDFS
files, invoke Hive, and check available hive tables, as shown here:
Click here to view code image
demo user@hn0-azureh:~$hadoop fs –ls /
Found 12 items
drwxr-xr-x - root supergroup 0 2016-11-25 04:16
/HdiSarnples
drwxr-xr-x - hdfs supergroup 0 2016-11-25 04:01
/amns
drwxr-xr-x - hdfs supergroup 0 2016-11-25 04:01
/amnshbase
drwxrwxrwx - yarn hadoop 0 2016-11-25 04:01
/app-logs
drwxr-xr-x - yarn hadoop 0 2016-11-25 04:01
/atshistory
drwxr-xr-x - root supergroup 0 2016-11-25 04:01
/example
drwxr-xr-x - hdfs supergroup 0 2016-11-25 04:01
/hdp
drwxr-xr-x - hdfs supergroup 0 2016-11-25 04:01
/hive
drwxr-xr-x -mapred supergroup 0 2016-11-25 04:01
/mapred
drwxrwxrwx -mapred hadoop 0 2016-11-25 04:01
/mr-history
drwxrwxrwx -hdfs supergroup 0 2016-11-25 04:01
/tmp
drwxr-xr-x -hdfs supergroup 0 2016-11-25 04:01
/user
demo_user@hn0-azureh:~$
You can invoke Hive in the newly created cluster as shown here:
Click here to view code image
demo_user@hn0-azureh:$ hive
WARNING: Use "yarn jar" to Launch YA!1 applications.
Logging initialized using configuration in
file:/etc/hive/2.4.4.0-10/0/hive-log4j.properties
hive> show databases;
OK
default
Tinie taken: 1.272 seconds. Fetched: 1 row(s)
hive> show tables;
hivesampletable
Tinie taken: 0.133 seconds, Fetched: 1 row(s)
If this is an experimental cluster, you’ll likely want it to go away after you are done
playing with it so that you don’t incur any charges. As you can see in Figures 11.28
and 11.29, it is easy to delete a cluster and a storage account.
Summary
This chapter introduces some of the main products, approaches, and use cases for big
data surrounding the Microsoft Azure offerings. It is important to look at the high
availability capabilities as big data solutions become more like traditional tier 1
(business-required) applications. This chapter talks about how high availability is
“designed into” the processing architecture of Hadoop platforms. Big data is here to
stay and is rapidly becoming a part of any company’s data foundation, and for this
reason you should not treat it any differently from your other traditional data
platforms. Big data is an integrated part of your entire data platform. Later chapters
show how this all comes together into a complete data picture with high availability
across all components. In some companies, big data analysis is determining the
companies’ future existence. Big data systems are critical systems for a company,
and they should be highly available to some degree. Microsoft has answered the call
for supporting big data on-premises and in the cloud with Azure, and building your
big data solutions has never been easier. However, do not forgo high availability. Big
data won’t be useful to anyone if the big data containers are down.
CHAPTER 12. Hardware and OS Options for
High Availability
IN THIS CHAPTER
Server HA Considerations
Backup Considerations
As you put together high availability solutions that meet your business needs, you
will have many different hardware and operating system options to consider. As first
described in Chapter 1, “Understanding High Availability,” your high availability is
only as good as your weakest link in your full system stack. It is essential that you
look at every layer in your system and understand what options you need to consider
in order to achieve your desired HA result. However, you may be dealing with
limited hardware resources, various and sometimes restrictive storage options,
certain operating system editions and their limitations, hybrid systems that span both
on-premises servers and cloud-based servers (infrastructure as a service [IaaS]),
100% virtualized servers (on-premises), or 100% cloud-based compute power
variations (for example, on Azure or Amazon). All these must be considered,
regardless of where your footprint is. You must also be aware of what is available for
things like server backup images, database-level backups, varying virtualization
options, and live migrations (from one server to another in case of failure or
increases in workloads). People used to think that a database will run more slowly on
a virtual machine (VM). If you are actually experiencing such a thing, though, it is
usually a result of a poorly configured VM, not the database engine.
Organizations around the world are pushing to get much of their stacks in the cloud,
one way or another. Welcome to the new world of infinite computing. The world has
crossed the threshold of accepting the cloud as a production-worthy solutions,
especially if HA and disaster recovery options are present. If your organization has
decided that having a backup site available on AWS or Azure is acceptable, then
why wouldn’t you just start there to begin with? This is a good question to ask
management. However, if you just aren’t ready to go 100% cloud-based yet, you
need to get your on-premises and hybrid acts together.
Generally speaking, it is best to architect for the “shared nothing” approach for data
and servers. In other words, you should always have secondary resources available
for failover at both the compute and storage levels. Many organizations are already
virtualized, both on-premises and in the cloud. Figure 12.1 illustrates the multiple
VMs within a Windows 2012 hypervisor server architecture.
Server HA Considerations
Still at the heart of many on-premises and virtualized systems is failover clustering
for both physical servers and VMs.
Failover Clustering
In Windows Server 2012 and later, you can create a cluster with up to 64 nodes and
8,000 VMs. Failover clustering requires some type of shared storage so the data can
be accessed by all nodes and will work with most SANs (storage area networks),
using a supported protocol. Failover clustering includes a built-in validation tool
(Cluster Validation) which can verify that the storage, network, and other cluster
components will work correctly. Figure 12.2 shows SAN storage shared across
multiple virtual machine nodes.
SAN storage can be categorized into three main types:
SAN using a host bus adapter (HBA)—This is the most traditional type of
SAN. Supported types include Fibre Channel and Serial Attached SCSI (SAS).
Fibre Channel tends to be more expensive but offers faster performance than
SAS.
SAN using Ethernet—In recent years, network bandwidth has become
significantly faster, matching speeds that were previously possible only with
HBA-based storage fabric. This has enabled Ethernet-based solutions to be
offered at much lower costs, although they still require dedicated NICs and
networks. The two protocols supported by failover clustering are iSCSI and
Fibre Channel over Ethernet (FCoE).
SMB3 file server—Server Message Block (SMB) protocol is a Microsoft-
centric, application-layer network protocol used for file sharing on a file server.
A traditional file share is a location for storing data that’s accessible by multiple
servers. With the introduction of Windows Server 2012, it has become possible
to store the virtual hard disk for a VM on this file share, which allows it to
function as a very affordable shared storage type that allows all cluster nodes to
access it at once. This has proven to be very reliable and helps simplify failover
cluster configurations.
Networking Configuration
Optimizing a cluster’s networks is critical for high availability because networks are
used for administration, VM access, health checking, live migration, and, often,
storage in the case of an Ethernet-based solution or Hyper-V over SMB. The cluster
nodes can be on the same subnet or different subnets, and the cluster will
automatically configure the networks when the cluster is created or a new network is
added.
You must use Hyper-V Manager to create identical virtual networks and switches on
every cluster node so that your VMs can connect to other services. These virtual
networks must be named the same on every node in the cluster so that the VM will
always be able to connect to the same network, using its name, regardless of which
host the VM is running on.
Every cluster requires at least two networks for redundancy. If one network becomes
unavailable, the traffic is rerouted through the redundant network. The best practice
is to have a dedicated network of at least 1Gbps for each network traffic type:
Live migration network traffic—In a live migration—which means moving a
running VM from one host to another—the memory of the VM is copied
between the hosts through a network connection. This data movement causes a
large spike in network traffic as several gigabytes of data are sent through the
network as fast as possible. A dedicated network is strongly recommended so the
live migration doesn’t interfere with other network traffic.
Host management network traffic—Certain types of administration tasks
require large amounts of data to be sent through a network, such as performing a
backup with third-party products such as Veeam Backup & Replication,
deploying a VM on a host from a library, or replicating a VM. Ideally, this type
of traffic should have a dedicated network.
Storage using Ethernet network traffic—If you are using iSCSI, FCoE, or an
SMB3 file server for storage, you must have a dedicated network connection for
this storage type. It is important to ensure that your network has enough
bandwidth to support the needs of all VMs on that host. The performance of all
the VMs on a host will slow down if they cannot access data fast enough.
When a cluster is created, it assigns a different value to each of the networks based
on the order in which it discovers different network adapters. NICs, which have
access to a default gateway, are designated for use with client and application traffic
because the cluster assumes that this network has an external connection. This value
is known as the network priority. Windows Server failover clustering does not
require identical hardware for each host (node), as long as the entire solution does
not fail any of the cluster validation tests. Because some hosts may be more powerful
than others or have different access speeds to the storage, you may want certain VMs
to run on specific hosts.
Virtualization Wars
At one time, VM’s were unstable, slow, and hard to manage. With big providers
such as VMware and Microsoft, the server virtualization world is now massively
successful, completely reliable, and extremely dynamic to manage.
Live migration, as mentioned earlier in this chapter, entails moving active VMs
between physical hosts with no service interruption or downtime. A VM live
migration allows administrators to perform maintenance and resolve a problem on a
host without affecting users. It is also possible to optimize network throughput by
running VMs on the same hypervisor, automating Distributed Resource Scheduler
(DRS), and doing automatic load balancing of disks—moving the disks of a VM
from one location to another while the VM continues to run on the same physical
host. This all adds up to higher availability.
As you might already know, Microsoft introduced the ability to move VMs across
Hyper-V hosts with Windows Server 2008 R2. This required VMs to reside on
shared storage as part of a cluster. With Windows Server 2012 and Server 2012 R2,
Microsoft continued to gain ground on VMware, introducing additional migration
capabilities that put Microsoft more or less on par with VMware. Now, since
Windows Server 2012 R2, Hyper-V can store VMs on SMB file shares and allows
live migrations on running VMs stored on a central SMB between nonclustered and
clustered servers, so users can benefit from live migration capabilities without
investing in a fully clustered infrastructure (see Figure 12.3).
FIGURE 12.3 Doing live migrations with zero downtime on Windows 2012 R2.
Windows Server 2012 R2’s live migration capability also leverages compression,
which reduces the time needed to perform live migration by 50% or more. Live
migration in Windows Server 2012 R2 utilizes improvements in the SMB3 protocol
as well. If you are using network interfaces that support remote direct memory
access (RDMA), the flow of live migration traffic is faster and has less impact on the
CPUs of the nodes involved. Storage live migration was introduced to the Hyper-V
feature set with Windows Server 2012. Windows Server 2008 R2 allowed users to
move a running VM using traditional live migration, but it required a shutdown of
the VM to move its storage.
Backup Considerations
Contemporary storage backup technology uses techniques such as changed block
tracking to back up virtual machines. VMs are well suited to storing backup data on
disk as they require access to the initial backup plus all data changes to perform
restores. However, disk-based backup isn’t necessarily scalable and doesn’t always
offer easy portability when you need to take data offsite for full disaster recovery.
Options such as creating synthetic backups based on the original backup plus all
subsequent incremental block changes can often meet a recovery need. But
remember that when failure happens, it is not just data that needs to be restored but
the full working environment. Further, disaster recovery is not possible without a
backup in the first place.
Figure 12.4 illustrates the possible backup and recovery scope across on-premises,
infrastructure as a service (IaaS), platform as a service (PaaS), and any number of
software as a service (SaaS) applications that are a part of a business.
VM Snapshots
Many VM tools take incremental snapshots of a VM at given frequencies. This
usually involves a short pause in the VM that lasts long enough to copy its data, its
memory, and other relevant elements very quickly. These VM snapshots can then be
used to re-create the VM anywhere that is specified in a relatively short amount of
time. The RPO depends on how often snapshots are taken. The RTO depends on how
quickly the entire VM becomes available in an alternate location. This approach
started to show up in the VM world from companies such as Veeam, VMware,
Microsoft Hyper-V, and others.
Other, more traditional backup suppliers have adapted their products to compete
directly with VM snapshot capabilities. Backup Exec (from Symantec) mostly
matches the capability and performance of VM snapshotting. Other vendors, such as
Dell, claim that their solutions avoid the pausing of VMs altogether and have zero
effect on the VMs. One advantage these traditional providers have is that they
support both the new VM world and all of your legacy world. And, of course, many
IaaS providers, such as Azure, Amazon, and Rackspace, provide VM replication,
enabling users to put their own failover in place without too much hassle (for an
additional cost, of course).
IN THIS CHAPTER
How to Approach Disaster Recovery
Microsoft Options for Disaster Recovery
The Overall Disaster Recovery Process
Have You Detached a Database Recently?
Third-Party Disaster Recovery Alternatives
What? You think disasters never happen? Your SQL Servers and applications have
been running fine for months on end? What could possibly happen to your data
center in Kansas? If you think it can’t happen to you, you are dreaming. Disasters
happen in all sorts of sizes, shapes, and forms. Whether a disaster is human-caused
(terrorism, hacking, viruses, fires, human errors, and so on), natural (weather,
earthquakes, fires, and so on), or just a plain failure of some kind (server failure), it
can be catastrophic to your company’s very existence.
Some estimate that companies spend up to 25% of their budget on disaster recovery
plans in order to avoid bigger losses. Of companies that have had a major loss of
computerized records, 43% never reopen, 51% close within 2 years, and only 6%
survive in the long term Institute (see
www.datacenterknowledge.com/archives/2013/12/03/study-cost-data-center-
downtime-rising). Which way would you go on this subject? I’m sure you are really
thinking about getting serious about devising some type of disaster recovery (DR)
plan that supports your company’s business continuity (BC) requirements. It must be
able to protect the primary (typically revenue-generating) applications that your
business relies on. Many applications are secondary when it comes to DR and BC.
Once you have identified what systems need to be protected, you can go about
planning and testing a true disaster plan, using all the best disaster recovery
capabilities you have at your disposal.
Microsoft doesn’t have something it calls “disaster recovery for SQL Server,” but it
does have many of the pieces of the puzzle that can be leveraged in your specialized
plans for your own disaster recovery effort. Microsoft’s newest solutions for disaster
recovery include the AlwaysOn features and some Azure options (to use the cloud as
a viable recovery site and as an architecture itself). In addition, Microsoft continues
to release enhancements to existing features that are highly leveraged for various
approaches to disaster recovery in many SQL Server environments. In particular, log
shipping is still being used to create redundant systems for DR purposes; a few types
of data replication topologies are available, such as peer-to-peer replication; and
change data capture (CDC) is a poor-man’s approach to DR. Database mirroring
(even though it will be deprecated someday) is another viable feature that can be
used to support both active/active and active/passive disaster recovery needs. With
Windows 2012 and newer, multisite clustering allows you to fail over to a
completely different data center location. As mentioned earlier, the AlwaysOn
availability group feature can be used to provide a multisite DR option for both
onsite and cloud-based options to DR. Finally, the cloud-based solutions continue to
expand, with Azure, Amazon, and others allowing for built-in DR options to
geographically remote sites and as extensions to your current on-premises solutions.
These offerings, and other more traditional offerings, round out the arsenal from
Microsoft on giving you the comfortable feeling of attaining business continuity.
Note
Remember that log shipping and database mirroring are both on the way
out in future Microsoft releases, so don’t plan too much new usage of
these features.
Recovery Objectives
You need to understand two main recovery objectives: the point in time to which
data must be restored to be able to successfully resume processing (called the
recovery point objective [RPO]) and the acceptable amount of downtime that is
tolerable (called the recovery time objective [RTO]). The RPO is often thought of as
the time between the last backup and the point when the outage occurred. It indicates
the amount of data that will be lost. The RTO is determined based on the acceptable
downtime in case of a disruption of operations. It indicates the latest point in time at
which the business operations must resume after disaster (that is, how much time can
elapse).
The RPO and RTO form the basis on which a data protection strategy is developed.
They help provide a picture of the total time that a business may lose due to a
disaster. The two metrics together are very important requirements when designing a
solution. Let’s put these terms in the form of algorithms:
RTO = Difference between the time of the disaster and the time the system is
operational – Time operational (up) – Time disaster occurred (down)
RPO = Time since the last backup of complete transactions representing data that
must be re-acquired or entered – Time disaster occurred – Time of last usable data
backup
Therefore:
Total lost business time = Time operational (up) – Time disaster occurred (down) –
Time of the last usable data backup
Data Replication
A solid and stable Microsoft option that can be leveraged for disaster recovery is
data replication. Not all variations of data replication fit this bill, though. However,
the central publisher replication model using either continuous or very frequently
scheduled distribution is very good for creating a hot spare of a SQL Server database
across almost any geographic distance, as shown in Figure 13.8. The primary site is
the only one actively processing transactions (updates, inserts, deletes) in this
configuration, with all transactions being replicated to the subscriber, usually in a
continuous replication mode.
Log Shipping
As you can see in Figure 13.11, log shipping is readily usable for the active/passive
DR pattern. However, log shipping is only as good as the last successful transaction
log shipment. The frequency of these log ships is critical in the RTO and RPO
aspects of DR. Log shipping is really not a real-time solution. Even if you are using
continuous log shipping mode, there is a lag of some duration due to the file
movement and log application on the destination.
Remember that Microsoft is deprecating log shipping, and it is perhaps not a good
idea to start planning a future DR implementation that will go away.
FIGURE 13.11 Log shipping configuration for active/passive DR pattern.
Note
Many organizations have gone to the concept of having hot alternate
sites available via stretch clustering or log shipping techniques. Costs
can be high for some of these advanced and highly redundant solutions.
Using sqldiag.exe
One good way to get a complete environmental picture is to run the sqldiag.exe
program provided with SQL Server 2016 on your production box (which you would
have to re-create on an alternate site if a disaster occurred). It is located in the Binn
directory, where all SQL Server executables reside (C:\Program Files\Microsoft SQL
Server\130\Tools\Binn). This program shows how the server is configured, all
hardware and software components (and their versions), memory sizes, CPU types,
operating system version and build information, paging file information,
environment variables, and so on. If you run this program on your production server
periodically, it provides good environment documentation to supplement your
disaster recovery plan. This utility is also used to capture and diagnose SQL Server–
wide issues and has a prompt that you must respond to when re-creating issues on
which you want to collect diagnosis information. Figure 13.17 shows the expected
execution command and system information dialog window.
FIGURE 13.17 sqldiag.exe execution.
Note
For the purposes of this chapter, when prompted for the SQLDIAG
collection, you can just terminate that portion by pressing Ctrl+C.
To run this utility, you open a command prompt and change directory to the SQL
Server Binn directory. Then, at the command prompt, you run sqldiag.exe:
Click here to view code image
C:\Program Files\Microsoft SQL Server\130\Tools\Binn>
sqldiag.exe
The results are written into several text files within the SQLDIAG subdirectory.
Each file contains different types of data about the physical machine (server) that
SQL Server is running on and information about each SQL Server instance. The
machine (server) information is stored in a file named XYX_MSINFO32.TXT, where
XYX is the machine name. It contains a verbose snapshot of everything that relates to
SQL Server (in one way or another) and all the hardware configuration, drivers, and
so on. It is the tightly coupled metadata and configuration information directly
related to the SQL Server instance. The following is an example of part of what it
contains:
Click here to view code image
System Information report written at: 12/08/16 21:18:01
System Name: DXD001
[System Summary]
Item Value
OS Name Microsoft Windows Vista Premium
Version 6.1.7601 Service Pack 1 Build 7601
Other OS Description Not Available
OS Manufacturer Microsoft Corporation
System Name DATAXDESIGN-PC
System Manufacturer TOSHIBA
System Model Satellite P745
System Type x64-based PC
Processor Intel(R) Core(TM) i3-2350M CPU @ 2.30GHz, 2300 Mhz, 2
Core(s), 4 Logical Processor(s)
BIOS Version/Date TOSHIBA 2.20, 10/30/2015
SMBIOS Version 2.6
Windows Directory C:\windows
System Directory C:\windows\system32
Boot Device \Device\HarddiskVolume1
Locale United States
Hardware Abstraction Layer Version = "6.1.7601.17514"
User Name DXD001\DATAXDESIGN
Time Zone Pacific Daylight Time
Installed Physical Memory (RAM) Not Available
Total Physical Memory 11.91 GB
Available Physical Memory 8.83 GB
Total Virtual Memory 19.8 GB
Available Virtual Memory 8.54 GB
Page File Space 5.91 GB
Page File C:\pagefile.sys
A separate file is generated for each SQL Server instance you have installed on a
server. These files are named XYZ_ABC_sp_sqldiag_Shutdown.OUT, where XYZ is
the machine name and ABC is the SQL Server instance name. This file contains most
of the internal SQL Server information regarding how it is configured, including a
snapshot of the SQL Server log as this server is operating on this machine. The
following example shows this critical information from the
DXD001_SQL2016DXD01_sp_sqldiag_Shutdown.OUT file:
Click here to view code image
2016-12-08 20:53:27.810 Server Microsoft SQL Server 2016 -
13.0.700.242 (X64)
Dec 8 2016 20:23:12
Copyright (c) Microsoft Corporation
Developer Edition (64-bit) on Windows 8.1 Pro <X64> (Build 9600:
Hypervisor)
2016-12-08 20:53:27.840 Server (c) Microsoft Corporation.
2016-12-08 20:53:27.840 Server All rights reserved.
2016-12-08 20:53:27.840 Server process ID is 4204.
2016-12-08 20:53:27.840 Server System Manufacturer:
'TOSHIBA', System Model: 'Satellite P745'.
2016-12-08 20:53:27.840 Server Authentication mode is
MIXED.
2016-12-08 20:53:27.840 Server Logging SQL Server messages
in file 'C:\Program
Files\Microsoft SQL
Server\MSSQL13.SQL2016DXD01\MSSQL\Log\ERRORLOG'.
2016-12-08 20:53:27.840 Server The service account is
'DXD001\Paul'. This is an
informational message; no user action is required.
2016-12-08 20:53:27.850 Server Registry startup
parameters:
-d C:\Program Files\Microsoft SQL
Server\MSSQL13.SQL2016DXD01\MSSQL\DATA\master.mdf
-e C:\Program Files\Microsoft SQL
Server\MSSQL13.SQL2016DXD01\MSSQL\Log\ERRORLOG
-l C:\Program Files\Microsoft SQL
Server\MSSQL13.SQL2016DXD01\MSSQL\DATA\mastlog.ldf
2016-12-08 20:53:27.850 Server Command Line Startup
Parameters:
-s "SQL2016DXD01"
2016-12-08 20:53:28.770 Server SQL Server detected 1
sockets with 2 cores per socket and 4 logical processors per
socket, 4 total logical processors; using 4 logical processors
based on SQL Server licensing. This is an informational message;
no user action is required.
2016-12-08 20:53:28.770 Server SQL Server is starting at
normal priority base (=7). This is an informational message
only. No user action is required.
2016-12-08 20:53:28.770 Server Detected 6051 MB of RAM.
This is an informational message; no user action is required.
2016-12-08 20:53:28.790 Server Using conventional memory
in the memory manager.
2016-12-08 20:53:29.980 Server This instance of SQL Server
last reported using a process ID of 6960 at 12/4/2016 12:28:56
AM (local) 12/4/2016 7:28:56 AM (UTC). This is an informational
message only; no user action is required.
From this output, you can ascertain the complete SQL Server instance information as
it was running on the primary site. It is excellent documentation for your SQL Server
implementation. You should run this utility regularly and compare the output with
prior executions’ output to guarantee that you know exactly what you have to have in
place in case of disaster.
IN THIS CHAPTER
Foundation First
Assembling Your HA Assessment Team
Setting the HA Assessment Project Schedule/Timeline
Doing a Phase 0 High Availability Assessment
Selecting an HA Solution
Determining Whether an HA Solution Is Cost-Effective
As you have no doubt surmised by now, evaluating, selecting, designing, and
implementing the right high availability solution should not be left to the weak at
heart or the inexperienced. There is too much at stake for your company to end up
with mistakes in this process. For this reason, I again stress that you should use your
best technologists for any HA assessment you do or get some outside help from
someone who specializes in HA assessments—and get it fast.
The good news is that achieving the mythical five 9s (a sustained 99.999%
availability of your application) can be done, if you follow the steps that outlined in
this book. In addition, you have now had a chance to thoroughly dig into the primary
HA solutions from Microsoft (failover clustering, SQL clustering, replication,
database mirroring/snapshots, log shipping, availability groups, virtual machine
snapshots, and backup and replication approaches) and should be getting a feel for
what these tools can do for you. This chapter combines this exposure and capability
information together into a coherent step-by-step methodology for getting your
applications onto the correct high availability solution. But first, a few words about
the hardware and software foundation you put into place.
Foundation First
Whereas in real estate the mantra is “location, location, location,” in HA solutions, it
is “foundation, foundation, foundation.” Laying in the proper hardware and software
components will allow you to build most HA solutions in a solid and resilient way.
As you can see in Figure 14.1, these foundation elements relate directly to different
parts of your system stack. And in most cases, it really doesn’t matter where the
environments are (on-premises, in the cloud, virtualized, or raw iron).
FIGURE 14.1 Foundational elements and their effects on different system stack
components.
Specifically, these are the foundational elements:
Putting hardware/network redundancies into place shores up your network
access and the long-term stability of your servers.
Making sure all network, OS, application, middleware, and database software
upgrades are always kept at the highest release levels possible (including
antivirus software) affects most components in the system stack.
Deploying comprehensive and well-designed disk backups and DB backups
directly impacts your applications, middleware, and databases, as well as the
stability of your operating systems. This might also take the form of virtual
machine snapshots or other options you might have available to you in your
foundation.
Establishing the necessary vendor service level agreements/contracts affects all
components of the system stack (hardware and software), especially if you’re
using IaaS, SaaS, PaaS, and DRaaS.
Comprehensive end-user, administrator, and developer training including
extensive QA testing has a great impact on the stability of your applications,
databases, and the OS itself.
Without making any further specialized HA changes, this basic foundation offers a
huge degree of availability (and stability) in itself—but not necessarily five 9s.
Adding specialized high availability solutions to this foundation allows you to push
toward higher HA goals.
In order to select the “right” high availability solution, you must gather the
specialized high availability detail requirements of your application. Very often,
characteristics related to high availability are not considered or are neglected during
the normal requirements gathering process. As discussed in Chapter 3, “Choosing
High Availability,” gathering these requirements is best done by initiating a full-
blown Phase 0 HA assessment project that runs through all the HA assessment areas
(which are designed to flesh out HA requirements specifically). Then, based on the
software available, the hardware available, and these high availability requirements,
you can match and build the appropriate HA solution on top of your solid
foundation.
If your application is already implemented (or is about to be implemented), then you
will really be doing a high availability “retrofit.” Coming in at such a late stage in
the process may or may not limit the HA options you can select. It of course depends
on what you have built. It is possible, however, to match up HA solutions that meet
HA needs and don’t result in major rewrites of applications.
Note
As a bonus to our readers, a sample Phase 0 HA assessment template (a
Word document named HA0AssessmentSample.doc) is available on the
book’s companion website for download at
www.informit.com/title/9780672337765.
Note
We provide a sample template of the primary variables gauge and other
HA representations for download. Look for the PowerPoint document
named HA0AssessmentSample.ppt on the Sams Publishing website at
www.informit.com/title/9780672337765.
FIGURE 14.3 Traditional development life cycle with high availability tasks built
in.
As you can see in this traditional “waterfall” methodology, every phase of the life
cycle has a new task or two that specifically calls out high availability issues, needs,
or characteristics (see the bold italic text):
Phase 0: Assessment (scope)
Estimate the high availability primary variables (gauges)
Using the HA primary variables gauge to do estimations is extremely valuable at
this early stage in the life cycle.
Phase 1: Requirements
Detailed high availability primary variables
Detailed service level agreements/requirements
Detailed disaster recovery requirements
Fully detailing the HA primary variables, defining the SLAs, and putting
together the early disaster recovery requirements will position you to make well-
founded design decisions and HA solution decisions in later phases.
Phase 2: Design
Choose and design the matching high availability solution for the
application
In Phase 2 you select the HA solution that best meets your high availability
requirements.
Phase 3: Coding and Testing
Fully integrate the high availability solution with the application
Each step in coding and testing should include an understanding of the high
availability solution that has been chosen. Unit testing may also be required on
certain high availability options.
Phase 4: System Testing and Acceptance
Full high availability testing/validation/acceptance
Full-scale system and acceptance testing of the high availability capabilities
must be completed without any issues whatsoever. During this phase, a
determination of whether the high availability option truly meets the availability
levels must be strictly measured. If it doesn’t, you may have to iterate back to
earlier phases and modify your HA solution design.
Phase 5: Implementation
Production high availability build/monitoring begins
Finally, you will be ready to move your application and your thoroughly tested
high availability solution into production mode confidently. From this point,
your system will be live, and monitoring of the high availability application
begins.
Selecting an HA Solution
The HA selection process consists of evaluating your HA assessment findings
(requirements) using the hybrid decision-tree evaluation technique (with the Nassi-
Shneiderman charts) presented in Chapter 3. Recall that this decision tree technique
evaluates the assessment findings against the following questions:
1. What percentage of time must the application remain up during its scheduled
time of operation? (The goal!)
2. How much tolerance does the end user have when the system is not available
(planned or unplanned unavailability)?
3. What is the per-hour cost of downtime for this application?
4. How long does it take to get the application back online following a failure (of
any kind)? (Worst case!)
5. How much of the application is distributed and will require some type of
synchronization with other nodes before all nodes are considered to be 100%
available?
6. How much data inconsistency can be tolerated in favor of having the application
available?
7. How often is scheduled maintenance required for this application (and
environment)?
8. How important are high performance and scalability?
9. How important is it for the application to keep its current connection alive with
the end user?
10. What is the estimated cost of a possible high availability solution? What is the
budget?
By systematically moving through the decision tree and answering the case
constructs for each question, you can work through a definitive path to a particular
HA solution. This process is not foolproof, but it is very good at helping you hone in
on an HA solution that matches the requirements being evaluated. Figure 14.4 shows
an example of the Scenario 1 (application service provider [ASP]) results from using
this process. Remember that the questions are cumulative. Each new question carries
along the responses of the preceding questions. The responses, taken together,
determine the HA solution that best fits.
FIGURE 14.4 Scenario 1: ASP, Nassi-Shneiderman HA questions results with the
resulting HA selection.
As you can see, this analysis for Scenario 1, featuring an ASP, yielded a high
availability selection of hardware redundancy, shared disk RAID arrays, failover
clustering, SQL clustering, and AlwaysOn availability groups. Having these options
together clearly met all of the ASP’s requirements of uptime, tolerance, performance,
distributing workload, and costs. The ASP’s service level agreement with its
customers also allows for brief amounts of downtime to deal with OS upgrades or
fixes, hardware upgrades, and application upgrades. The ASP’s budget was enough
for a large amount of hardware redundancy.
Figure 14.5 shows the production implementation of the ASP’s HA solutions. It is a
two-node SQL cluster (in an active/passive configuration) along with an availability
group primary and three secondary replicas. The first secondary is the synchronous
failover node, and the other two are asynchronous read-only secondaries used for
possible reporting offloading and even disaster recovery. (Servers D and E are on a
separate network subnet and located in another data center.) This implementation is
proving to be a very scalable, high-performance, risk mitigating, and cost-effective
architecture for the ASP.
FIGURE 14.5 ASP high availability “live solution” with SQL clustering and
AlwaysOn availability groups.
Summary
Pushing through a formal HA assessment for your application, making an HA
selection, and planning its implementation put you just shy of the actual production
implementation of the HA solution. To implement the selected HA solution, you can
follow the detailed steps in the appropriate HA options chapters that correspond to
your particular selection results .You will be building up a test environment first,
then a formal QA environment, and finally a production deployment. You will find
that knowing how to implement any one of these HA options beforehand takes the
risk and guessing out of the whole process. If you have completely thrashed through
your HA requirements for your applications to an excruciating level of detail,
proceeding all the way to your production implementation will hopefully be mostly
anticlimactic. And, to top that off, you will also know how much money it will take
to achieve this HA solution and what the payback will be in terms of ROI if
downtime should occur (and how quickly you will achieve this ROI). You can safely
say you have considered all the essential factors in determining a high available
solution and that you are fairly ready to get that HA solution into place.
CHAPTER 15. Upgrading Your Current
Deployment to HA
IN THIS CHAPTER
Quantifying Your Current Deployment
Deciding What HA Solution You Will Upgrade To
Planning Your upgrade
Doing Your upgrade
Testing Your HA Configuration
Monitoring Your HA Health
Hopefully you are upgrading your current SQL Server deployment to high
availability as a sane and measured course of business and not because you have just
had a major disaster or an extended amount of unavailability. Either way, though,
you must actually have a lot of information and analysis available to get from where
you are now to where you need to be.
It all begins with understanding why you need HA and a full assessment of exactly
what type of HA you should have in place that meets your company's needs. From
your HA assessment, you will be able to determine what HA configuration you
should have in place (your target HA deployment). You also need to understand
exactly what your current deployment is composed of so that you can create a GAP
analysis of that shows the details of what you currently have and what you need to
put into place for HA. As a part of this GAP analysis, you should also factor in what
disaster recovery solution you may need and add it to your planning exercise. As
mentioned in earlier chapters, it is much easier to include a DR solution now than it
used to be; you will have no regrets if you make DR part of your plans. Once you
have a full GAP analysis done, you can list the hardware, software, and cloud
components that you'll need to acquire to become fully operational in your planned
HA target. Due to the rapidly dropping prices of all these components, you will
likely be pleased with the price tag associated with this type of upgrade.
Other planning will be needed for things like operational tasks and education needed
by the dev ops team, storage and capacity planning, data migration or SQL license
upgrade (for example, from Standard to Enterprise if using the AlwaysOn features),
and a target date for doing the upgrade. It is a very good idea to plan on applying HA
to two environments: your staging or system test environment and, of course, your
production environment. You can fully test your HA capabilities in the
staging/system test environment and then switch to your production environment
when you know everything works. (You do not need to have HA in your dev or test
environments.)
Note
Some organizations don't think they need their staging/system test
environments to be HA. It is ultimately up to you. I don't like testing
HA in my production environments until I'm confident that I'm ready to
deploy it there.
Finally, you will be entering into a whole new world of HA performance monitoring
that will keep you informed about the constant health of your HA deployment. This
health monitoring is likely very different from what you have done in the past, in the
non-HA world. This chapter uses Scenario 1, featuring the application service
provider (ASP),as an example of planning and deploying a new HA solution. You
will see that this is a whole lot easier to do than you might think, and you'll probably
kick yourself for not moving to a HA solution sooner.
Note
For clarity, this chapter looks only at the database server(s), not the web
or application servers. However, this chapter does show the file server
that is the container of all database backups.
FIGURE 15.1 Scenario 1's original DB server configuration.
Next, you will have to fully test your application and make sure it functions properly
under failover and disaster recovery conditions.
FIGURE 15.4 Monitoring SQL Server high availability with PerfMon counters.
The following are PerfMon counters you typically need to set up for SQL Server and
AlwaysOn availability groups (where PROD_DB01 is the name of the SQL Server
instance):
Click here to view code image
Counter Group Counter
Memory PageFaults/sec
Memory Available Kbytes
MSSQL$PROD_DB01:Availability Replica Bytes Received from
Replica/sec
MSSQL$PROD_DB01:Availability Replica Bytes Sent to
Replica/sec
MSSQL$PROD_DB01:Availability Replica Flow Control Time
(ms/sec)
MSSQL$PROD_DB01:Availability Replica Flow Control Time/sec
MSSQL$PROD_DB01:Availability Replica Resent Messages/sec
MSSQL$PROD_DB01:Availability Replica Sends to Replica/sec
MSSQL$PROD_DB01:Database Replica Mirrored Write
Transactions/sec
MSSQL$PROD_DB01:Buffer Manager Buffer Cache Hit Ratio
MSSQL$PROD_DB01:Databases Transactions/sec
MSSQL$PROD_DB01:Databases(tempdb) Transactions/sec
MSSQL$PROD_DB01:General Statistics User Connections
MSSQL$PROD_DB01:Locks Lock Wait Time (ms)
MSSQL$PROD_DB01:Locks Lock Waits/sec
MSSQL$PROD_DB01:Memory Manager Total Server Memory
(KB)
MSSQL$PROD_DB01:Plan Cache Hit Ratio
Physical Disk Avg. Disk Queue Length
Physical Disk Reads/sec
Physical Disk Writes/sec
Process % Processor Time
Process(sqlservr) % Processor Time
System Processor Queue Length
Tip
A little word to the wise: Do not turn on your HA configuration until
you have set up full monitoring. We think you understand why.
Summary
Upgrading to a viable HA configuration can be fairly easy, depending on the HA
options you are choosing. With SQL Server 2016, the process has certainly become
streamlined quite a bit. Upgrading often involves replacing or upgrading your
hardware and software stack to a much higher level to support more advanced HA
solutions. You should also go through a full HA assessment exercise so that you can
determine exactly which HA solution is the right one for you. You can then plan the
infrastructure upgrades, get the upgrades configured, upgrade your SQL Server
platform to SQL Server 2016, configure the right SQL Server HA configuration for
your needs, and migrate to that well-tested environment. Gone are the days of
month-long migrations to HA. On average, I can migrate a pretty large SQL Server
HA configuration in about 3 days, including fully testing the application, testing all
failover scenarios, and coming up on a disaster recovery SQL Server instance.
CHAPTER 16. High Availability and Security
IN THIS CHAPTER
The Security Big Picture
Ensuring Proper Security for HA Options
SQL Server Auditing
The subjects of security and high availability are rarely considered in the same
breath. However, as I have built numerous (and varying) types of high availability
solutions, I have noticed that one Achilles heel is always present: security. It is
crucial to properly plan, specify, manage, and protect the security-related portions of
high availability solutions.
Time after time, I have seen application failures and the need to recover an
application from a backup related to security breakdowns of some kind. In general,
nearly 23% of application failures or applications becoming inaccessible can be
attributed to security-related factors (see http:www.owasp.org). The following are
some examples of various security-related breakdowns that can directly affect the
availability of an application:
Data in tables are deleted (updated or inserted) by users who shouldn’t have
update/delete/insert privileges on a table, rendering an application unusable
(unavailable).
Database objects in production are accidentally dropped by developers (or
sysadmins), completely bringing down an application.
The wrong Windows accounts are used to start services for log shipping or data
replication, resulting in SQL Server Agent tasks not being able to communicate
with other SQL Servers to fulfill transaction log restores, monitor server status
updates, and process data distribution in data replication.
Hot standby servers are missing local database user IDs, resulting in the
application being inaccessible by a portion of the user population.
Unfortunately, these and other types of security-related breakdowns are often
neglected in high availability planning, but they often contribute to large amounts of
unavailability.
Much can be done in the early stages of planning and designing a high availability
solution to prevent such issues from happening altogether. You can take a general
object permissions and roles approach or an object protection approach, using
constraints or schema-bound views. Even more thorough testing of your applications
or better end-user training on their applications can reduce data manipulation errors
on the database that the application uses. One or more of these methods can be used
to directly increase your applications’ availability.
This MyDBadmin user (which can be a Microsoft SQL Server login or an existing
Microsoft Windows user account) can now create and drop tables in the current
database. Any user in the dbcreator or sysadmin server roles can also create and
drop tables. Your company’s group responsible for object maintenance in production
will only be given the MyDBadmin user ID to use, not sa. To get a quick
verification of what grants exist for a user, you can run the sp_helprotect system
stored procedure, as shown in this example:
EXEC sp_helprotect NULL, 'MyDBadmin'
This will have the same net effect as granting to an individual user ID, but it is much
easier to manage at the role level. You can look at the protections for all statement-
level permissions in the current database by using the sp_helprotect system stored
procedure:
EXEC sp_helprotect NULL, NULL, NULL, 's'
As you may know, the ability to grant and revoke permissions via the GRANT and
REVOKE commands depends on which statement permissions are being granted and
the object involved. The members of the sysadmin role can grant any permission in
any database. Object owners can grant permissions for the objects they own.
Members of the db_owner or db_securityadmin roles can grant any permission on
any statement or object in their database.
Statements that require permissions are those that add objects in the database or
perform administrative activities with the database. Each statement that requires
permissions has a certain set of roles that automatically have permissions to execute
the statement. Consider these examples:
The CREATE TABLE permission defaults to members of the sysadmin,
db_owner, and db_ddladmin roles.
The permissions to execute the SELECT statement for a table default to the
sysadmin and db_owner roles, as well as the owner of the object.
There are some Transact-SQL statements for which permissions cannot be granted.
For example, to execute the SHUTDOWN statement, the user must be added as a
member of the serveradmin or sysadmin role, whereas dbcreator can execute
ALTER DATABASE, CREATE DATABASE, and RESTORE operations.
Taking a thorough, well-managed approach to permissions and access (user IDs and
the roles they have) will go a long way toward keeping your systems intact and
highly available.
This table has no outright protection to prevent it from being dropped by any user ID
that has database creator or object owner rights (such as sa). By creating a schema-
bound view on this table that will reference at least the primary key column of the
table, you can completely block a direct drop of this table. In fact, dropping this table
will require that the schema-bound view be dropped first (making this a formal two-
step process, which will drastically reduce failures of this nature in the future). You
might think this is a pain (if you are the DBA), but this type of approach will pay for
its built-in overhead time and time again.
Creating a schema-bound view requires you to use the WITH SCHEMABINDING
statement in the view. The following is an example of how you would do this for the
just created MyCustomer table:
Click here to view code image
CREATE VIEW [dbo].[NODROP_MyCustomer]
WITH SCHEMABINDING
AS
SELECT [CustomerID] FROM [dbo].[MyCustomer]
Don’t worry, you will not be creating any grants on this view because its sole
purpose is to protect the table.
If you now try to drop the table:
DROP TABLE [dbo].[MyCustomer]
Here you have effectively and painlessly added an extra level of protection to your
production system, which will directly translate into higher availability.
To look at all objects that depend on a particular table, you can use the sp_depends
system stored procedure:
EXEC sp_depends N'MyCustomer'
As you can see, it shows the view you just created and any other dependent objects
that may exist:
Name Type
dbo.NODROP_MyCustomer view
Keep in mind that this initial method does not prohibit other types of changes to a
table’s schema. (That can be done by using an ALTER statement.) The next section
describes how to take this approach a bit further to embrace schema changes that
would also cause an application to become unavailable.
Then when we try to change the datatype and nullability of an existing column, this
operation fails (as it should):
Click here to view code image
ALTER TABLE [dbo].[CustomersTest] ALTER COLUMN [Fax]
NVARCHAR(30)
NOT NULL
This is a fairly safe method of protecting your applications from inadvertent table
alterations that can render your application useless (and effectively unavailable). All
these schema-bound methods are designed to minimize the human errors that can
and will take place in a production environment.
The log shipping monitor server is usually (and is recommended to be) a separate
SQL Server instance. The log_shipping_monitor_probe login is used to monitor
log shipping. Alternatively, Windows authentication can also be used. If you use the
log_shipping_monitor_probe login for other database maintenance plans, you
must use the same password at any server that has this login defined. What is
actually happening is the log_shipping_monitor_probe login is used by the
source and destination servers to update two log shipping tables in the MSDB
database—thus the need for cross-server consistency.
Very often, the network share becomes unavailable or disconnected. This results in a
copy error to the destination transaction log backup directory (share). It’s always a
good idea to verify that these shares are intact or to establish a procedure to monitor
them and re-create them if they are ever disconnected. After you have reestablished
this share, log shipping will be able to function fully again.
Make sure your logins/user IDs are defined in the destination server. Normally, if
you intend the destination to act as a failover database, you must regularly
synchronize the SQL Server logins and user IDs anyway. Double-check that each
login has the proper role that was present in the source database. Syncing the logins
causes many headaches during a primary role change.
Last but not least, if you are log shipping a database from a SQL Server in one
domain to a SQL Server in another domain, you have to establish a two-way trust
between the domains. You can do this with the Active Directory Domains and Trusts
tool, under the Administrator Tools option. The downside of using two-way trusts is
that it opens up a pretty big window of trusting for SQL Server and any other
Windows-based applications. Most log shipping is done within a single domain to
maintain the tightest control possible.
Note
The transport security for AlwaysOn availability groups is the same as
for database mirroring. It requires CREATE ENDPOINT permission or
membership in the sysadmin fixed server role. It also requires CONTROL
ON ENDPOINT permission.
Tip
It is a good idea to use encryption for connections between server
instances that host AlwaysOn availability groups replicas.
FIGURE 16.4 Log File Viewer showing the audit events of a Server Audit object.
It’s up to your security and audit team to decide how to use these audits. It is
recommended that you create your audit specifications with scripts so that you can
easily manage them and not have to re-create them via SSMS dialogs.
Summary
This entire chapter is devoted to security considerations and how security affects
high availability. Lessons are often “hard lessons” when it comes to systems that are
highly available, and you can avoid much anguish if you give enough attention and
planning to the security ramifications up front. This chapter outlines the key security
points that can become mismanaged or broken for each HA option presented in this
book and a few general security techniques that can be applied to all your production
implementations.
Many of the security techniques described in this chapter are commonsense methods
such as preventing tables from being dropped in production by using a schema-
bound view approach. Others are more standards and infrastructure oriented, such as
using domain accounts for clustering and common SQL accounts for data replication
or starting SQL Server Agent services. Together, they all add up to stability and
minimizing downtime. Getting these types of security practices in place in your
environment will allow you to achieve or exceed your high availability goals much
more easily.
Remember that security risks appear due to architectural problems or holes in
applications (such as with SQL Injection). The main aim of security for software is
to just fail safely and carefully and to limit the damage. I don’t want to read about
your failing in the newspaper or on Twitter.
CHAPTER 17. Future Direction of High
Availability
IN THIS CHAPTER
High Availability as a Service (HAaaS)
100% Virtualization of Your Platforms
Being 100% in the Cloud
Advanced Geo Replication
Disaster Recovery as a Service?
The next time I update this book, I will only be talking about 100% high availability
solutions (not five 9s). For many, this is already a reality. But for still way too many,
this is not possible yet due to backward compatibility, budget restrictions, security
concerns, and a host of other factors.
The advancements that Microsoft and the rest of the industry have made in the past 5
years in terms of high availability options are nothing short of staggering. Having
been a part of the Silicon Valley high-technology industry for all of my 30+-year
career, I've always been at the forefront of these types of advancements. I spent
many years architecting global solutions for multi-billion-dollar corporations that
had some of the most severe high availability requirements that have likely ever
existed. But today, mere mortals can easily reach stratospheric high availability
levels without even breaking a sweat. Am I out of a job? Actually, it would be
incredible if high availability architects weren't needed anymore. Likely this is
exactly what the future has in store. (I have other skills, I'll be fine, not to worry!)
The industry is truly headed toward being out-of-the-box highly available and at
staggering levels of availability. A more modern way to express this future trend
might be to say that high availability will become a service that you merely enable—
high availability as a service (HAaaS).
Summary
High availability depends on laying a fundamentally sound foundation that you can
count on when failures occur. Then you need to determine how much data loss you
can tolerate, how much downtime is possible, and what the downtime costs you.
That’s the reality right now. However, emerging quickly are services like HAaaS and
DRaaS that will change the way you think about enabling your business to be 100%
available for every application you have. Granted, this will require many years of
migrations for organizations that have much invested in their current infrastructure
and application portfolios. But, they will get there; costs will drive them there.
Conclusion
I often talk about global risk mitigation, which involves spreading out your business
capabilities and data across the globe to mitigate against loss at any one place (for
example, if a region or a data center fails). This could even be a mitigation strategy
for small, local companies. As the pipes get bigger and the applications or services
become globally aware and able to distribute themselves out across resources in
many parts of the world, you greatly reduce your risk of loss and increase your
chances of survival. In addition, you gain 100% availability of your applications and
data with, in theory, zero data loss. All that is left for you to do is keep the dial on
100% and enjoy the warmth of living the HA life.
Index
A
Active Geo Replication, cloud computing, 270–271
Active Geo replication, disaster recovery (DR), 330
active multisite DR pattern, 319
active/active configuration, failover clustering, 82
active/active DR sites pattern, 318–319
active/passive configuration, failover clustering, 81
active/passive DR sites pattern, 316–317
activity logs, clusters, 295
Add a Failover Cluster Node Wizard, 109
Add Node Rules dialog, 110
adding, HA elements, to development methodologies, 76–77
advanced geo replication, 395–396
Agent History Clean Up: distribution, 213
ALTER_DATABASE command, 180, 186
ALTER ANY AVAILABILITY GROUP permission, 385
AlwaysOn
availability groups, 21, 39–40, 54, 122
availability group listeners, 124
Azure, 43–45
configuring, 95–96
disaster recovery (DR), 124
endpoints, 125
failure, 369
investment portfolio management scenario, 145–147
modes, 122–123
read-only replicas, 123
security, 384–385
cloud computing, 265–268
dashboard, 143–144
FCI (failover cluster instance), 120–122
multinode AlwaysOn configuration, 125–126
backup up databases, 130
connecting with listeners, 141
creating availability groups, 131–132
enabling AlwaysOn HA, 129–130
failing over to secondary, 141–142
failover clustering, 126–128
identifying replicas, 133–135
listeners, 138–140
preparing database, 129
selecting databases, 132–133
SQL verifying Server instances, 126
synchronizing data, 135–138
use cases, 119–120
WSFC (Windows Server Failover Clustering), 87–88
Amazon Web Services (AWS), 273
Apache Hadoop, 273
Apache Spark, 280
Apache Storm, 280
application clustering, 45–46
application data values, 322
application resiliency, 11, 349
application service providers (ASPs), 18
SQL Server clustering, 114–117
application service providers (ASPs) assessments, 57–64
application types, availability, 10
applications
assessing existing applications, 16–17
isolating, 34
articles, data replication, 200–201
filtering, 201–205
AS SNAPSHOT OF statement, 160
ASPs (application service providers), 18
assessments, 57–64
business scenarios, 18–19
selecting HA solutions, 353–354
SQL Server clustering, 114–117
assessments
ASPs (application service providers) assessments, 57–64
call-before-you dig call center, 71–74
investment portfolio management, 68–71
Phase 0 HA assessment, 49, 345–346
conducting, 346–348
gauging HA primary variables, 348–349
worldwide sales and marketing, 64–68
asynchronous mode, AlwaysOn availability groups, 123
asynchronous operations, database mirroring, 172
auditing, SQL Server auditing, 385–388
automatic failover, database mirroring, 173
availability, 79
calculating, 6
availability continuum, 8–10
availability examples, 24X7X365 application, 6–8
cloud computing, 265–268
creating for multinode Alwayson configuration, 131–132
dashboard and monitoring, 143–144
disaster recovery (DR), 124
endpoints, 125
investment portfolio management scenario, 145–147
Microsoft Azure, 54
modes, 122–123
read-only replicas, 123
use cases, 119–120
availability trees, 4
availability variables, 10–12
AVG (AlwaysOn availability groups), 54
AWS (Amazon Web Services), 273
Azure, 273
availability groups, 54
big data options, 274
cloud computing, 271
big data use cases, 300–301
Cognitive Services, 276
Data Factory, 278
Data Lake Analytics, 277, 281–282
data lake services, 278
Data Lake Store, 277–278, 282–283
disaster recovery (DR), 330
HDInsight. See HDInsight
high availability, data redundancy, 283–285
Machine Learning Studio, 276
Machine Learning Web Service, 276
Power BI Embedded, 278
SQL database, 54
Stream Analytics, 276
Stretch Database, 54
Azure SQL databases, 43–45
cloud computing, 268–270
B
backing up
data replication, 231–233
databases, 130, 388
virtual machines (VM), 308–310
Backup Exec, 310–311
backups, 24–25
baselines, database snapshots, 157–158
BI integration, Azure big data use cases, 301
big data distributors, 277–278
big data options
Azure. See Azure
seven-step big data journey from inception to enterprise scale, 297–299
big data solutions, considerations for, 299–300
big data use cases, Azure, 300–301
block-level striping with distributed parity, RAID 5, 30–31
brand promotion
assessments, 64–68
business scenarios, 19
breaking down, database snapshots, 165
business scenarios
ASPs (application service providers), 18–19
call-before-you dig call center, 20
investment portfolio management, 20
worldwide sales and marketing, brand promotion, 19
C
calculating
availability, 6
ROI (return on investment), 48, 75–76
for ASPs, SQL Server clustering, 116–117
investment portfolio management, AlwaysOn, 146–147
call-before-you dig call center
assessments, 71–74
business scenarios, 20
log shipping, 252–254
CDC (change data capture), 41
disaster recovery (DR), 327–328
CDP (continuous data protection), 310
central publisher, data replication, 206–207
central publisher with remote distributor, data replication, 207–208
change data capture (CDC), 41
clients, database mirroring (setup and configuration), 189
cloud computing, 257
100% cloud computing, 394–395
DDoS (denial-of-service) attack, 258
hybrid approaches, 259
AGR (Active Geo Replication), 270–271
AlwaysOn availability groups, 265–268
Azure big data options, 271
Azure SQL databases, 268–270
extending log shipping to the cloud, 262–263
extending replication topology to the cloud, 260–262
Stretch Database, 264–265
Cloudera, 277
Cluster Node Configuration dialog, 111
cluster resource group, specifying, 104–105
Cluster Shared Volumes (CSV), 305–306
Cluster Validation, 304–305
Cluster Validation Wizard, 83
clustered virtual machine replication, 307
clustering
how it works, 81–82
SQL Server clustering. See SQL Server clustering
clusters
activity logs, 295
deleting, 296–297
Hadoop, 285
Hadoop commands, 295–296
HDInsight, 285
high availability HDInsight clusters, creating, 285–294
hive, 296
logging into, 295
networking configuration, 306–307
WSFC (Windows Server Failover Clustering), 36–37
Cognitive Services, Azure, 276
combining failover clustering with scale-out options, 125
configuration data, 322
configuring
AlwaysOn availability groups, 95–96
data replication, 214–215
creating publications, 217–220
creating subscriptions, 220–226
enabling distributors, 215–216
publishing, 217
turning subscribers into publishers, 227
database mirroring. See database mirroring
creating database on mirror servers, 178–180
creating endpoints, 176–178
granting permissions, 178
identifying endpoints, 179–182
SQL clustering, 94–95
SQL Server database disks, 96–97
connecting with listeners, multinode AlwaysOn configuration, 141
continuous data protection (CDP), 310
CONTROL AVAILABILITY GROUP permission, 385
CONTROL ON ENDPOINT permission, 385
CONTROL SERVER permission, 385
copy-on-write technology, database snapshots, 153, 154–155
cost of downtime, 1–2, 12, 349
cost to build and maintain the high availability solution ($), 12, 349
cost-effectiveness, determining for HA solutions, 354–356
CREATE AVAILABILITY GROUP server permission, 385
CREATE DATABASE command, 160
CREATE ENDPOINT permission, 385
CREATE permissions, 376
CREATE TABLE permissions, 376, 377
created roles, 376–377
CSV (Cluster Shared Volumes), 305–306
current deployment, quantifying, 360
D
dashboard, AlwaysOn, 143–144
data, synchronizing for multinode AlwaysOn configuration, 135–138
Data Factory, Azure, 278
Data Lake Analytics, 277
Azure, 281–282
data lake services, Azure, 278
Data Lake Store, 277–278
Azure, 282–283
data lakes, 282
data latency, log shipping, 238–239
data latency restrictions, 42
log shipping, 239
data redundancy, Azure, 283–285
ZRS (zone-redundant storage), 284
data replication, 21–22, 40–41, 54, 195
articles, 200–201
filtering, 201–205
backing up, 231–233
configuring, 214–215
creating publications, 217–220
creating subscriptions, 220–226
enabling distributors, 215–216
publishing, 217
switching over to warm standby (subscribers), 226–227
turning subscribers into publishers, 227
defined, 198–199
disaster recovery (DR), 323–325
distribution server, 200
merge replication, 196–197
monitoring, 227
publications, 200–201
publisher server, 199–200
scenarios
central publisher, 206–207
central publisher with remote distributor, 207–208
distribution database, 209–210
replication agents, 209–213
subscriptions, 208
subscriptions, pull subscriptions, 208–209
subscriptions, push subscriptions, 209
security, 383–384
snapshot replication, 196
SQL statements, 228
SSMS (SQL Server Management Studio), 228–230
subscription server, 200
transactional replication, 196
triggers, 213
user requirements, 213
Windows Performance Monitor, 230–231
worldwide sales and marketing scenario, 233–235
data resiliency, 11, 349
data warehouse on demand, Azure big data use cases, 300
database audit specification, 387
database disks, configuring, 96–97
database load state, 250
database log shipping tasks, 242–252
database mirroring, 42–43, 150, 168–171
client setup and configuration, 189
configuring
creating database on mirror servers, 178–180
creating endpoints, 176–178
granting permissions, 178
identifying endpoints, 179–182
disaster recovery (DR), 326–327
endpoints, 174
operating modes, 172–173
preparations for, 174–176
removing, 185–187
reporting database, 159–160
roles, 171–172
setting up against database snapshots, 190
testing failover from the principal to the mirror, 187–189
when to use, 171
Database Mirroring Monitor, 182–185
Database Mirroring Wizard, 173
Database Properties Mirroring page, 175, 188
Database Properties page, 244, 247
DATABASE RESTORE command, 166
database roles, 377
database snapshot sparse file, 153
database snapshots, 21, 42–43, 149, 150–154
breaking down, 165
copy-on-write technology, 154–155
creating, 161–165
disaster recovery (DR), 326–327
naming, 161
point-in-time reporting database, 158–159
providing a testing starting point, 157–158
reciprocal principal/mirror reporting configuration, 190–191
reporting database from a database mirror, 159–160
reverting source databases, 166–167
reverting to, 153, 155–156, 166
safeguarding databases prior to making changes, 157
security, 168, 384
setting up against database mirroring, 190
source databases, 168
sparse files, 151–152
size management, 168
for testing and QA (quality assurance), 167
databases
backing up
for multinode AlwaysOn configuration, 130
security, 388–389
creating on mirror servers, 178–180
detaching, 339
preparing for multinode AlwaysOn configuration, 129
selecting for multinode AlwaysOn configuration, 132–133
data-centric approach, disaster recovery (DR), 322–323
DDoS (denial-of-service) attack, 258
decision-tree approach
ASPs (application service providers) assessments, 57–64
call-before-you dig call center, assessments, 71–74
choosing HA solutions, 55–57
investment portfolio management, assessments, 68–71
worldwide sales and marketing, assessments, 64–68
degree of distributed access/synchronization, 11, 349
deleting, clusters, 296–297
denial-of-service (DDoS) attack, 258
deployment, quantifying, 360
design
approach for achieving high availability, 13–14
log shipping, 239–240
destination pairs, 237–238
detaching, databases, 339
development life cycle, high availability tasks, 350–352
development methodology
adding HA elements, 76–77
with high availability built-in, 14–16
spiral/rapid development methodology, 16
disaster recovery as a service (DRaaS), 309, 311, 340, 396–397
disaster recovery (DR), 309, 313–316
AlwaysOn availability groups, 124
data-centric approach, 322–323
detaching, databases, 339
DR (disaster recovery), 340
executing, 338
focus of, 331–335
levels, 315–316
Microsoft, 323
Active Geo replication, 330
AlwaysOn availability groups, 328–329
Azure, 330
CDC (change data capture), 327–328
database mirroring, 326–327
database snapshots, 326–327
log shipping, 325–326
Microsoft Azure, data replication, 323–325
patterns
active multisite DR pattern, 319
active/active DR sites pattern, 318–319
active/passive DR sites pattern, 316–317
choosing, 320–321
planning, 338
security, 388–389
sqldiag.exe, 335–338
third-party disaster recovery, 339
disaster recovery process, 330–331
disaster recovery pyramid, 315
disk methods, 53
disk mirroring, 26–27
disk striping, 28–29
Distributed Resource Scheduler (DRS), 307
distribution agents, data replication, 212–213
Distribution Clean Up: distribution, 213
distribution database, data replication, 209–210
distribution server, data replication, 200
distributors, data replication, 215–216
DR (disaster recovery). See disaster recovery (DR)
DRaaS (disaster recovery as a service), 309, 311, 340, 396–397
DROP DATABASE command, 165, 186
DROP ENDPOINT command, 181, 186
DRS (Distributed Resource Scheduler), 307
DXD AlwaysOn configuration, 126
dynamic management views, AlwaysOn availability groups, 143–144
E
EMC, disaster recovery (DR), 339
endpoints
AlwaysOn availability groups, 125
database mirroring, 173, 174
creating, 176–178
enterprise resource planning (ERP), 323
ERP (enterprise resource planning), 323
ERP system, mirrored disk, 27
ETL (extract, transform, load) automation, Azure big data use cases, 301
ETL (extract, transform, load) processing, 301
exec sp_configure, 333
exec sp_helpdb dbnameXYZ, 333
exec sp_helplinkedsrvlogin, 332
exec sp_helplogins, 332
exec sp_helpserver, 332
exec sp_linkedservers, 332
exec sp_server_info, 332
exec sp_spaceused, 333
executing, disaster recovery (DR), 338
Expired Subscription Clean Up, 213
extending
log shipping to the cloud, 262–263
NLB (network load balancing), WSFC (Windows Server Failover Clustering),
86–87
replication topology to the cloud, 260–262
F
failing over to secondary, multinode AlwaysOn configuration, 141–142
failover cluster instance. See FCI (failover cluster instance)
Failover Cluster Manager, 107–108
failover clustering, 53, 304–306
combining with scale-out options, 125
implementing, 81–82
installing, 89–94
recommendations, 97
setting up, 126–128
variations of, 80–81
WSFC (Windows Server Failover Clustering), 82–86
failure, simulating, 369
fault tolerance recommendations, SQL Server clustering disk, 97
fault tolerant disks, creating, RAID and mirroring, 26–27
FCI (failover cluster instance), 80
AlwaysOn, 120–122
filtering articles, data replication, 201–205
fixed server roles, 377
forced service, database mirroring, 173
foundation components for HA, 24–25
foundational elements, 341–343
four-node configuration, AlwaysOn availability groups, 121
four-step process for moving toward high availability, 47–48
determining optimal HA solutions, 53
gauging HA primary variables, 52–53
justifying cost of HA solutions, 75
launching Phase 0 HA assessment, 49–52
FROM DATABASE_SNAPSHOT statement, 166
full table structure protection, schema-bound views, 379–380
G
gauging HA primary variables, 52–53
geo-redundant storage (GRS), 285
global risk mitigation, 398
goodwill dollar loss, 51–52, 348
Google BigQuery, 280
granting permissions, database mirroring, 178
H
HA (high availability), monitoring, 370–372
HA assessment, conducting, 346–348
HA assessment project, setting schedules/timelines, 344–345
HA assessment teams, assembling, 343–344
HA configurations, testing, 369–370
HA elements, adding, to development methodologies, 76–77
HA primary variables, primary variables, gauging, 52–53
HA project lead, 343
HA solutions
choosing with decision-tree approach, 55–57
deciding what to upgrade to, 363–364
determining cost-effectiveness, 354–356
determining optimal HA solutions, 53
justifying cost, 75
selecting, 352–354
HAaaS (high availability as a service), 391–392
Hadoop, 273, 274, 279
clusters, 285, 295–296
HDInsight, 279
Hadoop Distributed File System (HDFS), 271
hardware, 24
HBase, 280
HDFS (Hadoop Distributed File System), 271
HDInsight, 271
Azure, 274, 276, 279, 280
Apache Spark, 280
Apache Storm, 280
creating highly available clusters, 285–294
HDInsight clusters, 285
HDP (Hortonworks Data Platform), 274
HDP for Windows, 274
high availability
Azure, data redundancy, 283–285
Azure big data options, cloud computing, 271
Azure SQL databases, cloud computing, 268–270
data replication. See data replication
general design approach, 13–14
log shipping, cloud computing, 262–263
overview, 1–6
Stretch Database, cloud computing, 264–265
high availability as a service (HAaaS), 391–392
high availability tasks, integrating into development life cycle, 350–352
hive, clusters, 296
horizontal filtering, 201–203
Hortonworks, 277
Hortonworks Data Platform (HDP), 274
hybrid approaches
to cloud computing, 259
cloud computing
AGR (Active Geo Replication), 270–271
AlwaysOn and availability groups, 265–268
Azure big data options, 271
Azure SQL databases, 268–270
extending log shipping to the cloud, 262–263
extending replication topology to the cloud, 260–262
Stretch Database, 264–265
hybrid HA selection method, 53–55
Hyper-V Manager, 306
I
IaaS (infrastructure as a service), 3
identifying
endpoints, database mirroring, 179–182
replicas for multinode AlwaysOn configuration, 133–135
implementing, failover clustering, 81–82
infrastructure as a service (IaaS), 3
inherited from the source database, 168
Install a SQL Server Failover Cluster Wizard, 101
installing
failover clustering, 89–94
SQL Server clustering, with WSFC, 100–113
integrated Hypervisor replication, 310
integrating high availability tasks into development life cycle, 350–352
interoperability, 79
investment portfolio management
AlwaysOn availability groups, 145–147
assessments, 68–71
business scenarios, 20
database snapshots and database mirroring, 192–194
isolating
applications, 34
SQL roles, 388–389
iterative exploration, Azure big data use cases, 300
J-K
join filters, 203
justifying cost of HA solutions, 75
L
levels, 315–316
listeners
connecting with, multinode AlwaysOn configuration, 141
setting up, for multinode AlwaysOn configuration, 138–140
live migration, 307–308
locally redundant storage (LRS), 284–285
log reader agents, data replication, 212
log shipping, 22, 41–42, 54, 237–238
logging into clusters, 295
LRS (locally redundant storage), 284–285
M
Machine Learning Studio, Azure, 276
Machine Learning Web Service, Azure, 276
manual failover
database mirroring, 173
testing, 370
merge replication, 196–197
metadata, 322
Microsoft
options for disaster recovery (DR), 323. See also disaster recovery
Microsoft Analytics Platform System, 274
Microsoft Azure. See Azure
Microsoft Cluster Services (MSCS), 36
Microsoft high availability options, 35–36
Microsoft Parallel Data Warehouse (PDW), 274
Microsoft Windows 2012 hypervisor virtual machines, 304
Microsoft Windows and SQL Server Product Updates dialog, 109
mirror database servers, 170
mirror roles, database mirroring, 172
mirror servers, creating databases on, 178–180
mirrored database environments, monitoring, 182–185
mirrored stripes, 32
mirroring
creating fault tolerant disks, 26–27
RAID 1, 29–30
miscues, security, 380
mitigating risk, server instances, 33–35
modes, AlwaysOn availability groups, 122–123
monitoring
AlwaysOn, 143–144
data replication, 227
HA (high availability), 370–372
mirrored database environments, 182–185
MSCS (Microsoft Cluster Services), 36
multinode AlwaysOn configuration. See AlwaysOn, multinode AlwaysOn
configuration
multisite failover clustering, 80
multisite SQL Server failover clustering, 114
N
naming database snapshots, 161
Nassi-Shneiderman chart, 55, 56
network load balancing (NLB), 38
network priority, 306–307
New Availability Group Wizard, 132
New Job Schedule page, 246
New Publication Wizard, 217
NLB (network load balancing), 38
extending, WSFC (Windows Server Failover Clustering), 86–87
No Recovery mode, 250
non-logged operations, 212
NoSQL databases, 279–280
NOT FOR REPLICATION, 214
O
object protection using schema-bound views, 377–380
OLTP, 6
OLTP (online transaction processing), 195
Open Web Application Security Project (OWASP), 375
operating modes, database mirroring, 172–173
other hardware, hybrid HA selection method, 53
OWASP (Open Web Application Security Project), 375
P
PaaS, 395
parity, 30
part-time senior technical lead (STL), 343
patterns, disaster recovery (DR)
active multisite DR pattern, 319
active/active DR sites pattern, 318–319
choosing, 320–321
PDW (Microsoft Parallel Data Warehouse), 274
PerfMon, monitoring, HA (high availability), 370–371
performance, 11, 349
performing, upgrades, 368
permissions
granting for database mirroring, 178
security, 376–377
Transact-SQL statements, 377
Phase 0 HA assessment, 16, 47–48, 345–346
conducting HA assessment, 346–348
gauging HA primary variables, 348–349
launching, 49–52
resources, 49
tasks, 49–52
planned downtime, 4
planning
disaster recovery (DR), 338
upgrades, 367
point-in-time reporting database, snapshot databases, 158–159
poor man’s replication. See log shipping
Power BI Embedded, Azure, 278
predictive analysis, Azure big data use cases, 301
preparing, database snapshots, multinode AlwaysOn configuration, 129
primary key column only, schema-bound views, 378–379
principal database server, 170
principal roles, database mirroring, 172
problems, SQL Server failover clustering, 113
productivity dollar loss (per hour of unavailability), 51, 348
publication validation stored procedures (sp_publication_validation), 228
publications, data replication, 200–201, 217–220
publisher server, data replication, 199–200
publishing, data replication, 217
pull subscriptions, data replication, 208–209
push subscriptions, data replication, 209
Q
QA (quality assurance), 25
database snapshots, 167
quantifying current deployment, 360
quorum drives, 84–86
quorum resources, 85
R
R, HDInsight, 280
RAID (redundant array of independent disks),
creating fault tolerant disks, 26–27
increasing system availability, 27–28
RAID 0, 28–29
RAID 0+1, 32
RAID 0/1, 32
RAID 1, 29–30
RAID 01, 32
RAID 1+0, 32–33
RAID 1/0, 32–33
RAID 5, 6, 30–31
RAID 10, 32–33
RAID levels, 28
RDMA (remote direct memory access), 308
read-access geo-redundant storage (RA-GRS), 284
read-only replicas, AlwaysOn availability groups, 123
real-time processing, Apache Storm, 280
reciprocal DR, 321
reciprocal principal/mirror reporting configuration, 190–191
recovery
data replication, 231–233
reverting to database snapshots, 155–156, 166
recovery objectives, 321–322
recovery point objective (RPO), 321–322
recovery time objective (RTO), 321–322
redundant array of independent disks. See RAID (redundant array of independent
disks)
Reinitialize Subscriptions Having Data Validation Failures, 213
reliability, 79
remote direct memory access (RDMA), 308
removing
clusters, 296–297
database mirroring, 185–187
replicas, identifying, for multinode AlwaysOn configuration, 133–135
replication
advanced geo replication, 395–396
clustered virtual machine replication, 307
integrated Hypervisor replication, 310
replication agents, data replication
distribution agents, 212–213
Replication Agents Checkup, 213
Replication Monitor, 229–230
replication topology, extending to the cloud, 260–262
reporting database, database mirroring, snapshots, 159–160
RESTORE DATABASE command, 156, 157
RESTORE LOG command, 180
restoring, databases, security, 388–389
return on investment (ROI). See ROI (return on investment), calculating, 48, 75–76
revenue loss (per hour of unavailability), 51, 347
reverting, 150
Revolution Analytics, 277
risk, mitigating, by spreading out server instances, 33–35
ROI (return on investment)
calculating, 48, 75–76
for ASPs, SQL Server clustering, 116–117
investment portfolio management, AlwaysOn, 146–147
cost-effectiveness, 355–356
data replication, worldwide sales and marketing, 234–235
role switching, database mirroring, 173
roles
created roles, 376–377
database mirroring, 171–172
database roles, 377
fixed server roles, 377
security, 376–377
RPO (recovery point objective), 321–322
RTO (recovery time objective), 321–322
S
SaaS (software as a service), 3
SA/DA (system architect/data architect), 49, 343
safeguarding databases prior to making changes, database snapshots, 157
SAN storage, 305
SAN using Ethernet, 305
SAN using host bus adapter (HBA), 305
SBA (senior business analyst), 49, 343
scalability, 11, 79, 349
scale-out options, combining, with failover clustering, 125
scaling out, 86
Scenario 1
original environment list, 361–363
target HA environment list, 365–367
scheduled maintenance frequency, 11, 349
schedules, setting for HA assessment project, 344–345
schema-bound views, object protection, 377–380
secondary, failing over to secondary, multinode AlwaysOn configuration, 141–142
security, 374–375
AlwaysOn availability groups, 384–385
data replication, 383–384
database snapshots, 168, 384
disaster recovery (DR), 388–389
log shipping, 381–383
miscues, 380
object permissions and roles, 376–377
object protection using schema-bound views, 377–380
SQL clustering, 380–381
SQL Server auditing, 385–388
user IDs, 376
vulnerabilities, 375
Windows accounts, 376
security resynchronization readiness, 331
security-related breakdowns, 373
senior business analyst (SBA), 49, 343
senior technical lead (STL), 49, 343
server audit, 387
server audit specification, 387
server failover clusters, 36–37, 82–83
server instance isolation, 33
server instances, spreading out, 33–35
Server Message Block (SMB), 305
@@SERVERNAME, 332
service level agreements (SLAs), 17–18
@@SERVICENAME, 332
seven-step big data journey from inception to enterprise scale, 297–299
shared disk array, 84
shared nothing disk arrays, 84
Ship Transaction Logs task, 242–243
simulating, failure, 369
SIOS, disaster recovery (DR), 339
size management, sparse files, database snapshots, 168
SLAs (service level agreements), 17–18
SMB (Server Message Block), 305
SMB3 file server, 305
snapshot agents, data replication, 210–212
snapshot databases, database snapshots, 153
snapshot replication, 196
software as a service (SaaS), 3
software scaling, 86
software upgrades, 25
source server failure, log shipping, 252
Spark, 280
sparse files, 151–152, 153
size management, 168
spiral/rapid development methodology, 16
split-brain scenario, 84–85
SQL Server clustering, 21, 37–39, 54, 80, 97
configuring, 94–95
FCI (failover cluster instance), 80
security, 380–381
WSFC (Windows Server Failover Clustering), 87–88
SQL database, Microsoft Azure, 54
SQL roles, isolating, 388–389
SQL Server 2000 instance, 34
SQL Server Agent, 95
SQL Server Audit feature, 386
SQL Server auditing, 385–388
SQL Server clustering, 99–100
ASPs (application service providers), 114–117
installing, with WSFC, 100–113
SQL Server database disks, configuring, 96–97
SQL Server failover clustering, 102
potential problems, 113
SQL Server instances, verifying, 126
SQL Server Management Studio. See SSMS
SQL Server network name, 103–104
SQL statements, data replication, 228
sqldiag.exe, 335–338
SQLServer:Replication Agents, 230
SQLServer:Replication Dist, 230
SQLServer:Replication Logreader, 231
SQLServer:Replication Merge, 231
SQLServer:Replication Snapshot, 231
SSMS (SQL Server Management Studio), 228–230
monitoring HA (high availability), 371–372
testing failover from the principal to the mirror, 187–189
Standby mode, log shipping, 250, 251
STL (senior technical lead), 49, 343
store-and-forward data distribution model, 198
stored procedures, publication validation stored procedures
(sp_publication_validation), 228
Stream Analytics, Azure, 276
Stretch Database, 54
creating to the cloud, 264–265
stripe of mirrors, RAID, 32–33
subscribers, switching over to warm standby, 226–227
subscription server, data replication, 200
subscriptions
data replication, 208, 220–226
pull subscriptions, 208–209
push subscriptions, 209
turning subscribers into publishers, 227
Symantec, 339
synchronizing, data, multinode AlwaysOn configuration, 135–138
synchronous mode, AlwaysOn availability groups, 123
synchronous operations, database mirroring, 172
system architect/data architect (SA/DA), 49, 343
system stack, 3
T
teams, HA assessment teams, assembling, 343–344
testing
database snapshots, 167
as a starting point, 157–158
failover from the principal to the mirror, SSMS, 187–189
HA configurations, 369–370
third-party disaster recovery, 339
time slices, 83
time to recover, 11, 348
timelines, setting for HA assessment project, 344–345
tolerance of recovery time, 11, 348
training, 25
Transaction Log Backup Settings page, 245
transaction record compression, AlwaysOn availability groups, 123
transactional replication, 196
Transact-SQL statements, permissions, 377
triggers, data replication, 213
two-node SQL Server failover clustering configuration in active/passive mode, 101
U
United States, DDoS (denial-of-service) attack, 258
unplanned downtime, 4, 5
unplanned outages, 1
upgrades
performing, 368
planning, 367
upgrading, deciding what to upgrade to, 363–364
uptime, 4
uptime requirement, 10, 348
use cases
AlwaysOn and availability groups, 119–120
big data use cases, Azure, 300–301
user IDs, security, 376
user requirements, data replication, 213
V
variables
availability variables, 10–12
gauging HA primary variables, 348–349
HA primary variables, gauging, 52–53
vendor agreements, 25
verifying SQL Server clustering, 126
@@VERSION, 332
vertical filtering, 201–202
View Synchronization Status option, 228–229
virtual machines (VM), 303
backing up, 308–310
live migration, 307–308
Microsoft Windows 2012 hypervisor virtual machines, 304
virtualization, 100% of your platforms, 392–394
VM (virtual machine, 303
VM snapshots, 310–311
vulnerabilities, 375
W
warm standby, switching over to, 226–227
Windows 2012 hypervisor virtual machines, 304
Windows accounts, 376
Windows Performance Monitor, 230–231
Windows Server 2012 R2, live migration, 308
Windows Server Failover Clustering. See WSFC (Windows Server Failover
Clustering)
Windows Server Manager, installing, failover clustering, 89–94
WITH SCHEMABINDING option, 377–378
witness database servers, 171
witness roles, database mirroring, 172
wizards
Add a Failover Cluster Node Wizard, 109
Cluster Validation Wizard, 83
Database Mirroring Wizard, 173
Install a SQL Server Failover Cluster Wizard, 101
New Availability Group Wizard, 132
New Publication Wizard, 217
worldwide sales and marketing
assessments, 64–68
business scenarios, 19
data replication, 233–235
WSFC (Windows Server Failover Clustering), 21, 36–37, 82–86, 120
AlwaysOn, 87–88
extending, with NLB, 86–87
installing, SQL Server clustering, 100–113
SQL clustering, 80, 87–88
Code Snippets
Many titles include programming code or configuration examples.
To optimize the presentation of these elements, view the eBook
in single-column, landscape mode and adjust the font size to the
smallest setting. In addition to presenting code and
configurations in the reflowable text format, we have included
images of the code that mimic the presentation found in the
print book; therefore, where the reflowable format may
compromise the presentation of the code listing, you will see a
“Click here to view code image” link. Click the link to view the
print-fidelity code image. To return to the previous page
viewed, click the Back button on your device or app.