Cloud Native Database Principle and Practice
Cloud Native Database Principle and Practice
Cloud Native
Database
Principle and Practice
Cloud Native Database
Feifei Li • Xuan Zhou • Peng Cai
Rong Zhang • Gui Huang • XiangWen Liu
This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
Cloud-native databases are the best choice for cloud computing platforms and have
become a new favorite among users. This book, written by cloud computing and
database experts, provides valuable content that can be used as a helpful reference.
Du Xiaoyong
Professor, Renmin University of China; Director, CCF Task Force on Big Data
Since its inception in the 1960s, databases have been recognized as a key
infrastructure of the information society. With the development and popularization
of the Internet over the past 20 years, the world has undergone profound changes.
The future of the information society is becoming evident, and the digital
transformation of the economy and society is ready to take off. The combination of
databases and the Internet has brought forth the biggest challenge and opportunity
for database development in the last decade. This has sparked renewed research and
development efforts in the field. Cloud-native databases are the result of combining
databases with cloud computing. They aim to provide database capabilities as
services that are available to everyone, that is, to turn databases into public utilities.
This is the first step in leveraging the power of data and providing a platform for
v
vi
Cloud-native databases have been a major innovation in the database field over the
past decade, setting the trend for database development. This book explains the core
technologies of cloud-native databases, such as computing-storage separation, log-
as-data, and elastic multitenancy. It is an invaluable resource and deserves careful
reading. This book is written by renowned experts from the academia and industry
in the field of databases, and contains their insights into cloud-native databases.
Li Guoliang
Professor and Deputy Director of the Department of Computer Science and
Technology, Tsinghua University; Deputy Director, CCF Technical Committee on
Databases
Cloud-native databases are the most popular databases today, thanks to their
excellent characteristics such as high scalability and availability. This book is a
pioneering work on cloud-native databases, covering key theories and technical
implementations. The authors are senior scholars and excellent professionals. I
highly recommend this book to graduate students and developers who are interested
in database technologies.
Cui Bin
Professor, Peking University; Deputy Director, CCF Technical Committee on
Databases
Data has become the key factor in the digital economy, and databases serve as the
essential software infrastructure for storing and processing data. They play a crucial
role in driving business development. Telecommunications service providers also
highly value the development of database technologies. With the rapid development
of cloud computing and big data, databases have evolved from traditional customized
vii
deployments to on-demand, elastic, and scalable cloud services that are highly
flexible and cost-effective. The book delves into Alibaba Cloud’s technological
exploration and practical experience in the cloudification of databases. I believe this
book will offer valuable insights to readers, assisting them in successfully migrating
to the cloud and expediting their digital transformation.
Chen Guo
Deputy General Manager, China Mobile Information Technology Co., Ltd
Databases have entered a new era of intense competition. As this book suggests,
those who embrace cloud-nativeness are the ones who stand out. This book combines
theory with practical insights, devoting many pages to technology selection.
Drawing from the authors’ extensive experience, it offers readers a comprehensive
overview of cloud-native databases. It is undeniably a valuable read.
Zhang Wensheng
Chairman, PostgreSQL Chinese Community; Author of “A Practical Guide to
PostgreSQL” and “PostgreSQL Guide: In-depth Exploration”
Foreword 1
Cloud-native databases are emerging as a new, vital form of databases. It was pro-
jected that 75% of databases will either be directly deployed in or migrated to the
cloud by 2022. Alibaba Cloud’s database products not only underpin the world’s
largest high-concurrency and low-latency e-commerce environment, providing
seamless end-to-end online data management and services for millions of small-and
medium-sized enterprises, but also offer stable and reliable data storage, processing,
and analysis services for vital sectors, such as the government, manufacturing,
finance, telecommunications, customs, transportation, and education industries. So
far, Alibaba Cloud’s database products have served over 100,000 enterprise users,
enabling them to effortlessly access enterprise-level database services, substantially
reduce costs, enhance operational efficiency, and generate novel business scenarios
and value.
In response to the immense demands for concurrent data throughput and com-
putational capabilities in e-commerce operations, Alibaba Cloud embarked on
independent R&D of databases as early as 2010. We successfully tackled criti-
cal technological challenges such as storage-computing separation, distributed
computing, high availability, compatibility, online–offline integration, and hybrid
transactional/analytical processing (HTAP). We continued to upgrade the kernel
and cluster architecture to meet the requirements of high-concurrency business
scenarios, including the renowned Double 11 shopping festival. Moreover, we
optimized our offerings for domestic chips and operating systems, laying a solid
foundation for leveraging indigenous technology stacks and achieving complete
autonomy and control. Along this journey, Alibaba Cloud’s database solutions
have garnered a range of prestigious accolades such as the award for the World’s
Leading Internet Scientific and Technological Achievements at the World Internet
Conference, the first prize of Zhejiang Science and Technology Progress Award,
and the first prize of the Science and Technology Progress Award of Chinese
Institute of Electronics. Furthermore, Alibaba Cloud has become the first Chinese
database vendor in Gartner’s list of global database leaders. Standing at the
threshold of the cloud computing era, Alibaba Cloud is committed to advancing
ix
x Foreword 1
Databases are among the most vital foundational systems in computer science. They
serve as the bedrock software that supports the digital economy and occupy a piv-
otal strategic position. In the era of digital economy, effectively managing and lever-
aging diverse data resources is a prerequisite for scientific research and
decision-making. The traditional database market is dominated by major commer-
cial database vendors, forming a well-established database ecosystem. Databases
store and process the core data resources of users, leading to high user stickiness
and significant migration challenges. Due to a high degree of monopoly, Chinese
database systems face fierce competition in the commercial market. The current
national strategy places great emphasis on driving innovation and breakthroughs in
fundamental technologies and explicitly prompts efforts to vigorously develop
foundational software and advanced information technology services and expedite
the development progress of database systems tailored for big data applications.
Against this backdrop, database systems should not merely aim to replace existing
products in the market, but to evolve and innovate, adapting to the emerging
demands of cloud computing, big data, AI, and other new market trends. They
should not only be useful but also user-friendly.
The rapid development of technologies like cloud computing has propelled foun-
dational software toward a transformation journey into the cloud. An increasing
number of enterprises are migrating their new applications to the cloud, intensifying
the requirements for data storage and computational analysis capabilities. Cloud-
native databases boast cloud elasticity, flexibility, and high availability, enabling
them to provide robust innovation capabilities, diverse product offerings, economi-
cally efficient deployment methods, and pay-as-you-go payment models. Cloud-
native distributed databases present a significant opportunity for groundbreaking
innovation in the database domain. Cloudification opens up new possibilities for
professionals working with databases.
The cloudification of databases has undergone two stages. The first stage is cloud
hosting, which involves deploying existing database systems on cloud platforms to
provide databases as on-demand services. The second stage is cloud-native imple-
mentation, where the hierarchical structure of databases is completely reconstructed
xi
xii Foreword II
to leverage the resource pooling feature of the cloud. This decouples computing,
storage, and networking resources to accommodate dynamic business needs by
using elastic resource pools. In the second stage, databases are transformed compre-
hensively, unlocking opportunities for profound innovation. This book has emerged
in response to this trajectory. As the primary author of this book, Dr. Li Feifei has
dedicated over a decade to database research, followed by years of industry immer-
sion focusing on the development of database systems. This book reflects a fusion
of cutting-edge theory and practical engineering expertise. It provides a robust theo-
retical foundation and detailed technical implementation insights, to facilitate a
deep understanding of the key technologies of cloud-native databases, including
storage-computing separation, high availability, storage engines, distributed query
engines, data distribution, and automatic load balancing. I believe this book will be
invaluable to readers seeking to learn the latest database technologies.
Independent innovation in information infrastructure technologies is critical to
informatization initiatives. However, this important task cannot be accomplished
overnight. It necessitates a shared vision within society and close cooperation across
the entire industry chain. Only by making unremitting efforts and seizing every
opportunity presented by the evolving landscape we can succeed. I hope this book
will inspire practitioners on their technological exploration journey and contribute
to the development of new database systems.
Database management systems are one of the most vital software systems in the
field of computer science and technology. They serve as indispensable foundational
platforms for information systems, providing essential support throughout the entire
lifecycle of data from collection to classification, organization, processing, storage,
analysis, and application. Without database management systems, informatization
across industries would not be possible. Database management technologies have
come a long way since their commercialization in the 1970s. Relational databases
have established their dominant position in informatization, thanks to their concise
conceptual framework, robust abstraction, powerful expressive capabilities, and
transaction consistency guarantee, making them the “de facto standard” for data
management technologies.
The widespread commercial use of the Internet has greatly accelerated the gen-
eration, circulation, and aggregation of data, raising new requirements and chal-
lenges for data management. The exponential growth in data volume necessitates
better management practices, and the increasing diversity of data types calls for
more flexible and diverse data models. The Internet, as the information infrastruc-
ture, places applications under greater pressure in terms of data throughput, concur-
rent access, and quick response to queries. The Internet is free of geographical
constraints and therefore demands uninterrupted service provisioning from business
systems. This imposes higher requirements on database systems in terms of avail-
ability and scalability, among other aspects. These new characteristics and scenarios
have sparked a revolution in database technologies, giving rise to a multitude of
innovative database technologies and products. The integration of distributed tech-
nologies has become a prominent feature, offering scalability and high availability
to effectively address the demands of large-scale data processing and storage
analysis.
The advent of cloud computing has ushered in a new phase for Internet applica-
tions. Computing and storage capabilities are offered to users as on-demand ser-
vices. Besides infrastructure services, various underlying platform technologies,
including databases and middleware, as well as application software, are also avail-
able as services. Cloud computing has triggered another wave of transformation in
xiii
xiv Foreword III
Background
For over six decades, database systems have been continuously developed to fulfill
their role as one of the fundamental software components. As such, relational data-
bases have dominated the market due to their strong data abstraction, expressive
capabilities, and the easy-to-use SQL language. Over the past 50 years, the theory
and technology of relational databases have come a long way. Numerous books
have been published to delve into technical aspects such as SQL parsing, optimiza-
tion and execution, transaction processing, log recovery, storage engines, and data
dictionaries. Despite their maturity, database technologies continue to evolve due to
a variety of factors, including the rapid development of the Internet and big data
technology, complex business requirements, diverse data models, exponential data
volumes, and advancements in hardware technologies.
Internet applications have completely reshaped people’s lifestyles at an unprec-
edented pace, making an enormous amount of data available online. These data
need to be stored, analyzed, and consumed, which, in turn, puts databases under
unprecedented pressure. To adapt to the dynamic market, Internet applications
quickly adjust their business forms and models. This leads to the emergence of more
flexible and enriched data models and rapid changes in workload characteristics. To
cope with such changes, databases must support elastic scaling to adapt to evolving
business needs while keeping costs low. Traditional databases, often deployed as
standalone systems with fixed specifications, struggle to meet these demands. This
is where cloud computing comes in. By providing infrastructure as a service, cloud
computing establishes large-scale resource pools and offers a unified virtualized
abstraction interface. A massive operating system is established on diverse hard-
ware by utilizing technologies such as containers, virtualization, orchestration, and
microservices. Leveraging the capabilities of cloud computing, databases have
transformed from fixed-specification instances to on-demand services, allowing
users to access them as needed and scale them in real-time based on specific busi-
ness requirements.
xv
xvi Preface
Summary
This book portrays the evolution of database technologies in the era of cloud com-
puting. Through specific examples, it illustrates how cloud-native and distributed
technologies have enriched the essence of databases.
Chapter 1 offers a concise overview of database development. This chapter
explains the structure, key modules, and implementation principles of typical rela-
tional databases. An SQL statement execution process is used as an example to
illustrate these concepts.
Chapter 2 discusses the transformation of databases in the era of cloud comput-
ing, highlighting the evolution from standalone databases to cloud-native distrib-
uted databases. This chapter explores the technical changes brought by cloud
computing and examines the potential trends in database technologies.
Chapter 3 focuses on the architectural design principles of cloud-native data-
bases and the reasons behind these principles. Additionally, this chapter analyzes
the technical features of several prominent cloud-native databases in the market,
such as AWS Aurora, Alibaba Cloud PolarDB, and Microsoft Socrates.
Chapters 4–7, respectively, delve into the implementation principles of important
components of cloud-native databases, including storage engines, shared storage,
database caches, and computing engines. Each chapter follows the same structure,
in which the theoretical foundations and general implementation methods of the
components are explained, and then targeted improvements and optimization meth-
ods specific to cloud-native databases are introduced.
Chapter 8 provides a detailed explanation of distributed database technologies
that support scale-in and scale-out, including their application and implementation
principles. This chapter also highlights how the integration of database technologies
with cloud-native technologies takes the database technologies to new heights.
Chapters 9 and 10 center around the practical applications of cloud-native data-
bases. By using PolarDB as an example, these chapters cover relevant topics, such
as creating database instances in the cloud, optimizing usage and O&M, and har-
nessing the elastic, high availability, security, and cost-effectiveness features offered
by cloud databases.
Preface xvii
Primary Authors
This book is authored by Dr. Li Feifei from Alibaba Cloud and Professor Zhou
Xuan from East China Normal University (ECNU). Some of the content is contrib-
uted by Professor Cai Peng and Professor Zhang Rong from ECNU and also senior
technical expert Huang Gui from Dr. Li Feifei’s team. Liu Xiangwen, the Vice
President of Alibaba Cloud and General Manager for marketing, Alibaba Cloud
Intelligence has also made significant contributions. Other technical experts from
Alibaba Cloud’s database team, including Zhang Yingqiang, Wang Jianying, Hu
Qingda, Chen Zongzhi, Wang Yuhui, Wang Bo, Sun Yue, Zhuang Zechao, Ying
Shanshan, Song Zhao, Wang Kang, Cheng Xuntao, Zhang Haiping, Wu Xiaofei, Wu
Xueqiang, Yang Shukun, and others, have provided valuable technical materials,
and we sincerely appreciate their contributions.
Special thanks to Jeff Zhang, Managing Director of DAMO Academy,
Academician Chen Zuoning from the Chinese Academy of Engineering, and
Academician Mei Hong from the Chinese Academy of Sciences for writing the
forewords for this book.
We would also like to express our gratitude to Professor Li Zhanhuai, Professor
Du Xiaoyong, Professor Zhou Aoying, Professor Peng Zhiyong, Professor Li
Guoliang, Professor Cui Bin, General Manager Chen Guo, President Zhou Yanwei,
and Chairman Zhang Wensheng for their testimonials.
Additionally, we extend our appreciation to the technical experts from Alibaba
Cloud’s database team, including Huang Gui, Yang Xinjun, Lou Jianghang, You
Tianyu, Wu Wenqi, Chen Zongzhi, Liang Chen, Zhang Yingqiang, Wang Jianying,
Hu Qingda, Weng Ninglong, Fu Dachao, Fu Cuiyun, Wang Yuhui, Yuan Lixiang,
Sun Jingyuan, Cai Chang, Zhou Jie, Xu Jiawei, Wu Xiaofei, Xie Rongbiao, Wang
Kang, Zheng Song, Ren Zhuo, Wei Zetao, Sun Yuxuan, Zhang Xing, Li Ziqiang, Xu
Dading, Xiong Meihui, Liang Gaozhong, Chen Shiyang, Chen Jiang, Xu Jie, Cai
Xin, Yu Nanlong, Wang Yujie, Chen Shibin, Wu Qianqian, Sun Yue, Zhao Minghuan,
Sun Haiqing, Li Wei, Yang Yuming, and Han Ken for their contributions to the trans-
lation of textbooks. Thanks to Wang Yuan and Xiao Simiao from the Alibaba Cloud’s
database team for their contributions in organizing the translation of the textbook.
This book would not have been possible without the collective efforts of every-
one involved.
As it was completed within a limited timeframe, this book may not answer every-
thing there is to know about database systems. With this, we encourage readers to
kindly share their feedback.
xix
Contents
xxi
xxii Contents
8
Integration of Cloud-Native and Distributed Architectures���������������� 199
8.1 Basic Principles of Distributed Databases���������������������������������������� 199
8.1.1 Architecture of Distributed Databases���������������������������������� 200
8.1.2 Data Partitioning ������������������������������������������������������������������ 201
8.1.3 Distributed Transactions ������������������������������������������������������ 203
8.1.4 MPP�������������������������������������������������������������������������������������� 207
8.2 Distributed and Cloud-Native Architectures������������������������������������ 208
8.2.1 Shared Storage Architecture ������������������������������������������������ 209
8.2.2 Shared-Nothing Architecture������������������������������������������������ 210
8.3 Cloud-Native Distributed Database: PolarDB-X������������������������������ 211
8.3.1 Architecture Design�������������������������������������������������������������� 211
8.3.2 Partitioning Schemes������������������������������������������������������������ 212
8.3.3 GSIs�������������������������������������������������������������������������������������� 213
8.3.4 Distributed Transactions ������������������������������������������������������ 213
8.3.5 HTAP������������������������������������������������������������������������������������ 214
References�������������������������������������������������������������������������������������������������� 215
9 Practical Application of PolarDB ���������������������������������������������������������� 217
9.1 Creating Instances on the Cloud ������������������������������������������������������ 217
9.1.1 Related Concepts������������������������������������������������������������������ 217
9.1.2 Prerequisites�������������������������������������������������������������������������� 218
9.1.3 Billing Method���������������������������������������������������������������������� 218
9.1.4 Region and Availability Zone����������������������������������������������� 218
9.1.5 Creation Method ������������������������������������������������������������������ 219
9.1.6 Network Type������������������������������������������������������������������������ 219
9.1.7 Series������������������������������������������������������������������������������������ 219
9.1.8 Compute Node Specification������������������������������������������������ 219
9.1.9 Storage Space������������������������������������������������������������������������ 219
9.1.10 Creation�������������������������������������������������������������������������������� 220
9.2 Database Access�������������������������������������������������������������������������������� 220
9.2.1 Account Creation������������������������������������������������������������������ 220
9.2.2 GUI-Based Access���������������������������������������������������������������� 221
9.2.3 CLI-Based Access���������������������������������������������������������������� 221
9.3 Basic Operations ������������������������������������������������������������������������������ 225
9.3.1 Database and Table Creation������������������������������������������������ 225
9.3.2 Test Data Creation���������������������������������������������������������������� 227
9.3.3 Account and Permission Management��������������������������������� 227
9.3.4 Data Querying���������������������������������������������������������������������� 229
9.4 Cloud Data Migration ���������������������������������������������������������������������� 231
9.4.1 Migrating Data to the Cloud ������������������������������������������������ 231
9.4.2 Exporting Data from the Cloud�������������������������������������������� 235
10 PolarDB O&M����������������������������������������������������������������������������������������� 237
10.1 Overview���������������������������������������������������������������������������������������� 237
10.2 Resource Scaling���������������������������������������������������������������������������� 238
10.2.1 System Scaling�������������������������������������������������������������������� 238
10.2.2 Manual Scaling ������������������������������������������������������������������ 238
Contents xxv
Xuan Zhou is a Professor and Vice Dean of the School of Data Science and
Engineering at East China Normal University (ECNU).
He received the bachelor’s degree from Fudan University in 2001 and obtained
his Ph.D. degree from the National University of Singapore in 2005. He worked as
a researcher at the L3S Research Center in Germany and the Commonwealth
Scientific and Industrial Research Organisation (CSIRO) in Australia from 2005 to
2010 and then tutored at Renmin University of China before joining ECNU in 2017.
He has devoted himself to the research of database systems and information retrieval
technologies. He has contributed to and directed various national and international
research projects and industrial cooperation projects. He has developed a variety of
data management systems. His research in distributed databases earned him the
second prize of the State Scientific and Technological Progress Award in 2019.
xxvii
xxviii About the Authors
Peng Cai is a Professor and Ph.D. supervisor of the School of Data Science and
Engineering at ECNU.
Prior to joining ECNU in June 2015, he worked at IBM China Research
Laboratory and Baidu (China) Co., Ltd. He has published academic papers at
esteemed international conferences, such as the VLDB, ICDE, Special Interest
Group on Information Retrieval (SIGIR) Conference, and Association for
Computational Linguistics (ACL) Conference. His current research focuses on two
key areas: in-memory transaction processing and adaptive data management sys-
tems based on machine learning. He has been awarded the second prize of the State
Scientific and Technological Progress Award and the first prize of the Science and
Technology Progress Award of the Ministry of Education.
Rong Zhang is a Professor and Ph.D. supervisor of the School of Data Science and
Engineering at ECNU.
She has been dedicated to the research and development of distributed systems
and databases since 2001. She has led or participated in various research projects
funded by the National Natural Science Foundation of China, projects under the 863
Program, and industrial cooperation projects. Her outstanding contributions have
earned her the first prize of Shanghai Technology Progress Award for Technical
Invention and the second prize of the State Scientific and Technological Progress
Award. Her research fields encompass distributed data management, data stream
management, and big data benchmarking.
Gui Huang is a Senior technical expert at Alibaba and chief database architect of
Alibaba Cloud.
Throughout his tenure at Alibaba, he has been deeply engaged in the research
and development of distributed systems and distributed database kernels. He has
also participated in the development of PolarDB, a database independently devel-
oped by Alibaba. He possesses extensive expertise in distributed system design,
distributed consensus protocols, and database kernel implementation. He has pub-
lished multiple scholarly papers at esteemed international conferences, including
SIGMOD, FAST, and VLDB. His achievements have earned him the first prize of
the Science and Technology Progress Award of the Chinese Institute of Electronics.
Xiangwen Liu is Vice President of Alibaba Cloud, General Manager for market-
ing, Alibaba Cloud Intelligence, and Standing Director of CCF.
Having been with Alibaba for over a decade, Ms. Liu has led teams in creating a
three-tier governance system for Alibaba’s technology mid-end strategy and played
a pivotal role in the founding and growth of Alibaba DAMO Academy. As the
General Manager of the Marketing and Public Relations Department at Alibaba
Cloud, Ms. Liu has been instrumental in forging partnerships with universities, gov-
ernments, developers, and innovators, promoting the brand upgrade of Alibaba
Cloud in the digital economy era.
Chapter 1
Database Development Milestones
In the 1960s, database management systems began to thrive as the core software for
data management. Propelled by changes in application requirements and hardware
development, databases have undergone several evolutions and achieved notable
progress in query engines, transaction processing, storage engines, and other
aspects. However, with the advent of the cloud era, new demands and challenges
have been posed for the processing capabilities of database systems. Moreover,
cloud platform-based initiatives have been launched for various database systems,
giving rise to numerous innovative design ideas and implementation technologies.
Databases play a vital role in the field of computer science. Early computers were
essentially giant calculators focused on algorithms and were mainly used for scien-
tific calculations. Computers did not store data persistently. They batch processed
input data and returned the calculation results but did not save the data results. At
that time, no specialized data management software was available. Programmers
had to define the logical structure of data and design the physical structure pro-
grams, including the storage structure, access methods, and input/output formats.
Therefore, the subroutines that access data in a program changed with the changes
in storage, and the data and program were not consistent. The concept of files had
not been introduced, and data could not be reused. Even if two programs used the
same data, the data must be input twice.
In the 1960s, as computers entered commercial systems and began to solve practi-
cal business problems, data went from being a by-product of algorithm processing to
a core product. At this time, database management systems (DBMSs) blossomed into
a specialized technical field. The core task of DBMSs was to manage data, which
included collecting, classifying, organizing, encoding, storing, processing, applying,
and maintaining data. Although this task has not changed much since its inception, the
theoretical models and related technologies for managing and organizing data have
undergone several transformations, driven by the development of computer hardware
and software, the complexity and diversity of business processing, and changes in data
scale. Database development can be divided into four stages: emergence, commercial-
ization, maturation, and era of cloud-native and distributed computing.
1.1.1 Emergence
In 1960, Charles Bachman joined General Electric (GE) and developed the Integrated
Database System (IDS), the first database system, which was a network model database
system. Bachman later joined the Database Task Group (DBTG) of the Conference/
Committee on Data Systems Languages (CODASYL), under which he developed the
language standards for the network model, with IDS as the main source. In 1969, IBM
developed a database system called the Information Management System (IMS) for the
Apollo program, which used the hierarchical model and supported transaction process-
ing. The hierarchical and network models pioneered database technologies and effec-
tively solved the problems of data centralization and sharing but lacked data independence
and abstraction layers. When accessing these two types of databases, users must be
aware of the data storage structure and specify the access methods and paths. This com-
plicated mechanism hindered the popularization of such databases.
1.1.2 Commercialization
In 1970, IBM researcher E.F. Codd proposed the relational model in his ground-
breaking paper “A Relational Model of Data for Large Shared Data Banks,” provid-
ing the theoretical foundation for relational database technologies. The relational
model was based on predicate logic and set theory with a rigorous mathematical
foundation. It provided a high-level data abstraction layer but did not include the
specific data access process, which was implemented by the DBMS instead. At the
time, some believed the relational model was too idealized and only an abstract data
model that was difficult to implement in an efficient system. In 1974, Michael
Stonebraker and Eugene Wong from UC Berkeley decided to study relational data-
bases and developed the Interactive Graphics and Retrieval System (INGRES),
which proved the efficiency and practicality of the relational model. INGRES used
a query language called QUEL. At the same time, IBM realized the potential of the
relational model and developed the relational database known as System R, along
with the Structured Query Language (SQL). By the late 1970s, INGRES had been
developed and commercialized in Oracle and IBM DB2, and SQL was eventually
adopted by the American National Standards Institute (ANSI) in 1986 as the stan-
dard language for relational databases. SQL only describes what data is desired,
without specifying the process for obtaining the data, freeing users from cumber-
some data operations. This was the key to the success of relational databases.
1.1 Overview of Database Development 3
1.1.3 Maturation
After 10 years of development, the relational model theory had taken a foothold,
and E.F. Codd was awarded the Turing Award in 1981 for his contribution to the
relational model. The maturity of the theoretical model gave rise to a multitude of
commercial database products, such as Oracle, IBM DB2, Microsoft SQL Server,
Infomix, and other popular database software systems, all of which appeared during
this period. The development of database technologies was related to programming
languages, software engineering, information system design, and other technolo-
gies and promoted further research in database theory. For example, database
researchers proposed an object-oriented database model (referred to as “object
model” for short) based on object-oriented methods and techniques. To this day,
many research works are still conducted based on existing database achievements
and technologies, aiming to expand the field of traditional DBMSs, mainly rela-
tional DBMSs (RDBMSs), at different levels for different applications, such as
building an object-relational (OR) model and establishing an OR database (ORDB).
The development of commercial databases has also driven the continuous evolution
of open-source database technologies. In the 1990s, open-source database projects
flourished, and the two major open-source database systems nowadays, MySQL and
PostgreSQL, were born. In the past, databases were mainly used for processing online
transaction business and therefore were known as online transaction processing (OLTP)
systems. After nearly 20 years of development, standalone relational database technolo-
gies and systems have become increasingly mature and were commercialized on a large
scale. With the widespread application of relational databases in information systems,
an increasing amount of business data accumulated. The focus and interest of scholars
and technical professionals shifted to the utilization of these data to facilitate business
decision-making. Hence, online analytical processing (OLAP) was introduced to query
and analyze large-scale data. Against this backdrop, IBM researchers Barry Devlin and
Paul Murphy proposed a new term in 1988—data warehouse. With the advent of the
Internet era, systems exclusive to professionals were opened up to everyone, leading to
an exponential increase in the scale of data that businesses processed and an explosive
surge in the scale of processing requests to databases. Consequently, traditional stand-
alone databases were overstretched. Propelled by cloud technologies, emerging tech-
nologies such as distributed databases then made their debuts.
In the cloud-native era, two different practical schemes are available for expanding
database processing capabilities as the business processing scale grows. One is ver-
tical scaling, also called “scale-up.” In this scheme, the capacities of database com-
ponents are increased. Additionally, better hardware (e.g., minicomputers and
high-end storages) is used, such as in the well-known “IBM, Oracle, and EMC
(IOE)” solutions. Multiple compute nodes in a database system share storage, giv-
ing birth to the shared storage architecture shown in Figure 1.1a. The other scheme
4 1 Database Development Milestones
Shared-Storage
Shared-Nothing
(a) (b)
Fig. 1.1 Scale-up and scale-out of a database. (a) Scale-up (b) Scale-out
1.2 Database Technology Development Trends 5
technology route is chosen. Over time, the big data ecosystem represented by
Hadoop and the database ecosystem represented by traditional data warehouses
gradually converge in the field of big data. “SQL on Hadoop” has become a vital
research direction in this field. Databases gradually developed big data capabilities
while providing the same user experience as standalone databases. Moreover, SQL
gradually became a universally accepted query and analysis language.
With the continuous development of information technologies, an increasing
amount of data of various types has been generated. Given its strictly structured data,
the relational model of traditional relational databases is not suitable for processing
variable business data and data of some dedicated special structures. In view of this,
databases that use flexible data model definitions (known as schemaless databases)
and databases that use special data models, collectively known as NoSQL systems,
have emerged. NoSQL databases are classified into three categories: key value (KV)
databases (e.g., Redis, HBase, and Cassandra), document databases (e.g., MongoDB),
and graph databases (e.g., Neo4j). Trade-offs in several technical details have been
made for these databases to meet specific requirements for the data scale, flexibility,
concurrency, and performance. In specific scenarios, NoSQL displays better perfor-
mance, scalability, availability, and cost-effectiveness than relational databases.
However, relational databases remain the mainstream databases due to the powerful
expression capability of SQL and their universally mature specifications and complete
and strict atomicity, consistency, isolation, and durability (ACID) semantics.
With the increasing popularity of cloud computing in the 2020s and the launch of
database services of major cloud vendors, traditional database vendors also began to
explore the cloud computing field and launch cloud-based database products. Databases
have entered the cloud era, driving a new round of remarkable transformation.
The recent years have witnessed the emergence of many new technologies and ideas
that brought new vitality to the field of databases. The following sections will dis-
cuss the development trends of database technologies in six aspects.
operations and maintenance (O&M). For stateful storage resources, key technolo-
gies like distributed file systems, distributed consistency protocols, and multimodal
replicas are used to meet the requirements for storage resource pooling, data secu-
rity, and strong consistency. Scalable communication resources ensure that “suffi-
cient” bandwidth is available between computing and storage resources to meet the
demand for high-throughput, low-latency data transmission.
High availability based on resource decoupling is the basic feature of cloud data-
bases. Overall high availability of computing resources is achieved by using redundant
compute nodes in combination with “probing” and high-availability switching technolo-
gies that are based on the cloud infrastructure. Using multiple replicas and distributed
consistency protocols ensures the consistency between multiple replicas of data and the
high availability of data storage. Given that cloud databases face arbitrary data scales,
they must have rapid backup and recovery capabilities and the ability to restore data to
any point in time according to the backup strategy. To meet high concurrency and big
data processing requirements, cloud databases must support scale-out/scale-in and dis-
tributed processing mechanisms, including but not limited to load balancing, distributed
transaction processing, distributed locking, resource isolation and scheduling in the
multitenancy architecture, mixed CPU loads, and massively parallel processing (MPP).
Cloud databases aim to provide users with simple and easy-to-use database systems
to help them quickly achieve business functionality in the shortest time and at the
lowest costs. With the development of information technologies, big data has
become a reality. Along with this, a core requirement was imposed on databases,
which is to maintain consistent performance and acceptable response time when
dealing with massive data. The demand for integration of big data and databases is
increasingly strong. For users, this means they can directly use SQL to analyze and
process massive data based on cloud databases. To enable cloud databases to pro-
cess big data, a powerful kernel engine must be built by leveraging the elasticity and
distributed parallel processing features of the cloud infrastructure. This will maxi-
mize the efficiency of computing and storage resources, thereby providing massive
data analysis capabilities at an acceptable cost-effectiveness ratio. Further, ecosys-
tem tools that facilitate big data analysis and processing must be available. The
ecosystem tools can be categorized into three types: data transfer and migration
tools, data integration development tools, and data asset management tools. Data
transfer and migration tools ensure smooth data links and free flow of data. From a
performance perspective, such tools are evaluated based on the real timeliness and
throughput. From a functional perspective, they must serve as a pipeline for various
upstream and downstream data sources. Data integration development tools enable
users to freely process massive data (e.g., integrate, clean, and transform data) and
provide a complete integrated development environment that supports visualized
modeling of the development process and task publishing and scheduling. Data
1.2 Database Technology Development Trends 7
asset management tools are essential for data fusion applications. “Business data,
data assets, asset application, and application value” reflect the progressive process
of business innovation driven by business data. As an important cloud infrastructure
for business data production, storage, processing, and consumption, cloud databases
play a key role in the data assetization process. Asset management tools based on
cloud databases guarantee that cloud databases can connect from “end to end” and
help customers achieve business value.
1.2.3 Hardware-Software Integration
The development of new hardware opened more possibilities for database technolo-
gies and fully utilizing hardware performance has become an important means for
improving the efficiency of all database systems. In a cloud-native database, com-
puting and storage are decoupled, and the network is utilized to implement distrib-
uted capabilities. The design of the computing, storage, and network features takes
into account the characteristics of new hardware. The SQL computation layer of the
database needs to perform massive algebraic operations, such as join, aggregation,
filtering, and sorting operations. These computation operations are accelerated by
using heterogeneous computing devices such as GPUs to fully utilize the parallel
processing capability of the database. Specific computation-intensive operations,
such as compression/decompression and encryption/decryption, may be pro-
grammed by using the programmable capabilities of field-programmable gate arrays
(FPGAs), to reduce the burden of CPUs. In terms of storage, the emergence of
nonvolatile memories (NVMs) has expanded the horizon for databases. With its
byte addressing and persistent storage capabilities, NVM has exponentially
improved the I/O performance compared with solid-state drives (SSDs). Many data-
base designers are beginning to rethink how to redesign the architecture to use these
features, for example, to design index structures for NVMs and reduce or cancel
logs. The execution path becomes longer after computing and storage are decou-
pled. Therefore, many cloud databases use high-performance network technologies
(e.g., remote direct memory access [RDMA] and InfiniBand) together with user-
mode network protocols (e.g., Data Plane Development Kit [DPDK]) and other
technologies, to mitigate the negative impact of network latency. Nowadays, data-
base system theories are mature and it is much harder to achieve breakthroughs. It
is an inevitable trend to reap the benefits of hardware development.
1.2.4 Multimodality
now become a constraint in the face of rapidly changing businesses. At present, the
fundamental requirement is to manage flexible semistructured and unstructured
data. New databases rise to this challenge. By leveraging the advantages of tradi-
tional databases, such as powerful and rich data operation capabilities and complete
ACID semantics, new databases support data processing for more data models (e.g.,
graph, KV, document, time series, and spatial models) and unstructured data (e.g.,
images and streaming media). Processing numerous data models in one system and
normalizing and processing heterogeneous data can mine more application value.
1.2.5 Intelligent O&M
As the data scale increases, the usage scenarios and frequency of cloud databases are
also increasing. The traditional database administrator (DBA)-based O&M mode can
no longer meet the O&M requirements of the cloud era because DBAs have limited
physical strengths and capabilities. Intelligent O&M technologies facilitate the safe
and stable operation of cloud databases. Heuristic machine learning may be a poten-
tial solution. For instance, machine learning can be combined with the expertise of
database experts to build an intelligent O&M model based on the data collection capa-
bilities of the cloud infrastructure and the massive operation data of cloud databases.
The model can be used to implement self-awareness, self-repair, self-optimization,
self-O&M, and self-security as cloud services for cloud databases, freeing users from
complex database management and preventing service failures caused by manual
operations to guarantee the stability, security, and efficiency of database services.
of different industries. Visibility means that a database is no longer a “black box” but
something that can provide complete log audit capabilities to ensure that all operations
on the cloud database are recorded and management permissions are controlled by the
user. The security and trust technology of cloud databases covers the authentication,
protection, and auditing of data access.
A DBMS provides client drivers that comply with standard interface protocols such
as Open Database Connectivity (ODBC) and Java Database Connectivity (JDBC).
User programs can establish database connections with DBMS server programs by
using the APIs provided by the client drivers and send SQL requests. After receiving
a connection establishment request from a client, a DBMS determines whether to
establish the connection according to the protocol requirements. For example, the
DBMS checks whether the client address meets the access requirements and per-
forms security and permission verification on the user that uses the client. If the user
passes the verification, the corresponding database connection is established, and
resources are allocated for the connection. Then, a session context is created to
execute subsequent requests. All requests sent through this connection use the set-
tings in the session context until the session is closed.
When a DBMS receives the first request sent by the client, the DBMS allocates the
corresponding computing resources. This process is related to the implementation of
the DBMS. Several databases, such as PostgreSQL, use the process-per-connection
model, which creates a child process to handle all requests on the connection. Other
10 1 Database Development Milestones
1.3.2 Query Engine
The SQL engine subsystem is responsible for parsing SQL requests sent by users,
as well as for performing semantic checks and generating a logical plan for the SQL
requests. After an SQL request is rewritten and optimized, a physical plan is gener-
ated and delivered to the plan executor for execution.
SQL statements are generally divided into two categories: Data Manipulation
Language (DML) statements and Data Definition Language (DDL) statements. DML
statements include SELECT, UPDATE, INSERT, and DELETE, whereas DDL state-
ments are used to maintain data dictionaries and include CRATE TABLE, CREATE
INDEX, DROP TABLE, and ALTER Table. DDL statements usually do not undergo
1.3 Key Components of Relational Databases 11
the query optimization process in the query engine but are directly processed by the
DBMS static logic by calling the storage engine and catalog manager. This section
discusses how DML statements are processed and uses the simple statement “SELECT
a, b FROM tbl WHERE pk >= 1 and pk <= 10 ORDER BY c” to demonstrate the
general execution process of SQL statements, as shown in Fig. 1.3.
1.3.2.1 Query Parsing
First, an SQL request is subject to syntax parsing. In most DBMSs, tools like Lex
and Yacc are used to generate a lexical and grammar parser known as a query parser.
The query parser checks the validity of the SQL statement and converts the SQL
text to an abstract syntax tree (AST) structure. Figure 1.3 shows the syntax tree
obtained after statement parsing. The syntax tree has a relatively simple structure. If
a subquery is nested after the FROM or WHERE clause, the subtree of the subquery
will be attached to the nodes.
Then, the system performs semantic checks on the AST to resolve name and refer-
ence issues. For example, the system checks whether the tables and fields involved in
the operation exist in the data dictionary, whether the user has the required operation
permissions, and whether the referenced object names are normalized. For example,
the name of each table is normalized into the “database”.”schema”.”table name” for-
mat. After the checks are passed, a logical query plan is generated. The logical query
plan is a tree structure composed of a series of logical operators. The logical query
plan cannot be directly executed, and the logical operators are usually tag data struc-
tures used to carry necessary operational information to facilitate subsequent optimi-
zation and generate an actual execution plan.
1.3.2.2 Query Rewriting
Query rewriting is a preparatory stage for query optimization that involves perform-
ing equivalent transformation on the original query structure while ensuring that the
semantics of the query statement remain unchanged to simplify and standardize the
12 1 Database Development Milestones
query structure. Query rewriting is usually performed on the logical plan rather than
directly on the text. Some common query rewriting tasks are as follows:
1. View expansion: For each referenced view in the FROM clause, the defini-
tion is read from the catalog, the view is replaced with the tables and predi-
cate conditions referenced by the view, and any reference to columns in this
view is replaced with references to columns in the tables referenced by
the view.
2. Constant folding: Expressions that can be calculated during compilation are
directly compiled and rewritten. For example, “a < 2 * 3” is rewritten as
“a < 6.”
3. Predicate logic rewriting: The predicates in the WHERE clause are rewritten.
For example, the short-circuit evaluation expression “a < 10 and a> 100” can be
directly converted to “false.” In this case, an empty result set is returned.
Alternatively, logical equivalent transformation may be performed. For example,
“not a < 10” may be converted to “a > = 10.” Another logic rewriting method is
to introduce new predicates by using predicate transitivity. For example “t1.a <
10 and t1.a = t2.x” implies the condition “t2.x < 10” and can be used to filter the
data of the t2 table in advance.
4. Subquery expansion: Nested subqueries are difficult to handle at the optimiza-
tion stage. Therefore, they are usually joined during rewriting. For example,
“SELECT * FROM t1 WHERE id IN (SELECT id FROM t2)” can be rewritten
as “SELECT DISTINCT t1.* FROM t1, t2 WHERE t1.id = t2.id.”
Other rewriting rules are available. For example, semantic optimization and
rewriting may be performed based on the constraint conditions defined by the
schema. Nonetheless, these rewriting rules serve the same purpose, which is to bet-
ter optimize query efficiency and reduce unnecessary operations or to normalize the
query for easier processing in subsequent optimization.
1.3.2.3 Query Optimization
Query optimization involves converting the previously generated and rewritten logi-
cal plan into an executable physical plan. The conversion process is an attempt to
find the plan with the lowest cost. Finding the “optimal plan” is an NP-complete
problem, and the cost estimation is not accurate. Therefore, the optimizer can only
search for a plan with the lowest possible cost.
The technical details of query optimization are complex and will not be dis-
cussed here. In most cases, query optimizers combine two technologies: rule-
based optimization and cost-based optimization. Take, for example, the
open-source database MySQL, which is completely based on heuristic rules. For
instance, all Filter, Project, and Limit operations are pushed down as far as pos-
sible. In a multitable join, small tables are selected with priority based on cardi-
nality estimation, and the nested-loop join operator is typically selected as the join
operator. The cost-based approach searches for possible plans, calculates costs
1.3 Key Components of Relational Databases 13
based on cost model formulas, and selects the plan with the lowest cost. However,
searching for all possible plans is costly and, therefore, infeasible. Two typical
search methods are available. The first is the dynamic planning method mentioned
by Selinger in his paper “Access path selection in a relational database manage-
ment system.” [1] This method uses a bottom-up approach and focuses on “left-
deep tree” query plans (where the input on the right side of a join operator must
be a base table) to reduce the search space. The dynamic planning method avoids
cross joins to ensure that the Cartesian product in the data stream is calculated
after all joins). The second search method is a target-oriented top-down approach
based on the cascading technology. In some cases, top-down search can reduce
the number of plans that an optimizer needs to consider, but it increases the mem-
ory consumption of the optimizer.
Calculations based on the cost model are related to the specific operators and the
order in which they are executed. For example, when the join operator is selected,
the size of the result set must be estimated based on the join condition. This involves
selectivity, which is calculated based on column statistics. The accuracy of the esti-
mated cost depends on the accuracy of the statistics available. Modern databases not
only collect the maximum, minimum, and number of distinct values (NDV) values
of columns but also provide more accurate histogram statistics. However, comput-
ing histograms of large datasets can lead to excessive overheads, so sampling is
often used as a compromise.
1.3.2.4 Query Execution
In most cases, executors use the Volcano model (also known the Iterator model) as
the execution model. This model was first introduced by Goetz Graefe in his paper
“Volcano, an extensible and parallel query evaluation system.” [2] The Volcano
model is a “pull data” model with a simple basic idea: to implement each physical
operator in the execution plan as an iterator. Each iterator comes with a get_next_
row() method, which returns a row of data (tuple) produced by the operator every
time the method is called. The program recursively calls the get_next_row() method
by using the root operator of the physical execution plan until all data is pulled.
Taking the physical plan in Fig. 1.4 as an example, the plan execution can retrieve
the entire result set by recursively calling the get_next_row() method of the Sort
operator.
The top-level Sort operator is an aggregation operator that needs to pull all data
from the suboperators before it can sort the data.
14 1 Database Development Milestones
class SortOperator {
SortedRowSet rowset;
Cursor cursor;
void open() {
while (child.has_next()) {
row = child.get_next_row();
add_to_row_set(rowset);
}
sort(rowset);
cursor = 0;
}
Row get_next_row() {
if (!initialized) {
open();
initialized = true;
}
return rowset[cursor++];}}
1.3 Key Components of Relational Databases 15
class ProjectionOperator {
Row get_next_row() {
Row row = child.get_next_row();
return select(row, columns);
}
};
IndexScan, which is implemented by the storage engine, needs to scan the data
of the tables involved according to the index.
1.3.3.1 Concurrency Control
A database is a multiuser system. This means that the database may receive a large
number of concurrent access requests at the same time. If the concurrent requests
access the same piece of data and one of the operations is a write operation, this situ-
ation is called a “data race.” If no appropriate protection mechanism is configured
16 1 Database Development Milestones
to deal with data races, data read and write exceptions will occur. For example,
uncommitted dirty data of another transaction may be read, data written by a trans-
action may be overwritten by another transaction, or inconsistent data may be read
at different points in time within a transaction. The isolation feature discussed above
prevents unexpected data results caused by these exceptions. To achieve isolation,
concurrency control is defined, which is a set of data read and write access protec-
tion protocols.
The cost and execution efficiency vary based on the strictness of data consis-
tency. Stricter data consistency results in higher costs and lower execution effi-
ciency. Nowadays, most databases define multiple isolation levels based on different
levels of anomalies to balance efficiency and consistency. The isolation levels,
namely, read uncommitted, read committed, repeatable read, and serializable, are
ranked in ascending order of strictness of consistency. Users can select an appropri-
ate isolation level according to their business characteristics.
The strictest isolation level is serializable, which requires the results of inter-
leaved concurrent execution of multiple transactions to be the same as the results of
the serial execution of this group of transactions. Each individual transaction in the
group occupies exclusive resources and is unaware of the existence of other transac-
tions. This is the purpose of isolation. The main concurrency control techniques
include the following:
1. Two-phase locking (2PL): For each read/write operation in a transaction, the
data row to be read/written must be locked. A shared lock is added for read
operations so that all read operations can be performed concurrently. An exclu-
sive lock is added for write operations so that a write operation can be performed
only after the previous write operation is completed. The locking phase and
releasing phase must be sequential. In the locking phase, only new locks can be
added. In the releasing phase, only previously added locks can be released, and
no new locks can be added.
2. Multiversion concurrency control (MVCC): With MVCC, transactions do not
use a locking mechanism. Instead, multiple versions are saved for modified data.
When a transaction is executed, a point in time may be marked. Even if the data
is modified by other transactions after this point in time, the historical version of
the data before this point in time can still be read.
3. Optimistic concurrency control: All transactions can read and write data without
blocking, but all read and write operations are written to the read and write sets,
respectively. Before a transaction is committed, validation is performed to check
whether the read and write sets conflict with other transactions in the interval
between the start and commitment of the transaction. If a conflict exists, one of
the transactions must be rolled back.
The serializable isolation level can be achieved by strictly using 2PL. However,
the costs of adding a lock for each operation are high, especially for read operations.
Most concurrent transactions do not operate on the same data, but the costs of add-
ing locks still exist. To reduce these costs, many databases use a hybrid mode that
combines MVCC and 2PL. In this mode, locks are added to data manipulated by
1.3 Key Components of Relational Databases 17
write operations, but no locks are added to data manipulated by read operations and
a historical version of data at a specific point in time is accessed. The isolation level
provided by this method is called snapshot isolation, which is lower than the serial-
izable level and may have anomalies such as write skews1 However, this isolation
level avoids most other anomalies, provides better performance, and is often a better
choice than other concurrency control techniques in most cases.
Although optimistic concurrency control avoids lock waiting, a large number of
rollbacks will occur when a transaction conflict is detected. Therefore, optimistic
concurrency control is suitable for scenarios with a few transaction conflicts but
performs poorly when many transaction conflicts exist (e.g., flash sale scenarios that
involve inventory reduction operations when users purchase the same product).
The logging system is a core part of the database storage engine that ensures the
durability of committed transactions and the atomicity of transactions that are
aborted or rolled back. The durability of committed transactions enables the data-
base to recover previously committed transactions after a crash. Many techniques
can be used to ensure the durability and atomicity of transactions. Taking the shadow
paging technique [3] proposed by Jim Gray in System R as an example, a new page
is generated for each modified page, and the new page is persisted when the transac-
tion is committed. In addition, the page pointer in the current page table is atomi-
cally changed to the address of the new page. In a rollback, the new page is simply
discarded and the original shadow page is used. Although this method is simple and
direct, it failed to become a mainstream technique because it does not support page-
level transaction concurrency, has high recycling costs, and is inefficient. Most
mainstream databases currently use the logging mechanism proposed by C. Mohan
in the Algorithms for Recovery and Isolation Exploiting Semantics (ARIES)
paper [4].
Databases were originally designed for traditional disks. The sequential access
performance of traditional disks is higher than the random read/write performance.
User updates generally update some pages in a relatively random manner. If a page
1
A write skew anomaly occurs when two transactions (T1 and T2) concurrently read a data set
(e.g., values V1 and V2), concurrently make disjoint updates (e.g., T1 updates V1, and T2 updates
V2), and are concurrently committed. This anomaly does not occur in serial execution of transac-
tions and is acceptable in snapshot isolation. Consider V1 and V2 as Phil’s personal bank accounts.
The bank allows V1 or V2 to have a negative balance as long as the sum of both accounts is nonnega-
tive (i.e., V1 + V2 ≥ 0). The initial balance of both accounts is USD 100. Phil initiates two transac-
tions: T1 to withdraw USD 200 from V1 and T2 to withdraw USD 200 from V2.
Write skew anomalies can be resolved by using two strategies: One is to implement write con-
flict by adding a dedicated conflict table that both transactions modify. The other is to use a promo-
tion strategy, where one transaction modifies its read-only data row (by replacing its value with an
equal value) to cause a write conflict, or use an equivalent update, for example, the SELECT FOR
UPDATE statement.
18 1 Database Development Milestones
is flushed to the disk each time the page is updated and the update is committed,
massive amounts of random I/Os are produced. As a result, the atomicity of concur-
rent transactions within the page cannot be guaranteed. For example, when multiple
transactions update a page at the same time, the page cannot be flushed immediately
after a transaction is committed because other uncommitted transactions may still
be updating the page. Therefore, when a page is updated, the page content is only
updated in place in the memory cache. Then, the transaction operation log will be
recorded to ensure that the operation log is flushed to the disk before the page con-
tent when the transaction is committed. This technology is called write-ahead log-
ging (WAL).
To ensure the durability and atomicity of transactions, the sequence of flushing
the log, commit point, and data page to the disk must be strictly defined. Some of
the strategies that can be used to do this include force/no-force and steal/no-steal.
• Force/no-force: The log must be written to the disk before the data page. After
the transaction is committed (i.e., the commit marker is recorded), the commit-
ment is considered successful only after all pages updated by the transaction are
forcibly flushed to the disk. This is called the force strategy. If the updated pages
are not required to be flushed immediately to the disk, the pages can be asynchro-
nously flushed later. This is called the no-force strategy. No-force means that
some pages containing committed transactions may not be written to the disk. In
this case, the redo log must be recorded to ensure durability through playback
and rollback during recovery.
• Steal/no-steal: The steal strategy allows a data page that contains uncommitted
transactions to be flushed to the disk. By contrast, the no-steal strategy does not
allow this. When the steal strategy is used, uncommitted transactions exist on the
disk, and rollbacks must be recorded in the log to ensure that transactions can be
rolled back when they are aborted.
The ARIES protocol uses the steal/no-force strategy, which allows uncommitted
transactions to be flushed to the disk before the commit point and does not forcibly
require that the data page be written to the disk after the transaction is committed.
Instead, the time at which dirty pages are flushed to the disk can be autonomously
decided, and the optimal I/O mode is used. Theoretically, this is the most efficient
strategy. However, the redo log and rollback log must be recorded.
1.3.4 Storage Engine
In general, the data storage operations of a database table are completed at the
storage engine level. The TableScan and IndexScan physical operators involved in
data access call the data access methods provided by the storage engine to read
and write specific data. The database storage engine includes two major modules:
the data organization and index management module and the buffer manage-
ment module.
1.3 Key Components of Relational Databases 19
NSM. This model attempts to combine the advantages of the NSM and DSM to
implement hybrid transactional/analytical processing (HTAP). However, it also
inherits the disadvantages of both models.
Data organization involves more details, for example, how to save storage space
without compromising access efficiency on disks or in memory. For more informa-
tion, see related documents, such as ASV99 [5] and BBK98 [6].
To perform read and write operations on data pages, the pages must be loaded from
the disk to the memory, and the content of the pages must be modified in the mem-
ory. In general, the capacity of the memory is much smaller than that of the disk.
Therefore, the pages loaded into the memory are only part of the data pages. Buffer
management covers how to decide the pages to be loaded into the memory based on
the read and write requests, when to synchronize modified pages to the disk, and
which pages are to be evicted from the memory.
In most databases, the content of pages in the memory is the same as that of
pages in the disk. This is beneficial in two aspects: First, fixed-sized pages are easy
to manage in the memory, and the allocation and recycling algorithms used to avoid
memory fragmentation are simple. Second, the format is consistent, and encoding
and decoding operations (also known as serialization and deserialization opera-
tions) are not required during data read and write, thereby reducing CPU workloads.
Databases use a data structure called a page table (hash table) to manage buffer pool
pages. The page table records the mappings between page numbers and page content,
including the page location in the memory, disk location, and page metadata. The meta-
data records some current characteristics of the page, such as the dirty flag and the refer-
ence pin. The dirty flag indicates whether the page has been modified after being read,
and the reference pin indicates whether the page is referenced by other ongoing transac-
tions. Pages that have been modified after being read cannot be swapped out to the disk.
The size of the buffer pool, which is usually fixed, is related to the physical
memory configured for the system. Therefore, when new pages need to be loaded
but the buffer pool is full, some pages must be evicted by using a page replacement
algorithm. Several buffer pool replacement algorithms, such as least recently used
(LRU), least frequently used (LFU), and CLOCK algorithms, may be unsuitable for
complex database access modes (e.g., full table scans in databases). If the LRU
algorithm is used when a large amount of data needs to be scanned, all data pages
will be loaded into the buffer pool. This will cause the original data pages to be
evicted from the buffer pool, resulting in a rapid and significant drop in the hit rate
of the buffer pool in a short time. At present, many databases use simple enhanced
LRU schemes to solve scan-related problems. For example, the buffer pool is
divided into a cold zone and a hot zone. Some pages read during the scan first enter
the cold zone, and then hit pages enter the hot zone, thereby avoiding cache pollu-
tion. Many studies have also been conducted to find more targeted replacement
algorithms, such as LRU-K [7] and 2Q [8].
References 21
References
Cloud platforms create a novel operating system for various components by employ-
ing technologies such as containerization, virtualization, and orchestration.
However, achieving a highly available, high-performance, and intelligent cloud
database system by using the virtualization and elastic resource allocation capabili-
ties offered by cloud platforms is challenging. Over time, cloud database systems
have become clearly distinct from traditional database systems.
In the past, most enterprises build their IT infrastructure through hardware procure-
ment and IDC rental. Professional expertise is required to perform O&M of servers,
cabinets, bandwidth, and switches and handle other matters such as network con-
figuration, software installation, and virtualization. The rollout period for system
adjustment is long and involves a series of procedures, such as procurement, supply
chain processing, shelf placement, deployment, and service provisioning. Enterprises
must plan their IT infrastructure in advance based on their business development
requirements, and redundant resources must be reserved to ensure that the system
capacity can cope with business surges. However, business development does not
always follow the planned path, especially in the Internet era. It may go beyond
expectations and overload the system or may be below expectations, resulting in
massive idle resources.
Cloud computing is the answer to the preceding problems. Cloud computing
provides the IT infrastructure as a service IaaS) for informatization, enabling enter-
prises and individual users to use the IT infrastructure on demand, without the need
to build their IT infrastructure. Similarly, enterprise users do not need to purchase
hardware and build IDCs when they need computing resources. They can purchase
resources from cloud computing service providers as needed.
In 2006, Google CEO Eric Schmidt proposed the cloud computing concept at the
Search Engine Strategies conference (SES SanJose 2006). The same year, Amazon
launched Amazon Web Services (AWS) to provide public cloud services. Later,
Internet giants outside China, such as Microsoft, VMWare, and Google, and Chinese
companies, such as Alibaba, Tencent, and Huawei, successively launched cloud ser-
vices. Soon enough, cloud computing became the preferred IT service for enterprises.
For enterprise users, IT service construction no longer means holding heavy
assets as they can now purchase computing resources or services from cloud comput-
ing vendors based on their business needs. Prompted by massive requirements from
different users, cloud vendors are able to establish a super-large resource pool and
provide a unified, virtualized abstraction interface based on the resource pool. A
cloud is basically a huge operating system built on diversified hardware by using
technologies such as containers, virtualization, orchestration and scheduling, and
microservices. With cloud technologies, users no longer need to pay attention to
hardware differences, lifecycle management, networking, high availability, load bal-
ancing, security, and other details. With the resource pooling capability, the cloud
boasts a unique advantage of elasticity to meet different computing requirements of
different businesses at different periods of time by using flexible scheduling strategies.
2.1.2 Database as a Service
With the IaaS layer as the cornerstone, cloud computing service providers have
established more layers, such as the platform as a service (PaaS) and software as a
service (SaaS) layers, to provide appropriate platforms for various application sce-
narios in the cloud.
As important foundational software, databases have been cloudified at an early
stage. In addition, databases, along with operating systems, storage, middleware,
and other components, form a standard cloud-based PaaS system. Most major cloud
vendors provide cloud database services, which can be roughly categorized into
cloud hosting, cloud services, and cloud-native models depending on the ser-
vice mode.
2.1.2.1 Cloud Hosting
Cloud hosting is a deployment mode that closely resembles traditional database sys-
tems. Essentially, cloud hosting involves deploying, on cloud hosts, traditional data-
base software that was originally deployed on physical servers or virtual servers in
an IDC. In this deployment mode, the cloud service provider merely serves as an
IDC provider to database users and provides the users with computing and storage
resources that are hosted on cloud hosts. Users are responsible for the availability,
2.1 Development of Databases in the Cloud Era 25
security, and performance of their database systems. The costs of owning a database
system deployed in cloud hosting mode are the same as the costs of building a data-
base system in an IDC. Moreover, the users still need to have their own IT O&M
team and DBAs to ensure normal database operation. In cloud hosting mode, cus-
tomers must resort to their own technical capabilities and DBA team to obtain
enterprise-level database management system capabilities, such as high availability,
remote disaster recovery, backup and recovery, data security, SQL auditing, perfor-
mance tuning, and status monitoring. Therefore, the total cost of ownership (TCO)
for a customer who deploys a database system in cloud hosting mode covers the
human resource costs of the DBA team.
2.1.2.2 Cloud Services
The cloud service model is a step further than the cloud hosting model. In this
model, users can directly use the database services provided by the cloud service
provider without concerning themselves with the deployment of database manage-
ment software. In general, cloud service providers offer various traditional database
services, such as MySQL, SQL Server, and PostgreSQL. Users can directly access
the database by using the access link of the cloud database service and the JDBC or
ODBC interface.
A database management system that provides services in the cloud service model
usually incorporates enterprise-level features. When providing cloud database ser-
vices, cloud service providers also provide corresponding enterprise-level features,
including but not limited to high availability, remote disaster recovery, backup and
recovery, data security, SQL auditing, performance tuning, and status monitoring.
In addition, cloud database services typically include online upgrades, scaling, and
other services. These are essentially resource management capabilities provided by
cloud service providers for cloud database services.
Another advantage of the cloud service model over the cloud hosting model is
that users do not need to have their own DBA team. In most cases, the cloud service
provider provides database O&M services. Some even offer expert services, such as
data model design, SQL statement optimization, and performance testing.
2.1.2.3 Cloud-Native Model
The cloud service model reduces the TCO of a database system through integrated
O&M services and supply chain management capabilities, allowing traditional
database users to enjoy the convenience brought by cloud computing to database
systems. However, the architecture of traditional database systems limits the full
play of the advantages of cloud computing. For example, the on-demand resource
usage, rapid elasticity, high performance, and high availability brought by cloud
computing cannot be fully provided in the cloud service model. Hence, cloud-native
databases have emerged to address these issues.
26 2 Database and Cloud Nativeness
The concept of cloud native was first proposed by Pivotal in 2014. One year later, the
Cloud Native Computing Foundation was established. There is still no clear definition
for “cloud native,” but this term is used to refer to new team culture, new technology
architectures, and new engineering methods in the era of cloud computing. To achieve
cloud nativeness, a flexible engineering team uses highly automated R&D tools by fol-
lowing agile development principles to develop applications that are based on and
deployed in cloud infrastructure to meet rapidly changing customer needs. These appli-
cations adopt an automated, scalable, and highly available architecture. The engineering
team then provides application services through efficient O&M based on cloud comput-
ing platforms and continuously improves the services based on online feedback.
In the database field, cloud-native database services involve database manage-
ment systems built on the cloud infrastructure, highly flexible database software
development and IT operations (DevOps) teams, and supplementary cloud-native
ecological tools. From the user’s perspective, cloud-native database services must
have core capabilities such as compute-storage separation, extreme elasticity, high
availability, high security, and high performance. These database services must also
have intelligent self-awareness capabilities, including self-perception, self-
diagnosis, self-optimization, and self-recovery. Security, monitoring, and smooth
flow of data can be achieved by using cloud-native ecological tools. A database
technology team that follows DevOps conventions can implement rapid iteration
and achieve functional evolution of database services.
2.3.1 Layered Architecture
The most notable feature of the architecture of cloud-native databases is the decom-
position of the originally monolithic database [2, 3]. The resulting layered architec-
ture includes three layers: the computing service layer, the storage service layer, and
the shared storage layer. The computing service layer parses SQL requests and con-
verts them into physical execution plans. The storage service layer performs data
cache management and transaction processing, to ensure that data updates and reads
comply with the ACID semantics of transactions. In terms of implementation, the
storage service layer may not be physically independent and may be partially inte-
grated into the computing service layer and the shared storage layer. The shared
storage layer is responsible for the persistent storage of data and ensures data con-
sistency and reliability by using distributed consistency protocols.
2.3.3 Elastic Scalability
partitioned based on one logical scheme, and the business logic and sharding logic
are not perfectly aligned, resulting in transactions that may cross databases and
shards. For instance, at a high level of isolation, the system throughput is signifi-
cantly compromised when distributed transactions account for over 5% of total
transactions. Perfect sharding strategies are nonexistent. Hence, ensuring high con-
sistency of data in the distributed architecture is a significant challenge that must be
addressed for distributed businesses.
The cloud-native architecture essentially consists of three layers: (1) the underly-
ing layer, which is the shared storage layer for the distributed architecture; (2) the
upper layer, which serves as the shared computing pool for the distributed architec-
ture; and (3) the intermediate layer, which is used for computing and storage decou-
pling. This architecture provides elastic high availability capabilities and facilitates
the centralized deployment of the distributed technology, making the architecture
transparent to applications.
In a distributed system, multiple nodes communicate and coordinate with each other
through message transmission. This process inevitably involves issues such as node
failures, communication exceptions, and network partitioning. A consensus proto-
col can be used to ensure that multiple nodes in a distributed system that may expe-
rience the abovementioned abnormalities can reach a consensus.
In the field of distributed systems, the consistency, availability, and partition tol-
erance (CAP) theorem [4] ensures that any network-based data-sharing system can
deliver only two out of the following three characteristics: consistency, availability,
and partition tolerance. Consistency means that after an update operation is per-
formed, the latest version of data is visible to all nodes, and all nodes have consis-
tent data. Availability refers to the ability of the system to provide services within a
normal response time when some nodes in the cluster are faulty. Partition tolerance
is the ability of the system to maintain service consistency and availability in the
event of node failure or network partitioning. Given the nature of distributed sys-
tems, network partitioning is bound to occur, thereby necessitating partition toler-
ance. Therefore, trade-offs must be made between consistency and availability. In
actual application, cloud-native databases typically adopt an asynchronous multi-
replica replication approach, such as Paxos, Raft, and other consensus protocols, to
ensure system availability and consistency. This compromises strong consistency in
exchange for enhanced system availability.
When used online, cloud-native databases provide different high availability
strategies. A high availability strategy is a tailored combination of service prioritiza-
tion strategies and data replication methods selected based on the characteristics of
2.3 Characteristics of Cloud-Native Databases 29
user businesses. Users can use two service prioritization strategies to balance avail-
ability and consistency:
• Recovery time objective (RTO) first: The database must restore services as soon
as possible to maximize its available time. This strategy is suitable for users who
have high requirements for database up time.
• Recovery point objective (RPO) first: The database must ensure as much data
reliability as possible to minimize data loss. This strategy is suitable for users
who have high requirements for data consistency.
Multitenancy means that one system can support multiple tenants. A tenant is a
group of users with similar access patterns and permissions and is typically com-
prised of several users from the same organization or company. To effectively
implement multitenancy, multitenancy at the database layer must be considered.
The multitenancy model at the database layer significantly affects the implementa-
tion of upper-level services and applications. Multitenancy usually involves resource
sharing. Therefore, corresponding measures must be available to prevent one tenant
from exhausting system resources and affecting the response time of other tenants.
In a multitenancy architecture, a database system is deployed for each tenant, or
multiple tenants share the same database system and are isolated by using
namespaces. However, the O&M and management of this approach are complex. In
cloud-native scenarios, computing and storage nodes in a database can be bound to
different tenants to achieve resource isolation and scheduling for the tenants.
2.3.6 Intelligent O&M
References
3.1 Design Principles
Before we discuss the database forms and technological trends in the cloud comput-
ing era, let us first delve into the essence of cloud computing and databases.
Cloud computing pools various IT infrastructure resources to integrate the com-
puting, communication, and storage resources that customers require for centralized
management. This enables customers to build large-scale information systems and
infrastructure without the need to build IDCs (Internet data centers), purchase hard-
ware facilities, deploy basic network, or install operating systems and software,
significantly reducing investment costs at the initial stage. With the resource virtu-
alization and pooling technologies of cloud computing, customers can also elasti-
cally adjust their infrastructure to quickly respond to changes in business traffic. In
addition, cloud service providers can use, maintain, and manage massive resources
in a centralized manner. This greatly improves the technological capabilities and
supply chain management capabilities of cloud service providers and leads to
greater economies of scale, significantly improving the overall resource utilization.
Databases can be analyzed based on database users. Users utilize the computing
and storage capabilities of databases to complete the full-link process starting from
The lack of elasticity in traditional distributed databases and the performance bot-
tleneck of single nodes are the result of the binding of computing and storage of
individual nodes. This necessitates a technical architecture that separates computing
3.2 Architecture Design 33
and storage. Currently, cloud-native databases are developing toward the compute-
storage-separated architecture. When implementing this architecture, cloud service
providers typically bind the CPU and memory together while separately deploying
the persistent storage, such as SSDs and HDDs. With the development of NVM
technologies, the CPU and memory can be further isolated in the future, and mem-
ory resources can be pooled to form a three-tier resource pool, helping customers
better achieve on-demand resource usage.
On the basis of the Von Neumann architecture, a database system can be
abstracted into a three-layer architecture that consists of the computing, communi-
cation, and storage layers. Cloud-native databases can ensure that the resources at
each layer can be independently scaled. Computing and communication resources
are stateless infrastructure resources. Therefore, compute nodes and communica-
tion nodes can be quickly started and closed during resource scaling to fully utilize
the elasticity of cloud computing. The storage layer is completely pooled and used
on-demand. In terms of specific processing technologies, the computing layer is
stateless and only processes business logic without persistently storing data and
therefore mainly involves distributed computing technologies, including but not
limited to distributed transaction processing, MPP, and distributed resource sched-
uling. Meanwhile, the storage layer only stores data and does not process business
logic and therefore mainly involves data consistency, security, and multimodal data
storage models in distributed scenarios.
Against the backdrop of cloud computing, new subjects on database system archi-
tectures have been raised and new challenges arise. A cloud-native database system is
designed based on the principle that its core components can fully utilize the resource
pooling feature of cloud computing to deliver more efficient and secure data services.
From the perspective of technical implementation, stateful storage resources and
stateless computing resources must be distinguished and adopt different resource
scheduling and utilization strategies on the premise that security, reliability, and cor-
rectness are guaranteed to minimize data movement and reduce additional computing,
storage, and communication overheads. Programming interfaces compatible with tra-
ditional database systems are preferred to achieve smoother learning curves for users
and enable users to complete the entire process of data production, storage, process-
ing, and consumption in a more convenient and efficient manner.
3.2 Architecture Design
As shown in Fig. 3.1, SQL requests sent by the client are forwarded by a proxy
layer to any node in the computing service layer for processing. The proxy layer is
a simple load balancing service. The computing service layer parses the SQL
requests and converts the requests into physical execution plans. The execution of a
physical execution plan involves transaction processing and data access and is per-
formed by the storage service layer. As mentioned in Chap. 1, the storage service
layer is responsible for data cache management and transaction processing and
manages and organizes data in the form of data pages to ensure that the updating
and reading of data pages comply with the ACID (atomicity, consistency, isolation,
and durability) semantics of transactions. In practice, the storage service layer may
not be physically separated and may be partially integrated into the computing ser-
vice layer and the shared storage layer.
The shared storage layer persistently stores data pages and ensures the high
availability of database data. Typically, the shared storage layer is implemented as a
distributed file system that uses multiple replicas and distributed consensus proto-
cols to ensure data consistency and reliability. The compute-storage-separated
architecture allows each layer to be independently and elastically scaled to achieve
theoretically optimal allocation of resources. Thanks to the shared storage design,
all data views visible to compute nodes are complete, and the expansion of comput-
ing capabilities can be achieved in real time without the need for extensive data
migration as required in other databases of the MPP architecture. However, this can
also be problematic. If each node in the storage service layer handles write transac-
tions, data conflicts will inevitably occur. In addition, handling cross-node data con-
flicts will require massive network communications and complex processing
algorithms, resulting in a high processing cost. To simplify implementation, some
cloud-native databases designate one of the nodes as the update node and the others
as read-only nodes. The read-only nodes need to provide access to consistent data
pages based on the transaction isolation semantics. The shared storage layer is not
equivalent to a general distributed file system, such as Google File System (GFS) or
Hadoop Distributed File System (HDFS), but is designed to adapt to the paged seg-
mentation structure of databases. The size of a data block is selected based on the
I/O pattern of the database. More importantly, data playback logic is integrated into
3.3 Typical Cloud-Native Databases 35
the shared storage layer, which uses distributed capabilities to increase concurrency
and improve page update performance.
Different cloud-native databases may use different layering logic. In most cloud-
native databases, SQL statement parsing, physical plan execution, and transaction
processing are implemented in the computing layer, and transaction-generated log
records and data are stored in the shared storage layer (also known as the storage
layer). In the storage layer, data is stored in multiple replicas to ensure data reliabil-
ity and consensus protocols, such as Raft, are used to ensure data consistency.
3.3.1 AWS Aurora
locking, cache management, access interfaces, and undo log management, are still
implemented by the database instances. However, features related to the redo log,
including log processing, fault recovery, and backup and recovery, are pushed down
to the storage layer. Compared with traditional databases, Aurora is advantageous in
three aspects: First, the underlying database storage is a distributed storage service
that facilitates fault handling. Second, the database instances only write redo log
records to the underlying storage layer. This greatly reduces the network pressure
between the database instances and the storage nodes and provides a guarantee for
improving database performance. Lastly, some core features, such as fault recovery
and backup restoration, are pushed down to the storage layer and can be executed
asynchronously in the backend without affecting foreground user tasks.
Traditional databases suffer from severe write amplification issues. For example, in
standalone MySQL, log records are flushed to the disk each time a write operation
is performed, and the back-end thread asynchronously flushes dirty data to the disk.
In addition, data pages also need to be written to the double-write area during the
flushing of dirty pages to avoid page fragmentation. The write amplification issue
may worsen in a production environment in which primary-standby replication is
implemented. As shown in Fig. 3.3, a MySQL instance is separately deployed in
availability zone (AZ) 1 and AZ 2, and synchronous mirror replication is
implemented between the two instances. Amazon Elastic Block Store (EBS) is used
for underlying storage, and each EBS instance has a mirror. Amazon Simple Storage
Service (S3) is also deployed to archive the redo log and binlog to facilitate data
recovery to specific points in time. From the operational perspective, five types of
data, namely, Redo, Binlog, Data-Page, Double-Write, and Frmfiles, must be trans-
ferred in each step. Steps 1, 3, and 5 in the figure are sequentially executed because
of the mirror-based synchronous replication mechanism. This mechanism results in
an excessively long response time as it requires four network I/O operations, three
of which are synchronous serial operations. From a storage perspective, data is
stored in four replicas on EBS, and a write success is returned only after data is suc-
cessfully written to all four replicas. In this architecture, the I/O volume and the
serial model will lead to an extremely poor performance.
To reduce network I/Os, only one type of data (redo log) is written in Aurora, and
data pages are never written at any time. After receiving redo log records, a storage
node replays the log records based on data pages of an earlier version to obtain data
pages of a new version. To avoid replaying the redo log from the beginning each
time, the storage node periodically materializes data page versions. As shown in
Fig. 3.4, Aurora consists of a primary instance and multiple standby instances that
are deployed across AZs. Only the redo log and metadata are transmitted between
the primary instance and standby instances or storage nodes. The primary instance
simultaneously sends the redo log to six storage nodes and standby instances. When
four of the six storage nodes respond, the redo log is considered persisted regardless
of the response time of other standby instances. According to the Sysbench test
statistics that are obtained by performing a 30-min stress test in a write-only sce-
nario by using 100 GB of data, Aurora’s throughput is 35 times that of mirror-based
MySQL, and the log volume per transaction is 0.12% less than that of the latter.
Regarding the fault recovery speed, after a traditional database crashes and restarts,
it recovers from the latest checkpoint and reads and replays all redo log records after
the checkpoint to update the data pages corresponding to the committed transac-
tions. In Aurora, the features related to the redo log are pushed down to the storage
layer, and the redo log can be replayed continuously in the backend. If the accessed
data page in any disk read operation is not of the latest version, the storage node is
triggered to replay the log to obtain the latest version of the data page. In this case,
fault recovery operations similar to those in traditional databases are continuously
performed in the backend. When a fault occurs, it can be rapidly rectified.
A key goal in the storage service design of Aurora is to reduce the response time for
front-end user writes. Therefore, operations are moved as far as possible to the
backend for asynchronous execution, and the storage nodes adaptively allocate
resources for different tasks based on the volume of front-end requests. For exam-
ple, when a large number of front-end requests need to be processed, the storage
nodes slow down the reclamation of data pages of earlier versions. In traditional
databases, back-end threads need to continuously advance checkpoints to avoid
excessively long fault recovery time from affecting front-end user request process-
ing capabilities. Thanks to the independent storage service layer in Aurora, check-
point advancement in the backend does not affect database instances. Faster
checkpoint advancement is more favorable for front-end disk I/O read operations
because this reduces the amount of log data that needs to be replayed.
To ensure database availability and correctness, the replication in the storage
layer of Aurora is based on the Quorum protocol. It is assumed that (1) V nodes
exist in the replication topology, (2) each node has one vote, and (3) success is
returned for a read or write operation only when Vr or Vw votes are obtained. To
ensure consistency, two conditions must be met. First, Vr + Vw > V, which ensures
that each read operation can read from the node with the latest data. Second,
Vw > V/2, which ensures that each write operation is performed on the latest data
obtained after the last write operation, thereby avoiding write conflicts. For exam-
ple, V = 3. To meet the above two conditions, Vr = 2 and Vw = 2. To ensure high
system availability under various abnormal conditions, database instances in
Aurora are deployed in three different AZs, each with two replicas. Each AZ is
equivalent to an IDC that has independent power systems, networks, and software
deployment and serves as an independent fault tolerant unit. Based on the Quorum
model and the two rules mentioned earlier, it is assumed that V = 6, Vw = 4, and
Vr = 3. In this case, Aurora can ensure intact write services when an AZ is faulty
and can still provide read services without data loss when an AZ and a node in
another AZ are faulty.
3.3 Typical Cloud-Native Databases 39
Provided that an AZ-level failure (which may be caused by fire, flood, or network
failures) and a node-level failure (e.g., disk failure, power failure, or machine dam-
age) do not occur at the same time, Aurora can maintain the quorum based on the
Quorum protocol and ensure database availability and correctness. How to keep a
database “permanently available” will essentially depend on the reduction of the
probability of the two types of failures occurring at the same time. The mean time
to fail (MTTF) of a database is usually determinate. Therefore, the mean time to
repair (MTTR) can be reduced to lower the probability of the two types of failures
occurring at the same time. Aurora manages storage by partition, with each partition
sized 10 GB and six 10-GB replicas forming a protection group (PG). The storage
layer of Aurora consists of multiple PGs, and each PG comprises Amazon Elastic
Compute Cloud (EC2) servers and local SSDs. Currently, Aurora supports a maxi-
mum of 64 TB of storage space. After partitioning, each partition serves as a failure
unit. On a 10-Gbp network, a 10-GB partition can be restored within 10 s. Database
service availability will be affected only when two or more partitions fail at the
same time within 10 s, which rarely occurs in practice. Simply put, partition man-
agement effectively improves database service availability.
In Aurora, data writes are performed based on the Quorum model. After storage
partitioning, success can be returned when data is written to a majority of partitions,
and the overall write performance remains intact even if a few disks are under heavy
I/O workloads because the data is discretely distributed. Figure 3.5 shows the spe-
cific write process, which includes the following steps: (1) A storage node receives
log records from the primary instance and appends the log records to the memory
queue. (2) The storage node persists the log records locally and then sends an
acknowledgment (ACK) to the primary instance. (3) The storage node classifies the
log records by partition and determines the log records that are lost. (4) The storage
node interacts with other storage nodes to obtain the missing log records from these
storage nodes. (5) The storage node replays the log records to generate new data
pages. (6) The storage node periodically backs up data pages and log records to the
S3 system. (7) The storage node periodically reclaims expired data page versions.
(8) The storage node periodically performs cyclic redundancy check (CRC) on data
pages. Only Steps (1) and (2) are serially synchronous and directly affect the
response time of front-end requests. Other steps are asynchronous.
3.3.1.3 Consistency Principle
Currently, almost all databases on the market use the WAL (Write-Ahead Logging)
model. Any change to a data page must first be recorded in the redo log record corre-
sponding to the modified data page. As a MySQL-based database, Aurora is no excep-
tion. During implementation, each redo log record has a globally unique log sequence
number (LSN). To ensure data consistency between multiple nodes, Aurora uses the
Quorum protocol instead of the 2PC protocol because the latter has low tolerance for
errors. In a production environment, each storage node may have some missing log
records. The storage nodes complete their redo log records based on the Gossip proto-
col. During normal operation, database instances are in a consistent state, and only the
storage node with complete redo log records needs to be accessed during disk read.
However, during a fault recovery process, read operations must be performed based on
the Quorum protocol to rebuild the consistent state of the database. Many transactions
are active on a database instance, and the transactions may be committed in an order
different from the order in which they are started. Therefore, when the database crashes
and restarts due to an exception, the database instance must determine whether to com-
mit or roll back each transaction. To ensure data consistency, several concepts regarding
redo log records at the storage service layer are defined in Aurora:
• Volume complete LSN (VCL) represents all complete log records before the
storage service has a VCL. During fault recovery, all log records with an LSN
greater than the VCL must be truncated.
• Consistency point LSN (CPL): For MySQL (InnoDB), each transaction consists
of multiple minitransactions. A minitransaction is the smallest atomic operation
unit. For example, a B-tree split may involve modifications to multiple data
pages, and the corresponding group of log records for these page modifications
is atomic. Redo log records are also replayed by minitransactions. A CPL repre-
sents the LSN of the last log record in a group of log records, and one transaction
has multiple CPLs.
• Volume durable LSN (VDL) represents the maximum LSN among all CPLs that
is persisted, where VDL ≤ VCL. To ensure the atomicity of minitransactions, all
log records with an LSN greater than the VDL must be truncated. For example,
if the VCL is 1007 and the CPLs are 900, 1000, and 1100, the VDL is 1000.
Then, all log records with an LSN greater than 1000 must be truncated. The VDL
3.3 Typical Cloud-Native Databases 41
represents the latest LSN at which the database is in a consistent state. Therefore,
during fault recovery, the database instance determines the VDL by PG and trun-
cates all log records with an LSN greater than the VDL.
3.3.1.4 Fault Recovery
Most databases perform fault recovery based on the classic ARIES (Algorithm for
Recovery and Isolation Exploiting Semantics) protocol and uses the WAL mecha-
nism to ensure that all committed transactions are persisted and uncommitted trans-
actions are rolled back in case of a fault. Such systems typically perform periodic
checkpointing and record checkpoint information in log records. If a fault occurs, a
data page may contain committed and uncommitted data. In this case, the system
must first replay the redo log starting from the last checkpoint during fault recovery
to restore the data pages to the status at the time of the fault and then roll back
uncommitted transactions based on undo logs. Fault recovery is time-consuming
and strongly related to the checkpointing frequency. Increasing the checkpointing
frequency can reduce the fault recovery time but directly affects the front-end
request processing of the system. The checkpointing frequency and fault recovery
time must be balanced, which, however, is not necessary in Aurora.
During fault recovery in a traditional database, the database status advances by
replaying the redo log. The entire database is offline during redo log replay. Aurora
uses a similar approach, but the log replay logic is pushed down to storage nodes
and runs in the backend while the database provides services online. Therefore,
when the database restarts due to a fault, the storage service can quickly recover.
Even under a pressure of 100,000 TPS, the storage service can recover within 10 s.
After a database instance crashes and restarts, fault recovery must be performed to
obtain a consistent runtime status. The instance communicates with Vr storage nodes
to ensure that the latest data is read, calculates a new VDL, and truncates log records
with LSNs greater than the VDL. In Aurora, the range of newly allocated LSNs is
limited. To be specific, the difference between the LSN and VDL cannot exceed
10,000,000. This prevents excessive uncommitted transactions on the instance
because the database needs to roll back uncommitted transactions based on undo
logs after replaying the redo log. In Aurora, the database can provide services after
all redo log records are replayed, and transaction rollback based on undo logs can
be performed after the database provides services online.
3.3.2 PolarDB
environments. The compute nodes and storage nodes in the database are intercon-
nected over a high-speed network and transmit data to each other based on the
RDMA protocol. This way, the database performance is no longer bottlenecked by
the I/O performance. The database nodes are fully compatible with MySQL. The
primary node and read-only nodes work in active-active mode, and the failover
mechanism is provided to deliver high availability of databases. The data files and
redo log of the database are stored in a user-space file system, routed by the inter-
face between the file system and the block storage device, and transmitted to remote
chunk servers by using the high-speed network and the RDMA protocol. In addi-
tion, only metadata information related to the redo log needs to be synchronized
among database instances. Data is stored in multiple replicas in the chunk servers to
ensure data reliability, and the Parallel-Raft protocol is used to ensure data
consistency.
3.3.2.1 Physical Replication
The binlog in MySQL records changes to data at the tuple (row) level. Replicas of
the redo log that records changes to physical file pages are stored in the InnoDB
engine layer to ensure transaction ACID.
As a result, the Fsync() function needs to be called at least twice during the pro-
cessing of a transaction in MySQL. This directly affects the response time and
throughput performance of the transaction processing system. Although MySQL
employs a Group Commit mechanism to increase the throughput in high-concurrency
3.3 Typical Cloud-Native Databases 43
Before its application in cloud computing, RDMA has been widely used in the high-
performance computing (HPC) field for several years. RDMA typically uses net-
work devices that support high-speed connections, such as switches and network
interface controllers (NICs), to communicate with the NIC driver through a specific
programming interface. With RDMA, data is efficiently transmitted between NICs
and remote applications with a low delay by using the Zero-Copy technology. In
addition, copying data from the kernel mode to the application mode is not neces-
sary. Hence, the CPU workloads are not interrupted, thereby greatly reducing per-
formance jitter and improving the overall processing capabilities of the system. In
PolarDB, the compute nodes and storage nodes are interconnected over a high-
speed network and transmit data to each other by using the RDMA protocol. This
way, the system performance is no longer bottlenecked by the I/O performance.
block device, a copy of the block device is created, and the write operation modifies
the copy of the block device. This way, data can be recovered to a specific snapshot
point. Snapshotting is a typical postprocessing mechanism based on time and write
load models. In other words, when a snapshot is created, data is not backed up.
Instead, the data backup load is evenly distributed to the time windows in which
actual data writes occur after the snapshot is created, thus achieving fast response to
backup and recovery. PolarDB provides the snapshotting and redo log mechanisms
to implement data recovery to specific points in time, which is more efficient than
the traditional recovery mode in which full data is used together with incremental
binlog data.
3.3.3 Microsoft Socrates
3.3.3.1 Computing Layer
Figure 3.8 shows the internal structure of XLOG Service. Log blocks are synchro-
nously written from the primary node to the landing zone (LZ). In the current Socrates
version, XStore, a premium storage service of Azure, is used as the storage medium
for the LZ. To implement data persistence, XStore retains three replicas of all data.
The primary node asynchronously sends log records to the XLOG process, which
then sends the log records to read-only nodes and Page Servers. When the log blocks
are sent to the LZ and XLOG process in parallel, data may reach the read-only nodes
before being persisted in the LZ, resulting in data inconsistency or loss in the event of
a failure. To avoid this situation, XLOG propagates only log records that have been
3.3 Typical Cloud-Native Databases 47
persisted in the LZ. The XLOG process stores the log records in pending blocks, and
the primary node notifies the XLOG process of the log blocks that have been per-
sisted. Then, the XLOG process moves the persisted log blocks from the pending
blocks to the LogBroker, from which the log blocks are broadcast to read-only nodes
and Page Servers. The XLOG process incorporates a Destaging process, which copies
persisted log blocks to a fixed-size local SSD cache to accelerate access and sends a
copy of the log blocks to XStore for long-term retention. Socrates refers to the long-
term retention of log blocks as Long-Term Archive (LT). In Socrates, LZ and LT
retain all log data to meet the requirement for database persistence. The LZ is an
expensive service that can achieve low latency to facilitate fast commits of transac-
tions. It also retains log records for 30 days to facilitate data recovery to specific points
in time. XStore (LT) uses inexpensive and durable storage devices to store massive
data. This tiered storage structure meets performance and cost requirements.
Page Servers are responsible for three tasks: (1) responding to GetPage requests
from compute nodes, (2) maintaining data of a database partition through log
replay, and (3) recording checkpoints and backing up data to XStore. Each Page
48 3 Architecture of Cloud-Native Database
Server stores only a portion of the database data pages and focuses only on log
blocks related to the partition handled by the Page Server. To this end, the primary
node adds sufficient annotation information for each log block to indicate the
partitions to which the log records in the log block must be applied. XLOG uses
this filtering information to distribute the relevant log blocks to their correspond-
ing Page Servers. In Socrates, two methods are available for improving system
availability. The first method is to use a more fine-grained sharding strategy,
which allows each Page Server to correspond to a smaller partition, thereby reduc-
ing the average recovery time for each partition and improving system availabil-
ity. Based on current network and hardware parameters, it is recommended that
the partition size be set to 128 GB for Page Servers in Socrates. The second
method is to add a standby Page Server for each existing Page Server. When a
Page Server fails, its standby Page Server can immediately provide services, thus
improving system availability.
3.3.3.4 XStore Layer
XStore is a highly replicated disk-based storage system that spans multiple zones.
It ensures data durability with minimal data loss. In the Socrates architecture,
XStore plays the same role as disks in traditional databases. Similarly, the mem-
ory and SSD caches (RBPEX) of the compute nodes and Page Servers play the
same role as the main memory in traditional databases. Page Servers periodically
send modified data pages to XStore, and Socrates uses the snapshot feature of
XStore to create backups by simply recording a timestamp. When a user requests
a point-in-time recovery (PITR) operation, Socrates needs to fetch a complete set
of snapshots that are taken before the time of the PITR operation from XStore, as
well as the log range required to bring this set of snapshots from the recovery time
to the requested time.
Socrates divides the entire database into multiple service layers that have
respective lifecycles and perform asynchronous communication as far as possi-
ble. Unlike other cloud-native databases, Socrates separately implements dura-
bility and availability. In particular, Socrates uses XLOG and XStore to ensure
system durability and uses the computing layer and Page Servers to ensure sys-
tem availability. In Socrates, the computing layer and Page Servers are stateless.
This way, data integrity is not affected even if a compute node or Page Server
fails because the data of any Page Server can be recovered to the latest status by
using the snapshot versions and log records in XStore and XLOG. This layered
storage architecture can implement more flexible and finer-grained control,
achieving a better balance among system availability, costs, performance, and
other aspects.
References 49
References
1. Verbitski A, Gupta A, Saha D, et al. Amazon Aurora: design considerations for high
throughput cloud-native relational databases. In: Proceedings of the 2017 ACM interna-
tional conference on management of data (SIGMOD ‘17); 2017. p. 1041–52. https://doi.
org/10.1145/3035918.3056101.
2. Verbitski A, Gupta A, Saha D, et al. Amazon Aurora: on avoiding distributed consensus for I/Os,
commits, and membership changes. In: Proceedings of the 2018 ACM international conference
on management of data (SIGMOD’18); 2018. p. 8. https://doi.org/10.1145/3183713.3196937.
3. Cao W, Liu ZJ, Wang P, et al. PolarFS: an ultra-low latency and failure resilient distributed file
system for shared storage cloud database. Proc VLDB Endow. 2018;11:1849–62. https://doi.
org/10.14778/3229863.3229872.
4. Li FF. Cloud-native database systems at Alibaba: opportunities and challenges.
PVLDB. 2019;12(12):2263–72. https://doi.org/10.14778/3352063.3352141.
5. Antonopoulos P, Budovski A, Diaconu C, et al. Socrates: the new SQL server in the cloud. In:
Proceedings of the 2019 ACM international conference on management of data (SIGMOD’19);
2019. p. 14. https://doi.org/10.1145/3299869.3314047.
Chapter 4
Storage Engine
A storage engine provides the technical implementation for storing data in files (or
memory). Different storage engines may use different storage mechanisms, index-
ing techniques, and locking methods and provide extensive functionality. In this
chapter, we will introduce the basic concepts and technologies related to storage
engines from three aspects: data organization, concurrency control, and logging and
recovery. Then, we will discuss the characteristics and advantages of X-Engine, the
storage engine of PolarDB.
4.1 Data Organization
4.1.1 B+ Tree
The B+ tree structure was proposed by Professor Rudolf Bayer in 1970 in his paper
Organization and Maintenance of Large Ordered Indices. Since then, it has become
the most common and frequently used index structure in databases. By utilizing the
mutable storage structure, a B+ tree facilitates the rapid location of the page on
which a data row resides based on the key-value pair. A B+ tree also uses an m-ary
tree structure, which reduces the depth of the index structure and avoids most of the
random access operations in traditional binary tree structures, effectively reducing
the number of seek operations of the disk head and mitigating the impact of external
storage access latency on performance. This ensures the orderliness of key-value
pairs in tree nodes, thereby controlling the time complexity of query, insertion, dele-
tion, and update operations within the range of O(logn). Given these advantages, B+
trees are widely used as a building module for index structures in many database
and storage systems, including PolarDB, Aurora, and other cloud-native databases.
4.1.1.1 Principles of B+ Trees
This section describes the structure and characteristics of a B+ tree. Due to limited
space, only a brief introduction to the basic structure and operations of a B+ tree is
provided. For more information, refer to the referenced articles for further reading.
4.1 Data Organization 53
The storage structure of a computer consists of the following layers from top to bot-
tom: registers, high-speed cache, main memory, and auxiliary storage. The main
memory is also called RAM, and the auxiliary storage is also known as external
storage (e.g., a disk used to store files). In this storage structure, the access speed of
each layer is much slower than that of its upper layer, with disk access being the
slowest. This is because disk access involves track seeking and sector locating. In
track seeking, the magnetic head is fixed, and the disk rotates at a high speed to
locate the track in which the data is located. In sector locating, the magnetic head is
moved to locate a specific sector from hundreds of sectors in the track.
These are time-consuming mechanical operations. At the disk scheduling level,
various disk scheduling algorithms are employed to reduce the movement of the
actuator arm of the disk, thereby improving efficiency. At the index structure level,
a proper index must be created to improve the disk read efficiency. In most cases,
the performance of an index is evaluated based on the number of disk I/Os. The
performance of a B+ tree index is advantageous in the following aspects:
• Organizing data in the disk by using a B+ tree index is inherently advantageous
because read and write operations in the operating system are performed in disk
blocks. When the size of the leaf nodes of the B+ tree is aligned with the disk
block size, the number of I/Os can be significantly reduced through interaction
between the operating system and the disk.
• A B+ tree consists of nonleaf nodes and leaf nodes. Nonleaf nodes, also known
as index nodes, are mapped to index pages in the physical structure. Leaf nodes
are data nodes that are also mapped to data pages in the physical structure. The
index nodes store only keys and pointers, not data. This way, a single index node
can store a large number of branches, and information about locations of physi-
cal pages on the disk can be read into the memory with just one I/O operation.
• All leaf nodes in a B+ tree contain a pointer to neighboring leaf nodes, which
greatly improves the efficiency of range queries.
By default, MySQL 5.5 and later use the InnoDB engine as the storage engine. The
InnoDB engine uses a B+ tree as its index structure, and the primary key is stored
as a clustered index.
The InnoDB engine organizes data in the tablespace structure. Each tablespace
contains multiple segments, each segment contains multiple extents, and each extent
occupies 1 MB of space and contains 64 pages. The smallest access unit in the
InnoDB engine is pages, which can be used to store data or index pointers. In a B+
tree, leaf and nonleaf nodes store data and index pointers, respectively. In the search
for a record, the page that contains the record is determined by using the B+ tree
index, the page is loaded to the memory, and then the memory is traversed to find
the row that contains the record.
The read performance of the InnoDB engine depends on the query performance
of the B+ tree. The write performance of the InnoDB engine is guaranteed by using
the WAL mechanism. In this mechanism, each write operation is recorded in the
4.1 Data Organization 55
redo log instead of updating the full index and data content on the B+ tree, and then
the redo log is sequentially written to the disk. At the same time, the dirty page data
that needs be updated on the B+ tree is recorded in memory, and then the dirty pages
are flushed to the disk when specific conditions are met.
When data is read in InnoDB, the page that contains the desired record can be rap-
idly located by using the B+ tree index, and the desired record is fetched by loading
the data to the memory. MySQL uses the WAL mechanism for write operations.
Simply put, the write operations are first recorded in the log and then written to the
disk. Efficient writes in MySQL are achieved by using a buffer pool considering that
the performance of sequential writes is much higher than that of random writes.
Directly reading from the disk or writing data to the disk is costly, and frequent
random I/Os significantly reduce CPU utilization. To reduce accesses to the disk,
pages are cached in the memory based on the principle of locality. The buffer pool
mainly serves to:
• Preserve the content of cached disk pages in the memory.
• Cache modifications to disk pages so that the cached version is modified instead
of the data on the disk.
• Return the cached page if the requested page is in the cache.
4.1 Data Organization 57
• Load the requested page from the disk to the memory when the requested page
is not cached and the memory has free space.
• Call a page replacement strategy to select pages to be swapped out when the
requested page is not cached and the memory has no free space. The contents of
the swapped out pages are written back to the disk.
When the storage engine accesses a page, it first checks whether the content of
the page is cached in the buffer pool. If that is the case, the storage engine directly
returns the requested page. Otherwise, it converts the logical address or page num-
ber of the requested page to a physical address and loads the content of the page
from the disk to the buffer pool.
When the requested page is not in the buffer pool and the memory has no free
space, a page needs to be swapped out from the cache and written back to the disk
before the requested page can be swapped into the memory. The algorithm used to
select the page to be swapped out is called a page replacement algorithm. A good
page replacement algorithm achieves a low page replacement frequency. In other
words, pages that will no longer be accessed or will not be accessed in a long time
are swapped out first. The following four page replacement algorithms are most
commonly used:
• First In First Out (FIFO): This page replacement algorithm first evicts the pages
that have been residing in the memory for the longest time.
• Least Recently Used (LRU): This replacement algorithm first evicts the least
recently used pages from the memory.
• Clock: This replacement algorithm keeps track of the references of pages
and associated access bits in a circular buffer area. The access bit of each
page is updated in a clock-like manner, and the page with an access bit of 0
is evicted.
• Least Frequently Used (LFU): This replacement algorithm sorts pages based
on their request frequency and evicts the page with the lowest request
frequency.
If a page in the buffer pool is modified, for example, appended with a new tuple,
the page is marked as a dirty page to indicate that its content is inconsistent with that
of the disk data page. The dirty page must be flushed to the disk to ensure data
consistency.
In the InnoDB engine, the checkpointing mechanism controls WAL and the buf-
fer pool to ensure that they work in coordination. Related operation logs can be
discarded from the redo log files only when the data pages in the buffer pool are
written to the disk. After the above process is completed, dirty pages can be evicted
from the cache. The InnoDB engine flushes dirty pages to the disks in the follow-
ing cases:
• When the redo log files are full, the system stops writing update operations to the
disk, advances the checkpoint, and needs to flush all dirty pages corresponding
to the log records between the checkpoint and the write position to the disk to
make space for the redo log.
58 4 Storage Engine
• When a page fault occurs due to insufficient system memory, some pages need to
be swapped out so that new requested pages can be swapped in. A dirty page
must be written to the disk first before it can be swapped out.
• Dirty pages are flushed to the disk when the database is idle.
• When the database is properly closed, all dirty pages in the memory must be
flushed to the disk.
4.1.3 LSM-Tree
The LSM-tree concept was proposed by Professor Patrick O’Neil in 1996 in his
paper The log-structured merge-tree (LSM-tree). The name “log-structured merge-
tree” originates from the log-structured file system. Similar to the log-structured file
system, an LSM-tree is also implemented based on an immutable storage structure,
which executes sequential write operations by using buffers and the append-only
mode. This avoids most of the random write operations in a mutable storage struc-
ture and mitigates the impact of multiple random I/Os of a write operation on per-
formance. The LSM-tree ensures the orderliness of disk data storage. The immutable
storage structure is conducive to sequential writes. All data can be written to the
disk at the same time in append-only mode. The higher data density of the immu-
table storage structure prevents the generation of external fragments.
In addition, data locations do not need to be determined in advance for write,
insertion, and update operations because the files are immutable. This greatly
reduces the impact of random I/Os and significantly improves the write performance
and throughput. However, duplication is allowed for immutable files. As the amount
of appended data increases, the number of disk resident tables also increases, result-
ing in file duplication during data reads. This problem can be addressed by main-
taining the LSM-tree through compactions.
As mentioned above, a B+ tree organizes data on the disk in the unit of pages and
uses nonleaf and leaf nodes to store index and data files, respectively, to facilitate locat-
ing the page that contains the desired data record. In an LSM-tree, data exists in the form
of sorted string tables (SSTables). An SSTable usually consists of two components: an
index file and a data file. The index file stores keys and their offsets in the data file, and
the data file consists of concatenated key-value pairs. Each SSTable consists of multiple
pages. When a data record is queried, instead of directly locating the page that contains
the data record like in B+ trees, the SSTable is located first. Then, the page that contains
the data record is searched based on the index file in the SSTable.
4.1.3.1 Structure of an LSM-Tree
Figure 4.3 shows the overall architecture of an LSM-tree, which includes memory-
resident components and disk-resident components. When a write request is exe-
cuted to write data, the operation is first recorded in the commit log on the disk for
4.1 Data Organization 59
fault recovery. Then, the record is written to the mutable memory-resident compo-
nent (MemTable). When the size of the MemTable reaches a specific threshold, the
MemTable becomes an immutable memory-resident component (immutable
MemTable), and the data in the MemTable is flushed to the disk in the backend. For
disk-resident components, the written data is divided into multiple levels. Data
flushed from the immutable MemTable is first stored at Level 0 (L0), and a corre-
sponding SSTable is generated. When the data size of L0 reaches a specific thresh-
old, the SSTables at L0 are compacted into Level 1 (L1); subsequent levels are
compacted in a similar way.
• Memory-resident components: Memory-resident components consist of the
MemTable and the immutable MemTable. Data is usually stored in the MemTable
in an ordered skip list to ensure the orderliness of disk data. The MemTable buf-
fers data records and serves as the primary destination for read and write opera-
tions. The immutable MemTable writes data to the disk.
• Disk-resident components: Disk-resident components consist of the commit log
and SSTables. The MemTable exists in memory. To prevent data loss caused by
system failure before data is written to the disk, the operation records must be
written to the commit log before the data is written to the MemTable to ensure
data persistence. An SSTable consists of data records written to the disk by the
immutable MemTable and is immutable, so it can only be read, compacted, and
deleted.
Just as B+ trees based on a mutable storage structure will inevitably encounter write
amplification, LSM-trees based on an immutable storage structure will encounter
read amplification problems. In fact, different compaction strategies for LSM-trees
bring new problems. In the distributed field, the well-known CAP theorem states
that a distributed system can satisfy only two of the preceding guarantees at a time.
In 2016, Manos Athanassoulis et al. proposed a similar theorem called the RUM
conjecture, which states that any data structure can be optimized to mitigate at most
two of the read, write, and space amplification problems. In short, LSM-trees based
on an immutable storage structure will face the following three problems:
• Read amplification: The LSM-tree is searched layer by layer during data retrieval,
resulting in additional disk I/O operations. The read amplification problem is
more prominent in range queries.
• Write amplification: Data is continuously rewritten to new files during compac-
tions, resulting in write amplification.
• Space amplification: Duplication is allowed and expired data is not immediately
cleaned up, resulting in space amplification.
Due to the difference in implementation methods, the two compaction strategies
result in different amplification problems.
In tiered compaction, the SSTable at a high level is quite large in size, and origi-
nal SSTable files are retained for fault recovery before the compaction is completed.
As a result, the data volume doubles within a short time. Although the old data is
deleted after compaction is completed, this still causes a serious space amplification
problem.
62 4 Storage Engine
4.2 Concurrency Control
4.2.1 Basic Concepts
When a transaction accesses a data item, no other transaction can modify the data
item. To meet this requirement, a transaction is allowed to access only the data items
it holds locks on. Different data locking methods are available, but this section will
describe only two:
4.2 Concurrency Control 63
can be sorted according to their lock points, and this sorting order is the serializ-
ability order of the transactions.
The 2PL protocol cannot avoid deadlocks. For example, transactions T1 and T2 in
Fig. 4.5 are two-phased, but a deadlock still occurs.
When the 2PL protocol is adopted, transactions may also read uncommitted data.
In the example depicted in Fig. 4.6, transaction T4 reads uncommitted data A of
transaction T3. If transaction T3 rolls back, a cascading rollback is triggered.
Cascading rollbacks can be avoided by using strict 2PL and strong 2PL proto-
cols. The strict 2PL protocol requires that exclusive locks held by a transaction be
released only after the transaction is committed, whereas the strong 2PL protocol
requires that no locks be released before the transaction is committed. This way,
uncommitted data will not be read.
Aside from the foregoing lock-based concurrency control methods, another method
used to ensure the serializability of transactions is the timestamp-based concurrency
control method. This method implements the orderliness of selected transactions.
Each transaction Ti in the system is assigned a unique number, which is called the
timestamp and denoted by TS(Ti). The system assigns timestamps in ascending
order before transaction Ti is executed. Two methods can be used to generate
timestamps:
• System clock: The timestamp of a transaction is the clock value when the trans-
action enters the system.
• Logical counter: The counter is incremented by 1 each time a transaction starts,
and the value of the counter is assigned to the transaction as its timestamp.
To use timestamps to ensure the serial scheduling of transactions, each data item
Q must be associated with two timestamps and an additional bit.
WT(Q): The maximum timestamp of all transactions that successfully executed a
Write(Q) operation.
RT(Q): The maximum timestamp of all transactions that successfully executed a
Read(Q) operation.
C(Q): The commit bit of Q, which is set to True only when the most recent transac-
tion that wrote data item Q has been committed. This bit is used to prevent
dirty reads.
The Timestamp Ordering Protocol can ensure that any conflicting read and write
operations are executed in order based on their timestamps. The rules are as follows:
1. Assume that transaction Ti performs a Read(Q) operation:
(a) If TS(Ti) < WT(Q), the read operation cannot be completed, and Ti is
rolled back.
(b) If TS(Ti) ≥ WT(Q), the read operation can be executed.
If C(Q) is true, the request is executed, and the value of RT(Q) is set to the value of
TS(Ti) or RT(Q), whichever is greater.
If C(Q) is false, the system waits for Write(Q) to complete or be terminated.
2. Assume that transaction Ti performs a Write(Q) operation:
(a) If TS(Ti) < RT(Q), the value that transaction Ti attempts to write is no longer
needed, and Ti is rolled back.
(b) If TS(Ti) < WT(Q), the value that transaction Ti attempts to write is outdated,
and Ti is rolled back.
(c) If TS(Ti) ≥ RT(Q) or TS(Ti) ≥ WT(Q), the system performs the Write(Q)
operation and sets WT(Q) to TS(Ti) and C(Q) to False.
When transaction Ti issues a commit request, C(Q) is set to True, and a transac-
tion waiting for data item Q to be committed can be executed.
Under the preceding rules, the system assigns new timestamps for read and write
operations of rolled back transactions.
The following section describes another case, which assumes that TS(Ti) < TS(Tj).
In this case, the Read(Q) operation of transaction Ti can succeed, and the Write(Q)
operation of transaction Tj can be completed. When transaction Ti attempts to exe-
cute the Write(Q) operation, TS(Ti) < WT(Q). In this case, the Write(Q) operation
of Ti will be rejected, and the transaction Ti will be rolled back. The outdated write
66 4 Storage Engine
operation will be rolled back according to the rules of the Timestamp Ordering
Protocol. However, the rollback is unnecessary in this case. Therefore, the
Timestamp Ordering Protocol can be modified so that a write operation can be
skipped if a later write operation has been performed. This method is called the
Thomas write rule.
Assume that transaction T issues a Write(Q) request. The basic principles of the
Thomas write rule are as follows:
• When TS(T) < RT(Q), the Write(Q) operation will be rejected and transaction T
will be rolled back.
• When TS(T) < WT(Q), the data item Q that transaction T wants to write is out-
dated, and the Write(Q) operation does not need to be executed.
If neither of the above situations exists, the Write(Q) operation will be executed,
and the value of TS(T) will be set to WT(Q).
In the preceding lock-based and timestamp-based concurrency control methods,
when a conflict is detected, the transaction waits or rolls back even if the schedule
is conflict serializable. Therefore, these methods can be categorized as pessimistic
concurrency control methods. In scenarios with many read transactions, the fre-
quency of transaction conflicts is low. If pessimistic concurrency control is used,
system overheads may increase. In most cases, minimal system overheads are
expected. Therefore, a validation mechanism is used to reduce system overheads.
Unlike the lock-based and timestamp-based concurrency control methods, the vali-
dation mechanism is optimistic in executing transactions, so it is also called opti-
mistic concurrency control.
When a transaction is executed by using a validation mechanism, the transaction
will be executed in three phases:
• Read phase: The transaction reads the required database elements from the data-
base and save them in its local variables.
• Validation phase: The transaction is validated. If the transaction passes the vali-
dation, the third phase will be executed. Otherwise, the transaction will be
rolled back.
• Write phase: The transaction writes modified elements to the database. Read-
only transactions can ignore this phase.
Each transaction will be executed based on the order of the preceding three
phases. To facilitate validation, the following three timestamps will be used:
• Start(Ti): The time when transaction Ti starts to execute, the validation of transac-
tion Ti has not been completed.
• Validation(Ti): The time when transaction Ti has completed, the read phase and
the validation of the transaction start. At this time, the write phase of transaction
Ti has not been completed.
• Finish(Ti): The time when the write phase of transaction Ti is completed.
Assume that two transactions Ti and Tj modify the same row of data. Transaction
Ti is considered to have passed the validation when any of the following rules is met:
4.2 Concurrency Control 67
• Finish(Tj) < Start(Ti) and the execution of transaction Tj has been completed
before transaction Ti starts. In this case, transaction Ti can enter the validation
phase and be executed.
• Finish(Ti) > Start(Tj), the validation of transaction Ti has been completed before
transaction Tj starts, and transaction Ti ends after transaction Tj starts. In this
case, the data set written by transaction Ti cannot intersect with the data set read
by transaction Tj.
• Finish(Ti) > Validation(Tj), the validation of transaction Ti has been completed
before the validation of transaction Ti is completed, and transaction Ti ends after
the validation of transaction Tj is completed. In this case, the data set written by
transaction Ti cannot intersect with the data set written by transaction Tj.
4.2.4 MVCC
4.2.4.1 Multiversion 2PL
Multiversion 2PL aims to combine the advantages of MVCC with lock-based con-
currency control. This protocol distinguishes between read-only transactions and
update transactions.
Update transactions are executed according to the 2PL protocol. In other words,
an update transaction holds all locks until the transaction ends. Therefore, they can
be serialized based on the order in which they are committed. Each version of a data
item has a timestamp. The timestamp is not a real clock-based timestamp but a
counter (called a TS-Counter) that increases when a transaction is committed.
68 4 Storage Engine
Before a read-only transaction is executed, the database system reads the current
value of the TS-Counter and uses the value as the timestamp of the transaction.
Read-only transactions adhere to the Multiversion Timestamp Ordering Protocol
when performing read operations. Therefore, when read-only transaction T1 issues
a Read(Q) request, the return value is the content of the maximum timestamp ver-
sion that is less than TS(T1).
When an update transaction reads a data item, it first obtains a shared lock on the
data item and then reads the latest version of the data item. When an update transac-
tion wants to write a data item, it must obtain an exclusive lock on the data item and
then create a new version for the data item. The write operation is performed on the
new version, and the timestamp of the new version is initially set to infinity (∞).
After update transaction T1 completes its task, it is committed in the following
way: T1 first sets the timestamp of each version it created to the value of the
TS-Counter plus 1. Then, T1 increments the TS-Counter by 1. Only one update
transaction can be committed at a time.
This way, only read-only transactions started after T1 increases the TS-Counter
see the values updated by T1. Read-only transactions started before T1 increases the
TS-Counter see the values before T1 makes the updates. In either case, read-only
transactions do not need to wait for locks. The Multiversion 2PL protocol also
ensures that scheduling is recoverable and noncascading.
Version deletion is similar to the approach used in the Multiversion Timestamp
Ordering Protocol. Assuming a data item has two versions, Q1 and Q2, and the time-
stamps of both versions are less than or equal to the timestamp of the oldest read-only
transaction in the system, the older version will no longer be used and can be deleted.
• If transaction Ti executes the Write(Q) operation and TS(Ti) < R-TS(Qk), the
system rolls back transaction Ti. If TS(Ti) = W-TS(Qk), the system overrides the
content of Qk. If TS(Ti) > R-TS(Qk), the system creates a new version of Q.
According to these rules, a transaction reads the latest version that precedes it. If
a transaction attempts to write to a version that another transaction has already read,
the write operation cannot succeed.
Versions that are no longer needed are deleted based on the following rule:
Assume that a data item has two versions, Qi and Qj, and the W-TS values of both
versions are less than the timestamp of the oldest transaction in the system. In this
case, the older version will no longer be used and can be deleted.
The multiversion timestamp ordering mechanism ensures that read requests
never fail and do not have to wait. However, this mechanism also has some draw-
backs. First, reading a data item requires updating the R-TS field, which results in
two potential disk accesses (instead of one). Second, conflicts between transac-
tions are resolved through rollbacks instead of waiting, which significantly
increases overheads. The Multiversion 2PL Protocol can effectively alleviate
these issues.
This section analyzes the implementation of MVCC in InnoDB, which is the default
storage engine of MySQL. The implementation of MVCC relies on two hidden
fields (DATA_TRX_ID and DATA_ROLL_PTR) added to each table, the snapshot
(also known as a read view) created by transactions during querying and the data
version chain (i.e., the undo logs) of the database.
The InnoDB engine adds three hidden fields to each table to implement data multi-
versioning and clustered indexing. Among these fields, DATA_TRX_ID and
DATA_ROLL_PTR are used for data multiversioning. Figure 4.7 shows the table
structure in InnoDB.
DATA_TRX_ID records the ID of the transaction that last updated or inserted the
record. Deletion is treated as an update in the database, with a 6-byte deletion flag
specified at a special location in the row.
DATA_ROLL_PTR is the rollback pointer that points to the undo log written in
the rollback segment. When a row is updated, the undo log records the content of the
row before the update. The InnoDB engine uses this pointer to find the previous
versions of data. All old versions of a row are organized in the form of a linked list
in the undo log, occupying 7 bytes.
DB_ROW_ID is a row ID that increments monotonically when new rows are
inserted. InnoDB uses a clustered index, which stores data based on the order of the
sizes of the clustered index fields. When a table has no primary key or unique non-
null index, the InnoDB engine automatically generates a hidden primary key that is
used as the clustered index of the table. DB_ROW_ID is not related to MVCC and
has a size of 6 bytes.
The following examples explain the specific operations involving MVCC:
• SELECT operation: InnoDB checks each row based on the following conditions:
(1) InnoDB looks up only data rows whose versions are earlier than or equal to
the current transaction version (i.e., the transaction ID of the row is less than or
equal to the current transaction ID). This ensures that the rows read by the trans-
action either already existed before the transaction started or were inserted or
modified by the transaction itself. (2) The deletion version of the row is either
undefined or greater than the current transaction ID. This ensures that the rows
read by the transaction were not deleted before the transaction started. Only
records that meet the above two conditions can be returned as the query results.
• INSERT operation: InnoDB saves the current transaction ID as the row version
number for each newly inserted row.
• DELETE operation: InnoDB saves the current transaction ID as the row deletion
flag for each deleted row.
• UPDATE operation: InnoDB inserts a new row for the updated row, saves the
current transaction ID as the row version number of the new row, and saves the
current transaction ID as the row deletion flag of the original row.
An undo log is used to record data before the data is modified. Before a row is
modified, its data is first copied to an undo log. When a transaction needs to read a
row and the row is invisible, the transaction can use the rollback pointer to find the
visible version of the row along the version chain in the undo log. When a transac-
tion rolls back, data can be restored by using records in the undo log.
On the one hand, undo logs can be used to construct records during snapshot
reads in MVCC. In MVCC, different transaction versions can have their indepen-
dent snapshot data versions by reading the historical versions of data in an undo log.
On the other hand, undo logs ensure the atomicity and consistency of transactions
during rollback. When a transaction is rolled back, the data can be restored by using
the data in the undo log.
4.2 Concurrency Control 71
A read view is a snapshot that records the ID array and related information of cur-
rently active transactions in the system. It is used for visibility judgment, that is, to
check whether the current transaction is eligible to access a row. A read view has
multiple variables, including the following:
• trx_ids: This variable stores the list of active transactions, namely, the IDs of
other uncommitted active transactions when the read view was created. For
example, if transaction B and transaction C in the database have not been com-
mitted or rolled back when transaction A creates a read view, trx_ids will record
the transaction IDs of transaction B and transaction C. If a record that contains
the ID of the current transaction exists in trx_ids, the record is invisible.
Otherwise, it is visible.
• low_limit_id: The maximum transaction ID +1. The value of this variable is
obtained from the max_trx_id variable of the transaction system. If the transac-
tion ID contained in a record is greater than the value of low_limit_id of the read
view, the record is invisible in the current transaction.
• up_limit_id: The minimum transaction ID in trx_ids. If trx_ids is empty, up_
limit_id is equal to low_limit_id. Although the field name is up_limit_id, the last
active transaction ID in trx_ids is the smallest one because the active transaction
IDs in trx_ids are sorted in descending order. Records with a transaction ID less
than the value of up_limit_id are visible to this view.
• creator_trx_id: The ID of the transaction that created the current read view.
MVCC supports only the read committed and repeatable read isolation lev-
els. The difference in their implementations lies in the number of generated
read views.
The repeatable read isolation level avoids dirty reads and nonrepeatable reads
but has phantom read problems. For this level, MVCC is implemented as follows:
In the current transaction, a read view is generated only for the first ordinary
SELECT query; all subsequent SELECT queries reuse this read view. The trans-
action always uses this read view for snapshot queries until the transaction ends.
This avoids nonrepeatable reads but cannot prevent the phantom read problem,
which can be solved by using gap locks and record locks of the next-key lock
algorithm.
For the read committed isolation level, MVCC is implemented as follows: A new
snapshot is generated for each ordinary SELECT query. Each time a SELECT state-
ment starts, all active transactions in the current system are recopied to the list to
generate a read view. This achieves higher concurrency and avoids dirty reads but
cannot prevent nonrepeatable reads and phantom read problems.
72 4 Storage Engine
4.3.1 Basic Concepts
Transaction persistence and fault recovery are essential features of a database man-
agement system. The design and implementation of fault recovery often affect the
architecture and performance of the entire system. In the past decade, ARIES-style
WAL has become the standard for implementing logging and recovery, especially in
disk-based database systems.
During the database status changes process, the objects, data, and processes of
write operations are recorded. This recording process is known as logging. Logging
technology can achieve the atomicity and persistence of transactions. Atomicity
means that when a system fails, committed transactions can be reexecuted by
replaying the logs, and old data of uncommitted transactions can be revoked.
Persistence means that after logs are written to the disk, the results of transaction
execution can be recovered by using the logs on the disk.
Unlike the swapping in and out of dirty pages, writing logs to a disk is a sequen-
tial process. LSNs indicate the relative sequence of log records generated by differ-
ent transactions. Sequentially writing logs is faster than randomly writing dirty
pages to disk. This is why most transactions are committed after logs are flushed to
disk rather than after dirty pages are flushed to disk.
In a database storage engine, generated logs are typically written to a log buffer.
The logs of multiple transaction threads are written to the same log buffer and then
flushed to external storage, such as a disk, by I/O threads. The progress of log flush-
ing must be checked before transactions are committed. Only transactions that meet
the WAL conditions can be committed. In most cases, multiple log files exist in the
system. Two mechanisms are available for managing the log files. One is to not
reuse log files, as in PostgreSQL. With this mechanism, log files continuously
increase in quantity and size. The other mechanism is to reuse log files, as in
MySQL. With this mechanism, two or more log files are used alternately. The mech-
anism that does not reuse log files can tolerate long transactions but requires addi-
tional mechanisms for clearing the continuously increasing log files. The mechanism
that reuses log files mandates that the size of the log files be configured based on the
length of the longest transaction. Otherwise, the database system stops providing
services once the log files are used up because it cannot commit transactions.
4.3.2 Logical Logs
Logical logs are quite common, but not all databases support logical logs. A logical
log records database modification operations, which are often a simple variant of
user input. Such operations are not parsed in detail and are only related to the logical
views provided by the database. They are irrelevant to the underlying data organiza-
tion structure of the database.
4.3 Logging and Recovery 73
Taking a relational database as an example, the storage engine may read and
write data in traditional page storage managed by using B+ trees or in compacted
storage managed by using LSM-trees. However, for users, data is always organized
in the form of tables rather than pages or key-value pairs. The logical logs of a rela-
tional database record user operations on tables. These operations may be SQL
statements or simple variants of SQL statements and do not involve the actual form
of data storage.
In general, logical logs are not crucial because many database systems use physi-
cal logs as the primary basis for fault recovery. Besides, not using logical logs does
not affect normal database operation. So, what is the significance of logical logs?
For starters, logical logs are independent of the physical storage and thus more por-
table than physical logs. When data is migrated between database systems that use
different physical log formats, logical logs in a universal format are of vital impor-
tance. Provided that two systems use a uniform format for logical logs, data can be
migrated from one system to another by parsing and replaying the logical logs.
Support for logical logs can greatly simplify the workflow in scenarios such as data
migration, log replication, and coordination of multiple storage engines.
Logical logs often come with additional overheads because transactions also
need to wait for the logical logs, in addition to physical logs, to persist before they
can be committed. Physical logs can be generated and flushed to disks during trans-
action execution, whereas logical logs are often generated and flushed to disks at a
time during transaction commits. Using logical logs can affect the system through-
put. Moreover, the replay of logical logs is much slower than that of physical logs,
and the parsing cost of the former is higher than that of the latter.
4.3.3 Physical Logs
Physical logs are the foundation of fault recovery and are therefore a must-have
feature for any mature database system. These logs record write operations on data,
and the description of such write operations is often directly related to the way data
is organized on physical storage. Parsing physical logs can learn of actual modifica-
tions made by users to the physical storage but cannot learn of the actual content of
the logical operations performed. When physical logs are used for recovery, all
physical logs must be parsed and replayed to obtain the final status of the database.
Physical logs include the redo log and undo logs. Undo logs record only old values
of database elements. This means that an undo log can only use the old values to over-
write the current values of the database elements to undo the modifications made by a
transaction to the database state. An undo log is commonly used to undo uncommitted
changes that a transaction made before the system crashed. During undo logging, a
transaction cannot be committed before all modifications are written to a disk. As a
result, the transaction must wait for all I/O operations to be completed before it is
committed. Redo logging does not have this problem. The redo log records the new
values of database elements. During recovery based on the redo log, uncommitted
transactions are ignored and the modifications made by committed transactions are
74 4 Storage Engine
reapplied. Provided that the redo log is persisted to the disk before a transaction is
committed, the modifications of the transaction can be recovered by using the redo log
after the system crashes. Sequentially writing logs naturally incurs lower I/O opera-
tion costs than random writes and reduces the waiting time before transaction commits.
The redo log and undo logs are not mutually exclusive and can be used in com-
bination in some databases. Later, we will discuss how the redo log and undo logs
are used in combination in MySQL.
4.3.4 Recovery Principles
A database may encounter the following types of faults during operation: transaction
errors, process errors, system failures, and media damage. The first two are self-
explanatory. System failures refer to failures of the operating system or hardware, and
media damage refers to irreversible damage to the storage media. These faults must be
properly handled to ensure the correctness of the entire system. As such, the database
system must support two major features: transaction persistence and atomicity.
Transaction persistence ensures that the updates made by a committed transac-
tion still exist after failure recovery. Transaction atomicity means that all modifica-
tions made by an uncommitted transaction are invisible. The sequential access
performance of traditional disks is much better than the random access performance.
Therefore, a log-based fault recovery mechanism is used. In this mechanism, write
operations on the database are sequentially written to log records, and the database
is recovered to a correct state by using the log records after a fault. To ensure that
the latest database state can be obtained from logs during recovery, the logs must be
flushed to a disk before the data content. This action is known as the WAL principle.
A fault recovery process usually includes three phases: analysis, redo, and undo.
The analysis phase includes the following tasks: (1) The scope of redo and undo oper-
ations in subsequent redo and undo phases is confirmed based on the checkpoint and
log information. (2) The dirty page set information recorded in checkpoints is cor-
rected based on logs. (3) The position of the smallest LSN in the checkpoint is deter-
mined and used as the start position of the redo phase. (4) The set of active transactions
(uncommitted transactions) recorded in checkpoints is corrected, where the transac-
tions will be rolled back in the undo phase. In the redo phase, all log records are
redone one by one based on the start position determined in the analysis phase. Note
that the modifications made by uncommitted transactions are also reapplied in this
phase. In the undo phase, uncommitted transactions are rolled back based on undo
logs to revoke the modifications made by these transactions.
4.3.5 Binlog of MySQL
The binary log (binlog) in MySQL is a type of logical log that records changes made
to data in the MySQL service layer. Binary logging can be enabled at the startup of
the MySQL service by specifying related parameters.
4.3 Logging and Recovery 75
The binlog contains all statements that update data and statements that have the
potential to update data. It also includes the duration of each statement used to
update data. In addition, the binlog contains service layer status information required
for correctly reexecuting statements, error codes, and metadata information required
for maintaining the binlog.
The binlog serves two important purposes. The first one is replication. The bin-
log is usually sent to replica servers during leader-follower replication. Many details
of the format and handling methods of the binlog are designed for this purpose. The
leader sends the update events contained in the binlog to the followers. The follow-
ers store these update events in the relay log, which has the same format as the
binlog. The followers then execute these update events to redo the data modifica-
tions made on the leader. The second purpose is specific data recovery operations.
After backup files are restored, the events recorded in the binlog after the backup
was completed are reexecuted. These events ensure that the database is up to date
from the point of the backup.
As a logical log, the binlog must be consistent with the physical logs. This can be
ensured by using the two-phase commit (2PC) protocol in MySQL. Regular trans-
actions are treated as internal eXtended Architecture (XA) transactions in MySQL
and each is assigned with an XID. A transaction is committed in two phases. In the
first phase, the InnoDB engine writes the redo log to a disk, and the transaction
enters the Prepare state. In the second phase, the binlog is written to the disk, and
the transaction enters the Commit state. The binlog of each transaction records an
XID event at the end to indicate whether the transaction is committed. During fault
recovery, content after the last XID event in the binlog must be cleared.
As the default storage engine for MySQL, the InnoDB engine has two essential
logs: undo logs and the redo log. Undo logs are used to ensure atomicity and isola-
tion of transactions, and the redo log is used to ensure persistence of transactions.
Undo logs are essential for transaction isolation. Each time a data record is modi-
fied, an undo log record is generated and subsequently recorded in the system
tablespace by default. However, MySQL 5.6 and later support the use of an inde-
pendent undo log tablespace. Undo logs store old versions of data. When an old
transaction needs to read data, it needs to traverse the version chain in the undo log
to find the records visible to it. This can be time-consuming if the version chain
is long.
The common data modification operations include INSERT, DELETE, and
UPDATE operations. Data inserted by an INSERT operation is visible only to the
current transaction; other transactions cannot find the newly inserted data by using
an index before the current transaction is committed. The generated undo log can
then be deleted after the transaction is committed. For UPDATE and DELETE oper-
ations, multiple versions of data need to be maintained. The undo log records of
these operations in the InnoDB engine are of the Update_Undo type and cannot be
directly deleted.
76 4 Storage Engine
Writing of redo log files can be triggered by any of the following conditions:
insufficient space of the redo log buffer, transaction commits, back-end threads,
checkpointing, instance shutdown, and binlog switching. The redo log in InnoDB is
written in circular overwrite mode and does not have infinite space. Although a large
redo log space is theoretically available, checkpointing in a timely manner is still
essential for rapid recovery from crashes. The master thread of InnoDB performs
redo log checkpointing roughly every 10 s.
In addition to the regular redo log, InnoDB provides a file log type that allows
you to create specific files and assign specific names to the files to indicate specific
operations. Currently, two operations are supported: undo log tablespace truncate
operation and user tablespace truncate operation. The file logs can ensure the atomi-
city of these operations.
4.4.1 PolarDB X-Engine
caches that respectively cache disk records and multiversioned data indexes
based on the LRU rule.
2. Cold data tier: The cold data tier is a multilevel structure stored on disks. Records
in immutable MemTables flushed from the memory are inserted into the first
level (L0) in the form of data blocks, also known as extents. When L0 is full,
some of the data blocks are moved out and compacted with the data blocks in L1
through an asynchronous compaction operation. Similarly, data blocks in L1 are
eventually compacted into those in L2.
3. Heterogeneous FPGA accelerator: X-Engine can offload the compaction opera-
tions in the cold data tier from the CPU to a dedicated heterogeneous FPGA
accelerator [2] to improve the efficiency of compaction operations, reduce inter-
ference with other computing tasks handled by the CPU, and achieve stable sys-
tem performance and higher average throughput.
X-Engine employs a series of innovative technologies to reduce storage costs
and ensure system performance. Table 4.1 describes the main technological innova-
tions and achievements of X-Engine. X-Engine is mainly optimized to achieve
higher transaction processing performance, reduce data storage costs, improve
query performance, and reduce overheads of backend asynchronous tasks. To
4.4 LSM-Tree Storage Engine 79
achieve these optimization goals, X-Engine is designed and developed through in-
depth software and hardware collaboration and combines the technical characteris-
tics of modern multicore CPUs, DRAM memory, and heterogeneous FPGA
processors. The specific design, applicable scope, and experimental results of the
technologies will be described in detail in the following sections.
CREATE, READ, UPDATE, and DELETE (CRUD) are the fundamental capabili-
ties required for transaction processing. Record modification operations, such as
CREATE, UPDATE, and DELETE, are performed along a write path, whereas
record query operations, such as READ, are performed along a read path.
1. Write path: Fig. 4.8 shows that to ensure the persistence of stored data in the
database in the case of DRAM power failure in X-Engine, all modifications to
database records must be recorded in the log and stored on persistent storage
media (such as SSDs) and then stored in the active MemTable in memory.
X-Engine adopts a two-phase mechanism to ensure that the modifications made
by a transaction to records conform to the ACID properties and are visible to and
can be queried by other transactions after the transaction is committed. In the
two-phase mechanism, transactions are completed in two phases: the read/write
phase and the commit phase. After the active MemTable is full, it is converted
into an immutable MemTable, which is then flushed to disk for persistence.
Multiversion active MemTable data structure: MVCC results in many ver-
sions of hotspot records in high-concurrency transaction processing scenarios.
Querying these versions incurs additional overheads. To solve this problem,
X-Engine is designed with a multiversion active MemTable data structure, as
shown in Fig. 4.9. In this structure, the upper layer (the blue part in the figure)
consists of a skiplist in which all records are sorted by primary key values. For a
hot record with multiple versions (such as the record with key = 300 in the fig-
ure), X-Engine adds a dedicated single linked list (the green part in the figure) to
store all its versions, which are sorted by version number. Due to the temporal
locality of data access, the latest version (version 99) is most likely to be accessed
by queries and is therefore stored at the top, thereby reducing the linked list scan
overheads during querying for hotspot records.
2. Read path: As shown in Fig. 4.10, a query operation in X-Engine queries data in
the following sequence: active MemTable/immutable MemTable, row cache,
block cache, and disk. As mentioned above, the multiversion skiplist structure
used in the MemTable can reduce the overhead of hotspot record queries. The
row cache and block cache can cache hot data records or record blocks in the
disk. The block cache stores the metadata of user tables, which includes the
bloom filters that can reduce disk accesses, as well as corresponding index blocks.
and calculation operations are completed in the read/write phase, and then the
required modifications are temporarily stored in the transaction buffer.
In the commit phase, multiple writer threads write the content in the transaction
buffer to the lock-free task queues. The consumer tasks in the task queues of the
multistage pipeline pushes the corresponding write task content in the following
stages: log buffering, log flushing, MemTable writing, and commit.
This two-stage design decouples the front-end and back-end threads. After com-
pleting the read/write phase of a transaction, the front-end threads can immediately
proceed to process the next transaction. The back-end writer threads access the
memory to complete the operations of writing data to disk and memory. The front
end and the backend exchange data through the transaction buffer to achieve paral-
lel execution on different data. This also improves the instruction cache hit rate of
each thread, ultimately improving the system throughput. In the commit phase, each
task queue is handled by one back-end thread, and the number of task queues is
limited by hardware conditions, such as the available I/O bandwidth in the system.
In the four-stage transaction pipeline, the granularity of parallelism for each
stage is optimized based on the characteristics of the stage. Log buffering (which
collects relevant logs for all write contents in a task queue) at the first stage and log
flushing at the second stage are sequentially executed by a single thread because
data dependencies exist between these two stages. MemTable writing at the third
stage is completed concurrently by multiple threads. At the fourth stage, the transac-
tion is committed, and related resources (such as the acquired locks and memory
space) are released to make all modifications visible. This stage is executed in paral-
lel by multiple threads. All writer threads obtain the required tasks from any stage
in active pull mode. This design allows X-Engine to allocate more threads to handle
memory accesses with high bandwidth and low latency while using fewer threads to
handle disk writes with relatively low bandwidth and high latency, thereby improv-
ing the utilization of hardware resources.
4.4.3.1 Background
The back-end compaction threads in X-Engine merge memory data and disk data.
When the amount of data at each level reaches the specified threshold, the data is
merged with the data of the next level. This operation is called compaction. Timely
compactions are essential for LSM-trees. Under continuous high-intensity write
pressure, an LSM-tree deforms as the data at L0 accumulates. This severely affects
write operations because these operations need to scan all layers of data and return
a combined result due to the existence of multiple data versions. Compaction of data
at L0 and data of multiple versions helps maintain a healthy read path length, which
is crucial for storage space release and system performance.
4.4 LSM-Tree Storage Engine 83
Figure 4.12 shows the compaction execution time under different value lengths.
When the value length is less than or equal to 64 bytes, the CPU time accounts for
over 50%. This phenomenon happens because the rapid improvement in the read/
write performance of storage devices in recent years caused CPUs to become the
performance bottleneck of traditional I/O intensive operations.
4.4.3.3 Compaction Scheduler
tion data, only about 0.03% of compaction tasks will be reexecuted by the CPU
mainly because the samples have excessively long KV lengths.
• Distribution thread distributes the compaction tasks to CUs for execution. The
FPGA accelerator has multiple CUs. Therefore, corresponding distribution algo-
rithms must be designed. Currently, a simple round-robin distribution strategy is
adopted. All compaction tasks are similar in size. Therefore, the CUs have bal-
anced utilization, according to the experimental results.
• Drive thread transfers data to the FPGA accelerator and instructs the CUs to start
working. When the CUs complete the tasks, the drive thread is instructed to
transfer the result back to the memory, and the compaction tasks are put into the
result queue.
4.4.3.4 CUs
Figure 4.15 shows the logical implementation of CUs on an FGPA. Multiple CUs
can be deployed on an FPGA accelerator, which are scheduled by the driver. A CU
consists of a decoder, a key-value ring buffer, a key-value transferrer, a key buffer, a
merger, an encoder, and a controller.
The key-value ring buffer consists of 32 8-KB slots. Each slot allocates 6 KB
for storing key-value pairs and the remaining 2 KB for storing metadata of the
key-value pairs (such as the key-value pair length). Each key-value ring buffer has
three states, FLAG_EMPTY, FLAG_HALF_FULL, and FLAG_FULL, which,
respectively, indicate that the key-value ring buffer is empty, half full, and full.
Whether to carry forward the pipeline or to pause the decoding and wait for down-
stream consumption is determined based on the number of cached key-value
pairs. The key-value transferrer and key-value buffer are responsible for key-value
pair transmission. Value comparison is not required in merge-sorting; only the
keys need to be cached. The merger is responsible for merging the keys in the key
buffer. In Fig. 4.15, Way 1 has the smallest key. The controller instructs the key-
value transferrer to transfer the corresponding key-value pair from the key-value
ring buffer to the key-value output buffer (which has a similar structure as the
key-value ring buffer) and moves the read pointer of the key-value ring buffer to
the next key-value pair. The controller then instructs the merger to perform the
next round of compaction. The encoder performs prefix encoding on the key-
value pairs output by the merger and writes the encoded key-value pairs in the
format of data blocks to the FPGA memory.
To control the processing speed at each stage, a controller module is introduced.
The controller module maintains the read and write pointers of the key-value ring
buffers, detects the difference in the processing speed between the upstream and
downstream of the pipeline based on the states of the key-value ring buffers, and
maintains efficient operation of the pipeline by pausing or restarting corresponding
modules.
X-Engine adopts an optimized tiered storage structure for storing data in the cold
data layer, which significantly reduces the storage costs while ensuring the query
performance. This section describes the design and optimization of X-Engine in
terms of data flush, compactions, and data placement.
4.4.4.1 Flush Optimization
The flush operation in X-Engine converts the data in the immutable MemTable
in the memory into data blocks and stores the data blocks to disk for persistent
storage. Flush operations are crucial for the stability, performance, and space
efficiency of the storage engine. First, the flush operations move data out of the
memory, thereby freeing up memory space for new data or caches. If flush oper-
ations are not performed in a timely manner and new data continues to be writ-
ten to the memory, the memory usage keeps increasing until the system can no
longer accommodate new data, resulting in database unavailability risks.
Figure 4.16 demonstrates the flush operation and multiple compaction task
types in X-Engine.
Second, trade-offs must be made to achieve a balance between flush overheads
and query overheads. To ensure that data on the disk is always sorted by primary
key values and data blocks at the same level do not overlap in terms of primary
key ranges, it is crucial to ensure that a primary key value exists only in one data
block within any data range. This way, a point query needs to read at most one
data block at each level, minimizing the number of data blocks that need to be
read in range queries. However, to sort data by primary key values, each flush task
must merge the immutable MemTable data moved from memory with the data
4.4 LSM-Tree Storage Engine 87
Fig. 4.16 Flush operation and multiple compaction task types in X-Engine
blocks on disk whose primary key range overlaps with that of the immutable
MemTable data. This process consumes a significant amount of CPU and I/O
resources. In addition, repeated compactions result in I/O write amplification,
exacerbating I/O consumption. This results in high flush overheads, long process-
ing time, and excessive resource consumption, thus affecting the stability of data-
base performance. If lower requirements are imposed on data sorting by primary
key values, the flush overheads will be reduced, but the query overheads will
increase. Therefore, X-Engine is optimized to achieve a balance between flush
overheads and query overheads.
Figure 4.16 shows that after converting the data in the immutable MemTable into
data blocks, the flush operation in X-Engine directly appends the data blocks to L0
on disk without merging the data blocks with other data at L0. This significantly
reduces flush overheads. However, this causes overlapping primary key ranges of
the data blocks at L0. As a result, a record within a primary key range may exist in
multiple data blocks, which increases query overheads. To mitigate the impact on
the query performance, X-Engine controls the total data size of L0 within an
extremely small range (about 1% of the total data size on disk). Primary keys of
common transactional data, such as order numbers, serial numbers, and timestamps,
in OLTP databases usually increase monotonically. If no update operations exist in
the load, primary keys of newly inserted data do not overlap with those of existing
data. In this case, the flush design of X-Engine does not increase query overheads.
88 4 Storage Engine
The location of a target record for a query in the tiered storage structure directly
affects the overhead of the query.
Given the storage separation of hot and cold data and the storage of the latter in
a tiered storage structure, a data record can reside in different memory segments
(such as the active MemTable, immutable MemTable, and cache) and at different
levels on disk. The query overhead varies based on the location of the target data
record. For instance, hot data is placed at L0 and L1 on a disk. This shortens the
query paths for such data, mitigating read amplification during access to L2 and
reducing query latency. This section covers the design and optimization of X-Engine
for spatial data placement.
of data blocks at the corresponding level reaches the specified threshold. Such
compaction operations are intended to limit the data volume at each level of the
tiered storage structure within an expected range, thereby reducing read ampli-
fication during queries, write amplification during compactions, and space
amplification caused by inter-level primary key range overlap. A delete-trig-
gered compaction is triggered when the number of deletion flags in the MemTable
reaches the specified threshold. X-Engine processes all write operations in
append-only mode, and delete operations are implemented by inserting deletion
flags for target records in MemTables. When the number of deletion flags
reaches the specified threshold, a considerable amount of logically deleted data
is present; the data must be cleared. In this case, X-Engine triggers a compac-
tion task tailored to clear such records. A fragment compaction is triggered
based on the space fragmentation status at each level. Fragments may be caused
by data block reuse, disk space allocation, or data deletion. An excessive num-
ber of fragments result in increased query overheads and reduced space effi-
ciency. Therefore, the fragments must be cleared in a timely manner. Manual
compactions provide a necessary database maintenance means for database
administrators. Administrators can execute specific instructions to trigger cor-
responding compactions based on the current status of the storage engine and
database requirements. With these compaction strategies, X-Engine can sched-
ule asynchronous compaction tasks to achieve a balance among data write per-
formance, query performance, and storage overheads. X-Engine also provides
related parameters to achieve specific optimization goals (e.g., maximizing
query performance or minimizing space usage), thus ensuring that the perfor-
mance and storage overheads of the storage engine remain within the
expected ranges.
4.4 LSM-Tree Storage Engine 91
X-Engine accurately separates hot data and cold data by analyzing the access char-
acteristics of workloads and implements automatic archiving of cold data by using
a hybrid storage architecture to provide users with an optimal price-performance
ratio. OLTP businesses are sensitive to access latency. Therefore, cloud service pro-
viders typically use local SSDs or enhanced SSDs (ESSDs) as storage media. In
practice, most data generated by flow-type businesses, such as transaction logistics
and instant messaging, is accessed less frequently over time or may never be
accessed again. If such data is also stored in high-speed storage media such as Non-
Volatile Memory Express (NVMe) and SSDs like hot data, the overall price-
performance ratio can be significantly reduced.
For businesses that support separation of cold data and hot data, X-Engine sup-
ports automatic archiving of cold data by analyzing log information. It is the first
storage engine in the industry that supports automatic archiving of row-level data
[3]. The hybrid storage edition of X-Engine supports multiple types of hybrid stor-
age media, as shown in Fig. 4.18. ESSDs or local SSDs are recommended for L0
and L1 to ensure the access performance of hot data. High-efficiency cloud disks or
local HDDs are recommended for L2. Archiving of cold data significantly reduces
the storage costs.
X-Engine employs a unique method for predicting data archiving that differs
from traditional cache replacement policies like LRU. X-Engine uses a longer time
window and takes into account a wider range of features to predict when data should
be archived:
• X-Engine aggregates access frequency over a specific time window, providing
insights into changes in data popularity.
• By analyzing semantic information from SQL logs, X-Engine can accurately
predict the lifecycle of a data record. For instance, in e-commerce transactions,
patterns of order table accesses reflect user shopping behavior. For virtual orders,
such as top-up orders, a record may no longer be accessed after it is created.
However, orders for physical products may involve logistics tracking, delivery,
delivery signature, or even after-sales services such as returns, resulting in a
complex distribution of the data lifecycle. Moreover, rules for shipment and
package receipt may be adjusted during major promotion events, such as Double
11 and Double 12. This may cause changes to the data lifecycle for the same
load, making it hard to distinguish between hot and cold data based on simple
rules. However, for the same business, the lifecycle of a record in the database
can be learned based on the updates and reads of the record. Therefore, fields
accessed by SQL statements may be encoded to accurately depict the lifecycle of
a record.
• Additionally, timestamp-related features, such as insertion time and last update
time, provide insights into the data lifecycle.
X-Engine uses different feature combinations for different businesses by lever-
aging machine learning technologies, ultimately achieving a cold data recall and
precision of over 90% for these businesses. Cold data compactions are triggered
during off-peak hours to compact predicted cold data to cold levels on a daily basis
and subsequently minimize the impact of cold data migration on normal businesses.
PolarDB supports dual engines online, with InnoDB handling the hybrid read/write
requirements of online transactions and X-Engine handling requests to read/write
less frequently accessed archived data. Figure 4.19 shows the dual-engine architec-
ture of PolarDB for MySQL.
The first version of PolarDB was designed based on InnoDB. This version imple-
mented physical replication by using InnoDB and supported the one-writer, multi-
reader architecture, which was technologically challenging. Nonetheless, it is more
challenging to integrate X-Engine into PolarDB to support the one-writer, multi-
reader architecture based on dual engines because X-Engine is a complete, indepen-
dent transaction engine with its own redo log, disk data management, cache
management, and MVCC modules. Through remarkable innovative efforts, the
PolarDB team ushered PolarDB into the dual-engine era by introducing the follow-
ing engineering advances:
• The WAL stream of X-Engine is integrated with the redo log stream of InnoDB
without modifying the control logic and the interaction logic of shared storage.
This way, one log stream and one transmission path are sufficient for X-Engine
and InnoDB. In addition, this architecture can be reused when other engines are
introduced.
• The I/O module of X-Engine is interconnected with the user-mode file process-
ing system (FPS) of InnoDB. This allows InnoDB and X-Engine to share the
4.4 LSM-Tree Storage Engine 93
same distributed block device and implement fast backup based on the underly-
ing distributed storage.
• X-Engine implements physical replication based on WAL and provides the WAL
replay mechanism. This ensures millisecond-level replication latency between read/
write nodes and read-only nodes and supports consistency reads on read-only nodes.
The introduction of X-Engine into PolarDB involves considerable engineering
modifications, such as the modification of X-Engine to support the one-writer, mul-
tireader architecture and the rectification of issues related to DDL operations on
large tables in history databases. In addition to online DDL, X-Engine also supports
parallel DDL to accelerate DDL operations that involve table replication. The dual-
engine architecture of PolarDB implements the one-writer, multi-reader architec-
ture based on two engines with one set of code. This ensures the simplicity of the
product architecture and provides consistent user experience.
4.4.6 Experimental Evaluation
This section compares the space efficiency of X-Engine with the following related
products in the cloud computing market: InnoDB, which is the default storage
engine for MySQL databases, and TokuDB, which is a storage engine product with
high space efficiency that is used by many space-sensitive customers on pub-
lic clouds.
94 4 Storage Engine
Figure 4.20 shows the disk space usage of InnoDB and X-Engine. Both storage
engines are tested with the default settings and the default table structure of the
Sysbench benchmark. Each table contains ten million records, and the total number
of tables gradually increases from 32 to 736. The test result shows that as the amount
of data increases, the space occupied by X-Engine slowly increases, thereby saving
more space. The maximum space occupied by X-Engine is only 58% of that occu-
pied by InnoDB. For scenarios with longer single-record lengths, X-Engine is more
efficient in terms of storage space usage. For example, after a Taobao image space
database is migrated from InnoDB to X-Engine, the required storage space required
is only 14% of that required in InnoDB.
Data compression is not enabled for InnoDB in most business scenarios. If com-
pression is enabled for InnoDB, the storage space required is 67% of that before
compression. Moreover, the query performance sharply deteriorates, seriously
affecting the business. Taking primary key updates as an example, the throughput is
only 10% of that before compression. Compared with InnoDB, whose performance
seriously deteriorates after compression is enabled, X-Engine is a high-performance
and cost-effective storage engine with an excellent balance between space compres-
sion and performance.
Unlike X-Engine, InnoDB does not have a tiered storage structure and uses a
single storage mode to store all table data in the database. In this mode, data is
stored in the form of pages in a B+ structure. Moreover, data is not stored by using
different modes based on the locality characteristic and frequency of data access,
and data in a user table (e.g., seldom accessed cold data) cannot be selectively com-
pressed in depth. In addition, X-Engine performs prefix encoding for data blocks,
which logically reduces the amount of data to be stored and implements data storage
in a compact format. This reduces space fragments and improves the compression
ratio. Therefore, X-Engine has higher space efficiency than InnoDB.
Fig. 4.20 Comparison of disk space usage between InnoDB and X-Engine
4.4 LSM-Tree Storage Engine 95
Figure 4.21 shows the disk space usage of TokuDB and X-Engine. TokuDB used to
provide storage services at low overheads, but its developer, Percona, discontinued
its maintenance. The results revealed that X-Engine has lower storage overheads
than TokuDB. Therefore, Alibaba Cloud recommends that you migrate your data in
TokuDB-based databases to X-Engine-based databases.
TokuDB uses the fractal tree structure, which has more leaf nodes filled with data
and corresponding data blocks than the B+ tree structure used by InnoDB engine.
The former can achieve a higher compression ratio than the latter. However, TokuDB
lacks the tiered storage design of X-Engine, and X-Engine also has data blocks that
are filled with records. Hence, combined with other space optimizations, X-Engine
can achieve lower storage overheads than TokuDB.
4.4.6.4 Comparison of Performance
X-Engine can reduce the storage space occupied by cold data without compromis-
ing the hot data query performance, consequently reducing the total storage costs,
specifically the following:
• X-Engine uses a tiered storage structure to store hot and cold data at different
levels. By default, the level where cold data is stored is compressed.
• Techniques such as prefix encoding are implemented for each record to reduce
storage overheads.
• Tiered data access is implemented based on the omnipresent locality characteris-
tic and data access skewness (where the volume of hot data is usually far less
than that of cold data) in actual business scenarios.
Figure 4.22 shows the performance comparison between X-Engine and InnoDB
in terms of processing point queries on skewed data. The test used the Zipf distribu-
tion to control the degree of data access skewness. When the skewness (Zipf factor)
Fig. 4.21 Comparison of disk space usage between TokuDB and X-Engine
96 4 Storage Engine
Fig. 4.22 Performance comparison between X-Engine and InnoDB in terms of processing point
queries on skewed data
Fig. 4.23 Performance comparison between InnoDB and X-Engine in various scenarios
is high, more point queries hit the hot data in the cache rather than the cold data on
the disk, resulting in lower access latency and higher overall query performance. In
this case, compressing cold data has minimal impact on the query performance.
In summary, the tiered storage structure and tiered access mode of X-Engine
enable most SQL queries on hot data to ignore cold data. As a result, a QPS value
that is 2.7 times higher than that obtained when all data is accessed uniformly is
achieved.
If a large amount of inventory data (especially archived data and historical data)
is stored in X-Engine, X-Engine demonstrates a slightly inferior performance (QPS
or TPS) than InnoDB when querying inventory data. Figure 4.23 shows the perfor-
mance comparison between InnoDB and X-Engine in various scenarios. The com-
parison reveals that X-Engine and InnoDB have almost the same performance.
In most OLTP workloads, updates and point queries are frequently executed.
X-Engine and InnoDB basically level with each other in these two aspects.
References 97
Given its tiered storage structure, X-Engine needs to scan or access multiple
levels when performing a range scan or checking whether a record is unique.
Therefore, X-Engine has a slightly inferior performance compared with InnoDB in
terms of performing range queries and inserting new records.
In hybrid scenarios, X-Engine and InnoDB exhibit almost the same
performance.
References
1. Huang G, Cheng XT, Wang JY, et al. X-Engine: an optimized storage engine for large-scale
e-commerce transaction processing. In: Proceedings of the 2019 International Conference on
Management of Data (SIGMOD’19). New York: Association for Computing Machinery; 2019.
p. 651–65. https://doi.org/10.1145/3299869.3314041.
2. Zhang T, Wang JY, Cheng XT, et al. FPGA-accelerated compactions for LSM based key-value
store. In: 18th USENIX Conference on File and Storage Technologies (FAST20); 2020.
3. Yang L, Wu H, Zhang TY, et al. Leaper: a learned prefetcher for cache invalidation in LSM-tree
based storage engines. Proc VLDB Endow. 2020;13(11):1976–89.
Chapter 5
High-Availability Shared Storage System
High availability is one of the factors that must be considered in the design of dis-
tributed systems. This chapter introduces consensus algorithms for distributed sys-
tems and compares the methods used by MySQL and PolarDB to achieve high
availability. This chapter also discusses the implementation of shared storage archi-
tectures like Aurora and PolarFS and presents some of the ongoing optimization
work concerning the file system in PolarDB.
In a distributed system, multiple nodes communicate and coordinate with each other
through message passing, which inevitably involves issues such as node failures,
communication abnormalities, and network partitions. Consensus protocols ensure
that in a distributed system in which these exceptions may occur, multiple nodes
reach an agreement on a specific value.
In the field of distributed systems, the CAP (consistency, availability, and parti-
tion tolerance) theorem states that any network-based data sharing system can sat-
isfy at most two of the following three characteristics: consistency, availability, and
partition tolerance.
Network partitioning inevitably occurs in a distributed system, thereby necessi-
tating the satisfaction of the partition tolerance characteristic. Therefore, trade-offs
must be made between consistency and availability. In practice, an asynchronous
multireplica replication approach is often used to ensure system availability and
consistency. However, this compromises strong consistency in exchange for
enhanced system availability.
From the perspective of the client, if all replicas reach a consistent state immedi-
ately after an update operation is completed and subsequent read operations can
immediately read the most recently updated data, strong consistency is
implemented. If the system does not guarantee that subsequent read operations can
immediately read the most recently updated data after an update operation is com-
pleted, weak consistency is implemented. If no new update operations are per-
formed subsequently, the system guarantees that the most recently updated data can
be read after a specific period of time. This means that eventual consistency is
implemented, which is a special case of weak consistency. Compromised strong
consistency does not mean that consistency is not guaranteed. It means less strict
requirements are imposed in terms of consistency, and an “inconsistency window”
is allowed. This way, consistency within a time range acceptable to the user can be
achieved, thereby ensuring eventual consistency. The size of this inconsistency win-
dow depends on the time it takes for multiple replicas to reach a consistent state.
Compared with single-node systems, distributed systems are more unstable and often
experience node or link failures, resulting in one or several nodes being in a failed state
and unable to provide normal services. This requires distributed systems to have a robust
fault tolerance mechanism to continue responding to client requests when such prob-
lems occur, ideally without users noticing any failures. Service-level high availability
does not require all nodes to be available after a failure, but rather that the system can
automatically coordinate the remaining functioning nodes to ensure service continuity.
In the database field, RTO and RPO are often used to measure system high avail-
ability. RTO refers to the time required for the system to restore normal services after
a disaster occurs. For example, if the system needs to restore normal operations within
1 h after a disaster, the RTO is 1. If the RTO is zero, the system can recover instantly
after a disaster and has strong disaster recovery capabilities. Otherwise, the system
may remain in a failed state for a long time or even indefinitely. RPO refers to the
amount of data that the system can tolerate losing when a disaster occurs. If the RPO
is 0, no data is lost. Figure 5.1 illustrates the RPO and RTO.
In a distributed storage system, the replication technology is often used to store
multiple replicas. When a node or link failure occurs, the system can automatically
5.1.2 Quorum
Write-all, read one (WARO) is a replica control protocol with simple principles.
As the name suggests, it requires all replicas to be updated successfully during an
update; data can be read from any replica during data query. WARO ensures consis-
tency among all replicas but also creates new problems. Although it enhances read
availability, it leads to system load imbalance and significant update latency.
Moreover, an update must be implemented for all replicas. As a result, the update of
the system fails if an exception occurs on one node.
Quorum is a consistency protocol proposed by Gifford in 1979. Based on the
pigeonhole principle, high availability and eventual consistency can be guaranteed
through trade-offs between consistency and availability.
According to the Quorum mechanism, in a system with N replicas, an update
operation is considered successful only when it is successfully executed on W rep-
licas, and a read operation is considered successful when at least R replicas are read.
To ensure that the most recently updated data can be read every time, the Quorum
protocol requires W + R > N and W > N/2. To be specific, the set of written replicas
and the set of read replicas must have an intersection, and the written replicas must
account for more than half of the total replicas. When W = N and R = 1, Quorum is
equivalent to WARO. Therefore, WARO can be seen as a special case of Quorum,
and Quorum can balance reads and updates on the basis of WARO. For update
operations, Quorum can tolerate exceptions on N–R replicas. For read operations, it
can tolerate exceptions on N–W replicas. Update and read operations cannot be
performed simultaneously on the same data.
Nonetheless, the Quorum mechanism cannot guarantee strong consistency. After
an update operation is completed, the replicas cannot immediately achieve a consis-
tent state, and subsequent read operations cannot immediately read the most recently
updated commit. The Quorum mechanism can only guarantee that the most recently
updated data is read each time but cannot determine whether the data has been com-
mitted. Therefore, if the latest version of data appears less than W times after R
replicas are read, the system proceeds to read other replicas until the latest version
of data appears W times. At this point, it can be considered that the latest version of
data is successfully committed. If the number of occurrences of this version is still
less than W after the other replicas are read, the second latest version in R is consid-
ered the most recently committed data.
In a distributed storage system, the values of N, W, and R can be adjusted based
on different business requirements. For example, for systems with frequent read
requests, W = N and R = 1. This ensures that the result can be quickly obtained by
reading just one replica. For systems requiring fast writes, R = N and W = 1. This
achieves better write performance at the cost of consistency.
5.1.3 Paxos
5.1.3.1 Prepare Phase
After receiving the request, an Acceptor updates the minimum proposal number
that it has received. If the Acceptor has not replied to or received a request whose
proposal number is greater than or equal to n in this round of the Paxos process, it
returns the previously accepted proposal number and proposed value and promises
not to return a proposal whose number is less than n.
5.1.3.2 Accept Phase
When the Proposer receives ACKs of the Prepare request from a majority of
Acceptors, the Proposer chooses the largest proposal number accepted in the ACKs
and uses it as the proposal number for the current round. If no accepted proposal
number is received, the Proposer determines the proposal number. Then, the
Proposer broadcasts the proposal number and proposed value to all Acceptors.
An Acceptor checks the proposal number upon receiving the proposal. If the
promise, made in the Prepare phase, to never return proposals whose numbers are
less than n is not violated, the Acceptor accepts the proposal and returns the pro-
posal number. Otherwise, the Acceptor rejects the proposal and requests the
Proposer to return to Step 1 and reexecute the Paxos process.
After the Acceptor accepts the proposal, it sends the proposal to all Learners.
After confirming that the proposal has been accepted by a majority of Acceptors, the
Learners determine that the proposal is approved. Then, the Paxos round ends. A
Learner can also broadcast the approved proposal to other Learners.
Paxos is used to enable multiple replicas to reach consensus on a specific value.
For example, Paxos can be used in primary node reelection when the primary node
is faulty or in log synchronization among multiple nodes. Although Paxos is theo-
retically feasible, it is difficult to understand and lacks pseudocode-level implemen-
tation. The huge gap between algorithm description and system implementation
results in the final system being built on an unproven protocol. There are few imple-
mentations similar to Paxos in actual systems.
A typical actual scenario is to reach consensus on a series of consecutive values.
A direct approach is to execute the Paxos process for each value. However, each
round of the Paxos process requires two RPCs, which is costly. In addition, two
Proposers may propose incrementally numbered proposals, leading to potential
livelocks. To resolve this issue, the Multi-Paxos algorithm is developed to introduce
a leader role. Only the leader can make a proposal, which eliminates most Prepare
requests and ensures that each node eventually has complete and consistent data.
Taking log replication as an example, the leader can initiate a Prepare request
that contains the entire log rather than just a value in the log. The leader then initi-
ates an Accept request to confirm multiple values, thereby reducing RPCs by half.
In the Prepare phase, the proposal number is used to block old proposals and the log
for determined log entries. One leader election method is as follows: Each node has
an ID, and the node with the greatest ID value is the leader by default. Each node
sends a heartbeat at an interval T. If a node receives no heartbeat information from
any node with a greater ID value within 2T, the node becomes the leader. To ensure
5.1 Basics of High Availability 105
that all nodes have the complete latest log, Multi-Paxos is specifically designed
based on the following aspects:
• The system continuously sends Accept RPCs in the background to ensure that
responses are received from all Acceptors, thereby ensuring that the log of a node
can be synchronized to other nodes.
• Each node marks whether each log entry is approved and marks the first unap-
proved log entry to facilitate tracking of the approved log entries.
• The Proposer needs to inform the Acceptors of the approved log entries to help
the Acceptors update logs.
• When an Acceptor replies to the Proposer, it informs the latter of the subscript of
the first unapproved log entry of the Acceptor. If the subscript of the first unap-
proved log entry of the Proposer is larger, the Proposer sends the default unap-
proved log entry to the Acceptor.
5.1.4 Raft
5.1.4.1 Node Roles
Fig. 5.4 Transition between the follower, candidate, and leader roles
5.1.4.2 Leader Election
Raft triggers leader elections by using a heartbeat mechanism. In the initial state,
each node is a follower. Followers communicate with the leader by using a heartbeat
mechanism. If a follower receives no heartbeat messages within a specific period of
time, the follower believes that no leader is available in the system and initiates a
leader election.
The follower that initiates the election increases its current local term num-
ber, switches to the candidate role, votes for itself as the new leader, and sends
a vote request to other followers. Each follower may receive multiple vote
requests but can only cast one vote on a first-come, first-served basis. The log
information of the candidate that receives the vote must be newer than that of
the follower.
The candidates wait for votes from other followers. The vote results for the can-
didates vary depending on the following cases:
• A candidate that receives more than half of the votes wins the election and
becomes the leader. Then, the new leader sends heartbeat messages to other
nodes to maintain its Leader status and prevent new elections from tak-
ing place.
• If a candidate receives a message that contains a larger term number from
another node, the sender node has been elected as the leader. In this case, the
candidate switches to the follower role. If a candidate receives a message that
contains a smaller term number, it rejects this message and maintains the can-
didate role.
• If no candidate receives more than half of the votes, the election times out. In this
case, each candidate starts a new election by increasing its current term number.
To prevent multiple election timeouts, Raft uses a random election timeout algo-
rithm. Each candidate sets a random election timeout when starting an election.
This prevents concurrent timeouts and concurrent initialization of new elections
by multiple candidates, thereby reducing the possibility of votes being divided
up in the new election.
5.1 Basics of High Availability 107
5.1.4.3 Log Replication
Each server node has a replicated state machine implemented based on the repli-
cated log mechanism. If the state machines of the server nodes are in the same initial
state and the server nodes obtain identical execution commands from the logs,
which are executed in the same order, the final states of the state machines are also
the same.
After the leader is elected, the system provides services externally. The leader
receives requests from clients, each containing a command that acts on a replicated
state machine. The leader encapsulates the requests into log entries, appends the log
entries to the end of the log, and sends these log entries in order to the followers in
parallel. Each log entry contains a state machine command and the current term
number when the leader receives the request, as well as the position index of the log
entry in the log file. When the log entries are safely replicated to a majority of
nodes, the log entries are committed. The leader then returns a success to the clients
and instructs each node to apply the state machine commands in the log entries to
the replicated state machines in the same order as the log entries in the leader. At
this point, the log entries are applied, as shown in Fig. 5.5.
As shown in the figure, log replication in Raft is a Quorum-based process that
can tolerate failure of n/2–1 replicas. The leader will supplement the logs for out-of-
sync replicas in the background.
To ensure that the logs of the followers are consistent with the log of the leader,
the leader must find the index position at which the logs of the followers are
consistent with its log. Then, the leader instructs the followers to delete their log
entries after the index position and sends its log entries after the index position to
the followers. In addition, the leader maintains a nextIndex for each follower,
which indicates the index of the next log entry that the leader will send to the fol-
lower. When the leader begins its term, it initializes the nextIndexes to its latest
log entry index +1. If a follower finds during consistency check that its log content
corresponding to the log entry index is inconsistent with the content of the log
entry that the leader sends to it, it will reject the log entry. After receiving the
response, the leader decrements the nextIndex and retries until the nextIndex is
consistent with the log entry index of the follower. At this point, the log entry
from the leader is successfully appended, and the logs of the leader and follower
become consistent. Therefore, the log replication mechanism of Raft has the fol-
lowing characteristics:
• If two log entries in different logs have the same log index and term number,
they store the same state machine command. This characteristic originates
from the fact that the leader can create at most one log entry at a specified log
index position within a term, and the position of the log entry in the log does
not change.
• If two log entries in different logs have the same log index and term number, all
their previous log entries are also the same. This phenomenon can be ascribed to
consistency checks. When the leader sends a new log entry, it also sends the log
index and term number of the previous log entry to the follower. If the follower
cannot find a log entry with the same log index and term number in its log, it will
reject the new log entry.
To prevent committed logs from being overwritten, Raft requires candidates to
have all committed log entries. When a node is newly elected as the leader, it can
only commit logs of the current term that have been replicated to a majority of
nodes. Logs with an old term number cannot be directly committed by the current
leader even if they have been replicated to a majority of nodes. These logs need to
be indirectly committed through log matching when the leader commits logs with
the current term number.
5.1.5 Parallel Raft
Parallel Raft is a consistency protocol designed and developed for PolarFS [3] to
ensure the reliability and consistency of stored data.
For simplicity and protocol comprehensibility, Raft adopts a highly serialized
design, which does not allow holes in logs of either the leader or followers. Log
entries are acknowledged by the follower, committed by the leader, and applied to
all replicas in a serialized manner. When a large number of concurrent write requests
are executed, they are committed in sequence. Requests are executed and commit-
ted in sequence, and the requests at the end of the queue can be committed with
5.1 Basics of High Availability 109
results returned only after previous requests are persisted to disk with results
returned, as shown in Fig. 5.6. This increases the average latency and reduces the
throughput.
Parallel Raft removes the serialization constraint and implements performance
optimization for log replication through out-of-order ACKs and out-of-order com-
mits. It also ensures protocol correctness based on the Raft framework and imple-
ments out-of-order application based on actual application scenarios.
Out-of-order ACK: In Raft, after receiving a log entry from the leader, a follower
sends an ACK only after the current log entry and all its previous log entries are
persisted. In Parallel Raft, a follower returns an ACK immediately after receiving
any log entry, thereby reducing the average system latency.
Out-of-order commit: In Raft, the leader commits log entries in series. To be
specific, a log entry is committed only after all its previous log entries are commit-
ted. In Parallel Raft, the leader can commit a log entry as soon as the log entry is
acknowledged by a majority of replicas.
Out-of-order application: In Raft, all log entries are applied in strict order to
ensure the consistency of the data files of all replicas. In Parallel Raft, holes may
occur at different replica log positions due to out-of-order ACKs and out-of-order
commits. Therefore, it is necessary to ensure that a log entry can be safely applied
when preceding log entries are missing, as shown in Fig. 5.7.
To this end, Parallel Raft introduces a new data structure called “look-behind
buffer” to address the issue of missing log entries during application. Each log entry
in Parallel Raft comes with a look-behind buffer, which stores the summary of logi-
cal block addresses (LBAs) modified by the previous N log entries. A follower can
determine whether a log entry conflict exists (i.e., whether the log entry modifies
LBAs that are modified by a missing previous log entry) by using the look-behind
buffer. If no log entry conflict exists, the log entry can be safely applied. Otherwise,
it is added to a pending list and will be applied after the previous log entry that is
missing is applied.
Through the aforementioned asynchronous ACKs, asynchronous commits,
and asynchronous applications, Parallel Raft can avoid the extra waiting time
caused by sequencing during log entry writing and committing, thus effectively
reducing the average latency in high-concurrency multireplica synchronization
scenarios.
110 5 High-Availability Shared Storage System
Databases are the cornerstone of business systems, and their availability is of vital
importance. Therefore, online databases are rarely deployed in standalone mode
because in this mode, services may become unavailable for seconds or even min-
utes or hours in severe cases (e.g., when accidents such as instance failure, host
failure, or network failure occur). If the disk is corrupted, data may be completely
lost, which is fatal for upper-level businesses that use the database. Hence, a data-
base cluster that implements high availability through leader-follower replication
is usually deployed in the production environment. The following section takes
MySQL as an example to introduce the general practices for implementing high
availability for databases in the industry and then discusses the high availability
architecture of PolarDB with reference to the advantages and disadvantages
of MySQL.
When the leader instance receives a write request and needs to update data, it writes
the event content of this update to its binlog file. At this time, the binlog dump
thread (created when the leader-follower relationship was established) on the leader
instance notifies the follower instance of the data update and passes the content
written to the binlog to the I/O thread of the follower instance.
5.2.1.2 I/O Thread
The I/O thread on the follower instance connects to the leader instance, requests a
connection point at a specified binlog file position from the leader instance, and then
continuously saves the binlog content sent by the leader instance to the local relay
log. Like the binlog, the relay log records data update events. Multiple relay log files
are generated and named in the host_namerelay-bin.000001 format with incremen-
tal suffixes. The follower instance uses an Index file (host_name-relay-bin.index) to
track the currently used relay log file.
5.2.1.3 SQL Thread
Once the SQL thread detects that the relay log is updated, it reads and parses the
update content and locally reexecutes the events that occurred on the leader
instance to ensure that data is synchronized between the leader and follower
112 5 High-Availability Shared Storage System
instances. The binlog records the SQL statement executed by the user. Therefore,
parsing the binlog content sent by the leader instance is equivalent to receiving the
user request. Then, the SQL thread starts to reexecute the statement, starting from
SQL parsing.
Asynchronous replication is the most common binlog synchronization mode. In
this mode, after the leader instance writes the binlog, it directly returns a success
without waiting for acknowledgment of receipt of the binlog entry from follower
instances. If the leader instance breaks down, data for which a write success has
been returned to the user may have not been synchronized to follower instances.
When services are switched to a follower instance, such data will be lost.
MySQL can address this issue by using a semisynchronous mode. In this mode,
after the leader instance writes the binlog, it must wait for at least one follower
instance to acknowledge that it has received the binlog entry before returning a
write success to the user. This improves data consistency to some extent, but the
overhead of waiting for follower instance synchronization compromises the write
efficiency.
To efficiently achieve high availability, MySQL implements a MySQL Group
Replication (MGR) cluster based on the Paxos consistency protocol. Quorum-based
binlog replication is achieved by using the Paxos protocol to prevent data loss after
service switchover.
MySQL was designed as a database management system that supports multiple
engines. Different storage engines can be quickly integrated into MySQL in the
form of plug-ins. You can choose appropriate storage engines for different busi-
ness scenarios. For example, MyISAM features high insertion and query speed
but does not support transactions, MEMORY puts all data into memory but does
not support data persistence, and InnoDB provides complete transaction proper-
ties and persistence capabilities and is currently the most widely used storage
engine. Data cannot be shared between multiple storage engines; the data format
may vary based on the storage engine. This hinders replication across databases.
The binlog shields the heterogeneity of storage engines and provides a unified
data format to facilitate data synchronization to the downstream and thus serves
as a cornerstone for data replication. MySQL has been widely used in the Internet
era. In addition to its stability and efficiency, its fast and flexible horizontal scal-
ing capability brought upon by the binlog-based replication technology is consid-
ered the key to its success.
Replication Mode
Before MySQL 5.6, data is replicated by using the binlog file position-based repli-
cation protocol. In this method, data is replicated based on binlog file positions,
which are file names and file offsets of the binlog on the leader node. When a fol-
lower node initiates replication, it sends an initial position, pulls logs from the
leader, and applies the logs. This protocol is not flexible and cannot be used to build
complex topologies.
5.2 High Availability of Clusters 113
GTID = source_id:transaction_id,
Data Consistency
The binlog will be pulled by the downstream and contains data of committed trans-
actions, with one transaction possibly spanning across multiple storage engines.
Therefore, consistency between the binlog and one or more storage engines must be
guaranteed. MySQL uses the two-phase commit algorithm for distributed data-
bases, with the binlog as the coordinator and the storage engine as the participant.
With this algorithm, a transaction is committed in the following order: Prepare by
the storage engine (persisted) → Commit the binlog (persisted) → Commit by the
storage engine (not persisted).
A transaction commit involves two persistence operations. This way, during
crash recovery, whether the prepared transactions in each storage engine need to be
committed or rolled back can be determined based on whether the binlog has been
completely persisted. Persistence is a time-consuming operation, and transactions
in the binlog are ordered. As a result, the write performance will significantly dete-
riorate when binary logging is enabled in MySQL.
114 5 High-Availability Shared Storage System
Like AWS Aurora, PolarDB adopts a shared storage architecture that supports one
writer and multiple readers. This architecture is advantageous over the traditional
architecture in which the primary and standby nodes maintain their independent
data. First, it reduces the storage costs. One copy of shared data can support one
read-write node and multiple read-only nodes at the same time. Second, it pro-
vides extreme flexibility. To add a read-only node in an independent data storage
architecture, data needs to be replicated, which is time-consuming, affects the
total data volume, and may take hours or even days. In the shared storage archi-
tecture, data does not need to be replicated, and a read-only node can be created
within several minutes. Lastly, it significantly reduces the synchronization latency.
Only the memory status needs to be updated during synchronization because the
same disk data is visible to both the read-only nodes and the primary node. Details
will be discussed later. Meanwhile, the following section focuses on some key
technologies PolarDB uses to achieve high availability in the shared storage archi-
tecture that supports one writer and multiple readers. Figure 5.10 shows the shared
storage architecture of PolarDB. In the figure, RW represents a read-write node,
RO represents a read-only node, and PolarStore hosts the distributed file system
PolarFS.
5.2.2.1 Physical Replication
Logical Replication
Physical Logs
In addition to logical logs like the binlog, all database systems have a write-ahead
log (WAL), such as the redo log in MySQL. Such logs were initially designed to
support fault recovery of databases. Before actual data pages in the database are
modified, the modification content is written to the redo log. This way, once the
database fails due to an exception, the database status before the failure can be
restored by replaying the redo log during database restart. Each entry in the redo log
records only the modification to a single disk page. Such logs are called physical
logs. Logical logs may affect data in a large number of different locations during
replay. For example, replaying an INSERT operation may split the B+ tree, modify
an undo page, or even modify some metadata. As the name suggests, physical logs
record direct modifications of physical page information. Such logs can naturally
maintain the consistency of physical data and can be renovated and used for syn-
chronization from the primary node to standby nodes in the shared storage architec-
ture, as shown in Fig. 5.11.
In this architecture, the primary and standby nodes see the same data and the
same redo log on the shared storage. The primary node only needs to inform a
standby node of the position at which the current log write ends. Then, the standby
node reads the redo log from the shared storage and updates its memory status. The
physical structure of the primary node can be obtained by replaying the redo log,
which also ensures that the information in the memory structure of the standby node
completely corresponds to the persistent data in the shared storage.
116 5 High-Availability Shared Storage System
Fig. 5.11 Synchronization from the primary node to a standby node in the shared storage
architecture
The redo log is located at the underlying level of the database system engine and
records the final modification of the data page. Corresponding to the logical rep-
lication mechanism mentioned above, replication is implemented based on a bot-
tom-up approach in the shared storage architecture that uses the physical
replication scheme, as shown in Fig. 5.12. The standby node reads the redo log
from the shared storage, parses and applies the redo log, and then updates the
cached data page, transaction information, index information, and other status
information in the memory.
The latency becomes excessively long if it takes a long time to execute a transac-
tion. Physical replication uses a different approach as it is intended to maintain data
consistency between the primary and standby nodes at the physical page level. The
redo log can be continuously written during transaction execution. Therefore, trans-
action rollback and MVCC can be implemented for standby nodes in the same mode
5.2 High Availability of Clusters 117
as that for the primary node. Physical replication can be performed in real time dur-
ing the execution of a transaction. The replication delay for physical replication can
be calculated as follows:
The transmission time can be very small because the same redo log is accessed,
and the replay time accounts only for the time taken to replay the content of a single
page, which is much smaller compared with the entire binlog. As a result, the repli-
cation latency of physical replication is much shorter than that of logical replication,
even reaching the millisecond level. In addition, the replication latency of physical
replication is irrelevant to transactions. Figure 5.13 shows the latency comparison
between physical replication and logical replication.
Fig. 5.13 Latency comparison between physical replication and logical replication
and downstream. Therefore, in the physical replication scheme, the database still
needs to support the binlog.
This can be easily implemented in the shared storage architecture by writing the
binlog to the shared storage. Figure 5.14 shows physical replication in a nonshared
storage architecture, in which the binlog can be transferred to a standby node by
using a replication link (which is a logical replication link) other than that used to
transfer the redo log. However, these two log links are not synchronized. If a swi-
tchover is performed due to an exception, the binlog and data on the standby node
may be inconsistent.
To solve this problem, Alibaba Cloud proposed the concept of logic redo log,
which integrates the capabilities of the binlog and redo log. This avoids the data
consistency issue that arises due to the synchronization of the redo log and the bin-
log. Figure 5.15 shows the logic redo architecture.
The binlog is stored in a distributed manner but is presented as a complete file to
external interfaces. The runtime binlog system maintains the memory file structure,
parses log files, and provides a centralized interface for the binlog and redo log.
5.2.2.2 Logical Consistency
redo log is replayed on the RO nodes, physical structures on the RO nodes may be
different because of different transaction commitment sequences and data modifica-
tion sequences. Therefore, concurrency control must be implemented to make sure
that all RO nodes read the same physical structure.
This section describes the snapshot and MVCC implementation for logical con-
sistency, as well as how to ensure the physical structure consistency in the B+ tree
structure.
A read view is a snapshot that records the ID array and related information about
currently active transactions in the system. It is used for visibility judgment, that is,
for checking whether the current transaction is eligible to access a row. A read view
involves multiple variables, including the following:
trx_ids: This variable stores the list of active transactions, namely, the IDs of other
uncommitted active transactions when the read view was created. For example,
if transaction B and transaction C in the database have not been committed or
rolled back when transaction A creates a read view, trx_ids will record the trans-
action IDs of transaction B and transaction C. If a record that contains the ID of
120 5 High-Availability Shared Storage System
PolarDB adopts the redo log-based physical replication scheme to implement the
shared storage architecture that supports one writer and multiple readers. The RW
node and RO nodes share the same data. Therefore, the hidden fields in the record
are exactly the same in the RW node and RO nodes. To guarantee that the correct
data version is read during data access in the one-writer, multireader architecture,
the consistency of the transaction status of the RW node and the RO nodes must be
ensured. The transaction status is synchronized by using the redo log. The start of a
transaction can be identified by the MLOG_UNDO_HDR_REUSE or MLOG_
UNDO_HDR_CREATE record, and the commit of a transaction can be identified
by adding an MLOG_TRX_COMMIT record in PolarDB. This way, the committed
transactions and active transactions can be clearly identified on RO nodes by apply-
ing the redo log records, thereby ensuring a consistent transaction status between
the RW node and RO nodes.
Figure 5.16 shows the transaction status of an RW node and an RO node in the
repeatable read isolation level.
In the figure, the left-side column shows the transaction status of an RW node,
and the right-side column shows the transaction status of an RO node. MVCC-
facilitated consistent nonlocking reads are supported in the repeatable read and read
committed isolation levels.
5.2.2.3 Physical Consistency
As one of the key factors affecting system performance, the index structure has a sig-
nificant impact on the performance of database systems in high-concurrency scenar-
ios. In addition to conventional operations, such as query, insert, delete, and update
operations, the B+ tree structure supports structural modification operations (SMOs).
For example, when a tree node does not have sufficient space to accommodate a new
Fig. 5.16 Transaction status of an RW node and an RO node in the repeatable read isolation level
122 5 High-Availability Shared Storage System
record, the node will be split into two nodes, and the new node will be inserted to the
upper-level parent node. This changes the tree structure. Without a proper concur-
rency control mechanism, other operations that are performed at the same time an
SMO is performed on the B+ tree can see a tree structure in an intermediate state.
Moreover, corresponding records that should exist cannot be found or the access may
fail because an invalid memory address is accessed. In cloud-native databases, physi-
cal consistency means that even if multiple threads access or modify the same B+ tree
at the same time, all threads must see a consistent structure of the B+ tree.
This can be achieved by using a large index lock, which seriously compromises
the concurrency performance. Since the introduction of the B+ tree structure in
1970, many researches on how to optimize the performance of B+ trees in multi-
thread scenarios have been published on top-notch conferences in the database and
system fields, such as VLDB, SIGMOD, and EuroSys. In the cloud-native architec-
ture that features separation of computing and storage and supports one writer and
multiple readers, the RW nodes and RO nodes have independent memory and main-
tain different replicas of the B+ tree. However, threads on the RW nodes and RO
nodes may access the same B+ tree at the same time. This poses a problem in terms
of the physical consistency across nodes.
This section describes the concurrency control mechanism for B+ trees in
InnoDB, which is the method used to ensure the physical consistency of a B+ tree
in the traditional single-node architecture, and tackles the method used to ensure the
physical consistency of a B+ tree in the one-writer, multireader architecture in
PolarDB.
A proper concurrency control mechanism for a B+ tree must meet the following
requirements:
• The read operations are correct. R.1: A key-value pair in an intermediate state
will not be read. In other words, a read operation will not read a key-value pair
that is being modified by another write operation. R.2: An existing key-value pair
must be present. If a key-value pair on a tree node being accessed by a read
operation is moved to another tree node by a write operation (e.g., in a splitting
or merging operation), the read operation may fail to find the key-value pair.
• The write operations are correct. W.1: Two write operations will not modify the
same key-value pair at the same time.
• No deadlocks exist. D.1: Deadlocks, which are a situation in which two or more
threads are permanently blocked and wait for resources occupied by other
threads, will not occur.
PolarDB for MySQL 5.6 and earlier versions adopt a relatively basic concur-
rency mechanism that uses locks of two granularities: S/X locks on indexes and S/X
locks on pages (pages are equivalent to tree nodes in this book). An S/X lock on an
index is used to avoid conflicts in tree structure access and modification operations,
5.2 High Availability of Clusters 123
and an S/X lock on a page is used to avoid conflicts in data page access and modifi-
cation operations.
The following lists the notations that will be used in pseudocode in this book:
• SL adds a shared lock.
• SU releases a shared lock.
• XL adds an exclusive lock.
• XU releases an exclusive lock.
• SXL adds a shared exclusive lock.
• SXU releases a shared exclusive lock.
• R.1/R.2/W.1/D.1: correctness requirements that concurrency mechanisms need
to satisfy.
The following section analyzes the processes of read and write operations by
using pseudocode.
In Algorithm 1, the read operation adds an S lock to the entire B+ tree (Step 1),
traverses the tree structure to find the corresponding leaf node (Step 2), adds an S
lock to the page of the leaf node (Step 3), releases the S lock on the index (Step
4), accesses the content of the leaf node (Step 5), and then releases the S lock on
the leaf node (Step 6). The read operation adds an S lock to the index to prevent
the tree structure from being modified by other write operations, thus meeting
R.2. After the read operation reaches the leaf node, it applies for a lock on the
page of the leaf node and then releases the lock on the index. This prevents a key-
value pair from being modified by other write operations, thereby meeting R.1.
The read operation adds an S lock to the B+ tree. This way, other read operations
can access the tree structure in parallel, thereby reducing concurrency conflicts
between read threads.
A write thread may modify the entire tree structure. Therefore, it is necessary to
avoid two write threads from accessing the same B+ tree at the same time. To this
end, Algorithm 2 adopts a more pessimistic solution. Each write operation adds an
X lock to the B+ tree (Step 1) to prevent other read or write operations from access-
ing the B+ tree during the execution of the write operation and from accessing an
incorrect intermediate state. Then, the write operation traverses the tree structure to
find the corresponding leaf node (Step 2) and adds an X lock to the page of the leaf
node (Step 3). Next, the write operation determines whether it will trigger an opera-
tion that modifies the tree structure, such as a splitting or merging operation. If yes,
the write operation modifies the entire tree structure (Step 4) and then releases the
124 5 High-Availability Shared Storage System
lock on the index (Step 5). Lastly, it modifies the content of the leaf node (Step 6)
and then releases the X lock on the leaf node (Step 7). Although the pessimistic
write operation satisfies W.1 by using an exclusive lock on the index, the exclusive
lock on the B+ tree blocks other read and write operations. This results in poor mul-
tithreading scalability in high-concurrency scenarios. The following discussion will
reveal if there is room for optimization.
Each tree node page can store a large number of key-value pairs. Therefore, a
write operation on a B+ tree does not usually trigger an operation that modifies the
tree structure, such as splitting or merging. Compared with the pessimistic idea of
Algorithm 2, Algorithm 3 adopts an optimistic approach that assumes most write
operations will not modify the tree structure. In Algorithm 3, the whole process of
the write operation is roughly the same as that in Algorithm 1. The write operation
holds an S lock on the tree structure during access, so that other read operations and
optimistic write operations can also access the tree structure at the same time. The
main difference between Algorithm 3 and Algorithm 1 is that in the former, the
write operation holds an X lock on the leaf node. In MySQL 5.6, a B+ tree often
performs an optimistic write operation first and only performs a pessimistic write
operation when the optimistic write operation fails. This reduces conflicts and
blocking between operations. Both a pessimistic write operation and an optimistic
write operation prevent write conflicts by using a lock on the index or page to
meet W.1.
In MySQL 5.6, locks are added from top to bottom and from left to right. This
prevents locks added by any two threads from forming a loop to prevent deadlocks
and meet D.1.
5.2 High Availability of Clusters 125
After PolarDB for MySQL is upgraded from 5.6 to 5.7, the concurrency mecha-
nism of the B+ tree significantly changed in the following aspects: First, SX locks
are introduced, which conflict with X locks but do not conflict with S locks, thereby
reducing blocked read operations. Second, a write operation locks only the modifi-
cation branch to reduce the scope of locking. The read operations and optimistic
write operations in MySQL 5.7 are similar to those in MySQL 5.6. Hence, this sec-
tion describes only the pseudocode for a pessimistic write operation in MySQL 5.7.
In Algorithm 4, a write operation adds an SX lock to the tree structure (Step 1),
adds an X lock to the branches affected during the traversal of the tree structure
(Steps 2–4), adds an X lock to the leaf node (Step 5), releases the locks on nonleaf
nodes and the index (Steps 6–8), and then modifies the leaf node and releases the
lock on the leaf node (Steps 9 and 10). The correctness of the write operations and
the deadlock-free requirement are similar to those in the preceding sections.
Therefore, details will not be described repeatedly here. Compared with that in
PolarDB for MySQL 5.6, a pessimistic write operation in PolarDB for MySQL 5.7
no longer locks the entire tree structure but locks only the modified branches. This
way, read operations that do not conflict with the write operation can be performed
in parallel with the write operation, thereby reducing conflicts between threads.
PolarDB for MySQL 8.0 uses a locking mechanism similar to that of PolarDB for
MySQL 5.7.
Unlike the traditional InnoDB engine, which needs to ensure only the physical con-
sistency of the B+ tree of a single node, PolarDB adopts the one-writer, multireader
architecture and must ensure that the concurrent threads on multiple RO nodes can
read consistent B+ trees. In PolarDB, an SMO is synchronized to B+ trees in the
memory of RO nodes by replaying the redo log based on physical replication.
Physical replication synchronizes the redo log in the unit of disk pages. However,
one SMO affects multiple tree nodes, which may breach the atomicity of applying
the redo log entry of the SMO. Consequently, concurrent threads on the RO nodes
may read inconsistent tree structures.
126 5 High-Availability Shared Storage System
The simplest solution is to forbid user threads from retrieving from the B+ tree
when an RO node discovers that the redo log contains a log entry of an SMO (i.e.,
when the structure of the B+ tree is changed, such as when page merging or splitting
occurs). When a minitransaction that holds an exclusive lock on an index and modi-
fies data across pages is committed on the primary node, the ID of the index is writ-
ten to the log. When the log is parsed on a standby node, a synchronization point is
generated each time when the following tasks are completed: (1) parsing of the log
is completed, (2) the exclusive lock on the index is obtained, (3) the log group is
replayed, and (4) the exclusive lock on the index is released.
Although this method can effectively solve the foregoing problem, too many
synchronization points significantly affect the synchronization speed of the redo
log. This may lead to high synchronization latencies of RO nodes. To address this
issue, PolarDB introduces a versioning mechanism that maintains a global counter
Sync_counter for all RO nodes. This counter is used to coordinate the redo log syn-
chronization mechanism and the concurrent execution of user read requests, thereby
ensuring the consistency of B+ trees.
• During parsing of the redo log, an RO node collects the IDs of all indexes on
which an SMO is performed and increments Sync_counter:
• X locks on all indexes affected by an SMO are acquired, the latest copy of Sync_
counter is maintained on the index memory structure, and the X locks on the
indexes are released. A request to access the B+ tree needs to hold an S index
lock, and an X lock can ensure that the B+ tree cannot be accessed before the X
lock is released.
• When a user request traverses the B+ tree, it checks whether the copy of Sync_
counter of an index is consistent with the global Sync_counter. If yes, an SMO is
being performed on the B+ tree, and the indexed page being accessed needs to be
updated to the latest version by using the redo log. Otherwise, the redo log does
not need to be replayed.
By using this optimistic approach, PolarDB greatly reduces the interference to
concurrent requests to the B+ tree during application of the SMO log entries in the
redo log. This significantly improves the performance of read-only nodes while
ensuring the physical consistency across nodes.
5.2.2.4 DDL
In the trade-off between time, space, and flexibility, MySQL uses logic in
which data definitions are separated from data storage without considering flex-
ibility. In this logic, each piece of physical data stored in MySQL does not
contain all information required for interpreting itself and can be correctly inter-
preted and manipulated only in combination with the independently stored data
definitions. As a result, DDL operations are often accompanied by modifications
of full table data, making them the most time-consuming operations in
MySQL. In scenarios with large data volumes, the execution of a single DDL
statement can take days.
DDL operations are often executed concurrently with Data Query Language
(DQL) and Data Manipulation Language (DML). To control concurrency and
ensure the correctness of database operations, MySQL introduced metadata locks
(MDLs). A DDL operation can block DML and DQL operations by acquiring
exclusive MDLs, thus achieving concurrency control. However, in production
environments, blocking DML operations and other operations for a long time
severely affects the business logic. To address this problem, MySQL 5.6 intro-
duced the online DDL feature. Online DDL enables concurrent execution of DDL
and DML operations by introducing the Row_log object. The Row_log object
records the DML operations executed during the execution of a DDL operation,
and the incremental data generated by the DML operations is replayed after the
DDL operation is executed. This simple solution effectively solves the problem
that arises in the concurrent execution of DDL and DML operations. With this
solution, the basic DDL logic of MySQL began to take shape. DDL has been
adopted in MySQL databases ever since and is still used in the latest MySQL ver-
sion (i.e., MySQL 8.0).
Instant DDL
MySQL has effectively solved the concurrency issue between DDL and other oper-
ations. However, the manipulation of full data during DDL operations remains a
significant burden on the storage engine. To address this issue, MySQL 8.0 intro-
duced the Instant DDL feature. This feature enables MySQL to modify only the data
definition during DDL operations without modifying the actual physical data stored.
This solution, which is completely different from the previous DDL logic, stores
more information in data definitions and physical data to facilitate correct interpre-
tation and manipulation of the physical data. Essentially, the instance DDL feature
is a simplified version of the data dictionary multiversioning technique. However,
due to various complex engineering issues, such as compatibility, instant DDL now
supports only adding columns to the end of a table. Sustained efforts are still
required to enable instant DDL to support other operations. PolarDB faces scenarios
involving massive cloud data. Therefore, the instant DDL feature is especially
important to PolarDB. Through considerable efforts, Alibaba Cloud has imple-
mented instant DDL in earlier versions, such as PolarDB for MySQL 5.7, and
enabled instant DDL to support more operations.
128 5 High-Availability Shared Storage System
Parallel DDL
Aurora [4] is a relational database service launched by AWS specifically for the cloud.
It adopts a compute-storage-separated architecture in which compute nodes and stor-
age nodes are separately located in different virtual private clouds (VPCs). As shown
in Fig. 5.18, users access applications through the user VPC, and the RW node and RO
node communicate with each other in the Relational Database Service (RDS)
VPC. The data buffer and persistent storage are located in the storage VPC. This
achieves the physical separation of computing and storage in Aurora. The storage
VPC consists of multiple storage nodes mounted with local SSDs, which form an
Amazon Elastic Compute Cloud (EC2) VM cluster. This storage architecture provides
a unified storage space to support the one-writer, multireader architecture. Read-only
replicas can be quickly added by transferring the redo log over a network.
Aurora is built on products such as EC2, VPC, Amazon S3, Amazon DynamoDB,
and Amazon Simple Workflow Service (SWF) but does not have a dedicated file
system like PolarFS for PolarDB. To understand the Aurora storage system, you
need to understand the following basic products.
5.3.1.1 Amazon S3
Amazon S3 is a global storage area network that acts like a hard drive with an enor-
mous capacity. It can provide storage infrastructure for any application. The basic
entities stored in S3 are called objects, which are stored in buckets. The storage
architecture of S3 consists of only two layers. An object includes data and metadata,
where the metadata is often key-value pairs describing the object. A bucket provides
a way for organizing, storing, and classifying data. The operation UI of S3 is user-
friendly, and users can use simple commands to operate data objects in buckets.
5.3.1.2 Amazon DynamoDB
5.3.1.3 Amazon SWF
Amazon SWF helps developers easily build applications that coordinate work
across distributed components and can be viewed as a fully managed task coordina-
tor in the cloud. With Amazon SWF, developers can control the execution and coor-
dination of tasks without the need to track the progress of the tasks or maintain their
status or concern themselves with the complex underlying implementation.
As shown in Fig. 5.19, the underlying storage system of Aurora is responsible for
persisting the redo log, updating pages, and clearing expired log records. It regu-
larly uploads backup data to Amazon S3. The storage nodes are mounted to local
SSDs. Therefore, the persistence of the redo log and page updates do not require
cross-network transmission, and only the redo log needs to be transmitted over the
network. The metadata of the storage system, for example, the data that describes
how data is distributed and the running status of the software, is stored in Amazon
DynamoDB. Aurora’s long-term automated management, such as database recov-
ery and data replication, is implemented by using Amazon SWF.
5.3.2 PolarFS
data access. The shared storage design of PolarFS enables all compute nodes to
share the same underlying data. This way, read-only instances can be quickly added
in PolarDB without data replication.
PolarFS is internally divided into two layers. The underlying layer is responsible
for virtualization management of storage resources and provides a logical storage
space (in the form of a volume) for each database instance. The upper layer is
responsible for file system metadata management in the logical storage space and
controls synchronization and mutual exclusion for concurrent access to metadata.
PolarFS abstracts and encapsulates storage resources into volumes, chunks, and
blocks for efficient organization and management of resources.
A volume provides an independent logical storage space for each database
instance. Its capacity can dynamically change based on database needs and reaches
up to 100 TB. A volume appears as a block device to the upper layer. In addition to
database files, it also stores metadata of the distributed file system.
A volume is internally divided into multiple chunks. A chunk is the smallest
granularity of data distribution. Each chunk is stored on a single NVMe SSD on a
storage node, which is conducive to implementing high reliability and high
5.3 Shared Storage Architectures 133
availability of data. A typical chunk size is 10 GB, which is much larger than the
chunk size in other systems. The larger chunk size reduces the amount of metadata
that needs to be maintained for chunks. For example, for a 100-TB volume, meta-
data records need to be maintained only for 10,000 chunks. In addition, the storage
layer can cache metadata in memory to effectively avoid additional metadata access
overhead on critical I/O paths.
A chunk is further divided into multiple blocks. The physical space on SSDs is
allocated to corresponding chunks in the unit of blocks as needed. The typical block
size is 64 KB. The information about mappings between chunks and blocks is man-
aged and stored in the storage layer and cached in memory to further accelerate
data access.
As shown in Fig. 5.20, PolarFS consists of Libpfs, PolarSwitch, chunk servers,
PolarCtrl, and other components. Libpfs is a lightweight user-space file system
library that provides a POSIX-like interface to databases for managing and access-
ing files in volumes. PolarSwitch is a routing component deployed on a compute
node. It maps and forwards I/O requests to specific backend storage nodes. A chunk
server is deployed on a storage node and is responsible for responding to I/O
requests and managing resources of chunks. A chunk server replicates write requests
to other replicas of a chunk. Chunk replicas ensure data synchronization in various
faulty conditions by using the ParallelRaft consistency protocol, thereby preventing
data loss. PolarCtrl is a control component of the system and is used for task man-
agement and metadata management.
Libpfs converts the file operation issued by the database into a block device I/O
request and delivers the I/O request to PolarSwitch. Based on the locally cached chunk
route information, PolarSwitch forwards the I/O request to the ChunServer on which
the leader chunk is located. After the chunk server on which the leader chunk is
located receives the request by using an RDMA NIC (network interface card), it deter-
mines whether the operation is a read operation or a write operation. If it is a read
operation, the leader chunk directly reads local data and returns the result to
PolarSwitch. If it is a write operation, the leader chunk writes the operation content to
the local WAL and then sends the operation content to the follower chunks, which
subsequently write the operation content to their respective WALs. After receiving
responses from most follower chunks, the leader chunk returns a write success to
PolarSwitch and asynchronously applies the log to the data area of the follower chunks.
hardware resources in kernel mode. After the operation is completed, the CPU
switches to the user mode and returns the result to the application. When the CPU
switches from the user mode to the kernel mode to execute the system call, it swaps
out the state of the application from the register and saves the state in memory and
then swaps the process information related to this system call from memory into the
register to execute the process. This procedure is called context switching.
In the traditional hard disk era, the system uses an interrupt-based I/O model. After
an application initiates an I/O request, the CPU switches from the user mode to the
kernel mode, initiates a data request to the disk, and then switches back to the user
mode to continue processing other work. After the disk data is ready, the disk initi-
ates an interrupt request to the CPU. After receiving the request, the CPU switches
to the kernel mode to read the data and replicates the data from the kernel space to
the user space and then switches back to the user mode. Multiple context switches
and data replication operations occur during the I/O process, which undoubtedly
generates overheads. However, the overheads are negligible compared to the read/
write latency of traditional hard disks. With the successful commercialization and
advancement of NVM technologies, the hard disk speed has significantly improved.
For example, an NVMe SSD can complete 500,000 I/O operations per second with
a latency of less than 100 μs. Therefore, the system performance is no longer bottle-
necked by hardware but by software. To address the issue that the traditional I/O
stack no longer matches the capabilities of high-speed disk devices such as NVMe
SSDs, Intel developed a development kit named Storage Performance Development
Kit (SPDK) for NVMe devices. SPDK enables all necessary drivers to be moved to
the user space to avoid context switches for the CPU and data replication. In addi-
tion, interrupts are replaced with polling by the CPU, thereby further lowering the
latencies of I/O requests.
This section takes the distributed shared file system PolarFS as an example to
describe the application of a user-space I/O stack and network stack that are based
on new hardware in a storage system. PolarFS, which is the underlying storage sys-
tem of PolarDB, provides databases with high-performance and high-availability
storage services at a low latency. PolarFS adopts a lightweight user-space I/O stack
and network stack to utilize the potentials of emerging hardware and technologies,
such as NVMe SSDs and RDMA.
To avoid the overheads of message transfers between the kernel space and user
space in a traditional file system, especially the overheads of data replication,
PolarFS provides the database with a lightweight user-space file system library
named Libpfs. Libpfs replaces the standard file system interface and enables all I/O
operations of the file system to run in the user space. To ensure that I/O events are
handled in a timely manner, PolarFS constantly polls and listens to hardware
devices. In addition, to avoid CPU-level context switching, PolarFS binds each
worker thread to a CPU, so that each I/O thread runs on a specified CPU and each
5.4 File System Optimization 137
I/O thread handles different I/O requests and is bound to different I/O devices. In
essence, each I/O request is scheduled by the same I/O thread and processed by the
same CPU in its entire lifetime.
The I/O execution process in PolarDB is shown in Fig. 5.23. When the routing
component PolarSwitch pulls an I/O request issued by PolarDB from the ring buf-
fer, it immediately sends the request to the buffer zone of the leader node (chunk
server 1) in the storage layer through an RDMA NIC. The buffer zone of chunk
server 1 is registered with the local RDMA NIC. I/O threads in chunk server 1 will
keep pulling requests from the buffer zone. When a new request is found, it is writ-
ten to an NVMe SSD by using SPDK and sent to buffer zones in chunk server 2 and
chunk server 3 over an RDMA network for synchronization.
5.4.2 Near-Storage Computing
The unique storage scalability of OLTP cloud-native relational database can support
a capacity of a hundred terabytes for a single instance. Therefore, efficient OLTP is
more important in scenarios with large amounts of data. However, a cloud-native
database uses a storage-compute separated architecture, and all interactions between
compute and storage nodes are carried out over a network. Therefore, the system
performance can be bottlenecked by the network throughput. In an OLTP cloud-
native relational database based on row storage, a table scan brings unnecessary I/O
reads of rows and columns, and a table access by index primary key generates
unnecessary I/O reads of columns. These additional data reads further exacerbate
the network bandwidth bottleneck.
138 5 High-Availability Shared Storage System
The only feasible solution to this problem is to reduce network traffic between
compute and storage nodes by pushing down some data-intensive access tasks,
such as table scans, to the storage nodes. This requires the storage nodes to have
higher data processing capabilities to handle the additional table scan tasks. Several
approaches are available. One is to improve the specifications of the storage nodes.
However, this results in extremely high costs, and the current CPU architecture is
unsuitable for scanning tables that store data by row. Another approach is to use
a heterogeneous computing framework in which storage nodes are equipped with
special cost-effective hardware, such as FPGAs and GPUs, to allow them to per-
form table scan tasks. However, as shown in Fig. 5.24, a conventional centralized
heterogeneous computing architecture uses a single standalone FPGA card that
is based on PCIe. As a result, the system performance is bottlenecked by the I/O
and computing bandwidths of a single FPGA. Each storage node contains mul-
tiple SSDs, each of which can achieve a data throughput of several GB/s. During
analytical processing, multiple SSDs simultaneously access the raw data, and the
aggregated data is sent to a single FPGA card for processing. This not only leads
to excessive data traffic on the DRAM/PCIe channel but also results in a data
throughput that far exceeds the I/O bandwidth of a single PCIe card, making the
FPGA card a hotspot during data processing and compromising the overall system
performance.
Therefore, a distributed heterogeneous computing architecture is a better option,
as shown in Fig. 5.25. Multiple storage nodes can be equipped with special hard-
ware so that they can perform table scan tasks. This way, a query request can be
decomposed and sent to the storage nodes for processing. In addition, only the nec-
essary target data is transferred back to the compute nodes. This avoids excessively
large data traffic and prevents a single FPGA card from becoming a hotspot in data
processing.
PolarDB [5] builds an efficient processing architecture with integrated software
and hardware in a cloud-native environment based on the preceding principle. It
takes advantage of the emerging near-storage computing SSD media devices to
push data processing down to the hard disk on which the data is located, thereby
supporting efficient data queries while saving CPU computing resources on the
storage side. This section describes the specific implementation of the efficient pro-
cessing architecture in PolarDB from the software and hardware aspects.
5.4.2.1 FPGA
achieve pipeline parallelism and data parallelism. This way, multiple operators can
be processed in parallel in one clock cycle, greatly reducing processing latency.
Therefore, FPGAs can be used as coprocessors of CPUs to free the latter from
data processing tasks.
CSDs are data storage devices that can perform data processing tasks. The CPU and
CSD of a storage node form a heterogeneous system that can free the CPU from
table scan tasks so that the CPU can handle other requests, thereby improving sys-
tem performance. In PolarDB, a CSD is implemented based on an FPGA, which
implements flash control and computation. A storage node manages its CSD by
using mechanisms such as address mapping, request scheduling, and garbage col-
lection (GC). In addition, the CSD is integrated into the Linux I/O stack so that it
can serve normal I/O requests like traditional storage devices.
Hardware Optimization
FPGA-friendly data block format: A scan task requires a large number of compari-
son operations (e.g., =, ≥, and ≤). A comparator that supports many different
data types is difficult to implement by solely depending on FPGAs. Therefore,
for most data types, the data storage format of the storage engine of PolarDB
must be modified, so that data can be compared directly in memory. This way, a
CSD needs to implement only a comparator that can execute the memcmp()
function without considering the data types in different fields of a table. This
greatly reduces the resource usage of an FPGA.
The storage engine of PolarDB is designed based on the LSM-tree structure. In
this structure, data in each data block is stored in ascending order of key val-
142 5 High-Availability Shared Storage System
ues. Therefore, prefix compression can be implemented for the key values by
leveraging the characteristic that adjacent key values in an ordered data array
may have the same prefix. For example, assuming that the key value of the first
data record in a data block is “abcde” and the key value of the second data
record is “abcdf,” the common prefix of the key values of these two records is
“abcd.” When the key value of the second data record is stored, only the length
of the common prefix 4 and “f” need to be stored. This method of compressing
key values based on their common prefix is called prefix compression. Prefix
compression greatly reduces the required storage space but hinders search effi-
ciency to some extent. Therefore, a record is left uncompressed every other k
keys. This record is called a restart point. Prefix compression is implemented
for records following this restart point. This way, during record search, the last
restart point whose key value is less than the search keyword can be found
through binary search, and lookup is performed in a forward direction starting
from the start point.
To further improve hardware utilization, as shown in Fig. 5.26, the compression
type (Type), number of key-value pairs (# of keys), and number of restart points
(# of restarts) are added to the header of a data block, so that a CSD can decom-
press each data block and perform cyclic redundancy checks (CRC) by itself
without the need for the storage engine to pass the size information of each
block. In addition, the Type and # of keys fields facilitate data search when prefix
compression is implemented. These fields also facilitate easy identification of the
header and trailer of each block, thereby simplifying FPGA-based program
implementation.
FPGA implementation: To reduce costs and improve performance, mid-range FPGA
chips are used for flash memory control and scan task execution. In addition, a
parallel pipeline architecture is used to increase the throughput of scan process-
ing. As shown in Fig. 5.27, each FPGA contains two parallel data decompression
engines and four scan engines. Each scan engine contains a memory comparison
m ⎧ ni ⎫
(memcmp) module and a result evaluation (RE) module. Let p = Σ | Πci , j |
i =1 ⎩ j =1 ⎭
denote the entire scan task, ci, j denote a query condition for querying a field in a
table, and ∑ and ∏, respectively, represent logical OR and logical AND. The
memcmp and RE modules are used to recursively evaluate each condition ci, j in
the predicate. Specifically, the memcmp module compares data in memory.
When the RE module detects that the final result P (0 or 1) can be determined
based on the current output (all conditions ci, j that have been evaluated so far) of
the memcmp module, the RE module stops the scan of the current row and pro-
ceeds to scan the next row. The query conditions that can be implemented by an
FPGA in this architecture are =,!=, >, ≥, <, ≤, NULL, and!NULL.
References 143
References
As a crucial part of a database, the buffer pool retains some data in the memory to
reduce data exchanges between the memory and external storage, thereby improv-
ing data access performance of the database. This chapter outlines the significance
of buffer pools to databases, depicts the challenges of database cache management
in the cloud era and provides corresponding solutions, and finally shares the practi-
cal application of PolarDB in buffer pool management and dives into the implemen-
tation of RDMA-based shared memory.
In most cases, a database system cannot directly operate on data in a disk. Therefore,
frequently used data needs to be stored in the cache to reduce pauses caused by
reading data from the disk and ensure fast data access.
In a database, the buffer pool of the storage system and the redo log buffer of the
logging system are the two main users of the caching mechanism. The buffer pool
is an internal memory area allocated within the database and is used to store pages
read from the disk, so that the pages can be accessed directly in the memory. This
improves system performance because the access speed of the memory is much
higher than that of the disk. The redo log buffer is used to store the redo log entries,
which are periodically flushed log files.
© The Author(s), under exclusive license to Springer Nature Singapore Pte 145
Ltd. 2025
F. Li et al., Cloud Native Database,
https://doi.org/10.1007/978-981-97-4057-4_6
146 6 Database Cache
6.1.2 Buffer Pool
The buffer pool caches data and indexes. To efficiently utilize the memory, it is
divided into pages, each of which can accommodate multiple rows [1]. When a
data block is cached from a medium, such as a hard drive, into the buffer pool, the
pointers in the data block can be converted from the hard drive address space to
the buffer pool address space. This scheme is known as pointer swizzling. Due to
the limited size of the buffer pool, data pages in the buffer pool need to be periodi-
cally replaced. Basic page replacement algorithms include CLOCK [2] and
LRU. CLOCK and LRU are similar in concept, and both evict pages that have not
been accessed recently and retain recently accessed pages. LRU-K is an improve-
ment on LRU and can prevent the buffer pool from being polluted by sequential
access [3].
After a page is swapped out, whether the page needs to be written back to the
disk must be considered. If the page has been modified, it needs to be written back
to the disk. If the page has not been modified, it is directly discarded.
Pages pointed to by swizzled pointers may exist in the buffer pool. These pages
are called pinned pages and cannot be safely written back to the disk [4]. To write
out these pages, they need to be unswizzled and then evicted. To be specific,
addresses pointing to the buffer pool are changed to addresses pointing to the disk.
6.2 Cache Recovery
Multiple technical solutions are available for implementing CPU and memory sepa-
ration (e.g., shared memory-based separation, NVM (nonvolatile memory)-based
separation, and RDMA-based separation).
The key to CPU and memory separation is to adapt the restart mechanism of the
database to the memory after the separation. In a traditional database, memory data
is lost after a restart, and the memory needs to be reinitialized. After the memory is
separated from the CPU, it has persistence capabilities. How to adapt to this archi-
tecture and optimize the database based on this architecture is the core problem that
needs to be solved after CPU-memory separation in databases. These solutions vary
in implementation difficulties, complexity, and benefits.
6.2.2.2 NVM-Based Separation
NVM is a new type of hardware device that has a similar access speed as ordinary
memory. However, unlike ordinary memory, NVM provides the feature of retaining
data after a power failure. For example, Optane DC, an NVM product provided by
Intel, has a read latency of approximately two to three times that of ordinary mem-
ory and a write latency that is approximately the same as that of the latter [5].
The shared memory technology relies on the capabilities of the operating system.
If the operating system needs to be restarted when the host is restarted, the shared
memory will be destroyed. However, NVM can provide persistence capabilities
after the host is restarted. Therefore, a higher level of memory separation can be
achieved by using NVM.
6.2.2.3 RDMA-Based Separation
RDMA [6] is a technology that allows direct access to the memory of a remote host
without affecting the operation of the CPU of the host, as shown in Fig. 6.2. RDMA
(remote direct memory access) transmits data directly between application memory
and the network by using a network adapter without the need to copy data between
the data cache and application memory of the operating system. Common RDMA
implementations include Virtual Interface Architecture (VIA), RDMA over
Converged Ethernet (RoCE), InfiniBand, Omni-Path, and iWARP [7].
Shared memory-based separation and NVM-based separation can only be imple-
mented within a single host. If the host is faulty and cannot be started, the system
needs to perform complete initialization on a new host. The popularity of the RDMA
technology makes it possible to achieve memory and CPU separation across hosts
in a cloud database.
6.3 PolarDB Practices 149
6.3 PolarDB Practices
The buffer pool is a crucial module in the InnoDB engine. All data interactions,
including various CRUD operations, generated by user requests are implemented
based on the buffer pool. Upon startup, the InnoDB engine allocates a contiguous
memory area to the buffer pool. For better management, the memory area is typi-
cally divided into multiple buffer pool instances. All instances are equal in size, and
an algorithm is used to ensure that a page is located only on a specific instance. This
division manner can improve the concurrency performance of the buffer pool for a
database with multiple instances.
The InnoDB engine initializes the buffer pool instances in parallel based on the
setting of the srv_buf_pool_instances parameter at startup and allocates a continu-
ous memory area to these instances. This continuous memory area is divided into
multiple chunks, with each chunk sized 128 MB by default. Each buffer pool
instance contains locks to ensure the reliability of concurrent access, a buffer chunk
to store physical storage block arrays, various page lists (such as the free list, LRU
list, and flush list), and mutual exclusion (mutex) locks to ensure mutual exclusion
during access to these page lists. The instances are independent of each other, and
each supports concurrent access from multiple threads.
During the initialization of each buffer pool instance, three lists, namely, the free
list, LRU list, and flush list, and a critical page hash table are also initialized. The
specific functionalities of these data structures are described as follows:
Free List
The free list stores unused idle pages. When the InnoDB engine needs a page, it
retrieves the page from the free list. If the free list is empty (i.e., no idle pages exist),
the InnoDB engine reclaims pages by evicting old pages and flushing dirty pages
from the LRU list and flush list. During initialization, the InnoDB engine adds all
pages in the buffer chunks to the free list.
LRU List
All data pages read from data files are cached in the LRU list and managed by using
the LRU strategy. The LRU list is divided into two parts: a young sublist and an old
sublist. The young sublist stores hot data, whereas the old sublist stores data recently
read from a data file. If the LRU list contains less than 512 pages, it will not be split
into a young sublist and an old sublist. When the InnoDB engine attempts to read a
150 6 Database Cache
data page, it looks up the data page in the hash table of the buffer pool instance and
performs subsequent handling according to the actual case.
• When the data page is found in the hash table, namely, in the LRU list, it deter-
mines whether the data page is in the old sublist or the young sublist. If the data
page is in the old sublist, it adds the data page to the head of the young sublist
after reading the data page.
• If the data page is found in the hash table and is located in the young sublist, the
position of the data page in the young sublist needs to be determined. Only when
the data page is at about one-fourth of the total length of the young sublist can it
be added to the head of the young sublist.
• If the data page is not in the hash table, it needs to read from a data file and be
added to the head of the old sublist.
The LRU list manages data pages by using an ingenious LRU-based eviction
strategy to avoid frequent adjustment of the LRU list, thereby improving access
efficiency.
Flush List
All modified dirty pages that have not been written to the disk are saved in this list.
Note that all data in the flush list is also in the LRU list, but not all data in the LRU
list is in the flush list. Each data page in the flush list contains the LSN of the earliest
modification to the page, which is equal to the value of the oldest_modification field
in the buff_page_t data structure. An LSN is an integer of the unsigned long long
data type that continuously increases. LSNs are ubiquitously used in the logging
system of the InnoDB engine. For example, an LSN is used when a dirty page is
modified; checkpoints are also recorded by using LSNs. Specific locations of log
entries in redo log files can be determined by using LSNs. A page may be modified
multiple times, but only its earliest modification is recorded. Pages in the flush list
are sorted in descending order of their oldest_modification values, and the page with
the smallest oldest_modification value is saved at the end of the list. When pages
need to be reclaimed from the flush list, the reclamation process starts from the end
of the list and reclaimed pages are put back to the free list. The flush list is cleaned
by using a dedicated back-end page_cleaner thread that writes dirty pages to the
disk for data persistence. The corresponding redo log entries for the dirty pages are
also cleaned to advance the checkpoint.
All pages in the buffer pool are stored in this hash table. When a page is read, the
page can be directly located in the LRU list by using the page hash table without the
need to scan the entire LRU list, thereby greatly improving page access efficiency.
If the data page is not in the hash table, it needs to be read from the disk.
6.3 PolarDB Practices 151
When a user initiates a CRUD operation at the client, the InnoDB engine translates
it into a page access. Queries correspond to read requests, while the insert, delete,
and update operations correspond to write requests. Read and write requests need to
be processed by using the buffer pool. The following section discusses the read and
write access processes in the buffer pool.
Step 1: Determine the buffer pool instance in which a page is located based on the
space ID and page number. In the InnoDB engine, each table with unique logical
semantics is mapped to an independent tablespace that has a unique space
ID. Starting from MySQL 8.0, all system tables use InnoDB as the default
engine. Therefore, each system table and undo tablespace have respective unique
space IDs.
Step 2: Read the page from the hash table. If the page is found, jump to Step 5. If it
is not found, proceed to Step 3.
Step 3: Read the corresponding page from the disk.
Step 4: Get a free page from the free list and fill it with the data read from the disk.
Step 5: If the page is already in the buffer pool, adjust its position in the LRU list
based on the LRU strategy. If it is a new page, add it to the old sublist of the
LRU list.
Step 6: Return the page to the user thread.
Step 7: Return the result to the client.
Step 1: Determine the buffer pool instance in which a page is located based on the
space ID and page number.
Step 2: Read the page from the page hash table. If the page is found, jump to Step
5. If it is not found, proceed to Step 3.
Step 3: Read the corresponding page from the disk.
Step 4: Get a free page from the free list and fill it with the data read from the disk.
Step 5: If the page is already in the buffer pool, adjust its position in the LRU list
based on the LRU strategy. If it is a new page, add it to the old sublist of the
LRU list.
Step 6: Return the page to the user thread.
Step 7: The user thread modifies the page and adjusts the flush list. If the page
already exists in the buffer pool, its newest_modification field needs to be modi-
fied. If it is a new page, it is directly added to the head of the flush list.
Step 8: Return the result to the client.
152 6 Database Cache
6.3.1.3 Optimization of PolarDB
PolarDB adopts a one writer, multireader architecture. The primary node, also
known as the read-write node, is responsible for handling read and write requests,
generating redo log entries, and persisting data pages. The data generated is stored
on the shared storage PolarFS. Replica nodes, also known as read-only nodes, are
only responsible for handling read requests. A read-only node replays the redo log
on the shared storage PolarFS to update the pages in its buffer pool to the latest ver-
sion. This ensures that subsequent read requests can get the latest data in a
timely manner.
Compared with InnoDB, the shared storage architecture supports fast scale-out
to cope with heavier read request load without adding disks, as well as quick addi-
tion and deletion of read-only nodes. HA switchover can also be implemented
between the read-write node and the read-only nodes. This greatly improves instance
availability. Therefore, the shared storage architecture naturally fits the cloud-native
architecture.
In the InnoDB engine architecture, data persistence is achieved by the page
cleaner thread by periodically flushing dirty pages to the disk. This avoids the per-
formance impact caused by synchronously flushing dirty pages by user threads. In
the PolarDB architecture, when the read-write node flushes a data page to the disk,
it must ensure that the newest modification LSN of the page does not exceed the
minimum LSN of the redo log applied to all read-only nodes. Otherwise, when a
read-only node reads the page from the shared storage, data consistency cannot be
guaranteed because the data version of the page has exceeded the data version
obtained by replaying the redo log. To ensure the continuity and consistency of disk
data and prevent users from retrieving data of a later version or data undergoing
SMOs from read-only nodes, the read-write node must consider the LSN of the redo
log applied to read-only nodes when it flushes dirty pages. The system defines the
minimum LSN of the redo log applied to all read-only nodes as the safe LSN. When
the read-write node flushes a page, it must ensure that the newest modification LSN
of the page (newest_modification) is less than the safe LSN. However, in some
cases, the safe LSN may fail to advance normally. As a result, dirty pages on the
read-write node cannot be flushed to the disk in a timely manner, and the oldest
flush LSN (oldest_flush_lsn) cannot advance. To improve the efficiency of physical
replication, a runtime apply mechanism is introduced to read-only nodes. With the
runtime apply mechanism, the redo log is not applied to a page that is not in the
buffer pool. This prevents the redo log application thread on a read-only node from
frequently reading pages from the shared storage. However, read-only nodes need
to cache the parsed redo log entries in the parse buffer. This way, when a read
request is received from a user, the page can be read from the shared storage, and all
redo log entries recording modifications to this page are applied during runtime, so
that the latest version of the page is returned.
The redo log entries cached in the parse buffer of the read-only node can be
cleared only after the oldest_flush_lsn value of the read-write node has advanced. In
other words, the redo log entries can be discarded after the dirty pages
6.3 PolarDB Practices 153
corresponding to the modifications recorded in the redo log entries are flushed to the
disk. With this constraint, if a hotspot page is frequently updated (i.e., newest_modi-
fication is constantly updated) or the read-write node flushes dirty pages at a slow
speed, a large number of parsed redo log entries may accumulate in the parse buffers
of the read-only nodes, slowing down the speed of applying redo log entries and the
advancement of the LSNs of the redo log entries. In addition, dirty page flushing by
the read-write node is further constrained by the safe LSN, which ultimately affects
the write operations of user threads. If the redo log application speed of the read
nodes is too slow, the difference between the redo log application LSN and the new-
est LSN of the read-write node progressively increases, eventually leading to a con-
tinuous increase in the replication latency.
To solve the various problems that arise due to the preceding constraints, the buf-
fer pool of the read-write node in PolarDB has been optimized as follows:
• To enable the read-write node to flush generated dirty pages to the disk in a
timely manner and reduce the redo log entries cached in the parse buffer of a
read-only node, the read-only node synchronizes the LSNs of applied log entries
to the read-write node in real time. If the difference between the write_lsn value
of the read-write node and safe LSN of the write node exceeds a specified thresh-
old, the system increases the frequency of flushing dirty pages of the read-write
node and actively advances the oldest_flush_lsn value. In addition, the read-only
nodes can release the redo log entries cached in their parse buffers to reduce the
amount of redo log information that needs to be applied during runtime, thereby
improving the performance of the read-only nodes.
• When a page is frequently updated, the newest modification LSN continuously
increases and is always greater than the safe LSN, which fails to meet the flush-
ing condition. Consequently, the log entries of the read nodes accumulate in the
log cache, leaving no free space to receive new log entries. To solve this problem,
the system introduces the copy page mechanism. When a data page cannot be
written to the disk in a timely manner because it does not meet the flushing con-
ditions, the system generates a copy page for the data page. The information in
the copy page is a snapshot of the data page. The copy page stores fixed data that
will no longer be modified, and the oldest modification LSN of the original data
page is updated to the newest modification LSN of the copy page. The newest
modification LSN of the copy page no longer changes. When this LSN is smaller
than the safe LSN, the data of the copy page can be written to the disk, making
the write node advance oldest_flush_lsn and consequently releasing the log
caches of the read nodes.
• Several data pages, such as pages in the system tablespace and rollback segment
header pages in the rollback segment tablespace, are frequently accessed. To
improve execution efficiency and performance, frequently accessed pages are read
from the memory and will not be swapped out after an instance is started. In other
words, these pages are pinned and retained in the buffer pool, hence called “pinned
pages.” This prevents the log application efficiency of read-only nodes from being
affected by frequent swap-in and swap-out operations; the data pages will not be
154 6 Database Cache
swapped out by read-only nodes. When these data pages are needed again, they are
already in memory and do not need to be read from the disk again. This way, when
the read-write node writes out these pages during page flushing, these pages can be
smoothly flushed to the disk without being subject to the constraint that the newest
modification LSN (newest_modification) of a page cannot be greater than the log
application LSN (min_replica_applied_lsn) of read-only nodes. This avoids long
waiting time for user requests, which may be caused if page flushing operations are
triggered to release free pages when the user thread cannot obtain free pages.
6.3.2.3 Optimization of PolarDB
In the current PolarDB architecture, DDL consistency also faces new challenges. In
InnoDB, a DDL operation needs to handle only the status of the target object. To be
specific, before a DDL operation is performed, an MDL on the corresponding table
must be obtained, and the cache must be cleared. However, in the shared storage
architecture, the system also needs to synchronize the MDL to the read-only nodes, so
that requests on the read-only node will not access table data that is being modified by
a DDL operation. After the MDL on a read-only node is acquired, the redo log entries
before the current MDL are applied, and the data dictionary cache of the read-only
node is cleared. During the execution of a DDL operation, the read-write node per-
forms file operations, but the read-only nodes only need to update their respective file
system caches because the read-write node has already completed the file operations
in the shared storage. After the DDL operation is executed, the read-write node records
a redo log entry about MDL release. When a read-only node obtains this log entry
through parsing, it releases the MDL on the table. At this time, the table information
in memory is updated, and the table can provide normal access services.
6.3.3.1 Principles
each standalone PolarDB instance, and the CPU is private to each PolarDB instance
and directly accesses the buffer pool by using a memory bus.
As shown in Fig. 6.4, in the RDMA-based shared memory pool storage architec-
ture, a compute node contains only a small-sized local buffer pool (LBP) that serves
as the local cache, and a global buffer pool (GBP) in the remote cache cluster serves
as a remote cache of the compute node. The GBP (global buffer pool) contains all
pages of PolarDB. The compute node and GBP communicate with each other over
a high-speed interconnected network by using RDMA to read or write pages. In
6.3 PolarDB Practices 157
addition, RDMA ensures low latency for remote access operations. The GBP and
compute node are separated. Therefore, multiple PolarDB instances (compute
nodes) can simultaneously connect to and share the GBP, thereby forming the one
writer, multireader architecture and multiwriter architecture.
6.3.3.2 Advantages
The shared memory architecture offers considerable advantages for PolarDB. First,
this architecture efficiently implements the separation of the compute and memory
nodes. This enables on-demand scaling for commercial applications of Alibaba
Cloud, achieving almost continuously available instance elasticity (i.e., scale-up
and scale-down). This architecture can also allocate CPU and memory resources
separately based on specific customer requirements to facilitate separate pay-as-
you-go billing for CPU and memory resources. Second, this architecture can be
shared by multiple PolarDB instances, laying a solid foundation for the one-writer,
multireader architecture and multiwriter architecture and improving the computing
capabilities of PolarDB instances. Third, this architecture frees PolarDB from the
constraints of limited memory of a standalone instance and efficiently utilizes the
large memory capacity of remote nodes. It also decouples the dirty page flushing
logic from the compute nodes, which improves the write performance to some
extent and improves the performance of individual instances. Lastly, this architec-
ture enables fast system recovery because the shared memory pool contains all
memory pages and the buffer pool is always hot after a restart.
6.3.3.3 Implementation
Data Structure
The shared memory consists of multiple GBP instances and the RDMA service
module that is responsible for network communication. Each GBP instance consists
of four parts: (1) an LRU list, which stores the metadata of recently used pages, such
as page IDs, LSNs, and memory addresses; (2) a free list, which stores free shared
memory pages; (3) a hash table, which is used to quickly locate memory pages; and
(4) a data area of the shared memory, which is organized and managed in the form
of pages.
Network Communication
The RDMA-based shared memory framework can handle two types of requests. The
first type does not involve the CPU of the memory node. In this type of request, the
remote address of the data page is already known, and read and write operations can
be directly performed on the data page. The CPU of the memory node can be bypassed
158 6 Database Cache
by using one-side RDMA primitives. The second type is control requests, such as
memory page registration requests and invalidation requests. Such requests are han-
dled as follows: Network requests are received from compute nodes and handled by
using registered RPC (remote procedure call) functions. To ensure low latency, RPCs
of the second type are also handled based on RDMA. The entire network data flow is
handled by using a unified and efficient RDMA communication framework. The fol-
lowing examples are common network I/O operations on the shared memory:
Registration: Before RDMA-based data transmission is performed, memory regis-
tration needs to be performed. A remote key (r_key) and a local key (l_key) are
generated in each memory registration. The local key is used by the local host
channel adapter (HCA) to access the local memory. The remote key is provided
to the remote HCA to allow remote processes to access the local system memory
during RDMA operations. The compute node sends a registration request to a
shared memory node, and the shared memory node allocates a remote memory
page to the compute node. After receiving the registration request, the shared
memory node selects a memory page based on metadata, such as the page ID,
and returns the corresponding RDMA address and key to the compute node.
Reading: After the registration, the compute node already knows the remote address
of the corresponding page in the shared memory and can directly initiate a read
request by using an RDMA-based remote read operation.
Writing: The compute node first writes to a shared memory node by using an
RDMA-based remote write operation based on the known remote address of a
page. Then, the compute node sends the metadata of the relevant page to the
shared memory node. The shared memory node obtains and reads the metadata
of the page. Multiple writable nodes may exist in the cluster. When a data page
is modified by a node, other nodes need to be instructed to invalidate this data
page in their respective memory based on the address and key in invalid_bit.
Crash Recovery
6.3.3.4 Performance Tests
This section describes tests run on PolarDB instances with the scalable specification
of polar.mysql.x4.xlarge, which is configured with 8 CPU cores, 32 GB of memory,
and a 24-GB buffer pool. CPU isolation is implemented by using the taskset
6.3 PolarDB Practices 159
command. The baseline PolarDB instance is configured with a 24-GB buffer pool.
The test PolarDB instances are of the GBP edition with LBP sizes, respectively, set to
1 GB, 3 GB, and 5 GB. To control the variables, the total size of the LBP and the GBP
is set to 24 GB for each instance. For example, if the size of the LBP is 1 GB, the size
of the GBP is set to 23 GB. The PolarDB (LBP) process and the cache cluster (GBP)
process communicate with each other by using a 100-GB RDMA connection.
The PolarDB instances are tested against the oltp_read_only, oltp_read_write,
and oltp_write_only scenarios of Sysbench [8]. Each scenario is tested with 8, 16,
32, 64, 128, and 256 concurrent threads. The dataset consists of 250 tables that yield
a 17-GB basic dataset, and each table contains 30 million rows of data.
oltp_read_only
As shown in Fig. 6.5, in the oltp_read_only scenario, the throughputs of the test
PolarDB instances are basically consistent with that of the baseline PolarDB
instance, with a difference of less than 2%.
oltp_read_write
As shown in Fig. 6.6, in the oltp_read_write scenario, the throughput of the test
PolarDB instance whose LBP size is 5 GB is about 10% higher than that of the base-
line PolarDB instance. However, the throughput of the test PolarDB instance whose
LBP size is 1 GB is basically the same as that of the baseline PolarDB instance.
The results indicate that the PolarDB cluster of the GBP edition outperforms the
baseline PolarDB instance. In the PolarDB cluster of the GBP edition, the dirty page
flushing feature is decoupled from the compute nodes. By separating the core fea-
ture of the InnoDB engine, write performance is improved to some extent. Data can
be flushed from the GBP to PolarFS by any available CPU.
oltp_write_only
As shown in Fig. 6.7, in the oltp_write_only scenario, the throughput of the test
PolarDB instance whose LBP size is 5 GB is basically the same as that of the base-
line PolarDB instance. However, the throughputs of the test PolarDB instances
whose LBP sizes are 3 GB and 1 GB are lower than that of the baseline PolarDB
instance; the throughput of the test PolarDB instance whose LBP size is 1 GB is
only 78% that of the baseline PolarDB instance.
References
3. O’Neil EJ, O’Neil PE, Weikum G. The LRU-K page replacement algorithm for database disk
buffering. ACM SIGMOD Rec. 1993;22(2):297–306.
4. Garcia-Molina H, Ullman D, Widom J. Database systems: the complete book (trans: Yang D,
Wu Y, et al.). 2nd ed. Beijing: China Machine Press; 2010.
5. Yang J, Kim J, Hoseinzadeh M, et al. An empirical guide to the behavior and use of scal-
able persistent memory. In: 18th USENIX Conference on File and Storage Technologies
(FAST’20); 2020. p. 169–82.
6. What is RDMA? 2021. https://community.mellanox.com/s/article/. Accessed 17 Feb 2021.
7. Remote direct memory access. 2021. https://en.wikipedia.org/wiki/. Remote_direct_memory_
access. Accessed 17 Feb 2021.
8. Kopytov A. Sysbench manual. In: MySQL AB; 2012. p. 2–3.
Chapter 7
Computing Engine
The database computing engine, also known as the database query engine, is respon-
sible for processing database queries. Query processing is one of the most crucial
features of a database and consists of query execution and query optimization. This
chapter discusses the database query processing process, introduces the three mod-
els for database query execution and the main methods for database query optimiza-
tion, and explores the practical application of the PolarDB query engine.
This section briefly describes the query processing process of traditional relational
database management systems and the implementation of parallel query processing.
© The Author(s), under exclusive license to Springer Nature Singapore Pte 163
Ltd. 2025
F. Li et al., Cloud Native Database,
https://doi.org/10.1007/978-981-97-4057-4_7
164 7 Computing Engine
7.1.1.1 Query Operations
Selection Operations
Selection is the process of finding tuples that satisfy a given condition from a rela-
tion. Typical selection algorithms include full table scan and index (or hash) scan.
Sorting Operations
Data sorting plays an important role in database systems. On the one hand, query
results may need to be sorted in a specific way (e.g., by using the ORDER BY key-
word). On the other hand, sorting can be used in other operations to achieve efficient
implementation even if a sorting method is not specified in the query. For example,
loading sorted tuples in batches into a B+ tree index is more efficient than loading
unsorted tuples.
Data that can fit in memory is sorted by using standard sorting techniques, such
as quicksort. Data that cannot be entirely held in memory can be sorted by using
external merge sort algorithms.
Join Operations
Join is the process of selecting tuples whose attributes meet given conditions from
the Cartesian product of two relations. The algorithms used to compute the join of
two relations include nested loop join (NLJ), block nested loop (BNL) join, index
NLJ, merge join, and hash join. A join algorithm is selected depending on the physi-
cal storage form of the relations and whether indexes are present.
Other Operations
7.1.1.2 Expression Evaluation
Two expression evaluation methods are available. One method is to perform one
operation at a time in a specific order and materialize the result of each evaluation
into a temporary relation. The evaluation and materialization costs include the cost
of all operations and the cost of writing intermediate results back to the disk, result-
ing in high disk I/O costs. The other method is to perform multiple operations
simultaneously on a pipeline. In this method, the result of one operation is passed to
the next operation without the need to save the result to a temporary relation, thereby
eliminating the cost of reading and writing temporary relations.
7.1.2.1 Intraoperator Parallelism
Parallel Sorting
A classic scenario for parallel sorting is to sort a relation R stored on n nodes N0, N1,
N2, ⋯, Nn − 1. If R has already been range-partitioned and the partitioning attribute is
the attribute based on which R is to be sorted, partitions on each node can be sorted
in parallel. Then, the sorted partitions can be merged to obtain the complete sorted
relation.
If the relation is partitioned by using other methods, the relation can be range-
partitioned again by using the attribute based on which R is to be sorted. This
ensures that all tuples within the ith range are sent to the same node Ni. Then, parti-
tions on each node can be sorted in parallel, and the sorted partitions can be merged
to obtain the complete sorted relation.
Parallel Join
The join operation in relational algebra checks whether a pair of tuples satisfies a
join condition. If yes, the tuples are output to the join result. The parallel join algo-
rithm allocates the tuples to be examined to different processors, and each processor
computes the local join results. All processors work in parallel to compute the join,
and the system collects their results to produce the final result.
Taking the most commonly used natural join as an example, suppose that two rela-
tions R and S are to be joined. The parallel join algorithm separately partitions R and
S into n partitions, R0, R1, R2, ⋯, Rn − 1 and S0, S1, S2, ⋯, Sn − 1. The system sends parti-
tions Ri and Si to node Ni to compute the join result. All nodes work in parallel to
compute the join results, which are then merged to obtain the final join result.
Parallelization methods are also available for join methods such as hash join and
NLJ (nested-loop join) in centralized databases.
7.1.2.2 Interoperator Parallelism
This section describes the query execution models of databases. The query execu-
tion model of a database determines how the database executes a given query plan.
In this section, the most commonly used model, the Volcano model, is introduced
along with its advantages and disadvantages. Then, the compiled execution model
and the vectorized execution model, which partially compensate for the shortcom-
ings of the Volcano model, are presented.
168 7 Computing Engine
7.2.1 Volcano Model
The Volcano model [3] (also known as the iterator model) is the most common
execution model. Databases such as MySQL and PostgreSQL use the Volcano model.
In the Volcano model, each operation in relational algebra is abstracted as an
operator, and an operator tree is built for an SQL query. Each operator in the tree,
such as a join and sorting operator, implements a Next() function. The Next() func-
tion of the parent node in the tree calls the Next() functions of its subnodes, and the
subnodes return results to the parent node. Next() functions are recursively called
from the root node to leaf nodes in a top-down approach, and data is pulled for pro-
cessing in reverse order. This processing method of the Volcano model is also known
as the pull-based execution model [4].
In the Volcano model, an operator can be implemented separately without
considering the implementation logic of other operators. Despite this benefit,
the disadvantages of this model are apparent. For instance, only one tuple is
computed each time, resulting in low utilization of the CPU cache. Moreover,
the recursive calling of Next() functions of subnodes by the parent node causes
a large number of virtual function calls and consequently leads to low CPU
utilization.
In the era when the system performance mainly depended on disk I/Os, the query
processing performance of the Volcano model was considered high. However, with
the development of hardware, data storage becomes increasingly faster, and system
performance is no longer bottlenecked by disk I/Os. Therefore, the research focus
has switched to improving the computation efficiency; fruitful results are also
achieved. Moreover, two optimization methods have emerged: the compiled execu-
tion model and the vectorized execution model. Compared with the Volcano model,
the preceding models greatly improve the execution performance of database
queries.
Compiled execution, also known as data-centric code generation, was first proposed
by HyPer [5]. The compiled execution model uses the LLVM (low-level virtual
machine) compiler framework to transform queries into compact and efficient code
that can be quickly executed and is compatible with the modern CPU architecture.
As a result, excellent query performance can be achieved with moderate code com-
pilation, greatly improving the query execution efficiency.
The data-centric compilation approach is attractive for all new databases. Backed
by mainstream compilation frameworks, database management systems automati-
cally benefit from future improvements of compilers and processors without the
need to redesign the query engine. For more details about the compiled execution
model, see Ref. [6].
7.3 Overview of Query Optimization 169
The vectorized execution model is designed based on the Volcano model. In this
model, each operator implements a Next() function. Unlike in the Volcano model,
the Next() function of each operator in the vectorized model returns a batch of data
rather than a single tuple in the iteration process. Returning data in batches greatly
reduces the number of times the Next() function is called, thereby reducing virtual
function calls. Additionally, the vectorized execution model allows the SIMD (sin-
gle instruction, multiple data) mechanism to be used in each operator to simultane-
ously process multiple rows of data. Furthermore, vectorized execution processes
data in blocks, improving the CPU cache hit rate. Sompolski et al. [7] demonstrates
how the vectorized execution model is suitable for complex analytical query pro-
cessing and can improve the OLAP (online analytical processing) query perfor-
mance by up to 50 times.
The module responsible for query optimization in a database is called a query opti-
mizer. This module aims to find the most efficient query execution plan for a given
query. The goal of query optimization is to minimize the total cost of the query.
Query optimization can be achieved in various ways and categorized into logical
query optimization and physical query optimization according to the level of opti-
mization. Logical query optimization refers to the optimization of relational algebra
expressions, which involves rewriting queries based on equivalence rules. Physical
query optimization involves selecting the access path and underlying operation
algorithm, which can be achieved through rule-based heuristic optimization, cost-
based optimization, or a combination of both [1].
Query optimization plays an important role in databases as it relieves users from
the burden of selecting access paths, eliminating the need for high-level database
expertise and programming skills. Achieving higher query execution efficiency is
now a system task.
The query optimizer generates multiple query execution plans equivalent to the
given query and ultimately selects the plan with the lowest cost.
The general steps for query optimization are as follows:
• Rewrite the given query to an equivalent but more efficient query based on query
rewriting rules.
• Generate different query execution plans based on different underlying operation
algorithms.
• Select the query execution plan with the lowest cost.
170 7 Computing Engine
The join result remains unchanged if the positions of two tables to be joined are
exchanged. However, in an NLJ, if the outer table has fewer tuples, a block-based
NLJ can be performed by treating the table with fewer tuples as the outer table.
Owing to the associativity of join operations, when multiple tables are to be joined,
some tables may be joined in advance to significantly reduce the size of the interme-
diate result set without changing the final join result.
If all selection conditions are pushed down to the tables to which they relate and
selection is performed before the join operation, the size of the intermediate result
set can be greatly reduced. This rule is relatively the most effective logical optimiza-
tion method.
In summary, the purpose of logical query optimization is to use equivalent trans-
formation rules of relational algebra to transform a given query into an equivalent
but more efficient query.
Physical query optimization is implemented based on the cost model. The following
concepts are involved in physical query optimization:
The table access method, such as sequential scan, index scan, or parallel scan.
7.3.3.2 Join Algorithm
The algorithm used to join two tables, such as NLJ, hash join, and sort-merge join.
7.4 Practical Application of PolarDB Query Engine 171
7.3.4.1 Materialized Views
Materialized views are views whose results have been precomputed and stored.
They are generally used to improve the performance of complex queries that are
frequently executed and take a long time to execute [2]. Assuming that a material-
ized view A = B ⊳ ⊲ C is available, B ⊳ ⊲ C ⊳ ⊲ D can be rewritten to A ⊳ ⊲ D,
which lowers the execution cost for the database.
7.3.4.2 Plan Caches
A query optimizer requires several steps to generate an ideal query execution plan.
This consumes a considerable amount of computing resources. After a frequently
used query undergoes the query optimization and query execution processes, the
database caches its execution plan. The next time the same query is executed, the
database directly uses the cached execution plan for the query, thereby improving
the execution efficiency. Section 7.4.2 describes the execution plan management
and plan caching of PolarDB in detail.
This section describes the practical application the PolarDB query engine, mainly
focusing on three aspects: parallel query technology, execution plan management,
and vectorized execution.
172 7 Computing Engine
Parallel Execution
One of the biggest problems in MySQL is that the query performance continuously
deteriorates as the data volume grows, and the query execution time may increase
from milliseconds to hours. The reason for such long execution time is that in
MySQL, one query can be executed only in one thread even as the data volume
continuously increases. Hence, even if the system resources are sufficient, the mul-
ticore capabilities of modern CPUs cannot be utilized.
As shown in Fig. 7.1, when a query is executed on a node with 64 cores, only one
thread is involved in the execution of the query. The other 63 threads remain in an
idle state.
Against this backdrop, PolarDB for MySQL 8.0 is launched with a powerful
parallel query framework. Once the amount of query data reaches a specified
threshold, the parallel execution framework will be automatically enabled. Then,
the data in the storage layer will be divided into multiple partitions, which are
allocated to different threads. The threads compute the results in parallel. The
results are pipelined to the leader thread for aggregation and then returned to
the user.
The PolarDB query optimizer determines whether to generate a serial execution
plan or a parallel execution plan based on the execution cost of the statement. Take
the following query as an example:
Multiple worker threads scan and perform filtering on Table T1, separately cal-
culate a sum, and return the results to the leader thread. The leader thread then
performs a summation operation and returns the result to the client.
As shown in Fig. 7.2, the query is executed in two stages (from right to left). In
the first stage, the 64 threads participate in table scanning and computation. In the
second stage, the computation results of the threads are summed up. It can be seen
that parallel computing can fully mobilize the computing capabilities, with each
thread handling less than 2% of the total workload. This significantly reduces the
end-to-end time of the entire query.
In general, parallel execution in PolarDB has the following advantages:
• Zero need for business adaptation, SQL modification, data migration, or data
partitioning changes.
• 100% compatibility with MySQL.
• Significant improvement in query performance, enabling users to realize the per-
formance improvement brought by enhanced computing power.
Architecture Design
Table 7.1 lists several terms related to parallel execution.
Figure 7.3 shows the parallel execution architecture of PolarDB, which includes
four execution modules (from top to bottom):
• Leader generates the parallel execution plan, performs computing pushdown,
and aggregates computation results. In Fig. 7.3, the Gather node in the execution
plan is the leader and receives data from various message queues.
• Message queue is responsible for data communication between the leader and
the workers. Each message queue represents a communication relationship
between a worker and the leader. If N workers are present, N message queues
are needed.
• Worker receives execution plans issued by the leader and returns the execution
results to the leader. The execution tasks of the workers are homogeneous and
vary based on the specific data that they scan. For example, in Fig. 7.3, the five
worker threads are indicated by different colors, but they all contain the same
execution plan.
• Parallel scanning is implemented by the InnoDB engine. Data in a table is divided
into multiple partitions, and each worker scans data in one partition. As shown in
Fig. 7.3, partitions indicated by different colors are in a one-to-one correspon-
174 7 Computing Engine
dence with the upper-layer workers. When a worker completes scanning a parti-
tion, it requests to be bound to an unscanned partition and then continues to scan
the partition.
whether the cost of serial execution is greater than the cost threshold that triggers
parallel execution, whether the table supports parallel scanning, and whether the
number of rows scanned exceeds the specified threshold. The optimizer also esti-
mates the cost of parallel execution and compares it with the cost of serial
execution.
The optimizer generates a parallel execution plan only when it determines that paral-
lel execution is more efficient than serial execution. As mentioned earlier, the parallel
execution framework of PolarDB has only one leader/Gather thread. The purpose of a
parallel execution plan is to push down the computing of as many operators and expres-
sions as possible to worker threads for parallel execution. This reduces the cost of data
transmission and enables parallel execution for more computation workloads.
The PolarDB optimizer determines whether to push down computing based on
the following considerations: (1) whether new execution methods are needed
(e.g., an aggregation function may need to be transformed into a two-stage aggre-
gation function) and (2) whether the expression (including its parameters) is
parallel-safe.
Parallel-safe operations do not conflict with parallel queries. Whether an expres-
sion is parallel-safe needs to be determined based on its specific implementation.
For example, in MySQL, the Rand() function is parallel-safe but Rand(10) is not.
In MySQL, a constant can be used as a random seed, which is initialized once and
then constantly computed. As a result, all worker threads return identical data col-
umns after the function is pushed down. However, a Rand() function without con-
stant argument creates a random number seed based on the current thread. Using the
query in Fig. 7.1 as an example, an execution plan generated by the optimizer for the
query can be represented as follows:
The execution logic of the Gather thread is as follows:
In the syntax, gather_table specifies a temporary table created for receiving and
transmitting data, and the sum() function is formulated as a two-stage function in
consideration of the pushdown of aggregation operations.
The execution logic of a worker thread is as follows:
176 7 Computing Engine
The worker thread calculates the count value of Table t1. After optimization,
worker threads are only aware of the logical tasks and do not know the specific part
of data they will process. The data processed by each worker thread is determined
during the execution phase.
After tasks are allocated to the Gather and worker threads, the next step is to cre-
ate a temporary table for receiving and transmitting data. The worker threads write
data to the temporary table, thereby shielding the underlying operations. The Gather
and worker threads then scan the temporary table for further processing.
After optimization, the execution phase begins. Fig. 7.5 describes the workflows
of the Gather and worker threads. After the Gather thread is initialized, it sets the
actual partitions for parallel scanning, creates the required message queues, starts
the worker threads, and then waits to be awakened to read data. Data in PolarDB is
physically managed in the form of a B+ tree. During partitioning, the Gather thread
only needs to access some of the nodes in the B+ tree index, traverse the tree by
using a breadth-first search algorithm from top to bottom, and perform partitioning
level by level. Taking the tree in Fig. 7.6 as an example, the three levels of the tree
contain 32 data records. If two partitions are required, the Gather thread accesses
the root node and divides the root node into two data partitions. If eight partitions
are required, the Gather thread continues to explore nodes at the next level (e.g.,
nodes at Level 1 in Fig. 7.6) to determine the partitions. To control the additional
overhead brought by partitioning, PolarDB avoids traversing the leaf nodes.
During the optimization phase, PolarDB determines only the DOP (degree of
parallelism) and does not know how physical data is partitioned. On the one hand,
an excessively small number of partitions may result in low concurrency and
extremely uneven workload allocation. On the other hand, an excessively large
number of partitions frequent switching of worker threads.
Currently, partitioning in PolarDB is implemented based on the following rule:
The number of partitions is 100 times the DOP. After a worker thread completes
processing one partition, it sends a request to the leader thread for permission to
access the next accessible partition. This process repeats until all data is processed.
After the optimization, the worker threads share an execution plan template. A
worker thread creates related environment variables during initialization, clones the
expression and execution plan, and then executes the plan and outputs the result to
a message queue. At this point, the Gather thread is awakened to process the data.
Expressions in PolarDB do not have complete abstract representations. Therefore, a
proper cloning scheme must be provided for each expression to ensure that the
execution plan of each worker thread is complete and not affected by other threads.
Parallel Operations
Parallel Scanning
In a parallel scan, worker threads independently scan data in a data table in parallel.
The intermediate result set generated by a worker thread through scanning is
returned to the leader thread. The leader thread then collects the generated interme-
diate results through a Gather operation and returns the results to the client.
regular manner. Each worker thread returns the result set after the join to the leader
thread. The leader thread then collects the result sets through a Gather operation and
returns the results to the client.
Parallel Sorting
The PolarDB optimizer determines whether to push the ORDER BY operation
down to each worker thread for execution based on the query status. Each worker
thread returns the sorted result to the leader thread. The leader gathers, merges, and
sorts the results and then returns the sorted result to the client.
Parallel Grouping
The PolarDB optimizer determines whether to push the GROUP BY operation
down to worker threads for parallel execution based on the query status. If the target
table can be partitioned based on all the attributes of the GROUP BY clause or on
the first several attributes of the GROUP BY clause, the grouping operation can be
completely pushed down to worker threads for execution. In this case, the HAVING,
ORDER BY, and LIMIT operations can also be pushed down to worker threads for
execution to improve query performance. The leader thread aggregates the gener-
ated intermediate results through a Gather operation and returns the aggregated
result to the client.
Parallel Aggregation
In parallel query execution, the aggregation function is pushed down to worker
threads for parallel execution. Parallel aggregation is completed through two stages.
In the first stage, all worker threads that participate in the parallel query execute an
aggregation step. In the second stage, the Gather or Gather Merge operator collects
the results generated by the worker threads and sends the results to the leader thread.
The leader thread then aggregates the results of all worker threads to obtain the
final result.
Parallel Counting
The PolarDB optimizer determines whether to push the counting operation down to
worker threads for parallel execution based on the query status. Each worker thread
finds the corresponding data based on its primary key range and executes the Select
count(*) operation. The Select count(*) operation has been optimized at the engine
layer. Therefore, the engine can quickly traverse the data to obtain the result. Each
worker thread returns the intermediate result of the counting operation to the leader
thread, and the leader thread aggregates all data and performs counting. In addition
to supporting clustered indexes, parallel counting supports parallel search of sec-
ondary indexes.
7.4 Practical Application of PolarDB Query Engine 179
Parallel Semijoin
Semijoin supports five strategies: materialization-lookup, materialization-scan, first
match, weedout, and loose scan. PolarDB supports parallel processing for all these
five strategies. Two parallelization approaches are available for the materialization-
lookup and materialization-scan strategies. One is to push down the semi-join oper-
ation to worker threads for parallel execution. Each worker thread is responsible for
the semi-join of part of the data and materialized table. The other is to push the
parallel materialization operation down to worker threads in advance. The worker
threads share the materialized table during the semi-join operation. The PolarDB
optimizer selects the optimal parallelization approach based on the query status.
Only one parallelization approach is available for the other three strategies, namely,
to push the semi-join operation down to worker threads. Then, each worker thread
returns the result set of the semi-join operation to the leader thread, and the leader
thread aggregates the results and returns the aggregated result to the client.
PolarDB will continue to upgrade the parallel query feature. However, the parallel
query feature is unavailable for the following cases:
180 7 Computing Engine
To ensure better system stability and monitor the parallel execution status of the
system, PolarDB further provides a rich variety of resource management features
for parallel execution.
DOP Control
The maximum DOP for each query can be specified by using the max_parallel_
workers parameter:
set max_parallel_workers = n
PolarDB determines a proper DOP that is less than n based on factors such as the
thread count, memory resources, and CPU resources.
Memory Constraints
set query_memory_hard_limit = n;
set query_memory_soft_limit = m;
Users can check the current status of parallel execution by querying system tables:
PolarDB for MySQL 8.0 supports the parallel query feature, which can be enabled
or disabled by using system parameters or hints without modifying the SQL state-
ments (except for hints).
As supplementary SQL syntax, hints play a pivotal role in relational databases.
They allow users to specify the way an SQL statement is executed to optimize the
SQL statement. PolarDB also provides particular hint syntax.
PolarDB specifies the maximum number of parallel execution threads for each SQL
statement by using the global parameter max_parallel_degree. The default value of
this parameter is 0. As shown in Fig. 7.7, this parameter can be modified in the con-
sole any time during system operation without the need to restart the database.
Recommended Settings
We recommend that you gradually increase the value of the max_parallel_degree
parameter. For example, you can set this parameter to 2 at first and then check the
CPU load after the setting runs for 1 day. If the CPU load is not high, you can con-
tinue to increase the value. Otherwise, do not increase the value. The value of this
parameter cannot exceed one-fourth of the number of CPU cores. We recommend
that you enable the parallel query feature only when the system has at least eight
CPU cores and do not enable this feature for small-specification instances.
When max_parallel_degree is set to 0, parallel execution is disabled. When
max_parallel_degree is set to 1, parallel execution is enabled, but the DOP is only 1.
The max_parallel_degree parameter serves to maintain compatibility with
MySQL configuration files. PolarDB also provides the loose_max_parallel_degree
parameter in the console to ensure that other versions do not throw errors when
receiving this parameter.
When you enable the parallel query feature, disable the innodb_adaptive_hash_
index parameter because it affects the performance of parallel queries.
In addition to the cluster-level DOP, you can also configure session-level DOPs
by using related session-level environment variables. For example, you can add the
following command to the JDBC (Java Database Connectivity) connection string of
an application to set a separate DOP for the application:
set max_parallel_degree = n
Hints
Hints allow you to control individual statements. For example, when the parallel
query feature is disabled for the system but a frequently used slow SQL statement
needs to be processed in parallel, you can use a hint to enable parallel execution for
this particular SQL statement: You can enable parallel execution by using either of
the following syntaxes:
Advanced Hints
The parallel query feature provides two advanced hints: PARALLEL and NO_
PARALLEL. The PARALLEL hint can force a query to execute in parallel and
specify the DOP and the name of the table to be scanned in parallel. The NO_
PARALLEL hint can force a query to execute in series or specify the tables that will
not be scanned in parallel. The syntaxes for the PARALLEL and NO_PARALLEL
hints are as follows:
The following two parameters must be specified when the parallel query feature
is enabled:
• force_parallel_mode: Set this parameter to true to force parallel execution even
if a table contains a small number of records.
• max_parallel_degree: Use the default setting.
The following section provides several examples of parallel execution:
Example 1
SELECT /*+PARALLEL(8)*/ * FROM t1,t2;// Forcibly enables parallel execution and sets the DOP to 8.
Set max_parallel_degree to 8.
Set force_parallel_mode to false so that parallel execution is disabled when the
number of records in a table is smaller than the specified threshold.
Example 3
Subqueries are forcibly executed in parallel, with a DOP equal to the default
max_parallel_degree setting.
7.4 Practical Application of PolarDB Query Engine 185
Example 6
Parallel execution is disabled only for Table t1. When the parallel query feature
of the system is enabled, Table t2 may be scanned in parallel.
Example 11
EXPLAIN SELECT /*+ PQ_PUSHDOWN(@qb1) */ * FROM t2 WHERE t2.a = (SELECT /*+ qb_name(qb1)
*/ a FROM t1);
Example 2
# Use the shared access strategy for the parallel execution of subqueries.
Example 3
# Specify the parallel execution strategy without specifying the query blocks.
The PolarDB optimizer may not choose to execute a query in parallel (e.g., when
the table has less than 20,000 rows). If you expect the optimizer to choose a parallel
execution plan without considering the cost, use the following setting:
set force_parallel_mode = on
7.4 Practical Application of PolarDB Query Engine 187
The cost-based query optimizer attempts to find the optimal plan for execution,
which is generally characterized by short execution time and low resource con-
sumption. On the one hand, database developers invest continuous efforts to find
better execution plans by using more accurate cost models and cardinality estima-
tion (CE). On the other hand, the overheads brought by the optimization process,
especially for OLTP (online transaction processing) queries, also need to be consid-
ered. For example, MySQL always performs full optimization for the same SQL
statement, regardless of whether the same plan is generated. This approach is also
known as hard parsing in the commercial database Oracle. However, Oracle uses a
plan cache to cache plans, so that a plan can be reused and repeated optimization
can be avoided. However, caching only one plan for a query is inadequate. The per-
formance of the plan may deteriorate due to changes in parameters, data insertion
and deletion, or changes in the database system status.
One feasible solution is to cache multiple execution plans. When multiple poten-
tial plans are available for different parameters, the optimizer selects the most effec-
tive plan for a particular input. This scheme is called adaptive plan caching. In
adaptive plan caching, whether to generate a new plan and which cached plan is the
most suitable are determined based on the selectivity of the query predicate.
Adaptive plan caching can effectively alleviate the problem of plan performance
degradation caused by different parameters.
However, the degradation of plan performance is related not only to the selectiv-
ity of the query predicate. The join order, table access mode, and materialization
strategy also affect the plan performance. In addition, as system parameters change
and data updates occur, better execution plans may be generated in the system,
which, however, are not cached.
Therefore, databases require a more complete plan evolution management solu-
tion, which is usually called SQL plan management (SPM). SPM is mainly imple-
mented in database upgrades, statistics updates, and optimizer parameter adjustments
to prevent significant performance degradation when the database executes the same
query. SPM ensures the performance baseline by maintaining a collection of plan
baselines for queries. However, due to database upgrades and data changes, better
execution plans may be generated. This necessitates timely evolution of the plan
188 7 Computing Engine
baselines. An execution plan of a query that has been verified and proven to have bet-
ter performance is added to the collection of plan baselines and becomes an alternative
plan for the query the next time it is executed. Therefore, SPM needs to maintain plan
baselines to prevent performance degradation while actively evolving them to ensure
timely discovery of better execution plans without affecting the system performance.
In SPM, execution plans typically have three states: new, accepted, and verified.
A plan in the new state is newly generated and has not been verified, a plan in the
verified state has been verified, and a plan in the accepted state has been verified and
proven to be advantageous and is usually added to the collection of plan baselines.
Users can also manually set the status of a plan to “accepted.”
SPM includes three jobs:
1. SQL plan baseline capture creates SQL plan baselines for parameterized SQL
queries. These baselines are accepted execution plans for the corresponding SQL
statements, which are the current optimal plans or plans forcibly selected by the
DBA (database administrator). A query can have multiple plan baselines because
the optimal plan varies based on the parameter value of the query.
2. SQL plan selection and routing performs the following:
(a) Ensure that most workloads are routed to accepted plans for execution.
(b) Route a small portion of the workloads to unaccepted plans to verify
these plans.
3. SQL plan evolution evaluates the performance of unaccepted plans. If a plan
significantly improves the query performance, the plan evolves into an accepted
plan and is added to the collection of plan baselines.
Figure 7.8 shows the SPM process. When the system receives a query, it gen-
erates an execution plan and determines whether it is necessary to maintain an
execution plan for the query. If not, the system directly executes the plan. If yes,
the system checks whether the plan exists in the collection of plan baselines. If
the plan exists in the collection of plan baselines, the plan is directly executed.
Otherwise, it is added to the plan history database, and the system selects a plan
with the lowest cost from the plan baselines.
PolarDB provides three plan management strategies: plan caching (which
caches one plan for each query), adaptive plan caching (which caches multiple
plans for each query), and SPM (which caches multiple plans for each query and
evolves the plans online or offline). In PolarDB, these three strategies are dynam-
ically combined to form a complete plan management solution. Users can choose
the most appropriate strategy based on their business needs.
Three plan management strategies are described above, plan caching, adaptive plan
caching, and SPM, along with their respective foci. Fig. 7.9 shows all modules
related to plan management. The following section describes how the three plan
7.4 Practical Application of PolarDB Query Engine 189
Plan Storage
The blocks in yellow in Fig. 7.9 represent the SQL and plan storage modules. The
SQL history database is used to detect duplicate SQL statements. When the system
runs in automatic baseline capture mode, only SQL statements that have appeared
at least twice are collected, and the first plan is marked as the plan baseline. A plan
baseline is the baseline plan stored for an SQL statement and is a subset of the plan
history database. Only plans marked as “accepted” can become baseline plans. The
plan history database is used to store information about historical execution plans.
After an SQL statement is sent to PolarDB, the optimizer searches for its baseline
plans, calculates the cost of each plan, and executes the optimal plan. At the same
time, the optimizer performs a regular optimization process to timely detect whether
better execution plans have been generated.
After optimization, PolarDB captures and saves the execution plan to the plan
history database for future reuse. In addition, to facilitate plan reuse, the system
may need to reproduce the plan based on the representation in the cache and esti-
mate the cost of the restored plan.
Plan Management
The blocks in green in Fig. 7.9 represent the plan management strategies, including
plan caching, adaptive plan caching, and SPM.
In the plan caching strategy, only one plan is cached for each query. If a query
misses the cache, a new plan is generated for the query and added to the cache. If a
query hits the cache, the corresponding plan is directly executed.
Adaptive plan caching includes automatic plan selection, selectivity feedback
collection, and selectivity-based plan selection. In automatic plan selection, whether
a matching cache exists is determined based on the predicate selectivity of the cur-
rent query. If a matching plan exists in the cache, the matching plan is directly and
executed. Otherwise, a new execution plan needs to be generated. A “match” occurs
when the difference between the predicate selectivity of the new query and the pred-
icate selectivity of an existing plan in the cache is less than a specified threshold
(which is usually 5%).
SPM include SQL plan baseline capture, baseline evolution, and cost-based plan
selection. SPM is similar to the previous two strategies. In SPM, a captured plan is
added to the plan history database and is marked as “new” or “unverified” while
waiting to be executed. Baseline evolution can be implemented online or offline.
Cost-based plan selection selects the optimal plan from multiple baseline plans
based on their costs or performance estimates.
In the online evolution scheme, PolarDB uses a + 2b strategy. With this strategy,
there is an a% chance of selecting a plan from the baselines and a b% chance of try-
ing a new, unverified plan for a query, and b% of queries will try baseline plans.
PolarDB compares the results of these two plans to better estimate the performance
of the plans. More details of online evolution will be discussed below. Offline evolu-
tion is relatively simple and is triggered when specific conditions are met (e.g., trig-
gered periodically or when the number of data updates reaches a specified threshold).
• If the plan caching strategy is selected, the plan cache is hit and the correspond-
ing plan is directly executed.
• If the adaptive plan caching strategy is selected, whether the query hits the plan
cache is determined. In this strategy, the predicate selectivity R of the current
query is estimated based on statistical information, and the corresponding
predicate ranges of all baselines plans are queried. The query is considered to
have hit the plan cache if a plan P is found and the difference between the
predicate selectivity R’ of the plan and R is less than 5%. If the query hits the
plan cache, a plan is selected based on the predicate selectivities R’ and
R. Then, the predicate coverage range of the baseline plan is properly adjusted
(i.e., split). If the query misses the plan cache, a plan needs to be selected
through conventional cost-based query optimization, which is then added to
the plan baselines.
• With the SPM strategy, the optimizer selects an accepted execution plan, esti-
mates its cost, and then determines whether it is necessary to generate a new
execution plan based on the evolution strategy. If a new execution plan needs to
be generated, the cost-based query optimization process starts, and the new exe-
cution plan is added to the plan history database.
In the execution phase, if the plan caching strategy is selected, PolarDB directly
executes the cached plan that is hit. If the adaptive plan caching strategy is selected,
PolarDB collects predicate selectivity data during execution and sends the data to
the feedback module. If the SPM strategy is selected, PolarDB updates the required
feedback based on the evolution strategy.
7.4 Practical Application of PolarDB Query Engine 193
7.4.2.2 Plan Evolution
Traditional SPM has two issues. First, after cost-based query optimization, a plan
selection module is required, which reevaluates the costs of all accepted plans with-
out considering the feedback during execution. Second, traditional SPM does not
consider the possibility of generating better execution plans under the current
workloads.
As shown in Fig. 7.11, the optimal execution plan for a parameterized query var-
ies based on the arguments passed in. When C1 > 5, the optimal execution plan is a
full table scan. When C1 > 50, the optimal plan is an index range scan. Therefore,
SPM needs to consider not only the optimal plan for a particular query but also the
proportion of optimal plans in the cached plans under the current workloads, to
optimize the overall execution time of the workloads.
In view of this, PolarDB proposes an online evolution algorithm that uses the
SQL plan routing module and the execution feedback mechanism to ensure that
most workloads are routed to accepted plans while a small portion of the workloads
are routed to unaccepted plans through online evolution to verify these plans. From
the perspective of reinforcement learning, an agent that interacts with the database
management system can be designed. The agent takes an action by following the
policy and observes the reward. It then improves the policy based on the reward,
thereby maximizing the cumulative reward.
In the routing design of the existing online SPM strategy, only two routing
options are available: executing an accepted plan and performing regular cost-based
query optimization. This strategy is advantageous as it avoids the costs of blindly
trying unaccepted plans. However, it cannot find the optimal solution for the current
workloads. Hence, considering only these two actions is not enough. Therefore, the
following terms are defined in the online SPM evolution system of PolarDB:
• Action: the options available for plan routing in SPM, including selecting an
accepted plan, selecting an unaccepted plan, and using a regular optimizer.
• State: the passed in parameterized query.
• Reward: the execution time.
• Goal: to find a policy that can minimize the overall execution time.
After an SQL statement enters the system, the query parser parameterizes the
statement and passes it to the SPM router. The router retrieves the Q-value for each
possible action from the current policy and selects an unaccepted plan with a prob-
ability of ε to verify the performance of the plan. Then, the router selects a baseline
plan with a probability of 1 − ε to improve stability, where 0<ε<1.
The optimizer generates a physical query plan and passes it to the execution
engine. After the plan is executed, the database management system returns the
query result to the client and triggers the evolution logic of SPM. The execution
plan and its latency (other dimensions, such as the CPU and memory overhead and
number of rows scanned, will be supported in the future) are added to the experi-
ence of the agent as the execution feedback and serve as the starting point for
Q-value iteration. During evolution in the context of SPM, the policy is improved by
using the latest experience. When sufficient statistics indicate that an unaccepted
plan is clearly better than the corresponding baseline plan, the baseline is updated.
Each query sent by the client undergoes parameterization and is routed to an
action. Then, the execution results are collected and used as experience data. The
above process is executed multiple times for a query, forming a feedback and cor-
rection loop. In the exploration phase of evolution in the context of SPM, if a plan
has a higher latency than expected, the agent learns how to assign a lower weight to
the plan to reduce its chance of being selected.
The online SPM evolution technology integrates a lightweight reinforcement
learning framework, which improves router decisions and facilitates evolution toward
the optimal plan based on the execution feedbacks obtained. In the long run, an execu-
tion plan that is expected to achieve maximum benefits alleviates the problem that an
suboptimal execution plan becomes a baseline plan. Test results for balanced, skewed,
and variable workloads show that compared with traditional plan management frame-
works, the online execution plan evolution technology of PolarDB can correctly
obtain the optimal plan and consequently adapt efficiently to various workloads.
7.4.3.1 Vectorized Execution
Traditional execution engines that use the tuple-at-a-time (TAT) approach cannot
fully utilize the features of modern processors, such as SIMD operations, data
prefetching, and branch prediction. Vectorized execution and compiled execution
7.4 Practical Application of PolarDB Query Engine 195
are two commonly used acceleration solutions for database execution engines. This
section describes vectorized execution in PolarDB.
Vectorized execution can reuse the pull-based Volcano model. However, the
Next() function of each operator in the Volcano model is replaced with the corre-
sponding NextBatch() function, which returns a batch of data (such as 1024 rows)
each time instead of one row of data. Vectorized execution has the following
advantages:
• The number of virtual function calls, especially those for expression evaluation,
is lower than that in the Volcano model.
• Data is processed by using a batch or chunk as the basic unit, and the data to be
processed is continuously stored, greatly improving the hit rate of the modern
CPU cache.
• Multiple rows of data (usually 1024 rows) are processed at the same time by
operators, fully leveraging the SIMD technology.
The PolarDB optimizer determines whether to use vectorized execution based on
the estimated cost and characteristics of the operator. However, some operators,
such as sort operators and hash operators, cannot benefit from vectorized execution.
Therefore, PolarDB supports hybrid execution plans. A hybrid execution plan
allows vectorized and nonvectorized operators to coexist.
Figure 7.12 shows the vectorized execution framework implemented in the PolarDB
architecture. The underlying layer of PolarDB is a row-oriented storage layer that
provides a batch read interface to the upper layer. The batch read interface can
return multiple rows of data at the same time. When the execution layer receives the
data, it will convert the data into a columnar layout in memory, which facilitates
access from the upper layer.
A vector represents multiple data records (which are usually contiguous) from
the same column. In the execution state, each vector is bound to a fixed position
in the columnar layout in memory based on the column information and partici-
pates in subsequent calculations without additional data materialization opera-
tions. After the data of all vectors is processed, PolarDB will call the batch read
interface again to fetch data. With vectors as the basic operation unit, the system
needs to support vectorized expressions and operators. Additionally, a vectorized
expression must support processing multiple data records at the same time. To
ensure compatibility and reduce dependency on hardware, PolarDB enhances the
existing expression framework and introduces the for loop to facilitate accelera-
tion, leaving more compilation and optimization work to the compiler. PolarDB
now supports vectorized table scans, vectorized filtering operations, and vector-
ized hash joins. The following two operations in a vectorized hash join are opti-
mized: key extraction and insertion in the build phase and key search in the
probe phase.
References
1. Garcia-Molina H, Ullman JD, Widom J. Database systems: the complete book (trans: Yang D,
Wu Y, et al.). 2nd ed. Beijing: China Machine Press; 2010.
2. Silberschatz A, Henry Korth HF, Sudarshan S. Database systems concepts. 5th ed. New York:
McGraw-Hill; 2005.
3. Graefe G. Volcano—an extensible and parallel query evaluation system. IEEE Trans Knowl
Data Eng. 1994;6(1):120–35.
4. A glimpse into vectorized execution and compiled execution. https://www.jianshu.com/p/
fe7d5e2d66e7.
5. Kemper A, Neumann T. HyPer: A hybrid OLTP&OLAP main memory database system based
on virtual memory snapshots. In: IEEE 27th International Conference on Data Engineering;
2011. p. 195–206.
6. Thomas N. Efficiently compiling efficient query plans for modern hardware. Proc VLDB
Endow. 2011;4:539–50.
7. Sompolski J, Zukowski M, Boncz P. Vectorization vs. compilation in query execution. In:
Proceedings of the Seventh International Workshop on Data Management on New Hardware
(DaMoN’11); 2011. p. 33–40.
Chapter 8
Integration of Cloud-Native
and Distributed Architectures
© The Author(s), under exclusive license to Springer Nature Singapore Pte 199
Ltd. 2025
F. Li et al., Cloud Native Database,
https://doi.org/10.1007/978-981-97-4057-4_8
200 8 Integration of Cloud-Native and Distributed Architectures
Figure 8.1 shows the two typical architectures available for distributed databases,
namely, integrated architecture and compute-storage-separated architecture.
8.1.1.1 Integrated Architecture
In the integrated architecture, each node serves as a compute node and a storage
node, and data is distributed across multiple nodes. Each node in the cluster can
provide services externally. If the data accessed by the client is not on the current
node, the current node communicates with other nodes to request the corresponding
data. Distributed databases that adopt the integrated architecture include
Postgres-XL, OceanBase, and CockroachDB.
8.1.1.2 Compute-Storage-Separated Architecture
frequently initiate system calls to read and write the disk. Therefore, storage nodes
can be developed by using a system programming language, such as C or C++.
These nodes also have high requirements for disk I/O performance during
deployment.
The integrated architecture has better performance in processing local que-
ries that do not involve remote data access than the compute-storage architec-
ture. To ensure that user queries can “just” hit local data, a lightweight
partition-awareness feature is typically introduced to the load balancer or proxy
node that is located at the forefront of the cluster to route queries to the node on
which the data is located as far as possible, thereby reducing unnecessary RPC
overheads.
8.1.2 Data Partitioning
8.1.2.1 Hash Partitioning
In hash partitioning, hash values are calculated based on the partition key, and the
partition in which a row of data is located is calculated by using the mod(hash value,
N) operation, where N represents the total number of partitions. For example,
assuming N = 4, the partitions can be defined as follows:
• HASH(partition_key) = 0 → Partition 0
• HASH(partition_key) = 1 → Partition 1
• HASH(partition_key) = 2 → Partition 2
• HASH(partition_key) = 3 → Partition 3
Consistent hashing [1] is usually performed in hash partitioning to calculate hash
values. When a data node needs to be added or removed, the change in the total
number N of data nodes will cause extensive data redistribution if a regular hash
function is used; consistent hashing can ensure minimal data movement. In consis-
tent hashing, data nodes are mapped through hashing onto a large ringed space
called a hash ring. Then, the data to be stored is hashed based on the partition key
and mapped to a position on the hash ring. The data is stored on the first data node
that comes after the mapped-to position in the clockwise direction along the hash
ring. As shown in Fig. 8.3, data mapped to a position indicated by a specific color is
stored on the node with the same color. For example, data mapped to a position in
blue is stored on the node in blue. This design ensures that when a node is removed
or added, only data mapped to a position between the changed node and its next
node in the clockwise direction along the hash ring needs to be migrated.
8.1.2.2 Range Partitioning
Range partitioning maps data to several partitions based on the range to which the
value of the partition key belongs. For example, if the partition key is of the integer
data type, the partitions can be defined as follows:
• partition_key <= 10,000 → Partition 0
• 10,000 < partition_key <= 20,000 → Partition 1
• 20,000 < partition_key <= 30,000 → Partition 2
• 30,000 < partition_key → Partition 3.
Hash partitioning and range partitioning support dynamic changes in the number
N of nodes to scale out or scale in the storage layer. Hash partitioning can leverage
the consistent hashing function to migrate some data from each partition to a new
partition or migrate data from an existing partition to other existing partitions. In
range partitioning, a large partition can be further divided or two adjacent small
partitions can be merged.
Hash partitioning and range partitioning have respective advantages and disad-
vantages, as summarized in Table 8.1.
Several distributed databases are designed to support a specific partitioning
scheme. For example, YugaByte supports hash partitioning, whereas TiDB supports
range partitioning. Other databases, such as PolarDB-X and OceanBase, support
multiple partitioning schemes and allow users to choose the most appropriate parti-
tioning scheme as needed.
8.1.3 Distributed Transactions
8.1.3.1 XA Protocol
XA is a 2PC protocol that defines two main roles: a resource manager (RM) and a
transaction manager (TM). A resource manager is usually a physical node of the
database, whereas a transaction manager is also known as a transaction coordinator.
The XA protocol also specifies the interaction interfaces between a transaction
manager and a resource manager, such as XA_START, XA_END, XA_COMMIT,
and XA_ROLLBACK.
8.1.3.2 Percolator
In 2010, Google engineers proposed the Percolator transaction model [2] to solve
the atomicity issue in incremental index construction. Percolator is built based on
BigTable, a distributed wide-column storage system based on the key-value model.
Percolator achieves cross-row transaction processing capabilities at the snapshot
isolation level through row-level transactions and a multiversioning mechanism
without changing the internal implementation of BigTable.
Percolator introduces the timestamp oracle (TSO), which is a global clock, allo-
cates start and commit timestamps that monotonically increase for global transac-
tions, and performs visibility judgment based on the timestamps. To support 2PC,
Percolator adds two columns, Lock and Write, in addition to the original data. The
Lock and Write columns are respectively used to store lock information and map-
pings between transaction commit timestamps and data timestamps. Percolator
ensures transaction consistency and isolation based on such information. Figure 8.5
shows the implementation of the TSO.
The detailed process of the above example is as follows:
• Initial state: At this point, Bod and Joe, respectively, have $10 and $2 in their
accounts. The Write column indicates that the timestamp of the latest data ver-
sion is 5.
• Prewrite and locking: A transaction that requires transferring $7 from Bob’s
account to Joe’s account is initiated. This transaction involves multiple rows of
data. Percolator randomly selects one main record row and adds a main record
lock on the main record. In this case, a main record lock is written to Bob’s
account at Timestamp 7, and the value of the Data column is 3 (10–7). Lock
information that includes a reference to the main record lock is written to Joe’s
account at Timestamp 7, and the value of the Data column is 9 (2 + 7).
• Commits of the main record: A row with a timestamp of 8 is written to the Write
column, indicating that the data at Timestamp 7 is the latest data. Then, the lock
record is deleted from the Lock column to release the lock.
• Commit of other records: The operation logic is the same as that of committing
the main record.
A transaction is considered successful after Percolator commits the main record.
Remedies are still available even if other records fail to be committed. However,
such exceptions are handled only for read operations because Percolator imple-
ments a decentralized two-phase commit and does not have transaction managers
like the XA protocol. The method for handling such exceptions is to search for the
main record based on the lock in the abnormal record. If the main record lock exists,
the transaction is not completed. If the main record lock has been cleared, the record
can be committed and becomes visible.
The Percolator model is conducive to writes but not to reads. In a write transac-
tion, the decision is first persisted in the main record and then asynchronously per-
sisted in other participants to prevent abnormal waits by multiple participants. Yet
for reads, writing first to the main record and then asynchronously writing to other
records can result in longer locking times for participants or even failure to commit
the main record in the commit phase, rendering other participants unusable.
8.1.3.3 Omid
Omid [3] is a transaction processing system developed by Yahoo! for Apache HBase
based on the key-value model. Compared with Percolator’s locking method, Omid uses
an optimistic approach. The latter’s architecture is relatively simple and elegant. In
recent years, several papers on Omid have been published at ICDE, FAST, and PVLDB.
Omid believes that although Percolator’s lock-based approach simplifies transac-
tion conflict checking, handing over the transaction processing logic to the client
will result in lingering, uncleared locks that can block other transactions in the case
of client failure. In addition, maintaining the additional Lock and Write columns
incurs significant overheads. In the Omid solution, the central node is solely respon-
sible for determining whether to commit a transaction, greatly strengthening the
capabilities of the central node. Validation is performed based on the write set of the
transaction during the transaction commit to check whether the transaction-related
rows have been modified during the transaction execution period and determine
whether a transaction conflict exists.
8.1.3.4 Calvin
a globally ordered transaction log exists and multiple partitions in a distributed sys-
tem process data in local shards in strict accordance with the global transaction log
to ensure the consistency of the processing results of all shards.
The Calvin model requires all transactions to be “one-shot” transactions, in
which the entire transaction logic of a one-shot transaction is executed at a time by
calling a stored procedure. However, common transactions are interactive transac-
tions in which the client executes several statements in succession before finally
executing the COMMIT instruction to commit the transaction. Therefore, the Calvin
model is applicable only to specific fields. The commercial database VoltDB bor-
rows Calvin’s deterministic database concept. VoltDB is an inmemory database that
is designed for high-throughput and low-latency scenarios and widely used in the
IoT and financial fields.
The foregoing typical transaction models have respective advantages and disad-
vantages, as presented in Table 8.2.
8.1.4 MPP
In the early days, relational databases were limited by the I/O capabilities of com-
puters. Computation takes up only a small portion of the total processing time of a
query, and the optimization of executors has little impact on the overall perfor-
mance. With the rapid development of hardware and the increasing maturity of dis-
tributed technologies, the acceleration and optimization of executors involving large
amounts of data have become increasingly important.
With the emergence of multiprocessor hardware, executors gradually evolved
toward the symmetric multiprocessing (SMP) architecture for standalone parallel
computing to fully utilize the multicore capability to accelerate computation.
However, an executor of the SMP architecture has poor scalability and can utilize
the resources of only one SMP server during computation. As the amount of data to
be processed increases, the disadvantage of poor scalability becomes more apparent.
In MPP, multiple nodes in a distributed database cluster are interconnected to each
other over a network and collaboratively compute the query results. Figure 8.6 shows
208 8 Integration of Cloud-Native and Distributed Architectures
the principle of MPP. Compared to SMP, MPP can utilize the computing power of
multiple nodes to accelerate complex analytical queries and overcome the limitations
of hardware resources (such as CPU and memory) of a single physical node.
When a query is executed in MPP mode, the SQL execution plan is distributed to
multiple nodes. Multiple instances are allocated to each operator to handle a portion
of the data. For example, a join operation needs to partition data by using the join
key as the partition key. Before the join operator is executed, the exchange operator
needs to shuffle data on two sides of the join. After that, the data in each partition
can be joined separately. The implementation of MPP is shown in Fig. 8.7.
Representative databases that use the shared storage architecture include Amazon
Aurora and Alibaba Cloud PolarDB. Aurora for MySQL transforms the write path
of MySQL by replacing the original stand-alone storage based on local disks with a
multi-replica and scalable distributed storage, thereby improving system availabil-
ity and scalability while enhancing performance. This achieves complete compati-
bility with open-source databases by simply transforming the storage module in an
existing database.
For the storage layer, the shared storage architecture usually adopts a multirep-
lica mechanism to enhance high availability. Replica consistency protocols, such as
Quorum and Paxos, have been applied in various cloud-native database systems.
The shared storage architecture provides a centralized data access interface for
upper-layer compute nodes. This way, the compute nodes do not need to be con-
cerned with the actual distribution of data in storage or with the load balancing of
data distribution.
By using the shared storage architecture, cloud service providers can pool disk
resources and allow multiple users to share a distributed storage cluster and use
resources in a pay-as-you-go fashion. Taking Aurora for MySQL as an example, the
storage cost is $0.1 per GB per month, and users do not need to preplan capacity
when creating instances and only needs to pay for the actual capacity used.
As shown in Fig. 8.8a, compute nodes are categorized into the RW (primary)
node and RO nodes. This categorization is necessary because although the storage
210 8 Integration of Cloud-Native and Distributed Architectures
layer has been transformed into a distributed architecture, the computing layer
(including the transaction management and query processing modules) retains the
standalone structure, and the concurrent transaction processing capability (write
throughput) is limited by the performance of a single node. The shared storage
architecture enables elastic scalability for the computing and storage layers.
However, in this architecture, only read-only nodes can be added to share the read
workloads. The write performance is still bottlenecked by the processing capacity
of a single node. As a result, the shared storage architecture experiences a serious
performance bottleneck. Although the vertical scalability of the shared storage
architecture is favored by the industry, this condition is not suitable for practical
engineering implementations because the storage capacity of the entire cluster is
usually limited to dozens or hundreds of terabytes.
8.2.2 Shared-Nothing Architecture
With the rise of NewSQL [5] databases in recent years, the shared-nothing architec-
ture has attracted increasing attention. In the shared-nothing architecture, each node
is an independent process that does not share resources with other nodes and com-
municates and exchanges data with other nodes through network RPCs. This sec-
tion describes distributed databases that use the shared-nothing architecture.
Cloud Spanner launched by Google is a typical representative of distributed
cloud databases that feature impeccable scale-out and high-availability capabilities.
Compared with the shared storage architecture, the shared-nothing architecture is
more advantageous in terms of the scalability of the computing layer. In the shared-
nothing storage architecture, each node is an independent process that does not
share resources. In addition, the computing layer and the storage layer can be hori-
zontally scaled by simply adding more nodes. For a stateless computing layer, new
nodes can be started in seconds by using the container technology.
The shared-nothing storage architecture divides data into shards, enabling hori-
zontal scaling of compute nodes. However, the storage layer of the shared-nothing
architecture is disadvantageous compared with that of the shared storage architec-
ture in two aspects.
The first is high costs. In addition to the migration costs of replicating data, users
need to consider the storage costs. By default, the high-efficiency cloud disks
mounted to cloud hosts in the shared-nothing architecture implement three-replica
high availability. Combining the three-replica high-availability implementation and
the traditional database three-replica and virtualization technologies yields nine
(3 × 3) replicas in the system, which results in a waste of storage space. The design
philosophy of the shared storage architecture is to push down the implementation of
three replicas to the storage layer, which is more economically feasible than that of
the shared-nothing storage architecture.
Second, the storage layer has poor elasticity. The horizontal scaling of the stor-
age layer is more complicated. New nodes need to copy data from the original nodes
8.3 Cloud-Native Distributed Database: PolarDB-X 211
and can provide services externally only after the data is synchronized between the
new nodes and original nodes. This process is not only time-consuming but also
occupies the I/O bandwidth of existing nodes. Moreover, the capacity needs to be
planned in advance, resulting in poorer scalability than the shared-nothing storage
architecture. In addition, scaling can only be implemented by nodes, rendering the
pay-as-you-go payment model infeasible.
8.3.1 Architecture Design
GMS provides distributed metadata, such as metadata of tables, and a TSO. GMS
can adjust data distribution based on the workload to achieve load balancing between
nodes. GMS can also manage compute nodes and data nodes, for example, putting
a node online or pulling a node offline.
8.3.2 Partitioning Schemes
PolarDB-X supports hash partitioning and range partitioning and allows users to
define table groups. Tables in the same table group have the same partition key and
partitioning scheme. This way, joins of tables in the table group can be directly pushed
down to storage nodes, as shown in Fig. 8.10. Taking an online shopping business as
an example, the user and orders table can be added to the same table group and parti-
tioned by using user IDs as the hash partition key. When a transaction queries all
orders of a user, all data that needs to be joined in the distributed transaction is located
8.3 Cloud-Native Distributed Database: PolarDB-X 213
on the same physical node. Therefore, the query can be pushed down to a storage node
and considered a standalone transaction, thereby achieving higher performance.
8.3.3 GSIs
8.3.4 Distributed Transactions
For a transaction that involves multiple partitions, different partitioned tables may
be located on different RW nodes. In this case, the transaction needs to be imple-
mented as a distributed transaction to ensure the ACID properties. PolarDB-X sup-
ports TSO-based global MVCC transactions.
8.3.5 HTAP
Table 8.3 Timestamp format Physical clock Logical clock Reserved bits
42 bits 16 bits 6 bits
database scenarios. HTAP not only avoids complicated extract, transform, and load
(ETL) operations but also enables faster analysis of the latest data.
PolarDB-X provides the intelligent routing feature for HTAP. In addition,
PolarDB-X supports the processing of HTAP loads, with guaranteed low latency in
transactional processing and full utilization of computing resources in analytical
processing, and ensures strong data consistency. The optimizer of PolarDB-X ana-
lyzes the consumption of core resources, such as CPU, memory, I/O, and network
resources, for each query based on the costs and categorizes requests into OLTP
requests and OLAP requests.
PolarDB-X routes OLTP requests to the primary replica for execution, achieving
lower latency than the traditional read-write separated solution.
The compute nodes of PolarDB-X support MPP. The query optimizer automati-
cally identifies a complex analytical SQL query as an OLAP request and executes
the request in MPP mode. In other words, the optimizer generates a distributed plan
that is to be executed across multiple nodes.
To better isolate resources and prevent analytical queries from affecting OLTP
traffic, PolarDB-X allows users to create independent read-only clusters. The com-
pute nodes and storage nodes in a read-only cluster are deployed on physical hosts
that are different from those of the primary cluster. Through intelligent routing,
users can transparently use PolarDB-X to handle OLTP and OLAP loads.
References
1. Karger D, Lehman E, Leighton T, et al. Consistent hashing and random trees: distributed cach-
ing protocols for relieving hot spots on the world wide web. In: Proceedings of the Twenty-
Ninth Annual ACM Symposium on Theory of Computing; 1997. p. 654–63.
2. Peng D, Dabek F. Large-scale incremental processing using distributed transactions and notifi-
cations. 2010. OSDI 2010: Proceedings of the 9th USENIX conference on Operating systems
design and implementation, USENIX Association.
3. Bortnikov E, Hillel E, Keidar I, et al. Omid, reloaded: Scalable and highly-available transac-
tion processing. In: 15th USENIX Conference on File and Storage Technologies (FAST 17);
2017. p. 167–80.
4. Thomson A, Diamond T, Weng SC, et al. Calvin: fast distributed transactions for partitioned
database systems. In: Proceedings of the 2012 ACM SIGMOD International Conference on
Management of Data; 2012. p. 1–12.
5. Pavlo A, Aslett M. What's really new with NewSQL? ACM SIGMOD Rec. 2016;45(2):45–55.
Chapter 9
Practical Application of PolarDB
9.1.1 Related Concepts
• Instance: An instance is a virtualized database server. Users can create and man-
age multiple databases within an instance.
• Series: When you create a PolarDB instance, you can select a series (e.g., the
Cluster Edition or Single Node Edition) that suits your business needs.
• Cluster: PolarDB mainly adopts a cluster architecture, which consists of a pri-
mary node and multiple read-only nodes.
© The Author(s), under exclusive license to Springer Nature Singapore Pte 217
Ltd. 2025
F. Li et al., Cloud Native Database,
https://doi.org/10.1007/978-981-97-4057-4_9
218 9 Practical Application of PolarDB
• Specification: The resource configuration of each node, such as 2 CPU cores and
8 GB of memory.
• Region: A region is a physical data center. In general, PolarDB instances are
located in the same region as Elastic Compute Service (ECS)1 instances to
achieve optimal access performance.
• Availability zone: An availability zone (or “zone” for short) is a physical area in
a region with independent power and network. There is no substantial difference
between different zones in the same region.
• Database engine: PolarDB has three independent engines that are, respectively,
100% compatible with MySQL, 100% compatible with PostgreSQL, and highly
compatible with the Oracle syntax.
9.1.2 Prerequisites
Register for an Alibaba Cloud account or obtain a RAM account assigned by the
administrator of an Alibaba Cloud account. Then, log on to the PolarDB console
and navigate to the PolarDB purchase page.
9.1.3 Billing Method
• Subscription: In this billing method, you must pay for the compute nodes when
you create a cluster. The storage space is charged by hour based on the actual
data amount, and fees are deducted from the account on an hourly basis.
• Pay-as-you-go: In this billing method, you do not need to make advance pay-
ments. Compute nodes and storage space (based on the actual data amount) are
charged by hour, and fees are deducted from the account on an hourly basis.
The region and availability zone specify the geographic location where the cluster
is located and cannot be changed after the purchase.
Note: Make sure that the PolarDB instance and the ECS instance to which the
PolarDB instance is to be connected are located in the same region. Otherwise, they
can communicate only via the Internet, which may compromise the performance.
1
ECS is a cloud server service provided by the Alibaba Cloud that is usually deployed in coordina-
tion with cloud databases to form a typical business access architecture.
9.1 Creating Instances on the Cloud 219
9.1.5 Creation Method
Choose one of the following creation methods: (1) create a new PolarDB instance;
(2) if an RDS for MySQL instance exists, upgrade the instance to a PolarDB for
MySQL instance; or (3) create a new cluster by restoring a backup of a deleted
cluster from the recycle bin.
9.1.6 Network Type
9.1.7 Series
• Cluster Edition: The Cluster Edition is the recommended mainstream series that
offers rapid data backup and recovery and global database deployment free of charge.
This edition also provides enterprise-level features, such as quick elastic scaling and
parallel query acceleration, and thus is recommended for production environments.
• Single Node Edition: This edition is the best choice for individual users who
want to test and learn more about PolarDB. It can also be used as an entry-level
product for startup businesses.
• History Database Edition: This edition is considered an archive database and features
a high data compression ratio. Therefore, this edition is suitable for businesses that do
not have high requirements on computing but need to store archive data.
Select the compute node specification based on your business requirements. Each
node uses exclusive resources to achieve stable and reliable performance. Each speci-
fication has corresponding CPU and memory capacities, maximum storage capacity,
maximum number of connections, intranet bandwidth, and maximum IOPS.
9.1.9 Storage Space
Notice: The storage fees are charged by hour based on the actual data amount.
The maximum storage capacity varies based on the selected compute node
specification.
9.1.10 Creation
After you complete the payment, the cluster is created in 10–15 min. The created
cluster is displayed on the Clusters page.
Notice: Make sure that you selected the region where the cluster is deployed.
Otherwise, you cannot view the cluster.
9.2 Database Access
9.2.1 Account Creation
Notice: If you already created a privileged account, you cannot create another
privileged account because each cluster can have only one privileged account. You
do not need to grant permissions on databases to the privileged account because the
privileged account has all permissions on all databases in the cluster. For a standard
account, you must grant permissions on specific databases.
9.2.2 GUI-Based Access
9.2.3 CLI-Based Access
9.2.3.1 Configuring an Allowlist
After you create a PolarDB cluster, you must configure an allowlist and create an
initial account for the cluster before you connect to and use the cluster.
1. IP allowlist contains IP addresses that are allowed to access the cluster. The
default IP allowlist contains only the default IP address 127.0.0.1, indicating that
no device can access the cluster. Only IP addresses that have been added to the
IP allowlist can access the cluster.
2. ECS security group contains ECS instances that can access the cluster. An ECS
security group is a virtual firewall used to control inbound and outbound traffic
of ECS instances in the security group.
222 9 Practical Application of PolarDB
Note: You can configure an IP allowlist and an ECS security group for the same
cluster. The IP addresses in the IP allowlist and ECS instances in the security group
can access the PolarDB cluster.
Configuring an IP Allowlist
Log on to the PolarDB console. In the upper-left corner of the console, select the
region where the cluster that you want to manage is deployed. Find the cluster that
you want to manage and click the cluster ID. In the left-side navigation pane, choose
Settings and Management > Whitelists. On the Whitelists page, you can add an
IP allowlist or modify an existing IP allowlist.
Notice: The ali_dms_group (for DMS), hdm_security_ips (for Database
Autonomy Service [DAS]), and dtspolardb (for Data Transmission Service [DTS])
IP allowlists are automatically generated when you use the relevant services. To
ensure normal use of the services, do not modify or delete these IP allowlists.
Add the IP address of the devices that need to access the PolarDB cluster to the
allowlist. If an ECS instance needs to access the PolarDB cluster, you can view the
IP address of the ECS instance in the configuration information section on the
details page of the ECS instance and add the IP address to the allowlist.
Note: If the ECS instance and the PolarDB cluster are deployed in the same
region, such as the China (Hangzhou) region, add the private IP address of the ECS
instance to the IP allowlist. If the ECS instance and the PolarDB cluster are deployed
in different regions, add the public IP address of the ECS instance to the IP allowlist.
Alternatively, you can migrate the ECS instance to the region where the PolarDB
cluster is deployed and then add the private IP address of the ECS instance to the IP
allowlist.
If you want to connect on-premises servers, computers, or other cloud instances
to the PolarDB cluster, add their IP addresses to the IP allowlist of the cluster.
Log on to the PolarDB console. In the upper-left corner of the console, select the
region where the cluster that you want to manage is deployed. Find the cluster that
you want to manage and click the cluster ID. In the left-side navigation pane, choose
Settings and Management > Whitelists. Navigate to the Select Security Groups
panel, select one or more security groups, and click OK.
9.2 Database Access 223
9.2.3.2 Obtaining Endpoints
The endpoints of a PolarDB cluster are classified into two types: cluster endpoint
and primary endpoint.
Figure 9.1 compares cluster endpoints and primary endpoints, and Table 9.2 sum-
marizes the details of both endpoint types.
Cluster endpoints and primary endpoints have public endpoints for the Internet and
private endpoints for internal networks:
1. Use a private endpoint in the following scenario: If your application or client is
deployed on an ECS instance that is deployed in the same region as the PolarDB
cluster and supports the same network type as the cluster, the ECS instance can
connect to the PolarDB cluster by using a private endpoint. You do not need to
apply for a public endpoint. A PolarDB cluster achieves optimal performance
when it is connected by using a private endpoint.
2. Use a public endpoint in the following scenario: If you cannot connect to the
PolarDB cluster over the internal network due to specific reasons (e.g., the ECS
instance and the PolarDB cluster are located in different regions or support dif-
ferent network types or you access the PolarDB cluster from a device that is not
deployed on the Alibaba Cloud), you must apply for a public endpoint. Using a
public endpoint compromises the security of the cluster. Exercise caution when
you use a public endpoint.
9.3 Basic Operations
9.3.1.1 Create a Database
9.3.1.2 Creating a Table
This section describes how to create a table on the SQLConsole tab of DMS.
Log on to the PolarDB console. In the upper-left corner of the console, select the
region where the cluster that you want to manage is deployed. Find the cluster that
Table 9.5 Parameters of PolarDB for PostgreSQL or the PolarDB edition that is compatible
with Oracle
Parameter Description
Database name • The name must start with a letter and end with a letter or a digit
• The name can contain lowercase letters, digits, underscores (_), and
hyphens (−)
• The name can be up to 2–64 characters in length
• The name must be unique in your PolarDB cluster
Database owner The owner of the database. The owner has all permissions on the database
Supported The character set supported by the database. Default value: UTF8. You can
character set select another character set from the drop-down list
Collate The rule based on which character strings are sorted
Ctype The type of characters supported by the database
Description Enter a description for the database to facilitate database management. The
description must meet the following requirements:
• It cannot start with http:// or https://.
It must start with a letter.
• It can contain uppercase letters, lowercase letters, digits, underscores
(_), and hyphens (−)
• It can be 2–256 characters in length
you want to manage and click the cluster ID. In the left-side navigation pane, choose
Settings and Management > Databases. Find the target database and click SQL
Queries.
Notice: If the Log On to Database dialog box appears, enter the account and
password of the database. For PolarDB for PostgreSQL and the PolarDB edition
that is compatible with Oracle, you must also specify the database to which you
want to log on. You can create a database on the database management page of the
PolarDB cluster.
On the SQLConsole tab, enter the command to create a table and click Execute.
For example, execute the following command to create a table named big_table:
In this example, one million rows of test data are generated in batches for the big_
table table by using the test data building feature.
Log on to the DMS console. In the instance list, click the target PolarDB instance
and double-click the target database to go to the SQLConsole tab. On the
SQLConsole tab, right-click the big_table table and choose Data Plans > Test
Data Generation. In the Test data build dialog box, configure the parameters and
then click Submit, as shown in Fig. 9.2. Then, wait for the approval result.
After the ticket is approved, DMS automatically generates and executes SQL
statements. You can view the execution progress on the Ticket Details page. After
SQL statements are executed, go to the SQLConsole tab and execute the following
command in the database to query the test data generation status:
Account panel, choose the type of account you want to create and specify the pass-
word for the account.
Log on to the PolarDB console. In the upper-left corner of the console, select
the region where the cluster that you want to manage is deployed. Find the clus-
ter that you want to manage and click the cluster ID. In the left-side navigation
pane, choose Settings and Management > Accounts. Find the target account
and click Modify Permissions in the Actions column. In the dialog box that
appears, modify the permissions of authorized and unauthorized databases and
then click OK.
You can log on to the cluster to which the privileged account belongs and run the
following command to change the permissions of an account. Table 9.7 describes
the parameters in the command.
9.3.4 Data Querying
A PolarDB cluster consists of one primary node and at least one read-only node.
Users can connect to the PolarDB cluster by using a primary endpoint or a cluster
endpoint to perform CRUD operations. The primary endpoint is always connected
to the primary node, and a cluster endpoint is connected to all its associated nodes.
The following section describes how to configure a cluster endpoint:
Log on to the PolarDB console, go to the basic information page of the target
cluster, find a cluster endpoint, and open the Edit dialog box for the cluster endpoint.
Open the Edit dialog box for the cluster endpoint. Set the read/write mode of the
endpoint to Read and Write (Automatic Read-write Splitting). Then, select the
nodes that you want to add to the endpoint for handling read requests.
Notice: When the read/write mode is set to read and write, write requests are sent
to the primary node, regardless of whether the primary node is selected.
If necessary, you can disable the primary node from receiving read requests, so
that read requests are sent only to read-only nodes. This reduces the load on the
primary node and ensures the stability of the node.
In traditional databases, users need to configure the connection endpoints of the
primary node and each read-only node in the application and then split the business
logic to achieve read/write splitting (i.e., write requests are sent to the primary node
and read requests are sent to any suitable node). For PolarDB, you only need to con-
nect to a cluster endpoint for write requests to be automatically sent to the primary
230 9 Practical Application of PolarDB
node and read requests to be automatically sent to the primary node or read-only
nodes based on the node load (i.e., currently unhandled requests).
Consistency Level
Open the Edit dialog box for the cluster endpoint and configure the consistency level.
PolarDB uses an asynchronous physical replication mechanism to achieve data
synchronization between the primary nodes and the read-only nodes. After the data
of the primary node is updated, the updates will be applied to the read-only nodes;
the specific latency (usually at the millisecond level) is related to the write pressure.
The data of read-only nodes is delayed. Consequently, the queried data may not be
the most recent data.
To meet the requirements for consistency levels in different scenarios, PolarDB
provides three consistency levels: eventual consistency, session consistency, and
global consistency. Leader-follower replication latency may lead to inconsistent
data queried from different nodes. To reduce the pressure on the primary node, you
can route as many read requests as possible to read-only nodes and choose the even-
tual consistency level.
Session consistency ensures that data updated before the execution of the read
request can be queried in the same session. When a connection pool is used, requests
from the same thread may be sent through different connections. For the database,
these requests belong to different sessions, but they successively depend on each
other in terms of business logic. In this case, session consistency cannot guarantee
the consistency of query results, and global consistency is needed.
A high consistency level causes higher pressure on the primary node and lower
cluster performance.
Note: Session consistency is recommended because this level has minimal
impact on performance and can meet the needs of most application scenarios. If you
have high requirements for consistency between different sessions, you can choose
global consistency or use hints (e.g., /*FORCE_MASTER*/ select * from user) to
forcibly send specific queries to the primary node.
Transaction Splitting
Open the Edit dialog box for the cluster endpoint and enable transaction splitting.
When read/write splitting is enabled for the cluster endpoint, all requests in the
transactions will be sent to the primary node to ensure the read and write consis-
tency of transactions in a session. This may cause high pressure on the primary
node; the pressure on the read-only nodes remains low. After transaction splitting is
enabled, some read requests in the transactions can be sent to read-only nodes on the
premise that read/write consistency is not compromised, to reduce the pressure on
the primary node. Transaction splitting is supported only for transactions of the read
committed isolation level.
9.4 Cloud Data Migration 231
9.3.4.2 Using Hints
9.3.4.3 Other Features
PolarDB launched a parallel query framework. When the amount of query data
reaches the specified threshold, the parallel query framework is automatically
enabled to exponentially reduce the query time. For more information about the
parallel query framework, see related sections in this book.
This section describes how to migrate data from a self-managed MySQL database
to PolarDB for MySQL by using DTS. DTS is a real-time data streaming service
that supports RDBMS, NoSQL, and OLAP data sources. DTS seamlessly integrates
data migration, subscription, and synchronization to ensure a stable and secure
transmission infrastructure.
232 9 Practical Application of PolarDB
9.4.1.1 Prerequisites
Create a self-managed MySQL database of version 5.1, 5.5, 5.6, 5.7, or 8.0 and a
destination PolarDB for MySQL cluster. If the source MySQL database is an on-
premises database, add the CIDR block of DTS server to the IP allowlist of the
database to ensure that DTS server can access the source MySQL database. Lastly,
create an account and configure binary logging for the self-managed MySQL
database.
Table 9.8 describes the required permissions for database accounts.
DTS uses the read and write resources of the source and destination data-
bases during full data migration. This may increase the loads of the database
servers. In some cases, the database service may become unavailable due to
poor database performance, low specifications, or large data amounts (e.g., a
large number of slow SQL statements exist in the source database, tables with-
out primary keys exist, or deadlocks exist in the destination database). Therefore,
you must evaluate the impact of data migration on the performance of the source
and destination databases before you migrate data. We recommend that you
migrate data during off-peak hours. For example, you can migrate data when the
CPU utilization of the source and destination databases is less than 30%.
The source database must have PRIMARY KEY or UNIQUE constraints, and all
fields must be unique. Otherwise, the destination database may contain duplicate
data records.
DTS uses the ROUND(COLUMN, PRECISION) function to retrieve values
from columns of the FLOAT or DOUBLE data type. If you do not specify the preci-
sion level, DTS sets the precision levels for the FLOAT and DOUBLE data types to
38 and 308 digits, respectively. You must check whether the precision settings meet
your business requirements.
If a data migration task fails, DTS initiates automatic recovery. Therefore, before
you switch your workloads to the destination cluster, you must stop or release the
data migration task. Otherwise, the data in the source database overwrites the data
in the destination cluster after automatic recovery.
PolarDB supports schema migration, full data migration, and incremental data
migration. You can employ these migration types to smoothly complete database
migration without interrupting services.
9.4.1.2 Billing
9.4.1.3 Procedure
Log on to the DTS console. Select the region where the destination cluster resides
and go to the Create Migration Task page. Configure the connection information
of the source and destination databases for the migration task, as described in
Table 9.10.
In the lower-right corner of the page, click Set Whitelist and Next.
Note: This step will automatically add the IP address of the DTS server to the
allowlist of the destination PolarDB for MySQL cluster to ensure that the DTS
server can connect to the destination cluster
Select the required migration types and the objects that you want to migrate.
Table 9.11 describes the parameters that must be configured.
In the lower-right corner of the page, click Precheck. You must perform a pre-
check before you start the data migration task. You can start the data migration task
only after the task passes the precheck. If the task fails the precheck, you can click
the icon to the right of each failed check item to view details. You can trouble-
shoot the issues based on the details and then run a precheck again. After the task
passes the precheck, confirm the purchase and start the migration task.
Schema migration and full data migration: We recommend that you do not
manually stop the task. Otherwise, the data migrated to the destination database
may be incomplete. You can wait until the data migration task automati-
cally stops.
Schema migration, full data migration, and incremental data migration: A
migration task that implements these migration types does not automatically stop.
You must manually stop the task at an appropriate time (e.g., during off-peak hours
or before you switch your workloads to the destination cluster).
Wait until the migration task proceeds to the Incremental Data Migration step
and enters the nondelayed state. Then, stop writing data to the source database for a
few minutes. At this time, the status of incremental data migration may be displayed
as the delay time.
Table 9.10 Details of the parameters of the source and destination databases
Section Parameter Description
N/A Task name The name of the task. DTS automatically generates a task name.
We recommend that you specify a descriptive task name to make
identifying the task easy. Duplicate task names are allowed
Source Instance type The instance type of the source database. In this example,
database User-Created Database with Public IP Address is selected for
this parameter
Note: If you select other instance types, you must deploy the
network environment for the self-managed database
Instance The region where the source database resides. If you selected
region User-Created Database with Public IP Address for Instance
Type, you do not need to configure this parameter
Note: If an allowlist is configured for the self-managed MySQL
database, you must add the CIDR block of DTS server to the
allowlist. You can click Get IP Address Segment of DTS to the
right of Instance Region to obtain the CIDR block of DTS
server
Database Select MySQL
type
Hostname or The endpoint that is used to connect to the self-managed MySQL
IP address database. In this example, the public IP address is used
Port The service port number of the self-managed MySQL database.
Default value: 3306
Database The account of the self-managed MySQL database. For more
account information about the permissions that are required for the
account, see Table 9.8
Database The password of the database account
password Note: After you configure the parameters of the source database,
click Test Connectivity to the right of the Database Password
parameter to verify that the parameters are valid. If the
parameters are valid, the Passed message is displayed. If the
Failed message is displayed, click Check to the right of Failed
and modify the parameters based on the check results
Destination Instance type Select PolarDB
database Instance The region where the destination PolarDB cluster resides
region
PolarDB Select the ID of the destination PolarDB cluster
instance ID
Database The database account of the destination PolarDB cluster. For
account information about the permissions that are required for the
account, see Table 9.8
Database The password of the database account
password Note: After you configure the parameters of the destination
database, click Test Connectivity to the right of the Database
Password parameter to verify that the parameters are valid. If the
parameters are valid, the Passed message is displayed. If the
Failed message is displayed, click Check to the right of Failed
and modify the parameters based on the check results
9.4 Cloud Data Migration 235
Table 9.12 SQL operations that can be synchronized during incremental data migration
Operation
type SQL statements
DML INSERT, UPDATE, DELETE, and REPLACE
DDL • ALTER TABLE and ALTER VIEW
• CREATE FUNCTION, CREATE INDEX, CREATE PROCEDURE,
CREATE TABLE, and CREATE VIEW
• DROP INDEX and DROP TABLE
• RENAME TABLE
• TRUNCATE TABLE
In this case, wait until incremental data migration reenters the nondelayed state.
Then, manually stop the migration task.
Switch your workloads to the destination PolarDB cluster. Table 9.12 lists the
SQL operations that can be synchronized during incremental data migration.
Log on to the PolarDB instance in the DMS console. In the left-side instance list of
the DMS console, expand the destination PolarDB instance, and double-click a
database on the instance. You can export tables or query results. For example, in the
SQL console, right-click the target table and select Export to export the schema or
data of the table. You can export multiple tables in the database. To export query
results, execute the query statement in the SQL console and then export the query
result displayed in the execution result section.
236 9 Practical Application of PolarDB
For PolarDB for MySQL, you must enable binary logging by enabling the loose_
polar_log_bin parameter on the parameter settings page in the PolarDB console.
Log on to the DTS console. Create a migration task and configure the connection
information of the source and destination databases. For example, you can migrate
data from PolarDB to a self-built database on premises or in ECS. Proceed to the
next step to select the migration types and migration objects. To ensure service con-
tinuity, you must select incremental migration. After the task passes the precheck,
confirm the creation of the migration task.
Chapter 10
PolarDB O&M
The lifecycle of a database can be roughly divided into four stages: planning, devel-
opment, deployment, and O&M. After a database is deployed, it enters the O&M
stage, which includes three tasks: resource scaling, backup and recovery, and moni-
toring and diagnostics. This chapter provides an overview of PolarDB O&M man-
agement and describes the procedures for the resource scaling, backup and recovery,
and monitoring and diagnostics of PolarDB.
10.1 Overview
The lifecycle of a database [1] can be roughly divided into four stages: planning,
development, deployment, and O&M. The database enters the O&M stage after it is
deployed. Database O&M is a popular research field [2–5] that typically covers the
following aspects:
• Environment deployment, including database installation, parameter configura-
tion, and permission assignment.
• Backup and recovery: It is of crucial importance for a database to have a
backup available to prevent data loss caused by data corruption or user
misoperations.
• Monitoring and diagnostics: O&M personnel need to ensure normal operation of
the database and then ensure the performance of the system during operation.
Monitoring includes database running status monitoring and database perfor-
mance monitoring.
© The Author(s), under exclusive license to Springer Nature Singapore Pte 237
Ltd. 2025
F. Li et al., Cloud Native Database,
https://doi.org/10.1007/978-981-97-4057-4_10
238 10 PolarDB O&M
10.2 Resource Scaling
10.2.1 System Scaling
PolarDB supports online scaling. Locks do not need to be added to the database
during configuration changes. PolarDB supports scaling in three dimensions: verti-
cal scaling of computing capabilities (i.e., upgrading or downgrading of node speci-
fication), horizontal scaling of computing capabilities (i.e., addition or deletion of
read-only nodes), and horizontal scaling of the storage space. In addition, PolarDB
adopts a serverless architecture. Therefore, you do not need to manually set, expand,
or reduce the capacity of the storage space; the capacity is automatically adjusted
online as the amount of data changes. When the amount of data is large, you can use
the PolarDB storage package to reduce storage costs.
10.2.2 Manual Scaling
Upgrading or downgrading cluster specifications does not have any impact on data
already present in the cluster. During a cluster specification change, PolarDB may be
interrupted for a few seconds, and some operations cannot be performed. It is recom-
mended that you change cluster specifications during off-peak hours. After an interrup-
tion occurs, the application needs to reestablish the connection to the database. When
the specification of a PolarDB cluster is changed, the delay of read-only requests com-
pared with read-write requests may be longer than that during normal cluster operation.
Perform the following steps to manually upgrade or downgrade cluster
specifications:
1. Log on to the PolarDB console.
2. In the upper-left corner of the console, select the region where the cluster that
you want to manage is located.
3. Open the Change Configurations dialog box on the cluster list page or basic
information page.
4. Select Upgrade or Downgrade.
5. Select the required node specification and complete the purchase. The specifica-
tions of all nodes in the same cluster must be consistent. The new specification
takes effect after approximately 10 min.
10.2.3.1 Billing
If the billing method of the cluster is subscription (also known as prepayment), the
billing method of an added node is also subscription. If the billing method of the
cluster is pay-as-you-go (also known as post-payment or pay-by-the-hour), the bill-
ing method of an added node is also pay-as-you-go. You are charged with node
specification fees for newly added node. The storage fee varies based on the actual
usage regardless of the number of nodes.
Read-only nodes that use the subscription and pay-as-you-go billing methods
can be released at any time. After the release, the balance will be refunded or billing
will stop. Before addition of a read-only node is completed, read/write splitting con-
nections do not forward requests to the read-only node. If you want the connection
to forward requests to the read-only node, you must disconnect and then reestablish
the connection (e.g., restart the application). After a read-only node is added, the
newly created read/write splitting connection will forward requests to the read-only
node. Read-only nodes can be added or removed only when the cluster has no ongo-
ing configuration changes.
10.2.3.2 Procedure
3. On the cluster list page, find the cluster that you want to manage and click the
cluster ID.
4. In the left-side navigation pane, choose Diagnostics and Optimiza-
tion > Diagnosis.
5. On the page that appears, click the Autonomy Center tab. In the lower-right
corner, click Autonomy Service Settings. On the Autonomous Function
Management page, click the Autonomous Function Settings tab.
6. Enable auto scaling as needed and specify the corresponding trigger conditions,
maximum specifications, and maximum number of read-only nodes.
10.3.1 Backup
A reliable backup feature can effectively prevent data loss. PolarDB supports peri-
odic automatic backup and instant manual backup. When you delete a PolarDB
cluster, you can choose to retain the backup data to avoid data loss caused by
misoperations.
PolarDB allows you to use the backup and recovery features free of charge.
However, backup files occupy storage space. PolarDB charges a fee based on the
storage capacity used and the retention period of backup files, including data files
and log files.
10.3.1.1 Backup Methods
10.3.1.2 Backup Types
Level-1 backup creates redirect-on-write (ROW) snapshots that are directly stored
in the distributed file system of PolarDB. The system does not replicate data when
it creates a snapshot. When a data block is modified, the system generates a new
data block for the data block and saves both the new data block and the original data
block. This way, a database can be backed up within a few seconds regardless of the
data amount. Level-1 backup facilitates fast backup and recovery but results in high
storage costs. The backup and recovery features of PolarDB clusters use multi-
thread parallel processing to improve efficiency. Currently, the speed of recovery
(cloning) based on backup sets (snapshots) is 40 min per terabyte. To ensure data
security, the level-1 backup feature is enabled by default. A level-1 backup set is
retained for at least 7 days and at most 14 days.
Level-2 backup compresses level-1 backup files and stores the compressed files in
on-premises storage. Recovery by using level-2 backup data is slower than recovery
by using level-1 backup data but incurs lower storage costs. By default, the level-2
backup feature is disabled. A level-2 backup set is retained for at least 30 days and
at most 7300 days. You can also enable the Permanently Retain All Backups
option to permanently save level-2 backup files. After level-2 backup is enabled, an
expired level-1 backup set will be automatically dumped to on-premises storage at
a rate of approximately 150 MB/s and stored as a level-2 backup set. If a level-1
backup set expires before the previous one is dumped as a level-2 backup set, the
this level-1 backup set is deleted and will no longer be dumped as a level-2 backup
set. For example, a PolarDB cluster creates a level-1 backup set at 01:00 every day
and retains the backup set for 24 h. If the PolarDB cluster creates level-1 backup set
A at 01:00 on January 1 and creates level-1 backup set B at 01:00 on January 2,
level-1 backup A expires at 01:00 on January 2 and starts to be dumped as a level-2
backup set. Suppose level-1 backup set A stores a large amount of data, and the
dumping task has not been completed by 01:00 on January 3. In this case, level-1
backup set B is directly deleted after it expires at 01:00 on January 3 and will no
longer be dumped as a level-2 backup set.
Log Backup
The log backup feature allows you to upload redo log entries to OSS in parallel in
real time. You can perform PITR for a PolarDB cluster based on a full backup set
(snapshot) and redo log entries generated within a specific period of time after the
backup set is created, to ensure data security and prevent data loss caused by
242 10 PolarDB O&M
misoperations. A log backup set is retained for at least 7 days and at most 7300 days.
You can enable the Retained Before Cluster Is Deleted option to permanently
store the logs.
10.3.1.5 FAQ
Question: Why is the total size of level-1 backup sets smaller than the size of a
single backup set?
Answer: Level-1 backup sets in PolarDB are measured based on two aspects: the
logical size of each backup set and the total physical size of all backup sets. PolarDB
uses snapshot chains to store level-1 backup sets, and only one record is generated
for each data block. Therefore, the total physical size of level-1 backup sets is some-
times smaller than the logical size of a single backup set.
10.3.2 Recovery
10.3.2.1 Recovery Methods
10.3.2.2 Procedure
10.4.1.1 Monitoring
The PolarDB console provides a variety of monitoring metrics and updates monitor-
ing data every second, to help you understand the cluster running status in real time
and facilitate rapid fault location based on fine-grained monitoring data.
10.4.1.2 Alerting
The PolarDB console allows you to create and manage threshold-based alerting
rules. The alerting feature helps you detect cluster or node exceptions and handle
the exceptions at the earliest opportunity.
244 10 PolarDB O&M
10.4.1.3 Procedure
Slow SQL statements can greatly affect the stability of a database. When a data-
base encounters problems such as high load or performance jitters, the database
administrator or developer first checks whether slow SQL statements are being
executed. DAS provides the slow SQL analysis feature, which displays slow
SQL trends and statistics and provides SQL tuning suggestions and diagnostic
analysis.
Table 10.3 shows the comparison of slow SQL viewing methods.
10.4.2.2 Procedure
10.4.2.3 Other Features
Autonomy Center
You can enable DAS from the Autonomy Center tab. After DAS is enabled, DAS
automatically analyzes the root cause when the database becomes abnormal, pro-
vides optimization or rectification suggestions, and automatically performs optimi-
zation or rectification operations (optimization operations can be performed only
when authorization is granted).
Session Management
You can use the session management feature to view the session details and session
statistics of the target instance.
Real-Time Performance
The real-time performance feature allows you to view various information in real
time, such as the QPS, TPS, and network traffic information of the target cluster.
Storage Analysis
The storage analysis feature provides the overview information of the entire cluster
(e.g., the number of days for which the remaining storage capacity will last) and the
storage details of a specific table in a database (e.g., space usage, space fragments,
and space exception diagnostics information).
246 10 PolarDB O&M
Lock Analysis
The lock analysis feature allows you to view and analyze the latest deadlocks in the
database in a simple and direct manner.
Performance Insight
The performance insight feature enables you to quickly evaluate the database load
and find the root cause of performance issues to improve database stability.
Diagnostic Report
The diagnostic report feature allows you to specify custom criteria for generating
diagnostic reports and view diagnostic reports.
References