0% found this document useful (0 votes)
28 views

Cloud Native Database Principle and Practice

The book 'Cloud Native Database: Principle and Practice' discusses the principles and technologies of cloud-native databases, using PolarDB as a case study. It highlights the evolution of databases in the context of cloud computing, emphasizing their importance in the digital economy and the need for innovative solutions. Written by experts from Alibaba Group and academia, the book serves as a comprehensive resource for understanding cloud-native database technologies and their practical applications.

Uploaded by

ridwangsn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

Cloud Native Database Principle and Practice

The book 'Cloud Native Database: Principle and Practice' discusses the principles and technologies of cloud-native databases, using PolarDB as a case study. It highlights the evolution of databases in the context of cloud computing, emphasizing their importance in the digital economy and the need for innovative solutions. Written by experts from Alibaba Group and academia, the book serves as a comprehensive resource for understanding cloud-native database technologies and their practical applications.

Uploaded by

ridwangsn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 263

Feifei Li · Xuan Zhou

Peng Cai · Rong Zhang


Gui Huang · XiangWen Liu

Cloud Native
Database
Principle and Practice
Cloud Native Database
Feifei Li • Xuan Zhou • Peng Cai
Rong Zhang • Gui Huang • XiangWen Liu

Cloud Native Database


Principle and Practice
Feifei Li Xuan Zhou
Cloud Intelligence Group School of Data Science & Engineering
Alibaba Group (China) East China Normal University
Hangzhou, China Shanghai, China

Peng Cai Rong Zhang


School of Data Science & Engineering School of Data Science & Engineering
East China Normal University East China Normal University
Shanghai, China Shanghai, China

Gui Huang XiangWen Liu


Cloud Intelligence Group Cloud Intelligence Group
Alibaba Group (China) Alibaba Group (China)
Hangzhou, China Hangzhou, China

ISBN 978-981-97-4056-7    ISBN 978-981-97-4057-4 (eBook)


https://doi.org/10.1007/978-981-97-4057-4

© Publishing House of Electronics Industry 2025

Jointly published with Publishing House of Electronics Industry


The print edition is not for sale in China (Mainland). Customers from China (Mainland) please order the
print book from: Publishing House of Electronics Industry.
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or
information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publishers, the authors, and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publishers nor the authors
or the editors give a warranty, express or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publishers remain neutral with regard to jurisdictional
claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore

If disposing of this product, please recycle the paper.


Cloud platform-based data management services have led to an increased need for
cloud-native databases. This book, using PolarDB as an example, explains the
principles and technologies of cloud-native databases from both theoretical and
practical perspectives in a clear, concise, and easy-to-understand manner. I highly
recommend this book as a great read.
Li Zhanhuai
Professor, Northwestern Polytechnical University; Director, CCF Technical
Committee on Databases

Cloud-native databases are the best choice for cloud computing platforms and have
become a new favorite among users. This book, written by cloud computing and
database experts, provides valuable content that can be used as a helpful reference.
Du Xiaoyong
Professor, Renmin University of China; Director, CCF Task Force on Big Data

Since its inception in the 1960s, databases have been recognized as a key
infrastructure of the information society. With the development and popularization
of the Internet over the past 20 years, the world has undergone profound changes.
The future of the information society is becoming evident, and the digital
transformation of the economy and society is ready to take off. The combination of
databases and the Internet has brought forth the biggest challenge and opportunity
for database development in the last decade. This has sparked renewed research and
development efforts in the field. Cloud-native databases are the result of combining
databases with cloud computing. They aim to provide database capabilities as
services that are available to everyone, that is, to turn databases into public utilities.
This is the first step in leveraging the power of data and providing a platform for

v
vi 

digital transformation. This book summarizes Alibaba Group’s practical experiences


in cloud-native databases. I believe it will help us gain a competitive edge in this
emerging field.
Zhou Aoying
Professor and Vice President, East China Normal University;
Senior Database Scholar

Cloud computing platforms have made it possible for cloud-native databases to


emerge and become widely used. This book explains the basic principles and core
technologies of cloud-native databases from both theoretical and practical
perspectives. The authors have extensive experience in researching database theory
and have achieved significant results. This book has combined their research
expertise with their experience in developing Alibaba Cloud’s cloud-native database,
PolarDB. I highly recommend it as a good read.
Peng Zhiyong
Professor and Associate Dean of the Big Data Institute, Wuhan University;
Deputy Director, CCF Technical Committee on Databases

Cloud-native databases have been a major innovation in the database field over the
past decade, setting the trend for database development. This book explains the core
technologies of cloud-native databases, such as computing-storage separation, log-
as-data, and elastic multitenancy. It is an invaluable resource and deserves careful
reading. This book is written by renowned experts from the academia and industry
in the field of databases, and contains their insights into cloud-native databases.
Li Guoliang
Professor and Deputy Director of the Department of Computer Science and
Technology, Tsinghua University; Deputy Director, CCF Technical Committee on
Databases

Cloud-native databases are the most popular databases today, thanks to their
excellent characteristics such as high scalability and availability. This book is a
pioneering work on cloud-native databases, covering key theories and technical
implementations. The authors are senior scholars and excellent professionals. I
highly recommend this book to graduate students and developers who are interested
in database technologies.
Cui Bin
Professor, Peking University; Deputy Director, CCF Technical Committee on
Databases

Data has become the key factor in the digital economy, and databases serve as the
essential software infrastructure for storing and processing data. They play a crucial
role in driving business development. Telecommunications service providers also
highly value the development of database technologies. With the rapid development
of cloud computing and big data, databases have evolved from traditional customized
 vii

deployments to on-demand, elastic, and scalable cloud services that are highly
flexible and cost-effective. The book delves into Alibaba Cloud’s technological
exploration and practical experience in the cloudification of databases. I believe this
book will offer valuable insights to readers, assisting them in successfully migrating
to the cloud and expediting their digital transformation.
Chen Guo
Deputy General Manager, China Mobile Information Technology Co., Ltd

As technology advances and digital transformation progresses, data is gaining more


attention as a valuable asset. Databases, as the foundation for storing and processing
data, are also experiencing rapid development. Databases evolve in the following
order: traditional databases, cloud-native databases, and generalized databases.
Cloud-native databases are thriving with a brand-new technical architecture, which
has contributed a lot to the implementation of cloud computing.
As the concept of cloud nativeness is still unclear to many, this book provides
much-needed clarity on this topic. The authors possess profound theoretical
knowledge, a visionary view of the industry, and extensive experience in the best
practices of Alibaba’s database products. Through their collaborative efforts, this
book skillfully presents theoretical concepts and technical implementations. I highly
recommend it to people who are interested in databases. I eagerly await the
publication of this book.
Zhou Yanwei
Founder of Beijing Cloud-Ark Technology Co., Ltd. and Chief Architect of
DTArk; Director of the Database Committee, China Computer Industry
Association; Off-campus Supervisor, Zhejiang University

Databases have entered a new era of intense competition. As this book suggests,
those who embrace cloud-nativeness are the ones who stand out. This book combines
theory with practical insights, devoting many pages to technology selection.
Drawing from the authors’ extensive experience, it offers readers a comprehensive
overview of cloud-native databases. It is undeniably a valuable read.
Zhang Wensheng
Chairman, PostgreSQL Chinese Community; Author of “A Practical Guide to
PostgreSQL” and “PostgreSQL Guide: In-depth Exploration”
Foreword 1

Cloud-native databases are emerging as a new, vital form of databases. It was pro-
jected that 75% of databases will either be directly deployed in or migrated to the
cloud by 2022. Alibaba Cloud’s database products not only underpin the world’s
largest high-concurrency and low-latency e-commerce environment, providing
seamless end-to-end online data management and services for millions of small-and
medium-sized enterprises, but also offer stable and reliable data storage, processing,
and analysis services for vital sectors, such as the government, manufacturing,
finance, telecommunications, customs, transportation, and education industries. So
far, Alibaba Cloud’s database products have served over 100,000 enterprise users,
enabling them to effortlessly access enterprise-level database services, substantially
reduce costs, enhance operational efficiency, and generate novel business scenarios
and value.
In response to the immense demands for concurrent data throughput and com-
putational capabilities in e-commerce operations, Alibaba Cloud embarked on
independent R&D of databases as early as 2010. We successfully tackled criti-
cal technological challenges such as storage-computing separation, distributed
computing, high availability, compatibility, online–offline integration, and hybrid
transactional/analytical processing (HTAP). We continued to upgrade the kernel
and cluster architecture to meet the requirements of high-concurrency business
scenarios, including the renowned Double 11 shopping festival. Moreover, we
optimized our offerings for domestic chips and operating systems, laying a solid
foundation for leveraging indigenous technology stacks and achieving complete
autonomy and control. Along this journey, Alibaba Cloud’s database solutions
have garnered a range of prestigious accolades such as the award for the World’s
Leading Internet Scientific and Technological Achievements at the World Internet
Conference, the first prize of Zhejiang Science and Technology Progress Award,
and the first prize of the Science and Technology Progress Award of Chinese
Institute of Electronics. Furthermore, Alibaba Cloud has become the first Chinese
database vendor in Gartner’s list of global database leaders. Standing at the
threshold of the cloud computing era, Alibaba Cloud is committed to advancing

ix
x Foreword 1

cloud-native database technologies in collaboration with our partners and develop-


ers. Together, we aim to create a sound industrial ecosystem to accelerate digital
transformation across society.

DAMO Academy Jeff Zhang


Hangzhou, China
Foreword II

Databases are among the most vital foundational systems in computer science. They
serve as the bedrock software that supports the digital economy and occupy a piv-
otal strategic position. In the era of digital economy, effectively managing and lever-
aging diverse data resources is a prerequisite for scientific research and
decision-making. The traditional database market is dominated by major commer-
cial database vendors, forming a well-established database ecosystem. Databases
store and process the core data resources of users, leading to high user stickiness
and significant migration challenges. Due to a high degree of monopoly, Chinese
database systems face fierce competition in the commercial market. The current
national strategy places great emphasis on driving innovation and breakthroughs in
fundamental technologies and explicitly prompts efforts to vigorously develop
foundational software and advanced information technology services and expedite
the development progress of database systems tailored for big data applications.
Against this backdrop, database systems should not merely aim to replace existing
products in the market, but to evolve and innovate, adapting to the emerging
demands of cloud computing, big data, AI, and other new market trends. They
should not only be useful but also user-friendly.
The rapid development of technologies like cloud computing has propelled foun-
dational software toward a transformation journey into the cloud. An increasing
number of enterprises are migrating their new applications to the cloud, intensifying
the requirements for data storage and computational analysis capabilities. Cloud-­
native databases boast cloud elasticity, flexibility, and high availability, enabling
them to provide robust innovation capabilities, diverse product offerings, economi-
cally efficient deployment methods, and pay-as-you-go payment models. Cloud-­
native distributed databases present a significant opportunity for groundbreaking
innovation in the database domain. Cloudification opens up new possibilities for
professionals working with databases.
The cloudification of databases has undergone two stages. The first stage is cloud
hosting, which involves deploying existing database systems on cloud platforms to
provide databases as on-demand services. The second stage is cloud-native imple-
mentation, where the hierarchical structure of databases is completely reconstructed

xi
xii Foreword II

to leverage the resource pooling feature of the cloud. This decouples computing,
storage, and networking resources to accommodate dynamic business needs by
using elastic resource pools. In the second stage, databases are transformed compre-
hensively, unlocking opportunities for profound innovation. This book has emerged
in response to this trajectory. As the primary author of this book, Dr. Li Feifei has
dedicated over a decade to database research, followed by years of industry immer-
sion focusing on the development of database systems. This book reflects a fusion
of cutting-edge theory and practical engineering expertise. It provides a robust theo-
retical foundation and detailed technical implementation insights, to facilitate a
deep understanding of the key technologies of cloud-native databases, including
storage-computing separation, high availability, storage engines, distributed query
engines, data distribution, and automatic load balancing. I believe this book will be
invaluable to readers seeking to learn the latest database technologies.
Independent innovation in information infrastructure technologies is critical to
informatization initiatives. However, this important task cannot be accomplished
overnight. It necessitates a shared vision within society and close cooperation across
the entire industry chain. Only by making unremitting efforts and seizing every
opportunity presented by the evolving landscape we can succeed. I hope this book
will inspire practitioners on their technological exploration journey and contribute
to the development of new database systems.

Chinese Academy of Engineering Zuoning Chen


Beijing, China
Foreword III

Database management systems are one of the most vital software systems in the
field of computer science and technology. They serve as indispensable foundational
platforms for information systems, providing essential support throughout the entire
lifecycle of data from collection to classification, organization, processing, storage,
analysis, and application. Without database management systems, informatization
across industries would not be possible. Database management technologies have
come a long way since their commercialization in the 1970s. Relational databases
have established their dominant position in informatization, thanks to their concise
conceptual framework, robust abstraction, powerful expressive capabilities, and
transaction consistency guarantee, making them the “de facto standard” for data
management technologies.
The widespread commercial use of the Internet has greatly accelerated the gen-
eration, circulation, and aggregation of data, raising new requirements and chal-
lenges for data management. The exponential growth in data volume necessitates
better management practices, and the increasing diversity of data types calls for
more flexible and diverse data models. The Internet, as the information infrastruc-
ture, places applications under greater pressure in terms of data throughput, concur-
rent access, and quick response to queries. The Internet is free of geographical
constraints and therefore demands uninterrupted service provisioning from business
systems. This imposes higher requirements on database systems in terms of avail-
ability and scalability, among other aspects. These new characteristics and scenarios
have sparked a revolution in database technologies, giving rise to a multitude of
innovative database technologies and products. The integration of distributed tech-
nologies has become a prominent feature, offering scalability and high availability
to effectively address the demands of large-scale data processing and storage
analysis.
The advent of cloud computing has ushered in a new phase for Internet applica-
tions. Computing and storage capabilities are offered to users as on-demand ser-
vices. Besides infrastructure services, various underlying platform technologies,
including databases and middleware, as well as application software, are also avail-
able as services. Cloud computing has triggered another wave of transformation in

xiii
xiv Foreword III

database technologies. Backed by the application of distributed technologies, data-


base services that boast real-time elastic scalability and geographically distributed
availability are provided by leveraging the elastic resource pools of cloud platforms
and decoupling the computation and storage layers of databases. This transforma-
tion is not only a technological evolution but also a renewal of business models,
enabling users to access cost-effective, user-friendly, and highly available database
services that are elastically scalable. This shift has given rise to the demand for
cloud-native databases.
The reconstruction of database systems by using cloud-native technologies
aligns with the technological trends and market demands. This book, primarily
authored by Dr. Li Feifei, systematically explores the combination of cloud-native
technologies and databases. It reviews the development of database management
systems, outlining the major technological features of each important stage. It also
traces the trajectory of database technologies, which culminate in the era of cloud-­
native databases. This book analyzes the trends in database technologies, decon-
structs the technical stack of database systems, and explains the implementation of
components such as shared storage systems, storage engines, and query engines
after decoupling. It highlights how new technologies such as cloud-native architec-
tures, distributed systems, high availability, and integration of software and hard-
ware enhance the capabilities of database systems. This book also provides practical
insights into the usage and O&M of cloud databases. With a well-structured and
progressive layout, this book balances theory and practicality, making it a valuable
reference for professionals in the field of database technologies.
The cloudification of databases is becoming a significant trend nowadays, pre-
senting new opportunities for database management technologies and related indus-
tries. At the same time, it also calls for the cultivation of talents in this field. This
book perfectly captures this zeitgeist.

Peking University Hong Mei


Beijing, China
Preface

Background

For over six decades, database systems have been continuously developed to fulfill
their role as one of the fundamental software components. As such, relational data-
bases have dominated the market due to their strong data abstraction, expressive
capabilities, and the easy-to-use SQL language. Over the past 50 years, the theory
and technology of relational databases have come a long way. Numerous books
have been published to delve into technical aspects such as SQL parsing, optimiza-
tion and execution, transaction processing, log recovery, storage engines, and data
dictionaries. Despite their maturity, database technologies continue to evolve due to
a variety of factors, including the rapid development of the Internet and big data
technology, complex business requirements, diverse data models, exponential data
volumes, and advancements in hardware technologies.
Internet applications have completely reshaped people’s lifestyles at an unprec-
edented pace, making an enormous amount of data available online. These data
need to be stored, analyzed, and consumed, which, in turn, puts databases under
unprecedented pressure. To adapt to the dynamic market, Internet applications
quickly adjust their business forms and models. This leads to the emergence of more
flexible and enriched data models and rapid changes in workload characteristics. To
cope with such changes, databases must support elastic scaling to adapt to evolving
business needs while keeping costs low. Traditional databases, often deployed as
standalone systems with fixed specifications, struggle to meet these demands. This
is where cloud computing comes in. By providing infrastructure as a service, cloud
computing establishes large-scale resource pools and offers a unified virtualized
abstraction interface. A massive operating system is established on diverse hard-
ware by utilizing technologies such as containers, virtualization, orchestration, and
microservices. Leveraging the capabilities of cloud computing, databases have
transformed from fixed-specification instances to on-demand services, allowing
users to access them as needed and scale them in real-time based on specific busi-
ness requirements.

xv
xvi Preface

Cloud-native databases are not simply traditional databases deployed on cloud


computing platforms. They have undergone a comprehensive transformation in
terms of the overall architecture. They fully utilize the resource pooling capabilities
of cloud computing platforms to decouple the previously monolithic databases,
achieving complete separation of computing and storage resources. In addition,
local storage is replaced by distributed cloud storage, and the computing layer
becomes serverless. Cloud-native databases pool resources to support each layer of
services, enabling independent and real-time scaling of resource pools to match
dynamic workloads and maximize resource utilization.

Summary

This book portrays the evolution of database technologies in the era of cloud com-
puting. Through specific examples, it illustrates how cloud-native and distributed
technologies have enriched the essence of databases.
Chapter 1 offers a concise overview of database development. This chapter
explains the structure, key modules, and implementation principles of typical rela-
tional databases. An SQL statement execution process is used as an example to
illustrate these concepts.
Chapter 2 discusses the transformation of databases in the era of cloud comput-
ing, highlighting the evolution from standalone databases to cloud-native distrib-
uted databases. This chapter explores the technical changes brought by cloud
computing and examines the potential trends in database technologies.
Chapter 3 focuses on the architectural design principles of cloud-native data-
bases and the reasons behind these principles. Additionally, this chapter analyzes
the technical features of several prominent cloud-native databases in the market,
such as AWS Aurora, Alibaba Cloud PolarDB, and Microsoft Socrates.
Chapters 4–7, respectively, delve into the implementation principles of important
components of cloud-native databases, including storage engines, shared storage,
database caches, and computing engines. Each chapter follows the same structure,
in which the theoretical foundations and general implementation methods of the
components are explained, and then targeted improvements and optimization meth-
ods specific to cloud-native databases are introduced.
Chapter 8 provides a detailed explanation of distributed database technologies
that support scale-in and scale-out, including their application and implementation
principles. This chapter also highlights how the integration of database technologies
with cloud-native technologies takes the database technologies to new heights.
Chapters 9 and 10 center around the practical applications of cloud-native data-
bases. By using PolarDB as an example, these chapters cover relevant topics, such
as creating database instances in the cloud, optimizing usage and O&M, and har-
nessing the elastic, high availability, security, and cost-effectiveness features offered
by cloud databases.
Preface xvii

Primary Authors

This book is authored by Dr. Li Feifei from Alibaba Cloud and Professor Zhou
Xuan from East China Normal University (ECNU). Some of the content is contrib-
uted by Professor Cai Peng and Professor Zhang Rong from ECNU and also senior
technical expert Huang Gui from Dr. Li Feifei’s team. Liu Xiangwen, the Vice
President of Alibaba Cloud and General Manager for marketing, Alibaba Cloud
Intelligence has also made significant contributions. Other technical experts from
Alibaba Cloud’s database team, including Zhang Yingqiang, Wang Jianying, Hu
Qingda, Chen Zongzhi, Wang Yuhui, Wang Bo, Sun Yue, Zhuang Zechao, Ying
Shanshan, Song Zhao, Wang Kang, Cheng Xuntao, Zhang Haiping, Wu Xiaofei, Wu
Xueqiang, Yang Shukun, and others, have provided valuable technical materials,
and we sincerely appreciate their contributions.
Special thanks to Jeff Zhang, Managing Director of DAMO Academy,
Academician Chen Zuoning from the Chinese Academy of Engineering, and
Academician Mei Hong from the Chinese Academy of Sciences for writing the
forewords for this book.
We would also like to express our gratitude to Professor Li Zhanhuai, Professor
Du Xiaoyong, Professor Zhou Aoying, Professor Peng Zhiyong, Professor Li
Guoliang, Professor Cui Bin, General Manager Chen Guo, President Zhou Yanwei,
and Chairman Zhang Wensheng for their testimonials.
Additionally, we extend our appreciation to the technical experts from Alibaba
Cloud’s database team, including Huang Gui, Yang Xinjun, Lou Jianghang, You
Tianyu, Wu Wenqi, Chen Zongzhi, Liang Chen, Zhang Yingqiang, Wang Jianying,
Hu Qingda, Weng Ninglong, Fu Dachao, Fu Cuiyun, Wang Yuhui, Yuan Lixiang,
Sun Jingyuan, Cai Chang, Zhou Jie, Xu Jiawei, Wu Xiaofei, Xie Rongbiao, Wang
Kang, Zheng Song, Ren Zhuo, Wei Zetao, Sun Yuxuan, Zhang Xing, Li Ziqiang, Xu
Dading, Xiong Meihui, Liang Gaozhong, Chen Shiyang, Chen Jiang, Xu Jie, Cai
Xin, Yu Nanlong, Wang Yujie, Chen Shibin, Wu Qianqian, Sun Yue, Zhao Minghuan,
Sun Haiqing, Li Wei, Yang Yuming, and Han Ken for their contributions to the trans-
lation of textbooks. Thanks to Wang Yuan and Xiao Simiao from the Alibaba Cloud’s
database team for their contributions in organizing the translation of the textbook.
This book would not have been possible without the collective efforts of every-
one involved.
As it was completed within a limited timeframe, this book may not answer every-
thing there is to know about database systems. With this, we encourage readers to
kindly share their feedback.

Hangzhou, China Feifei Li


Shanghai, China  Xuan Zhou
Shanghai, China  Peng Cai
Shanghai, China  Rong Zhang
Hangzhou, China  Gui Huang
Hangzhou, China  XiangWen Liu
February 2024
Introduction

This book thoroughly analyzes the technological evolution of databases, which


serve as core software systems, in the era of cloud computing. It explores the pro-
gressive development of traditional database technologies toward cloud-native
architectures from various perspectives, including architectural design, implemen-
tation mechanisms, and system optimization. On the basis of the fusion of theory
and practice, this book examines SQL optimization and execution, transaction pro-
cessing, caching, indexing, and other features employed by widely used database
systems like MySQL and PostgreSQL. It also explores the trade-offs and compro-
mises made to meet practical application requirements, the improvements made to
suit complex scenarios, and the underlying rationale behind these choices.
Furthermore, this book draws on Alibaba Cloud’s database research and develop-
ment experience, highlighting the core technical principles that have enabled mod-
ern databases to evolve into services, such as cloud computing resource pooling and
distributed technologies for high availability, elastic scaling, and on-demand usage.
This book offers comprehensive theoretical knowledge and practical experience
by navigating the latest trends in database development, thereby inspiring readers to
delve deeper into the subject. It can serve as a textbook for undergraduate and grad-
uate students majoring in information-related disciplines at higher educational insti-
tutions, as well as a reference book for professionals engaged in kernel development
and system O&M in the database industry.

xix
Contents

1 Database Development Milestones ��������������������������������������������������������    1


1.1 Overview of Database Development������������������������������������������������    1
1.1.1 Emergence����������������������������������������������������������������������������    2
1.1.2 Commercialization����������������������������������������������������������������    2
1.1.3 Maturation����������������������������������������������������������������������������    3
1.1.4 Cloud-Native and Distributed Era����������������������������������������    3
1.2 Database Technology Development Trends��������������������������������������    5
1.2.1 Cloud-Native and Distributed Architectures������������������������    5
1.2.2 Integration of Big Data and Databases ��������������������������������    6
1.2.3 Hardware-Software Integration��������������������������������������������    7
1.2.4 Multimodality ����������������������������������������������������������������������    7
1.2.5 Intelligent O&M�������������������������������������������������������������������    8
1.2.6 Security and Trust ����������������������������������������������������������������    8
1.3 Key Components of Relational Databases����������������������������������������    9
1.3.1 Access Management Component������������������������������������������    9
1.3.2 Query Engine������������������������������������������������������������������������   10
1.3.3 Transaction Processing System��������������������������������������������   15
1.3.4 Storage Engine����������������������������������������������������������������������   18
References��������������������������������������������������������������������������������������������������   21
2 
Database and Cloud Nativeness��������������������������������������������������������������   23
2.1 Development of Databases in the Cloud Era������������������������������������   23
2.1.1 Rise of Cloud Computing ����������������������������������������������������   23
2.1.2 Database as a Service������������������������������������������������������������   24
2.2 Challenges Faced by Databases in the Cloud-Native Era����������������   26
2.3 Characteristics of Cloud-Native Databases��������������������������������������   27
2.3.1 Layered Architecture������������������������������������������������������������   27
2.3.2 Resource Decoupling and Pooling����������������������������������������   27
2.3.3 Elastic Scalability�����������������������������������������������������������������   27
2.3.4 High Availability and Data Consistency ������������������������������   28

xxi
xxii Contents

2.3.5 Multitenancy and Resource Isolation������������������������������������   29


2.3.6 Intelligent O&M�������������������������������������������������������������������   29
References��������������������������������������������������������������������������������������������������   30
3 
Architecture of Cloud-Native Database ������������������������������������������������   31
3.1 Design Principles������������������������������������������������������������������������������   31
3.1.1 Essence of Cloud-Native Databases ������������������������������������   31
3.1.2 Separation of Computing and Storage����������������������������������   32
3.2 Architecture Design��������������������������������������������������������������������������   33
3.3 Typical Cloud-Native Databases������������������������������������������������������   35
3.3.1 AWS Aurora��������������������������������������������������������������������������   35
3.3.2 PolarDB��������������������������������������������������������������������������������   41
3.3.3 Microsoft Socrates����������������������������������������������������������������   45
References��������������������������������������������������������������������������������������������������   49
4 Storage Engine ����������������������������������������������������������������������������������������   51
4.1 Data Organization ����������������������������������������������������������������������������   51
4.1.1 B+ Tree ��������������������������������������������������������������������������������   52
4.1.2 B+ Tree in the InnoDB Engine ��������������������������������������������   54
4.1.3 LSM-Tree������������������������������������������������������������������������������   58
4.2 Concurrency Control������������������������������������������������������������������������   62
4.2.1 Basic Concepts����������������������������������������������������������������������   62
4.2.2 Lock-Based Concurrency Control����������������������������������������   62
4.2.3 Timestamp-Based Concurrency Control������������������������������   64
4.2.4 MVCC����������������������������������������������������������������������������������   67
4.2.5 Implementation of MVCC in InnoDB����������������������������������   69
4.3 Logging and Recovery����������������������������������������������������������������������   72
4.3.1 Basic Concepts����������������������������������������������������������������������   72
4.3.2 Logical Logs ������������������������������������������������������������������������   72
4.3.3 Physical Logs������������������������������������������������������������������������   73
4.3.4 Recovery Principles��������������������������������������������������������������   74
4.3.5 Binlog of MySQL ����������������������������������������������������������������   74
4.3.6 Physical Logs of InnoDB������������������������������������������������������   75
4.4 LSM-Tree Storage Engine����������������������������������������������������������������   77
4.4.1 PolarDB X-Engine����������������������������������������������������������������   77
4.4.2 High-Performance Transaction Processing��������������������������   79
4.4.3 Hardware-Facilitated Software Optimization ����������������������   82
4.4.4 Cost-Effective Tiered Storage ����������������������������������������������   86
4.4.5 Dual Storage Engine Technology ����������������������������������������   92
4.4.6 Experimental Evaluation������������������������������������������������������   93
References��������������������������������������������������������������������������������������������������   97
5 
High-Availability Shared Storage System����������������������������������������������   99
5.1 Basics of High Availability ��������������������������������������������������������������   99
5.1.1 Leader and Follower Replicas���������������������������������������������� 100
5.1.2 Quorum �������������������������������������������������������������������������������� 101
Contents xxiii

5.1.3 Paxos ������������������������������������������������������������������������������������ 102


5.1.4 Raft���������������������������������������������������������������������������������������� 105
5.1.5 Parallel Raft�������������������������������������������������������������������������� 108
5.2 High Availability of Clusters������������������������������������������������������������ 110
5.2.1 High Availability of MySQL Clusters���������������������������������� 110
5.2.2 High Availability of PolarDB������������������������������������������������ 114
5.3 Shared Storage Architectures������������������������������������������������������������ 129
5.3.1 Aurora Storage System �������������������������������������������������������� 130
5.3.2 PolarFS���������������������������������������������������������������������������������� 131
5.4 File System Optimization ���������������������������������������������������������������� 134
5.4.1 User Space I/O Computing �������������������������������������������������� 134
5.4.2 Near-Storage Computing������������������������������������������������������ 137
References�������������������������������������������������������������������������������������������������� 143
6 Database Cache���������������������������������������������������������������������������������������� 145
6.1 Introduction to the Database Cache�������������������������������������������������� 145
6.1.1 Role of the Database Cache�������������������������������������������������� 145
6.1.2 Buffer Pool���������������������������������������������������������������������������� 146
6.2 Cache Recovery�������������������������������������������������������������������������������� 146
6.2.1 Challenges of Caching in the Cloud Environment���������������� 146
6.2.2 Cache Recovery Based on CPU-Memory Separation���������� 147
6.3 PolarDB Practices ���������������������������������������������������������������������������� 149
6.3.1 Optimization of the Buffer Pool ������������������������������������������ 149
6.3.2 Optimization of the Data Dictionary Cache
and the File System Cache���������������������������������������������������� 154
6.3.3 RDMA-Based Shared Memory Pool������������������������������������ 155
References�������������������������������������������������������������������������������������������������� 160
7 Computing Engine ���������������������������������������������������������������������������������� 163
7.1 Overview of Query Processing �������������������������������������������������������� 163
7.1.1 Overview of Database Query Processing������������������������������ 163
7.1.2 Overview of Parallel Queries������������������������������������������������ 165
7.2 Query Execution Models������������������������������������������������������������������ 167
7.2.1 Volcano Model���������������������������������������������������������������������� 168
7.2.2 Compiled Execution Model�������������������������������������������������� 168
7.2.3 Vectorized Execution Model������������������������������������������������ 169
7.3 Overview of Query Optimization ���������������������������������������������������� 169
7.3.1 Introduction to Query Optimization������������������������������������� 169
7.3.2 Logical Query Optimization ������������������������������������������������ 170
7.3.3 Physical Query Optimization������������������������������������������������ 170
7.3.4 Other Optimization Methods������������������������������������������������ 171
7.4 Practical Application of PolarDB Query Engine������������������������������ 171
7.4.1 Parallel Query Technology in PolarDB�������������������������������� 172
7.4.2 Execution Plan Management in PolarDB ���������������������������� 187
7.4.3 Vectorized Execution in PolarDB ���������������������������������������� 194
References�������������������������������������������������������������������������������������������������� 197
xxiv Contents

8 
Integration of Cloud-Native and Distributed Architectures���������������� 199
8.1 Basic Principles of Distributed Databases���������������������������������������� 199
8.1.1 Architecture of Distributed Databases���������������������������������� 200
8.1.2 Data Partitioning ������������������������������������������������������������������ 201
8.1.3 Distributed Transactions ������������������������������������������������������ 203
8.1.4 MPP�������������������������������������������������������������������������������������� 207
8.2 Distributed and Cloud-Native Architectures������������������������������������ 208
8.2.1 Shared Storage Architecture ������������������������������������������������ 209
8.2.2 Shared-Nothing Architecture������������������������������������������������ 210
8.3 Cloud-Native Distributed Database: PolarDB-X������������������������������ 211
8.3.1 Architecture Design�������������������������������������������������������������� 211
8.3.2 Partitioning Schemes������������������������������������������������������������ 212
8.3.3 GSIs�������������������������������������������������������������������������������������� 213
8.3.4 Distributed Transactions ������������������������������������������������������ 213
8.3.5 HTAP������������������������������������������������������������������������������������ 214
References�������������������������������������������������������������������������������������������������� 215
9 Practical Application of PolarDB ���������������������������������������������������������� 217
9.1 Creating Instances on the Cloud ������������������������������������������������������ 217
9.1.1 Related Concepts������������������������������������������������������������������ 217
9.1.2 Prerequisites�������������������������������������������������������������������������� 218
9.1.3 Billing Method���������������������������������������������������������������������� 218
9.1.4 Region and Availability Zone����������������������������������������������� 218
9.1.5 Creation Method ������������������������������������������������������������������ 219
9.1.6 Network Type������������������������������������������������������������������������ 219
9.1.7 Series������������������������������������������������������������������������������������ 219
9.1.8 Compute Node Specification������������������������������������������������ 219
9.1.9 Storage Space������������������������������������������������������������������������ 219
9.1.10 Creation�������������������������������������������������������������������������������� 220
9.2 Database Access�������������������������������������������������������������������������������� 220
9.2.1 Account Creation������������������������������������������������������������������ 220
9.2.2 GUI-Based Access���������������������������������������������������������������� 221
9.2.3 CLI-Based Access���������������������������������������������������������������� 221
9.3 Basic Operations ������������������������������������������������������������������������������ 225
9.3.1 Database and Table Creation������������������������������������������������ 225
9.3.2 Test Data Creation���������������������������������������������������������������� 227
9.3.3 Account and Permission Management��������������������������������� 227
9.3.4 Data Querying���������������������������������������������������������������������� 229
9.4 Cloud Data Migration ���������������������������������������������������������������������� 231
9.4.1 Migrating Data to the Cloud ������������������������������������������������ 231
9.4.2 Exporting Data from the Cloud�������������������������������������������� 235
10 PolarDB O&M����������������������������������������������������������������������������������������� 237
10.1 Overview���������������������������������������������������������������������������������������� 237
10.2 Resource Scaling���������������������������������������������������������������������������� 238
10.2.1 System Scaling�������������������������������������������������������������������� 238
10.2.2 Manual Scaling ������������������������������������������������������������������ 238
Contents xxv

10.2.3 Manual Addition and Removal of Nodes���������������������������� 238


10.2.4 Automatic Scaling and Node Addition and Removal �������� 239
10.3 Backup and Recovery �������������������������������������������������������������������� 240
10.3.1 Backup�������������������������������������������������������������������������������� 240
10.3.2 Recovery ���������������������������������������������������������������������������� 242
10.4 Monitoring and Diagnostics������������������������������������������������������������ 243
10.4.1 Monitoring and Alerting ���������������������������������������������������� 243
10.4.2 Diagnostics and Optimization�������������������������������������������� 244
References�������������������������������������������������������������������������������������������������� 246
About the Authors

Feifei Li is Senior Vice President of Alibaba Cloud and President of Database


Products Business Unit, Alibaba Cloud.
Recognized as a Distinguished Member of the Association for Computing
Machinery (ACM), he is honored with prestigious awards, including ACM and
Institute of Electrical and Electronics Engineers (IEEE) accolades, the award for the
World’s Leading Internet Scientific and Technological Achievements at the 2019
World Internet Conference, the first prize of Zhejiang Science and Technology
Progress Award, and the first prize of Science and Technology Progress Award of
Chinese Institute of Electronics. He has led the development of Alibaba Cloud’s
enterprise-level cloud-native database system. He also serves as the Deputy Director
of the Expert Committee on Big Data and a standing committee member of the
Database Professional Committee at the China Computer Federation (CCF). He
also holds key roles such as an editorial board member and chair for renowned
international academic journals and conferences, including the International
Conference on Very Large Databases (VLDB) 2021 conference and the IEEE
International Conference on Data Engineering (ICDE) 2021 conference.

Xuan Zhou is a Professor and Vice Dean of the School of Data Science and
Engineering at East China Normal University (ECNU).
He received the bachelor’s degree from Fudan University in 2001 and obtained
his Ph.D. degree from the National University of Singapore in 2005. He worked as
a researcher at the L3S Research Center in Germany and the Commonwealth
Scientific and Industrial Research Organisation (CSIRO) in Australia from 2005 to
2010 and then tutored at Renmin University of China before joining ECNU in 2017.
He has devoted himself to the research of database systems and information retrieval
technologies. He has contributed to and directed various national and international
research projects and industrial cooperation projects. He has developed a variety of
data management systems. His research in distributed databases earned him the
second prize of the State Scientific and Technological Progress Award in 2019.

xxvii
xxviii About the Authors

Peng Cai is a Professor and Ph.D. supervisor of the School of Data Science and
Engineering at ECNU.
Prior to joining ECNU in June 2015, he worked at IBM China Research
Laboratory and Baidu (China) Co., Ltd. He has published academic papers at
esteemed international conferences, such as the VLDB, ICDE, Special Interest
Group on Information Retrieval (SIGIR) Conference, and Association for
Computational Linguistics (ACL) Conference. His current research focuses on two
key areas: in-memory transaction processing and adaptive data management sys-
tems based on machine learning. He has been awarded the second prize of the State
Scientific and Technological Progress Award and the first prize of the Science and
Technology Progress Award of the Ministry of Education.

Rong Zhang is a Professor and Ph.D. supervisor of the School of Data Science and
Engineering at ECNU.
She has been dedicated to the research and development of distributed systems
and databases since 2001. She has led or participated in various research projects
funded by the National Natural Science Foundation of China, projects under the 863
Program, and industrial cooperation projects. Her outstanding contributions have
earned her the first prize of Shanghai Technology Progress Award for Technical
Invention and the second prize of the State Scientific and Technological Progress
Award. Her research fields encompass distributed data management, data stream
management, and big data benchmarking.

Gui Huang is a Senior technical expert at Alibaba and chief database architect of
Alibaba Cloud.
Throughout his tenure at Alibaba, he has been deeply engaged in the research
and development of distributed systems and distributed database kernels. He has
also participated in the development of PolarDB, a database independently devel-
oped by Alibaba. He possesses extensive expertise in distributed system design,
distributed consensus protocols, and database kernel implementation. He has pub-
lished multiple scholarly papers at esteemed international conferences, including
SIGMOD, FAST, and VLDB. His achievements have earned him the first prize of
the Science and Technology Progress Award of the Chinese Institute of Electronics.

Xiangwen Liu is Vice President of Alibaba Cloud, General Manager for market-
ing, Alibaba Cloud Intelligence, and Standing Director of CCF.
Having been with Alibaba for over a decade, Ms. Liu has led teams in creating a
three-tier governance system for Alibaba’s technology mid-end strategy and played
a pivotal role in the founding and growth of Alibaba DAMO Academy. As the
General Manager of the Marketing and Public Relations Department at Alibaba
Cloud, Ms. Liu has been instrumental in forging partnerships with universities, gov-
ernments, developers, and innovators, promoting the brand upgrade of Alibaba
Cloud in the digital economy era.
Chapter 1
Database Development Milestones

In the 1960s, database management systems began to thrive as the core software for
data management. Propelled by changes in application requirements and hardware
development, databases have undergone several evolutions and achieved notable
progress in query engines, transaction processing, storage engines, and other
aspects. However, with the advent of the cloud era, new demands and challenges
have been posed for the processing capabilities of database systems. Moreover,
cloud platform-based initiatives have been launched for various database systems,
giving rise to numerous innovative design ideas and implementation technologies.

1.1 Overview of Database Development

Databases play a vital role in the field of computer science. Early computers were
essentially giant calculators focused on algorithms and were mainly used for scien-
tific calculations. Computers did not store data persistently. They batch processed
input data and returned the calculation results but did not save the data results. At
that time, no specialized data management software was available. Programmers
had to define the logical structure of data and design the physical structure pro-
grams, including the storage structure, access methods, and input/output formats.
Therefore, the subroutines that access data in a program changed with the changes
in storage, and the data and program were not consistent. The concept of files had
not been introduced, and data could not be reused. Even if two programs used the
same data, the data must be input twice.
In the 1960s, as computers entered commercial systems and began to solve practi-
cal business problems, data went from being a by-product of algorithm processing to
a core product. At this time, database management systems (DBMSs) blossomed into
a specialized technical field. The core task of DBMSs was to manage data, which
included collecting, classifying, organizing, encoding, storing, processing, applying,

© The Author(s), under exclusive license to Springer Nature Singapore Pte 1


Ltd. 2025
F. Li et al., Cloud Native Database,
https://doi.org/10.1007/978-981-97-4057-4_1
2 1 Database Development Milestones

and maintaining data. Although this task has not changed much since its inception, the
theoretical models and related technologies for managing and organizing data have
undergone several transformations, driven by the development of computer hardware
and software, the complexity and diversity of business processing, and changes in data
scale. Database development can be divided into four stages: emergence, commercial-
ization, maturation, and era of cloud-native and distributed computing.

1.1.1 Emergence

In 1960, Charles Bachman joined General Electric (GE) and developed the Integrated
Database System (IDS), the first database system, which was a network model database
system. Bachman later joined the Database Task Group (DBTG) of the Conference/
Committee on Data Systems Languages (CODASYL), under which he developed the
language standards for the network model, with IDS as the main source. In 1969, IBM
developed a database system called the Information Management System (IMS) for the
Apollo program, which used the hierarchical model and supported transaction process-
ing. The hierarchical and network models pioneered database technologies and effec-
tively solved the problems of data centralization and sharing but lacked data independence
and abstraction layers. When accessing these two types of databases, users must be
aware of the data storage structure and specify the access methods and paths. This com-
plicated mechanism hindered the popularization of such databases.

1.1.2 Commercialization

In 1970, IBM researcher E.F. Codd proposed the relational model in his ground-
breaking paper “A Relational Model of Data for Large Shared Data Banks,” provid-
ing the theoretical foundation for relational database technologies. The relational
model was based on predicate logic and set theory with a rigorous mathematical
foundation. It provided a high-level data abstraction layer but did not include the
specific data access process, which was implemented by the DBMS instead. At the
time, some believed the relational model was too idealized and only an abstract data
model that was difficult to implement in an efficient system. In 1974, Michael
Stonebraker and Eugene Wong from UC Berkeley decided to study relational data-
bases and developed the Interactive Graphics and Retrieval System (INGRES),
which proved the efficiency and practicality of the relational model. INGRES used
a query language called QUEL. At the same time, IBM realized the potential of the
relational model and developed the relational database known as System R, along
with the Structured Query Language (SQL). By the late 1970s, INGRES had been
developed and commercialized in Oracle and IBM DB2, and SQL was eventually
adopted by the American National Standards Institute (ANSI) in 1986 as the stan-
dard language for relational databases. SQL only describes what data is desired,
without specifying the process for obtaining the data, freeing users from cumber-
some data operations. This was the key to the success of relational databases.
1.1 Overview of Database Development 3

1.1.3 Maturation

After 10 years of development, the relational model theory had taken a foothold,
and E.F. Codd was awarded the Turing Award in 1981 for his contribution to the
relational model. The maturity of the theoretical model gave rise to a multitude of
commercial database products, such as Oracle, IBM DB2, Microsoft SQL Server,
Infomix, and other popular database software systems, all of which appeared during
this period. The development of database technologies was related to programming
languages, software engineering, information system design, and other technolo-
gies and promoted further research in database theory. For example, database
researchers proposed an object-oriented database model (referred to as “object
model” for short) based on object-oriented methods and techniques. To this day,
many research works are still conducted based on existing database achievements
and technologies, aiming to expand the field of traditional DBMSs, mainly rela-
tional DBMSs (RDBMSs), at different levels for different applications, such as
building an object-relational (OR) model and establishing an OR database (ORDB).
The development of commercial databases has also driven the continuous evolution
of open-source database technologies. In the 1990s, open-source database projects
flourished, and the two major open-source database systems nowadays, MySQL and
PostgreSQL, were born. In the past, databases were mainly used for processing online
transaction business and therefore were known as online transaction processing (OLTP)
systems. After nearly 20 years of development, standalone relational database technolo-
gies and systems have become increasingly mature and were commercialized on a large
scale. With the widespread application of relational databases in information systems,
an increasing amount of business data accumulated. The focus and interest of scholars
and technical professionals shifted to the utilization of these data to facilitate business
decision-making. Hence, online analytical processing (OLAP) was introduced to query
and analyze large-scale data. Against this backdrop, IBM researchers Barry Devlin and
Paul Murphy proposed a new term in 1988—data warehouse. With the advent of the
Internet era, systems exclusive to professionals were opened up to everyone, leading to
an exponential increase in the scale of data that businesses processed and an explosive
surge in the scale of processing requests to databases. Consequently, traditional stand-
alone databases were overstretched. Propelled by cloud technologies, emerging tech-
nologies such as distributed databases then made their debuts.

1.1.4 Cloud-Native and Distributed Era

In the cloud-native era, two different practical schemes are available for expanding
database processing capabilities as the business processing scale grows. One is ver-
tical scaling, also called “scale-up.” In this scheme, the capacities of database com-
ponents are increased. Additionally, better hardware (e.g., minicomputers and
high-end storages) is used, such as in the well-known “IBM, Oracle, and EMC
(IOE)” solutions. Multiple compute nodes in a database system share storage, giv-
ing birth to the shared storage architecture shown in Figure 1.1a. The other scheme
4 1 Database Development Milestones

is horizontal scaling, which is also referred to as “scale-out.” In this scheme, the


capacities of individual compute nodes remain unchanged, but multiple compute
nodes are combined into a shared-nothing distributed system. Each node stores a
part of the data as data shards and processes part of requests according to the shard-
ing rules, as shown in Figure 1.1b.
Both schemes have their respective advantages and drawbacks. Scale-up essen-
tially maintains the standalone mode, in which all compute nodes share the same
status, data, and metadata. A scaled-up database system remains compatible with the
functionalities of a standalone system. Many traditional enterprises require smooth
business operations and expect to expand the system without modifying applications.
Scale-up is an ideal choice for such users. However, in this scheme, scalability is lim-
ited because the number of compute nodes that can be added is limited. In addition,
status synchronization costs increase with the increase in the number of compute
nodes, and the storage capacity is also subject to the capacity of the shared storage.
Consequently, this scheme is inadequate for handling massive data and requests in
Internet businesses. Most Internet enterprises prioritize database scalability. Therefore,
they opt for a horizontal scaling solution that uses inexpensive hardware, even though
it involves horizontal database sharding. As a result, queries and transactions become
more complex and costly, as they span across multiple data shards or nodes. This can
be avoided for most Internet businesses by using efficient sharding rules.
The rapid development of the Internet not only brought scalability requirements
to OLTP but also changed the form of data analysis. In the early twenty-first century,
Google published its famous Three Papers that introduced Google File System
(GFS), BigTable, and MapReduce and incubated the concept of big data. The open-­
source community actively followed up and built an open-source big data process-
ing technology system based on the Hadoop ecosystem, which became the de facto
industry standard. Since 2006, the debate over whether the Hadoop system or the
traditional data warehousing technology is better for big data processing has been a
hot topic. Nevertheless, achieving “simpler usage patterns, better performance expe-
rience, larger data processing scales, and stronger real-time processing capabilities”
has always been the research focus in academia and industry, regardless of which

Shared-Storage
Shared-Nothing

Compute Compute Compute Compute Compute Compute Compute


Node Node Node Node Node Node Node

Storage Disk Disk Disk Disk

(a) (b)

Fig. 1.1 Scale-up and scale-out of a database. (a) Scale-up (b) Scale-out
1.2 Database Technology Development Trends 5

technology route is chosen. Over time, the big data ecosystem represented by
Hadoop and the database ecosystem represented by traditional data warehouses
gradually converge in the field of big data. “SQL on Hadoop” has become a vital
research direction in this field. Databases gradually developed big data capabilities
while providing the same user experience as standalone databases. Moreover, SQL
gradually became a universally accepted query and analysis language.
With the continuous development of information technologies, an increasing
amount of data of various types has been generated. Given its strictly structured data,
the relational model of traditional relational databases is not suitable for processing
variable business data and data of some dedicated special structures. In view of this,
databases that use flexible data model definitions (known as schemaless databases)
and databases that use special data models, collectively known as NoSQL systems,
have emerged. NoSQL databases are classified into three categories: key value (KV)
databases (e.g., Redis, HBase, and Cassandra), document databases (e.g., MongoDB),
and graph databases (e.g., Neo4j). Trade-offs in several technical details have been
made for these databases to meet specific requirements for the data scale, flexibility,
concurrency, and performance. In specific scenarios, NoSQL displays better perfor-
mance, scalability, availability, and cost-effectiveness than relational databases.
However, relational databases remain the mainstream databases due to the powerful
expression capability of SQL and their universally mature specifications and complete
and strict atomicity, consistency, isolation, and durability (ACID) semantics.
With the increasing popularity of cloud computing in the 2020s and the launch of
database services of major cloud vendors, traditional database vendors also began to
explore the cloud computing field and launch cloud-based database products. Databases
have entered the cloud era, driving a new round of remarkable transformation.

1.2 Database Technology Development Trends

The recent years have witnessed the emergence of many new technologies and ideas
that brought new vitality to the field of databases. The following sections will dis-
cuss the development trends of database technologies in six aspects.

1.2.1 Cloud-Native and Distributed Architectures

Resource decoupling in the cloud database architecture means the separation of


computing and storage so that computing and storage resources can be indepen-
dently scaled to cope with user requirements for on-demand usage and pay-as-you-
­go billing. This lowers the requirements for beginners and provides “extreme
elasticity” to meet the rapid development of enterprise business in the Internet age.
Cloud-native databases can achieve minute-level orchestration and upgrades for
stateless computing resources, significantly reducing business downtime caused by
6 1 Database Development Milestones

operations and maintenance (O&M). For stateful storage resources, key technolo-
gies like distributed file systems, distributed consistency protocols, and multimodal
replicas are used to meet the requirements for storage resource pooling, data secu-
rity, and strong consistency. Scalable communication resources ensure that “suffi-
cient” bandwidth is available between computing and storage resources to meet the
demand for high-throughput, low-latency data transmission.
High availability based on resource decoupling is the basic feature of cloud data-
bases. Overall high availability of computing resources is achieved by using redundant
compute nodes in combination with “probing” and high-availability switching technolo-
gies that are based on the cloud infrastructure. Using multiple replicas and distributed
consistency protocols ensures the consistency between multiple replicas of data and the
high availability of data storage. Given that cloud databases face arbitrary data scales,
they must have rapid backup and recovery capabilities and the ability to restore data to
any point in time according to the backup strategy. To meet high concurrency and big
data processing requirements, cloud databases must support scale-out/scale-in and dis-
tributed processing mechanisms, including but not limited to load balancing, distributed
transaction processing, distributed locking, resource isolation and scheduling in the
multitenancy architecture, mixed CPU loads, and massively parallel processing (MPP).

1.2.2 Integration of Big Data and Databases

Cloud databases aim to provide users with simple and easy-to-use database systems
to help them quickly achieve business functionality in the shortest time and at the
lowest costs. With the development of information technologies, big data has
become a reality. Along with this, a core requirement was imposed on databases,
which is to maintain consistent performance and acceptable response time when
dealing with massive data. The demand for integration of big data and databases is
increasingly strong. For users, this means they can directly use SQL to analyze and
process massive data based on cloud databases. To enable cloud databases to pro-
cess big data, a powerful kernel engine must be built by leveraging the elasticity and
distributed parallel processing features of the cloud infrastructure. This will maxi-
mize the efficiency of computing and storage resources, thereby providing massive
data analysis capabilities at an acceptable cost-effectiveness ratio. Further, ecosys-
tem tools that facilitate big data analysis and processing must be available. The
ecosystem tools can be categorized into three types: data transfer and migration
tools, data integration development tools, and data asset management tools. Data
transfer and migration tools ensure smooth data links and free flow of data. From a
performance perspective, such tools are evaluated based on the real timeliness and
throughput. From a functional perspective, they must serve as a pipeline for various
upstream and downstream data sources. Data integration development tools enable
users to freely process massive data (e.g., integrate, clean, and transform data) and
provide a complete integrated development environment that supports visualized
modeling of the development process and task publishing and scheduling. Data
1.2 Database Technology Development Trends 7

asset management tools are essential for data fusion applications. “Business data,
data assets, asset application, and application value” reflect the progressive process
of business innovation driven by business data. As an important cloud infrastructure
for business data production, storage, processing, and consumption, cloud databases
play a key role in the data assetization process. Asset management tools based on
cloud databases guarantee that cloud databases can connect from “end to end” and
help customers achieve business value.

1.2.3 Hardware-Software Integration

The development of new hardware opened more possibilities for database technolo-
gies and fully utilizing hardware performance has become an important means for
improving the efficiency of all database systems. In a cloud-native database, com-
puting and storage are decoupled, and the network is utilized to implement distrib-
uted capabilities. The design of the computing, storage, and network features takes
into account the characteristics of new hardware. The SQL computation layer of the
database needs to perform massive algebraic operations, such as join, aggregation,
filtering, and sorting operations. These computation operations are accelerated by
using heterogeneous computing devices such as GPUs to fully utilize the parallel
processing capability of the database. Specific computation-intensive operations,
such as compression/decompression and encryption/decryption, may be pro-
grammed by using the programmable capabilities of field-programmable gate arrays
(FPGAs), to reduce the burden of CPUs. In terms of storage, the emergence of
nonvolatile memories (NVMs) has expanded the horizon for databases. With its
byte addressing and persistent storage capabilities, NVM has exponentially
improved the I/O performance compared with solid-state drives (SSDs). Many data-
base designers are beginning to rethink how to redesign the architecture to use these
features, for example, to design index structures for NVMs and reduce or cancel
logs. The execution path becomes longer after computing and storage are decou-
pled. Therefore, many cloud databases use high-performance network technologies
(e.g., remote direct memory access [RDMA] and InfiniBand) together with user-­
mode network protocols (e.g., Data Plane Development Kit [DPDK]) and other
technologies, to mitigate the negative impact of network latency. Nowadays, data-
base system theories are mature and it is much harder to achieve breakthroughs. It
is an inevitable trend to reap the benefits of hardware development.

1.2.4 Multimodality

As Internet businesses become more diversified, a richer variety of data needs to be


processed. Relational databases used to be advantageous in facilitating normalized
data management due to their strict definition of schemas. However, this feature has
8 1 Database Development Milestones

now become a constraint in the face of rapidly changing businesses. At present, the
fundamental requirement is to manage flexible semistructured and unstructured
data. New databases rise to this challenge. By leveraging the advantages of tradi-
tional databases, such as powerful and rich data operation capabilities and complete
ACID semantics, new databases support data processing for more data models (e.g.,
graph, KV, document, time series, and spatial models) and unstructured data (e.g.,
images and streaming media). Processing numerous data models in one system and
normalizing and processing heterogeneous data can mine more application value.

1.2.5 Intelligent O&M

As the data scale increases, the usage scenarios and frequency of cloud databases are
also increasing. The traditional database administrator (DBA)-based O&M mode can
no longer meet the O&M requirements of the cloud era because DBAs have limited
physical strengths and capabilities. Intelligent O&M technologies facilitate the safe
and stable operation of cloud databases. Heuristic machine learning may be a poten-
tial solution. For instance, machine learning can be combined with the expertise of
database experts to build an intelligent O&M model based on the data collection capa-
bilities of the cloud infrastructure and the massive operation data of cloud databases.
The model can be used to implement self-awareness, self-­repair, self-optimization,
self-O&M, and self-security as cloud services for cloud databases, freeing users from
complex database management and preventing service failures caused by manual
operations to guarantee the stability, security, and efficiency of database services.

1.2.6 Security and Trust

Database security is a top priority in the cloud environment. Reliability, controllability,


and visibility are the core principles of cloud database security and trustworthiness. To
achieve reliability, cloud databases focus on ensuring link security and data storage
security on the basis of secure cloud infrastructure. Under normal circumstances, cloud
databases can implement encrypted storage of important data by using the key manage-
ment feature of the cloud infrastructure and provide encryption algorithms of different
strengths based on the requirements of different industries and regions. Controllability
refers to key access control and data permission management. Generally, cloud data-
bases support encrypted storage of sensitive data by using user-provided keys based on
the key management service of the cloud computing infrastructure. In this case, even the
cloud service provider cannot access the encrypted data without the key. In terms of data
permission management, in addition to the database-and table-level access control sup-
ported by traditional databases, cloud databases in combination with ecosystem tools
can also achieve column-­level and row-level access control (i.e., content-based access
control) and support on-demand configuration to meet the access control requirements
1.3 Key Components of Relational Databases 9

of different industries. Visibility means that a database is no longer a “black box” but
something that can provide complete log audit capabilities to ensure that all operations
on the cloud database are recorded and management permissions are controlled by the
user. The security and trust technology of cloud databases covers the authentication,
protection, and auditing of data access.

1.3 Key Components of Relational Databases

A DBMS is a complex mission-critical software system. DBMSs together with


operating systems and middleware are referred to as the three major types of basic
system software. Today’s DBMSs incorporate the achievements of decades of
research from academia and industry, as well as the software development results
from enterprises. As mentioned earlier, relational databases occupy a dominant
position in numerous online information systems in finance, telecommunications,
transportation, energy, and other fields. After decades of development, relational
databases have gradually converged in terms of technical architecture. In addition to
common components, relational databases mainly include four parts, namely, an
access management component, a query engine, a transaction processing system,
and a storage engine. The transaction processing system can be embedded in the
storage engine, to provide the upper-layer query engine with storage capabilities
that have ACID guarantees, as shown in Fig. 1.2. These parts work together to com-
plete the SQL request processing procedure.

1.3.1 Access Management Component

A DBMS provides client drivers that comply with standard interface protocols such
as Open Database Connectivity (ODBC) and Java Database Connectivity (JDBC).
User programs can establish database connections with DBMS server programs by
using the APIs provided by the client drivers and send SQL requests. After receiving
a connection establishment request from a client, a DBMS determines whether to
establish the connection according to the protocol requirements. For example, the
DBMS checks whether the client address meets the access requirements and per-
forms security and permission verification on the user that uses the client. If the user
passes the verification, the corresponding database connection is established, and
resources are allocated for the connection. Then, a session context is created to
execute subsequent requests. All requests sent through this connection use the set-
tings in the session context until the session is closed.
When a DBMS receives the first request sent by the client, the DBMS allocates the
corresponding computing resources. This process is related to the implementation of
the DBMS. Several databases, such as PostgreSQL, use the process-per-­connection
model, which creates a child process to handle all requests on the connection. Other
10 1 Database Development Milestones

Fig. 1.2 Architecture of a DBMS

databases, such as MySQL, use the thread-per-connection model, which creates an


independent thread. In some more complex designs, no separate service thread or
process is created for each session. Instead, a thread pool or process pool model is
used. In this case, multiple sessions share a set of threads or processes to prevent the
system resources from being overloaded by an excessive number of connections. This
mechanism is implemented by the process management component.

1.3.2 Query Engine

The SQL engine subsystem is responsible for parsing SQL requests sent by users,
as well as for performing semantic checks and generating a logical plan for the SQL
requests. After an SQL request is rewritten and optimized, a physical plan is gener-
ated and delivered to the plan executor for execution.
SQL statements are generally divided into two categories: Data Manipulation
Language (DML) statements and Data Definition Language (DDL) statements. DML
statements include SELECT, UPDATE, INSERT, and DELETE, whereas DDL state-
ments are used to maintain data dictionaries and include CRATE TABLE, CREATE
INDEX, DROP TABLE, and ALTER Table. DDL statements usually do not undergo
1.3 Key Components of Relational Databases 11

Fig. 1.3 Execution process of an SQL statement

the query optimization process in the query engine but are directly processed by the
DBMS static logic by calling the storage engine and catalog manager. This section
discusses how DML statements are processed and uses the simple statement “SELECT
a, b FROM tbl WHERE pk >= 1 and pk <= 10 ORDER BY c” to demonstrate the
general execution process of SQL statements, as shown in Fig. 1.3.

1.3.2.1 Query Parsing

First, an SQL request is subject to syntax parsing. In most DBMSs, tools like Lex
and Yacc are used to generate a lexical and grammar parser known as a query parser.
The query parser checks the validity of the SQL statement and converts the SQL
text to an abstract syntax tree (AST) structure. Figure 1.3 shows the syntax tree
obtained after statement parsing. The syntax tree has a relatively simple structure. If
a subquery is nested after the FROM or WHERE clause, the subtree of the subquery
will be attached to the nodes.
Then, the system performs semantic checks on the AST to resolve name and refer-
ence issues. For example, the system checks whether the tables and fields involved in
the operation exist in the data dictionary, whether the user has the required operation
permissions, and whether the referenced object names are normalized. For example,
the name of each table is normalized into the “database”.”schema”.”table name” for-
mat. After the checks are passed, a logical query plan is generated. The logical query
plan is a tree structure composed of a series of logical operators. The logical query
plan cannot be directly executed, and the logical operators are usually tag data struc-
tures used to carry necessary operational information to facilitate subsequent optimi-
zation and generate an actual execution plan.

1.3.2.2 Query Rewriting

Query rewriting is a preparatory stage for query optimization that involves perform-
ing equivalent transformation on the original query structure while ensuring that the
semantics of the query statement remain unchanged to simplify and standardize the
12 1 Database Development Milestones

query structure. Query rewriting is usually performed on the logical plan rather than
directly on the text. Some common query rewriting tasks are as follows:
1. View expansion: For each referenced view in the FROM clause, the defini-
tion is read from the catalog, the view is replaced with the tables and predi-
cate conditions referenced by the view, and any reference to columns in this
view is replaced with references to columns in the tables referenced by
the view.
2. Constant folding: Expressions that can be calculated during compilation are
directly compiled and rewritten. For example, “a < 2 * 3” is rewritten as
“a < 6.”
3. Predicate logic rewriting: The predicates in the WHERE clause are rewritten.
For example, the short-circuit evaluation expression “a < 10 and a> 100” can be
directly converted to “false.” In this case, an empty result set is returned.
Alternatively, logical equivalent transformation may be performed. For example,
“not a < 10” may be converted to “a > = 10.” Another logic rewriting method is
to introduce new predicates by using predicate transitivity. For example “t1.a <
10 and t1.a = t2.x” implies the condition “t2.x < 10” and can be used to filter the
data of the t2 table in advance.
4. Subquery expansion: Nested subqueries are difficult to handle at the optimiza-
tion stage. Therefore, they are usually joined during rewriting. For example,
“SELECT * FROM t1 WHERE id IN (SELECT id FROM t2)” can be rewritten
as “SELECT DISTINCT t1.* FROM t1, t2 WHERE t1.id = t2.id.”
Other rewriting rules are available. For example, semantic optimization and
rewriting may be performed based on the constraint conditions defined by the
schema. Nonetheless, these rewriting rules serve the same purpose, which is to bet-
ter optimize query efficiency and reduce unnecessary operations or to normalize the
query for easier processing in subsequent optimization.

1.3.2.3 Query Optimization

Query optimization involves converting the previously generated and rewritten logi-
cal plan into an executable physical plan. The conversion process is an attempt to
find the plan with the lowest cost. Finding the “optimal plan” is an NP-complete
problem, and the cost estimation is not accurate. Therefore, the optimizer can only
search for a plan with the lowest possible cost.
The technical details of query optimization are complex and will not be dis-
cussed here. In most cases, query optimizers combine two technologies: rule-
based optimization and cost-based optimization. Take, for example, the
open-source database MySQL, which is completely based on heuristic rules. For
instance, all Filter, Project, and Limit operations are pushed down as far as pos-
sible. In a multitable join, small tables are selected with priority based on cardi-
nality estimation, and the nested-loop join operator is typically selected as the join
operator. The cost-based approach searches for possible plans, calculates costs
1.3 Key Components of Relational Databases 13

based on cost model formulas, and selects the plan with the lowest cost. However,
searching for all possible plans is costly and, therefore, infeasible. Two typical
search methods are available. The first is the dynamic planning method mentioned
by Selinger in his paper “Access path selection in a relational database manage-
ment system.” [1] This method uses a bottom-up approach and focuses on “left-
deep tree” query plans (where the input on the right side of a join operator must
be a base table) to reduce the search space. The dynamic planning method avoids
cross joins to ensure that the Cartesian product in the data stream is calculated
after all joins). The second search method is a target-oriented top-down approach
based on the cascading technology. In some cases, top-down search can reduce
the number of plans that an optimizer needs to consider, but it increases the mem-
ory consumption of the optimizer.
Calculations based on the cost model are related to the specific operators and the
order in which they are executed. For example, when the join operator is selected,
the size of the result set must be estimated based on the join condition. This involves
selectivity, which is calculated based on column statistics. The accuracy of the esti-
mated cost depends on the accuracy of the statistics available. Modern databases not
only collect the maximum, minimum, and number of distinct values (NDV) values
of columns but also provide more accurate histogram statistics. However, comput-
ing histograms of large datasets can lead to excessive overheads, so sampling is
often used as a compromise.

1.3.2.4 Query Execution

In most cases, executors use the Volcano model (also known the Iterator model) as
the execution model. This model was first introduced by Goetz Graefe in his paper
“Volcano, an extensible and parallel query evaluation system.” [2] The Volcano
model is a “pull data” model with a simple basic idea: to implement each physical
operator in the execution plan as an iterator. Each iterator comes with a get_next_
row() method, which returns a row of data (tuple) produced by the operator every
time the method is called. The program recursively calls the get_next_row() method
by using the root operator of the physical execution plan until all data is pulled.
Taking the physical plan in Fig. 1.4 as an example, the plan execution can retrieve
the entire result set by recursively calling the get_next_row() method of the Sort
operator.

// root is root node of physical plan tree.


while (root.has_next()) {
row = root.get_next_row();
add_to_result_set(row, result_set);
}

The top-level Sort operator is an aggregation operator that needs to pull all data
from the suboperators before it can sort the data.
14 1 Database Development Milestones

Fig. 1.4 Query processing in the Volcano model

class SortOperator {
SortedRowSet rowset;
Cursor cursor;
void open() {
while (child.has_next()) {
row = child.get_next_row();
add_to_row_set(rowset);
}
sort(rowset);
cursor = 0;
}
Row get_next_row() {
if (!initialized) {
open();
initialized = true;
}
return rowset[cursor++];}}
1.3 Key Components of Relational Databases 15

The implementation of the Projection operator is relatively simple. It only needs


to pull one row of data from the suboperator and return the required columns.

class ProjectionOperator {
Row get_next_row() {
Row row = child.get_next_row();
return select(row, columns);
}
};

IndexScan, which is implemented by the storage engine, needs to scan the data
of the tables involved according to the index.

1.3.3 Transaction Processing System

One of the most important features of a database is guaranteed ACID (atomicity,


consistency, isolation, and durability) semantics. The following describes the spe-
cific meaning of ACID:
• Atomicity: All behaviors of a transaction in the database must be “atomic,” that
is, all data manipulated by the transaction must be either completely committed
or completely rolled back.
• Consistency: This serves as a guarantee at the application level. The integrity
constraints of SQL statements are used to ensure consistency in the database
system. Given a consistency definition provided by a set of constraints, a
­transaction can be committed only if the consistency of the entire database is
maintained when the transaction is completed.
• Isolation: Each transaction in the database occupies exclusive resources, and any
two concurrently executed transactions are unaware of the uncommitted data of
each other.
• Durability: Updates to the database by a successfully committed transaction are
permanent, even in the event of software or hardware failures, unless the updates
are overwritten by another committed transaction.
The ACID guarantee of a database processing system is extremely complex and
includes the concurrency control module and the logging and recovery systems.

1.3.3.1 Concurrency Control

A database is a multiuser system. This means that the database may receive a large
number of concurrent access requests at the same time. If the concurrent requests
access the same piece of data and one of the operations is a write operation, this situ-
ation is called a “data race.” If no appropriate protection mechanism is configured
16 1 Database Development Milestones

to deal with data races, data read and write exceptions will occur. For example,
uncommitted dirty data of another transaction may be read, data written by a trans-
action may be overwritten by another transaction, or inconsistent data may be read
at different points in time within a transaction. The isolation feature discussed above
prevents unexpected data results caused by these exceptions. To achieve isolation,
concurrency control is defined, which is a set of data read and write access protec-
tion protocols.
The cost and execution efficiency vary based on the strictness of data consis-
tency. Stricter data consistency results in higher costs and lower execution effi-
ciency. Nowadays, most databases define multiple isolation levels based on different
levels of anomalies to balance efficiency and consistency. The isolation levels,
namely, read uncommitted, read committed, repeatable read, and serializable, are
ranked in ascending order of strictness of consistency. Users can select an appropri-
ate isolation level according to their business characteristics.
The strictest isolation level is serializable, which requires the results of inter-
leaved concurrent execution of multiple transactions to be the same as the results of
the serial execution of this group of transactions. Each individual transaction in the
group occupies exclusive resources and is unaware of the existence of other transac-
tions. This is the purpose of isolation. The main concurrency control techniques
include the following:
1. Two-phase locking (2PL): For each read/write operation in a transaction, the
data row to be read/written must be locked. A shared lock is added for read
operations so that all read operations can be performed concurrently. An exclu-
sive lock is added for write operations so that a write operation can be performed
only after the previous write operation is completed. The locking phase and
releasing phase must be sequential. In the locking phase, only new locks can be
added. In the releasing phase, only previously added locks can be released, and
no new locks can be added.
2. Multiversion concurrency control (MVCC): With MVCC, transactions do not
use a locking mechanism. Instead, multiple versions are saved for modified data.
When a transaction is executed, a point in time may be marked. Even if the data
is modified by other transactions after this point in time, the historical version of
the data before this point in time can still be read.
3. Optimistic concurrency control: All transactions can read and write data without
blocking, but all read and write operations are written to the read and write sets,
respectively. Before a transaction is committed, validation is performed to check
whether the read and write sets conflict with other transactions in the interval
between the start and commitment of the transaction. If a conflict exists, one of
the transactions must be rolled back.
The serializable isolation level can be achieved by strictly using 2PL. However,
the costs of adding a lock for each operation are high, especially for read operations.
Most concurrent transactions do not operate on the same data, but the costs of add-
ing locks still exist. To reduce these costs, many databases use a hybrid mode that
combines MVCC and 2PL. In this mode, locks are added to data manipulated by
1.3 Key Components of Relational Databases 17

write operations, but no locks are added to data manipulated by read operations and
a historical version of data at a specific point in time is accessed. The isolation level
provided by this method is called snapshot isolation, which is lower than the serial-
izable level and may have anomalies such as write skews1 However, this isolation
level avoids most other anomalies, provides better performance, and is often a better
choice than other concurrency control techniques in most cases.
Although optimistic concurrency control avoids lock waiting, a large number of
rollbacks will occur when a transaction conflict is detected. Therefore, optimistic
concurrency control is suitable for scenarios with a few transaction conflicts but
performs poorly when many transaction conflicts exist (e.g., flash sale scenarios that
involve inventory reduction operations when users purchase the same product).

1.3.3.2 Logging and Recovery Systems

The logging system is a core part of the database storage engine that ensures the
durability of committed transactions and the atomicity of transactions that are
aborted or rolled back. The durability of committed transactions enables the data-
base to recover previously committed transactions after a crash. Many techniques
can be used to ensure the durability and atomicity of transactions. Taking the shadow
paging technique [3] proposed by Jim Gray in System R as an example, a new page
is generated for each modified page, and the new page is persisted when the transac-
tion is committed. In addition, the page pointer in the current page table is atomi-
cally changed to the address of the new page. In a rollback, the new page is simply
discarded and the original shadow page is used. Although this method is simple and
direct, it failed to become a mainstream technique because it does not support page-­
level transaction concurrency, has high recycling costs, and is inefficient. Most
mainstream databases currently use the logging mechanism proposed by C. Mohan
in the Algorithms for Recovery and Isolation Exploiting Semantics (ARIES)
paper [4].
Databases were originally designed for traditional disks. The sequential access
performance of traditional disks is higher than the random read/write performance.
User updates generally update some pages in a relatively random manner. If a page

1
A write skew anomaly occurs when two transactions (T1 and T2) concurrently read a data set
(e.g., values V1 and V2), concurrently make disjoint updates (e.g., T1 updates V1, and T2 updates
V2), and are concurrently committed. This anomaly does not occur in serial execution of transac-
tions and is acceptable in snapshot isolation. Consider V1 and V2 as Phil’s personal bank accounts.
The bank allows V1 or V2 to have a negative balance as long as the sum of both accounts is nonnega-
tive (i.e., V1 + V2 ≥ 0). The initial balance of both accounts is USD 100. Phil initiates two transac-
tions: T1 to withdraw USD 200 from V1 and T2 to withdraw USD 200 from V2.
Write skew anomalies can be resolved by using two strategies: One is to implement write con-
flict by adding a dedicated conflict table that both transactions modify. The other is to use a promo-
tion strategy, where one transaction modifies its read-only data row (by replacing its value with an
equal value) to cause a write conflict, or use an equivalent update, for example, the SELECT FOR
UPDATE statement.
18 1 Database Development Milestones

is flushed to the disk each time the page is updated and the update is committed,
massive amounts of random I/Os are produced. As a result, the atomicity of concur-
rent transactions within the page cannot be guaranteed. For example, when multiple
transactions update a page at the same time, the page cannot be flushed immediately
after a transaction is committed because other uncommitted transactions may still
be updating the page. Therefore, when a page is updated, the page content is only
updated in place in the memory cache. Then, the transaction operation log will be
recorded to ensure that the operation log is flushed to the disk before the page con-
tent when the transaction is committed. This technology is called write-ahead log-
ging (WAL).
To ensure the durability and atomicity of transactions, the sequence of flushing
the log, commit point, and data page to the disk must be strictly defined. Some of
the strategies that can be used to do this include force/no-force and steal/no-steal.
• Force/no-force: The log must be written to the disk before the data page. After
the transaction is committed (i.e., the commit marker is recorded), the commit-
ment is considered successful only after all pages updated by the transaction are
forcibly flushed to the disk. This is called the force strategy. If the updated pages
are not required to be flushed immediately to the disk, the pages can be asynchro-
nously flushed later. This is called the no-force strategy. No-force means that
some pages containing committed transactions may not be written to the disk. In
this case, the redo log must be recorded to ensure durability through playback
and rollback during recovery.
• Steal/no-steal: The steal strategy allows a data page that contains uncommitted
transactions to be flushed to the disk. By contrast, the no-steal strategy does not
allow this. When the steal strategy is used, uncommitted transactions exist on the
disk, and rollbacks must be recorded in the log to ensure that transactions can be
rolled back when they are aborted.
The ARIES protocol uses the steal/no-force strategy, which allows uncommitted
transactions to be flushed to the disk before the commit point and does not forcibly
require that the data page be written to the disk after the transaction is committed.
Instead, the time at which dirty pages are flushed to the disk can be autonomously
decided, and the optimal I/O mode is used. Theoretically, this is the most efficient
strategy. However, the redo log and rollback log must be recorded.

1.3.4 Storage Engine

In general, the data storage operations of a database table are completed at the
storage engine level. The TableScan and IndexScan physical operators involved in
data access call the data access methods provided by the storage engine to read
and write specific data. The database storage engine includes two major modules:
the data organization and index management module and the buffer manage-
ment module.
1.3 Key Components of Relational Databases 19

1.3.4.1 Data Organization and Index Management Module

The data organization structure of a database is determined by the target access


efficiency. Different types of databases have different focuses. Early databases used
disks as storage media. To increase I/O efficiency, these databases used fixed-length
pages to manage data. The page size was aligned with the disk sector size, such as
8 KB or 16 KB. This method conveniently loads data to the memory buffer for
management.
In most cases, the data of a table contains a large number of rows that are orga-
nized into several pages in random or data write order. A table that uses this organi-
zation structure is called a heap table. Efficiently locating the required page during
a query and reducing the number of physical I/Os are crucial for improving database
query performance. To achieve these, indexes must be created for the table data.
Several indexing methods are available. By default, B-tree indexing is used. The
database separately stores the index structure, which is also organized in pages. In
B-tree indexing, the leaf nodes index the positions in the actual table data pages.
Primary key indexes are a special type of index. When a user specifies a primary key
for a table, the database sorts the data in the table based on the primary key and
aggregates sorted data in physical pages. This type of index is called a clustered
index. Clustered indexes do not require an additional index structure.
As larger-capacity memories become available at lower prices, pure in-memory
databases gradually emerged. An in-memory database uses memories as the main
storage media, which supports efficient random access. In a traditional database that
uses disks as the storage media, an indirect access method in which logical struc-
tures are mapped to physical addresses is used. In an in-memory database, data
structures with row pointers and indexes are used to obtain shorter and more direct
access paths. The data organization structure on a page varies based on the usage
scenario. The following data organization models are available:
• Narray storage model (NSM): Data is stored by row, which is more friendly to
transaction processing because transaction data is always written in the form of
complete rows. This model is mostly used in OLTP scenarios.
• Decomposition storage model (DSM): Identical column values in tuples are
physically stored together. This way, only the required columns are read. This
greatly reduces the I/Os when a large amount of data is scanned. In addition,
column storage has better compression effect and is suitable for OLAP scenar-
ios. However, this model is not convenient for transaction processing and requires
row-to-column conversion. Therefore, most analytical processing (AP) data-
bases have low transaction processing efficiency, and some even support only
batch import.
• Flexible storage model (FSM): This model uses a mixed layout of rows and col-
umns. In some structures, such as PAX and RCFile, data is first grouped into
segments or subpages by row and then organized by using the DSM. In some
structures, such as tiles in Peloton, data is first divided into column groups by
column, and then specified columns in a group are organized by using the
20 1 Database Development Milestones

NSM. This model attempts to combine the advantages of the NSM and DSM to
implement hybrid transactional/analytical processing (HTAP). However, it also
inherits the disadvantages of both models.
Data organization involves more details, for example, how to save storage space
without compromising access efficiency on disks or in memory. For more informa-
tion, see related documents, such as ASV99 [5] and BBK98 [6].

1.3.4.2 Buffer Management Module

To perform read and write operations on data pages, the pages must be loaded from
the disk to the memory, and the content of the pages must be modified in the mem-
ory. In general, the capacity of the memory is much smaller than that of the disk.
Therefore, the pages loaded into the memory are only part of the data pages. Buffer
management covers how to decide the pages to be loaded into the memory based on
the read and write requests, when to synchronize modified pages to the disk, and
which pages are to be evicted from the memory.
In most databases, the content of pages in the memory is the same as that of
pages in the disk. This is beneficial in two aspects: First, fixed-sized pages are easy
to manage in the memory, and the allocation and recycling algorithms used to avoid
memory fragmentation are simple. Second, the format is consistent, and encoding
and decoding operations (also known as serialization and deserialization opera-
tions) are not required during data read and write, thereby reducing CPU workloads.
Databases use a data structure called a page table (hash table) to manage buffer pool
pages. The page table records the mappings between page numbers and page content,
including the page location in the memory, disk location, and page metadata. The meta-
data records some current characteristics of the page, such as the dirty flag and the refer-
ence pin. The dirty flag indicates whether the page has been modified after being read,
and the reference pin indicates whether the page is referenced by other ongoing transac-
tions. Pages that have been modified after being read cannot be swapped out to the disk.
The size of the buffer pool, which is usually fixed, is related to the physical
memory configured for the system. Therefore, when new pages need to be loaded
but the buffer pool is full, some pages must be evicted by using a page replacement
algorithm. Several buffer pool replacement algorithms, such as least recently used
(LRU), least frequently used (LFU), and CLOCK algorithms, may be unsuitable for
complex database access modes (e.g., full table scans in databases). If the LRU
algorithm is used when a large amount of data needs to be scanned, all data pages
will be loaded into the buffer pool. This will cause the original data pages to be
evicted from the buffer pool, resulting in a rapid and significant drop in the hit rate
of the buffer pool in a short time. At present, many databases use simple enhanced
LRU schemes to solve scan-related problems. For example, the buffer pool is
divided into a cold zone and a hot zone. Some pages read during the scan first enter
the cold zone, and then hit pages enter the hot zone, thereby avoiding cache pollu-
tion. Many studies have also been conducted to find more targeted replacement
algorithms, such as LRU-K [7] and 2Q [8].
References 21

References

1. Griffiths Selinger P, Astrahan M, Chamberlin D, et al. Access path selection in a relational


database management system. In: Proceedings of the 1979 ACM-SIGMOD international con-
ference on management of data; 1979. p. 23–34. https://doi.org/10.1145/582095.582099.
2. Graefe G. Volcano, an extensible and parallel query evaluation system. IEEE Trans Knowl
Data Eng. 1994;6(1):120–35.
3. Gray J, Mcjones P, Blasgen M, et al. The recovery manager of the system R database manager.
ACM Comput Survey. 1981;13(2):223–42.
4. Mohan C, Haderle DJ, Lindsay BG, et al. Aries: a transaction recovery method supporting fine-­
granularity locking and partial rollbacks using write-ahead logging. ACM Transact Database
Syst. 1992;17:94–162.
5. Arge L, Samoladas V, Jeffrey Scott V, et al. On two-dimensional Indexability and optimal range
search index. In: Proceeding of ACM SIGACT-SIGMOD-SIGART symposium on principles
of database systems; 1999. p. 346–57.
6. Berchtold S, Böhm C, Kriegel H-P. The pyramid-technique: towards breaking the curse of
dimensionality. In: ACM-SIGMOD international conference on management of data; 1998.
p. 142–53.
7. O'neil EJ, O'neil PE, Weikum G, et al. The LRU-K page replacement algorithm for database
disk buffering. In: ACM SIGMOD international conference on management of data, vol. 22;
1993. p. 297–306.
8. Johnson T, Shasha D. 2Q: a low overhead high performance buffer management replacement
algorithm. In: International conference on very large data bases (VLDB); 1994. p. 297–306.
Chapter 2
Database and Cloud Nativeness

Cloud platforms create a novel operating system for various components by employ-
ing technologies such as containerization, virtualization, and orchestration.
However, achieving a highly available, high-performance, and intelligent cloud
database system by using the virtualization and elastic resource allocation capabili-
ties offered by cloud platforms is challenging. Over time, cloud database systems
have become clearly distinct from traditional database systems.

2.1 Development of Databases in the Cloud Era

2.1.1 Rise of Cloud Computing

In the past, most enterprises build their IT infrastructure through hardware procure-
ment and IDC rental. Professional expertise is required to perform O&M of servers,
cabinets, bandwidth, and switches and handle other matters such as network con-
figuration, software installation, and virtualization. The rollout period for system
adjustment is long and involves a series of procedures, such as procurement, supply
chain processing, shelf placement, deployment, and service provisioning. Enterprises
must plan their IT infrastructure in advance based on their business development
requirements, and redundant resources must be reserved to ensure that the system
capacity can cope with business surges. However, business development does not
always follow the planned path, especially in the Internet era. It may go beyond
expectations and overload the system or may be below expectations, resulting in
massive idle resources.
Cloud computing is the answer to the preceding problems. Cloud computing
provides the IT infrastructure as a service IaaS) for informatization, enabling enter-
prises and individual users to use the IT infrastructure on demand, without the need
to build their IT infrastructure. Similarly, enterprise users do not need to purchase

© The Author(s), under exclusive license to Springer Nature Singapore Pte 23


Ltd. 2025
F. Li et al., Cloud Native Database,
https://doi.org/10.1007/978-981-97-4057-4_2
24 2 Database and Cloud Nativeness

hardware and build IDCs when they need computing resources. They can purchase
resources from cloud computing service providers as needed.
In 2006, Google CEO Eric Schmidt proposed the cloud computing concept at the
Search Engine Strategies conference (SES SanJose 2006). The same year, Amazon
launched Amazon Web Services (AWS) to provide public cloud services. Later,
Internet giants outside China, such as Microsoft, VMWare, and Google, and Chinese
companies, such as Alibaba, Tencent, and Huawei, successively launched cloud ser-
vices. Soon enough, cloud computing became the preferred IT service for enterprises.
For enterprise users, IT service construction no longer means holding heavy
assets as they can now purchase computing resources or services from cloud comput-
ing vendors based on their business needs. Prompted by massive requirements from
different users, cloud vendors are able to establish a super-large resource pool and
provide a unified, virtualized abstraction interface based on the resource pool. A
cloud is basically a huge operating system built on diversified hardware by using
technologies such as containers, virtualization, orchestration and scheduling, and
microservices. With cloud technologies, users no longer need to pay attention to
hardware differences, lifecycle management, networking, high availability, load bal-
ancing, security, and other details. With the resource pooling capability, the cloud
boasts a unique advantage of elasticity to meet different computing requirements of
different businesses at different periods of time by using flexible scheduling strategies.

2.1.2 Database as a Service

With the IaaS layer as the cornerstone, cloud computing service providers have
established more layers, such as the platform as a service (PaaS) and software as a
service (SaaS) layers, to provide appropriate platforms for various application sce-
narios in the cloud.
As important foundational software, databases have been cloudified at an early
stage. In addition, databases, along with operating systems, storage, middleware,
and other components, form a standard cloud-based PaaS system. Most major cloud
vendors provide cloud database services, which can be roughly categorized into
cloud hosting, cloud services, and cloud-native models depending on the ser-
vice mode.

2.1.2.1 Cloud Hosting

Cloud hosting is a deployment mode that closely resembles traditional database sys-
tems. Essentially, cloud hosting involves deploying, on cloud hosts, traditional data-
base software that was originally deployed on physical servers or virtual servers in
an IDC. In this deployment mode, the cloud service provider merely serves as an
IDC provider to database users and provides the users with computing and storage
resources that are hosted on cloud hosts. Users are responsible for the availability,
2.1 Development of Databases in the Cloud Era 25

security, and performance of their database systems. The costs of owning a database
system deployed in cloud hosting mode are the same as the costs of building a data-
base system in an IDC. Moreover, the users still need to have their own IT O&M
team and DBAs to ensure normal database operation. In cloud hosting mode, cus-
tomers must resort to their own technical capabilities and DBA team to obtain
enterprise-­level database management system capabilities, such as high availability,
remote disaster recovery, backup and recovery, data security, SQL auditing, perfor-
mance tuning, and status monitoring. Therefore, the total cost of ownership (TCO)
for a customer who deploys a database system in cloud hosting mode covers the
human resource costs of the DBA team.

2.1.2.2 Cloud Services

The cloud service model is a step further than the cloud hosting model. In this
model, users can directly use the database services provided by the cloud service
provider without concerning themselves with the deployment of database manage-
ment software. In general, cloud service providers offer various traditional database
services, such as MySQL, SQL Server, and PostgreSQL. Users can directly access
the database by using the access link of the cloud database service and the JDBC or
ODBC interface.
A database management system that provides services in the cloud service model
usually incorporates enterprise-level features. When providing cloud database ser-
vices, cloud service providers also provide corresponding enterprise-level features,
including but not limited to high availability, remote disaster recovery, backup and
recovery, data security, SQL auditing, performance tuning, and status monitoring.
In addition, cloud database services typically include online upgrades, scaling, and
other services. These are essentially resource management capabilities provided by
cloud service providers for cloud database services.
Another advantage of the cloud service model over the cloud hosting model is
that users do not need to have their own DBA team. In most cases, the cloud service
provider provides database O&M services. Some even offer expert services, such as
data model design, SQL statement optimization, and performance testing.

2.1.2.3 Cloud-Native Model

The cloud service model reduces the TCO of a database system through integrated
O&M services and supply chain management capabilities, allowing traditional
database users to enjoy the convenience brought by cloud computing to database
systems. However, the architecture of traditional database systems limits the full
play of the advantages of cloud computing. For example, the on-demand resource
usage, rapid elasticity, high performance, and high availability brought by cloud
computing cannot be fully provided in the cloud service model. Hence, cloud-native
databases have emerged to address these issues.
26 2 Database and Cloud Nativeness

The concept of cloud native was first proposed by Pivotal in 2014. One year later, the
Cloud Native Computing Foundation was established. There is still no clear definition
for “cloud native,” but this term is used to refer to new team culture, new technology
architectures, and new engineering methods in the era of cloud computing. To achieve
cloud nativeness, a flexible engineering team uses highly automated R&D tools by fol-
lowing agile development principles to develop applications that are based on and
deployed in cloud infrastructure to meet rapidly changing customer needs. These appli-
cations adopt an automated, scalable, and highly available architecture. The engineering
team then provides application services through efficient O&M based on cloud comput-
ing platforms and continuously improves the services based on online feedback.
In the database field, cloud-native database services involve database manage-
ment systems built on the cloud infrastructure, highly flexible database software
development and IT operations (DevOps) teams, and supplementary cloud-native
ecological tools. From the user’s perspective, cloud-native database services must
have core capabilities such as compute-storage separation, extreme elasticity, high
availability, high security, and high performance. These database services must also
have intelligent self-awareness capabilities, including self-perception, self-­
diagnosis, self-optimization, and self-recovery. Security, monitoring, and smooth
flow of data can be achieved by using cloud-native ecological tools. A database
technology team that follows DevOps conventions can implement rapid iteration
and achieve functional evolution of database services.

2.2 Challenges Faced by Databases in the Cloud-Native Era

Traditional database architectures rely on high-end hardware. Each database system


has a simple architecture that has a few servers and cannot be scaled out to meet the
requirements of new businesses. Cloud-native databases use a distributed database
architecture to achieve high scalability. Each database system spans multiple serv-
ers and virtual machines (VMs), bringing new system management challenges [1].
The fundamental challenge is how to achieve elasticity and high availability, to
attain on-demand resource usage while ensuring efficient resource utilization. In
traditional big data processing, distributed horizontal scaling is implemented at the
expense of some ACID features to meet the requirements in many scenarios.
However, the ACID requirements of applications have always existed and are even
higher in distributed parallel computing scenarios. Therefore, cloud databases face
significant challenges in coordinating distributed transactions, optimizing distrib-
uted queries, and ensuring strong consistency and strong ACID properties. Some of
the other challenges faced by cloud-native databases include the following:
• Operational challenges brought by automated deployment and scaling of multi-
ple servers.
• Real-time monitoring in complex cloud environments and security auditing for
node failures and performance issues.
2.3 Characteristics of Cloud-Native Databases 27

• Management of multiple database systems and their business systems.


• Migration of massive amounts of data.

2.3 Characteristics of Cloud-Native Databases

2.3.1 Layered Architecture

The most notable feature of the architecture of cloud-native databases is the decom-
position of the originally monolithic database [2, 3]. The resulting layered architec-
ture includes three layers: the computing service layer, the storage service layer, and
the shared storage layer. The computing service layer parses SQL requests and con-
verts them into physical execution plans. The storage service layer performs data
cache management and transaction processing, to ensure that data updates and reads
comply with the ACID semantics of transactions. In terms of implementation, the
storage service layer may not be physically independent and may be partially inte-
grated into the computing service layer and the shared storage layer. The shared
storage layer is responsible for the persistent storage of data and ensures data con-
sistency and reliability by using distributed consistency protocols.

2.3.2 Resource Decoupling and Pooling

In the cloud-native era, the cloud infrastructure is pooled by using virtualization


technologies. By leveraging the layered architecture described in Sect. 2.3.1, cloud-­
native databases can effectively decouple computing and storage resources and
scale them separately. After resource pooling, cloud-native databases can use
resources on demand and flexibly schedule them.
Nowadays, CPU and memory resources are still coupled but are decoupled from
SSDs. With the development of NVMs and the RDMA technology, CPU and mem-
ory resources may be decoupled, and the memory resources may be pooled to form
a three-tier resource pool, further enhancing isolation and elasticity and better help-
ing customers achieve on-demand resource usage.

2.3.3 Elastic Scalability

When ACID (atomicity, consistency, isolation, and durability) requirements are


high, the distributed architecture of traditional middleware-based distributed shard-
ing solutions and enterprise-level distributed databases faces considerable system
performance challenges. In this architecture, data can only be sharded and
28 2 Database and Cloud Nativeness

partitioned based on one logical scheme, and the business logic and sharding logic
are not perfectly aligned, resulting in transactions that may cross databases and
shards. For instance, at a high level of isolation, the system throughput is signifi-
cantly compromised when distributed transactions account for over 5% of total
transactions. Perfect sharding strategies are nonexistent. Hence, ensuring high con-
sistency of data in the distributed architecture is a significant challenge that must be
addressed for distributed businesses.
The cloud-native architecture essentially consists of three layers: (1) the underly-
ing layer, which is the shared storage layer for the distributed architecture; (2) the
upper layer, which serves as the shared computing pool for the distributed architec-
ture; and (3) the intermediate layer, which is used for computing and storage decou-
pling. This architecture provides elastic high availability capabilities and facilitates
the centralized deployment of the distributed technology, making the architecture
transparent to applications.

2.3.4 High Availability and Data Consistency

In a distributed system, multiple nodes communicate and coordinate with each other
through message transmission. This process inevitably involves issues such as node
failures, communication exceptions, and network partitioning. A consensus proto-
col can be used to ensure that multiple nodes in a distributed system that may expe-
rience the abovementioned abnormalities can reach a consensus.
In the field of distributed systems, the consistency, availability, and partition tol-
erance (CAP) theorem [4] ensures that any network-based data-sharing system can
deliver only two out of the following three characteristics: consistency, availability,
and partition tolerance. Consistency means that after an update operation is per-
formed, the latest version of data is visible to all nodes, and all nodes have consis-
tent data. Availability refers to the ability of the system to provide services within a
normal response time when some nodes in the cluster are faulty. Partition tolerance
is the ability of the system to maintain service consistency and availability in the
event of node failure or network partitioning. Given the nature of distributed sys-
tems, network partitioning is bound to occur, thereby necessitating partition toler-
ance. Therefore, trade-offs must be made between consistency and availability. In
actual application, cloud-native databases typically adopt an asynchronous multi-
replica replication approach, such as Paxos, Raft, and other consensus protocols, to
ensure system availability and consistency. This compromises strong consistency in
exchange for enhanced system availability.
When used online, cloud-native databases provide different high availability
strategies. A high availability strategy is a tailored combination of service prioritiza-
tion strategies and data replication methods selected based on the characteristics of
2.3 Characteristics of Cloud-Native Databases 29

user businesses. Users can use two service prioritization strategies to balance avail-
ability and consistency:
• Recovery time objective (RTO) first: The database must restore services as soon
as possible to maximize its available time. This strategy is suitable for users who
have high requirements for database up time.
• Recovery point objective (RPO) first: The database must ensure as much data
reliability as possible to minimize data loss. This strategy is suitable for users
who have high requirements for data consistency.

2.3.5 Multitenancy and Resource Isolation

Multitenancy means that one system can support multiple tenants. A tenant is a
group of users with similar access patterns and permissions and is typically com-
prised of several users from the same organization or company. To effectively
implement multitenancy, multitenancy at the database layer must be considered.
The multitenancy model at the database layer significantly affects the implementa-
tion of upper-level services and applications. Multitenancy usually involves resource
sharing. Therefore, corresponding measures must be available to prevent one tenant
from exhausting system resources and affecting the response time of other tenants.
In a multitenancy architecture, a database system is deployed for each tenant, or
multiple tenants share the same database system and are isolated by using
namespaces. However, the O&M and management of this approach are complex. In
cloud-native scenarios, computing and storage nodes in a database can be bound to
different tenants to achieve resource isolation and scheduling for the tenants.

2.3.6 Intelligent O&M

Intelligent O&M is a crucial characteristic of cloud-native databases. Cloud-native


databases offer an easy-to-use operation interface and automated processes to help
users easily complete routine O&M tasks. Cloud-native databases also support the
following features:
• Custom backup strategies, which enable you to restore data to any point in time.
• Automatic online hot upgrades, which fix known bugs in a timely manner.
• Monitoring of resources and engines and integration of custom alerting strategies
in cloud monitoring platforms.
• Node failure detection within seconds and failover within minutes.
• Expert-level self-services, which can solve performance problems in most
scenarios.
30 2 Database and Cloud Nativeness

References

1. Li F. Cloud-native database systems at Alibaba: opportunities and challenges. Proc VLDB


Endow. 2019;12(12):2263–72.
2. Verbitski A, Gupta A, Saha D, et al. Amazon aurora: design considerations for high throughput
cloud-native relational databases. In: Proceedings of the 2017 ACM international conference
on management of data; 2017. p. 1041–52.
3. Corbett JC, Dean J, Epstein M, et al. Spanner: Google’s globally distributed database. ACM
Transact Comput Syst. 2013;31(3):1–22.
4. Gilbert S, Lynch N. Brewer's conjecture and the feasibility of consistent, available, partition-­
tolerant web services. Acm Sigact News. 2002;33(2):51–9.
Chapter 3
Architecture of Cloud-Native Database

This chapter presents the architecture of cloud-native databases in three aspects.


First, it briefly introduces the essence of cloud computing and databases, analyzes
the limitations of traditional distributed databases, and proposes improvement
methods. Second, it explores the features of the architecture of cloud-native data-
bases. Lastly, it illustrates the design principles of three representative cloud-native
databases.

3.1 Design Principles

3.1.1 Essence of Cloud-Native Databases

Before we discuss the database forms and technological trends in the cloud comput-
ing era, let us first delve into the essence of cloud computing and databases.
Cloud computing pools various IT infrastructure resources to integrate the com-
puting, communication, and storage resources that customers require for centralized
management. This enables customers to build large-scale information systems and
infrastructure without the need to build IDCs (Internet data centers), purchase hard-
ware facilities, deploy basic network, or install operating systems and software,
significantly reducing investment costs at the initial stage. With the resource virtu-
alization and pooling technologies of cloud computing, customers can also elasti-
cally adjust their infrastructure to quickly respond to changes in business traffic. In
addition, cloud service providers can use, maintain, and manage massive resources
in a centralized manner. This greatly improves the technological capabilities and
supply chain management capabilities of cloud service providers and leads to
greater economies of scale, significantly improving the overall resource utilization.
Databases can be analyzed based on database users. Users utilize the computing
and storage capabilities of databases to complete the full-link process starting from

© The Author(s), under exclusive license to Springer Nature Singapore Pte 31


Ltd. 2025
F. Li et al., Cloud Native Database,
https://doi.org/10.1007/978-981-97-4057-4_3
32 3 Architecture of Cloud-Native Database

data production to data storage, processing, and consumption. Therefore, database


systems must be able to provide functional and nonfunctional support for the full-­link
process. In traditional database systems, software runs on hardware systems of the
Von Neumann architecture. The basic concepts involved in the Von Neumann archi-
tecture are “stored program” and “program control.” The “stored program” concept
mandates that codes run by the computer and data be stored at specific locations, and
the “program control” concept mandates that the computer fetch and effectively exe-
cute instructions in a specific logical order. In the context of database system software,
the essence of database management systems is to provide computing and storage
capabilities. Specifically, the computing capabilities of the compute nodes are lever-
aged to perform user-specified analysis and computation operations on data stored in
the storage to obtain computational results, ultimately achieving data application.
Computing, storage, and communication between components are essential capa-
bilities of database systems. As such, many of the current research works in the era of
cloud computing focus on how to leverage the powerful computing, storage, and com-
munication capabilities provided by cloud computing to achieve high availability, high
performance, elasticity, and high security at all levels of database systems. Different
architectures have different degrees of compatibility with the cloud computing archi-
tecture. A standalone database can be installed on a single cloud host provided by a
cloud service provider. In this case, the computing and storage capabilities of the data-
base are subject to those of the cloud host. At present, the mainstream technology for
deploying cloud hosts is virtualization. Therefore, if a standalone database system is
deployed on a cloud host, the following performance limitation rule applies:
Database < Cloud host (Container) < Host machine (Physical server).
In other words, when a traditional standalone database management system is
deployed on a cloud host, the cloud host is used as an ordinary server, and the advan-
tages of cloud computing cannot be fully utilized. A more advanced option is a distrib-
uted database management system into which appropriate nodes can be added based
on the computational complexity and the data scale to meet computational and storage
requirements. To some extent, this meets the requirements for scalability. Nevertheless,
the performance of individual nodes in the cluster still satisfies the preceding rule.
Although most databases can run in the cloud, whether the advantages of the cloud
platform can be fully leveraged still depends on the system architecture of the data-
base. In the long run, it is beneficial to design, build, and run database systems on
cloud computing platforms. Therefore, it is of vital importance to design a database
system architecture that conforms to the characteristics of elastic resource manage-
ment in cloud computing. Cloud-native databases rise in response.

3.1.2 Separation of Computing and Storage

The lack of elasticity in traditional distributed databases and the performance bot-
tleneck of single nodes are the result of the binding of computing and storage of
individual nodes. This necessitates a technical architecture that separates computing
3.2 Architecture Design 33

and storage. Currently, cloud-native databases are developing toward the compute-­
storage-­separated architecture. When implementing this architecture, cloud service
providers typically bind the CPU and memory together while separately deploying
the persistent storage, such as SSDs and HDDs. With the development of NVM
technologies, the CPU and memory can be further isolated in the future, and mem-
ory resources can be pooled to form a three-tier resource pool, helping customers
better achieve on-demand resource usage.
On the basis of the Von Neumann architecture, a database system can be
abstracted into a three-layer architecture that consists of the computing, communi-
cation, and storage layers. Cloud-native databases can ensure that the resources at
each layer can be independently scaled. Computing and communication resources
are stateless infrastructure resources. Therefore, compute nodes and communica-
tion nodes can be quickly started and closed during resource scaling to fully utilize
the elasticity of cloud computing. The storage layer is completely pooled and used
on-demand. In terms of specific processing technologies, the computing layer is
stateless and only processes business logic without persistently storing data and
therefore mainly involves distributed computing technologies, including but not
limited to distributed transaction processing, MPP, and distributed resource sched-
uling. Meanwhile, the storage layer only stores data and does not process business
logic and therefore mainly involves data consistency, security, and multimodal data
storage models in distributed scenarios.
Against the backdrop of cloud computing, new subjects on database system archi-
tectures have been raised and new challenges arise. A cloud-native database system is
designed based on the principle that its core components can fully utilize the resource
pooling feature of cloud computing to deliver more efficient and secure data services.
From the perspective of technical implementation, stateful storage resources and
stateless computing resources must be distinguished and adopt different resource
scheduling and utilization strategies on the premise that security, reliability, and cor-
rectness are guaranteed to minimize data movement and reduce additional computing,
storage, and communication overheads. Programming interfaces compatible with tra-
ditional database systems are preferred to achieve smoother learning curves for users
and enable users to complete the entire process of data production, storage, process-
ing, and consumption in a more convenient and efficient manner.

3.2 Architecture Design

One of the most prominent features of the architecture of cloud-native databases is


the decomposition of the originally integrated databases. In the decomposed archi-
tecture, computing and storage resources are completely decoupled, and local stor-
age is replaced with distributed cloud storage, transforming the computing layer
into a stateless (serverless) one. In a cloud-native database, resources are pooled for
each layer of service, and each resource pool is independently scaled in real time
based on the workload to maximize resource utilization.
34 3 Architecture of Cloud-Native Database

Fig. 3.1 Layered architecture of a cloud-native database

As shown in Fig. 3.1, SQL requests sent by the client are forwarded by a proxy
layer to any node in the computing service layer for processing. The proxy layer is
a simple load balancing service. The computing service layer parses the SQL
requests and converts the requests into physical execution plans. The execution of a
physical execution plan involves transaction processing and data access and is per-
formed by the storage service layer. As mentioned in Chap. 1, the storage service
layer is responsible for data cache management and transaction processing and
manages and organizes data in the form of data pages to ensure that the updating
and reading of data pages comply with the ACID (atomicity, consistency, isolation,
and durability) semantics of transactions. In practice, the storage service layer may
not be physically separated and may be partially integrated into the computing ser-
vice layer and the shared storage layer.
The shared storage layer persistently stores data pages and ensures the high
availability of database data. Typically, the shared storage layer is implemented as a
distributed file system that uses multiple replicas and distributed consensus proto-
cols to ensure data consistency and reliability. The compute-storage-separated
architecture allows each layer to be independently and elastically scaled to achieve
theoretically optimal allocation of resources. Thanks to the shared storage design,
all data views visible to compute nodes are complete, and the expansion of comput-
ing capabilities can be achieved in real time without the need for extensive data
migration as required in other databases of the MPP architecture. However, this can
also be problematic. If each node in the storage service layer handles write transac-
tions, data conflicts will inevitably occur. In addition, handling cross-node data con-
flicts will require massive network communications and complex processing
algorithms, resulting in a high processing cost. To simplify implementation, some
cloud-native databases designate one of the nodes as the update node and the others
as read-only nodes. The read-only nodes need to provide access to consistent data
pages based on the transaction isolation semantics. The shared storage layer is not
equivalent to a general distributed file system, such as Google File System (GFS) or
Hadoop Distributed File System (HDFS), but is designed to adapt to the paged seg-
mentation structure of databases. The size of a data block is selected based on the
I/O pattern of the database. More importantly, data playback logic is integrated into
3.3 Typical Cloud-Native Databases 35

the shared storage layer, which uses distributed capabilities to increase concurrency
and improve page update performance.
Different cloud-native databases may use different layering logic. In most cloud-­
native databases, SQL statement parsing, physical plan execution, and transaction
processing are implemented in the computing layer, and transaction-generated log
records and data are stored in the shared storage layer (also known as the storage
layer). In the storage layer, data is stored in multiple replicas to ensure data reliabil-
ity and consensus protocols, such as Raft, are used to ensure data consistency.

3.3 Typical Cloud-Native Databases

3.3.1 AWS Aurora

Aurora [1, 2] is a cloud-native database service launched by AWS. Aurora implements


a compute-storage-separated architecture based on MySQL and is intended for OLTP
(online transaction processing) scenarios. The overall architecture of Aurora is shown
in Fig. 3.2. The basic design concept of Aurora is that in a cloud environment, the
database is no longer bottlenecked by computing or storage resources but by the net-
work. Aurora adopts a compute-storage-separated architecture in which log process-
ing is pushed down to the distributed storage layer, thereby solving the network
bottleneck through architectural optimization. Storage nodes in Aurora are loosely
coupled with the database instances (the compute nodes) and have some computing
functionalities. Core features, such as query processing, transaction execution,

Fig. 3.2 Overall architecture of Aurora [1]


36 3 Architecture of Cloud-Native Database

locking, cache management, access interfaces, and undo log management, are still
implemented by the database instances. However, features related to the redo log,
including log processing, fault recovery, and backup and recovery, are pushed down
to the storage layer. Compared with traditional databases, Aurora is advantageous in
three aspects: First, the underlying database storage is a distributed storage service
that facilitates fault handling. Second, the database instances only write redo log
records to the underlying storage layer. This greatly reduces the network pressure
between the database instances and the storage nodes and provides a guarantee for
improving database performance. Lastly, some core features, such as fault recovery
and backup restoration, are pushed down to the storage layer and can be executed
asynchronously in the backend without affecting foreground user tasks.

3.3.1.1 Write Amplification in Traditional Databases

Traditional databases suffer from severe write amplification issues. For example, in
standalone MySQL, log records are flushed to the disk each time a write operation
is performed, and the back-end thread asynchronously flushes dirty data to the disk.
In addition, data pages also need to be written to the double-write area during the
flushing of dirty pages to avoid page fragmentation. The write amplification issue
may worsen in a production environment in which primary-standby replication is
implemented. As shown in Fig. 3.3, a MySQL instance is separately deployed in
availability zone (AZ) 1 and AZ 2, and synchronous mirror replication is

Fig. 3.3 Network I/Os in mirror-based MySQL [1]


3.3 Typical Cloud-Native Databases 37

implemented between the two instances. Amazon Elastic Block Store (EBS) is used
for underlying storage, and each EBS instance has a mirror. Amazon Simple Storage
Service (S3) is also deployed to archive the redo log and binlog to facilitate data
recovery to specific points in time. From the operational perspective, five types of
data, namely, Redo, Binlog, Data-Page, Double-Write, and Frmfiles, must be trans-
ferred in each step. Steps 1, 3, and 5 in the figure are sequentially executed because
of the mirror-based synchronous replication mechanism. This mechanism results in
an excessively long response time as it requires four network I/O operations, three
of which are synchronous serial operations. From a storage perspective, data is
stored in four replicas on EBS, and a write success is returned only after data is suc-
cessfully written to all four replicas. In this architecture, the I/O volume and the
serial model will lead to an extremely poor performance.
To reduce network I/Os, only one type of data (redo log) is written in Aurora, and
data pages are never written at any time. After receiving redo log records, a storage
node replays the log records based on data pages of an earlier version to obtain data
pages of a new version. To avoid replaying the redo log from the beginning each
time, the storage node periodically materializes data page versions. As shown in
Fig. 3.4, Aurora consists of a primary instance and multiple standby instances that
are deployed across AZs. Only the redo log and metadata are transmitted between
the primary instance and standby instances or storage nodes. The primary instance
simultaneously sends the redo log to six storage nodes and standby instances. When
four of the six storage nodes respond, the redo log is considered persisted regardless
of the response time of other standby instances. According to the Sysbench test

Fig. 3.4 Network I/Os in Aurora


38 3 Architecture of Cloud-Native Database

statistics that are obtained by performing a 30-min stress test in a write-only sce-
nario by using 100 GB of data, Aurora’s throughput is 35 times that of mirror-based
MySQL, and the log volume per transaction is 0.12% less than that of the latter.
Regarding the fault recovery speed, after a traditional database crashes and restarts,
it recovers from the latest checkpoint and reads and replays all redo log records after
the checkpoint to update the data pages corresponding to the committed transac-
tions. In Aurora, the features related to the redo log are pushed down to the storage
layer, and the redo log can be replayed continuously in the backend. If the accessed
data page in any disk read operation is not of the latest version, the storage node is
triggered to replay the log to obtain the latest version of the data page. In this case,
fault recovery operations similar to those in traditional databases are continuously
performed in the backend. When a fault occurs, it can be rapidly rectified.

3.3.1.2 Storage Service Design

A key goal in the storage service design of Aurora is to reduce the response time for
front-end user writes. Therefore, operations are moved as far as possible to the
backend for asynchronous execution, and the storage nodes adaptively allocate
resources for different tasks based on the volume of front-end requests. For exam-
ple, when a large number of front-end requests need to be processed, the storage
nodes slow down the reclamation of data pages of earlier versions. In traditional
databases, back-end threads need to continuously advance checkpoints to avoid
excessively long fault recovery time from affecting front-end user request process-
ing capabilities. Thanks to the independent storage service layer in Aurora, check-
point advancement in the backend does not affect database instances. Faster
checkpoint advancement is more favorable for front-end disk I/O read operations
because this reduces the amount of log data that needs to be replayed.
To ensure database availability and correctness, the replication in the storage
layer of Aurora is based on the Quorum protocol. It is assumed that (1) V nodes
exist in the replication topology, (2) each node has one vote, and (3) success is
returned for a read or write operation only when Vr or Vw votes are obtained. To
ensure consistency, two conditions must be met. First, Vr + Vw > V, which ensures
that each read operation can read from the node with the latest data. Second,
Vw > V/2, which ensures that each write operation is performed on the latest data
obtained after the last write operation, thereby avoiding write conflicts. For exam-
ple, V = 3. To meet the above two conditions, Vr = 2 and Vw = 2. To ensure high
system availability under various abnormal conditions, database instances in
Aurora are deployed in three different AZs, each with two replicas. Each AZ is
equivalent to an IDC that has independent power systems, networks, and software
deployment and serves as an independent fault tolerant unit. Based on the Quorum
model and the two rules mentioned earlier, it is assumed that V = 6, Vw = 4, and
Vr = 3. In this case, Aurora can ensure intact write services when an AZ is faulty
and can still provide read services without data loss when an AZ and a node in
another AZ are faulty.
3.3 Typical Cloud-Native Databases 39

Provided that an AZ-level failure (which may be caused by fire, flood, or network
failures) and a node-level failure (e.g., disk failure, power failure, or machine dam-
age) do not occur at the same time, Aurora can maintain the quorum based on the
Quorum protocol and ensure database availability and correctness. How to keep a
database “permanently available” will essentially depend on the reduction of the
probability of the two types of failures occurring at the same time. The mean time
to fail (MTTF) of a database is usually determinate. Therefore, the mean time to
repair (MTTR) can be reduced to lower the probability of the two types of failures
occurring at the same time. Aurora manages storage by partition, with each partition
sized 10 GB and six 10-GB replicas forming a protection group (PG). The storage
layer of Aurora consists of multiple PGs, and each PG comprises Amazon Elastic
Compute Cloud (EC2) servers and local SSDs. Currently, Aurora supports a maxi-
mum of 64 TB of storage space. After partitioning, each partition serves as a failure
unit. On a 10-Gbp network, a 10-GB partition can be restored within 10 s. Database
service availability will be affected only when two or more partitions fail at the
same time within 10 s, which rarely occurs in practice. Simply put, partition man-
agement effectively improves database service availability.
In Aurora, data writes are performed based on the Quorum model. After storage
partitioning, success can be returned when data is written to a majority of partitions,
and the overall write performance remains intact even if a few disks are under heavy
I/O workloads because the data is discretely distributed. Figure 3.5 shows the spe-
cific write process, which includes the following steps: (1) A storage node receives
log records from the primary instance and appends the log records to the memory

Fig. 3.5 I/O direction on a storage node in Aurora [1]


40 3 Architecture of Cloud-Native Database

queue. (2) The storage node persists the log records locally and then sends an
acknowledgment (ACK) to the primary instance. (3) The storage node classifies the
log records by partition and determines the log records that are lost. (4) The storage
node interacts with other storage nodes to obtain the missing log records from these
storage nodes. (5) The storage node replays the log records to generate new data
pages. (6) The storage node periodically backs up data pages and log records to the
S3 system. (7) The storage node periodically reclaims expired data page versions.
(8) The storage node periodically performs cyclic redundancy check (CRC) on data
pages. Only Steps (1) and (2) are serially synchronous and directly affect the
response time of front-end requests. Other steps are asynchronous.

3.3.1.3 Consistency Principle

Currently, almost all databases on the market use the WAL (Write-Ahead Logging)
model. Any change to a data page must first be recorded in the redo log record corre-
sponding to the modified data page. As a MySQL-based database, Aurora is no excep-
tion. During implementation, each redo log record has a globally unique log sequence
number (LSN). To ensure data consistency between multiple nodes, Aurora uses the
Quorum protocol instead of the 2PC protocol because the latter has low tolerance for
errors. In a production environment, each storage node may have some missing log
records. The storage nodes complete their redo log records based on the Gossip proto-
col. During normal operation, database instances are in a consistent state, and only the
storage node with complete redo log records needs to be accessed during disk read.
However, during a fault recovery process, read operations must be performed based on
the Quorum protocol to rebuild the consistent state of the database. Many transactions
are active on a database instance, and the transactions may be committed in an order
different from the order in which they are started. Therefore, when the database crashes
and restarts due to an exception, the database instance must determine whether to com-
mit or roll back each transaction. To ensure data consistency, several concepts regarding
redo log records at the storage service layer are defined in Aurora:
• Volume complete LSN (VCL) represents all complete log records before the
storage service has a VCL. During fault recovery, all log records with an LSN
greater than the VCL must be truncated.
• Consistency point LSN (CPL): For MySQL (InnoDB), each transaction consists
of multiple minitransactions. A minitransaction is the smallest atomic operation
unit. For example, a B-tree split may involve modifications to multiple data
pages, and the corresponding group of log records for these page modifications
is atomic. Redo log records are also replayed by minitransactions. A CPL repre-
sents the LSN of the last log record in a group of log records, and one transaction
has multiple CPLs.
• Volume durable LSN (VDL) represents the maximum LSN among all CPLs that
is persisted, where VDL ≤ VCL. To ensure the atomicity of minitransactions, all
log records with an LSN greater than the VDL must be truncated. For example,
if the VCL is 1007 and the CPLs are 900, 1000, and 1100, the VDL is 1000.
Then, all log records with an LSN greater than 1000 must be truncated. The VDL
3.3 Typical Cloud-Native Databases 41

represents the latest LSN at which the database is in a consistent state. Therefore,
during fault recovery, the database instance determines the VDL by PG and trun-
cates all log records with an LSN greater than the VDL.

3.3.1.4 Fault Recovery

Most databases perform fault recovery based on the classic ARIES (Algorithm for
Recovery and Isolation Exploiting Semantics) protocol and uses the WAL mecha-
nism to ensure that all committed transactions are persisted and uncommitted trans-
actions are rolled back in case of a fault. Such systems typically perform periodic
checkpointing and record checkpoint information in log records. If a fault occurs, a
data page may contain committed and uncommitted data. In this case, the system
must first replay the redo log starting from the last checkpoint during fault recovery
to restore the data pages to the status at the time of the fault and then roll back
uncommitted transactions based on undo logs. Fault recovery is time-consuming
and strongly related to the checkpointing frequency. Increasing the checkpointing
frequency can reduce the fault recovery time but directly affects the front-end
request processing of the system. The checkpointing frequency and fault recovery
time must be balanced, which, however, is not necessary in Aurora.
During fault recovery in a traditional database, the database status advances by
replaying the redo log. The entire database is offline during redo log replay. Aurora
uses a similar approach, but the log replay logic is pushed down to storage nodes
and runs in the backend while the database provides services online. Therefore,
when the database restarts due to a fault, the storage service can quickly recover.
Even under a pressure of 100,000 TPS, the storage service can recover within 10 s.
After a database instance crashes and restarts, fault recovery must be performed to
obtain a consistent runtime status. The instance communicates with Vr storage nodes
to ensure that the latest data is read, calculates a new VDL, and truncates log records
with LSNs greater than the VDL. In Aurora, the range of newly allocated LSNs is
limited. To be specific, the difference between the LSN and VDL cannot exceed
10,000,000. This prevents excessive uncommitted transactions on the instance
because the database needs to roll back uncommitted transactions based on undo
logs after replaying the redo log. In Aurora, the database can provide services after
all redo log records are replayed, and transaction rollback based on undo logs can
be performed after the database provides services online.

3.3.2 PolarDB

PolarDB [3, 4] is a cloud-native database developed by Alibaba Cloud. It is fully


compatible with MySQL and adopts a compute-storage-separated architecture in
which PolarStore, a distributed shared storage service, is implemented by using the
high-performance RDMA (Remote Direct Memory Access) technology, as shown
in Fig. 3.6. Thanks to the design philosophy of compute-storage separation, PolarDB
can meet user requirements for business elasticity in public cloud computing
42 3 Architecture of Cloud-Native Database

Fig. 3.6 Overall architecture of PolarDB

environments. The compute nodes and storage nodes in the database are intercon-
nected over a high-speed network and transmit data to each other based on the
RDMA protocol. This way, the database performance is no longer bottlenecked by
the I/O performance. The database nodes are fully compatible with MySQL. The
primary node and read-only nodes work in active-active mode, and the failover
mechanism is provided to deliver high availability of databases. The data files and
redo log of the database are stored in a user-space file system, routed by the inter-
face between the file system and the block storage device, and transmitted to remote
chunk servers by using the high-speed network and the RDMA protocol. In addi-
tion, only metadata information related to the redo log needs to be synchronized
among database instances. Data is stored in multiple replicas in the chunk servers to
ensure data reliability, and the Parallel-Raft protocol is used to ensure data
consistency.

3.3.2.1 Physical Replication

The binlog in MySQL records changes to data at the tuple (row) level. Replicas of
the redo log that records changes to physical file pages are stored in the InnoDB
engine layer to ensure transaction ACID.
As a result, the Fsync() function needs to be called at least twice during the pro-
cessing of a transaction in MySQL. This directly affects the response time and
throughput performance of the transaction processing system. Although MySQL
employs a Group Commit mechanism to increase the throughput in high-­concurrency
3.3 Typical Cloud-Native Databases 43

scenarios, the I/O bottleneck cannot be completely eliminated. Additionally, due to


the limited computing resources and network bandwidth of a single database
instance, a typical approach is to build multiple read-only instances to share the read
load through horizontal scaling (also known as scale-out). In PolarDB, database
files and the redo log are stored in the shared storage devices. This effectively solves
the data replication problem between the read-only nodes and the primary node.
Due to data sharing, when a read-only node is added, it is unnecessary to replicate
all data to the node. Consistent data can be accessed based on a copy of full data and
the redo log by synchronizing metadata information and implementing basic
MVCC. This reduces the recovery time to less than 30 s when the primary node fails
and failover is initiated to switch services to a read-only node, enhancing system
high availability. Moreover, the data delay between the read-only and primary nodes
can be reduced to milliseconds. Currently, binlog-based replication is implemented
in parallel at the table level, and physical replication is implemented in parallel at
the data page level, which is finer-grained and has higher efficiency. In most cases,
the binlog can be disabled if it is not required in data migration or logical backup for
disaster recovery. The system uses the redo log to implement replication to mitigate
impact on performance. In conclusion, the farther down the I/O path, the easier it is
to decouple from the upper-layer business logic and status to reduce system com-
plexity. Furthermore, the WAL read/write model is well suited for the concurrency
mechanism of distributed file systems and improve the concurrent read performance
of PolarDB.

3.3.2.2 RDMA Protocol for High-Speed Networks

Before its application in cloud computing, RDMA has been widely used in the high-­
performance computing (HPC) field for several years. RDMA typically uses net-
work devices that support high-speed connections, such as switches and network
interface controllers (NICs), to communicate with the NIC driver through a specific
programming interface. With RDMA, data is efficiently transmitted between NICs
and remote applications with a low delay by using the Zero-Copy technology. In
addition, copying data from the kernel mode to the application mode is not neces-
sary. Hence, the CPU workloads are not interrupted, thereby greatly reducing per-
formance jitter and improving the overall processing capabilities of the system. In
PolarDB, the compute nodes and storage nodes are interconnected over a high-­
speed network and transmit data to each other by using the RDMA protocol. This
way, the system performance is no longer bottlenecked by the I/O performance.

3.3.2.3 Snapshot-Based Physical Backup

Snapshotting is a popular backup solution based on block storage devices. This


solution uses a copy-on-write mechanism to record the metadata changes of block
devices. With this mechanism, when a write operation needs to be performed on a
44 3 Architecture of Cloud-Native Database

block device, a copy of the block device is created, and the write operation modifies
the copy of the block device. This way, data can be recovered to a specific snapshot
point. Snapshotting is a typical postprocessing mechanism based on time and write
load models. In other words, when a snapshot is created, data is not backed up.
Instead, the data backup load is evenly distributed to the time windows in which
actual data writes occur after the snapshot is created, thus achieving fast response to
backup and recovery. PolarDB provides the snapshotting and redo log mechanisms
to implement data recovery to specific points in time, which is more efficient than
the traditional recovery mode in which full data is used together with incremental
binlog data.

3.3.2.4 User-Space File System

When discussing file systems, we have to mention the Portable Operating


System Interface (POSIX) syntax invented by IEEE (POSIX.1 has been accepted
by ISO). In easy words, POSIX is to file systems as the SQL standard is to data-
bases. The greatest challenge in implementing a general distributed file system
is to provide strong concurrent file read and write performance while complying
with the POSIX standards. Complying with the POSIX standards inevitably
compromises the system performance and increases the complexity of system
implementation. This ultimately requires trade-offs between general-purpose
design and special-purpose design, as well as a balance between usability and
performance.
Distributed file systems have been continuously remodeled since its emergence
and constantly evolve from the HPC era to the cloud computing era, the Internet era,
and the big data era. Strictly speaking, the revisions and evolutions are customized
implementations for different application I/O scenarios.
In this regard, some file systems that serve specific I/O scenarios may not comply
with POSIX. This is analogous to the trend of development from SQL to NoSQL. File
systems that support POSIX must provide system call interfaces that support stan-
dard file read and write operations. This way, users can implement file operation
applications without modifying the POSIX interfaces. In this case, the Linux Virtual
File System (VFS) layer must be integrated with a specific file system in the kernel
during implementation. This is one of the reasons why the engineering implementa-
tion of a file system is difficult.
For distributed file systems, the kernel module must exchange data with user-­
space daemons to achieve data sharding and data transfer to other machines by
using the daemon processes. However, user-space file systems provide dedi-
cated APIs for users, which do not need to fully comply with POSIX standards
and do not require a 1:1 mapping with system calls in the operating system
kernel. File system metadata management and data read and write support can
be directly implemented in user space. This greatly reduces the implementation
difficulty and is more conducive to interprocess communication in distributed
systems.
3.3 Typical Cloud-Native Databases 45

3.3.3 Microsoft Socrates

Socrates [5] is a new cloud-based OLTP database for the database-as-a-service


(DBaaS) paradigm that has been used in Microsoft SQL Server. Socrates adopts
the idea of compute-storage separation, separates the log storage from the over-
all storage layer (which stores log records and data pages), and treats the log
storage as a separate first-level storage module. Traditional databases usually
maintain multiple copies of data to implement data persistence and high avail-
ability. However, data persistence and high availability are not implemented
under completely identical conditions. For data persistence, a transaction can be
committed only after log records are written to a fixed number of replicas. For
high availability, the rapid replication and recovery of data pages enable the
system to provide services that meet the service-level agreement (SLA) in the
event of a failure. The separated storage of log records and data pages in Socrates
means that the implementation of persistence (implemented by using log
records) and availability (implemented by using data pages and the computing
layer) of the database are decoupled, which is beneficial for selecting the most
suitable mechanism to process tasks. Figure 3.7 shows the overall architecture
of Socrates, which consists of four layers: the computing layer, XLOG Service
layer, Page Server layer, and XStore layer.

Fig. 3.7 Overall architecture of Socrates [5]


46 3 Architecture of Cloud-Native Database

3.3.3.1 Computing Layer

Similar to PolarDB, Socrates uses a one-writer, multireader architecture in


which only one node (the primary node) in the computing layer can handle all
read and write transactions, but multiple read-only nodes (also known as sec-
ondary nodes) can process read-only transactions. When the primary node is
faulty, one of the read-only nodes can be selected as the primary node to support
read and write operations. In Socrates, the primary node only needs to process
read and write transactions and generate log records and does not need to know
other nodes or the storage location of log records. The primary node stores only
the hot data in memory and SSDs, which are known as the resilient buffer pool
extension (RBPEX). The GetPage@LSN mechanism is required to retrieve data
pages that are not cached locally. The GetPage@LSN mechanism is a remote
procedure call (RPC) mechanism. In this mechanism, a GetPage request con-
sisting of pageId and LSN is sent to a Page Server. pageId uniquely identifies the
page that the primary node needs to read, and LSN specifies the LSN of the page,
which is the same as the maximum LSN of the page. Similarly, read-only nodes
use the GetPage@LSN mechanism to retrieve data pages that are not cached
locally. Note that the primary node and the read-only nodes in Socrates do not
interact with each other. Instead, the primary node sends all log records to the
XLOG Service, the XLOG Service broadcasts the log records to each read-only
node, and the read-only nodes replay the log records after receiving them. Read-
only nodes do not store a complete backup of the database. As a result, they may
process log records associated with pages that are not in the local memory or
SSDs. In this case, two different processing strategies are available. The first
one is to fetch the pages from the Page Server and replay the log records, which
maintains the consistency of the cached data between the read-only nodes and
the primary node.
When the primary node is faulty, a read-only node can be smoothly switched
to the primary role, thereby maintaining a stable performance. The second strat-
egy is to ignore log records involving uncached pages. Socrates adopts the second
strategy.

3.3.3.2 XLOG Service Layer

Figure 3.8 shows the internal structure of XLOG Service. Log blocks are synchro-
nously written from the primary node to the landing zone (LZ). In the current Socrates
version, XStore, a premium storage service of Azure, is used as the storage medium
for the LZ. To implement data persistence, XStore retains three replicas of all data.
The primary node asynchronously sends log records to the XLOG process, which
then sends the log records to read-only nodes and Page Servers. When the log blocks
are sent to the LZ and XLOG process in parallel, data may reach the read-­only nodes
before being persisted in the LZ, resulting in data inconsistency or loss in the event of
a failure. To avoid this situation, XLOG propagates only log records that have been
3.3 Typical Cloud-Native Databases 47

Fig. 3.8 Internal structure of the XLOG service [5]

persisted in the LZ. The XLOG process stores the log records in pending blocks, and
the primary node notifies the XLOG process of the log blocks that have been per-
sisted. Then, the XLOG process moves the persisted log blocks from the pending
blocks to the LogBroker, from which the log blocks are broadcast to read-only nodes
and Page Servers. The XLOG process incorporates a Destaging process, which copies
persisted log blocks to a fixed-size local SSD cache to accelerate access and sends a
copy of the log blocks to XStore for long-term retention. Socrates refers to the long-
term retention of log blocks as Long-Term Archive (LT). In Socrates, LZ and LT
retain all log data to meet the requirement for database persistence. The LZ is an
expensive service that can achieve low latency to facilitate fast commits of transac-
tions. It also retains log records for 30 days to facilitate data recovery to specific points
in time. XStore (LT) uses inexpensive and durable storage devices to store massive
data. This tiered storage structure meets performance and cost requirements.

3.3.3.3 Page Server Layer

Page Servers are responsible for three tasks: (1) responding to GetPage requests
from compute nodes, (2) maintaining data of a database partition through log
replay, and (3) recording checkpoints and backing up data to XStore. Each Page
48 3 Architecture of Cloud-Native Database

Server stores only a portion of the database data pages and focuses only on log
blocks related to the partition handled by the Page Server. To this end, the primary
node adds sufficient annotation information for each log block to indicate the
partitions to which the log records in the log block must be applied. XLOG uses
this filtering information to distribute the relevant log blocks to their correspond-
ing Page Servers. In Socrates, two methods are available for improving system
availability. The first method is to use a more fine-grained sharding strategy,
which allows each Page Server to correspond to a smaller partition, thereby reduc-
ing the average recovery time for each partition and improving system availabil-
ity. Based on current network and hardware parameters, it is recommended that
the partition size be set to 128 GB for Page Servers in Socrates. The second
method is to add a standby Page Server for each existing Page Server. When a
Page Server fails, its standby Page Server can immediately provide services, thus
improving system availability.

3.3.3.4 XStore Layer

XStore is a highly replicated disk-based storage system that spans multiple zones.
It ensures data durability with minimal data loss. In the Socrates architecture,
XStore plays the same role as disks in traditional databases. Similarly, the mem-
ory and SSD caches (RBPEX) of the compute nodes and Page Servers play the
same role as the main memory in traditional databases. Page Servers periodically
send modified data pages to XStore, and Socrates uses the snapshot feature of
XStore to create backups by simply recording a timestamp. When a user requests
a point-in-time recovery (PITR) operation, Socrates needs to fetch a complete set
of snapshots that are taken before the time of the PITR operation from XStore, as
well as the log range required to bring this set of snapshots from the recovery time
to the requested time.
Socrates divides the entire database into multiple service layers that have
respective lifecycles and perform asynchronous communication as far as possi-
ble. Unlike other cloud-native databases, Socrates separately implements dura-
bility and availability. In particular, Socrates uses XLOG and XStore to ensure
system durability and uses the computing layer and Page Servers to ensure sys-
tem availability. In Socrates, the computing layer and Page Servers are stateless.
This way, data integrity is not affected even if a compute node or Page Server
fails because the data of any Page Server can be recovered to the latest status by
using the snapshot versions and log records in XStore and XLOG. This layered
storage architecture can implement more flexible and finer-grained control,
achieving a better balance among system availability, costs, performance, and
other aspects.
References 49

References

1. Verbitski A, Gupta A, Saha D, et al. Amazon Aurora: design considerations for high
throughput cloud-native relational databases. In: Proceedings of the 2017 ACM interna-
tional conference on management of data (SIGMOD ‘17); 2017. p. 1041–52. https://doi.
org/10.1145/3035918.3056101.
2. Verbitski A, Gupta A, Saha D, et al. Amazon Aurora: on avoiding distributed consensus for I/Os,
commits, and membership changes. In: Proceedings of the 2018 ACM international conference
on management of data (SIGMOD’18); 2018. p. 8. https://doi.org/10.1145/3183713.3196937.
3. Cao W, Liu ZJ, Wang P, et al. PolarFS: an ultra-low latency and failure resilient distributed file
system for shared storage cloud database. Proc VLDB Endow. 2018;11:1849–62. https://doi.
org/10.14778/3229863.3229872.
4. Li FF. Cloud-native database systems at Alibaba: opportunities and challenges.
PVLDB. 2019;12(12):2263–72. https://doi.org/10.14778/3352063.3352141.
5. Antonopoulos P, Budovski A, Diaconu C, et al. Socrates: the new SQL server in the cloud. In:
Proceedings of the 2019 ACM international conference on management of data (SIGMOD’19);
2019. p. 14. https://doi.org/10.1145/3299869.3314047.
Chapter 4
Storage Engine

A storage engine provides the technical implementation for storing data in files (or
memory). Different storage engines may use different storage mechanisms, index-
ing techniques, and locking methods and provide extensive functionality. In this
chapter, we will introduce the basic concepts and technologies related to storage
engines from three aspects: data organization, concurrency control, and logging and
recovery. Then, we will discuss the characteristics and advantages of X-Engine, the
storage engine of PolarDB.

4.1 Data Organization

The data storage organization structure of a database is determined based on the


target access efficiency. Data organization focuses on different aspects in different
business scenarios. Data is stored as files on disks and usually managed by using
fixed-length pages. Each page has a unique identifier. The DBMS (Database
Management System) maintains the mapping relationship between page IDs and
physical addresses. The page size varies based on the database and is usually aligned
with the disk sector size, such as 8 KB or 16 KB, to make it easier to load the pages
into the memory buffer for management.
Different storage engines organize pages in different ways. The most commonly
used method is heap file organization. In this organization method, pages serve as
the basic unit for data storage, and records on the disk are unordered and are orga-
nized in the form of a linked list or directory.
Another popular organization method is log-structured file organization, which
stores data records in the database in an append-only manner. In this method, data
is organized by merging and sorting small storage areas. Log-structured merge-
tree (LSM-tree)-based LevelDB, HBase, and RocksDB use this method to
store data.

© The Author(s), under exclusive license to Springer Nature Singapore Pte 51


Ltd. 2025
F. Li et al., Cloud Native Database,
https://doi.org/10.1007/978-981-97-4057-4_4
52 4 Storage Engine

In each data organization method, storage structures can be categorized into


immutable and mutable storage structures. In the immutable storage structure,
files cannot be modified after they are created, and new records can only be
appended to new files. In the mutable storage structure, existing records on the
disk can be directly modified given that they are located first. In this storage struc-
ture, most I/O operations are random, resulting in high overheads for write opera-
tions. Nevertheless, the mutable storage structure features high query efficiency.
Therefore, this storage structure optimizes the read performance at the expense of
the write performance.
In the immutable storage structure, inserted and updated records are appended to
new files in sequential write mode. This storage structure is essentially optimized to
achieve higher write performance. However, a record may have multiple versions
because new records are added in append mode. Consequently, multiple files need
to be queried during data read. In contrast to the mutable storage structure, the
immutable storage structure optimizes write performance at the expense of the read
performance.
This chapter describes the implementation principles of the data organization
methods. To illustrate, the B+ tree and LSM-tree structures are respectively used as
examples of the mutable storage structure that uses heap file organization and the
immutable storage structure that uses log-structured file organization.

4.1.1 B+ Tree

The B+ tree structure was proposed by Professor Rudolf Bayer in 1970 in his paper
Organization and Maintenance of Large Ordered Indices. Since then, it has become
the most common and frequently used index structure in databases. By utilizing the
mutable storage structure, a B+ tree facilitates the rapid location of the page on
which a data row resides based on the key-value pair. A B+ tree also uses an m-ary
tree structure, which reduces the depth of the index structure and avoids most of the
random access operations in traditional binary tree structures, effectively reducing
the number of seek operations of the disk head and mitigating the impact of external
storage access latency on performance. This ensures the orderliness of key-value
pairs in tree nodes, thereby controlling the time complexity of query, insertion, dele-
tion, and update operations within the range of O(logn). Given these advantages, B+
trees are widely used as a building module for index structures in many database
and storage systems, including PolarDB, Aurora, and other cloud-native databases.

4.1.1.1 Principles of B+ Trees

This section describes the structure and characteristics of a B+ tree. Due to limited
space, only a brief introduction to the basic structure and operations of a B+ tree is
provided. For more information, refer to the referenced articles for further reading.
4.1 Data Organization 53

1. A traditional B+ tree must meet the following requirements:


(a) All paths from the root node to the leaf nodes have the same length.
(b) All data is stored in the leaf nodes. Nonleaf nodes only serve as indexes to
the leaf nodes.
(c) The root node has at least two key-value pairs.
(d) Each tree node can have up to M key-value pairs.
(e) Each tree node (except the root node) has at least M/2 key-value pairs.
Figure 4.1 is a schematic diagram of a B+ tree.
2. A traditional B+ tree must support the following operations:
(a) Single-key operations: search, insertion, update, and deletion (Search and
insertion operations are used as examples in this chapter, and the other oper-
ations are implemented similarly.)
(b) Range operations: range search.
3. The B+ tree concurrency control mechanism must meet the following
requirements:
(a) A read operation does not read a key-value pair that is in an intermediate
state and is being modified by a write operation or fail to find an existing
key-value pair. A read operation may fail to find an existing key-value pair if
the key-value pair is moved to another tree node by a write operation (e.g., a
split or merge operation).
(b) Two write operations do not simultaneously modify the same key-value pair.
(c) Deadlocks do not occur. A deadlock is a situation in which two or more threads
are permanently blocked and waiting for resources occupied by other threads.

4.1.1.2 Disk Data Organization by Using a B+ Tree

The storage structure of a computer consists of the following layers from top to bot-
tom: registers, high-speed cache, main memory, and auxiliary storage. The main
memory is also called RAM, and the auxiliary storage is also known as external

Fig. 4.1 Schematic diagram of a B+ tree


54 4 Storage Engine

storage (e.g., a disk used to store files). In this storage structure, the access speed of
each layer is much slower than that of its upper layer, with disk access being the
slowest. This is because disk access involves track seeking and sector locating. In
track seeking, the magnetic head is fixed, and the disk rotates at a high speed to
locate the track in which the data is located. In sector locating, the magnetic head is
moved to locate a specific sector from hundreds of sectors in the track.
These are time-consuming mechanical operations. At the disk scheduling level,
various disk scheduling algorithms are employed to reduce the movement of the
actuator arm of the disk, thereby improving efficiency. At the index structure level,
a proper index must be created to improve the disk read efficiency. In most cases,
the performance of an index is evaluated based on the number of disk I/Os. The
performance of a B+ tree index is advantageous in the following aspects:
• Organizing data in the disk by using a B+ tree index is inherently advantageous
because read and write operations in the operating system are performed in disk
blocks. When the size of the leaf nodes of the B+ tree is aligned with the disk
block size, the number of I/Os can be significantly reduced through interaction
between the operating system and the disk.
• A B+ tree consists of nonleaf nodes and leaf nodes. Nonleaf nodes, also known
as index nodes, are mapped to index pages in the physical structure. Leaf nodes
are data nodes that are also mapped to data pages in the physical structure. The
index nodes store only keys and pointers, not data. This way, a single index node
can store a large number of branches, and information about locations of physi-
cal pages on the disk can be read into the memory with just one I/O operation.
• All leaf nodes in a B+ tree contain a pointer to neighboring leaf nodes, which
greatly improves the efficiency of range queries.

4.1.2 B+ Tree in the InnoDB Engine

By default, MySQL 5.5 and later use the InnoDB engine as the storage engine. The
InnoDB engine uses a B+ tree as its index structure, and the primary key is stored
as a clustered index.
The InnoDB engine organizes data in the tablespace structure. Each tablespace
contains multiple segments, each segment contains multiple extents, and each extent
occupies 1 MB of space and contains 64 pages. The smallest access unit in the
InnoDB engine is pages, which can be used to store data or index pointers. In a B+
tree, leaf and nonleaf nodes store data and index pointers, respectively. In the search
for a record, the page that contains the record is determined by using the B+ tree
index, the page is loaded to the memory, and then the memory is traversed to find
the row that contains the record.
The read performance of the InnoDB engine depends on the query performance
of the B+ tree. The write performance of the InnoDB engine is guaranteed by using
the WAL mechanism. In this mechanism, each write operation is recorded in the
4.1 Data Organization 55

redo log instead of updating the full index and data content on the B+ tree, and then
the redo log is sequentially written to the disk. At the same time, the dirty page data
that needs be updated on the B+ tree is recorded in memory, and then the dirty pages
are flushed to the disk when specific conditions are met.

4.1.2.1 Implementation of a B+ Tree in the MySQL/InnoDB Engine

B+ tree indexes in a database can be categorized into clustered and secondary


indexes. A leaf node in a clustered index records a complete row of information,
whereas a leaf node in a secondary index does not. In a secondary index, the index
file (suffixed with .myi) and the data file (suffixed with .myd) are separated. The
index file stores only the pointer addresses of the data records, whereas the leaf
nodes store the pointer addresses that point to the data records. In a clustered
index, the index and data files are combined into a single .ibd file, the leaf nodes
directly store the complete data records, and the nonleaf nodes store pointers to
the next level. Figure 4.2 shows the structures of a clustered index and a second-
ary index.

Fig. 4.2 Structures of a clustered index and a secondary index


56 4 Storage Engine

1. Advantages of Clustered Indexes.


(a) A clustered index stores the index and data rows in the same B+ tree. This
enables direct access to data during queries, thereby increasing efficiency
compared with a nonclustered index. The latter requires a second query.
(b) Clustered indexes are more efficient for range queries, because they sort
data rows based on values. This way, pointers in leaf nodes can be logically
accessed in sequence.
2. Disadvantages of Clustered Indexes.
(a) The cost of updating a clustered index is high. For clustered indexes with
updated rows, data needs to be moved to the corresponding location.
(b) The insertion speed is heavily dependent on the insertion order.
(c) In a clustered index, a page split may occur when new rows are inserted or
the primary key is updated.
(d) A clustered index may slow down full table scans because logically contigu-
ous pages may be physically far apart, and a large number of random reads
can significantly reduce performance.
In the InnoDB storage engine, the leaf nodes of a secondary index store only
pointers to data, not the actual data. Therefore, the leaf nodes point to the clustered
index keys of the corresponding rows. The existence of secondary indexes does not
affect the organization of data in the clustered index, so multiple secondary indexes
can be created on each table. During a data query, the InnoDB storage engine tra-
verses the secondary indexes and obtains the primary key that points to the primary
key index by using leaf-level pointers and then finds a complete row record by using
the primary key index.

4.1.2.2 B+ Tree-Based Persistence Strategy in MySQL/InnoDB

When data is read in InnoDB, the page that contains the desired record can be rap-
idly located by using the B+ tree index, and the desired record is fetched by loading
the data to the memory. MySQL uses the WAL mechanism for write operations.
Simply put, the write operations are first recorded in the log and then written to the
disk. Efficient writes in MySQL are achieved by using a buffer pool considering that
the performance of sequential writes is much higher than that of random writes.
Directly reading from the disk or writing data to the disk is costly, and frequent
random I/Os significantly reduce CPU utilization. To reduce accesses to the disk,
pages are cached in the memory based on the principle of locality. The buffer pool
mainly serves to:
• Preserve the content of cached disk pages in the memory.
• Cache modifications to disk pages so that the cached version is modified instead
of the data on the disk.
• Return the cached page if the requested page is in the cache.
4.1 Data Organization 57

• Load the requested page from the disk to the memory when the requested page
is not cached and the memory has free space.
• Call a page replacement strategy to select pages to be swapped out when the
requested page is not cached and the memory has no free space. The contents of
the swapped out pages are written back to the disk.
When the storage engine accesses a page, it first checks whether the content of
the page is cached in the buffer pool. If that is the case, the storage engine directly
returns the requested page. Otherwise, it converts the logical address or page num-
ber of the requested page to a physical address and loads the content of the page
from the disk to the buffer pool.
When the requested page is not in the buffer pool and the memory has no free
space, a page needs to be swapped out from the cache and written back to the disk
before the requested page can be swapped into the memory. The algorithm used to
select the page to be swapped out is called a page replacement algorithm. A good
page replacement algorithm achieves a low page replacement frequency. In other
words, pages that will no longer be accessed or will not be accessed in a long time
are swapped out first. The following four page replacement algorithms are most
commonly used:
• First In First Out (FIFO): This page replacement algorithm first evicts the pages
that have been residing in the memory for the longest time.
• Least Recently Used (LRU): This replacement algorithm first evicts the least
recently used pages from the memory.
• Clock: This replacement algorithm keeps track of the references of pages
and associated access bits in a circular buffer area. The access bit of each
page is updated in a clock-like manner, and the page with an access bit of 0
is evicted.
• Least Frequently Used (LFU): This replacement algorithm sorts pages based
on their request frequency and evicts the page with the lowest request
frequency.
If a page in the buffer pool is modified, for example, appended with a new tuple,
the page is marked as a dirty page to indicate that its content is inconsistent with that
of the disk data page. The dirty page must be flushed to the disk to ensure data
consistency.
In the InnoDB engine, the checkpointing mechanism controls WAL and the buf-
fer pool to ensure that they work in coordination. Related operation logs can be
discarded from the redo log files only when the data pages in the buffer pool are
written to the disk. After the above process is completed, dirty pages can be evicted
from the cache. The InnoDB engine flushes dirty pages to the disks in the follow-
ing cases:
• When the redo log files are full, the system stops writing update operations to the
disk, advances the checkpoint, and needs to flush all dirty pages corresponding
to the log records between the checkpoint and the write position to the disk to
make space for the redo log.
58 4 Storage Engine

• When a page fault occurs due to insufficient system memory, some pages need to
be swapped out so that new requested pages can be swapped in. A dirty page
must be written to the disk first before it can be swapped out.
• Dirty pages are flushed to the disk when the database is idle.
• When the database is properly closed, all dirty pages in the memory must be
flushed to the disk.

4.1.3 LSM-Tree

The LSM-tree concept was proposed by Professor Patrick O’Neil in 1996 in his
paper The log-structured merge-tree (LSM-tree). The name “log-structured merge-­
tree” originates from the log-structured file system. Similar to the log-structured file
system, an LSM-tree is also implemented based on an immutable storage structure,
which executes sequential write operations by using buffers and the append-only
mode. This avoids most of the random write operations in a mutable storage struc-
ture and mitigates the impact of multiple random I/Os of a write operation on per-
formance. The LSM-tree ensures the orderliness of disk data storage. The immutable
storage structure is conducive to sequential writes. All data can be written to the
disk at the same time in append-only mode. The higher data density of the immu-
table storage structure prevents the generation of external fragments.
In addition, data locations do not need to be determined in advance for write,
insertion, and update operations because the files are immutable. This greatly
reduces the impact of random I/Os and significantly improves the write performance
and throughput. However, duplication is allowed for immutable files. As the amount
of appended data increases, the number of disk resident tables also increases, result-
ing in file duplication during data reads. This problem can be addressed by main-
taining the LSM-tree through compactions.
As mentioned above, a B+ tree organizes data on the disk in the unit of pages and
uses nonleaf and leaf nodes to store index and data files, respectively, to facilitate locat-
ing the page that contains the desired data record. In an LSM-tree, data exists in the form
of sorted string tables (SSTables). An SSTable usually consists of two components: an
index file and a data file. The index file stores keys and their offsets in the data file, and
the data file consists of concatenated key-value pairs. Each SSTable consists of multiple
pages. When a data record is queried, instead of directly locating the page that contains
the data record like in B+ trees, the SSTable is located first. Then, the page that contains
the data record is searched based on the index file in the SSTable.

4.1.3.1 Structure of an LSM-Tree

Figure 4.3 shows the overall architecture of an LSM-tree, which includes memory-­
resident components and disk-resident components. When a write request is exe-
cuted to write data, the operation is first recorded in the commit log on the disk for
4.1 Data Organization 59

Fig. 4.3 Overall architecture of an LSM-tree

fault recovery. Then, the record is written to the mutable memory-resident compo-
nent (MemTable). When the size of the MemTable reaches a specific threshold, the
MemTable becomes an immutable memory-resident component (immutable
MemTable), and the data in the MemTable is flushed to the disk in the backend. For
disk-resident components, the written data is divided into multiple levels. Data
flushed from the immutable MemTable is first stored at Level 0 (L0), and a corre-
sponding SSTable is generated. When the data size of L0 reaches a specific thresh-
old, the SSTables at L0 are compacted into Level 1 (L1); subsequent levels are
compacted in a similar way.
• Memory-resident components: Memory-resident components consist of the
MemTable and the immutable MemTable. Data is usually stored in the MemTable
in an ordered skip list to ensure the orderliness of disk data. The MemTable buf-
fers data records and serves as the primary destination for read and write opera-
tions. The immutable MemTable writes data to the disk.
• Disk-resident components: Disk-resident components consist of the commit log
and SSTables. The MemTable exists in memory. To prevent data loss caused by
system failure before data is written to the disk, the operation records must be
written to the commit log before the data is written to the MemTable to ensure
data persistence. An SSTable consists of data records written to the disk by the
immutable MemTable and is immutable, so it can only be read, compacted, and
deleted.

4.1.3.2 Data Update and Deletion in the LSM-Tree

An LSM-tree is based on an immutable storage structure. Therefore, update opera-


tions cannot directly modify the original data. Instead, a new data record marked
with a timestamp is inserted. As a result, insertion and update operations in the
60 4 Storage Engine

LSM-tree cannot be explicitly distinguished. Similarly, delete operations can be


implemented by inserting a special delete marker, which is sometimes called a
tombstone. This marker indicates that the data record corresponding to the key has
been deleted.

4.1.3.3 Lookup in the LSM-Tree

Data lookup in an LSM-tree follows the following procedure:


• Access the MemTable.
• Access the immutable MemTable.
• Access the disk-resident components starting from L0 until the desired data is
found. Data at a lower level is newer than that at an upper level. Therefore, the
first occurrence of the desired data record is its latest value.
The LSM-tree is optimized for read operations. Bloom filters are often used to
determine whether an SSTable contains a specific key. The underlying layer of a
Bloom filter is a bitmap, which is used to represent a set and determine whether an
element belongs to this set. Applying Bloom filters can greatly reduce the number
of disk accesses. However, Bloom filters have a certain misjudgment rate because
their bitmap values are determined based on a hash function, which can lead to a
hash collision between multiple values. In other words, Bloom filters have a false-­
positive probability. For example, an element that does not belong to a set may be
mistakenly considered belonging to the set. An element belongs to a set only when
the value of multiple bits in the bitmap is 1, which fundamentally determines that
Bloom filters have no false-negative probability. To be specific, the following two
cases may exist:
• If a Bloom filter determines that an element does not belong to a set, this element
does not exist in the set.
• If a Bloom filter determines that an element belongs to a set, this element may or
may not exist in the set.

4.1.3.4 Compaction Strategies for LSM-Trees

As the amount of disk-resident table data in an LSM-tree increases, duplicate data


can be reduced through periodic compaction operations. Two basic compaction
strategies are available: tiered compaction and leveled compaction.
1. Specific Implementation of the Two Compaction Strategies
(a) Tiered compaction: The maximum allowable number of SSTables is the
same for all levels. When the number of SSTables at a level generated
through continuous data writing by the immutable MemTable reaches a
specified threshold, all SSTables at this level are compacted into a large new
4.1 Data Organization 61

SSTable that is placed at a higher level. This compaction strategy is easy to


implement but results in serious space amplification problems during com-
pactions. In addition, the read amplification problem is more serious at a
higher level because the size of the SSTable is larger.
(b) Leveled compaction: The size of the SSTable at each level is fixed to 2 MB
by default, and the maximum number of SSTables at each level is N times
that at the lower level. (By default, N = 10 in Scylla and Apache Cassandra.)
2. Compaction Procedure
(a) When L0 is full, all SSTables at L0 are compacted with all SSTables at L1,
and duplicate key values are removed. Due to the size limit of SSTables,
multiple SSTables will be generated and placed at L1.
(b) When L1 to LN are full, an SSTable is selected from one of the levels and
compacted with a next level.
This compaction strategy mitigates the space amplification problem but results in
serious write amplification problems during compactions.

4.1.3.5 Read, Write, and Space Amplifications

Just as B+ trees based on a mutable storage structure will inevitably encounter write
amplification, LSM-trees based on an immutable storage structure will encounter
read amplification problems. In fact, different compaction strategies for LSM-trees
bring new problems. In the distributed field, the well-known CAP theorem states
that a distributed system can satisfy only two of the preceding guarantees at a time.
In 2016, Manos Athanassoulis et al. proposed a similar theorem called the RUM
conjecture, which states that any data structure can be optimized to mitigate at most
two of the read, write, and space amplification problems. In short, LSM-trees based
on an immutable storage structure will face the following three problems:
• Read amplification: The LSM-tree is searched layer by layer during data retrieval,
resulting in additional disk I/O operations. The read amplification problem is
more prominent in range queries.
• Write amplification: Data is continuously rewritten to new files during compac-
tions, resulting in write amplification.
• Space amplification: Duplication is allowed and expired data is not immediately
cleaned up, resulting in space amplification.
Due to the difference in implementation methods, the two compaction strategies
result in different amplification problems.
In tiered compaction, the SSTable at a high level is quite large in size, and origi-
nal SSTable files are retained for fault recovery before the compaction is completed.
As a result, the data volume doubles within a short time. Although the old data is
deleted after compaction is completed, this still causes a serious space amplification
problem.
62 4 Storage Engine

In leveled compaction, the size of SSTables is fixed. An SSTable at a high level


is selected and compacted with SSTables at the next level that have the same keys
instead of compacting all SSTables at the level with those at the next level. This
mitigates the space amplification problems. However, the SSTable may have the
same keys as 10 SSTables at the next level. This necessitates rewriting 10 SSTable
files, resulting in severe write amplification.

4.2 Concurrency Control

4.2.1 Basic Concepts

As the name suggests, the concurrency control mechanism of a database is used to


control concurrent operations in the database. Control is necessary for ensuring the
integrity and consistency of data. In modern database systems, concurrency control
not only ensures data integrity and consistency but also maximizes system concur-
rency to improve system processing capabilities. Concurrency control mainly
includes pessimistic concurrency control, optimistic concurrency control, and
MVCC (multiversion concurrency control). Pessimistic concurrency control is the
most common mechanism, which is achieved through locking. Optimistic concur-
rency control is also known as optimistic locking. MVCC is commonly used to
handle read/write conflicts in modern database engines, including MySQL, Oracle,
and PostgreSQL, to improve the throughput performance of databases in high-­
concurrency scenarios. MVCC can be used together with either pessimistic or opti-
mistic concurrency control to improve the read performance of databases. In a
database for which MVCC is enabled, multiple physical versions are maintained for
each record. When a transaction writes data, a new version of the data is created.
Read requests obtain the latest version of data that already exists at the time the
transaction or statement began based on the snapshot information. The most direct
benefits of MVCC are as follows: writes do not block reads, reads do not block
writes, and read requests never fail or wait due to conflicts. For most databases,
there are often more read requests than write requests. Almost all mainstream data-
bases implement the MVCC mechanism, but the specific implementations of
MVCC may vary in different database storage engines.

4.2.2 Lock-Based Concurrency Control

When a transaction accesses a data item, no other transaction can modify the data
item. To meet this requirement, a transaction is allowed to access only the data items
it holds locks on. Different data locking methods are available, but this section will
describe only two:
4.2 Concurrency Control 63

• Shared lock (S-lock): If transaction Ti has acquired a shared lock on element D


in the database, Ti can read D but cannot write to it.
• Exclusive lock (X-lock): If transaction Ti has obtained an exclusive lock on ele-
ment D in the database, Ti can read from and write to D.
Before a transaction performs an operation on element D, it needs to apply for an
appropriate lock based on the operation type. The transaction can proceed to exe-
cute the operation only after acquiring the required lock. Using exclusive and shared
locks can allow multiple transactions to access the same element at the same time,
but only one transaction is allowed to perform a write operation. Only two read
requests can be executed on the same object at the same time. A read request and a
write request or two write requests cannot be executed simultaneously on the same
object due to lock conflicts but can be executed in series. Figure 4.4 shows a com-
patibility matrix for shared and exclusive locks. In the matrix, “yes” indicates that
the two locks are compatible.
To perform an operation on a database element, a transaction must add a lock on
this element. If an incompatible lock has been added on the database element by
another transaction, the transaction cannot acquire the lock on the database element
until the incompatible lock is released.
To further explain locking and unlocking operations, the following operations
are defined:
• LS(D): A transaction requests a shared lock on element D in the database.
• LX(D): A transaction requests an exclusive lock on element D in the database.
• UL(D): A transaction releases the lock on element D in the database.
The 2PL protocol can ensure serializability. This protocol requires locks to be
applied and released in two phases:
• Growing phase: A transaction can acquire locks but cannot release them.
• Shrinking phase: A transaction can release locks but cannot acquire new locks.
In each transaction, all lock requests precede all unlock requests. A transaction is
in the growing phase when it starts and can acquire locks as needed in this phase.
Once the transaction starts to release locks, it enters the shrinking phase and can no
longer issue lock requests. The 2PL protocol can effectively ensure conflict serializ-
ability. For any transaction, the position at which the transaction acquires its final
lock in the schedule is called the lock point of the transaction. Multiple transactions

Fig. 4.4 Compatibility matrix


64 4 Storage Engine

can be sorted according to their lock points, and this sorting order is the serializ-
ability order of the transactions.
The 2PL protocol cannot avoid deadlocks. For example, transactions T1 and T2 in
Fig. 4.5 are two-phased, but a deadlock still occurs.
When the 2PL protocol is adopted, transactions may also read uncommitted data.
In the example depicted in Fig. 4.6, transaction T4 reads uncommitted data A of
transaction T3. If transaction T3 rolls back, a cascading rollback is triggered.
Cascading rollbacks can be avoided by using strict 2PL and strong 2PL proto-
cols. The strict 2PL protocol requires that exclusive locks held by a transaction be
released only after the transaction is committed, whereas the strong 2PL protocol
requires that no locks be released before the transaction is committed. This way,
uncommitted data will not be read.

4.2.3 Timestamp-Based Concurrency Control

Aside from the foregoing lock-based concurrency control methods, another method
used to ensure the serializability of transactions is the timestamp-based concurrency
control method. This method implements the orderliness of selected transactions.

Fig. 4.5 Deadlock between transactions

Fig. 4.6 Cascading rollback of transactions


4.2 Concurrency Control 65

Each transaction Ti in the system is assigned a unique number, which is called the
timestamp and denoted by TS(Ti). The system assigns timestamps in ascending
order before transaction Ti is executed. Two methods can be used to generate
timestamps:
• System clock: The timestamp of a transaction is the clock value when the trans-
action enters the system.
• Logical counter: The counter is incremented by 1 each time a transaction starts,
and the value of the counter is assigned to the transaction as its timestamp.
To use timestamps to ensure the serial scheduling of transactions, each data item
Q must be associated with two timestamps and an additional bit.
WT(Q): The maximum timestamp of all transactions that successfully executed a
Write(Q) operation.
RT(Q): The maximum timestamp of all transactions that successfully executed a
Read(Q) operation.
C(Q): The commit bit of Q, which is set to True only when the most recent transac-
tion that wrote data item Q has been committed. This bit is used to prevent
dirty reads.
The Timestamp Ordering Protocol can ensure that any conflicting read and write
operations are executed in order based on their timestamps. The rules are as follows:
1. Assume that transaction Ti performs a Read(Q) operation:
(a) If TS(Ti) < WT(Q), the read operation cannot be completed, and Ti is
rolled back.
(b) If TS(Ti) ≥ WT(Q), the read operation can be executed.
If C(Q) is true, the request is executed, and the value of RT(Q) is set to the value of
TS(Ti) or RT(Q), whichever is greater.
If C(Q) is false, the system waits for Write(Q) to complete or be terminated.
2. Assume that transaction Ti performs a Write(Q) operation:
(a) If TS(Ti) < RT(Q), the value that transaction Ti attempts to write is no longer
needed, and Ti is rolled back.
(b) If TS(Ti) < WT(Q), the value that transaction Ti attempts to write is outdated,
and Ti is rolled back.
(c) If TS(Ti) ≥ RT(Q) or TS(Ti) ≥ WT(Q), the system performs the Write(Q)
operation and sets WT(Q) to TS(Ti) and C(Q) to False.
When transaction Ti issues a commit request, C(Q) is set to True, and a transac-
tion waiting for data item Q to be committed can be executed.
Under the preceding rules, the system assigns new timestamps for read and write
operations of rolled back transactions.
The following section describes another case, which assumes that TS(Ti) < TS(Tj).
In this case, the Read(Q) operation of transaction Ti can succeed, and the Write(Q)
operation of transaction Tj can be completed. When transaction Ti attempts to exe-
cute the Write(Q) operation, TS(Ti) < WT(Q). In this case, the Write(Q) operation
of Ti will be rejected, and the transaction Ti will be rolled back. The outdated write
66 4 Storage Engine

operation will be rolled back according to the rules of the Timestamp Ordering
Protocol. However, the rollback is unnecessary in this case. Therefore, the
Timestamp Ordering Protocol can be modified so that a write operation can be
skipped if a later write operation has been performed. This method is called the
Thomas write rule.
Assume that transaction T issues a Write(Q) request. The basic principles of the
Thomas write rule are as follows:
• When TS(T) < RT(Q), the Write(Q) operation will be rejected and transaction T
will be rolled back.
• When TS(T) < WT(Q), the data item Q that transaction T wants to write is out-
dated, and the Write(Q) operation does not need to be executed.
If neither of the above situations exists, the Write(Q) operation will be executed,
and the value of TS(T) will be set to WT(Q).
In the preceding lock-based and timestamp-based concurrency control methods,
when a conflict is detected, the transaction waits or rolls back even if the schedule
is conflict serializable. Therefore, these methods can be categorized as pessimistic
concurrency control methods. In scenarios with many read transactions, the fre-
quency of transaction conflicts is low. If pessimistic concurrency control is used,
system overheads may increase. In most cases, minimal system overheads are
expected. Therefore, a validation mechanism is used to reduce system overheads.
Unlike the lock-based and timestamp-based concurrency control methods, the vali-
dation mechanism is optimistic in executing transactions, so it is also called opti-
mistic concurrency control.
When a transaction is executed by using a validation mechanism, the transaction
will be executed in three phases:
• Read phase: The transaction reads the required database elements from the data-
base and save them in its local variables.
• Validation phase: The transaction is validated. If the transaction passes the vali-
dation, the third phase will be executed. Otherwise, the transaction will be
rolled back.
• Write phase: The transaction writes modified elements to the database. Read-­
only transactions can ignore this phase.
Each transaction will be executed based on the order of the preceding three
phases. To facilitate validation, the following three timestamps will be used:
• Start(Ti): The time when transaction Ti starts to execute, the validation of transac-
tion Ti has not been completed.
• Validation(Ti): The time when transaction Ti has completed, the read phase and
the validation of the transaction start. At this time, the write phase of transaction
Ti has not been completed.
• Finish(Ti): The time when the write phase of transaction Ti is completed.
Assume that two transactions Ti and Tj modify the same row of data. Transaction
Ti is considered to have passed the validation when any of the following rules is met:
4.2 Concurrency Control 67

• Finish(Tj) < Start(Ti) and the execution of transaction Tj has been completed
before transaction Ti starts. In this case, transaction Ti can enter the validation
phase and be executed.
• Finish(Ti) > Start(Tj), the validation of transaction Ti has been completed before
transaction Tj starts, and transaction Ti ends after transaction Tj starts. In this
case, the data set written by transaction Ti cannot intersect with the data set read
by transaction Tj.
• Finish(Ti) > Validation(Tj), the validation of transaction Ti has been completed
before the validation of transaction Ti is completed, and transaction Ti ends after
the validation of transaction Tj is completed. In this case, the data set written by
transaction Ti cannot intersect with the data set written by transaction Tj.

4.2.4 MVCC

The lock-based and timestamp-based concurrency control mechanisms either


delay an operation or terminate the transaction that issued the operation to main-
tain serializability. Although these two concurrency control mechanisms can fun-
damentally solve the serializability problem of concurrent transactions, most
database transactions in practical environments are read-only, and the number of
read requests is much higher than that of write requests. If there is no concurrency
control mechanism between write and read requests, the worst case is that the read
requests read data that has been written. This scenario is acceptable for many
applications.
Under this premise, another concurrency control mechanism, MVCC, is intro-
duced for database systems. In this mechanism, a new version of data is created for
each write operation, and a read operation selects the most suitable result from a
finite number of versions. In this case, attention no longer needs to be paid to con-
flicts between read and write operations. How to manage data versions and quickly
select the desired versions of data becomes the main problem that MVCC needs to
resolve. MVCC can be used together with any of the preceding concurrency control
mechanisms to improve the read performance of the database.

4.2.4.1 Multiversion 2PL

Multiversion 2PL aims to combine the advantages of MVCC with lock-based con-
currency control. This protocol distinguishes between read-only transactions and
update transactions.
Update transactions are executed according to the 2PL protocol. In other words,
an update transaction holds all locks until the transaction ends. Therefore, they can
be serialized based on the order in which they are committed. Each version of a data
item has a timestamp. The timestamp is not a real clock-based timestamp but a
counter (called a TS-Counter) that increases when a transaction is committed.
68 4 Storage Engine

Before a read-only transaction is executed, the database system reads the current
value of the TS-Counter and uses the value as the timestamp of the transaction.
Read-only transactions adhere to the Multiversion Timestamp Ordering Protocol
when performing read operations. Therefore, when read-only transaction T1 issues
a Read(Q) request, the return value is the content of the maximum timestamp ver-
sion that is less than TS(T1).
When an update transaction reads a data item, it first obtains a shared lock on the
data item and then reads the latest version of the data item. When an update transac-
tion wants to write a data item, it must obtain an exclusive lock on the data item and
then create a new version for the data item. The write operation is performed on the
new version, and the timestamp of the new version is initially set to infinity (∞).
After update transaction T1 completes its task, it is committed in the following
way: T1 first sets the timestamp of each version it created to the value of the
TS-Counter plus 1. Then, T1 increments the TS-Counter by 1. Only one update
transaction can be committed at a time.
This way, only read-only transactions started after T1 increases the TS-Counter
see the values updated by T1. Read-only transactions started before T1 increases the
TS-Counter see the values before T1 makes the updates. In either case, read-only
transactions do not need to wait for locks. The Multiversion 2PL protocol also
ensures that scheduling is recoverable and noncascading.
Version deletion is similar to the approach used in the Multiversion Timestamp
Ordering Protocol. Assuming a data item has two versions, Q1 and Q2, and the time-
stamps of both versions are less than or equal to the timestamp of the oldest read-­only
transaction in the system, the older version will no longer be used and can be deleted.

4.2.4.2 Multiversion Timestamp Ordering

The Timestamp Ordering Protocol can be extended to a multiversion protocol. Each


transaction Ti in the system is associated with a unique static timestamp denoted as
TS(Ti). The database system assigns this timestamp before the transaction starts.
Each data item Q is associated with a version sequence <Q1, Q2, ..., Qm>, and
each version Qk contains three data fields:
• Content: the value of version Qk.
• W-TS(Q): the timestamp of the transaction that created version Qk.
• R-TS(Q): the maximum timestamp of all transactions that successfully read
version Qk.
The multiversion timestamp ordering mechanism works based on the following
rules: Assume that transaction Ti issues a Read(Q) or Write(Q) operation, and let Qk
represent a version of Q that satisfies the following conditions and whose write
timestamp is the maximum timestamp less than or equal to TS(Tk):
• If transaction Ti executes the Read(Q) operation, the return value is the con-
tent of Qk.
4.2 Concurrency Control 69

• If transaction Ti executes the Write(Q) operation and TS(Ti) < R-TS(Qk), the
system rolls back transaction Ti. If TS(Ti) = W-TS(Qk), the system overrides the
content of Qk. If TS(Ti) > R-TS(Qk), the system creates a new version of Q.
According to these rules, a transaction reads the latest version that precedes it. If
a transaction attempts to write to a version that another transaction has already read,
the write operation cannot succeed.
Versions that are no longer needed are deleted based on the following rule:
Assume that a data item has two versions, Qi and Qj, and the W-TS values of both
versions are less than the timestamp of the oldest transaction in the system. In this
case, the older version will no longer be used and can be deleted.
The multiversion timestamp ordering mechanism ensures that read requests
never fail and do not have to wait. However, this mechanism also has some draw-
backs. First, reading a data item requires updating the R-TS field, which results in
two potential disk accesses (instead of one). Second, conflicts between transac-
tions are resolved through rollbacks instead of waiting, which significantly
increases overheads. The Multiversion 2PL Protocol can effectively alleviate
these issues.

4.2.5 Implementation of MVCC in InnoDB

This section analyzes the implementation of MVCC in InnoDB, which is the default
storage engine of MySQL. The implementation of MVCC relies on two hidden
fields (DATA_TRX_ID and DATA_ROLL_PTR) added to each table, the snapshot
(also known as a read view) created by transactions during querying and the data
version chain (i.e., the undo logs) of the database.

4.2.5.1 Three Hidden Fields

The InnoDB engine adds three hidden fields to each table to implement data multi-
versioning and clustered indexing. Among these fields, DATA_TRX_ID and
DATA_ROLL_PTR are used for data multiversioning. Figure 4.7 shows the table
structure in InnoDB.

Fig. 4.7 Table structure in InnoDB


70 4 Storage Engine

DATA_TRX_ID records the ID of the transaction that last updated or inserted the
record. Deletion is treated as an update in the database, with a 6-byte deletion flag
specified at a special location in the row.
DATA_ROLL_PTR is the rollback pointer that points to the undo log written in
the rollback segment. When a row is updated, the undo log records the content of the
row before the update. The InnoDB engine uses this pointer to find the previous
versions of data. All old versions of a row are organized in the form of a linked list
in the undo log, occupying 7 bytes.
DB_ROW_ID is a row ID that increments monotonically when new rows are
inserted. InnoDB uses a clustered index, which stores data based on the order of the
sizes of the clustered index fields. When a table has no primary key or unique non-
null index, the InnoDB engine automatically generates a hidden primary key that is
used as the clustered index of the table. DB_ROW_ID is not related to MVCC and
has a size of 6 bytes.
The following examples explain the specific operations involving MVCC:
• SELECT operation: InnoDB checks each row based on the following conditions:
(1) InnoDB looks up only data rows whose versions are earlier than or equal to
the current transaction version (i.e., the transaction ID of the row is less than or
equal to the current transaction ID). This ensures that the rows read by the trans-
action either already existed before the transaction started or were inserted or
modified by the transaction itself. (2) The deletion version of the row is either
undefined or greater than the current transaction ID. This ensures that the rows
read by the transaction were not deleted before the transaction started. Only
records that meet the above two conditions can be returned as the query results.
• INSERT operation: InnoDB saves the current transaction ID as the row version
number for each newly inserted row.
• DELETE operation: InnoDB saves the current transaction ID as the row deletion
flag for each deleted row.
• UPDATE operation: InnoDB inserts a new row for the updated row, saves the
current transaction ID as the row version number of the new row, and saves the
current transaction ID as the row deletion flag of the original row.
An undo log is used to record data before the data is modified. Before a row is
modified, its data is first copied to an undo log. When a transaction needs to read a
row and the row is invisible, the transaction can use the rollback pointer to find the
visible version of the row along the version chain in the undo log. When a transac-
tion rolls back, data can be restored by using records in the undo log.
On the one hand, undo logs can be used to construct records during snapshot
reads in MVCC. In MVCC, different transaction versions can have their indepen-
dent snapshot data versions by reading the historical versions of data in an undo log.
On the other hand, undo logs ensure the atomicity and consistency of transactions
during rollback. When a transaction is rolled back, the data can be restored by using
the data in the undo log.
4.2 Concurrency Control 71

4.2.5.2 Implementation of Consistent Reads

A read view is a snapshot that records the ID array and related information of cur-
rently active transactions in the system. It is used for visibility judgment, that is, to
check whether the current transaction is eligible to access a row. A read view has
multiple variables, including the following:
• trx_ids: This variable stores the list of active transactions, namely, the IDs of
other uncommitted active transactions when the read view was created. For
example, if transaction B and transaction C in the database have not been com-
mitted or rolled back when transaction A creates a read view, trx_ids will record
the transaction IDs of transaction B and transaction C. If a record that contains
the ID of the current transaction exists in trx_ids, the record is invisible.
Otherwise, it is visible.
• low_limit_id: The maximum transaction ID +1. The value of this variable is
obtained from the max_trx_id variable of the transaction system. If the transac-
tion ID contained in a record is greater than the value of low_limit_id of the read
view, the record is invisible in the current transaction.
• up_limit_id: The minimum transaction ID in trx_ids. If trx_ids is empty, up_
limit_id is equal to low_limit_id. Although the field name is up_limit_id, the last
active transaction ID in trx_ids is the smallest one because the active transaction
IDs in trx_ids are sorted in descending order. Records with a transaction ID less
than the value of up_limit_id are visible to this view.
• creator_trx_id: The ID of the transaction that created the current read view.
MVCC supports only the read committed and repeatable read isolation lev-
els. The difference in their implementations lies in the number of generated
read views.
The repeatable read isolation level avoids dirty reads and nonrepeatable reads
but has phantom read problems. For this level, MVCC is implemented as follows:
In the current transaction, a read view is generated only for the first ordinary
SELECT query; all subsequent SELECT queries reuse this read view. The trans-
action always uses this read view for snapshot queries until the transaction ends.
This avoids nonrepeatable reads but cannot prevent the phantom read problem,
which can be solved by using gap locks and record locks of the next-key lock
algorithm.
For the read committed isolation level, MVCC is implemented as follows: A new
snapshot is generated for each ordinary SELECT query. Each time a SELECT state-
ment starts, all active transactions in the current system are recopied to the list to
generate a read view. This achieves higher concurrency and avoids dirty reads but
cannot prevent nonrepeatable reads and phantom read problems.
72 4 Storage Engine

4.3 Logging and Recovery

4.3.1 Basic Concepts

Transaction persistence and fault recovery are essential features of a database man-
agement system. The design and implementation of fault recovery often affect the
architecture and performance of the entire system. In the past decade, ARIES-style
WAL has become the standard for implementing logging and recovery, especially in
disk-based database systems.
During the database status changes process, the objects, data, and processes of
write operations are recorded. This recording process is known as logging. Logging
technology can achieve the atomicity and persistence of transactions. Atomicity
means that when a system fails, committed transactions can be reexecuted by
replaying the logs, and old data of uncommitted transactions can be revoked.
Persistence means that after logs are written to the disk, the results of transaction
execution can be recovered by using the logs on the disk.
Unlike the swapping in and out of dirty pages, writing logs to a disk is a sequen-
tial process. LSNs indicate the relative sequence of log records generated by differ-
ent transactions. Sequentially writing logs is faster than randomly writing dirty
pages to disk. This is why most transactions are committed after logs are flushed to
disk rather than after dirty pages are flushed to disk.
In a database storage engine, generated logs are typically written to a log buffer.
The logs of multiple transaction threads are written to the same log buffer and then
flushed to external storage, such as a disk, by I/O threads. The progress of log flush-
ing must be checked before transactions are committed. Only transactions that meet
the WAL conditions can be committed. In most cases, multiple log files exist in the
system. Two mechanisms are available for managing the log files. One is to not
reuse log files, as in PostgreSQL. With this mechanism, log files continuously
increase in quantity and size. The other mechanism is to reuse log files, as in
MySQL. With this mechanism, two or more log files are used alternately. The mech-
anism that does not reuse log files can tolerate long transactions but requires addi-
tional mechanisms for clearing the continuously increasing log files. The mechanism
that reuses log files mandates that the size of the log files be configured based on the
length of the longest transaction. Otherwise, the database system stops providing
services once the log files are used up because it cannot commit transactions.

4.3.2 Logical Logs

Logical logs are quite common, but not all databases support logical logs. A logical
log records database modification operations, which are often a simple variant of
user input. Such operations are not parsed in detail and are only related to the logical
views provided by the database. They are irrelevant to the underlying data organiza-
tion structure of the database.
4.3 Logging and Recovery 73

Taking a relational database as an example, the storage engine may read and
write data in traditional page storage managed by using B+ trees or in compacted
storage managed by using LSM-trees. However, for users, data is always organized
in the form of tables rather than pages or key-value pairs. The logical logs of a rela-
tional database record user operations on tables. These operations may be SQL
statements or simple variants of SQL statements and do not involve the actual form
of data storage.
In general, logical logs are not crucial because many database systems use physi-
cal logs as the primary basis for fault recovery. Besides, not using logical logs does
not affect normal database operation. So, what is the significance of logical logs?
For starters, logical logs are independent of the physical storage and thus more por-
table than physical logs. When data is migrated between database systems that use
different physical log formats, logical logs in a universal format are of vital impor-
tance. Provided that two systems use a uniform format for logical logs, data can be
migrated from one system to another by parsing and replaying the logical logs.
Support for logical logs can greatly simplify the workflow in scenarios such as data
migration, log replication, and coordination of multiple storage engines.
Logical logs often come with additional overheads because transactions also
need to wait for the logical logs, in addition to physical logs, to persist before they
can be committed. Physical logs can be generated and flushed to disks during trans-
action execution, whereas logical logs are often generated and flushed to disks at a
time during transaction commits. Using logical logs can affect the system through-
put. Moreover, the replay of logical logs is much slower than that of physical logs,
and the parsing cost of the former is higher than that of the latter.

4.3.3 Physical Logs

Physical logs are the foundation of fault recovery and are therefore a must-have
feature for any mature database system. These logs record write operations on data,
and the description of such write operations is often directly related to the way data
is organized on physical storage. Parsing physical logs can learn of actual modifica-
tions made by users to the physical storage but cannot learn of the actual content of
the logical operations performed. When physical logs are used for recovery, all
physical logs must be parsed and replayed to obtain the final status of the database.
Physical logs include the redo log and undo logs. Undo logs record only old values
of database elements. This means that an undo log can only use the old values to over-
write the current values of the database elements to undo the modifications made by a
transaction to the database state. An undo log is commonly used to undo uncommitted
changes that a transaction made before the system crashed. During undo logging, a
transaction cannot be committed before all modifications are written to a disk. As a
result, the transaction must wait for all I/O operations to be completed before it is
committed. Redo logging does not have this problem. The redo log records the new
values of database elements. During recovery based on the redo log, uncommitted
transactions are ignored and the modifications made by committed transactions are
74 4 Storage Engine

reapplied. Provided that the redo log is persisted to the disk before a transaction is
committed, the modifications of the transaction can be recovered by using the redo log
after the system crashes. Sequentially writing logs naturally incurs lower I/O opera-
tion costs than random writes and reduces the waiting time before transaction commits.
The redo log and undo logs are not mutually exclusive and can be used in com-
bination in some databases. Later, we will discuss how the redo log and undo logs
are used in combination in MySQL.

4.3.4 Recovery Principles

A database may encounter the following types of faults during operation: transaction
errors, process errors, system failures, and media damage. The first two are self-
explanatory. System failures refer to failures of the operating system or hardware, and
media damage refers to irreversible damage to the storage media. These faults must be
properly handled to ensure the correctness of the entire system. As such, the database
system must support two major features: transaction persistence and atomicity.
Transaction persistence ensures that the updates made by a committed transac-
tion still exist after failure recovery. Transaction atomicity means that all modifica-
tions made by an uncommitted transaction are invisible. The sequential access
performance of traditional disks is much better than the random access performance.
Therefore, a log-based fault recovery mechanism is used. In this mechanism, write
operations on the database are sequentially written to log records, and the database
is recovered to a correct state by using the log records after a fault. To ensure that
the latest database state can be obtained from logs during recovery, the logs must be
flushed to a disk before the data content. This action is known as the WAL principle.
A fault recovery process usually includes three phases: analysis, redo, and undo.
The analysis phase includes the following tasks: (1) The scope of redo and undo oper-
ations in subsequent redo and undo phases is confirmed based on the checkpoint and
log information. (2) The dirty page set information recorded in checkpoints is cor-
rected based on logs. (3) The position of the smallest LSN in the checkpoint is deter-
mined and used as the start position of the redo phase. (4) The set of active transactions
(uncommitted transactions) recorded in checkpoints is corrected, where the transac-
tions will be rolled back in the undo phase. In the redo phase, all log records are
redone one by one based on the start position determined in the analysis phase. Note
that the modifications made by uncommitted transactions are also reapplied in this
phase. In the undo phase, uncommitted transactions are rolled back based on undo
logs to revoke the modifications made by these transactions.

4.3.5 Binlog of MySQL

The binary log (binlog) in MySQL is a type of logical log that records changes made
to data in the MySQL service layer. Binary logging can be enabled at the startup of
the MySQL service by specifying related parameters.
4.3 Logging and Recovery 75

The binlog contains all statements that update data and statements that have the
potential to update data. It also includes the duration of each statement used to
update data. In addition, the binlog contains service layer status information required
for correctly reexecuting statements, error codes, and metadata information required
for maintaining the binlog.
The binlog serves two important purposes. The first one is replication. The bin-
log is usually sent to replica servers during leader-follower replication. Many details
of the format and handling methods of the binlog are designed for this purpose. The
leader sends the update events contained in the binlog to the followers. The follow-
ers store these update events in the relay log, which has the same format as the
binlog. The followers then execute these update events to redo the data modifica-
tions made on the leader. The second purpose is specific data recovery operations.
After backup files are restored, the events recorded in the binlog after the backup
was completed are reexecuted. These events ensure that the database is up to date
from the point of the backup.
As a logical log, the binlog must be consistent with the physical logs. This can be
ensured by using the two-phase commit (2PC) protocol in MySQL. Regular trans-
actions are treated as internal eXtended Architecture (XA) transactions in MySQL
and each is assigned with an XID. A transaction is committed in two phases. In the
first phase, the InnoDB engine writes the redo log to a disk, and the transaction
enters the Prepare state. In the second phase, the binlog is written to the disk, and
the transaction enters the Commit state. The binlog of each transaction records an
XID event at the end to indicate whether the transaction is committed. During fault
recovery, content after the last XID event in the binlog must be cleared.

4.3.6 Physical Logs of InnoDB

As the default storage engine for MySQL, the InnoDB engine has two essential
logs: undo logs and the redo log. Undo logs are used to ensure atomicity and isola-
tion of transactions, and the redo log is used to ensure persistence of transactions.
Undo logs are essential for transaction isolation. Each time a data record is modi-
fied, an undo log record is generated and subsequently recorded in the system
tablespace by default. However, MySQL 5.6 and later support the use of an inde-
pendent undo log tablespace. Undo logs store old versions of data. When an old
transaction needs to read data, it needs to traverse the version chain in the undo log
to find the records visible to it. This can be time-consuming if the version chain
is long.
The common data modification operations include INSERT, DELETE, and
UPDATE operations. Data inserted by an INSERT operation is visible only to the
current transaction; other transactions cannot find the newly inserted data by using
an index before the current transaction is committed. The generated undo log can
then be deleted after the transaction is committed. For UPDATE and DELETE oper-
ations, multiple versions of data need to be maintained. The undo log records of
these operations in the InnoDB engine are of the Update_Undo type and cannot be
directly deleted.
76 4 Storage Engine

To ensure transaction concurrency, all transactions can simultaneously write their


undo logs. InnoDB uses rollback segments to maintain undo logs. Each rollback seg-
ment contains multiple undo log slots. In MySQL 5.6, rollback segment 0 is reserved
in the system tablespace, rollback segments 1–32 are stored in the system tablespace
for temporary tables, and rollback segments 33–128 are stored in the independent
undo log tablespace or in the system tablespace if the independent undo log tablespace
is not enabled. Each rollback segment maintains a segment header page, which con-
tains 1024 slots. Each slot corresponds to one undo log object. Theoretically, InnoDB
can support up to 96 × 1024 concurrent regular transactions.
When a read/write transaction is started or a read-only transaction is converted
into a read/write transaction, a rollback segment needs to be allocated in advance to
the transaction. When a data change occurs, the data before the change is recorded
in an undo log to maintain multiversioning information. INSERT operations are
recorded in an undo log that is independent of an undo log that records a DELETE
or UPDATE operation. Therefore, a separate undo log slot needs to be allocated
from the rollback segment. Once an undo log slot is allocated, undo records can be
written to the slot.
If a transaction is rolled back due to an exception or explicitly rolled back, all
data changes made by the transaction are undone. The transaction is rolled back
based on undo logs. The records of old versions are obtained through parsing, and
reverse operations are performed: for a record with a deletion flag, the deletion is
deleted; for a local update, the data is reset to the oldest version; and for an INSERT
operation, the clustered index and secondary index records are directly deleted.
Multiversioning in InnoDB is implemented based on undo logs. The undo logs
include the mirrors of data before updates. If a transaction that changes data is not
committed, the modified data is invisible to other transactions whose isolation levels
are greater than or equal to read committed, and an old version of the data is returned.
When a clustered index record is modified, the rollback segment pointer and trans-
action ID are always stored. The corresponding undo log record can be found by
using the pointer, and the visibility of the data record can be determined by using the
transaction ID. When the transaction ID recorded in an old version is invisible to the
current transaction, the transaction moves forward until it finds a visible record or
reaches the end of the version chain.
To manage dirty pages, InnoDB maintains a flush list on each buffer pool
instance. Pages on the flush list are sorted by the LSNs of page modifications. When
periodic checkpointing is performed, the selected LSN is always that of the oldest
page on the flush list; the oldest page has the smallest LSN. A mini transaction is the
smallest transaction unit for physical data operations in InnoDB. After each mini
transaction is completed, the logs generated locally must be copied to the public log
buffer, and the modified dirty pages are put to the flush list. The redo log of InnoDB
is generated by mini transactions, written to the mini transaction buffer, and then
committed to the public log buffer. The public log buffer follows a specific format
and is 512-byte aligned, which is consistent with the block size of redo log files. A
log block may contain records submitted by multiple mini transactions, or the log of
one mini transaction may occupy multiple log blocks.
4.4 LSM-Tree Storage Engine 77

Writing of redo log files can be triggered by any of the following conditions:
insufficient space of the redo log buffer, transaction commits, back-end threads,
checkpointing, instance shutdown, and binlog switching. The redo log in InnoDB is
written in circular overwrite mode and does not have infinite space. Although a large
redo log space is theoretically available, checkpointing in a timely manner is still
essential for rapid recovery from crashes. The master thread of InnoDB performs
redo log checkpointing roughly every 10 s.
In addition to the regular redo log, InnoDB provides a file log type that allows
you to create specific files and assign specific names to the files to indicate specific
operations. Currently, two operations are supported: undo log tablespace truncate
operation and user tablespace truncate operation. The file logs can ensure the atomi-
city of these operations.

4.4 LSM-Tree Storage Engine

PolarDB X-Engine [1] (hereinafter referred to as X-Engine) is an OLTP database-­


oriented storage engine developed by the Database Products Business Unit of
Alibaba Cloud. As one of the optional storage engines for PolarDB, a database
product independently developed by Alibaba Cloud, X-Engine has been widely
used in many business systems within the Alibaba Group, including the transaction
history database, DingTalk history database, and other core applications. X-Engine
has significantly reduced costs for businesses and served as a crucial database tech-
nology that empowers Alibaba Group to withstand traffic bursts that can surge to
hundreds of times greater than the average during the Double 11 shopping festival
in China. This section introduces the design principles, innovative technologies, and
leading achievements of X-Engine.

4.4.1 PolarDB X-Engine

X-Engine is a low-cost, high-performance storage engine based on the LSM-tree


architecture. As shown in Fig. 4.8, X-Engine includes the following components:
1. Hot data tier: The hot data tier is a collection of data structures stored in memory,
including the active MemTable, immutable MemTables, caches, and indexes.
Newly inserted records are inserted into the active MemTable after related logs
are persisted to the disk. The active MemTable is implemented by using a skip
list, which provides high insertion performance. When an active MemTable is
full, it is switched to an immutable MemTable and no longer receives new data.
At the same time, a new active MemTable is created to store newly inserted
records. The immutable MemTable is gradually flushed to persistent storage
media, such as SSDs. In addition, the memory stores row caches and block
78 4 Storage Engine

Fig. 4.8 Architecture of X-Engine

caches that respectively cache disk records and multiversioned data indexes
based on the LRU rule.
2. Cold data tier: The cold data tier is a multilevel structure stored on disks. Records
in immutable MemTables flushed from the memory are inserted into the first
level (L0) in the form of data blocks, also known as extents. When L0 is full,
some of the data blocks are moved out and compacted with the data blocks in L1
through an asynchronous compaction operation. Similarly, data blocks in L1 are
eventually compacted into those in L2.
3. Heterogeneous FPGA accelerator: X-Engine can offload the compaction opera-
tions in the cold data tier from the CPU to a dedicated heterogeneous FPGA
accelerator [2] to improve the efficiency of compaction operations, reduce inter-
ference with other computing tasks handled by the CPU, and achieve stable sys-
tem performance and higher average throughput.
X-Engine employs a series of innovative technologies to reduce storage costs
and ensure system performance. Table 4.1 describes the main technological innova-
tions and achievements of X-Engine. X-Engine is mainly optimized to achieve
higher transaction processing performance, reduce data storage costs, improve
query performance, and reduce overheads of backend asynchronous tasks. To
4.4 LSM-Tree Storage Engine 79

Table 4.1 Technological innovations and achievements related to X-Engine


Technology Description Achievements
High-performance Computing and memory access are decoupled Published a paper at
transaction for each transaction and are separately ACM SIGMOD 2019,
processing pipeline optimized in terms of concurrency method and granted a national
degree invention patent in
Intelligent tiered Cold data and hot data are identified based on China, and filed a patent
storage the LSM-tree architecture in combination with application
statistical methods and machine learning
methods and are stored in different tiers
Multiversioned The storage structure is optimized for hotspot
skiplist records that have multiple versions, thereby
improving query performance
Efficient data A compact data block format is used in
structures combination with matching multiversioned
indexes and metadata
Lightweight Data block reusing technologies and various
compaction compaction strategies are adopted to reduce
the compaction overheads
High-performance Row caches and block caches are optimized in
caching terms of format and access path, thereby
improving the hotspot data access performance
Heterogeneous Compaction operations are offloaded to the Published a paper at
FPGA accelerator heterogeneous FPGA accelerator to achieve USENIX Conference on
for compactions improved performance and stability at lower File and Storage
costs Technologies 2020
(FAST 20) and granted a
national invention patent
in China
Machine learning-­ The data access status during compactions is Published a paper at
based cache prefetch predicted by using machine learning models, VLDB 2020
and hot data is prefetched to the cache to
reduce the query latency

achieve these optimization goals, X-Engine is designed and developed through in-­
depth software and hardware collaboration and combines the technical characteris-
tics of modern multicore CPUs, DRAM memory, and heterogeneous FPGA
processors. The specific design, applicable scope, and experimental results of the
technologies will be described in detail in the following sections.

4.4.2 High-Performance Transaction Processing

This section describes the transaction processing mechanism, performance optimi-


zation, and corresponding fault recovery mechanism of X-Engine. Transaction pro-
cessing is one of the most important tasks of OLTP database-oriented storage
engines. A transaction is a collection of SQL statements that must succeed or fail
80 4 Storage Engine

together. As per common standards, transaction processing must guarantee the


ACID (atomicity, consistency, isolation, and durability) properties even in the case
of exceptions, such as database errors or power failure. The transaction processing
mechanism of X-Engine can achieve high transaction processing performance by
fully utilizing the hardware characteristics of multicore processors and memory
while guaranteeing the ACID properties.

4.4.2.1 Read Path and Write Path

CREATE, READ, UPDATE, and DELETE (CRUD) are the fundamental capabili-
ties required for transaction processing. Record modification operations, such as
CREATE, UPDATE, and DELETE, are performed along a write path, whereas
record query operations, such as READ, are performed along a read path.
1. Write path: Fig. 4.8 shows that to ensure the persistence of stored data in the
database in the case of DRAM power failure in X-Engine, all modifications to
database records must be recorded in the log and stored on persistent storage
media (such as SSDs) and then stored in the active MemTable in memory.
X-Engine adopts a two-phase mechanism to ensure that the modifications made
by a transaction to records conform to the ACID properties and are visible to and
can be queried by other transactions after the transaction is committed. In the
two-phase mechanism, transactions are completed in two phases: the read/write
phase and the commit phase. After the active MemTable is full, it is converted
into an immutable MemTable, which is then flushed to disk for persistence.
Multiversion active MemTable data structure: MVCC results in many ver-
sions of hotspot records in high-concurrency transaction processing scenarios.
Querying these versions incurs additional overheads. To solve this problem,
X-Engine is designed with a multiversion active MemTable data structure, as
shown in Fig. 4.9. In this structure, the upper layer (the blue part in the figure)
consists of a skiplist in which all records are sorted by primary key values. For a
hot record with multiple versions (such as the record with key = 300 in the fig-

Fig. 4.9 Multiversion


active MemTable data
structure in X-Engine
4.4 LSM-Tree Storage Engine 81

ure), X-Engine adds a dedicated single linked list (the green part in the figure) to
store all its versions, which are sorted by version number. Due to the temporal
locality of data access, the latest version (version 99) is most likely to be accessed
by queries and is therefore stored at the top, thereby reducing the linked list scan
overheads during querying for hotspot records.
2. Read path: As shown in Fig. 4.10, a query operation in X-Engine queries data in
the following sequence: active MemTable/immutable MemTable, row cache,
block cache, and disk. As mentioned above, the multiversion skiplist structure
used in the MemTable can reduce the overhead of hotspot record queries. The
row cache and block cache can cache hot data records or record blocks in the
disk. The block cache stores the metadata of user tables, which includes the
bloom filters that can reduce disk accesses, as well as corresponding index blocks.

4.4.2.2 High-Performance Transaction Processing

As shown in Fig. 4.11, X-Engine is designed with a multistage parallel transaction


processing pipeline for processing each transaction. The transaction processing pro-
cess is divided into the read/write phase and the commit phase. All necessary query

Fig. 4.10 Cache query


path in X-Engine

Fig. 4.11 Transaction processing pipeline in X-Engine


82 4 Storage Engine

and calculation operations are completed in the read/write phase, and then the
required modifications are temporarily stored in the transaction buffer.
In the commit phase, multiple writer threads write the content in the transaction
buffer to the lock-free task queues. The consumer tasks in the task queues of the
multistage pipeline pushes the corresponding write task content in the following
stages: log buffering, log flushing, MemTable writing, and commit.
This two-stage design decouples the front-end and back-end threads. After com-
pleting the read/write phase of a transaction, the front-end threads can immediately
proceed to process the next transaction. The back-end writer threads access the
memory to complete the operations of writing data to disk and memory. The front
end and the backend exchange data through the transaction buffer to achieve paral-
lel execution on different data. This also improves the instruction cache hit rate of
each thread, ultimately improving the system throughput. In the commit phase, each
task queue is handled by one back-end thread, and the number of task queues is
limited by hardware conditions, such as the available I/O bandwidth in the system.
In the four-stage transaction pipeline, the granularity of parallelism for each
stage is optimized based on the characteristics of the stage. Log buffering (which
collects relevant logs for all write contents in a task queue) at the first stage and log
flushing at the second stage are sequentially executed by a single thread because
data dependencies exist between these two stages. MemTable writing at the third
stage is completed concurrently by multiple threads. At the fourth stage, the transac-
tion is committed, and related resources (such as the acquired locks and memory
space) are released to make all modifications visible. This stage is executed in paral-
lel by multiple threads. All writer threads obtain the required tasks from any stage
in active pull mode. This design allows X-Engine to allocate more threads to handle
memory accesses with high bandwidth and low latency while using fewer threads to
handle disk writes with relatively low bandwidth and high latency, thereby improv-
ing the utilization of hardware resources.

4.4.3 Hardware-Facilitated Software Optimization

4.4.3.1 Background

The back-end compaction threads in X-Engine merge memory data and disk data.
When the amount of data at each level reaches the specified threshold, the data is
merged with the data of the next level. This operation is called compaction. Timely
compactions are essential for LSM-trees. Under continuous high-intensity write
pressure, an LSM-tree deforms as the data at L0 accumulates. This severely affects
write operations because these operations need to scan all layers of data and return
a combined result due to the existence of multiple data versions. Compaction of data
at L0 and data of multiple versions helps maintain a healthy read path length, which
is crucial for storage space release and system performance.
4.4 LSM-Tree Storage Engine 83

Figure 4.12 shows the compaction execution time under different value lengths.
When the value length is less than or equal to 64 bytes, the CPU time accounts for
over 50%. This phenomenon happens because the rapid improvement in the read/
write performance of storage devices in recent years caused CPUs to become the
performance bottleneck of traditional I/O intensive operations.

4.4.3.2 Offloading to the FPGA Accelerator

Although compactions are essential, they do not necessarily have to be handled by


CPUs. A compaction task involves reading, decoding, sorting, encoding, and writ-
ing back multiple data blocks. For the same compaction task, data dependencies
exist between consecutive stages but do not exist between multiple tasks.
Therefore, multiple tasks can be fully pipelined to improve the throughput. By
offloading compactions to an FPGA accelerator that is suitable for processing
pipeline tasks, the CPUs can be dedicated to handle complex transaction process-
ing to prevent back-­end tasks from encroaching on computational resources. This
way, the database system can always process transactions with peak
performance.
To adapt to the FPGA accelerator, a CPU-based compaction operation must be
first converted to a batch mode. The CPU splits a compaction operation into tasks
with specific sizes that can be executed in parallel. This fully leverages the parallel
ability of multiple compaction units (CUs) on the FPGA accelerator. X-Engine is
designed with a task queue to cache the compaction tasks that need to be executed.
The tasks delivered by the driver to the FPGA accelerator for execution and the
results of the execution are cached in the result queue and wait for the CPU to write
to persistent storage. Figure 4.13 shows how compactions are offloaded to the
FPGA accelerator.

Fig. 4.12 Execution time overheads of compactions


84 4 Storage Engine

Fig. 4.13 Offloading of compactions to the FPGA accelerator

Fig. 4.14 Compaction scheduler

4.4.3.3 Compaction Scheduler

The compaction scheduler is responsible for creating compaction tasks, distributing


the tasks to the CPU for execution, and writing the compaction results to the disk.
Three threads are designed to complete the entire process (Fig. 4.14):
• Compaction task creation thread: splits the data blocks that need to be compacted
by range to create compaction tasks. Each compaction task structure maintains
necessary metadata, including the task queue pointer, start address of input data
(where CRC check is performed to ensure data correctness when the data is
transferred to the FPGA accelerator), write address of compaction results, and
callback function pointer for subsequent logic. In addition, the compaction task
includes a return value that indicates whether the compaction task is successful.
Failed tasks will be reexecuted by calling the CPU. According to online opera-
4.4 LSM-Tree Storage Engine 85

tion data, only about 0.03% of compaction tasks will be reexecuted by the CPU
mainly because the samples have excessively long KV lengths.
• Distribution thread distributes the compaction tasks to CUs for execution. The
FPGA accelerator has multiple CUs. Therefore, corresponding distribution algo-
rithms must be designed. Currently, a simple round-robin distribution strategy is
adopted. All compaction tasks are similar in size. Therefore, the CUs have bal-
anced utilization, according to the experimental results.
• Drive thread transfers data to the FPGA accelerator and instructs the CUs to start
working. When the CUs complete the tasks, the drive thread is instructed to
transfer the result back to the memory, and the compaction tasks are put into the
result queue.

4.4.3.4 CUs

Figure 4.15 shows the logical implementation of CUs on an FGPA. Multiple CUs
can be deployed on an FPGA accelerator, which are scheduled by the driver. A CU
consists of a decoder, a key-value ring buffer, a key-value transferrer, a key buffer, a
merger, an encoder, and a controller.
The key-value ring buffer consists of 32 8-KB slots. Each slot allocates 6 KB
for storing key-value pairs and the remaining 2 KB for storing metadata of the
key-­value pairs (such as the key-value pair length). Each key-value ring buffer has
three states, FLAG_EMPTY, FLAG_HALF_FULL, and FLAG_FULL, which,
respectively, indicate that the key-value ring buffer is empty, half full, and full.
Whether to carry forward the pipeline or to pause the decoding and wait for down-
stream consumption is determined based on the number of cached key-value
pairs. The key-­value transferrer and key-value buffer are responsible for key-value
pair transmission. Value comparison is not required in merge-sorting; only the
keys need to be cached. The merger is responsible for merging the keys in the key

Fig. 4.15 Logical implementation of CUs on an FGPA


86 4 Storage Engine

buffer. In Fig. 4.15, Way 1 has the smallest key. The controller instructs the key-
value transferrer to transfer the corresponding key-value pair from the key-value
ring buffer to the key-value output buffer (which has a similar structure as the
key-value ring buffer) and moves the read pointer of the key-value ring buffer to
the next key-value pair. The controller then instructs the merger to perform the
next round of compaction. The encoder performs prefix encoding on the key-
value pairs output by the merger and writes the encoded key-value pairs in the
format of data blocks to the FPGA memory.
To control the processing speed at each stage, a controller module is introduced.
The controller module maintains the read and write pointers of the key-value ring
buffers, detects the difference in the processing speed between the upstream and
downstream of the pipeline based on the states of the key-value ring buffers, and
maintains efficient operation of the pipeline by pausing or restarting corresponding
modules.

4.4.4 Cost-Effective Tiered Storage

X-Engine adopts an optimized tiered storage structure for storing data in the cold
data layer, which significantly reduces the storage costs while ensuring the query
performance. This section describes the design and optimization of X-Engine in
terms of data flush, compactions, and data placement.

4.4.4.1 Flush Optimization

The flush operation in X-Engine converts the data in the immutable MemTable
in the memory into data blocks and stores the data blocks to disk for persistent
storage. Flush operations are crucial for the stability, performance, and space
efficiency of the storage engine. First, the flush operations move data out of the
memory, thereby freeing up memory space for new data or caches. If flush oper-
ations are not performed in a timely manner and new data continues to be writ-
ten to the memory, the memory usage keeps increasing until the system can no
longer accommodate new data, resulting in database unavailability risks.
Figure 4.16 demonstrates the flush operation and multiple compaction task
types in X-Engine.
Second, trade-offs must be made to achieve a balance between flush overheads
and query overheads. To ensure that data on the disk is always sorted by primary
key values and data blocks at the same level do not overlap in terms of primary
key ranges, it is crucial to ensure that a primary key value exists only in one data
block within any data range. This way, a point query needs to read at most one
data block at each level, minimizing the number of data blocks that need to be
read in range queries. However, to sort data by primary key values, each flush task
must merge the immutable MemTable data moved from memory with the data
4.4 LSM-Tree Storage Engine 87

Fig. 4.16 Flush operation and multiple compaction task types in X-Engine

blocks on disk whose primary key range overlaps with that of the immutable
MemTable data. This process consumes a significant amount of CPU and I/O
resources. In addition, repeated compactions result in I/O write amplification,
exacerbating I/O consumption. This results in high flush overheads, long process-
ing time, and excessive resource consumption, thus affecting the stability of data-
base performance. If lower requirements are imposed on data sorting by primary
key values, the flush overheads will be reduced, but the query overheads will
increase. Therefore, X-Engine is optimized to achieve a balance between flush
overheads and query overheads.
Figure 4.16 shows that after converting the data in the immutable MemTable into
data blocks, the flush operation in X-Engine directly appends the data blocks to L0
on disk without merging the data blocks with other data at L0. This significantly
reduces flush overheads. However, this causes overlapping primary key ranges of
the data blocks at L0. As a result, a record within a primary key range may exist in
multiple data blocks, which increases query overheads. To mitigate the impact on
the query performance, X-Engine controls the total data size of L0 within an
extremely small range (about 1% of the total data size on disk). Primary keys of
common transactional data, such as order numbers, serial numbers, and timestamps,
in OLTP databases usually increase monotonically. If no update operations exist in
the load, primary keys of newly inserted data do not overlap with those of existing
data. In this case, the flush design of X-Engine does not increase query overheads.
88 4 Storage Engine

4.4.4.2 Lightweight Asynchronous Compaction

A compaction operation in X-Engine involves merging data blocks based on primary


key ranges within the same level or between adjacent levels in the tiered storage struc-
ture. X-Engine performs these operations asynchronously in the backend. However, a
compaction operation needs to read input data blocks from the disk, perform compac-
tion calculations, and write the results back to the target level on the disk. This process
consumes a significant amount of CPU and disk I/O resources and results in a write
amplification problem. This problem occurs in a tiered storage structure in which key
value ranges at different levels may overlap and a newly inserted record is repeatedly
read from the disk and involved in compactions, resulting in multiple I/O operations for
the same record. Additionally, compaction operations consume considerable amount of
CPU and I/O resources. For low-specification database instances on the cloud, compac-
tion operations significantly reduce the system resources available for processing user
queries and transactions, leading to database performance degradation.
As shown in Fig. 4.12, the compaction of short-length records (i.e., records
whose lengths are less than 64 bytes) results in high execution overheads and is
bottlenecked by CPU performance. In comparison, the compaction of records with
longer lengths requires reading and writing back more data and is bottlenecked by
disk I/O performance. This phenomenon is not in line with the traditional notion
that compactions are I/O-intensive operations and offers valuable insights into the
optimization of X-Engine. X-Engine employs techniques such as data block reuse,
streaming compaction, and asynchronous I/O to mitigate write amplification and
reduce compaction overheads from the algorithm and implementation aspects.

4.4.4.3 Data Block Reuse

X-Engine logically reduces the number of compactions by reusing extents or data


blocks in extents. Figure 4.17 demonstrates that during the compaction of data
blocks at L1 and L2, some data blocks are reused, and the chances for reuse are
increased through splitting.
• An overlap exists between data block [1,35] at L1 and data block [1,30] at L2,
but the second data block [32,35] does not overlap with data block [1,30].
­Therefore, only other data blocks need to be merged. Data block [32,35] can be
copied to the new data block to reduce the computation workload during the
compaction.
• Data block [210,280] at L1 and data block [50,70] at L2 do not have correspond-
ing data blocks with which they overlap in terms of primary key range and thus
can be directly reused without being merged or moved. For these two data blocks,
only the indexes need to be updated.
• For data block [80,200] at L1, the second data block [106,200] is relatively
sparse and has no records in data block [135,180]. Therefore, data block
[106,200] can be split into data block [106,135] and data block [180,200]. This
way, data block [150,170] at L2 can be directly reused without compaction.
4.4 LSM-Tree Storage Engine 89

Fig. 4.17 Data block reuse in compaction operations in X-Engine

4.4.4.4 Compaction Task Optimization

To reduce resource consumption for compaction operations, a compaction task is


divided into multiple subtasks with isolated data that are concurrently executed. The
I/O operations of one subtask is combined with the compaction computation opera-
tions of another subtask. This reduces the overheads of compaction computation,
thereby shortening the execution time of the compaction task. Moreover, asynchro-
nous I/O is used to reduce the overheads of data reads and writes.

4.4.4.5 Data Placement Optimization

The location of a target record for a query in the tiered storage structure directly
affects the overhead of the query.
Given the storage separation of hot and cold data and the storage of the latter in
a tiered storage structure, a data record can reside in different memory segments
(such as the active MemTable, immutable MemTable, and cache) and at different
levels on disk. The query overhead varies based on the location of the target data
record. For instance, hot data is placed at L0 and L1 on a disk. This shortens the
query paths for such data, mitigating read amplification during access to L2 and
reducing query latency. This section covers the design and optimization of X-Engine
for spatial data placement.

4.4.4.6 Compaction Strategy Design

X-Engine is designed with a range of compaction types to cater to diverse needs.


Table 4.2 describes the various types of compaction tasks in X-Engine, as well
as their features and triggering methods. Self-compactions at L0, compactions
from L0 to L1, and compactions from L1 to L2 are triggered when the number
90 4 Storage Engine

Table 4.2 Compaction task types in X-Engine


Compaction
type Description Trigger method
Self-compaction Merges and organizes data blocks that The number of data blocks with
at L0 are produced through flushing and have overlapping primary key ranges
an overlapping primary key range reaches the specified threshold
Compaction Merges some data from L0 with The number of data blocks at L0
from L0 to L1 corresponding data at L1 reaches the specified threshold
Compaction Merges some data from L1 with The number of data blocks at L1
from L1 to L2 corresponding data at L2 reaches the specified threshold
Deleted data Clears logically deleted data The number of deletion flags in the
clearing MemTable reaches the specified
threshold
Fragment Organizes disk fragments within L1 or The fragmentation ratio of the
organization L2 corresponding level reaches the
specified threshold
Manual Manually triggers compaction tasks for The DBA manually issues related
compaction maintenance purposes commands

of data blocks at the corresponding level reaches the specified threshold. Such
compaction operations are intended to limit the data volume at each level of the
tiered storage structure within an expected range, thereby reducing read ampli-
fication during queries, write amplification during compactions, and space
amplification caused by inter-level primary key range overlap. A delete-trig-
gered compaction is triggered when the number of deletion flags in the MemTable
reaches the specified threshold. X-Engine processes all write operations in
append-only mode, and delete operations are implemented by inserting deletion
flags for target records in MemTables. When the number of deletion flags
reaches the specified threshold, a considerable amount of logically deleted data
is present; the data must be cleared. In this case, X-Engine triggers a compac-
tion task tailored to clear such records. A fragment compaction is triggered
based on the space fragmentation status at each level. Fragments may be caused
by data block reuse, disk space allocation, or data deletion. An excessive num-
ber of fragments result in increased query overheads and reduced space effi-
ciency. Therefore, the fragments must be cleared in a timely manner. Manual
compactions provide a necessary database maintenance means for database
administrators. Administrators can execute specific instructions to trigger cor-
responding compactions based on the current status of the storage engine and
database requirements. With these compaction strategies, X-Engine can sched-
ule asynchronous compaction tasks to achieve a balance among data write per-
formance, query performance, and storage overheads. X-Engine also provides
related parameters to achieve specific optimization goals (e.g., maximizing
query performance or minimizing space usage), thus ensuring that the perfor-
mance and storage overheads of the storage engine remain within the
expected ranges.
4.4 LSM-Tree Storage Engine 91

4.4.4.7 Intelligent Separation of Hot and Cold Data

X-Engine accurately separates hot data and cold data by analyzing the access char-
acteristics of workloads and implements automatic archiving of cold data by using
a hybrid storage architecture to provide users with an optimal price-performance
ratio. OLTP businesses are sensitive to access latency. Therefore, cloud service pro-
viders typically use local SSDs or enhanced SSDs (ESSDs) as storage media. In
practice, most data generated by flow-type businesses, such as transaction logistics
and instant messaging, is accessed less frequently over time or may never be
accessed again. If such data is also stored in high-speed storage media such as Non-­
Volatile Memory Express (NVMe) and SSDs like hot data, the overall price-­
performance ratio can be significantly reduced.
For businesses that support separation of cold data and hot data, X-Engine sup-
ports automatic archiving of cold data by analyzing log information. It is the first
storage engine in the industry that supports automatic archiving of row-level data
[3]. The hybrid storage edition of X-Engine supports multiple types of hybrid stor-
age media, as shown in Fig. 4.18. ESSDs or local SSDs are recommended for L0
and L1 to ensure the access performance of hot data. High-efficiency cloud disks or
local HDDs are recommended for L2. Archiving of cold data significantly reduces
the storage costs.
X-Engine employs a unique method for predicting data archiving that differs
from traditional cache replacement policies like LRU. X-Engine uses a longer time
window and takes into account a wider range of features to predict when data should
be archived:
• X-Engine aggregates access frequency over a specific time window, providing
insights into changes in data popularity.
• By analyzing semantic information from SQL logs, X-Engine can accurately
predict the lifecycle of a data record. For instance, in e-commerce transactions,

Fig. 4.18 Intelligent separation of hot data and cold data


92 4 Storage Engine

patterns of order table accesses reflect user shopping behavior. For virtual orders,
such as top-up orders, a record may no longer be accessed after it is created.
However, orders for physical products may involve logistics tracking, delivery,
delivery signature, or even after-sales services such as returns, resulting in a
complex distribution of the data lifecycle. Moreover, rules for shipment and
package receipt may be adjusted during major promotion events, such as Double
11 and Double 12. This may cause changes to the data lifecycle for the same
load, making it hard to distinguish between hot and cold data based on simple
rules. However, for the same business, the lifecycle of a record in the database
can be learned based on the updates and reads of the record. Therefore, fields
accessed by SQL statements may be encoded to accurately depict the lifecycle of
a record.
• Additionally, timestamp-related features, such as insertion time and last update
time, provide insights into the data lifecycle.
X-Engine uses different feature combinations for different businesses by lever-
aging machine learning technologies, ultimately achieving a cold data recall and
precision of over 90% for these businesses. Cold data compactions are triggered
during off-peak hours to compact predicted cold data to cold levels on a daily basis
and subsequently minimize the impact of cold data migration on normal businesses.

4.4.5 Dual Storage Engine Technology

PolarDB supports dual engines online, with InnoDB handling the hybrid read/write
requirements of online transactions and X-Engine handling requests to read/write
less frequently accessed archived data. Figure 4.19 shows the dual-engine architec-
ture of PolarDB for MySQL.
The first version of PolarDB was designed based on InnoDB. This version imple-
mented physical replication by using InnoDB and supported the one-writer, multi-
reader architecture, which was technologically challenging. Nonetheless, it is more
challenging to integrate X-Engine into PolarDB to support the one-writer, multi-
reader architecture based on dual engines because X-Engine is a complete, indepen-
dent transaction engine with its own redo log, disk data management, cache
management, and MVCC modules. Through remarkable innovative efforts, the
PolarDB team ushered PolarDB into the dual-engine era by introducing the follow-
ing engineering advances:
• The WAL stream of X-Engine is integrated with the redo log stream of InnoDB
without modifying the control logic and the interaction logic of shared storage.
This way, one log stream and one transmission path are sufficient for X-Engine
and InnoDB. In addition, this architecture can be reused when other engines are
introduced.
• The I/O module of X-Engine is interconnected with the user-mode file process-
ing system (FPS) of InnoDB. This allows InnoDB and X-Engine to share the
4.4 LSM-Tree Storage Engine 93

Fig. 4.19 Dual-engine architecture of PolarDB for MySQL

same distributed block device and implement fast backup based on the underly-
ing distributed storage.
• X-Engine implements physical replication based on WAL and provides the WAL
replay mechanism. This ensures millisecond-level replication latency between read/
write nodes and read-only nodes and supports consistency reads on read-­only nodes.
The introduction of X-Engine into PolarDB involves considerable engineering
modifications, such as the modification of X-Engine to support the one-writer, mul-
tireader architecture and the rectification of issues related to DDL operations on
large tables in history databases. In addition to online DDL, X-Engine also supports
parallel DDL to accelerate DDL operations that involve table replication. The dual-­
engine architecture of PolarDB implements the one-writer, multi-reader architec-
ture based on two engines with one set of code. This ensures the simplicity of the
product architecture and provides consistent user experience.

4.4.6 Experimental Evaluation

4.4.6.1 Comparison of Space Efficiency

This section compares the space efficiency of X-Engine with the following related
products in the cloud computing market: InnoDB, which is the default storage
engine for MySQL databases, and TokuDB, which is a storage engine product with
high space efficiency that is used by many space-sensitive customers on pub-
lic clouds.
94 4 Storage Engine

4.4.6.2 Comparison with InnoDB

Figure 4.20 shows the disk space usage of InnoDB and X-Engine. Both storage
engines are tested with the default settings and the default table structure of the
Sysbench benchmark. Each table contains ten million records, and the total number
of tables gradually increases from 32 to 736. The test result shows that as the amount
of data increases, the space occupied by X-Engine slowly increases, thereby saving
more space. The maximum space occupied by X-Engine is only 58% of that occu-
pied by InnoDB. For scenarios with longer single-record lengths, X-Engine is more
efficient in terms of storage space usage. For example, after a Taobao image space
database is migrated from InnoDB to X-Engine, the required storage space required
is only 14% of that required in InnoDB.
Data compression is not enabled for InnoDB in most business scenarios. If com-
pression is enabled for InnoDB, the storage space required is 67% of that before
compression. Moreover, the query performance sharply deteriorates, seriously
affecting the business. Taking primary key updates as an example, the throughput is
only 10% of that before compression. Compared with InnoDB, whose performance
seriously deteriorates after compression is enabled, X-Engine is a high-performance
and cost-effective storage engine with an excellent balance between space compres-
sion and performance.
Unlike X-Engine, InnoDB does not have a tiered storage structure and uses a
single storage mode to store all table data in the database. In this mode, data is
stored in the form of pages in a B+ structure. Moreover, data is not stored by using
different modes based on the locality characteristic and frequency of data access,
and data in a user table (e.g., seldom accessed cold data) cannot be selectively com-
pressed in depth. In addition, X-Engine performs prefix encoding for data blocks,
which logically reduces the amount of data to be stored and implements data storage
in a compact format. This reduces space fragments and improves the compression
ratio. Therefore, X-Engine has higher space efficiency than InnoDB.

Fig. 4.20 Comparison of disk space usage between InnoDB and X-Engine
4.4 LSM-Tree Storage Engine 95

4.4.6.3 Comparison with TokuDB

Figure 4.21 shows the disk space usage of TokuDB and X-Engine. TokuDB used to
provide storage services at low overheads, but its developer, Percona, discontinued
its maintenance. The results revealed that X-Engine has lower storage overheads
than TokuDB. Therefore, Alibaba Cloud recommends that you migrate your data in
TokuDB-based databases to X-Engine-based databases.
TokuDB uses the fractal tree structure, which has more leaf nodes filled with data
and corresponding data blocks than the B+ tree structure used by InnoDB engine.
The former can achieve a higher compression ratio than the latter. However, TokuDB
lacks the tiered storage design of X-Engine, and X-Engine also has data blocks that
are filled with records. Hence, combined with other space optimizations, X-Engine
can achieve lower storage overheads than TokuDB.

4.4.6.4 Comparison of Performance

X-Engine can reduce the storage space occupied by cold data without compromis-
ing the hot data query performance, consequently reducing the total storage costs,
specifically the following:
• X-Engine uses a tiered storage structure to store hot and cold data at different
levels. By default, the level where cold data is stored is compressed.
• Techniques such as prefix encoding are implemented for each record to reduce
storage overheads.
• Tiered data access is implemented based on the omnipresent locality characteris-
tic and data access skewness (where the volume of hot data is usually far less
than that of cold data) in actual business scenarios.
Figure 4.22 shows the performance comparison between X-Engine and InnoDB
in terms of processing point queries on skewed data. The test used the Zipf distribu-
tion to control the degree of data access skewness. When the skewness (Zipf factor)

Fig. 4.21 Comparison of disk space usage between TokuDB and X-Engine
96 4 Storage Engine

Fig. 4.22 Performance comparison between X-Engine and InnoDB in terms of processing point
queries on skewed data

Fig. 4.23 Performance comparison between InnoDB and X-Engine in various scenarios

is high, more point queries hit the hot data in the cache rather than the cold data on
the disk, resulting in lower access latency and higher overall query performance. In
this case, compressing cold data has minimal impact on the query performance.
In summary, the tiered storage structure and tiered access mode of X-Engine
enable most SQL queries on hot data to ignore cold data. As a result, a QPS value
that is 2.7 times higher than that obtained when all data is accessed uniformly is
achieved.
If a large amount of inventory data (especially archived data and historical data)
is stored in X-Engine, X-Engine demonstrates a slightly inferior performance (QPS
or TPS) than InnoDB when querying inventory data. Figure 4.23 shows the perfor-
mance comparison between InnoDB and X-Engine in various scenarios. The com-
parison reveals that X-Engine and InnoDB have almost the same performance.
In most OLTP workloads, updates and point queries are frequently executed.
X-Engine and InnoDB basically level with each other in these two aspects.
References 97

Given its tiered storage structure, X-Engine needs to scan or access multiple
levels when performing a range scan or checking whether a record is unique.
Therefore, X-Engine has a slightly inferior performance compared with InnoDB in
terms of performing range queries and inserting new records.
In hybrid scenarios, X-Engine and InnoDB exhibit almost the same
performance.

References

1. Huang G, Cheng XT, Wang JY, et al. X-Engine: an optimized storage engine for large-scale
e-commerce transaction processing. In: Proceedings of the 2019 International Conference on
Management of Data (SIGMOD’19). New York: Association for Computing Machinery; 2019.
p. 651–65. https://doi.org/10.1145/3299869.3314041.
2. Zhang T, Wang JY, Cheng XT, et al. FPGA-accelerated compactions for LSM based key-value
store. In: 18th USENIX Conference on File and Storage Technologies (FAST20); 2020.
3. Yang L, Wu H, Zhang TY, et al. Leaper: a learned prefetcher for cache invalidation in LSM-tree
based storage engines. Proc VLDB Endow. 2020;13(11):1976–89.
Chapter 5
High-Availability Shared Storage System

High availability is one of the factors that must be considered in the design of dis-
tributed systems. This chapter introduces consensus algorithms for distributed sys-
tems and compares the methods used by MySQL and PolarDB to achieve high
availability. This chapter also discusses the implementation of shared storage archi-
tectures like Aurora and PolarFS and presents some of the ongoing optimization
work concerning the file system in PolarDB.

5.1 Basics of High Availability

In a distributed system, multiple nodes communicate and coordinate with each other
through message passing, which inevitably involves issues such as node failures,
communication abnormalities, and network partitions. Consensus protocols ensure
that in a distributed system in which these exceptions may occur, multiple nodes
reach an agreement on a specific value.
In the field of distributed systems, the CAP (consistency, availability, and parti-
tion tolerance) theorem states that any network-based data sharing system can sat-
isfy at most two of the following three characteristics: consistency, availability, and
partition tolerance.
Network partitioning inevitably occurs in a distributed system, thereby necessi-
tating the satisfaction of the partition tolerance characteristic. Therefore, trade-offs
must be made between consistency and availability. In practice, an asynchronous
multireplica replication approach is often used to ensure system availability and
consistency. However, this compromises strong consistency in exchange for
enhanced system availability.
From the perspective of the client, if all replicas reach a consistent state immedi-
ately after an update operation is completed and subsequent read operations can
immediately read the most recently updated data, strong consistency is

© The Author(s), under exclusive license to Springer Nature Singapore Pte 99


Ltd. 2025
F. Li et al., Cloud Native Database,
https://doi.org/10.1007/978-981-97-4057-4_5
100 5 High-Availability Shared Storage System

implemented. If the system does not guarantee that subsequent read operations can
immediately read the most recently updated data after an update operation is com-
pleted, weak consistency is implemented. If no new update operations are per-
formed subsequently, the system guarantees that the most recently updated data can
be read after a specific period of time. This means that eventual consistency is
implemented, which is a special case of weak consistency. Compromised strong
consistency does not mean that consistency is not guaranteed. It means less strict
requirements are imposed in terms of consistency, and an “inconsistency window”
is allowed. This way, consistency within a time range acceptable to the user can be
achieved, thereby ensuring eventual consistency. The size of this inconsistency win-
dow depends on the time it takes for multiple replicas to reach a consistent state.

5.1.1 Leader and Follower Replicas

Compared with single-node systems, distributed systems are more unstable and often
experience node or link failures, resulting in one or several nodes being in a failed state
and unable to provide normal services. This requires distributed systems to have a robust
fault tolerance mechanism to continue responding to client requests when such prob-
lems occur, ideally without users noticing any failures. Service-­level high availability
does not require all nodes to be available after a failure, but rather that the system can
automatically coordinate the remaining functioning nodes to ensure service continuity.
In the database field, RTO and RPO are often used to measure system high avail-
ability. RTO refers to the time required for the system to restore normal services after
a disaster occurs. For example, if the system needs to restore normal operations within
1 h after a disaster, the RTO is 1. If the RTO is zero, the system can recover instantly
after a disaster and has strong disaster recovery capabilities. Otherwise, the system
may remain in a failed state for a long time or even indefinitely. RPO refers to the
amount of data that the system can tolerate losing when a disaster occurs. If the RPO
is 0, no data is lost. Figure 5.1 illustrates the RPO and RTO.
In a distributed storage system, the replication technology is often used to store
multiple replicas. When a node or link failure occurs, the system can automatically

Fig. 5.1 RPO and RTO


5.1 Basics of High Availability 101

switch services to other replicas to ensure service continuity. In leader-follower


replication, one node has multiple replicas, one of which is the leader and other
replicas are followers.
Leader-follower is categorized into synchronous leader-follower replication and
asynchronous leader-follower replication. Taking a simple asynchronous leader-­
follower replication process as an example, the client sends a transaction request to
the server, and the leader replica modifies the local information, informs the client
that the transaction request is completed, and synchronizes the log to the follower
replicas. Then, the follower replicas modify their local information according to the
log and inform the leader replica that the synchronization is successful.
The difference between synchronous leader-follower replication and asynchro-
nous leader-follower replication is that in the former, the transaction can be commit-
ted only after the leader replica receives responses from all follower replicas, as
shown in Fig. 5.2. According to the principles of asynchronous leader-follower rep-
lication, an RPO of 0 cannot be guaranteed, but faster responses and better system
performance can be achieved at the cost of data losses and compromised data con-
sistency. Synchronous leader-follower replication can guarantee an RPO of 0 and
reduce the risk of data losses but may result in slow responses because (1) the leader
replica must wait for responses from all follower replicas and (2) the response time
is determined by the slowest follower replica.

5.1.2 Quorum

According to the CAP theorem, ensuring consistency between replicas is a critical


issue in distributed systems. Partition tolerance cannot be guaranteed due to inevi-
table node and network failures. Therefore, trade-offs are often made between con-
sistency and availability during the design of a distributed system to better meet
business needs. Introduction of the multireplica scheme improves availability but
also brings consistency issues when all replicas are modified.

Fig. 5.2 Synchronous leader-follower replication


102 5 High-Availability Shared Storage System

Write-all, read one (WARO) is a replica control protocol with simple principles.
As the name suggests, it requires all replicas to be updated successfully during an
update; data can be read from any replica during data query. WARO ensures consis-
tency among all replicas but also creates new problems. Although it enhances read
availability, it leads to system load imbalance and significant update latency.
Moreover, an update must be implemented for all replicas. As a result, the update of
the system fails if an exception occurs on one node.
Quorum is a consistency protocol proposed by Gifford in 1979. Based on the
pigeonhole principle, high availability and eventual consistency can be guaranteed
through trade-offs between consistency and availability.
According to the Quorum mechanism, in a system with N replicas, an update
operation is considered successful only when it is successfully executed on W rep-
licas, and a read operation is considered successful when at least R replicas are read.
To ensure that the most recently updated data can be read every time, the Quorum
protocol requires W + R > N and W > N/2. To be specific, the set of written replicas
and the set of read replicas must have an intersection, and the written replicas must
account for more than half of the total replicas. When W = N and R = 1, Quorum is
equivalent to WARO. Therefore, WARO can be seen as a special case of Quorum,
and Quorum can balance reads and updates on the basis of WARO. For update
operations, Quorum can tolerate exceptions on N–R replicas. For read operations, it
can tolerate exceptions on N–W replicas. Update and read operations cannot be
performed simultaneously on the same data.
Nonetheless, the Quorum mechanism cannot guarantee strong consistency. After
an update operation is completed, the replicas cannot immediately achieve a consis-
tent state, and subsequent read operations cannot immediately read the most recently
updated commit. The Quorum mechanism can only guarantee that the most recently
updated data is read each time but cannot determine whether the data has been com-
mitted. Therefore, if the latest version of data appears less than W times after R
replicas are read, the system proceeds to read other replicas until the latest version
of data appears W times. At this point, it can be considered that the latest version of
data is successfully committed. If the number of occurrences of this version is still
less than W after the other replicas are read, the second latest version in R is consid-
ered the most recently committed data.
In a distributed storage system, the values of N, W, and R can be adjusted based
on different business requirements. For example, for systems with frequent read
requests, W = N and R = 1. This ensures that the result can be quickly obtained by
reading just one replica. For systems requiring fast writes, R = N and W = 1. This
achieves better write performance at the cost of consistency.

5.1.3 Paxos

Paxos [1] is a consensus algorithm proposed by Leslie Lamport in 1990. It is based


on message passing and has a high degree of fault tolerance. It facilitates quick and
correct consensus on a specific value in a distributed system that may experience
5.1 Basics of High Availability 103

Fig. 5.3 Paxos execution process

node failures or network anomalies without considering the Byzantine Generals


Problem, thereby ensuring consistency between nodes in the system. It should be
noted that in actual systems, this value may not necessarily be a number and can be
a log or command on which consensus must be reached.
In Paxos, nodes are divided into three roles: Proposer, Acceptor, and Learner.
The Proposer proposes a proposal, which includes a proposal number and a pro-
posed value. The Acceptor can accept the proposal upon receiving it. If the proposal
is accepted by a majority of Acceptors, the proposal is considered approved (cho-
sen). The Learner can only “learn” approved proposals. In Paxos, each node can
simultaneously assume multiple roles.
Paxos has two properties: Safety and Liveness. Safety requires that (1) only one
value is approved and (2) a node can only learn one approved value to ensure system
consistency. Liveness requires Paxos to eventually approve a proposed value, pro-
vided that a majority of nodes are alive and can normally communicate with each
other. After a value is approved, other nodes will eventually learn this value.
The main idea of Paxos is that the Proposer needs to know the proposal most
recently accepted by the majority of Acceptors before making a proposal, deter-
mines the proposed value, and then initiates a vote. Once a majority of Acceptors
accept the proposal, the proposal is considered approved. Then, the Learner is
informed of the information about the proposal. Paxos has two phases: Prepare
phase and Accept phase. Figure 5.3 shows the Paxos execution process.

5.1.3.1 Prepare Phase

A Proposer chooses a new proposal number n and broadcasts a Prepare request


containing n to all Acceptors. The request does not include the proposed value. Note
that n must be unique and greater than any other value used or observed by the
Proposer.
104 5 High-Availability Shared Storage System

After receiving the request, an Acceptor updates the minimum proposal number
that it has received. If the Acceptor has not replied to or received a request whose
proposal number is greater than or equal to n in this round of the Paxos process, it
returns the previously accepted proposal number and proposed value and promises
not to return a proposal whose number is less than n.

5.1.3.2 Accept Phase

When the Proposer receives ACKs of the Prepare request from a majority of
Acceptors, the Proposer chooses the largest proposal number accepted in the ACKs
and uses it as the proposal number for the current round. If no accepted proposal
number is received, the Proposer determines the proposal number. Then, the
Proposer broadcasts the proposal number and proposed value to all Acceptors.
An Acceptor checks the proposal number upon receiving the proposal. If the
promise, made in the Prepare phase, to never return proposals whose numbers are
less than n is not violated, the Acceptor accepts the proposal and returns the pro-
posal number. Otherwise, the Acceptor rejects the proposal and requests the
Proposer to return to Step 1 and reexecute the Paxos process.
After the Acceptor accepts the proposal, it sends the proposal to all Learners.
After confirming that the proposal has been accepted by a majority of Acceptors, the
Learners determine that the proposal is approved. Then, the Paxos round ends. A
Learner can also broadcast the approved proposal to other Learners.
Paxos is used to enable multiple replicas to reach consensus on a specific value.
For example, Paxos can be used in primary node reelection when the primary node
is faulty or in log synchronization among multiple nodes. Although Paxos is theo-
retically feasible, it is difficult to understand and lacks pseudocode-level implemen-
tation. The huge gap between algorithm description and system implementation
results in the final system being built on an unproven protocol. There are few imple-
mentations similar to Paxos in actual systems.
A typical actual scenario is to reach consensus on a series of consecutive values.
A direct approach is to execute the Paxos process for each value. However, each
round of the Paxos process requires two RPCs, which is costly. In addition, two
Proposers may propose incrementally numbered proposals, leading to potential
livelocks. To resolve this issue, the Multi-Paxos algorithm is developed to introduce
a leader role. Only the leader can make a proposal, which eliminates most Prepare
requests and ensures that each node eventually has complete and consistent data.
Taking log replication as an example, the leader can initiate a Prepare request
that contains the entire log rather than just a value in the log. The leader then initi-
ates an Accept request to confirm multiple values, thereby reducing RPCs by half.
In the Prepare phase, the proposal number is used to block old proposals and the log
for determined log entries. One leader election method is as follows: Each node has
an ID, and the node with the greatest ID value is the leader by default. Each node
sends a heartbeat at an interval T. If a node receives no heartbeat information from
any node with a greater ID value within 2T, the node becomes the leader. To ensure
5.1 Basics of High Availability 105

that all nodes have the complete latest log, Multi-Paxos is specifically designed
based on the following aspects:
• The system continuously sends Accept RPCs in the background to ensure that
responses are received from all Acceptors, thereby ensuring that the log of a node
can be synchronized to other nodes.
• Each node marks whether each log entry is approved and marks the first unap-
proved log entry to facilitate tracking of the approved log entries.
• The Proposer needs to inform the Acceptors of the approved log entries to help
the Acceptors update logs.
• When an Acceptor replies to the Proposer, it informs the latter of the subscript of
the first unapproved log entry of the Acceptor. If the subscript of the first unap-
proved log entry of the Proposer is larger, the Proposer sends the default unap-
proved log entry to the Acceptor.

5.1.4 Raft

Raft [2] is an easy-to-understand and practical consensus protocol that decomposes


a consistency algorithm into three major modules: leader election, log replication,
and safety. In Raft, an elected leader performs log management to implement con-
sistency. This greatly reduces the states that need to be considered, enhances under-
standability, and simplifies engineering implementation.

5.1.4.1 Node Roles

A node in a Raft cluster is either a leader or a follower and can be a candidate in


leader election when no leader is available. There is only one leader in a cluster, and
all other nodes are followers. Raft first elects a leader who is responsible for manag-
ing log replication. The leader receives log entries from all clients, replicates the log
entries to the follower nodes, and when determining that the log entries have been
replicated to a majority of follower nodes instructs the followers to apply these log
entries to their respective state machines. If the leader is faulty, a new leader is
reelected from the followers. Followers do not send any requests themselves and
only respond to requests from the leader and candidates. If a follower cannot receive
any messages, it becomes a candidate and initiates a leader election. A candidate
that receives a majority of votes in the cluster becomes the leader.
Raft divides time into arbitrary-length terms that have specific term numbers.
Each time a leader is elected, a new term begins. If the leader or a candidate discov-
ers that its term number has expired, it immediately switches to the follower role. If
a node receives a request that contains an expired term number, it will reject the
request. Figure 5.4 illustrates the transition between the follower, candidate, and
leader roles.
106 5 High-Availability Shared Storage System

Fig. 5.4 Transition between the follower, candidate, and leader roles

5.1.4.2 Leader Election

Raft triggers leader elections by using a heartbeat mechanism. In the initial state,
each node is a follower. Followers communicate with the leader by using a heartbeat
mechanism. If a follower receives no heartbeat messages within a specific period of
time, the follower believes that no leader is available in the system and initiates a
leader election.
The follower that initiates the election increases its current local term num-
ber, switches to the candidate role, votes for itself as the new leader, and sends
a vote request to other followers. Each follower may receive multiple vote
requests but can only cast one vote on a first-come, first-served basis. The log
information of the candidate that receives the vote must be newer than that of
the follower.
The candidates wait for votes from other followers. The vote results for the can-
didates vary depending on the following cases:
• A candidate that receives more than half of the votes wins the election and
becomes the leader. Then, the new leader sends heartbeat messages to other
nodes to maintain its Leader status and prevent new elections from tak-
ing place.
• If a candidate receives a message that contains a larger term number from
another node, the sender node has been elected as the leader. In this case, the
candidate switches to the follower role. If a candidate receives a message that
contains a smaller term number, it rejects this message and maintains the can-
didate role.
• If no candidate receives more than half of the votes, the election times out. In this
case, each candidate starts a new election by increasing its current term number.
To prevent multiple election timeouts, Raft uses a random election timeout algo-
rithm. Each candidate sets a random election timeout when starting an election.
This prevents concurrent timeouts and concurrent initialization of new elections
by multiple candidates, thereby reducing the possibility of votes being divided
up in the new election.
5.1 Basics of High Availability 107

5.1.4.3 Log Replication

Each server node has a replicated state machine implemented based on the repli-
cated log mechanism. If the state machines of the server nodes are in the same initial
state and the server nodes obtain identical execution commands from the logs,
which are executed in the same order, the final states of the state machines are also
the same.
After the leader is elected, the system provides services externally. The leader
receives requests from clients, each containing a command that acts on a replicated
state machine. The leader encapsulates the requests into log entries, appends the log
entries to the end of the log, and sends these log entries in order to the followers in
parallel. Each log entry contains a state machine command and the current term
number when the leader receives the request, as well as the position index of the log
entry in the log file. When the log entries are safely replicated to a majority of
nodes, the log entries are committed. The leader then returns a success to the clients
and instructs each node to apply the state machine commands in the log entries to
the replicated state machines in the same order as the log entries in the leader. At
this point, the log entries are applied, as shown in Fig. 5.5.
As shown in the figure, log replication in Raft is a Quorum-based process that
can tolerate failure of n/2–1 replicas. The leader will supplement the logs for out-of-­
sync replicas in the background.
To ensure that the logs of the followers are consistent with the log of the leader,
the leader must find the index position at which the logs of the followers are

Fig. 5.5 Log replication in Raft


108 5 High-Availability Shared Storage System

consistent with its log. Then, the leader instructs the followers to delete their log
entries after the index position and sends its log entries after the index position to
the followers. In addition, the leader maintains a nextIndex for each follower,
which indicates the index of the next log entry that the leader will send to the fol-
lower. When the leader begins its term, it initializes the nextIndexes to its latest
log entry index +1. If a follower finds during consistency check that its log content
corresponding to the log entry index is inconsistent with the content of the log
entry that the leader sends to it, it will reject the log entry. After receiving the
response, the leader decrements the nextIndex and retries until the nextIndex is
consistent with the log entry index of the follower. At this point, the log entry
from the leader is successfully appended, and the logs of the leader and follower
become consistent. Therefore, the log replication mechanism of Raft has the fol-
lowing characteristics:
• If two log entries in different logs have the same log index and term number,
they store the same state machine command. This characteristic originates
from the fact that the leader can create at most one log entry at a specified log
index position within a term, and the position of the log entry in the log does
not change.
• If two log entries in different logs have the same log index and term number, all
their previous log entries are also the same. This phenomenon can be ascribed to
consistency checks. When the leader sends a new log entry, it also sends the log
index and term number of the previous log entry to the follower. If the follower
cannot find a log entry with the same log index and term number in its log, it will
reject the new log entry.
To prevent committed logs from being overwritten, Raft requires candidates to
have all committed log entries. When a node is newly elected as the leader, it can
only commit logs of the current term that have been replicated to a majority of
nodes. Logs with an old term number cannot be directly committed by the current
leader even if they have been replicated to a majority of nodes. These logs need to
be indirectly committed through log matching when the leader commits logs with
the current term number.

5.1.5 Parallel Raft

Parallel Raft is a consistency protocol designed and developed for PolarFS [3] to
ensure the reliability and consistency of stored data.
For simplicity and protocol comprehensibility, Raft adopts a highly serialized
design, which does not allow holes in logs of either the leader or followers. Log
entries are acknowledged by the follower, committed by the leader, and applied to
all replicas in a serialized manner. When a large number of concurrent write requests
are executed, they are committed in sequence. Requests are executed and commit-
ted in sequence, and the requests at the end of the queue can be committed with
5.1 Basics of High Availability 109

Fig. 5.6 Design philosophy of Parallel Raft

results returned only after previous requests are persisted to disk with results
returned, as shown in Fig. 5.6. This increases the average latency and reduces the
throughput.
Parallel Raft removes the serialization constraint and implements performance
optimization for log replication through out-of-order ACKs and out-of-order com-
mits. It also ensures protocol correctness based on the Raft framework and imple-
ments out-of-order application based on actual application scenarios.
Out-of-order ACK: In Raft, after receiving a log entry from the leader, a follower
sends an ACK only after the current log entry and all its previous log entries are
persisted. In Parallel Raft, a follower returns an ACK immediately after receiving
any log entry, thereby reducing the average system latency.
Out-of-order commit: In Raft, the leader commits log entries in series. To be
specific, a log entry is committed only after all its previous log entries are commit-
ted. In Parallel Raft, the leader can commit a log entry as soon as the log entry is
acknowledged by a majority of replicas.
Out-of-order application: In Raft, all log entries are applied in strict order to
ensure the consistency of the data files of all replicas. In Parallel Raft, holes may
occur at different replica log positions due to out-of-order ACKs and out-of-order
commits. Therefore, it is necessary to ensure that a log entry can be safely applied
when preceding log entries are missing, as shown in Fig. 5.7.
To this end, Parallel Raft introduces a new data structure called “look-behind
buffer” to address the issue of missing log entries during application. Each log entry
in Parallel Raft comes with a look-behind buffer, which stores the summary of logi-
cal block addresses (LBAs) modified by the previous N log entries. A follower can
determine whether a log entry conflict exists (i.e., whether the log entry modifies
LBAs that are modified by a missing previous log entry) by using the look-­behind
buffer. If no log entry conflict exists, the log entry can be safely applied. Otherwise,
it is added to a pending list and will be applied after the previous log entry that is
missing is applied.
Through the aforementioned asynchronous ACKs, asynchronous commits,
and asynchronous applications, Parallel Raft can avoid the extra waiting time
caused by sequencing during log entry writing and committing, thus effectively
reducing the average latency in high-concurrency multireplica synchronization
scenarios.
110 5 High-Availability Shared Storage System

Fig. 5.7 Application process

5.2 High Availability of Clusters

Databases are the cornerstone of business systems, and their availability is of vital
importance. Therefore, online databases are rarely deployed in standalone mode
because in this mode, services may become unavailable for seconds or even min-
utes or hours in severe cases (e.g., when accidents such as instance failure, host
failure, or network failure occur). If the disk is corrupted, data may be completely
lost, which is fatal for upper-level businesses that use the database. Hence, a data-
base cluster that implements high availability through leader-follower replication
is usually deployed in the production environment. The following section takes
MySQL as an example to introduce the general practices for implementing high
availability for databases in the industry and then discusses the high availability
architecture of PolarDB with reference to the advantages and disadvantages
of MySQL.

5.2.1 High Availability of MySQL Clusters

MySQL supports the leader-follower mode. In this mode, several independent


database instances are started. One of the instances serves as the leader instance
to receive user write requests, and other instances are connected to the leader
instance as follower instances and synchronize data written to the leader instance
by using the binlog. This way, if the leader instance is unavailable, services can
be switched to a follower instance to ensure service continuity and high
availability.
Figure 5.8 shows the MySQL leader-follower replication process between the
leader instance and a follower instance. This process involves three threads: the
binlog dump thread on the leader instance and the I/O thread and SQL thread on the
follower instance.
5.2 High Availability of Clusters 111

Fig. 5.8 Leader-follower replication process in MySQL

5.2.1.1 Binlog Dump Thread

When the leader instance receives a write request and needs to update data, it writes
the event content of this update to its binlog file. At this time, the binlog dump
thread (created when the leader-follower relationship was established) on the leader
instance notifies the follower instance of the data update and passes the content
written to the binlog to the I/O thread of the follower instance.

5.2.1.2 I/O Thread

The I/O thread on the follower instance connects to the leader instance, requests a
connection point at a specified binlog file position from the leader instance, and then
continuously saves the binlog content sent by the leader instance to the local relay
log. Like the binlog, the relay log records data update events. Multiple relay log files
are generated and named in the host_namerelay-bin.000001 format with incremen-
tal suffixes. The follower instance uses an Index file (host_name-relay-bin.index) to
track the currently used relay log file.

5.2.1.3 SQL Thread

Once the SQL thread detects that the relay log is updated, it reads and parses the
update content and locally reexecutes the events that occurred on the leader
instance to ensure that data is synchronized between the leader and follower
112 5 High-Availability Shared Storage System

instances. The binlog records the SQL statement executed by the user. Therefore,
parsing the binlog content sent by the leader instance is equivalent to receiving the
user request. Then, the SQL thread starts to reexecute the statement, starting from
SQL parsing.
Asynchronous replication is the most common binlog synchronization mode. In
this mode, after the leader instance writes the binlog, it directly returns a success
without waiting for acknowledgment of receipt of the binlog entry from follower
instances. If the leader instance breaks down, data for which a write success has
been returned to the user may have not been synchronized to follower instances.
When services are switched to a follower instance, such data will be lost.
MySQL can address this issue by using a semisynchronous mode. In this mode,
after the leader instance writes the binlog, it must wait for at least one follower
instance to acknowledge that it has received the binlog entry before returning a
write success to the user. This improves data consistency to some extent, but the
overhead of waiting for follower instance synchronization compromises the write
efficiency.
To efficiently achieve high availability, MySQL implements a MySQL Group
Replication (MGR) cluster based on the Paxos consistency protocol. Quorum-based
binlog replication is achieved by using the Paxos protocol to prevent data loss after
service switchover.
MySQL was designed as a database management system that supports multiple
engines. Different storage engines can be quickly integrated into MySQL in the
form of plug-ins. You can choose appropriate storage engines for different busi-
ness scenarios. For example, MyISAM features high insertion and query speed
but does not support transactions, MEMORY puts all data into memory but does
not support data persistence, and InnoDB provides complete transaction proper-
ties and persistence capabilities and is currently the most widely used storage
engine. Data cannot be shared between multiple storage engines; the data format
may vary based on the storage engine. This hinders replication across databases.
The binlog shields the heterogeneity of storage engines and provides a unified
data format to facilitate data synchronization to the downstream and thus serves
as a cornerstone for data replication. MySQL has been widely used in the Internet
era. In addition to its stability and efficiency, its fast and flexible horizontal scal-
ing capability brought upon by the binlog-based replication technology is consid-
ered the key to its success.

Replication Mode

Before MySQL 5.6, data is replicated by using the binlog file position-based repli-
cation protocol. In this method, data is replicated based on binlog file positions,
which are file names and file offsets of the binlog on the leader node. When a fol-
lower node initiates replication, it sends an initial position, pulls logs from the
leader, and applies the logs. This protocol is not flexible and cannot be used to build
complex topologies.
5.2 High Availability of Clusters 113

Fig. 5.9 Transaction replication topology

The Global Transaction ID (GTID)-based replication protocol was introduced


since MySQL 5.6. A GTID can uniquely identify a transaction on a node. The basic
format of a GTID is

GTID = source_id:transaction_id,

where source_id identifies a node in the replication topology and transac-


tion_id is the serial number of a transaction on the node. When a replica node
initiates replication, it sends its GTID set to the leader node. After receiving the
GTID set, the leader node can calculate an initial position. With GTIDs, when a
transaction is routed in the replication topology, whether the transaction has
been executed on a node can be determined. For example, in a leader/follower
topology, after a transaction is replicated to the downstream, it may be pulled
back again. However, this transaction has been executed and will be filtered out
based on its GTID. In the transaction replication topology shown in Fig. 5.9, it
is assumed that leader 1 executes transaction (1,1), which is replicated to leader
1, leader 2, and the follower. When leader 1 executes the transaction, it can
determine, based on the GTID of the transaction, that the transaction has been
executed locally.

Data Consistency

The binlog will be pulled by the downstream and contains data of committed trans-
actions, with one transaction possibly spanning across multiple storage engines.
Therefore, consistency between the binlog and one or more storage engines must be
guaranteed. MySQL uses the two-phase commit algorithm for distributed data-
bases, with the binlog as the coordinator and the storage engine as the participant.
With this algorithm, a transaction is committed in the following order: Prepare by
the storage engine (persisted) → Commit the binlog (persisted) → Commit by the
storage engine (not persisted).
A transaction commit involves two persistence operations. This way, during
crash recovery, whether the prepared transactions in each storage engine need to be
committed or rolled back can be determined based on whether the binlog has been
completely persisted. Persistence is a time-consuming operation, and transactions
in the binlog are ordered. As a result, the write performance will significantly dete-
riorate when binary logging is enabled in MySQL.
114 5 High-Availability Shared Storage System

5.2.2 High Availability of PolarDB

Like AWS Aurora, PolarDB adopts a shared storage architecture that supports one
writer and multiple readers. This architecture is advantageous over the traditional
architecture in which the primary and standby nodes maintain their independent
data. First, it reduces the storage costs. One copy of shared data can support one
read-write node and multiple read-only nodes at the same time. Second, it pro-
vides extreme flexibility. To add a read-only node in an independent data storage
architecture, data needs to be replicated, which is time-consuming, affects the
total data volume, and may take hours or even days. In the shared storage archi-
tecture, data does not need to be replicated, and a read-only node can be created
within several minutes. Lastly, it significantly reduces the synchronization latency.
Only the memory status needs to be updated during synchronization because the
same disk data is visible to both the read-only nodes and the primary node. Details
will be discussed later. Meanwhile, the following section focuses on some key
technologies PolarDB uses to achieve high availability in the shared storage archi-
tecture that supports one writer and multiple readers. Figure 5.10 shows the shared
storage architecture of PolarDB. In the figure, RW represents a read-write node,
RO represents a read-only node, and PolarStore hosts the distributed file system
PolarFS.

5.2.2.1 Physical Replication

Logical Replication

In a database system of the primary/standby architecture, standby nodes need to


provide read services. In most cases, the primary and standby nodes maintain con-
sistency through log synchronization. In the traditional architecture in which each

Fig. 5.10 Shared storage architecture of PolarDB


5.2 High Availability of Clusters 115

node maintains independent data, data consistency is maintained through logical


replication (i.e., through synchronization of logical logs, such as the binlog
in MySQL).
In the shared storage architecture, logical replication may be problematic. The
binlog records the operations performed by transactions on the database, and the
binlog entries are generated when transactions are committed. However, modifica-
tions to database data are made during transaction execution. When concurrent
access requests are received, the database data modification sequence may be incon-
sistent with the transaction commitment sequence. Taking a B+ tree as an example,
even if the same batch of data is inserted, a different B+ tree structure may be gener-
ated when the insertion sequence is different. Inserting the same batch of data means
submitting insert transactions in the same sequence. However, the transactions may
be executed in a different sequence due to contention between transactions, result-
ing in inconsistent insertion sequences. Consequently, the physical data structure on
a standby node may be different from that on the primary node. This is acceptable
in the traditional architecture in which standby nodes have exclusive data but prob-
lematic in the shared storage architecture because the redo log records modifica-
tions to disk pages. If the physical data structures are inconsistent, the results
obtained by replaying the same redo log are inconsistent. Therefore, a new primary-­
standby replication scheme is required for the shared storage architecture.

Physical Logs

In addition to logical logs like the binlog, all database systems have a write-ahead
log (WAL), such as the redo log in MySQL. Such logs were initially designed to
support fault recovery of databases. Before actual data pages in the database are
modified, the modification content is written to the redo log. This way, once the
database fails due to an exception, the database status before the failure can be
restored by replaying the redo log during database restart. Each entry in the redo log
records only the modification to a single disk page. Such logs are called physical
logs. Logical logs may affect data in a large number of different locations during
replay. For example, replaying an INSERT operation may split the B+ tree, modify
an undo page, or even modify some metadata. As the name suggests, physical logs
record direct modifications of physical page information. Such logs can naturally
maintain the consistency of physical data and can be renovated and used for syn-
chronization from the primary node to standby nodes in the shared storage architec-
ture, as shown in Fig. 5.11.
In this architecture, the primary and standby nodes see the same data and the
same redo log on the shared storage. The primary node only needs to inform a
standby node of the position at which the current log write ends. Then, the standby
node reads the redo log from the shared storage and updates its memory status. The
physical structure of the primary node can be obtained by replaying the redo log,
which also ensures that the information in the memory structure of the standby node
completely corresponds to the persistent data in the shared storage.
116 5 High-Availability Shared Storage System

Fig. 5.11 Synchronization from the primary node to a standby node in the shared storage
architecture

Implementation of Physical Replication

The redo log is located at the underlying level of the database system engine and
records the final modification of the data page. Corresponding to the logical rep-
lication mechanism mentioned above, replication is implemented based on a bot-
tom-up approach in the shared storage architecture that uses the physical
replication scheme, as shown in Fig. 5.12. The standby node reads the redo log
from the shared storage, parses and applies the redo log, and then updates the
cached data page, transaction information, index information, and other status
information in the memory.

Comparison of Replication Latencies

In addition to the differences in replication implementation logic, physical replica-


tion has a significantly shorter replication latency than logical replication. Logical
replication is implemented at the transaction level. To be specific, a transaction
starts to be executed on a standby node only after it is committed on the primary
node. Therefore, the replication latency can be calculated as follows:

Latency = Transaction time + Transmission time + Replay time

The latency becomes excessively long if it takes a long time to execute a transac-
tion. Physical replication uses a different approach as it is intended to maintain data
consistency between the primary and standby nodes at the physical page level. The
redo log can be continuously written during transaction execution. Therefore, trans-
action rollback and MVCC can be implemented for standby nodes in the same mode
5.2 High Availability of Clusters 117

Fig. 5.12 Synchronization in the shared storage architecture

as that for the primary node. Physical replication can be performed in real time dur-
ing the execution of a transaction. The replication delay for physical replication can
be calculated as follows:

Latency = Transmission time + Replay time

The transmission time can be very small because the same redo log is accessed,
and the replay time accounts only for the time taken to replay the content of a single
page, which is much smaller compared with the entire binlog. As a result, the repli-
cation latency of physical replication is much shorter than that of logical replication,
even reaching the millisecond level. In addition, the replication latency of physical
replication is irrelevant to transactions. Figure 5.13 shows the latency comparison
between physical replication and logical replication.

Binlog in Physical Replication

Cloud-native databases implement a more efficient physical replication technology,


and synchronization within a cluster can be completed by using only physical repli-
cation. However, the redo log for physical replication is strongly dependent on the
InnoDB storage engine and cannot be recognized by other engines or synchroniza-
tion tools. In most scenarios, database data needs to be synchronized to the down-
stream in real time, for example, to the downstream analytics database for report
generation or to user-built standby databases. After years of development, the bin-
log has become a standard log format for the database eco-system in the upstream
118 5 High-Availability Shared Storage System

Fig. 5.13 Latency comparison between physical replication and logical replication

and downstream. Therefore, in the physical replication scheme, the database still
needs to support the binlog.
This can be easily implemented in the shared storage architecture by writing the
binlog to the shared storage. Figure 5.14 shows physical replication in a nonshared
storage architecture, in which the binlog can be transferred to a standby node by
using a replication link (which is a logical replication link) other than that used to
transfer the redo log. However, these two log links are not synchronized. If a swi-
tchover is performed due to an exception, the binlog and data on the standby node
may be inconsistent.
To solve this problem, Alibaba Cloud proposed the concept of logic redo log,
which integrates the capabilities of the binlog and redo log. This avoids the data
consistency issue that arises due to the synchronization of the redo log and the bin-
log. Figure 5.15 shows the logic redo architecture.
The binlog is stored in a distributed manner but is presented as a complete file to
external interfaces. The runtime binlog system maintains the memory file structure,
parses log files, and provides a centralized interface for the binlog and redo log.

5.2.2.2 Logical Consistency

In the one writer, multireader architecture of cloud-native databases, consistency


between the RW node and RO nodes is implemented based on the redo log. The
redo log is continuously written during transaction execution. Therefore, when an
RO node reads the redo log, it must be in the same transaction status as the RW
node. This necessitates concurrency control on the RO nodes, so that the RO nodes
are logically consistent with the RW node in terms of transaction status. After the
5.2 High Availability of Clusters 119

Fig. 5.14 Physical


replication in a nonshared
storage architecture

Fig. 5.15 Logic redo architecture

redo log is replayed on the RO nodes, physical structures on the RO nodes may be
different because of different transaction commitment sequences and data modifica-
tion sequences. Therefore, concurrency control must be implemented to make sure
that all RO nodes read the same physical structure.
This section describes the snapshot and MVCC implementation for logical con-
sistency, as well as how to ensure the physical structure consistency in the B+ tree
structure.

Implementation of Consistent Reads: Read View

A read view is a snapshot that records the ID array and related information about
currently active transactions in the system. It is used for visibility judgment, that is,
for checking whether the current transaction is eligible to access a row. A read view
involves multiple variables, including the following:
trx_ids: This variable stores the list of active transactions, namely, the IDs of other
uncommitted active transactions when the read view was created. For example,
if transaction B and transaction C in the database have not been committed or
rolled back when transaction A creates a read view, trx_ids will record the trans-
action IDs of transaction B and transaction C. If a record that contains the ID of
120 5 High-Availability Shared Storage System

the current transaction exists in trx_ids, the record is invisible. Otherwise, it is


visible.
low_limit_id: The latest maximum transaction ID +1, where the maximum transac-
tion ID is obtained from the max_trx_id variable of the transaction system. If the
transaction ID contained in a record is greater than the value of low_limit_id of
the read view, the record is invisible to the current transaction.
up_limit_id: The minimum transaction ID in trx_ids. If trx_ids is empty, up_limit_
id is equal to low_limit_id. Although the field name is up_limit_id, the last active
transaction ID in trx_ids is the smallest one because the active transaction IDs in
trx_ids are sorted in descending order. Records with a transaction ID less than
the value of up_limit_id are visible to this view.
creator_trx_id: The ID of the transaction that created the current read view.
When a transaction accesses a row of data, the record visibility can be deter-
mined based on the following rules:
• Create a read view. (The number of read views created varies based on the isola-
tion level in MVCC. For more information, see related content in the following
sections.)
• If the trx_id value (the version number of the data) of a record is less than the
value of up_limit_id, the transaction that generates this version has been commit-
ted before the read view is created. In this case, the record is visible.
• If trx_id > low_limit_id, the transaction that generates this version is created
after the read view is created. In this case, the record is invisible.
• If up_limit_id < trx_id < low_limit_id, and trx_id is in trx_ids, the transaction
that generates this version is still active and the record is invisible. If trx_id is not
in trx_ids, the transaction that generates this version has been committed and the
record is visible.
MVCC supports only the read committed and repeatable read isolation lev-
els. The difference in their implementations lies in the number of generated
read views.
The repeatable read isolation level avoids dirty reads and nonrepeatable reads
but experiences phantom read problems. MVCC for this level is implemented as
follows: In the current transaction, a read view is generated only for the first
ordinary SELECT query; all subsequent SELECT queries reuse this read view.
The transaction always uses this read view for snapshot queries until the trans-
action ends. This avoids nonrepeatable reads but not the phantom read problem,
which can be solved by using gap locks and record locks of the next-key lock
algorithm.
MVCC for the read committed isolation level is implemented as follows: A new
snapshot is generated for each ordinary SELECT query. Each time a SELECT state-
ment starts, all active transactions in the current system are recopied to a list to
generate a read view. This achieves higher concurrency and avoids dirty reads but
cannot avoid nonrepeatable reads and phantom read problems.
5.2 High Availability of Clusters 121

Implementation of Version Correctness in the Shared Storage Architecture

PolarDB adopts the redo log-based physical replication scheme to implement the
shared storage architecture that supports one writer and multiple readers. The RW
node and RO nodes share the same data. Therefore, the hidden fields in the record
are exactly the same in the RW node and RO nodes. To guarantee that the correct
data version is read during data access in the one-writer, multireader architecture,
the consistency of the transaction status of the RW node and the RO nodes must be
ensured. The transaction status is synchronized by using the redo log. The start of a
transaction can be identified by the MLOG_UNDO_HDR_REUSE or MLOG_
UNDO_HDR_CREATE record, and the commit of a transaction can be identified
by adding an MLOG_TRX_COMMIT record in PolarDB. This way, the committed
transactions and active transactions can be clearly identified on RO nodes by apply-
ing the redo log records, thereby ensuring a consistent transaction status between
the RW node and RO nodes.
Figure 5.16 shows the transaction status of an RW node and an RO node in the
repeatable read isolation level.
In the figure, the left-side column shows the transaction status of an RW node,
and the right-side column shows the transaction status of an RO node. MVCC-­
facilitated consistent nonlocking reads are supported in the repeatable read and read
committed isolation levels.

5.2.2.3 Physical Consistency

As one of the key factors affecting system performance, the index structure has a sig-
nificant impact on the performance of database systems in high-concurrency scenar-
ios. In addition to conventional operations, such as query, insert, delete, and update
operations, the B+ tree structure supports structural modification operations (SMOs).
For example, when a tree node does not have sufficient space to accommodate a new

Fig. 5.16 Transaction status of an RW node and an RO node in the repeatable read isolation level
122 5 High-Availability Shared Storage System

record, the node will be split into two nodes, and the new node will be inserted to the
upper-level parent node. This changes the tree structure. Without a proper concur-
rency control mechanism, other operations that are performed at the same time an
SMO is performed on the B+ tree can see a tree structure in an intermediate state.
Moreover, corresponding records that should exist cannot be found or the access may
fail because an invalid memory address is accessed. In cloud-native databases, physi-
cal consistency means that even if multiple threads access or modify the same B+ tree
at the same time, all threads must see a consistent structure of the B+ tree.
This can be achieved by using a large index lock, which seriously compromises
the concurrency performance. Since the introduction of the B+ tree structure in
1970, many researches on how to optimize the performance of B+ trees in multi-
thread scenarios have been published on top-notch conferences in the database and
system fields, such as VLDB, SIGMOD, and EuroSys. In the cloud-native architec-
ture that features separation of computing and storage and supports one writer and
multiple readers, the RW nodes and RO nodes have independent memory and main-
tain different replicas of the B+ tree. However, threads on the RW nodes and RO
nodes may access the same B+ tree at the same time. This poses a problem in terms
of the physical consistency across nodes.
This section describes the concurrency control mechanism for B+ trees in
InnoDB, which is the method used to ensure the physical consistency of a B+ tree
in the traditional single-node architecture, and tackles the method used to ensure the
physical consistency of a B+ tree in the one-writer, multireader architecture in
PolarDB.

Physical Consistency of a B+ Tree in a Traditional Architecture

A proper concurrency control mechanism for a B+ tree must meet the following
requirements:
• The read operations are correct. R.1: A key-value pair in an intermediate state
will not be read. In other words, a read operation will not read a key-value pair
that is being modified by another write operation. R.2: An existing key-value pair
must be present. If a key-value pair on a tree node being accessed by a read
operation is moved to another tree node by a write operation (e.g., in a splitting
or merging operation), the read operation may fail to find the key-value pair.
• The write operations are correct. W.1: Two write operations will not modify the
same key-value pair at the same time.
• No deadlocks exist. D.1: Deadlocks, which are a situation in which two or more
threads are permanently blocked and wait for resources occupied by other
threads, will not occur.
PolarDB for MySQL 5.6 and earlier versions adopt a relatively basic concur-
rency mechanism that uses locks of two granularities: S/X locks on indexes and S/X
locks on pages (pages are equivalent to tree nodes in this book). An S/X lock on an
index is used to avoid conflicts in tree structure access and modification operations,
5.2 High Availability of Clusters 123

and an S/X lock on a page is used to avoid conflicts in data page access and modifi-
cation operations.
The following lists the notations that will be used in pseudocode in this book:
• SL adds a shared lock.
• SU releases a shared lock.
• XL adds an exclusive lock.
• XU releases an exclusive lock.
• SXL adds a shared exclusive lock.
• SXU releases a shared exclusive lock.
• R.1/R.2/W.1/D.1: correctness requirements that concurrency mechanisms need
to satisfy.
The following section analyzes the processes of read and write operations by
using pseudocode.
In Algorithm 1, the read operation adds an S lock to the entire B+ tree (Step 1),
traverses the tree structure to find the corresponding leaf node (Step 2), adds an S
lock to the page of the leaf node (Step 3), releases the S lock on the index (Step
4), accesses the content of the leaf node (Step 5), and then releases the S lock on
the leaf node (Step 6). The read operation adds an S lock to the index to prevent
the tree structure from being modified by other write operations, thus meeting
R.2. After the read operation reaches the leaf node, it applies for a lock on the
page of the leaf node and then releases the lock on the index. This prevents a key-
value pair from being modified by other write operations, thereby meeting R.1.
The read operation adds an S lock to the B+ tree. This way, other read operations
can access the tree structure in parallel, thereby reducing concurrency conflicts
between read threads.

/* Algorithm 1. Read operation */


Step 1. SL(index)
Step 2. Traverse the tree structure to find the target leaf node
Step 3. SL(leaf)
Step 4. SU(index)
Step 5. Read the content of the leaf node
Step 6. SU(leaf)

A write thread may modify the entire tree structure. Therefore, it is necessary to
avoid two write threads from accessing the same B+ tree at the same time. To this
end, Algorithm 2 adopts a more pessimistic solution. Each write operation adds an
X lock to the B+ tree (Step 1) to prevent other read or write operations from access-
ing the B+ tree during the execution of the write operation and from accessing an
incorrect intermediate state. Then, the write operation traverses the tree structure to
find the corresponding leaf node (Step 2) and adds an X lock to the page of the leaf
node (Step 3). Next, the write operation determines whether it will trigger an opera-
tion that modifies the tree structure, such as a splitting or merging operation. If yes,
the write operation modifies the entire tree structure (Step 4) and then releases the
124 5 High-Availability Shared Storage System

lock on the index (Step 5). Lastly, it modifies the content of the leaf node (Step 6)
and then releases the X lock on the leaf node (Step 7). Although the pessimistic
write operation satisfies W.1 by using an exclusive lock on the index, the exclusive
lock on the B+ tree blocks other read and write operations. This results in poor mul-
tithreading scalability in high-concurrency scenarios. The following discussion will
reveal if there is room for optimization.

/* Algorithm 2. Pessimistic write operation */


Step 1. XL(index)
Step 2. Traverse the tree structure to find the target leaf node
Step 3. XL(leaf) /* lock prev/curr/next leaves */
Step 4. Determine whether the operation will trigger a splitting or merging operation that modifies the tree
structure
Step 5. XU(index)
Step 6. Modify the content of the leaf node
Step 7. XU(leaf)

Each tree node page can store a large number of key-value pairs. Therefore, a
write operation on a B+ tree does not usually trigger an operation that modifies the
tree structure, such as splitting or merging. Compared with the pessimistic idea of
Algorithm 2, Algorithm 3 adopts an optimistic approach that assumes most write
operations will not modify the tree structure. In Algorithm 3, the whole process of
the write operation is roughly the same as that in Algorithm 1. The write operation
holds an S lock on the tree structure during access, so that other read operations and
optimistic write operations can also access the tree structure at the same time. The
main difference between Algorithm 3 and Algorithm 1 is that in the former, the
write operation holds an X lock on the leaf node. In MySQL 5.6, a B+ tree often
performs an optimistic write operation first and only performs a pessimistic write
operation when the optimistic write operation fails. This reduces conflicts and
blocking between operations. Both a pessimistic write operation and an optimistic
write operation prevent write conflicts by using a lock on the index or page to
meet W.1.

/* Algorithm 3. Optimistic write operation */


Step 1. SL(index)
Step 2. Traverse the tree structure to find the target leaf node
Step 3. XL(leaf)
Step 4. SU(index)
Step 5. Modify the content of the leaf node
Step 6. XU(leaf)

In MySQL 5.6, locks are added from top to bottom and from left to right. This
prevents locks added by any two threads from forming a loop to prevent deadlocks
and meet D.1.
5.2 High Availability of Clusters 125

After PolarDB for MySQL is upgraded from 5.6 to 5.7, the concurrency mecha-
nism of the B+ tree significantly changed in the following aspects: First, SX locks
are introduced, which conflict with X locks but do not conflict with S locks, thereby
reducing blocked read operations. Second, a write operation locks only the modifi-
cation branch to reduce the scope of locking. The read operations and optimistic
write operations in MySQL 5.7 are similar to those in MySQL 5.6. Hence, this sec-
tion describes only the pseudocode for a pessimistic write operation in MySQL 5.7.
In Algorithm 4, a write operation adds an SX lock to the tree structure (Step 1),
adds an X lock to the branches affected during the traversal of the tree structure
(Steps 2–4), adds an X lock to the leaf node (Step 5), releases the locks on nonleaf
nodes and the index (Steps 6–8), and then modifies the leaf node and releases the
lock on the leaf node (Steps 9 and 10). The correctness of the write operations and
the deadlock-free requirement are similar to those in the preceding sections.
Therefore, details will not be described repeatedly here. Compared with that in
PolarDB for MySQL 5.6, a pessimistic write operation in PolarDB for MySQL 5.7
no longer locks the entire tree structure but locks only the modified branches. This
way, read operations that do not conflict with the write operation can be performed
in parallel with the write operation, thereby reducing conflicts between threads.
PolarDB for MySQL 8.0 uses a locking mechanism similar to that of PolarDB for
MySQL 5.7.

/* Algorithm 4. Pessimistic write operation */


Step 1. SX(index)
Step 2. While current is not leaf do {
Step 3. XL(modified non-leaf)
Step 4. }
Step 5. XL(leaf) /* lock prev/curr/next leaf */
Step 6. Modify the tree structure
Step 7. XU(non-leaf)
Step 8. SXU(index)
Step 9. Modify the leaf node
Step 10. XU(leaf)

Physical Consistency of a B+ Tree in the One-Writer, Multireader Architecture

Unlike the traditional InnoDB engine, which needs to ensure only the physical con-
sistency of the B+ tree of a single node, PolarDB adopts the one-writer, multireader
architecture and must ensure that the concurrent threads on multiple RO nodes can
read consistent B+ trees. In PolarDB, an SMO is synchronized to B+ trees in the
memory of RO nodes by replaying the redo log based on physical replication.
Physical replication synchronizes the redo log in the unit of disk pages. However,
one SMO affects multiple tree nodes, which may breach the atomicity of applying
the redo log entry of the SMO. Consequently, concurrent threads on the RO nodes
may read inconsistent tree structures.
126 5 High-Availability Shared Storage System

The simplest solution is to forbid user threads from retrieving from the B+ tree
when an RO node discovers that the redo log contains a log entry of an SMO (i.e.,
when the structure of the B+ tree is changed, such as when page merging or splitting
occurs). When a minitransaction that holds an exclusive lock on an index and modi-
fies data across pages is committed on the primary node, the ID of the index is writ-
ten to the log. When the log is parsed on a standby node, a synchronization point is
generated each time when the following tasks are completed: (1) parsing of the log
is completed, (2) the exclusive lock on the index is obtained, (3) the log group is
replayed, and (4) the exclusive lock on the index is released.
Although this method can effectively solve the foregoing problem, too many
synchronization points significantly affect the synchronization speed of the redo
log. This may lead to high synchronization latencies of RO nodes. To address this
issue, PolarDB introduces a versioning mechanism that maintains a global counter
Sync_counter for all RO nodes. This counter is used to coordinate the redo log syn-
chronization mechanism and the concurrent execution of user read requests, thereby
ensuring the consistency of B+ trees.
• During parsing of the redo log, an RO node collects the IDs of all indexes on
which an SMO is performed and increments Sync_counter:
• X locks on all indexes affected by an SMO are acquired, the latest copy of Sync_
counter is maintained on the index memory structure, and the X locks on the
indexes are released. A request to access the B+ tree needs to hold an S index
lock, and an X lock can ensure that the B+ tree cannot be accessed before the X
lock is released.
• When a user request traverses the B+ tree, it checks whether the copy of Sync_
counter of an index is consistent with the global Sync_counter. If yes, an SMO is
being performed on the B+ tree, and the indexed page being accessed needs to be
updated to the latest version by using the redo log. Otherwise, the redo log does
not need to be replayed.
By using this optimistic approach, PolarDB greatly reduces the interference to
concurrent requests to the B+ tree during application of the SMO log entries in the
redo log. This significantly improves the performance of read-only nodes while
ensuring the physical consistency across nodes.

5.2.2.4 DDL

Basic Concepts and Core Logic of DDL

Data Definition Language (DDL) is an essential component of the standard


SQL. DDL defines data schemas and establishes a logical structure for the target
dataset, so that all entries in the target dataset are processed according to the struc-
ture. Dataset processing includes database creation, table and table index creation,
and view creation, modification, and deletion. The most common DDL commands
in MySQL include CREATE, ALTER, and DROP.
5.2 High Availability of Clusters 127

In the trade-off between time, space, and flexibility, MySQL uses logic in
which data definitions are separated from data storage without considering flex-
ibility. In this logic, each piece of physical data stored in MySQL does not
contain all information required for interpreting itself and can be correctly inter-
preted and manipulated only in combination with the independently stored data
definitions. As a result, DDL operations are often accompanied by modifications
of full table data, making them the most time-consuming operations in
MySQL. In scenarios with large data volumes, the execution of a single DDL
statement can take days.
DDL operations are often executed concurrently with Data Query Language
(DQL) and Data Manipulation Language (DML). To control concurrency and
ensure the correctness of database operations, MySQL introduced metadata locks
(MDLs). A DDL operation can block DML and DQL operations by acquiring
exclusive MDLs, thus achieving concurrency control. However, in production
environments, blocking DML operations and other operations for a long time
severely affects the business logic. To address this problem, MySQL 5.6 intro-
duced the online DDL feature. Online DDL enables concurrent execution of DDL
and DML operations by introducing the Row_log object. The Row_log object
records the DML operations executed during the execution of a DDL operation,
and the incremental data generated by the DML operations is replayed after the
DDL operation is executed. This simple solution effectively solves the problem
that arises in the concurrent execution of DDL and DML operations. With this
solution, the basic DDL logic of MySQL began to take shape. DDL has been
adopted in MySQL databases ever since and is still used in the latest MySQL ver-
sion (i.e., MySQL 8.0).

Instant DDL

MySQL has effectively solved the concurrency issue between DDL and other oper-
ations. However, the manipulation of full data during DDL operations remains a
significant burden on the storage engine. To address this issue, MySQL 8.0 intro-
duced the Instant DDL feature. This feature enables MySQL to modify only the data
definition during DDL operations without modifying the actual physical data stored.
This solution, which is completely different from the previous DDL logic, stores
more information in data definitions and physical data to facilitate correct interpre-
tation and manipulation of the physical data. Essentially, the instance DDL feature
is a simplified version of the data dictionary multiversioning technique. However,
due to various complex engineering issues, such as compatibility, instant DDL now
supports only adding columns to the end of a table. Sustained efforts are still
required to enable instant DDL to support other operations. PolarDB faces scenarios
involving massive cloud data. Therefore, the instant DDL feature is especially
important to PolarDB. Through considerable efforts, Alibaba Cloud has imple-
mented instant DDL in earlier versions, such as PolarDB for MySQL 5.7, and
enabled instant DDL to support more operations.
128 5 High-Availability Shared Storage System

Parallel DDL

Some DDL operations, such as index creation, inevitably involve manipulation of


full data, which cannot be avoided even by using the instant DDL feature.
Therefore, the parallel DDL feature was introduced. Parallel DDL is a typical use
case of the unique parallel service system of the PolarDB storage engine in DDL
scenarios. Facilitated by the unique parallel service system of the storage engine,
an I/O-­intensive DDL operation, such as a full data scan or creation of an index,
is divided into several execution units. These execution units are scheduled by the
parallel service system to correctly complete the entire DDL operation. This
greatly reduces the execution time of the DDL operation, especially in scenarios
involving a large amount of data. Combination of parallel DDL and instant DDL
can be explored and leveraged to alleviate the heavy system load caused by DDL
operations.

Challenges and Solutions for DDL in the One-Writer, Multireader Architecture

In the one-writer, multireader architecture, introduction of regular logic and new


features will pose new challenges and issues for DDL.
In this architecture, the data definition consistency issues during DDL operations
become prominent because multiple nodes share the same copy of physical data.
When a write node executes a DDL operation, all nodes must read consistent data
definition information. Otherwise, errors will occur during data operations. To
address this issue, PolarDB adopts a parallel MDL synchronization technology.
When a write node performs a DDL operation, MDL information is synchronized
to other RO nodes in time. This ensures that all nodes are in a consistent state in the
target table of the DDL operation throughout the entire cycle of the operation, fur-
ther guaranteeing consistent data operation results on all nodes. Additionally, the
parallel MDL synchronization technology allows a write node to asynchronously
execute some DDL operations. The write node only needs to transfer the MDL sta-
tus information to read nodes; there is no need to wait until the same state as the read
nodes is reached before proceeding to execute the DDL logic. This greatly improves
the processing capabilities of the entire cluster during the execution of a DDL
operation.
PolarDB may need to implement cross-zone deployment and point-in-time
recovery. Therefore, operations such as index creation also need to be recorded in
the redo log so that these operations can be replayed as needed. However, these redo
log records are not essential for read-only nodes, and parsing and replaying these
redo log records consume considerable CPU and memory resources. To address this
issue, PolarDB marks the redo log records so that read-only nodes can identify and
discard them, thereby reducing CPU and memory consumption. In practice, han-
dling different types of redo log records by using different logic for different node
roles and application scenarios greatly enhances feature support and ensures smooth
performance of PolarDB.
5.3 Shared Storage Architectures 129

5.3 Shared Storage Architectures

The compute-storage-separated architecture is one of the key features of cloud-­


native databases. In this architecture, different instances of a database can access the
same distributed storage system and share the same data. The underlying storage
system provides high-reliability, high-availability, and high-performance storage
services to the database. The database can be horizontally scaled at the lowest cost,
greatly improving the elasticity of the entire system. Figure 5.17 shows a typical
shared storage architecture of a database. All nodes of the database share data in the
same storage system. The primary node provides read-write services externally, and
RO nodes provide read-only services externally. Users can add read-only nodes at
any time without replicating any data.
Database systems based on shared storage have the following advantages:
• The hardware for computing and storage nodes can be customized indepen-
dently. For example, compute nodes generally require larger and more power-
ful CPU and memory capacities, whereas storage nodes prefer larger disk
capacities.
• The resources on multiple storage nodes can form a unified storage pool, which
is conducive to solving issues such as fragmentation of the storage space, uneven
load between nodes, and waste of space. In addition, the capacity and throughput
of the storage system are easier to scale horizontally.
• The storage system stores all data of the database, so database instances can be
quickly migrated between compute nodes and scaled without replicating data.
This chapter introduces Amazon Aurora, a storage system for databases that uses
a shared storage architecture, and PolarFS, a distributed file system of the next-­
generation cloud-native database PolarDB.

Fig. 5.17 Shared storage architecture of a database


130 5 High-Availability Shared Storage System

5.3.1 Aurora Storage System

Aurora [4] is a relational database service launched by AWS specifically for the cloud.
It adopts a compute-storage-separated architecture in which compute nodes and stor-
age nodes are separately located in different virtual private clouds (VPCs). As shown
in Fig. 5.18, users access applications through the user VPC, and the RW node and RO
node communicate with each other in the Relational Database Service (RDS)
VPC. The data buffer and persistent storage are located in the storage VPC. This
achieves the physical separation of computing and storage in Aurora. The storage
VPC consists of multiple storage nodes mounted with local SSDs, which form an
Amazon Elastic Compute Cloud (EC2) VM cluster. This storage architecture provides
a unified storage space to support the one-writer, multireader architecture. Read-only
replicas can be quickly added by transferring the redo log over a network.
Aurora is built on products such as EC2, VPC, Amazon S3, Amazon DynamoDB,
and Amazon Simple Workflow Service (SWF) but does not have a dedicated file
system like PolarFS for PolarDB. To understand the Aurora storage system, you
need to understand the following basic products.

5.3.1.1 Amazon S3

Amazon S3 is a global storage area network that acts like a hard drive with an enor-
mous capacity. It can provide storage infrastructure for any application. The basic
entities stored in S3 are called objects, which are stored in buckets. The storage

Fig. 5.18 Overall architecture of Aurora


5.3 Shared Storage Architectures 131

architecture of S3 consists of only two layers. An object includes data and metadata,
where the metadata is often key-value pairs describing the object. A bucket provides
a way for organizing, storing, and classifying data. The operation UI of S3 is user-­
friendly, and users can use simple commands to operate data objects in buckets.

5.3.1.2 Amazon DynamoDB

Amazon DynamoDB is a key-value storage system that achieves high availability,


high scalability, and decentralization at the expense of consistency. The storage
model of DynamoDB is a key-value mapping table that uses a consistent hashing
algorithm to reduce the overheads of data migration when a node change occurs.
DynamoDB does not guarantee strong consistency between multiple replicas but
guarantees eventual consistency, which ensures that all replicas will be synchro-
nized to the latest version within a limited time frame when data updates are stopped.

5.3.1.3 Amazon SWF

Amazon SWF helps developers easily build applications that coordinate work
across distributed components and can be viewed as a fully managed task coordina-
tor in the cloud. With Amazon SWF, developers can control the execution and coor-
dination of tasks without the need to track the progress of the tasks or maintain their
status or concern themselves with the complex underlying implementation.
As shown in Fig. 5.19, the underlying storage system of Aurora is responsible for
persisting the redo log, updating pages, and clearing expired log records. It regu-
larly uploads backup data to Amazon S3. The storage nodes are mounted to local
SSDs. Therefore, the persistence of the redo log and page updates do not require
cross-network transmission, and only the redo log needs to be transmitted over the
network. The metadata of the storage system, for example, the data that describes
how data is distributed and the running status of the software, is stored in Amazon
DynamoDB. Aurora’s long-term automated management, such as database recov-
ery and data replication, is implemented by using Amazon SWF.

5.3.2 PolarFS

PolarDB is a next-generation cloud-native relational database independently devel-


oped by Alibaba Cloud that uses the compute-storage-separated architecture. The
database stores logs and data in the distributed file system PolarFS. The database
focuses on the internal logic, whereas PolarFS ensures the high reliability, high
availability, and high performance of storage services. PolarFS adopts a lightweight
user-space network stack and I/O stack in place of the corresponding kernel stack.
This fully utilizes the potentials of emerging technologies and hardware, such as
RDMA and NVMe SSDs, greatly reducing the end-to-end latency of distributed
132 5 High-Availability Shared Storage System

Fig. 5.19 Storage architecture of Aurora

data access. The shared storage design of PolarFS enables all compute nodes to
share the same underlying data. This way, read-only instances can be quickly added
in PolarDB without data replication.
PolarFS is internally divided into two layers. The underlying layer is responsible
for virtualization management of storage resources and provides a logical storage
space (in the form of a volume) for each database instance. The upper layer is
responsible for file system metadata management in the logical storage space and
controls synchronization and mutual exclusion for concurrent access to metadata.
PolarFS abstracts and encapsulates storage resources into volumes, chunks, and
blocks for efficient organization and management of resources.
A volume provides an independent logical storage space for each database
instance. Its capacity can dynamically change based on database needs and reaches
up to 100 TB. A volume appears as a block device to the upper layer. In addition to
database files, it also stores metadata of the distributed file system.
A volume is internally divided into multiple chunks. A chunk is the smallest
granularity of data distribution. Each chunk is stored on a single NVMe SSD on a
storage node, which is conducive to implementing high reliability and high
5.3 Shared Storage Architectures 133

availability of data. A typical chunk size is 10 GB, which is much larger than the
chunk size in other systems. The larger chunk size reduces the amount of metadata
that needs to be maintained for chunks. For example, for a 100-TB volume, meta-
data records need to be maintained only for 10,000 chunks. In addition, the storage
layer can cache metadata in memory to effectively avoid additional metadata access
overhead on critical I/O paths.
A chunk is further divided into multiple blocks. The physical space on SSDs is
allocated to corresponding chunks in the unit of blocks as needed. The typical block
size is 64 KB. The information about mappings between chunks and blocks is man-
aged and stored in the storage layer and cached in memory to further accelerate
data access.
As shown in Fig. 5.20, PolarFS consists of Libpfs, PolarSwitch, chunk servers,
PolarCtrl, and other components. Libpfs is a lightweight user-space file system
library that provides a POSIX-like interface to databases for managing and access-
ing files in volumes. PolarSwitch is a routing component deployed on a compute
node. It maps and forwards I/O requests to specific backend storage nodes. A chunk
server is deployed on a storage node and is responsible for responding to I/O
requests and managing resources of chunks. A chunk server replicates write requests
to other replicas of a chunk. Chunk replicas ensure data synchronization in various
faulty conditions by using the ParallelRaft consistency protocol, thereby preventing
data loss. PolarCtrl is a control component of the system and is used for task man-
agement and metadata management.
Libpfs converts the file operation issued by the database into a block device I/O
request and delivers the I/O request to PolarSwitch. Based on the locally cached chunk
route information, PolarSwitch forwards the I/O request to the ChunServer on which
the leader chunk is located. After the chunk server on which the leader chunk is
located receives the request by using an RDMA NIC (network interface card), it deter-
mines whether the operation is a read operation or a write operation. If it is a read
operation, the leader chunk directly reads local data and returns the result to
PolarSwitch. If it is a write operation, the leader chunk writes the operation content to

Fig. 5.20 Abstract architecture of the storage layer in PolarFS


134 5 High-Availability Shared Storage System

the local WAL and then sends the operation content to the follower chunks, which
subsequently write the operation content to their respective WALs. After receiving
responses from most follower chunks, the leader chunk returns a write success to
PolarSwitch and asynchronously applies the log to the data area of the follower chunks.

5.4 File System Optimization

5.4.1 User Space I/O Computing

Propelled by technological development, new hardware and protocols continue to


emerge. For example, NVMe SSDs offer lower I/O latency and greater disk through-
puts than traditional hard drives, and RDMA allows computers to directly access the
memory of other computers without CPU intervention, which greatly reduces inter-
machine network communication latency compared with traditional TCP/IP proto-
col stacks. The emergence of such high-performance hardware poses considerable
challenges to the traditional I/O stack and network stack.

5.4.1.1 Kernel Space and User Space

To prevent application programs from occupying excessive system hardware


resources and ensure system security and stability, the Linux system divides the
entire system into user space and kernel space. The user space provides necessary
space for applications, which can only access their memory space in the user space
and cannot directly access other system resources. The kernel space is responsible
for managing system resources, such as CPU, memory, and disk resources. It further
provides necessary system call interfaces, such as read(), write(), and send() in the
C language library, to allow applications to access hardware resources.
As shown in Fig. 5.21, when an application needs to access the hardware
resources of the system, it calls the system call interface provided by the operating
system. The CPU switches from the user mode to the kernel mode and accesses the

Fig. 5.21 Relationship


between the operating
system and hardware
resources
5.4 File System Optimization 135

hardware resources in kernel mode. After the operation is completed, the CPU
switches to the user mode and returns the result to the application. When the CPU
switches from the user mode to the kernel mode to execute the system call, it swaps
out the state of the application from the register and saves the state in memory and
then swaps the process information related to this system call from memory into the
register to execute the process. This procedure is called context switching.

5.4.1.2 Challenges Faced by the Traditional I/O Stack

In the traditional hard disk era, the system uses an interrupt-based I/O model. After
an application initiates an I/O request, the CPU switches from the user mode to the
kernel mode, initiates a data request to the disk, and then switches back to the user
mode to continue processing other work. After the disk data is ready, the disk initi-
ates an interrupt request to the CPU. After receiving the request, the CPU switches
to the kernel mode to read the data and replicates the data from the kernel space to
the user space and then switches back to the user mode. Multiple context switches
and data replication operations occur during the I/O process, which undoubtedly
generates overheads. However, the overheads are negligible compared to the read/
write latency of traditional hard disks. With the successful commercialization and
advancement of NVM technologies, the hard disk speed has significantly improved.
For example, an NVMe SSD can complete 500,000 I/O operations per second with
a latency of less than 100 μs. Therefore, the system performance is no longer bottle-
necked by hardware but by software. To address the issue that the traditional I/O
stack no longer matches the capabilities of high-speed disk devices such as NVMe
SSDs, Intel developed a development kit named Storage Performance Development
Kit (SPDK) for NVMe devices. SPDK enables all necessary drivers to be moved to
the user space to avoid context switches for the CPU and data replication. In addi-
tion, interrupts are replaced with polling by the CPU, thereby further lowering the
latencies of I/O requests.

5.4.1.3 Challenges Faced by the Traditional Network Stack

As shown in Fig. 5.22, in traditional TCP/IP network communication, the data


sender needs to first replicate the data from the user space to the kernel space,
encapsulates the data into a data packet in the kernel space, and then sends the data
packet to a remote machine through an. After receiving the data packet, the remote
machine also needs to parse the packet in the kernel space before it can replicate the
data to the user space for use by the application. The data is repeatedly replicated
between the user space and the kernel space, which greatly increases the communi-
cation latency and significantly strains the server CPU and memory. In a system that
uses a compute-storage-separated architecture in which a storage node has multiple
replicas, a large number of data transfers take place between nodes. Therefore, the
system performance is bottlenecked by the traditional network stack. RDMA allows
136 5 High-Availability Shared Storage System

Fig. 5.22 Traditional network communication model

applications to directly access remote machines in the user space. In an RDMA


NIC, read and write requests are completed in registered memory regions, thereby
avoiding the overhead associated with the traditional network stack and enabling
low-latency high-performance network communication. RDMA also frees the CPU
from data transfers, allowing the free CPU resources to handle more tasks.

5.4.1.4 User-Space I/O Stack and Network Stack Based on New


Hardware in PolarFS

This section takes the distributed shared file system PolarFS as an example to
describe the application of a user-space I/O stack and network stack that are based
on new hardware in a storage system. PolarFS, which is the underlying storage sys-
tem of PolarDB, provides databases with high-performance and high-availability
storage services at a low latency. PolarFS adopts a lightweight user-space I/O stack
and network stack to utilize the potentials of emerging hardware and technologies,
such as NVMe SSDs and RDMA.
To avoid the overheads of message transfers between the kernel space and user
space in a traditional file system, especially the overheads of data replication,
PolarFS provides the database with a lightweight user-space file system library
named Libpfs. Libpfs replaces the standard file system interface and enables all I/O
operations of the file system to run in the user space. To ensure that I/O events are
handled in a timely manner, PolarFS constantly polls and listens to hardware
devices. In addition, to avoid CPU-level context switching, PolarFS binds each
worker thread to a CPU, so that each I/O thread runs on a specified CPU and each
5.4 File System Optimization 137

Fig. 5.23 I/O execution process in PolarDB

I/O thread handles different I/O requests and is bound to different I/O devices. In
essence, each I/O request is scheduled by the same I/O thread and processed by the
same CPU in its entire lifetime.
The I/O execution process in PolarDB is shown in Fig. 5.23. When the routing
component PolarSwitch pulls an I/O request issued by PolarDB from the ring buf-
fer, it immediately sends the request to the buffer zone of the leader node (chunk
server 1) in the storage layer through an RDMA NIC. The buffer zone of chunk
server 1 is registered with the local RDMA NIC. I/O threads in chunk server 1 will
keep pulling requests from the buffer zone. When a new request is found, it is writ-
ten to an NVMe SSD by using SPDK and sent to buffer zones in chunk server 2 and
chunk server 3 over an RDMA network for synchronization.

5.4.2 Near-Storage Computing

The unique storage scalability of OLTP cloud-native relational database can support
a capacity of a hundred terabytes for a single instance. Therefore, efficient OLTP is
more important in scenarios with large amounts of data. However, a cloud-native
database uses a storage-compute separated architecture, and all interactions between
compute and storage nodes are carried out over a network. Therefore, the system
performance can be bottlenecked by the network throughput. In an OLTP cloud-­
native relational database based on row storage, a table scan brings unnecessary I/O
reads of rows and columns, and a table access by index primary key generates
unnecessary I/O reads of columns. These additional data reads further exacerbate
the network bandwidth bottleneck.
138 5 High-Availability Shared Storage System

The only feasible solution to this problem is to reduce network traffic between
compute and storage nodes by pushing down some data-intensive access tasks,
such as table scans, to the storage nodes. This requires the storage nodes to have
higher data processing capabilities to handle the additional table scan tasks. Several
approaches are available. One is to improve the specifications of the storage nodes.
However, this results in extremely high costs, and the current CPU architecture is
unsuitable for scanning tables that store data by row. Another approach is to use
a heterogeneous computing framework in which storage nodes are equipped with
special cost-effective hardware, such as FPGAs and GPUs, to allow them to per-
form table scan tasks. However, as shown in Fig. 5.24, a conventional centralized
heterogeneous computing architecture uses a single standalone FPGA card that
is based on PCIe. As a result, the system performance is bottlenecked by the I/O
and computing bandwidths of a single FPGA. Each storage node contains mul-
tiple SSDs, each of which can achieve a data throughput of several GB/s. During
analytical processing, multiple SSDs simultaneously access the raw data, and the
aggregated data is sent to a single FPGA card for processing. This not only leads
to excessive data traffic on the DRAM/PCIe channel but also results in a data
throughput that far exceeds the I/O bandwidth of a single PCIe card, making the
FPGA card a hotspot during data processing and compromising the overall system
performance.
Therefore, a distributed heterogeneous computing architecture is a better option,
as shown in Fig. 5.25. Multiple storage nodes can be equipped with special hard-
ware so that they can perform table scan tasks. This way, a query request can be
decomposed and sent to the storage nodes for processing. In addition, only the nec-
essary target data is transferred back to the compute nodes. This avoids excessively
large data traffic and prevents a single FPGA card from becoming a hotspot in data
processing.
PolarDB [5] builds an efficient processing architecture with integrated software
and hardware in a cloud-native environment based on the preceding principle. It
takes advantage of the emerging near-storage computing SSD media devices to

Fig. 5.24 Centralized heterogeneous computing architecture


5.4 File System Optimization 139

Fig. 5.25 Distributed heterogeneous computing architecture

push data processing down to the hard disk on which the data is located, thereby
supporting efficient data queries while saving CPU computing resources on the
storage side. This section describes the specific implementation of the efficient pro-
cessing architecture in PolarDB from the software and hardware aspects.

5.4.2.1 FPGA

During the design of an application-specific integrated circuit (ASIC), the circuitry


is permanently embedded in the silicon wafer and cannot be changed after the
design is completed. This results in poor flexibility. An FPGA not only realizes all
logic functionalities of an ASIC but also supports updates of the logic functional-
ities based on application requirements after the chip is manufactured. Compared
with ASICs, FPGAs can greatly shorten the development cycle and reduce the
development costs.
Storing data by column can fully leverage hardware resources of the CPU, such
as the cache and the single instruction, multiple data (SIMD) mechanism. For exam-
ple, in analytical processing that usually involves processing of specific columns,
the CPU can read only the useful columns from the cache. The SIMD mechanism
allows the CPU to execute the same instruction for different columns in one clock
cycle, thereby increasing the throughput. However, in an OLTP database based on
row storage, these features cannot be fully utilized. For example, the data read by
the CPU into the cache by row may contain other useless columns.
As a general-purpose processor, the CPU is more suitable for handling tasks with
complex control logic. However, the CPU can achieve only data parallelism and
complete one job in one clock cycle, making it unsuitable for data processing tasks
with simple logic but high requirements for parallelism. For example, when the
CPU processes an SQL statement, it can process only one operator in a relational
algebraic expression of the SQL statement per clock cycle. As a result, it takes a
large number of clock cycles to process the SQL statement. As a dedicated proces-
sor, an FPGA can be programmed in a way that its circuitry can be reorganized to
140 5 High-Availability Shared Storage System

achieve pipeline parallelism and data parallelism. This way, multiple operators can
be processed in parallel in one clock cycle, greatly reducing processing latency.
Therefore, FPGAs can be used as coprocessors of CPUs to free the latter from
data processing tasks.

5.4.2.2 Computational Storage Drive (CSD)

CSDs are data storage devices that can perform data processing tasks. The CPU and
CSD of a storage node form a heterogeneous system that can free the CPU from
table scan tasks so that the CPU can handle other requests, thereby improving sys-
tem performance. In PolarDB, a CSD is implemented based on an FPGA, which
implements flash control and computation. A storage node manages its CSD by
using mechanisms such as address mapping, request scheduling, and garbage col-
lection (GC). In addition, the CSD is integrated into the Linux I/O stack so that it
can serve normal I/O requests like traditional storage devices.

5.4.2.3 Optimization of Software and Hardware

PolarDB faces software- and hardware-related challenges after computing is pushed


down to the CSDs. In terms of software, the storage engine of PolarDB accesses
data by specifying offsets in files, whereas the CSDs provide scan services by
manipulating logical block addresses (LBAs). Therefore, the entire software or
driver stack needs to be modified to support pushing down of scan tasks to specific
physical blocks. At the hardware level, FPGAs are expensive, so an economical and
efficient way to implement and deploy CSDs must be identified.

Software Stack Optimization

PolarDB MPP engine is a front-end analytic processing engine of PolarDB. It


parses, optimizes, and rewrites SQL queries and converts SQL queries into directed
acyclic graph (DAG) execution plans that contain operators and data flow topolo-
gies. It supports pushing down of scan tasks to the storage engine without additional
modifications. However, the storage engine, PolarFS, and CSDs still need to be
modified:
Optimization of the storage engine: A CSD does not support all query conditions.
Therefore, when the storage engine receives a scan request from MPP, the stor-
age engine analyzes the query conditions of the scan request, extracts and pushes
down the query conditions that the CSDs can execute, and then locally executes
the query conditions that the CSDs do not support.
Then, the storage engine converts the scan request into block offsets of the to-be-­
scanned data in the data file and the scheme of the table involved in the scan
5.4 File System Optimization 141

operation. Such information is forwarded to PolarFS for processing, along with


the extracted query conditions.
The storage engine allocates a memory space to store the data returned by the CSDs
after executing the scan request. The address of this memory space is also
included in the request sent to PolarFS. After receiving the data returned by the
CSDs, the storage engine checks the data against the complete query conditions
and then returns the data to the upper-level application.
Optimization of PolarFS: The scan request from the storage engine specifies the
location of the data to be scanned in the form of file offsets, so that the data
may be distributed across multiple CSDs. However, each CSD can only per-
form a scan task on its data, locate data by using LBAs, and scan data by block
from the storage engine. Therefore, PolarFS needs to break the scan request
into multiple subscan requests based on the number of CSDs it spans before
translating the data address in each subrequest into an LBA address that the
CSDs can recognize.
Optimization of CSDs: CSDs are managed in a centralized manner by the kernel
driver of the storage node host where they reside. A CSD first analyzes the query
conditions of each scan request sent by PolarFS and reorders the query condi-
tions if necessary to better utilize the hardware pipeline to improve the through-
put. Then, the LBA in the request is translated into corresponding physical
address (PBA) associated with the NAND flash memory.
To further improve the throughput, a CSD internally divides each scan request
into multiple smaller scan subtasks. This is to prevent large scan tasks from taking
up too much flash bandwidth, so that normal I/O requests do not need to wait for a
long time. This further reduces hardware resource usage for internal buffering and
takes full advantage of the parallel access nature of the NAND flash array. In addi-
tion, NAND flash devices perform GC when too much expired data exist. This can
severely disrupt the execution of scan tasks. Therefore, CSDs are optimized to mini-
mize the disruption caused by GC. If heavy workloads exist, a CSD adaptively
reduces or even suspends GC operations.

Hardware Optimization

FPGA-friendly data block format: A scan task requires a large number of compari-
son operations (e.g., =, ≥, and ≤). A comparator that supports many different
data types is difficult to implement by solely depending on FPGAs. Therefore,
for most data types, the data storage format of the storage engine of PolarDB
must be modified, so that data can be compared directly in memory. This way, a
CSD needs to implement only a comparator that can execute the memcmp()
function without considering the data types in different fields of a table. This
greatly reduces the resource usage of an FPGA.
The storage engine of PolarDB is designed based on the LSM-tree structure. In
this structure, data in each data block is stored in ascending order of key val-
142 5 High-Availability Shared Storage System

ues. Therefore, prefix compression can be implemented for the key values by
leveraging the characteristic that adjacent key values in an ordered data array
may have the same prefix. For example, assuming that the key value of the first
data record in a data block is “abcde” and the key value of the second data
record is “abcdf,” the common prefix of the key values of these two records is
“abcd.” When the key value of the second data record is stored, only the length
of the common prefix 4 and “f” need to be stored. This method of compressing
key values based on their common prefix is called prefix compression. Prefix
compression greatly reduces the required storage space but hinders search effi-
ciency to some extent. Therefore, a record is left uncompressed every other k
keys. This record is called a restart point. Prefix compression is implemented
for records following this restart point. This way, during record search, the last
restart point whose key value is less than the search keyword can be found
through binary search, and lookup is performed in a forward direction starting
from the start point.
To further improve hardware utilization, as shown in Fig. 5.26, the compression
type (Type), number of key-value pairs (# of keys), and number of restart points
(# of restarts) are added to the header of a data block, so that a CSD can decom-
press each data block and perform cyclic redundancy checks (CRC) by itself
without the need for the storage engine to pass the size information of each
block. In addition, the Type and # of keys fields facilitate data search when prefix
compression is implemented. These fields also facilitate easy identification of the
header and trailer of each block, thereby simplifying FPGA-based program
implementation.
FPGA implementation: To reduce costs and improve performance, mid-range FPGA
chips are used for flash memory control and scan task execution. In addition, a
parallel pipeline architecture is used to increase the throughput of scan process-
ing. As shown in Fig. 5.27, each FPGA contains two parallel data decompression
engines and four scan engines. Each scan engine contains a memory comparison
m ⎧ ni ⎫
(memcmp) module and a result evaluation (RE) module. Let p = Σ | Πci , j |
i =1 ⎩ j =1 ⎭
denote the entire scan task, ci, j denote a query condition for querying a field in a
table, and ∑ and ∏, respectively, represent logical OR and logical AND. The
memcmp and RE modules are used to recursively evaluate each condition ci, j in
the predicate. Specifically, the memcmp module compares data in memory.
When the RE module detects that the final result P (0 or 1) can be determined
based on the current output (all conditions ci, j that have been evaluated so far) of
the memcmp module, the RE module stops the scan of the current row and pro-
ceeds to scan the next row. The query conditions that can be implemented by an
FPGA in this architecture are =,!=, >, ≥, <, ≤, NULL, and!NULL.
References 143

Fig. 5.26 Improvement of


the data block structure
(a) Original structure
(b) Improved structure

Fig. 5.27 FPGA implementation in a parallel pipeline structure

References

1. Lamport L. Paxos made simple. ACM Sigact News. 2001;32(4):18–25.


2. Ongaro D, Ousterhout J. In search of an understandable consensus algorithm. In: 2014 USENIX
Annual Technical Conference (USENIX ATC 14); 2014. p. 305–19.
3. Cao W, Liu Z, Wang P, et al. PolarFS: an ultra-low latency and failure resilient distributed file
system for shared storage cloud database. Proc VLDB Endow. 2018;11(12):1849–62.
4. Verbitski A, Gupta A, Saha D, et al. Amazon aurora: design considerations for high throughput
cloud-native relational databases. In: Proceedings of the 2017 ACM International Conference
on Management of Data; 2017. p. 1041–52.
5. Cao W, Liu Y, Cheng Z, et al. POLARDB meets computational storage: efficiently support
analytical workloads in cloud-native relational database. In: 18th USENIX Conference on File
and Storage Technologies (FAST 20); 2020. p. 29–41.
Chapter 6
Database Cache

As a crucial part of a database, the buffer pool retains some data in the memory to
reduce data exchanges between the memory and external storage, thereby improv-
ing data access performance of the database. This chapter outlines the significance
of buffer pools to databases, depicts the challenges of database cache management
in the cloud era and provides corresponding solutions, and finally shares the practi-
cal application of PolarDB in buffer pool management and dives into the implemen-
tation of RDMA-based shared memory.

6.1 Introduction to the Database Cache

6.1.1 Role of the Database Cache

In most cases, a database system cannot directly operate on data in a disk. Therefore,
frequently used data needs to be stored in the cache to reduce pauses caused by
reading data from the disk and ensure fast data access.
In a database, the buffer pool of the storage system and the redo log buffer of the
logging system are the two main users of the caching mechanism. The buffer pool
is an internal memory area allocated within the database and is used to store pages
read from the disk, so that the pages can be accessed directly in the memory. This
improves system performance because the access speed of the memory is much
higher than that of the disk. The redo log buffer is used to store the redo log entries,
which are periodically flushed log files.

© The Author(s), under exclusive license to Springer Nature Singapore Pte 145
Ltd. 2025
F. Li et al., Cloud Native Database,
https://doi.org/10.1007/978-981-97-4057-4_6
146 6 Database Cache

6.1.2 Buffer Pool

The buffer pool caches data and indexes. To efficiently utilize the memory, it is
divided into pages, each of which can accommodate multiple rows [1]. When a
data block is cached from a medium, such as a hard drive, into the buffer pool, the
pointers in the data block can be converted from the hard drive address space to
the buffer pool address space. This scheme is known as pointer swizzling. Due to
the limited size of the buffer pool, data pages in the buffer pool need to be periodi-
cally replaced. Basic page replacement algorithms include CLOCK [2] and
LRU. CLOCK and LRU are similar in concept, and both evict pages that have not
been accessed recently and retain recently accessed pages. LRU-K is an improve-
ment on LRU and can prevent the buffer pool from being polluted by sequential
access [3].
After a page is swapped out, whether the page needs to be written back to the
disk must be considered. If the page has been modified, it needs to be written back
to the disk. If the page has not been modified, it is directly discarded.
Pages pointed to by swizzled pointers may exist in the buffer pool. These pages
are called pinned pages and cannot be safely written back to the disk [4]. To write
out these pages, they need to be unswizzled and then evicted. To be specific,
addresses pointing to the buffer pool are changed to addresses pointing to the disk.

6.2 Cache Recovery

6.2.1 Challenges of Caching in the Cloud Environment

In a traditional database, a cache is created when a database process starts and is


destroyed when the database process exits. If the database process is restarted, the
cache needs to be reinitialized and preheated. For a cache that stores hundreds of
GB of data, the preheating process is very slow and may take several minutes or
even dozens of minutes. Optimal database performance cannot be achieved during
the period from initialization to preheating due to a low cache hit rate, resulting in
compromised user business performance.
Nonetheless, this issue has not become a bottleneck in a traditional database
primarily because the R&D and release periods of a traditional database are usually
long and the database is seldom restarted unless due to business needs.
However, the R&D and release periods of cloud databases are relatively short,
and a new version may be released every month or even every 2 weeks. Bug fixes or
new features in new versions propel users to quickly upgrade their databases.
Moreover, cloud database instances are restarted based on user business needs.
Therefore, cloud databases are restarted more frequently than traditional databases.
User businesses in the cloud change rapidly. At the early stage of business develop-
ment, database instances with small specifications may be sufficient to meet the
6.2 Cache Recovery 147

business requirements. The database instances may need to be upgraded to larger


specifications as the business grows. Specification upgrades often necessitate data-
base restarts.
As a general-purpose basic software service, a cloud database often has a large
number of instances. How to smoothly upgrade databases and minimize the impact
on user businesses during restarts is an issue that cloud database service providers
need to resolve.
The lifecycles of the cache (memory) and the process (CPU) are closely coupled.
As a result, the cache must be restarted when the database process is restarted. On
the one hand, if their lifecycles are decoupled, the cache persists when the database
is restarted. The data in the cache is still hot, so the cache does not need to be pre-
heated after the database is restarted. On the other hand, the CPU-memory-separated
architecture significantly accelerates database recovery in the case of database
crashes, greatly reducing the fault recovery time. In general, the CPU-memory-­
separated design can greatly mitigate the impact of restarts on user businesses.

6.2.2 Cache Recovery Based on CPU-Memory Separation

Multiple technical solutions are available for implementing CPU and memory sepa-
ration (e.g., shared memory-based separation, NVM (nonvolatile memory)-based
separation, and RDMA-based separation).
The key to CPU and memory separation is to adapt the restart mechanism of the
database to the memory after the separation. In a traditional database, memory data
is lost after a restart, and the memory needs to be reinitialized. After the memory is
separated from the CPU, it has persistence capabilities. How to adapt to this archi-
tecture and optimize the database based on this architecture is the core problem that
needs to be solved after CPU-memory separation in databases. These solutions vary
in implementation difficulties, complexity, and benefits.

6.2.2.1 Shared Memory-Based Separation

Shared memory is an ability provided by traditional operating systems, as shown in


Fig. 6.1. When the CPU exits, the memory can be decoupled from the CPU and
managed by the operating system or other processes. After the CPU is restarted, the
memory can be coupled with the CPU again, and the data in the memory will not
be lost.
If the memory data still exists after a restart, the database cache does not need
to be preheated again. In the case of abnormal restarts, although the dirty data
pages in the database cache are not written to persistent storage, the data pages in
memory still exist after the restart. This way, the data pages do not need to be
restarted, thereby accelerating database startup and mitigating the impact on user
businesses.
148 6 Database Cache

Fig. 6.1 Shared memory-­


based separation

Fig. 6.2 RDMA-based


separation

6.2.2.2 NVM-Based Separation

NVM is a new type of hardware device that has a similar access speed as ordinary
memory. However, unlike ordinary memory, NVM provides the feature of retaining
data after a power failure. For example, Optane DC, an NVM product provided by
Intel, has a read latency of approximately two to three times that of ordinary mem-
ory and a write latency that is approximately the same as that of the latter [5].
The shared memory technology relies on the capabilities of the operating system.
If the operating system needs to be restarted when the host is restarted, the shared
memory will be destroyed. However, NVM can provide persistence capabilities
after the host is restarted. Therefore, a higher level of memory separation can be
achieved by using NVM.

6.2.2.3 RDMA-Based Separation

RDMA [6] is a technology that allows direct access to the memory of a remote host
without affecting the operation of the CPU of the host, as shown in Fig. 6.2. RDMA
(remote direct memory access) transmits data directly between application memory
and the network by using a network adapter without the need to copy data between
the data cache and application memory of the operating system. Common RDMA
implementations include Virtual Interface Architecture (VIA), RDMA over
Converged Ethernet (RoCE), InfiniBand, Omni-Path, and iWARP [7].
Shared memory-based separation and NVM-based separation can only be imple-
mented within a single host. If the host is faulty and cannot be started, the system
needs to perform complete initialization on a new host. The popularity of the RDMA
technology makes it possible to achieve memory and CPU separation across hosts
in a cloud database.
6.3 PolarDB Practices 149

6.3 PolarDB Practices

6.3.1 Optimization of the Buffer Pool

6.3.1.1 Buffer Pool of the InnoDB Engine

The buffer pool is a crucial module in the InnoDB engine. All data interactions,
including various CRUD operations, generated by user requests are implemented
based on the buffer pool. Upon startup, the InnoDB engine allocates a contiguous
memory area to the buffer pool. For better management, the memory area is typi-
cally divided into multiple buffer pool instances. All instances are equal in size, and
an algorithm is used to ensure that a page is located only on a specific instance. This
division manner can improve the concurrency performance of the buffer pool for a
database with multiple instances.
The InnoDB engine initializes the buffer pool instances in parallel based on the
setting of the srv_buf_pool_instances parameter at startup and allocates a continu-
ous memory area to these instances. This continuous memory area is divided into
multiple chunks, with each chunk sized 128 MB by default. Each buffer pool
instance contains locks to ensure the reliability of concurrent access, a buffer chunk
to store physical storage block arrays, various page lists (such as the free list, LRU
list, and flush list), and mutual exclusion (mutex) locks to ensure mutual exclusion
during access to these page lists. The instances are independent of each other, and
each supports concurrent access from multiple threads.
During the initialization of each buffer pool instance, three lists, namely, the free
list, LRU list, and flush list, and a critical page hash table are also initialized. The
specific functionalities of these data structures are described as follows:

Free List

The free list stores unused idle pages. When the InnoDB engine needs a page, it
retrieves the page from the free list. If the free list is empty (i.e., no idle pages exist),
the InnoDB engine reclaims pages by evicting old pages and flushing dirty pages
from the LRU list and flush list. During initialization, the InnoDB engine adds all
pages in the buffer chunks to the free list.

LRU List

All data pages read from data files are cached in the LRU list and managed by using
the LRU strategy. The LRU list is divided into two parts: a young sublist and an old
sublist. The young sublist stores hot data, whereas the old sublist stores data recently
read from a data file. If the LRU list contains less than 512 pages, it will not be split
into a young sublist and an old sublist. When the InnoDB engine attempts to read a
150 6 Database Cache

data page, it looks up the data page in the hash table of the buffer pool instance and
performs subsequent handling according to the actual case.
• When the data page is found in the hash table, namely, in the LRU list, it deter-
mines whether the data page is in the old sublist or the young sublist. If the data
page is in the old sublist, it adds the data page to the head of the young sublist
after reading the data page.
• If the data page is found in the hash table and is located in the young sublist, the
position of the data page in the young sublist needs to be determined. Only when
the data page is at about one-fourth of the total length of the young sublist can it
be added to the head of the young sublist.
• If the data page is not in the hash table, it needs to read from a data file and be
added to the head of the old sublist.
The LRU list manages data pages by using an ingenious LRU-based eviction
strategy to avoid frequent adjustment of the LRU list, thereby improving access
efficiency.

Flush List

All modified dirty pages that have not been written to the disk are saved in this list.
Note that all data in the flush list is also in the LRU list, but not all data in the LRU
list is in the flush list. Each data page in the flush list contains the LSN of the earliest
modification to the page, which is equal to the value of the oldest_modification field
in the buff_page_t data structure. An LSN is an integer of the unsigned long long
data type that continuously increases. LSNs are ubiquitously used in the logging
system of the InnoDB engine. For example, an LSN is used when a dirty page is
modified; checkpoints are also recorded by using LSNs. Specific locations of log
entries in redo log files can be determined by using LSNs. A page may be modified
multiple times, but only its earliest modification is recorded. Pages in the flush list
are sorted in descending order of their oldest_modification values, and the page with
the smallest oldest_modification value is saved at the end of the list. When pages
need to be reclaimed from the flush list, the reclamation process starts from the end
of the list and reclaimed pages are put back to the free list. The flush list is cleaned
by using a dedicated back-end page_cleaner thread that writes dirty pages to the
disk for data persistence. The corresponding redo log entries for the dirty pages are
also cleaned to advance the checkpoint.

Page Hash Table

All pages in the buffer pool are stored in this hash table. When a page is read, the
page can be directly located in the LRU list by using the page hash table without the
need to scan the entire LRU list, thereby greatly improving page access efficiency.
If the data page is not in the hash table, it needs to be read from the disk.
6.3 PolarDB Practices 151

6.3.1.2 Processing of Read and Write Requests in the Buffer Pool


of the InnoDB Engine

When a user initiates a CRUD operation at the client, the InnoDB engine translates
it into a page access. Queries correspond to read requests, while the insert, delete,
and update operations correspond to write requests. Read and write requests need to
be processed by using the buffer pool. The following section discusses the read and
write access processes in the buffer pool.

Process of Handling a Read Request

Step 1: Determine the buffer pool instance in which a page is located based on the
space ID and page number. In the InnoDB engine, each table with unique logical
semantics is mapped to an independent tablespace that has a unique space
ID. Starting from MySQL 8.0, all system tables use InnoDB as the default
engine. Therefore, each system table and undo tablespace have respective unique
space IDs.
Step 2: Read the page from the hash table. If the page is found, jump to Step 5. If it
is not found, proceed to Step 3.
Step 3: Read the corresponding page from the disk.
Step 4: Get a free page from the free list and fill it with the data read from the disk.
Step 5: If the page is already in the buffer pool, adjust its position in the LRU list
based on the LRU strategy. If it is a new page, add it to the old sublist of the
LRU list.
Step 6: Return the page to the user thread.
Step 7: Return the result to the client.

Process of Handling a Write Request

Step 1: Determine the buffer pool instance in which a page is located based on the
space ID and page number.
Step 2: Read the page from the page hash table. If the page is found, jump to Step
5. If it is not found, proceed to Step 3.
Step 3: Read the corresponding page from the disk.
Step 4: Get a free page from the free list and fill it with the data read from the disk.
Step 5: If the page is already in the buffer pool, adjust its position in the LRU list
based on the LRU strategy. If it is a new page, add it to the old sublist of the
LRU list.
Step 6: Return the page to the user thread.
Step 7: The user thread modifies the page and adjusts the flush list. If the page
already exists in the buffer pool, its newest_modification field needs to be modi-
fied. If it is a new page, it is directly added to the head of the flush list.
Step 8: Return the result to the client.
152 6 Database Cache

6.3.1.3 Optimization of PolarDB

PolarDB adopts a one writer, multireader architecture. The primary node, also
known as the read-write node, is responsible for handling read and write requests,
generating redo log entries, and persisting data pages. The data generated is stored
on the shared storage PolarFS. Replica nodes, also known as read-only nodes, are
only responsible for handling read requests. A read-only node replays the redo log
on the shared storage PolarFS to update the pages in its buffer pool to the latest ver-
sion. This ensures that subsequent read requests can get the latest data in a
timely manner.
Compared with InnoDB, the shared storage architecture supports fast scale-out
to cope with heavier read request load without adding disks, as well as quick addi-
tion and deletion of read-only nodes. HA switchover can also be implemented
between the read-write node and the read-only nodes. This greatly improves instance
availability. Therefore, the shared storage architecture naturally fits the cloud-native
architecture.
In the InnoDB engine architecture, data persistence is achieved by the page
cleaner thread by periodically flushing dirty pages to the disk. This avoids the per-
formance impact caused by synchronously flushing dirty pages by user threads. In
the PolarDB architecture, when the read-write node flushes a data page to the disk,
it must ensure that the newest modification LSN of the page does not exceed the
minimum LSN of the redo log applied to all read-only nodes. Otherwise, when a
read-only node reads the page from the shared storage, data consistency cannot be
guaranteed because the data version of the page has exceeded the data version
obtained by replaying the redo log. To ensure the continuity and consistency of disk
data and prevent users from retrieving data of a later version or data undergoing
SMOs from read-only nodes, the read-write node must consider the LSN of the redo
log applied to read-only nodes when it flushes dirty pages. The system defines the
minimum LSN of the redo log applied to all read-only nodes as the safe LSN. When
the read-write node flushes a page, it must ensure that the newest modification LSN
of the page (newest_modification) is less than the safe LSN. However, in some
cases, the safe LSN may fail to advance normally. As a result, dirty pages on the
read-write node cannot be flushed to the disk in a timely manner, and the oldest
flush LSN (oldest_flush_lsn) cannot advance. To improve the efficiency of physical
replication, a runtime apply mechanism is introduced to read-only nodes. With the
runtime apply mechanism, the redo log is not applied to a page that is not in the
buffer pool. This prevents the redo log application thread on a read-only node from
frequently reading pages from the shared storage. However, read-only nodes need
to cache the parsed redo log entries in the parse buffer. This way, when a read
request is received from a user, the page can be read from the shared storage, and all
redo log entries recording modifications to this page are applied during runtime, so
that the latest version of the page is returned.
The redo log entries cached in the parse buffer of the read-only node can be
cleared only after the oldest_flush_lsn value of the read-write node has advanced. In
other words, the redo log entries can be discarded after the dirty pages
6.3 PolarDB Practices 153

corresponding to the modifications recorded in the redo log entries are flushed to the
disk. With this constraint, if a hotspot page is frequently updated (i.e., newest_modi-
fication is constantly updated) or the read-write node flushes dirty pages at a slow
speed, a large number of parsed redo log entries may accumulate in the parse buffers
of the read-only nodes, slowing down the speed of applying redo log entries and the
advancement of the LSNs of the redo log entries. In addition, dirty page flushing by
the read-write node is further constrained by the safe LSN, which ultimately affects
the write operations of user threads. If the redo log application speed of the read
nodes is too slow, the difference between the redo log application LSN and the new-
est LSN of the read-write node progressively increases, eventually leading to a con-
tinuous increase in the replication latency.
To solve the various problems that arise due to the preceding constraints, the buf-
fer pool of the read-write node in PolarDB has been optimized as follows:
• To enable the read-write node to flush generated dirty pages to the disk in a
timely manner and reduce the redo log entries cached in the parse buffer of a
read-only node, the read-only node synchronizes the LSNs of applied log entries
to the read-write node in real time. If the difference between the write_lsn value
of the read-write node and safe LSN of the write node exceeds a specified thresh-
old, the system increases the frequency of flushing dirty pages of the read-write
node and actively advances the oldest_flush_lsn value. In addition, the read-only
nodes can release the redo log entries cached in their parse buffers to reduce the
amount of redo log information that needs to be applied during runtime, thereby
improving the performance of the read-only nodes.
• When a page is frequently updated, the newest modification LSN continuously
increases and is always greater than the safe LSN, which fails to meet the flush-
ing condition. Consequently, the log entries of the read nodes accumulate in the
log cache, leaving no free space to receive new log entries. To solve this problem,
the system introduces the copy page mechanism. When a data page cannot be
written to the disk in a timely manner because it does not meet the flushing con-
ditions, the system generates a copy page for the data page. The information in
the copy page is a snapshot of the data page. The copy page stores fixed data that
will no longer be modified, and the oldest modification LSN of the original data
page is updated to the newest modification LSN of the copy page. The newest
modification LSN of the copy page no longer changes. When this LSN is smaller
than the safe LSN, the data of the copy page can be written to the disk, making
the write node advance oldest_flush_lsn and consequently releasing the log
caches of the read nodes.
• Several data pages, such as pages in the system tablespace and rollback segment
header pages in the rollback segment tablespace, are frequently accessed. To
improve execution efficiency and performance, frequently accessed pages are read
from the memory and will not be swapped out after an instance is started. In other
words, these pages are pinned and retained in the buffer pool, hence called “pinned
pages.” This prevents the log application efficiency of read-only nodes from being
affected by frequent swap-in and swap-out operations; the data pages will not be
154 6 Database Cache

swapped out by read-only nodes. When these data pages are needed again, they are
already in memory and do not need to be read from the disk again. This way, when
the read-write node writes out these pages during page flushing, these pages can be
smoothly flushed to the disk without being subject to the constraint that the newest
modification LSN (newest_modification) of a page cannot be greater than the log
application LSN (min_replica_applied_lsn) of read-only nodes. This avoids long
waiting time for user requests, which may be caused if page flushing operations are
triggered to release free pages when the user thread cannot obtain free pages.

6.3.2 Optimization of the Data Dictionary Cache and the File


System Cache

6.3.2.1 Data Dictionary Cache in InnoDB

The data dictionary is a collection of information about database objects, such as


table, views, and stored procedures. This information is also known as the metadata
of the database. For example, the data dictionary stores information that describes
the table schemas, including the columns and indexes of each table. The data dic-
tionary stores the table information in the INFORMATION_SCHEMA and
PERFORMANCE_SCHEMA tables in memory, which are dynamically populated
and updated by the InnoDB engine during operation but are not persisted to the stor-
age engine.
Starting from MySQL 8.0, the data dictionary no longer uses MyISAM as the default
storage engine but is directly stored in the InnoDB engine. The write and update opera-
tions of data dictionary tables conform to the ACID (atomicity, consistency, isolation,
and durability) properties. Each time the SHOW DATABASES or SHOW TABLES
statement is executed, the data dictionary is queried or, more precisely, corresponding
table information is obtained from the data dictionary cache (DD cache). However, the
SHOW CREATE TABLE statement does not access the DD cache but directly accesses
data in the INFORMATION_SCHEMA and PERFORMANCE_SCHEMA tables. As
a result, some tables are visible to the SHOW TABLES statement but invisible to the
SHOW CREATE TABLE statement. This is usually caused because outdated table
information is still retained in the DD cache.
When the system executes an SQL statement to access a table, MySQL first
attempts to open the table. In this process, tablespace information is retrieved from
the DD cache by table name. If the information is not found in the DD cache, the
system will attempt to read it from a data dictionary table. Generally, the data in the
DD cache and the data in the data dictionary table completely match the data in the
tablespace in the InnoDB engine. The execution of a DDL (data definition lan-
guage) operation usually triggers the clearing of the DD cache. This process must
hold an exclusive MDL on the entire tablespace. After the DDL operation is com-
pleted, the information in the data dictionary table will be modified. When the user
initiates a read request next time, the information will be read from the data diction-
ary table and cached in the DD cache.
6.3 PolarDB Practices 155

6.3.2.2 File System Cache in InnoDB

When the innodb_file_per_table parameter is set to ON, each table corresponds to a


separate .ibd file. When the InnoDB engine starts, it scans all .ibd files in the datadir
directory; each .ibd file consists of several pages. The InnoDB engine first parses
Page 0 to Page 3 (which are index pages configured to facilitate convenient manage-
ment of partitions and pages) of each .ibd file to obtain the metadata of the file and
then creates a hash mapping between the file name and the tablespace. The hash
mapping is stored in the mdirs field of the Fil_system object. The mdirs field is used
to record the scanned paths and discovered files during startup. During crash recov-
ery, the InnoDB engine parses corresponding log entries, retrieves and opens the
corresponding .ibd files based on the mdirs field by using the space IDs recorded in
the log entries, reads the corresponding pages based on the page numbers, and
finally applies the corresponding redo log entries to restore the database to the
moment of the crash. During the operation of the InnoDB engine, the mappings
between the space_id and space_name values of all tablespaces and the correspond-
ing “.ibd” files are stored in the memory in the Fil_system object of InnoDB.

6.3.2.3 Optimization of PolarDB

In the current PolarDB architecture, DDL consistency also faces new challenges. In
InnoDB, a DDL operation needs to handle only the status of the target object. To be
specific, before a DDL operation is performed, an MDL on the corresponding table
must be obtained, and the cache must be cleared. However, in the shared storage
architecture, the system also needs to synchronize the MDL to the read-only nodes, so
that requests on the read-only node will not access table data that is being modified by
a DDL operation. After the MDL on a read-only node is acquired, the redo log entries
before the current MDL are applied, and the data dictionary cache of the read-only
node is cleared. During the execution of a DDL operation, the read-write node per-
forms file operations, but the read-only nodes only need to update their respective file
system caches because the read-write node has already completed the file operations
in the shared storage. After the DDL operation is executed, the read-­write node records
a redo log entry about MDL release. When a read-only node obtains this log entry
through parsing, it releases the MDL on the table. At this time, the table information
in memory is updated, and the table can provide normal access services.

6.3.3 RDMA-Based Shared Memory Pool

6.3.3.1 Principles

CPU-memory separation in PolarDB is achieved based on an RDMA-based shared


memory pool architecture. In the traditional storage architecture shown in Fig. 6.3,
a node provides computing and memory capabilities, the buffer pool is located on
156 6 Database Cache

Fig. 6.3 Traditional


storage architecture

Fig. 6.4 Shared memory pool storage architecture

each standalone PolarDB instance, and the CPU is private to each PolarDB instance
and directly accesses the buffer pool by using a memory bus.
As shown in Fig. 6.4, in the RDMA-based shared memory pool storage architec-
ture, a compute node contains only a small-sized local buffer pool (LBP) that serves
as the local cache, and a global buffer pool (GBP) in the remote cache cluster serves
as a remote cache of the compute node. The GBP (global buffer pool) contains all
pages of PolarDB. The compute node and GBP communicate with each other over
a high-speed interconnected network by using RDMA to read or write pages. In
6.3 PolarDB Practices 157

addition, RDMA ensures low latency for remote access operations. The GBP and
compute node are separated. Therefore, multiple PolarDB instances (compute
nodes) can simultaneously connect to and share the GBP, thereby forming the one
writer, multireader architecture and multiwriter architecture.

6.3.3.2 Advantages

The shared memory architecture offers considerable advantages for PolarDB. First,
this architecture efficiently implements the separation of the compute and memory
nodes. This enables on-demand scaling for commercial applications of Alibaba
Cloud, achieving almost continuously available instance elasticity (i.e., scale-up
and scale-down). This architecture can also allocate CPU and memory resources
separately based on specific customer requirements to facilitate separate pay-as-­
you-go billing for CPU and memory resources. Second, this architecture can be
shared by multiple PolarDB instances, laying a solid foundation for the one-writer,
multireader architecture and multiwriter architecture and improving the computing
capabilities of PolarDB instances. Third, this architecture frees PolarDB from the
constraints of limited memory of a standalone instance and efficiently utilizes the
large memory capacity of remote nodes. It also decouples the dirty page flushing
logic from the compute nodes, which improves the write performance to some
extent and improves the performance of individual instances. Lastly, this architec-
ture enables fast system recovery because the shared memory pool contains all
memory pages and the buffer pool is always hot after a restart.

6.3.3.3 Implementation

Data Structure

The shared memory consists of multiple GBP instances and the RDMA service
module that is responsible for network communication. Each GBP instance consists
of four parts: (1) an LRU list, which stores the metadata of recently used pages, such
as page IDs, LSNs, and memory addresses; (2) a free list, which stores free shared
memory pages; (3) a hash table, which is used to quickly locate memory pages; and
(4) a data area of the shared memory, which is organized and managed in the form
of pages.

Network Communication

The RDMA-based shared memory framework can handle two types of requests. The
first type does not involve the CPU of the memory node. In this type of request, the
remote address of the data page is already known, and read and write operations can
be directly performed on the data page. The CPU of the memory node can be bypassed
158 6 Database Cache

by using one-side RDMA primitives. The second type is control requests, such as
memory page registration requests and invalidation requests. Such requests are han-
dled as follows: Network requests are received from compute nodes and handled by
using registered RPC (remote procedure call) functions. To ensure low latency, RPCs
of the second type are also handled based on RDMA. The entire network data flow is
handled by using a unified and efficient RDMA communication framework. The fol-
lowing examples are common network I/O operations on the shared memory:
Registration: Before RDMA-based data transmission is performed, memory regis-
tration needs to be performed. A remote key (r_key) and a local key (l_key) are
generated in each memory registration. The local key is used by the local host
channel adapter (HCA) to access the local memory. The remote key is provided
to the remote HCA to allow remote processes to access the local system memory
during RDMA operations. The compute node sends a registration request to a
shared memory node, and the shared memory node allocates a remote memory
page to the compute node. After receiving the registration request, the shared
memory node selects a memory page based on metadata, such as the page ID,
and returns the corresponding RDMA address and key to the compute node.
Reading: After the registration, the compute node already knows the remote address
of the corresponding page in the shared memory and can directly initiate a read
request by using an RDMA-based remote read operation.
Writing: The compute node first writes to a shared memory node by using an
RDMA-based remote write operation based on the known remote address of a
page. Then, the compute node sends the metadata of the relevant page to the
shared memory node. The shared memory node obtains and reads the metadata
of the page. Multiple writable nodes may exist in the cluster. When a data page
is modified by a node, other nodes need to be instructed to invalidate this data
page in their respective memory based on the address and key in invalid_bit.

Crash Recovery

If a compute node crashes, additional operations do not need to be performed on the


shared memory node because the GBP includes the data of all LBPs. If the shared
memory crashes, the database service, including the compute nodes, needs to be
restarted. The shared memory is recovered from crashes by using the Aries algorithm,
which inherits the WAL (write-ahead logging) mechanism of PolarDB and uses
records such as the redo log, undo log, and checkpoints to achieve crash recovery.

6.3.3.4 Performance Tests

This section describes tests run on PolarDB instances with the scalable specification
of polar.mysql.x4.xlarge, which is configured with 8 CPU cores, 32 GB of memory,
and a 24-GB buffer pool. CPU isolation is implemented by using the taskset
6.3 PolarDB Practices 159

command. The baseline PolarDB instance is configured with a 24-GB buffer pool.
The test PolarDB instances are of the GBP edition with LBP sizes, respectively, set to
1 GB, 3 GB, and 5 GB. To control the variables, the total size of the LBP and the GBP
is set to 24 GB for each instance. For example, if the size of the LBP is 1 GB, the size
of the GBP is set to 23 GB. The PolarDB (LBP) process and the cache cluster (GBP)
process communicate with each other by using a 100-GB RDMA connection.
The PolarDB instances are tested against the oltp_read_only, oltp_read_write,
and oltp_write_only scenarios of Sysbench [8]. Each scenario is tested with 8, 16,
32, 64, 128, and 256 concurrent threads. The dataset consists of 250 tables that yield
a 17-GB basic dataset, and each table contains 30 million rows of data.

oltp_read_only

As shown in Fig. 6.5, in the oltp_read_only scenario, the throughputs of the test
PolarDB instances are basically consistent with that of the baseline PolarDB
instance, with a difference of less than 2%.

oltp_read_write

As shown in Fig. 6.6, in the oltp_read_write scenario, the throughput of the test
PolarDB instance whose LBP size is 5 GB is about 10% higher than that of the base-
line PolarDB instance. However, the throughput of the test PolarDB instance whose
LBP size is 1 GB is basically the same as that of the baseline PolarDB instance.
The results indicate that the PolarDB cluster of the GBP edition outperforms the
baseline PolarDB instance. In the PolarDB cluster of the GBP edition, the dirty page
flushing feature is decoupled from the compute nodes. By separating the core fea-
ture of the InnoDB engine, write performance is improved to some extent. Data can
be flushed from the GBP to PolarFS by any available CPU.

Fig. 6.5 oltp_read_only


160 6 Database Cache

Fig. 6.6 oltp_read_write

Fig. 6.7 oltp_write_only

oltp_write_only

As shown in Fig. 6.7, in the oltp_write_only scenario, the throughput of the test
PolarDB instance whose LBP size is 5 GB is basically the same as that of the base-
line PolarDB instance. However, the throughputs of the test PolarDB instances
whose LBP sizes are 3 GB and 1 GB are lower than that of the baseline PolarDB
instance; the throughput of the test PolarDB instance whose LBP size is 1 GB is
only 78% that of the baseline PolarDB instance.

References

1. MySQL 8.0 reference Manual: 15.5.1 buffer pool. https://dev.mysql.com/doc/refman/8.0/en/


innodb-­buffer-­pool.html. Accessed 17 Feb 2021.
2. Corbato FJ. A paging experiment with the multics system. Cambridge, MA: Massachusetts Inst
of Tech Cambridge Project Mac; 1968.
References 161

3. O’Neil EJ, O’Neil PE, Weikum G. The LRU-K page replacement algorithm for database disk
buffering. ACM SIGMOD Rec. 1993;22(2):297–306.
4. Garcia-Molina H, Ullman D, Widom J. Database systems: the complete book (trans: Yang D,
Wu Y, et al.). 2nd ed. Beijing: China Machine Press; 2010.
5. Yang J, Kim J, Hoseinzadeh M, et al. An empirical guide to the behavior and use of scal-
able persistent memory. In: 18th USENIX Conference on File and Storage Technologies
(FAST’20); 2020. p. 169–82.
6. What is RDMA? 2021. https://community.mellanox.com/s/article/. Accessed 17 Feb 2021.
7. Remote direct memory access. 2021. https://en.wikipedia.org/wiki/. Remote_direct_memory_
access. Accessed 17 Feb 2021.
8. Kopytov A. Sysbench manual. In: MySQL AB; 2012. p. 2–3.
Chapter 7
Computing Engine

The database computing engine, also known as the database query engine, is respon-
sible for processing database queries. Query processing is one of the most crucial
features of a database and consists of query execution and query optimization. This
chapter discusses the database query processing process, introduces the three mod-
els for database query execution and the main methods for database query optimiza-
tion, and explores the practical application of the PolarDB query engine.

7.1 Overview of Query Processing

This section briefly describes the query processing process of traditional relational
database management systems and the implementation of parallel query processing.

7.1.1 Overview of Database Query Processing

Query processing is the process of executing a query statement in a database man-


agement system [1]. Query processing translates query statements into efficient
query execution plans through the following basic steps: query parsing, query vali-
dation, query optimization, and query execution.
Query processing first translates a query statement into an internal representation in
the system (i.e., an equivalent relational algebraic expression). During this process, the
syntax analyzer performs (1) query analysis to identify language symbols in the query
statement and (2) syntax checking and parsing to determine whether the query statement
conforms to the SQL syntax rules. If the query statement is valid, the system checks its
semantics against the data dictionary, builds a parse tree for the query statement, and
translates the query statement into an equivalent relational algebraic expression.

© The Author(s), under exclusive license to Springer Nature Singapore Pte 163
Ltd. 2025
F. Li et al., Cloud Native Database,
https://doi.org/10.1007/978-981-97-4057-4_7
164 7 Computing Engine

Query optimization means selecting an efficient query processing strategy. This


procedure is categorized into algebraic optimization and physical optimization.
Algebraic optimization optimizes relational algebraic expressions, whereas physi-
cal optimization selects optimal access paths and underlying operation algorithms.
The cost of a query plan is generally measured by the number of disk blocks trans-
ferred and the number of disk accesses required [2] and varies based on the query
method used and the way the CPU handles it. Different query methods and CPU
handling approaches can incur different costs for the same query.
Query optimization aims to minimize the query cost by selecting the most effi-
cient processing strategy for a given query.

7.1.1.1 Query Operations

Selection Operations

Selection is the process of finding tuples that satisfy a given condition from a rela-
tion. Typical selection algorithms include full table scan and index (or hash) scan.

Sorting Operations

Data sorting plays an important role in database systems. On the one hand, query
results may need to be sorted in a specific way (e.g., by using the ORDER BY key-
word). On the other hand, sorting can be used in other operations to achieve efficient
implementation even if a sorting method is not specified in the query. For example,
loading sorted tuples in batches into a B+ tree index is more efficient than loading
unsorted tuples.
Data that can fit in memory is sorted by using standard sorting techniques, such
as quicksort. Data that cannot be entirely held in memory can be sorted by using
external merge sort algorithms.

Join Operations

Join is the process of selecting tuples whose attributes meet given conditions from
the Cartesian product of two relations. The algorithms used to compute the join of
two relations include nested loop join (NLJ), block nested loop (BNL) join, index
NLJ, merge join, and hash join. A join algorithm is selected depending on the physi-
cal storage form of the relations and whether indexes are present.

Other Operations

Other relational operations include deduplication, projection, set operations, outer


join, and aggregation, all of which can be implemented through sorting and hashing.
7.1 Overview of Query Processing 165

7.1.1.2 Expression Evaluation

Two expression evaluation methods are available. One method is to perform one
operation at a time in a specific order and materialize the result of each evaluation
into a temporary relation. The evaluation and materialization costs include the cost
of all operations and the cost of writing intermediate results back to the disk, result-
ing in high disk I/O costs. The other method is to perform multiple operations
simultaneously on a pipeline. In this method, the result of one operation is passed to
the next operation without the need to save the result to a temporary relation, thereby
eliminating the cost of reading and writing temporary relations.

7.1.2 Overview of Parallel Queries

As the number of computer cores supported in a system increases, parallel systems


have become popular. In the era of big data, an increasing number of applications need
to query a large amount of data or process a large number of transactions every sec-
ond. Database management systems can use a parallel mechanism to accelerate the
processing of these queries and transactions, effectively improving the processing
speed and I/O speed by using multiple processors and disks in parallel. This section
uses read-only queries as examples to illustrate the processing of parallel queries.
Two parallel query processing mechanisms are available in a database system:
interquery parallelism and intraquery parallelism.
In interquery parallelism, multiple queries or transactions are executed in paral-
lel across multiple processors and disks, but each individual query or transaction is
still executed in a serial manner on a specific processor. Interquery parallelism can
effectively improve the transaction throughput and expand the transaction process-
ing capacity of the system, so that the system can support a larger number of trans-
actions per unit of time.
In intraquery parallelism, a single query is executed in parallel across multiple
processors and disks. This mechanism is particularly effective for queries that have
a long execution time. In intraquery parallelism, different parts of a single query are
executed in parallel, thereby shortening the overall execution time of the query.
The execution of a single query may involve different operations, such as selec-
tion, projection, join, and aggregation. In general, intraquery parallelism can be
achieved through two different parallelization approaches: intraoperator parallelism
and interoperator parallelism.

7.1.2.1 Intraoperator Parallelism

Intraoperator parallelism refers to the parallel processing of a single operator across


multiple processors, such as selection, projection, and join, to speed up query pro-
cessing. In this section, we will take parallel sorting and parallel join as examples to
166 7 Computing Engine

illustrate intraoperator parallelism. To simplify the description, we assume that


there are n nodes, N0, N1, N2, ⋯, Nn − 1, and each node corresponds to one processor.

Parallel Sorting

A classic scenario for parallel sorting is to sort a relation R stored on n nodes N0, N1,
N2, ⋯, Nn − 1. If R has already been range-partitioned and the partitioning attribute is
the attribute based on which R is to be sorted, partitions on each node can be sorted
in parallel. Then, the sorted partitions can be merged to obtain the complete sorted
relation.
If the relation is partitioned by using other methods, the relation can be range-­
partitioned again by using the attribute based on which R is to be sorted. This
ensures that all tuples within the ith range are sent to the same node Ni. Then, parti-
tions on each node can be sorted in parallel, and the sorted partitions can be merged
to obtain the complete sorted relation.

Parallel Join

The join operation in relational algebra checks whether a pair of tuples satisfies a
join condition. If yes, the tuples are output to the join result. The parallel join algo-
rithm allocates the tuples to be examined to different processors, and each processor
computes the local join results. All processors work in parallel to compute the join,
and the system collects their results to produce the final result.
Taking the most commonly used natural join as an example, suppose that two rela-
tions R and S are to be joined. The parallel join algorithm separately partitions R and
S into n partitions, R0, R1, R2, ⋯, Rn − 1 and S0, S1, S2, ⋯, Sn − 1. The system sends parti-
tions Ri and Si to node Ni to compute the join result. All nodes work in parallel to
compute the join results, which are then merged to obtain the final join result.
Parallelization methods are also available for join methods such as hash join and
NLJ (nested-loop join) in centralized databases.

Parallelism of Other Common Operations

Many commonly used relational operators can be evaluated in parallel:


1. Selection: The parallel implementation of selection depends on the selection con-
dition. Assuming the attribute involved in the selection condition is Ai, if relation R
is partitioned based on attribute Ai, the selection operation is executed in parallel
on different processors for the partitions involved in the selection condition.
Otherwise, the selection operation is executed in parallel on all processors.
7.2 Query Execution Models 167

2. Deduplication: Deduplication can be achieved through parallel sorting by using


any parallel sorting technique. Specifically, duplicates are removed as soon as
they appear during the sorting process. Alternatively, deduplication can be
achieved by partitioning tuples in parallel and removing duplicates during the
partitioning process.
3. Projection: Projection without deduplication can be performed in parallel as
tuples are read from the disk.
4. Aggregation: Aggregation can be performed in parallel by partitioning the rela-
tion based on a grouping attribute and computing the aggregate value on each
processor.

7.1.2.2 Interoperator Parallelism

Interoperator parallelism refers to the parallel execution of multiple different opera-


tions in a query across processors to speed up query processing. Interoperator paral-
lelism can be implemented in two forms: pipeline parallelism and independent
parallelism.
In pipeline parallelism, operator A is used by the next operator, operator B, before
the former produces a complete output tuple set. Pipeline parallelism can effectively
reduce the computational cost of database query processing and does not require
saving any results to the disk. Hence, operations are not interrupted. Pipeline paral-
lelism uses a pipeline in a parallel system. For instance, two different processors can
respectively execute operations A and B at the same time. Once operation A pro-
duces a result, operation B immediately uses the result.
Independent parallelism refers to the parallel execution of operators that do not
depend on each other in a query expression. For example, assume that three tables
are joined. If a natural join is performed on the first two tables and the resulting table
is then joined with a resulting table obtained by performing selection and filtering
on the third table, that is, A ⊳ ⊲ B ⊳ ⊲ σθ(C), operations A ⊳ ⊲ B and σθ(C) are
independent of each other. Therefore, they can be executed through independent
parallelism to improve the execution efficiency.

7.2 Query Execution Models

This section describes the query execution models of databases. The query execu-
tion model of a database determines how the database executes a given query plan.
In this section, the most commonly used model, the Volcano model, is introduced
along with its advantages and disadvantages. Then, the compiled execution model
and the vectorized execution model, which partially compensate for the shortcom-
ings of the Volcano model, are presented.
168 7 Computing Engine

7.2.1 Volcano Model

The Volcano model [3] (also known as the iterator model) is the most common
execution model. Databases such as MySQL and PostgreSQL use the Volcano model.
In the Volcano model, each operation in relational algebra is abstracted as an
operator, and an operator tree is built for an SQL query. Each operator in the tree,
such as a join and sorting operator, implements a Next() function. The Next() func-
tion of the parent node in the tree calls the Next() functions of its subnodes, and the
subnodes return results to the parent node. Next() functions are recursively called
from the root node to leaf nodes in a top-down approach, and data is pulled for pro-
cessing in reverse order. This processing method of the Volcano model is also known
as the pull-based execution model [4].
In the Volcano model, an operator can be implemented separately without
considering the implementation logic of other operators. Despite this benefit,
the disadvantages of this model are apparent. For instance, only one tuple is
computed each time, resulting in low utilization of the CPU cache. Moreover,
the recursive calling of Next() functions of subnodes by the parent node causes
a large number of virtual function calls and consequently leads to low CPU
utilization.
In the era when the system performance mainly depended on disk I/Os, the query
processing performance of the Volcano model was considered high. However, with
the development of hardware, data storage becomes increasingly faster, and system
performance is no longer bottlenecked by disk I/Os. Therefore, the research focus
has switched to improving the computation efficiency; fruitful results are also
achieved. Moreover, two optimization methods have emerged: the compiled execu-
tion model and the vectorized execution model. Compared with the Volcano model,
the preceding models greatly improve the execution performance of database
queries.

7.2.2 Compiled Execution Model

Compiled execution, also known as data-centric code generation, was first proposed
by HyPer [5]. The compiled execution model uses the LLVM (low-level virtual
machine) compiler framework to transform queries into compact and efficient code
that can be quickly executed and is compatible with the modern CPU architecture.
As a result, excellent query performance can be achieved with moderate code com-
pilation, greatly improving the query execution efficiency.
The data-centric compilation approach is attractive for all new databases. Backed
by mainstream compilation frameworks, database management systems automati-
cally benefit from future improvements of compilers and processors without the
need to redesign the query engine. For more details about the compiled execution
model, see Ref. [6].
7.3 Overview of Query Optimization 169

7.2.3 Vectorized Execution Model

The vectorized execution model is designed based on the Volcano model. In this
model, each operator implements a Next() function. Unlike in the Volcano model,
the Next() function of each operator in the vectorized model returns a batch of data
rather than a single tuple in the iteration process. Returning data in batches greatly
reduces the number of times the Next() function is called, thereby reducing virtual
function calls. Additionally, the vectorized execution model allows the SIMD (sin-
gle instruction, multiple data) mechanism to be used in each operator to simultane-
ously process multiple rows of data. Furthermore, vectorized execution processes
data in blocks, improving the CPU cache hit rate. Sompolski et al. [7] demonstrates
how the vectorized execution model is suitable for complex analytical query pro-
cessing and can improve the OLAP (online analytical processing) query perfor-
mance by up to 50 times.

7.3 Overview of Query Optimization

7.3.1 Introduction to Query Optimization

The module responsible for query optimization in a database is called a query opti-
mizer. This module aims to find the most efficient query execution plan for a given
query. The goal of query optimization is to minimize the total cost of the query.
Query optimization can be achieved in various ways and categorized into logical
query optimization and physical query optimization according to the level of opti-
mization. Logical query optimization refers to the optimization of relational algebra
expressions, which involves rewriting queries based on equivalence rules. Physical
query optimization involves selecting the access path and underlying operation
algorithm, which can be achieved through rule-based heuristic optimization, cost-­
based optimization, or a combination of both [1].
Query optimization plays an important role in databases as it relieves users from
the burden of selecting access paths, eliminating the need for high-level database
expertise and programming skills. Achieving higher query execution efficiency is
now a system task.
The query optimizer generates multiple query execution plans equivalent to the
given query and ultimately selects the plan with the lowest cost.
The general steps for query optimization are as follows:
• Rewrite the given query to an equivalent but more efficient query based on query
rewriting rules.
• Generate different query execution plans based on different underlying operation
algorithms.
• Select the query execution plan with the lowest cost.
170 7 Computing Engine

7.3.2 Logical Query Optimization

Logical query optimization is implemented based on the equivalence transforma-


tion rules of relational algebra, also known as query rewriting rules. Some of these
rules as described below:

7.3.2.1 Commutative Property of Join Operations

The join result remains unchanged if the positions of two tables to be joined are
exchanged. However, in an NLJ, if the outer table has fewer tuples, a block-based
NLJ can be performed by treating the table with fewer tuples as the outer table.

7.3.2.2 Associative Property of Join Operations

Owing to the associativity of join operations, when multiple tables are to be joined,
some tables may be joined in advance to significantly reduce the size of the interme-
diate result set without changing the final join result.

7.3.2.3 Distributive Property of Selection and Join Operations

If all selection conditions are pushed down to the tables to which they relate and
selection is performed before the join operation, the size of the intermediate result
set can be greatly reduced. This rule is relatively the most effective logical optimiza-
tion method.
In summary, the purpose of logical query optimization is to use equivalent trans-
formation rules of relational algebra to transform a given query into an equivalent
but more efficient query.

7.3.3 Physical Query Optimization

Physical query optimization is implemented based on the cost model. The following
concepts are involved in physical query optimization:

7.3.3.1 Table Access Path

The table access method, such as sequential scan, index scan, or parallel scan.

7.3.3.2 Join Algorithm

The algorithm used to join two tables, such as NLJ, hash join, and sort-merge join.
7.4 Practical Application of PolarDB Query Engine 171

7.3.3.3 Join Order of Tables

The order of joining multiple tables to minimize the cost.


The primary task of the cost model is to estimate the cost of executing a query
plan and the costs of all equivalent query execution plans. The query optimizer
chooses the query plan with the lowest cost for the final execution. This type of
optimizer is also called a cost-based optimizer (CBO).
Several heuristic rules are available for physical query optimization. These rules
provide guidance on the selection of underlying operation algorithms. For example,
if an index is created on Column B of Table A and a selection operation is to be
performed on Column B of Table A, an index scan can be performed, which may
have a lower scan cost than a sequential scan.

7.3.4 Other Optimization Methods

7.3.4.1 Materialized Views

Materialized views are views whose results have been precomputed and stored.
They are generally used to improve the performance of complex queries that are
frequently executed and take a long time to execute [2]. Assuming that a material-
ized view A = B ⊳ ⊲ C is available, B ⊳ ⊲ C ⊳ ⊲ D can be rewritten to A ⊳ ⊲ D,
which lowers the execution cost for the database.

7.3.4.2 Plan Caches

A query optimizer requires several steps to generate an ideal query execution plan.
This consumes a considerable amount of computing resources. After a frequently
used query undergoes the query optimization and query execution processes, the
database caches its execution plan. The next time the same query is executed, the
database directly uses the cached execution plan for the query, thereby improving
the execution efficiency. Section 7.4.2 describes the execution plan management
and plan caching of PolarDB in detail.

7.4 Practical Application of PolarDB Query Engine

This section describes the practical application the PolarDB query engine, mainly
focusing on three aspects: parallel query technology, execution plan management,
and vectorized execution.
172 7 Computing Engine

7.4.1 Parallel Query Technology in PolarDB

7.4.1.1 Parallel Queries in PolarDB

Parallel Execution

One of the biggest problems in MySQL is that the query performance continuously
deteriorates as the data volume grows, and the query execution time may increase
from milliseconds to hours. The reason for such long execution time is that in
MySQL, one query can be executed only in one thread even as the data volume
continuously increases. Hence, even if the system resources are sufficient, the mul-
ticore capabilities of modern CPUs cannot be utilized.
As shown in Fig. 7.1, when a query is executed on a node with 64 cores, only one
thread is involved in the execution of the query. The other 63 threads remain in an
idle state.
Against this backdrop, PolarDB for MySQL 8.0 is launched with a powerful
parallel query framework. Once the amount of query data reaches a specified
threshold, the parallel execution framework will be automatically enabled. Then,
the data in the storage layer will be divided into multiple partitions, which are
allocated to different threads. The threads compute the results in parallel. The
results are pipelined to the leader thread for aggregation and then returned to
the user.
The PolarDB query optimizer determines whether to generate a serial execution
plan or a parallel execution plan based on the execution cost of the statement. Take
the following query as an example:

SELECT SUM(a) FROM T1 WHERE b LIKE '%xxx%'

Multiple worker threads scan and perform filtering on Table T1, separately cal-
culate a sum, and return the results to the leader thread. The leader thread then
performs a summation operation and returns the result to the client.
As shown in Fig. 7.2, the query is executed in two stages (from right to left). In
the first stage, the 64 threads participate in table scanning and computation. In the
second stage, the computation results of the threads are summed up. It can be seen
that parallel computing can fully mobilize the computing capabilities, with each
thread handling less than 2% of the total workload. This significantly reduces the
end-to-end time of the entire query.
In general, parallel execution in PolarDB has the following advantages:

Fig. 7.1 Serial execution in MySQL


7.4 Practical Application of PolarDB Query Engine 173

Fig. 7.2 Parallel execution example

• Zero need for business adaptation, SQL modification, data migration, or data
partitioning changes.
• 100% compatibility with MySQL.
• Significant improvement in query performance, enabling users to realize the per-
formance improvement brought by enhanced computing power.

Design of Parallel Execution

Architecture Design
Table 7.1 lists several terms related to parallel execution.
Figure 7.3 shows the parallel execution architecture of PolarDB, which includes
four execution modules (from top to bottom):
• Leader generates the parallel execution plan, performs computing pushdown,
and aggregates computation results. In Fig. 7.3, the Gather node in the execution
plan is the leader and receives data from various message queues.
• Message queue is responsible for data communication between the leader and
the workers. Each message queue represents a communication relationship
between a worker and the leader. If N workers are present, N message queues
are needed.
• Worker receives execution plans issued by the leader and returns the execution
results to the leader. The execution tasks of the workers are homogeneous and
vary based on the specific data that they scan. For example, in Fig. 7.3, the five
worker threads are indicated by different colors, but they all contain the same
execution plan.
• Parallel scanning is implemented by the InnoDB engine. Data in a table is divided
into multiple partitions, and each worker scans data in one partition. As shown in
Fig. 7.3, partitions indicated by different colors are in a one-to-one correspon-
174 7 Computing Engine

Table 7.1 Terms related to parallel execution


Term Definition
Parallel A set of threads that execute the same query in parallel
group
Leader A leader can also be called a Gather thread. Each parallel group has only one
(parallel leader thread, which is mainly responsible for the following:
group  • Generating parallel execution plans
leader)  • Initializing worker threads
 • Collecting the results from worker threads and performing further
computation
Worker A worker thread receives an execution plan from the leader thread, executes the
(parallel plan, and returns the result to the leader thread. One parallel group may have
group multiple worker threads
worker)
Message A message queue serves as the data transfer channel between a worker thread and
queue the leader thread. Each worker thread generates data and puts the data into a
(MQ) message queue. Then, the leader thread consumes the data. In the parallel query
framework, each message queue is a one-to-one channel that is associated with
only one worker and one leader. PolarDB will soon support the multistage parallel
query feature, which allows a message queue to be flexibly configured as a
one-to-one, one-to-n, or n-to-one channel

Fig. 7.3 Parallel execution architecture of PolarDB

dence with the upper-layer workers. When a worker completes scanning a parti-
tion, it requests to be bound to an unscanned partition and then continues to scan
the partition.

Generation and Execution of Parallel Execution Plans


PolarDB supports parallel execution by modifying the original serial execution
framework. Figure 7.4 shows the process of generating a parallel execution plan.
After serial execution optimization, the PolarDB optimizer determines whether
to perform parallel execution based on the current cost. The optimizer checks
7.4 Practical Application of PolarDB Query Engine 175

Fig. 7.4 Process of generating a parallel execution plan

whether the cost of serial execution is greater than the cost threshold that triggers
parallel execution, whether the table supports parallel scanning, and whether the
number of rows scanned exceeds the specified threshold. The optimizer also esti-
mates the cost of parallel execution and compares it with the cost of serial
execution.
The optimizer generates a parallel execution plan only when it determines that paral-
lel execution is more efficient than serial execution. As mentioned earlier, the parallel
execution framework of PolarDB has only one leader/Gather thread. The purpose of a
parallel execution plan is to push down the computing of as many operators and expres-
sions as possible to worker threads for parallel execution. This reduces the cost of data
transmission and enables parallel execution for more computation workloads.
The PolarDB optimizer determines whether to push down computing based on
the following considerations: (1) whether new execution methods are needed
(e.g., an aggregation function may need to be transformed into a two-stage aggre-
gation function) and (2) whether the expression (including its parameters) is
parallel-safe.
Parallel-safe operations do not conflict with parallel queries. Whether an expres-
sion is parallel-safe needs to be determined based on its specific implementation.
For example, in MySQL, the Rand() function is parallel-safe but Rand(10) is not.
In MySQL, a constant can be used as a random seed, which is initialized once and
then constantly computed. As a result, all worker threads return identical data col-
umns after the function is pushed down. However, a Rand() function without con-
stant argument creates a random number seed based on the current thread. Using the
query in Fig. 7.1 as an example, an execution plan generated by the optimizer for the
query can be represented as follows:
The execution logic of the Gather thread is as follows:

select sum(xx) from gather_table;

In the syntax, gather_table specifies a temporary table created for receiving and
transmitting data, and the sum() function is formulated as a two-stage function in
consideration of the pushdown of aggregation operations.
The execution logic of a worker thread is as follows:
176 7 Computing Engine

select count(*) as xx from t1;

The worker thread calculates the count value of Table t1. After optimization,
worker threads are only aware of the logical tasks and do not know the specific part
of data they will process. The data processed by each worker thread is determined
during the execution phase.
After tasks are allocated to the Gather and worker threads, the next step is to cre-
ate a temporary table for receiving and transmitting data. The worker threads write
data to the temporary table, thereby shielding the underlying operations. The Gather
and worker threads then scan the temporary table for further processing.
After optimization, the execution phase begins. Fig. 7.5 describes the workflows
of the Gather and worker threads. After the Gather thread is initialized, it sets the
actual partitions for parallel scanning, creates the required message queues, starts
the worker threads, and then waits to be awakened to read data. Data in PolarDB is
physically managed in the form of a B+ tree. During partitioning, the Gather thread
only needs to access some of the nodes in the B+ tree index, traverse the tree by
using a breadth-first search algorithm from top to bottom, and perform partitioning
level by level. Taking the tree in Fig. 7.6 as an example, the three levels of the tree
contain 32 data records. If two partitions are required, the Gather thread accesses
the root node and divides the root node into two data partitions. If eight partitions
are required, the Gather thread continues to explore nodes at the next level (e.g.,
nodes at Level 1 in Fig. 7.6) to determine the partitions. To control the additional
overhead brought by partitioning, PolarDB avoids traversing the leaf nodes.

Fig. 7.5 Workflows of the Gather and worker threads


7.4 Practical Application of PolarDB Query Engine 177

Fig. 7.6 Partitioning process

During the optimization phase, PolarDB determines only the DOP (degree of
parallelism) and does not know how physical data is partitioned. On the one hand,
an excessively small number of partitions may result in low concurrency and
extremely uneven workload allocation. On the other hand, an excessively large
number of partitions frequent switching of worker threads.
Currently, partitioning in PolarDB is implemented based on the following rule:
The number of partitions is 100 times the DOP. After a worker thread completes
processing one partition, it sends a request to the leader thread for permission to
access the next accessible partition. This process repeats until all data is processed.
After the optimization, the worker threads share an execution plan template. A
worker thread creates related environment variables during initialization, clones the
expression and execution plan, and then executes the plan and outputs the result to
a message queue. At this point, the Gather thread is awakened to process the data.
Expressions in PolarDB do not have complete abstract representations. Therefore, a
proper cloning scheme must be provided for each expression to ensure that the
execution plan of each worker thread is complete and not affected by other threads.

Parallel Operations

PolarDB supports a variety of parallel execution scenarios to meet customer needs.


This section briefly describes the execution processes of several parallel operations.

Parallel Scanning
In a parallel scan, worker threads independently scan data in a data table in parallel.
The intermediate result set generated by a worker thread through scanning is
returned to the leader thread. The leader thread then collects the generated interme-
diate results through a Gather operation and returns the results to the client.

Parallel Multitable Join


In a parallel query, the multi-table join operation is completely pushed down to
worker threads for execution. The PolarDB optimizer selects one table that it con-
siders to be the most suitable for parallel scanning; other tables are scanned in a
178 7 Computing Engine

regular manner. Each worker thread returns the result set after the join to the leader
thread. The leader thread then collects the result sets through a Gather operation and
returns the results to the client.

Parallel Sorting
The PolarDB optimizer determines whether to push the ORDER BY operation
down to each worker thread for execution based on the query status. Each worker
thread returns the sorted result to the leader thread. The leader gathers, merges, and
sorts the results and then returns the sorted result to the client.

Parallel Grouping
The PolarDB optimizer determines whether to push the GROUP BY operation
down to worker threads for parallel execution based on the query status. If the target
table can be partitioned based on all the attributes of the GROUP BY clause or on
the first several attributes of the GROUP BY clause, the grouping operation can be
completely pushed down to worker threads for execution. In this case, the HAVING,
ORDER BY, and LIMIT operations can also be pushed down to worker threads for
execution to improve query performance. The leader thread aggregates the gener-
ated intermediate results through a Gather operation and returns the aggregated
result to the client.

Parallel Aggregation
In parallel query execution, the aggregation function is pushed down to worker
threads for parallel execution. Parallel aggregation is completed through two stages.
In the first stage, all worker threads that participate in the parallel query execute an
aggregation step. In the second stage, the Gather or Gather Merge operator collects
the results generated by the worker threads and sends the results to the leader thread.
The leader thread then aggregates the results of all worker threads to obtain the
final result.

Parallel Counting
The PolarDB optimizer determines whether to push the counting operation down to
worker threads for parallel execution based on the query status. Each worker thread
finds the corresponding data based on its primary key range and executes the Select
count(*) operation. The Select count(*) operation has been optimized at the engine
layer. Therefore, the engine can quickly traverse the data to obtain the result. Each
worker thread returns the intermediate result of the counting operation to the leader
thread, and the leader thread aggregates all data and performs counting. In addition
to supporting clustered indexes, parallel counting supports parallel search of sec-
ondary indexes.
7.4 Practical Application of PolarDB Query Engine 179

Parallel Semijoin
Semijoin supports five strategies: materialization-lookup, materialization-scan, first
match, weedout, and loose scan. PolarDB supports parallel processing for all these
five strategies. Two parallelization approaches are available for the materialization-­
lookup and materialization-scan strategies. One is to push down the semi-join oper-
ation to worker threads for parallel execution. Each worker thread is responsible for
the semi-join of part of the data and materialized table. The other is to push the
parallel materialization operation down to worker threads in advance. The worker
threads share the materialized table during the semi-join operation. The PolarDB
optimizer selects the optimal parallelization approach based on the query status.
Only one parallelization approach is available for the other three strategies, namely,
to push the semi-join operation down to worker threads. Then, each worker thread
returns the result set of the semi-join operation to the leader thread, and the leader
thread aggregates the results and returns the aggregated result to the client.

Support for Subqueries


Four execution strategies are available for subqueries in a parallel query: serial exe-
cution on the leader thread, parallel execution on the leader thread (in this case, the
leader thread will start another group of worker threads), shared access for early
parallel execution, and pushdown.
The first execution strategy is applicable when a subquery cannot be executed in
parallel. For example, when the condition for joining two tables references a user-­
defined function (UDF), the subquery will be executed in series on the leader thread.
The second execution strategy is applicable in the following case: After the parallel
execution plan is generated, the execution plan for the leader thread contains a subquery
that can be executed in parallel. However, the subquery cannot be executed in parallel in
advance (i.e., cannot be executed by using the shared access strategy). For example, if
the current subquery contains a window function, this execution strategy is applicable.
The third execution strategy is applicable in the following case: After the parallel
execution plan is generated, the execution plan for the worker threads references a
subquery that can be executed in parallel. In this case, the PolarDB optimizer will
choose to execute the subquery in parallel in advance, so that the worker threads can
directly access the results of the subquery.
The fourth execution strategy is applicable in the following case: After the paral-
lel execution plan is generated, the execution plan for the worker threads references
related subqueries. In this case, these subqueries will be pushed down as a whole to
the worker threads for execution.

Limitation of the Parallel Query Feature

PolarDB will continue to upgrade the parallel query feature. However, the parallel
query feature is unavailable for the following cases:
180 7 Computing Engine

• Queries on system tables or non-InnoDB tables.


• Queries using full-text indexes.
• Select count(*) operations without conditions on temporary tables.
• Parallel scans on temporary tables in the memory engine.
• Stored procedures.
• Recursive common table expressions (CTEs).
• Window functions (functions cannot be evaluated in parallel, but the query can
be executed in parallel).
• GIS (geographic information system) functions (functions cannot be evaluated
in parallel, but the query can be executed in parallel).
• Index merges.
• Query statements in transactions of the serializable isolation level.

7.4.1.2 Resource Management for Parallel Execution

To ensure better system stability and monitor the parallel execution status of the
system, PolarDB further provides a rich variety of resource management features
for parallel execution.

DOP Control

The maximum DOP for each query can be specified by using the max_parallel_
workers parameter:

set max_parallel_workers = n

PolarDB determines a proper DOP that is less than n based on factors such as the
thread count, memory resources, and CPU resources.

Memory Constraints

set query_memory_hard_limit = n;
set query_memory_soft_limit = m;

The query_memory_hard_limit and query_memory_soft_limit parameters are


provided to control memory resources, including the temporary tablespace, sort buf-
fer, and join buffer, for parallel execution. The sort and join buffers are used by the
system to sort data and perform join operations, respectively. When the memory
usage exceeds the hard limit, execution of some queries will be terminated based on
the memory management strategies. When the memory usage exceeds the soft limit,
the PolarDB optimizer will no longer choose parallel execution.
7.4 Practical Application of PolarDB Query Engine 181

Execution Status Monitoring

Users can check the current status of parallel execution by querying system tables:

select * from performance_schema.events_parallel_query_current

For example, the parallel execution status is as follows:

**************** 1.row *****************


THREAD_ID: 94
PARENT_THREAD_D: 0
PARALLEL_TYPE: GATHER
EVENT_ID: 11
END_EVENT_ID: NULL
EVENT_NAME: parallel query
STATE: COMPLETED
PLANNED_DOP: 16
ACTUAL_DOP: 16
NUMBER_OF_PARTITIONS: 36
PARTITIONED_OBJECT: t1
ROWS_SCANED: 10189
ROWS_SENT: 77
ROWS_SORTED: 0
EXECUTION_TIME: 435373818
NESTING_EVENT_ID: 9
NESTING_EVENT_TYPE: STATEMENT

**************** 2.row *****************


THREAD_ID: 95
PARENT_THREAD_ID: 94
PARALLEL_TYPE: WORKER
EVENT_ID: 2
END_EVENT_ID: NULL
EVENT_NAME: parallel querySTATE: COMPLETED
PLANNED_DOP: 0
ACTUAL_DOP: 0
NUMBER_OF_PARTITIONS: 0
PARTITIONED_OBJECT:
ROWS_SCANED: 718
ROWS_SENT: 8
ROWS_SORTED: 0
EXECUTION_TIME: 423644016
NESTING_EVENT_ID: 1
NESTING_EVENT_TYPE: STATEMENT
182 7 Computing Engine

7.4.1.3 Implementation of Parallel Execution

PolarDB for MySQL 8.0 supports the parallel query feature, which can be enabled
or disabled by using system parameters or hints without modifying the SQL state-
ments (except for hints).
As supplementary SQL syntax, hints play a pivotal role in relational databases.
They allow users to specify the way an SQL statement is executed to optimize the
SQL statement. PolarDB also provides particular hint syntax.

Configure the Parallel Query Feature by Using System Parameters

PolarDB specifies the maximum number of parallel execution threads for each SQL
statement by using the global parameter max_parallel_degree. The default value of
this parameter is 0. As shown in Fig. 7.7, this parameter can be modified in the con-
sole any time during system operation without the need to restart the database.

Recommended Settings
We recommend that you gradually increase the value of the max_parallel_degree
parameter. For example, you can set this parameter to 2 at first and then check the
CPU load after the setting runs for 1 day. If the CPU load is not high, you can con-
tinue to increase the value. Otherwise, do not increase the value. The value of this
parameter cannot exceed one-fourth of the number of CPU cores. We recommend
that you enable the parallel query feature only when the system has at least eight
CPU cores and do not enable this feature for small-specification instances.
When max_parallel_degree is set to 0, parallel execution is disabled. When
max_parallel_degree is set to 1, parallel execution is enabled, but the DOP is only 1.
The max_parallel_degree parameter serves to maintain compatibility with
MySQL configuration files. PolarDB also provides the loose_max_parallel_degree
parameter in the console to ensure that other versions do not throw errors when
receiving this parameter.
When you enable the parallel query feature, disable the innodb_adaptive_hash_
index parameter because it affects the performance of parallel queries.

Fig. 7.7 System parameter settings


7.4 Practical Application of PolarDB Query Engine 183

In addition to the cluster-level DOP, you can also configure session-level DOPs
by using related session-level environment variables. For example, you can add the
following command to the JDBC (Java Database Connectivity) connection string of
an application to set a separate DOP for the application:

set max_parallel_degree = n

Control the Parallel Query Feature by Using Hints

Hints
Hints allow you to control individual statements. For example, when the parallel
query feature is disabled for the system but a frequently used slow SQL statement
needs to be processed in parallel, you can use a hint to enable parallel execution for
this particular SQL statement: You can enable parallel execution by using either of
the following syntaxes:

SELECT /*+PARALLEL(x)*/ ... FROM ...; -- x >0


SELECT /*+ SET_VAR(max_parallel_degree=n) */ * FROM ... // n > 0
You can disable parallel execution by using either of the following
syntaxes:
SELECT /*+NO_PARALLEL()*/ ... FROM ...
SELECT /*+ SET_VAR(max_parallel_degree=0) */ * FROM ...

Advanced Hints
The parallel query feature provides two advanced hints: PARALLEL and NO_
PARALLEL. The PARALLEL hint can force a query to execute in parallel and
specify the DOP and the name of the table to be scanned in parallel. The NO_
PARALLEL hint can force a query to execute in series or specify the tables that will
not be scanned in parallel. The syntaxes for the PARALLEL and NO_PARALLEL
hints are as follows:

/*+ PARALLEL [( [query_block] [table_name] [degree] )] */


/*+ NO_PARALLEL [( [query_block] [table_name][,table_name] )] */

In the preceding syntaxes, query_block specifies the name of a query block to


which the hint is to be applied, table_name specifies the name of a table to which
the hint is to be applied, and degree specifies the DOP. The following shows a
sample statement:

SELECT /*+PARALLEL()*/ * FROM t1,2;


184 7 Computing Engine

The following two parameters must be specified when the parallel query feature
is enabled:
• force_parallel_mode: Set this parameter to true to force parallel execution even
if a table contains a small number of records.
• max_parallel_degree: Use the default setting.
The following section provides several examples of parallel execution:
Example 1

SELECT /*+PARALLEL(8)*/ * FROM t1,t2;// Forcibly enables parallel execution and sets the DOP to 8.

Set force_parallel_mode to true to force parallel execution even if a table con-


tains a small number of table records.
Set max_parallel_degree to 8.
Example 2

SELECT /*+ SET_VAR(max_parallel_degree=8) */ * FROM ...

Set max_parallel_degree to 8.
Set force_parallel_mode to false so that parallel execution is disabled when the
number of records in a table is smaller than the specified threshold.
Example 3

SELECT /*+PARALLEL(t1)*/ * FROM t1,t2;

The /* + PARALLEL()*/ syntax is executed on Table t1 (i.e., Table t1 is scanned


in parallel).
Example 4

SELECT /*+PARALLEL(t1 8)*/ * FROM t1,t2;

The /*+PARALLEL(8)*/syntax is executed on Table t1 (i.e., Table t1 is forcibly


scanned in parallel, with a DOP of 8).
Example 5

SELECT /*+PARALLEL(@subq1)*/ SUM(t.a) FROM t WHERE t.a =


(SELECT /*+QB_NAME(subq1)*/ SUM(t1.a) FROM t1);

Subqueries are forcibly executed in parallel, with a DOP equal to the default
max_parallel_degree setting.
7.4 Practical Application of PolarDB Query Engine 185

Example 6

SELECT /*+PARALLEL(@subq1 8)*/ SUM(t.a) FROM t WHERE t.a =


(SELECT /*+QB_NAME(subq1)*/ SUM(t1.a) FROM t1);

Subqueries are forcibly executed in parallel, with a DOP of 8 specified by


max_parallel_degree.
Example 7

SELECT SUM(t.a) FROM t WHERE t.a =.


(SELECT /*+PARALLEL()*/ SUM(t1.a) FROM t1);

Subqueries are forcibly executed in parallel.


The DOP is equal to the default max_parallel_degree setting.
Example 8

SELECT SUM(t.a) FROM t WHERE t.a =


(SELECT /*+PARALLEL(8)*/ SUM(t1.a) FROM t1);

Subqueries are forcibly executed in parallel, with a DOP of 8 specified by


max_parallel_degree.
Example 9

SELECT /*+NO_PARALLEL()*/ * FROM t1,t2;

Parallel execution is disabled.


Example 10

SELECT /*+NO_PARALLEL(t1)*/ * FROM t1,t2;

Parallel execution is disabled only for Table t1. When the parallel query feature
of the system is enabled, Table t2 may be scanned in parallel.
Example 11

SELECT /*+NO_PARALLEL(t1,t2)*/ * FROM t1,t2;

Parallel execution is disabled for Tables t1 and t2.


Example 12

SELECT /*+NO_PARALLEL(@subq1)*/ SUM(t.a) FROM t WHERE t.a =


(SELECT /*+QB_NAME(subq1)*/ SUM(t1.a) FROM t1);
186 7 Computing Engine

Parallel execution of subqueries is disabled.


Example 13

SELECT SUM(t.a) FROM t WHERE t.a =


(SELECT /*+NO_PARALLEL()*/ SUM(t1.a) FROM t1);

Parallel execution of subqueries is disabled.


Notice: The PARALLEL hint does not take effect on queries that do not support
parallel execution or tables that do not support parallel scans. The parallel execution
strategy for subqueries can also be controlled by using hints. The syntaxes are
described below:
• /*+ PQ_PUSHDOWN [([query_block])] */Use the pushdown strategy for the
parallel execution of subqueries.
• /*+ NO_PQ_PUSHDOWN [([query_block])] */Use the shared access strategy
for the parallel execution of subqueries.
The following shows some examples:
Example 1
# Use the pushdown strategy for the parallel execution of subqueries.

EXPLAIN SELECT /*+ PQ_PUSHDOWN(@qb1) */ * FROM t2 WHERE t2.a = (SELECT /*+ qb_name(qb1)
*/ a FROM t1);

Example 2
# Use the shared access strategy for the parallel execution of subqueries.

EXPLAIN SELECT /*+ NO_PQ_PUSHDOWN(@qb1) */ * FROM t2 WHERE t2.a = (SELECT /*+


qb_name(qb1) */ a FROM t1);

Example 3
# Specify the parallel execution strategy without specifying the query blocks.

EXPLAIN SELECT * FROM t2 WHERE t2.a =


(SELECT /*+ NO_PQ_PUSHDOWN() */ a FROM t1);

Force the Optimizer to Choose Parallel Execution

The PolarDB optimizer may not choose to execute a query in parallel (e.g., when
the table has less than 20,000 rows). If you expect the optimizer to choose a parallel
execution plan without considering the cost, use the following setting:

set force_parallel_mode = on
7.4 Practical Application of PolarDB Query Engine 187

Note: This is a debugging parameter and is not recommended for production


environments. Moreover, the optimizer may not choose parallel execution in some
scenarios even if this variable is specified due to the limitations on parallel query
scenarios.

7.4.2 Execution Plan Management in PolarDB

7.4.2.1 Execution Plan Management

Execution Plan Management

The cost-based query optimizer attempts to find the optimal plan for execution,
which is generally characterized by short execution time and low resource con-
sumption. On the one hand, database developers invest continuous efforts to find
better execution plans by using more accurate cost models and cardinality estima-
tion (CE). On the other hand, the overheads brought by the optimization process,
especially for OLTP (online transaction processing) queries, also need to be consid-
ered. For example, MySQL always performs full optimization for the same SQL
statement, regardless of whether the same plan is generated. This approach is also
known as hard parsing in the commercial database Oracle. However, Oracle uses a
plan cache to cache plans, so that a plan can be reused and repeated optimization
can be avoided. However, caching only one plan for a query is inadequate. The per-
formance of the plan may deteriorate due to changes in parameters, data insertion
and deletion, or changes in the database system status.
One feasible solution is to cache multiple execution plans. When multiple poten-
tial plans are available for different parameters, the optimizer selects the most effec-
tive plan for a particular input. This scheme is called adaptive plan caching. In
adaptive plan caching, whether to generate a new plan and which cached plan is the
most suitable are determined based on the selectivity of the query predicate.
Adaptive plan caching can effectively alleviate the problem of plan performance
degradation caused by different parameters.
However, the degradation of plan performance is related not only to the selectiv-
ity of the query predicate. The join order, table access mode, and materialization
strategy also affect the plan performance. In addition, as system parameters change
and data updates occur, better execution plans may be generated in the system,
which, however, are not cached.
Therefore, databases require a more complete plan evolution management solu-
tion, which is usually called SQL plan management (SPM). SPM is mainly imple-
mented in database upgrades, statistics updates, and optimizer parameter adjustments
to prevent significant performance degradation when the database executes the same
query. SPM ensures the performance baseline by maintaining a collection of plan
baselines for queries. However, due to database upgrades and data changes, better
execution plans may be generated. This necessitates timely evolution of the plan
188 7 Computing Engine

baselines. An execution plan of a query that has been verified and proven to have bet-
ter performance is added to the collection of plan baselines and becomes an alternative
plan for the query the next time it is executed. Therefore, SPM needs to maintain plan
baselines to prevent performance degradation while actively evolving them to ensure
timely discovery of better execution plans without affecting the system performance.
In SPM, execution plans typically have three states: new, accepted, and verified.
A plan in the new state is newly generated and has not been verified, a plan in the
verified state has been verified, and a plan in the accepted state has been verified and
proven to be advantageous and is usually added to the collection of plan baselines.
Users can also manually set the status of a plan to “accepted.”
SPM includes three jobs:
1. SQL plan baseline capture creates SQL plan baselines for parameterized SQL
queries. These baselines are accepted execution plans for the corresponding SQL
statements, which are the current optimal plans or plans forcibly selected by the
DBA (database administrator). A query can have multiple plan baselines because
the optimal plan varies based on the parameter value of the query.
2. SQL plan selection and routing performs the following:
(a) Ensure that most workloads are routed to accepted plans for execution.
(b) Route a small portion of the workloads to unaccepted plans to verify
these plans.
3. SQL plan evolution evaluates the performance of unaccepted plans. If a plan
significantly improves the query performance, the plan evolves into an accepted
plan and is added to the collection of plan baselines.
Figure 7.8 shows the SPM process. When the system receives a query, it gen-
erates an execution plan and determines whether it is necessary to maintain an
execution plan for the query. If not, the system directly executes the plan. If yes,
the system checks whether the plan exists in the collection of plan baselines. If
the plan exists in the collection of plan baselines, the plan is directly executed.
Otherwise, it is added to the plan history database, and the system selects a plan
with the lowest cost from the plan baselines.
PolarDB provides three plan management strategies: plan caching (which
caches one plan for each query), adaptive plan caching (which caches multiple
plans for each query), and SPM (which caches multiple plans for each query and
evolves the plans online or offline). In PolarDB, these three strategies are dynam-
ically combined to form a complete plan management solution. Users can choose
the most appropriate strategy based on their business needs.

Plan Management Architecture

Three plan management strategies are described above, plan caching, adaptive plan
caching, and SPM, along with their respective foci. Fig. 7.9 shows all modules
related to plan management. The following section describes how the three plan
7.4 Practical Application of PolarDB Query Engine 189

Fig. 7.8 SPM process

Fig. 7.9 Execution plan management modules in PolarDB


190 7 Computing Engine

management strategies can be dynamically combined on the same platform from


the perspectives of plan storage, plan representation and capture, plan caching, and
plan management. The blocks in yellow, blue, and orange in Fig. 7.9 represent com-
mon components that can support all strategies. The blocks in green represent the
implementations of the three strategies:

Plan Storage
The blocks in yellow in Fig. 7.9 represent the SQL and plan storage modules. The
SQL history database is used to detect duplicate SQL statements. When the system
runs in automatic baseline capture mode, only SQL statements that have appeared
at least twice are collected, and the first plan is marked as the plan baseline. A plan
baseline is the baseline plan stored for an SQL statement and is a subset of the plan
history database. Only plans marked as “accepted” can become baseline plans. The
plan history database is used to store information about historical execution plans.
After an SQL statement is sent to PolarDB, the optimizer searches for its baseline
plans, calculates the cost of each plan, and executes the optimal plan. At the same
time, the optimizer performs a regular optimization process to timely detect whether
better execution plans have been generated.

Plan Representation and Capture


Blocks in blue in Fig. 7.9 represent the plan representation, capture, and reuse mod-
ules. PolarDB does not provide a proper abstraction scheme to represent execution
plans, making it impossible to obtain structure information stripped of the execution
status. Therefore, execution plans must be represented by using existing informa-
tion in accordance with the minimum intrusion rule.
• Execution Plan Representation
There is no abstract scheme for representing execution plans in
PolarDB. Therefore, an execution plan is represented by using an optimization
primitive and a plan tree. The optimization primitives include the table access
method, join order, join algorithm, and SQL conversion rules. The plan tree
includes information about each execution node and the expression structure. One
challenge is to ensure that all information of the execution plan can be captured
from the optimization primitive and plan tree and that the original execution plan
can be completely restored after serialization and deserialization.
• Execution Context
The execution context describes other system information, such as system
parameter settings and mode information of the current query, when an execution
plan is generated. In addition, the plan invalidation event source module is needed
to evict invalid execution plans in time. Execution plan invalidation is usually
caused by changes in table schemas.
• Plan Infrastructure
7.4 Practical Application of PolarDB Query Engine 191

After optimization, PolarDB captures and saves the execution plan to the plan
history database for future reuse. In addition, to facilitate plan reuse, the system
may need to reproduce the plan based on the representation in the cache and esti-
mate the cost of the restored plan.

Plan Management
The blocks in green in Fig. 7.9 represent the plan management strategies, including
plan caching, adaptive plan caching, and SPM.
In the plan caching strategy, only one plan is cached for each query. If a query
misses the cache, a new plan is generated for the query and added to the cache. If a
query hits the cache, the corresponding plan is directly executed.
Adaptive plan caching includes automatic plan selection, selectivity feedback
collection, and selectivity-based plan selection. In automatic plan selection, whether
a matching cache exists is determined based on the predicate selectivity of the cur-
rent query. If a matching plan exists in the cache, the matching plan is directly and
executed. Otherwise, a new execution plan needs to be generated. A “match” occurs
when the difference between the predicate selectivity of the new query and the pred-
icate selectivity of an existing plan in the cache is less than a specified threshold
(which is usually 5%).
SPM include SQL plan baseline capture, baseline evolution, and cost-based plan
selection. SPM is similar to the previous two strategies. In SPM, a captured plan is
added to the plan history database and is marked as “new” or “unverified” while
waiting to be executed. Baseline evolution can be implemented online or offline.
Cost-based plan selection selects the optimal plan from multiple baseline plans
based on their costs or performance estimates.
In the online evolution scheme, PolarDB uses a + 2b strategy. With this strategy,
there is an a% chance of selecting a plan from the baselines and a b% chance of try-
ing a new, unverified plan for a query, and b% of queries will try baseline plans.
PolarDB compares the results of these two plans to better estimate the performance
of the plans. More details of online evolution will be discussed below. Offline evolu-
tion is relatively simple and is triggered when specific conditions are met (e.g., trig-
gered periodically or when the number of data updates reaches a specified threshold).

Plan Optimization and Execution


Backed by the preceding modules, PolarDB can implement plan management based
on multiple strategies. In the optimization phase, if an SQL statement has never
been executed before, the execution plan for this statement needs to be captured and
added to the plan baselines in any strategy. This procedure is not elaborated here. If
a plan has been cached for the SQL statement, the subsequent process varies based
on the selected plan management strategy. Figure 7.10 shows the execution logic of
this process.
PolarDB calls corresponding modules based on the selected plan management
strategy, such as plan caching, adaptive plan caching, or SPM:
192 7 Computing Engine

Fig. 7.10 Multistrategy execution logic

• If the plan caching strategy is selected, the plan cache is hit and the correspond-
ing plan is directly executed.
• If the adaptive plan caching strategy is selected, whether the query hits the plan
cache is determined. In this strategy, the predicate selectivity R of the current
query is estimated based on statistical information, and the corresponding
predicate ranges of all baselines plans are queried. The query is considered to
have hit the plan cache if a plan P is found and the difference between the
predicate selectivity R’ of the plan and R is less than 5%. If the query hits the
plan cache, a plan is selected based on the predicate selectivities R’ and
R. Then, the predicate coverage range of the baseline plan is properly adjusted
(i.e., split). If the query misses the plan cache, a plan needs to be selected
through conventional cost-based query optimization, which is then added to
the plan baselines.
• With the SPM strategy, the optimizer selects an accepted execution plan, esti-
mates its cost, and then determines whether it is necessary to generate a new
execution plan based on the evolution strategy. If a new execution plan needs to
be generated, the cost-based query optimization process starts, and the new exe-
cution plan is added to the plan history database.
In the execution phase, if the plan caching strategy is selected, PolarDB directly
executes the cached plan that is hit. If the adaptive plan caching strategy is selected,
PolarDB collects predicate selectivity data during execution and sends the data to
the feedback module. If the SPM strategy is selected, PolarDB updates the required
feedback based on the evolution strategy.
7.4 Practical Application of PolarDB Query Engine 193

7.4.2.2 Plan Evolution

Traditional SPM has two issues. First, after cost-based query optimization, a plan
selection module is required, which reevaluates the costs of all accepted plans with-
out considering the feedback during execution. Second, traditional SPM does not
consider the possibility of generating better execution plans under the current
workloads.
As shown in Fig. 7.11, the optimal execution plan for a parameterized query var-
ies based on the arguments passed in. When C1 > 5, the optimal execution plan is a
full table scan. When C1 > 50, the optimal plan is an index range scan. Therefore,
SPM needs to consider not only the optimal plan for a particular query but also the
proportion of optimal plans in the cached plans under the current workloads, to
optimize the overall execution time of the workloads.
In view of this, PolarDB proposes an online evolution algorithm that uses the
SQL plan routing module and the execution feedback mechanism to ensure that
most workloads are routed to accepted plans while a small portion of the workloads
are routed to unaccepted plans through online evolution to verify these plans. From
the perspective of reinforcement learning, an agent that interacts with the database
management system can be designed. The agent takes an action by following the
policy and observes the reward. It then improves the policy based on the reward,
thereby maximizing the cumulative reward.
In the routing design of the existing online SPM strategy, only two routing
options are available: executing an accepted plan and performing regular cost-based
query optimization. This strategy is advantageous as it avoids the costs of blindly

Fig. 7.11 Parameterized query and execution plan selection


194 7 Computing Engine

trying unaccepted plans. However, it cannot find the optimal solution for the current
workloads. Hence, considering only these two actions is not enough. Therefore, the
following terms are defined in the online SPM evolution system of PolarDB:
• Action: the options available for plan routing in SPM, including selecting an
accepted plan, selecting an unaccepted plan, and using a regular optimizer.
• State: the passed in parameterized query.
• Reward: the execution time.
• Goal: to find a policy that can minimize the overall execution time.
After an SQL statement enters the system, the query parser parameterizes the
statement and passes it to the SPM router. The router retrieves the Q-value for each
possible action from the current policy and selects an unaccepted plan with a prob-
ability of ε to verify the performance of the plan. Then, the router selects a baseline
plan with a probability of 1 − ε to improve stability, where 0<ε<1.
The optimizer generates a physical query plan and passes it to the execution
engine. After the plan is executed, the database management system returns the
query result to the client and triggers the evolution logic of SPM. The execution
plan and its latency (other dimensions, such as the CPU and memory overhead and
number of rows scanned, will be supported in the future) are added to the experi-
ence of the agent as the execution feedback and serve as the starting point for
Q-value iteration. During evolution in the context of SPM, the policy is improved by
using the latest experience. When sufficient statistics indicate that an unaccepted
plan is clearly better than the corresponding baseline plan, the baseline is updated.
Each query sent by the client undergoes parameterization and is routed to an
action. Then, the execution results are collected and used as experience data. The
above process is executed multiple times for a query, forming a feedback and cor-
rection loop. In the exploration phase of evolution in the context of SPM, if a plan
has a higher latency than expected, the agent learns how to assign a lower weight to
the plan to reduce its chance of being selected.
The online SPM evolution technology integrates a lightweight reinforcement
learning framework, which improves router decisions and facilitates evolution toward
the optimal plan based on the execution feedbacks obtained. In the long run, an execu-
tion plan that is expected to achieve maximum benefits alleviates the problem that an
suboptimal execution plan becomes a baseline plan. Test results for balanced, skewed,
and variable workloads show that compared with traditional plan management frame-
works, the online execution plan evolution technology of PolarDB can correctly
obtain the optimal plan and consequently adapt efficiently to various workloads.

7.4.3 Vectorized Execution in PolarDB

7.4.3.1 Vectorized Execution

Traditional execution engines that use the tuple-at-a-time (TAT) approach cannot
fully utilize the features of modern processors, such as SIMD operations, data
prefetching, and branch prediction. Vectorized execution and compiled execution
7.4 Practical Application of PolarDB Query Engine 195

are two commonly used acceleration solutions for database execution engines. This
section describes vectorized execution in PolarDB.
Vectorized execution can reuse the pull-based Volcano model. However, the
Next() function of each operator in the Volcano model is replaced with the corre-
sponding NextBatch() function, which returns a batch of data (such as 1024 rows)
each time instead of one row of data. Vectorized execution has the following
advantages:
• The number of virtual function calls, especially those for expression evaluation,
is lower than that in the Volcano model.
• Data is processed by using a batch or chunk as the basic unit, and the data to be
processed is continuously stored, greatly improving the hit rate of the modern
CPU cache.
• Multiple rows of data (usually 1024 rows) are processed at the same time by
operators, fully leveraging the SIMD technology.
The PolarDB optimizer determines whether to use vectorized execution based on
the estimated cost and characteristics of the operator. However, some operators,
such as sort operators and hash operators, cannot benefit from vectorized execution.
Therefore, PolarDB supports hybrid execution plans. A hybrid execution plan
allows vectorized and nonvectorized operators to coexist.

7.4.3.2 Vectorized Execution Framework

Figure 7.12 shows the vectorized execution framework implemented in the PolarDB
architecture. The underlying layer of PolarDB is a row-oriented storage layer that
provides a batch read interface to the upper layer. The batch read interface can
return multiple rows of data at the same time. When the execution layer receives the
data, it will convert the data into a columnar layout in memory, which facilitates
access from the upper layer.

Fig. 7.12 Vectorized


execution framework in
PolarDB
196 7 Computing Engine

A vector represents multiple data records (which are usually contiguous) from
the same column. In the execution state, each vector is bound to a fixed position
in the columnar layout in memory based on the column information and partici-
pates in subsequent calculations without additional data materialization opera-
tions. After the data of all vectors is processed, PolarDB will call the batch read
interface again to fetch data. With vectors as the basic operation unit, the system
needs to support vectorized expressions and operators. Additionally, a vectorized
expression must support processing multiple data records at the same time. To
ensure compatibility and reduce dependency on hardware, PolarDB enhances the
existing expression framework and introduces the for loop to facilitate accelera-
tion, leaving more compilation and optimization work to the compiler. PolarDB
now supports vectorized table scans, vectorized filtering operations, and vector-
ized hash joins. The following two operations in a vectorized hash join are opti-
mized: key extraction and insertion in the build phase and key search in the
probe phase.

7.4.3.3 Parameter Settings for Vectorized Execution

PolarDB provides multiple system variables to control the implementation of vec-


torized execution. Table 7.2 describes the parameter for enabling vectorized execu-
tion, Table 7.3 describes the parameter for controlling the vector size, and Table 7.4
describes the parameter for specifying whether to display vectorization information.
The variables are described as follows:

Table 7.2 vectorized_execution_enable Property Value


System variable vectorized_execution
Scope Global and session
Dynamic Yes
Type Boolean
Default value Off

Table 7.3 vector_execution_batch_size Property Value


System variable vectorized_execution_batch_size
Scope Global and session
Dynamic Yes
Property Value
Type Integer
Default value 1024
Value range [1, 1024]
References 197

Table 7.4 vectorized_explain_enabled Property Value


System variable vectorized_execution_explain
Scope Global and session
Dynamic Yes
Type Integer
Default value Off

1. vectorized_execution_enable specifies whether to enable vectorized execution.


2. vector_execution_batch_size controls the size of each vector, which is 1024 by
default.
3. vectorized_explain_enabled specifies whether to display vectorization informa-
tion when an EXPLAIN statement is executed.

References

1. Garcia-Molina H, Ullman JD, Widom J. Database systems: the complete book (trans: Yang D,
Wu Y, et al.). 2nd ed. Beijing: China Machine Press; 2010.
2. Silberschatz A, Henry Korth HF, Sudarshan S. Database systems concepts. 5th ed. New York:
McGraw-Hill; 2005.
3. Graefe G. Volcano—an extensible and parallel query evaluation system. IEEE Trans Knowl
Data Eng. 1994;6(1):120–35.
4. A glimpse into vectorized execution and compiled execution. https://www.jianshu.com/p/
fe7d5e2d66e7.
5. Kemper A, Neumann T. HyPer: A hybrid OLTP&OLAP main memory database system based
on virtual memory snapshots. In: IEEE 27th International Conference on Data Engineering;
2011. p. 195–206.
6. Thomas N. Efficiently compiling efficient query plans for modern hardware. Proc VLDB
Endow. 2011;4:539–50.
7. Sompolski J, Zukowski M, Boncz P. Vectorization vs. compilation in query execution. In:
Proceedings of the Seventh International Workshop on Data Management on New Hardware
(DaMoN’11); 2011. p. 33–40.
Chapter 8
Integration of Cloud-Native
and Distributed Architectures

In some cases, a single cloud-native database instance encounters performance bot-


tlenecks and requires MPP (massively parallel processing) to accelerate queries.
This chapter outlines the basic principles of distributed databases in conjunction
with cloud-native technologies and elaborates how PolarDB-X integrates cloud-­
native and distributed technologies to leverage the computing power of multi-
ple nodes.
Cloud-native databases feature excellent scalability at the storage layer and sup-
ports the one-writer, multiread architecture. They are also applicable to most data-
base scenarios. However, in some extreme situations, a single cloud-native database
instance may experience performance bottlenecks. For example, the core transac-
tion databases of many Internet businesses can experience peak traffic reaching mil-
lions of transactions per second, far exceeding the processing capacity of a single
physical node. Additionally, for HTAP (hybrid transactional/analytical processing)
scenarios, a single machine may be inadequate to meet the processing speed require-
ment. Therefore, MPP is introduced to accelerate query processing. This way, dis-
tributed technologies can be introduced in the preceding scenarios to fully utilize
the computing power of multiple nodes.

8.1 Basic Principles of Distributed Databases

A distributed database is a logically unified database management system that


connects multiple physical database nodes together by using a computer network.
Compared with a single-node database, a distributed database has better scalabil-
ity and supports node addition to improve the overall computing and storage per-
formance without being limited by the hardware specification of a single
physical node.

© The Author(s), under exclusive license to Springer Nature Singapore Pte 199
Ltd. 2025
F. Li et al., Cloud Native Database,
https://doi.org/10.1007/978-981-97-4057-4_8
200 8 Integration of Cloud-Native and Distributed Architectures

8.1.1 Architecture of Distributed Databases

Figure 8.1 shows the two typical architectures available for distributed databases,
namely, integrated architecture and compute-storage-separated architecture.

8.1.1.1 Integrated Architecture

In the integrated architecture, each node serves as a compute node and a storage
node, and data is distributed across multiple nodes. Each node in the cluster can
provide services externally. If the data accessed by the client is not on the current
node, the current node communicates with other nodes to request the corresponding
data. Distributed databases that adopt the integrated architecture include
Postgres-XL, OceanBase, and CockroachDB.

8.1.1.2 Compute-Storage-Separated Architecture

In the compute-storage-separated architecture, a compute node is responsible for


SQL parsing and optimization, whereas a storage node is responsible for data stor-
age. When a compute node receives a user request, it obtains a physical execution
plan based on the SQL (structured query language) statement and writes data to or
reads data from the storage node on which the data resides. Several simple opera-
tions, such as filtering and aggregation, can be pushed down to a storage node.
Distributed databases that adopt the compute-storage-separated architecture include
PolarDB-X, TiDB, and Apache Trafodion. Many database middleware products,
such as Alibaba Cloud Distribute Relational Database Service (DRDS) and open-­
source ShardingSphere, also fall into this category.
The compute-storage-separated architecture is more flexible than the integrated
architecture. Compute nodes and storage nodes handle different workloads.
Therefore, compute nodes and storage nodes can be developed by using different
programming languages, use different models, and have different quantities during
deployment. For example, storage nodes are highly sensitive to latency and

Fig. 8.1 Two typical architectures for distributed databases


8.1 Basic Principles of Distributed Databases 201

frequently initiate system calls to read and write the disk. Therefore, storage nodes
can be developed by using a system programming language, such as C or C++.
These nodes also have high requirements for disk I/O performance during
deployment.
The integrated architecture has better performance in processing local que-
ries that do not involve remote data access than the compute-storage architec-
ture. To ensure that user queries can “just” hit local data, a lightweight
partition-awareness feature is typically introduced to the load balancer or proxy
node that is located at the forefront of the cluster to route queries to the node on
which the data is located as far as possible, thereby reducing unnecessary RPC
overheads.

8.1.2 Data Partitioning

In a distributed database, a logical relational table is horizontally split into multiple


physical partitions (also known as shards) according to specific rules through data
partitioning. These partitions can be distributed across multiple physical nodes, as
shown in Fig. 8.2. During data access, the physical partition in which the data is
located can be calculated based on the partitioning rules to retrieve the correspond-
ing data.
Common partitioning rules include hash partitioning and range partitioning.

8.1.2.1 Hash Partitioning

In hash partitioning, hash values are calculated based on the partition key, and the
partition in which a row of data is located is calculated by using the mod(hash value,
N) operation, where N represents the total number of partitions. For example,
assuming N = 4, the partitions can be defined as follows:

Fig. 8.2 Data partitioning in a distributed database


202 8 Integration of Cloud-Native and Distributed Architectures

• HASH(partition_key) = 0 → Partition 0
• HASH(partition_key) = 1 → Partition 1
• HASH(partition_key) = 2 → Partition 2
• HASH(partition_key) = 3 → Partition 3
Consistent hashing [1] is usually performed in hash partitioning to calculate hash
values. When a data node needs to be added or removed, the change in the total
number N of data nodes will cause extensive data redistribution if a regular hash
function is used; consistent hashing can ensure minimal data movement. In consis-
tent hashing, data nodes are mapped through hashing onto a large ringed space
called a hash ring. Then, the data to be stored is hashed based on the partition key
and mapped to a position on the hash ring. The data is stored on the first data node
that comes after the mapped-to position in the clockwise direction along the hash
ring. As shown in Fig. 8.3, data mapped to a position indicated by a specific color is
stored on the node with the same color. For example, data mapped to a position in
blue is stored on the node in blue. This design ensures that when a node is removed
or added, only data mapped to a position between the changed node and its next
node in the clockwise direction along the hash ring needs to be migrated.

8.1.2.2 Range Partitioning

Range partitioning maps data to several partitions based on the range to which the
value of the partition key belongs. For example, if the partition key is of the integer
data type, the partitions can be defined as follows:
• partition_key <= 10,000 → Partition 0
• 10,000 < partition_key <= 20,000 → Partition 1
• 20,000 < partition_key <= 30,000 → Partition 2
• 30,000 < partition_key → Partition 3.

Fig. 8.3 Principle of


consistent hashing
8.1 Basic Principles of Distributed Databases 203

Hash partitioning and range partitioning support dynamic changes in the number
N of nodes to scale out or scale in the storage layer. Hash partitioning can leverage
the consistent hashing function to migrate some data from each partition to a new
partition or migrate data from an existing partition to other existing partitions. In
range partitioning, a large partition can be further divided or two adjacent small
partitions can be merged.
Hash partitioning and range partitioning have respective advantages and disad-
vantages, as summarized in Table 8.1.
Several distributed databases are designed to support a specific partitioning
scheme. For example, YugaByte supports hash partitioning, whereas TiDB supports
range partitioning. Other databases, such as PolarDB-X and OceanBase, support
multiple partitioning schemes and allow users to choose the most appropriate parti-
tioning scheme as needed.

8.1.3 Distributed Transactions

ACID (atomicity, consistency, isolation, and durability) transactions are an impor-


tant feature of relational databases. In distributed databases, data is distributed
across multiple nodes. This inevitably necessitates the introduction of distributed
transactions to ensure the ACID properties of transactions. For the definition of
ACID transaction and the implementation principles of transactions in a standalone
database, see Sect. 1.3.3 of this book.
Various implementation models are available for distributed transactions. The
following section describes several representative models.

8.1.3.1 XA Protocol

XA is a 2PC protocol that defines two main roles: a resource manager (RM) and a
transaction manager (TM). A resource manager is usually a physical node of the
database, whereas a transaction manager is also known as a transaction coordinator.
The XA protocol also specifies the interaction interfaces between a transaction
manager and a resource manager, such as XA_START, XA_END, XA_COMMIT,
and XA_ROLLBACK.

Table 8.1 Comparison between hash partitioning and range partitioning


Item Hash partitioning Range partitioning
Autoincrement New data is evenly distributed New data is aggregated on only one or a
partition key across all partitions few partitions, resulting in write hotspots
on these partitions
Range query on All partitions are scanned and Only partitions in the query range are
partition keys sorted, resulting in poor scanned, achieving better performance
performance
204 8 Integration of Cloud-Native and Distributed Architectures

Currently, the XA protocol is implemented on most mainstream commercial


databases, such as Oracle and MySQL. Database middleware also implements dis-
tributed transactions based on the XA protocol.
The 2PC protocol is designed based on the following idea: The execution of a dis-
tributed transaction involves multiple nodes, and each node is aware of its execution
status but not of the execution status of other nodes. In this case, the 2PC protocol
introduces a coordinator (i.e., the transaction manager in the XA protocol). The nodes
participating in the transaction feed back the operation results to the coordinator, and
the coordinator determines whether to commit or abort the transaction based on the
results from all nodes. In essence, the 2PC protocol ensures strong data consistency by
implementing modifications to all replica data in an all-or-­nothing manner.
The 2PC protocol has the following shortcomings:
• Synchronous blocking: All participating nodes are synchronously blocked. For
example, when a public resource is occupied by a participant, a third-party node
that applies for the resource can be blocked.
• Single point of failure (SPOF): If the coordinator fails, all participants will be
blocked.
• Data inconsistency: In the second phase of the 2PC protocol, if the coordinator
or a participant fails, data inconsistency may occur. For example, if the coordina-
tor fails to send a commit message to participants, some participants may receive
the commit message while others do not, resulting in data inconsistency.
To address the SPOF issue of the transaction manager and the data inconsistency
issue in the commit phase, PolarDB-X introduces the concepts of transaction log
table and commit point to the XA protocol implementation. As shown in Fig. 8.4,
when the prepare phase ends, a transaction commit record is added to the global
transaction log table as a commit point. When the transaction manager fails, a new
transaction manager is selected to continue the two-phase commit. The new transac-
tion manager restores the transaction status or rolls back the transaction based on
whether a commit point record exists in the primary database.
• If no commit point exists, no resource manager has entered the commit phase.
All resource managers can be safely rolled back.
• If a commit point exists, all resource managers have completed the prepare phase
and can proceed to the commit phase.

Fig. 8.4 Implementation of the XA protocol


8.1 Basic Principles of Distributed Databases 205

8.1.3.2 Percolator

In 2010, Google engineers proposed the Percolator transaction model [2] to solve
the atomicity issue in incremental index construction. Percolator is built based on
BigTable, a distributed wide-column storage system based on the key-value model.
Percolator achieves cross-row transaction processing capabilities at the snapshot
isolation level through row-level transactions and a multiversioning mechanism
without changing the internal implementation of BigTable.
Percolator introduces the timestamp oracle (TSO), which is a global clock, allo-
cates start and commit timestamps that monotonically increase for global transac-
tions, and performs visibility judgment based on the timestamps. To support 2PC,
Percolator adds two columns, Lock and Write, in addition to the original data. The
Lock and Write columns are respectively used to store lock information and map-
pings between transaction commit timestamps and data timestamps. Percolator
ensures transaction consistency and isolation based on such information. Figure 8.5
shows the implementation of the TSO.
The detailed process of the above example is as follows:
• Initial state: At this point, Bod and Joe, respectively, have $10 and $2 in their
accounts. The Write column indicates that the timestamp of the latest data ver-
sion is 5.
• Prewrite and locking: A transaction that requires transferring $7 from Bob’s
account to Joe’s account is initiated. This transaction involves multiple rows of
data. Percolator randomly selects one main record row and adds a main record
lock on the main record. In this case, a main record lock is written to Bob’s
account at Timestamp 7, and the value of the Data column is 3 (10–7). Lock
information that includes a reference to the main record lock is written to Joe’s
account at Timestamp 7, and the value of the Data column is 9 (2 + 7).

Fig. 8.5 Implementation of the TSO


206 8 Integration of Cloud-Native and Distributed Architectures

• Commits of the main record: A row with a timestamp of 8 is written to the Write
column, indicating that the data at Timestamp 7 is the latest data. Then, the lock
record is deleted from the Lock column to release the lock.
• Commit of other records: The operation logic is the same as that of committing
the main record.
A transaction is considered successful after Percolator commits the main record.
Remedies are still available even if other records fail to be committed. However,
such exceptions are handled only for read operations because Percolator imple-
ments a decentralized two-phase commit and does not have transaction managers
like the XA protocol. The method for handling such exceptions is to search for the
main record based on the lock in the abnormal record. If the main record lock exists,
the transaction is not completed. If the main record lock has been cleared, the record
can be committed and becomes visible.
The Percolator model is conducive to writes but not to reads. In a write transac-
tion, the decision is first persisted in the main record and then asynchronously per-
sisted in other participants to prevent abnormal waits by multiple participants. Yet
for reads, writing first to the main record and then asynchronously writing to other
records can result in longer locking times for participants or even failure to commit
the main record in the commit phase, rendering other participants unusable.

8.1.3.3 Omid

Omid [3] is a transaction processing system developed by Yahoo! for Apache HBase
based on the key-value model. Compared with Percolator’s locking method, Omid uses
an optimistic approach. The latter’s architecture is relatively simple and elegant. In
recent years, several papers on Omid have been published at ICDE, FAST, and PVLDB.
Omid believes that although Percolator’s lock-based approach simplifies transac-
tion conflict checking, handing over the transaction processing logic to the client
will result in lingering, uncleared locks that can block other transactions in the case
of client failure. In addition, maintaining the additional Lock and Write columns
incurs significant overheads. In the Omid solution, the central node is solely respon-
sible for determining whether to commit a transaction, greatly strengthening the
capabilities of the central node. Validation is performed based on the write set of the
transaction during the transaction commit to check whether the transaction-related
rows have been modified during the transaction execution period and determine
whether a transaction conflict exists.

8.1.3.4 Calvin

First proposed in 2012, Calvin [4] adopts an “unorthodox deterministic database


concept compared with traditional databases. Calvin globally presorts transactions
and then executes the transactions in the sorted order. This mode also considers that
8.1 Basic Principles of Distributed Databases 207

Table 8.2 Comparison between several transaction models


Data Concurrency control
Model model scheme Isolation level Limitation
XA Any 2PL (pessimistic) All Adding a read lock will result
in performance degradation
Percolator Key-­ Locking (pessimistic) Snapshot
value and MVCC isolation
Omid Key-­ Conflict detection Snapshot
value (optimistic) and MVCC isolation
Calvin Any Deterministic database Serializable Only one-shot transactions are
supported

a globally ordered transaction log exists and multiple partitions in a distributed sys-
tem process data in local shards in strict accordance with the global transaction log
to ensure the consistency of the processing results of all shards.
The Calvin model requires all transactions to be “one-shot” transactions, in
which the entire transaction logic of a one-shot transaction is executed at a time by
calling a stored procedure. However, common transactions are interactive transac-
tions in which the client executes several statements in succession before finally
executing the COMMIT instruction to commit the transaction. Therefore, the Calvin
model is applicable only to specific fields. The commercial database VoltDB bor-
rows Calvin’s deterministic database concept. VoltDB is an inmemory database that
is designed for high-throughput and low-latency scenarios and widely used in the
IoT and financial fields.
The foregoing typical transaction models have respective advantages and disad-
vantages, as presented in Table 8.2.

8.1.4 MPP

In the early days, relational databases were limited by the I/O capabilities of com-
puters. Computation takes up only a small portion of the total processing time of a
query, and the optimization of executors has little impact on the overall perfor-
mance. With the rapid development of hardware and the increasing maturity of dis-
tributed technologies, the acceleration and optimization of executors involving large
amounts of data have become increasingly important.
With the emergence of multiprocessor hardware, executors gradually evolved
toward the symmetric multiprocessing (SMP) architecture for standalone parallel
computing to fully utilize the multicore capability to accelerate computation.
However, an executor of the SMP architecture has poor scalability and can utilize
the resources of only one SMP server during computation. As the amount of data to
be processed increases, the disadvantage of poor scalability becomes more apparent.
In MPP, multiple nodes in a distributed database cluster are interconnected to each
other over a network and collaboratively compute the query results. Figure 8.6 shows
208 8 Integration of Cloud-Native and Distributed Architectures

Fig. 8.6 Principle of MPP

Fig. 8.7 Implementation of MPP

the principle of MPP. Compared to SMP, MPP can utilize the computing power of
multiple nodes to accelerate complex analytical queries and overcome the limitations
of hardware resources (such as CPU and memory) of a single physical node.
When a query is executed in MPP mode, the SQL execution plan is distributed to
multiple nodes. Multiple instances are allocated to each operator to handle a portion
of the data. For example, a join operation needs to partition data by using the join
key as the partition key. Before the join operator is executed, the exchange operator
needs to shuffle data on two sides of the join. After that, the data in each partition
can be joined separately. The implementation of MPP is shown in Fig. 8.7.

8.2 Distributed and Cloud-Native Architectures

Compute-storage separation is one of the technical characteristics of cloud-native


databases. Almost all cloud-native databases adopt the compute-storage-separated
architecture. This design thoroughly transforms databases for cloud scenarios,
8.2 Distributed and Cloud-Native Architectures 209

Fig. 8.8 Storage architectures of cloud-native databases

enabling compute and storage nodes to be scaled independently while reducing


costs. This section focuses on the storage architectures of cloud-native databases
and the advantages and limitations of these storage forms.
Two storage architectures are available for cloud-native databases: shared stor-
age architecture and shared-nothing architecture. As shown in Fig. 8.8, the shared
storage architecture provides a centralized data access interface for the upper layer,
and the compute nodes in the shared-nothing architecture need to identify the stor-
age shard in which the data to be accessed is located.

8.2.1 Shared Storage Architecture

Representative databases that use the shared storage architecture include Amazon
Aurora and Alibaba Cloud PolarDB. Aurora for MySQL transforms the write path
of MySQL by replacing the original stand-alone storage based on local disks with a
multi-replica and scalable distributed storage, thereby improving system availabil-
ity and scalability while enhancing performance. This achieves complete compati-
bility with open-source databases by simply transforming the storage module in an
existing database.
For the storage layer, the shared storage architecture usually adopts a multirep-
lica mechanism to enhance high availability. Replica consistency protocols, such as
Quorum and Paxos, have been applied in various cloud-native database systems.
The shared storage architecture provides a centralized data access interface for
upper-layer compute nodes. This way, the compute nodes do not need to be con-
cerned with the actual distribution of data in storage or with the load balancing of
data distribution.
By using the shared storage architecture, cloud service providers can pool disk
resources and allow multiple users to share a distributed storage cluster and use
resources in a pay-as-you-go fashion. Taking Aurora for MySQL as an example, the
storage cost is $0.1 per GB per month, and users do not need to preplan capacity
when creating instances and only needs to pay for the actual capacity used.
As shown in Fig. 8.8a, compute nodes are categorized into the RW (primary)
node and RO nodes. This categorization is necessary because although the storage
210 8 Integration of Cloud-Native and Distributed Architectures

layer has been transformed into a distributed architecture, the computing layer
(including the transaction management and query processing modules) retains the
standalone structure, and the concurrent transaction processing capability (write
throughput) is limited by the performance of a single node. The shared storage
architecture enables elastic scalability for the computing and storage layers.
However, in this architecture, only read-only nodes can be added to share the read
workloads. The write performance is still bottlenecked by the processing capacity
of a single node. As a result, the shared storage architecture experiences a serious
performance bottleneck. Although the vertical scalability of the shared storage
architecture is favored by the industry, this condition is not suitable for practical
engineering implementations because the storage capacity of the entire cluster is
usually limited to dozens or hundreds of terabytes.

8.2.2 Shared-Nothing Architecture

With the rise of NewSQL [5] databases in recent years, the shared-nothing architec-
ture has attracted increasing attention. In the shared-nothing architecture, each node
is an independent process that does not share resources with other nodes and com-
municates and exchanges data with other nodes through network RPCs. This sec-
tion describes distributed databases that use the shared-nothing architecture.
Cloud Spanner launched by Google is a typical representative of distributed
cloud databases that feature impeccable scale-out and high-availability capabilities.
Compared with the shared storage architecture, the shared-nothing architecture is
more advantageous in terms of the scalability of the computing layer. In the shared-­
nothing storage architecture, each node is an independent process that does not
share resources. In addition, the computing layer and the storage layer can be hori-
zontally scaled by simply adding more nodes. For a stateless computing layer, new
nodes can be started in seconds by using the container technology.
The shared-nothing storage architecture divides data into shards, enabling hori-
zontal scaling of compute nodes. However, the storage layer of the shared-nothing
architecture is disadvantageous compared with that of the shared storage architec-
ture in two aspects.
The first is high costs. In addition to the migration costs of replicating data, users
need to consider the storage costs. By default, the high-efficiency cloud disks
mounted to cloud hosts in the shared-nothing architecture implement three-replica
high availability. Combining the three-replica high-availability implementation and
the traditional database three-replica and virtualization technologies yields nine
(3 × 3) replicas in the system, which results in a waste of storage space. The design
philosophy of the shared storage architecture is to push down the implementation of
three replicas to the storage layer, which is more economically feasible than that of
the shared-nothing storage architecture.
Second, the storage layer has poor elasticity. The horizontal scaling of the stor-
age layer is more complicated. New nodes need to copy data from the original nodes
8.3 Cloud-Native Distributed Database: PolarDB-X 211

and can provide services externally only after the data is synchronized between the
new nodes and original nodes. This process is not only time-consuming but also
occupies the I/O bandwidth of existing nodes. Moreover, the capacity needs to be
planned in advance, resulting in poorer scalability than the shared-nothing storage
architecture. In addition, scaling can only be implemented by nodes, rendering the
pay-as-you-go payment model infeasible.

8.3 Cloud-Native Distributed Database: PolarDB-X

PolarDB-X is a cloud-native distributed database independently developed by


Alibaba. It adopts a compute-storage-separated architecture and is available in a
local storage edition and a shared storage edition based on the deployment mode of
the storage layer. In the local storage edition, each storage node is an independently
deployed process, and data is stored on the local disk. PolarDB-X guarantees high
availability by using the Paxos algorithm. In the shared storage edition, a distributed
storage architecture is employed. This section describes the PolarDB-X shared stor-
age edition.

8.3.1 Architecture Design

PolarDB-X combines the advantages of the shared storage and shared-nothing


architectures. PolarDB-X has nearly unlimited computing and storage scalability
like the shared-nothing storage architecture and is not limited by a single RW node.
By utilizing the container technology and the advantages of shared storage, it
achieves rapid scale-outs and scale-ins in seconds and allows users to use resources
in a pay-as-you-go fashion.
PolarDB-X reuses the distributed storage technology of PolarDB. To bypass the
limitation of a single RW node, PolarDB-X introduces the multitenancy technology,
which allocates a logical database to different tenants (RW nodes) by table. Only
table owners can write to the tables; other RW nodes cannot write to the tables.
Moreover, physical partitioned tables can be allocated to different RW nodes,
thereby achieving scalability of write capabilities.
As depicted in Fig. 8.9, the architecture of PolarDB-X includes three key com-
ponents: the computing layer, storage layer, and global meta service (GMS).
Compute node provides the distributed SQL engine, distributed transaction coor-
dinator, optimizer, executor, and other modules.
Storage node supports local disks and shared storage and provides the data stor-
age engine, such as InnoDB or an Alibaba-developed storage engine. This layer
implements data consistency and persistence and provides the computation push-
down feature (e.g., pushdown of the Project, Filter, Join, and Agg operators) to meet
the requirements of the distributed architecture.
212 8 Integration of Cloud-Native and Distributed Architectures

Fig. 8.9 Architecture of PolarDB-X

Fig. 8.10 Partitioning example in PolarDB-X

GMS provides distributed metadata, such as metadata of tables, and a TSO. GMS
can adjust data distribution based on the workload to achieve load balancing between
nodes. GMS can also manage compute nodes and data nodes, for example, putting
a node online or pulling a node offline.

8.3.2 Partitioning Schemes

PolarDB-X supports hash partitioning and range partitioning and allows users to
define table groups. Tables in the same table group have the same partition key and
partitioning scheme. This way, joins of tables in the table group can be directly pushed
down to storage nodes, as shown in Fig. 8.10. Taking an online shopping business as
an example, the user and orders table can be added to the same table group and parti-
tioned by using user IDs as the hash partition key. When a transaction queries all
orders of a user, all data that needs to be joined in the distributed transaction is located
8.3 Cloud-Native Distributed Database: PolarDB-X 213

on the same physical node. Therefore, the query can be pushed down to a storage node
and considered a standalone transaction, thereby achieving higher performance.

8.3.3 GSIs

In addition to secondary indexes within partitioned tables, PolarDB-X supports


creating global secondary indexes (GSIs) for logical tables. When users create
tables, they often expect support for multiple partitioning schemes. However,
only one partitioning scheme can be specified, which fails to meet all require-
ments. GSIs provide users with an additional partitioning dimension.
For example, a user table typically uses the user ID column as the primary key
and partition key. However, when a user logs on with a mobile number, a query
needs to be performed by using the mobile number field as the filtering condition.
When the mobile number field is used as the index key, the corresponding data
may exist on all nodes. Without a GSI, the database needs to traverse all parti-
tioned tables and use the local index of each partitioned table to find the corre-
sponding user; this is costly. In PolarDB-X, a global secondary index can be
created for the mobile number field. This way, the database can find the corre-
sponding user without traversing the partitions, as shown in Fig. 8.11.
GSIs also face several issues, mainly due to their complex architecture and
implementation. For example, when the indexed data and main table data are
located on different machines, data consistency and performance issues may arise.
Nevertheless, these deficiencies are a fair exchange for the convenience GSIs offer.

8.3.4 Distributed Transactions

For a transaction that involves multiple partitions, different partitioned tables may
be located on different RW nodes. In this case, the transaction needs to be imple-
mented as a distributed transaction to ensure the ACID properties. PolarDB-X sup-
ports TSO-based global MVCC transactions.

Fig. 8.11 Sample GSI in PolarDB-X


214 8 Integration of Cloud-Native and Distributed Architectures

The implementation of MVCC transactions requires a global clock to sequence


the transactions. The global clock requires a strategy to generate monotonically
increasing timestamps for global transactions. Common strategies include TrueTime,
hybrid logical clock (HLC), and TSO. PolarDB-X uses the TSO strategy to main-
tain a global time server to generate strictly monotonically increasing timestamps.
Given that all timestamps come from the same global time server, the strict order of
all timestamps can be ensured.
A timestamp is represented in the physical clock + logical clock format, with the
physical clock accurate to milliseconds, as shown in Table 8.3.
When a transaction starts, a compute node obtains a transaction start timestamp
(start_ts), also known as a snapshot timestamp (snapshot_ts), from the TSO server.
This timestamp is used in all read requests in the transaction. The storage node finds
the corresponding data record version based on the snapshot timestamp to ensure
that a consistent view can always be read in the transaction.
When the transaction is committed, the compute node acts as a coordinator and
initiates a two-phase commit to the storage node. After the prepare phase and before
the commit phase, the compute node obtains the transaction commit timestamp
(commit_ts) from the TSO server. This timestamp will be used as the version of all
data records written in the transaction. This way, an atomic state can be restored
regardless of whether the compute node or the storage node is faulty (Fig. 8.12).

8.3.5 HTAP

The concept of HTAP (hybrid transactional/analytical processing) was first used by


Gartner in a 2014 report to describe a new application framework that breaks down
the barrier between OLTP (online transaction processing) and OLAP (online ana-
lytical processing). This architecture can be applied in transactional and analytical

Table 8.3 Timestamp format Physical clock Logical clock Reserved bits
42 bits 16 bits 6 bits

Fig. 8.12 2PC process


References 215

database scenarios. HTAP not only avoids complicated extract, transform, and load
(ETL) operations but also enables faster analysis of the latest data.
PolarDB-X provides the intelligent routing feature for HTAP. In addition,
PolarDB-X supports the processing of HTAP loads, with guaranteed low latency in
transactional processing and full utilization of computing resources in analytical
processing, and ensures strong data consistency. The optimizer of PolarDB-X ana-
lyzes the consumption of core resources, such as CPU, memory, I/O, and network
resources, for each query based on the costs and categorizes requests into OLTP
requests and OLAP requests.
PolarDB-X routes OLTP requests to the primary replica for execution, achieving
lower latency than the traditional read-write separated solution.
The compute nodes of PolarDB-X support MPP. The query optimizer automati-
cally identifies a complex analytical SQL query as an OLAP request and executes
the request in MPP mode. In other words, the optimizer generates a distributed plan
that is to be executed across multiple nodes.
To better isolate resources and prevent analytical queries from affecting OLTP
traffic, PolarDB-X allows users to create independent read-only clusters. The com-
pute nodes and storage nodes in a read-only cluster are deployed on physical hosts
that are different from those of the primary cluster. Through intelligent routing,
users can transparently use PolarDB-X to handle OLTP and OLAP loads.

References

1. Karger D, Lehman E, Leighton T, et al. Consistent hashing and random trees: distributed cach-
ing protocols for relieving hot spots on the world wide web. In: Proceedings of the Twenty-­
Ninth Annual ACM Symposium on Theory of Computing; 1997. p. 654–63.
2. Peng D, Dabek F. Large-scale incremental processing using distributed transactions and notifi-
cations. 2010. OSDI 2010: Proceedings of the 9th USENIX conference on Operating systems
design and implementation, USENIX Association.
3. Bortnikov E, Hillel E, Keidar I, et al. Omid, reloaded: Scalable and highly-available transac-
tion processing. In: 15th USENIX Conference on File and Storage Technologies (FAST 17);
2017. p. 167–80.
4. Thomson A, Diamond T, Weng SC, et al. Calvin: fast distributed transactions for partitioned
database systems. In: Proceedings of the 2012 ACM SIGMOD International Conference on
Management of Data; 2012. p. 1–12.
5. Pavlo A, Aslett M. What's really new with NewSQL? ACM SIGMOD Rec. 2016;45(2):45–55.
Chapter 9
Practical Application of PolarDB

PolarDB is a next-generation cloud-native relational database independently devel-


oped by the Alibaba Cloud. It has three independent engines that are, respectively,
100% compatible with MySQL, 100% compatible with PostgreSQL, and highly
compatible with the Oracle syntax. This chapter takes PolarDB as an example to
describe how to create instances on the cloud, access cloud databases, perform basic
operations, and migrate cloud data.

9.1 Creating Instances on the Cloud

The construction of self-managed databases involves server procurement and data-


base software installation and deployment. This requires significant manpower and
material resources. However, cloud database instances can be created within several
minutes by configuring and submitting several parameter settings.

9.1.1 Related Concepts

• Instance: An instance is a virtualized database server. Users can create and man-
age multiple databases within an instance.
• Series: When you create a PolarDB instance, you can select a series (e.g., the
Cluster Edition or Single Node Edition) that suits your business needs.
• Cluster: PolarDB mainly adopts a cluster architecture, which consists of a pri-
mary node and multiple read-only nodes.

© The Author(s), under exclusive license to Springer Nature Singapore Pte 217
Ltd. 2025
F. Li et al., Cloud Native Database,
https://doi.org/10.1007/978-981-97-4057-4_9
218 9 Practical Application of PolarDB

• Specification: The resource configuration of each node, such as 2 CPU cores and
8 GB of memory.
• Region: A region is a physical data center. In general, PolarDB instances are
located in the same region as Elastic Compute Service (ECS)1 instances to
achieve optimal access performance.
• Availability zone: An availability zone (or “zone” for short) is a physical area in
a region with independent power and network. There is no substantial difference
between different zones in the same region.
• Database engine: PolarDB has three independent engines that are, respectively,
100% compatible with MySQL, 100% compatible with PostgreSQL, and highly
compatible with the Oracle syntax.

9.1.2 Prerequisites

Register for an Alibaba Cloud account or obtain a RAM account assigned by the
administrator of an Alibaba Cloud account. Then, log on to the PolarDB console
and navigate to the PolarDB purchase page.

9.1.3 Billing Method

• Subscription: In this billing method, you must pay for the compute nodes when
you create a cluster. The storage space is charged by hour based on the actual
data amount, and fees are deducted from the account on an hourly basis.
• Pay-as-you-go: In this billing method, you do not need to make advance pay-
ments. Compute nodes and storage space (based on the actual data amount) are
charged by hour, and fees are deducted from the account on an hourly basis.

9.1.4 Region and Availability Zone

The region and availability zone specify the geographic location where the cluster
is located and cannot be changed after the purchase.
Note: Make sure that the PolarDB instance and the ECS instance to which the
PolarDB instance is to be connected are located in the same region. Otherwise, they
can communicate only via the Internet, which may compromise the performance.

1
ECS is a cloud server service provided by the Alibaba Cloud that is usually deployed in coordina-
tion with cloud databases to form a typical business access architecture.
9.1 Creating Instances on the Cloud 219

9.1.5 Creation Method

Choose one of the following creation methods: (1) create a new PolarDB instance;
(2) if an RDS for MySQL instance exists, upgrade the instance to a PolarDB for
MySQL instance; or (3) create a new cluster by restoring a backup of a deleted
cluster from the recycle bin.

9.1.6 Network Type

The value is fixed to VPC and cannot be modified.


Note: Make sure that the PolarDB instance and the ECS instance to which the
PolarDB instance is to be connected are located in the same VPC. Otherwise, they
cannot communicate via the intranet.

9.1.7 Series

• Cluster Edition: The Cluster Edition is the recommended mainstream series that
offers rapid data backup and recovery and global database deployment free of charge.
This edition also provides enterprise-level features, such as quick elastic scaling and
parallel query acceleration, and thus is recommended for production environments.
• Single Node Edition: This edition is the best choice for individual users who
want to test and learn more about PolarDB. It can also be used as an entry-level
product for startup businesses.
• History Database Edition: This edition is considered an archive database and features
a high data compression ratio. Therefore, this edition is suitable for businesses that do
not have high requirements on computing but need to store archive data.

9.1.8 Compute Node Specification

Select the compute node specification based on your business requirements. Each
node uses exclusive resources to achieve stable and reliable performance. Each speci-
fication has corresponding CPU and memory capacities, maximum storage capacity,
maximum number of connections, intranet bandwidth, and maximum IOPS.

9.1.9 Storage Space

PolarDB adopts a compute-storage-separated architecture, and the storage capacity


automatically scales elastically with the increase or decrease in the data amount.
Therefore, you do not need to specify the storage capacity when you create a cluster.
220 9 Practical Application of PolarDB

Notice: The storage fees are charged by hour based on the actual data amount.
The maximum storage capacity varies based on the selected compute node
specification.

9.1.10 Creation

After you complete the payment, the cluster is created in 10–15 min. The created
cluster is displayed on the Clusters page.
Notice: Make sure that you selected the region where the cluster is deployed.
Otherwise, you cannot view the cluster.

9.2 Database Access

9.2.1 Account Creation

Table 9.1 describes the details of privileged and standard accounts.


Log on to the PolarDB console. In the upper-left corner of the console, select the
region where the cluster that you want to manage is deployed. Find the cluster that
you want to manage and click the cluster ID. In the left-side navigation pane, choose
Settings and Management > Accounts. Click Create Account. In the Create
Account panel, choose the type of account you want to create and specify the pass-
word for the account.

Table 9.1 Privileged and standard accounts


Account
type Description
Privileged • A privileged account can only be created and managed in the PolarDB console
account • You can create only one privileged account for each cluster
• The permissions of a privileged account have been increased to allow you to
implement fine-grained control on user permissions based on your business
requirements. For example, you can grant different users the permissions to
query different tables
• A privileged account has all permissions on all databases in the cluster
• A privileged account has the permission to disconnect any standard account
Standard • A standard account can be created and managed in the PolarDB console or by
account using SQL (structured query language) statements
• You can create multiple standard accounts for each cluster. The maximum
number of standard accounts that you can create varies based on the database
engine
• Specific database permissions must be manually granted to a standard account
• A standard account cannot create or manage other accounts or disconnect the
connections of other accounts
9.2 Database Access 221

Notice: If you already created a privileged account, you cannot create another
privileged account because each cluster can have only one privileged account. You
do not need to grant permissions on databases to the privileged account because the
privileged account has all permissions on all databases in the cluster. For a standard
account, you must grant permissions on specific databases.

9.2.2 GUI-Based Access

Data Management (DMS) is a graphical data management tool provided by the


Alibaba Cloud. It supports centralized management of multiple databases, includ-
ing relational and NoSQL databases, as well as various features, such as data man-
agement, structure management, user authorization, security auditing, data trends,
data tracking, business intelligence (BI) charts, and database performance evalua-
tion and optimization.
Perform the following steps to access PolarDB from DMS: Find the target cluster
and click the cluster ID to go to the basic information page of the cluster. In the
upper-right corner of the page, click Log On to Database and enter the account and
password of the database.
Note: For PolarDB for PostgreSQL and the PolarDB edition that is compatible
with Oracle, you must specify the database to which you want to log on. You can
create a database on the database management page of a PolarDB cluster. The first
time you access the instance from DMS, the system will prompt you to configure an
allowlist.

9.2.3 CLI-Based Access
9.2.3.1 Configuring an Allowlist

After you create a PolarDB cluster, you must configure an allowlist and create an
initial account for the cluster before you connect to and use the cluster.

Two Types of Allowlists

1. IP allowlist contains IP addresses that are allowed to access the cluster. The
default IP allowlist contains only the default IP address 127.0.0.1, indicating that
no device can access the cluster. Only IP addresses that have been added to the
IP allowlist can access the cluster.
2. ECS security group contains ECS instances that can access the cluster. An ECS
security group is a virtual firewall used to control inbound and outbound traffic
of ECS instances in the security group.
222 9 Practical Application of PolarDB

Note: You can configure an IP allowlist and an ECS security group for the same
cluster. The IP addresses in the IP allowlist and ECS instances in the security group
can access the PolarDB cluster.

Configuring an IP Allowlist

Log on to the PolarDB console. In the upper-left corner of the console, select the
region where the cluster that you want to manage is deployed. Find the cluster that
you want to manage and click the cluster ID. In the left-side navigation pane, choose
Settings and Management > Whitelists. On the Whitelists page, you can add an
IP allowlist or modify an existing IP allowlist.
Notice: The ali_dms_group (for DMS), hdm_security_ips (for Database
Autonomy Service [DAS]), and dtspolardb (for Data Transmission Service [DTS])
IP allowlists are automatically generated when you use the relevant services. To
ensure normal use of the services, do not modify or delete these IP allowlists.
Add the IP address of the devices that need to access the PolarDB cluster to the
allowlist. If an ECS instance needs to access the PolarDB cluster, you can view the
IP address of the ECS instance in the configuration information section on the
details page of the ECS instance and add the IP address to the allowlist.
Note: If the ECS instance and the PolarDB cluster are deployed in the same
region, such as the China (Hangzhou) region, add the private IP address of the ECS
instance to the IP allowlist. If the ECS instance and the PolarDB cluster are deployed
in different regions, add the public IP address of the ECS instance to the IP allowlist.
Alternatively, you can migrate the ECS instance to the region where the PolarDB
cluster is deployed and then add the private IP address of the ECS instance to the IP
allowlist.
If you want to connect on-premises servers, computers, or other cloud instances
to the PolarDB cluster, add their IP addresses to the IP allowlist of the cluster.

Configuring a Security Group

Log on to the PolarDB console. In the upper-left corner of the console, select the
region where the cluster that you want to manage is deployed. Find the cluster that
you want to manage and click the cluster ID. In the left-side navigation pane, choose
Settings and Management > Whitelists. Navigate to the Select Security Groups
panel, select one or more security groups, and click OK.
9.2 Database Access 223

9.2.3.2 Obtaining Endpoints

The endpoints of a PolarDB cluster are classified into two types: cluster endpoint
and primary endpoint.

Cluster Endpoints and Primary Endpoints

Figure 9.1 compares cluster endpoints and primary endpoints, and Table 9.2 sum-
marizes the details of both endpoint types.

Public Endpoints and Private Endpoints

Cluster endpoints and primary endpoints have public endpoints for the Internet and
private endpoints for internal networks:
1. Use a private endpoint in the following scenario: If your application or client is
deployed on an ECS instance that is deployed in the same region as the PolarDB
cluster and supports the same network type as the cluster, the ECS instance can
connect to the PolarDB cluster by using a private endpoint. You do not need to
apply for a public endpoint. A PolarDB cluster achieves optimal performance
when it is connected by using a private endpoint.
2. Use a public endpoint in the following scenario: If you cannot connect to the
PolarDB cluster over the internal network due to specific reasons (e.g., the ECS
instance and the PolarDB cluster are located in different regions or support dif-
ferent network types or you access the PolarDB cluster from a device that is not
deployed on the Alibaba Cloud), you must apply for a public endpoint. Using a

Fig. 9.1 Comparison between cluster endpoints and primary endpoints


224 9 Practical Application of PolarDB

Table 9.2 Details of cluster endpoints and primary endpoints


Supported
Endpoint type Description network type
Cluster endpoint • An application can access multiple nodes by • Internal
(recommended) connecting only to one cluster endpoint network
• PolarDB PolarProxy provides the read/write splitting • Internet
feature, which automatically forwards write requests to
the primary node and forwards read requests to the
primary node or read-only nodes based on the node
loads
Note: By default, a PolarDB cluster provides one cluster
endpoint and allows you to create multiple custom cluster
endpoints based on your business requirements. When you
create a cluster endpoint, you can configure the read/write
mode for the cluster endpoint and specify the nodes to
which the cluster endpoint can connect
Primary endpoint • The primary endpoint allows you to connect to the
primary node of the cluster. The primary endpoint can
be used for read and write operations
• When the primary node is faulty, the primary endpoint
is automatically switched to a new primary node

Table 9.3 Required information for connecting to a PolarDB cluster


Parameter Description
Hostname/IP Enter a public or private endpoint of the PolarDB cluster. To view the endpoint
and port information of the PolarDB cluster, perform the following steps:
1. Log on to the PolarDB console
2. In the upper-left corner of the console, select the region where the cluster is
deployed
3. Find the cluster and click the cluster ID
4. On the Overview page, view the endpoint and port information
Port The port number in the public or private endpoint that is used to connect to the
PolarDB cluster. The default port number varies based on the database edition:
• PolarDB for MySQL: 3306
• PolarDB for PostgreSQL: 1921
• PolarDB edition compatible with Oracle: 1521
Database You can view or create a database on the database management page
You must specify this parameter only for PolarDB for PostgreSQL and the
PolarDB edition that is compatible with Oracle
Username The name of the account that is used to connect to the PolarDB cluster
Password The password of the account

public endpoint compromises the security of the cluster. Exercise caution when
you use a public endpoint.

9.2.3.3 Connecting to a PolarDB Cluster

You can connect to a PolarDB cluster by using an application, a client, or a CLI


(command-line interface). Table 9.3 lists the required information.
9.3 Basic Operations 225

9.3 Basic Operations

9.3.1 Database and Table Creation

9.3.1.1 Create a Database

Multiple databases can be created in a PolarDB cluster. To create a database, per-


form the following steps: Log on to the PolarDB console. In the upper-left corner of
the console, select the region where the cluster that you want to manage is deployed.
Find the cluster that you want to manage and click the cluster ID. In the left-side
navigation pane, choose Settings and Management > Databases. Click Create
Database. In the Create Database panel, configure the database parameters.
Table 9.4 describes the parameters of PolarDB for MySQL.
Table 9.5 describes the parameters of PolarDB for PostgreSQL or the PolarDB
edition that is compatible with Oracle.

9.3.1.2 Creating a Table

This section describes how to create a table on the SQLConsole tab of DMS.
Log on to the PolarDB console. In the upper-left corner of the console, select the
region where the cluster that you want to manage is deployed. Find the cluster that

Table 9.4 Parameters of PolarDB for MySQL


Parameter Description
Database name • The name must start with a letter and end with a letter or a digit
• The name can contain lowercase letters, digits, underscores (_), and
hyphens (−)
• The name can be 2 to 64 characters in length
• The name must be unique in your PolarDB cluster
Supported Select the character set supported by the database, such as utf8mb4, UTF8,
character set GBK, or Latin1.
If you need another character set, select the desired character set from the
drop-down list on the right
Authorized Select the account that you want to authorize to access the database. You can
account leave this parameter empty and bind an account after the database is created
Note: Only standard accounts are available in the drop-down list. Privileged
accounts have all permissions on all databases. You do not need to authorize
the privileged account to access the database
Account Select the permission that you want to grant to the selected account. Valid
permission values: Read&Write, ReadOnly, DMLOnly, DDLOnly, and
ReadOnly&Index
Description Enter a description for the database to facilitate database management. The
description must meet the following requirements:
• It cannot start with http:// or https://.
• It must start with a letter
• It can contain uppercase letters, lowercase letters, digits, underscores (_),
and hyphens (−)
• It can be 2–256 characters in length
226 9 Practical Application of PolarDB

Table 9.5 Parameters of PolarDB for PostgreSQL or the PolarDB edition that is compatible
with Oracle
Parameter Description
Database name • The name must start with a letter and end with a letter or a digit
• The name can contain lowercase letters, digits, underscores (_), and
hyphens (−)
• The name can be up to 2–64 characters in length
• The name must be unique in your PolarDB cluster
Database owner The owner of the database. The owner has all permissions on the database
Supported The character set supported by the database. Default value: UTF8. You can
character set select another character set from the drop-down list
Collate The rule based on which character strings are sorted
Ctype The type of characters supported by the database
Description Enter a description for the database to facilitate database management. The
description must meet the following requirements:
• It cannot start with http:// or https://.
It must start with a letter.
• It can contain uppercase letters, lowercase letters, digits, underscores
(_), and hyphens (−)
• It can be 2–256 characters in length

you want to manage and click the cluster ID. In the left-side navigation pane, choose
Settings and Management > Databases. Find the target database and click SQL
Queries.
Notice: If the Log On to Database dialog box appears, enter the account and
password of the database. For PolarDB for PostgreSQL and the PolarDB edition
that is compatible with Oracle, you must also specify the database to which you
want to log on. You can create a database on the database management page of the
PolarDB cluster.
On the SQLConsole tab, enter the command to create a table and click Execute.
For example, execute the following command to create a table named big_table:

CREATE TABLE `big_table` (


`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT COMMENT 'Primary key',
`name` varchar(64) NOT NULL COMMENT 'Name',
`long_text_a` varchar(1024) DEFAULT NULL COMMENT 'Text A',
`long_text_b` varchar(1024) DEFAULT NULL COMMENT 'Text B',
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT='Modify big_table';
9.3 Basic Operations 227

Fig. 9.2 Test data build dialog box

9.3.2 Test Data Creation

In this example, one million rows of test data are generated in batches for the big_
table table by using the test data building feature.
Log on to the DMS console. In the instance list, click the target PolarDB instance
and double-click the target database to go to the SQLConsole tab. On the
SQLConsole tab, right-click the big_table table and choose Data Plans > Test
Data Generation. In the Test data build dialog box, configure the parameters and
then click Submit, as shown in Fig. 9.2. Then, wait for the approval result.
After the ticket is approved, DMS automatically generates and executes SQL
statements. You can view the execution progress on the Ticket Details page. After
SQL statements are executed, go to the SQLConsole tab and execute the following
command in the database to query the test data generation status:

SELECT COUNT(*) FROM 'Table name'

9.3.3 Account and Permission Management

9.3.3.1 Creating a Database Account

Table 9.6 presents the details of privileged and standard accounts.


Log on to the PolarDB console. In the upper-left corner of the console, select the
region where the cluster that you want to manage is deployed. Find the cluster that
you want to manage and click the cluster ID. In the left-side navigation pane, choose
Settings and Management > Accounts. Click Create Account. In the Create
228 9 Practical Application of PolarDB

Table 9.6 Details of privileged and standard accounts


Account type Description
Privileged • A privileged account can only be created and managed in the console
account • You can create only one privileged account for each cluster
• The permissions of a privileged account have been increased to allow you to
implement fine-grained control on user permissions based on your business
requirements. For example, you can grant different users the permissions to
query different tables
• A privileged account has all permissions on all databases in the cluster
• A privileged account has the permission to disconnect any standard account
Standard • A standard account can be created and managed in the PolarDB console or by
account using SQL statements
• You can create multiple standard accounts for each cluster. The maximum
number of standard accounts that you can create varies based on the database
engine
• Specific database permissions must be manually granted to a standard
account
• A standard account cannot create or manage other accounts or disconnect the
connections of other accounts

Account panel, choose the type of account you want to create and specify the pass-
word for the account.

9.3.3.2 Managing Account Permissions

Modify Account Permissions in the Console

Log on to the PolarDB console. In the upper-left corner of the console, select
the region where the cluster that you want to manage is deployed. Find the clus-
ter that you want to manage and click the cluster ID. In the left-side navigation
pane, choose Settings and Management > Accounts. Find the target account
and click Modify Permissions in the Actions column. In the dialog box that
appears, modify the permissions of authorized and unauthorized databases and
then click OK.

Modify Account Permissions by Using a Command

You can log on to the cluster to which the privileged account belongs and run the
following command to change the permissions of an account. Table 9.7 describes
the parameters in the command.

GRANT privileges ON databasename.tablename TO 'username'@'host' WITH GRANT OPTION;


9.3 Basic Operations 229

Table 9.7 Parameters in the command


Parameter Description
Privileges The operation permissions that are granted to the account, such as
SELECT, INSERT, and UPDATE. If you set this parameter to ALL, the
account can perform all operations
Databasename The name of the database on which the account has the granted
permissions. If you set this parameter to an asterisk (*), the account has
the granted permissions on all databases
Tablename The name of the table on which the account has the granted permissions. If
you set this parameter to an asterisk (*), the account has the granted
permissions on all tables
Username The account that you want to authorize
Host The host from which the account can be used to log on to the database. If
you set this parameter to a percent sign (%), the account can be used to log
on to the database from any host
WITH GRANT This parameter grants the account the permission to run the GRANT
OPTION command. This parameter is optional

9.3.4 Data Querying

9.3.4.1 Configuring Cluster Endpoints

A PolarDB cluster consists of one primary node and at least one read-only node.
Users can connect to the PolarDB cluster by using a primary endpoint or a cluster
endpoint to perform CRUD operations. The primary endpoint is always connected
to the primary node, and a cluster endpoint is connected to all its associated nodes.
The following section describes how to configure a cluster endpoint:
Log on to the PolarDB console, go to the basic information page of the target
cluster, find a cluster endpoint, and open the Edit dialog box for the cluster endpoint.

Automatic Read/Write Splitting

Open the Edit dialog box for the cluster endpoint. Set the read/write mode of the
endpoint to Read and Write (Automatic Read-write Splitting). Then, select the
nodes that you want to add to the endpoint for handling read requests.
Notice: When the read/write mode is set to read and write, write requests are sent
to the primary node, regardless of whether the primary node is selected.
If necessary, you can disable the primary node from receiving read requests, so
that read requests are sent only to read-only nodes. This reduces the load on the
primary node and ensures the stability of the node.
In traditional databases, users need to configure the connection endpoints of the
primary node and each read-only node in the application and then split the business
logic to achieve read/write splitting (i.e., write requests are sent to the primary node
and read requests are sent to any suitable node). For PolarDB, you only need to con-
nect to a cluster endpoint for write requests to be automatically sent to the primary
230 9 Practical Application of PolarDB

node and read requests to be automatically sent to the primary node or read-only
nodes based on the node load (i.e., currently unhandled requests).

Consistency Level

Open the Edit dialog box for the cluster endpoint and configure the consistency level.
PolarDB uses an asynchronous physical replication mechanism to achieve data
synchronization between the primary nodes and the read-only nodes. After the data
of the primary node is updated, the updates will be applied to the read-only nodes;
the specific latency (usually at the millisecond level) is related to the write pressure.
The data of read-only nodes is delayed. Consequently, the queried data may not be
the most recent data.
To meet the requirements for consistency levels in different scenarios, PolarDB
provides three consistency levels: eventual consistency, session consistency, and
global consistency. Leader-follower replication latency may lead to inconsistent
data queried from different nodes. To reduce the pressure on the primary node, you
can route as many read requests as possible to read-only nodes and choose the even-
tual consistency level.
Session consistency ensures that data updated before the execution of the read
request can be queried in the same session. When a connection pool is used, requests
from the same thread may be sent through different connections. For the database,
these requests belong to different sessions, but they successively depend on each
other in terms of business logic. In this case, session consistency cannot guarantee
the consistency of query results, and global consistency is needed.
A high consistency level causes higher pressure on the primary node and lower
cluster performance.
Note: Session consistency is recommended because this level has minimal
impact on performance and can meet the needs of most application scenarios. If you
have high requirements for consistency between different sessions, you can choose
global consistency or use hints (e.g., /*FORCE_MASTER*/ select * from user) to
forcibly send specific queries to the primary node.

Transaction Splitting

Open the Edit dialog box for the cluster endpoint and enable transaction splitting.
When read/write splitting is enabled for the cluster endpoint, all requests in the
transactions will be sent to the primary node to ensure the read and write consis-
tency of transactions in a session. This may cause high pressure on the primary
node; the pressure on the read-only nodes remains low. After transaction splitting is
enabled, some read requests in the transactions can be sent to read-only nodes on the
premise that read/write consistency is not compromised, to reduce the pressure on
the primary node. Transaction splitting is supported only for transactions of the read
committed isolation level.
9.4 Cloud Data Migration 231

9.3.4.2 Using Hints

You can add the /*FORCE_MASTER*/ or /*FORCE_SLAVE*/ hint before an SQL


statement to specify the routing direction of the SQL statement. For example, if you
want to forcibly route select * from test to the primary node, you can add the
/*FORCE_MASTER*/ hint.
You can add the /*force_node='<node ID>'*/ hint before an SQL statement to
specify the node on which the SQL statement is to be executed.
For example, /*force_node='pi-bpxxxxxxxx'*/ show processlist specifies that the
show processlist command can be executed only on the pi-bpxxxxxxxx node. If the
node fails, the error message “force hint server node is not found, please check” will
be returned.
You can add /*force_proxy_internal*/set force_node = '<node ID>' before an
SQL statement to forcibly execute all subsequent query commands on a specific node.
For example, after an SQL statement with the /*force_proxy_internal*/set force_
node = 'pi-bpxxxxxxxx' hint is executed, all subsequent query commands will only
be sent to the pi-bpxxxxxxxx node. If the node fails, the error message “set force
node ' pi-bpxxxxxxxx ' is not found, please check” will be returned.
Note: If you execute an SQL statement with any of the preceding hints by using the
official MySQL CLI, add the -c parameter. Otherwise, the hint will be ignored and
become invalid. A hint has the highest routing priority and bypasses the consistency
level and transaction splitting settings. A hint cannot be added to statements that modify
environment variables. Otherwise, business errors may occur. For example, do not exe-
cute an SQL statement like /*FORCE_SLAVE*/ set names utf8. The /*force_proxy_
internal*/ hint is generally not recommended because this hint will cause all subsequent
query requests to be sent to a specific node, invalidating the read/write splitting setting.

9.3.4.3 Other Features

PolarDB launched a parallel query framework. When the amount of query data
reaches the specified threshold, the parallel query framework is automatically
enabled to exponentially reduce the query time. For more information about the
parallel query framework, see related sections in this book.

9.4 Cloud Data Migration

9.4.1 Migrating Data to the Cloud

This section describes how to migrate data from a self-managed MySQL database
to PolarDB for MySQL by using DTS. DTS is a real-time data streaming service
that supports RDBMS, NoSQL, and OLAP data sources. DTS seamlessly integrates
data migration, subscription, and synchronization to ensure a stable and secure
transmission infrastructure.
232 9 Practical Application of PolarDB

9.4.1.1 Prerequisites

Create a self-managed MySQL database of version 5.1, 5.5, 5.6, 5.7, or 8.0 and a
destination PolarDB for MySQL cluster. If the source MySQL database is an on-­
premises database, add the CIDR block of DTS server to the IP allowlist of the
database to ensure that DTS server can access the source MySQL database. Lastly,
create an account and configure binary logging for the self-managed MySQL
database.
Table 9.8 describes the required permissions for database accounts.
DTS uses the read and write resources of the source and destination data-
bases during full data migration. This may increase the loads of the database
servers. In some cases, the database service may become unavailable due to
poor database performance, low specifications, or large data amounts (e.g., a
large number of slow SQL statements exist in the source database, tables with-
out primary keys exist, or deadlocks exist in the destination database). Therefore,
you must evaluate the impact of data migration on the performance of the source
and destination databases before you migrate data. We recommend that you
migrate data during off-peak hours. For example, you can migrate data when the
CPU utilization of the source and destination databases is less than 30%.
The source database must have PRIMARY KEY or UNIQUE constraints, and all
fields must be unique. Otherwise, the destination database may contain duplicate
data records.
DTS uses the ROUND(COLUMN, PRECISION) function to retrieve values
from columns of the FLOAT or DOUBLE data type. If you do not specify the preci-
sion level, DTS sets the precision levels for the FLOAT and DOUBLE data types to
38 and 308 digits, respectively. You must check whether the precision settings meet
your business requirements.
If a data migration task fails, DTS initiates automatic recovery. Therefore, before
you switch your workloads to the destination cluster, you must stop or release the
data migration task. Otherwise, the data in the source database overwrites the data
in the destination cluster after automatic recovery.
PolarDB supports schema migration, full data migration, and incremental data
migration. You can employ these migration types to smoothly complete database
migration without interrupting services.

Table 9.8 Required permissions for database accounts


Schema migration/full
Database data migration Incremental data migration
Self-managed SELECT permission REPLICATION CLIENT, REPLICATION
MySQL database SLAVE, SHOW VIEW, and SELECT
permissions
PolarDB for Read and write Read and write permissions
MySQL cluster permissions
9.4 Cloud Data Migration 233

9.4.1.2 Billing

Table 9.9 lists the migration fees.

9.4.1.3 Procedure

Log on to the DTS console. Select the region where the destination cluster resides
and go to the Create Migration Task page. Configure the connection information
of the source and destination databases for the migration task, as described in
Table 9.10.
In the lower-right corner of the page, click Set Whitelist and Next.
Note: This step will automatically add the IP address of the DTS server to the
allowlist of the destination PolarDB for MySQL cluster to ensure that the DTS
server can connect to the destination cluster
Select the required migration types and the objects that you want to migrate.
Table 9.11 describes the parameters that must be configured.
In the lower-right corner of the page, click Precheck. You must perform a pre-
check before you start the data migration task. You can start the data migration task
only after the task passes the precheck. If the task fails the precheck, you can click
the icon to the right of each failed check item to view details. You can trouble-
shoot the issues based on the details and then run a precheck again. After the task
passes the precheck, confirm the purchase and start the migration task.
Schema migration and full data migration: We recommend that you do not
manually stop the task. Otherwise, the data migrated to the destination database
may be incomplete. You can wait until the data migration task automati-
cally stops.
Schema migration, full data migration, and incremental data migration: A
migration task that implements these migration types does not automatically stop.
You must manually stop the task at an appropriate time (e.g., during off-peak hours
or before you switch your workloads to the destination cluster).
Wait until the migration task proceeds to the Incremental Data Migration step
and enters the nondelayed state. Then, stop writing data to the source database for a
few minutes. At this time, the status of incremental data migration may be displayed
as the delay time.

Table 9.9 Migration fees


Migration type Instance configuration fee Internet traffic fee
Schema migration Free of charge Charged only when data is migrated from
and full data the Alibaba cloud over the internet. For
migration more information, see the billing overview
Incremental data Charged. For more of DTS
migration information, see the billing
overview of DTS
234 9 Practical Application of PolarDB

Table 9.10 Details of the parameters of the source and destination databases
Section Parameter Description
N/A Task name The name of the task. DTS automatically generates a task name.
We recommend that you specify a descriptive task name to make
identifying the task easy. Duplicate task names are allowed
Source Instance type The instance type of the source database. In this example,
database User-Created Database with Public IP Address is selected for
this parameter
Note: If you select other instance types, you must deploy the
network environment for the self-managed database
Instance The region where the source database resides. If you selected
region User-Created Database with Public IP Address for Instance
Type, you do not need to configure this parameter
Note: If an allowlist is configured for the self-managed MySQL
database, you must add the CIDR block of DTS server to the
allowlist. You can click Get IP Address Segment of DTS to the
right of Instance Region to obtain the CIDR block of DTS
server
Database Select MySQL
type
Hostname or The endpoint that is used to connect to the self-managed MySQL
IP address database. In this example, the public IP address is used
Port The service port number of the self-managed MySQL database.
Default value: 3306
Database The account of the self-managed MySQL database. For more
account information about the permissions that are required for the
account, see Table 9.8
Database The password of the database account
password Note: After you configure the parameters of the source database,
click Test Connectivity to the right of the Database Password
parameter to verify that the parameters are valid. If the
parameters are valid, the Passed message is displayed. If the
Failed message is displayed, click Check to the right of Failed
and modify the parameters based on the check results
Destination Instance type Select PolarDB
database Instance The region where the destination PolarDB cluster resides
region
PolarDB Select the ID of the destination PolarDB cluster
instance ID
Database The database account of the destination PolarDB cluster. For
account information about the permissions that are required for the
account, see Table 9.8
Database The password of the database account
password Note: After you configure the parameters of the destination
database, click Test Connectivity to the right of the Database
Password parameter to verify that the parameters are valid. If the
parameters are valid, the Passed message is displayed. If the
Failed message is displayed, click Check to the right of Failed
and modify the parameters based on the check results
9.4 Cloud Data Migration 235

Table 9.11 Migration types and migration objects


Parameter Description
Migration To perform only full data migration, select Schema Migration and Full Data
types Migration
To ensure service continuity during data migration, select Schema Migration,
Full Data Migration, and Incremental Data Migration
Notice: If you do not select Incremental Data Migration, we recommend that
you do not write data to the source database during full data migration to ensure
data consistency
Migration Select one or more objects from the Available section and click the icon to
objects add the objects to the Selected section
Notice: You can select columns, tables, or databases as the objects to be migrated.
By default, after an object is migrated to the destination database, the name of the
object remains unchanged. You can use the object name mapping feature to
rename migrated objects. If you use the object name mapping feature to rename
an object, other objects that are dependent on the object may fail to be migrated

Table 9.12 SQL operations that can be synchronized during incremental data migration
Operation
type SQL statements
DML INSERT, UPDATE, DELETE, and REPLACE
DDL • ALTER TABLE and ALTER VIEW
• CREATE FUNCTION, CREATE INDEX, CREATE PROCEDURE,
CREATE TABLE, and CREATE VIEW
• DROP INDEX and DROP TABLE
• RENAME TABLE
• TRUNCATE TABLE

In this case, wait until incremental data migration reenters the nondelayed state.
Then, manually stop the migration task.
Switch your workloads to the destination PolarDB cluster. Table 9.12 lists the
SQL operations that can be synchronized during incremental data migration.

9.4.2 Exporting Data from the Cloud

9.4.2.1 Exporting Data by Using DMS

Log on to the PolarDB instance in the DMS console. In the left-side instance list of
the DMS console, expand the destination PolarDB instance, and double-click a
database on the instance. You can export tables or query results. For example, in the
SQL console, right-click the target table and select Export to export the schema or
data of the table. You can export multiple tables in the database. To export query
results, execute the query statement in the SQL console and then export the query
result displayed in the execution result section.
236 9 Practical Application of PolarDB

9.4.2.2 Migrating Data to Other Databases by Using DTS

For PolarDB for MySQL, you must enable binary logging by enabling the loose_
polar_log_bin parameter on the parameter settings page in the PolarDB console.
Log on to the DTS console. Create a migration task and configure the connection
information of the source and destination databases. For example, you can migrate
data from PolarDB to a self-built database on premises or in ECS. Proceed to the
next step to select the migration types and migration objects. To ensure service con-
tinuity, you must select incremental migration. After the task passes the precheck,
confirm the creation of the migration task.
Chapter 10
PolarDB O&M

The lifecycle of a database can be roughly divided into four stages: planning, devel-
opment, deployment, and O&M. After a database is deployed, it enters the O&M
stage, which includes three tasks: resource scaling, backup and recovery, and moni-
toring and diagnostics. This chapter provides an overview of PolarDB O&M man-
agement and describes the procedures for the resource scaling, backup and recovery,
and monitoring and diagnostics of PolarDB.

10.1 Overview

The lifecycle of a database [1] can be roughly divided into four stages: planning,
development, deployment, and O&M. The database enters the O&M stage after it is
deployed. Database O&M is a popular research field [2–5] that typically covers the
following aspects:
• Environment deployment, including database installation, parameter configura-
tion, and permission assignment.
• Backup and recovery: It is of crucial importance for a database to have a
backup available to prevent data loss caused by data corruption or user
misoperations.
• Monitoring and diagnostics: O&M personnel need to ensure normal operation of
the database and then ensure the performance of the system during operation.
Monitoring includes database running status monitoring and database perfor-
mance monitoring.

© The Author(s), under exclusive license to Springer Nature Singapore Pte 237
Ltd. 2025
F. Li et al., Cloud Native Database,
https://doi.org/10.1007/978-981-97-4057-4_10
238 10 PolarDB O&M

10.2 Resource Scaling

10.2.1 System Scaling

PolarDB supports online scaling. Locks do not need to be added to the database
during configuration changes. PolarDB supports scaling in three dimensions: verti-
cal scaling of computing capabilities (i.e., upgrading or downgrading of node speci-
fication), horizontal scaling of computing capabilities (i.e., addition or deletion of
read-only nodes), and horizontal scaling of the storage space. In addition, PolarDB
adopts a serverless architecture. Therefore, you do not need to manually set, expand,
or reduce the capacity of the storage space; the capacity is automatically adjusted
online as the amount of data changes. When the amount of data is large, you can use
the PolarDB storage package to reduce storage costs.

10.2.2 Manual Scaling

Upgrading or downgrading cluster specifications does not have any impact on data
already present in the cluster. During a cluster specification change, PolarDB may be
interrupted for a few seconds, and some operations cannot be performed. It is recom-
mended that you change cluster specifications during off-peak hours. After an interrup-
tion occurs, the application needs to reestablish the connection to the database. When
the specification of a PolarDB cluster is changed, the delay of read-only requests com-
pared with read-write requests may be longer than that during normal cluster operation.
Perform the following steps to manually upgrade or downgrade cluster
specifications:
1. Log on to the PolarDB console.
2. In the upper-left corner of the console, select the region where the cluster that
you want to manage is located.
3. Open the Change Configurations dialog box on the cluster list page or basic
information page.
4. Select Upgrade or Downgrade.
5. Select the required node specification and complete the purchase. The specifica-
tions of all nodes in the same cluster must be consistent. The new specification
takes effect after approximately 10 min.

10.2.3 Manual Addition and Removal of Nodes

A PolarDB cluster can support up to 15 read-only nodes. To ensure high availability,


the cluster must have at least one read-only node. All nodes in the same cluster must
have consistent specifications.
10.2 Resource Scaling 239

10.2.3.1 Billing

If the billing method of the cluster is subscription (also known as prepayment), the
billing method of an added node is also subscription. If the billing method of the
cluster is pay-as-you-go (also known as post-payment or pay-by-the-hour), the bill-
ing method of an added node is also pay-as-you-go. You are charged with node
specification fees for newly added node. The storage fee varies based on the actual
usage regardless of the number of nodes.
Read-only nodes that use the subscription and pay-as-you-go billing methods
can be released at any time. After the release, the balance will be refunded or billing
will stop. Before addition of a read-only node is completed, read/write splitting con-
nections do not forward requests to the read-only node. If you want the connection
to forward requests to the read-only node, you must disconnect and then reestablish
the connection (e.g., restart the application). After a read-only node is added, the
newly created read/write splitting connection will forward requests to the read-only
node. Read-only nodes can be added or removed only when the cluster has no ongo-
ing configuration changes.

10.2.3.2 Procedure

1. Log on to the PolarDB console.


2. In the upper-left corner of the console, select the region where the cluster that
you want to manage is located.
3. Find the target cluster. On the cluster list page or the basic information page of
the cluster, open the wizard for adding and removing nodes.
4. Add or remove read-only nodes. If a node is removed, the billing for the node
will be stopped or the payment balance will be refunded. Node addition or
removal takes effect after approximately 5 min.

10.2.4 Automatic Scaling and Node Addition and Removal

If the business workload significantly and frequently fluctuates, it is recommended


that you purchase the PolarDB computing package and use it together with the
automatic scaling service. When the cluster configuration is adjusted, fees are auto-
matically deducted from the computing package based on the current specifications.
However, only PolarDB for MySQL clusters that use the pay-as-you-go billing
method support automatic scaling.
Procedure
1. Log on to the PolarDB console.
2. In the upper-left corner of the console, select the region where the cluster that
you want to manage is located.
240 10 PolarDB O&M

3. On the cluster list page, find the cluster that you want to manage and click the
cluster ID.
4. In the left-side navigation pane, choose Diagnostics and Optimiza-
tion > Diagnosis.
5. On the page that appears, click the Autonomy Center tab. In the lower-right
corner, click Autonomy Service Settings. On the Autonomous Function
Management page, click the Autonomous Function Settings tab.
6. Enable auto scaling as needed and specify the corresponding trigger conditions,
maximum specifications, and maximum number of read-only nodes.

10.3 Backup and Recovery

10.3.1 Backup

A reliable backup feature can effectively prevent data loss. PolarDB supports peri-
odic automatic backup and instant manual backup. When you delete a PolarDB
cluster, you can choose to retain the backup data to avoid data loss caused by
misoperations.
PolarDB allows you to use the backup and recovery features free of charge.
However, backup files occupy storage space. PolarDB charges a fee based on the
storage capacity used and the retention period of backup files, including data files
and log files.

10.3.1.1 Backup Methods

See Table 10.1.

Table 10.1 Backup methods


Backup
method Description
Automatic • By default, PolarDB automatically backs up data once a day. You can configure
backup parameters such as the frequency of automatic backup and the retention period
of backup files in the console
• Automatically created backup files cannot be deleted
Note: To ensure security, automatic backup must be performed at least twice a
week
Manual • You can manually trigger backup at any time. You can manually create up to
backup three backup sets for a cluster
• Manually generated backup files can be deleted
10.3 Backup and Recovery 241

10.3.1.2 Backup Types

Level-1 Backup (Data Backup)

Level-1 backup creates redirect-on-write (ROW) snapshots that are directly stored
in the distributed file system of PolarDB. The system does not replicate data when
it creates a snapshot. When a data block is modified, the system generates a new
data block for the data block and saves both the new data block and the original data
block. This way, a database can be backed up within a few seconds regardless of the
data amount. Level-1 backup facilitates fast backup and recovery but results in high
storage costs. The backup and recovery features of PolarDB clusters use multi-
thread parallel processing to improve efficiency. Currently, the speed of recovery
(cloning) based on backup sets (snapshots) is 40 min per terabyte. To ensure data
security, the level-1 backup feature is enabled by default. A level-1 backup set is
retained for at least 7 days and at most 14 days.

Level-2 Backup (Data Backup)

Level-2 backup compresses level-1 backup files and stores the compressed files in
on-premises storage. Recovery by using level-2 backup data is slower than recovery
by using level-1 backup data but incurs lower storage costs. By default, the level-2
backup feature is disabled. A level-2 backup set is retained for at least 30 days and
at most 7300 days. You can also enable the Permanently Retain All Backups
option to permanently save level-2 backup files. After level-2 backup is enabled, an
expired level-1 backup set will be automatically dumped to on-premises storage at
a rate of approximately 150 MB/s and stored as a level-2 backup set. If a level-1
backup set expires before the previous one is dumped as a level-2 backup set, the
this level-1 backup set is deleted and will no longer be dumped as a level-2 backup
set. For example, a PolarDB cluster creates a level-1 backup set at 01:00 every day
and retains the backup set for 24 h. If the PolarDB cluster creates level-1 backup set
A at 01:00 on January 1 and creates level-1 backup set B at 01:00 on January 2,
level-1 backup A expires at 01:00 on January 2 and starts to be dumped as a level-2
backup set. Suppose level-1 backup set A stores a large amount of data, and the
dumping task has not been completed by 01:00 on January 3. In this case, level-1
backup set B is directly deleted after it expires at 01:00 on January 3 and will no
longer be dumped as a level-2 backup set.

Log Backup

The log backup feature allows you to upload redo log entries to OSS in parallel in
real time. You can perform PITR for a PolarDB cluster based on a full backup set
(snapshot) and redo log entries generated within a specific period of time after the
backup set is created, to ensure data security and prevent data loss caused by
242 10 PolarDB O&M

misoperations. A log backup set is retained for at least 7 days and at most 7300 days.
You can enable the Retained Before Cluster Is Deleted option to permanently
store the logs.

10.3.1.3 Configuring an Automatic Backup Policy

1. Log on to the PolarDB console.


2. In the upper-left corner of the console, select the region where the cluster that
you want to manage is located.
3. Find the cluster that you want to manage and click the cluster ID.
4. In the left-side navigation pane, choose Settings and Management > Backup
and Restore.
5. On the Backup Policy Settings tab, click Edit.
6. In the Backup Policy Settings dialog box, configure related parameters.

10.3.1.4 Manually Creating Backups

1. Log on to the PolarDB console.


2. In the upper-left corner of the console, select the region where the cluster that
you want to manage is located.
3. Find the cluster that you want to manage and click the cluster ID.
4. In the left-side navigation pane, choose Settings and Management > Backup
and Restore.
5. On the Data Backups page, click Create Backup. You can manually create up
to three backup sets for each cluster.

10.3.1.5 FAQ

Question: Why is the total size of level-1 backup sets smaller than the size of a
single backup set?
Answer: Level-1 backup sets in PolarDB are measured based on two aspects: the
logical size of each backup set and the total physical size of all backup sets. PolarDB
uses snapshot chains to store level-1 backup sets, and only one record is generated
for each data block. Therefore, the total physical size of level-1 backup sets is some-
times smaller than the logical size of a single backup set.

10.3.2 Recovery

10.3.2.1 Recovery Methods

See Table 10.2.


10.4 Monitoring and Diagnostics 243

Table 10.2 Recovery methods


Dimension Recovery method
Recovery • PIRT (point-in-time recovery) recovers the cluster to a specific point in time in
source the past based on a backup set and the redo log
• Recovery from a backup set: Recovers data from the selected backup set
Granularity • Recovery of the entire cluster
• Recovery of specific databases or tables
Destination • Recovery to a new cluster recovers historical data of a cluster to a new cluster
(also known as a cloned instance) and then migrates the data back to the
original cluster after the data is verified in the new cluster
• Recovery to the current cluster
Note: Several recovery options, such as recovery to the current cluster, are supported only by spe-
cific types of clusters

10.3.2.2 Procedure

1. Log on to the PolarDB console.


2. In the upper-left corner of the console, select the region where the cluster that
you want to manage is located.
3. Find the cluster that you want to manage and click the cluster ID.
4. In the left-side navigation pane, choose Settings and Management > Backup
and Restore.
5. Choose whether to perform PITR for the cluster or recover the cluster from a
backup, and choose whether to recover data to a new cluster or the current clus-
ter: To recover data to the current cluster, specify the databases and tables that
need to be recovered. To recover data to a new cluster, specify the billing method
for the new cluster and purchase required instances.

10.4 Monitoring and Diagnostics

10.4.1 Monitoring and Alerting

10.4.1.1 Monitoring

The PolarDB console provides a variety of monitoring metrics and updates monitor-
ing data every second, to help you understand the cluster running status in real time
and facilitate rapid fault location based on fine-grained monitoring data.

10.4.1.2 Alerting

The PolarDB console allows you to create and manage threshold-based alerting
rules. The alerting feature helps you detect cluster or node exceptions and handle
the exceptions at the earliest opportunity.
244 10 PolarDB O&M

10.4.1.3 Procedure

1. Log on to the PolarDB console.


2. In the upper-left corner of the console, select the region where the cluster that
you want to manage is located.
3. Find the cluster that you want to manage and click the cluster ID.
4. In the left-side navigation pane, choose Diagnostics and Optimiza-
tion > Monitoring.
5. View the monitoring information of the cluster or cluster nodes, specify the
monitoring data collection interval, and manage alert rules based on your busi-
ness requirements.

10.4.2 Diagnostics and Optimization

Combined with Alibaba Cloud DAS, PolarDB provides a variety of autonomous


features to help users quickly diagnose and respond to database performance issues
caused by various reasons [4, 5].

10.4.2.1 Automatic SQL Tuning

Slow SQL statements can greatly affect the stability of a database. When a data-
base encounters problems such as high load or performance jitters, the database
administrator or developer first checks whether slow SQL statements are being
executed. DAS provides the slow SQL analysis feature, which displays slow
SQL trends and statistics and provides SQL tuning suggestions and diagnostic
analysis.
Table 10.3 shows the comparison of slow SQL viewing methods.

Table 10.3 Slow SQL viewing methods


Viewing method Description
View directly • The slow SQL statement has not undergone parameterization,
aggregation, and sampling, resulting in pool viewability
• Cannot quickly locate and rectify the issue, so losses cannot be
prevented in time
View on the self-­ • Requires self-managed collection, computing, and storage
managed slow SQL platforms, which are costly
platform • Requires dedicated development and maintenance personnel,
consequently raising entry barriers
Use the slow SQL • Closed-loop process involving slow SQL discovery, analysis,
analysis feature of DAS diagnosis, tuning, and tracking, which facilitates full lifecycle
management
• Low entry barriers so that slow SQL analysis and tuning can be
performed without the help of professional database administrators
10.4 Monitoring and Diagnostics 245

10.4.2.2 Procedure

1. Log on to the PolarDB console.


2. In the upper-left corner of the console, select the region where the cluster that
you want to manage is located.
3. Find the cluster that you want to manage and click the cluster ID.
4. In the left-side navigation pane, choose Diagnostics and Optimization > Slow
SQL Query.
5. View slow SQL trends in a specific time range.
6. Click a point in time in the slow SQL trend graph to view the slow SQL statis-
tics, SQL details, and tuning suggestions.

10.4.2.3 Other Features

Autonomy Center

You can enable DAS from the Autonomy Center tab. After DAS is enabled, DAS
automatically analyzes the root cause when the database becomes abnormal, pro-
vides optimization or rectification suggestions, and automatically performs optimi-
zation or rectification operations (optimization operations can be performed only
when authorization is granted).

Session Management

You can use the session management feature to view the session details and session
statistics of the target instance.

Real-Time Performance

The real-time performance feature allows you to view various information in real
time, such as the QPS, TPS, and network traffic information of the target cluster.

Storage Analysis

The storage analysis feature provides the overview information of the entire cluster
(e.g., the number of days for which the remaining storage capacity will last) and the
storage details of a specific table in a database (e.g., space usage, space fragments,
and space exception diagnostics information).
246 10 PolarDB O&M

Lock Analysis

The lock analysis feature allows you to view and analyze the latest deadlocks in the
database in a simple and direct manner.

Performance Insight

The performance insight feature enables you to quickly evaluate the database load
and find the root cause of performance issues to improve database stability.

Diagnostic Report

The diagnostic report feature allows you to specify custom criteria for generating
diagnostic reports and view diagnostic reports.

References

1. Garcia-Molina H, D. Ullman J, Widom J. In: Dongqing Y, Yuqing W, et al., editors. Database


systems: the complete book. 2nd ed. Beijing: China Machine Press; 2010.
2. Zhang J, Liu Y, Zhou K, et al. An end-to-end automatic cloud database tuning system using
deep reinforcement learning. In: SIGMOD Conference; 2019. p. 415–32.
3. Dana V, Andrew P, Geoffrey JG, et al. Automatic database management system tuning through
large-scale machine learning. In: SIGMOD Conference; 2017. p. 1009–24.
4. Tan J, Zhang TY, Li FF, et al. iBTune: individualized buffer tuning for large-scale cloud data-
bases. Proc VLDB Endow. 2019;12(10):1221–34.
5. Ma MH, Yin ZH, Zhang SHL, et al. Diagnosing root causes of intermittent slow queries in
large-scale cloud databases. Proc VLDB Endow. 2020;13(8):1176–89.

You might also like