Technical Secrets of PolarDB: Standalone Performance Optimization

This article introduces PolarDB's achievement of setting a new world record for TPC-C performance and cost-effectiveness, highlighting its standalone performance optimization techniques.

Recently, PolarDB topped the TPC-C benchmark test ranking with a performance that exceeded the previous record by 2.5 times. It set the TPC-C world record for performance and cost-effectiveness with a performance of 2.055 billion transactions per minute (tpmC) and a unit cost of CNY 0.8 (price/tpmC).

Each seemingly simple number contains countless technical personnel's ultimate pursuit of database performance, cost-effectiveness, and stability. The pace of innovation in PolarDB has never stopped. A series of articles on "PolarDB's Technical Secrets of Topping TPC-C" are hereby released to tell you the story behind the "Double First Place". Stay tuned!

This article is the first one in this series - standalone performance optimization.

1. TPC-C Benchmark Model

TPC-C is a benchmark test developed by the Transaction Processing Performance Council (TPC) to measure the performance of OLTP systems. It covers typical processing paths of databases such as additions, deletions, modifications, and queries. The final performance is measured by tpmC (transactions per minute). The TPC-C test model can directly and objectively evaluate the performance of a database, which is the most credible test standard in the world.

In this TPC-C benchmark test, we use PolarDB for MySQL 8.0.2. As a flagship product developed by Alibaba Cloud ApsaraDB, PolarDB improves single-core performance by 1.8 times (compared with the previous record) through more than 90 methods of standalone optimization and performance improvement, achieving a milestone in the database field. This article will uncover the technical insider of PolarDB's standalone optimization.

In the ranking process, we sorted out the following four core features and corresponding optimization solutions based on an in-depth analysis of the database pressure model:

• Massive user connections → High concurrency optimization

• High CPU usage and memory access → CPU and memory efficiency optimization

• High I/O throughput → I/O link optimization

• Longer log write link → Replication performance optimization

These four features are also common performance bottlenecks in online user businesses. We will describe the overall performance links of PolarDB, and then introduce the key optimizations made on the performance links of standalone instances based on these four typical features.

2. PolarDB Performance Link

As a cloud-native database with shared storage, PolarDB provides ultra-high ease of use (fast backup and recovery, high availability), powerful elasticity (storage and computing decoupling), and consistent I/O latency with local disks and higher IOPS through software and hardware coevolution. It achieves the ultimate in performance, ease of use, and scalability.

The traditional MySQL architecture deployed on local disks benefits from the low I/O latency of local disks, but also faces the issues of limited storage capacity and difficulty in scale-up. In addition, the high latency of cross-machine primary/secondary replication also obscures the performance benefits of local disks. Although the MySQL architecture directly deployed on cloud disks can take advantage of the scalability and high availability of the storage resources of cloud disks, the high latency of cloud disks cannot bring the performance of MySQL into full play, and computing resources cannot be scaled out.

To solve performance and scalability issues, users will consider distributed databases, but these databases present problems such as significant business transformation and high O&M costs. PolarDB solves these three problems by using end-to-end hardware and software collaborative optimization from the proxy to the underlying storage.

Figure 1: PolarDB with high performance, ease of use, and scalability

Figure 2 shows an overview of the entire performance optimization link of PolarDB. SQL queries connected by users are forwarded by the proxy to the database kernel, then parsed by SQL statements and searched by indexes, and finally reach disk storage through the file system. The entire performance optimization link spans from the upper-layer proxy to the underlying storage. PolarDB optimizes end-to-end performance to maintain efficient transaction processing performance under high-pressure TPC-C loads. This article mainly introduces the standalone optimization at the database kernel level. We will continue to introduce how PolarDB integrates software and hardware to achieve collaborative evolution.

Figure 2: Overview of PolarDB performance links

3. High Concurrency Optimization

The first typical feature of the TPC-C benchmark load is massive user connections. In the PolarDB cluster test, a total of 1.6 billion user connections are created on the client. Even through a multi-level connection pool, the number of user connections to a single database node still exceeds 7,000, which poses an ordeal in the concurrent processing capability of the database.

3.1 PolarIndex

To solve the lock bottleneck caused by a large number of concurrent index writes, PolarDB provides a high-performance Polar Index to improve the index read and write performance in multi-thread concurrency scenarios.

Polar Index 1.0 solves the problem that concurrent split and merge operations (SMO) are not allowed due to the global index lock. Each time the index is traversed, the latch coupling principle is followed, that is, the write lock of the parent node is released only when the write lock of the next node is successfully obtained. SMO is decomposed into multiple stages, allowing concurrent reads of split nodes during SMO.

In Polar Index 2.0, PolarDB further optimizes the locking granularity of indexes. Compared with latch coupling, PolarDB takes the lock of only one node each time it traverses a btree from top to bottom. It also optimizes SMO to add bottom-up write locks. This shortens the locking time of upper nodes with a larger lock range and allows higher concurrent read and write operations.

The high-performance read and write of Polar Index makes concurrent read and write of indexes no longer a write bottleneck in TPC-C benchmark tests.

Figure 3: Multi-stage SMO process of Polar Index

3.2 PolarTrans

To solve the scalability bottleneck of a large number of concurrent operations in the transaction system, PolarDB provides a lock-free transaction management system PolarTrans, which improves the efficiency of transaction commit and visibility determination.

In the native Innodb storage engine, the transaction system maintains a list of active transactions in a global set, which is protected by a global lock. Transaction operations have a scalability bottleneck. PolarTrans uses Commit Timestamp Store (CTS) to map each transaction ID to the transaction commit sequence number (CSN) in CTS logs and performs a large number of lock-free optimizations to make the operation logic of the transaction state more lightweight.

When a transaction is started, you only need to register the transaction in the CTS log. When committing, the committed CSN is marked in the CTS log. When determining the visibility, the maximum commit timestamp of the transaction system is used instead of the native active transaction array, which greatly improves the efficiency of transaction management.

Figure 4: Implementation of the CTS log

3.3 Multi-level Sharded Buffer Pool

PolarDB uses a multi-level sharded buffer pool cache system to resolve the scalability bottleneck caused by a large number of concurrent accesses in the buffer pool cache.

PolarDB splits access to multiple LRU caches and uses an asynchronous LRU manager thread to eliminate the page at the end of the LRU linked list. The front-end user thread does not actively eliminate the LRU page to avoid increasing the CPU time for transaction processing.

LRU manager adopts a one-to-many relationship with the number of LRU linked lists. LRU manager scans the elimination area at the end of the LRU linked list with the fewest free pages by maintaining the heap structure of free pages, flushes dirty pages, and provides memory-free pages to the front-end thread, thus avoiding a large number of background thread switching.

For I/O-intensive workloads, to reduce the lock overhead caused by key page elimination and frequent movement of data page locations, PolarDB creates a hot cache list at the head of the LRU linked list. Specific pages (intermediate index pages and metadata pages) and hotspot tables identified by frequency are preferentially put into the hot cache to reduce the elimination frequency.

Figure 5: Implementation of a multi-level sharded buffer pool

3.4 Fully Asynchronous Execution Architecture

To resolve issues such as CPU contention and frequent context switching caused by a large number of concurrent requests under the traditional threading model, PolarDB designs a fully asynchronous execution architecture based on coroutines, which realizes the asynchronous execution of core logic such as authentication, transaction commit, and lock wait. This significantly improves the high-concurrency processing capability of the database.

Design of the Fully Asynchronous Architecture

PolarDB uses the coroutine technology to decouple the lifecycle of user requests from physical threads and reconstruct it into a fully asynchronous execution model:

• Coroutine-based requests: Transaction requests are encapsulated as independent coroutines and managed by the user-mode scheduler.

• Proactive release mechanism: After a coroutine suspends, the execution right is released, and the scheduler immediately switches to another ready coroutine.

• Efficient resource reuse: A single thread can process hundreds of coroutines in parallel, reducing thread scheduling overhead.

Coroutine Communication Mechanism

PolarDB designs a lightweight communication protocol based on eventfd. Each coroutine is bound to an independent eventfd as a signaling channel. When a coroutine suspends, the epoll thread captures the event in real time; when the resource is ready, the suspended thread is immediately woken up by writing the eventfd to trigger a signal. This mechanism breaks through the limitations of traditional thread broadcast wake-up and achieves three major improvements: zero invalid wake-up, nanosecond-level response, and the capability to manage millions of concurrent operations.

Figure 6: Asynchronous execution logic

4. CPU and Memory Efficiency Optimization

In TPC-C benchmark tests, the parsing and execution of a large number of SQL statements and the access to data tables still consume a lot of CPU and memory resources. PolarDB conducts an in-depth analysis of modules such as table metadata management and stored procedures to save CPU resources and improve computing efficiency.

4.1 Optimistic Open Table Reuse

For structured databases, you need to hold the metadata lock of the table before performing DML operations. While identifying the row structure, it is also necessary to prevent data inconsistency caused by other DDL operations.

For traditional pessimistic locking, the user thread needs to generate the metadata of all tables before executing SQL statements to access data and add metadata locks (MDLs), which are released after the transaction is executed. This consumes a lot of CPU time. In particular, since transactions are usually composed of dozens of SQL statements, the CPU overhead is further increased each time the transaction is executed in a loop.

PolarDB implements an optimistic open table reuse mechanism to reduce the overhead of repeatedly building and destroying the metadata of tables, and repeatedly adding/releasing MDL locks each time performing connections for repeated transactions. By maintaining a private cache of a connection, the data table accessed by the transaction and user connection information are stored for reuse by the next transaction. If the data table accessed is a subset of the private cache, the cached table metadata and MDL locks can be reused. To avoid deadlock, if a new data table is accessed or disconnected, the private cache will be cleared and the corresponding MDL lock will be released, and the pessimistic locking process will be resumed.

Figure 7: Optimistic open table reuse optimization

4.2 Caching Mechanism in the Stored Procedure

TPC-C transaction execution depends on the parsing and execution of the stored procedure. The execution efficiency of the stored procedure greatly determines the transaction processing performance.

To improve the execution efficiency of stored procedures, PolarDB optimizes the caching mechanism of the stored procedure in the following ways:

• Convert the structure cache at the user connection level to the global structure cache to avoid excessive memory usage caused by a large number of connections, thereby increasing the page memory of the buffer pool and reducing I/O overhead.

• Realize the prepare result cache of SQL statements in the stored procedure, and cache the column information of the data table to be bound in the item of the SQL expression in combination with the optimistic open table, so as to avoid the repeated prepare overhead during each stored procedure call, which causes a waste of CPU resources.

• Implement execution plan cache and solidify the execution paths of simple SQL statements based on index statistics, such as single primary key index queries and range queries without indexes. This prevents the optimizer from sneaking down to the storage engine and occupying additional I/O and CPU resources.

Figure 8: Caching mechanism of the PolarDB stored procedure

5. I/O Link Optimization

The execution of one TPC-C transaction may involve dozens of read I/Os, making the data access performance of the transaction highly dependent on the performance of the disk I/O.

PolarDB proposes a PolarIO solution for I/O links, as shown in Figure 9 (a). PolarDB starts from the two main types of storage engines, Page and Redo I/O. It transforms the buffer pool, Redo buffer, and I/O queue. Finally, it uses the self-developed user-mode file system PFS to persist them to the underlying elastic storage.

In addition to the buffer_pool module described in the preceding section, Figure 9 (b) shows the design of parallel redo writing in PolarDB. The redo buffer is divided into multiple shards to concurrently issue asynchronous I/O tasks, and combined with asynchronous redo prepare mechanisms (such as redo alignment and checksum computation). In the actual measurement under high redo pressure, the redo throughput of PolarDB can reach 4 GB/s.

Figure 9: PolarIO solution and parallel redo log storage

In the PolarIO solution, another key point is the persistence of the file system and the underlying storage. On the write path, data is written to PolarStore through PolarFS and written to AliSCM, a high-speed device with a latency close to that of memory, over the 100Gb loosy RDMA network. On the read path, PolarDB uses a large elastic memory pool of hundreds of terabytes of data built on DRAM and AliSCM. The entire I/O full link implements a software stack by bypassing the kernel. The I/O path does not have any memory copy and costs extremely low software stack overhead. The software and hardware are coordinated to provide I/O latency capabilities comparable to or even better than those of local disks.

6. Replication Performance Optimization

TPC-C, in the disaster recovery test, requires the database to adopt semi-synchronous transaction submission, that is, before a transaction is submitted, the redo log must be transmitted to another host of the secondary node through the network so that after the machine failure on the primary node, the log can be normally recovered from the secondary node.

However, the cross-machine replication link has two impacts:

• The replication link prolongs the time that transactions wait for log persistence.

• Under the high-load write of the primary node, the secondary node also needs stronger replication capability.

To address the performance issues caused by cross-machine replication, PolarDB reduces the latency of network packet transmission based on the low-latency advantage of RDMA. In addition, the redo cache used for replication synchronization is specially maintained, and asynchronous concurrent links are used to improve the throughput of redo transmission. Moreover, a wake-up mechanism of multiple semaphores is employed to prevent invalid wake-ups of transactions.

PolarDB uses an end-to-end parallel replication framework on the secondary node. The entire PolarDB replication process is parallelized from reading, parsing, and applying redo logs. Thanks to the parallel replication optimization of the secondary machine, the replication latency of the secondary machine is maintained at the millisecond level under the test pressure of 2 billion tpmC in TPC-C.

Figure 10: PolarDB primary/secondary synchronization link and parallel replication framework

7. Summary

The TPC-C stress model covers a comprehensive test of CPU and I/O resources, which greatly examines the concurrent execution and collaboration efficiency of most modules within the database. PolarDB adheres to the principle of "customer first" and continuously optimizes the database performance based on the actual application scenarios of customers. This way, customers can make the best use of each core to maximize their performance.

Community

Technical Secrets of PolarDB: Standalone Performance Optimization

1. TPC-C Benchmark Model

2. PolarDB Performance Link

3. High Concurrency Optimization

3.1 PolarIndex

3.2 PolarTrans

3.3 Multi-level Sharded Buffer Pool

3.4 Fully Asynchronous Execution Architecture

Design of the Fully Asynchronous Architecture

Coroutine Communication Mechanism

4. CPU and Memory Efficiency Optimization

4.1 Optimistic Open Table Reuse

4.2 Caching Mechanism in the Stored Procedure

5. I/O Link Optimization

6. Replication Performance Optimization

7. Summary

Read previous post:

Read next post:

ApsaraDB

You may also like

Comments

ApsaraDB

Related Products

PolarDB for PostgreSQL

PolarDB for Xscale

PolarDB for MySQL

Database for FinTech Solution