什么是Patroni
在很多生产环境中,分布式数据库以高可用性、数据分布性、负载均衡等特性,被用户广泛应用。而作为高可用数据库的解决方案——Patroni,是专门为PostgreSQL数据库设计的,一款以Python语言实现的高可用架构模板。该架构模板,旨在通过外部共享存储软件(kubernetes、etcd、etcd3、zookeeper、aws等),实现 PostgreSQL 集群的自动故障恢复、自动故障转移、自动备份等能力。
主要特点:
1.自动故障检测和恢复:Patroni 监视 PostgreSQL 集群的健康状态,一旦检测到主节点故障,它将自动执行故障恢复操作,将其中一个从节点晋升为主节点。
2.自动故障转移:一旦 Patroni 定义了新的主节点,它将协调所有从节点和客户端,以确保它们正确地切换到新的主节点,从而实现快速、无缝的故障转移。
3.一致性和数据完整性:Patroni 高度关注数据一致性和完整性。在故障切换过程中,它会确保在新主节点接管之前,数据不会丢失或受损。
4.外部共享配置存储:Patroni 使用外部键值存储(如 ZooKeeper、etcd 或 Consul)来存储配置和集群状态信息。这确保了配置的一致性和可访问性,并支持多个 Patroni 实例之间的协作。
5.支持多种云环境和物理硬件:Patroni 不仅可以在云环境中运行,还可以部署在物理硬件上,提供了广泛的部署选项。
Patroni架构解析
●DCS(Distributed Configuration Store ):是指分布式配置信息的存储位置,可支持kubernetes、etcd、etcd3、zookeeper、aws等存储媒介,由Patroni进行分布式配置信息的读写。
●核心Patroni:负责将分布式配置信息写入DCS中,并设置PostgreSQL节点的角色以及PostgreSQL配置信息,管理PostgreSQL的生命周期。
●PostgreSQL节点:各PostgreSQL节点,根据Patroni设置的PostgreSQL配置信息,生成主从关系链,以流复制的方式进行数据同步,最终生成一个PostgreSQL集群。
Patroni高可用源码分析
Patroni高可用启动流程
流程说明:
●加载集群信息,通过DCS支持的API接口,获取集群信息,主要内容如下:
○config:记录pg集群ID以及配置信息(包括pg参数信息、一些超时时间配置等),用于集群校验、节点重建等;
○leader:记录主节点选举时间、心跳时间、选举周期、最新的lsn等,用于主节点完成竞争后的信息记录;
○sync: 记录主节点和同步节点信息,由主节点记录,用于主从切换、故障转移的同步节点校验;
○failover: 记录最后一次故障转移的时间。
●集群状态检测,主要检测集群配置信息的内容校验,当前集群的整体状态及节点状态,判断通过什么方式来启动PostgreSQL;
●启动PostgreSQL,用于初始化PostgreSQL目录,根据集群信息设置相应的PostgreSQL配置信息,并启动;
●生成PostgreSQL集群,指将完成启动的PostgreSQL节点,通过设置主从角色,关联不同角色的PostgreSQL节点,最终生成完整的集群。
Patroni高可用启动流程分析
加载集群信息
加载集群信息,是高可用流程启动的第一步,也是生成PostgreSQL集群的最关键信息。
第一步,记载集群信息
…
try:
self.load_cluster_from_dcs()
self.state_handler.reset_cluster_info_state(self.cluster, self.patroni.nofailover)
except Exception:
self.state_handler.reset_cluster_info_state(None, self.patroni.nofailover)
raise
…
通过DCS接口加载集群信息
def load_cluster_from_dcs(self):
cluster = self.dcs.get_cluster()
# We want to keep the state of cluster when it was healthy
if not cluster.is_unlocked() or not self.old_cluster:
self.old_cluster = cluster
self.cluster = cluster
if not self.has_lock(False):
self.set_is_leader(False)
self._leader_timeline = None if cluster.is_unlocked() else cluster.leader.timeline
集群接口
def get_cluster(self, force=False):
if force:
self._bypass_caches()
try:
cluster = self._load_cluster()
except Exception:
self.reset_cluster()
raise
self._last_seen = int(time.time())
with self._cluster_thread_lock:
self._cluster = cluster
self._cluster_valid_till = time.time() + self.ttl
return cluster
@abc.abstractmethod
def _load_cluster(self):
"""Internally this method should build `Cluster` object which
represents current state and topology of the cluster in DCS.
this method supposed to be called only by `get_cluster` method.
raise `~DCSError` in case of communication or other problems with DCS.
If the current node was running as a master and exception raised,
instance would be demoted."""
以Kubernetes作为DCS为例
def _load_cluster(self):
stop_time = time.time() + self._retry.deadline
self._api.refresh_api_servers_cache()
try:
with self._condition:
self._wait_caches(stop_time)
members = [self.member(pod) for pod in self._pods.copy().values()]
nodes = self._kinds.copy()
config = nodes.get(self.config_path)
metadata = config and config.metadata
annotations = metadata and metadata.annotations or {}
# get initialize flag
initialize = annotations.get(self._INITIALIZE)
# get global dynamic configuration
config = ClusterConfig.from_node(metadata and metadata.resource_version,
annotations.get(self._CONFIG) or '{}',
metadata.resource_version if self._CONFIG in annotations else 0)
# get timeline history
history = TimelineHistory.from_node(metadata and metadata.resource_version,
annotations.get(self._HISTORY) or '[]')
leader = nodes.get(self.leader_path)
metadata = leader and leader.metadata
self._leader_resource_version = metadata.resource_version if metadata else None
annotations = metadata and metadata.annotations or {}
# get last known leader lsn
last_lsn = annotations.get(self._OPTIME)
try:
last_lsn = 0 if last_lsn is None else int(last_lsn)
except Exception:
last_lsn = 0
# get permanent slots state (confirmed_flush_lsn)
slots = annotations.get('slots')
try:
slots = slots and json.loads(slots)
except Exception:
slots = None
# get leader
leader_record = {n: annotations.get(n) for n in (self._LEADER, 'acquireTime',
'ttl', 'renewTime', 'transitions') if n in annotations}
if (leader_record or self._leader_observed_record) and leader_record != self._leader_observed_record:
self._leader_observed_record = leader_record
self._leader_observed_time = time.time()
leader = leader_record.get(self._LEADER)
try:
ttl = int(leader_record.get('ttl')) or self._ttl
except (TypeError, ValueError):
ttl = self._ttl
if not metadata or not self._leader_observed_time or self._leader_observed_time + ttl < time.time():
leader = None
if metadata:
member = Member(-1, leader, None, {})
member = ([m for m in members if m.name == leader] or [member])[0]
leader = Leader(metadata.resource_version, None, member)
# failover key
failover = nodes.get(self.failover_path)
metadata = failover and failover.metadata
failover = Failover.from_node(metadata and metadata.resource_version,
metadata and (metadata.annotations or {}).copy())
# get synchronization state
sync = nodes.get(self.sync_path)
metadata = sync and sync.metadata
sync = SyncState.from_node(metadata and metadata.resource_version, metadata and metadata.annotations)
return Cluster(initialize, config, leader, last_lsn, members, failover, sync, history, slots)
except Exception:
logger.exception('get_cluster')
raise KubernetesError('Kubernetes API is not responding properly')
上述集群信息中,主要以xxx-config、xxx-leader、xxx-failover、xxx-sync作为配置信息,具体内容如下:
●xxx-config
% kubectl get cm pg142-1013-postgresql-config -oyaml
apiVersion: v1
kind: ConfigMap
metadata:
annotations:
config: '{"loop_wait":10,"maximum_lag_on_failover":33554432,"postgresql":{"parameters":{"archive_command":"/bin/true","archive_mode":"on","archive_timeout":"1800s","autovacuum":"on","autovacuum_analyze_scale_factor":0.02,"autovacuum_max_workers":"3","autovacuum_naptime":"5min","autovacuum_vacuum_cost_delay":"2ms","autovacuum_vacuum_cost_limit":"-1","autovacuum_vacuum_scale_factor":0.05,"autovacuum_work_mem":"128MB","backend_flush_after":"0","bgwriter_delay":"200ms","bgwriter_flush_after":"256","bgwriter_lru_maxpages":"100","bgwriter_lru_multiplier":"2","checkpoint_completion_target":"0.9","checkpoint_flush_after":"256kB","checkpoint_timeout":"5min","commit_delay":"0