环境信息
服务名称 | 系统版本 |
:-------- |:----- | :----- | :----- |
PostgreSQL | PostgreSQL 9.6.8
Patronictl | patronictl version 1.5.5
Etcd | etcdctl version 2.2.5
查看Patroni状态
root@localhost:~# patronictl -c /data/scripts/patroni_postgresql.yml list "9.6/main"
+----------+---------+--------------+------+---------+----+-----------+
| Cluster | Member | Host | Role | State | TL | Lag in MB |
+----------+---------+--------------+------+---------+----+-----------+
| 9.6/main | pgsql_1 | 192.168.1.1 | | running | 7 | |
| 9.6/main | pgsql_2 | 192.168.1.2 | | stopped | | unknown |
+----------+---------+--------------+------+---------+----+-----------+
第二台192.168.1.2为主库,此时启动后显示状态为stopped
检查Patroni日志
tail -f /data/logs/patroni.log
# 关键日志如下
2020-04-03 14:40:35,434 INFO: Lock owner: None; I am pgsql_2
2020-04-03 14:40:35,524 INFO: PAUSE: postgres is not running
2020-04-03 14:40:45,433 INFO: Process 77888 is not postmaster, too much difference between PID file start time 1565058216.95 and process start time 1565058213
2020-04-03 14:40:45,434 INFO: Process 77888 is not postmaster, too much difference between PID file start time 1565058216.95 and process start time 1565058213
2020-04-03 14:40:45,434 WARNING: Postgresql is not running.
2020-04-03 14:40:45,434 INFO: Lock owner: None; I am pgsql_2
2020-04-03 14:40:45,523 INFO: PAUSE: postgres is not running
2020-04-03 14:40:55,435 INFO: Process 77888 is not postmaster, too much difference between PID file start time 1565058216.95 and process start time 1565058213
2020-04-03 14:40:55,435 INFO: Process 77888 is not postmaster, too much difference between PID file start time 1565058216.95 and process start time 1565058213
2020-04-03 14:40:55,435 WARNING: Postgresql is not running.
从日志可以发现进程的时间和pid文件的时间相差太多导致Patroni检测Postgresql状态为not running,实质postgresql是正常启动的, 也就是状态的判断因时间差大,判断为not running,那相差多大会被认为是not running呢?
通过python检查差值
root@localhost:~# python
Python 2.7.12 (default, Nov 12 2018, 14:36:49)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>>
>>> print abs(1565058216.95 - 1565058213)
3.95000004768
结果为相差为 3.9秒!!!
查看源码文件
def _is_postmaster_process(self):
try:
start_time = int(self._postmaster_pid.get('start_time', 0))
if start_time and abs(self.create_time() - start_time) > 3: # 这里是关键代码!!!
logger.info('Process %s is not postmaster, too much difference between PID file start time %s and '
'process start time %s', self.pid, self.create_time(), start_time)
return False
except ValueError:
logger.warning('Garbage start time value in pid file: %r', self._postmaster_pid.get('start_time'))
这里已经写了判断,如果pid的创建时间与进程启动时间相差超过3秒,当abs(self.create_time() - start_time) > 3条件成立时,程序返回False,状态会是默认的stop状态,而不是running状态。
【注】这里的logger.info打印的时候,把进程时间和pid创建时间位置反了,上面打印第一个时间是进程启动时间,第二个时间才是pid文件创建时间。
对比pid文件和进程时间
# 进程启动时间:10:23:36
root@localhost:~# ps -eo pid,lstart | grep 77888
77888 Tue Aug 6 10:23:36 2019
# pid 文件创建时间 10:23:33
root@localhost:~# ls --full-time | grep pid
-rw------- 1 postgres postgres 87 2019-08-06 10:23:33.597849566 +0800 postmaster.pid
# pid 文件创建时间
root@localhost:~# cat postmaster.pid
77888
/data/pg9.6/main
1565058213 # 这里的时间戳转换为时间为: 2019-08-06 10:23:33
5432
/var/run/postgresql
0.0.0.0
5432001 786435
解决办法一
通过修改postmaster.py 代码,把3改为4,然后启动patroni , 启动成功后再修改回来
if start_time and abs(self.create_time() - start_time) > 3:
# 修改为
if start_time and abs(self.create_time() - start_time) > 4:
# 再启动 patroni
root@localhost:~# patronictl -c /data/scripts/patroni_postgresql.yml list "9.6/main"
+----------+---------+--------------+------+---------+----+-----------+
| Cluster | Member | Host | Role | State | TL | Lag in MB |
+----------+---------+--------------+------+---------+----+-----------+
| 9.6/main | pgsql_1 | 192.168.1.1 | | running | 8 | 0.0 |
| 9.6/main | pgsql_2 | 192.168.1.2 | Leader| running | 8 | 0.0 |
+----------+---------+--------------+------+---------+----+-----------+
成功启动后,再上面的postmaster.py修改回来。
解决办法二
重启 postgresql 让重新生成pid 文件 ,时间一致。如果业务可以允许重启,建议使用此方法:
root@localhost:~# patronictl -c /data/scripts/patroni_postgresql.yml pause "9.6/main" --wait
root@localhost:~# /etc/init.d/postgresql stop
root@localhost:~# /etc/init.d/postgresql start
root@localhost:~# patronictl -c /data/scripts/patroni_postgresql.yml list "9.6/main"
+----------+---------+--------------+------+---------+----+-----------+
| Cluster | Member | Host | Role | State | TL | Lag in MB |
+----------+---------+--------------+------+---------+----+-----------+
| 9.6/main | pgsql_1 | 192.168.1.1 | | running | 8 | 0.0 |
| 9.6/main | pgsql_2 | 192.168.1.2 | Leader| running | 8 | 0.0 |
+----------+---------+--------------+------+---------+----+-----------+
Maintenance mode: on