文档版本 v3.7-DRAFT 处于 草稿 状态。如需获取最新的稳定版文档,请参阅 v3.6。
将 etcd 从 3.5 降级到 3.4
在一般情况下,从 etcd 3.5 降级到 3.4 可以实现零停机、滚动式降级:
- 逐个停止 etcd 3.5 进程,并将其替换为 etcd 3.4 进程
- 启动任意 3.4 进程后,集群将不再可用 3.5 中的新功能
在开始降级之前,请通读本指南其余部分以做好准备。
降级检查清单
content/en/docs/v3.5/op-guide/authentication/rbac.md
注意: 如果你的集群启用了认证(auth),则不支持从 3.5 滚动降级,因为 3.5 更改了与认证相关的 WAL 日志条目格式。你可以按照 认证说明 禁用认证并先删除所有用户。
从 3.5 到 3.4 的主要破坏性变更:
命令行参数的差异
如果你在 3.5 配置中使用了以下任一参数,请确保在降级到 3.4 时移除、重命名或更改其默认值。
注意 此差异基于版本 3.5.14 和 3.4.33。实际差异取决于你使用的补丁版本,请先使用 diff <(etcd-3.5/bin/etcd -h | grep \\-\\-) <(etcd-3.4/bin/etcd -h | grep \\-\\-) 命令进行检查。
# flags not available in 3.4
-etcd --socket-reuse-port
-etcd --socket-reuse-address
-etcd --raft-read-timeout
-etcd --raft-write-timeout
-etcd --v2-deprecation
-etcd --client-cert-file
-etcd --client-key-file
-etcd --peer-client-cert-file
-etcd --peer-client-key-file
-etcd --self-signed-cert-validity
-etcd --enable-log-rotation --log-rotation-config-json=some.json
-etcd --experimental-enable-distributed-tracing --experimental-distributed-tracing-address='localhost:4317' --experimental-distributed-tracing-service-name='etcd' --experimental-distributed-tracing-instance-id='' --experimental-distributed-tracing-sampling-rate='0'
-etcd --experimental-compact-hash-check-enabled --experimental-compact-hash-check-time='1m'
-etcd --experimental-downgrade-check-time
-etcd --experimental-memory-mlock
-etcd --experimental-txn-mode-write-with-shared-buffer
-etcd --experimental-bootstrap-defrag-threshold-megabytes
-etcd --experimental-stop-grpc-service-on-defrag
# same flag with different names
-etcd --backend-bbolt-freelist-type=map
+etcd --experimental-backend-bbolt-freelist-type=array
# same flag different defaults
-etcd --pre-vote=true
+etcd --pre-vote=false
-etcd --logger=zap
+etcd --logger=capnslog
etcd --logger zap
3.4 默认使用 --logger=capnslog,而 3.5 默认使用 --logger=zap。
如果你想继续使用 zap,需要显式指定该参数。
+etcd --logger=zap --log-outputs=stderr
+# to write logs to stderr and a.log file at the same time
+etcd --logger=zap --log-outputs=stderr,a.log
Prometheus 指标的变化
# metrics not available in 3.4
-etcd_debugging_mvcc_db_compaction_last
服务器降级检查清单
降级要求
为了确保滚动降级顺利进行,运行中的集群必须处于健康状态。在继续操作前,请使用 etcdctl endpoint health 命令检查集群健康状况。
要降级到的 3.4 版本必须 >= 3.4.32。
准备工作
在降级 etcd 之前,务必先在预发布环境中测试依赖 etcd 的服务,然后再将降级部署到生产环境。
开始之前,请下载快照备份。如果降级过程中出现问题,可以使用此备份回滚到现有的 etcd 版本。请注意,snapshot 命令仅备份 v3 数据。对于 v2 数据,请参阅 v2 数据存储的备份方法。
开始之前,请下载最新版的 etcd 3.4,并确保其版本 >= 3.4.32。
混合版本
在降级过程中,etcd 集群支持混合版本的成员,并以最低公共版本的协议运行。一旦有任何成员降级到 3.4 版本,即认为整个集群已降级。内部而言,etcd 成员之间会相互协商以确定整体集群版本,该版本控制着报告的版本号和所支持的功能。
限制
注意:如果集群仅有 v3 数据而无 v2 数据,则不受此限制影响。
如果集群正在服务的 v2 数据集大于 50MB,每个新降级的成员可能需要最多两分钟才能追上现有集群的进度。可通过检查最近快照的大小来估算总数据量。换句话说,在每次降级成员之间最好等待 2 分钟。
对于更大的总数据量(例如 100MB 或更多),这一一次性过程可能耗时更长。如此大规模的 etcd 集群管理员可在降级前联系 etcd 团队,我们将很乐意提供操作建议。
回滚
一旦有任何成员被降级到 3.4,集群版本将降级为 3.4,所有操作都将“兼容 3.4”。若需回滚,你需要遵循从 3.4 升级到 3.5的说明进行操作。
请下载快照备份,以便即使在集群完全降级后仍可执行降级操作。
降级流程
本示例展示了如何将本地机器上运行的三成员 3.5 etcd 集群进行降级。
步骤 1:检查降级要求
集群是否健康且正在运行 3.5.x?
etcdctl --endpoints=localhost:2379,localhost:22379,localhost:32379 endpoint health
<<COMMENT
localhost:2379 is healthy: successfully committed proposal: took = 2.118638ms
localhost:22379 is healthy: successfully committed proposal: took = 3.631388ms
localhost:32379 is healthy: successfully committed proposal: took = 2.157051ms
COMMENT
curl http://localhost:2379/version
<<COMMENT
{"etcdserver":"3.5.0","etcdcluster":"3.5.0"}
COMMENT
curl http://localhost:22379/version
<<COMMENT
{"etcdserver":"3.5.0","etcdcluster":"3.5.0"}
COMMENT
curl http://localhost:32379/version
<<COMMENT
{"etcdserver":"3.5.0","etcdcluster":"3.5.0"}
COMMENT
步骤 2:从主节点下载快照备份
下载快照备份,以便在出现问题时提供降级恢复路径。
步骤 3:停止一个现有的 etcd 服务器
在停止服务器之前,检查它是否是主节点
etcdctl --endpoints=localhost:2379,localhost:22379,localhost:32379 endpoint status -w=table
<<COMMENT
+-----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+-----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| localhost:2379 | 8211f1d0f64f3269 | 3.5.13 | 20 kB | true | false | 2 | 9 | 9 | |
| localhost:22379 | 91bc3c398fb3c146 | 3.5.13 | 20 kB | false | false | 2 | 9 | 9 | |
| localhost:32379 | fd422379fda50e48 | 3.5.13 | 20 kB | false | false | 2 | 9 | 9 | |
+-----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
COMMENT
如果待停止的服务器是 Leader,可以在停止前使用 move-leader 命令将 Leader 角色转移到其他服务器,以避免部分停机时间。
etcdctl --endpoints=localhost:2379,localhost:22379,localhost:32379 move-leader 91bc3c398fb3c146
etcdctl --endpoints=localhost:2379,localhost:22379,localhost:32379 endpoint status -w=table
<<COMMENT
+-----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+-----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| localhost:2379 | 8211f1d0f64f3269 | 3.5.13 | 20 kB | false | false | 3 | 11 | 11 | |
| localhost:22379 | 91bc3c398fb3c146 | 3.5.13 | 20 kB | true | false | 3 | 11 | 11 | |
| localhost:32379 | fd422379fda50e48 | 3.5.13 | 20 kB | false | false | 3 | 11 | 11 | |
+-----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
COMMENT
每当一个 etcd 进程停止时,集群中的其他成员会记录预期的错误日志。这是正常现象,因为集群成员之间的连接已(临时)中断。
{"level":"info","ts":"2024-05-14T20:25:47.051124Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"91bc3c398fb3c146 became leader at term 3"}
{"level":"info","ts":"2024-05-14T20:25:47.051139Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"raft.node: 91bc3c398fb3c146 elected leader 91bc3c398fb3c146 at term 3"}
^C{"level":"warn","ts":"2024-05-14T20:27:09.094119Z","caller":"rafthttp/stream.go:421","msg":"lost TCP streaming connection with remote peer","stream-reader-type":"stream MsgApp v2","local-member-id":"91bc3c398fb3c146","remote-peer-id":"8211f1d0f64f3269","error":"EOF"}
{"level":"warn","ts":"2024-05-14T20:27:09.09427Z","caller":"rafthttp/stream.go:421","msg":"lost TCP streaming connection with remote peer","stream-reader-type":"stream Message","local-member-id":"91bc3c398fb3c146","remote-peer-id":"8211f1d0f64f3269","error":"EOF"}
{"level":"warn","ts":"2024-05-14T20:27:09.095535Z","caller":"rafthttp/peer_status.go:66","msg":"peer became inactive (message send to peer failed)","peer-id":"8211f1d0f64f3269","error":"failed to dial 8211f1d0f64f3269 on stream MsgApp v2 (peer 8211f1d0f64f3269 failed to find local node 91bc3c398fb3c146)"}
{"level":"warn","ts":"2024-05-14T20:27:09.43915Z","caller":"rafthttp/stream.go:223","msg":"lost TCP streaming connection with remote peer","stream-writer-type":"stream Message","local-member-id":"91bc3c398fb3c146","remote-peer-id":"8211f1d0f64f3269"}
{"level":"warn","ts":"2024-05-14T20:27:11.085646Z","caller":"etcdserver/cluster_util.go:294","msg":"failed to reach the peer URL","address":"http://127.0.0.1:12380/version","remote-member-id":"8211f1d0f64f3269","error":"Get \"http://127.0.0.1:12380/version\": dial tcp 127.0.0.1:12380: connect: connection refused"}
{"level":"warn","ts":"2024-05-14T20:27:11.085718Z","caller":"etcdserver/cluster_util.go:158","msg":"failed to get version","remote-member-id":"8211f1d0f64f3269","error":"Get \"http://127.0.0.1:12380/version\": dial tcp 127.0.0.1:12380: connect: connection refused"}
{"level":"warn","ts":"2024-05-14T20:27:13.557385Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"8211f1d0f64f3269","rtt":"416.079µs","error":"dial tcp 127.0.0.1:12380: connect: connection refused"}
步骤 4:使用相同的配置和 --next-cluster-version-compatible 参数重启 etcd 服务器
使用相同的配置以及新的 etcd 二进制文件,并加上 --next-cluster-version-compatible 参数重启 etcd 服务器。
-etcd-3.5/bin --name s1 \
+etcd-3.4/bin --name s1 \
--data-dir /tmp/etcd/s1 \
--listen-client-urls http://localhost:2379 \
--advertise-client-urls http://localhost:2379 \
--listen-peer-urls http://localhost:2380 \
--initial-advertise-peer-urls http://localhost:2380 \
--initial-cluster s1=http://localhost:2380,s2=http://localhost:22380,s3=http://localhost:32380 \
--initial-cluster-token tkn \
--initial-cluster-state existing
--next-cluster-version-compatible
新的 3.4 版本 etcd 将向集群发布其信息。此时,集群将开始以 3.4 协议运行,这是最低公共版本。
> `{"level":"info","ts":"2024-05-13T21:05:43.981445Z","caller":"membership/cluster.go:561","msg":"set initial cluster version","cluster-id":"ef37ad9dc622a7c4","local-member-id":"8211f1d0f64f3269","cluster-version":"3.0"}`
> `{"level":"info","ts":"2024-05-13T21:05:43.982188Z","caller":"api/capability.go:77","msg":"enabled capabilities for version","cluster-version":"3.0"}`
> `{"level":"info","ts":"2024-05-13T21:05:43.982312Z","caller":"membership/cluster.go:549","msg":"updated cluster version","cluster-id":"ef37ad9dc622a7c4","local-member-id":"8211f1d0f64f3269","from":"3.0","from":"3.5"}`
> `{"level":"info","ts":"2024-05-13T21:05:43.982376Z","caller":"api/capability.go:77","msg":"enabled capabilities for version","cluster-version":"3.5"}`
> `{"level":"info","ts":"2024-05-13T21:05:44.000672Z","caller":"etcdserver/server.go:2152","msg":"published local member to cluster through raft","local-member-id":"8211f1d0f64f3269","local-member-attributes":"{Name:infra1 ClientURLs:[http://127.0.0.1:2379]}","request-path":"/0/members/8211f1d0f64f3269/attributes","cluster-id":"ef37ad9dc622a7c4","publish-timeout":"7s"}`
> `{"level":"info","ts":"2024-05-13T21:05:46.452631Z","caller":"membership/cluster.go:549","msg":"updated cluster version","cluster-id":"ef37ad9dc622a7c4","local-member-id":"8211f1d0f64f3269","from":"3.5","from":"3.4"}`
验证每个成员以及整个集群是否都使用新的 3.4 版本 etcd 二进制文件恢复正常状态:
etcdctl endpoint health --endpoints=localhost:2379,localhost:22379,localhost:32379
<<COMMENT
localhost:32379 is healthy: successfully committed proposal: took = 2.337471ms
localhost:22379 is healthy: successfully committed proposal: took = 1.130717ms
localhost:2379 is healthy: successfully committed proposal: took = 2.124843ms
COMMENT
未降级的成员将记录类似如下的信息
{"level":"info","ts":"2024-05-13T21:05:46.450764Z","caller":"etcdserver/server.go:2633","msg":"updating cluster version using v2 API","from":"3.5","to":"3.4"}
{"level":"info","ts":"2024-05-13T21:05:46.452419Z","caller":"membership/cluster.go:576","msg":"updated cluster version","cluster-id":"ef37ad9dc622a7c4","local-member-id":"91bc3c398fb3c146","from":"3.5","to":"3.4"}
{"level":"info","ts":"2024-05-13T21:05:46.452547Z","caller":"etcdserver/server.go:2652","msg":"cluster version is updated","cluster-version":"3.4"}
步骤 5:对剩余成员重复步骤 3 和 步骤 4
当所有成员都降级完成后,检查集群的健康状态和版本:
endpoint health --endpoints=localhost:2379,localhost:22379,localhost:32379
<<COMMENT
localhost:2379 is healthy: successfully committed proposal: took = 492.834µs
localhost:22379 is healthy: successfully committed proposal: took = 1.015025ms
localhost:32379 is healthy: successfully committed proposal: took = 1.853077ms
COMMENT
curl http://localhost:2379/version
<<COMMENT
{"etcdserver":"3.4.32","etcdcluster":"3.4.0"}
COMMENT
curl http://localhost:22379/version
<<COMMENT
{"etcdserver":"3.4.32","etcdcluster":"3.4.0"}
COMMENT
curl http://localhost:32379/version
<<COMMENT
{"etcdserver":"3.4.32","etcdcluster":"3.4.0"}
COMMENT