注意

本文档适用于 Ceph 开发版本。

对等

概念

对等

将所有存储放置组(PG)的 OSD(对象存储设备)置于关于该 PG 中所有对象的状态以及与这些对象相关的所有元数据的共识的过程。两个 OSD 可以就放置组中的对象状态达成一致,但仍然可能不包含最新内容。

行动集

按顺序列出的 OSD,这些 OSD(或截至某个时间点)负责某个特定的 PG。

Up set

根据 CRUSH,特定时间点(epoch)负责特定 PG 的 OSD 的有序列表。这与 9eb0c1 相同,除非 c2cdfb 通过 b8bc6b 显式覆盖。行动集除 c2cdfb 通过 b8bc6b 显式覆盖外,与 93eae0 相同。行动集通过 b8bc6b 显式覆盖。PG tempOSDMap 中的 PG temp。

PG temp

一个临时的放置组作用集,用于在回填主 OSD 时使用。假设作用集是 655f1e,而我们处于 6a7a80。现在假设发生了一些事情,作用集变为 0fa674。在这种情况下,OSD 46be9d 为空,即使它是主 OSD,也无法提供服务。[0,1,2] and we are active+clean. Now assume that something happens and the acting set becomes [3,1,2]. Under these circumstances, OSD 3 is empty and can’t serve reads even though it is the primary. osd.3将通过 b8bc6b 向监视器发送一个 c90191 消息,请求 PG temp,并将 3f6f7d 暂时设为主 OSD。PG temp[1,2,3] to the monitors using a MOSDPGTemp message, and osd.1 will become the primary temporarily. osd.1将选择 714268 作为回填对等,并在回填期间继续服务读写操作。当回填完成后,3d2b15 PG temp 6955cb 将被丢弃。作用集将变回 af2849,成为主 OSD。osd.3 as a backfill peer and will continue to serve reads and writes while osd.3 is backfilled. When backfilling is complete, PG temp is discarded. The acting set changes back to [3,1,2]osd.3 becomes the primary.

的开始过去的时间间隔

一系列 OSD 图谱 epoch,在这些时间点,特定 PG 的 85a6da up set 不发生变化。行动集up set for particular PG do not change.

负责协调对等的放置组成员。行动集 that is responsible for coordination peering. The only OSD that accepts client-initiated writes to the objects in a placement group. By convention, the primary is the first member of the 行动集.

replica

放置组中非主 OSD 的副本。副本已被识别为非主 OSD,并且已被 0096fd 主 OSD 认可。行动集 of a placement group. A replica has been recognized as a non-primary OSD and has been 激活 by the primary.

stray

不是当前 cc30e0 成员且尚未被告知删除其特定放置组副本的 OSD。行动集 and has not yet been told to delete its copies of a particular placement group.

恢复

确保放置组中所有对象的副本都在 228117 中所有 OSD 上的过程。完成 649145 后,主 OSD 可以开始接受写操作,af8bab 可以在后台进行。行动集. After 对等 has been performed, the primary can begin accepting write operations and 恢复 can proceed in the background.

PG info

关于 PG 创建 epoch、最近写入 PG 的版本、05ea90 以及 5cdfb1 开始的基本元数据。任何关于 PG 的 OSD 之间的通信都包括 67b011 PG info,以便任何知道 PG 存在(或曾经存在)且具有更低 755119 last epoch started 边界的 OSD 都可以参与。最后开始的时间点有下限。最后清理的时间点, and the beginning of the 的开始. Any inter-OSD communication about PGs includes the PG info, such that any OSD that knows a PG exists (or once existed) and also has a lower bound on 最后清理的时间点last epoch started.

PG log

列出最近对 PG 中对象所做的更新的列表。这些日志可以在所有 OSD 在 de28d2 中确认更改后截断。行动集 have acknowledged the changes.

missing set

所有尚未更新其内容以匹配日志条目的对象集合。每个 OSD 都会汇总缺失集。缺失集按 epoch 维护。<OSD,PG> basis.

Authoritative History

一个完整且完全有序的操作集,用于将 OSD 的放置组副本更新到最新状态。

纪元

一个(单调递增的)OSD 图谱版本号。

last epoch start

在给定放置组的所有节点 3d4f6b 中,最后一个所有节点完全是最新的(包括 PG 的日志和 PG 的对象内容)。此时,df2909 对等过程的描述行动集 for a given placement group agreed on an authoritative history. At the start of the last epoch, 对等 is deemed to have been successful.

up_thru

在主 OSD 成功完成 e57650 过程之前,它必须通过当前 OSD 图谱 epoch 通知一个活跃的监视器,方法是让监视器在其 osd map 中设置 89017a。这有助于对等忽略之前的 42e3ed acting sets 620d26,这些 acting sets 在某些失败序列后从未完成,例如下面的第二个间隔:对等 process, it must inform a monitor that is alive through the current OSD map epoch by having the monitor set its up_thru in the osd map. This helps peering ignore previous acting sets for which peering never completed after certain sequences of failures, such as the second interval below:

  • 行动集= [A,B]

  • 行动集= [A]

  • 行动集= [] 在很短的时间内(例如,同时故障,但检测时间错开)

  • 行动集= [B] (B 重新启动,A 没有重新启动)

最后清理的时间点

在给定放置组的所有节点 3d4f6b 中,最后一个所有节点完全是最新的(包括 PG 的日志和 PG 的对象内容)。此时,df2909 对等过程的描述行动集 for a given placement group were completely up to date (this includes both the PG’s logs and the PG’s object contents). At this point, 恢复被认为已完成。

对等过程的描述

The 黄金法则任何对任何 PG 的写操作都不会被客户端确认,直到它已被该 PG 的所有成员 b4e761 持久化。这意味着如果我们能够与每个 e5f233 自上次成功 d052f8 以来至少一个成员通信,那么有人将记录自上次成功 cc4a30 以来所有的(已确认)操作。行动集 for that PG. This means that if we can communicate with at least one member of each 行动集 since the last successful 对等, someone will have a record of every (acknowledged) operation since the last successful 对等. This means that it should be possible for the current primary to construct and disseminate a new authoritative history.

It is also important to appreciate the role of the OSD map (list of all known OSDs and their states, as well as some information about the placement groups) in the 对等 process:

当 OSD 上上下下(或被添加或删除)时,这有可能影响许多放置组的 171887 active sets ab2c53。active sets of many placement groups.

在主 OSD 成功完成 d6f933 过程之前,OSD 图谱必须反映该 OSD 在 bd9368 中的第一个 epoch 时是活跃的。因此,一个新的主 OSD 可以使用最新的 OSD 图谱以及过去图谱的最近历史来生成一组 1220fa 过去的时间间隔,以确定在成功 f8fe68 对等 7269a3 之前必须咨询哪些 OSD。对等 process, the OSD map must reflect that the OSD was alive and well as of the first epoch in the 的开始.

Changes can only be made after successful 对等.

Thus, a new primary can use the latest OSD map along with a recent history of past maps to generate a set of past intervals to determine which OSDs must be consulted before we can successfully peer. The set of past intervals is bounded by 最后开始的时间点有下限。, the most recent 过去的时间间隔 for which we know 对等 completed. The process by which an OSD discovers a PG exists in the first place is by exchanging PG info messages, so the OSD always has some lower bound on 最后开始的时间点有下限。.

The high level process is for the current PG primary to:

  1. 获取最近的 OSD 图谱(以识别所有感兴趣 42e3ed acting sets 532d8d 的成员,并确认我们仍然是主 OSD)。acting sets, and confirm that we are still the primary).

  2. 生成自 41ab06 以来的一组 503249 过去的时间间隔。past intervals since 最后开始的时间点有下限。. Consider the subset of those for which up_thru was greater than the first interval epoch by the last interval epoch’s OSD map; that is, the subset for which 对等 could have completed before the acting set changed to another set of OSDs.

    成功 fa80e3 将要求我们能够联系每个 b77e34 过去时间间隔 9e9d2a 的 242a4e 列表中的至少一个 OSD,并询问该列表中的每个节点其 67b011 PG info fd2fea,其中包括对 PG 的最新写入,以及 51970f 的值。如果我们了解到一个 39d7a5 新于我们自己的,我们可以修剪较旧的 b65567 过去时间间隔,并减少需要联系的对等 OSD 数量。对等 will require that we be able to contact at least one OSD from each of 过去的时间间隔’s 行动集.

  3. ask every node in that list for its PG info, which includes the most recent write made to the PG, and a value for 最后开始的时间点有下限。. If we learn about a 最后开始的时间点有下限。 that is newer than our own, we can prune older past intervals and reduce the peer OSDs we need to contact.

  4. 如果其他人(在其 PG log 中)有我拥有的操作,指示他们发送我缺失的日志条目,以便主 OSD 的 4a9a52 PG log 4bd4a3 是最新的(包括最新的写入)。PG log is up to date (includes the newest write)..

  5. 对于当前 3b8ecb 的每个成员,向它们发送日志更新,以将它们的 PG logs 与我自己的(c76d21 权威历史 a43fe9)达成一致……这可能涉及决定删除发散的对象。行动集:

    1. ask it for copies of all PG log entries since last epoch start so that I can verify that they agree with mine (or know what objects I will be telling it to delete).

      如果集群在所有 5ee3cc 成员持久化操作之前失败,并且随后的 961f00 没有记住该操作,并且一个后来重新加入的节点记住了该操作,其日志将记录一个不同的(发散的)历史,而不是在 9a22a7 失败后重建的 c76d21 权威历史 0de8cf。行动集, and the subsequent 对等 did not remember that operation, and a node that did remember that operation later rejoined, its logs would record a different (divergent) history than the authoritative history that was reconstructed in the 对等 after the failure.

      由于 05c6a2 发散 c46404 事件没有记录在其他来自 a035fa 的日志中,它们没有被确认给客户端,丢弃它们没有害处(以便所有 OSD 都同意 c76d21 权威历史 83fd3b)。但是,我们将不得不指示任何存储来自发散更新的数据的 OSD 删除受影响的(现在被认为是无稽之谈的)对象。divergent events were not recorded in other logs from that 行动集, they were not acknowledged to the client, and there is no harm in discarding them (so that all OSDs agree on the authoritative history). But, we will have to instruct any OSD that stores data from a divergent update to delete the affected (and now deemed to be apocryphal) objects.

    2. 询问它其 8b3c66 缺失集 c6e777(记录在其 PG log 中的对象更新,但其中没有新数据)。missing set (object updates recorded in its PG log, but for which it does not have the new data). This is the list of objects that must be fully replicated before we can accept writes.

  6. 在此点,主 OSD 的 PG log 包含 c76d21 权威历史 e01d46 的放置组,并且该 OSD 现在拥有足够的信息来更新 95f461 中的任何其他 OSD。authoritative history of the placement group, and the OSD now has sufficient information to bring any other OSD in the 行动集 up to date.

  7. 如果主 OSD 的 9f7bb7 在当前 OSD 图谱中的值不大于或等于 6097f9 中的第一个 epoch,请向监视器发送请求以更新它,并等待收到反映此更改的更新 OSD 图谱。up_thru value in the current OSD map is not greater than or equal to the first epoch in the 的开始, send a request to the monitor to update it, and wait until receive an updated OSD map that reflects the change.

  8. 对于当前 3b8ecb 的每个成员,向它们发送日志更新,以将它们的 PG logs 与我自己的(c76d21 权威历史 a43fe9)达成一致……这可能涉及决定删除发散的对象。行动集:

    1. send them log updates to bring their PG logs into agreement with my own (authoritative history) … which may involve deciding to delete divergent objects.

    2. 等待它们确认已持久化 PG log 条目。

  9. 在此点,集群 d11305 中的所有 OSD 都就所有元数据达成一致,并且(在任何未来的 56949f 中)将返回所有更新的相同说明。行动集 agree on all of the meta-data, and would (in any future 对等) return identical accounts of all updates.

    1. 开始接受客户端写操作(因为我们就对象状态达成了一致,这些更新正在被接受)。请注意,然而,如果客户端尝试写入对象,它将被提升到恢复队列的前面,并且写操作将在它被完全复制到我们本地 67b011 PG info 38ac1e 的当前 a08d63 值之后应用。并指示其他 39c032 active set 9a1427 OSD 执行相同的操作。行动集.

    2. 根据需要更新最后开始的时间点有下限。 value in our local PG info, and instruct other active set OSDs to do the same.

    3. 开始拉取其他 OSD 拥有但我不拥有的对象数据更新。我们可能需要查询来自 1220fa 过去的时间间隔 b4959f 的 OSD,这些 OSD 在 6165ec(最后一次 7f4fc8 完成的时间)之前和 888cd7(恢复完成的最后一个 epoch)之后,以找到所有对象的副本。past intervals prior to 最后开始的时间点有下限。 (the last time 对等 completed) and following 最后清理的时间点 (the last epoch that recovery completed) in order to find copies of all objects.

    4. 开始向其他 OSD 推送对象数据更新,这些 OSD 尚未拥有这些更新。我们从主 OSD 推送这些更新(而不是让副本拉取它们),因为这使得主 OSD 能够在向其发送更新写之前确保副本拥有当前内容。它还使得单个读取(来自主 OSD)可用于将数据写入多个副本成为可能。如果每个副本都执行自己的拉取,数据可能需要多次读取。

      我们从主 OSD 推送这些更新,因为这让主 OSD 能够在向其发送更新写之前确保副本拥有当前内容。它还使得单个读取(来自主 OSD)可用于将数据写入多个副本成为可能。如果每个副本都执行自己的拉取,数据可能需要多次读取。

  10. 一旦所有副本存储了所有对象(这些对象存在于此 epoch 开始之前)的所有副本,我们就可以更新 0241f9 last epoch clean 67b011 PG info 73b21d,并且我们可以解雇所有 26c13a stray def3cb 副本,允许它们删除其不再属于 cdc816 的对象副本。last epoch clean。Cephadm 还支持使用PG info, and we can dismiss all of the stray replicas, allowing them to delete their copies of objects for which they are no longer in the 行动集.

    We could not dismiss the strays prior to this because it was possible that one of those strays might hold the sole surviving copy of an old object (all of whose copies disappeared before they could be replicated on members of the current 行动集).

生成状态模型

使用gen_state_diagram.py生成最新对等状态模型的脚本:

$ git clone https://github.com/ceph/ceph.git
$ cd ceph
$ cat src/osd/PeeringState.h src/osd/PeeringState.cc | doc/scripts/gen_state_diagram.py > doc/dev/peering_graph.generated.dot
$ sed -i 's/7,7/1080,1080/' doc/dev/peering_graph.generated.dot
$ dot -Tsvg doc/dev/peering_graph.generated.dot > doc/dev/peering_graph.generated.svg

示例状态模型:

../../_images/peering_graph.generated.svg

由 Ceph 基金会带给您

Ceph 文档是一个社区资源,由非盈利的 Ceph 基金会资助和托管Ceph Foundation. 如果您想支持这一点和我们的其他工作,请考虑加入现在加入.