注意

本文档适用于 Ceph 开发版本。

监控器故障排除

即使集群遇到与监视器相关的问题,集群也不一定处于下线的危险之中。如果一个集群丢失了多个监视器,只要还有足够的幸存监视器来形成仲裁,它仍然可以保持运行。

如果您的集群遇到与监视器相关的问题,我们建议您参考以下故障排除信息。

初始故障排除

故障排除 Ceph 监视器的第一步包括确保监视器正在运行,并且它们能够与网络通信并在网络上通信。遵循本节中的步骤以排除监视器故障的最简单原因。

  1. 确保监视器正在运行。

    确保监视器(1d5dbb:)守护进程(e2604b:)正在运行。可能是 mons 在升级后没有重新启动。检查这个简单的疏忽可以节省数小时的艰苦故障排除时间。mon) daemon processes (ceph-mon) are running. It might be the case that the mons have not be restarted after an upgrade. Checking for this simple oversight can save hours of painstaking troubleshooting.

    还必须确保管理器守护进程(f33260:)正在运行。请记住,典型的集群配置为每个监视器(e2cba9:)提供一个管理器(8f1bf1:)在 v1.12.5 之前的版本中,Rook 不会运行超过两个管理器。ceph-mgr) are running. Remember that typical cluster configurations provide one Manager (ceph-mgr) for each Monitor (ceph-mon).

    Note

    In releases prior to v1.12.5, Rook will not run more than two managers.

  2. 确保您可以到达监视器节点。

    在某些罕见的情况下,2818af: 规则可能会阻止访问监视器节点或 TCP 端口。这些规则可能是早期压力测试或规则开发的遗留。要检查此类规则的存在,请使用 SSH 登录到每个监视器节点,并使用 da8a2d: 或类似工具尝试连接到其他监视器节点上的端口 5766fc: 确保您能够连接到集群。iptables rules might be blocking access to Monitor nodes or TCP ports. These rules might be left over from earlier stress testing or rule development. To check for the presence of such rules, SSH into each Monitor node and use telnetnc or a similar tool to attempt to connect to each of the other Monitor nodes on ports tcp/3300tcp/6789.

  3. Make sure that the “ceph status” command runs and receives a reply from the cluster.

    如果未设置ceph status如果“ceph status”命令收到来自集群的回复,则集群正在运行。监视器只有在形成仲裁时才会响应请求。确认一个或多个 e00b35: 守护进程报告为正在运行。在一个没有缺陷的集群中,4c52a7: 将报告所有 08b346: 守护进程都在运行。status request only if there is a formed quorum. Confirm that one or more mgr daemons are reported as running. In a cluster with no deficiencies, ceph status will report that all mgr daemons are running.

    如果未设置ceph status如果“ceph status”命令没有收到来自集群的回复,则可能没有足够的监视器 aa3619: 来形成仲裁。如果运行 986208: 命令时未指定其他选项,它将连接到任意选择的监视器。但是,在某些情况下,通过向命令添加 2f3a24: 标志来连接特定的监视器(或按顺序连接几个特定的监视器)可能有助于连接到特定的监视器:例如,7d7c22: None of this worked. What now?up to form a quorum. If the ceph -s command is run with no further options specified, it connects to an arbitrarily selected Monitor. In certain cases, however, it might be helpful to connect to a specific Monitor (or to several specific Monitors in sequence) by adding the -m flag to the command: for example, ceph status -m mymon1.

  4. None of this worked. What now?

    如果上述解决方案未解决您的问题,您可能会发现检查每个单独的监视器很有帮助。即使没有形成仲裁,也可以通过使用 0e1456: 命令(在这里 68fe47: 是监视器的标识符)单独联系每个监视器并请求其状态。ceph tell mon.ID mon_status command (here ID is the Monitor’s identifier).

    ,则表示你位于正确的目录中。运行ceph tell mon.ID mon_status命令来为集群中的每个监视器。有关此命令输出的更多信息,请参阅 9427c1: 理解 mon_statusUnderstanding mon_status.

    还有另一种方法可以单独联系每个监视器:使用 SSH 登录到每个监视器节点并查询守护进程的管理员套接字。参见 5d18ee: 使用监视器的管理员套接字Using the Monitor’s Admin Socket.

使用监视器的管理员套接字

监视器的管理员套接字允许您通过使用 Unix 套接字文件直接与特定守护进程进行交互。此套接字文件位于监视器的 ab907b: 默认目录是 5f0c7e: 。可以覆盖管理员套接字的默认位置。如果默认位置已被覆盖,则管理员套接字将位于其他位置。当集群的守护进程在容器中部署时,这种情况很常见。run目录中运行此命令。

The admin socket’s default directory is /var/run/ceph/ceph-mon.ID.asok. It is possible to override the admin socket’s default location. If the default location has been overridden, then the admin socket will be elsewhere. This is often the case when a cluster’s daemons are deployed in containers.

要找到管理员套接字的目录,请检查您的 0d9267: 以查找替代路径或运行以下命令:aca439: 管理员套接字仅在监视器守护进程运行时可用。每次监视器正确关闭时,管理员套接字都会被删除。如果监视器未运行,但管理员套接字仍然存在,则很可能是监视器未正确关闭。如果监视器未运行,则无法使用管理员套接字,并且 36e648: 命令可能会返回 3f47e2: 要访问管理员套接字,请运行以下形式的 9e4f0d: 命令(指定您感兴趣的守护进程):a2f179: 此命令通过其管理员套接字将 c1d862: 命令传递给指定的正在运行的监视器守护进程 f493c9:。如果您知道管理员套接字文件的完整路径,可以通过运行以下命令更直接地执行此操作:b8ecc8: 显示通过管理员套接字可用的所有支持的命令。特别是 20a326: 理解 mon_statusceph.conf for an alternative path or run the following command:

ceph-conf --name mon.ID --show-config-value admin_socket

The admin socket is available for use only when the Monitor daemon is running. Every time the Monitor is properly shut down, the admin socket is removed. If the Monitor is not running and yet the admin socket persists, it is likely that the Monitor has been improperly shut down. If the Monitor is not running, it will be impossible to use the admin socket, and the ceph command is likely to return Error 111: Connection Refused.

To access the admin socket, run a ceph tell command of the following form (specifying the daemon that you are interested in):

ceph tell mon.<id> mon_status

This command passes a help command to the specified running Monitor daemon <id> via its admin socket. If you know the full path to the admin socket file, this can be done more directly by running the following command:

ceph --admin-daemon <full_path_to_asok_file> <command>

运行ceph help shows all supported commands that are available through the admin socket. See especially config get, config show, mon stat,并quorum_status.

理解 mon_status

监视器的状态(由 f98587: 命令报告)可以通过管理员套接字获得。a17d30: 命令输出有关监视器的许多信息(包括 8835a3: 输出的信息)。这不是要输入的字面量。5ab66b: 的 737e5b: 部分意指在您运行命令时用特定于您的 Ceph 集群值的值替换它。要理解此命令的输出,让我们考虑以下示例,其中我们看到 e470b8: 的输出。此输出报告 monmap 中有三个监视器(0fa76f:),仲裁由两个监视器组成,以及 63b9bc: 哪个监视器不在仲裁中?答案是 8af68e:(即 844ca4: 不在仲裁中。cfea38: 我们如何知道,在这个示例中,mon.a 不在仲裁中?935627: 我们知道 2a2de0: 不在仲裁中,因为它具有 rank 0b96e3: ,而 rank c2717b: 的监视器定义为不在仲裁中。a1b5ab: 如果我们检查 1e6255: 集合,我们可以清楚地看到该集合中有两个监视器:b54e7d:。但这些不是监视器名称。它们是监视器排名,如当前 75b5fe: 集合所示,不包括具有 rank 2d9041: 的监视器,并且根据 b516fb: 该监视器是 942a68: 监视器排名是如何确定的?09256f: 监视器排名在监视器被添加到或从集群中移除时计算(或重新计算)。排名计算遵循一个简单的规则:7b8dc4: 更大的 401906: 组合,排名 81e073: 更低 e88af2:。在这种情况下,因为 11fb0e:)在数值上小于其他两个 9b25b0: 组合(分别是 5a8c60: 用于“监视器 b”和 f0853a: 用于“监视器 c”),839ef5: 具有最高排名:即排名 048b95: 最常见的监视器问题ceph tell mon.X mon_status command) can be obtained via the admin socket. The ceph tell mon.X mon_status command outputs a great deal of information about the monitor (including the information found in the output of the quorum_status命令确认这样做是安全的之后)。

Note

The command ceph tell mon.X mon_status is not meant to be input literally. The X portion of mon.X is meant to be replaced with a value specific to your Ceph cluster when you run the command.

To understand this command’s output, let us consider the following example, in which we see the output of ceph tell mon.c mon_status:

{ "name": "c",
  "rank": 2,
  "state": "peon",
  "election_epoch": 38,
  "quorum": [
        1,
        2],
  "outside_quorum": [],
  "extra_probe_peers": [],
  "sync_provider": [],
  "monmap": { "epoch": 3,
      "fsid": "5c4e9d53-e2e1-478a-8061-f543f8be4cf8",
      "modified": "2013-10-30 04:12:01.945629",
      "created": "2013-10-29 14:14:41.914786",
      "mons": [
            { "rank": 0,
              "name": "a",
              "addr": "127.0.0.1:6789\/0"},
            { "rank": 1,
              "name": "b",
              "addr": "127.0.0.1:6790\/0"},
            { "rank": 2,
              "name": "c",
              "addr": "127.0.0.1:6795\/0"}]}}

This output reports that there are three monitors in the monmap (a, b,并c), that quorum is formed by only two monitors, and that c。例如,要将新卷的MDS守护进程放置在标记为peon.

Which monitor is out of quorum?

The answer is a (that is, mon.a). mon.a is out of quorum.

How do we know, in this example, that mon.a is out of quorum?

We know that mon.a is out of quorum because it has rank 0, and Monitors with rank 0 are by definition out of quorum.

If we examine the quorum set, we can see that there are clearly two monitors in the set: 12. But these are not monitor names. They are monitor ranks, as established in the current monmap一起使用。该quorum set does not include the monitor that has rank 0, and according to the monmap that monitor is mon.a.

How are monitor ranks determined?

Monitor ranks are calculated (or recalculated) whenever monitors are added to or removed from the cluster. The calculation of ranks follows a simple rule: the greater参数IP:PORT combination, the lower the rank. In this case, because 127.0.0.1:6789 (mon.a) is numerically less than the other two IP:PORT combinations (which are 127.0.0.1:6790 for “Monitor b” and 127.0.0.1:6795 for “Monitor c”), mon.a has the highest rank: namely, rank 0.

最常见的监视器问题

集群有仲裁,但至少有一个监视器宕机

当集群有仲裁,但至少有一个监视器宕机时,8e3718: 返回类似以下的消息:1692f2: 我如何排除具有仲裁但至少有一个监视器宕机的 Ceph 集群问题?e4d81a: 确保 013574: 正在运行。e07cc6: 确保您可以从其他监视器节点连接到 292b65: 的节点。检查 TCP 端口。检查 63ef73: 在所有节点上,并确保您没有丢失/拒绝连接。f5c92f: 如果此初始故障排除无法解决问题,则需要进一步调查。6dd136: 首先,通过管理员套接字检查有问题的监视器 41e7e8:,如 a31a7a: 使用监视器的管理员套接字 20a326: 理解 mon_status 中解释的那样。b0c1e9: 如果监视器不在仲裁中,则其状态将是以下之一:dd8a83:。如果监视器的状态是 ef794f:,则监视器认为它自己处于仲裁中,但集群的其余部分认为它不处于仲裁中。可能有一个处于 e6a98c: 状态之一的监视器在故障排除过程中进入了仲裁。再次检查 5e0887: 以确定监视器是否在您的故障排除过程中进入了仲裁。如果监视器仍然不在仲裁中,则继续执行本文档本节中描述的调查。83fbb6: 当监视器的状态为“probing”时是什么意思?91fbfc: 显示监视器的状态是 1c01f8:,则监视器仍在寻找其他监视器。每次监视器启动时,它都会处于此状态一段时间。当监视器连接到 4f8345: 中指定的其他监视器时,它将不再处于 ce84ab: 状态。监视器处于 02abb7: 状态的时间取决于它所属集群的参数。例如,当监视器是单个监视器集群的一部分(在生产环境中永远不要这样做)时,监视器几乎立即通过探测状态。在多监视器集群中,监视器将保持在 e2567d: 状态,直到它们找到足够的监视器来形成仲裁——这意味着如果集群中的三个监视器中有两个是 d7ac57:,则剩余的监视器将无限期地保持在 6a1067: 状态,直到您将其中一个其他监视器启动。85e701: 如果已经建立了仲裁,则只要它们可以被到达,监视器守护进程应该能够快速找到其他监视器。如果监视器卡在 592a3f: 状态,并且您已经用尽描述监视器之间通信故障排除的程序,则有可能问题监视器正在尝试以错误的地址到达其他监视器。9276ae: 输出监视器知道的 d8e9d1::确定 2dac7e: 中指定的其他监视器的位置是否与网络中的监视器位置匹配。如果不匹配,请参阅 451ec5: 恢复损坏的 monmap 3ee8fa:。如果 9c0d62: 中指定的监视器位置与网络中的监视器位置匹配,则持久 551c0f: 状态可能与监视器节点之间严重的时钟偏差有关。参见 e14ab6: 时钟偏差 586df4:。如果 e14ab6: 时钟偏差 6f800b: 中的信息无法将监视器从 8ff8f3: 状态中移出,则准备您的系统日志并请求 Ceph 社区的帮助。参见 1632e1: 准备您的日志 15b455: 以获取有关正确准备日志的信息。c7f0c9: 当监视器的状态为“leader”或“peon”时是什么意思?17eac2: 在正常 Ceph 操作期间,当集群处于 1c6a53: 状态时,Ceph 集群中的一个监视器处于 618490: 状态,其余监视器处于 65b4fa: 状态。给定监视器的状态可以通过检查 a7f449: 返回的状态键值来确定。显示监视器处于 5b8111: 状态或 e4e86c: 状态,则很可能存在时钟偏差。遵循 e14ab6: 时钟偏差 cdae3f: 中的说明。如果您已经遵循了那些说明,863d3a: 仍然显示监视器处于 6737e1: 状态或 dcfec3: 状态,请在 b3ea27: Ceph 问题跟踪器 8db14f: 中报告问题。如果您提出问题,请提供日志来支持它。参见 1632e1: 准备您的日志 cdcef4: 以获取有关正确准备日志的信息。5d05fc: 恢复监视器损坏的“monmap”ceph health detail returns a message similar to the following:

$ ceph health detail
[snip]
mon.a (rank 0) addr 127.0.0.1:6789/0 is down (out of quorum)

How do I troubleshoot a Ceph cluster that has quorum but also has at least one monitor down?

  1. Make sure that mon.a is running.

  2. Make sure that you can connect to mon.a’s node from the other Monitor nodes. Check the TCP ports as well. Check iptablesnf_conntrack on all nodes and make sure that you are not dropping/rejecting connections.

If this initial troubleshooting doesn’t solve your problem, then further investigation is necessary.

First, check the problematic monitor’s mon_status via the admin socket as explained in 使用监视器的管理员套接字理解 mon_status.

If the Monitor is out of the quorum, then its state will be one of the following: probing, electingsynchronizing. If the state of the Monitor is leaderpeon, then the Monitor believes itself to be in quorum but the rest of the cluster believes that it is not in quorum. It is possible that a Monitor that is in one of the probing, electing,或synchronizing states has entered the quorum during the process of troubleshooting. Check ceph status again to determine whether the Monitor has entered quorum during your troubleshooting. If the Monitor remains out of the quorum, then proceed with the investigations described in this section of the documentation.

What does it mean when a Monitor’s state is ``probing``?

如果ceph health detail shows that a Monitor’s state is probing, then the Monitor is still looking for the other Monitors. Every Monitor remains in this state for some time when it is started. When a Monitor has connected to the other Monitors specified in the monmap, it ceases to be in the probing state. The amount of time that a Monitor is in the probing state depends upon the parameters of the cluster of which it is a part. For example, when a Monitor is a part of a single-monitor cluster (never do this in production), the monitor passes through the probing state almost instantaneously. In a multi-monitor cluster, the Monitors stay in the probing state until they find enough monitors to form a quorum—this means that if two out of three Monitors in the cluster are down, the one remaining Monitor stays in the probing state indefinitely until you bring one of the other monitors up.

If quorum has been established, then the Monitor daemon should be able to find the other Monitors quickly, as long as they can be reached. If a Monitor is stuck in the probing state and you have exhausted the procedures above that describe the troubleshooting of communications between the Monitors, then it is possible that the problem Monitor is trying to reach the other Monitors at a wrong address. mon_status outputs the monmap that is known to the monitor: determine whether the other Monitors’ locations as specified in the monmap match the locations of the Monitors in the network. If they do not, see Recovering a Monitor’s Broken monmap. If the locations of the Monitors as specified in the monmap match the locations of the Monitors in the network, then the persistent probing state could be related to severe clock skews among the monitor nodes. See Clock Skews. If the information in Clock Skews does not bring the Monitor out of the probing state, then prepare your system logs and ask the Ceph community for help. See Preparing your logs for information about the proper preparation of logs.

What does it mean when a Monitor’s state is ``electing``?

如果ceph health detail shows that a Monitor’s state is electing, the monitor is in the middle of an election. Elections typically complete quickly, but sometimes the monitors can get stuck in what is known as an election storm。有关网络配置的详细信息,请参阅Monitor Elections for more on monitor elections.

The presence of election storm might indicate clock skew among the monitor nodes. See Clock Skews for more information.

If your clocks are properly synchronized, search the mailing lists and bug tracker for issues similar to your issue. The electing state is not likely to persist. In versions of Ceph after the release of Cuttlefish, there is no obvious reason other than clock skew that explains why an electing state would persist.

It is possible to investigate the cause of a persistent electing state if you put the problematic Monitor into a down state while you investigate. This is possible only if there are enough surviving Monitors to form quorum.

What does it mean when a Monitor’s state is ``synchronizing``?

如果ceph health detail shows that the Monitor is synchronizing, the monitor is catching up with the rest of the cluster so that it can join the quorum. The amount of time that it takes for the Monitor to synchronize with the rest of the quorum is a function of the size of the cluster’s monitor store, the cluster’s size, and the state of the cluster. Larger and degraded clusters generally keep Monitors in the synchronizing state longer than do smaller, new clusters.

A Monitor that changes its state from synchronizingtoelecting and then back to synchronizing indicates a problem: the cluster state may be advancing (that is, generating new maps) too fast for the synchronization process to keep up with the pace of the creation of the new maps. This issue presented more frequently prior to the Cuttlefish release than it does in more recent releases, because the synchronization process has since been refactored and enhanced to avoid this dynamic. If you experience this in later versions, report the issue in the Ceph bug tracker. Prepare and provide logs to substantiate any bug you raise. See Preparing your logs for information about the proper preparation of logs.

What does it mean when a Monitor’s state is ``leader`` or ``peon``?

During normal Ceph operations when the cluster is in the HEALTH_OK state, one monitor in the Ceph cluster is in the leader state and the rest of the monitors are in the peon state. The state of a given monitor can be determined by examining the value of the state key returned by the command ceph tell <mon_name> mon_status.

如果ceph health detail shows that the Monitor is in the leader state or in the peon state, it is likely that clock skew is present. Follow the instructions in Clock Skews. If you have followed those instructions and ceph health detail still shows that the Monitor is in the leader state or the peon state, report the issue in the Ceph bug tracker. If you raise an issue, provide logs to substantiate it. See Preparing your logs for information about the proper preparation of logs.

Recovering a Monitor’s Broken “monmap”

可以使用形式为 ce056f: 的命令来检索 monmap,如 20a326: 理解 mon_status 中所述。77631c: 这个 2b897f: 是可以正常工作的,但您的 ce9a7d: 可能无法正常工作。给定节点中的 5b5697: 可能因为节点长时间宕机,而在此期间集群的监视器发生了变化而变得过时。7206e7: 更新监视器过时的 0ec6d8: 有两种方法:Scrap the monitor and redeploy.04d0fb: 只要在您确定不会丢失您要废弃的监视器保存的信息的情况下才这样做。确保您有其他状况良好的监视器,以便新监视器能够与幸存的监视器同步。记住,如果没有任何其他监视器内容的副本,则销毁监视器会导致数据丢失。28e73f: 向监视器注入 monmap。e50b95: 可以通过从集群中幸存的监视器检索最新的 e8abb1: 并将其注入到具有损坏或丢失的 e0c88e: 的监视器中来修复具有过时的 71e11e:。通过执行以下程序来实现此解决方案:f85834: 以以下两种方式之一检索 9c86a1::560a52: IF THERE IS A QUORUM OF MONITORS: f85834: 从仲裁中检索 5a800b::231991: IF THERE IS NO QUORUM OF MONITORS: f85834: 直接从已停止的监视器检索 6c271e::95c5bf: 在此示例中,已停止的监视器的 ID 是 c264e7::f0e99b: 将 monmap 注入到将注入 5b049f: 的监视器:d9ef23: 启动监视器。80d496: 向监视器注入 monmap 可能会导致严重问题。注入 monmap 会覆盖监视器上存储的最新 442b6a:。小心!e14ab6: 时钟偏差 b6515c: Paxos 共识算法需要紧密的时间同步,这意味着仲裁中的监视器之间的时钟偏差会对监视器操作产生严重影响。由此产生的行为可能令人困惑。为了避免这个问题,请在您的监视器节点上运行时钟同步工具:例如,使用 2115bd: 或传统的 ec8dde: 实用程序。配置每个监视器节点,以便 4e78b2: iburst c2f84c: 选项生效,并且每个监视器都有多个对等方,包括以下内容:819413: 每个其他 afbf08: 内部 d5539e: 服务器 746956: 多个外部、公共池服务器 78278a: 选项发送八个数据包而不是通常的单个数据包,并在将两个对等方置于初始同步过程中使用。7b1609: 此外,建议将集群中的 c2493c: 节点与内部和外部服务器同步,甚至与您的监视器同步。在裸金属上运行 880443: 服务器:虚拟化时钟不适合稳定的计时。参见 998d33: https://www.ntp.org 78be50: 以获取有关网络时间协议(NTP)的更多信息。您的组织可能已经拥有质量良好的内部 e2259a: 服务器。e9d236: 服务器的来源包括以下内容:3bbd58: Microsemi(前身为 Symmetricom)5fdd68: https://microsemi.com 9cf0fc: EndRun 1de28c: https://endruntechnologies.com 2542ed: Netburner 2f7f8d: https://www.netburner.com 23f5be: 时钟偏差问题解答 7f3c69: 最大可容忍的时钟偏差是多少?23c8ae: 默认情况下,监视器允许时钟漂移高达 0.05 秒(50 毫秒)。b4ccb9: 我可以增加最大可容忍的时钟偏差吗?038b53: 是的,但我们强烈建议不要这样做。最大可容忍的时钟偏差可以通过 1ba0c5: 选项进行配置,但我们几乎肯定不建议更改此选项。时钟偏差最大值的存在是因为时钟偏差的监视器不可靠。当前默认值已被证明在监视器遇到严重问题之前可以提醒用户。更改此值可能会对监视器的稳定性以及整体集群健康产生不可预见的影响。e814d2: 我如何知道是否存在时钟偏差?23fad8: 监视器将通过集群状态 008e28: 警告您。当存在时钟偏差时,38e61e: 命令返回类似于以下内容的输出:a17dc4: 在此示例中,监视器 561607: 被标记为存在时钟偏差。f6663d: 在 Luminous 和后续版本中,可以通过运行 17028c: 命令来检查时钟偏差。请注意,通常具有最低 IP 地址的领导监视器会显示 5b214a::其他监视器的报告偏移量是相对于领导监视器,而不是任何外部参考源。c79023: 如果存在时钟偏差,我应该做什么?60cc49: 同步您的时钟。使用 NTP 客户端可能会有所帮助。但是,如果您已经使用 NTP 客户端,并且仍然遇到时钟偏差问题,请确定您使用的 NTP 服务器是位于您的网络远程还是托管在您的网络中。托管自己的 NTP 服务器通常有助于减轻时钟偏差问题。b5a972: 客户端无法连接或挂载 e217c6: 如果客户端无法连接到集群或挂载,请检查您的 iptables。某些操作系统安装实用程序会向 5ea8a3: 规则添加 09f9d9: 规则。这些规则将拒绝除 fec3b6: 之外的所有尝试连接到主机的客户端。如果您的监视器主机的 iptables 中有 005674: 规则,则从单独的节点连接的客户端将失败,这将引发超时错误。查找 981453: 拒绝尝试连接到 Ceph 守护进程的客户端的规则。例如:88b972: 可能还需要在您的 Ceph 主机的 iptables 上添加规则,以确保客户端能够访问与您的 Ceph 监视器(默认端口 6789)和 Ceph OSD(默认端口 6800 通过 7568)相关联的 TCP 端口。例如:de5b32: 监视器存储故障ceph tell mon.c mon_status, as described in 理解 mon_status.

Here is an example of a monmap:

epoch 3
fsid 5c4e9d53-e2e1-478a-8061-f543f8be4cf8
last_changed 2013-10-30 04:12:01.945629
created 2013-10-29 14:14:41.914786
0: 127.0.0.1:6789/0 mon.a
1: 127.0.0.1:6790/0 mon.b
2: 127.0.0.1:6795/0 mon.c

This monmap is in working order, but your monmap might not be in working order. The monmap in a given node might be outdated because the node was down for a long time, during which the cluster’s Monitors changed.

There are two ways to update a Monitor’s outdated monmap:

  1. Scrap the monitor and redeploy.

    Do this only if you are certain that you will not lose the information kept by the Monitor that you scrap. Make sure that you have other Monitors in good condition, so that the new Monitor will be able to synchronize with the surviving Monitors. Remember that destroying a Monitor can lead to data loss if there are no other copies of the Monitor’s contents.

  2. Inject a monmap into the monitor.

    It is possible to fix a Monitor that has an outdated monmap by retrieving an up-to-date monmap from surviving Monitors in the cluster and injecting it into the Monitor that has a corrupted or missing monmap.

    Implement this solution by carrying out the following procedure:

    1. Retrieve the monmap in one of the two following ways:

      1. IF THERE IS A QUORUM OF MONITORS:

        Retrieve the monmap from the quorum:

        ceph mon getmap -o /tmp/monmap
        
      2. IF THERE IS NO QUORUM OF MONITORS:

        Retrieve the monmap directly from a Monitor that has been stopped :

        ceph-mon -i ID-FOO --extract-monmap /tmp/monmap
        

        In this example, the ID of the stopped Monitor is ID-FOO.

    2. Stop the Monitor into which the monmap will be injected:

      service ceph -a stop mon.{mon-id}
      
    3. Inject the monmap into the stopped Monitor:

      ceph-mon -i ID --inject-monmap /tmp/monmap
      
    4. Start the Monitor.

      警告

      Injecting a monmap into a Monitor can cause serious problems. Injecting a monmap overwrites the latest existing monmap stored on the monitor. Be careful!

Clock Skews

The Paxos consensus algorithm requires close time synchroniziation, which means that clock skew among the monitors in the quorum can have a serious effect on monitor operation. The resulting behavior can be puzzling. To avoid this issue, run a clock synchronization tool on your monitor nodes: for example, use Chrony or the legacy ntpd utility. Configure each monitor nodes so that the iburst option is in effect and so that each monitor has multiple peers, including the following:

  • Each other

  • Internal NTP servers

  • Multiple external, public pool servers

Note

The iburst option sends a burst of eight packets instead of the usual single packet, and is used during the process of getting two peers into initial synchronization.

Furthermore, it is advisable to synchronize all nodes in your cluster against internal and external servers, and perhaps even against your monitors. Run NTP servers on bare metal: VM-virtualized clocks are not suitable for steady timekeeping. See https://www.ntp.org for more information about the Network Time Protocol (NTP). Your organization might already have quality internal NTP servers available. Sources for NTP server appliances include the following:

Clock Skew Questions and Answers

What’s the maximum tolerated clock skew?

By default, monitors allow clocks to drift up to a maximum of 0.05 seconds (50 milliseconds).

Can I increase the maximum tolerated clock skew?

Yes, but we strongly recommend against doing so. The maximum tolerated clock skew is configurable via the mon-clock-drift-allowed option, but it is almost certainly a bad idea to make changes to this option. The clock skew maximum is in place because clock-skewed monitors cannot be relied upon. The current default value has proven its worth at alerting the user before the monitors encounter serious problems. Changing this value might cause unforeseen effects on the stability of the monitors and overall cluster health.

How do I know whether there is a clock skew?

The monitors will warn you via the cluster status HEALTH_WARN. When clock skew is present, the ceph health detailceph status commands return an output resembling the following:

mon.c addr 10.10.0.1:6789/0 clock skew 0.08235s > max 0.05s (latency 0.0045s)

In this example, the monitor mon.c has been flagged as suffering from clock skew.

In Luminous and later releases, it is possible to check for a clock skew by running the ceph time-sync-status command. Note that the lead monitor typically has the numerically lowest IP address. It will always show 0: the reported offsets of other monitors are relative to the lead monitor, not to any external reference source.

What should I do if there is a clock skew?

Synchronize your clocks. Using an NTP client might help. However, if you are already using an NTP client and you still encounter clock skew problems, determine whether the NTP server that you are using is remote to your network or instead hosted on your network. Hosting your own NTP servers tends to mitigate clock skew problems.

Client Can’t Connect or Mount

If a client can’t connect to the cluster or mount, check your iptables. Some operating-system install utilities add a REJECT rule to iptables. iptables rules will reject all clients other than ssh that try to connect to the host. If your monitor host’s iptables have a REJECT rule in place, clients that connect from a separate node will fail, and this will raise a timeout error. Look for iptables rules that reject clients that are trying to connect to Ceph daemons. For example:

REJECT all -- anywhere anywhere reject-with icmp-host-prohibited

It might also be necessary to add rules to iptables on your Ceph hosts to ensure that clients are able to access the TCP ports associated with your Ceph monitors (default: port 6789) and Ceph OSDs (default: 6800 through 7568). For example:

iptables -A INPUT -m multiport -p tcp -s {ip-address}/{netmask} --dports 6789,6800:7568 -j ACCEPT

监视器存储故障

存储损坏的症状 01bbc6: Ceph 监视器在键值存储中维护 2b5043:。如果键值存储损坏导致监视器失败,则监视器日志可能包含以下错误消息之一:4e19d4: 使用健康的监视器(aebe46: 如果集群包含幸存的监视器,则损坏的监视器可以 91bb24: 被 fea867: 新监视器替换。新监视器启动后,它将同步一个健康的对等方。新监视器完全同步后,它将能够服务客户端。ff1b9c: 即使所有监视器同时失败,也可以使用 OSD 中存储的信息恢复监视器存储。我们鼓励在 Ceph 集群中至少部署三个(最好是五个)监视器。在这种部署中,完全的监视器故障不太可能。然而,在磁盘设置或文件系统设置配置不当的数据中心中发生计划外的电源丢失可能会导致底层文件系统失败,这可能会导致所有监视器失效。在这种情况下,OSD 中的数据可用于恢复监视器。以下是一个可以在这种情况下用于恢复监视器的脚本:7f0d2d: 此脚本执行以下步骤:44a5e3: 从每个 OSD 主机收集地图。b08e36: 重建存储。b2bf9f: 用适当的权限填充键环文件中的实体。1b79c6: 用恢复的副本替换 cbd32f: 上的损坏存储。482723: 已知限制 0a76b4: 上述恢复工具无法恢复以下信息:624c38: 某些添加的键环 13a055::所有使用 93f2da: 命令添加的 OSD 键环都从 OSD 的副本中恢复,并使用 719174: 导入 694bc3: 键环。但是,MDS 键环和所有其他键环将在恢复的监视器存储中丢失。可能需要手动重新添加它们。70e96f: : 如果任何 RADOS 池正在创建过程中,该状态将丢失。恢复工具假设所有池都已创建。如果部分创建的池的 PG 在恢复后卡在 926c9b: 状态,可以通过运行 5c90d6: 命令强制创建空的 a2e482: PG。这会创建一个空的 a2e482: PG,因此只有在您确定池为空的情况下才采取此操作。21c665: MDS 地图 933625::丢失 MDS 地图。168ec8: 一切都失败了!现在怎么办?

Ceph Monitors maintain the 集群地图 in a key-value store. If key-value store corruption causes a Monitor to fail, then the Monitor log might contain one of the following error messages:

Corruption: error in middle of record

或:

Corruption: 1 missing files; e.g.: /var/lib/ceph/mon/mon.foo/store.db/1234567.ldb

Recovery using healthy monitor(s)

If the cluster contains surviving Monitors, the corrupted Monitor can be replaced with a new Monitor. After the new Monitor boots, it will synchronize with a healthy peer. After the new Monitor is fully synchronized, it will be able to serve clients.

Recovery using OSDs

Even if all monitors fail at the same time, it is possible to recover the Monitor store by using information that is stored in OSDs. You are encouraged to deploy at least three (and preferably five) Monitors in a Ceph cluster. In such a deployment, complete Monitor failure is unlikely. However, unplanned power loss in a data center whose disk settings or filesystem settings are improperly configured could cause the underlying filesystem to fail and this could kill all of the monitors. In such a case, data in the OSDs can be used to recover the Monitors. The following is a script that can be used in such a case to recover the Monitors:

ms=/root/mon-store
mkdir $ms

# collect the cluster map from stopped OSDs
for host in $hosts; do
  rsync -avz $ms/. user@$host:$ms.remote
  rm -rf $ms
  ssh user@$host <<EOF
    for osd in /var/lib/ceph/osd/ceph-*; do
      ceph-objectstore-tool --data-path \$osd --no-mon-config --op update-mon-db --mon-store-path $ms.remote
    done
EOF
  rsync -avz user@$host:$ms.remote/. $ms
done

# rebuild the monitor store from the collected map, if the cluster does not
# use cephx authentication, we can skip the following steps to update the
# keyring with the caps, and there is no need to pass the "--keyring" option.
# i.e. just use "ceph-monstore-tool $ms rebuild" instead
ceph-authtool /path/to/admin.keyring -n mon. \
  --cap mon 'allow *'
ceph-authtool /path/to/admin.keyring -n client.admin \
  --cap mon 'allow *' --cap osd 'allow *' --cap mds 'allow *'
# add one or more ceph-mgr's key to the keyring. in this case, an encoded key
# for mgr.x is added, you can find the encoded key in
# /etc/ceph/${cluster}.${mgr_name}.keyring on the machine where ceph-mgr is
# deployed
ceph-authtool /path/to/admin.keyring --add-key 'AQDN8kBe9PLWARAAZwxXMr+n85SBYbSlLcZnMA==' -n mgr.x \
  --cap mon 'allow profile mgr' --cap osd 'allow *' --cap mds 'allow *'
# If your monitors' ids are not sorted by ip address, please specify them in order.
# For example. if mon 'a' is 10.0.0.3, mon 'b' is 10.0.0.2, and mon 'c' is  10.0.0.4,
# please passing "--mon-ids b a c".
# In addition, if your monitors' ids are not single characters like 'a', 'b', 'c', please
# specify them in the command line by passing them as arguments of the "--mon-ids"
# option. if you are not sure, please check your ceph.conf to see if there is any
# sections named like '[mon.foo]'. don't pass the "--mon-ids" option, if you are
# using DNS SRV for looking up monitors.
ceph-monstore-tool $ms rebuild -- --keyring /path/to/admin.keyring --mon-ids alpha beta gamma

# make a backup of the corrupted store.db just in case!  repeat for
# all monitors.
mv /var/lib/ceph/mon/mon.foo/store.db /var/lib/ceph/mon/mon.foo/store.db.corrupted

# move rebuild store.db into place.  repeat for all monitors.
mv $ms/store.db /var/lib/ceph/mon/mon.foo/store.db
chown -R ceph:ceph /var/lib/ceph/mon/mon.foo/store.db

This script performs the following steps:

  1. Collect the map from each OSD host.

  2. Rebuild the store.

  3. Fill the entities in the keyring file with appropriate capabilities.

  4. Replace the corrupted store on mon.foo with the recovered copy.

Known limitations

The above recovery tool is unable to recover the following information:

  • Certain added keyrings: All of the OSD keyrings added using the ceph auth add command are recovered from the OSD’s copy, and the client.admin keyring is imported using ceph-monstore-tool. However, the MDS keyrings and all other keyrings will be missing in the recovered Monitor store. It might be necessary to manually re-add them.

  • 创建存储池: If any RADOS pools were in the process of being created, that state is lost. The recovery tool operates on the assumption that all pools have already been created. If there are PGs that are stuck in the unknown state after the recovery for a partially created pool, you can force creation of the empty PG by running the ceph osd force-create-pg command. This creates an empty PG, so take this action only if you are certain that the pool is empty.

  • MDS Maps: The MDS maps are lost.

一切都失败了!现在怎么办?

寻求帮助 135f5f: 您可以在 OFTC 的 #ceph 和 #ceph-devel 上的 IRC 中找到帮助,或者 at 443735:。确保您已经准备了日志,并且在需要时准备好提供。93aa25: 上游 Ceph Slack 工作区可以在以下地址加入:4f4e2b: https://ceph-storage.slack.com/ c54707: https://ceph.io/en/community/connect/ 2d4708: 以获取截至 2023 年 12 月的最新信息,了解如何联系上游 Ceph 社区。1632e1: 准备您的日志

You can find help on IRC in #ceph and #ceph-devel on OFTC (server irc.oftc.net), or at dev@ceph.ioceph-users@lists.ceph.com. Make sure that you have prepared your logs and that you have them ready upon request.

The upstream Ceph Slack workspace can be joined at this address: https://ceph-storage.slack.com/

请参阅https://ceph.io/en/community/connect/ for current (as of December 2023) information on getting in contact with the upstream Ceph community.

Preparing your logs

监视器日志的默认位置是 eef468:。监视器日志的位置可能已从默认位置更改。如果监视器日志的位置已从默认位置更改,请通过运行以下命令找到监视器日志的位置:5b0089: 日志中的信息量由集群配置文件中的调试级别决定。如果 Ceph 使用默认调试级别,则您的日志可能缺少有助于上游 Ceph 社区解决您的问题的重要信息。892a8a: 提高调试级别以确保您的监视器日志包含相关信息。我们对来自监视器的信息感兴趣。与其他组件一样,监视器有不同部分,它们在不同的子系统中输出其调试信息。6e47ba: 如果您是经验丰富的 Ceph 故障排除人员,我们建议提高最相关子系统的调试级别。这种方法可能对初学者来说并不容易。在大多数情况下,如果输入以下调试级别,则足够的信息将被记录下来以解决问题:d20602: 有时这些调试级别不会提供足够的信息。在这种情况下,上游 Ceph 社区的成员会要求您对这些或对其他调试级别进行额外的更改。无论如何,收到一些有用的信息比收到空日志更好。f84376: 我需要重启监视器来调整调试级别吗?3975b1: 不需要。调整调试级别时不需要重启监视器。853ad5: 调整调试级别有两种不同的方法。一种方法是在有仲裁时使用。另一种方法是在没有仲裁时使用。58b97d: 有仲裁时调整调试级别 c8ba11: 要么将调试选项注入需要调试的特定监视器:9753ec: 要么一次性注入所有监视器:a568f4: 没有仲裁时调整调试级别 b321d8: 使用需要调试的特定监视器的管理员套接字,并直接调整监视器的配置选项:3f1f87: 将调试级别恢复为默认值 172710: 要将调试级别恢复为默认值,请使用上述命令,但使用调试级别 4c3e26 而不是调试级别 a9feeb。要检查监视器的当前值,请使用管理员套接字并运行以下任一命令:bd61bd: 我使用适当的调试级别重现了问题。现在怎么办?147e9f: 仅向上游 Ceph 社区发送与您的监视器问题相关的日志部分。因为您可能难以确定哪些部分是相关的,上游 Ceph 社区接受完整且未经删节的日志。但不要发送包含数十万行且没有其他说明信息的日志。帮助 Ceph 社区帮助您的常见方法之一是记录您重现问题时的当前时间和日期,然后根据该信息提取您的日志部分。31fb70: 通过邮件列表或 IRC 或 Slack 联系上游 Ceph 社区,或通过在 ae2aeb: 跟踪器上提交新问题/var/log/ceph/ceph-mon.FOO.log*. It is possible that the location of the Monitor logs has been changed from the default. If the location of the Monitor logs has been changed from the default location, find the location of the Monitor logs by running the following command:

ceph-conf --name mon.FOO --show-config-value log_file

The amount of information in the logs is determined by the debug levels in the cluster’s configuration files. If Ceph is using the default debug levels, then your logs might be missing important information that would help the upstream Ceph community address your issue.

Raise debug levels to make sure that your Monitor logs contain relevant information. Here we are interested in information from the Monitors. As with other components, the Monitors have different parts that output their debug information on different subsystems.

If you are an experienced Ceph troubleshooter, we recommend raising the debug levels of the most relevant subsystems. This approach might not be easy for beginners. In most cases, however, enough information to address the issue will be logged if the following debug levels are entered:

debug_mon = 10
debug_ms = 1

Sometimes these debug levels do not yield enough information. In such cases, members of the upstream Ceph community will ask you to make additional changes to these or to other debug levels. In any case, it is better for us to receive at least some useful information than to receive an empty log.

Do I need to restart a monitor to adjust debug levels?

No. It is not necessary to restart a Monitor when adjusting its debug levels.

There are two different methods for adjusting debug levels. One method is used when there is quorum. The other is used when there is no quorum.

Adjusting debug levels when there is a quorum

Either inject the debug option into the specific monitor that needs to be debugged:

ceph tell mon.FOO config set debug_mon 10/10

Or inject it into all monitors at once:

ceph tell mon.* config set debug_mon 10/10

Adjusting debug levels when there is no quorum

Use the admin socket of the specific monitor that needs to be debugged and directly adjust the monitor’s configuration options:

ceph daemon mon.FOO config set debug_mon 10/10

Returning debug levels to their default values

To return the debug levels to their default values, run the above commands using the debug level 1/10 rather than the debug level 10/10. To check a Monitor’s current values, use the admin socket and run either of the following commands:

ceph daemon mon.FOO config show

或:

ceph daemon mon.FOO config get 'OPTION_NAME'

I Reproduced the problem with appropriate debug levels. Now what?

Send the upstream Ceph community only the portions of your logs that are relevant to your Monitor problems. Because it might not be easy for you to determine which portions are relevant, the upstream Ceph community accepts complete and unabridged logs. But don’t send logs containing hundreds of thousands of lines with no additional clarifying information. One common-sense way to help the Ceph community help you is to write down the current time and date when you are reproducing the problem and then extract portions of your logs based on that information.

Contact the upstream Ceph community on the mailing lists or IRC or Slack, or by filing a new issue on the tracker.

由 Ceph 基金会带给您

Ceph 文档是一个社区资源,由非盈利的 Ceph 基金会资助和托管Ceph Foundation. 如果您想支持这一点和我们的其他工作,请考虑加入现在加入.