注意

本文档适用于 Ceph 开发版本。

是CephFS的名称,是

卷模块的Ceph管理器守护进程(ceph-mgr)为 CephFS 导出提供了一个单一的信息源。OpenStack 共享文件系统服务(manila)和 Ceph 容器存储接口(CSI)存储管理员使用 ceph-mgr 模块提供的通用命令行界面来管理 CephFS 导出。manila) and the Ceph Container Storage Interface (CSI) storage administrators use the common CLI provided by the ceph-mgr volumes module to manage CephFS exports.

ceph-mgrvolumes模块实现了以下文件系统导出抽象:

  • 文件系统卷,CephFS 文件系统的抽象

  • 文件系统子卷组,高于文件系统子卷的目录级别的抽象。用于在子卷集上执行策略(例如,)文件布局) across a set of subvolumes

  • 文件系统子卷,独立 CephFS 目录树的抽象

导出抽象的可能用例:

  • 使用文件系统子卷作为 Manila 共享或 CSI 卷

  • 使用文件系统子卷组作为 Manila 共享组

要求

  • Nautilus(14.2.x)或更高版本的 Ceph 发布

  • Cephx 客户端用户(见User Management)至少具有以下功能:

    mon 'allow r'
    mgr 'allow rw'
    

文件系统卷

通过运行以下命令创建卷:

ceph fs volume create <vol_name> [placement] [--data-pool <data-pool-name>] [--meta-pool <metadata-pool-name>]

这会创建一个 CephFS 文件系统及其数据和元数据池。或者,如果创建 CephFS 卷所需的数据池和/或元数据池已经存在,则可以将这些池的名称传递给此命令,以便使用这些现有池创建卷。此命令还可以使用 Ceph 管理器编排模块(例如 Rook)部署文件系统的 MDS 守护进程。另见Orchestrator CLI.

<vol_name>是卷名称(任意字符串)。[placement]是一个可选字符串,指定守护进程放置用于 MDS。另见部署 CephFS以获取更多关于放置的示例。

Note

通过 YAML 文件指定放置不被卷接口支持。

要删除卷,请运行以下命令:

ceph fs volume rm <vol_name> [--yes-i-really-mean-it]

此命令会删除文件系统和其数据和元数据池。它还尝试使用启用的 Ceph 管理器编排模块删除 MDS 守护进程。

Note

删除卷后,我们建议重新启动ceph-mgr如果在同一集群上创建新的文件系统并且正在使用子卷接口,则。另见https://tracker.ceph.com/issues/49605#note-5对于使用 snap-schedule Ceph 管理器模块的卷,如果删除卷,则 snap-schedule Ceph 管理器模块将继续保留对旧池的引用。这将导致 snap-schedule Ceph 管理器模块出现故障并记录错误。建议在删除卷后重新启动 snap-schedule Ceph 管理器模块。如果故障仍然存在,则建议重新启动

Note

If the snap-schedule Ceph Manager module is being used for a volume and the volume is deleted, then the snap-schedule Ceph Manager module will continue to hold references to the old pools. This will lead to the snap-schedule Ceph Manager module faulting and logging errors. To remedy this scenario, we recommend that the snap-schedule Ceph Manager module be restarted after volume deletion. If the faults still persist, then we recommend restarting ceph-mgr.

通过运行以下命令列出卷:

ceph fs volume ls

通过运行以下命令重命名卷:

ceph fs volume rename <vol_name> <new_vol_name> [--yes-i-really-mean-it]

重命名卷可能是一个昂贵的操作,需要以下操作:

  • 将编排器管理的 MDS 服务重命名为以匹配<new_vol_name>。这涉及启动具有<new_vol_name>的 MDS 服务并将 MDS 服务降级为<vol_name>.

  • 将文件系统从<vol_name>to<new_vol_name>.

  • 更改文件系统的数据和元数据池上的应用程序标签为<new_vol_name>.

  • 更改文件系统的元数据和数据池。

授权给<vol_name>的 CephX ID 必须重新授权给<new_vol_name>。任何使用这些 ID 的客户端的进行操作可能会中断。确保卷上的镜像已禁用。

要获取 CephFS 卷的信息,请运行以下命令:

ceph fs volume info vol_name [--human_readable]

The --human_readable标志显示使用和可用池容量,单位为 KB/MB/GB。

输出格式为 JSON,包含以下字段:

  • pools: 数据和元数据池的属性
    • avail: 字节中可用的空闲空间量

    • used: 字节中消耗的存储量

    • name: 池的名称

  • mon_addrs: Ceph 监控器地址列表

  • used_size: CephFS 卷的当前使用大小,单位为字节

  • pending_subvolume_deletions: 待删除的子卷数

示例输出volume info command:

ceph fs volume info vol_name
{
    "mon_addrs": [
        "192.168.1.7:40977"
    ],
    "pending_subvolume_deletions": 0,
    "pools": {
        "data": [
            {
                "avail": 106288709632,
                "name": "cephfs.vol_name.data",
                "used": 4096
            }
        ],
        "metadata": [
            {
                "avail": 106288709632,
                "name": "cephfs.vol_name.meta",
                "used": 155648
            }
        ]
    },
    "used_size": 0
}

文件系统子卷组

通过运行以下命令创建子卷组:

ceph fs subvolumegroup create <vol_name> <group_name> [--size <size_in_bytes>] [--pool_layout <data_pool_name>] [--uid <uid>] [--gid <gid>] [--mode <octal_mode>]

即使子卷组已经存在,命令也会成功。

创建子卷组时,您可以指定其数据池布局(见文件布局)、uid、gid、八进制数字的文件模式和字节大小的值。子卷组的大小通过为其设置配额(见CephFS 配额). By default, the subvolume group is created with octal file mode 755, uid 0, gid 0 and the data pool layout of its parent directory.

通过运行以下形式的命令删除子卷组:

ceph fs subvolumegroup rm <vol_name> <group_name> [--force]

如果子卷组不为空或不存在,则删除子卷组会失败。标志--force允许命令在其参数是非存在的子卷组时成功。

通过运行以下形式的命令获取子卷组的绝对路径:

ceph fs subvolumegroup getpath <vol_name> <group_name>

通过运行以下形式的命令列出子卷组:

ceph fs subvolumegroup ls <vol_name>

Note

主线 CephFS 不再支持子卷组快照功能(现有的组快照仍然可以列出和删除)

通过运行以下形式的命令获取子卷组的元数据:

ceph fs subvolumegroup info <vol_name> <group_name>

输出格式为 JSON,包含以下字段:

  • atime: 子卷组路径的访问时间,格式为YYYY-MM-DD HH:MM:SS

  • mtime: 子卷组路径最近修改的时间,格式为YYYY-MM-DD HH:MM:SS

  • ctime: 子卷组路径最近更改的时间,格式为YYYY-MM-DD HH:MM:SS

  • uid: 子卷组路径的 uid

  • gid: 子卷组路径的 gid

  • mode: 子卷组路径的模式

  • mon_addrs: 监控器地址列表

  • bytes_pcent: 如果设置了配额,则显示配额使用的百分比,否则显示“未定义”

  • bytes_quota: 如果设置了配额,则显示配额大小,单位为字节,否则显示“无限”

  • bytes_used: 子卷组当前使用的大小,单位为字节

  • created_at: 子卷组的创建时间,格式为“YYYY-MM-DD HH:MM:SS”

  • data_pool: 子卷组所属的数据池

通过运行以下形式的命令检查给定子卷组是否存在:

ceph fs subvolumegroup exist <vol_name>

The exist命令输出:

  • subvolumegroup exists: 如果存在任何子卷组

  • no subvolumegroup exists: 如果不存在子卷组

Note

此命令检查自定义组的存在,而不是默认组的存在。仅检查子卷组的存在不足以验证卷为空。还必须检查子卷的存在,因为默认组中可能存在子卷。

通过运行以下形式的命令调整子卷组的大小:

ceph fs subvolumegroup resize <vol_name> <group_name> <new_size> [--no_shrink]

此命令使用new_size指定的大小调整子卷配额。标志--no_shrink flag prevents the subvolume group from shrinking below the current used size.

通过传递infinfinite作为new_size.

的值,可以将子卷组调整为无限(但稀疏)逻辑大小。

ceph fs subvolumegroup snapshot rm <vol_name> <group_name> <snap_name> [--force]

通过提供--force标志,即使由于快照不存在而导致命令原本会失败,命令也可以成功。

通过运行以下形式的命令列出子卷组的快照:

ceph fs subvolumegroup snapshot ls <vol_name> <group_name>

文件系统子卷

创建子卷

使用以下形式的命令创建子卷:

ceph fs subvolume create <vol_name> <subvol_name> [--size <size_in_bytes>] [--group_name <subvol_group_name>] [--pool_layout <data_pool_name>] [--uid <uid>] [--gid <gid>] [--mode <octal_mode>] [--namespace-isolated] [--earmark <earmark>] [--normalization <form>] [--casesensitive <bool>]

即使子卷已经存在,命令也会成功。

创建子卷时,您可以指定其子卷组、数据池布局、uid、gid、八进制数字的文件模式和字节大小的值。子卷的大小通过为其设置配额(见CephFS 配额). The subvolume can be created in a separate RADOS namespace by specifying the --namespace-isolated选项将子卷创建在单独的 RADOS 命名空间中。默认情况下,子卷在默认子卷组中创建,文件模式为755, a uid of its subvolume group, a gid of its subvolume group, a data pool layout of its parent directory, and no size limit. You can also assign an earmark to a subvolume using the --earmark选项为子卷分配标记。标记是用于特定目的(如 NFS 或 SMB 服务)标记子卷的唯一标识符。默认情况下,没有设置标记,允许根据管理需求灵活分配。可以使用空字符串("")从子卷中删除任何现有的标记。

标记机制确保子卷被正确标记和管理,有助于避免冲突并确保每个子卷都与预期的服务或用例相关联。

有效标记

  • 对于 NFS:
    • 有效的标记格式是顶级范围:'nfs'.

  • 对于 SMB:
    • 有效的标记格式是:
      • 顶级范围:'smb'.

      • 带有模块级范围的顶级范围:'smb.cluster.{cluster_id}',其中cluster_id是一个短字符串,唯一标识集群。

      • 无模块级范围的示例:smb

      • 带有模块级范围的示例:smb.cluster.cluster_1

Note

如果您正在将标记从一个范围更改为另一个范围(例如,从 nfs 更改为 smb 或反之亦然),请注意与先前范围关联的用户权限和 ACL 可能仍然适用。确保根据需要更新任何必要的权限以维护适当的访问控制。

创建子卷时,您还可以使用--normalization选项指定 Unicode 规范化形式。这将用于内部处理文件名,以便可以由不同的 Unicode 代码点序列表示的 Unicode 字符都被映射到表示,这意味着它们将访问相同的文件。但是,用户将继续看到他们创建文件时使用的相同名称。

Unicode 规范化形式的有效值是:

  • nfd: 常规分解(默认)

  • nfc: 常规分解,然后是常规组合

  • nfkd: 兼容分解

  • nfkc: 兼容分解,然后是常规组合

要了解更多关于 Unicode 规范化形式的信息,请参阅https://unicode.org/reports/tr15

当使用--casesensitive=0选项时,还可以配置子卷以进行大小写不敏感的访问。当添加此选项时,仅字符的大小写不同的文件名将被映射到同一个文件。保留文件创建时使用的大小写。

Note

设置--casesensitive=0选项隐式启用了子卷上的 Unicode 规范化。

移除子卷

使用以下形式的命令移除子卷:

ceph fs subvolume rm <vol_name> <subvol_name> [--group_name <subvol_group_name>] [--force] [--retain-snapshots]

此命令会移除子卷及其内容。这是分两步完成的。首先,子卷被移动到回收站文件夹。然后,回收站文件夹的内容异步清除。

如果子卷有快照或不存在,则移除子卷会失败。标志--force允许“不存在子卷移除”命令成功。

要保留子卷的快照并移除子卷,请使用--retain-snapshots标志。如果与给定子卷相关联的快照被保留,则子卷在所有不涉及保留快照的操作中都被视为为空。

Note

使用ceph fs subvolume create.

Note

可以重新创建保留快照的子卷。

调整子卷的大小

使用以下形式的命令调整子卷的大小:

ceph fs subvolume resize <vol_name> <subvol_name> <new_size> [--group_name <subvol_group_name>] [--no_shrink]

此命令使用new_size指定的大小调整子卷配额。标志--no_shrink防止子卷缩小到当前“使用大小”以下。

通过传递infinfinite as <new_size>.

可以将子卷调整为无限(但稀疏)逻辑大小。

使用以下形式的命令授权 CephX auth IDs。这提供了对文件系统子卷的读取/写入访问:

ceph fs subvolume authorize <vol_name> <sub_name> <auth_id> [--group_name=<group_name>] [--access_level=<access_level>]

The <access_level>选项接受rrw作为值。

取消授权 CephX auth IDs

使用以下形式的命令取消授权 CephX auth IDs。这会移除对文件系统子卷的读取/写入访问:

ceph fs subvolume deauthorize <vol_name> <sub_name> <auth_id> [--group_name=<group_name>]

使用以下形式的命令列出被授权访问文件系统子卷的 CephX auth IDs:

使用以下形式的命令列出被授权访问文件系统子卷的 CephX auth IDs:

ceph fs subvolume authorized_list <vol_name> <sub_name> [--group_name=<group_name>]

根据auth ID和挂载的子卷驱逐文件系统客户端(Auth ID)

使用以下形式的命令根据auth ID和挂载的子卷驱逐文件系统客户端:

ceph fs subvolume evict <vol_name> <sub_name> <auth_id> [--group_name=<group_name>]

获取子卷的绝对路径

使用以下形式的命令获取子卷的绝对路径:

ceph fs subvolume getpath <vol_name> <subvol_name> [--group_name <subvol_group_name>]

获取子卷的信息

使用以下形式的命令获取子卷的信息:

ceph fs subvolume info <vol_name> <subvol_name> [--group_name <subvol_group_name>]

输出格式为 JSON,包含以下字段。

  • atime: 子卷路径的访问时间,格式为YYYY-MM-DD HH:MM:SS

  • mtime: 子卷路径的修改时间,格式为YYYY-MM-DD HH:MM:SS

  • ctime: 子卷路径的更改时间,格式为YYYY-MM-DD HH:MM:SS

  • uid: 子卷路径的 uid

  • gid: 子卷路径的 gid

  • mode: 子卷路径的模式

  • mon_addrs: 监控器地址列表

  • bytes_pcent: 如果设置了配额,则显示配额使用的百分比;否则显示undefined

  • bytes_quota: 如果设置了配额,则显示配额大小,单位为字节;否则显示infinite

  • bytes_used: 子卷当前使用的大小,单位为字节

  • created_at: 子卷的创建时间,格式为YYYY-MM-DD HH:MM:SS

  • data_pool: 子卷所属的数据池

  • path: 子卷的绝对路径

  • type: 子卷类型,指示它是否为clonesubvolume

  • pool_namespace: 子卷的 RADOS 命名空间

  • features: 子卷支持的功能

  • state: 子卷的当前状态

  • earmark: 子卷的标记

如果子卷已被移除但其快照已保留,则输出仅包含以下字段。

  • type: 子卷类型指示它是否为clonesubvolume

  • features: 子卷支持的功能

  • state: 子卷的当前状态

子卷的features基于子卷的内部版本,是以下子集:

  • snapshot-clone: 支持使用子卷的快照作为源进行克隆

  • snapshot-autoprotect: 如果快照是活动的克隆源,则支持自动保护快照免受删除

  • snapshot-retention: 支持删除子卷内容,保留任何现有快照

子卷的state基于子卷的当前状态,包含以下值之一。

  • complete: 子卷已准备好进行所有操作

  • snapshot-retained: 子卷已移除但其快照已保留

列出子卷

使用以下形式的命令列出子卷:

ceph fs subvolume ls <vol_name> [--group_name <subvol_group_name>]

Note

已被移除但保留快照的子卷也会列出。

检查给定子卷的存在

使用以下形式的命令检查给定子卷的存在:

ceph fs subvolume exist <vol_name> [--group_name <subvol_group_name>]

这些是exist command:

  • subvolume exists: 如果给定group_name的子卷不存在

  • no subvolume exists: 如果给定group_name的子卷不存在

在子卷上设置自定义元数据

使用以下形式的命令在子卷上设置自定义元数据作为键值对:

ceph fs subvolume metadata set <vol_name> <subvol_name> <key_name> <value> [--group_name <subvol_group_name>]

Note

如果键名已存在,则旧值将被新值替换。

Note

key_namevalue应该是 ASCII 字符串(如 Python 的string.printable). key_name中指定)。

Note

子卷上的自定义元数据在快照子卷时不会保留,因此在克隆子卷快照时也不会保留。

获取子卷上的自定义元数据集

使用以下形式的命令使用元数据键获取子卷上的自定义元数据集:

ceph fs subvolume metadata get <vol_name> <subvol_name> <key_name> [--group_name <subvol_group_name>]

列出子卷上的自定义元数据集

使用以下形式的命令列出子卷上设置的自定义元数据(键值对):

ceph fs subvolume metadata ls <vol_name> <subvol_name> [--group_name <subvol_group_name>]

从子卷中移除自定义元数据集

使用以下形式的命令使用元数据键从子卷中移除自定义元数据集:

ceph fs subvolume metadata rm <vol_name> <subvol_name> <key_name> [--group_name <subvol_group_name>] [--force]

不推荐使用--force标志允许命令在原本会失败(如果元数据键不存在)的情况下成功。

获取子卷的标记

使用以下形式的命令获取子卷的标记:

ceph fs subvolume earmark get <vol_name> <subvol_name> [--group_name <subvol_group_name>]

使用以下形式的命令设置子卷的标记:

使用以下形式的命令移除子卷的标记:

ceph fs subvolume earmark set <vol_name> <subvol_name> [--group_name <subvol_group_name>] <earmark>

移除子卷的标记

使用以下形式的命令移除子卷的标记:

ceph fs subvolume earmark rm <vol_name> <subvol_name> [--group_name <subvol_group_name>]

创建子卷的快照

使用以下形式的命令创建子卷的快照:

ceph fs subvolume snapshot create <vol_name> <subvol_name> <snap_name> [--group_name <subvol_group_name>]

使用以下形式的命令移除子卷的快照:

使用以下形式的命令移除子卷的快照:

ceph fs subvolume snapshot rm <vol_name> <subvol_name> <snap_name> [--group_name <subvol_group_name>] [--force]

不推荐使用--force标志允许命令在原本会失败(如果快照不存在)的情况下成功。

Note

如果保留快照子卷中的最后一个快照被移除,则子卷也会被移除

获取子卷快照的路径

使用以下形式的命令获取子卷快照的绝对路径:

ceph fs subvolume snapshot getpath <volname> <subvol_name> <snap_name> [<group_name>]

使用以下形式的命令列出子卷的快照:

使用以下形式的命令列出子卷的快照:

ceph fs subvolume snapshot ls <vol_name> <subvol_name> [--group_name <subvol_group_name>]

获取快照的信息

使用以下形式的命令获取快照的信息:

ceph fs subvolume snapshot info <vol_name> <subvol_name> <snap_name> [--group_name <subvol_group_name>]

输出格式为 JSON,包含以下字段。

  • created_at: 快照的创建时间,格式为YYYY-MM-DD HH:MM:SS:ffffff

  • data_pool: 快照所属的数据池

  • has_pending_clones: yes如果快照克隆正在进行中,否则no

  • pending_clones: 快照克隆的列表及其目标组(如果存在);否则不显示此字段

  • orphan_clones_count: 孤立克隆的数量(如果快照有孤立克隆,否则不显示此字段)

快照克隆正在进行或等待时的示例输出:

ceph fs subvolume snapshot info cephfs subvol snap
{
    "created_at": "2022-06-14 13:54:58.618769",
    "data_pool": "cephfs.cephfs.data",
    "has_pending_clones": "yes",
    "pending_clones": [
        {
            "name": "clone_1",
            "target_group": "target_subvol_group"
        },
        {
            "name": "clone_2"
        },
        {
            "name": "clone_3",
            "target_group": "target_subvol_group"
        }
    ]
}

快照克隆没有进行或等待时的示例输出:

ceph fs subvolume snapshot info cephfs subvol snap
{
    "created_at": "2022-06-14 13:54:58.618769",
    "data_pool": "cephfs.cephfs.data",
    "has_pending_clones": "no"
}

在快照上设置自定义键值对元数据

使用以下形式的命令在快照上设置自定义键值对元数据:

ceph fs subvolume snapshot metadata set <vol_name> <subvol_name> <snap_name> <key_name> <value> [--group_name <subvol_group_name>]

Note

如果未设置key_name如果元数据已存在,则旧值将被新值替换。

Note

The key_namestring.printable应该是 ASCII 字符串(如 Python 的key_name是大小写不敏感的,始终存储为小写。

Note

快照上的自定义元数据在快照子卷时不会保留,因此在克隆快照快照时也不会保留。

获取快照上的自定义元数据

使用以下形式的命令使用元数据键获取快照上的自定义元数据:

ceph fs subvolume snapshot metadata get <vol_name> <subvol_name> <snap_name> <key_name> [--group_name <subvol_group_name>]

列出快照上设置的自定义元数据

使用以下形式的命令列出快照上设置的自定义元数据(键值对):

ceph fs subvolume snapshot metadata ls <vol_name> <subvol_name> <snap_name> [--group_name <subvol_group_name>]

从快照中移除自定义元数据

使用以下形式的命令使用元数据键从快照中移除自定义元数据:

ceph fs subvolume snapshot metadata rm <vol_name> <subvol_name> <snap_name> <key_name> [--group_name <subvol_group_name>] [--force]

不推荐使用--force标志允许命令在原本会失败(如果元数据键不存在)的情况下成功。

克隆快照

Subvolumes can be created by cloning subvolume snapshots. Cloning is an asynchronous operation that copies data from a snapshot to a subvolume. Because cloning is an operation that involves bulk copying, it is slow for very large data sets.

Note

Removing a snapshot (source subvolume) fails when there are pending or in-progress clone operations.

Protecting snapshots prior to cloning was a prerequisite in the Nautilus release. Commands that made possible the protection and unprotection of snapshots were introduced for this purpose. This prerequisite is being deprecated and may be removed from a future release.

The commands being deprecated are:

ceph fs subvolume snapshot protect <vol_name> <subvol_name> <snap_name> [--group_name <subvol_group_name>]
ceph fs subvolume snapshot unprotect <vol_name> <subvol_name> <snap_name> [--group_name <subvol_group_name>]

Note

Using the above commands will not result in an error, but they have no useful purpose.

Note

使用subvolume info command to fetch subvolume metadata regarding supported features to help decide if protect/unprotect of snapshots is required, based on the availability of the snapshot-autoprotect feature.

Run a command of the following form to initiate a clone operation:

ceph fs subvolume snapshot clone <vol_name> <subvol_name> <snap_name> <target_subvol_name>

Note

subvolume snapshot clone command depends upon the above mentioned config option snapshot_clone_no_wait

Run a command of the following form when a snapshot (source subvolume) is a part of non-default group. Note that the group name needs to be specified:

ceph fs subvolume snapshot clone <vol_name> <subvol_name> <snap_name> <target_subvol_name> --group_name <subvol_group_name>

Cloned subvolumes can be a part of a different group than the source snapshot (by default, cloned subvolumes are created in default group). Run a command of the following form to clone to a particular group use:

ceph fs subvolume snapshot clone <vol_name> <subvol_name> <snap_name> <target_subvol_name> --target_group_name <subvol_group_name>

Pool layout can be specified when creating a cloned subvolume in a way that is similar to specifying a pool layout when creating a subvolume. Run a command of the following form to create a cloned subvolume with a specific pool layout:

ceph fs subvolume snapshot clone <vol_name> <subvol_name> <snap_name> <target_subvol_name> --pool_layout <pool_layout>

Run a command of the following form to check the status of a clone operation:

ceph fs clone status <vol_name> <clone_name> [--group_name <group_name>]

A clone can be in one of the following states:

  1. pending : Clone operation has not started

  2. in-progress : Clone operation is in progress

  3. complete : Clone operation has successfully finished

  4. failed : Clone operation has failed

  5. canceled : Clone operation is cancelled by user

The reason for a clone failure is shown as below:

  1. errno : error number

  2. error_msg : failure error string

Here is an example of an in-progress clone:

ceph fs subvolume snapshot clone cephfs subvol1 snap1 clone1
ceph fs clone status cephfs clone1
{
  "status": {
    "state": "in-progress",
    "source": {
      "volume": "cephfs",
      "subvolume": "subvol1",
      "snapshot": "snap1"
    },
    "progress_report": {
      "percentage cloned": "12.24%",
      "amount cloned": "376M/3.0G",
      "files cloned": "4/6"
    }
  }
}

A progress report is also printed in the output when clone is in-progress. Here the progress is reported only for the specific clone. For collective progress made by all ongoing clones, a progress bar is printed at the bottom in ouput of ceph status command:

progress:
  3 ongoing clones - average progress is 47.569% (10s)
    [=============...............] (remaining: 11s)

If the number of clone jobs are more than cloner threads, two progress bars are printed, one for ongoing clones (same as above) and other for all (ongoing+pending) clones:

progress:
  4 ongoing clones - average progress is 27.669% (15s)
    [=======.....................] (remaining: 41s)
  Total 5 clones - average progress is 41.667% (3s)
    [===========.................] (remaining: 4s)

Note

The failure section will be shown only if the clone’s state is failedcancelled

Here is an example of a failed clone:

ceph fs subvolume snapshot clone cephfs subvol1 snap1 clone1
ceph fs clone status cephfs clone1
{
    "status": {
        "state": "failed",
        "source": {
            "volume": "cephfs",
            "subvolume": "subvol1",
            "snapshot": "snap1"
            "size": "104857600"
        },
        "failure": {
            "errno": "122",
            "errstr": "Disk quota exceeded"
        }
    }
}

Note

Because subvol1 is in the default group, the source object’s clone status does not include the group name)

Note

Cloned subvolumes are accessible only after the clone operation has successfully completed.

After a successful clone operation, clone status will look like the following:

ceph fs clone status cephfs clone1
{
    "status": {
        "state": "complete"
    }
}

If a clone operation is unsuccessful, the state value will be failed.

To retry a failed clone operation, the incomplete clone must be deleted and the clone operation must be issued again.

Run a command of the following form to delete a partial clone:

ceph fs subvolume rm <vol_name> <clone_name> [--group_name <group_name>] --force

Note

Cloning synchronizes only directories, regular files and symbolic links. inode timestamps (access and modification times) are synchronized up to a second’s granularity.

一个in-progress or a pending clone operation may be canceled. To cancel a clone operation use the clone cancel command:

ceph fs clone cancel <vol_name> <clone_name> [--group_name <group_name>]

On successful cancellation, the cloned subvolume is moved to the canceled state:

ceph fs subvolume snapshot clone cephfs subvol1 snap1 clone1
ceph fs clone cancel cephfs clone1
ceph fs clone status cephfs clone1
{
    "status": {
        "state": "canceled",
        "source": {
            "volume": "cephfs",
            "subvolume": "subvol1",
            "snapshot": "snap1"
        }
    }
}

Note

Delete the canceled cloned by supplying the --force。此参数必须是 CIDR 表示法的子网(例如fs subvolume rm command.

Configurables

Configure the maximum number of concurrent clone operations. The default is 4:

ceph config set mgr mgr/volumes/max_concurrent_clones <value>

Pause the threads that asynchronously purge trashed subvolumes. This option is useful during cluster recovery scenarios:

ceph config set mgr/volumes/pause_purging true

To resume purging threads:

ceph config set mgr/volumes/pause_purging false

Pause the threads that asynchronously clone subvolume snapshots. This option is useful during cluster recovery scenarios:

ceph config set mgr/volumes/pause_cloning true

To resume cloning threads:

ceph config set mgr/volumes/pause_cloning false

Configure the snapshot_clone_no_wait选项:

The snapshot_clone_no_wait config option is used to reject clone-creation requests when cloner threads (which can be configured using the above options, for example, max_concurrent_clones) are not available. It is enabled by default. This means that the value is set to True, but it can be configured by using the following command:

ceph config set mgr mgr/volumes/snapshot_clone_no_wait <bool>

The current value of snapshot_clone_no_wait can be fetched by running the following command.

ceph config get mgr mgr/volumes/snapshot_clone_no_wait

固定子卷和子卷组

Subvolumes and subvolume groups may be automatically pinned to ranks according to policies. This can distribute load across MDS ranks in predictable and stable ways. Review Manually pinning directory trees to a particular rank设置子树分区策略 for details on how pinning works.

Run a command of the following form to configure pinning for subvolume groups:

ceph fs subvolumegroup pin <vol_name> <group_name> <pin_type> <pin_setting>

Run a command of the following form to configure pinning for subvolumes:

ceph fs subvolume pin <vol_name> <group_name> <pin_type> <pin_setting>

Under most circumstances, you will want to set subvolume group pins. The pin_type may be export, distributedrandom一起使用。该pin_setting corresponds to the extended attributed “value” as in the pinning documentation referenced above.

Here is an example of setting a distributed pinning strategy on a subvolume group:

ceph fs subvolumegroup pin cephfilesystem-a csi distributed 1

This enables distributed subtree partitioning policy for the “csi” subvolume group. This will cause every subvolume within the group to be automatically pinned to one of the available ranks on the file system.

规范化和大小写敏感性

The subvolumegroup and subvolume interefaces have a porcelain layer API to manipulate the ceph.dir.charmap configurations (see also CephFS Directory Entry Name Normalization and Case Folding).

Configuring the charmap

To configure the charmap, for a subvolumegroup:

ceph fs subvolumegroup charmap set <vol_name> <group_name> <setting> <value>

Or for a subvolume:

ceph fs subvolume charmap set <vol_name> <subvol> <--group_name=name> <setting> <value>

例如:

ceph fs subvolumegroup charmap set vol csi normalization nfd

outputs:

{"casesensitive":true,"normalization":"nfd","encoding":"utf8"}

Reading the charmap

To read the configuration, for a subvolumegroup:

ceph fs subvolumegroup charmap get <vol_name> <group_name> <setting>

Or for a subvolume:

ceph fs subvolume charmap get <vol_name> <subvol> <--group_name=name> <setting>

例如:

ceph fs subvolume charmap get vol subvol --group_name=csi casesensitive
0

To read the full charmap, for a subvolumegroup:

ceph fs subvolumegroup charmap get <vol_name> <group_name>

Or for a subvolume:

ceph fs subvolume charmap get <vol_name> <subvol> <--group_name=name>

例如:

ceph fs subvolumegroup charmap get vol csi

outputs:

{"casesensitive":false,"normalization":"nfd","encoding":"utf8"}

Removing the charmap

To remove the configuration, for a subvolumegroup:

ceph fs subvolumegroup charmap rm <vol_name> <group_name

Or for a subvolume:

ceph fs subvolume charmap rm <vol_name> <subvol> <--group_name=name>

例如:

ceph fs subvolumegroup charmap rm vol csi

outputs:

{}

Note

A charmap can only be removed when a subvolumegroup or subvolume is empty.

子卷暂停

Note

The information in this section applies only to Squid and later releases of Ceph.

CephFS snapshots do not provide strong-consistency guarantees in cases involving writes performed by multiple clients, which makes consistent backups and disaster recovery a serious challenge for distributed applications. Even in a case where an application uses file system flushes to synchronize checkpoints across its distributed components, there is no guarantee that all acknowledged writes will be part of a given snapshot.

The subvolume quiesce feature has been developed to provide enterprise-level consistency guarantees for multi-client applications that work with one or more subvolumes. The feature makes it possible to pause IO to a set of subvolumes of a given volume (file system). Enforcing such a pause across all clients makes it possible to guarantee that any persistent checkpoints reached by the application before the pause will be recoverable from the snapshots made during the pause.

The volumes plugin provides a CLI to initiate and await the pause for a set of subvolumes. This pause is called a quiesce, which is also used as the command name:

ceph fs quiesce <vol_name> --set-id myset1 <[group_name/]sub_name...> --await
# perform actions while the IO pause is active, like taking snapshots
ceph fs quiesce <vol_name> --set-id myset1 --release --await
# if successful, all members of the set were confirmed as still paused and released

The fs quiesce functionality is based on a lower level quiesce db service provided by the MDS daemons, which operates at a file system path granularity. The volumes plugin merely maps the subvolume names to their corresponding paths on the given file system and then issues the corresponding quiesce db command to the MDS. You can learn more about the low-level service in the developer guides.

操作

The quiesce can be requested for a set of one or more subvolumes (i.e. paths in a filesystem). This set is referred to as quiesce set. Every quiesce set is identified by a unique set id. A quiesce set can be manipulated in the following ways:

  • include one or more subvolumes - quiesce set members

  • exclude one or more members

  • cancel the set, asynchronously aborting the pause on all its current members

  • release the set, requesting the end of the pause from all members and expecting an ack from all clients

  • query the current state of a set by id or all active sets or all known sets

  • cancel all active sets in case an immediate resume of IO is required.

The operations listed above are non-blocking: they attempt the intended modification and return with an up to date version of the target set, whether the operation was successful or not. The set may change states as a result of the modification, and the version that’s returned in the response is guaranteed to be in a state consistent with this and potentialy other successful operations from the same control loop batch.

Some set states are awaitable. We will discuss those below, but for now it’s important to mention that any of the commands above can be amended with an await modifier, which will cause them to block on the set after applying their intended modification, as long as the resulting set state is awaitable. Such a command will block until the set reaches the awaited state, gets modified by another command, or transitions into another state. The return code will unambiguously identify the exit condition, and the contents of the response will always carry the latest known set state.

../../_images/quiesce-set-states.svg

Awaitable states on the diagram are marked with (a)(A). Blocking versions of the operations will pend while the set is in an (a) state and will complete with success if it reaches an (A) state. If the set is already at an (A) state, the operation completes immediately with a success.

Most of the operations require a set-id. The exceptions are:

  • creation of a new set without specifying a set id,

  • query of active or all known sets, and

  • the cancel all

Creating a new set is achieved by including member(s) via the includereset commands. It’s possible to specify a set id, and if it’s a new id then the set will be created with the specified member(s) in the QUIESCING state. When no set id is specified while including or resetting members, then a new set with a unique set id is created. The set id will be known to the caller by inspecting the output

ceph fs quiesce fs1 sub1 --set-id=unique-id
{
    "epoch": 3,
    "set_version": 1,
    "sets": {
        "unique-id": {
            "version": 1,
            "age_ref": 0.0,
            "state": {
                "name": "TIMEDOUT",
                "age": 0.0
            },
            "timeout": 0.0,
            "expiration": 0.0,
            "members": {
                "file:/volumes/_nogroup/sub1/b1fcce76-3418-42dd-aa76-f9076d047dd3": {
                    "excluded": false,
                    "state": {
                        "name": "QUIESCING",
                        "age": 0.0
                    }
                }
            }
        }
    }
}

The output contains the set we just created successfully, however it’s already TIMEDOUT. This is expected, since we have not specified the timeout for this quiesce, and we can see in the output that it was initialized to 0 by default, along with the expiration.

Timeouts

The two timeout parameters, timeout过期, are the main guards against accidentally causing a DOS condition for our application. Any command to an active set may carry the --timeout--expiration arguments to update these values for the set. If present, the values will be applied before the action this command requests.

ceph fs quiesce fs1 --set-id=unique-id --timeout=10 > /dev/null
Error EPERM:

It’s too late for our unique-id set, as it’s in a terminal state. No changes are allowed to sets that are in their terminal states, i.e. inactive. Let’s create a new set:

ceph fs quiesce fs1 sub1 --timeout 60
{
    "epoch": 3,
    "set_version": 2,
    "sets": {
        "8988b419": {
            "version": 2,
            "age_ref": 0.0,
            "state": {
                "name": "QUIESCING",
                "age": 0.0
            },
            "timeout": 60.0,
            "expiration": 0.0,
            "members": {
                "file:/volumes/_nogroup/sub1/b1fcce76-3418-42dd-aa76-f9076d047dd3": {
                    "excluded": false,
                    "state": {
                        "name": "QUIESCING",
                        "age": 0.0
                    }
                }
            }
        }
    }
}

This time, we haven’t specified a set id, so the system created a new one. We see its id in the output, it’s 8988b419. The command was a success and we see that this time the set is QUIESCING. At this point, we can add more members to the set

ceph fs quiesce fs1 --set-id 8988b419 --include sub2 sub3
{
    "epoch": 3,
    "set_version": 3,
    "sets": {
        "8988b419": {
            "version": 3,
            "age_ref": 0.0,
            "state": {
                "name": "QUIESCING",
                "age": 30.7
            },
            "timeout": 60.0,
            "expiration": 0.0,
            "members": {
                "file:/volumes/_nogroup/sub1/b1fcce76-3418-42dd-aa76-f9076d047dd3": {
                    "excluded": false,
                    "state": {
                        "name": "QUIESCING",
                        "age": 30.7
                    }
                },
                "file:/volumes/_nogroup/sub2/bc8f770e-7a43-48f3-aa26-d6d76ef98d3e": {
                    "excluded": false,
                    "state": {
                        "name": "QUIESCING",
                        "age": 0.0
                    }
                },
                "file:/volumes/_nogroup/sub3/24c4b57b-e249-4b89-b4fa-7a810edcd35b": {
                    "excluded": false,
                    "state": {
                        "name": "QUIESCING",
                        "age": 0.0
                    }
                }
            }
        }
    }
}

The --include bit is optional, as if no operation is given while members are provided, then “include” is assumed.

As we have seen, the timeout argument specifies how much time we are ready to give the system to reach the 暂停状态,因此释放命令 state on the set. However, since new members can be added to an active set at any time, it wouldn’t be fair to measure the timeout from the set creation time. Hence, the timeout is tracked per member: every member has timeout seconds to quiesce, and if any one takes longer than that, the whole set is marked as TIMEDOUT and the pause is released.

Once the set is in the 暂停状态,因此释放命令 state, it will begin its expiration timer. This timer is tracked per set as a whole, not per members. Once the 过期 seconds elapse, the set will transition into an EXPIRED state, unless it was actively released or canceled by a dedicated operation.

It’s possible to add new members to a 暂停状态,因此释放命令 set. In this case, it will transition back to QUIESCING, and the new member(s) will have their own timeout to quiesce. If they succeed, then the set will again be 暂停状态,因此释放命令 and the expiration timer will restart.

警告

  • The expiration timer doesn’t apply when a set is QUIESCING; it is reset to the value of the 过期 property when the 设置 becomes 暂停状态,因此释放命令

  • The timeout doesn’t apply to members that are 暂停状态,因此释放命令

Awaiting

Note that the commands above are all non-blocking. If we want to wait for the quiesce set to reach the 暂停状态,因此释放命令 state, we should await it at some point. --await can be given along with other arguments to let the system know our intention.

There are two types of await: quiesce awaitrelease await. The former is the default, and the latter can only be achieved with --release present in the argument list. To avoid confision, it is not permitted to issue a quiesce await when the set is not QUIESCING. Trying to --release a set that is not 暂停状态,因此释放命令 is an EPERM error as well, regardless of whether await is requested alongside. However, it’s not an error to release await an already released set, or to quiesce await a 暂停状态,因此释放命令 one - those are successful no-ops.

Since a set is awaited after the application of the --await-augmented command, the await operation may mask a successful result with its own error. A good example is trying to cancel-await a set:

ceph fs quiesce fs1 --set-id set1 --cancel --await
{
    // ...
    "sets": {
        "set1": {
            // ...
            "state": {
                "name": "CANCELED",
                "age": 0
            },
            // ...
        }
    }
}
Error EPERM:

Although --cancel will succeed syncrhonously for a set in an active state, awaiting a canceled set is not permitted, hence this call will result in an EPERM. This is deliberately different from returning a EINVAL error, denoting an error on the user’s side, to simplify the system’s behavior when --await is requested. As a result, it’s also a simpler model for the user to work with.

When awaiting, one may specify a maximum duration that they would like this await request to block for, orthogonally to the two intrinsic set timeouts discussed above. If the target awaited state isn’t reached within the specified duration, then EINPROGRESS is returned. For that, one should use the argument --await-for=<seconds>. One could think of --await as equivalent to --await-for=Infinity. While it doesn’t make sense to specify both arguments, it is not considered an error. If both --await--await-for are present, then the former is ignored, and the time limit from --await-for is honored.

time ceph fs quiesce fs1 sub1 --timeout=10 --await-for=2
{
    "epoch": 6,
    "set_version": 3,
    "sets": {
        "c3c1d8de": {
            "version": 3,
            "age_ref": 0.0,
            "state": {
                "name": "QUIESCING",
                "age": 2.0
            },
            "timeout": 10.0,
            "expiration": 0.0,
            "members": {
                "file:/volumes/_nogroup/sub1/b1fcce76-3418-42dd-aa76-f9076d047dd3": {
                    "excluded": false,
                    "state": {
                        "name": "QUIESCING",
                        "age": 2.0
                    }
                }
            }
        }
    }
}
Error EINPROGRESS:
ceph fs quiesce fs1 sub1 --timeout=10 --await-for=2  0.41s user 0.04s system 17% cpu 2.563 total

(there is a ~0.5 sec overhead that the ceph client adds, at least in a local debug setup)

暂停等待和过期

暂停等待具有以下副作用:它重置了内部过期定时器。这允许通过反复--await激活已经暂停状态,因此释放命令 set. Consider the following example script:

set -e   # (1)
ceph fs quiesce fs1 sub1 sub2 sub3 --timeout=30 --expiration=10 --set-id="snapshots" --await # (2)
ceph fs subvolume snapshot create a sub1 snap1-sub1  # (3)
ceph fs quiesce fs1 --set-id="snapshots" --await  # (4)
ceph fs subvolume snapshot create a sub2 snap1-sub2  # (3)
ceph fs quiesce fs1 --set-id="snapshots" --await  # (4)
ceph fs subvolume snapshot create a sub3 snap1-sub3  # (3)
ceph fs quiesce fs1 --set-id="snapshots" --release --await  # (5)

警告

此示例使用任意的超时来传达概念。在现实生活中,必须根据实际的系统需求和规范仔细选择值。abbc89: 脚本的目标是对 3 个子卷进行一致性快照。我们首先设置 bash

The goal of the script is to take consistent snapshots of 3 subvolumes. We begin by setting the bash -e选项(1)如果任何命令返回非零状态,则退出此脚本。

我们继续请求为三个子卷中为 30,但这意味着. 我们设置了我们的超时,允许系统花费最多 30 秒达到暂停状态跨所有成员,并在暂停到期、IO 恢复之前保持暂停状态长达 10 秒。我们还指定--await仅在达到暂停状态后才能继续进行。

然后我们继续进行一系列命令对,这些命令对会获取下一个快照并调用--await来扩展过期超时 10 秒(3,4)这种方法为我们提供了每个快照长达 10 秒的时间,但也允许我们根据需要获取尽可能多的快照,而不会失去 IO 暂停,并且随着它 - 一致性。如果我们想要,我们可以在每次调用过期如果快照卡住并花费超过 10 秒才能完成,则下一次调用

If any of the snapshots gets stuck and takes longer than 10 seconds to complete, then the next call to --await将返回错误,因为集合将变为EXPIRED我们可以设置

We could have set the 过期在 (2)中为 30,但这意味着, but that would mean that a single stuck snapshot would keep the applications pending for all this time.

如果版本

有时,仅观察成功的暂停或释放就足够了。原因可能是另一个客户端对集合的并发更改。考虑以下示例:

ceph fs quiesce fs1 sub1 sub2 sub3 --timeout=30 --expiration=60 --set-id="snapshots" --await  # (1)
ceph fs subvolume snapshot create a sub1 snap1-sub1  # (2)
ceph fs subvolume snapshot create a sub2 snap1-sub2  # (3)
ceph fs subvolume snapshot create a sub3 snap1-sub3  # (4)
ceph fs quiesce fs1 --set-id="snapshots" --release --await  # (5)

序列看起来很好,并且释放(5)成功完成。但是,在为子卷(4)拍照之前,另一个会话可能将子卷

ceph fs quiesce fs1 --set-id="snapshots" --exclude sub3

从集合中排除,恢复其 IOs暂停状态,因此释放命令 state, the release command (5)没有理由失败。它将确认两个未排除的成员

为了解决此问题或类似问题,暂停命令支持乐观并发模式。要激活它,需要传递一个--if-version=<version>将与集合的数据库版本进行比较,如果值匹配,则操作才会继续。否则,命令将不会执行,并且返回状态将是ESTALE.

很容易知道预期集合的版本,因为任何修改集合的命令都会在标准输出上返回此集合,无论退出状态如何。在上述示例中可以看到,每个集合都包含一个"version"属性,每当此集合被修改时,无论是显式地由用户还是隐式地

在此小节的开始处,初始暂停命令(1)将返回新创建的集合,id"snapshots"和一些版本,比如13. 由于我们在使用命令(2,3,4)进行快照时没有预期集合的其他更改,因此释放命令(5)可能看起来像

ceph fs quiesce fs1 --set-id="snapshots" --release --await --if-version=13 # (5)

释放命令的结果将是ESTALE�'t 0,并且我们会知道集合中有些东西不正确,我们的快照可能不一致。

提示

--if-version和命令返回ESTALE时,请求的操作是firefly 发布。Firefly 将延迟至少另一个冲刺,以便我们可以对新代码进行一些操作经验,并进行一些额外的测试,然后再承诺长期支持。执行。

02ba6e: 参数的另一个用途可能对自动化软件很有用。如前所述,可以创建一个具有给定集合 ID 的新暂停集。像 Kubernetes 的 CSI 这样的驱动程序可以使用它们的内部请求 ID 来消除保留映射到暂停集 ID 的需要。但是,为了确保唯一性,驱动程序可能想要验证集合确实是新的。为此,--if-version argument which could come handy for automation software. As we have discussed earlier, it is possible to create a new quiesce set with a given set id. Drivers like the CSI for Kubernetes could use their internal request id to eliminate the need to keep an additional mapping to the quiesce set id. However, to guarantee uniqueness, the driver may want to verify that the set is indeed new. For that, if-version=0可以使用,它将仅在数据库中没有其他具有此 ID 的集合时创建新集合。

ceph fs quiesce fs1 sub1 sub2 sub3 --set-id="external-id" --if-version=0

禁用卷插件

默认情况下,卷插件是启用并设置为always on。但是,在某些情况下,可能需要禁用它。例如,当 CephFS 处于降级状态时,卷插件命令可能会在 MGR 中累积而不是提供服务。这最终会导致策略限制被触发,并且 MGR 变得不响应。

在这种情况下,即使卷插件是一个always on模块,也可以禁用它。要这样做,请运行ceph mgr module disable volumes --yes-i-really-mean-it。请注意,此命令将禁用卷插件的所有操作和命令,因为它将禁用通过此插件访问的 Ceph 集群上的所有 CephFS 服务。

在采取如此激烈的措施之前,最好尝试一些不太激烈的措施,然后评估文件系统体验是否由于这些措施而有所改善。一个不太激烈的措施示例是禁用卷插件启动的克隆和清除垃圾的异步线程。有关这些的详细信息,请参阅:暂停清除线程暂停克隆线程.

由 Ceph 基金会带给您

Ceph 文档是一个社区资源,由非盈利的 Ceph 基金会资助和托管Ceph Foundation. 如果您想支持这一点和我们的其他工作,请考虑加入现在加入.