在 GitHub 上编辑

运行管道

要运行一个管道,你可以使用 dvc reprodvc exp run。两者都会运行该管道,而 dvc exp run 会将结果保存为一个 实验(并具备其他 与实验相关的功能,例如从命令行修改参数):

$ dvc exp run --set-param featurize.ngrams=3

Reproducing experiment 'funny-dado'
'data/data.xml.dvc' didn't change, skipping
Stage 'prepare' didn't change, skipping
Running stage 'featurize':
> python src/featurization.py data/prepared data/features
Updating lock file 'dvc.lock'

Running stage 'train':
> python src/train.py data/features model.pkl
Updating lock file 'dvc.lock'

Running stage 'evaluate':
> python src/evaluate.py model.pkl data/features
Updating lock file 'dvc.lock'

Ran experiment(s): funny-dado
Experiment results have been applied to your workspace.

在执行生成某个阶段输出的命令之前,该阶段的输出会被从 工作区 中删除(除非在 dvc.yaml 中使用了 persist: true)。

DAG

DVC 会按照 DAG 中各阶段的 依赖关系输出 所定义的顺序依次运行各个阶段。考虑以下示例 dvc.yaml

stages:
  prepare:
    cmd: python src/prepare.py data/data.xml
    deps:
      - data/data.xml
      - src/prepare.py
    params:
      - prepare.seed
      - prepare.split
    outs:
      - data/prepared
  featurize:
    cmd: python src/featurization.py data/prepared data/features
    deps:
      - data/prepared
      - src/featurization.py
    params:
      - featurize.max_features
      - featurize.ngrams
    outs:
      - data/features

prepare 阶段总是在 featurize 阶段之前运行,因为 data/preparedprepare输出,同时也是 featurize依赖项

缓存阶段

DVC 会尽量避免重新计算已经运行过的阶段。如果你在未更改其命令、依赖项参数 的情况下再次运行某个阶段,DVC 将跳过该阶段:

Stage 'prepare' didn't change, skipping

DVC 还会通过 运行缓存 恢复之前运行产生的输出。

Stage 'prepare' is cached - skipping run, checking out outputs

如果你希望某个阶段每次都被执行,可以在 dvc.yaml 中使用 always changed 设置:

stages:
  pull_latest:
    cmd: python pull_latest.py
    deps:
      - pull_latest.py
    outs:
      - latest_results.csv
    always_changed: true

拉取缺失的数据

默认情况下,DVC 假定运行管道所需的所有数据都已存在于本地。任何缺失的数据都将被视为已被删除,并可能导致管道失败。为了避免这种情况,请使用以下标志:

  • --pull 会在需要时自动下载缺失的数据,因此你无需提前拉取全部数据。
  • --allow-missing 会跳过那些除了数据缺失外没有其他变更的阶段,从而避免下载不必要的数据。

你可以同时使用 --pull--allow-missing 标志来运行管道,仅拉取实际需要的数据以执行发生变更的阶段。

在 DVC>=3.0 版本中,--allow-missing 不会跳过由 DVC<3.0 保存的数据,因为在 DVC 3.0 中哈希类型发生了变化,DVC 将其视为数据变更。要将数据迁移到新的哈希类型,请运行 dvc cache migrate --dvc-files。更多关于 从 DVC 2.x 升级到 3.0 的信息请参阅相关文档。

基于 example-get-started-experiments 中使用的管道:

$ dvc dag
      +--------------------+
      | data/pool_data.dvc |
      +--------------------+
                 *
                 *
                 *
          +------------+
          | data_split |
          +------------+
           **         **
         **             **
        *                 **
  +-------+                 *
  | train |*                *
  +-------+ ****            *
      *         ***         *
      *            ****     *
      *                **   *
+-----------+         +----------+
| sagemaker |         | evaluate |
+-----------+         +----------+

如果我们处于一台所有数据均缺失的机器上:

$ dvc status
data_split:
        changed deps:
                deleted:            data/pool_data
        changed outs:
                not in cache:       data/test_data
                not in cache:       data/train_data
train:
        changed deps:
                deleted:            data/train_data
        changed outs:
                not in cache:       models/model.pkl
                not in cache:       models/model.pth
                not in cache:       results/train
evaluate:
        changed deps:
                deleted:            data/test_data
                deleted:            models/model.pkl
        changed outs:
                not in cache:       results/evaluate
sagemaker:
        changed deps:
                deleted:            models/model.pth
        changed outs:
                not in cache:       model.tar.gz
data/pool_data.dvc:
        changed outs:
                not in cache:       data/pool_data

我们可以修改 evaluate 阶段,DVC 将仅拉取运行该阶段所需的必要数据(models/model.pkldata/test_data/),并跳过其余阶段:

$ dvc exp run --pull --allow-missing --set-param evaluate.n_samples_to_save=20
Reproducing experiment 'hefty-tils'
'data/pool_data.dvc' didn't change, skipping
Stage 'data_split' didn't change, skipping
Stage 'train' didn't change, skipping
Running stage 'evaluate':
...

管道完成后,evaluate 阶段被更新,但所有其他阶段仍然缺少数据:

$ dvc status
data_split:
        changed deps:
                deleted:            data/pool_data
        changed outs:
                not in cache:       data/train_data
train:
        changed deps:
                deleted:            data/train_data
        changed outs:
                not in cache:       models/model.pth
                not in cache:       results/train
sagemaker:
        changed deps:
                deleted:            models/model.pth
        changed outs:
                not in cache:       model.tar.gz
data/pool_data.dvc:
        changed outs:
                not in cache:       data/pool_data

我们可以再次运行并加上 --pull 参数(但不加 --allow-missing),以下载管道中未变更阶段所需的数据:

$ dvc exp run --pull

管道完成后,所有阶段均已更新至最新状态:

$ dvc status
Data and pipelines are up to date.

验证管道状态

在 CI 作业等场景中,你可能希望检查流水线是否为最新状态,而不进行拉取或运行任何内容。dvc repro --dry 将检查哪些流水线阶段需要运行,但不会真正执行它们。然而,如果数据缺失,--dry 将会失败,因为 DVC 无法判断该数据是需要拉取还是因其他原因丢失。要检查哪些阶段需要运行并忽略任何缺失的数据,请使用 dvc repro --dry --allow-missing

如果没有任何更改,此命令将成功执行:

在以下示例中,由于尚未拉取任何内容,因此数据缺失,但除此之外流水线是最新状态。

$ dvc status
data_split:
        changed deps:
                deleted:            data/pool_data
        changed outs:
                not in cache:       data/test_data
                not in cache:       data/train_data
train:
        changed deps:
                deleted:            data/train_data
        changed outs:
                not in cache:       models/model.pkl
evaluate:
        changed deps:
                deleted:            data/test_data
                deleted:            models/model.pkl
data/pool_data.dvc:
        changed outs:
                not in cache:       data/pool_data
$ dvc repro --allow-missing --dry
'data/pool_data.dvc' didn't change, skipping
Stage 'data_split' didn't change, skipping
Stage 'train' didn't change, skipping
Stage 'evaluate' didn't change, skipping

如果存在任何未更新的内容,该命令将失败:

在以下示例中,params.yaml 中的 data_split 参数已被修改,因此流水线不是最新状态。

$ dvc status
data_split:
        changed deps:
                deleted:            data/pool_data
                params.yaml:
                        modified:           data_split
        changed outs:
                not in cache:       data/test_data
                not in cache:       data/train_data
train:
        changed deps:
                deleted:            data/train_data
        changed outs:
                not in cache:       models/model.pkl
evaluate:
        changed deps:
                deleted:            data/test_data
                deleted:            models/model.pkl
data/pool_data.dvc:
        changed outs:
                not in cache:       data/pool_data
$ dvc repro --allow-missing --dry
'data/pool_data.dvc' didn't change, skipping
ERROR: failed to reproduce 'data_split': [Errno 2] No such file or directory: '.../example-get-started-experiments/data/pool_data'

为了确保所有缺失的数据都存在,你还可以检查远程存储中是否存在全部数据。以下命令会在远程存储中找到所有数据时成功(退出码设为 0),否则失败(退出码设为 1)。

$ dvc data status --not-in-remote --json | grep -v not_in_remote
true

调试阶段

如果你正在使用高级功能为流水线插值,例如 模板Hydra 组合,可以通过运行 dvc repro -vvdvc exp run -vv 获取插值后的值,输出将包含如下信息:

2023-05-18 07:38:43,955 TRACE: Hydra composition enabled.
Contents dumped to params.yaml: {'model': {'batch_size':
512, 'latent_dim': 8, 'lr': 0.01, 'duration': '00:00:30:00',
'max_epochs': 2}, 'data_path': 'fra.txt', 'num_samples':
100000, 'seed': 423}
2023-05-18 07:38:44,027 TRACE: Context during resolution of
stage download: {'model': {'batch_size': 512, 'latent_dim':
8, 'lr': 0.01, 'duration': '00:00:30:00', 'max_epochs': 2},
'data_path': 'fra.txt', 'num_samples': 100000, 'seed': 423}
2023-05-18 07:38:44,073 TRACE: Context during resolution of
stage train: {'model': {'batch_size': 512, 'latent_dim': 8,
'lr': 0.01, 'duration': '00:00:30:00', 'max_epochs': 2},
'data_path': 'fra.txt', 'num_samples': 100000, 'seed': 423}
内容

🐛 发现问题?告诉我们!或者修复它:

在 GitHub 上编辑

有疑问?加入我们的聊天,我们会为您提供帮助:

Discord 聊天