运行管道
要运行一个管道,你可以使用 dvc repro
或 dvc exp run
。两者都会运行该管道,而 dvc exp run
会将结果保存为一个 实验(并具备其他 与实验相关的功能,例如从命令行修改参数):
$ dvc exp run --set-param featurize.ngrams=3
Reproducing experiment 'funny-dado'
'data/data.xml.dvc' didn't change, skipping
Stage 'prepare' didn't change, skipping
Running stage 'featurize':
> python src/featurization.py data/prepared data/features
Updating lock file 'dvc.lock'
Running stage 'train':
> python src/train.py data/features model.pkl
Updating lock file 'dvc.lock'
Running stage 'evaluate':
> python src/evaluate.py model.pkl data/features
Updating lock file 'dvc.lock'
Ran experiment(s): funny-dado
Experiment results have been applied to your workspace.
在执行生成某个阶段输出的命令之前,该阶段的输出会被从 工作区 中删除(除非在 dvc.yaml
中使用了 persist: true
)。
DAG
DVC 会按照 DAG 中各阶段的 依赖关系 和 输出 所定义的顺序依次运行各个阶段。考虑以下示例 dvc.yaml
:
stages:
prepare:
cmd: python src/prepare.py data/data.xml
deps:
- data/data.xml
- src/prepare.py
params:
- prepare.seed
- prepare.split
outs:
- data/prepared
featurize:
cmd: python src/featurization.py data/prepared data/features
deps:
- data/prepared
- src/featurization.py
params:
- featurize.max_features
- featurize.ngrams
outs:
- data/features
prepare
阶段总是在 featurize
阶段之前运行,因为 data/prepared
是 prepare
的输出,同时也是 featurize
的依赖项。
缓存阶段
DVC 会尽量避免重新计算已经运行过的阶段。如果你在未更改其命令、依赖项 或 参数 的情况下再次运行某个阶段,DVC 将跳过该阶段:
Stage 'prepare' didn't change, skipping
DVC 还会通过 运行缓存 恢复之前运行产生的输出。
Stage 'prepare' is cached - skipping run, checking out outputs
如果你希望某个阶段每次都被执行,可以在 dvc.yaml
中使用 always changed 设置:
stages:
pull_latest:
cmd: python pull_latest.py
deps:
- pull_latest.py
outs:
- latest_results.csv
always_changed: true
拉取缺失的数据
默认情况下,DVC 假定运行管道所需的所有数据都已存在于本地。任何缺失的数据都将被视为已被删除,并可能导致管道失败。为了避免这种情况,请使用以下标志:
--pull
会在需要时自动下载缺失的数据,因此你无需提前拉取全部数据。--allow-missing
会跳过那些除了数据缺失外没有其他变更的阶段,从而避免下载不必要的数据。
你可以同时使用 --pull
和 --allow-missing
标志来运行管道,仅拉取实际需要的数据以执行发生变更的阶段。
在 DVC>=3.0 版本中,--allow-missing
不会跳过由 DVC<3.0 保存的数据,因为在 DVC 3.0 中哈希类型发生了变化,DVC 将其视为数据变更。要将数据迁移到新的哈希类型,请运行 dvc cache migrate --dvc-files
。更多关于 从 DVC 2.x 升级到 3.0 的信息请参阅相关文档。
基于 example-get-started-experiments 中使用的管道:
$ dvc dag
+--------------------+
| data/pool_data.dvc |
+--------------------+
*
*
*
+------------+
| data_split |
+------------+
** **
** **
* **
+-------+ *
| train |* *
+-------+ **** *
* *** *
* **** *
* ** *
+-----------+ +----------+
| sagemaker | | evaluate |
+-----------+ +----------+
如果我们处于一台所有数据均缺失的机器上:
$ dvc status
data_split:
changed deps:
deleted: data/pool_data
changed outs:
not in cache: data/test_data
not in cache: data/train_data
train:
changed deps:
deleted: data/train_data
changed outs:
not in cache: models/model.pkl
not in cache: models/model.pth
not in cache: results/train
evaluate:
changed deps:
deleted: data/test_data
deleted: models/model.pkl
changed outs:
not in cache: results/evaluate
sagemaker:
changed deps:
deleted: models/model.pth
changed outs:
not in cache: model.tar.gz
data/pool_data.dvc:
changed outs:
not in cache: data/pool_data
我们可以修改 evaluate
阶段,DVC 将仅拉取运行该阶段所需的必要数据(models/model.pkl
和 data/test_data/
),并跳过其余阶段:
$ dvc exp run --pull --allow-missing --set-param evaluate.n_samples_to_save=20
Reproducing experiment 'hefty-tils'
'data/pool_data.dvc' didn't change, skipping
Stage 'data_split' didn't change, skipping
Stage 'train' didn't change, skipping
Running stage 'evaluate':
...
管道完成后,evaluate
阶段被更新,但所有其他阶段仍然缺少数据:
$ dvc status
data_split:
changed deps:
deleted: data/pool_data
changed outs:
not in cache: data/train_data
train:
changed deps:
deleted: data/train_data
changed outs:
not in cache: models/model.pth
not in cache: results/train
sagemaker:
changed deps:
deleted: models/model.pth
changed outs:
not in cache: model.tar.gz
data/pool_data.dvc:
changed outs:
not in cache: data/pool_data
我们可以再次运行并加上 --pull
参数(但不加 --allow-missing
),以下载管道中未变更阶段所需的数据:
$ dvc exp run --pull
管道完成后,所有阶段均已更新至最新状态:
$ dvc status
Data and pipelines are up to date.
验证管道状态
在 CI 作业等场景中,你可能希望检查流水线是否为最新状态,而不进行拉取或运行任何内容。dvc repro --dry
将检查哪些流水线阶段需要运行,但不会真正执行它们。然而,如果数据缺失,--dry
将会失败,因为 DVC 无法判断该数据是需要拉取还是因其他原因丢失。要检查哪些阶段需要运行并忽略任何缺失的数据,请使用 dvc repro --dry --allow-missing
。
如果没有任何更改,此命令将成功执行:
在以下示例中,由于尚未拉取任何内容,因此数据缺失,但除此之外流水线是最新状态。
$ dvc status
data_split:
changed deps:
deleted: data/pool_data
changed outs:
not in cache: data/test_data
not in cache: data/train_data
train:
changed deps:
deleted: data/train_data
changed outs:
not in cache: models/model.pkl
evaluate:
changed deps:
deleted: data/test_data
deleted: models/model.pkl
data/pool_data.dvc:
changed outs:
not in cache: data/pool_data
$ dvc repro --allow-missing --dry
'data/pool_data.dvc' didn't change, skipping
Stage 'data_split' didn't change, skipping
Stage 'train' didn't change, skipping
Stage 'evaluate' didn't change, skipping
如果存在任何未更新的内容,该命令将失败:
在以下示例中,params.yaml
中的 data_split
参数已被修改,因此流水线不是最新状态。
$ dvc status
data_split:
changed deps:
deleted: data/pool_data
params.yaml:
modified: data_split
changed outs:
not in cache: data/test_data
not in cache: data/train_data
train:
changed deps:
deleted: data/train_data
changed outs:
not in cache: models/model.pkl
evaluate:
changed deps:
deleted: data/test_data
deleted: models/model.pkl
data/pool_data.dvc:
changed outs:
not in cache: data/pool_data
$ dvc repro --allow-missing --dry
'data/pool_data.dvc' didn't change, skipping
ERROR: failed to reproduce 'data_split': [Errno 2] No such file or directory: '.../example-get-started-experiments/data/pool_data'
为了确保所有缺失的数据都存在,你还可以检查远程存储中是否存在全部数据。以下命令会在远程存储中找到所有数据时成功(退出码设为 0
),否则失败(退出码设为 1
)。
$ dvc data status --not-in-remote --json | grep -v not_in_remote
true
调试阶段
如果你正在使用高级功能为流水线插值,例如 模板 或 Hydra 组合,可以通过运行 dvc repro -vv
或 dvc exp run -vv
获取插值后的值,输出将包含如下信息:
2023-05-18 07:38:43,955 TRACE: Hydra composition enabled.
Contents dumped to params.yaml: {'model': {'batch_size':
512, 'latent_dim': 8, 'lr': 0.01, 'duration': '00:00:30:00',
'max_epochs': 2}, 'data_path': 'fra.txt', 'num_samples':
100000, 'seed': 423}
2023-05-18 07:38:44,027 TRACE: Context during resolution of
stage download: {'model': {'batch_size': 512, 'latent_dim':
8, 'lr': 0.01, 'duration': '00:00:30:00', 'max_epochs': 2},
'data_path': 'fra.txt', 'num_samples': 100000, 'seed': 423}
2023-05-18 07:38:44,073 TRACE: Context during resolution of
stage train: {'model': {'batch_size': 512, 'latent_dim': 8,
'lr': 0.01, 'duration': '00:00:30:00', 'max_epochs': 2},
'data_path': 'fra.txt', 'num_samples': 100000, 'seed': 423}