You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Each time dvc status (or commit or repro) is run, DVC collects files and computes hashes for every single dependency independent of other stages that have been checked, committed, or reproduced before it. This means that the output of train_models is re-hashed/checksumed first for train_models, then for predict, then for analyze, and then for validate, even though there is nothing in between any of those stages (after the initial train_models stage) that could have changed the output of train_models.
Additionally, when predict, analyze, and validate are collected/hashed/checksumed during their own execution/status/etc., the hashes are still recomputed again for the publish stage.
For a project like mine that has rather large files as the outputs of each stage (~66 files totaling ~120GB), this ends up taking at least an hour if not more. This is especially problematic when I make minor updates to upstream files that don't affect outputs and need to recommit or re-check dvc status.
It seems like there should be some persistence between downstream stages that share dependencies to reduce this redundancy.
Side note, though perhaps indicative of the underlying issue: when committing a stage which has dependencies that have changed, DVC builds the dependency tree twice, once to determine that the deps have changed, and once after confirming that "yes" to commit the stage.
For example:
$ dvc commit analyze -v2024-10-24 11:36:35,380 DEBUG: v3.56.0 (pip), CPython 3.10.12 on Linux-5.15.0-122-generic-x86_64-with-glibc2.352024-10-24 11:36:35,381 DEBUG: command: <HOME_DIR>/.local/bin/dvc commit analyze -v2024-10-24 11:36:35,576 DEBUG: Checking if stage 'analyze' is in 'dvc.yaml'2024-10-24 11:36:35,640 DEBUG: Lockfile 'dvc.lock' needs to be updated.2024-10-24 11:37:13,915 DEBUG: built tree 'object 3772f172da6d2560f8b1922c7300ee46.dir' <----- Computes oncedependencies ['src/models/multires_stats.py', 'models/Shrub_Tree_Grass/001', 'params.yaml'] of stage: 'analyze' changed. Are you sure you want to commit it? [y/n] y2024-10-24 11:37:42,012 DEBUG: built tree 'object 3772f172da6d2560f8b1922c7300ee46.dir' <----- Computes again2024-10-24 11:37:42,057 DEBUG: Computed stage: 'analyze' md5: '3213ffd0fb4741f26c242b6f78879476'Updating lock file 'dvc.lock'2024-10-24 11:37:42,215 DEBUG: Analytics is disabled.
Note that the hash was computed for the dependency twice...
The text was updated successfully, but these errors were encountered:
DVC caches checksums, so it does not recompute checksums for each file (even though the message may say so, the message is for Output, not files). DVC does stat files multiple times as you said for each stage dependency, which may be expensive depending on the filesystem, hardware, OS, etc.
Consider a pipeline that looks something like this:
Each time
dvc status
(orcommit
orrepro
) is run, DVC collects files and computes hashes for every single dependency independent of other stages that have been checked, committed, or reproduced before it. This means that the output oftrain_models
is re-hashed/checksumed first fortrain_models
, then forpredict
, then foranalyze
, and then forvalidate
, even though there is nothing in between any of those stages (after the initialtrain_models
stage) that could have changed the output oftrain_models
.Additionally, when
predict
,analyze
, andvalidate
are collected/hashed/checksumed during their own execution/status/etc., the hashes are still recomputed again for thepublish
stage.For a project like mine that has rather large files as the outputs of each stage (~66 files totaling ~120GB), this ends up taking at least an hour if not more. This is especially problematic when I make minor updates to upstream files that don't affect outputs and need to recommit or re-check
dvc status
.It seems like there should be some persistence between downstream stages that share dependencies to reduce this redundancy.
Side note, though perhaps indicative of the underlying issue: when committing a stage which has dependencies that have changed, DVC builds the dependency tree twice, once to determine that the deps have changed, and once after confirming that "yes" to commit the stage.
For example:
Note that the hash was computed for the dependency twice...
The text was updated successfully, but these errors were encountered: