status: don't recompute checksums and hashes for each stage with a shared dependency #10604

Oct 24, 2024 · 2024-10-24T11:28:05Z

Consider a pipeline that looks something like this:

"prep_data" -> "train_models"; "train_models" -> "predict"; "train_models" -> "analyze"; "train_models" -> "validate"; "predict" -> "publish"; "analyze" -> "publish"; "validate" -> "publish";

Each time dvc status (or commit or repro) is run, DVC collects files and computes hashes for every single dependency independent of other stages that have been checked, committed, or reproduced before it. This means that the output of train_models is re-hashed/checksumed first for train_models, then for predict, then for analyze, and then for validate, even though there is nothing in between any of those stages (after the initial train_models stage) that could have changed the output of train_models.

Additionally, when predict, analyze, and validate are collected/hashed/checksumed during their own execution/status/etc., the hashes are still recomputed again for the publish stage.

For a project like mine that has rather large files as the outputs of each stage (~66 files totaling ~120GB), this ends up taking at least an hour if not more. This is especially problematic when I make minor updates to upstream files that don't affect outputs and need to recommit or re-check dvc status.

It seems like there should be some persistence between downstream stages that share dependencies to reduce this redundancy.

Side note, though perhaps indicative of the underlying issue: when committing a stage which has dependencies that have changed, DVC builds the dependency tree twice, once to determine that the deps have changed, and once after confirming that "yes" to commit the stage.

For example:

$ dvc commit analyze -v 2024-10-24 11:36:35,380 DEBUG: v3.56.0 (pip), CPython 3.10.12 on Linux-5.15.0-122-generic-x86_64-with-glibc2.35 2024-10-24 11:36:35,381 DEBUG: command: <HOME_DIR>/.local/bin/dvc commit analyze -v 2024-10-24 11:36:35,576 DEBUG: Checking if stage 'analyze' is in 'dvc.yaml' 2024-10-24 11:36:35,640 DEBUG: Lockfile 'dvc.lock' needs to be updated. 2024-10-24 11:37:13,915 DEBUG: built tree 'object 3772f172da6d2560f8b1922c7300ee46.dir' <----- Computes once dependencies ['src/models/multires_stats.py', 'models/Shrub_Tree_Grass/001', 'params.yaml'] of stage: 'analyze' changed. Are you sure you want to commit it? [y/n] y 2024-10-24 11:37:42,012 DEBUG: built tree 'object 3772f172da6d2560f8b1922c7300ee46.dir' <----- Computes again 2024-10-24 11:37:42,057 DEBUG: Computed stage: 'analyze' md5: '3213ffd0fb4741f26c242b6f78879476' Updating lock file 'dvc.lock' 2024-10-24 11:37:42,215 DEBUG: Analytics is disabled.

Note that the hash was computed for the dependency twice...

Oct 24, 2024 · 2024-10-24T11:34:12Z

For some additional context, in my case the DVC cache is located on a NAS drive (CIFS, unfortunately) and the cache type is "symlink".

$ dvc doctor DVC version: 3.56.0 (pip) ------------------------- Platform: Python 3.10.12 on Linux-5.15.0-122-generic-x86_64-with-glibc2.35 Subprojects: dvc_data = 3.16.6 dvc_objects = 5.1.0 dvc_render = 1.0.2 dvc_task = 0.3.0 scmrepo = 3.3.8 Supports: gs (gcsfs = 2024.3.1), http (aiohttp = 3.9.1, aiohttp-retry = 2.8.3), https (aiohttp = 3.9.1, aiohttp-retry = 2.8.3) Config: Global: <USER DIR>/.config/dvc System: /etc/xdg/dvc Cache types: symlink Cache directory: cifs on //<NAS DRIVE> Caches: local Remotes: local Workspace directory: ext4 on <LOCAL SYSTEM> Repo: dvc, git Repo.site_cache_dir: <LOCAL SYSTEM>

skshetry · 2024-10-24T11:46:17Z

Can you please share profiling data? See https://github.com/iterative/dvc/wiki/Debugging,-Profiling-and-Benchmarking-DVC#generating-cprofile-data.

DVC caches checksums, so it does not recompute checksums for each file (even though the message may say so, the message is for Output, not files). DVC does stat files multiple times as you said for each stage dependency, which may be expensive depending on the filesystem, hardware, OS, etc.

Oct 25, 2024 · 2024-10-25T06:44:12Z

Hopefully I did this right. Let me know if should change something about my profiling setup.

cprofile dump: https://file.io/c2ChbfBj8Z1H

skshetry added the awaiting response we are waiting for your reply, please respond! :) label Oct 24, 2024

shcheklein added triage Needs to be triaged and removed awaiting response we are waiting for your reply, please respond! :) labels Nov 10, 2024

shcheklein assigned skshetry Nov 10, 2024

status: don't recompute checksums and hashes for each stage with a shared dependency #10604

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

status: don't recompute checksums and hashes for each stage with a shared dependency #10604

Oct 24, 2024 •

edited

Loading

Oct 24, 2024

skshetry commented Oct 24, 2024

Oct 25, 2024

status: don't recompute checksums and hashes for each stage with a shared dependency #10604

status: don't recompute checksums and hashes for each stage with a shared dependency #10604

Comments

Oct 24, 2024 • edited Loading

Oct 24, 2024

skshetry commented Oct 24, 2024

Oct 25, 2024

Oct 24, 2024 •

edited

Loading