Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for annex add --batch and use it if more than a single file (and not directories) is provided #6977

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

yarikoptic
Copy link
Member

@yarikoptic yarikoptic commented Aug 24, 2022

Resurrects #5722 which could not be reopened since I force-pushed

If ever finished, Closes #5721, #6977 (hopefully)

If to proceed

  • detected/reported an inconsistency against git-annex: https://git-annex.branchable.com/todo/make_add_--batch_add_directories_content/ . Worked around in the code by still using non-batched version if any path is a directory. Adds "checking overhead" but I do not see any other way ATM. Most likely would not be relevant much since higher level interfaces likely "analyze" folders and point to specific files
  • remove TEMP if True to always try batch mode
  • seek @christian-monch wisdom on properly using BatchedCommand as a generator
  • see if any positive effect on OSX with bleeding edge annex in relation to recent regression on our tests on OSX
    • yeap, seems to mitigate it well, see the git-annex issue above for fresh comments on that.
  • no effect on benchmarks... if anything only 0.92 plugins.addurls.addurls1.time_addurls('*') is happier
  • see what fails and either it could be addressed here or there is some roadblock
  • I think added batch=None logic handling is sound (taken from get_file_key), so I would just keep it but could be improved to just reuse a batched add if already one running.
  • interface that total_nbytes (I did not even look into what it is about)

@yarikoptic yarikoptic added cmd-save semver-performance Changes only improve performance, no impact on version labels Aug 24, 2022
@yarikoptic yarikoptic changed the title WiP Q&D AnnexRepo.add_ batched (by default) if more than a single file passed Add support for annex add --batch and use it if more than a single file (and not directories) is provided Aug 24, 2022
@yarikoptic
Copy link
Member Author

Note: benchmarks are ok to fail ATM since we are adding all the --batch overhead for **every** `add` ATM due to TEMP `if True`. But they fail not because of slow down but because some calls/ fail (TODO)
before           after         ratio
     [dc77a088]       [428f3ccc]
     <bm/merge-target>       <bm/pr>   
!      1.87±0.06s           failed      n/a  api.supers.time_createadd
!      1.97±0.09s           failed      n/a  api.supers.time_createadd_to_dataset
           failed           failed      n/a  api.supers.time_diff
           failed           failed      n/a  api.supers.time_diff_recursive
!      6.98±0.01s           failed      n/a  api.supers.time_installr
!        487±10ms           failed      n/a  api.supers.time_ls
!      2.65±0.04s           failed      n/a  api.supers.time_ls_recursive
!         2.85±0s           failed      n/a  api.supers.time_ls_recursive_long_all
           failed           failed      n/a  api.supers.time_remove
!      1.49±0.01s           failed      n/a  api.supers.time_status
!      2.76±0.06s           failed      n/a  api.supers.time_status_recursive
!         120±5ms           failed      n/a  api.supers.time_subdatasets
!         4[11](https://github.com/datalad/datalad/runs/7998607905?check_suite_focus=true#step:8:12)±4ms           failed      n/a  api.supers.time_subdatasets_recursive
!         [12](https://github.com/datalad/datalad/runs/7998607905?check_suite_focus=true#step:8:13)3±3ms           failed      n/a  api.supers.time_subdatasets_recursive_first
!        421±10ms           failed      n/a  api.supers.time_uninstall
         806±10ms         817±10ms     1.01  api.testds.time_create_test_dataset1
!      5.44±0.01s           failed      n/a  api.testds.time_create_test_dataset2x2
          290±3ms          300±8ms     1.04  cli.startup.time_command_execution
          286±9ms          295±9ms     1.03  cli.startup.time_command_help_np
         292±10ms          296±4ms     1.01  cli.startup.time_command_short_help
          625±6ms         640±10ms     1.02  cli.startup.time_help_np
          293±7ms          289±2ms     0.99  cli.startup.time_short_help
          274±3ms          272±5ms     1.00  cli.startup.time_usage_advice
          184±8ms          188±6ms     1.02  core.startup.time_import
          6[13](https://github.com/datalad/datalad/runs/7998607905?check_suite_focus=true#step:8:14)±5ms          619±3ms     1.01  core.startup.time_import_api
      3.19±0.03ms      3.20±0.08ms     1.01  core.witlessrunner.time_echo
      3.76±0.07ms      3.64±0.04ms     0.97  core.witlessrunner.time_echo_gitrunner
      3.98±0.04ms       4.40±0.3ms    ~1.10  core.witlessrunner.time_echo_gitrunner_fullcapture
         738±10ms         757±20ms     1.03  plugins.addurls.addurls1.time_addurls('*')
       2.38±0.05s       2.46±0.05s     1.03  plugins.addurls.addurls1.time_addurls(None)
!        20.4±2ms           failed      n/a  repo.gitrepo.time_get_content_info
       8.89±0.4ms      8.68±0.02ms     0.98  support.path.get_parent_paths.time_allsubmods_toplevel
       8.11±0.2ms         8.01±0ms     0.99  support.path.get_parent_paths.time_allsubmods_toplevel_only
        392±0.3ns          4[15](https://github.com/datalad/datalad/runs/7998607905?check_suite_focus=true#step:8:16)±5ns     1.06  support.path.get_parent_paths.time_no_submods
      6.96±0.03ms      6.97±0.03ms     1.00  support.path.get_parent_paths.time_one_submod_subdir
      7.06±0.09ms       7.[21](https://github.com/datalad/datalad/runs/7998607905?check_suite_focus=true#step:8:22)±0.6ms     1.02  support.path.get_parent_paths.time_one_submod_toplevel
        57.4±0.2s        57.0±0.3s     0.99  usecases.study_forrest.time_make_studyforrest_mockup

# python -m nose -s -v datalad/support/tests/test_annexrepo.py:test_files_split
#
# It seems that BatchedCommand lost `yield_` in
# 9f066efa6a0b316a133b6d6d0d4e4c2e1b856ce0 and ability to be used as a
Copy link
Contributor

@christian-monch christian-monch Aug 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yield_ was removed because it was not used anymore. It is no problem to bring it back, the proc1-method still exists and exhibits the original semantics.

If you have specific requirements that go beyond the original yield_, I am happy to incorporate them :-) (the current runner-based BatchedCommand-implementation does faithfully recreate the behavior of the previous implementation with no further extensions).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no immediate other requirements -- just want to have a generator to avoid the overhead of pre/post logic which we otherwise invoke per each file. Do you think you could provide needed changes within this PR @christian-monch ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping @christian-monch on this elderly request -- I am back to this PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking into the required changes

Comment on lines +1765 to +1834
for f in files:
yield batched_get.proc1(f)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. This is the replacement for yield_, if files is a list of strings.

@codecov
Copy link

codecov bot commented Oct 15, 2023

Codecov Report

Attention: 3 lines in your changes are missing coverage. Please review.

Comparison is base (1d4ff45) 91.59% compared to head (59f2a5a) 73.00%.

❗ Current head 59f2a5a differs from pull request most recent head 05f9560. Consider uploading reports for the commit 05f9560 to get more accurate results

Additional details and impacted files
@@             Coverage Diff             @@
##           master    #6977       +/-   ##
===========================================
- Coverage   91.59%   73.00%   -18.59%     
===========================================
  Files         325      354       +29     
  Lines       43427    46347     +2920     
  Branches     5821     6613      +792     
===========================================
- Hits        39777    33837     -5940     
- Misses       3635    12486     +8851     
- Partials       15       24        +9     
Files Coverage Δ
datalad/support/annexrepo.py 79.46% <88.00%> (-11.19%) ⬇️

... and 245 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@yarikoptic
Copy link
Member Author

FTR: rebased, pushed.

@adswa
Copy link
Member

adswa commented Oct 24, 2023

Yarik said: If he doesn't get to it in the next week, feel free to close

@yarikoptic
Copy link
Member Author

I pushed into master a merge of maint, then rebased this PR, let's see what fails and address it with hope to provide a fix for

@yarikoptic yarikoptic force-pushed the enh-annex-add-batch branch from fdd1df9 to 02ba927 Compare June 16, 2024 21:52
Copy link

codeclimate bot commented Jun 16, 2024

Code Climate has analyzed commit 02ba927 and detected 3 issues on this pull request.

Here's the issue category breakdown:

Category Count
Complexity 3

View more on Code Climate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cmd-save semver-performance Changes only improve performance, no impact on version
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use annex add --batch instead of a sequence of annex add ...chunk-of-files
3 participants