Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Baxpr queue handling #365

Merged
main from
May 18, 2022
Merged

Baxpr queue handling #365

main from
May 18, 2022

Conversation

baxpr
Copy link
Member

@bud42 can you look at this?

Instead of limiting launch based on total number of accre jobs, this will limit based on the number of pending accre jobs. This might help avoid hitting xnat with too many accre job starts at once.

It would also remove the limit on total running jobs, unless we add that in additionally. Not sure what effect that would have, but I think ACCRE's own scheduling will keep that under control just by keeping the pending list full.

If we install this, we would also need to drop the queue_limit setting in the instance dashboard to more like 20-50 instead of 200-500.

Copy link
Member Author

btw I have not tested

Copy link
Member

Do jobs always enter the pending phase long enough for this to work?
And if so, are we limiting the number of new jobs that launch each time dax runs? Is that what we want?

Copy link
Member Author

Ah, I don't know. Probably not.

The goal would be to limit the number of pending jobs to ~50 or whatever avoids too large of a hit when accre launches them all at once.

I'll add back a limit for total jobs as well, I think

Copy link
Member Author

... in which case, what do you think about a 2 sec sleep between individual job launches as well, in case accre is launching them immediately?

Copy link
Member Author

So overall, dax will launch as many jobs as possible each hour subject to

  • queue_limit total jobs on accre (500 right now)
  • number of pending jobs is not too large (say 50)
  • slight delay before each launch

All three of these settings can be exposed in the instance redcap.

Copy link
Member Author

Could throttle by pending uploads as well, for that matter - no new launches until e.g. pending uploads are <2000

Copy link
Member Author

Testing full dax manager run on ROGERSTEST (rogersbp@hickory)

Copy link
Member Author

Requires new fields main_queuelimit_pending, main_limit_pendinguploads in the instance redcap. We need to document the instances panel and provide an initial data dictionary for it, similar to the project settings info in docs/dax_manager.rst

Copy link
Member Author

... have not implemented a delay yet

Copy link
Member Author

Tested ok for a full build/launch/upload cycle on a single project, a few assessors. Next, test thresholds

Copy link
Member Author

With thresholds set to 1, only 1 job got launched. Would be helpful to report why launching stopped in the log.

Copy link
Member Author

Launch delay is working. @bud42 this is ready for another look. Not sure how it will interact with #369 though

Copy link
Member Author

No. I meant to do that via Template, hang on

dax/launcher.py Outdated Show resolved Hide resolved
Copy link
Member

Looks great! Let's merge this prior to #369. Hopefully, git will work it's magic!

Copy link
Member Author

@bud42 see if that should do it?

Copy link
Member

@bud42

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants