-
Notifications
You must be signed in to change notification settings - Fork 183
Implement startup_order and stop_criteria #2714
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Do you plan to also support |
@peterschmidt85 in a separate PR |
and then we can update the NCCL tests example |
if run.run_spec.merged_profile.stop_criteria != StopCriteria.MASTER_DONE: | ||
return False | ||
for job in run.jobs: | ||
if job.job_spec.job_num == 0 and job.job_submissions[-1].status == JobStatus.DONE: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) Can also check for termination_reason == JobTerminationReason.DONE_BY_RUNNER
to terminate the run faster, without waiting for the terminating
-> done
master job transition. See line 241
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if we want to terminate the run before the master is really done.
class StartupOrder(str, Enum): | ||
ANY = "any" | ||
MASTER_FIRST = "master-first" | ||
WORKERS_FIRST = "workers-first" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) I'm not sure about calling non-master nodes "workers", because the master node is also a "worker" - it performs the same work other nodes do.
I can suggest to use "secondary" (secondary-first
) or avoid any names (master-last
). Although we might still need a name to use in the code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
master/worker is standard terminalogy used for pytorch, mpi, etc. Let's not reinvent.
A part of #2467
The PR introduces new run configuration properties:
startup_order
allows specifying the order in which master and workers jobs are started.stop_criteria
allows specifying the criteria determining when a multi-node run should be considered finished.They simplify running mpirun with
dstack
:Other multi-node tasks such as iperf may require
startup_order: master-first
. Most such as pytorch will work with the defaultstartup_order: any
.TODO: