Skip to content

Implement startup_order and stop_criteria #2714

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jun 2, 2025
Merged

Conversation

r4victor
Copy link
Collaborator

A part of #2467

The PR introduces new run configuration properties:

  • startup_order allows specifying the order in which master and workers jobs are started.
  • stop_criteria allows specifying the criteria determining when a multi-node run should be considered finished.

They simplify running mpirun with dstack:

type: task
name: nccl-tests

nodes: 2
startup_order: workers-first
stop_criteria: master-done

image: dstackai/efa
commands:
  - |
    if [ ${DSTACK_NODE_RANK} -eq 0 ]; then
      cd /root/nccl-tests/build
      : > hostfile
      for ip in ${DSTACK_NODES_IPS}; do
        echo "${ip} slots=${DSTACK_GPUS_PER_NODE}" >> hostfile
      done
      MPIRUN='mpirun --allow-run-as-root --hostfile hostfile'
      # Run NCCL Tests
      ${MPIRUN} \
        -n ${DSTACK_GPUS_NUM} -N ${DSTACK_GPUS_PER_NODE} \
        --mca btl_tcp_if_exclude lo,docker0 \
        --bind-to none \
        ./all_reduce_perf -b 8 -e 8G -f 2 -g 1
    else
      sleep infinity
    fi

Other multi-node tasks such as iperf may require startup_order: master-first. Most such as pytorch will work with the default startup_order: any.

TODO:

  • use the new properties in the NCCL tests examples

@peterschmidt85
Copy link
Contributor

Do you plan to also support DSTACK_MPI_HOSTFILE? I guess we just need to mount this file and pass the environment variable on container start.

@r4victor
Copy link
Collaborator Author

@peterschmidt85 in a separate PR

@r4victor
Copy link
Collaborator Author

and then we can update the NCCL tests example

@r4victor r4victor requested a review from jvstme May 30, 2025 10:45
if run.run_spec.merged_profile.stop_criteria != StopCriteria.MASTER_DONE:
return False
for job in run.jobs:
if job.job_spec.job_num == 0 and job.job_submissions[-1].status == JobStatus.DONE:
Copy link
Collaborator

@jvstme jvstme Jun 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) Can also check for termination_reason == JobTerminationReason.DONE_BY_RUNNER to terminate the run faster, without waiting for the terminating -> done master job transition. See line 241

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we want to terminate the run before the master is really done.

class StartupOrder(str, Enum):
ANY = "any"
MASTER_FIRST = "master-first"
WORKERS_FIRST = "workers-first"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) I'm not sure about calling non-master nodes "workers", because the master node is also a "worker" - it performs the same work other nodes do.

I can suggest to use "secondary" (secondary-first) or avoid any names (master-last). Although we might still need a name to use in the code

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

master/worker is standard terminalogy used for pytorch, mpi, etc. Let's not reinvent.

@r4victor r4victor merged commit 2ddae6e into master Jun 2, 2025
25 checks passed
@r4victor r4victor deleted the issue_2467_simpler_mpi branch June 2, 2025 08:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants