Skip to content

0.19.12

Compare
Choose a tag to compare
@r4victor r4victor released this 04 Jun 11:22
· 145 commits to master since this release
8732138

Clusters

Simplified use of MPI

startup_order and stop_criteria

New run configuration properties are introduced:

  • startup_order: any/master-first/workers-first specifies the order in which master and workers jobs are started.
  • stop_criteria: all-done/master-done specifies the criteria when a multi-node run should be considered finished.

These properties simplify running certain multi-node workloads. For example, MPI requires that workers are up and running when the master runs mpirun, so you'd use startup_order: workers-first. MPI workload can be considered done when the master is done, so you'd use stop_criteria: master-done and dstack won't wait for workers to exit.

DSTACK_MPI_HOSTFILE

dstack now automatically creates an MPI hostfile and exposes the DSTACK_MPI_HOSTFILE environment variable with the hostfile path. It can be used directly as mpirun --hostfile $DSTACK_MPI_HOSTFILE.

Below is the updated NCCL tests example.

CLI

We've also updated how the CLI displays run and job status. Previously, the CLI displayed the internal status code which was hard to interpret. Now, the the STATUS column in dstack ps and dstack apply displays a status code which is easy to understand why run or job was terminated.

Examples

Distributed training

TRL

The new TRL example walks you through how to run distributed fine-tune using TRL, Accelerate and Deepspeed.

Axolotl

The new Axolotl example walks you through how to run distributed fine-tune using Axolotl with dstack.

What's changed

  • [Feature] Update .gitignore logic to catch more cases by @colinjc in #2695
  • [Bug] Increase upload_code client timeout by @r4victor in #2709
  • [Bug] Fix missing apt-get update by @r4victor in #2710
  • [Internal]: Update git hooks and package.json by @olgenn in #2706
  • [Examples] Add distributed Axolotl and TRL example by @Bihan in #2703
  • [Docs] Update dstack-proxy contributing guide by @jvstme in #2683
  • [Feature] Implement DSTACK_MPI_HOSTFILE by @r4victor in #2718
  • [Feature] Implement startup_order and stop_criteria by @r4victor in #2714
  • [Bug] Fix CLI exiting while master starting by @r4victor in #2720
  • [Examples] Simplify NCCL tests example by @r4victor in #2723
  • [Examples] Update TRL Single Node example to uv by @Bihan in #2715
  • [Bug] Fix backward compatibility when creating fleets by @jvstme in #2727
  • [UX]: Make run status in UI and CLI easier to understand by @peterschmidt85 in #2716
  • [Bug] Fix relative paths in dstack apply --repo by @jvstme in #2733
  • [Internal]: Drop hardcoded regions from the backend template by @jvstme in #2734
  • [Internal]: Update backend template to match ruff formatting by @jvstme in #2735

Full changelog: 0.19.11...0.19.12