Files
DeepSpeed/accelerator
Stas Bekman ee286e53c8 set device_id in torch's init_process_group (#7266)
This PR overcomes this issue when using any `torch.distributed` calls w/
deepspeed:
```
[W404 00:15:21.693690333 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 0]  using GPU 0 
to perform barrier as devices used by this process are currently unknown. This can
 potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in
 barrier() to force use of a particular device, or call init_process_group() with a device_id.
```
by setting `device_id` to the correct device corresponding to
`LOCAL_RANK` env var.

-------------------

Update: discovered `torch.dist` deadlocks with `torch=>2.7.0` when using
`device_id` arg - switching to draft for now as we can't commit this
until we know how to work around this.

---------

Signed-off-by: Stas Bekman <stas@stason.org>
Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Stas Bekman <stas.bekman@snowflake.com>
2025-07-16 08:32:20 -07:00
..