Commit Graph

172 Commits

Author SHA1 Message Date
e06ee4aa9f Revert "Nccl update to 2.25.1 for cuda 12.4-12.8 (#146073)"
This reverts commit 06f4a5c0e578d7da10ebdf14edcd24e5dcef78d6.

Reverted https://github.com/pytorch/pytorch/pull/146073 on behalf of https://github.com/atalman due to breaks macos builds: ModuleNotFoundError: No module named 'torch._C._distributed_c10d'; 'torch._C' is not a package ([comment](https://github.com/pytorch/pytorch/pull/146073#issuecomment-2659802389))
2025-02-14 16:44:46 +00:00
06f4a5c0e5 Nccl update to 2.25.1 for cuda 12.4-12.8 (#146073)
Should resolve: https://github.com/pytorch/pytorch/issues/144768
We use one common nccl version for cuda builds 12.4-12.8 : ``NCCL_VERSION=v2.25.1-1``
For CUDA 11.8 we use legacy ``NCCL_VERSION=v2.21.1-1``
We use pinned version of NCCL rather then submodule.
Move nccl location from ``third_party/nccl/nccl`` to ``third_party/nccl``

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146073
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/kwen2501, https://github.com/fduwjj
2025-02-14 15:29:59 +00:00
dea7ad3371 PEP585 update - torch/testing (#145200)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145200
Approved by: https://github.com/bobrenjc93
2025-01-20 22:42:42 +00:00
2c4281d7da Make MultiProcContinuousTest timeout configurable (#145099)
Allows test classes using MPCT to set their own timeout as a class
property, which is good enough since the processgroup is shared across
test instances and the timeout is set at processgroup init.

Also sets a default timeout of 2 minutes, which is probably (?) long
enough for reasonable tests, but can be changed if it causes flakyness.
It's preferable to have as short default timeout as possible, since when
debugging tests getting a timeout quickly helps.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145099
Approved by: https://github.com/d4l3k, https://github.com/fduwjj
ghstack dependencies: #145010, #145011
2025-01-18 04:37:12 +00:00
3b6b306b71 Migrate from Tuple -> tuple in torch/testing (#144256)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144256
Approved by: https://github.com/aorenste
2025-01-10 06:37:55 +00:00
080f992d68 Revert "[CI] Reduce distributed test timeout to 60s (#141168)"
This reverts commit e8de8f3969bf935442378efd125442de90e78431.

Reverted https://github.com/pytorch/pytorch/pull/141168 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think we missed inductor tests ([comment](https://github.com/pytorch/pytorch/pull/141168#issuecomment-2494060624))
2024-11-22 15:46:37 +00:00
e8de8f3969 [CI] Reduce distributed test timeout to 60s (#141168)
Pulling a PR to test viability.
Today's timeout is 300s, which could waste quite some machine time if a hang happens in CI.

Differential Revision: [D66275756](https://our.internmc.facebook.com/intern/diff/D66275756)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141168
Approved by: https://github.com/clee2000
2024-11-22 00:59:55 +00:00
66476617bf [Dist][CI] Easier override of destroy-upon-exit setting (#141192)
Adding `destroy_pg_upon_exit` property to allow derived Test classes to control whether auto destroy is desired.
(Otherwise, derived test classes will need to rewrite the `_run()` method, leading to duplicated code of `_run()` and if one needs to add things to `_run` in the future, more code change is needed.)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141192
Approved by: https://github.com/wconstab
2024-11-21 07:32:56 +00:00
e0482fdf95 Implements user buffer registration using MemPool (#133603)
This PR implements user buffer registration and demonstrates NVLink Sharp (NVLS) reductions using a combination of allocation special memory using MemPool and registering it with the nccl buffer registration APIs.

Part of https://github.com/pytorch/pytorch/issues/124807.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133603
Approved by: https://github.com/kwen2501, https://github.com/eqy
2024-11-21 01:40:11 +00:00
496c1e78c5 Revert "Implements user buffer registration using MemPool (#133603)"
This reverts commit 25d9be37bef949c675e42b4929ddcb6997af2a7b.

Reverted https://github.com/pytorch/pytorch/pull/133603 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/133603#issuecomment-2486897708))
2024-11-19 22:42:26 +00:00
b379a28a95 Generalization of distributed test cases for non-CUDA devices (#138216)
# Motivation
This pr is an extension of #131758. As described in #131758, these changes are looking to make distributed UTs more accessible to users of all device types.

It is a demonstration of a few changes discussed by @kwen2501 and @jgong5 in the discussion for #131758(https://github.com/pytorch/pytorch/pull/131758#discussion_r1762422784)

This PR contains two types of changes, the first is to the common distributed folder where we have added a new class derived from MultiProcessTestCase which helps abstracts out the process group creation /deletion and other functionality for a given device.

The new generalized content can be added by deriving from this base class.
Also includes other misc changes for gaudi support

The second changed file is test_functional_api. a test file in common distributed. This file is a POC for how we can use this new class to write more device agnostic distributed test cases.

The following changes have been made to test_functional_api.py:
-Functionality has been added to test for non cuda devices using intel HPU as an example
-Multiple set up steps previously required by MultiProcessTestCase have been abstracted out
-Misc adaptations to allow for general call to accelerators while adding test skips instead explicitly skipping for multiple GPUs
-Skipifhpu flags have been added to enable skipping a few Multithreaded test cases which are as yet not supported on HPUs

NOTE: Within test functional api, there are tests which require the use of some multithreading functions which are as yet not supported on HPUs. These have been skipped for hpu using skipHPU decorator.

I will be raising a separate PR to improve usability pf said decorators in a device agnostic setting in the manner suggested by @kwen2501 in a comment on this PR.

This pr is a cleaned up version of a previous PR(#136988) which I closed due to human error. I have addressed some of the comments made by @kwen2501 in this as well

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138216
Approved by: https://github.com/kwen2501, https://github.com/guangyey
2024-11-18 09:38:00 +00:00
1d5a8ee8fb [C10D] call destroy_process_group after MultiProcess tests (#140820)
Faced with an annoying string of warnings like this when running tests,
<img width="1644" alt="Screenshot 2024-11-15 at 11 23 21 AM" src="https://github.com/user-attachments/assets/91ff4e1d-3c29-4510-9a61-46e7df68a212">

My choices seem to be (1) call destroy_process_group() at the end of
each test fn, (2) do this in some wrapper, (3) do it in the base test
class.

Since tests in MultiProcessTestCase are responsible for calling
init_process_group themselves, they should also be responsible for
calling destroy (or at least method (3) would be asymmetric and may
result in double-destroy).

But it doesn't feel worth it to go add a destroy call manually to each
test, and try/except for a possible second destroy call seems like a
happy middle ground.

Note: tests that want to ensure that destroy runs cleanly can and should
still call destroy _inside_ the test, and this change does not affect
that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140820
Approved by: https://github.com/fegin
2024-11-18 04:26:21 +00:00
bf8709b08a Revert "[C10D] call destroy_process_group after MultiProcess tests (#140820)"
This reverts commit 77d1f076dadec7a77c4bcf807c4efbef6ca5a8f1.

Reverted https://github.com/pytorch/pytorch/pull/140820 on behalf of https://github.com/wconstab due to failures on trunk not on PR CI ([comment](https://github.com/pytorch/pytorch/pull/140820#issuecomment-2480644227))
2024-11-16 16:32:14 +00:00
77d1f076da [C10D] call destroy_process_group after MultiProcess tests (#140820)
Faced with an annoying string of warnings like this when running tests,
<img width="1644" alt="Screenshot 2024-11-15 at 11 23 21 AM" src="https://github.com/user-attachments/assets/91ff4e1d-3c29-4510-9a61-46e7df68a212">

My choices seem to be (1) call destroy_process_group() at the end of
each test fn, (2) do this in some wrapper, (3) do it in the base test
class.

Since tests in MultiProcessTestCase are responsible for calling
init_process_group themselves, they should also be responsible for
calling destroy (or at least method (3) would be asymmetric and may
result in double-destroy).

But it doesn't feel worth it to go add a destroy call manually to each
test, and try/except for a possible second destroy call seems like a
happy middle ground.

Note: tests that want to ensure that destroy runs cleanly can and should
still call destroy _inside_ the test, and this change does not affect
that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140820
Approved by: https://github.com/fegin
ghstack dependencies: #140460, #140815
2024-11-16 14:24:52 +00:00
25d9be37be Implements user buffer registration using MemPool (#133603)
This PR implements user buffer registration and demonstrates NVLink Sharp (NVLS) reductions using a combination of allocation special memory using MemPool and registering it with the nccl buffer registration APIs.

Part of https://github.com/pytorch/pytorch/issues/124807.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133603
Approved by: https://github.com/kwen2501, https://github.com/eqy
2024-11-15 12:47:49 +00:00
4e647871d6 Ensure TORCH_TRACE is run for Dynamo/Distributed tests (#139786)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139786
Approved by: https://github.com/bobrenjc93, https://github.com/c00w, https://github.com/anijain2305
ghstack dependencies: #139716
2024-11-07 01:58:05 +00:00
e4ad02892f Upgrade distributed test to g4dn instances (T4 GPUs) (#137161)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137161
Approved by: https://github.com/seemethere, https://github.com/eqy, https://github.com/yf225

Co-authored-by: Will Feng <yf225@cornell.edu>
2024-10-20 23:48:54 +00:00
c0582fd0f8 Remove unused Python variables in torch/[b-z]* (#136963)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136963
Approved by: https://github.com/ezyang
2024-10-19 16:45:22 +00:00
c88b77af9c [Distributed][CI] Add SM guard for compiled tests involving BF16 (#138245)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138245
Approved by: https://github.com/yf225
2024-10-18 21:39:39 +00:00
0ff6f7a040 Revert "[Distributed][CI] Add SM guard for compiled tests involving BF16 (#138245)"
This reverts commit 1581a93e8705dc23f649573d4404cd6816d614af.

Reverted https://github.com/pytorch/pytorch/pull/138245 on behalf of https://github.com/albanD due to Breaks distributed inductor tests ([comment](https://github.com/pytorch/pytorch/pull/138245#issuecomment-2422462579))
2024-10-18 13:21:17 +00:00
1581a93e87 [Distributed][CI] Add SM guard for compiled tests involving BF16 (#138245)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138245
Approved by: https://github.com/yf225
2024-10-18 09:10:01 +00:00
24ee4af86b Revert "Upgrade distributed test to g4dn instances (T4 GPUs) (#137161)"
This reverts commit 2b7c7a20b9c0e8e7f2773ffc5c9f79c3cae2070b.

Reverted https://github.com/pytorch/pytorch/pull/137161 on behalf of https://github.com/kwen2501 due to breaking trunk ([comment](https://github.com/pytorch/pytorch/pull/137161#issuecomment-2417833666))
2024-10-16 20:05:38 +00:00
2b7c7a20b9 Upgrade distributed test to g4dn instances (T4 GPUs) (#137161)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137161
Approved by: https://github.com/seemethere, https://github.com/eqy
2024-10-16 16:42:57 +00:00
78632b97b1 Revert "Upgrade distributed test to g4dn instances (T4 GPUs) (#137161)"
This reverts commit f43c4d28b8f955fe1f2b80f193815edadc95507b.

Reverted https://github.com/pytorch/pytorch/pull/137161 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it seems another failure showing up after the upgrade ([comment](https://github.com/pytorch/pytorch/pull/137161#issuecomment-2415941159))
2024-10-16 07:26:34 +00:00
f43c4d28b8 Upgrade distributed test to g4dn instances (T4 GPUs) (#137161)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137161
Approved by: https://github.com/seemethere, https://github.com/eqy
2024-10-16 05:03:08 +00:00
b55ff476bd Revert "[Distributed] Fix extra context on device 0 (#135273)"
This reverts commit cdd8fa98c77b052085cca65dd54769ae18b72104.

Reverted https://github.com/pytorch/pytorch/pull/135273 on behalf of https://github.com/PaliC due to broken tests on trunk ([comment](https://github.com/pytorch/pytorch/pull/137161#issuecomment-2406236337))
2024-10-10 23:47:25 +00:00
cdd8fa98c7 [Distributed] Fix extra context on device 0 (#135273)
This PR contains multiple fixes for issue https://github.com/pytorch/pytorch/issues/135279:

## First part:
Moves the GPU guard (`cudaSetDevice`) before the `currentStreamCaptureStatusMayInitCtx` call.
As its name suggests, it May Init Ctx.

## Second part:
Even with the above fix, additional contexts are still observed during Work object destruction, e.g.
```
work = dist.all_reduce(tensor, async_op=True)
time.sleep(5)  <-- no additional context yet
del work  <-- additional context shows up
```
### Debug process
Chasing it down to destruction of a `Future` object -- a member variable of `Work`.
Then further down to the following member of `Future`:
```
std::vector<c10::Event> events_;
```
When the `events_` are destroyed, we hit the road down to:
1f3a793790/c10/cuda/impl/CUDAGuardImpl.h (L106-L121)

When there is no "preset" CUDA context (**which is the case for python garbage collector**), line 112: `c10::cuda::GetDevice(&orig_device)` will set `orig_device` to 0. Then, at line 120, `c10::cuda::SetDevice(orig_device)` will "officially" set the context to device 0 --
**that's where rank 1, 2, ... can create extra context on device 0!**
### Solution
This PR adds an explicit destructor to `Future`. In this destructor, destroy each event with a device guard.

## Test
Added test_extra_cuda_context, implemented via
- `pynvml` (if available), or
- memory consumption check.

`python test/distributed/test_c10d_nccl.py -k test_extra_cuda_context`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135273
Approved by: https://github.com/fduwjj, https://github.com/wconstab, https://github.com/eqy
ghstack dependencies: #137161
2024-10-10 17:16:34 +00:00
cd17b2645c Revert "[Distributed] Fix extra context on device 0 (#135273)"
This reverts commit a93d3873e97973fbc0009245579ee4e4fa7f155a.

Reverted https://github.com/pytorch/pytorch/pull/135273 on behalf of https://github.com/albanD due to Broken trunk distributed ci ([comment](https://github.com/pytorch/pytorch/pull/135273#issuecomment-2393772987))
2024-10-04 13:58:57 +00:00
a93d3873e9 [Distributed] Fix extra context on device 0 (#135273)
This PR contains multiple fixes for issue https://github.com/pytorch/pytorch/issues/135279:

## First part:
Moves the GPU guard (`cudaSetDevice`) before the `currentStreamCaptureStatusMayInitCtx` call.
As its name suggests, it May Init Ctx.

## Second part:
Even with the above fix, additional contexts are still observed during Work object destruction, e.g.
```
work = dist.all_reduce(tensor, async_op=True)
time.sleep(5)  <-- no additional context yet
del work  <-- additional context shows up
```
### Debug process
Chasing it down to destruction of a `Future` object -- a member variable of `Work`.
Then further down to the following member of `Future`:
```
std::vector<c10::Event> events_;
```
When the `events_` are destroyed, we hit the road down to:
1f3a793790/c10/cuda/impl/CUDAGuardImpl.h (L106-L121)

When there is no "preset" CUDA context (**which is the case for python garbage collector**), line 112: `c10::cuda::GetDevice(&orig_device)` will set `orig_device` to 0. Then, at line 120, `c10::cuda::SetDevice(orig_device)` will "officially" set the context to device 0 --
**that's where rank 1, 2, ... can create extra context on device 0!**
### Solution
This PR adds an explicit destructor to `Future`. In this destructor, destroy each event with a device guard.

## Test
Added test_extra_cuda_context, implemented via
- `pynvml` (if available), or
- memory consumption check.

`python test/distributed/test_c10d_nccl.py -k test_extra_cuda_context`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135273
Approved by: https://github.com/fduwjj, https://github.com/wconstab, https://github.com/eqy
2024-10-04 00:44:02 +00:00
3efaa016b1 [c10d] Make test compatible for new pytest (#136158)
Temporary fix to the issue in https://github.com/pytorch/pytorch/issues/127517.

Short-term fix following CPython: 51aefc5bf9/Lib/unittest/case.py (L419-L426)

Differential Revision: [D62878083](https://our.internmc.facebook.com/intern/diff/D62878083)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136158
Approved by: https://github.com/fegin
2024-09-18 17:10:55 +00:00
b5be4d8c05 Fix ROCm skip decorator for test_ddp_tp and multiprocess UTs (#136161)
skip_if_rocm is used only in multiprocess case (when UT test class is a child of MultiProcessTestCase). Each individual process can exit with a skip code. If used for single process UT, it will cause the UT to fail as the process returns a non-zero exit code. Use skipIfRocm in single process UTs.

To avoid the above confusion, this PR renamed skip_if_rocm to skip_if_rocm_multiprocess.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136161
Approved by: https://github.com/jithunnair-amd, https://github.com/kwen2501, https://github.com/fegin
2024-09-18 11:01:23 +00:00
9e06572704 [Traceable FSDP2][Inductor] Create grouped nodes for FSDP2 all-gather code block and reduce-scatter code block (after Buffer/Operation split) (#131510)
This PR creates these `GroupedSchedulerNode`s:
- One for each all-gather code block (cast + copy-in + all-gather)
- One for each all-gather-wait code block (all-gather-wait + copy-out)
- One for each reduce-scatter code block (copy-in + reduce-scatter)
- One for each reduce-scatter-wait code block (reduce-scatter-wait)

This serves two goals:
- Prevent outside ops from being fused into these op groups, in order to have more predicable memory usage.
- Make it easier to specify the dependency e.g. from `i+1` all-gather group node to the `i` all-gather-wait group node, to enforce FSDP2 comm ordering (i.e. "serialization of comms").

The actual "reorder-for-FSDP-compute-comm-overlap" PR will come next.

Test commands:
- `pytest -rA  test/distributed/test_compute_comm_reordering.py::TestComputeCommReorderingMultiProc`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_backend_inductor`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131510
Approved by: https://github.com/yifuwang
2024-07-27 08:39:58 +00:00
fc3d2b26cd Use fake PG for test_compute_comm_reordering.py unit tests (#131415)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131415
Approved by: https://github.com/yifuwang
2024-07-23 22:53:23 +00:00
fff92d4f18 Revert "Use inductor TestCase for test_replicate_with_compiler.py (#129494)"
This reverts commit 9f392f8294e928aec49599ad649aa899e1356102.

Reverted https://github.com/pytorch/pytorch/pull/129494 on behalf of https://github.com/atalman due to broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/129494#issuecomment-2237147504))
2024-07-18 17:42:05 +00:00
9f392f8294 Use inductor TestCase for test_replicate_with_compiler.py (#129494)
Summary: `test/distributed/_composable/test_replicate_with_compiler.py` exercises inductor. This change introduces a version of MultiProcessTestCase that derives from the inductor TestCase class to make sure we always get a clean cache dir.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129494
Approved by: https://github.com/eellison
2024-07-18 03:08:32 +00:00
8458dc8966 Revert "Use inductor TestCase for distributed tests (#129494)"
This reverts commit 3cd2ae331a5ed6839456bb0025c729a1ee50bc84.

Reverted https://github.com/pytorch/pytorch/pull/129494 on behalf of https://github.com/masnesral due to fbcode failures ([comment](https://github.com/pytorch/pytorch/pull/129494#issuecomment-2232063690))
2024-07-17 00:32:48 +00:00
3cd2ae331a Use inductor TestCase for distributed tests (#129494)
Summary: At least some of the tests deriving from MultiProcessTestCase exercise inductor. Using the inductor TestCase class makes sure we always get a clean cache dir.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129494
Approved by: https://github.com/eellison
2024-07-16 01:24:35 +00:00
6e43897912 [BE][ptd_fb_test][3/N] Enable TestSlide for MultiThreadedTestCase (#128843)
Enabling testslide for MultiThreadedTestCase, similar to https://github.com/pytorch/pytorch/pull/127512.

Differential Revision: [D58677457](https://our.internmc.facebook.com/intern/diff/D58677457/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128843
Approved by: https://github.com/wz337
2024-06-18 07:05:31 +00:00
85758fa5ae [c10d][TCPStore] make TCPStore server use libuv by default (#127957)
**Summary**
This PR switches the default TCPStore server backend to a new implementation that utilizes [`libuv`](https://github.com/libuv/libuv) for significantly lower initialization time and better scalability:
<img width="714" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/18503011-da5d-4104-8ba9-abc456438b02">

We hope this improvement would benefit users from a much shorter startup time in large-scale jobs. Eventually, we hope to fully replace the old TCPStore backend implementation with the libuv one.

**What it changes**
This PR changes the underlying TCPStore server backend to `libuv` if users don't explicitly specify to use the old TCPStore server. This change is not supposed to cause any user notice except significant faster TCPStore startup for large-scale jobs.

One thing to note is, we do not support the initialization approach where user passes in a socket for libuv backend. We plan to support it as a next step but we choose to disable it before fully testing. If you are initializing TCPStore in this approach, you can see the next section to remain using the old TCPStore server.

**Fallback/Remain using the old TCPStore server**
For users who want to stay with the old TCPStore backend, there're 3 ways:

1. If user is directly instantiating TCPStore object, user can pass in argument `use_libuv=False` to use the old TCPStore server backend e.g. `store = torch.distributed.TCPStore(..., use_libuv=False)`.
2. Or, specify the TCPStore backend option in `init_method` when calling default ProcessGroup init, e.g. `torch.distributed.init_process_group(..., init_method="{YOUR_RENDEZVOUS_METHOD}://{YOUR_HOSTNAME}:{YOUR_PORT}?use_libuv=0")`
3. Or, user can set environment variable `USE_LIBUV` to `"0"` when launching.

These 3 approach are in order of precedence. That being said, if user specifies `use_libuv=0` in `init_method` and also sets environment var `USE_LIBUV="1"`, the former will take effect and the TCPStore backend instantiated will be the old one instead of the one using libuv.

**Operating Systems Compatibility**
From the CI signals, we believe the new implementation has the same behavior as the old TCPStore server on all supported platforms. If you notice any behavior discrepancy, please file an issue with `oncall: distributed` label.

**Test Plan**
`pytest test/distributed/test_store.py`
<img width="2548" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/dc0aebeb-6d5a-4daa-b98c-e56bd39aa588">
note: `TestMultiThreadedWait::test_wait` is a broken test that has been there for some time.

`test/distributed/elastic/utils/distributed_test.py`
<img width="2558" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/a6a3266d-b798-41c4-94d2-152056a034f6">

**TODO**
1. Update the doc at

- https://pytorch.org/docs/stable/distributed.html#distributed-key-value-store
- https://pytorch.org/docs/stable/distributed.html#tcp-initialization

2. Make torch elastic rendezvous to use libuv TCPStore as well. See `torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py` cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k @kurman
3. Test if libuv backend is okay with initialization with socket. Change `LibUvTCPStoreTest::test_take_over_listen_socket`.

**Test Plan**
`pytest test/distributed/test_store.py`
<img width="2548" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/dc0aebeb-6d5a-4daa-b98c-e56bd39aa588">
note: `TestMultiThreadedWait::test_wait` is a broken test that has been there for some time.

`test/distributed/elastic/utils/distributed_test.py`
<img width="2558" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/a6a3266d-b798-41c4-94d2-152056a034f6">

Differential Revision: [D58259591](https://our.internmc.facebook.com/intern/diff/D58259591)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127957
Approved by: https://github.com/kurman
ghstack dependencies: #127956
2024-06-07 16:53:01 +00:00
bb68b54be0 [BE][ptd_fb_test][1/N] Enable testslide (#127512)
This change allows to enable Testslide, which gives us more readable output, import time, etc. The PR is previously stamped https://github.com/pytorch/pytorch/pull/126460 but the old PR has some ghexport issue.

Differential Revision: [D57919583](https://our.internmc.facebook.com/intern/diff/D57919583/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127512
Approved by: https://github.com/wz337, https://github.com/Skylion007
2024-06-05 17:45:15 +00:00
6454e95824 [FSDP2] enable CI for torch.compile(root Transformer) (#127832)
This CI showcases FSDP2 works with `torch.compile` root model, since FSDP1 can do the same

compiling root Transformer without AC: `pytest test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_multi_group`

compiling root Transformer with AC: `pytest test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_with_activation_checkpointing`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127832
Approved by: https://github.com/awgu
2024-06-05 17:29:46 +00:00
04bf7713e8 [c10d] Reduce test time by reusing ProcessGroup (#125648)
## Problem this PR resolves
Today, most of distributed tests are arranged like this:
```
def test_allreduce(self):
    pg = self._create_process_group_nccl(store, self.opts())
    pg.allreduce(tensor)
    ...
```
Thus, we are paying PG creation time **per test**. That's bad. But why were we doing that? Is there a constraint?

If we look deeper, we would find that most of our test cases inherit from `torch.testing._internal.common_distributed.MultiProcessTestCase`. From the name, nothing seems wrong, and probably fits distributed well. But a "problem" exists in its `setUp()` and `tearDown()` methods, which basically do the following:
```
def setUp(self):
    self._spawn_processes()

def tearDown(self):
    for p in self.processes:
        p.terminate()
```
Since `setUp` and `tearDown` are "**test-scope fixtures"**, meaning, they are called per test, each test will have brand new processes. Of course we'd have to recreate ProcessGroup every time.

## How we are fixing it
First, obviously, we need to put a PG's lifetime into a longer scope. Python `unittest` provides such a helper, called **"class-scope fixtures."** It is embodied by a `setUpClass` method and a `tearDownClass` method (note the name difference), which are called only once for all tests in the same test class.  Therefore, we would do:
```
@classmethod
def setUpClass(self):
    dist.init_process_group(...)

@classmethod
def tearDownClass(self):
    dist.destroy_process_group()
```
**In this PR, we create a new test template for distributed: `MultiProcContinousTest`, to hold this class-scope fixture.**

Second, we'd need to avoid per-test process spawn and terminate. That's easy, we can either:
1. launch the whole test file with `torchrun --nproc-per-node=...` or
2. use `mp.spawn()` under `if __name__ == "__main__":`.

Point is, launch the processes only once.

## Result
We moved the "positive tests" from test_c10d_nccl.py to test_c10d_ops_nccl.py.
Before this PR:
```
$ python test_c10d_nccl.py -k ProcessGroupNCCLTest
Ran 24 tests in 174.457s
```
After this PR:
```
$ torchrun --nproc-per-node 2 test_c10d_ops_nccl.py
or
$ python test_c10d_ops_nccl.py
Ran 24 tests in 16.247s
```
10X speedup.

## Limitation
For tests intended to test destroy or abort of PGs, we'd need to go back to the old style. So it would make sense to divide our tests into two classes: one for positive tests where we would reuse the PGs, and the other one for abort/destroy and negative tests like watchdog timeout.

## Next step
Migrate the tests of distributed that would fit with this test style!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125648
Approved by: https://github.com/wconstab
2024-05-08 22:33:40 +00:00
5a1216bb2e [BE]: Update ruff to 0.4.1 (#124549)
Update ruff to 0.4.1 .
This version fixes a lot false negatives/false positives, is 20-40% faster, and has various other bug fixes.

Below is a before and after table showing the execution time of ruff lint and ruff format in milliseconds courtesy of https://astral.sh/blog/ruff-v0.4.0

| Repository                                         | Linter (v0.3) | Linter (v0.4) | Formatter (v0.3) | Formatter (v0.4) |
|----------------------------------------------------|---------------|---------------|------------------|------------------|
| [pytorch/pytorch](https://github.com/pytorch/pytorch) | 328.7         | 251.8         | 351.1            | 274.9            |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124549
Approved by: https://github.com/ezyang
2024-04-21 14:06:23 +00:00
2a2e1d8e4f [functional collective] change the Python APIs to only use the native funcol ops (#123777)
## Summary

After this PR, the functional collective Python APIs will stop honoring `TORCH_DISABLE_NATIVE_FUNCOL` and only use native funcol ops. Specifically, this PR:
- Removed `use_native_funcol()`.
- Removed the code path in the Python APIs when `use_native_funcol()` is `False`.
- Changed the CI tests that runs on both native funcol and legacy funcol through the Python API to only run with native funcol.

## Test Changes

`test_functional_api.py`
- Removed the tests where only one of output_split_sizes or input_split_sizes is specified. This behavior is unreliable has been removed from the native funcol.
- Removed `TestWaitiness` which tests an implementation detail of the legacy funcol. We have equivalent tests for native funcol in `test/distributed/test_c10d_functional_native.py` b7fac76fc2/test/distributed/test_c10d_functional_native.py (L114-L116)

`test/distributed/_tensor/test_dtensor.py`
`test/distributed/_tensor/test_dtensor_compile.py`
`test/distributed/test_device_mesh.py`
`test/distributed/_tensor/experimental/test_tp_transform.py`
`test/distributed/_tensor/test_matrix_ops.py`
`test/distributed/test_inductor_collectives.py`
- All these tests were double running with both native funcol and legacy funcol. Changed to only run with native funcol.

`test/distributed/test_c10d_functional_native.py`
- Removed the `run_with_native_funcol` decorators.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123777
Approved by: https://github.com/wanchaol
ghstack dependencies: #123776
2024-04-13 03:08:36 +00:00
b78e8c0d37 remove duplicate method run_subtests (#122421)
Fixes #121654

I have removed the duplicate test `run_subtests` from `common_dtensor.py` and `common_fsdp.py` and moved it to `common_distributed.py`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122421
Approved by: https://github.com/soulitzer
2024-03-22 07:00:49 +00:00
22cd2658b4 Disable GroupRegistry's thread isolation by default (#121457)
Today `GroupRegistry` employs thread isolation by default, i.e. every thread sees its own process group registry. This is intended to work for one-device-per-process (for python use cases) and one-device-per-thread case (for custom native runtimes).

However, there's a problem - there are python use cases that initializes/registers process groups in one thread, and runs collectives in another thread. This use case should be supported. However, since `GroupRegistry` employs thread isolation by default, collectives in different threads can't find the registered process groups.

This PR fixes the issue by:
- Make `GroupRegistry` work in non-thread isolation mode by default. This would match the behavior w/o the native process group registry.
- Introduces `set_thread_isolation_mode` so one-device-per-thread runtimes can enable thread isolation mode explicitly.

Differential Revision: [D54658515](https://our.internmc.facebook.com/intern/diff/D54658515)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121457
Approved by: https://github.com/wanchaol
2024-03-08 19:31:24 +00:00
fad228c7cc Fix a potential race condition in the test decorators for enabling/disabling native funcol (#120833)
Previous, we parametrize some tests to run with both native and py funcol by flipping a global variable. However, some of these tests are multi-threaded tests, and the parametrization mechanism could lead to race condition.

This PR changes the mechansim to use `mock.patch` which is applied on a per-thread basis.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120833
Approved by: https://github.com/wconstab
2024-02-29 03:19:44 +00:00
637cf4a3f2 Test parametrization utils for native funcol migration (#119950)
```
Between the time we switch to the native funcol by default and the time when
we are confident that we can remove the legacy implementation, we want to
ensure that the legacy funcol remains covered by unit tests. This is to
prepare for any potential (but unlikely) reverts. The following utilities
help achieve this goal.

run_with_{native,legacy}_funcol - mark a test to run with only
{native,legacy} funcol. These decorators are for impl specific tests (e.g.
verifying generated code with FileCheck).

run_with_both_funcol_impls - parametrize a test to run with both legacy and
native funcol.

run_with_both_funcol_impls_with_arg - same as run_with_both_funcol_impls, but
passes `enable_native_funcol` to the test so impl specific checks can be
carried out.
```

This PR also marks some tests we want to cover in this fashion. More tests will be marked in subsequent PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119950
Approved by: https://github.com/wanchaol
ghstack dependencies: #119881
2024-02-19 02:46:03 +00:00
9bce208dfb Replace follow_imports = silent with normal (#118414)
This is a lot of files changed! Don't panic! Here's how it works:

* Previously, we set `follow_imports = silent` for our mypy.ini configuration. Per https://mypy.readthedocs.io/en/stable/running_mypy.html#follow-imports, what this does is whenever we have an import to a module which is not listed as a file to be typechecked in mypy, we typecheck it as normal but suppress all errors that occurred in that file.
* When mypy is run inside lintrunner, the list of files is precisely the files covered by the glob in lintrunner.toml, but with files in excludes excluded.
* The top-level directive `# mypy: ignore-errors` instructs mypy to typecheck the file as normal, but ignore all errors.
* Therefore, it should be equivalent to set `follow_imports = normal`, if we put `# mypy: ignore-errors` on all files that were previously excluded from the file list.
* Having done this, we can remove the exclude list from .lintrunner.toml, since excluding a file from typechecking is baked into the files themselves.
* torch/_dynamo and torch/_inductor were previously in the exclude list, because they were covered by MYPYINDUCTOR. It is not OK to mark these as `# mypy: ignore-errors` as this will impede typechecking on the alternate configuration. So they are temporarily being checked twice, but I am suppressing the errors in these files as the configurations are not quite the same. I plan to unify the configurations so this is only a temporary state.
* There were some straggler type errors after these changes somehow, so I fixed them as needed. There weren't that many.

In the future, to start type checking a file, just remove the ignore-errors directive from the top of the file.

The codemod was done with this script authored by GPT-4:

```
import glob

exclude_patterns = [
    ...
]

for pattern in exclude_patterns:
    for filepath in glob.glob(pattern, recursive=True):
        if filepath.endswith('.py'):
            with open(filepath, 'r+') as f:
                content = f.read()
                f.seek(0, 0)
                f.write('# mypy: ignore-errors\n\n' + content)
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118414
Approved by: https://github.com/thiagocrepaldi, https://github.com/albanD
2024-01-27 02:44:11 +00:00
6de28e92d2 [BE]: Apply FURB118 (prev): replaces unnecessary lambdas with operator. (#116027)
This replaces a bunch of unnecessary lambdas with the operator package. This is semantically equivalent, but the operator package is faster, and arguably more readable. When the FURB rules are taken out of preview, I will enable it as a ruff check.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116027
Approved by: https://github.com/malfet
2023-12-20 19:35:08 +00:00