2500 Commits

Author SHA1 Message Date
2a58d2a155 StringCordView: make iterator fast when there is only one piece (#151810)
This makes the StringCordView iterator a variant holding
either the existing implementation (when there is more than one piece)
or a simple `std::string_view::iterator` (when there is only one
piece). The latter seems to be significantly cheaper.

Differential Revision: [D73379178](https://our.internmc.facebook.com/intern/diff/D73379178/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151810
Approved by: https://github.com/Skylion007
ghstack dependencies: #151801, #151802, #151803, #151804, #151805, #151806, #151807
2025-04-24 04:43:34 +00:00
aa61707a56 Fix extra heap allocation in Source constructor (#151800)
This was a sneaky one: the StringCordView default constructor allocates.

Differential Revision: [D73129448](https://our.internmc.facebook.com/intern/diff/D73129448/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151800
Approved by: https://github.com/malfet, https://github.com/cyyever, https://github.com/Skylion007
ghstack dependencies: #151682
2025-04-22 23:36:06 +00:00
bf28d1cafc Expose bicubic mode for torch::nn::functional::grid_sample in LibTorch (#150817)
When bicubic interpolation was added to grid_sampler in #44780, `GridSampleFuncOptions` was not updated to allow a user to use bicubic mode in LibTorch, even though the function could handle it. This PR fixes the parity such that LibTorch's  `torch::nn::functional::grid_sample` behaves the same as PyTorch's `torch.nn.functional.grid_sample`.

Existing users can directly use `torch::grid_sampler` but must know what int to pass for the interpolation (2 for bicubic) and padding mode parameters, which is not ideal.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150817
Approved by: https://github.com/Skylion007
2025-04-21 08:55:27 +00:00
3ed5f1fb77 [CUDA][cuBLAS] Aten GEMM overload for FP32 output from FP16/BF16 inputs (#150812)
Enable FP32 output from FP16/BF16 GEMMs in aten with cuBLAS. Accumulation for these GEMMs are generally already done in FP32. Adds the functionality to the following aten operators:
* mm
* bmm
* addmm
* baddmm

Follow up of customer issue: https://github.com/pytorch/pytorch/issues/146241#issuecomment-2781889390

Differential Revision: [D73126191](https://our.internmc.facebook.com/intern/diff/D73126191)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150812
Approved by: https://github.com/ngimel, https://github.com/eqy
2025-04-18 01:53:26 +00:00
c3a18f6126 [AOTInductor] Add states for constant folding process (#151273)
Summary:
We add states in the constant folding process for AOTInductor.
Basically, there's 3 states, which is
(1) None: The state when no constants are loaded and uninitialized.
(2) Initialized: The state when constants are loaded, but not yet
folded.
(3) Folded: The state where the model is fully ready with folded
constants.

Note that even if constant folding is not enabled, we still only run
when state is FOLDED, this is okay because without constant folding, the
transition from INITIALIZED to FOLDED is just a pass-throught.

Test Plan:
python test/inductor/test_aot_inductor.py -k test_constant_folding_with_update

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D73002538](https://our.internmc.facebook.com/intern/diff/D73002538)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151273
Approved by: https://github.com/jingsh, https://github.com/desertfire
2025-04-17 16:41:38 +00:00
23a3cef5d9 [c10d] Add _allgather_base , reduce_scatter , and _reduce_scatter_base into ProcessGroupMPI to enable FSDP with MPI backend (#150162)
This PR implements _allgather_base, reduce_scatter, and _reduce_scatter_base in the MPI backend (ProcessGroupMPI), enabling support for Fully Sharded Data Parallel (FSDP) in environments that use MPI for distributed communication.

### Context

As noted in https://github.com/pytorch/pytorch/issues/85628, FSDP currently supports only the NCCL backend. Due to this limitation, FSDP cannot run on legacy HPC environments or clusters that rely on MPI.

By implementing just these three collective operations, we can enable FSDP to work with the MPI backend. These collectives are implemented in a similar manner to existing operations such as allgather.

### Testing

We validated this PR using pytorch/build/bin/ProcessGroupMPITest with OpenMPI, and all tests passed successfully.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150162
Approved by: https://github.com/H-Huang
2025-04-14 19:31:38 +00:00
ad5e9065ac [Profiler/Easy] Remove temp flag for on-demand Memory Snapshot (#151068)
Summary: Now that we have profiler impl in we don't need the temporary flag. submodule update too.

Test Plan: CI

Reviewed By: sanrise

Differential Revision: D72672186

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151068
Approved by: https://github.com/davidberard98
2025-04-11 18:50:25 +00:00
f663aa4e81 [c10d][tcp_store] Fix connection reset caused by wrong socket close (#150987)
While fixing the memory leak in https://github.com/pytorch/pytorch/pull/145757, we accidentally close the socket for the case when nread == 0 and thought it is the case when connection is closed. This is not true. According to libuv doc: https://docs.libuv.org/en/v1.x/stream.html#c.uv_read_cb.

> nread might be 0, which does not indicate an error or EOF. This is equivalent to EAGAIN or EWOULDBLOCK under read(2).

We found this bug when debugging a broken pipe issue when users first call a set and then wait for all keys right afterwards on 128 ranks. This might also cause other broken pipe issues we have seen in the prod jobs recently.

Added a unit test to test this case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150987
Approved by: https://github.com/d4l3k, https://github.com/XilunWu
2025-04-10 18:48:57 +00:00
f3cf3ec591 [AOTInductor] Add User Managed buffer for AOTI constant buffer. (#150276)
Summary:
We add the functionality to allow users to directly pass in a at::Tensor
into AOTInductor, that would be used as the constant.
This user managed buffer skips the copying step in AOTInductor, and let
users to directly manage the memory usage themselve.

Test Plan:
LD_LIBRARY_PATH=/data/users/$USER/pytorch/build/lib
/data/users/$USER/pytorch/build/bin/test_aoti_inference

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D72589514](https://our.internmc.facebook.com/intern/diff/D72589514)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150276
Approved by: https://github.com/chenyang78, https://github.com/desertfire
2025-04-10 00:15:44 +00:00
99c9a31386 [submodule] [Snapshot/Profiler] Memory Snapshot On Demand (#150559)
Summary:
Profiler side of memory snapshot.

1. Add API to actually do snapshot when client interface is called
2. Add ifdefs to builds so that kineto hooks snapshot correctly.

Design Philosophy: There is one interesting part of this implementation and it is during export. For export we are callign the python impl of the export rather than CPP even though we are already in CPP. This is because it is better to simply have one path of export rather than 2. Personally, I want there to be parity between auto-trace and on-demand so it if we can limit the side paths then we will have an easier time maintaining this relationship

Test Plan: {F1976563426}

Reviewed By: sanrise

Differential Revision: D70733247

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150559
Approved by: https://github.com/sanrise
2025-04-07 13:04:38 +00:00
063ea5d669 [AOTInductor] Modify test for Memory tracking for memory-related (#150269)
operations

Summary:
Fix the test for memory tracking. This PR does:
(1) Add tracking before and after for all memory-related operations.
Make sure the operation do indeed captures memory both in CUDA and
torch's CUDACachAllocator Make sure the operation do indeed captures
consumed memory both in CUDA and torch's CUDACachAllocator.
(2) Keep track of memory being reserved by CUDACacheAllocator in
torch and it's relationship with global CUDA memory consumption.

Test Plan:
This PR is adding tests.

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150269
Approved by: https://github.com/jingsh, https://github.com/chenyang78, https://github.com/desertfire
2025-04-02 04:18:18 +00:00
35c45a4a31 [Reland] Launch kernel on current stream & remove record_stream entirely (#150398)
Relanding #148590 due to merge conflict.

This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related):
1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back.
- Resolves #147729
- Resolves #146881
- Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user.
2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling.
- Resolves #147168
3. Remove tensor life management when async_op=False; only use it when async_op=True.
4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460).
5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels.

Joint work with @cenzhaometa who wants to remove the event sync overhead.

Squashed contents:

* [ptd][nccl] use current-stream as nccl-stream under async=False mode (#147820)
PTD current workflow:
- PTD creates its own dedicated `ncclStream` for comm operation
- it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective
such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us).
This diff:
- async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead
- async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready
- pass down async from c10d down to NCCL-PG
this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%**

* [PGNCCL] Make avoid-record-stream default

* [c10d] Add asyncOp argument to Ops

* Change python side wait

* Pass asyncOp at ProcessGroup level

* Watchdog unstashing tensors as a safety net

* Stash tensors for reduce_scatter_v and all_gather_v
Pull Request approved: https://github.com/pytorch/pytorch/pull/149753

* [c10d] Move unstashing from watchdog to main thread
Pull Request approved: https://github.com/pytorch/pytorch/pull/150079

* [PGNCCL][BE] Merge mutex into TensorShelf for encapsulation
Pull Request approved: https://github.com/pytorch/pytorch/pull/150130

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150398
Approved by: https://github.com/atalman
2025-04-01 16:46:07 +00:00
a2070e2fd5 [AOTInductor] Free tensors in test (#150274)
Summary:
This PR frees tensor that were new-ed within the test itself to prevent
memory leak.

Test Plan:
Fixing tests itself.

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150274
Approved by: https://github.com/chenyang78
2025-03-31 23:28:13 +00:00
f3c77b2458 Set requires grad in TensorMaker::make_tensor() (#148255)
Fixes #146419

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148255
Approved by: https://github.com/soulitzer
2025-03-29 08:06:42 +00:00
03313c6619 [AOTInductor] Add function for users to extract constants in container (#150163)
Summary: Add extract_constant_map that allows users to inspect the constants being used by AOTInductor

Test Plan:
`python test/inductor/test_aot_inductor.py -k extract_constants_map`

`LD_LIBRARY_PATH=/data/users/$USER/pytorch/build/lib /data/users/$USER/pytorch/build/bin/test_aoti_inference`

Differential Revision: D72020400

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150163
Approved by: https://github.com/chenyang78
2025-03-29 03:36:12 +00:00
e6afb51805 [AOTInductor] Free folded constants that's managed by AOTInductor (#149825)
internally.

Summary:
This diff allows freeing the usage of folded constants that's created by
AOTInductor through CUDACachingAllocator instead of the constant blob
from cudaMalloc directly.

Test Plan:
LD_LIBRARY_PATH=/data/users/$USER/pytorch/build/lib
/home/$USER/local/pytorch/build/bin/test_aoti_inference

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149825
Approved by: https://github.com/chenyang78, https://github.com/desertfire, https://github.com/jingsh
2025-03-27 06:05:50 +00:00
12628ba24d [AOTInductor] Bug fix for freeing buffers when freeing multiple times (#149810)
Summary:
We might free the active buffer if we free the buffer twice.

Test Plan:
```
LD_LIBRARY_PATH=/data/users/$USER/pytorch/build/lib
/home/$USER/local/pytorch/build/bin/test_aoti_inference
```
Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149810
Approved by: https://github.com/chenyang78
2025-03-25 20:26:36 +00:00
c73a526599 Extract reusable portions of elu_kernel into header (#149673)
Similar to #140425, we are making the implementation usable via header-only code sharing.

Review note: #62546 by @yanbing-j removed expm1 usage from this path. I don't know why and expm1 should be more efficient, so I've put it back. Please let me know if there is a good reason I shouldn't.

Testing: existing correctness tests should cover.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149673
Approved by: https://github.com/cyyever, https://github.com/Skylion007
2025-03-21 23:54:26 +00:00
46dd226702 Fakify torchbind objects in compile_fx and add tests for SigridTransformsInstanceTorchBind (#149529)
Summary:
We need to properly fakify torchbind objects, including the ones in graph module attributes, so the resgitered fake implementation works properly.

- _fakify_script_objects in `compile_fx`
- Allow fake torchbind objects in `torchbind_constants`

Remove `node.meta["unbacked_bindings"]` for `aot_compile` in `compile_fx`. Otherwise `ShapeProp` will fail when trying to resolve the `unbacked_bindings` of `with_effect` tokens.

Update `sigrid_transforms_test` to use the latest `torch._inductor.aot_compile` API.

Add a test for `Fakify torchbind objects in compile_fx and add tests for SigridTransformsInstanceTorchBind` in `e2e_test`.

Test Plan:
```
buck run //caffe2/torch/fb/sparsenn:sigrid_test -- -r test_transform_torch_bind

buck run //sigmoid/inference/test:e2e_test_cpu -- -r SigridTransforms

buck2 run mode/dev-nosan sigmoid/inference/ts_migration:pt2i_readiness_main -- --model_id 545017754 --test_suite ads_all --mode test_preproc

```

Differential Revision: D70013257

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149529
Approved by: https://github.com/angelayi
2025-03-21 18:58:28 +00:00
04e251a7dd [AOTI] Add num_runners to AOTIModelPackageLoader (#149364)
Summary: AOTIModelContainerRunner takes a num_runners argument for multi-threaded inference, but AOTIModelPackageLoader forgot to take the same parameter, although its run() API already expects to take an optional cudaStream_t parameter for multi-threaded inference.

Differential Revision: [D71357418](https://our.internmc.facebook.com/intern/diff/D71357418)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149364
Approved by: https://github.com/angelayi
2025-03-19 02:28:06 +00:00
bb42e4d137 [AOTInductor] Add function to free buffer (#149161)
Summary:
We add a function that allows users to free the unused buffer.

Test Plan:
Testing correctness:
    python test/inductor/test_aot_inductor.py -k free_inactive

    Testing memory consumption:
    LD_LIBRARY_PATH=/data/users/$USER/pytorch/build/lib
    /home/$USER/local/pytorch/build/bin/test_aoti_inference

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149161
Approved by: https://github.com/chenyang78, https://github.com/desertfire
ghstack dependencies: #149249
2025-03-18 02:43:14 +00:00
afa1eda901 Revert "[PGNCCL] Launch kernel on current stream & remove record_stream entirely (#148590)"
This reverts commit ef6296e7f20d744a0cfed81cab573d60204e7626.

Reverted https://github.com/pytorch/pytorch/pull/148590 on behalf of https://github.com/izaitsevfb due to reverted internally, see D71292427 ([comment](https://github.com/pytorch/pytorch/pull/148590#issuecomment-2731114626))
2025-03-17 22:43:15 +00:00
5e1b715dda BC fix for AOTIModelPackageLoader() constructor defaults (#149082)
The default value for `run_single_threaded` was wrongly specified in the .cpp file instead of the header, breaking C++-side instantiation of `AOTIModelPackageLoader` with no arguments. This PR fixes this and adds a test for the use case of running with `AOTIModelPackageLoader` instead of `AOTIModelContainerRunner` on the C++ side.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149082
Approved by: https://github.com/desertfire
2025-03-13 18:40:53 +00:00
b9803a5c81 [AOTI] Re-enable AOTI cpp unit test (#149085)
Summary: test_inductor_aoti was removed by accident previously. Add it back.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149085
Approved by: https://github.com/jbschlosser
2025-03-13 16:00:38 +00:00
ef6296e7f2 [PGNCCL] Launch kernel on current stream & remove record_stream entirely (#148590)
This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related):
1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back.
- Resolves #147729
- Resolves #146881
- Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user.
2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling.
- Resolves #147168
3. Remove tensor life management when async_op=False; only use it when async_op=True.
4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460).
5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels.

Joint work with @cenzhaometa who wants to remove the event sync overhead.

Cc: @ngimel @awgu @Aidyn-A @skyw @wconstab @leonardo0lyj

Differential Revision: [D70937982](https://our.internmc.facebook.com/intern/diff/D70937982)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148590
Approved by: https://github.com/eqy, https://github.com/Aidyn-A, https://github.com/fduwjj
2025-03-11 18:36:12 +00:00
a95eb0c0a7 Revert "[PGNCCL] Launch kernel on current stream & remove record_stream entirely (#148590)"
This reverts commit 2149f6c6845d00711ffab648132b7377e8cd3edb.

Reverted https://github.com/pytorch/pytorch/pull/148590 on behalf of https://github.com/ZainRizvi due to Breaking internally, see D70873275. Discussed reverting this with Ke. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/148590#issuecomment-2712001270))
2025-03-10 22:38:40 +00:00
2149f6c684 [PGNCCL] Launch kernel on current stream & remove record_stream entirely (#148590)
This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related):
1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back.
- Resolves #147729
- Resolves #146881
- Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user.
2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling.
- Resolves #147168
3. Remove tensor life management when async_op=False; only use it when async_op=True.
4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460).
5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels.

Joint work with @cenzhaometa who wants to remove the event sync overhead.

Cc: @ngimel @awgu @Aidyn-A @skyw @wconstab @leonardo0lyj

Differential Revision: [D70835197](https://our.internmc.facebook.com/intern/diff/D70835197)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148590
Approved by: https://github.com/eqy, https://github.com/Aidyn-A, https://github.com/fduwjj
2025-03-09 07:32:23 +00:00
9cb25f0ea2 Revert "[PGNCCL] Launch kernel on current stream & remove record_stream entirely (#148590)"
This reverts commit 17dbeb11db7afbab792ad76c24840c1552a0e76d.

Reverted https://github.com/pytorch/pytorch/pull/148590 on behalf of https://github.com/janeyx99 due to PR break backward compat test ([comment](https://github.com/pytorch/pytorch/pull/148590#issuecomment-2708641172))
2025-03-09 03:01:55 +00:00
17dbeb11db [PGNCCL] Launch kernel on current stream & remove record_stream entirely (#148590)
This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related):
1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back.
- Resolves #147729
- Resolves #146881
- Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user.
2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling.
- Resolves #147168
3. Remove tensor life management when async_op=False; only use it when async_op=True.
4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460).
5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels.

Joint work with @cenzhaometa who wants to remove the event sync overhead.

Cc: @ngimel @awgu @Aidyn-A @skyw @wconstab @leonardo0lyj

Differential Revision: [D70835197](https://our.internmc.facebook.com/intern/diff/D70835197)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148590
Approved by: https://github.com/eqy, https://github.com/Aidyn-A, https://github.com/fduwjj
2025-03-08 20:00:12 +00:00
4abff4b271 Introduce cache clearing APIs for the lazy graph executor (#144489)
This PR introduces two new methods to the LazyGraphExecutor class:

- ClearComputationCache(): Allows clearing the entire computation cache.
- RemoveFromComputationCache(hash): Enables removal of specific cache entries based on their hash.

The main objective is to expose cache management functionality for debugging cache hits and misses across different computations. For instance:
- Reset the cache state in tests, allowing reuse of the same computation client to evaluate cache logic consistently.
- Selectively remove cache entries to analyze the impact on subsequent computations.
- Improve observability into the cache behavior, aiding in the investigation of cache-related issues or optimizations.

On the XLA lazy graph executor, we want to run a series of tests that modify some parts of the HLO module proto of the computation, and we need a means to ensure that the hash is agnostic to some elements (OpMetadata in the XLA proto data). Hence, it would be easy to parameterize the test, clear the cache and validate that the resulting hash is the same between runs. Otherwise, we'd need to hardcode the resulting serialized hash.

Simultaneously, **another motivation**, is that users could also clear some computation hashes for an added flexibility in their applications, by introducing their own custom strategies for maintaining the cache (without relying on the default LRU).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144489
Approved by: https://github.com/wconstab
2025-01-29 17:38:01 +00:00
c0861d092c [PGNCCL] Add an API to get the status/error code at the PG level (#144498)
Summary:
This PR is basically a replacement of
https://github.com/pytorch/pytorch/pull/140087, which caused some perf
drop due to frequent TCPStore check in watchdog thread. The fix is to move the
tcpstore check in monitoring thread

If unhealthy, the user should be able to get the type of errors, e.g.,
timeout,nccl error or remote error.

This API is applied to PG level, compared to the
work.get_future_result() API which is applied to Work Level.
Error detection at PG level is much more convenient for users to handle
the PG failure as a whole, e.g, restarting the PG.

Error handling at the work level is still useful for users to attach
work specific context and debug the RC of the specific failing
work/collective

Note it is critical for all ranks in the PG to be notified about an
error as soon as it occurs, so we introduce an errorType of
REMOTE_ERROR, which is 'broadcasted' from a src rank (which detects a
local error) to all other ranks in the PG, the broadcast is done through
TCPStore currently

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144498
Approved by: https://github.com/kwen2501
2025-01-24 16:47:32 +00:00
ae7df51232 [c10d] Fix CudaEventCache for dangling references (#144496)
Reported in https://github.com/pytorch/pytorch/issues/143470, we have a dangling references in `CudaEventCache`. So we want to fix it.
1. We add a unit test to repro the issue mentioned in the issue.
2. Instead of converting variables to shared pointers as suggested in the issue, we then make the cache itself a shared pointer. So if the thread creates the cache dies before all events get recycled, the cache is still there until the last CudaEvent get deleted. (thanks for the suggestion from @kwen2501 )

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144496
Approved by: https://github.com/kwen2501
2025-01-15 05:11:48 +00:00
b80ecc4457 Revert "Fix poision child process issue when call getAccelerator() (#144368)"
This reverts commit 2583d831d40d6fa64f0b637d5bc7598e484a3283.

Reverted https://github.com/pytorch/pytorch/pull/144368 on behalf of https://github.com/clee2000 due to broke internal tests D68023262, probably the same problem as noted in the issue this PR is mentioned above ([comment](https://github.com/pytorch/pytorch/pull/144368#issuecomment-2584848568))
2025-01-10 23:36:43 +00:00
2583d831d4 Fix poision child process issue when call getAccelerator() (#144368)
# Motivation
fix https://github.com/pytorch/pytorch/issues/144152

# Solution

- Align `at::globalContext()::hasXXX` to determine if accelerator XXX is built with PyTorch or an extension already registered to PyTorch.
- Define `at::hasXXX` to determine if accelerator XXX is available at runtime.
- Use `at::globalContext()::hasXXX` in `getAccelerator` rather than `at::hasXXX` to avoid initializing the XXX runtime (which can poison child processes) while detecting the current accelerator.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144368
Approved by: https://github.com/albanD, https://github.com/atalman, https://github.com/gujinghui
2025-01-10 09:28:27 +00:00
bbec35f028 [BE]: Replace clone detach with detach clone to be more efficient (#144469)
Follow up to #144270 and fix some vulkan code
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144469
Approved by: https://github.com/awgu
2025-01-09 18:28:39 +00:00
1365ae859c [ROCm][CI] upgrade CI to ROCm 6.3 (#142152)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142152
Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-01-09 17:14:16 +00:00
bf7009d839 [rpc] Fix unit test after c10::nullopt removal (#143690)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143690
Approved by: https://github.com/yifuwang, https://github.com/c-p-i-o, https://github.com/XilunWu
2024-12-20 23:36:07 +00:00
f9da639950 [codemod] Fix a few unused-variable issues in pytorch (#143517)
Summary:
LLVM-15 has a warning `-Wunused-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance.

This diff either (a) removes an unused variable and, possibly, it's associated code or (b) qualifies the variable with `[[maybe_unused]]`.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan: Sandcastle

Reviewed By: palmje

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143517
Approved by: https://github.com/mhorowitz
2024-12-19 00:18:08 +00:00
d4ed5941db Fix floating point literals in IRPrinter (#142119)
Fixes #114035
This is a recreation of #140002 with approval from its author. Original description:
>when v larger than 1e16, the format will be error. example: v is 1.2e17, the output is 1.2e17.f, it have two point '.'

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142119
Approved by: https://github.com/jgong5, https://github.com/malfet
2024-12-18 21:59:48 +00:00
cyy
075905b7bd [14/N] Fix extra warnings brought by clang-tidy-17 (#141644)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141644
Approved by: https://github.com/ezyang

Co-authored-by: Eli Uriegas <1700823+seemethere@users.noreply.github.com>
2024-12-13 06:22:13 +00:00
82ce888273 c10::string_view -> std::string_view in more places (#142517)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142517
Approved by: https://github.com/malfet
2024-12-12 19:45:59 +00:00
2f0fe82f6d Revert "[14/N] Fix extra warnings brought by clang-tidy-17 (#141644)"
This reverts commit 24a5a2ef258d2b482ded674cdb9555afaf081402.

Reverted https://github.com/pytorch/pytorch/pull/141644 on behalf of https://github.com/clee2000 due to failing internally D67112938 ([comment](https://github.com/pytorch/pytorch/pull/141644#issuecomment-2539602023))
2024-12-12 17:43:36 +00:00
7667235a23 c10::optional -> std::optional (#142514)
Fixes issues introduced in https://github.com/pytorch/pytorch/pull/141348 and https://github.com/pytorch/pytorch/pull/139578

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142514
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-12-12 17:23:46 +00:00
cyy
24a5a2ef25 [14/N] Fix extra warnings brought by clang-tidy-17 (#141644)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141644
Approved by: https://github.com/ezyang
2024-12-11 18:40:42 +00:00
7e41717a26 c10::string_view -> std::string_view in caffe2/jit (#142383)
Test Plan: Sandcastle

Differential Revision: D66939979

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142383
Approved by: https://github.com/malfet
2024-12-10 15:42:28 +00:00
d3d1a78774 [AOTInductor] Add standalone test for compilation from ExportedProgram (#142327)
Summary: Provide a standalone path to compile and run a ExportedProgram in C.

Test Plan:
(1) Generate a compiled model from ExportedProgram
```
python generate_lowered_cpu.py --input-path /tmp/$USER/ep.pt --output-path /tmp/$USER/final.pt
```
(2) Compile a standalone test runner
```
TORCH_ROOT_DIR=/data/users/$USER/pytorch sh standalone_compile.sh standalone_test.cpp standalone_test.out
```
(3) Run test for the compiled model in step (1)
```
LD_LIBRARY_PATH=/data/users/$USER/pytorch/build/lib ./standalone_test.out /tmp/$USER/final.pt
```

Differential Revision: D66872380

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142327
Approved by: https://github.com/hl475
2024-12-10 06:50:09 +00:00
1cb2ebd740 [AOTI] Fix #140546 and support AOTI package load for Intel GPU. (#140664)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

* #140686
* __->__ #140664
* #140269
* #140268
* #135320
* #135318
* #139026

Fix #140546

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140664
Approved by: https://github.com/desertfire, https://github.com/EikanWang
ghstack dependencies: #140268, #140269

Co-authored-by: Bin Bao <binbao@meta.com>
2024-12-10 05:05:08 +00:00
6fcb294e18 Revert "[AOTI] Fix #140546 and support AOTI package load for Intel GPU. (#140664)"
This reverts commit 91d30546a4338b17f31d31a674662aa53d61b1aa.

Reverted https://github.com/pytorch/pytorch/pull/140664 on behalf of https://github.com/clee2000 due to breaks forward compatibility?  D66937097 ([comment](https://github.com/pytorch/pytorch/pull/140269#issuecomment-2528828555))
2024-12-09 17:33:28 +00:00
91d30546a4 [AOTI] Fix #140546 and support AOTI package load for Intel GPU. (#140664)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

* #140686
* __->__ #140664
* #140269
* #140268
* #135320
* #135318
* #139026

Fix #140546

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140664
Approved by: https://github.com/desertfire, https://github.com/EikanWang
ghstack dependencies: #140268, #140269
2024-12-07 19:22:04 +00:00
215f5d77b5 [functional autograd] Refactor validate_outputs into a functional variant (#141348)
Today, validate_outputs is stateful (it depends on the autograd graph).
This PR refactors it into a stateless form that just depends on
InputMetadata.

Test Plan:
- new unittest
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141348
Approved by: https://github.com/soulitzer
ghstack dependencies: #141278
2024-12-04 18:06:31 +00:00