- Add a `kwargs` option; add the `dynamic_shapes` option so users can supply it directly to `torch.export`.
- Make the options keyword-only arguments (bc-breaking)
- Deprecate the `training` and `operator_export_type` options and include a warning message. The exact time for removal is TBD but the message should discourage users from using the options.
- Deprecate two functions `exportable_ops` and pretty print export
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131501
Approved by: https://github.com/titaiwangms
`inductor` and `rocm` workflows are the major contributors to the CI load on ROCm CI at the moment, resulting in huge backlogs: https://github.com/pytorch/pytorch/pull/131489#issue-2425804464
* Move rocm.yml to cron frequency
* Move ROCm CI jobs from inductor.yml to inductor-rocm.yml
* Introduce `ciflow/inductor-rocm` as PR label to manually invoke inductor jobs for ROCm (no automatic invoking to limit CI load)
* After this PR, only `trunk` workflow jobs for ROCm will run on every commit and PR merge, but since they take 45min*3 time on average, I decided to leave them as-is since it will provide us some basic insulation against ROCm breakage.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131637
Approved by: https://github.com/clee2000, https://github.com/atalman, https://github.com/huydhn
Summary: Since WaitCounter frontend itself has minimal depdendencies it's fine to be moved into c10. Specific backends can be registered/linked separately.
Test Plan: unit test
Reviewed By: jamesperng, asiab4, c-p-i-o
Differential Revision: D59842868
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131021
Approved by: https://github.com/asiab4
The problem was we were shoving SymInts into the constant_args side
table. The root problem is that torch.fx.node.base_types, which we use
to determine what can be put in the graph, doesn't actually have SymInt
in it. This PR fixes base_types to include SymInt.
Test Plan:
- tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131363
Approved by: https://github.com/oulgen, https://github.com/justinchuby
This PR adds an API `FSDPModule.set_reduce_scatter_divide_factor` to allow setting a custom gradient divide factor for reduce-scatter. This can be useful when using parallelisms in combination with FSDP (e.g. expert parallelism), where gradients need to be divided by a custom factor (e.g. an extra `EP` factor).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129286
Approved by: https://github.com/weifengpy
This PR implements an opt-in configuration option for synchronizing compilation across all ranks at the end of Dynamo tracing (and potentially, other places in the future). There are two pieces to this PR:
1. Implementing infrastructure for compiler collectives (DistributedState/LocalState, the actual collective)
2. Using this infrastructure to synchronize automatic dynamic choices across all ranks
The infrastructure in part one can be used for other purposes, just add more (serializable) fields to LocalState.
Here is how automatic dynamic synchronization works:
1. Preflight in "torch/_dynamo/variables/builder.py": On the first Dynamo trace run, we trace without automatic dynamic at all; we assume all Tensor inputs that are not otherwise marked are static. This run is purely to collect all Tensor input sizes in the program.
2. torch/_dynamo/output_graph.py: At the end of the first Dynamo trace run, we perform a compiler collective to distribute all Tensor input sizes to all ranks. Then, we restart Dynamo
3. Apply the updates in "torch/_dynamo/variables/builder.py": Now that we have all sizes for every rank, we now update frame state with the observed sizes for all ranks, in rank order. Under the assumption that frame state is consistent on all ranks, this series of updates will preserve consistency.
For future work, it would be safer if we force a consistent hint on all ranks; this is more involved as we have to interpose in fakification.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130935
Approved by: https://github.com/jansel
High level goals:
- Cover the all-gather and reduce-scatter pattern matchers with unit tests
- Make it easier to exclude certain collectives as async-tp candidates
- Make it easier to match other all-gather and reduce-scatter variants (e.g. fp8 collectives)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131409
Approved by: https://github.com/weifengpy
Resubmit of #129325
Previously each mutation was represented by a `MutationOutput` operation which
was a new scheduler node that must be scheduled immediately afterwards.
Now we have a single scheduler node, which produces mutiple `MutationOutput`
buffers as its output.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130832
Approved by: https://github.com/lezcano
This should prevent regressions like the ones fixed by https://github.com/pytorch/pytorch/pull/131204
- Remove global `-Wno-error=inconsistent-missing-override`
- Wrap offending includes (protobuf and asmjit) with `C10_DIAGNOSTIC_PUSH_AND_IGNORE` and `C10_DIAGNOSTIC_POP_AND_IGNORED`
- Add `override` keyword to `at::namespace::tunable::StreamTimer` and `LLVMCodeGenImpl`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131524
Approved by: https://github.com/atalman
Add example `NestedTensor`s with inner dimension of size `1` to `_get_example_tensor_lists` with `include_inner_dim_size_1=True`. This diff creates `NestedTensor`s of sizes `(B, *, 1)` and `(B, *, 5, 1)`, ensuring that the current implementations of jagged reductions for `sum` and `mean` hold for tensors of effective shape `(B, *)` and `(B, *, 5)`.
Differential Revision: [D59846023](https://our.internmc.facebook.com/intern/diff/D59846023/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131516
Approved by: https://github.com/davidberard98
Summary: We currently don't support some of the `@triton.autotune` arguments when compiling user-written Triton kernels with PT2. In this PR, we're adding a flag to circumvent it. This is to unblock internal compilation in some cases. The flag is supplied with the docs mentioning why it is not a good idea to set it.
Test Plan:
```
python test/inductor/test_triton_kernels.py -k test_triton_kernel_
autotune_with_unsupported_args
...
----------------------------------------------------------------------
Ran 3 tests in 3.636s
OK
```
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131431
Approved by: https://github.com/oulgen, https://github.com/zou3519
The issue addressed is that compiled autograd changes the calling convention of the FX graph to only have a single placeholder which contains a list of inputs. In this case, the meta of the tensor input nodes don't contain the `tensor_dict` meta. This adds them.
The context is that `tensor_dict` is used to convey if a tensor is an input with a static address.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131556
Approved by: https://github.com/anijain2305
In gen.py, the code for generating CompositeViewCopyKernels.cpp includes *_native.h headers for "view_groups" but not "structured_native_functions". However, this results in the TORCH_API in the headers being ineffective and presents such functions being used outside libtorch_cpu.so
This patch ensures that gen.py includes the native headers for "structured_native_functions" in the same way as for "view_groups".
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131208
Approved by: https://github.com/bdhirsh