Summary: When HOPs live out of tree, it makes it impossible to make breaking changes to the HOP API. But HOP implementations are deeply entwined with PyTorch internals. Move the HOP into PyTorch tree so that changes are possible.
Test Plan: sandcastle and oss ci
Differential Revision: D60674861
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132526
Approved by: https://github.com/SherlockNoMad
Summary: The historical default here is "1", i.e., no parallel compilation. In order to prepare for rolling out the subprocess-based parallel compile, I had previously modified this code to allow parallelism when worker_start_method="subprocess". I realize this probably isn't the best rollout strategy. Rather than opting all internal usages into both a) parallel-compile, _and_ b) a new implementation of parallel compile, let's put the default back to "1" and then start rolling out the new parallel compile implementation only to those usages that have already opted in by explicitly setting compile_thread > 1
Differential Revision: [D60686105](https://our.internmc.facebook.com/intern/diff/D60686105)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132540
Approved by: https://github.com/c00w
Summary:
# Problem
`TORCH_WARN` can cause massive log spam.
I output the logs for before and after adding this change.
*Before:*
* The log file size was ~61.15 MB(61148028 bytes).
*After:*
* The log filesize was ~56.44 MB(56444057) bytes.
# Context
Looks like we tried to land this change earlier but it was reverted:
* D59413413
* Reverted https://github.com/pytorch/pytorch/pull/130047 on behalf of https://github.com/clee2000 due to broke test_overrides.py::TestTorchFunctionWarning::test_warn_on_invalid_torch_function
# Testing Update
`test_warn_on_invalid_torch_function` would fail because the warning would not be called on the handling of the second torch function class since `TORCH_WARN_ONCE` stops repeats globally.
Updated so that it runs separate programs. (Was not able to actually run the test, could someone help me with that
Test Plan: Need help with this...
Differential Revision: D60561181
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132374
Approved by: https://github.com/ezyang
Summary:
Suggested in https://github.com/pytorch/pytorch/issues/128394.
If there's an autocast context manager, the predispatch (strict) graph can look something like:
```
class <lambda>(torch.nn.Module):
def forward(self, x: "f32[1]"):
...
_enter_autocast = torch.amp.autocast_mode._enter_autocast('cuda', torch.bfloat16, True, None)
mm: "f32[8, 8]" = torch.ops.aten.mm.default(rand, rand_1); rand = rand_1 = None
_exit_autocast = torch.amp.autocast_mode._exit_autocast(_enter_autocast); _enter_autocast = None
return (mm_1,)
```
But the operator `torch.amp.autocast_mode._enter_autocast` is not a valid ATen op. We remove these nodes by turning autocast into a higher order operator and make a submodule for the blocks between `_enter_autocast` and `_exit_autocast`.
Some potential followup improvement:
1) Merge some of the duplicated logic with `replace_set_grad_with_hop_pass.py`
2) Check the current autocast status (any enabled? dtype?) and not create a submodule if the autocast args matches current autocast status.
Test Plan:
CI
```
parsh --build-flags fbcode//mode/dev-nosan fbcode//caffe2/test:test_export
run_tests("test_predispatch_autocast")
```
Reviewed By: angelayi
Differential Revision: D60206382
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131914
Approved by: https://github.com/angelayi
Summary: Fixes T192448049. The module call form an unusal call stack for the nodes: https://www.internalfb.com/phabricator/paste/view/P1507230978. This is currently not supported by unflattener and need some extra design to make it work.
Test Plan: buck2 run 'fbcode//mode/opt' torchrec/distributed/tests:test_pt2 -- --filter-text "test_sharded_quant_fpebc_non_strict_export"
Reviewed By: zhxchen17
Differential Revision: D60528900
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132437
Approved by: https://github.com/Skylion007
Add functional support for torch.addmm with CK backend. See also #125453
# Implementation details
1. It turns out we can use the same template between addmm and matmul; essentially, matmul is addmm with empty bias
2. The Python generator in CK was updated to generate the shared cpp template. The pip package can be installed from `pip install git+https://github.com/rocm/composable_kernel@add-addmm` and will be merged into `develop` branch after this PR lands to avoid breaking the current matmul
# Testing
`pytest test/inductor/test_ck_backend.py -k addmm`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130576
Approved by: https://github.com/chenyang78
Noticed a hang where the stuck thread blocked on cudaHostUnregister
call, probably due to an internal cuda deadlock caused by something
else, but was holding the GIL at the time and blocked other python
threads.
As far as I can tell cudart APIs all do not require the GIL held nor are
they marked as thread unsafe.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132520
Approved by: https://github.com/LucasLLC, https://github.com/kirtiteja
Migrates usages of deprecated APIs in NumPy-2.0 per [numpy-2.0 migration guide](https://numpy.org/devdocs/numpy_2_0_migration_guide.html#numpy-2-0-migration-guide).
I did a grep on the old API usages (see list below) and these were used only referenced in test files under `test/torch_np/numpy_tests/**/*.py`.
Specifically, migrates the usages of the following APIs:
1. `np.sctypes` → Access dtypes explicitly instead
2. `np.float_` → `np.float64`
3. `np.complex_` → `np.complex128`
4. `np.longcomplex` → `np.clongdouble`
5. `np.unicode_` → `np.str_`
6. `np.product` → `np.prod`
7. `np.cumproduct` → `np.cumprod`
8. `np.alltrue` → `np.all`
9. `np.sometrue` → `np.any`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131909
Approved by: https://github.com/rgommers, https://github.com/Skylion007, https://github.com/atalman
fixes https://github.com/pytorch/pytorch/issues/132016.
Right now if you run an op that DTensor has no sharding prop rule, **and** that op accepts non-trivial pytrees of inputs tensors as arguments, DTensor can end up infinite looping before it has the chance to error due to not having a sharding prop rule.
This PR doesn't fix the problem, but adds rules for the culprit ops (missing foreach ops)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132066
Approved by: https://github.com/wanchaol
Summary:
When a user sets config.profiler_mark_wrapper_call, RECORD_FUNCTION annotations are added to the code. This requires importing the header <ATen/record_function.h>, but the conditional for doing so didn't check
config.profiler_mark_wrapper_call.
Test Plan:
This case is already covered in test_profiler_mark_wrapper_call.
```
(pytorch-3.10) [gabeferns@devvm2252.cco0 ~/pytorch (missing-profile-include)]$ TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCHINDUCTOR_CPP_WRAPPER=1 python test/inductor/test_torchinductor.py -k CpuTests.test_profiler_mark_wrapper_call_cpu
stats [('calls_captured', 1), ('unique_graphs', 1)]
inductor [('fxgraph_cache_miss', 1)]
aot_autograd [('total', 1), ('ok', 1)]
.
----------------------------------------------------------------------
Ran 1 test in 8.080s
OK
```
Fixes https://github.com/pytorch/pytorch/issues/131339
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132419
Approved by: https://github.com/jgong5, https://github.com/desertfire
Summary:
Reland #124969 by backing out D60397377 "Back out "[1/2] PT2 Inductor ComboKernels - Foreach cases (#124969)""
The original diff D54134695 was reverted because of failure of ads nightly cogwheel tests.
The root cause: the logic for generating mask in Triton kernel needed update after a recent refactoring on triton.py. This diff includes the fix of the root cause.
See D54134695 or #124969 for more details.
Test Plan:
Originally failed tests
f585704630
f585733786
Diff patched:
f586664028
f586663820
Differential Revision: D60458597
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132182
Approved by: https://github.com/Yuzhen11
Fixes the Inductor max-autotune mode failures of the below models:
- GPT2ForSequenceClassification
- PegasusForConditionalGeneration
- XGLMForCausalLM
- hf_GPT2
- tnt_s_patch16_224
```log
File "/pytorch/torch/_inductor/index_propagation.py", line 329, in statically_true
evaluated = self.shape_env._maybe_evaluate_static(
File "/pytorch/torch/fx/experimental/symbolic_shapes.py", line 1499, in wrapper
return fn_cache(self, *args, **kwargs)
File "/pytorch/torch/fx/experimental/symbolic_shapes.py", line 4539, in _maybe_evaluate_static
vr = var_ranges[k]
torch._dynamo.exc.BackendCompilerFailed: backend='compile_fx_wrapper' raised:
KeyError: m_start
```
The `_maybe_evaluate_static` call in `IndexPropagation` may fail. This PR adds try except following the way in `torch/_inductor/sizevars.py` by adding a common utility function.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132128
Approved by: https://github.com/jgong5, https://github.com/jansel
Summary:
feikou observed the big numerical gaps when using math backend on AMD and NV GPUs. It's mainly because we are not using higher precision FP32 for the intermediate accumulated/materialized parts.
Since math backend is expected to be slower anyways, and we expect math backend to generate the correct reference result, I think it should be worth to upcast FP16/BF16 input to FP32, and do FP32/TF32 computations, and then downcast FP32 output back to FP16/BF16.
Differential Revision: D58710805
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128922
Approved by: https://github.com/xw285cornell, https://github.com/drisspg
Need to revert due to internal hangs: S437700
This reverts commit b6c1490cc02316ffe85e5ae74651d80f0158ba64.
Revert "[dynamo] implement IteratorVariable and polyfill fallbacks for enumerate (#131725)"
This reverts commit 2576dbbc35d66e8e9ed6cb12216ccc424cb87ec3.
Revert "[dynamo] add itertools repeat/count bytecode reconstruction (#131716)"
This reverts commit 35b4de32fafc5ad024c20ef1275711bffc557ae9.
Revert "[dynamo] add lazy IteratorVariable implementations for map and zip (#131413)"
This reverts commit 7d282d87550787d8269593093519c2ad7c5032cd.
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132528
Approved by: https://github.com/ZainRizvi
#### Issue
ScriptObject was treated as normal attribute by the converter previously. This PR lifts it to be a constant and convert it directly to a GetAttr fx node. ScriptObject would also trigger `CallMethod` and this PR adds that support as well.
#### Test Plan
Add test case for ScriptObject.
`pytest test/export/test_converter.py -s -k test_convert_script_object`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130952
Approved by: https://github.com/angelayi
Before setting up float8 numeric parity test, I have to set up regular TP numeric parity test, preferrably testing 10 iterations
this PR sets a baseline of TP numerics. I can verify fp8 on top of it
Summary:
Test Plan:
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132543
Approved by: https://github.com/tianyu-l
ghstack dependencies: #132350
Some sympy Functions aren't supported by sympy_interp(); we can't turn them into FX nodes, so currently the runtime asserts CSE pass avoids CSE'ing on any expression containing a sympy Function. https://github.com/pytorch/pytorch/pull/132325 started tracking unsupported functions, so we switch the check to that to be more precise. We also check for and skip unsupported functions when adding asserts - previously we only did the check for CSE, and not adding new expressions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132457
Approved by: https://github.com/avikchaudhuri
Summary:
This is a reland attempt of [#131431](https://github.com/pytorch/pytorch/pull/131431), as, in its original form, the PR has caused issues internally.
We currently don't support some of the `triton.autotune` arguments when compiling user-written Triton kernels with PT2. In this PR, we're adding a flag to circumvent it. This is to unblock internal compilation in some cases. The flag is supplied with the docs mentioning why it is not a good idea to set it.
Test Plan:
```
python test/inductor/test_triton_kernels.py -k test_triton_kernel_
autotune_with_unsupported_args
...
----------------------------------------------------------------------
Ran 3 tests in 3.636s
OK
```
Differential Revision: D60701839
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132562
Approved by: https://github.com/chenyang78
Summary:
Suggested in https://github.com/pytorch/pytorch/issues/128394.
If there's an autocast context manager, the predispatch (strict) graph can look something like:
```
class <lambda>(torch.nn.Module):
def forward(self, x: "f32[1]"):
...
_enter_autocast = torch.amp.autocast_mode._enter_autocast('cuda', torch.bfloat16, True, None)
mm: "f32[8, 8]" = torch.ops.aten.mm.default(rand, rand_1); rand = rand_1 = None
_exit_autocast = torch.amp.autocast_mode._exit_autocast(_enter_autocast); _enter_autocast = None
return (mm_1,)
```
But the operator `torch.amp.autocast_mode._enter_autocast` is not a valid ATen op. We remove these nodes by turning autocast into a higher order operator and make a submodule for the blocks between `_enter_autocast` and `_exit_autocast`.
Some potential followup improvement:
1) Merge some of the duplicated logic with `replace_set_grad_with_hop_pass.py`
2) Check the current autocast status (any enabled? dtype?) and not create a submodule if the autocast args matches current autocast status.
Test Plan:
CI
```
parsh --build-flags fbcode//mode/dev-nosan fbcode//caffe2/test:test_export
run_tests("test_predispatch_autocast")
```
Reviewed By: angelayi
Differential Revision: D60206382
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131914
Approved by: https://github.com/angelayi
python_code(verbose=True) (or print_readable()) generates a string with the code representing the fx graph, with extra annotations indicating the size or stride of the tensor. Currently, it'll only shows sizes/strides for FakeTensors provided in metadata. For subclass tensors like NestedTensor, the outer class (provided in the node metadata) will be a non-FakeTensor and the inner tensors will be fake. This PR expands the conditional to show sizes/strides for all tensors, not just FakeTensors.
Testing: I ran this test script (below), ran it with `TORCH_LOGS=+dynamo` and found in the logs the graph shown below - we see that the input nested tensor has sizes and strides associated with it. Also, I stacked a diff on top of this one that forces the readable graph to be generated whenever PT2 is in use in tests, which should hopefully find any issues; https://github.com/pytorch/pytorch/pull/132195 shows no significant failures except for preexisting failures.
test script:
```python
import torch
def fn(x):
return x.cos()
nt = torch.nested.nested_tensor_from_jagged(
torch.randn(10, 10),
torch.tensor([0, 1, 3, 6, 10]),
)
torch.compile(fn)(nt)
```
logs excerpt:
```
[0/0] [__graph_code] TRACED GRAPH
[0/0] [__graph_code] ===== __compiled_fn_1 =====
[0/0] [__graph_code] /data/users/dberard/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.M
[0/0] [__graph_code] def forward(self, L_x_: "f32[4, zf1, 10][10*zf1, 10, 1]cpu", zf1: "Sym(zf1)"):
[0/0] [__graph_code] l_x_ = L_x_
[0/0] [__graph_code]
[0/0] [__graph_code] # File: /data/users/dberard/scripts/nt_print_graph.py:4 in fn, code: return x.c
[0/0] [__graph_code] cos: "f32[4, zf1, 10][10*zf1, 10, 1]cpu" = l_x_.cos(); l_x_ = None
[0/0] [__graph_code] return (cos,)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132192
Approved by: https://github.com/Chillee
This is already represented in trunk.yml so it seems a bit redundant to include this level of testing in pull.yml.
I've been observing a large spike in our usage of `g3.4xlarge` which seems to correspond to these builds in particular so removing these from `pull.yml` since they are already covered in `trunk.yml`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132537
Approved by: https://github.com/ZainRizvi, https://github.com/malfet
Summary:
- moves logging functionalities into `torch/_export/db/logging.py` file.
- add a check in `_dynamo/eval_frame.py` to check for optional input and error out with `UnsupportedError`
- change the case name of `torch_sym_int` to `unsupported_operator`
- Check if the case name is registered in exportdb, if so, we give a link to the case in exportdb.
- TODO: add test
Test Plan:
CI
Running the example in https://pytorch.org/docs/main/generated/exportdb/index.html#optional-input gives the following error logging:
```
E0730 10:53:33.687000 4155538 torch/_dynamo/eval_frame.py:1086] Parameter y is optional with a default value of tensor([[-0.1633, 1.2414, -0.1071],
E0730 10:53:33.687000 4155538 torch/_dynamo/eval_frame.py:1086] [-0.1936, -0.9425, -0.0824]])
E0730 10:53:33.688000 4155538 torch/export/_trace.py:1043] See optional_input in exportdb for unsupported case. https://pytorch.org/docs/main/generated/exportdb/index.html#optional-input
......
File "/data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/389acaeb40d57230/tutorials/pytorch/nntest/__torchtest__/torchtest#link-tree/torch/_dynamo/eval_frame.py", line 1091, in produce_matching
raise Unsupported(
torch._dynamo.exc.Unsupported: Tracing through optional input is not supported yet
```
It also logs a `export.error.classified` event in Scuba.
Reviewed By: zhxchen17
Differential Revision: D60427208
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132420
Approved by: https://github.com/zhxchen17
This PR introduces a new sanity check for the public API tests in `.ci/pytorch/test.sh`.
* Validates two public API tests:
1. Ensures `test_correct_module_names` fails when a new file OR an existing file adds an invalid public API function (e.g. one whose `__module__` is unset).
2. Ensures `test_modules_can_be_imported` fails when a module underneath `torch/` cannot be imported.
* Runs this in CI as part just before the pre-existing FC / BC checks.
I've verified that re-introducing the bug that #131386 fixed causes the new check to fail:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131390
Approved by: https://github.com/albanD
Summary:
#### Description
Add support for aten::append with a python function that returns a new list with the appended element. We then update the `fx_node` in the `name_to_node` mapping.
aten::append contributed by Jiashen Cao <jiashenc@meta.com>
Fix conversion for csr_ranker_test
```
model_name: csr_ranker_test_4.ptl
has_ts_model: True
has_sample_inputs: True
ops_maybe_missing_meta: set()
script_objects: set()
ts_can_run: True
ts_run_exception: None
can_convert: True
convert_exception: None
ep_result_correct: True
ep_run_exception: None
can_package: True
package_exception: None
sigmoid_can_run: False
sigmoid_run_exception: RuntimeError('not for symbolics')
sigmoid_result_correct: None
```
Test Plan:
test_aten_add_t
test_aten_append_t
test_aten_to_dtype_with_mutating_storage
buck2 run mode/opt sigmoid/inference/ts_migration:main -- --mode test_one --model_name csr_ranker_test
Differential Revision: D60635893
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132529
Approved by: https://github.com/jiashenC
Internally there's a model that's using memory_budget with the partitioner, and using custom triton kernels. The partitioner fails when encountering the triton ops because they don't have `meta["val"]`. This PR adds `meta["val"]` to these fx graph nodes and then adds handling for `meta["val"]` being a dict in the partitioner.
Differential Revision: [D60627813](https://our.internmc.facebook.com/intern/diff/D60627813)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132466
Approved by: https://github.com/zou3519
ghstack dependencies: #132356
Inserts send/recv ops where needed in a compute-only pipeline schedule.
Any F or B action will require a recv op for its input and a send op
for its output, except for at the ends of the pipeline.
To avoid hangs caused by mixed-up orderings of sends/recvs across ranks,
we pick one compute action at a time and insert both its send op (on
that rank's schedule), and the matching recv op for the recipient stage
(on the schedule for the rank for that stage).
TODO
Currently ignores a couple of edge cases
- ignores batching (which is an optimization)
- ignores cases where a stage sends to anotehr stage on the same rank,
and should skip the send/recv and directly access memory
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130378
Approved by: https://github.com/H-Huang
ghstack dependencies: #129810
Adds fsdp unshard/reshard ops to a compute-only schedule.
Operates on one pp-rank's schedule at a time, since there is no
cross-pp-rank coordination needed for FSDP. (Unshard/Reshard is across
DP ranks within a PP group).
Uses a heuristic based on examining the next N stages to run compute
operations on this rank, evicting (resharding) and fetching (unsharding)
ahead of time to give unshard operations a chance to overlap with
compute and PP comms.
- this heuristic has not been validated and may not be optimal
Makes the assumption that it's fine to add the UNSHARD/RESHARD actions
to the schedule regardless of if FSDP will actually be used.
- this way, users do not have to tell us at PP schedule creation time if
they plan to use FSDP or DDP
- it is trivial to implement UNSHARD/RESHARD as no-ops inside the
runtime, if FSDP is not detected on the stage module
TODO
- also add FSDP's reduce-scatter? or is it sufficient to leave this
handled by PipelineStage at 'last backward' time
- validate 'next N stages' heuristic and expose an API if needed
- add an e2e test
Co-authored-by: Howard Huang <howardhuang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129810
Approved by: https://github.com/kwen2501, https://github.com/H-Huang
Summary:
as title.
torch._higher_order_ops.auto_functionlize.auto_functionalized is a Python FQN which should NOT be used to talk to the backends and we should use the standard FQN name torch.ops.higher_order.auto_functionalized instead.
Test Plan: buck test mode/opt caffe2/test:test_export -- -r test_custom_op_auto_functionalize_pre_dispatch
Differential Revision: D60468759
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132171
Approved by: https://github.com/SherlockNoMad
The functorch partitioners use network flow to split the joint graph into a forward and backward graph. Internally, we've found that upgrading to networkx 2.8.8 (from 2.5) results in some hard-to-debug failures (internal reference: https://fburl.com/workplace/jrqwagdm). And I'm told that there's interest to remove the python dependency.
So this PR introduces a C++ implementation that mirrors the API provided by networkx. We'll need to add python bindings and do some additional testing to verify correctness.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132188
Approved by: https://github.com/Chillee
Need to revert due to internal hangs: S437700
This reverts commit b6c1490cc02316ffe85e5ae74651d80f0158ba64.
Revert "[dynamo] implement IteratorVariable and polyfill fallbacks for enumerate (#131725)"
This reverts commit 2576dbbc35d66e8e9ed6cb12216ccc424cb87ec3.
Revert "[dynamo] add itertools repeat/count bytecode reconstruction (#131716)"
This reverts commit 35b4de32fafc5ad024c20ef1275711bffc557ae9.
Revert "[dynamo] add lazy IteratorVariable implementations for map and zip (#131413)"
This reverts commit 7d282d87550787d8269593093519c2ad7c5032cd.
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132528
Approved by: https://github.com/ZainRizvi
Summary:
It looks like there are several places in AotCodeCompiler that write files in a way that aren't safe for concurrency. There's a filelock to cope with that, but it seems like the lock path isn't quite robust enough to prevent races. We have an internal stress test failing when executing multiple concurrent versions of the test. It seems as though there's some variability in the content we write to the cpp file, which means we can get a different 'key' across different runs. The lock path includes that key in the lock path name, but the path for the "consts_path" is computed separately. Therefore, I see things like this:
- The computed 'key' is `cp5tgbuxuegvg5g2j7oi6u74nkf3v7mx5w3qzl6qbedtmw5tq77z`
- The lock_path (based on the key) is: `/tmp/torchinductor_slarsen/locks/cp5tgbuxuegvg5g2j7oi6u74nkf3v7mx5w3qzl6qbedtmw5tq77z.lock`
- The cpp path is (also includes the key) is: `/tmp/torchinductor_slarsen/cenzkqfnhu53mrhrdhzjtnblzyma2hgmeo7hai5yqsxzirdavurh/cp5tgbuxuegvg5g2j7oi6u74nkf3v7mx5w3qzl6qbedtmw5tq77z.cpp`
- The consts_path (not based on the key) is: `/tmp/torchinductor_slarsen/cenzkqfnhu53mrhrdhzjtnblzyma2hgmeo7hai5yqsxzirdavurh/cifbshkqkbsurzldsyi2vl5bsnhvejmavys4kktpwrzmpo4ysuoy.bin`
So we have different test instances using different lock paths, but touching the same consts_path and therefore stomping on each others' consts_path. To fix, include the key in the consts_paths.
Test Plan: Ran internal stress test. Repro'd failure and verified this change fixes it.
Differential Revision: D60552021
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132343
Approved by: https://github.com/desertfire
Summary:
We observed that many introduced nodes during split cat and batch fusion pattern optimization did not have example value meta data, which will cause problems in our follow up pattern optimizations, thus we add all missing values.
We also fix bugs in some meta update and corner case bug for the old pattern, which caused problems in the follow up pattern optimization.
We delete merge_stack_tahn_unbind_pass pattern, which was designed for cmf model, and it could be replaced by the more advanced pattern we added, thus we remove it for easy maintenance.
Test Plan:
# unit test
```
buck2 test //caffe2/test/inductor:split_cat_fx_passes
```
Test UI: https://www.internalfb.com/intern/testinfra/testrun/15481123762720165
Network: Up: 230KiB Down: 702KiB (reSessionID-756346bf-6da3-4fa0-8d03-1b4fd61e0a7a)
Jobs completed: 30. Time elapsed: 7:23.9s.
Cache hits: 20%. Commands: 5 (cached: 1, remote: 0, local: 4)
Tests finished: Pass 9. Fail 0. Fatal 0. Skip 1. Build failure 0
```
buck2 test @mode/opt pytorch/diff_train_tests/ads/optimus:local_pt2_runner
```
Network: Up: 1.3GiB Down: 84MiB (reSessionID-ff135cdd-e42c-4ab5-8217-907ada465f01)
Jobs completed: 61. Time elapsed: 21:56.5s.
Cache hits: 0%. Commands: 39 (cached: 0, remote: 0, local: 39)
Tests finished: Pass 8. Fail 0. Fatal 0. Skip 0. Build failure 0
# benchmark
```
CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 run @mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "ig_ctr" --flow_id 584880697
```
Counter({'pattern_matcher_nodes': 752, 'pattern_matcher_count': 732, 'normalization_pass': 328, 'normalization_aten_pass': 12, 'scmerge_cat_removed': 5, 'scmerge_cat_added': 4, 'scmerge_split_removed': 3, 'unbind_stack_pass': 3, 'batch_tanh': 2, 'scmerge_split_sections_removed': 2, 'scmerge_split_added': 2, 'optimize_cat_inputs_pass': 1, 'unbind_cat_to_view_pass': 1, 'fxgraph_cache_miss': 1})
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132297
Approved by: https://github.com/jackiexu1992
Summary:
Fixes T197371132.
Previously, we call copy.deepcopy to avoid mutating the original signature. However, this causes errors when the signature reference a FakeScriptObject, which then references a real torch.ScriptObject due to "The tensor has a non-zero number of elements, but its data is not allocated yet."
We therefore just change it to a shallow copy. This should be good enough for guarding the signature.
Test Plan: buck2 run 'fbcode//mode/opt' torchrec/distributed/tests:test_pt2 -- --filter-text "test_sharded_quant_ebc_non_strict_export"
Differential Revision: D60476839
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132181
Approved by: https://github.com/BoyuanFeng
Define the `TORCH_ONNX_USE_EXPERIMENTAL_LOGIC` flag to allow for enabling the new torch.onnx logic and hiding them during migration and testing. The actual logic migration will happen after.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132299
Approved by: https://github.com/titaiwangms
Enable exception chaining of BackendCompilerFailed exception in call_user_compiler. This prevents the original exception and traceback, which is often the most useful for debugging, from being discarded.
Example output without the patch
> Traceback (most recent call last):
> [Traceback from test_slice_scatter_issue122291 to raise BackendCompilerFailed(self.compiler_fn, e).with_traceback(]
> [Trace back from call_user_compiler to _inplace_generalized_scatter raise RuntimeError]
> torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
> RuntimeError: shape error in scatter op, can not broadcast torch.Size([16, 2]) to torch.Size([16, 6])
> Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information
Example output with the patch
> Traceback (most recent call last):
> [Traceback from_inplace_generalized_scatter to raise error_type(message_evaluated)]
> RuntimeError: expand: attempting to expand a dimension of length 2!
> The above exception was the direct cause of the following exception:
> Traceback (most recent call last):
> [Traceback from call_user_compiler to _inplace_generalized_scatter raise RuntimeError]
> RuntimeError: shape error in scatter op, can not broadcast torch.Size([16, 2]) to torch.Size([16, 6])
> The above exception was the direct cause of the following exception:
> Traceback (most recent call last):
> [Traceback from test_slice_scatter_issue122291 to raise BackendCompilerFailed(self.compiler_fn, e) with e]
> RuntimeError: shape error in scatter op, can not broadcast torch.Size([16, 2]) to torch.Size([16, 6])
> Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131186
Approved by: https://github.com/jansel
This PR introduces changes to AutoHeuristic that allow one to learn a heuristic as a decision tree. I used this to learn a heuristic for mixed_mm on A100 that consistenly performs better than the default choice (https://github.com/pytorch/pytorch/blob/main/torch/_inductor/kernel/mm.py#L402).
This is how the results look like:
Explanation of columns:
**wrong_max_spdup**: In the worst case, how much better would the best choice have been
**wrong_gman_spdup**: For inputs where the heuristic is wrong, how much better is the best choice on average (geomean)
**max_spdup_default**: Highest speedup achieved by the learned heuristic over the default choice
**gman_spdup_default**: Geomean speedup achived by the learned heuristic over the default choice
**max_slowdown_default**: If the default choice is better than the choice predicted by the learned heuristic, how much is it better in the worst case
**non_default_preds**: Number of times the learned heuristic predicted a choice that is not the default choice
**default_better**: Number of times the default choice is better than the choice made by the heuristic
```
set crit max_depth min_samples_leaf correct wrong unsure total wrong_max_spdup wrong_gman_spdup max_spdup_default gman_spdup_default max_slowdown_default non_default_preds default_better
train entropy 5 0.01 2376 740 323 3439 1.855386 1.063236 11.352318 3.438279 1.022164 3116 2
test entropy 5 0.01 563 183 71 817 1.622222 1.060897 10.084181 3.507741 1.017039 746 2
```
While the number of wrong predictions is high, on average the best choice is only around 6% better. What is important is that the choice predicted by the learned heuristic performs better than the default choice.
I evaluated my heuristic on gpt-fast `meta-llama/Llama-2-7b-chat-hf` with int8 weight quantization. To get the `tuned_mixed_mm` to trigger, I had to replace `F.linear()` in https://github.com/pytorch-labs/gpt-fast/blob/main/quantize.py#L355 with `torch.matmul(input, self.weight.t().to(dtype=input.dtype))` because the mixed_mm pattern does not match if there is a transpose between a cast and the matmul.
|batch size|prompt length| fallback | heuristic | speedup |
|----------|-------------|------------:|------------:|--------:|
| 1 | 7 | 75.31 tok/s | 148.83 tok/s| 1.97 |
| 1 | 11 | 75.99 tok/s | 148.15 tok/s| 1.94 |
| 4 | 7 | 103.48 tok/s | 472.00 tok/s| 4.56 |
| 4 | 11 | 103.56 tok/s | 371.36 tok/s| 3.58 |
| 8 | 7 | 201.92 tok/s | 813.44 tok/s| 4.02 |
| 8 | 11 | 201.76 tok/s | 699.36 tok/s| 3.46 |
Currently, the heuristic only applies to the following inputs:
- m <= 128, k >= 1024, n >= 1024 (For these sizes, one of the triton kernels wins in most cases, but the heuristic still has to be careful to not choose a config that performs worse than the fallback)
- k % 256 == 0 (If k is not a multiple of the block size, some choices perform extremely bad. In one case one config, that usually performs very well, was 130x slower.)
- mat1 not transposed
- mat2 transposed (In some cases, it was hard for the learned heuristic to detect some cases where it
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131613
Approved by: https://github.com/eellison
Summary: This makes it so that stress tests on separate processes on the same machine don't clobber the directories of each other. InductorTestCase will automatically make a fresh tmpdir for each unit test.
Test Plan:
```
buck2 test -j 18 'fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --exact 'caffe2/test/dynamo:test_dynamo - test_aot_autograd_cache.py::AOTAutogradCacheTests::test_nn_module_with_params_global_constant' --run-disabled --stress-runs 10 --record-results
```
Now passes
Differential Revision: D60604811
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132432
Approved by: https://github.com/masnesral
Fixes#130087
This patch tries to provide a built-in id function implementation for TensorVariable when the id function is called on tensors like module parameters. The id function call on intermediate tensors is not supported.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130100
Approved by: https://github.com/anijain2305
https://github.com/pytorch/pytorch/pull/130422 caused the test `test.inductor.test_aot_inductor.AOTInductorTestABICompatibleCuda. test_fp8_abi_compatible_cuda` to fail (unclear why it was not run in GitHub) with `torch/csrc/inductor/aoti_torch/c/shim.h:390:34: note: candidate function not viable: requires 9 arguments, but 6 were provided`. We suspect that the kernel produced by the lowering function, which is no longer a fallback choice, has a schema issue at codegen. Fp8 is not used through AOTI currently and it is difficult to revert the PR (BE week), so we'll skip the test temporarily while making the new lowering compatible with AOTI.
Testing: the failed test on internal diff is now skipped.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132453
Approved by: https://github.com/henrylhtsang
Summary: Currently suggested fixes pick a map from symbols to user variables. However it is possible that many user variables point to the same symbol, and some may be preferred over others. Thus we dump this info as well.
Test Plan: updated test
Sample error with new format:
```
Could not guard on data-dependent expression u2 >= 0 (unhinted: u2 >= 0). (Size-like symbols: none)
<snip>
The following call raised this error:
File "test/export/test_export.py", line 1950, in forward
return r.view(items[0], items[2])
To fix the error, insert one of the following checks before this call:
1. torch._check(items[2] >= 0)
2. torch._check(items[2] < 0)
(These suggested fixes were derived by replacing `u2` with items[2] in u2 >= 0 and its negation.)
```
Differential Revision: D60574478
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132393
Approved by: https://github.com/BoyuanFeng
Context:
We are planning to make a BC breaking change to `torch.load` by flipping the default for `weights_only` from `False` --> `True` in a future release. With `weights_only=True`, a custom unpickler is used that limits what can be loaded to state_dicts containing tensors (there is also a way for the user to allowlist specific things to be loaded). The goal of this is to attempt to prevent remote execution of arbitrary code when using `torch.load`.
To my understanding, in export, `torch.load` is used internally to load arbitrary objects, so we should set `weights_only=False` here to prevent the flip from breaking export.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132348
Approved by: https://github.com/angelayi
Summary:
Skip the warning if the fake script object doesn't implement a fake method for:
1. __obj_flatten__: for real script object only.
2. __set_state__ and __get_state__ for serialization. Don't expect it to be used during tracing.
Test Plan: Existing tests.
Reviewed By: angelayi
Differential Revision: D60478460
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132306
Approved by: https://github.com/angelayi
mvlgamma backward trips DEBUG=1 asserts when trying to construct an empty tensor with `layout=torch.jagged`. This happens due to passing `self.options()` to `arange()` in `mvlgamma_backward()`. Fix in this PR unconditionally constructs `arange()` with the strided layout.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132422
Approved by: https://github.com/albanD
# Motivation
This PR intends to support ABI=0 build for XPU backend.
# Additional Context
The major change is adding a compilation option `-D__INTEL_PREVIEW_BREAKING_CHANGES` for the host compiler(gcc) and `-fpreview-breaking-changes` for XPU device kernel code compiler(icpx), why?
Because we use
- gcc to compile host code and link SYCL runtime. So we need to pass `-D__INTEL_PREVIEW_BREAKING_CHANGES` to tell the host compiler invoking the ABI-neutral API included in SYCL. And
- use icpx to compile device kernel code and link SYCL runtime. So we need to pass `-fpreview-breaking-changes` to tell the device kernel compiler building ABI-neutral code. Besides,
- `libsycl-preview.so` is an ABI-neutral library but `libsycl.so` is not.
This PR depends on https://github.com/pytorch/pytorch/pull/131643.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130110
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD
Summary:
feikou observed the big numerical gaps when using math backend on AMD and NV GPUs. It's mainly because we are not using higher precision FP32 for the intermediate accumulated/materialized parts.
Since math backend is expected to be slower anyways, and we expect math backend to generate the correct reference result, I think it should be worth to upcast FP16/BF16 input to FP32, and do FP32/TF32 computations, and then downcast FP32 output back to FP16/BF16.
Differential Revision: D58710805
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128922
Approved by: https://github.com/xw285cornell, https://github.com/drisspg
Summary:
AOTAutogradCache currently only checks the local directory instead of both local and remote when saving/loading from the cache, so if remote cache is turned on, it will cache miss.
Disable remote caching for now on these tests: when I work on remote caching compatibility, I'll re-enable them here.
Test Plan:
buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --exact 'caffe2/test/dynamo:test_dynamo - test_aot_autograd_cache.py::AOTAutogradCacheTests::test_nn_module_with_params_global_constant' --run-disabled
passes
Differential Revision: D60588615
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132409
Approved by: https://github.com/masnesral
Summary:
Occaisonally we run into a partition that looks like this for Add:
```
SourcePartition(nodes=[_constant2, add_2], source=<built-in function add>, input_nodes=[x], output_nodes=[_constant2, add_2], params=[_constant2])
```
In this case we are adding a constant to an input, and reusing the constant later down the line. This causes our constant to be an output in our SourcePartition. The assumption then that:
```
add_node = add_partition.output_nodes[0]
```
Will not necessarily hold. As a result we must check that the output node is indeed a call function and not a constant.
Test Plan: buck test mode/dev-nosan //executorch/backends/xnnpack/test:test_xnnpack_ops -- test_qs8_add_constant
Differential Revision: D60413221
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132092
Approved by: https://github.com/jerryzh168
Python's set is non deterministic. There is an internal failure which we recently ran into which did not consistently fail.
See, repro here: P1453035092.
Now, with these changes, it does consistently fail. In follow ups we could also consider adding a lintrule for uses of either set() or set literals.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130004
Approved by: https://github.com/oulgen
Summary:
In D60024830 I attempted to define these overloads, but gated the implementation on the wrong macros. Namely I used `__CUDACC__` instead of `__HIPCC__` (facepalm).
It might be worth merging this with the nvidia case via typedefs (e.g. `typedef __hip_bfloat16 __gpu_bfloat16` and `typedef __nv_bfloat16 __gpu_bfloat16`), but that seems like an entirely new paradigm for torch, so I'll punt that change to the future so we can focus on supporting `BFloat16(__hip_bfloat16)` here
Test Plan: CI
Differential Revision: D60362079
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132234
Approved by: https://github.com/houseroad
in pdb, it's pretty common to print `FSDPParamGroup` and `FSDPParam`. making sure they are human readable
print `FSDPParam` in pdb
```
FSDPParam(fqn=layers.6._checkpoint_wrapped_module.attention.wq.weight, orig_size=torch.Size([128, 256]))
```
print `FSDPParamGroup` in pdb
```
FSDPParamGroup(fqn=layers.6)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132350
Approved by: https://github.com/awgu
Summary:
A bunch of issues around support for sympy functions like `TruncToInt` and `ToFloat` are uncovered by https://github.com/pytorch/pytorch/issues/131897. This PR addresses only one of them (as the title suggests). Another issue is deserialization, filed as a task: T197567691.
However the most important issue is that adding runtime assertions is broken right now: specifically, sympy_interp with `PythonReferenceAnalysis` currently doesn't work because the implementations of some of these sympy functions in `PythonReferenceAnalysis` (or falling through to its base class) does not expect proxies. This means things like `math.trunc`, `math.floor`, `round`, etc. don't work, and can be easily repro'd by using them inside `torch._check`, e.g. According to ezyang these implementations need to point to new torch functions that can expect proxies (see how minimum and maximum are implemented, e.g.).
Test Plan: added test (original repro provided)
Differential Revision: D60540951
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132325
Approved by: https://github.com/ezyang
Fixes#132196
Let's say we have:
- op(x, y) that mutates both x and y
- new_x, new_y = functional_op(x, y) is the functional variant
If we are presented with functional_op(x, x), we must not reinplace
this into op(x, x), because then it would be writing to the same Tensor.
Instead, it's OK to reinplace one of them and to clone the other:
```
>>> y = x.clone()
>>> op(x, y)
```
This also applies if we have views: functional_op(x, x[0])
should not reinplace into op(x, x[0]).
The fix is to avoid reinplacing an arg if a view of it already has been
reinplaced.
Test Plan:
- new and existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132238
Approved by: https://github.com/oulgen, https://github.com/eellison
This PR introduces changes to AutoHeuristic that allow one to learn a heuristic as a decision tree. I used this to learn a heuristic for mixed_mm on A100 that consistenly performs better than the default choice (https://github.com/pytorch/pytorch/blob/main/torch/_inductor/kernel/mm.py#L402).
This is how the results look like:
Explanation of columns:
**wrong_max_spdup**: In the worst case, how much better would the best choice have been
**wrong_gman_spdup**: For inputs where the heuristic is wrong, how much better is the best choice on average (geomean)
**max_spdup_default**: Highest speedup achieved by the learned heuristic over the default choice
**gman_spdup_default**: Geomean speedup achived by the learned heuristic over the default choice
**max_slowdown_default**: If the default choice is better than the choice predicted by the learned heuristic, how much is it better in the worst case
**non_default_preds**: Number of times the learned heuristic predicted a choice that is not the default choice
**default_better**: Number of times the default choice is better than the choice made by the heuristic
```
set crit max_depth min_samples_leaf correct wrong unsure total wrong_max_spdup wrong_gman_spdup max_spdup_default gman_spdup_default max_slowdown_default non_default_preds default_better
train entropy 5 0.01 2376 740 323 3439 1.855386 1.063236 11.352318 3.438279 1.022164 3116 2
test entropy 5 0.01 563 183 71 817 1.622222 1.060897 10.084181 3.507741 1.017039 746 2
```
While the number of wrong predictions is high, on average the best choice is only around 6% better. What is important is that the choice predicted by the learned heuristic performs better than the default choice.
I evaluated my heuristic on gpt-fast `meta-llama/Llama-2-7b-chat-hf` with int8 weight quantization. To get the `tuned_mixed_mm` to trigger, I had to replace `F.linear()` in https://github.com/pytorch-labs/gpt-fast/blob/main/quantize.py#L355 with `torch.matmul(input, self.weight.t().to(dtype=input.dtype))` because the mixed_mm pattern does not match if there is a transpose between a cast and the matmul.
|batch size|prompt length| fallback | heuristic | speedup |
|----------|-------------|------------:|------------:|--------:|
| 1 | 7 | 75.31 tok/s | 148.83 tok/s| 1.97 |
| 1 | 11 | 75.99 tok/s | 148.15 tok/s| 1.94 |
| 4 | 7 | 103.48 tok/s | 472.00 tok/s| 4.56 |
| 4 | 11 | 103.56 tok/s | 371.36 tok/s| 3.58 |
| 8 | 7 | 201.92 tok/s | 813.44 tok/s| 4.02 |
| 8 | 11 | 201.76 tok/s | 699.36 tok/s| 3.46 |
Currently, the heuristic only applies to the following inputs:
- m <= 128, k >= 1024, n >= 1024 (For these sizes, one of the triton kernels wins in most cases, but the heuristic still has to be careful to not choose a config that performs worse than the fallback)
- k % 256 == 0 (If k is not a multiple of the block size, some choices perform extremely bad. In one case one config, that usually performs very well, was 130x slower.)
- mat1 not transposed
- mat2 transposed (In some cases, it was hard for the learned heuristic to detect some cases where it
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131613
Approved by: https://github.com/eellison
ghstack dependencies: #131610, #131611
This fixes a few instances where we assumed indexing expressions were
non-negative. This is not valid when we have more complicated
expressions involving masking e.g. pointwise cat.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131761
Approved by: https://github.com/ezyang
Summary: ET sets the length limit of string input varaibele to 8192 characters. However, the node process_group::init has more than 8192 characters for a Ads 128 rank job. This DIFF is to temporaily remove this limit, so ET can capture the complete information of the process group.
Test Plan: buck2 test mode/opt caffe2/test:test_profiler_cuda -- profiler.test_execution_trace.TestExecutionTrace
Reviewed By: sanrise
Differential Revision: D60341306
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132169
Approved by: https://github.com/sraikund16, https://github.com/sanrise
This PR introduces a script that can be used to collect data for mixed_mm to learn a heuristic with AutoHeuristic. This PR also includes the following things:
Move pad_mm related AutoHeuristic files into subdirectory
Introduce an interface benchmark_runner.py that can be subclassed to introduce new scripts to run benchmarks in order to collect data with AutoHeuristic (see gen_data_pad_mm.py and gen_data_mixed_mm.py).
The idea behind the interface is that, in the end, it hopefully makes it easier to collect data for new optimizations, and thus makes it easier to learn a heuristic.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131611
Approved by: https://github.com/eellison
ghstack dependencies: #131610
Summary:
Implement a callback-based dynamic counter with pluggable backends.
The backend API and integration is similar to WaitCounter. Note that this counter should only be used with C++ callbacks, since making it safe to be used for GIL-requiring callbacks would be pretty challenging and may defeat the whole purpose of this counter (since the duration of the callback can no longer be guaranteed).
Test Plan: unit test
Differential Revision: D60464055
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132166
Approved by: https://github.com/asiab4
This PR mostly refactors by putting code into utils files so that they can be shared between codecache.py and compile_fx.py. Afterwards, it then changes compile_fx so that:
- When saving to FXGraphCache, we save onto the CompiledFXGraph all the necessary metadata for running post compile steps (realigning inputs, cudagraphification).
- When loading from FXGraphCache, we use the saved information directly, instead of calculating them from scratch.
What this does is make it so that `FXGraphCache.load()` is a perfect cache on compile_fx_inner, in that it **returns exactly what compile_fx_inner returns**. This also makes it possible for AOTAutogradCache, given a key to the fx graph cache and example inputs, to get back the full return value of compile_fx_inner.
## What's a post compile step?
We define a **post-compile** to be the set of actions that need to run after FXGraphCache either loads from the cache or misses and runs compilation. These steps include:
- Setting the tracing context's output strides
- Running cudagraphs if enabled
- Maybe realign inputs if cudagraphs didn't run
To run these steps, we save all the necessary metadata in CompiledFxGraph, and use them on a cache hit to reconstruct the object.
## Splitting cudagraphs work into pre/post compile
Cudagraphs does a lot of work on the input graph module to determine if cudagraphs can be enabled. This is the code that involves cudagraph_tests and stack traces. This will work in a world where we have access to the input graph module, but with AOTAutograd warm start, we won't have access to that information anymore. Therefore we can split cudagraphs work into two parts: on a cache miss (and therefore a full compile), we do the cudagraphs testing work, and save cudagraph_fail_reasons into the cache. Then on a cache hit, we know whether or not we can run cudagraphs, and if we can't, we can emit the correct error messages.
Implementation notes:
- We save `fx_kwargs` directly onto the CompiledFXGraph. `fx_kwargs` is already, by definition, part of the cache key, so this is safe to do when it comes to cache correctness.
- ^ Why do we do above even though FXGraphCache.load takes fx_kwargs as an argument? Because AOTAutogradCache **doesn't** have access to fx_kwargs: they're annoyingly encoded in the functools.partial() of the fw_compiler, so *only* inductor knows about these options. They're fully captured by the AOTAutogradCache key (since every key to fx_kwargs is either a global config, or a field that's deterministic based on an input graph module), but their values are still needed to run cudagraphs/postprocessing. Therefore, it's easier/safer to store it on the cached result.
- Willing to hear other approaches here if we think saving these extra fields is not reasonable, though I can't think of another way to do this that's less complicated to explain.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130572
Approved by: https://github.com/eellison
**Background:** NJT utilizes a `jagged_unary_pointwise()` fallback that historically has assumed blindly that the first arg is an NJT. This assumption breaks certain ops; for example `pow(scalar, Tensor)` has an NJT as the second arg.
This PR expands `jagged_unary_pointwise()` and the associated schema validation logic to handle an NJT in args other than the first position.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131937
Approved by: https://github.com/soulitzer
ghstack dependencies: #131898, #131704
# Motivation
This PR intends to enhance the codegen to allow generate codes for XPU backend.
XPU operators need be registered in an hand-written way currently. Developers have no chance to take the advantage of shared code to handle tensor meta setting (like strides, proxy output, structured kernels). Manually porting code is erro-prone and may lead to high maintaining efforts.
We utilize the backend_whitelist argument in `gen.py` to generate XPU needed headers and source codes.
# Usage
XPU ops lie in `third_pary/torch-xpu-ops`, the codegen process is triggered before the complation of `torch-xpu-ops`
We use the following commands to generate XPU operators
` python -m torchgen.gen --source-path path/to/yaml/of/xpu --install-dir build/xpu --per-operator-headers --static-dispatch-backend --backend-whitelist=XPU`
The diff lies at `backend-whitelist=XPU`. The backend-whitelist key is an existent argument in torchgen.
The input of `gen.py` are code templates and operators yaml. We share the same templates in `aten`. A simplified yaml lies in `third_party/torch-xpu-ops`, which only includes the supported xpu operators. This yaml is a copy-and-modify of `native_functions.yaml`. No extra entry is added, the format is same as the one in `aten`
# Result
All operators headers are generated in `build/xpu/ATen/ops` independently, which would not affect operators declared/defined by CPU/CUDA or any other backend. XPU operators only include headers in this folder.
# Verification
* In `third-party/torch-xpu-ops`, we migrate all supported kernels to structured kernels style, where they are registered through `REGISTER_XPU_DISPATCH` or `TORCH_IMPL_FUNC`, and we have UT verification based on `test_ops.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130082
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/atalman
ghstack dependencies: #130019
As observed during working on this fix (https://github.com/pytorch/pytorch/pull/130994), 128 threads per block seems quite low. This PR is to increase the default to improve the performance, and also slightly refactoring the code to replace the hard-coded 128 for better maintenance.
By increasing the default max threads per block from 128 to 256, I saw for `aten::index_select`, its "CUDA total" time drop from 44.820ms to 33.608ms by profiling below embedding script:
```
input = torch.randint(low=0, high=16032, size=[131072], device="cuda")
w = torch.randn([16032, 16384], device="cuda")
with profiler.profile(record_shapes=True) as prof:
x = torch.nn.functional.embedding(input, w)
```
I tested with the default from 128 to 256, 512, 1024 on several different types of devices, and observed "CUDA total" time dropping even more and more latency improvement as the number increases. Below is one example of latency improvement ratio:
128 | 1x
256 | 1.33x
512 | 1.44x
1024 | 1.49x
Using 512 as the new default max for non-mi300x to be conservative, which is 1.44x faster than using 128 with the above profiling script.
Using 1024 for mi300x is 1.61x faster than using 128 with the same profiling script, and using 512 is 1.57x faster.
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131713
Approved by: https://github.com/jeffdaily, https://github.com/syed-ahmed, https://github.com/malfet
Python's set is non deterministic. There is an internal failure which we recently ran into which did not consistently fail.
See, repro here: P1453035092.
Now, with these changes, it does consistently fail. In follow ups we could also consider adding a lintrule for uses of either set() or set literals.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130004
Approved by: https://github.com/oulgen
Summary:
Basic pybind integration for WaitCounter providing a guard API.
Also fixes broken copy/move constructor in WaitGuard (it wasn't really used with the macro-based C++ API).
Test Plan: unit test
Reviewed By: asiab4
Differential Revision: D60463979
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132167
Approved by: https://github.com/asiab4
get_plain_tensors() should result in DFS of leaves.
The error was that plain tensors (leaves) on the same level were returned before subclasses plained tensors even if subclasses are before in "flatten" list.
Original issue from AO: https://github.com/pytorch/ao/issues/515
Test:TBD, need to make asymetric subclass with dense tensors and subclasses
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132096
Approved by: https://github.com/bdhirsh
Modify the existing `layer normalization` operator in PyTorch, invoked by `torch.layer_norm`, to allow for reductions along the jagged dimension of a nested tensor. The function originally had a basic implementation for reducing along 1 non-ragged dimension. This diff, which uses the `aten` padding operator, enables PyTorch users to invoke `torch.nn.functional.layer_norm` on a nested tensor when reducing along the ragged dimension, e.g. `*` in a `(B, *, M)` or `(B, *, M, N)` nested tensor.
Write unit tests based on the `softmax` jagged operator to verify the accuracy of the ragged reduction implementation for `torch.nn.functional.layer_norm`. Add unit tests to verify error handling for unsupported features.
Note that this implementation is limited to nested tensors with `ragged_idx == 1`, i.e. the ragged dimension is not transposed. The layer normalization operator also requires an operation on a 2-dimensional layer; for nested tensors with 4 or more dimensions, I flatten the extra dimensions, then unflatten them after performing layer normalization.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132172
Approved by: https://github.com/davidberard98
ghstack dependencies: #132170
Modify the existing `softmax` operator in PyTorch, invoked by `torch.softmax`, to allow for reductions along the jagged dimension of a nested tensor. The function originally had a basic implementation for reducing along 1 non-ragged dimension. This diff, which uses the aten padding operator, enables PyTorch users to invoke `torch.softmax` on a nested tensor when reducing along the ragged dimension, e.g. `*` in a `(B, *, M)` nested tensor.
Write unit tests based on the `sum` and `mean` jagged operators to verify the accuracy of the ragged reduction implementation for `torch.softmax`. Add unit tests to verify error handling for unsupported features in `NestedTensor` `torch.softmax`.
Note that this implementation is limited to nested tensors with `ragged_idx == 1`, i.e. the ragged dimension is not transposed. In addition, the `softmax` operator is required to take in as input an integer for the reduction dimension `dim`, requiring new unit tests heavily inspired by the `sum` and `mean` jagged operator unit tests. `Softmax` also allows for reducing along the batch dimension.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132170
Approved by: https://github.com/davidberard98
Add similar semantics for creating a buffer object similar to creating a parameter. This is done by introducing a new Buffer class that can be used for type disambiguation. The underlying functionality of registering a buffer remains the same as the register_buffer method has not been changed. The persistent parameter in the Buffer type is to indicate whether a buffer object should be persistent or not. Other non-test changes have to do with getting the new Buffer type recognized by inductor and dynamo. Remaining changes are test changes to make sure that the Buffer type can be used as a drop in replacement for register_buffer as it just leads to register_buffer being called. The addition of this new functionality still allows for normal tensors to be used as buffers so these changes are intended to be backwards compatible.
Fixes#35735
Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125971
Approved by: https://github.com/albanD, https://github.com/anijain2305, https://github.com/mlazos
Original issue: https://github.com/pytorch/pytorch/issues/114338
Reland of: https://github.com/pytorch/pytorch/pull/128016
Summary from previous PR:
We assume only two possible mutually exclusive scenarios:
Running compiled region for training (Any of inputs has requires_grad)
Produced differentiable outputs should have requires_grad.
Running compiled region for inference (None of inputs has requires_grad)
All outputs do not have requires_grad.
Even if user runs the region under no_grad(), but has an input Tensor with requires_grad - we go Training scenario (1).
With current state that means:
1/ needs_autograd should not check torch.is_grad_enabled(), only that any of inputs requires_grad
2/ if needs_autograd => trace_joint (We are in training scenario 1.) => always run compiled region under with.enable_grad()
Changes in partitioner?
Inference and Training graphs had difference in return container, list/tuple.
The changes in partitioner are done to unify and return always tuple.
As a result - some changes in test_aotdispatch.py for graph contents list -> tuple.
Why was revert?
There was a regression of hf_Reformer model on inference.
```
TORCHINDUCTOR_FX_GRAPH_CACHE=0 python benchmarks/dynamo/torchbench.py --performance --inference --bfloat16 --backend inductor --device cuda --only hf_Reformer --cold-start-latency --use-eval-mode
```
Because one of the compiled graphs contained outputs, which are aliases to the inputs that are nn.Parameter(requires_grad=True).
Even if inference bencharmsk torchbench runs inside with` torch.no_grad()` - alias (specifically for hf_Reformer - expand) ops preserve requires_grad.
As a result we started compiling training graph instead of inference.
Fix for view ops:
If we have outputs, that are aliases to inputs that requires_grad, those outputs requires grad is not a reason to generate training graph.
This is handled in aot_autograd.py, where output_and_mutation_safe are calculated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128890
Approved by: https://github.com/bdhirsh
**Summary**
I created functions that reduced repeating code in the console and json APIs which also improved their readability for future developers.
**Test Plan**
1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_json_dump
2. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_operation_tracing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132070
Approved by: https://github.com/XilunWu
`register_sharding` is an experimental API that allows users to register sharding strategies for an operator when the tensor inputs and outputs are :class:`DTensor`s. It can be useful when: (1) there doesn't exist a default sharding strategy for ``op``, e.g. when `op` is a custom operator that is not supported by `DTensor`; (2) when users would like to overwrite default sharding strategies of existing operators.
Here's an example:
@register_sharding(aten._softmax.default)
def custom_softmax_sharding(x, dim, half_to_float):
softmax_dim = dim if dim >= 0 else dim + x.ndim
acceptable_shardings = []
all_replicate = ([Replicate()], [Replicate(), None, None])
acceptable_shardings.append(all_replicate)
for sharding_dim in range(x.ndim):
if sharding_dim != softmax_dim:
all_sharded = (
[Shard(sharding_dim)],
[Shard(sharding_dim), None, None],
)
acceptable_shardings.append(all_sharded)
return acceptable_shardings
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131108
Approved by: https://github.com/wanchaol
**Summary**
If a `global buffer` has been replaced by `local buffer`, we will add this `global buffer` into `removed_buffers` to avoid unnecessary allocation. However, a special case is when this `global buffer` can reuse previous buffer. We didn't handle this case previously which cause functional failure in f151f25c0b/torch/_inductor/codegen/wrapper.py (L440)
In this PR, we resolve this issue by avoid adding this global buffer into `V.kernel.inplace_update_buffers` when this buffer has been marked as `removed`.
**Test Plan**
```
python test/inductor/test_cpu_repro.py -k test_local_buffer_with_line_reuse
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132018
Approved by: https://github.com/jgong5, https://github.com/peterbell10
Add the Inductor lowering for `torch._scaled_mm`, whose API was last updated in https://github.com/pytorch/pytorch/pull/128683.
The lowering does:
- for tensor-wise scaling, auto-tune between the default ATen kernel (cuBLAS) and Triton kernel configurations.
- for row-wise scaling, auto-tune between the default ATen kernel (CUTLASS kernel added in https://github.com/pytorch/pytorch/pull/125204) and Triton kernel configurations.
The Triton kernel template is based on 3ad9031d02 (D56337896) by @choutim, without using SPLIT_K, and that of mm `torch/_inductor/kernel/mm.py`
## Testing:
- Logging shows max-autotune tuning (`AUTOTUNE scaled_mm`) for both tensor-wise and row-wise scaling when called with the two scaling types.
- Row-wise scaling allows operator fusion between preceding pointwise/reduction op and amax/cast:
- output code Evaluating m=256, n=256, k=256, fusion_case='pointwise', scaling_mode='row'
- P1477224245 - 2 kernels
- output code Evaluating m=2048, n=256, k=2048, fusion_case='reduction', scaling_mode='row'
- P1477227340 - 2 kernels
- UT `python test/inductor/test_fp8.py -- TestFP8Lowering`
## Benchmarking
Eager/compiled tensor-wise/row-wise scaling for various shapes:
https://docs.google.com/spreadsheets/d/1VfWEVuyrwoWysfbS0_u2VHJ-PsdWkF1qIsiD60AzTes/edit?gid=2113587669#gid=2113587669
- Some of the “compiled” cases are slightly slower than “eager”. It’s because max-autotune selected the ATen kernel in the compiled case, and I think the discrepancy is variance.
Eager/compiled tensor-wise/row-wise scaling with pointwise/reduction preceding op for various shapes:
https://docs.google.com/spreadsheets/d/1Nv07NrdffQIoDeMjo9E0V-E-EYrEN0WysO_bn1bc6ns/edit?gid=1715488446#gid=1715488446
## Questions for reviewers:
- Should the type of the accumulator `ACC_TYPE` always be in float32? If not, where is this type set (output layout?)?
## Todo:
- Make the Triton template use the improved persistent kernel version (https://github.com/pytorch/FBGEMM/pull/2735 by @htyu)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130422
Approved by: https://github.com/ipiszy
This PR utilizes the info from the existing OpInfo database `op_db` to contribute to general NJT testing.
* New tests in `TestNestedTensorOpInfo`
* `test_forward()` - compares forward output to an unbind-based reference
* `test_backward()` - compares forward output and grads to an unbind-based reference
* `test_forward_compile()` - compares forward compile output (`backend="aot_eager_decomp_partition"`) to eager
* `test_backward_compile()` - compares forward compile output (`backend="aot_eager_decomp_partition"`) and grads to eager
* To avoid adding a bunch of NJT-specific stuff to the `OpInfo` structure, this PR translates `op_db` -> a NJT-specific `njt_op_db`.
* `UnaryUfuncInfo`s utilize a new `sample_inputs_unary_njt_pointwise()` which iterates through a comprehensive list of NJTs: contiguous / non-contiguous, dims 2, 3, and 4, transposed / not, etc.
* `BinaryUfuncInfo`s utilize a new `sample_inputs_binary_njt_pointwise()` which iterates through a comprehensive list of NJTs: contiguous / non-contiguous, dims 2, 3, and 4, transposed / not, etc.
* `ReductionOpInfo`s utilize a new `sample_inputs_njt_reduction()` which covers full reductions, reductions over the jagged dim, and reductions over the non-jagged dim
* Several xfails were added to get things passing
TODO (future PRs):
* Pass non-contiguous / non-contiguous with holes NJTs (maybe we should have separate tests for these? most ops don't support NJTs with holes today)
* Mixed (NT, T), (T, NT) inputs for binary ops
* Handle other types of OpInfos (beyond unary pointwise, binary pointwise, and reduction) by manually by writing sample_inputs_funcs
* Address all xfails via fixes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131704
Approved by: https://github.com/soulitzer
ghstack dependencies: #131898
Summary:
there're some issues for dim order creation. T194410923 has detail illustration.
One of the reason is sometimes `is_contiguous` function may generate ambiguous memory format result (some tensors might be both channels_last and contiguous at the same time), and dim order generation rely on memory format result underneath for shortcut.
To mitigate the issue, we make dim order utilizing the short cut if and only if the tensor is only belongs to single memory format. Otherwise, we will still recalculate it.
Test Plan: CI
Differential Revision: D60056793
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131366
Approved by: https://github.com/ezyang
Try to unblock https://github.com/pytorch/pytorch/issues/131991
- `nn.init.orthogonal_` uses `tensor.new`, which is the legacy factory function. We change this to `tensor.new_empty` (empty is okay since it will be immediately followed by `.normal_()` to fill the tensor) so that it preserves `DTensor`-ness.
- `nn.init.orthogonal_` uses QR decomposition (`aten.linalg_qr.default`) and `torch.diag` (calling into `aten.diagonal_copy.default`). For simplicity, we use naive replicate strategies for now. `aten.diagonal_copy.default` could do something more sophisticated for sharded inputs, but I would rather defer that to later due to the complexity. For `orthogonal_` support specifically, since the result of the QR decomp will be replicated, the input to `aten.diagonal_copy.default` will be replicated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132104
Approved by: https://github.com/albanD, https://github.com/wanchaol
Causing some terrible error messages e.g. :
```
# printing directly: cudaError.???
# casting to int first: 712
Traceback (most recent call last):
File "/data/users/lpasqualin/fbsource/fbcode/scripts/lpasqualin/playground.py", line 15, in <module>
main()
File "/data/users/lpasqualin/fbsource/fbcode/scripts/lpasqualin/playground.py", line 11, in main
_create_cpu_state_dict(sd, share_memory=True, pin_memory=True)
File "/home/lpasqualin/pytorch/torch/distributed/_state_dict_utils.py", line 436, in _create_cpu_state_dict
ret = _iterate_state_dict(
^^^^^^^^^^^^^^^^^^^^
File "/home/lpasqualin/pytorch/torch/distributed/_state_dict_utils.py", line 143, in _iterate_state_dict
ret = {
^
File "/home/lpasqualin/pytorch/torch/distributed/_state_dict_utils.py", line 144, in <dictcomp>
key: _iterate_state_dict(
^^^^^^^^^^^^^^^^^^^^
File "/home/lpasqualin/pytorch/torch/distributed/_state_dict_utils.py", line 125, in _iterate_state_dict
ret = tensor_func(iter_object, pg, device, companion_obj)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lpasqualin/pytorch/torch/distributed/_state_dict_utils.py", line 428, in tensor_func
succ == 0
AssertionError: Pinning shared memory failed with error-code: cudaError.???
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132089
Approved by: https://github.com/Skylion007
Summary: Currently, running explain with TORCH_LOGS enabled will cause duplicate loggings because explain uses the exact same code path for covnersion. This PR just disables logging when it is running explain. And move all logging to convert() to prevent from logging from __init__ when we are just using explain.
Test Plan: Manual testing with attached outputs.
Reviewed By: SherlockNoMad, angelayi
Differential Revision: D60199007
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132082
Approved by: https://github.com/ydwu4
I didn't test this path when creating the orchestrator. This PR fixes
that path to work in the capture_triton path. The problem is that we are
handling a value that is an int (in the capture_triton path) and a
ConstantVariable (in the Dynamo triton path) so we abstract that out in
the orchestrator.
Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132143
Approved by: https://github.com/oulgen
**Background**: `TestCase.assertEqual()` is commonly used during test case validation. Historically, to support NSTs, the logic was written to compare two nested tensors by unbinding them and comparing their components. This logic applied to NJTs as well, which in practice meant that two NJTs with different nested ints in their shapes could compare equal if their components were equal.
This PR changes the above logic so that NJTs are no longer unbound during comparison, allowing them to receive full shape validation. This makes `TestCase.assertEqual()` stricter for NJTs, requiring them to have the same nested ints in their shapes to compare equal.
Note that some tests rely on the old, looser behavior. To address this, the PR introduces a base `NestedTensorTestCase` that defines a helper function `assertEqualIgnoringNestedInts()` so that these tests can explicitly opt in to the looser comparison behavior.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131898
Approved by: https://github.com/soulitzer
Summary:
The previous logic adds skipped files when the file was imported which happens at very early stage. However, we could set skip_torchrec at later stage (e.g, in APS, we set it during the trainer execution). In that case, the skip logic will still take effect since skipped files have been added.
So in this diff, we revise the logic so that it can adapt to changes of skip_torchrec at later stages.
Test Plan:
Tested on APS models:
buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher_live -- mode=local_ig_fm_uhm_mini model_name=ig_fm_one_sparse_benchmark features=ig_fm_one_sparse_benchmark model=ig_fm_one_sparse_benchmark training.pipeline_type=pt2
commit: 2fb485d9e
torchrec related paths were not skipped.
Differential Revision: D59779153
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130783
Approved by: https://github.com/yanboliang
There are some substantive changes. Instead of recording the *next* instruction in the speculation log, I record the *current* instruction. I think this is more intuitive, we always call speculation at the beginning of executing an instruction, so logically, the entry is associated with the current instruction. (Note that self.instruction_pointer is next instruction, as conventionally we increment IP before calling speculate).
The cosmetic change is to also pass in the Instruction corresponding to the IP and print it, and beef up the error message, including notes about the previous instruction that was run before it failed (this is typically the critical instruction).
At time of submission, this test case triggered the error:
```
diff --git a/test/distributed/test_dynamo_distributed.py b/test/distributed/test_dynamo_distributed.py
index 5ade17856e1..60ef89be346 100644
--- a/test/distributed/test_dynamo_distributed.py
+++ b/test/distributed/test_dynamo_distributed.py
@@ -844,6 +844,39 @@ class TestMultiProc(DynamoDistributedMultiProcTestCase):
for r in res[1:]:
self.assertEqual(res[0], r)
+ @unittest.skipIf(not has_triton(), "Inductor+gpu needs triton and recent GPU arch")
+ @config.patch(enable_compiler_collectives=True)
+ def test_compiler_collectives_automatic_dynamic_speculation_divergence(self):
+ with _dynamo_dist_per_rank_init(self.rank, self.world_size):
+ torch._dynamo.utils.clear_compilation_metrics()
+
+ # TODO: This should be possible to do inside the function, but
+ device = f"cuda:{self.rank}"
+
+ @torch.compile()
+ def f(x, y):
+ zx = x.shape
+ zy = y.shape
+ return x.sum() + y.sum()
+
+ if self.rank == 0:
+ dataloader = [4, 4]
+ else:
+ dataloader = [3, 4]
+
+ for data in dataloader:
+ f(
+ torch.randn(data, device=self.rank),
+ torch.randn(data, device=self.rank),
+ )
+
+ metrics = torch._dynamo.utils.get_compilation_metrics()
+ # Number of compiles same on all nodes
+ res = [None] * self.world_size
+ torch.distributed.all_gather_object(res, len(metrics))
+ for r in res[1:]:
+ self.assertEqual(res[0], r)
+
@requires_nccl()
```
although I plan to fix this soon.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131982
Approved by: https://github.com/anijain2305, https://github.com/mlazos, https://github.com/jansel
This PR fixes a bug in `test_correct_module_names` introduced in #130497. It also addresses post-fix test failures in:
* `torch/ao/quantization/__init__.py` - set the correct `__module__` for several public API helpers
* `torch/library.py` - add `register_vmap` to `__all__`
* `torch/nn/attention/flex_attention.py` - make `round_up_to_multiple` private by prepending an underscore
* `torch/storage.py` - introduce `__all__` to avoid `Self` being re-exported as a public API
* `torch/distributed/pipelining/schedules.py` - add `ZeroBubbleAlgorithm` to `__all__`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131386
Approved by: https://github.com/albanD
In _creating chunk_sharded_tensor, _get_remote_device_str is used. by default it uses the node cound to determine the device:instance. for hpu, need to use current device to get the deivce_instance.
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132120
Approved by: https://github.com/awgu
Summary:
There are two kinds of exceptions:
Case #1:
```
static input data pointer changed.
input name: primals_2. data pointer changed from 140315748992000 to 140315748993536. input stack trace: File "/dev/shm/uid-30083/c0899c70-seed-nspid4026535598_cgpid16622182-ns-4026535192/caffe2/test/inductor/test_cudagraph_trees.py", line 1826, in forward
return self.static_tensor + x + self.goo(x)
File "/dev/shm/uid-30083/c0899c70-seed-nspid4026535598_cgpid16622182-ns-4026535192/caffe2/test/inductor/test_cudagraph_trees.py", line 1816, in forward
return self.linear(x)
input name: primals_3. data pointer changed from 140315748990976 to 140315748993024. input stack trace: File "/dev/shm/uid-30083/c0899c70-seed-nspid4026535598_cgpid16622182-ns-4026535192/caffe2/test/inductor/test_cudagraph_trees.py", line 1825, in forward
self.static_tensor.add_(torch.ones((2, 2), device="cuda"))
```
Case #2:
```
static input data pointer changed.
input name: primals_2. data pointer changed from 139852509086720 to 139852509088256. input stack trace: None
input name: primals_3. data pointer changed from 139852509085696 to 139852509087744. input stack trace: File "/dev/shm/uid-30083/f61ee184-seed-nspid4026560782_cgpid769179-ns-4026560865/caffe2/test/inductor/test_cudagraph_trees.py", line 1825, in forward
self.static_tensor.add_(torch.ones((2, 2), device="cuda"))
```
The current impl only covered the case #2
Test Plan: https://www.internalfb.com/intern/testinfra/testrun/15481123762274476
Differential Revision: D60340212
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132043
Approved by: https://github.com/BoyuanFeng
**Summary**
Previously, we used `data_type_propagation` at the start of `codegen` to deduce the data type of each node and save this information in `node.meta[OptimizationContext.key]`. Then, we used this node metadata to update the cppcsevar data type in `update_on_args`. However, this method is not always correct. For example, in the codegen of `indirect_indexing` (see [here](096dc444ce/torch/_inductor/codegen/common.py (L1844))), we insert nodes on the fly and reuse the node of `indirect_indexing` to set the `cppcsevar` data type. In this PR, we plan to enhance the `cppcsevar` data type deduction:
- We will deduce the `cppcsevar` data type in `update_on_args` by reusing the code in `data_type_propagation`.
- To align the data type of scalar and vector variables, we previously always cast the scalar to the vector's data type. This caused a data type misalignment between `codegen` and `data_type_propagation`. We should use the same data type promotion logic to align the data types of scalar and vector variables.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130827
Approved by: https://github.com/jgong5, https://github.com/jansel
Summary: This code was overly complex and is confusing some guards - basically if a result cached tensor isn't a view there's no reason to be messing with its storage.
Test Plan: unit tests pass
Differential Revision: D60387821
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132050
Approved by: https://github.com/oulgen
These OSS changes are part of a larger MTIA diff. The OSS part is a simple refactor that makes it easier to query max block sizes by the prefix of the grid dimension, e.g. `"X"`, as opposed to having to use separate functions for `get_xmax()`, `get_ymax()`, etc.
Differential Revision: D60195669
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131730
Approved by: https://github.com/eellison
Taking inspiration from `GraphModule.print_readable` (aka I copied its [code](17b45e905a/torch/fx/graph_module.py (L824))), I added a `print_readable` to the unflattened module, because it's kind of nontrivial to print the contents of this module.
Example print from `python test/export/test_unflatten.py -k test_unflatten_nested`
```
class UnflattenedModule(torch.nn.Module):
def forward(self, x: "f32[2, 3]"):
# No stacktrace found for following nodes
rootparam: "f32[2, 3]" = self.rootparam
# File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:99 in forward, code: x = x * self.rootparam
mul: "f32[2, 3]" = torch.ops.aten.mul.Tensor(x, rootparam); x = rootparam = None
# No stacktrace found for following nodes
foo: "f32[2, 3]" = self.foo(mul); mul = None
bar: "f32[2, 3]" = self.bar(foo); foo = None
return (bar,)
class foo(torch.nn.Module):
def forward(self, mul: "f32[2, 3]"):
# No stacktrace found for following nodes
child1param: "f32[2, 3]" = self.child1param
nested: "f32[2, 3]" = self.nested(mul); mul = None
# File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:79 in forward, code: return x + self.child1param
add: "f32[2, 3]" = torch.ops.aten.add.Tensor(nested, child1param); nested = child1param = None
return add
class nested(torch.nn.Module):
def forward(self, mul: "f32[2, 3]"):
# File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:67 in forward, code: return x / x
div: "f32[2, 3]" = torch.ops.aten.div.Tensor(mul, mul); mul = None
return div
class bar(torch.nn.Module):
def forward(self, add: "f32[2, 3]"):
# No stacktrace found for following nodes
child2buffer: "f32[2, 3]" = self.child2buffer
# File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:87 in forward, code: return x - self.child2buffer
sub: "f32[2, 3]" = torch.ops.aten.sub.Tensor(add, child2buffer); add = child2buffer = None
return sub
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128617
Approved by: https://github.com/zhxchen17, https://github.com/pianpwk
```
# Mode to emulate pytorch eager numerics for lower precision (fp16, bf16)
# Pytorch eager computes bf16/fp16 by upcasting inputs to fp32 and downcasting after
# For multiple, fused pointwise nodes, inductor will elide the intermediary upcasts and downcasts
# Typically this should be closer to fp64 ref numerics. However, it can be useful for debugging
# to emulate the eager numerics.
```
We add extra upcasts and downcasts for pointwise nodes that correspond to casts that existed in the original user program (excluding pointwise nodes that are emitted during decomposition). Since this is mostly for debugging, I added this information in the `meta` so that this mode does not have unintended side effects like changing pattern matching.
in theory there could also be some other casts with fused reduction -> reduction, although i havent seen this in practice as much. could be done as follow up. note: only works with cuda backend right now.
This mode was sufficient to eliminate compile differences from https://fb.workplace.com/groups/385893200869952/posts/464263173032954/?comment_id=465199259606012&reply_comment_id=465676792891592.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131595
Approved by: https://github.com/shunting314, https://github.com/bdhirsh, https://github.com/jansel
Modify the existing `layer normalization` operator in PyTorch, invoked by `torch.layer_norm`, to allow for reductions along the jagged dimension of a nested tensor. The function originally had a basic implementation for reducing along 1 non-ragged dimension. This diff, which uses the `aten` padding operator, enables PyTorch users to invoke `torch.nn.functional.layer_norm` on a nested tensor when reducing along the ragged dimension, e.g. `*` in a `(B, *, M)` or `(B, *, M, N)` nested tensor.
Write unit tests based on the `softmax` jagged operator to verify the accuracy of the ragged reduction implementation for `torch.nn.functional.layer_norm`. Add unit tests to verify error handling for unsupported features.
Note that this implementation is limited to nested tensors with `ragged_idx == 1`, i.e. the ragged dimension is not transposed. The layer normalization operator also requires an operation on a 2-dimensional layer; for nested tensors with 4 or more dimensions, I flatten the extra dimensions, then unflatten them after performing layer normalization.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131519
Approved by: https://github.com/davidberard98
ghstack dependencies: #131518
Add a new label `ci-test-showlocals` and add it to test config filter.
If the PR is labeled with `ci-test-showlocals` or "ci-test-showlocals"
present in the PR comment, the test config filter will set a environment
variable `TEST_SHOWLOCALS`. Then `pytest` will show local variables on
failures for better debugging.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131981
Approved by: https://github.com/malfet
ghstack dependencies: #131151
------
As per the title, add argument `--locals` for `unittest` and `--showlocals --tb=long` for `pytest` in CI.
Some failures cannot be reproduced on the local machine but exist on cloud CI. This change allows us to investigate the test failure more easily.
Example output: https://github.com/pytorch/pytorch/actions/runs/9961546996/job/27523888353?pr=130710#step:20:3361
```text
/opt/conda/envs/py_3.8/lib/python3.8/site-packages/sympy/core/function.py:307:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
cls = FloorDiv, base = -1.00000000000000, divisor = -1.00000000000000
@classmethod
def eval(cls, base, divisor):
# python test/test_dynamic_shapes.py -k TestDimConstraints.test_dim_constraints_solve_full
# Assert triggered by inequality solver
# assert base.is_integer, base
# assert divisor.is_integer, divisor
# We don't provide the same error message as in Python because SymPy
# makes it difficult to check the types.
if divisor.is_zero:
raise ZeroDivisionError("division by zero")
if base in (int_oo, -int_oo, sympy.oo, -sympy.oo) and divisor in (
int_oo,
-int_oo,
sympy.oo,
-sympy.oo,
):
return sympy.nan
if base is sympy.nan or divisor is sympy.nan:
return sympy.nan
if base.is_zero:
return sympy.S.Zero
if base.is_integer and divisor == 1:
return base
if base.is_integer and divisor == -1:
return sympy.Mul(base, -1)
if (
isinstance(base, sympy.Number)
and isinstance(divisor, sympy.Number)
and (
base in (int_oo, -int_oo, sympy.oo, -sympy.oo)
or divisor in (int_oo, -int_oo, sympy.oo, -sympy.oo)
)
):
r = float(base) / float(divisor)
if r == math.inf:
return int_oo
elif r == -math.inf:
return -int_oo
elif math.isnan(r):
return sympy.nan
else:
return sympy.Integer(math.floor(r))
if isinstance(base, sympy.Integer) and isinstance(divisor, sympy.Integer):
return sympy.Integer(int(base) // int(divisor))
if isinstance(base, FloorDiv):
return FloorDiv(base.args[0], base.args[1] * divisor)
# Expands (x + y) // b into x // b + y // b.
# This only works if floor is an identity, i.e. x / b is an integer.
for term in sympy.Add.make_args(base):
quotient = term / divisor
if quotient.is_integer and isinstance(divisor, sympy.Integer):
# NB: this is correct even if the divisor is not an integer, but it
# creates rational expressions that cause problems with dynamic
# shapes.
return FloorDiv(base - term, divisor) + quotient
try:
gcd = sympy.gcd(base, divisor)
if gcd != 1:
> return FloorDiv(
sympy.simplify(base / gcd), sympy.simplify(divisor / gcd)
)
base = -1.00000000000000
cls = FloorDiv
divisor = -1.00000000000000
gcd = 1.00000000000000
quotient = 1.00000000000000
term = -1.00000000000000
/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/utils/_sympy/functions.py:159:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
args = (FloorDiv, -1.00000000000000, -1.00000000000000), kwargs = {}
@wraps(func)
def wrapper(*args, **kwargs):
try:
> retval = cfunc(*args, **kwargs)
E RecursionError: maximum recursion depth exceeded in comparison
E
E To execute this test, run the following from the base repo dir:
E python test/test_sympy_utils.py -k TestValueRanges.test_binary_ref_fn_floordiv_dtype_float
E
E This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
args = (FloorDiv, -1.00000000000000, -1.00000000000000)
cfunc = <functools._lru_cache_wrapper object at 0x7fc5303173a0>
func = <function Function.__new__ at 0x7fc530317280>
kwargs = {}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131151
Approved by: https://github.com/ezyang
https://github.com/pytorch/pytorch/pull/126586 tried to reset dynamo before each unit test. That PR get reverted a couple of times because we see post-land test failures that we don't see before merge. This PR only reset dynamo before each tests in `test_ops_gradients.py` to make it easier to land.
Eventually after we reset dynamo in each individual test files, we can move the change to the base class (TestCase) and remove the change in individual test files.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131397
Approved by: https://github.com/zou3519
ghstack dependencies: #131551, #131388, #131372
https://github.com/pytorch/pytorch/pull/126586 tried to reset dynamo before each unit test. That PR get reverted a couple of times because we see post-land test failures that we don't see before merge. This PR only reset dynamo before each tests in `test_module.py` to make it easier to land.
Eventually after we reset dynamo in each individual test files, we can move the change to the base class (TestCase) and remove the change in individual test files.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131372
Approved by: https://github.com/zou3519
ghstack dependencies: #131551, #131388
Made the following changes:
- mutates_args is now keyword-only and mandatory. This is to align with
torch.library.custom_op (which makes it mandatory because it's easy to
miss)
- op_name is now keyword-only. This helps the readability of the API
- updated all usages of infer_schema
This change is not BC-breaking because we introduced
torch.library.infer_schema a couple of days ago.
Test Plan:
- tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130705
Approved by: https://github.com/yushangdi
ghstack dependencies: #131777
On Windows, _triton.py creates a confusing error ("RuntimeError: Should never be _installed")_ as triton is not supported in Windows. This is not caught in the current Pytorch exception handling. This pull request adds a new exception handling for the runtime error.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132006
Approved by: https://github.com/oulgen
Add a new label `ci-test-showlocals` and add it to test config filter.
If the PR is labeled with `ci-test-showlocals` or "ci-test-showlocals"
present in the PR comment, the test config filter will set a environment
variable `TEST_SHOWLOCALS`. Then `pytest` will show local variables on
failures for better debugging.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131981
Approved by: https://github.com/malfet
https://github.com/pytorch/pytorch/pull/126586 tried to reset dynamo before each unit test. That PR get reverted a couple of times because we see post-land test failures that we don't see before merge. This PR only reset dynamo before each tests in `test_torch.py` to make it easier to land.
Eventually after we reset dynamo in each individual test files, we can move the change to the base class (TestCase) and remove the change in individual test files.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131388
Approved by: https://github.com/zou3519
ghstack dependencies: #131551
Fix the compilation error:
```cpp
/tmp/tmpywg34bca/tg/ctg7wbli6pvydsjr2xsxamdbamkquhlincuky3dzopa3ilrxqdwt.cpp:401:24: error: cannot convert ‘at::Tensor’ to ‘const bfloat16*’ {aka ‘const c10::BFloat16*’}
401 | cpp_fused_div_mm_0(arg2_1, constant2, _frozen_param1, buf1);
| ^~~~~~
| |
| at::Tensor
```
The generated code after the fix will be:
```cpp
cpp_fused_div_mm_0((bfloat16*)(arg2_1.data_ptr()), (bfloat16*)(constant2.data_ptr()), (bfloat16*)(_frozen_param1.data_ptr()), (bfloat16*)(buf1.data_ptr()));
```
Multiple changes are required for ABI compatible mode. Separate it into a follow-up PR in this ghstack: https://github.com/pytorch/pytorch/pull/131841
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129557
Approved by: https://github.com/leslie-fang-intel
This fixes a few instances where we assumed indexing expressions were
non-negative. This is not valid when we have more complicated
expressions involving masking e.g. pointwise cat.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131761
Approved by: https://github.com/ezyang
# Motivation
Structured codegen is beneficial for easier decoupling tensor meta setting and kernel implementation. At present, XPU operators need to handle tensor metas in hand-written way.
We plan to leverage the codegen system for auto generate structured operators. This PR facilitate the `DispatchStub` support for Intel GPUs. Based on that, XPU operators would have possibility to register kernel functor to operator stubs.
This is a prerequisite of PR #130082, where we will modify the codegen system to generate XPU needed source files and headers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130019
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD
Previously, using _MaskPartial when multiple embeddings have the following issues:
1. Suppose an `nn.Embedding` has shape `[vocab_size, emb_size]`. When there are more than one embeddings, sharing the same `vocab_size` but with different `emb_size`s. Then they would not share `OpStrategy` since each, when involved in computation, would have different `OpSchema`; however, there would be cache hit for redistribute (specifically `_gen_transform_infos` in `torch/distributed/_tensor/_redistribute.py` when doing `Replicate` -> `_MaskPartial`) as the `_MaskPartial` only has `vocab_size` as `logical_dim_size` but not `emb_size` as attribute. This cache hit is undesirable and would cause trouble when doing all-reduce/reduce-scatter on the new `_MaskPartial` in a separate `OpStrategy`. The error was reported in #130725. In this PR, we introduce `offset_shape` to represent the embedding's full shape to avoid cache hit from embeddings of different shapes.
2. The second issue is when we have two `nn.Embedding`s `emb1` and `emb2` with the same shape. There will be cache hit not only in `_gen_transform_infos`, but also in `OpStrategy` generation. Previously, if we sequentially do `Replicate` -> `_MaskPartial` for both `emb1` `emb2` and then sequentially do reduction on the `_MaskPartial` of `emb1`, it would destroy the `MaskBuffer` and `emb2` would hit error. This PR adds a `refcount` for the `MaskBuffer` so that it can be properly shared by multiple `nn.Embedding`s.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131264
Approved by: https://github.com/wanchaol
We're currently under-counting mutations from ExternKernel since they use `NoneLayout` which doesn't have an associated shape and dtype. Instead, we can get that information from the buffer being mutated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131645
Approved by: https://github.com/jansel
Summary:
The `test_model_modified_weights` in `test_aot_inductor.py` has been failing internally for a while. The behavior leading to the test failure was that, after updating the eager model's weights and recompiling the (CPU) model with AOTI, the output of the model was identical to the one before the weights were updated.
The root cause is here in Python:
8927fc209f/test/inductor/test_aot_inductor_utils.py (L69-L71)
which, in turn, instantiates the `Runner` object in C++ relying on `dlopen` for loading the *.so. The problem is that repeated `dlopen` call does not reload the library from the same path, unless `dlclose` is called in-between the two `dlopen` calls. There is `dlclose` in the `Runner`'s destructor, but it's not called, likely due to the way the loaded `runner` gets closed over in Python:
8927fc209f/test/inductor/test_aot_inductor_utils.py (L83-L94)
Here we add copying the *.so file to a unique temporary path right before loading it into a `runner` to avoid the `dlopen` staleness described above. This fixes the `test_model_modified_weights` and, hopefully, will help avoiding similar errors in the future tests.
Test Plan: Tested internally.
Differential Revision: D60348165
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131994
Approved by: https://github.com/chenyang78
------
As per the title, add argument `--locals` for `unittest` and `--showlocals --tb=long` for `pytest` in CI.
Some failures cannot be reproduced on the local machine but exist on cloud CI. This change allows us to investigate the test failure more easily.
Example output: https://github.com/pytorch/pytorch/actions/runs/9961546996/job/27523888353?pr=130710#step:20:3361
```text
/opt/conda/envs/py_3.8/lib/python3.8/site-packages/sympy/core/function.py:307:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
cls = FloorDiv, base = -1.00000000000000, divisor = -1.00000000000000
@classmethod
def eval(cls, base, divisor):
# python test/test_dynamic_shapes.py -k TestDimConstraints.test_dim_constraints_solve_full
# Assert triggered by inequality solver
# assert base.is_integer, base
# assert divisor.is_integer, divisor
# We don't provide the same error message as in Python because SymPy
# makes it difficult to check the types.
if divisor.is_zero:
raise ZeroDivisionError("division by zero")
if base in (int_oo, -int_oo, sympy.oo, -sympy.oo) and divisor in (
int_oo,
-int_oo,
sympy.oo,
-sympy.oo,
):
return sympy.nan
if base is sympy.nan or divisor is sympy.nan:
return sympy.nan
if base.is_zero:
return sympy.S.Zero
if base.is_integer and divisor == 1:
return base
if base.is_integer and divisor == -1:
return sympy.Mul(base, -1)
if (
isinstance(base, sympy.Number)
and isinstance(divisor, sympy.Number)
and (
base in (int_oo, -int_oo, sympy.oo, -sympy.oo)
or divisor in (int_oo, -int_oo, sympy.oo, -sympy.oo)
)
):
r = float(base) / float(divisor)
if r == math.inf:
return int_oo
elif r == -math.inf:
return -int_oo
elif math.isnan(r):
return sympy.nan
else:
return sympy.Integer(math.floor(r))
if isinstance(base, sympy.Integer) and isinstance(divisor, sympy.Integer):
return sympy.Integer(int(base) // int(divisor))
if isinstance(base, FloorDiv):
return FloorDiv(base.args[0], base.args[1] * divisor)
# Expands (x + y) // b into x // b + y // b.
# This only works if floor is an identity, i.e. x / b is an integer.
for term in sympy.Add.make_args(base):
quotient = term / divisor
if quotient.is_integer and isinstance(divisor, sympy.Integer):
# NB: this is correct even if the divisor is not an integer, but it
# creates rational expressions that cause problems with dynamic
# shapes.
return FloorDiv(base - term, divisor) + quotient
try:
gcd = sympy.gcd(base, divisor)
if gcd != 1:
> return FloorDiv(
sympy.simplify(base / gcd), sympy.simplify(divisor / gcd)
)
base = -1.00000000000000
cls = FloorDiv
divisor = -1.00000000000000
gcd = 1.00000000000000
quotient = 1.00000000000000
term = -1.00000000000000
/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/utils/_sympy/functions.py:159:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
args = (FloorDiv, -1.00000000000000, -1.00000000000000), kwargs = {}
@wraps(func)
def wrapper(*args, **kwargs):
try:
> retval = cfunc(*args, **kwargs)
E RecursionError: maximum recursion depth exceeded in comparison
E
E To execute this test, run the following from the base repo dir:
E python test/test_sympy_utils.py -k TestValueRanges.test_binary_ref_fn_floordiv_dtype_float
E
E This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
args = (FloorDiv, -1.00000000000000, -1.00000000000000)
cfunc = <functools._lru_cache_wrapper object at 0x7fc5303173a0>
func = <function Function.__new__ at 0x7fc530317280>
kwargs = {}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131151
Approved by: https://github.com/ezyang
When a collective can be hidden through either simple overlapping or micro-pipeline TP, we prefer simple overlapping to avoid the overhead associated with decomposition. If `reorder_for_compute_comm_overlap` is enabled, we identify collectives that can be hidden through simple overlapping and exclude them from micro-pipeline TP candidates.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131410
Approved by: https://github.com/weifengpy
This PR enables the Inductor compute/comm reordering passes to Traceable FSDP2 to achieve overlap. Note that the overlap is not maximally optimized yet and the follow-up work will be done in subsequent PRs.
Test commands:
- `pytest -rA test/distributed/test_compute_comm_reordering.py::TestComputeCommReorderingMultiProc`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_backend_inductor`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131614
Approved by: https://github.com/yifuwang
ghstack dependencies: #131510
This PR creates these `GroupedSchedulerNode`s:
- One for each all-gather code block (cast + copy-in + all-gather)
- One for each all-gather-wait code block (all-gather-wait + copy-out)
- One for each reduce-scatter code block (copy-in + reduce-scatter)
- One for each reduce-scatter-wait code block (reduce-scatter-wait)
This serves two goals:
- Prevent outside ops from being fused into these op groups, in order to have more predicable memory usage.
- Make it easier to specify the dependency e.g. from `i+1` all-gather group node to the `i` all-gather-wait group node, to enforce FSDP2 comm ordering (i.e. "serialization of comms").
The actual "reorder-for-FSDP-compute-comm-overlap" PR will come next.
Test commands:
- `pytest -rA test/distributed/test_compute_comm_reordering.py::TestComputeCommReorderingMultiProc`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_backend_inductor`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131510
Approved by: https://github.com/yifuwang
Modify the existing `layer normalization` operator in PyTorch, invoked by `torch.layer_norm`, to allow for reductions along the jagged dimension of a nested tensor. The function originally had a basic implementation for reducing along 1 non-ragged dimension. This diff, which uses the `aten` padding operator, enables PyTorch users to invoke `torch.nn.functional.layer_norm` on a nested tensor when reducing along the ragged dimension, e.g. `*` in a `(B, *, M)` or `(B, *, M, N)` nested tensor.
Write unit tests based on the `softmax` jagged operator to verify the accuracy of the ragged reduction implementation for `torch.nn.functional.layer_norm`. Add unit tests to verify error handling for unsupported features.
Note that this implementation is limited to nested tensors with `ragged_idx == 1`, i.e. the ragged dimension is not transposed. The layer normalization operator also requires an operation on a 2-dimensional layer; for nested tensors with 4 or more dimensions, I flatten the extra dimensions, then unflatten them after performing layer normalization.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131519
Approved by: https://github.com/davidberard98
ghstack dependencies: #131518
Modify the existing `softmax` operator in PyTorch, invoked by `torch.softmax`, to allow for reductions along the jagged dimension of a nested tensor. The function originally had a basic implementation for reducing along 1 non-ragged dimension. This diff, which uses the aten padding operator, enables PyTorch users to invoke `torch.softmax` on a nested tensor when reducing along the ragged dimension, e.g. `*` in a `(B, *, M)` nested tensor.
Write unit tests based on the `sum` and `mean` jagged operators to verify the accuracy of the ragged reduction implementation for `torch.softmax`. Add unit tests to verify error handling for unsupported features in `NestedTensor` `torch.softmax`.
Note that this implementation is limited to nested tensors with `ragged_idx == 1`, i.e. the ragged dimension is not transposed. In addition, the `softmax` operator is required to take in as input an integer for the reduction dimension `dim`, requiring new unit tests heavily inspired by the `sum` and `mean` jagged operator unit tests. `Softmax` also allows for reducing along the batch dimension.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131518
Approved by: https://github.com/davidberard98
## Motivation
This refactor aligns our testing methodology with the Flash Attention upstream repository while addressing several key issues:
1. **Standardized comparison**: We now compare fused kernels against float64 references, using the maximum of a calculated tolerance (based on same-precision math implementation) or standard float32 `atol`.
2. **Reduced redundancy**: Utilizing the same tensors for both same-precision math and fused kernel runs eliminates duplication.
3. **Improved maintainability**: The new approach simplifies tolerance adjustments across all affected tests.
4. **Consistency**: Standardizing tensor comparisons ensures a more uniform and reliable testing suite.
These changes collectively simplify our testing code, improve its maintainability, and provide a more robust framework for validating our attention mechanisms.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131743
Approved by: https://github.com/jainapurva, https://github.com/jbschlosser
Changes:
1. Switch `AotCodeCompiler` to new cpp_builder.
2. Only use `deprecated_cpp_compile_command` for `fb_code`, due to I can't debug anymore on no Meta internal environment access.
3. Add `TODO` comments for further some Meta employee help on contine to do this work.
4. Due to item 3, we only remaining `deprecated_cpp_compile_command` for `fb_code` to be fix, let's remove `validate_new_cpp_commands`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130127
Approved by: https://github.com/jgong5, https://github.com/jansel
This synchronized lf-canary-scale-config and lf-scale-config with one in test-infra.
This really needs some automatic validation to prevent it from drifting out of sync over and over again (coming soon...)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131955
Approved by: https://github.com/malfet
This PR fixes a bug in `test_correct_module_names` introduced in #130497. It also addresses post-fix test failures in:
* `torch/ao/quantization/__init__.py` - set the correct `__module__` for several public API helpers
* `torch/library.py` - add `register_vmap` to `__all__`
* `torch/nn/attention/flex_attention.py` - make `round_up_to_multiple` private by prepending an underscore
* `torch/storage.py` - introduce `__all__` to avoid `Self` being re-exported as a public API
* `torch/distributed/pipelining/schedules.py` - add `ZeroBubbleAlgorithm` to `__all__`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131386
Approved by: https://github.com/albanD
Summary: CPU CI nodes failed to find valid VecISA because importing torch under the default pytorch directory will fail with the following msg, so switch cwd to a tmp directory.
```
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/var/lib/jenkins/workspace/torch/__init__.py", line 66, in <module>
from torch.torch_version import __version__ as __version__
File "/var/lib/jenkins/workspace/torch/torch_version.py", line 4, in <module>
from torch.version import __version__ as internal_version
ModuleNotFoundError: No module named 'torch.version'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131812
Approved by: https://github.com/eellison, https://github.com/malfet
Closes#129507
This makes two changes to the sort kernel:
1. Use int16 for the indices since we only operate on small dims anyway
2. Instead of passing an explicit mask, we pass the rnumel and imply the
mask from that which saves an additional reduction in the sort
kernel's inner loop.
In my benchmarks, this gives enough of a perf improvement to bump up the
max rblock to 512.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131719
Approved by: https://github.com/eellison
We automatically generate FakeTensor support for them (the FakeTensor
kernel for a triton kernel is "return None"). The same thing should
apply to the meta kernel.
Tests:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131896
Approved by: https://github.com/oulgen
Previously, FlopCounterMode would ignore any custom ops registered
through `register_flop_formula`. The problem was:
- register_flop_formula(target) requires target to be an OpOverloadPacket.
- register_flop_formula used register_decomposition to populate its registry
- register_decomposition decomposes the OpOverloadPacket into OpOverload before
putting it into the registry
- FlopCounterMode ignores OpOverloads in its registry (it assumes the
registry is a dictionary mapping OpOverloadPacket to flop formula).
register_decomposition is too heavy of a hammer, plus this isn't a
decomposition, so I changed the registration mechanism.
Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131777
Approved by: https://github.com/Chillee
Implemented by extending `collections.abc.MutableSet` and backing it with a dictionary, which is ordered. From collections.abc.MutableSet:
```
A mutable set is a finite, iterable container.
This class provides concrete generic implementations of all
methods except for __contains__, __iter__, __len__,
add(), and discard().
```
In addition to implementing those methods I also had to define some methods of python's set which were not implemented in MutableSet.
I reused the test from my python's lib. There were a few instances of tests that didnt pass because edge case behavior that is not necessary to reimplement
- support self-referencing repr
- erroring when an member's `__eq__` function would modify the set itself
- MutableSet supports Iterables as inputs, but not sequences (pretty rare..)
- Some specifics of exact equivalent type errors being thrown
- [The protocol for automatic conversion to immutable](https://docs.python.org/2/library/sets.html#protocol-for-automatic-conversion-to-immutable)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130003
Approved by: https://github.com/aorenste
Reland https://github.com/pytorch/pytorch/pull/126704
#### Fixes the issue with type of `nn.Module._state_dict_hooks` being changed in that PR which was problematic:
Instead of using `Tuple(Callable, bool)` to keep track of whether the private `_register_state_dict_hook` or the public `register_state_dict_post_hook` API was used to register the hook and toggle the behavior accordingly, I set an attribute on the Callable in the private API, which is never cleaned up.
If a callable previously registered using the private API is registered via the public API, a RuntimeError will be raised
#### Copied from previous PR description
Fixes https://github.com/pytorch/pytorch/issues/75287 and https://github.com/pytorch/pytorch/issues/117437
- `nn.Module._register_state_dict_hook` --> add public `nn.Module.register_state_dict_post_hook`
- Add a test as this API was previously untested
- `nn.Module._register_load_state_dict_pre_hook` --> add public `nn.Module.register_load_state_dict_pre_hook` (remove the `with_module` flag, default it to `True`
~- For consistency with optimizer `load_state_dict_pre_hook` raised by @janeyx99, allow the pre-hook to return a new `state_dict`~
- For issuet by https://github.com/pytorch/pytorch/issues/117437 regarding `_register_state_dict_hook` semantic of returning a new state_dict only being respected for the root for private hook
- Document this for private `_register_state_dict_hook`
- Remove this for the public `register_state_dict_post_hook`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131690
Approved by: https://github.com/albanD
Implements donated buffer feature and adds unit tests. Donated buffer is a saved tensor that is not aliased with forward inputs, fw_outputs (except saved tensors), and bw_outputs. We detect donated buffers during `aot_dispatch_autograd` and store donated buffers in `ViewAndMutationMetadata`, such that it can be accssed in inductor.
Fixes#129496
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130580
Approved by: https://github.com/bdhirsh
BE task T195600898 (internal).
The 3 tests
```
test_non_contiguous_input_mm
test_non_contiguous_input_bmm
test_non_contiguous_input_addmm
```
had the following error in TestX:
```
self.assertTrue(torch.allclose(ref, act, atol=1e-2, rtol=1e-2))
AssertionError: False is not true
```
The tolerance comparing eager and compiled results is too small, perhaps because of a Triton update that changed numerics:
```
Mismatched elements: 25 / 38597376 (0.0%)
Greatest absolute difference: 0.015625 at index (3771, 509) (up to 0.01 allowed)
Greatest relative difference: 9.375 at index (13687, 48) (up to 0.01 allowed)
```
Change the absolute tolerance from 0.01 to 0.02. Also switch to use `torch.testing.assert_close` which prints out the greatest absolute/relative difference like above when the assert fails.
`test_non_contiguous_input_mm_plus_mm` has a different problem, just switching to `torch.testing.assert_close` to be uniform with the other tests.
Test commands:
```
python test/inductor/test_max_autotune.py -k TestMaxAutotune.test_non_contiguous_input_mm
python test/inductor/test_max_autotune.py -k TestMaxAutotune.test_non_contiguous_input_addmm
python test/inductor/test_max_autotune.py -k TestMaxAutotune.test_non_contiguous_input_bmm
```
Internal stress tests pass now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131822
Approved by: https://github.com/shunting314
Summary:
Fixes https://github.com/pytorch/pytorch/issues/130379.
The original error is verifier finds that the placeholder nodes' meta[''val"] are missing in subgraph of WrapSetGradEnabled hop.
In this PR, we fixed it by re-ordering the replace_set_grad_with_hop_pass with lift_constant_tensor pass because only after lift_constant_pass, all the constant attrs start to have meta["val"].
Test Plan: buck2 test test:test_export -- -r "test_setgrad_lifted_tensor"
Differential Revision: D60244935
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131787
Approved by: https://github.com/yushangdi
This PR enables AutoHeuristic for kernel choice selection, where the feedback can not immediately be provided when AutoHeuristic is called, but only after autotuning has happened. The steps are the following:
When the AutoHeuristic constructor is called, AutoHeuristic registers a function in select_algorithm.py.
After autotuning in select_algorithm.py has happened, and there is an entry in autoheuristic_registry, select_algorithm provides the autotuning results to AutoHeuristic, which stores the results.
I enabled AutoHeuristic for mixed_mm to have an example to test it on. We probably want to add more context, and also add an augment_context function. I will add support for this in another PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131610
Approved by: https://github.com/eellison
Summary: Previously, when folding BN into conv, we rely on DCE
to clean up the unused BN node from the graph. This works if
the model is already in eval mode, but fails if the model is
still in train mode because DCE doesn't remove nodes with
potential side effects (in this case `_native_batch_norm_legit`).
This required users to move the model to eval mode before calling
convert in order to get a properly DCE'd graph.
To solve this, we manually erase the BN node after folding
instead of relying on DCE. This relaxes the ordering constraints
between `move_exported_model_to_eval` and `convert_pt2e`.
Test Plan:
python test/test_quantization.py TestQuantizePT2EQAT_ConvBn1d.test_fold_bn_erases_bn_node
python test/test_quantization.py TestQuantizePT2EQAT_ConvBn2d.test_fold_bn_erases_bn_node
Reviewers: jerryzh168, yushangdi
Subscribers: jerryzh168, yushangdi, supriyar
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131651
Approved by: https://github.com/yushangdi
Summary: This is an experimental work. Depending on the performance stableness and benchmark coverage on A10g, we may consider to use A10g for manually-triggered per-PR performance comparison instead of exausting expensive A100 instances.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131816
Approved by: https://github.com/huydhn
Summary: Pretty straightfoward. ROCm 6.2.0 changed the `__hip_bfloat16` API (see [this PR](481912a1fd)), so we gate impl on `__BF16_HOST_DEVICE__` macro to support older and newer versions of ROCm.
Test Plan: CI
Differential Revision: D60024830
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131359
Approved by: https://github.com/houseroad
https://github.com/pytorch/pytorch/issues/105290
The problem in the original flow is that:
(1) the user calls `torch.mul(complex_tensor, complex_scalar)
(2) python arg parser wraps the complex scalar in a `scalar_tensor`, and dispatches to `aten.mul.Tensor(self, scalar_other)`
(3) autograd sees `aten.mul.Tensor`, calls `scalar_other.conj()` [here](https://github.com/pytorch/pytorch/blob/main/torch/csrc/autograd/FunctionsManual.cpp#L597)
(4) during proxy tensor tracing, this gets dispatched to `aten._conj(scalar_tensor)`
(5) when we hit __torch_dispatch__, the scalar_tensor is converted back into a plain python scalar
(6) we error during tracing, because in `FunctionalTensorMode.__torch_dispatch__` we try to redispatch on `aten._conj.default(plain_python_scalar)`, and this overload does not accept python scalars.
My attempted fix in this PR is to update `TensorBase::conj()` to check if the current tensor is a scalar tensor (wrapped number), and if so, manually:
(1) convert the scalar tensor back into a scalar
(2) call scalar.conj() directly
(3) convert the result back into a wrapped tensor
This avoids having to go through python entirely in the tracing case (which is fine, because these scalar tensors are constants that we can const-prop during tracing anyway).
Notable, I did **not** add e.g. a new `aten._conj.Scalar` overload. This would not actually fix the problem, since the bug is that we call `aten._conj.default(python_scalar)` directly. we would also need to muck with all `__torch_dispatch__` call sites to know to convert python scalars back into tensors directly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131482
Approved by: https://github.com/zou3519, https://github.com/ezyang
ghstack dependencies: #131403
Fixes https://github.com/pytorch/pytorch/issues/121353
our handle for `.data` in dynamo today basically just converts `y = x.data` into `y = x.detach()`. The semantics of these two ops are not quite the same, because:
(1) any future mutations on `x.data` will be fully ignored by autograd
(2) any mutations on `x.detach()` will bump x's version counter
the linked model does a .data mutation that is hidden from autograd in eager, but ends up erroring during AOTDispatcher tracing.
I updated dynamo's handling so that:
(1) when dynamo sees a call to `getattr(tensor, "data")` and calls `.detach()` we set a flag on the returned `TensorVariable` indicating it came from `.data`
(2) on any tensor method that we call with an input `TensorVariable` with this flag turned on, we proxy autograd's `preserve_version_counter` logic into the graph, to properly reset the VC after the op is run.
One thing to note is that I don't actually do this on every op that we pass the tensor to: I only do it for tensor methods that appear to be mutations (by checking for a trailing underscore). My thought was that:
(1) I didn't want to do this for **every** op that you pass `y` into, since that will e.g. triple the number of nodes in the graph, and could cause compile time regressions if you use .data
(2) this situation is pretty rare in general, and I'm hoping that "tensor method mutations" cover most reasonable mutation cases. If we manage to miss a case, you will get a loud error during tracing anyway, so there is not a safety issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131403
Approved by: https://github.com/anijain2305, https://github.com/zou3519
Looks like in the halide codegen refactor, the range tree codegen was
split out from initialize_range_tree into its own function, but
triton_split_scan.py wasn't updated to reflect this change.
The result was the codegen gets invoked twice which is benign but makes
the kernel harder to read.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131669
Approved by: https://github.com/Chillee
Fixes https://github.com/pytorch/pytorch/issues/130750.
Repro of lazy/eager `map` discrepancy without `islice`:
```python
def fn(a, b):
y = 1
def f(x):
nonlocal y
y += 1
return x
l = list(zip([a, b], map(f, [1, 2, 3, 4])))
return a + y
```
The major change is that we implement `MapVariable` and `ZipVariable` based on `IteratorVariable`. Before, `map` and `zip` were being traced by immediately unpacking the result as a `TupleVariable`, which is wrong in cases such as the example above.
`MapVariable`s are not allowed to be unpacked while `ZipVariable`s can only be unpacked if all of its iterables can also be unpacked.
We also add new `[has_]force_unpack_var_sequence` methods to `VariableTracker` for the case where it is safe to unpack the entire sequence lazily, e.g., when building a list from a map (i.e. `list(map(f, ...))`).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131413
Approved by: https://github.com/anijain2305
Brian debugged the difference of the output type for inference and train graph.
Partitioner sometimes return list output type.
After this PR it will always return tuple.
Potentially there can be some new graphs inside tests that will be landed between this PR ci jobs finish and landing.
This could be easily fixed with fast-forward fix on:
```
EXPECTTEST_ACCEPT=1 python test/test.py
```
Adding ciflows/periodic to minimize this probability
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131759
Approved by: https://github.com/ezyang, https://github.com/bdhirsh
Summary: Found this "cannot find -ltorch: No such file or directory" issue when collecting AOTI CPU perf for the dashboard. Debugging on the CI machine revealed two problems: 1) no valid VEC_ISA was picked; 2) when 1 happens, libtorch path is not specified in the linker path.
This PR fixes the second problem. A later PR will fix the first problem, but somehow finding the right VEC_ISA causes a performance regression, which needs more investigation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131791
Approved by: https://github.com/zou3519, https://github.com/chenyang78
Suggests fixes for data-dependent errors in non-strict export.
Any data-dependent error has an unresolved condition on unbacked symints. A mechanizable strategy for fixing such errors, which this PR enables, is to "bash" them using `torch._check()`s. For each error we suggest using `torch._check()` on the condition or its negation. The user selects and copy-pastes the suggested fix and continues.
For example, here's an existing data-dependent error message with the suffix following `<snip>...</snip>` added by this PR:
```
Could not guard on data-dependent expression Eq(u2, u1) (unhinted: Eq(u2, u1)). (Size-like symbols: u1)
<snip>...</snip>
User code:
File "test/export/test_export.py", line 1944, in forward
return r.view(items[0], items[2])
Suggested fixes (please choose one of the following):
1. torch._check(items[2] == r.shape[1])
2. torch._check(items[2] != r.shape[1])"
```
Tests in this PR illustrate this workflow, by taking common examples of data-dependent errors and bashing them until success, purely based on suggested fixes. In particular, we test this workflow on the "puzzlers" in https://www.internalfb.com/intern/anp/view/?id=5330476 (thanks @ezyang).
In terms of implementation, we focus on non-strict mode, where we can intercept torch function calls to install a handler that walks up the stack from the error, finding the closest non-torch frame and inspecting its locals for symints appearing in the error. The suggested fixes then access these symints through the local variables so that they can be (a) easily understood by the user (b) directly added to the code.
Implementing this idea in strict mode is follow-up work—we have already investigated what it would take, and decided to separate it out of this PR for reasons described next.
It's not too hard to map symints to locals in Dynamo (although it needs to happen elsewhere, i.e., intercepting torch function calls won't work). However, unfortunately this doesn't seem to be enough; the graph modules created by Dynamo when going through AOTAutograd can raise further data-dependent errors in some cases, and thus we need yet another mechanism to map symints to locals for graph modules, via captured source-level metadata and FX node walking. This latter component will require some care to build properly, or we might conclude it is altogether unnecessary and fix Dynamo instead.
Differential Revision: D56867432
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125378
Approved by: https://github.com/ezyang
Add support for transposed, non-contiguous `NestedTensor`s, where `ragged_idx > 1`, for the aten operators `sum` and `mean`. This diff enables reducing along the jagged dimension for non-contiguous `NestedTensor`s, transposed between non-batch dimensions as well as between a ragged and a non-batch dimension. For example, users can now reduce a `NestedTensor` of shape `(B, M, *, N)` along `*` or `(B, N, M, *)` along `*`.
Parametrize existing unit tests and add new unit tests verifying the accuracy of implementations on `NestedTensor`s that transpose between 2 non-batch dimensions as well as between a ragged and a non-batch dimension.
Differential Revision: [D59847927](https://our.internmc.facebook.com/intern/diff/D59847927/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131517
Approved by: https://github.com/davidberard98
Summary:
Dynamo doesn't track whether buffers are `persistent`. This led to some ugly code where we would mark buffers as always persistent when creating signatures, then later check whether the buffers were not in the state dict to infer whether they were non-persistent, and use this to fix up the signature.
This PR instead defines a utility to look up all the non-persistent buffers registered inside a module (this information is recorded in a private `_non_persistent_buffers_set` module attribute), and uses it to (a) correctly set the persistent flag on buffers when creating signatures (b) transfer this information to a Dynamo-traced graph module, which then causes non-persistent buffers to (correctly) not show up in the state dict.
Test Plan: existing tests + new case with non-persistent buffer in nested module
Differential Revision: D60224656
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131756
Approved by: https://github.com/zhxchen17, https://github.com/ydwu4
After a recent refactoring of inductor, `.users` are now associated with buffers instead of scheduler nodes.
In `debug.py`, one such usage of `.users` is not updated accordingly, and the change here fixes that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131796
Approved by: https://github.com/yf225
Persistent kernels are sometimes able to remove intermediate buffers that would
otherwise be needed for the non-persistent reduction kernel. This makes
multi kernel's codegen more complicated as it needs to drop these extra
arguments at runtime after selecting the correct kernel to run.
Instead, this PR updates the persistent kernel's `must_keep_buffers` so these
aren't dropped during codegen so both kernels have the same signature.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127724
Approved by: https://github.com/shunting314
ghstack dependencies: #131044
This makes TCPStore `wait` timeout print actually useful info instead of a generic `Socket Timeout` message on timeout.
Bonus:
* fix weirdness where `connect_timeout` only supported seconds unlike the reset of our timeouts (thus minimum timeout was 1s)
* Fixed tests that used a 10s timeout (test_store now only takes 20s instead of 40s)
Ex:
```
DistStoreError: wait timeout after 100ms, keys: /the_key
```
Test plan:
```
python test/distributed/test_store.py
python test/distributed/test_c10d_gloo.py -v -k timeout
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131808
Approved by: https://github.com/kurman
Inductor would like a way to have activations that do not escape the backward graph marked as "donated", so we can re-use their memory during memory planning here: https://github.com/pytorch/pytorch/pull/130580
For this to be safe though, we need to know at runtime that autograd does not plan to retain the current autograd graph (either for another call to .backward() later, or if double backward is being used). In the linked PR, the current plan is to error when we detect this situation, and ask the user to turn off the donated buffer config (although if/once we get to the point of always delaying backward compilation to runtime, we can just wait until we know the runtime value to compile).
There isn't a way to know if the currently running backward is run with `retain_graph=True` from python - @soulitzer helped me figure out where to grab it so I added a python binding for it under `ctx.is_retain_graph()`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131038
Approved by: https://github.com/soulitzer
## What is sympy fn str arg?
It's a string such as `sqrt` which also happens to be a real sympy function (e.g. `sympy.sqrt`)
## Crash
```
torch/_inductor/sizevars.py", line 468, in symbolic_hint
expr = self.simplify(expr) # where expr is 'sqrt'
torch/_inductor/sizevars.py", line 66, in simplify
return sympy.expand(expr).xreplace(self.replacements)
sympy/core/function.py", line 2816, in expand
return sympify(e).expand(deep=deep, modulus=modulus, **hints)
AttributeError: 'function' object has no attribute 'expand'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131253
Approved by: https://github.com/desertfire
# Add these packages if torch.distributed is needed.
# Distributed package support on Windows is a prototype feature and is subject to changes.
conda install -c conda-forge libuv=1.39
@ -252,6 +252,8 @@ If you would like to compile PyTorch with [new C++ ABI](https://gcc.gnu.org/onli
export_GLIBCXX_USE_CXX11_ABI=1
```
Please **note** that starting from PyTorch 2.5, the PyTorch build with XPU supports both new and old C++ ABIs. Previously, XPU only supported the new C++ ABI. If you want to compile with Intel GPU support, please follow [Intel GPU Support](#intel-gpu-support).
If you're compiling for AMD ROCm then first run this command:
#pragma message("CuDNN v" STRING(CUDNN_MAJOR) " found, but need at least CuDNN v6. You can get the latest version of CuDNN from https://developer.nvidia.com/cudnn or disable CuDNN with USE_CUDNN=0")
#pragma message "We strongly encourage you to move to 6.0 and above."
CUDNN_MAJOR) " found, but need at least CuDNN v8. You can get the latest version of CuDNN from https://developer.nvidia.com/cudnn or disable CuDNN with USE_CUDNN=0")
#pragma message "We strongly encourage you to move to 8.5 and above."
#pragma message "This message is intended to annoy you enough to update."
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.