Some tests in `test/dynamo` are marked as "expected failure when testing
with `PYTORCH_TEST_WITH_DYNAMO=1`, i.e., we added files of those test
names in the `dynamo_expected_failures` folder.
However, a lot of those dynamo tests seem to be passing with
`PYTORCH_TEST_WITH_DYNAMO=1`, so this patch removes them from
`dynamo_expected_failures`.
Somehow make setup-env as recomended in CONTRIBUTING.MD is not installing all dependencies require to run tests
This makes it slightly clearer when running tests.
Specific repro on my side was
```
git checkout e7679663070e3149ae7cd6e28d376d86852ce9e4
make setup-env
conda activate pytorch-deps
python test/test_utils_internal.py
```
which is what my reading of the instructions implies should be correct.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137797
Approved by: https://github.com/albanD
I was discussing with James March how the current fx_codegen_and_compile
counter doesn't actually capture all compile time. This one is more
accurate and corresponds closely to the existing events in dynamo_compile
table.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138072
Approved by: https://github.com/markkm
Summary:
- This diff introduces `dtype` attribute to `TritonCSEVariable` and a dtype propagation helper function to infer dtype from input to output for each op.
- There will be a follow-up diff that uses this `dtype` information in `TritonCSEVariable` to perform dtype-aware codegen.
Test Plan: CI
Differential Revision: D61815079
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136778
Approved by: https://github.com/eellison, https://github.com/blaine-rister
Summary: while reading through inductor template code I found a few places where else statements were driving me crazy. Fixing them as I read
Test Plan: CI
Differential Revision: D64882385
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138789
Approved by: https://github.com/aakhundov
Summary: there is an issue with functionalization V2 in export. This is a quick fix that plumbs `is_export` through to `run_functionalized_fw_and_collect_metadata`.
Test Plan: CI
Differential Revision: D64915263
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138836
Approved by: https://github.com/tugsbayasgalan
I was debugging an internal ne divergence for a while that ended up being because of a bad meta. I added an explicit a config option and an explicit backend `aot_eager_decomp_partition_crossref` to enable the FakeCrossRefMode when running the graph. I added an explicit backend bc I suspect it will be useful for internal models but I'm also happy to leave as config option.
It will only test ops that have meta to avoid memory overhead of hitting fallback path and running in eager.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138651
Approved by: https://github.com/zou3519, https://github.com/bdhirsh
Summary: With the fast pickling mode, we don't need the custom hack for replacing device strings in tensors. This was previously needed because, e.g., two strings "cuda" will pickle differently if they are the same object vs. not.
Test Plan:
The new test fails with fast mode commented out, but succeeds when enabled:
`python test/inductor/test_codecache.py -k test_stable_strings`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138681
Approved by: https://github.com/oulgen
This change fixes the RUNPATH of installed c++ tests so that the linker can find the shared libraries they depend on.
For example, currently:
```bash
venv/lib/python3.10/site-packages/torch $ ./bin/test_lazy
./bin/test_lazy: error while loading shared libraries: libtorch.so: cannot open shared object file: No such file or directory
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136627
Approved by: https://github.com/malfet
Fixes#138509
tl.constexpr globals would be codegen-ed as `constexpr()` instead of `tl.constexpr()` if they were un-annotated. This fixes the issue (and adds a test). The correct handling was already added but the corrected string was not being used in the un-annotated branch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138757
Approved by: https://github.com/oulgen
Summary: During compilation, a profiler context gets ignored so we should temporarily turn off tests that are failing due to dynamo. Once profiler integration with dynamo is introduced we can reintroduce these tests
Test Plan: Make sure CI is passing again
Differential Revision: D64867447
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138762
Approved by: https://github.com/davidberard98
As called out in https://github.com/pytorch/pytorch/pull/137999, preserving signatures of multiple calls when buffer mutations are present was NYI. The main problem was that intermediate values of buffers were not tracked, so couldn't be propagated statefully between multiple calls (i.e., they would need to be explicitly passed around, defeating the unlifting needed for preserving signatures).
This PR fixes this situation, by introducing module attributes that carry the necessary intermediate values of buffer mutations. In general, a buffer mutation can have several intermediate values it depends on recursively, even other buffers. So rather than tying an intermediate value with a particular buffer, we tie it with the submodules that create and read it. We install an attribute on all modules that create or read a particular intermediate value, sharing the same initial storage (i.e., initialized with the same empty tensor). For the module that creates this intermediate value, we copy the value into the corresponding attribute; and for the modules that read it, we read the corresponding attribute instead.
Another complication that needed to be addressed was that a `run_decompositions` following an `export_for_training` was not preserving module call graphs, which is needed for unflattening and, in particular, used when remapping inputs. Fortunately some existing metadata already tracks provenance of nodes, which we could use to update a module call graph after functionalization / decomposition.
Differential Revision: D64806175
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138669
Approved by: https://github.com/tugsbayasgalan
Summary:
We always need to give the heartbeat monitor thread time to write out flight recorder dumps. Otherwise, the watchdog thread kills the heartbeat monitor thread too fast before it has time to write out the Flight Recorder logs.
This change:
1. Removes the "sleep after exception" JK. We don't need to sleep for 8 minutes.
2. Use a promise between watchdog thread and heartbeat monitor thread to delay, at most, one minute to give Flight Recorder time to write out it's log on timeout.
Test Plan:
Tested on my local job and flight recorder successfully executed for the job.
https://fburl.com/mlhub/38fj5yne
The watchdog thread gives heartbeat thread time to write out the logs.
In the logs we see:
```
[trainer4]:I1023 17:39:29.755507 12592 ProcessGroupNCCL.cpp:1950] [PG ID 0 PG GUID 0(precheck) Rank 12] slept for 1647ms giving time for flight recorder dumps to finish.
```
Differential Revision: D64857928
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138828
Approved by: https://github.com/d4l3k, https://github.com/fduwjj
We found that `export() -> _inductor.aot_compile()` lowering, 3 different ShapeEnvs get created, leading to errors when one ShapeEnv processes expressions created by another ShapeEnv. This plumbs the 2 places where ShapeEnv creation happens, detecting the original ShapeEnv from the GraphModule example values, so the original ShapeEnv is just reused.
Differential Revision: D64613290
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138362
Approved by: https://github.com/angelayi
Each inductor event should have exactly one hit, miss, bypass etc. Add it to the inductor compile event.
Add triton_compile as a compiler phase with `dynamo_timed`. This way, we get PT2 Compile Event Logs for triton as well.
Here's what triton events look like: {F1941513932}
And this on a cache hit(since we still redo this work):
{F1941514350}
Inductor cache info:
{F1941528530}
Differential Revision: [D64703392](https://our.internmc.facebook.com/intern/diff/D64703392/)
@diff-train-skip-merge
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138603
Approved by: https://github.com/oulgen
Summary: We expand the tests to cover serdes_non_strict. Currently failing tests are skipped.
Test Plan:
```
buck2 test @//mode/dev-nosan //caffe2/test:test_export -- -r _serdes_non_strict
```
Differential Revision: D64709285
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138662
Approved by: https://github.com/avikchaudhuri
As discussed with @ezyang, this set of diffs are extracting fixes to problems discovered to flipping `specialize_float=False` in https://github.com/pytorch/pytorch/pull/137782. Since these codepaths are exercised in existing tests, I'm going to bias towards shipping speed and put these up with the primary test plan as the global CI. These code paths are all tested via existing tests when `specialize_float=False` and it feels a bit wonky to add more gated tests that only test behavior when this flag is True, especially since these code paths are already covered. That being said, I'm happy to add individual tests if reviewers insist or have a different POV.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138598
Approved by: https://github.com/ezyang
ghstack dependencies: #138595
When fp8 dtype is involved, Inductor may set min_elem_per_thread to be a positive value. This will force increasing XBLOCK even for a small xnumel (e.g. 1). Inductor will report an error later when sanity check the triton config.
The simple fix here is to just not let XBLOCK to be larger than xnumel.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138730
Approved by: https://github.com/Chillee
ghstack dependencies: #136782
unwrap_tensor_subclasses -> get_plain_tensors
Is used at runtime. For small models this overhead is feasible in comparison with small compiled kernel.
1/ Removing asserts from runtime path
2/ Removing list creation with using optional output list to append argument
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138498
Approved by: https://github.com/bdhirsh
Summary:
Currently, whether split should be used depends on the size of subgroup.
It's possible that default PG is not eagerly initialized yet, but split is still
called.
This PR fixes this issue by removing split's dependency on subgroup size
Test Plan:
Modified UT
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138781
Approved by: https://github.com/kwen2501
# Motivation
Fix https://github.com/pytorch/pytorch/issues/138577.
# Solution
1. All UTs in `test/inductor/test_compiled_optimizers.py` are fixed by https://github.com/pytorch/pytorch/pull/134170
2. UT in `test/inductor/test_pattern_matcher.py` is introduced by https://github.com/pytorch/pytorch/pull/138089, we will skip this UT due to the unsupported feature `max_autotune_gemm_backends:Triton`.
3. We have a new impl related to `histc`, so we remove the expected failure from `test/inductor/test_torchinductor_opinfo.py`
4. We support `avg_pool3d` for `fp16` data type, so we remove the expected failure from `test/inductor/test_torchinductor_opinfo.py`
5. CUDA-bias code is introduced by https://github.com/pytorch/pytorch/issues/138472, we just generalize it to `GPU_TYPE`.
# Additional Context
> Why update torch-xpu-ops commit pin here?
We have to update commit pin to avoid the build failure raised by the code change [C10_UNUSED](https://github.com/pytorch/pytorch/pull/138364).
> What does the feature of torch-xpu-ops update?
1. Add some foreach ops, like `unary ops` and `foreach_clamp_max` etc;
2. Add some maxpool ops forward and backward, like `averge_pool3d` and `max_pool3d`
3. Add some other ops, like `log_normal_`, `index_copy`, and `mode` etc;
4. fix build failure related to `C10_UNUSED`;
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138548
Approved by: https://github.com/malfet, https://github.com/EikanWang
We did a lot of optimization for PyTorch Windows, and we got good progress of it. But still some models have performance gap between PyTorch Windows and PyTorch Linux. Ref: https://pytorch.org/blog/performance-boost-windows/#conclusion
From the blog conclusion, we found the `ResNet50` is typical case of it.
Let's focus on the `ResNet50`, and collect the profiling log:
```cmd
(nightly) D:\xu_git\dnnl_cb>python test_script_resnet50.py
--------------------------------- ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls
--------------------------------- ------------ ------------ ------------ ------------ ------------ ------------
model_inference 3.91% 682.427ms 100.00% 17.448s 17.448s 1
aten::conv2d 0.18% 30.906ms 64.79% 11.305s 2.133ms 5300
aten::convolution 0.45% 78.031ms 64.62% 11.275s 2.127ms 5300
aten::_convolution 0.30% 51.670ms 64.17% 11.196s 2.113ms 5300
aten::mkldnn_convolution 63.58% 11.093s 63.87% 11.145s 2.103ms 5300
aten::batch_norm 0.13% 23.536ms 20.10% 3.506s 661.580us 5300
aten::_batch_norm_impl_index 0.28% 49.486ms 19.96% 3.483s 657.139us 5300
aten::native_batch_norm 19.26% 3.360s 19.64% 3.427s 646.615us 5300
aten::max_pool2d 0.01% 1.038ms 5.84% 1.018s 10.181ms 100
aten::max_pool2d_with_indices 5.83% 1.017s 5.83% 1.017s 10.171ms 100
aten::add_ 3.38% 588.907ms 3.38% 588.907ms 85.349us 6900
aten::relu_ 0.35% 60.358ms 1.67% 292.155ms 59.624us 4900
aten::clamp_min_ 1.33% 231.797ms 1.33% 231.797ms 47.306us 4900
aten::empty 0.46% 80.195ms 0.46% 80.195ms 1.513us 53000
aten::linear 0.01% 927.300us 0.23% 39.353ms 393.532us 100
aten::addmm 0.20% 35.379ms 0.21% 37.016ms 370.155us 100
aten::empty_like 0.12% 20.455ms 0.17% 29.976ms 5.656us 5300
aten::as_strided_ 0.11% 18.830ms 0.11% 18.830ms 3.553us 5300
aten::adaptive_avg_pool2d 0.00% 419.900us 0.08% 14.265ms 142.647us 100
aten::mean 0.01% 1.737ms 0.08% 13.845ms 138.448us 100
aten::sum 0.05% 8.113ms 0.05% 8.648ms 86.479us 100
aten::resize_ 0.03% 5.182ms 0.03% 5.182ms 0.978us 5300
aten::div_ 0.01% 1.445ms 0.02% 3.460ms 34.600us 100
aten::to 0.00% 337.000us 0.01% 2.015ms 20.154us 100
aten::_to_copy 0.01% 977.500us 0.01% 1.678ms 16.784us 100
aten::copy_ 0.01% 1.474ms 0.01% 1.474ms 7.371us 200
aten::t 0.00% 775.900us 0.01% 1.410ms 14.104us 100
aten::flatten 0.00% 420.900us 0.01% 1.311ms 13.106us 100
aten::view 0.01% 889.700us 0.01% 889.700us 8.897us 100
aten::transpose 0.00% 410.700us 0.00% 634.500us 6.345us 100
aten::expand 0.00% 496.800us 0.00% 566.800us 5.668us 100
aten::fill_ 0.00% 534.800us 0.00% 534.800us 5.348us 100
aten::as_strided 0.00% 293.800us 0.00% 293.800us 1.469us 200
aten::empty_strided 0.00% 241.700us 0.00% 241.700us 2.417us 100
aten::resolve_conj 0.00% 54.800us 0.00% 54.800us 0.274us 200
--------------------------------- ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 17.448s
Execution time: 20.02380895614624
```
We found the major kernel consume CPU resource is `aten::mkldnn_convolution`. It was dispatched to `MKLDNN`.
Acturally, we had optimized memory allocation via integrated mimalloc to pytorch C10 module. It helps PyTorch Windows boost a lot, but it does not cover `MKL` and `MKLDNN`'s intermediary temporary memory.
We still have potential to improve PyTorch Windows performance via optimize `MKL` and `MKLDNN`'s intermediary temporary memory.
So, I discussed with Intel MKL team, and get a method to register high performance memory allocation API to MKL, and it would help MKL to boost memory performance. Please check the online document: https://www.intel.com/content/www/us/en/docs/onemkl/developer-guide-windows/2023-0/redefining-memory-functions.html
This PR is optimize MKL memory alloction performance on Windows, via register mi_malloc to MKL. PR Changes:
1. Add cmake option: `USE_MIMALLOC_ON_MKL`, It is sub-option of `USE_MIMALLOC`.
2. Wrap and export mi_malloc APIs in C10, when `USE_MIMALLOC_ON_MKL` is `ON`.
3. Add MklAllocationHelp.cpp to register allocation APIs to MKL, when `USE_MIMALLOC_ON_MKL` is `ON`.
For `oneDNN`, it is still tracking in this proposal: https://github.com/oneapi-src/oneDNN/issues/1898
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138419
Approved by: https://github.com/jgong5, https://github.com/ezyang
This patch improves the performance of individual reductions on MI300X. These improvements are measured on individual sum reduction operations of varying sizes. The patch impacts the following tensor types:
- 1D tensors
- 2D tensors when reducing along dimension 0
- 2D tensors when reducing along dimension 1
Runtime reduction between 0 and 75% depending on tensor shape.
The patch uses the maximum number of threads per CU and the number of CUs itself to control the number of threadblocks in various situations (i.e. for various reduction types and tensor dimensions).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137737
Approved by: https://github.com/eqy, https://github.com/jeffdaily, https://github.com/pruthvistony, https://github.com/xw285cornell
This split in JKs was never actually used (We just set both JKs to the same values except when we accidentally didn't due to being humans who make mistakes). This simplifies the overall JK structure and eventually, will let us delete the duplicate JK
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137501
Approved by: https://github.com/oulgen
Summary:
Currently, if watchdog + healthcheck are enabled via knobs but watchdog is disabled via SJD config, we observe a stuck when the watchdog loop attempts to open the watchdog file path. This is because the FileTimerClient that is usually set in TorchElasticWatchdog will not be set since disabling watchdog via SJD config bypasses the TorchElasticWatchdog initialization
The workaround is to update the healthcheck time when calling `get_last_progress_time`
Test Plan:
Logs show that the progress time value is being changed despite client not being set
Behavior when watchdog is enabled with SJD config is left unchanged
Differential Revision: D64733766
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138615
Approved by: https://github.com/gag1jain
Summary: Currently, calling `torch._logging.set_logs()` resets the log directory leading to multiple tlparse outputs. This prevents the dir from resetting after the first call.
Reviewed By: ezyang
Differential Revision: D64118047
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137793
Approved by: https://github.com/ezyang
Adds the job `get-test-label-type` in `.github/workflows/inductor-perf-compare.yml` checking for the experiment `awsa100`.
It is then used by the job `linux-focal-cuda12_1-py3_10-gcc9-inductor-build` to define the prefix for the runners that will run the benchmark.
Those runners temporarily accept the labels `awsa100.linux.gcp.a100` and `linux.aws.a100`. This is used so we can migrate via experimentation from `linux.gcp.a100`. After successfully experiment with those instances we will remove those labels and update the workflows to use `linux.aws.a100` and decomisson the gcp fleet.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138204
Approved by: https://github.com/ZainRizvi, https://github.com/huydhn
### Why use non-blocking mode in eager init?
For overlapping comm init and model init, etc.

### Why can we set non-blocking as default?
If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`).
### Why not make non-blocking default for lazy mode as well?
PR https://github.com/pytorch/pytorch/pull/137544 tried it.
Two reasons why that's not preferred today:
1. It is hard -- too big a blast.
2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138527
Approved by: https://github.com/wconstab
ghstack dependencies: #137855, #138488, #138374, #138384
Summary:
This DIFF is to fix dead lock issue in execution issue. ExecutionTraceObserver get a lock in recordOperatorStart and onFunctionExit. However, inside these two functions, the input/ouput values are evaluated, which will triger python GIL in some use cases. In this case, the lock order is ET locker -> GIL.
One of the ads application get GIL first, then call all-gather to collect some metrics from all ranks. When ET is on, all-gather is captured by ET observer. In this case, the lock order is: GIL -> ET locker
That is the reason why dead lock happens. To fix it, I changed the ET locker scope, so the input/output evaluation is no longer inside the scope of the ET locker.
Test Plan: buck2 test mode/opt caffe2/test:test_profiler_cuda
Differential Revision: D63556608
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136892
Approved by: https://github.com/aaronenyeshi
As discussed with @ezyang, this set of diffs are extracting fixes to problems discovered to flipping `specialize_float=False` in https://github.com/pytorch/pytorch/pull/137782. Since these codepaths are exercised in existing tests, I'm going to bias towards shipping speed and put these up with the primary test plan as the global CI. These code paths are all tested via existing tests when `specialize_float=False` and it feels a bit wonky to add more gated tests that only test behavior when this flag is True, especially since these code paths are already covered. That being said, I'm happy to add individual tests if reviewers insist or have a different POV.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138599
Approved by: https://github.com/ezyang
As discussed with @ezyang, this set of diffs are extracting fixes to problems discovered to flipping `specialize_float=False` in https://github.com/pytorch/pytorch/pull/137782. Since these codepaths are exercised in existing tests, I'm going to bias towards shipping speed and put these up with the primary test plan as the global CI. These code paths are all tested via existing tests when `specialize_float=False` and it feels a bit wonky to add more gated tests that only test behavior when this flag is True, especially since these code paths are already covered. That being said, I'm happy to add individual tests if reviewers insist or have a different POV.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138595
Approved by: https://github.com/ezyang
Previous to this PR, when a model fails to be exported, it falls back to try with the legacy torchscript exporter. However, we didn't stop when it's exported with torchscript exporter, an optimization is applied to the graph.
It's ideal that the optimization can also boost the performance of the model exported with the legacy torchscript exporter, but currently, for benchmarking purpose and what fallback guarantee to the users, we should keep it simple and only return the graph.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138265
Approved by: https://github.com/xadupre, https://github.com/justinchuby
Previously we only wait for comm to become ready after its initialization.
That's not enough. There are other NCCL APIs that can cause the comm to be InProgress, e.g. P2P calls, commSplit, commFinalize, etc.
Therefore, we just ensure comm is ready every "next time" we need to access ncclComm.
The place to add such gate keeper is `getNcclComm`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138384
Approved by: https://github.com/shuqiangzhang, https://github.com/fduwjj
ghstack dependencies: #137855, #138488, #138374
Summary:
This PR adds a lowering for `torch._cslt_sparse_mm` to find the optimal
alg_id and cache it when running with `torch.compile`
Seeing speedups on both bfloat16 and float8 dtypes:
<img width="641" alt="Screenshot 2024-10-17 at 2 10 38 PM" src="https://github.com/user-attachments/assets/b928cd11-32a3-43e5-b209-8e4028896f0b">
<img width="1274" alt="Screenshot 2024-10-17 at 1 39 03 PM" src="https://github.com/user-attachments/assets/d9edd684-a8ec-46fd-b3da-2e76dbcb7bb6">
* `torch._cslt_sparse_mm_search` has been modified to return optimal
split-k parameters as well as max alg_id.
* max_id is now available in `torch.backends.cusparselt` via
`torch.backends.cusparselt.get_max_alg_id()`
* fixed meta registrations for float8
Test Plan:
python test/test_sparse_semi_structured.py
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137427
Approved by: https://github.com/cpuhrsch
This diff does a few things:
## Add metadata to events in progress
Adds the ability to add extra metadata to Chromium Events via `add_event_data`.
Metadata can only be added to chromium events that have started, but not ended (so, in progress events)
- When you add the data, the metadata is appended to the metadata when you call log_event_end().
- The metadata appears in chromium events in tlparse. It also gets logged to scuba.
## New `dynamo` chromium event
We add a new `dynamo` chromium event to the top of the stack, where we collect various metadata found in dynamo_compile. So the new order of events goes:
```
__start__
-> dynamo (dynamo compile metrics)
-> entire_frame_compile (compile.inner)
-> backend_compile (i.e. aotdispatch)
-> create_aot_dispatch_function
-> inductor_compile
-> ...
```
BackwardCompilationMetrics doesn't have any dynamo specific information (as it's mostly inductor timings). So we don't include that here.
*FAQ: Why can't we use `entire_frame_compile` as the event?*
This is mostly due to backward compatibility with `dynamo_compile`. `dynamo_compile` collects CompilationMetrics outside of `compile.compile_inner`, and uses `dynamo_timed` to grab timings from phases of the compiler, including `entire_frame_compile`. So we don't have a CompilationMetric object until after an `entire_frame_compile` event ends! Separately, `dynamo` as a name for all of dynamo compile is more descriptive than `entire_frame_compile`, imo.
## Log metadata as separate columns
(Meta only): Separately, this also changes the `metadata` column in PT2 Compile Events. Instead of logging a single metadata column in JSON, it separates the JSON into separate columns. This is much better for data analysis. Now that this table is more mature, I think logging keys to separate columns is a better system.Differential Revision: [D64696287](https://our.internmc.facebook.com/intern/diff/D64696287/)
**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D64696287/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138477
Approved by: https://github.com/aorenste
This PR fixes an issue with `torch._dynamo.assume_constant_result` causing global values to be overwritten.
Currently `torch._dynamo.assume_constant_result` saves the constant result into a global variable derived from the name of the function. This causes that function to be overwritten in the global scope. This PR checks that the name is unique in the global scope as well, avoiding the issue of overriding the function.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132431
Approved by: https://github.com/jansel
Summary: We expand the tests to cover retraceability_non_strict. Currently failing tests are skipped.
Test Plan:
```
buck2 test @//mode/dev-nosan //caffe2/test:test_export -- -r _retraceability
```
Differential Revision: D64611532
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138380
Approved by: https://github.com/angelayi
Before this PR, NJT would dispatch e.g. `NJT * nested_int` to `mul.Tensor`, wrongly interpreting the SymInt as a tensor and outputting garbage. This PR verifies that there are no nested ints in the list of args before dispatching for pointwise ops.
I originally tried checking that `the number of passed tensor args == the number of func schema tensor args`, but this wrongly disallows `nt * 2`, which (non-intuitively to me at least at first) dispatches via the `mul.Tensor` overload.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138602
Approved by: https://github.com/soulitzer
Was a bit confusing to read when working on #138354
"computer-assisted proof"
```
import random
def argsort(seq):
# preserve original order for equal strides
getter = seq.__getitem__
a_r = range(len(seq))
return list(reversed(sorted(a_r, key=getter, reverse=True))) # noqa: C413
def stride_order2fill_order(order):
"""
Convert stride order to fill order
For channel last format,
stride order = [3, 0, 2, 1] and fill order = [1, 3, 2, 0]
"""
lookup = {pos: idx for idx, pos in enumerate(order)}
fill_order = [lookup[i] for i in range(len(order))]
return fill_order
def get_stride_order(seq):
"""
Convert strides to stride order
"""
sorted_idx: List[int] = argsort(seq)
out = [0 for _ in range(len(seq))]
a = sorted_idx.copy()
for i, elem in enumerate(sorted_idx):
out[elem] = i
fillorder = stride_order2fill_order(out)
assert fillorder == sorted_idx
return out
for _ in range(1000):
a = [0, 1, 2, 3]
random.shuffle(a)
get_stride_order(a)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138376
Approved by: https://github.com/drisspg
On ROCm, using a non-vectorized index_put kernel provides ~2x perf improvement over the hipified CUDA kernel. None of the existing unit tests were exercising the large index case so a new unit test was added.
It was also noted that the scale value in the original kernel was hard-coded to 1.0 which would be a no-op, so it was removed from the simplified rocm kernel.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138259
Approved by: https://github.com/xw285cornell, https://github.com/leitian, https://github.com/eqy
Summary: The problem happened after splitting CppWrapperCpu and CppWrapperCpuArrayRef, because CppWrapperCpuArrayRef.generate_index_put_fallback missed a statement.
Running test_aot_inductor.py as a whole didn't reveal the problem, but running test_index_put_with_none_index_cpu_with_stack_allocation individually did. Digging deeper, the root cause is init_backend_registration has incorrectly cached CPU CppWrapperCodegen class, which means CppWrapperCpuArrayRef was never picked when running test_aot_inductor.py as a whole. To fix the problem, all the ArrayRef tests are split into a separate file. Also a code checking is added to regex match AOTInductorModelRunMinimalArrayrefInterface so this kind of false passing signal won't be unnoticed.
Differential Revision: [D64734106](https://our.internmc.facebook.com/intern/diff/D64734106)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138541
Approved by: https://github.com/frank-wei
This PR combines a number of cleanups in one PR. If any of the specific cleanups don't seem to make sense, let me know and I can remove them.
Cleanups
- This PR adds a set of test suites for the config module code, which handles basically all the APIs and ways it is used. Please let me know if you see anything critical that is not tested that I missed. This test suite is primarily used as the regression test suite for later changes in this diff. Note that there is some dynamo specific testing of the config module, but it isn't as verbose.
- I removed all internal usage of shallow_copy_dict. Those usages could all use the deep copy, and did not depend on the reference behavior of certain config values that shallow_copy_dict allows.
- I removed shallow copy semantics for configuration with a deprecation warning. I think this requires a release note, so hopefully I did that correctly. Let me know if we want to continue to expose shallow copy value semantics, but I just can't find a case where I expect anyone would want it. It also complicated later internal changes to the API (i.e. breaking apart various layers of the config changes).
- I fixed what I believe is a bug in how hashes are calculated on configs. In particular, if you got the hash, then made a config change, and then got the hash again, it would not update the hash. @oulgen, please let me know if I'm misunderstanding this behavior and it is desired.
- I switched our multiple implementations of iterating through the dictionary to a single one. This is primarily to make later changes easier, but it also makes it clear how inconsistent our various config ignoring options are. Let me know if people would be interested in me unifying the various options for ignoring config values.
- I updated the test patcher (not the performance critical one, just the normal one), to use __setattr__ and __getattr__ to remove direct API access to the underlying config fetcher.
For release notes, Not sure exactly how to communicate this, but something like
"ConfigModule.to_dict, and ConfigModule.shallow_copy_dict no longer retain their shallow copy semantics, which allowed reference values objects to be modified. If you wish to modify the config object, call load_config explicitly".
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138377
Approved by: https://github.com/ezyang, https://github.com/jansel, https://github.com/jovianjaison
Previously we'd been raising UserErrors when `Dim()` and DimHints (`Dim.AUTO/Dim.DYNAMIC`) were both specified in `dynamic_shapes`, this PR stops that, and uses `Dim()` objects to guide DimHints.
The key to this was making the `EqualityConstraint` class happy when it checks that inferred equivalence relations were specified in the original `dynamic_shapes` spec, and this introduces a `RelaxedConstraint` object to mark the hinted dimensions, so equality checks between `RelaxedConstraints` and other constraints are treated as valid.
Current behavior is that:
```
class Foo(torch.nn.Module):
def forward(self, x, y):
return x - y
inputs = (torch.randn(4, 4), torch.randn(4, 4))
shapes = {
"x": (Dim.AUTO, Dim("d1", min=3)),
"y": (Dim("d0", max=8), Dim.DYNAMIC),
}
ep = export(Foo(), inputs, dynamic_shapes=shapes)
```
The dimensions marked `AUTO` and `DYNAMIC` will have max & min ranges of 8 & 3 respectively. Note that inferred equality between `Dim()` objects & `Dim.STATIC` will still raise errors - `Dim()` suggests not specializing to a constant.
Differential Revision: D64636101
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138490
Approved by: https://github.com/avikchaudhuri
Summary:
This implementation does not utilize the benefit that after allgather we can directly perform the SDPA without doing the ring-based SDPA, but we can overlap the communication with the first sharded kv computation. This implementation shows some performance benefit and memory saving compared to the original alltoall implementation in certain cases.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132820
Approved by: https://github.com/XilunWu
This PR refactors some ref-counting functionality out of `beginAllocateToPool` and `releasePool`. The ref-counting logic is then used in construction and destruction of `torch.cuda.MemPool`.
The `use_count` variable in the CUDACachingAllocator is essentially a refcount of how many context managers are using the pool. Since we are now lifting up the MemPool abstraction to the user, the MemPool object itself now needs to hold a an extra reference as well.
Part of https://github.com/pytorch/pytorch/issues/124807.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133600
Approved by: https://github.com/eqy, https://github.com/ezyang
Use `split_group()` to create sub_groups for nccl backend if the default pg is eagerly initialized. Otherwise, it will still go through the normal lazy init process and call `new_group()` instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138129
Approved by: https://github.com/kwen2501
Currently when tuples values are encountered in dynamo, they are encoded using `repr(arg)`. This causes an issue if one of the values inside of the tuple will not be properly encoded. In this case, if an enum is contained inside of a tuple, it will cause invalid python code to be generated
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133123
Approved by: https://github.com/jansel
Summary: same as title. Plan is to pass a callable to the partitioner to perform custom autoAC via an ILP. This is the same as a previous diff D63714905 which was landed and then subsequently reverted by PyTorch Release Engineering because of a failing unit test (f7b8d36c28). We think the unit test is buggy, and we also fix the same.
Test Plan: tbd
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137785
Approved by: https://github.com/basilwong
Co-authored-by: Huy Do <huydhn@gmail.com>
This has the benefit that
1) It's much easier to aggregate test failure repros into say a CSV or shell script from scuba
2) We can do analysis (eg. set different two sets of tests across two PRs)
3) We can get results faster at the test-level granularity instead of job-level granularity we see in the HUD/GH.
I tested this by introducing a breaking change, adding ci-scribe label and then verifying that the failed tests were logged to scuba: https://fburl.com/scuba/torch_open_source_signpost/w6qt7qr9
I then reverted the breaking change and published this PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138394
Approved by: https://github.com/ezyang
Forward fix for build issue introduced by #137855:
```
In file included from fbcode/caffe2/torch/csrc/distributed/c10d/NCCLUtils.cpp:2:
fbcode/caffe2/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp:508:21: error: use of undeclared identifier 'NCCL_SPLIT_NOCOLOR'
508 | int split_color{NCCL_SPLIT_NOCOLOR - 1};
| ^
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138488
Approved by: https://github.com/fduwjj
ghstack dependencies: #137855
Does what it says on the tin. I believe the right behavior here is to ensure that `record_stream()` is called on all tensor components of the NJT to ensure they all live until stream computation is complete.
This is an ask from torchrec as the op is used there.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137099
Approved by: https://github.com/ngimel
Looking in the code I see
```
// NB: __cplusplus doesn't work for MSVC, so for now MSVC always uses
// the "__declspec(deprecated)" implementation and not the C++14
// "[[deprecated]]" attribute. We tried enabling "[[deprecated]]" for C++14 on
// MSVC, but ran into issues with some older MSVC versions.
```
But looking at the [MSVC C++ support table](https://learn.microsoft.com/en-us/cpp/overview/visual-cpp-language-conformance?view=msvc-170) I see that the `[[deprecated]]` attribute is supported as of MSVC 2015 and that the vast majority of C++17 features became supported in MSVC 2015 _or later_.
Since PyTorch is C++17 now, I infer that PyTorch must not support versions of MSVC earlier than MSVC 2015, so the versions of MSVC supported by PyTorch must support `[[deprecated]]`.
Therefore, since we are finished deprecating old MSVCs we can deprecate `C10_DEPRECATED`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138406
Approved by: https://github.com/cyyever, https://github.com/malfet
Since cuda 12.4 binaries are default binaries on pypi now. The pytorch_extra_install_requirements need to use 12.4.
This would need to be cherry-picked to release 2.5 branch to avoid injecting these versions into metadata during pypi promotion.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138458
Approved by: https://github.com/malfet
## The problem
In a typical debugger, `repr()` is used to display variables and not `str()`.
Several classes in Dynamo have a `__str__()` method that returns useful information and a `__repr__()` that does not. Having to call `str(x)` or `[str(i) for i in x]` in the debugger all the time is a chore.
`str()` should be ["informal, nicely printable"](https://docs.python.org/3/library/stdtypes.html#str) and `repr()` should ["attempt to return a string that would yield an object with the same value when passed to eval()](https://docs.python.org/3/library/functions.html#repr)".
## The solution
In the Python object model, if there is no `__str__` method, `__repr__` is used instead (but not the other way around).
So renaming `__str__` to `__repr__` in a few cases where no `__repr__` method exists now should not change observable behavior, and should make debugging easier.
The specific classes changed were all in `torch._dynamo.variables`:
* `builtin.BuiltinVariable`
* `constant.ConstantVariable`
* `constant.EnumVariable`
* `functions.UserMethodVariable`
* `lazy.LazyVariableTracker`
* `lazy.LazySymNodeFormatString`
* `misc.GetAttrVariable`
* `misc.NullVariable`
* `user_defined.UserDefinedObjectVariable`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136316
Approved by: https://github.com/XuehaiPan, https://github.com/jansel
Type annotations for compile_fx.
- Some of the stuff here is pretty complicated (functions which return functions that take functions) so I bailed on those and used `Any` just to get the rest landed.
- There are also changes to type signatures in other files which I did just to let mypy know more about the types in compile_fx.py.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138033
Approved by: https://github.com/Skylion007
This PR contains multiple fixes for issue https://github.com/pytorch/pytorch/issues/135279:
## First part:
Moves the GPU guard (`cudaSetDevice`) before the `currentStreamCaptureStatusMayInitCtx` call.
As its name suggests, it May Init Ctx.
## Second part:
Even with the above fix, additional contexts are still observed during Work object destruction, e.g.
```
work = dist.all_reduce(tensor, async_op=True)
time.sleep(5) <-- no additional context yet
del work <-- additional context shows up
```
### Debug process
Chasing it down to destruction of a `Future` object -- a member variable of `Work`.
Then further down to the following member of `Future`:
```
std::vector<c10::Event> events_;
```
When the `events_` are destroyed, we hit the road down to:
1f3a793790/c10/cuda/impl/CUDAGuardImpl.h (L106-L121)
When there is no "preset" CUDA context (**which is the case for python garbage collector**), line 112: `c10::cuda::GetDevice(&orig_device)` will set `orig_device` to 0. Then, at line 120, `c10::cuda::SetDevice(orig_device)` will "officially" set the context to device 0 --
**that's where rank 1, 2, ... can create extra context on device 0!**
### Solution
This PR adds an explicit destructor to `Future`. In this destructor, destroy each event with a device guard.
## Test
Added test_extra_cuda_context, implemented via
- `pynvml` (if available), or
- memory consumption check.
`python test/distributed/test_c10d_nccl.py -k test_extra_cuda_context`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135273
Approved by: https://github.com/fduwjj, https://github.com/wconstab, https://github.com/eqy
ghstack dependencies: #137161
Co-authored-by: Will Feng <yf225@cornell.edu>
Instead of calling `safe_expand` right after symbolic expression construction, we invoke it in `ShapeEnv.simplify`. This enables more simplification with product form, e.g.,
```
(a + b)^2 / (a + b) --> (a + b)
```
which won't happen if we expand eagerly during product construction:
```
(a^2 + 2ab + b^2) / (a + b) --> no change
```
Fixes#136044.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138235
Approved by: https://github.com/ezyang
replace_by_example is used to implement some pattern-matching passes in inductor. Previously, replace_by_example would generate nodes with very little metadata. In particular, `meta["original_aten"]` would be lost; that meant that when generating triton kernel names, you could get empty names like `triton_tem_fused_0` if the input nodes to the fused kernel were the result of a pattern-matching pass that used replace_by_example.
This also adds metadata for to register_replacement patterns, including pad_mm.
This fixes the issue by copying metadata from the original node to the replacement nodes. If there are multiple original nodes we skip the metadata transfer; so if you have a `add(z, mm(x, y))`, then the metadata won't be transferred right now.
Differential Revision: [D64480755](https://our.internmc.facebook.com/intern/diff/D64480755)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138089
Approved by: https://github.com/aakhundov
Summary: same as title. Plan is to pass a callable to the partitioner to perform custom autoAC via an ILP. This is the same as a previous diff D63714905 which was landed and then subsequently reverted by PyTorch Release Engineering because of a failing unit test (f7b8d36c28). We think the unit test is buggy, and we also fix the same.
Test Plan: tbd
Differential Revision: D64246495
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137785
Approved by: https://github.com/basilwong
Summary: The problem happened after splitting CppWrapperCpu and CppWrapperCpuArrayRef, because CppWrapperCpuArrayRef.generate_index_put_fallback missed a statement. Running test_aot_inductor.py as a whole didn't reveal the problem, but running test_index_put_with_none_index_cpu_with_stack_allocation individually did. Digging deeper, the root cause is init_backend_registration has incorrectly cached CPU CppWrapperCodegen class, which means CppWrapperCpuArrayRef was never picked when running test_aot_inductor.py as a whole.
Differential Revision: [D64598714](https://our.internmc.facebook.com/intern/diff/D64598714)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138303
Approved by: https://github.com/hl475
As `backend`, `pg_options`, and `group_desc` are the same for each mesh dimension, we don't need to get or create these args for `new_group` multiple times. This PR moves it from the inner loop of the subgroup creation (each subgroup ranks of each mesh dimension) to the outer loop (each mesh_dimension).
For example, given we have a 2 * 4 DeviceMesh, we are re-creating the variables `backend`, `pg_options`, and `group_desc` 2*4 = 8 times. After the change, we only create these variables once per mesh dimension, which is 2 times.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138117
Approved by: https://github.com/kwen2501
replace_by_example is used to implement some pattern-matching passes in inductor. Previously, replace_by_example would generate nodes with very little metadata. In particular, `meta["original_aten"]` would be lost; that meant that when generating triton kernel names, you could get empty names like `triton_tem_fused_0` if the input nodes to the fused kernel were the result of a pattern-matching pass that used replace_by_example.
This also adds metadata for to register_replacement patterns, including pad_mm.
This fixes the issue by copying metadata from the original node to the replacement nodes. If there are multiple original nodes we skip the metadata transfer; so if you have a `add(z, mm(x, y))`, then the metadata won't be transferred right now.
Differential Revision: [D64480755](https://our.internmc.facebook.com/intern/diff/D64480755)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138089
Approved by: https://github.com/aakhundov
Fixes: https://github.com/pytorch/pytorch/issues/138069
I tested this by running `python test/inductor/test_torchinductor_dynamic_shapes.py DynamicShapesCpuTests.test_builtins_round_float_ndigits_pos_dynamic_shapes_cpu` before and after the change and verifying no more log spew.
I'm uncertain on if it makes sense to add a test for this PR. Question for reviewers: is there a standard paradigm for testing these log spew based fixed? Happy to add a test if someone can point me towards the right direction.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138356
Approved by: https://github.com/ezyang
This fixes internal crash due to the invalid bufer size computation if sliced API is used
Not sure what was the purpose of
```c++
IntArrayRef baseShape;
if (src.is_view()) {
baseShape = src._base().sizes();
} else {
baseShape = getIMPSAllocator()->getBufferShape(src.storage().data());
}
int flattenedShaped = 1;
for (const auto i : c10::irange(baseShape.size())) {
flattenedShaped *= baseShape[i];
}
```
As flattenShaped could be much easier computed as `[srcBuf
lengh]/src.element_size()`, and even if `srcBuf` is padded it's a safe thing to do.
When someone allocated buffer to hold say uint8 and that view-casted it
to float16, attempt to compute `baseShape` returned sizes of original
tensor in its data type, rather than size in new dtypes
Fixes https://github.com/pytorch/pytorch/issues/137800
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138314
Approved by: https://github.com/albanD, https://github.com/DenisVieriu97
## Summary
We are currently [updating](https://github.com/conda-forge/pytorch-cpu-feedstock/pull/277) the [`conda-forge::pytorch`](https://anaconda.org/conda-forge/pytorch) package to version 2.5.0. This update includes a new dependency, the third_party/NVTX submodule. However, like other package management frameworks (e.g., apt), conda-forge prefers using system-installed packages instead of vendor-provided third-party packages.
This pull request aims to add an option, `USE_SYSTEM_NVTX`, to select whether to use the vendored nvtx or the system-installed one, with the default being the vendored one (which is the current behavior).
## Test Plan
The `USE_SYSTEM_NVTX` option is tested by building the `conda-forge::pytorch` package with the change applied as a [patch](cd1d2464dd/recipe/patches/0005-Use-system-nvtx3.patch).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138287
Approved by: https://github.com/albanD
Fixes https://github.com/pytorch/pytorch/issues/137856.
### Issue 1
Today under `ProcessGroupNCCL::Options`, color is declared as:
```
int64_t split_color{0};
```
When passing this variable to `ncclCommSplit` which accepts `int`, the value may overflow and become negative, as in #137856. But NCCL API only accepts non-negative colors (or `NCCL_SPLIT_NOCOLOR`).
But that's not all.
### Issue 2
`split_color` is pybind'ed to python frontend. If we just change from `int64_t` to `int` in C++, pybind will complain:
```
[rank0]: TypeError: (): incompatible function arguments. The following argument types are supported:
[rank0]: 1. (self: torch._C._distributed_c10d.ProcessGroupNCCL.Options, arg0: int) -> None
```
This is because python `int` represents a wider range than C++ `int`. So we cannot pass hash values -- which are potentially big ints -- from python to C++. The PR modulo the hash value with `c_int`'s max value.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137855
Approved by: https://github.com/wconstab
The rationale behind this PR is to:
1. Move the dump of c++ traces after FR dump because the FR dump is timed meaning that it will not block forever, while the dumping of c++ traces is likely to be blocking. so that we swap the order. Ideally we also want to make cpp stacktrace dump to be a future wait, if we want to go down this path, we can also make it happen in an another PR.
2. Add log Prefix to the logs which have not been added.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138368
Approved by: https://github.com/c-p-i-o
Summary:
This change unblocks the CFR AOTI lowering runtime error.
TL;DR:
In this model, one triton kernel expects a scalar input dtype as i64, but getting an i32. The reason is "auto" can infer a smaller data type if the variable it passed in e.g. is i32. thus cause CUDA IMA.
Original problematic kernel: `triton_poi_fused_add_ge_logical_and_logical_or_lt_46_grid_100`.
This diff manually cast it to i64 for all symbolic arguments in compile time for i64 triton kernel inputs, instead of use `auto var_x = {arg}` in cpp wrapper code.
Test Plan:
Verified in FLB locally:
```
PYTORCH_NO_CUDA_MEMORY_CACHING=1 AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=3 TORCH_LOGS="output_code" TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCH_SHOW_CPP_STACKTRACES=1 CUDA_LAUNCH_BLOCKING=1 ~/fbsource/buck-out/v2/gen/fbcode/98e643f8bb44fe9d/hpc/new/models/feed/benchmark/__feed_lower_benchmark__/feed_lower_benchmark.par --skip-eager --skip-flop-estimation --lower-backend="AOT_INDUCTOR" --sync-mode=0 --precision bf16 --output-precision bf16 --lower-presets="ifr_cint;disable_new_lowering_weights;disable_dper_passes:passes=fuse_parallel_linear_no_weight_change" --remove-unexpected-type-cast=False --load="manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/924293663/0/gpu_lowering/input.merge"```
Differential Revision: D64490039
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138106
Approved by: https://github.com/ColinPeppler
Fixes#138211
`Path.rename` function has Windows OS specific behavior, that will raise `FileExistsError` when the target file existing.
This behavior is not happened on Linux, so I write a small repoduce code to figure out what happened.
After stepping trace the repo code:
```python
import os
import sys
from pathlib import Path
_IS_WINDOWS = sys.platform == "win32"
def test_case():
cwd = os.getcwd()
path1 = os.path.join(cwd, "haha1.txt")
path2 = Path(os.path.join(cwd, "haha2.txt"))
try:
path2.rename(path1)
except FileExistsError as e_file_exist:
if _IS_WINDOWS:
# on Windows file exist is expected: https://docs.python.org/3/library/pathlib.html#pathlib.Path.rename
shutil.copy2(path2, path1)
os.remove(path2)
else:
raise e_file_exist
except BaseException as e:
raise e
print("run here.")
if __name__ == "__main__":
test_case()
```
We found the code `path2.rename(path1)` can breakdown into:
1. copy file2's content to file1.
2. delete file2.
So, we can implemented equal code on Windows path:
```python
shutil.copy2(src=tmp_path, dst=path)
os.remove(tmp_path)
```
So, we can get current PR.
TODO: need cherry-pick to release/2.5 branch, CC: @atalman .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138331
Approved by: https://github.com/malfet
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Enable -Werror on s390x
Example of original issue on s390x:
https://github.com/pytorch/pytorch/actions/runs/11014606340/job/30585632704
Most of warnings are not specific to s390x, but specific to gcc-13 or gcc-14. To test it on s390x an image with gcc-13 is needed. For s390x it's tested for new regressions on every merge due to trunk workflow.
`-Wdangling-reference` produces either obviously false warnings or suspicious warnings, which on closer inspection look plausibly safe.
`-Wredundant-move` with new gcc complains about `std::move(...)` disabling copy elision. But removing `std::move(...)` makes used clang versions complain about copying objects when they could be moved. For now also disable it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136527
Approved by: https://github.com/malfet
Add an optional `eager_init` flag to `with_comms`.
When `eager_init` is True and backend is `nccl`, we pass the `device_id` to `init_process_group()` for eager initialization.
Otherwise, `device_id` is still `None` and this goes through the normal lazy call.
Default for `eager_init` is False.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138108
Approved by: https://github.com/kwen2501
Summary: I have to keep bypassing issues because of these clang rules. Let's start with all of the bugs instead of the variable name ones because that will introduce a lot of lines of code and can make things hard to read
Test Plan: Format tests pass.
Differential Revision: D64411171
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138296
Approved by: https://github.com/aaronenyeshi, https://github.com/Skylion007
This diff is the starting steps of https://docs.google.com/document/u/2/d/1kAEBt4AyW7HTAhXHbjoz8FBFHNyyEA2Qo2mPn7v3WUQ/edit?usp=drive_web&ouid=113555078003219714709
It implements the following changes:
- Only log spans to scuba, so no start events are ever logged
- Log events as the full event name, without "START" or "END"
- Only log to scuba major phases from chromium events. These are:
- entire_frame_compile (dynamo)
- backend_compile (aotdispatch)
- inductor_compile (inductor)
- codegen (inductor codegen)
Tlparse chromium events stay basically the same. But I implemented a few changes to clean that up as well:
- When there's a phase name available, log the phase name instead of the function name as the event name. This simplifies the trace to not have two identical rows. The fn_name is avaliable as metadata on the chromium event, if interested
- Log new events for pre and post grad passes. These do *not* log to scuba.
By making the phases much simpler in Scuba, with only categories for major phases of PT2 Compilation, we pave the way to add **much** more metadata and information to each individual event type. Diffs for that will come later.
**IMPLEMENTATION NOTES:**
- The logic for `log_chromium_event_internal` (which is the function that logs to Scuba) lives in chromium_events for now, but in the future as we add more metadata, it may belong independently in dynamo_timed or even outside of dynamo_timed. I haven't explored in detail what the refactor will look like. Once we start logging metadata for dynamo, aotdispatch, inductor, I suspect we will call log_pt2_compile_event directly, instead of making chromium event logger handle the pt2_compile_event logic. But that refactor is left for another PR on top of this one.
- There's an interesting space after pre grad passes within AOT autograd logic, that's between create_aot_dispatcher_function and pre grad passes. I'm not sure what we're spending time doing in that time, but I'll find out with a profile later.
Differential Revision: [D64479033](https://our.internmc.facebook.com/intern/diff/D64479033/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138093
Approved by: https://github.com/ezyang
This patch addresses parts of the side-effect refactor proposed in #133027;
specifically, it does 3 things:
1. Change `SideEffects.store_attr_mutations` and `PyCodegen.tempvars`
to index on `VariableTracker` rather than `MutableLocalBase`.
2. Remove the `source` field from `MutableSideEffects` and
`AttributeMutation`, and use `VariableTracker.source` instead.
3. Plumb a `overridden_sources: Dict[Source, Source]` from
`handle_aliases_for_stolen_lists` to `PyCodegen` so that we don't
update `VariableTracker.source` in place, while still preserving what
`handle_aliases_for_stolen_lists` needed (i.e., modifying codegen for
certain `VariableTracker`).
(1) and (2) are merged in 1 patch because of some dependency between
a. `OutputGraph.handle_aliases_for_stolen_lists` which iterates over
`sideSideEffects.store_attr_mutations.keys()`, and potentially update
its source field to be completely different.
b. `SideEffects.codegen_update_mutated`, which happens after the above
and uses `cg(var.mutable_local.source)`.
where if we apply (1) only, (b) breaks, and if we apply (2) only, (a)
breaks.
(3) is needed for correctness, see comments in the PR for details.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137905
Approved by: https://github.com/jansel, https://github.com/anijain2305, https://github.com/mlazos
Removing just the LF canary scale config for now to test the changes in https://github.com/pytorch/test-infra/pull/5767
Those changes have been deployed to prod and appear to be working, but this will be the final proof that it is in fact reading the test-config version of scale-config and not the pytorch/pytorch copy.
Note: This will break the Scale config validation workflow on test-infra, but it's worth it since this test will be very short lived and that workflow only runs when someone modifies scale config
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138361
Approved by: https://github.com/wdvr
Add an additional check that scalars wrapped to 0-D tensors by dynamo are actually 0-D. This fixes a bug where a 1-D tensor was mistakenly converted to a scalar value rather than passed as a pointer.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137303
Approved by: https://github.com/eellison
ghstack dependencies: #135701
Summary: From experiment, it seems like aten.constant_pad_nd has better QPS compared to torch.cat. The qps gain for ig ctr is ~10%, and ~5% for oc.
Test Plan:
```
buck2 run mode/opt -c fbcode.nvcc_arch=a100 //caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --model-path=manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/585279927/480/gpu_lowering/input.predictor.disagg.gpu.merge --lower-backend=AOT_INDUCTOR
```
```
buck2 run mode/opt //caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --model-path=manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/588102397/1500/gpu_lowering/input.predictor.disagg.gpu.merge --lower-backend=AOT_INDUCTOR
```
Differential Revision: D64271583
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137820
Approved by: https://github.com/eellison
In this PR, we implement lazy dictionary for export decomp behaviour for following reasons:
1. Custom op loading can happen after import time, as a result, the decomp table might not be able to pick up the decomp. Therefore we try to delay materialization as late as possible.
I intentionally seperated out the core_aten_decomp to not have any custom CIA ops in this PR to mitigate the risk of getting reverted but in the future, core_aten_decomp under torch/_decomp will exist as an alias to official export table (torch.export.default_decompositions)
Differential Revision: [D64140807](https://our.internmc.facebook.com/intern/diff/D64140807)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137650
Approved by: https://github.com/justinchuby, https://github.com/bdhirsh
To skip over the command that do not have output file specified
Recently I've noticed that `generate_torch_version.py` started to run on every rebuild, and this results in a failed plan for deb info rebuilds
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138290
Approved by: https://github.com/Skylion007
## `VariableTracker::build()` hides the Builders
### The problem
In the current code, creating a `VariableTracker` involves choosing one of two `Builder` classes and either calling a method, or calling a constructor that creates an object that you immediately call, [like this](083c9149b7/torch/_dynamo/variables/functions.py (L761-L768)).
Variations on this code are repeated in many places.
More, the `Builder` classes have a lot of dependencies, so they have to be loaded late in the whole import process to avoid circular imports, so they end up being repeatedly imported at local scope.
### The solution
In this commit, the import from `builder` and the logic of choosing and calling the Builder class are hidden in a single static factory method, `VariableTracker.build()`, easier to reason about and to import.
This commit net lowers the total lines of code by over 150 lines by removing repetitive logic and unnecessary local imports.
**CHANGES:** Originally the name of the static method was `VariableTracker.create()` but a static method on a derived class, `LazyVariableTracker.create()` now exists with a different signature that's irreconcilable, so the new static method was renamed to `VariableTracker.build()`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135714
Approved by: https://github.com/jansel
Previously we would error when trying to preserve the call signature for a module when it was called multiple times. This PR can now do this without erroring. The fix is to propagate call indices in a few more places.
Note that while this works in the presence of params, buffers, and tensor constants, preserving call signatures for multiple calls to a module when buffers are mutated is not supported yet. This is future work. The main problem is that we do not have enough metadata to `copy_` mutated buffers at the end of each call to a module, so the next call can read those buffers at the beginning. Making this work will likely need some explicit tracking of intermediate values of mutated buffers when collecting metadata during functionalization in export.
Note also that we stop short of creating a single graph out of multiple graphs: that is still future work. So the unflattened module will still have different targets `n`, `n@1`, `n@2`, etc. for each call when we ask the module call signature of `n` to be preserved. However it is way easier to swap all of these targets with a replacement that behaves similar to the original, because all of these calls will respect the original module call signature. (In particular, any constant inputs will be carried by the calls.)
Differential Revision: D64406945
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137999
Approved by: https://github.com/tugsbayasgalan
This adds Dynamo tracing support for the host-side Triton TMA API (see `create_2d_tma_descriptor` calls on the host in the [Triton tutorial](https://triton-lang.org/main/getting-started/tutorials/09-persistent-matmul.html#sphx-glr-getting-started-tutorials-09-persistent-matmul-py)). A few notes:
- Here we assume the availability of the host-side TMA API added to upstream Triton in https://github.com/triton-lang/triton/pull/4498. As of time of writing, this is not a part of the PT2 OSS Triton pin (although back-ported internally). OSS Triton pin update should be done in December 2024.
- Due to Dynamo support implemented in the previous PR, the `tma_descriptor_metadata` dict is delivered to the `triton_kerenl_wrap_` lowering and passed to the `ir.UserDefinedTritonKernel` as additional argument.
- Looking into the `tma_descriptor_metadata`, `ir.UserDefinedTritonKernel` substitutes the corresponding `TensorBox` arguments of the kernel (swapped upstream in Dynamo) by the new `ir.TMADescriptor` nodes implementing TMA descriptors in Inductor IR.
- `ir.TMADescriptor.__init__` provides the wiring between the upstream underlying `ir.TensorBox` and the downstream `ir.UserDefinedTritonKernel` kernel. In particular, we use `ir.NonOwnedLayout` wrapping `ir.ReinterpretView` to avoid the upstream tensor's buffer being deleted prematurely (before the TMA descriptor is used in the Triton kernel).
- Via `ir.TMADescriptor.codegen`, the Triton's `create_{1d,2d}_tma_descriptor` function call is codegened in the wrapper (in the host code).
- New `TMADescriptorArg` dataclass is added to handle the Triton kernel metadata pertinent to host-side TMA.
- AOT Inductor support will be implemented in a follow-up PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137950
Approved by: https://github.com/eellison
ghstack dependencies: #137677
Fix TODO in code
```python
# TODO: create an internal helper function and extract the duplicate code in FP16_compress and BF16_compress.
```
1. Extract common logic in `fp16_compress_hook` and `bf16_compress_hook` to `_compress_hook` method
2. Let `fp16_compress_hook` and `bf16_compress_hook` invoke `_compress_hook` with difference `dtype`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138182
Approved by: https://github.com/awgu
By guarding the calls to `-[MTLCompileOptions setFastMathEnabled]` with `C10_DIAGNOSTIC_PUSH` and `POP`
and using `-[MTLCompileOptions setMathMode:]` and `-[MTLCompileOptions setMathFloatingPointFunctions:]` on MacOS15
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138238
Approved by: https://github.com/atalman
Summary:
Our watchdog does not differentiate timeout from NCCL errors clearly in terms of both log and code paths.
It's important for c10d to differentiate different reasons of watchdog
failures. E.g, timeout vs nccl errors, and possibly let users to handle the
errors differently depends on the type of errors
Test Plan:
UT
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138240
Approved by: https://github.com/Skylion007
Summary: logs if an operator is run with the TorchScript runtime, using a thread_local variable set in `InterpreterState.run()`
Test Plan: buck2 run mode/dev-nosan caffe2/torch/fb/observers:scuba_observer_runner
Reviewed By: zou3519
Differential Revision: D64200781
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137986
Approved by: https://github.com/angelayi
Dynamo stance is recently added in https://github.com/pytorch/pytorch/pull/137504. When Dynamo stance is "force_eager" (user explicitly wants to fall back to eager), we would like Compiled Autograd to fall back to eager as well. This will allow the Traceable FSDP2 use case to work since "eager forward + compiled autograd backward" is not supported for Traceable FSDP2.
In general, if user wants to do "eager forward + compiled autograd backward", they should explicitly run the forward in eager instead of applying compile and then set stance to "force_eager".
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138113
Approved by: https://github.com/xmfan
This fixes 4 main issues:
- The way the cuda sanitizer handle it's state is weird. In particular, because the lifetime of the Mode is linked to the submodule, then this might outlive the python runtime and other modules loaded. On my current version, this even outlives the "sys" module. Given that I'm not sure the impact of changing this lifetime handling, I'm making the exit handler a no-op when python is already dying and thus no point cleaning up.
- Adds a "disable" method to be able to test after the mode is enabled.
- Fix `Tensor.as_sublass()` to properly disable modes when creating the new Tensor object just like we already do in `make_subclass` and `make_wrapper_subclass`. The change here is just to apply the exact same treatment to it.
- ~Fix `Tensor.as_subclass()` not to propagate autograd as there is no valid backward associated here.~ We have test that check that this behavior happen so I guess this is not an obvious bugfix and expected behavior. Reverted that change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138218
Approved by: https://github.com/ngimel
Previously the decomposition would upcasts inputs to fp32. This led to a slowdown compared to eager which would run in fp16. We also tried keeping the bmm in fp16, and the upcasting for the epilogue but that led to worse numerics because the bmm in eager would do the epilogue all in fp32 without a downcast in the bmm accumulator.
Fix for https://github.com/pytorch/pytorch/issues/137897
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137904
Approved by: https://github.com/ngimel
This is only a minor patch that I hope will change how I talk to contributors when lint fails, so that I can tell them to read the logs about lintrunner. There have been too many times when I have had to click the "approve all workflows" just for lint to fail again cuz the developer is manually applying every fix and using CI to test. I understand there are times when lintrunner doesn't work, but I'd like most contributors to at least give it a swirl once to start.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138232
Approved by: https://github.com/kit1980, https://github.com/Skylion007
Compiled Autograd uses this AOT inference path, but it shows up as "aot_forward_graph" in tlparse output, which causes it to not be easily differentiable from normal "aot_forward_graph"s that are also in the tlparse output. This PR renames it to "aot_inference_graph" which makes it easier to tell which tlparse graph block is from Compiled Autograd.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137803
Approved by: https://github.com/Microve, https://github.com/bdhirsh, https://github.com/ezyang
There's an annoying pattern emerging for pulling out the NJT min / max seqlen ints if they exist without computing / caching if they don't. This PR introduces private convenience functions to simplify handling this and avoiding redundant checks.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138130
Approved by: https://github.com/soulitzer
Summary:
Support autocast re-tracing by giving it the same treatment as set_grad.
In re-tracing, when dynamo encounters an autocast HOP, we want it to trace through `with torch.autocast()` again, and replace the HOP with the traced subgraph.
Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r test_export_with_autocast
```
Differential Revision: D63856081
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138082
Approved by: https://github.com/ydwu4
Summary:
1) Add sleef back to enable SIMD on AMD
2) adding kpack to triton compute_meta for AMD triton, since there will be user-defined triton kernels using this for k-dim packing
Test Plan:
```
HIP_VISIBLE_DEVICES=0 TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 TORCH_LOGS="output_code,graph_code" buck run mode/{opt,amd-gpu} -c fbcode.triton_backend=amd -c fbcode.enable_gpu_sections=true //hpc/new/models/feed/benchmark:feed_lower_benchmark -- --skip-flop-estimation --skip-trt --skip-ait --enable-aot-inductor --sync-mode=0 --gpu-trace --sample-input-tile-factor=1 --load="manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/925729118/0/gpu_lowering/input.merge" --lowering-input-str='{"serialized_inference_model_input_path":"ads_storage_fblearner/tree/user/facebook/fblearner/predictor/925729118/0/gpu_lowering/input.merge","serialized_inference_model_output_path":"ads_storage_fblearner/tree/user/facebook/fblearner/predictor/925729118/0/gpu_lowering/mi300_output.merge","submodule_names_to_lower":["merge"],"inductor_lowering_context":{"aot_inductor_lowering_settings":{"use_scripting":true,"preset_lowerer":"ifu_cint;disable_new_lowering_weights;disable_dper_passes:passes=fuse_parallel_linear_no_weight_change","precision":3,"output_precision":3, "remove_unexpected_type_cast":false, "sample_input_tile_factor":32}},"model_entity_id":925729118,"model_snapshot_id":0,"add_sample_inputs":false,"hardware_type":0,"platform_arch":1,"dense_in_place_format":2}' --precision=bf16 2>&1 | tee local_benchmark_log.txt
```
Differential Revision: D64262924
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137939
Approved by: https://github.com/frank-wei
Summary:
Currently, https://fburl.com/code/uka25j1i checks whether the guarded object supports weakref by looking at its `__class__`
```
if hasattr(guarded_object.__class__, "__weakref__") and not isinstance(
guarded_object, enum.Enum
):
obj_ref = weakref.ref(guarded_object)
```
However, we have reason to modify this slightly because we use classes that "pretend" to be some other classes (e.g. nn.Parameter). Example https://fburl.com/code/8bcktgoh :
```
class QuantizedWeights:
# TODO: Ugly trick so torch allows us to replace parameters
# with our custom weights. Do this properly.
property
def __class__(self) -> Type[nn.parameter.Parameter]:
return nn.Parameter
property
def grad_fn(self) -> None:
return None
```
For example, Fp8RowwiseWeights which inherit from the base class above and also from namedtuple, actually does not have `__weakref__` attribute, but its "class" will say it does.
I think the easiest change is to use instance-level checking rather than class-level
```
if hasattr(guarded_object, "__weakref__") ...
```
But I'm wondering if this will harm any of the existing behaviors.
I'd appreciate reviews from the experts
(I just added all recommended reviewers since I'm not sure who is the best person to consult...)
Test Plan: CI?
Reviewed By: YJYJLee
Differential Revision: D64140537
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137838
Approved by: https://github.com/williamwen42, https://github.com/jansel
`test_replicate_with_compiler.py` and `test_fully_shard_compile.py` requires bf16, so needs to be run within test_inductor_distributed job (which uses A10G (SM80) and has bf16 support).
This allows us to migrate distributed jobs to T4 machines in https://github.com/pytorch/pytorch/pull/137161, as the compiled distributed jobs are the only blocking ones now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138178
Approved by: https://github.com/xmfan
The suggestions unusably clog up early draft PRs that are not necessarily lint clean yet. Making matters worse, even if I fix them I have to manually click through hundreds of comments to "Resolve" them even though I've fixed it. Disabling it on ghstack helps, but I occasionally do standard PRs via fbcode export mechanism. Opt me out.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138054
Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/PaliC
Summary: Float8 is becoming and increasingly popular datatype now that it is well supported on GPUs. This diff enables FP8 to work with `torch.cat`. This is pretty straight forward since memory operations dont vary based on the input dtype, but can be quite helpful for FP8 based models.
Test Plan:
```
buck2 run mode/opt -c fbcode.enable_gpu_sections=true -c fbcode.platform=platform010 -c fbcode.nvcc_arch=h100a -c fbcode.platform010_cuda_version=12 //caffe2/test:tensor_creation -- -r test_cat_all_dtypes_and_devices
```
Differential Revision: D64443965
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138046
Approved by: https://github.com/eqy, https://github.com/qchip, https://github.com/jianyuh
Current setting command of the `CMAKE_PREFIX_PATH` environment variable will overwrite values if it had already been set with some values. Changing it to `:` appends the conda env search path to its values to avoid library not found issues.
`export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}:${CMAKE_PREFIX_PATH}`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134934
Approved by: https://github.com/malfet, https://github.com/EikanWang
Dynamo stance is recently added in https://github.com/pytorch/pytorch/pull/137504. When Dynamo stance is "force_eager" (user explicitly wants to fall back to eager), we would like Compiled Autograd to fall back to eager as well. This will allow the Traceable FSDP2 use case to work since "eager forward + compiled autograd backward" is not supported for Traceable FSDP2.
In general, if user wants to do "eager forward + compiled autograd backward", they should explicitly run the forward in eager instead of applying compile and then set stance to "force_eager".
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138113
Approved by: https://github.com/xmfan
ghstack dependencies: #138105
Our matmul support is abysmal - let's at least get this working and do it performantly later.
Bonus: implements `bmm` as well.
jagged <-> padded dense conversions are utilized when possible, and an unbind-based fallback otherwise (the former works with torch.compile and the latter doesn't). Some testing is missing because we don't have factory function support yet :(
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138121
Approved by: https://github.com/cpuhrsch
Summary:
Tensor constants can show up through wrapped methods, so that they may not always be found in constant attributes. They need to be fakified and their meta vals need to be found to create graph signatures nevertheless. Otherwise non-strict barfs.
Longer term maybe we should pull this fakification up in non-strict.
Test Plan: added test
Differential Revision: D64480272
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138091
Approved by: https://github.com/tugsbayasgalan
Fixes https://github.com/pytorch/pytorch/issues/136640
Today, inductor has some logic to figure out when it needs to do broadcasting during lowering, which just checks if any of the input shapes have sizes equal to 1.
In particular: we should already have this information by the time we get to inductor, because our FakeTensor compute will have branched/guarded on whether any ops performed broadcasting, appropriately.
In particular, if we have a tensor with a size value of `(64//((2048//(s3*((s2//s3)))))))`, and it happens to be equal to one (and it is used in an op that requires this dim to be broadcasted), FakeTensorProp will have generated a guard:
```
Eq((64//((2048//(s3*((s2//s3))))))), 1)
```
I chose the simplest possible way to beef up inductor's checks to know when a given size is equal to 1: loop over the existing shape env guards, and if our current size is a sympy expression on the LHS of one of our `Eq(LHS, 1)` guards, then return True.
I'm hoping for feedback on whether or not this approach is reasonable. One better option I could imagine is that our symbolic reasoning should have automatically simplified the size of our tensor down to a constant as part of evaluating that guard. I was originally going to try to do this directly in the shape env, but I ran into a few issues:
(1) I wanted to call some version of `set_replacement(expr, 1)`. But `set_replacement()` only accepts plain symbols on the LHS, not expressions
(2) in theory I could get this to work if I could rework the above expression to move everything that is not a free variable to the RHS, e.g. `Eq(s2, 32)`. It looks like our existing `try_solve()` logic is... [not quite able](https://github.com/pytorch/pytorch/blob/main/torch/utils/_sympy/solve.py#L27) to do this generally though.
Checking the guards feels pretty simple-and-easy. Are we worried that it is too slow to iterate over all the guards? I could also cache the lookup so we only need to iterate over guards that are of the form `Eq(LHS, 1)`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136670
Approved by: https://github.com/ezyang
This pr introduces two changes:
1. Before this pr, the subgraphs output is ([], []), in this pr, we change it to a flattened list for easier codegen and consistency with other control flow operators.
2. Before the PR, the combine_fn of scan takes a sliced input but keep the sliced dimension. For exmaple, suppose xs = torch.randn(3, 4, 5) and we scan over dim 0, the combine_fn looks like:
```
# x.shape = (1, 4, 5) instead of (4, 5)
def combine_fn(carry, x):
...
```
In this PR, we fixed this and also simplify some of the slicing logic.
3. this diff also make sure we always stack ys on fist dimension.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135601
Approved by: https://github.com/zou3519
ghstack dependencies: #135600
Summary:
Record the world size in log and scuba table.
This helps us quickly figure out if there are missing flight recorder files form ranks.
Test Plan: Ran locally and noted that size was logged to scuba
Differential Revision: D64442949
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138044
Approved by: https://github.com/Skylion007
Summary:
## Why
random.fork_rng doesn't support meta device:
```
[rank0]: File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/aps_models/ads/tools/memory_estimator/estimation_dense.py", line 655, in estimate_dense_memory_size
[rank0]: losses.sum().backward()
[rank0]: File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/torch/_tensor.py", line 604, in backward
[rank0]: return handle_torch_function(
[rank0]: File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/torch/overrides.py", line 1718, in handle_torch_function
[rank0]: result = mode.__torch_function__(public_api, types, args, kwargs)
[rank0]: File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/torch/utils/_device.py", line 106, in __torch_function__
[rank0]: return func(*args, **kwargs)
[rank0]: File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/torch/_tensor.py", line 613, in backward
[rank0]: torch.autograd.backward(
[rank0]: File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/torch/autograd/__init__.py", line 347, in backward
[rank0]: _engine_run_backward(
[rank0]: File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/torch/autograd/graph.py", line 825, in _engine_run_backward
[rank0]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank0]: File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/torch/utils/checkpoint.py", line 1125, in unpack_hook
[rank0]: frame.recompute_fn(*args)
[rank0]: File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/torch/utils/checkpoint.py", line 1507, in recompute_fn
[rank0]: with torch.random.fork_rng(
[rank0]: File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/runtime/lib/python3.10/contextlib.py", line 135, in __enter__
[rank0]: return next(self.gen)
[rank0]: File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/torch/random.py", line 153, in fork_rng
[rank0]: raise RuntimeError(
[rank0]: RuntimeError: torch has no module of `meta`, you should register a module by `torch._register_device_module`.
```
This blocks us from running backward() on model with checkpoint enabled in meta mode.
## What
This diff handles the case of meta device in random.fork_rng.
Test Plan: Tested with toy model which has checkpoint on its module: P1641201046
Differential Revision: D64161410
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137715
Approved by: https://github.com/kit1980
Summary:
add testing for autocast and set_grad nodes for export_for_training. In export_for_training, we do not wrap the autocast and set_grad node in to HOP, but we should still have the set_grad_enabled/autocast nodes.
add support for autocast in non-strict export. Previously, `_enter_autocast` and `_exit_autocast` nodes don't show up in the export graph when we use `strict=False`.
- In autocast's enter and exit function, we dispatch to `PreDispatchTorchFunctionMode.__torch_function__`.
if we have PreDispatchTorchFunctionMode in our function_mode_stack, the call stack looks like below. This is mostly the same call stack as strict mode, except strict mode enters [here](https://www.internalfb.com/code/fbsource/[0d4f1135cacdb26c6e01d5dce1ce52a15d61ee48]/xplat/caffe2/torch/_dynamo/variables/ctx_manager.py?lines=806).
```
- torch.amp.autocast.__enter__()'s torch.overrides.handle_torch_function
- torch.fx.experimental.proxy_tensor.TorchFunctionMetadataMode.__torch_function__
- torch.amp._enter_autocast()'s torch.overrides.handle_torch_function
- PreDispatchTorchFunctionMode.__torch_function__
```
- in `PreDispatchTorchFunctionMode.__torch_function__`, we create the autocast nodes.
- to match the strict mode behavior, we let the input node to the `_exist_autocast` node be the corresponding `_enter_autocast` node. This requires us to maintain a stack in `PreDispatchTorchFunctionMode`.
Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r test_export_with_autocast
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r test_export_with_set_grad
```
Differential Revision: D64016023
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137495
Approved by: https://github.com/bdhirsh
We used a LIFO stack to store the CudaEvent in the cache. ,Somehow we like FIFO deque better so aside from improving the readability of the code, we use a deque instead. As @wconstab pointed out, both methods are equally correct because the moment we put the event into stack/deque, the event is already ready for reuse, this change mostly is a preference change not trying to fix anything.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138048
Approved by: https://github.com/kwen2501
ghstack dependencies: #138040
**MOTIVATION**
We recently integrated support for Intel Gaudi devices (identified as 'hpu') into the common_device_type framework via the pull request at https://github.com/pytorch/pytorch/pull/126970. This integration allows tests to be automatically instantiated for Gaudi devices upon loading the relevant library. Building on this development, the current pull request extends the utility of these hooks by adapting selected CUDA tests to operate on Gaudi devices. Additionally, we have confirmed that these modifications do not interfere with the existing tests on CUDA devices.
**CHANGES**
- Add support for HPU devices within the payload function.
- Use instantiate_device_type_tests with targeted attributes to generate device-specific test instances.
- Expand the supported_activities() function to include checks for torch.profiler.ProfilerActivity.HPU.
- Apply skipIfHPU decorator to bypass tests that are not yet compatible with HPU devices.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133975
Approved by: https://github.com/briancoutinho, https://github.com/aaronenyeshi
Summary:
Blocking wait mode is not widely used, probably useful in debugging.
in blockingWait mode, we don't need to enable the watchdog thread to
check the timeout or nccl error because the main thread would throw an
exception if error happens and it is obvious to user which work fails
and its user's responsibility to handle the exception.
Test Plan:
CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138001
Approved by: https://github.com/fduwjj, https://github.com/c-p-i-o
ghstack dependencies: #137799
Here is why we see using `CUDAEventCache` cause crash and data corruption.
1. The deleter is doing its job and append the job the stack.
2. In create, instead of getting a reference, we are getting a copy of eventsArray_[i] (which is a std::vector). This is bad because we didn't really remove the element from the stack. While we thought we already pop up the last one from the stack, but it turns out the last one is still in the stack; we end up reusing the same event again and again. What's worse, since we keep adding new events to the stack, this will eventually explode the stack and a crash happens.
Fix is easy, just get a reference. Local torchtitan run see a non-Nan loss.
Also we want to use a deque instead of a stack, and refactor the code a bit to make it more readable. (in a separate PR)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138040
Approved by: https://github.com/kwen2501, https://github.com/shuqiangzhang
NOTE: this PR removes `ScheduleFlexibleInterleaved1F1B`, let me know if theres any concerns.
`ScheduleFlexibleInterleaved1F1B` is a superset of `Interleaved1F1B` and uses most of the same implementation, but relaxes the condition that `n_microbatches % pp_size == 0`. This is refactors the implementation into `Interleaved1F1B` and then removes it since it is confusing to have both schedules with similar names. This also refactors the zero bubble logic to belong in the `ZeroBubble` schedule class.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137783
Approved by: https://github.com/wconstab
This adds Dynamo tracing support for the host-side Triton TMA API (see `create_2d_tma_descriptor` calls on the host in the [Triton tutorial](https://triton-lang.org/main/getting-started/tutorials/09-persistent-matmul.html#sphx-glr-getting-started-tutorials-09-persistent-matmul-py)). A few notes:
- Here we assume the availability of the host-side TMA API added to upstream Triton in https://github.com/triton-lang/triton/pull/4498. As of time of writing, this is not a part of the PT2 OSS Triton pin (although back-ported internally). OSS Triton pin update should be done in December 2024.
- To capture the chain of calls `t.data_ptr() --> create_{1d,2d}_tma_descriptor(ptr, ...) --> kernel[grid](tma_desc, ...)`, we add three new variable trackers: `DataPtrVariable`, `CreateTMADescriptorVariable` (for the function), `TMADescriptorVariable` (for TMA descriptor object). This is to maintain the path back from the Triton kernel to the Tensor from which the TMA descriptor has been created.
- The newly introduced variables have `reconstruct` methods used in case of graph breaks.
- The `tma_descriptor_metadata` extracted from the captured `create_{1d,2d}_tma_descriptor` calls is propagated through the HOPs in Dynamo and AOTAutograd to be used by the downstream compiler (e.g., Inductor). See the unit tests for how the captured HOP arguments look like.
- In the Dynamo-captured fx graph, we replace the TMA descriptor arguments of the Triton kernel by the underlying Tensors, to be able to track the input/output relationships in terms of Tensors.
- In the Triton kernel mutation analysis pass (in AOTAutograd), we use the `tt.experimental_descriptor_store` TTIR op to detect mutations of the underlying tensors via TMA descriptors. So that downstream AOTAutograd can perform functionalizations as required.
- JIT Inductor and AOT Inductor support will be implemented in follow-up PRs.
Differential Revision: [D64404928](https://our.internmc.facebook.com/intern/diff/D64404928)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137677
Approved by: https://github.com/zou3519
Summary:
This PR is trying to let users to know what exact collective call from the python thread is failing, and
customize their own error handling function, instead of watchdog thread crashing everything.
This is potentially very useful in fault tolerant training, in which we can have in-process restart.
E.g., when an nccl error is detected, users can potentially abort comms, re-init comms and go back to the previous check pointed step and try again, instead of crashing the whole job.
This is to allow users to check the status of each collective call,
using the ivalue::future libs in PT core. This also allows users to
attach its customized failure handling functions by:
work.get_future_result().then(erro_handling_func)
Note that the above call is also non-blocking for CPU thread
Test Plan:
Added a new test: test_get_future_result to verify the workResutl is
correctly propagated to the users
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137799
Approved by: https://github.com/fduwjj, https://github.com/wconstab
```
Parallelization strategy: after each rank copies its shard into its local
p2p buffer, every rank issues independent p2p copy -> shard_consumer
sequences to two streams. In addition to computation/communication
overlapping, the strategy allows for computation/computation overlapping,
greatly reducing quantization inefficiency.
Notation:
- "mv" for the copy to local buffer
- "cp" for p2p copies
- "b" for barriers
Constraints:
- The GPU scheduler may or may not overlap "mv" with the first shard_consumer.
- "cp" from different streams cannot overlap.
Ideal scenario 0 - "mv" overlaps with the first shard_consumer:
stream 0: [ shard_consumer ][ cp ][ shard_consumer ]
stream 1: [ mv ][b][ cp ][ shard_consumer ]
Ideal scenario 1 - "mv" is scheduled before the first shard_consumer:
stream 0: [ shard_consumer ][ cp ][ shard_consumer ]
stream 1: [ mv ][b][ cp ][ shard_consumer ]
Suboptimal scenario 0 - "mv" is scheduled after the first shard_consumer:
stream 0: [ shard_consumer ] [ cp ][ shard_consumer ]
stream 1: [ mv ][b][ cp ][ shard_consumer ]
Suboptimal scenario 0 - "b" is scheduled after the first shard_consumer:
stream 0: [ shard_consumer ] [ cp ][ shard_consumer ]
stream 1: [ mv ] [b][ cp ][ shard_consumer ]
We haven't yet figured out a way to ensure "mv" and "b" are either
overlapped with or scheduled before the first shard_consumer. Thus, to
prevent suboptimal scenarios, we are giving up the chance to overlap "mv"
and "b" with the first shard_consumer for now.
```
This PR improves the scheduling for mm kernels with high SM utilization. The GPU scheduler tends to not overlap local DtoD copies with such kernels, which leads to suboptimal scheduling. The following is an example of pipelining PyTorch's cutlass-based, row-wise scaling fp8 kernel:
Before this PR:
<img width="298" alt="image" src="https://github.com/user-attachments/assets/81e0a7f4-18ee-47c6-b258-04fdaca7a6a2">
With this PR:
<img width="253" alt="image" src="https://github.com/user-attachments/assets/982de5a8-da1e-4a8f-b67e-c9c869b0a77f">
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137850
Approved by: https://github.com/weifengpy
ghstack dependencies: #137643, #137738, #137805, #137836
```
Parallelization strategy: every rank issues independent compute
-> barrier -> p2p copy sequences on two streams. In addition to
computation/communication overlapping, the strategy allows for
computation/computation overlapping, greatly reducing
quantization inefficiency.
Ideally, stream activities would look like this ("b" for
barriers, "cp" for p2p copies):
[rank 0]
stream 0: [ chunk_producer ][b][ cp ][ chunk_producer ][b][ cp ]
stream 1: [ chunk_producer ][b][ cp ][ chunk_producer ][b][ cp ]
[rank 1]
stream 0: [ chunk_producer ][b][ cp ][ chunk_producer ][b][ cp ]
stream 1: [ chunk_producer ][b][ cp ][ chunk_producer ][b][ cp ]
Note that the barriers synchronize streams with the same ID
across ranks. They don't synchronize streams on the same rank.
Since the work on both streams is independent, there's no
guarantee that the chunk_producer from stream 0 or stream 1 will
be scheduled first. If there is a scheduling mismatch across
ranks, the barrier forces all ranks to wait for the slowest.
When scheduling mismatches occur among ranks, the stream
activities might look like this (note that p2p copies from
different streams cannot overlap with each other):
[rank 0]
stream 0: [ chunk_producer ][b ][ cp ][ chunk_producer ][b ][ cp ]
stream 1: [ chunk_producer ][b] [ cp ][ chunk_producer ][b] [ cp ]
[rank 1]
stream 0: [ chunk_producer ][b] [ cp ][ chunk_producer ][b] [ cp ]
stream 1: [ chunk_producer ][b ][ cp ][ chunk_producer ][b ][ cp ]
To prevent this, we need to ensure that the chunk_producer on
stream 1 gets scheduled first on every rank. Without access to
the underlying kernels, CUDA offers no API to control the
scheduling order of two independent, overlapping kernels. Our
solution is to issue a small sleep kernel in stream 0. The sleep
duration is insignificant, but having an extra task in stream 0
will almost guarantee that the chunk_producer on stream 1 gets
scheduled first. Once the first chunk_producer is scheduled in
the correct order, there's very little room for the scheduling
order of subsequent kernels to be inconsistent across ranks.
```
Currently, we perform stream synchronization to ensure scheduling order. The stream synchronization has no bearing on correctness, but prevents inconsistent scheduling orders across ranks.
Without the stream synchronization, ranks may have inconsistent scheduling order, and the barriers cause all ranks to wait for the slowest rank:
<img width="379" alt="image" src="https://github.com/user-attachments/assets/ffb97e76-7e19-4449-b121-83c32ec3e91d">
With stream synchronization, the inconsistent scheduling order issue is addressed, but we lose compute/compute overlapping (this is the state before this PR):
<img width="378" alt="image" src="https://github.com/user-attachments/assets/4cb76246-625f-4fc1-b49a-823ae46d3f23">
With this PR, we get both consistent scheduling order across ranks and compute/compute overlap:
<img width="327" alt="image" src="https://github.com/user-attachments/assets/51ab1bdc-4f60-46e0-b53c-6d208e2d4888">
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137836
Approved by: https://github.com/weifengpy
ghstack dependencies: #137643, #137738, #137805
This PR add support for `A_scale` to be row-wise scale. The op can automatically detect whether the row-wise scale is sharded or replicated. When the row-wise scale is sharded, the op would all-gather the scale in a pipelined fashion.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137805
Approved by: https://github.com/weifengpy
ghstack dependencies: #137643, #137738
This will retry connection timeout failures up to the timeout duration. Under heavy load the server may not be able to immediately accept the connection. In such a case we do want to retry the connection rather than fall back to ipv4 for the remaining of the connection timeout.
The connection timeout here is not the same as the c10d timeout which appears to be higher. We could adjust the linux timeout directly but using the c10d retry loop keeps things more consistent and gives us things like exponential backoff, logs, etc.
Example failure:
```
socket.cpp:752] [c10d] The client socket has failed to connect to [...]:29400 (errno: 110 - Connection timed out).
socket.cpp:752] [c10d] The IPv4 network addresses of (..., 29400) cannot be retrieved (gai error: -2 - Name or service not known).
... repeats ipv4 connection failure
```
From Linux man page: https://man7.org/linux/man-pages/man2/connect.2.html
```
ETIMEDOUT
Timeout while attempting connection. The server may be
too busy to accept new connections. Note that for IP
sockets the timeout may be very long when syncookies are
enabled on the server.
```
Test plan:
CI for backwards compatibility
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138003
Approved by: https://github.com/c-p-i-o, https://github.com/fduwjj, https://github.com/rsdcastro
### Fix 1: Throw async error during init wait
Previously we just busy wait for `ncclSuccess`, if the nonblocking init encountered error, we never report that. Added detection of async error via `ncclGetAsyncError`.
### Fix 2: Add wait after comm split
```
// After calling ncclCommSplit in non-blocking mode, we should wait for the
// source communicator to be out of ncclInProgress state.
// Reason 1:
// it's unsafe to call new operations on the parent comm while it's in
// ncclInProgress state.
// Reason 2:
// as of NCCL 2.23, the ptr value of child comm will not be filled until the
// state of parent comm is ncclSuccess. This may change in the future. See:
// https://github.com/NVIDIA/nccl/issues/1472
```
This wait does not mean the child comm is ready for use, neither does it block till that point.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137741
Approved by: https://github.com/shuqiangzhang
The Triton `AttrsDescriptor` object was refactored in https://github.com/triton-lang/triton/pull/4734. These changes add support for the new `AttrsDescriptor` while maintaining backwards compatibility with the existing version. The main changes are different names for the initialized of the descriptor parameters, and a creation via a static method instead of the class constructor.
Depends on #137458 which removes some unused logic around the old descriptor. Those changes make this PR cleaner, but if for some reason that old logic is still used I can make adjustments.
Use of the new `AttrsDescriptor` depends on https://github.com/triton-lang/triton/pull/4888
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137757
Approved by: https://github.com/jansel
Related to #107302
The following tests failed in test_binary_ufuncs.py when testing with NumPy 2.
```
FAILED [0.0050s] test/test_binary_ufuncs.py::TestBinaryUfuncsCPU::test_scalar_support__refs_sub_cpu_complex64 - AssertionError
FAILED [0.0043s] test/test_binary_ufuncs.py::TestBinaryUfuncsCPU::test_scalar_support__refs_sub_cpu_float32 - AssertionError
FAILED [0.0048s] test/test_binary_ufuncs.py::TestBinaryUfuncsCPU::test_scalar_support_sub_cpu_complex64 - AssertionError
FAILED [0.0043s] test/test_binary_ufuncs.py::TestBinaryUfuncsCPU::test_scalar_support_sub_cpu_float32 - AssertionError
FAILED [0.0028s] test/test_binary_ufuncs.py::TestBinaryUfuncsCPU::test_shift_limits_cpu_uint8 - OverflowError: Python integer -100 out of bounds for uint8
```
This PR fixes them.
More details:
* `test_shift_limits` failed because `np.left_shift()` and `np.right_shift()` no longer support negative shift values in NumPy 2.
* `test_scalar_support` failed because NumPy 2 changed its dtype promo rules. We special-cased the incompatible cases by changing the expected dtypes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137937
Approved by: https://github.com/albanD
Related to #107302
`TestExport.test_exported_objects` in `test/torch_np/test_basic.py` is failing with NumPy 2.
The test is checking if all methods under `torch._numpy` exist in `numpy`.
However, some of them are removed in NumPy 2.
This PR fixes the issue by not checking the removed methods when running with NumPy 2.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137814
Approved by: https://github.com/albanD
Intel GPU aten library(libtorch_xpu) utilizes `torchgen` to generate structure kernels. Currently, the generated structure kernels are decorated by `TORCH_API` to control the visibility, while `TORCH_API` is controlled by the `CAFFE2_BUILD_MAIN_LIB` macro. However, we cannot enable `CAFFE2_BUILD_MAIN_LIB` for the Intel GPU ATen library naively. Because the macro not only serves for the `TORCH_API` semantic. It means that the semantic of `TORCH_API` is symbol `hidden`.
https://github.com/pytorch/pytorch/blob/main/c10/macros/Export.h#L95-L99
Therefore, we need to use ` TORCH_XPU_API` to decorate the produced structure kernels.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137794
Approved by: https://github.com/atalman
ghstack dependencies: #137873
MOTIVATION
We recently verified some quantization tests on devices other than cpu (eg. CUDA and Intel Gaudi devices identified as 'hpu'). We noticed a device mismatch error as eps is a tensor created on cpu but other tensors (min_val_neg, max_val_pos, scale, zero_point) are moved to the targeted _device_.
CHANGES
Move eps to _device_ of other tensors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135204
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
Previously the decomposition would upcasts inputs to fp32. This led to a slowdown compared to eager which would run in fp16. We also tried keeping the bmm in fp16, and the upcasting for the epilogue but that led to worse numerics because the bmm in eager would do the epilogue all in fp32 without a downcast in the bmm accumulator.
Fix for https://github.com/pytorch/pytorch/issues/137897
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137904
Approved by: https://github.com/ngimel
Fixes an issue where if the context arg is not provided, Dynamo would throw an arg mismatch error.
The skips are there because Dynamo would previously fall back to eager on those tests due to the arg mismatch error.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137809
Approved by: https://github.com/drisspg
Summary: Tests Fix Clear On Fork by forking a process after a profile has already been done. Afterwards we check that all the PID/TID are as expected.
Test Plan: Ran buck2 test 'fbcode//mode/dev' fbcode//caffe2/test:profiler -- --exact 'caffe2/test:profiler - test_forked_process (profiler.test_profiler.TestProfiler)'
Differential Revision: D63992036
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137511
Approved by: https://github.com/sanrise, https://github.com/aaronenyeshi
The `AttrsDescriptor` class has been present in Triton for almost a year now (introduced [here](72c9833927)), so we should be able to rely on it existing. I am in the process of supporting the new `AttrsDescriptor` class and @jansel suggested I split changes to the existing class out separately to make sure nothing breaks removing the legacy attribute descriptor attributes.
Initially I attempted to remove the branching around detecting whether `AttrsDescriptor` exists but that breaks because PyTorch must build without Triton. So, I went back and updated for the naming introduced in the commit linked above, and also removed two unused attributes `divisible_by_8` and `ids_to_fold` which were removed in Feb 2024 (https://github.com/triton-lang/triton/pull/3122 and https://github.com/triton-lang/triton/pull/3080 respectively).
With these changes only the internal workings of the `AttrsDescriptor` class will differ between supported Triton versions, but the data stored will remain consistent.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137458
Approved by: https://github.com/jansel
This reverts commit e688b78791d01bd91614a61e57726c32beb46ee4.
We're reverting this because:
1) The original PR (#134872) fixed a bug but caused another one. The
assessment is that the bug it caused is worse than the bug it fixed.
2) it was reverted on the release 2.5 branch, so we want to prevent
divergence
3) The original author is out-of-office for a while so we don't want the
divergence to wait until they're back
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137891
Approved by: https://github.com/Skylion007
For `autograd.Function`, the engine will try to allocate correctly-shaped zeros for `None` grads (i.e. in the case where the output isn't used downstream). It determines the shape of these zeros from the `VariableInfo` entry, which is derived from the forward output shape. For the NJT forward output case, the size info stored will contain a nested int, and calling `zeros()` with this size throws:
```
RuntimeError: .../build/aten/src/ATen/RegisterCPU.cpp:5260: SymIntArrayRef expected to contain only concrete integers
```
This PR fixes this by storing the full tensor in the `VariableInfo` for the nested case and calling `zeros_like()` to allocate correctly-shaped zeros. This is pretty inefficient; ideally we would want to save just the NJT shape and be able to construct zeros from it, but this requires factory function support for nested ints (WIP). So this is a short-term fix until we have that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136875
Approved by: https://github.com/soulitzer, https://github.com/huydhn
A proposal addressing Issue #1489: **Optimizer should track parameter names and not id.**
(also mentioned in here: [[RFC] Introducing FQNs/clarity eyeglasses to optim state_dict](https://dev-discuss.pytorch.org/t/rfc-introducing-fqns-clarity-to-optim-state-dict/1552)
## Summary
This PR introduces a backward-compatible enhancement where optimizers track parameter names instead of just their id.
Optimizers can be initialized with `named_parameters()` as:
```python
optimizer = optim.SGD(model.named_parameters(), lr=0.01, momentum=0.9)
```
This allows for greater clarity and ease when handling optimizers, as the parameters' names are preserved within the optimizer’s `state_dict` as:
```
state_dict =
{
'state': {
0: {'momentum_buffer': tensor(...), ...},
1: {'momentum_buffer': tensor(...), ...},
},
'param_groups': [
{
'lr': 0.01,
'weight_decay': 0,
...
'params': [0,1]
'param_names' ['layer.weight', 'layer.bias'] (optional)
}
]
}
```
Loading `state_dict` is not changed (backward-compatible) and the `param_names` key will be ignored.
## Key Features
#### Named Parameters in Optimizer Initialization:
Optimizers can accept the output of `model.named_parameters()` during initialization, allowing them to store parameter names directly.
#### Parameter Names in `state_dict`:
The parameter names are saved as a list in the optimizer’s `state_dict` with key `param_names`, alongside the `params` indices, ensuring seamless tracking of both names and parameters.
## Backward Compatibility
#### No Breaking Changes:
This change is fully backward-compatible. The added `param_names` key in the optimizer's `state_dict` is ignored when loading a state to the optimizer.
#### Customization with Hooks:
For more control, the loaded state_dict can be modified using a custom `register_load_state_dict_pre_hook`, providing flexibility for different design needs.
## Documentation Updates
Please refer to the documentation changes for more details on how this feature is implemented and how it can be used effectively.
## Solution Example:
A suggested solution to the problem mentioned in #1489, for the same parameters but in a different order.
The following `register_load_state_dict_pre_hook` should be added to the optimizer before loading to enable loading the state dict :
```python
def adapt_state_dict_ids(optimizer, state_dict):
# assuming a single param group.
current_state_group = optimizer.state_dict()['param_groups'][0]
loaded_state_group = state_dict['param_groups'][0]
# same number of params, same names, only different ordering
current_state_name_to_id_mapping = {} # mapping -- param_name: id
for i, name in enumerate(current_state_group['param_names']):
current_state_name_to_id_mapping[name] = current_state_group['params'][i]
# changing the ids of the loaded state dict to match the order of the given state dict.
for i, name in enumerate(current_state_group['param_names']):
loaded_state_group['params'][i] = current_state_name_to_id_mapping[name]
return state_dict
```
In this code, the loaded `state_dict` ids are adapted to match the order of the current optimizer `state_dict`.
Both the previous and the current optimizers are required to be initiated with `named_parameters()` to have the 'param_names' key in the dict.
### Note
This is my first contribution to PyTorch, and I wish to receive feedback or suggestions for improvement.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134107
Approved by: https://github.com/janeyx99
Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
The test is failing (flakily?) on periodic Windows CUDA jobs with the following error:
```
__________ TestLinalgCUDA.test_matmul_offline_tunableop_cuda_float16 __________
Traceback (most recent call last):
File "C:\actions-runner\_work\pytorch\pytorch\test\test_linalg.py", line 4618, in test_matmul_offline_tunableop
os.remove(filename)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'tunableop_untuned0.csv'
```
For example, https://github.com/pytorch/pytorch/actions/runs/11292745299/job/31410578167#step:15:15097
The test tried to catch and ignore this, but this is Windows. So, the fix is to:
1. Ignore if these files couldn't be removed
2. Write them to a temp directory instead, otherwise, [assert_git_not_dirty](https://github.com/pytorch/pytorch/blob/main/.ci/pytorch/test.sh#L286) won't be happy
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137835
Approved by: https://github.com/atalman
This PR adds a Selective Activation Checkpointing (SAC) Estimator, built on top of the `Runtime Estimator`, for estimating memory and recomputation time trade-offs.
It provides a `TorchDispatchMode` based context manager that estimates the memory and runtime trade-offs of functions or `torch.nn.Modules` for SAC, using the `Runtime Estimator` #134243 under the hood to support two estimation modes: 'operator-level-benchmark' and 'operator-level-cost-model' (roofline model). The SAC Estimator provides detailed statistics and metadata information for operators of each module, including greedy order for selecting operators to be recomputed/checkpointed and per-module trade-off graphs. This estimator is designed to be used under FakeTensorMode and currently supports estimation of compute time and memory usage."
It's inspired from: [XFormers SAC](https://github.com/facebookresearch/xformers/blob/main/xformers/checkpoint.py) by @fmassa
End-to-end example:
```
import torch
from torch._subclasses.fake_tensor import FakeTensorMode
from torch.distributed._tools.sac_estimator import SACEstimator
from torch.testing._internal.distributed._tensor.common_dtensor import (
ModelArgs,
Transformer,
)
if __name__ == "__main__":
dev = torch.cuda.current_device()
vocab_size = 8192
bsz, seq_len = 8, 1024
model_args = ModelArgs(
n_layers=4,
n_heads=12,
vocab_size=vocab_size,
max_seq_len=seq_len,
dim=768,
dropout_p=0.1,
)
with FakeTensorMode():
with torch.device(dev):
model = Transformer(model_args)
inp = torch.randint(
0, model_args.vocab_size, (bsz, model_args.max_seq_len), device=dev
)
sace = SACEstimator()
with sace(estimate_mode_type='operator-level-cost-model'):
loss = model(inp).sum()
loss.backward()
sace.pwlf_sac_tradeoff_curve(n_segments=2, save_tradeoff_graphs=True)
sace.display_modulewise_sac_stats(depth=4, print_tabular=True)
```
Example AC Stats for one of the transformer layers:

Example AC Trade-off for one of the transformer layers:

Example AC Trade-Off graph one of the transformer layers:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135208
Approved by: https://github.com/weifengpy
With LTO(Link Time Optimization) enabled in CFLAGS, some compiler will optimize and strip the unwind_c function, which is caused by compiler that couldn’t resolve reference correctly, thus breaking the build with undefined reference in unwind_entry. Add an attribute to avoid this bad situation.
Fixes#121282
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137862
Approved by: https://github.com/Skylion007
Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
Builds upon #76951.
Benchmarking code is the same as in #76950.
AMD Ryzen Threadripper PRO 3995WX:
```
batch_size drop_last origin new speedup
------------ ----------- -------- ------ ---------
4 True 0.94 0.5706 64.74%
4 False 0.9745 0.9468 2.93%
8 True 0.7423 0.3715 99.82%
8 False 0.7974 0.5666 40.73%
64 True 0.5394 0.2085 158.76%
64 False 0.6083 0.2697 125.51%
640 True 0.5448 0.1985 174.41%
640 False 0.7085 0.2308 206.91%
6400 True 0.5554 0.2028 173.88%
6400 False 0.7711 0.2109 265.60%
64000 True 0.556 0.2091 165.82%
64000 False 0.7803 0.2078 275.58%
```
When `drop_last == True`, it uses `zip` to speed things up.
When `drop_last == False`, it uses `itertools` to speed things up.
`itertools` was the fastest way I could find that deals with the last batch if it is smaller than `batch_size`. I have a pure python method too, but it is slower when `batch_size` is 4 or 8, so I have committed the `itertools` version for now.
Happy to chat further about this change :-) I understand you may not want to introduce the `itertools` package into [sampler.py](https://github.com/pytorch/pytorch/blob/main/torch/utils/data/sampler.py).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137423
Approved by: https://github.com/Skylion007
PyStructSequence is the C API equivalent for `collections.namedtuple` in Python. But they have different constructors:
```python
tuple = NamedTupleType(*args)
tuple = NamedTupleType._make(args)
tuple = StructSequenceType(args)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137776
Approved by: https://github.com/jansel
We use nn_module_stack in unflatten to recognize when module calls begin and end. However the current format is not sufficient to detect module call boundaries when we have successive calls to the same module, because the successive instructions (end of one call, begin of next call) have the same nn_module_stack. This causes us to effectively "unroll" successive calls to a single call. This can cause problems when preserving module call signatures because the outputs of the successive calls might be concatenated in the single call.
Previously we introduced the concept of a "call index" to generate multiple graphs when unflattening, one per call. This PR pushes this concept into nn_module_stack itself. In particular, the keys of nn_module_stack now go from `key` to `key@call_index`. (In a previous attempt, https://github.com/pytorch/pytorch/pull/137457, instead values in nn_module_stack go from (fqn, type) to (fqn, type, call_index), which is BC-breaking.)
Note that we still do not have the ability to preserve module call signatures for multiple calls to the same module. But now instead of randomly crashing we give a proper error. OTOH when not preserving module call signatures we simply generate multiple calls, each with its own graph, possibly deduplicated, matching what we would do for non-successive calls.
Test Plan: Like D64014936
Differential Revision: D64136277
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137646
Approved by: https://github.com/angelayi
This PR adds new meta functions for `lerp`, `addcmul`, and `addcdiv` (including their
respective inplace versions).
These functions only had refs implementations, which was being the root cause of a
significant overhead ([issue][1]) when running `AdamW` optimizer step on PyTorch/XLA
backend. Running the meta functions resulted in the following improvements:
- `lerp` calls: 1,550ms to 140ms (10x)
- `addcdiv` calls: 640ms to 350ms (1.8x)
- `addcmul` calls: 620ms to 300ms (2.05x)
[1]: https://github.com/pytorch/xla/issues/7923
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136909
Approved by: https://github.com/jansel
Summary:
# Latest Update
This diff is no longer needed because we did need the check to exist, to make meta behave the same as other devices, see D54526190.
---------------------------------
# Background
T176105639
| case | embedding bag weight | per_sample_weight | fbgemm lookup | forward in meta |
| A | fp32 | fp32 | good | good |
| B | fp16 | fp32 | good| failed [check](https://fburl.com/code/k3n3h031) that forces weight dtype == per_sample_weights dtype |
| C | fp16 | fp16 | P1046999270, RuntimeError: "expected scalar type Float but found Half from fbgemm call" | good |
| D | fp32 | fp16 | N/A | N/A |
Currently we are in case A. Users need to add `use_fp32_embedding` in training to force embedding bag dtype to be fp32. However, users actually hope for case B to use fp16 as the embedding bag weight. When deleting `use_fp32_embedding`, they would fail the [check](https://fburl.com/code/k3n3h031) that forces `weight dtype == per_sample_weights dtype ` in meta_registration.
The check is actually not necessary. Is it because the backend fbgemm does support case B. Additionally, later on in the `meta_embedding_bag`, `weight` and `per_sample_weights` don't need to be in the same dtype (https://fburl.com/code/q0tho05h, weight is src, per_sample_weights is scale) for `is_fast_path_index_select`.
# This diff
Therefore, this diff remove the unnecessary [check](https://fburl.com/code/k3n3h031) to support case B in meta forward. With such, users are able to use fp16 to be the emb bag dtype without the need to force per_sample_weights the same dtype in meta forward (see Test Plan).
# Reference diffs to resolve this issue
Diff 1: D52591217
This passes embedding bag dtype to feature_processor to make per_sample_weights same dtype as emb bag weight. However, `is_meta` also needs to be passed because of case C. fbgemm still does not support per_sample_weights = fp16 (see the above table). Therefore users are forced to only make per_sample_weights fp16 when it is on meta. The solution requires too many hacks.
Diff 2: D53232739
Basically doing the same thing in diff 1 D52591217, except that the hack is added in TorchRec library. This adds an if in EBC and PEA for: when emb bag weight is fp16, it forces per_sample_weight fp16 too. However, it would then result in fbgemm issue too and has broken a bunch of prod models.
Test Plan:
# APS
The following command will run icvr_launcher which triggers ads_launcher and run forward in meta device:
```
buck2 run mode/opt -c python.package_style=inplace //aps_models/ads/icvr:icvr_launcher_publish -- mode=mast_ig_fm_when_combo0_uhm_publish launcher.fbl_entitlement=ads_global_tc_ads_score launcher.data_project=oncall_ads_model_platform launcher.tags=[ads_ranking_taxonomy_exlarge_fm_prod] stages.train=false
```
Result:
{F1461463993}
Reviewed By: ezyang
Differential Revision: D54175438
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136774
Approved by: https://github.com/ezyang
This test is currently failing in trunk when memory leak check is enabled, for example https://github.com/pytorch/pytorch/actions/runs/11296206361/job/31422348823#step:22:1970. When testing locally, calling `backward` on a masked tensor always causes memory leak until I clean up the data and the mask manually. This is probably related to this warning from masked tensor `UserWarning: It is not recommended to create a MaskedTensor with a tensor that requires_grad. To avoid this, you can use data.clone().detach()`, but I don't know much about the internal details here to go further. So, let's just fix the test first/
### Testing
```
PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 python test/test_maskedtensor.py TestBasicsCUDA.test_stack_cuda
```
passes and doesn't warn about memory leak anymore.
The test itself came from https://github.com/pytorch/pytorch/pull/125262#issuecomment-2344068012
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137815
Approved by: https://github.com/kit1980
Related to #107302
The breakages are caused by backward incompatibility between NumPy 1 and NumPy 2.
This PR fixes all the corresponding test failures in `test_torch.py`.
1. The dtype of the return value `np.percentile` when passed a `torch.float32` tensor.
NumPy 1: Return value of `np.float64`.
NumPy 2: Return value of `np.float32`.
Solution: Enforce it with `.astype(np.float64)`.
2. The type of `np.gradient()` when returning multiple arrays.
NumPy1: A list of arrays.
NumPy2: A tuple of arrays.
Solution: Cast the tuple to a list.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137740
Approved by: https://github.com/ezyang
Updates all references to runner determinator workflow (`_runner-determinator.yml`) from current cloned version to main version.
This enables the team to push updates to this workflow, like fixing bugs or pushing improvements, and have it immediately be reflected on all open PRs. So avoiding potentially breaking situations, empowering moving fast and fast and simple recover in case of bugs.
From:
```
jobs:
get-label-type:
uses: ./.github/workflows/_runner-determinator.yml
```
To:
```
jobs:
get-label-type:
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137791
Approved by: https://github.com/malfet, https://github.com/huydhn, https://github.com/zxiiro
Performs shape inference at runtime using user-provided real tensors.
- avoids the need for users to precompute shapes which is difficult and error prone
- lets us remove args from the PipelineStage ctor (in a later PR)
- deprecates existing inference helper in PipelineStage constructor for several reasons: its problematic to have to reason about the stage submod being on the right device for shape inference
The current state as of this PR:
- Users should not pass any input or output shapes into PipelineStage ctor, and shape inference will run automatically
- To override shape inference, they can continue to pass input/output args as previously
Currently, does not add a barrier after shape-inference, which essentially pipelines shape inference with the subsequent schedule action for that stage. If this complicates debugging, we could add in a barrier (it comes at a cost, but only during the first step).
Testing:
- Removed input args from all PP test cases, thus exposing them all to shape-inference.
- Verified visually (nvidia-smi) that torchtitan PP 3D test runs shape inference fine without creating extra cuda contexts.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136912
Approved by: https://github.com/kwen2501, https://github.com/H-Huang
The original PR #122396 used the CPU device since at that point in time
there was no actual Triton CPU backend. After #133408, this is no longer
the case, so we now have multiple backends getting registered for the
CPU. The test still works in OSS but fails internally due to different
test runners initializing the backends in a different order.
This PR doesn't actually end up fixing the test internally because
cpp_extension -- needed to implement the privateuseone device -- isn't
supported there, so we simply skip it instead. However, it still makes the
OSS test independent of initialization order, which is good.
Differential Revision: [D63838169](https://our.internmc.facebook.com/intern/diff/D63838169/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137611
Approved by: https://github.com/henrylhtsang
Previously, instances of `SchedulerNode` and `FusedSchedulerNode` would explicitly check whether the compilation target is Triton when codegen'ing debug strings. Generating debug triton code is instead implemented as a callback set on scheduler nodes by `TritonScheduling`. This makes the codegen more device-agnostic and allows schedulers to customise the codegen output as opposed to it being closely coupled to the debug string codegen
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135015
Approved by: https://github.com/jansel
Follow up to https://github.com/pytorch/pytorch/pull/131936. In the original bisector you'd have to test inline if we were disabling a component - `if BisectionManager.disable_subsystem("inductor", "post_grad_passes", debug_info)`. This adds a convenient way of testing config changes for root causing issue. I've added `emulate_precision_casts` and aot_eager_decomp_partition cse as initial ones.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137346
Approved by: https://github.com/zou3519
This function computes a topological sort using a non-recursive implementation of DFS. Upon first reading, I thought it was using Kahn’s algorithm because it uses a variable called `queue`, but upon closer reading, I noticed this variable is actually used as a stack.
This pull request improves readability by renaming the stack and changing it from `std::vector` to `std::stack`.
Note: this also changes the backing store from an `std::vector` to an `std::deque`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130526
Approved by: https://github.com/alanwaketan, https://github.com/malfet
adds a `default` tag to experiment configurations, allowing to remove some experiments by default on the random draw:
```
experiments:
lf:
rollout_perc: 25
otherExp:
rollout_perc: 25
default: false
---
```
and includes the configuration to filter what experiments are of interest for a particular workflow (comma separated):
```
get-test-label-type:
name: get-test-label-type
uses: ./.github/workflows/_runner-determinator.yml
with:
...
check_experiments: "awsa100"
```
The end goal, is to enable us to run multiple experiments, that are independent from one another. For example, while we still runs the LF infra experiment, we want to migrate other runners leveraging the current solution. A immediate UC is for the A100 instances, where we want to migrate to AWS.
Those new instances will during the migration period be labeled both `awsa100.linux.gcp.a100` and `linux.aws.a100`. Once the experiment ends, we will remove the first confusing one.
```
jobs:
get-build-label-type:
name: get-build-label-type
uses: ./.github/workflows/_runner-determinator.yml
with:
...
get-test-label-type:
name: get-test-label-type
uses: ./.github/workflows/_runner-determinator.yml
with:
...
check_experiments: "awsa100"
linux-focal-cuda12_1-py3_10-gcc9-inductor-build:
name: cuda12.1-py3.10-gcc9-sm80
uses: ./.github/workflows/_linux-build.yml
needs:
- get-build-label-type
- get-test-label-type
with:
runner_prefix: "${{ needs.get-build-label-type.outputs.label-type }}"
...
test-matrix: |
{ include: [
{ config: "inductor_huggingface_perf_compare", shard: 1, num_shards: 1, runner: "${{ needs.get-test-label-type.outputs.label-type }}linux.gcp.a100" },
...
]}
...
```
```
experiments:
lf:
rollout_perc: 50
awsa100:
rollout_perc: 50
default: false
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137614
Approved by: https://github.com/malfet
Earlier the subgraphs were getting inlined into the output code. This PR lifts the subgraphs into a function, and then we just call the function in the output code.
This is the output code for test `test_cond_reintepret_view_inputs_outputs`
Before this PR - https://www.internalfb.com/intern/paste/P1632948905/
With this PR - https://www.internalfb.com/intern/paste/P1632946348/
A relevant snippet from the above paste is
~~~
def false_graph_0(args):
false_graph_0_arg0_1, false_graph_0_arg1_1, s0 = args
args.clear()
s0 = s0
with torch.cuda._DeviceGuard(0):
torch.cuda.set_device(0)
false_graph_0_buf0 = empty_strided_cuda(((-1) + s0, 20), (20, 1), torch.float32)
false_graph_0_buf1 = empty_strided_cuda(((-1) + s0, 20), (20, 1), torch.float32)
# Unsorted Source Nodes: [cond, z1, z2], Original ATen: [aten.sub, aten.add]
triton_poi_fused_add_sub_1_xnumel = (-20) + (20*s0)
stream0 = get_raw_stream(0)
triton_poi_fused_add_sub_1.run(false_graph_0_arg0_1, false_graph_0_arg1_1, false_graph_0_buf0, false_graph_0_buf1, triton_poi_fused_add_sub_1_xnumel, grid=grid(triton_poi_fused_add_sub_1_xnumel), stream=stream0)
del false_graph_0_arg0_1
del false_graph_0_arg1_1
return (reinterpret_tensor(false_graph_0_buf0, ((-3) + s0, 20), (20, 1), 40), reinterpret_tensor(false_graph_0_buf1, ((-1) + s0, 16), (20, 1), 4), )
async_compile.wait(globals())
del async_compile
def call(args):
arg0_1, arg1_1, arg2_1, arg3_1 = args
args.clear()
s0 = arg0_1
assert_size_stride(arg1_1, (s0, 20), (20, 1))
assert_size_stride(arg2_1, (s0, 20), (20, 1))
assert_size_stride(arg3_1, (), ())
with torch.cuda._DeviceGuard(0):
torch.cuda.set_device(0)
buf0 = [None] * 2
buf0 = [None] * 2
if arg3_1.item():
# subgraph: true_graph_0
true_graph_0_arg0_1 = reinterpret_tensor(arg1_1, ((-1) + s0, 20), (20, 1), 0)
true_graph_0_arg1_1 = reinterpret_tensor(arg2_1, ((-1) + s0, 20), (20, 1), 0)
(true_graph_0_buf0, true_graph_0_buf1) = true_graph_0([true_graph_0_arg0_1, true_graph_0_arg1_1, s0])
buf0[0] = true_graph_0_buf0
buf0[1] = true_graph_0_buf1
else:
# subgraph: false_graph_0
false_graph_0_arg0_1 = reinterpret_tensor(arg1_1, ((-1) + s0, 20), (20, 1), 0)
false_graph_0_arg1_1 = reinterpret_tensor(arg2_1, ((-1) + s0, 20), (20, 1), 0)
(false_graph_0_buf0, false_graph_0_buf1) = false_graph_0([false_graph_0_arg0_1, false_graph_0_arg1_1, s0])
buf0[0] = false_graph_0_buf0
buf0[1] = false_graph_0_buf1
del arg1_1
del arg2_1
del arg3_1
buf1 = buf0[0]
buf2 = buf0[1]
del buf0
return (buf1, buf2, )
~~~
The key change is to recursively call `codegen` for the subgraph and rely on `SubgraphPythonWrapper` to generate just the subgraph `fn`. The resulting subgraph_code is then inserted into the parent wrapper.
Note that this PR only works for python wrapper. For cpp wrapper, we need a lot of refactor to ensure that we don't duplicate the global variables in the outpute_code. So, for now, I fallback to the old way of inlining for cpp wrapper. I am hoping someone with more familiarity with cpp wrapper can support subgraph lifting (cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov).
This work will unblock hierarchical compilation (or cold start compile time work).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137200
Approved by: https://github.com/desertfire, https://github.com/eellison
This PR reduces the overhead on the CPU side by eliminating the use of c10::str in creating signatures. Instead we use fmt library. TunableOp overhead on the CPU are reduced by around ~40%. The improvement is most noticeable on small GEMMs. This PR does not contain any bug fixes or new features.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135371
Approved by: https://github.com/jeffdaily
# Summary
I started to explore the performance of _scaled_mm against a triton-based persistent TMA kernel for RowWise scaling.
There are more details here: https://github.com/drisspg/transformer_nuggets/pull/36
It clearly showed that where was some room for improvement on larger problem sizes compared to triton's performance. Note that the triton kernel only has a 128x128x128 Tile shape, where scaled_mm has a 64, 128, 128 tile shape which we use for smaller problem sizes which may explain some of the perf delta for at smaller shapes.
This led to seeing if we can improve our triton codegen lowering for _scaled_mm (I think we should still do this: https://github.com/pytorch/pytorch/pull/137517).
In the meantime @Chillee suggested I make sure swizziling is set for the large matmul shapes
This PR makes sure that we increase the max_swizzle_size for the large matmuls.
## Performance
Note* Red means triton based tma beats _scaled_mm blue means _scaled_mm is faster
On Nighlty W/ Triton at (2ef33c6c4c3)

You can see that as M,K,N increase there is a clear win for the Triton Persistent TMA.
After this PR:

For example w/ this change(power limited gpu)
M=16384 K=16384 N=16384
TFlops Before :`985.49`
TFlops After: `1304.69`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137681
Approved by: https://github.com/eqy
Fixes#136722Fixes#136718
By default, it goes to onednn. So this PR adds a check to ensure stride > 0. Now program will quit with an error message if stride is 0.
FBGEMM and QNNPACK can create modules with stride=0 without error but program crashes when calling forward.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136739
Approved by: https://github.com/jgong5
Thanks @eqy for reminding me of this RFC: https://github.com/pytorch/pytorch/issues/119797
This PR is meant to:
- provide a way to abort multiple PGs without deadlocking each other.
- provide a possibility to manually handle comm errors or timeouts (and potentially recovery of such).
One can find an example from: https://github.com/NVIDIA/nccl/issues/1013
## How is it different from `destroy_process_group`?
`destroy_process_group` is meant for normal exit, while `_abort_process_group` is meant for bailout upon hangs or failures. Similar to `ncclCommDestroy` vs `ncclCommAbort`.
## What's new in `_abort_process_group`?
It added support for "group abort" semantic. The "group abort" semantic is capable of aborting multiple NCCL comms concurrently, avoiding deadlock in otherwise serialized `ncclCommAbort` executions. Details are in the [RFC](https://github.com/pytorch/pytorch/issues/119797) targeting [the hang issue in multi-comm case](https://github.com/NVIDIA/nccl/issues/1013). `Group abort` semantic is added in NCCL 2.22.
## What's next?
Ideally, the watchdog's behavior should support "group abort" too. But this is hard to implement today due to a lack of "global view" by each PG's individual watchdog. A big semi-big refactor may be needed to "uplift" the watchdogs to a global level or consolidate them into one (i.e. one dog watching multiple PGs).
In any case, it may not be a bad idea to experiment the "group abort" feature with a manual API first and then extend to the automatic mode (watchdog).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132291
Approved by: https://github.com/eqy
Enable FSDP to deal with channels_last memory formatted tensors. Preserving channels_last memory format makes FSDP compatible with the best kernels CUDNN offers.
Summary of changes:
1) Store strides information along with shapes
2) Replace calls to flatten() with as_strided(size=(param.numel(),), stride=(1,)) for flattening
3) Replace calls to view() with as_strided with the stored sizes and strides for unflattening
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137382
Approved by: https://github.com/awgu
Resolve#137540
Summary:
We might get different state_dict and named_parameters result when the module has registered custom state_dict_hooks.
For exported_program's state_dict, we want the state_dict to reflect the actual module hierarchy at runtime, and it might be different from the model's state_dict() output if the model has state_dict hooks.
To do weight swapping, one needs to either re-export or turn-off the hooks when saving model's state_dict().
Previously, ExportedProgram uses nn.Module's state_dict() method to populate its own state_dict, but it doesn't work for some models (e.g. llama3_3_vision) because ExportedProgram's state_dict and an nn.Module's state_dict have some subtle differences semantically.
nn.Module's state_dict is about how the state should be serialized, and it reflects the structure of the original user model code. In contrast, export specializes on a “run” of a model, and its state_dict needs to reflect the runtime module hierarchy.
One example where these two are different is TorchTune's Llama3_2_vision text decoder. Here, a FusionLayer is added as a local optimization and it is not part of the "static model definition". In runtime, we have mod.layers[3].layer.sa_norm.scale.
But in nn.Module's state_dict, the authors of the model added a state_dict hook to remove the "layer" in mod.state_dict() to reflect the static model definition, so we have mod.state_dict()["layers.3.sa_norm.scale"].
In this Diff, we change ExportedProgram to populate its state_dict using named_parameters() and named_buffers() instead. So in ExportedProgram's state_dict, we have "layers.3.layer.sa_norm.scale", which reflects the runtime module hierarchy.
Now one problem this presents is weight swapping. Since ExportedProgram's state and the model's state is not the same anymore, weight swapping procedure also needs to change slightly.
In internal Ads and RecSys models deployment, weight swapping is where they have one model that is currently being being deployed and serving traffic, and they want to swap out the weights with newly trained model weights without having to redo the whole exporting/lowering process and create a new artifact. So they would move the deployed model’s pointer to the state dict over to the new state dict. Because of this, it’s previously a requirement that the FQNs are matching between the exported and the eager model’s state dict.
The new ExportedProgram's state dict still supports weight swapping, but the state_dict to be swapped needs to be obtained from torch.export.exported_program instead of model.state_dict() if the model has state_dict hooks.
The new requirement is that the FQNs are matching between the exported’s state dict and the state_dict obtained from `_disabled_load_state_dict_hooks(M)` context manager. One benefit of having this new API is that we are now in full control within export of gathering and updating the model state.
If a model doesn't have any state_dict hooks, one can still use model.state_dict() for weight swapping, so it's BC.
Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r test_export_for_training_with_state_dict_hooks
```
Differential Revision: D64080561
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137609
Approved by: https://github.com/angelayi, https://github.com/pianpwk
Move optimization from the export call to the `optimize()` method in ONNXProgram.
Users can call `optimize()` before calling `save()` to save the model. Right now if users set `optimize=True` in `torch.onnx.export` it will have the same effect as calling `optimize()`, but in the future we can evolve the method to be more flexible (e.g. target aware etc.)
Example
```python
onnx_program = torch.onnx.export(..., dynamo=True)
onnx_program.optimize()
onnx_program.save("model.onnx")
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137667
Approved by: https://github.com/titaiwangms
ghstack dependencies: #137666
Related to #107302
The following test fails with NumPy 2.
```
_________ TestNumPyInteropCPU.test_numpy_array_interface_cpu __________
Traceback (most recent call last):
File "/usr/local/google/home/haifengj/git/pytorch_np2/test/test_numpy_interop.py", line 415, in test_numpy_array_interface
wrapped_x = np.array([1, -2, 3, -4], dtype=dtype)
OverflowError: Python integer -2 out of bounds for uint8
To execute this test, run the following from the base repo dir:
python test/test_numpy_interop.py TestNumPyInteropCPU.test_numpy_array_interface_cpu
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
```
According to the official warning from NumPy 1, the assigning negative value to a `uint8` is deprecated.
The recommended way is to `np.array([1, -2, 3, -4]).astype(np.uint8)`
See the following for details.
```
>>> np.array([1, -2, 3, -4], dtype=np.uint8)
<stdin>:1: DeprecationWarning: NumPy will stop allowing conversion of out-of-bound Python integers to integer arrays. The conversion of -2 to uint8 will fail in the future.
For the old behavior, usually:
np.array(value).astype(dtype)
will give the desired result (the cast overflows).
<stdin>:1: DeprecationWarning: NumPy will stop allowing conversion of out-of-bound Python integers to integer arrays. The conversion of -4 to uint8 will fail in the future.
For the old behavior, usually:
np.array(value).astype(dtype)
will give the desired result (the cast overflows).
array([ 1, 254, 3, 252], dtype=uint8)
```
This PR fixes the test failure.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137532
Approved by: https://github.com/soulitzer
This PR contains multiple fixes for issue https://github.com/pytorch/pytorch/issues/135279:
## First part:
Moves the GPU guard (`cudaSetDevice`) before the `currentStreamCaptureStatusMayInitCtx` call.
As its name suggests, it May Init Ctx.
## Second part:
Even with the above fix, additional contexts are still observed during Work object destruction, e.g.
```
work = dist.all_reduce(tensor, async_op=True)
time.sleep(5) <-- no additional context yet
del work <-- additional context shows up
```
### Debug process
Chasing it down to destruction of a `Future` object -- a member variable of `Work`.
Then further down to the following member of `Future`:
```
std::vector<c10::Event> events_;
```
When the `events_` are destroyed, we hit the road down to:
1f3a793790/c10/cuda/impl/CUDAGuardImpl.h (L106-L121)
When there is no "preset" CUDA context (**which is the case for python garbage collector**), line 112: `c10::cuda::GetDevice(&orig_device)` will set `orig_device` to 0. Then, at line 120, `c10::cuda::SetDevice(orig_device)` will "officially" set the context to device 0 --
**that's where rank 1, 2, ... can create extra context on device 0!**
### Solution
This PR adds an explicit destructor to `Future`. In this destructor, destroy each event with a device guard.
## Test
Added test_extra_cuda_context, implemented via
- `pynvml` (if available), or
- memory consumption check.
`python test/distributed/test_c10d_nccl.py -k test_extra_cuda_context`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135273
Approved by: https://github.com/fduwjj, https://github.com/wconstab, https://github.com/eqy
ghstack dependencies: #137161
### Context
Fixes CUDA IMA in autotune_at_compile_time, where we would generate an example tensor with an incorrect stride.
In the case below, the stride should be (u0 * 128, 128, 1). However, we apply the fallback on the entire expr (i.e. u0 * 128).
```
# buf817 = tensor(size=(s0, u0, 128), stride=(u0 * 128, 128, 1))
buf812 = generate_example_value(
(64, 8192, 128), (8192, 128, 1), "cuda:0", torch.bfloat16, 0
)
```
The fix is to apply the fallback on each symbol.
### Test
```
PYTORCH_NO_CUDA_MEMORY_CACHING=1 compute-sanitizer python test_aot_inductor.py -k test_stride_with_unbacked_expr_abi_compatible_cuda
========= Invalid __global__ write of size 2 bytes
```
Differential Revision: [D64074561](https://our.internmc.facebook.com/intern/diff/D64074561)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137537
Approved by: https://github.com/jingsh
Currently, there are compilation warnings as below, which are resolved after the fix
```
/tmp/torchinductor_root/c7t6qm4gf35cxkk5jywa5booovl5n6ivzwdbbs5og7rdemqtgrzh/caoefkofe5jrkuaoch4lfpjwtodlcy4savxgzsxqldkcdof7ifh7.cpp: In function ‘ihipModuleSymbol_t* loadKernel(std::string, const string&, uint32_t, const std::optional<std::__cxx11::basic_string<char> >&)’:
/tmp/torchinductor_root/c7t6qm4gf35cxkk5jywa5booovl5n6ivzwdbbs5og7rdemqtgrzh/caoefkofe5jrkuaoch4lfpjwtodlcy4savxgzsxqldkcdof7ifh7.cpp:482:25: warning: ignoring returned value of type ‘hipError_t’, declared with attribute nodiscard [-Wunused-result]
482 | hipDrvGetErrorString(code, &msg); \
| ~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~
/tmp/torchinductor_root/c7t6qm4gf35cxkk5jywa5booovl5n6ivzwdbbs5og7rdemqtgrzh/caoefkofe5jrkuaoch4lfpjwtodlcy4savxgzsxqldkcdof7ifh7.cpp:519:5: note: in expansion of macro ‘CUDA_DRIVER_CHECK’
519 | CUDA_DRIVER_CHECK(hipModuleLoad(&mod, filePath.c_str()));
| ^~~~~~~~~~~~~~~~~
In file included from /opt/rocm/include/hip/hip_runtime.h:70,
from /pytorch/torch/include/torch/csrc/inductor/aoti_runtime/device_utils.h:14,
from /pytorch/torch/include/torch/csrc/inductor/aoti_runtime/model.h:17,
from /pytorch/torch/include/torch/csrc/inductor/aoti_runtime/model_container.h:13,
from /tmp/torchinductor_root/c7t6qm4gf35cxkk5jywa5booovl5n6ivzwdbbs5og7rdemqtgrzh/caoefkofe5jrkuaoch4lfpjwtodlcy4savxgzsxqldkcdof7ifh7.cpp:4:
/opt/rocm/include/hip/hip_runtime_api.h:2369:12: note: in call to ‘hipError_t hipDrvGetErrorString(hipError_t, const char**)’, declared here
2369 | hipError_t hipDrvGetErrorString(hipError_t hipError, const char** errorString);
| ^~~~~~~~~~~~~~~~~~~~
In file included from /opt/rocm/include/hip/hip_runtime.h:70,
from /pytorch/torch/include/torch/csrc/inductor/aoti_runtime/device_utils.h:14,
from /pytorch/torch/include/torch/csrc/inductor/aoti_runtime/model.h:17,
from /pytorch/torch/include/torch/csrc/inductor/aoti_runtime/model_container.h:13,
from /tmp/torchinductor_root/c7t6qm4gf35cxkk5jywa5booovl5n6ivzwdbbs5og7rdemqtgrzh/caoefkofe5jrkuaoch4lfpjwtodlcy4savxgzsxqldkcdof7ifh7.cpp:4:
/opt/rocm/include/hip/hip_runtime_api.h:399:3: note: ‘hipError_t’ declared here
399 | } hipError_t;
| ^~~~~~~~~~
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137626
Approved by: https://github.com/ColinPeppler, https://github.com/chenyang78
**Motivation**
Enable SVE vectorization with `torch.compile`
Extends PR: #119571
* This PR enables vectorization for codegen part using SVE-256 (vec length)
* The changes can be extended to other SVE vec lengths
I've done some comparisons against existing NEON implementation with SVE vectorization enabled route for `torch.compile`
Test results are for 8 cores on ARM Neoverse_V1
<img width="359" alt="Screenshot 2024-08-28 at 16 02 07" src="https://github.com/user-attachments/assets/6961fbea-8285-4ca3-b92e-934a2db50ee2">
It's worth mentioning, for standalone `SiLU op` there's a `~1.8x` speedup with `torch.compile`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134672
Approved by: https://github.com/jgong5, https://github.com/malfet
The change in pytorch/pytorch#136785 enabled these jobs to run on LF runners however we saw a sudden large spike in cost once that happened last week that would have caused us to over use our available AWS credits. This change hardlocks the tests for these jobs to Meta runners. We need this at least until we can figure out how to handle the additional spend caused by these jobs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137616
Approved by: https://github.com/Skylion007, https://github.com/seemethere
Summary: We hipify NCCLUtils.h from nccl.h to rccl/rccl.h. This follows the format of the rocm rpm suite (the header is in include/rccl/rccl.h), however the source code is just src/rccl.h. Using the rccl/rccl.h will make us find the rpm's header but not the src code's header.
Test Plan:
buck run mode/opt-amd-gpu -c hpc_comms.use_rccl=develop -c fbcode.split-dwarf=True --config rccl.build_rdma_core=true --config rccl.adhoc_brcm=true //aps_models/ads/icvr:icvr_launcher -- mode=local_ctr_cvr_cmf_rep_1000x_v1_no_atom data_loader.dataset.table_ds=[2024-09-04] data_loader.dataset.batch_size=512 max_ind_range=10
w/o this diff, it'll show 2.18 nccl version
Differential Revision: D62371434
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135472
Approved by: https://github.com/jeffdaily, https://github.com/cenzhaometa
Summary: Fixes a couple of problems: constants didn't have metadata before creating graph signatures, and graph signatures weren't updated when lifting constants.
Test Plan: fixed test
Differential Revision: D64081786
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137547
Approved by: https://github.com/tugsbayasgalan
**Summary**
Previously, we assumed the packed weight for (`MKL/MKLDNN`) linear operations was at `new_input_nodes[1]`. However, this is not the case for `MKL linear`, where `new_input_nodes[1]` contains the original weight instead of the packed weight. To generalize the code, in this PR, we identify nodes that are present in `input_nodes` but not in `new_input_nodes`—indicating they are no longer used by the GEMM template and can be considered candidates for deletion.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135101
Approved by: https://github.com/jgong5, https://github.com/jansel
It seems that there's a bug in `TensorMaker` - it would treat `storage_offset` as bytes when calculating the storage size, but as numel when setting the tensor `storage_offset`. This seems to be causing tensors returned by get_buffer() with non-0 offset to report wrong storage size.
Will look into the `TensorMaker` issue further. But for `get_buffer()`, it seems more natural to just incorporate the offset into the data pointer.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137569
Approved by: https://github.com/weifengpy
ghstack dependencies: #137567
during auto_functionalize_v2 if we encounter a view such that size() stride() and storage_offset() matches the base
we create a view that is regenerated by calling aten.alias instead of as_strided for better performance.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137149
Approved by: https://github.com/zou3519
Remove all the keyword static for constants of vec registers in exp_u20 implementation. With the bf16 input shape of BertLarge, the SDPA kernel improves from 5.1ms to 4.7ms on SPR 56 threads.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137571
Approved by: https://github.com/jgong5
Fixes#115725. Note that the github issue title is misleading. Read the comments to understand what the problem is really about.
The PR improves the documentation and CMake's behavior for ROCM builds.
- Documentation: There were two environment variables for ROCm builds that are now documented. `ROCM_PATH` and `PYTORCH_ROCM_ARCH`.
- CMake: Improved diagnostic messaging and error handling with respect to `ROCM_PATH`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137308
Approved by: https://github.com/pruthvistony, https://github.com/jithunnair-amd, https://github.com/jeffdaily
The test also add singpost log for the benchmarks that pass.
to test run I ran python check_results.py test_check_result/expected_test.csv test_check_result/result_test.csv out.csv
results
```
WIN: benchmark ('a', 'instruction count') failed, actual result 90 is -18.18% lower than expected 110 ±1.00% please update the expected results.
REGRESSION: benchmark ('b', 'memory') failed, actual result 200 is 100.00% higher than expected 100 ±+10.00% if this is an expected regression, please update the expected results.
PASS: benchmark ('c', 'something') pass, actual result 107 +7.00% is within expected 100 ±10.00%
MISSING REGRESSION TEST: benchmark ('d', 'missing-test') does not have a regression test enabled for it.
You can use the new reference expected result stored at path: out.csv.
a,instruction count,90,0.01
b,memory,200,0.1
c,something,100,0.1
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137551
Approved by: https://github.com/aorenste
When non-blocking mode is enabled, we need to make sure `ncclComm_` is ready before calling NCCL APIs on it.
`NCCLComm::getNcclComm` help us do that (thanks to a wait function inside), thus is preferred than directly using `ncclComm_`.
To prevent `ncclComm_` from being directly used outside, e.g. in `ProcessGroupNCCL`, we also move it as a private member of `NCCLComm` class -- the external-facing wrapper.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137573
Approved by: https://github.com/Skylion007, https://github.com/shuqiangzhang, https://github.com/c-p-i-o
ghstack dependencies: #137572
synchronization
Summary:
Barrier is essentially intended to block CPU thread (instead of GPU
streams). Before we used 2 stream synchronizations (1. current stream
blocked by nccl stream end event, 2. CPU thread blocked on current
stream). This is unnecessary as we already have CPU thread blocking
logic in wait(). Also, adding barrier specific code block in the general
GPU synchronize() API is intrusive and confusing.
This PR cleans this.
Test Plan:
CI
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137516
Approved by: https://github.com/fduwjj, https://github.com/kwen2501
- Previously the detection would fail before user calling APIs such as `torch.cuda.set_device()`. This is because the detection logic requires nvml initialization. In this PR, we added explicit nvml initialization (which idempotent).
- Previously any nvml issue occurred in the detection logic would result in fatal error. Now we issue an informative warning and return a topology assuming no NVLink connectivity.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137530
Approved by: https://github.com/Chillee
ghstack dependencies: #137471, #137472, #137473, #137474, #137475, #137529
## This Stack
Implement custom all-reduce algos available in `IntraNodeComm` as `symm_mem` ops and replace the existing `IntraNodeComm` kernels with them.
## This PR
- Replaces one-shot all-reduce with `symm_mem::one_shot_all_reduce_out`
- Replaces two-shot all-reduce with `symm_mem::two_shot_all_reduce_`
- Removes HCM all-reduce (at least for now). Due to the nature of its accumulation order, we can't guarantee the numerical consistency across all ranks.
- Removes the `IntraNodeComm` python binding (its original purpose is superceded by `SymmetricMemory`).
- Removes methods that were made for the python binding.
- Replaces nvlink detection logic with `DMAConnectivityDetector`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137475
Approved by: https://github.com/Chillee
ghstack dependencies: #137471, #137472, #137473, #137474
## This Stack
Implement custom all-reduce algos available in `IntraNodeComm` as `symm_mem` ops and replace the existing `IntraNodeComm` kernels with them.
## This PR
Implement `symm_mem::multimem_one_shot_all_reduce_out`. The out-variant is more suitable for `IntraNodeComm` integration.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137474
Approved by: https://github.com/Chillee
ghstack dependencies: #137471, #137472, #137473
Before this changes, tests for operators like `eye` or `triu_indices` were essentially a test that respective CPU operators are stable, as cpu_sample and mps_sample were the same
Moved the logic to `transform_opinfo_sample_to_mps` whicih in addition to copying tensors is also tweaks `kwargs`
Discovered that:
- `torch.randn` and `torch.randint` fall into the same undefined category
- `torch.logspace` is not implemented for MPS
- Allow 1.0 absolute tolerance for all `torch.linspace` calls over integral input as rounding is wrong on the MPS side
- `torch.triu_indices` are not implemented (PR is coming, this is how I've discovered this problem)
- `torch.signal.windows.kaiser` fails because `aten::i0` is not implemented
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137601
Approved by: https://github.com/albanD
This is a utility to aid the torch.compile debugging. You provide a function that returns True on success, False on failure, or do something out of process and run bisect_helper `good | bad`.
The bisector will first go through backends - `eager`, `aot_eager`, `aot_eager_decomp_partition`, `inductor` to find the first failing backend. Then, it will go through subsystems within the backend - currently limited but could be expanded - and try to find the first subsystem for which disabling fixes the problem. Once it has found the failing subsystem, it will find the number of times the subsystem is applied, and then bisect through it.
An example usage of how to hook it up for aot_eager_decomp_partition and decomposition subsystem is :
```
from torch._inductor.bisect_helper import BisectionManager
if op in CURRENT_DECOMPOSITION_TABLE:
if BisectionManager.disable_subsystem("aot_eager_decomp_partition", "decomposition", lambda: repr(op)):
return NotImplemented
```
Once it has discovered the problematic change, it will print out the associated debug info, and you can set the same limits with `TORCH_BISECT_BACKEND` `TORCH_BISECT_SUBSYSTEM` and `TORCH_BISECT_MAX`.
We could add further options as an automated way of going through a check list for checking divergence - e.g., the mode to emulate amp casts.
Fix for https://github.com/pytorch/pytorch/issues/126546
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131936
Approved by: https://github.com/ezyang
Related: https://github.com/pytorch/xla/issues/7799#issuecomment-2375818263
Follow ups: Do the same for maia and mtia
## Motivation
With the move to `weights_only` by default, we are making an explicit decision not to allowlist GLOBALs required to deserialize `numpy` tensors by default. The implication is that backends relying on numpy for serialization will fail loudly when `torch.load` flips `weights_only`.
However, we make the observation that this dependency on numpy was legacy and is not actually needed anymore. So we can remove it, which aligns with our weights_only strategy.
## Why is this ok?
The following comment on why numpy is necessary for serialization is legacy
c87c9f0a01/torch/_tensor.py (L303-L312)
We no longer do the following, though it was the case 5 years ago in the PR that added this
> CPU storage is reconstructed with randomly initialized data, moved onto backend device, and then storage is updated to the serialized content
**Instead what now happens is that CPU storage is constructed with data from the file **and then** moved onto backend device.**
Old behavior (`legacy_load`): 67adda891a/torch/serialization.py (L620)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137444
Approved by: https://github.com/albanD
## Overview
This PR adds a `shard_placement_fn: Optional[Callable[[nn.Parameter], Optional[Shard]]` arg to `fully_shard` that allows users to specify FSDP sharding on a nonzero tensor dim. If doing so, then the tensor dim size must be divisible by the FSDP shard world size.
```
# Example:
def shard_placement_fn(param: nn.Parameter) -> Optional[Shard]:
largest_dim = largest_dim_size = -1
for dim, dim_size in enumerate(param.shape):
if dim_size > largest_dim_size:
largest_dim = dim
largest_dim_size = dim_size
return Shard(largest_dim)
fully_shard(module, shard_placement_fn=shard_placement_fn)
```
## Follow-Ups
- **Copy kernels:** For all-gather copy-out, we currently copy-out to temporaries and then chunk-dim-0 -> cat-shard-dim, incurring an extra copy for parameters sharded on nonzero tensor dim. Similarly, for reduce-scatter copy-in, we currently chunk-shard-dim -> cat-dim-0, incurring an extra copy for gradients sharded on nonzero tensor dim. @yifuwang has ideas for adding additional split size args to the copy ops that allows fusing these extra copies into the existing all-gather copy-out and reduce-scatter copy-in.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137496
Approved by: https://github.com/weifengpy
ghstack dependencies: #137593
Summary:
NCCL 2.23.4 provides the profiler plugin feature, which traces collective, p2p, proxyOps, and other events.
The diff supports the following feature: when NCCL times out, the flight recorder can also dump traces in the profiler plugin.
Test Plan:
```
tensor = torch.tensor([dist.get_rank()], dtype=torch.int32, device=dev)
# Create a list with same number of elements as world size (aka no. of ranks)
# During allgather this list is going to be populated with tensors from all ranks (aka all gather)
gathered_tensors = [torch.zeros_like(tensor) for _ in range(WORLD_SIZE)]
# get collective from all ranks
if i <= 10 or RANK != 0:
dist.all_gather(gathered_tensors, tensor)
```
My script triggers flight recoder.
```
trainer/0 [0]:E0927 12:07:22.643702 1012209 ProcessGroupNCCL.cpp:1356] [PG ID 0 PG GUID 0(default_pg) Rank 0] ProcessGroupNCCL preparing to dump debug info.
trainer/0 [0]:I0927 12:07:22.643784 1012209 ProcessGroupNCCL.cpp:392] NCCL_PROFILER_PLUGIN: /data/users/zhiyongww/fbsource/fbcode/scripts/nbahl/libnccl_profiler_plugin.so
trainer/0 [0]:I0927 12:07:22.643805 1012209 plugin.cpp:559] Profiler start dump
trainer/0 [0]:I0927 12:07:22.645249 1012209 ProcessGroupNCCL.cpp:1363] [PG ID 0 PG GUID 0(default_pg) Rank 0] ProcessGroupNCCL dumping nccl trace to /tmp/nccl_trace_rank_0
trainer/0 [0]:I0927 12:07:22.645418 1012209 NCCLUtils.cpp:348] Finished writing NCCLPG debug info to /tmp/nccl_trace_rank_0
```
Content from /tmp/nccl_trace_rank_0: P1614645283
Differential Revision: D61929401
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137523
Approved by: https://github.com/c-p-i-o
See `test_inline_closure_returned_by_another_function_and_captures` and #136814 for more context.
In #90286, we introduced an optimization so that for captured cells that are unmodified during a Dynamo trace, `UserFunctionVariable` will represent them as variable of the cell's actual value, rather than a `NewCellVariable`.
Later on we introduced more mechanisms to model such cells across function calls (#104222), and across function calls where `NestedUserFunctionVariable::bind_args` need to look up further in the parent frames (#106491) to find these cells' values.
This patch removes `InlinedClosureVariable` in favor of a simpler modelling, which is also more consistent with what was introduced in #90286, i.e., just model these cells as their contents, in `symbolic_locals`.
This fixes#136814 because resolution of `InlinedClosureVariable` to the underlying cell content value happens in
`NestedUserFunctionVariable::bind_args`, which requires Dynamo to have the value in scope at the function call site (when Dynamo does inlining), but's not always the case (as the test case shows). However, if we model the cells in `symbolic_locals`, we never need such resolution, and the values are directly stored into the `NestedUserFunctionVariable::closure` upon the function creation, at which point Dynamo always has the cell value in `symbolic_locals` for look up.
Fixes#136814.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137510
Approved by: https://github.com/williamwen42
Summary:
# context
* enable the `_get_user_embeddings` function
* run failed at P1610151892
```
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
GuardOnDataDependentSymNode: Could not guard on data-dependent expression u22 <= 0 (unhinted: u22 <= 0). (Size-like symbols: u22)
ATTENTION: guard_size_oblivious would fix the error, evaluating expression to False.
Maybe you need to add guard_size_oblivious to framework code, see doc below for more guidance.
Potential framework code culprit (scroll up for full backtrace):
File "/data/users/hhy/fbsource/buck-out/v2/gen/fbcode/38472faba4e3e6c1/aps_models/ads/icvr/__icvr_launcher_live__/icvr_launcher_live#link-tree/torch/_decomp/decompositions.py", line 1692, in native_layer_norm_backward
if M <= 0 or N <= 0:
```
```
N = prod(inner_dims) # type: ignore[arg-type]
M = prod(outer_dims) # type: ignore[arg-type]
if M <= 0 or N <= 0:
return (
input.new_zeros(input_shape) if output_mask[0] else None,
input.new_zeros(input_shape[axis:]) if output_mask[1] else None,
input.new_zeros(input_shape[axis:]) if output_mask[2] else None,
)
```
# changes
* use guard_size_oblivious since the new_zeros return is kind of optimization, shouldn't impact the correctness of the follow up code logic.
* the size `ret[i][j]` could be zero, so the change in V1 isn't valid
* for more details: [post](https://fb.workplace.com/groups/6829516587176185/permalink/8003616173099548/)
```
from torch.fx.experimental.symbolic_shapes import guard_size_oblivious
if guard_size_oblivious(M <= 0) or guard_size_oblivious(N <= 0):
```
# past
* found `u22` was introduced at
```
def _wait_impl(self) -> List[List[int]]:
# Can not use is_torchdynamo_compiling(), as every such condition should be independent for compilation with graph breaks.
if isinstance(self._splits_awaitable, dist.Work):
self._splits_awaitable.wait()
ret = self._output_tensor.view(self.num_workers, -1).T.tolist() # <------ u22 introduced here
if not torch.jit.is_scripting() and is_torchdynamo_compiling():
for i in range(len(ret)):
for j in range(len(ret[i])):
torch._check_is_size(ret[i][j]) # <---------- my question: why the _check_is_size isn't enough??
torch._check(ret[i][j] > 0) # <------ added by diff V1
```
Test Plan:
# run command
```
TORCH_SHOW_CPP_STACKTRACES=1 TORCHDYNAMO_EXTENDED_DEBUG_CPP=1 TORCH_LOGS="+graph_code,output_code,dynamic,aot,guards,verbose_guards,recompiles,graph_breaks" TORCH_TRACE=/var/tmp/tt buck2 run fbcode//mode/opt fbcode//aps_models/ads/icvr:icvr_launcher_live -- mode=fmc/local_ig_fm_v4_mini training.pipeline_type=pt2 2>&1 | tee -a `tagT`.`tagH`.log
```
# results
* before
**without enabling `_get_user_embeddings`**
[14 Failures and Restarts](https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmp2eNI7p/failures_and_restarts.html)
log: P1610151892
{F1889387940}
* V1
enable `_get_user_embeddings`
with `torch._check(ret[i][j] > 0)`
[13 Failures and Restarts](https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmp6J1iY9/failures_and_restarts.html)
{F1889388378}
* V2
enable `_get_user_embeddings`
with `if guard_size_oblivious(M <= 0) or guard_size_oblivious(N <= 0):`
[tlparse](https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpFhZZyC/index.html)
if guard_size_oblivious(M <= 0) or guard_size_oblivious(N <= 0):
Differential Revision: D63424929
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136798
Approved by: https://github.com/ezyang
Summary: Previously triton_reshape will generate code with `Min` keyword in it, which is incorrect. This diff updates the triton_reshape function to properly expand `Min` keyword to `<`.
Test Plan:
```
buck2 run @//mode/{opt,mtia,inplace} //glow/fb/fx/fba/tests:test_fba_inductor -- -r test_Min_keyword_in_block_shape
```
Differential Revision: D63850158
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137357
Approved by: https://github.com/blaine-rister, https://github.com/eellison
This PR doesn't change the logic of `test_graph_input_is_async` - it just adds an additional check to the graph input type to ensure it's always `AsyncCollectiveTensor` as expected. It would potentially make it easier to show to users that we already support `AsyncCollectiveTensor` as graph input.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137253
Approved by: https://github.com/bdhirsh
Fixes#137422
Add parameter type definition in API docs to clarify allowed value type, eliminate users pass `None` as `dim` value directly.
```python
>>> import torch
>>> x = torch.randn(3,1,2)
>>> x.squeeze(dim=None)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
RuntimeError: Please look up dimensions by name, got: name = None.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137485
Approved by: https://github.com/albanD
The test runs all its combination (512) sequentially, so it takes more than 30 minutes to finish or timeout on ASAN after one hour. Parametrizing it will break it up, so individual tests can finish and aren't need to be marked as slow anymore.
Also, the test seems to run OOM on a 2xlarge with `std::bad_alloc` memory error. Maybe, this would also fix the issue (pending CI testing)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137447
Approved by: https://github.com/albanD, https://github.com/malfet
## This Stack
Implement custom all-reduce algos available in `IntraNodeComm` as `symm_mem` ops and replace the existing `IntraNodeComm` kernels with them.
## This PR
Implement `symm_mem::two_shot_all_reduce_`. Later we'll replace the two-shot all-reduce in `IntraNodeComm` with these.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137473
Approved by: https://github.com/Chillee
ghstack dependencies: #137471, #137472
## This Stack
Implement custom all-reduce algos available in `IntraNodeComm` as `symm_mem` ops and replace the existing `IntraNodeComm` kernels with them.
## This PR
Implement `symm_mem::one_shot_all_reduce` and `symm_mem::one_shot_all_reduce_out`. Later we'll replace the one-shot all-reduce in `IntraNodeComm` with these.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137472
Approved by: https://github.com/Chillee, https://github.com/weifengpy
ghstack dependencies: #137471
## This Stack
Implement custom all-reduce algos available in `IntraNodeComm` as `symm_mem` ops and replace the existing `IntraNodeComm` kernels with them.
## This PR
Refine the corss-device synchronization primitives to make it clearer when to use which synchronization pattern.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137471
Approved by: https://github.com/Chillee, https://github.com/weifengpy
Moves `TransformGetItemToIndex` to a file where dynamo stores other traceable HOP concepts. (We don't trace through torch.* modules by default)
Tracing through the mode required fixing a bug in dynamo autograd function, which fixed a graph break, which caused the autograd test failures (skipping for now and will file an issue)
Previously those tests were in essence running in eager, because dynamo would fallback due to an arg mismatch error.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137120
Approved by: https://github.com/yanboliang, https://github.com/malfet
ghstack dependencies: #137114, #137115, #137116, #137117
This PR implements tracing of with contexts with TorchFunction modes which have the default enter/exit behavior (ie pushing/popping the mode)
Typically the bytecode for a context manager looks like this during a graph break:
1. graph call
2. enter context
3. unsupported code
4. exit context
5. resume call
resume fn structure:
1. enter context
2. jump
...
3. exit context
The issue with torch function modes is that side effects will replay any mutations to the torch function stack performed during tracing. So, we do not need to enter and exit around the unsupported code in the original function (doing so would result in a duplicate torch function mode entry during execution of the unsupported code), and we don't need to enter again in the resume function (the mode that was pushed from the side effects bytecode would still be on the stack).
So for torch function modes the structure of our output code is this:
1. graph call
2. mutate tf mode stack to replay mutations
4. unsupported code
5. on exception restore stack
6. resume function
Then our resume fn looks like this:
1. no-op enter torch function mode
2. jump
3. exit tf mode
To implement the no-op enter of the torch function mode I added torch function mode in polyfill which no-op enters, but normally exits. This is needed because we still want to trace the with context in the resume function, and exit properly (the exit instructions will still be in the function, so we need to generate instructions to set up the context).
Separately from the bytecode, dynamo also tracks contexts on the block stack, which is how the SETUP_* instructions are implemented. Naturally at a graph break, we exit these block stacks to properly reset the contexts entirely, so that we can re-enter around the unsupported code soundly. However once again, in the torch function mode case, in the event of a graph we do not want to perform any exit side effects because we want to preserve the state of the mode stack as is so that we will properly update the stack with bytecode mentioned in the first section. If we exited here, dynamo would pop the mode off of the symbolic stack, and not update the true python torch function mode stack with the suffix bytecode. All in all, for torch function modes we enter exactly once, update the global torch function mode stack with side effects bytecode, re-read this stack when compiling the resume function, and exit exactly once in the resume function. This matches the semantics of eager exactly.
Approved by: https://github.com/williamwen42
ghstack dependencies: #134732, #133137, #135443, #135444
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137114
Approved by: https://github.com/yanboliang
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/135106. For AOTI, there is the Inductor IR of weight
```
ReinterpretView(
StorageBox(
ConstantBuffer(name='L__self___mlp_0_weight', layout=FixedLayout('cpu', torch.float32, size=[64, 128], stride=[128, 1]))
),
FixedLayout('cpu', torch.float32, size=[128, 64], stride=[1, 128]),
origins=OrderedSet([addmm])
)
```
In the post-processing step of the GEMM template, the used weight was before permutation, leading to correctness issues. In this PR, we address this by reshaping the weight to the expected size and stride before the weight prepack.
**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_aot_inductor.py -k test_misc_1_max_autotune_True_non_abi_compatible_cpu
python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_aoti_linear
python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_aoti_linear_multi_view_operations
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136421
Approved by: https://github.com/jgong5, https://github.com/desertfire
This PR is for supporting calling `parallelize_module` from within a model definition, making the model a parallel one.
Calling `parallelize_module` is an alternative to maintaining a set of `ColumnWiseLinear`, `RowWiseLinear`, etc, while still being able to directly author a parallel model.
(The motivation for authoring a parallel model is that there may be other distributed operations, which may not be easily captured by any module, see the forward function below. Alternatively speaking, the purpose is to exploit the expressiveness of DTensor -- we need to first create DTensors before calling ops on them. Having parallelized modules in model is one way of creating DTensors.)
For example:
```
class FeedForward(nn.Module):
def __init__(self, config: TransformerArgs) -> None:
super().__init__()
w1 = nn.Linear(config.dim, config.hidden_dim, bias=False)
w2 = nn.Linear(config.hidden_dim, config.dim, bias=False)
w3 = nn.Linear(config.dim, config.hidden_dim, bias=False)
self.w1 = parallelize_module(w1, Colwise)
self.w2 = parallelize_module(w2, Rowwise)
self.w3 = parallelize_module(w3, Colwise)
def forward(self, x: Tensor) -> Tensor:
y: DTensor = self.w2(F.silu(self.w1(x)) * self.w3(x))
# y is a DTensor with Partial placement; we can return it as is.
return y
# Or we can convert it to Replicate -- there is modeling flexibility here.
return y.redistribute(Replicate())
with device_mesh:
model = FeedForward(config)
# Now model is a model parallelized onto device_mesh
y = model(x)
```
The `device_mesh` actually used for `parallelize_module` would be retrieved from the ambient context.
Calling `parallelize_module` from within model hierarchy also saves the use of *FQNs* as in the out-of-model annotation case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134247
Approved by: https://github.com/tianyu-l
When we are autotuning matmuls the aten.mm and the triton template choices take in an externally allocated tensor that can be a view into a pre-planned aten.cat. So long as the output shape and stride of the matmul matches the slice of the cat we're planning, we can realize the mm directly into the cat.
Discussion for reviewers:
It feels a little bit odd that in the existing code we set the output of aten.mm as [FlexibleLayout](bcac71517c/torch/_inductor/kernel/mm.py (L156)). While is this correct, it might lead to passing non performant output strides to cublas.. I guess this is better than a copy ? Not sure. We could also introduce a Layout that denotes a Fixed shape and stride which we control allocation
```
class AllocatedFixedLayout(FixedLayout)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132554
Approved by: https://github.com/jansel
For `autograd.Function`, the engine will try to allocate correctly-shaped zeros for `None` grads (i.e. in the case where the output isn't used downstream). It determines the shape of these zeros from the `VariableInfo` entry, which is derived from the forward output shape. For the NJT forward output case, the size info stored will contain a nested int, and calling `zeros()` with this size throws:
```
RuntimeError: .../build/aten/src/ATen/RegisterCPU.cpp:5260: SymIntArrayRef expected to contain only concrete integers
```
This PR fixes this by storing the full tensor in the `VariableInfo` for the nested case and calling `zeros_like()` to allocate correctly-shaped zeros. This is pretty inefficient; ideally we would want to save just the NJT shape and be able to construct zeros from it, but this requires factory function support for nested ints (WIP). So this is a short-term fix until we have that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136875
Approved by: https://github.com/soulitzer
compile time benchmark for the min cut partitioner. I'm hoping that this is a reasonable benchmark because:
(1) it consists of a single input + many weights that are used sequentially
(2) contains a mix of recompute vs non-recomputed ops (matmul + sin)
(3) it is relatively simple
from running locally:
```
collecting compile time instruction count for aotdispatcher_partitioner_cpu
compile time instruction count for iteration 0 is 21764219181
compile time instruction count for iteration 1 is 12475020009
compile time instruction count for iteration 2 is 12463710140
compile time instruction count for iteration 3 is 12455676489
compile time instruction count for iteration 4 is 12451344330
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136760
Approved by: https://github.com/ezyang
ghstack dependencies: #136759
this adds a few compile time benchmarks for some disjoint paths in AOTDispatcher:
(1) inference vs training code paths
(2) "subclasses" vs "no subclasses" codepaths
Also see https://github.com/pytorch/pytorch/pull/136760 for a partitioner benchmark (I'm not sure why ghstack didn't display the stack nicely)
I ran locally, and got these numbers on the 4 paths:
```
collecting compile time instruction count for aotdispatcher_inference_nosubclass_cpu
compile time instruction count for iteration 0 is 11692348671
compile time instruction count for iteration 1 is 3026287204
compile time instruction count for iteration 2 is 3011467318
compile time instruction count for iteration 3 is 3004485935
compile time instruction count for iteration 4 is 3003087410
collecting compile time instruction count for aotdispatcher_training_nosubclass_cpu
compile time instruction count for iteration 0 is 6068003223
compile time instruction count for iteration 1 is 5585418102
compile time instruction count for iteration 2 is 5581856618
compile time instruction count for iteration 3 is 5581651794
compile time instruction count for iteration 4 is 5578742619
collecting compile time instruction count for aotdispatcher_inference_subclass_cpu
compile time instruction count for iteration 0 is 8634984264
compile time instruction count for iteration 1 is 8633467573
compile time instruction count for iteration 2 is 8632182092
compile time instruction count for iteration 3 is 8632056925
compile time instruction count for iteration 4 is 8632543871
collecting compile time instruction count for aotdispatcher_training_subclass_cpu
compile time instruction count for iteration 0 is 14737239311
compile time instruction count for iteration 1 is 14734346427
compile time instruction count for iteration 2 is 14736493730
compile time instruction count for iteration 3 is 14734121272
compile time instruction count for iteration 4 is 14733852882
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136759
Approved by: https://github.com/laithsakka
Instead, callback to a missing handler when needed. This greatly speeds things up with the value ranges dict is large. The missing handler is needed because nested ints don't have VRs, but symbolic sizes involving them occasionally show up in compute.
```
TORCHDYNAMO_EXTENDED_DEBUG_CREATE_SYMBOL="s11" TORCH_LOGS=dynamic PYTORCH_TEST_WITH_DYNAMO=1 python test/test_nestedtensor.py TestNestedTensorAutogradCPU.test_dropout_backward_jagged_cpu
```
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136667
Approved by: https://github.com/isuruf
ghstack dependencies: #136429
Partially addresses https://github.com/pytorch/pytorch/issues/128150
When you have big sums of values, we end up computing long chains of
binary addition in our FX graph representation. Not only is this ugly,
it also is quadratic, as the sympy.Add constructor is O(N) in number
of arguments. Instead, ensure that we maintain the summation as a
single FX node so we can do the entire addition all in one go.
update_hint_regression benchmark, before and after:
```
update_hint_regression,compile_time_instruction_count,2648328980
update_hint_regression,compile_time_instruction_count,2563748678
```
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136429
Approved by: https://github.com/isuruf
Summary: There are two instances of AppendOnlyLists that don't get cleared after we have finished iterating through the forward lists. This can be potentially dangerous since they can last for the entirety of the lifespan of the profiler. We have also seen crashes during the destructor of these variables when the profiler is exiting. This could possibly be related to the fact that the default constructor assumes some valid state of these lists rather than whatever state they are in when profiler is exiting.
Test Plan: Ran with profile_memory=True to make sure allocations queue gets cleared correctly and trace+workload ran successfully
Differential Revision: D64010911
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137450
Approved by: https://github.com/aaronenyeshi
Works around #136543.
This fix solves the issue only in the context of the ONNX exporter but this issue happens in other context.
The bug happens when method `run_decompositions` is called. The failing pattern is assumed to be ``view(transpose(x, ...))``. This pattern is replaced by ``view(flatten(transpose(x, ..)))``. By changing the dimensions, the strides are updated as well and `run_decompositions` does not fail anymore. It would be inefficient on a 1D tensor but then transpose would not be used. The extra node appears in the final onnx graph but is removed after optimization. The final onnx graph should not be impacted and no performance loss should be observed for the onnx model.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137340
Approved by: https://github.com/justinchuby
Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
The test runs all its combination (512) sequentially, so it takes more than 30 minutes to finish or timeout on ASAN after one hour. Parametrizing it will break it up, so individual tests can finish and aren't need to be marked as slow anymore.
Also, the test seems to run OOM on a 2xlarge with `std::bad_alloc` memory error. Maybe, this would also fix the issue (pending CI testing)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137447
Approved by: https://github.com/albanD, https://github.com/malfet
code
Summary:
This PR should not change the existing behavior of work.wait(), just
separate the stream synchronization code from the CPU busy wait code.
Also, remove the need of a private synchronization function.
In a longer term, we would like to give user the flexibility of bypassing the watchdog thread and handle the collective error by themselves.
Test Plan:
python test/distributed/test_c10d_nccl.py NcclErrorHandlingTest
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137295
Approved by: https://github.com/kwen2501
Summary: It seems like the import path is different from FBCode & OSS. Wondering how to consolidate them.
Test Plan:
```
buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:cutlass_backend
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 33. Build failure 0
```
Differential Revision: D63991961
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137435
Approved by: https://github.com/jovianjaison
When we are autotuning matmuls the aten.mm and the triton template choices take in an externally allocated tensor that can be a view into a pre-planned aten.cat. So long as the output shape and stride of the matmul matches the slice of the cat we're planning, we can realize the mm directly into the cat.
Discussion for reviewers:
It feels a little bit odd that in the existing code we set the output of aten.mm as [FlexibleLayout](bcac71517c/torch/_inductor/kernel/mm.py (L156)). While is this correct, it might lead to passing non performant output strides to cublas.. I guess this is better than a copy ? Not sure. We could also introduce a Layout that denotes a Fixed shape and stride which we control allocation
```
class AllocatedFixedLayout(FixedLayout)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132554
Approved by: https://github.com/jansel
This PR fixes a bug in `search_end_matrix_indices_cuda_kernel` causing an illegal memory access when calling `bmm_sparse_cuda` on a sparse matrix with no non-zero values in the first batch dimension. Reproducible example:
```py
import torch
ind = torch.tensor([[1], [0], [0]], device="cuda")
val = torch.tensor([1.], device="cuda")
A = torch.sparse_coo_tensor(ind, val, size=(2, 1, 1))
B = torch.zeros((2, 1, 1), device="cuda")
C = torch.bmm(A, B)
```
## Details
In the previous code, we may for example end up with the following situation:
```
i : indices_1D[i]
------------------------------------------
0 : 1 <- start_idx, mid_idx
1 : 1 <- end_idx
...
```
When `target_mat_num = 0`, the next iteration of the while loop will assign `-1` to `end_idx` and thus `(0 + (-1)) >> 1 = -1` to `mid_idx`, causing an access error on line 703. The updated code maintains the invariant `start_idx <= end_idx` and will not go out of bounds.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131977
Approved by: https://github.com/amjames, https://github.com/pearu, https://github.com/nikitaved
This patch adds logging for all frames Dynamo traced, during each invocation of a Dynamo-optimized function.
Example:
```python
import torch
@torch.compile
def foo():
x = torch.ones([10])
def bar():
y = x + x
torch._dynamo.graph_break()
z = y * x
return z
return bar(), bar
foo()
foo()
```
Running `TORCH_LOGS="dynamo" python` on the above dumps the following near the very end.
```
......
I1003 12:18:31.058000 177 torch/_dynamo/eval_frame.py:486] starting from foo /Users/ryanguo99/Documents/work/scratch/test.py:4, torchdynamo attempted to trace the following frames: [
I1003 12:18:31.058000 177 torch/_dynamo/eval_frame.py:486] * foo /Users/ryanguo99/Documents/work/scratch/test.py:4
I1003 12:18:31.058000 177 torch/_dynamo/eval_frame.py:486] * bar /Users/ryanguo99/Documents/work/scratch/test.py:7
I1003 12:18:31.058000 177 torch/_dynamo/eval_frame.py:486] ]
I1003 12:18:31.064000 177 torch/_dynamo/eval_frame.py:486] starting from foo /Users/ryanguo99/Documents/work/scratch/test.py:4, torchdynamo attempted to trace the following frames: []
......
```
Fixes#118262.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137297
Approved by: https://github.com/williamwen42
Moves `TransformGetItemToIndex` to a file where dynamo stores other traceable HOP concepts. (We don't trace through torch.* modules by default)
Tracing through the mode required fixing a bug in dynamo autograd function, which fixed a graph break, which caused the autograd test failures (skipping for now and will file an issue)
Previously those tests were in essence running in eager, because dynamo would fallback due to an arg mismatch error.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137120
Approved by: https://github.com/yanboliang
ghstack dependencies: #137114, #137115, #137116, #137117
This PR implements tracing of with contexts with TorchFunction modes which have the default enter/exit behavior (ie pushing/popping the mode)
Typically the bytecode for a context manager looks like this during a graph break:
1. graph call
2. enter context
3. unsupported code
4. exit context
5. resume call
resume fn structure:
1. enter context
2. jump
...
3. exit context
The issue with torch function modes is that side effects will replay any mutations to the torch function stack performed during tracing. So, we do not need to enter and exit around the unsupported code in the original function (doing so would result in a duplicate torch function mode entry during execution of the unsupported code), and we don't need to enter again in the resume function (the mode that was pushed from the side effects bytecode would still be on the stack).
So for torch function modes the structure of our output code is this:
1. graph call
2. mutate tf mode stack to replay mutations
4. unsupported code
5. on exception restore stack
6. resume function
Then our resume fn looks like this:
1. no-op enter torch function mode
2. jump
3. exit tf mode
To implement the no-op enter of the torch function mode I added torch function mode in polyfill which no-op enters, but normally exits. This is needed because we still want to trace the with context in the resume function, and exit properly (the exit instructions will still be in the function, so we need to generate instructions to set up the context).
Separately from the bytecode, dynamo also tracks contexts on the block stack, which is how the SETUP_* instructions are implemented. Naturally at a graph break, we exit these block stacks to properly reset the contexts entirely, so that we can re-enter around the unsupported code soundly. However once again, in the torch function mode case, in the event of a graph we do not want to perform any exit side effects because we want to preserve the state of the mode stack as is so that we will properly update the stack with bytecode mentioned in the first section. If we exited here, dynamo would pop the mode off of the symbolic stack, and not update the true python torch function mode stack with the suffix bytecode. All in all, for torch function modes we enter exactly once, update the global torch function mode stack with side effects bytecode, re-read this stack when compiling the resume function, and exit exactly once in the resume function. This matches the semantics of eager exactly.
Approved by: https://github.com/williamwen42
ghstack dependencies: #134732, #133137, #135443, #135444
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137114
Approved by: https://github.com/yanboliang
If there is an `unshard` (top-half) without a `wait_for_unshard` (bottom-half), then the next iteration's `unshard` will be a no-op. This can unexpectedly not propagate the optimizer update on the sharded parameters to the unsharded parameters, so it is better to clear that `unshard` at the end of backward.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137348
Approved by: https://github.com/weifengpy
Summary:
- Further added more types for debug value dumping.
- Add a test case for symint inputs for debug printer. in real prod model use cases, "unbacked symints" (those 'u0', 's0', etc.) can be helpful if we can examine their value.
Test Plan:
```
AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=2 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+graph, inductor, +schedule, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_aoti_debug_printer_sym_inputs_abi_compatible_cuda
```
Differential Revision: D63864708
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137323
Approved by: https://github.com/chenyang78
Summary: Implement Remote AOTAutogradCache. It uses all the same tech as Remote FXGraphCache, just with its own name.
Test Plan:
Run benchmark:
TORCHINDUCTOR_AUTOGRAD_REMOTE_CACHE=1 TORCHINDUCTOR_FX_GRAPH_REMOTE_CACHE=1 TORCHINDUCTOR_AUTOGRAD_CACHE=0 TORCHINDUCTOR_FX_GRAPH_CACHE=0 TORCH_LOGS=+torch._functorch._aot_autograd.autograd_cache buck run mode/opt benchmarks/dynamo:torchbench -- --training --backend=inductor --only nanogpt --repeat 5 --performance --cold-start-latency
See that it cache hits even with local cache removed.
Results show up in remote cache logs https://fburl.com/scuba/pt2_remote_cache/5893dbaj
New unit tests
Reviewed By: oulgen
Differential Revision: D63323958
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137278
Approved by: https://github.com/oulgen
Summary: Prototyping the custom op meta kernel generation. Rest of the changes are in fbcode/scripts/angelayi
Test Plan: followup diff (D63837739)
Differential Revision: D63837740
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137277
Approved by: https://github.com/zou3519
This PR fixes an issue where when running `python setup.py develop`, the `open_registration_extension` self contained example will not build due to the following:
```
error: 'synchronizeStream' overrides a member function but is not marked 'override'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137355
Approved by: https://github.com/albanD, https://github.com/spzala
When one process fails, others are immediately killed. This prevents other processes to do necessary cleanups, or dump debug information (in particular, the NCCL flight recorder).
This PR adds a grace period. Default behavior is unchanged.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131278
Approved by: https://github.com/albanD
Added an optimization pass to the swap function which removes extraneous pytrees. Currently it removes the pytree flatten/unflatten calls between modules in very specific scenarios (all the inputs of one module go into the other).
Future work can be to remove the input pytree.flatten if the inputs go directly into an unflatten, and output pytree unflatten if the outputs are directly from a pytree.flatten.
Differential Revision: [D62879820](https://our.internmc.facebook.com/intern/diff/D62879820)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136191
Approved by: https://github.com/avikchaudhuri
currently FSDP2 support only CUDA, for other backends that need to use FSDP2 it won’t work as stream and events are based on CUDA. To support other backends, use
_get_device_handle by device type to get the class and use this
for stream and events.
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136843
Approved by: https://github.com/awgu
Summary: In unflatten, when we generate module calls when their signature has been preserved, we do not pass the original constant args. This can cause strange effects, e.g., if the module is swapped out with itself, we may suddenly go down a different path than the original, or even crash.
Test Plan: added a test
Reviewed By: angelayi
Differential Revision: D63913750
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137363
Approved by: https://github.com/angelayi
Fixes#136720
the result in this case says:
```
Traceback (most recent call last):
File "/Users/shenke/workspace/pytorch/mytest.py", line 9, in <module>
result = torch.bincount(input)
^^^^^^^^^^^^^^^^^^^^^
RuntimeError: maximum value of input overflowed, it should be < 9223372036854775807 but got 9223372036854775807
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136745
Approved by: https://github.com/Skylion007
Thanks @awgu for raising this issue and the small repro
From offline discussion with @albanD, in the case where a forward returns multiple outputs with different devices, we'd want to select the ready queue based on the device of the first one. Even though this is somewhat arbitrary, we prefer this over deciding which ready queue to push based on whichever input buffer's we happen to compute last, which can vary depending on more factors and thus be harder to reason about. This is in theory bc-breaking, but it seems unlikely that someone would depend on this behavior.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135633
Approved by: https://github.com/albanD
This change modifies the `hipify_python.py` script to properly detect all directories, `include` and `ignore` paths during hipification process on Windows, by changing the path syntax convention to a UNIX-like one.
Since in many places the script assumes a UNIX-like convention by using paths with forward slashes `/`, I decided to accommodate for it by converting Windows paths to UNIX-like ones. By doing it so, the number of changes to the file is limited. Moreover this early-on unification allows for the rest of the code to have a battle-tested linux-like behaviour.
Another option would be to use `Path` object from `pathlib` to represent all paths in the script, however, it would impact a broader share of a code and would hence require a more meticulous evaluation in terms of non-altered logic and edge cases.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135360
Approved by: https://github.com/jeffdaily, https://github.com/jithunnair-amd
Summary:
X-link: https://github.com/pytorch/executorch/pull/5720
For smaller models the overhead of profiling ops might be prohibitively large (distorting the inference execution time significantly) so we provide users an option to disable op profiling and essentially only profile the important events such as inference execution time.
To disable operator profiling users need to do:
```
etdump_gen.set_event_tracer_profiling_level(executorch::runtime::EventTracerProfilingLevel::kNoOperatorProfiling);
```
Test Plan: Added test case.
Differential Revision: D61883224
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136838
Approved by: https://github.com/dbort
if the function is
```func(a, b, c) ```
and is called as
```func(a=1, b=.., c=..)```
before this change we do not iterate on the a, b, c, since those appear in kwargs this diff fix that issue.
This function is used in _inductor/ir.py to iterate over custom op arguments and when a custom pass does changes
and pass arguments as kwargs, we do not process them.
```
for info, arg in torch._library.utils.zip_schema(schema, args, kwargs):
handle_aliasing_and_mutation(info, arg)
```
Fix https://github.com/pytorch/pytorch/issues/137057
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137311
Approved by: https://github.com/zou3519
Fixes https://github.com/pytorch/pytorch/issues/136358
The bug here is that the Tensor object is actually 2 classes: `Tensor` from `_tensor.py` and `TensorBase` from c++.
Before this PR, they have the following gc methods:
Tensor:
- tp_clear subtype_clear
- tp_traverse THPVariable_subclass_traverse
- tp_dealloc THPVariable_subclass_dealloc
TensorBase:
- tp_clear THPVariable_clear
- tp_traverse THPFunction_traverse (fake function that is just an error)
- tp_dealloc object_dealloc
The problem is that when clear is called on the Tensor, subtype_clear is going to clear the things owned by the "Tensor" type, in particular, its `__dict__` attribute, before delegating to the TensorBase clear where we detect that resurrection needs to happen and skip it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137267
Approved by: https://github.com/ezyang, https://github.com/kshitij12345
This PR relaxes the even sharding requirement for the all-gather extensions.
The `fsdp_pre_all_gather` now expects signature:
```diff
def fsdp_pre_all_gather(
self,
mesh: DeviceMesh,
+ outer_size: torch.Size,
+ outer_stride: Tuple[int, ...],
module: nn.Module,
mp_policy: MixedPrecisionPolicy,
) -> Tuple[Tuple[torch.Tensor, ...], Any]:
```
- Since no one is using this new signature yet, we should be safe to change it.
- Currently, the `outer_stride` will always be contiguous strides since FSDP2 only supports contiguous strides for now.
- For the uneven sharding case, the user is responsible to return a padded sharded tensor from `fsdp_pre_all_gather`. This is risky territory because if the user does not do so, then this may manifest as a NCCL timeout, as only the ranks with padding will error out. However, I am not aware of any way around this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137005
Approved by: https://github.com/weifengpy
Summary: We had attribute assignment detection and handling of registered buffer assignments when using `aot_autograd`, but not when using just `make_fx`. Fixed.
Test Plan: expanded coverage of `test_state_tensors` to use `export` instead of `torch.export.export`
Differential Revision: D63802576
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137240
Approved by: https://github.com/tugsbayasgalan
Adds lowering for `aten.searchsorted`. This entails:
1. Adding support for multi-dimensional bucket tensors to `ops.bucketize`.
2. Adding support for striding to `ops.bucketize`.
3. Adding support for sorting tensors to `ops.bucketize`.
4. Adding a lowering for `aten.searchsorted.Tensor`.
5. Adding a basic decomposition for `aten.searchsorted.Scalar` that calls into the lowering for tensors.
6. Updating the meta-function for `aten.searchsorted` to properly check some of the sizing conditions.
Closes#135873
Differential Revision: [D63766514](https://our.internmc.facebook.com/intern/diff/D63766514)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135701
Approved by: https://github.com/amjames, https://github.com/eellison, https://github.com/davidberard98
Summary:
as title
also change it in `prepare_pt2e()` docstring
Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:quantization_pt2e_qat
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization
```
Differential Revision: D63345059
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137233
Approved by: https://github.com/tugsbayasgalan
Summary:
Special autotuning configs like `num_warps` and `num_stages` can be passed to the kernel as parameters. The `config.all_kwargs()` call [here](762a7d197c/python/triton/runtime/autotuner.py (L106)) in the Trtion code includes those special configs (names and values) into the potential arguments to the kernel. [Here](762a7d197c/python/triton/runtime/jit.py (L613)) some of those may be included in actual kenrel arguments, given that their names are present among the kernel parameters.
This PR replicates this behavior in user-defined Triton kernel compilation in PT2. Resolves#136550.
Test Plan:
```
$ python test/inductor/test_triton_kernels.py -k test_triton_kernel_special_params
inductor []
inline_call []
stats [('calls_captured', 2), ('unique_graphs', 1)]
aot_autograd [('total', 1), ('ok', 1)]
.inductor []
inline_call []
stats [('calls_captured', 2), ('unique_graphs', 1)]
.inductor [('fxgraph_cache_bypass', 1), ('pattern_matcher_count', 1), ('pattern_matcher_nodes', 1), ('extern_calls', 1), ('possibly_missed_reinplacing_opportunities', 0), ('possibly_missed_reinplacing_bytes', 0)]
inline_call []
stats [('calls_captured', 2), ('unique_graphs', 1)]
aot_autograd [('total', 1), ('ok', 1)]
.inductor []
inline_call []
stats [('calls_captured', 2), ('unique_graphs', 1)]
aot_autograd [('total', 1), ('ok', 1)]
.inductor []
inline_call []
stats [('calls_captured', 2), ('unique_graphs', 1)]
.inductor [('benchmarking.TritonBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_bypass', 1), ('pattern_matcher_count', 1), ('pattern_matcher_nodes', 1), ('extern_calls', 1), ('benchmarking.TritonBenchmarker.triton_do_bench', 1), ('possibly_missed_reinplacing_opportunities', 0), ('possibly_missed_reinplacing_bytes', 0)]
inline_call []
stats [('calls_captured', 2), ('unique_graphs', 1)]
aot_autograd [('total', 1), ('ok', 1)]
.
----------------------------------------------------------------------
Ran 6 tests in 6.283s
OK
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137236
Approved by: https://github.com/zou3519
Summary:
When we handle dynamic shapes markers like `Dim.AUTO, Dim.DYNAMIC`, we use dynamo decorators, attaching set attributes to the export input tensors, e.g. `x._dynamo_dynamic_indices = set()`.
I thought this was fine, since it's done all the time with torch.compile, but it breaks some PT2Inference tests, specifically because unpickling a set attribute isn't possible with the C++ torch::jit::pickle_load call.
We've agreed that the PT2Inference side will clone sample inputs & pickle the original inputs to be safe, but this still establishes a nice invariant that user-facing decorators are both ignored & cleaned out in the lifecycle of an export call.
Test Plan: test_export
Differential Revision: D63773534
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137230
Approved by: https://github.com/avikchaudhuri
When the stub file `nn/parallel/distributed.pyi` was removed (#88701), some types that existed are no longer available. This pull request adds them back.
Just for reference, these types are used in pytorch-lightning's LightningCLI. Command line interfaces are created automatically, and having type hints make them nicer.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136835
Approved by: https://github.com/kwen2501
We didn't support multiple levels of vmap. The main problem is, during
the batching rule, we need to exclude the vmap dispatch key
(FuncTorchBatched) like how our C++ batching rules do it.
Test Plan:
- new test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137306
Approved by: https://github.com/Chillee
This PR adds new meta functions for `lerp`, `addcmul`, and `addcdiv` (including their
respective inplace versions).
These functions only had refs implementations, which was being the root cause of a
significant overhead ([issue][1]) when running `AdamW` optimizer step on PyTorch/XLA
backend. Running the meta functions resulted in the following improvements:
- `lerp` calls: 1,550ms to 140ms (10x)
- `addcdiv` calls: 640ms to 350ms (1.8x)
- `addcmul` calls: 620ms to 300ms (2.05x)
[1]: https://github.com/pytorch/xla/issues/7923
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136909
Approved by: https://github.com/jansel
One-shot all-reduce did not have a barrier at the end. It was possible for a rank to write to its p2p buffer for the next collective before another rank finished reading it for the previous collective.
Also removing the fuse-input-copy optimization. The synchronization complexity probably outweighs the saving.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137257
Approved by: https://github.com/Chillee
This PR contains multiple fixes for issue https://github.com/pytorch/pytorch/issues/135279:
## First part:
Moves the GPU guard (`cudaSetDevice`) before the `currentStreamCaptureStatusMayInitCtx` call.
As its name suggests, it May Init Ctx.
## Second part:
Even with the above fix, additional contexts are still observed during Work object destruction, e.g.
```
work = dist.all_reduce(tensor, async_op=True)
time.sleep(5) <-- no additional context yet
del work <-- additional context shows up
```
### Debug process
Chasing it down to destruction of a `Future` object -- a member variable of `Work`.
Then further down to the following member of `Future`:
```
std::vector<c10::Event> events_;
```
When the `events_` are destroyed, we hit the road down to:
1f3a793790/c10/cuda/impl/CUDAGuardImpl.h (L106-L121)
When there is no "preset" CUDA context (**which is the case for python garbage collector**), line 112: `c10::cuda::GetDevice(&orig_device)` will set `orig_device` to 0. Then, at line 120, `c10::cuda::SetDevice(orig_device)` will "officially" set the context to device 0 --
**that's where rank 1, 2, ... can create extra context on device 0!**
### Solution
This PR adds an explicit destructor to `Future`. In this destructor, destroy each event with a device guard.
## Test
Added test_extra_cuda_context, implemented via
- `pynvml` (if available), or
- memory consumption check.
`python test/distributed/test_c10d_nccl.py -k test_extra_cuda_context`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135273
Approved by: https://github.com/fduwjj, https://github.com/wconstab, https://github.com/eqy
This should be a no-op change, i.e. it runs the same code, but replaces verbose ObjectiveC invocation with helper function from OperationUtils.h, which this example already depends on
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137313
Approved by: https://github.com/atalman
Previously, all integer inputs to user-defined triton kernels were assumed to be int32. This would result in errors if your input was actually an int64.
This PR checks the value to determine which dtype to use for indexing: if it is known to be < int_max, then use int32 (and add guards if relevant); if we can't check (e.g. unbacked symint), then use int64.
Differential Revision: [D63797975](https://our.internmc.facebook.com/intern/diff/D63797975)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137234
Approved by: https://github.com/eellison
As title, when testing on an internal case, we found that we have very similar output for the error when certain ranks does not join one collective. This is because we didn't put all ranks into `candidate_ranks` so that they didn't get wiped out from entries and gets checked again.
Ideally for the given case, we should report this is an out of order case, because rank 0, 1 calls all-to-all while all the rest ranks call all-gather-base. But when we select entries to compare, we don't have global view of the entries.
In the specific case, on rank 0 and 1, it has collective of PG 7 on entry 1130 with seq ID = 1130. However, on other ranks, they have collective of PG 0 on entry 1130 with seq ID = 2. It's hard to use entry idx to do the match because if we later consider p2p, this assumption will collapse, so we now still defer it for users or further down debugging stream to figure it out. To make the message clearer, I also include both seqID and record_id (aka, entry index) in the message. (That does not mean this is not possible to implement in the code, for example, we can let all record_id to minus the maximum p2p seq id before it; but users will easily see the wrong order, so we don't think it's necessary to have that logic now)
P1626755348
Differential Revision: [D63815335](https://our.internmc.facebook.com/intern/diff/D63815335/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137256
Approved by: https://github.com/c-p-i-o
Summary: We had a report of crashes in parallel compile subprocesses linked to reading justknobs. See https://fburl.com/workplace/14a4mcbh internally. This is a known issue with justknobs. It looks like we don't have a lot of control over evaluating knobs. Some are read in inductor (`"pytorch/remote_cache:autotune_memcache_version`), but many are read by the triton compiler. According to this advice https://fburl.com/workplace/imx9lsx3, we can import thread_safe_fork which installs some functionality to destroy some singletons before forking and re-enable them after. This apporach works for the failing workload.
Test Plan: See D63719673 where the reporting user was kind enough to provide us with a local repro. Without the relevant import, we can reproduce the crash. With the import, the training runs successfully to completion.
Differential Revision: D63736829
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137155
Approved by: https://github.com/xmfan, https://github.com/eellison
When we populate unlifted graph module, we actually only "unlift" constant tensor inputs which is problematic because export de-duplicates aliasing constants. As a result, we only register one constant instead of two constants. This PR fixes that by querying ep.constants table instead of ep.graph_signature.lifted_tensor_constants.
Differential Revision: [D63743111](https://our.internmc.facebook.com/intern/diff/D63743111)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137162
Approved by: https://github.com/pianpwk
Summary: This just adds a config option and JK for turning on remote AOTAutogradCache. It does not implement anything with the new options being passed in. That will come next diff.
This PR also changes the command for turning on the local AOTAutogradCache to be more consistent to that of FXGraphCache: TORCHINDUCTOR_AUTOGRAD_CACHE
Test Plan: Existing tests should pass and should build
Reviewed By: oulgen
Differential Revision: D63321965
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137011
Approved by: https://github.com/oulgen
This greatly reduces compile time; TorchBench models that were previously 50-100x slower (vs the cpp backend) are now ~20x slower. More work needs to be done on the Triton side, but smaller block sizes will still be helpful.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136612
Approved by: https://github.com/desertfire
ghstack dependencies: #135342
Function `_get_pg_default_device` is being used outside of `distributed_c10d.py`.
A concern is that people may not be aware of what it actually does, due to bad naming of this function:
`Return the device to use with ``group`` for control flow usage (object collectives, barrier).`
The remediation is as follows:
- Added a deprecation warning to `_get_pg_default_device`;
- Added a private function `_get_object_coll_device` to undertake what it does;
- Added a `_device_capability` function for users who want to query the device support of a PG -- it returns a plain list, no more "default" choice.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136790
Approved by: https://github.com/H-Huang
Previously we were making a fairly restrictive assumption when unflattening an exported program: for any submodule, we would assert that the graph of every call to that submodule must be the same. This assertion is load-bearing, i.e., if we simply remove the assertion then we can get incorrect results, as shown by the following example.
```
class N(torch.nn.Module):
def forward(self, x, b):
if b:
return x + 1
else:
return x + 2
class M(torch.nn.Module):
def __init__(self):
super().__init__()
self.n = N()
def forward(self, x):
x0 = x + 3
x1 = self.n(x0, True)
x2 = x1 + 4
x3 = self.n(x2, False)
return x3 + 5
m = M()
inp = (torch.ones(1),)
print(m(*inp)) # tensor([16.])
ep = torch.export.export(m, inp)
print(ep.module()(*inp)) # tensor([16.])
unflattened = torch.export.unflatten(ep)
print(unflattened(*inp)) # tensor([15.])
```
However, this goes against the spirit of specializing graphs when exporting: we should *expect* that for every call to a submodule we *might* generate a different graph. The goal of this PR is to fix unflattening to handle multiple specialized graphs corresponding to multiple calls to the same submodule.
The idea is simple: for every call to a child module `foo`, we will create potentially different child modules `foo`, `foo@1`, `foo@2`, etc. and use those names as targets in `callmodule` instructions in the parent graph. An immediate consequence of this is that the list of fqns in an unflattened module may not be the same as an exported module. Note that all these variants share the same parameters / buffers, so that multiple calls to the same submodule can share state as expected.
However, as described so far this scheme may end up with needlessly too many submodules. Thus, between calls to the same submodule, if graphs are equal then we optimize away the extra submodules and reuse call names as much as possible. Moreover, when submodules are shared across fqns, we also try to de-duplicate graphs corresponding to their calls as much as possible. Note that no matter what, information about which submodule was called is still preserved, so that if a submodule has to be swapped with another, one can still find all calls to the former submodule and replace them with calls to the latter.
A note on the choice of naming scheme for call names: instead of generating "sibling" modules `foo@1`, `foo@2`, etc. for `foo`, we had considered generating "children" modules `foo._1`, `foo._2`, etc. of `foo`. However this can cause spurious cycles when de-duplicating graphs. E.g., suppose that `foo` is an alias for `bar._1` and `foo._1` is an alias for `bar`, then we must either introduce a cycle or drop the opportunity to optimize. Another idea would be to make `foo` a dummy module that contains `foo._0` corresponding to the first call, but this necessitates too many changes to existing tests and hurts the common case.
Differential Revision: D63642479
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137013
Approved by: https://github.com/pianpwk
* Upload_metrics function to upload to ossci-raw-job-status bucket instead of dynamo
* Moves all added metrics to a field called "info" so ingesting into database table with a strict schema is easier
* Removes the dynamo_key field since it is no longer needed
* Removes the concept of reserved metrics, since they cannot be overwritten by user added metrics anymore
* Moves s3 resource initialization behind a function so import is faster
---
Tested by emitting a metric during run_test and seeing that documents got added to s3
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136799
Approved by: https://github.com/ZainRizvi
Fixes#129366
Since NJT has custom serialization logic, we need an NJT-specific fix to clear out cached sizes / strides PyCapsules. Eventually, we should switch NJT to use the default serialization logic, but this depends on #125622 being addressed.
This PR also makes serialization more complete by explicitly handling `lengths`, `ragged_idx`, and the `metadata_cache`, ensuring working operation for both contiguous and non-contiguous NJTs,
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137031
Approved by: https://github.com/soulitzer
ghstack dependencies: #137030
Summary:
Given an op, with a pair (output buffer, input buffer) from that op, we consider marking the output buffer as inline. However, if the parent of input buffer and the current op are going to be fused, then we don't want to mark the output buffer as inline. This change checks that criterion, and skips inlining if it is so.
Test Plan:
New unit test "layer_norm_should_not_inplace" runs LayerNorm and checks for no "in_out" pointers.
Fixes#120217
Here's a diagram of the issue:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137042
Approved by: https://github.com/eellison
Fixes#130154
This PR takes the strategy outlined in the above issue and clears out any cached sizes / strides PyCapsules before serialization. This affects the default subclass serialization logic.
The PyCapsule issue also affects `deepcopy`, so that's fixed here as well.
Note: I originally tried utilizing a context manager to remove / restore cached PyCapsules after serialization, but in practice the state returned from `_reduce_ex_internal()` references the actual `tensor.__dict__()`, so the problem persists once the cached values are restored. Instead, we have to be careful to remove the cached values in the right place so they're not re-cached when pulling out size / stride information for serialization.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137030
Approved by: https://github.com/albanD
this PR unblocks unit test with single Float8Linear module. It fixes following error
```
torch._foreach_copy_(foreach_copy_dsts, all_gather_inputs)
[rank0]:E0913 13:44:29.829000 2179476 torch/testing/_internal/common_distributed.py:671] RuntimeError: "foreach_tensor_copy" not implemented for 'Float8_e4m3fn'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135955
Approved by: https://github.com/vkuzo, https://github.com/eqy
Summary:
Tests in test_mkldnn_pattern_matcher.py can take too long to finish. Splitting them into smaller tests, using `parametrize`.
I guess this means this test file has some refactoring opportunities as well. Next time would be the parametrize the add functions.
Differential Revision: D63723925
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137153
Approved by: https://github.com/desertfire
For the reusable action checkout-pytorch, skips cleaning workspace when running from a container environment.
The motivation for this change is twofold:
* There is no need for cleanup when running in ephemeral containers, as any changes will be discarded when the docker container is terminated;
* In the specific case of GITHUB_WORKSPACE, to enable sharing this between multiple containers, it need to be mounted with `-v`. This prevents the possibility of running `rm -r` and deleting this mount path;
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137168
Approved by: https://github.com/huydhn
Previously if we had a graph like:
```
triton_kernel_wrapper_functional_proxy = triton_kernel_wrapper_functional(...)
getitem: "f32[3][1]cuda:0" = triton_kernel_wrapper_functional_proxy['out_ptr']
getitem_1: "f32[3][1]cuda:0" = triton_kernel_wrapper_functional_proxy['out2_ptr']
sigmoid: "f32[3][1]cuda:0" = torch.ops.aten.sigmoid.default(getitem_1)
mul: "f32[3][1]cuda:0" = torch.ops.aten.mul.Tensor(tangents_1, sigmoid)
```
The partitioner would assume that the `sigmoid()` could be fused into either its user (the pointwise mul), or its producer (the user triton kernel). This could lead to a bad partitioning:
(1) If the partitioner thinks we can fuse the sigmoid with its producer triton kernel, we would keep the sigmoid compute in the forward, and have to generate two separate kernels in the forward (user triton kernel, dedicated sigmoid kernel)
(2) if the partitioner puts the sigmoid in the backward instead, we could fuse it with an existing backward kernel (the mul with a tangent)
Reviewed By: embg
Differential Revision: D63551393
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136878
Approved by: https://github.com/zou3519
Summary:
# Context
Goal: Enable CK for Inductor in FBCode
We split this stack into three diffs to help with review & in case we need to revert anything.
# This Diff
* Gets us to have CK kernels as an option for GEMM autotuning in Inductor.
Reviewed By: zjing14
Differential Revision: D62662705
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136234
Approved by: https://github.com/tenpercent, https://github.com/chenyang78
'set_requires_grad' dict appears to be always full of "False" values,
and we always set requires_grad based on the value of 'has_backward'
setting of required_grad field was being repeatedly done during
get_fwd_recv_ops, but it should be done just once, so move it to the
function that creates recv buffers in the first place.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136804
Approved by: https://github.com/kwen2501
Summary:
We skip the save_gpu_kernel if kernel is being saved already.
This would give us a more accurate Triton profiling result. The
following trace shows before/after the change for a benchmarking of a
trivial addmm:
Before:
<img width="1255" alt="Screenshot 2024-09-23 at 10 26 53 AM" src="https://github.com/user-attachments/assets/5aea05ef-6ef0-464c-8da9-17b31c97b43a">
After:
<img width="910" alt="Screenshot 2024-09-23 at 10 27 03 AM" src="https://github.com/user-attachments/assets/488b7d4f-268f-41cf-8553-cb16ceeae118">
We can see that before the change, the benchmarking includes two parts,
(1) The overhead of our triton_heuristic call, which includes the
save/get, and the (expensive) hash computation.
(2) The exact computation of Triton kernel.
We see that (1) accounts >50% of time, which makes kernel selection
for profiling choosing aten kernels over Triton kernels.
Test Plan:
Existing OSS CI
python test/inductor/test_cuda_cpp_wrapper.py
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137073
Approved by: https://github.com/desertfire
For Traceable FSDP2, the most common use case is to have `fullgraph=False` for forward pass (to allow user-level graph breaks), and `fullgraph=True` for compiled autograd backward pass (required for queue_callback support).
With `torch._dynamo.compiled_autograd=True`, previously we are not able to set different `fullgraph` config value for forward vs. backward pass, since `rebuild_ctx` just reuses the forward compile config as-is. This PR adds `torch._dynamo.config.compiled_autograd_kwargs_override` config to allow forcing `fullgraph=True` for CA Dynamo tracing.
With this PR, we can remove standalone compiled autograd ctx manager usage in Traceable FSDP2 unit tests, and consolidate on using `torch._dynamo.compiled_autograd=True`.
Test commands:
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor_fullgraph_True`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136967
Approved by: https://github.com/xmfan
Fixes#127920
This commit addresses a build failure occurring with GCC 12 and above due to the -Werror=nonnull flag. The error manifests in the test_api target.
**Issue:**
When building with GCC 12+, the following error occurs:
```
error: argument 1 null where non-null expected [-Werror=nonnull]
431 | __builtin_memmove(__result, __first, sizeof(_Tp) * _Num);
| ~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
```
This change ensures that:
1. The flag is only added for GCC 12 or higher
2. The flag is only added if it's supported by the compiler
3. The flag is added specifically to the test_api target, not globally
By disabling this specific error, we allow the build to proceed while maintaining other compiler warnings.
**Test Plan:**
- Verified successful build with GCC 12 and above
- Ensured no regression in builds with earlier GCC versions and other compilers
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137092
Approved by: https://github.com/malfet
`json.dumps(float("inf"))` returns `Infinity`, which is technically invalid json
This is fine if you json.load, but ClickHouse cannot handle it
Solution here: cast inf and nan to string (which ClickHouse is able to cast back to float)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136877
Approved by: https://github.com/huydhn
Summary:
# Why
We want this to run internally
# What
- fix python path issue on the test
- reenable the test
# Background
(copied from similar issue resolved earlier)
It appears that the parent process does not pass the entire path down to the child process. Namely, if there is some setup that makes the sys.path effectively look different than, say, PYTHONPATH or something like this, the child will not inherit this setup. To avoid needing to keep track of specific setups, we pass the effective `sys.path` from the parent to the child through the PYTHONPATH env variable
Test Plan: buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:kernel_benchmark
Differential Revision: D63498897
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136876
Approved by: https://github.com/henrylhtsang
By even further reducing precisions of imprecise FP16 ops, introducing new BF16_LOW_PRECISION_OPS category and marking BF16 tests as xfail for `divfloor_rounding`, `floor_divide` and `remainder`.
I guess the nature of low-precision results, is that MPSGraph, unlike the rest of the PyTorch does not do accumulation over fp32 for reduction operations
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136987
Approved by: https://github.com/albanD
ghstack dependencies: #137070
Related to #107302.
When built and tested with NumPy 2 the following unit tests failed.
```
=========================================================== short test summary info ============================================================
FAILED [0.0026s] test/test_linalg.py::TestLinalgCPU::test_householder_product_cpu_complex128 - TypeError: expected np.ndarray (got Tensor)
FAILED [0.0024s] test/test_linalg.py::TestLinalgCPU::test_householder_product_cpu_complex64 - TypeError: expected np.ndarray (got Tensor)
FAILED [0.0025s] test/test_linalg.py::TestLinalgCPU::test_householder_product_cpu_float32 - TypeError: expected np.ndarray (got Tensor)
FAILED [0.0024s] test/test_linalg.py::TestLinalgCPU::test_householder_product_cpu_float64 - TypeError: expected np.ndarray (got Tensor)
FAILED [0.0016s] test/test_linalg.py::TestLinalgCPU::test_nuclear_norm_axes_small_brute_force_old_cpu - ValueError: Unable to avoid copy while creating an array as requested.
FAILED [0.0054s] test/test_linalg.py::TestLinalgCPU::test_solve_cpu_complex128 - AssertionError: The values for attribute 'shape' do not match: torch.Size([0, 0]) != torch.Size([0, 0, 0]).
FAILED [0.0055s] test/test_linalg.py::TestLinalgCPU::test_solve_cpu_complex64 - AssertionError: The values for attribute 'shape' do not match: torch.Size([0, 0]) != torch.Size([0, 0, 0]).
FAILED [0.0048s] test/test_linalg.py::TestLinalgCPU::test_solve_cpu_float32 - AssertionError: The values for attribute 'shape' do not match: torch.Size([0, 0]) != torch.Size([0, 0, 0]).
FAILED [0.0054s] test/test_linalg.py::TestLinalgCPU::test_solve_cpu_float64 - AssertionError: The values for attribute 'shape' do not match: torch.Size([0, 0]) != torch.Size([0, 0, 0]).
=========================================== 9 failed, 1051 passed, 118 skipped in 152.51s (0:02:32) ============================================
```
This PR fixes them. The test is now compatible with both NumPy 1 & 2.
Some more details:
1. The `np.linalg.solve` has changed its behavior. So I added an adapt function in the unit test to keep its behavior the same no matter it is NumPy 1 or Numpy 2.
2. The cause of the failure is when passing a `torch.Tensor` to `np.linalg.qr`, the return type in NumPy 1 is `(np.ndarray, np.ndarray)`, while it is `(torch.Tensor, torch.Tensor)` in NumPy 2.
3. NumPy 2 does not allow `np.array(obj, copy=False)`, but recommended to use `np.asarray(obj)` instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136800
Approved by: https://github.com/lezcano
# Motivation
This PR intends to make device-specific Event inherit from the generic torch.Event. The benefit is providing a generic abstract class `torch.Event` for different devices, like `torch.Stream`. This make it easier for Dynamo to capture the Event of different devices, like torch.cuda.Event and torch.xpu.Event.
And the next PR would like to remove previous useless base class `_StreamBase` and `_EventBase` to avoid multiple Inheritance.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134845
Approved by: https://github.com/albanD, https://github.com/EikanWang
inductor mutates the aot backward graph. a solution could be to copy the graph, but since we don't know if compiled autograd is applied or not, it would be expensive to always clone it
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136741
Approved by: https://github.com/jansel
ghstack dependencies: #135663
Summary: Problem is, when gpu is not big, we will omit the test cases in the test class. We expect the test to be skipped, but due to fbcode ci it can throw an error. This causes the test to be flaky.
Test Plan: ci
Differential Revision: D62037908
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137055
Approved by: https://github.com/masnesral
Fixes#136440
**Issue:**
When building PyTorch in debug mode on aarch64 architecture using GCC, we encounter relocation errors due to the R_AARCH64_CALL26 relocation limit. This occurs because debug builds with -O0 optimization generate larger code sizes, potentially exceeding the range limit for these relocations.
**Fix:**
Apply -Og optimization instead of -O0 for aarch64 GCC debug builds. This slightly reduces code size while maintaining debuggability, bringing function calls back within the range of R_AARCH64_CALL26 relocations.
The fix is implemented by conditionally setting compiler and linker flags in CMakeLists.txt:
- For aarch64 GCC debug builds: use -Og
- For all other debug builds: retain -O0
This change affects only debug builds on aarch64 with GCC, leaving other configurations unchanged.
**Testing:**
Verified that the build succeeds without relocation errors on aarch64 systems with GCC in debug mode. Ensured that debugging information is still available and useful for debugging purposes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136990
Approved by: https://github.com/malfet
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Since our implementation currently assumes contiguous strides, let us add an explicit check and raise an error at construction time if the parameter is not contiguous.
We can try to support this in the future. Mainly, I want to first learn more about how DTensor support for non-contiguous memory formats works.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137000
Approved by: https://github.com/weifengpy
## The problem.
[A commit from three weeks ago](82d00acfee) appears to have broken five tests but was not caught by CI.
[A later commit](https://github.com/pytorch/pytorch/commit/e05ea2b1797) which added a decomposition of `transpose_copy` added another broken test, also seemingly not detected, making six total (listed below).
They came to my attention when I updated some pending decomposition pull requests which passed CI, and started getting failures like [this](https://hud.pytorch.org/pr/134319) for a test unrelated to any of these pull requests, `TestCommonCPU.test_out__refs_transpose_copy_cpu_float32`
Running `python test/test_ops.py -k _copy` on `viable/strict` found failures for six `_refs` ops: `copysign`, `expand_copy`, `index_copy`, `t_copy`, `transpose_copy`, `view_copy`
## The solution
The original commit did actually cause breakage by slightly changing user-visible behavior (in a special case involving scalar tensors being copied between different devices).
This pull request fixes that breakage in a reasonable way, but I don't understand why this error didn't appear in CI until I made later changes in the same area.
## To reproduce
To reproduce the six cases in your own client:
```
PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=5 python test/test_ops.py TestCommonCPU.test_out__refs_view_copy_cpu_float32
PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=2 python test/test_ops.py TestCommonCPU.test_out__refs_t_copy_cpu_float32
PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=0 python test/test_ops.py TestCommonCPU.test_out__refs_index_copy_cpu_float32
PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=7 python test/test_ops.py TestCommonCPU.test_out__refs_expand_copy_cpu_float32
PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=0 python test/test_ops.py TestCommonCPU.test_out__refs_copysign_cpu_float32
PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=4 python test/test_ops.py TestCommonCPU.test_out__refs_transpose_copy_cpu_float32
```
@amjames
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136653
Approved by: https://github.com/zou3519
This PR is for supporting calling `parallelize_module` from within a model definition, making the model a parallel one.
Calling `parallelize_module` is an alternative to maintaining a set of `ColumnWiseLinear`, `RowWiseLinear`, etc, while still being able to directly author a parallel model.
(The motivation for authoring a parallel model is that there may be other distributed operations, which may not be easily captured by any module, see the forward function below. Alternatively speaking, the purpose is to exploit the expressiveness of DTensor -- we need to first create DTensors before calling ops on them. Having parallelized modules in model is one way of creating DTensors.)
For example:
```
class FeedForward(nn.Module):
def __init__(self, config: TransformerArgs) -> None:
super().__init__()
w1 = nn.Linear(config.dim, config.hidden_dim, bias=False)
w2 = nn.Linear(config.hidden_dim, config.dim, bias=False)
w3 = nn.Linear(config.dim, config.hidden_dim, bias=False)
self.w1 = parallelize_module(w1, Colwise)
self.w2 = parallelize_module(w2, Rowwise)
self.w3 = parallelize_module(w3, Colwise)
def forward(self, x: Tensor) -> Tensor:
y: DTensor = self.w2(F.silu(self.w1(x)) * self.w3(x))
# y is a DTensor with Partial placement; we can return it as is.
return y
# Or we can convert it to Replicate -- there is modeling flexibility here.
return y.redistribute(Replicate())
with device_mesh:
model = FeedForward(config)
# Now model is a model parallelized onto device_mesh
y = model(x)
```
The `device_mesh` actually used for `parallelize_module` would be retrieved from the ambient context.
Calling `parallelize_module` from within model hierarchy also saves the use of *FQNs* as in the out-of-model annotation case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134247
Approved by: https://github.com/tianyu-l
compile time benchmark for the min cut partitioner. I'm hoping that this is a reasonable benchmark because:
(1) it consists of a single input + many weights that are used sequentially
(2) contains a mix of recompute vs non-recomputed ops (matmul + sin)
(3) it is relatively simple
from running locally:
```
collecting compile time instruction count for aotdispatcher_partitioner_cpu
compile time instruction count for iteration 0 is 21764219181
compile time instruction count for iteration 1 is 12475020009
compile time instruction count for iteration 2 is 12463710140
compile time instruction count for iteration 3 is 12455676489
compile time instruction count for iteration 4 is 12451344330
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136760
Approved by: https://github.com/ezyang
ghstack dependencies: #136670, #136759
this adds a few compile time benchmarks for some disjoint paths in AOTDispatcher:
(1) inference vs training code paths
(2) "subclasses" vs "no subclasses" codepaths
Also see https://github.com/pytorch/pytorch/pull/136760 for a partitioner benchmark (I'm not sure why ghstack didn't display the stack nicely)
I ran locally, and got these numbers on the 4 paths:
```
collecting compile time instruction count for aotdispatcher_inference_nosubclass_cpu
compile time instruction count for iteration 0 is 11692348671
compile time instruction count for iteration 1 is 3026287204
compile time instruction count for iteration 2 is 3011467318
compile time instruction count for iteration 3 is 3004485935
compile time instruction count for iteration 4 is 3003087410
collecting compile time instruction count for aotdispatcher_training_nosubclass_cpu
compile time instruction count for iteration 0 is 6068003223
compile time instruction count for iteration 1 is 5585418102
compile time instruction count for iteration 2 is 5581856618
compile time instruction count for iteration 3 is 5581651794
compile time instruction count for iteration 4 is 5578742619
collecting compile time instruction count for aotdispatcher_inference_subclass_cpu
compile time instruction count for iteration 0 is 8634984264
compile time instruction count for iteration 1 is 8633467573
compile time instruction count for iteration 2 is 8632182092
compile time instruction count for iteration 3 is 8632056925
compile time instruction count for iteration 4 is 8632543871
collecting compile time instruction count for aotdispatcher_training_subclass_cpu
compile time instruction count for iteration 0 is 14737239311
compile time instruction count for iteration 1 is 14734346427
compile time instruction count for iteration 2 is 14736493730
compile time instruction count for iteration 3 is 14734121272
compile time instruction count for iteration 4 is 14733852882
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136759
Approved by: https://github.com/laithsakka
ghstack dependencies: #136670
Fixes https://github.com/pytorch/pytorch/issues/136640
Today, inductor has some logic to figure out when it needs to do broadcasting during lowering, which just checks if any of the input shapes have sizes equal to 1.
In particular: we should already have this information by the time we get to inductor, because our FakeTensor compute will have branched/guarded on whether any ops performed broadcasting, appropriately.
In particular, if we have a tensor with a size value of `(64//((2048//(s3*((s2//s3)))))))`, and it happens to be equal to one (and it is used in an op that requires this dim to be broadcasted), FakeTensorProp will have generated a guard:
```
Eq((64//((2048//(s3*((s2//s3))))))), 1)
```
I chose the simplest possible way to beef up inductor's checks to know when a given size is equal to 1: loop over the existing shape env guards, and if our current size is a sympy expression on the LHS of one of our `Eq(LHS, 1)` guards, then return True.
I'm hoping for feedback on whether or not this approach is reasonable. One better option I could imagine is that our symbolic reasoning should have automatically simplified the size of our tensor down to a constant as part of evaluating that guard. I was originally going to try to do this directly in the shape env, but I ran into a few issues:
(1) I wanted to call some version of `set_replacement(expr, 1)`. But `set_replacement()` only accepts plain symbols on the LHS, not expressions
(2) in theory I could get this to work if I could rework the above expression to move everything that is not a free variable to the RHS, e.g. `Eq(s2, 32)`. It looks like our existing `try_solve()` logic is... [not quite able](https://github.com/pytorch/pytorch/blob/main/torch/utils/_sympy/solve.py#L27) to do this generally though.
Checking the guards feels pretty simple-and-easy. Are we worried that it is too slow to iterate over all the guards? I could also cache the lookup so we only need to iterate over guards that are of the form `Eq(LHS, 1)`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136670
Approved by: https://github.com/ezyang
To see the payoff, look at test/dynamo/test_logging.py
The general idea is to refactor produce_guards into produce_guards_verbose which also returns verbose code parts, which have our annotations.
The rest of the logic is plumbing around SLocs to the places they need to be so we can print them. Guards are easy; value ranges and duck sizing take more care.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136917
Approved by: https://github.com/anijain2305
1. example of failing diff
https://github.com/pytorch/pytorch/pull/136740
2. test this by running
python check_results.py test_check_result/expected_test.csv test_check_result/result_test.csv
results
```
WIN: benchmark ('a', ' instruction count') failed, actual result 90 is 18.18% lower than expected 110 ±1.00% please update the expected results.
REGRESSION: benchmark ('b', ' memory') failed, actual result 200 is 100.00% higher than expected 100 ±10.00% if this is an expected regression, please update the expected results.
MISSING REGRESSION TEST: benchmark ('d', ' missing-test') does not have a regression test enabled for it
```
MISSING REGRESSION TEST does not fail but its logged.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136551
Approved by: https://github.com/ezyang
ghstack dependencies: #136383
* This fixes a major CMake/Bazel configuration bug where we were leaving CUTLASS performance on the table, especially with FlashAttention. This now enables using MMA instructions on SM90+, which should close the gap between SDPA and the external FA2. Note these operations only affect H100 and newer GPUs. Thankfully, this seems to have been updated recently into being a noop on the CUTLASS side. Still better set the CMake variable properly.
* Also enables additional new shape kernels added in the recent CUTLASS 3.5.1+ update. This was the original motivatin of the PR before I realized the basic MMA kernels were accidentally disabled since we didn't go through the submodule's CMake/Bazels.
* Adds a bit to compile time and code size, but well worth it considering it speeds up our internal flash attention significantly on H100s at the cost of some minor additional compile time.
* These kernels and settings will be needed for Flash Attention 3 whenever we add that too.
Fixes#133695
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133686
Approved by: https://github.com/ezyang
Summary: In D60803317, we added CompileContext (trace_id) information to Kineto traces using caching when a CompileContext exits. As pointed out by some users, this gives innaccurate IDs because we are not getting the context that we is being looked up within the eval_frame. For this reason, we decided to revert that change, and go with an approach that involves getting the trace_id associated with a given CacheEntry. To do this, we add a trace_id to the GuardedCode so that it can be passed onto a CacheEntry. Then, we change the lookup function to return said trace_id alongside the code so that we can pass both into our eval function. Once we get to a Torch-Compiled Region, we can just append the context information to the name of the annotation thus bypassing any need for kwargs.
Test Plan: Added more comprehensive unit test. Saw that all the trace_ids appeared within the graph.
Differential Revision: D63138786
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136460
Approved by: https://github.com/ezyang
This is to avoid cache confusion between normal vs pydebug vs nogil builds in cpp extensions which can lead to catastrophic ABI issues.
This is rare today for people to run both normal and pydebug on the same machine, but we expect quite a few people will run normal and nogil on the same machine going forward.
This is tested locally by running each version alternatively.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136890
Approved by: https://github.com/colesbury
In #136512, we fixed handling for tl.constexpr and dynamic shapes: if a symint is passed to tl.constexpr, you should specialize on it, because tl.constexpr implies needing to know the concrete value at compile time.
However, when using triton_op, capture_triton, or non-strict export, the regression remains (and #136512 might technically regress some specific export scenarios) - see [Richard's comment](https://github.com/pytorch/pytorch/pull/136512/files#r1775999871).
This fixes these scenarios: implement the handling differently depending on whether we're expecting a SymNodeVariable or a SymInt(/SymBool/SymFloat)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136686
Approved by: https://github.com/zou3519
int8_t = DeviceIndex is interpreted by cout as a char, which then shows up as a control character in logs (eg. ^A) etc.
Explicitly casting to int to have the numbers printed out correctly.
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135405
Approved by: https://github.com/wconstab
Fixes#134714 (or attempts to, idk how to test yet)
For posterity, how one can test:
1. make sure you have USE_PTHREADPOOL=1 or pull a packaged binary
2. run gdb --args python, with `r` to enter, `Ctrl-C` to pause, and `c` to get back into Python
3. import torch
4. torch.set_num_threads(1), make sure this does not trigger any additional threads getting created.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136793
Approved by: https://github.com/albanD
Summary:
This PR adds `torch.float8e4m3fn` support to cuSPARSELt and `to_sparse_semi_structured`.
This will let users to run fp8 + 2:4 sparse matmuls on Hopper GPUs with
cusparselt >= 0.6.2, via to `scaled_mm` API.
```
A = rand_sparse_semi_structured_mask(256, 128, dtype=torch.float16)
B = torch.rand(dense_input_shape, device=device).to(torch.float16).t()
A_fp8, A_scale = to_float8(A)
B_fp8, B_scale = to_float8(B)
dense_result = torch._scaled_mm(
A_fp8, B_fp8,
scale_a=A_scale, scale_b=B_scale,
out_dtype=out_dtype
)
A_fp8_sparse = to_sparse_semi_structured(A_fp8)
sparse_result = torch._scaled_mm(
A_fp8_sparse, B_fp8,
scale_a=A_scale, scale_b=B_scale,
out_dtype=out_dtype
)
```
Note that to keep this consistent with normal torch behavior, calling
`torch.mm(A_fp8_sparse, B_fp8)` will raise a NotImplementedError.
I also turned on cuSPARSELt by default and added CUSPARSELT_MAX_ID to the
backend to make the tests a bit cleaner
Test Plan:
```
python test/test_sparse_semi_structured -k scaled_mm
python test/test_sparse_semi_structured -k fp8
```
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136397
Approved by: https://github.com/drisspg
Summary: Previously is_fbcode just checked whether the checkout was git or not. This is extremely error prone. Lets make it fool-proof.
Test Plan: unit tests
Reviewed By: masnesral
Differential Revision: D63545169
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136871
Approved by: https://github.com/masnesral
Before that attempt to run something like
```
% python -c "import torch;dev,dt='mps',torch.int; print(torch.normal(mean=torch.arange(1., 11., device=dev, dtype=dt), std=torch.arange(10, 0, -1, device=dev, dtype=dt)))"
```
Resulted in hard error
```
(mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: error: 'mps.multiply' op requires the same element type for all operands and results
(mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: note: see current operation: %5 = "mps.multiply"(%2, %arg1) : (tensor<10xf32>, tensor<10xsi32>) -> tensor<*xf32>
(mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: error: 'mps.multiply' op requires the same element type for all operands and results
(mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: note: see current operation: %5 = "mps.multiply"(%2, %arg1) : (tensor<10xf32>, tensor<10xsi32>) -> tensor<*xf32>
/AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphExecutable.mm:953: failed assertion `original module failed verification'
```
After the change, it raises a nice type error
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136863
Approved by: https://github.com/Skylion007
ghstack dependencies: #136754, #136755, #136821, #136822
Fixes https://github.com/pytorch/pytorch/issues/136494
Currently, CUDASymmetricMemory::rendezvous() initializes a multicast address if multicast support is present. However, if we believe multicast support is present but cuMulticastCreate still fails for some reason, we do not fallback gracefully.
- In addition to CUDART and driver version check, query CU_DEVICE_ATTRIBUTE_MULTICAST_SUPPORTED to determine multicast support for a rank/device.
- Before initializing multicast for a block, ensure all ranks/devices have multicast support.
- This is unlikely, but if cuMulticastCreate still fails on rank 0, print the corresponding driver error message as a warning, and gracefully skip multicast initialization for the block.
- Introduced an environment variable (TORCH_SYMM_MEM_DISABLE_MULTICAST) to allow users to explicitly disable multicast support as a workaround.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136577
Approved by: https://github.com/Chillee, https://github.com/eqy
- also makes scales and zp dtype reconcile with meta impl as well as other
quantized ops representation of scales and zero point
- make sure qunatize_per_token's output_dtype is respected
There are a few places where we need to reconcile on scale and zero point dtype
but that will come later. This fixes are mainly being done to enable quantized
kv cache though ET stack
Differential Revision: [D62301840](https://our.internmc.facebook.com/intern/diff/D62301840/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136807
Approved by: https://github.com/jerryzh168
Removing `_transform_shapes_for_default_dynamic` and `assume_static_by_default=False` as added in https://github.com/pytorch/pytorch/pull/133620.
This reverts back to `assume_static_by_default=True` with the use of dynamo decorators (e.g. `maybe_mark_dynamic, mark_static`, instead) for handling Dim.AUTO & Dim.STATIC instead. This is easier to maintain, as it doesn't requiring reasoning about "inverting" the dynamic_shapes specs, and also opens up usage of other decorators (`mark_dynamic, mark_unbacked`).
On the user side this change has no effect, but internally this means dynamic behavior is determined only by the `dynamic_shapes` specs (ignoring user-side input decorators following https://github.com/pytorch/pytorch/pull/135536), but transferring this information for _DimHints via decorators, for Dynamo/non-strict to create symbolic_contexts accordingly, e.g. 7c6d543a5b/torch/_dynamo/variables/builder.py (L2646-L2666)
One caveat is we don't raise errors for dynamic decorators on the user side, since we don't know if they're from user markings, or from re-exporting with inputs we've previously marked.
Differential Revision: D63358628
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136591
Approved by: https://github.com/avikchaudhuri
Summary:
This diff logs the time_taken_ns for the forward and backward graphs in AOTAutogradCache, saving it into the cache entry.
This information is helpful later when I remotify the cache, and also is just useful to have in tlparse and chromium events.
Test Plan: Run benchmark, see that the times are in the chromium events.
Reviewed By: aorenste
Differential Revision: D62590077
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136529
Approved by: https://github.com/oulgen
In #136512, we fixed handling for tl.constexpr and dynamic shapes: if a symint is passed to tl.constexpr, you should specialize on it, because tl.constexpr implies needing to know the concrete value at compile time.
However, when using triton_op, capture_triton, or non-strict export, the regression remains (and #136512 might technically regress some specific export scenarios) - see [Richard's comment](https://github.com/pytorch/pytorch/pull/136512/files#r1775999871).
This fixes these scenarios: implement the handling differently depending on whether we're expecting a SymNodeVariable or a SymInt(/SymBool/SymFloat)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136686
Approved by: https://github.com/zou3519
I think this could help many teams, especially compile/export teams (/cc @ezyang), to let end user/bug reporters to quickly test WIP PR when reporting a related bug.
This could quickly run in an official nightly Docker container or in a nightly venv/coda env.
Let me know what do you think.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136535
Approved by: https://github.com/ezyang
This was a stupid cast error that caused MPSGraph to crash with the following exception
```
(mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: error: 'mps.multiply' op requires the same element type for all operands and results
(mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: note: see current operation: %3 = "mps.multiply"(%2, %arg1) : (tensor<1x3x9x9xf16>, tensor<1xf32>) -> tensor<*xf32>
(mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: error: 'mps.multiply' op requires the same element type for all operands and results
(mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: note: see current operation: %3 = "mps.multiply"(%2, %arg1) : (tensor<1x3x9x9xf16>, tensor<1xf32>) -> tensor<*xf32>
/AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphExecutable.mm:953: failed assertion `original module failed verification'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136822
Approved by: https://github.com/Skylion007
ghstack dependencies: #136754, #136755, #136821
This is a retry of https://github.com/pytorch/pytorch/pull/136594, which is having trouble landing.
Summary: We have an internal report of a Triton compiler error `ValueError: Cannot broadcast, rank mismatch: [1], [1, 2048]` coming from a line like this:
`tmp25 = tl.broadcast_to(((tl.full([1], 1.00000000000000, tl.float64)) + ((ks0 // 3278).to(tl.float64))) / (((tl.full([1], 0.500000000000000, tl.float64))*(libdevice.sqrt((1 + ((ks0 // 3278)*(ks0 // 3278)) + ((-2)*(ks0 // 3278))).to(tl.float64).to(tl.float32)))) + ((tl.full([1], 0.500000000000000, tl.float64))*((1 + (ks0 // 3278)).to(tl.float64)))), [XBLOCK, RBLOCK])`
https://github.com/pytorch/pytorch/pull/135260 is the cause, presumably because we turn a constant into a 1-element tensor with: `(tl.full([1], const, tl.float64))`. It looks like changing the syntax to `(tl.full([], const, tl.float64))` gives us what we want?
Differential Revision: [D63540693](https://our.internmc.facebook.com/intern/diff/D63540693)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136858
Approved by: https://github.com/atalman
Original Issue: https://github.com/pytorch/pytorch/issues/134644
We assume trace_tangents to have the same memory_format as inputs, outputs, intermediate during first tracing.
=>
Tracing time:
- Store trace_tangents_memory_formats in metadata
- Coerce tangents to deduced memory_format
Runtime:
- Coerce tangents to tracing memory format from metadata
Subclasses logic:
- Previously coercing tangents logic did not handle nested subclasses case, fixing this.
For Subclasses we deduce memory format for subclass_tensor first, then for each element of subclass:
[subclass_tensor_memory_format, subclass_tensor_elem0_memory_format, ... ]
If subclass element (__tensor_flatten__[0] tensors) is also subclass => on its place we will have a nested list of the same structure.
The recursive traversal of subclass tree is expensive. So we do memory format deduction and coercing at the same time, to keep only one traverse for this. With this approach there is no regression in comparison with previous logic which also does one traversal. (`coerce_tangent_and_suggest_memory_format` method).
Other small change:
Remove duplicated not-related comment.
Testing
```
python test/functorch/test_aotdispatch.py -k test_channels_last_grads_no_force_contiguous
```
Benchmarking:
After change:
```
└─ $ PYTORCH_AOTD_DEBUG_PROFILE=1 python test/functorch/test_aotdispatch.py -k test_benchmark_grads_no_force_contiguous
Benchmark SUBCLASS avg_bwd_duration:4.059906005859375 ms
Benchmark NO_SUBCLASS avg_bwd_duration:3.1563830375671387 ms
```
Before change:
```
BEFORE_CHANGE SUBCLASS 4.1194
```
No siginificant changes in processing time.
(We do single traverse of subclass tree for collecting memory_formats and coercing during tracing.)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135225
Approved by: https://github.com/bdhirsh
Fixes the max-autotune failure of `soft_actor_critic` of Torchbench in FP32 single thread dynamic shape case:
```log
File "/home/user/inductor/pytorch/torch/_inductor/codegen/cpp_micro_gemm.py", line 136, in codegen_call
C_ptr = f"&({kernel.index(C, [0, 0])})"
File "/home/user/inductor/pytorch/torch/_inductor/codegen/cpp_template_kernel.py", line 135, in index
else self.args.input(node.get_name())
File "/home/user/inductor/pytorch/torch/_inductor/codegen/common.py", line 1251, in input
assert name not in V.graph.removed_buffers, name
AssertionError: buf_GemmOut
```
The 1st and 2nd linear does not need to use local buffer while the 3rd linear needs to use local buffer.
The 3rd linear which uses local buffer will add its global buffer (named as `buf_GemmOut`) into `V.graph.removed_buffers`.
When scheduling the nodes, the 1st linear (won't use local buffer) will get its output buffer (also named as `buf_GemmOut`) from the input and found that it's in the `V.graph.removed_buffers` and raise AssertionError. The issue is that the output buffer of all these linears are all names with `buf_GemmOut`, which have a conflict.
Rename these buffers by adding the name of the `template_buffer` as the prefix.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136419
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
ghstack dependencies: #136418, #136518
Cleanup:
1/ We do not need to unwrap_subclasses() in freezing wrapper, as it will be wrapped by AOTD wrappers which inclused SubclassesWrapper
2/ No need to use weakreferences for unwrapped list, dynamo optimizers need to clean unwrapped list along with original params_flat.
Verfified fbcode tests compiled_optimizers
Differential Revision: [D63393651](https://our.internmc.facebook.com/intern/diff/D63393651)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136549
Approved by: https://github.com/bdhirsh
Fixes#135439
This PR adds support for the `is_inference` method on torch tensors which successfully compiles the following example fn without graph breaks:
```python
def fn_simple(x):
if x.is_inference():
return x.sum()
else:
return x.min()
```
I've also tried to add guards on the tensor to guard against `is_inference`. I wasn't 100% sure where these should go so please don't hesitate to correct me.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136450
Approved by: https://github.com/ezyang
if ! lintrunner --force-color --tee-json=lint.json ${ADDITIONAL_LINTRUNNER_ARGS} 2> /dev/null;then
echo""
echo -e "\e[1m\e[36mYou can reproduce these results locally by using \`lintrunner -m origin/main\`. (If you don't get the same results, run \'lintrunner init\' to update your local linter)\e[0m"
echo -e "\e[1m\e[36mSee https://github.com/pytorch/pytorch/wiki/lintrunner for setup instructions.\e[0m"
echo -e "\e[1m\e[36mSee https://github.com/pytorch/pytorch/wiki/lintrunner for setup instructions. To apply suggested patches automatically, use the -a flag. Before pushing another commit,\e[0m"
echo -e "\e[1m\e[36mplease verify locally and ensure everything passes.\e[0m"
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.