pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-25 08:11:06 +08:00

Author	SHA1	Message	Date
andrewor14	24e35f0c37	Switch batch norm stack to consolidated ops Summary: This commit switches `aten.batch_norm` to call the new `batch_norm_with_update` and `batch_norm_no_update` ops, instead of the old `_batch_norm_impl_index` op. The new stack is "consolidated" in the sense that there is a single backend agnostic op that will internally pick the right kernel based on the backend, but this detail will be hidden away from the user and from the model graph. ghstack-source-id: 518baff49c66aeccd4d28a50522fe104bc323d1b Pull Request resolved: https://github.com/pytorch/pytorch/pull/119496	2024-07-24 15:03:37 -07:00
Andrew Gu	23ae6e2eb3	[FSDP2] Removed state dict error for HSDP (#131320 ) Fixes https://github.com/pytorch/torchtitan/issues/441#issuecomment-2241288906. This PR avoids raising the 2D state dict error for HSDP, which does not depend on strided sharding. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131320 Approved by: https://github.com/wanchaol, https://github.com/weifengpy	2024-07-22 19:23:17 +00:00
Mikayla Gawarecki	d3556786b8	Blocklist certain modules for weights_only load (#131259 ) Also bold certain text in the error message as suggested <img width="3000" alt="Screenshot 2024-07-19 at 5 56 48 PM" src="https://github.com/user-attachments/assets/378f20c5-c6b2-4e53-8eaf-0bd26c3a6b60"> With a GLOBAL like `os.execv` the error message is now as such ```python File "/data/users/mg1998/pytorch/torch/serialization.py", line 1256, in load raise pickle.UnpicklingError(_get_wo_message(str(e))) from None _pickle.UnpicklingError: Weights only load failed. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source. Trying to load unsupported GLOBAL posix.execv whose module posix is blocked. Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131259 Approved by: https://github.com/malfet, https://github.com/albanD	2024-07-22 18:23:21 +00:00
William Wen	93ef2e53f8	[3.13, dynamo] support FORMAT_SIMPLE/FORMAT_SPEC (#130751 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130751 Approved by: https://github.com/Skylion007 ghstack dependencies: #130566, #130567, #130568, #130569	2024-07-22 18:07:40 +00:00
William Wen	375a4d7e9e	[3.13, dynamo] decompose fused load/store instructions (#130569 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130569 Approved by: https://github.com/jansel ghstack dependencies: #130566, #130567, #130568	2024-07-22 18:07:40 +00:00
William Wen	157f38bc4d	[3.13, dynamo] support STORE_FAST_LOAD_FAST (#130568 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130568 Approved by: https://github.com/jansel ghstack dependencies: #130566, #130567	2024-07-22 18:07:35 +00:00
William Wen	1e116c7a1e	[3.13, dynamo] fix END_FOR (#130567 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130567 Approved by: https://github.com/jansel ghstack dependencies: #130566	2024-07-22 18:07:32 +00:00
William Wen	4319147ca9	[3.13, dynamo] fix closures, MAKE_FUNCTION, LOAD_CLOSURE; support SET_FUNCTION_ATTRIBUTE (#130566 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130566 Approved by: https://github.com/jansel	2024-07-22 18:07:28 +00:00
PyTorch MergeBot	44e689d947	Revert "[TD] More synonyms, new heuristic for test_public_bindings (#130397 )" This reverts commit d8a35d57220cdd5ed2fe52c02bb1f78cc0b3c75b. Reverted https://github.com/pytorch/pytorch/pull/130397 on behalf of https://github.com/clee2000 due to broke lint, probably a landrace ([comment](https://github.com/pytorch/pytorch/pull/130397#issuecomment-2243518651))	2024-07-22 18:03:22 +00:00
Xiaodong Wang	56bb047449	[pt2] Increase dynamo/inductor default log level to info (#131311 ) Summary: Avoid the logs to be too verbose Test Plan: CI Differential Revision: D60028647 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131311 Approved by: https://github.com/oulgen	2024-07-22 17:33:29 +00:00
Catherine Lee	d8a35d5722	[TD] More synonyms, new heuristic for test_public_bindings (#130397 ) test_public_bindings should be run on anything that changes the public API - need to figure out in the future what is part of the public api, currently I'm using anything in torch/ flex_attention should be run on anything involving autograd Pull Request resolved: https://github.com/pytorch/pytorch/pull/130397 Approved by: https://github.com/malfet	2024-07-22 17:06:00 +00:00
PyTorch MergeBot	b9912f31ef	Revert "[export] fix zero arg export in training_ir (#130990 )" This reverts commit 50436d5bdb5d2e29307a0c0bcfcce8d7e2da82c0. Reverted https://github.com/pytorch/pytorch/pull/130990 on behalf of https://github.com/clee2000 due to failing some executorch and torchrec tests internally D60006710 ([comment](https://github.com/pytorch/pytorch/pull/130990#issuecomment-2243395316))	2024-07-22 16:49:25 +00:00
zdevito	32c2f84e34	Support IPC for Expandable Segments (#130890 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130890 Approved by: https://github.com/dsjohns2 ghstack dependencies: #130888, #130889	2024-07-22 16:15:01 +00:00
Henry Tsang	0246b28510	[aoti] refactor aoti_torch__scaled_mm and skip aoti fp8 test for some cases (#130868 ) Continuing https://github.com/pytorch/pytorch/pull/128683 and https://github.com/pytorch/pytorch/pull/130582. The api of _scaled_mm has changed. For example, there is only one return now. So change the aoti api as well. Also, tested the fp8 tests offline. The test_fp8_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface would fail with `error: use of undeclared identifier 'float8_e4m3fn'` and `error: use of undeclared identifier 'half'`, so skipping them for now. The reason this wasn't known earlier is probably because the CI doesn't use H100. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130868 Approved by: https://github.com/drisspg, https://github.com/chenyang78, https://github.com/desertfire	2024-07-22 15:24:20 +00:00
Mikayla Gawarecki	5b5e0698a5	Add wrappers for synchronous GPUDirect Storage APIs (#130633 ) Based in part on https://github.com/NVIDIA/apex/pull/1774 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130633 Approved by: https://github.com/albanD	2024-07-22 14:51:24 +00:00
Miguel Perez	5c78581fc9	Fix documentation for tensor.repeat. (#131195 ) Fixes #130930. Adjusts the documentation which used `sizes` instead of `repeats`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131195 Approved by: https://github.com/mikaylagawarecki, https://github.com/soulitzer	2024-07-22 14:48:18 +00:00
PyTorch MergeBot	26383a6cc0	Revert "Added and_masks and or_masks utilities (#131073 )" This reverts commit 92bb323d36adca097c44a2fc8d9f0d574214d801. Reverted https://github.com/pytorch/pytorch/pull/131073 on behalf of https://github.com/albanD due to The docs build fails here and in trunk ([comment](https://github.com/pytorch/pytorch/pull/131073#issuecomment-2242997958))	2024-07-22 13:44:55 +00:00
Thanh Ha	3eb9fa5d58	Add support for using LF Canary runners (#131188 ) The script is updated such that if a canary build is detected and the label_type is LF runner it will run on an LF Canary runner. Closes pytorch/ci-infra#245. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131188 Approved by: https://github.com/ZainRizvi	2024-07-22 13:26:46 +00:00
eqy	69e2590490	Fix MKLDNN check in `test_aot_inductor.py` (#130982 ) `torch.ops.mkldnn._is_mkldnn_bf16_supported()` assumes MKLDNN is on the system which isn't the case for e.g., some ARM system configurations CC @tinglvv @nWEIdia Pull Request resolved: https://github.com/pytorch/pytorch/pull/130982 Approved by: https://github.com/malfet	2024-07-22 11:58:18 +00:00
chilli	92bb323d36	Added and_masks and or_masks utilities (#131073 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131073 Approved by: https://github.com/drisspg ghstack dependencies: #130871, #130904	2024-07-22 11:48:03 +00:00
PyTorch UpdateBot	68df24f9b6	[xla hash update] update the pinned xla hash (#126672 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned xla hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126672 Approved by: https://github.com/pytorchbot	2024-07-22 11:35:36 +00:00
Wang, Eikan	6d65a2c3f4	[3/N] Non-Tensor: Support string parameter for aten operations (#125831 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125831 Approved by: https://github.com/jansel, https://github.com/jgong5	2024-07-22 09:42:35 +00:00
xinan.lin	8da19fec60	[Inductor] Support store SPIR-V binary file output from Intel Triton. (#130849 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130849 Approved by: https://github.com/peterbell10, https://github.com/EikanWang	2024-07-22 05:59:03 +00:00
albanD	2820e1d9f8	Update CPython support policy (#130989 ) Update as specified in the RFC that was accepted: https://github.com/pytorch/rfcs/blob/master/RFC-0038-cpython-support.md Pull Request resolved: https://github.com/pytorch/pytorch/pull/130989 Approved by: https://github.com/seemethere	2024-07-22 05:29:07 +00:00
Florian	1614891946	[Profiler] exclude gpu_user_annotation when accumulating cuda time total (#130733 ) Fixes #[130730](https://github.com/pytorch/pytorch/issues/130730) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130733 Approved by: https://github.com/aaronenyeshi	2024-07-22 04:35:21 +00:00
Nikita Shulga	c2425a3b57	[BE] Use `_linux-build.yml` instead of `-linux-build-label.yml` flavor (#130762 ) It was also introduced during the ARC experiment and supposed to be a temporary thing. Fix `use_split_build` option handling in `_linux_build.yml` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130762 Approved by: https://github.com/Skylion007, https://github.com/atalman, https://github.com/jeanschmidt	2024-07-21 23:17:17 +00:00
Tom Ritchford	500cbb5b90	Add decomposition for view_copy (#130938 ) * Extracted from #128416 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130938 Approved by: https://github.com/peterbell10 ghstack dependencies: #130937	2024-07-21 20:39:24 +00:00
Tom Ritchford	f628813066	Fix out_wrapper, _make_copy_from_view to handle all signatures (#130937 ) * See #128416 and #129476 * Simplify xskip lists in test/functorch/test_ops.py * Add supports_out=True to OpInfos for copy ops Pull Request resolved: https://github.com/pytorch/pytorch/pull/130937 Approved by: https://github.com/peterbell10	2024-07-21 20:39:24 +00:00
Aaron Orenstein	b193894b94	FakeTensor cache SymInt support (#127596 ) Adds support for SymInts in the FakeTensor cache. A couple notes: 1. When a SymInt is present in the input key for a FakeTensor operation we cache on the ShapeEnv instead of using the FakeTensorMode cache. This is necessary so we don't have to remember and check the guards. It reduces the cache hits but there's diminishing return on how much work we can do before the cache becomes more of a burden than a gain. 2. We need to be careful that when we cache an output SymInt that is a direct copy from the input that when we have a cache-hit we copy the SymNode from the input to the output. This is important because the fx-graph building code actually uses SymNode ids in the process of building the graph so constructing a same-content-but-different-id SymNode will fail. 3. In the cache key we store SymInts as a _PySymInputStub. These represent SymInt (and friends) but support `__hash__` and `__eq__` (which SymInt do not). 4. In the cache entry we store SymInts as a _SymIntOutputStub. Perf example: ``` python benchmarks/dynamo/timm_models.py --ci --accuracy --timing --explain --inductor --dynamic-shapes --dynamic-batch-only --device cuda --training --amp --total-partitions 2 --partition-id 0 --output /tmp/training_timm_models.csv --filter crossvit_9_240 ``` fake tensor cache before: ``` INFO: FakeTensor cache stats: INFO: cache_hits: 68137 INFO: cache_misses: 837 INFO: cache_bypasses: INFO: symbolic shape: 48224 INFO: CompositeImplicitAutograd: 917 INFO: non-fake tensor: 70 INFO: non-FakeTensor output: 62 INFO: non-builtin: 8 INFO: dynamic output shape: 1 ``` and after: ``` INFO: FakeTensor cache stats: INFO: cache_hits: 88187 INFO: cache_misses: 14233 INFO: cache_bypasses: INFO: CompositeImplicitAutograd: 1037 INFO: non-FakeTensor output: 602 INFO: non-fake tensor: 70 INFO: unsafe view: 36 INFO: non-builtin: 8 INFO: dynamic output shape: 1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127596 Approved by: https://github.com/eellison ghstack dependencies: #131014, #129780	2024-07-21 19:26:38 +00:00
Aaron Orenstein	ebce85172e	FakeTensor cache SymInt support: flatten cache key (#129780 ) This is part of #127596, pulled out to make reviewing a little easier. Flatten the FakeTensor cache key - so it's a list of singular elements and pointing at one requires a single index rather than a PyTree path. This is used in the next PR to allow us to have the cache entry refer to an input SymInt that it needs to copy directly into the output. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129780 Approved by: https://github.com/oulgen, https://github.com/eellison ghstack dependencies: #131014	2024-07-21 19:26:38 +00:00
Aaron Orenstein	f3562e2cdc	backport dataclass(slots=True) (#131014 ) Python 3.10 adds `@dataclass(slots=True)` to auto-build the `__slots__` for a dataclass. This is really useful but we can't use it until 3.10 becomes our minimum version. Copied the code for that functionality from python into a new decorator and ported it to use 3.8 syntax (removed use of `match`). Usage: ``` @dataclass_slots @dataclass class X: pass ``` is the same as (in py3.10): ``` @dataclass(slots=True) class X: pass ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131014 Approved by: https://github.com/oulgen, https://github.com/eellison	2024-07-21 19:26:31 +00:00
Xuehai Pan	1439bd3c9c	[Easy][pytree] enable CXX pytree under `torch::deploy` (#130144 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130144 Approved by: https://github.com/zou3519 ghstack dependencies: #130895, #130139	2024-07-21 07:36:22 +00:00
Animesh Jain	ddde9dd25c	[dynamo][automatic_dynamic] Trigger dynamism on stride changes (#130232 ) Fixes https://github.com/pytorch/pytorch/issues/129798 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130232 Approved by: https://github.com/ezyang	2024-07-21 03:45:54 +00:00
Chuanhao Zhuge	e506dfa640	[dynamo] Add a JK kill switch for disabling compile (#131258 ) Summary: The JK disables dynamo by passing None to set_eval_frame. Test Plan: Ran buck test mode/opt caffe2/test/dynamo:test_dynamo Buck UI: https://www.internalfb.com/buck2/1fec33b4-c95a-4bdf-b47b-7c0b8ab9e24a Test UI: https://www.internalfb.com/intern/testinfra/testrun/2814750010105363 Network: Up: 0B Down: 0B Jobs completed: 9596. Time elapsed: 28:54.5s. Tests finished: Pass 4796. Fail 0. Fatal 0. Skip 17. Build failure 0 Also manually write a small local test with torch.compile and toggles the code to see if PT2 can be disabled. Validated with running the test and observing the log. PT2 enabled: P1486847242. Can see dynamo log about graph breaks. PT2 disabled: P1486847727. No dynamo log. The newly added warning printed. Reviewed By: ezyang Differential Revision: D59968925 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131258 Approved by: https://github.com/c00w	2024-07-21 01:22:31 +00:00
cyy	1d1d074072	[3/N] Fix Wunused-parameter warnings (#131271 ) Follows #131170 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131271 Approved by: https://github.com/ezyang	2024-07-20 23:31:03 +00:00
Shan19900305	d57af32e63	Fix undefined tensor error in _copy_from_and_resize when fallback to cpu. (#130237 ) 1) Add skip undefined tensor in cpu fallback when call _copy_from_and_resize; 2) Modify to_cpu function support optional tensor; 3) Add copy back to origin optional tensor when alias_info isWrite is true. @ezyang @bdhirsh Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/130237 Approved by: https://github.com/ezyang	2024-07-20 23:12:17 +00:00
Tristan Rice	13283fb4bc	[distributed] test_store: remove flaky bind test (#131262 ) Fixes https://github.com/pytorch/pytorch/issues/131084 There's no good way to fix this since some tests environments can bind the protected range. Removing test since the value is relatively low since it's just testing error messages. Test plan: ``` python test/distributed/test_store.py -v -k address ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131262 Approved by: https://github.com/mori360, https://github.com/XilunWu	2024-07-20 23:04:31 +00:00
Anshul Sinha	407c87a32c	[debug][dtensor] fixed updating current module (#130995 ) Summary Fixed issue with updating the current module when transitioning between child module to parent module and in the backward pass. The first issue is caused because the prehook is not called again when we go back to the parent module and that the hook being used was a register_module_forward_hook, which runs before the register_module_hook used in redistribute, causing the collective call to be assigned to the incorrect module. In order to do this, I updated the current module to be the parent module in a register_forward_hook in the module tracker. The second issue was caused by the parent set in the module tracker I inherit from being incorrect. I fixed this issue by saving the parents of each module and using them in collective counter instead of the incorrect set. I have updated the example in module_operation_tracing to reflect the correct output. In addition, I changed the test cases that used the incompatible old CommDebugMode. Test Case 1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_operation_tracing 2. pytest test/distributed/_tensor/debug/test_comm_mode_features.py -s -k test_transformer_module_tracing 3. python test/distributed/_composable/fsdp/test_fully_shard_training.py -k TestFullyShardGradientAccumulation.test_gradient_accumulation 4. python test/distributed/_tensor/test_math_ops.py -k DistMathOpsTest.test_layer_norm_bwd Pull Request resolved: https://github.com/pytorch/pytorch/pull/130995 Approved by: https://github.com/XilunWu ghstack dependencies: #130410	2024-07-20 20:57:29 +00:00
Peter Bell	33f036a6f7	[inductor] Kill mark_node_as_mutating (#130834 ) Resubmit of #129346 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130834 Approved by: https://github.com/lezcano ghstack dependencies: #130831, #130832, #130833	2024-07-20 18:53:33 +00:00
Nikita Shulga	fccbe85475	[BE] Improve CUDA UpSample error message (#131252 ) `Expected grad_output.numel() <= std::numeric_limits<int32_t>::max() to be true` is not very helpful, it's better to mention method name as well as actual tensor size This error was reported in https://github.com/pytorch/pytorch/issues/131185 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131252 Approved by: https://github.com/albanD	2024-07-20 16:49:34 +00:00
PyTorch UpdateBot	a7a951a4ae	[executorch hash update] update the pinned executorch hash (#130001 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Co-authored-by: Huy Do <huydhn@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130001 Approved by: https://github.com/pytorchbot	2024-07-20 16:44:07 +00:00
Xuehai Pan	b6d477fd56	[BE][Easy][16/19] enforce style for empty lines in import segments in `torch/_i*/` (#129768 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129768 Approved by: https://github.com/jansel	2024-07-20 16:20:58 +00:00
Soumith Chintala	8e478d4fb1	Add Alban and Piotr into Core Maintainers (#130903 ) See official announcement here: https://dev-discuss.pytorch.org/t/alban-desmaison-and-piotr-bialecki-are-now-pytorch-core-maintainers/2280 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130903 Approved by: https://github.com/albanD, https://github.com/Skylion007	2024-07-20 16:02:42 +00:00
hongxyan	637ab85e7f	fix for launching kernel invalid config error when calling embedding … (#130994 ) …with large index Fixes #130806 When an output size of 2147483648 (=131072*16384) is expected in the above issue, it throwed out the following error: RuntimeError: HIP error: invalid configuration argument What happened was that the second parameter passed to hipLaunchKernel was crazy {2147483648,1,1}. Found two issues in the Indexing.cu: 1: ptrdiff_t was used but it is signed int, outTotalSize >= 2147483648 can cause overflow when doing [this](`39493aa934/aten/src/ATen/native/cuda/Indexing.cu (L1367)`): 2: On ROCm, std::min -> ::min did not work as expected when outTotalSize>=2147483648 As the result, 2147483648 was sent to hipLaunchKernel which the GPU does not support such a huge number since this number specifies the number of threads per block. The original code intended to set 128 threads per block, though this is debatable as the perf would not good for latest powerful GPUs (a TODO item to update for perf maybe?) , but at least it would not cause `invalid configuration argument` error. [Test] Run the same code snippet in the [issue](https://github.com/pytorch/pytorch/issues/130806), and print the output, its dim and numel(), which looks like below now: ``` output=tensor([[ 0.4044, -0.0244, -0.6865, ..., -0.7800, 0.1175, 1.6726], [-1.0866, -0.1609, 0.3538, ..., 1.9105, 0.7882, 1.1583], [-2.2079, 0.3736, 0.3610, ..., -0.2658, -0.0459, 1.3077], ..., [ 0.8753, -0.7482, -0.1978, ..., 0.9016, 1.1501, -0.5178], [-1.5845, -0.6277, 1.4520, ..., 0.5733, -2.1198, -0.0915], [-0.6310, -1.0239, -0.1910, ..., 0.4309, 0.1630, 0.3239]], device='cuda:0'), dim=2, numel=2147483648 ``` Added a large tensor unit test too. ``` /pytorch# pytest test/nn/test_embedding.py -k test_large_tensors ================================================================================== test session starts =================================================================================== platform linux -- Python 3.9.19, pytest-7.3.2, pluggy-1.4.0 rootdir: /dockerx/development/pytorch configfile: pytest.ini plugins: flakefinder-1.1.0, rerunfailures-14.0, xdist-3.3.1, xdoctest-1.1.0, cpp-2.3.0, hypothesis-5.35.1 collected 288 items / 287 deselected / 1 selected Running 1 items in this shard test/nn/test_embedding.py . [100%] =========================================================================== 1 passed, 287 deselected in 3.16s ============================================================================ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130994 Approved by: https://github.com/jeffdaily, https://github.com/xw285cornell	2024-07-20 08:33:29 +00:00
Wu, Chunyuan	a8319698b3	[inductor] [cpp] improve cache blocking with CPU info (#129348 ) ## Description For single thread case, this PR improves the cache blocking in CPP GEMM template with the CPU info (the L1 and L2 cache size). `Mc_blocks` and `Kc_blocks` are calculated based on the below condition: - size_of_B < L1 - size_of_A < 0.5 * L2 For multi-thread, we need to tune the task decomposition among threads together with cache blocking. We disabled the cache blocking change for now and will submit a follow-up PR for multi-thread optimizations. ## Performance No regressions. Models with > 3% performance speedup are listed below: ### BF16 single thread (measured on CPU with AMX support) - static shape \| Model Family \| Model Name \| Speedup \| \|--------------\|------------\|---------\| torchbench \| detectron2_fasterrcnn_r_101_dc5\| 4% - dynamic shape \| Model Family \| Model Name \| Speedup \| \|--------------\|------------\|---------\| torchbench \| detectron2_fasterrcnn_r_101_dc5\| 4% ### FP32 single thread (measured on Ice Lake) - static shape \| Model Family \| Model Name \| Speedup \| \|--------------\|------------\|---------\| torchbench \| basic_gnn_edgecnn\| 10% - dynamic shape \| Model Family \| Model Name \| Speedup \| \|--------------\|------------\|---------\| torchbench \| basic_gnn_edgecnn\| 10% ### Next step The E2E level improvement is limited due to the below reasons: - For several HF models, we can observe performance improvement at kernel level for the gemm template kernel but since the performance is either still worse than ATen kernel (thus won't be selected during autotune) or improved from worse than ATen to similar to ATen, so we don't see E2E level performance change. - There're models where the gemm template kernel could get > 10% performance improvement with this PR but since the kernel time is only about 3% of the E2E time, we don't observe significant E2E level improvement. We will continue to find possible optimizations in the gemm template kernel in follow-up PRs. Co-authored-by: Jiong Gong <jiong.gong@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129348 Approved by: https://github.com/jgong5, https://github.com/jansel ghstack dependencies: #130675, #130690	2024-07-20 06:53:31 +00:00
Jiong Gong	0b44e1a74c	[inductor][cpp][gemm] optimize arbitrary N in packed gemm template (#130690 ) Currently we require `n % register_block_n == 0` which typically bring good perf when `n` is a multiply of 8, 16, 32 etc. while will fall back to the reference micro gemm otherwise (where `register_block_n == 1`). This PR optimizes this by padding `n` to the multiple of `register_block_n` which is 8, 16, 32 etc. for packed weight. Therefore, the micro-gemm can work as is on the padded `n`. When the weight is padded, we will use the local accumulation buffer to get the result from micro-gemm and then unpadded (sliced) before storing back to the output buffer. Performance numbers measured on "Intel (R) Xeon (R) CPU Max 9480", single core, bf16. Before AUTOTUNE linear_unary(512x768, 3073x768, 3073) _linear_pointwise 2.3563 ms 100.0% cpp_packed_gemm_0 710.5902 ms 0.3% After AUTOTUNE linear_unary(512x768, 3073x768, 3073) cpp_packed_gemm_0 1.8909 ms 100.0% _linear_pointwise 2.1016 ms 90.0% Pull Request resolved: https://github.com/pytorch/pytorch/pull/130690 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel ghstack dependencies: #130675	2024-07-20 06:30:15 +00:00
ankurneog	ebc012ace6	Add hooks for execution on intel gaudi devices - 1 (#128584 ) ## Motivation This is follow up to PR:https://github.com/pytorch/pytorch/pull/126970 to support Gaudi devices for Pytorch UT execution. ## Changes We are adding additional hooks to: 1. Add dtype exceptions for Gaudi/HPU 2. Extend onlyNativeDevices decorator functionality to add additional devices Pull Request resolved: https://github.com/pytorch/pytorch/pull/128584 Approved by: https://github.com/albanD	2024-07-20 05:03:36 +00:00
Michael Lazos	d31f2ae904	Ensure invariant that all inputs have tensor dict (#131249 ) There was a path with freezing enabled that violated the invariant that all inputs have the "tensor_dict" meta. This ensures that `register_attr_or_module` also sets tensor_dict meta. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131249 Approved by: https://github.com/anijain2305	2024-07-20 04:40:58 +00:00
drisspg	37337ef5c3	add some description on create_block_mask and mask mods (#131209 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131209 Approved by: https://github.com/joydddd	2024-07-20 04:40:48 +00:00
Yifu Wang	168c0e24a5	[IntraNodeComm] Fix some issues in two-shot all-reduce (#131244 ) Two issues: - Similar to https://github.com/pytorch/pytorch/pull/129501, two-shot all-reduce's reduction order was different across ranks. This PR fixes it. - When migrated to use SymmetricMemory, I accidentally used `get_buffer_ptrs_dev` instread of `get_buffer_ptrs` (the former is an on-device array). This PR fixes it (for https://github.com/pytorch/pytorch/issues/131215). The failing snippet provided by https://github.com/pytorch/pytorch/issues/131215 now works. ```python import os import torch import torch.distributed as dist def _get_global_rank() -> int: return int(os.environ.get("LOCAL_RANK", "0")) def is_local(): return _get_global_rank() == 0 def _get_world_size() -> int: return int(os.environ.get("LOCAL_WORLD_SIZE", "1")) global_rank = _get_global_rank() world_size = _get_world_size() torch.cuda.set_device(global_rank) dist.init_process_group(backend="nccl") global_group = dist.group.WORLD draft_group = dist.new_group([0,1]) inp = torch.full((128, 1, 4096), global_rank, dtype=torch.bfloat16, device="cuda") dist.all_reduce(inp, group=global_group) expect = sum(range(world_size)) assert inp.eq(expect).all() if 0 <= global_rank < 2: inp = torch.full((128, 1, 2048), global_rank, dtype=torch.bfloat16, device="cuda") dist.all_reduce(inp, group=draft_group) expect = sum(range(2)) assert inp.eq(expect).all() torch.cuda.synchronize() print("success") dist.destroy_process_group() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131244 Approved by: https://github.com/weifengpy, https://github.com/Chillee	2024-07-20 02:51:45 +00:00
Xuehai Pan	d2bd9acabd	[BE] bump `optree` version to 0.12.1 (#130139 ) 0.12.0 Major Updates: - Add context manager to temporarily set the dictionary sorting mode - Add accessor APIs - Use `stable` tag for `pybind11` for Python 3.13 support - Fix potential segmentation fault for pickling support 0.12.1 Updates: - Fix warning regression during import when launch with strict warning filters Closes #130155 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130139 Approved by: https://github.com/zou3519 ghstack dependencies: #130895	2024-07-20 02:41:10 +00:00
Yidi Wu	50436d5bdb	[export] fix zero arg export in training_ir (#130990 ) Fixed TrainingIRToRunDecomp failures for test_tensor_attribute_zero_args and also a few re-tracability failures because run_decomposition does a retracing. edit: also remove the eliminate_dead_code() in _unlift because of one onnx test failure: a constant tensor attr was lifted as constant_tensor input but it's not used in the graph after aot_autograd due to a short cut in its decomposition. This causes the setattr to be removed by eliminate_dead_code but the graph signature still contains the name of that buffer, which causes an inconsitency between the transformed graph and ep's original signature after _unlift. And it seems that this has happened a few times where some nodes are accidentally removed and we're in an inconsistent state. The alternative of removing it would be: every time we call elimiate_dead_code, we verify the consistency of the graph with 1. the graph before transformation and 2. all the meta datas but i think this deserves a complete design. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130990 Approved by: https://github.com/pianpwk	2024-07-20 02:35:13 +00:00
Sam Larsen	3c43fe068f	[inductor] parallel compile: Create new pipes for subproc communication (#131194 ) Summary: Rather then using stdin/stdout for IPC, we can create new pipes and pass the descriptors to the subproc via the cmd line. https://github.com/pytorch/pytorch/issues/131070 reports an issue where the combination of deepspeed and onnxruntime-training causes _something_ in the subproc to write to stdout and corrupt the IPC. The current implementation was already brittle; we can just create new pipes specifically for the IPC. Test Plan: I was able to repro the MemoryError in https://github.com/pytorch/pytorch/issues/131070 by installing deepspeed and onnxruntime-training. Verified this PR fixes. Differential Revision: [D59968362](https://our.internmc.facebook.com/intern/diff/D59968362) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131194 Approved by: https://github.com/malfet, https://github.com/eellison, https://github.com/atalman	2024-07-20 02:23:01 +00:00
Peter Bell	9df8ea1cf2	[inductor] Use multiple outputs for flex-attention (#130833 ) Resubmit of #129344 This fixes the DCE issue for attention output Pull Request resolved: https://github.com/pytorch/pytorch/pull/130833 Approved by: https://github.com/lezcano ghstack dependencies: #130831, #130832	2024-07-20 02:05:10 +00:00
Peter Bell	deacc543f1	[inductor] Make UserDefinedTritonKernel a multi-output operation (#130832 ) Resubmit of #129325 Previously each mutation was represented by a `MutationOutput` operation which was a new scheduler node that must be scheduled immediately afterwards. Now we have a single scheduler node, which produces mutiple `MutationOutput` buffers as its output. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130832 Approved by: https://github.com/lezcano ghstack dependencies: #130831	2024-07-20 02:05:10 +00:00
Peter Bell	27c2a0d63b	[inductor] Separate Buffer and Operation into two concepts (#130831 ) Resubmit of #128893 Currently a buffer represents both a tensor with physical storage and a computation that produces the tensor as a result. This PR attempts to split these into two different concepts in the scheduler. This should allow us to have multiple outputs from a single operation. Differential Revision: [D59876059](https://our.internmc.facebook.com/intern/diff/D59876059) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130831 Approved by: https://github.com/lezcano	2024-07-20 02:05:07 +00:00
Isuru Fernando	bb4251213b	Add decomposition for channel_shuffle (#118775 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118775 Approved by: https://github.com/peterbell10	2024-07-20 01:24:41 +00:00
Xuehai Pan	f0075c179b	Pin `sympy >= 1.13.0` (#130895 ) ------ The opposite of #130836. Pin `sympy >= 1.13.0` for Python >= 3.9 and `sympy == 1.12.1` for Python 3.8. - #130836 See the PR description of #130836 for more details. `sympy` 1.13.0 introduces some breaking changes which break our tests. More specifically: - Ref [Backwards compatibility breaks and deprecations](https://github.com/sympy/sympy/wiki/release-notes-for-1.13.0#backwards-compatibility-breaks-and-deprecations) > BREAKING CHANGE: Float and Integer/Rational no longer compare equal with a == b. From now on Float(2.0) != Integer(2). Previously expressions involving Float would compare unequal e.g. x2.0 != x2 but an individual Float would compare equal to an Integer. In SymPy 1.7 a Float will always compare unequal to an Integer even if they have the same "value". Use sympy.numbers.int_valued(number) to test if a number is a concrete number with no decimal part. ([#25614](https://github.com/sympy/sympy/pull/25614) by [@smichr](https://github.com/smichr)) `sympy >= 1.13.0` is required to enable Python 3.13 support. This should be part of #130689. - #130689 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130895 Approved by: https://github.com/ezyang	2024-07-20 00:59:24 +00:00
PyTorch MergeBot	30d1826b2b	Revert "[executorch hash update] update the pinned executorch hash (#130001 )" This reverts commit 4821f72457afd7b1b5c61c1c8c3c49105c1bd22d. Reverted https://github.com/pytorch/pytorch/pull/130001 on behalf of https://github.com/clee2000 due to the test_sympy_utils failure is real, Dr. CI is wrong https://github.com/pytorch/pytorch/actions/runs/10015433275/job/27687163560 `4821f72457` ([comment](https://github.com/pytorch/pytorch/pull/130001#issuecomment-2240807631))	2024-07-20 00:56:14 +00:00
cyy	cd8bbdc71a	[2/N] Fix Wunused-parameter warnings (#131170 ) Follows #130924 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131170 Approved by: https://github.com/mikaylagawarecki	2024-07-19 23:58:56 +00:00
rzou	207fb96155	[functorch] saved tensor hooks error should only apply to grad, vjp transforms. (#131191 ) There's no reason to ban them for vmap or jvp, because without the {grad, vjp} transforms those just act above PyTorch autograd, which will end up saving regular Tensors. Test Plan: - some tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/131191 Approved by: https://github.com/drisspg	2024-07-19 23:16:27 +00:00
PyTorch UpdateBot	4821f72457	[executorch hash update] update the pinned executorch hash (#130001 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130001 Approved by: https://github.com/pytorchbot	2024-07-19 23:10:20 +00:00
PyTorch MergeBot	7c299b46ca	Revert "Invalidate StorageImpl instances when tensor is overwritten with cudagraphs (#125264 )" This reverts commit 8390843eba6271dcdbec7d048e9fa4e56d4479d8. Reverted https://github.com/pytorch/pytorch/pull/125264 on behalf of https://github.com/izaitsevfb due to breaks internal tests ([comment](https://github.com/pytorch/pytorch/pull/125264#issuecomment-2240516202))	2024-07-19 22:58:51 +00:00
Shuo Ding	35bf05561c	[Inductor] B2B-GEMM performance tuning with test (#130778 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130778 Approved by: https://github.com/eellison	2024-07-19 22:53:57 +00:00
peaceorwell	6657b14a64	[inductor] Fix the method for checking the variable type of entry.numel (#131026 ) The data type of numel in the IterationRangesEntry class is sympy.Expr. To determine if it's an integer, we need to use sympy.Integer. Co-authored-by: peterbell10 <peterbell10@live.co.uk> Pull Request resolved: https://github.com/pytorch/pytorch/pull/131026 Approved by: https://github.com/peterbell10	2024-07-19 22:51:11 +00:00
PyTorch MergeBot	0e72baddf0	Revert "[easy][pytorch][counters] Move WaitCounter in c10/util (#131021 )" This reverts commit 0ca7b6ddd91192ebffd3c88bf314d07ba6cddf50. Reverted https://github.com/pytorch/pytorch/pull/131021 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/131021#issuecomment-2240280827))	2024-07-19 21:56:09 +00:00
Shuqiang Zhang	4aef5a1134	[c10] add an option to pg_config split share (#130877 ) Summary: context is: #129865 We want to give users an option to not share comms resouces so that comm opts can overlap Test Plan: Augmentd UT Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/130877 Approved by: https://github.com/fduwjj	2024-07-19 21:11:26 +00:00
Andrii Grynenko	0ca7b6ddd9	[easy][pytorch][counters] Move WaitCounter in c10/util (#131021 ) Summary: Since WaitCounter frontend itself has minimal depdendencies it's fine to be moved into c10. Specific backends can be registered/linked separately. Test Plan: unit test Reviewed By: jamesperng, asiab4, c-p-i-o Differential Revision: D59842868 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131021 Approved by: https://github.com/asiab4	2024-07-19 20:58:32 +00:00
Zain Rizvi	c64ad2403c	LF runners: Add new runner types for Amazon2023 AMIs (#131246 ) Add new LF runner types with the Amazon2023 ami, matching the change done in https://github.com/pytorch/test-infra/pull/5487 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131246 Approved by: https://github.com/malfet	2024-07-19 20:30:41 +00:00
lessw2020	85ca88a2bb	[Distributed][PP export] update tracing to handle autocast inclusion (#130998 ) Fixes https://github.com/pytorch/pytorch/issues/128394 This updates PP export tracing to use no_grad() context along with avoid predispatch. This enables tracing for HF llama models that currently fail due to not handling the use of autocast in the Rope embeddings. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130998 Approved by: https://github.com/fduwjj	2024-07-19 20:08:00 +00:00
Yidi Wu	ceee87df2e	[export] modify export code owners (#130894 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130894 Approved by: https://github.com/zhxchen17	2024-07-19 19:49:34 +00:00
PyTorch MergeBot	5f981388ec	Revert "[ROCm] Enable ROCm support for inductor's dynamic_rblock_scaling (#129663 )" This reverts commit d7a78ec8b938a61297221912464f5afef288b823. Reverted https://github.com/pytorch/pytorch/pull/129663 on behalf of https://github.com/atalman due to Breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/129663#issuecomment-2240011143))	2024-07-19 19:46:26 +00:00
Li-Huai (Allan) Lin	125be005eb	[Docs] Fix fake tensor doc (#131205 ) Fix this: `# AttributeError: 'FakeTensorMode' object has no attribute 'from_real_tensor'` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131205 Approved by: https://github.com/eellison	2024-07-19 17:59:45 +00:00
Animesh Jain	e49c0acc39	[dynamo] Revert https://github.com/pytorch/pytorch/pull/130416 (#131058 ) All the changes brought by the original PR have been addressed in alternative ways in the stack. Why the original PR has to be reverted requires more effort because there is some bad interaction with export. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131058 Approved by: https://github.com/williamwen42	2024-07-19 17:26:24 +00:00
henrylhtsang	042be441ba	[aoti] Unskip some aot inductor tests (#130973 ) Trying to unskip some tests, and if they are still broken, add reasons. ## example testing command ``` pytest -v test/inductor/test_aot_inductor.py -k test_add_complex ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130973 Approved by: https://github.com/ColinPeppler	2024-07-19 17:19:35 +00:00
Jiashen Cao	9b5c70878b	[Fix] Missing parameter happens when retracing an already jit.scripted module (#129787 ) #### Issue Model parameters sometime do not appear in the `named_parameters()` function. For example, when trying to jit.trace an already jit.scripted model. This PR fixes that by relying on `state_dict` to get both parameters`requires_grad=True` and buffers. #### Test Plan * `pytest test/export/test_converter.py -s -k test_convert_retrace_nested_scripted_modules` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129787 Approved by: https://github.com/angelayi	2024-07-19 16:58:48 +00:00
Zhengxu Chen	abb3f2822c	[aotinductor] Support additional lifted constants supplied to const folding. (#130743 ) Summary: In export workflow, we always have a lifted graph which doesn't fetch constants through get_attr nodes. This cause some compatibility issue when we're trying to use inductor's split_const_gm function with a lifted graph. This diff make an additive change to split_const_gm's interface, such that, when the pass sees a placeholder node is present in the lifted_constants table, it will also use that as the source of constness. This change won't break the existing code and the lifted_constants table can be used orthogonal to the existing const folding mechanisms. Also as required from MTIA team, we want to introduce a small callback function used to skip certain nodes during const folding. For the internal followup counterpart, see D59685145 Test Plan: buck run mode/opt caffe2/test:test_export -- -r split_const_gm Differential Revision: D59692790 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130743 Approved by: https://github.com/desertfire, https://github.com/SherlockNoMad	2024-07-19 16:48:56 +00:00
Catherine Lee	31e79aae6a	Another follow up to #130260 (#130993 ) Another followup to https://github.com/pytorch/pytorch/pull/130260 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130993 Approved by: https://github.com/huydhn	2024-07-19 16:43:54 +00:00
xingyunjohn1	d4a79d4a7c	Fix an example: Resolve broadcasting error in attn_bias and attn_mask… (#130209 ) … addition, fix device assignment for newly created variables in method Fix an example: Resolve broadcasting error in attn_bias and attn_mask addition, fix device assignment for newly created variables in method 1. `attn_bias += attn_mask` would cause a broadcasting error. Because the shape of `attn_bias` is (L, S), the shape of the output would be expected as (L, S) too. When the shape of input is (N, num_heads, L, S), a broadcasting should be triggered. Then, the shape of the output would be (N, num_heads, L, S), which is unexpected. 2. `attn_bias` is a newly created variables in method, which is not assigned device. This is my retry of #130200 . I used a wrong account in that pr. Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130209 Approved by: https://github.com/mikaylagawarecki	2024-07-19 15:23:22 +00:00
sradc	451fc029fe	docs: note transposed weight initialisations (#130122 ) Fixes #129834 Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130122 Approved by: https://github.com/mikaylagawarecki	2024-07-19 15:23:03 +00:00
PyTorch MergeBot	5f3d8b8788	Revert "[c10] add an option to pg_config split share (#130877 )" This reverts commit 367213a608528ee74e67e03bf11f775e263ef480. Reverted https://github.com/pytorch/pytorch/pull/130877 on behalf of https://github.com/atalman due to breaks internal build ([comment](https://github.com/pytorch/pytorch/pull/130877#issuecomment-2239298810))	2024-07-19 14:24:50 +00:00
Andres Suarez	25d8a0480b	[lint] Remove unnecessary BUCKRESTRICTEDSYNTAX suppressions Differential Revision: D59935630 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131187	2024-07-19 07:19:11 -07:00
Edward Z. Yang	a6a2cd6257	Typo fix (#131037 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/131037 Approved by: https://github.com/albanD, https://github.com/Skylion007	2024-07-19 13:17:54 +00:00
Michael Lazos	1b72cf0b09	Add hasattr for tensor variable (#131008 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131008 Approved by: https://github.com/anijain2305 ghstack dependencies: #131007	2024-07-19 12:43:27 +00:00
Syed Tousif Ahmed	1f961ad495	Runs aten cuda cpp tests in CI (#131061 ) It seems like these tests are never run because https://github.com/pytorch/pytorch/pull/99956 got rid of the `pushd $1` which would make the if conditions true in CUDA builds. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131061 Approved by: https://github.com/malfet, https://github.com/eqy	2024-07-19 12:35:33 +00:00
Jack Taylor	d7a78ec8b9	[ROCm] Enable ROCm support for inductor's dynamic_rblock_scaling (#129663 ) As of ROCm 6.1 [hipDeviceProp_t::regsPerMultiprocessor](https://rocm.docs.amd.com/projects/HIP/en/latest/doxygen/html/structhip_device_prop__t.html#a7390d5b180d63978c81aa971060270b4) is now available allowing us to enable this attribute on ROCm. ``` >>> torch.cuda.get_device_properties(0) _CudaDeviceProperties(name='AMD Instinct MI250X/MI250', major=9, minor=0, gcnArchName='gfx90a:sramecc+:xnack-', total_memory=65520MB, multi_processor_count=104) >>> torch.cuda.get_device_properties(0).regs_per_multiprocessor 65536 ``` With https://github.com/triton-lang/triton/pull/3962we can extract n_regs and n_spells from a triton binary with AMD backend allowing us to enable inductor's dynamic_rblock_scaling on ROCm initially implemented in https://github.com/pytorch/pytorch/pull/115094 Leaving this in draft until following PRs have landed: - https://github.com/pytorch/pytorch/pull/129361 to bump the triton commit pin - https://github.com/pytorch/pytorch/pull/128449 to allow us to grab warp_size from device properties instead of hard coding 64 on ROCm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129663 Approved by: https://github.com/jansel, https://github.com/shunting314	2024-07-19 09:45:03 +00:00
cyy	feef057691	[1/N] Fix Wunused-parameter warnings (#130924 ) Before we can turn Wunused-parameter into an error Pull Request resolved: https://github.com/pytorch/pytorch/pull/130924 Approved by: https://github.com/ezyang	2024-07-19 06:14:51 +00:00
Oguz Ulgen	eee76c86a8	Write trace_structured events to scuba (#130955 ) Summary: https://fb.workplace.com/groups/1286739428954016/posts/1287192258908733 Test Plan: Run test with tlparse and inspect https://www.internalfb.com/intern/scuba/query/?dataset=pt2_trace_structured_events Differential Revision: D59866096 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130955 Approved by: https://github.com/ezyang	2024-07-19 06:02:47 +00:00
Chirag Pandya	982309b501	Initial commit of flight recorder trace (#130764 ) Summary: `fr_trace.py` is used to analyze flight recorder dump files. This script was taken from @wconstab and @zdevito. Only minor changes made were to make the linter happy and add a few odd new fields that I added in version `2.2` of the collector portions. Test Plan: Tested manually on some flight recorder data and it seems to run. TODO: Address 15 odd `#type: ignore` that I put in there to make the linter happy for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130764 Approved by: https://github.com/fduwjj	2024-07-19 06:00:54 +00:00
Justin Chu	fd4899bc58	[ONNX] Run ruff pyupgrade to update type annotations (#130657 ) Use the newest syntax for type annotations Pull Request resolved: https://github.com/pytorch/pytorch/pull/130657 Approved by: https://github.com/titaiwangms	2024-07-19 05:09:44 +00:00
kausik	4f60a2e39c	Set correct output dtype for dequantize op during convert_pt2e in decomposed mode (#128953 ) Earlier the signature of dequantize ops for decomposed quantized Tensor was changed for wider use-cases where the output dtype can be different from torch.float and needs to be passed during dequantization. Please refer: https://github.com/pytorch/pytorch/pull/121450 However, setting of correct output dtype for dequantize ops was still missing in convert_pt2e flow. This change enables the users to use PT2E quantization flow with non torch.float unquantized dtype, such as torch.bfloat16. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128953 Approved by: https://github.com/jgong5, https://github.com/jerryzh168	2024-07-19 04:58:02 +00:00
chilli	d59803fb67	Refactored flexattention kernel (#130904 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130904 Approved by: https://github.com/drisspg ghstack dependencies: #130871	2024-07-19 04:56:32 +00:00
Animesh Jain	ac76dd606f	[dynamo] Alternative way to skip empty hooks guards on inbuilt nn modules (#131057 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131057 Approved by: https://github.com/williamwen42, https://github.com/jansel ghstack dependencies: #131056	2024-07-19 04:42:38 +00:00
Animesh Jain	00e54e74ff	[dynamo][cpp-guards] Fix bug in dict tags (#131056 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131056 Approved by: https://github.com/williamwen42, https://github.com/jansel	2024-07-19 04:42:38 +00:00
Peter Bell	3c622fbcd3	[inductor] Fix var_to_range in IndexPropagation (#130984 ) The current code assumes that indirect variables will be created by the same `IndexPropagation` instance, however that isn't true in the case of masked sub-blocks where we take in variables from the parent block. This fixes the issue by moving the var range information up to the `LoopBody` object where it can be shared by all sub-blocks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130984 Approved by: https://github.com/lezcano	2024-07-19 03:08:00 +00:00
Feng Yuan	b556d31586	Update torch-xpu-ops pin (ATen XPU implementation) (#131015 ) Regular update. 1. New 90 ATen operators and their variants are supported for XPU. 2. Bugfixing: a. Fixing out-of-bound memory access in index_put kernel b. Fixing debug build error 3. Binary change. Split device AOT code of SYCL kernel into multiple libraries to avoid linkage failure. 4. torch-xpu-ops test case enhancement: a. Hook PyTorch testing ob_db to align opInfo configuration with CUDA b. Hook _check_arg_device2 and freeze_rng_state to make XPU happy Pull Request resolved: https://github.com/pytorch/pytorch/pull/131015 Approved by: https://github.com/EikanWang	2024-07-19 02:18:55 +00:00
Ma, Jing1	52cb9abb1d	Add deterministic support in nn.functional.interpolate for XPU (#129864 ) Both for CUDA and XPU, there are no deterministic implementation at native in `aten::upsample_bilinear` and `aten::replication_pad`. CUDA leverage operator decomposition path in frontend hook `nn.functional.interpolate` as its deterministic implentation. XPU backend uses the same solution in this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129864 Approved by: https://github.com/dvrogozh, https://github.com/albanD, https://github.com/EikanWang	2024-07-19 02:15:42 +00:00
Jiong Gong	39493aa934	[inductor][cpp][gemm] move bias add to epilogue (#130675 ) Speedup bias-add compute by moving it to the epilogue. Performance numbers measured on "Intel (R) Xeon (R) CPU Max 9480", single core, bf16. Before AUTOTUNE linear_unary(512x768, 3072x768, 3072) cpp_packed_gemm_0 1.9200 ms 100.0% _linear_pointwise 1.9345 ms 99.3% After AUTOTUNE linear_unary(512x768, 3072x768, 3072) cpp_packed_gemm_0 1.8321 ms 100.0% _linear_pointwise 1.9246 ms 95.2% Pull Request resolved: https://github.com/pytorch/pytorch/pull/130675 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel	2024-07-19 01:16:34 +00:00
xinan.lin	5a6a806b19	[Inductor UT] Generalize device-bias code in case TestFxGraphCache.test_inductor_counters. (#131006 ) [Inductor UT] Generalize device-bias code in case `TestFxGraphCache.test_inductor_counters`. Fix #131005 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131006 Approved by: https://github.com/masnesral	2024-07-19 01:14:22 +00:00
Will Feng	208dffa702	[Compiled DDP] DDP + AC unit test (#130981 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130981 Approved by: https://github.com/fegin	2024-07-19 01:07:41 +00:00
cyy	3cc6183ce1	Fix getAugOp error (#131033 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/131033 Approved by: https://github.com/ezyang	2024-07-19 01:07:24 +00:00
Xu Han	6e7b9ee8a0	[inductor] adapte windows file path (#130713 ) This PR is depends on https://github.com/pytorch/pytorch/pull/130132 can be landed successful. The detailed log: https://github.com/pytorch/pytorch/issues/124245#issuecomment-2211889758 After the file path was adapted for Windows, the first Windows inductor case was run successful. ```python import torch def foo(x, y): a = torch.sin(x) b = torch.cos(x) return a + b opt_foo1 = torch.compile(foo) print(opt_foo1(torch.randn(10, 10), torch.randn(10, 10))) ``` Result: ![image](https://github.com/user-attachments/assets/4944df47-e74d-476b-8eb5-1d1fd5abeb41) Co-authored-by: Jiong Gong <jiong.gong@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130713 Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/desertfire	2024-07-18 23:19:38 +00:00
Justin Chu	e880cb2fe0	[ONNX] Remove beartype usage (#130484 ) beartype has served us well in identifying type errors and ensuring we call internal functions with the correct arguments (thanks!). However, the value of having beartype is diminished because of the following: 1. When beartype improves support for better Dict[] type checking, it discovered typing mistakes in some functions that were previously uncaught. This caused the exporter to fail with newer versions beartype when it used to succeed. Since we cannot fix PyTorch and release a new version just because of this, it creates confusion for users that have beartype in their environment from using torch.onnx 2. beartype adds an additional call line in the traceback, which makes the already thick dynamo stack even larger, affecting readability when users diagnose errors with the traceback. 3. Since the typing annotations need to be evaluated, we cannot use new syntaxes like `\|` because we need to maintain compatibility with Python 3.8. We don't want to wait for PyTorch take py310 as the lowest supported Python before using the new typing syntaxes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130484 Approved by: https://github.com/titaiwangms	2024-07-18 22:07:40 +00:00
PyTorch MergeBot	fb3674b1f4	Revert "[Autograd] Cond Higher-Order Operation (#126911 )" This reverts commit f7058b735e52a1d876912f8c96a594673a495007. Reverted https://github.com/pytorch/pytorch/pull/126911 on behalf of https://github.com/clee2000 due to broke lint and functorch/test_aotdispatch `f7058b735e` Probably a landrace since both the test and lint passed on PR ([comment](https://github.com/pytorch/pytorch/pull/126911#issuecomment-2237703182))	2024-07-18 22:06:40 +00:00
Jiashen Cao	686b7f046a	[Fix]: TSConverter handles call ops with multiple outputs (#129294 ) #### Issue * Current call ops does not handle IR with multiple outputs. If an op has multiple outputs, we add an implicit unpack to map output. E.g., ``` %5 : Tensor, %6 : Tensor = aten::max(%x.1, %3, %4), scope: export.test_converter.M:: # /data/users/jiashenc/pytorch/test/export/test_converter.py:774:20 ``` * There are some cases that `prim::If` sub-blocks do not return any outputs. E.g., ``` %9 : bool = aten::gt(%8, %3), scope: export.test_converter.M::/torch.nn.modules.pooling.AdaptiveMaxPool2d::pool # <string>:5:9 = prim::If(%9), scope: export.test_converter.M::/torch.nn.modules.pooling.AdaptiveMaxPool2d::pool # <string>:5:2 block0(): -> () block1(): = prim::RaiseException(%5, %4), scope: export.test_converter.M::/torch.nn.modules.pooling.AdaptiveMaxPool2d::pool # <string>:5:2 -> () ``` #### Test Plan We did an exhaustive search of all torch APIs that can return multiple outputs. We sample some of common ones and add new test cases based on those. * `pytest test/export/test_converter.py -s -k test_ts2ep_multi_outputs_on_call_ops` #### Appendix * aten ops that return multiple outputs. ``` aten._batch_norm_impl_index aten._batch_norm_no_update aten._batch_norm_with_update aten._batch_norm_with_update_functional aten._cudnn_rnn aten._efficient_attention_backward aten._efficient_attention_forward aten._embedding_bag aten._embedding_bag_forward_only aten._flash_attention_backward aten._flash_attention_forward aten._fused_adam aten._fused_dropout aten._fused_moving_avg_obs_fq_helper aten._linalg_det aten._linalg_eigh aten._linalg_slogdet aten._linalg_solve_ex aten._linalg_svd aten._native_batch_norm_legit aten._native_batch_norm_legit_functional aten._native_batch_norm_legit_no_training aten._pack_padded_sequence aten._prelu_kernel_backward aten._scaled_dot_product_efficient_attention aten._scaled_dot_product_efficient_attention_backward aten._scaled_dot_product_flash_attention aten._scaled_dot_product_flash_attention_backward aten._scaled_dot_product_flash_attention_for_cpu aten._scaled_dot_product_flash_attention_for_cpu_backward aten._thnn_fused_lstm_cell aten._thnn_fused_lstm_cell_backward_impl aten._unique2 aten._weight_norm_interface aten.adaptive_max_pool2d aten.adaptive_max_pool3d aten.aminmax aten.batch_norm_backward aten.convolution_backward aten.cudnn_batch_norm aten.cudnn_batch_norm_backward aten.cummax aten.cummin aten.fractional_max_pool2d aten.frexp aten.grid_sampler_2d_backward aten.grid_sampler_3d_backward aten.gru aten.linalg_cholesky_ex aten.linalg_eig aten.linalg_inv_ex aten.linalg_ldl_factor_ex aten.linalg_lu aten.linalg_lu_factor_ex aten.linalg_qr aten.linear_backward aten.log_sigmoid_forward aten.lstm aten.lu_unpack aten.max aten.max_pool2d_with_indices aten.max_pool3d_with_indices aten.median aten.min aten.miopen_batch_norm aten.miopen_batch_norm_backward aten.mkldnn_rnn_layer aten.mkldnn_rnn_layer_backward aten.mode aten.multilabel_margin_loss_forward aten.nanmedian aten.native_batch_norm aten.native_batch_norm_backward aten.native_dropout aten.native_group_norm aten.native_group_norm_backward aten.native_layer_norm aten.native_layer_norm_backward aten.nll_loss2d_forward aten.nll_loss_forward aten.quantized_gru aten.quantized_lstm aten.rnn_relu aten.rnn_tanh aten.sort aten.std_mean aten.topk aten.triangular_solve aten.unique_dim aten.var_mean ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129294 Approved by: https://github.com/angelayi	2024-07-18 21:55:18 +00:00
Alnis Murtovi	7f1cda1533	Autoheuristic: Do not store choices as metadata (#130304 ) While for optimizations like pad_mm, there are always only two possible choices, for other decision procedures, like kernel choice selection, the set of "available" choices depends on the input. Instead of storing the choices as metadata, we can instead take a look at all choices for which we have collected data (i.e. `df[CHOICE_COL].unique()`). In this PR, I also try to replace "choice" and "feedback" with global constants CHOICE_COL and FEEDBACK_COL. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130304 Approved by: https://github.com/eellison	2024-07-18 21:39:42 +00:00
zdevito	4d9f2a6d56	Small expandable segments refactor. (#130889 ) Makes next PRs that will export/import segment handles easier to write. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130889 Approved by: https://github.com/dsjohns2 ghstack dependencies: #130888	2024-07-18 21:34:38 +00:00
zdevito	d8fed480ef	Move handle-creation logic into cudacaching allocator. (#130888 ) A later PR will then make the handle abstract and able to use either cudaMalloc or expandable segments. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130888 Approved by: https://github.com/dsjohns2	2024-07-18 21:34:38 +00:00
Richard Zou	3e9cf1cc80	Fix potential segfault during deletion (#131036 ) Summary: See comment in code Test Plan: code reading Reviewed By: albanD Differential Revision: D59872819 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131036 Approved by: https://github.com/Skylion007, https://github.com/albanD	2024-07-18 21:18:31 +00:00
Thomas Bohnstingl	f7058b735e	[Autograd] Cond Higher-Order Operation (#126911 ) This is an updated PR to equip cond with the autograd feature and replaces the old [PR](https://github.com/pytorch/pytorch/pull/126007) @ydwu4 I tried to incorporate your requests already. Currently there are two problems that I struggle with solving: 1. There seems to be an import issue when trying to import cond in `torch/__init__.py`, see [here](`8a704035c9/torch/__init__.py (L1914-L1916)`). Therefore, I had to comment those lines, which resolved the import issues, but I believe cond is not proberly exposed as torch.cond. 2. I am not entirely sure how to deal with the opinfo test in `hop_db.py` Co-authored-by: Yidi Wu <yidi@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126911 Approved by: https://github.com/ydwu4	2024-07-18 21:09:09 +00:00
JackCaoG	24467ba2ec	Update pin (#130896 ) Test the XLA pin update Pull Request resolved: https://github.com/pytorch/pytorch/pull/130896 Approved by: https://github.com/anijain2305	2024-07-18 21:04:30 +00:00
Jerry Zhang	793b17ebcb	Add numeric_debugger top level APIs (#130643 ) Summary: Add three top level APIs for numeric debugger in pt2e flow that can log intermediate output in the model and calculate summary for metric comparisons between nodes in two graphs * `prepare_for_propagation_comparison` * `extract_results_from_loggers` * `compare_results` Test Plan: python test/test_quantization.py -k test_prepare_for_propagation_comparison python test/test_quantization.py -k test_extract_results_from_loggers Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/130643 Approved by: https://github.com/dulinriley, https://github.com/tarun292	2024-07-18 20:54:18 +00:00
PyTorch MergeBot	726b9268d2	Revert "Re-implement pin_memory to be device-agnostic by leveraging the Accelerator concept (#126376 )" This reverts commit c986aeea2d7d9403be702119e3dd4dcb18134fc2. Reverted https://github.com/pytorch/pytorch/pull/126376 on behalf of https://github.com/atalman due to Failing internal builds ([comment](https://github.com/pytorch/pytorch/pull/126376#issuecomment-2237496633))	2024-07-18 20:25:20 +00:00
Peter Bell	e7f7c5c3f8	[inductor] Avoid fallback case for custom scan op lowering (#130936 ) We currently can't generate split scans when there are multiple scan values, so we normally fall back to ATen. However, for the higher order scan op, we can't fallback so it makes sense to just generate the slower kernel anyway. This avoids having special shapes where we fail to codegen. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130936 Approved by: https://github.com/lezcano	2024-07-18 19:53:47 +00:00
Shuqiang Zhang	367213a608	[c10] add an option to pg_config split share (#130877 ) Summary: context is: #129865 We want to give users an option to not share comms resouces so that comm opts can overlap Test Plan: Augmentd UT Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/130877 Approved by: https://github.com/fduwjj	2024-07-18 19:03:00 +00:00
drisspg	c015e5b9e3	Make sure that TransformGetItemToIndex for all graph replay (#131003 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131003 Approved by: https://github.com/Chillee ghstack dependencies: #130871	2024-07-18 18:32:21 +00:00
redwrasse	82242a258a	rm duplicate index_dtype arg (#130803 ) - Remove duplicate `index_dtype` argument for `_test_meta_sparse_compressed` operation. - Also remove unused `y_v_numel` variable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130803 Approved by: https://github.com/soulitzer	2024-07-18 18:30:13 +00:00
joydddd	6d9f74f0af	Add flex decoding benchmark (#130850 ) ghstack-source-id: b4f26fb66ed47907b11580c8c853737959c58811 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130788 Add benchmark for flex decoding. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130850 Approved by: https://github.com/Chillee, https://github.com/drisspg	2024-07-18 18:09:25 +00:00
PyTorch MergeBot	fff92d4f18	Revert "Use inductor TestCase for test_replicate_with_compiler.py (#129494 )" This reverts commit 9f392f8294e928aec49599ad649aa899e1356102. Reverted https://github.com/pytorch/pytorch/pull/129494 on behalf of https://github.com/atalman due to broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/129494#issuecomment-2237147504))	2024-07-18 17:42:05 +00:00
Pian Pawakapan	745324e487	[export] turn on hybrid symints by default (#130775 ) Sets `prefer_deferred_runtime_asserts_over_guards=True` for export, so any guards emitted from `SymNode.expect_true` (for example, guards that are implicitly required to be true for an op to succeed) won't lead to constraint violations. Instead these should appear in the graph as runtime asserts, or potentially as replacement expressions for placeholder shapes. For example, this reshape op should emit s0 * s1 = s2, deferred as a runtime assert. ``` x = torch.randn(4, 8) # [s0, s1] y = torch.randn(32) # [s2] out = x.reshape(-1) + y # this emits Eq(s0 * s1, s2), and we represent y's shape as [s0s1] in the graph. ``` However, other complex guards can still cause export to fail, for instance guards emitted from `SymNode.guard_bool/guard_size_oblivious` (e.g. explicit if-else conditions in user code or lower-level op implementations hit during tracing) can still raise constraint violations. These can be deferred with `allow_complex_guards_as_runtime_asserts=True`. We don't yet make this default, because while this makes export more likely to succeed, it results in non-trivial asserts being emitted that often represent specialization to a variant of the op, or checks related to 0/1 specialization. We also remove forced specializations for export and kill the `_disable_forced_specializations` flag - now any guard we can't express with Dims/DerivedDims either are handled with Hybrid SymInts, or should be resolved with rewriting or deferring. Follow up: Currently, `ShapeEnv._set_replacement()` is called for complex equality expressions (e.g. s2 -> s0s1 in the example above), and the ExportedProgram stores `s0*s1` in the input placeholder. This isn't checked for validity when the program is run, so an option is to avoid replacement and/or runtime assert on equality. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130775 Approved by: https://github.com/avikchaudhuri	2024-07-18 17:40:58 +00:00
Michael Lazos	22388ffe03	Graph break on tostring for numpy remapping (#131007 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131007 Approved by: https://github.com/williamwen42	2024-07-18 17:23:41 +00:00
Boyuan Feng	8bf0be7c78	[CUDAGraph] Add operator.mul to skip list for find_input_mutations (#130986 ) The #130912 error happens since `operator.mul` does not have `_schema`. So why do we have `operator.mul` and why is it not dispatched to `torch.ops.aten.mul`? This op comes from %mul_3. %mul_3 : [num_users=50] = call_function[target=operator.mul](args = (%arg689_1, 4096), kwargs = {}) `%arg689_1` is a placeholder with `meta[‘val’] = s0`. It comes form dynamic shapes and represents the batch size since it’s also used in many other nodes such as: %view_1 : [num_users=1] = call_function[target=torch.ops.aten.view.default](args = (%mm, [%arg689_1, 4096, 320]), kwargs = {}) and %native_group_norm_2 : [num_users=1] = call_function[target=torch.ops.aten.native_group_norm.default](args = (%div_1, %arg16_1, %arg17_1, %arg689_1, 320, 4096, 32, 1e-06), kwargs = {}) To fix the issue, we can add `operator.mul` to skip list. Fixes #130912 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130986 Approved by: https://github.com/eellison	2024-07-18 17:11:39 +00:00
mori360	5979014059	DSD for TorchTune LoRA (#129635 ) Fixes #128745 Solve the issue with conflicts when users use full_state_dict while the model is FSDP. Current solve the issue for `full_state_dict=True`, with error `'aten.copy_.default: got mixed torch.Tensor and DTensor, need to convert all torch.Tensor to DTensor before calling distributed operators!',).` TODO: for` broadcast_from_rank0=True, full_state_dict=True`, the error is `NotImplementedError: c10d::broadcast_: attempted to run this operator with Meta tensors` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129635 Approved by: https://github.com/fegin	2024-07-18 17:00:35 +00:00
Zhengxu Chen	5484c86021	[export] Fully support extension op in serialization/deserialization. (#130851 ) Summary: Finishing up the mechanism to "register" certain types of operators to a registry so that the serializer can handle them correctly. This is expected to be firstly used by executorch. Test Plan: buck run mode/opt caffe2/test:test_export -- -r test_export_with_extension_op_serialization Differential Revision: D59825148 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130851 Approved by: https://github.com/angelayi	2024-07-18 16:47:53 +00:00
Iris Z	85451b2cde	[DTensor] Fix shard_dim_alltoall fake tensor return (#129945 ) shard_dim_alltoall op has a return type as a Tensor in its schemas (here: https://github.com/pytorch/pytorch/blob/main/torch/csrc/distributed/c10d/Functional.cpp#L628), but its FakeTensor implementation returns a list of tensors (see the chunk() call here: https://github.com/pytorch/pytorch/blob/main/torch/distributed/_tensor/_collective_utils.py#L33). So it would error out when device="meta". This PR fixes the fake tensor mode return type for 1d mesh and adds a test to compare shape with non-meta tensor case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129945 Approved by: https://github.com/wanchaol	2024-07-18 16:43:40 +00:00
eellison	16aaff7783	Fix mm pad regresion - more conservative estimation of plannable inputs (#128909 ) - More conservative estimation of plannable inputs - Consider constant_pad_nd as pointwise node in concat lowering - Use aten.cat instead of constant pad ndwhen padding just a single dimension because it can be memory-planned away Pull Request resolved: https://github.com/pytorch/pytorch/pull/128909 Approved by: https://github.com/Chillee	2024-07-18 16:42:30 +00:00
Shangdi Yu	27ded03545	[FX][export] DCE pass, check schema for node impurity (#130395 ) Change the default DCE pass to check node schema for impure nodes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130395 Approved by: https://github.com/angelayi, https://github.com/jgong5	2024-07-18 16:31:40 +00:00
Anshul Sinha	32ff04d30a	[dtensor][debug] adding functionality to control noisiness of the debug output (#130410 ) Summary Currently, the output of CommDebugMode contains a lot of noise, such as operations that usually won’t tell the user much information such as aten.detach.default. I have created a set of these trivial operations and added a user argument noise_level for users to choose how much information they would want to receive. noise_level = 1 prints module-level collective counts noise_level = 2 prints operations not included in trivial operations and module information noise_level = 3 prints all operations In addition, I have removed the generate_module_tracing_table since noise_level = 1 essentially replaces it. Finally, I have updated the examples and test cases. Test Plan 1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_json_dump 2. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_json_dump 3. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_operation_tracing 4. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_operation_tracing 5. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_module_tracing 6. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_module_tracing Pull Request resolved: https://github.com/pytorch/pytorch/pull/130410 Approved by: https://github.com/XilunWu	2024-07-18 16:12:59 +00:00
Li-Huai (Allan) Lin	8ea03372a1	[MPS] Store philox counter as part of the RNG state (#130662 ) Fixes #130613 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130662 Approved by: https://github.com/malfet	2024-07-18 15:57:28 +00:00
cyy	7c90a82970	[Reland] [5/N] Change static functions in headers to inline (#131010 ) Reland of #130673 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131010 Approved by: https://github.com/Skylion007	2024-07-18 15:53:48 +00:00
PyTorch MergeBot	d6ae8bbf16	Revert "[export] Add print_readable to unflattener (#128617 )" This reverts commit 9fee87e4cd9efb55ee5427a8e6b3c57de7c599f9. Reverted https://github.com/pytorch/pytorch/pull/128617 on behalf of https://github.com/clee2000 due to broke inductor/test_flex_attention https://github.com/pytorch/pytorch/actions/runs/9984688318/job/27595182606 `433ef4e444` Not run on PR due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/128617#issuecomment-2236867975))	2024-07-18 15:31:51 +00:00
PyTorch MergeBot	120fdf7ee2	Revert "[aota] Needs autograd if an input requires_grad, agnostic to enable_grad (#128890 )" This reverts commit e98135d1ad2f999fec649ecd21b35f3d5676be43. Reverted https://github.com/pytorch/pytorch/pull/128890 on behalf of https://github.com/zou3519 due to broke trunk tests, probably a landrace ([comment](https://github.com/pytorch/pytorch/pull/128890#issuecomment-2236790805))	2024-07-18 14:58:25 +00:00
rzou	5a90ed3523	Reinplacing should ignore copy_ nodes where the mutated arg is not read (#130866 ) Might fix #127660, need to test some more cases. We update the reinplacing pass. If we have something like the following, where "sin" is a custom op (this situation should also apply to triton kernels) ```py def graph(x): y = sin(x) z = sin(y) x.copy_(z) ``` then the reinplacer used to produce the following: ```py """step 1: reinplaces the first sin""" def graph(x): x_clone = x.clone() sin_out(x, out=x_clone) z = sin(x_clone) x.copy_(z) """step 2: reinplaces the second sin""" def graph(x): x_clone = x.clone() sin_out(x, out=x_clone) sin_out(x_clone, out=x_clone) x.copy_(x_clone) ``` However, the first clone is unnecessary. It is safe to reinplace the first sin into the following: ```py def graph(x): sin_out(x, out=x) z = sin(x) x.copy_(z) ``` because there are no users of `x`'s original value (the copy_ node doesn't actually use the original value of x!) This PR updates the reinplacing pass to ignore copy_ in its computation of if the original value of the mutated argument is still needed. NB: this also applies to triton kernels, but it was easier for me to reason about custom ops (and my repros were all for custom ops). Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/130866 Approved by: https://github.com/oulgen	2024-07-18 13:47:54 +00:00
drisspg	dd39dca034	Removing some cruff and updating signatures for consistency (#130871 ) # Summary - This removes a bunch of example score mods that were primarily used for testing and places them directly in the test file. We should follow up with merging test_flex_decode and test_flash when the velocity slows down a little - Fixes a bug with indexing on block mask - Adds some doc strings to helper funcs and fixes some misc typing things - Forces functions passed to `create_block_mask` to mask_mods and updates tests files Pull Request resolved: https://github.com/pytorch/pytorch/pull/130871 Approved by: https://github.com/joydddd, https://github.com/Chillee	2024-07-18 13:32:11 +00:00
PyTorch MergeBot	9f6db5d0e2	Revert "Ensure staticmethods can be allowed in graph (#130882 )" This reverts commit b0387449db41c90fb4226baea97a8d889a0951c4. Reverted https://github.com/pytorch/pytorch/pull/130882 on behalf of https://github.com/atalman due to failing torchrec tests internally, please fix and reland ([comment](https://github.com/pytorch/pytorch/pull/130882#issuecomment-2236528473))	2024-07-18 13:31:30 +00:00
redwrasse	63a0a65df9	Define 'zero-preserving unary functions' in docs (#130804 ) Make explicit the definition of 'zero-preserving unary functions' in the sparse tensors documentation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130804 Approved by: https://github.com/soulitzer	2024-07-18 13:30:29 +00:00
eqy	1b07d42171	Add @syed-ahmed to CUDA `CODEOWNERS` paths (#130971 ) CC @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/130971 Approved by: https://github.com/soulitzer	2024-07-18 11:55:10 +00:00
wizzniu	c986aeea2d	Re-implement pin_memory to be device-agnostic by leveraging the Accelerator concept (#126376 ) This PR re-implements pin memory aiming to get rid of the optional `device` argument and makes all related APIs to be device-agnostic. We add two new abstract APIs in [AcceleratorHooksInterface](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/detail/AcceleratorHooksInterface.h#L12) and redefine pin memory as: "Pin memory is always pinned for the current accelerator device". In detail, it uses [getAcceleratorHooksInterface](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/Context.h#L61) in pin_memory/is_pinned to get an appropriate device and invoke the corresponding overridden interfaces, instead of using BackendSelect and then dispatching to CUDA or other specific backends' implement methods. Note: For new backends who want to implement and use pin memory, just inherit AcceleratorHooksInterface and overwrite the `isPinnedPtr` and `getPinnedMemoryAllocator` methods. Additional context: To avoid BC-breaking, this PR just preserves the `device` arg of related APIs and would throw a deprecation warning if `device` arg is passed. Another PR will be submitted to update all PT callers (`Tensor.is_pinned()`, `Tensor.pin_memory()`...) not to pass this arg based on this PR. In future, `device` arg will be actually removed. Relates #124908 Relates #14560 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126376 Approved by: https://github.com/albanD	2024-07-18 11:54:14 +00:00
Syed Tousif Ahmed	38b7d89aa4	Uses context pointer for deleter to enable multiple CUDAPluggableAllocator usage (#130472 ) We should be able to create multiple CUDAPluggableAllocators in the same pytorch program (see https://github.com/pytorch/pytorch/issues/124807, https://github.com/pytorch/pytorch/pull/125722 for context). When mixing CUDAPluggableAllocators in the same pytorch program, we need to make sure that the deleter passed in through the CUDAPluggableAllocator gets "attached" to the data_ptr and persist until program exit (when it's called to free the memory). Currently, CUDAPluggableAllocator maintains a global `current_custom_allocator`. When creating the `DataPtr`, `raw_deleter` attaches `custom_raw_deleter` to the DataPtr which calls `current_custom_allocator->raw_delete(...)`. This approach is fine when using only one allocator, however for multiple allocator use case, DataPtr would be using the deleter of whatever is in the `current_custom_allocator`. For example, if allocation 1 was done with `cudaMalloc` and allocation 2 was done with `ncclMemAlloc`, and if `current_custom_allocator` is currently pointing to the CUDAPluggableAllocator with `ncclMemAlloc` - when cleaning up the allocation 1, we'd be using `ncclMemFree` instead of `cudaFree`. In this PR, we solve the above problem by remembering the `free_fn_` using a deleter context. Hence, there is no need to go through an allocator object to find the deleter. CC: @zdevito @ptrblck @eqy Pull Request resolved: https://github.com/pytorch/pytorch/pull/130472 Approved by: https://github.com/eqy, https://github.com/ezyang	2024-07-18 11:33:21 +00:00
jananisriram	28a74b9fa4	[NestedTensor] Integrate sum along the jagged dimension into NestedTensor (#130425 ) Summary: Modify the existing `sum` operator in PyTorch, invoked by `torch.sum`, to allow for reductions along the ragged dimension of a nested tensor. This diff enables PyTorch users to invoke `torch.sum` on a nested tensor with `dim=1`, where `ragged_idx=1`. Functions modified in `caffe2/torch/nested/_internal/ops.py`: - `sum_dim_IntList()`: The function assumes that `ragged_idx=1`; in the case that `dim=1` as well, where `dim` is the dimension on which we reduce, this diff invokes the PyTorch benchmark found in D58423489. Specifically, this diff pads a nested tensor, e.g. of logical shape `(B, , M)`, using [`torch.ops.aten._jagged_to_padded_dense_forward`](https://www.internalfb.com/code/fbsource/[92c2a067ab04e3eebc999254fed4ae2fbea6def3]/fbcode/deeplearning/fbgemm/fbgemm_gpu/fb/inductor_lowerings/elementwise_ops.py?lines=26), then reduces across the `` dimension (`dim == 1`) to a `(B, M)` output tensor. - `_wrap_jagged_dims()`: This diff adds special handling to allow for the case where `dim` contains `1` and not `0`, but to continue disallowing the case where `dim` contains `0` and not `1`. In this function's creation, I created a helper function, `_get_condition_for_invalid_jagged_reductions()`, which makes it clearer which conditions apply to which operators. Specifically, operators which are enabled with jagged reductions are specified at the top of the file in `SUPPORTED_JAGGED_REDUCTIONS` and have a different set of conditions that need to be tested, as reducing along `dim == 1` without `dim == 0` is now possible. Functions modified in `caffe2/test/test_nestedtensor.py`: - `test_sum_int_DimList()`: This diff adds special handling in the `sum` unit test to allow for the case where `dim` contains `1` and not `0`, but to continue disallowing the case where `dim` contains `0` and not `1`. - `test_sum_int_DimList_ragged_dim_1()`: This diff adds a new unit test which verifies the accuracy and feasibility of reducing along the jagged dimension of a nested tensor. Notes: - This diff solely adds functionality for the case in which we reduce only along the ragged dimension. Cases in which we reduce along both the ragged and another dimension, like `dim == (1, 2)`, are not permitted, as this set of diffs focuses primarily on the former. - The `sum` operator is the only operator which uses the function `_wrap_jagged_dims()`; all other operators use `_wrap_jagged_dim()`. I would like to later look into why this is the case and if we can consolidate this! - I modified some of the comments in the `sum` function as well as the unit tests for more clarity. Test Plan: Verify that existing (`test_sum_int_DimList`) and new (`test_sum_int_DimList_ragged_dim_1`) unit tests pass via the following command: ``` buck2 run mode/{opt,inplace} //caffe2/test:nested -- --regex test_sum_int_DimList ``` Differential Revision: D59571209 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130425 Approved by: https://github.com/davidberard98	2024-07-18 10:48:18 +00:00
IvanKobzarev	e98135d1ad	[aota] Needs autograd if an input requires_grad, agnostic to enable_grad (#128890 ) Reland of: https://github.com/pytorch/pytorch/pull/128016 Summary from previous PR: We assume only two possible mutually exclusive scenarios: Running compiled region for training (Any of inputs has requires_grad) Produced differentiable outputs should have requires_grad. Running compiled region for inference (None of inputs has requires_grad) All outputs do not have requires_grad. Even if user runs the region under no_grad(), but has an input Tensor with requires_grad - we go Training scenario (1). With current state that means: 1/ needs_autograd should not check torch.is_grad_enabled(), only that any of inputs requires_grad 2/ if needs_autograd => trace_joint (We are in training scenario 1.) => always run compiled region under with.enable_grad() Changes in partitioner? Inference and Training graphs had difference in return container, list/tuple. The changes in partitioner are done to unify and return always tuple. As a result - some changes in test_aotdispatch.py for graph contents list -> tuple. Why was revert? There was a regression of hf_Reformer model on inference. ``` TORCHINDUCTOR_FX_GRAPH_CACHE=0 python benchmarks/dynamo/torchbench.py --performance --inference --bfloat16 --backend inductor --device cuda --only hf_Reformer --cold-start-latency --use-eval-mode ``` Because one of the compiled graphs contained outputs, which are aliases to the inputs that are nn.Parameter(requires_grad=True). Even if inference bencharmsk torchbench runs inside with` torch.no_grad()` - alias (specifically for hf_Reformer - expand) ops preserve requires_grad. As a result we started compiling training graph instead of inference. Fix for view ops: If we have outputs, that are aliases to inputs that requires_grad, those outputs requires grad is not a reason to generate training graph. This is handled in aot_autograd.py, where output_and_mutation_safe are calculated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128890 Approved by: https://github.com/bdhirsh	2024-07-18 08:27:53 +00:00
Michael Lazos	cf3f4285a8	Add recursive metadata guard test (#131002 ) Ensures that nested tensors subclasses are guarded properly. It turns out this case is already handled [here](`d77af49380/torch/_dynamo/variables/builder.py (L1496)`) which will recursively wrap inner tensors adding metadata guards for them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131002 Approved by: https://github.com/bdhirsh	2024-07-18 08:24:43 +00:00
Xuehai Pan	134bc4fc34	[BE][Easy][12/19] enforce style for empty lines in import segments in `test/i*/` (#129763 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129763 Approved by: https://github.com/jansel	2024-07-18 07:49:19 +00:00
Andrii Grynenko	dfc3347c4a	[pytorch][counters] Make WaitCounter backend pluggable (#130934 ) Summary: This diff introduces a much more flexible model for WaitCounter backend: 1. Backend can be installed dynamically (even if not linked with pytorch) instead of relying on macros and swapping implementation at compile time 2. Multiple backends are supported at the same time. Test Plan: unit test Reviewed By: jamesperng Differential Revision: D59795863 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130934 Approved by: https://github.com/asiab4	2024-07-18 07:23:55 +00:00
PyTorch MergeBot	b732b52f1e	Revert "[BE][Easy][12/19] enforce style for empty lines in import segments in `test/i*/` (#129763 )" This reverts commit aecc746fccc4495313167e3a7f94210daf457e1d. Reverted https://github.com/pytorch/pytorch/pull/129763 on behalf of https://github.com/XuehaiPan due to need reland after rerunning lintrunner on main ([comment](https://github.com/pytorch/pytorch/pull/129763#issuecomment-2235736732))	2024-07-18 06:39:58 +00:00
angelayi	6c2c8ee15b	[export] Remove preserved ops from decomp list (#130970 ) Fixes https://fb.workplace.com/groups/1075192433118967/permalink/1466016147369925/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/130970 Approved by: https://github.com/bdhirsh	2024-07-18 05:15:22 +00:00
Xuehai Pan	aecc746fcc	[BE][Easy][12/19] enforce style for empty lines in import segments in `test/i*/` (#129763 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129763 Approved by: https://github.com/jansel	2024-07-18 05:13:41 +00:00
Xuehai Pan	740fb22966	[BE][Easy][4/19] enforce style for empty lines in import segments in `functorch/` (#129755 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129755 Approved by: https://github.com/zou3519 ghstack dependencies: #129752	2024-07-18 05:08:03 +00:00
Animesh Jain	a085acd7d6	[dynamo] Revert back changes to UnspecializedBuiltinNNModuleVariable (#130991 ) xref - https://fb.workplace.com/groups/1075192433118967/permalink/1466525440652329/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/130991 Approved by: https://github.com/williamwen42, https://github.com/mlazos	2024-07-18 05:01:46 +00:00
Sam Larsen	9f392f8294	Use inductor TestCase for test_replicate_with_compiler.py (#129494 ) Summary: `test/distributed/_composable/test_replicate_with_compiler.py` exercises inductor. This change introduces a version of MultiProcessTestCase that derives from the inductor TestCase class to make sure we always get a clean cache dir. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129494 Approved by: https://github.com/eellison	2024-07-18 03:08:32 +00:00
PyTorch MergeBot	433ef4e444	Revert "[FX][export] DCE pass, check schema for node impurity (#130395 )" This reverts commit e22b0acc766db4a853fe8fd73e919b4adf0e3148. Reverted https://github.com/pytorch/pytorch/pull/130395 on behalf of https://github.com/yushangdi due to breaking tests, need to rebase and fix ([comment](https://github.com/pytorch/pytorch/pull/130395#issuecomment-2235192986))	2024-07-18 02:46:03 +00:00
Aidyn-A	bd56bcf0ab	[TEST] Fix _scaled_mm tests (#130897 ) This PR resolves several sets of `_scaled_mm` test failures: - `scale_a` and `scale_b` are now required arguments, so the function `sample_inputs_scaled_mm` must supply them - `_scaled_mm` does not support `"meta"` device, so it should be skipped in `test_meta.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130897 Approved by: https://github.com/drisspg	2024-07-18 02:15:00 +00:00
angelayi	9fee87e4cd	[export] Add print_readable to unflattener (#128617 ) Taking inspiration from `GraphModule.print_readable` (aka I copied its [code](`17b45e905a/torch/fx/graph_module.py (L824)`)), I added a `print_readable` to the unflattened module, because it's kind of nontrivial to print the contents of this module. Example print from `python test/export/test_unflatten.py -k test_unflatten_nested` ``` class UnflattenedModule(torch.nn.Module): def forward(self, x: "f32[2, 3]"): # No stacktrace found for following nodes rootparam: "f32[2, 3]" = self.rootparam # File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:99 in forward, code: x = x * self.rootparam mul: "f32[2, 3]" = torch.ops.aten.mul.Tensor(x, rootparam); x = rootparam = None # No stacktrace found for following nodes foo: "f32[2, 3]" = self.foo(mul); mul = None bar: "f32[2, 3]" = self.bar(foo); foo = None return (bar,) class foo(torch.nn.Module): def forward(self, mul: "f32[2, 3]"): # No stacktrace found for following nodes child1param: "f32[2, 3]" = self.child1param nested: "f32[2, 3]" = self.nested(mul); mul = None # File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:79 in forward, code: return x + self.child1param add: "f32[2, 3]" = torch.ops.aten.add.Tensor(nested, child1param); nested = child1param = None return add class nested(torch.nn.Module): def forward(self, mul: "f32[2, 3]"): # File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:67 in forward, code: return x / x div: "f32[2, 3]" = torch.ops.aten.div.Tensor(mul, mul); mul = None return div class bar(torch.nn.Module): def forward(self, add: "f32[2, 3]"): # No stacktrace found for following nodes child2buffer: "f32[2, 3]" = self.child2buffer # File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:87 in forward, code: return x - self.child2buffer sub: "f32[2, 3]" = torch.ops.aten.sub.Tensor(add, child2buffer); add = child2buffer = None return sub ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128617 Approved by: https://github.com/zhxchen17, https://github.com/pianpwk	2024-07-18 01:36:01 +00:00
cyy	a0ae77b25b	Simpilfy cub::unique_by_key code (#130907 ) It removed an unused parameter. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130907 Approved by: https://github.com/ezyang	2024-07-18 01:12:00 +00:00
Alnis Murtovi	d818c3319f	Autoheuristic: add config options for specifying optimizations to collect data for and use heuristics (#130245 ) Previously, it was only possible to collect data or use a heuristic regardless of where autoheuristic is used. This PR makes it possible to collect data for some optimizations while using a learned heuristic for other optimizations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130245 Approved by: https://github.com/shunting314	2024-07-18 01:04:36 +00:00
Edward Z. Yang	051971ab32	Reorder MIOpen conditions so getCUDAHooks only called when CUDA input (#130867 ) See post for more details: [fb.workplace.com/groups/1405155842844877/permalink/8719141948112860](https://fb.workplace.com/groups/1405155842844877/permalink/8719141948112860/) Function getCUDAHooks() returns a reference to an object without checking if the object is null. In the AutoMOS QE, which runs a ML model in Messenger Android, we are getting native crashes because of this reason: [internalfb.com/code/fbsource/[b7f8e18320f9d5d8347c3428c67301f20c3c81d2]/xplat/caffe2/aten/src/ATen/native/Convolution.cpp?lines=504](https://www.internalfb.com/code/fbsource/%5Bb7f8e18320f9d5d8347c3428c67301f20c3c81d2%5D/xplat/caffe2/aten/src/ATen/native/Convolution.cpp?lines=504), crash [fburl.com/logview/xi4w7jk4](https://fburl.com/logview/xi4w7jk4) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130867 Approved by: https://github.com/albanD	2024-07-18 00:59:33 +00:00
Shangdi Yu	e22b0acc76	[FX][export] DCE pass, check schema for node impurity (#130395 ) Change the default DCE pass to check node schema for impure nodes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130395 Approved by: https://github.com/angelayi, https://github.com/jgong5	2024-07-18 00:55:20 +00:00
cyy	73d0f484b3	[structural binding][11/N] Replace std::tie with structural binding (#130830 ) Follows #130784 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130830 Approved by: https://github.com/janeyx99	2024-07-18 00:45:06 +00:00
eellison	e14d1d10ef	Unwrap Identity in prepare indexing (#130967 ) We wrap indexing calculation in the concat kernel in `Identity` so that we do not expand int32 intermediates to int64. This was causing an issue where the index simplified to an integer and would not hit an intended [path](`752c817898/torch/_inductor/codegen/triton.py (L1554)`) which would do wrapping with tl.full. I couldn't generate a minimal repro to add as test but I have a repro you can check here: P1483831261 There is already a test that we dont expand the int32 intermediates to int64. Differential Revision: [D59871850](https://our.internmc.facebook.com/intern/diff/D59871850) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130967 Approved by: https://github.com/Chillee, https://github.com/jansel	2024-07-18 00:43:53 +00:00
Will Feng	d77af49380	[Traceable FSDP2] Preserve fsdp.set_ op through lowering; Add unit test for multiple .set_ into same primal; Add unit test for FSDP2 module layer reuse (#130786 ) Test commands: - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_fullgraph_backend_inductor` - `pytest -rA test/functorch/test_aotdispatch.py::TestAOTAutograd::test_input_mutation_fsdp_set__into_same_input` - `PYTORCH_TEST_WITH_CROSSREF=1 python test/functorch/test_aotdispatch.py -k TestAOTAutogradWithCache.test_input_mutation_fsdp_set__into_same_input` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130786 Approved by: https://github.com/bdhirsh ghstack dependencies: #129773	2024-07-17 23:25:42 +00:00
Will Feng	fc3dbcd1c3	[Traceable FSDP2][Inductor] Re-inplace all_gather_into_tensor (#129773 ) FSDP2 eager pre-allocates the output buffer for AllGather and the AllGather just writes into that buffer. However, under compile, by default we use out-of-place AllGather, which means in Traceable FSDP2 case we will be unnecessarily using more memory than eager. We want to re-inplace that AllGather instead. This PR adds a post_grad pass to re-inplace all_gather_into_tensor (i.e. changing it from `all_gather_into_tensor.default` out-of-place op to `all_gather_into_tensor_out.default` out-variant op). One thing to note is that since with this pass we are introducing a mutable op into the post_grad FX graph, we must do this pass after `reinplace_inplaceable_ops` (at which point we are okay again with having mutable ops in the graph). To facilitate this, this PR adds a `post_grad_custom_post_reinplace_pass` extension point to allow user-defined post-reinplace FX passes. --- Test commands: - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_fullgraph_backend_inductor` --- Pull Request resolved: https://github.com/pytorch/pytorch/pull/129773 Approved by: https://github.com/eellison	2024-07-17 22:51:20 +00:00
Oguz Ulgen	442bfa7fc4	Fix mypy error (#130992 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130992 Approved by: https://github.com/izaitsevfb	2024-07-17 22:49:23 +00:00
Oguz Ulgen	a0da1265c5	Define key in codecache (#130979 ) Test Plan: ``` buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --exact 'caffe2/test/dynamo:test_dynamo - test_misc.py::InlineInbuiltNNModulesMiscTests::test_auto_functionalize_can_with_none_return_inline_inbuilt_nn_modules' ``` Differential Revision: D59875657 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130979 Approved by: https://github.com/jamesjwu	2024-07-17 22:44:50 +00:00
Andrew Gu	31e3330040	[Reland][FSDP2] Allowed `List[nn.Module]` as arg (#130949 ) This PR allows `fully_shard`'s first argument to be `List[nn.Module]` instead of strictly `nn.Module`. This allows more flexible grouping of modules/parameters for communication, which can lead to memory savings and/or more efficient communication. Approach At a high level, we can think of a model as a tree of modules. Previously, we could only select specific module nodes in this tree as representing one FSDP parameter group. With this PR, we can select a group of module nodes, effectively becoming a single super node. To implement the runtime schedule, we define new forward hooks that run based on the following semantics: - If a module is the first to run the pre-hook, actually run the given pre-hook. Otherwise, the pre-hook is no-op. - If a module is the last to run the post-hook, actually run the given post-hook. Otherwise, the post-hook is a no-op. - First and last are determined by scoreboarding against a set of the modules. - This set must get cleared at the end of backward in the case that >=1 module in the list is never used, in which case we still want the forward hooks to run in the next forward after this backward. Beyond these new forward hooks, everything else is some simple generalization from `Module` to `List[Module]` or `Tuple[Module, ...]`. Examples This PR enables wrapping Llama models more efficiently by grouping the final norm and output linear together: https://github.com/pytorch/torchtitan/pull/382. If at least one of the modules in the list does not run forward before backward, then there will be a warning message like: ``` 1 of the 2 modules passed to fully_shard did not run forward before backward, which is error-prone since FSDP post-forward/pre-backward logic will not run for these modules. We recommend passing only modules that run forward together. Modules that did not run forward: [FSDPLinear(in_features=1, out_features=1, bias=True)] ``` --- Changes for reland: none since breakage was from PR below Pull Request resolved: https://github.com/pytorch/pytorch/pull/130949 Approved by: https://github.com/weifengpy ghstack dependencies: #130947	2024-07-17 22:40:14 +00:00
Andrew Gu	ff7e021e94	[Reland][PT-D] Relaxed `contract` to allow `Sequence[nn.Module]` (#127773 ) (#130947 ) This PR relaxes `@contract` to allow the 1st argument to be `Sequence[nn.Module]` instead of strictly `nn.Module`. This is required for the next PR, which allows `fully_shard` to take in `List[nn.Module]`. --- Changes for reland: - The previous PR assumed that any `func` decorated with `@contract` would return the same input `module` as output (which is true for PT-D composable APIs). - However, TorchRec `shard` returns a different module as output (though that module _does_ satisfy the `@contract` FQN check). - This PR removes the assumption and instead only enforces the FQN check following the input module order. In other words, if calling `func([x1, ..., xN])` for `N` modules `x1, ..., xN` that returns `[y1, ..., yM]` for `M` modules, we require that `N = M` and that FQNs are preserved coordinate-wise: `xi` and `yi` have same FQNs for all `i = 1, ..., N`. Differential Revision: [D59863438](https://our.internmc.facebook.com/intern/diff/D59863438) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130947 Approved by: https://github.com/weifengpy, https://github.com/atalman	2024-07-17 22:40:13 +00:00
Boyuan Feng	90105a4f3e	[ts-migration] Support RaiseException, prim::Unitialized, prim::Enter, and prim::Exit (#129416 ) - Support raise exception. It's behavior matches non-strict export now, thanks to @ydwu4's [PR](https://github.com/pytorch/pytorch/pull/128709). - Support prim::Unitialized, prim::Enter, and prim::Exit Pull Request resolved: https://github.com/pytorch/pytorch/pull/129416 Approved by: https://github.com/angelayi	2024-07-17 21:59:52 +00:00
PyTorch MergeBot	874bbc53c9	Revert "Define key in codecache (#130979 )" This reverts commit 4112f687831fb6f3554ff659a0be45909a1b4639. Reverted https://github.com/pytorch/pytorch/pull/130979 on behalf of https://github.com/clee2000 due to broke lint on torch/_inductor/codecache.py https://github.com/pytorch/pytorch/actions/runs/9981737836/job/27586013811 `f0faecd291` ([comment](https://github.com/pytorch/pytorch/pull/130979#issuecomment-2234392332))	2024-07-17 21:59:19 +00:00
Isuru Fernando	43a6d20883	Add decomposition for reflection_pad{1,2,3}d_backward (#130299 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130299 Approved by: https://github.com/lezcano ghstack dependencies: #130130	2024-07-17 21:56:00 +00:00
PyTorch MergeBot	0eb43ed189	Revert "[ts-migration] Support RaiseException, prim::Unitialized, prim::Enter, and prim::Exit (#129416 )" This reverts commit f0faecd2915d73e56917922cc995237cef064e50. Reverted https://github.com/pytorch/pytorch/pull/129416 on behalf of https://github.com/clee2000 due to broke lint, but for for torch/_inductor/codecache.py this time https://github.com/pytorch/pytorch/actions/runs/9981737836/job/27586013811 `f0faecd291` ([comment](https://github.com/pytorch/pytorch/pull/129416#issuecomment-2234387254))	2024-07-17 21:55:48 +00:00
Nikita Shulga	ebdfc7e37d	[BE] Rename `ISORT_WHITELIST` to `ISORT_SKIPLIST` (#130987 ) To better represent what this list is doing Pull Request resolved: https://github.com/pytorch/pytorch/pull/130987 Approved by: https://github.com/seemethere, https://github.com/ZainRizvi	2024-07-17 21:52:56 +00:00
Jeff Daily	df5919393c	[ROCm] std::clamp work-around for hip-clang compiler (#127812 ) Fixes #127666. Other std math functions are replaced with those in the global namespace during hipify. HIP does not claim to support every function in the C++ standard library. std::clamp is not yet supported and we have been relying on the std implementation. For Fedora 40 + gcc 14, a host-side assert is used which is not supported. Work-around this by replacing std::clamp with min and max. Using #ifndef USE_ROCM to differentiate between CUDA using std::clamp and the ROCm replacement broke Windows builds. The replacement generates the same PTX as std::clamp, so using the replacement unconditionally. The replacement generates the same PTX as std::clamp. See https://godbolt.org/z/Wde9KW3v4 for a sample. Original patch comes from @lamikr. Modified to improve efficiency. https://github.com/lamikr/rocm_sdk_builder/pull/37 Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127812 Approved by: https://github.com/hongxiayang, https://github.com/malfet	2024-07-17 21:31:17 +00:00
Boyuan Feng	f0faecd291	[ts-migration] Support RaiseException, prim::Unitialized, prim::Enter, and prim::Exit (#129416 ) - Support raise exception. It's behavior matches non-strict export now, thanks to @ydwu4's [PR](https://github.com/pytorch/pytorch/pull/128709). - Support prim::Unitialized, prim::Enter, and prim::Exit Pull Request resolved: https://github.com/pytorch/pytorch/pull/129416 Approved by: https://github.com/angelayi	2024-07-17 21:27:45 +00:00
Oguz Ulgen	4112f68783	Define key in codecache (#130979 ) Test Plan: ``` buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --exact 'caffe2/test/dynamo:test_dynamo - test_misc.py::InlineInbuiltNNModulesMiscTests::test_auto_functionalize_can_with_none_return_inline_inbuilt_nn_modules' ``` Differential Revision: D59875657 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130979 Approved by: https://github.com/jamesjwu	2024-07-17 21:19:13 +00:00
PyTorch MergeBot	0b134c15cd	Revert "Relax constraints for creating a `GenericContextWrappingVariable` (#129091 )" This reverts commit 882fd9186924b4632fba65033717d97d15ad3339. Reverted https://github.com/pytorch/pytorch/pull/129091 on behalf of https://github.com/clee2000 due to test_jit started failing on main after this stack https://github.com/pytorch/pytorch/actions/runs/9980754603/job/27583474357 `a8bd2933d9` ([comment](https://github.com/pytorch/pytorch/pull/129091#issuecomment-2234269541))	2024-07-17 20:59:40 +00:00
PyTorch MergeBot	c49f909aab	Revert "wrap self.call_function(...) in try finally block to undo changes to self.kw_names (#130490 )" This reverts commit a8bd2933d9eaf24ec9582001efa844de499d9e93. Reverted https://github.com/pytorch/pytorch/pull/130490 on behalf of https://github.com/clee2000 due to test_jit started failing on main after this stack https://github.com/pytorch/pytorch/actions/runs/9980754603/job/27583474357 `a8bd2933d9` ([comment](https://github.com/pytorch/pytorch/pull/129091#issuecomment-2234269541))	2024-07-17 20:59:40 +00:00
Animesh Jain	65b4163bd2	[dynamo][nn-module] Make slice getitem on nn module container sourceless (#130852 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130852 Approved by: https://github.com/mlazos ghstack dependencies: #130773	2024-07-17 20:17:08 +00:00
Guilherme Leobas	a8bd2933d9	wrap self.call_function(...) in try finally block to undo changes to self.kw_names (#130490 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130490 Approved by: https://github.com/williamwen42, https://github.com/zou3519 ghstack dependencies: #129091	2024-07-17 20:07:06 +00:00
Guilherme Leobas	882fd91869	Relax constraints for creating a `GenericContextWrappingVariable` (#129091 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129091 Approved by: https://github.com/yanboliang, https://github.com/zou3519	2024-07-17 20:07:06 +00:00
PyTorch MergeBot	41f5d5dcaf	Revert "[inductor] adapte windows file path (#130713 )" This reverts commit e51e971a8675826e517a78bf2a97f8e2df5f4abd. Reverted https://github.com/pytorch/pytorch/pull/130713 on behalf of https://github.com/clee2000 due to sorry but I think its still failing, this time on windows CUDA https://github.com/pytorch/pytorch/actions/runs/9971126834/job/27552761451 `bb62e9d7c3`. It was not run on PR due to being on the periodic workflow, which isnt usually run on PRs due to capacity issues for windows CUDA machines. I will add ciflow/periodic to the PR to ensure the test gets run ([comment](https://github.com/pytorch/pytorch/pull/130713#issuecomment-2234092078))	2024-07-17 19:37:16 +00:00
PyTorch MergeBot	1bf4a44b33	Revert "[ts-migration] Support RaiseException, prim::Unitialized, prim::Enter, and prim::Exit (#129416 )" This reverts commit ef0511245a92bae7057c195dcae2efc237b96f16. Reverted https://github.com/pytorch/pytorch/pull/129416 on behalf of https://github.com/clee2000 due to broke lint for test/export/test_converter.py https://github.com/pytorch/pytorch/actions/runs/9979009143/job/27577181982 `ef0511245a`. Probably a landrace ([comment](https://github.com/pytorch/pytorch/pull/129416#issuecomment-2234067407))	2024-07-17 19:21:52 +00:00
Michael Lazos	b0387449db	Ensure staticmethods can be allowed in graph (#130882 ) Fixes https://github.com/pytorch/pytorch/issues/124735 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130882 Approved by: https://github.com/anijain2305, https://github.com/williamwen42	2024-07-17 19:18:30 +00:00
Michael Lazos	e4f9d01cd9	Add test for dataclass field accesses (#130848 ) Fixes https://github.com/pytorch/pytorch/issues/120108 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130848 Approved by: https://github.com/williamwen42, https://github.com/anijain2305	2024-07-17 19:14:23 +00:00
Michael Lazos	470f07c840	Add guard override capability for tensor subclass metadata (#130780 ) Fixes https://github.com/pytorch/pytorch/issues/114405 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130780 Approved by: https://github.com/anijain2305, https://github.com/bdhirsh ghstack dependencies: #130779	2024-07-17 19:13:53 +00:00
Michael Lazos	bea6762c01	Add guards on subclass metadata (#130779 ) This PR adds guards in dynamo which verify the equality of tensor subclass metadata along with tests verifying the expected recompile behavior. The next PR adds the capability to override the guard behavior to possibly perform the check in a less expensive manner. Toward fixing https://github.com/pytorch/pytorch/issues/114405 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130779 Approved by: https://github.com/anijain2305, https://github.com/bdhirsh	2024-07-17 19:13:52 +00:00
Bin Bao	752c817898	[AOTI][refactor] Unify UserDefinedTritonKernel.codegen (#130796 ) Summary: Unify the argment codegen logic between python wrapper and cpp wrapper. Differential Revision: [D59809273](https://our.internmc.facebook.com/intern/diff/D59809273) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130796 Approved by: https://github.com/oulgen	2024-07-17 18:37:23 +00:00
chilli	efefea52e0	renamed inductor kernel args in flexattention properly (#130869 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130869 Approved by: https://github.com/drisspg, https://github.com/joydddd ghstack dependencies: #130809, #130818	2024-07-17 18:36:03 +00:00
chilli	480a5bd881	Renamed mask_fn to mask_mod (#130818 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130818 Approved by: https://github.com/drisspg ghstack dependencies: #130809	2024-07-17 18:36:03 +00:00
Pian Pawakapan	d96c80649f	[export] constants & non-persistent buffers for training IR (#130864 ) Summary: Uses original ExportedProgram constants and graph signature to inform decompositions, so that constant tensors and non-persistent buffers are respected for training IR. Removes 7 test failures for training IR. Test Plan: test_export Differential Revision: D59820909 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130864 Approved by: https://github.com/angelayi	2024-07-17 18:27:53 +00:00
Boyuan Feng	ef0511245a	[ts-migration] Support RaiseException, prim::Unitialized, prim::Enter, and prim::Exit (#129416 ) - Support raise exception. It's behavior matches non-strict export now, thanks to @ydwu4's [PR](https://github.com/pytorch/pytorch/pull/128709). - Support prim::Unitialized, prim::Enter, and prim::Exit Pull Request resolved: https://github.com/pytorch/pytorch/pull/129416 Approved by: https://github.com/angelayi	2024-07-17 17:48:36 +00:00
Catherine Lee	d552e5c3d5	Fix ciflow/nightly triggering commit hash update workflow (#130570 ) Move the if statement to be higher so people don't get the below ![image](https://github.com/user-attachments/assets/e9be7d7c-6400-4f80-880f-d58dcb4b5495) like https://togithub.com/pytorch/pytorch/pull/130465 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130570 Approved by: https://github.com/ZainRizvi	2024-07-17 17:13:50 +00:00
Xuehai Pan	db3290846e	[BE][Easy][10/19] enforce style for empty lines in import segments in `test/d*/` (#129761 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129761 Approved by: https://github.com/fegin	2024-07-17 16:57:39 +00:00
Oguz Ulgen	1e13cb2f28	Log cache state to structured logs (#130845 ) https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpRm4MaD/0_0_0/fx_graph_cache_hash_4.json Differential Revision: D59795574 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130845 Approved by: https://github.com/jamesjwu	2024-07-17 16:45:45 +00:00
lezcano	af0b5ee924	Reduce number of samples in {svd,pca}_lowrank OpInfos (#127199 ) We don't need to generate so many samples for these very expensive ops. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127199 Approved by: https://github.com/peterbell10, https://github.com/zou3519	2024-07-17 16:29:36 +00:00
Sam Larsen	6e916f112f	[inductor] skip fx remote cache for 2 tests in test_metrics.py (#130853 ) Summary: `collect_defined_kernels()` is essentially patching deep inside to see if a specific codegen is happening. We could also patch somewhere in the cache path to make sure it's called, but I'm not sure that's really testing anything interesting. I suggest it's better to just disable the remote cache here. Test Plan: `buck2 test -j 18 'fbcode//mode/opt' fbcode//caffe2/test/inductor:metrics -- --exact 'caffe2/test/inductor:metrics - test_kernel_args_num_gb (caffe2.test.inductor.test_metrics.TestMetrics)' --run-disabled --stress-runs 10` Differential Revision: D59825899 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130853 Approved by: https://github.com/oulgen	2024-07-17 16:17:43 +00:00
fduwjj	1fb572289b	[BE][c10d] Add a warning messages in the comment about cuda hang (#130844 ) Add comments to warn users potential hang for the cuda event query in NCCLPG. Co-authored-by: Andrew Gu <31054793+awgu@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130844 Approved by: https://github.com/wconstab	2024-07-17 15:51:19 +00:00
Isuru Fernando	b7d2abd766	Fix vectorized ops.masked (#130130 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130130 Approved by: https://github.com/jgong5, https://github.com/lezcano	2024-07-17 14:55:11 +00:00
Xuehai Pan	b29b23137c	[Easy] Fix argument name collision in dispatched functions (#129562 ) Use positional-only argument to avoid naming collision with aten ops arguments that are named "self". ```python In [1]: def foo(self, args, kwargs): ...: print(self, args, kwargs) ...: In [2]: def bar(self, /, args, **kwargs): ...: print(self, args, kwargs) ...: In [3]: foo(1, 2, self=3) TypeError: foo() got multiple values for argument 'self' In [4]: bar(1, 2, self=3) 1 (2,) {'self': 3} ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129562 Approved by: https://github.com/zou3519, https://github.com/fegin	2024-07-17 14:39:56 +00:00
Xuehai Pan	c0ed38e644	[BE][Easy][3/19] enforce style for empty lines in import segments in `benchmarks/` (#129754 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129754 Approved by: https://github.com/ezyang	2024-07-17 14:34:42 +00:00
Yutao Xu	32995dec28	Add support for XPU accumulate type (#128579 ) Provide an accumulate type interface specifically for XPU, similar to what was done for MPS. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128579 Approved by: https://github.com/EikanWang, https://github.com/albanD	2024-07-17 14:33:53 +00:00
Xuehai Pan	76169cf691	[BE][Easy][9/19] enforce style for empty lines in import segments in `test/[e-h]*/` (#129760 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129760 Approved by: https://github.com/ezyang	2024-07-17 14:25:29 +00:00
angelayi	cbf274d4a7	[aoti] Add packaging solution (#129895 ) In this PR, I added support for packaging the AOTI generated files into a zipfile, and loading it in python. `compile_so` takes the path to the package, a device, and a desired so_path location, and compiles package into a .so, and saves to the specified location. `load_package` takes a path to the package and device, calls _extract_so, and then creates a callable to run the compiled model. The zipfile generated looks like the following: ``` \|- version \|- archive_format \|- data \|- aotinductor \|- cbtnafqaqrhvwztv7xudlal4xs6sofxa5oxccyuaqtrt6aozaklx.cubin # AOTI cuda generated cubin files \|- cskkqtna23bty2v3aq7g2q37cxrgufehlkuaaolhlgug5zg6fuwe.cpp # AOTI generated cpp file \|- cskkqtna23bty2v3aq7g2q37cxrgufehlkuaaolhlgug5zg6fuwe_compile_flags # Flags for compiling the .o \|- c6qqtnpgwfi3dv5nb76ai773kt45ezoxfwdmd7q37lvq6fs2tnoi.o # AOTI saved const.o \|- cskkqtna23bty2v3aq7g2q37cxrgufehlkuaaolhlgug5zg6fuwe_linker_flags # Flags for linking the files to form the .so \|- constants \|- constants.pt # Constants saved using torch.save, can be loaded using mmap ``` The workflow is something like: ``` with torch.no_grad(): ep = torch.export.export( model, example_inputs, dynamic_shapes=dynamic_shapes, strict=False, ) gm = ep.module() package_path = torch._inductor.aot_compile( gm, example_inputs, options= { "aot_inductor.output_path": "my_path.pt2", # or a directory "aot_inductor.package": True, } ) compiled_model = torch._inductor.package.load_package(package_path, device) return compiled_model ``` I tried turning on loading the weights using mmap by default, but had some trouble with it, so that is just left as a todo Pull Request resolved: https://github.com/pytorch/pytorch/pull/129895 Approved by: https://github.com/malfet	2024-07-17 13:56:58 +00:00
PyTorch MergeBot	94a910b43b	Revert "Renamed mask_fn to mask_mod (#130818 )" This reverts commit 1a97bcf93b2ac98505ef6ff011ccb3565e456596. Reverted https://github.com/pytorch/pytorch/pull/130818 on behalf of https://github.com/atalman due to Failing internally ([comment](https://github.com/pytorch/pytorch/pull/130818#issuecomment-2233367318))	2024-07-17 13:47:08 +00:00
PyTorch MergeBot	d027aef8f8	Revert "Removed q_num_blocks from constructor (#130819 )" This reverts commit 03c660468eb57772e82c1034613f5ff8781c775a. Reverted https://github.com/pytorch/pytorch/pull/130819 on behalf of https://github.com/atalman due to Internal problem with previous PR in stack https://github.com/pytorch/pytorch/pull/130818 ([comment](https://github.com/pytorch/pytorch/pull/130819#issuecomment-2233359569))	2024-07-17 13:43:35 +00:00
Alnis Murtovi	4b7ff35622	Fix flex_attention import in score_mod (#130906 ) torch.nn.attention._flex_attention has been renamed to torch.nn.attention.flex_attention, so the import does not work currently. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130906 Approved by: https://github.com/Chillee	2024-07-17 13:37:08 +00:00
PyTorch MergeBot	e1b2d8f975	Revert "[cuDNN][SDPA] Support `attn_bias` in cuDNN (#130482 )" This reverts commit de177b50f89e45a57ac056ee64a64d7775b450ff. Reverted https://github.com/pytorch/pytorch/pull/130482 on behalf of https://github.com/atalman due to failing internally ([comment](https://github.com/pytorch/pytorch/pull/130482#issuecomment-2233309217))	2024-07-17 13:21:50 +00:00
xinan.lin	d3a11a0198	[Inductor] Handle device_put op in constant folding. (#130824 ) Fix #130823 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130824 Approved by: https://github.com/eellison, https://github.com/EikanWang ghstack dependencies: #130817	2024-07-17 10:13:36 +00:00
xinan.lin	2af2d26562	[Inductor UT] Generalize device-bias code in test_triton_kernels.py and test_torchinductor.py (#130817 ) [Inductor UT] Generalize newly introduced device-bias code in test_triton_kernels.py::test_add_kernel and test_torchinductor.py::test_ctr_not_moved_to_cuda_when_used_in_index_put Fix #130814 , #130838 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130817 Approved by: https://github.com/zou3519	2024-07-17 10:13:36 +00:00
William Wen	2300bb2a88	[3.13, dynamo] support TO_BOOL (#130565 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130565 Approved by: https://github.com/jansel ghstack dependencies: #130459, #130460, #130461, #130564	2024-07-17 09:47:58 +00:00
William Wen	539acf7656	[3.13, dynamo] support CALL_KW (#130564 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130564 Approved by: https://github.com/jansel ghstack dependencies: #130459, #130460, #130461	2024-07-17 09:47:58 +00:00
William Wen	e2365c05d7	[3.13, dynamo] fix instruction line numbers (#130461 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130461 Approved by: https://github.com/jansel ghstack dependencies: #130459, #130460	2024-07-17 09:47:58 +00:00
William Wen	82b2e7a253	[3.13, dynamo] fix CALL_FUNCTION_EX in symbolic_convert (#130460 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130460 Approved by: https://github.com/jansel ghstack dependencies: #130459	2024-07-17 09:47:58 +00:00
William Wen	8c9a996091	[3.13, dynamo] support LOAD_FAST_LOAD_FAST and STORE_FAST_STORE_FAST (#130459 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130459 Approved by: https://github.com/jansel	2024-07-17 09:47:58 +00:00
Adrian Wälchli	bb62e9d7c3	Avoid autocast deprecation warning in DataParallel (#130660 ) Fixes #130659 Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130660 Approved by: https://github.com/guangyey, https://github.com/fegin, https://github.com/albanD	2024-07-17 08:32:19 +00:00
Xuehai Pan	f6838d521a	[BE][Easy][5/19] enforce style for empty lines in import segments in `tools/` and `torchgen/` (#129756 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129756 Approved by: https://github.com/ezyang	2024-07-17 06:44:35 +00:00
Xuehai Pan	ba48cf6535	[BE][Easy][6/19] enforce style for empty lines in import segments in `test/` (#129757 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129757 Approved by: https://github.com/ezyang	2024-07-17 06:42:37 +00:00
Xu Han	e51e971a86	[inductor] adapte windows file path (#130713 ) This PR is depends on https://github.com/pytorch/pytorch/pull/130132 can be landed successful. The detailed log: https://github.com/pytorch/pytorch/issues/124245#issuecomment-2211889758 After the file path was adapted for Windows, the first Windows inductor case was run successful. ```python import torch def foo(x, y): a = torch.sin(x) b = torch.cos(x) return a + b opt_foo1 = torch.compile(foo) print(opt_foo1(torch.randn(10, 10), torch.randn(10, 10))) ``` Result: ![image](https://github.com/user-attachments/assets/4944df47-e74d-476b-8eb5-1d1fd5abeb41) Co-authored-by: Jiong Gong <jiong.gong@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130713 Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/desertfire	2024-07-17 06:36:11 +00:00
Andrii Grynenko	7c45476d38	[pytorch][counters] WaitCounter cleanup (#130664 ) Summary: This diff does a minor cleanup of WaitCounters: 1. Fixes some singleton use to ensure one instance of WaitCounterImpl per counter per process 2. Updates API to enable measuring duration of individual wait operations Test Plan: unit test Differential Revision: D59709324 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130664 Approved by: https://github.com/c-p-i-o, https://github.com/asiab4	2024-07-17 04:42:35 +00:00
Colin Peppler	419b8df0b6	[inductor][easy] add debug logs for inlining constants (#130799 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130799 Approved by: https://github.com/chenyang78	2024-07-17 04:21:08 +00:00
Yu, Guangye	f2552dcc3d	refactor cached tensor more generic (#129359 ) # Motivation solve https://github.com/pytorch/pytorch/issues/129027 to refactor cached tensor to be generic. # Additional Context No API name change. It is only decoupling with CUDA build option. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129359 Approved by: https://github.com/eqy, https://github.com/EikanWang, https://github.com/albanD	2024-07-17 03:00:08 +00:00
Yu, Guangye	c6aa03bd4e	Add allow_xpu to enable XPU UTs (#130312 ) # Motivation enable UTs under folder test/xpu/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/130312 Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD	2024-07-17 02:40:28 +00:00
Wang, Eikan	fc238db62a	Separate AOTI Eager utils as a single file (#125819 ) The key change is code movement. We just moved aoti eager related code from `torch._inductor.utils` to `torch._inductor.aoti_eager` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125819 Approved by: https://github.com/jansel, https://github.com/jgong5, https://github.com/desertfire	2024-07-17 02:27:11 +00:00
Aaron Gokaslan	d1c4e6b55f	[BE]: Enable a few additional ruff rules (#130700 ) Enables a few extra ruff rules, most of which do not have any violations as I already cleaned them with earlier PRs, these just turns them on to enforce them. Adds 1 noqa as we want the suboptimal lambda generation + call kept as a test. Also enables the test in flake8 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130700 Approved by: https://github.com/justinchuby, https://github.com/ezyang	2024-07-17 02:06:04 +00:00
Yu, Guangye	c24c50da92	fix tensor print behavior for XPU (#130523 ) # Motivation Some XPU device don't support `double` data type. So we have to use `tensor.to(torch.float)` if it is a XPU tensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130523 Approved by: https://github.com/gujinghui, https://github.com/EikanWang, https://github.com/albanD	2024-07-17 02:03:32 +00:00
Edward Z. Yang	aa95fb99af	On advice of James March, log pid instead of tid (#130679 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130679 Approved by: https://github.com/jmarchfb	2024-07-17 02:00:10 +00:00
Jack Taylor	e9023d57b0	[ROCm] Return correct AMDSMI socket_power metric (#130331 ) Extending on the change in https://github.com/pytorch/pytorch/pull/127729 Depending on gcnArch the API to return socket power will change based on underlying gpu_metrics. This PR will handle both cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130331 Approved by: https://github.com/jeffdaily, https://github.com/eqy, https://github.com/malfet	2024-07-17 01:58:58 +00:00
chilli	03c660468e	Removed q_num_blocks from constructor (#130819 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130819 Approved by: https://github.com/drisspg ghstack dependencies: #130809, #130818	2024-07-17 01:41:20 +00:00
chilli	1a97bcf93b	Renamed mask_fn to mask_mod (#130818 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130818 Approved by: https://github.com/drisspg ghstack dependencies: #130809	2024-07-17 01:41:20 +00:00
chilli	6024fea0f8	Compute q_num_blocks from kv_num_blocks if q_num_blocks is not passed in (#130809 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130809 Approved by: https://github.com/drisspg	2024-07-17 01:41:15 +00:00
Tristan Rice	ef9d9be236	TCPStoreLibUvBackend: log port on error (#130797 ) Adds better error messages when a socket fails to bind in libuv. New format: ``` The server socket has failed to bind. port: 1, useIpv6: 0, code: -13, name: EACCES, message: permission denied ``` Old format: ``` The server socket has failed to listen on any local network address. useIpv6: 0, code: -98, name: EADDRINUSE, message: address already in use ``` Test plan: Added test in `test_store.py` ``` python test/distributed/test_store.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130797 Approved by: https://github.com/kurman	2024-07-17 01:34:15 +00:00
Sam Larsen	25cb4426d3	[inductor] Add num_matches_for_scatter_upon_const_tensor to list of cached metrics (#130843 ) Summary: test/inductor:scatter_optimization is using this counter and fails with remote caching enabled Test Plan: `buck2 test -j 18 'fbcode//mode/opt' fbcode//caffe2/test/inductor:scatter_optimization -- --exact 'caffe2/test/inductor:scatter_optimization - test_cross_entropy_loss (caffe2.test.inductor.test_scatter_optimization.TestScatterOpt)' --run-disabled --stress-runs 10` Differential Revision: D59817406 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130843 Approved by: https://github.com/oulgen	2024-07-17 00:41:22 +00:00
PyTorch MergeBot	8458dc8966	Revert "Use inductor TestCase for distributed tests (#129494 )" This reverts commit 3cd2ae331a5ed6839456bb0025c729a1ee50bc84. Reverted https://github.com/pytorch/pytorch/pull/129494 on behalf of https://github.com/masnesral due to fbcode failures ([comment](https://github.com/pytorch/pytorch/pull/129494#issuecomment-2232063690))	2024-07-17 00:32:48 +00:00
PyTorch MergeBot	d7a8e8f7c5	Revert "[PT-D] Relaxed `contract` to allow `Sequence[nn.Module]` (#127773 )" This reverts commit b27695791e9cc4eedb1b713b1be20398bfeb911b. Reverted https://github.com/pytorch/pytorch/pull/127773 on behalf of https://github.com/atalman due to failing internally ([comment](https://github.com/pytorch/pytorch/pull/127773#issuecomment-2232004006))	2024-07-16 23:48:09 +00:00
Lei Wang (Server LLVM)	9a6d81b178	Fix pytorch JIT build for LLVM 18+ (#130661 ) Summary: LLVM upstream(https://github.com/llvm/llvm-project/pull/97824) changed `getHostCPUFeatures`to use Return StringMap. Fix this to unblock T195389358 Test Plan: ``` buck2 build mode/opt-clang-thinlto --upload-all-actions -c unicorn.hfsort="1" -c cxx.extra_cxxflags="-gpubnames -w -Wno-enum-constexpr-conversion -Wno-missing-template-arg-list-after-template-kw -Wno-c++11-narrowing -Wno-c++11-narrowing-const-reference -ferror-limit=0" -c cxx.extra_cflags="-gpubnames -w -Wno-enum-constexpr-conversion -Wno-missing-template-arg-list-after-template-kw -Wno-c++11-narrowing -Wno-c++11-narrowing-const-reference" -c cxx.profile="fbcode//fdo/autofdo/unicorn/topaggr/top_aggregator_server:autofdo" unicorn/topaggr:top_aggregator_server ``` Differential Revision: D59708722 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130661 Approved by: https://github.com/Skylion007	2024-07-16 23:47:48 +00:00
eqy	de177b50f8	[cuDNN][SDPA] Support `attn_bias` in cuDNN (#130482 ) CC @drisspg Pull Request resolved: https://github.com/pytorch/pytorch/pull/130482 Approved by: https://github.com/drisspg	2024-07-16 23:45:21 +00:00
PyTorch MergeBot	4f40a7078e	Revert "[FSDP2] Allowed `List[nn.Module]` as arg (#127786 )" This reverts commit d3ab8cecedd7843b8caed5946404704a18479811. Reverted https://github.com/pytorch/pytorch/pull/127786 on behalf of https://github.com/atalman due to bottom pr from the stack is failing on internal error ([comment](https://github.com/pytorch/pytorch/pull/127786#issuecomment-2231999178))	2024-07-16 23:45:17 +00:00
Michael Lazos	7919f0b952	Add buffer static input tests to cudagraph trees (#130402 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130402 Approved by: https://github.com/eellison ghstack dependencies: #130393	2024-07-16 22:12:38 +00:00
Michael Lazos	415d5e53ae	Propagate buffer and parameter indices through AOT (#130393 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130393 Approved by: https://github.com/bdhirsh	2024-07-16 22:12:38 +00:00
PyTorch MergeBot	5f3c356a56	Revert "[inductor] adapte windows file path (#130713 )" This reverts commit 69e99172450e40536bf2e6c110183d34a0e283e2. Reverted https://github.com/pytorch/pytorch/pull/130713 on behalf of https://github.com/clee2000 due to broke functorch\test_eager_transforms.py on windows https://github.com/pytorch/pytorch/actions/runs/9958208725/job/27530132704 `69e9917245`. Test failure on PR is real, possibly force merged to get around lint error? ([comment](https://github.com/pytorch/pytorch/pull/130713#issuecomment-2231901793))	2024-07-16 22:07:55 +00:00
soulitzer	2eec02523b	[autograd] Support GradientEdge as output for torch.autograd.grad (#127766 ) This is useful for splitting grad to run in two parts while preserving intermediates: <details> <summary> Click to see code </summary> ```python import collections import weakref from torch.autograd.graph import GradientEdge def _get_grad_fn_or_grad_acc(t): if t.requires_grad and t.grad_fn is None: return t.view_as(t).grad_fn.next_functions[0][0] else: return t.grad_fn def reverse_closure(roots, target_nodes): # Recurse until we reach a target node closure = set() actual_target_nodes = set() q: Deque = collections.deque() for node in roots: if node is not None and node not in closure: closure.add(node) q.append(node) while q: node = q.popleft() reverse_edges = node.metadata.get("reverse_edges", []) for holder_ref, idx in reverse_edges: ref = holder_ref() if ref is not None: raise RuntimeError("Reverse graph is no longer alive") fn = ref.node if fn in closure or fn is None: continue if fn in target_nodes: actual_target_nodes.add(fn) continue closure.add(fn) q.append(fn) return closure, actual_target_nodes # Enable weak pointer class Holder(): def __init__(self, node): self.node = node # TODO: use weak references to avoid reference cycle def construct_reverse_graph(roots): q: Deque = collections.deque() root_seen = set() reverse_graph_refs = [] for node in roots: if node is not None and node not in root_seen: q.append(node) root_seen.add(node) while q: node = q.popleft() for fn, idx in node.next_functions: if fn is not None: # Don't necessarily need to store on the graph reverse_edges = fn.metadata.get("reverse_edges", []) if len(reverse_edges) == 0: q.append(fn) holder = Holder(node) holder_ref = weakref.ref(holder) reverse_graph_refs.append(holder) reverse_edges.append((holder_ref, idx)) fn.metadata["reverse_edges"] = reverse_edges return reverse_graph_refs def get_param_groups(inputs, params): inputs_closure, _ = reverse_closure(inputs, set()) param_groups = dict() # keyed on intermediates for i, param in enumerate(params): closure, intersected = reverse_closure([param], inputs_closure) param_group = { "params": set([param]), "intermediates": set(intersected), } for input_node in intersected: existing = param_groups.get(input_node, None) if existing is not None: existing["params"] = existing["params"].union(param_group["params"]) existing["intermediates"] = existing["intermediates"].union(param_group["intermediates"]) param_group = existing else: param_groups[input_node] = param_group # Sanity check: union of all param_groups params should be equal to all params union_params = set() seen_ids = set() unique_param_groups = [] for param_group in param_groups.values(): if id(param_group) not in seen_ids: seen_ids.add(id(param_group)) unique_param_groups.append(param_group) union_params = union_params.union(param_group["params"]) assert union_params == set(params) return unique_param_groups def compute_grads_only_inputs2(roots, inps, weights): root_grad_fns = list(map(_get_grad_fn_or_grad_acc, roots)) inp_grad_fns = list(map(_get_grad_fn_or_grad_acc, inps)) weight_grad_fns = list(map(_get_grad_fn_or_grad_acc, weights)) reverse_graph_refs = construct_reverse_graph(root_grad_fns) param_groups = get_param_groups(inp_grad_fns, weight_grad_fns) del reverse_graph_refs for param_group in param_groups: for i, intermediate in enumerate(param_group["intermediates"]): def get_hook(param_group, i): def hook(grad_inputs): if param_group.get("grads", None) is None: param_group["grads"] = [None] * len(param_group["intermediates"]) param_group["grads"][i] = grad_inputs return hook # These are always "split" nodes that we need to recompute, so # save their inputs. intermediate.register_prehook(get_hook(param_group, i)) dinputs = torch.autograd.grad((out,), inputs=tuple(inps), grad_outputs=(torch.ones_like(out),), retain_graph=True) return dinputs, param_groups def compute_grads_only_weights2(user_weights, param_groups): all_dweights = dict() for param_group in param_groups: # TODO: Handle case where intermediate can have multiple outputs intermediate_edges = tuple(GradientEdge(i, 0) for i in param_group["intermediates"]) weights_edges = tuple(GradientEdge(w, 0) for w in param_group["params"]) assert all(len(g) == 1 for g in param_group["grads"]) # [NEW!] Able to pass a GradientEdge to autograd.grad as output # We do not need to retain_graph because... guarantee no overlap? print("trying to execute: ", intermediate_edges, weights_edges) dweights = torch.autograd.grad(intermediate_edges, weights_edges, grad_outputs=sum(param_group["grads"], tuple())) for w, dw in zip(param_group["params"], dweights): all_dweights[w] = dw # return grads in the original order weights were provided in out = [] for w in user_weights: grad_acc = _get_grad_fn_or_grad_acc(w) out.append(all_dweights[grad_acc]) return tuple(out) ``` </details> ```python import torch.nn as nn # Setup mod1 = nn.Linear(10, 10) mod2 = nn.Linear(10, 10) a = torch.rand(10, requires_grad=True) weights = tuple(mod1.parameters()) + tuple(mod2.parameters()) inps = (a,) out = mod2(mod1(a)) class LoggingTensorMode(torch.utils._python_dispatch.TorchDispatchMode): def __torch_dispatch__(self, func, types, args=(), kwargs=None): if kwargs is None: kwargs = {} rs = func(args, *kwargs) print(f"{func.__module__}.{func.__name__}") return rs print(" -- SPLIT -- ") # Compute gradients in two parts with LoggingTensorMode(): print("PART 1") dinputs, state = compute_grads_only_inputs2((out,), inps, weights) print("PART 2") dweights = compute_grads_only_weights2(weights, state) out = mod2(mod1(a)) print(" -- REF -- ") # Compare with reference with LoggingTensorMode(): ref_all_gradients = torch.autograd.grad(out, inputs=tuple(inps) + weights, grad_outputs=(torch.ones_like(out),)) for actual, ref in zip(dinputs + dweights, ref_all_gradients): print(torch.allclose(actual, ref)) ``` <img width="598" alt="image" src="https://github.com/pytorch/pytorch/assets/13428986/3681b8a7-3ab4-4d1d-a836-abef6913e671"> ``` PART 1 torch._ops.aten.view.default torch._ops.aten.view.default torch._ops.aten.view.default torch._ops.aten.view.default torch._ops.aten.view.default torch._ops.aten.ones_like.default V0603 10:17:21.590878 8300067520 torch/autograd/graph.py:751] Executing: <ViewBackward0 object at 0x12a1ee160> with grad_outputs: [f32[10]] torch._ops.aten.view.default V0603 10:17:21.591204 8300067520 torch/autograd/graph.py:751] Executing: <AddmmBackward0 object at 0x12a1ee0d0> with grad_outputs: [f32[1, 10]] torch._ops.aten.t.default torch._ops.aten.mm.default V0603 10:17:21.591578 8300067520 torch/autograd/graph.py:751] Executing: <ViewBackward0 object at 0x100d7ae50> with grad_outputs: [f32[1, 10]] torch._ops.aten.view.default V0603 10:17:21.591747 8300067520 torch/autograd/graph.py:751] Executing: <ViewBackward0 object at 0x12a1e4a60> with grad_outputs: [f32[10]] torch._ops.aten.view.default V0603 10:17:21.591834 8300067520 torch/autograd/graph.py:751] Executing: <AddmmBackward0 object at 0x12a1e4bb0> with grad_outputs: [f32[1, 10]] torch._ops.aten.t.default torch._ops.aten.mm.default V0603 10:17:21.591922 8300067520 torch/autograd/graph.py:751] Executing: <ViewBackward0 object at 0x12a1e4a90> with grad_outputs: [f32[1, 10]] torch._ops.aten.view.default PART 2 trying to execute: (GradientEdge(node=<AddmmBackward0 object at 0x12a1e4bb0>, output_nr=0),) (GradientEdge(node=<AccumulateGrad object at 0x12a21b130>, output_nr=0), GradientEdge(node=<AccumulateGrad object at 0x12a21b7c0>, output_nr=0)) V0603 10:17:21.592223 8300067520 torch/autograd/graph.py:751] Executing: <AddmmBackward0 object at 0x12a1e4bb0> with grad_outputs: [f32[1, 10]] torch._ops.aten.t.default torch._ops.aten.mm.default torch._ops.aten.t.default torch._ops.aten.sum.dim_IntList torch._ops.aten.view.default V0603 10:17:21.592421 8300067520 torch/autograd/graph.py:751] Executing: <TBackward0 object at 0x12a1cad60> with grad_outputs: [f32[10, 10]] torch._ops.aten.t.default trying to execute: (GradientEdge(node=<AddmmBackward0 object at 0x12a1ee0d0>, output_nr=0),) (GradientEdge(node=<AccumulateGrad object at 0x12a1e41c0>, output_nr=0), GradientEdge(node=<AccumulateGrad object at 0x12a21b670>, output_nr=0)) V0603 10:17:21.593481 8300067520 torch/autograd/graph.py:751] Executing: <AddmmBackward0 object at 0x12a1ee0d0> with grad_outputs: [f32[1, 10]] torch._ops.aten.t.default torch._ops.aten.mm.default torch._ops.aten.t.default torch._ops.aten.sum.dim_IntList torch._ops.aten.view.default V0603 10:17:21.593750 8300067520 torch/autograd/graph.py:751] Executing: <TBackward0 object at 0x12a21b2b0> with grad_outputs: [f32[10, 10]] torch._ops.aten.t.default torch._ops.aten.view.default torch._ops.aten.view.default torch._ops.aten.view.default torch._ops.aten.view.default ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127766 Approved by: https://github.com/albanD	2024-07-16 21:46:19 +00:00
PyTorch MergeBot	c1e7e40f24	Revert "[Traceable FSDP2][Inductor] Re-inplace all_gather_into_tensor (#129773 )" This reverts commit f2f31027ce8dc3985663bf6eaa66f3c5559b724a. Reverted https://github.com/pytorch/pytorch/pull/129773 on behalf of https://github.com/clee2000 due to failed inductor/test_torchinductor_dynamic_shapes.py on mac https://github.com/pytorch/pytorch/actions/runs/9963396991/job/27530249256 `f2f31027ce`. The build failed on PR so test jobs didn't run ([comment](https://github.com/pytorch/pytorch/pull/129773#issuecomment-2231808437))	2024-07-16 20:54:14 +00:00
Atul Jangra	4e479568df	[PT2] Log compile ID in the signpost event (#130801 ) Summary: We should log compile ID as well for easier comparison. Currently going through some of this data, I think we should make few more changes as well. Reland for D59725870 Test Plan: Sandcastle and Pytorch Differential Revision: D59789110 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130801 Approved by: https://github.com/oulgen	2024-07-16 20:47:36 +00:00
Yifu Wang	2ceade37c5	[SymmetricMemory] put socket files in /tmp (#130757 ) Currently the socket files are put in the current directory, which may not be writable in all environments. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130757 Approved by: https://github.com/Chillee ghstack dependencies: #130756	2024-07-16 20:21:05 +00:00
Yifu Wang	0468f2616a	[SymmetricMemory] make sure different subgroups with the same name use different store prefixes (#130756 ) This fixes a race condition in which different subgroups with the same name on the same host would use the same store. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130756 Approved by: https://github.com/Chillee	2024-07-16 20:21:05 +00:00
Will Feng	f2f31027ce	[Traceable FSDP2][Inductor] Re-inplace all_gather_into_tensor (#129773 ) FSDP2 eager pre-allocates the output buffer for AllGather and the AllGather just writes into that buffer. However, under compile, by default we use out-of-place AllGather, which means in Traceable FSDP2 case we will be unnecessarily using more memory than eager. We want to re-inplace that AllGather instead. This PR adds a post_grad pass to re-inplace all_gather_into_tensor (i.e. changing it from `all_gather_into_tensor.default` out-of-place op to `all_gather_into_tensor_out.default` out-variant op). One thing to note is that since with this pass we are introducing a mutable op into the post_grad FX graph, we must do this pass after `reinplace_inplaceable_ops` (at which point we are okay again with having mutable ops in the graph). To facilitate this, this PR adds a `post_grad_custom_post_reinplace_pass` extension point to allow user-defined post-reinplace FX passes. --- Test commands: - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_fullgraph_backend_inductor` --- Pull Request resolved: https://github.com/pytorch/pytorch/pull/129773 Approved by: https://github.com/eellison	2024-07-16 20:07:41 +00:00
Sam Larsen	156b99cfb1	[inductor] Handle inductor counters in fx graph cache (#130635 ) Summary: Similar to the handling of metrics, save inductor counter deltas in the FX graph cache entry and increment the counters appropriately on a cache hit Test Plan: new unit test Pull Request resolved: https://github.com/pytorch/pytorch/pull/130635 Approved by: https://github.com/eellison	2024-07-16 20:07:16 +00:00
David Berard	d548417d95	[NJT] throw an exception if nested_tensor_from_jagged is fx-traced without being fx.wrapped (#130702 ) The NJT constructor can't be fx-traced safely due to the dummy nt used: `774ca93fd2/torch/nested/_internal/nested_tensor.py (L501-L508)` The error doesn't appear immediately, but appears if you try to move a module with an fx-traced NJT constructor onto a different device, or try to serialize it. Let's throw an error if we try to fx-trace the NJT constructor so users know to wrap the call. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130702 Approved by: https://github.com/jbschlosser, https://github.com/soulitzer	2024-07-16 19:21:10 +00:00
PyTorch MergeBot	0851de5b16	Revert "[ONNX] Remove beartype usage (#130484 )" This reverts commit 1794c35912025aa44b0d70f67ff664b4f7bd1014. Reverted https://github.com/pytorch/pytorch/pull/130484 on behalf of https://github.com/clee2000 due to test_sympy_utils failure is real https://github.com/pytorch/pytorch/actions/runs/9961499559/job/27523758780 `1794c35912`. Dr CI is matching with commits in current commit? ([comment](https://github.com/pytorch/pytorch/pull/130484#issuecomment-2231575577))	2024-07-16 18:41:51 +00:00
Joel Schlosser	09b1b113f5	Cache min / max seq len for torch.nested.as_nested_tensor(t) (#130766 ) For the `torch.nested.as_nested_tensor(t)` constructor, computing min / max seq len is trivial since the sequence lengths are all the same. Might as well cache them during construction. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130766 Approved by: https://github.com/YuqingJ, https://github.com/soulitzer	2024-07-16 18:32:47 +00:00
Edward Z. Yang	408c921d96	Make hashing a SymInt raise an error again (#130548 ) See https://github.com/pytorch/pytorch/issues/130547 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130548 Approved by: https://github.com/Skylion007, https://github.com/albanD, https://github.com/lezcano	2024-07-16 18:30:30 +00:00
Xu Zhao	1d8baa4df2	[torchbench][servicelab] Fix servicelab test failures (#130781 ) Fix servicelab test failures Pull Request resolved: https://github.com/pytorch/pytorch/pull/130781 Approved by: https://github.com/desertfire	2024-07-16 17:35:13 +00:00
Justin Chu	1794c35912	[ONNX] Remove beartype usage (#130484 ) beartype has served us well in identifying type errors and ensuring we call internal functions with the correct arguments (thanks!). However, the value of having beartype is diminished because of the following: 1. When beartype improves support for better Dict[] type checking, it discovered typing mistakes in some functions that were previously uncaught. This caused the exporter to fail with newer versions beartype when it used to succeed. Since we cannot fix PyTorch and release a new version just because of this, it creates confusion for users that have beartype in their environment from using torch.onnx 2. beartype adds an additional call line in the traceback, which makes the already thick dynamo stack even larger, affecting readability when users diagnose errors with the traceback. 3. Since the typing annotations need to be evaluated, we cannot use new syntaxes like `\|` because we need to maintain compatibility with Python 3.8. We don't want to wait for PyTorch take py310 as the lowest supported Python before using the new typing syntaxes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130484 Approved by: https://github.com/titaiwangms	2024-07-16 17:34:36 +00:00
Jiashen Cao	67e22d6c61	[Fix]: Convert operator that does specialization to its symbolic counterpart (#129578 ) #### Issue During conversion, use symbolic operator when exist. #### Test Plan `pytest test/export/test_converter.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129578 Approved by: https://github.com/angelayi	2024-07-16 17:19:57 +00:00
Pian Pawakapan	e8998d68c8	[export] add non-strict training IR (#130062 ) Summary: Adds non-strict implementation of training IR export. Any expected non-strict training IR failures are also either existing strict training IR or non-strict failures (no new failures added). 4 strict training IR failures also resolved. Refraining from unifying export/export_for_training, per @ydwu4's feedback :) Test Plan: added test_export_training_ir_to_run_decomp_non_strict.py for non-strict training IR Differential Revision: D59349454 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130062 Approved by: https://github.com/ydwu4, https://github.com/zhxchen17	2024-07-16 17:08:00 +00:00
Sidney Tsang	d2f44eabe7	[Export] Support aten.full.default and aten.full_like.default (#130639 ) Summary: Add operator tests for full & full_like operators Test Plan: Rerun kernel test using ``` buck2 run //glow/fba/tests:run_kernel mode/dev -- --kernel splat --config "input=1;dtype=fp32;fill_value=42.0" -tl_time ``` {F1752274071} Operator tests ``` buck2 run mode/{opt,inplace} //caffe2/torch/fb/test_library:afg_operator_test -- -k __full__ ``` {F1752340913} Differential Revision: D59593849 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130639 Approved by: https://github.com/StellarrZ	2024-07-16 16:50:04 +00:00
Colin Peppler	f272e0ab4a	[inductor] support unbacked symint divisors in vars_and_sizes (#130595 ) Scenario: ``` >>> nodes IterationRangesEntry( x2, divisor=192u0 + 192576, length=s1, (xindex//(192u0 + 192576)), {x0: 192, x1: u0 + 1003, x2: s1, x3: 192s1u0 + 192576s1, x4: 192u0 + 192576}) IterationRangesEntry( x1, divisor=192, length=u0 + 1003, ModularIndexing(xindex, 192, u0 + 1003), {x0: 192, x1: u0 + 1003, x2: s1, x3: 192s1u0 + 192576s1, x4: 192u0 + 192576}) IterationRangesEntry( x0, divisor=1, length=192, ModularIndexing(xindex, 1, 192), {x0: 192, x1: u0 + 1003, x2: s1, x3: 192s1u0 + 192576s1, x4: 192u0 + 192576}) ``` Think about whether using fallback is safe here. I think it's safe because the divisor of one IterationRangesEntry should be the product of the lengths of the preceding IterationRangesEntry? Unless, one of the lengths divides by an unbacked symint? Pull Request resolved: https://github.com/pytorch/pytorch/pull/130595 Approved by: https://github.com/aakhundov, https://github.com/ezyang	2024-07-16 16:21:38 +00:00
drisspg	2b43d339fe	Make FlexAttention API public (#130755 ) # Summary Makes the prototype API flex_attention public Pull Request resolved: https://github.com/pytorch/pytorch/pull/130755 Approved by: https://github.com/Chillee	2024-07-16 16:21:25 +00:00
PyTorch MergeBot	cbda8be537	Revert "Propagate buffer and parameter indices through AOT (#130393 )" This reverts commit 69a77389e2c4052834c89a25757cdbf5f83b6208. Reverted https://github.com/pytorch/pytorch/pull/130393 on behalf of https://github.com/clee2000 due to broke lint for torch/_functorch/_aot_autograd/subclass_utils.py https://github.com/pytorch/pytorch/actions/runs/9948630877/job/27483551649 `80236dca90` lint was green on PR, probably a landrace ([comment](https://github.com/pytorch/pytorch/pull/130393#issuecomment-2231263753))	2024-07-16 15:43:34 +00:00
PyTorch MergeBot	9cb23ba85b	Revert "Add buffer static input tests to cudagraph trees (#130402 )" This reverts commit 80236dca90b0874cb2b6f9c9fa5f159c55726401. Reverted https://github.com/pytorch/pytorch/pull/130402 on behalf of https://github.com/clee2000 due to broke lint for torch/_functorch/_aot_autograd/subclass_utils.py https://github.com/pytorch/pytorch/actions/runs/9948630877/job/27483551649 `80236dca90` lint was green on PR, probably a landrace ([comment](https://github.com/pytorch/pytorch/pull/130393#issuecomment-2231263753))	2024-07-16 15:43:34 +00:00
Sam Larsen	c509319210	[inductor] Disable remote fx graph cache in test_snode_runtime (#130655 ) Summary: Unfortunately we can't save / restore metrics.metrics.node_runtimes in the cache entries because these contain objects that don't pickle: `TypeError: cannot pickle 'PyCapsule' object`. Test Plan: `buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:snode_runtime -- --exact 'caffe2/test/inductor:snode_runtime - test_mm (caffe2.test.inductor.test_snode_runtime.ComputeBoundedTests)' --run-disabled --jobs 18 --stress-runs 10` Differential Revision: D59705654 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130655 Approved by: https://github.com/oulgen	2024-07-16 15:11:17 +00:00
Aaron Enye Shi	aa4ad711ef	[CCA][Memory Snapshot] Create TraceEntryRingBuffer class for alloc_trace logic (#130741 ) Summary: Move the alloc_trace logic into a separate class, to reduce risk of deadlocks when mixing with CCA's lock. Switch to an std::mutex instead of std::recursive_mutex. Let's us re-use the logic in TraceEntryRingBuffer class for later diffs. Test Plan: CI, resnet run, and FBR model. Differential Revision: D59690408 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/130741 Approved by: https://github.com/davidberard98	2024-07-16 15:01:48 +00:00
eellison	e11c41035c	Directly use empty strided in cudagraph copy (#130777 ) We had an issue with the `-1` somehow ending up in negative num elements required. not sure why the original didn't work - we should land if CI is green. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130777 Approved by: https://github.com/BoyuanFeng	2024-07-16 14:37:30 +00:00
Aaron Orenstein	4c3348932c	typing: convert_frame (#130670 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130670 Approved by: https://github.com/Skylion007 ghstack dependencies: #130669	2024-07-16 14:31:35 +00:00
Aaron Orenstein	ea25febfab	typing: storage (#130669 ) This isn't a full typing of the file - it just fixes some uses of unbound 'T' (if you use a TypeVar as an output it also needs to be an input). Pull Request resolved: https://github.com/pytorch/pytorch/pull/130669 Approved by: https://github.com/oulgen, https://github.com/Skylion007	2024-07-16 14:31:35 +00:00
Isuru Fernando	8390843eba	Invalidate StorageImpl instances when tensor is overwritten with cudagraphs (#125264 ) Fixes #104435 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125264 Approved by: https://github.com/ezyang	2024-07-16 14:29:29 +00:00
David Berard	1fbfb3202d	[docs][TorchScript] document c10::AliasAnalysisKind::CONSERVATIVE (#130765 ) I spent a while trying to search this to remember what this was called. Adding it to the OVERVIEW.md docs so it's easier to search Pull Request resolved: https://github.com/pytorch/pytorch/pull/130765 Approved by: https://github.com/nmacchioni, https://github.com/eellison, https://github.com/aaronenyeshi	2024-07-16 14:20:31 +00:00
Xu Han	69e9917245	[inductor] adapte windows file path (#130713 ) This PR is depends on https://github.com/pytorch/pytorch/pull/130132 can be landed successful. The detailed log: https://github.com/pytorch/pytorch/issues/124245#issuecomment-2211889758 After the file path was adapted for Windows, the first Windows inductor case was run successful. ```python import torch def foo(x, y): a = torch.sin(x) b = torch.cos(x) return a + b opt_foo1 = torch.compile(foo) print(opt_foo1(torch.randn(10, 10), torch.randn(10, 10))) ``` Result: ![image](https://github.com/user-attachments/assets/4944df47-e74d-476b-8eb5-1d1fd5abeb41) Co-authored-by: Jiong Gong <jiong.gong@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130713 Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/desertfire	2024-07-16 13:53:39 +00:00
Aaron Gokaslan	53e5b8ac5b	[BE]: Update flake8-comprehensions and enable C420 (#130699 ) Uses `dict.fromkeys` whenever possible as covered by flake8-comprehensions rule C420. While the ruff rule RUF025 is still in preview, flake8-comprehensions have added a new rule which covers this. Use dict.fromkeys is faster when the value being added to the dictionary is the same at every iteration and is immutable, it also removes an unnecessary dict comprehension. This rule will be enabled with our current ruleset in RUF in 0.6 as C420. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130699 Approved by: https://github.com/lezcano, https://github.com/ezyang	2024-07-16 13:47:49 +00:00
Xu Zhao	213685ba97	[torchao][pt2 benchmark runner] Run performance test non-alternately (#130136 ) Summary: By default, performance tests (speedup experiments) will run the baseline and test backend alternately. However, this does not work for the torchao backend, which will change the model in-place, therefore the baseline run will also run with torchao backend since the model has already been quantized. Add a new experiment "latency_experiment" to run performance tests non-alternately (first run baseline for a few iterations, then run the test backend). Test Plan: ``` buck2 run mode/opt //pytorch/benchmark:pt2 -- --only AlbertForMaskedLM --quantization noquant --performance --inference --bfloat16 ``` ``` buck2 run mode/opt //pytorch/benchmark:pt2 -- --only AlbertForMaskedLM --quantization autoquant --performance --inference --bfloat16 --inductor-compile-mode max-autotune ``` Differential Revision: D59332736 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130136 Approved by: https://github.com/jerryzh168	2024-07-16 13:38:17 +00:00
eellison	67c6941b4e	Update torch.cat decomp for 0-dim (#130763 ) Fix for https://github.com/pytorch/pytorch/issues/130615 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130763 Approved by: https://github.com/Skylion007, https://github.com/mlazos	2024-07-16 13:34:01 +00:00
Jiong Gong	705da70f2c	[inductor][cpp] align dtype convert cache between vec and scalar kernels (#130677 ) The conversion cache used for fixing https://github.com/pytorch/pytorch/issues/115260 depended on "store" which might be removed and ignored. This would lead to inconsistent code generated between vec and scalar kernels since we generate scalar kernel first followed by the vector kernel and the store buffer might be removed by the scalar and impacts the vector kernel codegen. This PR move the caching from "store" to the "to_dtype" calls which won't be impacted by the removed buffers. `pytest -k test_consistent_remove_buffers test/inductor/test_cpu_repro.py` before ```c++ extern "C" void kernel(const bfloat16* in_ptr0, bfloat16* out_ptr1) { { for(long x0=static_cast<long>(0L); x0<static_cast<long>(64L); x0+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<bfloat16>::loadu(in_ptr0 + static_cast<long>(x0), 16); auto tmp1 = at::vec::convert<float>(tmp0); auto tmp2 = tmp1 + tmp1; auto tmp3 = at::vec::convert<bfloat16>(tmp2); auto tmp4 = at::vec::convert<float>(tmp3); auto tmp5 = tmp1 + tmp4; auto tmp6 = at::vec::convert<bfloat16>(tmp5); tmp6.store(out_ptr1 + static_cast<long>(x0), 16); } #pragma omp simd simdlen(8) for(long x0=static_cast<long>(64L); x0<static_cast<long>(65L); x0+=static_cast<long>(1L)) { auto tmp0 = in_ptr0[static_cast<long>(x0)]; auto tmp1 = c10::convert<float>(tmp0); auto tmp2 = decltype(tmp1)(tmp1 + tmp1); auto tmp3 = c10::convert<bfloat16>(tmp2); auto tmp4 = decltype(tmp1)(tmp1 + tmp2); auto tmp5 = c10::convert<bfloat16>(tmp4); out_ptr1[static_cast<long>(x0)] = tmp5; } } } ``` after ```c++ extern "C" void kernel(const bfloat16* in_ptr0, bfloat16* out_ptr1) { { for(long x0=static_cast<long>(0L); x0<static_cast<long>(64L); x0+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<bfloat16>::loadu(in_ptr0 + static_cast<long>(x0), 16); auto tmp1 = at::vec::convert<float>(tmp0); auto tmp2 = tmp1 + tmp1; auto tmp3 = at::vec::convert<bfloat16>(tmp2); auto tmp4 = tmp1 + tmp2; auto tmp5 = at::vec::convert<bfloat16>(tmp4); tmp5.store(out_ptr1 + static_cast<long>(x0), 16); } #pragma omp simd simdlen(8) for(long x0=static_cast<long>(64L); x0<static_cast<long>(65L); x0+=static_cast<long>(1L)) { auto tmp0 = in_ptr0[static_cast<long>(x0)]; auto tmp1 = c10::convert<float>(tmp0); auto tmp2 = decltype(tmp1)(tmp1 + tmp1); auto tmp3 = c10::convert<bfloat16>(tmp2); auto tmp4 = decltype(tmp1)(tmp1 + tmp2); auto tmp5 = c10::convert<bfloat16>(tmp4); out_ptr1[static_cast<long>(x0)] = tmp5; } } } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130677 Approved by: https://github.com/leslie-fang-intel	2024-07-16 13:25:05 +00:00
PyTorch MergeBot	68a4f2a3df	Revert "Tighten torch.library.infer_schema input types (#130705 )" This reverts commit ca2d424c6e5358f9fee8dc9ee7477de76b50f848. Reverted https://github.com/pytorch/pytorch/pull/130705 on behalf of https://github.com/atalman due to Failing internal CI ([comment](https://github.com/pytorch/pytorch/pull/130705#issuecomment-2230821876))	2024-07-16 12:57:11 +00:00
Andrea Frittoli	dee0f43fde	Add a CI job to check runner det sync (#129746 ) Add a new CI job that runs only when the runner determinator files are modified. The jobs checks that the runner_determinator.py script is in sync with the version embedded in _runner-determinator.yaml. Fixes TBD Pull Request resolved: https://github.com/pytorch/pytorch/pull/129746 Approved by: https://github.com/zxiiro, https://github.com/ZainRizvi, https://github.com/jeanschmidt	2024-07-16 11:44:55 +00:00
Jovian Anthony Jaison	e57101d927	Add testing regarding SparseAdam state_dicts (#130645 ) Summary: - Updated SparseAdam to run test_state_dict_deterministic unit test. - Made gradients sparse while keeping weights dense in the above test. Test Plan: - Ran test_optim.py locally. Fixes #116507 Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130645 Approved by: https://github.com/janeyx99	2024-07-16 11:29:22 +00:00
cyy	168e41009b	[structural binding][10/N] Replace std::tie with structural binding (#130784 ) Follows #130404 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130784 Approved by: https://github.com/malfet	2024-07-16 10:28:14 +00:00
Xuehai Pan	747b38c131	[BE][Easy][2/19] enforce style for empty lines in import segments in `.ci/` and `.github/` (#129753 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129753 Approved by: https://github.com/malfet ghstack dependencies: #129752	2024-07-16 09:40:00 +00:00
Yu, Guangye	096dc444ce	Keep zero check be compatible with different sympy versions (#130729 ) # Motivation I found a difference between sympy 1.12 and 1.13. ```python # for 1.12 >>> import sympy >>> a = sympy.Number(0.0) >>> a == 0 True ``` ```python # for 1.13 >>> import sympy >>> a = sympy.Number(0.0) >>> a == 0 False ``` The different behavior will impact the result of [safe_mul](`6beec34b1c/torch/utils/_sympy/value_ranges.py (L521-L528)`), resulting in an incorrect results when `a = sympy.Number(0.0)`, `b = inf` and the result is `nan` if sympy version is 1.13. (the expected result is 0) ```python def safe_mul(a, b): # Make unknown() * wrap(0.0) == wrap(0.0) if a == 0.0: return a elif b == 0.0: return b else: return a * b ``` In different sympy versions, `sympy.Number(0)` always has the same behavior that equals to 0.0. ```python >>> import sympy >>> a = sympy.Number(0) >>> a == 0.0 True # for different sympy versions ``` So, use 0.0 when checking zero in safe_mul to keep compatible with different sympy versions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130729 Approved by: https://github.com/lezcano, https://github.com/EikanWang	2024-07-16 08:39:00 +00:00
Animesh Jain	fedae41c57	[dynamo] Do not mark nn.module containers as BuiltinNNModuleVariable (#130773 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130773 Approved by: https://github.com/williamwen42, https://github.com/mlazos	2024-07-16 06:55:46 +00:00
Aaron Gokaslan	83eedf66b9	Update libfmt submodule to 11.0.1 (#130628 ) Update libfmt to 11.0.1 reopen of https://github.com/pytorch/pytorch/pull/129962. Requires a kineto update and moves fmt::join into a separate include so added it where necessary. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130628 Approved by: https://github.com/aaronenyeshi	2024-07-16 06:12:11 +00:00
chuanqiw	c549629696	[CD] Fix xpu nightly wheel test failure (#130742 ) The xpu nightly wheel test met permission issue on `linux.idc.xpu` runner. Because those runners onboarded with `jenkins` user but the binary test in docker container with `root` directly. The temp files can't be deleted, refer https://github.com/pytorch/pytorch/actions/runs/9935452320/job/27448053625#step:8:91 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130742 Approved by: https://github.com/atalman	2024-07-16 05:31:20 +00:00
cyy	95dbbf713e	[Distributed] [9/N] Fix clang-tidy warnings in torch/csrc/distributed/rpc (#130109 ) Follows #125102 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130109 Approved by: https://github.com/ezyang	2024-07-16 04:23:42 +00:00
Wanchao Liang	7b2e802f31	[dtensor] add a few dunder methods to pointwise ops (#130754 ) fixes https://github.com/pytorch/pytorch/issues/130671 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130754 Approved by: https://github.com/Skylion007, https://github.com/awgu, https://github.com/msaroufim ghstack dependencies: #130753	2024-07-16 02:53:35 +00:00
Wanchao Liang	2b2671a7b1	[dtensor] fix foreach_norm when ord is 2 (#130753 ) as titled, fixed a case when passing ord as 2 (default value), the op dispatching does not receive the default value case We simply check if the args schema receiving a `ord` field or not Pull Request resolved: https://github.com/pytorch/pytorch/pull/130753 Approved by: https://github.com/awgu	2024-07-16 02:53:35 +00:00
Aaron Gokaslan	a29052a0bf	[BE][Ez]: Update ruff to 0.5.2 (#130698 ) Update ruff to 0.5.2 which bugfixes and performance improvements Pull Request resolved: https://github.com/pytorch/pytorch/pull/130698 Approved by: https://github.com/ezyang	2024-07-16 01:31:30 +00:00
Adrian Wälchli	ad314a2f05	Pass `torch.load(weights_only=)` internally to avoid FutureWarning (#130663 ) Fixes #130658 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130663 Approved by: https://github.com/malfet, https://github.com/LucasLLC	2024-07-16 01:24:38 +00:00
Sam Larsen	3cd2ae331a	Use inductor TestCase for distributed tests (#129494 ) Summary: At least some of the tests deriving from MultiProcessTestCase exercise inductor. Using the inductor TestCase class makes sure we always get a clean cache dir. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129494 Approved by: https://github.com/eellison	2024-07-16 01:24:35 +00:00
Brian Hirsh	39eeaac4e5	inductor: avoiding moving constructor to cuda when it would cause h2d sync in index_put_ fallback (#130338 ) My attempt at a fix for https://github.com/pytorch/pytorch/issues/130335, see issue for more details / internal xref. Any feedback from inductor folks is appreciated. I attempted to make the move-constructors-to-cuda pass a bit less aggressive by detecting when the movement would incur a H2D sync for `aten.index_put_`. I'm not sure if there are any other ops that inductor falls back to eager on, that may-or-may-not incur a H2D sync if we change any of their inputs from cpu to cuda. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130338 Approved by: https://github.com/eellison	2024-07-16 00:48:58 +00:00
Jiang, Yanbing	93a03edcf9	Update error message in meta__convert_weight_to_int4pack (#130707 ) This PR is to fix error message in https://github.com/pytorch/pytorch/pull/129940. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130707 Approved by: https://github.com/lezcano, https://github.com/malfet	2024-07-16 00:44:35 +00:00
Xuehai Pan	a3abfa5cb5	[BE][Easy][1/19] enforce style for empty lines in import segments (#129752 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129752 Approved by: https://github.com/ezyang, https://github.com/malfet	2024-07-16 00:42:56 +00:00
eqy	5e617d7ef5	[CUDA] Actually bump tolerances for `test_grad_pca_lowrank` (#130770 ) Fixes change in #129902 to actually bump pca rather than svd, thanks @ptrblck for the catch Pull Request resolved: https://github.com/pytorch/pytorch/pull/130770 Approved by: https://github.com/Skylion007	2024-07-16 00:41:10 +00:00
Michael Lazos	80236dca90	Add buffer static input tests to cudagraph trees (#130402 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130402 Approved by: https://github.com/eellison ghstack dependencies: #130391, #130392, #130503, #130393	2024-07-16 00:25:38 +00:00
Michael Lazos	69a77389e2	Propagate buffer and parameter indices through AOT (#130393 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130393 Approved by: https://github.com/bdhirsh ghstack dependencies: #130391, #130392, #130503	2024-07-16 00:25:38 +00:00
Michael Lazos	200d3d0a89	Remove static param counting if inlining NN modules (#130503 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130503 Approved by: https://github.com/bdhirsh ghstack dependencies: #130391, #130392	2024-07-16 00:25:34 +00:00
Michael Lazos	0d0c09702a	Update mark_static_address for inlining NN modules (#130392 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130392 Approved by: https://github.com/anijain2305 ghstack dependencies: #130391	2024-07-16 00:25:29 +00:00
Michael Lazos	d8616eb66a	Mark nn_module params and buffers as static in dynamo (#130391 ) This PR marks all buffers and parameters of an NNModule as static using the `mark_static_address` API. As a result, when tensors are passed to AOT, the `tensor_dict` metadata of placeholder nodes will contain the `static_address_type` key, indicating which graph argument positions are static for cudagraphs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130391 Approved by: https://github.com/anijain2305	2024-07-16 00:25:23 +00:00
eellison	9ab8d47f9d	Constant folding for dynamic shape node (#129686 ) Extend constant folding for dynamic shape node, only support pointwise op and some restricted ops We support dynamic shapes by limiting constant folding of ops that are guaranteed to have uniform values (full, pointwise ops, and views) and running these operators with tensors of shape 1. This also eliminates the possibility of memory overhead of constant folding. Taken over from https://github.com/pytorch/pytorch/pull/128937 joint work with @imzhuhl Pull Request resolved: https://github.com/pytorch/pytorch/pull/129686 Approved by: https://github.com/Chillee ghstack dependencies: #130367	2024-07-16 00:17:11 +00:00
yuqingj	ea4f310ff1	[Nested Tensor][easy] Add softmax backward support (#130602 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130602 Approved by: https://github.com/davidberard98, https://github.com/jbschlosser	2024-07-16 00:07:42 +00:00
Andrew Gu	d3ab8ceced	[FSDP2] Allowed `List[nn.Module]` as arg (#127786 ) This PR allows `fully_shard`'s first argument to be `List[nn.Module]` instead of strictly `nn.Module`. This allows more flexible grouping of modules/parameters for communication, which can lead to memory savings and/or more efficient communication. Approach At a high level, we can think of a model as a tree of modules. Previously, we could only select specific module nodes in this tree as representing one FSDP parameter group. With this PR, we can select a group of module nodes, effectively becoming a single super node. To implement the runtime schedule, we define new forward hooks that run based on the following semantics: - If a module is the first to run the pre-hook, actually run the given pre-hook. Otherwise, the pre-hook is no-op. - If a module is the last to run the post-hook, actually run the given post-hook. Otherwise, the post-hook is a no-op. - First and last are determined by scoreboarding against a set of the modules. - This set must get cleared at the end of backward in the case that >=1 module in the list is never used, in which case we still want the forward hooks to run in the next forward after this backward. Beyond these new forward hooks, everything else is some simple generalization from `Module` to `List[Module]` or `Tuple[Module, ...]`. Examples This PR enables wrapping Llama models more efficiently by grouping the final norm and output linear together: https://github.com/pytorch/torchtitan/pull/382. If at least one of the modules in the list does not run forward before backward, then there will be a warning message like: ``` 1 of the 2 modules passed to fully_shard did not run forward before backward, which is error-prone since FSDP post-forward/pre-backward logic will not run for these modules. We recommend passing only modules that run forward together. Modules that did not run forward: [FSDPLinear(in_features=1, out_features=1, bias=True)] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127786 Approved by: https://github.com/yf225, https://github.com/weifengpy ghstack dependencies: #127773	2024-07-15 23:54:10 +00:00
Andrew Gu	b27695791e	[PT-D] Relaxed `contract` to allow `Sequence[nn.Module]` (#127773 ) This PR relaxes `@contract` to allow the 1st argument to be `Sequence[nn.Module]` instead of strictly `nn.Module`. This is required for the next PR, which allows `fully_shard` to take in `List[nn.Module]`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127773 Approved by: https://github.com/weifengpy	2024-07-15 23:54:10 +00:00
Bilal Khan	54a932b0ac	Support for expandable segments with cuda graph trees (#128068 ) This PR adds support to use expandable segments with private memory pools which should unblock using it with cuda graphs and cuda graph trees. Currently, the allocator silently avoids using expandable segments when allocating in a private pool due to checkpoint saving/restoring not meshing well with how we keep track of unmapped blocks. The PR itself is pretty short, most of the logic for checkpointing and reapplying state for non-expandable segments transfers over without much work. Expandable segments reserve a virtual address space of size equal to the amount of physical memory on the GPU. Every time we want to `malloc()` or `free()` memory in a memory pool with expandable segments turned on, we map/unmap pages of physical GPU memory under the hood to create a new block that we return to the caller. This is beneficial due to the fact that each memory pool functions as a single segment of memory with a contiguous block of memory addresses that can grow and shrink as needed, avoiding fragmentation from allocating multiple non-contiguous segments that may not be merged together. The caching allocator handles this by creating an unmapped block for the entire reserved virtual address space at init, which is treated similarly to an unallocated block in a free pool. When callers call `malloc()`, it's split and mapped to create allocated blocks, and calling `free()` similarly caches and merges free blocks in a free pool to be used later. Expandable blocks are unmapped and returned back to Cuda when they are cleaned up, or when we hit an OOM and the allocator attempts to remap cached free blocks. The code paths to map, free, and unmap blocks in expandable segments is similar to that for normal blocks and does all the same work of updating stats on memory usage, moving blocks between active and free pools, and returning memory to Cuda. With Cuda Graph Trees and private memory pools, we need the ability to take checkpoints of the current state of the memory allocator after each graph capture as well as reapplying the state before capturing a new graph after replaying a captured graph so that the new cuda graph capture has access to the state of the allocator at the point after replaying a previously captured graph so it can reuse empty blocks and allocate new ones. As mentioned in a below comment, memory in a private pool is cached until the private pool is destroyed and allocations can only grow from extra graph captures, any freeing of memory would result in invalid memory addresses and would break cuda graphs. One implementation detail to note for unmapped blocks with expandable segments is that unmapped blocks are kept track in a member variable `unmapped` of a `BlockPool`. `unmapped` is not part of the checkpointed state of the caching allocator and isn't restored when reapplying checkpoints since we never free/unmap memory back to cuda and is persisted across graph captures / replays. Checkpointing the current state of the memory allocator works as expected with expandable segments. Checkpointing grabs the first block of every segment in the active and free pools of the private pool and traverses the linked list of blocks in the segment to capture the state of every segment, which is then saved and kept for when it is needed to be reapplied. For expandable blocks, the last block in every segment will be an unallocated unmapped block containing the remaining amount of unmapped memory at graph capture time, and this too is saved in the checkpoint. Reapplying the checkpoints works by freeing all allocated blocks and merging them into a single block per segment, then for each segment, we manually split and allocate all blocks from the checkpoint and then free the blocks marked as unallocated in the checkpoint state. For expandable segments, we need to make some modifications to not split unmapped blocks and avoid manually mapping then freeing unmapped blocks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128068 Approved by: https://github.com/eqy, https://github.com/eellison	2024-07-15 23:23:23 +00:00
Sijia Chen	006020ff6e	Fix the cudagraph capture of SDPA (#130712 ) Summary: The scalar tensor by default is on CPU, which failed the cuda graph capture. To fix the issue, we put the scalar tensor on GPU Test Plan: buck2 test 'fbcode//mode/opt' fbcode//gen_ai/llm_inference/fb/tests:test_llama2_multimodal_generator -- --exact 'gen_ai/llm_inference/fb/tests:test_llama2_multimodal_generator - gen_ai.llm_inference.fb.tests.test_llama2_multimodal_generator.TestGenerator: test_multimodal_decode_gen2' Differential Revision: D59740639 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130712 Approved by: https://github.com/Skylion007, https://github.com/chenyang78	2024-07-15 23:05:48 +00:00
Alnis Murtovi	50ef099ad0	Learn a heuristic to decide whether to pad before mm (#128643 ) This PR introduces AutoHeuristic, a framework to collect results from autotuning, learn a heuristic as a machine learning model (a regression tree), and then ship the learned heuristic by generating the regression tree to code. The heuristics have been learned on artificial/random data that has been collected with the `gen_data_pad_mm.py` script. The `gen_pad_mm_a100.sh` scripts can then be used to learn a heuristic and generate it to code. The best model is decided by doing a grid search over various values for `max_depth` and `min_samples_leaf` and choosing the model with the highest number of correct predicitons on the validation set. The heuristic can return "unsure" which means that it is not sure which choice is the best choice and as a result autotuning will happen. On A100 only tensors where each dimension is >= 512 are considered. For smaller tensors the heuristics that I learned returned "unsure" too often. The results for randomly generated data and huggingface look as follows: `max_wrong_speedup` is max(`wrong_speedups`) where `wrong_speedups` contains all the speedups one could have achieved for those examples where the heuristic made a wrong choice, i.e. a `max_wrong_speedup` of 1.37 means that the heuristic selected a choice, but the other choice would have been 1.37x faster. `gman_wrong_speedup` is the geomean of `wrong_speedups`. The heuristic is learned as a regression tree, that returns higher values for better choices. The threshold decides how much better the better choice has to be for it to be returned, i.e. on A100 if the better choice is less than 1.702530x better than the other choice, "unsure" will be returned. This threshold is determined using the validation set. A100 ``` max_depth min_samples_leaf dataset correct wrong unsure total max_wrong_speedup gman_wrong_speedup threshold 15 5.0 10 train 2730 4 3023 5757 1.372220 1.193873 1.702530 16 5.0 10 val 878 0 1042 1920 NaN NaN 1.702530 17 5.0 10 test 925 2 993 1920 1.741708 1.354954 1.702530 18 5.0 10 hf-train 14 0 22 36 NaN NaN 1.702530 19 5.0 10 hf-inf 7 0 1 8 NaN NaN 1.702530 ``` The numbers for huggingface only include tensors where each dim is >=512. If all tensors would have been included there would have been the following number of matmuls, where at least one dimension is unaligned: A100 hf-train: 60 A100 hf-inf: 10 ## Results on running huggingface locally This only includes models where the learned heuristic made at least one decision. For the examples here, it takes around 0.25-0.3 seconds to perform autotuning for the padded and unpadded version, so each decision that the heuristic makes saves around 0.25-0.3 seconds. #pad_mm_autotuning is the number of times autotuning happened in pad_mm and #heuristic_made_decision is the number of times the heuristic made a decision (i.e. it didn't return "unsure"). I ran huggingface locally, each model 5 times and took the median speedup and compilation_latency. Results on huggingface training ``` name speedup_heuristic speedup_baseline speedup_diff compilation_latency_heuristic compilation_latency_baseline compilation_latency_diff comp_latency_reduction% #pad_mm_autotuning #heuristic_made_decision BartForCausalLM 1.19 (+/- 0.00) 1.19 (+/- 0.00) -0.00 40.33 (+/- 1.13) 40.95 (+/- 0.78) -0.62 1.52 3 2 BartForConditionalGeneration 1.53 (+/- 0.06) 1.47 (+/- 0.05) 0.06 81.93 (+/- 5.20) 82.23 (+/- 1.92) -0.30 0.36 3 1 BlenderbotSmallForCausalLM 1.86 (+/- 0.04) 1.86 (+/- 0.00) 0.00 36.76 (+/- 0.49) 37.62 (+/- 1.33) -0.87 2.31 3 2 CamemBert 2.36 (+/- 0.01) 2.35 (+/- 0.01) 0.01 97.60 (+/- 1.91) 98.69 (+/- 1.35) -1.09 1.11 2 1 DistillGPT2 2.57 (+/- 0.01) 2.57 (+/- 0.01) 0.00 57.33 (+/- 0.77) 58.26 (+/- 1.41) -0.93 1.59 3 2 PLBartForCausalLM 2.07 (+/- 0.01) 2.06 (+/- 0.01) 0.01 32.54 (+/- 0.83) 34.65 (+/- 0.71) -2.11 6.10 3 2 PLBartForConditionalGeneration 1.87 (+/- 0.00) 1.88 (+/- 0.00) -0.01 58.45 (+/- 1.24) 58.95 (+/- 1.92) -0.50 0.85 3 1 RobertaForCausalLM 2.39 (+/- 0.01) 2.40 (+/- 0.01) -0.01 97.38 (+/- 1.52) 97.69 (+/- 1.18) -0.31 0.32 2 1 TrOCRForCausalLM 1.70 (+/- 0.00) 1.70 (+/- 0.00) -0.00 44.79 (+/- 1.33) 45.25 (+/- 1.08) -0.46 1.01 3 2 Mean difference in speedup: 0.01 Mean compilation latency saved: -0.80s Mean compilation latency reduction: 1.68% ``` Results on huggingface inference ``` name speedup_heuristic speedup_baseline speedup_diff compilation_latency_heuristic compilation_latency_baseline compilation_latency_diff comp_latency_reduction% #pad_mm_autotuning #heuristic_made_decision BartForCausalLM 1.11 (+/- 0.00) 1.11 (+/- 0.00) 0.00 19.02 (+/- 0.28) 19.40 (+/- 0.35) -0.38 1.95 3 2 BartForConditionalGeneration 1.26 (+/- 0.01) 1.23 (+/- 0.03) 0.03 36.84 (+/- 0.40) 36.55 (+/- 0.75) 0.30 -0.81 3 1 BlenderbotSmallForCausalLM 1.87 (+/- 0.02) 1.87 (+/- 0.01) 0.00 17.53 (+/- 0.31) 18.03 (+/- 0.43) -0.49 2.74 3 2 DistillGPT2 2.50 (+/- 0.02) 2.50 (+/- 0.01) 0.00 16.16 (+/- 0.29) 16.40 (+/- 0.18) -0.24 1.46 3 2 PLBartForCausalLM 1.93 (+/- 0.01) 1.94 (+/- 0.01) -0.00 15.30 (+/- 0.22) 16.01 (+/- 0.71) -0.71 4.43 3 2 PLBartForConditionalGeneration 1.98 (+/- 0.01) 1.98 (+/- 0.01) 0.00 25.90 (+/- 0.32) 26.58 (+/- 0.62) -0.67 2.53 3 1 TrOCRForCausalLM 1.61 (+/- 0.00) 1.62 (+/- 0.00) -0.01 21.38 (+/- 0.37) 21.85 (+/- 0.16) -0.47 2.16 3 2 Mean difference in speedup: 0.00 Mean compilation latency saved: -0.38s Mean compilation latency reduction: 2.07% ``` For now, the heuristic can only be applied to decide whether to pad for mm. One could also learn heuristics for bmm and addmm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128643 Approved by: https://github.com/Chillee, https://github.com/eellison	2024-07-15 23:04:06 +00:00
Sam Larsen	9a5204dc2d	[inductor] Remove "spawn" as an option for parallel compile method (#130746 ) Summary: Looks like "spawn" is broken. Since we have "subprocess", I don't think we need it any more, so just remove as an option. Test Plan: Verified that we get: `AssertionError: Invalid start method: spawn` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130746 Approved by: https://github.com/Skylion007	2024-07-15 22:55:54 +00:00
Jiashen Cao	3f031b96c6	[Fix] Correctly identifying arguments for sub-blocks with renaming logic during TorchScript to ExportedProgram conversion (#128386 ) #### Issue Fix two issues related to inputs lifting when there are sub-blocks. * Some inputs may appear in the nested sub-blocks, which need a recursive search to identify which arguments need to be lifted / passed in the top-level block. * Some inputs to the sub-block are intermediate results, meaning their names are only number. This will cause issue during code generation (i.e., invalid argument name). We rename those to valid names. #### Test Plan * `pytest test/export/test_converter.py -s -k test_convert_nn_module_with_nested_if_and_param` * `test/export/test_converter.py -s -k test_hidden_input_name` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128386 Approved by: https://github.com/angelayi	2024-07-15 22:48:13 +00:00
Jerry Zhang	b893aa71ca	Rename generate_numeric_debug_handle to numeric_debugger (#130590 ) Summary: att Test Plan: CI Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/130590 Approved by: https://github.com/dulinriley, https://github.com/tarun292	2024-07-15 22:42:27 +00:00
WeiChunyu-star	535016967a	Enable UFMT on all of torch/sparse (#130545 ) Partially addresses #123062 Ran lintrunner on: - torch/sparse Detail: ``` $ lintrunner -a --take UFMT --all-files ok No lint issues. Successfully applied all patches. ``` @ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/130545 Approved by: https://github.com/ezyang	2024-07-15 22:35:52 +00:00
Alex Dennis	7d4f50de19	dynamo add support for `defaultdict(set)` (#130745 ) Fixes #130554 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130745 Approved by: https://github.com/Skylion007	2024-07-15 22:23:33 +00:00
William Wen	3928ca2ab6	[dynamo] update call map to allow multiple input parameters (#130748 ) Fixes https://github.com/pytorch/pytorch/issues/128072. Commandeering https://github.com/pytorch/pytorch/pull/128282 since the issue is now hi pri. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130748 Approved by: https://github.com/Skylion007, https://github.com/anijain2305	2024-07-15 22:16:49 +00:00
eqy	6f32dc0c7b	Don't pass error message as `places` in `assertGreaterAlmostEqual` (#130648 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130648 Approved by: https://github.com/awgu	2024-07-15 22:14:49 +00:00
PyTorch MergeBot	dff9d68f18	Revert "Fix names conflict when lifting (#129817 )" This reverts commit 53cf46b8c602f8512d49a5c30bca7fcf5411e25c. Reverted https://github.com/pytorch/pytorch/pull/129817 on behalf of https://github.com/clee2000 due to Failing inductor/test_flex_attention.py https://github.com/pytorch/pytorch/actions/runs/9940532858/job/27478084137 `74da2a467f` Sorry for the churn, possibly a landrace? ([comment](https://github.com/pytorch/pytorch/pull/129817#issuecomment-2229519886))	2024-07-15 22:08:45 +00:00
PyTorch MergeBot	78799e82b0	Revert "Invalidate StorageImpl instances when tensor is overwritten with cudagraphs (#125264 )" This reverts commit 1bc390c5f5ac065c156f55f4eceed267ecc67b41. Reverted https://github.com/pytorch/pytorch/pull/125264 on behalf of https://github.com/jithunnair-amd due to test test/inductor/test_cudagraph_trees.py::CudaGraphTreeTests::test_fallback_to_eager_if_recompiling_too_many_times is failing https://github.com/pytorch/pytorch/actions/runs/9933628108/job/27477785946 `1bc390c5f5`. Test was introduced by `fa5f572748` which is before the merge base ([comment](https://github.com/pytorch/pytorch/pull/125264#issuecomment-2229508737))	2024-07-15 21:59:46 +00:00
Yifu Wang	db3a641b71	Implement operator for micro-pipelined all-gather -> _scaled_mm (#129289 ) This PR implements `torch.ops.symm_mem.fused_all_gather_scaled_matmul`. It's similar to `torch.ops.symm_mem.fused_all_gather_matmul`, except that it takes scales and calls ` _scaled_mm`. [Profiling Trace vs. Baseline](https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_tmp0gmg1f2_) (FB internal only) Co-authored-by: Will Feng <yf225@cornell.edu> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129289 Approved by: https://github.com/Chillee, https://github.com/weifengpy, https://github.com/drisspg	2024-07-15 21:48:35 +00:00
Shuqiang Zhang	77fb5b0e23	[c10d] a new Pytorch API (split_group) to create a process group (#130507 ) This is the implementation following the RFC: https://github.com/pytorch/pytorch/issues/130407 ncclCommSplit Summary: In current Pytorch/c10d, the new_group API is used to create a new process group from the default pg. When device_id is specified in init_process_group and nccl is used as the backend, the new_group call will use ncclCommSplit to create the nccl communicators to save communicator resources. It has a few drawbacks: Redundant calls Suppose the default group has 256 ranks, we need to have 32 children PGs and each child PG has 8 ranks. in this case, each rank needs to call new_group and ncclCommSplit 32 times because of how we implement new_group API and the collective requirement of ncclCommSplit. For a specific global rank, 31 calls of ncclCommSplit would be no_color split, and only 1 of them is colored split. With the proposed new split_group API, we expect only 1 call of split_group/ncclCommSplit is needed per rank in the above example case new_group can only split from default_pg Ideally, a new pg should be able to be split from any pg With the new split_group API, users can create new PGs using ncclCommSplit with less number of calls and initialize the PG eagerly. This is also useful in the cases of creating many P2P communicators. Test Plan: New UTs: e.g., python test/distributed/test_c10d_nccl.py -k test_comm_split_group_larger_scale Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/130507 Approved by: https://github.com/wconstab	2024-07-15 21:26:43 +00:00
Nikita Shulga	ac3e2cb64a	[BE] Delete unused -rg.yml workflow (#130759 ) As well as `_linux-test-label.yml` as ARC experiment is dead Pull Request resolved: https://github.com/pytorch/pytorch/pull/130759 Approved by: https://github.com/ZainRizvi	2024-07-15 20:41:59 +00:00
Iris Zhang (PyTorch)	ee6f0ab190	[DeviceMesh][Reland] Only include the real thread_id in DeviceMesh hash under threaded backend (#130495 ) (#130685 ) Summary: As a followup to https://github.com/pytorch/pytorch/pull/130454, users are hitting the cross-mesh operation error because the DeviceMesh thread ID differs between the saved vs. loaded DTensor due to thread id being different. This is a hot fix to only consider the real thread_id in DeviceMesh hash under threaded backend, but set it to None for all other cases. As a follow up, we need to look at the following test failures to better root cause specific DeviceMesh related failures related to MTPG, if thread_id is not included as part of the hash. ``` test/distributed/_composable/fsdp/test_fully_shard_training.py::TestFullyShardRegisteredParams::test_param_registration_after_forward test/distributed/_tensor/test_dtensor_ops.py::TestDTensorOpsCPU::test_dtensor_op_db_column_stack_cpu_float32 ``` Adding an additional is_initialized() check since APF has a test mocking the backend without pg initialized. Therefore, we need to add the is_initialized() check to avoid test failure. In real use case, we should have a pg initialized before the get_backend() check. Not sure if we want to add this specifically for the test, but temporarily adding it to unblock APF conveyor runs. Test Plan: ``` [irisz@devgpu051.cln3 /data/users/irisz/fbsource/fbcode (38e4a0a3b)]$ buck2 test 'fbcode//mode/opt' fbcode//apf/distributed/tests:pipeline_parallel_test_cpu -- --exact 'apf/distributed/tests:pipeline_parallel_test_cpu - apf.distributed.tests.pipeline_parallel_test_cpu.PipelineParallelContextTestCPU: test_stage_pg_creation_with_different_backends' ``` Reviewed By: gag1jain Differential Revision: D59725924 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130685 Approved by: https://github.com/gag1jain	2024-07-15 20:05:26 +00:00
chilli	27322355de	Added some more documentation to block mask creation (#130649 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130649 Approved by: https://github.com/drisspg ghstack dependencies: #130626	2024-07-15 19:48:42 +00:00
yuqingj	0e79e1f958	[NJT+SDPA]Fix flash_attention output when batch_size=1 and seq_len=1 (#130652 ) fix issue #130196 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130652 Approved by: https://github.com/Skylion007, https://github.com/drisspg, https://github.com/jbschlosser	2024-07-15 19:44:04 +00:00
PyTorch MergeBot	074a5c0c9b	Revert "[BE] bump `optree` version to 0.12.1 (#130139 )" This reverts commit 8fcb156e8b5697a8f292db6db2a1803c5f4ce2d7. Reverted https://github.com/pytorch/pytorch/pull/130139 on behalf of https://github.com/clee2000 due to broke inductor/test_torchinductor_codegen_dynamic_shapes.py and test_sympy_utils.py `8fcb156e8b` ([comment](https://github.com/pytorch/pytorch/pull/130139#issuecomment-2229248447))	2024-07-15 19:42:11 +00:00
Xu Han	f1456c74a0	Fix mkl-static issue for Windows. (#130697 ) Background: We found the pytorch Windows release/2.4 performance regression: https://github.com/pytorch/pytorch/issues/130619 After some debug works, I found the pytorch Windows static mkl build options are wrong: <img width="1049" alt="image" src="https://github.com/user-attachments/assets/38692142-bfca-4c98-8092-6e105c82bb13"> 1. Thread lib is wrong. 2. Miss `openmp` lib and config. > Debug history: https://github.com/pytorch/pytorch/issues/130619#issuecomment-2226782504 and https://github.com/pytorch/pytorch/issues/130619#issuecomment-2226418611 This PR will fix `mkl-static` build options issue. <img width="863" alt="image" src="https://github.com/user-attachments/assets/834f6cee-7e6d-4d74-b2bc-8a270f05e429"> Reference: <img width="482" alt="image" src="https://github.com/user-attachments/assets/8184dadb-f230-4062-a49f-51df1d7285f5"> https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl-link-line-advisor.html#gs.c6izlg Pull Request resolved: https://github.com/pytorch/pytorch/pull/130697 Approved by: https://github.com/jgong5, https://github.com/atalman	2024-07-15 19:28:11 +00:00
Wanchao Liang	a7cfe40c9b	[dtensor] Improve from_local API with run_check (#130289 ) as titled, this PR: 1. switch `run_check` to be by default False and add extra doc/comments about the correctness guarantee. Since I observed so many calls forget to use run_check=False, we should simply switch to not perform metadata check and make our documentation explicit 2. Implement metadata check by picking up the changes from https://github.com/pytorch/pytorch/pull/115229 3. Improve the from_local documentation Pull Request resolved: https://github.com/pytorch/pytorch/pull/130289 Approved by: https://github.com/awgu, https://github.com/wz337 ghstack dependencies: #130286, #130287, #130288	2024-07-15 18:52:55 +00:00
Wanchao Liang	3342f3aa4e	[dtensor] simplify sdpa strategies (#130288 ) as titled, this PR simplifies both flash and efficient attention op strategy generation paths Pull Request resolved: https://github.com/pytorch/pytorch/pull/130288 Approved by: https://github.com/tianyu-l ghstack dependencies: #130286, #130287	2024-07-15 18:52:55 +00:00
Wanchao Liang	7d82dc2c23	[dtensor] slice_backward to use op strategy (#130287 ) as titled. slice_backward right now forward the sharding unconditionally, which is wrong mathmatically. This PR switch it to op strategy and only allow replication Pull Request resolved: https://github.com/pytorch/pytorch/pull/130287 Approved by: https://github.com/awgu ghstack dependencies: #130286	2024-07-15 18:52:49 +00:00
Zhanghan Wang	53cf46b8c6	Fix names conflict when lifting (#129817 ) ## Bug description When pending args that are potentially to be lift [here](`58f346c874/torch/_dynamo/output_graph.py (L1866)`) having same base name, like `contiguous` and `contiguous_1`, the call into [create_graph_input](`58f346c874/torch/_dynamo/output_graph.py (L2081)`) can finally create a name ([here](`58f346c874/torch/fx/graph.py (L1008)`)) that overwrite args to lift. And thus causing a wrong output of graph. ## Reproducing Below is an reproduceable example, ```python import logging from typing import List import torch from functorch.compile import aot_module_simplified, make_boxed_func @torch.library.custom_op("mylib::somefunc_forward", mutates_args=()) def somefunc_forward( input_: torch.Tensor, weight: torch.Tensor, shape: List[int], ) -> torch.Tensor: return torch.ones_like(input_) @somefunc_forward.register_fake def _(input_, shape, weight): return torch.empty_like(input_) @torch.library.custom_op("mylib::somefunc_backward", mutates_args=()) def somefunc_backward( grad_output: torch.Tensor, input_: torch.Tensor, weight: torch.Tensor, shape: List[int], ) -> torch.Tensor: print(f"backward.{grad_output.shape=}") print(f"backward.{input_.shape=}") print(f"backward.{weight.shape=}") print(f"backward.{shape=}") assert list(weight.shape) == shape return torch.ones_like(weight) @somefunc_backward.register_fake def _(grad_output, input_, weight, shape): return torch.empty_like(weight) def a_func(grad_output, input_, weight_, shape): return torch.ones_like(input_.sum() * weight_) class SomeFunc(torch.autograd.Function): @staticmethod def forward(ctx, input, weight, normalized_shape): ctx.normalized_shape = normalized_shape input_ = input.contiguous() weight_ = weight.contiguous() output = somefunc_forward(input_, weight_, ctx.normalized_shape) ctx.save_for_backward(input_, weight_) return output @staticmethod def backward(ctx, grad_output): input_, weight_ = ctx.saved_tensors # grad_weight = a_func(grad_output, input_, weight_, ctx.normalized_shape) grad_weight = somefunc_backward( grad_output.contiguous(), input_, weight_, ctx.normalized_shape, ) return None, grad_weight, None class MyModel(torch.nn.Module): def __init__(self): super().__init__() self.weight = torch.nn.Parameter(torch.ones(7)) def forward(self, x): return SomeFunc.apply(x, self.weight, [7]) model = MyModel() torch._logging.set_logs(dynamo=logging.DEBUG, aot=logging.DEBUG, graph_code=True) def aot_print_backend(gm, sample_inputs): # Forward compiler capture def fw(gm, sample_inputs): print(f"----- fw") gm.print_readable() return make_boxed_func(gm.forward) # Backward compiler capture def bw(gm, sample_inputs): print(f"----- bw") gm.print_readable() return make_boxed_func(gm.forward) # Call AOTAutograd gm_forward = aot_module_simplified( gm, sample_inputs, fw_compiler=fw, bw_compiler=bw ) return gm_forward model = torch.compile( model, backend=aot_print_backend, dynamic=False, ) out = model(torch.rand((128, 4, 7))) out.mean().backward() ``` I can see log that showing calling into create_graph_input like ```log V0629 02:08:46.839914 8200981504 torch/_dynamo/output_graph.py:2042] [0/0] create_graph_input contiguous (none) V0629 02:08:46.839998 8200981504 torch/_dynamo/output_graph.py:2042] [0/0] create_graph_input contiguous_1 (none) ``` And the backward graph generate will be like ```log class GraphModule(torch.nn.Module): def forward(self, function_ctx, somefunc_forward_default: "f32[128, 4, 7]", contiguous: "f32[128, 4, 7]", contiguous_1: "f32[7]"): contiguous_1 = contiguous contiguous_2 = contiguous_1 # No stacktrace found for following nodes _set_grad_enabled = torch._C._set_grad_enabled(False) # File: /Users/bytedance/testtorch/test_custom_op_bug.py:61 in backward, code: grad_output.contiguous(), contiguous: "f32[128, 4, 7]" = somefunc_forward_default.contiguous(); somefunc_forward_default = None # File: /opt/tiger/pytorch/torch/_library/custom_ops.py:506 in __call__, code: return self._opoverload(args, *kwargs) somefunc_backward_default: "f32[7]" = torch.ops.mylib.somefunc_backward.default(contiguous, contiguous_1, contiguous_2, [7]); contiguous = contiguous_1 = contiguous_2 = None # No stacktrace found for following nodes _set_grad_enabled_1 = torch._C._set_grad_enabled(True) return (None, somefunc_backward_default) ``` The original code of `somefunc_backward` takes a input list of `grad_output`, `input_`, `weight` and `shape`, where `weight` should be shape of `torch.Size([7])`. However, in the graph, `contiguous1` and `contiguous_2` are assigned with `contiguous`, this leads to assertion failure I added in `somefunc_backward`. ## Environment ```log Collecting environment information... PyTorch version: 2.5.0a0+git0b7e8df Is debug build: False CUDA used to build PyTorch: None ROCM used to build PyTorch: N/A OS: macOS 14.5 (arm64) GCC version: Could not collect Clang version: 15.0.0 (clang-1500.3.9.4) CMake version: version 3.26.4 Libc version: N/A Python version: 3.9.19 (main, May 6 2024, 14:39:30) [Clang 14.0.6 ] (64-bit runtime) Python platform: macOS-14.5-arm64-arm-64bit Is CUDA available: False CUDA runtime version: No CUDA CUDA_MODULE_LOADING set to: N/A GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Apple M3 Pro Versions of relevant libraries: [pip3] numpy==2.0.0 [pip3] optree==0.11.0 [pip3] torch==2.5.0a0+git0b7e8df [pip3] torchgraph==0.0.1 [conda] numpy 2.0.0 pypi_0 pypi [conda] optree 0.11.0 pypi_0 pypi [conda] torch 2.5.0a0+git0b7e8df dev_0 <develop> [conda] torchgraph 0.0.1 dev_0 <develop> ``` ## How to fix? I put a naive fix that add the potential args to lift into the used_names. This visits private variables, will fix that if this issue makes sense to you. @zou3519 @oulgen Co-authored-by: rzou <zou3519@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129817 Approved by: https://github.com/zou3519	2024-07-15 18:49:12 +00:00
Guilherme Leobas	b4b64f76e5	Ensure tensors devices match on `torch.index_put` batch rule impl (#130479 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130479 Approved by: https://github.com/zou3519	2024-07-15 18:16:31 +00:00
Joel Schlosser	00d71b3e86	Tweak tolerances for test_vjp_linalg_tensorsolve_cuda_float32 to pass in Windows / debug builds (#130449 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130449 Approved by: https://github.com/zou3519, https://github.com/malfet ghstack dependencies: #128238, #130360	2024-07-15 17:35:34 +00:00
PyTorch MergeBot	9e161af179	Revert "Increase tolerance for tensorsolve tests (#130620 )" This reverts commit 103b6ccab2bd025dfacc8c8a91f71f3d68e50426. Reverted https://github.com/pytorch/pytorch/pull/130620 on behalf of https://github.com/clee2000 due to didn't work, test is still failing on this PR and on main, reverting in favor of https://github.com/pytorch/pytorch/pull/130449 instead ([comment](https://github.com/pytorch/pytorch/pull/130620#issuecomment-2229036418))	2024-07-15 17:35:04 +00:00
Xuehai Pan	8fcb156e8b	[BE] bump `optree` version to 0.12.1 (#130139 ) 0.12.0 Major Updates: - Add context manager to temporarily set the dictionary sorting mode - Add accessor APIs - Use `stable` tag for `pybind11` for Python 3.13 support - Fix potential segmentation fault for pickling support 0.12.1 Updates: - Fix warning regression during import when launch with strict warning filters Closes #130155 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130139 Approved by: https://github.com/zou3519	2024-07-15 17:27:07 +00:00
PyTorch MergeBot	1e897a0ca4	Revert "Fix names conflict when lifting (#129817 )" This reverts commit 74da2a467f166e00316aee82ba24835ca563ed87. Reverted https://github.com/pytorch/pytorch/pull/129817 on behalf of https://github.com/clee2000 due to broke dynamo/test_inline_inbuilt_nn_modules.py https://github.com/pytorch/pytorch/actions/runs/9940532858/job/27461141919 `74da2a467f`. Test passed on PR, possibly a landrace? ([comment](https://github.com/pytorch/pytorch/pull/129817#issuecomment-2228993570))	2024-07-15 17:09:52 +00:00
Edward Z. Yang	0099e15b47	Also put unbacked symbols in symbol_to_node in split_module pass (#130535 ) This is not a complete fix but it is a simple one, full fix tracked in https://github.com/pytorch/pytorch/issues/130534 Internal xref: https://fb.workplace.com/groups/6829516587176185/posts/7510238679103969/ Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130535 Approved by: https://github.com/malfet	2024-07-15 16:56:01 +00:00
rzou	ca2d424c6e	Tighten torch.library.infer_schema input types (#130705 ) Made the following changes: - mutates_args is now keyword-only and mandatory. This is to align with torch.library.custom_op (which makes it mandatory because it's easy to miss) - op_name is now keyword-only. This helps the readability of the API - updated all usages of infer_schema This change is not BC-breaking because we introduced torch.library.infer_schema a couple of days ago. Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/130705 Approved by: https://github.com/yushangdi	2024-07-15 16:43:57 +00:00
PyTorch MergeBot	9df4bc6a0d	Revert "Constant folding for dynamic shape node (#129686 )" This reverts commit b7d287fbec0a05a3d4c9524006e6bfd1de6a71a0. Reverted https://github.com/pytorch/pytorch/pull/129686 on behalf of https://github.com/atalman due to Failing internally. Test: https://github.com/pytorch/ao/blob/main/test/prototype/mx_formats/test_mx_linear.py ([comment](https://github.com/pytorch/pytorch/pull/129686#issuecomment-2228755295))	2024-07-15 15:19:24 +00:00
Yu, Guangye	7cd48df2da	Refine the logic of device construction when only device index is given (#129119 ) # Motivation Before this PR, device construction was `cuda` type when only a device index was given. It also returns the `PrivateUser1` type if a `PrivateUser1` type is registered. ```bash >>> import torch >>> device = torch.device(0) >>> device.type 'cuda' >>> a = torch.tensor([1, 2]) >>> b = a.to(0) >>> b tensor([1, 2], device='cuda:0') ``` It works well on CUDA GPU. But it will raise unexpected information and error running on XPU. ```bash >>> import torch >>> device = torch.device(0) >>> device.type 'cuda' >>> a = torch.tensor([1, 2]) >>> b = a.to(0) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/xxx/pytorch/torch/cuda/__init__.py", line 302, in _lazy_init raise AssertionError("Torch not compiled with CUDA enabled") AssertionError: Torch not compiled with CUDA enabled ``` With this PR, refine the logic to use the currently available device type instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129119 Approved by: https://github.com/albanD, https://github.com/gujinghui, https://github.com/EikanWang ghstack dependencies: #129463, #129205, #129363	2024-07-15 14:34:29 +00:00
Yu, Guangye	9cae2160f5	Introduce the concept of Accelerators to PyTorch doc (#129363 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129363 Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD ghstack dependencies: #129463, #129205	2024-07-15 14:24:46 +00:00
Zhanghan Wang	74da2a467f	Fix names conflict when lifting (#129817 ) ## Bug description When pending args that are potentially to be lift [here](`58f346c874/torch/_dynamo/output_graph.py (L1866)`) having same base name, like `contiguous` and `contiguous_1`, the call into [create_graph_input](`58f346c874/torch/_dynamo/output_graph.py (L2081)`) can finally create a name ([here](`58f346c874/torch/fx/graph.py (L1008)`)) that overwrite args to lift. And thus causing a wrong output of graph. ## Reproducing Below is an reproduceable example, ```python import logging from typing import List import torch from functorch.compile import aot_module_simplified, make_boxed_func @torch.library.custom_op("mylib::somefunc_forward", mutates_args=()) def somefunc_forward( input_: torch.Tensor, weight: torch.Tensor, shape: List[int], ) -> torch.Tensor: return torch.ones_like(input_) @somefunc_forward.register_fake def _(input_, shape, weight): return torch.empty_like(input_) @torch.library.custom_op("mylib::somefunc_backward", mutates_args=()) def somefunc_backward( grad_output: torch.Tensor, input_: torch.Tensor, weight: torch.Tensor, shape: List[int], ) -> torch.Tensor: print(f"backward.{grad_output.shape=}") print(f"backward.{input_.shape=}") print(f"backward.{weight.shape=}") print(f"backward.{shape=}") assert list(weight.shape) == shape return torch.ones_like(weight) @somefunc_backward.register_fake def _(grad_output, input_, weight, shape): return torch.empty_like(weight) def a_func(grad_output, input_, weight_, shape): return torch.ones_like(input_.sum() * weight_) class SomeFunc(torch.autograd.Function): @staticmethod def forward(ctx, input, weight, normalized_shape): ctx.normalized_shape = normalized_shape input_ = input.contiguous() weight_ = weight.contiguous() output = somefunc_forward(input_, weight_, ctx.normalized_shape) ctx.save_for_backward(input_, weight_) return output @staticmethod def backward(ctx, grad_output): input_, weight_ = ctx.saved_tensors # grad_weight = a_func(grad_output, input_, weight_, ctx.normalized_shape) grad_weight = somefunc_backward( grad_output.contiguous(), input_, weight_, ctx.normalized_shape, ) return None, grad_weight, None class MyModel(torch.nn.Module): def __init__(self): super().__init__() self.weight = torch.nn.Parameter(torch.ones(7)) def forward(self, x): return SomeFunc.apply(x, self.weight, [7]) model = MyModel() torch._logging.set_logs(dynamo=logging.DEBUG, aot=logging.DEBUG, graph_code=True) def aot_print_backend(gm, sample_inputs): # Forward compiler capture def fw(gm, sample_inputs): print(f"----- fw") gm.print_readable() return make_boxed_func(gm.forward) # Backward compiler capture def bw(gm, sample_inputs): print(f"----- bw") gm.print_readable() return make_boxed_func(gm.forward) # Call AOTAutograd gm_forward = aot_module_simplified( gm, sample_inputs, fw_compiler=fw, bw_compiler=bw ) return gm_forward model = torch.compile( model, backend=aot_print_backend, dynamic=False, ) out = model(torch.rand((128, 4, 7))) out.mean().backward() ``` I can see log that showing calling into create_graph_input like ```log V0629 02:08:46.839914 8200981504 torch/_dynamo/output_graph.py:2042] [0/0] create_graph_input contiguous (none) V0629 02:08:46.839998 8200981504 torch/_dynamo/output_graph.py:2042] [0/0] create_graph_input contiguous_1 (none) ``` And the backward graph generate will be like ```log class GraphModule(torch.nn.Module): def forward(self, function_ctx, somefunc_forward_default: "f32[128, 4, 7]", contiguous: "f32[128, 4, 7]", contiguous_1: "f32[7]"): contiguous_1 = contiguous contiguous_2 = contiguous_1 # No stacktrace found for following nodes _set_grad_enabled = torch._C._set_grad_enabled(False) # File: /Users/bytedance/testtorch/test_custom_op_bug.py:61 in backward, code: grad_output.contiguous(), contiguous: "f32[128, 4, 7]" = somefunc_forward_default.contiguous(); somefunc_forward_default = None # File: /opt/tiger/pytorch/torch/_library/custom_ops.py:506 in __call__, code: return self._opoverload(args, *kwargs) somefunc_backward_default: "f32[7]" = torch.ops.mylib.somefunc_backward.default(contiguous, contiguous_1, contiguous_2, [7]); contiguous = contiguous_1 = contiguous_2 = None # No stacktrace found for following nodes _set_grad_enabled_1 = torch._C._set_grad_enabled(True) return (None, somefunc_backward_default) ``` The original code of `somefunc_backward` takes a input list of `grad_output`, `input_`, `weight` and `shape`, where `weight` should be shape of `torch.Size([7])`. However, in the graph, `contiguous1` and `contiguous_2` are assigned with `contiguous`, this leads to assertion failure I added in `somefunc_backward`. ## Environment ```log Collecting environment information... PyTorch version: 2.5.0a0+git0b7e8df Is debug build: False CUDA used to build PyTorch: None ROCM used to build PyTorch: N/A OS: macOS 14.5 (arm64) GCC version: Could not collect Clang version: 15.0.0 (clang-1500.3.9.4) CMake version: version 3.26.4 Libc version: N/A Python version: 3.9.19 (main, May 6 2024, 14:39:30) [Clang 14.0.6 ] (64-bit runtime) Python platform: macOS-14.5-arm64-arm-64bit Is CUDA available: False CUDA runtime version: No CUDA CUDA_MODULE_LOADING set to: N/A GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Apple M3 Pro Versions of relevant libraries: [pip3] numpy==2.0.0 [pip3] optree==0.11.0 [pip3] torch==2.5.0a0+git0b7e8df [pip3] torchgraph==0.0.1 [conda] numpy 2.0.0 pypi_0 pypi [conda] optree 0.11.0 pypi_0 pypi [conda] torch 2.5.0a0+git0b7e8df dev_0 <develop> [conda] torchgraph 0.0.1 dev_0 <develop> ``` ## How to fix? I put a naive fix that add the potential args to lift into the used_names. This visits private variables, will fix that if this issue makes sense to you. @zou3519 @oulgen Co-authored-by: rzou <zou3519@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129817 Approved by: https://github.com/zou3519	2024-07-15 13:41:46 +00:00
rzou	ee039c0614	[custom_op] triton_op API V0 (#130637 ) This is the initial version of an API to create custom operators whose implementations are backed by triton kernels. While user-defined triton kernels work out-of-the-box with triton kernels, you may wish to construct a custom operator if you need to compose with other PyTorch subsystems, like Tensor subclasses or vmap. I'm hoping to get design feedback on this and ship it so that we can begin experimenting with customers. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/130637 Approved by: https://github.com/albanD	2024-07-15 13:00:54 +00:00
cyy	6beec34b1c	[structural binding][9/N] Replace std::tie with structural binding (#130404 ) Follows #130544 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130404 Approved by: https://github.com/janeyx99	2024-07-15 10:14:52 +00:00
Aaron Gokaslan	ac28ae18dc	[BE][Ez]: Update pybind11 submodule to v2.13.1 (#129827 ) Updates pybind11 submodule to v2.13.1 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129827 Approved by: https://github.com/XuehaiPan, https://github.com/atalman, https://github.com/albanD	2024-07-15 08:58:56 +00:00
Animesh Jain	1d983bbb28	[easy][inline-inbuilt-nn-module] Update test output (#130681 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130681 Approved by: https://github.com/zou3519, https://github.com/jansel ghstack dependencies: #130654, #130420	2024-07-15 06:19:53 +00:00
Animesh Jain	1a266def4f	[dynamo][unsoundness but very controlled] Skip guards on inbuilt nn module hooks (#130420 ) Reduces the guard overhead from 2.1k units to 1k units. Compared to no-inlining (0.4k units), this reduces the slowdown from 5x to 2.5x. This introduces unsoundness, but only for hooks for inbuilt nn modules (user defined nn module hooks are fine). Each builtin nn module adds 4 empty ordered dict checks in the check_fn. This blows up for models with large numbers of builtin nn modules. With this PR, we skip those guards. There is no other easy way I can think of right now to control the guard overhead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130420 Approved by: https://github.com/jansel ghstack dependencies: #130654	2024-07-15 06:19:53 +00:00
Li-Huai (Allan) Lin	dc7725cc16	[halide-backend] Random number generation (#130211 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130211 Approved by: https://github.com/jansel	2024-07-15 05:03:24 +00:00
Isuru Fernando	1bc390c5f5	Invalidate StorageImpl instances when tensor is overwritten with cudagraphs (#125264 ) Fixes #104435 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125264 Approved by: https://github.com/ezyang	2024-07-15 04:16:17 +00:00
Wu, Chunyuan	a3c0bab502	[inductor] [cpp] use non-temporal tile load for A (#129455 ) Use non-temporal tile load `_tile_stream_loadd` for A to keep B in L1. Verified AMP static shapes and dynamic shapes on CPU with AMX support and no obvious performance boost (no regression either) at end-to-end level. We're expecting to get performance gain when adding https://github.com/pytorch/pytorch/pull/129348 (also in this ghstack) on top of this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129455 Approved by: https://github.com/jgong5	2024-07-15 04:07:29 +00:00
Nikita Shulga	c547b2e871	Fix python detection in cuda.cmake (#130651 ) If Python package has not been detected previously, call it here This fixes regression introduced by https://github.com/pytorch/pytorch/pull/128801 that results in annoying, but harmless warning reported in https://github.com/pytorch/pytorch/issues/129777 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130651 Approved by: https://github.com/Skylion007	2024-07-15 03:45:31 +00:00
PyTorch MergeBot	c0897919da	Revert " [5/N] Change static functions in headers to inline (#130673 )" This reverts commit 4410c44ae6fd8eb36f2358ac76f7d988ca7537c5. Reverted https://github.com/pytorch/pytorch/pull/130673 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it causes CUDA build 12.1/12.4 to timeout in trunk, I am not sure what I am looking at yet, so attempt to revert to see if it fixes trunk. Plz keep in mind that a cancelled job is counted as a failure ([comment](https://github.com/pytorch/pytorch/pull/130673#issuecomment-2227641368))	2024-07-15 03:27:11 +00:00
cyy	28f6ae2718	[9/N] Replace c10::optional with std::optional (#130674 ) Follows #130509 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130674 Approved by: https://github.com/Skylion007	2024-07-15 00:48:43 +00:00
Haoci Zhang	774ca93fd2	Added zb1p schedule (#130210 ) Adds the ZB1P schedule in https://arxiv.org/pdf/2401.10241. The ZB2P schedule might not be zero bubble when pp_group_size > 4. Proof: ![image](https://github.com/pytorch/pytorch/assets/13212964/fac4a738-c323-47c7-bcaa-c6cdd1cf20d7) Since ZB2P generates longer schedules for some cases, and we might need a collective for fault tolerance all reduce at the end of every iteration for llama 4, so holding off to implement a more fancier ZBV schedule for now unless it would be useful. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130210 Approved by: https://github.com/H-Huang	2024-07-14 17:32:59 +00:00
cyy	5fe9515d35	[structural binding][8/N] Replace std::tie with structural binding (#130544 ) Follows #130216 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130544 Approved by: https://github.com/ezyang	2024-07-14 13:23:20 +00:00
leslie-fang-intel	81322aee74	[Inductor][CPP] Support more than one LocalBuffer (#129121 ) Summary Support more than 1 Local Buffer in an outer loop fused node and also the case when multi global buffers sharing usage of same local buffer. TestPlan ``` python -u -m pytest -s -v inductor/test_cpu_repro.py -k test_two_local_buffers_in_outer_loop_fusion python -u -m pytest -s -v inductor/test_cpu_repro.py -k test_share_local_buffers_in_outer_loop_fusion ``` Next Step - [✓] Support more than one Local Buffer/Global Buffer Pull Request resolved: https://github.com/pytorch/pytorch/pull/129121 Approved by: https://github.com/jgong5, https://github.com/peterbell10 ghstack dependencies: #126967	2024-07-14 11:31:14 +00:00
leslie-fang-intel	adaa0fea5a	[Inductor][CPP] Enable Local Buffer for Outer loop fusion (#126967 ) Summary Currently, the Inductor CPP backend [generated code](https://gist.github.com/leslie-fang-intel/98f91d43dabed581a1ffe23daf133a65#file-bf16-softmax-generated-code-wo-local-buffer-py) for `Softmax` with BF16 data type is significantly slower than the [ATen Implementation](`9a2beb862d/aten/src/ATen/native/cpu/SoftMaxKernel.cpp (L149)`). Upon comparing the generated code with ATen, the performance bottleneck appears to be related to the usage of [local buffer in ATen](`9a2beb862d/aten/src/ATen/native/cpu/SoftMaxKernel.cpp (L159-L160)`). In the current implementation, the Inductor uses the output buffer of Kernel Group Args to store and load temporary result (such as `exp`), since this buffer is corresponding to a `SchedulerNode`. Each thread accesses a portion of this output buffer via indexing. However, since this buffer (take this `exp` as example) is only utilized internally within decomposed `softmax`, this buffer can be replaced with a thread-local buffer similar to ATen's approach. In this PR, we have introduced the optimizations of `LocalBuffer`. Following this enhancement, the [new generated Inductor code with local buffer](https://gist.github.com/leslie-fang-intel/98f91d43dabed581a1ffe23daf133a65#file-bf16-softmax-generated-code-w-local-buffer-py) for BF16 `Softmax` demonstrates significantly improved performance. Running the benchmark [here](https://gist.github.com/leslie-fang-intel/37d81441237b5139c8295f5e6c4cd31a) to test this BF16 `Softmax` case on an 8480 Xeon server shows similar performance between the Inductor CPP Backend and the ATen implementation. TestPlan ``` python -u -m pytest -s -v inductor/test_cpu_repro.py -k test_local_buffer_in_outer_loop_fusion ``` Next Step - [ ] Support more than one Local Buffer/Global Buffer Pull Request resolved: https://github.com/pytorch/pytorch/pull/126967 Approved by: https://github.com/jgong5, https://github.com/peterbell10	2024-07-14 11:28:10 +00:00
awayzjj	dcaa111dc8	support intersection by polyfill (#130672 ) Fixes https://github.com/pytorch/pytorch/issues/130557 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130672 Approved by: https://github.com/anijain2305	2024-07-14 10:44:26 +00:00
Xuehai Pan	4d7bf72d93	[BE][Easy] fix ruff rule needless-bool (SIM103) (#130206 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130206 Approved by: https://github.com/malfet	2024-07-14 08:17:52 +00:00
Boyuan Feng	fa5f572748	[cudagraph] fallback to eager if re-record too many times (#129349 ) Summary: CUDAGraph Trees previously relies on an assumption that static inputs (parameters and buffers) does not change tensor addresses across multiple function invocations. This assumption can be used to reduce the number of tensor copies to improve performance. We also use `check_static_inputs_are_stable()` to check whether this assumption holds at runtime. While this assumption is True in most cases, we recently observe a few cases that this assumption is not valid: - [Inline inbuilt nn modules](https://github.com/pytorch/pytorch/pull/126822): the same function (a nn module) is used in multiple places and different parameters and buffers are passed to this function with different tensor addresses - Some user code changes tensor addresses of parameters/buffers. See [internal example]( https://www.internalfb.com/mlhub/pipelines/runs/mast/sw-935450288-OfflineTraining_08ba1cf0?job_attempt=1&version=0&env=PRODUCTION) - Compiled Autograd may also pass parameters/buffers with different tensor addresses across runs. Previous PR [#126822](https://github.com/pytorch/pytorch/pull/126822) (by @mlazos) allows detecting static tensor address changes during runtime and re-recording a cudagraph if that happened. However, if the same function is re-recorded too many times, it may introduce large overhead and hurt performance. This PR adds `torch._inductor.config.triton.cudagraph_max_recording` (=5) to fallback to eager if a function has been recorded more than `cudagraph_max_recording` times for a specific node in the CUDAGraph Trees. A summary on how static tensor address changes are handled now: - For each child node, check the assumption via `check_invariants`. If this holds, execute node with the assumption. - If the assumption does not hold for all child nodes, re-record if the function_id has not been recorded too many times for the current_node. - If the function_id has been re-recorded too many times, fallback to eager function and warning. Test Plan: CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/129349 Approved by: https://github.com/eellison	2024-07-14 04:17:24 +00:00
cyy	4410c44ae6	[5/N] Change static functions in headers to inline (#130673 ) Follows #128286 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130673 Approved by: https://github.com/ezyang	2024-07-14 03:15:28 +00:00
Shivam Raikundalia	6f275ae4d0	Add kwinputs to Kineto Traces (#130373 ) Summary: On the autograd side of things, we are currently saving the kwinputs but we aren't doing anything with them on the profiler side. This diff enables the use of the kwinputs for both FunctionEvents and Chrome Traces. Test Plan: Added unit testing for both chrome traces and FunctionEvents. Used RecordFunctionFast to test kwinputs since test already had kwargs being passed in but not tested. Differential Revision: D59472345 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130373 Approved by: https://github.com/davidberard98	2024-07-14 00:40:59 +00:00
chilli	f9f85bfc0b	[Inductor] FlexAttention supports partial masking (#130415 ) (#130626 ) This is the new version of https://github.com/pytorch/pytorch/pull/130415 Updated test script: https://gist.github.com/yanboliang/7c34a82df611d4ea8869cb9e041bfbfc Updated perf numbers: ``` (pt) [ybliang@devgpu002.ash8 ~/local/debug]$ CUDA_VISIBLE_DEVICES=4 python debug7.py fwd speedup: 0.7166695598192317 bwd speedup: 0.7142133867805904 (pt) [ybliang@devgpu002.ash8 ~/local/debug]$ CUDA_VISIBLE_DEVICES=4 python debug7.py --partial-mask fwd speedup: 0.8428246087169973 bwd speedup: 0.8486261278030254 ``` Approved by: https://github.com/Chillee Pull Request resolved: https://github.com/pytorch/pytorch/pull/130626 Approved by: https://github.com/drisspg, https://github.com/yanboliang	2024-07-14 00:37:26 +00:00
William Wen	cbb7e26acd	[3.13, dynamo] fix jump target offset calculation (#130458 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130458 Approved by: https://github.com/jansel ghstack dependencies: #130383, #130384, #130385	2024-07-13 23:32:06 +00:00
William Wen	0b5792c0ae	[3.13, dynamo] fix NULL ordering in symbolic_convert CALL (#130385 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130385 Approved by: https://github.com/jansel ghstack dependencies: #130383, #130384	2024-07-13 23:32:05 +00:00
William Wen	87b406d7e5	[3.13, dynamo] codegen TO_BOOL before conditional jump (#130384 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130384 Approved by: https://github.com/jansel ghstack dependencies: #130383	2024-07-13 23:32:02 +00:00
William Wen	92ac9ee83c	[3.13, dynamo] swap null and pop_null in codegen (#130383 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130383 Approved by: https://github.com/jansel	2024-07-13 23:31:57 +00:00
Gagan Jain	97cfc65dbc	Back out "[DeviceMesh] Only include the real thread_id in DeviceMesh hash under threaded backend (#130495 )" (#130676 ) Summary: Original commit changeset: 80c2ca639146 Original Phabricator Diff: D59612200 Test Plan: buck2 test 'fbcode//mode/opt' fbcode//apf/distributed/tests:pipeline_parallel_test_cpu -- --exact 'apf/distributed/tests:pipeline_parallel_test_cpu - apf.distributed.tests.pipeline_parallel_test_cpu.PipelineParallelContextTestCPU: test_stage_pg_creation_with_different_backends' Differential Revision: D59719562 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130676 Approved by: https://github.com/xunnanxu	2024-07-13 23:19:22 +00:00
Tobias Ringwald	e5de25896f	Fixed CUDA randint generation for large ranges. (#126066 ) Fixes #125224 For large ranges, calls to CUDA `randint` use a different `unroll_factor` to generate random ints. This `unroll_factor` was not considered correctly in the calculation of the Philox offsets. Thus, some of the random states were reused, resulting in lower entropy (see #125224). This also affects multiple other random functions, such as `torch.rand` and `torch.randn`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126066 Approved by: https://github.com/eqy, https://github.com/lezcano	2024-07-13 21:42:27 +00:00
PyTorch MergeBot	1f162a5fce	Revert "[Inductor][CPP] Support vectorization of remainder (#129849 )" This reverts commit 5bc18ec0a181fac0994522fefaf664f917d64b86. Reverted https://github.com/pytorch/pytorch/pull/129849 on behalf of https://github.com/izaitsevfb due to fails the compilation of executorch benchmark internally ([comment](https://github.com/pytorch/pytorch/pull/129849#issuecomment-2227054413))	2024-07-13 19:28:34 +00:00
Animesh Jain	8714b7fc69	[dynamo][cpp-guards] Use dict tags to skip guards on immutable dict getitems (#130654 ) Reduces the guard overhead from 3.7k units to 2.1k units. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130654 Approved by: https://github.com/jansel	2024-07-13 15:31:10 +00:00
cyy	7c83f5f7d5	[8/N] Replace c10::optional with std::optional (#130509 ) Follows #130510 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130509 Approved by: https://github.com/ezyang	2024-07-13 13:05:36 +00:00
PyTorch MergeBot	0effcb70ef	Revert "[ONNX] Remove beartype usage (#130484 )" This reverts commit f44739cf42e22a569bd1bdb0c113f8a069c17a41. Reverted https://github.com/pytorch/pytorch/pull/130484 on behalf of https://github.com/huydhn due to Sorry for reverting your change but those failures show up in trunk after the commit landed `f44739cf42`, I am reverting it to see if it fix trunk ([comment](https://github.com/pytorch/pytorch/pull/130484#issuecomment-2226812311))	2024-07-13 07:52:59 +00:00
Aaron Orenstein	567482973d	typing fake_tensor.py (#128041 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128041 Approved by: https://github.com/eellison ghstack dependencies: #129182	2024-07-13 06:07:40 +00:00
drisspg	1ad0f38a37	Fix IMAs in FlexAttention + autotuning (#130352 ) # Summary Makes error message better for non divisible sequence lengths. Updates this PR was blocked due to two IMAs. - The first, is that when the kv indices ends up being an 'arange' I.e. there are non sparse blocks, we end up loading off of kv_indices + 1. - The second I dont really have a clear answer for. We were hitting an ima here: `9f401187c7/torch/_inductor/kernel/flex_attention.py (L846)` I noticed that the for our inputs 2048 and q_blocksize = 128 we were again exactly at 16. Something felt fishy. I suspect we launch one extra sparse_q block, But why only during autotuning... ### Repro: https://gist.github.com/drisspg/f312a66426f3440b7756c6c0cc037f4c ### After this change: ``` ========= COMPUTE-SANITIZER AUTOTUNE flex_attention(1x1x2048x64, 1x1x2048x64, 1x1x2048x64, 1x1x2048, 1x1x16, 1x1x16x16) triton_flex_attention_0 2.1118 ms 100.0% BLOCK_DMODEL=64, BLOCK_M=128, BLOCK_N=128, OUTPUT_LOGSUMEXP=True, PRESCALE_QK=False, ROWS_GUARANTEED_SAFE=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=3, num_warps=4 triton_flex_attention_3 2.4306 ms 86.9% BLOCK_DMODEL=64, BLOCK_M=64, BLOCK_N=128, OUTPUT_LOGSUMEXP=True, PRESCALE_QK=False, ROWS_GUARANTEED_SAFE=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=3, num_warps=4 triton_flex_attention_1 2.5729 ms 82.1% BLOCK_DMODEL=64, BLOCK_M=128, BLOCK_N=64, OUTPUT_LOGSUMEXP=True, PRESCALE_QK=False, ROWS_GUARANTEED_SAFE=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=3, num_warps=4 triton_flex_attention_4 2.8035 ms 75.3% BLOCK_DMODEL=64, BLOCK_M=64, BLOCK_N=64, OUTPUT_LOGSUMEXP=True, PRESCALE_QK=False, ROWS_GUARANTEED_SAFE=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=3, num_warps=4 triton_flex_attention_2 2.8837 ms 73.2% BLOCK_DMODEL=64, BLOCK_M=128, BLOCK_N=128, OUTPUT_LOGSUMEXP=True, PRESCALE_QK=False, ROWS_GUARANTEED_SAFE=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=3, num_warps=4 SingleProcess AUTOTUNE benchmarking takes 0.7225 seconds and 1.5218 seconds precompiling AUTOTUNE flex_attention_backward(1x1x2048x64, 1x1x2048x64, 1x1x2048x64, 1x1x2048, 1x1x2048, 1x1x2048x64, 1x1x2048x64, 1x1x2048x64, 1x1x16, 1x1x16x16, 1x1x16, 1x1x16x16) triton_flex_attention_backward_30 2.7763 ms 100.0% BLOCK_DMODEL=64, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, PRESCALE_QK=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=1, num_warps=4 triton_flex_attention_backward_15 3.1404 ms 88.4% BLOCK_DMODEL=64, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, PRESCALE_QK=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=3, num_warps=4 triton_flex_attention_backward_14 3.2604 ms 85.2% BLOCK_DMODEL=64, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, PRESCALE_QK=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=1, num_warps=4 triton_flex_attention_backward_7 3.4176 ms 81.2% BLOCK_DMODEL=64, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, PRESCALE_QK=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=3, num_warps=4 triton_flex_attention_backward_8 3.4182 ms 81.2% BLOCK_DMODEL=64, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, PRESCALE_QK=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=4, num_warps=4 triton_flex_attention_backward_34 3.4939 ms 79.5% BLOCK_DMODEL=64, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, PRESCALE_QK=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=1, num_warps=8 triton_flex_attention_backward_6 3.6517 ms 76.0% BLOCK_DMODEL=64, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, PRESCALE_QK=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=1, num_warps=4 triton_flex_attention_backward_26 3.7000 ms 75.0% BLOCK_DMODEL=64, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, PRESCALE_QK=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=1, num_warps=8 triton_flex_attention_backward_22 4.0120 ms 69.2% BLOCK_DMODEL=64, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, PRESCALE_QK=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=1, num_warps=4 triton_flex_attention_backward_18 4.5052 ms 61.6% BLOCK_DMODEL=64, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, PRESCALE_QK=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=1, num_warps=8 SingleProcess AUTOTUNE benchmarking takes 6.6558 seconds and 6.3567 seconds precompiling torch.Size([1, 1, 2048, 64]) Test completed successfully! ========= ERROR SUMMARY: 0 errors ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130352 Approved by: https://github.com/Skylion007, https://github.com/Chillee	2024-07-13 05:27:39 +00:00
Will Feng	c03e667276	[Inductor][PatternMatcher] Always prevent match across mutations (#130584 ) Preventing match across mutations should always be the safe thing to do. This will be especially important for Traceable FSDP2 because in that case we do have mutation ops (`.set_` and `.resize_(0)`) in the middle of the graph for both joint-graph and post-grad graph, so making sure the pattern matcher passes work well with middle-of-graph mutation ops is important. Q: Why can't we move these mutation ops to the end of graph, to make pass writing easier? A: We attempted to do that in https://github.com/pytorch/pytorch/pull/129852, but the custom FX passes (in `torch/_functorch/_aot_autograd/fx_passes.py`) for the re-functionalization is complicated to maintain, and the changes to partitioner (in `torch/_functorch/partitioners.py`) also feels hacky. Hence we want to preserve these mutation ops in the middle of graph to avoid the complexity. Test commands: - `pytest -rA test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_uint4x2_mixed_mm` - `pytest -rA test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_serialized_patterns_up_to_date` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130584 Approved by: https://github.com/jansel	2024-07-13 03:39:21 +00:00
joydddd	3710a79622	Flex Attention HOP: Add support for flex decoding (#129415 ) # Flex Decoding tl;dr This PR adds `flex_decoding` kernel to higher-order-op: `flex_attention` as the backend for multi-head attention decoding. Higher-order-op `flex_attention` was introduced in (https://github.com/pytorch/pytorch/pull/121845) to accept a user defined score modification callable (`score_mod`) and through `torch.compile`to create an efficient fused flash attention kernel instatiation. The `flex_attention` kernel is efficient for long queries (>512 tokens) attention. This PR introduces `flex_decoding` kernel as an alternative backend for `flex_attention` HOP to handle LLM inference where short queries (<32 tokens) attends to long key/value sequences. ### Details LLM decoding iteratively attends each newly generated token ( query length = 1 ) to a long key/value context (up to 132k). `flex_attention` kernel only parallelizes attention along query length (M), batch size (B) and number of heads (H) dimension. LLM decoding lacks enough parallelism in the M dimension to fill up all SMs on the modern GPUs. `flex_decoding` adds parallelization along key/value sequence length (N). The key/value cache of a single head are split into multiple blocks and the query tokens attends to them in parallel. The results for the same head are then reduced across KV blocks to generate a global output. ## Examples Consider a Group Query Attention (GQA) decoding case, where a query token of 16 query heads (Hq) attends to 2 kv head (Hkv). Assume a batch size of 2 (B=2) and kv cache length of 4096 (N=4096). The attention kernel iteratively attends to newly generated query token (Mq = 1). We transform this problem into a Multiheaded Attention (MHA) problem by assuming a query length equal to number of query heads per kv heads, i.e. M=Hq//Hkv. The inputs to `flex_attention` HOP is thus a query of shape (B=2, H=Hkv=2, M=Hq//Hkv=8, D=64), key,value of shape (B=2, H=Hkv=2, N=4096, D=64, which lead to an intermediate attention score matrix of shape (2, 2, 8, 4096) and an output of shape (2, 2, 8, 64). ```Python import torch from torch.nn.attention._flex_attention import _flex_attention as flex_attention torch.manual_seed(0) # Lets create some input tensors # query of shape (B, Hkv, Hq//Hkv, D) # key/value of shape (B, Hkv, N, D) query = torch.randn(2, 2, 8, 64, device="cuda", dtype=torch.float32) key = torch.randn(2, 2, 4096, 64, device="cuda", dtype=torch.float32) value = torch.randn(2, 2, 4096, 64, device="cuda", dtype=torch.float32) # Lets create a new score_modification checkerboard. def checkerboard(score, batch, head, token_q, token_kv): score = torch.where(torch.abs(token_kv - token_q) == 1, score * 0.5, score) score = torch.where(torch.abs(token_kv - token_q) == 2, score * 2.0, score) return score # Lets call flex_attention with this new score modification for decoding. # The flex_attention HOP will chose flex_decoding as its backend since our query length (M) is only 8. output = flex_attention(query, key, value, score_mod=checkerboard) compiled_flex_attention = torch.compile(flex_attention) out_compiled = compiled_flex_attention (query, key, value, score_mod=checkerboard) torch.testing.assert_close(output, out_compiled, atol=2e-2, rtol=2e-2) ``` ## Future Plans - This PR does not implement load mask for score_mod function. This means if the score_mod functions takes a captured buffer along the M dimension , it must be padded to q length of 16, or next 2^n of query length if q_len > 16. i.e. ```python q_scale = torch.randn(Hq//Hkv, device="cuda") q_scale = torch.nn.functional.pad(q_scale, (0, 16-Hq//Hkv)) # Pad captured buffer def bias_mod(score, batch, head, q, kv): score = score + q_scale[token_q] return score ``` - Backward path for short queries (<128 token) currently does not work because the `flex_attention_backward` kernel is lacking mask support and only takes query length of a multiple of 128. - Dynamic shape and max_autotuning is currently not working - Add block sparse mask support (#129216 is a draft for flex_attention kernel) - Add explicit GQA support. (#130076 is a draft for GQA support on flex_attention kernel) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129415 Approved by: https://github.com/Chillee	2024-07-13 00:41:48 +00:00
Justin Chu	f44739cf42	[ONNX] Remove beartype usage (#130484 ) beartype has served us well in identifying type errors and ensuring we call internal functions with the correct arguments (thanks!). However, the value of having beartype is diminished because of the following: 1. When beartype improves support for better Dict[] type checking, it discovered typing mistakes in some functions that were previously uncaught. This caused the exporter to fail with newer versions beartype when it used to succeed. Since we cannot fix PyTorch and release a new version just because of this, it creates confusion for users that have beartype in their environment from using torch.onnx 2. beartype adds an additional call line in the traceback, which makes the already thick dynamo stack even larger, affecting readability when users diagnose errors with the traceback. 3. Since the typing annotations need to be evaluated, we cannot use new syntaxes like `\|` because we need to maintain compatibility with Python 3.8. We don't want to wait for PyTorch take py310 as the lowest supported Python before using the new typing syntaxes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130484 Approved by: https://github.com/titaiwangms	2024-07-13 00:08:25 +00:00
Colin Peppler	a7f54c7f8a	[dynamo] add meta fn for aten.kthvalue.default (#130562 ) I saw ``` torch._dynamo.exc.Unsupported: unsupported operator: aten.kthvalue.default ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130562 Approved by: https://github.com/jingsh, https://github.com/zou3519	2024-07-12 23:48:31 +00:00
Aaron Orenstein	634b62f111	typing proxy_tensor.py (#129182 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129182 Approved by: https://github.com/Chillee	2024-07-12 23:17:09 +00:00
PyTorch MergeBot	ea78b0c177	Revert "Fix static `py::object` dangling pointer with `py::gil_safe_call_once_and_store` (#130341 )" This reverts commit a17d1e5322229a31f868d98987996a04736933a6. Reverted https://github.com/pytorch/pytorch/pull/130341 on behalf of https://github.com/izaitsevfb due to internal needs pybind update ([comment](https://github.com/pytorch/pytorch/pull/130341#issuecomment-2226499397))	2024-07-12 23:07:37 +00:00
inkcherry	f422027fce	fix torch.linalg.lstsq input check (#130612 ) Fixes [#117236 ](https://github.com/pytorch/pytorch/issues/117236) The current case does not meet the vector scenario requirements, and it lacks sufficient checks (relying solely on ```dim_diff``` is insufficient). Consequently, it triggers an internal assertion error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130612 Approved by: https://github.com/lezcano	2024-07-12 23:06:52 +00:00
Yifu Wang	06ebf87a1e	Fix and improve reorder_compute_for_overlap (#130573 ) Since the raise_comms and sink_waits passes are also scheduling-based, we can now implement reorder_compute_for_overlap as an optional step in the same pass. Merging them into the same pass greatly simplifies the logic and makes it easier to reason about the synergy between different passes. - The unit tests are now fixed and re-enabled. - Verified that the pass produces good schedulling w/ Llama3 70B in torchtitan (the scheduling was sub-optimal before this PR). Pull Request resolved: https://github.com/pytorch/pytorch/pull/130573 Approved by: https://github.com/Chillee ghstack dependencies: #129980	2024-07-12 22:25:49 +00:00
Mikayla Gawarecki	619029e892	[easy] Small rendering fix in Tensor.module_load doc (#130489 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130489 Approved by: https://github.com/janeyx99	2024-07-12 22:12:53 +00:00
rzou	95046c86e3	[HOP] add HOP x torch_dispatch interaction (#130606 ) This involved beefing up the Python dispatcher to handle torch_dispatch. Given a HOP and a torch_dispatch Tensor subclass: - the HOP will show up in the subclass's `__torch_dispatch__` - you can also use HOP.py_impl to register a rule for the HOP x subclass interaction - (coming soon) we'll offer a way to open register HOP x subclass interaction without needing to touch the subclass's `__torch_dispatch__` or the HOP's .py_impl. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/130606 Approved by: https://github.com/ydwu4	2024-07-12 21:51:36 +00:00
rzou	f093cd4086	Fix custom ops warning during export (#130623 ) Fixes https://github.com/pytorch/pytorch/issues/130588 The problem was we were warning on all custom ops, not just ones marked as CompositeImplicitAutograd. This PR changes the warning to just warn on CompositeImplicitAutograd ops. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130623 Approved by: https://github.com/williamwen42	2024-07-12 21:34:29 +00:00
Mikayla Gawarecki	7c289c2a5c	Add torch.serialization.safe_globals context manager (#127939 ) Add context manager mentioned in https://github.com/pytorch/pytorch/pull/127808#pullrequestreview-2096298486 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127939 Approved by: https://github.com/albanD	2024-07-12 20:38:43 +00:00
PyTorch MergeBot	f0d7164cb9	Revert "[inductor] switch AotCodeCompiler to new cpp_builder (#130127 )" This reverts commit 2abc7cc21b8a215f000ac037c316ca178e9ade81. Reverted https://github.com/pytorch/pytorch/pull/130127 on behalf of https://github.com/izaitsevfb due to breaks meta-internal tests ([comment](https://github.com/pytorch/pytorch/pull/130127#issuecomment-2226313943))	2024-07-12 20:36:00 +00:00
albanD	103b6ccab2	Increase tolerance for tensorsolve tests (#130620 ) Fix current failure in periodic trunk https://hud.pytorch.org/failure?name=periodic%20%2F%20linux-focal-cuda11.8-py3.10-gcc9-debug%20%2F%20test%20(default%2C%204%2C%205%2C%20linux.4xlarge.nvidia.gpu)&jobName=undefined&failureCaptures=%5B%22functorch%2Ftest_ops.py%3A%3ATestOperatorsCUDA%3A%3Atest_vjp_linalg_tensorsolve_cuda_float32%22%5D Since it appeared with https://github.com/pytorch/pytorch/pull/128238 that only updates random seed for the test, I expect this is just bad luck of the draw. Thus increasing tolerance like we do for other tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130620 Approved by: https://github.com/lezcano, https://github.com/atalman, https://github.com/malfet	2024-07-12 20:08:18 +00:00
Scott Wolchok	af4da0799c	[PyTorch] Half: don't disable direct conversion to/from float on mobile (#130465 ) As far as I can tell, `FCVT` (https://developer.arm.com/documentation/ddi0602/2024-06/SIMD-FP-Instructions/FCVT--Floating-point-convert-precision--scalar--?lang=en) is part of the base aarch64 instruction set, so it should work fine on mobile. Differential Revision: [D59589733](https://our.internmc.facebook.com/intern/diff/D59589733/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130465 Approved by: https://github.com/ezyang, https://github.com/malfet	2024-07-12 19:46:30 +00:00
dshi7	d727e2f2d1	add total wall time in calculate_time_spent (#130611 ) Fixes #ISSUE_NUMBER Actual wall time is fwd_entire_frame_time + bwd_inductor_compile. `calculate_time_spent` is accessed internally for monitoring use https://fburl.com/code/iiurj5m6. However, summing values up lose the info of fwd/bwd. This PR adds a new key of `total_wall_time` without affecting dynamo counters. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130611 Approved by: https://github.com/oulgen, https://github.com/Yuzhen11	2024-07-12 19:32:44 +00:00
eqy	60fc01d0ab	[CUDA] Don't double-destroy CUDA graph when debug dump is used (#130401 ) Repro from @eellison Could have sworn we had another PR with this fix floating around somewhere but I couldn't find it... Pull Request resolved: https://github.com/pytorch/pytorch/pull/130401 Approved by: https://github.com/Skylion007, https://github.com/eellison	2024-07-12 18:57:07 +00:00
Bertrand Thia	43b98fa521	Add debug repr to SymNode (#129925 ) Fixes #129403 Create a separate printing function to debug SymNode, since we can't easily change `__repr__` that is used by GraphModule.recompile() to create a pythonic version of a graph This is my first contribution, please let me know if there is anything that I should look into in further details Thank you for you guidance! 🙏 I hope to contribute more in the future! @aorenste Pull Request resolved: https://github.com/pytorch/pytorch/pull/129925 Approved by: https://github.com/aorenste	2024-07-12 18:31:23 +00:00
Jack Taylor	2c4303c1d1	[ROCm] [BUGFIX] Re-enable rocm-specific tuning parameters (#130617 ) Small bug fix - https://github.com/pytorch/pytorch/pull/124592 replaced the torch.version.hip with device_props but made a mistake in porting the original logic. The original code was: ``` if torch.version.hip is not None: ``` Which was incorrectly replaced by: ``` if self.device_props.type != "hip": ``` Perhaps we need to write some unit tests here in the future. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130617 Approved by: https://github.com/masnesral	2024-07-12 18:29:59 +00:00
Yidi Wu	741c1710e8	[cond] inlining into one of the branches when pred is a python constant (#130493 ) Reland https://github.com/pytorch/pytorch/pull/128709. When the input predicate is a python constant, we specialize into one of the branches and warn users that torch.cond is not preserving the dynamism. The previous behavior is that we baked in True/False in the cond operator. This can be confusing. In this PR, we change it to be specializing into one of the branches when the inputs are constants. We additionally change the naming of cond operator to default one without overriding its name. This allows better testing on de-serialized graph. Test Plan: The predicate in some existing tests is the result of a shape comparison. When no dynamic shape is involved, the predicate is a python bool. To fix them, we either change the predicate to be some data-dependent tensor or change the test to check cond is specialized as one of the branches, Pull Request resolved: https://github.com/pytorch/pytorch/pull/130493 Approved by: https://github.com/BoyuanFeng	2024-07-12 18:02:09 +00:00
Yidi Wu	0bf9a091ec	[torchbind] add tracing_mode support (#129586 ) Sometimes, it could be difficult to write a fake class e.g. when the original implementation is using some third-party libraries or users are certain that the class is safe to trace with the real object. This PR allows user to specify their intention by implementing a "safe_to_trace_with_real_obj" method on their script class. Test Plan: `pytest test/export/test_torchbind.py -k safe` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129586 Approved by: https://github.com/zou3519	2024-07-12 18:01:47 +00:00
William Wen	c3e77d144e	[3.12, 3.13, dynamo] simplified construction for frame f_locals/localsplus (#129185 ) Construct frame localsplus in 3.12+ using our own simplified way rather than copypasting from CPython. This is necessary for 3.13 since we can no longer generate frame `f_locals` before executing the interpreter frame. We also enable this for 3.12 since the `f_locals` construction between 3.12 and 3.13 is the same, so we can test for correctness with 3.12. This is also one of the first steps to completing https://github.com/pytorch/pytorch/issues/93753 - we will implement simplified f_locals generation of previous Python versions in the future. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129185 Approved by: https://github.com/jansel	2024-07-12 17:56:38 +00:00
Tom Ritchford	b0a597fcb4	Fix #121334 : graph break on constant method call (#130158 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130158 Approved by: https://github.com/lezcano	2024-07-12 17:34:46 +00:00
Chirag Pandya	4865c6425c	Add new control plane handler (#129712 ) Summary: Add a new control plane handler to retrieve flight recorder data as JSON. Test Plan: Unit tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/129712 Approved by: https://github.com/wconstab	2024-07-12 17:32:01 +00:00
Nikita Shulga	55dc82bef9	[EZ] Make test_pytree_inputs actually run tests on CUDA (#130593 ) Right now it's only running it on CPU even when `self.device` is set to CUDA Pull Request resolved: https://github.com/pytorch/pytorch/pull/130593 Approved by: https://github.com/angelayi	2024-07-12 17:17:28 +00:00
Pian Pawakapan	988ed4d5db	[export] clean up allow_complex_guards_as_runtime_asserts flag (#130596 ) Summary: removes underscore, cleans up dead code in DimConstraints Test Plan: existing export tests Reviewed By: angelayi Differential Revision: D59612746 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130596 Approved by: https://github.com/angelayi	2024-07-12 17:17:11 +00:00
Chien-Chin Huang	dafef3ff35	[CP] Make CP loss curve on par with TP (#129515 ) Summary: This PR changes two implementations to make CP (CP8) lose curve be on par with TP (TP8). 1. Making key and value contiguous before doing ring attention. It is unclear why this is a requirement as SDPA does not have this requirement. 2. Use the out, grad_out, softmax_lse passed by autograd to do the backward. This implementation is similar to the implementation in transformer engine. The original implementation reruns the SDPA to get the output and logsumexp and uses that reculcated results to infer the corrected softmax_lse. But that implementation does not give a better accuracy or lose curve. Instead, that implementation converges slower. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129515 Approved by: https://github.com/d4l3k, https://github.com/wanchaol ghstack dependencies: #129512, #129514	2024-07-12 16:55:28 +00:00
Nikita Shulga	c35f12c67c	[EZ] Add formatting changes to .git-blame-ignore-revs (#130627 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/130627 Approved by: https://github.com/izaitsevfb, https://github.com/clee2000	2024-07-12 16:37:46 +00:00
Aidyn-A	22fd89c904	[TEST][Inductor] Fix scaled_mm call (#130582 ) `_scaled_mm` no longer returns `amax` (see #128683) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130582 Approved by: https://github.com/drisspg	2024-07-12 16:25:18 +00:00
Edward Z. Yang	34e57025e1	Add unsigned int types to torch/types.h (#130616 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130616 Approved by: https://github.com/NicolasHug, https://github.com/albanD	2024-07-12 16:24:29 +00:00
PyTorch MergeBot	2b1df24877	Revert "Make hashing a SymInt raise an error again (#130548 )" This reverts commit 3100455b8eeebdfbc3428ff9454579ac50666faf. Reverted https://github.com/pytorch/pytorch/pull/130548 on behalf of https://github.com/clee2000 due to broke inductor/test_triton_kernels.py https://github.com/pytorch/pytorch/actions/runs/9908970127/job/27377960411 `3100455b8e`. Not run on PR due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/130548#issuecomment-2225912018))	2024-07-12 16:20:12 +00:00
leslie-fang-intel	2a1f22e57f	Change BN to eval before QAT Convert phase (#130598 ) Summary In the QAT convert phase, we fold bn into conv and do DCE to this BN node. We should change `torch.ops.aten._native_batch_norm_legit.default` to `torch.ops.aten._native_batch_norm_legit_no_training.default` for a safe DCE. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130598 Approved by: https://github.com/jgong5, https://github.com/yushangdi	2024-07-12 16:03:56 +00:00
titaiwangms	18418a7dbb	[ONNX] Fix torch_onnx patch accuracy bug in benchmark (#130586 ) The ONNX related compilers have another route of accuracy check, and this PR brings torch_onnx compiler to the right measurement. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130586 Approved by: https://github.com/justinchuby	2024-07-12 15:47:59 +00:00
Mayank Mishra	e5657024b5	Fix loss_parallel with BF16 logits (#130550 ) Fixes #130549 This PR uses the specific dtype for the `grad_input` buffer and fixes the error Pull Request resolved: https://github.com/pytorch/pytorch/pull/130550 Approved by: https://github.com/tianyu-l	2024-07-12 15:47:38 +00:00
Shangdi Yu	ea4b80e6d6	[FX][export] strict DCE pass, check schema for node impurity (#130552 ) Fixes the failure in `test/export/test_export_training_ir_to_run_decomp.py ` caused by dead code elimination removing node with side effects. For background, in export, we may want to export higher-level IRs that are not functional, so we need to check for side effects more carefully. A call_function node is impure if it has at least one mutable argument. Fixed the tests below: test_to_module_with_mutated_buffer_multiple_update_sub_later test_export_input_mutation_static_shape test_buffer_util Another attempt modifying the original DCE pass is made in PR #130395, but it breaks some other tests, so here we add a flag and use it for export only. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130552 Approved by: https://github.com/pianpwk	2024-07-12 15:43:27 +00:00
Nikita Shulga	febadda107	[MPS] Fix `torch.[all\|any]` for 5+D tensors (#130542 ) Workaround bug in `reductionAndWithTensor:` that kills app with the following assert if 5+D tensor as an input ``` Assertion failed: (0 <= mpsAxis && mpsAxis < 4 && "Runtime canonicalization must simplify reduction axes to minor 4 dimensions."), function encodeNDArrayOp, file GPUReductionOps.mm, line 76. ``` by reshaping the tensor to 2D/3D one before running the reduction. Refactored common code into `all_any_common_impl_mps` as both `reductionOrWithTensor:` and `reductionAndWithTensor:` suffer from the same issue Enabled `test_reduction_ops_5D` and added regression test to it Pull Request resolved: https://github.com/pytorch/pytorch/pull/130542 Approved by: https://github.com/Skylion007, https://github.com/albanD ghstack dependencies: #130541	2024-07-12 15:06:22 +00:00
Bert Maher	d443fbc025	[inductor] Cache precompilation functions based on configs (#130350 ) Summary: If we attempt to precompile sets of different choices (e.g. Triton vs Cutlass) that have the same key, the cached pool of futures doesn't work, since it only includes the first set of configs. Add the config's hashes to the key to avoid this problem. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130350 Approved by: https://github.com/eellison	2024-07-12 14:21:49 +00:00
rzou	9c69684af8	[custom_ops] expose torch.library.register_torch_dispatch (#130261 ) This is the API for defining the interaction between a torch_dispatch class and a custom op. Taking API bikeshedding. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/130261 Approved by: https://github.com/albanD ghstack dependencies: #130064	2024-07-12 14:13:01 +00:00
rzou	ba941769b5	Add API for open registration between operators and subclasses (and modes) (#130064 ) We add torch.library.Library._register_torch_dispatch_rule. Here, a user can provide us a specific rule to run for a specific (torch_dispatch_class, operator) pair. The motivation is that a user might want to extend a subclass/mode but may not have access to the source code of the subclass/mode. I'll make this public in a follow-up PR if we think the approach and API is good. Keep in mind that many subclasses will likely deliver their own open registration solution (DTensor has register_sharding_prop_rule and NJT has register_jagged_op); _register_torch_dispatch_rule is meant as a catch-all open registration mechanism for when the subclass hasn't provided anything more specific. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/130064 Approved by: https://github.com/albanD	2024-07-12 14:13:01 +00:00
Edward Z. Yang	ae3ac9cb64	Only test _is_param if doing instance check on Parameter base (#130578 ) Fixes https://github.com/pytorch/pytorch/issues/111348 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130578 Approved by: https://github.com/Skylion007	2024-07-12 13:55:13 +00:00
Edward Z. Yang	6f54e961ea	Add trace_shape_events artifact tracing for ShapeEnv events (#130473 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130473 Approved by: https://github.com/lezcano	2024-07-12 13:50:25 +00:00
Edward Z. Yang	3100455b8e	Make hashing a SymInt raise an error again (#130548 ) See https://github.com/pytorch/pytorch/issues/130547 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130548 Approved by: https://github.com/Skylion007, https://github.com/albanD	2024-07-12 13:49:56 +00:00
Will Constable	b75cc70875	[Pipelining] add looped schedules to fsdp/ddp test (#130563 ) It feels like an oversight that these were not tested, especially since the test case already handles multi schedules specially but no multi-schedules were being tested Pull Request resolved: https://github.com/pytorch/pytorch/pull/130563 Approved by: https://github.com/H-Huang	2024-07-12 13:39:47 +00:00
PyTorch MergeBot	da030e7add	Revert "[Inductor] FlexAttention supports partial masking (#130415 )" This reverts commit 207564bab1c4fe42750931765734ee604032fb69. Reverted https://github.com/pytorch/pytorch/pull/130415 on behalf of https://github.com/janeyx99 due to Windows trunk test_proxy_tensor test failures look relevant ([comment](https://github.com/pytorch/pytorch/pull/130415#issuecomment-2225575622))	2024-07-12 13:20:18 +00:00
Yanbo Liang	207564bab1	[Inductor] FlexAttention supports partial masking (#130415 ) This is the new version of #130235 Updated test script: https://gist.github.com/yanboliang/7c34a82df611d4ea8869cb9e041bfbfc Updated perf numbers: ``` (pt) [ybliang@devgpu002.ash8 ~/local/debug]$ CUDA_VISIBLE_DEVICES=4 python debug7.py fwd speedup: 0.7166695598192317 bwd speedup: 0.7142133867805904 (pt) [ybliang@devgpu002.ash8 ~/local/debug]$ CUDA_VISIBLE_DEVICES=4 python debug7.py --partial-mask fwd speedup: 0.8428246087169973 bwd speedup: 0.8486261278030254 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130415 Approved by: https://github.com/Chillee	2024-07-12 07:19:28 +00:00
Chien-Chin Huang	e568c91a7b	[CP] Fix the incorrect ring schedule in the fwd and bwd (#129514 ) Summary: 1. The argument order for all_to_all_single is "block, output_split_size, input_split_sizes, pg". 2. Uses the correct ring order for the grad_kv. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129514 Approved by: https://github.com/d4l3k, https://github.com/drisspg, https://github.com/wanchaol ghstack dependencies: #129512	2024-07-12 07:05:36 +00:00
Chien-Chin Huang	0d8dedb01b	[dtensor] Add dtensor to TORCH_LOGS (#129512 ) Summary: Add the basic log for dispatcher of dtensor Pull Request resolved: https://github.com/pytorch/pytorch/pull/129512 Approved by: https://github.com/wanchaol, https://github.com/XilunWu	2024-07-12 06:50:53 +00:00
Harshavardhan Reddy Bommireddy	b6215f44ef	DCP checkpoint_dist_client integration (#130452 ) Summary: Integrate scope tracking with `checkpoint/fb/logging_handlers.py`. Add a map of uuid -> tracker context manager. when logging handler has following events: * `start`: create scope_tracker object, call `__enter__`, add to map with uuid * `end`: retrieve scope_tracker object by uuid, call `__exit__`. * `exception`: retrieve scope_tracker object by uuid, call `__exit__` with current exception info. Test Plan: Test with bento notebook (attached). with a runtime_error in finish_checkpoint method. scuba records: https://fburl.com/scuba/workflow_signpost/ddttgmv2 Differential Revision: D56654417 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130452 Approved by: https://github.com/LucasLLC	2024-07-12 06:01:56 +00:00
Tarun Karuturi	ff25dfca5a	Save quantization_tag in export graph serialization (#127473 ) Summary: `quantization_tag` is a first class citizen metadata in quantization flows that is preserved by it. As we'll want to store the quantized exported graphs we also need to preserve this metadata as it's used in later flows. Only json supported metadata will be allowed to be serialized. Test Plan: Added test case Differential Revision: D57939282 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127473 Approved by: https://github.com/angelayi	2024-07-12 05:06:40 +00:00
eellison	b7d287fbec	Constant folding for dynamic shape node (#129686 ) Extend constant folding for dynamic shape node, only support pointwise op and some restricted ops We support dynamic shapes by limiting constant folding of ops that are guaranteed to have uniform values (full, pointwise ops, and views) and running these operators with tensors of shape 1. This also eliminates the possibility of memory overhead of constant folding. Taken over from https://github.com/pytorch/pytorch/pull/128937 joint work with @imzhuhl Pull Request resolved: https://github.com/pytorch/pytorch/pull/129686 Approved by: https://github.com/Chillee ghstack dependencies: #130367	2024-07-12 03:44:29 +00:00
Sijia Chen	ae0edadea0	[SDPA] Replace `masked_fill_` with `aten::where` (#130281 ) Summary: full context in D59385876 Based on the offline discussion with PT2 folks, we switched to change the SDPA impl to mitigate the AOTI lowering issue Test Plan: PYTORCH_TEST_FBCODE=1 buck2 run mode/opt -c=python.package_style=inplace -c fbcode.enable_gpu_sections=true -c fbcode.platform=platform010 -c fbcode.split-dwarf=true caffe2/test/inductor:test_inductor -- -r test_sdpa_inference_mode_aot_compile Differential Revision: D59495634 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130281 Approved by: https://github.com/drisspg, https://github.com/zou3519, https://github.com/Skylion007, https://github.com/justinchuby	2024-07-12 03:04:31 +00:00
yan-yhy	c16e90fe06	The device_suffix in a test_name is "privateuse1" sometimes. (#130091 ) When run some test cases on the privateuse1 device, the device_suffix in a test_name is 'privateuse1' sometimes. For examples, a test_name is 'test_Dropout1d_npu', while it would be 'test_Dropout1d_privateuse1' sometimes. When setUpClass() didn't set it, the device_suffix would be "privateuse1". Pull Request resolved: https://github.com/pytorch/pytorch/pull/130091 Approved by: https://github.com/zou3519	2024-07-12 02:51:40 +00:00
Yifu Wang	9ae40c6bc0	Fix and improve raise_comms and sink_waits (#129980 ) The tests for `raise_comms` and `sink_waits` passes were not enabled in CI. The passes are now broken due to functional collective v2 and possibly other changes. Correctness issues: - The original passes did not take mutation into consideration and may yield semantically different scheduling order. This may be due to the recent changes to how mutations are expressed in Inductor IR (e.g., MutationOutput). Effectiveness issues: - The original passes only moved the comm/wait nodes themselves. However, comm nodes can come with prologues (e.g., clone for all_reduce_, split-cat for non-zero dim all-gather). Whenever there are any prologues, the comms won't be raised at all. - The prologues are often horizontally fused with other pointwise nodes. This can severely delay the scheduling of the comm node. This PR: - Make the passes handle mutation correctly. - Instead of moving individual comm/wait nodes, schedule all node using a scored method. This way the comm nodes can be optimally raised even in the presence of prologues. - The horizontal fusion of prolofues often severely delays the scheduling of the comm node. Horizontally fusing this clone can almost never out-perform scheduling the comm node earlier. Also in most cases, this clone is eliminated via in-place reuse. Therefore, we tell the scheduler to not fuse it. - Enable the tests in CI. Co-authored-by: Will Feng <yf225@cornell.edu> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129980 Approved by: https://github.com/yf225	2024-07-12 01:55:47 +00:00
Will Feng	c6a676add4	[Traceable FSDP2][Inductor] Add GroupedSchedulerNode to contain nodes that must be scheduled together (#128568 ) As discussed with @mlazos and @Chillee in the Inductor group chat, we need the concept of `GroupedSchedulerNode` to be able to express nodes that must be scheduled together one-after-another (i.e. no other node is allowed to fuse into them or schedule in-between them). This is particularly important for comm reordering and fine-grained control of peak memory. For Traceable FSDP2, there are two very important requirements: - At any time, there must be only one AllGather in flight. However, our existing comm reordering pass will naturally raise all of AllGather ops to the beginning of the graph, which will clearly blow up memory usage. Instead, we leverage GroupedScheduleNode which provides simple connection points to build the "chaining" on. i.e. we use it to express the schedule `(copyin + AllGather1) -> (AllGather1Wait+copyout) -> (copyin + AllGather2) -> (AllGather2Wait+copyout) ...` by setting up fake dep between the GroupedScheduleNode, which is a very clean and easy-to-understand way to express this schedule. - The "comms" in FSDP2 are not just comms, but a combination of compute and comm. We must prevent other nodes from being scheduled in-between that set of nodes, otherwise we are artificially delaying the release of comm buffer memory which makes the peak memory usage quite bad. This is particularly pronounced for `AllGatherWait+copyout`. From these two requirements, we derive the behavior of `GroupedSchedulerNode`: it contains nodes that must be scheduled together one-after-another (i.e. no other node is allowed to fuse into them or schedule in-between them). ---- Q: Can we leverage `ir.Subgraph`? A: I looked into the possibility of using `ir.Subgraph` to implement this, but realized that: 1. `ir.Subgraph` requires defining the subgraph in FX IR. 2. There is no guarantee that the Inductor IR nodes that we want to group together will all have a corresponding FX IR node, because some of those Inductor IR nodes can potentially be dynamically generated by a custom pass in the scheduler (e.g. for merging multiple all-gathers into one big all-gather, and later we want to group that big all-gather with some other op). Dynamically generated Inductor IR node doesn't have a corresponding upstream FX IR node. 3. For the above reasons, we can't use the `ir.Subgraph`, and need to define a new (and more lightweight) concept of `GroupedSchedulerNode` to achieve the behavior we need (this PR). ---- Test commands: - `pytest -rA test/distributed/test_compute_comm_reordering.py::TestComputeCommReorderingMultiProc::test_grouped_scheduler_node` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128568 Approved by: https://github.com/eellison, https://github.com/mlazos	2024-07-12 01:42:38 +00:00
Michael Lazos	c101c4517a	Add python type for list iterators (#130511 ) Fixes https://github.com/pytorch/pytorch/issues/117026 Also not sure why this was missing Pull Request resolved: https://github.com/pytorch/pytorch/pull/130511 Approved by: https://github.com/williamwen42, https://github.com/yanboliang, https://github.com/anijain2305	2024-07-12 01:14:18 +00:00
PyTorch MergeBot	536b5b19b5	Revert "Simplify c10::string_view (#130009 )" This reverts commit 10c7f037fe3271cb3865816c216007ba403f5347. Reverted https://github.com/pytorch/pytorch/pull/130009 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/130009#issuecomment-2224223526))	2024-07-12 00:46:49 +00:00
Feny Patel	7f2436014e	add MTIA as valid device type for prof averages (#130340 ) Summary: Add MTIA as valid device option for getting profile averages Test Plan: Tested with auto-trace on MTIA Differential Revision: D59486392 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130340 Approved by: https://github.com/aaronenyeshi	2024-07-12 00:39:01 +00:00
PyTorch MergeBot	7ce5b5767c	Revert "Make c10::string_view an alias of std::string_view (#130417 )" This reverts commit c9551a3f50efc8163d8508a3c2189536528577ac. Reverted https://github.com/pytorch/pytorch/pull/130417 on behalf of https://github.com/izaitsevfb due to depends on #130009 which needs to be reverted ([comment](https://github.com/pytorch/pytorch/pull/130417#issuecomment-2224212227))	2024-07-12 00:37:04 +00:00
Shivam Raikundalia	b5b91b418d	[Easy] Update record_function Comment (#130561 ) Summary: Users have been confused why user annotations on GPU tracks do not show when doing GPU only tracing. This comment should help users understand that to use this function they need to have CPU activies enabled. Test Plan: N/A it is just updating a comment Differential Revision: D59649390 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130561 Approved by: https://github.com/aaronenyeshi	2024-07-11 23:51:25 +00:00
Pian Pawakapan	18b7633bfb	[export] fix kwargs in run_decompositions() for training IR (#130553 ) Re-exporting GraphModule expects all inputs to be in args, though not in pytree-flattened format. This avoids failing when we run with a fx.Interpreter subclass in [AOTAutograd tracing](`973037be6a/torch/_functorch/_aot_autograd/traced_function_transforms.py (L760-L762)`). Removes 7 test failures for training IR export. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130553 Approved by: https://github.com/zhxchen17, https://github.com/ydwu4	2024-07-11 22:53:18 +00:00
Yidi Wu	26c2b92525	[export] make with_effect mark op has_effect to prevent them from DCEed. (#129680 ) Before the PR, custom ops that don't return outputs will get eliminated after calling `.module()` because the effect_token that keeps the operator alive is removed in remove_effect_token pass. The reason why we want to remove_effect_token is because we don't want the token to be part of input. However, this causes DCE calls in remove_effect_token itself and the dce calls in unlift to remove the custom op in the graph causing an error in the exported graph. This PR calls has_side_effect in with_effect to make sure graph.eliminate_dead_code doesn't remove the calls by accident. Test Plan: Add a new test pytest test/export/test_torchbind.py -k test_export_inplace_custom_op Differential Revision: [D59498728](https://our.internmc.facebook.com/intern/diff/D59498728) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129680 Approved by: https://github.com/angelayi	2024-07-11 22:46:21 +00:00
Edward Z. Yang	9c6c0deadc	Add eager_compile_backwards_failure to tlparse (#130434 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130434 Approved by: https://github.com/albanD	2024-07-11 22:35:33 +00:00
PyTorch MergeBot	d97d962082	Revert "Add decompositions for copy variants of view ops (#128416 )" This reverts commit 68751799b85aa7f659420801bdbb8451f01ab09a. Reverted https://github.com/pytorch/pytorch/pull/128416 on behalf of https://github.com/izaitsevfb due to breaks test_qs8_permute_copy test in executorch ([comment](https://github.com/pytorch/pytorch/pull/128416#issuecomment-2224023423))	2024-07-11 22:09:23 +00:00
PyTorch MergeBot	a2f630a9a4	Revert "Decompose expand_copy and permute_copy (#129476 )" This reverts commit 7d4cb2109823f1c4001dff62b461bb9eda07ca17. Reverted https://github.com/pytorch/pytorch/pull/129476 on behalf of https://github.com/izaitsevfb due to depends on #128416 which needs to be reverted ([comment](https://github.com/pytorch/pytorch/pull/129476#issuecomment-2224019720))	2024-07-11 22:06:15 +00:00
eellison	fc872e98f3	Infer prim tags from equivalent aten ones (#130367 ) Take intersection of all the tags for corresponding aten op overloads. Previously, some of the rng ops not having tags caused issues with constant folding (they should get decomposed but thats a separate issue). Pull Request resolved: https://github.com/pytorch/pytorch/pull/130367 Approved by: https://github.com/ezyang	2024-07-11 20:53:52 +00:00
Zhengxu Chen	726a287271	[export] Expand verifier to be multiple on ExportedProgram (#130364 ) Summary: This diff updates the ExportedProgram class in PyTorch to allow for multiple verifiers to be attached to it. This is done by adding a new field to the ExportedProgram schema called "verifiers" which is a list of strings representing the names of the verifiers to be attached to the program. The verifiers are loaded using the "load_verifier" function which is defined in the "torch._export.serde.serialize" module. The "exported_program.dialect" field is also deprecated in favor of the "verifiers" field. Test Plan: CI Differential Revision: D59408546 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130364 Approved by: https://github.com/angelayi, https://github.com/ydwu4	2024-07-11 20:34:49 +00:00
mengph	5c6edd29ec	Turn on splitShare=1 to make the optimization of comm_split effective. (#129929 ) Fixes #129865 Currently, new_group will call ncclCommSplit in some cases. In theory, ncclCommSplit will bring performance and memory benefits. However, the config parameter of the ncclCommSplit function in pytorch does not set "splitShare=1", which results in the optimization of ncclCommSplit being turned off and the benefits being invalid. This PR turn on splitShare=1 to make the optimization of comm_split effective. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129929 Approved by: https://github.com/shuqiangzhang	2024-07-11 20:14:58 +00:00
Nikita Shulga	c50b189280	Move trunk windows builds to CUDA-12.1 (#130446 ) That should catch build regressions that were previously only detectable during the nightly builds Win + CUDA-11.8 builds and tests are still run as part of periodic workflow Pull Request resolved: https://github.com/pytorch/pytorch/pull/130446 Approved by: https://github.com/atalman	2024-07-11 19:50:57 +00:00
Tijmen Blankevoort	bc18863713	Corner-case fix for upscale_histogram in the new HistogramObserver (#130316 ) Summary: Small fix to the bucketize function that caused a run-time error in some corner cases. Test Plan: Unit tests Differential Revision: D59508432 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130316 Approved by: https://github.com/jerryzh168	2024-07-11 19:49:21 +00:00
Yidi Wu	cd9bae30de	Allow kwargs in _remove_effect_tokens_pass (#130491 ) Summary: Previously, remove_effect_tokens pass didn't pass kwargs to the internal nodes. This PR fix it and add a test for it. Test Plan: buck2 run caffe2/test:test_export -- -r test_remove_effect_token_kwargs Reviewed By: angelayi Differential Revision: D59603147 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130491 Approved by: https://github.com/angelayi	2024-07-11 19:03:19 +00:00
PyTorch MergeBot	578388bed8	Revert "Support for expandable segments with cuda graph trees (#128068 )" This reverts commit fdc83610f272610ce50d1a6f5b6354f2df1baabb. Reverted https://github.com/pytorch/pytorch/pull/128068 on behalf of https://github.com/janeyx99 due to Reverting for breaking ROCm tests on trunk, I think the tests need to be qualified with @onlyCUDA ([comment](https://github.com/pytorch/pytorch/pull/128068#issuecomment-2223672381))	2024-07-11 18:58:13 +00:00
Yidi Wu	1cae60a87e	Caching attr_proxy for nn_module attribute to fix guard check failure (#130280 ) Fixes https://github.com/pytorch/pytorch/issues/129939 Differential Revision: [D59594605](https://our.internmc.facebook.com/intern/diff/D59594605) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130280 Approved by: https://github.com/anijain2305	2024-07-11 18:21:35 +00:00
Chien-Chin Huang	0a4fe2ff86	[DSD] Use no_grad() to make some operations faster and avoid possible memory leakage (#130355 ) Use no_grad() to make some operations faster and avoid possible memory leakage Pull Request resolved: https://github.com/pytorch/pytorch/pull/130355 Approved by: https://github.com/wz337	2024-07-11 18:18:08 +00:00
Xuehai Pan	973037be6a	[BE][Easy] apply autofix for ruff rules unnecessary-collection-call (C408): `list()` / `tuple()` / `dict()` (#130199 ) This PR changes the empty collection factory call to Python literals: - `list()` -> `[]` - `tuple()` -> `()` - `dict()` -> `{}` The Python literals are more performant and safer. For example, the bytecode for building an empty dictionary: ```bash $ python3 -m dis - <<EOS import collections d1 = {} d2 = dict() dict = collections.OrderedDict d3 = dict() EOS ``` ```text 0 0 RESUME 0 1 2 LOAD_CONST 0 (0) 4 LOAD_CONST 1 (None) 6 IMPORT_NAME 0 (collections) 8 STORE_NAME 0 (collections) 3 10 BUILD_MAP 0 12 STORE_NAME 1 (d1) 4 14 PUSH_NULL 16 LOAD_NAME 2 (dict) 18 CALL 0 26 STORE_NAME 3 (d2) 6 28 LOAD_NAME 0 (collections) 30 LOAD_ATTR 8 (OrderedDict) 50 STORE_NAME 2 (dict) 7 52 PUSH_NULL 54 LOAD_NAME 2 (dict) 56 CALL 0 64 STORE_NAME 5 (d3) 66 RETURN_CONST 1 (None) ``` The dict literal `{}` only has one bytecode `BUILD_MAP`, while the factory call `dict()` has three `PUSH_NULL + LOAD_NAME + CALL`. Also, the factory call is not safe if users override the `dict` name in `locals` or `globals` (see the example of replacing with `OrderedDict` above). Pull Request resolved: https://github.com/pytorch/pytorch/pull/130199 Approved by: https://github.com/malfet	2024-07-11 17:30:28 +00:00
PyTorch MergeBot	492de213e2	Revert "Change deprecated warning on dispatch_on_subclass to warn once (#130047 )" This reverts commit f21a21828ac6e16d903ee88f726fdb2278c04782. Reverted https://github.com/pytorch/pytorch/pull/130047 on behalf of https://github.com/albanD due to The failure on the PR are valid, they should not have been ignored ([comment](https://github.com/pytorch/pytorch/pull/130047#issuecomment-2223488933))	2024-07-11 17:24:02 +00:00
Iris Zhang (PyTorch)	f21a21828a	Change deprecated warning on dispatch_on_subclass to warn once (#130047 ) Summary: Right now the deprecated warning fires on every operator that calls into torch_function. Changing it to TORCH_WARN_ONCE instead. More context in https://fb.workplace.com/groups/260102303573409/permalink/445299188387052/ Test Plan: Sandcastle Differential Revision: D59338775 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130047 Approved by: https://github.com/XilunWu	2024-07-11 17:02:26 +00:00
wz337	3896ba3260	[DeviceMesh] Only include the real thread_id in DeviceMesh hash under threaded backend (#130495 ) Fixes #ISSUE_NUMBER As a followup to https://github.com/pytorch/pytorch/pull/130454, users are hitting the cross-mesh operation error because the DeviceMesh thread ID differs between the saved vs. loaded DTensor due to thread id being different. This is a hot fix to only consider the real thread_id in DeviceMesh hash under threaded backend, but set it to None for all other cases. As a follow up, we need to look at the following test failures to better root cause specific DeviceMesh related failures related to MTPG, if thread_id is not included as part of the hash. ``` test/distributed/_composable/fsdp/test_fully_shard_training.py::TestFullyShardRegisteredParams::test_param_registration_after_forward test/distributed/_tensor/test_dtensor_ops.py::TestDTensorOpsCPU::test_dtensor_op_db_column_stack_cpu_float32 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130495 Approved by: https://github.com/awgu, https://github.com/wanchaol	2024-07-11 17:02:18 +00:00
Dmitry Nikolaev	72d9135679	increase tensor size to force out of memory exception on the latest generations of GPUs (#130334 ) This PR fixes profiler/test_profiler.py::.TestProfiler::test_oom_tracing Test expects OOM by allocating huge tensor. But MI300X has enough memory to allocate such a tensor. This PR increases tensor size with a large margin to force OutOfMemory exception on MI300X and future GPU generations Pull Request resolved: https://github.com/pytorch/pytorch/pull/130334 Approved by: https://github.com/jeffdaily, https://github.com/janeyx99	2024-07-11 16:59:40 +00:00
Nikita Shulga	9c1ba5ac10	[BE] Cleanup unused vars in MPS (#130541 ) And move `using namespace mps` outside of every function as there are no need to repeat it Use `getTensorsStringKey` instead of explicit `getMPSShapeString(getMPSShape(t)) + getMPSDataTypeString(t)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130541 Approved by: https://github.com/Skylion007	2024-07-11 16:48:03 +00:00
Edward Z. Yang	68ad3eb722	Do not set hints for mark_unbacked quantities (#130483 ) Fixes https://github.com/pytorch/pytorch/issues/130456 When we mark_unbacked a size, we actually DO have a hint for it (because we have a real, input tensor) for it, and previously, we were accidentally putting it into the hint field of SymNode. If marked unbacked size is zero or one, this can lead to inconsistency between hint compute and static evaluation compute under guard size oblivious, since that's the whole point of size oblivious. Answer is to scrub out hints on mark unbacked ints. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130483 Approved by: https://github.com/lezcano	2024-07-11 15:51:00 +00:00
chuanqiw	ca023f77bc	[CD] Add pytorch xpu wheel build in nightly (#129560 ) Add pytorch xpu wheel build in nightly after the xpu build image enabling PR https://github.com/pytorch/builder/pull/1879 merged Pull Request resolved: https://github.com/pytorch/pytorch/pull/129560 Approved by: https://github.com/atalman	2024-07-11 15:49:04 +00:00
Shangdi Yu	fb9bc6d74a	[custom op] add doc for CustomOpDef.set_kernel_enabled (#130406 ) <img width="1067" alt="Screenshot 2024-07-09 at 6 14 55 PM" src="https://github.com/pytorch/pytorch/assets/22356083/941751f8-8e12-43cb-8477-c739476e0096"> <img width="965" alt="Screenshot 2024-07-09 at 6 14 59 PM" src="https://github.com/pytorch/pytorch/assets/22356083/aa9be099-f26c-45a3-8a14-742a2bb7c28b"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130406 Approved by: https://github.com/zou3519	2024-07-11 15:47:35 +00:00
James Wu	5ed72ff5f5	Reduce all tensors to their metadata in AOTAutogradCache; add tests (#128583 ) This PR makes it so that all tensors are reduced to their metadata in AOTAutogradCache. Because dynamo always embeds constant tensors into the FXgraph directly, there's no risk of a constant tensor whose values are semantically important being lost here. AOTAutograd itself may take a constant tensor and set it as an attribute on an FXGraph for inductor, but Dynamo never does this. One other thing that this diff does is add `[pickler.fast](https://docs.python.org/3/library/pickle.html#pickle.Pickler.fast)` to our pickling algorithm for cache key generation. Pickle will often memoize/intern strings when pickling, leading to false cache misses due to inconsistent memoization. Turning on pickler.fast removes this behavior. Technically `fast` is a "deprecated" feature according to python docs. But it's still supported in py3.8-3.12, and if it ever is removed, the only downside will just be a few more cache misses, so I think it's worth just adding here (and removing later as needed) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128583 Approved by: https://github.com/oulgen ghstack dependencies: #128335	2024-07-11 15:39:09 +00:00
Oguz Ulgen	be7bf20234	Add JK to enable fx graph cache for amd (#130463 ) Test Plan: ad hoc testing Differential Revision: D59593961 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130463 Approved by: https://github.com/nmacchioni, https://github.com/mxz297	2024-07-11 15:28:38 +00:00
Jiang, Yanbing	6f662e9575	update the input `weight` of `_convert_weight_to_int4pack` to `[n][k / 2] uint8` (#129940 ) This PR is to update the input `weight` of `_convert_weight_to_int4pack` from `[n][k] int32` to `[n][k / 2] uint8`, both for CPU, CUDA and MPS, which can help decouple int4 model checkpoint with different ISAs and different platforms in `gpt-fast`. The advantage is int4 model checkpoint can be shared in different test machines, without re-generating in one certain platform. Meanwhile, the size of input `weight` can be reduced to `1 / 8`. Before this PR, packed weight stored in CUDA specific layout: `[n/8][k/(InnerKTiles*16)][32][InnerKTiles/2]`, dtype int32, where InnerKTiles = 2, 4, 8. CPU packed weight viewed as the SAME shape but stored in different layout: `[n/64][k][32]`, dtype uint8. Weight is strongly coupled with platforms (CPU/CUDA) and ISAs (AVX512/AVX2/scalar). And users cannot use a generated weight in another different ISA or platform, because when loading weight into devices, the compute format is different. ![image](https://github.com/pytorch/pytorch/assets/61222868/64971c4b-29b9-42cf-9aeb-ffa01cea93dd) Now, we use common serialized layout (`[n][k/2] uint8`) for different devices or ISAs as input `weight` of `_convert_weight_to_int4pack`, and each back chooses how to interpret as compute layout. ![image](https://github.com/pytorch/pytorch/assets/61222868/c7990761-c723-417b-aca2-7c60db7785c7) ### Performance Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores) There is no obvious regression of this PR. ![image](https://github.com/pytorch/pytorch/assets/61222868/6046dcf4-920b-4c63-9ca3-1c8c3cafebde) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129940 Approved by: https://github.com/jgong5, https://github.com/lezcano, https://github.com/mingfeima	2024-07-11 15:26:48 +00:00
cyy	c4a2b6a943	[2/N] Fix NVCC warnings (#130214 ) Follows #130191 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130214 Approved by: https://github.com/ezyang	2024-07-11 14:46:53 +00:00
Animesh Jain	a833582dbb	[dynamo][tuple] Optimize guard for small tuples - helps conv2d guards (#130400 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130400 Approved by: https://github.com/yanboliang, https://github.com/jansel ghstack dependencies: #130285, #130368, #130416	2024-07-11 14:13:24 +00:00
Animesh Jain	f7d7b94017	[dynamo][unspecialized-nn-module] Distinguish between user-defined and builtin nn module (#130416 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130416 Approved by: https://github.com/jansel ghstack dependencies: #130285, #130368	2024-07-11 14:13:24 +00:00
Animesh Jain	fed8b0055f	[dynamo][bufgix] Fix the value for key manager (#130368 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130368 Approved by: https://github.com/jansel ghstack dependencies: #130285	2024-07-11 14:13:19 +00:00
Animesh Jain	9c612df504	[dynamo][cpp-guards][QOL] Print NO_TENSOR_ALIASING guard once (#130285 ) NO_TENSOR_ALIASING guard lists all tensors. Printing it on every occurence is ugly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130285 Approved by: https://github.com/jansel	2024-07-11 14:13:14 +00:00
cdzhan	bac10cdd6f	[DCP] Fix duplicated logging messages when enable both c10d and dcp l… (#130423 ) …ogger Fixes #129951 . Would you take a moment to review it? @LucasLLC Pull Request resolved: https://github.com/pytorch/pytorch/pull/130423 Approved by: https://github.com/Skylion007	2024-07-11 13:43:39 +00:00
Yifu Wang	0d66ccaf23	[IntraNodeComm] fix an issue where input check fails when running all-reduce on sub groups (#130492 ) Tested against the following snippet with `ENABLE_INTRA_NODE_COMM=1`. ```python import os import torch import torch.distributed as dist def main(): rank = int(os.environ["RANK"]) local_rank = int(os.environ["LOCAL_RANK"]) world_size = int(os.environ["WORLD_SIZE"]) torch.cuda.set_device(f"cuda:{local_rank}") dist.init_process_group("nccl") draft_group = dist.new_group([0, 1, 2, 3]) target_group = dist.new_group([4, 5, 6, 7]) inp = torch.full((128, 128), rank, dtype=torch.bfloat16, device="cuda") dist.all_reduce(inp) expect = sum(range(world_size)) assert inp.eq(expect).all() if 0 <= rank < 4: inp = torch.full((128, 128), rank, dtype=torch.bfloat16, device="cuda") dist.all_reduce(inp, group=draft_group) expect = sum(range(4)) assert inp.eq(expect).all() else: inp = torch.full((128, 128), rank, dtype=torch.bfloat16, device="cuda") dist.all_reduce(inp, group=target_group) expect = sum(range(4, 8)) assert inp.eq(expect).all() torch.cuda.synchronize() dist.destroy_process_group() if __name__ == "__main__": main() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130492 Approved by: https://github.com/Chillee	2024-07-11 13:39:14 +00:00
PyTorch MergeBot	f261c6ebe8	Revert "[halide-backend] Update CI pin (#130258 )" This reverts commit 4fcfd475bea24b832da32a0c4d464dd87c73a2a9. Reverted https://github.com/pytorch/pytorch/pull/130258 on behalf of https://github.com/albanD due to Seems to have broken trunk pretty bad `4fcfd475be` ([comment](https://github.com/pytorch/pytorch/pull/130258#issuecomment-2222935064))	2024-07-11 13:26:01 +00:00
albanD	354edb232a	Make public binding test only consider files that are packaged in the wheels (#130497 ) In particular, when creating the PyTorch wheel, we use setuptools find_packages `551b3c6dca/setup.py (L1055)` which explicitly skips packages without `__init__.py` files (namespace packages) https://setuptools.pypa.io/en/latest/userguide/package_discovery.html#finding-simple-packages. So this PR is reverting the change to stop skipping these namespace packages as, even though they are in the codebase, they are not in the published binaries and so we're ok relaxing the public API and importability rules for them. A manual diff of the two traversal methods: ``` torch._inductor.kernel.bmm torch._inductor.kernel.conv torch._inductor.kernel.flex_attention torch._inductor.kernel.mm torch._inductor.kernel.mm_common torch._inductor.kernel.mm_plus_mm torch._inductor.kernel.unpack_mixed_mm torch._strobelight.examples.cli_function_profiler_example torch._strobelight.examples.compile_time_profile_example torch.ao.pruning._experimental.data_sparsifier.benchmarks.dlrm_utils torch.ao.pruning._experimental.data_sparsifier.benchmarks.evaluate_disk_savings torch.ao.pruning._experimental.data_sparsifier.benchmarks.evaluate_forward_time torch.ao.pruning._experimental.data_sparsifier.benchmarks.evaluate_model_metrics torch.ao.pruning._experimental.data_sparsifier.lightning.tests.test_callbacks torch.ao.quantization.experimental.APoT_tensor torch.ao.quantization.experimental.adaround_fake_quantize torch.ao.quantization.experimental.adaround_loss torch.ao.quantization.experimental.adaround_optimization torch.ao.quantization.experimental.apot_utils torch.ao.quantization.experimental.fake_quantize torch.ao.quantization.experimental.fake_quantize_function torch.ao.quantization.experimental.linear torch.ao.quantization.experimental.observer torch.ao.quantization.experimental.qconfig torch.ao.quantization.experimental.quantizer torch.csrc.jit.tensorexpr.codegen_external torch.csrc.jit.tensorexpr.scripts.bisect torch.csrc.lazy.test_mnist torch.distributed._tensor.examples.checkpoint_example torch.distributed._tensor.examples.comm_mode_features_example torch.distributed._tensor.examples.comm_mode_features_example_argparser torch.distributed._tensor.examples.convnext_example torch.distributed._tensor.examples.torchrec_sharding_example torch.distributed._tensor.examples.visualize_sharding_example torch.distributed.benchmarks.benchmark_ddp_rpc torch.distributed.checkpoint.examples.async_checkpointing_example torch.distributed.checkpoint.examples.fsdp_checkpoint_example torch.distributed.checkpoint.examples.stateful_example torch.distributed.examples.memory_tracker_example torch.fx.experimental.shape_inference.infer_shape torch.fx.experimental.shape_inference.infer_symbol_values torch.include.fp16.avx torch.include.fp16.avx2 torch.onnx._internal.fx.analysis.unsupported_nodes torch.onnx._internal.fx.passes._utils torch.onnx._internal.fx.passes.decomp torch.onnx._internal.fx.passes.functionalization torch.onnx._internal.fx.passes.modularization torch.onnx._internal.fx.passes.readability torch.onnx._internal.fx.passes.type_promotion torch.onnx._internal.fx.passes.virtualization torch.utils._strobelight.examples.cli_function_profiler_example torch.utils.benchmark.examples.sparse.compare torch.utils.benchmark.examples.sparse.fuzzer torch.utils.benchmark.examples.sparse.op_benchmark torch.utils.tensorboard._convert_np torch.utils.tensorboard._embedding torch.utils.tensorboard._onnx_graph torch.utils.tensorboard._proto_graph torch.utils.tensorboard._pytorch_graph torch.utils.tensorboard._utils torch.utils.tensorboard.summary torch.utils.tensorboard.writer ``` These are all either namespace packages (which we want to remove) or package that are not importable (and tagged as such in the test). Pull Request resolved: https://github.com/pytorch/pytorch/pull/130497 Approved by: https://github.com/aorenste	2024-07-11 13:22:04 +00:00
Eddie Yan	215013daad	[cuDNN][SDPA] Limit cuDNN SDPA head-dim to 128 (#130494 ) Limit cuDNN SDPA to head-dim 128 globally. Apparently the support for 256 is only for the forward on sm90+, which would be clunky to maintain as it would mean dispatching different for forward/backward. CC @drisspg Pull Request resolved: https://github.com/pytorch/pytorch/pull/130494 Approved by: https://github.com/drisspg, https://github.com/Skylion007	2024-07-11 13:21:18 +00:00
cyy	9822fdc354	[7/N] Replace c10::optional with std::optional (#130510 ) Follows #130438 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130510 Approved by: https://github.com/janeyx99	2024-07-11 13:21:05 +00:00
Wang, Eikan	f52b2ee90f	Modularize aten parameter parser and checker (#125308 ) In this PR, we abstracted the different types of aten operation parameters as `ParameterMetadata`. This structure intends to be used to represent and store the metadata of each aten operation parameter. Currently, it only supports `Tensor`, `TensorList`, and `Scalar`. ```C++ using ParameterMetadataValue = std::variant<TensorMetadata, std::vector<TensorMetadata>, c10::Scalar>; ``` With this PR, we can extend other parameter-type support in a more modularize way, like `string`, `int`, `double`. Differential Revision: [D59399546](https://our.internmc.facebook.com/intern/diff/D59399546) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125308 Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/atalman	2024-07-11 13:17:25 +00:00
Edward Z. Yang	2a51ccc77e	When translation validation is enabled, assert that hint is consistent (#130478 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130478 Approved by: https://github.com/lezcano	2024-07-11 13:02:31 +00:00
cyy	c9551a3f50	Make c10::string_view an alias of std::string_view (#130417 ) Follows #130009 to further facilitate the mitigation from c10::string_view to std::string_view. The old c10::string_view was renamed to c10::string_view_ext. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130417 Approved by: https://github.com/ezyang	2024-07-11 12:31:06 +00:00
cyy	c5b66c3fe1	Enable -Werror=pedantic on torch targets (#130319 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/130319 Approved by: https://github.com/ezyang	2024-07-11 12:27:32 +00:00
Isuru Fernando	5db9bd467e	Skip test_nnc_correctness for new op _unsafe_masked_index (#130375 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130375 Approved by: https://github.com/lezcano	2024-07-11 08:17:16 +00:00
Benson Ma	b1942a1af4	[fbgemm_gpu] Break up `fbgemm_cuda_utils.cuh`, pt 10 (#130468 ) Summary: X-link: https://github.com/pytorch/FBGEMM/pull/2814 X-link: https://github.com/facebookresearch/FBGEMM/pull/19 - Break up `fbgemm_cuda_utils.cuh`, pt 10 Test Plan: ``` buck2 targets //deeplearning/fbgemm/fbgemm_gpu/test/jagged/... \| grep -v '-' \| xargs -I % sh -c 'buck2 run @//mode/opt -c fbcode.nvcc_arch=v100 -c fbcode.platform=platform010 % \|\| exit 255' buck2 targets //deeplearning/fbgemm/fbgemm_gpu/test/tbe/... \| grep -v '-' \| xargs -I % sh -c 'buck2 run @//mode/opt -c fbcode.nvcc_arch=v100 -c fbcode.platform=platform010 % \|\| exit 255' buck2 targets //deeplearning/fbgemm/fbgemm_gpu/test/sparse/... \| grep -v '-' \| xargs -I % sh -c 'buck2 run @//mode/opt -c fbcode.nvcc_arch=v100 -c fbcode.platform=platform010 % \|\| exit 255' buck2 build --config fbcode.enable_gpu_sections=true --flagfile fbcode//mode/dev-nosan-amd-gpu fbcode//smart/inference_platform_sp/llm_predictor_amd:service buck2 build --flagfile fbcode//mode/amd-gpu fbcode//hpc/ops:sparse_ops buck2 build --flagfile fbcode//mode/dev-nosan-amd-gpu fbcode//caffe2/benchmarks/operator_benchmark/pt:add_test ``` Reviewed By: spcyppt Differential Revision: D59545097 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130468 Approved by: https://github.com/ezyang	2024-07-11 07:10:27 +00:00
Xu Han	79c41bb58a	[inductor] switch CppCodeCache to new cpp_builder. (#130132 ) Changes: 1. switch CppCodeCache to new cpp_builder. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130132 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-07-11 07:03:43 +00:00
Wanchao Liang	75ab027fbb	[dtensor] move bernolli to op strategy (#130286 ) as titled Pull Request resolved: https://github.com/pytorch/pytorch/pull/130286 Approved by: https://github.com/awgu, https://github.com/yifuwang	2024-07-11 06:43:11 +00:00
Bilal Khan	fdc83610f2	Support for expandable segments with cuda graph trees (#128068 ) This PR adds support to use expandable segments with private memory pools which should unblock using it with cuda graphs and cuda graph trees. Currently, the allocator silently avoids using expandable segments when allocating in a private pool due to checkpoint saving/restoring not meshing well with how we keep track of unmapped blocks. The PR itself is pretty short, most of the logic for checkpointing and reapplying state for non-expandable segments transfers over without much work. Expandable segments reserve a virtual address space of size equal to the amount of physical memory on the GPU. Every time we want to `malloc()` or `free()` memory in a memory pool with expandable segments turned on, we map/unmap pages of physical GPU memory under the hood to create a new block that we return to the caller. This is beneficial due to the fact that each memory pool functions as a single segment of memory with a contiguous block of memory addresses that can grow and shrink as needed, avoiding fragmentation from allocating multiple non-contiguous segments that may not be merged together. The caching allocator handles this by creating an unmapped block for the entire reserved virtual address space at init, which is treated similarly to an unallocated block in a free pool. When callers call `malloc()`, it's split and mapped to create allocated blocks, and calling `free()` similarly caches and merges free blocks in a free pool to be used later. Expandable blocks are unmapped and returned back to Cuda when they are cleaned up, or when we hit an OOM and the allocator attempts to remap cached free blocks. The code paths to map, free, and unmap blocks in expandable segments is similar to that for normal blocks and does all the same work of updating stats on memory usage, moving blocks between active and free pools, and returning memory to Cuda. With Cuda Graph Trees and private memory pools, we need the ability to take checkpoints of the current state of the memory allocator after each graph capture as well as reapplying the state before capturing a new graph after replaying a captured graph so that the new cuda graph capture has access to the state of the allocator at the point after replaying a previously captured graph so it can reuse empty blocks and allocate new ones. As mentioned in a below comment, memory in a private pool is cached until the private pool is destroyed and allocations can only grow from extra graph captures, any freeing of memory would result in invalid memory addresses and would break cuda graphs. One implementation detail to note for unmapped blocks with expandable segments is that unmapped blocks are kept track in a member variable `unmapped` of a `BlockPool`. `unmapped` is not part of the checkpointed state of the caching allocator and isn't restored when reapplying checkpoints since we never free/unmap memory back to cuda and is persisted across graph captures / replays. Checkpointing the current state of the memory allocator works as expected with expandable segments. Checkpointing grabs the first block of every segment in the active and free pools of the private pool and traverses the linked list of blocks in the segment to capture the state of every segment, which is then saved and kept for when it is needed to be reapplied. For expandable blocks, the last block in every segment will be an unallocated unmapped block containing the remaining amount of unmapped memory at graph capture time, and this too is saved in the checkpoint. Reapplying the checkpoints works by freeing all allocated blocks and merging them into a single block per segment, then for each segment, we manually split and allocate all blocks from the checkpoint and then free the blocks marked as unallocated in the checkpoint state. For expandable segments, we need to make some modifications to not split unmapped blocks and avoid manually mapping then freeing unmapped blocks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128068 Approved by: https://github.com/zdevito, https://github.com/eqy	2024-07-11 05:33:09 +00:00
fduwjj	da24823e06	[BE][EZ] Migrate to new dcp save and load APIs (#130475 ) When I play with DCP for distributed inference, I found that we are still using deprecated APIs for DCP even in unit test. So this PR is using the new API with unified small letters "dcp". Pull Request resolved: https://github.com/pytorch/pytorch/pull/130475 Approved by: https://github.com/wz337	2024-07-11 04:13:39 +00:00
Will Feng	5835ff1ed5	[Easy][Inductor] Add comment for .min_order and .max_order (#130390 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130390 Approved by: https://github.com/anijain2305	2024-07-11 03:58:03 +00:00
Shangdi Yu	a4576dad34	[reland][custom ops] infer schema (#130079 ) Fixes #129617 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130079 Approved by: https://github.com/zou3519	2024-07-11 03:39:07 +00:00
Will Constable	9f401187c7	[pipelining] Refactor test_schedule to fix "-k" (#130294 ) This is kind of a short-sighted workaround and we should actually come up with a way to fix this in general, but I got annoyed that I can't use -k to filter tests in test_schedule, and realized it's because we jam tests using the new MultiProcContinuousTest fixture together with old-style tests. For now I separate the two types of tests so -k works again. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130294 Approved by: https://github.com/H-Huang	2024-07-11 03:18:02 +00:00
Mikayla Gawarecki	dfd1d1971e	Fix warning when pickle.load torch.Storage (#130246 ) Fixes https://github.com/pytorch/pytorch/issues/130242 Since `torch.save` does not use pickle for storages, the `torch.load` in `_load_from_bytes` should not ever be called when `torch.load`-ing a checkpoint. Setting weights_only=False explicitly in `_load_from_bytes` to avoid the weights_only warning when using the pickle module Pull Request resolved: https://github.com/pytorch/pytorch/pull/130246 Approved by: https://github.com/albanD	2024-07-11 02:40:29 +00:00
Jason Ansel	4fcfd475be	[halide-backend] Update CI pin (#130258 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130258 Approved by: https://github.com/eellison	2024-07-11 02:26:16 +00:00
Jerry Zhang	df9d1b44e7	Preserve _numeric_debug_handle throguh deepcopy and re-export (#129287 ) Summary: * Added support for preserving it during deepcopy, need to remap the args since _numeric_debug_handle refers to the nodes in the graph TODO: need to fully support re-export, currently the metadata for output node is not preserved Test Plan: python test/test_quantization.py -k test_deepcopy_preserve_handle python test/test_quantization.py -k test_copy_preserve_handle all related tests: python test/test_quantization.py -k TestGenerateNumericDebugHandle Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/129287 Approved by: https://github.com/zhxchen17	2024-07-11 02:19:41 +00:00
Edward Z. Yang	a205a53c50	Make sym_node log more useful (#130436 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130436 Approved by: https://github.com/Skylion007	2024-07-11 01:42:53 +00:00
Edward Z. Yang	79e34800c3	Suppress guards generated by empty_strided in ir_node_to_tensor (#130431 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130431 Approved by: https://github.com/IvanKobzarev	2024-07-11 01:19:11 +00:00
cyy	798b9652f7	[6/N] Replace c10::optional with std::optional (#130438 ) Follows #130408 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130438 Approved by: https://github.com/janeyx99	2024-07-11 01:15:37 +00:00
leslie-fang-intel	5bc18ec0a1	[Inductor][CPP] Support vectorization of remainder (#129849 ) Summary When check the vectorization status among 3 test suit, we found some operators disabled vectorization with message `Disabled vectorization: op: remainder`. In this PR, we add vectorization support of this op. Test Plan ``` python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_vec_remainder python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_int_div_vec ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129849 Approved by: https://github.com/jgong5, https://github.com/lezcano ghstack dependencies: #130405	2024-07-11 00:50:50 +00:00
Vladimir Fokow	6adc725157	doc - fix the `max_norm` value in a note (#129687 ) `max_norm=True` is currently written in the note, but `max_norm` can be a `float`, NOT a `bool` (as the [docstring](`ec284d3a74/torch/nn/modules/sparse.py (L30)`) says). That note was created in #45595 The current pull request cleans it up. The value `True` in the note can confuse the users to think it can be a boolean. In fact, a counter-intuitive behavior will happen if users try to set it to `False`: it will be interpreted as 0, so the values of the embedding will become 0 - not what the users were expecting by setting it to `False`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129687 Approved by: https://github.com/mikaylagawarecki, https://github.com/malfet	2024-07-11 00:01:17 +00:00
Sam Larsen	358da54be5	[inductor] Better messaging when triton version is too old (#130403 ) Summary: If triton is available, but we can't import triton.compiler.compiler.triton_key, then we see some annoying behavior: 1) If we don't actually need to compile triton, the subprocess pool will still spew error messages about the import failure; it's unclear to users if this is an actual problem. 2) If we do need to compile triton, we a) see the error messages from above and b) get a vanilla import exception without the helpful "RuntimeError: Cannot find a working triton installation ..." Test Plan: Ran with and without torch.compile for a) recent version of triton, b) triton 2.2, and c) no triton. In all cases, verified expected output (success or meaningful error message) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130403 Approved by: https://github.com/eellison	2024-07-10 23:45:50 +00:00
Andrew Gu	ceedee23ec	[DTensor] Included meshes in cross-mesh error msg (#130454 ) The current error message is not actionable since we do not know which meshes are involved. Including the `__repr__` of each mesh in the error helps but is not always sufficient. `7d4cb21098/torch/distributed/device_mesh.py (L395-L408)` The problem is that `DeviceMesh.__eq__` is actually pretty involved, and we cannot see all parts of the `__eq__` criteria just from the `__repr__` (e.g. the thread ID). Pull Request resolved: https://github.com/pytorch/pytorch/pull/130454 Approved by: https://github.com/wz337, https://github.com/wanchaol	2024-07-10 22:40:57 +00:00
Xu Han	2abc7cc21b	[inductor] switch AotCodeCompiler to new cpp_builder (#130127 ) Changes: 1. Switch `AotCodeCompiler` to new cpp_builder. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130127 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-07-10 22:28:29 +00:00
Ivan Zaitsev	551b3c6dca	Use irange to avoid -Wsign-compare errors (#130388 ) Fixes meta-internal errors after importing #128753 (see [D59498679](https://www.internalfb.com/diff/D59498679)) ``` fbcode/caffe2/aten/src/ATen/Context.cpp:286:34: error: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Werror,-Wsign-compare] for (auto index = 0; index < at::getNumGPUs(); index++) { ~~~~~ ^ ~~~~~~~~~~~~~~~~ 1 error generated. ``` Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130388 Approved by: https://github.com/Skylion007, https://github.com/malfet	2024-07-10 22:07:51 +00:00
PyTorch MergeBot	ce499eee0c	Revert "Add API for open registration between operators and subclasses (and modes) (#130064 )" This reverts commit c23d103afae65588772cb30037ea4110f01f6f41. Reverted https://github.com/pytorch/pytorch/pull/130064 on behalf of https://github.com/izaitsevfb due to fails internal builds, see [D59553526](https://www.internalfb.com/diff/D59553526) ([comment](https://github.com/pytorch/pytorch/pull/130064#issuecomment-2221587575))	2024-07-10 21:50:32 +00:00
Chirag Pandya	83c95c48f7	Flight recoder data as JSON (#129505 ) Summary: Provide a new API to retrieve flight recorder data as JSON. The one minor difference between flight recorder as Pickle v/s JSON is that the JSON API does not retrieve stack traces at the moment. This ends up being far too much data. Test Plan: unit test Differential Revision: [D59536460](https://our.internmc.facebook.com/intern/diff/D59536460) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129505 Approved by: https://github.com/wconstab, https://github.com/d4l3k	2024-07-10 21:50:27 +00:00
PyTorch MergeBot	86bca69c5f	Revert "[custom_ops] expose torch.library.register_torch_dispatch (#130261 )" This reverts commit bb9a73f767526e0d23c60360db5212b6bed0e8bc. Reverted https://github.com/pytorch/pytorch/pull/130261 on behalf of https://github.com/izaitsevfb due to depends on #130064 which needs to be reverted ([comment](https://github.com/pytorch/pytorch/pull/130261#issuecomment-2221569707))	2024-07-10 21:43:28 +00:00
PyTorch MergeBot	e14a0f45ed	Revert "[reland][custom ops] infer schema (#130079 )" This reverts commit bef085bdfa62cc14589c70279de17108b2c2089f. Reverted https://github.com/pytorch/pytorch/pull/130079 on behalf of https://github.com/izaitsevfb due to depends on #130064 which needs to be reverted ([comment](https://github.com/pytorch/pytorch/pull/130079#issuecomment-2221561483))	2024-07-10 21:40:16 +00:00
Jon Janzen	46c52661bc	Use a better cherry-pick strategy for stable pytorch w/ distribute changes (#129987 ) 1. Update the branch name from internal feedback 2. Only cherry-pick in the changes to these folders Pull Request resolved: https://github.com/pytorch/pytorch/pull/129987 Approved by: https://github.com/seemethere	2024-07-10 20:55:36 +00:00
Catherine Lee	80a421a54d	[TD] Pin numpy to 1.26.0 in indexer (#130442 ) Temporarily pin 1.26.0 to get the workflow working while I go sort out which dependencies need to be updated Succeeding run: https://github.com/pytorch/pytorch/actions/runs/9877733366/job/27280052419?pr=130442 Tested by adding my branch to the trust relationship for the policy and removing the environment Pull Request resolved: https://github.com/pytorch/pytorch/pull/130442 Approved by: https://github.com/atalman, https://github.com/malfet	2024-07-10 20:52:24 +00:00
PyTorch MergeBot	cd2638be09	Revert "[pipelining] Refactor test_schedule to fix "-k" (#130294 )" This reverts commit 1352f13f7827cd1862a6e0507fb17dccddf73dc2. Reverted https://github.com/pytorch/pytorch/pull/130294 on behalf of https://github.com/clee2000 due to broke lint https://github.com/pytorch/pytorch/actions/runs/9879591538/job/27286156803 ([comment](https://github.com/pytorch/pytorch/pull/130294#issuecomment-2221376073))	2024-07-10 20:26:58 +00:00
PyTorch MergeBot	b81767161e	Revert "[aota] Needs autograd if an input requires_grad, agnostic to enable_grad (#128890 )" This reverts commit 08d5423d339ac4b302f8ae6b63b334e032104753. Reverted https://github.com/pytorch/pytorch/pull/128890 on behalf of https://github.com/clee2000 due to broke inductor/test_flex_attention https://github.com/pytorch/pytorch/actions/runs/9879109008/job/27286339304 `08d5423d33` test was not run on PR due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/128890#issuecomment-2221368245))	2024-07-10 20:22:24 +00:00
Pian Pawakapan	1b3b4c2fb9	[runtime asserts] deduplicate runtime asserts & CSE (#128599 ) (#130380 ) original PR: https://github.com/pytorch/pytorch/pull/128599 (re-created after revert + poisoned diff train) Summary: This PR adds deduplication and CSE for runtime asserts. Existing size computation in the graph is CSE'd along with added runtime asserts, and redundant asserts are removed. Shape calls on intermediate tensors are also turned into compute on input sizes if possible, allowing intermediate tensors to be freed earlier. For example: ``` z = torch.cat([x, x], dim=0) # 2s0 w = z.repeat(y.shape[0]) # 2s0s1 _w = w.shape[0] s0 = x.shape[0] s1 = y.shape[0] _w0 = 2 s0 _w = _w0 * s1 ``` Additionally, constrain_range calls are deduplicated. Single-symbol bound checks for unbacked symbols (e.g. u0 >= 0, u0 <= 5) and sym_constrain_range.default calls are also removed, since they accumulate range info in the ShapeEnv, and are replaced with two _assert_scalar.default calls that check the min/max bounds. For example: ``` torch.sym_constrain_range_for_size(n, min=2, max=16) torch.sym_constrain_range(n, min=4, max=20) torch._check(n >= 0) torch._check(n >= 3) torch._check(n <= 14) torch.sym_constrain_range_for_size(n) torch._check(n >= 4) torch._check(n <= 14) ``` Test Plan: contbuild & OSS CI, see `940e4477ab` Original Phabricator Test Plan: Imported from GitHub, without a `Test Plan:` line. Differential Revision: D59543603 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130380 Approved by: https://github.com/izaitsevfb	2024-07-10 19:23:37 +00:00
Will Constable	1352f13f78	[pipelining] Refactor test_schedule to fix "-k" (#130294 ) This is kind of a short-sighted workaround and we should actually come up with a way to fix this in general, but I got annoyed that I can't use -k to filter tests in test_schedule, and realized it's because we jam tests using the new MultiProcContinuousTest fixture together with old-style tests. For now I separate the two types of tests so -k works again. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130294 Approved by: https://github.com/H-Huang	2024-07-10 18:32:51 +00:00
Feng Yuan	cf090e222e	Update torch-xpu-ops pin (ATen XPU implementation) (#130333 ) 1. Fixing compilation error due to PyTorch update. The helper function prototype changes, `checkIndexTensorTypes`. 2. Fixing compilation error due to PyTorch update. PyTorch forced -Werror=unused-function. 3. Fixing inductor case failure due to CUDA bias implementation in the case. https://github.com/pytorch/pytorch/issues/130426 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130333 Approved by: https://github.com/EikanWang, https://github.com/atalman	2024-07-10 18:10:53 +00:00
Nikita Shulga	4b7ee51260	[BE][MPS] Cleanup optimizers code (#130453 ) - Fix C++20 forward compatibility warnings, namely ``` warning: use of function template name with no prior declaration in function call with explicit template arguments is a C++20 extension [-Wc++20-extensions] multi_tensor_apply_for_fused_optimizer<2, 512>(kernel_name, ``` - Use nested namespaces - Do not explicitly specify `at::` namespace for functions already implemented inside of that namespace - Use more convenience methods (rather than call by hand) - Use C++14 `return f();` for void functions Pull Request resolved: https://github.com/pytorch/pytorch/pull/130453 Approved by: https://github.com/Skylion007	2024-07-10 18:00:05 +00:00
IvanKobzarev	08d5423d33	[aota] Needs autograd if an input requires_grad, agnostic to enable_grad (#128890 ) Reland of: https://github.com/pytorch/pytorch/pull/128016 Summary from previous PR: We assume only two possible mutually exclusive scenarios: Running compiled region for training (Any of inputs has requires_grad) Produced differentiable outputs should have requires_grad. Running compiled region for inference (None of inputs has requires_grad) All outputs do not have requires_grad. Even if user runs the region under no_grad(), but has an input Tensor with requires_grad - we go Training scenario (1). With current state that means: 1/ needs_autograd should not check torch.is_grad_enabled(), only that any of inputs requires_grad 2/ if needs_autograd => trace_joint (We are in training scenario 1.) => always run compiled region under with.enable_grad() Changes in partitioner? Inference and Training graphs had difference in return container, list/tuple. The changes in partitioner are done to unify and return always tuple. As a result - some changes in test_aotdispatch.py for graph contents list -> tuple. Why was revert? There was a regression of hf_Reformer model on inference. ``` TORCHINDUCTOR_FX_GRAPH_CACHE=0 python benchmarks/dynamo/torchbench.py --performance --inference --bfloat16 --backend inductor --device cuda --only hf_Reformer --cold-start-latency --use-eval-mode ``` Because one of the compiled graphs contained outputs, which are aliases to the inputs that are nn.Parameter(requires_grad=True). Even if inference bencharmsk torchbench runs inside with` torch.no_grad()` - alias (specifically for hf_Reformer - expand) ops preserve requires_grad. As a result we started compiling training graph instead of inference. Fix for view ops: If we have outputs, that are aliases to inputs that requires_grad, those outputs requires grad is not a reason to generate training graph. This is handled in aot_autograd.py, where output_and_mutation_safe are calculated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128890 Approved by: https://github.com/bdhirsh	2024-07-10 17:56:32 +00:00
PyTorch MergeBot	0beeac35fa	Revert "[cond] inlining into one of the branches when pred is a python constant (#128709 )" This reverts commit fe3e6878c4bb2a6001045c179fd7fa9838242558. Reverted https://github.com/pytorch/pytorch/pull/128709 on behalf of https://github.com/ydwu4 due to causing error on truck due to a land racing: `fe3e6878c4` ([comment](https://github.com/pytorch/pytorch/pull/128709#issuecomment-2221104043))	2024-07-10 17:47:19 +00:00
Shivam Raikundalia	b4b7477d3f	Fix CPU Annotation Overlapping with Python Events (#129599 ) Summary: Currently we have an issue where CPU User annotations can overlap with python events in the event that a python event calls step() within the function itself. To combat this, we can move the left side of the user annotation to the beginning of the parent python function. We do this because when instantiating the profiler we already start on step 0. To implement this, we start by collecting all instances of ProfilerStep during post processing. Since TorchOps and Python events are sorted already, we can easily check if the current python event partially overlaps with the current ProfilerStep and, if so, alter the start time of the current ProfilerStep. We then move to the next ProfilerStep and continue iterating through all the python events. This keeps the time complexity of adding events to 'out' at O(s + n) -> O(n) post sorting, where "s" is the number of ProfilerSteps and "n" is the length of all events. Test Plan: Added unit test in which step() is called midway through a function. Afterwards, we print out a trace and then load the json to check that there are no overlaps. Also make sure that there is no regression in performance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129599 Approved by: https://github.com/aaronenyeshi	2024-07-10 17:33:56 +00:00
Ivan Zaitsev	6b3460ae0d	fix discrepancy from the export of #126601 (#130296 ) #126601 (internally [D58103182](https://www.internalfb.com/diff/D58103182)) was exported missing one class definition. This PR brings github repo in sync with fbcode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130296 Approved by: https://github.com/kit1980, https://github.com/seemethere, https://github.com/malfet	2024-07-10 17:26:44 +00:00
Tom Ritchford	7d4cb21098	Decompose expand_copy and permute_copy (#129476 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129476 Approved by: https://github.com/amjames, https://github.com/lezcano	2024-07-10 17:12:01 +00:00
AIM \| Nara	a7aa066b09	Fix link to dynamo in torch/fx readme (#130233 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130233 Approved by: https://github.com/janeyx99	2024-07-10 17:00:49 +00:00
Laith Sakka	a09910d3a9	add strobelight profile links to tlparse (#129703 ) Summary: title. Test Plan: buck2TORCH_TRACE=~/my_trace_log_dir buck2 run @//mode/inplace @//mode/opt //caffe2/fb/strobelight:compile_time_profiler_example tlparse ~/my_trace_log_dir result https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpBrQJcL/index.html {F1726980413} Differential Revision: D59130581 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129703 Approved by: https://github.com/aorenste	2024-07-10 16:53:21 +00:00
Yidi Wu	fe3e6878c4	[cond] inlining into one of the branches when pred is a python constant (#128709 ) When the input predicate is a python constant, we specialize into one of the branches and warn users that torch.cond is not preserving the dynamism. The previous behavior is that we baked in True/False in the cond operator. This can be confusing. In this PR, we change it to be specializing into one of the branches when the inputs are constants. We additionally change the naming of cond operator to default one without overriding its name. This allows better testing on de-serialized graph. Test Plan: The predicate in some existing tests is the result of a shape comparison. When no dynamic shape is involved, the predicate is a python bool. To fix them, we either change the predicate to be some data-dependent tensor or change the test to check cond is specialized as one of the branches, Differential Revision: [D59589709](https://our.internmc.facebook.com/intern/diff/D59589709) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128709 Approved by: https://github.com/zou3519	2024-07-10 16:44:27 +00:00
atalman	9d94b122f0	Fix usage of USE_ROCM when calling cudaFuncGetAttributes (#130441 ) This fixes MSVC build regression introduced by https://github.com/pytorch/pytorch/pull/129710 as VC++ fails to unroll nested defines in the specific order and fails with ``` C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\int4mm.cu(984): error: "#" not expected here do { const cudaError_t __err = cudaFuncGetAttributes( &funcAttr, #if defined(USE_ROCM) (void *)func #else func #endif ); c10::cuda::c10_cuda_check_implementation( static_cast<int32_t>(__err), "C:\\actions-runner\\_work\\pytorch\\pytorch\\builder\\windows\\pytorch\\aten\\src\\ATen\\native\\cuda\\int4mm.cu", __func__, static_cast<uint32_t>(991), true); } while (0); ``` Fixes https://github.com/pytorch/pytorch/issues/130437 Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130441 Approved by: https://github.com/Skylion007, https://github.com/malfet	2024-07-10 16:30:43 +00:00
Richard Barnes	ae73489b7d	[codemod] Use C++17 [[fallthrough]] in 1 file inc caffe2/aten/src/ATen/native/cuda/DistributionTemplates.h (#130433 ) Test Plan: Sandcastle Reviewed By: meyering Differential Revision: D59528276 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130433 Approved by: https://github.com/malfet	2024-07-10 16:30:37 +00:00
Shangdi Yu	bef085bdfa	[reland][custom ops] infer schema (#130079 ) Fixes #129617 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130079 Approved by: https://github.com/zou3519	2024-07-10 16:18:36 +00:00
chilli	ce4d95143f	Add scale kwarg to FlexAttention (and some changes that get FlexAttention numerics to be as accurate as FA2) (#130250 ) After this PR, our numerical error is within 3% of FA2 for forward and gradients. Prior, for `dq` our numerical error was 30% higher. I also added a `PRESCALE_QK` kernel option that increases perf by about 3-4% but incurs about 20-30% more numerical error. ![image](https://github.com/pytorch/pytorch/assets/6355099/7b5ff44e-219b-4a05-8a1b-2a0182c01ab2) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130250 Approved by: https://github.com/drisspg ghstack dependencies: #130227	2024-07-10 16:14:45 +00:00
chilli	a7715e36de	Add block mask utility support for batches and heads > 1 (#130227 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130227 Approved by: https://github.com/yanboliang	2024-07-10 16:14:45 +00:00
Shangdi Yu	c83b941141	[export] add dynamic shapes argument and infer from graph nodes (#129928 ) Fixes the example in #118304 for `torch._functorch.aot_autograd.aot_export_module` and `torch.export.export`. On a high level, the issue is caused by not detecting fake_mode when there's no input. Change plan: 1) we add a `dynamic_shapes: Union[bool, None] = None` arg to `aot_export_module` and `_aot_export_function`. 2) if the input is not a graph module, then we can only rely on this `dynamic_shapes` input arg. 3) If the input is a graph module, then we can traverse the graph and check. 4) So we check if the input mod is a graph module or just a module, and do 2) or 3) depending on the type. Fixes #129927 Bug source: dynamo's fake_mode is not detected correctly in `_convert_input_to_fake` in `_traced.py` when there’s no input to the graph). So in ` _strict_export_lower_to_aten_ir`, we create another fake_mode. `dynamo_fake_mode` is not the same as the fake_mode used by dynamo. Change plan: check `gm_torch_level` graph's node meta "example_value" for fake mode in addition. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129928 Approved by: https://github.com/angelayi	2024-07-10 15:51:05 +00:00
cyy	d31f866b33	[BE] [CMake] Remove AT_CORE_STATIC_WINDOWS option (#130409 ) AT_CORE_STATIC_WINDOWS was inherited from torch and is not used anymore. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130409 Approved by: https://github.com/malfet	2024-07-10 15:50:47 +00:00
Chien-Chin Huang	81ea298600	Wrap the test func with try/except to always call destroy_process_group (#124961 ) This can avoid PG warning about not calling destry_pg Pull Request resolved: https://github.com/pytorch/pytorch/pull/124961 Approved by: https://github.com/wanchaol, https://github.com/wz337	2024-07-10 15:36:38 +00:00
Michael Eisel	81df076bfd	Fix Apple crash when running PyTorch with Metal API validation turned on (#130377 ) Fixes #130376 (at least, for my usage) There may be other places in the code base where `-setBytes:length:` is called with a length of 0 besides this, but this is the case that has triggered for me. Please let me know if there are any specific tests I should run. Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130377 Approved by: https://github.com/malfet	2024-07-10 15:07:47 +00:00
Andres Lugo-Reyes	417c83e7cf	[ROCm] Unskip scaled_dot_product_attention tests on ROCm (#127966 ) Needle has moved quite a bit on the ROCm backend front. This PR intended to examine the tests referenced in the following issue: https://github.com/pytorch/pytorch/issues/96560 This a follow-up PR to https://github.com/pytorch/pytorch/pull/125069 unskipping the next batch of tests referenced by the aforementioned issue. No explicit changes needed for source as they worked immediately after unskipping. The tests previously marked with xfail have now been modified to not expect a failure iff running on ROCm as they now pass. Behavior is unchanged for them on other architectures. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127966 Approved by: https://github.com/malfet	2024-07-10 14:53:41 +00:00
rzou	b38de2f9e2	[decomps] Fix aten._to_copy decomp (#130381 ) `aten._to_copy` can receive a python number as input. This occurs in torch.compile support for vmap (see #130188). Previously, this would raise an assertion error. This PR changes it so that if we see a python number, we call torch.scalar_tensor on it first (h/t @bdhirsh). Fixes #130362 Fixes #130188 Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/130381 Approved by: https://github.com/Chillee	2024-07-10 14:34:28 +00:00
cyy	bd3452f431	[5/N] Change #include <c10/util/Optional.h> to #include <optional> (#130408 ) Follows #130329 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130408 Approved by: https://github.com/malfet	2024-07-10 14:29:43 +00:00
Li-Huai (Allan) Lin	99967e1119	[MPS][TYPE_PROMOTION] Fix Clamp (#130226 ) Summary: 1. Fixed #130201 by adding type promotion. 2. Added proper tests. 3. Found torch's type promotion is different from numpy as follows: ```python import torch import numpy as np np.clip(np.array([1], dtype=np.float32), np.array([1], dtype=np.int32), None).dtype # dtype('float64') torch.clamp(torch.tensor([1], dtype=torch.float32), torch.tensor([1], dtype=torch.int32)).dtype # torch.float32 ``` ~Not sure the proper way to handle it, it causes numpy ref tests to fail.~ Reason here, so think I'm gonna xfail it: `3c1cf03fde/test/test_ops.py (L260-L264)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130226 Approved by: https://github.com/malfet	2024-07-10 14:27:39 +00:00
rzou	6ce0bd7d3b	[HOP] Use user directed names for variables where possible (#130271 ) Afaict the previous check was too strict. Removing it passes all the mutation tests (mutation checks happen via the TensorVariable's mutable_local). Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/130271 Approved by: https://github.com/Chillee, https://github.com/ydwu4	2024-07-10 13:59:20 +00:00
PyTorch MergeBot	637cc8d27f	Revert "update the input `weight` of `_convert_weight_to_int4pack` to `[n][k / 2] uint8` (#129940 )" This reverts commit 6367f02a0e136ced05c665301bcdaa4d76690457. Reverted https://github.com/pytorch/pytorch/pull/129940 on behalf of https://github.com/albanD due to Broke rocm tests on main `6367f02a0e` ([comment](https://github.com/pytorch/pytorch/pull/129940#issuecomment-2220554681))	2024-07-10 13:48:32 +00:00
atalman	a1590e16df	Add single Python 3.10, single Cuda 12.1 build with dependencies included (#130349 ) Build large wheel for Python 3.10, CUDA 12.1 that will be used in Colab. Build name: ``manywheel-py3_11-cuda12_1-full-build`` We still have all code to support the full build in builder repo, here: https://github.com/pytorch/builder/blob/main/manywheel/build_cuda.sh#L151 Test: ``` import sys import torch sys.version_info print(torch.__version__) sys.version_info 2.3.0+cu121 sys.version_info(major=3, minor=10, micro=12, releaselevel='final', serial=0) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130349 Approved by: https://github.com/malfet	2024-07-10 12:57:39 +00:00
Li-Huai (Allan) Lin	cb2bce98de	[MPS][BE] Reduce the number of parameters encoded for no momentum fused SGD (#130131 ) Summary: 1. Reduce the number of parameters encoded for no momentum fused SGD 2. Use convenience functions `mtl_setBuffer` and `mtl_setBytes`. Just a BE, no significant performance difference is observed. Test plan: Relying on CI signals Pull Request resolved: https://github.com/pytorch/pytorch/pull/130131 Approved by: https://github.com/janeyx99, https://github.com/malfet	2024-07-10 07:58:38 +00:00
Jiang, Yanbing	6367f02a0e	update the input `weight` of `_convert_weight_to_int4pack` to `[n][k / 2] uint8` (#129940 ) This PR is to update the input `weight` of `_convert_weight_to_int4pack` from `[n][k] int32` to `[n][k / 2] uint8`, both for CPU, CUDA and MPS, which can help decouple int4 model checkpoint with different ISAs and different platforms in `gpt-fast`. The advantage is int4 model checkpoint can be shared in different test machines, without re-generating in one certain platform. Meanwhile, the size of input `weight` can be reduced to `1 / 8`. Before this PR, packed weight stored in CUDA specific layout: `[n/8][k/(InnerKTiles*16)][32][InnerKTiles/2]`, dtype int32, where InnerKTiles = 2, 4, 8. CPU packed weight viewed as the SAME shape but stored in different layout: `[n/64][k][32]`, dtype uint8. Weight is strongly coupled with platforms (CPU/CUDA) and ISAs (AVX512/AVX2/scalar). And users cannot use a generated weight in another different ISA or platform, because when loading weight into devices, the compute format is different. ![image](https://github.com/pytorch/pytorch/assets/61222868/64971c4b-29b9-42cf-9aeb-ffa01cea93dd) Now, we use common serialized layout (`[n][k/2] uint8`) for different devices or ISAs as input `weight` of `_convert_weight_to_int4pack`, and each back chooses how to interpret as compute layout. ![image](https://github.com/pytorch/pytorch/assets/61222868/c7990761-c723-417b-aca2-7c60db7785c7) ### Performance Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores) There is no obvious regression of this PR. ![image](https://github.com/pytorch/pytorch/assets/61222868/6046dcf4-920b-4c63-9ca3-1c8c3cafebde) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129940 Approved by: https://github.com/jgong5, https://github.com/lezcano, https://github.com/mingfeima	2024-07-10 07:38:42 +00:00
leslie-fang-intel	e29657efb6	[Inductor][CPP] Fix typo in merge rules (#130405 ) Summary There is a typo of the `CPU Inductor` group in `merge_rules.yaml` which should be `test/inductor/test_cpu_repro.py` instead of `test/inductor/test_cpu_repo.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130405 Approved by: https://github.com/jgong5, https://github.com/lezcano	2024-07-10 07:13:03 +00:00
cyy	10c7f037fe	Simplify c10::string_view (#130009 ) Make c10::basic_string_view a subclass of std::basic_string_view for easier replacement in the future. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130009 Approved by: https://github.com/ezyang	2024-07-10 05:02:16 +00:00
Xuehai Pan	a17d1e5322	Fix static `py::object` dangling pointer with `py::gil_safe_call_once_and_store` (#130341 ) Fix static `py::object`s with `py::gil_safe_call_once_and_store`. The following code will leak a `py::object` which will call its destructor when shutdown the program. The destructor will call `Py_DECREF(obj.m_ptr)` which may raise a segmentation fault. ```c++ void func() { static py::object obj = py::module_::import("foo").attr("bar"); ... } ``` The correct code is to use raw pointers rather than the instance. ```c++ void func() { static py::object* obj_ptr = new py::object{py::module_::import("foo").attr("bar")}; py::object obj = *obj_ptr; ... } ``` This PR uses the `py::gil_safe_call_once_and_store` function from `pybind11`, which can run arbitrary initialization code only once under the Python GIL thread safely. ```c++ void func() { PYBIND11_CONSTINIT static py::gil_safe_call_once_and_store<py::object> storage; py::object obj = storage .call_once_and_store_result( []() -> py::object { return py::module_::import("foo").attr("bar"); } ) .get_stored(); ... } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130341 Approved by: https://github.com/ezyang	2024-07-10 04:23:37 +00:00
rzou	5abe7ebd41	Add new (private) capture_triton API (#130178 ) When applied to a triton kernel, capture_triton allows the triton kernel to be captured when tracing with make_fx. It does this by transforming the call to the triton kernel into a call to the triton_kernel_wrapper_mutation HOP, which can actually be traced into a graph via make_fx. We have two main uses cases for this: - non-strict export doesn't use Dynamo, but people want to use non-strict export to export programs with triton kernels. non-strict export uses make_fx tracing, so this is a necessary step in that direction. - People want to write inductor passes that replace a sequence of operators with a call to a function that may contain a triton kernel. The way these passes work today is that we have a FX graph and want to replace a subgraph of it with a new subgraph. We obtain said subgraph from calling make_fx on the function; this won't work on raw triton kernels but will work if one uses capture_triton. Test Plan: - I wrote some manual tests to run make_fx over two of the triton kernels in test_triton_kernels. It would be nice to be able to run make_fx through all of the tests in the file but I'm not sure how to do that refactor right now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130178 Approved by: https://github.com/oulgen ghstack dependencies: #130177	2024-07-10 03:09:29 +00:00
rzou	99c68f7bea	Refactor TritonKernelVariable's logic so it can be shared (#130177 ) TritonKernelVariable's logic tells us how to go from a user-defined triton kernel and a grid to a call to the triton_kernel_wrapper_mutation HOP. We want to re-use this in a setting without Dynamo; in the next PR up, we create a new decorator (capture_triton) that, when applied to a triton kernel, transforms a call to the triton kernel into a call to the triton_kernel_wrapper_mutation HOP. Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/130177 Approved by: https://github.com/oulgen, https://github.com/ydwu4	2024-07-10 03:09:29 +00:00
Valentine233	868d9a4f12	[cpu][flash attention] fix nan issue (#130014 ) Fixes #127055. NaNs are generated in flash attention because the computation of `std::exp((-inf) - (-inf))` and `+/-inf * 0` in lazy softmax. We fix the issue by avoiding the related calculation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130014 Approved by: https://github.com/jgong5, https://github.com/drisspg	2024-07-10 02:33:26 +00:00
Tom Ritchford	68751799b8	Add decompositions for copy variants of view ops (#128416 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128416 Approved by: https://github.com/amjames, https://github.com/lezcano	2024-07-10 01:39:09 +00:00
cyy	007e75958f	[4/N] Change #include <c10/util/Optional.h> to #include <optional> (#130329 ) Follows #130300 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130329 Approved by: https://github.com/ezyang	2024-07-10 01:26:50 +00:00
awayzjj	9912209743	check if the input fx graph of aot_compile return tuple (#129824 ) Fixes https://github.com/pytorch/pytorch/issues/129719 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129824 Approved by: https://github.com/angelayi, https://github.com/yushangdi	2024-07-10 01:18:55 +00:00
cyy	85b8503621	[Caffe2] Remove Caffe2 documentation (#130089 ) Due to the removal of Caffe2 code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130089 Approved by: https://github.com/r-barnes, https://github.com/albanD	2024-07-10 00:52:16 +00:00
cyy	7a3ab1fe79	[structural binding][7/N] Replace std::tie with structural binding (#130216 ) Follows #120353 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130216 Approved by: https://github.com/albanD	2024-07-10 00:52:04 +00:00
PyTorch MergeBot	fb696bf264	Revert "Add block mask utility support for batches and heads > 1 (#130227 )" This reverts commit 64139987c0588f2eef198a0b9fd6904783b37b2c. Reverted https://github.com/pytorch/pytorch/pull/130227 on behalf of https://github.com/izaitsevfb due to breaks internal builds, please see D59498662 ([comment](https://github.com/pytorch/pytorch/pull/130227#issuecomment-2218842579))	2024-07-09 22:34:39 +00:00
PyTorch MergeBot	44815ed67e	Revert "Add scale kwarg to FlexAttention (and some changes that get FlexAttention numerics to be as accurate as FA2) (#130250 )" This reverts commit 3e48d927332915e1ecbd3c7f2c6b9680428f181e. Reverted https://github.com/pytorch/pytorch/pull/130250 on behalf of https://github.com/izaitsevfb due to depends on #130227 which needs to be reverted ([comment](https://github.com/pytorch/pytorch/pull/130250#issuecomment-2218840674))	2024-07-09 22:32:54 +00:00
Catherine Lee	5b5a1f5202	Add on to Mark some test_decomp tests as slow on win #130260 (#130337 ) An add on to https://github.com/pytorch/pytorch/pull/130260 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130337 Approved by: https://github.com/malfet	2024-07-09 22:30:53 +00:00
Joel Schlosser	fd43a2ba27	Forward fix for test_compare_cpu_cuda_float32 (#130360 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130360 Approved by: https://github.com/malfet ghstack dependencies: #128238	2024-07-09 22:28:39 +00:00
PyTorch MergeBot	3be4922a9d	Revert "[HOP] Use user directed names for variables where possible (#130271 )" This reverts commit adb65682affdfc37f724c02ea8c8930d3925fc07. Reverted https://github.com/pytorch/pytorch/pull/130271 on behalf of https://github.com/clee2000 due to broke inductor/test_flex_attention https://github.com/pytorch/pytorch/actions/runs/9863205414/job/27236960046 `adb65682af` Test not run on PR due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/130271#issuecomment-2218832643))	2024-07-09 22:24:39 +00:00
Zhengxu Chen	37d4d04309	[torchscript] Add logging for model id. (#130118 ) Summary: as title. Test Plan: CI Reviewed By: angelayi Differential Revision: D59348256 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130118 Approved by: https://github.com/BoyuanFeng	2024-07-09 22:24:16 +00:00
Riley Dulin	fb5cb17fbe	[torch][fx] Add normalize_args constructor argument to FxGraphDrawer (#130348 ) Summary: When writing out Graphviz files for graphs, sometimes the arguments are all in a row and it's unclear which is which. Like for `aten.conv2d`, someone might not remember the stride, padding, dilation order. Add an option `normalize_args` (defaults to False) to normalize all args into kwargs. This should help the readability of a graph. Differential Revision: D59529417 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130348 Approved by: https://github.com/mcremon-meta	2024-07-09 22:16:54 +00:00
Aaron Enye Shi	df83142131	[CCA][Memory Snapshot] Stop duplicating annotations to all device_traces (#130315 ) Summary: This diff fixes a bug, where all record_annotations will save a TraceEntry to each of the device_traces. Instead, we should only save annotations to the current device_trace that is being called by the thread calling the native allocator's recordAnnotation. Test Plan: CI and ran workloads on MVAI WPR FBR. Reviewed By: zdevito Differential Revision: D59477339 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/130315 Approved by: https://github.com/zdevito	2024-07-09 21:38:47 +00:00
rzou	bb9a73f767	[custom_ops] expose torch.library.register_torch_dispatch (#130261 ) This is the API for defining the interaction between a torch_dispatch class and a custom op. Taking API bikeshedding. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/130261 Approved by: https://github.com/albanD ghstack dependencies: #130064	2024-07-09 21:11:27 +00:00
rzou	c23d103afa	Add API for open registration between operators and subclasses (and modes) (#130064 ) We add torch.library.Library._register_torch_dispatch_rule. Here, a user can provide us a specific rule to run for a specific (torch_dispatch_class, operator) pair. The motivation is that a user might want to extend a subclass/mode but may not have access to the source code of the subclass/mode. I'll make this public in a follow-up PR if we think the approach and API is good. Keep in mind that many subclasses will likely deliver their own open registration solution (DTensor has register_sharding_prop_rule and NJT has register_jagged_op); _register_torch_dispatch_rule is meant as a catch-all open registration mechanism for when the subclass hasn't provided anything more specific. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/130064 Approved by: https://github.com/albanD	2024-07-09 21:11:27 +00:00
PyTorch MergeBot	9c9744c3ac	Revert "[runtime asserts] deduplicate runtime asserts & CSE (#128599 )" This reverts commit 940e4477ab0b81eea25051447cf5f599080c903f. Reverted https://github.com/pytorch/pytorch/pull/128599 on behalf of https://github.com/izaitsevfb due to breaking internal APS tests, see D59498864 ([comment](https://github.com/pytorch/pytorch/pull/128599#issuecomment-2218724762))	2024-07-09 21:03:49 +00:00
Tristan Rice	f85bda8bdd	c10d/Handlers: expose running handlers from Python (#130149 ) This adds a `_run_handler` method that will invoke a specific handler. Test plan: ``` python test/distributed/elastic/test_control_plane.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130149 Approved by: https://github.com/kurman, https://github.com/c-p-i-o	2024-07-09 20:20:59 +00:00
Tianyi Tao	1d93367cfa	Fix typo (#130305 ) Fixes #130241 that is a reopen pr of #130244, for possibly fixing the failed job Pull Request resolved: https://github.com/pytorch/pytorch/pull/130305 Approved by: https://github.com/Skylion007	2024-07-09 20:02:00 +00:00
Chen Lai	721a798886	add bits16 to graph dtype_abbrs (#130339 ) As title, patch the dtype in torch.fx.graph Pull Request resolved: https://github.com/pytorch/pytorch/pull/130339 Approved by: https://github.com/angelayi	2024-07-09 19:58:51 +00:00
Jerry Mannil	42f647219a	[ROCm] Add int4 support (#129710 ) - Add AMD support for int4 kernel - Only supports CDNA2 and CDNA3 gpus for now - Uses `mfma_f32_16x16x16bf16` instruction for matrix multiply - Uses `v_and_or_b32` instruction and `__hfma2` instrinsic for unpacking bf16 values - Enable hipify for `__nv_bfloat16` and `__nv_bfloat162` data types - Enable int4 unit tests for CDNA2 and CDNA3 AMD gpus - Fix torchscript issues due to hipify for `__nv_bfloat16` type - TorchScript has its own implementation for bfloat16 type - Implemented in `__nv_bloat16` structure at [resource_strings.h](https://github.com/pytorch/pytorch/blob/main/torch/csrc/jit/codegen/fuser/cuda/resource_strings.h) - So, we shouldn't hipify any reference of `__nv_bfloat16` in the torchscript implementation - Hence moved the `__nv_bfloat16` direct references in `codegen.cpp` and `cuda_codegen.cpp` to `resource_strings.h` which is already exempted from hipify Fixes #124699 Fixes pytorch-labs/gpt-fast/issues/154 Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129710 Approved by: https://github.com/malfet	2024-07-09 19:49:12 +00:00
rzou	adb65682af	[HOP] Use user directed names for variables where possible (#130271 ) Afaict the previous check was too strict. Removing it passes all the mutation tests (mutation checks happen via the TensorVariable's mutable_local). Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/130271 Approved by: https://github.com/Chillee, https://github.com/ydwu4 ghstack dependencies: #130255, #130268	2024-07-09 19:42:52 +00:00
cyy	a6345d3477	[CMake] [3/N] Remove unused code (#130322 ) Some functions used by Caffe2 were removed along with some outdated checks. Follows #130006. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130322 Approved by: https://github.com/r-barnes	2024-07-09 19:33:33 +00:00
Tianyi Tao	3477ee38e4	fix the use of initial learning rate in the OneCycleLR example (#130306 ) Fixes #127649 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130306 Approved by: https://github.com/janeyx99	2024-07-09 18:58:07 +00:00
Peter Bell	3689471ea4	[inductor] Add FileCheck to flex attention epilogue test (#129343 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129343 Approved by: https://github.com/lezcano	2024-07-09 18:15:55 +00:00
Yifu Wang	c6cce976b2	Fix an issue where ENABLE_INTRA_NODE_COMM=1 + multiple process groups leads to failure (#130269 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130269 Approved by: https://github.com/Chillee	2024-07-09 17:42:09 +00:00
Yidi Wu	cb4bec311a	Fix nodes has more than one output users after replace_set_grad_with_hop pass (#129716 ) Summary: Previously, when we inline the subgraphs that doesn't have a different require_grad environment, we didn't clean up the nodes's users in subgraph and direcly used them to to replace the output of the call_modules. This records dead depencies in node.users. This PR fixes this. Test Plan: Added a new test. Also see the torchrec tests: Step 1: buck run mode/dev-nosan //aimp/experimental/pt2:pt2_export -- --model-entity-id 934687114 --output /tmp/934687114.zip --use-torchrec-eager-mp --use-manifold Step 2: buck run mode/opt -c python.package_style=inplace -c fbcode.enable_gpu_sections=true aimp/cli:cli -- --platform=aps --template=disagg_gpu_aps_pt2 --pt2 --model-entity-id=934687114 non-request-only-tagging torchrec-shard-and-quantize gpu-disagg-split assign-device materialize-weights script-and-save Differential Revision: D59132214 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129716 Approved by: https://github.com/angelayi	2024-07-09 17:04:03 +00:00
Eddie Yan	e4c51d22c5	[cuDNN] Cleanup < 8.5 #ifdefs (#130283 ) We've said cuDNN 8.5 is the minimum supported version for a bit now Pull Request resolved: https://github.com/pytorch/pytorch/pull/130283 Approved by: https://github.com/Skylion007	2024-07-09 16:35:39 +00:00
Shangdi Yu	cab90b0049	[custom ops] disable kernel temporarily (#130190 ) Fixes #128621 Sometimes we want to disable the backend implementation for testing/benchmarking purposes. For example: ```python @custom_op("mylib::f", mutates_args=()) def f(x: Tensor) -> Tensor: return torch.zeros(1) print(f(torch.randn(1))) # tensor([0.]) @f.register_kernel("cpu") def _(x): return torch.ones(1) print(f(torch.randn(1))). # tensor([1.]) with f.set_kernel_enabled("cpu", enabled = False): print(f(0)) # tensor([0.]) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130190 Approved by: https://github.com/williamwen42, https://github.com/zou3519	2024-07-09 16:13:50 +00:00
Richard Zou	edf273edf4	Revert some PRs (#130303 ) Summary: Revert https://github.com/pytorch/pytorch/pull/129346 thru https://github.com/pytorch/pytorch/pull/128893 For S430832 Test Plan: Tests Differential Revision: D59503843 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130303 Approved by: https://github.com/bdhirsh	2024-07-09 14:46:00 +00:00
cyy	71efbf701d	[3/N] Change #include <c10/util/Optional.h> to #include <optional> (#130300 ) Follows #130236 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130300 Approved by: https://github.com/ezyang	2024-07-09 13:32:57 +00:00
milesial	a5f816df18	Add more dtypes to __cuda_array_interface__ (#129621 ) `__cuda_array_interface__` was missing some unsigned integer dtypes as well as BF16. numba doesn't support BF16 so I skip tests for that one. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129621 Approved by: https://github.com/lezcano	2024-07-09 10:47:19 +00:00
chilli	3e48d92733	Add scale kwarg to FlexAttention (and some changes that get FlexAttention numerics to be as accurate as FA2) (#130250 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130250 Approved by: https://github.com/drisspg ghstack dependencies: #130160, #130106, #130224, #130227	2024-07-09 09:24:06 +00:00
eqy	86fb76e871	[SDPA] Clean up `print` in `test/test_transformers.py` (#130302 ) Left this in #125343, oops... Pull Request resolved: https://github.com/pytorch/pytorch/pull/130302 Approved by: https://github.com/awgu	2024-07-09 09:20:52 +00:00
Yichen Yan	953c6476bd	[CMAKE] Look for `Development.Module` instead of `Development` (#129669 ) Based on the [cmake issue](https://gitlab.kitware.com/cmake/cmake/-/issues/23716) and [manylinux issue](https://github.com/pypa/manylinux/issues/1347), when building a python module, it should find the `Development.Module` module, not `Development`, which includes `Development.Module` and `Development.Embed`, and will expect the shared python library only. After this PR and before #124613, pytorch could be built with a static libpython (e.g. in manylinux). Pull Request resolved: https://github.com/pytorch/pytorch/pull/129669 Approved by: https://github.com/malfet	2024-07-09 09:16:43 +00:00
Valentin Andrei	b139b5090f	[pytorch] Name threads in thread pools for better debugging (#130270 ) Threads inside the thread pools are not named, so they inherit the main process name or the name of the first thread. In our case if we set `pt_main_thread` as the thread name when a thread does `import torch`, this name will be inherited by all the threads in the created pools. This PR names the threads in the pools I was able to find. There are other pools created, like OpenMP ones and we need to follow-up on those. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130270 Approved by: https://github.com/d4l3k, https://github.com/albanD	2024-07-09 08:03:47 +00:00
Yuanhao Ji	312652c325	[RFC] Add support for device extension autoloading (#127074 ) Fixes #122468 - Load device extensions at the end of `torch/__init__.py` - Enabled by default, or you can disable it with `TORCH_DEVICE_BACKEND_AUTOLOAD=0` run test: ```python python test/run_test.py -i test_autoload_enable python test/run_test.py -i test_autoload_disable ``` doc: https://docs-preview.pytorch.org/pytorch/pytorch/127074/miscellaneous_environment_variables.html co-author: @jgong5 @bsochack @bkowalskiINTEL @jczaja @FFFrog @hipudding Co-authored-by: albanD <desmaison.alban@gmail.com> Co-authored-by: Jiong Gong <jiong.gong@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127074 Approved by: https://github.com/albanD, https://github.com/jgong5	2024-07-09 06:14:13 +00:00
Aaron Enye Shi	6c4efd4e95	[Memory Snapshot][BE] Clean up record function callback scope (#130265 ) Summary: We can directly set the scope to at::RecordScope::USER_SCOPE for the at::RecordFunctionCallback object, rather than performing a check inside of the callback. Test Plan: Ran locally, works fine. https://www.internalfb.com/pytorch_memory_visualizer/mvai_gpu_traces/tree/gpu_snapshot/fire-aaronshi-20240704-1709-7a80b83b/0/rank-0_itrn-1503.Jul_04_17_24_02.3577.snapshot.pickle Differential Revision: D59477046 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/130265 Approved by: https://github.com/davidberard98	2024-07-09 05:23:48 +00:00
Sam Larsen	ded469cfbd	[issue scrubbing] Fix imports in test_memory_planning.py to work with pytest (#130275 ) Summary: I actually don't grok why this pattern works; I guess pytest expects a different import syntax for these relative imports?? But this pattern is used in many other tests here (notably `test_aot_inductor.py`), so it must be right ;) Test Plan: Ran both ways: * `python test/inductor/test_memory_planning.py` * `pytest test/inductor/test_memory_planning.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130275 Approved by: https://github.com/zou3519	2024-07-09 05:20:56 +00:00
Xu Han	e235db98c9	[Inductor] Add aot_mode UT to new cpp_builder. (#130105 ) Changes: 1. Add `aot_mode` parameter to `validate_new_cpp_commands` UT. 2. Switch AotCodeCompiler vec isa command gen to new cpp_builder. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130105 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-07-09 04:08:35 +00:00
Sheng Fu	31df1d235e	Support tensor stride (#129297 ) Summary: X-link: https://github.com/facebookresearch/param/pull/126 Support tensor stride for execution trace. Test Plan: buck2 test mode/opt caffe2/test:test_profiler_cuda profiler.test_execution_trace.TestExecutionTrace Differential Revision: D58900476 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129297 Approved by: https://github.com/sanrise, https://github.com/izaitsevfb	2024-07-09 03:55:46 +00:00
Edward Z. Yang	e836ee1955	Enhancements to recompiles logs (#130043 ) ---- - We now record on CacheEntry what the compile id that populated it was, so now we can say why a specific frame was rejected - Add structured log for recompiles under name artifact "recompile_reasons". As it stands, it's not terribly structured, but this was the easiest thing I could do to start - Slightly reformat multi-reason printing; since we only report one guard failure seems better to have it as a single line Example output: ``` V0703 10:34:13.273000 140345997743104 torch/_dynamo/guards.py:2590] [0/1] [__recompiles] Recompiling function f in /data/users/ezyang/a/pytorch/b.py:3 V0703 10:34:13.273000 140345997743104 torch/_dynamo/guards.py:2590] [0/1] [__recompiles] triggered by the following guard failure(s): V0703 10:34:13.273000 140345997743104 torch/_dynamo/guards.py:2590] [0/1] [__recompiles] - 0/0: tensor 'L['x']' size mismatch at index 0. expected 4, actual 5 ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130043 Approved by: https://github.com/anijain2305	2024-07-09 03:40:56 +00:00
cyy	29861779ce	[2/N] Change #include <c10/util/Optional.h> to #include <optional> (#130236 ) Follows #128301. The changes were made by grep and sed Pull Request resolved: https://github.com/pytorch/pytorch/pull/130236 Approved by: https://github.com/ezyang	2024-07-09 03:17:24 +00:00
rzou	d1e0653fad	[fx][easy] print_readable should recursively apply options (#130268 ) For example, print_readable(colored=True) should also print submodules with colors. Test Plan: - tested locally Pull Request resolved: https://github.com/pytorch/pytorch/pull/130268 Approved by: https://github.com/Chillee ghstack dependencies: #130255	2024-07-09 02:50:20 +00:00
rzou	f2c9f0c0db	[HOP] improve naming for subgraph inputs (#130255 ) Previously, subgraph input names were whatever the input proxies were, which were confusing. This PR changes those names to be whatever the names of the arguments the functions being speculate_subgraph'ed are. This is best-effort: if we can't figure it out then we go back to the previous strategy. Test Plan: - existing expecttests Pull Request resolved: https://github.com/pytorch/pytorch/pull/130255 Approved by: https://github.com/ydwu4	2024-07-09 02:46:40 +00:00
Jane Xu	abe81d5d05	Fix the rest of foreach flakers (#130277 ) Reenable foreach tests on non-sm86 machines. I believe I've fixed the flakes that are caused when TORCH_SHOW_CPP_STACKTRACES=1, though I know @clee2000 had also just landed https://github.com/pytorch/pytorch/pull/129004 for the same effect. Regardless, this makes the foreach tests more robust against future disruptions anyway. Fix similar in flavor to https://github.com/pytorch/pytorch/pull/129003 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130277 Approved by: https://github.com/soulitzer	2024-07-09 02:08:21 +00:00
PyTorch MergeBot	d44c30e2f9	Revert "Add API for open registration between operators and subclasses (and modes) (#130064 )" This reverts commit 922d2737d5e0ad22ee1dcf91c48ab09d641de840. Reverted https://github.com/pytorch/pytorch/pull/130064 on behalf of https://github.com/huydhn due to Sorry for reverting your change but test_profiler_tree is failing in trunk after this lands `922d2737d5`, maybe a landrace ([comment](https://github.com/pytorch/pytorch/pull/130064#issuecomment-2216135497))	2024-07-09 01:48:38 +00:00
Catherine Lee	75fa10066d	Mark some test_decomp tests as slow on win (#130260 ) Auto slow test detection is marking and then un marking these as slow, so permanently mark them as slow on windows. These tests take >500s on windows. This is part of the reason why test_decomp keeps failing on windows (ex `da66e50e6e`) The other part is something to do with reruns + thresholds that I am still investigating Pull Request resolved: https://github.com/pytorch/pytorch/pull/130260 Approved by: https://github.com/huydhn, https://github.com/malfet	2024-07-09 00:16:31 +00:00
Will Constable	7f08d3d9a0	[C10D] Fix corrupt log due to uint_8 printing as char (#130184 ) Previously, jobs would log lines like this due to interpreteting an int8 value as a signed char when streaming out. "ProcessGroupNCCL created ncclComm_ 0x94960120 on CUDA device: ^@" We need a better solution for avoiding this systematically, but at least for now fix the spot we know about. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130184 Approved by: https://github.com/eeggl, https://github.com/Skylion007	2024-07-08 23:37:50 +00:00
Jerry Zhang	4c19623800	Change numeric_debug_handle to store per-node id (#129811 ) Summary: Previously we store edge id in numeric_debug_handle to support operator fusion and operator decomposition throughout the stack, but according to feedback from customers, people prefer the simpler per-node id, and they are fine with not having the additional support for numerical debugging for inputs and willing to hack around to achieve this. This PR changes the structure of numeric_debug_handle to store unique_id for each node instead. e.g. graph: ``` node = op(input_node, weight_node) ``` Before: ``` node.meta[NUMERIC_DEBUG_HANDLE_KEY] = {input_node: id1, weight_node: id2, "output": id3} ``` After: ``` node.meta[NUMERIC_DEBUG_HANDLE_KEY] = id1 ``` Test Plan: python test/test_quantization.py -k TestGenerateNumericDebugHandle Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/129811 Approved by: https://github.com/tarun292	2024-07-08 23:36:19 +00:00
Will Constable	a28bb3268d	[Pipelining] Reorder _Action from F1_1 to 1F1 (#129786 ) Also steers away from accesing _Action via positional unpacking since that is error prone Co-authored-by: Howard Huang <howardhuang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129786 Approved by: https://github.com/H-Huang	2024-07-08 23:07:51 +00:00
Huy Do	60d9f3f7d9	Set the epoch timestamp when uploading data to dynamoDB (#130273 ) This is to move away the `_event_time` field from Rockset, which we cannot use when reimport the data Pull Request resolved: https://github.com/pytorch/pytorch/pull/130273 Approved by: https://github.com/clee2000	2024-07-08 22:58:32 +00:00
Yueming Hao	b4cc25f126	[custom_op]Fix self in mutation_args (#130179 ) Fixes #124933 ## Issue Summary If users define `self` as mutate args, there is an error occurs `TypeError: AutoFunctionalized.__call__() got multiple values for argument 'self'`. For the following example, the schema for mutates_args is parsed as {"self": FakeTensor}. `6df963a2c8/torch/_higher_order_ops/auto_functionalize.py (L234)` In the above line, it is unwrapped as `self=FakeTensor` and leads to wrong argument pass because `self` is the default keyword for functions of a class, such as https://github.com/pytorch/pytorch/compare/main...findhao/fix-self-custom-ops#diff-9453b6b52a54783beec3dd1c60248620f61c3a524d404a188af17bbdf6be3d9eR292 . ```python import torch @torch.library.custom_op("mylib::foo", mutates_args={"self"}) def foo(self: torch.Tensor) -> None: self.sin_() x = torch.randn(3) @torch.compile(backend="inductor", fullgraph=True) def f(x): foo(x) f(x) ``` ## Fix This PR changes all related default argument `self` to `self_` following the existing way in `6fc771d19b/torch/_ops.py (L667)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130179 Approved by: https://github.com/zou3519	2024-07-08 22:55:50 +00:00
Andrey Talman	17ca0d0edf	Add linux manywheel python 3.13 binary workflows (#130030 ) Test with passing linux manywheel workflows is here: https://github.com/pytorch/pytorch/pull/121979 Builder PR already merged: https://github.com/pytorch/builder/pull/1910 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130030 Approved by: https://github.com/albanD	2024-07-08 22:50:15 +00:00
Joel Schlosser	00335a27b4	Accept min / max sequence length in nested_tensor_from_jagged() constructor (#130175 ) This PR updates the public API for NJT construction `torch.nested.nested_tensor_from_jagged()` to accept values for min / max sequence length. It's useful to provide these ahead of time to avoid GPU -> CPU syncs from on-demand computation later on. NB: The test changes are extensive because I reworked the existing `_validate_nt()` helper function used throughout our NJT construction tests to verify more (specifically: expected cached min / max seq len and contiguity). API design question: should we additionally provide an option to compute these from `offsets` at construction time? I can think of three possible cases during construction: 1. Min / max seq len has already been obtained from somewhere (manual calculation, static values, etc.) and they should be used in the cache 2. Min / max seq len should be computed immediately at construction time for use in the cache (ideally, the caller wouldn't have to do this computation manually) 3. Min / max seq len are not needed at all (i.e. SDPA isn't ever called) and computation should be skipped Pull Request resolved: https://github.com/pytorch/pytorch/pull/130175 Approved by: https://github.com/davidberard98, https://github.com/soulitzer	2024-07-08 22:14:52 +00:00
rzou	922d2737d5	Add API for open registration between operators and subclasses (and modes) (#130064 ) We add torch.library.Library._register_torch_dispatch_rule. Here, a user can provide us a specific rule to run for a specific (torch_dispatch_class, operator) pair. The motivation is that a user might want to extend a subclass/mode but may not have access to the source code of the subclass/mode. I'll make this public in a follow-up PR if we think the approach and API is good. Keep in mind that many subclasses will likely deliver their own open registration solution (DTensor has register_sharding_prop_rule and NJT has register_jagged_op); _register_torch_dispatch_rule is meant as a catch-all open registration mechanism for when the subclass hasn't provided anything more specific. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/130064 Approved by: https://github.com/albanD	2024-07-08 22:13:05 +00:00
PyTorch MergeBot	44a773c121	Revert "[custom ops] infer schema (#130079 )" This reverts commit 3fe324ffb612c8712f6af7639c1e7bcec5f3b4fd. Reverted https://github.com/pytorch/pytorch/pull/130079 on behalf of https://github.com/huydhn due to The test_public_bindings failure looks legit `3fe324ffb6` ([comment](https://github.com/pytorch/pytorch/pull/130079#issuecomment-2215420957))	2024-07-08 22:02:29 +00:00
PyTorch MergeBot	f9bb258892	Revert "[Inductor] Add aot_mode UT to new cpp_builder. (#130105 )" This reverts commit 21eeedb4554edab22b42bcb2f75f19e85652b72e. Reverted https://github.com/pytorch/pytorch/pull/130105 on behalf of https://github.com/izaitsevfb due to Breaks 46 tests internally at meta with: OSError: CUDA_HOME environment variable is not set ([comment](https://github.com/pytorch/pytorch/pull/130105#issuecomment-2215392198))	2024-07-08 21:40:03 +00:00
PyTorch MergeBot	5e467604c3	Revert "[inductor] switch AotCodeCompiler to new cpp_builder (#130127 )" This reverts commit dc5f37193f8d144d3de8525bf64eb1775d91e932. Reverted https://github.com/pytorch/pytorch/pull/130127 on behalf of https://github.com/izaitsevfb due to Depends on #130105 which has to be reverted ([comment](https://github.com/pytorch/pytorch/pull/130127#issuecomment-2215355259))	2024-07-08 21:25:28 +00:00
PyTorch MergeBot	09d57f577b	Revert "[inductor] switch CppCodeCache to new cpp_builder. (#130132 )" This reverts commit 3957b3b34976896e0b13e1d09cf19e1da5b8292e. Reverted https://github.com/pytorch/pytorch/pull/130132 on behalf of https://github.com/izaitsevfb due to Depends on #130105 which has to be reverted ([comment](https://github.com/pytorch/pytorch/pull/130132#issuecomment-2215352180))	2024-07-08 21:22:39 +00:00
Yang Chen	856fe230c7	[AOTI] better approach to generating runtime checks for symbolic dimensions (#130220 ) Previously, we only handled cases where the symbolic dimension is of Symbol. We should use bound_sympy which handles more general cases for us. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130220 Approved by: https://github.com/aakhundov	2024-07-08 20:46:38 +00:00
Shangdi Yu	3fe324ffb6	[custom ops] infer schema (#130079 ) Fixes #129617 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130079 Approved by: https://github.com/zou3519	2024-07-08 20:46:23 +00:00
PyTorch MergeBot	1e61cb8c87	Revert "[3.12, 3.13, dynamo] simplified construction for frame f_locals/localsplus (#129185 )" This reverts commit b428f1ad77aedfd150e920c8b0d23b7e6393ad6f. Reverted https://github.com/pytorch/pytorch/pull/129185 on behalf of https://github.com/huydhn due to dr ci categorization is wrong, the test_linalg xsuccess is real, theres also a test_jit failure https://github.com/pytorch/pytorch/actions/runs/9844339391/job/27178009798 `b428f1ad77` ([comment](https://github.com/pytorch/pytorch/pull/129185#issuecomment-2215230345))	2024-07-08 20:37:07 +00:00
Anshul Sinha	f059201e0d	[dtensor][debug] added deviceMesh for relevant operations and module parameter sharding and module fqn (#130072 ) Summary In order to give users more information, I have added the deviceMesh for operations with DTensor inputs, and module parameter sharding and FQN. These changes have only been placed in operation tracing log. In the future, I plan to just have one logging function with an argument to show how detailed a user wants the log to be, and will get rid of the module tracing log function. This information has also been added to the JSON dump and can be seen in the browser visual. I have also edited the test case file as the module_depth dictionary has been replaced with module_helper_dict and have edited the example output for the MLP operation tracing which can be seen below: Test Plan 1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_json_dump 2. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_json_dump 3. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_operation_tracing 4. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_operation_tracing 5. pytest test/distributed/_tensor/debug/test_comm_mode_features.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/130072 Approved by: https://github.com/XilunWu ghstack dependencies: #129994	2024-07-08 20:12:52 +00:00
atalman	3e53cae0fc	Release 2.4 matrix update. Future releases dates (#130267 ) Added Release Compatibility Matrix for release 2.4 Updated future release dates for 2.6-2.9 Updated possible patch release date for 2.4 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130267 Approved by: https://github.com/malfet, https://github.com/albanD	2024-07-08 20:09:17 +00:00
Xia, Weiwen	36e2608783	[Quant][PT2E] enable qlinear post op fusion for dynamic quant & qat (#122667 ) Description Add fusion path for dynamic quant and for QAT. The following patterns can be matched for static quant with QAT cases: `qx -> qlinear -> add -> optional relu -> optional type convert -> optional quant` The following patterns can be matched for dynamic quant cases: `qx -> qlinear -> add -> optional relu` Test plan python test/inductor/test_mkldnn_pattern_matcher.py -k test_qlinear python test/inductor/test_cpu_cpp_wrapper.py -k test_qlinear python test/test_quantization.py -k test_linear_unary python test/test_quantization.py -k test_linear_binary Differential Revision: [D57655830](https://our.internmc.facebook.com/intern/diff/D57655830) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122667 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel	2024-07-08 20:04:39 +00:00
Tristan Rice	a8985a97f9	elastic/store: use wait instead of get for barrier (#130148 ) Summary: We call `.get` in the elastic store barrier operation but we don't need the result. This switches it to use `.wait` instead which eliminates one network round trip as `get` internally does a wait first. Test Plan: CI + existing tests -- no behavior change Differential Revision: D59396199 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130148 Approved by: https://github.com/kurman, https://github.com/wconstab	2024-07-08 19:53:42 +00:00
Jeeja	22c809aa73	[FSDP] Runtime Error on Checkpoint Loading for optimizer state (#129110 ) for checkpoint optimizer, tensors are created on CUDA when other backends are used. This is because by default torch.device() constructed via a single device ordinal is treated as a cuda device. In _alloc_tensor, empty tensor are created using device = cast(torch.device, _get_device_module(device_type).current_device()). above will return only the index which will create the empty tensor on CUDA by the default behavior. So, change it to use torch.device(device_type,device_module(device_type).current_device()) to get the device with the index. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/129110 Approved by: https://github.com/fegin	2024-07-08 18:52:13 +00:00
James Wu	9158bb7837	Ignore functional tensor wrapper when caching (#128335 ) This PR makes it so that we don't try to serialize FunctionalTensorWrappers. FunctionalTensorWrappers don't pickle well because they have no underlying storage. This should be fixable at a later point, but I might not be the right author for implementing the serialization for it. If there's a way to avoid actually saving the FunctionalTensorWrappers themselves and just saving the ViewMetadata so we can replay it, that would also work. To do this, we disable view_replay_input_mutations when using AOTAutogradCache, and then only keep the functional tensor in the ViewAndMutationMeta if we need it for view_replay_input_mutations (i.e. the cache is off). Pull Request resolved: https://github.com/pytorch/pytorch/pull/128335 Approved by: https://github.com/bdhirsh	2024-07-08 18:39:20 +00:00
Michael Lazos	6dc64026cb	Restrict fusions in foreach if there are dependencies on multiple subkernels (#130046 ) In https://www.internalfb.com/intern/sevmanager/view/s/429861/, a downstream consuming buffer `buf486_buf526` had two read dependencies; `buf373` and `buf394`, both of which were at separate indices of the upstream foreach op. `buf486_buf526` was fused into `buf373` because in the usual fused case, this is completely fine if all dependencies are met in the upstream fused buffer. However in the foreach case and this case specifically it is possible for foreach ops to be partitioned if there are many arguments in order to stay under CUDA driver arg limits. As a result, this large foreach op was split into two, and the latter had `buf394` in its node schedule for allocation, while the earlier split did not, even though `buf486_buf526` uses the `buf394`, as a result we would hit the unbound local error. @eellison provided this repro to help debug the issue (https://www.internalfb.com/phabricator/paste/view/P1453035092) To fix this, we no longer return a valid producer subnode if there are multiple producer subnodes for a downstream consuming op. In short we should not fuse if there are dependencies on multiple foreach subkernels because 1) their execution order is non-deterministic and 2) (this issue) we may not properly handle dependencies in the presence of foreach partitioning. Co-authored-by: David Berard <dberard@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130046 Approved by: https://github.com/eellison	2024-07-08 18:25:16 +00:00
chilli	64139987c0	Add block mask utility support for batches and heads > 1 (#130227 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130227 Approved by: https://github.com/yanboliang ghstack dependencies: #130160, #130106, #130224	2024-07-08 18:15:35 +00:00
chilli	cd683212a2	Fix indexing twice with score_mod (#130224 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130224 Approved by: https://github.com/yanboliang ghstack dependencies: #130160, #130106	2024-07-08 18:15:35 +00:00
Jithun Nair	e16276b9bf	[ROCm] Check supported archs before setting preferred blas backend to hipblasLT (#128753 ) This PR is needed to resolve usability issues with PyTorch ROCm nightly wheels on non-gfx90a/gf94x architectures as a result of https://github.com/pytorch/pytorch/pull/127944. Addresses https://github.com/pytorch/pytorch/issues/119081#issuecomment-2166504992 ### With this PR's changes, I get the following on a gfx908 (unsupported by hipblasLT) architecture: _Using setter function:_ ``` >>> torch.backends.cuda.preferred_blas_library(backend="cublaslt") [W617 19:58:58.286088851 Context.cpp:280] Warning: torch.backends.cuda.preferred_blas_library is an experimental feature. If you see any error or unexpected behavior when this flag is set please file an issue on GitHub. (function operator()) [W617 19:59:02.125161985 Context.cpp:291] Warning: Attempting to use hipBLASLt on an unsupported architecture! Overriding blas backend to hipblas (function operator()) <_BlasBackend.Cublas: 0> ``` _Using `TORCH_BLAS_PREFER_HIPBLASLT` env var:_ ``` root@9d47bf40d4d4:/tmp/pytorch# TORCH_BLAS_PREFER_CUBLASLT=1 python >>> import torch >>> torch.backends.cuda.preferred_blas_library() [W619 06:14:11.627715807 Context.cpp:274] Warning: Attempting to use hipBLASLt on an unsupported architecture! Overriding blas backend to hipblas (function operator()) <_BlasBackend.Cublas: 0> ``` ### and the following on a gfx90a (supported by hipblasLT) architecture: _Using setter function:_ ``` >>> import torch >>> torch.backends.cuda.preferred_blas_library() <_BlasBackend.Cublaslt: 1> >>> torch.backends.cuda.preferred_blas_library(backend="cublas") <_BlasBackend.Cublas: 0> >>> torch.backends.cuda.preferred_blas_library(backend="cublaslt") [W620 18:38:29.404265518 Context.cpp:293] Warning: torch.backends.cuda.preferred_blas_library is an experimental feature. If you see any error or unexpected behavior when this flag is set please file an issue on GitHub. (function operator()) <_BlasBackend.Cublaslt: 1> ``` _Using `TORCH_BLAS_PREFER_HIPBLASLT` env var:_ ``` root@9d47bf40d4d4:/tmp/pytorch# TORCH_BLAS_PREFER_HIPBLASLT=1 python >>> import torch >>> torch.backends.cuda.preferred_blas_library() <_BlasBackend.Cublaslt: 1> ``` (Same result for _Using `TORCH_BLAS_PREFER_CUBLASLT` env var:_) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128753 Approved by: https://github.com/malfet	2024-07-08 17:43:41 +00:00
William Wen	b428f1ad77	[3.12, 3.13, dynamo] simplified construction for frame f_locals/localsplus (#129185 ) Construct frame localsplus in 3.12+ using our own simplified way rather than copypasting from CPython. This is necessary for 3.13 since we can no longer generate frame `f_locals` before executing the interpreter frame. We also enable this for 3.12 since the `f_locals` construction between 3.12 and 3.13 is the same, so we can test for correctness with 3.12. This is also one of the first steps to completing https://github.com/pytorch/pytorch/issues/93753 - we will implement simplified f_locals generation of previous Python versions in the future. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129185 Approved by: https://github.com/jansel	2024-07-08 17:39:05 +00:00
Jason Ansel	d325aaef39	[halide-backend] Use get_reduction_combine_fn for reduction ops (#130212 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130212 Approved by: https://github.com/eellison	2024-07-08 17:23:32 +00:00
Anshul Sinha	a18568f293	[dtensor][debug] Added functionality to convert log into a json file (#129994 ) Summary Currently, users have 2 options to view the tracing data. The first is through console where colored text is used to help users read the information. The second is they can log the information to a text file to view the log, which is useful in instances where the log is too long to fit in the console. However, depending on the model complexity, these logs could go on for thousands of lines making it difficult for the user to find specific information. In order to fix this, I have added the functionality to convert the log into a JSON file, which will be used to create a tree view in a browser, allowing the user to collapse parts of the log that will not be useful to them. I have given the user the option to pass their own file path, but have a default one in the event that none is provided. The expected output of the beginning json file and the browser view for the MLP model are shown below: <img width="542" alt="Screenshot 2024-07-02 at 3 40 41 PM" src="https://github.com/pytorch/pytorch/assets/50644008/b9570540-e1d2-4777-b643-db4801b60ed8"> <img width="777" alt="Screenshot 2024-07-02 at 3 41 43 PM" src="https://github.com/pytorch/pytorch/assets/50644008/9296e255-c3ae-48a4-8be7-4273f69ee178"> Test Plan 1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_json_dump 2. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_json_dump Pull Request resolved: https://github.com/pytorch/pytorch/pull/129994 Approved by: https://github.com/XilunWu	2024-07-08 17:15:34 +00:00
Abhinav Podili	61017eb77b	Add missing mapping between DLDevice and ATenDevice for MAIA (#129615 ) This PR adds missing mapping between the `DLDevice `and `ATenDevice `for MAIA device. These changes are necessary for `dlpack `support for `maia `tensors. [MAIA is added to the DldeviceType enum in the dlpack repo](`bbd2f4d324/include/dlpack/dlpack.h (L120)`) already. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129615 Approved by: https://github.com/albanD	2024-07-08 17:08:39 +00:00
Edan Tessel Sneh	63743b223c	[AO] catch qparam mismatch for cat (#123769 ) Summary: use &= instead of \|= since \|= ignores incorrect scale/zp change scale to use float comparison, instead of int comparison Issue warning instead of error for backward compatibility: ex: P1204628034 Test Plan: see warning in: P1204628034 Reviewed By: jerryzh168 Differential Revision: D55699212 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123769 Approved by: https://github.com/jerryzh168	2024-07-08 16:47:14 +00:00
Catherine Lee	f4774d64bf	Skip test_profile_memory on windows (#130037 ) The test was introduced in https://github.com/pytorch/pytorch/pull/128743 It is failing on windows cuda `a9a744e442/1` (it is skipped on cpu jobs) After talking with the author and Aaron, I have been advised to skip it on windows, as windows support for kineto is not a high priority Pull Request resolved: https://github.com/pytorch/pytorch/pull/130037 Approved by: https://github.com/huydhn, https://github.com/aaronenyeshi	2024-07-08 16:11:51 +00:00
PyTorch MergeBot	d7b7f8b79f	Revert "[ROCm] Add int4 support (#129710 )" This reverts commit d0ad13fa42fc2e9935bd3bda2937a3491276d274. Reverted https://github.com/pytorch/pytorch/pull/129710 on behalf of https://github.com/jeffdaily due to original ROCm PR did not have ciflow/rocm, missed signal ([comment](https://github.com/pytorch/pytorch/pull/129710#issuecomment-2214558368))	2024-07-08 16:07:53 +00:00
Joel Schlosser	c8ab2e8b63	Set seed per sample for OpInfo tests + support for restricting to a single sample input (#128238 ) This PR: * Sets a random seed before generating each sample for an OpInfo test. It does this by intercepting the sample input iterator via `TrackedInputIter`, optionally setting the seed to a test name specific seed before each iterator call (default is to set the seed). * Some quick and dirty benchmarking shows (hopefully) negligible overhead from setting the random seed before each sample input generation. For a trivial (single assert) test that uses `@ops`: * Uncovered a bunch of test issues: * Test breakdown (>100 total) * A lot of tolerance issues (tweaked tolerance values to fix) * 1 broken OpInfo (`sample_inputs_masked_fill` was generating a sample of the wrong dtype) * 3 actually broken semantics (for masked tensor; added xfails) * 4 Jacobian mismatches (added xfails) * 2 nan results (skip for now, need fixing) * 3 results too far from reference result (add xfails) * Skips MPS tests for now (there are so many failures!). Those will default to the old behavior. before (no seed setting): ``` real 0m21.306s user 0m19.053s sys 0m5.192s ``` after (with seed setting): ``` real 0m21.905s user 0m19.578s sys 0m5.390s ``` * Utilizing the above for reproducible sample input generation, adds support for restricting the iterator to a single sample input. This is done via an env var `PYTORCH_OPINFO_SAMPLE_INPUT_INDEX` and its usage is included in the repro command. ``` ====================================================================== ERROR: test_bar_add_cuda_uint8 (__main__.TestFooCUDA.test_bar_add_cuda_uint8) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_device_type.py", line 971, in test_wrapper return test(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/jbschlosser/branches/testing_updates/test/test_ops.py", line 2671, in test_bar self.assertFalse(True) AssertionError: True is not false The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_utils.py", line 2816, in wrapper method(args, *kwargs) File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_utils.py", line 2816, in wrapper method(args, kwargs) File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_device_type.py", line 419, in instantiated_test result = test(self, param_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_utils.py", line 1426, in wrapper fn(args, *kwargs) File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_device_type.py", line 982, in test_wrapper raise new_e from e Exception: Caused by sample input at index 3: SampleInput(input=Tensor[size=(10, 5), device="cuda:0", dtype=torch.uint8], args=TensorList[Tensor[size=(), device="cuda:0", dtype=torch.uint8]], kwargs={}, broadcasts_input=False, name='') To execute this test, run the following from the base repo dir: PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=3 python test/test_ops.py -k TestFooCUDA.test_bar_add_cuda_uint8 This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 ---------------------------------------------------------------------- Ran 1 test in 0.037s FAILED (errors=1) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128238 Approved by: https://github.com/janeyx99, https://github.com/justinchuby	2024-07-08 16:06:38 +00:00
Feny Patel	acf9e31cf8	adding MTIA to supported activities (#130052 ) Summary: Put the hasMTIA block in the if condition as well to let MTIA activities be added to supported activities Test Plan: Tested with auto-trace Differential Revision: D59280848 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130052 Approved by: https://github.com/aaronenyeshi	2024-07-08 15:20:05 +00:00
Alnis Murtovi	16d53cb7d5	Only run mixed_mm heuristic if shapes are static (#130081 ) If we have dynamic shapes, the heuristic in mixed_mm will cause a crash, because it cannot compare m, k and n to integer values. This PR makes it so that the heuristic only runs if we have static shapes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130081 Approved by: https://github.com/Chillee	2024-07-08 14:20:55 +00:00
Simon Fan	010009e642	[compiled autograd] c++ autograd function saved_data: lift tensors (#130057 ) avoid recompiles when custom c++ autograd function use ctx->saved_data to save tensors iv.toTensor can return reference for `after(iv.toTensor())` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130057 Approved by: https://github.com/jansel	2024-07-08 07:42:07 +00:00
cyy	f4dcf2ae93	[1/N] Change #include <c10/util/Optional.h> to #include <optional> (#128301 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/128301 Approved by: https://github.com/ezyang, https://github.com/r-barnes	2024-07-08 07:03:53 +00:00
Animesh Jain	f053be2a97	[dynamo] Graph break on random_ op (#130222 ) Fixes https://github.com/pytorch/pytorch/issues/121621 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130222 Approved by: https://github.com/jansel	2024-07-08 06:10:24 +00:00
Sijia Chen	31bb65de19	[Inductor] Fix conditional codegen (#129492 ) Summary: We have the cache to guarantee the `sym` is codegen only once, see the following code ``` def ensure_size_computed(self, sym: sympy.Symbol): if isinstance(sym, sympy.Symbol) and symbol_is_type(sym, SymT.PRECOMPUTED_SIZE): if sym in self.computed_sizes: return self.computed_sizes.add(sym) expr = V.graph.sizevars.inv_precomputed_replacements[sym] self.writeline( f"{self.declare}{sym} = {self.expr_printer(expr)}{self.ending}" ) ``` However, we don't consider the case when same `sym`s need to be codegen in both conditions (true branch and false branch), which caused the issue of `undefined symbols`: P1441378833 To fix the issue, we use a stack to capture the state before doing the condition codegen and restore the state after doing the codegen Test Plan: TORCH_LOGS="+inductor" buck2 run mode/dev-nosan -c fbcode.nvcc_arch=h100 -c fbcode.enable_gpu_sections=true --config 'cxx.extra_cxxflags=-g1' -c fbcode.platform010_cuda_version=12 //scripts/hhh:repro_cond_torch_compile PYTORCH_TEST_FBCODE=1 TORCH_COMPILE_DEBUG=1 buck2 run mode/opt -c=python.package_style=inplace -c fbcode.enable_gpu_sections=true -c fbcode.platform=platform010 -c fbcode.split-dwarf=true //caffe2/test/inductor:control_flow -- -r test_cond_control_flow_with_precomputed_size Differential Revision: D58973730 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129492 Approved by: https://github.com/aakhundov	2024-07-08 05:33:47 +00:00
Animesh Jain	c5c9dbece1	[dynamo][user-defined] Simplify and improve scope of UserDefinedObject var_getattr (#130169 ) Fixes https://github.com/pytorch/pytorch/issues/122649 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130169 Approved by: https://github.com/jansel ghstack dependencies: #118448, #130159	2024-07-08 04:10:56 +00:00
Jerry Mannil	d0ad13fa42	[ROCm] Add int4 support (#129710 ) Add AMD support for int4 kernel using mfma_f32_16x16x16bf16 instruction. Only supports CDNA2 and CDNA3 gpus for now. Fixes #124699 Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129710 Approved by: https://github.com/malfet	2024-07-07 23:54:22 +00:00
Animesh Jain	d1b832e739	[inductor][mkl][inline-inbuilt-nn-modules] Change assertion (#130219 ) Fixes the test in the next PR - `python test/inductor/test_mkldnn_pattern_matcher.py -k TestDynamicPatternMatcher.test_conv3d_unary_dynamic_shapes` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130219 Approved by: https://github.com/leslie-fang-intel	2024-07-07 21:32:07 +00:00
Pian Pawakapan	940e4477ab	[runtime asserts] deduplicate runtime asserts & CSE (#128599 ) This PR adds deduplication and CSE for runtime asserts. Existing size computation in the graph is CSE'd along with added runtime asserts, and redundant asserts are removed. Shape calls on intermediate tensors are also turned into compute on input sizes if possible, allowing intermediate tensors to be freed earlier. For example: ``` z = torch.cat([x, x], dim=0) # 2s0 w = z.repeat(y.shape[0]) # 2s0s1 _w = w.shape[0] # something with _w ... # turns into -> s0 = x.shape[0] s1 = y.shape[0] _w0 = 2 s0 _w = _w0 * s1 ``` Additionally, constrain_range calls are deduplicated. Single-symbol bound checks for unbacked symbols (e.g. u0 >= 0, u0 <= 5) and sym_constrain_range.default calls are also removed, since they accumulate range info in the ShapeEnv, and are replaced with two _assert_scalar.default calls that check the min/max bounds. For example: ``` torch.sym_constrain_range_for_size(n, min=2, max=16) torch.sym_constrain_range(n, min=4, max=20) torch._check(n >= 0) torch._check(n >= 3) torch._check(n <= 14) # turns into torch.sym_constrain_range_for_size(n) torch._check(n >= 4) torch._check(n <= 14) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128599 Approved by: https://github.com/ezyang	2024-07-07 20:10:14 +00:00
Simon Mahns	0c44684901	[Typo] Fix typo in DispatchKeyExtractor.h (#130221 ) Summary: typo_helper Test Plan: ci Differential Revision: D59424671 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130221 Approved by: https://github.com/Skylion007	2024-07-07 19:43:31 +00:00
PyTorch MergeBot	e423224546	Revert "[Inductor][CPP] Enable Local Buffer for Outer loop fusion (#126967 )" This reverts commit 98929ceae3873f18f4747b88cdff708fde107aa7. Reverted https://github.com/pytorch/pytorch/pull/126967 on behalf of https://github.com/leslie-fang-intel due to Broken trunk and need rebase ([comment](https://github.com/pytorch/pytorch/pull/126967#issuecomment-2212337926))	2024-07-07 06:16:32 +00:00
PyTorch MergeBot	1b57dce35f	Revert "[Inductor][CPP] Support more than one LocalBuffer (#129121 )" This reverts commit f794cf59bd0891ff4a4337e0d919ee68ba1f0472. Reverted https://github.com/pytorch/pytorch/pull/129121 on behalf of https://github.com/leslie-fang-intel due to Broken trunk and need rebase ([comment](https://github.com/pytorch/pytorch/pull/129121#issuecomment-2212337590))	2024-07-07 06:13:40 +00:00
leslie-fang-intel	f794cf59bd	[Inductor][CPP] Support more than one LocalBuffer (#129121 ) Summary Support more than 1 Local Buffer in an outer loop fused node and also the case when multi global buffers sharing usage of same local buffer. TestPlan ``` python -u -m pytest -s -v inductor/test_cpu_repro.py -k test_two_local_buffers_in_outer_loop_fusion python -u -m pytest -s -v inductor/test_cpu_repro.py -k test_share_local_buffers_in_outer_loop_fusion ``` Next Step - [✓] Support more than one Local Buffer/Global Buffer Pull Request resolved: https://github.com/pytorch/pytorch/pull/129121 Approved by: https://github.com/jgong5, https://github.com/peterbell10 ghstack dependencies: #126967	2024-07-07 05:43:08 +00:00
leslie-fang-intel	98929ceae3	[Inductor][CPP] Enable Local Buffer for Outer loop fusion (#126967 ) Summary Currently, the Inductor CPP backend [generated code](https://gist.github.com/leslie-fang-intel/98f91d43dabed581a1ffe23daf133a65#file-bf16-softmax-generated-code-wo-local-buffer-py) for `Softmax` with BF16 data type is significantly slower than the [ATen Implementation](`9a2beb862d/aten/src/ATen/native/cpu/SoftMaxKernel.cpp (L149)`). Upon comparing the generated code with ATen, the performance bottleneck appears to be related to the usage of [local buffer in ATen](`9a2beb862d/aten/src/ATen/native/cpu/SoftMaxKernel.cpp (L159-L160)`). In the current implementation, the Inductor uses the output buffer of Kernel Group Args to store and load temporary result (such as `exp`), since this buffer is corresponding to a `SchedulerNode`. Each thread accesses a portion of this output buffer via indexing. However, since this buffer (take this `exp` as example) is only utilized internally within decomposed `softmax`, this buffer can be replaced with a thread-local buffer similar to ATen's approach. In this PR, we have introduced the optimizations of `LocalBuffer`. Following this enhancement, the [new generated Inductor code with local buffer](https://gist.github.com/leslie-fang-intel/98f91d43dabed581a1ffe23daf133a65#file-bf16-softmax-generated-code-w-local-buffer-py) for BF16 `Softmax` demonstrates significantly improved performance. Running the benchmark [here](https://gist.github.com/leslie-fang-intel/37d81441237b5139c8295f5e6c4cd31a) to test this BF16 `Softmax` case on an 8480 Xeon server shows similar performance between the Inductor CPP Backend and the ATen implementation. TestPlan ``` python -u -m pytest -s -v inductor/test_cpu_repro.py -k test_local_buffer_in_outer_loop_fusion ``` Next Step - [ ] Support more than one Local Buffer/Global Buffer Pull Request resolved: https://github.com/pytorch/pytorch/pull/126967 Approved by: https://github.com/jgong5, https://github.com/peterbell10	2024-07-07 05:34:57 +00:00
Xuehai Pan	a3ce9eddd6	[BE][Easy] apply autofix for ruff rule unnecessary-literal-set (C405) and unnecessary-map (C417) (#130198 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130198 Approved by: https://github.com/Skylion007	2024-07-07 00:58:22 +00:00
peaceorwell	9983242c8e	[inductor] support adding a new inductor backend using PrivateUse1 (#129953 ) Add handling custom device registered by PrivateUse1 in init_backend_registration() func Fixes #129952 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129953 Approved by: https://github.com/jansel	2024-07-06 21:15:40 +00:00
Shuo Ding	3d138af943	[Inductor] First implementation of the B2B-GEMM pass with tests (#129995 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129995 Approved by: https://github.com/eellison	2024-07-06 19:10:22 +00:00
Xu Han	3957b3b349	[inductor] switch CppCodeCache to new cpp_builder. (#130132 ) Changes: 1. switch CppCodeCache to new cpp_builder. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130132 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-07-06 18:57:44 +00:00
Xu Han	dc5f37193f	[inductor] switch AotCodeCompiler to new cpp_builder (#130127 ) Changes: 1. Switch `AotCodeCompiler` to new cpp_builder. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130127 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-07-06 18:44:13 +00:00
cyy	dfe3534134	[1/N] Fix NVCC warnings (#130191 ) Fixes NVCC warnings, as the required steps to enable Werror on CUDA. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130191 Approved by: https://github.com/Skylion007	2024-07-06 18:25:04 +00:00
Xuehai Pan	3f50e197c4	[BE] annotate `torch.autograd.graph` (#129558 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129558 Approved by: https://github.com/soulitzer	2024-07-06 18:14:16 +00:00
Xu Han	01ec03bac6	[inductor] switch HalideCodeCache to new cpp_builder. (#130146 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/130146 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-07-06 17:35:17 +00:00
cyy	2f219f7d79	Enforce unused-{variable/function} checks to all torch targets (#130189 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/130189 Approved by: https://github.com/ezyang	2024-07-06 16:03:01 +00:00
cyy	096eca2f9a	[2/N] Replace exceptions with static_assert(false) in some templates (#130116 ) Follows #127371 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130116 Approved by: https://github.com/eqy, https://github.com/ezyang	2024-07-06 13:23:05 +00:00
Nikita Shulga	520a4642bf	[CI] Enable build with asserts (#129924 ) Not a standard CMake config, as far as I can tell, but it introduces an important concept of optimized build without `NDEBUG`. Test by running `python -c "import torch; torch._C._crash_if_debug_asserts_fail(424242)"`, which is a no-op unless debug_assert_fail is enabled. Add recently added `_unsafe_masked_index`/`_unsafe_masked_index_put_accumulate` to DONT_ENFORCE_SAME_TENSOR_IMPL_OR_STORAGE to avoid all test involving those ops to fail with internal assert Suppress number of internal asserts to make CI green, see https://github.com/pytorch/pytorch/issues/130073 Fixes https://github.com/pytorch/pytorch/issues/102105 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129924 Approved by: https://github.com/atalman, https://github.com/albanD	2024-07-06 13:14:32 +00:00
chilli	da66e50e6e	Added compile option to create_block_mask (#130106 ) Compiling the `create_block_mask` function allows us to "materialize" extremely large masks. This would have been a 1 trillion element tensor if fully materialized. ``` print(do_bench(lambda: create_block_mask(causal_mask, 1, 1, 220, 220, _compiled=True))) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130106 Approved by: https://github.com/yanboliang ghstack dependencies: #130160	2024-07-06 08:09:56 +00:00
PyTorch MergeBot	963f430d13	Revert "[runtime asserts] deduplicate runtime asserts & CSE (#128599 )" This reverts commit 0267b2ddcb58aa66b2b62336216da7df4f9939d8. Reverted https://github.com/pytorch/pytorch/pull/128599 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to cause a landrace and fails inductor/test_cudagraph_trees in trunk `0267b2ddcb` ([comment](https://github.com/pytorch/pytorch/pull/128599#issuecomment-2211690518))	2024-07-06 07:20:05 +00:00
Aaron Enye Shi	aa4899eee9	[CCA][Memory Snapshot] Fix race on alloc_trace vector - S430480 (#130180 ) Summary: Multiple threads can be calling the alloc_trace std::vector, which will result in SIGSEGVs when objects are double freed, accessed after free, or two inserts at the same time. We need to lock when inserting, accessing or removing TraceEntry in alloc_trace. Test Plan: This is a rare crash, which was exposed when we introduced recordAnnotations, which saves record_function annotations into the snapshot files. Saving a lot of annotations can trigger this bug. Here are a few jobs that crashed before, and this diff fixes. Differential Revision: D59380507 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/130180 Approved by: https://github.com/eqy, https://github.com/kit1980	2024-07-06 06:14:54 +00:00
PyTorch MergeBot	e019540c9e	Revert "Fix the SDPA AOT export issue (#130164 )" This reverts commit 1927c406844affbfe3496d5cbc31d4ebe11c8bfb. Reverted https://github.com/pytorch/pytorch/pull/130164 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is breaking ExecuTorch tests in trunk `1927c40684` ([comment](https://github.com/pytorch/pytorch/pull/130164#issuecomment-2211667777))	2024-07-06 05:59:49 +00:00
chilli	bf609630ae	Fix a bunch of stride issues with FlexAttention (#130160 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130160 Approved by: https://github.com/yanboliang	2024-07-06 03:58:14 +00:00
Edward Z. Yang	10c831567b	Make sympify'ing SymInt/etc produce their sympy expression (#130166 ) There is one huge problem this fixes: today, sympify(symint) produces a float(!!) because Sympy attempts to see if you can coerce the symint to float in sympify and of course this works on SymInt. However, this also has another nontrivial effect: anywhere in Inductor where sympy expressions are passed around, it is also valid to pass around a SymInt now. I'm ambivalent about this: it's currently a mistake to be passing around a SymInt when a sympy expression is expected. But maybe this is fine? Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130166 Approved by: https://github.com/yf225	2024-07-06 03:56:45 +00:00
Jason Ansel	acd03ca2d9	[halide-backend] Support scan kernels (#129035 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129035 Approved by: https://github.com/shunting314, https://github.com/eellison ghstack dependencies: #130129	2024-07-06 03:49:50 +00:00
Jason Ansel	c5110f6388	[halide-backend] Use 0D scalar inputs/outputs (#130129 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130129 Approved by: https://github.com/shunting314	2024-07-06 03:49:50 +00:00
Pian Pawakapan	0267b2ddcb	[runtime asserts] deduplicate runtime asserts & CSE (#128599 ) This PR adds deduplication and CSE for runtime asserts. Existing size computation in the graph is CSE'd along with added runtime asserts, and redundant asserts are removed. Shape calls on intermediate tensors are also turned into compute on input sizes if possible, allowing intermediate tensors to be freed earlier. For example: ``` z = torch.cat([x, x], dim=0) # 2s0 w = z.repeat(y.shape[0]) # 2s0s1 _w = w.shape[0] # something with _w ... # turns into -> s0 = x.shape[0] s1 = y.shape[0] _w0 = 2 s0 _w = _w0 * s1 ``` Additionally, constrain_range calls are deduplicated. Single-symbol bound checks for unbacked symbols (e.g. u0 >= 0, u0 <= 5) and sym_constrain_range.default calls are also removed, since they accumulate range info in the ShapeEnv, and are replaced with two _assert_scalar.default calls that check the min/max bounds. For example: ``` torch.sym_constrain_range_for_size(n, min=2, max=16) torch.sym_constrain_range(n, min=4, max=20) torch._check(n >= 0) torch._check(n >= 3) torch._check(n <= 14) # turns into torch.sym_constrain_range_for_size(n) torch._check(n >= 4) torch._check(n <= 14) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128599 Approved by: https://github.com/ezyang	2024-07-06 03:44:49 +00:00
PyTorch UpdateBot	7c43f59a45	[audio hash update] update the pinned audio hash (#129429 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129429 Approved by: https://github.com/pytorchbot	2024-07-06 03:34:12 +00:00
Animesh Jain	bd0252fb98	[dynamo][user-defined] Support method descriptors (#130159 ) Fixes https://github.com/pytorch/pytorch/issues/120650 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130159 Approved by: https://github.com/jansel ghstack dependencies: #118448	2024-07-06 02:03:09 +00:00
Daulet Askarov	a1a2023eb8	Back out "Pass device to is_pinned call inside TensorProperties.create_from_tensor" (#129972 ) Summary: It turns out, the device used as a param in is_pinned is meant to be the accelerator device with the respect to which pinning is expected. Passing 'cpu' always makes the return value false, regardless of whether the actual tensor is a cpu tensor pinned to Cuda. Besides, there is a PR https://github.com/pytorch/pytorch/pull/126376 about to be merged which automatically uses the correct accelerator device which obviates the need for users to pass any kind of explicit device and doesn't create Cuda context for pure cpu tensors. Note, https://www.internalfb.com/intern/test/844425019931542?ref_report_id=0 test is expected to be broken by this diff, but it should be fixed forward by https://github.com/pytorch/pytorch/pull/126376 Test Plan: Sandcastle. Differential Revision: D59283190 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129972 Approved by: https://github.com/LucasLLC	2024-07-06 01:07:32 +00:00
Sijia Chen	1927c40684	Fix the SDPA AOT export issue (#130164 ) Summary: ## Context TL;DR: aot_export failed for SDPA memory efficient backend when using `inference_mode` The CMF AOTI lowering started to fail on the trunk. We have the script (https://fburl.com/code/kfk64i5s) to reproduce the issue quickly (log: P1469307638). By bisecting the stack, we found the issue starting from the D58701607 ## Root Cause In the `inference_mode()`, the `aten::scaled_dot_product_attention` was not decomposed before the `functionalization` and the op it-self was an out-place op, so the `functionalization` doesn't make change and then was decomposed into `masked_fill_.`, then decomposed to the `copy_` So it's `aten::sdpa` --- (functionalization) ---> `aten::sdpa` --- (decompose) ---> `masked_fill_` --- (decompose) ---> `copy_` ---> failure In the `torch.no_grad()`, `aten::sdpa` was decomposed before `functionalization`, so the story is `aten::sdpa` --- (decompose) ---> `masked_fill_` --- (functionalization) ---> `masked_fill` --- (decompose) ---> `out-place ops` ---> good ## How to fix Long-term: The issue was tracked in the ticket (https://github.com/pytorch/pytorch/issues/129418). The long-term fix could be we do one more round of `functionalization` after the `decompose`, like `aten::sdpa` --- (functionalization) ---> `aten::sdpa` --- (decompose) ---> `masked_fill_` --- (functionalization) ---> `masked_fill` ---> good Short-term: It would be a big change I guess. To unblock the production use-case, I marked the `aten::sdpa` should be decomposed in this diff Test Plan: local repro works now buck run mode/opt scripts/sijiac/prototypes:sdpa_aoti Differential Revision: D59385876 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130164 Approved by: https://github.com/zou3519	2024-07-06 00:57:47 +00:00
Shunting Zhang	c5ede865c4	[pt2-bench] raise tolerance for squeezenet1_1 (#130165 ) The training accuracy for this model starts to regress. It does not show up on the weekly run yet but 1. it shows up in my MA runs [here](https://hud.pytorch.org/benchmark/torchbench/inductor_max_autotune?dashboard=torchinductor&startTime=Fri,%2028%20Jun%202024%2006:53:45%20GMT&stopTime=Fri,%2005%20Jul%202024%2006:53:45%20GMT&granularity=hour&mode=training&dtype=amp&lBranch=gh/shunting314/162/head&lCommit=cb236e8c198b54901e4fb19698f91be786f72e25&rBranch=main&rCommit=4ee1cb9b955fcc5d75a421b19393998122136f2c) 2. I can repro it locally Command: ``` TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/torchbench.py --accuracy --training --amp --backend inductor --device cuda --only squeezenet1_1 ``` Raise the tolerance to fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130165 Approved by: https://github.com/jansel ghstack dependencies: #129996, #129941, #130005, #130163	2024-07-06 00:49:15 +00:00
Shunting Zhang	0fcbca9adb	[pt2-bench] use eval mode for vision_maskrcnn (#130163 ) Try to fix https://github.com/pytorch/pytorch/issues/130161 The reason that `--accuracy` works is we use eval mode. While `--training` does not work since we use training mode but TorchBench does not return targets tenors. In training mode, vision_maskrcnn requires targets tensors I fix that to always use eval mode for vision_maskrcnn training. With the fix, I start see a segfault: https://gist.github.com/shunting314/5a70df3463b2a4421b2c34aa88e78d1f I'm not sure if that's due to my local setup but I think the fix in this PR is something we need any way. We can check the dashboard after the PR is in. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130163 Approved by: https://github.com/jansel ghstack dependencies: #129996, #129941, #130005	2024-07-06 00:49:15 +00:00
cyy	e5841bb8d5	[3/N] Enforce unused-function and unused-variable checks (#130084 ) Follows #129878. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130084 Approved by: https://github.com/ezyang	2024-07-05 23:56:00 +00:00
Shuqiang Zhang	126796d239	[c10d] fixing an UT after a change in eager mode new group (#130167 ) Summary: after https://github.com/pytorch/pytorch/pull/129284, new_group is eager now if device_id is specified, one UT was broken This PR fixes it. Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/130167 Approved by: https://github.com/wconstab	2024-07-05 23:18:30 +00:00
Xuehai Pan	d1d0a7080f	[torchgen] reference generated comment to actual location of the generator and template (#130020 ) As per title. ```diff # torch/_VF.pyi - # @generated from torch/_C/_VariableFunctions.pyi.in + # @generated by tools/pyi/gen_pyi.py from torch/_C/_VariableFunctions.pyi.in ``` ```diff # torch/return_types.pyi - # @generated from torch/_C/return_types.pyi + # @generated by tools/pyi/gen_pyi.py from torch/_C/return_types.pyi.in ``` ```diff # torch/_C/__init__.pyi - # @generated from torch/_C/__init__.pyi.in + # @generated by tools/pyi/gen_pyi.py from torch/_C/__init__.pyi.in ``` ```diff # torch/_C/_nn.pyi + # @generated by tools/pyi/gen_pyi.py from torch/_C/_nn.pyi.in ``` ```diff # torch/_C/_VariableFunctions.pyi - # @generated from torch/_C/_VariableFunctions.pyi.in + # @generated by tools/pyi/gen_pyi.py from torch/_C/_VariableFunctions.pyi.in ``` ```diff # torch/nn/functional.pyi + # @generated by tools/pyi/gen_pyi.py from torch/nn/functional.pyi.in ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130020 Approved by: https://github.com/ezyang	2024-07-05 21:47:14 +00:00
PyTorch MergeBot	6fc771d19b	Revert "Change depreacate warning on dispatch_on_subclass to warn once (#130047 )" This reverts commit 8ff243bcf190bab62348310693f0ad2f90061c89. Reverted https://github.com/pytorch/pytorch/pull/130047 on behalf of https://github.com/clee2000 due to broke test_overrides.py::TestTorchFunctionWarning::test_warn_on_invalid_torch_function on multiple jobs `8ff243bcf1` https://github.com/pytorch/pytorch/actions/runs/9812489165/job/27097342443. Dr CI is doing something weird about the unstable failures ([comment](https://github.com/pytorch/pytorch/pull/130047#issuecomment-2211409090))	2024-07-05 21:03:36 +00:00
Catherine Lee	df50452279	Pin optree==0.11.0 on windows CI (#130155 ) Fixes #ISSUE_NUMBER doctests test_testing Failing run has 0.12.0 https://github.com/pytorch/pytorch/actions/runs/9804335516/job/27072891998 Succeeding run has 0.11.0 https://github.com/pytorch/pytorch/actions/runs/9798330845/job/27057359554 It is already pinned for mac and linux Pull Request resolved: https://github.com/pytorch/pytorch/pull/130155 Approved by: https://github.com/huydhn, https://github.com/atalman	2024-07-05 20:28:58 +00:00
Lucas Pasqualin	18e75c098b	[DCP] Adds Checkpointing Team (dcp) to merge rules (#129582 ) [DCP] Adds Checkpointing Team (dcp) to merge rules. Please comment to this PR if you think you should be added as well! Pull Request resolved: https://github.com/pytorch/pytorch/pull/129582 Approved by: https://github.com/fegin	2024-07-05 20:09:31 +00:00
Eddie Yan	739fc01ac9	[NCCL] Make sure current device is correct in `torch.distributed.barrier()`'s `streamSynchronize` (#129908 ) The real root cause of the issue is that the current stream on a given CUDA device may be the legacy default stream, which doesn't seem to have a device associated with it. If the current CUDA device as reported by `cudaGetDevice` doesn't match the device of the intended legacy default stream's device (this happens if a user is running distributed code without e.g., `torch.cuda.set_device(mylocalrank)`) then the stream synchronize will not have the intended effect. Previous stream sync code here correctly inserted a `DeviceGuard` to ensure that this legacy-default-stream-sync with a mismatched current device didn't happen, but the check is elided here. The simplest fix is to just use the `CUDAStream` wrapper's `synchronize()` call, which already correctly uses a `DeviceGuard` internally: `a21d4363d2/c10/cuda/CUDAStream.h (L132)` OUTDATED below: The current behavior of `barrier`'s `synchronizeInternal` seems to be a bit counterintuitive, as it is synchronizing on a device's current `CUDAStream` rather than the one used for the actual `allreduce` (the `ncclStream`). In practice this results in a script like the following: ``` import logging import os import time import torch import torch.distributed as dist def main(): logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(message)s") backend = 'nccl' group = torch.distributed.init_process_group(backend=backend) rank = torch.distributed.get_rank(group=group) for i in range(4): time.sleep(rank) logging.info(f"Rank {rank}: enter barrier {i}") dist.barrier() logging.info(f"Rank {rank}: exit barrier {i}") dist.destroy_process_group() if __name__ == "__main__": main() ``` appearing to show that ranks can exit barrier(s) before other ranks have entered. Note that the device-side ordering should still be correct in this case, but the host is free to run ahead. The issue can be worked-around by adding a `torch.cuda.synchronize(rank)` after the `barrier`, but this seems to be against the spirit of the stream synchronization which deliberately tried to avoid a device synchronization. This PR does a sync on the `allreduce`'s stream so that a device synchronization is not needed to align the host's output with the device. CC @wujingyue @Aidyn-A @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/129908 Approved by: https://github.com/kwen2501	2024-07-05 19:53:54 +00:00
Huy Do	faebaef089	[EZ] Fix typo in upload stats OIDC rolename (#130168 ) My mistake from https://github.com/pytorch/pytorch/pull/129544 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130168 Approved by: https://github.com/kit1980, https://github.com/malfet, https://github.com/atalman	2024-07-05 19:38:24 +00:00
PaliC	3d56673b24	[Split Build][BE] remove extraneous .py, .a, and .so files (#130053 ) Removes extraneous .a, .so, and .py files from the split build. From here we can also clean up the builder script which produces the binary to do this. That pr is https://github.com/pytorch/builder/pull/1912 Verification: The built wheel with BUILD_LIBTORCH_WHL=1 has the following files only (with .a, .so, and .py extensions) ``` sahanp@devgpu086 ~/p/dist (viable/strict)> pwd (pytorch-3.10) /home/sahanp/pytorch/dist sahanp@devgpu086 ~/p/dist (viable/strict)> find . -type f $ -name ".py" -o -name ".a" -o -name "*.so" $ (pytorch-3.10) ./torch/__init__.py ./torch/lib/libbackend_with_compiler.so ./torch/lib/libc10.so ./torch/lib/libjitbackend_test.so ./torch/lib/libtorch.so ./torch/lib/libtorch_cpu.so ./torch/lib/libtorch_global_deps.so ./torch/lib/libtorchbind_test.so sahanp@devgpu086 ~/p/dist (viable/strict)> ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130053 Approved by: https://github.com/atalman	2024-07-05 19:05:32 +00:00
Iris Zhang (PyTorch)	8ff243bcf1	Change depreacate warning on dispatch_on_subclass to warn once (#130047 ) Summary: Right now the deprecated warning fires on every operator that calls into torch_function. Changing it to TORCH_WARN_ONCE instead. More context in https://fb.workplace.com/groups/260102303573409/permalink/445299188387052/ Test Plan: Sandcastle Differential Revision: D59338775 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130047 Approved by: https://github.com/XilunWu	2024-07-05 18:52:49 +00:00
PyTorch MergeBot	784e3b4123	Revert "Change numeric_debug_handle to store per-node id (#129811 )" This reverts commit a9a744e442975cfbc6f4b26a532e5c1b3d9d5692. Reverted https://github.com/pytorch/pytorch/pull/129811 on behalf of https://github.com/kit1980 due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/129811#issuecomment-2211245852))	2024-07-05 18:14:02 +00:00
Huy Do	889ed48a22	Fix missing id-token write in upload stats (#130153 ) Fix the mistake from https://github.com/pytorch/pytorch/pull/129544 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130153 Approved by: https://github.com/clee2000	2024-07-05 18:05:46 +00:00
Jiashen Cao	7c5f3cd049	Add explain function to TSConverter. (#129968 ) Summary: The explain function does a conversion dry run to provide feedback on which operators are not supported / fail the conversion to the users. Test Plan: * `pytest test/export/test_converter.py` Differential Revision: D59251934 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129968 Approved by: https://github.com/angelayi	2024-07-05 18:04:29 +00:00
Animesh Jain	7ea8a3c9b8	[dynamo] Validate check_fn (#118448 ) Fixes - https://github.com/pytorch/pytorch/issues/128090 Tracker issue here - https://github.com/pytorch/pytorch/issues/129937 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118448 Approved by: https://github.com/jansel, https://github.com/ezyang	2024-07-05 18:04:12 +00:00
Joel Schlosser	7192ee0735	Default to input tensor device for as_nested_tensor(t) (#130050 ) Fixes #129647 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130050 Approved by: https://github.com/YuqingJ	2024-07-05 17:50:08 +00:00
Huy Do	a33ee73a28	Upload perf stats to both Rockset and dynamoDB (#129544 ) To avoid outage on HUD, I plan to migrate perf stats to dynamoDB as follows: 1. Upload perf stats to both Rockset and dynamoDB 2. Copy all the existing content from Rockset to dynamoDB 3. Create new Rockset tables to map to dynamoDB 4. Switch HUD to use the new Rockset tables (temporarily) 5. Delete the existing tables This depends on https://github.com/pytorch-labs/pytorch-gha-infra/pull/422 ### Testing ``` python3 -m tools.stats.upload_dynamo_perf_stats --workflow-run-id 9770217910 --workflow-run-attempt 1 --repo "pytorch/pytorch" --head-branch "gh/shunting314/162/head" --rockset-collection torch_dynamo_perf_stats --rockset-workspace inductor --dynamodb-table torchci-dynamo-perf-stats --match-filename "^inductor_" ... Writing 1607 documents to DynamoDB torchci-dynamo-perf-stats ``` And confirm the same number of documents is on the table ![Screenshot 2024-07-03 at 18 10 35](https://github.com/pytorch/pytorch/assets/475357/6c055c96-00ca-4cb3-bbe5-fe4914f9da9b) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129544 Approved by: https://github.com/clee2000	2024-07-05 16:31:49 +00:00
James Wu	e7ab7b83bc	Have torch_key hash entire torch directory (#129250 ) Summary: Title. This way, both FXGraphCache and AOTAutogradCache use the same torch_key, and we don't need to only hash specific files. There's an argument to be made to only hash .py and .cpp files. Maybe we can fix the glob to do that. We use a buck_filegroup because otherwise $SRCs gets too large. By using `$(location :torch_sources)`, we make the genrule implicitly depend on all files globbed by torch_sources. Test Plan: Unit tests still pass on OSS For torch_key: ``` buck2 build caffe2:src_hash.txt -v 2 --show-output ``` See the output, then make any change to any torch file. See that the hash changes. Reviewed By: oulgen Differential Revision: D58875785 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129250 Approved by: https://github.com/oulgen	2024-07-05 15:37:16 +00:00
PyTorch MergeBot	eea4ece256	Revert "[audio hash update] update the pinned audio hash (#129429 )" This reverts commit 30fc4b06f55c7c4a915f938d7d5d6abbbc23bf61. Reverted https://github.com/pytorch/pytorch/pull/129429 on behalf of https://github.com/jeanschmidt due to pytorch bot should not have allowed this merge, as there are failing jobs ([comment](https://github.com/pytorch/pytorch/pull/129429#issuecomment-2210894639))	2024-07-05 13:38:44 +00:00
PyTorch MergeBot	4b05d9d233	Revert "[NCCL] Make sure current device is correct in `torch.distributed.barrier()`'s `streamSynchronize` (#129908 )" This reverts commit c9f1db265e317829b3a4d3af5be5c9266874dcd4. Reverted https://github.com/pytorch/pytorch/pull/129908 on behalf of https://github.com/jeanschmidt due to Seems to have introduced windows errors on main ([comment](https://github.com/pytorch/pytorch/pull/129908#issuecomment-2210888890))	2024-07-05 13:34:59 +00:00
Shunting Zhang	8f6765f7a7	[pt2-bench] fix accuracy failure for beit_base_patch16_224 during training (#130005 ) This model's accuracy test recently regressed. I have a quite smooth debugging process to figure out the cause. So I'd like to write it down just in case it can be helpful. Clicking the model name beit_base_patch16_224 on the dashboard, we are able to see the pass status of the model in e.g. the past month. For this model, we can see that it starts to fail on June 08: <img width="1118" alt="Screenshot 2024-07-02 at 5 36 35 PM" src="https://github.com/pytorch/pytorch/assets/52589240/32f27ccd-3ec7-4431-88b3-febeff831f8e"> What's nice is the dashboard shows the nightly commits for each run. Running ``` git log --oneline a448b3ae9537c0ae233fb9199a4a221fdffbb..0e6c204642a571d5a7cd60be0caeb9b50faca030 torch/_inductor/ ``` Gives us the list of Inductor PRs between the good and bad commit: https://gist.github.com/shunting314/eb57965688fc9e1746fcfa9b7b6b02df Roughly looking thru the PRs, I feel ``` ffc202a1b91 Added remove_noop_ops to joint_graph_passes (#124451) ``` can change numerics so I disable it locally by this one line change: https://gist.github.com/shunting314/13aec768bda986056d0fb40dce53396e . And then the accuracy test pass. (Command: time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only beit_base_patch16_224 ) Horace's PR (https://github.com/pytorch/pytorch/pull/124451) itself is valid. It removes no-op ops in joint-graph. I think maybe the graph get changed and cause the partitioner do different recomputation decisions. That can cause some numerics change. Since this is not a real issue, I'll raise the tolerance to make it pass. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130005 Approved by: https://github.com/eellison, https://github.com/jansel ghstack dependencies: #129996, #129941	2024-07-05 10:26:39 +00:00
Shunting Zhang	c0735a3dd3	[pt2-bench] fix accuracy failure for a few models (#129941 ) This PR batch the fix for a few accuracy failures issues during training by raising tolerance. I do that only for models that I think it fails not due to real issue. ## sebotnet33ts_256 The accuracy test for this model start to fail around June 05 [link](https://hud.pytorch.org/benchmark/timm_models/inductor_with_cudagraphs?dashboard=torchinductor&startTime=Sun%2C%2002%20Jun%202024%2007%3A19%3A38%20GMT&stopTime=Tue%2C%2002%20Jul%202024%2007%3A19%3A38%20GMT&granularity=day&mode=training&dtype=amp&lBranch=main&lCommit=04a0d856207d83c2031e4b9cb6825ba3e0092850&rBranch=main&rCommit=e62925930f6a62f6aeeb1fe1a661a9bd3352b53d&model=sebotnet33ts_256). I can not repro locally, but from the log from the dashboard: ``` RMSE (res-fp64): 0.09441, (ref-fp64): 0.02971 and shape=torch.Size([1536]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000 ``` raising the tolerance should fix it. ## DebertaForQuestionAnswering This model fails accuracy test on the dashboard only in max-autotune mode. I can not repro locally by command: ``` TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/huggingface.py --accuracy --no-translation-validation --training --amp --backend inductor --device cuda --only DebertaForQuestionAnswering ``` From error message on the dashboard: ``` RMSE (res-fp64): 0.01803, (ref-fp64): 0.00537 and shape=torch.Size([2]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000 ``` 0.02 tolerance should suppress this error. ## gluon_inception_v3 This model fail on the dashboard in max-autotune mode. I can not repro locally by command ``` TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only gluon_inception_v3 ``` From error message on the dashboard ``` RMSE (res-fp64): 0.02798, (ref-fp64): 0.00730 and shape=torch.Size([384]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000 Accuracy failed for key name Mixed_7c.branch3x3dbl_3a.bn.running_var ``` raising tolerance should suppress this error. # mobilenetv3_large_100 Fail in MA model. I can not repro locally by command ``` TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only ``` The error message on the dashboard is ``` RMSE (res-fp64): 0.29754, (ref-fp64): 0.05205 and shape=torch.Size([]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000 ``` The tensor is so small that the noise can be high. I use larger multiplier for smaller tensor in torch._dynamo.utils.same. # yolov3 Fail on dashboard with error ``` Error on the dashboard: RMSE (res-fp64): 0.01278, (ref-fp64): 0.00246 and shape=torch.Size([256]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000 ``` Fix it by using a larger multiplier for smaller tensors and raising the tolereance. # timm_efficientdet Fail on the dashboard with error ``` E0623 18:37:43.638000 139924418725056 torch/_dynamo/utils.py:1468] RMSE (res-fp64): 0.00096, (ref-fp64): 0.00009 and shape=torch.Size([2]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000 ``` But I can not repro locally with command ``` time python benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --only timm_efficientdet --training ``` Raise the tolerance should fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129941 Approved by: https://github.com/jansel ghstack dependencies: #129996	2024-07-05 10:26:39 +00:00
Shunting Zhang	8f1c2e1e28	[pt2-bench] pass acc test if ref is NaN (#129996 ) I'm debugging the accuracy failure for training vision_maskrcnn. Unfortunately I could not succeed to run it locally (I've check pined commits for torchbenchmars/torchvision are correct, and reinstalled torchbenchmark for mask_rcnn). I get this error: ``` eager run fail: AssertionError: targets should not be none when in training mode ``` (Command: time python benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --training --only vision_maskrcnn ) But look at the log from the dashboard ``` E0623 19:17:59.085000 140114670171328 torch/_dynamo/utils.py:1468] RMSE (res-fp64): nan, (ref-fp64): nan and shape=torch.Size([1024, 256, 1, 1]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000 ``` We can see both the reference number and the pt2 number are NaN. I change torch._dynamo.utils.same to return true if both RMSE values are NaN. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129996 Approved by: https://github.com/jansel	2024-07-05 10:26:39 +00:00
Yu, Guangye	78a0b010eb	Refine XPU UTs (#130138 ) # Motivation 1. enable all test cases related to `TestXpu` running in XPU CI. 2. make `test_lazy_init` stable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130138 Approved by: https://github.com/EikanWang	2024-07-05 09:56:22 +00:00
Jason Ansel	3240bff56a	[benchmarking] Add join_results.py (#129202 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129202 Approved by: https://github.com/yanboliang, https://github.com/shunting314	2024-07-05 06:55:30 +00:00
PyTorch UpdateBot	30fc4b06f5	[audio hash update] update the pinned audio hash (#129429 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129429 Approved by: https://github.com/pytorchbot	2024-07-05 03:32:29 +00:00
Eddie Yan	c9f1db265e	[NCCL] Make sure current device is correct in `torch.distributed.barrier()`'s `streamSynchronize` (#129908 ) The real root cause of the issue is that the current stream on a given CUDA device may be the legacy default stream, which doesn't seem to have a device associated with it. If the current CUDA device as reported by `cudaGetDevice` doesn't match the device of the intended legacy default stream's device (this happens if a user is running distributed code without e.g., `torch.cuda.set_device(mylocalrank)`) then the stream synchronize will not have the intended effect. Previous stream sync code here correctly inserted a `DeviceGuard` to ensure that this legacy-default-stream-sync with a mismatched current device didn't happen, but the check is elided here. The simplest fix is to just use the `CUDAStream` wrapper's `synchronize()` call, which already correctly uses a `DeviceGuard` internally: `a21d4363d2/c10/cuda/CUDAStream.h (L132)` OUTDATED below: The current behavior of `barrier`'s `synchronizeInternal` seems to be a bit counterintuitive, as it is synchronizing on a device's current `CUDAStream` rather than the one used for the actual `allreduce` (the `ncclStream`). In practice this results in a script like the following: ``` import logging import os import time import torch import torch.distributed as dist def main(): logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(message)s") backend = 'nccl' group = torch.distributed.init_process_group(backend=backend) rank = torch.distributed.get_rank(group=group) for i in range(4): time.sleep(rank) logging.info(f"Rank {rank}: enter barrier {i}") dist.barrier() logging.info(f"Rank {rank}: exit barrier {i}") dist.destroy_process_group() if __name__ == "__main__": main() ``` appearing to show that ranks can exit barrier(s) before other ranks have entered. Note that the device-side ordering should still be correct in this case, but the host is free to run ahead. The issue can be worked-around by adding a `torch.cuda.synchronize(rank)` after the `barrier`, but this seems to be against the spirit of the stream synchronization which deliberately tried to avoid a device synchronization. This PR does a sync on the `allreduce`'s stream so that a device synchronization is not needed to align the host's output with the device. CC @wujingyue @Aidyn-A @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/129908 Approved by: https://github.com/kwen2501	2024-07-04 20:36:58 +00:00
Lei Zhang	7128504424	[inductor] Add Triton template for Conv3D (#129518 ) This commit adds a Triton template for Conv3D ops, by following the same logic like Conv2D. Conv3D aren't as frequently used like Conv2D so they might enjoy less optimizations in various libraries. So having a Triton based inductor impl can improve performance for cases. Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129518 Approved by: https://github.com/jansel, https://github.com/jataylo	2024-07-04 20:30:50 +00:00
Kurt Mohler	e590168865	Enable sharing meta tensors between processes (#129520 ) Fixes #129436 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129520 Approved by: https://github.com/ezyang	2024-07-04 20:29:48 +00:00
Xu Han	21eeedb455	[Inductor] Add aot_mode UT to new cpp_builder. (#130105 ) Changes: 1. Add `aot_mode` parameter to `validate_new_cpp_commands` UT. 2. Switch AotCodeCompiler vec isa command gen to new cpp_builder. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130105 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-07-04 19:08:56 +00:00
chuanqiw	d496145534	[CD] Add triton xpu wheel build (#129730 ) Enable triton xpu wheel build firstly, then add pytorch xpu nightly wheel build Pull Request resolved: https://github.com/pytorch/pytorch/pull/129730 Approved by: https://github.com/atalman	2024-07-04 17:55:20 +00:00
Huy Do	f78b79daaa	Forward fix the missing torch.nn.Module.set_submodule from D59140215 (#130075 ) Summary: This is to forward fix D59140215 from a PyTorch open source contributor T194074371. On PyTorch side, we need to use isinstance instead of type when checking for nn.Module. This is the same way get_submodule is currently implemented. Test Plan: `buck2 test 'fbcode//mode/opt' fbcode//dper3/dper3/core/tests:module_test` Differential Revision: D59254638 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130075 Approved by: https://github.com/mikaylagawarecki	2024-07-04 17:46:56 +00:00
Howard Huang	5b5f4b02c2	[pipelining] [BE] Move pipeline_order validation to schedules.py (#129369 ) # Changes * small fix in stage error message * Move `format_pipeline_order` and `_validate_pipeline_order` out of `test_schedule.py` into `schedules.py`. * Wrap the execution runtime in a try-except which on error will log the timestep and schedule plan before re-raising the exception. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129369 Approved by: https://github.com/wconstab ghstack dependencies: #129368	2024-07-04 16:38:30 +00:00
PyTorch MergeBot	6dfa53ca76	Revert "[pt2-bench] pass acc test if ref is NaN (#129996 )" This reverts commit 51fa0bd436cf627bd0c8ccf3a3a8b9c07d260622. Reverted https://github.com/pytorch/pytorch/pull/129996 on behalf of https://github.com/jeanschmidt due to Seems to have introduced breakages in main cuda12 focal jobs ([comment](https://github.com/pytorch/pytorch/pull/129996#issuecomment-2209175516))	2024-07-04 14:55:38 +00:00
PyTorch MergeBot	fa3953a2e1	Revert "[pt2-bench] fix accuracy failure for a few models (#129941 )" This reverts commit dafbd603ee6672d9592ec72b59300a2631f431d2. Reverted https://github.com/pytorch/pytorch/pull/129941 on behalf of https://github.com/jeanschmidt due to Seems to have introduced breakages in main cuda12 focal jobs ([comment](https://github.com/pytorch/pytorch/pull/129996#issuecomment-2209175516))	2024-07-04 14:55:38 +00:00
PyTorch MergeBot	54da35a2e0	Revert "[pt2-bench] fix accuracy failure for beit_base_patch16_224 during training (#130005 )" This reverts commit 0af8c8a981e79b05767089e57e81262dbbf2b1b4. Reverted https://github.com/pytorch/pytorch/pull/130005 on behalf of https://github.com/jeanschmidt due to Seems to have introduced breakages in main cuda12 focal jobs ([comment](https://github.com/pytorch/pytorch/pull/129996#issuecomment-2209175516))	2024-07-04 14:55:38 +00:00
Yu, Guangye	57d05f2616	[RELAND] Add xpu to getAccelerator (#129205 ) # Motivation Add `xpu` support to `getAccelerator`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129205 Approved by: https://github.com/albanD, https://github.com/gujinghui ghstack dependencies: #129463	2024-07-04 10:26:52 +00:00
Yanbo Liang	551f3b92b2	[Dynamo] Add assertion for tensor unpack shape mismatch (#130077 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130077 Approved by: https://github.com/Chillee	2024-07-04 09:25:08 +00:00
Yu, Guangye	f3962cfd9c	[RELAND] XPUHooksInterface inherits from AcceleratorHooksInterface (#129463 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129463 Approved by: https://github.com/gujinghui, https://github.com/albanD	2024-07-04 08:46:34 +00:00
Animesh Jain	fa4e489d70	[dynamo][dynamic-shapes] Graph break if out shape changes on out= variants (#130074 ) Fixes https://github.com/pytorch/pytorch/issues/130068 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130074 Approved by: https://github.com/ezyang ghstack dependencies: #129913, #129914	2024-07-04 08:36:12 +00:00
Yan Zhiwei	e98587c58d	Update torch-xpu-ops pin (ATen XPU implementation) (#129353 ) 188 new ATen operators/variants are added in the pin update, involving eager and torch.compile usage on HuggingFace, TIMM and TorchBench models. 16 new unit tests ported to enhance functionality coverage. Aligned source file directory structure with ATen native. Fixed corner case failures in aten::resize, aten::index_add and aten::index_put. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129353 Approved by: https://github.com/EikanWang	2024-07-04 07:36:17 +00:00
titaiwangms	bffb278700	[ONNX] Add `artifacts_dir` to torch-onnx-patch in benchmark (#130069 ) Add `artifacts_dir` to torch-onnx-patch to save error report for debugging. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130069 Approved by: https://github.com/justinchuby	2024-07-04 07:11:02 +00:00
Li-Huai (Allan) Lin	d62d351107	[Optim][BE] Change str(device) to _get_device_type(device) (#129984 ) Prevent using vague expressions like `"cuda" in str(device)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129984 Approved by: https://github.com/janeyx99 ghstack dependencies: #129451, #129552	2024-07-04 06:44:48 +00:00
Li-Huai (Allan) Lin	42f3d7e948	[MPS] Add mps profiler env vars to docs (#129552 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129552 Approved by: https://github.com/malfet ghstack dependencies: #129451	2024-07-04 06:44:48 +00:00
cyy	07b06f0f0a	[2/N] Remove outdated CMake code (#130006 ) Follows #129851 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130006 Approved by: https://github.com/drisspg	2024-07-04 06:24:22 +00:00
Jithun Nair	26be691e6b	Unify shard logic for inductor and dynamo test_config (#129508 ) Addresses https://github.com/pytorch/pytorch/pull/129480#issuecomment-2189954552 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129508 Approved by: https://github.com/clee2000, https://github.com/huydhn	2024-07-04 06:04:29 +00:00
Anshul Sinha	9c9ac670a0	[dtensor][be] Reduced redundant LOC by creating functions to set up models used in example (#129613 ) Summary As the CommModeFeature example file grew, there were to many LOC that was repeated for setting up the models used. I created two functions, one to handle MLP and MLPStacked models and the other for transformer models. The output of the examples will not have changed. Test Plan 1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_distributed_sharding_display 2. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLPStacked_distributed_sharding_display 3. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_module_tracing 4. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_module_tracing 5. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_operation_tracing 6. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_operation_tracing Pull Request resolved: https://github.com/pytorch/pytorch/pull/129613 Approved by: https://github.com/XilunWu ghstack dependencies: #129602	2024-07-04 06:00:58 +00:00
Anshul Sinha	0b9995c1ce	[dtensor][debug] Added forward and backward differentiation for module level tracing (#129602 ) Summary Currently, comm_mode only allowed users to differentiate between forward and backward passes at the operational level. I modified the code so that users can now see the collective counts for the passes at a module level. I decided to slightly change how the output was formatted making it easier to differentiate between a collective count and an operation. I have designed the operational trace table function so that in the future, a user can use command line arguments in order to determine the level of information they want to display instead of having two similar functions. Finally, I have updated the new output and test cases for comm_mode example and test files. The expected output for the first 3 examples are shown below: <img width="320" alt="Screenshot 2024-06-26 at 2 30 25 PM" src="https://github.com/pytorch/pytorch/assets/50644008/b8e88075-a07f-4e84-b728-a08959df3661"> <img width="497" alt="Screenshot 2024-06-26 at 2 29 15 PM" src="https://github.com/pytorch/pytorch/assets/50644008/5ef4bea7-1355-4089-bfb0-c7e3f588ac77"> <img width="615" alt="Screenshot 2024-06-26 at 2 31 05 PM" src="https://github.com/pytorch/pytorch/assets/50644008/feacae51-76f7-403b-b6cd-dd15e981770e"> Test Plan 1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_module_tracing 2. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_module_tracing 3. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_operation_tracing 4. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_operation_tracing 5. pytest test/distributed/_tensor/debug/test_comm_mode_features.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/129602 Approved by: https://github.com/XilunWu, https://github.com/wz337	2024-07-04 06:00:58 +00:00
Peter Bell	e2e624a02f	[AOTAutograd] Micro-optimize runtime_wrapper (#128188 ) This moves a bunch of runtime inspection of the `output_info` for alias handling into the construction of fixed output handlers that are created during compilation and captured by the runtime wrapper. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128188 Approved by: https://github.com/bdhirsh	2024-07-04 03:53:06 +00:00
Animesh Jain	a7a7363be0	[dynamo] Skip side effect tracking for c wrappers/descriptors (#129914 ) Fixes PYTORCH_TEST_WITH_DYNAMO=1 pytest -vs test/test_python_dispatch.py::TestPythonDispatch::test_deepcopy_wrapper_subclass Pull Request resolved: https://github.com/pytorch/pytorch/pull/129914 Approved by: https://github.com/jansel ghstack dependencies: #129913	2024-07-04 03:14:45 +00:00
Animesh Jain	da8af685ac	[dynamo] Skip ID_MATCH guard on GetSetDescriptorType (#129913 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129913 Approved by: https://github.com/jansel	2024-07-04 03:14:45 +00:00
Jiong Gong	8405ba21c1	[inductor][cpp] fix the vec convertion between float and int64 on AVX2 (#130013 ) Fix https://github.com/pytorch/pytorch/issues/129863 There is no single instruction support on AVX2 to convert between fp and int64 and has to be emulated. The original fast implementation (see https://stackoverflow.com/questions/41144668) assumes the data range is within [-2^51, 2^51]. The issue reported in https://github.com/pytorch/pytorch/issues/129863 has the input data outside this range and failed the test. This PR supports the full range of the conversion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130013 Approved by: https://github.com/lezcano	2024-07-04 03:01:49 +00:00
cyy	99ec7bbee7	Force inconsistent-missing-override for torch targets (#130010 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/130010 Approved by: https://github.com/ezyang	2024-07-04 02:37:57 +00:00
Shunting Zhang	0af8c8a981	[pt2-bench] fix accuracy failure for beit_base_patch16_224 during training (#130005 ) This model's accuracy test recently regressed. I have a quite smooth debugging process to figure out the cause. So I'd like to write it down just in case it can be helpful. Clicking the model name beit_base_patch16_224 on the dashboard, we are able to see the pass status of the model in e.g. the past month. For this model, we can see that it starts to fail on June 08: <img width="1118" alt="Screenshot 2024-07-02 at 5 36 35 PM" src="https://github.com/pytorch/pytorch/assets/52589240/32f27ccd-3ec7-4431-88b3-febeff831f8e"> What's nice is the dashboard shows the nightly commits for each run. Running ``` git log --oneline a448b3ae9537c0ae233fb9199a4a221fdffbb..0e6c204642a571d5a7cd60be0caeb9b50faca030 torch/_inductor/ ``` Gives us the list of Inductor PRs between the good and bad commit: https://gist.github.com/shunting314/eb57965688fc9e1746fcfa9b7b6b02df Roughly looking thru the PRs, I feel ``` ffc202a1b91 Added remove_noop_ops to joint_graph_passes (#124451) ``` can change numerics so I disable it locally by this one line change: https://gist.github.com/shunting314/13aec768bda986056d0fb40dce53396e . And then the accuracy test pass. (Command: time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only beit_base_patch16_224 ) Horace's PR (https://github.com/pytorch/pytorch/pull/124451) itself is valid. It removes no-op ops in joint-graph. I think maybe the graph get changed and cause the partitioner do different recomputation decisions. That can cause some numerics change. Since this is not a real issue, I'll raise the tolerance to make it pass. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130005 Approved by: https://github.com/eellison, https://github.com/jansel ghstack dependencies: #129996, #129941	2024-07-04 01:14:29 +00:00
Shunting Zhang	dafbd603ee	[pt2-bench] fix accuracy failure for a few models (#129941 ) This PR batch the fix for a few accuracy failures issues during training by raising tolerance. I do that only for models that I think it fails not due to real issue. ## sebotnet33ts_256 The accuracy test for this model start to fail around June 05 [link](https://hud.pytorch.org/benchmark/timm_models/inductor_with_cudagraphs?dashboard=torchinductor&startTime=Sun%2C%2002%20Jun%202024%2007%3A19%3A38%20GMT&stopTime=Tue%2C%2002%20Jul%202024%2007%3A19%3A38%20GMT&granularity=day&mode=training&dtype=amp&lBranch=main&lCommit=04a0d856207d83c2031e4b9cb6825ba3e0092850&rBranch=main&rCommit=e62925930f6a62f6aeeb1fe1a661a9bd3352b53d&model=sebotnet33ts_256). I can not repro locally, but from the log from the dashboard: ``` RMSE (res-fp64): 0.09441, (ref-fp64): 0.02971 and shape=torch.Size([1536]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000 ``` raising the tolerance should fix it. ## DebertaForQuestionAnswering This model fails accuracy test on the dashboard only in max-autotune mode. I can not repro locally by command: ``` TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/huggingface.py --accuracy --no-translation-validation --training --amp --backend inductor --device cuda --only DebertaForQuestionAnswering ``` From error message on the dashboard: ``` RMSE (res-fp64): 0.01803, (ref-fp64): 0.00537 and shape=torch.Size([2]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000 ``` 0.02 tolerance should suppress this error. ## gluon_inception_v3 This model fail on the dashboard in max-autotune mode. I can not repro locally by command ``` TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only gluon_inception_v3 ``` From error message on the dashboard ``` RMSE (res-fp64): 0.02798, (ref-fp64): 0.00730 and shape=torch.Size([384]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000 Accuracy failed for key name Mixed_7c.branch3x3dbl_3a.bn.running_var ``` raising tolerance should suppress this error. # mobilenetv3_large_100 Fail in MA model. I can not repro locally by command ``` TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only ``` The error message on the dashboard is ``` RMSE (res-fp64): 0.29754, (ref-fp64): 0.05205 and shape=torch.Size([]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000 ``` The tensor is so small that the noise can be high. I use larger multiplier for smaller tensor in torch._dynamo.utils.same. # yolov3 Fail on dashboard with error ``` Error on the dashboard: RMSE (res-fp64): 0.01278, (ref-fp64): 0.00246 and shape=torch.Size([256]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000 ``` Fix it by using a larger multiplier for smaller tensors and raising the tolereance. # timm_efficientdet Fail on the dashboard with error ``` E0623 18:37:43.638000 139924418725056 torch/_dynamo/utils.py:1468] RMSE (res-fp64): 0.00096, (ref-fp64): 0.00009 and shape=torch.Size([2]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000 ``` But I can not repro locally with command ``` time python benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --only timm_efficientdet --training ``` Raise the tolerance should fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129941 Approved by: https://github.com/jansel ghstack dependencies: #129996	2024-07-04 01:14:29 +00:00
Shunting Zhang	51fa0bd436	[pt2-bench] pass acc test if ref is NaN (#129996 ) I'm debugging the accuracy failure for training vision_maskrcnn. Unfortunately I could not succeed to run it locally (I've check pined commits for torchbenchmars/torchvision are correct, and reinstalled torchbenchmark for mask_rcnn). I get this error: ``` eager run fail: AssertionError: targets should not be none when in training mode ``` (Command: time python benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --training --only vision_maskrcnn ) But look at the log from the dashboard ``` E0623 19:17:59.085000 140114670171328 torch/_dynamo/utils.py:1468] RMSE (res-fp64): nan, (ref-fp64): nan and shape=torch.Size([1024, 256, 1, 1]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000 ``` We can see both the reference number and the pt2 number are NaN. I change torch._dynamo.utils.same to return true if both RMSE values are NaN. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129996 Approved by: https://github.com/jansel	2024-07-04 01:14:29 +00:00
drisspg	9108b74bbc	Updates to scaled_mm for rowwise scaling (#130059 ) # Summary This updates _scaled_mm's API to enforce that input scales are always 2 dimensional. This resolves ambiguity around scaling scheme Pull Request resolved: https://github.com/pytorch/pytorch/pull/130059 Approved by: https://github.com/vkuzo	2024-07-04 00:53:17 +00:00
Tristan Rice	cd70ac884f	c10d/Utils: better error message on 0 bytes (#130056 ) This improves the error messages on 0 bytes sent/received. We currently log it as a connection reset when it's caused by other reasons. Test plan: ``` python test/distributed/test_store.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130056 Approved by: https://github.com/kurman, https://github.com/rsdcastro	2024-07-04 00:48:20 +00:00
cyy	efb73eda51	[2/N] Fix some violations of unused-function and unused-variable checks in torch_cpu (#129878 ) Follows #128670 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129878 Approved by: https://github.com/ezyang	2024-07-04 00:39:28 +00:00
Shangdi Yu	d95a019704	[export] construct empty graph when there's no tensor computation (#129541 ) Fixes [#127110](https://github.com/pytorch/pytorch/issues/127110). When input module does not contain any tensor computation, we would create a graph with inputs and outputs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129541 Approved by: https://github.com/angelayi	2024-07-04 00:26:17 +00:00
Shangdi Yu	2fe7c1fe04	[custom ops] Support factory function (#129978 ) Fixes #129389 If a user registers a device-specific implementation for an operator that accepts no Tensors, then we require the operator to have a "device: torch.device argument" We switch on the device argument to select the correct backend to dispatch to. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129978 Approved by: https://github.com/zou3519	2024-07-04 00:10:52 +00:00
PyTorch MergeBot	779fc8119e	Revert "XPUHooksInterface inherits from AcceleratorHooksInterface (#129463 )" This reverts commit 6353a12e6a80f06217645b10fb69cffeac08a8d0. Reverted https://github.com/pytorch/pytorch/pull/129463 on behalf of https://github.com/kit1980 due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/129463#issuecomment-2207529072))	2024-07-03 23:43:15 +00:00
PyTorch MergeBot	8a9725bedb	Revert "Add xpu to getAccelerator (#129205 )" This reverts commit 3e2df3ca9d0a593e09bc94c14bbf2b213413cbf3. Reverted https://github.com/pytorch/pytorch/pull/129205 on behalf of https://github.com/kit1980 due to Need to revert https://github.com/pytorch/pytorch/pull/129463 which breaks Meta builds ([comment](https://github.com/pytorch/pytorch/pull/129205#issuecomment-2207514346))	2024-07-03 23:37:24 +00:00
Jerry Zhang	a9a744e442	Change numeric_debug_handle to store per-node id (#129811 ) Summary: Previously we store edge id in numeric_debug_handle to support operator fusion and operator decomposition throughout the stack, but according to feedback from customers, people prefer the simpler per-node id, and they are fine with not having the additional support for numerical debugging for inputs and willing to hack around to achieve this. This PR changes the structure of numeric_debug_handle to store unique_id for each node instead. e.g. graph: ``` node = op(input_node, weight_node) ``` Before: ``` node.meta[NUMERIC_DEBUG_HANDLE_KEY] = {input_node: id1, weight_node: id2, "output": id3} ``` After: ``` node.meta[NUMERIC_DEBUG_HANDLE_KEY] = id1 ``` Test Plan: python test/test_quantization.py -k TestGenerateNumericDebugHandle Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/129811 Approved by: https://github.com/tarun292	2024-07-03 22:03:31 +00:00
Zain Rizvi	b0d0114f5b	Enable automigration for windows jobs (#129977 ) Enable Windows jobs to automatically use LF runners when the author is opted-in Pull Request resolved: https://github.com/pytorch/pytorch/pull/129977 Approved by: https://github.com/clee2000	2024-07-03 22:02:56 +00:00
Yukio Siraichi	a79bb8db91	Make `_embedding_bag_backward` explicitly dispatch to CPU and CUDA. (#129691 ) This PR modifies `_embedding_bag_backward` item inside _native_functions.yaml_, so that it dispatches to CPU and CUDA directly, instead of `CompositeImplicitAutograd`. Context: PyTorch operations that have the `CompositeImplicitAutograd` dispatch do not allow third party backends (e.g. XLA) to modify its implementation, since this dispatch key has higher priority. When calling `_embedding_bag_backward` operation using XLA, a dispatch error will be thrown, since PyTorch/XLA doesn't support sparse tensors. Problem: `_embedding_bag_backward` has a `sparse` parameter that controls whether the operation should return a sparse or dense tensor. However, at the moment, PyTorch/XLA does not support sparse tensors. In order to fallback that execution to dense, i.e. change the flag at runtime, we need to be able to modify its implementation. Solution: we have changed the dispatch of `_embedding_bag_backward` to CPU and CUDA, which allowed us to introduce our own kernel for it. Additionally, this PR refactored the representation of its mode from constant integers into an enum class. It also introduces two additional operators: `int == EmbeddingBagMode` and `int != EmbeddingBagMode`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129691 Approved by: https://github.com/lezcano	2024-07-03 21:54:49 +00:00
rzou	7bbd6cf931	[custom_ops] Mark older custom ops prototypes as deprecated (#130032 ) I've had at least one person try to call APIs from here. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130032 Approved by: https://github.com/yushangdi, https://github.com/williamwen42	2024-07-03 21:11:05 +00:00
Shivam Raikundalia	a21d4363d2	[Profiler] Remove all instances of TMP_USE_TSC_AS_TIMESTAMP (#129973 ) Summary: Now that D56584521 is in, we can remove all insteances of TMP_USE_TSC_AS_TIMESTAMP Test Plan: Ran resnet. Trace looks good https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Jun_27_14_46_01.1967733.pt.trace.json.gz&bucket=gpu_traces Reviewed By: aaronenyeshi, swolchok Differential Revision: D59132793 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129973 Approved by: https://github.com/aaronenyeshi	2024-07-03 19:28:52 +00:00
Zhengxu Chen	042d764872	[export] Update example inputs format for DB. (#129982 ) Summary: To give user a simpler example code, we are getting rid of ExportArgs in favor of example_args and example_kwargs. Test Plan: CI Differential Revision: D59288920 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129982 Approved by: https://github.com/angelayi	2024-07-03 17:53:15 +00:00
Brian Hirsh	9b902b3ee3	AOTI: dont treat views of buffers as constants (#129688 ) More context [here](https://github.com/pytorch/pytorch/issues/129682#issuecomment-2195463838), but this change was enough to get this AOTI + float8 repro running for me (below). Previously, it would fail an assertion [here](https://github.com/pytorch/pytorch/blob/main/torch/_meta_registrations.py#L5387) at inductor lowering time. It looks like during lowering, we were supposed to pass `param.transpose(1, 0)` as the second argument to the scaled_mm kernel. But in the inductor IR, this object is a `ReinterpretView` with `get_name()` equal to one of the param constants, so we would end up passing the constant directly into the kernel, instead of performing the view first. I'm not totally sure if this is the right place to make the change, so interested in any thoughts from inductor folks (cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @eellison ) ``` import torch from torch.export import export from torch.export._trace import _export # Copyright (c) Meta Platforms, Inc. and affiliates. # All rights reserved. # # This source code is licensed under the BSD 3-Clause license found in the # LICENSE file in the root directory of this source tree. import copy import io import random import unittest import pytest import torch import torch.nn as nn import torch.nn.functional as F from float8_experimental.float8_dynamic_linear import Float8DynamicLinear from float8_experimental.float8_linear_utils import swap_linear_with_float8_linear from float8_experimental.float8_tensor import Float8Tensor from float8_experimental.float8_utils import compute_error random.seed(0) torch.manual_seed(0) is_H100 = torch.cuda.is_available() and torch.cuda.get_device_capability() >= (9, 0) import torch.nn.utils.parametrize as parametrize # NOTE: we should upstream this directly into export and make it more automatic! class UnwrapTensorSubclass(torch.nn.Module): def forward(self, tensors): todo = list(tensors) for tp, meta, inner_tensors in reversed(self.rebuild_stack): nb_tensor = len(inner_tensors) inner_tensors = {a: b for a, b in zip(inner_tensors, todo[-nb_tensor:])} todo = todo[nb_tensor:] rebuilt = tp.__tensor_unflatten__(inner_tensors, meta, None, None) todo.append(rebuilt) assert len(todo) == 1 return todo[0] def right_inverse(self, tensor): assert type(tensor) is not torch.Tensor rebuild_stack = [] plain_tensors = [] todo = [tensor] while todo: obj = todo.pop() inner_tensors, metadata = obj.__tensor_flatten__() rebuild_stack.append((type(obj), metadata, inner_tensors)) for attr_name in inner_tensors: val = getattr(obj, attr_name) if type(val) is torch.Tensor: plain_tensors.append(val) else: assert isinstance(val, torch.Tensor) todo.append(val) self.rebuild_stack = rebuild_stack return plain_tensors def unwrap_tensor_subclass(model, filter_fn=None): for name, child in model.named_children(): if ( isinstance(child, Float8DynamicLinear) and hasattr(child, "weight") and type(child.weight) is not torch.Tensor and isinstance(child.weight, torch.Tensor) ): parametrize.register_parametrization(child, "weight", UnwrapTensorSubclass()) unwrap_tensor_subclass(child) return model class FeedForward(nn.Module): def __init__(self) -> None: super().__init__() self.w1 = nn.Linear(4096, 14336, bias=False) self.w3 = nn.Linear(4096, 14336, bias=False) self.w2 = nn.Linear(14336, 4096, bias=False) def forward(self, x: torch.Tensor) -> torch.Tensor: return self.w2(F.silu(self.w1(x)) self.w3(x)) def reset_parameters(self): for m in self.modules(): if isinstance(m, nn.Linear): m.reset_parameters() export_model = FeedForward().to("cuda") swap_linear_with_float8_linear( export_model, Float8DynamicLinear, from_float_kwargs={"pre_quantize_weight": True}, ) export_model = unwrap_tensor_subclass(export_model) batch_size = 4 num_tokens = 1024 embedding_dim = 4096 input_tensor = torch.randn( batch_size, num_tokens, embedding_dim, device="cuda", dtype=torch.float32 ) example_args = (input_tensor,) # NOTE: this breaks unless we use strict=False, pre_dispatch=False! exported_program: torch.export.ExportedProgram = _export( export_model, example_args, strict=False, pre_dispatch=False, ) with torch.no_grad(): so_path = torch._inductor.aot_compile(exported_program.module(), example_args) print(so_path) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129688 Approved by: https://github.com/eellison	2024-07-03 17:24:08 +00:00
Edward Z. Yang	35600bcaad	Print float with full precision, don't truncate (#130027 ) Fixes https://github.com/pytorch/pytorch/issues/119338 Exercised in https://github.com/pytorch/pytorch/pull/118448 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130027 Approved by: https://github.com/lezcano, https://github.com/Skylion007	2024-07-03 17:20:19 +00:00
chilli	01e41f1814	Modified autotuning for flex_attention to pass in (proper) fake inputs for the block sparse entries (#129915 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129915 Approved by: https://github.com/yanboliang, https://github.com/eellison ghstack dependencies: #129846, #129950	2024-07-03 17:08:45 +00:00
chilli	e2eb33b089	Added methods to blockmask to visualize them (#129950 ) <img width="319" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/319b10f4-f6fe-4ff8-9529-d366ff411b95"> <img width="324" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/27a8953a-3c50-4922-b5d0-4ea5630a133a"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129950 Approved by: https://github.com/yanboliang, https://github.com/drisspg ghstack dependencies: #129846	2024-07-03 17:08:45 +00:00
Edward Z. Yang	29c68df600	Stop immediately specializing common constants 0/1 for plain int (#128327 ) Fixes https://github.com/pytorch/pytorch/issues/128319 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128327 Approved by: https://github.com/lezcano ghstack dependencies: #129983	2024-07-03 16:41:51 +00:00
James Wu	9e1e58e052	Support allowlisted modules and op overloads in AOTAutogradCache (#128329 ) Ops in torch, torch.functional, and torch.nn.functional are cache safe by default (at least, based on my cursory audit of the ops). This fixes a few tests that use these ops with the cache. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128329 Approved by: https://github.com/bdhirsh	2024-07-03 14:59:24 +00:00
Edward Z. Yang	64a04d2225	Make sparse empty constructors specialize instead of fail on symbolic inputs (#129983 ) Exercised in https://github.com/pytorch/pytorch/pull/128327 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129983 Approved by: https://github.com/anijain2305	2024-07-03 13:27:19 +00:00
Xuehai Pan	735044191f	[Easy] Add whitespace after comma when re-rendering tuple default value in schema (#129884 ) The default value of `rot90()` in the schema registry is `[0,1]` because we split the function schema by `", "`. There should be no space after `,` in `[0,1]`. `5c9d5272e4/aten/src/ATen/native/native_functions.yaml (L6120-L6126)` Then the the default value is formatted to `(0,1)` in `pyi` files. This PR manually adds an extra whitespace when rerendering the default value to a string. ```python ", ".join(string.split(",")) ``` ```python # before def rot90(input: Tensor, k: _int = 1, dims: _size = (0,1)) -> Tensor: ... # after def rot90(input: Tensor, k: _int = 1, dims: _size = (0, 1)) -> Tensor: ... ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129884 Approved by: https://github.com/ezyang	2024-07-03 11:45:24 +00:00
Huy Do	8f70bf7a94	Skip TestSDPAPrivateUse1Only on FBCODE (#129997 ) Summary: The test is from D59181111, but I couldn't figure out a way to make it pass on FBCODE because loading PyTorch C++ extension requires Ninja which is not going to work with BUCK Test Plan: `buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test:transformers` Differential Revision: D59304327 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129997 Approved by: https://github.com/drisspg	2024-07-03 06:48:51 +00:00
Valentine233	62b710782d	change LayoutLMForSequenceClassification inference accuracy tolerance (#129728 ) Fixes #128510. https://github.com/pytorch/pytorch/pull/124451 makes LayoutLMForSequenceClassification hit the SDPA pattern 1 and then encounter the accuracy issue. The issue only happens with BF16 inference single thread. This PR tends to increase the model tolerance and make the check pass. Note that even the math-version SDPA could have the issue because of some small implementation diff. The test log: Single thread ``` correct_result: SequenceClassifierOutput(loss=tensor(0.5998), logits=tensor([[0.3301, 0.1338]], dtype=torch.bfloat16), hidden_states=None, attentions=None) new_result: SequenceClassifierOutput(loss=tensor(0.6016), logits=tensor([[0.3281, 0.1357]], dtype=torch.bfloat16), hidden_states=None, attentions=None) E0627 01:09:16.762789 140281313759104 torch/_dynamo/utils.py:1476] RMSE (res-fp64): 0.00151, (ref-fp64): 0.00046 and shape=torch.Size([1, 2]). res.dtype: torch.bfloat16, multiplier: 3.000000, tol: 0.001000 E0627 01:09:16.762972 140281313759104 torch/_dynamo/utils.py:1390] Accuracy failed for key name logits fail_accuracy ``` Multiple threads ``` correct_result: SequenceClassifierOutput(loss=tensor(0.6007), logits=tensor([[0.3301, 0.1357]], dtype=torch.bfloat16), hidden_states=None, attentions=None) new_result: SequenceClassifierOutput(loss=tensor(0.6016), logits=tensor([[0.3281, 0.1357]], dtype=torch.bfloat16), hidden_states=None, attentions=None) pass ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129728 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-07-03 06:28:27 +00:00
Jason Ansel	4fc9157e90	[halide-backend] Disable split reductions for Halide (#129320 ) In theory Halide doesn't need the split reduction stuff we do for Triton since it can generate multiple kernels. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129320 Approved by: https://github.com/shunting314, https://github.com/eellison ghstack dependencies: #129321	2024-07-03 05:56:40 +00:00
Jason Ansel	0abcca85b7	[halide-backend] Support manual schedules (#129321 ) Currently using this for some by-hand hacking, but might need to implement our own scheduler later. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129321 Approved by: https://github.com/shunting314	2024-07-03 05:56:40 +00:00
Edward Z. Yang	8af58f66bb	Fix typo in floordiv solver code that affects flipped relation (#129888 ) Fixes https://github.com/pytorch/pytorch/issues/123535 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129888 Approved by: https://github.com/lezcano	2024-07-03 04:47:32 +00:00
Edward Z. Yang	424cd1e1df	Enable TORCH_TRACE by default on Conda on Mast (#129988 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129988 Approved by: https://github.com/kunalb	2024-07-03 03:35:45 +00:00
Catherine Lee	1026b0f687	Use setup-miniconda step from test-infra for llm retrival workflow (#129720 ) Undo https://github.com/pytorch/pytorch/pull/129722 Use the setup-miniconda step in written in test-infra to install miniconda in the llm retrieval workflow. It comes with a cache so we don't have to worry about hitting cache limits. The llm retrieval job was failing due to too many requests https://github.com/pytorch/pytorch/issues/129718#issue-2379260544 `2aba8f107a/.github/actions/setup-miniconda/action.yml (L1)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129720 Approved by: https://github.com/PaliC, https://github.com/malfet, https://github.com/huydhn	2024-07-03 03:02:23 +00:00
chilli	31fc5b8966	Add support for inline_asm_elementwise in Inductor lowerings (#129846 ) This doesn't actually expose `inline_asm_elementwise` from any public API, but makes it pretty easy to register a lowering for a custom op that uses it. <img width="667" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/f125f4bb-4f8c-46e7-8e06-925f37ed2930"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129846 Approved by: https://github.com/shunting314	2024-07-03 02:34:03 +00:00
Tristan Rice	9ee8c18309	TCPStore: add ping to verify network connectivity on connect (#129985 ) This does a round trip request on socket connect -- this allows for detecting connection resets etc and retrying before the non-retryable application requests are sent. This adds support for PING to both the libuv and legacy backend. Example error: ``` [trainer85612\|12]:W0701 13:41:43.421574 4776 TCPStore.cpp:182] [c10d] recvValue failed on SocketImpl(fd=24, ...): Connection reset by peer [trainer85612\|12]:Exception raised from recvBytes at /mnt/code/pytorch/torch/csrc/distributed/c10d/Utils.hpp:669 (most recent call first): ... [trainer85612\|12]:#9 c10d::TCPStore::incrementValueBy(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, long) from /packages/.../conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so:84809637 [trainer85612\|12]:#10 c10d::TCPStore::waitForWorkers() from /packages/.../conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so:84812868 [trainer85612\|12]:#11 c10d::TCPStore::TCPStore(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, c10d::TCPStoreOptions const&) from /packages/.../conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so:84814775 ``` Test plan: ``` python test/distributed/test_store.py -v ``` ``` tristanr@devvm4382 ~/pytorch (d4l3k/tcpstore_ping)> python ~/pt_tests/tcpstore_large_test.py starting pool started 90000 started 30000 started 70000 started 20000 started 80000 started 60000 started 0 [W702 16:16:25.301681870 TCPStore.cpp:343] [c10d] Starting store with 100000 workers but somaxconn is 4096.This might cause instability during bootstrap, consider increasing it. init 20000 set 20000 init 80000 set 80000 init 70000 set 70000 init 60000 set 60000 init 30000 set 30000 init 90000 set 90000 started 40000 init 40000 set 40000 started 50000 init 50000 set 50000 started 10000 init 10000 set 10000 init 0 set 0 run finished 617.2992351055145 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129985 Approved by: https://github.com/rsdcastro, https://github.com/kurman	2024-07-03 02:09:44 +00:00
Catherine Lee	91a8376d47	run_test: Unset cpp stacktraces after reruns (#129004 ) Rerun the failing test singly with the env var set. If it succeeds, start a new process without the cpp stack traces env var We don't want to waste time generating these if we don't have to They can also show up in assertion errors, which may cause unexpected failures if a test wants to check these Adds new --rs (run single) to be used the same way --scs and --sc are. It will only run the single test in the step current file https://hud.pytorch.org/pytorch/pytorch/pull/129004?sha=2c349d3557d399020bf1f6a8b7045e2e4957ba46 has some examples of logs In the above: * test_checkpoint_valid failed, then passed in another subprocess. The testing continued in a different new subprocess from the test right after it (test_checkpointing_without_reentrant_early_free) * test_format_traceback_short failed consistently, but it continued to run because keep-going was set Pull Request resolved: https://github.com/pytorch/pytorch/pull/129004 Approved by: https://github.com/PaliC	2024-07-03 01:50:15 +00:00
xinan.lin	c77c139878	[Intel Triton] Update Intel Triton to resolve installation issue on manylinux. (#129847 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129847 Approved by: https://github.com/Skylion007, https://github.com/gujinghui, https://github.com/atalman ghstack dependencies: #129782	2024-07-03 01:46:32 +00:00
dilililiwhy	c686304277	Enable UFMT on test/test_public_bindings.py (#128389 ) Part of: https://github.com/pytorch/pytorch/issues/123062 Ran lintrunner on: > test/test_public_bindings.py Detail: ``` $ lintrunner -a --take UFMT --all-files ok No lint issues. Successfully applied all patches. ``` Co-authored-by: Edward Z. Yang <ezyang@fb.com> Co-authored-by: Xuehai Pan <XuehaiPan@pku.edu.cn> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128389 Approved by: https://github.com/malfet	2024-07-03 01:43:41 +00:00
xinan.lin	3b77b122c5	[Inductor UT] update rtol for convoluton on XPU. (#129782 ) [Inductor UT] update rtol for convoluton on XPU. Fix https://github.com/pytorch/pytorch/issues/129974 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129782 Approved by: https://github.com/atalman	2024-07-03 01:37:16 +00:00
Shiyan Deng	1e27af335e	[easy] enhance local model loading (#129897 ) Summary: 1. add one more model lib dep. 2. add error message when torchscript failed to find a class in python compilation unit. Test Plan: CI Reviewed By: jingsh Differential Revision: D59243250 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129897 Approved by: https://github.com/jingsh	2024-07-03 00:29:02 +00:00
Simon Fan	be2d79a16b	[dynamic] config to disable duck sizing (#129804 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129804 Approved by: https://github.com/ezyang	2024-07-03 00:20:54 +00:00
Yanbo Liang	111f9b5d44	[Dynamo] Add config to skip/inline torchrec (#129912 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/129912 Approved by: https://github.com/anijain2305	2024-07-03 00:14:51 +00:00
PyTorch MergeBot	89646ebb11	Revert "[export] make with_effect mark op has_effect to prevent them from DCEed. (#129680 )" This reverts commit 4b8a5e03745924c8f987dc072fa4d41f4cb6f103. Reverted https://github.com/pytorch/pytorch/pull/129680 on behalf of https://github.com/kit1980 due to breaking internal builds, see D59181183 ([comment](https://github.com/pytorch/pytorch/pull/129680#issuecomment-2204737227))	2024-07-03 00:03:50 +00:00
Peter Bell	921c116089	[inductor] Kill mark_node_as_mutating (#129346 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129346 Approved by: https://github.com/lezcano ghstack dependencies: #128893, #129325, #129343, #129344	2024-07-02 23:50:07 +00:00
Peter Bell	b2ac8d2af3	[inductor] Use multiple outputs for flex-attention (#129344 ) This fixes the DCE issue for attention output Pull Request resolved: https://github.com/pytorch/pytorch/pull/129344 Approved by: https://github.com/lezcano ghstack dependencies: #128893, #129325, #129343	2024-07-02 23:50:07 +00:00
Peter Bell	45844e0d4e	[inductor] Add FileCheck to flex attention epilogue test (#129343 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129343 Approved by: https://github.com/lezcano ghstack dependencies: #128893, #129325	2024-07-02 23:50:04 +00:00
Peter Bell	7955cd3e83	[inductor] Make UserDefinedTritonKernel a multi-output operation (#129325 ) Previously each mutation was represented by a `MutationOutput` operation which was a new scheduler node that must be scheduled immediately afterwards. Now we have a single scheduler node, which produces mutiple `MutationOutput` buffers as its output. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129325 Approved by: https://github.com/lezcano ghstack dependencies: #128893	2024-07-02 23:50:00 +00:00
Peter Bell	fb078c20c1	[inductor] Separate Buffer and Operation into two concepts (#128893 ) Currently a buffer represents both a tensor with physical storage and a computation that produces the tensor as a result. This PR attempts to split these into two different concepts in the scheduler. This should allow us to have multiple outputs from a single operation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128893 Approved by: https://github.com/lezcano	2024-07-02 23:49:57 +00:00
rzou	872d972e41	[custom_op] better error message on no returns (#129896 ) I run into this a lot. I can imagine that it would look opaque to users, so made it more friendly Old error message: "ValueError: infer_schema(func): Return has unsupported type <class 'inspect._empty'>." Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/129896 Approved by: https://github.com/yushangdi	2024-07-02 23:34:23 +00:00
Shangdi Yu	aa0352ca38	[custom ops] add default value support for device types (#129792 ) Fixes #129371 I think the first case in Issue #129371 is already supported in the current code? Since it takes care of string default values. This PR adds support for device type default values. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129792 Approved by: https://github.com/zou3519	2024-07-02 23:31:29 +00:00
Edward Z. Yang	d7680a564b	Bug fixes for disabling 0/1 specialization on plain int (#129961 ) These bug fixes will be exercised in https://github.com/pytorch/pytorch/pull/128327 but I separate them from the actual policy change (which is more risky) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129961 Approved by: https://github.com/lezcano	2024-07-02 23:19:48 +00:00
eqy	29ffa20bb1	[CUDA] Bump tolerances for `test_grad_pca_lowrank` (#129902 ) The revert of #127199 seems to surface an additional failure on A100---small tolerance bump to account for this. I did find what appears to be a race condition in the one of the kernels used in this workload but I'm not sure it's related here... CC @nWEIdia Pull Request resolved: https://github.com/pytorch/pytorch/pull/129902 Approved by: https://github.com/ezyang	2024-07-02 23:17:02 +00:00
PyTorch MergeBot	b5fdbc1a9f	Revert "[pipelining] [BE] Move pipeline_order validation to schedules.py (#129369 )" This reverts commit ec789a3c9ddd4e550b3dea6934ce2d41deb98784. Reverted https://github.com/pytorch/pytorch/pull/129369 on behalf of https://github.com/clee2000 due to broke test/distributed/pipelining/test_schedule.py::ScheduleTest::test_non_symmetric_stage_ids_ScheduleClass0 on distributed cuda https://github.com/pytorch/pytorch/actions/runs/9766039400/job/26959115773 `ec789a3c9d`. You can see the error on the PR, but Dr. CI classified it wrong ([comment](https://github.com/pytorch/pytorch/pull/129369#issuecomment-2204568418))	2024-07-02 22:30:53 +00:00
Sheng Fu	b6f781e433	Bug fix for captuing execution trace grid function (#129832 ) Summary: The inputs to grid function are varying argument, it can be one number, two numbers, or three numbers. The current implementation captured it as a tuple. For example "grid((16,))". The fix is to change it to varying number of elements. In the previous example, it is changed to "grid(16,)". PARAM et-replay code will be modified to reflect this change in a following up DIFF. Test Plan: buck2 test mode/dev-nosan caffe2/test:profiler -- -- test_execution_trace_with_pt2 Differential Revision: D59195933 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129832 Approved by: https://github.com/Skylion007, https://github.com/davidberard98	2024-07-02 22:23:57 +00:00
Colin Peppler	39357ba06f	[dynamo] don't constrain range on the replacement for a symbol (#129907 ) # Error ``` File "/data/users/colinpeppler/pytorch/torch/_meta_registrations.py", line 704, in sym_constrain_range constrain_range(size, min=min, max=max) File "/data/users/colinpeppler/pytorch/torch/fx/experimental/symbolic_shapes.py", line 898, in constrain_range a.node.shape_env._constrain_range(a.node.expr, min, max) File "/data/users/colinpeppler/pytorch/torch/fx/experimental/recording.py", line 245, in wrapper return fn(args, *kwargs) File "/data/users/colinpeppler/pytorch/torch/fx/experimental/symbolic_shapes.py", line 2813, in _constrain_range assert isinstance(a, sympy.Symbol), f"constraining non-Symbols NYI, {a} is {type(a)}" torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: AssertionError: constraining non-Symbols NYI, s1 + s2 is <class 'sympy.core.add.Add'> ``` # Context I ran into the following scenario: ``` getitem = ... sym_size_int = torch.ops.aten.sym_size.int(getitem, 0) # this is u0 = s0 + s1 _check_is_size = torch._check_is_size(sym_size_int) # we fail at this guy sym_constrain_range_default = torch.ops.aten.sym_constrain_range.default(sym_size_int, min = 4, max = 1234) # runtime assertion add = sym_size_int + sym_size_int_1 eq = add == sym_size_int _assert_scalar_default = torch.ops.aten._assert_scalar(eq, "Runtime assertion failed for expression Eq(s0 + s1, u0) on node 'eq'") ``` everything but getitem was asserted into the FX graph by insert_deferred_runtime_asserts() `7e4329c258/torch/fx/passes/runtime_assert.py (L38-L52)` In the above scenario, we fail trying to constraint the range on `s0 + s1` which is not a `sympy.Symbol`. And why exactly are we constraining the range on `s0 + s1`? Because it's the replacement for `u0`. # Approach Whenever we try to constrain the range on the replacement of ~~an unbacked symint~~ a non-symbol, just ignore it. In the scenario above, we'll be okay to ignore it because whenever there's a replacement on an unbacked symint, we will update its range. Hence, no need to constrain the range on `s1 + s1`. We can confirm this with `TORCH_LOGS="+dynamic"`. ``` torch/fx/experimental/symbolic_shapes.py:4737: _update_var_to_range u0 = VR[4, 198] (update) torch/fx/experimental/symbolic_shapes.py:4856: set_replacement u0 = s1 + s2 (trivial_lhs) VR[4, 198] ``` `600bf978ba/torch/fx/experimental/symbolic_shapes.py (L4759-L4764)` Differential Revision: [D59257079](https://our.internmc.facebook.com/intern/diff/D59257079) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129907 Approved by: https://github.com/jingsh	2024-07-02 21:46:40 +00:00
PyTorch MergeBot	c22e66896f	Revert "Fix typo in floordiv solver code that affects flipped relation (#129888 )" This reverts commit 3c6c3b94486d49614bae5e76e7bd6b9579f643d4. Reverted https://github.com/pytorch/pytorch/pull/129888 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the updated test starts to fail flakily in trunk somehow, so I am reverting the change to see if it helps ([comment](https://github.com/pytorch/pytorch/pull/129888#issuecomment-2204442653))	2024-07-02 21:16:59 +00:00
wz337	1ddb100318	[FSDP1][Easy] Remove Spammy Log Lin in _runtime_utils.py (#129967 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/129967 Approved by: https://github.com/awgu, https://github.com/fduwjj, https://github.com/Skylion007	2024-07-02 21:08:57 +00:00
PyTorch UpdateBot	deefc10dd3	[executorch hash update] update the pinned executorch hash (#129428 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129428 Approved by: https://github.com/pytorchbot	2024-07-02 20:39:39 +00:00
cyy	26de2c2487	[3/N] Enable clang-tidy on torch/csrc/jit/serialization/* (#129850 ) Follows #129300. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129850 Approved by: https://github.com/ezyang	2024-07-02 20:08:48 +00:00
Li-Huai (Allan) Lin	8ec5ba960f	[MPS] Add tensor_lr overloads to fused adam & adamw (#129451 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129451 Approved by: https://github.com/janeyx99	2024-07-02 19:46:30 +00:00
Edward Z. Yang	2631a96f2a	Stop updating hints (#129893 ) Some profiling suggests that the repeated maybe evaluate static calls are expensive. Ref: https://github.com/pytorch/pytorch/issues/123964 With test script: ``` import torch import torch._dynamo.config torch._dynamo.config.capture_scalar_outputs = True @torch.compile(fullgraph=True) def f(a, b): xs = b.tolist() for x in xs: torch._check_is_size(x) torch._check(x <= 20) return a.split(xs) N = 20 splits = torch.randint(10, (N,)) sz = splits.sum().item() f(torch.randn(sz), splits) ``` Before: ``` real 0m18.526s user 0m16.555s sys 0m11.031s ``` After: ``` real 0m13.831s user 0m12.152s sys 0m10.941s ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129893 Approved by: https://github.com/lezcano	2024-07-02 19:24:33 +00:00
Anshul Sinha	1f6c1fcd36	[dtensor][debug] add operation tracing to comm_mode (#129017 ) Summary I have added an even more detailed module tracker that now includes the collective counts and operations that happen in each submodule making it easier for users to debug. The tracing now includes the operation's DTensor arguements' input shape and sharding. Like the module collective tracing, the user also has the option to log the tracing table to output.txt file. I have decided not to include the example output for transformer as it is too many lines. The expected output for the MLP_operation_tracing is shown below: <img width="574" alt="Screenshot 2024-06-25 at 3 33 16 PM" src="https://github.com/pytorch/pytorch/assets/50644008/a09e2504-19d5-4c69-96e8-f84e852d7786"> <img width="467" alt="Screenshot 2024-06-25 at 3 33 45 PM" src="https://github.com/pytorch/pytorch/assets/50644008/55c07d2d-6cb6-410f-82ac-2849bb7bfbbb"> Test Plan 1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_operation_tracing 2. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_operation_tracing Pull Request resolved: https://github.com/pytorch/pytorch/pull/129017 Approved by: https://github.com/XilunWu	2024-07-02 19:05:05 +00:00
Huy Do	bf05ea2bab	Re-generate Linux build workflows after #124014 (#129976 ) This looks like a landrace as lint passed on #124014 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129976 Approved by: https://github.com/kit1980	2024-07-02 18:57:20 +00:00
Yanbo Liang	080149cb38	[Inductor][FlexAttention] Add helper functions of converting score_mod to block_mask (#129909 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129909 Approved by: https://github.com/Chillee, https://github.com/drisspg ghstack dependencies: #129831, #129859	2024-07-02 18:48:16 +00:00
Yanbo Liang	1f3e2d7877	[Inductor] Rename TemplatedAttention to FlexAttention (#129859 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129859 Approved by: https://github.com/Chillee, https://github.com/drisspg ghstack dependencies: #129831	2024-07-02 18:48:16 +00:00
Michael Lazos	aa7ea6b45c	Add wraps back (#129933 ) Fixes https://github.com/pytorch/pytorch/issues/129922 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129933 Approved by: https://github.com/eqy, https://github.com/janeyx99	2024-07-02 18:24:02 +00:00
Howard Huang	ec789a3c9d	[pipelining] [BE] Move pipeline_order validation to schedules.py (#129369 ) # Changes * small fix in stage error message * Move `format_pipeline_order` and `_validate_pipeline_order` out of `test_schedule.py` into `schedules.py`. * Wrap the execution runtime in a try-except which on error will log the timestep and schedule plan before re-raising the exception. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129369 Approved by: https://github.com/wconstab ghstack dependencies: #129368	2024-07-02 18:19:28 +00:00
Howard Huang	4eb449f7dc	[pipelining] add small logging section to docs (#129368 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129368 Approved by: https://github.com/wconstab	2024-07-02 18:19:28 +00:00
Yanbo Liang	34e94c507a	[Inductor] Make FlexAttention block_mask argument as tuple (#129831 ) Re-organize ```block_mask``` related arguments a tuple to reduce the individual argument number. I was trying to use named tuple, but aot autograd doesn't work well with named tuple. The only downside of using tuple rather than named tuple is we need to use index to access its element. But we only need this at one place, it should be fine. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129831 Approved by: https://github.com/Chillee, https://github.com/drisspg	2024-07-02 17:18:33 +00:00
Animesh Jain	9105d54c6b	[dynamo][sparse] Graph break on sparse tensors (#129883 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129883 Approved by: https://github.com/ezyang ghstack dependencies: #129830, #129858, #129857, #129881	2024-07-02 16:51:56 +00:00
Animesh Jain	75443d3daf	[dynamic-shapes] Dont create symbol if .item() is a nan (#129881 ) Passes ` PYTORCH_TEST_WITH_DYNAMO=1 pytest test/torch_np/numpy_tests/lib/test_function_base.py::TestInterp::test_scalar_interpolation_point` in the stack. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129881 Approved by: https://github.com/ezyang, https://github.com/zou3519 ghstack dependencies: #129830, #129858, #129857	2024-07-02 16:51:56 +00:00
Nikita Shulga	d146a62e77	[MPS][BE] Introduce `mtl_setBytes` (#129910 ) Which for primitive types calls `[encoder setBytes:&val legnth:sizeof(val) index:idx];` and for container types passes number of elements equal to the size of the container Pull Request resolved: https://github.com/pytorch/pytorch/pull/129910 Approved by: https://github.com/Skylion007	2024-07-02 16:36:57 +00:00
Shangdi Yu	9fb2dec7a6	[custom ops] Add unknown arg (#129614 ) Fixes #129372 Add a mutated_args="unknown" that pessimistically assumes that all inputs to the operator are being mutates. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129614 Approved by: https://github.com/zou3519	2024-07-02 16:10:14 +00:00
Tijmen Blankevoort	e3b3431c42	Fix for HistogramObserver (#129387 ) Summary: There were two problems with the HistogramObserver: 1. It does not work when someone passes a batch_size 1, tensor_size 1 data-point. 2. The Histogram doesn't seem to actually update if the range of the new x falls within the old one These issues were both fixed. On top of this, I greatly simplified the logic for the histogram updating. Now, it doesn't do the downsampling anymore, which saves a ton of memory and code. The accuracy can still be controlled with the upsampling ratio. This ratio was also too high for the accuracy we generally need here, I reduced the default for this. Also the code is cleaner now, much easier to follow what's happening. test_histogram_observer_same_inputs was likely wrong - If I pass 0s and 1s to my histogramobserver, I want them to actually count! The current test now thinks it's good to discard and ignore these values. Test Plan: You can run the included tests. Differential Revision: D58931336 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129387 Approved by: https://github.com/jerryzh168	2024-07-02 15:41:44 +00:00
PyTorch MergeBot	03440a1c13	Revert "Add support for inline_asm_elementwise in Inductor lowerings (#129846 )" This reverts commit badc638eb68c0b07ae3b857e885e6d0137b218aa. Reverted https://github.com/pytorch/pytorch/pull/129846 on behalf of https://github.com/jeffdaily due to introduced ROCm breakages in trunk ([comment](https://github.com/pytorch/pytorch/pull/129846#issuecomment-2203519554))	2024-07-02 15:25:34 +00:00
Aart Bik	3fd128361e	[traced-graph][sparse] add relay override for layout_impl (#129930 ) In the "layout()" method of "TensorImpl" defined in the file core/TensorImpl.h, the following code and documentation can be found: ``` Layout layout() const { ... if .. { ... } else if (is_sparse_compressed()) { // Typically, the tensor dispatch keys define the tensor layout // uniquely. This allows using non-virtual layout method for // better performance. However, when tensor's layout depends, // say, on tensor attributes, one must use this execution path // where the corresponding tensor impl class overwrites virtual // layout_impl() method. return layout_impl(); } else { ... } } ``` However, this override was never implemented. This PR put the override in place, to prepare for sparsity propagation in another PR. https://github.com/pytorch/pytorch/issues/117188 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129930 Approved by: https://github.com/ezyang	2024-07-02 15:24:34 +00:00
Edward Z. Yang	dacc33d2fa	Make sym_min/sym_max handle Numpy scalars (#129917 ) Internal xref: https://fb.workplace.com/groups/1069285536500339/posts/7773876449374514/ Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129917 Approved by: https://github.com/Skylion007	2024-07-02 14:59:20 +00:00
Xuehai Pan	f1df13f023	[BE][Easy] Fix `PYI001`: unprefixed-type-param in `torch/utils/data/datapipes` (#129885 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129885 Approved by: https://github.com/ezyang	2024-07-02 14:56:27 +00:00
Joel Schlosser	257b9c7936	Fix layout for _like() factories on NJTs (#129879 ) Background: this bug was triggering DEBUG=1 asserts in the backward for `unbind()`, which calls `empty_like()`. I found that the NJT implementation of `empty_like()` was redispatching on `values` while blindly passing along all kwargs. This resulted in `empty_like(values, ..., layout=torch.jagged)`, which is incorrect since `values` is strided, tripping the debug assert here: `433b691f98/aten/src/ATen/EmptyTensor.cpp (L305)` This PR explicitly sets `layout=torch.strided` when redispatching `_like()` factories on `values`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129879 Approved by: https://github.com/soulitzer	2024-07-02 14:51:23 +00:00
Aaron Gokaslan	6c2a8b6b38	[Ez][BE]: Enable new stable ruff rules (#129825 ) Applies a bunch of new ruff lint rules that are now stable. Some of these improve efficiency or readability. Since I already did passes on the codebase for these when they were in preview, there should be relatively few changes to the codebase. This is just more for future hardening of it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129825 Approved by: https://github.com/XuehaiPan, https://github.com/jansel, https://github.com/malfet	2024-07-02 14:47:10 +00:00
Xu Han	2926655761	[inductor] optimize cpp builder configuration code (#129577 ) Changes: 1. Combine choose isa condition dispatch code. 2. Unificate MacOS openmp configuration code. 3. Clean up useless code. Co-authored-by: Jason Ansel <jansel@jansel.net> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129577 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-07-02 14:41:59 +00:00
Aaron Gokaslan	6cb0ad3375	[BE]: Update NCCL submodule to 2.21.5 (#124014 ) Update NCCL to the latest version. This release is mostly bugfixes with a few new minor features. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124014 Approved by: https://github.com/eqy, https://github.com/ezyang, https://github.com/nWEIdia, https://github.com/malfet, https://github.com/atalman	2024-07-02 14:39:33 +00:00
Peter Bell	dc75ec252a	[inductor] Fix can_merge check for expr=q0q1 (#129806 ) Fixes #111884 In the minimised reproducer, we have a loop with the index expression `-q0q1` for which in the merge tester we get: ``` expr1 = - 0 * (_merge_tester * 16) = 0 expr2 = - _merge_tester * 0 = 0 ``` so it decides we can merge the dimensions and `q0` is set to `0`, meaning `-q0q1` is always zero! Here I change the test so we have at least one case where no zeros are substituted so we can catch this situation. In the normal strided case we get e.g. ``` expr = 16 q0 + q1 expr1 = 16 * _merge_tester2 + (16 * _merge_tester1) expr2 = 16 * (_merge_tester2 + _merge_tester1) ``` which are still equivalent expressions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129806 Approved by: https://github.com/lezcano	2024-07-02 14:30:02 +00:00
leslie-fang-intel	37e3c60897	[Inductor][CPP] Remove redundant INT8-specific logic in the INT8 GEMM template (#129470 ) Summary Remove redundant INT8-specific logic in the INT8 GEMM template to unify the code structure with FP32/BF16/FP16 GEMM Template. Test Plan ``` numactl -C 56-111 -m 1 python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_quantized_linear ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129470 Approved by: https://github.com/jgong5 ghstack dependencies: #128825, #129048, #129049, #129103, #129220, #129221	2024-07-02 13:15:15 +00:00
leslie-fang-intel	b6379591a9	[Inductor][CPP] Pass weight dtype explicitly for cpp gemm template (#129221 ) Summary This PR mainly refactor 2 things: 1. Passing in weight's data type explicitly in `create_micro_gemm` as `input2.dtype`. When registering `CppMicroGemmConfig`, we will reuse `input.dtype` if `input2.dtype` is not explicitly registered. 2. Add an util function to get the output data type and compute data type from input data type. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129221 Approved by: https://github.com/jgong5, https://github.com/jansel ghstack dependencies: #128825, #129048, #129049, #129103, #129220	2024-07-02 13:06:32 +00:00
leslie-fang-intel	72fa864098	[Inductor][CPP] Enable Quantized Linear with AMX MicroGEMM (#129220 ) Summary Add the AMX micro gemm kernel with int8 data type. Test Plan ``` clear && python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_quantized_linear_amx ``` Next Step - [✓] Unary post op fusion - [✓] Int8 output - [✓] Binary Fusion - [✓] AMX int8 MicroGEMM Kernel Pull Request resolved: https://github.com/pytorch/pytorch/pull/129220 Approved by: https://github.com/jgong5 ghstack dependencies: #128825, #129048, #129049, #129103	2024-07-02 12:53:35 +00:00
leslie-fang-intel	a796358330	[Inductor][CPP] Enable Quantized Linear GEMM Template with Binary Fusion (#129103 ) Summary Based on previous PR, add the config to support quantized linear binary - optional(unary) post op fusion. - Activation dtype: uint8 - Weight dtype: int8 - Output dtype: float32/bfloat16/uint8 - Post Op Fusion: with binary and optional[Unary] post operator fusion Test Plan ``` clear && python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_quantized_linear_with_pointwise_binary ``` Next Step - [✓] Unary post op fusion - [✓] Int8 output - [✓] Binary Fusion - [ ] AMX int8 MicroGEMM Kernel Pull Request resolved: https://github.com/pytorch/pytorch/pull/129103 Approved by: https://github.com/jgong5, https://github.com/jansel ghstack dependencies: #128825, #129048, #129049	2024-07-02 12:45:10 +00:00
leslie-fang-intel	86e2d16ba0	[Inductor][Quant] Change the schema of QLinear Binary (#129049 ) Summary We change the schema of QLinear Binary, so it will be easier to enable the corresponding gemm template. - Extra input of binary post-op is a tensor which needs to be an input node of autotuning, we need to move it at front of `output_scale` which is a scalar. - We also move it at front of `bias`, since `bias` is optional tensor for this fusion, but `other` is a must to have for linear binary fusion. Test Plan ``` python -u -m pytest -s -v test/quantization/core/test_quantized_op.py -k qlinear python -u -m pytest -s -v test/inductor/test_mkldnn_pattern_matcher.py -k qlinear ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129049 Approved by: https://github.com/jgong5, https://github.com/jansel ghstack dependencies: #128825, #129048	2024-07-02 12:36:38 +00:00
PyTorch MergeBot	07450e9713	Revert "[MPS] Add support for autocast in MPS (#99272 )" This reverts commit 6240cfd5c751bea6ca91dc765085e1d871b22345. Reverted https://github.com/pytorch/pytorch/pull/99272 on behalf of https://github.com/jeanschmidt due to introduced breakages in trunk ([comment](https://github.com/pytorch/pytorch/pull/99272#issuecomment-2203033719))	2024-07-02 12:29:51 +00:00
Fuzzkatt	0441173ab2	Add slowTest marker to test_linalg_solve_triangular_large (#129903 ) In nvidia internal testing, for slower devices such as Orin NX, on large dtypes like complex128, test_linalg_solve_triangular_large is taking multiple hours to complete and timing out CI. This PR adds a slowTest marker so it can be skipped due to speed issues. cc @nWEIdia Pull Request resolved: https://github.com/pytorch/pytorch/pull/129903 Approved by: https://github.com/lezcano	2024-07-02 12:27:12 +00:00
Jack Taylor	95a5958db4	[ROCm] Update nightly triton-rocm pin to release branch (#129361 ) Update pin to tip of https://github.com/triton-lang/triton/commits/release/3.0.x/ following upstream strategy here https://github.com/pytorch/pytorch/pull/126098 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129361 Approved by: https://github.com/peterbell10	2024-07-02 11:49:52 +00:00
Edward Z. Yang	3c6c3b9448	Fix typo in floordiv solver code that affects flipped relation (#129888 ) Fixes https://github.com/pytorch/pytorch/issues/123535 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129888 Approved by: https://github.com/lezcano	2024-07-02 11:15:03 +00:00
Edward Z. Yang	8ef8240172	Don't mark conversion to float as is_integer = False (#129890 ) Zero is an integer, so if you say is_integer = False, you are also saying the result cannot be zero, which is undesirable. This is exercised by next PR in the stack. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129890 Approved by: https://github.com/lezcano	2024-07-02 11:08:09 +00:00
Edward Z. Yang	eb1ff76f23	Make are_strides_like_channels_last size oblivious (#129677 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129677 Approved by: https://github.com/Skylion007, https://github.com/albanD ghstack dependencies: #129869	2024-07-02 11:05:20 +00:00
Edward Z. Yang	ebeeb22669	Correctly put mark_unbacked symbols in shape_env_to_source_to_symbol_cache (#129869 ) Internal xref: https://www.internalfb.com/intern/anp/view/?source=version_selector&id=5534845 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129869 Approved by: https://github.com/albanD	2024-07-02 11:05:20 +00:00
Xu Han	567dd1a3ca	[inductor] unificate toolchain code. (#129816 ) This PR is the implemention of https://github.com/pytorch/pytorch/issues/124245#issuecomment-2197778902 plan 2, and it is continued PR to https://github.com/pytorch/pytorch/pull/129789 Changes: 1. Unificate cpp builder's toolchain code. 2. Move all build related code to `cpp_builder.py`. 3. Optimize `codecache.py`, `cpp_builder.py` and `cpu_vec_isa.py` import logical follow: https://github.com/pytorch/pytorch/issues/124245#issuecomment-2197778902 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129816 Approved by: https://github.com/jansel	2024-07-02 09:52:06 +00:00
chilli	badc638eb6	Add support for inline_asm_elementwise in Inductor lowerings (#129846 ) This doesn't actually expose `inline_asm_elementwise` from any public API, but makes it pretty easy to register a lowering for a custom op that uses it. <img width="667" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/f125f4bb-4f8c-46e7-8e06-925f37ed2930"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129846 Approved by: https://github.com/shunting314	2024-07-02 09:31:38 +00:00
awayzjj	ccc4ee7793	check boolean alpha and beta of Fake tensor impl for Tensor.addr (#129839 ) Fixes https://github.com/pytorch/pytorch/issues/127043 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129839 Approved by: https://github.com/lezcano	2024-07-02 09:20:49 +00:00
Jeff Willette	5c9d5272e4	fixes #124582 (#128483 ) added check for existence of outputs requiring grad to make_graphed_callables. added new test case, updated existing test case to include parameterless modules. Fixes #124582 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128483 Approved by: https://github.com/eqy, https://github.com/ezyang	2024-07-02 08:45:59 +00:00
Haoci Zhang	1ad683033b	Implemented flexible PP schedule (#129597 ) Enabled some cases to work where num_microbatches % pp_size != 0. Using the flex_pp schedule, we will have num_rounds = max(1, n_microbatches // pp_group_size) and it works as long as n_microbatches % num_rounds is 0. As a few examples, support pp_group_size = 4, n_microbatches = 10. We will have num_rounds = 2 and n_microbatches % 2 is 0. pp_group_size = 4, n_microbatches = 3. We will have num_rounds = 1 and n_microbatches % 1 is 0. Moved over from PiPPy (https://github.com/pytorch/PiPPy/pull/1129) Tested using the config in (1), schedule looks like the following graph: ``` =========== ALL_RANK_ACTIONS =========== Rank 0 Rank 1 Rank 2 Rank 3 Step 00: F0_s0 None None None Step 01: F1_s0 F0_s1 None None Step 02: F2_s0 F1_s1 F0_s2 None Step 03: F3_s0 F2_s1 F1_s2 F0_s3 Step 04: F4_s0 F3_s1 F2_s2 F1_s3 Step 05: F0_s4 F4_s1 F3_s2 F2_s3 Step 06: F1_s4 F0_s5 F4_s2 F3_s3 Step 07: F2_s4 F1_s5 F0_s6 F4_s3 Step 08: F3_s4 F2_s5 F1_s6 F0_s7 Step 09: F4_s4 F3_s5 None B0_s7 Step 10: F5_s0 None F2_s6 F1_s7 Step 11: None None B0_s6 B1_s7 Step 12: None F4_s5 F3_s6 F2_s7 Step 13: None B0_s5 B1_s6 B2_s7 Step 14: F6_s0 F5_s1 F4_s6 F3_s7 Step 15: B0_s4 B1_s5 B2_s6 B3_s7 Step 16: F7_s0 F6_s1 F5_s2 F4_s7 Step 17: B1_s4 B2_s5 B3_s6 B4_s7 Step 18: F8_s0 F7_s1 F6_s2 F5_s3 Step 19: B2_s4 B3_s5 B4_s6 B0_s3 Step 20: F9_s0 F8_s1 F7_s2 F6_s3 Step 21: B3_s4 B4_s5 B0_s2 B1_s3 Step 22: F5_s4 F9_s1 F8_s2 F7_s3 Step 23: B4_s4 B0_s1 B1_s2 B2_s3 Step 24: F6_s4 F5_s5 F9_s2 F8_s3 Step 25: B0_s0 B1_s1 B2_s2 B3_s3 Step 26: F7_s4 F6_s5 F5_s6 F9_s3 Step 27: B1_s0 B2_s1 B3_s2 B4_s3 Step 28: F8_s4 F7_s5 F6_s6 F5_s7 Step 29: B2_s0 B3_s1 B4_s2 B5_s7 Step 30: F9_s4 F8_s5 F7_s6 F6_s7 Step 31: B3_s0 B4_s1 B5_s6 B6_s7 Step 32: None F9_s5 F8_s6 F7_s7 Step 33: B4_s0 B5_s5 B6_s6 B7_s7 Step 34: None None F9_s6 F8_s7 Step 35: B5_s4 B6_s5 B7_s6 B8_s7 Step 36: None None None F9_s7 Step 37: B6_s4 B7_s5 B8_s6 B9_s7 Step 38: None None None None Step 39: B7_s4 B8_s5 B9_s6 B5_s3 Step 40: None None None None Step 41: B8_s4 B9_s5 B5_s2 B6_s3 Step 42: None None None None Step 43: B9_s4 B5_s1 B6_s2 B7_s3 Step 44: None None None None Step 45: B5_s0 B6_s1 B7_s2 B8_s3 Step 46: None None None None Step 47: B6_s0 B7_s1 B8_s2 B9_s3 Step 48: None None None Step 49: B7_s0 B8_s1 B9_s2 Step 50: None None Step 51: B8_s0 B9_s1 Step 52: None ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129597 Approved by: https://github.com/H-Huang	2024-07-02 07:54:38 +00:00
Yu, Guangye	3e2df3ca9d	Add xpu to getAccelerator (#129205 ) # Motivation Add `xpu` support to `getAccelerator`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129205 Approved by: https://github.com/albanD, https://github.com/gujinghui ghstack dependencies: #129463	2024-07-02 06:48:24 +00:00
Yu, Guangye	6353a12e6a	XPUHooksInterface inherits from AcceleratorHooksInterface (#129463 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129463 Approved by: https://github.com/gujinghui, https://github.com/albanD	2024-07-02 06:48:24 +00:00
Xu Han	76259ebfdd	[inductor] split cpu vec isa to dedicate file (keep git history) (#129789 ) This PR is the implemention of https://github.com/pytorch/pytorch/issues/124245#issuecomment-2197778902 plan 1 Changes: 1. Duplicate `codecache.py` to `cpu_vec_isa.py` with its `git history`. <img width="745" alt="image" src="https://github.com/pytorch/pytorch/assets/8433590/106533da-ce80-4825-8271-35ffb3141f92"> 2. Make `cpu_vec_isa.py` as dedicate file for CPU vec isa. It also good to extend for more archtectures and vec isa. 3. Update code for above changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129789 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-07-02 05:29:05 +00:00
Jovian Anthony Jaison	f6edd1f7c9	[BE] Make ActivationWrapper an abstract class (#129808 ) Fixes #95481 Test Plan: Unit tested checkpoint_wrapper.py by instantizing ActivationWrapper and got TypeError as expected. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129808 Approved by: https://github.com/Skylion007	2024-07-02 04:29:43 +00:00
PyTorch MergeBot	c2d0b7b96d	Revert "[ROCm] std::clamp work-around for hip-clang compiler (#127812 )" This reverts commit 8c2c3a03fb87c3568a22362d83b00d82b9fb3db2. Reverted https://github.com/pytorch/pytorch/pull/127812 on behalf of https://github.com/ezyang due to windows trunk job failing ([comment](https://github.com/pytorch/pytorch/pull/127812#issuecomment-2201653245))	2024-07-02 01:52:31 +00:00
Kulin Seth	6240cfd5c7	[MPS] Add support for autocast in MPS (#99272 ) Fixes https://github.com/pytorch/pytorch/issues/88415 Co-authored-by: Siddharth Kotapati <skotapati@apple.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/99272 Approved by: https://github.com/malfet	2024-07-02 01:49:52 +00:00
Howard Huang	600bf978ba	[Pipelining] Add to/from CSV format and improved __repr__ (#129264 ) _Action.__repr__ gets rearranged so it doesn't require an underscore or a 's' prefix, but still keeps multi-digit stage and microbatch indices separated by an alpha character indicating the action type. to/from CSV methods allow dumping a generated schedule to CSV format for offline visualization or manual editing in a spreadsheet and reloading to use at runtime. Co-authored-by: Howard Huang <howardhuang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129264 Approved by: https://github.com/H-Huang	2024-07-02 01:26:23 +00:00
wz337	83e6ec2ccd	[FSDP2+TP] Disable 2D state_dict (#129519 ) Fixes #ISSUE_NUMBER Gonna fill in the RFC but just want to run CI to see if anything else breaks. Test: ``` python test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_raise_not_implemented_state_dict_if_2d ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129519 Approved by: https://github.com/awgu	2024-07-02 01:26:14 +00:00
cyy	46366888d7	Remove outdated CMake code (#129851 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/129851 Approved by: https://github.com/ezyang	2024-07-02 00:40:37 +00:00
Nikita Shulga	7e4329c258	[EZ][BE] Bump min cmake version to 3.18 (#129906 ) As this is a min CMake version supported by top level PyTorch Hides ``` CMake Deprecation Warning at aten/src/ATen/native/quantized/cpu/qnnpack/deps/clog/CMakeLists.txt:7 (cmake_minimum_required): Compatibility with CMake < 3.5 will be removed from a future version of CMake. Update the VERSION argument <min> value or use a ...<max> suffix to tell CMake that the project does not need compatibility with older versions. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129906 Approved by: https://github.com/kit1980	2024-07-01 23:06:49 +00:00
Zain Rizvi	9645eaaaec	[BE] Improve logging for runner-determinator (#129679 ) This lets us be more flexible about what data we output and throwing exceptions. It's also less likely to break when others make changes (e.g. any print statement would have broken this code before since the printed output was expected to only be a json) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129679 Approved by: https://github.com/zxiiro, https://github.com/jeanschmidt, https://github.com/Skylion007	2024-07-01 22:31:35 +00:00
soulitzer	eeef68671d	[autograd] Do not detach when unpacking tensors that do not require grad (#127959 ) In this PR: - Ensure that if a tensor not requiring grad is saved for backward unpacking does not trigger a detach (unless the user installs a saved tensor pack hook that returns a tensor requiring grad). - Update non-reentrant checkpoint to also no longer detach for this case. Alternatives: - For custom autograd Function, you could directly save on ctx to work around this, but that would not work for when we switch to using custom ops. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127959 Approved by: https://github.com/YuqingJ ghstack dependencies: #125795, #128545, #129262	2024-07-01 21:57:36 +00:00
Jithun Nair	87693b534c	[ROCm] Use AOTriton as a dynamic library (#129094 ) This PR enables using AOTriton as a shared library dependency instead of a static one. Resolves the issue of linker errors when trying to build PyTorch for a lot of (>7 or so) gfx archs due to huge size of aotriton static library. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129094 Approved by: https://github.com/malfet	2024-07-01 21:39:27 +00:00
Jeff Daily	8c2c3a03fb	[ROCm] std::clamp work-around for hip-clang compiler (#127812 ) Fixes #127666. Other std math functions are replaced with those in the global namespace during hipify. HIP does not claim to support every function in the C++ standard library. std::clamp is not yet supported and we have been relying on the std implementation. For Fedora 40 + gcc 14, a host-side assert is used which is not supported. Work-around this by replacing std::clamp with min and max for USE_ROCM builds. Patch comes from @lamikr. Modified to use #ifndef USE_ROCM. https://github.com/lamikr/rocm_sdk_builder/pull/37 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127812 Approved by: https://github.com/hongxiayang, https://github.com/malfet	2024-07-01 21:00:33 +00:00
Andres Lugo-Reyes	750c701e49	[ROCm] Update xlogy comment detailing issue (#128151 ) update skip reason comment with more accurate descriptor Pull Request resolved: https://github.com/pytorch/pytorch/pull/128151 Approved by: https://github.com/zou3519	2024-07-01 20:58:58 +00:00
Animesh Jain	78cda9a810	[symbolic-shapes] Add FloatPow in the symbolic shape guard closure (#129857 ) Fixes test failure raised in the next diff. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129857 Approved by: https://github.com/ezyang ghstack dependencies: #129830, #129858	2024-07-01 20:44:59 +00:00
Animesh Jain	53d67165c0	[dynamo] Skip FUNCTION_MATCH guards for descriptors (#129858 ) Hard to write tests. This PR makes many test pass in the stack such as `PYTORCH_TEST_WITH_DYNAMO=1 pytest test/test_ao_sparsity.py::TestComposability::test_convert_without_squash_mask` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129858 Approved by: https://github.com/mlazos ghstack dependencies: #129830	2024-07-01 20:44:59 +00:00
Jithun Nair	f86dbae247	Fix typo in lxml requirement (#129695 ) Extra period at the end throws off pip: ``` root@f04177cab5af:/data/pytorch# pip install -r .ci/docker/requirements-ci.txt ERROR: Invalid requirement: 'lxml==5.0.0.': Expected end or semicolon (after version specifier) lxml==5.0.0. ~~~~~~~^ (from line 309 of .ci/docker/requirements-ci.txt) ``` Not sure why CI docker builds do not have an issue with this period. Typo comes from `f73b1b9388` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129695 Approved by: https://github.com/huydhn	2024-07-01 19:43:37 +00:00
Huy Do	fdd0a7f9b4	Run test_mps_allocator_module serially (#129340 ) Not sure why this test starts to fail (maybe runner update) `8a2fed7e6a/1` or why it was XFAIL in this old PR https://github.com/pytorch/pytorch/pull/97151, but the test is passing locally for me now Pull Request resolved: https://github.com/pytorch/pytorch/pull/129340 Approved by: https://github.com/kit1980, https://github.com/malfet	2024-07-01 18:44:48 +00:00
PyTorch MergeBot	b02186ffc1	Revert "Allow get attributes on DDP similar to FSDP (#128620 )" This reverts commit 065c386990dce444db17eff7b254bf79e82450ef. Reverted https://github.com/pytorch/pytorch/pull/128620 on behalf of https://github.com/jeanschmidt due to Reverting in order to see if the trunk error on inductor is fixed ([comment](https://github.com/pytorch/pytorch/pull/128620#issuecomment-2200717876))	2024-07-01 17:57:00 +00:00
Hao Dong	bb0f3df562	Fix index issues in torch.fx.interpreter (#129527 ) Summary: Fix index issues in torch.fx.interpreter by changing range from `[:i]` to `[:i+1]`. Because if there are `n` elements, the last index `i` of the `for` loop is `n-1` and `[:i]` can only get access to elements from index `0` to index `n-2` and miss the last element. `[:i+1]` can get access to all elements correctly. Test Plan: Test with Node API Differential Revision: D59028395 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129527 Approved by: https://github.com/dulinriley	2024-07-01 17:46:13 +00:00
zhangfeiv0	1956d87c1f	Increase riscv implementation in DepthwiseConvKernel (#127867 ) Summary: Increase riscv implementation in DepthwiseConvKernel. Compile: export USE_CUDA=0 export USE_DISTRIBUTED=0 export USE_MKLDNN=0 export MAX_JOBS=4 export CMAKE_CXX_COMPILER=clang++ export CMAKE_C_COMPILER=clang export CMAKE_C_FLAGS=-march=rv64gcv export CMAKE_CXX_FLAGS=-march=rv64gcv python3 setup.py develop --cmake Test Plan: Correctness - Check the results of the run before and after test_convolution.py python3 test/run_test.py --include nn/test_convolution --keep-going Before: ===== 9 passed, 13 skipped, 564 deselected in 46.55s ===== The following tests failed consistently: test/nn/test_convolution.py::TestConvolutionNN::test_Conv2d_backward_twice test/nn/test_convolution.py::TestConvolutionNN::test_Conv2d_inconsistent_types test/nn/test_convolution.py::TestConvolutionNN::test_conv_modules_raise_error_on_incorrect_input_size test/nn/test_convolution.py::TestConvolutionNN::test_conv_shapecheck test/nn/test_convolution.py::TestConvolutionNN::test_invalid_conv1d test/nn/test_convolution.py::TestConvolutionNN::test_invalid_conv2d test/nn/test_convolution.py::TestConvolutionNN::test_invalid_conv3d test/nn/test_convolution.py::TestConvolutionNN::test_mismatch_shape_conv2d test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCPU::test_conv_empty_channel_cpu_complex64 test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCPU::test_conv_empty_channel_cpu_float32 After: ===== 9 passed, 13 skipped, 564 deselected in 48.13s ===== The following tests failed consistently: test/nn/test_convolution.py::TestConvolutionNN::test_Conv2d_backward_twice test/nn/test_convolution.py::TestConvolutionNN::test_Conv2d_inconsistent_types test/nn/test_convolution.py::TestConvolutionNN::test_conv_modules_raise_error_on_incorrect_input_size test/nn/test_convolution.py::TestConvolutionNN::test_conv_shapecheck test/nn/test_convolution.py::TestConvolutionNN::test_invalid_conv1d test/nn/test_convolution.py::TestConvolutionNN::test_invalid_conv2d test/nn/test_convolution.py::TestConvolutionNN::test_invalid_conv3d test/nn/test_convolution.py::TestConvolutionNN::test_mismatch_shape_conv2d test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCPU::test_conv_empty_channel_cpu_complex64 test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCPU::test_conv_empty_channel_cpu_float32 Performance - Compare the results before and after mobilenet_v2 python3 run.py mobilenet_v2 -d cpu -t eval Before: Running eval method from mobilenet_v2 on cpu in eager mode with input batch size 16 and precision fp32. CPU Wall Time per batch: 19590.647 milliseconds CPU Wall Time: 19590.647 milliseconds Time to first batch: 5271.3518 ms CPU Peak Memory: 0.3809 GB After: Running eval method from mobilenet_v2 on cpu in eager mode with input batch size 16 and precision fp32. CPU Wall Time per batch: 13523.530 milliseconds CPU Wall Time: 13523.530 milliseconds Time to first batch: 2696.0304 ms CPU Peak Memory: 0.3408 GB Versions: Clang version: 17.0.2 Platform: CanMV-K230 Architecture: riscv64 OS: Ubuntu 23.10 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127867 Approved by: https://github.com/malfet	2024-07-01 17:11:21 +00:00
PyTorch MergeBot	c9dc9887db	Revert "Enable UFMT on test/test_public_bindings.py (#128389 )" This reverts commit fe5424d0f8604f6e66d827ae9f94b05cb7119d55. Reverted https://github.com/pytorch/pytorch/pull/128389 on behalf of https://github.com/clee2000 due to broke test_mps.py::TestMPS::test_mps_allocator_module? https://github.com/pytorch/pytorch/actions/runs/9730750763/job/26854426294 `fe5424d0f8` Not sure how this change can do that. Build failed on PR so test didn't run ([comment](https://github.com/pytorch/pytorch/pull/128389#issuecomment-2200589719))	2024-07-01 16:34:04 +00:00
PyTorch MergeBot	433b691f98	Revert "[inductor] optimize cpp builder configuration code (#129577 )" This reverts commit 2e3ff394bf94d3b9cbab0fe8a93a9ea7c9cb4267. Reverted https://github.com/pytorch/pytorch/pull/129577 on behalf of https://github.com/jeanschmidt due to Breaking internal builds, see D59181128 ([comment](https://github.com/pytorch/pytorch/pull/129577#issuecomment-2200554824))	2024-07-01 16:14:06 +00:00
PyTorch MergeBot	19e17216a2	Revert "[inductor] split cpu vec isa to dedicate file (keep git history) (#129789 )" This reverts commit 58f346c874a8a982679b4d4f3876602cc05d66d4. Reverted https://github.com/pytorch/pytorch/pull/129789 on behalf of https://github.com/jeanschmidt due to Need to revert in order to revert https://github.com/pytorch/pytorch/pull/129577 ([comment](https://github.com/pytorch/pytorch/pull/129789#issuecomment-2200545144))	2024-07-01 16:08:44 +00:00
PyTorch MergeBot	b6dc37bb4e	Revert "[inductor] unificate toolchain code. (#129816 )" This reverts commit 67c9ec2b6d12ffd0e83861dcc16c1cd1a9b74d35. Reverted https://github.com/pytorch/pytorch/pull/129816 on behalf of https://github.com/jeanschmidt due to Need to revert in order to revert #129577 ([comment](https://github.com/pytorch/pytorch/pull/129816#issuecomment-2200539687))	2024-07-01 16:06:22 +00:00
cyy	ca5d13c672	[1/N] Enable unused variable warnings on torch_cpu and fix some violations (#128670 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/128670 Approved by: https://github.com/ezyang	2024-07-01 14:56:46 +00:00
PyTorch MergeBot	e385bf8ef8	Revert "[halide-backend] Disable split reductions for Halide (#129320 )" This reverts commit a18eb651d352e45860a96869abaf9fb7b215eac6. Reverted https://github.com/pytorch/pytorch/pull/129320 on behalf of https://github.com/jeanschmidt due to This PR is breaking internal builds, please check comments on it D59204360 ([comment](https://github.com/pytorch/pytorch/pull/129320#issuecomment-2200351678))	2024-07-01 14:44:35 +00:00
PyTorch MergeBot	a83eaf1c3a	Revert "[halide-backend] Support manual schedules (#129321 )" This reverts commit 9ae78a578caff195821ad535a9e8d8ef59552142. Reverted https://github.com/pytorch/pytorch/pull/129321 on behalf of https://github.com/jeanschmidt due to Reverting, as it is required to do so in order to revert #129320 ([comment](https://github.com/pytorch/pytorch/pull/129321#issuecomment-2200345664))	2024-07-01 14:42:33 +00:00
Xu Zhao	cc9b005bf2	Enable torchao nightly workflow (#129779 ) Summary: Make the following improvements: * Schedule the torchao benchmark nightly * Enable torchbench, timm, and huggingface models * Refactor the benchmarking script to better arrange the benchmarking groups Test workflow: https://github.com/pytorch/benchmark/actions/runs/9705589352 X-link: https://github.com/pytorch/benchmark/pull/2336 Differential Revision: D59074571 Pulled By: xuzhao9 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129779 Approved by: https://github.com/jerryzh168	2024-07-01 14:28:38 +00:00
Xuehai Pan	75f64e1203	Fix test `test_type_hints.py::TestTypeHints::test_doc_examples` (#129829 ) As per the title, this test was broken for months. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129829 Approved by: https://github.com/ezyang	2024-07-01 13:28:37 +00:00
Jack Taylor	e1b426b345	[ROCm] CUDA_VISIBLE_DEVICES fallback option for device_count (#129650 ) Updating `_parse_visible_devices` to allow use of CUDA_VISIBLE_DEVICES if HIP_VISIBLE_DEVICES is unset, to avoid any unnecessary code changes in workloads that already rely on CUDA_VISIBLE_DEVICES. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129650 Approved by: https://github.com/hongxiayang, https://github.com/malfet	2024-07-01 11:40:09 +00:00
cyy	313eec02cc	Add hash function of std::string_view to torch/csrc/lazy/core/hash.h (#128800 ) For easier moving of c10::string_view to std::string_view in PyTorch/XLA. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128800 Approved by: https://github.com/ezyang	2024-07-01 09:53:34 +00:00
Ramana Cherukuri	f6a0be5023	Add warpSize to Device properties (#128449 ) Adding warp_size to CudaDeviceProperties. >>> import torch >>> prop = torch.cuda.get_device_properties(torch.cuda.current_device()) >>> prop.warp_size 64 >>> @jeffdaily @pruthvistony @jithunnair-amd @ROCmSupport Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128449 Approved by: https://github.com/eqy, https://github.com/jataylo, https://github.com/jithunnair-amd, https://github.com/malfet	2024-07-01 09:13:32 +00:00
Nikita Shulga	04a0d85620	[BE] Print all pip packages installed on the system after TorchChat (#129809 ) To make debugging regressions like ones happened last Wed when new version of torchao was released, that resulted in TorchBench downgrading pytorch version to 2.3.1 Test plan: Look at the log output for example https://github.com/pytorch/pytorch/actions/runs/9720408234/job/26832794157?pr=129809#step:20:1158 contains ``` + echo 'Print all dependencies after TorchBench is installed' Print all dependencies after TorchBench is installed + python -mpip freeze absl-py==2.1.0 accelerate==0.31.0 aiohttp==3.9.5 aiosignal==1.3.1 astunparse==1.6.3 async-timeout==4.0.3 attrs==23.2.0 audioread==3.0.1 beautifulsoup4==4.12.3 boto3==1.19.12 botocore==1.22.12 bs4==0.0.2 cachetools==5.3.3 certifi==2024.6.2 cffi==1.16.0 charset-normalizer==3.3.2 click==8.1.7 ... ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129809 Approved by: https://github.com/kit1980, https://github.com/atalman	2024-07-01 04:51:53 +00:00
cyy	eb1583dbc1	[2/N] Fix clang-tidy warnings in torch/csrc/jit/serialization (#129300 ) Follows #129055 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129300 Approved by: https://github.com/ezyang	2024-07-01 01:09:00 +00:00
Animesh Jain	e62073d799	[dynamo] Skip FUNCTION_MATCH on method-wrapper objects (#129830 ) Fixes https://github.com/pytorch/pytorch/issues/118563 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129830 Approved by: https://github.com/jansel	2024-06-30 20:21:18 +00:00
eqy	24b6c5a41f	[cuDNN][SDPA] Bail out of dispatching to cuDNN for head dim > 128 on Ampere (#129587 ) Fix for #129579 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129587 Approved by: https://github.com/Skylion007, https://github.com/drisspg	2024-06-30 19:37:44 +00:00
eqy	f845a7a91a	[cuDNN][SDPA] Remove `TORCH_CUDNN_SDPA_ENABLED=1`, enable cuDNN SDPA by default on H100 and 2nd on other archs >= sm80 (#125343 ) Looks like one of the first failures seen is `test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` when `test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` passes. What seems interesting here is that the `torch.compile` version fails while the eager version passes. Not sure what the difference would be here... Nevertheless, is there a recommended mechanism to skip cuDNN SDPA as a backend for this test? CC @drisspg Pull Request resolved: https://github.com/pytorch/pytorch/pull/125343 Approved by: https://github.com/Skylion007	2024-06-30 19:22:16 +00:00
eqy	7b0e9a27ba	Restore `allowed_info` in OOM message when applicable (#129546 ) Seems to be removed following #99699? Pull Request resolved: https://github.com/pytorch/pytorch/pull/129546 Approved by: https://github.com/Skylion007	2024-06-30 17:22:32 +00:00
Eddie Yan	8755e035d2	[CUDA][Pooling] Fix 64-bit indexing in `avg_pool_2d` backward attempt 2 (#129818 ) Somehow the original PR was missing the `CUDA_KERNEL_LOOP_TYPE` change??? Thanks @johnc-keen @Chillee for the great repro! (#129785) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129818 Approved by: https://github.com/Chillee, https://github.com/Skylion007	2024-06-30 16:52:33 +00:00
eqy	4dd3cff234	[CUDA] Fix more `DeviceIndex` printing (#128540 ) Same `char` dtype causing device index `0` to be interpreted as a null-terminator, see also #123984 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128540 Approved by: https://github.com/nWEIdia, https://github.com/Skylion007	2024-06-30 16:44:14 +00:00
eqy	68484621fe	[cuDNN][functorch] Bump tolerances for `nn.functional.conv2d` in `test_vmap_autograd_grad` (#129796 ) Newer versions of cuDNN can dispatch to a winograd kernel here on A100 which affects numerics a bit Pull Request resolved: https://github.com/pytorch/pytorch/pull/129796 Approved by: https://github.com/Skylion007	2024-06-30 16:36:12 +00:00
Weizhuo Zhang	fff633f087	[CI] Enable AOT inductor FP32 accuracy test for CPU (#129040 ) This PR enabled AOT inductor backend FP32 accuracy check for CPU in CI workflow, which could catch AOT inductor issue at early stage. Test Time cost: \| Suite \| Precision \| Time cost \| \|------------- \|----------- \|----------- \| \| Huggingface \| FP32 \| 1h12m \| \| Timm models \| FP32 \| 1h32m \| \| Torchbench \| FP32 \| 1h40m \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/129040 Approved by: https://github.com/chuanqi129, https://github.com/desertfire, https://github.com/malfet	2024-06-30 14:00:09 +00:00
Randolf Scholz	8a5fda0377	added type hints for __contains__ (#129653 ) - Fixes #129646 - Added test in test/typing/reveal/tensor_constructors.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/129653 Approved by: https://github.com/ezyang	2024-06-30 11:49:11 +00:00
leslie-fang-intel	1a689ea38c	[Inductor][CPP] Enable Quantized Linear GEMM Template with INT8 output and Unary Post Op (#129048 ) Summary Based on previous PR, add the config to support of int8 output and unary post op fusion with `ReLU` and `GeLU` - Activation dtype: uint8 - Weight dtype: int8 - Output dtype: float32/bfloat16/uint8 - Post Op Fusion: with unary post operator fusion Test Plan ``` clear && python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_quantized_linear_with_pointwise ``` Next Step - [✓] Unary post op fusion - [✓] Int8 output - [ ] Binary Fusion - [ ] AMX int8 MicroGEMM Kernel Pull Request resolved: https://github.com/pytorch/pytorch/pull/129048 Approved by: https://github.com/jgong5, https://github.com/jansel ghstack dependencies: #128825	2024-06-30 09:53:55 +00:00
leslie-fang-intel	35a197defa	[Inductor][CPP] Enable Quantized Linear GEMM Template with FP32 output (#128825 ) Summary Support int8 GEMM Template with refer MicroInt8GEMM kernel for case: - Activation dtype: uint8 - Weight dtype: int8 - Output dtype: float32/bfloat16 - Post Op Fusion: without unary post operator fusion Test Plan ``` clear && python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_quantized_linear_with_pointwise ``` Next Step - [ ] Unary post op fusion - [ ] Int8 output - [ ] Binary Fusion - [ ] AMX int8 MicroGEMM Kernel Pull Request resolved: https://github.com/pytorch/pytorch/pull/128825 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-06-30 09:45:43 +00:00
dilililiwhy	fe5424d0f8	Enable UFMT on test/test_public_bindings.py (#128389 ) Part of: https://github.com/pytorch/pytorch/issues/123062 Ran lintrunner on: > test/test_public_bindings.py Detail: ``` $ lintrunner -a --take UFMT --all-files ok No lint issues. Successfully applied all patches. ``` Co-authored-by: Edward Z. Yang <ezyang@fb.com> Co-authored-by: Xuehai Pan <XuehaiPan@pku.edu.cn> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128389 Approved by: https://github.com/ezyang	2024-06-30 08:49:51 +00:00
Xuehai Pan	4ee1cb9b95	[BE][Easy] replace `import pathlib` with `from pathlib import Path` (#129426 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129426 Approved by: https://github.com/malfet	2024-06-30 01:36:07 +00:00
PyTorch MergeBot	2effbcfcd8	Revert "[BE][Easy] replace `import pathlib` with `from pathlib import Path` (#129426 )" This reverts commit 6d75604ef135925e8c85363c2f4a5e0b6f7fef28. Reverted https://github.com/pytorch/pytorch/pull/129426 on behalf of https://github.com/XuehaiPan due to recognize `Path` as new exported API ([comment](https://github.com/pytorch/pytorch/pull/129426#issuecomment-2198371625))	2024-06-29 23:24:06 +00:00
Xu Han	67c9ec2b6d	[inductor] unificate toolchain code. (#129816 ) This PR is the implemention of https://github.com/pytorch/pytorch/issues/124245#issuecomment-2197778902 plan 2, and it is continued PR to https://github.com/pytorch/pytorch/pull/129789 Changes: 1. Unificate cpp builder's toolchain code. 2. Move all build related code to `cpp_builder.py`. 3. Optimize `codecache.py`, `cpp_builder.py` and `cpu_vec_isa.py` import logical follow: https://github.com/pytorch/pytorch/issues/124245#issuecomment-2197778902 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129816 Approved by: https://github.com/jansel	2024-06-29 23:21:13 +00:00
leslie-fang-intel	3fec0efd34	[Inductor][CPP] Support vectorization of bitwise fn (#129733 ) Summary When check the vectorization status among 3 test suit, we found some operators disabled vectorization with message `Disabled vectorization: op: bitwise_and`. In this PR, we add vectorization support of 6 bitwise functions. In this PR, we also remove `bitwise_xor` from `ops_to_bool` list which sets output data type as bool in data type propagation. It seems wrong since according to this doc https://pytorch.org/docs/stable/generated/torch.bitwise_xor.html, it should return the same integral data type with input and the testcase `test_bitwise3` failed due to this issue. Test Plan ``` python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_vec_bitwise python -u -m pytest -s -v test/inductor/test_torchinductor.py -k test_bitwise3 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129733 Approved by: https://github.com/jgong5, https://github.com/Skylion007	2024-06-29 17:25:27 +00:00
Xuehai Pan	6d75604ef1	[BE][Easy] replace `import pathlib` with `from pathlib import Path` (#129426 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129426 Approved by: https://github.com/malfet	2024-06-29 15:42:09 +00:00
Xuehai Pan	7837a12474	[BE] enforce style for empty lines in import segments (#129751 ) This PR follows https://github.com/pytorch/pytorch/pull/129374#pullrequestreview-2136555775 cc @malfet: > Lots of formatting changes unrelated to PR goal, please keep them as part of separate PR (and please add lint rule if you want to enforce those, or at least cite one) `usort` allows empty lines within import segments. For example, `usort` do not change the following code: ```python import torch.aaa import torch.bbb import torch.ccc x = ... # some code ``` ```python import torch.aaa import torch.bbb import torch.ccc x = ... # some code ``` ```python import torch.aaa import torch.bbb import torch.ccc x = ... # some code ``` This PR first sort imports via `isort`, then re-sort the file using `ufmt` (`usort` + `black`). This enforces the following import style: 1. no empty lines within segments. 2. single empty line between segments. 3. two spaces after import statements. All the code snippets above will be formatted to: ```python import torch.aaa import torch.bbb import torch.ccc x = ... # some code ``` which produces a consistent code style. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129751 Approved by: https://github.com/malfet	2024-06-29 14:15:24 +00:00
Jason Ansel	9ae78a578c	[halide-backend] Support manual schedules (#129321 ) Currently using this for some by-hand hacking, but might need to implement our own scheduler later. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129321 Approved by: https://github.com/shunting314 ghstack dependencies: #126417, #129025, #129026, #127506, #129036, #129320	2024-06-29 14:06:28 +00:00
Jason Ansel	a18eb651d3	[halide-backend] Disable split reductions for Halide (#129320 ) In theory Halide doesn't need the split reduction stuff we do for Triton since it can generate multiple kernels. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129320 Approved by: https://github.com/shunting314, https://github.com/eellison ghstack dependencies: #126417, #129025, #129026, #127506, #129036	2024-06-29 14:06:28 +00:00
Jason Ansel	4cb8cb04a7	[halide-backend] Enable bfloat16 support (#129036 ) Requires https://github.com/halide/Halide/pull/8255 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129036 Approved by: https://github.com/shunting314, https://github.com/eellison ghstack dependencies: #126417, #129025, #129026, #127506	2024-06-29 14:06:25 +00:00
Jason Ansel	b93bf55b6a	[halide-backend] Add GPU support (#127506 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127506 Approved by: https://github.com/shunting314, https://github.com/eellison ghstack dependencies: #126417, #129025, #129026	2024-06-29 14:06:21 +00:00
Jason Ansel	86cadc6385	[halide-backend] Dimension-based indexing (#129026 ) Prior to this the generated Halide code was a rather literal translation of the Triton code, with XBLOCK/YBLOCK/RBLOCK and 1D inputs. Halide prefers dimensions, and this 1D index triggers a lot of bugs and perf issues. This PR infers dimensions and changes the indexing in the generated code. Before ```py @hl.generator(name="kernel") class Kernel: in_ptr0 = hl.InputBuffer(hl.Float(32), 1) out_ptr3 = hl.OutputBuffer(hl.Float(32), 2) def generate(g): in_ptr0 = g.in_ptr0 out_ptr3 = g.out_ptr3 xindex = hl.Var('xindex') rindex = hl.Var('rindex') r1 = rindex x0 = xindex idom = hl.RDom([hl.Range(0, 16), hl.Range(0, 32)]) odom = hl.RDom([hl.Range(0, 16)]) rdom = hl.RDom([hl.Range(0, 32)]) xindex_idom = idom.x xindex_odom = odom.x rindex_idom = idom.y r1_idom = rindex_idom x0_idom = xindex_idom x0_odom = xindex_odom tmp0 = hl.Func('tmp0') tmp0[rindex, xindex] = in_ptr0[r1 + (32*x0)] tmp1 = hl.Func('tmp1') tmp1[xindex] = hl.maximum(rdom, tmp0[rdom, xindex]) tmp2 = hl.Func('tmp2') tmp2[rindex, xindex] = tmp0[rindex, xindex] - tmp1[xindex] tmp3 = hl.Func('tmp3') tmp3[rindex, xindex] = hl.fast_exp(hl.cast(hl.Float(32), tmp2[rindex, xindex])) if tmp2.type().bits() <= 32 else hl.exp(tmp2[rindex, xindex]) tmp4 = hl.Func('tmp4') tmp4[xindex] = hl.sum(rdom, tmp3[rdom, xindex]) tmp5 = hl.Func('tmp5') tmp5[rindex, xindex] = tmp3[rindex, xindex] / tmp4[xindex] out_ptr3_i0 = hl.Var('out_ptr3_i0') out_ptr3_i1 = hl.Var('out_ptr3_i1') out_ptr3[out_ptr3_i0, out_ptr3_i1] = hl.cast(out_ptr3.type(), tmp5[out_ptr3_i0, out_ptr3_i1]) assert g.using_autoscheduler() in_ptr0.set_estimates([hl.Range(0, 512)]) out_ptr3.set_estimates([hl.Range(0, 32), hl.Range(0, 16)]) ``` After ```py @hl.generator(name="kernel") class Kernel: in_ptr0 = hl.InputBuffer(hl.Float(32), 2) out_ptr3 = hl.OutputBuffer(hl.Float(32), 2) def generate(g): in_ptr0 = g.in_ptr0 out_ptr3 = g.out_ptr3 h0 = hl.Var('h0') h1 = hl.Var('h1') rdom = hl.RDom([hl.Range(0, 32)]) hr1 = rdom[0] tmp0 = hl.Func('tmp0') tmp0[h0, h1] = in_ptr0[h0, h1,] tmp1 = hl.Func('tmp1') tmp1[h1] = hl.maximum(rdom, tmp0[hr1, h1]) tmp2 = hl.Func('tmp2') tmp2[h0, h1] = tmp0[h0, h1] - tmp1[h1] tmp3 = hl.Func('tmp3') tmp3[h0, h1] = hl.fast_exp(hl.cast(hl.Float(32), tmp2[h0, h1])) if tmp2.type().bits() <= 32 else hl.exp(tmp2[h0, h1]) tmp4 = hl.Func('tmp4') tmp4[h1] = hl.sum(rdom, tmp3[hr1, h1]) tmp5 = hl.Func('tmp5') tmp5[h0, h1] = tmp3[h0, h1] / tmp4[h1] out_ptr3[h0, h1,] = hl.cast(hl.Float(32), tmp5[h0, h1]) assert g.using_autoscheduler() in_ptr0.dim(0).set_min(0) in_ptr0.dim(0).set_stride(1) in_ptr0.dim(0).set_extent(32) in_ptr0.dim(1).set_min(0) in_ptr0.dim(1).set_stride(32) in_ptr0.dim(1).set_extent(16) in_ptr0.set_estimates([hl.Range(0, 32), hl.Range(0, 16)]) out_ptr3.set_estimates([hl.Range(0, 32), hl.Range(0, 16)]) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129026 Approved by: https://github.com/shunting314, https://github.com/eellison ghstack dependencies: #126417, #129025	2024-06-29 14:06:16 +00:00
Jason Ansel	da5f37515e	[halide-backend] Generate standalone runtime (#129025 ) This puts the halide runtime in a global shared object, rather than copying it to each kernel. Having many copies of the runtime causes many issues with cuda. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129025 Approved by: https://github.com/shunting314, https://github.com/eellison ghstack dependencies: #126417	2024-06-29 14:06:12 +00:00
Jason Ansel	e34b7e6af3	[halide-backend] Initial implementation of HalideKernel and HalideScheduling (#126417 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126417 Approved by: https://github.com/shunting314, https://github.com/eellison	2024-06-29 14:06:08 +00:00
Howard Huang	13d4be1dc7	[pipelining] Support W action for schedules (#129233 ) Add support to for the `W` action in `_step_microbatches`. ## TODO: - Clean up the tests theres a lot of copy-pasted repeated code there Co-authored-by: Will Constable <whc@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129233 Approved by: https://github.com/wconstab ghstack dependencies: #128983, #128976	2024-06-29 11:51:40 +00:00
Howard Huang	a6da01bd01	[pipelining] Support arbitrary stage ordering on ranks (#128976 ) Fixes based on discussion in https://github.com/pytorch/pytorch/issues/128665 Our previous assumption was that for looped schedules `stage_ids = range(rank, total_stages, num_local_stages)`. This is not true for all schedules. This change relaxes that assumptions and allows arbitrary ordering of stages. For example in the added test we do, rank 0: [stage0, stage3], rank 1: [stage1, stage2]. The test also adds a schedule registry (for testing) which performs 1 microbatch through this schedule ``` F0_0 None None F0_3 B0_3 None None B0_0 None F0_1 F0_2 None None B0_2 B0_1 None ``` Co-authored-by: Will Constable <whc@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128976 Approved by: https://github.com/wconstab ghstack dependencies: #128983	2024-06-29 11:51:39 +00:00
Will Constable	18ae3bab2f	[Pipelining] Support separate dw_runner for PipelineStage (#128983 ) Fixes #128974 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128983 Approved by: https://github.com/H-Huang	2024-06-29 11:51:34 +00:00
谭九鼎	b0e5c9514d	use shutil.which in check_compiler_ok_for_platform (#129069 ) the same as https://github.com/pytorch/pytorch/pull/126060 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129069 Approved by: https://github.com/ezyang	2024-06-29 11:38:51 +00:00
Xuehai Pan	56935684c3	Use Generic TypeAlias (PEP 585) and Union Type (PEP 604) in `.pyi` stub files (#129419 ) ------ - [Generic TypeAlias (PEP 585)](https://peps.python.org/pep-0585): e.g. `typing.List[T] -> list[T]`, `typing.Dict[KT, VT] -> dict[KT, VT]`, `typing.Type[T] -> type[T]`. - [Union Type (PEP 604)](https://peps.python.org/pep-0604): e.g. `Union[X, Y] -> X \| Y`, `Optional[X] -> X \| None`, `Optional[Union[X, Y]] -> X \| Y \| None`. Note that in `.pyi` stub files, we do not need `from __future__ import annotations`. So this PR does not violate issue #117449: - #117449 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129419 Approved by: https://github.com/ezyang ghstack dependencies: #129375, #129376	2024-06-29 09:23:39 +00:00
Xuehai Pan	9120992c72	[BE][Easy] enable postponed annotations in `torchgen` (#129376 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129376 Approved by: https://github.com/ezyang ghstack dependencies: #129375	2024-06-29 09:23:39 +00:00
Xuehai Pan	8a67daf283	[BE][Easy] enable postponed annotations in `tools` (#129375 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129375 Approved by: https://github.com/malfet	2024-06-29 09:23:35 +00:00
Xu Han	58f346c874	[inductor] split cpu vec isa to dedicate file (keep git history) (#129789 ) This PR is the implemention of https://github.com/pytorch/pytorch/issues/124245#issuecomment-2197778902 plan 1 Changes: 1. Duplicate `codecache.py` to `cpu_vec_isa.py` with its `git history`. <img width="745" alt="image" src="https://github.com/pytorch/pytorch/assets/8433590/106533da-ce80-4825-8271-35ffb3141f92"> 2. Make `cpu_vec_isa.py` as dedicate file for CPU vec isa. It also good to extend for more archtectures and vec isa. 3. Update code for above changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129789 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-06-29 07:19:54 +00:00
Animesh Jain	a676b7c5f3	Add XGLMForCausalLM to the flaky model list (#129776 ) Not failing on devGPU. Went to CI machine ... flaky. So adding to the flaky list. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129776 Approved by: https://github.com/mlazos ghstack dependencies: #129583, #129610, #129775	2024-06-29 05:47:28 +00:00
Animesh Jain	5d1763d159	Add lcnet to the inline_inbuilt_nn_module list (#129775 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129775 Approved by: https://github.com/mlazos ghstack dependencies: #129583, #129610	2024-06-29 05:47:28 +00:00
Wenlei He	89696db4b0	Revert "[LLVM/TensorExpr] Update for an API change in LLVM 18." (#129797 ) This reverts commit 20f394f10a389bcf13485929be8862f98ad4b322 (https://github.com/pytorch/pytorch/pull/117086) LLVM upstream changed the pass builder API again, so registerPassBuilderCallbacks no longer takes extra boolean for PopulateClassToPassNames. Update accordingly. Relevant LLVM upstream change: https://github.com/llvm/llvm-project/pull/96321 https://github.com/llvm/llvm-project/pull/96462 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129797 Approved by: https://github.com/dcci	2024-06-29 05:17:20 +00:00
Boyuan Feng	3ef44df667	[ts-migration] support prim::SetAttr and fix prim::GetAttr (#129440 ) - Lifting Tensor Constant attributes to buffers: TorchScript does not automatically lift tensor constant attributes to buffers. So previous converter cannot access tensor constant attributes. This PR fixed the issue. - Add SetAttr support for tensor attributes by copy_. - Add SetAttr support for non-tensor attributes. In particular, we maintain the current value of non-tensor attributes in `name_to_non_tensor_attribute_node`, similar to an interpreter pass on non-tensor attributes. So we can support the following use case: ```python def forward(self, x): c1 = self.count self.count += 1 c2 = self.count return x + c1 + c2 ``` - Fixed a bug in GetAttr to support the following use case: ```python def forward(self, inp): x = self.buffer self.buffer += 1 y = self.buffer return x + y + inp ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129440 Approved by: https://github.com/angelayi	2024-06-29 05:08:13 +00:00
Yanbo Liang	ec47d4d9a8	[Inductor] FlexAttention supports block sparse mask (#129216 ) Benchmark script (causal mask): https://gist.github.com/yanboliang/c2010a1fd081d4e8ca94fadec9eef286 Initial perf number: * fwd speedup: 0.44 -> 0.72 * bwd speedup: 0.38 -> 0.71 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129216 Approved by: https://github.com/Chillee	2024-06-29 04:44:38 +00:00
Yanbo Liang	7b5a8424a1	[GPT-fast] Update micro benchmark numbers as A100-50G (#129799 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/129799 Approved by: https://github.com/Chillee	2024-06-29 04:36:07 +00:00
Mayank Mishra	065c386990	Allow get attributes on DDP similar to FSDP (#128620 ) FSDP implements the following logic but its missing from DDP. This PR adds an equivalent function for the same. ```python def __getattr__(self, name: str) -> Any: """Forward missing attributes to the wrapped module.""" try: return super().__getattr__(name) # defer to nn.Module's logic except AttributeError: return getattr(self._fsdp_wrapped_module, name) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128620 Approved by: https://github.com/awgu	2024-06-29 01:57:22 +00:00
Nikita Shulga	2bc6f329b2	Make PyTorch argparser understand complex (#129580 ) It understands float and int, so why not `complex`. Test plan: `python -c "import torch;print(torch.rand(3, dtype=complex))"` Fixes https://github.com/pytorch/pytorch/issues/126837 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129580 Approved by: https://github.com/albanD	2024-06-29 01:21:12 +00:00
PyTorch MergeBot	dfd55d1714	Revert "[cond] inlining into one of the branches when pred is a python constant (#128709 )" This reverts commit 23adf166e166bd56e3446284939af7e46a181079. Reverted https://github.com/pytorch/pytorch/pull/128709 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is breaking one ExecuTorch test ([comment](https://github.com/pytorch/pytorch/pull/128709#issuecomment-2197806850))	2024-06-29 01:03:55 +00:00
PyTorch MergeBot	3d96217891	Revert "[BE][Easy] use `pathlib.Path` instead of `dirname` / `".."` / `pardir` (#129374 )" This reverts commit 9e1f3ecaa710785a1ab03c6ad5093a5566d6c5e5. Reverted https://github.com/pytorch/pytorch/pull/129374 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is still failing with the same error ([comment](https://github.com/pytorch/pytorch/pull/129374#issuecomment-2197801405))	2024-06-29 00:47:15 +00:00
Haiping Zhao	c0782e7c81	Kineto profiler: collecting observer traces from C++ child threads (#128743 ) Summary: In a C++ program, if we have child threads doing GPU work, it would be nice to get traces of those threads as well. The problem is, pushProfilingCallbacks() is not called on child threads, therefore, no observer traces are collected on these threads, entirely missing in the final output. This diff provides a new API that a child thread may elect to call to register itself onto the profiler that was started in main thread (or whatever the Python thread that manages the profiler). Test Plan: ``` buck2 test @mode/opt //caffe2/test:profiler_test_cpp_thread ``` Reviewed By: aaronenyeshi Differential Revision: D56669942 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128743 Approved by: https://github.com/aaronenyeshi	2024-06-29 00:44:30 +00:00
PyTorch MergeBot	a32ce5ce34	Revert "[BE][Easy] enable postponed annotations in `tools` (#129375 )" This reverts commit 59eb2897f1745f513edb6c63065ffad481c4c8d0. Reverted https://github.com/pytorch/pytorch/pull/129375 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I need to revert to cleanly revert https://github.com/pytorch/pytorch/pull/129374, please do a rebase and reland this ([comment](https://github.com/pytorch/pytorch/pull/129375#issuecomment-2197800541))	2024-06-29 00:44:25 +00:00
PyTorch MergeBot	6063bb9d45	Revert "[BE][Easy] enable postponed annotations in `torchgen` (#129376 )" This reverts commit 494057d6d4e9b40daf81a6a4d7a8c839b7424b14. Reverted https://github.com/pytorch/pytorch/pull/129376 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I need to revert to cleanly revert https://github.com/pytorch/pytorch/pull/129374, please do a rebase and reland this ([comment](https://github.com/pytorch/pytorch/pull/129375#issuecomment-2197800541))	2024-06-29 00:44:25 +00:00
PyTorch MergeBot	83caf4960f	Revert "Use Generic TypeAlias (PEP 585) and Union Type (PEP 604) in `.pyi` stub files (#129419 )" This reverts commit e40f50cb87bcd176a380b729af5dda13dbe9c399. Reverted https://github.com/pytorch/pytorch/pull/129419 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I need to revert to cleanly revert https://github.com/pytorch/pytorch/pull/129374, please do a rebase and reland this ([comment](https://github.com/pytorch/pytorch/pull/129375#issuecomment-2197800541))	2024-06-29 00:44:24 +00:00
PyTorch MergeBot	00d7bba2fa	Revert "[BE] enforce style for empty lines in import segments (#129751 )" This reverts commit f5ff1a3ab9ef279655308266029faf6543a8a1ca. Reverted https://github.com/pytorch/pytorch/pull/129751 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I need to revert to cleanly revert https://github.com/pytorch/pytorch/pull/129374, please do a rebase and reland this ([comment](https://github.com/pytorch/pytorch/pull/129751#issuecomment-2197799814))	2024-06-29 00:41:41 +00:00
PyTorch MergeBot	fa6c0fe3e4	Revert "Conversions between strided and jagged layouts for Nested Tensors (#115749 )" This reverts commit 9450e198aa0bdf6f81ccb8ad2f74c06e81d1af6e. Reverted https://github.com/pytorch/pytorch/pull/115749 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/115749#issuecomment-2197790226))	2024-06-29 00:16:47 +00:00
Andrew Gu	24f69eef6a	[FSDP2] Ran reduce-scatter copy-in in default stream (#129721 ) This PR runs the reduce-scatter copy-in in the default stream, allowing the reduce-scatter input (large allocation proportional to unsharded gradients) to be allocated in the default stream to avoid fragmenting that memory across stream memory pools. - In general, minimizing memory usage spikes in non-default-stream memory pools helps because otherwise, that memory cannot be reused by the default stream outside of that spike. This reduce-scatter input allocation represents one such spike. The reduce-scatter outputs are still allocated in the separate `reduce_scatter` stream since they are small and have a non-spiky allocation/free pattern (we iteratively allocate them through backward and free them altogether after optimizer). - This PR should not have any impact on overlap (I sanity checked Llama3-8B traces from torchtitan; plus we have the `test_fully_shard_overlap.py` unit tests). Experiment (Before) Llama3-8B, 1D FSDP, 8 H100s, bf16/fp32 mixed precision, no AC, local batch size 1: ``` [rank0]:2024-06-27 16:38:56,620 - root - INFO - step: 1 loss: 12.2764 memory: 71.99GiB(75.75%) wps: 1,436 mfu: 8.41% [rank0]:2024-06-27 16:38:56,620 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40 [rank0]:2024-06-27 16:38:57,943 - root - INFO - step: 2 loss: 12.1001 memory: 79.82GiB(83.98%) wps: 6,195 mfu: 36.28% [rank0]:2024-06-27 16:38:59,266 - root - INFO - step: 3 loss: 11.7697 memory: 79.82GiB(83.98%) wps: 6,193 mfu: 36.27% [rank0]:2024-06-27 16:39:00,587 - root - INFO - step: 4 loss: 11.2807 memory: 79.82GiB(83.98%) wps: 6,203 mfu: 36.32% [rank0]:2024-06-27 16:39:01,910 - root - INFO - step: 5 loss: 10.9494 memory: 79.82GiB(83.98%) wps: 6,198 mfu: 36.30% ``` (After) Llama3-8B, 1D FSDP, 8 H100s, bf16/fp32 mixed precision, no AC, local batch size 1: ``` [rank0]:2024-06-27 16:41:12,106 - root - INFO - step: 1 loss: 12.2560 memory: 69.46GiB(73.08%) wps: 1,158 mfu: 6.78% [rank0]:2024-06-27 16:41:12,106 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40 [rank0]:2024-06-27 16:41:13,502 - root - INFO - step: 2 loss: 12.0949 memory: 77.29GiB(81.32%) wps: 5,870 mfu: 34.37% [rank0]:2024-06-27 16:41:14,839 - root - INFO - step: 3 loss: 11.7770 memory: 77.29GiB(81.32%) wps: 6,130 mfu: 35.90% [rank0]:2024-06-27 16:41:16,154 - root - INFO - step: 4 loss: 11.3188 memory: 77.29GiB(81.32%) wps: 6,230 mfu: 36.48% [rank0]:2024-06-27 16:41:17,474 - root - INFO - step: 5 loss: 10.9443 memory: 77.29GiB(81.32%) wps: 6,211 mfu: 36.37% ``` 2.53 GiB reduction in peak reserved memory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129721 Approved by: https://github.com/weifengpy, https://github.com/yifuwang	2024-06-28 23:55:12 +00:00
Sahan Paliskara	f06e3a1569	[Split Build] Make script not crash if split build is not set (#129774 ) Fixes issue causing https://github.com/pytorch/pytorch/actions/runs/9704484834/job/26801889463 to crash Pull Request resolved: https://github.com/pytorch/pytorch/pull/129774 Approved by: https://github.com/atalman	2024-06-28 23:50:18 +00:00
Aaron Gokaslan	7bda23ef84	[BE]: Update ruff to 0.5.0 (#129744 ) Update ruff to 0.5.0 so we can enable all the some of the new checks I've been wanting to add to the codebase. First just updating the code to comply with some rule changes and a couple minor API changes / deprecations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129744 Approved by: https://github.com/ezyang	2024-06-28 21:49:56 +00:00
Mohamed Yassine Kabouri	0a337613f8	Fix typo in stack_module_state doc (#129126 ) I think there is a typo in the first example of the `torch.func.stack_module_state` documentation. The first parameter in the function call in the `wrapper` return is missing an 's'. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129126 Approved by: https://github.com/zou3519	2024-06-28 21:36:40 +00:00
Xuehai Pan	f5ff1a3ab9	[BE] enforce style for empty lines in import segments (#129751 ) This PR follows https://github.com/pytorch/pytorch/pull/129374#pullrequestreview-2136555775 cc @malfet: > Lots of formatting changes unrelated to PR goal, please keep them as part of separate PR (and please add lint rule if you want to enforce those, or at least cite one) `usort` allows empty lines within import segments. For example, `usort` do not change the following code: ```python import torch.aaa import torch.bbb import torch.ccc x = ... # some code ``` ```python import torch.aaa import torch.bbb import torch.ccc x = ... # some code ``` ```python import torch.aaa import torch.bbb import torch.ccc x = ... # some code ``` This PR first sort imports via `isort`, then re-sort the file using `ufmt` (`usort` + `black`). This enforces the following import style: 1. no empty lines within segments. 2. single empty line between segments. 3. two spaces after import statements. All the code snippets above will be formatted to: ```python import torch.aaa import torch.bbb import torch.ccc x = ... # some code ``` which produces a consistent code style. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129751 Approved by: https://github.com/malfet	2024-06-28 21:02:59 +00:00
Joona Havukainen	5b96a552df	Add a check and error message for no support on MPS for conv with output_channels > 2^16 (#129484 ) Fixes the silent correctness issue in #129207 by preventing the user from calling the convolution op on MPS device with an unsupported value. The fix for the missing support is coming in later as that requires work on the kernel side so it'll take some more time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129484 Approved by: https://github.com/kulinseth	2024-06-28 20:57:40 +00:00
Zaida Zhou	bc8883a7c4	fix the error msg in device_mesh (#129747 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129747 Approved by: https://github.com/awgu, https://github.com/wconstab	2024-06-28 20:12:09 +00:00
Mikayla Gawarecki	45f3e20527	Improve error message for weights_only load (#129705 ) As @vmoens pointed out, the current error message does not make the "either/or" between setting `weights_only=False` and using `add_safe_globals` clear enough, and should print the code for the user to call `add_safe_globals` New formatting looks like such In the case that `add_safe_globals` can be used ```python >>> import torch >>> from torch.testing._internal.two_tensor import TwoTensor >>> torch.save(TwoTensor(torch.randn(2), torch.randn(2)), "two_tensor.pt") >>> torch.load("two_tensor.pt", weights_only=True) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/data/users/mg1998/pytorch/torch/serialization.py", line 1225, in load raise pickle.UnpicklingError(_get_wo_message(str(e))) from None _pickle.UnpicklingError: Weights only load failed. This file can still be loaded, to do so you have two options (1) Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source. (2) Alternatively, to load with `weights_only=True` please check the recommended steps in the following error message. WeightsUnpickler error: Unsupported global: GLOBAL torch.testing._internal.two_tensor.TwoTensor was not an allowed global by default. Please use `torch.serialization.add_safe_globals([TwoTensor])` to allowlist this global if you trust this class/function. Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html. ``` For other issues (unsupported bytecode) ```python >>> import torch >>> t = torch.randn(2, 3) >>> torch.save(t, "protocol_5.pt", pickle_protocol=5) >>> torch.load("protocol_5.pt", weights_only=True) /data/users/mg1998/pytorch/torch/_weights_only_unpickler.py:359: UserWarning: Detected pickle protocol 5 in the checkpoint, which was not the default pickle protocol used by `torch.load` (2). The weights_only Unpickler might not support all instructions implemented by this protocol, please file an issue for adding support if you encounter this. warnings.warn( Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/data/users/mg1998/pytorch/torch/serialization.py", line 1225, in load raise pickle.UnpicklingError(_get_wo_message(str(e))) from None _pickle.UnpicklingError: Weights only load failed. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source. Please file an issue with the following so that we can make `weights_only=True` compatible with your use case: WeightsUnpickler error: Unsupported operand 149 Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html. ``` Old formatting would have been like: ```python Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/data/users/mg1998/pytorch/torch/serialization.py", line 1203, in load raise pickle.UnpicklingError(UNSAFE_MESSAGE + str(e)) from None _pickle.UnpicklingError: Weights only load failed. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you get the file from a trusted source. Alternatively, to load with `weights_only` please check the recommended steps in the following error message. WeightsUnpickler error: Unsupported global: GLOBAL torch.testing._internal.two_tensor.TwoTensor was not an allowed global by default. Please use `torch.serialization.add_safe_globals` to allowlist this global if you trust this class/function. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129705 Approved by: https://github.com/albanD, https://github.com/vmoens ghstack dependencies: #129239, #129396, #129509	2024-06-28 19:36:31 +00:00
Rachel Guo	99456a612b	[AOTI] Properly indent launchKernel calls in AOTInductor (#129616 ) Summary: There is a small cosmetic issue in the C++ wrapper file generated by AOTInductor - The launchKernel() call isn't properly indented. Added indentation for launchKernel() code block call when there's a "if" condition. a.k.a when `grid_uses_symbolic_shapes` is `True`. Test Plan: Test cmd ran (in pytorch oss): `TORCH_LOGS="output_code" TORCH_COMPILE_DEBUG=1 python test/inductor/test_aot_inductor.py -k test_zero_grid_with_backed_symbols_abi_compatible_cuda` And then manually verified the output code generated in a path like `/tmp/torchinductor_guorachel/coraisesuchpl3qabrazn7ydydszcit6lwpn7ckd3b4wej4rep5l/cba5g5ajeh5sym3tp5iqn7kkokimj7qqd4krs2rruhupbfqgppge.cpp` Similarly, also verified for test case:`test_zero_grid_with_unbacked_symbols_abi_compatible_cuda` Differential Revision: D58897157 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129616 Approved by: https://github.com/ColinPeppler	2024-06-28 19:16:18 +00:00
Animesh Jain	6120aa3718	[nn-module] Use standard dict for _parameters, _modules and _buffers (#129164 ) TorchDynamo guard mechanism guards on the key order on the dictionaries if the user iterates over the dictionary. For standard dict, we can write a fast C++ implementation by using PyDict_Next. But with OrderedDict, we have to rely on `keys` Python API to get the key ordering. This makes guard evaluation slow. With Dynamo inlining into inbuilt nn modules, I am seeing many guards over the OrderedDict on `_modules`, `_parameters`. From reading the code, I don't see any reason to not use standard dicts. I think OrderedDict was preferred over dict because of the ordering, but dicts are now ordered. With this PR, I am observing ~20% reduction in guard overhead of a HF model. Functionality impact - The only difference between dict and OrdedeDict is `move_to_end` method for OrderedDict ([link](https://stackoverflow.com/questions/34305003/difference-between-dictionary-and-ordereddict)). But the changes here are internal to nn module, and we do not use `move_to_end` for `_parameters`, `_modules` and `_buffers`. We use `move_to_end` for hooks but this PR keeps the OrderedDict for hooks untouched (we should still followup with hooks but in a separate PR). Perf impact - I dont anticipate any perf impact. `dict` is completely implemented in C. OrderedDict is Python wrapper over dict with only few method overridden ([link](https://stackoverflow.com/questions/34305003/difference-between-dictionary-and-ordereddict)). Typing impact - I dont anticipate any. For all the user visible methods for nn.Module, we don't expose the underlying `_modules` etc. We have iterators like `named_parameters` which return an Iterator of Parameter. So, no typing changes required. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129164 Approved by: https://github.com/mikaylagawarecki ghstack dependencies: #129163	2024-06-28 18:30:13 +00:00
zrr1999	db4c7bb7fc	Refine typing annotation for compile (#129136 ) before ![image](https://github.com/pytorch/pytorch/assets/46243324/91372d0f-ad0e-4abe-9582-7fe892f99ec8) after ![image](https://github.com/pytorch/pytorch/assets/46243324/175066ff-78f9-44a1-a3bb-5df809f7e86d) Co-authored-by: Nyakku Shigure <sigure.qaq@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129136 Approved by: https://github.com/ezyang	2024-06-28 17:57:44 +00:00
FEI	59e4e92556	sdp::SDPBackend::flash_attention support PrivateUse1 (#126392 ) Fixes https://github.com/pytorch/pytorch/issues/124271 cc @cpuhrsch @drisspg @albanD @soulitzer Pull Request resolved: https://github.com/pytorch/pytorch/pull/126392 Approved by: https://github.com/drisspg	2024-06-28 17:48:40 +00:00
Chien-Chin Huang	26d633b721	[BE] Correctly catch skip signals emitting from sys.exit in Sandcastle (#129731 ) https://github.com/pytorch/pytorch/pull/129581 does not work correctly with Sandcastle environment. This PR fixes the issue. Differential Revision: [D59144062](https://our.internmc.facebook.com/intern/diff/D59144062/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129731 Approved by: https://github.com/wz337	2024-06-28 17:24:12 +00:00
Isuru Fernando	c12a4f2e65	Add decomposition for slice_scatter (#123744 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123744 Approved by: https://github.com/peterbell10	2024-06-28 17:02:10 +00:00
Joel Schlosser	6897631ceb	Guard on inner tensor names for traceable wrapper subclasses (#129618 ) Fixes #129601 Background: it's possible that a traceable wrapper subclass will have an optional inner tensor constituent (e.g. NJT's cached min / max sequence lengths). To specify this, the subclass's `__tensor_flatten__()` impl should leave out any unspecified optional inner tensors in the returned list of `attrs`. This PR guards on the list of inner tensor `attrs` returned in `subclass.__tensor_flatten__()[0]`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129618 Approved by: https://github.com/anijain2305	2024-06-28 16:30:25 +00:00
Ying Zhao	b84036e3fb	[AOTI] Fix test_dynamic_scalar_abi_compatible_cpu_with_stack_allocation (#129173 ) Fixes #122978 ## Summary To fix compilation error for test test_dynamic_scalar_abi_compatible_cpu_with_stack_allocation - Error 1 ``` error: no matching function for call to ‘torch::aot_inductor::ArrayRefTensor<float>::ArrayRefTensor(float [1], const int64_t [0], const int64_t [0], int&, int32_t&)’ 613 \| ArrayRefTensor<float> buf3(buf3_storage, int_array_6, int_array_6, cached_torch_device_type_cpu, this->device_idx_); \| ^ ... torch/include/torch/csrc/inductor/aoti_runtime/arrayref_tensor.h:188:35: note: no known conversion for argument 2 from ‘const int64_t [0]’ {aka ‘const long int [0]’} to ‘torch::aot_inductor::MiniArrayRef<const long int>’ 188 \| MiniArrayRef<const int64_t> sizes, \| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~ ``` Fix: added constructor for empty array in arrayref_tensor.h - Error 2 ``` error: cannot convert ‘torch::aot_inductor::ArrayRefTensor<float>’ to ‘AtenTensorHandle’ {aka ‘AtenTensorOpaque*’} 625 \| AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_item_float32(buf3, &zuf0_raw)); \| ^~~~ \| \| \| torch::aot_inductor::ArrayRefTensor<float> ``` Fix: in cpp_wrapper_cpu.py, added codegen to call convert ArrayRefTensor to AtenTensorHandle first. ## Test Plan ``` python test/inductor/test_aot_inductor.py -k AOTInductorTestABICompatibleCpuWithStackAllocation.test_dynamic_scalar_abi_compatible_cpu_with_stack_allocation ``` Before the fix, detailed in #122978: ``` \| AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_item_float32(buf3, &zuf0_raw)); \| ^~~~ \| \| \| torch::aot_inductor::ArrayRefTensor<float> /home/yingzhaoseattle/pytorch/torch/include/torch/csrc/inductor/aoti_runtime/utils.h:34:8: note: in definition of macro ‘AOTI_TORCH_ERROR_CODE_CHECK’ Ran 1 test in 4.377s FAILED (errors=1) ``` After the fix ``` /home/yingzhaoseattle/pytorch/torch/backends/cudnn/__init__.py:107: UserWarning: PyTorch was compiled without cuDNN/MIOpen support. To use cuDNN/MIOpen, rebuild PyTorch making sure the library is visible to the build system. warnings.warn( stats [('calls_captured', 3), ('unique_graphs', 1)] inductor [('extern_calls', 1)] . ---------------------------------------------------------------------- Ran 1 test in 9.633s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129173 Approved by: https://github.com/chenyang78	2024-06-28 16:27:42 +00:00
Oguz Ulgen	04264efab6	Add structured logging on FXGraphCache hit (#129588 ) We'll also want to do this for AOTAutogradCache once that's ready Differential Revision: [D59144226](https://our.internmc.facebook.com/intern/diff/D59144226) Co-authored-by: Oguz Ulgen <oulgen@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129588 Approved by: https://github.com/oulgen, https://github.com/xmfan	2024-06-28 16:06:22 +00:00
Xuehai Pan	e40f50cb87	Use Generic TypeAlias (PEP 585) and Union Type (PEP 604) in `.pyi` stub files (#129419 ) ------ - [Generic TypeAlias (PEP 585)](https://peps.python.org/pep-0585): e.g. `typing.List[T] -> list[T]`, `typing.Dict[KT, VT] -> dict[KT, VT]`, `typing.Type[T] -> type[T]`. - [Union Type (PEP 604)](https://peps.python.org/pep-0604): e.g. `Union[X, Y] -> X \| Y`, `Optional[X] -> X \| None`, `Optional[Union[X, Y]] -> X \| Y \| None`. Note that in `.pyi` stub files, we do not need `from __future__ import annotations`. So this PR does not violate issue #117449: - #117449 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129419 Approved by: https://github.com/ezyang ghstack dependencies: #129375, #129376	2024-06-28 15:37:57 +00:00
Xuehai Pan	494057d6d4	[BE][Easy] enable postponed annotations in `torchgen` (#129376 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129376 Approved by: https://github.com/ezyang ghstack dependencies: #129375	2024-06-28 15:37:57 +00:00
Xuehai Pan	59eb2897f1	[BE][Easy] enable postponed annotations in `tools` (#129375 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129375 Approved by: https://github.com/malfet	2024-06-28 15:37:54 +00:00
Xu Han	2e3ff394bf	[inductor] optimize cpp builder configuration code (#129577 ) Changes: 1. Combine choose isa condition dispatch code. 2. Unificate MacOS openmp configuration code. 3. Clean up useless code. Co-authored-by: Jason Ansel <jansel@jansel.net> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129577 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-06-28 15:08:54 +00:00
Manuel Candales	eabe6574c0	[metal] Parameterize group_size in int4_mm test, fix int4mm shader for group_size > 128 (#129628 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129628 Approved by: https://github.com/kimishpatel	2024-06-28 15:01:30 +00:00
Andrew Gu	635d6c9d66	[FSDP2] Ran post-acc-grad hooks manually (#129450 ) FSDP2 accumulates gradients for sharded parameters outside of the autograd engine's normal accumulation logic. We can respect registered post-accumulate-grad hooks by running them manually. Discussion Discussing with @soulitzer, changing FSDP2 to make the sharded parameters autograd leaves requires nontrivial changes to FSDP and some changes to the autograd engine (around forward vs. backward streams) where the changes may not preserve eager-mode performance and/or add some complexity. Under the FSDP2 design, the sharded parameters never participate in autograd, so calling `register_post_accumulate_grad_hook` on them would otherwise be a no-op. In other words, there is virtually no chance for FSDP2 incorrectly re-running the hook when it should not. Given these, a reasonable near-term solution is for FSDP2 to run the post-accumulate-grad hooks manually. Caveats - Running `foreach=False` optimizer _per parameter tensor_ incurs significantly higher CPU overhead compared to `foreach=True` (partially due to `DTensor` being a `__torch_dispatch__` tensor subclass). - On preliminary benchmarking on Llama3-8B on 8 GPUs, this CPU overhead is mostly tolerable, but on smaller # of GPUs or a less compute-intensive model, this may not be. - One solution for native Adam/AdamW is to use `fused=True`, which makes both the CPU overhead lower and GPU compute faster. However, this is generally not an option for user-defined optimizers. - If this CPU overhead blocks adoption of this feature, then we should seriously consider an FSDP-specific API like `register_post_backward_hook(params: List[nn.Parameter]) -> None` that allows the user to see all parameters in the `FSDPParamGroup` together for the hook so that the user can still run a `foreach=True` optimizer step on that `List[nn.Parameter]`. - The post-accumulate-grad hook runs in the reduce-scatter stream. Our current stream handling logic does not have the default stream wait for the reduce-scatter stream until the end of backward. Unless we add that, we cannot simply run the post-accumulate-grad hook in the default stream. - This means that optimizer compute will overlap with backward compute, which may slowdown end-to-end execution slightly (e.g. due to SM contention or wave quantization effects). For example, on Llama3-8B, we see about ~3% decrease in MFU when running optimizer in backward even though the optimizer steps are fully overlapped and there are no CPU boundedness issues. - This PR's goal is only to run the hook manually. State dict etc. for optimizer-in-backward is out of scope. Experiments (torchtitan) - Llama3-8B on 2 GPUs, local batch size 1, with full activation checkpointing, and bf16/fp32 mixed precision: - Without optimizer-in-backward: 82.03 GiB reserved memory; 28.1% MFU - With optimizer-in-backward (`foreach=False`): 72.84 GiB reserved memory; 28.9% MFU (speedup from more of optimizer step overlapped) - With optimizer-in-backward (`fused=True`): 70.84 GiB reserved memory; 30.4% MFU Pull Request resolved: https://github.com/pytorch/pytorch/pull/129450 Approved by: https://github.com/weifengpy, https://github.com/yf225	2024-06-28 14:50:09 +00:00
Nikita Shulga	fe4032fe20	[BE][CMake] Do not use `EXEC_PROGRAM` (#129714 ) It was deprecated since CMake-3.0 in favor of `execute_process`, see https://cmake.org/cmake/help/v3.18/command/exec_program.html This makes the following warning disappear: ``` CMake Warning (dev) at cmake/Modules/FindARM.cmake:5 (EXEC_PROGRAM): Policy CMP0153 is not set: The exec_program command should not be called. Run "cmake --help-policy CMP0153" for policy details. Use the cmake_policy command to set the policy and suppress this warning. Use execute_process() instead. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129714 Approved by: https://github.com/kit1980	2024-06-28 13:29:52 +00:00
Yu, Guangye	98d34d849d	Add a XPU UT to ensure lazy init (#129638 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129638 Approved by: https://github.com/gujinghui	2024-06-28 13:22:17 +00:00
Randolf Scholz	22a06869f2	include jit/*.pyi (#129654 ) Fixes #108781, see https://github.com/pytorch/pytorch/pull/108782#issuecomment-1927321532 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129654 Approved by: https://github.com/ezyang	2024-06-28 12:40:11 +00:00
Xu Han	424068d0d2	[Windows] remove mkl shared library dependency. (#129493 ) # Background I have fixed pytorch Windows missing mkl shared library dependency issue: https://github.com/pytorch/pytorch/issues/124009 The solution is change torch_cpu module static link mkl library: 1. pytorch static link mkl PR: https://github.com/pytorch/pytorch/pull/124925 2. builder install mkl static library: https://github.com/pytorch/builder/pull/1790 Double confirmed current build is using mkl static link: https://github.com/pytorch/pytorch/issues/124009#issuecomment-2160941802 # Goal Remove setup.py `install_requires` will install mkl shared lib on pytorch Windows. It is not required now, due to we have static linked it. It will reduce the pytorch install network traffic and avoid install useless mkl shared library package. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129493 Approved by: https://github.com/malfet	2024-06-28 11:42:21 +00:00
Shan19900305	a0dac3de31	Noise tensor using same size/stride with input to promote performance when channel last situation. (#129467 ) All ops in _dropout_impl function are point-wise op. When input and output tensors are with same size and stride, those operators will get better performance. So i have remove memory in at::empty_like in make noise tensor. @ezyang Test code: ``` import torch input1 = torch.randn((50, 20, 50 ,30)).cuda() input2 = torch.randn((50, 20, 50 ,30)).cuda().to(memory_format=torch.channels_last) input3 = torch.randn((50, 20, 50 , 50)).cuda()[...,10:40] dropout = torch.nn.Dropout(p=0.5, inplace=True) # warmup: for i in range(20): output = dropout(input1) start_event = torch.cuda.Event(enable_timing=True) end_event = torch.cuda.Event(enable_timing=True) num = 10000 start_event.record() for i in range(num): output = dropout(input1) end_event.record() end_event.synchronize() time = start_event.elapsed_time(end_event) print("input1 each time: {0}.".format(time * 1.0/num), flush =True) start_event.record() for i in range(num): output = dropout(input2) end_event.record() end_event.synchronize() time = start_event.elapsed_time(end_event) print("input2 each time: {0}.".format(time * 1.0/num), flush =True) start_event.record() for i in range(num): output = dropout(input3) end_event.record() end_event.synchronize() time = start_event.elapsed_time(end_event) print("input3 each time: {0}.".format(time * 1.0/num), flush =True) ``` Test result: \| 算子名称 \| 输入信息size / stride \| empty是否携带连续性参数 \| 耗时（ms） \| 备注 -- \| -- \| -- \| -- \| -- \| -- 1 \| dropout \| (50, 20, 50 ,30) / (30000, 1500, 30, 1) \| LEGACY_CONTIGUOUS_MEMORY_FORMAT \| 0.0426735 \| 2 \| dropout \| (50, 20, 50 ,30) / (30000, 1, 600, 20) \| LEGACY_CONTIGUOUS_MEMORY_FORMAT \| 0.0461689 \| 3 \| dropout \| (50, 20, 50 ,30) / (50000, 2500, 50, 1) \| LEGACY_CONTIGUOUS_MEMORY_FORMAT \| 0.0512882 \| 4 \| dropout \| (50, 20, 50 ,30) / (30000, 1500, 30, 1) \| 空，根据输入决定size/stride \| 0.0426598 \| 对比1,基本一致 5 \| dropout \| (50, 20, 50 ,30) / (30000, 1, 600, 20) \| 空，根据输入决定size/stride \| 0.0422751 \| 对比2,提升8.4%左右 6 \| dropout \| (50, 20, 50 ,30) / (50000, 2500, 50, 1) \| 空，根据输入决定size/stride \| 0.0509037 \| 对比3,基本一致 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129467 Approved by: https://github.com/ezyang	2024-06-28 10:06:13 +00:00
PyTorch MergeBot	999eec8dea	Revert "[cuDNN][SDPA] Remove `TORCH_CUDNN_SDPA_ENABLED=1`, enable cuDNN SDPA by default on H100 and 2nd on other archs >= sm80 (#125343 )" This reverts commit b7e7a4cb01de394af7686ab6feb216a8a5c716bb. Reverted https://github.com/pytorch/pytorch/pull/125343 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to break some test_transformer running on internal A100 and V100 ([comment](https://github.com/pytorch/pytorch/pull/125343#issuecomment-2196202003))	2024-06-28 06:03:54 +00:00
PyTorch MergeBot	d21993bbb8	Revert "[cuDNN][SDPA] Bail out of dispatching to cuDNN for head dim > 128 on Ampere (#129587 )" This reverts commit 7854d84acbfb7a4e3e807951188535a0316b585e. Reverted https://github.com/pytorch/pytorch/pull/129587 on behalf of https://github.com/huydhn due to Sorry for revert yet another of your change but I need to revert this to cleanly revert https://github.com/pytorch/pytorch/pull/125343#issuecomment-2196187332 ([comment](https://github.com/pytorch/pytorch/pull/129587#issuecomment-2196198756))	2024-06-28 06:01:07 +00:00
PyTorch MergeBot	c43923a116	Revert "[Inductor] FlexAttention supports block sparse mask (#129216 )" This reverts commit b9d3cedd648d4ed9d0bf5b918893341e5f95289a. Reverted https://github.com/pytorch/pytorch/pull/129216 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is still failing in trunk `b9d3cedd64`, maybe a landrace given that TD has been turned off ([comment](https://github.com/pytorch/pytorch/pull/129216#issuecomment-2196182882))	2024-06-28 05:44:46 +00:00
hippocookie	73eb4503cc	Enable UFMT for numpy_test files, test_xnnpack_integration.py (#129023 ) Fixes #123062 Run lintrunner on files: test/test_xnnpack_integration.py ```bash $ lintrunner FLAKE8 success! CLANGFORMAT success! MYPY success! MYPYSTRICT success! CLANGTIDY success! TYPEIGNORE success! TYPENOSKIP success! NOQA success! NATIVEFUNCTIONS success! NEWLINE success! CONSTEXPR success! SPACES success! TABS success! INCLUDE success! PYBIND11_INCLUDE success! ERROR_PRONE_ISINSTANCE success! PYBIND11_SPECIALIZATION success! PYPIDEP success! EXEC success! CUBINCLUDE success! RAWCUDADEVICE success! RAWCUDA success! ROOT_LOGGING success! DEPLOY_DETECTION success! CMAKE success! SHELLCHECK success! ACTIONLINT success! TESTOWNERS success! TEST_HAS_MAIN success! CALL_ONCE success! ONCE_FLAG success! WORKFLOWSYNC success! UFMT success! COPYRIGHT success! BAZEL_LINTER success! LINTRUNNER_VERSION success! ATEN_CPU_GPU_AGNOSTIC success! MERGE_CONFLICTLESS_CSV success! RUFF success! ok No lint issues. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129023 Approved by: https://github.com/ezyang	2024-06-28 05:40:31 +00:00
Peter Bell	b019f38fdd	[inductor] Fix pattern replacements with multiple users (#129689 ) Fixes #129685 After matching a pattern, we currently try to remove all the nodes of that pattern, which doesn't work if any intermediate node has users outside of the pattern. In which case we can't delete those particular nodes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129689 Approved by: https://github.com/shunting314	2024-06-28 05:16:17 +00:00
eqy	7854d84acb	[cuDNN][SDPA] Bail out of dispatching to cuDNN for head dim > 128 on Ampere (#129587 ) Fix for #129579 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129587 Approved by: https://github.com/Skylion007, https://github.com/drisspg	2024-06-28 04:42:45 +00:00
Daniel Richard G.	8d4216af8c	Fix compile error with Intel oneAPI compiler (#129589 ) I am building PyTorch with the Intel oneAPI 2024.0.0 compiler, and encountered this compile error: ``` [ 85%] Building CXX object caffe2/CMakeFiles/cpu_rng_test.dir/__/aten/src/ATen/test/cpu_rng_test.cpp.o In file included from /home/src/pytorch/aten/src/ATen/test/cpu_rng_test.cpp:2: /home/src/pytorch/aten/src/ATen/test/rng_test.h:119:41: error: loop variable 'to' creates a copy from type 'const ::std::optional<int64_t>' (aka 'const optional<long>') [-Werror,-Wrange-loop-construct] 119 \| for (const ::std::optional<int64_t> to : tos) { \| ^ /home/src/pytorch/aten/src/ATen/test/rng_test.h:119:10: note: use reference type 'const ::std::optional<int64_t> &' (aka 'const optional<long> &') to prevent copying 119 \| for (const ::std::optional<int64_t> to : tos) { \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ \| & 1 error generated. ``` This change makes the compiler happy. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129589 Approved by: https://github.com/colesbury	2024-06-28 02:35:10 +00:00
Yidi Wu	4b8a5e0374	[export] make with_effect mark op has_effect to prevent them from DCEed. (#129680 ) Before the PR, custom ops that don't return outputs will get eliminated after calling `.module()` because the effect_token that keeps the operator alive is removed in remove_effect_token pass. The reason why we want to remove_effect_token is because we don't want the token to be part of input. However, this causes DCE calls in remove_effect_token itself and the dce calls in unlift to remove the custom op in the graph causing an error in the exported graph. This PR calls has_side_effect in with_effect to make sure graph.eliminate_dead_code doesn't remove the calls by accident. Test Plan: Add a new test pytest test/export/test_torchbind.py -k test_export_inplace_custom_op Pull Request resolved: https://github.com/pytorch/pytorch/pull/129680 Approved by: https://github.com/angelayi	2024-06-28 02:22:30 +00:00
Nikita Shulga	4b598d87d3	Fix FindBLAS.cmake (#129713 ) Fixes regression introduced by https://github.com/pytorch/pytorch/pull/125227 by adding `INCLUDE(CheckFunctionExists)` that fixes ``` CMake Error at cmake/Modules/FindBLAS.cmake:413 (check_function_exists): Unknown CMake command "check_function_exists". ``` Fixes https://github.com/pytorch/pytorch/issues/129693 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129713 Approved by: https://github.com/kit1980	2024-06-28 02:15:16 +00:00
Yanbo Liang	b9d3cedd64	[Inductor] FlexAttention supports block sparse mask (#129216 ) Benchmark script (causal mask): https://gist.github.com/yanboliang/c2010a1fd081d4e8ca94fadec9eef286 Initial perf number: * fwd speedup: 0.44 -> 0.72 * bwd speedup: 0.38 -> 0.71 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129216 Approved by: https://github.com/Chillee	2024-06-28 01:32:54 +00:00
Will Feng	c07a799ed5	[Traceable FSDP2] Add Dynamo support for run_with_rng_state HOP (#127247 ) Test command: `pytest -rA test/inductor/test_compiled_autograd.py::TestCompiledAutograd::test_trace_run_with_rng_state` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127247 Approved by: https://github.com/bdhirsh ghstack dependencies: #129502	2024-06-28 01:04:49 +00:00
xinan.lin	36b9d9cfcd	[Inductor UT] Generalize device-bias code in newly added UT `test_scatter_optimization.py` (#129622 ) [Inductor UT] Generalize device-bias code in newly added UT test_scatter_optimization.py and test_torchinductor_dynamic_shapes.py Fix issue #129624 , #129642 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129622 Approved by: https://github.com/EikanWang, https://github.com/peterbell10	2024-06-28 01:04:21 +00:00
Shangdi Yu	deaab33f3f	[custom op] add error message (#129417 ) Fixes [#129370](https://github.com/pytorch/pytorch/issues/129370) Suggest correct a List type annotation when input is in Tuple type. To avoid confusion, we only suggest a type if the type is supported. Example: Tuple[int, int] -> List[int] Tuple[Tensor, Tensor, Optional[Tensor]] -> List[Optional[Tensor]] Tuple[int, ...] -> List[int] ValueError: infer_schema(func): Parameter y has unsupported type typing.Tuple[torch.Tensor, torch.Tensor, typing.Optional[torch.Tensor]]. Tuple type annotation is not supported. Please try to use a List instead. For example, typing.List[typing.Optional[torch.Tensor]]. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129417 Approved by: https://github.com/zou3519	2024-06-28 01:03:14 +00:00
PyTorch MergeBot	8ba0f6c7c2	Revert "[nn-module] Use standard dict for _parameters, _modules and _buffers (#129164 )" This reverts commit f2840bb22079a6952c61446a3d0dfc12f6452852. Reverted https://github.com/pytorch/pytorch/pull/129164 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to break some internal dper3 tests ([comment](https://github.com/pytorch/pytorch/pull/129164#issuecomment-2195888838))	2024-06-28 00:49:39 +00:00
Xuehai Pan	9e1f3ecaa7	[BE][Easy] use `pathlib.Path` instead of `dirname` / `".."` / `pardir` (#129374 ) Changes by apply order: 1. Replace all `".."` and `os.pardir` usage with `os.path.dirname(...)`. 2. Replace nested `os.path.dirname(os.path.dirname(...))` call with `str(Path(...).parent.parent)`. 3. Reorder `.absolute()` ~/ `.resolve()`~ and `.parent`: always resolve the path first. `.parent{...}.absolute()` -> `.absolute().parent{...}` 4. Replace chained `.parent x N` with `.parents[${N - 1}]`: the code is easier to read (see 5.) `.parent.parent.parent.parent` -> `.parents[3]` 5. ~Replace `.parents[${N - 1}]` with `.parents[${N} - 1]`: the code is easier to read and does not introduce any runtime overhead.~ ~`.parents[3]` -> `.parents[4 - 1]`~ 6. ~Replace `.parents[2 - 1]` with `.parent.parent`: because the code is shorter and easier to read.~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/129374 Approved by: https://github.com/justinchuby, https://github.com/malfet	2024-06-28 00:35:15 +00:00
Nikita Shulga	d4b6ff6fbe	Disable llm-td step (#129722 ) As it often fails during conda install step with `Unexpected HTTP response: 429` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129722 Approved by: https://github.com/kit1980, https://github.com/clee2000	2024-06-28 00:12:32 +00:00
Will Feng	0ffb17547e	[Simple FSDP] Add unit test for torch.compile + reparameterization + SAC (#129641 ) This can reproduce the error in https://github.com/pytorch/pytorch/issues/129684. Adding a unit test so that we hold the line for torch.compile + reparameterization + SAC to always be working, to pave the path for Tianyu's intern's project. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129641 Approved by: https://github.com/tianyu-l	2024-06-28 00:00:36 +00:00
Jeff Daily	169b4ca07e	add uuid in cudaDeviceProperties (#125083 ) Replaces #99967. Fixes #99903. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125083 Approved by: https://github.com/pruthvistony, https://github.com/albanD, https://github.com/eqy, https://github.com/malfet	2024-06-27 23:53:13 +00:00
cyy	fb5888c719	Remove unused type traits in torch/csrc/utils (#128799 ) Follows #127852 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128799 Approved by: https://github.com/ezyang	2024-06-27 23:51:18 +00:00
Peter Bell	3fc279633b	[ATen] Make argsort.stable CompositeImplicitAutograd (#129529 ) It literally just calls `at::sort` and returns the indices, so is composite compliant. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129529 Approved by: https://github.com/lezcano	2024-06-27 23:49:16 +00:00
Xuehai Pan	7cf0b90e49	[BE] enable UFMT in `torch.utils.data` (#127705 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127705 Approved by: https://github.com/ezyang ghstack dependencies: #127706, #127704	2024-06-27 23:16:24 +00:00
Xuehai Pan	f911957573	[BE] sort imports in `torch.utils.data` (#127704 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127704 Approved by: https://github.com/ezyang ghstack dependencies: #127706	2024-06-27 23:16:24 +00:00
Xuehai Pan	d80939e5e9	[BE] enable UFMT for `torch/storage.py` (#127706 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127706 Approved by: https://github.com/ezyang	2024-06-27 23:16:24 +00:00
Yifu Wang	67416a2996	[c10d] Introduce a util for detecting DMA connectivity among devices (#129510 ) This PR introduces `_detect_dma_connectivity` - a utility for detecting DMA connectivity among devices. The "DMA connectivity" in this context is more stringent than the ability to perform memory copy without CPU involvement. We define it as the ability for a device to issue load/store instructions and perform atomic operations on memory that resides on connected devices. The ability translates to the ability to run most aten GPU operations with operands backed by remote memory. `_detect_dma_connectivity` can help PyTorch and its users to determine whether certain DMA-based optimizations are possible. `_detect_dma_connectivity` takes a `(device_type, connection_type)` pair and returns a matrix describing the connectivity. Connectivity detectors are statically registered on a `(device_type, connection_type)` basis. This PR implements the detector for `(CUDA, "nvlink")`. Later, detectors for pairs such as `(ROCM, "infinity_fabric")` can be introduced. Example: ```python3 >>> from torch._C._autograd import DeviceType >>> from torch._C._distributed_c10d import _detect_dma_connectivity >>> connectivity = _detect_dma_connectivity(DeviceType.CUDA, "nvlink") >>> for row in connectivity.matrix: ... print(row) ... [0, 18, 18, 18, 18, 18, 18, 18] [18, 0, 18, 18, 18, 18, 18, 18] [18, 18, 0, 18, 18, 18, 18, 18] [18, 18, 18, 0, 18, 18, 18, 18] [18, 18, 18, 18, 0, 18, 18, 18] [18, 18, 18, 18, 18, 0, 18, 18] [18, 18, 18, 18, 18, 18, 0, 18] [18, 18, 18, 18, 18, 18, 18, 0] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129510 Approved by: https://github.com/weifengpy	2024-06-27 23:02:07 +00:00
yousufmo	305ba62906	Add support to `GradScaler` for respecting an already set `grad_scale` value (#123429 ) Fixes #123428 Co-authored-by: Yousuf Mohamed-Ahmed <youmed.tech@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123429 Approved by: https://github.com/ezyang	2024-06-27 22:40:54 +00:00
Will Constable	83a4a8b510	[C10D] clean up pointless 'or None' clause (#129522 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129522 Approved by: https://github.com/awgu	2024-06-27 22:40:11 +00:00
Chien-Lin Chen	5e7ac69a67	[Dynamic Shapes] fixed dynamic shape inference (#128807 ) Made dynamic dimension indirectly bound to an integer constrained. After each ShapeEnv._refine_ranges, check if the new ValueRange is singleton, if it is, replace the symbol. Fixes #122307 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128807 Approved by: https://github.com/ezyang	2024-06-27 22:33:32 +00:00
Catherine Lee	b8398b771c	Upload test stats when workflow regardless of conclusion (#129694 ) Upload test stats when workflow always so that we can get status for cancelled workflows (especially ones that were cancelled manually) There aren't that many workflow conclusions, so might as well as always run it, and we can see what happens Undos [this old PR](https://togithub.com/pytorch/pytorch/pull/79180) Notable pitfalls from the above: Might cause noise if things can't be downloaded, but since this workflow doesn't show up on PRs, I think it's ok to slowly deal with what comes Pull Request resolved: https://github.com/pytorch/pytorch/pull/129694 Approved by: https://github.com/huydhn	2024-06-27 21:12:21 +00:00
Shivam Raikundalia	1d0efedc85	[Profiler] Add TSC Clock Callback to CUPTI (#125036 ) Summary: Right now we use the default clock for CUPTI which is not monotonic nor particularly fast. We have already added the Kineto side of the implementation here: https://www.internalfb.com/diff/D56525885 This diff only adds the compile flags such that the TSC format is used and sets the converter using a libkineto call in the profiler Test Plan: Obtained following trace using resnet test: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Apr_25_11_03_18.3862943.pt.trace.json.gz&bucket=gpu_traces TBD: Add benchmarks Differential Revision: D56584521 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125036 Approved by: https://github.com/aaronenyeshi	2024-06-27 21:07:43 +00:00
Xu Han	602b5cb218	[inductor] switch HalideCodeCache to new cpp_builder. (#129441 ) Original PRs is damaged by confilct and rebase: https://github.com/pytorch/pytorch/pull/128303, https://github.com/pytorch/pytorch/pull/129144 This PR just switch `HalideCodeCache` to new cpp_builder and it is not `fb_code` related. It can merge without `fb_code` test. Let's land this change firstly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129441 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-06-27 20:50:13 +00:00
Tugsbayasgalan Manlaibaatar	39427288f4	Taskify training IR + run_decomp flow failures (#129547 ) Differential Revision: [D59069088](https://our.internmc.facebook.com/intern/diff/D59069088) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129547 Approved by: https://github.com/avikchaudhuri ghstack dependencies: #128077, #129092, #129249	2024-06-27 20:43:22 +00:00
Yidi Wu	23adf166e1	[cond] inlining into one of the branches when pred is a python constant (#128709 ) When the input predicate is a python constant, we specialize into one of the branches and warn users that torch.cond is not preserving the dynamism. The previous behavior is that we baked in True/False in the cond operator. This can be confusing. In this PR, we change it to be specializing into one of the branches when the inputs are constants. We additionally change the naming of cond operator to default one without overriding its name. This allows better testing on de-serialized graph. Test Plan: The predicate in some existing tests is the result of a shape comparison. When no dynamic shape is involved, the predicate is a python bool. To fix them, we either change the predicate to be some data-dependent tensor or change the test to check cond is specialized as one of the branches, Pull Request resolved: https://github.com/pytorch/pytorch/pull/128709 Approved by: https://github.com/zou3519	2024-06-27 20:28:50 +00:00
Sanket Jayant Purandare	71f5ecd1ee	Fixed Memory Leaks in tests (#129640 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129640 Approved by: https://github.com/clee2000 ghstack dependencies: #129400	2024-06-27 20:26:21 +00:00
Tugsbayasgalan Manlaibaatar	dabaebd339	Make run_decomp work (#129249 ) In this PR, we implement the first version of training_ir.run_decomp functionality. Since we don't return the modified buffers as extra output in training IR, our previous strategy of reusing graph signature won't work. In fact, this run_decomp is more similar to retracing. So i reuse some of export steps here. After this PR: export_for_training().run_decomp({}, _preserve_ops=[all 183 ops]) == export_for_predispatch() - autograd_manipulating_ops. Differential Revision: [D59069090](https://our.internmc.facebook.com/intern/diff/D59069090) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129249 Approved by: https://github.com/zhxchen17 ghstack dependencies: #128077, #129092	2024-06-27 19:16:07 +00:00
Tugsbayasgalan Manlaibaatar	ec284d3a74	Prototype for export_for_training (#129092 ) This PR implements export_for_training where the IR is not-functional, pre-dispatch aten IR. The general strategy: 1. Call dynamo to get torch IR 2. Lift param/buffer 3. call make_fx TODO: 1. run_decomp doesn't work 2. not-strict is not supported Differential Revision: [D59069087](https://our.internmc.facebook.com/intern/diff/D59069087) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129092 Approved by: https://github.com/zhxchen17 ghstack dependencies: #128077	2024-06-27 18:27:11 +00:00
Angela Yi	4dcc1ceff3	[dynamo] Fakify result of delegate (#128752 ) Summary: Somehow the delegate returns a real tensor result even though we pass in fake tensors. So here we need to convert the result to fake. Test Plan: `buck2 run @//mode/dev-nosan //on_device_ai/helios/multi_zion:multi_zion_test -- -r test_single_delegate_dsp_only` Differential Revision: D58617091 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128752 Approved by: https://github.com/ydwu4	2024-06-27 17:59:52 +00:00
Zain Rizvi	389492e264	Fix runner determinator bug (#129612 ) Currently the runner determinator is buggy and doesn't let anyone's workflows run against the LF runners (it prefixes a "@" to the user names in the issue instead of either stripping it or prefixing it to the incoming names) This PR fixes the bug so that people opted in to using LF runners can actually use them. It also puts the python code back into the repo. Even though the code isn't directly invoked, having it there makes testing and linting easier/possible Also includes lint fixes Note: if you just review the .yml file you'll see all the relevant diffs ### Testing: #### Before ``` python .github/scripts/runner_determinator.py --github-token $GH_KEY --github-issue 5132 --github-actor ZainRizvi --github-issue-owner ZainRizvi --github-branch foo {"label_type": "", "message": "LF Workflows are disabled for ZainRizvi, ZainRizvi. Using meta runners."} ``` #### After ``` python .github/scripts/runner_determinator.py --github-token $GH_KEY --github-issue 5132 --github-actor ZainRizvi --github-issue-owner ZainRizvi --github-branch foo {"label_type": "lf.", "message": "LF Workflows are enabled for ZainRizvi, ZainRizvi. Using LF runners."} ``` Aside: updated test case after rebase: ``` python .github/scripts/runner_determinator.py --github-token $GH_KEY --github-issue 5132 --github-actor ZainRizvi --github-issue-owner ZainRizvi2 --github-branch foo --github-repo python/pythonss --github-ref-type branch {"label_type": "lf.", "message": "LF Workflows are enabled for ZainRizvi. Using LF runners."} ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129612 Approved by: https://github.com/zxiiro, https://github.com/jeanschmidt	2024-06-27 17:51:09 +00:00
Brian Hirsh	a4d7aa498b	[Traceable FSDP2] Add auto-functionalize support for mutable list[Tensor] (copy from Brian's PR #127347 ); enable E2E inductor unit test for transformer model (#129502 ) Copy of Brian's PR: https://github.com/pytorch/pytorch/pull/127347 with additional changes to support mutable `List[Tensor]` in Inductor. Also enable E2E inductor unit test for Traceable FSDP2 + transformer model. Test commands: - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_trace_fsdp_set_` - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_simple_mlp_fullgraph_backend_aot_eager` - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_simple_mlp_fullgraph_backend_inductor` - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_fullgraph_backend_aot_eager` - `pytest -rA test/dynamo/test_misc.py::MiscTests::test_auto_functionalize_tensorlist` - `pytest -rA test/inductor/test_torchinductor.py::GPUTests::test_fallback_mutable_op_list_cuda` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129502 Approved by: https://github.com/zou3519	2024-06-27 17:50:57 +00:00
Aleksei Nikiforov	9174d14551	Don't install remaining caffe2 python files (#129067 ) It is assumed that they are no longer needed. And keeping their installation as is breaks "python setup.py develop --user" workflow when non-root user is used. This change is follow up for 3d617333e700 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129067 Approved by: https://github.com/cyyever, https://github.com/r-barnes	2024-06-27 17:25:59 +00:00
Richard Barnes	e0bba37d66	[codemod] Add `[[noreturn]]` to 2 files inc caffe2/c10/util/TypeCast.cpp (#129575 ) Summary: LLVM-15 has a warning `-Wno-return` which can be used to identify functions that do not return. Qualifying these functions with `[[noreturn]]` is a perf optimization. Test Plan: Sandcastle Reviewed By: dmm-fb Differential Revision: D59003594 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129575 Approved by: https://github.com/Skylion007	2024-06-27 17:23:22 +00:00
Dmitry Rogozhkin	321bdcb372	Fix device propagation for checkpointing (#128671 ) Fixes: #128478 In backward() implementation checkpointing code was quering device type from the rng_state tensors saved on forward(). These tensors are CPU only tensors and don't carry device information with them. As a result CUDA device was assumed as a default. Which is not correct if user runs on some other device. For example, on XPU. This patch saves full device information on forward() and uses it on backward() to get device type. Previously forward save only device index. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128671 Approved by: https://github.com/guangyey, https://github.com/soulitzer	2024-06-27 17:14:13 +00:00
Jeff Daily	04206d1898	TunableOp hotfix, unit test follow-up (#129606 ) PR #129281 was landed to fix critical issues but did not contain unit tests to exercise those issues. This is a follow-up set of unit tests that would exercise the problems seen previously. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129606 Approved by: https://github.com/atalman	2024-06-27 17:01:04 +00:00
Peter Bell	5c6af2b583	[cpu] Fix div with rounding_mode="floor" when division overflows (#129536 ) Fixes #77742 `Sleef_fmod` returns NaN when the division overflows, where `libm` returns 0. In this narrow case we can drop the `fmod` from the calulation entirely. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129536 Approved by: https://github.com/lezcano	2024-06-27 16:50:47 +00:00
PyTorch MergeBot	5ceba6a3cb	Revert "[Inductor] FlexAttention supports block sparse mask (#129216 )" This reverts commit 4082759925a712b7cb340164d3da3a1dab372d9f. Reverted https://github.com/pytorch/pytorch/pull/129216 on behalf of https://github.com/clee2000 due to broke functorch/aot_dispatch and test_proxy_tensor on windows https://github.com/pytorch/pytorch/actions/runs/9691331440/job/26743164471 `4082759925` missed on PR due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/129216#issuecomment-2195087274))	2024-06-27 15:57:52 +00:00
Adnan Akhundov	82c8fc3a2b	[inductor] Add size_hint to conv dilation (#129631 ) Summary: [Here](`ea588d7fd3/torch/_inductor/kernel/conv.py (L252)`) in the `conv` lowering `dilation` is not `size_hint`-ed. This breaks if `dilation` is a symbolic expression (which we see in some internal models). The PR fixes it by adding a `size_hints`. Test Plan: ``` $ python test/inductor/test_torchinductor.py -k test_convolution5 ... ---------------------------------------------------------------------- Ran 2 tests in 7.329s OK ``` Differential Revision: D59097019 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129631 Approved by: https://github.com/chenyang78	2024-06-27 15:27:57 +00:00
Chien-Chin Huang	483dbfcf2a	[BE] Correctly catch skip signals emitting from sys.exit (#129581 ) Some tests in test_c10d_nccl.py overwrite `_join_process()` and `_check_return_codes()`, which cause the skip signals are not catched appropriately. This PR fixes the issue. Differential Revision: [D59067457](https://our.internmc.facebook.com/intern/diff/D59067457/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129581 Approved by: https://github.com/fduwjj	2024-06-27 15:12:51 +00:00
Huy Do	2d9012ad25	Forward fix internal pyre failure from D58983461 (#129525 ) Summary: Somehow, using underscore alias of some builtin types breaks pyre Test Plan: All failed tests from D58983461 are passing: ``` buck2 test 'fbcode//mode/opt' fbcode//caffe2/torch/fb/training_toolkit/utils/tests:gpu_memory_utils_test-type-checking buck2 test 'fbcode//mode/dev-nosan' fbcode//dper_lib/silvertorch/lib:device_util-type-checking buck2 test 'fbcode//mode/dev-nosan' fbcode//dper_lib/silvertorch/lib:thompson_samplers_gpu-type-checking buck2 test 'fbcode//mode/dev-nosan' fbcode//dper_lib/silvertorch/modules/retrieval/diversity/tests:combined_sampling_diversifier_test-type-checking buck2 test 'fbcode//mode/dev-nosan' fbcode//dper_lib/silvertorch/modules/retrieval/diversity/tests:submodular_opt_test-type-checking ``` Differential Revision: D59029768 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129525 Approved by: https://github.com/XuehaiPan, https://github.com/clee2000, https://github.com/malfet	2024-06-27 14:41:20 +00:00
Aaron Enye Shi	0680e6cd1c	[Profiler] Add sraikund16 to profiler paths in CODEOWNERS (#129591 ) Summary: Add Shivam to the list of code owners for the profiler code paths, so that Shivam gets added to reviewers for PRs too. Test Plan: CI Differential Revision: D59072152 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/129591 Approved by: https://github.com/sraikund16	2024-06-27 14:22:09 +00:00
Animesh Jain	ad607b91f4	[dynamo][onnx] Skip some dynamic=True test with inlining in built nn modules (#129610 ) These tests fail with dynamic=True when inlining in built nn modules. There are a few more recompilations. Since `dynamic=True` is not a recommended usage, I am skipping these tests for now. This is the tracking issue to come back later and fix/update these tests - https://github.com/pytorch/pytorch/issues/129456 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129610 Approved by: https://github.com/yanboliang ghstack dependencies: #129583	2024-06-27 10:56:24 +00:00
Chen, Zejun	a028e5862d	[profiler] Directly use end_ns to create the FunctionEvent instead of using start_ns + duration_ns in pytorch profiler post processing for checking parent-child precisely (#129554 ) Use the raw end_ns directly, instead of the sum of start_ns and duration_ns, in order to avoid negative CPU time in profiler. Fix https://github.com/pytorch/pytorch/issues/101861 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129554 Approved by: https://github.com/gujinghui, https://github.com/aaronenyeshi	2024-06-27 10:46:05 +00:00
y-sq	ff026f3d0a	Fix an issue in meta_scaled_mm (#129521 ) Summary: To fix the following failure cases: For example, when `M, K, N = 245760, 656, 6560`, fp8 with compile fails due to `RuntimeError: mat2 must be col_major`. --------- From the inductor generated code (https://fburl.com/everpaste/epcagkrd) ``` V0625 01:38:55.551000 140329914449920 torch/_inductor/scheduler.py:1623] [0/0] scheduling ComputedBuffer(name='buf12', layout=FixedLayout('cuda', torch.float8_e4m3fn, size=[656, 6560], stride=[6656, 1]), ... ... V0625 01:38:56.194000 140329914449920 torch/_inductor/graph.py:1680] [0/0] [__output_code] buf12 = empty_strided_cuda((656, 6560), (6656, 1), torch.float8_e4m3fn) ... ... V0625 01:38:56.194000 140329914449920 torch/_inductor/graph.py:1680] [0/0] [__output_code] return (buf10, buf2, buf5, buf6, reinterpret_tensor(buf11, (245760, 656), (1, 245760), 0), reinterpret_tensor(buf12, (6560, 656), (1, 6656), 0), ) ... ... V0625 01:39:12.098000 140312968167424 torch/_inductor/graph.py:1680] [1/0_1] [__output_code] assert_size_stride(permute_10, (6560, 656), (1, 6656)) ... ... V0625 01:39:12.098000 140312968167424 torch/_inductor/graph.py:1680] [1/0_1] [__output_code] buf8 = aten._scaled_mm.default(buf6, permute_10, buf7, reciprocal_3, None, None, torch.bfloat16) ``` Inductor gives the mat2 (`permute_10`) a different stride (`6656`) instead of using its shape[0] (`(6560, 656)`). Therefore, the `stride[1] == shape[0]` condition fails. To fix the issue, simply modify the `is_col_major` check to exclude this condition as it doesn't hold for all valid cases. Test Plan: Run the failed case again. It works with the fix. ----- Sandcastle / GitHub CI will make sure the existing tests could still pass. Reviewed By: vkuzo Differential Revision: D58994704 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129521 Approved by: https://github.com/drisspg	2024-06-27 07:03:34 +00:00
Yang Cao	9f29a2291c	Feat: Updated torch.nn.Modules.set_submodules() (#127714 ) modified: torch/nn/modules/module.py Implemented feature request by #127712. Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127714 Approved by: https://github.com/mikaylagawarecki	2024-06-27 06:38:54 +00:00
Animesh Jain	c9798d123b	[dynamo][compile-time] Manually trace torch.nn.Module.parameters (#129583 ) With this PR, we are not worse than no-inlining for Dynamo-only compilation time (there is a litte bit of noise, so outlier of 0.89 is probably ok here). For most of the models, we see positive numbers because of better caching in `UserDefinedObjectVariable`. ![image](https://github.com/pytorch/pytorch/assets/13822661/719d34fd-3e7f-4886-b7e0-1dbfc7141aa5) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129583 Approved by: https://github.com/jansel	2024-06-27 06:06:04 +00:00
Valentin Andrei	cf392d8a89	[pytorch][cuda] Generate kernels for 5x5 filters on depth wise convolution backward (#129609 ) In #125362 we improved the default implementation of depth wise convolution 2D forward pass by precomputing boundaries of accessed slices instead of doing expensive edge checks in the inner loops. We also generated kernels for 5x5 filters as this is a common problem size. In this PR we tried to applied the same strategy for the backward kernel but we only saw good gains just by generating code for 5x5 filters. We could also write a fallback implementation that precomputes access boundaries when filter size and stride are not known at compile time may bring some speedup but that kernel would very rarely be called. This PR also hints the thread count at compile time and leaves only the unroll directive that seems to help performance. Before: ``` B C iH iW kH kW conv2d-backward (cuda) conv2d-fp16-backward (cuda) 0 8.0 64.0 1024.0 1008.0 5.0 5.0 89.002686 26.400480 1 8.0 64.0 1008.0 1008.0 5.0 5.0 88.885025 25.995296 2 4.0 48.0 720.0 539.0 6.0 1.0 9.488832 9.091136 3 4.0 120.0 379.0 283.0 6.0 1.0 4.194640 3.844432 4 4.0 32.0 713.0 532.0 6.0 1.0 8.027296 7.700064 5 4.0 3.0 712.0 542.0 31.0 31.0 15.618095 15.097760 6 4.0 120.0 379.0 288.0 1.0 6.0 3.788224 3.499648 7 1024.0 384.0 1.0 928.0 1.0 3.0 18.988289 14.152768 8 4.0 24.0 687.0 512.0 6.0 1.0 6.902704 6.685056 9 96.0 96.0 112.0 112.0 5.0 5.0 15.672400 4.953984 10 96.0 80.0 56.0 56.0 5.0 5.0 3.261152 1.250320 11 64.0 128.0 64.0 84.0 3.0 3.0 3.172192 1.515648 12 16.0 960.0 7.0 7.0 5.0 5.0 0.197024 0.072736 13 16.0 64.0 112.0 112.0 3.0 3.0 1.126240 0.650304 ``` After ``` conv2d-performance: B C iH iW kH kW conv2d-backward (cuda) conv2d-fp16-backward (cuda) 0 8.0 64.0 1024.0 1008.0 5.0 5.0 76.278656 26.418720 1 8.0 64.0 1008.0 1008.0 5.0 5.0 73.211617 26.018433 2 4.0 48.0 720.0 539.0 6.0 1.0 8.901312 9.322912 3 4.0 120.0 379.0 283.0 6.0 1.0 3.815616 3.992208 4 4.0 32.0 713.0 532.0 6.0 1.0 7.753024 8.032433 5 4.0 3.0 712.0 542.0 31.0 31.0 15.244144 15.277296 6 4.0 120.0 379.0 288.0 1.0 6.0 3.503264 3.552976 7 1024.0 384.0 1.0 928.0 1.0 3.0 16.682976 14.167969 8 4.0 24.0 687.0 512.0 6.0 1.0 6.802576 7.019040 9 96.0 96.0 112.0 112.0 5.0 5.0 12.713024 4.958656 10 96.0 80.0 56.0 56.0 5.0 5.0 2.648352 1.254752 11 64.0 128.0 64.0 84.0 3.0 3.0 3.213568 1.517952 12 16.0 960.0 7.0 7.0 5.0 5.0 0.182208 0.076256 13 16.0 64.0 112.0 112.0 3.0 3.0 1.139952 0.652432 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129609 Approved by: https://github.com/ezyang, https://github.com/eqy	2024-06-27 06:01:47 +00:00
Yanbo Liang	4082759925	[Inductor] FlexAttention supports block sparse mask (#129216 ) Benchmark script (causal mask): https://gist.github.com/yanboliang/c2010a1fd081d4e8ca94fadec9eef286 Initial perf number: * fwd speedup: 0.44 -> 0.72 * bwd speedup: 0.38 -> 0.71 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129216 Approved by: https://github.com/Chillee	2024-06-27 05:44:27 +00:00
Jiang, Yanbing	5ee893a84a	Add inductor support for conv3d transpose (#129458 ) This PR is to add Conv3d Transpose support in inductor. Basicly reuse and expand Conv2d Transpose and unit tests to Conv3d Transpose. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129458 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-06-27 05:27:10 +00:00
Wei Wang	9b5b93c58f	[CUDA][Inductor][CI] Revert PR#127150 since cu124 is now behaving similar enough to cu121 (#128423 ) Pre-requisite: close https://github.com/pytorch/pytorch/issues/126692 first. This PR also gives a current read on cu121 and cu124 parity. Essentially reverting #127150 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128423 Approved by: https://github.com/atalman, https://github.com/eqy	2024-06-27 05:22:18 +00:00
Yifu Wang	ea588d7fd3	[SymmetricMemory] use SCM_RIGHTS socket control message to share exported cumem handle (#129412 ) `SymmetricMemory` currently uses the `pidfd_getfd` syscall to share the exported cumem fd among devices. The syscall is introduced in linux kernel 5.6 which is relatively new and not available everywhere. This PR replaces the use of the `pidfd_getfd` syscall with socket + SCM_RIGHTS control message. The approach is demonstrated in [memMapIPCDrv](https://github.com/NVIDIA/cuda-samples/tree/master/Samples/3_CUDA_Features/memMapIPCDrv) in [cuda-samples](https://github.com/NVIDIA/cuda-samples/tree/master/Samples) (relevant code: https://github.com/NVIDIA/cuda-samples/blob/master/Common/helper_multiprocess.cpp). Pull Request resolved: https://github.com/pytorch/pytorch/pull/129412 Approved by: https://github.com/Chillee	2024-06-27 04:38:13 +00:00
Li-Huai (Allan) Lin	84ad5452f6	[MPS] Fused SGD optimizer (#129350 ) ``` [-------------------------------------- Fused SGD --------------------------------------] \| Fused: True \| Fused: False 1 threads: ------------------------------------------------------------------------------ numel: 1024, num_tensors: 100, momentum: True \| 2 \| 15 numel: 1024, num_tensors: 100, momentum: False \| 2 \| 5 numel: 65536, num_tensors: 100, momentum: True \| 3 \| 16 numel: 65536, num_tensors: 100, momentum: False \| 2 \| 5 numel: 1048576, num_tensors: 100, momentum: True \| 11 \| 16 numel: 1048576, num_tensors: 100, momentum: False \| 8 \| 6 numel: 1024, num_tensors: 500, momentum: True \| 29 \| 70 numel: 1024, num_tensors: 500, momentum: False \| 20 \| 24 numel: 65536, num_tensors: 500, momentum: True \| 33 \| 76 numel: 65536, num_tensors: 500, momentum: False \| 22 \| 26 numel: 1048576, num_tensors: 500, momentum: True \| 70 \| 80 numel: 1048576, num_tensors: 500, momentum: False \| 43 \| 40 numel: 1024, num_tensors: 1000, momentum: True \| 108 \| 139 numel: 1024, num_tensors: 1000, momentum: False \| 72 \| 48 numel: 65536, num_tensors: 1000, momentum: True \| 116 \| 150 numel: 65536, num_tensors: 1000, momentum: False \| 77 \| 52 numel: 1048576, num_tensors: 1000, momentum: True \| 190 \| 170 numel: 1048576, num_tensors: 1000, momentum: False \| 120 \| 50 ``` ```python def profile_fused_sgd(): from torch.optim.sgd import sgd import torch.utils.benchmark as benchmark import itertools def profile(fn, params, grads, momentum_buffer_list, fused): fn( params, grads, momentum_buffer_list, momentum=True if len(momentum_buffer_list) > 0 else False, dampening=0.0, nesterov=False, foreach=False, fused=fused, lr=1e-3, weight_decay=.0, maximize=False, grad_scale=None, found_inf=None, ) torch.mps.synchronize() device = "mps" results = [] for num_tensors, numel, momentum in itertools.product([100, 500, 1000], [1024, 65536, 1048576], [True, False]): sublabel = f"numel: {numel}, num_tensors: {num_tensors}, momentum: {momentum}" print(sublabel) params, grads = [[torch.arange(numel, dtype=torch.float32, device=device) + (numel * i) for i in range(num_tensors)] for _ in range(2)] momentum_buffer_list = [torch.arange(numel, dtype=torch.float32, device=device) + (numel * i) for i in range(num_tensors)] if momentum else [] fn = sgd for fused in [True, False]: t = benchmark.Timer( stmt='profile(fn, params, grads, momentum_buffer_list, fused)', label='Fused SGD', sub_label=sublabel, globals=locals(), description= f"Fused: {fused}", ).blocked_autorange(min_run_time=5) results.append(t) compare = benchmark.Compare(results) compare.trim_significant_figures() compare.colorize(rowwise=True) compare.print() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129350 Approved by: https://github.com/janeyx99 ghstack dependencies: #129006, #129008, #129007, #129105	2024-06-27 04:37:14 +00:00
Eddie Yan	e19042481b	[cuDNN][cuDNN Frontend] Bump cuDNN FE submodule to 1.5.2 (#129592 ) Some relevant fixes include stride-0 support 👀 CC @drisspg @Skylion007 @vedaanta Pull Request resolved: https://github.com/pytorch/pytorch/pull/129592 Approved by: https://github.com/Skylion007	2024-06-27 04:01:23 +00:00
Antoni Viros	9450e198aa	Conversions between strided and jagged layouts for Nested Tensors (#115749 ) This PR does 3 things: 1. Adds a copy-free strided->jagged layout conversion for NT 2. Adds a copy-free jagged->strided layout conversion for NT 3. Modifies and expands the .to() API to support the layout argument for the specific case of NT layout conversion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115749 Approved by: https://github.com/jbschlosser	2024-06-27 03:41:28 +00:00
Kurman Karabukaev	c9ceae3fac	Use JK for mast rdzv handler tcpstore handling and additional logging (#129603 ) Summary: Use JK to control the release instead of using env variable to toggle the feature. Note: sharing the store reduces shutdown races asn the TCPStore lifecycle is managed outside of trainer rank execution time. Test Plan: CI Differential Revision: D59071544 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129603 Approved by: https://github.com/d4l3k	2024-06-27 03:34:52 +00:00
Yidi Wu	b9697eacd3	[torchbind] support tensor ops inside of __obj_flatten__ (#129605 ) As titled. Previously, __obj_flatten__ can run in a fake tensor mode, e.g. in process_input of aot_autograd, which is surrounded by a fake tensor mode. This causes the tensor ops inside __obj_flatten__ to run under fake tensor mode. However, tensors inside of script obejct are real tensors, this causes the fake tensor mode to error out saying that we need to first fakify fall the tensors (because allow_non_fake_inputs is set to True). In this PR, we disable all the dispatch modes when running to_fake_obj. Note that, the output of `__obj_flatten__` will be fakified and filled inside of the corresponding FakeScriptObject. So during traicng, we'll be using FakeScriptObject that has fake tensor contents. Test Plan: Add a new test: pytest test/export/test_torchbind.py -k test_compile_tensor_op_in_tensor_flatten Pull Request resolved: https://github.com/pytorch/pytorch/pull/129605 Approved by: https://github.com/angelayi	2024-06-27 03:07:31 +00:00
Nikita Shulga	cdbd6542d0	Fix inductor benchmarks (#129620 ) By installing torchao explicitly, as torchao-0.3.0 that was release recently to pypi introduced hard dependency to torch-2.3.1, which results in following cryptic error: `RuntimeError: operator torchvision::nms does not exist` TODOs: - Figure out what installs torchao from pypi rather than builds from source - Add proper CI pin for torchao Pull Request resolved: https://github.com/pytorch/pytorch/pull/129620 Approved by: https://github.com/kit1980, https://github.com/huydhn	2024-06-27 02:59:08 +00:00
garfield1997	27a14405d3	enable device index check for all device types (#126767 ) enable device index check for all device types for grad setter. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126767 Approved by: https://github.com/albanD	2024-06-27 01:09:53 +00:00
Boyuan Feng	0b7e8df7d8	[CUDAGraph Trees] Enable input mutation support in OSS (#129184 ) Summary: Enable input mutation support for cudagraph trees in OSS. Test Plan: CI Differential Revision: D58847850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129184 Approved by: https://github.com/eellison	2024-06-27 00:49:45 +00:00
yuqingj	7bb558fd6e	add _flash_attention_forward and _efficient_attention_forward to compute intensive ops in partitioner (#129533 ) Avoid recompute of SDPA during the backward. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129533 Approved by: https://github.com/drisspg	2024-06-27 00:49:00 +00:00
Jiashen Cao	b6689e0fb8	[ts migration] add logging as part of torch logging system (#129405 ) #### Description Add more verbose logging of conversion process. Output which IR is being converted, which function is used to do conversion, and whether it succeeds. #### Example `TORCH_LOGS="+export,ts2ep_conversion" pytest test/export/test_converter.py -s -k test_prim_tolist` ``` test/export/test_converter.py I0624 13:19:26.416000 140608224474112 torch/_export/converter.py:734] TorchScript graph I0624 13:19:26.416000 140608224474112 torch/_export/converter.py:734] I0624 13:19:26.416000 140608224474112 torch/_export/converter.py:734] graph(%x.1 : Long(3, strides=[1], requires_grad=0, device=cpu)): I0624 13:19:26.416000 140608224474112 torch/_export/converter.py:734] %1 : __torch__.export.test_converter.___torch_mangle_1.Module = prim::CreateObject() I0624 13:19:26.416000 140608224474112 torch/_export/converter.py:734] %2 : int = prim::Constant[value=1](), scope: export.test_converter.Module:: I0624 13:19:26.416000 140608224474112 torch/_export/converter.py:734] %3 : int = prim::Constant[value=0](), scope: export.test_converter.Module:: I0624 13:19:26.416000 140608224474112 torch/_export/converter.py:734] %4 : int[] = prim::tolist(%x.1, %2, %3), scope: export.test_converter.Module:: I0624 13:19:26.416000 140608224474112 torch/_export/converter.py:734] return (%4) I0624 13:19:26.416000 140608224474112 torch/_export/converter.py:734] I0624 13:19:26.416000 140608224474112 torch/_export/converter.py:734] V0624 13:19:26.417000 140608224474112 torch/_export/converter.py:690] Convert [%1 : __torch__.export.test_converter.___torch_mangle_1.Module = prim::CreateObject()] V0624 13:19:26.417000 140608224474112 torch/_export/converter.py:690] Convert using [convert_prim_CreateObject] succeeds V0624 13:19:26.417000 140608224474112 torch/_export/converter.py:690] Convert [%2 : int = prim::Constant[value=1](), scope: export.test_converter.Module::] V0624 13:19:26.417000 140608224474112 torch/_export/converter.py:690] Convert using [convert_prim_Constant] succeeds V0624 13:19:26.417000 140608224474112 torch/_export/converter.py:690] Convert [%3 : int = prim::Constant[value=0](), scope: export.test_converter.Module::] V0624 13:19:26.417000 140608224474112 torch/_export/converter.py:690] Convert using [convert_prim_Constant] succeeds V0624 13:19:26.417000 140608224474112 torch/_export/converter.py:690] Convert [%4 : int[] = prim::tolist(%x.1, %2, %3), scope: export.test_converter.Module::] V0624 13:19:26.417000 140608224474112 torch/_export/converter.py:690] Convert using [convert_prim_tolist] succeeds I0624 13:19:26.427000 140608224474112 torch/_export/converter.py:760] TS2EPConverter IR-to-IR conversion succeeds ``` #### Test Plan `pytest test/export/test_converter` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129405 Approved by: https://github.com/angelayi	2024-06-27 00:20:20 +00:00
Tugsbayasgalan Manlaibaatar	90f6043368	Don't decompose functional composite ops in export inference IR (#128077 ) Recently we decided to split export IR into two different IRs (training vs inference). In the inference IR, one major change we decided to introduce was we wanted to keep the composite ops that user specified in the IR. This PR does that by overriding the CompositeImplicitAutograd decomp in export inference path. Differential Revision: [D58701607](https://our.internmc.facebook.com/intern/diff/D58701607) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128077 Approved by: https://github.com/bdhirsh	2024-06-26 23:07:55 +00:00
Chirag Pandya	64f1111d38	Expose nholmann json to torch (#129570 ) Summary: Expose nlohmann json library so that it can be used from inside Pytorch. The library already exists in the `third_party` directory. This PR is making `nlohmann/json.hpp` header available to be used from `torch.distributed`. The next PR makes actual use of this header. imported-using-ghimport Test Plan: Imported from OSS Reviewed By: malfet Differential Revision: D59035246 Pulled By: c-p-i-o Pull Request resolved: https://github.com/pytorch/pytorch/pull/129570 Approved by: https://github.com/d4l3k, https://github.com/malfet	2024-06-26 21:59:26 +00:00
HOOLoLo	5ad2ad5921	Update start_, end_ and retired only for the right entry when retire a work (#128948 ) Fixes #128805 If the buffer size of NCCLTraceBuffer is 10 and the pg has recorded 11 works, the entry of the work 0 will have been overwritten by the work 10, so when watchdog retire the work 0, the start_ and end_ of the entry 0 shouldn't be set to nullptr. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128948 Approved by: https://github.com/wconstab, https://github.com/c-p-i-o	2024-06-26 21:58:00 +00:00
Elias Ellison	b8e5678ad2	Delete lazy ddp optimizer (#120727 ) This is no longer necessary now that the normal ddp optimizer works correctly with inductor strides. Differential Revision: [D54858819](https://our.internmc.facebook.com/intern/diff/D54858819) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120727 Approved by: https://github.com/jansel, https://github.com/yf225	2024-06-26 21:53:54 +00:00
Shivam Raikundalia	13316a8d46	[Profiler] Add Rank to NCCL Debug Info (#129528 ) Summary: We need to add the Rank information to the NCCL debug data so that kineto can infer all the necessary process group info such that on-demand can create distributedInfo metadata. Kineto portion will be added in a follow up diff Test Plan: Tested in D58736045, this diff just splits the kineto and profiler instances Differential Revision: D59028819 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129528 Approved by: https://github.com/aaronenyeshi	2024-06-26 21:24:05 +00:00
Catherine Lee	7b1988f922	[ez] Give trymerge id token write permissions after #129503 (#129594 ) Forgot to do this in #129503 Also fix minor typo Pull Request resolved: https://github.com/pytorch/pytorch/pull/129594 Approved by: https://github.com/huydhn	2024-06-26 20:33:14 +00:00
Catherine Lee	795db80975	Upload release tag source code to s3 (#128842 ) Upload tarball containing source code to s3 for release tags Can be found here https://us-east-1.console.aws.amazon.com/s3/buckets/pytorch?region=us-east-1&bucketType=general&prefix=source_code/test/&showversions=false D58695048 for adding permissions to allow uploading to the s3 folder Pull Request resolved: https://github.com/pytorch/pytorch/pull/128842 Approved by: https://github.com/atalman, https://github.com/malfet	2024-06-26 20:32:40 +00:00
Andrea Frittoli	28480dd7dc	[CI] Fix runner determinator for ciflow (#129500 ) In case of ciflow, runs are triggered by a tag which is created by @pytorchbot, which breaks the logic of the runner determinator. In case of tag triggers, extract the pr number from the tag name, fetch the pr and extract the user login from it. Both the inline and standalone python scripts have been updated for consistency. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129500 Approved by: https://github.com/malfet, https://github.com/zxiiro	2024-06-26 20:27:06 +00:00
James Perng	d3d6764082	[pytorch][logging] add fb internal ODS implementation of wait counter (#128605 ) * created fb internal implementation in `caffe2/torch/csrc/monitor/fb/instrumentation.cpp` * uses `facebook::data_preproc::WaitCounterUs` under the hood by having `WaitCounterImpl` trivially subclass it. * this makes `WaitCounterHandle` a glorified pointer to `facebook::data_preproc::WaitCounterUs` which is statically defined in the `STATIC_WAIT_COUNTER` macro making these pointers Meyer's singletons. * `facebook::data_preproc::WaitCounterUs` uses 3 singletons: 1. `std::unique_ptr<DynamicCounter::State>` map — leaky singleton 2. `std::weak_ptr<WaitCounterUs::State>` map — leaky singleton 3. publisherSingleton — normal singleton since it manages resources (threads) * `facebook::data_preproc::WaitCounterUs` actually owns shared pointers to the state and its destructor will remove it from the `std::weak_ptr<WaitCounterUs::State>` map when the reference count for the state hits 0. * linked `caffe2/torch/csrc/monitor/fb/instrumentation.cpp` and added `//data_preproc/common:counters` (dpp dependency) to `caffe2/fb/fbcode/target_definitions.bzl` * wrapped OSS null implementation in `#ifndef FBCODE_CAFFE2` so that internally we use the fb internal implementation. as a follow-up I might move the counter implementation out of the data_preproc/counters library to a more common ai infra library? Differential Revision: [D58458751](https://our.internmc.facebook.com/intern/diff/D58458751/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128605 Approved by: https://github.com/c-p-i-o ghstack dependencies: #128466	2024-06-26 19:11:21 +00:00
Catherine Lee	90f82426b9	RS migration - trymerge to upload merge records to s3 (#129503 ) Uploads merge records to to ossci-raw-job-status (public) bucket instead of directly to rockset The runner used by trymerge is a GH runner, so it doesn't have access to s3. Instead, I save the record as a json and upload the json to s3 in a different step that runs after the aws credentials are configured. The role is defined [here](https://togithub.com/pytorch-labs/pytorch-gha-infra/pull/421) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129503 Approved by: https://github.com/huydhn, https://github.com/ZainRizvi, https://github.com/malfet	2024-06-26 19:06:52 +00:00
PyTorch MergeBot	895316119d	Revert "[BE][Easy] use `pathlib.Path` instead of `dirname` / `".."` / `pardir` (#129374 )" This reverts commit 0314c4c101c44d5d89b4fad9d37a012dc6f31128. Reverted https://github.com/pytorch/pytorch/pull/129374 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it causes lots of internal build failures where they fail to find hipify module ([comment](https://github.com/pytorch/pytorch/pull/129374#issuecomment-2192437052))	2024-06-26 19:03:57 +00:00
PyTorch MergeBot	e9aefad641	Revert "[CUDA][Inductor][CI] Revert PR#127150 since cu124 is now behaving similar enough to cu121 (#128423 )" This reverts commit 551e4127185195ae8a5331dc8bbfdffd5d4dd1b8. Reverted https://github.com/pytorch/pytorch/pull/128423 on behalf of https://github.com/nWEIdia due to Sorry for reverting your change but I need to revert it to cleanly revert https://github.com/pytorch/pytorch/pull/129374 ([comment](https://github.com/pytorch/pytorch/pull/128423#issuecomment-2192423840))	2024-06-26 18:54:41 +00:00
Shangdi Yu	cca85c96cd	[export] minor typo fix (#129543 ) Fixes a typo in torch.export doc. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129543 Approved by: https://github.com/angelayi	2024-06-26 18:35:31 +00:00
Sam Larsen	87d14ad419	[inductor] Fix TORCHINDUCTOR_FORCE_DISABLE_CACHES (#129257 ) Summary: See https://github.com/pytorch/pytorch/issues/129159; this option wasn't doing its job for a few reasons. In this PR: * Fix the with_fresh_cache_if_config() decorator * Reset the "TORCHINDUCTOR_CACHE_DIR" & "TRITON_CACHE_DIR" env vars in sub-process to support them changing in the parent process Pull Request resolved: https://github.com/pytorch/pytorch/pull/129257 Approved by: https://github.com/oulgen	2024-06-26 18:34:48 +00:00
Huy Do	61bf1452a3	Add one more shard for CPU jobs (#129299 ) The first shard is very close to 3.5h and timeout sometimes now `1c75ddff35 (26540310592)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129299 Approved by: https://github.com/kit1980, https://github.com/ZainRizvi	2024-06-26 18:32:10 +00:00
Andres Lugo	b9a1c2c991	[ROCm] Enable F8 Inductor Unit tests (#128353 ) First batch of inductor unit test enablement on ROCm for the fnuz f8 variant on MI300 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128353 Approved by: https://github.com/jansel, https://github.com/eellison	2024-06-26 18:30:43 +00:00
Saurabh Mishra	8e4f7f742f	[DCP] Capture reader, writer and planner components in the DCP API logger (#129548 ) Summary: Capture reader, writer and planner components in the DCP API logger Test Plan: logs can be found in scuba pytorch_dcp_logging https://fburl.com/scuba/pytorch_dcp_logging/ruqez1ki Differential Revision: D59040866 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129548 Approved by: https://github.com/wz337, https://github.com/fegin	2024-06-26 18:11:16 +00:00
Isuru Fernando	7373492c9b	Use _unsafe_masked_index in masked_scatter decomposition (#123667 ) and remove masked_scatter_with_index inductor prims Pull Request resolved: https://github.com/pytorch/pytorch/pull/123667 Approved by: https://github.com/peterbell10	2024-06-26 17:18:24 +00:00
Jack Taylor	1b1fd0f4fe	[ROCm] Use additional shard for inductor workflow to resolve timeouts (#129480 ) This will help timeouts on inductor workflow. The cuda equivalent job also moved to 2 shards since `e0aa992d73` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129480 Approved by: https://github.com/jeffdaily, https://github.com/jithunnair-amd, https://github.com/malfet	2024-06-26 17:18:20 +00:00
Nikita Shulga	bc68907caa	[EZ][BE] Replace `assertTrue` with more appropriate checks (#129569 ) Based on this https://github.com/pytorch/pytorch/pull/129340#issuecomment-2191228046 I.e. - `assertTrue(x == y)` -> `assertEqual(x, y) - `assertTrue(not x)` -> assertFalse(x)` - `assertTrue(x > y)` -> assertGreater(x, y)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129569 Approved by: https://github.com/jeanschmidt, https://github.com/Skylion007	2024-06-26 16:29:59 +00:00
Piotr Kluska	9cf8e5dd32	chore(quantization): Enable PT2E symmetric dynamic quantization (#124615 ) in `_find_choose_qparams_node` function compare the current node if it is affine or symmetric Pull Request resolved: https://github.com/pytorch/pytorch/pull/124615 Approved by: https://github.com/kimishpatel, https://github.com/malfet	2024-06-26 16:14:58 +00:00
PyTorch MergeBot	f7708ffebb	Revert "[AOTI][refactor] Unify UserDefinedTritonKernel.codegen (#129378 )" This reverts commit 52009068bc39ebc846bd37b44f5f9c5f62257778. Reverted https://github.com/pytorch/pytorch/pull/129378 on behalf of https://github.com/clee2000 due to broke inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCuda::test_triton_kernel_sympy_expr_arg_abi_compatible_cuda and a few other tests https://github.com/pytorch/pytorch/actions/runs/9680978494/job/26713689249 `52009068bc`. The tests were added in https://github.com/pytorch/pytorch/pull/129301 which is before your base ([comment](https://github.com/pytorch/pytorch/pull/129378#issuecomment-2192032697))	2024-06-26 15:46:17 +00:00
Xu Zhao	474d743dba	[torchao][benchmark] Skip all accuracy tests by returning `pass_due_to_skip` (#129545 ) Summary: As the title says. Test Plan: ``` buck2 run mode/opt //pytorch/benchmark:pt2 -- --only BERT_pytorch --quantization noquant --inference --bfloat16 --accuracy ``` Differential Revision: D59040593 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129545 Approved by: https://github.com/HDCharles	2024-06-26 14:21:53 +00:00
Mikayla Gawarecki	25cec43678	Remove dependency on private _compat_pickle in CPython (#129509 ) Use the IMPORT_MAPPING and NAME_MAPPING from here https://github.com/python/cpython/blob/main/Lib/_compat_pickle.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/129509 Approved by: https://github.com/malfet ghstack dependencies: #129239, #129396	2024-06-26 14:20:27 +00:00
Mikayla Gawarecki	3b531eace7	Add example for torch.serialization.add_safe_globals (#129396 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129396 Approved by: https://github.com/albanD, https://github.com/malfet ghstack dependencies: #129239	2024-06-26 14:20:27 +00:00
Mikayla Gawarecki	303ad8d7f5	Add warning for weights_only (#129239 ) Also changes default for `weights_only` to `None` per comment below (hence the `suppress-bc-linter` tag) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129239 Approved by: https://github.com/albanD, https://github.com/malfet	2024-06-26 14:20:19 +00:00
Bin Bao	52009068bc	[AOTI][refactor] Unify UserDefinedTritonKernel.codegen (#129378 ) Summary: Unify the UserDefinedTritonKernel argument codegen logic between python wrapper and cpp wrapper. This prepares for later PRs that will simplify AOTI codegen. Differential Revision: [D59002226](https://our.internmc.facebook.com/intern/diff/D59002226) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129378 Approved by: https://github.com/oulgen, https://github.com/chenyang78 ghstack dependencies: #129267	2024-06-26 13:53:27 +00:00
Bin Bao	42d490d41d	[AOTI][refactor] Move generate_user_defined_triton_kernel (#129267 ) Summary: Move generate_user_defined_triton_kernel from cpp_wrapper_cpu to cpp_wrapper_cuda as it's for CUDA only Differential Revision: [D58953005](https://our.internmc.facebook.com/intern/diff/D58953005) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129267 Approved by: https://github.com/chenyang78	2024-06-26 13:50:39 +00:00
Jean Schmidt	53fafdd0c3	[BE] Runner determinator: more resilient user matching (#129462 ) Small improvements on runner determinator script: * Don't do splitting of the issue comment, unless necessary; * Match username against a set over a list; * Match both triggering_actor and issue owner over only actor (to avoid edge cases, where we get `pytorch-bot[bot]`) * Add stripping, to remove potential breaking and not visible whitespaces; * Don't use linux.4xlarge as a runner: it should not depend on meta runners, for reliability; Pull Request resolved: https://github.com/pytorch/pytorch/pull/129462 Approved by: https://github.com/zxiiro, https://github.com/ZainRizvi	2024-06-26 13:47:52 +00:00
PyTorch MergeBot	211f38e742	Revert "[ALI] [Reland] Use LF runners for Lint (#129071 )" This reverts commit 1b92bdd0ea326cd30bc3945602701ffe28c85fd5. Reverted https://github.com/pytorch/pytorch/pull/129071 on behalf of https://github.com/malfet due to All LF jobs are backlogged, so revert this one ([comment](https://github.com/pytorch/pytorch/pull/129071#issuecomment-2191676677))	2024-06-26 13:19:00 +00:00
Yifu Wang	92be3403ea	Fix an issue in oneShotAllReduce where different ranks perform reduction in different order (#129501 ) In `oneShotAllReduce`, ranks read data from peers in a round-robin fashion to load-balance NVLinks. However, the following reduction is also performed in the this order which is different across ranks. This can results in slight numerical differences across ranks, which can lead to a hang in data dependent applications like speculative decoding. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129501 Approved by: https://github.com/Chillee	2024-06-26 08:43:10 +00:00
Animesh Jain	f2840bb220	[nn-module] Use standard dict for _parameters, _modules and _buffers (#129164 ) TorchDynamo guard mechanism guards on the key order on the dictionaries if the user iterates over the dictionary. For standard dict, we can write a fast C++ implementation by using PyDict_Next. But with OrderedDict, we have to rely on `keys` Python API to get the key ordering. This makes guard evaluation slow. With Dynamo inlining into inbuilt nn modules, I am seeing many guards over the OrderedDict on `_modules`, `_parameters`. From reading the code, I don't see any reason to not use standard dicts. I think OrderedDict was preferred over dict because of the ordering, but dicts are now ordered. With this PR, I am observing ~20% reduction in guard overhead of a HF model. Functionality impact - The only difference between dict and OrdedeDict is `move_to_end` method for OrderedDict ([link](https://stackoverflow.com/questions/34305003/difference-between-dictionary-and-ordereddict)). But the changes here are internal to nn module, and we do not use `move_to_end` for `_parameters`, `_modules` and `_buffers`. We use `move_to_end` for hooks but this PR keeps the OrderedDict for hooks untouched (we should still followup with hooks but in a separate PR). Perf impact - I dont anticipate any perf impact. `dict` is completely implemented in C. OrderedDict is Python wrapper over dict with only few method overridden ([link](https://stackoverflow.com/questions/34305003/difference-between-dictionary-and-ordereddict)). Typing impact - I dont anticipate any. For all the user visible methods for nn.Module, we don't expose the underlying `_modules` etc. We have iterators like `named_parameters` which return an Iterator of Parameter. So, no typing changes required. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129164 Approved by: https://github.com/mikaylagawarecki ghstack dependencies: #129163	2024-06-26 07:59:42 +00:00
Will Feng	ead97ee486	[Compile+SAC] Only warn for in-place ops once (#129397 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129397 Approved by: https://github.com/tianyu-l	2024-06-26 07:25:02 +00:00
cdzhan	c422a9549d	[easy][DCP] Fix test_fsdp_ep.py for _MeshEnv.create_child_mesh API ch… (#129445 ) …ange Update test/distributed/checkpoint/e2e/test_fsdp_ep.py for #127465 change. Failure info: ```bash [rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] Caught exception: [rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] Traceback (most recent call last): [rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] File "/projs/framework/fooooo/code/pytorch_new/torch/testing/_internal/common_distributed.py", line 657, in run_test [rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] getattr(self, test_name)() [rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] File "/projs/framework/fooooo/code/pytorch_new/torch/testing/_internal/common_distributed.py", line 539, in wrapper [rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] fn() [rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] File "/projs/framework/fooooo/code/pytorch_new/torch/testing/_internal/common_utils.py", line 2744, in wrapper [rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] method(args, kwargs) [rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] File "/projs/framework/fooooo/code/pytorch_new/torch/testing/_internal/distributed/_tensor/common_dtensor.py", line 369, in wrapper [rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] func(self, args, *kwargs) # type: ignore[misc] [rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] File "/projs/framework/fooooo/code/pytorch_new/torch/testing/_internal/common_distributed.py", line 180, in wrapper [rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] return func(args, *kwargs) [rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] File "/projs/framework/fooooo/code/pytorch_new/torch/testing/_internal/distributed/checkpoint_utils.py", line 44, in wrapper [rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] func(self, args, **kwargs) [rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] File "/projs/framework/fooooo/code/pytorch_new/test/distributed/checkpoint/e2e/test_fsdp_ep.py", line 76, in test_e2e [rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] mesh_fsdp_ep = _mesh_resources.create_child_mesh(mesh_fsdp_tp, 0, "dp") [rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] TypeError: _MeshEnv.create_child_mesh() takes 3 positional arguments but 4 were given [rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] [rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] To execute this test, run the following from the base repo dir: [rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] python test/distributed/checkpoint/e2e/test_fsdp_ep.py -k TestFSDPWithEP.test_e2e [rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] [rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129445 Approved by: https://github.com/fegin, https://github.com/wz337	2024-06-26 06:43:30 +00:00
wz337	8b8e2fcdda	[DCP] Fix Optimizer Learning Rate not being loaded correctly (#129398 ) Fixes #129079 Currently, the tensor object is loading correctly in-place, but the non-tensor object such as learning rate is not load correctly after `f518cf811d`, which is a regression introduced in 2.3. This PR replaces tree_map_only and manual replacement of the state dict items with _tree_map_only and fixes the regression of non-tensor loading. Test: ``` # test to make sure lr is loading correctly python3 test/distributed/checkpoint/e2e/test_e2e_save_and_load.py -k test_init_state_dict # test to make sure load on meta device model still works python3 test/distributed/checkpoint/test_tp_checkpoint.py -k test_tp_checkpoint_load_on_meta_device ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129398 Approved by: https://github.com/fegin	2024-06-26 06:41:47 +00:00
Sheng Fu	000f2d637b	Refactoring the code to make it lint clean (#129424 ) Summary: Refactoring the code to make it lint clean Test Plan: buck2 build mode/dev-tsan caffe2/test:test_profiler_cuda Differential Revision: D58971175 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129424 Approved by: https://github.com/aaronenyeshi	2024-06-26 06:12:01 +00:00
Li-Huai (Allan) Lin	610894e978	[MPS][BE] Generalize Fused optimizers (#129105 ) This PR generalizes the multi_tensor_apply function for other fused optimizers Pull Request resolved: https://github.com/pytorch/pytorch/pull/129105 Approved by: https://github.com/malfet ghstack dependencies: #129006, #129008, #129007	2024-06-26 06:00:41 +00:00
Pian Pawakapan	d02bba519c	[export] match fake mode for _decompose_exported_program() (#129421 ) Summary: _decompose_exported_program() ran into an issue with trace_joint, where trace_joint() produces values with mismatching FakeModes. Adding fake mode context to aot_export_module() so this doesn't happen. #thanks to tugsbayasgalan for the fix! Test Plan: test_experimental Differential Revision: D58977694 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129421 Approved by: https://github.com/tugsbayasgalan, https://github.com/zhxchen17	2024-06-26 05:52:31 +00:00
Chien-Chin Huang	7420bad74c	[BE] Do not assert if the barrier is not created (#129497 ) the foler will be created as long as TEMP_DIR is set and the program has the write permission. This will ensure some test environment can run the spawn tests. Differential Revision: [D59020736](https://our.internmc.facebook.com/intern/diff/D59020736/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129497 Approved by: https://github.com/fduwjj, https://github.com/wz337	2024-06-26 05:51:36 +00:00
Anshul Sinha	c04cec609d	[dtensor][debug] fixing CommDebugMode module collective tracing (#128887 ) Summary The logic for CommDebugMode module collective tracing is incorrect as it only worked for leaf module nodes on the model's module tree. If we had a sub-module that had a collective call along with a nested module inside it, the sub-module was not removed from the module_tracker parent set leading to double-counting collectives. This problem was addressed by checking to make sure the current sub-module was not already in the parent set. The output of the below test cases should remain the same. Test Plan 1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_module_tracing 2. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_module_tracing Pull Request resolved: https://github.com/pytorch/pytorch/pull/128887 Approved by: https://github.com/XilunWu ghstack dependencies: #128729	2024-06-26 05:25:57 +00:00
Anshul Sinha	bd3a11776f	[dtensor][test] test case suite for comm_mode features (#128729 ) Summary Currently, there is only an example file for comm_mode and its features. I have created test cases that mirror the examples while the more complicated test cases also ensure that comm_mode resets all variables when used multiple times in the same function. This test case suite will also help developers ensure that new code they add to comm_mode does not affect correctness of old features. #128536 Test Plan pytest test/distributed/_tensor/debug/test_comm_mode_features.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/128729 Approved by: https://github.com/XilunWu	2024-06-26 05:25:57 +00:00
Tugsbayasgalan Manlaibaatar	6181e65cd8	Nested tensor subclass support (#127431 ) When we have nested tensor subclasses, we need to recursively flatten/unflatten in Fake tensor creation and AOTAUtograd. Most of the PR is about mechanical change which changes today's single level flatten logic to be recursive. Differential Revision: [D58533224](https://our.internmc.facebook.com/intern/diff/D58533224) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127431 Approved by: https://github.com/bdhirsh	2024-06-26 04:45:22 +00:00
Huy Do	cda4d4887d	Skip signals from older runs of the same workflows (#129291 ) I discovered this bug in trymerge when debugging https://github.com/pytorch/pytorch/pull/129013 in which Dr.CI reported no relevant failures while mergebot complained about some unrelated ROCm failures https://github.com/pytorch/pytorch/pull/129013#issuecomment-2183009217. It turns out that mergebot took into account stale signals from older runs of the same workflow here. For example, * https://github.com/pytorch/pytorch/actions/runs/9604985361 was the first run where it had a ROCm failure * While https://github.com/pytorch/pytorch/actions/runs/9608926565 was the second attempt and it was all green Notice that both runs came from the same push to commit [be69191](`be69191f2d`) with [ciflow/rocm/129013](https://github.com/pytorch/pytorch/tree/ciflow/rocm/129013). So, we just need to check the signals from the newer run. Note that Dr.CI handles this part correctly using the logic in https://github.com/pytorch/test-infra/blob/main/torchci/pages/api/drci/drci.ts#L1079-L1088. So, the fix in this PR is to bring the same logic to trymerge. ### Testing `pytest -v test_trymerge.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129291 Approved by: https://github.com/ZainRizvi	2024-06-26 03:49:09 +00:00
James Perng	c718e2f43b	[pytorch][logging] add empty wait counter implementation (#128466 ) Differential Revision: [D58441466](https://our.internmc.facebook.com/intern/diff/D58441466) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128466 Approved by: https://github.com/c-p-i-o	2024-06-26 03:47:17 +00:00
xinan.lin	54f27b886e	[Inductor UT] Reuse test_distributed_patterns.py for Intel GPU (#129437 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129437 Approved by: https://github.com/EikanWang, https://github.com/jansel	2024-06-26 02:58:45 +00:00
CaoE	555f71a15b	Fix test_auto_simd in machine with AMX support (#129444 ) Fixes #129438 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129444 Approved by: https://github.com/jgong5, https://github.com/peterbell10	2024-06-26 02:50:55 +00:00
cdzhan	a89a1ed072	[easy][DCP] make BroadcastingTorchSaveReader device generic (#129231 ) Test test/distributed/checkpoint/test_format_utils.py on GPU and othor device pass. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129231 Approved by: https://github.com/fegin	2024-06-26 02:37:30 +00:00
Peter Bell	90d5a6f001	[inductor] Add lowering and codegen for aten.sort (#128458 ) Closes #125633 Benchmarks: \| Shape \| dim \| stable \| compiled \| eager \| speedup \| \|-------------\|-----\|--------\|----------\|---------\|---------\| \| (256, 4096) \| 0 \| False \| 0.73 ms \| 1.26 ms \| 1.7 \| \| (256, 4096) \| 0 \| True \| 0.75 ms \| 1.27 ms \| 1.7 \| \| (4096, 256) \| 1 \| False \| 0.20 ms \| 0.73 ms \| 3.7 \| \| (4096, 256) \| 1 \| True \| 0.21 ms \| 0.73 ms \| 3.5 \| \| (255, 4096) \| 0 \| False \| 1.05 ms \| 1.48 ms \| 1.4 \| \| (255, 4096) \| 0 \| True \| 1.03 ms \| 1.47 ms \| 1.4 \| \| (4096, 255) \| 1 \| False \| 0.52 ms \| 0.98 ms \| 1.9 \| \| (4096, 255) \| 1 \| True \| 0.54 ms \| 1.00 ms \| 1.9 \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/128458 Approved by: https://github.com/lezcano, https://github.com/eellison	2024-06-26 01:36:39 +00:00
Eddie Yan	b7e7a4cb01	[cuDNN][SDPA] Remove `TORCH_CUDNN_SDPA_ENABLED=1`, enable cuDNN SDPA by default on H100 and 2nd on other archs >= sm80 (#125343 ) Looks like one of the first failures seen is `test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` when `test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` passes. What seems interesting here is that the `torch.compile` version fails while the eager version passes. Not sure what the difference would be here... Nevertheless, is there a recommended mechanism to skip cuDNN SDPA as a backend for this test? CC @drisspg Pull Request resolved: https://github.com/pytorch/pytorch/pull/125343 Approved by: https://github.com/Skylion007	2024-06-26 00:49:18 +00:00
Yanbo Liang	9554a9af87	[GPT-benchmark] Distinguish LLM models and mirco-benchmarks (#129498 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/129498 Approved by: https://github.com/huydhn	2024-06-26 00:25:05 +00:00
Catherine Lee	0d0d42c4a7	test_qat_mobilenet_v2 succeeding on dynamo (#129532 ) https://github.com/pytorch/pytorch/actions/runs/9669572961/job/26677024995 Test is usually marked as slow so it doesn't get run on dynamo since dynamo doesn't have a slow equivalent However, it is succeeding, so we might as well as do what the logs tell us to do and remove the failure Pull Request resolved: https://github.com/pytorch/pytorch/pull/129532 Approved by: https://github.com/malfet, https://github.com/kit1980	2024-06-25 23:55:12 +00:00
Peter Bell	112ef79f29	[inductor] Remove comm-specific node attributes from scheduler (#129084 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129084 Approved by: https://github.com/lezcano	2024-06-25 23:52:19 +00:00
wz337	d1f9e822dd	[DTensor][Test] Update implicit replication unit tests for tensor arg being the first in args list (#127803 ) Change the operands order so we can have test coverage for when the first arg is a tensor arg instead of DTensor arg. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127803 Approved by: https://github.com/XilunWu	2024-06-25 23:51:58 +00:00
Will Feng	575bc1e3af	[Reopen #114036 ] Allow "must recompute" in torch.compile + selective checkpointing (SAC) (#129295 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129295 Approved by: https://github.com/Chillee	2024-06-25 23:47:08 +00:00
joydddd	f389541ce0	Add Strided Input test for flex attention (#128915 ) Test strided inputs to the flex_attention HOP. Similar to how inputs are generated in https://github.com/pytorch/pytorch/blob/main/benchmarks/transformer/score_mod.py. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128915 Approved by: https://github.com/Chillee, https://github.com/drisspg	2024-06-25 23:26:34 +00:00
Catherine Lee	87ebd627a7	RS migration - upload sccache stats to s3 instead of rockset (#129490 ) Upload sccache stats to s3 instead of rockset I don't think we use these anywhere, so it's ok to cut off the ingest into rockset right now. We should consider deleting this entirely if we don't plan on using it I will work on copying existing data over from rockset to s3 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129490 Approved by: https://github.com/huydhn, https://github.com/ZainRizvi	2024-06-25 23:23:16 +00:00
PyTorch MergeBot	52341c28e8	Revert "[FSDP2] Ran post-acc-grad hooks manually (#129450 )" This reverts commit 7ebffef4d02a3cc68dbbcf44b92d63c7fe0ebb67. Reverted https://github.com/pytorch/pytorch/pull/129450 on behalf of https://github.com/clee2000 due to broke distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_simple_mlp_fullgraph_backend_aot_eager `7ebffef4d0` https://github.com/pytorch/pytorch/actions/runs/9667812641/job/26671489454. Test got added in https://github.com/pytorch/pytorch/pull/129157 which is before your mergebase ([comment](https://github.com/pytorch/pytorch/pull/129450#issuecomment-2190174363))	2024-06-25 23:13:57 +00:00
Yifu Wang	bbd47f7b2f	Remove ProcessGroupCudaP2P and change async-TP to use SymmetricMemory (#128762 ) This PR removes `ProcessGroupCudaP2P` and changes async-TP to use `SymmetricMemory`. The async-TP implementation is still workspace-based, but it now doesn't require a buffer size to be specified upfront. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128762 Approved by: https://github.com/wanchaol	2024-06-25 22:32:21 +00:00
Chien-Chin Huang	1c5df9107d	[BE] Fix several incorrect skip tests (#129488 ) These tests may not be skipped properly if NCCL library exists but CUDA is not avaiable. Differential Revision: [D59013855](https://our.internmc.facebook.com/intern/diff/D59013855/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129488 Approved by: https://github.com/wz337, https://github.com/fduwjj	2024-06-25 22:10:31 +00:00
Shunting Zhang	fd414d6189	[inductor] don't materialize the large sparse matrix in CE bwd (#129043 ) Inductor currently materialize a large sparse matrix in the backward pass for CrossEntropyLoss and load that to compute gradients of Softmax input. If we could fuse the sparse matrix computation to the consumer sides, we gonna have both perf and memory usage wins. The Fx graph snippets that construct this aforementioned sparse matrix looks like: ``` full_default_3: "bf16[32768, 50257]" = torch.ops.aten.full.default([32768, 50257], 0, dtype = torch.bfloat16, layout = torch.strided, device = device(type='cuda', index=0), pin_memory = False) scatter: "bf16[32768, 50257]" = torch.ops.aten.scatter.value(full_default_3, 1, where_2, -1.0); full_default_3 = where_2 = None ``` Leveraging the following observations: - the scatter is applied upon a all zero (or more generally a const tensor) - the index tensor for the scatter has a single element on the scatter dimension. In this case it's the label tensor allow us to lower this 'scatter_upon_const_tensor' pattern to a pointwise kernel that can be easily fused with downstream kernels: ``` def inner_fn(idx): selector_idx = list(idx) selector_idx[dim] = 0 # can do this since the index tensor has a single element on the scatter dimension selector = selector_loader(selector_idx) return ops.where( selector == ops.index_expr(idx[dim], torch.int64), ops.constant(val, dtype), ops.constant(background_val, dtype), ) ``` ## Test result on microbenchmark For the microbenchmark added as `test_cross_entropy_loss`, we improve latency from 47.340ms to 42.768ms, memory footprint from 10.524GB to 7.227GB on A100. (on H100, we improve latency from 27.54ms to 23.51ms, memory footprint from 10.574GB to 7.354GB). The saving matches the back-of-envelope calculation. We avoid storing a BF16 tensor with shape [30K, 50K] which is about 3GB in size. On A100, avoid loading and storing such a tensor can roughly save 3GB x 2 / 1.5TBGS = 4ms ## Test result on llm.c We also test this on llm.c and the saving is much larger especially for memory footprint. The reason is due to autotuning that allocates extra memory for benchmarking. (Check https://github.com/pytorch/pytorch/issues/129258 and https://github.com/pytorch/pytorch/pull/129399 for more details). For llm.c PyTorch implementation on A100, we improve from 171K tokens/s , 33.6G peak memory usage to 180K tokens/s, 18.6G peak memory usage. (A 45% saving of peak memory) ## Test on PyTorch 2.0 Dashboard The optimization is quite general especially for transformers. We tested this on PyTorch2.0 dashboard. Here is the [result](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2017%20Jun%202024%2018%3A07%3A51%20GMT&stopTime=Mon%2C%2024%20Jun%202024%2018%3A07%3A51%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&lBranch=gh/shunting314/158/head&lCommit=c62c55e29c65497d495217b6574bb36b0c4da7d4&rBranch=main&rCommit=0d25f096c1beaf8749932a3d6083ad653405ed71). TLDR, for Huggingface benchmark suite, we get 6% geomean perf improvement and 10% geomean memory footprint improvement. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129043 Approved by: https://github.com/jansel, https://github.com/Chillee	2024-06-25 21:25:50 +00:00
Will Constable	e1499f6342	[C10D] Make new_group eager when used with comm_split (#129284 ) If users pass `device_id` to init_process_group, they enable eager init for the default group. Then if they subsequently call `new_group`, the device_id argument is not required as it should be assumed to match the one used for init_process_group. However, both `init_process_group` and `new_group` apis share a helper function, which expects a `device_id` value that defaults to None. When it's None, eager initialization is disabled. This PR ensures that if a device_id was passed to init_process_group, the same device_id will automatically be fed into the helper function for any new_group calls that follow. Test plan I found an existing test in CI `test_comm_split_subgroup` that failed after my change, because it was asserting that backend comm_split counter did not increment eagerly, and its behavior had changed to increment eagerly. I updated the test in the PR to pass with my change. I also tested locally via simple program with TORCH_CPP_LOG_LEVEL=INFO and observed eager initialization of the 'lows' and 'highs' PGs before the 'Here' print. ``` import torch import torch.distributed as dist dist.init_process_group(backend="nccl", device_id =torch.device(f"cuda:{torch.distributed.get_node_local_rank(0)}")) dist.new_group([0, 1], group_desc="lows") dist.new_group([2, 3], group_desc="highs") print("Here") torch.distributed.destroy_process_group() ``` Output: https://gist.github.com/wconstab/88a5ba0b970244ca1f79133f989e0349 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129284 Approved by: https://github.com/pavanbalaji, https://github.com/fduwjj, https://github.com/d4l3k, https://github.com/nvcastet	2024-06-25 21:09:34 +00:00
Zhengxu Chen	e58ef5b65f	[export] Rewrite exportdb formatting. (#129260 ) Summary: It'll be easier to generate examples if the code doesn't depend on exportdb library. Test Plan: CI Differential Revision: D58886554 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129260 Approved by: https://github.com/tugsbayasgalan	2024-06-25 21:04:53 +00:00
Wei Wang	551e412718	[CUDA][Inductor][CI] Revert PR#127150 since cu124 is now behaving similar enough to cu121 (#128423 ) Pre-requisite: close https://github.com/pytorch/pytorch/issues/126692 first. This PR also gives a current read on cu121 and cu124 parity. Essentially reverting #127150 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128423 Approved by: https://github.com/atalman, https://github.com/eqy	2024-06-25 20:59:49 +00:00
Max Podkorytov	79959d707c	[Inductor][ROCm] Composable Kernel backend for Inductor (#125453 ) This PR adds an alternative backend for Inductor, adding Composable Kernel Universal GEMM instances to the autotune instance selection. The implementation is heavily influenced by the series of PRs which adds CUTLASS backend (https://github.com/pytorch/pytorch/issues/106991). The main differences are (1) customizing compiler for the ROCm platform (2) customizing template code generation for Composable Kernel Universal GEMM instances. We provide config tuning knobs for balancing between instance sources compilation time and finding the best instance. ### Testing Install the ck library ``` pip install git+https://github.com/rocm/composable_kernel@develop ``` Run the test ``` TORCH_LOGS=+torch._inductor \ pytest --capture=tee-sys test/inductor/test_ck_backend.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125453 Approved by: https://github.com/eellison, https://github.com/jansel	2024-06-25 20:54:14 +00:00
DiweiSun	ae0f84d89c	[CI] Enable amp accuracy check for inductor cpu (#127758 ) This is to enable inductor AMP accuracy check for on CPU in CI workflow to capture issue early. Three suites are included: timms, huggingface as well as torchbench. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127758 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-06-25 20:34:18 +00:00
Jiashen Cao	45f2876934	[Fix] NumToTensor resulted from numel() and size() in TSCovnerter (#128761 ) #### Issue In jit.trace, torch.numel() is automatically cast to a `LongTensor`. But during conversion, we lost the casting part. `prim::NumToTensor` was previously converted to `torch.ops.aten.scalar_tensor`, which uses the same `dtype` as the input tensor instead of `LongTensor`. in this PR, we add a casting to convert it to the correct `dtype`. #### Test Plan We activate previously failing test case. * `pytest test/export/test_converter.py -s -k test_implicit_constant_to_tensor_handling` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128761 Approved by: https://github.com/angelayi	2024-06-25 20:20:03 +00:00
Jeff Daily	e68ee2cadb	TunableOp hotfix (#129281 ) Fixes. - PYTORCH_TUNABLEOP_NUMERICAL_CHECK=1 had a memory leak. - The strided batched gemm size calculation for buffer rotation was incorrect resulting in a mem fault. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129281 Approved by: https://github.com/xw285cornell, https://github.com/eqy, https://github.com/mxz297	2024-06-25 20:12:46 +00:00
Chirag Pandya	1865fe282f	Log whenever we sleep (#129197 ) Summary: Log whenever we sleep for heartbeatTimeout. Useful for debugging stuck jobs. This will eventually turn into a metric. Test Plan: none. Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/129197 Approved by: https://github.com/Skylion007, https://github.com/d4l3k, https://github.com/wconstab	2024-06-25 20:09:41 +00:00
PyTorch MergeBot	b1f486aff9	Revert "Add warning for weights_only (#129239 )" This reverts commit 381ce0821c3fa2b342f0b8660c76cc27f48543c4. Reverted https://github.com/pytorch/pytorch/pull/129239 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I am seeing some test_nn failures from ROCm `381ce0821c`, trying to revert this to see if trunk recovers ([comment](https://github.com/pytorch/pytorch/pull/129239#issuecomment-2189812903))	2024-06-25 19:30:07 +00:00
PyTorch MergeBot	7cf454ec52	Revert "Add example for torch.serialization.add_safe_globals (#129396 )" This reverts commit f18becaaf1c7a7bf851e3ae8d215eee8dba688b6. Reverted https://github.com/pytorch/pytorch/pull/129396 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I am seeing some test_nn failures from ROCm `381ce0821c`, trying to revert this to see if trunk recovers ([comment](https://github.com/pytorch/pytorch/pull/129239#issuecomment-2189812903))	2024-06-25 19:30:07 +00:00
Tristan Rice	0298560ca2	TCPStore: improve connect and retry logic (#129261 ) We've been facing issues where TCPStore can successfully connect but then fail in the validate() function due to resets from listen backlog queue overflow when combined with reset enabled as well as long init times. This PR does a few things: * Retry that connect and validate up to the specified timeout. * Use exponential backoff for the retry logic with jitter instead of a fixed 1s sleep. * Eliminate the `sleep(std::chrono::milliseconds(numWorkers))` on init which can add significant delays to startup. This is no longer necessary per @XilunWu https://github.com/pytorch/pytorch/pull/116141 Test plan: ``` python test/distributed/test_store.py -v ./build/bin/BackoffTest ``` Will do internal testing with some large scale jobs to ensure TCPStore works correctly. At 4k scale: 4x improvement ``` tristanr@devvm4382 ~/pt_tests [SIGABRT]> time TORCH_SHOW_CPP_STACKTRACES=1 python tcpstore_large_test.py (pytorch-3.10) started 0 init 0 set 0 joined all ________________________________________________________ Executed in 1.98 secs fish external usr time 0.93 secs 91.00 micros 0.93 secs sys time 1.98 secs 954.00 micros 1.97 secs tristanr@devvm4382 ~/pt_tests> conda activate torchdrive-3.10 (pytorch-3.10) tristanr@devvm4382 ~/pt_tests> time TORCH_SHOW_CPP_STACKTRACES=1 python tcpstore_large_test.py (torchdrive-3.10) started 0 init 0 set 0 joined all ________________________________________________________ Executed in 8.20 secs fish external usr time 2.15 secs 0.00 micros 2.15 secs sys time 2.76 secs 843.00 micros 2.76 secs ``` ```py import time import os import threading from multiprocessing import Pool WORLD_SIZE = 10000 import torch.distributed as dist def run(rank): should_log = rank % (WORLD_SIZE // 10) == 0 if should_log: print(f"started {rank}") store = dist.TCPStore( host_name="devvm4382.nao0.facebook.com", port=29500, world_size=WORLD_SIZE, is_master=rank == 0, use_libuv=True, ) if should_log: print(f"init {rank}") store.set(f"key{rank}", "1234") if should_log: print(f"set {rank}") del store def noop(rank): pass print("starting pool") with Pool(WORLD_SIZE) as pool: pool.map(noop, range(WORLD_SIZE), 1) print("pool hot") start = time.time() pool.map(run, range(WORLD_SIZE), 1) print("run finished", time.time()-start) ``` ``` tristanr@devvm4382 ~/pt_tests> python tcpstore_large_test.py (pytorch-3.10) starting pool pool hot started 0 [W624 16:58:09.086081750 TCPStore.cpp:343] [c10d] Starting store with 10000 workers but somaxconn is 4096.This might cause instability during bootstrap, consider increasing it. started 1000 init 1000 set 1000 started 2000 init 2000 set 2000 started 3000 init 3000 set 3000 started 4000 init 4000 set 4000 started 5000 init 5000 set 5000 started 6000 init 6000 set 6000 started 7000 init 7000 set 7000 started 8000 init 8000 set 8000 started 9000 init 9000 set 9000 init 0 set 0 run finished 0.705092191696167 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129261 Approved by: https://github.com/rsdcastro, https://github.com/wconstab, https://github.com/kurman, https://github.com/XilunWu, https://github.com/c-p-i-o	2024-06-25 19:24:22 +00:00
Nikita Shulga	816e8a3f21	[MacOS] Improve libomp packaging (#129473 ) Instead of replacing `@rpath/libomp.dylib` with `@loadper_path/libomp.dylib`, keep it in place and add `@loadper_path` as new rpath This should prevent double-loading of OpenMP runtime, because in case of `@rpath` loader is allowed to reuse other libraries, but `loadper_path` directive forces it to load it from the location relative to the executable Test plan: - Prepare the environment ```shell conda create -n py310-cf python=3.10 numpy pip -c conda-forge conda activate py310-cf pip install torch --index-url https://download.pytorch.org/whl/test/cpu ``` - Verify that OpenMP is loaded twice and than crashes ```shell KMP_VERSION=true python -c "import numpy as np; import torch; print(torch.__version__, torch.backends.openmp.is_available()); print(torch.rand(300, 300).abs().max())" ``` output: ``` LLVM OMP version: 5.0.20140926 LLVM OMP library type: performance LLVM OMP link type: dynamic LLVM OMP build time: no_timestamp LLVM OMP build compiler: Clang 16.0 LLVM OMP alternative compiler support: yes LLVM OMP API version: 5.0 (201611) LLVM OMP dynamic error checking: no LLVM OMP thread affinity support: no LLVM OMP version: 5.0.20140926 LLVM OMP library type: performance LLVM OMP link type: dynamic LLVM OMP build time: no_timestamp LLVM OMP build compiler: Clang 12.0 LLVM OMP alternative compiler support: yes LLVM OMP API version: 5.0 (201611) LLVM OMP dynamic error checking: no LLVM OMP thread affinity support: no 2.4.0 True zsh: segmentation fault KMP_VERSION=true python -c ``` - Install artifact from this PR and make sure it passes the same test ```shell python -mpip install ~/Downloads/torch-2.5.0.dev20240625-cp310-none-macosx_11_0_arm64.whl KMP_VERSION=true python -c "import numpy as np; import torch; print(torch.__version__, torch.backends.openmp.is_available()); print(torch.rand(300, 300).abs().max())" ``` output ``` LLVM OMP version: 5.0.20140926 LLVM OMP library type: performance LLVM OMP link type: dynamic LLVM OMP build time: no_timestamp LLVM OMP build compiler: Clang 16.0 LLVM OMP alternative compiler support: yes LLVM OMP API version: 5.0 (201611) LLVM OMP dynamic error checking: no LLVM OMP thread affinity support: no 2.5.0.dev20240625 True tensor(1.0000) ``` - Make sure it still uses bundled OpenMP if none is available in the environment ``` conda uninstall numpy -c conda-forge KMP_VERSION=true python -c "from ctypes import cdll, c_char_p, c_uint32; import torch; from ctypes import cdll, c_char_p, c_uint32; libdyld = cdll.LoadLibrary('libSystem.dylib'); libdyld._dyld_image_count.restype = c_uint32; libdyld._dyld_get_image_name.restype = c_char_p; libdyld._dyld_get_image_name.argtypes = [c_uint32]; print(torch.rand(300, 300).abs().max()); libs = [libdyld._dyld_get_image_name(i).decode('ascii') for i in range(libdyld._dyld_image_count())]; print([l for l in libs if 'libomp.dylib' in l])" ``` Fixes https://github.com/pytorch/pytorch/issues/124497 and https://github.com/pytorch/pytorch/issues/126385 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129473 Approved by: https://github.com/atalman	2024-06-25 19:12:34 +00:00
PyTorch MergeBot	b045878f81	Revert "Remove test_mps_allocator_module XFAIL (#129340 )" This reverts commit c888ee36325148ed99db4298bf2ae739ebbeacdc. Reverted https://github.com/pytorch/pytorch/pull/129340 on behalf of https://github.com/huydhn due to The test is now failing again in trunk after a day or so of staying green, we need to continue the investigation ([comment](https://github.com/pytorch/pytorch/pull/129340#issuecomment-2189701706))	2024-06-25 18:37:54 +00:00
Andrew Gu	7ebffef4d0	[FSDP2] Ran post-acc-grad hooks manually (#129450 ) FSDP2 accumulates gradients for sharded parameters outside of the autograd engine's normal accumulation logic. We can respect registered post-accumulate-grad hooks by running them manually. Discussion Discussing with @soulitzer, changing FSDP2 to make the sharded parameters autograd leaves requires nontrivial changes to FSDP and some changes to the autograd engine (around forward vs. backward streams) where the changes may not preserve eager-mode performance and/or add some complexity. Under the FSDP2 design, the sharded parameters never participate in autograd, so calling `register_post_accumulate_grad_hook` on them would otherwise be a no-op. In other words, there is virtually no chance for FSDP2 incorrectly re-running the hook when it should not. Given these, a reasonable near-term solution is for FSDP2 to run the post-accumulate-grad hooks manually. Caveats - Running `foreach=False` optimizer _per parameter tensor_ incurs significantly higher CPU overhead compared to `foreach=True` (partially due to `DTensor` being a `__torch_dispatch__` tensor subclass). - On preliminary benchmarking on Llama3-8B on 8 GPUs, this CPU overhead is mostly tolerable, but on smaller # of GPUs or a less compute-intensive model, this may not be. - One solution for native Adam/AdamW is to use `fused=True`, which makes both the CPU overhead lower and GPU compute faster. However, this is generally not an option for user-defined optimizers. - If this CPU overhead blocks adoption of this feature, then we should seriously consider an FSDP-specific API like `register_post_backward_hook(params: List[nn.Parameter]) -> None` that allows the user to see all parameters in the `FSDPParamGroup` together for the hook so that the user can still run a `foreach=True` optimizer step on that `List[nn.Parameter]`. - The post-accumulate-grad hook runs in the reduce-scatter stream. Our current stream handling logic does not have the default stream wait for the reduce-scatter stream until the end of backward. Unless we add that, we cannot simply run the post-accumulate-grad hook in the default stream. - This means that optimizer compute will overlap with backward compute, which may slowdown end-to-end execution slightly (e.g. due to SM contention or wave quantization effects). For example, on Llama3-8B, we see about ~3% decrease in MFU when running optimizer in backward even though the optimizer steps are fully overlapped and there are no CPU boundedness issues. - This PR's goal is only to run the hook manually. State dict etc. for optimizer-in-backward is out of scope. Experiments (torchtitan) - Llama3-8B on 2 GPUs, local batch size 1, with full activation checkpointing, and bf16/fp32 mixed precision: - Without optimizer-in-backward: 82.03 GiB reserved memory; 28.1% MFU - With optimizer-in-backward (`foreach=False`): 72.84 GiB reserved memory; 28.9% MFU (speedup from more of optimizer step overlapped) - With optimizer-in-backward (`fused=True`): 70.84 GiB reserved memory; 30.4% MFU Pull Request resolved: https://github.com/pytorch/pytorch/pull/129450 Approved by: https://github.com/weifengpy	2024-06-25 18:34:56 +00:00
Yidi Wu	dd00f5e78d	Fixes T192448049 (#129146 ) Differential Revision: D58767610 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129146 Approved by: https://github.com/angelayi	2024-06-25 17:50:15 +00:00
Weizhuo Zhang	53f462c506	Write dynamo benchmarks performance result to csv when throw exceptions (#126764 ) Performance mode Issue: When dynamo benchmarks performance warm-up failed, the result will be not written into csv file. But the accuracy will be written as `fail_to_run` even when dynamo pass failed. So the accuracy model number is not aligned with performance model number for each of their csv files. ![image](https://github.com/pytorch/pytorch/assets/84730719/9043d215-130b-46b4-a835-f148c225947c) - Fix: The warm-up failed models will be recorded into csv file shown as following: ![image](https://github.com/pytorch/pytorch/assets/84730719/7907a3c2-c942-42bb-b31c-55424a0e8117) Accuracy mode issue: `detectron2_fasterrcnn_r` models failed on accuracy mode, but was tested successfully on performance mode. The accuracy failure is same as PR `ee557d8f61`. ``` Dynamic Shape: Traceback (most recent call last): File "benchmarks/dynamo/torchbench.py", line 449, in <module> torchbench_main() File "benchmarks/dynamo/torchbench.py", line 445, in torchbench_main main(TorchBenchmarkRunner(), original_dir) File "/workspace/pytorch/benchmarks/dynamo/common.py", line 3650, in main process_entry(0, runner, original_dir, args) File "/workspace/pytorch/benchmarks/dynamo/common.py", line 3582, in process_entry return run(runner, args, original_dir) File "/workspace/pytorch/benchmarks/dynamo/common.py", line 4163, in run assert marked, f"nothing in example_inputs had a dim with {batch_size}" AssertionError: nothing in example_inputs had a dim with 4 ``` ![image](https://github.com/pytorch/pytorch/assets/84730719/f25392f0-f982-46c8-8e2c-a8a25d85a21a) - Fix: same as PR `ee557d8f61`, the batch_size will be skipped to set as 4 when testing dynamic shapes. Dynamic shapes passrate improved from 89% -> 95% \| Comp Item \| Compiler \| suite \| before \| After fix \| \|-----------\|----------\|------------\|------------\|------------\| \| Pass Rate \| Inductor \| torchbench \| 89%, 73/82 \| 95%, 79/83 \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/126764 Approved by: https://github.com/jansel	2024-06-25 17:49:04 +00:00
atalman	e317a8b264	Add guard to use AMX for x86_64 only (#129479 ) Trying to mitigate aarch64 and s390 nightly failures as per this comment: https://github.com/pytorch/pytorch/pull/127195#issuecomment-2189177949 Fixes https://github.com/pytorch/pytorch/issues/129443 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129479 Approved by: https://github.com/nWEIdia, https://github.com/malfet	2024-06-25 17:31:28 +00:00
PyTorch MergeBot	45b2931b7e	Revert "[Traceable FSDP2] Don't decompose fsdp.split_with_sizes_copy (#129414 )" This reverts commit b24787b7576c184a54d13c1833ada23a395f5c31. Reverted https://github.com/pytorch/pytorch/pull/129414 on behalf of https://github.com/ZainRizvi due to This PR is seems to be causing multiple macos failures. Looks like it was merged before trunk jobs were started, which would have run those tests ([comment](https://github.com/pytorch/pytorch/pull/129414#issuecomment-2189479505))	2024-06-25 17:05:55 +00:00
PyTorch MergeBot	fb40ba6fc2	Revert "[Traceable FSDP2] Add Dynamo support for run_with_rng_state HOP (#127247 )" This reverts commit aa4ee2cb9e1f9be6bbdd27654e0f768b7fe9be6c. Reverted https://github.com/pytorch/pytorch/pull/127247 on behalf of https://github.com/ZainRizvi due to This PR is seems to be causing multiple macos failures. Looks like it was merged before trunk jobs were started, which would have run those tests ([comment](https://github.com/pytorch/pytorch/pull/129414#issuecomment-2189479505))	2024-06-25 17:05:55 +00:00
PyTorch MergeBot	ad76da6c16	Revert "[inductor] Fix TORCHINDUCTOR_FORCE_DISABLE_CACHES (#129257 )" This reverts commit 7b57ddd38c6d502ba313c0e6b0c92b6787d69986. Reverted https://github.com/pytorch/pytorch/pull/129257 on behalf of https://github.com/clee2000 due to one of the PRs in the stack seems to have broken test/distributed/_composable/test_replicate_with_compiler.py::ReplicateTest::test_bucketing_concat_op on distributed https://github.com/pytorch/pytorch/actions/runs/9653941844/job/26627760340 `4c1e4c5f30`, not tested on this PR due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/129257#issuecomment-2189444171))	2024-06-25 16:48:32 +00:00
PyTorch MergeBot	b38f6d4cd2	Revert "[inductor] Enable FX graph caching in OSS by default (#125863 )" This reverts commit 4c1e4c5f307f9743014a08cf97d3fa8de7e1ce5f. Reverted https://github.com/pytorch/pytorch/pull/125863 on behalf of https://github.com/clee2000 due to one of the PRs in the stack seems to have broken test/distributed/_composable/test_replicate_with_compiler.py::ReplicateTest::test_bucketing_concat_op on distributed https://github.com/pytorch/pytorch/actions/runs/9653941844/job/26627760340 `4c1e4c5f30`, not tested on this PR due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/129257#issuecomment-2189444171))	2024-06-25 16:48:32 +00:00
vinithakv	f8db12a538	Fix logic to find sbgemm in BLAS library (#125227 ) Current logic to set the HAS_SBGEMM flag is ignored in case the BLAS libraries are found already, ie, if set from environment variable BLAS=OpenBLAS . If BLAS_LIBRARIES are already set the code to find if BLAS_LIBRARY has sbgemm is never executed. The following commit brings out this logic outside unconditionally. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/125227 Approved by: https://github.com/malfet	2024-06-25 16:34:38 +00:00
Zhengxu Chen	665d6ea05b	[export] Fix IR canonlization. (#129401 ) Summary: as title. we should unpack results from _canonicalize_graph. Differential Revision: D58963429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129401 Approved by: https://github.com/tugsbayasgalan	2024-06-25 16:33:02 +00:00
Joel Schlosser	e364290718	Support linear backward for NJT with dim > 3 (#129393 ) Replaces usage of `torch.mm()` with `torch.matmul()` in NJT's impl of linear_backward to support higher dims. See [here](https://github.com/pytorch/pytorch/issues/125214#issuecomment-2184968703) for more context. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129393 Approved by: https://github.com/soulitzer	2024-06-25 16:06:23 +00:00
Klein Shen	0e6bb7f1ce	[caffe2][be] migrate gloabl static initializer (#128784 ) Summary: Caffe2 lib has 200+ global static initializer usage, which are papar-cut reference to startup perf. Detail in this post https://fb.workplace.com/groups/arglassesperf/permalink/623909116287154. This Diff migrate StorageImpl.cpp Addtional Context: https://fb.workplace.com/groups/arglassesperf/permalink/623909116287154 Test Plan: CI Differential Revision: D58639283 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128784 Approved by: https://github.com/aaronenyeshi	2024-06-25 15:30:49 +00:00
Nikita Shulga	fd4af87855	Fix non-portable path warning (#129474 ) MacOS uses case-insensitive filesystem by default, but it's better to specify include path using proper capitalization Should fix ``` MultiTensorApply.h:4:10: warning: non-portable path to file '<ATen/native/mps/operations/FusedOptimizerOps.h>'; specified path differs in case from file name on disk [-Wnonportable-include-path] #include <Aten/native/mps/operations/FusedOptimizerOps.h> ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129474 Approved by: https://github.com/albanD, https://github.com/atalman, https://github.com/qqaatw	2024-06-25 15:17:21 +00:00
drisspg	cb1c56caba	Set target dependencies to always build for sm90a on rowwise scaling (#129402 ) # Summary Instead of landing global builder changes; https://github.com/pytorch/builder/pull/1878 This PR targets only the Rowwise file and adds the sm90a featurs. Verified locally by setting: ``` TORCH_CUDA_ARCH_LIST=9.0 ``` We can see in the build.ninja file that the proper flags are set: ``` build caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/RowwiseScaledMM.cu.o: CUDA_COMPILER__torch_cuda_unscanned_Release /home/drisspg/meta/pytorch/aten/src/ATen/native/cuda/RowwiseScaledMM.cu \|\| cmake_object_order_depends_target_torch_cuda DEFINES = -DAT_PER_OPERATOR_HEADERS -DFLASHATTENTION_DISABLE_ALIBI -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTORCH_CUDA_BUILD_MAIN_LIB -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_CUDA -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_FLASH_ATTENTION -DUSE_MEM_EFF_ATTENTION -DUSE_NCCL -DUSE_RPC -DUSE_TENSORPIPE -D_FILE_OFFSET_BITS=64 -Dtorch_cuda_EXPORTS DEP_FILE = caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/RowwiseScaledMM.cu.o.d FLAGS = -DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS -D_GLIBCXX_USE_CXX11_ABI=1 -Xfatbin -compress-all -DONNX_NAMESPACE=onnx_torch -gencode arch=compute_90,code=sm_90 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -Wno-deprecated-gpu-targets --expt-extended-lambda -DCUB_WRAPPED_NAMESPACE=at_cuda_detail -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -O3 -DNDEBUG -std=c++17 -Xcompiler=-fPIC -DTORCH_USE_LIBUV -DCAFFE2_USE_GLOO -Xcompiler=-Wall,-Wextra,-Wdeprecated,-Wno-unused-parameter,-Wno-missing-field-initializers,-Wno-unknown-pragmas,-Wno-type-limits,-Wno-array-bounds,-Wno-unknown-pragmas,-Wno-strict-overflow,-Wno-strict-aliasing,-Wno-unused-function,-Wno-maybe-uninitialized -Wno-deprecated-copy -gencode arch=compute_90a,code=sm_90a INCLUDES = -I/home/drisspg/meta/pytorch/build/aten/src -I/home/drisspg/meta/pytorch/aten/src -I/home/drisspg/meta/pytorch/build -I/home/drisspg/meta/pytorch -I/home/drisspg/meta/pytorch/third_party/onnx -I/home/drisspg/meta/pytorch/build/third_party/onnx -I/home/drisspg/meta/pytorch/third_party/foxi -I/home/drisspg/meta/pytorch/build/third_party/foxi -I/home/drisspg/meta/pytorch/aten/src/THC -I/home/drisspg/meta/pytorch/aten/src/ATen/cuda -I/home/drisspg/meta/pytorch/aten/src/ATen/../../../third_party/cutlass/include -I/home/drisspg/meta/pytorch/aten/src/ATen/../../../third_party/cutlass/tools/util/include -I/home/drisspg/meta/pytorch/build/caffe2/aten/src -I/home/drisspg/meta/pytorch/aten/src/ATen/.. -I/home/drisspg/meta/pytorch/build/nccl/include -I/home/drisspg/meta/pytorch/c10/cuda/../.. -I/home/drisspg/meta/pytorch/c10/.. -I/home/drisspg/meta/pytorch/third_party/tensorpipe -I/home/drisspg/meta/pytorch/build/third_party/tensorpipe -I/home/drisspg/meta/pytorch/third_party/tensorpipe/third_party/libnop/include -I/home/drisspg/meta/pytorch/torch/csrc/api -I/home/drisspg/meta/pytorch/torch/csrc/api/include -isystem /home/drisspg/meta/pytorch/build/third_party/gloo -isystem /home/drisspg/meta/pytorch/cmake/../third_party/gloo -isystem /home/drisspg/meta/pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include -isystem /home/drisspg/meta/pytorch/third_party/protobuf/src -isystem /home/drisspg/meta/pytorch/third_party/ittapi/include -isystem /home/drisspg/meta/pytorch/cmake/../third_party/eigen -isystem /usr/local/cuda-12.3/include -isystem /home/drisspg/meta/pytorch/third_party/ideep/mkl-dnn/include/oneapi/dnnl -isystem /home/drisspg/meta/pytorch/third_party/ideep/include -isystem /home/drisspg/meta/pytorch/cmake/../third_party/cudnn_frontend/include OBJECT_DIR = caffe2/CMakeFiles/torch_cuda.dir OBJECT_FILE_DIR = caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129402 Approved by: https://github.com/malfet	2024-06-25 13:54:51 +00:00
Li-Huai (Allan) Lin	71ebe5121a	[MPS] Fast math env var (#129007 ) Allow users to decide whether they want to have fast math enabled via env var Pull Request resolved: https://github.com/pytorch/pytorch/pull/129007 Approved by: https://github.com/malfet ghstack dependencies: #129006, #129008	2024-06-25 13:52:07 +00:00
Shangdi Yu	bbdeff76fc	fix add decomposition for complex numbers (#129044 ) Fixes #125745 Bug source: When addition requires broadcasting, adding complex numbers is not implemented correctly in `torch/_inductor/decomposition.py` because `x.view(x.real.dtype)` would multiply the last dimension by 2, and then broadcasting wouldn't work. Fix: re-shape the complex tensors after view and before broadcasting. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129044 Approved by: https://github.com/zou3519, https://github.com/lezcano	2024-06-25 11:05:41 +00:00
Sanket Jayant Purandare	6508f0f5d4	Improved backward tracking and attribution, fixed typing for python < 3.10 (#129400 ) For #125323 * Fixes typing for python < 3.10 * Fixes #129390 For #124688 * Improved attribution by registering `register_hook` and `post_accumulate_grad_hook` on params. * Fixed pre-mature per module bw peak state initialization for AC. * This improves per-module stats, global `peak_mem` was already accurate and remains unaffected. For #128508 * When AC is applied to a `mod (nn.Module)` the backward order of execution is `pre-bw -> pre-fw -> post-fw -> post-bw`. Since the `ModTracker` maintains the `parents` attribute as set, the `post-fw` during backward was prematurely removing it from parents. * With the fix we now maintain a per-module counter and only remove a module from `parents` when its counter goes to 0. * Added tests to ensure this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129400 Approved by: https://github.com/awgu, https://github.com/huydhn	2024-06-25 10:54:58 +00:00
Alexander Grund	63474620ab	test_jit: Replace plain assert by test assert (#128950 ) The plain assert doesn't show the values in case of failure Pull Request resolved: https://github.com/pytorch/pytorch/pull/128950 Approved by: https://github.com/zou3519	2024-06-25 09:04:53 +00:00
Xuehai Pan	0314c4c101	[BE][Easy] use `pathlib.Path` instead of `dirname` / `".."` / `pardir` (#129374 ) Changes by apply order: 1. Replace all `".."` and `os.pardir` usage with `os.path.dirname(...)`. 2. Replace nested `os.path.dirname(os.path.dirname(...))` call with `str(Path(...).parent.parent)`. 3. Reorder `.absolute()` ~/ `.resolve()`~ and `.parent`: always resolve the path first. `.parent{...}.absolute()` -> `.absolute().parent{...}` 4. Replace chained `.parent x N` with `.parents[${N - 1}]`: the code is easier to read (see 5.) `.parent.parent.parent.parent` -> `.parents[3]` 5. ~Replace `.parents[${N - 1}]` with `.parents[${N} - 1]`: the code is easier to read and does not introduce any runtime overhead.~ ~`.parents[3]` -> `.parents[4 - 1]`~ 6. ~Replace `.parents[2 - 1]` with `.parent.parent`: because the code is shorter and easier to read.~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/129374 Approved by: https://github.com/justinchuby, https://github.com/malfet	2024-06-25 08:28:38 +00:00
Fuzzkatt	4ca8eecca4	skip test_graph_capture_oom for jetson (#128661 ) On Jetson IGX, `python test/test_cuda.py -k test_graph_capture_oom` fails with the following error: ``` RuntimeError: NVML_SUCCESS == r INTERNAL ASSERT FAILED at "/opt/pytorch/pytorch/c10/cuda/CUDACachingAllocator.cpp":841, please report a bug to PyTorch. During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/lib/python3.10/unittest/case.py", line 59, in testPartExecutor yield File "/usr/lib/python3.10/unittest/case.py", line 591, in run self._callTestMethod(testMethod) File "/usr/lib/python3.10/unittest/case.py", line 549, in _callTestMethod method() File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_utils.py", line 2759, in wrapper method(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_utils.py", line 2759, in wrapper method(args, **kwargs) File "/opt/pytorch/pytorch/test/test_cuda.py", line 2255, in test_graph_capture_oom with self.assertRaisesRegex(RuntimeError, oom_regex): File "/usr/lib/python3.10/unittest/case.py", line 239, in __exit__ self._raiseFailure('"{}" does not match "{}"'.format( File "/usr/lib/python3.10/unittest/case.py", line 163, in _raiseFailure raise self.test_case.failureException(msg) AssertionError: "out of memory" does not match "NVML_SUCCESS == r INTERNAL ASSERT FAILED at "/opt/pytorch/pytorch/c10/cuda/CUDACachingAllocator.cpp":841, please report a bug to PyTorch. " ``` This is a known issue as nvml support on Jetson is limited, and the OOM reporting in CUDACachingAllocator.cpp requires nvml to be properly loaded, which fails on Jetson. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128661 Approved by: https://github.com/eqy, https://github.com/atalman	2024-06-25 08:25:11 +00:00
eqy	8bfd9e9815	[cuDNN] Graph-capturable cuDNN CTCLoss (#128271 ) cuDNN v8.x added a graph-capturable CTCLoss, which slots "neatly" into the `Tensor` variant ~~WIP as cuDNN has a restriction on the max target length (255), but this is not checkable in the graph-capture case, so the UX around warnings/error-messages here might need to be tuned...~~ Currently checks restriction on max target length during warmup run(s), and bails out during capture if this constraint was violated during warmup. CC @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/128271 Approved by: https://github.com/ezyang, https://github.com/malfet	2024-06-25 06:01:50 +00:00
Jiong Gong	533c4190f9	[inductor][cpp] support nested kernel with indirect indexing (#129223 ) This PR makes sure the current kernel is used for generating CSE variables when nested kernel codegen is involved, e.g., nested CppKernel is used to generate epilogue of CppTemplateKernel. Without the fix, the epilogue with indirect indexing would fail to run. pytest -k test_linear_with_embedding_bias_False_cpu test_cpu_select_algorithm.py Epilogue code Before: ```c++ { #pragma GCC ivdep for(long x0=static_cast<long>(0L); x0<static_cast<long>(m_end + ((-1L)m_start)); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(16L(c10::div_floor_integer(N0, 16L))); x1+=static_cast<long>(16L)) { auto tmp0 = in_ptr2[static_cast<long>(m_start + x0)]; auto tmp11 = at::vec::Vectorized<float>::loadu(local_acc_buf + static_cast<long>(x1 + (N0x0)), 16); auto tmp1 = 64L; auto tmp2 = c10::convert<int64_t>(tmp1); auto tmp3 = decltype(tmp0)(tmp0 + tmp2); auto tmp4 = tmp0 ? tmp3 : tmp0; auto tmp5 = decltype(tmp4)(tmp4 + tmp2); auto tmp6 = tmp1 ? tmp5 : tmp4; auto tmp7 = tmp6; auto tmp8 = c10::convert<int64_t>(tmp7); TORCH_CHECK((0 <= tmp8) & (tmp8 < 64L), "index out of bounds: 0 <= tmp8 < 64L"); auto tmp10 = at::vec::Vectorized<float>::loadu(in_ptr3 + static_cast<long>(n_start + x1 + (384Ltmp6)), 16); auto tmp12 = (tmp11); auto tmp13 = tmp10 + tmp12; tmp13.store(Y + static_cast<long>(n_start + x1 + (384Lm_start) + (384Lx0))); } #pragma omp simd simdlen(8) for(long x1=static_cast<long>(16L(c10::div_floor_integer(N0, 16L))); x1<static_cast<long>(N0); x1+=static_cast<long>(1L)) { auto tmp0 = in_ptr2[static_cast<long>(m_start + x0)]; auto tmp11 = local_acc_buf[static_cast<long>(x1 + (N0x0))]; auto tmp1 = 64L; auto tmp2 = c10::convert<int64_t>(tmp1); auto tmp3 = decltype(tmp0)(tmp0 + tmp2); auto tmp4 = tmp0 ? tmp3 : tmp0; auto tmp5 = decltype(tmp4)(tmp4 + tmp2); auto tmp6 = tmp1 ? tmp5 : tmp4; auto tmp7 = tmp6; auto tmp8 = c10::convert<int64_t>(tmp7); TORCH_CHECK((0 <= tmp8) & (tmp8 < 64L), "index out of bounds: 0 <= tmp8 < 64L"); TORCH_CHECK((0 <= tmp8) & (tmp8 < 64L), "index out of bounds: 0 <= tmp8 < 64L"); auto tmp10 = in_ptr3[static_cast<long>(n_start + x1 + (384Ltmp6))]; auto tmp12 = c10::convert<float>(tmp11); auto tmp13 = decltype(tmp10)(tmp10 + tmp12); Y[static_cast<long>(n_start + x1 + (384Lm_start) + (384Lx0))] = tmp13; } } } ``` Epilogue code After: ```c++ { #pragma GCC ivdep for(long x0=static_cast<long>(0L); x0<static_cast<long>(m_end + ((-1L)m_start)); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(16L(c10::div_floor_integer(N0, 16L))); x1+=static_cast<long>(16L)) { auto tmp0 = in_ptr2[static_cast<long>(m_start + x0)]; auto tmp13 = at::vec::Vectorized<float>::loadu(local_acc_buf + static_cast<long>(x1 + (N0x0)), 16); auto tmp1 = 64L; auto tmp2 = c10::convert<int64_t>(tmp1); auto tmp3 = decltype(tmp0)(tmp0 + tmp2); auto tmp4 = tmp0 < 0; auto tmp5 = tmp4 ? tmp3 : tmp0; auto tmp6 = decltype(tmp5)(tmp5 + tmp2); auto tmp7 = tmp5 < 0; auto tmp8 = tmp7 ? tmp6 : tmp5; auto tmp9 = tmp8; auto tmp10 = c10::convert<int64_t>(tmp9); TORCH_CHECK((0 <= tmp10) & (tmp10 < 64L), "index out of bounds: 0 <= tmp10 < 64L"); auto tmp12 = at::vec::Vectorized<float>::loadu(in_ptr3 + static_cast<long>(n_start + x1 + (384Ltmp8)), 16); auto tmp14 = (tmp13); auto tmp15 = tmp12 + tmp14; tmp15.store(Y + static_cast<long>(n_start + x1 + (384Lm_start) + (384Lx0))); } #pragma omp simd simdlen(8) for(long x1=static_cast<long>(16L(c10::div_floor_integer(N0, 16L))); x1<static_cast<long>(N0); x1+=static_cast<long>(1L)) { auto tmp0 = in_ptr2[static_cast<long>(m_start + x0)]; auto tmp13 = local_acc_buf[static_cast<long>(x1 + (N0x0))]; auto tmp1 = 64L; auto tmp2 = c10::convert<int64_t>(tmp1); auto tmp3 = decltype(tmp0)(tmp0 + tmp2); auto tmp4 = tmp0 < 0; auto tmp5 = tmp4 ? tmp3 : tmp0; auto tmp6 = decltype(tmp5)(tmp5 + tmp2); auto tmp7 = tmp5 < 0; auto tmp8 = tmp7 ? tmp6 : tmp5; auto tmp9 = tmp8; auto tmp10 = c10::convert<int64_t>(tmp9); TORCH_CHECK((0 <= tmp10) & (tmp10 < 64L), "index out of bounds: 0 <= tmp10 < 64L"); TORCH_CHECK((0 <= tmp10) & (tmp10 < 64L), "index out of bounds: 0 <= tmp10 < 64L"); auto tmp12 = in_ptr3[static_cast<long>(n_start + x1 + (384Ltmp8))]; auto tmp14 = c10::convert<float>(tmp13); auto tmp15 = decltype(tmp12)(tmp12 + tmp14); Y[static_cast<long>(n_start + x1 + (384Lm_start) + (384Lx0))] = tmp15; } } } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129223 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel	2024-06-25 05:21:00 +00:00
cdzhan	665dbc2f52	[easy][DCP] Fix test_fine_tuning.py for get/set_state_dict API changes (#129365 ) Update test/distributed/checkpoint/e2e/test_fine_tuning.py for https://github.com/pytorch/pytorch/pull/112203 change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129365 Approved by: https://github.com/fegin	2024-06-25 05:12:02 +00:00
titaiwangms	0e1e289033	[ONNX] Benchmark refactored ONNX export (#129427 ) Reuse torch.onnx.export with torch_onnx patch to test ExportedProgram -> ONNX IR exporter Pull Request resolved: https://github.com/pytorch/pytorch/pull/129427 Approved by: https://github.com/justinchuby	2024-06-25 04:47:53 +00:00
Mikayla Gawarecki	f18becaaf1	Add example for torch.serialization.add_safe_globals (#129396 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129396 Approved by: https://github.com/albanD ghstack dependencies: #129244, #129251, #129239	2024-06-25 04:19:44 +00:00
Mikayla Gawarecki	381ce0821c	Add warning for weights_only (#129239 ) Also changes default for `weights_only` to `None` per comment below (hence the `suppress-bc-linter` tag) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129239 Approved by: https://github.com/albanD ghstack dependencies: #129244, #129251	2024-06-25 04:19:44 +00:00
Mikayla Gawarecki	c5f7755e86	Allow BUILD/NEWOBJ instruction for items added via torch.serialization.add_safe_globals (#129251 ) Previously, allowlisting functions/classes via `torch.serialization.add_safe_globals(obj)` for the `weights_only` Unpickler had the following effect: - For a [`GLOBAL`](https://github.com/python/cpython/blob/3.12/Lib/pickletools.py#L1926-L1939) instruction, `GLOBAL obj.__module__ obj.__name__` would be allowed and translated back to obj to be pushed back to the stack. - For a [`REDUCE`](https://github.com/python/cpython/blob/3.12/Lib/pickletools.py#L1926-L1982) instruction where we expect the stack to contain `func` and `args`, `func` is allowed if it was added via `add_safe_globals` However, it did not have an effect on `BUILD` and `NEWOBJ` instructions Some classes may be rebuilt via [`NEWOBJ`](https://github.com/python/cpython/blob/3.12/Lib/pickletools.py#L2091-L2104) instruction, which indicates that their constructor should be used to rebuild the class. Further, a [`BUILD`](https://github.com/python/cpython/blob/3.12/Lib/pickletools.py#L1984-L2007) instruction might be used if an object's `__reduce__`/`__reduce_ex__` returns a non-None value for `state`. Which indicates a `__setstate__` or `__dict__.update`. This PR makes sure that adding objects to the allowlist will also allow `NEWOBJ` and `BUILD` instructions for them. In particular, the update for `NEWOBJ` should unblock allowlisting of [`ScaledMMConfig`](`d4ade877df/float8_experimental/float8_tensor.py (L26-L30)`) in float8_experimental @drisspg Pull Request resolved: https://github.com/pytorch/pytorch/pull/129251 Approved by: https://github.com/albanD ghstack dependencies: #129244	2024-06-25 04:19:44 +00:00
Mikayla Gawarecki	1bb1e3463c	Fix allowlisting of builtins for weights_only unpickler (#129244 ) Since we use [`DEFAULT_PROTOCOL=2`](https://github.com/pytorch/pytorch/blob/main/torch/serialization.py#L62), some functions/classes that were renamed from python 2-->3 will be pickled with their python2 name. This PR ensures that when a mod `GLOBAL <python2_mod>.<python2_name> ` is encountered, [following the strategy used by pickle](https://github.com/python/cpython/blob/main/Lib/pickle.py#L1590C13-L1593C63) it is properly mapped to `<python3_mod>.<python3_name>`. This fix ensures that `add_safe_globals` works properly for such functions/classes (i.e. users will allowlist the python3 func and the weights_only unpickler will do the appropriate translation when checking whether a class was allowlisted). An example is as follows: `__builtin__` was named to `builtins`, see the [release notes for Python 3.0](https://docs.python.org/3/whatsnew/3.0.html) > Renamed module `__builtin__` to [`builtins`](https://docs.python.org/3/library/builtins.html#module-builtins) (removing the underscores, adding an ‘s’). The __builtins__ variable found in most global namespaces is unchanged. To modify a builtin, you should use [builtins](https://docs.python.org/3/library/builtins.html#module-builtins), not `__builtins__`! However, since we use [`DEFAULT_PROTOCOL=2`](https://github.com/pytorch/pytorch/blob/main/torch/serialization.py#L62), builtins will be pickled with their module string as `__builtin__`. ```python >>> import pickle >>> import pickletools >>> print.__module__ 'builtins' >>> with open('print.pkl', 'wb') as f: >>> pickle.dump(print, f, protocol=2) # 2 because this is the default protocol used by pytorch >>> with open('print.pkl', 'rb') as f: >>> pickletools.dis(f) 0: \x80 PROTO 2 2: c GLOBAL '__builtin__ print' # pickle saves the module string as __builtin__ !!! :( 21: q BINPUT 0 23: . STOP ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129244 Approved by: https://github.com/albanD	2024-06-25 04:19:44 +00:00
Will Feng	aa4ee2cb9e	[Traceable FSDP2] Add Dynamo support for run_with_rng_state HOP (#127247 ) Test command: `pytest -rA test/inductor/test_compiled_autograd.py::TestCompiledAutograd::test_trace_run_with_rng_state` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127247 Approved by: https://github.com/bdhirsh ghstack dependencies: #129414	2024-06-25 03:13:38 +00:00
Will Feng	b24787b757	[Traceable FSDP2] Don't decompose fsdp.split_with_sizes_copy (#129414 ) This makes it easier to do pattern-matching on `fsdp.split_with_sizes_copy` in Inductor passes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129414 Approved by: https://github.com/bdhirsh	2024-06-25 03:08:56 +00:00
Isuru Fernando	e6bfa2958b	Add aten._unsafe_masked_index (#116491 ) To generate masked indexing operations that would generate masked loads in triton code Pull Request resolved: https://github.com/pytorch/pytorch/pull/116491 Approved by: https://github.com/lezcano, https://github.com/peterbell10	2024-06-25 02:45:02 +00:00
Zain Rizvi	4d04203852	[BE] Runner determinator: Expect usernames to be prefixed with '@' (#129246 ) Expect the username in the runner rollover issue (https://github.com/pytorch/test-infra/issues/5132) to be prefixed with a "@". This will make typos way less likely since github's autocomplete/autoformating will help out For now, I've updated the issue to have usernames both with and without the @ while this change rolls out Testing: Ran the script locally on both this issue and a new test issue and verified they both had the expected output: ``` (venv) (base) ➜ ~/pytorch git:(zainr/improve-get-workflow-type) python .github/scripts/get_workflow_type.py --github-token github_pat_*** --github-issue 5132 --github-user ZainRizvi --github-branch "zainr/stuff" {"label_type": "lf.", "message": "LF Workflows are enabled for ZainRizvi. Using LF runners."} ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129246 Approved by: https://github.com/zxiiro, https://github.com/huydhn	2024-06-25 02:39:33 +00:00
Kazuaki Ishizaki	533395e204	Fix build error on s390x (#129326 ) This PR fixes the build error on s390 after #127195. The following is the log of the build on s390x. This is because `SYS_arch_prctl` is not defined on s390x. ``` ... [792/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/cpu/FlushDenormal.cpp.o [793/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/cpu/Utils.cpp.o FAILED: caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/cpu/Utils.cpp.o /usr/bin/c++ -DAT_PER_OPERATOR_HEADERS -DCAFFE2_BUILD_MAIN_LIB -DFLASHATTENTION_DISABLE_ALIBI -DFMT_HEADER_ONLY=1 -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_RPC -DUSE_TENSORPIPE -D_FILE_OFFSET_BITS=64 -Dtorch_cpu_EXPORTS -I/pytorch/build/aten/src -I/pytorch/aten/src -I/pytorch/build -I/pytorch -I/pytorch/cmake/../third_party/benchmark/include -I/pytorch/third_party/onnx -I/pytorch/build/third_party/onnx -I/pytorch/third_party/foxi -I/pytorch/build/third_party/foxi -I/pytorch/torch/csrc/api -I/pytorch/torch/csrc/api/include -I/pytorch/caffe2/aten/src/TH -I/pytorch/build/caffe2/aten/src/TH -I/pytorch/build/caffe2/aten/src -I/pytorch/build/caffe2/../aten/src -I/pytorch/torch/csrc -I/pytorch/third_party/miniz-2.1.0 -I/pytorch/third_party/kineto/libkineto/include -I/pytorch/third_party/kineto/libkineto/src -I/pytorch/third_party/cpp-httplib -I/pytorch/aten/src/ATen/.. -I/pytorch/c10/.. -I/pytorch/third_party/FP16/include -I/pytorch/third_party/tensorpipe -I/pytorch/build/third_party/tensorpipe -I/pytorch/third_party/tensorpipe/third_party/libnop/include -I/pytorch/third_party/fmt/include -I/pytorch/third_party/flatbuffers/include -isystem /pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include -isystem /pytorch/cmake/../third_party/googletest/googlemock/include -isystem /pytorch/cmake/../third_party/googletest/googletest/include -isystem /pytorch/third_party/protobuf/src -isystem /pytorch/cmake/../third_party/eigen -isystem /pytorch/build/include -Wno-maybe-uninitialized -Wno-uninitialized -Wno-free-nonheap-object -Wno-nonnull -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility-inlines-hidden -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DLIBKINETO_NOROCTRACER -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow -O3 -DNDEBUG -DNDEBUG -fPIC -DTORCH_USE_LIBUV -Wall -Wextra -Wdeprecated -Wno-unused-parameter -Wno-missing-field-initializers -Wno-unknown-pragmas -Wno-type-limits -Wno-array-bounds -Wno-strict-overflow -Wno-strict-aliasing -Wunused-function -Wno-maybe-uninitialized -fvisibility=hidden -O2 -fopenmp -MD -MT caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/cpu/Utils.cpp.o -MF caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/cpu/Utils.cpp.o.d -o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/cpu/Utils.cpp.o -c /pytorch/aten/src/ATen/cpu/Utils.cpp /pytorch/aten/src/ATen/cpu/Utils.cpp: In function 'bool at::cpu::init_amx()': /pytorch/aten/src/ATen/cpu/Utils.cpp:60:21: error: 'SYS_arch_prctl' was not declared in this scope; did you mean 'SYS_prctl'? 60 \| long rc = syscall(SYS_arch_prctl, ARCH_REQ_XCOMP_PERM, XFEATURE_XTILEDATA); \| ^~~~~~~~~~~~~~ \| SYS_prctl [794/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/Integration.cpp.o [795/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/GridSampler.cpp.o [796/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/detail/CPUGuardImpl.cpp.o [797/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/ThreadLocalState.cpp.o [798/2147] Building CXX object caffe2/CMakeFiles/vec_test_all_types_DEFAULT.dir/__/aten/src/ATen/test/vec_test_all_types.cpp.o [799/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/Utils.cpp.o [800/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/VmapModeRegistrations.cpp.o [801/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/ZeroTensorFallback.cpp.o [802/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/autocast_mode.cpp.o ninja: build stopped: subcommand failed. Building wheel torch-2.5.0a0+git94dc325 -- Building version 2.5.0a0+git94dc325 cmake -GNinja -DBUILD_CAFFE2=0 -DBUILD_PYTHON=True -DBUILD_TEST=True -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/pytorch/torch -DCMAKE_PREFIX_PATH=/usr/local/lib/python3.10/dist-packages -DPython_EXECUTABLE=/usr/bin/python3 -DTORCH_BUILD_VERSION=2.5.0a0+git94dc325 -DUSE_GLOO=0 -DUSE_NUMPY=True /pytorch cmake --build . --target install --config Release Build step 'Execute shell' marked build as failure ... ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129326 Approved by: https://github.com/Skylion007, https://github.com/eqy	2024-06-25 02:39:13 +00:00
Animesh Jain	c4dd752d97	[dynamo][compile-time][inlining-inbuilt-nn-modules] Manually implement nn.Module._call_impl (#129285 ) # Compile time for eager backend ## AlbertForMaskedLM No inlining - 3.65 seconds Inlining on main - 7.48 seconds Inlining + this PR - 2.86 seconds ## MobileBertForMaskedLM No inlining - 26.90 seconds Inlining on main - 48.21 seconds Inlining + this PR - 24.25 seconds Pull Request resolved: https://github.com/pytorch/pytorch/pull/129285 Approved by: https://github.com/jansel ghstack dependencies: #129316, #129315	2024-06-25 01:31:26 +00:00
Animesh Jain	514f9279f8	[dynamo][compile-time] Manually implement nn.Module.__getattr__ to reduce compile time (#129315 ) # Compile time for eager backend ## AlbertForMaskedLM No inlining - 3.65 seconds Inlining on main - 7.48 seconds Inlining + this PR - 6.70 seconds ## MobileBertForMaskedLM No inlining - 26.90 seconds Inlining on main - 48.21 seconds Inlining + this PR - 43.85 seconds Next PR in the stack makes the total compile time better/comparable to no inlining Pull Request resolved: https://github.com/pytorch/pytorch/pull/129315 Approved by: https://github.com/jansel ghstack dependencies: #129316	2024-06-25 01:31:26 +00:00
PyTorch MergeBot	c012013aa6	Revert "Add Strided Input test for flex attention (#128915 )" This reverts commit 41bb81b58279f492e72bd270b3b071dd2953ed8c. Reverted https://github.com/pytorch/pytorch/pull/128915 on behalf of https://github.com/huydhn due to Sorry for reverting your change but its tests are failing in trunk, i.e. `41bb81b582 (26627138290)` ([comment](https://github.com/pytorch/pytorch/pull/128915#issuecomment-2187695317))	2024-06-25 00:43:34 +00:00
Colin Peppler	1315be4893	[aotinductor] only autotune at compile time when enabled via config (#129413 ) internal breakage when enabled. Test Plan: CI Differential Revision: D58965784 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129413 Approved by: https://github.com/jingsh, https://github.com/desertfire	2024-06-25 00:41:10 +00:00
Antoni Vros	78e40b271b	Change index_put on GPU to accept FP8 inputs (#128758 ) As the title says, this PR changes the dispatcher for the CUDA index_put_ kernel to accept FP8 inputs. This is useful for Transformers models where the KV cache is FP8 and has been pre-allocated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128758 Approved by: https://github.com/eqy, https://github.com/drisspg	2024-06-25 00:38:03 +00:00
wz337	8b6391ee59	[Test][DTensor] Temporarily skip gloo test for test_depthwise_convolution (#129391 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/129391 Approved by: https://github.com/awgu	2024-06-25 00:29:50 +00:00
Shunting Zhang	81de71fdc5	[inductor] fix a double clone in coordesc tuning (#129399 ) It's embarrassing that there is a hidden double clone bug in coordinate descent tuning. In `CachingAutotuner.coordinate_descent_tuning`, we clone mutated args to make sure benchmarking does not cause numerical problems. But latter on in `CachingAutotuner.bench` we do that again. This double clone is fine if - the tensor is small - the allocation of the tensor is not on the critical path for memory footprint. But neither holds for quite common usage of cross entropy loss. This is related to the memory usage debugging in https://github.com/pytorch/pytorch/pull/129043 . Note that the general issue that peak memory usage increasing due to autotuning still exists. This bug just makes it worse (since we double allocate). Pull Request resolved: https://github.com/pytorch/pytorch/pull/129399 Approved by: https://github.com/Chillee, https://github.com/jansel	2024-06-25 00:18:51 +00:00
Nikita Shulga	14dc08ddc7	Inductor to fail gracefully on Voltas for bf16 tensors (#129288 ) Volta(sm_7x) do not have a HW support for bfloat16 datatype, and while it is is emulated to ted in software, so PyTorch eager can use bfloat16 tensors, but not in Triton. So if graph with either CUDA bf16 input or output tensors is used, raise warnings and skip the frame. Add optional parameter `including_emulation` to `torch.cuda.is_bf16_supported` method and call it from `torch._inductor.compile_fx. _check_triton_bf16_support`. Test plan: Modify `is_bf16_supported` to return False and see that warning is generated Fixes https://github.com/pytorch/pytorch/issues/118122 and https://github.com/pytorch/pytorch/issues/118581 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129288 Approved by: https://github.com/eqy, https://github.com/jansel	2024-06-25 00:04:13 +00:00
Sam Larsen	4c1e4c5f30	[inductor] Enable FX graph caching in OSS by default (#125863 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125863 Approved by: https://github.com/eellison, https://github.com/oulgen ghstack dependencies: #129257	2024-06-24 23:39:43 +00:00
Sam Larsen	7b57ddd38c	[inductor] Fix TORCHINDUCTOR_FORCE_DISABLE_CACHES (#129257 ) Summary: See https://github.com/pytorch/pytorch/issues/129159; this option wasn't doing its job for a few reasons. In this PR: * Fix the with_fresh_cache_if_config() decorator * Reset the "TORCHINDUCTOR_CACHE_DIR" & "TRITON_CACHE_DIR" env vars in sub-process to support them changing in the parent process Pull Request resolved: https://github.com/pytorch/pytorch/pull/129257 Approved by: https://github.com/oulgen	2024-06-24 23:39:43 +00:00
Yidi Wu	b22f0f5f51	[torchbind] fix bug of mutating FakeScriptObjects twice in aot_export (#128844 ) This PR does two things: 1. it duplicates the fake script object because aot_export trace the program twice. The result of tracing in the first time would cause the tracing result of second time be wrong. 2. Also add a new test for methods that return constant outputs. Before the PR, there's is no meta["val"] for these nodes because fx won't track these constants. We still need to preserve these constant return operators in the graph because torchbind objects are stateful and deleting it would remove the implicit state mutation inside of the object. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128844 Approved by: https://github.com/angelayi	2024-06-24 23:14:34 +00:00
joydddd	41bb81b582	Add Strided Input test for flex attention (#128915 ) Test strided inputs to the flex_attention HOP. Similar to how inputs are generated in https://github.com/pytorch/pytorch/blob/main/benchmarks/transformer/score_mod.py. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128915 Approved by: https://github.com/Chillee, https://github.com/drisspg	2024-06-24 22:56:39 +00:00
yuqingj	00f675bb4c	[Nested Tensor]fix sdpa backward for the special case with ragged second batch dim and constant length (#128349 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128349 Approved by: https://github.com/jbschlosser	2024-06-24 22:35:07 +00:00
Joel Schlosser	7b7f357042	Fix DEBUG=1 asserts with NJT ops (#129014 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129014 Approved by: https://github.com/YuqingJ, https://github.com/soulitzer	2024-06-24 22:32:01 +00:00
Isuru Fernando	5f912f480c	Fix max_pool2d decomposition for empty list and integer limits (#129106 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129106 Approved by: https://github.com/peterbell10, https://github.com/lezcano, https://github.com/malfet ghstack dependencies: #129096, #129097	2024-06-24 22:19:42 +00:00
Isuru Fernando	e096faaf30	Fix rot90 decomposition for no rotation (#129097 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129097 Approved by: https://github.com/peterbell10 ghstack dependencies: #129096	2024-06-24 22:19:42 +00:00
Isuru Fernando	fbca70718f	Fix scatter lowering when src is a Number (#129096 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129096 Approved by: https://github.com/peterbell10	2024-06-24 22:19:39 +00:00
Zain Rizvi	8edb7b96b1	Enable dynamic rollout for pull workflow (#129243 ) Enables dynamic migration of jobs to the LF AWS account for the pull workflow. For now, it leaves out a few jobs that need a bit more testing: Namely Windows and Android runners. The new runners are only given to people specified in this issue: https://github.com/pytorch/test-infra/issues/5132 Note: The non-pull jobs updated are the ones that have are synced to jobs in pull.yml (via `sync-tag`) and thus have to be updated whenever their corresponding pull.yml jobs are edited Based on https://github.com/pytorch/pytorch/pull/128597 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129243 Approved by: https://github.com/zxiiro, https://github.com/huydhn, https://github.com/malfet	2024-06-24 22:15:53 +00:00
ajbrent	30bfdf1afc	Errors when 0-dim tensor of complex or bool type passed to aminmax. (#128404 ) Fixes #126742 Added errors for the case of 0-dim tensors of complex or bool types passed to aminmax. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128404 Approved by: https://github.com/janeyx99	2024-06-24 21:46:49 +00:00
PyTorch UpdateBot	18fdc0ae5b	[executorch hash update] update the pinned executorch hash (#129099 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129099 Approved by: https://github.com/pytorchbot	2024-06-24 21:01:40 +00:00
Xuehai Pan	93a33bf3ac	[BE] update type annotations for basic utilities in `torch/__init__.py` (#129001 ) Changes: 1. Make some arguments positional-only as we only support Python 3.8+ 2. Clean up `torch.typename(obj)` implementation. 3. Update type annotations., especially `is_tensor()` and `is_masked_tensor()` using `TypeGuard`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129001 Approved by: https://github.com/malfet	2024-06-24 18:04:38 +00:00
PyTorch MergeBot	1a54bb0f96	Revert "[halide-backend] Initial implementation of HalideKernel and HalideScheduling (#126417 )" This reverts commit 4f9399bd0d2bc0cbd14348b80e32b263de5c6bc0. Reverted https://github.com/pytorch/pytorch/pull/126417 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/126417#issuecomment-2186999121))	2024-06-24 16:50:15 +00:00
PyTorch MergeBot	063facf352	Revert "[halide-backend] Generate standalone runtime (#129025 )" This reverts commit 10c64c3b49e2008a50f9229e600c68c8a3d49292. Reverted https://github.com/pytorch/pytorch/pull/129025 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/129025#issuecomment-2186995467))	2024-06-24 16:47:25 +00:00
Huy Do	c888ee3632	Remove test_mps_allocator_module XFAIL (#129340 ) Not sure why this test starts to fail (maybe runner update) `8a2fed7e6a/1` or why it was XFAIL in this old PR https://github.com/pytorch/pytorch/pull/97151, but the test is passing locally for me now Pull Request resolved: https://github.com/pytorch/pytorch/pull/129340 Approved by: https://github.com/kit1980	2024-06-24 16:26:38 +00:00
PyTorch MergeBot	cb4919344a	Revert "[BE] update type annotations for basic utilities in `torch/__init__.py` (#129001 )" This reverts commit e53d9590287cbf97521f96d055910394f6e9a849. Reverted https://github.com/pytorch/pytorch/pull/129001 on behalf of https://github.com/XuehaiPan due to lint failure ([comment](https://github.com/pytorch/pytorch/pull/129001#issuecomment-2186944549))	2024-06-24 16:18:43 +00:00
PyTorch MergeBot	7b910285db	Revert "[inductor] Refactor fusion of inplace operations (#128979 )" This reverts commit 72e3aca227ae1e3dc1b91aee415cf27b0cb22f2b. Reverted https://github.com/pytorch/pytorch/pull/128979 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/128979#issuecomment-2186846940))	2024-06-24 15:29:40 +00:00
Colin Peppler	df51d0b623	[aotinductor][UserDefinedTritonKernel] use appropriate expr printer when printing args (#129301 ) Encountered the following C++ compile error. ``` Declared in this scope; did you mean ‘std::max’? 619 \| auto var_5 = max(1, u0); ``` This PR will use the C++ printer when it's doing C++ codegen, before this PR it was using the Python printer even during C++ codegen. Differential Revision: [D58913123](https://our.internmc.facebook.com/intern/diff/D58913123) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129301 Approved by: https://github.com/chenyang78, https://github.com/desertfire	2024-06-24 15:23:05 +00:00
Xuehai Pan	e53d959028	[BE] update type annotations for basic utilities in `torch/__init__.py` (#129001 ) Changes: 1. Make some arguments positional-only as we only support Python 3.8+ 2. Clean up `torch.typename(obj)` implementation. 3. Update type annotations., especially `is_tensor()` and `is_masked_tensor()` using `TypeGuard`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129001 Approved by: https://github.com/malfet	2024-06-24 14:35:41 +00:00
soulitzer	c89a9f5d17	Allow SAC policy_fn to return bool for backward compatibility (#129262 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129262 Approved by: https://github.com/Chillee, https://github.com/fmassa ghstack dependencies: #125795, #128545	2024-06-24 13:54:30 +00:00
Andrew Gu	9094248090	[FSDP2] Fixed `unshard` without lazy init (#129241 ) Previously, the `FSDPCommContext` only defines the stream attributes when `FSDPCommContext.init` is called from lazy initialization. This means that if the user calls `module.unshard()` before lazy init (e.g. first forward pass), then it would error in `wait_for_unshard()`. This PR fixes this by making sure that the stream attributes are defined, only with the default stream, at construction time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129241 Approved by: https://github.com/Skylion007, https://github.com/weifengpy	2024-06-24 13:31:54 +00:00
Will Feng	d21f311af8	[Easy][Traceable FSDP2] Skip rocm for the E2E tests (#129339 ) The CUDA implementation of `resize_storage_bytes_` doesn't run on rocm yet, so need to skip it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129339 Approved by: https://github.com/msaroufim	2024-06-24 06:38:33 +00:00
Xuehai Pan	662e9e1076	[BE] enable UFMT for `torch/nn/functional.py` (#128592 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128592 Approved by: https://github.com/mikaylagawarecki	2024-06-24 06:24:12 +00:00
leslie-fang-intel	8a2fed7e6a	[Inductor][CPP] Fallback QLinear Binaryfusion from postop sum to binary add when others is view (#128808 ) Summary In int8 GEMM Template, we will view the input from 3D to 2D and view the output back to 3D for QLinear which makes the output of this QLinear as `view`. So, if this output view inputs to a QLinear-Binary fusion which breaks the assumption of QLinear-Binary with post op inplace `sum`. We change the postop name from inplace `sum` to outplace `add` for this case which is similar as FP32/BF16 Linear Inplace as in `1208347d09/torch/_inductor/fx_passes/mkldnn_fusion.py (L541-L543)`. TestPlan ``` clear && numactl -C 56-111 -m 1 python -u -m pytest -s -v inductor/test_mkldnn_pattern_matcher.py -k test_qlinear_dequant_promotion_cpu_input_dim_exceeds_2 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128808 Approved by: https://github.com/jgong5 ghstack dependencies: #128804	2024-06-24 01:12:18 +00:00
leslie-fang-intel	287c68c5ec	[Inductor][Quant] Use output dtype torch.uint8 explicitly (#128804 ) Summary Previously, we use `None` as output data type in the lowering of QLinear/QConv for uint8 implicitly. It's not clear and we should use `torch.uint8` explicitly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128804 Approved by: https://github.com/Xia-Weiwen, https://github.com/jgong5	2024-06-24 01:08:49 +00:00
PaliC	7b9e6430ed	[Split Build] Add periodic and trunk CI for cuda builds (#129269 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129269 Approved by: https://github.com/atalman	2024-06-23 17:04:37 +00:00
Xuehai Pan	f85d1e845a	[BE] enable UFMT for `torch/nn/*.py` (#128593 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128593 Approved by: https://github.com/mikaylagawarecki	2024-06-23 16:05:13 +00:00
Will Feng	dadc0ed4c8	[Traceable FSDP2] Add `aot_eager` backend E2E tests for transformer model (#129157 ) This PR adds Traceable FSDP2 `aot_eager` backend E2E tests for simple MLP as well as transformer model. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129157 Approved by: https://github.com/awgu ghstack dependencies: #129203	2024-06-23 06:11:11 +00:00
Brian Hirsh	b91a9dc328	[Brian's PR #128754 ] Use torch.ops.fsdp.set_ for FSDP2 storage resize; dont functionalize resize_, set_, split_with_sizes_copy.out (#129203 ) This is a copy of Brian's PR https://github.com/pytorch/pytorch/pull/128754, with some changes in the test_distributed_patterns.py unit tests to more closely reflect FSDP2 patterns. Also disabled two tests `test_input_mutation_storage_resize_up_down` and `test_input_mutation_storage_resize_not_supported` in test_aotdispatch.py until we figure out the right behavior for them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129203 Approved by: https://github.com/bdhirsh	2024-06-23 06:07:19 +00:00
Xuehai Pan	62ccf6d7cd	[BE] enable UFMT for `torch/nn/modules` (#128594 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128594 Approved by: https://github.com/mikaylagawarecki	2024-06-23 05:37:57 +00:00
sanketpurandare	440d8fbd4a	FSDP2 Memory Tracker (#125323 ) * __->__ #125323 ### Why do we need the FSDP Memory Tracker? Tuning Decisions 1. What is the expected peak memory with current configuration? 2. If I change my FSDP wrapping, how much effect will it have on peak memory? 3. What is the best batch size to use? 4. What is the maximum sequence length that one can run with current configuration? 5. How does increasing/decreasing the “DP” world size affect peak memory? 6. How much memory do I save if I move the optimizer to the CPU? 7. Which activation checkpointing policy should I use? 8. If I have various SAC policies, How do they compare against each other? 9. What happens if I apply different SAC policies to different FSDP units? 10. If I make my gradient reduction in fp32, what effect will it have on memory? 11. If I want to use a custom mixed precision policy, how will it affect the peak memory? 12. When does it make sense to use HSDP? 13. Can I reshard to a smaller mesh without increasing peak memory substantially? 14. Can safely disable post forward reshard without causing an OOM? Debugging 1. Which module contributes most to activation memory? 2. Which FSDP unit is holding a lot of unsharded memory? 3. AC is not releasing memory? The FSDP2 Memory Tracker addresses all of the above. It is based on: * #124688 * #128508 Example and Output: ``` if __name__== "__main__": from contextlib import nullcontext from functools import partial import torch from torch.distributed._composable import checkpoint from torch.distributed._composable.fsdp import ( CPUOffloadPolicy, fully_shard, MixedPrecisionPolicy, ) from torch.distributed._tensor import DeviceMesh from torch.distributed._tools.fsdp2_mem_tracker import FSDPMemTracker from torch._subclasses.fake_tensor import FakeTensorMode from torch.testing._internal.distributed._tensor.common_dtensor import ( ModelArgs, Transformer, TransformerBlock, ) from torch.testing._internal.distributed.fake_pg import FakeStore dev = torch.device("cuda:0") torch.cuda.set_device(dev) world_size = 4 store = FakeStore() torch.distributed.init_process_group( "fake", rank=0, world_size=world_size, store=store ) mesh = DeviceMesh("cuda", torch.arange(0, world_size)) torch.cuda.empty_cache() torch.manual_seed(42) use_fake_mode = False with FakeTensorMode() if use_fake_mode else nullcontext(): vocab_size = 8192 bsz, seq_len = 32, 1024 with torch.device(dev): model_args = ModelArgs( n_layers=2, n_heads=16, vocab_size=vocab_size, max_seq_len=seq_len, dropout_p=0.1, ) model = Transformer(model_args) foreach = True mp_policy = MixedPrecisionPolicy(param_dtype=torch.bfloat16, reduce_dtype=torch.float32) offload_policy = CPUOffloadPolicy(pin_memory=not use_fake_mode) reshard_after_forward = True fsdp_config = { } fully_shard_fn = partial( fully_shard, mesh=mesh, reshard_after_forward=reshard_after_forward, offload_policy=offload_policy, mp_policy=mp_policy, ) for module in model.modules(): if isinstance(module, TransformerBlock): checkpoint(module, preserve_rng_state=not use_fake_mode) fully_shard_fn(module) fully_shard_fn(model) optim = torch.optim.Adam(model.parameters(), lr=1e-2, foreach=foreach) torch.manual_seed(42) inp = torch.randint(0, vocab_size, (bsz, seq_len), device=dev) torch.cuda.reset_accumulated_memory_stats() torch.cuda.reset_peak_memory_stats() fmt = FSDPMemTracker(model, optim) fmt.track_inputs((inp,)) with fmt: for iter_idx in range(2): loss = model(inp).sum() loss.backward() optim.step() optim.zero_grad() if iter_idx == 0: fmt.reset_mod_stats() mem_stats = torch.cuda.memory_stats() tracker_peak = fmt.get_tracker_snapshot("peak")[dev]["Total"] cuda_peak_active = mem_stats["active_bytes.all.peak"] fmt.display_modulewise_snapshots(depth=4, units="MiB", tabulate=True) fmt.display_snapshot("peak", units="MiB", tabulate=True) print( f"peak active: {cuda_peak_active / (10243)} GiB \| " f"Tracker Max: {tracker_peak / (1024 3)} GiB" ) if not use_fake_mode: print(f"Accuracy: {tracker_peak/cuda_peak_active}") try: torch.distributed.destroy_process_group() except Exception as e: print(e) ``` <img width="1236" alt="Screenshot 2024-06-21 at 5 16 49 PM" src="https://github.com/pytorch/pytorch/assets/12934972/9be40b8b-e635-4112-b111-418413e6b959"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125323 Approved by: https://github.com/awgu	2024-06-23 05:23:00 +00:00
Animesh Jain	17d1723aee	[dynamo][unspecialized-nn-modules] Remove dead (also incorrect) code (#129316 ) This code is unused because we just inline the `.parameters` call. The code was also wrong because side-effects only track the first level of mutations. An object might not marked mutated if one of the child objects (like a dict) is mutated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129316 Approved by: https://github.com/jansel	2024-06-23 03:02:27 +00:00
Huy Do	cac6f99d41	Fix Windows CUDA periodic inductor/test_pattern_matcher test (#129198 ) The check was run on Windows and crashed there because Windows doesn't have triton, i.e. https://github.com/pytorch/pytorch/actions/runs/9606662121/job/26502347998#step:15:13196 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129198 Approved by: https://github.com/kit1980, https://github.com/ZainRizvi, https://github.com/malfet	2024-06-23 02:32:27 +00:00
Manuel Candales	749c03406c	[metal] Add int4mm weight packing mps kernel, and improved int4mm shader (#128965 ) Adds _convert_weight_to_int4pack MPS kernel Replaces previous int4mm Metal shader, with shader authored by @kimishpatel which improves perf by ~40% Pull Request resolved: https://github.com/pytorch/pytorch/pull/128965 Approved by: https://github.com/malfet	2024-06-23 02:10:46 +00:00
rzou	856541c701	[custom_op] support default dtype values (#129189 ) This PR: - moves some of the dtype-string utilities into ScalarType.{h, cpp} - adds a new utility to get a mapping from dtype name to the C++ dtype - the perser now checks if the string is a dtype name; if it is then it pulls the c++ dtype from the mapping. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/129189 Approved by: https://github.com/albanD ghstack dependencies: #129177, #129178, #129179	2024-06-23 00:13:23 +00:00
Isuru Fernando	3e02ecd740	Test only one sample with huber_loss (#129245 ) Fixes https://github.com/pytorch/pytorch/issues/129238 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129245 Approved by: https://github.com/huydhn	2024-06-22 21:15:39 +00:00
Xuehai Pan	94dc3253a0	[BE][Easy] enable UFMT for `torch/distributed/` (#128870 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128870 Approved by: https://github.com/fegin, https://github.com/wconstab	2024-06-22 18:53:28 +00:00
Will Feng	e165a5971f	[Traceable FSDP2] Fix support for CUDA resize_storage_bytes_ (#129215 ) Currently if `x` is a CUDA tensor, calling `x.untyped_storage().resize_()` seems to always go into the `built without cuda` branch of `resize_storage_bytes_()` regardless of whether PyTorch is built with CUDA. I suspect this is because `inductor_ops.cpp` is only included in `libtorch_cpu.so` thus doesn't have the `USE_CUDA` information or ability to link to CUDA-related functions. This PR moves `resize_storage_bytes_()` related custom op functions out of `inductor_ops.cpp` into its standalone file `resize_storage_bytes.cpp` to be included in `libtorch_python.so` instead. This mimics the setup for `StorageMethods.cpp`. This way, `resize_storage_bytes_()` can have access to the CUDA-related functions, which passes the CUDA unit test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129215 Approved by: https://github.com/jansel	2024-06-22 18:38:47 +00:00
Anshul Sinha	0e6118a68e	[dtensor][debug] added logging module tracing table to file feature (#128721 ) Summary Currently, only way for users to view the module tracing table is to print in the console which could be hard to read. I have added the functionality to comm_debug_mode for a user to log the module tracing table to output.txt file giving the user more options to view module tracing. I have implemented the use case in the module tracing examples. The expected output is shown below for MLPModule tracing: <img width="349" alt="Screenshot 2024-06-14 at 10 39 07 AM" src="https://github.com/pytorch/pytorch/assets/50644008/a05288a9-3cdb-483b-8e27-daab50da6251"> Test Plan 1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_module_tracing 2. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_module_tracing Pull Request resolved: https://github.com/pytorch/pytorch/pull/128721 Approved by: https://github.com/tianyu-l, https://github.com/XilunWu ghstack dependencies: #128720	2024-06-22 18:14:13 +00:00
Anshul Sinha	1afd492d88	[dtensor][example] add functionality allowing users to choose which example they'd to run (#128720 ) Summary The previous example file would run all examples at the same time, leading to confusing output as the 4 processors would mix up the order. In order to fix this, I have added the functionality to choose which example to run to make it easier for users to read the output. Due to importing from torch.testing._internal.distributed._tensor.common_dtensor, the argparser from a file in the dependency tree would overwrite the argparser that I attempted to place in the example file. As a result, I created an argparser in a different file and imported it above previously mentioned import. Test Plan 1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_distributed_sharding_display 2. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLPStacked_distributed_sharding_display 3. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_module_tracing 4. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_module_tracing 5. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -h The first four outputs will be the same as the outputs seen in previous PRs. The expected output for help argument is seen below: <img width="931" alt="Screenshot 2024-06-14 at 10 25 06 AM" src="https://github.com/pytorch/pytorch/assets/50644008/547ca112-1e7a-4769-857a-558292c6fe7b"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128720 Approved by: https://github.com/XilunWu	2024-06-22 18:14:13 +00:00
Jason Ansel	10c64c3b49	[halide-backend] Generate standalone runtime (#129025 ) This puts the halide runtime in a global shared object, rather than copying it to each kernel. Having many copies of the runtime causes many issues with cuda. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129025 Approved by: https://github.com/shunting314, https://github.com/eellison ghstack dependencies: #126417	2024-06-22 17:39:52 +00:00
Jason Ansel	4f9399bd0d	[halide-backend] Initial implementation of HalideKernel and HalideScheduling (#126417 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126417 Approved by: https://github.com/shunting314, https://github.com/eellison	2024-06-22 17:39:52 +00:00
William Wen	79aabaf626	[3.13, dynamo] codegen PUSH_NULL when callable is codegen'd (#129172 ) Significant bytecode generation API change! The new suggested convention to generating bytecode to call a function is now to wrap instructions that push a callable to the stack with `add_push_null`, then that callable is called with `create_call_function` with `push_null=False` (see diff for examples). In Python 3.13, NULL is now expected to be pushed after the callable. In <=3.12, the NULL was pushed before the callable. This change abstracts away the exact placement of the NULL, but the developer must be aware that a NULL may be needed when codegen'ing a callable. This abstraction also reduces the need for the `push_null=True` option in `create_call_function`, which removes the need to rotate a NULL to the right place on the stack with a sequence of `SWAP` instructions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129172 Approved by: https://github.com/jansel	2024-06-22 17:25:23 +00:00
Mengwei Liu	905dfa186c	Fix ConstraintViolationError exception string when exprs are int (#129271 ) As titled. If `expr1` `expr2` are int, don't need to do `.xreplace`. See example error: ``` UserError: L['args'][0][0].size()[1] = 35 is not equal to L['args'][0][2].size()[1] = 23 ``` Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/129271 Approved by: https://github.com/lezcano	2024-06-22 16:33:40 +00:00
Jiong Gong	920ebccca2	[inductor][cpp] refactor CppTemplateKernel to inherit CppKernel (#129101 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129101 Approved by: https://github.com/leslie-fang-intel	2024-06-22 12:50:37 +00:00
Peter Bell	72e3aca227	[inductor] Refactor fusion of inplace operations (#128979 ) `WeakDep`s force readers to have completed before a mutation overwrites the buffer, but we want to allow fusions to occur for inplace mutations where the same index is read and written. Currently this is achieved by: 1. Identifying the buffers used by the mutating op in its `dep_closure` 2. Not creating `WeakDep`s for buffers in the `dep_closure` 3. Fixing up any bad fusions that might occur by an extra check in `can_fuse_vertical` So we are first over-agressive in removing `WeakDep`, then add an ad-hoc fixup. This PR instead emits all `WeakDep`s and adds a `fusable_weak_dep` check to `can_fuse_vertical` which selectively allows inplace operation to fuse. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128979 Approved by: https://github.com/lezcano ghstack dependencies: #129082, #129083	2024-06-22 12:38:22 +00:00
Peter Bell	88a35b5b64	BE: User future annotations in _inductor/comms.py (#129083 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129083 Approved by: https://github.com/lezcano ghstack dependencies: #129082	2024-06-22 12:38:22 +00:00
Peter Bell	73ba226d98	[inductor] Linear time dead node elimination (#129082 ) The nodes are already topologically sorted by this point, so DCEing a chain of nodes will take one full iteration per node. Simply reversing the iteration order means all users will be removed before checking a node. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129082 Approved by: https://github.com/lezcano	2024-06-22 12:38:17 +00:00
Jiong Gong	cb126711cd	[merge_rule] add more cpp inductor files (#129192 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129192 Approved by: https://github.com/leslie-fang-intel, https://github.com/atalman	2024-06-22 09:04:14 +00:00
PaliC	b57fa8d9c0	[BE] Remove JNI from libtorch builds (#124995 ) Removes jni files from the libtorch build as we do not plan to distribute them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124995 Approved by: https://github.com/malfet	2024-06-22 07:41:54 +00:00
Driss Guessous	9ffdbb5d12	Forward Fix PR for #128683 (#129037 ) Summary: This forward fixes this diff: D58699985 Since we have a few things in flight it would be much better to forward fix this test Test Plan: buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:test_inductor_cuda -- --exact 'caffe2/test/inductor:test_inductor_cuda - test_red_followed_by_transposed_pointwise (caffe2.test.inductor.test_torchinductor.TritonCodeGenTests)' Differential Revision: D58767577 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129037 Approved by: https://github.com/vkuzo	2024-06-22 05:50:21 +00:00
PaliC	64743de6d8	[Split Build][BE] consolidate pip install commands (#129253 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129253 Approved by: https://github.com/atalman ghstack dependencies: #129011	2024-06-22 05:49:14 +00:00
PaliC	7661d1220a	[Split Build] Fix typo in pull ci (#129270 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129270 Approved by: https://github.com/atalman	2024-06-22 05:48:01 +00:00
PaliC	b0044e2e18	[Split Build] Support nightly release (#129011 ) This PR adds the split build to our binaries workflow. Validation for the workflow is done using the PR above in conjunction with https://github.com/pytorch/builder/pull/1876. Test Workflow: Check CI in the workflow above Pull Request resolved: https://github.com/pytorch/pytorch/pull/129011 Approved by: https://github.com/atalman	2024-06-22 05:45:14 +00:00
Huy Do	b72ef9df0d	Update torchbench model expected accuracy values after pinning numpy (#129213 ) After pinning numpy on torchbench, we need to move torchbench inductor benchmark jobs out of unstable state asap, so that more failures don't sneak it. I'm updating the expected values here to make trunk green. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129213 Approved by: https://github.com/xuzhao9, https://github.com/malfet, https://github.com/desertfire	2024-06-22 04:59:50 +00:00
Aaron Enye Shi	f42d5b6dca	[Memory Snapshot] Make recordAnnotations callback initialize lazily (#129242 ) Summary: Make the recordAnnotations' Record function callback lazily initialize when record memory history starts. This will help reduce the impact on Time To First Batch metric. Test Plan: CI and ran locally. Differential Revision: D58875576 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/129242 Approved by: https://github.com/zdevito	2024-06-22 04:05:55 +00:00
chilli	858fb05dac	Modify ExternKernelAlloc with NoneLayout to not assign its result to anything (#129188 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129188 Approved by: https://github.com/yifuwang	2024-06-22 02:57:44 +00:00
Will Constable	2f8b301c32	Clean up distributed/CONTRIBUTING.md (#128450 ) Click [here](`cf6c88af48/torch/distributed/CONTRIBUTING.md`) to see the rendered version of the file in this PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/128450 Approved by: https://github.com/wanchaol	2024-06-22 02:41:22 +00:00
James Wu	5b14943213	Run TestAOTAutograd test suite with cache (#128222 ) This diff introduces AOTAutogradTestWithCache, which runs AOTAutogradTests with both dynamo and AOTAutogradCache. To do this, for any verify_aot_autograd() calls in the original tests, we run compiled_f an extra time. We also turn on a new strict mode that throws any time a cache is missed due to weird reasons, like BypassAOTAutogradCache or FxGraphCacheMiss. We use a mocked version of FXGraphCache to decrease the number of variables for these tests. The normal tests in test_aot_autograd_cache.py will still run with FXGraphCache. I might change my mind and unmock these in the future. In total, 87 of the tests pass naturally. None of the tests fail in non strict cache mode, so the cache never crashes, it just misses more often than we'd like. The remaining 27 tests fail due to relatively simple (though not necessarily easy to fix) reasons. I'll fix the remaining test failures in the next few PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128222 Approved by: https://github.com/bdhirsh	2024-06-22 02:13:28 +00:00
Animesh Jain	c5b9ee7408	[easy][dynamo] Remove try except from call_getattr (#129217 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129217 Approved by: https://github.com/lezcano ghstack dependencies: #129098, #129015	2024-06-21 23:56:00 +00:00
PyTorch MergeBot	1c75ddff35	Revert "[cuDNN] Graph-capturable cuDNN CTCLoss (#128271 )" This reverts commit 40e8675fcbb233c98ec532607d5cd421ec850253. Reverted https://github.com/pytorch/pytorch/pull/128271 on behalf of https://github.com/malfet due to This makes PyTorch buildable only with CuDNN v9 ([comment](https://github.com/pytorch/pytorch/pull/128271#issuecomment-2183576996))	2024-06-21 23:29:20 +00:00
mori360	ef55446538	[FSDP2] Add 'TORCH_LOGS=+fsdp' to log hooks(pre/post forward/backward) and FQN (_init_fqns) (#128663 ) Summary: Add '`TORCH_LOGS=+fsdp`' in the CLI to print fsdp logs Example: `TORCH_LOGS=+fsdp torchrun --standalone --nproc_per_node=2 run_fsdp.py` Description: Add logging to `FSDPParamGroup.pre_forward`, `FSDPParamGroup.post_forward`, `FSDPParamGroup.pre_backward`, and `FSDPParamGroup.post_backward`, `FSDPState._root_pre_forward` if is the root, and `FSDPState._root_post_backward_final_callback`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128663 Approved by: https://github.com/weifengpy, https://github.com/awgu	2024-06-21 23:25:58 +00:00
Menglu Yu	9d1b65b569	[PT2][Observability] Change the log logic (#129201 ) Summary: We only log the multiplier when users changes the default value. Test Plan: see signal Differential Revision: D58854330 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129201 Approved by: https://github.com/Skylion007, https://github.com/dshi7	2024-06-21 21:48:34 +00:00
Eddie Yan	40e8675fcb	[cuDNN] Graph-capturable cuDNN CTCLoss (#128271 ) cuDNN v8.x added a graph-capturable CTCLoss, which slots "neatly" into the `Tensor` variant ~~WIP as cuDNN has a restriction on the max target length (255), but this is not checkable in the graph-capture case, so the UX around warnings/error-messages here might need to be tuned...~~ Currently checks restriction on max target length during warmup run(s), and bails out during capture if this constraint was violated during warmup. CC @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/128271 Approved by: https://github.com/ezyang	2024-06-21 21:40:23 +00:00
Mashrur Morshed	9103b40a47	Fix small typo in docstring in ParameterList (#129193 ) In the docstring of `nn.ParameterList`, ParameterDict.append/extend was being used, which is most likely a typo. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129193 Approved by: https://github.com/mikaylagawarecki	2024-06-21 20:53:52 +00:00
Andrew M. James	92ca17d85d	Update triton pin (#126098 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126098 Approved by: https://github.com/bertmaher	2024-06-21 18:46:15 +00:00
Aaron Gokaslan	d52684e9a8	[BE]: Update CUDNN_frontend submodule to v1.5.1 (#128612 ) Updates submodule to cudnn_frontend v1.5.1 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128612 Approved by: https://github.com/eqy, https://github.com/ezyang	2024-06-21 18:17:35 +00:00
soulitzer	ebf25e128c	[autograd] Do not stash version counter for saved tensor (#128545 ) Fixes https://github.com/pytorch/pytorch/issues/128611 We detach using tensor_data, which already preserves the version counter, so there is no reason to save it prior to unpacking: ``` at::TensorBase VariableHooks::tensor_data(const at::TensorBase& self) const { TORCH_CHECK(self.defined(), "cannot call tensor_data() on undefined tensor"); auto self_impl_copy = self.unsafeGetTensorImpl()->shallow_copy_and_detach( /version_counter=/self.unsafeGetTensorImpl()->version_counter(), /allow_tensor_metadata_change=/ self.unsafeGetTensorImpl()->allow_tensor_metadata_change()); return at::Tensor(self_impl_copy); } ``` This changes the behavior when hooks are involved: - Previously, if you had a hook that replaced the saved tensor with an entirely new tensor, we would've smashed the saved version counter onto that during unpack, which is not quite correct because the tensor returned by user's pack hook is not necessarily aliased to the tensor originally being saved (unlikely), and even if it were, the version counter would already be shared, if the user did their operations not in inference mode (unlikely). - In this PR, we restore the version counter using the version counter from the unpack hook's output. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128545 Approved by: https://github.com/albanD ghstack dependencies: #125795	2024-06-21 18:03:06 +00:00
Zhuoran Zhao	58cefaf53b	Fix hipify regular expression for AOTI wrapper (#128912 ) Summary: We need to redefine RE_PYTORCH_PREPROCESSOR here since in hipify_torch, it will apply positive lookbehind (?<=\W) and lookahead (?=\W) to the pattern to avoid matching keyword at the beginning and end of code line. However, this can happen in codegen, which will cause the pattern to not match. Test Plan: ``` buck2 run //caffe2/test/inductor:test_cpp_wrapper_hipify ``` ``` File changed: fbcode//caffe2/test/inductor/test_cpp_wrapper_hipify.py Buck UI: https://www.internalfb.com/buck2/395155fa-b2dc-4892-8c71-74e52c65fa2f Note: Using experimental modern dice Network: Up: 0B Down: 0B (reSessionID-8fcfc520-755c-48f9-bacc-507c62f59231) Jobs completed: 10947. Time elapsed: 0.5s. Cache hits: 0%. Commands: 2 (cached: 0, remote: 0, local: 2) BUILD SUCCEEDED /data/users/zhuoran/fbsource/buck-out/v2/gen/fbcode/15b7034708b669be/caffe2/test/inductor/__test_cpp_wrapper_hipify__/test_cpp_wrapper_hipify#link-tree/torch/_utils_internal.py:282: NCCL_DEBUG env var is set to None /data/users/zhuoran/fbsource/buck-out/v2/gen/fbcode/15b7034708b669be/caffe2/test/inductor/__test_cpp_wrapper_hipify__/test_cpp_wrapper_hipify#link-tree/torch/_utils_internal.py:300: NCCL_DEBUG is forced to WARN from None test_hipify_aoti_driver_header (caffe2.test.inductor.test_cpp_wrapper_hipify.TestCppWrapperHipify) ... ok test_hipify_basic_declaration (caffe2.test.inductor.test_cpp_wrapper_hipify.TestCppWrapperHipify) ... ok test_hipify_cross_platform (caffe2.test.inductor.test_cpp_wrapper_hipify.TestCppWrapperHipify) ... ok ---------------------------------------------------------------------- Ran 3 tests in 0.262s OK ``` e2e test: ``` TORCH_LOGS="output_code,graph_code" buck2 run mode/{opt,amd-gpu,inplace} -c fbcode.triton_backend=amd -c fbcode.enable_gpu_sections=true //aiplatform/modelstore/model_generation/gpu_lowering_service:gpu_lowering_cli -- --model_input_path="ads_storage_fblearner/tree/user/facebook/fblearner/predictor/936383960/0/gpu_lowering/input.merge" --model_output_path="ads_storage_fblearner/tree/user/facebook/fblearner/predictor/936383960/0/gpu_lowering/mi300_inductor_output.merge" --lowering_backend AOT_INDUCTOR --is_ads_model False --aot_inductor_lowering_settings_json='{"use_scripting":true,"preset_lowerer":"standalone_hstu_cint;disable_new_lowering_weights;disable_dper_passes:passes=fuse_parallel_linear_no_weight_change","precision":4,"output_precision":4, "remove_unexpected_type_cast":false, "sample_input_tile_factor":32}' 2>&1 \| tee local_benchmark_log.txt ``` Differential Revision: D58705216 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128912 Approved by: https://github.com/desertfire	2024-06-21 18:00:40 +00:00
iibrahimli	2db33054b3	Disable fast path in `TransformerEncoderLayer` when there are forward (pre-)hooks attached to modules (#128415 ) Fixes #128413 Disable fast-path if there are forward hooks or pre-hooks. Example failure case given in the issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128415 Approved by: https://github.com/mikaylagawarecki	2024-06-21 17:38:08 +00:00
Bin Bao	8edd4c71c6	[AOTI][refactor] Remove GridExprCppPrinter (#129142 ) Summary: Previously we thought using CppPrinter is not ABI-compatibility safe, but c10/util/generic_math.h has been changed to header-only implementation, so we can remove GridExprCppPrinter now. Differential Revision: [D58854214](https://our.internmc.facebook.com/intern/diff/D58854214) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129142 Approved by: https://github.com/chenyang78	2024-06-21 17:18:37 +00:00
Jason Ansel	bdc39eef3b	[inductor] Add --inductor-config benchmark flag (#129034 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129034 Approved by: https://github.com/shunting314, https://github.com/eellison ghstack dependencies: #129024, #129033	2024-06-21 16:53:42 +00:00
Jason Ansel	bb4ab59651	[inductor] Run more test on correct device (#129033 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129033 Approved by: https://github.com/shunting314, https://github.com/eellison ghstack dependencies: #129024	2024-06-21 16:53:42 +00:00
Jason Ansel	feb3f3ad77	[inductor] Refactors for Halide backend (#129024 ) Pulling these inductor-related refactors out of the larger Halide backend PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129024 Approved by: https://github.com/shunting314, https://github.com/eellison	2024-06-21 16:53:35 +00:00
chilli	237c4e6163	Improved flexattention bwd perf + added configurations for benchmarks (#129013 ) Before: <img width="519" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/6f4a9b37-4aff-48d3-aaba-7e8e5a5bf0fb"> After: <img width="541" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/423f179e-76f5-457b-8064-ee8a70247534"> After fixing strides: ![image](https://github.com/pytorch/pytorch/assets/6355099/58471587-404b-4bfc-b9b2-7546bdf53f54) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129013 Approved by: https://github.com/drisspg, https://github.com/yanboliang ghstack dependencies: #128938	2024-06-21 15:58:53 +00:00
William Wen	bdd11483ea	[3.13] get C dynamo to compile with python callback and custom frame eval (#129171 ) Start enabling parts of C Dynamo for 3.13 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129171 Approved by: https://github.com/jansel, https://github.com/albanD	2024-06-21 15:58:02 +00:00
xinan.lin	b0ae0db815	[Inductor][Intel GPU] Support reduction split. (#129120 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129120 Approved by: https://github.com/EikanWang, https://github.com/jansel, https://github.com/desertfire ghstack dependencies: #129124	2024-06-21 15:11:59 +00:00
xinan.lin	fb0c51b61c	[Inductor UT] Fix UT failure 'test_polar_dynamic_shapes_xpu' introduced by #128722 (#129124 ) [Inductor UT] Fix UT failure 'test_polar_dynamic_shapes_xpu' introduced by #128722. Currently, XPU CI does not gate PR merge. So, we have to do some post-CI fixing as some PRs may break XPU CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129124 Approved by: https://github.com/EikanWang, https://github.com/desertfire	2024-06-21 15:08:17 +00:00
PyTorch MergeBot	715b09ae2d	Revert "Fix DEBUG=1 asserts with NJT ops (#129014 )" This reverts commit 2bb8ee602b264b652a9dbd6877da61018054d313. Reverted https://github.com/pytorch/pytorch/pull/129014 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/129014#issuecomment-2182922009))	2024-06-21 15:03:02 +00:00
cyy	479ce5e2f4	Remove outdated CUDA code from CMake (#128801 ) It's possible to simplify some CUDA handling logic in CMake. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128801 Approved by: https://github.com/r-barnes, https://github.com/malfet	2024-06-21 15:00:00 +00:00
cyy	2c7c286fa4	[1/N] Fix clang-tidy warnings in torch/csrc/jit/serialization (#129055 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/129055 Approved by: https://github.com/r-barnes	2024-06-21 14:56:31 +00:00
lezcano	53be7ff0e4	Make tl.atomic_add relaxed (#129133 ) We don't use any fancy synchronization within out atomic ops, we just want them to be atomic, so better to have them be relaxed than the default aquire/release. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129133 Approved by: https://github.com/peterbell10	2024-06-21 14:49:58 +00:00
Bin Bao	62e5d045c0	[AOTI] Auto-tune Triton kernels in a seperate block (#129057 ) Summary: Currently AOTI does a two-pass compilation for the CUDA backend. In the first pass AOTI generates Python code, runs the generated code once with real example inputs to trigger Triton kernel compilation and tuning, and then AOTI runs the second pass to generate cpp code and compiles that into a shared library. There are several problems with this approach when we want to enable the cpp wrapper mode for JIT Inductor: * Compilation time: JIT compilation is more sensitive to compilation time than AOT compilation. The two-pass approach does add extra overhead for compilation. * Peak memory size: when executing the first-pass generated code with real inputs, some inputs need to be cloned to avoid side effect coming from input mutation. This can raise the high-water mark for memory consumption. * Missing triton kernel autotuning: Because kernel autotune depends on the kernel being executed in the two-pass approach, some kernels will not be autotuned when a model contains control flow such as torch.if or torch.while. This PR is the first step towards solving these problems by moving Triton kernel autotuning to the compile time and use random inputs for tuning. The cpp wrapper codegen still has two passes, but in the first pass, Inductor will generate a separate code just for kernel autotuning, with https://gist.github.com/desertfire/606dc772b3e989b5e2edc66d76593070 as an example, and we no longer need to execute the model after the first-pass finishes. After that we rerun a second pass to generate cpp code. This reduces peak memory consumption and enables kernel autotuning when there is control flow. Truly making the codegen into one-pass will come later once this solution is proven stable and generates as performant kernels as before. Differential Revision: [D58782766](https://our.internmc.facebook.com/intern/diff/D58782766) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129057 Approved by: https://github.com/jansel, https://github.com/eellison	2024-06-21 14:34:13 +00:00
Sahdev Zala	9795dba1e0	Optim package docstring fix (#129086 ) Fix docstrings in various files in optim package. This is a last remaining fix for the issue #112593 The fix can be verified by running pydocstyle path-to-file --count Fixes #112593 Related #128248 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129086 Approved by: https://github.com/janeyx99	2024-06-21 14:30:53 +00:00
Xuehai Pan	b697808056	[BE][Easy] eliminate relative import in `torchgen` (#128872 ) Fix generated by: ```bash ruff check --config 'lint.flake8-tidy-imports.ban-relative-imports="all"' --fix --select=TID $(fd '.pyi?$' torchgen) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128872 Approved by: https://github.com/zou3519	2024-06-21 14:11:46 +00:00
Joel Schlosser	e1c1052829	Backward support for unbind() with NJT (#128032 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128032 Approved by: https://github.com/soulitzer	2024-06-21 14:05:23 +00:00
haozhe.zhu	27ae1f981d	[inductor] fix linear_add_bias for autocast case (#129138 ) Previously `linear_add_bias` only support the added tensor is `bfloat16`. ``` class M(torch.nn.Module): def __init__(self, dtype): super().__init__() self.linear1 = torch.nn.Linear(10, 64, bias=False) self.bias1 = torch.randn(64).bfloat16() # if the bias is not bf16, we will crash def forward(self, x): return self.linear1(x) + self.bias1 ``` For `Autocast(bf16)` cases, `self.bias1` will not be converted to bf16. And we also not checked the dtype for weight and bias in the pattern matcher, this will lead to error if weight is bfl6 while bias is fp32. We have 2 options to resolve this: - Check bias/weight dtype, only fold the bias when they are same dtype - We will fold them even they are not same dtype. By inserting to_dtypes for `bias node` to enforce it have same dtype with weight. This PR chose option1, since we can't implicitly cast bias to bf16 here which would lose precision. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129138 Approved by: https://github.com/jgong5	2024-06-21 14:04:30 +00:00
rzou	5d8e23b49c	[custom_op] Support string default values in schema (#129179 ) Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/129179 Approved by: https://github.com/albanD ghstack dependencies: #129177, #129178	2024-06-21 13:31:40 +00:00
rzou	08b616281f	[custom ops] Switch out references from old landing page to new landing page (#129178 ) Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/129178 Approved by: https://github.com/albanD ghstack dependencies: #129177	2024-06-21 13:31:40 +00:00
rzou	311fadb1fb	[docs] Redirect custom ops landing page to the correct place (#129177 ) I'm moving it to pytorch/tutorials Pull Request resolved: https://github.com/pytorch/pytorch/pull/129177 Approved by: https://github.com/albanD	2024-06-21 13:31:32 +00:00
Yifu Wang	217aac96d7	Introduce a prototype for SymmetricMemory (#128582 ) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): This PR introduces a prototype for `SymmetricMemory` (including a CUDA implementation) - a remote-memory access-based communication primitive. It allows for user-defined communication patterns/kernels and is designed to be torch.compile-friendly. It addresses the major limitations of `IntraNodeComm` and `ProcessGroupCudaP2p` and serves as a replacement for them. ### SymmetricMemory `SymmetricMemory` represents symmetric allocations across a group of devices. The allocations represented by a `SymmetricMemory` object are accessible by all devices in the group. The class can be used for op-level custom communication patterns (via the get_buffer APIs and the synchronization primitives), as well as custom communication kernels (via the buffer and signal_pad device pointers). ### Python API Example ```python from torch._C.distributed_c10d import _SymmetricMemory # Set a store for rendezvousing symmetric allocations on a group of devices # identified by group_name. The concept of groups is logical; users can # utilize predefined groups (e.g., a group of device identified by a # ProcessGroup) or create custom ones. Note that a SymmetricMemoryAllocator # backends might employ a more efficient communication channel for the actual # rendezvous process and only use the store for bootstrapping purposes. _SymmetricMemory.set_group_info(group_name, rank, world_size, store) # Identical to empty_strided, but allows symmetric memory access to be # established for the allocated tensor via _SymmetricMemory.rendezvous(). # This function itself is not a collective operation. t = _SymmetricMemory.empty_strided_p2p((64, 64), (64, 1), torch.float32, group_name) # Users can write Python custom ops that leverages the symmetric memory access. # Below are examples of things users can do (assuming the group's world_size is 2). # Establishes symmetric memory access on tensors allocated via # _SymmetricMemory.empty_strided_p2p(). rendezvous() is a one-time process, # and the mapping between a local memory region and the associated SymmetricMemory # object is unique. Subsequent calls to rendezvous() with the same tensor will receive # the cached SymmetricMemory object. # # The function has a collective semantic and must be invoked simultaneously # from all rendezvous participants. symm_mem = _SymmetricMemory.rendezvous(t) # This represents the allocation on rank 0 and is accessible from all devices. buf = symm_mem.get_buffer(0, (64, 64), torch.float32) if symm_mem.rank == 0: symm_mem.wait_signal(src_rank=1) assert buf.eq(42).all() else: # The remote buffer can be used as a regular tensor buf.fill_(42) symm_mem.put_signal(dst_rank=0) symm_mem.barrier() if symm_mem.rank == 0: symm_mem.barrier() assert buf.eq(43).all() else: new_val = torch.empty_like(buf) new_val.fill_(43) # Contiguous copies to/from a remote buffer utilize copy engines # which bypasses SMs (i.e. no need to load the data into registers) buf.copy_(new_val) symm_mem.barrier() ``` ### Custom CUDA Comm Kernels Given a tensor, users can access the associated `SymmetricMemory` which provides pointer to remote buffers/signal_pads needed for custom communication kernels. ```cpp TORCH_API c10::intrusive_ptr<SymmetricMemory> get_symmetric_memory( const at::Tensor& tensor); class TORCH_API SymmetricMemory : public c10::intrusive_ptr_target { public: ... virtual std::vector<void> get_buffer_ptrs() = 0; virtual std::vector<void> get_signal_pad_ptrs() = 0; virtual void get_buffer_ptrs_dev() = 0; virtual void get_signal_pad_ptrs_dev() = 0; virtual size_t get_buffer_size() = 0; virtual size_t get_signal_pad_size() = 0; virtual int get_rank() = 0; virtual int get_world_size() = 0; ... }; ``` ### Limitations of IntraNodeComm and ProcessGroupCudaP2p Both `IntraNodeComm` (used by `ProcessGroupCudaP2p`) manages a single fixed-size workspace. This approach: - Leads to awkward UX in which the required workspace needs to be specified upfront. - Can not avoid extra copies for some algorithms in eager mode (e.g., custom/multimem all-reduce, reduce-scatter, all-gather). - Prevents torch.compile from eliminating all copies. In addition, they only offer out-of-the-box communication kernels and don't expose required pointers for user-defined, custom CUDA comm kernels. * __->__ #128582 Differential Revision: [D58849033](https://our.internmc.facebook.com/intern/diff/D58849033) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128582 Approved by: https://github.com/wanchaol	2024-06-21 08:49:11 +00:00
Simon Fan	f0443ad174	[compiled autograd] flatten runtime inputs with fast path (#129116 ) covered by test_compiled_autograd.py and test_standalone_compile.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/129116 Approved by: https://github.com/jansel ghstack dependencies: #127960, #128905, #128982, #128987, #129181	2024-06-21 08:16:33 +00:00
Simon Fan	d97dfe9313	[compiled autograd] move inputs to cuda with non_blocking=True (#129181 ) non_blocking=True requires first pinning, which shouldn't be a problem given that they are cpu scalars Pull Request resolved: https://github.com/pytorch/pytorch/pull/129181 Approved by: https://github.com/eellison, https://github.com/jansel ghstack dependencies: #127960, #128905, #128982, #128987	2024-06-21 08:16:33 +00:00
Simon Fan	8f320fd6c6	[compiled autograd] treat input params as static (#128987 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128987 Approved by: https://github.com/eellison, https://github.com/BoyuanFeng ghstack dependencies: #127960, #128905, #128982	2024-06-21 08:16:33 +00:00
Simon Fan	fafa1867d1	[compiled autograd] use in_compiled_autograd_region instead of compiled_autograd_enabled_count (#128982 ) current implementation of compiled_autograd_enabled_count affects the entire region under the context manager. so if the context manager wraps torch.compile calls unrelated to the backward, they are affected too: - no lazy compile for compiled fw - no aot autograd cache for inference graphs we instead maintain a flag when we execute the compiled backward callable, to isolate the special handling to the compiled backward graph Pull Request resolved: https://github.com/pytorch/pytorch/pull/128982 Approved by: https://github.com/jansel ghstack dependencies: #127960, #128905	2024-06-21 08:16:33 +00:00
Simon Fan	68b33453f4	[aot autograd] collect static parameter metadata when graphs fallback to inference (#128905 ) https://github.com/pytorch/pytorch/pull/126820 but for graphs that have requires_grad inputs but no requires_grad outputs i.e. inference graph the implementation of inference graph fallback was throwing away the static parameter information during metadata recomputation also adding a cudagraphs counter to test this easier Pull Request resolved: https://github.com/pytorch/pytorch/pull/128905 Approved by: https://github.com/mlazos ghstack dependencies: #127960	2024-06-21 08:16:33 +00:00
Simon Fan	123812790b	[compiled autograd] update benchmarks to use cli flags for fullgraph/dynamic (#127960 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127960 Approved by: https://github.com/jansel	2024-06-21 08:16:33 +00:00
Anshul Sinha	aee512cc9d	[dtensor][op] Fixed stack op strategy (#129018 ) Summary The previous stack op strategy was causing the input to be resharded, resulting in list index out of range error. I delayed the resharding for after the input_specs were created so that the new dimension could be inserted, preventing the error above. I have also ran all the other test cases to ensure changes did not introduce any new bugs Test Plan pytest test/distributed/_tensor/test_tensor_ops.py -s -k test_stack Pull Request resolved: https://github.com/pytorch/pytorch/pull/129018 Approved by: https://github.com/XilunWu	2024-06-21 08:10:28 +00:00
Animesh Jain	6b5fbc544e	[dynamo] Use polyfill to trace through the attributes of torch.jit.* and lru_cache_wrapper (#128336 ) Earlier we were taking the vt for `obj` and then monkeypatching that `vt.source` to be `obj._torchdynamo_inline`. If one accesses `obj.attr_a`, this would cause problems because Dynamo would then search it in `obj._torchdynamo_inline.attr_a`. This PR makes it more functional, so that we have different vts for obj and `ob._torchdynamo_inline`. Fixes https://github.com/pytorch/pytorch/issues/93698 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128336 Approved by: https://github.com/jansel, https://github.com/yanboliang ghstack dependencies: #129117	2024-06-21 07:44:44 +00:00
Jiong Gong	914d3ca2ba	[inductor][cpp] BF16 AMX micro-gemm support (#127195 ) This PR adds the intrinsics based micro-gemm for BF16 using Advanced Matrix eXtension (AMX) instructions available in Intel 4th and 5th Xeon processors. A compilation check is added to `codecache.py` to check the validity of the compiler support. Also, since AMX requires an initialization in the Linux kernel to extra register states, an initialization function is added to do that and triggered via `codecache.py`. Performance speedups with >=10% on BF16 AMP, max_autotune vs. no autotune, measured on Intel(R) Xeon(R) Platinum 8488C: Static shapes Single-threaded \| Model Family \| Model Name \| Speedup \| \|--------------\|------------\|---------\| \| timm_models \| mixer_b16_224 \| 1.54 \| \| timm_models \| convit_base \| 1.53 \| \| huggingface \| MobileBertForQuestionAnswering \| 1.52 \| \| torchbench \| fastNLP_Bert \| 1.44 \| \| torchbench \| llama \| 1.33 \| \| timm_models \| swin_base_patch4_window7_224 \| 1.31 \| \| torchbench \| dlrm \| 1.28 \| \| torchbench \| timm_vision_transformer_large \| 1.28 \| \| huggingface \| MobileBertForMaskedLM \| 1.27 \| \| timm_models \| vit_base_patch16_224 \| 1.26 \| \| timm_models \| beit_base_patch16_224 \| 1.23 \| \| timm_models \| jx_nest_base \| 1.21 \| \| torchbench \| pyhpc_equation_of_state \| 1.18 \| \| huggingface \| Speech2Text2ForCausalLM \| 1.15 \| \| timm_models \| pit_b_224 \| 1.14 \| \| timm_models \| twins_pcpvt_base \| 1.14 \| \| torchbench \| maml_omniglot \| 1.1 \| \| timm_models \| botnet26t_256 \| 1.1 \| Multi-threaded \| Model Family \| Model Name \| Speedup \| \|--------------\|------------\|---------\| \| torchbench \| BERT_pytorch \| 1.35 \| \| torchbench \| lennard_jones \| 2.43 \| \| torchbench \| hf_Albert \| 1.35 \| \| torchbench \| hf_T5 \| 1.34 \| \| torchbench \| soft_actor_critic \| 1.34 \| \| torchbench \| fastNLP_Bert \| 1.28 \| \| huggingface \| LayoutLMForSequenceClassification \| 1.26 \| \| torchbench \| llama \| 1.24 \| \| huggingface \| GPT2ForSequenceClassification \| 1.19 \| \| torchbench \| hf_Bart \| 1.17 \| \| torchbench \| hf_Bert_large \| 1.16 \| \| torchbench \| hf_GPT2 \| 1.16 \| \| timm_models \| gmixer_24_224 \| 1.16 \| \| torchbench \| hf_GPT2_large \| 1.15 \| \| torchbench \| maml_omniglot \| 1.14 \| \| torchbench \| hf_Bert \| 1.13 \| \| torchbench \| hf_DistilBert \| 1.13 \| \| torchbench \| hf_T5_large \| 1.12 \| \| huggingface \| MT5ForConditionalGeneration \| 1.11 \| Dynamic shapes Single-threaded \| Model Family \| Model Name \| Speedup \| \|--------------\|------------\|-------\| \| timm_models \| mixer_b16_224 \| 1.52 \| \| timm_models \| convit_base \| 1.5 \| \| huggingface \| MobileBertForQuestionAnswering \| 1.49 \| \| torchbench \| fastNLP_Bert \| 1.42 \| \| torchbench \| timm_vision_transformer_large \| 1.28 \| \| timm_models \| swin_base_patch4_window7_224 \| 1.27 \| \| torchbench \| llama \| 1.26 \| \| huggingface \| MobileBertForMaskedLM \| 1.25 \| \| timm_models \| vit_base_patch16_224 \| 1.25 \| \| timm_models \| beit_base_patch16_224 \| 1.24 \| \| timm_models \| jx_nest_base \| 1.2 \| \| torchbench \| dlrm \| 1.19 \| \| timm_models \| pit_b_224 \| 1.13 \| \| timm_models \| twins_pcpvt_base \| 1.13 \| \| torchbench \| hf_Bert_large \| 1.12 \| \| torchbench \| hf_BigBird \| 1.11 \| \| huggingface \| Speech2Text2ForCausalLM \| 1.11 \| \| timm_models \| eca_botnext26ts_256 \| 1.11 \| \| timm_models \| botnet26t_256 \| 1.1 \| Multi-threaded \| Model Family \| Model Name \| Speedup \| \|--------------\|------------\|-------\| \| torchbench \| BERT_pytorch \| 1.18 \| \| torchbench \| lennard_jones \| 2.18 \| \| torchbench \| hf_Albert \| 1.37 \| \| torchbench \| soft_actor_critic \| 1.31 \| \| huggingface \| GPT2ForSequenceClassification \| 1.29 \| \| torchbench \| hf_T5 \| 1.28 \| \| torchbench \| fastNLP_Bert \| 1.27 \| \| torchbench \| hf_Bart \| 1.21 \| \| torchbench \| hf_Bert_large \| 1.19 \| \| torchbench \| hf_T5_large \| 1.19 \| \| torchbench \| hf_Bert \| 1.16 \| \| torchbench \| hf_GPT2 \| 1.16 \| \| huggingface \| CamemBert \| 1.16 \| \| torchbench \| hf_GPT2_large \| 1.13 \| \| torchbench \| functorch_maml_omniglot \| 1.12 \| \| huggingface \| BertForMaskedLM \| 1.12 \| \| huggingface \| MT5ForConditionalGeneration \| 1.12 \| \| torchbench \| hf_DistilBert \| 1.11 \| \| timm_models \| mixnet_l \| 1.11 \| \| timm_models \| tf_mixnet_l \| 1.11 \| No perf regressions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127195 Approved by: https://github.com/jansel	2024-06-21 07:21:47 +00:00
Wu, Chunyuan	632910e2a8	Add test to xfail_list only for abi_compatible (#128506 ) https://github.com/pytorch/pytorch/pull/126717 will skip the tests in both ABI compatible and non-ABI compatible mode. It's not expected to skip them in non-ABI compatible mode since they can actually run successfully in such mode but only have issues in ABI compatible mode. We leverage the existing `xfail_list` for those that will only fail in ABI compatible mode. - `test_qlinear_add` is already in the `xfail_list`. - `test_linear_packed` doesn't fail either in my local run (running with `TORCHINDUCTOR_ABI_COMPATIBLE=1`) or in the CI of this PR so I didn't add it into `xfail_list`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128506 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-06-21 07:19:28 +00:00
Sanket Jayant Purandare	62e425ab03	Memory Tracker for tracking Module wise memory (#124688 ) We present a utility MemTracker, that tracks the module-wise memory for the code executed under its context. The core features that this tool aims to provide are: 1. Capturing 'snapshots' of memory for each module during its execution. Specifically, at 8 points, during pre-forward, post-forward, pre-backward, 2nd pre-forward (if AC is applied), 2nd post-forward (if AC is applied), post-backward. Also capturing peak memory snapshot during forward and backward. 2. Each such snapshot provides the per device (cpu, cuda etc) memory breakdown in terms of the global parameters, gradients, activations, optimizer states and temporary memory. 3. A summary for each module (that can be analyzed or processed later), in terms of the memory occupied by its own parameters, buffers, inputs and outputs. The remaining components can be derived from these per module attributes and its corresponding captured snapshots. 4. Record the global peak memory consumption per device and their respective breakdowns. 5. Ability to do all of this under the FakeTensorMode so that all these statistics can be obtained without executing code on real data. 6. Ability to register and track modules, optimizers and any other tensors that are created outside the context of MemTracker. 7. Ability to capture a custom memory snapshot at any point during program execution execution. 8. Utility functions to display all of these statistics in user-friendly and human readable manner. These features will enable users to anticipate OOMs, debug and pinpoint where majority of memory comes from, experiment with different activation checkpointing policies, batch sizes, mixed precision, model architecture features (ex. number of layers, hidden dimensions, number of attention heads etc.) and inter-device memory movement (ex. CPU off-loading) among others. Basically anything and everything related to device memory. * __->__ #128508 Example: > import torch > import torchvision.models as models > from torch.distributed._tools.mem_tracker import MemTracker > device, dtype = "cuda", torch.float32 > with torch.device(device): > model = models.resnet18().to(dtype=dtype) > optim = torch.optim.Adam(model.parameters(), foreach=True) > mem_tracker = MemTracker() > mem_tracker.track_external(model, optim) > with mem_tracker as mt: > for i in range(2): > input_batch = torch.randn(256, 3, 224, 224, device=device, dtype=dtype) > model(input_batch).sum().backward() > optim.step() > optim.zero_grad() > if i == 0: > # to account for lazy init of optimizer state > mt.reset_mod_stats() > mt.display_snapshot("peak", units="MiB", tabulate=True) > mt.display_modulewise_snapshots(depth=2, units="MiB", tabulate=True) > # Check for accuracy of peak memory > tracker_max = mt.get_tracker_snapshot('peak')[device]['Total'] > cuda_max = torch.cuda.max_memory_allocated() > accuracy = tracker_max / cuda_max > print(f"Tracker Max: {tracker_max}, CUDA Max: {cuda_max}, Accuracy: {accuracy}") Output <img width="1197" alt="Screenshot 2024-06-15 at 12 10 12 AM" src="https://github.com/pytorch/pytorch/assets/12934972/83e953db-43dc-4094-90eb-9f1d2ca8e758"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124688 Approved by: https://github.com/awgu	2024-06-21 07:15:32 +00:00
PaliC	2b1b055a96	[Split Build] Fix libtorch_python RPATH (#129088 ) In the split build we end up with an incorrect RPATH for `libtorch_python.so`. This PR fixes said RPATH. What the rpath should look like: ``` sahanp@devgpu086 ~/pytorch ((636de71c…))> objdump -p ~/main_so_files/libtorch_python.so \| grep "RPATH" (pytorch-3.10) RPATH /lib/intel64:/lib/intel64_win:/lib/win-x64:/home/sahanp/pytorch/build/lib:/home/sahanp/.conda/envs/pytorch-3.10/lib: ``` Before ``` sahanp@devgpu086 ~/pytorch ((636de71c…))> objdump -p ~/split_so_files/libtorch_python.so \| grep "RPATH" (pytorch-3.10) RPATH /home/sahanp/pytorch/torch/lib:/home/sahanp/pytorch/build/lib: ``` After ``` sahanp@devgpu086 ~/pytorch ((636de71c…))> objdump -p build/lib/libtorch_python.so \| grep "RPATH" (pytorch-3.10) RPATH /lib/intel64:/lib/intel64_win:/lib/win-x64:/home/sahanp/pytorch/build/lib:/home/sahanp/pytorch/torch/lib:/home/sahanp/.conda/envs/pytorch-3.10/lib: ``` Testing that this works is in the above PR. Similarly, after running ciflow/binaries the output of objdump -p should not change https://www.diffchecker.com/14PRmCNz/ (checked manywheel py 3.10 cuda 12.1) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129088 Approved by: https://github.com/malfet	2024-06-21 06:49:19 +00:00
Animesh Jain	c008488b9c	[dynamo][guards] Dont run TYPE_MATCH for DICT_LENGTH C++ guard (#129163 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129163 Approved by: https://github.com/williamwen42, https://github.com/jansel	2024-06-21 06:27:19 +00:00
cyy	5c676bb8b3	Remove Caffe2 handling from onnx_unpack_quantized_weights (#129021 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/129021 Approved by: https://github.com/justinchuby, https://github.com/albanD	2024-06-21 06:16:44 +00:00
Colin L. Rice	3a2fdbb142	[dynamo] - Add JK killswitch for dynamo compilation. (#128538 ) This allows easy disablement of dynamo in emergency situations where env variables are hard to set. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128538 Approved by: https://github.com/jansel	2024-06-21 06:14:06 +00:00
PyTorch MergeBot	f73b451e78	Revert "Improved flexattention bwd perf + added configurations for benchmarks (#129013 )" This reverts commit ff89ebc50a738c734496393dc25313cf197fd0b4. Reverted https://github.com/pytorch/pytorch/pull/129013 on behalf of https://github.com/huydhn due to Sorry for reverting your change but one of the test_torchinductor_opinfo test starts to fail after this commit `ff89ebc50a`, I am reverting to see if it helps trunk recovers ([comment](https://github.com/pytorch/pytorch/pull/129013#issuecomment-2182042422))	2024-06-21 05:46:46 +00:00
Deng Weishi	b542825066	Enable deterministic support for oneDNN (#127277 ) This PR is a part of RFC https://github.com/pytorch/pytorch/issues/114848. For the request for Torchbenchmark models, this PR enables the deterministic attribute for the oneDNN operators for XPU backends, like convolution, deconvolution and matmult. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127277 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/desertfire, https://github.com/gujinghui	2024-06-21 05:21:24 +00:00
Animesh Jain	e8dbb45e98	[dynamo][user-defined-object] Check that object is valid (#129117 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129117 Approved by: https://github.com/yf225	2024-06-21 04:18:54 +00:00
cyy	e99a24ce7c	Remove TensorImpl_test.cpp (#129054 ) It's not used because of removal of Caffe2. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129054 Approved by: https://github.com/albanD, https://github.com/malfet	2024-06-21 04:17:36 +00:00
Brian Hirsh	880e894c39	[Brian's PR #128981 ] fix dynamo isinstance inlining for nn.Parameter + subclasses (#129162 ) This is a copy of Brian's PR https://github.com/pytorch/pytorch/pull/128981, with very small changes to work around numpy related errors. For discussions, please see Brian's original PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129162 Approved by: https://github.com/bdhirsh	2024-06-21 03:48:10 +00:00
eellison	8cd9b10456	Fix exp decomp numerics (#129154 ) Our previous implementation would sometimes generate `inf` because we did not do the same numerics tricks as in eager: See comment / [link](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/core/TransformationHelper.h#L123-L144) : ``` # curand_uniform has (0,1] bounds. log(1) is 0 and exponential excludes 0. # we need log to be not 0, and not underflow when converted to half # fast __logf approximation can underflow, so set log to -epsilon/2 for 1 or close to 1 args ``` Fix for https://github.com/pytorch/pytorch/issues/127749. Added a test for non-inf, but it would be great to have more robust decomp distribution tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129154 Approved by: https://github.com/bdhirsh, https://github.com/zou3519	2024-06-21 03:21:30 +00:00
chilli	ff89ebc50a	Improved flexattention bwd perf + added configurations for benchmarks (#129013 ) Before: <img width="519" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/6f4a9b37-4aff-48d3-aaba-7e8e5a5bf0fb"> After: <img width="541" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/423f179e-76f5-457b-8064-ee8a70247534"> After fixing strides: ![image](https://github.com/pytorch/pytorch/assets/6355099/58471587-404b-4bfc-b9b2-7546bdf53f54) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129013 Approved by: https://github.com/drisspg, https://github.com/yanboliang ghstack dependencies: #128938	2024-06-21 03:01:16 +00:00
Zain Huda	0acd09aecd	[torchrec][pt-d][model store] introduce LocalShardsWrapper for DTensor (#129150 ) Summary: Same as D57688538, recreated because of GH issues This diff introduces LocalShardsWrapper which is crucial to migrating from using ShardedTensor to DTensor in TRec state dict representation. As well as any changes needed in PT-D and ModelStore to support this. It allows us to extend DTensor to support multiple shards on a rank as well as empty shards on a rank as needed by TRec sharding logic. This diff also extends the support for LocalShardsWrapper to be used in conjunction with DTensor in checkpointing cases (ModelStore and DCP) See D54375878 for how it is used. LocalShardsWrapper supports the following torch ops: + torch.ops._c10d_functional.all_gather_into_tensor.default + aten._to_copy.default + aten.view.default + aten.equal.default + aten.detach.default With extensibility to add more as required by use cases. See https://docs.google.com/document/d/16Ptl50mGFJW2cljdF2HQ6FwsiA0scwbAbjx_4dhabJw/edit?usp=drivesdk for more info regarding design and approach. NOTE: This version of LocalShardsWrapper does not support empty shards, that is added in the next diff enabling CW. D57063512 Test Plan: ` buck test mode/opt -c python.package_style=inplace aiplatform/modelstore/client/tests_gpu:dist_checkpoint_save_load_with_stateful_tests -- --print-passing-details` `buck2 test 'fbcode//mode/dev-nosan' fbcode//torchrec/distributed/tests:test_tensor_configs -- --print-passing-details` Sandcastle Reviewed By: XilunWu, wanchaol Differential Revision: D58570479 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129150 Approved by: https://github.com/XilunWu	2024-06-21 01:58:51 +00:00
wz337	31c9e3d2f4	[FSDP][Test] Test save model save with FSDP1 and load into FSDP2 applied model (#129028 ) A lot of models have already been saving the model state in FULL_STATE_DICT mode with FSDP1 in APF. This unit test is just to demonstrate FSDP1 -> FSDP2 transition. The use of deprecating APIs in this test is intentional. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129028 Approved by: https://github.com/awgu, https://github.com/fegin	2024-06-21 01:40:58 +00:00
Pian Pawakapan	8758fedbfc	[export] copy sym ops when respecting call module signature (#129153 ) Summary: Export, through AOTAutograd, [deduplicates](`11ff5345d2/torch/fx/experimental/proxy_tensor.py (L198)`) sym_size calls, which can cause issues during unflattening when the sym_size node is used in multiple submodules. If preserve_call_module_signature is set, these nodes can't be passed between submodules as placeholders, so the calls (and any downstream un-duplicated nodes) must be copied. Adding this to unflattener Test Plan: export unflatten test case Reviewed By: TroyGarden, angelayi Differential Revision: D58697231 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129153 Approved by: https://github.com/angelayi	2024-06-21 01:40:22 +00:00
Valentine233	5da428d9eb	[cpu][flash attention] fix attention mask issue (#128816 ) For attention mask in flash attention: - Fix the issue of accessing illegal memory when the last size of mask is 1. - Add UT of attention mask for various shapes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128816 Approved by: https://github.com/jgong5, https://github.com/drisspg	2024-06-21 01:12:48 +00:00
PyTorch MergeBot	d4022b4658	Revert "[BE] enable UFMT for `torch/nn/modules` (#128594 )" This reverts commit 95ac2d648279ebc73feccf6d8eccafa4b2759de8. Reverted https://github.com/pytorch/pytorch/pull/128594 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/128594#issuecomment-2181788935))	2024-06-21 00:50:08 +00:00
PyTorch MergeBot	cc8193c707	Revert "[BE] enable UFMT for `torch/nn/functional.py` (#128592 )" This reverts commit f6e6e55fa7d883a89ba99584f8632c260519ba73. Reverted https://github.com/pytorch/pytorch/pull/128592 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/128592#issuecomment-2181783936))	2024-06-21 00:44:16 +00:00
PyTorch MergeBot	9c929f6ce9	Revert "[BE][Easy] enable UFMT for `torch/distributed/` (#128870 )" This reverts commit a0e1e20c4157bb3e537fc784a51d7aef1e754157. Reverted https://github.com/pytorch/pytorch/pull/128870 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/128870#issuecomment-2181780356))	2024-06-21 00:38:28 +00:00
Jiong Gong	9dd8f8cf8b	[cpuinfo][submodule] bump cpuinfo to the latest to support amx isa check (#127505 ) Fix https://github.com/pytorch/pytorch/issues/127368 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127505 Approved by: https://github.com/ezyang	2024-06-21 00:17:44 +00:00
Myungjin Lee	c027c8935b	[distributed] NCCL result code update (#128777 ) The nccl result codes are outdated. This PR fixes #128756. Fixes #128756 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128777 Approved by: https://github.com/Skylion007	2024-06-20 23:51:39 +00:00
Huy Do	43060a1dbc	Add shard support to test_inductor (#129160 ) I added one more shard for inductor tests earlier in https://github.com/pytorch/pytorch/pull/129108, but didn't realize that the second shard didn't do any inductor tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/129160 Approved by: https://github.com/seemethere, https://github.com/malfet	2024-06-20 23:41:00 +00:00
Joel Schlosser	31d5753247	Short-term fix to preserve NJT metadata cache in torch.compile (#122836 ) Idea: close over min / max sequence length in the main NJT view func (`_nested_view_from_jagged`) so that view replay during fake-ification propagates these correctly in torch.compile. For dynamic shapes support for min / max sequence length, this PR uses a hack that stores the values in `(val, 0)` shaped tensors. NB: This PR changes SDPA to operate on real views instead of using `buffer_from_jagged()` / `ViewNestedFromBuffer`, which may impact the internal FIRST model. That is, it undoes the partial revert from #123215 alongside a fix to the problem that required the partial revert. We need to verify that there are no regressions there before landing. Differential Revision: [D55448636](https://our.internmc.facebook.com/intern/diff/D55448636) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122836 Approved by: https://github.com/soulitzer	2024-06-20 23:15:53 +00:00
PyTorch MergeBot	63a724d8e1	Revert "Introduce a prototype for SymmetricMemory (#128582 )" This reverts commit 8771e3429c3d7327f08c48d547ad73546d5603b3. Reverted https://github.com/pytorch/pytorch/pull/128582 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/128582#issuecomment-2181656181))	2024-06-20 22:31:29 +00:00
Jing Xu	5fba5d83f0	add xpu for amp (#127276 ) As support for Intel GPU has been upstreamed, this PR is to add the XPU-related contents to AMP doc. Co-authored-by: Yu, Guangye <guangye.yu@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127276 Approved by: https://github.com/dvrogozh, https://github.com/albanD, https://github.com/malfet	2024-06-20 21:49:35 +00:00
Jane Xu	adc14adb88	Fix flakiness with test_binary_op_list_error_cases (#129003 ) So how come this PR fixes any flakiness? Well, following my investigation (read pt 1 in the linked ghstack PR below), I had realized that this test only consistently errors after another test was found flaky. Why? Because TORCH_SHOW_CPP_STACKTRACES=1 gets turned on for _every_ test after _any_ test reruns, following this PR https://github.com/pytorch/pytorch/pull/119408. And yea, this test checked for exact error message matching, which no longer would match since the stacktrace for a foreach function is obviously going to be different from a nonforeach. So we improve the test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129003 Approved by: https://github.com/soulitzer	2024-06-20 21:48:22 +00:00
Thanh Ha	61fa3de4cb	ci: Hardcode runner-determinator (#128985 ) Hardcode the runner-determinator script for testing ALI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128985 Approved by: https://github.com/ZainRizvi	2024-06-20 21:14:23 +00:00
PyTorch MergeBot	aace8ffc00	Revert "[BE] enable UFMT for `torch/nn/*.py` (#128593 )" This reverts commit a87d82abd746240e7b46b992fa9df7ae6d3e6d4a. Reverted https://github.com/pytorch/pytorch/pull/128593 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/128593#issuecomment-2181562604))	2024-06-20 21:09:44 +00:00
Animesh Jain	f2f4dde2d3	[dynamo] Remove ID_MATCH for FSDPModuleVariable (#129015 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129015 Approved by: https://github.com/yf225 ghstack dependencies: #129098	2024-06-20 19:23:32 +00:00
PyTorch MergeBot	e84cf805d2	Revert "Modularize aten parameter parser and checker (#125308 )" This reverts commit 60bbdc0b40656cf70b2b098c7d715e19f031fb0d. Reverted https://github.com/pytorch/pytorch/pull/125308 on behalf of https://github.com/fbgheith due to test failures when run by meta ([comment](https://github.com/pytorch/pytorch/pull/125308#issuecomment-2181327211))	2024-06-20 18:52:05 +00:00
PyTorch MergeBot	254487f288	Revert "Separate AOTI Eager utils as a single file (#125819 )" This reverts commit 18634048a1f939a961b7c96b0acfe78b474c821e. Reverted https://github.com/pytorch/pytorch/pull/125819 on behalf of https://github.com/fbgheith due to test failures when run by meta ([comment](https://github.com/pytorch/pytorch/pull/125819#issuecomment-2181317332))	2024-06-20 18:49:08 +00:00
PyTorch MergeBot	73340f0909	Revert "[3/N] Non-Tensor: Support string parameter for aten operations (#125831 )" This reverts commit a52c8ace98afe76dc9e2c330b415972fd1529077. Reverted https://github.com/pytorch/pytorch/pull/125831 on behalf of https://github.com/fbgheith due to test failures when run by meta ([comment](https://github.com/pytorch/pytorch/pull/125831#issuecomment-2181313892))	2024-06-20 18:45:41 +00:00
Brian Hirsh	8c2542623b	[Traceable FSDP2] [Dynamo] Add tracing support for out-variant custom ops that return None (#129078 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129078 Approved by: https://github.com/yanboliang	2024-06-20 17:46:13 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	734891ac22	Fix export log script (#128967 ) Summary: Title Test Plan: CI Differential Revision: D58699557 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128967 Approved by: https://github.com/jiashenC	2024-06-20 17:01:00 +00:00
Tijmen Blankevoort	ddb95dbb0d	Fixing equalize with three things and improving functionality (#124632 ) Summary: (1) Make code work when a first layer does not have a bias. (2) Make it possible to provide both modules and module names as input (3) Allow sequences of contiguous layers as input, that then get split into pairs (4) fix documentation to be more clear on inputs to be provided Test Plan: Run this new version of the algorithm on a network and see if it throws errors. There's also this notebook to run and test N5199827 It you tell me where I can find the tests for this code, I can add some simple unit tests as well. Differential Revision: D55895862 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124632 Approved by: https://github.com/jerryzh168	2024-06-20 16:55:56 +00:00
PyTorch MergeBot	832fc35211	Revert "Improved flexattention bwd perf + added configurations for benchmarks (#129013 )" This reverts commit 6d2b3c90f144d7b77d51da27e6696192b2b97ebd. Reverted https://github.com/pytorch/pytorch/pull/129013 on behalf of https://github.com/ZainRizvi due to Sorry but this is causing a flexattention test to fail on ROCm. Can you please fix that test before remerging this in? See `6d2b3c90f1` for details ([comment](https://github.com/pytorch/pytorch/pull/129013#issuecomment-2181133070))	2024-06-20 16:51:41 +00:00
Zhengxu Chen	65286883d4	[export] reland "experimental joint graph API." (#129081 ) Summary: previous diff got reverted despite CI was green. Test Plan: CI Differential Revision: D58790048 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129081 Approved by: https://github.com/tugsbayasgalan	2024-06-20 16:50:53 +00:00
PaliC	fc5b0ff2d7	[BE][Hackaday] deprecate legacy cuda docker image (#128859 ) Fixes https://github.com/pytorch/builder/issues/1795 from the pytorch side specifically for the cuda image Pull Request resolved: https://github.com/pytorch/pytorch/pull/128859 Approved by: https://github.com/atalman	2024-06-20 16:30:49 +00:00
Nikita Shulga	b2a9b8d485	[CpuInductor] Enable NEON ISA detection on Linux ARM (#129075 ) Also, cleanup code a bit to use `x in [y, z]` instead of `x == y or x == z` And do not redefine `at_align`, but instead use `alignas(64)` as was suggested in https://github.com/pytorch/pytorch/pull/128686/files#r1639365978 Test plan: `python3 -c "import torch._inductor.codecache as cc; isa = cc.valid_vec_isa_list()[0];print(str(isa), bool(isa))"` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129075 Approved by: https://github.com/jansel	2024-06-20 16:22:57 +00:00
Huy Do	e0aa992d73	Fix inductor and deploy jobs timing out (#129108 ) Some trunk and periodic jobs are timing out at the moment, including: * `deploy`. This is because https://github.com/pytorch/pytorch/pull/127952 has removed `deploy` config, but there is one left over in periodic. * [periodic / linux-focal-cuda12.4-py3.10-gcc9 / test (deploy, 1, 1, linux.4xlarge.nvidia.gpu](https://github.com/pytorch/pytorch/actions/runs/9525590191/job/26260620457). * `inductor`, including `py3.10`, `py3.12`, and `cuda12.1`, `cuda12.4`. The increase comes from this change https://github.com/pytorch/pytorch/pull/128343, so I add another GPU shard. * [inductor / cuda12.1-py3.12-gcc9-sm86 / test (inductor, 1, 1, linux.g5.4xlarge.nvidia.gpu)](https://github.com/pytorch/pytorch/actions/runs/9522817887/job/26255069269) * [inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor, 1, 1, linux.g5.4xlarge.nvidia.gpu)](https://github.com/pytorch/pytorch/actions/runs/9524651902/job/26260009757) * [inductor-cu124 / cuda12.4-py3.10-gcc9-sm86 / test (inductor, 1, 1, linux.g5.4xlarge.nvidia.gpu)](https://github.com/pytorch/pytorch/actions/runs/9587982228/job/26440205869) * [inductor-cu124 / cuda12.4-py3.12-gcc9-sm86 / test (inductor, 1, 1, linux.g5.4xlarge.nvidia.gpu)](https://github.com/pytorch/pytorch/actions/runs/9587982228/job/26440634200) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129108 Approved by: https://github.com/malfet	2024-06-20 16:03:11 +00:00
Joel Schlosser	2bb8ee602b	Fix DEBUG=1 asserts with NJT ops (#129014 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129014 Approved by: https://github.com/YuqingJ, https://github.com/soulitzer	2024-06-20 15:15:28 +00:00
rzou	7178b4e987	[Dynamo x torch_function] fix incorrect source (#128980 ) Fixes https://github.com/pytorch/pytorch/issues/128964 The problem was that we were installing the source for a type incorrectly. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/128980 Approved by: https://github.com/mlazos	2024-06-20 14:54:00 +00:00
Animesh Jain	ea47d542ca	[dynamo][guards] Remove BOOL_FALSE - not needed after C++ guards (#129098 ) PyDict_Size is very fast ... earlier with Python guards, Cpython will go through layers of fluff to finally call the PyDict_Size. With C++ guards, its not needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129098 Approved by: https://github.com/jansel	2024-06-20 14:40:27 +00:00
Oguz Ulgen	54b0006cb2	Evaluate symexprs on load path of cache not write (#128997 ) When caching is enabled, an internal model fails with ``` assert_size_stride(bmm_9, (17, s0, 512), (54784, 512, 1)) AssertionError: expected size 17==17, stride 57344==54784 at dim=0 ``` looking at this model, the exact problem is when the cache is hit on the forward graph, the generated code for backward fails since the strides of the outputs of forward, passed to backward as inputs, are not what we expected. This PR changes the evaluation logic so that we defer evaluation of output stride exprs to load path as opposed to eagerly doing it on save path. I have not been able to come up with a unit test repro for this problem. Differential Revision: [D58796503](https://our.internmc.facebook.com/intern/diff/D58796503) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128997 Approved by: https://github.com/ezyang	2024-06-20 08:55:12 +00:00
Li-Huai (Allan) Lin	799acd31b4	[MPS] Add lu_factor (#99269 ) <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at d75cde1</samp> Added MPS support and autograd formulas for LU factorization of tensors. Implemented the `linalg_lu_factor` and `linalg_lu_factor.out` functions for the MPS backend in `LinearAlgebra.mm` and added tests in `test_mps.py`. Added the corresponding dispatch entries in `native_functions.yaml` and the backward and forward formulas in `derivatives.yaml`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99269 Approved by: https://github.com/kulinseth, https://github.com/lezcano	2024-06-20 07:35:29 +00:00
Nikita Shulga	0d25f096c1	[CppInductor] Fix erfinv codegen when non-vectorized isa (#129090 ) Fix erfinv codegen when ISA could not be detected Manual test plan (on MacOS): - Modify `valid_vec_isa_list` to return empty list - Run `python3 inductor/test_torchinductor_opinfo.py -v -k test_comprehensive_erfinv_cpu_bool` Before this change, abovementioned test will fail with ``` Output: /var/folders/rk/fxg20zvx6vvb5bk7cplq4xrc0000gn/T/tmpgic60b6c/ns/cnsp7snp7fyclkm5lsfiyiv3m6c3svevkbhcb3v7pijdfjwlyaij.cpp:11:25: error: use of undeclared identifier 'calc_erfinv' auto tmp2 = calc_erfinv(tmp1); ^ 1 error generated. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129090 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-06-20 06:09:48 +00:00
chilli	6d2b3c90f1	Improved flexattention bwd perf + added configurations for benchmarks (#129013 ) Before: <img width="519" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/6f4a9b37-4aff-48d3-aaba-7e8e5a5bf0fb"> After: <img width="541" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/423f179e-76f5-457b-8064-ee8a70247534"> After fixing strides: ![image](https://github.com/pytorch/pytorch/assets/6355099/58471587-404b-4bfc-b9b2-7546bdf53f54) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129013 Approved by: https://github.com/drisspg, https://github.com/yanboliang ghstack dependencies: #128938	2024-06-20 05:15:48 +00:00
Will Feng	ad2593cb86	[Animesh's PR #125340 ] [dynamo][fsdp] Track FSDPNNModuleVariable for mutations (#129045 ) This is a copy of Animesh's work in https://github.com/pytorch/pytorch/pull/125340, with very small changes to the unit test. It's needed sooner for the Traceable FSDP2 work, so I copy it here and will work through landing it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129045 Approved by: https://github.com/anijain2305	2024-06-20 04:02:36 +00:00
Li-Huai (Allan) Lin	19f3abcde4	[Docs][MPS] Add mps environment variable table (#129008 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129008 Approved by: https://github.com/malfet ghstack dependencies: #129006	2024-06-20 03:30:35 +00:00
Huy Do	609ffaf717	Add more shards for slow CPU and ROCm jobs (#128873 ) As they start to timeout in trunk `fc2913fb80/1`. Adding one more shard for slow CPU job is trivial. ROCm runners is harder to find, but I assume that this is ok because slow jobs only run periodically. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128873 Approved by: https://github.com/PaliC	2024-06-20 03:13:19 +00:00
Will Feng	d8db074988	[Traceable FSDP2] [Dynamo] Fix OptimizedModule._initialize to allow tracing into FSDP2 module hooks for module from user-defined module class (#129046 ) This is a workaround to allow inplace fully-sharded module to still go into this branch: `3a185778ed/torch/_dynamo/eval_frame.py (L163)` instead of the second branch: `3a185778ed/torch/_dynamo/eval_frame.py (L166)` If we don't do this, `torch.compile(fully_shard(module_from_user_defined_module_class))` will ignore all module hooks which will break FSDP tracing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129046 Approved by: https://github.com/anijain2305	2024-06-20 00:15:55 +00:00
Peter Bell	859fa183fe	BE: Use future annotations in inductor scheduler and ir (#128892 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128892 Approved by: https://github.com/lezcano	2024-06-20 00:10:43 +00:00
chilli	a2b1673dfb	[Horace's PR #126446 ] Prevent partitioner from ever saving views (#129039 ) Most work is done by Horace in https://github.com/pytorch/pytorch/issues/126446, this PR just additionally adds the config for it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129039 Approved by: https://github.com/Chillee	2024-06-19 23:21:16 +00:00
leslie-fang-intel	9d06e3783d	[Inductor][CPP] Fix the symbolic size cast issue in GEMM Benchmark (#128824 ) Summary The symbolic size generated from size hint (python int) is different with c type `long` of kernel args which may cause the benchmark failing to run. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128824 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-06-19 23:11:53 +00:00
Jithun Nair	a6ac6447b5	Re-enable py3.12 nightly wheel builds and add triton dependency for ROCm (#128525 ) The llnl-hatchet developers have published the py3.12 binaries on [PyPI](https://pypi.org/project/llnl-hatchet/#files). In fact, looking [here](https://download.pytorch.org/whl/nightly/llnl-hatchet), it seems we already have the py3.12 wheels mirrored. This should allow us to re-enable py3.12 binaries for ROCm. This PR reverts commit 9d849d4312cd1e62d97b9e9d58979ec78d36c95f. It also adds the pytorch-triton-rocm dependency for torch wheels on ROCm since pytorch-triton-rocm py3.12 wheels are available now Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/128525 Approved by: https://github.com/malfet	2024-06-19 21:56:54 +00:00
Sam Larsen	571a0db132	[inductor] Fix logging for run_and_get_cpp_code (#128794 ) Summary: Found during testing with remote caching: Use the same output logger object between graph.py and codecache.py since it's patched in `run_and_get_cpp_code`. That allows us to capture any logging produced from the codecache path when using `run_and_get_cpp_code`. I'm also fixing a few tests that were passing mistakenly because logging was missing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128794 Approved by: https://github.com/oulgen, https://github.com/leslie-fang-intel	2024-06-19 21:32:34 +00:00
cyy	277f2914a5	[9/N] Remove unused functions (#128704 ) MKL can not be enabled on aarch64, and as CI compiles code with `-Werror=unused-function` it will fail to compile with ``` /usr/bin/c++ -DAT_PER_OPERATOR_HEADERS -DBUILD_ONEDNN_GRAPH -DCAFFE2_BUILD_MAIN_LIB -DCPUINFO_SUPPORTED_PLATFORM=1 -DFLASHATTENTION_DISABLE_ALIBI -DFMT_HEADER_ONLY=1 -DFXDIV_USE_INLINE_ASSEMBLY=0 -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DNNP_CONVOLUTION_ONLY=0 -DNNP_INFERENCE_ONLY=0 -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DUSE_C10D_GLOO -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_RPC -DUSE_TENSORPIPE -D_FILE_OFFSET_BITS=64 -Dtorch_cpu_EXPORTS -I/var/lib/jenkins/workspace/build/aten/src -I/var/lib/jenkins/workspace/aten/src -I/var/lib/jenkins/workspace/build -I/var/lib/jenkins/workspace -I/var/lib/jenkins/workspace/cmake/../third_party/benchmark/include -I/var/lib/jenkins/workspace/third_party/onnx -I/var/lib/jenkins/workspace/build/third_party/onnx -I/var/lib/jenkins/workspace/third_party/foxi -I/var/lib/jenkins/workspace/build/third_party/foxi -I/var/lib/jenkins/workspace/torch/csrc/api -I/var/lib/jenkins/workspace/torch/csrc/api/include -I/var/lib/jenkins/workspace/caffe2/aten/src/TH -I/var/lib/jenkins/workspace/build/caffe2/aten/src/TH -I/var/lib/jenkins/workspace/build/caffe2/aten/src -I/var/lib/jenkins/workspace/build/caffe2/../aten/src -I/var/lib/jenkins/workspace/torch/csrc -I/var/lib/jenkins/workspace/third_party/miniz-2.1.0 -I/var/lib/jenkins/workspace/third_party/kineto/libkineto/include -I/var/lib/jenkins/workspace/third_party/kineto/libkineto/src -I/var/lib/jenkins/workspace/third_party/cpp-httplib -I/var/lib/jenkins/workspace/aten/src/ATen/.. -I/var/lib/jenkins/workspace/third_party/FXdiv/include -I/var/lib/jenkins/workspace/c10/.. -I/var/lib/jenkins/workspace/third_party/pthreadpool/include -I/var/lib/jenkins/workspace/third_party/cpuinfo/include -I/var/lib/jenkins/workspace/aten/src/ATen/native/quantized/cpu/qnnpack/include -I/var/lib/jenkins/workspace/aten/src/ATen/native/quantized/cpu/qnnpack/src -I/var/lib/jenkins/workspace/aten/src/ATen/native/quantized/cpu/qnnpack/deps/clog/include -I/var/lib/jenkins/workspace/third_party/NNPACK/include -I/var/lib/jenkins/workspace/third_party/FP16/include -I/var/lib/jenkins/workspace/third_party/tensorpipe -I/var/lib/jenkins/workspace/build/third_party/tensorpipe -I/var/lib/jenkins/workspace/third_party/tensorpipe/third_party/libnop/include -I/var/lib/jenkins/workspace/third_party/fmt/include -I/var/lib/jenkins/workspace/build/third_party/ideep/mkl-dnn/include -I/var/lib/jenkins/workspace/third_party/ideep/mkl-dnn/src/../include -I/var/lib/jenkins/workspace/third_party/flatbuffers/include -isystem /var/lib/jenkins/workspace/build/third_party/gloo -isystem /var/lib/jenkins/workspace/cmake/../third_party/gloo -isystem /var/lib/jenkins/workspace/cmake/../third_party/tensorpipe/third_party/libuv/include -isystem /var/lib/jenkins/workspace/cmake/../third_party/googletest/googlemock/include -isystem /var/lib/jenkins/workspace/cmake/../third_party/googletest/googletest/include -isystem /var/lib/jenkins/workspace/third_party/protobuf/src -isystem /var/lib/jenkins/workspace/third_party/XNNPACK/include -isystem /var/lib/jenkins/workspace/cmake/../third_party/eigen -isystem /var/lib/jenkins/workspace/third_party/ideep/mkl-dnn/include/oneapi/dnnl -isystem /var/lib/jenkins/workspace/third_party/ideep/include -isystem /var/lib/jenkins/workspace/build/include -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DLIBKINETO_NOROCTRACER -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Werror -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow -O3 -DNDEBUG -DNDEBUG -std=gnu++17 -fPIC -DTORCH_USE_LIBUV -DCAFFE2_USE_GLOO -D__NEON__ -Wall -Wextra -Wdeprecated -Wno-unused-parameter -Wno-missing-field-initializers -Wno-unknown-pragmas -Wno-type-limits -Wno-array-bounds -Wno-strict-overflow -Wno-strict-aliasing -Wunused-function -Wno-maybe-uninitialized -fvisibility=hidden -O2 -pthread -fopenmp -MD -MT caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/mkldnn/Linear.cpp.o -MF caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/mkldnn/Linear.cpp.o.d -o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/mkldnn/Linear.cpp.o -c /var/lib/jenkins/workspace/aten/src/ATen/native/mkldnn/Linear.cpp /var/lib/jenkins/workspace/aten/src/ATen/native/mkldnn/Linear.cpp:426:15: error: ‘at::Tensor at::native::mkl_linear(const at::Tensor&, const at::Tensor&, const at::Tensor&, const std::optional<at::Tensor>&, int64_t)’ defined but not used [-Werror=unused-function] 426 \| static Tensor mkl_linear( \| ^~~~~~~~~~ ``` Follows #128499 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128704 Approved by: https://github.com/malfet	2024-06-19 20:46:45 +00:00
Aleksei Nikiforov	fca408fa29	s390x vectorization: rework operators (#129066 ) Move operators from member functions to free functions. This is needed to fix torch inductor on s390x. This change fixes tests like DynamicShapesMiscTests::test_numpy_min_dynamic_shapes from test/dynamo/test_dynamic_shapes.py This change also fixes recently intorduced build failure on s390x. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129066 Approved by: https://github.com/malfet	2024-06-19 20:12:41 +00:00
Huy Do	73f5d2b787	Run ET unit tests on PT CI (#128560 ) This is the first PR to add all existing ET unit tests into PT CI. The goal is to improve the coverage there to avoid breaking change from PT that could break ET. With this, any future unit tests on ET will automatically be run on PT CI. The duration of the job is now 40+ minutes, not too bad. This also fixed the failed ET build in https://github.com/pytorch/pytorch/pull/123043. Adding model coverage is a bit more evolved and requires adding new shards, so I will follow up on that in separate PRs. [T192117506](https://www.internalfb.com/intern/tasks/?t=192117506), with the failed diffs D58295865 and D58394154 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128560 Approved by: https://github.com/guangy10, https://github.com/digantdesai	2024-06-19 20:08:58 +00:00
PyTorch MergeBot	df94d57c0a	Revert "[export] experimental joint graph API. (#128847 )" This reverts commit 0707811286d1846209676435f4f86f2b4b3d1a17. Reverted https://github.com/pytorch/pytorch/pull/128847 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/128847#issuecomment-2179326891))	2024-06-19 19:04:36 +00:00
Aaron Enye Shi	b5d541609d	[Memory Snapshot] Add recordAnnotations to capture record_function annotations (#129072 ) Summary: Add new traceEvents into Memory Snapshot for record_function annotations. These will capture both the profiler's step annotation as well as user annotations. Test Plan: CI Pulled By: aaronenyeshi Differential Revision: D55941362 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129072 Approved by: https://github.com/zdevito	2024-06-19 18:05:41 +00:00
Xu Han	bafd68b4fc	[inductor] fix windows python module ext and func export declaration (#129059 ) I have run the first inductor case on Windows base on the exploration code: https://github.com/pytorch/pytorch/pull/128330 Due to some fundamental PR still need pass `fb_code`: https://github.com/pytorch/pytorch/pull/128303 This PR would land some part of exploration code: 1. Fix Windows python module ext type: pyd. 2. Add function export declaration for Windows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129059 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-06-19 17:51:32 +00:00
Zhengxu Chen	0707811286	[export] experimental joint graph API. (#128847 ) Summary: WARNING: This API is highly unstable and will be subject to change in the future. Add a protoype to "decompose" an ExportedProgram into a joint graph form, so that we can compute the gradients on this graph. Test Plan: buck test mode/opt caffe2/torch/fb/export:test_experimental Differential Revision: D55657917 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128847 Approved by: https://github.com/tugsbayasgalan	2024-06-19 16:45:27 +00:00
Li-Huai (Allan) Lin	0fc603ece4	[optim] Fused implementation stability table (#129006 ) I'd like to discuss the criteria that we regard an implementation as stable. If there is no existing standard, my initial proposal would be a 6 month period after the commit to regard it as stable. As a result, now Adam and AdamW on CUDA would be considered as stable, while the rest are of beta. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129006 Approved by: https://github.com/malfet	2024-06-19 16:29:49 +00:00
Jean Schmidt	1b92bdd0ea	[ALI] [Reland] Use LF runners for Lint (#129071 ) Quick experiment with using LF runners for lint jobs. Picking a set of jobs where infra failures would be obvious to most people (lint) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129071 Approved by: https://github.com/malfet	2024-06-19 16:10:51 +00:00
PaliC	236fbcbdf4	[Split Build] Test split build in pull CI workflow (#126813 ) This PR builds the split build in the pull workflow and runs the appropriate tests against them. A single linux cpu and single gpu build were chosen arbitrarily to not add too many tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126813 Approved by: https://github.com/atalman ghstack dependencies: #127934	2024-06-19 15:57:21 +00:00
PaliC	7d33ff59ba	[Split Build]Use same package (#127934 ) This PR removes the second separate package we were using for the libtorch wheel. In terms of testing that this works we will look use the PRs above this in the stack. As for sanity checking these are the wheels that are produced by running ``` python setup.py clean && BUILD_LIBTORCH_WHL=1 with-proxy python setup.py bdist_whee l && BUILD_PYTHON_ONLY=1 with-proxy python setup.py bdist_wheel --cmake ``` ``` sahanp@devgpu086 ~/pytorch ((5f15e171…))> ls -al dist/ (pytorch-3.10) total 677236 drwxr-xr-x 1 sahanp users 188 Jun 4 12:19 ./ drwxr-xr-x 1 sahanp users 1696 Jun 4 12:59 ../ -rw-r--r-- 1 sahanp users 81405742 Jun 4 12:19 torch-2.4.0a0+gitca0a73c-cp310-cp310-linux_x86_64.whl -rw-r--r-- 1 sahanp users 612076919 Jun 4 12:19 libtorch-2.4.0a0+gitca0a73c-py3-none-any.whl ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127934 Approved by: https://github.com/atalman	2024-06-19 15:57:21 +00:00
lyb	ffb50fb691	[ONNX] Add onnx::Gelu support for version 20 (#128773 ) Fixes https://github.com/pytorch/pytorch/issues/128772 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128773 Approved by: https://github.com/justinchuby	2024-06-19 15:39:02 +00:00
Jean Schmidt	3397d5ef90	Revert "[ALI] Use lf runners for Lint" (#129070 ) Reverts pytorch/pytorch#128978 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129070 Approved by: https://github.com/atalman	2024-06-19 14:48:16 +00:00
Xu Zhao	118f9ceb7c	[inductor][ci] Fix torchbench dependency issue with numpy (#128968 ) For some reason, pip will always upgrade the numpy version even when an older version has been installed. We have to lock numpy version to the old version to make this constraint explicit. Torchbench commit: `23512dbebd` Second attempt to fix #128845 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128968 Approved by: https://github.com/eellison	2024-06-19 12:10:50 +00:00
FFFrog	e49525275d	Make TraceUtils.h to be device-agnostic (#126969 ) Some features of third-party devices depend on TraceUtils.h, so some of the CUDA code was removed and split into NCCLUtils files. In addition, some common functions still remain in TraceUtils.h since I'm not sure if other devices will use them later. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126969 Approved by: https://github.com/c-p-i-o	2024-06-19 09:06:49 +00:00
Zain Rizvi	7fac03aee9	[ALI] Use lf runners for Lint (#128978 )	2024-06-19 10:59:07 +02:00
Daulet Askarov	50567f7081	Pass device to is_pinned call inside TensorProperties.create_from_tensor (#128896 ) Summary: The default input device for is_pinned function is Cuda. This can unnecessarily create Cuda context for CPU tensors when just generating TensorProperties, bloating memory usage. Passing the device to the is_pinned call site inside def create_from_tensor solves this issue. This also fixes Model Store test https://www.internalfb.com/intern/test/844425019931542?ref_report_id=0 which is currently broken on memory usage assertions. Test Plan: UT Differential Revision: D58695006 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128896 Approved by: https://github.com/fegin	2024-06-19 08:50:46 +00:00
Frank Lin	d3e8b8bf47	Remove cuda check in the CUDAGraph destructor (#127382 ) Fixes #125804 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127382 Approved by: https://github.com/eqy, https://github.com/eellison	2024-06-19 08:09:31 +00:00
Bin Bao	ba92f5277f	[inductor][refactor] Unify the use of generate_kernel_call (#128467 ) Summary: Refactor TritonTemplateKernel.call_kernel and ForeachKernel.call_kernel to use wrapper.generate_kernel_call to generate kernel calls instead of explicitly composing the kernel call string. This consolidates the entry point of generate_kernel_call and similifies later changes in this PR stack. Differential Revision: [D58733631](https://our.internmc.facebook.com/intern/diff/D58733631) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128467 Approved by: https://github.com/shunting314	2024-06-19 07:47:25 +00:00
Colin Peppler	3a185778ed	[aotinductor] Add torch.polar fallback op for shim v2 (#128722 ) Compilation error: ``` $ TORCHINDUCTOR_C_SHIM_VERSION=2 TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_LOGS_FORMAT="%(pathname)s:%(lineno)s: %(message)s" TORCH_LOGS="+output_code" python test/inductor/test_cpu_cpp_wrapper.py -k test_polar /tmp/tmp2sp128xj/dy/cdypvu3hvgg3mwxydwbiuddsnmuoi37it3mrpjktcnu6vt4hr3ki.cpp:59:33: error: ‘aoti_torch_cpu_polar’ was not declared in this scope; did you mean ‘aoti_torch_cpu_topk’? ``` Steps: 1. Add aten.polar 2. run `python torchgen/gen.py --update-aoti-c-shim`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128722 Approved by: https://github.com/chenyang78, https://github.com/desertfire	2024-06-19 05:06:58 +00:00
PyTorch MergeBot	a584b2a389	Revert "Add test to xfail_list only for abi_compatible (#128506 )" This reverts commit df85f34a14dd30f784418624b05bd52b12ab8b0b. Reverted https://github.com/pytorch/pytorch/pull/128506 on behalf of https://github.com/huydhn due to The failure shows up in trunk `df85f34a14` ([comment](https://github.com/pytorch/pytorch/pull/128506#issuecomment-2177744578))	2024-06-19 04:59:10 +00:00
drisspg	fcf2a1378b	Enable fp8 rowwise scaling kernel on cuda, TAKE 2: #125204 (#128989 ) # Summary First PR got reverted and needed a redo This pull request introduces an fp8 row-scaling kernel as an optional implementation for `scaled_mm`. The kernel selection is based on the scaling tensors of the inputs. For inputs `x` and `y` of shape `[M, K]` and `[K, N]` respectively, the following conditions must be met: - `x`'s scale should be a 1-dimensional tensor of length `M`. - `y`'s scale should be a 1-dimensional tensor of length `N`. It's important to note that this kernel is not called "rowwise, columnwise" scaling because, although the scales for `y` are semantically along its columns, this implementation only supports the TN format. This means the scaling is along the faster-moving dimension, or the "row". The following two PRs were required to enable local builds: - [PR #126185](https://github.com/pytorch/pytorch/pull/126185) - [PR #125523](https://github.com/pytorch/pytorch/pull/125523) ### Todo We still do not build our Python wheels with this architecture. @ptrblck @malfet, should we replace `sm_90` with `sm_90a`? The NVRTC TMA shadowing feels wrong, but I a not sure the right way to spoof the symbol for this compilation unit: https://github.com/pytorch/pytorch/pull/125204/files#r1586986954 #### ifdef I tried to use : `#if !defined(USE_ROCM) && defined(CUDA_VERSION) && CUDA_VERSION >= 12000 && \ defined(__CUDA_ARCH__) && __CUDA_ARCH__ > 900` to gate the building of the kernel. I was having a hell of a time with this.. so I am not really sure the right way to do this Kernel Credit: @jwfromm Pull Request resolved: https://github.com/pytorch/pytorch/pull/128989 Approved by: https://github.com/yangsiyu007, https://github.com/vkuzo	2024-06-19 04:49:39 +00:00
Sam Larsen	2f88597aad	[inductor] For internal, allow multiple workers if the method is "subprocess" (#129002 ) Summary: This does not change the current default behavior in fbcode ("fork" if unspecified and no worker processes if unspecified). But it allows us to more easily test the subprocess-based parallel if we override the start method to subprocess. Test Plan: Set `TORCHINDUCTOR_WORKER_START=subprocess` and locally ran all torchbench models listed [here](https://www.internalfb.com/intern/wiki/PyTorch/Teams/PyTorch_Perf_Infra/TorchBench/#torchbench-internal-mode) Differential Revision: D58755021 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129002 Approved by: https://github.com/eellison	2024-06-19 04:28:27 +00:00
Jerry Mannil	1f0a68b572	[ROCm] Fix fp32 atomicAdd for non-MI100 GPUs (#128750 ) Current implementation is very specific to MI100. This is causing performance degradation for other GPUs. Fixes #128631 Benchmarking on MI300X: ``` Before: 1918.5126953125 ms After: 0.8285150527954102 ms ``` Co-authored-by: Jeff Daily <jeff.daily@amd.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128750 Approved by: https://github.com/xw285cornell	2024-06-19 03:56:20 +00:00
Yanbo Liang	acefc5c016	[torch.compile] Enable bwd compilation metrics (#128973 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/128973 Approved by: https://github.com/dshi7	2024-06-19 03:45:41 +00:00
chilli	eb9f4da11e	Modified template indexing to broadcast indices to out instead of mask and some other flexattention micro-opts (#128938 ) For headdim=64 and headdim=128 Old: <img width="656" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/2c5d1613-96dc-4300-8dc0-dccaef59e73c"> New: <img width="644" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/730004a8-6d5f-46a5-82a0-2594feb5e192"> Note, this does regress headdim=256. We can unregress it by special casing `headdim=256`, but ehh.... we can do it later Pull Request resolved: https://github.com/pytorch/pytorch/pull/128938 Approved by: https://github.com/drisspg	2024-06-19 03:41:22 +00:00
Yifu Wang	8771e3429c	Introduce a prototype for SymmetricMemory (#128582 ) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): This PR introduces a prototype for `SymmetricMemory` (including a CUDA implementation) - a remote-memory access-based communication primitive. It allows for user-defined communication patterns/kernels and is designed to be torch.compile-friendly. It addresses the major limitations of `IntraNodeComm` and `ProcessGroupCudaP2p` and serves as a replacement for them. ### SymmetricMemory `SymmetricMemory` represents symmetric allocations across a group of devices. The allocations represented by a `SymmetricMemory` object are accessible by all devices in the group. The class can be used for op-level custom communication patterns (via the get_buffer APIs and the synchronization primitives), as well as custom communication kernels (via the buffer and signal_pad device pointers). ### Python API Example ```python from torch._C.distributed_c10d import _SymmetricMemory # Set a store for rendezvousing symmetric allocations on a group of devices # identified by group_name. The concept of groups is logical; users can # utilize predefined groups (e.g., a group of device identified by a # ProcessGroup) or create custom ones. Note that a SymmetricMemoryAllocator # backends might employ a more efficient communication channel for the actual # rendezvous process and only use the store for bootstrapping purposes. _SymmetricMemory.set_group_info(group_name, rank, world_size, store) # Identical to empty_strided, but allows symmetric memory access to be # established for the allocated tensor via _SymmetricMemory.rendezvous(). # This function itself is not a collective operation. t = _SymmetricMemory.empty_strided_p2p((64, 64), (64, 1), torch.float32, group_name) # Users can write Python custom ops that leverages the symmetric memory access. # Below are examples of things users can do (assuming the group's world_size is 2). # Establishes symmetric memory access on tensors allocated via # _SymmetricMemory.empty_strided_p2p(). rendezvous() is a one-time process, # and the mapping between a local memory region and the associated SymmetricMemory # object is unique. Subsequent calls to rendezvous() with the same tensor will receive # the cached SymmetricMemory object. # # The function has a collective semantic and must be invoked simultaneously # from all rendezvous participants. symm_mem = _SymmetricMemory.rendezvous(t) # This represents the allocation on rank 0 and is accessible from all devices. buf = symm_mem.get_buffer(0, (64, 64), torch.float32) if symm_mem.rank == 0: symm_mem.wait_signal(src_rank=1) assert buf.eq(42).all() else: # The remote buffer can be used as a regular tensor buf.fill_(42) symm_mem.put_signal(dst_rank=0) symm_mem.barrier() if symm_mem.rank == 0: symm_mem.barrier() assert buf.eq(43).all() else: new_val = torch.empty_like(buf) new_val.fill_(43) # Contiguous copies to/from a remote buffer utilize copy engines # which bypasses SMs (i.e. no need to load the data into registers) buf.copy_(new_val) symm_mem.barrier() ``` ### Custom CUDA Comm Kernels Given a tensor, users can access the associated `SymmetricMemory` which provides pointer to remote buffers/signal_pads needed for custom communication kernels. ```cpp TORCH_API c10::intrusive_ptr<SymmetricMemory> get_symmetric_memory( const at::Tensor& tensor); class TORCH_API SymmetricMemory : public c10::intrusive_ptr_target { public: ... virtual std::vector<void> get_buffer_ptrs() = 0; virtual std::vector<void> get_signal_pad_ptrs() = 0; virtual void get_buffer_ptrs_dev() = 0; virtual void get_signal_pad_ptrs_dev() = 0; virtual size_t get_buffer_size() = 0; virtual size_t get_signal_pad_size() = 0; virtual int get_rank() = 0; virtual int get_world_size() = 0; ... }; ``` ### Limitations of IntraNodeComm and ProcessGroupCudaP2p Both `IntraNodeComm` (used by `ProcessGroupCudaP2p`) manages a single fixed-size workspace. This approach: - Leads to awkward UX in which the required workspace needs to be specified upfront. - Can not avoid extra copies for some algorithms in eager mode (e.g., custom/multimem all-reduce, reduce-scatter, all-gather). - Prevents torch.compile from eliminating all copies. In addition, they only offer out-of-the-box communication kernels and don't expose required pointers for user-defined, custom CUDA comm kernels. * __->__ #128582 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128582 Approved by: https://github.com/wanchaol	2024-06-19 03:38:58 +00:00
Alnis Murtovi	ed5b8432cd	Enable mixed_mm only if casting from lower-bitwidth type to a higher one (#128899 ) This PR changes the behavior of `cuda_and_enabled_mixed_mm` such that mixed_mm is only enabled if we are casting from a lower-bitwidth type to a higher one. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128899 Approved by: https://github.com/eellison	2024-06-19 03:12:18 +00:00
Wu, Chunyuan	df85f34a14	Add test to xfail_list only for abi_compatible (#128506 ) https://github.com/pytorch/pytorch/pull/126717 will skip the tests in both ABI compatible and non-ABI compatible mode. It's not expected to skip them in non-ABI compatible mode since they can actually run successfully in such mode but only have issues in ABI compatible mode. We leverage the existing `xfail_list` for those that will only fail in ABI compatible mode. - `test_qlinear_add` is already in the `xfail_list`. - `test_linear_packed` doesn't fail either in my local run (running with `TORCHINDUCTOR_ABI_COMPATIBLE=1`) or in the CI of this PR so I didn't add it into `xfail_list`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128506 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-06-19 01:18:37 +00:00
Thanh Ha	4bc90185fb	fix: Print statements causing parse error (#128969 ) The print statements for the get_workflow_type script is problematic because the shell script calling this script is expecting the output to only be JSON. This PR resolves this by removing all print statements to covert them to a message field in the JSON return output so that the output can continue to expect to be JSON while giving us the debug data we are looking for. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128969 Approved by: https://github.com/tylertitsworth, https://github.com/ZainRizvi	2024-06-19 01:17:08 +00:00
leslie-fang-intel	eda375a490	[Inductor] Remove min/max from inductor opinfo test (#128925 ) Summary Remove `max.binary, min.binary, maximum, minimum` from `inductor_one_sample` op list as we fix the bool vectorization issue in https://github.com/pytorch/pytorch/pull/126841. Test Plan ``` python -u -m pytest -s -v test/inductor/test_torchinductor_opinfo.py -k test_comprehensive_maximum python -u -m pytest -s -v test/inductor/test_torchinductor_opinfo.py -k test_comprehensive_minimum python -u -m pytest -s -v test/inductor/test_torchinductor_opinfo.py -k test_comprehensive_min_binary python -u -m pytest -s -v test/inductor/test_torchinductor_opinfo.py -k test_comprehensive_max_binary ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128925 Approved by: https://github.com/isuruf, https://github.com/jgong5, https://github.com/peterbell10	2024-06-19 01:14:27 +00:00
xinan.lin	2458f79f83	[Inductor UT][Intel GPU] Skip newly added test case test_torchinductor_strided_blocks:test_reduction for Intel GPU (#128881 ) Skip newly added test case test_torchinductor_strided_blocks:test_reduction for Intel GPU because it have not implemented reduction kernel split. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128881 Approved by: https://github.com/blaine-rister, https://github.com/EikanWang, https://github.com/malfet	2024-06-19 00:44:57 +00:00
PyTorch MergeBot	b0d2fe6299	Revert "Short-term fix to preserve NJT metadata cache in torch.compile (#122836 )" This reverts commit 2a41fc03903de63270d325bd1886a50faf32d7e4. Reverted https://github.com/pytorch/pytorch/pull/122836 on behalf of https://github.com/jbschlosser due to internal test failures with DEBUG=1 asserts ([comment](https://github.com/pytorch/pytorch/pull/122836#issuecomment-2177298245))	2024-06-19 00:28:53 +00:00
PyTorch MergeBot	5ffb032be6	Revert "Backward support for unbind() with NJT (#128032 )" This reverts commit 5dc4f652bc5c068ef15130c955e3f2ffe11f4b74. Reverted https://github.com/pytorch/pytorch/pull/128032 on behalf of https://github.com/jbschlosser due to reverting to revert parent PR ([comment](https://github.com/pytorch/pytorch/pull/128032#issuecomment-2177296325))	2024-06-19 00:26:40 +00:00
Jane Xu	35c78668b4	Improve the debugging message for when foreach mta_called (#128991 ) The hope that lives in this PR: I am currently trying to debug why the foreach tests are so flaky. It looks like every flaky test falls under this pattern: - a test is flaky due to the mta_called assertion, which gathers data from the profiler regarding whether the multi_tensor_apply_kernel has been called. - then, a later test fails deterministically, usually failing to compare two results. ``` ================== 1 failed, 241 deselected, 2 rerun in 1.76s ================== Got exit code 1 Stopping at first consistent failure The following tests failed and then succeeded when run in a new process ['test/test_foreach.py::TestForeachCUDA::test_binary_op_float_inf_nan__foreach_add_cuda_bfloat16'] The following tests failed consistently: ['test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_add_cuda_bfloat16'] ``` So my suspicion is that the first causes the second, but what causes the first? Idk! So it would be nice to have the error message tell us what the profiler actually saw in case it's getting muddled. This change would help mostly because I have not been able to repro this flakiness locally. Also undo the useless changes in #128220 which are actually redundant as Joel and I realized that we set the seed during the setUp of every test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128991 Approved by: https://github.com/clee2000	2024-06-19 00:25:09 +00:00
PyTorch MergeBot	99f042d336	Revert "Forward fix to skip ROCm tests for #122836 (#128891 )" This reverts commit 4061b3b8225f522ae0ed6db00111441e7d3cc3d5. Reverted https://github.com/pytorch/pytorch/pull/128891 on behalf of https://github.com/jbschlosser due to reverting to revert parent PR ([comment](https://github.com/pytorch/pytorch/pull/128891#issuecomment-2177291249))	2024-06-19 00:21:21 +00:00
Animesh Jain	670b94c9c8	[inductor][mkldnn] Use floats instead of ints for pattern matcher test (#128484 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128484 Approved by: https://github.com/mlazos ghstack dependencies: #128428	2024-06-19 00:06:46 +00:00
Animesh Jain	c5e0b84484	[dynamo][trace_rules] Remove incorrectly classified Ingraph functions (#128428 ) Co-authored-by: Laith Sakka <lsakka@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128428 Approved by: https://github.com/yanboliang, https://github.com/mlazos	2024-06-19 00:06:46 +00:00
cyy	cb5e9183c6	[Caffe2] [2/N] Remove Caffe2 from tests (#128911 ) Follows #128675 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128911 Approved by: https://github.com/titaiwangms, https://github.com/r-barnes	2024-06-19 00:05:50 +00:00
Andrew Gu	ac5f565fa7	[FSDP2] Added `set_post_optim_event` (#128975 ) This PR adds `set_post_optim_event` that allows power users to provide their own CUDA event that is recorded after the optimizer step for the FSDP root module to wait the all-gather streams on. ``` def set_post_optim_event(self, event: torch.cuda.Event) -> None: ``` By default, the root would have the all-gather streams wait on the current stream (`wait_stream`), which may introduce false dependencies if there is unrelated computation after the optimizer step and before the wait. For example, this pattern can appear in recommendation models. To avoid those false dependencies while preserving the correctness guarantee, we provide this API so that the user can provide their own CUDA event to wait the all-gather streams on. We include both correctness test (`test_fully_shard_training.py`) and overlap test (`test_fully_shard_overlap.py`). --- One possible way to use the API is to register a post-step hook on the optimizer. For example: `12e8d1399b/test/distributed/_composable/fsdp/test_fully_shard_training.py (L546-L552)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128975 Approved by: https://github.com/sanketpurandare, https://github.com/weifengpy ghstack dependencies: #128884	2024-06-18 22:26:14 +00:00
Jokeren	d9c294c672	[Inductor] Fix arguments passed to triton kernel launch hooks (#128732 ) `binary.launch_enter_hook` is treated as an instance method and will add a `self` argument to the hooks. `CompiledKernel.launch_enter_hook` is a static method, which matches the hook calling convention of profilers (i.e., a single `LazyDict` argument only). Pull Request resolved: https://github.com/pytorch/pytorch/pull/128732 Approved by: https://github.com/shunting314, https://github.com/bertmaher	2024-06-18 22:06:55 +00:00
Xuehai Pan	a0e1e20c41	[BE][Easy] enable UFMT for `torch/distributed/` (#128870 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128870 Approved by: https://github.com/fegin ghstack dependencies: #128868, #128869	2024-06-18 21:49:08 +00:00
Xuehai Pan	3b798df853	[BE][Easy] enable UFMT for `torch/distributed/{fsdp,optim,rpc}/` (#128869 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128869 Approved by: https://github.com/fegin ghstack dependencies: #128868	2024-06-18 21:49:08 +00:00
Xuehai Pan	cec31050b4	[BE][Easy] enable UFMT for `torch/distributed/{tensor,_tensor}/` (#128868 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128868 Approved by: https://github.com/fegin	2024-06-18 21:49:02 +00:00
Nikita Shulga	e47603a549	Fix weight_norm decomposition behavior (#128956 ) By upcasting norm to float32 to align with CUDA and CPU behaviors `e6d4451ae8/aten/src/ATen/native/WeightNorm.cpp (L56-L59)` Discovered this when started running OpInfo tests, see https://github.com/pytorch/pytorch/actions/runs/9552858711/job/26332062502#step:20:1060 ``` File "/var/lib/jenkins/workspace/test/test_decomp.py", line 185, in op_assert_ref assert orig.dtype == decomp.dtype, f"{i} Operation: {op}" AssertionError: 1 Operation: aten._weight_norm_interface.default ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128956 Approved by: https://github.com/albanD ghstack dependencies: #128955	2024-06-18 21:24:12 +00:00
Aaron Enye Shi	2227da4431	[Profiler] Clean up use_mtia to follow standard use_device instead (#126284 ) Summary: use_mtia should instead set use_device='mtia' similar to cuda, xpu, and privateuseone. Avoid an ever-growing list of use_* arguments. Since use_mtia is specific to FBCode, we don't need a deprecation warning. Test Plan: CI. Differential Revision: D57338005 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/126284 Approved by: https://github.com/fenypatel99	2024-06-18 21:01:03 +00:00
dependabot[bot]	4cc3fb5ee2	Bump urllib3 from 2.2.1 to 2.2.2 in /tools/build/bazel (#128908 ) Bumps [urllib3](https://github.com/urllib3/urllib3) from 2.2.1 to 2.2.2. - [Release notes](https://github.com/urllib3/urllib3/releases) - [Changelog](https://github.com/urllib3/urllib3/blob/main/CHANGES.rst) - [Commits](https://github.com/urllib3/urllib3/compare/2.2.1...2.2.2) --- updated-dependencies: - dependency-name: urllib3 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2024-06-18 13:38:22 -07:00
Joel Schlosser	5dc4f652bc	Backward support for unbind() with NJT (#128032 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128032 Approved by: https://github.com/soulitzer	2024-06-18 20:29:00 +00:00
PyTorch MergeBot	44722c6b10	Revert "[dynamo][fsdp] Dont take unspecializedNNModuleVariable path for FSDP modules (#128453 )" This reverts commit 2b28b107dbafeec18d1095a2002e79511aa241df. Reverted https://github.com/pytorch/pytorch/pull/128453 on behalf of https://github.com/anijain2305 due to luca saw bad compile time ([comment](https://github.com/pytorch/pytorch/pull/128453#issuecomment-2176877667))	2024-06-18 20:09:00 +00:00
PyTorch MergeBot	1babeddbbf	Revert "[inductor][mkldnn] Use floats instead of ints for pattern matcher test (#128484 )" This reverts commit 1f6e84fa6852805e15ddc9583c5f36c3a7f93df8. Reverted https://github.com/pytorch/pytorch/pull/128484 on behalf of https://github.com/anijain2305 due to luca saw bad compile time ([comment](https://github.com/pytorch/pytorch/pull/128453#issuecomment-2176877667))	2024-06-18 20:09:00 +00:00
PyTorch MergeBot	5bc9835d64	Revert "[dynamo][trace_rules] Remove incorrectly classified Ingraph functions (#128428 )" This reverts commit c52eda896eb3ec7f8d04b6321861f4c5614a40bb. Reverted https://github.com/pytorch/pytorch/pull/128428 on behalf of https://github.com/anijain2305 due to luca saw bad compile time ([comment](https://github.com/pytorch/pytorch/pull/128453#issuecomment-2176877667))	2024-06-18 20:09:00 +00:00
Li-Huai (Allan) Lin	9a7e2519d3	[MPS] Fused Adam & AdamW (#127242 ) Summary: This PR adds fused Adam and AdamW implementations. Benchmark on Macbook Pro with M1 Max chip and 64GB unified memory: Fast math enabled: ``` [---------------------------------------------- Fused Adam ----------------------------------------------] \| Fused: True \| Fused: False 1 threads: ----------------------------------------------------------------------------------------------- amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 100 \| 10 \| 100 amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 100 \| 9 \| 89 amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 100 \| 9 \| 90 amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 100 \| 9 \| 83 amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 100 \| 12 \| 94 amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 100 \| 11 \| 88 amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 100 \| 12 \| 90 amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 100 \| 11 \| 100 amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 100 \| 27 \| 100 amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 100 \| 23 \| 100 amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 100 \| 27 \| 100 amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 100 \| 23 \| 98 amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 500 \| 82 \| 480 amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 500 \| 72 \| 450 amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 500 \| 82 \| 450 amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 500 \| 73 \| 420 amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 500 \| 91 \| 500 amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 500 \| 83 \| 400 amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 500 \| 94 \| 500 amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 500 \| 78 \| 400 amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 500 \| 170 \| 500 amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 500 \| 140 \| 600 amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 500 \| 170 \| 600 amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 500 \| 140 \| 500 amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 1000 \| 250 \| 890 amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 1000 \| 220 \| 850 amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 1000 \| 250 \| 830 amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 1000 \| 220 \| 770 amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 1000 \| 270 \| 870 amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 1000 \| 230 \| 840 amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 1000 \| 270 \| 810 amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 1000 \| 240 \| 800 amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 1000 \| 400 \| 1000 amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 1000 \| 360 \| 2000 amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 1000 \| 430 \| 2000 amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 1000 \| 360 \| 1300 Times are in milliseconds (ms). ``` Fast math disabled: ``` [---------------------------------------------- Fused Adam ----------------------------------------------] \| Fused: True \| Fused: False 1 threads: ----------------------------------------------------------------------------------------------- amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 100 \| 10 \| 100 amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 100 \| 9 \| 84 amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 100 \| 9 \| 84 amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 100 \| 9 \| 79 amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 100 \| 11 \| 93 amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 100 \| 10 \| 90 amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 100 \| 11 \| 91 amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 100 \| 11 \| 81 amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 100 \| 34 \| 100 amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 100 \| 31 \| 100 amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 100 \| 34 \| 95 amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 100 \| 31 \| 100 amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 500 \| 94 \| 500 amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 500 \| 82 \| 430 amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 500 \| 92 \| 430 amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 500 \| 81 \| 390 amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 500 \| 98 \| 500 amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 500 \| 88 \| 430 amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 500 \| 100 \| 500 amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 500 \| 88 \| 400 amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 500 \| 210 \| 500 amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 500 \| 190 \| 610 amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 500 \| 210 \| 510 amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 500 \| 190 \| 500 amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 1000 \| 300 \| 900 amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 1000 \| 260 \| 850 amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 1000 \| 295 \| 900 amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 1000 \| 260 \| 800 amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 1000 \| 320 \| 910 amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 1000 \| 280 \| 900 amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 1000 \| 320 \| 900 amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 1000 \| 300 \| 900 amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 1000 \| 500 \| 2000 amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 1000 \| 480 \| 2000 amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 1000 \| 540 \| 1500 amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 1000 \| 480 \| 1200 Times are in milliseconds (ms). ``` ```python def profile_fused_adam(): from torch.optim import adam, adamw import torch.utils.benchmark as benchmark import itertools def profile(fn, params, grads, exp_avgs, exp_avg_sqs, max_exp_avg_sqs, state_steps, amsgrad, fused): fn( params, grads, exp_avgs, exp_avg_sqs, max_exp_avg_sqs, state_steps, foreach=False, capturable=False, fused=fused, amsgrad=amsgrad, beta1=0.9, beta2=0.99, lr=1e-3, weight_decay=.0, eps=1e-5, maximize=False, grad_scale=None, found_inf=None, ) torch.mps.synchronize() device = "mps" results = [] for num_tensors, numel, adamWflag, amsgrad in itertools.product([100, 500, 1000], [1024, 65536, 1048576], [True, False], [True, False]): print(f"amsgrad: {amsgrad}, adamWflag: {adamWflag}, numel: {numel}, num_tensors: {num_tensors}") params, grads, exp_avgs, exp_avg_sqs = [[torch.arange(numel, dtype=torch.float32, device=device) + (numel * i) for i in range(num_tensors)] for _ in range(4)] max_exp_avg_sqs = [torch.arange(numel, dtype=torch.float32, device=device) for _ in range(num_tensors)] if amsgrad else [] state_steps = [torch.tensor([5], dtype=torch.float32, device=device) for _ in range(num_tensors)] if adamWflag: fn = adamw.adamw else: fn = adam.adam for fused in [True, False]: t = benchmark.Timer( stmt='profile(fn, params, grads, exp_avgs, exp_avg_sqs, max_exp_avg_sqs, state_steps, amsgrad, fused)', label='Fused Adam', sub_label=f"amsgrad: {amsgrad}, adamWflag: {adamWflag}, numel: {numel}, num_tensors: {num_tensors}", globals=locals(), description= f"Fused: {fused}", ).blocked_autorange(min_run_time=5) results.append(t) compare = benchmark.Compare(results) compare.trim_significant_figures() compare.colorize(rowwise=True) compare.print() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127242 Approved by: https://github.com/kulinseth, https://github.com/janeyx99	2024-06-18 19:59:50 +00:00
Chien-Chin Huang	fe8558b7aa	[DSD] Add unittest to verify HSDP1 + broadcast_from_rank0 (#128755 ) HSDP1 + broadcast_from_rank0 actually behaves differently from FSDP1 + broadcast_from_rank0. So we need an unittest to cover this use case. This test relies on the fix from https://github.com/pytorch/pytorch/pull/128446. Differential Revision: [D58621436](https://our.internmc.facebook.com/intern/diff/D58621436/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128755 Approved by: https://github.com/Skylion007, https://github.com/wz337 ghstack dependencies: #128685	2024-06-18 19:42:51 +00:00
Sam Larsen	abde6cab4c	Remove compile_threads=1 in test_inductor_collectives.py (#128580 ) Summary: I believe https://github.com/pytorch/pytorch/issues/125235 should be fixed after switching to subprocess-based parallel compile. Test Plan: Ran locally with python-3.9 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128580 Approved by: https://github.com/eellison	2024-06-18 19:31:13 +00:00
Boyuan Feng	04a5d3228e	[ts migration] Support prim::tolist and aten::len (#128894 ) Support prim::tolist and aten::len. Add unit tests for prim::min. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128894 Approved by: https://github.com/angelayi	2024-06-18 19:11:07 +00:00
Nikita Shulga	44483972bd	[EZ] Keep weight_norm var name aligned (#128955 ) To keep it aligned with `e6d4451ae8/aten/src/ATen/native/native_functions.yaml (L6484)` I.e. `x`->`v`, `y`->`g` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128955 Approved by: https://github.com/albanD, https://github.com/Skylion007	2024-06-18 18:40:59 +00:00
Animesh Jain	bdffd9f0c6	[export] Graph break on nn.Parameter construction (#128935 ) Fixes https://github.com/pytorch/pytorch/issues/126109 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128935 Approved by: https://github.com/angelayi	2024-06-18 18:37:44 +00:00
Chien-Chin Huang	1a527915a6	[DSD] Correctly handle shared parameters for optimizer state_dict (#128685 ) * Fixes https://github.com/pytorch/pytorch/issues/128011 See the discussion in https://github.com/pytorch/pytorch/pull/128076 Current implementation of `set_optimizer_state_dict()` assumes that all the fqns returned by `_get_fqns()` must exist in the optimizer state_dict. This is not true if the model has shared parameters. In such a case, only one fqn of the shared parameters will appear in the optimizer state_dict. This PR addresses the issue. Differential Revision: [D58573487](https://our.internmc.facebook.com/intern/diff/D58573487/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128685 Approved by: https://github.com/LucasLLC	2024-06-18 18:34:32 +00:00
loganthomas	d77a1aaa86	DOC: add note about same sized tensors to dist.gather() (#128676 ) Fixes #103305 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128676 Approved by: https://github.com/wconstab	2024-06-18 18:26:07 +00:00
soulitzer	1877b7896c	[checkpoint] Clean up selective activation checkpoint and make public (#125795 ) ### bc-breaking for existing users of the private API: - Existing policy functions must now change their return value to be [CheckpointPolicy](`c0b40ab42e/torch/utils/checkpoint.py (L1204-L1230)`) Enum instead of bool. - To restore previous behavior, return `PREFER_RECOMPUTE` instead of `False` and `{PREFER,MUST}_SAVE` instead of `True` depending whether you prefer the compiler to override your policy. - Policy function now accepts a `ctx` object instead of `mode` for its first argument. - To restore previous behavior, `mode = "recompute" if ctx.is_recompute else "forward"`. - Existing calls to `_pt2_selective_checkpoint_context_fn_gen` must be renamed to `create_selective_checkpoint_contexts `. The way you use the API remains the same. It would've been nice to do something different (not make the user have to use functools.partial?), but this was the easiest to compile (idk if this should actually be a constraint). Related doc: https://docs.google.com/document/d/1BKyizkZPdri9mHqdDOLAUpkI7SbbKfLHRFVVpK9ZWqo/edit Memory considerations: - As with the existing SAC, cached values are cleared upon first use. - We error if the user wishes to backward a second time on a region forwarded with SAC enabled. In-place: - We use version counting to enforce that if any cached tensor has been mutated. In-place operations not mutating cached tensors are allowed. - `allow_cache_entry_mutation=True` can be passed to disable this check (useful in the case of auto AC where the user is cleverly also saves the output of the in-place) Randomness, views - Currently in this PR, we don't do anything special for randomness or views, the author of the policy function is expected to handle them properly. (Would it would be beneficial to error? - we either want to save all or recompute all random tensors) Tensor object preservation - ~We guarantee that if a tensor does not requires grad, and it is saved, then what you get out is the same tensor object.~ UPDATE: We guarantee that if a tensor is of non-differentiable dtype AND it is not a view, and it is saved, then what you get out is the same tensor object. This is a nice guarantee for nested tensors which care about the object identity of of the offsets tensor. Policy function - Enum values are `{MUST,PREFER}_{SAVE,RECOMPUTE}` (bikeshed welcome). Alternatively there was `{SAVE,RECOMPUTE}_{NON_,}OVERRIDABLE`. The former was preferred bc it seemed clearer that two `MUST` clashing should error, versus it is ambiguous whether two `NON_OVERRIDABLE` being stacked should silently ignore or error. - The usage of Enum today. There actually is NO API to stack SAC policies today. The only thing the Enum should matter for in the near term is the compiler. The stacking SAC policy would be useful if someone wants to implement something like simple FSDP, but it is not perfect because with a policy of `PREFER_SAVE` you are actually saving more than autograd would save normally (would be fixed with AC v3). - The number of times we call the policy_fn is something that should be documented as part of public API. We call the policy function for all ops except ~~detach~~ UPDATE : metadata ops listed in `torch.utils.checkpoint.SAC_IGNORED_OPS`) because these ops may be called a different number of times by AC itself between forward and recompute. - The policy function can be a stateful object (we do NOT make separate copies of this object for forward/recompute, the user is expected to handle that via is_recompute see below). Tensors guaranteed to be the same tensor as-is - Policy function signature takes ctx object as its first argument. The ctx function is an object encapsulating info that may be useful to the user, it currently only holds "is_recompute". Adding this indirection gives us flexibility to add more attrs later if necessary. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125795 Approved by: https://github.com/Chillee, https://github.com/fmassa	2024-06-18 18:18:50 +00:00
PyTorch MergeBot	77830d509f	Revert "Introduce a prototype for SymmetricMemory (#128582 )" This reverts commit 7a39755da28d5a109bf0c37f72b364d3a83137b1. Reverted https://github.com/pytorch/pytorch/pull/128582 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/128582#issuecomment-2176685232))	2024-06-18 18:11:43 +00:00
Huy Do	84c86e56bd	Update tracker issues after successfully cherry-picking a PR (#128924 ) This extends the capacity of the cherry-pick bot to automatically update the tracker issue with the information. For this to work, the tracker issue needs to be an open one with a `release tracker` label, i.e. https://github.com/pytorch/pytorch/issues/128436. The version from the release branch, i.e. `release/2.4`, will be match with the title of the tracker issue, i.e. `[v.2.4.0] Release Tracker` or `[v.2.4.1] Release Tracker` ### Testing `python cherry_pick.py --onto-branch release/2.4 --classification release --fixes "DEBUG DEBUG" --github-actor huydhn 128718` * On the PR https://github.com/pytorch/pytorch/pull/128718#issuecomment-2174846771 * On the tracker issue https://github.com/pytorch/pytorch/issues/128436#issuecomment-2174846757 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128924 Approved by: https://github.com/atalman	2024-06-18 17:48:47 +00:00
eqy	4e03263224	[CUDA][Convolution] Add missing launch bounds to `vol2col_kernel` (#128740 ) Fix "too many resources requested" that can happen with recent toolkits on V100. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128740 Approved by: https://github.com/mikaylagawarecki	2024-06-18 17:26:23 +00:00
Kazuaki Ishizaki	26e374e3ca	[EZ] Fix typos in RELEASE.md (#128769 ) This PR fixes typo in `RELEASE.md` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128769 Approved by: https://github.com/yumium, https://github.com/mikaylagawarecki	2024-06-18 17:15:05 +00:00
Guilherme Leobas	9818283da1	re-enable jacrev/jacfwd/hessian after #128028 landed (#128622 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128622 Approved by: https://github.com/zou3519	2024-06-18 17:08:58 +00:00
eqy	ec616da518	RNN API cleanup for cuDNN 9.1 (#122011 ) Can potentially avoid a bit of boilerplate if we move directly to cuDNN 9.1's RNN API... Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122011 Approved by: https://github.com/Skylion007	2024-06-18 16:16:38 +00:00
David Berard	108318ad10	[BE][JIT] Handle case where codegen object can be unset (#128951 ) Summary: Unblocks a test that's failing. `codegen` can be unset until `compile` is called. If `codegen` is not set, then just use the kernel name directly. Test Plan: ``` buck2 run //caffe2/test:tensorexpr -- --regex test_simple_add ``` Differential Revision: D58727391 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128951 Approved by: https://github.com/aaronenyeshi	2024-06-18 15:40:45 +00:00
Isuru Fernando	4817180601	make fallback for aten.argsort.stable (#128907 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128907 Approved by: https://github.com/lezcano ghstack dependencies: #128343	2024-06-18 14:56:35 +00:00
Xuehai Pan	22d258427b	[BE][Easy] enable UFMT for `torch/distributed/_shard/` (#128867 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128867 Approved by: https://github.com/fegin ghstack dependencies: #128866	2024-06-18 14:39:25 +00:00
Xuehai Pan	e6d4451ae8	[BE][Easy] enable UFMT for `torch/distributed/{algorithms,autograd,benchmarks,checkpoint,elastic}/` (#128866 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128866 Approved by: https://github.com/fegin	2024-06-18 13:51:53 +00:00
Andrew Gu	f2805a0408	[FSDP2] Added APIs for explicit fwd/bwd prefetching (#128884 ) This PR adds two APIs `set_modules_to_forward_prefetch` and `set_modules_to_backward_prefetch` to enable explicit forward/backward all-gather prefetching, respectively. ``` def set_modules_to_forward_prefetch(self, modules: List[FSDPModule]): -> None def set_modules_to_backward_prefetch(self, modules: List[FSDPModule]): -> None ``` Motivation FSDP2 implements _reasonable defaults_ for forward and backward prefetching. In forward, it uses implicit prefetching and allows two all-gather output tensors to be alive at once (so that the current all-gather copy-out can overlap with the next all-gather). In backward, it uses explicit prefetching based on the reverse post-forward order. However, there may be cases where with expert knowledge, we can reduce communication bubbles by moving all-gathers manually. One way to expose such behavior is to expose _prefetching limits_, i.e. integers that configure how many outstanding all-gathers/all-gather output tensors can be alive at once. IMIHO, this leans toward _easy_, not _simple_ (see [PyTorch design principles](https://pytorch.org/docs/stable/community/design.html#principle-2-simple-over-easy)). The crux of the problem is that there may be special cases where manual intervention can give better performance. Exposing a prefetching limit and allowing users to pass a value >1 just smooths over the problem since such a limit would generally apply over the entire model even though it possibly should not. Then, expert users will see a specific all-gather that they want to deviate from this limit, and there is little we can do. Thus, we instead choose to expose the most primitive extension point: namely, every `FSDPModule` gives an opportunity to prefetch other all-gathers in forward and in backward. How to leverage this extension point is fully up to the user. Implementing the prefetch limit can be done using this extension point (e.g. record the post-forward order yourself using forward hooks, iterate over that order, and call the `set_modules_to_forward_prefetch` / `set_modules_to_backward_prefetch` APIs). Differential Revision: [D58700346](https://our.internmc.facebook.com/intern/diff/D58700346) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128884 Approved by: https://github.com/ckluk2, https://github.com/weifengpy	2024-06-18 13:32:57 +00:00
Ahmed Gheith	3dd5f0ecbb	Remove circular import (#128875 ) Summary: A spurious import is causing circular dependency errors Test Plan: phabricator signals Differential Revision: D58685676 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128875 Approved by: https://github.com/kit1980	2024-06-18 12:30:13 +00:00
leslie-fang-intel	304c934572	Move MKLDNN Specific IR to Separate File (#126504 ) Summary Following the discussion in https://github.com/pytorch/pytorch/pull/122593#discussion_r1604144782, Move Inductor MKLDNN specific IRs to a separate file. Co-authored-by: Isuru Fernando <ifernando@quansight.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126504 Approved by: https://github.com/desertfire, https://github.com/jgong5 ghstack dependencies: #126841, #126940	2024-06-18 09:29:13 +00:00
Chien-Chin Huang	6e43897912	[BE][ptd_fb_test][3/N] Enable TestSlide for MultiThreadedTestCase (#128843 ) Enabling testslide for MultiThreadedTestCase, similar to https://github.com/pytorch/pytorch/pull/127512. Differential Revision: [D58677457](https://our.internmc.facebook.com/intern/diff/D58677457/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128843 Approved by: https://github.com/wz337	2024-06-18 07:05:31 +00:00
Chien-Chin Huang	60baeee59f	[BE] Skip the test if CUDA is not available (#128885 ) As title Differential Revision: [D58690210](https://our.internmc.facebook.com/intern/diff/D58690210/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128885 Approved by: https://github.com/wz337	2024-06-18 07:02:44 +00:00
Will Feng	e3a39d49a0	[Traceable FSDP][Compiled Autograd] Add queue_callback() support (#126366 ) Adds support for `Variable._execution_engine.queue_callback()`, which is used in FSDP2. Important tests: - `pytest -rA test/inductor/test_compiled_autograd.py::TestCompiledAutograd::test_callback_graph_break_throws_error` - `pytest -rA test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_callback_adds_callback` - `PYTORCH_TEST_WITH_DYNAMO=1 python test/test_autograd.py -k TestAutograd.test_callback_adds_callback` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126366 Approved by: https://github.com/xmfan	2024-06-18 06:22:14 +00:00
Chirag Pandya	f7eae27946	Pass params to dump_nccl_trace_pickle (#128781 ) Summary Pass parameters from request to dump_nccl_trace_pickle handler. The supported parameters + value are all lowercase. includecollectives={true, false} includestacktraces={true, false} onlyactive={true, false} Example post is: /handler/dump_nccl_trace_pickle?includecollectives=true&includestacktraces=false&onlyactive=true Test Plan: unit tests Differential Revision: [D58640474](https://our.internmc.facebook.com/intern/diff/D58640474) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128781 Approved by: https://github.com/d4l3k	2024-06-18 03:46:57 +00:00
Joona Havukainen	d9eaa224f2	Fixes #128429 : NaN in triu op on MPS (#128575 ) Fixes triu op when k > 0 and the lower triangle of the input tensor contains inf leading to NaNs in the computation through complement. Fixed by using select API instead. Fixes #128429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128575 Approved by: https://github.com/kulinseth	2024-06-18 03:44:42 +00:00
Tristan Rice	59b4983dc0	DebugPlane: add dump_traceback handler (#128904 ) This adds a `dump_traceback` handler so you can see all running threads for a job. This uses a temporary file as a buffer when calling `faulthandler.dump_traceback` and requires the GIL to be held during dumping. Test plan: ``` python test/distributed/elastic/test_control_plane.py -v -k traceback ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128904 Approved by: https://github.com/c-p-i-o	2024-06-18 03:40:16 +00:00
Xu Han	17abbafdfc	[inductor] Fix some windows cpp builder issue (#128765 ) 1. fix some Windows build args. 2. fix c++20 likely issue on Windows, reference: https://github.com/pytorch/pytorch/pull/124997. 3. remove compiler return value check, different compilers return variant value, let's check exception to catch error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128765 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-06-18 03:25:20 +00:00
Joel Schlosser	4061b3b822	Forward fix to skip ROCm tests for #122836 (#128891 ) Fixes broken ROCm tests from #122836. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128891 Approved by: https://github.com/huydhn ghstack dependencies: #127007, #128057, #122836	2024-06-18 03:01:19 +00:00
Animesh Jain	c017c97333	[dynamo][inlining-inbuilt-nn-modules] Update test output (#128880 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128880 Approved by: https://github.com/mlazos ghstack dependencies: #128315, #128748, #128877, #128878	2024-06-18 02:18:09 +00:00
Animesh Jain	4e97d37fd9	[inlining-inbuilt-nn-modules][pre-grad] Adjust efficient_conv_bn_eval_graph for inlining (#128878 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128878 Approved by: https://github.com/mlazos ghstack dependencies: #128315, #128748, #128877	2024-06-18 02:18:09 +00:00
Animesh Jain	22f1793c0a	[dynamo][easy] Use LazyVariableTracker for UserDefinedObject var_getattr (#128877 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128877 Approved by: https://github.com/mlazos ghstack dependencies: #128315, #128748	2024-06-18 02:17:56 +00:00
Boyuan Feng	43998711a7	[CUDAGraph] add more docs for cudagraph trees (#127963 ) This PR adds more documentation for CUDAGraph Trees, including - Iteration Support - Input Mutation Support - Dynamic Shape Support - NCCL Support - Reasons for Skipping CUDAGraph Pull Request resolved: https://github.com/pytorch/pytorch/pull/127963 Approved by: https://github.com/eellison	2024-06-18 02:07:07 +00:00
Fuzzkatt	e12fa93b8b	add is_big_gpu(0) check to test_select_algorithm tests in tests/inductor/test_cuda_cpp_wrapper.py (#128652 ) In NVIDIA internal CI, on Jetson devices we are seeing this failure for `python test/inductor/test_cuda_cpp_wrapper.py -k test_addmm_cuda_cuda_wrapper -k test_linear_relu_cuda_cuda_wrapper`: ``` /usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_fx.py:132: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. warnings.warn( W0613 20:57:17.722000 281473279256672 torch/_inductor/utils.py:902] [0/0] Not enough SMs to use max_autotune_gemm mode frames [('total', 1), ('ok', 1)] stats [('calls_captured', 2), ('unique_graphs', 1)] inductor [('extern_calls', 2), ('fxgraph_cache_miss', 1), ('pattern_matcher_count', 1), ('pattern_matcher_nodes', 1)] aot_autograd [('total', 1), ('ok', 1)] F ====================================================================== FAIL: test_linear_relu_cuda_cuda_wrapper (__main__.TestCudaWrapper) ---------------------------------------------------------------------- Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_utils.py", line 2759, in wrapper method(args, kwargs) File "/opt/pytorch/pytorch/test/inductor/test_torchinductor.py", line 9818, in new_test return value(self) File "/usr/lib/python3.10/contextlib.py", line 79, in inner return func(args, *kwds) File "/opt/pytorch/pytorch/test/inductor/test_cuda_cpp_wrapper.py", line 152, in fn _, code = test_torchinductor.run_and_get_cpp_code( File "/opt/pytorch/pytorch/test/inductor/test_torchinductor.py", line 356, in run_and_get_cpp_code result = fn(args, *kwargs) File "/opt/pytorch/pytorch/test/inductor/test_select_algorithm.py", line 43, in wrapped return fn(args, *kwargs) File "/usr/lib/python3.10/contextlib.py", line 79, in inner return func(args, *kwds) File "/usr/lib/python3.10/unittest/mock.py", line 1379, in patched return func(newargs, *newkeywargs) File "/usr/lib/python3.10/contextlib.py", line 79, in inner return func(args, *kwds) File "/usr/lib/python3.10/contextlib.py", line 79, in inner return func(args, **kwds) File "/opt/pytorch/pytorch/test/inductor/test_select_algorithm.py", line 62, in test_linear_relu_cuda self.assertEqual(counters["inductor"]["select_algorithm_autotune"], 1) File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_utils.py", line 3642, in assertEqual raise error_metas.pop()[0].to_error( AssertionError: Scalars are not equal! Expected 1 but got 0. Absolute difference: 1 Relative difference: 1.0 ``` Looking into it, we see the failure is from https://github.com/pytorch/pytorch/blob/main/test/inductor/test_select_algorithm.py#L62. The warning `W0613 20:57:17.722000 281473279256672 torch/_inductor/utils.py:902] [0/0] Not enough SMs to use max_autotune_gemm ` is triggered from https://github.com/pytorch/pytorch/blob/main/torch/_inductor/utils.py#L973. Printing torch.cuda.get_device_properties(0).multi_processor_count returns 16 on the computelab AGX Orin; thus it makes sense that this check is failing, since the min_required_sms is 68, thus not letting it pick the autotune algorithm. Looking at the main for test_select_algorithm.py, we see that these tests should only be run if is_big_gpu(0) is true: https://github.com/pytorch/pytorch/blob/main/test/inductor/test_select_algorithm.py#L344. Thus this PR adds a similar check to the invocation of these tests in test_cuda_cpp_wrapper.py. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128652 Approved by: https://github.com/soulitzer, https://github.com/eqy	2024-06-18 02:00:04 +00:00
Huy Do	9e8443b56f	Remove dtype from gpt-fast micro benchmark experiments model name (#128789 ) Per comments on https://github.com/pytorch/test-infra/pull/5344, we already have a dtype column with the same information Pull Request resolved: https://github.com/pytorch/pytorch/pull/128789 Approved by: https://github.com/yanboliang	2024-06-18 01:26:45 +00:00
Shangdi Yu	fbc7559ceb	[custom ops] convert string type annotation to real type (#128809 ) Fixes #105157 Bug source: `from __future__ import annotations` converts type annotation to strings to make forwards references easier. However, existing custom ops do not consider strings to be valid types. Fix: We check if the argument and return type annotation is string type. If so, we try to use `eval` to convert it to a type. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128809 Approved by: https://github.com/zou3519	2024-06-18 00:55:50 +00:00
leslie-fang-intel	c35ffaf954	[Inductor][CPP] Add ne with VecMask (#126940 ) Summary Fix https://github.com/pytorch/pytorch/issues/126824#issuecomment-2125039161 which is missing the support of `ne` with `VecMask`. Test Plan ``` python test/inductor/test_torchinductor_opinfo.py -k test_comprehensive_ne_cpu_bool ``` Co-authored-by: Isuru Fernando <ifernando@quansight.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126940 Approved by: https://github.com/isuruf, https://github.com/jgong5, https://github.com/peterbell10 ghstack dependencies: #126841	2024-06-18 00:23:03 +00:00
leslie-fang-intel	beb29836cd	[Inductor][CPP] Add Min/Max with VecMask (#126841 ) Summary Fix issue: https://github.com/pytorch/pytorch/issues/126824 which is missing the support of `min/max` with `VecMask`. TestPlan ``` python test/inductor/test_torchinductor_opinfo.py -k test_comprehensive_clamp_max_cpu_bool python test/inductor/test_torchinductor_opinfo.py -k test_comprehensive_clamp_min_cpu_bool ``` Co-authored-by: Isuru Fernando <ifernando@quansight.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126841 Approved by: https://github.com/isuruf, https://github.com/jgong5, https://github.com/peterbell10	2024-06-18 00:20:32 +00:00
chilli	11ff5345d2	Changed colored logging to only be turned on if printing to interactive terminal (#128874 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128874 Approved by: https://github.com/anijain2305	2024-06-17 23:53:26 +00:00
awayzjj	b70440f0a7	Document the torch.cuda.profiler.profile function (#128216 ) Fixes https://github.com/pytorch/pytorch/issues/127901 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128216 Approved by: https://github.com/malfet, https://github.com/eqy	2024-06-17 23:42:40 +00:00
Edward Z. Yang	95b5ea9cde	Add mark_unbacked (#128638 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128638 Approved by: https://github.com/IvanKobzarev	2024-06-17 23:39:48 +00:00
Xiaodong Wang	8415a4ba98	Back out "[ROCm] TunableOp for gemm_and_bias (#128143 )" (#128815 ) Summary: Original commit changeset: 35083f04fdae Original Phabricator Diff: D58501726 This PR is bringing a large numerical gap. e.g. for 256 x 4096 x 4096 GEMM, if we enable tunable op + DISABLE_ADDMM_HIP_LT=0, the results are way off. Differential Revision: D58660832 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128815 Approved by: https://github.com/mxz297, https://github.com/eqy, https://github.com/malfet	2024-06-17 22:52:27 +00:00
atalman	3b8c9b8ab1	[Docker Release] Test if pytorch was compiled with CUDA before pushing to repo (#128852 ) Related to: https://github.com/pytorch/pytorch/issues/125879 Would check if we are compiled with CUDA before publishing CUDA Docker nightly image Test ``` #18 [conda-installs 5/5] RUN IS_CUDA=$(python -c 'import torch ; print(torch.cuda._is_compiled())'); echo "Is torch compiled with cuda: ${IS_CUDA}"; if test "${IS_CUDA}" != "True" -a ! -z "12.4.0"; then exit 1; fi #18 1.656 Is torch compiled with cuda: False #18 ERROR: process "/bin/sh -c IS_CUDA=$(python -c 'import torch ; print(torch.cuda._is_compiled())'); echo \"Is torch compiled with cuda: ${IS_CUDA}\"; if test \"${IS_CUDA}\" != \"True\" -a ! -z \"${CUDA_VERSION}\"; then \texit 1; fi" did not complete successfully: exit code: 1 ------ > [conda-installs 5/5] RUN IS_CUDA=$(python -c 'import torch ; print(torch.cuda._is_compiled())'); echo "Is torch compiled with cuda: ${IS_CUDA}"; if test "${IS_CUDA}" != "True" -a ! -z "12.4.0"; then exit 1; fi: 1.656 Is torch compiled with cuda: False ------ Dockerfile:80 -------------------- 79 \| RUN /opt/conda/bin/pip install torchelastic 80 \| >>> RUN IS_CUDA=$(python -c 'import torch ; print(torch.cuda._is_compiled())');\ 81 \| >>> echo "Is torch compiled with cuda: ${IS_CUDA}"; \ 82 \| >>> if test "${IS_CUDA}" != "True" -a ! -z "${CUDA_VERSION}"; then \ 83 \| >>> exit 1; \ 84 \| >>> fi 85 \| -------------------- ERROR: failed to solve: process "/bin/sh -c IS_CUDA=$(python -c 'import torch ; print(torch.cuda._is_compiled())'); echo \"Is torch compiled with cuda: ${IS_CUDA}\"; if test \"${IS_CUDA}\" != \"True\" -a ! -z \"${CUDA_VERSION}\"; then \texit 1; fi" did not complete successfully: exit code: 1 (base) [ec2-user@ip-172-30-2-248 pytorch]$ docker buildx build --progress=plain --platform="linux/amd64" --target official -t ghcr.io/pytorch/pytorch:2.5.0.dev20240617-cuda12.4-cudnn9-devel --build-arg BASE_IMAGE=nvidia/cuda:12.4.0-devel-ubuntu22.04 --build-arg PYTHON_VERSION=3.11 --build-arg CUDA_VERSION= --build-arg CUDA_CHANNEL=nvidia --build-arg PYTORCH_VERSION=2.5.0.dev20240617 --build-arg INSTALL_CHANNEL=pytorch --build-arg TRITON_VERSION= --build-arg CMAKE_VARS="" . #0 building with "default" instance using docker driver ``` Please note looks like we are installing from pytorch rather then nighlty channel on PR hence cuda 12.4 is failing since its not in pytorch channel yet: https://github.com/pytorch/pytorch/actions/runs/9555354734/job/26338476741?pr=128852 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128852 Approved by: https://github.com/malfet	2024-06-17 22:51:12 +00:00
Xu Zhao	1835e3beab	Fix the inductor ci (#128879 ) Fix the torchbench+inductor ci on trunk due to recent upgrade to numpy 2.0.0rc1. We have to remove DALLE2_pytorch model, since it depends on embedding-reader, which is not compatible with numpy>2: https://github.com/rom1504/embedding-reader/blob/main/requirements.txt#L3 Fixes #128845 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128879 Approved by: https://github.com/eellison	2024-06-17 22:20:33 +00:00
Shengbao Zheng	7baf32b5e7	[c10d] fix p2p group commsplit (#128803 ) Summary: For PointToPoint(sendrecv), the deviceId is lower_rank:higher_rank. This means a p2p group cannot be created through commSplit since it cannot find a parent. Fix this by using the right device key of current rank. Differential Revision: D58631639 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128803 Approved by: https://github.com/shuqiangzhang	2024-06-17 22:07:40 +00:00
Jun Luo	1fd7496ab2	[MTIA] Fix synchronize API (#128714 ) Reviewed By: fenypatel99 Differential Revision: D58590313 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128714 Approved by: https://github.com/aaronenyeshi	2024-06-17 21:58:46 +00:00
cyy	163847b1bb	[1/N] [Caffe2] Remove caffe2_aten_fallback code (#128675 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/128675 Approved by: https://github.com/r-barnes	2024-06-17 21:25:59 +00:00
Yanbo Liang	8953725e6d	[Inductor][FlexAttention] Tune backwards kernel block sizes (#128853 ) This replaces #128767 which somehow closed by mistake. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128853 Approved by: https://github.com/angelayi	2024-06-17 21:10:55 +00:00
Yanbo Liang	a489792bb2	[GPT-benchmark] Fix memory bandwidth for MoE (#128783 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128783 Approved by: https://github.com/Chillee ghstack dependencies: #128768	2024-06-17 21:04:57 +00:00
Yanbo Liang	8c06eae17e	[GPT-benchmark] Add metric: compilation time for GPT models (#128768 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128768 Approved by: https://github.com/Chillee	2024-06-17 21:04:57 +00:00
Masaki Kozuki	a59766ee05	replace `AT_ERROR(...)` with `TORCH_CHECK(false, ...)` (#128788 ) as per title. encountered the old-fashioned by chance Pull Request resolved: https://github.com/pytorch/pytorch/pull/128788 Approved by: https://github.com/mikaylagawarecki	2024-06-17 20:50:22 +00:00
Kurman Karabukaev	0f89e66d17	Validate logs are created by default (#128522 ) Summary: Make sure that logs are caputured in default settings Test Plan: ci Differential Revision: D58395812 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128522 Approved by: https://github.com/d4l3k	2024-06-17 20:07:13 +00:00
Huy Do	1577328ea4	Set bash shell on Windows (#128854 ) Attempt to fix the missing python3 command on the new Windows AMI https://github.com/pytorch/pytorch/actions/runs/9551494945/job/26325922503. I added the logic to copy python to python3 to make the command available, it worked with the previous AMI, but start to fail now and the cause is not clear (maybe it's not the AMI, but a new GitHub runner version) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128854 Approved by: https://github.com/kit1980, https://github.com/malfet, https://github.com/atalman	2024-06-17 19:24:09 +00:00
Mikayla Gawarecki	b181b58857	Fix Storage.filename to not track the filename when storage was mmap-ed with MAP_PRIVATE (#128725 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128725 Approved by: https://github.com/albanD	2024-06-17 18:55:47 +00:00
Catherine Lee	213eba7d2e	Configure mergebot via config (#128840 ) Fixes #ISSUE_NUMBER * Companion to https://github.com/pytorch/test-infra/pull/5312 * See the above for details + possible risks * Without the above PR, this should have no effects Pull Request resolved: https://github.com/pytorch/pytorch/pull/128840 Approved by: https://github.com/huydhn	2024-06-17 18:53:56 +00:00
PyTorch MergeBot	c172b58fe0	Revert "Update DALLE2_pytorch expected accuracy result on CPU (#128718 )" This reverts commit fd27138c4a86bd763a6b8128d940a7c98f951603. Reverted https://github.com/pytorch/pytorch/pull/128718 on behalf of https://github.com/huydhn due to This has reverted back to the previous expected value for some reason `153362fbc9` ([comment](https://github.com/pytorch/pytorch/pull/128718#issuecomment-2174194219))	2024-06-17 18:49:15 +00:00
eellison	5344c41d43	Use forked torchbench branch with pinned numpy (#128856 ) Adds pinned numpy commit to yolov3 dependencies to the existing pinned commit. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128856 Approved by: https://github.com/huydhn, https://github.com/PaliC	2024-06-17 18:41:42 +00:00
cyy	d35cdee97f	[Caffe2] Remove caffe2 onnx tests (#128687 ) They are not used. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128687 Approved by: https://github.com/r-barnes	2024-06-17 18:17:58 +00:00
Mihir Patel	153362fbc9	Support HSDP + Monolith Checkpointing (#128446 ) Fixes #128444. Rank 0 check should be in the same group as the broadcast Pull Request resolved: https://github.com/pytorch/pytorch/pull/128446 Approved by: https://github.com/fegin	2024-06-17 16:59:41 +00:00
ibartol	c6b180a316	Created docs (and example) for cudart function in torch.cuda (#128741 ) Fixes #127908 ## Description Created docs to document the torch.cuda.cudart function to solve the issue #127908. I tried to stick to the [guidelines to document a function](https://github.com/pytorch/pytorch/wiki/Docstring-Guidelines#documenting-a-function) but I was not sure if there is a consensus on how to handle the docs of a function that calls an internal function. So I went ahead and tried what the function will raise, etc. from the user endpoint and documented it (i.e. I am giving what actually _lazy_init() will raise). Updated PR from #128298 since I made quite a big mistake in my branch. I apologize for the newbie mistake. ### Summary of Changes - Added docs for torch.cuda.cudart - Added the cudart function in the autosummary of docs/source/cuda.rst ## Checklist - [X] The issue that is being fixed is referred in the description - [X] Only one issue is addressed in this pull request - [X] Labels from the issue that this PR is fixing are added to this pull request - [X] No unnecesary issues are included into this pull request Pull Request resolved: https://github.com/pytorch/pytorch/pull/128741 Approved by: https://github.com/msaroufim	2024-06-17 16:50:37 +00:00
drisspg	fc2913fb80	Remove amax return from _scaled_mm (#128683 ) # Summary The primary reason for the change was lack of current use case and the need to work around an two Inductor issue. - Tensor arguments as kwarg only - multiple outputs from triton templates If the need for the amax return type arises we can consider either adding it, more likely creating a separate op. In principle PyTorch is moving away from ops that bundle lots of functionality into "mega ops". We instead rely upon the compiler to generate appropriate fused kernels. ### Changes: - This removes the amax return type from scaled_mm. We have found that the common use case is to return in "high-precision" ( a type with more precision than fp8). This is only relevant when returning in low-precision. - We currently still allow for fp8 returns and scaled result. Perhaps we should also ban this as well... New signature: ```Python def meta_scaled_mm( self: torch.Tensor, mat2: torch.Tensor, scale_a: torch.Tensor, scale_b: torch.Tensor, bias: Optional[torch.Tensor] = None, scale_result: Optional[torch.Tensor] = None, out_dtype: Optional[torch.dtype] = None, use_fast_accum: bool = False, ) -> torch.Tensor: ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128683 Approved by: https://github.com/vkuzo	2024-06-17 16:48:00 +00:00
Andrew Hoblitzell	73b78d1cbe	Document the torch.nn.parallel.scatter_gather.gather function (#128566 ) Fixes #127899 ### Description Add docstring to `torch/nn/parallel/scatter_gather.py:gather` function Pull Request resolved: https://github.com/pytorch/pytorch/pull/128566 Approved by: https://github.com/kwen2501	2024-06-17 16:44:17 +00:00
Jiashen Cao	316b729677	[Fix] TS converter constant to tensor (#128442 ) #### Issue Tensor constant was previously lifted directly as an input in the fx graph, which results errors for multiple test cases with tensor constant. This PR introduces a fix to convert tensor constant to a `GetAttr` in the fx graph. This PR also introduces other fixes to maintain a valid `state_dict` for exported program when there are tensor constants. In short, after tensor constants are converted as `GetAttr`, they are treated as buffers during retracing. The fix will convert those back from buffer to constant. #### Test Plan Add new test cases that generate tensor constants * `pytest test/export/test_converter.py -s -k test_implicit_constant_to_tensor_handling` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128442 Approved by: https://github.com/angelayi	2024-06-17 16:42:43 +00:00
Xuehai Pan	a87d82abd7	[BE] enable UFMT for `torch/nn/*.py` (#128593 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128593 Approved by: https://github.com/mikaylagawarecki ghstack dependencies: #128596, #128594, #128592	2024-06-17 16:29:29 +00:00
Xuehai Pan	f6e6e55fa7	[BE] enable UFMT for `torch/nn/functional.py` (#128592 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128592 Approved by: https://github.com/mikaylagawarecki ghstack dependencies: #128596, #128594	2024-06-17 16:29:29 +00:00
Xuehai Pan	95ac2d6482	[BE] enable UFMT for `torch/nn/modules` (#128594 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128594 Approved by: https://github.com/mikaylagawarecki ghstack dependencies: #128596	2024-06-17 16:29:25 +00:00
Xuehai Pan	dff6342a0b	[BE][Easy] enable UFMT for `torch/nn/parallel` (#128596 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128596 Approved by: https://github.com/mikaylagawarecki	2024-06-17 16:29:22 +00:00
Zhengxu Chen	bfad0aee44	[export] Preserve requires_grad for export inputs. (#128656 ) Summary: Today meta['val'] on placeholder nodes doesn't preserve the consistent requires_grad information with the original inputs. Seems there's no easy way to fix this directly at proxy tensor layer. This is useful for reexporting joint graph. Test Plan: test_preserve_requires_grad_placeholders Differential Revision: D58555651 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128656 Approved by: https://github.com/tugsbayasgalan	2024-06-17 16:26:08 +00:00
Joel Schlosser	2a41fc0390	Short-term fix to preserve NJT metadata cache in torch.compile (#122836 ) Idea: close over min / max sequence length in the main NJT view func (`_nested_view_from_jagged`) so that view replay during fake-ification propagates these correctly in torch.compile. For dynamic shapes support for min / max sequence length, this PR uses a hack that stores the values in `(val, 0)` shaped tensors. NB: This PR changes SDPA to operate on real views instead of using `buffer_from_jagged()` / `ViewNestedFromBuffer`, which may impact the internal FIRST model. That is, it undoes the partial revert from #123215 alongside a fix to the problem that required the partial revert. We need to verify that there are no regressions there before landing. Differential Revision: [D55448636](https://our.internmc.facebook.com/intern/diff/D55448636) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122836 Approved by: https://github.com/soulitzer ghstack dependencies: #127007, #128057	2024-06-17 15:25:09 +00:00
Sam Larsen	24443fe16a	[inductor] parallel compile: Print traceback detail when there's an exception in a sub-process (#128775 ) Summary: We lose traceback info when an exception occurs in a subprocess because Python traceback objects don't pickle. In the subprocess-based parallel compile, we _are_ logging an exception in the subprocess, but a) those messages are easy to miss because they're not in the traceback output, and b) it seems that logging in the subproc is swallowed by default in internal builds. This PR captures the traceback in the subprocess and makes it available in the exception thrown in the main process. Users now see failures that look like this: ``` ... File "/home/slarsen/.conda/envs/pytorch-3.10_3/lib/python3.10/concurrent/futures/_base.py", line 458, in result return self.__get_result() File "/home/slarsen/.conda/envs/pytorch-3.10_3/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result raise self._exception torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: SubprocException: An exception occurred in a subprocess: Traceback (most recent call last): File "/data/users/slarsen/pytorch-3.10_3/torch/_inductor/compile_worker/subproc_pool.py", line 270, in do_job result = SubprocMain.foo() File "/data/users/slarsen/pytorch-3.10_3/torch/_inductor/compile_worker/subproc_pool.py", line 263, in foo SubprocMain.bar() File "/data/users/slarsen/pytorch-3.10_3/torch/_inductor/compile_worker/subproc_pool.py", line 260, in bar SubprocMain.baz() File "/data/users/slarsen/pytorch-3.10_3/torch/_inductor/compile_worker/subproc_pool.py", line 257, in baz raise Exception("an error occurred") Exception: an error occurred ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128775 Approved by: https://github.com/jansel	2024-06-17 15:10:47 +00:00
Nikita Shulga	e3093849e5	[Docs] Update links (#128795 ) From https://pytorch.org/docs/stable/nn.html#torch.nn.Embedding to https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html And from https://pytorch.org/docs/stable/nn.html#torch.nn.EmbeddingBag to https://pytorch.org/docs/stable/generated/torch.nn.EmbeddingBag.html Fixes https://github.com/pytorch/pytorch/issues/128774 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128795 Approved by: https://github.com/atalman	2024-06-17 14:55:32 +00:00
Ambareesh Shyam Sundar	0f81473d7b	Update fake tensor error checks for bool tensor subtraction (#128492 ) Fixes #127003 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128492 Approved by: https://github.com/soulitzer	2024-06-17 13:41:15 +00:00
Animesh Jain	b0282071c4	[dynamo] override torch.nn.modules.activation._is_make_fx_tracing (#128748 ) Discovered while inlining `MultiHeadAttention` nn Module. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128748 Approved by: https://github.com/jansel ghstack dependencies: #128315	2024-06-17 08:49:29 +00:00
Xu Han	b40a033c38	[cpp_extension][inductor] Fix sleef windows depends. (#128770 ) # Issue: During I'm working on enable inductor on PyTorch Windows, I found the sleef lib dependency issue. <img width="1011" alt="image" src="https://github.com/pytorch/pytorch/assets/8433590/423bd854-3c5f-468f-9a64-a392d9b514e3"> # Analysis: After we enabled SIMD on PyTorch Windows(https://github.com/pytorch/pytorch/pull/118980 ), the sleef functions are called from VEC headers. It bring the sleef to the dependency. Here is a different between Windows and Linux OS. ## Linux : Linux is default export its functions, so libtorch_cpu.so static link to sleef.a, and then It also export sleef's functions. <img width="647" alt="image" src="https://github.com/pytorch/pytorch/assets/8433590/00ac536c-33fc-4943-a435-25590508840d"> ## Windows: Windows is by default not export its functions, and have many limitation to export functions, reference: https://github.com/pytorch/pytorch/issues/80604 We can't package sleef functions via torch_cpu.dll like Linux. # Solution: Acturally, we also packaged sleef static lib as a part of release. We just need to help user link to sleef.lib, it should be fine. 1. Add sleef to cpp_builder for inductor. 2. Add sleef to cpp_extension for C++ extesion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128770 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-06-17 05:44:34 +00:00
Wang, Eikan	a52c8ace98	[3/N] Non-Tensor: Support string parameter for aten operations (#125831 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125831 Approved by: https://github.com/jansel, https://github.com/jgong5	2024-06-17 05:11:29 +00:00
cyy	74e11a4210	Enable clang-tidy on torch/csrc/mps (#128782 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/128782 Approved by: https://github.com/Skylion007	2024-06-17 02:19:48 +00:00
cyy	f9dae86222	Concat namespaces in torch/csrc/utils/* (#128787 ) Concat namespaces in torch/csrc/utils/* Pull Request resolved: https://github.com/pytorch/pytorch/pull/128787 Approved by: https://github.com/Skylion007	2024-06-16 23:51:14 +00:00
Mark Saroufim	6cbdbb6c3c	Remove top lev numpy dependency from fuzzer.py (#128759 ) Test CI This fixes issues like this where I don't even intend to use the fuzzer. this way if someone is calling functions from the fuzzer numpy will be imported otherwise the import should not happen at the top of the file ``` >>> import torchao Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/marksaroufim/anaconda3/envs/fresh/lib/python3.10/site-packages/torchao/__init__.py", line 26, in <module> from torchao.quantization import ( File "/home/marksaroufim/anaconda3/envs/fresh/lib/python3.10/site-packages/torchao/quantization/__init__.py", line 7, in <module> from .smoothquant import * # noqa: F403 File "/home/marksaroufim/anaconda3/envs/fresh/lib/python3.10/site-packages/torchao/quantization/smoothquant.py", line 18, in <module> import torchao.quantization.quant_api as quant_api File "/home/marksaroufim/anaconda3/envs/fresh/lib/python3.10/site-packages/torchao/quantization/quant_api.py", line 23, in <module> from torchao.utils import ( File "/home/marksaroufim/anaconda3/envs/fresh/lib/python3.10/site-packages/torchao/utils.py", line 2, in <module> import torch.utils.benchmark as benchmark File "/home/marksaroufim/anaconda3/envs/fresh/lib/python3.10/site-packages/torch/utils/benchmark/__init__.py", line 4, in <module> from torch.utils.benchmark.utils.fuzzer import * # noqa: F403 File "/home/marksaroufim/anaconda3/envs/fresh/lib/python3.10/site-packages/torch/utils/benchmark/utils/fuzzer.py", line 5, in <module> import numpy as np ModuleNotFoundError: No module named 'numpy' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128759 Approved by: https://github.com/Skylion007	2024-06-16 16:34:12 +00:00
leslie-fang-intel	f8d60e0e0a	[Inductor][CPP] Fix Half data type cse cache issue for CPP Backend (#128498 ) Summary Fixing issue: https://github.com/pytorch/pytorch/issues/128263. After https://github.com/pytorch/pytorch/issues/115260, we cached the higher precision cse variable to avoid duplicate casting between buffers. However, it failed to check the original data type. This means if we convert `int32` to `bf16` for `store` and then convert `bf16` back to `fp32` for `load`, it would incorrectly hit the cache and reuse the `int32` cse var. This PR fixes the issue. Test Plan ``` python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_issue_128263 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128498 Approved by: https://github.com/jgong5, https://github.com/zhuhaozhe, https://github.com/jerryzh168	2024-06-16 11:27:13 +00:00
Will Feng	979edbbe12	[Traceable FSDP2] Dynamo support FSDP2 use_training_state context manager (#127854 ) Improve Dynamo to support the FSDP2 `use_training_state()` context manager. Test command: ` pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_dynamo_trace_use_training_state ` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127854 Approved by: https://github.com/yanboliang	2024-06-16 08:48:52 +00:00
Animesh Jain	e4d8aa4d24	[torchbench] Enable some models with inline_inbuilt_nn_modules (#128315 ) For all models, graph breaks/recompiles reduce. For drq, it increases and this is a legit one. Co-authored-by: Laith Sakka <lsakka@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128315 Approved by: https://github.com/jansel	2024-06-16 08:37:23 +00:00
xinan.lin	cc518ebd38	[Inductor Intel GPU backend Upstream] Reuse inductor test for Intel GPU (PART 2) (#124147 ) Reuse Inductor test case for Intel GPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124147 Approved by: https://github.com/EikanWang, https://github.com/jansel	2024-06-16 08:07:05 +00:00
Blaine Burton Rister	f1ee3589a1	[Inductor] Emit strided block pointer from ModularIndexing and FloorDiv (#127342 ) Summary Inductor currently uses modulo and division to compute indices into certain multi-dimensional tensors, such as those arising from row padding. This PR matches on that indexing pattern, replacing it with an N-D block pointer. This should be more efficient than computing indices with division and modulo, and it can easily map to DMAs on non-GPU hardware targets. Because the 1D block size needs to map to an integer block shape in ND, we need to know that the ND block size evenly divides the size of the iteration range. This PR only generates ND block pointers when it can guarantee that the iteration order and number of elements loaded are unchanged. This means that the number of elements in a slice of the iteration range must either be: - Powers of 2. Since Triton block sizes are powers of 2, any integer power of 2 either divides the block size, or is greater than the block size. In the latter case, `CielDiv(x, y)` rounds up to 1. - Multiples of the maximum block size. Since block sizes are powers of 2, the maximum block size is a multiple of every possible block size. Note that a slice of the iteration range does not include the leading dimension. Thus we can support arbitrary leading dimensions like `(5,8)`. Feature proposal and discussion: https://github.com/pytorch/pytorch/issues/125077 Example kernel: ``` triton.jit def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr): xnumel = 4096 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel tmp0 = tl.reshape(tl.load(tl.make_block_ptr(in_ptr0, shape=[32, 16, 8], strides=[1024, 32, 1], block_shape=[32 * (32 <= ((127 + XBLOCK) // 128)) + ((127 + XBLOCK) // 128) * (((127 + XBLOCK) // 128) < 32), 16 * (16 <= ((7 + XBLOCK) // 8)) + ((7 + XBLOCK) // 8) * (((7 + XBLOCK) // 8) < 16), 8 * (8 <= XBLOCK) + XBLOCK * (XBLOCK < 8)], order=[0, 1, 2], offsets=[(xoffset // 128), (xoffset // 8) % 16, xoffset % 8]), boundary_check=[0, 1, 2]), [XBLOCK]) tmp1 = tmp0 + tmp0 tl.store(tl.make_block_ptr(out_ptr0, shape=[4096], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), tl.broadcast_to(tmp1, [XBLOCK]).to(tl.float32)) ''', device_str='cuda') ``` Test Plan This PR adds a new CI test script to cover this feature. The tests can be grouped into a few main categories: - Can we generate strided block pointers for the appropriate shapes? - Powers of 2 - Non-power of 2, but multiple of the maximum block size - Arbitrary leading dimensions, with power of 2 inner dimensions - Weird strides and offsets - Reductions - Symbolic shapes that are multiples of the maximum block size (wasn't able to trace this through dynamo) - Broadcasts (some variables are missing from the indexing expression) - Do we still compile other cases correctly, even if we don't expect to be able to generate block pointers? - Unsupported static shapes - Unsupported symbolic shapes - Mixing and matching these cases: - Pointwise and reduction in the same kernel - Sanity check the test harness - Do we raise an exception if the expected number of block pointers and the actual number are different? Follow-ups There are a few important cases which this PR can't handle. I'm hoping these can be deferred to follow-up PRs: - Handle non-divisible shapes - Change the tiling algorithm to generate a 2D (X,Y) blocking, if doing so enables block pointers to be emitted. - Pad unsupported loads up to the nearest divisible size, then mask/slice out the extra elements? This is probably the best solution, but I'm not yet sure how to go about it in triton. - Take advantage of this analysis when `triton.use_block_ptr=False`. I'm guessing we can still avoid `%` and `/` without requiring block pointers. Maybe we could compute block indices with arange and broadcast instead? Differential Revision: D56739375 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127342 Approved by: https://github.com/jansel, https://github.com/shunting314	2024-06-16 07:35:57 +00:00
Michael Lazos	a61939467a	Enable passing dynamo-traced complex test (#128771 ) Fixes https://github.com/pytorch/pytorch/issues/118159 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128771 Approved by: https://github.com/anijain2305	2024-06-16 07:28:09 +00:00
BowenBao	ab13980424	[ONNX] Update 'person_of_interest.rst', 'CODEOWNERS' and 'merge_rules.yaml' (#126364 ) The following are all constrained under the ONNX exporter project scope. - `personal_of_interest.rst` - Moving folks no longer working on the project to emeritus. - Adding @justinchuby, @titaiwangms, @shubhambhokare1 and @xadupre, who have all made countless contributions to this project. - `CODEOWNERS` - Removing folks no longer working on the project. - Updating new owners who will now be notified with PRs related to the specific file paths. - `merge_rules.yaml` - Removing folks no longer working on the project. 🫡 Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126364 Approved by: https://github.com/titaiwangms, https://github.com/justinchuby, https://github.com/albanD	2024-06-16 04:52:16 +00:00
Oguz Ulgen	6079c50910	Make config.fx_graph_remote_cache be three-value switch (#128628 ) Summary: We want to allow for three configurations False: Force off True: Force on None: OFF for OSS and JK config for internal Test Plan: CI Differential Revision: D58535897 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128628 Approved by: https://github.com/masnesral, https://github.com/eellison	2024-06-15 17:52:09 +00:00
Sam Larsen	94c0dcbe1d	[inductor] Parallel compile: handle crashes in subprocesses (#128757 ) Summary: If any subprocess in the pool crashes, we get a BrokenProcessPool exception and the whole pool becomes unusable. Handle crashes by recreating the pool. Test Plan: * New unit test * Started a long-running test (`test/inductor/test_torchinductor.py`), periodically killed subprocess manually, made sure the test run recovers and makes progress. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128757 Approved by: https://github.com/jansel	2024-06-15 17:35:04 +00:00
David Berard	f0d68120f4	[subclasses] Handle dynamo inputs that are subclass views with (-1) in the view (#128662 ) When handling an input to dynamo that's a view of a subclass, dynamo does some handling to reconstruct the view. Part of this is to construct symints for the input parameters to the view. Previously, the code would just call `create_symbol()` which by default specifies a _positive_ symint (>= 0); this fails in the case where you have an aten::view that was called with a -1. Fix: just specify `positive=None` when calling `create_symbol()`, to avoid restricting the symint to >= 0 or <= 0. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128662 Approved by: https://github.com/jbschlosser	2024-06-15 14:58:18 +00:00
Wang, Eikan	18634048a1	Separate AOTI Eager utils as a single file (#125819 ) The key change is code movement. We just moved aoti eager related code from `torch._inductor.utils` to `torch._inductor.aoti_eager` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125819 Approved by: https://github.com/jansel, https://github.com/jgong5, https://github.com/desertfire ghstack dependencies: #125308	2024-06-15 13:42:49 +00:00
Yifu Wang	7a39755da2	Introduce a prototype for SymmetricMemory (#128582 ) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): This PR introduces a prototype for `SymmetricMemory` (including a CUDA implementation) - a remote-memory access-based communication primitive. It allows for user-defined communication patterns/kernels and is designed to be torch.compile-friendly. It addresses the major limitations of `IntraNodeComm` and `ProcessGroupCudaP2p` and serves as a replacement for them. ### SymmetricMemory `SymmetricMemory` represents symmetric allocations across a group of devices. The allocations represented by a `SymmetricMemory` object are accessible by all devices in the group. The class can be used for op-level custom communication patterns (via the get_buffer APIs and the synchronization primitives), as well as custom communication kernels (via the buffer and signal_pad device pointers). ### Python API Example ```python from torch._C.distributed_c10d import _SymmetricMemory # Set a store for rendezvousing symmetric allocations on a group of devices # identified by group_name. The concept of groups is logical; users can # utilize predefined groups (e.g., a group of device identified by a # ProcessGroup) or create custom ones. Note that a SymmetricMemoryAllocator # backends might employ a more efficient communication channel for the actual # rendezvous process and only use the store for bootstrapping purposes. _SymmetricMemory.set_group_info(group_name, rank, world_size, store) # Identical to empty_strided, but allows symmetric memory access to be # established for the allocated tensor via _SymmetricMemory.rendezvous(). # This function itself is not a collective operation. t = _SymmetricMemory.empty_strided_p2p((64, 64), (64, 1), torch.float32, group_name) # Users can write Python custom ops that leverages the symmetric memory access. # Below are examples of things users can do (assuming the group's world_size is 2). # Establishes symmetric memory access on tensors allocated via # _SymmetricMemory.empty_strided_p2p(). rendezvous() is a one-time process, # and the mapping between a local memory region and the associated SymmetricMemory # object is unique. Subsequent calls to rendezvous() with the same tensor will receive # the cached SymmetricMemory object. # # The function has a collective semantic and must be invoked simultaneously # from all rendezvous participants. symm_mem = _SymmetricMemory.rendezvous(t) # This represents the allocation on rank 0 and is accessible from all devices. buf = symm_mem.get_buffer(0, (64, 64), torch.float32) if symm_mem.rank == 0: symm_mem.wait_signal(src_rank=1) assert buf.eq(42).all() else: # The remote buffer can be used as a regular tensor buf.fill_(42) symm_mem.put_signal(dst_rank=0) symm_mem.barrier() if symm_mem.rank == 0: symm_mem.barrier() assert buf.eq(43).all() else: new_val = torch.empty_like(buf) new_val.fill_(43) # Contiguous copies to/from a remote buffer utilize copy engines # which bypasses SMs (i.e. no need to load the data into registers) buf.copy_(new_val) symm_mem.barrier() ``` ### Custom CUDA Comm Kernels Given a tensor, users can access the associated `SymmetricMemory` which provides pointer to remote buffers/signal_pads needed for custom communication kernels. ```cpp TORCH_API c10::intrusive_ptr<SymmetricMemory> get_symmetric_memory( const at::Tensor& tensor); class TORCH_API SymmetricMemory : public c10::intrusive_ptr_target { public: ... virtual std::vector<void> get_buffer_ptrs() = 0; virtual std::vector<void> get_signal_pad_ptrs() = 0; virtual void get_buffer_ptrs_dev() = 0; virtual void get_signal_pad_ptrs_dev() = 0; virtual size_t get_buffer_size() = 0; virtual size_t get_signal_pad_size() = 0; virtual int get_rank() = 0; virtual int get_world_size() = 0; ... }; ``` ### Limitations of IntraNodeComm and ProcessGroupCudaP2p Both `IntraNodeComm` (used by `ProcessGroupCudaP2p`) manages a single fixed-size workspace. This approach: - Leads to awkward UX in which the required workspace needs to be specified upfront. - Can not avoid extra copies for some algorithms in eager mode (e.g., custom/multimem all-reduce, reduce-scatter, all-gather). - Prevents torch.compile from eliminating all copies. In addition, they only offer out-of-the-box communication kernels and don't expose required pointers for user-defined, custom CUDA comm kernels. * __->__ #128582 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128582 Approved by: https://github.com/wanchaol	2024-06-15 10:20:21 +00:00
Wang, Eikan	60bbdc0b40	Modularize aten parameter parser and checker (#125308 ) In this PR, we abstracted the different types of aten operation parameters as `ParameterMetadata`. This structure intends to be used to represent and store the metadata of each aten operation parameter. Currently, it only supports `Tensor`, `TensorList`, and `Scalar`. ```C++ using ParameterMetadataValue = std::variant<TensorMetadata, std::vector<TensorMetadata>, c10::Scalar>; ``` With this PR, we can extend other parameter-type support in a more modularize way, like `string`, `int`, `double`, and other different types to be summarized as the following list. The list is collected from all aten operations and ordered by the number of being used. - `Tensor` - `bool` - `int64_t` - `TensorList` - `Scalar` - `c10::SymIntArrayRef` - `::std::optional<Tensor>` - `IntArrayRef` - `double` - `c10::SymInt` - `::std::optional<ScalarType>` - `::std::optional<double>` - `::std::optional<bool>` - `::std::optional<Layout>` - `::std::optional<Device>` - `::std::optional<int64_t>` - `Dimname` - `::std::optional<Generator>` - `c10::string_view` - `::std::optional<c10::string_view>` - `OptionalIntArrayRef` - `::std::optional<Scalar>` - `OptionalSymIntArrayRef` - `::std::optional<MemoryFormat>` - `::std::optional<c10::SymInt>` - `ScalarType` - `ArrayRef<Scalar>` - `DimnameList` - `::std::optional<ArrayRef<double>>` - `::std::array<bool,3>` - `::std::optional<DimnameList>` - `c10::List<::std::optional<Tensor>>` - `::std::array<bool,2>` - `Storage` - `::std::array<bool,4>` - `Device` - `DeviceIndex` - `ITensorListRef` - `Stream` - `Layout` - `MemoryFormat` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125308 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-06-15 09:18:44 +00:00
Michael Lazos	de4f379cf2	run mkldnn test with inlining (#128749 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/128749 Approved by: https://github.com/anijain2305	2024-06-15 09:04:08 +00:00
Tristan Rice	b50c0e94c2	TCPStoreLibUvBackend: use somaxconn and enable TCP_NODELAY (#128739 ) This adjusts the settings of the libuv backend to match the older TCPStore. * DEFAULT_BACKLOG: setting this to -1 will enable using the host somaxconn value instead of a hardcoded 16k value. When going over this limit with `tcp_abort_on_overflow` set it results in connections being reset. * TCP_NODELAY: Since TCPStore primarily sends small messages there's no benefit to using Nargle's algorithm and it may add additional latency for store operations. Test plan: ``` python test/distributed/test_store.py -v -k LibUv ``` Benchmark script: ``` import time import os import torch.distributed as dist rank = int(os.environ["RANK"]) store = dist.TCPStore( host_name="<server>", port=29500, world_size=2, is_master=(rank == 0), use_libuv=True, ) if rank == 1: total_iters = 0 total_dur = 0 for iter in range(10): iters = 500000 start = time.perf_counter() for i in range(iters): store.set(f"key_{i}", f"value_{i}") dur = time.perf_counter() - start print(f"{iter}. {iters} set, qps = {iters/dur}") total_iters += iters total_dur += dur print(f"overall qps = {total_iters/total_dur}") else: print("sleeping") time.sleep(1000000000) ``` Performance seems to be negligible difference between TCP_NODELAY and not for a single host Pull Request resolved: https://github.com/pytorch/pytorch/pull/128739 Approved by: https://github.com/rsdcastro, https://github.com/kurman, https://github.com/c-p-i-o	2024-06-15 07:40:18 +00:00
cyy	e4c32d14a8	[3/N] Remove inclusion of c10/util/string_utils.h (#128504 ) Follows #128372 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128504 Approved by: https://github.com/malfet	2024-06-15 06:38:40 +00:00
Oguz Ulgen	472211c97a	Make assert_size_stride to return all errors (#128764 ) This will help debug some problems I'm encountering, but in general, it is best to show the entire error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128764 Approved by: https://github.com/jansel	2024-06-15 06:32:40 +00:00
Sahdev Zala	4ccbf711e2	Learning Rate Scheduler docstring fix (#128679 ) Fix docstrings in Learning Rate Scheduler. The fix can be verified by running pydocstyle path-to-file --count Related #112593 BEFORE the PR: pydocstyle torch/optim/lr_scheduler.py --count  92  AFTER the PR: pydocstyle torch/optim/lr_scheduler.py --count  0 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128679 Approved by: https://github.com/janeyx99	2024-06-15 05:30:35 +00:00
Animesh Jain	108adbc726	[dynamo][side effects] Raise assertion error if the object is already tracked for mutation (#128590 ) This issue was pointed out by @tombousso here - https://github.com/pytorch/pytorch/pull/128269#issuecomment-2163755792 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128590 Approved by: https://github.com/mlazos ghstack dependencies: #128715, #128269	2024-06-15 05:07:49 +00:00
Xu Han	9ebf77b13b	Fix windows inductor defination issue (#128686 ) Changes: 1. Add memory align macro support on Windows. 2. Fix `#pragma unroll` not support on MSVC cl compiler. `#pragma unroll` occur error on msvc `cl` compiler, but it would be supported on Windows `clang`. We'd better disable it only on `__msvc_cl__` compiler, and get better performance if we enabled `clang`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128686 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-06-15 03:02:00 +00:00
Animesh Jain	7e092a62e6	[dynamo] Support weakref objects (#128533 ) Fixes https://github.com/pytorch/pytorch/issues/125720 I was earlier worried that DELETE_* or STORE_* on referent values should result in a graph break, because they could invalidate the weak ref. But then @zou3519 pointed out that weakref invalidation will happen EVENTUALLY, CPython provides no guarantees when the weakref will be invalidated (even when the user calls del x and x is the last reference). So any code that relies on del x to invalidate the weakref of x right away is BAD code. CPython provide no guarantees. Therefore we can (ab)use this nuance, and can just ignore DELETE_* or STORE_* on the referent objects. The only corner case is when Dynamo is reconstructing the weakref object. Dynamo will have a hard time being correct here, so just SKIP_FRAME on such a case. This is rare. Cpython notes 1) https://docs.python.org/3/library/weakref.html 2) https://docs.python.org/3/reference/datamodel.html#index-2 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128533 Approved by: https://github.com/jansel	2024-06-15 02:16:25 +00:00
Animesh Jain	62a0e39ced	[dynamo][inlining-nn-modules] Update tests with new expected counts (#128463 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128463 Approved by: https://github.com/yanboliang	2024-06-15 02:08:02 +00:00
vasiliy	2d01f87737	Enable torch.empty for float8 dtypes + deterministic mode + cpu (#128744 ) Summary: Enables creating empty float8 tensors for: * cuda when `torch.use_deterministic_algorithms` is set to True * cpu for all settings of `torch.use_deterministic_algorithms` Context for NaN values of float8_e4m3fn and float8_e5m2: https://arxiv.org/pdf/2209.05433, Section 3, Table 1 Context for NaN values of float8_e4m3fnuz and float8_e5m2fnuz: https://arxiv.org/pdf/2206.02915, Section 3.2, "instead of reserving one exponent field to represent Inf and NaN, we reserve only a single codeword (corresponding to negative zero)" Test Plan: ``` python test/test_quantization.py -k test_empty ``` Reviewers: Subscribers: Tasks: Tags: Fixes https://github.com/pytorch/pytorch/issues/128733 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128744 Approved by: https://github.com/malfet, https://github.com/drisspg	2024-06-15 02:05:30 +00:00
PyTorch MergeBot	846bb30e13	Revert "[1/N] Change #include <c10/util/Optional.h> to #include <optional> (#128301 )" This reverts commit bd72e28314d8d63bb347becb8309f5ac7761c6b5. Reverted https://github.com/pytorch/pytorch/pull/128301 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it fails XLA build `bd72e28314`. Please rebase your PR before relanding because I think the failure is hidden by an unrelated broken trunk XLA failure from your current base commit ([comment](https://github.com/pytorch/pytorch/pull/128301#issuecomment-2169035822))	2024-06-15 01:58:20 +00:00
PyTorch MergeBot	5efe71f134	Revert "[export] Add print_readable to unflattener (#128617 )" This reverts commit 5d9a609b4f6c94fb930188e4d7c99f53d989c022. Reverted https://github.com/pytorch/pytorch/pull/128617 on behalf of https://github.com/huydhn due to Sorry for reverting your change but another failed test shows up in trunk inductor/test_flex_attention.py where it needs to be updated `5d9a609b4f`. I guess it is easier to revert and reland this ([comment](https://github.com/pytorch/pytorch/pull/128617#issuecomment-2169030779))	2024-06-15 01:46:23 +00:00
Huy Do	f37121bb74	Add model name, quantization and device to gpt_fast micro benchmark output (#128091 ) A small enhancement to https://hud.pytorch.org/benchmark/llms with these columns in the output. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128091 Approved by: https://github.com/yanboliang	2024-06-15 01:39:48 +00:00
Fuzzkatt	3f47c72268	add multiprocessing checks in test_dataloader.py (#128244 ) Add multiprocessing checks in test_dataloader.py for tests requiring multiprocessing similar to test_multiprocessing.py: https://github.com/pytorch/pytorch/blob/main/test/test_multiprocessing.py#L41-L52. Change all Jetson skips to TEST_CUDA_IPC checks since that is the root cause of the failures on Jetson in the first place. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128244 Approved by: https://github.com/eqy, https://github.com/malfet	2024-06-15 01:32:55 +00:00
Yueming Hao	73ba432d32	[custom_op]Fix None return schema (#128667 ) Fixes #125044 If users define a schema returns `None`, it will be parsed to a `torch.NoneType`. Auto functionalization support the `()` as a empty return but not for `None`. So, `None` return fails the check for [`can_auto_functionalize`](https://github.com/pytorch/pytorch/blob/findhao/fix_none_return_functionalize/torch/_higher_order_ops/auto_functionalize.py#L71) even we can take this as a `()` return. This PR is a fix to skip the check for None return. I hope it can be fixed in a [deeper level](`31e44c72ca`), but this fix breaks a lot of existing schemas. So it's better to fix this issue in the auto_functionalize.py at this moment. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128667 Approved by: https://github.com/zou3519	2024-06-15 00:41:37 +00:00
leslie-fang-intel	6616ad030f	[Inductor] Fix the High Order Op layout issue (#128275 ) Fix the issue: https://github.com/pytorch/pytorch/issues/127995 - In current implementation of creating `FallbackKernel`, the `device` of the `NoneLayout` is set to `None` when `example_output` returns from `cls.process_kernel` is `None`. `921aa194c7/torch/_inductor/ir.py (L5632-L5649)` - If a `ExternalKernel schedulerNode` has None device, the previous buffer will not flush before codegen this `ExternalKernel schedulerNode` which causes the wrong generated code. `ef2b5ed500/torch/_inductor/scheduler.py (L2701-L2709)` Test Plan ``` python -u -m pytest -s -v test/higher_order_ops/test_with_effects.py -k test_compile_inductor_external_op_return_none ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128275 Approved by: https://github.com/eellison	2024-06-15 00:33:21 +00:00
angelayi	5d9a609b4f	[export] Add print_readable to unflattener (#128617 ) Taking inspiration from `GraphModule.print_readable` (aka I copied its [code](`17b45e905a/torch/fx/graph_module.py (L824)`)), I added a `print_readable` to the unflattened module, because it's kind of nontrivial to print the contents of this module. Example print from `python test/export/test_unflatten.py -k test_unflatten_nested` ``` class UnflattenedModule(torch.nn.Module): def forward(self, x: "f32[2, 3]"): # No stacktrace found for following nodes rootparam: "f32[2, 3]" = self.rootparam # File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:99 in forward, code: x = x * self.rootparam mul: "f32[2, 3]" = torch.ops.aten.mul.Tensor(x, rootparam); x = rootparam = None # No stacktrace found for following nodes foo: "f32[2, 3]" = self.foo(mul); mul = None bar: "f32[2, 3]" = self.bar(foo); foo = None return (bar,) class foo(torch.nn.Module): def forward(self, mul: "f32[2, 3]"): # No stacktrace found for following nodes child1param: "f32[2, 3]" = self.child1param nested: "f32[2, 3]" = self.nested(mul); mul = None # File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:79 in forward, code: return x + self.child1param add: "f32[2, 3]" = torch.ops.aten.add.Tensor(nested, child1param); nested = child1param = None return add class nested(torch.nn.Module): def forward(self, mul: "f32[2, 3]"): # File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:67 in forward, code: return x / x div: "f32[2, 3]" = torch.ops.aten.div.Tensor(mul, mul); mul = None return div class bar(torch.nn.Module): def forward(self, add: "f32[2, 3]"): # No stacktrace found for following nodes child2buffer: "f32[2, 3]" = self.child2buffer # File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:87 in forward, code: return x - self.child2buffer sub: "f32[2, 3]" = torch.ops.aten.sub.Tensor(add, child2buffer); add = child2buffer = None return sub ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128617 Approved by: https://github.com/zhxchen17, https://github.com/pianpwk	2024-06-15 00:26:04 +00:00
Sanket Jayant Purandare	d67923b955	Adding kwargs to composable AC API to enable full capabilities (#128516 ) Summary: Firstly, this does not change any existing behaviour, since all the default values for kwargs were hardcoded into the ``_checkpoint_without_reentrant_generator`` call. Secondly, this is needed for unlocking the full potential of composable checkpointing making it equivalent to ``torch.utils.checkpoint.checkpoint(use_reentrant=False)``. Finally, an added benefit is now composable checkpointing can be used under ``FakeTensorMode`` by passing ``preserve_rng_state=False``. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128516 Approved by: https://github.com/awgu	2024-06-15 00:23:48 +00:00
Brian Hirsh	271852aa7e	inductor: pre-grad bmm pass shouldn't match if output is mutated (#128570 ) This PR is enough to get this test to pass when using `TORCHDYNAMO_INLINE_INBUILT_NN_MODULES`: ``` TORCHDYNAMO_INLINE_INBUILT_NN_MODULES=1 python test/inductor/test_group_batch_fusion.py -k TestPostGradBatchLinearFusion.test_batch_linear_post_grad_fusion ``` inductor has a pre-grad pass to swap out multiple `linear` layers with with `addbmm`, but it also needs to insert an `unbind()` at the end. If that unbind is then followed by a mutation (like `add_()`), the autograd engine will complain (autograd does not let you mutate the output of multiple-out-view ops like unbind). I made a tweak to the pattern matching logic to avoid matching if the output of the linear is used in an op that mutates its input. My hope is that: (1) this situation is rare enough that it won't materially impact pattern matching in real world code (2) I had to use a heuristic for "is an op a mutable op", since the graph we get is from dynamo, so it can contain code like `operator.iadd` in it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128570 Approved by: https://github.com/eellison, https://github.com/mlazos ghstack dependencies: #127927	2024-06-15 00:08:44 +00:00
Brian Hirsh	ba19ed9a1a	FunctionalTensor: dispatch metadata directly to inner tensor (#127927 ) Fixes https://github.com/pytorch/pytorch/issues/127374 The error in the linked repro is: ``` AssertionError: Please convert all Tensors to FakeTensors first or instantiate FakeTensorMode with 'allow_non_fake_inputs'. Found in aten.sym_storage_offset.default(_to_functional_tensor(FakeTensor(..., device='cuda:0', size=(16, 4), dtype=torch.uint8), device='cuda:0')) ``` Where we hit FakeTensor.__torch_dispatch__, but our input is a C++ `FunctionalTensorWrapper`. What should actually have happened is that the call to `aten.sym_storage_offset` hits the `Functionalize` dispatch key, which should remove the `FunctionalTensorWrapper` and redispatch. I spent some time debugging and haven't actually figured out why this isn't happening. Instead, this PR just skips that step completely, and asks `FunctionalTensor` to directly unwrap the C++ `FunctionalTensorWrapper` when querying tensor metadata. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127927 Approved by: https://github.com/tugsbayasgalan	2024-06-15 00:08:44 +00:00
dilililiwhy	574a2cbcb7	Enable UFMT on common_device_type.py and common_dtype.py (#128490 ) Part of: https://github.com/pytorch/pytorch/issues/123062 Ran lintrunner on: > torch/testing/_internal/common_device_type.py > torch/testing/_internal/common_dtype.py Detail: ``` $ lintrunner -a --take UFMT --all-files ok No lint issues. Successfully applied all patches. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128490 Approved by: https://github.com/ezyang, https://github.com/XuehaiPan	2024-06-15 00:07:42 +00:00
PaliC	0492ec460a	[BE] Remove external testing of torch::deploy (#127952 ) As we don't expect external users of torch::deploy as the library is no longer supported, we will remove external testing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127952 Approved by: https://github.com/malfet	2024-06-14 23:32:02 +00:00
cyy	bd72e28314	[1/N] Change #include <c10/util/Optional.h> to #include <optional> (#128301 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/128301 Approved by: https://github.com/ezyang	2024-06-14 23:21:01 +00:00
Tristan Rice	52d4442a00	[c10d] Socket, TCPStore: add better logging (#128673 ) This adds better logging of errors to the socket and TCPStore classes. All socket operations should now include the local and remote addresses and we actually log errors from the TCPStoreBackend::run as well as TCPStoreBackendUV which were previously INFO messages and not actually logged. It also overhauls test_wait in test_store.py as it had a race condition causing it to be flaky. Test plan: ``` python test/distributed/test_store.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128673 Approved by: https://github.com/c-p-i-o	2024-06-14 23:08:29 +00:00
Yang Chen	4abecd7102	[AOTI] fixed performance issue for AOTI_TORCH_CHECK (#128402 ) We introduced AOTI_TORCH_CHECK in #119220 to resolve slow-compilation time issues. Unfortunately, it caused perf regressions for CPU , as described in issue #126665. After some investigation, it turned out the slow compilation was caused by the use of the builtin function __builtin_expect provided by gcc/clang. Moreover, nuking __builtin_expect doesn't seem to cause any performance penalty, even though its purpose is to improve performance by providing the compiler with branch prediction information. abs latency numbers using the script shared by #126665: before the fix after the fix T5Small 1019.055694 917.875027 T5ForConditionalGeneration 1009.825196 916.369239 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128402 Approved by: https://github.com/desertfire	2024-06-14 23:03:17 +00:00
Huy Do	fd27138c4a	Update DALLE2_pytorch expected accuracy result on CPU (#128718 ) I suspect that the issue shows up because of the new version of https://pypi.org/project/pyarrow/16.1.0/#history released yesterday. The package is a dependency of DALLE2_pytorch https://github.com/pytorch/benchmark/blob/main/torchbenchmark/models/DALLE2_pytorch/install.py#L22. I'll just update the expected accuracy result on CPU benchmark because the model fails to run there anyway. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128718 Approved by: https://github.com/malfet	2024-06-14 22:54:21 +00:00
Catherine Lee	d3a4d9e4fe	Update cu124 dynamo benchmark expected values (#128737 ) Missed one in https://github.com/pytorch/pytorch/pull/128589 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128737 Approved by: https://github.com/Skylion007	2024-06-14 22:23:00 +00:00
titaiwangms	bca2cf00ed	[ONNX] Add dynamic axes support to torchscript exporter with dynamo=True (#128371 ) This PR enables specific axe to be dynamic with calling torch.export.export and torch.export.Dim. Features: (1) Turn dynamic_axes to dynamic_shapes (2) Dim constraints remain the same (see test case with hitting constraints). This might give different user experience, since we didn't have any constraints in torchscript-onnx exporting. (3) If input_names is used in dynamic_axes, ValueError will be raised, as input_names is currently not supported. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128371 Approved by: https://github.com/justinchuby	2024-06-14 21:56:51 +00:00
Isuru Fernando	f103247a14	Run all samples for torchinductor tests (#128343 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128343 Approved by: https://github.com/lezcano	2024-06-14 21:52:12 +00:00
angelayi	e9c6e8369c	Torchbind call method + effects support (#128397 ) Adds effect token support to torchbind method calls by allowing `with_effects` to take in `torch.ops._higher_order_ops.call_torchbind` as an input. Here is the print from `TORCH_LOGS="aot" python test/export/test_torchbind.py -k test_compile_obj_torchbind_op`: ```python def forward(self, arg0_1: "f32[0]", arg1_1: "f32[2]", arg2_1): # File: /data/users/angelayi/pytorch2/test/export/test_torchbind.py:1266 in f, code: torch.ops._TorchScriptTesting.queue_push(tq, x.cos()) cos: "f32[2]" = torch.ops.aten.cos.default(arg1_1) with_effects = torch._higher_order_ops.effects.with_effects(arg0_1, torch.ops._TorchScriptTesting.queue_push.default, arg2_1, cos); arg0_1 = cos = None getitem: "f32[0]" = with_effects[0]; with_effects = None # File: /data/users/angelayi/pytorch2/test/export/test_torchbind.py:1267 in f, code: torch.ops._TorchScriptTesting.queue_push(tq, x.cos() + 1) cos_1: "f32[2]" = torch.ops.aten.cos.default(arg1_1) add: "f32[2]" = torch.ops.aten.add.Tensor(cos_1, 1); cos_1 = None with_effects_1 = torch._higher_order_ops.effects.with_effects(getitem, torch.ops._TorchScriptTesting.queue_push.default, arg2_1, add); getitem = add = None getitem_2: "f32[0]" = with_effects_1[0]; with_effects_1 = None # File: /data/users/angelayi/pytorch2/test/export/test_torchbind.py:1268 in f, code: torch.ops._TorchScriptTesting.queue_pop(tq) with_effects_2 = torch._higher_order_ops.effects.with_effects(getitem_2, torch.ops._TorchScriptTesting.queue_pop.default, arg2_1); getitem_2 = None getitem_4: "f32[0]" = with_effects_2[0]; with_effects_2 = None # File: /data/users/angelayi/pytorch2/test/export/test_torchbind.py:1269 in f, code: torch.ops._TorchScriptTesting.queue_push(tq, x.sin()) sin: "f32[2]" = torch.ops.aten.sin.default(arg1_1); arg1_1 = None with_effects_3 = torch._higher_order_ops.effects.with_effects(getitem_4, torch.ops._TorchScriptTesting.queue_push.default, arg2_1, sin); getitem_4 = sin = None getitem_6: "f32[0]" = with_effects_3[0]; with_effects_3 = None # File: /data/users/angelayi/pytorch2/test/export/test_torchbind.py:1270 in f, code: return tq.pop(), tq.pop() + tq.size(), tq with_effects_4 = torch._higher_order_ops.effects.with_effects(getitem_6, torch.ops._higher_order_ops.call_torchbind, arg2_1, 'pop'); getitem_6 = None getitem_8: "f32[0]" = with_effects_4[0] getitem_9: "f32[2]" = with_effects_4[1]; with_effects_4 = None with_effects_5 = torch._higher_order_ops.effects.with_effects(getitem_8, torch.ops._higher_order_ops.call_torchbind, arg2_1, 'pop'); getitem_8 = None getitem_10: "f32[0]" = with_effects_5[0] getitem_11: "f32[2]" = with_effects_5[1]; with_effects_5 = None with_effects_6 = torch._higher_order_ops.effects.with_effects(getitem_10, torch.ops._higher_order_ops.call_torchbind, arg2_1, 'size'); getitem_10 = arg2_1 = None getitem_12: "f32[0]" = with_effects_6[0]; with_effects_6 = None add_1: "f32[2]" = torch.ops.aten.add.Tensor(getitem_11, 0); getitem_11 = None return (getitem_12, getitem_9, add_1) ``` In order to support this, this PR makes the following changes: * Adds `FakeScriptObject` to `CustomObjArgument`, which will be put on the `meta["val"]` of nodes representing torchbind objects. * Adds pickle/deepcopy support to FunctionSchema. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128397 Approved by: https://github.com/ydwu4, https://github.com/zou3519	2024-06-14 21:28:17 +00:00
ibartol	65d3ddcb8b	Add GLIBC requirements for libtorch to solve #113124 (#128135 ) Fixes #113124. ## Description I modified the installing.rst file to address the system requirements and troubleshooting steps for using LibTorch with different GLIBC versions. ### Summary of Changes - Added system requirements specifying the GLIBC version needed for both the cxx11 ABI version and the pre-cxx11 ABI version of LibTorch. - Included a troubleshooting section with instructions on how to check the dependencies of the LibTorch libraries and identify the required GLIBC version using the `ldd lib/libtorch.so` command. ## Checklist - [X] The issue that is being fixed is referred in the description - [X] Only one issue is addressed in this pull request - [X] Labels from the issue that this PR is fixing are added to this pull request - [X] No unnecesary issues are included into this pull request Pull Request resolved: https://github.com/pytorch/pytorch/pull/128135 Approved by: https://github.com/jbschlosser	2024-06-14 21:24:53 +00:00
titaiwangms	e9a29aaa4a	[ONNX] Add upsample trilinear to skip decomp (#128259 ) (1) Add upsample trilinear vec to skip decomposition (2) Add tests to make sure that torch.export.export still decomposes them Pull Request resolved: https://github.com/pytorch/pytorch/pull/128259 Approved by: https://github.com/justinchuby	2024-06-14 21:20:44 +00:00
rzou	e6e102cf85	Dynamo testing: add some skips (#128734 ) The following tests are failing consistently for me locally, so we're going to skip them. They're disabled in CI but it looks like they're just always failing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128734 Approved by: https://github.com/williamwen42 ghstack dependencies: #128731	2024-06-14 20:53:30 +00:00
rzou	11de50f17c	[Dynamo] skip some TorchScript tests (#128731 ) We don't care about the Dynamo x TorchScript composition, so I'm disabling these tests (so they don't get reported as flaky). Not disabling all of the TorchScript tests yet because they have been useful to catch random bugs. Test Plan: - CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/128731 Approved by: https://github.com/williamwen42	2024-06-14 20:53:30 +00:00
Simon Fan	4b96575a09	[dynamo][aot autograd] Silently disable default saved tensor hooks during tracing (#123196 ) FIXES #113263. Same idea as in https://github.com/pytorch/pytorch/pull/113417, but we need a more intrusive C API to silently nop default saved tensor hooks, in order to support user-code that use torch.autograd.disable_saved_tensors_hooks (see test_unpack_hooks_can_be_disabled). We mock the output of get_hooks while leaving push/pop untouched. For compiled autograd, we're firing pack hooks once and unpack hooks twice right now, I'll look into this separately from this issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123196 Approved by: https://github.com/soulitzer	2024-06-14 20:28:08 +00:00
Animesh Jain	1aafb9eb90	[dynamo][yolov3] Track UnspecializedNNModuleVariable for mutation (#128269 ) Fixes https://github.com/pytorch/pytorch/issues/101168 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128269 Approved by: https://github.com/jansel ghstack dependencies: #128715	2024-06-14 20:17:03 +00:00
Animesh Jain	9c77332116	[torch.compile][ci] Flaky models in CI (similar to DISABLED_TEST) (#128715 ) These models are really flaky. I went into the CI machine and ran the model many times, sometime it fails, sometimes it passes. Even Pytorch-eager results change from run to run, so the accuracy comparison is fundamentally broken/non-deterministic. I am hitting these issues more frequently in inlining work. There is nothing wrong with inlining, I think these models are on the edge of already-broken accuracy measurement, and inlining is just pushing it in more broken direction. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128715 Approved by: https://github.com/eellison	2024-06-14 20:17:03 +00:00
Sanket Jayant Purandare	2e5366fbc0	Extended Module Tracker (#128508 ) This is an extension of [ModuleTracker](https://github.com/pytorch/pytorch/blob/main/torch/utils/module_tracker.py) with added features and bug fixes. 1. Allows installing user-defined hooks to be called in pre-fw, post-fw, pre-bw and post-bw hooks of the ``ModTracker``. 2. Adds a function ``get_known_fqn`` that retrieves the fqn of the module as tracked by the ``ModTracker``. 3. Only registers the multi-grad hooks if we are in the forward pass. This is important because, a module's pre-fw and post-fw hooks get called in the backward during AC and we do not want to register multi-grad hooks in this case. 4. Sets the kwarg ``always_call=True`` for post-fw hooks, so that they are called post AC. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128508 Approved by: https://github.com/wanchaol	2024-06-14 19:48:46 +00:00
Menglu Yu	d50712e5e3	[PT2] add inductor log for unbind_stack_pass (#128684 ) Summary: Currently, we do not log the pass. To better enable pattern hit inspection, we enable it. Test Plan: see signal Differential Revision: D58571992 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128684 Approved by: https://github.com/dshi7	2024-06-14 19:45:55 +00:00
Nikita Shulga	9035fff2de	[BE] Do not test deprecated `torch.nn.utils.weight_norm` (#128727 ) Test `torch.nn.utils.parametrizations.weight_norm` instead Pull Request resolved: https://github.com/pytorch/pytorch/pull/128727 Approved by: https://github.com/kit1980 ghstack dependencies: #128726	2024-06-14 19:14:44 +00:00
Nikita Shulga	27458cc097	[BE] Refactor repeated code in test_weight_norm (#128726 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128726 Approved by: https://github.com/kit1980	2024-06-14 19:14:44 +00:00
Colin Peppler	a6bd154a42	[inductor] Support mm decomps for matrices with unbacked sizes (#128655 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128655 Approved by: https://github.com/jansel	2024-06-14 18:35:42 +00:00
Nikita Shulga	b94c52dd29	[GHF] Refuse merge to non-default branch (#128710 ) Unless PR is ghstack one Test plan: ``` % GITHUB_TOKEN=$(gh auth token) python3 -c "from trymerge import GitHubPR; pr=GitHubPR('pytorch', 'pytorch', 128591); print(pr.base_ref(), pr.default_branch())" release/2.4 main ``` Fixes: https://github.com/pytorch/test-infra/issues/5339 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128710 Approved by: https://github.com/seemethere, https://github.com/atalman	2024-06-14 18:23:25 +00:00
Zhengxu Chen	be0eec9031	[export] Improve static typing in tracer. (#128552 ) Summary: as title. Test Plan: CI Differential Revision: D58485487 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128552 Approved by: https://github.com/angelayi	2024-06-14 17:57:37 +00:00
PyTorch MergeBot	2367161e4b	Revert "[ROCm] Unskip scaled_dot_product_attention tests on ROCm (#127966 )" This reverts commit c339efaf023b4af056dad4cb2f11c07930ed8af6. Reverted https://github.com/pytorch/pytorch/pull/127966 on behalf of https://github.com/jithunnair-amd due to Broke ROCm CI ([comment](https://github.com/pytorch/pytorch/pull/127966#issuecomment-2168505985))	2024-06-14 17:57:23 +00:00
Peter Bell	d7fc871175	[inductor] Improve superfluous mask handling in triton codegen (#128518 ) This takes the logic from `filter_masks` and factors it out into `_has_constant_mask`. I also improve support for `persistent_reduction` kernels by making use of the static RBLOCK value and potentially XBLOCK too in the `no_x_dim` case. I then use this helper when generating the `xmask` and `rmask`, so we can generate them as constants meaning triton can optimize them even if they are included. e.g. `compiled_sum(torch.randn(1024, 512, device="cuda"), dim=-1)` before: ```python @triton.jit def triton_(in_ptr0, out_ptr0, xnumel, rnumel): xnumel = 1024 XBLOCK: tl.constexpr = 1 rnumel = 512 RBLOCK: tl.constexpr = 512 xoffset = tl.program_id(0) * XBLOCK xindex = tl.full([1], xoffset, tl.int32) xmask = xindex < xnumel rindex = tl.arange(0, RBLOCK)[:] roffset = 0 rmask = rindex < rnumel r1 = rindex x0 = xindex tmp0 = tl.load(in_ptr0 + (r1 + (512x0)), rmask & xmask, other=0.0) tmp1 = tl.broadcast_to(tmp0, [RBLOCK]) tmp3 = tl.where(rmask & xmask, tmp1, 0) tmp4 = triton_helpers.promote_to_tensor(tl.sum(tmp3, 0)) tl.store(out_ptr0 + (x0), tmp4, xmask) ``` after: ```python @triton.jit def triton_(in_ptr0, out_ptr0, xnumel, rnumel): xnumel = 1024 XBLOCK: tl.constexpr = 1 rnumel = 512 RBLOCK: tl.constexpr = 512 xoffset = tl.program_id(0) XBLOCK xindex = tl.full([1], xoffset, tl.int32) xmask = tl.full([RBLOCK], True, tl.int1) rindex = tl.arange(0, RBLOCK)[:] roffset = 0 rmask = tl.full([RBLOCK], True, tl.int1) r1 = rindex x0 = xindex tmp0 = tl.load(in_ptr0 + (r1 + (512*x0)), None) tmp1 = tl.broadcast_to(tmp0, [RBLOCK]) tmp3 = triton_helpers.promote_to_tensor(tl.sum(tmp1, 0)) tl.store(out_ptr0 + (x0), tmp3, None) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128518 Approved by: https://github.com/lezcano	2024-06-14 17:52:55 +00:00
Menglu Yu	2357490524	[PT2] Enable shape_padding multiplier adjustment (#128346 ) Summary: Our experiments demonstrate that the current defautl value 1.1 may not be the best multiplier, and we thus enable the adjustment of the value to further improve the QPS. context: https://docs.google.com/document/d/10VjpOJkTv5A4sNX7dD6qT7PyhBxn6LSeLAuaqYtoOto/edit Test Plan: # IG_CTR {F1682138315} Differential Revision: D58373261 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128346 Approved by: https://github.com/jackiexu1992	2024-06-14 17:49:24 +00:00
cyy	d4807da802	Various fixes of torch/csrc files (#127252 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/127252 Approved by: https://github.com/r-barnes	2024-06-14 17:31:24 +00:00
Aart Bik	089e76cca3	[traced-graph][sparse] remove redundant assert in sparse prop test (#128523 ) The assertEqualMeta() method already tests that the first argument is a FakeTensor https://github.com/pytorch/pytorch/issues/117188 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128523 Approved by: https://github.com/huydhn	2024-06-14 17:05:17 +00:00
Yanbo Liang	1fb4effe7a	[GPT-fast benchmark] Add MLP, gather + gemv, gemv micro benchmark (#128002 ) Output example: ``` \| name \| metric \| target \| actual \| \|------------------------------\|---------------------------\|---------\|---------\| \| layer_norm_bfloat16 \| memory_bandwidth(GB/s) \| 1017 \| 1000.01 \| \| mlp_layer_norm_gelu_bfloat16 \| flops_utilization \| 0.71 \| 0.71 \| \| gemv_int8 \| memory_bandwidth(GB/s) \| 990 \| 984.06 \| \| gemv_bfloat16 \| memory_bandwidth(GB/s) \| 1137 \| 1137.92 \| \| gather_gemv_int8 \| memory_bandwidth(GB/s) \| 1113 \| 1111.09 \| \| gather_gemv_bfloat16 \| memory_bandwidth(GB/s) \| 1249 \| 1248.15 \| ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128002 Approved by: https://github.com/Chillee	2024-06-14 17:03:22 +00:00
Laith Sakka	4c84af0f5d	Fix indexing and slicing of ranges in dynamo (#128567 ) Fix https://github.com/pytorch/pytorch/issues/128520 Dynamo does not handle range()[binary subscript] or range()[trinary_subscript] correctly. Right now it calls the get_item function which basically applies the subscript operation on top of the list of [start, end, step]! which is completely not related to what is expected. in python, range()[complex subscript] is another range, ex: range(1, 10, 2)[1:4:1] is range(3, 9, 2) and range(1, 10, 2)[1:4:1] is range(-9, 9, 2) This diff fix index and slice applications on range. it mimics implementations from (https://github.com/python/cpython/blob/main/Objects/rangeobject.c) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128567 Approved by: https://github.com/anijain2305	2024-06-14 16:49:49 +00:00
PyTorch MergeBot	f75f5987aa	Revert "Extended Module Tracker (#128508 )" This reverts commit 1f46284f9ed5b60981174e689d750b358b19e4c4. Reverted https://github.com/pytorch/pytorch/pull/128508 on behalf of https://github.com/malfet due to Broke lint, see https://github.com/pytorch/pytorch/actions/runs/9515753429/job/26230639980 ([comment](https://github.com/pytorch/pytorch/pull/128508#issuecomment-2168405784))	2024-06-14 16:46:03 +00:00
Aaron Orenstein	732b4e9074	Fix generated vararg types (#128648 ) In the generated files torchgen is incorrectly generating types on the varargs. The changes all look like this (changing `size: _int` to `size: Union[_int, SymInt]`): ``` --- ./torch/_VF.pyi.sav 2024-06-13 20:36:49.189664629 -0700 +++ ./torch/_VF.pyi 2024-06-13 20:36:57.208894614 -0700 @@ -168,17 +168,17 @@ @overload def _efficientzerotensor(size: Sequence[Union[_int, SymInt]], , dtype: Optional[_dtype] = None, layout: Optional[_layout] = None, device: Optional[Optional[DeviceLikeType]] = None, pin_memory: Optional[_bool] = False, requires_grad: Optional[_bool] = False) -> Tensor: ... @overload -def _efficientzerotensor(size: _int, dtype: Optional[_dtype] = None, layout: Optional[_layout] = None, device: Optional[Optional[DeviceLikeType]] = None, pin_memory: Optional[_bool] = False, requires_grad: Optional[_bool] = False) -> Tensor: ... +def _efficientzerotensor(*size: Union[_int, SymInt], dtype: Optional[_dtype] = None, layout: Optional[_layout] = None, device: Optional[Optional[DeviceLikeType]] = None, pin_memory: Optional[_bool] = False, requires_grad: Optional[_bool] = False) -> Tensor: ... def _embedding_bag(weight: Tensor, indices: Tensor, offsets: Tensor, scale_grad_by_freq: _bool = False, mode: _int = 0, sparse: _bool = False, per_sample_weights: Optional[Tensor] = None, include_last_offset: _bool = False, padding_idx: _int = -1) -> Tuple[Tensor, Tensor, Tensor, Tensor]: ... def _embedding_bag_forward_only(weight: Tensor, indices: Tensor, offsets: Tensor, scale_grad_by_freq: _bool = False, mode: _int = 0, sparse: _bool = False, per_sample_weights: Optional[Tensor] = None, include_last_offset: _bool = False, padding_idx: _int = -1) -> Tuple[Tensor, Tensor, Tensor, Tensor]: ... @overload ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128648 Approved by: https://github.com/jamesjwu	2024-06-14 16:04:37 +00:00
Kiuk Chung	8629939a51	[torch/c10] Add C10_UBSAN_ENABLED macro and use it to disable SymInt_… (#127967 ) Adds `C10_UBSAN_ENABLED` macro and use it to disable `SymIntTest::Overflows` (fails under `signed-integer-overflow` UBSAN check). Also cleans up UBSAN guard in `jit/test_misc.cpp` to use `C10_UBSAN_ENABLED` and the existing `C10_ASAN_ENABLED` instead of locally defining `HAS_ASANUBSAN`. > NOTE: This should fix `SymIntTest::Overflows` failing under ubsan in fbcode too... Pull Request resolved: https://github.com/pytorch/pytorch/pull/127967 Approved by: https://github.com/atalman, https://github.com/d4l3k, https://github.com/malfet	2024-06-14 16:01:12 +00:00
PyTorch MergeBot	ee140a198f	Revert "[Port][Quant][Inductor] Bug fix: mutation nodes not handled correctly for QLinearPointwiseBinaryPT2E (#128591 )" This reverts commit 03e8a4cf45ee45611de77b55b515a8936f60ce31. Reverted https://github.com/pytorch/pytorch/pull/128591 on behalf of https://github.com/atalman due to Contains release only changes should not be landed ([comment](https://github.com/pytorch/pytorch/pull/128591#issuecomment-2168308233))	2024-06-14 15:51:00 +00:00
eellison	c187593418	Prevent expansion of cat indexing to avoid int64 intermediate (#127815 ) Fix for https://github.com/pytorch/pytorch/issues/127652 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127815 Approved by: https://github.com/shunting314, https://github.com/peterbell10	2024-06-14 15:42:08 +00:00
Andres Lugo-Reyes	c339efaf02	[ROCm] Unskip scaled_dot_product_attention tests on ROCm (#127966 ) Needle has moved quite a bit on the ROCm backend front. This PR intended to examine the tests referenced in the following issue: https://github.com/pytorch/pytorch/issues/96560 This a follow-up PR to https://github.com/pytorch/pytorch/pull/125069 unskipping the next batch of tests referenced by the aforementioned issue. No explicit changes needed for source as they worked immediately after unskipping. The tests previously marked with xfail have now been modified to not expect a failure iff running on ROCm as they now pass. Behavior is unchanged for them on other architectures. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127966 Approved by: https://github.com/pruthvistony, https://github.com/zou3519	2024-06-14 15:24:28 +00:00
Huamin Li	c76a9d13cb	Revert D56709309 (#128481 ) Summary: potential fw compatibility issue raised from D58397323 Test Plan: Sandcastle Reviewed By: houseroad Differential Revision: D58443190 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128481 Approved by: https://github.com/desertfire	2024-06-14 14:57:17 +00:00
rzou	9972e5f447	Rename impl_abstract to register_fake, part 2/2 (#123938 ) This PR renames the implementation details of register_fake to align more with the new name. It is in its own PR because this is risky (torch.package sometimes depends on private library functions and implementation details). Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/123938 Approved by: https://github.com/williamwen42	2024-06-14 14:37:24 +00:00
Zheng, Zhaoqiong	a2d9c430b4	Adding a note for Getting Started with PyTorch on Intel GPUs (#127872 ) Adding a note for Getting Started with PyTorch on Intel GPUs Pull Request resolved: https://github.com/pytorch/pytorch/pull/127872 Approved by: https://github.com/svekars	2024-06-14 14:24:28 +00:00
Luca Wehrstedt	dfc4b608e1	Remove leftover warning causing log spew (#128688 ) This warning was left by mistake, and is uninformative (the user is doing nothing wrong) and causing log spew in trainings. See https://github.com/pytorch/pytorch/pull/120750#discussion_r1638430500 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128688 Approved by: https://github.com/drisspg	2024-06-14 14:08:11 +00:00
Nikita Shulga	e1dfc61250	Document CI/CD security philosophy (#128316 ) Namely: - when use of non-ephemeral runners is OK, vs when it is not - Why binary build pipelines should not use distributed caching - Why temporary CI artifacts should not be considered safe Pull Request resolved: https://github.com/pytorch/pytorch/pull/128316 Approved by: https://github.com/seemethere, https://github.com/atalman	2024-06-14 13:47:25 +00:00
cyy	bfd5ea93e0	Enable clang-tidy on c10/util/Float8.h (#120573 ) This PR clears warnings and enables clang-tidy on c10/util/Float8.h. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120573 Approved by: https://github.com/drisspg	2024-06-14 13:47:07 +00:00
Sanket Jayant Purandare	1f46284f9e	Extended Module Tracker (#128508 ) This is an extension of [ModuleTracker](https://github.com/pytorch/pytorch/blob/main/torch/utils/module_tracker.py) with added features and bug fixes. 1. Allows installing user-defined hooks to be called in pre-fw, post-fw, pre-bw and post-bw hooks of the ``ModTracker``. 2. Adds a function ``get_known_fqn`` that retrieves the fqn of the module as tracked by the ``ModTracker``. 3. Only registers the multi-grad hooks if we are in the forward pass. This is important because, a module's pre-fw and post-fw hooks get called in the backward during AC and we do not want to register multi-grad hooks in this case. 4. Sets the kwarg ``always_call=True`` for post-fw hooks, so that they are called post AC. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128508 Approved by: https://github.com/wanchaol	2024-06-14 12:01:53 +00:00
Isuru Fernando	e397ad6883	Improve codegen for ops.masked in triton (#128054 ) Fixes https://github.com/pytorch/pytorch/issues/127930 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128054 Approved by: https://github.com/peterbell10, https://github.com/lezcano	2024-06-14 11:52:56 +00:00
Colin Peppler	7e734e2d08	[inductor] Fix nested indirect indexing case for index_propagation (#128378 ) Tries to fix #127677. # Context Just as @peterbell10 pointed out, we have the following scenario: ``` a = ops.indirect_indexing(...) b = ops.index_expr(a, ...) c = ops.indirect_indexing(b, ...) ``` We can repro this as: ``` def forward(self, arg0_1, arg1_1, arg2_1): iota = torch.ops.prims.iota.default(arg0_1, start = 0, step = 1, index=0), repeat_interleave = torch.ops.aten.repeat_interleave.Tensor(arg1_1); index = torch.ops.aten.index.Tensor(iota, [repeat_interleave]); index_1 = torch.ops.aten.index.Tensor(arg2_1, [index]); return (index_1,) ``` which should generate a JIT py file like this: ``` def triton_poi_fused_index_select_0(in_ptr0, in_ptr1, out_ptr0, ks0, xnumel, XBLOCK : tl.constexpr): ... tmp0 = tl.load(in_ptr0 + (x1), xmask, eviction_policy='evict_last') tmp1 = ks0 tmp2 = tmp0 + tmp1 tmp3 = tmp0 < 0 tmp4 = tl.where(tmp3, tmp2, tmp0) # check_bounds() tl.device_assert(((0 <= tmp4) & (tmp4 < ks0)) \| ~(xmask), "index out of bounds: 0 <= tmp4 < ks0") def call(): arg0_1, arg1_1, arg2_1 = args buf1 = aten.repeat_interleave.Tensor(arg1_1) buf4 = empty_strided_cuda((u0, 64), (64, 1)) triton_poi_fused_index_select_0.run( buf1, arg2_1, buf4, s0, triton_poi_fused_index_select_0_xnumel, grid=grid(triton_poi_fused_index_select_0_xnumel), stream=stream0) ``` # Issue In our `IndexPropagation.indirect_indexing()` call we have `expr=indirect0` which is spawned in `LoopBodyBlock.indirect_indexing()`. `3b555ba477/torch/_inductor/ir.py (L8154-L8160)` When we try to see if we can prove its bounds, we fail because `indirect0` isn't in `var_ranges`. # Approach When creating `indirect` symbols from fallback, specify its range to be `[-size, size -1]` to avoid a lookup error with `indirectX`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128378 Approved by: https://github.com/lezcano, https://github.com/peterbell10	2024-06-14 10:07:06 +00:00
Jason Ansel	99988be423	[halide-backend] Add test shard (#127308 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127308 Approved by: https://github.com/shunting314, https://github.com/eellison ghstack dependencies: #128266	2024-06-14 10:02:57 +00:00
Xia, Weiwen	03e8a4cf45	[Port][Quant][Inductor] Bug fix: mutation nodes not handled correctly for QLinearPointwiseBinaryPT2E (#128591 ) Port #127592 from main to release/2.4 ------ Fixes #127402 - Revert some changes to `ir.MutationOutput` and inductor/test_flex_attention.py - Add checks of mutation for QLinearPointwiseBinaryPT2E Pull Request resolved: https://github.com/pytorch/pytorch/pull/127592 Approved by: https://github.com/leslie-fang-intel, https://github.com/Chillee Pull Request resolved: https://github.com/pytorch/pytorch/pull/128591 Approved by: https://github.com/jgong5, https://github.com/Chillee	2024-06-14 09:31:38 +00:00
PyTorch MergeBot	43ae3073f9	Revert "[traced-graph][sparse] remove redundant assert in sparse prop test (#128523 )" This reverts commit ba3726d02b25dff92762c59d4dffe96a7babfa75. Reverted https://github.com/pytorch/pytorch/pull/128523 on behalf of https://github.com/DanilBaibak due to Sorry for the revert. Looks like your changes broke the inductor tests: inux-jammy-cpu-py3.8-gcc11-inductor, linux-jammy-cpu-py3.8-gcc11-inductor, linux-jammy-cpu-py3.8-gcc11-inductor. [Here you can find more details](`ba3726d02b`). ([comment](https://github.com/pytorch/pytorch/pull/128523#issuecomment-2167518145))	2024-06-14 08:27:05 +00:00
Will Constable	0344f95c2e	Add missing #include <array> to thread_name.cpp (#128664 ) I got local compile errors (using clang 14.0.6) due to this missing include after pulling the latest pytorch main. It's totally puzzling why CI appears to pass without this fix. Hopefully someone else will have an idea if we are missing some CI coverage or if I am using a strange build setup locally. The PR introducing the compile errors was https://github.com/pytorch/pytorch/pull/128448. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128664 Approved by: https://github.com/fduwjj, https://github.com/malfet, https://github.com/d4l3k	2024-06-14 07:49:09 +00:00
Anshul Sinha	03725a0512	[dtensor][example] added MLPStacked example for printing sharding (#128461 ) Summary Currently, the comm_mode_feature_examples does not have an example for printing sharding information for a model with nested module. While adding the new example to the suite, I recognized a way to refactor existing examples in order to make them more readable for users. The expected output can be found below: <img width="354" alt="Screenshot 2024-06-11 at 5 41 14 PM" src="https://github.com/pytorch/pytorch/assets/50644008/68cef7c7-cb1b-4e51-8b60-85123d96ca92"> Test Plan torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/128461 Approved by: https://github.com/XilunWu ghstack dependencies: #128369, #128451	2024-06-14 07:30:31 +00:00
Anshul Sinha	dd3b79a08f	[dtensor][be] improving readability of comm_mode.py and comm_mode_features_example.py (#128451 ) Summary I have added comments to address previous readability concerns in comm_mode.py and comm_mode_features_example.py. I also renamed files and test cases in order to better reflect what they are about. Removed non-distributed test case and other lines of code that do not contribute to the example of how comm_mode can be used. Finally, I've added the expected output for each example function so users are not forced to run code. Test Plan torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/128451 Approved by: https://github.com/XilunWu ghstack dependencies: #128369	2024-06-14 07:30:31 +00:00
Anshul Sinha	e886122e98	[dtensor][debug] add module level tracing and readable display (#128369 ) Summary Currently, CommDebugMode only allows displaying collective tracing at a model level whereas a user may require a more detailed breakdown. In order to make this possible, I have changed the ModuleParamaterShardingTracker by adding a string variable to track the current sub-module as well as a dictionary keeping track of the depths of the submodules in the model tree. CommModeDebug class was changed by adding a new dictionary keeping track of the module collective counts as well as a function that displays the counts in a way that is easy for the user to read. Two examples using MLPModule and Transformer have been added to showcase the new changes. The expected output of the simpler MLPModule example is: <img width="255" alt="Screenshot 2024-06-10 at 4 58 50 PM" src="https://github.com/pytorch/pytorch/assets/50644008/cf2161ef-2663-49c1-a8d5-9f97e96a1791"> Test Plan torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/display_sharding_example.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/128369 Approved by: https://github.com/XilunWu	2024-06-14 07:30:31 +00:00
yiliu30	4669c6d3ae	[quant][pt2e][quantizer] Support `set_module_name_qconfig` in X86InductorQuantizer (#126044 ) Summary: Added `set_module_name_qconfig` support to allow users to set configurations based on module name in `X86InductorQuantizer`. For example, only quantize the `sub`: ```python class M(torch.nn.Module): def __init__(self): super().__init__() self.linear = torch.nn.Linear(5, 5) self.sub = Sub() def forward(self, x): x = self.linear(x) x = self.sub(x) return x m = M().eval() example_inputs = (torch.randn(3, 5),) # Set config for a specific submodule. quantizer = X86InductorQuantizer() quantizer.set_module_name_qconfig("sub", xiq.get_default_x86_inductor_quantization_config()) ``` - Added `set_module_name_qconfig` to allow user set the configuration at the `module_name` level. - Unified the annotation process to follow this order: `module_name_qconfig`, `operator_type_qconfig`, and `global_config`. - Added `config_checker` to validate all user configurations and prevent mixing of static/dynamic or QAT/non-QAT configs. - Moved `_get_module_name_filter` from `xnnpack_quantizer.py` into `utils.py` as it common for all quantizer. Test Plan ```bash python -m pytest quantization/pt2e/test_x86inductor_quantizer.py -k test_set_module_name ``` @Xia-Weiwen @leslie-fang-intel @jgong5 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126044 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jerryzh168	2024-06-14 07:13:10 +00:00
Catherine Lee	674be9d3be	Update cu124 dynamo benchmark expected values (#128589 ) I believe this corresponds to changes in https://github.com/pytorch/pytorch/pull/127780 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128589 Approved by: https://github.com/nWEIdia, https://github.com/DanilBaibak	2024-06-14 07:04:34 +00:00
PyTorch MergeBot	18f35d9e12	Revert "Run all samples for torchinductor tests (#128343 )" This reverts commit 41df20c07caecddb6d21d69a125f2998ae9313e8. Reverted https://github.com/pytorch/pytorch/pull/128343 on behalf of https://github.com/clee2000 due to broke inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_avg_pool3d_cuda_float16 and other tests `41df20c07c` https://github.com/pytorch/pytorch/actions/runs/9509191526/job/26213490266. I think this might be a landrace ([comment](https://github.com/pytorch/pytorch/pull/128343#issuecomment-2167275337))	2024-06-14 06:08:17 +00:00
David Berard	f48f7615dc	[easy][subclasses] dynamo.reset() in test_subclass_views (#128659 ) When we don't dynamo.reset(), we don't recompile on different dynamic shapes. Also, some of the returned views were tuples - so when we `* 2`, we actually just copy all the inputs twice in the tuple. I changed it so that it would just return one of the values from the return tuple. Additionally, this exposes a bug that fails with the slice operation, so I skipped it when we're testing with dynamic shapes: ``` File "/home/dberard/local/pytorch/torch/fx/experimental/symbolic_shapes.py", line 3996, in produce_guards sexpr = ShapeGuardPrinter(symbol_to_source, source_ref, self.var_to_sources).doprint(expr) File "/home/dberard/local/miniconda3/envs/pytorch/lib/python3.10/site-packages/sympy/printing/printer.py", line 292, in doprint return self._str(self._print(expr)) File "/home/dberard/local/miniconda3/envs/pytorch/lib/python3.10/site-packages/sympy/printing/printer.py", line 331, in _print return printmethod(expr, kwargs) File "/home/dberard/local/miniconda3/envs/pytorch/lib/python3.10/site-packages/sympy/printing/str.py", line 56, in _print_Add t = self._print(term) File "/home/dberard/local/miniconda3/envs/pytorch/lib/python3.10/site-packages/sympy/printing/printer.py", line 331, in _print return printmethod(expr, kwargs) File "/home/dberard/local/miniconda3/envs/pytorch/lib/python3.10/site-packages/sympy/printing/str.py", line 366, in _print_Mul a_str = [self.parenthesize(x, prec, strict=False) for x in a] File "/home/dberard/local/miniconda3/envs/pytorch/lib/python3.10/site-packages/sympy/printing/str.py", line 366, in <listcomp> a_str = [self.parenthesize(x, prec, strict=False) for x in a] File "/home/dberard/local/miniconda3/envs/pytorch/lib/python3.10/site-packages/sympy/printing/str.py", line 37, in parenthesize return self._print(item) File "/home/dberard/local/miniconda3/envs/pytorch/lib/python3.10/site-packages/sympy/printing/printer.py", line 331, in _print return printmethod(expr, **kwargs) File "/home/dberard/local/pytorch/torch/fx/experimental/symbolic_shapes.py", line 1494, in _print_Symbol assert self.symbol_to_source.get(expr), ( AssertionError: s3 (could be from ['<ephemeral: symint_visitor_fn>', '<ephemeral: symint_visitor_fn>']) not in {s0: ["L['x'].a.size()[1]", "L['x'].b.size()[1]", "L['x'].size()[1]", "L['x'].a.size()[1]", "L['x'].b.size()[1]", "L['x'].a.size()[1]", "L['x'].b.size()[1]"], s1: ["L['x'].a.stride()[0]", "L['x'].b.stride()[0]", "L['x'].stride()[0]", "L['x'].a.stride()[0]", "L['x'].b.stride()[0]", "L['x'].a.stride()[0]", "L['x'].b.stride()[0]"], s2: ["L['x'].a.storage_offset()", "L['x'].b.storage_offset()", "L['x'].a.storage_offset()", "L['x'].b.storage_offset()"]}. If this assert is failing, it could be due to the issue described in https://github.com/pytorch/pytorch/pull/90665 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128659 Approved by: https://github.com/YuqingJ	2024-06-14 05:18:07 +00:00
amdfaa	9ac08dab1f	Updates diskspace-cleanup for ROCm CI (#127947 ) Gets the location of the docker directory and outputs how much disk space is being used by docker. This is required since the new Cirrascale CI nodes for ROCm have docker root directory in a different partition. Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127947 Approved by: https://github.com/jithunnair-amd, https://github.com/malfet	2024-06-14 04:32:38 +00:00
Huy Do	eff01bce21	Only run inductor A100 perf benchmark smoke test periodically (#128677 ) Attempt to mitigate the long queue on A100 as reported in https://github.com/pytorch/pytorch/issues/128627. From what I see, this change `03467b3fed/1` doubles the job duration from 20+ to 40+ minutes. This, together https://github.com/pytorch/pytorch/blob/main/.github/workflows/inductor-cu124.yml and maybe an increase number of PR with `ciflow/inductor`, are all contributing to the long queue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128677 Approved by: https://github.com/atalman, https://github.com/desertfire	2024-06-14 02:39:33 +00:00
Aart Bik	ba3726d02b	[traced-graph][sparse] remove redundant assert in sparse prop test (#128523 ) The assertEqualMeta() method already tests that the first argument is a FakeTensor https://github.com/pytorch/pytorch/issues/117188 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128523 Approved by: https://github.com/soulitzer	2024-06-14 02:34:51 +00:00
Sahdev Zala	685fcfb40d	Fix docstring in autograd (#128657 ) Fix docstrings in autograd files. The fix can be verified by running pydocstyle path-to-file --count Related #112593 BEFORE the PR:  pydocstyle torch/autograd/anomaly_mode.py --count 8 pydocstyle torch/autograd/__init__.py --count 9 AFTER the PR:  pydocstyle torch/autograd/anomaly_mode.py --count 0 pydocstyle torch/autograd/__init__.py --count 0 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128657 Approved by: https://github.com/soulitzer	2024-06-14 02:18:59 +00:00
PyTorch MergeBot	0186b386cd	Revert "[ONNX] Add upsample trilinear to skip decomp (#128259 )" This reverts commit b72989a2b5ac4637612e31e325d7c8233fcbd7a1. Reverted https://github.com/pytorch/pytorch/pull/128259 on behalf of https://github.com/huydhn due to Sorry for reverting your change but its ONNX job is failing in trunk `b72989a2b5` ([comment](https://github.com/pytorch/pytorch/pull/128259#issuecomment-2167058937))	2024-06-14 01:44:26 +00:00
anandptl84	f48ca2561d	Document `torch.cuda.profiler.start` (#128098 ) document https://github.com/pytorch/pytorch/issues/127917 start function of cuda/ profiler.py Fixes 127917 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128098 Approved by: https://github.com/aaronenyeshi	2024-06-14 01:44:18 +00:00
Isuru Fernando	41df20c07c	Run all samples for torchinductor tests (#128343 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128343 Approved by: https://github.com/lezcano	2024-06-14 01:28:32 +00:00
PyTorch MergeBot	6895a5804c	Revert "[checkpoint] Clean up selective activation checkpoint and make public (#125795 )" This reverts commit c472cec5656b9ffb668af97a02d711bdbdf5ebec. Reverted https://github.com/pytorch/pytorch/pull/125795 on behalf of https://github.com/soulitzer due to breaking torchtitan CI ([comment](https://github.com/pytorch/pytorch/pull/125795#issuecomment-2167036157))	2024-06-14 01:14:59 +00:00
Mengwei Liu	6564d63e69	Use mv kernel for small M (#128632 ) Previously we are using: * mv kernel for M == 1 * mm kernel for 1 < M < 4 * llama.cpp inspired mm kernel for M >= 4 This PR consolidate it to only 2 kernels, use the same mv kernel for M < 12. Benchmarked on https://github.com/malfet/llm_experiments/blob/main/metal-perf/int8mm.mm Mac M1 Max, input size M x 4128 x 4096 ![llama cpp shader and ATen shader (2)](https://github.com/pytorch/pytorch/assets/8188269/9e2e3024-c5ea-4303-88bf-ff3646296396) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128632 Approved by: https://github.com/malfet	2024-06-14 01:06:53 +00:00
Sheng Fu	ae2359638b	Save DOT file of graph instead of SVG for GraphTranformObserver (#128634 ) Summary: GraphTransformObserver saves the SVG file of the input/output graph in each inductor pass. In my test with CMF model, if the graph is large, GraphViz took forever to convert DOT to SVG. That is NOT acceptable. This DIFF is to save DOT file instead of SVG file to speed it up. Also DOT file size is order of mangitude smaller than SVG. To view these graphs, user can run dot -Txxx inpout.dot to convert DOT to any other format you want. User can control how many iterations to layout the graph properly. Refer to https://web.archive.org/web/20170507095019/http://graphviz.org/content/attrs#dnslimit for details. Test Plan: buck2 test mode/dev-sand caffe2/test:fx -- fx.test_fx_xform_observer.TestGraphTransformObserver Differential Revision: D58539182 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128634 Approved by: https://github.com/mengluy0125	2024-06-14 00:54:22 +00:00
Scott Wolchok	6f181756dc	Use by-column algorithm for fp16/bf16 CPUBlas gemm_transb kernels (#127318 ) Summary: #96074 (D44340826) changed the algorithm for 16-bit types for gemm_notrans_ and gemm_transb_ for the sake of precision. In this diff, we go back to the old algorithm for gemm_transb_, maintaining precision by allocating temporary space equal to (in elements, so actually double since we are accumulating 16-bit types into fp32) the size of `c` to accumulate into. Test Plan: Used https://github.com/malfet/llm_experiments (benchmarks/benchmark_torch_mm.py) to benchmark before and after: before: ``` mv_nt torch.float32 5.47 usec mv_nt torch.float16 8.45 usec mv_nt torch.bfloat16 183.43 usec mv_ta torch.float32 5.70 usec mv_ta torch.float16 24.17 usec mv_ta torch.bfloat16 97.27 usec notrans torch.float32 5.58 usec notrans torch.float16 25.18 usec notrans torch.bfloat16 63.11 usec trans_a torch.float32 5.59 usec trans_a torch.float16 68.94 usec trans_a torch.bfloat16 311.60 usec trans_b torch.float32 5.63 usec trans_b torch.float16 8.76 usec trans_b torch.bfloat16 29.17 usec ``` after: ``` mv_nt torch.float32 5.53 usec mv_nt torch.float16 8.57 usec mv_nt torch.bfloat16 188.17 usec mv_ta torch.float32 5.78 usec mv_ta torch.float16 28.59 usec mv_ta torch.bfloat16 98.45 usec notrans torch.float32 5.71 usec notrans torch.float16 26.08 usec notrans torch.bfloat16 64.06 usec trans_a torch.float32 5.72 usec trans_a torch.float16 32.21 usec trans_a torch.bfloat16 32.10 usec trans_b torch.float32 5.83 usec trans_b torch.float16 9.05 usec trans_b torch.bfloat16 29.66 usec ``` Also expanded coverage to a range of larger matrix-vector and matrix-matrix sizes. before: ``` Matrix-vector: m=1024, n=1024, k=1 ==================== notrans torch.float32 24.75 usec notrans torch.float16 258.04 usec notrans torch.bfloat16 245.64 usec trans_a torch.float32 26.94 usec trans_a torch.float16 692.09 usec trans_a torch.bfloat16 1709.53 usec m=4100, n=4100, k=1 ==================== notrans torch.float32 2811.48 usec notrans torch.float16 4192.06 usec notrans torch.bfloat16 4041.01 usec trans_a torch.float32 2778.38 usec trans_a torch.float16 17218.41 usec trans_a torch.bfloat16 27561.21 usec m=16384, n=16384, k=1 ==================== notrans torch.float32 60157.66 usec notrans torch.float16 64121.38 usec notrans torch.bfloat16 65714.65 usec trans_a torch.float32 84975.39 usec trans_a torch.float16 1024223.33 usec trans_a torch.bfloat16 1078683.21 usec Matrix-matrix: m=1024, n=1024, k=256 ==================== notrans torch.float32 302.55 usec notrans torch.float16 172869.06 usec notrans torch.bfloat16 172837.81 usec trans_a torch.float32 250.03 usec trans_a torch.float16 333373.38 usec trans_a torch.bfloat16 432760.00 usec m=4100, n=4100, k=128 ==================== notrans torch.float32 5278.56 usec notrans torch.float16 1426335.29 usec notrans torch.bfloat16 1404249.37 usec trans_a torch.float32 4818.63 usec trans_a torch.float16 2969936.17 usec trans_a torch.bfloat16 3432565.96 usec m=16384, n=16384, k=16 ==================== notrans torch.float32 72225.71 usec notrans torch.float16 1439875.54 usec notrans torch.bfloat16 1443716.33 usec trans_a torch.float32 221130.21 usec trans_a torch.float16 16910654.17 usec trans_a torch.bfloat16 21447377.63 usec ``` after: ``` Matrix-vector: m=1024, n=1024, k=1 ==================== notrans torch.float32 25.11 usec notrans torch.float16 252.76 usec notrans torch.bfloat16 238.58 usec trans_a torch.float32 26.62 usec trans_a torch.float16 167.40 usec trans_a torch.bfloat16 174.08 usec m=4100, n=4100, k=1 ==================== notrans torch.float32 2774.28 usec notrans torch.float16 3991.70 usec notrans torch.bfloat16 3945.44 usec trans_a torch.float32 3011.25 usec trans_a torch.float16 2666.85 usec trans_a torch.bfloat16 2686.95 usec m=16384, n=16384, k=1 ==================== notrans torch.float32 58682.15 usec notrans torch.float16 63077.52 usec notrans torch.bfloat16 63319.33 usec trans_a torch.float32 70549.57 usec trans_a torch.float16 42145.45 usec trans_a torch.bfloat16 42270.13 usec Matrix-matrix: m=1024, n=1024, k=256 ==================== notrans torch.float32 289.37 usec notrans torch.float16 179704.87 usec notrans torch.bfloat16 173490.33 usec trans_a torch.float32 330.89 usec trans_a torch.float16 42466.26 usec trans_a torch.bfloat16 42811.19 usec m=4100, n=4100, k=128 ==================== notrans torch.float32 4793.33 usec notrans torch.float16 1407557.04 usec notrans torch.bfloat16 1388212.17 usec trans_a torch.float32 4714.20 usec trans_a torch.float16 359406.58 usec trans_a torch.bfloat16 350419.42 usec m=16384, n=16384, k=16 ==================== notrans torch.float32 65757.08 usec notrans torch.float16 1427715.71 usec notrans torch.bfloat16 1440883.00 usec trans_a torch.float32 202263.44 usec trans_a torch.float16 1387522.33 usec trans_a torch.bfloat16 1762253.92 usec ``` We are improving, but still have a lot of room for improvement compared to float32 BLAS. Full disclosure: applying this same method to gemm_notrans (which does correspond to notrans in the benchmark's nomenclature) does not approve performance across the board; the 16KB x 16KB x 16 matmul regresses and I haven't figured out why yet. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127318 Approved by: https://github.com/peterbell10, https://github.com/malfet	2024-06-14 00:39:18 +00:00
Alnis Murtovi	18f5357f4f	Introduce heuristic for mixed_mm on A100 (#128232 ) This PR introduces a heuristic for tuned_mixed_mm. The heuristic is only enabled on an A100, because it has only been tested on an A100, and it is only enabled if force_mixed_mm="heuristic". I compared the heuristic to the aten fallback implementation and triton+autotune: Geometric mean speedup: 2.51 ``` m n k triton + autotune (GB/s) aten (GB/s) heuristic (GB/s) used_heuristic speedup (heuristic/aten) 1 4096 4096 456.95 134.59 459.37 True 3.41 1 4096 8192 523.93 138.29 553.50 True 4.00 1 4096 16394 233.70 161.62 234.14 True 1.45 1 8192 4096 633.25 140.64 574.86 True 4.09 1 8192 8192 737.54 147.41 690.26 True 4.68 1 8192 16394 413.67 175.88 408.68 True 2.32 1 16394 4096 717.22 167.22 665.36 True 3.98 1 16394 8192 812.69 177.17 815.90 True 4.61 1 16394 16394 473.17 178.58 435.11 True 2.44 4 4096 4096 479.46 134.80 486.74 True 3.61 4 4096 6333 174.27 106.74 171.64 True 1.61 4 4096 8192 567.14 138.32 571.09 True 4.13 4 4096 12313 179.65 105.91 180.03 True 1.70 4 4096 16394 222.96 145.54 222.81 True 1.53 4 6333 4096 491.78 126.37 473.20 True 3.74 4 6333 6333 268.79 143.40 269.75 True 1.88 4 6333 8192 783.80 135.12 796.23 True 5.89 4 6333 12313 286.35 142.37 287.30 True 2.02 4 6333 16394 362.47 139.66 361.47 True 2.59 4 8192 4096 642.73 140.53 641.88 True 4.57 4 8192 6333 287.65 137.63 287.38 True 2.09 4 8192 8192 738.42 150.16 721.59 True 4.81 4 8192 12313 301.27 146.18 302.31 True 2.07 4 8192 16394 415.37 167.66 393.41 True 2.35 4 12313 4096 823.66 141.81 745.40 True 5.26 4 12313 6333 433.92 148.17 429.83 True 2.90 4 12313 8192 984.60 149.30 988.95 True 6.62 4 12313 12313 452.00 150.87 452.50 True 3.00 4 12313 16394 609.88 159.20 609.71 True 3.83 4 16394 4096 779.44 157.46 777.10 True 4.94 4 16394 6333 402.93 139.50 309.47 True 2.22 4 16394 8192 950.38 175.49 949.67 True 5.41 4 16394 12313 414.62 153.99 315.95 True 2.05 4 16394 16394 497.56 174.97 461.77 True 2.64 16 4096 4096 475.92 134.45 478.57 True 3.56 16 4096 6333 146.36 112.50 145.35 True 1.29 16 4096 8192 560.00 138.22 557.19 True 4.03 16 4096 12313 152.02 105.06 151.27 True 1.44 16 4096 16394 222.48 156.72 222.88 True 1.42 16 6333 4096 692.41 122.14 696.88 True 5.71 16 6333 6333 220.74 140.90 225.41 True 1.60 16 6333 8192 813.56 140.21 820.28 True 5.85 16 6333 12313 232.48 131.19 232.55 True 1.77 16 6333 16394 367.39 134.93 361.87 True 2.68 16 8192 4096 665.54 140.29 266.24 True 1.90 16 8192 6333 254.77 136.65 240.12 True 1.76 16 8192 8192 750.63 146.26 736.93 True 5.04 16 8192 12313 266.61 127.13 251.81 True 1.98 16 8192 16394 397.25 160.42 390.76 True 2.44 16 12313 4096 857.48 141.36 851.36 True 6.02 16 12313 6333 423.21 132.40 357.55 True 2.70 16 12313 8192 1021.24 145.68 1024.60 True 7.03 16 12313 12313 370.12 143.94 383.52 True 2.66 16 12313 16394 608.52 141.03 608.48 True 4.31 16 16394 4096 826.48 155.94 826.74 True 5.30 16 16394 6333 420.38 144.09 265.23 True 1.84 16 16394 8192 988.07 156.21 984.63 True 6.30 16 16394 12313 431.40 146.92 265.49 True 1.81 16 16394 16394 497.39 167.86 461.79 True 2.75 23 4096 4096 344.43 132.84 338.64 True 2.55 23 4096 6333 195.34 118.48 195.31 True 1.65 23 4096 8192 389.83 140.02 376.62 True 2.69 23 4096 12313 204.49 137.96 204.80 True 1.48 23 4096 16394 242.48 148.99 242.74 True 1.63 23 6333 4096 429.25 126.52 517.75 True 4.09 23 6333 6333 295.56 133.51 296.14 True 2.22 23 6333 8192 594.88 137.05 581.78 True 4.25 23 6333 12313 315.18 131.67 314.64 True 2.39 23 6333 16394 386.46 141.45 386.54 True 2.73 23 8192 4096 553.52 142.05 568.35 True 4.00 23 8192 6333 215.58 139.01 210.86 True 1.52 23 8192 8192 609.21 154.85 528.76 True 3.41 23 8192 12313 220.38 142.93 233.54 True 1.63 23 8192 16394 402.63 158.39 403.21 True 2.55 23 12313 4096 723.54 131.58 581.94 True 4.42 23 12313 6333 307.90 131.58 307.90 True 2.34 23 12313 8192 893.36 129.97 623.72 True 4.80 23 12313 12313 322.40 134.84 317.80 True 2.36 23 12313 16394 512.97 142.31 409.45 True 2.88 23 16394 4096 703.66 154.54 643.53 True 4.16 23 16394 6333 305.55 127.55 293.17 True 2.30 23 16394 8192 768.12 154.60 681.53 True 4.41 23 16394 12313 311.61 140.92 307.01 True 2.18 23 16394 16394 467.24 171.07 467.29 True 2.73 32 4096 4096 344.71 132.30 338.62 True 2.56 32 4096 6333 206.48 107.59 205.55 True 1.91 32 4096 8192 387.24 137.82 353.12 True 2.56 32 4096 12313 216.35 120.61 214.50 True 1.78 32 4096 16394 242.05 149.92 241.94 True 1.61 32 6333 4096 525.50 127.12 518.02 True 4.08 32 6333 6333 300.50 118.41 296.55 True 2.50 32 6333 8192 600.92 136.99 601.94 True 4.39 32 6333 12313 316.13 136.45 316.03 True 2.32 32 6333 16394 386.11 141.34 386.10 True 2.73 32 8192 4096 546.18 140.18 341.14 True 2.43 32 8192 6333 218.40 130.65 263.42 True 2.02 32 8192 8192 608.29 147.16 542.12 True 3.68 32 8192 12313 225.60 135.04 225.23 True 1.67 32 8192 16394 434.75 160.42 401.28 True 2.50 32 12313 4096 787.80 136.28 583.60 True 4.28 32 12313 6333 316.66 125.76 323.35 True 2.57 32 12313 8192 891.38 128.88 639.50 True 4.96 32 12313 12313 326.11 132.37 325.88 True 2.46 32 12313 16394 521.64 139.47 395.69 True 2.84 32 16394 4096 625.55 158.46 651.16 True 4.11 32 16394 6333 304.14 131.13 284.55 True 2.17 32 16394 8192 767.79 162.95 704.34 True 4.32 32 16394 12313 310.74 137.68 303.39 True 2.20 32 16394 16394 465.92 171.43 465.37 True 2.71 43 4096 4096 345.05 133.87 196.47 True 1.47 43 4096 6333 148.64 99.92 148.97 True 1.49 43 4096 8192 386.50 135.39 214.00 True 1.58 43 4096 12313 190.39 109.36 156.27 True 1.43 43 4096 16394 203.63 150.24 204.05 True 1.36 43 6333 4096 421.35 106.04 132.25 True 1.25 43 6333 6333 224.75 113.01 224.97 True 1.99 43 6333 8192 471.11 117.61 327.39 True 2.78 43 6333 12313 234.55 115.61 234.74 True 2.03 43 6333 16394 311.56 132.24 312.01 True 2.36 43 8192 4096 400.73 140.12 269.11 True 1.92 43 8192 6333 167.32 119.13 168.84 True 1.42 43 8192 8192 435.45 146.98 286.21 True 1.95 43 8192 12313 161.05 127.82 162.78 True 1.27 43 8192 16394 207.16 156.40 208.90 True 1.34 43 12313 4096 484.01 120.10 313.35 True 2.61 43 12313 6333 234.54 106.63 232.85 True 2.18 43 12313 8192 515.34 130.23 411.70 True 3.16 43 12313 12313 239.39 130.04 239.03 True 1.84 43 12313 16394 316.02 137.39 316.29 True 2.30 43 16394 4096 475.60 152.57 340.97 True 2.23 43 16394 6333 241.21 132.49 208.59 True 1.57 43 16394 8192 499.34 157.43 361.61 True 2.30 43 16394 12313 246.25 132.31 211.68 True 1.60 43 16394 16394 302.90 158.56 277.05 True 1.75 64 4096 4096 280.48 126.82 195.97 True 1.55 64 4096 6333 150.94 101.63 150.48 True 1.48 64 4096 8192 305.47 135.06 211.03 True 1.56 64 4096 12313 158.12 110.06 158.15 True 1.44 64 4096 16394 206.68 136.21 201.28 True 1.48 64 6333 4096 409.11 105.10 296.07 True 2.82 64 6333 6333 229.98 108.46 230.59 True 2.13 64 6333 8192 469.32 112.24 330.58 True 2.95 64 6333 12313 245.02 117.16 244.84 True 2.09 64 6333 16394 317.78 125.80 318.37 True 2.53 64 8192 4096 323.42 139.92 267.31 True 1.91 64 8192 6333 167.51 118.45 167.56 True 1.41 64 8192 8192 341.13 146.71 284.88 True 1.94 64 8192 12313 172.21 123.42 171.97 True 1.39 64 8192 16394 217.22 153.18 216.99 True 1.42 64 12313 4096 482.19 123.32 311.82 True 2.53 64 12313 6333 238.73 123.88 238.66 True 1.93 64 12313 8192 516.32 122.11 330.50 True 2.71 64 12313 12313 248.73 125.32 296.82 True 2.37 64 12313 16394 314.98 134.06 320.31 True 2.39 64 16394 4096 476.59 154.58 340.84 True 2.20 64 16394 6333 240.54 119.60 214.82 True 1.80 64 16394 8192 501.36 149.02 359.45 True 2.41 64 16394 12313 244.65 126.01 222.47 True 1.77 64 16394 16394 302.48 160.36 283.66 True 1.77 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128232 Approved by: https://github.com/Chillee	2024-06-14 00:31:22 +00:00
cyy	9ebec1f345	Enable Wunused-function in torch_cpu (#128576 ) Follows #128499 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128576 Approved by: https://github.com/ezyang, https://github.com/r-barnes	2024-06-14 00:12:58 +00:00
Jane Xu	6767e38267	Fix manual licensing (#128630 ) It has come to my attention that some of our licenses are incorrect, so I attempted to rectify a few of them based on given recommendations for: clog - BSD-3 eigen - MPL-2.0 ffnvcodec - LGPL-2.1 -> hungarian - Permissive (free to use) irrlicht - The Irrlicht Engine License (zlib/libpng) -> pdcurses - Public Domain for core -> sigslot - Public Domain test - BSD-3 Vulkan - Apache-2.0 or MIT fb-only: more context is here https://fb.workplace.com/groups/osssupport/posts/26333256012962998/?comment_id=26333622989592967 This PR addressed the manual mismatches of licensing mentioned above (the two bolded, one is getting addressed in #128085, but as everything else is generated by pulling through other files, I did not address those. It is unclear what needs to be updated for the remaining to be accurate/if they're inaccurate today. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128630 Approved by: https://github.com/malfet	2024-06-14 00:12:09 +00:00
Yidi Wu	afdaa7fc95	[while_loop] expose it as torch.while_loop (#128562 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128562 Approved by: https://github.com/zou3519	2024-06-13 23:44:10 +00:00
chilli	c486e2ab64	Add coloring to fx graph print out (#128476 ) Note: Won't land immediately, at least I'll need to add a color option to the field. But curious if any tests fail. Old: <img width="1294" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/c3a750ed-5e54-4621-b2e4-be5481be15b6"> New: <img width="1303" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/3a1f1adc-6f3a-413e-8b87-ee53da9bf4ed"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128476 Approved by: https://github.com/ezyang	2024-06-13 23:39:04 +00:00
rzou	61421c42c0	[custom_op] don't invoke autograd.Function when unnecessary (#127976 ) This matches our autograd logic for pytorch native operators. There's no need to invoke an autograd.Function if we're under a torch.no_grad() or if none of the inputs have requires_grad=True (invoking an autograd.Function results in (noticeable) overhead). Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/127976 Approved by: https://github.com/williamwen42	2024-06-13 23:38:23 +00:00
titaiwangms	b72989a2b5	[ONNX] Add upsample trilinear to skip decomp (#128259 ) (1) Add upsample trilinear vec to skip decomposition (2) Add tests to make sure that torch.export.export still decomposes them Pull Request resolved: https://github.com/pytorch/pytorch/pull/128259 Approved by: https://github.com/justinchuby	2024-06-13 23:31:34 +00:00
Jane Xu	8c20f53a5e	Try seeding individual foreach tests (#128220 ) A first easy attempt to deflake foreach Pull Request resolved: https://github.com/pytorch/pytorch/pull/128220 Approved by: https://github.com/ZainRizvi, https://github.com/crcrpar, https://github.com/huydhn	2024-06-13 22:42:16 +00:00
Animesh Jain	865d7b3424	[Reland][dynamo] Enable some inlining inbuilt nn module tests (#128440 ) Co-authored-by: Laith Sakka <lsakka@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128440 Approved by: https://github.com/williamwen42, https://github.com/jansel	2024-06-13 22:39:22 +00:00
Shangdi Yu	3a0006ef22	Remove global variable SIZE, and fix linter warning (#128559 ) - Resolve a TODO by removing global variable `SIZE`. - Fix a linter warning in `test/test_nestedtensor.py`. `pytest pytorch/test/test_sort_and_select.py` and ` pytest test/test_nestedtensor.py` pass. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128559 Approved by: https://github.com/kit1980, https://github.com/Skylion007	2024-06-13 22:09:51 +00:00
Andrew Hoblitzell	6211e67e49	Document `torch.jit.frontend.get_default_args` (#128408 ) Fixes #127896 ### Description Add docstring to `torch/jit/frontend.py:get_default_args` function ### Checklist - [x] The issue that is being fixed is referred in the description - [x] Only one issue is addressed in this pull request - [x] Labels from the issue that this PR is fixing are added to this pull request - [x] No unnecessary issues are included into this pull request Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128408 Approved by: https://github.com/malfet	2024-06-13 21:49:16 +00:00
Andrew Gu	bf8a05f483	[FSDP2] Included module FQN in `FSDPParamGroup` `record_function`s (#128624 ) This PR adds the module FQN into the `FSDPParamGroup` `record_function`s for improved clarity in profiler traces. Differential Revision: [D58544809](https://our.internmc.facebook.com/intern/diff/D58544809) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128624 Approved by: https://github.com/ckluk2	2024-06-13 21:35:33 +00:00
PyTorch MergeBot	c8e9656a12	Revert "Add test to xfail_list only for abi_compatible (#128506 )" This reverts commit 49366b2640df1cba5a3b40bedd31b57b08529612. Reverted https://github.com/pytorch/pytorch/pull/128506 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it causes an inductor test to fail in trunk `49366b2640` ([comment](https://github.com/pytorch/pytorch/pull/128506#issuecomment-2166824714))	2024-06-13 21:30:07 +00:00
Jing Xu	8763d44bf1	add xpu to torch.compile (#127279 ) As support for Intel GPU has been upstreamed, this PR is to add the XPU-related contents to torch.compile doc. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127279 Approved by: https://github.com/dvrogozh, https://github.com/svekars	2024-06-13 21:15:09 +00:00
Yifu Wang	790138fdc7	Add profiler annotation for fused_all_gather_matmul and fused_matmul_reduce_scatter (#127556 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127556 Approved by: https://github.com/awgu ghstack dependencies: #127454, #127455	2024-06-13 20:52:46 +00:00
Yifu Wang	3b28dc6c9d	Improve the scheduling for fused_matmul_reduce_scatter (#127455 ) In fused_all_gather_matmul, each rank copies their shard into their local p2p buffer, performs a barrier, then performs (copy -> matmul) for each remote shard. The (copy -> matmul)s for remote shards run on two streams without synchronization. This not only allows for computation/communication overlapping, but also computation/computation overlapping which alleviates the wave quantization effect caused by computation decomposition. However, the synchronization-free approach doesn't work well with fused_matmul_reduce_scatter, in which there's a barrier in every step. Without synchronization between the two streams, a matmul in one stream can delay a barrier in the other stream, further delaying the copy waiting for the barrier. This PR addresss the issue by adding synchronization between the two streams such that the matmul of step i can only start after the barrier of step i-1 completes. With this approach, we lose the computation/computation overlapping, but avoid slowdown due to delayed barrier. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127455 Approved by: https://github.com/Chillee ghstack dependencies: #127454	2024-06-13 20:52:46 +00:00
Arun Pa	c0b40ab42e	doc string for torch.jit.frontend.get_jit_class_def method (#128391 ) Fixes #127904 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128391 Approved by: https://github.com/jgong5, https://github.com/malfet	2024-06-13 19:51:02 +00:00
James Wu	a3af32c2fb	Add functionality to make ViewAndMutationData (slightly more) cache safe (#127618 ) This PR changes the traced_tangents field of ViewAndMutationMeta to be cache safe. Specifically, at runtime, the only time we need the fw_metadata's traced_tangent's field is for Tensor subclass metadata from __tensor_flatten__. So instead of storing an entire FakeTensor, which has many fields that can be unserializable, only store the result of __tensor_flatten__() on any FakeTensors representing subclasses. That said, there's no guarantee that `__tensor_flatten__` is actually serializable: if we fail to pickle the result of __tensor_flatten__ we won't save to the cache. To do this, we also make a small change to `__coerce_same_metadata_as_tangent__`, so that it takes in the return value of tensor_flatten() instead of an entire FakeTensor. Let me know if we should change the name of the function. By doing this, we can now run the dynamic shapes cache test with autograd turned on. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127618 Approved by: https://github.com/bdhirsh	2024-06-13 19:45:33 +00:00
Sam Larsen	39193b10e8	[inductor] fx graph cache: memoize devices to make cache key calculation more predictable (#128366 ) Summary: I've seen this issue once in the wild and oulgen was able to repro in a unit test. The problem is this: - We're using pickle to turn everything related to the FX graph cache key into a byte stream, then hashing the bytes to compute the cache key. - Pickle is optimized to avoid serializing the same ID more than once; it instead drops a reference to a previously-pickled object if it encounters the same ID. - That pickle behavior means that we can see different cache keys if an object id appears more than once in the hashed objects vs. being functionally equivalent but distinct objects. The cases I've investigated only involve the torch.device objects in the tensor graph args. That is, we may compile a graph with two tensor args, each referencing `torch.device('cpu')`. In one run, those devices may reference the same object; in another, they may reference distinct (but equivalent) objects. In practice, my observation is that the compiler is largely deterministic and this situation is rare. I've seen cache misses on a real benchmark only when enabling/disabling FakeTensor caching in order to introduce different code paths that otherwise produce the same fx graph. But the failing unit test seems to be enough motivation for a remediation? I don't really love this solution, but I've failed to find another way to make the pickling phase robust to these kinds of changes, e.g., by changing the protocol version or by overriding internal methods (which would also be gross). But I'm definitely open to other creative ideas. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128366 Approved by: https://github.com/oulgen, https://github.com/eellison	2024-06-13 19:25:14 +00:00
Shunting Zhang	c54e358bdb	enable comprehensive padding internally (#128555 ) Summary: The feature was previously disabled in fbcode due to breaking the deterministic NE unit tests. Now it has been on in OSS for quite a while and we verified that it has no NE impact on CMF, we want to update the unit test and enable the feature. Test Plan: ``` time buck2 test 'fbcode//mode/opt' fbcode//aps_models/ads/icvr/tests/ne/e2e_deterministic_tests:fm_tests -- --exact 'aps_models/ads/icvr/tests/ne/e2e_deterministic_tests:fm_tests - aps_models.ads.icvr.tests.ne.e2e_deterministic_tests.icvr_fm_test.ICVR_FM_DeterministicTest: test_icvr_fm_pt2_fsdp_multi_gpus' ``` Differential Revision: D58425432 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128555 Approved by: https://github.com/eellison	2024-06-13 19:20:00 +00:00
Isuru Fernando	cdc37e4bff	Add a shape property to IR nodes (#127818 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127818 Approved by: https://github.com/peterbell10	2024-06-13 19:11:52 +00:00
Xuehai Pan	5a80d2df84	[BE] enable UFMT for `torch/nn/utils` (#128595 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128595 Approved by: https://github.com/Skylion007	2024-06-13 18:34:57 +00:00
Bin Bao	9f55c80a9f	[AOTI] Fix a minimal_arrayref_interface test failure (#128613 ) Summary: When calling a fallback op in the minimal_arrayref_interface mode with an optional tensor, a temporary RAIIAtenTensorHandle needes to be explicitly created in order to pass a pointer of tensor as the optional tensor parameter. Test Plan: CI Differential Revision: D58528575 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128613 Approved by: https://github.com/hl475	2024-06-13 18:25:04 +00:00
vasiliy	a265556362	inductor fusion logs: make it easier to attribute to aten graph (#127159 ) Summary: I want to be able to look at inductor fusion logs and reason about which parts of the aot_autograd aten graph were fused / not fused. This PR adds a short description of each buffer to the fusion logs. Example for forward of `Float8Linear`: ``` torch._inductor.scheduler.__fusion: ===== attempting fusion (1/10): 13 nodes ===== torch._inductor.scheduler.__fusion: fuse_nodes_once, candidates: torch._inductor.scheduler.__fusion: SchedulerNode(name='buf0'), Reduction(['[254201]', 'max', 'origins={abs_1, max_1}']) torch._inductor.scheduler.__fusion: SchedulerNode(name='buf3'), Reduction(['[114688]', 'max', 'origins={abs_2, max_2}']) torch._inductor.scheduler.__fusion: SchedulerNode(name='buf6'), Pointwise(['[]', 'origins={reciprocal_1, convert_element_type_6, clamp_min_2, mul_2, copy_1, reciprocal_3, convert_element_type_5}']) torch._inductor.scheduler.__fusion: ExternKernelSchedulerNode(name='buf10') torch._inductor.scheduler.__fusion: SchedulerNode(name='buf2'), Pointwise(['[]', 'origins={full_default}']) torch._inductor.scheduler.__fusion: SchedulerNode(name='buf8'), Pointwise(['[8192, 7168]', 'origins={convert_element_type, clamp_min, convert_element_type_1, _scaled_mm, convert_element_type_4, clamp_max, convert_element_type _3, clamp_min_1, copy, convert_element_type_2, mul_1, mul, reciprocal}']) torch._inductor.scheduler.__fusion: SchedulerNode(name='buf4'), Reduction(['[512]', 'max', 'origins={abs_2, max_2}']) torch._inductor.scheduler.__fusion: SchedulerNode(name='buf13'), Pointwise(['[8192, 7168]', 'origins={clone_2}']) torch._inductor.scheduler.__fusion: SchedulerNode(name='buf7'), Pointwise(['[16384, 8192]', 'origins={convert_element_type, clamp_min, convert_element_type_1, _scaled_mm, convert_element_type_4, clamp_max, convert_element_typ e_3, clamp_min_1, copy, convert_element_type_2, mul_1, mul, reciprocal}']) torch._inductor.scheduler.__fusion: ExternKernelSchedulerNode(name='buf9') torch._inductor.scheduler.__fusion: SchedulerNode(name='buf1'), Reduction(['[528]', 'max', 'origins={abs_1, max_1}']) torch._inductor.scheduler.__fusion: SchedulerNode(name='buf5'), Pointwise(['[]', 'origins={convert_element_type, clamp_min, convert_element_type_1, copy, reciprocal_2, mul, reciprocal}']) torch._inductor.scheduler.__fusion: SchedulerNode(name='buf12'), Pointwise(['[8192, 16384]', 'origins={clone_1}']) torch._inductor.scheduler.__fusion: cannot fuse buf0 with buf7: no shared data torch._inductor.scheduler.__fusion: cannot fuse buf0 with buf12: no shared data torch._inductor.scheduler.__fusion: cannot fuse buf0 with buf1: numel/rnumel mismatch (reduce) (528, 1), (254201, 528) torch._inductor.scheduler.__fusion: cannot fuse buf7 with buf1: nodes numel incompatibility torch._inductor.scheduler.__fusion: cannot fuse buf12 with buf1: nodes numel incompatibility torch._inductor.scheduler.__fusion: cannot fuse buf5 with buf7: numel/rnumel mismatch (non-reduce) (1, 134217728), (1, 1) torch._inductor.scheduler.__fusion: cannot fuse buf5 with buf12: numel/rnumel mismatch (non-reduce) (1, 134217728), (1, 1) torch._inductor.scheduler.__fusion: cannot fuse buf3 with buf8: intermediate nodes between node1 & node2 torch._inductor.scheduler.__fusion: cannot fuse buf3 with buf13: no shared data torch._inductor.scheduler.__fusion: cannot fuse buf3 with buf4: numel/rnumel mismatch (reduce) (512, 1), (114688, 512) torch._inductor.scheduler.__fusion: cannot fuse buf8 with buf4: nodes numel incompatibility torch._inductor.scheduler.__fusion: cannot fuse buf13 with buf4: nodes numel incompatibility torch._inductor.scheduler.__fusion: cannot fuse buf6 with buf8: numel/rnumel mismatch (non-reduce) (1, 58720256), (1, 1) torch._inductor.scheduler.__fusion: cannot fuse buf6 with buf13: numel/rnumel mismatch (non-reduce) (1, 58720256), (1, 1) torch._inductor.scheduler.__fusion: cannot fuse buf5 with buf9: node2 is extern or nop torch._inductor.scheduler.__fusion: cannot fuse buf6 with buf9: node2 is extern or nop torch._inductor.scheduler.__fusion: cannot fuse buf7 with buf9: node2 is extern or nop torch._inductor.scheduler.__fusion: cannot fuse buf8 with buf9: node2 is extern or nop torch._inductor.scheduler.__fusion: cannot fuse buf9 with buf10: node1 is extern or nop torch._inductor.scheduler.__fusion: found 4 possible fusions torch._inductor.scheduler.__fusion: fusing buf7 with buf12 torch._inductor.scheduler.__fusion: fusing buf8 with buf13 torch._inductor.scheduler.__fusion: fusing buf4 with buf6 torch._inductor.scheduler.__fusion: fusing buf1 with buf5 torch._inductor.scheduler.__fusion: completed fusion round (1/10): fused 13 nodes into 9 nodes ``` Test Plan: will add tests after we align some version of this can land Reviewers: Subscribers: Tasks: Tags: Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/127159 Approved by: https://github.com/mlazos	2024-06-13 18:22:02 +00:00
JJ Asghar	de9a072ac4	Updating the `sigslot` license to Public Domain (#128085 ) It seems that Sigslot's license is Public Domain, not Apache 2. https://sigslot.sourceforge.net Pull Request resolved: https://github.com/pytorch/pytorch/pull/128085 Approved by: https://github.com/janeyx99	2024-06-13 18:13:54 +00:00
Thanh Ha	8733c4f4be	docs: Add link to test-infra issue (#128608 ) It's not immediately obvious from this file that the issue being referred to is in another repo. Add that detail and link to make it easier for folks reading this code to jump to the correct issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128608 Approved by: https://github.com/DanilBaibak, https://github.com/jeanschmidt, https://github.com/ZainRizvi	2024-06-13 18:00:53 +00:00
PyTorch MergeBot	dd19c9150c	Revert "[aota] compiled forward outputs requires_grad alignment with eager (#128016 )" This reverts commit b459713ca75f6ab7c8a59acec0258e0f77904ada. Reverted https://github.com/pytorch/pytorch/pull/128016 on behalf of https://github.com/bdhirsh due to fix torchbench regression ([comment](https://github.com/pytorch/pytorch/pull/128016#issuecomment-2166446841))	2024-06-13 17:56:42 +00:00
Yifu Wang	52f529105d	force_stride_order on fused_all_gather_matmul/fused_matmul_reduce_scatter's operands to avoid a copy due to layout transformation (#127454 ) When performing fused_all_gather_matmul/fused_matmul_reduce_scatter and gather_dim/scatter_dim != 0, a copy of the lhs operand (A_shard/A) is needed for layout transformation. This copy can be avoided if the lhs operand already has the following stride order: lhs.movedim(gather_dim, 0).contiguous().movedim(0, gather_dim).stride() In `micro_pipeline_tp` passes, we enforce the lhs operand to have such stride order via `inductor_prims.force_stride_order`. This way if the lhs operand has a flexible layout, the copy is avoided. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127454 Approved by: https://github.com/Chillee	2024-06-13 17:52:37 +00:00
Joel Schlosser	d5780396c7	Skip debug asserts for mixed dense, subclass views in autograd_not_implemented_fallback (#128057 ) Fixes #125503 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128057 Approved by: https://github.com/albanD, https://github.com/soulitzer ghstack dependencies: #127007	2024-06-13 17:13:02 +00:00
Joel Schlosser	9a8917fdbd	Naive CPU kernels for jagged <-> padded dense conversions (#127007 ) This PR introduces naive CPU impls for: * `_jagged_to_padded_dense_forward()` * `_padded_dense_to_jagged_forward()` On the CUDA side, these are backed by lifted FBGEMM kernels. We may want to revisit the CPU versions with higher-performance implementations at a later time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127007 Approved by: https://github.com/davidberard98	2024-06-13 17:13:02 +00:00
Animesh Jain	a0604193a2	handle call_function with Parameter args in DDPOptimizer splitting (#128034 ) When nn module inlining is enabled, modules are replaced with the underlying function calls in the output fx graph. example: ``` class GraphModule(torch.nn.Module): def forward(self, L_x_: "f32[1024, 1024]"): l_x_ = L_x_ # File: /data/users/lsakka/pytorch/pytorch/test/dynamo/test_structured_trace.py:284 in forward, code: return self.layers(x) l__self___layers_0: "f32[1024, 1024]" = self.L__self___layers_0(l_x_); l_x_ = None l__self___layers_1: "f32[1024, 1024]" = self.L__self___layers_1(l__self___layers_0); l__self___layers_0 = None return (l__self___layers_1,) ``` will be ``` class GraphModule(torch.nn.Module): def forward(self, L_self_layers_0_weight: "f32[1024, 1024]", L_self_layers_0_bias: "f32[1024]", L_x_: "f32[1024, 1024]", L_self_layers_1_weight: "f32[1024, 1024]", L_self_layers_1_bias: "f32[1024]"): l_self_layers_0_weight = L_self_layers_0_weight l_self_layers_0_bias = L_self_layers_0_bias l_x_ = L_x_ l_self_layers_1_weight = L_self_layers_1_weight l_self_layers_1_bias = L_self_layers_1_bias # File: /data/users/lsakka/pytorch/pytorch/torch/nn/modules/linear.py:116 in forward, code: return F.linear(input, self.weight, self.bias) input_1: "f32[1024, 1024]" = torch._C._nn.linear(l_x_, l_self_layers_0_weight, l_self_layers_0_bias); l_x_ = l_self_layers_0_weight = l_self_layers_0_bias = None input_2: "f32[1024, 1024]" = torch._C._nn.linear(input_1, l_self_layers_1_weight, l_self_layers_1_bias); input_1 = l_self_layers_1_weight = l_self_layers_1_bias = None return (input_2,) ``` The DDP optimizer when performing splitting, does not handle the inlined graph since it does not handle function calls since earlier we did not have function calls with params as inputs. (but calls to modules instead). This diff addresses that, it uses the example_value in the arguments to determine Parameter arguments of a function call and the Parameter properties. This address #https://github.com/pytorch/pytorch/issues/127552 running the optimizer on the code above with inlining yields to the following splitting: ``` ---submod_0 graph--- graph(): %l_x_ : torch.Tensor [num_users=1] = placeholder[target=l_x_] %l_self_layers_0_weight : torch.nn.parameter.Parameter [num_users=1] = placeholder[target=l_self_layers_0_weight] %l_self_layers_0_bias : torch.nn.parameter.Parameter [num_users=1] = placeholder[target=l_self_layers_0_bias] %linear : [num_users=1] = call_function[target=torch._C._nn.linear](args = (%l_x_, %l_self_layers_0_weight, %l_self_layers_0_bias), kwargs = {}) return linear ---submod_1 graph--- graph(): %input_1 : [num_users=1] = placeholder[target=input_1] %l_self_layers_1_weight : torch.nn.parameter.Parameter [num_users=1] = placeholder[target=l_self_layers_1_weight] %l_self_layers_1_bias : torch.nn.parameter.Parameter [num_users=1] = placeholder[target=l_self_layers_1_bias] %linear : [num_users=1] = call_function[target=torch._C._nn.linear](args = (%input_1, %l_self_layers_1_weight, %l_self_layers_1_bias), kwargs = {}) return linear ---final graph--- graph(): %l_self_layers_0_weight : torch.nn.parameter.Parameter [num_users=1] = placeholder[target=L_self_layers_0_weight] %l_self_layers_0_bias : torch.nn.parameter.Parameter [num_users=1] = placeholder[target=L_self_layers_0_bias] %l_x_ : torch.Tensor [num_users=1] = placeholder[target=L_x_] %l_self_layers_1_weight : torch.nn.parameter.Parameter [num_users=1] = placeholder[target=L_self_layers_1_weight] %l_self_layers_1_bias : torch.nn.parameter.Parameter [num_users=1] = placeholder[target=L_self_layers_1_bias] %submod_0 : [num_users=1] = call_module[target=compiled_submod_0](args = (%l_x_, %l_self_layers_0_weight, %l_self_layers_0_bias), kwargs = {}) %submod_1 : [num_users=1] = call_module[target=compiled_submod_1](args = (%submod_0, %l_self_layers_1_weight, %l_self_layers_1_bias), kwargs = {}) return (submod_1,) --------------- ``` where as without inlining it uses to be ``` ---submod_0 graph--- graph(): %l_x_ : torch.Tensor [num_users=1] = placeholder[target=l_x_] %l__self___layers_0 : [num_users=1] = call_module[target=L__self___layers_0](args = (%l_x_,), kwargs = {}) return l__self___layers_0 /data/users/lsakka/pytorch/pytorch/torch/_inductor/compile_fx.py:133: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. warnings.warn( ---submod_1 graph--- graph(): %l__self___layers_0 : [num_users=1] = placeholder[target=l__self___layers_0] %l__self___layers_1 : [num_users=1] = call_module[target=L__self___layers_1](args = (%l__self___layers_0,), kwargs = {}) return l__self___layers_1 ---final graph--- graph(): %l_x_ : torch.Tensor [num_users=1] = placeholder[target=L_x_] %submod_0 : [num_users=1] = call_module[target=compiled_submod_0](args = (%l_x_,), kwargs = {}) %submod_1 : [num_users=1] = call_module[target=compiled_submod_1](args = (%submod_0,), kwargs = {}) return (submod_1,) --------------- ``` TESTING: (1) running ``` TORCHDYNAMO_INLINE_INBUILT_NN_MODULES=1 pytest test/distributed/test_dynamo_distributed.py -k ``` result in reduction in failures from 6 to 2 with this PR. The two remaining are FSDP related which does not sounds trivial and have so many details. will leave them for future work. Co-authored-by: Animesh Jain <anijain@umich.edu> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128034 Approved by: https://github.com/anijain2305, https://github.com/wconstab	2024-06-13 17:07:27 +00:00
lezcano	3e3435678c	Remove some implications from the static_eval pattern matcher (#128500 ) We should be able to remove this as, with the new canonicalisation, we have that `a < b` and `-a > -b` should be canonicalised to the same expression (if SymPy does not interfere too much). nb. I thought this would cut further the compilation time, but I was running the benchmarks wrong (not removing triton's cache oops). It turns out that after the first PR in this stack, https://github.com/pytorch/pytorch/issues/128398 is fully fixed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128500 Approved by: https://github.com/ezyang ghstack dependencies: #128410, #128411	2024-06-13 16:50:00 +00:00
lezcano	0fdd8d84fa	Do not generate -1* in SymPy expressions when canonicalising (#128411 ) Partially addresses https://github.com/pytorch/pytorch/issues/128150 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128411 Approved by: https://github.com/ezyang ghstack dependencies: #128410	2024-06-13 16:49:59 +00:00
lezcano	bdeb9225b0	Do not call `get_implications` unnecessarily (#128410 ) This should improve compilation times. With this PR and the patch in the original issue, I get a compilation time of `Compilation time: 307.30 second`. Fixes https://github.com/pytorch/pytorch/issues/128398 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128410 Approved by: https://github.com/Chillee	2024-06-13 16:49:55 +00:00
cyy	e2a72313e8	Concat namespaces of torch/csrc/profiler code and other fixes (#128606 ) Improve namespaces and modernize codebase of torch/csrc/profiler code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128606 Approved by: https://github.com/Skylion007, https://github.com/aaronenyeshi	2024-06-13 16:46:34 +00:00
Tristan Rice	7c370d2fb0	expose set_thread_name to Python and set thread names (#128448 ) This adds a new multiprocessing method `_set_thread_name` and calls it from torchelastic and dataloader main functions. This will allow better monitoring of processes as we can separate elastic and dataloading processes from the main training process. Threads named: * torchrun/elastic * PyTorch dataloader worker processes + pin memory thread * TCPStore * ProcessGroupNCCL background threads * WorkerServer httpserver thread Test plan: ``` $ torchrun --nnodes 1 --nproc_per_node 1 --no-python /bin/bash -c 'ps -eL \| grep pt_' 3264281 3264281 pts/45 00:00:02 pt_elastic 3264281 3267950 pts/45 00:00:00 pt_elastic ``` dataloading ```py import torch import time from torch.utils.data import ( DataLoader, Dataset, ) class NoopDataset(Dataset): def __getitem__(self, index): return index def __len__(self): return 10 dataloader = DataLoader(NoopDataset(), num_workers=2) for i, x in enumerate(dataloader): print(i, x) time.sleep(10000) ``` ``` $ python3 ~/scripts/dataloader_test.py $ ps -eL \| grep pt_ 1228312 1228312 pts/45 00:00:02 pt_main_thread 1228312 1230058 pts/45 00:00:00 pt_main_thread 1228312 1230059 pts/45 00:00:00 pt_main_thread 1230052 1230052 pts/45 00:00:00 pt_data_worker 1230052 1230198 pts/45 00:00:00 pt_data_worker 1230052 1230740 pts/45 00:00:00 pt_data_worker 1230055 1230055 pts/45 00:00:00 pt_data_worker 1230055 1230296 pts/45 00:00:00 pt_data_worker 1230055 1230759 pts/45 00:00:00 pt_data_worker ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128448 Approved by: https://github.com/c-p-i-o, https://github.com/andrewkho, https://github.com/rsdcastro	2024-06-13 16:38:23 +00:00
Zain Rizvi	b05b8d3989	[EZ][ALI Migration] Add logging for workflow type determination (#128619 ) To help figure out what went wrong when the wrong label appears to have been set Pull Request resolved: https://github.com/pytorch/pytorch/pull/128619 Approved by: https://github.com/zxiiro, https://github.com/clee2000	2024-06-13 16:37:07 +00:00
Yidi Wu	e9b81e4edf	Fakify torch bind input by default (#128454 ) Summary: Try a reland of https://github.com/pytorch/pytorch/pull/127116 after some fixes landed Differential Revision: D58418251 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128454 Approved by: https://github.com/angelayi	2024-06-13 16:25:11 +00:00
PyTorch MergeBot	c63ccead5e	Revert "[dynamo] Enable some inlining inbuilt nn module tests (#128440 )" This reverts commit 1602c7d0c861a4382746ccb18c76d8703a636f4e. Reverted https://github.com/pytorch/pytorch/pull/128440 on behalf of https://github.com/clee2000 due to new test broke internally D58501220 ([comment](https://github.com/pytorch/pytorch/pull/128440#issuecomment-2166127531))	2024-06-13 16:14:37 +00:00
Oguz Ulgen	17b45e905a	Fix get output code when caching is enabled (#128445 ) Summary: Improve output code retrieval mechanism so that it works in the presence of cache hits. Test Plan: ci Differential Revision: D58429602 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128445 Approved by: https://github.com/jansel, https://github.com/eellison, https://github.com/masnesral	2024-06-13 16:00:30 +00:00
Aaron Gokaslan	93a14aba6e	[BE]: Update mypy to 1.10.0 (#127717 ) Updates mypy to the latest and greatest. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127717 Approved by: https://github.com/ezyang	2024-06-13 15:57:13 +00:00
Wu, Chunyuan	49366b2640	Add test to xfail_list only for abi_compatible (#128506 ) https://github.com/pytorch/pytorch/pull/126717 will skip the tests in both ABI compatible and non-ABI compatible mode. It's not expected to skip them in non-ABI compatible mode since they can actually run successfully in such mode but only have issues in ABI compatible mode. We leverage the existing `xfail_list` for those that will only fail in ABI compatible mode. - `test_qlinear_add` is already in the `xfail_list`. - `test_linear_packed` doesn't fail either in my local run (running with `TORCHINDUCTOR_ABI_COMPATIBLE=1`) or in the CI of this PR so I didn't add it into `xfail_list`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128506 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-06-13 15:32:15 +00:00
xinan.lin	cf7adc2fa1	[Inductor] Update Intel GPU Triton commit pin. (#124842 ) Update Intel triton for Pytorch 2.4 release. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124842 Approved by: https://github.com/EikanWang	2024-06-13 14:34:37 +00:00
Tom Ritchford	edb45dce85	Add OpInfo entry for as_strided_copy (#127231 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127231 Approved by: https://github.com/lezcano	2024-06-13 13:58:47 +00:00
rzou	7cc07a3eb1	[custom_op] stop using nonlocals to store information (#128547 ) Fixes https://github.com/pytorch/pytorch/issues/128544 Fixes https://github.com/pytorch/pytorch/issues/128535 We had a problem with multithreading where the nonlocals were being clobbered. In the first place, we stored these nonlocals because we wanted to ferry information from an autograd.Function.apply to autograd.Function.forward. Our new approach is: - pass the information directly as an input to the autograd.Function.apply. This means that the autograd.Function.forward will receive the information too. - this messes up ctx.needs_input_grad, which has an element per input to forward. The user should not see the additional information we passed. We fix this by temporarily overriding ctx.needs_input_grad to the right thing. - this exposed a bug in that ctx.needs_input_grad wasn't correct for TensorList inputs. This PR fixes that too. Test Plan: - existing and new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/128547 Approved by: https://github.com/williamwen42, https://github.com/soulitzer	2024-06-13 13:36:39 +00:00
IvanKobzarev	2b9465d62a	[aota] Allow some mutations in backward (#128409 ) https://github.com/pytorch/pytorch/issues/127572 Allow mutations in backward on forward inputs, if 1/ not mutationg metadata Enforced at compilation time. 2/ if create_graph=True: mutated input does not require_grad Enforced in runtime, when create_graph mode can be detected by checking torch.is_grad_enabled() Adding input_joint_info to track mutations of inputs during joint. Created a separate field in ViewAndMutationMeta as it is filled only after joint fn tracing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128409 Approved by: https://github.com/bdhirsh	2024-06-13 12:09:08 +00:00
Laith Sakka	d0c08926d1	allow inlining functions in _python_dispatch and _is_make_fx_tracing (#128485 ) This fix grab breaks in torch_multimodal_clip benchmark. Co-authored-by: Animesh Jain <anijain@umich.edu> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128485 Approved by: https://github.com/anijain2305 ghstack dependencies: #128428	2024-06-13 09:56:39 +00:00
Jiong Gong	1fd2cd26a0	[inductor][cpp] support bf16/fp16 gemm template epilogue fusion (#126545 ) As part of #125683, this PR adds epilogue fusion support for bf16/fp16 gemms. The key changes are as follows: 1. bf16 linear w/ epilogue fusion of some ops was originally supported via ATen oneDNN linear pointwise ops. In order to match the ATen op semantics, in-template epilogue support is added to the cpp gemm template so that we would have: "gemm + in-template epilogues -> template buffer". If the template is chosen for codegen, the in-template epilogues will be concatenated with the out-of-template epilogues that are appended during the scheduling. 2. Support bf16/fp16 legalization for `codegen_loop_bodies` which is used to generate the epilogue loops. 3. We used to leverage the in-place buffer mechanism to handle the in-place buffers in the epilogue codegen, in particular, for the reuses for output buffers of GEMM, template and epilogues. This is not correct since the output buffer is an "output" not an "in-place" buffer of the template kernel itself. Now, we use a dedicated "aliases" dict to manage such buffer reuses and the intermediate aliasing buffers are removed after codegen. 4. Add `localize_buffer` method to `LocalBufferScope` to allow the replacement of a global buffer with a local one in the given inductor IR nodes. This helps the fused loops to work on smaller-sized local buffers for better data locality. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126545 Approved by: https://github.com/jansel	2024-06-13 09:46:22 +00:00
Jason Ansel	c897651392	[inductor] Add BackendFeature gating (#128266 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128266 Approved by: https://github.com/shunting314	2024-06-13 07:31:51 +00:00
Yu, Guangye	88974fedd0	Clean up xpu ut to make CI happy (#128383 ) # Motivation Before #127611 merged, the xpu-specific UT `test/test_xpu.py` was skipped temporarily. This PR aims to fix the UT bug introduced by #127741. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128383 Approved by: https://github.com/EikanWang	2024-06-13 07:06:41 +00:00
Eddie Yan	ce79b09415	[CUDA][Sparse] Change comparison function of `test_sparse_semi_structured.py` and bump tolerances for `sp24_matmuls` (#128553 ) Minor tweak of comparison as using `assert` on `torch.allclose` prevents the mismatches from being logged. Also bump a few tolerances that seem to be causing failures on sm86/sm90 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128553 Approved by: https://github.com/jcaip	2024-06-13 06:58:07 +00:00
Nikita Shulga	0678742924	[MPS] Add Metal implementation of exp op (#128421 ) To improve accuracy, use `precise::exp()` (and `precise::sin()`/`precise::cos()` for complex flavor) Reuse `test_exp1` to check that accuracy of `exp` ops is sometimes closer to CPU Fix bug in non-contiguous tensors handling Fixes https://github.com/pytorch/pytorch/issues/84936 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128421 Approved by: https://github.com/kulinseth ghstack dependencies: #128373, #128375	2024-06-13 06:53:17 +00:00
Wang, Eikan	14c9eb5ed2	Add XPU code owners (#128486 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128486 Approved by: https://github.com/atalman, https://github.com/malfet	2024-06-13 06:33:45 +00:00
Catherine Lee	518c9e6455	Forward fix lint (#128587 ) merge at will After https://github.com/pytorch/pytorch/pull/125968 and https://github.com/pytorch/pytorch/pull/127693 landrace Pull Request resolved: https://github.com/pytorch/pytorch/pull/128587 Approved by: https://github.com/huydhn	2024-06-13 06:19:03 +00:00
Animesh Jain	c52eda896e	[dynamo][trace_rules] Remove incorrectly classified Ingraph functions (#128428 ) Co-authored-by: Laith Sakka <lsakka@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128428 Approved by: https://github.com/yanboliang, https://github.com/mlazos ghstack dependencies: #126578, #128440, #128470, #128453, #128484	2024-06-13 06:08:56 +00:00
Animesh Jain	1f6e84fa68	[inductor][mkldnn] Use floats instead of ints for pattern matcher test (#128484 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128484 Approved by: https://github.com/mlazos ghstack dependencies: #126578, #128440, #128470, #128453	2024-06-13 06:08:56 +00:00
Shaz Qadeer	ea541dd965	SymIntify cross_entropy_loss_prob_target numel call (#128141 ) This PR replaces call to ```numel``` with ```sym_numel``` in cross_entropy_loss_prob_target. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128141 Approved by: https://github.com/ezyang	2024-06-13 05:37:17 +00:00
Mengwei Liu	ade3d07483	GGML inspired int8 MM Metal shader (#127646 ) ## Context This PR ported GGML int8 per channel matrix multiplication and matrix vector multiplication metal shaders into ATen library. llama.cpp LICENSE: https://github.com/ggerganov/llama.cpp/blob/master/LICENSE ## Key Changes Made the following changes to the original code: * Memory layout of weight and scales is different than llama.cpp. * Weight dequantization (scales multiplication) is done after MM is finished. * Following PyTorch naming convention (M, K, N and assuming row major). ## Benchmark When M = 1, mv shader improves existing ATen int8mm by 40%. When M > 4, mm shader outperforms existing ATen int8mm up to 10x for a large M, as show blow. ![image](https://github.com/pytorch/pytorch/assets/8188269/fd9eff71-c538-4263-a7b5-f96fe479ae9d) Hence the kernel chooses different shaders based on M. ## Test Plan Tests are passing: ``` ❯ python test/test_mps.py -v -k _int8_ /Users/larryliu/CLionProjects/pytorch/venv/lib/python3.8/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 'dlopen(/Users/larryliu/CLionProjects/pytorch/venv/lib/python3.8/site-packages/torchvision/image.so, 0x0006): Symbol not found: __ZN3c1017RegisterOperatorsD1Ev Referenced from: <A770339A-37C9-36B2-84FE-4125FBE26FD6> /Users/larryliu/CLionProjects/pytorch/venv/lib/python3.8/site-packages/torchvision/image.so Expected in: <5749F98A-0A0C-3F89-9CBF-277B3C8EA00A> /Users/larryliu/CLionProjects/pytorch/torch/lib/libtorch_cpu.dylib'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source? warn( test__int8_mm_m_1_k_32_n_32_mps (__main__.TestLinalgMPSMPS) ... ok test__int8_mm_m_1_k_32_n_64_mps (__main__.TestLinalgMPSMPS) ... ok test__int8_mm_m_1_k_64_n_32_mps (__main__.TestLinalgMPSMPS) ... ok test__int8_mm_m_1_k_64_n_64_mps (__main__.TestLinalgMPSMPS) ... ok test__int8_mm_m_32_k_32_n_32_mps (__main__.TestLinalgMPSMPS) ... ok test__int8_mm_m_32_k_32_n_64_mps (__main__.TestLinalgMPSMPS) ... ok test__int8_mm_m_32_k_64_n_32_mps (__main__.TestLinalgMPSMPS) ... ok test__int8_mm_m_32_k_64_n_64_mps (__main__.TestLinalgMPSMPS) ... ok test__int8_mm_m_64_k_32_n_32_mps (__main__.TestLinalgMPSMPS) ... ok test__int8_mm_m_64_k_32_n_64_mps (__main__.TestLinalgMPSMPS) ... ok test__int8_mm_m_64_k_64_n_32_mps (__main__.TestLinalgMPSMPS) ... ok test__int8_mm_m_64_k_64_n_64_mps (__main__.TestLinalgMPSMPS) ... ok ---------------------------------------------------------------------- Ran 12 tests in 1.180s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127646 Approved by: https://github.com/malfet	2024-06-13 05:23:56 +00:00
Michael Lazos	b86b4ace88	Invalidate eager params when inlining and freezing nn modules (#128543 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128543 Approved by: https://github.com/anijain2305	2024-06-13 04:50:17 +00:00
Xuehai Pan	83bb9b7c53	[BE] explicitly export subpackage `torch.utils` (#128342 ) Resolves #126401 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128342 Approved by: https://github.com/Skylion007 ghstack dependencies: #127707	2024-06-13 04:39:16 +00:00
Edward Z. Yang	2229884102	Introduce int_oo (#127693 ) In a previous life, we used sympy.oo to represent the lower/upper bounds of integer ranges. Later, we changed this to be sys.maxsize - 1 for a few reasons: (1) sometimes we do tests on a value being exactly sys.maxsize, and we wanted to avoid a data dependent guard in this case, (2) sympy.oo corresponds to floating point infinity, so you get incorrect types for value ranges with oo, and (3) you can do slightly better reasoning if you assume that input sizes fall within representable 64-bit integer range. After working in the sys.maxsize regime for a bit, I've concluded that this was actually a bad idea. Specifically, the problem is that you end up with sys.maxsize in your upper bound, and then whenever you do any sort of size-increasing computation like size * 2, you end up with 2 * sys.maxsize, and you end up doing a ton of arbitrary precision int computation that is totally unnecessary. A symbolic bound is better. But especially after #126905, we can't go back to using sympy.oo, because that advertises that it's not an integer, and now your ValueRanges is typed incorrectly. So what do we do? We define a new numeric constant `int_oo`, which is like `sympy.oo` but it advertises `is_integer`. test/test_sympy_utils.py describes some basic properties of the number, and torch/utils/_sympy/numbers.py has the actual implementation. The rest of the changes of the PR are working out the implications of this change. I'll give more commentary as inline comments. Fixes https://github.com/pytorch/pytorch/issues/127396 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127693 Approved by: https://github.com/lezcano ghstack dependencies: #126905	2024-06-13 04:08:20 +00:00
Shivam Raikundalia	d3b8230639	Fix profiler_kineto Clang errors (#128464 ) Summary: There are clang errors in profiler_kineto. It would probably be a good idea to fix them as the file is already quite dense. Test Plan: Make sure all on Phabricator all tests under static_tests/lint_root pass Differential Revision: D58431005 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128464 Approved by: https://github.com/aaronenyeshi	2024-06-13 03:10:50 +00:00
PyTorch MergeBot	d630e1e838	Revert "[dynamo][yolov3] Track UnspecializedNNModuleVariable for mutation (#128269 )" This reverts commit f2d7f235a684c593f5a1ff2ca0b47b47274bfe85. Reverted https://github.com/pytorch/pytorch/pull/128269 on behalf of https://github.com/anijain2305 due to incorrect ([comment](https://github.com/pytorch/pytorch/pull/128269#issuecomment-2164267320))	2024-06-13 03:04:26 +00:00
Jing Xu	7fe9ab9ccc	update amp example to device-agnostic (#127278 ) As support for Intel GPU has been upstreamed, this PR is to make the AMP example doc device-agnostic. Co-authored-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127278 Approved by: https://github.com/dvrogozh, https://github.com/EikanWang, https://github.com/svekars	2024-06-13 02:01:16 +00:00
cyy	3f9b8446cf	[8/N] Remove unused functions (#128499 ) Follows #128407 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128499 Approved by: https://github.com/malfet	2024-06-13 01:15:11 +00:00
Xu Han	ede74940a1	optimize vec isa check dispatch logical. (#128320 ) Optimize cpu vec isa check dispatch by archecture, it makes code easy to read and maintaince. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128320 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-06-13 01:06:34 +00:00
Yidi Wu	c1cd946818	[cond] add a set_ and data mutation expected failure test (#128457 ) A follow up of the discussion in https://github.com/pytorch/pytorch/pull/126936. Cond errors out early because of a graph break triggered by DelayGraphBreakVariable, which is created due to `aten.set_` [here](https://github.com/pytorch/pytorch/blob/main/torch/_dynamo/variables/tensor.py#L366-L376). We might need to see what happened to this test if we allow graph break in higher order op. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128457 Approved by: https://github.com/zou3519	2024-06-13 00:16:59 +00:00
soulitzer	c472cec565	[checkpoint] Clean up selective activation checkpoint and make public (#125795 ) Related doc: https://docs.google.com/document/d/1BKyizkZPdri9mHqdDOLAUpkI7SbbKfLHRFVVpK9ZWqo/edit Memory considerations: - As with the existing SAC, cached values are cleared upon first use. - We error if the user wishes to backward a second time on a region forwarded with SAC enabled. In-place: - We use version counting to enforce that if any cached tensor has been mutated. In-place operations not mutating cached tensors are allowed. - `allow_cache_entry_mutation=True` can be passed to disable this check (useful in the case of auto AC where the user is cleverly also saves the output of the in-place) Randomness, views - Currently in this PR, we don't do anything special for randomness or views, the author of the policy function is expected to handle them properly. (Would it would be beneficial to error? - we either want to save all or recompute all random tensors) Tensor object preservation - We guarantee that if a tensor does not requires grad, and it is saved, then what you get out is the same tensor object. If the tensor does require grad, we must detach to avoid creating a reference cycle. This is a nice guarantee for nested tensors which care about the object identity of of the offsets tensor. Policy function - Enum values are `{MUST,PREFER}_{SAVE,RECOMPUTE}` (bikeshed welcome). Alternatively there was `{SAVE,RECOMPUTE}_{NON_,}OVERRIDABLE`. The former was preferred bc it seemed clearer that two `MUST` clashing should error, versus it is ambiguous whether two `NON_OVERRIDABLE` being stacked should silently ignore or error. - The usage of Enum today. There actually is NO API to stack SAC policies today. The only thing the Enum should matter for in the near term is the compiler. The stacking SAC policy would be useful if someone wants to implement something like simple FSDP, but it is not perfect because with a policy of `PREFER_SAVE` you are actually saving more than autograd would save normally (would be fixed with AC v3). - The number of times we call the policy_fn is something documented part of public API. We call the policy function for all ops except detach because detach is itself called a different number of times by AC between forward and recompute. - The policy function can be a stateful object (we do NOT make separate copies of this object for forward/recompute, the user is expected to handle that via is_recompute see below). Tensors guaranteed to be the same tensor as-is - Policy function signature takes ctx object as its first argument. The ctx function is an object encapsulating info that may be useful to the user, it currently only holds "is_recompute". Adding this indirection gives us flexibility to add more attrs later if necessary. "bc-breaking" for existing users of the private API: - Existing policy functions must now change their return value to use the Enum. - Existing calls to `_pt2_selective_checkpoint_context_fn_gen` must be renamed to `gen_selective_checkpoint_context_fn`. The way you use the API remains the same. It would've been nice to do something different (not make the user have to use functools.partial?), but this was the easiest to compile (idk if this should actually be a constraint). Pull Request resolved: https://github.com/pytorch/pytorch/pull/125795 Approved by: https://github.com/Chillee, https://github.com/fmassa	2024-06-12 23:57:33 +00:00
pradeepfn	25b7537a27	doc comment typo fixes and improvements (#128512 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128512 Approved by: https://github.com/LucasLLC	2024-06-12 23:55:09 +00:00
Huamin Li	eb1db6702f	[2nd try][AOTI] Switch to use shim v2 (#128521 ) Test Plan: Sandcastle Differential Revision: D58470269 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128521 Approved by: https://github.com/desertfire	2024-06-12 23:44:24 +00:00
Andrey Talman	4423e1bbdc	[release] Increase version 2.4.0->2.5.0 (#128514 ) Same as https://github.com/pytorch/pytorch/pull/121974 Branch cut for 2.4.0 completed hence advance main version to 2.5.0 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128514 Approved by: https://github.com/malfet	2024-06-12 23:40:01 +00:00
Angela Yi	3bc2004f91	[ts_converter] Fix prim::dtype (#128517 ) Summary: prim::dtype has the signature `(Tensor a) -> int`, where it gets the dtype of the tensor and returns the integer corresponding to this dtype based on the enum in ScalarType.h. Previously we were converting prim::dtype by returning the actual dtype of the tensor (ex. torch.float32). This causes some incorrect control flow to behavior, specifically where it checks if `prim::dtype(tensor) in [3, 5, 7]`, where [3, 5, 7] correspond to torch.int32, torch.float16, torch.float64. This control flow would always returns False because we would be comparing torch.float32 against the integers [3, 5, 7], which is a type mismatch. Test Plan: 7/22 internal models now are convertable and runnable in eager and sigmoid! P1410243909 Reviewed By: jiashenC Differential Revision: D58469232 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128517 Approved by: https://github.com/jiashenC	2024-06-12 23:02:50 +00:00
Edward Z. Yang	2fa6f80b13	Perform reciprocal optimization with foreach_div (#128433 ) Fixes https://github.com/pytorch/pytorch/issues/114165 Internal xref https://fb.workplace.com/groups/1144215345733672/posts/2801223606699496/ Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128433 Approved by: https://github.com/awgu	2024-06-12 22:57:03 +00:00
Shaz Qadeer	8db4a41973	Use computeStorageNbytesContiguous if possible (#128515 ) ```at::detail::computeStorageNbytesContiguous``` does fewer data-dependent tests compared to ```at::detail::computeStorageNbytes```. Therefore, use of former is more likely to succeed with dynamic shapes. This PR detects is_contiguous and dispatches to the appropriate function. This should be helpful in unblocking aot_eager for torchrec. As an aside, this is an alternative solution to the unsound solution I had first proposed in another [PR](#128141). Pull Request resolved: https://github.com/pytorch/pytorch/pull/128515 Approved by: https://github.com/ezyang	2024-06-12 22:53:06 +00:00
Prachi Gupta	e2610240f9	[ROCm] Enable several inductor UTs (#127761 ) Fixes #ISSUE_NUMBER Needs https://github.com/pytorch/pytorch/pull/125396 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127761 Approved by: https://github.com/peterbell10, https://github.com/pruthvistony	2024-06-12 22:47:45 +00:00
Joel Schlosser	bb3cf8a339	Lift inductor lowerings for jagged <-> padded dense kernels (#125968 ) This PR lifts internal lowerings written for FBGEMM kernels that do jagged <-> padded dense conversions. In particular, this PR provides lowerings and meta registrations for the following ATen ops: * `_jagged_to_padded_dense_forward()` * `_padded_dense_to_jagged_forward()` * NB: if `total_L` is not provided, the output shape is data-dependent. An unbacked SymInt is used for this case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125968 Approved by: https://github.com/davidberard98	2024-06-12 22:46:09 +00:00
Sam Larsen	b4a7b543e5	Add targeted unit tests for guards-related functions used in the codecache (#128482 ) Summary: Add a few unit tests that exercise `produce_guards_expression` and `evaluate_guards_expression` (and specifically "ToFloat" "FloatTrueDiv" added in https://github.com/pytorch/pytorch/pull/128418) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128482 Approved by: https://github.com/ezyang ghstack dependencies: #128418	2024-06-12 22:41:50 +00:00
Wang, Eikan	1f302d6885	Support aten operations with out tensor (#124926 ) This PR intends to support the aten operations with the `out` tensor. Currently, the AOT compile always does NOT keep input tensor mutations. According to the comments, this is because it has not encountered such a use case. > For now there's no use case involving keeping input mutations in the graph (which we can only do in the inference case anyway). We can add this later if we need to. However, for aten operations, it is popular that the `out` tensor is an input parameter and needs to be mutated. This PR intends to support it by adding a `keep_inference_input_mutations` flag to `aot_inductor.keep_inference_input_mutations`. This flag can provide flexibility to the callee in deciding whether the AOT compile needs to keep input tensor mutations in the graph. Take `clamp` as an example as follows. ```python out_tensor = torch.randn(128, dtype=torch.float, device=device).fill_(-2.0) inp_tensor = torch.randn(128, dtype=torch.float, device=device).fill_(1.0) min_tensor = inp_tensor - 0.05 max_tensor = inp_tensor + 0.05 torch.clamp(input=inp_tensor, min=min_tensor, max=max_tensor, out=out_tensor) ``` W/O this PR ```python def forward(self): arg0_1: "f32[128]"; arg1_1: "f32[128]"; arg2_1: "f32[128]"; arg3_1: "f32[128]"; arg0_1, arg1_1, arg2_1, arg3_1, = fx_pytree.tree_flatten_spec([], self._in_spec) clamp_min: "f32[128]" = torch.ops.aten.clamp_min.Tensor(arg0_1, arg1_1); arg0_1 = arg1_1 = None clamp_max: "f32[128]" = torch.ops.aten.clamp_max.Tensor(clamp_min, arg2_1); clamp_min = arg2_1 = None return (clamp_max, clamp_max) ``` W/ this PR ```python def forward(self): arg0_1: "f32[128]"; arg1_1: "f32[128]"; arg2_1: "f32[128]"; arg3_1: "f32[128]"; arg0_1, arg1_1, arg2_1, arg3_1, = fx_pytree.tree_flatten_spec([], self._in_spec) clamp_min: "f32[128]" = torch.ops.aten.clamp_min.Tensor(arg0_1, arg1_1); arg0_1 = arg1_1 = None clamp_max: "f32[128]" = torch.ops.aten.clamp_max.Tensor(clamp_min, arg2_1); clamp_min = arg2_1 = None copy_: "f32[128]" = torch.ops.aten.copy_.default(arg3_1, clamp_max); arg3_1 = clamp_max = None return (copy_,) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/124926 Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/angelayi	2024-06-12 22:31:59 +00:00
Shengbao Zheng	f4edd67fe7	[c10d] fix OSS commSplit bug (#128459 ) Summary: D56907877 modified OSS commSplit. However, commSplit requires every rank being called even though it is no-color. ncclCommSplit will not create a communicator for nocolor ranks hence this line of code will potentially throw error like `NCCL WARN CommUserRank : comm argument is NULL` Revert this change from D56907877 Test Plan: CI Differential Revision: D58436088 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128459 Approved by: https://github.com/shuqiangzhang	2024-06-12 22:29:01 +00:00
rzou	f39ab8a0fe	Fix side effect pruning (#128028 ) Summary: The previous side effect pruning algorithm would keep many dead cell variables alive. For example, in https://github.com/pytorch/pytorch/issues/125078, the compiled function has one return but there were three in the Dynamo graph due to two dead cell variables not being pruned away. This PR adds a corrected algorithm. "new cell variables" are alive if they can be reached from one of the following: 1. any of the tx.symbolic_locals or tx.stack (that is, if they are involved in a return from the function or intermediate variable during a graph break). Example: an alive NestedUserFunctionVariable 2. "mutations to pre-existing objects". Example: appending a NestedUserFunctionVariable to a global list The new algorithm reflects this, but please let me know if there are more cases to handle. Test Plan: - existing tests (afaict, test/dynamo/test_python_autograd is the best SideEffects test case we have) - see in test/dynamo/test_higher_order_ops that the expecttests changed -- the functorch dynamo graphs no longer return dead cellvars. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128028 Approved by: https://github.com/jansel	2024-06-12 22:25:37 +00:00
cyy	3008644297	[Caffe2] Remove remaining unused perfkernels (#128477 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/128477 Approved by: https://github.com/ezyang, https://github.com/r-barnes	2024-06-12 22:19:36 +00:00
Sam Larsen	55a6b38f52	[inductor] enable fx graph cache on torchbench (#128239 ) Summary: We've already enabled for timm and huggingface, but we had failures saving cache entries for moco. It looks like https://github.com/pytorch/pytorch/pull/128052 has fixed that issue, so we can enable for torchbench. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128239 Approved by: https://github.com/oulgen	2024-06-12 22:15:02 +00:00
Huy Do	6206da55ef	Fix lint after #119459 (#128558 ) TSIA Pull Request resolved: https://github.com/pytorch/pytorch/pull/128558 Approved by: https://github.com/atalman, https://github.com/kit1980, https://github.com/malfet	2024-06-12 22:11:37 +00:00
Animesh Jain	2b28b107db	[dynamo][fsdp] Dont take unspecializedNNModuleVariable path for FSDP modules (#128453 ) Co-authored-by: Laith Sakka <lsakka@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128453 Approved by: https://github.com/yf225 ghstack dependencies: #126578, #128440, #128470	2024-06-12 22:03:45 +00:00
James Wu	6aef2052ea	Save backward graphs lazily to cache (#126999 ) This PR makes it so we lazily save to the cache on backward call instead of saving ahead of time always. We have to pass a closure to post_compile to prevent cyclic dependencies. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126999 Approved by: https://github.com/bdhirsh ghstack dependencies: #126791	2024-06-12 21:58:34 +00:00
rzou	87072dcfdb	Change Dynamo's custom ops warning message to be less spammy (#128456 ) This is a short-term fix (for 2.4). In the longer term we should fix https://github.com/pytorch/pytorch/issues/128430 The problem is that warnings.warn that are inside Dynamo print all the time. Python warnings are supposed to print once, unless their cache is reset: Dynamo ends up resetting that cache everytime it runs. As a workaround we provide our own warn_once cache that is keyed on the warning msg. I am not worried about this increasing memory usage because that's effectively what python's warnings.warn cache does. Test Plan: - fix tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128456 Approved by: https://github.com/anijain2305	2024-06-12 21:57:12 +00:00
haozhe.zhu	c53d65b3d3	[inductor] fix linear add bias pattern (#128473 ) Fix https://github.com/pytorch/pytorch/issues/128287. Previous the assertion in `linear_add_bias` are pretty bad ``` assert packed_weight_node.name == "_reorder_linear_weight" assert transpose_weight_node.name == "permute_default" ``` because the `name` can be changed to `_reorder_linear_weight_id, permute_default_id` if we have more than 1 reorder/permute. Check `target` instead `name` can solve this issue. UT is also updated to have match more than 1 `linear_add_bias` pattern to cover this case. Co-authored-by: Jiong Gong <jiong.gong@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128473 Approved by: https://github.com/jgong5	2024-06-12 21:55:35 +00:00
Kurman Karabukaev	bb13fad7aa	Share TCPStore by default when using c10d rdzv handler (#128096 ) Summary: Number of features rely on TCP store as a control plane. By default TCPStore server is started on rank0 trainer and this can create a a race condition when rank0 may exit (error and graceful exit) and any other ranks reading/writing will fail. Solution: TCPStore server should outlive all the trainer processes. By moving the ownership TCPStore to torchelastic agent it naturally fixes the lifecycle of the server. Static rendezvous in torchelastic does already support sharing of the TCPStore server. We are extending this to more commonly used c10d rendezvous handler. Any handler would like to manage tcp store has to: - Return true on `use_agent_store` property - `RendezvousInfo`.`RendezvousStoreInfo`#[`master_addr/master_port`] values refer to managed TCPStore (those are returned on `next_rendezvous` call) Note: in some instances users may want to use non-TCPStore based stores for the torchelastic rendezvous process, so the handler will need to create and hold a reference to TCPStore (as done in this change) Test Plan: `cat ~/workspace/dist-demo/stores.py` ~~~ import torch import logging import sys import torch.distributed as dist import torch import os import time logger = logging.getLogger(__name__) logger.addHandler(logging.StreamHandler(sys.stderr)) logger.setLevel(logging.INFO) def _run_test(store): if dist.get_rank() == 1: logger.info("Rank %s is sleeping", dist.get_rank()) time.sleep(5) key = "lookup_key" logger.info("Checking key %s in store on rank %s", key, dist.get_rank()) store.check([key]) else: logger.info("rank %s done", dist.get_rank()) def main() -> None: use_gpu = torch.cuda.is_available() dist.init_process_group(backend="nccl" if use_gpu else "gloo") dist.barrier() logger.info(f"Hello World from rank {dist.get_rank()}") host = os.environ['MASTER_ADDR'] port = os.environ['MASTER_PORT'] world_size = os.environ['WORLD_SIZE'] logger.info("testing TCPStore") store = dist.TCPStore( host_name=host, port=int(port), world_size=int(world_size), ) _run_test(store) if __name__ == "__main__": main() ~~~ With the fix (TORCH_DISABLE_SHARE_RDZV_TCP_STORE=0 or just drop the option) ~~~ (pytorch_38) [kurman@devgpu011.cln5 ~/local/pytorch (main)]$ TORCH_DISABLE_SHARE_RDZV_TCP_STORE=0 python -m torch.distributed.run --rdzv-backend c10d --nproc-per-node 3 ~/workspace/dist-demo/stores.py master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified. WARNING:__main__: *************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ************************************* Hello World from rank 1 Hello World from rank 2 Hello World from rank 0 testing TCPStore testing TCPStore testing TCPStore rank 2 done Rank 1 is sleeping rank 0 done Checking key lookup_key in store on rank 1 ~~~ TORCH_DISABLE_SHARE_RDZV_TCP_STORE=1 ~~~ (pytorch_38) [kurman@devgpu011.cln5 ~/local/pytorch (main)]$ TORCH_DISABLE_SHARE_RDZV_TCP_STORE=1 python -m torch.distributed.run --rdzv-backend c10d --npro c-per-node 3 ~/workspace/dist-demo/stores.py master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified. WARNING:__main__: ************************************* Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. *************************************** Hello World from rank 0 Hello World from rank 2 Hello World from rank 1 testing TCPStore testing TCPStore testing TCPStore rank 0 done rank 2 done Rank 1 is sleeping Checking key lookup_key in store on rank 1 [rank1]: Traceback (most recent call last): [rank1]: File "/home/kurman/workspace/dist-demo/stores.py", line 46, in <module> [rank1]: main() [rank1]: File "/home/kurman/workspace/dist-demo/stores.py", line 42, in main [rank1]: _run_test(store) [rank1]: File "/home/kurman/workspace/dist-demo/stores.py", line 22, in _run_test [rank1]: store.check([key]) [rank1]: torch.distributed.DistNetworkError: Connection reset by peer E0605 17:40:22.853277 140249136719680 torch/distributed/elastic/multiprocessing/api.py:832] failed (exitcode: 1) local_rank: 1 (pid: 2279237) of binary: /home/kurman/.conda/envs/pytorch_38/bin/python Traceback (most recent call last): File "/home/kurman/.conda/envs/pytorch_38/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/kurman/.conda/envs/pytorch_38/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/data/users/kurman/pytorch/torch/distributed/run.py", line 904, in <module> main() File "/data/users/kurman/pytorch/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(args, *kwargs) File "/data/users/kurman/pytorch/torch/distributed/run.py", line 900, in main run(args) File "/data/users/kurman/pytorch/torch/distributed/run.py", line 891, in run elastic_launch( File "/data/users/kurman/pytorch/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/data/users/kurman/pytorch/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /home/kurman/workspace/dist-demo/stores.py FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-06-05_17:40:22 host : devgpu011.cln5.facebook.com rank : 1 (local_rank: 1) exitcode : 1 (pid: 2279237) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ ~~~ Differential Revision: D58180193 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128096 Approved by: https://github.com/shuqiangzhang	2024-06-12 21:49:42 +00:00
Michael Lazos	c0ea8fc3a3	Disable inlining nn modules on static inputs tests (#128529 ) With inilining NN modules these tests no longer raise runtime errors because changing static ptrs induces a rerecording instead of a runtime error. The solution is to run the test with inlining disabled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128529 Approved by: https://github.com/anijain2305 ghstack dependencies: #128528	2024-06-12 21:40:29 +00:00
Michael Lazos	ff3ba99320	Disable inline nn modules on unstable ptr test (#128528 ) With inilining NN modules these tests no longer raise runtime errors because changing static ptrs induces a rerecording instead of a runtime error. The solution is to run the test with inlining disabled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128528 Approved by: https://github.com/anijain2305	2024-06-12 21:40:29 +00:00
Andrea Frittoli	1026b7cfbe	Add docstring for the torch.typename function (#128129 ) Fixes: #127885 Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128129 Approved by: https://github.com/malfet	2024-06-12 21:34:20 +00:00
Aaron Orenstein	cba840fde9	Fix accidental variable shadow (#128460 ) Fixes #128322 We should probably crank up clang's warning levels... Test: ``` import torch def addmv_slice(input, mat, vec, slice_op): vec = vec[slice_op] res = torch.addmv(input, mat, vec) # traced line: 25 return res torch._dynamo.reset() model_opt = torch.compile(addmv_slice) input = torch.empty(size=[11]).uniform_(-1, 1) mat = torch.empty([11, 128]).uniform_(-10.0, 20.0) vec = torch.empty([256]).uniform_(-10.0, 20.0) slice_op = slice(None, None, 2) out = model_opt(input, mat, vec, slice_op) vec = torch.empty([384]).uniform_(-10.0, 20.0) slice_op = slice(None, None, 3) out = model_opt(input, mat, vec, slice_op) ``` before this change the test fails with: ``` torch._dynamo.exc.TorchRuntimeError: Failed running call_function <built-in function getitem>((FakeTensor(..., size=(s0,)), slice(None, None, s1)), *{}): slice step cannot be zero ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128460 Approved by: https://github.com/oulgen, https://github.com/Skylion007	2024-06-12 21:14:04 +00:00
Zhengxu Chen	0444e89931	[export] Remove replace_sym_size_ops_pass (#128443 ) Summary: Not needed anymore. Test Plan: CI Differential Revision: D58429458 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128443 Approved by: https://github.com/angelayi	2024-06-12 21:03:06 +00:00
Joel Schlosser	67e6c76a18	Support apply_(callable) sugar for CPU NJTs (#125416 ) Example: ```python nt.apply_(lambda x: x * 2) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125416 Approved by: https://github.com/soulitzer	2024-06-12 20:30:57 +00:00
Xuehai Pan	dd143d44cc	[BE] enable UFMT for top-level files `torch/*.py` (#127707 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127707 Approved by: https://github.com/ezyang	2024-06-12 20:15:05 +00:00
James Wu	cc231a8e2b	First version of AOTAutogradCache (#126791 ) This PR implements "V0" of AOTAutogradCache. Given an input to AOTAutograd, we calculate a cache key, then save an AOTAutogradCacheEntry. Each AOTAutogradCacheEntry has: - A CompiledForward and optionally a CompiledBackward - A bunch of metadata. CompiledForward and CompiledBackward each save the key to the FXGraphCache associated with the compiled object. FXGraphCache populates this key field as long as it's able to return a compiled graph given a set of inputs. We then load the same object from the FXGraphCache on an AOTAutogradCache hit. On cache miss: - Run AOTAutograd, up to AOTAutogradDispatch.post_compile. - Save an AOTAutogradCacheEntry to the cache after compiling the necessary portions and receiving a cache key from FXGraphCache. In this we always compile the backwards ahead of time. The PR above this one implements backward lazy caching, so that we only save to the cache after compiling the backward in a lazy backward scenario. - Return the resulting object On cache hit: - Run AOTAutogradCacheEntry.post_compile() on the cache key. - This attempts to load the forward and backward graphs from FXGraphCache - As long as we successfully load from FXGraphCache, it's a hit. We then rewrap the callable with post compile wrappers using our saved metadata. For now, we ignore the fakified out and debug wrappers. We only save to the cache if Fakified out is turned off. V0 Guards behavior: FXGraphCache serializes guards that are needed in the shape_env based on the symint inputs to the graph. The invariant that AOTAutograd uses here is that the sources for symints given to it by dynamo are exactly the same as the ones it passes to inductor, for both the forward and backward passes. (This does not mean that the tensor values passed in are the same: only that their symints are). That is, AOTAutograd and Inductor never create new guards based on symints with different sources than those passed to it by inductor. We don't currently store any AOTAutograd specific guards: my hypothesis is that FXGraphCache already stores these, as any guards generated by AOTAutograd should already be in the shape_env before calling into inductor, and we don't generate new guards post inductor. If this is needed, I'll add it in another diff. Testing: We'll start with some basic unit tests, but I'll be adding more and more complicated testing as the next step. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126791 Approved by: https://github.com/bdhirsh	2024-06-12 20:04:44 +00:00
Wanchao Liang	7775fee10f	[tp] refactor and fix PrepareModuleInput for DTensor inputs (#128431 ) as titled, this PR refactors the PrepareModuleInput style to have common method prepare_input_arg, allow both args/kwargs to reuse this logic This also fixes https://github.com/pytorch/pytorch/issues/128365 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128431 Approved by: https://github.com/awgu	2024-06-12 19:16:33 +00:00
Joel Schlosser	ec1fdda196	Fix jagged NT softmax semantics (#119459 ) Before: `softmax` definition uses `jagged_unary_pointwise()` (wrong) After: `softmax` impl adjusts the `dim` arg to account for the difference in dimensionality between the outer NT and the NT's `_values` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119459 Approved by: https://github.com/soulitzer	2024-06-12 19:12:03 +00:00
PyTorch MergeBot	817ce6835b	Revert "[cuDNN][SDPA] Remove `TORCH_CUDNN_SDPA_ENABLED=1`, enable cuDNN SDPA by default on H100 and 2nd on other archs >= sm80 (#125343 )" This reverts commit 4c971932e839fc5da2b91906ad028d4654932bca. Reverted https://github.com/pytorch/pytorch/pull/125343 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/125343#issuecomment-2163690162))	2024-06-12 18:47:52 +00:00
DanilBaibak	6d1b1ddd3e	Select Runner Label Dynamically (#127287 ) Updated `get_workflow_type.py` logic to dynamically select a prefix for the runner label. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127287 Approved by: https://github.com/ZainRizvi	2024-06-12 18:47:47 +00:00
PyTorch MergeBot	7db501ba2b	Revert "[cuDNN][SDPA] Support different key, value dimension in cuDNN SDPA (#128350 )" This reverts commit 45dccfddcd8fce804f50075484421ade27f1f021. Reverted https://github.com/pytorch/pytorch/pull/128350 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/128350#issuecomment-2163669538))	2024-06-12 18:35:18 +00:00
mori360	d71f92213c	[DSD] keep 'exp_avg' as DTensor after torch.distributed.checkpoint.state_dict.set_optimizer_state_dict (#128004 ) Fixes #126950 `ptd_state_dict` with `broadcast_from_rank0=False` might miss 2 condition checks in the `set_optimizer_state_dict` Here we add another condition `full_state_dict=True` with corresponding tensor distribution without broadcasting if broadcast_from_rank0=False Pull Request resolved: https://github.com/pytorch/pytorch/pull/128004 Approved by: https://github.com/fegin	2024-06-12 18:14:56 +00:00
Tharindu Patabandi	624e8ae491	Documentation for is_dependent function (#128197 ) Docstring for torch.distributions.constraints.is_dependent Fixes #127900 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128197 Approved by: https://github.com/fritzo, https://github.com/malfet	2024-06-12 17:50:41 +00:00
Shashank Shekhar	a70a7337d2	Update torch.nanmean() docstring to mention input dtype requirement (#128155 ) Fixes #120570 ## Description Update torch.nanmean() docstring to mention input dtype requirement as either floating point type or complex. Previously, the torch.mean() docstring had been updated in #120208 in a similar manner, but the torch.nanmean() docstring was not updated. ## Checklist - [X] The issue that is being fixed is referred in the description. - [X] Only one issue is addressed in this pull request. - [x] Labels from the issue that this PR is fixing are added to this pull request. - [X] No unnecessary issues are included into this pull request. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128155 Approved by: https://github.com/malfet	2024-06-12 17:46:36 +00:00
anandptl84	0f52dc7e51	Document `torch.cuda.profiler.stop` (#128196 ) Fixes #127918 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128196 Approved by: https://github.com/malfet, https://github.com/eqy	2024-06-12 17:39:43 +00:00
PyTorch MergeBot	5001f41b90	Revert "Make TraceUtils.h to be device-agnostic (#126969 )" This reverts commit 648625b230e8e6e7478fb219ff4f0aa6a45070f5. Reverted https://github.com/pytorch/pytorch/pull/126969 on behalf of https://github.com/clee2000 due to failing internal builds D58443769 ([comment](https://github.com/pytorch/pytorch/pull/126969#issuecomment-2163462600))	2024-06-12 16:32:57 +00:00
PyTorch MergeBot	f89574fa23	Revert "Pass params to dump_nccl_trace_pickle (#128307 )" This reverts commit eb567b1f40233667b982f81e3a75deec0fdfd9ca. Reverted https://github.com/pytorch/pytorch/pull/128307 on behalf of https://github.com/clee2000 due to sorry need to revert this in order to revert 126969 ([comment](https://github.com/pytorch/pytorch/pull/128307#issuecomment-2163459399))	2024-06-12 16:29:51 +00:00
PyTorch MergeBot	81e4e12f02	Revert "Support aten operations with out tensor (#124926 )" This reverts commit cba195c8edd6c7149036ef0767772d11fff5390e. Reverted https://github.com/pytorch/pytorch/pull/124926 on behalf of https://github.com/clee2000 due to newly added test broke in internal D58444103. Test passed in OSS CI though ([comment](https://github.com/pytorch/pytorch/pull/124926#issuecomment-2163441547))	2024-06-12 16:20:04 +00:00
PyTorch MergeBot	c5172b8de8	Revert "[AOTI] Switch to use shim v2 (#127674 )" This reverts commit 9a38cae299e5ffd8143182bec878c28f96cfd72a. Reverted https://github.com/pytorch/pytorch/pull/127674 on behalf of https://github.com/clee2000 due to tests failed internally D56709309 ([comment](https://github.com/pytorch/pytorch/pull/127674#issuecomment-2163436728))	2024-06-12 16:17:07 +00:00
Xu Han	9e39c62908	correct avx512_vnni isa name. (#128318 ) `x86` has two vnni isa currently: `avx2_vnni` and `avx512_vnni`. This PR correct the function name to `avx512_vnni`. Co-authored-by: Jiong Gong <jiong.gong@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128318 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/desertfire	2024-06-12 16:12:49 +00:00
PyTorch MergeBot	f2dcbe89d6	Revert "Prevent expansion of cat indexing to avoid int64 intermediate (#127815 )" This reverts commit 793df7b7cb1473004837f5867f4c1c4b2b0f751d. Reverted https://github.com/pytorch/pytorch/pull/127815 on behalf of https://github.com/clee2000 due to the newly added test is failing internally D58444153. Test exists in opensource and passed in OSS CI, maybe env difference? ([comment](https://github.com/pytorch/pytorch/pull/127815#issuecomment-2163421968))	2024-06-12 16:09:22 +00:00
Kulin Seth	8df56afc20	Add support in Python API for the recommended max working set size. (#128289 ) Adds ways for users to request recommended max size for Metal on Mac. It plumbs through https://developer.apple.com/documentation/metal/mtldevice/2369280-recommendedmaxworkingsetsize?language=objc Can be used like ``` max_memory = torch.mps.recommended_max_memory() print ("Recommended Max Memory : ", (max_memory/(102410241024)), "GB") ``` Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128289 Approved by: https://github.com/malfet	2024-06-12 16:03:57 +00:00
Jeff Daily	b19c2319e4	[ROCm] TunableOp for gemm_and_bias (#128143 ) Thus far TunableOp was implemented for gemm, bgemm, and scaled_mm. gemm_and_bias was notably missing. This PR closes that gap. This PR also fixes a regression after #124362 disabled the numerical check by default. The env var to enable it no longer worked. CC @xw285cornell Pull Request resolved: https://github.com/pytorch/pytorch/pull/128143 Approved by: https://github.com/Skylion007	2024-06-12 15:53:39 +00:00
Aaron Orenstein	3c971d2ef3	Flip default value for mypy disallow_untyped_defs [final] (#127836 ) Not requiring all functions to have types allows a lot of 'Any' types to slip in - which poison types and make mypy unable to properly typecheck the code. I want to flip the default so that new files are required to have fully typed defs and we can have a burndown list of files that fail to require full types. The preceding stack of PRs (cut up simply to limit the number of file changes per PR "reasonable") adds `# mypy: allow-untyped-defs` to any file which didn't immediately pass mypy with the flag flipped. Due to changing files and merge conflicts it will probably be necessary to have several passes through before landing this final PR which turns the option on. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127836 Approved by: https://github.com/oulgen, https://github.com/Skylion007	2024-06-12 15:28:42 +00:00
PyTorch MergeBot	15ab636007	Revert "Fix side effect pruning (#128028 )" This reverts commit a55d0d9718c11eb2897423c78eff18b168dd0a06. Reverted https://github.com/pytorch/pytorch/pull/128028 on behalf of https://github.com/clee2000 due to broke test in internal D58443816. Test exists in external too though ([comment](https://github.com/pytorch/pytorch/pull/128028#issuecomment-2163249251))	2024-06-12 14:55:57 +00:00
Wu, Chunyuan	5ef70faaa7	Revert "Make torch_geometric models compatible with export (#123403 )" (#128377 ) This reverts commit d78991a7381adb3df5e9b63c365db4506643edce. This PR reverts https://github.com/pytorch/pytorch/pull/123403 to fix the performance regression as discussed in https://github.com/pytorch/pytorch/issues/127513#issuecomment-2158835653. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128377 Approved by: https://github.com/jgong5, https://github.com/angelayi, https://github.com/desertfire	2024-06-12 14:53:01 +00:00
PyTorch MergeBot	71f491554c	Revert "First version of AOTAutogradCache (#126791 )" This reverts commit abc3eec22d38079bee855fbcb75da62a9558284c. Reverted https://github.com/pytorch/pytorch/pull/126791 on behalf of https://github.com/DanilBaibak due to The changes broke a number of linux jobs ([comment](https://github.com/pytorch/pytorch/pull/126791#issuecomment-2163081643))	2024-06-12 13:59:29 +00:00
James Wu	abc3eec22d	First version of AOTAutogradCache (#126791 ) This PR implements "V0" of AOTAutogradCache. Given an input to AOTAutograd, we calculate a cache key, then save an AOTAutogradCacheEntry. Each AOTAutogradCacheEntry has: - A CompiledForward and optionally a CompiledBackward - A bunch of metadata. CompiledForward and CompiledBackward each save the key to the FXGraphCache associated with the compiled object. FXGraphCache populates this key field as long as it's able to return a compiled graph given a set of inputs. We then load the same object from the FXGraphCache on an AOTAutogradCache hit. On cache miss: - Run AOTAutograd, up to AOTAutogradDispatch.post_compile. - Save an AOTAutogradCacheEntry to the cache after compiling the necessary portions and receiving a cache key from FXGraphCache. In this we always compile the backwards ahead of time. The PR above this one implements backward lazy caching, so that we only save to the cache after compiling the backward in a lazy backward scenario. - Return the resulting object On cache hit: - Run AOTAutogradCacheEntry.post_compile() on the cache key. - This attempts to load the forward and backward graphs from FXGraphCache - As long as we successfully load from FXGraphCache, it's a hit. We then rewrap the callable with post compile wrappers using our saved metadata. For now, we ignore the fakified out and debug wrappers. We only save to the cache if Fakified out is turned off. V0 Guards behavior: FXGraphCache serializes guards that are needed in the shape_env based on the symint inputs to the graph. The invariant that AOTAutograd uses here is that the sources for symints given to it by dynamo are exactly the same as the ones it passes to inductor, for both the forward and backward passes. (This does not mean that the tensor values passed in are the same: only that their symints are). That is, AOTAutograd and Inductor never create new guards based on symints with different sources than those passed to it by inductor. We don't currently store any AOTAutograd specific guards: my hypothesis is that FXGraphCache already stores these, as any guards generated by AOTAutograd should already be in the shape_env before calling into inductor, and we don't generate new guards post inductor. If this is needed, I'll add it in another diff. Testing: We'll start with some basic unit tests, but I'll be adding more and more complicated testing as the next step. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126791 Approved by: https://github.com/bdhirsh	2024-06-12 13:44:30 +00:00
Xia, Weiwen	2e065f2486	[Quant][Inductor] Bug fix: mutation nodes not handled correctly for QLinearPointwiseBinaryPT2E (#127592 ) Fixes #127402 - Revert some changes to `ir.MutationOutput` and inductor/test_flex_attention.py - Add checks of mutation for QLinearPointwiseBinaryPT2E Pull Request resolved: https://github.com/pytorch/pytorch/pull/127592 Approved by: https://github.com/leslie-fang-intel, https://github.com/Chillee	2024-06-12 10:49:16 +00:00
Xuehai Pan	46a35a1ed4	[BE] enable UFMT for `torch/__init__.py` (#127710 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127710 Approved by: https://github.com/ezyang ghstack dependencies: #127703, #127708, #127709	2024-06-12 10:40:23 +00:00
Xuehai Pan	26433b86de	[BE][Easy] sort `__all__` in `torch/__init__.py` (#127709 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127709 Approved by: https://github.com/ezyang ghstack dependencies: #127703, #127708	2024-06-12 10:21:36 +00:00
Tom Ritchford	2386045e4f	Add OpInfo entry for alias_copy (#127232 ) (#128142 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128142 Approved by: https://github.com/lezcano	2024-06-12 09:39:58 +00:00
Jiong Gong	1edcb31d34	[RELAND][inductor][cpp] bf16/fp16 gemm template computed with fp32 (#128472 ) reland for https://github.com/pytorch/pytorch/pull/126068 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128472 Approved by: https://github.com/desertfire	2024-06-12 08:37:16 +00:00
Animesh Jain	ebb00a92bd	[dynamo] Skip freezing expect failure for inlining inbuilt nn modules (#128470 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128470 Approved by: https://github.com/mlazos ghstack dependencies: #126578, #128440	2024-06-12 08:21:50 +00:00
Animesh Jain	1602c7d0c8	[dynamo] Enable some inlining inbuilt nn module tests (#128440 ) Co-authored-by: Laith Sakka <lsakka@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128440 Approved by: https://github.com/williamwen42, https://github.com/jansel ghstack dependencies: #126578	2024-06-12 08:21:50 +00:00
Xuehai Pan	04037f3d22	[BE] sort imports in `torch/__init__.py` (#127708 ) ---- - Sort import via `usort` - Change relative import `from . import xxx` to absolute import `from torch import xxx` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127708 Approved by: https://github.com/ezyang ghstack dependencies: #127703	2024-06-12 08:03:54 +00:00
Eddie Yan	0b331fd5d7	[CUDA] Abate `SoftMax.cu` compiler warning spam (#128468 ) Avoids excessively spammy warnings such as ``` pytorch/aten/src/ATen/native/cuda/SoftMax.cu(844): warning #191-D: type qualifier is meaningless on cast type [&] { const auto& the_type = input.scalar_type(); constexpr const char* at_dispatch_name = "host_softmax"; at::ScalarType _st = ::detail::scalar_type(the_type); ; switch (_st) { case at::ScalarType::Double: { do { if constexpr (!at::should_include_kernel_dtype( at_dispatch_name, at::ScalarType::Double)) { do { ::c10::detail::deprecated_AT_ERROR(); if (!(false)) { ::c10::detail::torchCheckFail( __func__, "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", static_cast<uint32_t>(844), (::c10::detail::torchCheckMsgImpl( "Expected " "false" " to be true, but got false. " "(Could this error message be improved? If so, " "please report an enhancement request to PyTorch.)", ::c10::str("dtype '", toString(at::ScalarType::Double), "' not selected for kernel tag ", at_dispatch_name)))); }; } while (false); } } while (0); using scalar_t __attribute__((__unused__)) = c10::impl::ScalarTypeToCPPTypeT<at::ScalarType::Double>; return [&] { using accscalar_t = acc_type<scalar_t, true>; if (!half_to_float) { auto output_ptr = output.mutable_data_ptr<scalar_t>(); auto input_ptr = input.const_data_ptr<scalar_t>(); if (dim_size <= 1024 && dim_sizesizeof(scalar_t) <= 4096) { int64_t remaining = outer_size; int64_t chunk_size = (1L << 30L) / dim_size; while(remaining > 0) { dispatch_softmax_forward<scalar_t, scalar_t, accscalar_t, is_log_softmax, false>( output_ptr, input_ptr, dim_size, dim_size, std::min<int64_t>(remaining, chunk_size), nullptr ); input_ptr += chunk_size dim_size; output_ptr += chunk_size * dim_size; remaining -= chunk_size; } } else { constexpr int ILP = sizeof(float4) / sizeof(scalar_t); dim3 block = SoftMaxForward_getBlockSize(dim_size); size_t smem_reduction_sz = block.x / 32 * sizeof(accscalar_t); auto max_elements_per_smem = (at::cuda::getCurrentDeviceProperties()->sharedMemPerBlock - smem_reduction_sz) / sizeof(scalar_t); bool can_use_smem = dim_size < max_elements_per_smem; can_use_smem &= !(reinterpret_cast<const uintptr_t>(input_ptr) % ALIGN_BYTES); can_use_smem &= (!(reinterpret_cast<uintptr_t>(output_ptr) % ALIGN_BYTES)); can_use_smem &= !(dim_size % ILP); if (can_use_smem) { size_t smem_sz = dim_size * sizeof(scalar_t) + smem_reduction_sz; cunn_SoftMaxForwardSmem<ILP, scalar_t, accscalar_t, scalar_t, Epilogue> <<<grid, block, smem_sz, stream>>>(output_ptr, input_ptr, dim_size); } else { cunn_SoftMaxForward<ILP, scalar_t, accscalar_t, scalar_t, Epilogue> <<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size); } do { const cudaError_t __err = cudaGetLastError(); c10::cuda::c10_cuda_check_implementation( static_cast<int32_t>(__err), "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", __func__, static_cast<uint32_t>(880), true); } while (0); } } else { auto output_ptr = output.mutable_data_ptr<accscalar_t>(); auto input_ptr = input.const_data_ptr<scalar_t>(); if (dim_size <= 1024 && dim_sizesizeof(scalar_t) <= 4096) { int64_t remaining = outer_size; int64_t chunk_size = (1<<30) / dim_size; while(remaining > 0) { dispatch_softmax_forward<scalar_t, accscalar_t, accscalar_t, is_log_softmax, false>( output_ptr, input_ptr, dim_size, dim_size, std::min<int64_t>(remaining, chunk_size), nullptr ); input_ptr += chunk_size dim_size; output_ptr += chunk_size * dim_size; remaining -= chunk_size; } } else { constexpr int ILP = sizeof(float4) / sizeof(scalar_t); dim3 block = SoftMaxForward_getBlockSize(dim_size); size_t smem_reduction_sz = block.x / 32 * sizeof(accscalar_t); auto max_elements_per_smem = (at::cuda::getCurrentDeviceProperties()->sharedMemPerBlock - smem_reduction_sz) / sizeof(scalar_t); bool can_use_smem = dim_size < max_elements_per_smem; can_use_smem &= !(reinterpret_cast<const uintptr_t>(input_ptr) % ALIGN_BYTES); can_use_smem &= (!(reinterpret_cast<uintptr_t>(output_ptr) % ALIGN_BYTES)); can_use_smem &= !(dim_size % ILP); if (can_use_smem) { size_t smem_sz = dim_size * sizeof(scalar_t) + smem_reduction_sz; cunn_SoftMaxForwardSmem<ILP, scalar_t, accscalar_t, accscalar_t, Epilogue> <<<grid, block, smem_sz, stream>>>(output_ptr, input_ptr, dim_size); } else { cunn_SoftMaxForward<ILP, scalar_t, accscalar_t, accscalar_t, Epilogue> <<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size); } do { const cudaError_t __err = cudaGetLastError(); c10::cuda::c10_cuda_check_implementation( static_cast<int32_t>(__err), "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", __func__, static_cast<uint32_t>(916), true); } while (0); } } }(); } case at::ScalarType::Float: { do { if constexpr (!at::should_include_kernel_dtype( at_dispatch_name, at::ScalarType::Float)) { do { ::c10::detail::deprecated_AT_ERROR(); if (!(false)) { ::c10::detail::torchCheckFail( __func__, "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", static_cast<uint32_t>(844), (::c10::detail::torchCheckMsgImpl( "Expected " "false" " to be true, but got false. " "(Could this error message be improved? If so, " "please report an enhancement request to PyTorch.)", ::c10::str("dtype '", toString(at::ScalarType::Float), "' not selected for kernel tag ", at_dispatch_name)))); }; } while (false); } } while (0); using scalar_t __attribute__((__unused__)) = c10::impl::ScalarTypeToCPPTypeT<at::ScalarType::Float>; return [&] { using accscalar_t = acc_type<scalar_t, true>; if (!half_to_float) { auto output_ptr = output.mutable_data_ptr<scalar_t>(); auto input_ptr = input.const_data_ptr<scalar_t>(); if (dim_size <= 1024 && dim_sizesizeof(scalar_t) <= 4096) { int64_t remaining = outer_size; int64_t chunk_size = (1L << 30L) / dim_size; while(remaining > 0) { dispatch_softmax_forward<scalar_t, scalar_t, accscalar_t, is_log_softmax, false>( output_ptr, input_ptr, dim_size, dim_size, std::min<int64_t>(remaining, chunk_size), nullptr ); input_ptr += chunk_size dim_size; output_ptr += chunk_size * dim_size; remaining -= chunk_size; } } else { constexpr int ILP = sizeof(float4) / sizeof(scalar_t); dim3 block = SoftMaxForward_getBlockSize(dim_size); size_t smem_reduction_sz = block.x / 32 * sizeof(accscalar_t); auto max_elements_per_smem = (at::cuda::getCurrentDeviceProperties()->sharedMemPerBlock - smem_reduction_sz) / sizeof(scalar_t); bool can_use_smem = dim_size < max_elements_per_smem; can_use_smem &= !(reinterpret_cast<const uintptr_t>(input_ptr) % ALIGN_BYTES); can_use_smem &= (!(reinterpret_cast<uintptr_t>(output_ptr) % ALIGN_BYTES)); can_use_smem &= !(dim_size % ILP); if (can_use_smem) { size_t smem_sz = dim_size * sizeof(scalar_t) + smem_reduction_sz; cunn_SoftMaxForwardSmem<ILP, scalar_t, accscalar_t, scalar_t, Epilogue> <<<grid, block, smem_sz, stream>>>(output_ptr, input_ptr, dim_size); } else { cunn_SoftMaxForward<ILP, scalar_t, accscalar_t, scalar_t, Epilogue> <<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size); } do { const cudaError_t __err = cudaGetLastError(); c10::cuda::c10_cuda_check_implementation( static_cast<int32_t>(__err), "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", __func__, static_cast<uint32_t>(880), true); } while (0); } } else { auto output_ptr = output.mutable_data_ptr<accscalar_t>(); auto input_ptr = input.const_data_ptr<scalar_t>(); if (dim_size <= 1024 && dim_sizesizeof(scalar_t) <= 4096) { int64_t remaining = outer_size; int64_t chunk_size = (1<<30) / dim_size; while(remaining > 0) { dispatch_softmax_forward<scalar_t, accscalar_t, accscalar_t, is_log_softmax, false>( output_ptr, input_ptr, dim_size, dim_size, std::min<int64_t>(remaining, chunk_size), nullptr ); input_ptr += chunk_size dim_size; output_ptr += chunk_size * dim_size; remaining -= chunk_size; } } else { constexpr int ILP = sizeof(float4) / sizeof(scalar_t); dim3 block = SoftMaxForward_getBlockSize(dim_size); size_t smem_reduction_sz = block.x / 32 * sizeof(accscalar_t); auto max_elements_per_smem = (at::cuda::getCurrentDeviceProperties()->sharedMemPerBlock - smem_reduction_sz) / sizeof(scalar_t); bool can_use_smem = dim_size < max_elements_per_smem; can_use_smem &= !(reinterpret_cast<const uintptr_t>(input_ptr) % ALIGN_BYTES); can_use_smem &= (!(reinterpret_cast<uintptr_t>(output_ptr) % ALIGN_BYTES)); can_use_smem &= !(dim_size % ILP); if (can_use_smem) { size_t smem_sz = dim_size * sizeof(scalar_t) + smem_reduction_sz; cunn_SoftMaxForwardSmem<ILP, scalar_t, accscalar_t, accscalar_t, Epilogue> <<<grid, block, smem_sz, stream>>>(output_ptr, input_ptr, dim_size); } else { cunn_SoftMaxForward<ILP, scalar_t, accscalar_t, accscalar_t, Epilogue> <<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size); } do { const cudaError_t __err = cudaGetLastError(); c10::cuda::c10_cuda_check_implementation( static_cast<int32_t>(__err), "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", __func__, static_cast<uint32_t>(916), true); } while (0); } } }(); } case at::ScalarType::Half: { do { if constexpr (!at::should_include_kernel_dtype( at_dispatch_name, at::ScalarType::Half)) { do { ::c10::detail::deprecated_AT_ERROR(); if (!(false)) { ::c10::detail::torchCheckFail( __func__, "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", static_cast<uint32_t>(844), (::c10::detail::torchCheckMsgImpl( "Expected " "false" " to be true, but got false. " "(Could this error message be improved? If so, " "please report an enhancement request to PyTorch.)", ::c10::str("dtype '", toString(at::ScalarType::Half), "' not selected for kernel tag ", at_dispatch_name)))); }; } while (false); } } while (0); using scalar_t __attribute__((__unused__)) = c10::impl::ScalarTypeToCPPTypeT<at::ScalarType::Half>; return [&] { using accscalar_t = acc_type<scalar_t, true>; if (!half_to_float) { auto output_ptr = output.mutable_data_ptr<scalar_t>(); auto input_ptr = input.const_data_ptr<scalar_t>(); if (dim_size <= 1024 && dim_sizesizeof(scalar_t) <= 4096) { int64_t remaining = outer_size; int64_t chunk_size = (1L << 30L) / dim_size; while(remaining > 0) { dispatch_softmax_forward<scalar_t, scalar_t, accscalar_t, is_log_softmax, false>( output_ptr, input_ptr, dim_size, dim_size, std::min<int64_t>(remaining, chunk_size), nullptr ); input_ptr += chunk_size dim_size; output_ptr += chunk_size * dim_size; remaining -= chunk_size; } } else { constexpr int ILP = sizeof(float4) / sizeof(scalar_t); dim3 block = SoftMaxForward_getBlockSize(dim_size); size_t smem_reduction_sz = block.x / 32 * sizeof(accscalar_t); auto max_elements_per_smem = (at::cuda::getCurrentDeviceProperties()->sharedMemPerBlock - smem_reduction_sz) / sizeof(scalar_t); bool can_use_smem = dim_size < max_elements_per_smem; can_use_smem &= !(reinterpret_cast<const uintptr_t>(input_ptr) % ALIGN_BYTES); can_use_smem &= (!(reinterpret_cast<uintptr_t>(output_ptr) % ALIGN_BYTES)); can_use_smem &= !(dim_size % ILP); if (can_use_smem) { size_t smem_sz = dim_size * sizeof(scalar_t) + smem_reduction_sz; cunn_SoftMaxForwardSmem<ILP, scalar_t, accscalar_t, scalar_t, Epilogue> <<<grid, block, smem_sz, stream>>>(output_ptr, input_ptr, dim_size); } else { cunn_SoftMaxForward<ILP, scalar_t, accscalar_t, scalar_t, Epilogue> <<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size); } do { const cudaError_t __err = cudaGetLastError(); c10::cuda::c10_cuda_check_implementation( static_cast<int32_t>(__err), "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", __func__, static_cast<uint32_t>(880), true); } while (0); } } else { auto output_ptr = output.mutable_data_ptr<accscalar_t>(); auto input_ptr = input.const_data_ptr<scalar_t>(); if (dim_size <= 1024 && dim_sizesizeof(scalar_t) <= 4096) { int64_t remaining = outer_size; int64_t chunk_size = (1<<30) / dim_size; while(remaining > 0) { dispatch_softmax_forward<scalar_t, accscalar_t, accscalar_t, is_log_softmax, false>( output_ptr, input_ptr, dim_size, dim_size, std::min<int64_t>(remaining, chunk_size), nullptr ); input_ptr += chunk_size dim_size; output_ptr += chunk_size * dim_size; remaining -= chunk_size; } } else { constexpr int ILP = sizeof(float4) / sizeof(scalar_t); dim3 block = SoftMaxForward_getBlockSize(dim_size); size_t smem_reduction_sz = block.x / 32 * sizeof(accscalar_t); auto max_elements_per_smem = (at::cuda::getCurrentDeviceProperties()->sharedMemPerBlock - smem_reduction_sz) / sizeof(scalar_t); bool can_use_smem = dim_size < max_elements_per_smem; can_use_smem &= !(reinterpret_cast<const uintptr_t>(input_ptr) % ALIGN_BYTES); can_use_smem &= (!(reinterpret_cast<uintptr_t>(output_ptr) % ALIGN_BYTES)); can_use_smem &= !(dim_size % ILP); if (can_use_smem) { size_t smem_sz = dim_size * sizeof(scalar_t) + smem_reduction_sz; cunn_SoftMaxForwardSmem<ILP, scalar_t, accscalar_t, accscalar_t, Epilogue> <<<grid, block, smem_sz, stream>>>(output_ptr, input_ptr, dim_size); } else { cunn_SoftMaxForward<ILP, scalar_t, accscalar_t, accscalar_t, Epilogue> <<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size); } do { const cudaError_t __err = cudaGetLastError(); c10::cuda::c10_cuda_check_implementation( static_cast<int32_t>(__err), "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", __func__, static_cast<uint32_t>(916), true); } while (0); } } }(); } case at::ScalarType::BFloat16: { do { if constexpr (!at::should_include_kernel_dtype( at_dispatch_name, at::ScalarType::BFloat16)) { do { ::c10::detail::deprecated_AT_ERROR(); if (!(false)) { ::c10::detail::torchCheckFail( __func__, "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", static_cast<uint32_t>(844), (::c10::detail::torchCheckMsgImpl( "Expected " "false" " to be true, but got false. " "(Could this error message be improved? If so, " "please report an enhancement request to PyTorch.)", ::c10::str("dtype '", toString(at::ScalarType::BFloat16), "' not selected for kernel tag ", at_dispatch_name)))); }; } while (false); } } while (0); using scalar_t __attribute__((__unused__)) = c10::impl::ScalarTypeToCPPTypeT<at::ScalarType::BFloat16>; return [&] { using accscalar_t = acc_type<scalar_t, true>; if (!half_to_float) { auto output_ptr = output.mutable_data_ptr<scalar_t>(); auto input_ptr = input.const_data_ptr<scalar_t>(); if (dim_size <= 1024 && dim_sizesizeof(scalar_t) <= 4096) { int64_t remaining = outer_size; int64_t chunk_size = (1L << 30L) / dim_size; while(remaining > 0) { dispatch_softmax_forward<scalar_t, scalar_t, accscalar_t, is_log_softmax, false>( output_ptr, input_ptr, dim_size, dim_size, std::min<int64_t>(remaining, chunk_size), nullptr ); input_ptr += chunk_size dim_size; output_ptr += chunk_size * dim_size; remaining -= chunk_size; } } else { constexpr int ILP = sizeof(float4) / sizeof(scalar_t); dim3 block = SoftMaxForward_getBlockSize(dim_size); size_t smem_reduction_sz = block.x / 32 * sizeof(accscalar_t); auto max_elements_per_smem = (at::cuda::getCurrentDeviceProperties()->sharedMemPerBlock - smem_reduction_sz) / sizeof(scalar_t); bool can_use_smem = dim_size < max_elements_per_smem; can_use_smem &= !(reinterpret_cast<const uintptr_t>(input_ptr) % ALIGN_BYTES); can_use_smem &= (!(reinterpret_cast<uintptr_t>(output_ptr) % ALIGN_BYTES)); can_use_smem &= !(dim_size % ILP); if (can_use_smem) { size_t smem_sz = dim_size * sizeof(scalar_t) + smem_reduction_sz; cunn_SoftMaxForwardSmem<ILP, scalar_t, accscalar_t, scalar_t, Epilogue> <<<grid, block, smem_sz, stream>>>(output_ptr, input_ptr, dim_size); } else { cunn_SoftMaxForward<ILP, scalar_t, accscalar_t, scalar_t, Epilogue> <<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size); } do { const cudaError_t __err = cudaGetLastError(); c10::cuda::c10_cuda_check_implementation( static_cast<int32_t>(__err), "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", __func__, static_cast<uint32_t>(880), true); } while (0); } } else { auto output_ptr = output.mutable_data_ptr<accscalar_t>(); auto input_ptr = input.const_data_ptr<scalar_t>(); if (dim_size <= 1024 && dim_sizesizeof(scalar_t) <= 4096) { int64_t remaining = outer_size; int64_t chunk_size = (1<<30) / dim_size; while(remaining > 0) { dispatch_softmax_forward<scalar_t, accscalar_t, accscalar_t, is_log_softmax, false>( output_ptr, input_ptr, dim_size, dim_size, std::min<int64_t>(remaining, chunk_size), nullptr ); input_ptr += chunk_size dim_size; output_ptr += chunk_size * dim_size; remaining -= chunk_size; } } else { constexpr int ILP = sizeof(float4) / sizeof(scalar_t); dim3 block = SoftMaxForward_getBlockSize(dim_size); size_t smem_reduction_sz = block.x / 32 * sizeof(accscalar_t); auto max_elements_per_smem = (at::cuda::getCurrentDeviceProperties()->sharedMemPerBlock - smem_reduction_sz) / sizeof(scalar_t); bool can_use_smem = dim_size < max_elements_per_smem; can_use_smem &= !(reinterpret_cast<const uintptr_t>(input_ptr) % ALIGN_BYTES); can_use_smem &= (!(reinterpret_cast<uintptr_t>(output_ptr) % ALIGN_BYTES)); can_use_smem &= !(dim_size % ILP); if (can_use_smem) { size_t smem_sz = dim_size * sizeof(scalar_t) + smem_reduction_sz; cunn_SoftMaxForwardSmem<ILP, scalar_t, accscalar_t, accscalar_t, Epilogue> <<<grid, block, smem_sz, stream>>>(output_ptr, input_ptr, dim_size); } else { cunn_SoftMaxForward<ILP, scalar_t, accscalar_t, accscalar_t, Epilogue> <<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size); } do { const cudaError_t __err = cudaGetLastError(); c10::cuda::c10_cuda_check_implementation( static_cast<int32_t>(__err), "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", __func__, static_cast<uint32_t>(916), true); } while (0); } } }(); } default: do { ::c10::detail::deprecated_AT_ERROR(); if (!(false)) { ::c10::detail::torchCheckFail( __func__, "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", static_cast<uint32_t>(844), (::c10::detail::torchCheckMsgImpl( "Expected " "false" " to be true, but got false. " "(Could this error message be improved? If so, " "please report an enhancement request to PyTorch.)", ::c10::str('"', at_dispatch_name, "\" not implemented for '", toString(_st), "'")))); }; } while (false); } }() ``` and ``` SoftMax.cu:844: warning: comparison of integer expressions of different signedness: ‘int64_t’ {aka ‘long int’} and ‘long unsigned int’ [-Wsign-compare] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128468 Approved by: https://github.com/valentinandrei	2024-06-12 07:47:14 +00:00
Sam Larsen	8b3daf1768	Add FloatTrueDiv and ToFloat to SYMPY_INTERP (#128418 ) Summary: I admit I'm not 100% sure what I'm doing here. I'm hitting a bug in the FX graph cache when we try to evaluate a guards expression. We're creating guards that look like this: ``` Ne(CeilToInt(FloatTrueDiv(ToFloat(8L['t0']) - 4.0, 8.0))CeilToInt(FloatTrueDiv(ToFloat(8L['t1']) - 4.0, 8.0)), CeilToInt(FloatTrueDiv(ToFloat(8L['t1']) - 4.0, 8.0))) and ... ``` It looks like we have a facility to define these operators in the SYMPY_INTERP map and we're just missing FloatTrueDiv and ToFloat. What's surprsing to me is that we're only hitting this problem with the FX graph enabled. We can create such guards, but we've never actually evaluated any? Test Plan: `TORCHINDUCTOR_FX_GRAPH_CACHE=1 python benchmarks/dynamo/torchbench.py --ci --accuracy --timing --explain --inductor --device cuda --inference --bfloat16 --only detectron2_fcos_r_50_fpn` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128418 Approved by: https://github.com/ezyang	2024-06-12 06:26:43 +00:00
PyTorch MergeBot	a421699998	Revert "[tp] refactor and fix PrepareModuleInput for DTensor inputs (#128431 )" This reverts commit 089f9a116ac8b2c14d6351b52614b529caba126b. Reverted https://github.com/pytorch/pytorch/pull/128431 on behalf of https://github.com/DanilBaibak due to Sorry for the revert. Your changes broke the linter. Here you can find more details - `089f9a116a` ([comment](https://github.com/pytorch/pytorch/pull/128431#issuecomment-2162197858))	2024-06-12 06:25:53 +00:00
Xuehai Pan	dcc0093dba	[BE][Easy] export explicitly imported public submodules (#127703 ) Add top-level submodules `torch.{storage,serialization,functional,amp,overrides,types}` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127703 Approved by: https://github.com/ezyang	2024-06-12 05:52:18 +00:00
diwei sun	62311257ad	Add 1 test case for Convtranspose1D in op microbenchmark (#127216 ) Operator Convtransposd1d suffers performance regression with specific shape, #120982. Then we'd like to have this shape included into op level benchmark in this PR. I reproduced the regression that convtranspos1d with shape [2016, 1026, 1024, 256, 1, 224]. Here is the summary: Hardware info: Intel SPR8480-56cores per socket with frequency=2.1G. Performance comparison between torch 1.13 vs. torch 2.2 Benchmarking PyTorch1.13: ConvTranspose1d Mode: Eager Name: ConvTranspose1d_IC2016_OC1026_kernel1024_stride256_N1_L224_cpu Input: IC: 2016, OC: 1026, kernel: 1024, stride: 256, N: 1, L: 224, device: cpu Forward Execution Time (s) : 0.96s Benchmarking PyTorch2.2: ConvTranspose1d Mode: Eager Name: ConvTranspose1d_IC2016_OC1026_kernel1024_stride256_N1_L224_cpu Input: IC: 2016, OC: 1026, kernel: 1024, stride: 256, N: 1, L: 224, device: cpu Forward Execution Time (s) : 7.988s Also benchmarking for 7 rounds to check the variance. \| Round1 \| Round2 \| Round3 \| Round4 \| Round5 \| Round6 \| Round7 \| Normalized Variance -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- Pytorch1.13 \| 0.971 \| 0.972 \| 0.969 \| 0.970 \| 0.972 \| 0.970 \| 0.971 \| 0.0002% Pytorch 2.2 \| 8.064 \| 8.053 \| 8.027 \| 7.927 \| 7.971 \| 7.929 \| 7.902 \| 0.0059% Ratio v2.2 vs. v1.13(Lower is better) \| 8.31 \| 8.28 \| 8.29 \| 8.18 \| 8.20 \| 8.18 \| 8.14 \| Reproduce script： numctl -N 0 python -m pt.conv_test Pull Request resolved: https://github.com/pytorch/pytorch/pull/127216 Approved by: https://github.com/chuanqi129, https://github.com/jgong5, https://github.com/atalman	2024-06-12 05:33:54 +00:00
Wanchao Liang	089f9a116a	[tp] refactor and fix PrepareModuleInput for DTensor inputs (#128431 ) as titled, this PR refactors the PrepareModuleInput style to have common method prepare_input_arg, allow both args/kwargs to reuse this logic This also fixes https://github.com/pytorch/pytorch/issues/128365 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128431 Approved by: https://github.com/awgu	2024-06-12 05:22:24 +00:00
Natalia Gimelshein	77a0ca66e4	Add threadfence to 2-stage reduction for correct writes visibility (#128455 ) Final block accumulating 2-stage reduction result has to complete acquire pattern to make sure the writes of all other blocks are visible to it, see https://docs.nvidia.com/cuda/parallel-thread-execution/index.html?highlight=atom#release-and-acquire-patterns Pull Request resolved: https://github.com/pytorch/pytorch/pull/128455 Approved by: https://github.com/eqy, https://github.com/ezyang	2024-06-12 04:13:36 +00:00
Animesh Jain	c0b87afcad	[RELAND2][dynamo][nn-modules] Trace through nn.Module dunder methods for UnspecializedNNModule (#126578 ) Tracing through `__init__` is important because it initializes (calls STORE_ATTR) on members. By doing that, we kick in the mutation tracking for these objects. So, things like mutating `_modules` etc is tracked automatically. Fixes https://github.com/pytorch/pytorch/issues/111837 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126578 Approved by: https://github.com/jansel	2024-06-12 04:09:23 +00:00
loganthomas	02e7519ac3	DOC: strip inaccurate either float32 or float64 statement from set_default_type (#128192 ) Fixes #126647 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128192 Approved by: https://github.com/malfet	2024-06-12 03:57:48 +00:00
cyy	8cf302dce4	[5/N] Change static functions in headers to inline (#128406 ) Follows #128286 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128406 Approved by: https://github.com/ezyang	2024-06-12 03:25:54 +00:00
Kazuaki Ishizaki	86b5df3e71	Documenting the torch.fx.annotate.annotate function (#128337 ) Fixes #127903 This PR adds docstring to the `torch.fx.annotate.annotate` function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128337 Approved by: https://github.com/malfet	2024-06-12 03:06:32 +00:00
Tuan Trieu	7c2058338a	Improve convert fp32 to fp16 fx pass (#127829 ) Summary: Improve the convert fp32 to fp16 fx pass to use to_dtype node and const folding instead of inplace conversion. Test Plan: ``` buck2 test @//mode/{opt,inplace} //glow/fb/fx/fba/tests:test_fba_pass_manager_builder ``` Differential Revision: D57803843 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127829 Approved by: https://github.com/Skylion007	2024-06-12 02:50:37 +00:00
PyTorch MergeBot	3ddec713b8	Revert "[cuDNN][Quantization] Don't print when plan finalization fails in cuDNN quantization backend (#128177 )" This reverts commit cac7a22b92478d897488688010e562b7bd36b97f. Reverted https://github.com/pytorch/pytorch/pull/128177 on behalf of https://github.com/clee2000 due to broke test/test_quantization.py::TestQuantizedLinear::test_qlinear_cudnn on sm86 tests `cac7a22b92` https://github.com/pytorch/pytorch/actions/runs/9470648757/job/26100448913. Probably a landrace, test ran on the PR and succeed ([comment](https://github.com/pytorch/pytorch/pull/128177#issuecomment-2161977110))	2024-06-12 02:20:15 +00:00
William Wen	85eeb90d2c	[dynamo] Fix graph breaks related to HF ModelOutput (#127780 ) Fixes https://github.com/pytorch/pytorch/issues/126028 and https://github.com/pytorch/pytorch/issues/126027. Changes: - Support building `CustomizedDictVariable` in` VariableBuilder` (but only for HF `ModelOutput` subclasses) - Remove `DataClassVariable` since it's not really being used anywhere (`CustomizedDictVariable` can be used instead) - Support side effects for `CustomizedDictVariable` - Allow `NO_HASATTR` leaf guard on `DictSubclassGuardManager` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127780 Approved by: https://github.com/jansel, https://github.com/anijain2305	2024-06-12 02:16:24 +00:00
Sam Larsen	7f6daf289b	[inductor] parallel compile: set LD_LIBRARY_PATH for sub-processes in internal (#128376 ) Test Plan: `TORCHINDUCTOR_WORKER_START=subprocess TORCHINDUCTOR_COMPILE_THREADS=16 buck run mode/opt scripts/slarsen/torch_compile:run` Differential Revision: D58371264 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128376 Approved by: https://github.com/eellison	2024-06-12 01:55:53 +00:00
Jiashen Cao	3d55d84ec2	[Fix] Check tensor dtype before using torch.allclose in _trace log (#128438 ) #### Issue `torch.allclose` errors out during logging due to different dtypes. #### Test * `pytest test/test_jit.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128438 Approved by: https://github.com/angelayi	2024-06-12 01:52:09 +00:00
Wei Chen	bb2a995529	Back out "[Dynamo] Treat integers stored on nn.Modules as dynamic (#126466 )" (#128432 ) Summary: Original commit changeset: c7d2e6b13922 Original Phabricator Diff: D57618942 Differential Revision: D58383241 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128432 Approved by: https://github.com/ezyang, https://github.com/Yuzhen11	2024-06-12 01:34:32 +00:00
cyy	9538bf4e7c	[2/N] Remove inclusion of c10/util/string_utils.h (#128372 ) Follows #128300. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128372 Approved by: https://github.com/aaronenyeshi	2024-06-12 01:18:20 +00:00
cyy	219da29dfd	[7/N] Remove unused functions (#128407 ) Follows #128309 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128407 Approved by: https://github.com/ezyang	2024-06-12 01:10:33 +00:00
cyy	fb013ecb24	Remove unused private List::ptr_to_first_element (#128405 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128405 Approved by: https://github.com/ezyang	2024-06-12 01:07:14 +00:00
Kurman Karabukaev	6af4c6acad	Migrate test to internal base class, fixes (#128367 ) Summary: ## Remove etc deps converted tests to non-etcd based rdzv handler so that tests don't have dependency on etcd server ## Adopt pytorch test convetions - test starts with `test_TESTS.py` - Test base class is torch.testing._internal.common_utils.TestCase - include __main__ handler ## reduce test timing (used to take > 300 seconds): 3.05s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_init_method_env_with_torchelastic 2.59s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_init_method_tcp_with_torchelastic 2.33s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_elastic_worker_raise_exception 2.33s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_run_path 2.30s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_nproc_launch_auto_configurations 2.24s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_is_torchelastic_launched_with_logs_spec_defined 2.24s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_is_torchelastic_launched 2.17s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_elastic_multiple_agents 2.12s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_elastic 2.08s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_nproc_gpu_launch_configurations 1.32s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_standalone 1.05s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_nproc_launch_number_configurations 1.05s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_with_env_vars 1.05s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_user_script_python 1.05s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_user_script_python_caffe2_bc 1.04s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_user_script_bash 1.03s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_user_script_default_nproc 0.04s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_logs_logs_spec_entrypoint_must_be_defined 0.01s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_elastic_agent_raise_exception 0.01s call test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_shutdown Test Plan: pytest --durations=0 test/distributed/launcher/run_test.py Differential Revision: D58388182 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128367 Approved by: https://github.com/d4l3k	2024-06-12 01:03:40 +00:00
Bin Bao	786c24a4cd	[inductor] Always realize sigmoid for CPU (#128339 ) Summary: Currently the cpu backend prefers to always realize exp because it's a heavy op on CPU. For the same reason, we need to realize sigmoid as well. This solves a problem in llama2 inference where exp was repeated in an inner loop for many times. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128339 Approved by: https://github.com/eellison, https://github.com/helloguo, https://github.com/jansel, https://github.com/jgong5, https://github.com/peterbell10	2024-06-12 00:46:33 +00:00
PyTorch MergeBot	5d8c7f39d4	Revert "Introduce int_oo (#127693 )" This reverts commit 9cab5987bdeb66df8efbc581b3469bfe300e168c. Reverted https://github.com/pytorch/pytorch/pull/127693 on behalf of https://github.com/clee2000 due to sorry executorch CI is a bit weird regarding pins, I'll make a chat with mergen with the choices of what to do and how it'll affect executorch CI, reverting for now to prevent more divergences in the meantime ([comment](https://github.com/pytorch/pytorch/pull/127693#issuecomment-2161775400))	2024-06-11 23:36:08 +00:00
PyTorch MergeBot	c9c1fed065	Revert "Flip default value for mypy disallow_untyped_defs [10+2/11] (#128374 )" This reverts commit c13e03c87428b986972a48d8fc78dbffc2579f63. Reverted https://github.com/pytorch/pytorch/pull/128374 on behalf of https://github.com/clee2000 due to sorry I need to revert this in order to revert something else, to remerge, just rebase and fix the merge conflict ([comment](https://github.com/pytorch/pytorch/pull/128374#issuecomment-2161772864))	2024-06-11 23:34:03 +00:00
Andrew Hoblitzell	94fea82d66	init sub comment (#128082 ) Fixes #127905 ### Description Add docstring to torch/onnx/symbolic_opset9.py:sigmoid function ### Checklist - [x] The issue that is being fixed is referred in the description - [x] Only one issue is addressed in this pull request - [x] Labels from the issue that this PR is fixing are added to this pull request - [x] No unnecessary issues are included into this pull request Pull Request resolved: https://github.com/pytorch/pytorch/pull/128082 Approved by: https://github.com/titaiwangms	2024-06-11 22:42:35 +00:00
Andrea Frittoli	447173198b	Add docstring for the torch.fx.operator_schemas.create_type_hint func… (#128139 ) Fixes: #127916 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128139 Approved by: https://github.com/SherlockNoMad	2024-06-11 22:42:11 +00:00
angelayi	b79d056e76	[export] FIx unflattener for preserving modules containing unused inputs (#128260 ) Currently unflattener fails if the module its preserving the module signature for contains unused inputs/outputs. This also fixes unflattener issues in D57829276. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128260 Approved by: https://github.com/pianpwk	2024-06-11 22:32:08 +00:00
Chirag Pandya	eb567b1f40	Pass params to dump_nccl_trace_pickle (#128307 ) Summary: Pass parameters from request to dump_nccl_trace_pickle handler. The supported parameters + value are all lowercase. includecollectives={true, false} includestacktraces={true, false} onlyactive={true, false} Example post is: /handler/dump_nccl_trace_pickle?includecollectives=true&includestacktraces=false&onlyactive=true Test Plan: unit tests Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/128307 Approved by: https://github.com/d4l3k ghstack dependencies: #128191	2024-06-11 22:28:53 +00:00
Chirag Pandya	1dd2431f86	[Test] Add test for only_active flag (#128191 ) Summary: Add a unit test for the only_active flag to _dump_nccl_trace API call. With this flag, we only expect active records to be returned. Test Plan: Unit test. Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/128191 Approved by: https://github.com/d4l3k	2024-06-11 22:26:01 +00:00
Andrew Hoblitzell	5fcb5f0c8b	init reshape_from_tensor_shape comment (#128171 ) Fixes #127897 ### Description Add docstring to torch/onnx/symbolic_opset9.py:sigmoid function ### Checklist - [x] The issue that is being fixed is referred in the description - [x] Only one issue is addressed in this pull request - [x] Labels from the issue that this PR is fixing are added to this pull request - [x] No unnecessary issues are included into this pull request Pull Request resolved: https://github.com/pytorch/pytorch/pull/128171 Approved by: https://github.com/titaiwangms	2024-06-11 21:56:33 +00:00
rzou	a55d0d9718	Fix side effect pruning (#128028 ) Summary: The previous side effect pruning algorithm would keep many dead cell variables alive. For example, in https://github.com/pytorch/pytorch/issues/125078, the compiled function has one return but there were three in the Dynamo graph due to two dead cell variables not being pruned away. This PR adds a corrected algorithm. "new cell variables" are alive if they can be reached from one of the following: 1. any of the tx.symbolic_locals or tx.stack (that is, if they are involved in a return from the function or intermediate variable during a graph break). Example: an alive NestedUserFunctionVariable 2. "mutations to pre-existing objects". Example: appending a NestedUserFunctionVariable to a global list The new algorithm reflects this, but please let me know if there are more cases to handle. Test Plan: - existing tests (afaict, test/dynamo/test_python_autograd is the best SideEffects test case we have) - see in test/dynamo/test_higher_order_ops that the expecttests changed -- the functorch dynamo graphs no longer return dead cellvars. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128028 Approved by: https://github.com/jansel	2024-06-11 21:40:48 +00:00
Andrew Gu	8c1247cffb	[BE] Fixed CPU autocast warning (#127774 ) This PR fixes ``` /data/users/andgu/pytorch/torch/utils/checkpoint.py:1398: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127774 Approved by: https://github.com/soulitzer, https://github.com/Skylion007, https://github.com/tianyu-l	2024-06-11 21:33:35 +00:00
Will Feng	70a1e85718	[Traceable FSDP2] Use custom ops for AllGather copy-in / copy-out and ReduceScatter copy-in (#127856 ) Making these operations into custom ops helps Inductor identify these ops and enforce the FSDP communication op ordering. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127856 Approved by: https://github.com/awgu	2024-06-11 20:15:03 +00:00
PyTorch MergeBot	adb699189b	Revert "[RELAND][dynamo][nn-modules] Trace through nn.Module dunder methods for UnspecializedNNModule (#126578 )" This reverts commit b2d602306a9eb19e30328cbaee941c874f8148a9. Reverted https://github.com/pytorch/pytorch/pull/126578 on behalf of https://github.com/clee2000 due to failed internal test D58394084. Author has forward fix but includes external changes so reverting is a bit easier to coordinate ([comment](https://github.com/pytorch/pytorch/pull/126578#issuecomment-2161481839))	2024-06-11 19:41:41 +00:00
eqy	45dccfddcd	[cuDNN][SDPA] Support different key, value dimension in cuDNN SDPA (#128350 ) CC @vedaanta-nvidia @drisspg Pull Request resolved: https://github.com/pytorch/pytorch/pull/128350 Approved by: https://github.com/Skylion007	2024-06-11 19:22:21 +00:00
yuqingj	3e09123797	Enable UFMT on test_nestedtensor.py (#128359 ) split it into two PRs since it is more than 2k lines of change Pull Request resolved: https://github.com/pytorch/pytorch/pull/128359 Approved by: https://github.com/davidberard98	2024-06-11 19:14:04 +00:00
BowenBao	61f922c2ca	Fix 'get_real_value' on placeholder nodes (#127698 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127698 Approved by: https://github.com/jansel ghstack dependencies: #127695, #127696	2024-06-11 18:57:25 +00:00
BowenBao	984b1a8c35	Fix 'get_attr' call in dynamo 'run_node' (#127696 ) Fixes #124858 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127696 Approved by: https://github.com/jansel ghstack dependencies: #127695	2024-06-11 18:57:25 +00:00
Jing Xu	205410cb44	add xpu to torch.tensors (#127280 ) As support for Intel GPU has been upstreamed, this PR is to add the XPU-related contents to torch.tensors doc. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127280 Approved by: https://github.com/svekars	2024-06-11 18:13:01 +00:00
Eddie Yan	cac7a22b92	[cuDNN][Quantization] Don't print when plan finalization fails in cuDNN quantization backend (#128177 ) Similar in spirit to #125790, hopefully addresses failures seen for cuDNN 9.1 upgrade: #https://github.com/pytorch/pytorch/pull/128166 CC @nWEIdia @atalman Pull Request resolved: https://github.com/pytorch/pytorch/pull/128177 Approved by: https://github.com/nWEIdia, https://github.com/Skylion007	2024-06-11 18:09:25 +00:00
Wanchao Liang	8a09940a54	[inductor] fix compile time regression by caching get_gpu_type (#128363 ) We observed signficant compile time regression in torchtitan when turning on 2D parallel + torch.compile recently. So I decided to get a deeper understanding why. It turns out this is affecting all the trainings that have functional collectives captured in the graph, not only 2D parallel (2D parallel was just the job that happen to have collectives captured in the TP region). The root cause is because when doing inductor lowering, we are calling the comm analysis pass to get a estimated collective time for each collective node in the graph, for each call to check the collective node, we are calling `get_gpu_type()`, which under the hood calls a `torch.utils.collect_env.run` to get the GPU info. However, this call is super expensive! The reason is that this call effectively spawns a new process and call `nvidia-smi` to get the GPU info, so the cost is linear to the number of collective nodes in the graph. see https://github.com/pytorch/pytorch/blob/main/torch/utils/collect_env.py#L75 The fix is to add a lru cache to the function, so that we only call this once and reuse the cached results afterwards torchtitan benchmark shows: * before this fix: 2D parallel + fp8 compile time: 6min + * after this fix: 2D parallel + fp8 compile time: 2min 48s (more than 100% improvement) There're more room to improve the compile time, but this PR is trying to fix the biggest regression I found so far. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128363 Approved by: https://github.com/yf225	2024-06-11 18:02:13 +00:00
PyTorch MergeBot	1d233b8f50	Revert "Make nn.Module state_dict load_state_dict pre-hook and state_dict post hook public (#126704 )" This reverts commit c38b3381a12a0ec033dd417827c530c4474b8165. Reverted https://github.com/pytorch/pytorch/pull/126704 on behalf of https://github.com/clee2000 due to broke internal typecheck D58394110 (which probably means the code wouldn't work either but I guess it didn't run on the diff). Probably an easy fix? ([comment](https://github.com/pytorch/pytorch/pull/126704#issuecomment-2161299193))	2024-06-11 17:45:20 +00:00
PyTorch MergeBot	491c4a5dcb	Revert "Make sure #126704 is BC for torch.save-ed `nn.Module` (#128344 )" This reverts commit 841d87177a900c2bbd59b6589165189141c4e8bb. Reverted https://github.com/pytorch/pytorch/pull/128344 on behalf of https://github.com/clee2000 due to broke internal typecheck D58394110 (which probably means the code wouldn't work either but I guess it didn't run on the diff). Probably an easy fix? ([comment](https://github.com/pytorch/pytorch/pull/126704#issuecomment-2161299193))	2024-06-11 17:45:20 +00:00
Angela Yi	4345d98663	[dynamo] Fix for #127696 (#128358 ) Test Plan: `buck2 test @//mode/dev-nosan //executorch/exir/backend/...` https://www.internalfb.com/intern/testinfra/testrun/12666373989243932 Differential Revision: D58384518 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128358 Approved by: https://github.com/ydwu4	2024-06-11 16:43:15 +00:00
ankurneog	a838e90964	Add Intel Gaudi device/HPU to auto load in instantiate_device_type_tests (#126970 ) ### Motivation Intel Gaudi accelerator (device name hpu) is seen to have good pass rate with the pytorch framework UTs , however being an out-of-tree device, we face challenges in adapting the device to natively run the existing pytorch UTs under pytorch/test. The UTs however is a good indicator of the device stack health and as such we run them regularly with adaptations. Although we can add Gaudi/HPU device to generate the device specific tests using the TORCH_TEST_DEVICES environment variable, we miss out on lot of features such as executing for specific dtypes, skipping and overriding opInfo. With significant changes introduced every Pytorch release maintaining these adaptations become difficult and time consuming. Hence with this PR we introduce Gaudi device in common_device_type framework, so that the tests are instantiated for Gaudi when the library is loaded. The eventual goal is to introduce Gaudi out-of-tree support as equivalent to in-tree devices ### Changes Add HPUTestBase of type DeviceTypeTestBase specifying appropriate attributes for Gaudi/HPU. Include code to check if intel Gaudi Software library is loaded and if so, add the device to the list of devices considered for instantiation of device type tests ### Additional Context please refer the following RFC : https://github.com/pytorch/rfcs/pull/63/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/126970 Approved by: https://github.com/albanD	2024-06-11 16:35:17 +00:00
David Berard	29081059b6	[Static Runtime] Fix & run gen_static_runtime_ops (#128299 ) gen_static_runtime_ops hasn't been updated in a while. In preparation for https://github.com/pytorch/pytorch/pull/127675 in which I need to re-run the codegen step for cumprod, I want to land these changes beforehand in case there are any other issues that arise. I added a number of ops to the blocklist: ``` + "_nested_tensor_storage_offsets", + "_nested_get_values", # no CPU backend + "_nested_get_values_copy", # no CPU backend + "_nested_view_from_jagged", # testing needs to be patched + "_nested_view_from_jagged_copy", # testing needs to be patched + "_nested_view_from_buffer", # testing needs to be patched + "_nested_view_from_buffer_copy", # testing needs to be patched + "_int_mm", # testing needs to be patched + "_to_sparse_csc", # testing needs to be patched + "_to_sparse_csr", # testing needs to be patched + "segment_reduce", # testing needs to be patched ``` Most of these are added just because testing doesn't work right now. Additionally, a few `fft` ops seem to have been removed from native_functions.yaml; I'm guessing it's unlikely FFT would have been used in many real models though. Differential Revision: [D58329403](https://our.internmc.facebook.com/intern/diff/D58329403/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128299 Approved by: https://github.com/YuqingJ	2024-06-11 16:27:39 +00:00
Nikita Shulga	f8c45996d5	[MPS] Make erfinv compilable for bfloat16 (#128375 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128375 Approved by: https://github.com/Skylion007 ghstack dependencies: #128373	2024-06-11 16:04:11 +00:00
Aaron Orenstein	c13e03c874	Flip default value for mypy disallow_untyped_defs [10+2/11] (#128374 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128374 Approved by: https://github.com/Skylion007	2024-06-11 15:58:28 +00:00
Nikita Shulga	053930e194	[MPS][BE] Remove code duplication (#128373 ) Use `scalarToMetalTypeString` instead of `getMetalType` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128373 Approved by: https://github.com/Skylion007	2024-06-11 15:58:04 +00:00
Huamin Li	9a38cae299	[AOTI] Switch to use shim v2 (#127674 ) Differential Revision: D56709309 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127674 Approved by: https://github.com/desertfire	2024-06-11 15:01:25 +00:00
kareem mohiddeen shaik	55901fb3da	[fx] Preserve Fx graph node order in partitioner across runs (#115621 ) Fixes #ISSUE_NUMBER partitioner generates different graph in recompilation on each run Co-authored-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/115621 Approved by: https://github.com/ezyang	2024-06-11 14:04:52 +00:00
IvanKobzarev	fc77fdca6f	[guard_size_oblivious] Add gso ExpandUtils:_sym_to (#128224 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128224 Approved by: https://github.com/ezyang	2024-06-11 14:01:34 +00:00
FFFrog	648625b230	Make TraceUtils.h to be device-agnostic (#126969 ) Some features of third-party devices depend on TraceUtils.h, so some of the CUDA code was removed and split into NCCLUtils files. In addition, some common functions still remain in TraceUtils.h since I'm not sure if other devices will use them later. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126969 Approved by: https://github.com/c-p-i-o	2024-06-11 08:38:07 +00:00
Peter Bell	207c2248a8	[inductor] Fix lowering full with SymBool value (#128213 ) Fixes #128161, fixes #128095 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128213 Approved by: https://github.com/lezcano	2024-06-11 08:33:35 +00:00
Colin L Reliability Rice	a206dcc79e	fb_memcache: Move to fbcode from thirdparty (#128174 ) Summary: The fb_memcache injections location and path is changing. Test Plan: Existing tests should pass. Reviewed By: bertmaher, oulgen Differential Revision: D57973772 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128174 Approved by: https://github.com/oulgen	2024-06-11 07:46:12 +00:00
Animesh Jain	f2d7f235a6	[dynamo][yolov3] Track UnspecializedNNModuleVariable for mutation (#128269 ) Fixes https://github.com/pytorch/pytorch/issues/101168 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128269 Approved by: https://github.com/jansel ghstack dependencies: #128295, #126578, #128268, #128254	2024-06-11 07:09:04 +00:00
Michael Lazos	402b289f3b	Properly register parameter for binary folding test (#128356 ) This PR properly registers the tensor used in the module compute as a parameter. This bug was hidden previously because all tensors on the nn modules would be considered constant by dynamo, with inlining NN modules, this is no longer the case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128356 Approved by: https://github.com/anijain2305 ghstack dependencies: #128355	2024-06-11 06:48:26 +00:00
Michael Lazos	a32157c67c	Mark params static if inlining modules and freezing (#128355 ) Today inlining builtin nn modules is not compatible with parameter freezing. Freezing parameters and then constant folding them through the graph relies on the assumption that they will not be inputs and will be static across calls to the same graph. When inlining builtin nn modules this assumption is broken and we reuse the same graph for different instances of the same nn module. There are three options 1) abandon constant folding, 2) create a dispatcher layer (like cudagraphs) which will dispatch to the correct constant-folded graph for each distinct set of parameters or 3) recompile This PR implements 3 by introducing guards on the parameter pointers. This was due to freezing being relatively rare and performance sensistive. 2 Had many more unknowns and 1 is not a viable option due to the drop in performance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128355 Approved by: https://github.com/anijain2305	2024-06-11 06:48:26 +00:00
Lourenco Matos	24e7f29099	Lowering for avg_pool_3d_backward (Fixes:#127101) (#127722 ) We implemented a lowering for the avg_pool3d_backward operation and created tests for it. We ran some benchmarks and achieved the following results: ``` [-------------- avgpool_3d_backwards --------------] \| Decomposed \| Eager 16 threads: ---------------------------------------- (3, 5, 400, 200, 200) \| 6061 \| 11160 (3, 5, 300, 200, 200) \| 4547 \| 8372 (3, 5, 200, 200, 200) \| 3032 \| 5585 (3, 5, 300, 300, 300) \| 10100 \| 18840 (3, 5, 100, 100, 100) \| 381 \| 703 (3, 5, 100, 300, 200) \| 2270 \| 4190 (8, 8, 128, 128, 128) \| 3397 \| 6253 (2, 3, 150, 150, 150) \| 520 \| 947 (1, 3, 128, 128, 128) \| 161 \| 299 (8, 16, 64, 64, 64) \| 851 \| 1569 (1, 1, 50, 50, 50) \| 17 \| 11 (3, 5, 20, 40, 40) \| 17 \| 30 (3, 5, 10, 20, 20) \| 17 \| 11 (1, 1, 10, 10, 10) \| 16 \| 11 (3, 5, 5, 10, 10) \| 17 \| 11 (3, 5, 2, 5, 5) \| 17 \| 11 ``` These were run on an RTX 3050, so we were not able to allocate larger tensors due to memory limitations. We believe it would be beneficial to benchmark this on more recent hardware, just to check if the performance holds up with larger sizes. Furthermore, we also refactored code from adaptive_avg_pool2d and adaptive_max_pool2d, to reduce code duplication. We diffed the kernels and they are identical. Fixes #127101 Co-authored-by: Martim Mendes <martimccmendes@tecnico.ulisboa.pt> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127722 Approved by: https://github.com/jansel	2024-06-11 06:39:04 +00:00
Oguz Ulgen	5b5d269d34	Speed up fx graph iteration by implementing it in C++ (#128288 ) Before this change ``` python benchmarks/dynamo/microbenchmarks/fx_microbenchmarks.py iterating over 100000000 FX nodes took 19.5s (5132266 nodes/s) ``` After this change ``` python benchmarks/dynamo/microbenchmarks/fx_microbenchmarks.py iterating over 100000000 FX nodes took 3.4s (29114001 nodes/s) ``` 5.7x improvement Differential Revision: [D58343997](https://our.internmc.facebook.com/intern/diff/D58343997) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128288 Approved by: https://github.com/jansel, https://github.com/albanD	2024-06-11 05:48:31 +00:00
PyTorch MergeBot	fa88f390a0	Revert "[inductor] enable fx graph cache on torchbench (#128239 )" This reverts commit 734e8f6ad7e7f0fa0341fb658f1f986225173f5f. Reverted https://github.com/pytorch/pytorch/pull/128239 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to surface a bunch of inductor failures in trunk `734e8f6ad7` ([comment](https://github.com/pytorch/pytorch/pull/128239#issuecomment-2159789242))	2024-06-11 04:53:38 +00:00
Ke Wen	fe39c07826	[pipelining][doc] Remove duplicated words (#128368 ) "for execution" is used in both step titles Pull Request resolved: https://github.com/pytorch/pytorch/pull/128368 Approved by: https://github.com/wconstab ghstack dependencies: #128361	2024-06-11 04:52:57 +00:00
Wang, Eikan	cba195c8ed	Support aten operations with out tensor (#124926 ) This PR intends to support the aten operations with the `out` tensor. Currently, the AOT compile always does NOT keep input tensor mutations. According to the comments, this is because it has not encountered such a use case. > For now there's no use case involving keeping input mutations in the graph (which we can only do in the inference case anyway). We can add this later if we need to. However, for aten operations, it is popular that the `out` tensor is an input parameter and needs to be mutated. This PR intends to support it by adding a `keep_inference_input_mutations` flag to `aot_inductor.keep_inference_input_mutations`. This flag can provide flexibility to the callee in deciding whether the AOT compile needs to keep input tensor mutations in the graph. Take `clamp` as an example as follows. ```python out_tensor = torch.randn(128, dtype=torch.float, device=device).fill_(-2.0) inp_tensor = torch.randn(128, dtype=torch.float, device=device).fill_(1.0) min_tensor = inp_tensor - 0.05 max_tensor = inp_tensor + 0.05 torch.clamp(input=inp_tensor, min=min_tensor, max=max_tensor, out=out_tensor) ``` W/O this PR ```python def forward(self): arg0_1: "f32[128]"; arg1_1: "f32[128]"; arg2_1: "f32[128]"; arg3_1: "f32[128]"; arg0_1, arg1_1, arg2_1, arg3_1, = fx_pytree.tree_flatten_spec([], self._in_spec) clamp_min: "f32[128]" = torch.ops.aten.clamp_min.Tensor(arg0_1, arg1_1); arg0_1 = arg1_1 = None clamp_max: "f32[128]" = torch.ops.aten.clamp_max.Tensor(clamp_min, arg2_1); clamp_min = arg2_1 = None return (clamp_max, clamp_max) ``` W/ this PR ```python def forward(self): arg0_1: "f32[128]"; arg1_1: "f32[128]"; arg2_1: "f32[128]"; arg3_1: "f32[128]"; arg0_1, arg1_1, arg2_1, arg3_1, = fx_pytree.tree_flatten_spec([], self._in_spec) clamp_min: "f32[128]" = torch.ops.aten.clamp_min.Tensor(arg0_1, arg1_1); arg0_1 = arg1_1 = None clamp_max: "f32[128]" = torch.ops.aten.clamp_max.Tensor(clamp_min, arg2_1); clamp_min = arg2_1 = None copy_: "f32[128]" = torch.ops.aten.copy_.default(arg3_1, clamp_max); arg3_1 = clamp_max = None return (copy_,) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/124926 Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/angelayi	2024-06-11 04:35:27 +00:00
Edward Z. Yang	16e67be7f1	Also preserve unbacked SymInts when partitioning as backward inputs (#128338 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128338 Approved by: https://github.com/IvanKobzarev	2024-06-11 04:27:09 +00:00
zengxian	7afffdf48b	[CI] Comment hf_T5_generate, hf_GPT2 and timm_efficientnet in inductor cpu smoketest for performance unstable issue (#127588 ) Fixes #126993 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127588 Approved by: https://github.com/chuanqi129, https://github.com/jgong5, https://github.com/desertfire	2024-06-11 03:12:11 +00:00
Animesh Jain	ca45649eb5	[easy][dynamo][inline work] Fix test with inlining inbuilt nn modules (#128254 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128254 Approved by: https://github.com/williamwen42 ghstack dependencies: #128295, #126578, #128268	2024-06-11 03:02:51 +00:00
Animesh Jain	665e568381	[inductor][inlining nn module] Skip batchnorm version check test for inlining (#128268 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128268 Approved by: https://github.com/zou3519 ghstack dependencies: #128295, #126578	2024-06-11 03:02:51 +00:00
Ke Wen	4077cdd589	[pipelining][doc] Update arg list of pipeline API (#128361 ) And document the use of `build_stage` API. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128361 Approved by: https://github.com/wconstab	2024-06-11 02:55:17 +00:00
cyy	e4bd0adca5	[6/N] Remove unused functions (#128309 ) Follows #127185 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128309 Approved by: https://github.com/ezyang	2024-06-11 02:46:33 +00:00
eellison	793df7b7cb	Prevent expansion of cat indexing to avoid int64 intermediate (#127815 ) Fix for https://github.com/pytorch/pytorch/issues/127652 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127815 Approved by: https://github.com/shunting314, https://github.com/peterbell10	2024-06-11 02:41:07 +00:00
Andrew Hoblitzell	d1d9bc7aa6	init add comment (#128083 ) Fixes #127898 ### Description Add docstring to torch/onnx/symbolic_opset9.py:sigmoid function ### Checklist - [x] The issue that is being fixed is referred in the description - [x] Only one issue is addressed in this pull request - [x] Labels from the issue that this PR is fixing are added to this pull request - [x] No unnecessary issues are included into this pull request Pull Request resolved: https://github.com/pytorch/pytorch/pull/128083 Approved by: https://github.com/titaiwangms	2024-06-11 02:37:04 +00:00
Mikayla Gawarecki	841d87177a	Make sure #126704 is BC for torch.save-ed `nn.Module` (#128344 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128344 Approved by: https://github.com/albanD ghstack dependencies: #126906, #126704	2024-06-11 02:26:06 +00:00
Arun Pa	3b555ba477	Add docstring for torch.utils.data.datapipes.decoder.basicandlers (#128018 ) Fixes #127912 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128018 Approved by: https://github.com/andrewkho	2024-06-11 01:32:45 +00:00
Sam Larsen	734e8f6ad7	[inductor] enable fx graph cache on torchbench (#128239 ) Summary: We've already enabled for timm and huggingface, but we had failures saving cache entries for moco. It looks like https://github.com/pytorch/pytorch/pull/128052 has fixed that issue, so we can enable for torchbench. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128239 Approved by: https://github.com/oulgen	2024-06-11 00:40:31 +00:00
cyy	99f5a85a09	[Clang Tidy] Fix misc-header-include-cycle errors in clang-tidy and ignore some files (#127233 ) Since there are such cycles in libfmt and PyTorch, which are detected by clang-tidy. ``` /home/cyy/pytorch/third_party/fmt/include/fmt/format-inl.h:25:10: error: circular header file dependency detected while including 'format.h', please check the include path [misc-header-include-cycle,-warnings-as-errors] 25 \| #include "format.h" \| ^ /home/cyy/pytorch/third_party/fmt/include/fmt/format.h:4530:12: note: 'format-inl.h' included from here 4530 \| # include "format-inl.h" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127233 Approved by: https://github.com/ezyang	2024-06-10 23:49:58 +00:00
Jun Luo	f843ccbb1a	[MTIA] Add set_device support (#128040 ) Summary: Support set_device API in MTIA backend. Reviewed By: gnahzg Differential Revision: D58089498 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128040 Approved by: https://github.com/gnahzg	2024-06-10 23:42:52 +00:00
cyy	30875953a4	[1/N] Remove inclusion of c10/util/string_utils.h (#128300 ) As a first step to remove it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128300 Approved by: https://github.com/ezyang, https://github.com/eqy	2024-06-10 23:40:47 +00:00
cyy	2126ae186e	Remove caffe2/perfkernels files (#128186 ) These files are not used. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128186 Approved by: https://github.com/ezyang, https://github.com/r-barnes	2024-06-10 23:40:18 +00:00
Jiashen Cao	739aa224ec	[Fix] Parameter un/lifting issues in the TorchScript to ExportedProgram converter (#127975 ) This PR fixes issues related to parameters and inputs lifting in the converter. #### Issue 1 ``` > Graph[linear.weights, bias.weights, x.1] %1 ... %2 ... %3 = CreateObject() > Block 0[] %linear.0 = GetAttr(linear)[%3] > Block 0.0[] %weight.0 = GetAttr(weights)[%linear.0] > Block 1[] ... ``` * Model parameters for the top level module should be unlifted, while parameters from sub-blocks should be lifted. #### Fixes * Bottom-up traversal (i.e., start from the inner most block) to figure out which parameters to be lifted for sub-blocks. #### Test Plan * Add test cases for nested block without control flow `pytest test/export/test_converter.py -s -k test_convert_nn_module_with_nested_param` * Add test cases for nested block with control flow `pytest test/export/test_converter.py -s -k test_convert_nn_module_with_nested_if_and_param` #### Outcome ##### TorchScript ``` graph(%x.1 : Float(3, strides=[1], requires_grad=0, device=cpu), %m1.m1.linear.weight : Float(3, 3, strides=[3, 1], requires_grad=0, device=cpu), %m1.m1.linear.bias : Float(3, strides=[1], requires_grad=0, device=cpu), %m1.linear.weight : Float(3, 3, strides=[3, 1], requires_grad=0, device=cpu), %m1.linear.bias : Float(3, strides=[1], requires_grad=0, device=cpu), %m1.m2.linear.weight : Float(3, 3, strides=[3, 1], requires_grad=0, device=cpu), %m1.m2.linear.bias : Float(3, strides=[1], requires_grad=0, device=cpu), %linear.weight : Float(3, 3, strides=[3, 1], requires_grad=0, device=cpu), %linear.bias : Float(3, strides=[1], requires_grad=0, device=cpu), %m2.m1.linear.weight : Float(3, 3, strides=[3, 1], requires_grad=0, device=cpu), %m2.m1.linear.bias : Float(3, strides=[1], requires_grad=0, device=cpu), %m2.linear.weight : Float(3, 3, strides=[3, 1], requires_grad=0, device=cpu), %m2.linear.bias : Float(3, strides=[1], requires_grad=0, device=cpu), %m2.m2.linear.weight : Float(3, 3, strides=[3, 1], requires_grad=0, device=cpu), %m2.m2.linear.bias : Float(3, strides=[1], requires_grad=0, device=cpu)): %15 : __torch__.export.test_converter.___torch_mangle_14.SuperNestedM1 = prim::CreateObject() %16 : NoneType = prim::Constant(), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1 %17 : int = prim::Constant[value=1](), scope: export.test_converter.SuperNestedM1:: # /data/users/jiashenc/pytorch/test/export/test_converter.py:342:34 %18 : Tensor = aten::max(%x.1), scope: export.test_converter.SuperNestedM1:: # /data/users/jiashenc/pytorch/test/export/test_converter.py:342:19 %19 : Tensor = aten::gt(%18, %17), scope: export.test_converter.SuperNestedM1:: # /data/users/jiashenc/pytorch/test/export/test_converter.py:342:19 %20 : bool = aten::Bool(%19), scope: export.test_converter.SuperNestedM1:: # /data/users/jiashenc/pytorch/test/export/test_converter.py:342:19 %21 : Tensor = prim::If(%20), scope: export.test_converter.SuperNestedM1:: # /data/users/jiashenc/pytorch/test/export/test_converter.py:342:16 block0(): %linear.6 : __torch__.torch.nn.modules.linear.___torch_mangle_17.Linear = prim::GetAttr[name="linear"](%15), scope: export.test_converter.SuperNestedM1:: %m1.1 : __torch__.export.test_converter.___torch_mangle_15.NestedM = prim::GetAttr[name="m1"](%15), scope: export.test_converter.SuperNestedM1:: %24 : Tensor = aten::sum(%x.1, %16), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1 # /data/users/jiashenc/pytorch/test/export/test_converter.py:327:19 %25 : Tensor = aten::gt(%24, %17), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1 # /data/users/jiashenc/pytorch/test/export/test_converter.py:327:19 %26 : bool = aten::Bool(%25), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1 # /data/users/jiashenc/pytorch/test/export/test_converter.py:327:19 %27 : Tensor = prim::If(%26), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1 # /data/users/jiashenc/pytorch/test/export/test_converter.py:327:16 block0(): %linear.10 : __torch__.torch.nn.modules.linear.___torch_mangle_17.Linear = prim::GetAttr[name="linear"](%m1.1), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1 %m1.3 : __torch__.export.test_converter.___torch_mangle_16.M = prim::GetAttr[name="m1"](%m1.1), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1 %linear.12 : __torch__.torch.nn.modules.linear.___torch_mangle_17.Linear = prim::GetAttr[name="linear"](%m1.3), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1 %weight.4 : Tensor = prim::GetAttr[name="weight"](%linear.12), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1 %bias.4 : Tensor = prim::GetAttr[name="bias"](%linear.12), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1 %33 : Tensor = aten::linear(%x.1, %weight.4, %bias.4), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1 # /data/users/jiashenc/pytorch/torch/nn/modules/linear.py:116:15 %weight.6 : Tensor = prim::GetAttr[name="weight"](%linear.10), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1 %bias.6 : Tensor = prim::GetAttr[name="bias"](%linear.10), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1 %36 : Tensor = aten::linear(%33, %weight.6, %bias.6), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1 # /data/users/jiashenc/pytorch/torch/nn/modules/linear.py:116:15 -> (%36) block1(): %linear.14 : __torch__.torch.nn.modules.linear.___torch_mangle_17.Linear = prim::GetAttr[name="linear"](%m1.1), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1 %m2.3 : __torch__.export.test_converter.___torch_mangle_16.M = prim::GetAttr[name="m2"](%m1.1), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1 %linear.16 : __torch__.torch.nn.modules.linear.___torch_mangle_17.Linear = prim::GetAttr[name="linear"](%m2.3), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1 %weight.8 : Tensor = prim::GetAttr[name="weight"](%linear.16), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1 %bias.8 : Tensor = prim::GetAttr[name="bias"](%linear.16), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1 %42 : Tensor = aten::linear(%x.1, %weight.8, %bias.8), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1 # /data/users/jiashenc/pytorch/torch/nn/modules/linear.py:116:15 %weight.2 : Tensor = prim::GetAttr[name="weight"](%linear.14), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1 %bias.2 : Tensor = prim::GetAttr[name="bias"](%linear.14), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1 %45 : Tensor = aten::linear(%42, %weight.2, %bias.2), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1 # /data/users/jiashenc/pytorch/torch/nn/modules/linear.py:116:15 -> (%45) %weight.10 : Tensor = prim::GetAttr[name="weight"](%linear.6), scope: export.test_converter.SuperNestedM1::/torch.nn.modules.linear.Linear::linear %bias.10 : Tensor = prim::GetAttr[name="bias"](%linear.6), scope: export.test_converter.SuperNestedM1::/torch.nn.modules.linear.Linear::linear %48 : Tensor = aten::linear(%27, %weight.10, %bias.10), scope: export.test_converter.SuperNestedM1::/torch.nn.modules.linear.Linear::linear # /data/users/jiashenc/pytorch/torch/nn/modules/linear.py:116:15 -> (%48) block1(): %linear.8 : __torch__.torch.nn.modules.linear.___torch_mangle_17.Linear = prim::GetAttr[name="linear"](%15), scope: export.test_converter.SuperNestedM1:: %m2.1 : __torch__.export.test_converter.___torch_mangle_15.NestedM = prim::GetAttr[name="m2"](%15), scope: export.test_converter.SuperNestedM1:: %51 : Tensor = aten::sum(%x.1, %16), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2 # /data/users/jiashenc/pytorch/test/export/test_converter.py:327:19 %52 : Tensor = aten::gt(%51, %17), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2 # /data/users/jiashenc/pytorch/test/export/test_converter.py:327:19 %53 : bool = aten::Bool(%52), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2 # /data/users/jiashenc/pytorch/test/export/test_converter.py:327:19 %54 : Tensor = prim::If(%53), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2 # /data/users/jiashenc/pytorch/test/export/test_converter.py:327:16 block0(): %linear.1 : __torch__.torch.nn.modules.linear.___torch_mangle_17.Linear = prim::GetAttr[name="linear"](%m2.1), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2 %m1 : __torch__.export.test_converter.___torch_mangle_16.M = prim::GetAttr[name="m1"](%m2.1), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2 %linear.5 : __torch__.torch.nn.modules.linear.___torch_mangle_17.Linear = prim::GetAttr[name="linear"](%m1), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2 %weight.1 : Tensor = prim::GetAttr[name="weight"](%linear.5), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2 %bias.1 : Tensor = prim::GetAttr[name="bias"](%linear.5), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2 %60 : Tensor = aten::linear(%x.1, %weight.1, %bias.1), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2 # /data/users/jiashenc/pytorch/torch/nn/modules/linear.py:116:15 %weight.3 : Tensor = prim::GetAttr[name="weight"](%linear.1), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2 %bias.3 : Tensor = prim::GetAttr[name="bias"](%linear.1), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2 %63 : Tensor = aten::linear(%60, %weight.3, %bias.3), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2 # /data/users/jiashenc/pytorch/torch/nn/modules/linear.py:116:15 -> (%63) block1(): %linear.3 : __torch__.torch.nn.modules.linear.___torch_mangle_17.Linear = prim::GetAttr[name="linear"](%m2.1), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2 %m2 : __torch__.export.test_converter.___torch_mangle_16.M = prim::GetAttr[name="m2"](%m2.1), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2 %linear : __torch__.torch.nn.modules.linear.___torch_mangle_17.Linear = prim::GetAttr[name="linear"](%m2), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2 %weight.5 : Tensor = prim::GetAttr[name="weight"](%linear), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2 %bias.5 : Tensor = prim::GetAttr[name="bias"](%linear), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2 %69 : Tensor = aten::linear(%x.1, %weight.5, %bias.5), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2 # /data/users/jiashenc/pytorch/torch/nn/modules/linear.py:116:15 %weight.12 : Tensor = prim::GetAttr[name="weight"](%linear.3), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2 %bias.12 : Tensor = prim::GetAttr[name="bias"](%linear.3), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2 %72 : Tensor = aten::linear(%69, %weight.12, %bias.12), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2 # /data/users/jiashenc/pytorch/torch/nn/modules/linear.py:116:15 -> (%72) %weight : Tensor = prim::GetAttr[name="weight"](%linear.8), scope: export.test_converter.SuperNestedM1::/torch.nn.modules.linear.Linear::linear %bias : Tensor = prim::GetAttr[name="bias"](%linear.8), scope: export.test_converter.SuperNestedM1::/torch.nn.modules.linear.Linear::linear %75 : Tensor = aten::linear(%54, %weight, %bias), scope: export.test_converter.SuperNestedM1::/torch.nn.modules.linear.Linear::linear # /data/users/jiashenc/pytorch/torch/nn/modules/linear.py:116:15 -> (%75) return (%21) ``` ##### ExportedProgram ``` ExportedProgram: class GraphModule(torch.nn.Module): def forward(self, p_linear_weight: "f32[3, 3]", p_linear_bias: "f32[3]", p_m1_linear_weight: "f32[3, 3]", p_m1_linear_bias: "f32[3]", p_m1_m1_linear_weight: "f32[3, 3]", p_m1_m1_linear_bias: "f32[3]", p_m1_m2_linear_weight: "f32[3, 3]", p_m1_m2_linear_bias: "f32[3]", p_m2_linear_weight: "f32[3, 3]", p_m2_linear_bias: "f32[3]", p_m2_m1_linear_weight: "f32[3, 3]", p_m2_m1_linear_bias: "f32[3]", p_m2_m2_linear_weight: "f32[3, 3]", p_m2_m2_linear_bias: "f32[3]", x_1: "f32[3]"): # No stacktrace found for following nodes max_1: "f32[]" = torch.ops.aten.max.default(x_1) gt: "b8[]" = torch.ops.aten.gt.Scalar(max_1, 1); max_1 = None # File: <eval_with_key>.137:23 in forward, code: cond = torch.ops.higher_order.cond(l_args_0_, cond_true_2, cond_false_2, [l_args_3_0_, l_args_3_13_, l_args_3_5_, l_args_3_12_, l_args_3_14_, l_args_3_1_, l_args_3_3_, l_args_3_4_, l_args_3_7_, l_args_3_10_, l_args_3_11_, l_args_3_2_, l_args_3_6_, l_args_3_8_, l_args_3_9_]); l_args_0_ = cond_true_2 = cond_false_2 = l_args_3_0_ = l_args_3_13_ = l_args_3_5_ = l_args_3_12_ = l_args_3_14_ = l_args_3_1_ = l_args_3_3_ = l_args_3_4_ = l_args_3_7_ = l_args_3_10_ = l_args_3_11_ = l_args_3_2_ = l_args_3_6_ = l_args_3_8_ = l_args_3_9_ = None true_graph_0 = self.true_graph_0 false_graph_0 = self.false_graph_0 conditional = torch.ops.higher_order.cond(gt, true_graph_0, false_graph_0, [p_linear_weight, p_linear_bias, x_1, p_m1_linear_weight, p_m1_m1_linear_bias, p_m1_linear_bias, p_m1_m2_linear_weight, p_m1_m2_linear_bias, p_m1_m1_linear_weight, p_m2_m2_linear_bias, p_m2_m1_linear_weight, p_m2_linear_weight, p_m2_m1_linear_bias, p_m2_m2_linear_weight, p_m2_linear_bias]); gt = true_graph_0 = false_graph_0 = p_linear_weight = p_linear_bias = x_1 = p_m1_linear_weight = p_m1_m1_linear_bias = p_m1_linear_bias = p_m1_m2_linear_weight = p_m1_m2_linear_bias = p_m1_m1_linear_weight = p_m2_m2_linear_bias = p_m2_m1_linear_weight = p_m2_linear_weight = p_m2_m1_linear_bias = p_m2_m2_linear_weight = p_m2_linear_bias = None getitem: "f32[3]" = conditional[0]; conditional = None return (getitem,) class <lambda>(torch.nn.Module): def forward(self, p_linear_weight: "f32[3, 3]", p_linear_bias: "f32[3]", x_1: "f32[3]", p_m1_linear_weight: "f32[3, 3]", p_m1_m1_linear_bias: "f32[3]", p_m1_linear_bias: "f32[3]", p_m1_m2_linear_weight: "f32[3, 3]", p_m1_m2_linear_bias: "f32[3]", p_m1_m1_linear_weight: "f32[3, 3]", p_m2_m2_linear_bias: "f32[3]", p_m2_m1_linear_weight: "f32[3, 3]", p_m2_linear_weight: "f32[3, 3]", p_m2_m1_linear_bias: "f32[3]", p_m2_m2_linear_weight: "f32[3, 3]", p_m2_linear_bias: "f32[3]"): # File: <eval_with_key>.134:8 in forward, code: sum_default = torch.ops.aten.sum.default(l_args_3_5__1, dtype = None) sum_1: "f32[]" = torch.ops.aten.sum.default(x_1) # File: <eval_with_key>.134:9 in forward, code: gt_scalar = torch.ops.aten.gt.Scalar(sum_default, 1); sum_default = None gt: "b8[]" = torch.ops.aten.gt.Scalar(sum_1, 1); sum_1 = None # File: <eval_with_key>.134:12 in forward, code: cond = torch.ops.higher_order.cond(gt_scalar, cond_true_0, cond_false_0, [l_args_3_12__true_branch, l_args_3_1__true_branch, l_args_3_5__1, l_args_3_14__true_branch, l_args_3_7__true_branch, l_args_3_3__true_branch, l_args_3_4__true_branch]); gt_scalar = cond_true_0 = cond_false_0 = l_args_3_12__true_branch = l_args_3_1__true_branch = l_args_3_5__1 = l_args_3_14__true_branch = l_args_3_7__true_branch = l_args_3_3__true_branch = l_args_3_4__true_branch = None true_graph_0 = self.true_graph_0 false_graph_0 = self.false_graph_0 conditional = torch.ops.higher_order.cond(gt, true_graph_0, false_graph_0, [p_m1_linear_weight, p_m1_linear_bias, x_1, p_m1_m1_linear_bias, p_m1_m1_linear_weight, p_m1_m2_linear_weight, p_m1_m2_linear_bias]); gt = true_graph_0 = false_graph_0 = p_m1_linear_weight = p_m1_linear_bias = x_1 = p_m1_m1_linear_bias = p_m1_m1_linear_weight = p_m1_m2_linear_weight = p_m1_m2_linear_bias = None getitem: "f32[3]" = conditional[0]; conditional = None # File: <eval_with_key>.134:14 in forward, code: linear_default = torch.ops.aten.linear.default(getitem, l_args_3_0__1, l_args_3_13__1); getitem = l_args_3_0__1 = l_args_3_13__1 = None linear: "f32[3]" = torch.ops.aten.linear.default(getitem, p_linear_weight, p_linear_bias); getitem = p_linear_weight = p_linear_bias = None return (linear,) class <lambda>(torch.nn.Module): def forward(self, p_m1_linear_weight: "f32[3, 3]", p_m1_linear_bias: "f32[3]", x_1: "f32[3]", p_m1_m1_linear_bias: "f32[3]", p_m1_m1_linear_weight: "f32[3, 3]", p_m1_m2_linear_weight: "f32[3, 3]", p_m1_m2_linear_bias: "f32[3]"): # File: <eval_with_key>.130:8 in forward, code: linear_default = torch.ops.aten.linear.default(l_args_3_5__1, l_args_3_7__true_branch, l_args_3_14__true_branch); l_args_3_5__1 = l_args_3_7__true_branch = l_args_3_14__true_branch = None linear: "f32[3]" = torch.ops.aten.linear.default(x_1, p_m1_m1_linear_weight, p_m1_m1_linear_bias); x_1 = p_m1_m1_linear_weight = p_m1_m1_linear_bias = None # File: <eval_with_key>.130:9 in forward, code: linear_default_1 = torch.ops.aten.linear.default(linear_default, l_args_3_12__1, l_args_3_1__1); linear_default = l_args_3_12__1 = l_args_3_1__1 = None linear_1: "f32[3]" = torch.ops.aten.linear.default(linear, p_m1_linear_weight, p_m1_linear_bias); linear = p_m1_linear_weight = p_m1_linear_bias = None return (linear_1,) class <lambda>(torch.nn.Module): def forward(self, p_m1_linear_weight: "f32[3, 3]", p_m1_linear_bias: "f32[3]", x_1: "f32[3]", p_m1_m1_linear_bias: "f32[3]", p_m1_m1_linear_weight: "f32[3, 3]", p_m1_m2_linear_weight: "f32[3, 3]", p_m1_m2_linear_bias: "f32[3]"): # File: <eval_with_key>.131:8 in forward, code: linear_default = torch.ops.aten.linear.default(l_args_3_5__1, l_args_3_3__false_branch, l_args_3_4__false_branch); l_args_3_5__1 = l_args_3_3__false_branch = l_args_3_4__false_branch = None linear: "f32[3]" = torch.ops.aten.linear.default(x_1, p_m1_m2_linear_weight, p_m1_m2_linear_bias); x_1 = p_m1_m2_linear_weight = p_m1_m2_linear_bias = None # File: <eval_with_key>.131:9 in forward, code: linear_default_1 = torch.ops.aten.linear.default(linear_default, l_args_3_12__1, l_args_3_1__1); linear_default = l_args_3_12__1 = l_args_3_1__1 = None linear_1: "f32[3]" = torch.ops.aten.linear.default(linear, p_m1_linear_weight, p_m1_linear_bias); linear = p_m1_linear_weight = p_m1_linear_bias = None return (linear_1,) class <lambda>(torch.nn.Module): def forward(self, p_linear_weight: "f32[3, 3]", p_linear_bias: "f32[3]", x_1: "f32[3]", p_m1_linear_weight: "f32[3, 3]", p_m1_m1_linear_bias: "f32[3]", p_m1_linear_bias: "f32[3]", p_m1_m2_linear_weight: "f32[3, 3]", p_m1_m2_linear_bias: "f32[3]", p_m1_m1_linear_weight: "f32[3, 3]", p_m2_m2_linear_bias: "f32[3]", p_m2_m1_linear_weight: "f32[3, 3]", p_m2_linear_weight: "f32[3, 3]", p_m2_m1_linear_bias: "f32[3]", p_m2_m2_linear_weight: "f32[3, 3]", p_m2_linear_bias: "f32[3]"): # File: <eval_with_key>.135:8 in forward, code: sum_default = torch.ops.aten.sum.default(l_args_3_5__1, dtype = None) sum_1: "f32[]" = torch.ops.aten.sum.default(x_1) # File: <eval_with_key>.135:9 in forward, code: gt_scalar = torch.ops.aten.gt.Scalar(sum_default, 1); sum_default = None gt: "b8[]" = torch.ops.aten.gt.Scalar(sum_1, 1); sum_1 = None # File: <eval_with_key>.135:12 in forward, code: cond = torch.ops.higher_order.cond(gt_scalar, cond_true_1, cond_false_1, [l_args_3_2__false_branch, l_args_3_5__1, l_args_3_9__false_branch, l_args_3_11__false_branch, l_args_3_6__false_branch, l_args_3_10__false_branch, l_args_3_8__false_branch]); gt_scalar = cond_true_1 = cond_false_1 = l_args_3_2__false_branch = l_args_3_5__1 = l_args_3_9__false_branch = l_args_3_11__false_branch = l_args_3_6__false_branch = l_args_3_10__false_branch = l_args_3_8__false_branch = None true_graph_0 = self.true_graph_0 false_graph_0 = self.false_graph_0 conditional = torch.ops.higher_order.cond(gt, true_graph_0, false_graph_0, [p_m2_linear_weight, x_1, p_m2_linear_bias, p_m2_m1_linear_weight, p_m2_m1_linear_bias, p_m2_m2_linear_bias, p_m2_m2_linear_weight]); gt = true_graph_0 = false_graph_0 = p_m2_linear_weight = x_1 = p_m2_linear_bias = p_m2_m1_linear_weight = p_m2_m1_linear_bias = p_m2_m2_linear_bias = p_m2_m2_linear_weight = None getitem: "f32[3]" = conditional[0]; conditional = None # File: <eval_with_key>.135:14 in forward, code: linear_default = torch.ops.aten.linear.default(getitem, l_args_3_0__1, l_args_3_13__1); getitem = l_args_3_0__1 = l_args_3_13__1 = None linear: "f32[3]" = torch.ops.aten.linear.default(getitem, p_linear_weight, p_linear_bias); getitem = p_linear_weight = p_linear_bias = None return (linear,) class <lambda>(torch.nn.Module): def forward(self, p_m2_linear_weight: "f32[3, 3]", x_1: "f32[3]", p_m2_linear_bias: "f32[3]", p_m2_m1_linear_weight: "f32[3, 3]", p_m2_m1_linear_bias: "f32[3]", p_m2_m2_linear_bias: "f32[3]", p_m2_m2_linear_weight: "f32[3, 3]"): # File: <eval_with_key>.132:8 in forward, code: linear_default = torch.ops.aten.linear.default(l_args_3_5__1, l_args_3_11__true_branch, l_args_3_6__true_branch); l_args_3_5__1 = l_args_3_11__true_branch = l_args_3_6__true_branch = None linear: "f32[3]" = torch.ops.aten.linear.default(x_1, p_m2_m1_linear_weight, p_m2_m1_linear_bias); x_1 = p_m2_m1_linear_weight = p_m2_m1_linear_bias = None # File: <eval_with_key>.132:9 in forward, code: linear_default_1 = torch.ops.aten.linear.default(linear_default, l_args_3_2__1, l_args_3_9__1); linear_default = l_args_3_2__1 = l_args_3_9__1 = None linear_1: "f32[3]" = torch.ops.aten.linear.default(linear, p_m2_linear_weight, p_m2_linear_bias); linear = p_m2_linear_weight = p_m2_linear_bias = None return (linear_1,) class <lambda>(torch.nn.Module): def forward(self, p_m2_linear_weight: "f32[3, 3]", x_1: "f32[3]", p_m2_linear_bias: "f32[3]", p_m2_m1_linear_weight: "f32[3, 3]", p_m2_m1_linear_bias: "f32[3]", p_m2_m2_linear_bias: "f32[3]", p_m2_m2_linear_weight: "f32[3, 3]"): # File: <eval_with_key>.133:8 in forward, code: linear_default = torch.ops.aten.linear.default(l_args_3_5__1, l_args_3_8__false_branch, l_args_3_10__false_branch); l_args_3_5__1 = l_args_3_8__false_branch = l_args_3_10__false_branch = None linear: "f32[3]" = torch.ops.aten.linear.default(x_1, p_m2_m2_linear_weight, p_m2_m2_linear_bias); x_1 = p_m2_m2_linear_weight = p_m2_m2_linear_bias = None # File: <eval_with_key>.133:9 in forward, code: linear_default_1 = torch.ops.aten.linear.default(linear_default, l_args_3_2__1, l_args_3_9__1); linear_default = l_args_3_2__1 = l_args_3_9__1 = None linear_1: "f32[3]" = torch.ops.aten.linear.default(linear, p_m2_linear_weight, p_m2_linear_bias); linear = p_m2_linear_weight = p_m2_linear_bias = None return (linear_1,) Graph signature: ExportGraphSignature(input_specs=[InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_linear_weight'), target='linear.weight', persistent=None), InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_linear_bias'), target='linear.bias', persistent=None), InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_m1_linear_weight'), target='m1.linear.weight', persistent=None), InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_m1_linear_bias'), target='m1.linear.bias', persistent=None), InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_m1_m1_linear_weight'), target='m1.m1.linear.weight', persistent=None), InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_m1_m1_linear_bias'), target='m1.m1.linear.bias', persistent=None), InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_m1_m2_linear_weight'), target='m1.m2.linear.weight', persistent=None), InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_m1_m2_linear_bias'), target='m1.m2.linear.bias', persistent=None), InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_m2_linear_weight'), target='m2.linear.weight', persistent=None), InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_m2_linear_bias'), target='m2.linear.bias', persistent=None), InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_m2_m1_linear_weight'), target='m2.m1.linear.weight', persistent=None), InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_m2_m1_linear_bias'), target='m2.m1.linear.bias', persistent=None), InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_m2_m2_linear_weight'), target='m2.m2.linear.weight', persistent=None), InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_m2_m2_linear_bias'), target='m2.m2.linear.bias', persistent=None), InputSpec(kind=<InputKind.USER_INPUT: 1>, arg=TensorArgument(name='x_1'), target=None, persistent=None)], output_specs=[OutputSpec(kind=<OutputKind.USER_OUTPUT: 1>, arg=TensorArgument(name='getitem'), target=None)]) Range constraints: {} ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127975 Approved by: https://github.com/angelayi, https://github.com/ydwu4	2024-06-10 23:24:16 +00:00
Animesh Jain	b2d602306a	[RELAND][dynamo][nn-modules] Trace through nn.Module dunder methods for UnspecializedNNModule (#126578 ) Tracing through `__init__` is important because it initializes (calls STORE_ATTR) on members. By doing that, we kick in the mutation tracking for these objects. So, things like mutating `_modules` etc is tracked automatically. Fixes https://github.com/pytorch/pytorch/issues/111837 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126578 Approved by: https://github.com/jansel ghstack dependencies: #128295	2024-06-10 23:11:04 +00:00
Animesh Jain	05711eece9	[dynamo][inlining inbuilt modules] Ensure BC for nn_module_stack (#128295 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128295 Approved by: https://github.com/ydwu4	2024-06-10 23:11:04 +00:00
Yidi Wu	a287ff75d0	Use init_torchbind_implementations in inductor torchbind tests. (#128341 ) Summary: To unify how we load the torch bind libraries for testing. Test Plan: Existing tests. Differential Revision: D58372372 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128341 Approved by: https://github.com/angelayi	2024-06-10 23:02:48 +00:00
PyTorch MergeBot	4bbadeee8a	Revert "Set simdlen based on ATEN_CPU_CAPABILITY (#123514 )" This reverts commit b66e3f0957b96b058c9b632ca60833d9717a9d8a. Reverted https://github.com/pytorch/pytorch/pull/123514 on behalf of https://github.com/clee2000 due to broke test/inductor/test_torchinductor.py::CpuTests::test_new_cpp_build_logical_cpu on periodic test on the no gpu tests `b66e3f0957` https://github.com/pytorch/pytorch/actions/runs/9453518547/job/26040077301 ([comment](https://github.com/pytorch/pytorch/pull/123514#issuecomment-2159433432))	2024-06-10 22:46:01 +00:00
Simon Fan	2176ef7dfa	[compiled autograd] support .backward(inputs=) (#128252 ) autograd already marks nodes as needed or not before calling calling compiled autograd. so our worklist already skips nodes not specified in the `inputs` kwarg. For the .backward(inputs=) case, I'm keeping the grads as outputs, just like for .grad(inputs=), this is to still guard on graph_output when we collect the nodes. This does not get DCE'd rn, and is ignored in the post graph bytecode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128252 Approved by: https://github.com/jansel	2024-06-10 22:20:51 +00:00
loganthomas	583a56d5a8	DOC: add docstring to construct_and_record_rdzv_event() (#128189 ) Fixes #127902 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128189 Approved by: https://github.com/kurman	2024-06-10 22:17:33 +00:00
Mikayla Gawarecki	c38b3381a1	Make nn.Module state_dict load_state_dict pre-hook and state_dict post hook public (#126704 ) Fixes https://github.com/pytorch/pytorch/issues/75287 and https://github.com/pytorch/pytorch/issues/117437 - `nn.Module._register_state_dict_hook` --> add public `nn.Module.register_state_dict_post_hook` - Add a test as this API was previously untested - `nn.Module._register_load_state_dict_pre_hook` --> add public `nn.Module.register_load_state_dict_pre_hook` (remove the `with_module` flag, default it to `True` ~- For consistency with optimizer `load_state_dict_pre_hook` raised by @janeyx99, allow the pre-hook to return a new `state_dict`~ - Document issue pointed out by https://github.com/pytorch/pytorch/issues/117437 regarding `_register_state_dict_hook` semantic of returning a new state_dict only being respected for the root for private hook - Remove this for the public `register_state_dict_post_hook` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126704 Approved by: https://github.com/albanD ghstack dependencies: #126906	2024-06-10 21:50:17 +00:00
Mikayla Gawarecki	a2d4fea872	[easy] Move state_dict hooks tests to test_module_hooks and decorate tests that call load_state_dict with swap (#126906 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126906 Approved by: https://github.com/albanD	2024-06-10 21:50:17 +00:00
Edward Z. Yang	58083ffb10	Improve unbacked reasoning involving has internal overlap (#128332 ) Fixes https://github.com/pytorch/pytorch/issues/122477 Partially addresses https://github.com/pytorch/pytorch/issues/116336 This PR is slightly overkill: not only does it disable the overlap test when there are unbacked SymInts, it also improves the is non-overlapping and dense test for some more unbacked situations. We technically don't need the latter change, but I was already deep in the sauce and just went ahead and did it. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128332 Approved by: https://github.com/lezcano	2024-06-10 21:49:38 +00:00
Andrea Frittoli	6630dcd53c	Add docstring for the torch.serialization.default_restore_location function (#128132 ) Fixes: #127887 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128132 Approved by: https://github.com/mikaylagawarecki	2024-06-10 21:33:56 +00:00
laithsakka	3a2d0755a4	enable test_ParameterList with dynamo if nn module inlining enabled only (#128308 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128308 Approved by: https://github.com/anijain2305	2024-06-10 21:25:40 +00:00
IvanKobzarev	b459713ca7	[aota] compiled forward outputs requires_grad alignment with eager (#128016 ) Original issue: https://github.com/pytorch/pytorch/issues/114338 We assume only two possible mutually exclusive scenarios: 1. Running compiled region for training (Any of inputs has requires_grad) - Produced differentiable outputs should have requires_grad. 2. Running compiled region for inference (None of inputs has requires_grad) - All outputs do not have requires_grad. Even if user runs the region under no_grad(), but has an input Tensor with requires_grad - we go Training scenario (1). With current state that means: 1/ needs_autograd should not check torch.is_grad_enabled(), only that any of inputs requires_grad 2/ if needs_autograd => trace_joint (We are in training scenario 1.) => always run compiled region under with.enable_grad() Pull Request resolved: https://github.com/pytorch/pytorch/pull/128016 Approved by: https://github.com/bdhirsh	2024-06-10 20:51:22 +00:00
Guilherme Leobas	4460e481bc	Disable jacrev/jacfwd/hessian if compiling with dynamo (#128255 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128255 Approved by: https://github.com/zou3519	2024-06-10 20:47:53 +00:00
PyTorch MergeBot	90bb510ece	Revert "Deprecate `torch._utils.is_compiling()` and `torch._dynamo.external_utils.is_compiling()` (#127690 )" This reverts commit 348b181a97abc2e636a6c18e5880a78e5d1dab94. Reverted https://github.com/pytorch/pytorch/pull/127690 on behalf of https://github.com/clee2000 due to sorry I think https://github.com/pytorch/pytorch/pull/126898#issuecomment-2142884456 is still relevant, I will reach out to them to see what needs to be done in internal to get this remerged ([comment](https://github.com/pytorch/pytorch/pull/127690#issuecomment-2159248859))	2024-06-10 20:44:42 +00:00
Xiaodong Wang	38e0a0440c	[AMD] Default to hipblaslt in gemm (#127944 ) Summary: It has been a constant pain that we have to specify env var to go with the hipblaslt path. The default path is very slow on MI300. Therefore, let's default to hipblaslt. Differential Revision: D58150764 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127944 Approved by: https://github.com/aaronenyeshi, https://github.com/houseroad	2024-06-10 19:55:21 +00:00
Aaron Orenstein	946f554c8f	Flip default value for mypy disallow_untyped_defs [10+1/11] (#128293 ) See #127836 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128293 Approved by: https://github.com/oulgen	2024-06-10 19:32:44 +00:00
Nikita Shulga	55646554b7	[EZ] Fix typos in SECURITY.md (#128340 ) permisisons -> permissions lates -> latest Pull Request resolved: https://github.com/pytorch/pytorch/pull/128340 Approved by: https://github.com/clee2000, https://github.com/atalman, https://github.com/kit1980	2024-06-10 19:21:39 +00:00
Edward Z. Yang	9cab5987bd	Introduce int_oo (#127693 ) In a previous life, we used sympy.oo to represent the lower/upper bounds of integer ranges. Later, we changed this to be sys.maxsize - 1 for a few reasons: (1) sometimes we do tests on a value being exactly sys.maxsize, and we wanted to avoid a data dependent guard in this case, (2) sympy.oo corresponds to floating point infinity, so you get incorrect types for value ranges with oo, and (3) you can do slightly better reasoning if you assume that input sizes fall within representable 64-bit integer range. After working in the sys.maxsize regime for a bit, I've concluded that this was actually a bad idea. Specifically, the problem is that you end up with sys.maxsize in your upper bound, and then whenever you do any sort of size-increasing computation like size * 2, you end up with 2 * sys.maxsize, and you end up doing a ton of arbitrary precision int computation that is totally unnecessary. A symbolic bound is better. But especially after #126905, we can't go back to using sympy.oo, because that advertises that it's not an integer, and now your ValueRanges is typed incorrectly. So what do we do? We define a new numeric constant `int_oo`, which is like `sympy.oo` but it advertises `is_integer`. test/test_sympy_utils.py describes some basic properties of the number, and torch/utils/_sympy/numbers.py has the actual implementation. The rest of the changes of the PR are working out the implications of this change. I'll give more commentary as inline comments. Fixes https://github.com/pytorch/pytorch/issues/127396 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127693 Approved by: https://github.com/lezcano ghstack dependencies: #126905	2024-06-10 19:09:53 +00:00
PyTorch MergeBot	db2fa7b827	Revert "[export] FIx unflattener for preserving modules containing unused inputs (#128260 )" This reverts commit 093a4ff5f859ccbbd8ba62dd189f76e5faadfb04. Reverted https://github.com/pytorch/pytorch/pull/128260 on behalf of https://github.com/angelayi due to breaking windows test ([comment](https://github.com/pytorch/pytorch/pull/128260#issuecomment-2159050726))	2024-06-10 18:42:33 +00:00
angelayi	093a4ff5f8	[export] FIx unflattener for preserving modules containing unused inputs (#128260 ) Currently unflattener fails if the module its preserving the module signature for contains unused inputs/outputs. This also fixes unflattener issues in D57829276. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128260 Approved by: https://github.com/pianpwk	2024-06-10 18:39:33 +00:00
Sam Larsen	fa8ec8e718	[dynamo] handle hashable exceptions in trace_rules lookup (#128078 ) Summary: Found during user empathy day when attempting to hash a fractions.Fraction object before it was fully constructed. See https://github.com/pytorch/pytorch/issues/128075 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128078 Approved by: https://github.com/anijain2305	2024-06-10 18:23:22 +00:00
Aaron Enye Shi	136bdb96cb	Update Kineto submodule with fix to test_basic_chrome_trace (#128333 ) Summary: We've updated the sort_index in Kineto chrome traces to support device ids up to 16 devices. This should make chrome trace rows be ordered in the same way as CUDA. We need to update the unit test as well. Test Plan: Ran locally the changing test: ``` $ buck2 test 'fbcode//mode/opt' fbcode//caffe2/test:test_profiler_cuda -- --exact 'caffe2/test:test_profiler_cuda - test_basic_chrome_trace (profiler.test_profiler.TestProfiler)' File changed: fbcode//caffe2/third_party/kineto.submodule.txt Buck UI: https://www.internalfb.com/buck2/f4fd1e9a-99f1-4422-aeed-b54903c64146 Test UI: https://www.internalfb.com/intern/testinfra/testrun/16888498639845776 Network: Up: 5.4KiB Down: 8.6KiB (reSessionID-0329120e-7fa2-4bc0-b539-7e58058f8fce) Jobs completed: 6. Time elapsed: 1:01.2s. Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` Differential Revision: D58362964 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/128333 Approved by: https://github.com/Skylion007	2024-06-10 18:12:34 +00:00
Andrea Frittoli	83941482f7	Add docstring for the torch.distributed.elastic.utils.distributed.get_free_port function (#128133 ) Fixes: #127914 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128133 Approved by: https://github.com/H-Huang	2024-06-10 18:10:58 +00:00
Menglu Yu	08d038f8a8	[PT2] Fix a typo and lint problem (#128258 ) Summary: Titled Test Plan: see signal Differential Revision: D58310169 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128258 Approved by: https://github.com/dshi7, https://github.com/Yuzhen11	2024-06-10 18:03:40 +00:00
Shengbao Zheng	46948300a2	[c10d] integrate PMI NCCL initialization to NCCL-PG (#128243 ) Summary: Move broadcastUniqueID check to NCCLUtils Differential Revision: D58273755 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128243 Approved by: https://github.com/wconstab	2024-06-10 17:20:03 +00:00
Chirag Pandya	ab3a0b192a	[RFC] add per-collective timeout value in flight recorder (#128190 ) Summary: Add timeout value field on every collected record. Test Plan: Unit tests Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/128190 Approved by: https://github.com/wconstab	2024-06-10 17:12:57 +00:00
Edward Z. Yang	8e482e909b	Add some guard to size oblivious has_internal_overlap (#128328 ) This doesn't actually help on https://github.com/pytorch/pytorch/issues/122477 but I noticed this modest improvement so sure, why not. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128328 Approved by: https://github.com/Skylion007	2024-06-10 17:11:26 +00:00
Sheng Fu	7b9c5e0e3f	Turn on GraphTransformObserver for inductor (#127962 ) The FX graphs for some PT2 models are very complicated, Inductor usually goes through many passes of graph optimization to generate the final FX graph. It’s very difficult to see the change in each pass, and check if the optimized graph is correct and optimal. GraphTransformObserver is an observer listening to all add/erase node events on GraphModule during a graph transform pass, and save the changed nodes. When the pass is done and if there is any change in the graph, GraphTransformObserver will save the SVG files of the input graph and the output graph for that pass. This PR is to enable GraphTransformObserver for inductor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127962 Approved by: https://github.com/jansel	2024-06-10 16:49:02 +00:00
PyTorch MergeBot	ca561d639b	Revert "Fix 'get_attr' call in dynamo 'run_node' (#127696 )" This reverts commit b741819b0580204e6a6b60c62ce44dacaf7787c8. Reverted https://github.com/pytorch/pytorch/pull/127696 on behalf of https://github.com/clee2000 due to broke (executorch?) internal tests D58295865 ([comment](https://github.com/pytorch/pytorch/pull/127696#issuecomment-2158820093))	2024-06-10 16:29:20 +00:00
PyTorch MergeBot	d22287d1ad	Revert "Fix 'get_real_value' on placeholder nodes (#127698 )" This reverts commit 19b31d899a78a6806314bcc73b88172dabf0c26e. Reverted https://github.com/pytorch/pytorch/pull/127698 on behalf of https://github.com/clee2000 due to broke (executorch?) internal tests D58295865 ([comment](https://github.com/pytorch/pytorch/pull/127696#issuecomment-2158820093))	2024-06-10 16:29:20 +00:00
PyTorch MergeBot	3b73f5de3a	Revert "Add OpInfo entry for alias_copy (#127232 ) (#128142 )" This reverts commit 04da6aeb61f4d57bf73ed1054dd897abbcceca83. Reverted https://github.com/pytorch/pytorch/pull/128142 on behalf of https://github.com/DanilBaibak due to The changes broke the test_output_match_alias_copy_cpu_complex64 test. ([comment](https://github.com/pytorch/pytorch/pull/128142#issuecomment-2158793878))	2024-06-10 16:17:16 +00:00
Isuru Fernando	c993f1b37f	Fix edge cases for gather in inductor (#126893 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126893 Approved by: https://github.com/peterbell10 ghstack dependencies: #126876	2024-06-10 15:31:03 +00:00
Tom Ritchford	04da6aeb61	Add OpInfo entry for alias_copy (#127232 ) (#128142 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128142 Approved by: https://github.com/lezcano	2024-06-10 15:01:53 +00:00
CaoE	b66e3f0957	Set simdlen based on ATEN_CPU_CAPABILITY (#123514 ) It is part of https://github.com/pytorch/pytorch/issues/123224. Set simdlen based on the environment ATEN_CPU_CAPABILITY to control CPU vec ISA like eager. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123514 Approved by: https://github.com/jgong5, https://github.com/peterbell10	2024-06-10 09:02:14 +00:00
Xu Han	df43d5843e	fix miss isa bool check (#128274 ) New cpp builder missed ISA bool(dry-compile) check. <img width="941" alt="image" src="https://github.com/pytorch/pytorch/assets/8433590/695ce911-7f6d-401d-b96b-2b9bda751b15"> @jgong5 Found this missing and then I submit this PR to fix it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128274 Approved by: https://github.com/jgong5, https://github.com/ezyang	2024-06-10 02:45:46 +00:00
cyy	26f6a87ae9	[5/N] Remove unused functions (#127185 ) Follows #128193 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127185 Approved by: https://github.com/ezyang	2024-06-10 01:57:49 +00:00
Peter Bell	d3817d8a60	Don't create python tuple when _maybe_handle_torch_function is called from C++ (#128187 ) Marginal overhead reduction when calling through the `torch.ops` API. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128187 Approved by: https://github.com/lezcano ghstack dependencies: #128183, #128184, #128185	2024-06-10 00:16:59 +00:00
Peter Bell	cd2ad29afe	[inductor] Reduce binding overhead of _reinterpret_tensor (#128185 ) Going through the dispatcher + pybind11 + torch.ops adds about 2 us overhead per call compared to `PyArgParser`. Note that views of inputs are reconstructed by AOTAutograd before being returned to the python code, so dispatching for autograd's sake shouldn't be required here. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128185 Approved by: https://github.com/lezcano ghstack dependencies: #128183, #128184	2024-06-09 23:33:03 +00:00
Peter Bell	253fa9c711	[AOTAutograd] Remove runtime import from view replay function (#128184 ) `gen_alias_from_base` spends about ~0.5 us in this import statement, which is called for each view in the graph output. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128184 Approved by: https://github.com/lezcano ghstack dependencies: #128183	2024-06-09 23:33:03 +00:00
Peter Bell	55b2a0a002	[AOTAutograd] Use _set_grad_enabled instead of no_grad (#128183 ) This saves ~1us of overhead from each inductor graph call. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128183 Approved by: https://github.com/lezcano	2024-06-09 23:33:03 +00:00
Masahiro Hiramori	5e7377e044	[Dynamo][TVM] Make the `opt_level` parameter adjustable (#127876 ) Fixes #127874 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127876 Approved by: https://github.com/jansel	2024-06-09 21:38:00 +00:00
Shuqiang Zhang	c7e2c9c37e	[c10d][doc] add a doc page for NCCL ENVs (#128235 ) Addressing issue: https://github.com/pytorch/pytorch/issues/128204 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128235 Approved by: https://github.com/wconstab	2024-06-09 16:08:38 +00:00
Chirag Pandya	0bf2fe522a	[RFC] Provide optional switches to _dump_nccl_trace (#127651 ) Summary: Data from PyTorch distributed is mostly useful during initial stages of model development. Provide options to reduce data sent/dumped. `_dump_nccl_trace` takes 3 optional switches. Default as before returns everything - `includeCollectives`: option to also include collectives: Default is True. - `includeStacktraces`: option to include stack traces in collectives. Default is True. - `onlyActive`: option to only send active collective work - i.e. not completed. Default is False (i.e. send everything) Test Plan: Unit tests Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/127651 Approved by: https://github.com/wconstab	2024-06-09 14:00:57 +00:00
PyTorch MergeBot	75b0720a97	Revert "Use hidden visibility in OBJECTCXX files (#127265 )" This reverts commit 669560d51aa1e81ebd09e2aa8288d0d314407d82. Reverted https://github.com/pytorch/pytorch/pull/127265 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but I suspect that it causes this failure https://github.com/pytorch/vision/issues/8478 on vision where its C++ extension could not be loaded on macOS ([comment](https://github.com/pytorch/pytorch/pull/127265#issuecomment-2156401838))	2024-06-09 09:05:17 +00:00
eqy	4c971932e8	[cuDNN][SDPA] Remove `TORCH_CUDNN_SDPA_ENABLED=1`, enable cuDNN SDPA by default on H100 and 2nd on other archs >= sm80 (#125343 ) Looks like one of the first failures seen is `test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` when `test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` passes. What seems interesting here is that the `torch.compile` version fails while the eager version passes. Not sure what the difference would be here... Nevertheless, is there a recommended mechanism to skip cuDNN SDPA as a backend for this test? CC @drisspg Pull Request resolved: https://github.com/pytorch/pytorch/pull/125343 Approved by: https://github.com/Skylion007	2024-06-09 06:53:34 +00:00
Edward Z. Yang	3964a3ec73	Complete revamp of float/promotion sympy handling (#126905 ) At a high level, the idea behind this PR is: * Make it clearer what the promotion and int/float rules for various Sympy operations are. Operators that previously were polymorphic over int/float are now split into separate operators for clarity. We never do mixed int/float addition/multiplication etc in sympy, instead, we always promote to the appropriate operator. (However, equality is currently not done correctly.) * Enforce strict typing on ValueRanges: if you have a ValueRange for a float, the lower and upper MUST be floats, and so forth for integers. The story begins in torch/utils/_sympy/functions.py. Here, I make some changes to how we represent certain operations in sympy expressions: * FloorDiv now only supports integer inputs; to do float floor division, do a truediv and then a trunc. Additionally, we remove the divide out addition by gcd optimization, because sympy gcd is over fields and is willing to generate rationals (but rationals are bad for ValueRange strict typing). * ModularIndexing, LShift, RShift now assert they are given integer inputs. * Mod only supports integer inputs; eventually we will support FloatMod (left for later work, when we build out Sympy support for floating operations). Unfortunately, I couldn't assert integer inputs here, because of a bad interaction with sympy's inequality solver that is used by the offline solver * TrueDiv is split into FloatTrueDiv and IntTrueDiv. This allows for us to eventually generate accurate code for Python semantics IntTrueDiv, which is written in a special way to preserve precision when the inputs are >= 2*53 beyond what first coercing the integer to floats and then doing true division. Trunc is split to TruncToFloat and TruncToInt. * Round is updated to return a float, not an int, making it consistent with the round op handler in Inductor. To get Python-style conversion to int, we call TruncToInt on the result. * RoundDecimal updated to consistently only ever return a float * Add ToFloat for explicit coercion to float (required so we can enforce strict ValueRanges typing) In torch/__init__.py, we modify SymInt and SymFloat to appropriately call into new bindings that route to these refined sympy operations. Also, we modify `torch.sym_min` and `torch.sym_max` to have promotion semantics (if one argument is a float, the return result is always a float), making them inconsistent with builtins.min/max, but possible to do type analysis without runtime information. We also need to introduce some new op handlers in torch/_inductor/ops_handler.py: * `to_int` for truncation to int64, directly corresponding to TruncToInt; this can be implemented by trunc and dtype, but with a dedicated handler it is more convenient for roundtripping in Sympy * `int_truediv` for Python-style integer true division, which has higher precision than casting to floats and then running `truediv` These changes have consequences. First, we need to make some administrative changes: * Actually wire up these Sympy functions from SymInt/SymFloat in torch/fx/experimental/sym_node.py, including the new promotion rules (promote2) * Add support for new Sympy functions in torch/utils/_sympy/interp.py, torch/utils/_sympy/reference.py * In particular, in torch.utils._sympy.reference, we have a strong preference to NOT do nontrivial compute, instead, everything in ops handler should map to a singular sympy function * TODO: I chose to roundtrip mod back to our Mod function, but I think I'm going to have to deal with the C/Python inconsistency this to fix tests here * Add printer support for the Sympy functions in torch/_inductor/codegen/common.py, torch/_inductor/codegen/cpp_utils.py, torch/_inductor/codegen/triton.py. `int_truediv` and mixed precision equality is currently not implemented soundly, so we will lose precision in codegen for large values. TODO: The additions here are not exhaustive yet * Update ValueRanges logic to use new sympy functions in torch/utils/_sympy/value_ranges.py. In general, we prefer to use the new Sympy function rather than try to roll things by hand, which is what was done previously for many VR analysis functions. In torch/fx/experimental/symbolic_shapes.py we need to make some symbolic reasoning adjustments: * Avoid generation of rational subexpressions by removing simplification of `x // y` into `floor(x / y)`. This simplification then triggers an addition simplification rule `(x + y) / c --> x / c + y / c` which is bad because x / c is a rational number now * `_assert_bound_is_rational` is no more, we no longer generate rational bounds * Don't intersect non-int value ranges with the `int_range` * Support more sympy Functions for guard SYMPY_INTERP * Assert the type of value range is consistent with the variable type The new asserts uncovered necessary bug fixes: * torch/_inductor/codegen/cpp.py, torch/_inductor/select_algorithm.py, torch/_inductor/sizevars.py - Ensure Wild/Symbol manually allocated in Inductor is marked `is_integer` so it's accepted to build expressions * torch/_inductor/utils.py - make sure you actually pass in sympy.Expr to these functions * torch/_inductor/ir.py - make_contiguous_strides_for takes int/SymInt, not sympy.Expr! * torch/export/dynamic_shapes.py - don't use infinity to represent int ranges, instead use sys.maxsize - 1 Because of the removal of some symbolic reasoning that produced rationals, some of our symbolic reasoning has gotten worse and we are unable to simplify some guards. Check the TODO at test/test_proxy_tensor.py Reland notes. This requires this internal fbcode diff https://www.internalfb.com/phabricator/paste/view/P1403322587 but I cannot prepare the diff codev due to https://fb.workplace.com/groups/osssupport/posts/26343544518600814/ It also requires this Executorch PR https://github.com/pytorch/executorch/pull/3911 but the ET PR can be landed prior to this landing. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126905 Approved by: https://github.com/xadupre, https://github.com/lezcano	2024-06-09 06:20:25 +00:00
PyTorch UpdateBot	31c3fa6cf5	[audio hash update] update the pinned audio hash (#128178 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128178 Approved by: https://github.com/pytorchbot	2024-06-09 04:29:04 +00:00
cyy	7bfd1db53a	[4/N] Change static functions in headers to inline (#128286 ) Follows #128194. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128286 Approved by: https://github.com/Skylion007, https://github.com/XuehaiPan	2024-06-09 03:08:53 +00:00
Anshul Sinha	f681e3689b	[dtensor][experiment] experimenting with displaying distributed model parameters and printing sharding info (#127987 ) Summary Example code to display distributed model parameters and verify them against ground truth. Also prints sharding information. Test Plan torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/display_sharding_example.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/127987 Approved by: https://github.com/XilunWu ghstack dependencies: #127358, #127360, #127630	2024-06-09 00:14:07 +00:00
Anshul Sinha	2c2cf1d779	[dtensor][experiment] experimenting with displaying model parameters (#127630 ) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): Summary Example code to display model parameters and verify them against ground truth. Also expanded on moduletracker to accomplish this. Test Plan python3 torch/distributed/_tensor/examples/display_sharding_example.py * #127987 * __->__ #127630 * #127360 * #127358 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127630 Approved by: https://github.com/XilunWu ghstack dependencies: #127358, #127360	2024-06-09 00:14:07 +00:00
Xinya Zhang	d34075e0bd	Add Efficient Attention support on ROCM (#124885 ) This patch implements `with sdpa_kernel(SDPBackend.EFFICIENT_ATTENTION):` by reusing AOTriton's accelerated SDPA implementation Known limitations: - Only supports MI200/MI300X GPUs - Does not support varlen - Does not support `CausalVariant` - Optional arguments `causal_diagonal` and `seqlen_k` in `_efficient_attention_forward/backward` must be null - Does not work well with inductor's SDPA rewriter. The rewriter has been updated to only use math and flash attention on ROCM. This PR also uses a different approach of installing AOTriton binary instead of building it from source in the base docker image. More details on motivation: https://github.com/pytorch/pytorch/pull/124885#issuecomment-2153229129 `PYTORCH_TEST_WITH_ROCM=1 PYTORCH_TESTING_DEVICE_ONLY_FOR="cuda" python test/test_transformers.py` yields "55028 passed, 20784 skipped" results with this change. [Previous result](https://hud.pytorch.org/pr/127528) of `test_transformers.py` was 0 error, 0 failure, 55229 skipped out of 75517 tests in total (the XML report does not contain total number of passed tests). Pull Request resolved: https://github.com/pytorch/pytorch/pull/124885 Approved by: https://github.com/malfet	2024-06-08 22:41:05 +00:00
James Wu	6e7a23475d	[easy] Run autograd if any mutations on inputs that require grad (#128229 ) If any inputs are mutated that require grad, even if all the outputs don't require grad, we should still run autograd with a backwards graph. This fixes two tests: test_input_mutation_alias_everything and test_view_detach. Fixes #128035 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128229 Approved by: https://github.com/aorenste	2024-06-08 21:18:38 +00:00
Will Feng	aee154edbe	[Traceable FSDP2] Make FSDPParam._unsharded_param creation traceable (#127245 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127245 Approved by: https://github.com/awgu	2024-06-08 21:10:15 +00:00
Pritam Damania	0dd55ee159	Fix bug in _update_process_group API (#128262 ) `local_used_map_` was undefined in case of `find_unused_parameters=False`, this resulted in an error when we ran `local_used_map_.fill_(0);` Added a unit test as well Pull Request resolved: https://github.com/pytorch/pytorch/pull/128262 Approved by: https://github.com/awgu	2024-06-08 19:52:24 +00:00
Animesh Jain	3494f3f991	[dynamo] Skip inlining builtin nn modules for torch.compile inside cond (#128247 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128247 Approved by: https://github.com/ydwu4 ghstack dependencies: #128246	2024-06-08 19:20:00 +00:00
Animesh Jain	33972dfd58	[easy][inline-inbuilt-nn-modules] Fix expected graph for control flow test (#128246 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128246 Approved by: https://github.com/ydwu4	2024-06-08 19:20:00 +00:00
Aaron Orenstein	57536286e2	Flip default value for mypy disallow_untyped_defs [10/11] (#127847 ) See #127836 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127847 Approved by: https://github.com/oulgen ghstack dependencies: #127842, #127843, #127844, #127845, #127846	2024-06-08 18:50:06 +00:00
Aaron Orenstein	8db9dfa2d7	Flip default value for mypy disallow_untyped_defs [9/11] (#127846 ) See #127836 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127846 Approved by: https://github.com/ezyang ghstack dependencies: #127842, #127843, #127844, #127845	2024-06-08 18:50:06 +00:00
Aaron Orenstein	27f9d3b0a1	Flip default value for mypy disallow_untyped_defs [8/11] (#127845 ) See #127836 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127845 Approved by: https://github.com/oulgen ghstack dependencies: #127842, #127843, #127844	2024-06-08 18:49:56 +00:00
Aaron Orenstein	038b927590	Flip default value for mypy disallow_untyped_defs [7/11] (#127844 ) See #127836 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127844 Approved by: https://github.com/oulgen ghstack dependencies: #127842, #127843	2024-06-08 18:49:45 +00:00
Aaron Orenstein	7c12cc7ce4	Flip default value for mypy disallow_untyped_defs [6/11] (#127843 ) See #127836 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127843 Approved by: https://github.com/oulgen ghstack dependencies: #127842	2024-06-08 18:49:29 +00:00
Aaron Orenstein	3a0d088517	Flip default value for mypy disallow_untyped_defs [5/11] (#127842 ) See #127836 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127842 Approved by: https://github.com/oulgen	2024-06-08 18:49:18 +00:00
Aaron Orenstein	62bcdc0ac9	Flip default value for mypy disallow_untyped_defs [4/11] (#127841 ) See #127836 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127841 Approved by: https://github.com/oulgen	2024-06-08 18:36:48 +00:00
Aaron Orenstein	afe15d2d2f	Flip default value for mypy disallow_untyped_defs [3/11] (#127840 ) See #127836 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127840 Approved by: https://github.com/oulgen	2024-06-08 18:28:01 +00:00
Aaron Orenstein	ea614fb2b1	Flip default value for mypy disallow_untyped_defs [2/11] (#127839 ) See #127836 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127839 Approved by: https://github.com/oulgen	2024-06-08 18:23:08 +00:00
Aaron Orenstein	dcfa7702c3	Flip default value for mypy disallow_untyped_defs [1/11] (#127838 ) See #127836 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127838 Approved by: https://github.com/oulgen	2024-06-08 18:16:33 +00:00
Chien-Chin Huang	2369c719d4	[DSD][BE] Cleanup unused variables and rename variables to avoid exposure to the users (#128249 ) These APIs and variables should not be exposed to users as they are designed to be used internally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128249 Approved by: https://github.com/wz337	2024-06-08 17:12:17 +00:00
PyTorch MergeBot	02a901f1e9	Revert "[RFC] Provide optional switches to _dump_nccl_trace (#127651 )" This reverts commit 0a761f0627130e739f0e2748e3f71a0c347552c4. Reverted https://github.com/pytorch/pytorch/pull/127651 on behalf of https://github.com/atalman due to Breaks internal CI ([comment](https://github.com/pytorch/pytorch/pull/127651#issuecomment-2156076838))	2024-06-08 15:30:04 +00:00
PyTorch MergeBot	57a24c4fdb	Revert "[RFC] add per-collective timeout value in flight recorder (#128190 )" This reverts commit 09cccbc1c74c9d1157c1caca5526e79ee9b7ea01. Reverted https://github.com/pytorch/pytorch/pull/128190 on behalf of https://github.com/atalman due to Sorry need to revert this, in conflict with https://github.com/pytorch/pytorch/pull/127651 that needs reverting ([comment](https://github.com/pytorch/pytorch/pull/128190#issuecomment-2156075318))	2024-06-08 15:25:27 +00:00
Xuehai Pan	348b181a97	Deprecate `torch._utils.is_compiling()` and `torch._dynamo.external_utils.is_compiling()` (#127690 ) This PR is split from PR #126898. - #126898 ------ Pull Request resolved: https://github.com/pytorch/pytorch/pull/127690 Approved by: https://github.com/Skylion007	2024-06-08 15:25:03 +00:00
Bin Bao	917387f66d	[AOTI] fix a constant tensor device move issue (#128265 ) Summary: When copying a constant tensor to another device, `.to` returns a fake tensor and causes a problem when a real tensor is expected. Test Plan: CI Differential Revision: D58313034 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128265 Approved by: https://github.com/chenyang78	2024-06-08 13:23:49 +00:00
cyy	695502ca65	[3/N] Change static functions in headers to inline (#128194 ) Follows #127764 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128194 Approved by: https://github.com/ezyang, https://github.com/Skylion007	2024-06-08 08:06:31 +00:00
Edward Z. Yang	73d6ec2db6	Increase verbosity of FX graph dumps (#128042 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128042 Approved by: https://github.com/aorenste	2024-06-08 07:24:58 +00:00
Ke Wen	0e6c204642	[pipelining] Friendly error message when not traceable (#128276 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128276 Approved by: https://github.com/H-Huang	2024-06-08 06:36:11 +00:00
PyTorch MergeBot	44371bd432	Revert "[dynamo][nn-modules] Trace through nn.Module dunder methods for UnspecializedNNModule (#126578 )" This reverts commit 7ede78f9f5d7e6c993faa1a70a5f0b0eaec5640d. Reverted https://github.com/pytorch/pytorch/pull/126578 on behalf of https://github.com/anijain2305 due to pippy tests fail ([comment](https://github.com/pytorch/pytorch/pull/126578#issuecomment-2155836555))	2024-06-08 06:35:34 +00:00
PyTorch MergeBot	6e13c7e874	Revert "[dynamo] Support if cond on UnspecializedNNModuleVariable and add inline tests (#128158 )" This reverts commit 747fc35ff54154ddec2a5ab5661f57c28d65c591. Reverted https://github.com/pytorch/pytorch/pull/128158 on behalf of https://github.com/anijain2305 due to pippy tests fail ([comment](https://github.com/pytorch/pytorch/pull/128158#issuecomment-2155835787))	2024-06-08 06:32:28 +00:00
PyTorch MergeBot	94165dba7b	Revert "[dynamo] Inline the getattr of fx graph and proxy graph (#128172 )" This reverts commit 662a78f957fb89e53ebeba7deb880561e10ecaf6. Reverted https://github.com/pytorch/pytorch/pull/128172 on behalf of https://github.com/anijain2305 due to pippy tests fail ([comment](https://github.com/pytorch/pytorch/pull/128172#issuecomment-2155835201))	2024-06-08 06:29:36 +00:00
Wanchao Liang	8a0bc8c9ee	[fsdp2] simplify fsdp_param logic with DTensorSpec (#128242 ) as titled, we can use a single DTensorSpec to save the SPMD sharding spec, plus the global shape/stride to simplify the FSDPParam logic Pull Request resolved: https://github.com/pytorch/pytorch/pull/128242 Approved by: https://github.com/awgu	2024-06-08 05:56:41 +00:00
Shaz Qadeer	cbb7e3053f	View specialization (#127641 ) This PR adds specialization shortcuts for converting n-d to 1-d and 1-d to 2-d views. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127641 Approved by: https://github.com/ezyang	2024-06-08 05:52:52 +00:00
chilli	310f80995b	Added memory budget to partitioner (#126320 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126320 Approved by: https://github.com/shunting314	2024-06-08 05:52:40 +00:00
chilli	ffc202a1b9	Added remove_noop_ops to joint_graph_passes (#124451 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124451 Approved by: https://github.com/ezyang, https://github.com/fmassa	2024-06-08 05:48:11 +00:00
Wanchao Liang	c446851829	[fsdp2] update foreach_reduce accumulate_grad (#128117 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128117 Approved by: https://github.com/awgu	2024-06-08 05:13:57 +00:00
Ke Wen	613c7d270d	[pipelining] Format doc (#128279 ) - Should use two dots around `var` - Wrap lines - Add section cross ref Pull Request resolved: https://github.com/pytorch/pytorch/pull/128279 Approved by: https://github.com/H-Huang ghstack dependencies: #128273, #128278	2024-06-08 04:59:04 +00:00
Ke Wen	2e42671619	[pipelining] Rename to stage.py and schedules.py (#128278 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128278 Approved by: https://github.com/H-Huang ghstack dependencies: #128273	2024-06-08 04:42:35 +00:00
Ke Wen	0e3fe694d1	[pipelining] Restore a stage constructor for tracer path (#128273 ) In case user modified stage module out of place, such as mod = DDP(mod) mod = torch.compile(mod) They need a stage builder else than `pipe.build_stage()`. This PR provides an API to do so: ``` def build_stage( stage_module, stage_index, pipe.info(), ... ) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128273 Approved by: https://github.com/wconstab	2024-06-08 04:42:35 +00:00
Wu, Chunyuan	8a45cf4c64	[AOTI] align data_size of the constants (#127610 ) https://github.com/pytorch/pytorch/pull/124272 set the alignment to the `consts_o` but if there're `data_size` of tensor in the `consts_o` non divisible by the alignment, the following tensors are not aligned anymore, resulting in poor performance on CPU. We align the `data_size` as well in this PR and pad the serialized bytes. Since `size` of the tensor instead of the `data_size` is used when creating tensor from the serialized bytes ([link](`f4d7cdc5e6/torch/csrc/inductor/aoti_runtime/model.h (L236-L259)`)), there won't be correctness issue. `data_size` is only used to record the [bytes_read](`f4d7cdc5e6/torch/csrc/inductor/aoti_runtime/model.h (L217)`). This PR will improve the performance on CPU for 4 models in HF, 7 models in TIMM and 1 model in Torchbench. For the unit test, I add a bias value the original `data_size` of which is not divisible by the alignment to test the correctness: ``` constants_info_[0].dtype = static_cast<int32_t>(at::kFloat); constants_info_[0].data_size = 64; # was 40 before this PR constants_info_[0].shape = {10}; constants_info_[1].dtype = static_cast<int32_t>(at::kFloat); ...... ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127610 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-06-08 04:31:00 +00:00
Iris Z	1d84c7e100	[DeviceMesh] Update get_group and add get_all_groups (#128097 ) Fixes #121984 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128097 Approved by: https://github.com/wconstab, https://github.com/wanchaol	2024-06-08 04:28:56 +00:00
Jason Ansel	6e5c2a1a3b	[inductor] Add missing files to torch_key (#128230 ) Previosly all subdirs (like torch.inductor.codegen) were not hashed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128230 Approved by: https://github.com/oulgen	2024-06-08 03:26:48 +00:00
Yidi Wu	6220602943	[torchbind] support query schema of methods (#128267 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128267 Approved by: https://github.com/angelayi	2024-06-08 03:20:44 +00:00
PyTorch MergeBot	0ef5229569	Revert "Change lerp decomp to use aten.as_strided_copy instead of prims.copy_strided (#128030 )" This reverts commit fdf1666b20f63e4acf01798f009e478d997a7f7f. Reverted https://github.com/pytorch/pytorch/pull/128030 on behalf of https://github.com/nWEIdia due to breaking cuda12.1 test_cuda, see HUD https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=inductor ([comment](https://github.com/pytorch/pytorch/pull/128030#issuecomment-2155764546))	2024-06-08 02:34:06 +00:00
Will Constable	f9508b4c1f	[pipelining] Update Pipelining Docs (#128236 ) ---- - Bring PipelineStage/Schedule more front-and-center - provide details on how to manually construct PipelineStage - move tracer example and manual example below so the high-level flow (e2e) is closer to the top Pull Request resolved: https://github.com/pytorch/pytorch/pull/128236 Approved by: https://github.com/H-Huang ghstack dependencies: #128201, #128228	2024-06-08 02:03:46 +00:00
Andrew Hoblitzell	fe74bbd6f0	init sigmoid comments (#127983 ) Fixes #127913 ### Description Add docstring to `torch/onnx/symbolic_opset9.py`:`sigmoid` function ### Checklist - [x] The issue that is being fixed is referred in the description - [x] Only one issue is addressed in this pull request - [x] Labels from the issue that this PR is fixing are added to this pull request - [x] No unnecessary issues are included into this pull request Pull Request resolved: https://github.com/pytorch/pytorch/pull/127983 Approved by: https://github.com/xadupre	2024-06-08 01:48:00 +00:00
Ke Wen	921aa194c7	[pipelining] Move modify_graph_op_device to _IR.py (#128241 ) This part is more IR related. Thus moving from `PipelineStage` constructor to `pipe.build_stage(..., device, ...)`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128241 Approved by: https://github.com/wconstab ghstack dependencies: #128240	2024-06-08 01:35:07 +00:00
Ke Wen	ad96f991a5	[pipelining] Add pipe.build_stage() (#128240 ) Given `PipelineStage` name to manual side. Thus adding a method under `Pipe` to create PipelineStage. Moved `PipeInfo` to utils.py to avoid circular dependency between `_IR` and `PipelineStage`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128240 Approved by: https://github.com/wconstab, https://github.com/H-Huang	2024-06-08 01:26:02 +00:00
Li-Huai (Allan) Lin	5ef081031e	[MPS] Include MPSGraphVenturaOps.h for complex types on macOS 12 (#127859 ) Fixes this on macOS 12: ``` /Users/qqaatw/Forks/pytorch/aten/src/ATen/native/mps/operations/FastFourierTransform.mm:108:60: error: use of undeclared identifier 'MPSDataTypeComplexFloat16'; did you mean 'MPSDataTypeFloat16'? (inputTensor.dataType == MPSDataTypeFloat16) ? MPSDataTypeComplexFloat16 : MPSDataTypeComplexFloat32; ^~~~~~~~~~~~~~~~~~~~~~~~~ MPSDataTypeFloat16 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127859 Approved by: https://github.com/kulinseth	2024-06-08 00:54:30 +00:00
Alnis Murtovi	647815049e	Inductor: Allow small sizes of m for mixed mm autotuning (#127663 ) For mixed mm with small sizes of m, such as in the example provided in #127056, being able to set BLOCK_M to 16 leads to better performance. This PR introduces kernel configs that are specific to mixed mm by extending the mm configs with two configs that work well for the example provided in #127056. I am excluding configs with (BLOCK_M=16, BLOCK_K=16, BLOCK_N=64) because triton crashes when this config is used. For the example in #127056: - Without my changes, skip_triton is evaluated to true which disables autotuning. On my machine I achieve 146GB/s. - If autotuning is enabled, but BLOCK_M>=32, I achieve 614 GB/s. - With the changes in this PR (i.e. autotuning enabled and BLOCK_M=16), I achieve 772 GB/s. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127663 Approved by: https://github.com/Chillee	2024-06-08 00:46:16 +00:00
cyy	ef2b5ed500	[4/N] Remove unused functions (#128193 ) Follows #128179 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128193 Approved by: https://github.com/ezyang	2024-06-08 00:09:26 +00:00
Animesh Jain	39dd4740e6	[inductor][dynamo-inline-nn-modules] Fix test with inlining flag (#128200 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128200 Approved by: https://github.com/Skylion007 ghstack dependencies: #128001, #126578, #128158, #128172	2024-06-07 23:51:58 +00:00
Howard Huang	bef586111a	[pipelining] pipelining.rst updates (#128228 ) fix some nits and add `PipelineStage` (manual) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128228 Approved by: https://github.com/wconstab ghstack dependencies: #128201	2024-06-07 23:29:54 +00:00
Chirag Pandya	09cccbc1c7	[RFC] add per-collective timeout value in flight recorder (#128190 ) Summary: Add timeout value field on every collected record. Test Plan: Unit tests Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/128190 Approved by: https://github.com/wconstab	2024-06-07 23:29:35 +00:00
Catherine Lee	11f2d8e823	Move inductor cuda 124 jobs to a separate workflow that is not triggered by ciflow/inductor (#128250 ) https://github.com/pytorch/pytorch/pull/127825 The majority of the g5 runner usage comes from inductor (its something like 2x everything else) in the past week, inductor ran 1300 ish times on PRs and 300 times on main. Inductor-periodic ran 50 times on main, so the previous move from inductor -> inductor-periodic only results in 250 fewer runs. I was under the impression that cu124 is experimental currently and eventually we'll need to switch to it, so this will stay until we switch or inductor uses much fewer runners Are we expected to be able to handle two versions of cuda in CI? Because currently we cannot, at least not comfortably Pull Request resolved: https://github.com/pytorch/pytorch/pull/128250 Approved by: https://github.com/huydhn	2024-06-07 23:01:52 +00:00
laithsakka	5b3624117a	update test_issue175 to handle inline_inbuilt_nn_modules (#128026 ) with inlining the output graph have more function calls reflecting those on the test that count number of function calls. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128026 Approved by: https://github.com/anijain2305 ghstack dependencies: #127553	2024-06-07 22:07:16 +00:00
Xu Han	ba81c3c290	[inductor] add cpp builder code. (take 2) (#125849 ) Fully manual rebase the code of PR: https://github.com/pytorch/pytorch/pull/124045 The old PR seems crashed due to too many commits, and too many times rebase. Please reference: https://github.com/pytorch/pytorch/pull/124045#issuecomment-2103744588 ------- It is the first step of RFC https://github.com/pytorch/pytorch/issues/124245. Changes: 1. Add cpp builder code, the new cpp_builder support Windows OS. 2. Add CPU ISA checker which is cross OS and exported from backend cpuinfo. 3. Switch compiler ISA checker to new cpp builder. 4. CppCodeCache use the new ISA checker. 5. Add temprary `test_new_cpp_build_logical` UT to help on transfer to new code. <img width="1853" alt="Image" src="https://github.com/pytorch/pytorch/assets/8433590/ce6519ab-ba92-4204-b1d6-7d15d2ba2cbe"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125849 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-06-07 20:49:58 +00:00
dshi7	3a620a0f65	bug fix of dynamo_timed in cprofile (#128203 ) Fixes #ISSUE_NUMBER fb-only: "Entire Frame" was missing before this change. Before: https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/f565966006-TrainingApplication/20240527/rank_0/5_0_1/compilation_metrics_23.html After: https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/f569854578-TrainingApplication/20240606/rank_0/0_0_0/compilation_metrics_16.html Pull Request resolved: https://github.com/pytorch/pytorch/pull/128203 Approved by: https://github.com/Chillee	2024-06-07 20:47:27 +00:00
Catherine Lee	8892ddaacc	[TD] Test removal on sm86 (#127131 ) Yolo I'm excited to break CI :') Pull Request resolved: https://github.com/pytorch/pytorch/pull/127131 Approved by: https://github.com/huydhn, https://github.com/ZainRizvi	2024-06-07 20:19:18 +00:00
angelayi	fdf1666b20	Change lerp decomp to use aten.as_strided_copy instead of prims.copy_strided (#128030 ) aten.lerp decomposition causes prims::copy_strided to appear in the graph, which is not core aten. Internal ref: https://fb.workplace.com/groups/pytorch.edge.users/permalink/1525644288305859/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/128030 Approved by: https://github.com/Skylion007, https://github.com/zou3519	2024-06-07 20:12:52 +00:00
Ke Wen	e647ea55a3	[pipelining] redirect README to document (#128205 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128205 Approved by: https://github.com/wconstab, https://github.com/H-Huang	2024-06-07 19:34:52 +00:00
Howard Huang	dcb63fcedb	[pipelining] Remove num_microbatches from stage (#128201 ) This is similar to https://github.com/pytorch/pytorch/pull/127979, but instead of removing `num_microbatches` from schedule, we remove it from `PipelineStage`. This also means that during `PipelineSchedule` init we need to setup the buffers for the stage(s). Pull Request resolved: https://github.com/pytorch/pytorch/pull/128201 Approved by: https://github.com/kwen2501	2024-06-07 18:56:44 +00:00
Aaron Gokaslan	cafbcb6376	[BE]: Update ruff to 0.4.8 (#128214 ) Updates ruff to 0.4.8. Some minor fixes, but noticably is 10% faster on microbenchmark and should further reduce local and CI runtime of the linter. Also includes a few bugfixes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128214 Approved by: https://github.com/ezyang	2024-06-07 18:41:35 +00:00
Will Constable	8ca4cefc7d	[C10D] Ensure gil is not released when calling toPyBytes (#128212 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128212 Approved by: https://github.com/Skylion007, https://github.com/XilunWu	2024-06-07 18:24:10 +00:00
_daohang_	0a6df4fca6	delete inductor config.trace.compile_profile (#127143 ) Fixes #ISSUE_NUMBER https://fb.workplace.com/groups/257735836456307/posts/687858786777341/?comment_id=687861123443774&reply_comment_id=687865486776671 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127143 Approved by: https://github.com/Chillee	2024-06-07 18:05:50 +00:00
Xu Zhao	82d7a36a27	Added torchao nightly workflow (#128152 ) Summary: Add torchao benchmark workflow, upload the artifacts to GHA. X-link: https://github.com/pytorch/benchmark/pull/2273 Test Plan: ``` python run_benchmark.py torchao --ci ``` Differential Revision: D58140479 Pulled By: xuzhao9 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128152 Approved by: https://github.com/jerryzh168	2024-06-07 17:52:15 +00:00
Shunting Zhang	0c7f4353e5	[inductor] simplify indexing (#127661 ) This is a short term fix for: https://github.com/pytorch/pytorch/issues/124002 We found the cause of bad perf for the int8_unpack kernel is due to sub-optimal indexing. In this PR we introduce 2 indexing optimizations: 1. expand FloorDiv to the entire expression when feasible. E.g. `x1 * 1024 + x2 // 2` will be transformed to `(x1 * 2048 + x2) // 2`. The motivation is that we have more chance to simplify loops for `x1 * 2048 + x2`. 2. merge ModularIndexing pairs: `ModularIndexing(ModularIndex(x, 1, a), 1, b)`, can be simplified to `ModularIndexing(x, 1, b)` if a is a multiple of b. With both indexing optimizations, we improve int8_unpack perf by 1.54x (183us -> 119us). Pull Request resolved: https://github.com/pytorch/pytorch/pull/127661 Approved by: https://github.com/jansel	2024-06-07 17:51:30 +00:00
Animesh Jain	662a78f957	[dynamo] Inline the getattr of fx graph and proxy graph (#128172 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128172 Approved by: https://github.com/yanboliang ghstack dependencies: #128001, #126578, #128158	2024-06-07 17:14:58 +00:00
BowenBao	19b31d899a	Fix 'get_real_value' on placeholder nodes (#127698 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127698 Approved by: https://github.com/jansel ghstack dependencies: #127695, #127696	2024-06-07 17:13:43 +00:00
BowenBao	b741819b05	Fix 'get_attr' call in dynamo 'run_node' (#127696 ) Fixes #124858 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127696 Approved by: https://github.com/jansel ghstack dependencies: #127695	2024-06-07 17:13:43 +00:00
BowenBao	3aa623d407	Fix assume_constant_result for UnspecializedNNModuleVariable methods (#127695 ) Fixes #127509 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127695 Approved by: https://github.com/jansel	2024-06-07 17:13:43 +00:00
Zain Rizvi	754e6d4ad0	Make jobs with LF runners still pass lint (#128175 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128175 Approved by: https://github.com/huydhn	2024-06-07 17:13:04 +00:00
Xilun Wu	85758fa5ae	[c10d][TCPStore] make TCPStore server use libuv by default (#127957 ) Summary This PR switches the default TCPStore server backend to a new implementation that utilizes [`libuv`](https://github.com/libuv/libuv) for significantly lower initialization time and better scalability: <img width="714" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/18503011-da5d-4104-8ba9-abc456438b02"> We hope this improvement would benefit users from a much shorter startup time in large-scale jobs. Eventually, we hope to fully replace the old TCPStore backend implementation with the libuv one. What it changes This PR changes the underlying TCPStore server backend to `libuv` if users don't explicitly specify to use the old TCPStore server. This change is not supposed to cause any user notice except significant faster TCPStore startup for large-scale jobs. One thing to note is, we do not support the initialization approach where user passes in a socket for libuv backend. We plan to support it as a next step but we choose to disable it before fully testing. If you are initializing TCPStore in this approach, you can see the next section to remain using the old TCPStore server. Fallback/Remain using the old TCPStore server For users who want to stay with the old TCPStore backend, there're 3 ways: 1. If user is directly instantiating TCPStore object, user can pass in argument `use_libuv=False` to use the old TCPStore server backend e.g. `store = torch.distributed.TCPStore(..., use_libuv=False)`. 2. Or, specify the TCPStore backend option in `init_method` when calling default ProcessGroup init, e.g. `torch.distributed.init_process_group(..., init_method="{YOUR_RENDEZVOUS_METHOD}://{YOUR_HOSTNAME}:{YOUR_PORT}?use_libuv=0")` 3. Or, user can set environment variable `USE_LIBUV` to `"0"` when launching. These 3 approach are in order of precedence. That being said, if user specifies `use_libuv=0` in `init_method` and also sets environment var `USE_LIBUV="1"`, the former will take effect and the TCPStore backend instantiated will be the old one instead of the one using libuv. Operating Systems Compatibility From the CI signals, we believe the new implementation has the same behavior as the old TCPStore server on all supported platforms. If you notice any behavior discrepancy, please file an issue with `oncall: distributed` label. Test Plan `pytest test/distributed/test_store.py` <img width="2548" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/dc0aebeb-6d5a-4daa-b98c-e56bd39aa588"> note: `TestMultiThreadedWait::test_wait` is a broken test that has been there for some time. `test/distributed/elastic/utils/distributed_test.py` <img width="2558" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/a6a3266d-b798-41c4-94d2-152056a034f6"> TODO 1. Update the doc at - https://pytorch.org/docs/stable/distributed.html#distributed-key-value-store - https://pytorch.org/docs/stable/distributed.html#tcp-initialization 2. Make torch elastic rendezvous to use libuv TCPStore as well. See `torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py` cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k @kurman 3. Test if libuv backend is okay with initialization with socket. Change `LibUvTCPStoreTest::test_take_over_listen_socket`. Test Plan `pytest test/distributed/test_store.py` <img width="2548" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/dc0aebeb-6d5a-4daa-b98c-e56bd39aa588"> note: `TestMultiThreadedWait::test_wait` is a broken test that has been there for some time. `test/distributed/elastic/utils/distributed_test.py` <img width="2558" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/a6a3266d-b798-41c4-94d2-152056a034f6"> Differential Revision: [D58259591](https://our.internmc.facebook.com/intern/diff/D58259591) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127957 Approved by: https://github.com/kurman ghstack dependencies: #127956	2024-06-07 16:53:01 +00:00
Xilun Wu	6c824cd9fb	[BE][c10d] fix use of TORCH_ERROR in TCPStore libuv backend (#127956 ) Summary The use of TORCH_ERROR in TCPStore libuv backend code needs update. Differential Revision: [D58259589](https://our.internmc.facebook.com/intern/diff/D58259589) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127956 Approved by: https://github.com/shuqiangzhang, https://github.com/cyyever	2024-06-07 16:53:01 +00:00
Howard Huang	b9b89ed638	[pipelining] fix LoopedBFS (#127796 ) # Issues Currently two issues need to be fixed with LoopedBFS: 1. The wrap around send operation to the looped around stage blocks will cause a hang. For some reason this doesn't surface on single node, but on multihost this surfaces in a hang. <img width="1311" alt="image" src="https://github.com/pytorch/pytorch/assets/14858254/210d9d18-455f-4f65-8a11-7ce2c1ec73fd"> 2. When microbatches are popped off in `backward_one_chunk` will automatically use the `bwd_chunk_id` starting from 0. This works for interleaved 1f1b and 1f1b, but for loopedBFS we want to pop from starting at `num_microbatches - 1`. Same needs to be fixed for gpipe? # Changes - Update LoopedBFS implementation to share `_step_microbatches` with `Interleaved1F1B` - Also share the tests between the two schedules for varying num_microbatches, local_stages, and world_sizes - Update `backward_one_chunk` to optionally take a `bwd_chunk_id` argument. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127796 Approved by: https://github.com/wconstab	2024-06-07 16:46:38 +00:00
Mu-Chu Lee	d9696ea624	[AOTInductor] [Tooling] Update NaN and INF Checker for AOTInductor (#127574 ) Summary: 1. Integrate NaN and INF checker with existing config, controllable by env var. 2. Move inject point of NaN & INF checker earlier, this could prevent buffer freeing before check. 3. Inject debugging code in Kernel level, which prevents us trying to read buffers that are fused inplace and into a single kernel. Test Plan: Debugging utility. Test and check by existing tests with env var: ``` TORCHINDUCTOR_NAN_ASSERTS=1 TORCHINDUCTOR_MAX_AUTOTUNE=0 python test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCuda.test_seq_non_abi_compatible_cuda ``` Reviewed By: ColinPeppler Differential Revision: D57989176 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127574 Approved by: https://github.com/chenyang78, https://github.com/desertfire	2024-06-07 16:46:26 +00:00
Prachi Gupta	fc6e3ff96d	[ROCm] Update triton pin to fix libtanh issue (#125396 ) There were some internal build issues related to tanh when we moved to upstream triton in ROCm. These issues were fixed by the following triton commit: https://github.com/triton-lang/triton/pull/3810 . This PR moves the triton pin to incorporate that change. Added some skips for unit tests that regressed due to the triton commit bump in this PR. Needs https://github.com/pytorch/pytorch/pull/127968 since this PR introduces a triton dependency on llnl-hatchet, which doesn't have py3.12 wheels available currently. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125396 Approved by: https://github.com/pruthvistony, https://github.com/malfet	2024-06-07 16:23:04 +00:00
PyTorch MergeBot	128952625b	Revert "Added memory budget to partitioner (#126320 )" This reverts commit 2184cdd29128a924583e4702489177f83fb8270a. Reverted https://github.com/pytorch/pytorch/pull/126320 on behalf of https://github.com/ZainRizvi due to The new test_ac.py fails on ROCm machines ([comment](https://github.com/pytorch/pytorch/pull/126320#issuecomment-2155141886))	2024-06-07 16:15:03 +00:00
cyy	c219fa5eb9	[3/N] Remove unused functions (#128179 ) Following https://github.com/pytorch/pytorch/pull/128005, this PR continues to remove unused functions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128179 Approved by: https://github.com/ezyang	2024-06-07 16:13:16 +00:00
Sam Larsen	8d16a73f0f	Manipulate triton_hash_with_backend so that it doesn't contain any keywords (#128159 ) Summary: See https://github.com/pytorch/pytorch/issues/127637 where "def" appears in the backend_hash and causes a problem. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128159 Approved by: https://github.com/jansel	2024-06-07 16:10:44 +00:00
Sam Larsen	852b7b4c99	[inductor] Enable subprocess-based parallel compile as the default (#126817 ) Differential Revision: [D58239826](https://our.internmc.facebook.com/intern/diff/D58239826) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126817 Approved by: https://github.com/eellison ghstack dependencies: #128037, #128086	2024-06-07 16:10:11 +00:00
PyTorch MergeBot	ac51f782fe	Revert "Complete revamp of float/promotion sympy handling (#126905 )" This reverts commit 2f7cfecd86009a9d396fdbdcdfb4ba7a005db16b. Reverted https://github.com/pytorch/pytorch/pull/126905 on behalf of https://github.com/atalman due to Sorry need to revert - failing internally ([comment](https://github.com/pytorch/pytorch/pull/126905#issuecomment-2155118778))	2024-06-07 16:01:46 +00:00
PyTorch MergeBot	23c156cd2d	Revert "[inductor] simplify indexing (#127661 )" This reverts commit 901226ae837bd4629b34735c84a3481c4988bb5b. Reverted https://github.com/pytorch/pytorch/pull/127661 on behalf of https://github.com/atalman due to Sorry reverting because in conflict with https://github.com/pytorch/pytorch/pull/126905 which needs to be reverted, will be relanding it ([comment](https://github.com/pytorch/pytorch/pull/127661#issuecomment-2155115388))	2024-06-07 15:58:36 +00:00
cyy	a1b664adeb	Add default values to PyTorchMemEffAttention::AttentionKernel::Params members (#112215 ) Default values were added to Params in order to eliminate CUDA warnings like ``` and the implicitly-defined constructor does not initialize ‘PyTorchMemEffAttention::AttentionKernel<float, cutlass::arch::Sm80, true, 64, 64, 64, true, true>::accum_t PyTorchMemEffAttention::AttentionKernel<float, cutlass::arch::Sm80, true, 64, 64, 64, true, true>::Params::scale’ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/112215 Approved by: https://github.com/eqy, https://github.com/ezyang	2024-06-07 15:54:07 +00:00
Ke Wen	3090667cf9	[pipelining] pipeline() taking microbatch as example input (#128163 ) Changed the API of `pipeline()` to take microbatch instead of full batch as example args. Main purpose is to: - make this API more atomic; - decouple tracing frontend from runtime info like `num_chunks`. Side effects: - Creates opportunity for varying `num_chunks` of schedules with the same `pipe` object. - User has to create example microbatch input. - Chunk spec stuff are now all moved to runtime side. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128163 Approved by: https://github.com/H-Huang	2024-06-07 15:51:53 +00:00
PyTorch MergeBot	224b4339e5	Revert "Make ValueRange repr less chatty by default (#128043 )" This reverts commit f0dd11df5534ae074ad2d090e6700576a22719d6. Reverted https://github.com/pytorch/pytorch/pull/128043 on behalf of https://github.com/atalman due to Sorry reverting because in conflict with [#126905](https://github.com/pytorch/pytorch/pull/126905) which needs to be reverted ([comment](https://github.com/pytorch/pytorch/pull/128043#issuecomment-2155091732))	2024-06-07 15:43:39 +00:00
James Wu	6e75024ff0	Run TestAOTAutograd with dynamo (#128047 ) My goal is to run these tests with the autograd cache on, but first I want them running with dynamo. These tests already caught an interesting issue so I thought it would be helpful to just have them. Next up I'll have a second subclass of these tests, run them twice, and expect a cache hit the second time from autograd. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128047 Approved by: https://github.com/ezyang	2024-06-07 15:42:28 +00:00
GdoongMathew	771be55bb0	Documenting `torch.onnx.operator.shape_as_tensor` (#128051 ) Fixes #127890 This PR adds docstring to the `torch.onnx.operator.shape_as_tensor` function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128051 Approved by: https://github.com/xadupre	2024-06-07 15:20:18 +00:00
zabboud	3f9798a4fd	add docstring to masked_fill, expand, select, unsqueeze, cat fns (#128055 ) Fixes #127891 Fixes #127893 Fixes #127894 Fixes #127907 Fixes #127910 ## Description Add docstring to `masked_fill`, `expand`, `select`, `unsqueeze`, and `cat` functions in torch.onnx.symbolic_opset9.py remaining pydocstyle errors: 257 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128055 Approved by: https://github.com/xadupre	2024-06-07 15:17:22 +00:00
Howard Huang	543a870943	[pipelining] Rename ManualPipelineStage -> PipelineStage (#128157 ) Renaming ManualPipelineStage to remove the "Manual" part. I needed to replace the existing `PipelineStage` which takes in the `pipe` argument, so I have renamed that to `TracerPipelineStage`. @kwen2501 will remove this entirely in favor of adding a util to `Pipe` to just create the stage directly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128157 Approved by: https://github.com/wconstab	2024-06-07 09:24:16 +00:00
Will Feng	5f81265572	[Traceable FSDP2] Return early from _register_post_backward_hook when compile (#127864 ) Dynamo doesn't support `RegisterPostBackwardFunction` very well yet. This PR skips it and rely on `root_post_backward_callback` under compile. We will improve `RegisterPostBackwardFunction` support in Q3. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127864 Approved by: https://github.com/awgu	2024-06-07 09:19:07 +00:00
chunyuan	7efaeb1494	[AOTI] docs: add suggestion to turn on freezing on CPU (#128010 ) With https://github.com/pytorch/pytorch/pull/124350 landed, it is now suggested in AOTI to turn on freezing on CPU to get better performance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128010 Approved by: https://github.com/desertfire	2024-06-07 08:57:02 +00:00
Pian Pawakapan	0c16800b4a	[pipelining] include lifted constants in input_to_state (#128173 ) Previous PR only looked at state dict to determine inputs to state, missing out on lifted tensors Pull Request resolved: https://github.com/pytorch/pytorch/pull/128173 Approved by: https://github.com/kwen2501	2024-06-07 08:40:54 +00:00
Ke Wen	01601ebd41	Retire torch.distributed.pipeline (#127354 ) Actually retiring module after deprecation warning for a while. The new supported module is: torch.distributed.pipelining. Please migrate. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127354 Approved by: https://github.com/wconstab	2024-06-07 08:11:58 +00:00
Jason Ansel	70724bdbfe	Bugfix for nondeterminstic torch_key (#128111 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128111 Approved by: https://github.com/oulgen	2024-06-07 07:17:39 +00:00
Simon Fan	00c6ca4459	[compiled autograd][cudagraphs] Inputs runtime wrapper to move cpu scalars to cuda (#125382 ) Most commonly CPU scalars used for philox random seed. Right now, any cpu input will skip cudagraphing the entire graph. We need both the traced graph and the runtime inputs to be cudaified. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125382 Approved by: https://github.com/jansel	2024-06-07 07:12:46 +00:00
Ke Wen	190f06d468	[pipelining] Lower _configure_data_parallel_mode to stage (#127946 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127946 Approved by: https://github.com/wconstab ghstack dependencies: #127935	2024-06-07 07:06:23 +00:00
Will Feng	a448b3ae95	[Traceable FSDP2] Check hasattr('fsdp_pre_all_gather') only when not compile (#127855 ) Dynamo doesn't support `hasattr(inner_tensor, "fsdp_post_all_gather")` yet. We will work on this support in Q3. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127855 Approved by: https://github.com/awgu	2024-06-07 06:36:40 +00:00
Sun, Jiayi	2ff312359c	skip hf_T5_generate in dynamic shape test (#121129 ) As reported in https://github.com/pytorch/pytorch/issues/119434, `hf_T5_generate` failed with dynamic shape testing, we propose to skip the dynamic batch size testing of this model in this PR. * Error msg is ``` File "/home/jiayisun/pytorch/torch/_dynamo/guards.py", line 705, in SHAPE_ENV guards = output_graph.shape_env.produce_guards( File "/home/jiayisun/pytorch/torch/fx/experimental/symbolic_shapes.py", line 3253, in produce_guards raise ConstraintViolationError( torch.fx.experimental.symbolic_shapes.ConstraintViolationError: Constraints violated (L['inputs_tensor'].size()[0])! For more information, run with TORCH_LOGS="+dynamic". - Not all values of RelaxedUnspecConstraint(L['inputs_tensor'].size()[0]) are valid because L['inputs_tensor'].size()[0] was inferred to be a constant (4). ``` * Root Cause is This error happens while creating guard for this [model script line](https://github.com/huggingface/transformers/blob/main/src/transformers/models/t5/modeling_t5.py#L561): `scores += position_bias_masked` I run it with TORCH_LOGS="+dynamic" and got the key line : `I0305 00:21:00.849974 140376923287424 torch/fx/experimental/symbolic_shapes.py:3963] [6/0_1] eval Eq(s0, 4) [guard added] at miniconda3/envs/pt2/lib/python3.9/site-packages/transformers/models/t5/modeling_t5.py:561 in forward (_refs/__init__.py:403 in _broadcast_shapes)` The reason for this error is that the batch dimension of `inputs_tensor` in the dynamic batch size test is marked as dynamic shape `s0`, so the batch dimension of `scores` generated by a series of operations with `inputs_tensor` is also `s0`. However, because the function of creating `attention_mask` is not in Dynamo but in python. The batch dimension of `attention_mask` is the real shape `4`, and the batch dimension of `position_bias_masked` generated by a series of operations with `attention_mask` is also the real shape `4`, not the dynamic shape `s0`. The current line of `scores += position_bias_masked` requires creating a guard and check whether the batch dimension of `scores` is always equal to the batch dimension of `position_bias_masked`, Eq(s0, 4), the error happens. So the root cause of this error is that the function of creating `attention_mask` not in Dynamo but in python. The reason why the function of `attention_mask` not in Dynamo is that Dynamo has a graph break on this function (happened in the [model script line](https://github.com/huggingface/transformers/blob/main/src/transformers/generation/utils.py#L476): `is_pad_token_in_inputs = (pad_token_id is not None) and (pad_token_id in inputs)`) due to the following error: `torch._dynamo.exc.Unsupported: Tensor.item` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121129 Approved by: https://github.com/leslie-fang-intel, https://github.com/ezyang	2024-06-07 06:28:29 +00:00
Stonepia	d943357a21	[XPU] Add xpu support of `make triton` (#126513 ) This PR is to add XPU support for `make triton`. If a user wishes to use Triton with XPU support, the user needs to install the [intel-xpu-backend-for-triton](https://github.com/intel/intel-xpu-backend-for-triton). This PR allows the user to easily install Triton for xpu backend support: ``` # clone the pytorch repo export USE_XPU=1 make triton ``` The XPU version of triton will always be built from the source. It will cat the commit id from `.ci/docker/ci_commit_pins/triton-xpu.txt`, for example, `b8c64f64c18d8cac598b3adb355c21e7439c21de`. So the final call would be like: ``` pip install --force-reinstall "git+https://github.com/intel/intel-xpu-backend-for-triton@b8c64f64c18d8cac598b3adb355c21e7439c21de#subdirectory=python" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126513 Approved by: https://github.com/EikanWang, https://github.com/atalman	2024-06-07 06:25:47 +00:00
laithsakka	68cc63ae27	introduce skipIfNNModuleInlined and skip test_cpu_cuda_module_after_dynamo (#128023 ) see the issue https://github.com/pytorch/pytorch/issues/127636 to for details about the issue, TLDR is that when inlining is enabled, we create a fake tensor while tracing in dynamo and try to perform aten.add.Tensor between two tensor of different types, with out inlining we do not hit that operation during tracing. ``` Failed running call_function <built-in function add>((FakeTensor(..., size=(20, 20), grad_fn=<AddBackward0>), FakeTensor(..., device='cuda:0', size=(20, 20))), *{}): Unhandled FakeTensor Device Propagation for aten.add.Tensor, found two different devices cpu, cuda:0 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128023 Approved by: https://github.com/anijain2305 ghstack dependencies: #127487, #127553	2024-06-07 06:00:33 +00:00
laithsakka	7e48d6a497	reset dynamo in test_do_not_skip_side_effects unit test loop to avoid dynamo cache limit hit (#127487 ) fix https://github.com/pytorch/pytorch/issues/127483 When nn module inlining is enabled, all recompilations are considered for the same frame hence we hit the cache limit for test_do_not_skip_side_effects, but without inlining things are different , each time we hit a new Object Model we do not consider that a re-compilation, as explained in https://github.com/pytorch/pytorch/issues/127483 For that test we do not really care about cache size hence i reset dynamo in the main loop. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127487 Approved by: https://github.com/anijain2305	2024-06-07 06:00:33 +00:00
Sam Larsen	dc8e3c2e90	[inductor] subproc parallel compile: initialize future before sending work to the pool (#128086 ) Summary: I got reports of intermittent failures in CI and the logs show errors like this: ``` CRITICAL:concurrent.futures:Future 139789013754560 in unexpected state: FINISHED ``` I can't repro locally, but seems clear that we should initialize the future _before_ sending work to the subprocess pool since it could finish before we call set_running_or_notify_cancel() Differential Revision: [D58239829](https://our.internmc.facebook.com/intern/diff/D58239829) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128086 Approved by: https://github.com/jansel ghstack dependencies: #128037	2024-06-07 04:17:35 +00:00
Sam Larsen	6a2bf48cfa	[inductor] subproc parallel-compile: start thread last in init (#128037 ) Summary: Observed on an internal workload: the helper thread started and attempted to access member variables before they were initialized. Differential Revision: [D58239827](https://our.internmc.facebook.com/intern/diff/D58239827) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128037 Approved by: https://github.com/Skylion007, https://github.com/eellison	2024-06-07 04:17:35 +00:00
Sam Larsen	e8e0bdf541	[inductor] parallel-compile: call triton_key() before forking (#127639 ) Summary: A user reported severe slowdown on a workload when using parallel compile. The issue is that in some environments, the process affinity changes after forking such that all forked subprocesses use a single logical processor. Described here: https://github.com/pytorch/pytorch/issues/99625. That requires a separate fix, but during debuging we noticed that we can at least optimize the expensive call to triton_key() before forking. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127639 Approved by: https://github.com/eellison, https://github.com/anijain2305	2024-06-07 04:12:57 +00:00
Ke Wen	96806b1777	[pipelining][doc] Add frontend description and change tracer example (#128070 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128070 Approved by: https://github.com/wconstab, https://github.com/H-Huang	2024-06-07 04:09:36 +00:00
Wanchao Liang	3df53c2a8f	[dtensor] directly return local_tensor under no_grad (#128145 ) as titled, skip the autograd function and directly return the local_tensor if it's under no_grad context, this would avoid creating views Pull Request resolved: https://github.com/pytorch/pytorch/pull/128145 Approved by: https://github.com/awgu ghstack dependencies: #128112	2024-06-07 04:01:47 +00:00
Animesh Jain	747fc35ff5	[dynamo] Support if cond on UnspecializedNNModuleVariable and add inline tests (#128158 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128158 Approved by: https://github.com/jansel ghstack dependencies: #128001, #126578	2024-06-07 03:50:33 +00:00
Aidyn-A	5e5bbdb35e	[DDP] Bucket handling: make first bucket size equal to bucket_cap_mb if it was set (#121640 ) The fist DDP bucket is always being created of the size of `dist._DEFAULT_FIRST_BUCKET_BYTES` (1 MiB) by default regardless of `bucket_cap_mb`. The proposal is to set `bucket_cap_mb` as the one main bucket size if it was supplied by the user. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121640 Approved by: https://github.com/wanchaol	2024-06-07 03:33:33 +00:00
Ke Wen	4d0ece8196	[pipelining] Consolidate chunk counting between stage and schedule (#127935 ) We used to have two backward chunk id counting systems, one at schedule level, the other at stage level. (Which makes safety dependent on the two advancing hand-in-hand.) This PR consolidates the counting system to the schedule side only, which would pass `mb_index` to the following stage calls: `forward_one_chunk` `backward_one_chunk` `get_bwd_send_ops` ... Pull Request resolved: https://github.com/pytorch/pytorch/pull/127935 Approved by: https://github.com/H-Huang	2024-06-07 03:33:18 +00:00
Brian Hirsh	476bfe6cce	fix torch.compile with triton kernels under inference_mode (#124489 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124489 Approved by: https://github.com/albanD	2024-06-07 03:29:37 +00:00
Pian Pawakapan	50155e825b	[export] provide refine function for automatically accepting dynamic shapes suggested fixes (#127436 ) Summary: Part of the work helping export's automatic dynamic shapes / dynamic shapes refining based on suggested fixes. Introduces a util function refine_dynamic_shapes_from_suggested_fixes() that takes the error message from a ConstraintViolationError message containing suggested dynamic shapes fixes, along with the original dynamic shapes spec, and returns the new spec. Written so that the suggested fixes from export can be directly parsed and used. Example usage for the automatic dynamic shapes workflow: ``` # export, fail, parse & refine suggested fixes, re-export try: export(model, inps, dynamic_shapes=dynamic_shapes) except torch._dynamo.exc.UserError as exc: new_shapes = refine_dynamic_shapes_from_suggested_fixes(exc.msg, dynamic_shapes) export(model, inps, dynamic_shapes=new_shapes) ``` For examples of behavior, see the added test and docstring. Will take suggestions for renaming the function to something else 😅 Test Plan: test_export tests Differential Revision: D57409142 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127436 Approved by: https://github.com/avikchaudhuri	2024-06-07 03:29:06 +00:00
Mikayla Gawarecki	65aa16f968	Revert "Default XLA to use swap_tensors path in nn.Module._apply (#126814 )" (#128170 ) https://github.com/pytorch/pytorch/issues/128165 :( This reverts commit a7b1dd82ff3063894fc665ab0c424815231c10e6. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128170 Approved by: https://github.com/drisspg, https://github.com/albanD	2024-06-07 01:44:14 +00:00
GdoongMathew	f99409903c	Documenting `torch.distributions.utils.clamp_probs` (#128136 ) Fixes https://github.com/pytorch/pytorch/issues/127889 This PR adds docstring to the `torch.distributions.utils.clamp_probs` function. Co-authored-by: Svetlana Karslioglu <svekars@meta.com> Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128136 Approved by: https://github.com/janeyx99, https://github.com/svekars, https://github.com/malfet	2024-06-07 00:49:41 +00:00
Oguz Ulgen	740cd0559f	Filter non input symexprs from codecache guards (#128052 ) Summary: Dynamo lifts all symexprs that appear in the inputs to top level which means that we do not need to look at guards that contain symexprs that do not appear in the inputs. Prune them. Test Plan: added two new tests Differential Revision: D58200476 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128052 Approved by: https://github.com/ezyang, https://github.com/masnesral	2024-06-07 00:48:49 +00:00
Kazuaki Ishizaki	117ab34891	Documenting the torch.utils.collect_env.get_pretty_env_info function (#128123 ) Fixes #127888 This PR adds docstring to the `torch.utils.collect_env.get_pretty_env_info` function Pull Request resolved: https://github.com/pytorch/pytorch/pull/128123 Approved by: https://github.com/ezyang, https://github.com/malfet	2024-06-07 00:43:18 +00:00
Shunting Zhang	901226ae83	[inductor] simplify indexing (#127661 ) This is a short term fix for: https://github.com/pytorch/pytorch/issues/124002 We found the cause of bad perf for the int8_unpack kernel is due to sub-optimal indexing. In this PR we introduce 2 indexing optimizations: 1. expand FloorDiv to the entire expression when feasible. E.g. `x1 * 1024 + x2 // 2` will be transformed to `(x1 * 2048 + x2) // 2`. The motivation is that we have more chance to simplify loops for `x1 * 2048 + x2`. 2. merge ModularIndexing pairs: `ModularIndexing(ModularIndex(x, 1, a), 1, b)`, can be simplified to `ModularIndexing(x, 1, b)` if a is a multiple of b. With both indexing optimizations, we improve int8_unpack perf by 1.54x (183us -> 119us). Pull Request resolved: https://github.com/pytorch/pytorch/pull/127661 Approved by: https://github.com/jansel	2024-06-06 23:57:45 +00:00
Animesh Jain	7ede78f9f5	[dynamo][nn-modules] Trace through nn.Module dunder methods for UnspecializedNNModule (#126578 ) Tracing through `__init__` is important because it initializes (calls STORE_ATTR) on members. By doing that, we kick in the mutation tracking for these objects. So, things like mutating `_modules` etc is tracked automatically. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126578 Approved by: https://github.com/jansel ghstack dependencies: #128001	2024-06-06 23:05:49 +00:00
Animesh Jain	e5b3387166	[dynamo] Bugfix for nn parameter construction (#128001 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128001 Approved by: https://github.com/jansel	2024-06-06 23:05:49 +00:00
brightonanc	6dfdce92ba	Fixed typos in the complex numbers portion of the autograd docs (#127948 ) This PR fixes several typos in the complex numbers section of the docs for autograd. Only documentation was altered. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127948 Approved by: https://github.com/soulitzer	2024-06-06 22:47:04 +00:00
Jiashen Cao	56a3d276fe	Handle custom op during TorchScript to ExportedProgram conversion (#127580 ) #### Description Handle custom ops during TorchScript to ExportedProgram covnersion ```python torch.library.define( "mylib::foo", "(Tensor x) -> Tensor", lib=lib, ) # PyTorch custorm op implementation @torch.library.impl( "mylib::foo", "CompositeExplicitAutograd", lib=lib, ) def foo_impl(x): return x + x # Meta function of the custom op. @torch.library.impl_abstract( "mylib::foo", lib=lib, ) def foo_meta(x): return x + x class M(torch.nn.Module): def forward(self, x): return torch.ops.mylib.foo(x) ``` #### Test Plan * Add a test case where custom op is called and converted. `pytest test/export/test_converter.py -s -k test_ts2ep_converter_custom_op` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127580 Approved by: https://github.com/angelayi	2024-06-06 22:06:51 +00:00
joncrall	80fa2778ed	Update types for verbose in lr_scheduler (#127943 ) I'm currently locked into jsonargparse version 4.19.0, and it complains when used in combination with LightningCLI (v2.0.8). This is because it cares about the types declared in google style docstrings. This causes a problem when it tries to parse how it should cast arguments to construct an instance of an LRScheduler class because the docstrings declare the "verbose" parameter as a bool, but the defaults recently changed to a string "deprecated". This means the type should really be `bool \| str`. This PR adds a `\| str` to the docstring type in each learning rate scheduler class. This will prevent jsonargparse from complaining. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127943 Approved by: https://github.com/janeyx99	2024-06-06 21:59:22 +00:00
Chirag Pandya	0a761f0627	[RFC] Provide optional switches to _dump_nccl_trace (#127651 ) Summary: Data from PyTorch distributed is mostly useful during initial stages of model development. Provide options to reduce data sent/dumped. `_dump_nccl_trace` takes 3 optional switches. Default as before returns everything - `includeCollectives`: option to also include collectives: Default is True. - `includeStacktraces`: option to include stack traces in collectives. Default is True. - `onlyActive`: option to only send active collective work - i.e. not completed. Default is False (i.e. send everything) Test Plan: Unit tests Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/127651 Approved by: https://github.com/wconstab	2024-06-06 21:59:09 +00:00
Eddie Yan	54fe2d0e89	[cuDNN][quantization] skip qlinear test in cuDNN v9.1.0 (#128166 ) #120006 only very recently unskipped this test 3 days ago so we don't consider it a blocker for cuDNNv9 for now CC @atalman Pull Request resolved: https://github.com/pytorch/pytorch/pull/128166 Approved by: https://github.com/atalman, https://github.com/nWEIdia	2024-06-06 21:43:29 +00:00
Andrea Frittoli	04272a0e12	Add docstring for the torch.ao.quantization.utils.get_combined_dict function (#128127 ) Fixes: #127906 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128127 Approved by: https://github.com/jerryzh168	2024-06-06 21:22:09 +00:00
Howard Huang	baaa914bf7	[small] test clean up (#128079 ) remove unnecessary line: https://github.com/pytorch/pytorch/issues/123733 add main so test can be run `python ...`: https://github.com/pytorch/pytorch/issues/124906 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128079 Approved by: https://github.com/awgu	2024-06-06 21:21:40 +00:00
Andrew M. James	9554300436	[inductor][codegen] Codegen constexpr globals and constexpr annotated globals correctly. (#126195 ) [Triton #3762](https://github.com/triton-lang/triton/pull/3762) disallows access to globals which are not `tl.constexpr` Triton has always treated captured globals this way, but they now require it be explicit in user code. Updated codegen to make sure these variables are defined before writing the kernel source when compiling a user defined triton kernel. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126195 Approved by: https://github.com/alexbaden, https://github.com/bertmaher	2024-06-06 20:50:11 +00:00
chilli	2184cdd291	Added memory budget to partitioner (#126320 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126320 Approved by: https://github.com/shunting314	2024-06-06 20:32:29 +00:00
atalman	7e059b3c95	Add a call to validate docker images after build step is complete (#127768 ) Adds validation to docker images. As discussed here: https://github.com/pytorch/pytorch/issues/125879 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127768 Approved by: https://github.com/huydhn, https://github.com/Skylion007	2024-06-06 20:25:39 +00:00
Masahiro Hiramori	e8670f6aea	[Dynamo][TVM] Support macOS and Linux/aarch64 platforms (#128124 ) Fixes #128122 With this fix, I've confirmed that the repro works on the platforms below. - macOS 14.5 (arm64) - Ubuntu 20.04.6 LTS (GNU/Linux 5.10.120-tegra aarch64) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128124 Approved by: https://github.com/malfet	2024-06-06 19:47:11 +00:00
Eddie Yan	de4f8b9946	[BE]: Update cudnn to 9.1.0.70 (#123475 ) cuDNN has managed to upload cu11 and cu12 wheels for ~~9.0.0.312~~ 9.1.0.70, so trying this out... CC @Skylion007 @malfet Co-authored-by: Wei Wang <weiwan@nvidia.com> Co-authored-by: atalman <atalman@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123475 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/nWEIdia, https://github.com/atalman	2024-06-06 18:45:22 +00:00
Catherine Lee	fba21edf5b	[CI] Ensure inductor/test_cpu_cpp_wrapper is actually run in inductor_cpp_wrapper_abi_compatible (#126717 ) `inductor/test_cpu_cpp_wrapper` is not actually being run in `inductor_cpp_wrapper_abi_compatible` test config The cpu device type gets removed in `d28868c7e8/torch/testing/_internal/common_device_type.py (L733)` so `d28868c7e8/test/inductor/test_cpu_cpp_wrapper.py (L396)` returns false. Feel free to make a PR with a different way to do this (a better RUN_CPU check?) Add a skip for a failing test. I am not equipped to fix it Pull Request resolved: https://github.com/pytorch/pytorch/pull/126717 Approved by: https://github.com/ZainRizvi	2024-06-06 18:23:52 +00:00
Catherine Lee	936225d7b2	[mergebot] Fix pending unstable jobs being viewed as failed (#128080 ) https://github.com/pytorch/pytorch/pull/128038#issuecomment-2150802030 In the above, pending unstable jobs get put into the ok_failed_checks list, and because there are a lot of unstable jobs, it exceeds the threshold and merge fails. I don't think unstable jobs should be considered in the ok failed checks threshold, only flaky and broken trunk jobs should be considered there. Change looks big, but main thing is that unstable jobs don't get included in the check for how many flaky failures there are. The other changes are mostly renames so things are clearer Pull Request resolved: https://github.com/pytorch/pytorch/pull/128080 Approved by: https://github.com/huydhn	2024-06-06 18:22:20 +00:00
Andrew Gu	32fb68960e	[FSDP2] Added experimental warning to `unshard` API (#128138 ) There is still ongoing discussion on how this API should work. Current approach: - The pre-all-gather ops run in the default stream and the all-gather is called from the default stream with `async_op=True`. - Pros: - The all-gather input and output tensors are allocated in the default stream, so there is no increased memory fragmentation across stream pools. - There is no need for additional CUDA synchronization. The API is self-contained. - Cons: - The pre-all-gather ops (e.g. cast from fp32 -> bf16 and all-gather copy-in device copies) cannot overlap with other default stream compute. The biggest concern here is for CPU offloading, the H2D copies cannot overlap. Alternative approach: - Follow the default implicit prefetching approach, where the pre-all-gather ops and all-gather run in separate streams. - Pros: - The pre-all-gather ops can overlap with default stream compute. - Cons: - We require an API that should be called after the last optimizer step (namely, last op that modified sharded parameters) and before the first `unshard` call that has the all-gather streams wait for the default stream. The API is no longer self-contained and now has a complementary API. - The all-gather input and output tensors are allocated in separate streams (not the default stream), so there can be increased memory fragmentation across pools. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128138 Approved by: https://github.com/wanchaol ghstack dependencies: #128100	2024-06-06 18:18:42 +00:00
laithsakka	78a6b0c479	update test_reformer_train test to handle nn module inlining (#127467 ) number of call nodes increase due to inlining before inlining: ``` class GraphModule(torch.nn.Module): def forward(self, function_ctx, cat: "f32[1, s0, 512]"): # No stacktrace found for following nodes _set_grad_enabled = torch._C._set_grad_enabled(False) # File: /data/users/lsakka/pytorch/pytorch/test/dynamo/test_repros.py:283 in backward, code: grad_attn_output, grad_hidden_states = torch.chunk( chunk = torch.chunk(cat, 2, dim = -1); cat = None getitem: "f32[1, s0, 256]" = chunk[0] getitem_1: "f32[1, s0, 256]" = chunk[1]; chunk = None # No stacktrace found for following nodes _set_grad_enabled_1 = torch._C._set_grad_enabled(True) return (getitem_1, None) ``` after inlining: ``` class GraphModule(torch.nn.Module): def forward(self, s0: "Sym(s0)", L_hidden_states_: "f32[1, s0, 256]", L_self_layers_0_weight: "f32[256, 256]", L_self_layers_0_bias: "f32[256]", L_self_layer_norm_weight: "f32[512]", L_self_layer_norm_bias: "f32[512]", L_self_layer_norm_normalized_shape_0_: "Sym(512)"): l_hidden_states_ = L_hidden_states_ l_self_layers_0_weight = L_self_layers_0_weight l_self_layers_0_bias = L_self_layers_0_bias l_self_layer_norm_weight = L_self_layer_norm_weight l_self_layer_norm_bias = L_self_layer_norm_bias l_self_layer_norm_normalized_shape_0_ = L_self_layer_norm_normalized_shape_0_ # File: /data/users/lsakka/pytorch/pytorch/test/dynamo/test_repros.py:332 in forward, code: hidden_states = torch.cat([hidden_states, hidden_states], dim=-1) hidden_states: "f32[1, s0, 512]" = torch.cat([l_hidden_states_, l_hidden_states_], dim = -1); l_hidden_states_ = None # File: /data/users/lsakka/pytorch/pytorch/test/dynamo/test_repros.py:333 in forward, code: hidden_states = _ReversibleFunction.apply( function_ctx = torch.autograd.function.FunctionCtx() # File: /data/users/lsakka/pytorch/pytorch/test/dynamo/test_repros.py:258 in forward, code: hidden_states, attn_output = torch.chunk(hidden_states, 2, dim=-1) chunk = torch.chunk(hidden_states, 2, dim = -1); hidden_states = None hidden_states_1: "f32[1, s0, 256]" = chunk[0] attn_output: "f32[1, s0, 256]" = chunk[1]; chunk = None # File: /data/users/lsakka/pytorch/pytorch/torch/nn/modules/linear.py:116 in forward, code: return F.linear(input, self.weight, self.bias) attn_output_1: "f32[1, s0, 256]" = torch._C._nn.linear(attn_output, l_self_layers_0_weight, l_self_layers_0_bias); attn_output = l_self_layers_0_weight = l_self_layers_0_bias = None # File: /data/users/lsakka/pytorch/pytorch/test/dynamo/test_repros.py:272 in forward, code: ctx.save_for_backward(attn_output.detach(), hidden_states.detach()) detach: "f32[1, s0, 256]" = attn_output_1.detach() detach_1: "f32[1, s0, 256]" = hidden_states_1.detach() # File: /data/users/lsakka/pytorch/pytorch/test/dynamo/test_repros.py:279 in forward, code: return torch.cat([attn_output, hidden_states], dim=-1) hidden_states_2: "f32[1, s0, 512]" = torch.cat([attn_output_1, hidden_states_1], dim = -1); attn_output_1 = hidden_states_1 = None # File: /data/users/lsakka/pytorch/pytorch/torch/nn/modules/normalization.py:201 in forward, code: return F.layer_norm( hidden_states_3: "f32[1, s0, 512]" = torch.nn.functional.layer_norm(hidden_states_2, (l_self_layer_norm_normalized_shape_0_,), l_self_layer_norm_weight, l_self_layer_norm_bias, 1e-12); hidden_states_2 = l_self_layer_norm_normalized_shape_0_ = l_self_layer_norm_weight = l_self_layer_norm_bias = None # File: /data/users/lsakka/pytorch/pytorch/test/dynamo/test_repros.py:352 in forward, code: hidden_states = torch.nn.functional.dropout( hidden_states_4: "f32[1, s0, 512]" = torch.nn.functional.dropout(hidden_states_3, p = 0.5, training = True); hidden_states_3 = None return (hidden_states_4,) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127467 Approved by: https://github.com/anijain2305 ghstack dependencies: #126444, #127146, #127424, #127440	2024-06-06 17:56:36 +00:00
Yu, Guangye	304956e1fb	Switch to torch.float16 on XPU AMP mode (#127741 ) # Motivation Previously, the default dtype for AMP on XPU was aligned with the CPU. To align with other GPUs, we intend to change the default dtype for AMP to `torch.float16`. This change aims to save users the effort of converting models from `torch.float16` to `torch.bfloat16`, or vice versa when they want to run the model on different types of GPUs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127741 Approved by: https://github.com/EikanWang, https://github.com/albanD	2024-06-06 17:40:13 +00:00
Yifu Wang	1d0c1087dd	Allow overriding per-dim group options via _MeshEnv.set_dim_group_options (#126599 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126599 Approved by: https://github.com/wanchaol ghstack dependencies: #126598	2024-06-06 17:18:12 +00:00
Pritam Damania	e9c5144cbc	Fix bug in update_process_group DDP API (#128092 ) Fix bug in `_update_process_group` DDP API where we didn't correctly reset `local_used_map_` and a few other variables. This resulted in errors like `Encountered gradient which is undefined, but still allreduced by...` Added a unit test as well that reproduced the issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128092 Approved by: https://github.com/awgu, https://github.com/fegin	2024-06-06 17:10:42 +00:00
albanD	2ffdf556ea	Add back API that some people rely on in torch.cuda.amp.grad_scaler namespace (#128056 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128056 Approved by: https://github.com/kit1980, https://github.com/eqy	2024-06-06 17:02:32 +00:00
Aaron Gokaslan	2d47385f0f	[BE]: Enable ruff TCH rules and autofixes for better imports (#127688 ) Automated fixes to put imports that are only used in type hints into TYPE_CHECKING imports. This also enables the RUFF TCH rules which will automatically apply autofixes to move imports in and out of TYPE_CHECKING blocks as needed in the future, this will make the initial PyTorch import faster and will reduce cyclic dependencies. Co-authored-by: Xuehai Pan <XuehaiPan@pku.edu.cn> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127688 Approved by: https://github.com/XuehaiPan, https://github.com/ezyang, https://github.com/malfet	2024-06-06 16:55:58 +00:00
Wanchao Liang	4f87f47ea1	[dtensor] reuse DTensorSpec as much as possible (#128112 ) as titled, given that our DTensorSpec is immutable, we can always reuse the spec if the input/output have the same tensor metadata. this helps two fold: 1. We don't need to re-calculate the hash everytime we produce a DTensorSpec, reduce runtime operator overhead 2. reduce the DTensor construction overhead. Some local benchmark on a 800 parameter clip_grad_norm shows that for foreach_norm the CPU overhead reduces from 11ms -> 7.8ms (around 30% improvement) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128112 Approved by: https://github.com/awgu	2024-06-06 16:55:50 +00:00
Edward Z. Yang	f0dd11df55	Make ValueRange repr less chatty by default (#128043 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128043 Approved by: https://github.com/lezcano	2024-06-06 16:42:48 +00:00
eqy	0de6d2427f	Bump tolerances for `inductor/test_efficient_conv_bn_eval.py::EfficientConvBNEvalCudaTests::test_basic_cuda` attempt 2 (#128048 ) CC @nWEIdia @huydhn @Skylion007 Same thing but also bump backward tolerances... Pull Request resolved: https://github.com/pytorch/pytorch/pull/128048 Approved by: https://github.com/Skylion007	2024-06-06 16:17:43 +00:00
PyTorch MergeBot	a5b86a1ec0	Revert "FP8 rowwise scaling (#125204 )" This reverts commit 5dc912822913b3d90f4938891c7eca722a057cf1. Reverted https://github.com/pytorch/pytorch/pull/125204 on behalf of https://github.com/atalman due to Sorry need to revert this failing, on internal CI. I suggest to reimport this and try to land internally resolving all issues ([comment](https://github.com/pytorch/pytorch/pull/125204#issuecomment-2152905513))	2024-06-06 16:12:34 +00:00
Joona Havukainen	a5ba9b2858	Fix for addcdiv contiguous problem (#124442 ) Fixes issue number #118115 Co-authored-by: Siddharth Kotapati <skotapati@apple.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124442 Approved by: https://github.com/kulinseth	2024-06-06 16:09:18 +00:00
PyTorch MergeBot	c58d3af3b4	Revert "Add OpInfo entry for alias_copy (#127232 )" This reverts commit 457df212e1c6e1aa4f1eb2ad6ee292052d7c07e1. Reverted https://github.com/pytorch/pytorch/pull/127232 on behalf of https://github.com/clee2000 due to broke [onnx](https://github.com/pytorch/pytorch/actions/runs/9397057801/job/25880181144) and [mps](https://github.com/pytorch/pytorch/actions/runs/9397057805/job/25879818705) tests, [hud link](`457df212e1`) , base is 15 days old, the onnx test xfailed on the pr but the xfail was removed so if you rebase itll surface, mps build failed so no mps tests were run on the pr ([comment](https://github.com/pytorch/pytorch/pull/127232#issuecomment-2152848758))	2024-06-06 15:44:47 +00:00
Jithun Nair	9d849d4312	Disable py3.12 nightly wheel builds for ROCm (#127968 ) Triton commit bump PR https://github.com/pytorch/pytorch/pull/125396 reverted due to missing llnl-hatchet dependency for triton. Workaround is to disable py3.12 binary build jobs for ROCm on PyTorch CI until llnl-hatchet publishes py3.12 wheels on [PyPI](https://pypi.org/project/llnl-hatchet/#files) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127968 Approved by: https://github.com/atalman, https://github.com/pruthvistony	2024-06-06 15:17:35 +00:00
PyTorch MergeBot	48a54146e7	Revert "[dynamo] Support ndarray.dtype attribute access (#124490 )" This reverts commit 4adee71155bec4e419bac32be2cbc1763bc6c98f. Reverted https://github.com/pytorch/pytorch/pull/124490 on behalf of https://github.com/atalman due to Breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/124490#issuecomment-2152664749))	2024-06-06 14:21:29 +00:00
Hengwen Tong	f08fd8e9e3	Remove redundant device guard in Resize.h (#126498 ) In https://github.com/pytorch/pytorch/pull/113386 a device guard was [inserted](https://github.com/pytorch/pytorch/pull/113386/files#diff-2691af3a999b3a8f4a0f635aabcd8edf0ffeda501edfa9366648e8a89de12a90R30). The new inserted device guarded has a clear and more confined guarded scope. And it's hard to tell the exact purpose and scope of the [old device guard](`78ffe49a3f/aten/src/ATen/native/cuda/Resize.h (L41)`). Removing the guard has negligible positive performance impact and make the code more understandable. Thanks Pull Request resolved: https://github.com/pytorch/pytorch/pull/126498 Approved by: https://github.com/eqy, https://github.com/lezcano	2024-06-06 13:01:42 +00:00
Xuehai Pan	c97e3ebb96	Fix wrongly exposed variables in `torch/__init__.py` (#127795 ) <img width="609" alt="image" src="https://github.com/pytorch/pytorch/assets/16078332/964c6707-1856-4c2c-8cd8-ce1d96d38d36"> This PR removes temporary variables in `torch/__init__.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127795 Approved by: https://github.com/albanD	2024-06-06 08:31:41 +00:00
Tom Ritchford	457df212e1	Add OpInfo entry for alias_copy (#127232 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127232 Approved by: https://github.com/lezcano	2024-06-06 07:46:26 +00:00
Michael Lazos	f5328542b5	Allow multiple cudagraph recordings per compiled graph (#126822 ) ### Introduction/Problem Today when dynamo traces a builtin nn module (nn.Linear for example) it will specially handle parameters of that module by storing them as constant attributes of the graph. This requires that dynamo guard on the ID of the NNModule because if the instance of the module changes, we need to retrace and recollect the new parameters as attributes of the graph. This creates a 1:1 compiled graph to cudagraph relationship. With hierarchical compilation, dynamo will treat builtin nn modules like any other code. This reduces complexity and critically, if there are multiple identical layers in a model, we only need to compile one of those layers once, and reuse the same compiled artifact for each layer. This introduces a problem for the current approach to parameter handling. Since the parameters could now possibly change across calls to the compiled artifact, these need to be inputs to the graph instead of attributes. This introduces a problem for cudagraphs - previously cudagraphs was guaranteed that the parameters of builtin NN Modules would be constant across calls, but now since the compiled artifact needs to be agnostic to the actual instance of the NN module being used these parameter memory locations may vary. Previously cudagraphs simply copies varying inputs to cudagraph owned memory, but since the parameters are quite large, this is catastrophic for performance. ### Solution To avoid this performance cliff, this PR allows cudagraphs to re-record a new cudagraph if only parameters change. Metadata about which arguments are parameters are propagated from AOT Autograd to compile_fx, and these indices are passed to cudagraphs. If these memory locations change, a new graph is recorded vs previously where this would be an error (because this previously should not happen). This enables a 1:many compiled graph to cudagraph relationship. Across similar modules we will re-record cudagraphs and dispatch the correct graph if parameter pointers match when the cudagraph is executed. ### Next steps (if needed) It is theoretically possible that a user passes Parameters that change frequently as inputs to model code - if this is a common issue this design allows for dynamo to pass metadata indicating which parameters were created in a builtin NN Module context to only permit those parameters to have the multi-cudagraph behavior, but this PR does not implement this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126822 Approved by: https://github.com/eellison ghstack dependencies: #126820, #126821	2024-06-06 06:39:59 +00:00
Michael Lazos	5a3bea1e88	Remove unused arg to GraphLowering (#126821 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126821 Approved by: https://github.com/eellison ghstack dependencies: #126820	2024-06-06 06:39:59 +00:00
Michael Lazos	70ba6f0ab6	Collect static parameter metadata in aot (#126820 ) Collect the indices of the static parameters to pass down to cudagraphs in order to re-record if necessary. This location was chosen in order to allow us to restrict this (if needed) in the future by setting metadata in dynamo. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126820 Approved by: https://github.com/bdhirsh	2024-06-06 06:39:50 +00:00
Andrew Gu	c8ff1cd387	[FSDP2] Changed `test_register_forward_method` to use multiprocess test (#128100 ) The test seems to be flaky due to multi-threaded process group. This PR converts the test to use normal multi-process `ProcessGroupNCCL` to fix the flakiness. This PR closes https://github.com/pytorch/pytorch/issues/126851. Interestingly, the original MTPG version passes for me on devgpu. Either way, the new version also passes on devgpu, so we can see in CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128100 Approved by: https://github.com/weifengpy	2024-06-06 06:34:02 +00:00
Michael Lazos	638f543ac2	Enable single nadam test (#128087 ) https://github.com/pytorch/pytorch/issues/117150 has been fixed Pull Request resolved: https://github.com/pytorch/pytorch/pull/128087 Approved by: https://github.com/xmfan	2024-06-06 06:25:00 +00:00
Jiashen Cao	cd42b95047	Handle aten::__contains__ during TorchScript to ExportedProgram conversion (#127544 ) #### Description Add support for converting `prim::__contains__` from TorchScript IR to ExportedProgram, e.g., ```python class MIn(torch.nn.Module): def forward(self, x: torch.Tensor): return x.dtype in [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10] ``` #### Test Plan * Add test cases to cover both contains IR resulted from primitive types or Tensor. `pytest test/export/test_converter.py -s -k test_ts2ep_converter_contains` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127544 Approved by: https://github.com/angelayi	2024-06-06 05:00:13 +00:00
cyy	68eb771265	[2/N] Remove unused test functions (#128005 ) Following #127881, this PR continues to remove unused test functions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128005 Approved by: https://github.com/ezyang	2024-06-06 03:41:32 +00:00
Edward Z. Yang	2f7cfecd86	Complete revamp of float/promotion sympy handling (#126905 ) At a high level, the idea behind this PR is: * Make it clearer what the promotion and int/float rules for various Sympy operations are. Operators that previously were polymorphic over int/float are now split into separate operators for clarity. We never do mixed int/float addition/multiplication etc in sympy, instead, we always promote to the appropriate operator. (However, equality is currently not done correctly.) * Enforce strict typing on ValueRanges: if you have a ValueRange for a float, the lower and upper MUST be floats, and so forth for integers. The story begins in torch/utils/_sympy/functions.py. Here, I make some changes to how we represent certain operations in sympy expressions: * FloorDiv now only supports integer inputs; to do float floor division, do a truediv and then a trunc. Additionally, we remove the divide out addition by gcd optimization, because sympy gcd is over fields and is willing to generate rationals (but rationals are bad for ValueRange strict typing). * ModularIndexing, LShift, RShift now assert they are given integer inputs. * Mod only supports integer inputs; eventually we will support FloatMod (left for later work, when we build out Sympy support for floating operations). Unfortunately, I couldn't assert integer inputs here, because of a bad interaction with sympy's inequality solver that is used by the offline solver * TrueDiv is split into FloatTrueDiv and IntTrueDiv. This allows for us to eventually generate accurate code for Python semantics IntTrueDiv, which is written in a special way to preserve precision when the inputs are >= 2*53 beyond what first coercing the integer to floats and then doing true division. Trunc is split to TruncToFloat and TruncToInt. * Round is updated to return a float, not an int, making it consistent with the round op handler in Inductor. To get Python-style conversion to int, we call TruncToInt on the result. * RoundDecimal updated to consistently only ever return a float * Add ToFloat for explicit coercion to float (required so we can enforce strict ValueRanges typing) In torch/__init__.py, we modify SymInt and SymFloat to appropriately call into new bindings that route to these refined sympy operations. Also, we modify `torch.sym_min` and `torch.sym_max` to have promotion semantics (if one argument is a float, the return result is always a float), making them inconsistent with builtins.min/max, but possible to do type analysis without runtime information. We also need to introduce some new op handlers in torch/_inductor/ops_handler.py: * `to_int` for truncation to int64, directly corresponding to TruncToInt; this can be implemented by trunc and dtype, but with a dedicated handler it is more convenient for roundtripping in Sympy * `int_truediv` for Python-style integer true division, which has higher precision than casting to floats and then running `truediv` These changes have consequences. First, we need to make some administrative changes: * Actually wire up these Sympy functions from SymInt/SymFloat in torch/fx/experimental/sym_node.py, including the new promotion rules (promote2) * Add support for new Sympy functions in torch/utils/_sympy/interp.py, torch/utils/_sympy/reference.py * In particular, in torch.utils._sympy.reference, we have a strong preference to NOT do nontrivial compute, instead, everything in ops handler should map to a singular sympy function * TODO: I chose to roundtrip mod back to our Mod function, but I think I'm going to have to deal with the C/Python inconsistency this to fix tests here * Add printer support for the Sympy functions in torch/_inductor/codegen/common.py, torch/_inductor/codegen/cpp_utils.py, torch/_inductor/codegen/triton.py. `int_truediv` and mixed precision equality is currently not implemented soundly, so we will lose precision in codegen for large values. TODO: The additions here are not exhaustive yet * Update ValueRanges logic to use new sympy functions in torch/utils/_sympy/value_ranges.py. In general, we prefer to use the new Sympy function rather than try to roll things by hand, which is what was done previously for many VR analysis functions. In torch/fx/experimental/symbolic_shapes.py we need to make some symbolic reasoning adjustments: * Avoid generation of rational subexpressions by removing simplification of `x // y` into `floor(x / y)`. This simplification then triggers an addition simplification rule `(x + y) / c --> x / c + y / c` which is bad because x / c is a rational number now * `_assert_bound_is_rational` is no more, we no longer generate rational bounds * Don't intersect non-int value ranges with the `int_range` * Support more sympy Functions for guard SYMPY_INTERP * Assert the type of value range is consistent with the variable type The new asserts uncovered necessary bug fixes: * torch/_inductor/codegen/cpp.py, torch/_inductor/select_algorithm.py, torch/_inductor/sizevars.py - Ensure Wild/Symbol manually allocated in Inductor is marked `is_integer` so it's accepted to build expressions * torch/_inductor/utils.py - make sure you actually pass in sympy.Expr to these functions * torch/_inductor/ir.py - make_contiguous_strides_for takes int/SymInt, not sympy.Expr! * torch/export/dynamic_shapes.py - don't use infinity to represent int ranges, instead use sys.maxsize - 1 Because of the removal of some symbolic reasoning that produced rationals, some of our symbolic reasoning has gotten worse and we are unable to simplify some guards. Check the TODO at test/test_proxy_tensor.py Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126905 Approved by: https://github.com/xadupre, https://github.com/lezcano	2024-06-06 02:29:45 +00:00
Janani Sriram	c1a43a69e4	[NestedTensor] Add error checks for unbind operator coverage when ragged_idx != 1 (#128058 ) Summary: Add the following error checks for the `unbind` operator on `NestedTensor`s when `ragged_idx != 1`: - The current implementation allows the creation of `NestedTensor` instances from the class definition with an `offsets` tensor that applies to a dimension other than the jagged dimension. This diff ensures that `unbind` fails when the `offsets` exceed the length of the jagged dimension. Test Plan: Added the following unit tests: `test_unbind_with_lengths_ragged_idx_equals_2_bad_dim_cpu` verifies that `unbind` fails when there is a mismatch between the offsets and the jagged dimension, for `NestedTensor`s with `lengths`. ``` test_unbind_with_lengths_ragged_idx_equals_2_bad_dim_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok ``` Reviewed By: davidberard98 Differential Revision: D57989082 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128058 Approved by: https://github.com/davidberard98	2024-06-06 01:56:12 +00:00
PyTorch MergeBot	9795c4224b	Revert "[DDP] Bucket handling: make first bucket size equal to bucket_cap_mb if it was set (#121640 )" This reverts commit e98662bed99df57b7d79f9fc1cbe670afc303235. Reverted https://github.com/pytorch/pytorch/pull/121640 on behalf of https://github.com/clee2000 due to Sorry but it looks like you're failing `distributed/_composable/test_replicate_with_compiler.py::ReplicateTest::test_bucketing_coalesced_op `. THe build failed so the tests didn't run, consider rebasing, there have been a couple of PRs lately related to cudnn so you probably are either based on a bad or too old of a commit `e98662bed9` https://github.com/pytorch/pytorch/actions/runs/9392731942/job/25868060913 ([comment](https://github.com/pytorch/pytorch/pull/121640#issuecomment-2151258585))	2024-06-06 01:50:18 +00:00
sdp	b4a0161449	Build SYCL kernels for ATen XPU ops on Native Windows (take 2) (#127390 ) Original PR https://github.com/pytorch/pytorch/pull/126725 is closed due to bad rebase. ------- As proposed in https://github.com/pytorch/pytorch/issues/126719, we are enabling PyTorch XPU on Native Windows on Intel GPU. This PR enables XPU build on Windows as the first step of #126719: - Enable `USE_XPU` build on Windows using MSVC as host compiler. The use of MSVC as host compiler seamlessly aligns with the existing PyTorch build on Windows. - Build oneDNN GPU library on Windows. Co-authored-by: Yu, Guangye <guangye.yu@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127390 Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/ezyang	2024-06-06 01:41:06 +00:00
Kazuaki Ishizaki	6adcf21b2b	Documenting the torch.cuda.nccl.version function (#128022 ) Fixes #127892 This PR adds docstring to the torch.cuda.nccl.version function Pull Request resolved: https://github.com/pytorch/pytorch/pull/128022 Approved by: https://github.com/malfet	2024-06-06 01:13:07 +00:00
Edward Z. Yang	bf2c05352e	Make length == stop size oblivious too (#128050 ) This doesn't do anything right now (need some other PRs to activate) but since it edits a header file it would be better to land this earlier. Context: https://github.com/pytorch/pytorch/pull/127693 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128050 Approved by: https://github.com/Skylion007, https://github.com/lezcano	2024-06-06 01:09:37 +00:00
Adam J. Stewart	80d34217c6	Typo fixes: et al. (#127811 ) "et al." is short for _et alia_ and should be abbreviated with a period on the second word. Noticed this typo when reading through the SGD docs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127811 Approved by: https://github.com/janeyx99	2024-06-06 01:03:25 +00:00
Edward Z. Yang	d3ad84c38f	Use pexpr, not texpr in Triton launch codegen (#128038 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128038 Approved by: https://github.com/Skylion007	2024-06-06 00:45:59 +00:00
albanD	8bcebc8dae	Add runtime dependency on setuptools for cpp_extensions (#127921 ) As per title since this was removed from the builtin python binary in 3.12 and we use it `torch.utils.cpp_extension.*`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127921 Approved by: https://github.com/Skylion007	2024-06-05 23:59:38 +00:00
cyy	2fd75667b4	[Caffe2]Remove Caffe2 scripts and benchmarks (#126747 ) Due to removal of Caffe2. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126747 Approved by: https://github.com/ezyang, https://github.com/malfet	2024-06-05 23:46:31 +00:00
Aidyn-A	e98662bed9	[DDP] Bucket handling: make first bucket size equal to bucket_cap_mb if it was set (#121640 ) The fist DDP bucket is always being created of the size of `dist._DEFAULT_FIRST_BUCKET_BYTES` (1 MiB) by default regardless of `bucket_cap_mb`. The proposal is to set `bucket_cap_mb` as the one main bucket size if it was supplied by the user. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121640 Approved by: https://github.com/wanchaol	2024-06-05 23:44:54 +00:00
Tristan Rice	ffaea656b5	WorkerServer: add support for binding to TCP (#127986 ) This adds support for the WorkerServer binding to TCP as well as the existing unix socket support. ```py server = _WorkerServer("", 1234) ``` Test plan: Added unit test ``` python test/distributed/elastic/test_control_plane.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127986 Approved by: https://github.com/c-p-i-o	2024-06-05 22:56:32 +00:00
Xuehai Pan	a7c596870d	[BE][Eazy] remove `torch.torch.xxx` usages (#127800 ) NB: `torch` is exposed in `torch/__init__.py`. So there can be `torch.torch.torch.xxx`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127800 Approved by: https://github.com/peterbell10, https://github.com/kit1980, https://github.com/malfet	2024-06-05 21:53:49 +00:00
titaiwangms	4123323eff	[ONNX] Single function for torch.onnx.export and torch.onnx.dynamo_export (#127974 ) Add `dynamo: bool = True` as a switch in `torch.onnx.export` to provide users an option to try `torch.onnx.dynamo_export`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127974 Approved by: https://github.com/justinchuby	2024-06-05 21:27:46 +00:00
Catherine Lee	01694eaa56	Move cuda 12.4 jobs to periodic for both pull and inductor (#127825 ) Moves 12.4 sm86/a10g jobs in pull to trunk Moves 12.4 cuda non sm86 jobs to periodic Moves 12.4 jobs in inductor to inductor-periodic, except inductor_timm which seems to give important signal There has been a lot of queueing for cuda runners due to the addition of jobs for cuda 12.4, so move those jobs to other workflows that are run less often Co-authored-by: Andrey Talman <atalman@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127825 Approved by: https://github.com/ZainRizvi, https://github.com/nWEIdia, https://github.com/atalman, https://github.com/malfet	2024-06-05 21:01:36 +00:00
Animesh Jain	8184cd85fc	[fake tensor] Set _is_param for base fake tensors for views (#127823 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127823 Approved by: https://github.com/eellison, https://github.com/ezyang ghstack dependencies: #127972	2024-06-05 20:26:52 +00:00
Animesh Jain	626dc934d1	[dynamo][pippy] Hotfix for nn_module_stack for pippy usecase (#127972 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127972 Approved by: https://github.com/ydwu4	2024-06-05 20:14:50 +00:00
rk7697	72e863df27	Update _learnable_fake_quantize.py (#127993 ) Remove sentence "For literature references, please see the class _LearnableFakeQuantizePerTensorOp." and add "s" to "support" (Possibly) Fixes #99107 (But not sure, sorry) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127993 Approved by: https://github.com/jerryzh168	2024-06-05 20:02:33 +00:00
atalman	6e545392cd	Move nongpu workflows from trunk to periodic (#128049 ) We don't need to run them on every PR. These are used to test for graceful degradation of GPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128049 Approved by: https://github.com/clee2000	2024-06-05 18:31:26 +00:00
rzou	6412c6060c	[reland] Refresh OpOverloadPacket if a new OpOverload gets added (#128000 ) If a user accesses an OpOverloadPacket, then creates a new OpOverload, then uses the OpOverloadPacket, the new OpOverload never gets hit. This is because OpOverloadPacket caches OpOverloads when it is constructed. This PR fixes the problem by "refreshing" the OpOverloadPacket if a new OpOverload gets constructed and the OpOverloadPacket exists. Test Plan: - new tests This is the third land attempt. The first one was reverted for breaking internal tests, the second was reverted for being erroneously suspected of causing a perf regression. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128000 Approved by: https://github.com/albanD	2024-06-05 17:57:09 +00:00
Chien-Chin Huang	bb68b54be0	[BE][ptd_fb_test][1/N] Enable testslide (#127512 ) This change allows to enable Testslide, which gives us more readable output, import time, etc. The PR is previously stamped https://github.com/pytorch/pytorch/pull/126460 but the old PR has some ghexport issue. Differential Revision: [D57919583](https://our.internmc.facebook.com/intern/diff/D57919583/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127512 Approved by: https://github.com/wz337, https://github.com/Skylion007	2024-06-05 17:45:15 +00:00
Arun Pa	3acbfd602e	Document torch.utils.collect_env.get_env_info function (#128021 ) Fixes #127911 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128021 Approved by: https://github.com/malfet	2024-06-05 17:44:47 +00:00
willfengg	6454e95824	[FSDP2] enable CI for torch.compile(root Transformer) (#127832 ) This CI showcases FSDP2 works with `torch.compile` root model, since FSDP1 can do the same compiling root Transformer without AC: `pytest test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_multi_group` compiling root Transformer with AC: `pytest test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_with_activation_checkpointing` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127832 Approved by: https://github.com/awgu	2024-06-05 17:29:46 +00:00
Andrew M. James	4adee71155	[dynamo] Support ndarray.dtype attribute access (#124490 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124490 Approved by: https://github.com/lezcano ghstack dependencies: #125717	2024-06-05 17:20:01 +00:00
Chien-Chin Huang	a9cc147fa1	[DSD][FSDP1] Deprecate FSDP.state_dict_type and redirect users to DSD (#127794 ) Summary: As title Pull Request resolved: https://github.com/pytorch/pytorch/pull/127794 Approved by: https://github.com/awgu ghstack dependencies: #127793	2024-06-05 16:55:05 +00:00
Peter Bell	9acc19f8da	[inductor] Take absolute value of strides when picking loop order (#127425 ) Fixes #126860 The stride hint is found by comparing the value of the indexing expression evaluated at `idx` set to all zeros and at `idx[dim] = 1`. This causes a problem for padded inputs where 0 and 1 are still in the padded region. In particular, for reflection padding this causes the stride to be negative. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127425 Approved by: https://github.com/lezcano	2024-06-05 16:48:22 +00:00
Chien-Chin Huang	22964d1007	[DSD] Deprecate submodules feature for DSD (#127793 ) Summary: Getting a partial of the state_dict and set the state_dict with the type of Dict[nn.Module, Dict[str, Any]] is too complicated and can confuse users. The features can be achieved by simple pre-processing and post-processing by users. So this PR adds the deprecation warning to the feature. The previous PR, https://github.com/pytorch/pytorch/pull/127070, assumes no one is using the feature and remove it without the grace period. This seems to be too aggresive and causes some concerns. This PR adds the deprecation warning and tests. We will remove the support in 2.5. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127793 Approved by: https://github.com/LucasLLC	2024-06-05 16:31:29 +00:00
drisspg	5dc9128229	FP8 rowwise scaling (#125204 ) # Summary This pull request introduces an fp8 row-scaling kernel as an optional implementation for `scaled_mm`. The kernel selection is based on the scaling tensors of the inputs. For inputs `x` and `y` of shape `[M, K]` and `[K, N]` respectively, the following conditions must be met: - `x`'s scale should be a 1-dimensional tensor of length `M`. - `y`'s scale should be a 1-dimensional tensor of length `N`. It's important to note that this kernel is not called "rowwise, columnwise" scaling because, although the scales for `y` are semantically along its columns, this implementation only supports the TN format. This means the scaling is along the faster-moving dimension, or the "row". The following two PRs were required to enable local builds: - [PR #126185](https://github.com/pytorch/pytorch/pull/126185) - [PR #125523](https://github.com/pytorch/pytorch/pull/125523) ### Todo We still do not build our Python wheels with this architecture. @ptrblck @malfet, should we replace `sm_90` with `sm_90a`? The NVRTC TMA shadowing feels wrong, but I a not sure the right way to spoof the symbol for this compilation unit: https://github.com/pytorch/pytorch/pull/125204/files#r1586986954 #### ifdef I tried to use : `#if !defined(USE_ROCM) && defined(CUDA_VERSION) && CUDA_VERSION >= 12000 && \ defined(__CUDA_ARCH__) && __CUDA_ARCH__ > 900` to gate the building of the kernel. I was having a hell of a time with this.. so I am not really sure the right way to do this Kernel Credit: @jwfromm Pull Request resolved: https://github.com/pytorch/pytorch/pull/125204 Approved by: https://github.com/lw, https://github.com/malfet	2024-06-05 15:46:40 +00:00
Jiashen Cao	4f9fcd7156	Handle unpacking during TorchScript to ExportedProgram conversion (#127419 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127419 Approved by: https://github.com/angelayi	2024-06-05 15:27:13 +00:00
cyy	9f2c4b9342	Replace with standard type traits in torch/csrc (#127852 ) In preparation to clean up more type traits. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127852 Approved by: https://github.com/ezyang	2024-06-05 15:22:48 +00:00
cyy	3d617333e7	Simplify CMake code (#127683 ) Due to the recent adoption of find(python), it is possible to further simplify some CMake code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127683 Approved by: https://github.com/ezyang	2024-06-05 15:17:31 +00:00
cyy	df75a9dc80	Remove Caffe2/onnx (#127991 ) Remove Caffe2/onnx since it is not used. Other tiny fixes are also applied. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127991 Approved by: https://github.com/ezyang	2024-06-05 15:10:12 +00:00
Nikita Shulga	d48c25c7d1	[BE] Fix missing-prototypes errors in Metal backend (#127994 ) By declaring a bunch of functions static. Removed `USE_PYTORCH_METAL` from list of flags that suppress `-Werror=missing-prototypes`. This will prevent regressions like the ones reported in https://github.com/pytorch/pytorch/issues/127942 to sneak past CI, that builds PyTorch with Metal support. Use nested namespaces Remove spurious semicolon after TORCH_LIBRARY declaration. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127994 Approved by: https://github.com/Skylion007, https://github.com/ZainRizvi	2024-06-05 14:58:19 +00:00
Huy Do	8992141dba	Restore MPS testing on MacOS 13 and m2 metal (#127853 ) The runners are ready now https://github.com/organizations/pytorch/settings/actions/runners?qr=label%3Amacos-m1-13, we want to keep some MacOS 13 runner for mps coverage until MacOS 15 is out. This also fixes the `macos-m2-14` mistake from https://github.com/pytorch/pytorch/pull/127582. The current `macos-m2-14` runner is on 14.2 while our `macos-m1-14` has 14.4. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127853 Approved by: https://github.com/malfet	2024-06-05 14:44:00 +00:00
Andrew M. James	879d01afcb	[dynamo][numpy] Add unsigned integer dtypes (#125717 ) We should support these to whatever extent we can. They corresponding `torch.uint<w>` types are defined, so I don't see an issue with generating the various casting rules and allowing them to trace. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125717 Approved by: https://github.com/lezcano	2024-06-05 14:33:47 +00:00
hippocookie	4ce5322a1f	Enable UFMT on test_shape_ops.py test_show_pickle.py test_sort_and_select.py (#127165 ) Fixes some files in #123062 Run lintrunner on files: test_shape_ops.py test_show_pickle.py test_sort_and_select.py ```bash $ lintrunner --take UFMT --all-files ok No lint issues. Successfully applied all patches. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127165 Approved by: https://github.com/ezyang	2024-06-05 14:31:26 +00:00
Weizhuo Zhang	faabda4fc9	[Inductor] Skip model_fail_to_load and eager_fail_to_run models in inductor benchmarks test (#127210 ) Aligned with test-infra repo, we skipped `model_fail_to_load` and `eager_fail_to_run` models Refer code logic: `d3b79778f8/torchci/rockset/inductor/__sql/compilers_benchmark_performance.sql (L57-L58)` ```SQL WHERE filename LIKE '%_accuracy' AND filename LIKE CONCAT( '%_', : dtypes, '_', : mode, '_', : device, '_%' ) AND _event_time >= PARSE_DATETIME_ISO8601(:startTime) AND _event_time < PARSE_DATETIME_ISO8601(:stopTime) AND (workflow_id = :workflowId OR :workflowId = 0) AND accuracy != 'model_fail_to_load' AND accuracy != 'eager_fail_to_run' ), ``` Comp Item \| Compiler \| suite \| Before \| After fix -- \| -- \| -- \| -- \| -- Pass Rate \| Inductor \| torchbench \| 96%, 80/83 \| 100%, 80/80 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127210 Approved by: https://github.com/jansel	2024-06-05 14:23:09 +00:00
weiyusheng	c3949b20a1	Opt model save and load (#126374 ) ## save&load support for OptimizedModule [Issue Description](https://github.com/pytorch/pytorch/pull/101651) English is not my native language; please excuse typing errors. This pr is based on commit b9588101c4d3411b107fdc860acfa8a72c642f91\ I'll do something with the merge conflicts later ### test result for test/dynamo Conclusion:\ It performs the same as before as far as I can see. ENV(CPU only):\ platform linux -- Python 3.10.14, pytest-7.3.2, pluggy-1.5.0\ configfile: pytest.ini\ plugins: anyio-3.7.1, cpp-2.3.0, flakefinder-1.1.0, xdist-3.3.1, xdoctest-1.1.0, metadata-3.1.1, html-4.1.1, hypothesis-5.35.1, rerunfailures-14.0 #### before this pr: [before](https://github.com/pytorch/pytorch/files/15329370/before.md) #### after this pr: [after](https://github.com/pytorch/pytorch/files/15329376/after.md) ### some changes 1. add test_save_and_load to test/dynamo/test_modules.py with & without "backend='inductor'" 2. add \_\_reduce\_\_ function to OptimizedModule and derived classes of _TorchDynamoContext for pickling & unpickling 3. change the wrappers into wrapper classes ( including convert_frame_assert, convert_frame, catch_errors_wrapper in torch/_dynamo/convert_frame.py & wrap_backend_debug in torch/_dynamo/repro/after_dynamo.py ) 4. change self.output.compiler_fn into innermost_fn(self.output.compiler_fn) in torch/_dynamo/symbolic_convert.py to get the origin compiler_fn and to avoid the "compiler_fn is not eager" condition Pull Request resolved: https://github.com/pytorch/pytorch/pull/126374 Approved by: https://github.com/msaroufim, https://github.com/jansel	2024-06-05 13:01:16 +00:00
PyTorch MergeBot	9a8ab778d3	Revert "[BE]: Update cudnn to 9.1.0.70 (#123475 )" This reverts commit c490046693e77e254664e19d940e9b05a1da18ef. Reverted https://github.com/pytorch/pytorch/pull/123475 on behalf of https://github.com/huydhn due to CUDA trunk jobs are pretty red after this change, and the forward fix https://github.com/pytorch/pytorch/pull/127984 does not look working ([comment](https://github.com/pytorch/pytorch/pull/123475#issuecomment-2149258430))	2024-06-05 08:59:53 +00:00
ibartol	bb2de3b101	Fixed broken link and removed unfinished sentence from issue #126367 (#127938 ) Fixes #126367. ## Description Fixed a broken link in the pytorch/docs/source/torch.compiler_faq.rst doc and deleted a few words that were extra according to the issue tagged above. ## Checklist - [X] The issue that is being fixed is referred in the description - [X] Only one issue is addressed in this pull request - [X] Labels from the issue that this PR is fixing are added to this pull request - [X] No unnecesary issues are included into this pull request Pull Request resolved: https://github.com/pytorch/pytorch/pull/127938 Approved by: https://github.com/msaroufim	2024-06-05 07:37:32 +00:00
dan_the_3rd	4a384d813b	[SDPA/memeff] Backport changes from xFormers to PT (#127090 ) Backporting a few fixes from xFormers: * Bug fixes for local attention (which is not exposed in PT at the moment) * Massively reduced memory usage on the BW pass (see also https://github.com/facebookresearch/xformers/pull/1028) Essentially this will also make xFormers build process much easier, as we will be able to use mem-eff from PyTorch (if the user has a recent enough version) rather than building it at xFormers install time The goal is to have the source of truth for these files in PT moving forward, and remove them from xFormers eventually once our users have a recent-enough version of PT. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127090 Approved by: https://github.com/drisspg	2024-06-05 07:33:27 +00:00
cyy	b054470db2	Remove unused functions (#127881 ) Some unused functions detected by g++ warnings can be removed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127881 Approved by: https://github.com/zou3519	2024-06-05 05:21:24 +00:00
Shuqiang Zhang	30788739f4	[c10d] add a simple test to demonstrate the user usage of collectives (#127665 ) Summary: Just play around the UT and think it would be good to give an simple example of user function which can be used for different subclasses of _ControlCollectives, and test the user function can be executed correctly Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/127665 Approved by: https://github.com/d4l3k	2024-06-05 04:32:11 +00:00
Pian Pawakapan	e505132797	[export] track TORCH_DYNAMO_DO_NOT_EMIT_RUNTIME_ASSERTS for export runtime asserts (#127554 ) Track TORCH_DYNAMO_DO_NOT_EMIT_RUNTIME_ASSERTS=1 in export so it doesn't omit runtime asserts. Differential Revision: D57978699 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127554 Approved by: https://github.com/tugsbayasgalan	2024-06-05 04:16:54 +00:00
PyTorch MergeBot	d5cb5d623a	Revert "Complete revamp of float/promotion sympy handling (#126905 )" This reverts commit fb696ef3aa34e20c0fef1c0210a397abd3ea5885. Reverted https://github.com/pytorch/pytorch/pull/126905 on behalf of https://github.com/ezyang due to internal user reported ceiling equality simplification problem, I have a plan ([comment](https://github.com/pytorch/pytorch/pull/126905#issuecomment-2148805840))	2024-06-05 03:57:58 +00:00
Howard Huang	55a4ef80c4	[pipelining] test pipeline_order in schedule (#127559 ) Add a unittest to test validate the pipeline order for different `num_stages`, `num_microbatches`, `num_world_size` combinations. This doesn't actually run the schedule but just validates the ordering of microbatches processed is valid, therefore doesn't require GPUs / multiple processes. Will add more combinations and negative tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127559 Approved by: https://github.com/wconstab ghstack dependencies: #127084, #127332	2024-06-05 03:51:27 +00:00
Nikita Shulga	71e684bfae	[BE][Mac] Add missing prototypes (#127988 ) Really confused how CI did not catch this one, but this triggers missing prototype erros if compiled from scratch on MacOS Sonoma using clang-15 Fixes https://github.com/pytorch/pytorch/issues/127942 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127988 Approved by: https://github.com/Skylion007, https://github.com/huydhn	2024-06-05 02:16:50 +00:00
cyy	ce4436944c	Fix IOS builds (#127985 ) IOS builds fail these days, fix it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127985 Approved by: https://github.com/ezyang	2024-06-05 02:14:43 +00:00
Mikayla Gawarecki	a135776307	Remove tensor subclass detection logic from weights_only unpickler (#127808 ) Remove logic to auto-detect and allow subclasses that did not override certain methods from the weights_only unpickler from https://github.com/pytorch/pytorch/pull/124331 for 2.4 release Subclasses should be loadable using `torch.serialization.add_safe_globals` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127808 Approved by: https://github.com/malfet	2024-06-05 02:14:30 +00:00
Feng Yuan	8e496046e5	Update torch-xpu-ops pin (ATen XPU implementation) (#127879 ) Support AMP GradScaler. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127879 Approved by: https://github.com/EikanWang	2024-06-05 02:13:46 +00:00
Dmovic	6c07e2c930	fix redundant tensor (#127850 ) As title. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127850 Approved by: https://github.com/mikaylagawarecki	2024-06-05 02:03:02 +00:00
Cory Modlin	8830b81208	[c10d] Add commCreateFromRanks to c10d (#127421 ) (#127982 ) This is a duplicate of: https://github.com/pytorch/pytorch/pull/127421 which we can't merge. its landed internally already Summary: `ncclCommCreateFromRanks` - described in this [document](https://docs.google.com/document/d/1QIRkAO4SAQ6eFBpxE51JmRKRAH2bwAHn8OIj69XuFqQ/edit#heading=h.5g71oqe3soez), replaces `ncclCommSplit` in NCCLX versions 2.21.5+. The difference is that `ncclCommCreateFromRanks` is given a list of active ranks and is collective only over those ranks as opposed to `ncclCommSplit` for which you give it a color for every rank including NO_COLOR for inactive ranks and the collective is over the entire world. This diff connects `ncclCommCreateFromRanks` to `c10d` `ncclCommSplit` will still be available at the NCCL API but, in this diff, is not used starting at version 2.21.5 Split the python test and implementation of `split()` for internal FB and external OSS builds. The diff defines `"USE_C10D_NCCL_FBCODE"` as a compiler option. When defined, we use the version of split in the newly created `NCCLUtils.cpp` in the `fb` directory. The `fb` directory is not shipit-ed to github. The same API is used for `split()` in both the `ncclx` and `nccl` versions adding `ranks` to the API. This argument is not used in the `nccl` version nor in the 2.18 `ncclx` version where `ncclCommSplit()` is used instead of `ncclCommCreateFromRanks()` in `ncclx` This diff was squashed with D57343946 - see D57343946 for additional review comments. Test Plan: for 2.18.3-1 and 2.21.5-1 versions: ``` buck2 run fbcode//mode/opt -c param.use_nccl=True -c fbcode.nvcc_arch=a100 -c hpc_comms.use_ncclx="$VERSION" -c fbcode.enable_gpu_sections=true fbcode//caffe2/test/distributed/fb:test_comm_split_subgroup_x ``` ``` BUILD SUCCEEDED ... ok ---------------------------------------------------------------------- Ran 1 test in 10.210s OK ~/scripts ``` OSS build: `[cmodlin@devgpu003.vll5 ~/fbsource/third-party/ncclx/v2.21.5-1 (e56338cfa)]$ ./maint/oss_build.sh` OSS build output: ``` ... ncclCommHash 197dce9b413e2775 nccl commDesc example_pg Dump from comm 0x4708aa0 rings: [[0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0]] Dump from comm 0x4708aa0 commDesc: example_pg Dump from comm 0x4708aa0 nRanks: 1 Dump from comm 0x4708aa0 nNodes: 1 Dump from comm 0x4708aa0 node: 0 Dump from comm 0x4708aa0 localRanks: 1 Dump from comm 0x4708aa0 localRank: 0 Dump from comm 0x4708aa0 rank: 0 Dump from comm 0x4708aa0 commHash: "197dce9b413e2775" 2024-05-24T09:02:54.385543 devgpu003:3040664:3040744 [0][AsyncJob]ctran/backends/ib/CtranIb.cc:143 NCCL WARN CTRAN-IB : No active device found. 2024-05-24T09:02:54.385607 devgpu003:3040664:3040744 [0][AsyncJob]ctran/mapper/CtranMapper.cc:187 NCCL WARN CTRAN: IB backend not enabled Created NCCL_SPLIT_TYPE_NODE type splitComm 0x11c76d0, rank 0 ~/fbsource/third-party/ncclx/v2.21.5-1 ``` Reviewed By: wconstab, wesbland Differential Revision: D56907877 Fixes #ISSUE_NUMBER Co-authored-by: Cory Modlin <cmodlin@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127982 Approved by: https://github.com/izaitsevfb	2024-06-05 00:19:52 +00:00
Howard Huang	7fdfb88f03	[pipelining] rewrite interleaved 1f1b (#127332 ) ## Context Interleaved 1F1B has multiple points in the schedule where communication is both criss-crossed across ranks leading to hangs due to 1. looped nature of schedules, 2. batched nature of forward + backward in 1f1b phase. <img width="1370" alt="image" src="https://github.com/pytorch/pytorch/assets/14858254/a07c2b1d-8a99-420b-9ba3-32a0115d228b"> In the current implementation, it is difficult to fix these hangs since it requires `dist.recv` from a prior point in time, but each rank operates on its own step schedule and does not have knowledge of other ranks operations to perform the `recv` prior to their own `send`. ## New implementation The new implementation is split into 2 parts: 1. Creating the pipeline order. Each rank will create the timestep normalized ordering of all schedule actions across all ranks. This is created once during the initialization of the schedule class. The timestep between each rank is normalized as each rank can only have 1 computation action (forward or backward) during that timestep. <img width="1065" alt="image" src="https://github.com/pytorch/pytorch/assets/14858254/196f2347-7ff4-49cf-903b-d8db97d1156f"> 3. Executing the pipeline order. Once the pipeline order is determined, execution is simple because as each rank will perform its send to its peer (based on whether they did forward and backward). Now that each rank has a global understanding of the schedule, they can check their previous and next neighbor ranks to see if they need to recv any activations/gradients from them. Therefore, during execution, each rank is aligned and executing the same time step. ## Benefits - Implementation is faster since 1f1b computation can now be split up in two time steps, 1 for forward and 1 for backward. - Debugging is easier since we can now determine which timestep each rank is hung on - Testing is easier since we can just validate the pipeline order, without running the schedule. This allows us to test on large amount of ranks without actually needing the GPUs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127332 Approved by: https://github.com/wconstab ghstack dependencies: #127084	2024-06-04 23:46:05 +00:00
Shunting Zhang	1f67cfd437	[inductor] raise tolerance for cspdarknet (#127949 ) cspdarknet previously is flaky but after https://github.com/pytorch/pytorch/pull/127367 it fails quite stably. It's probably due to small numerical change from the mentioned PR. That PR will let inductor generated different code due to different loop orders. Raise tolerance to pass CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127949 Approved by: https://github.com/atalman, https://github.com/nWEIdia, https://github.com/eqy	2024-06-04 23:28:20 +00:00
PyTorch MergeBot	907cb28f67	Revert "Inductor: Allow small sizes of m for mixed mm autotuning (#127663 )" This reverts commit d8d0bf264a736c7fb3cd17799a1c1aba4addf8d9. Reverted https://github.com/pytorch/pytorch/pull/127663 on behalf of https://github.com/soulitzer due to breaks torch ao CI, see: https://github.com/pytorch/pytorch/issues/127924 ([comment](https://github.com/pytorch/pytorch/pull/127663#issuecomment-2148554128))	2024-06-04 23:06:43 +00:00
Jiashen Cao	f4b05ce683	Add registry for TorchScript to ExportedProgram conversion (#127464 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127464 Approved by: https://github.com/ydwu4, https://github.com/angelayi	2024-06-04 22:53:00 +00:00
rzou	0eb9ec958a	Revert "Inductor respects strides for custom ops by default (#126986 )" (#127923 ) This reverts commit dd64ca2a02434944ecbc8f3e186d44ba81e3cb26. There's a silent incorrectness bug with needs_fixed_stride_order=True and mutable custom ops, so it's better to flip the default back to avoid silent incorrectness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127923 Approved by: https://github.com/williamwen42	2024-06-04 22:25:45 +00:00
Svetlana Karslioglu	20f966a8e0	Ignore undocumented PipelineSchedule.step (#127955 ) Ignore undocumented PipelineSchedule.step to fix doc build: https://github.com/pytorch/pytorch/actions/runs/9372492435/job/25805861083?pr=127938#step:11:1284 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/127955 Approved by: https://github.com/kit1980	2024-06-04 22:11:09 +00:00
Mikayla Gawarecki	a7b1dd82ff	Default XLA to use swap_tensors path in nn.Module._apply (#126814 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126814 Approved by: https://github.com/JackCaoG, https://github.com/albanD ghstack dependencies: #127313	2024-06-04 21:40:49 +00:00
Ting Lu	1b704a160f	Add linker script optimization flag to CMAKE rule for CUDA ARM wheel (#127514 ) Original PR - https://github.com/pytorch/pytorch/pull/127220 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127514 Approved by: https://github.com/Aidyn-A, https://github.com/atalman	2024-06-04 20:51:44 +00:00
PyTorch MergeBot	6dc0a291b9	Revert "[dynamo] Bugfix for nn parameter construction (#127806 )" This reverts commit f27c4dd862bf79f37019ef277957cd577d57b66f. Reverted https://github.com/pytorch/pytorch/pull/127806 on behalf of https://github.com/PaliC due to causing nn tests to fail ([comment](https://github.com/pytorch/pytorch/pull/127806#issuecomment-2148393903))	2024-06-04 20:51:41 +00:00
Tristan Rice	597922ba21	Reapply "distributed debug handlers (#126601 )" (#127805 ) This reverts commit 7646825c3eb687030c4f873b01312be0eed80174. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127805 Approved by: https://github.com/PaliC	2024-06-04 19:44:30 +00:00
Anshul Sinha	e76b28c765	[dtensor][debug] added c10d alltoall_ and alltoall_base_ to CommDebugMode (#127360 ) Summary Added c10d alltoall_ and alltoall_base tracing to CommDebugMode and edited test case in test_comm_mode to include added features. Test Plan pytest test/distributed/_tensor/debug/test_comm_mode.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/127360 Approved by: https://github.com/wz337, https://github.com/XilunWu, https://github.com/yifuwang ghstack dependencies: #127358	2024-06-04 18:29:48 +00:00
Anshul Sinha	01e6d1cae4	[dtensor][debug] added c10d reduce_scatter_ and reduce_scatter_tensor_coalesced tracing_ to CommDebugMode (#127358 ) Summary Added c10d reduce_scatter_ and reduce_scatter_tensor_coalesced tracing to CommDebugMode and edited test case in test_comm_mode to include added features. Test Plan pytest test/distributed/_tensor/debug/test_comm_mode.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/127358 Approved by: https://github.com/wz337, https://github.com/XilunWu, https://github.com/yifuwang	2024-06-04 18:29:48 +00:00
PyTorch MergeBot	9a25ff77af	Revert "[inductor] Enable subprocess-based parallel compile as the default (#126817 )" This reverts commit cf77e7dd9770caf65e898ac2ee82045aa0408e30. Reverted https://github.com/pytorch/pytorch/pull/126817 on behalf of https://github.com/huydhn due to There are lots of flaky inductor failure showing up in trunk after this commit `cf77e7dd97`, so I am trying to revert this to see if this helps ([comment](https://github.com/pytorch/pytorch/pull/126817#issuecomment-2148143502))	2024-06-04 18:26:12 +00:00
Animesh Jain	f27c4dd862	[dynamo] Bugfix for nn parameter construction (#127806 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127806 Approved by: https://github.com/jansel ghstack dependencies: #127785, #127802	2024-06-04 18:25:46 +00:00
Animesh Jain	569c5e72e7	[dynamo] Unspec nn module when global backward hooks are present (#127802 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127802 Approved by: https://github.com/jansel ghstack dependencies: #127785	2024-06-04 18:25:46 +00:00
Animesh Jain	c7e936a56a	[dynamo] Tensorvariable - track grad with _grad field (#127785 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127785 Approved by: https://github.com/jansel	2024-06-04 18:25:46 +00:00
Shan19900305	3bcc3cddb5	Using scalarType instead string in function _group_tensors_by_device_and_dtype. (#127869 ) Now torch.dtype can pass through pybind11, so modify function _group_tensors_by_device_and_dtype to using scalar type. And without convert torch.dtype and string in python and c++ side. @ezyang @bdhirsh Pull Request resolved: https://github.com/pytorch/pytorch/pull/127869 Approved by: https://github.com/ezyang	2024-06-04 18:19:33 +00:00
PyTorch MergeBot	0ff60236ab	Revert "Retire torch.distributed.pipeline (#127354 )" This reverts commit b9c058c203ee38032594f898f27cd8404f113a63. Reverted https://github.com/pytorch/pytorch/pull/127354 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the doc build failure looks legit `b9c058c203` ([comment](https://github.com/pytorch/pytorch/pull/127354#issuecomment-2148133982))	2024-06-04 18:19:31 +00:00
chuanqiw	627d2cd87d	[CI] disable td for xpu ci test by default (#127611 ) Due to the xpu ci test has been enabled td by default, a lot of test cases (75%) have been skipped in CI tests. It caused some ci failures escaped from the ci tests, for example issue #127539. This PR depends on PR #127595 landed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127611 Approved by: https://github.com/etaf, https://github.com/atalman	2024-06-04 17:15:10 +00:00
Hu Niu	36e9b71613	Enable UFMT on test/test_jit_fuser_te.py (#127759 ) Part of #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127759 Approved by: https://github.com/ezyang	2024-06-04 16:56:03 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	ff32f6c93b	Use freshly traced jit-traced module to be used in export analysis (#127577 ) Summary: When we export already traced module, it seems to be modifying some global state causing the traced modules to fail to run. For now, we are only logging for test cases, so it is probs ok to trace fresh copy to be used in export for now. Test Plan: CI Differential Revision: D57983518 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127577 Approved by: https://github.com/pianpwk	2024-06-04 16:54:23 +00:00
Eddie Yan	c490046693	[BE]: Update cudnn to 9.1.0.70 (#123475 ) cuDNN has managed to upload cu11 and cu12 wheels for ~~9.0.0.312~~ 9.1.0.70, so trying this out... CC @Skylion007 @malfet Co-authored-by: Wei Wang <weiwan@nvidia.com> Co-authored-by: atalman <atalman@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123475 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/nWEIdia	2024-06-04 16:33:06 +00:00
Aaron Orenstein	97ea2b5d83	documentation for pattern_matcher.py (#127459 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127459 Approved by: https://github.com/oulgen ghstack dependencies: #127457, #127458	2024-06-04 15:24:47 +00:00
Aaron Orenstein	7a60a75256	Add typing annotations to pattern_matcher.py (#127458 ) Turn on `mypy: disallow-untyped-defs` in pattern_matcher.py and fix the fallout. There are still a bunch of `type: ignore` annotations which should eventually be ironed out. In the processs found a bug: #127457 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127458 Approved by: https://github.com/Skylion007 ghstack dependencies: #127457	2024-06-04 15:24:47 +00:00
Aaron Orenstein	9adfa143d7	fix post_grad pattern (#127457 ) The lowering pattern built by cuda_and_enabled_mixed_mm_and_not_int8() was using ListOf() incorrectly - ListOf() is meant to represent a single repeating pattern - but cuda_and_enabled_mixed_mm_and_not_int8() was passing two patterns - I think based on the comment it's trying to build a sequence which would be represented by an actual list, not ListOf(). The behavior of the existing pattern would be to pass the second pattern as the `partial` parameter of `ListOf` which is meant to be a boolean - so it's almost certainly not what was intended. I tried changing it to be what I thought was the intended behavior but then the resnet152 test failed accuracy - so I'm just preserving the existing behavior with the correct parameter types. Found when adding annotations to pattern_matcher.py (#127458) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127457 Approved by: https://github.com/oulgen	2024-06-04 15:24:41 +00:00
cyy	f8c6d43524	Concat namespaces and other fixes in torch/csrc/utils (#127833 ) It contains formatting and other minor fixes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127833 Approved by: https://github.com/ezyang	2024-06-04 15:12:45 +00:00
Valeriu	91461601b6	[TORCH_FA2_flash_api] Update total_q to the reshaped query 0th dimension (#127524 ) There is a difference (&bug) between the TORCH_FA2_flash_api:mha_varlen_fwd and FA2_flash_api:mha_varlen_fwd at the query transposition (GQA) step. ``` at::Tensor temp_q = q; if (seqlenq_ngroups_swapped) { temp_q = q.reshape( ... ... } const int total_q = q.sizes()[0]; CHECK_SHAPE(temp_q, total_q, num_heads, head_size_og); ``` When doing query transposition we need to update total_q to the reshaped query 0th dimension, i.e: ``` const int total_q = temp_q.sizes()[0]; ``` In the original FA2_flash_api:mha_varlen_fwd they dont introduce a new variable temp_q but overwrite the q value directly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127524 Approved by: https://github.com/drisspg	2024-06-04 14:44:45 +00:00
IvanKobzarev	c209fbdc53	[inductor] Fix missing unbacked def for unbacked in input expr (#127770 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127770 Approved by: https://github.com/ezyang	2024-06-04 14:43:01 +00:00
cyy	059cae6176	[Caffe2] Remove Caffe2 proto and other files (#127655 ) Remove Caffe2 proto files altogether. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127655 Approved by: https://github.com/ezyang	2024-06-04 14:22:21 +00:00
PyTorch MergeBot	4c074a9b8b	Revert "[torchbind] always fakify script object by default in non-strict export (#127116 )" This reverts commit c27882ffa8c1c7e4cf8ebc6c2f879e5b6c8814ad. Reverted https://github.com/pytorch/pytorch/pull/127116 on behalf of https://github.com/atalman due to Failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/127116#issuecomment-2147459339))	2024-06-04 12:53:19 +00:00
Edward Z. Yang	fb696ef3aa	Complete revamp of float/promotion sympy handling (#126905 ) At a high level, the idea behind this PR is: * Make it clearer what the promotion and int/float rules for various Sympy operations are. Operators that previously were polymorphic over int/float are now split into separate operators for clarity. We never do mixed int/float addition/multiplication etc in sympy, instead, we always promote to the appropriate operator. (However, equality is currently not done correctly.) * Enforce strict typing on ValueRanges: if you have a ValueRange for a float, the lower and upper MUST be floats, and so forth for integers. The story begins in torch/utils/_sympy/functions.py. Here, I make some changes to how we represent certain operations in sympy expressions: * FloorDiv now only supports integer inputs; to do float floor division, do a truediv and then a trunc. Additionally, we remove the divide out addition by gcd optimization, because sympy gcd is over fields and is willing to generate rationals (but rationals are bad for ValueRange strict typing). * ModularIndexing, LShift, RShift now assert they are given integer inputs. * Mod only supports integer inputs; eventually we will support FloatMod (left for later work, when we build out Sympy support for floating operations). Unfortunately, I couldn't assert integer inputs here, because of a bad interaction with sympy's inequality solver that is used by the offline solver * TrueDiv is split into FloatTrueDiv and IntTrueDiv. This allows for us to eventually generate accurate code for Python semantics IntTrueDiv, which is written in a special way to preserve precision when the inputs are >= 2*53 beyond what first coercing the integer to floats and then doing true division. Trunc is split to TruncToFloat and TruncToInt. * Round is updated to return a float, not an int, making it consistent with the round op handler in Inductor. To get Python-style conversion to int, we call TruncToInt on the result. * RoundDecimal updated to consistently only ever return a float * Add ToFloat for explicit coercion to float (required so we can enforce strict ValueRanges typing) In torch/__init__.py, we modify SymInt and SymFloat to appropriately call into new bindings that route to these refined sympy operations. Also, we modify `torch.sym_min` and `torch.sym_max` to have promotion semantics (if one argument is a float, the return result is always a float), making them inconsistent with builtins.min/max, but possible to do type analysis without runtime information. We also need to introduce some new op handlers in torch/_inductor/ops_handler.py: * `to_int` for truncation to int64, directly corresponding to TruncToInt; this can be implemented by trunc and dtype, but with a dedicated handler it is more convenient for roundtripping in Sympy * `int_truediv` for Python-style integer true division, which has higher precision than casting to floats and then running `truediv` These changes have consequences. First, we need to make some administrative changes: * Actually wire up these Sympy functions from SymInt/SymFloat in torch/fx/experimental/sym_node.py, including the new promotion rules (promote2) * Add support for new Sympy functions in torch/utils/_sympy/interp.py, torch/utils/_sympy/reference.py * In particular, in torch.utils._sympy.reference, we have a strong preference to NOT do nontrivial compute, instead, everything in ops handler should map to a singular sympy function * TODO: I chose to roundtrip mod back to our Mod function, but I think I'm going to have to deal with the C/Python inconsistency this to fix tests here * Add printer support for the Sympy functions in torch/_inductor/codegen/common.py, torch/_inductor/codegen/cpp_utils.py, torch/_inductor/codegen/triton.py. `int_truediv` and mixed precision equality is currently not implemented soundly, so we will lose precision in codegen for large values. TODO: The additions here are not exhaustive yet * Update ValueRanges logic to use new sympy functions in torch/utils/_sympy/value_ranges.py. In general, we prefer to use the new Sympy function rather than try to roll things by hand, which is what was done previously for many VR analysis functions. In torch/fx/experimental/symbolic_shapes.py we need to make some symbolic reasoning adjustments: * Avoid generation of rational subexpressions by removing simplification of `x // y` into `floor(x / y)`. This simplification then triggers an addition simplification rule `(x + y) / c --> x / c + y / c` which is bad because x / c is a rational number now * `_assert_bound_is_rational` is no more, we no longer generate rational bounds * Don't intersect non-int value ranges with the `int_range` * Support more sympy Functions for guard SYMPY_INTERP * Assert the type of value range is consistent with the variable type The new asserts uncovered necessary bug fixes: * torch/_inductor/codegen/cpp.py, torch/_inductor/select_algorithm.py, torch/_inductor/sizevars.py - Ensure Wild/Symbol manually allocated in Inductor is marked `is_integer` so it's accepted to build expressions * torch/_inductor/utils.py - make sure you actually pass in sympy.Expr to these functions * torch/_inductor/ir.py - make_contiguous_strides_for takes int/SymInt, not sympy.Expr! * torch/export/dynamic_shapes.py - don't use infinity to represent int ranges, instead use sys.maxsize - 1 Because of the removal of some symbolic reasoning that produced rationals, some of our symbolic reasoning has gotten worse and we are unable to simplify some guards. Check the TODO at test/test_proxy_tensor.py Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126905 Approved by: https://github.com/xadupre, https://github.com/lezcano	2024-06-04 11:47:32 +00:00
Jack Taylor	db515b6ac7	[ROCm] Fix error in torch.cuda initialisation if amdsmi is not available (#127528 ) Reported in https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/15874 When nvml_count is set via `9f73c65b8f/torch/cuda/__init__.py (L834)` If amdsmi is not available this will throw an error ``` File "python3.10/site-packages/torch/cuda/__init__.py", line 634, in _raw_device_count_amdsmi except amdsmi.AmdSmiException as e: NameError: name 'amdsmi' is not defined ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127528 Approved by: https://github.com/jeffdaily, https://github.com/eqy, https://github.com/pruthvistony, https://github.com/atalman	2024-06-04 11:16:02 +00:00
Andrew Gu	49048e7f26	[FSDP2] Fixed variable shadowing of `module` (#127776 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127776 Approved by: https://github.com/wanchaol ghstack dependencies: #127771	2024-06-04 10:27:34 +00:00
Yifu Wang	f325b39303	Introduce Inductor passes to micro-pipeline all-gather-matmul and matmul-reduce-scatter in certain cases (#126598 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126598 Approved by: https://github.com/wanchaol	2024-06-04 09:06:56 +00:00
Sam Larsen	cf77e7dd97	[inductor] Enable subprocess-based parallel compile as the default (#126817 ) Differential Revision: [D58056502](https://our.internmc.facebook.com/intern/diff/D58056502) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126817 Approved by: https://github.com/eellison	2024-06-04 07:48:32 +00:00
Ke Wen	b9c058c203	Retire torch.distributed.pipeline (#127354 ) Actually retiring module after deprecation warning for a while. The new supported module is: torch.distributed.pipelining. Please migrate. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127354 Approved by: https://github.com/wconstab	2024-06-04 07:03:26 +00:00
Ke Wen	6abca6a564	[export][unflatten] More strictly respect scope when removing inputs (#127607 ) Code snippet from TorchTitan (LLaMa): ``` for layer in self.layers.values(): h = layer(h, self.freqs_cis) ``` `self.freqs_cis` is a buffer of root module (`self`). It is also an explicit arg in the call signature of original `layer` modules. If not respecting scope -- `freqs_cis`'s scope only corresponds to root -- `_sink_param` can remove `freqs_cis` from `layer`'s call signature, resulting in runtime error. There are two fixes in this PR: 1. We filter out the `inputs_to_state` corresponding to the current scope, using existing code that does prefix matching. 2. We delay the removal of param inputs from `call_module` nodes' `args`, till `_sink_param` call on that submodule returns. The return now returns information on which input is actually removed by the submodule, thus more accurate than just doing: ``` for node in call_module_nodes: node.args = tuple(filter(lambda n: n.name not in inputs_to_state, node.args)) ``` Before the PR: ![Screenshot 2024-05-31 at 1 40 24 AM](https://github.com/pytorch/pytorch/assets/6676466/a2e06b18-44d5-40ca-b242-0edab45075b7) After the PR: ![Screenshot 2024-05-31 at 1 43 41 AM](https://github.com/pytorch/pytorch/assets/6676466/b72afb94-cdfa-420d-b88b-29a92bf2a0c0) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127607 Approved by: https://github.com/pianpwk	2024-06-04 06:43:54 +00:00
Masahiro Hiramori	e216df48c8	[Dynamo][TVM] Fix ignored `trials` argument for MetaSchedule (#127747 ) Fixes #127746 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127747 Approved by: https://github.com/jansel	2024-06-04 06:13:02 +00:00
Andrew Gu	2122c9e2a9	[BE] Enabled lintrunner on torch/distributed/utils.py (#127771 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127771 Approved by: https://github.com/wanchaol, https://github.com/Skylion007	2024-06-04 06:10:33 +00:00
Ke Wen	ef77f2ca4a	[pipelining] Simple 1F1B schedule (#127673 ) ![Screenshot 2024-05-31 at 9 13 18 PM](https://github.com/pytorch/pytorch/assets/6676466/ecf3ca24-33a6-4188-9f7c-df6e96311caa) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127673 Approved by: https://github.com/wconstab	2024-06-04 06:09:51 +00:00
satheeshhab	f4b77ce8e2	Masked scale meta function registration #119984 (#127389 ) Fixes #119984 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127389 Approved by: https://github.com/cpuhrsch	2024-06-04 06:09:17 +00:00
cyy	e7cb43a2d2	Check unused variables in tests (#127498 ) Enables unused variable checks in CMake. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127498 Approved by: https://github.com/ezyang	2024-06-04 05:35:25 +00:00
Boyuan Feng	2ad0e4197d	[ts-migration] support aten::__is__, aten::__isnot__, aten::__not__, profiler::_record_function_enter_new, profiler::_record_function_exit (#127656 ) Support more ops in ts converter and add unit tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127656 Approved by: https://github.com/SherlockNoMad	2024-06-04 04:51:29 +00:00
Yanbo Liang	8d153e0bab	[Inductor] Add FlexAttention backward kernel dynamic shape tests (#127728 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/127728 Approved by: https://github.com/Chillee	2024-06-04 04:32:03 +00:00
Yanbo Liang	e793ae220f	[Inductor][Flex-attention] Support different sequence lengths for Query and Key/Value (#127678 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/127678 Approved by: https://github.com/Chillee	2024-06-04 04:27:24 +00:00
Nikita Shulga	dae757c971	Specify supported OS matrix (#127816 ) Windows-10 or newer manylinux-2014 MacOS-11 or newer (but only on Apple Silicon) Fixes https://github.com/pytorch/pytorch/issues/126679 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127816 Approved by: https://github.com/kit1980, https://github.com/huydhn	2024-06-04 04:25:41 +00:00
Will Constable	22368eac10	[FSDP2] Fix submesh slicing to enable 3D parallelism (#127585 ) Ensures the submesh used to create sharded parameters are created on a submesh that excludes the Pipeline Parallelism dimension. Also cleans up the logic for storing placements to no longer consider the outer / global dims. Since we store an 'spmd' submesh, we can avoid this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127585 Approved by: https://github.com/wanchaol	2024-06-04 04:24:09 +00:00
Yanbo Liang	69f5b66132	[Inductor] FlexAttention backward kernel optimization (#127208 ) BWD Speedups (before this PR): ``` \| Type \| Speedup \| shape \| score_mod \| dtype \| \|---------\|-----------\|-------------------\|---------------\|----------------\| \| Average \| 0.211 \| \| \| \| \| Max \| 0.364 \| (16, 16, 512, 64) \| relative_bias \| torch.bfloat16 \| \| Min \| 0.044 \| (2, 16, 4096, 64) \| causal_mask \| torch.bfloat16 \| ``` BWD Speedups (after this PR, though not optimizing block size yet): ``` \| Type \| Speedup \| shape \| score_mod \| dtype \| \|---------\|-----------\|--------------------\|---------------\|----------------\| \| Average \| 0.484 \| \| \| \| \| Max \| 0.626 \| (2, 16, 512, 256) \| head_bias \| torch.bfloat16 \| \| Min \| 0.355 \| (8, 16, 4096, 128) \| relative_bias \| torch.bfloat16 \| ``` There are a few things need to do as follow-ups: * Optimized default block size on A100/H100. * Support different seqlen for Q and K/V. * Support dynamic shapes for backward. * Enhance unit tests to check there is no ```nan``` value in any grad. I think we should make some changes to ```test_padded_dense_causal``` because it has invalid inputs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127208 Approved by: https://github.com/Chillee	2024-06-04 04:22:41 +00:00
Aaron Gokaslan	2498ef7490	Fix scheduler typehints (#127769 ) Fixes scheduler typehints Pull Request resolved: https://github.com/pytorch/pytorch/pull/127769 Approved by: https://github.com/jansel	2024-06-04 04:19:06 +00:00
Xilun Wu	6580a18f86	[c10d][BE] fix test_init_pg_and_rpc_with_same_socket (#127654 ) Summary fix `test_init_pg_and_rpc_with_same_socket` in `test/distributed/test_store.py` which missed a call to destroy the created ProcessGroup before exiting test function. It lead to "init PG twice" error in the test. Test Plan `pytest test/distributed/test_store.py -s -k test_init_pg_and_rpc_with_same_socket` `ciflow/periodic` since this test is included in `.ci/pytorch/multigpu-test.sh` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127654 Approved by: https://github.com/Skylion007, https://github.com/malfet	2024-06-04 04:00:28 +00:00
Menglu Yu	7e906ec9e5	[PT2][Optimus] Improve group batch fusion with same parent/users fusion enablement (#127648 ) Summary: Currently, we fuse the ops in random place, we here enable the same parent/users fuse to enable follow up potential split cat elimination. Context https://docs.google.com/document/d/1MSZY23wKD2keW2Z-DfAI1DscDERHKjOJAnuB5bxa06I/edit Test Plan: # local reproduce ``` buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "pm_cmf" --flow_id 559694026 ``` P1386889671 Differential Revision: D58037636 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127648 Approved by: https://github.com/jackiexu1992	2024-06-04 03:41:44 +00:00
mori360	c32fe6b279	[FSDP] keep paras in torch.distributed.checkpoint.state_dict.set_optimizer_state_dict (#127644 ) This addresses Fixes https://github.com/pytorch/pytorch/issues/126948 The previous code under `_load_optim_state_dict `function with condition of `info.broadcast_from_rank0`, `optim_state_dict` holds the parameters based on `optim`. Changes here aim to synchronize the differential parameters. Unit tests are conducted under `test_state_dict.py` in `test_optim_state_dict_para_matching`, Pull Request resolved: https://github.com/pytorch/pytorch/pull/127644 Approved by: https://github.com/fegin	2024-06-04 03:32:22 +00:00
Kiuk Chung	4d0386ce1c	[torch/jit-runtime] Add explicit include of <chrono> to torch/jit/run… (#127779 ) Added an explicit include to `<chrono>` in `jit/runtime/logging.h` since `std::chrono::time_point<std::chrono::high_resolution_clock>` is directly referenced in the header. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127779 Approved by: https://github.com/albanD	2024-06-04 02:12:17 +00:00
Nikita Shulga	ddef7c350f	Add comments about runner labels (#127827 ) To distinguish between org-wide and repo-specific runners as well as highlight where they are hosted (by DevInfra, LF or various partners Delete unused `bm-runner` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127827 Approved by: https://github.com/huydhn	2024-06-04 02:06:43 +00:00
Chien-Chin Huang	1208347d09	[inductor][ez] fix loop ordering test (#127807 ) I didn't realize that the main block is not being run when inductor tests are being run in FBCode via remote GPUs. This is a quick fix. I've tested it in both OSS and FBCode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127807 Approved by: https://github.com/eellison, https://github.com/jansel	2024-06-04 01:14:34 +00:00
Jirka	41033a4274	PyPI: fix link to images to be rendered (#127798 ) It addresses the long pending issues on PyPI. The [package description](https://pypi.org/project/torch/2.3.0/) is the repo's Readme, but compared to GitHub rendering, PyPI accepts only raw images linked via MarkDown images. ![image](https://github.com/pytorch/pytorch/assets/6035284/1d8e51d5-c8c1-4f92-b323-f7684879adb4) This minor link edit makes the image become raw images and so correctly rendered via PyPI Pull Request resolved: https://github.com/pytorch/pytorch/pull/127798 Approved by: https://github.com/albanD	2024-06-04 00:59:58 +00:00
cyy	05fa05cbae	[2/N] Change static functions in headers to inline (#127764 ) Follows #127727 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127764 Approved by: https://github.com/Skylion007	2024-06-04 00:49:04 +00:00
haozhe.zhu	dbf39a6e63	[inductor] fix linear_add_bias path (#127597 ) Previous the `linear_add_bias` path do not work. This PR is to fix it and add more ut with it. TestPlan ``` python test/inductor/test_mkldnn_pattern_matcher.py -k test_linear_add_bias ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127597 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-06-04 00:39:01 +00:00
Joel Schlosser	b42cfcabc4	Lift jagged -> padded dense forward / backward kernels from fbgemm_gpu (#125946 ) PyTorch can't depend on `fbgemm_gpu` as a dependency because `fbgemm_gpu` already has a dependency on PyTorch. So this PR copy / pastes kernels from `fbgemm_gpu`: * `dense_to_jagged_forward()` as CUDA registration for new ATen op `_padded_dense_to_jagged_forward()` * `jagged_to_padded_dense_forward()` as CUDA registration for new ATen op `_jagged_to_padded_dense_forward()` CPU impls for these new ATen ops will be added in a follow-up PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125946 Approved by: https://github.com/davidberard98	2024-06-03 23:41:54 +00:00
eqy	ac568fc007	[CUDNN] Remove defunct cuDNN V8 API build flag (#120006 ) The flag basically does nothing following #95722 Let's see if the quantization tests break CC @malfet @atalmanagement Pull Request resolved: https://github.com/pytorch/pytorch/pull/120006 Approved by: https://github.com/malfet	2024-06-03 22:42:05 +00:00
Jeff Daily	0e7bd7fedd	[ROCm] TunableOp improvements (#124362 ) - use less memory; smaller default hipblaslt workspace size - options to avoid cache effects - icache flush option - rotating buffers during tuning - python APIs - unit tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/124362 Approved by: https://github.com/xw285cornell	2024-06-03 22:30:11 +00:00
Scott Wolchok	0f1f0d3015	Onboard ARM bfloat16 to gemv fast path (#127484 ) Summary: Used bfloat16 dot support from #127477 to write a bfloat16 transposed fast path and integrated it. Test Plan: Ran https://github.com/malfet/llm_experiments/blob/main/benchmarks/benchmark_torch_mm.py before and after on my Apple M1 Pro. Before: ``` mv_nt torch.float32 6.77 usec mv_nt torch.float16 8.24 usec mv_nt torch.bfloat16 184.74 usec mv_ta torch.float32 5.71 usec mv_ta torch.float16 27.95 usec mv_ta torch.bfloat16 98.06 usec notrans torch.float32 5.55 usec notrans torch.float16 25.11 usec notrans torch.bfloat16 63.55 usec trans_a torch.float32 5.62 usec trans_a torch.float16 74.48 usec trans_a torch.bfloat16 313.19 usec trans_b torch.float32 5.68 usec trans_b torch.float16 8.18 usec trans_b torch.bfloat16 14.96 usec ``` After: ``` mv_nt torch.float32 5.40 usec mv_nt torch.float16 8.25 usec mv_nt torch.bfloat16 12.81 usec mv_ta torch.float32 5.69 usec mv_ta torch.float16 27.94 usec mv_ta torch.bfloat16 98.18 usec notrans torch.float32 5.60 usec notrans torch.float16 25.17 usec notrans torch.bfloat16 63.22 usec trans_a torch.float32 5.61 usec trans_a torch.float16 69.32 usec trans_a torch.bfloat16 316.62 usec trans_b torch.float32 5.60 usec trans_b torch.float16 8.09 usec trans_b torch.bfloat16 14.61 usec ``` Note large improvement in mv_nt torch.bfloat16 case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127484 Approved by: https://github.com/malfet ghstack dependencies: #127477, #127478	2024-06-03 22:14:16 +00:00
Scott Wolchok	f6ca822366	Patch ARM Half use_gemv_fast_path gate to avoid kernel duplication (#127478 ) Summary: The existing code didn't gate the fast path, so the fast path had to duplicate the stock kernel. Now we gate it and delete the duplicate kernel. Test Plan: Existing tests. Flipped the TORCH_INTERNAL_ASSERT_DEBUG_ONLY to non-debug and forced to fail (locally) to make sure we had test coverage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127478 Approved by: https://github.com/malfet ghstack dependencies: #127477	2024-06-03 22:14:16 +00:00
Scott Wolchok	6faa3d5f18	Onboard ARM bfloat16 to gemm-by-dot-product-for-gemm_transa_ infrastructure (#127477 ) Summary: This gets us a baseline level of reasonable performance for bfloat16 matrix-vector and matrix-matrix multiplication on my Apple M1. I've intentionally left using intrinsics for future work. Test Plan: Used https://github.com/malfet/llm_experiments/blob/main/benchmarks/benchmark_torch_mm.py (modified to run larger sizes) to benchmark a range of LLM-interesting matrix-vector and matrix-matrix sizes on my Apple M1 Pro. bfloat16 performance is improved across the board (except possibly for very small cases) and now exceeds float32 performance (as it should) for the matrix-vector cases. Before: ``` Matrix-vector: m=8, n=128, k=1 ==================== trans_b torch.float32 0.75 usec trans_b torch.float16 0.71 usec trans_b torch.bfloat16 0.81 usec m=128, n=8, k=1 ==================== trans_b torch.float32 0.75 usec trans_b torch.float16 0.93 usec trans_b torch.bfloat16 0.98 usec m=4096, n=4096, k=1 ==================== trans_b torch.float32 2194.31 usec trans_b torch.float16 661.27 usec trans_b torch.bfloat16 3758.42 usec m=11008, n=4096, k=1 ==================== trans_b torch.float32 5792.04 usec trans_b torch.float16 1789.98 usec trans_b torch.bfloat16 10120.67 usec m=4096, n=11008, k=1 ==================== trans_b torch.float32 6101.22 usec trans_b torch.float16 1927.34 usec trans_b torch.bfloat16 10469.47 usec m=32000, n=4096, k=1 ==================== trans_b torch.float32 18353.20 usec trans_b torch.float16 5161.06 usec trans_b torch.bfloat16 29601.69 usec Matrix-matrix (prompt len 4: m=8, n=128, k=4 ==================== trans_b torch.float32 2.14 usec trans_b torch.float16 0.85 usec trans_b torch.bfloat16 1.19 usec m=128, n=8, k=4 ==================== trans_b torch.float32 1.47 usec trans_b torch.float16 1.85 usec trans_b torch.bfloat16 1.75 usec m=4096, n=4096, k=4 ==================== trans_b torch.float32 4416.40 usec trans_b torch.float16 2688.36 usec trans_b torch.bfloat16 14987.33 usec m=11008, n=4096, k=4 ==================== trans_b torch.float32 6140.24 usec trans_b torch.float16 7467.26 usec trans_b torch.bfloat16 40295.52 usec m=4096, n=11008, k=4 ==================== trans_b torch.float32 6143.10 usec trans_b torch.float16 7298.04 usec trans_b torch.bfloat16 41393.43 usec m=32000, n=4096, k=4 ==================== trans_b torch.float32 17650.72 usec trans_b torch.float16 21346.63 usec trans_b torch.bfloat16 116849.98 usec Matrix-matrix (prompt len 8: m=8, n=128, k=8 ==================== trans_b torch.float32 1.05 usec trans_b torch.float16 1.03 usec trans_b torch.bfloat16 1.69 usec m=128, n=8, k=8 ==================== trans_b torch.float32 2.05 usec trans_b torch.float16 3.08 usec trans_b torch.bfloat16 2.95 usec m=4096, n=4096, k=8 ==================== trans_b torch.float32 2323.99 usec trans_b torch.float16 5265.45 usec trans_b torch.bfloat16 29942.40 usec m=11008, n=4096, k=8 ==================== trans_b torch.float32 6202.01 usec trans_b torch.float16 14677.90 usec trans_b torch.bfloat16 80625.18 usec m=4096, n=11008, k=8 ==================== trans_b torch.float32 6112.05 usec trans_b torch.float16 14340.52 usec trans_b torch.bfloat16 82799.99 usec m=32000, n=4096, k=8 ==================== trans_b torch.float32 17650.65 usec trans_b torch.float16 42551.43 usec trans_b torch.bfloat16 236081.08 usec Matrix-matrix (prompt len 16: m=8, n=128, k=16 ==================== trans_b torch.float32 1.26 usec trans_b torch.float16 1.34 usec trans_b torch.bfloat16 2.69 usec m=128, n=8, k=16 ==================== trans_b torch.float32 1.60 usec trans_b torch.float16 5.81 usec trans_b torch.bfloat16 5.34 usec m=4096, n=4096, k=16 ==================== trans_b torch.float32 2328.05 usec trans_b torch.float16 10526.58 usec trans_b torch.bfloat16 60028.28 usec m=11008, n=4096, k=16 ==================== trans_b torch.float32 6243.35 usec trans_b torch.float16 28505.08 usec trans_b torch.bfloat16 163670.15 usec m=4096, n=11008, k=16 ==================== trans_b torch.float32 5870.11 usec trans_b torch.float16 28597.89 usec trans_b torch.bfloat16 165404.88 usec m=32000, n=4096, k=16 ==================== trans_b torch.float32 17746.27 usec trans_b torch.float16 83393.87 usec trans_b torch.bfloat16 472313.13 usec Matrix-matrix (prompt len 32: m=8, n=128, k=32 ==================== trans_b torch.float32 1.35 usec trans_b torch.float16 2.01 usec trans_b torch.bfloat16 4.68 usec m=128, n=8, k=32 ==================== trans_b torch.float32 1.19 usec trans_b torch.float16 10.98 usec trans_b torch.bfloat16 10.13 usec m=4096, n=4096, k=32 ==================== trans_b torch.float32 2525.29 usec trans_b torch.float16 23106.71 usec trans_b torch.bfloat16 122987.04 usec m=11008, n=4096, k=32 ==================== trans_b torch.float32 6131.34 usec trans_b torch.float16 57537.41 usec trans_b torch.bfloat16 327825.00 usec m=4096, n=11008, k=32 ==================== trans_b torch.float32 6395.01 usec trans_b torch.float16 57456.33 usec trans_b torch.bfloat16 331325.58 usec m=32000, n=4096, k=32 ==================== trans_b torch.float32 19078.68 usec trans_b torch.float16 167735.08 usec trans_b torch.bfloat16 975736.88 usec Matrix-matrix (prompt len 128: m=8, n=128, k=128 ==================== trans_b torch.float32 2.40 usec trans_b torch.float16 6.07 usec trans_b torch.bfloat16 16.83 usec m=128, n=8, k=128 ==================== trans_b torch.float32 1.78 usec trans_b torch.float16 40.35 usec trans_b torch.bfloat16 37.21 usec m=4096, n=4096, k=128 ==================== trans_b torch.float32 4827.60 usec trans_b torch.float16 84341.24 usec trans_b torch.bfloat16 478917.75 usec m=11008, n=4096, k=128 ==================== trans_b torch.float32 11879.96 usec trans_b torch.float16 226484.33 usec trans_b torch.bfloat16 1289465.50 usec m=4096, n=11008, k=128 ==================== trans_b torch.float32 10707.75 usec trans_b torch.float16 229200.58 usec trans_b torch.bfloat16 1327416.67 usec m=32000, n=4096, k=128 ==================== trans_b torch.float32 33306.32 usec trans_b torch.float16 662898.21 usec trans_b torch.bfloat16 3815866.63 usec ``` After: ``` Matrix-vector: m=8, n=128, k=1 ==================== trans_b torch.float32 0.77 usec trans_b torch.float16 0.72 usec trans_b torch.bfloat16 0.77 usec m=128, n=8, k=1 ==================== trans_b torch.float32 0.73 usec trans_b torch.float16 0.93 usec trans_b torch.bfloat16 1.56 usec m=4096, n=4096, k=1 ==================== trans_b torch.float32 2195.22 usec trans_b torch.float16 675.40 usec trans_b torch.bfloat16 1038.29 usec m=11008, n=4096, k=1 ==================== trans_b torch.float32 5980.27 usec trans_b torch.float16 1806.08 usec trans_b torch.bfloat16 2756.46 usec m=4096, n=11008, k=1 ==================== trans_b torch.float32 6339.95 usec trans_b torch.float16 1844.71 usec trans_b torch.bfloat16 2726.52 usec m=32000, n=4096, k=1 ==================== trans_b torch.float32 18137.17 usec trans_b torch.float16 6020.75 usec trans_b torch.bfloat16 8612.89 usec Matrix-matrix (prompt len 4: m=8, n=128, k=4 ==================== trans_b torch.float32 2.24 usec trans_b torch.float16 0.91 usec trans_b torch.bfloat16 1.07 usec m=128, n=8, k=4 ==================== trans_b torch.float32 1.58 usec trans_b torch.float16 1.96 usec trans_b torch.bfloat16 2.11 usec m=4096, n=4096, k=4 ==================== trans_b torch.float32 4583.43 usec trans_b torch.float16 3014.04 usec trans_b torch.bfloat16 4434.04 usec m=11008, n=4096, k=4 ==================== trans_b torch.float32 6245.55 usec trans_b torch.float16 7513.82 usec trans_b torch.bfloat16 11207.80 usec m=4096, n=11008, k=4 ==================== trans_b torch.float32 6096.22 usec trans_b torch.float16 7688.82 usec trans_b torch.bfloat16 11143.72 usec m=32000, n=4096, k=4 ==================== trans_b torch.float32 17982.88 usec trans_b torch.float16 22001.28 usec trans_b torch.bfloat16 32470.62 usec Matrix-matrix (prompt len 8: m=8, n=128, k=8 ==================== trans_b torch.float32 1.05 usec trans_b torch.float16 1.02 usec trans_b torch.bfloat16 1.44 usec m=128, n=8, k=8 ==================== trans_b torch.float32 2.07 usec trans_b torch.float16 3.10 usec trans_b torch.bfloat16 3.38 usec m=4096, n=4096, k=8 ==================== trans_b torch.float32 2245.43 usec trans_b torch.float16 5597.87 usec trans_b torch.bfloat16 8775.08 usec m=11008, n=4096, k=8 ==================== trans_b torch.float32 6227.68 usec trans_b torch.float16 15102.41 usec trans_b torch.bfloat16 22457.37 usec m=4096, n=11008, k=8 ==================== trans_b torch.float32 6082.16 usec trans_b torch.float16 15131.57 usec trans_b torch.bfloat16 21860.15 usec m=32000, n=4096, k=8 ==================== trans_b torch.float32 19659.00 usec trans_b torch.float16 45075.64 usec trans_b torch.bfloat16 67746.75 usec Matrix-matrix (prompt len 16: m=8, n=128, k=16 ==================== trans_b torch.float32 1.31 usec trans_b torch.float16 1.41 usec trans_b torch.bfloat16 2.04 usec m=128, n=8, k=16 ==================== trans_b torch.float32 1.66 usec trans_b torch.float16 5.76 usec trans_b torch.bfloat16 6.37 usec m=4096, n=4096, k=16 ==================== trans_b torch.float32 2271.34 usec trans_b torch.float16 11198.46 usec trans_b torch.bfloat16 16893.54 usec m=11008, n=4096, k=16 ==================== trans_b torch.float32 6266.85 usec trans_b torch.float16 29342.49 usec trans_b torch.bfloat16 45159.22 usec m=4096, n=11008, k=16 ==================== trans_b torch.float32 5999.16 usec trans_b torch.float16 29157.43 usec trans_b torch.bfloat16 43295.81 usec m=32000, n=4096, k=16 ==================== trans_b torch.float32 18028.83 usec trans_b torch.float16 89626.88 usec trans_b torch.bfloat16 128164.62 usec Matrix-matrix (prompt len 32: m=8, n=128, k=32 ==================== trans_b torch.float32 1.38 usec trans_b torch.float16 2.03 usec trans_b torch.bfloat16 3.29 usec m=128, n=8, k=32 ==================== trans_b torch.float32 1.24 usec trans_b torch.float16 10.58 usec trans_b torch.bfloat16 11.97 usec m=4096, n=4096, k=32 ==================== trans_b torch.float32 2591.56 usec trans_b torch.float16 21683.62 usec trans_b torch.bfloat16 32657.68 usec m=11008, n=4096, k=32 ==================== trans_b torch.float32 6468.43 usec trans_b torch.float16 57811.33 usec trans_b torch.bfloat16 89263.21 usec m=4096, n=11008, k=32 ==================== trans_b torch.float32 6034.74 usec trans_b torch.float16 59372.56 usec trans_b torch.bfloat16 88107.85 usec m=32000, n=4096, k=32 ==================== trans_b torch.float32 18609.27 usec trans_b torch.float16 167298.00 usec trans_b torch.bfloat16 255116.37 usec Matrix-matrix (prompt len 128: m=8, n=128, k=128 ==================== trans_b torch.float32 2.44 usec trans_b torch.float16 6.11 usec trans_b torch.bfloat16 10.92 usec m=128, n=8, k=128 ==================== trans_b torch.float32 1.80 usec trans_b torch.float16 40.26 usec trans_b torch.bfloat16 44.82 usec m=4096, n=4096, k=128 ==================== trans_b torch.float32 4773.29 usec trans_b torch.float16 84458.54 usec trans_b torch.bfloat16 131248.58 usec m=11008, n=4096, k=128 ==================== trans_b torch.float32 12249.16 usec trans_b torch.float16 234411.87 usec trans_b torch.bfloat16 351970.71 usec m=4096, n=11008, k=128 ==================== trans_b torch.float32 11439.24 usec trans_b torch.float16 233347.04 usec trans_b torch.bfloat16 354475.96 usec m=32000, n=4096, k=128 ==================== trans_b torch.float32 33803.03 usec trans_b torch.float16 688157.54 usec trans_b torch.bfloat16 1048221.42 usec ``` Also ran the stock configuration; it was unchanged, indicating that we need to integrate this path with torch.mv separately, which will come in a follow-up PR.l Pull Request resolved: https://github.com/pytorch/pytorch/pull/127477 Approved by: https://github.com/malfet	2024-06-03 22:14:10 +00:00
Xuehai Pan	01fc22056a	[BE] enable UFMT for `torch/masked/` (#127715 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127715 Approved by: https://github.com/cpuhrsch	2024-06-03 22:01:49 +00:00
Xiaodong Wang	406532f864	[AMD] Fix power_draw api (#127729 ) Summary: average_socket_power only gives me NA. So we need to change it to current_socket_power Test Plan: Before `torch.cuda.power_draw` gives me NA, after it gives me the right power reading (e.g.441) Differential Revision: D58047484 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127729 Approved by: https://github.com/nmacchioni, https://github.com/eqy	2024-06-03 21:46:50 +00:00
Yidi Wu	c27882ffa8	[torchbind] always fakify script object by default in non-strict export (#127116 ) This diff can be risky for internal tests: any torchbind class that hasn't registered a fake class will fail and we should fix them. We've gained some confidence that this can work e2e by implementing FakeTensorQueue for TBE models in sigmoid with [D54210823](https://www.internalfb.com/diff/D54210823). Differential Revision: [D57991002](https://our.internmc.facebook.com/intern/diff/D57991002) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127116 Approved by: https://github.com/zou3519 ghstack dependencies: #127113, #127114	2024-06-03 21:38:57 +00:00
Yidi Wu	3efac92888	[torchbind] support torch.compile with aot_eager backend (#127114 ) Differential Revision: [D57991001](https://our.internmc.facebook.com/intern/diff/D57991001) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127114 Approved by: https://github.com/zou3519 ghstack dependencies: #127113	2024-06-03 21:38:57 +00:00
Yidi Wu	c6dc624690	[torchbind] remove test cases that don't fakify script objects (#127113 ) As titled. Differential Revision: [D57991003](https://our.internmc.facebook.com/intern/diff/D57991003) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127113 Approved by: https://github.com/zou3519	2024-06-03 21:38:50 +00:00
Zain Huda	6d4ec9b2ec	[RFC] Introduce Checkpointable for DCP (#127540 ) (#127628 ) Summary: # Introduce Checkpointable interface for DCP to support arbitrary tensor subclasses for checkpointing Authors: * zainhuda ## Summary This diff adds a CheckpointableTensor interface to allow for future compatibility for any tensor subclass with DCP in a clean and maintainable way. ## Motivation For TorchRec sharding migration from ShardedTensor to DTensor, we create a tensor subclass that is stored by DTensor to support TorchRec's sharding schemes (ex, empty shards, multiple shards on a rank). ## Proposed Implementation View the CheckpointableTensor interface implementation, in which, we introduce the minimal set of methods needed to be compatible with DCP. These methods are expected to implemented by any tensor subclasses and as such are then checkpointable by DCP. ## Drawbacks No drawbacks, it extends functionality in a clean and maintainable way. ## Alternatives Alternative design was creating paths for checking for certain attributes in tensor subclasses which can get messy and hard to maintain/understand why it was there in the first place. Test Plan: Sandcastle cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k LucasLLC Differential Revision: D57970603 Pulled By: iamzainhuda Pull Request resolved: https://github.com/pytorch/pytorch/pull/127628 Approved by: https://github.com/wz337, https://github.com/XilunWu, https://github.com/fegin	2024-06-03 21:21:55 +00:00
Edward Z. Yang	a4064da8ca	Always simplify sympy expressions before printing. (#127543 ) This is important because if a replacement has happened during inductor lowering, we may have stale symbols in sympy expressions that we need to replace away. Do this at the very end. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127543 Approved by: https://github.com/lezcano	2024-06-03 20:36:14 +00:00
Xinya Zhang	ef9451ac8d	Move the build of AOTriton to base ROCM docker image. (#127012 ) Mitigates #126111 AOTrtion, as a Math library, takes long time to build. However, this library itself is not moving as fast as PyTorch itself and it is not cost-efficient to build it for every CI check. This PR moves the build of AOTriton from PyTorch to its base docker image, avoids duplicated and long build time. Pre-this-PR: * PyTorch base docker build job duration: 1.1-1.3h * PyTorch build job duration: 1.4-1.5hr (includes AOTriton build time of 1hr6min on a linux.2xlarge node) Post-this-PR: * PyTorch base docker build job duration: 1.3h (includes AOTriton build time of 20min on a linux.12xlarge node) * PyTorch build job duration: <20 min Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127012 Approved by: https://github.com/jithunnair-amd, https://github.com/pruthvistony, https://github.com/huydhn	2024-06-03 20:35:22 +00:00
Ke Wen	941316f821	[pipelining] Stress test schedules with multi iters (#127475 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127475 Approved by: https://github.com/wconstab	2024-06-03 20:24:07 +00:00
Xiangyang (Mark) Guo	db9d457a3f	Use sleef on macOS Apple silicon by default (#126509 ) Use sleef ~~for aarch64~~ on macOS Apple silicon by default. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126509 Approved by: https://github.com/digantdesai, https://github.com/malfet	2024-06-03 19:33:06 +00:00
PyTorch MergeBot	2fc907971a	Revert "[Inductor] FlexAttention backward kernel optimization (#127208 )" This reverts commit f7171313abf14d9501a330457140b2f8a01c9985. Reverted https://github.com/pytorch/pytorch/pull/127208 on behalf of https://github.com/yanboliang due to test_flex_attention is failing internally ([comment](https://github.com/pytorch/pytorch/pull/127208#issuecomment-2145830810))	2024-06-03 18:13:27 +00:00
PyTorch MergeBot	3f45fa63f2	Revert "[Inductor] Add FlexAttention backward kernel dynamic shape tests (#127728 )" This reverts commit 10e3406ea5d115a54a7d753d33110762eb6c07ff. Reverted https://github.com/pytorch/pytorch/pull/127728 on behalf of https://github.com/yanboliang due to Ineternal breakage of https://github.com/pytorch/pytorch/pull/127208 hence reverting ([comment](https://github.com/pytorch/pytorch/pull/127728#issuecomment-2145822667))	2024-06-03 18:10:46 +00:00
PyTorch MergeBot	c35b65715c	Revert "[Inductor][Flex-attention] Support different sequence lengths for Query and Key/Value (#127678 )" This reverts commit e2e3ca94ccce1c0abbfd75ac0368793e1756c268. Reverted https://github.com/pytorch/pytorch/pull/127678 on behalf of https://github.com/atalman due to Ineternal breakage of https://github.com/pytorch/pytorch/pull/127208 hence reverting ([comment](https://github.com/pytorch/pytorch/pull/127678#issuecomment-2145821489))	2024-06-03 18:07:57 +00:00
GdoongMathew	3437177e2b	Quick Fix on #126854 , deepcopy `lr` and other possible `base_parameters` (#127190 ) * Apply `deepcopy` to every base parameters (`initial_lr`, `max_lr`) when instantiating `LRScheduler`. Fixes #126854 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127190 Approved by: https://github.com/janeyx99	2024-06-03 18:06:31 +00:00
Alnis Murtovi	d8d0bf264a	Inductor: Allow small sizes of m for mixed mm autotuning (#127663 ) For mixed mm with small sizes of m, such as in the example provided in #127056, being able to set BLOCK_M to 16 leads to better performance. This PR introduces kernel configs that are specific to mixed mm by extending the mm configs with two configs that work well for the example provided in #127056. I am excluding configs with (BLOCK_M=16, BLOCK_K=16, BLOCK_N=64) because triton crashes when this config is used. For the example in #127056: - Without my changes, skip_triton is evaluated to true which disables autotuning. On my machine I achieve 146GB/s. - If autotuning is enabled, but BLOCK_M>=32, I achieve 614 GB/s. - With the changes in this PR (i.e. autotuning enabled and BLOCK_M=16), I achieve 772 GB/s. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127663 Approved by: https://github.com/Chillee	2024-06-03 17:53:48 +00:00
Janani Sriram	7c3740d388	[NestedTensor] Extend coverage for unbind when ragged_idx != 1 (#127493 ) Summary: Extend coverage for the `NestedTensor` `unbind` operator to cases in which `ragged_idx != 1`. Currently, the `unbind` operator in the `NestedTensor` class splits a tensor along the 0-th dimension, where the `ragged_idx` property, which controls the jagged dimension upon which `unbind` splits, is 1. This diff extends support for `ragged_idx != 1` in `NestedTensor`s, allowing `unbind` to split a tensor along a jagged dimension greater than 0 for `NestedTensor`s with and without the `lengths` property. Test Plan: Added the following unit tests: `test_unbind_ragged_idx_equals_2_cpu`, `test_unbind_ragged_idx_equals_3_cpu`, and `test_unbind_ragged_idx_equals_last_dim_cpu` verify that `unbind` works for all jagged dimensions greater than 1, for `NestedTensor`s without `lengths`. ``` test_unbind_ragged_idx_equals_2_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok test_unbind_ragged_idx_equals_3_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok test_unbind_ragged_idx_equals_last_dim_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok ``` `test_unbind_with_lengths_cpu` and `test_unbind_with_lengths_ragged_idx_equals_1_cpu` verify that `unbind` works when the jagged dimension is 1, for `NestedTensor`s with `lengths`. ``` test_unbind_with_lengths_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok test_unbind_with_lengths_ragged_idx_equals_1_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok ``` `test_unbind_with_lengths_ragged_idx_equals_2_cpu` and `test_unbind_with_lengths_ragged_idx_equals_3_cpu` verify that `unbind` works when the jagged dimension is greater than 1, for `NestedTensor`s with `lengths`. ``` test_unbind_with_lengths_ragged_idx_equals_2_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok test_unbind_with_lengths_ragged_idx_equals_3_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok ``` `test_unbind_with_lengths_ragged_idx_equals_0_cpu` verifies that `unbind` fails when the jagged dimension is 0 (the batch dimension), for `NestedTensor`s with `lengths`. ``` test_unbind_with_lengths_ragged_idx_equals_0_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok ``` `test_unbind_with_lengths_ragged_idx_equals_2_bad_dim_cpu` verifies that `unbind` fails when there is a mismatch between the offsets and the jagged dimension, for `NestedTensor`s with `lengths`. ``` test_unbind_with_lengths_ragged_idx_equals_2_bad_dim_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok ``` `test_unbind_with_wrong_lengths_cpu` verifies that `unbind` fails when the lengths exceed the limitations set by offsets, for `NestedTensor`s with `lengths`. ``` test_unbind_with_wrong_lengths_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok ``` Differential Revision: D57942686 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127493 Approved by: https://github.com/davidberard98	2024-06-03 17:46:12 +00:00
angelayi	4d32de14b6	[export] Handle serializing duplicate getitem nodes (#127633 ) We ran into a graph that looks something like the following, where we have 2 getitem calls to the same index (%getitem, %getitem_2 both query topk[0]): ``` graph(): %x : [num_users=1] = placeholder[target=x] %topk : [num_users=3] = call_function[target=torch.ops.aten.topk.default](args = (%x, 2), kwargs = {}) %getitem : [num_users=1] = call_function[target=operator.getitem](args = (%topk, 0), kwargs = {}) %getitem_1 : [num_users=1] = call_function[target=operator.getitem](args = (%topk, 1), kwargs = {}) %getitem_2 : [num_users=1] = call_function[target=operator.getitem](args = (%topk, 0), kwargs = {}) %mul_tensor : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%getitem, %getitem_2), kwargs = {}) %mul : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%mul_tensor, 2), kwargs = {}) return (mul, getitem_1) ``` The duplicate getitem call gets created during a pass.. so there are a couple of solutions: 1. Change serializer to support the case of duplicate getitem calls 2. Change the pass so that it doesn’t produce duplicate getitem calls 3. Add a pass which dedups the getitem calls As a framework, we should do 1 and 3 (through a CSE pass). This PR implements solution 1. However, the serializer currently does some special handling for getitem nodes -- instead of directly serializing the getitem nodes, we serialize the output of the node that outputting a list of tensors (the %topk node in this example) into a list nodes for each output ([%getitem, %getitem_1]). This fails when we have duplicate getitem nodes to the same index (%getitem_2), since we do not record that duplicate getitem node anywhere. So, the solution this PR takes is that the serializer will deduplicate the getitem nodes (%getitem_2 will be replaced with %getitem). This would result in a sematically correct graph, but not necessarily node-to-node identical as the original fx graph. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127633 Approved by: https://github.com/ydwu4	2024-06-03 17:25:51 +00:00
Aaron Gokaslan	12c4a2c297	[BE]: Apply PLR1736 fixes (unnecessary index lookup) (#127716 ) Applies the PLR1736 preview rule with some more autofixes to cut down on unnecessary accesses. Added a noqa since that test actually testing the dunder method. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127716 Approved by: https://github.com/ezyang	2024-06-03 17:22:13 +00:00
Wanchao Liang	21144ce570	[dtensor] implement scatter op with simple replication (#126713 ) as titled, implement torch.scatter op with simple replications strategy, need to follow up and see if we could actually support any sharding pattern Pull Request resolved: https://github.com/pytorch/pytorch/pull/126713 Approved by: https://github.com/tianyu-l ghstack dependencies: #126712	2024-06-03 16:16:28 +00:00
Wanchao Liang	ded580a594	[dtensor] standardize multi mesh-dim strategy with utils (#126712 ) This PR standardize the multi mesh-dim strategy generation by unifying a util to expand from a single mesh dim strategy to multi mesh dim strategy, to allow strategy generation simpler Pull Request resolved: https://github.com/pytorch/pytorch/pull/126712 Approved by: https://github.com/tianyu-l	2024-06-03 16:16:28 +00:00
PyTorch MergeBot	d1fad416a8	Revert "Add aten._unsafe_masked_index (#116491 )" This reverts commit f03f8bc901a6c9038308a6353e8d280f4b5628f5. Reverted https://github.com/pytorch/pytorch/pull/116491 on behalf of https://github.com/PaliC due to breaking onnx tests ([comment](https://github.com/pytorch/pytorch/pull/116491#issuecomment-2145557724))	2024-06-03 15:51:50 +00:00
atalman	53f001c599	Revert "correct BLAS input (#126200 )" (#127762 ) This reverts commit ea13e9a097aaa875a2b404822579b7f8b62ea291. Looks like this could have caused: https://github.com/pytorch/pytorch/actions/runs/9346105069/job/25722431775#step:17:984 Aarch64 tests failures: ``` + echo 'Checking that MKLDNN is available on aarch64' Checking that MKLDNN is available on aarch64 + pushd /tmp /tmp / + python -c 'import torch; exit(0 if torch.backends.mkldnn.is_available() else 1)' Error: Process completed with exit code 1. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127762 Approved by: https://github.com/PaliC, https://github.com/malfet	2024-06-03 15:49:48 +00:00
Shuqiang Zhang	8677508167	[c10d] guard gpu context during abort (#127363 ) This is a mitigation for an internal out of MEM issues on GPU0 that happend during comms abort, this PR was tested internally to have fixed the out of MEM issue. Note This is supposed to be mitigation only, as the ideal fix should be within NCCL comm libs, which should just set the right CUDA context before any CUDA call and restore it to its exact previous state ncclCommDestroy/ncclCommAbort -> commReclaim -> commDestroySync (https://fburl.com/code/pori1tka) In commDestroySync, it thinks that "current device context" is not same as comm's device context. It tries to: 1) save the current context 2) sets the comm's device context 3) cleans up things 4) Restores "previously stored context" by another cudaSetDevice. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127363 Approved by: https://github.com/wconstab	2024-06-03 15:41:11 +00:00
Aidyn-A	430cdfc0ac	[ATen][Native] fixes sparse SPMV on aarch64 (#127642 ) Fixes #127491 In #127491 result was allocated as `result = at::empty(...)`, which does not guarantee `result` being filled by zeros, therefore `torch.mv` was producing non-finite values. This happened mainly because the corner case (`beta = 0`) of `addmv` was not taken care of, as it should be just like in any other `addmv`/`addmm`: `923edef31c/aten/src/ATen/native/mkl/SparseBlasImpl.cpp (L307-L311)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127642 Approved by: https://github.com/malfet	2024-06-03 15:38:27 +00:00
Zain Rizvi	badf898df2	Remove unstable ARC jobs (#127563 ) Disable these jobs since we're no longer trying to enable ARC Pull Request resolved: https://github.com/pytorch/pytorch/pull/127563 Approved by: https://github.com/huydhn	2024-06-03 15:30:06 +00:00
James Wu	63d7ffe121	Retry of D58015187 Move AsyncCompile to a different file (#127691 ) Summary: This is a retry of https://github.com/pytorch/pytorch/pull/127545/files and D58015187, fixing the internal test that also imported codecache Test Plan: Same tests as CI in github, plus sandcastle for internal unit tests should pass now Differential Revision: D58054611 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127691 Approved by: https://github.com/oulgen	2024-06-03 15:29:41 +00:00
PaliC	3f8b8f08c8	[Split Build] Make libtorch_global_deps accessible from libtorch wheel (#127570 ) Title Pull Request resolved: https://github.com/pytorch/pytorch/pull/127570 Approved by: https://github.com/atalman, https://github.com/malfet	2024-06-03 15:14:29 +00:00
PyTorch MergeBot	d05cddfe23	Revert "FP8 rowwise scaling (#125204 )" This reverts commit 923edef31c7f3e98a14625724f2019b1422dcb26. Reverted https://github.com/pytorch/pytorch/pull/125204 on behalf of https://github.com/atalman due to Broke nightlies and internal tests ([comment](https://github.com/pytorch/pytorch/pull/125204#issuecomment-2145422196))	2024-06-03 15:00:21 +00:00
Isuru Fernando	f03f8bc901	Add aten._unsafe_masked_index (#116491 ) To generate masked indexing operations that would generate masked loads in triton code Pull Request resolved: https://github.com/pytorch/pytorch/pull/116491 Approved by: https://github.com/lezcano, https://github.com/peterbell10	2024-06-03 14:44:03 +00:00
Edward Z. Yang	d6963e769c	Force Inductor output code to be dumped even if it fails to compile (#127700 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127700 Approved by: https://github.com/oulgen	2024-06-03 14:06:53 +00:00
Daniil Kutz	f343f98710	[jit] Validate mobile module fields parsed by flatbuffer loader (#127437 ) Fixing error in `torch.jit.load` Python API function that cause crash in C-backend of PyTorch. The mobile module is succesfully parsed from flatbuffer format, but its fields are used without any validation. Fixes #127434 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127437 Approved by: https://github.com/davidberard98	2024-06-03 08:48:12 +00:00
Xilun Wu	e017b56c0c	[dtensor] local_map UX change: keep func signature and be compatible with Tensor input (#126924 ) Summary This PR has 2 parts of change in `local_map`: 1. regulates the way user can access `DeviceMesh` inside the `func` argument of `local_map`. This means `local_map` will strictly follow the `func` signature without implicitly passing any argument to `func`. If user wants to use `DeviceMesh` inside `func`, this mesh must be explicitly passed to `func` as an argument by user. For example, ``` def user_function(device_mesh, /, args, kwargs): USER CODE HERE local_func = local_map(func=user_function, ...) dtensor_out = local_func(device_mesh, dtensor_input, ...) ``` Before this PR, user code was like: ``` def user_function(device_mesh, /, args, kwargs): USER CODE HERE local_func = local_map(func=user_function, ...) dtensor_out = local_func(dtensor_input, ...) # local_map passes mesh implicitly for user ``` 2. `local_map` now supports mix use of `torch.Tensor` and `DTensor` in argument: - Pure torch.Tensor case: no `DTensor` argument is passed in, all tensor arguments are `torch.Tensor`. Bypass the `in_placements` check and unwrapping steps. The output will not be wrapped into `DTensor` but directly returned. - Pure DTensor case: no `torch.Tensor` argument is passed in, all tensor arguments are `DTensor`. This follows the default rule: `in_placements` check, unwrapping arguments, pass into `func`, wrapping the `torch.Tensor` output into `DTensor` if the `out_placements` is not `None`. - Mix of the above two: some arguments are `torch.Tensor` while some are `DTensor`. Only perform `in_placements` check and unwrapping on `DTensor` arguments. For output processing, it's the same as Pure DTensor case. Test** `pytest test/distributed/_tensor/experimental/test_local_map.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126924 Approved by: https://github.com/wanchaol	2024-06-03 08:41:59 +00:00
diwei sun	2d1ad0c31a	[CI] Add freezing for cpu inductor accuracy test in inductor CI (#124715 ) This PR is to enable '--freezing' when running dynamo accuracy check in CI. Backgroud: ISSUES[#124286](https://github.com/pytorch/pytorch/issues/124286) is not captured by CI since freezing is not enabled for cpu-inductor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124715 Approved by: https://github.com/chuanqi129, https://github.com/jgong5, https://github.com/atalman, https://github.com/desertfire	2024-06-03 07:37:30 +00:00
Yanbo Liang	10e3406ea5	[Inductor] Add FlexAttention backward kernel dynamic shape tests (#127728 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/127728 Approved by: https://github.com/Chillee	2024-06-03 07:15:46 +00:00
Chien-Chin Huang	6d21685b45	[DSD] Fixes various bugs for broadcast_from_rank0 (#127635 ) Fixes https://github.com/pytorch/pytorch/issues/126285 Summary: 1. Fixes https://github.com/pytorch/pytorch/issues/126285 2. Broadcasting one tensor per time to avoid OOM. 3. Add some docstring Pull Request resolved: https://github.com/pytorch/pytorch/pull/127635 Approved by: https://github.com/weifengpy	2024-06-03 06:35:21 +00:00
Feng Yuan	48846cd164	Update torch-xpu-ops pin (ATen XPU implementation) (#127730 ) Regular bi-weekly pin update. 1. Porting operator relative PyTorch unit tests. The existing operators in torch-xpu-ops are covered by, 1) Operator specific test, like test_binary_ufuncs.py. 2) Operator common test, like test_ops.py. 2. Bugfixing under the latest PyTorch unit test scope, https://github.com/intel/torch-xpu-ops/tree/release/2.4/test/xpu. Totally 297 ATen operators are implemented in torch-xpu-ops. https://github.com/intel/torch-xpu-ops/blob/release/2.4/yaml/xpu_functions.yaml Pull Request resolved: https://github.com/pytorch/pytorch/pull/127730 Approved by: https://github.com/EikanWang	2024-06-03 05:55:00 +00:00
Yanbo Liang	e2e3ca94cc	[Inductor][Flex-attention] Support different sequence lengths for Query and Key/Value (#127678 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/127678 Approved by: https://github.com/Chillee	2024-06-03 04:35:50 +00:00
cyy	288df042c5	[1/N] Change static functions in headers to inline (#127727 ) So that it may fix some tricky linking issues. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127727 Approved by: https://github.com/ezyang	2024-06-03 04:34:36 +00:00
cyy	1b182ea0d2	Remove c10::guts::{conjunction,disjunction} (#127726 ) They are not used in Pytorch OSS. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127726 Approved by: https://github.com/ezyang	2024-06-03 04:06:21 +00:00
leslie-fang-intel	3399ad8d9d	[Inductor][CPP] Add UT for bitwise right shift (#127731 ) Summary Per the discussion in https://github.com/pytorch/pytorch/issues/127310, `bitwise_right_shift` failed in Torch 2.1 but pass with latest PyTorch, Add the UT in this PR to ensure the correctness. TestPlan ``` python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_bitwise_right_shift ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127731 Approved by: https://github.com/Skylion007	2024-06-03 04:05:41 +00:00
Yanbo Liang	7e97b33fbb	[Dynamo] Log backward graph compilation metrics (#126629 ) Fixes #125313 Compilation metric logs for the code example at #125313: ``` %s CompilationMetrics(compile_id='0/0', frame_key='1', co_name='forward', co_filename='/data/users/ybliang/debug/debug2.py', co_firstlineno=10, cache_size=0, accumulated_cache_size=0, guard_count=11, shape_env_guard_count=0, graph_op_count=1, graph_node_count=3, graph_input_count=1, start_time=1716247236.6165977, entire_frame_compile_time_s=7.926939964294434, backend_compile_time_s=7.887059926986694, inductor_compile_time_s=4.108498811721802, code_gen_time_s=3.97833514213562, fail_type=None, fail_reason=None, fail_user_frame_filename=None, fail_user_frame_lineno=None, non_compliant_ops=set(), compliant_custom_ops=set(), restart_reasons={"'skip function graph_break in file /home/ybliang/local/pytorch/torch/_dynamo/decorators.py'"}, dynamo_time_before_restart_s=0.025330543518066406, has_guarded_code=True, is_fwd=True) %s CompilationMetrics(compile_id='1/0', frame_key='2', co_name='torch_dynamo_resume_in_forward_at_12', co_filename='/data/users/ybliang/debug/debug2.py', co_firstlineno=12, cache_size=0, accumulated_cache_size=0, guard_count=10, shape_env_guard_count=0, graph_op_count=2, graph_node_count=5, graph_input_count=1, start_time=1716247244.544928, entire_frame_compile_time_s=0.10148310661315918, backend_compile_time_s=0.08753013610839844, inductor_compile_time_s=0.03691983222961426, code_gen_time_s=0.022417306900024414, fail_type=None, fail_reason=None, fail_user_frame_filename=None, fail_user_frame_lineno=None, non_compliant_ops=set(), compliant_custom_ops=set(), restart_reasons=set(), dynamo_time_before_restart_s=0.0, has_guarded_code=True, is_fwd=True) tensor([[-0.1622, -0.0000, -0.0000, 0.5643, -0.0000, 0.0000, -0.5087, 0.0914, -0.0000, -0.0421]], grad_fn=<CompiledFunctionBackward>) %s CompilationMetrics(compile_id='1/0', frame_key=None, co_name=None, co_filename=None, co_firstlineno=None, cache_size=None, accumulated_cache_size=None, guard_count=None, shape_env_guard_count=None, graph_op_count=None, graph_node_count=None, graph_input_count=None, start_time=None, entire_frame_compile_time_s=None, backend_compile_time_s=None, inductor_compile_time_s=0.026738643646240234, code_gen_time_s=0.016446352005004883, fail_type=None, fail_reason=None, fail_user_frame_filename=None, fail_user_frame_lineno=None, non_compliant_ops=None, compliant_custom_ops=None, restart_reasons=None, dynamo_time_before_restart_s=None, has_guarded_code=None, is_fwd=False) %s CompilationMetrics(compile_id='0/0', frame_key=None, co_name=None, co_filename=None, co_firstlineno=None, cache_size=None, accumulated_cache_size=None, guard_count=None, shape_env_guard_count=None, graph_op_count=None, graph_node_count=None, graph_input_count=None, start_time=None, entire_frame_compile_time_s=None, backend_compile_time_s=None, inductor_compile_time_s=0.14563536643981934, code_gen_time_s=0.08652091026306152, fail_type=None, fail_reason=None, fail_user_frame_filename=None, fail_user_frame_lineno=None, non_compliant_ops=None, compliant_custom_ops=None, restart_reasons=None, dynamo_time_before_restart_s=None, has_guarded_code=None, is_fwd=False) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126629 Approved by: https://github.com/ezyang	2024-06-03 03:55:33 +00:00
PyTorch MergeBot	84776d7597	Revert "[BE]: Update mypy to 1.10.0 (#127717 )" This reverts commit 30213ab0a7b27277e76ea9dd707ce629a63d91ee. Reverted https://github.com/pytorch/pytorch/pull/127717 on behalf of https://github.com/huydhn due to I am not sure why but the failures look legit and they are showing up in trunk `30213ab0a7` ([comment](https://github.com/pytorch/pytorch/pull/127717#issuecomment-2144183347))	2024-06-03 02:52:47 +00:00
bigning	e57f51b80f	Update _dedup_save_plans.py (#126569 ) To resolve https://github.com/pytorch/pytorch/issues/125740, save each tensor on the lowest rank. Fixes #125740 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126569 Approved by: https://github.com/LucasLLC	2024-06-03 01:55:03 +00:00
Yash Rathore	fec8ef8c17	[Aten][BlasKernel] Add function prototype to fix compiler error (#127719 ) Adds a prototype for function `fp16_dot_with_fp32_arith()` in `aten/src/ATen/native/BlasKernel.cpp`. Without this patch the build fails on Apple silicon/MacOs (CPU) with the error `no previous prototype for function 'fp16_dot_with_fp32_arith' [-Werror,-Wmissing-prototypes]`. The function cannot be marked `static` because its use is not limited to this file. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127719 Approved by: https://github.com/Skylion007	2024-06-02 23:41:43 +00:00
Xuehai Pan	8b08b0f340	[BE] enable ruff rule `Q` from flake8-quotes (#127713 ) Enable [ruff rule `Q`](https://docs.astral.sh/ruff/rules/#flake8-quotes-q) from flake8-quotes. Fixes: - [avoidable-escaped-quote (Q003)](https://docs.astral.sh/ruff/rules/avoidable-escaped-quote/#avoidable-escaped-quote-q003) - [unnecessary-escaped-quote (Q004)](https://docs.astral.sh/ruff/rules/unnecessary-escaped-quote/#unnecessary-escaped-quote-q004) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127713 Approved by: https://github.com/ezyang	2024-06-02 23:25:26 +00:00
Edward Z. Yang	139b9c6529	Avoid reference cycle in inner closure (#127711 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127711 Approved by: https://github.com/Skylion007, https://github.com/izaitsevfb	2024-06-02 21:28:46 +00:00
Aaron Gokaslan	30213ab0a7	[BE]: Update mypy to 1.10.0 (#127717 ) Updates mypy to the latest and greatest. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127717 Approved by: https://github.com/ezyang	2024-06-02 21:07:23 +00:00
Kiuk Chung	fb53cd6497	[aten_cuda/flash_attn] Add typename to template argument Kernel_trait… (#127634 ) Adds the `typename` keyword to the template argument `Kernel_traits::TiledMma` and `Kernel_traits::TiledMmaSdP` (which are dependent type names) when calling the template function `pytorch_flash::convert_layout_acc_Aregs`. Without `typename` flash_attention kernels do not compile with Clang under C++20 since Clang compiles the entire .cu file in a single pass as opposed to NVCC which split compiles the host and device code. Adding `typename` seems to be OK under NVCC based on CI cuda builds succeeding. Below is the excerpt of the compilation error: ``` third_party/py/torch/aten/src/ATen/native/transformers/cuda/flash_attn/static_switch.h:46:24: note: expanded from macro 'ALIBI_SWITCH' 46 \| #define ALIBI_SWITCH BOOL_SWITCH \| ^ third_party/py/torch/aten/src/ATen/native/transformers/cuda/flash_attn/flash_bwd_launch_template.h:132:5: note: in instantiation of function template specialization 'pytorch_flash::run_flash_bwd_seqk_parallel<pytorch_flash::Flash_bwd_ke rnel_traits<160, 64, 64, 8, 4, 4, 4, false, true>, true>' requested here 132 \| run_flash_bwd_seqk_parallel<Kernel_traits, Is_dropout>(params, stream); \| ^ third_party/py/torch/aten/src/ATen/native/transformers/cuda/flash_attn/flash_bwd_launch_template.h:280:13: note: in instantiation of function template specialization 'pytorch_flash::run_flash_bwd<pytorch_flash::Flash_bwd_kernel_traits<1 60, 64, 64, 8, 4, 4, 4, false, true>, true>' requested here 280 \| run_flash_bwd<Flash_bwd_kernel_traits<Headdim, 64, 64, 8, 4, 4, 4, false, true, T>, Is_dropout>(params, stream); \| ^ third_party/py/torch/aten/src/ATen/native/transformers/cuda/flash_attn/static_switch.h:36:26: note: expanded from macro 'DROPOUT_SWITCH' 36 \| #define DROPOUT_SWITCH BOOL_SWITCH \| ^ third_party/py/torch/aten/src/ATen/native/transformers/cuda/flash_attn/kernels/flash_bwd_hdim160_fp16_sm80.cu:12:5: note: in instantiation of function template specialization 'pytorch_flash::run_mha_bwd_hdim160<cutlass::half_t>' request ed here 12 \| run_mha_bwd_hdim160<cutlass::half_t>(params, stream); \| ^ In file included from third_party/py/torch/aten/src/ATen/native/transformers/cuda/flash_attn/kernels/flash_bwd_hdim160_fp16_sm80.cu:7: In file included from third_party/py/torch/aten/src/ATen/native/transformers/cuda/flash_attn/flash_bwd_launch_template.h:12: third_party/py/torch/aten/src/ATen/native/transformers/cuda/flash_attn/flash_bwd_kernel.h:543:86: error: missing 'typename' prior to dependent type name 'Flash_bwd_kernel_traits<160, 64, 64, 8, 4, 4, 4, false, true>::TiledMmaSdP' 543 \| Tensor tPrP = make_tensor(rP.data(), pytorch_flash::convert_layout_acc_Aregs<Kernel_traits::TiledMmaSdP>(rP.layout())); ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127634 Approved by: https://github.com/Skylion007	2024-06-02 16:25:02 +00:00
rzou	08653fe355	Beef up the allow_in_graph docs (#127117 ) We make the following changes: - most of the time when someone uses allow_in_graph, they actually wanted to make a custom op. We add a link to the custom ops landing page and explain the differences between allow_in_graph and custom ops. - we warn people against using allow_in_graph footguns and document them. Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/127117 Approved by: https://github.com/jansel, https://github.com/albanD	2024-06-02 15:00:46 +00:00
Aaron Gokaslan	e24a87ed8d	[BE][Ez]: Apply PYI059 - Generic always come last (#127685 ) Generic baseclass should always be last or unexpected issues can occur, especially in non-stub files (such as with MRO). Applies autofixes from the preview PYI059 rule to fix the issues in the codebase. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127685 Approved by: https://github.com/ezyang	2024-06-02 13:38:58 +00:00
Aaron Gokaslan	c2547dfcc3	[BE][Ez]: Enable ruff PYI019 (#127684 ) Tells pytorch to use typing_extensions.Self when it's able to. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127684 Approved by: https://github.com/ezyang	2024-06-02 13:38:33 +00:00
Xuehai Pan	67ef2683d9	[BE] wrap deprecated function/class with `typing_extensions.deprecated` (#127689 ) Use `typing_extensions.deprecated` for deprecation annotation if possible. Otherwise, add `category=FutureWarning` to `warnings.warn("message")` if the category is missing. Note that only warnings that their messages contain `[Dd]eprecat(ed\|ion)` are updated in this PR. Resolves #126888 - #126888 This PR is split from PR #126898. - #126898 ------ Pull Request resolved: https://github.com/pytorch/pytorch/pull/127689 Approved by: https://github.com/Skylion007	2024-06-02 12:30:43 +00:00
Sheng Fu	c1dd3a615f	Implement Graph Transform Observer (#127427 ) Summary: Implement Graph Transform Observer Differential Revision: D57887518 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127427 Approved by: https://github.com/angelayi	2024-06-02 06:49:47 +00:00
cyy	4e7f497bb3	[Submodule] Remove ios-cmake (#127694 ) It has not been updated for a long time and CI iOS builds don't rely on it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127694 Approved by: https://github.com/ezyang	2024-06-02 04:40:21 +00:00
Michael Lazos	2129903aa3	Properly detect nested torch function args (#127496 ) Dynamo was not detecting nested torch function classes in containers. This was due to pytree compatibility for variable trackers being removed. Fixes https://github.com/pytorch/pytorch/issues/127174 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127496 Approved by: https://github.com/anijain2305	2024-06-02 03:43:22 +00:00
Colin Peppler	16578e8584	[symbolic shapes] if symbol not in var_ranges default to unknown range (#127681 ) Purpose of this PR is to get around this error: https://github.com/pytorch/pytorch/issues/127677 Differential Revision: D58048558 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127681 Approved by: https://github.com/lezcano	2024-06-02 02:28:40 +00:00
titaiwangms	4fd777ed59	[ONNX] Add quantized layer norm op to opset 17 (#127640 ) Fixes #126160 Continue #126555 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127640 Approved by: https://github.com/justinchuby	2024-06-02 02:10:02 +00:00
xinan.lin	c19ad112f6	[Inductor UT][Intel GPU] Skip test case which doesn't currently work on the XPU stack but newly re-enabled by community. (#127629 ) The Inductor UT test/inductor/test_triton_heuristics.py:test_artificial_zgrid that previously skipped was recently enbaled by the PR https://github.com/pytorch/pytorch/pull/127448. However, the test doesn't currently work on the XPU stack, it will huang on GPU, so this PR skip the test for Intel GPU instead of expected failure. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127629 Approved by: https://github.com/EikanWang, https://github.com/peterbell10	2024-06-02 01:00:33 +00:00
Boyuan Feng	2cef2fc2b4	[ts migration] support aten::dim, aten::len, aten::__getitem__ (#127593 ) - Add support for aten::dim, aten::len, aten::__getitem__ for torchscript to export converter. - Add unit tests Co-authored-by: cyy <cyyever@outlook.com> Co-authored-by: Menglu Yu <mengluy@meta.com> Co-authored-by: Animesh Jain <anijain@umich.edu> Co-authored-by: Simon Fan <xmfan@meta.com> Co-authored-by: Zain Rizvi <ZainR@meta.com> Co-authored-by: Tugsbayasgalan (Tugsuu) Manlaibaatar <tmanlaibaatar@meta.com> Co-authored-by: titaiwangms <titaiwang@microsoft.com> Co-authored-by: Yueming Hao <yhao@meta.com> Co-authored-by: IvanKobzarev <ivan.kobzarev@gmail.com> Co-authored-by: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com> Co-authored-by: Edward Z. Yang <ezyang@meta.com> Co-authored-by: Bin Bao <binbao@meta.com> Co-authored-by: Feny Patel <fenypatel@meta.com> Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Co-authored-by: xinan.lin <xinan.lin@intel.com> Co-authored-by: Zain Huda <zainhuda@meta.com> Co-authored-by: Chien-Chin Huang <chienchin@fb.com> Co-authored-by: Wei Wang <weiwan@nvidia.com> Co-authored-by: Jason Ansel <jansel@meta.com> Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com> Co-authored-by: Iris Z <31293777+wz337@users.noreply.github.com> Co-authored-by: Wang, Eikan <eikan.wang@intel.com> Co-authored-by: angelayi <yiangela7@gmail.com> Co-authored-by: Svetlana Karslioglu <svekars@meta.com> Co-authored-by: Yanbo Liang <ybliang8@gmail.com> Co-authored-by: Catherine Lee <csl@fb.com> Co-authored-by: Kwanghoon An <kwanghoon@meta.com> Co-authored-by: Brian Hirsh <hirsheybar@fb.com> Co-authored-by: Robert Mast <rmast@live.nl> Co-authored-by: drisspg <drisspguessous@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127593 Approved by: https://github.com/SherlockNoMad, https://github.com/malfet	2024-06-02 00:36:33 +00:00
Oguz Ulgen	0d9e527c4d	Remove tensor storage_offset/storage_bytes from the cache key (#127319 ) Summary: We observed differences in these fields and inductor does not specialize on them so it is safe to remove them from the key. Test Plan: CI Reviewed By: masnesral Differential Revision: D57871276 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127319 Approved by: https://github.com/masnesral	2024-06-02 00:28:43 +00:00
eqy	2e779166eb	[Functorch][cuDNN] Bump tolerances for `test_vmapjvpvjp` (#127355 ) cuDNN can select a winograd kernel for this case which slightly affects tolerances... Pull Request resolved: https://github.com/pytorch/pytorch/pull/127355 Approved by: https://github.com/zou3519, https://github.com/Skylion007	2024-06-01 21:22:55 +00:00
Sam Larsen	6e2e09f6cc	[inductor] fix redis-related env vars in remote_cache.py (#127583 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127583 Approved by: https://github.com/oulgen	2024-06-01 19:55:25 +00:00
Wei Wang	b505e86475	[Inductor][CI][CUDA 12.4] Update dynamic_inductor_timm_training.csv - change gluon_inception_v3 from fail_accuracy to pass (#127672 ) From the HUD, most of the time the "X" is due to "improved_accuracy" for gluon_inception_v3. ![image](https://github.com/pytorch/pytorch/assets/143543872/d4f70377-2756-4921-872d-587426f00302) https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=inductor_timm Pull Request resolved: https://github.com/pytorch/pytorch/pull/127672 Approved by: https://github.com/eqy, https://github.com/Skylion007	2024-06-01 19:12:43 +00:00
PyTorch MergeBot	17dea09b15	Revert "Default XLA to use swap_tensors path in nn.Module._apply (#126814 )" This reverts commit bfdec93395f675a0e5a59e95aef9104ac8f5081a. Reverted https://github.com/pytorch/pytorch/pull/126814 on behalf of https://github.com/izaitsevfb due to suspicious build instructions count regression, see [D58015016](https://www.internalfb.com/diff/D58015016) ([comment](https://github.com/pytorch/pytorch/pull/126814#issuecomment-2143545818))	2024-06-01 18:46:16 +00:00
PyTorch MergeBot	82cd7a7dab	Revert "Default meta device to use swap_tensors in nn.Module._apply (.to_empty and .to('meta')) (#126819 )" This reverts commit fa426b096b3635daab6ce26b44d50f3baab5a4e5. Reverted https://github.com/pytorch/pytorch/pull/126819 on behalf of https://github.com/izaitsevfb due to suspicious build instructions count regression, see [D58015016](https://www.internalfb.com/diff/D58015016) ([comment](https://github.com/pytorch/pytorch/pull/126814#issuecomment-2143545818))	2024-06-01 18:46:16 +00:00
Lucas Pasqualin	42312a52b3	[DSD] Adds type_check param to copy state dict utils (#127417 ) [DSD] Adds type_check param to copy state dict utils. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127417 Approved by: https://github.com/fegin	2024-06-01 17:50:52 +00:00
Aaron Gokaslan	edffb28d39	[BE][Ez]: Enable B019 - flags memory leaks through LRU cache on method (#127686 ) Flags potential mem leaks through LRUCache and will hopefully make future contributors rethink this pattern which can cause memleaks. noqas the violations we currently have (should be fixed later) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127686 Approved by: https://github.com/c-p-i-o	2024-06-01 17:19:24 +00:00
PyTorch MergeBot	22f392ba40	Revert "[easy?] Move AsyncCompile to a different file (#127235 )" This reverts commit f58fc16e8f059232f452a333f32e14ff681e12af. Reverted https://github.com/pytorch/pytorch/pull/127235 on behalf of https://github.com/izaitsevfb due to breaking internal tests, see [D58015187](https://www.internalfb.com/diff/D58015187) ([comment](https://github.com/pytorch/pytorch/pull/127235#issuecomment-2143518610))	2024-06-01 17:16:16 +00:00
PyTorch MergeBot	d49dc8f4b8	Revert "Add noqa to prevent lint warnings (#127545 )" This reverts commit f9937afd4f87fbb4844642ae2f587b13b5caa08c. Reverted https://github.com/pytorch/pytorch/pull/127545 on behalf of https://github.com/izaitsevfb due to reverting to unblock the revert of #127545 ([comment](https://github.com/pytorch/pytorch/pull/127545#issuecomment-2143517711))	2024-06-01 17:12:46 +00:00
PyTorch MergeBot	114c752b14	Revert "Improve MAGMA conditional macro in BatchLinearAlgebra.cpp (#127495 )" This reverts commit ee08cf57924a4230edad3101666890d8fe050c75. Reverted https://github.com/pytorch/pytorch/pull/127495 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/127495#issuecomment-2143508218))	2024-06-01 16:39:06 +00:00
Animesh Jain	efcea2d2fd	[dynamo] Support __getitem__ on NNModuleVariable __dict__ (#126956 ) Moves further along (but still fails) for the testcase in https://github.com/pytorch/pytorch/pull/126875 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126956 Approved by: https://github.com/jansel ghstack dependencies: #126923	2024-06-01 15:22:45 +00:00
Jane Xu	4129c3e596	Let us find out why we wrote foreach meta regs (#127623 ) Turns out it was for no reason!...well, after realizing that these ops are all CompositeExplicit, their meta impls come for free. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127623 Approved by: https://github.com/mikaylagawarecki ghstack dependencies: #127412	2024-06-01 13:58:18 +00:00
Jane Xu	ac60bdaf01	Allow slow foreach to run for any backend, not just CPU (#127412 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127412 Approved by: https://github.com/albanD	2024-06-01 13:58:18 +00:00
Animesh Jain	4aa7a1efcf	[dynamo] Initial exception handling support (#126923 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126923 Approved by: https://github.com/williamwen42, https://github.com/jansel	2024-06-01 13:00:32 +00:00
Bin Bao	25994a7ed1	[AOTI] Fix a bug when mutated buffer meets .to (#127671 ) Summary: Before this change, the added unit test will trigger: `AssertionError: Can not find the original value for L__self____tensor_constant0_cuda0`. The reason is GraphLowering.constant_name could rename a constant with a device suffix but AOTI requires that new name being registered properly. Differential Revision: [D58047165](https://our.internmc.facebook.com/intern/diff/D58047165) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127671 Approved by: https://github.com/ColinPeppler, https://github.com/22quinn	2024-06-01 12:30:56 +00:00
haozhe.zhu	c3be459f26	[inductor] fix mkldnn linear binary fusion check ut (#127296 ) In this PR: （1）Fix the unary fusion for bf16 conv/linear. Previously we registered same fusion pattern for `bf16. fp16`. And we do not check the dtype while matching the pattern. This results the `fp16` case matched the `bf16` pattern but in later replacement, we found that we have a float16 here which is not expected, so we do not fuse them. We fix it by checking dtypes to avoid `fp16` case matched `bf16` pattern. ``` def _is_valid_computation_unary_fusion(computation_op, lowp_dtype=None): def fn(match): matched = _is_single_computation_op(computation_op, lowp_dtype)(match) # previously we do not check lowp_dtype here ``` It is not exposed before because we only check the match count, and the match count is anyway correct because we matched the pattern. To address this, we add check on number of `generated_kernel`. If it is not fused, there will be an additional kernel to compute the post op. （2）Previous the ut ``` python test/inductor/test_mkldnn_pattern_matcher.py -k test_linear_binary ``` dose not check the fusion status, fix it in this PR. （3）Extend `test_conv_binary` to test with lp. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127296 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jansel	2024-06-01 11:10:29 +00:00
Shan19900305	e62925930f	Clear dest impl extra_meta_ info when shallow_copy_from src impl to dest impl. (#127616 ) tensorA.data = tensorB will call shallow_copy_from function to copy tensorB metadata and storage to tensorA metadata and storage. If tensorB extra_meta_ is nullptr,then tensorA extra_meta_ still keep in tensorA. This will contaminate new meta data in tensorA. @ezyang @bdhirsh Pull Request resolved: https://github.com/pytorch/pytorch/pull/127616 Approved by: https://github.com/ezyang	2024-06-01 06:54:32 +00:00
Alex Baden	554265d450	[Inductor]: Use new device-agnostic libdevice import from triton.language (#127348 ) Triton refactored `libdevice` in `5e6952d8c5` While both imports still appear to work under CUDA, this change is required to pull the correct libdevice variants under the Intel XPU backend. I am working on developing a test that catches this behavior. The easiest path would be to enable `test/inductor/test_triton_kernels.py` under the XPU backend, but a different group at Intel manages that test and I need to see if they already have an enabling plan. I am not sure the double `libdevice` import (see line 22 where I have the nolint flag) is really necessary but have yet to find a conclusive test case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127348 Approved by: https://github.com/etaf, https://github.com/peterbell10	2024-06-01 06:15:33 +00:00
Huy Do	7ef7c265d4	Ack codecvt_utf8_utf16 as a deprecated func in C++17 (#127659 ) https://en.cppreference.com/w/cpp/header/codecvt. This starts to fail on MacOS after migrating it to MacOS 14 with a newer toolchain. For example `57baae9c9b`. As there is no clear alternative to the deprecated function yet, I just ack the warning to fix the build and complete the migration https://github.com/pytorch/pytorch/issues/127490 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127659 Approved by: https://github.com/kit1980, https://github.com/atalman	2024-06-01 04:31:39 +00:00
a-gardner1	3c1cf03fde	Add fake impl for aten.unique_dim (#126561 ) Follow-up to #113118 and #124306. Developed in coordination with the solution to https://github.com/microsoft/onnxscript/pull/1547 This PR adds the missing fake tensor implementation for `aten.unique_dim`, thus enabling tracing and compilation of `torch.unique` when `dim` is not None. Local testing has proceeded with the following simple script (provided that one has checked out the changes in https://github.com/microsoft/onnxscript/pull/1547): ```python import onnx import onnxruntime as ort import logging import numpy as np onnx_program = torch.onnx.dynamo_export( lambda x: torch.unique(x, dim=0, return_inverse=True), torch.arange(10), export_options=torch.onnx.ExportOptions( dynamic_shapes=True, diagnostic_options=torch.onnx.DiagnosticOptions( verbosity_level=logging.DEBUG))) onnx_program.save("torch_unique.onnx") onnx_inputs = onnx_program.adapt_torch_inputs_to_onnx(torch.arange(10)) onnx_outputs = onnx_program(*onnx_inputs) loaded_onnx_program = onnx.load("torch_unique.onnx") onnx.checker.check_model(loaded_onnx_program) ort_session = ort.InferenceSession("torch_unique.onnx") inputs = np.random.randint(0, 10, 10) print(f"Inputs: {inputs}") outputs = ort_session.run(None, { "l_x_": inputs }) print(f"Outputs: {outputs}") print("Success") ``` Co-authored-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126561 Approved by: https://github.com/ezyang	2024-06-01 04:03:10 +00:00
Wang, Eikan	25447ba241	Always Link libtorch and libtorch_cpu to ensure the functionality for AOT mode (#127381 ) Fix #126763: The root cause is that the produced library does not link any torch library because the vec ISA is invalid, and then it cannot run into another path without linking `libtorch` and `libtorch_cpu`. https://github.com/pytorch/pytorch/blob/main/torch/_inductor/codecache.py#L1637-L1642 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127381 Approved by: https://github.com/desertfire	2024-06-01 01:47:41 +00:00
Masaki Kozuki	df53cc7114	[reland] "[reland] `_foreach_copy` with different src/dst dtypes" (#127186 ) Fixes #115171 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127186 Approved by: https://github.com/ezyang	2024-06-01 01:25:10 +00:00
Huamin Li	ff8042bcfb	Enable AOTI shim v2 build and add into libtorch (#125211 ) Summary: Follow up of https://github.com/pytorch/pytorch/pull/125087 This diff will create shim v2 header and cpp file and corresponding build Differential Revision: D56617546 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125211 Approved by: https://github.com/desertfire	2024-05-31 23:56:11 +00:00
Zain Rizvi	a8c9b26534	[BE] Fix dependabot security errors (#127567 ) Fixes https://github.com/pytorch/pytorch/security/dependabot/36 and https://github.com/pytorch/pytorch/security/dependabot/37 by deleting spurious dependency Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127567 Approved by: https://github.com/malfet	2024-05-31 23:00:07 +00:00
Yanbo Liang	f7171313ab	[Inductor] FlexAttention backward kernel optimization (#127208 ) BWD Speedups (before this PR): ``` \| Type \| Speedup \| shape \| score_mod \| dtype \| \|---------\|-----------\|-------------------\|---------------\|----------------\| \| Average \| 0.211 \| \| \| \| \| Max \| 0.364 \| (16, 16, 512, 64) \| relative_bias \| torch.bfloat16 \| \| Min \| 0.044 \| (2, 16, 4096, 64) \| causal_mask \| torch.bfloat16 \| ``` BWD Speedups (after this PR, though not optimizing block size yet): ``` \| Type \| Speedup \| shape \| score_mod \| dtype \| \|---------\|-----------\|--------------------\|---------------\|----------------\| \| Average \| 0.484 \| \| \| \| \| Max \| 0.626 \| (2, 16, 512, 256) \| head_bias \| torch.bfloat16 \| \| Min \| 0.355 \| (8, 16, 4096, 128) \| relative_bias \| torch.bfloat16 \| ``` There are a few things need to do as follow-ups: * Optimized default block size on A100/H100. * Support different seqlen for Q and K/V. * Support dynamic shapes for backward. * Enhance unit tests to check there is no ```nan``` value in any grad. I think we should make some changes to ```test_padded_dense_causal``` because it has invalid inputs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127208 Approved by: https://github.com/Chillee	2024-05-31 22:56:10 +00:00
Huy Do	57baae9c9b	Migrating CI/CD jobs to macOS 14 (#127582 ) We have half the fleet in MacoS 14 already and it has been running fine so far https://github.com/pytorch/pytorch/issues/127490. So, I'm preparing the final push to replace the rest of them. This also switches release build from 13 to 14 (GitHub runners) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127582 Approved by: https://github.com/atalman	2024-05-31 22:30:59 +00:00
Zain Rizvi	02248b73eb	[EZ] Port over all test-infra scale configs to lf runners (#127645 ) Follow up to https://github.com/pytorch/pytorch/pull/127578 Since GPU builds seem to be working correctly, porting over all remaining scale configs from [the org-wide scale config file](https://github.com/pytorch/test-infra/blob/main/.github/scale-config.yml) The naming convention here is all temporary. We'll figure out something better before completing the migration Pull Request resolved: https://github.com/pytorch/pytorch/pull/127645 Approved by: https://github.com/malfet	2024-05-31 22:24:41 +00:00
Lucas Pasqualin	bb1468d506	Updates state dict in state dict loader (#127617 ) Fixes #125096 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127617 Approved by: https://github.com/Skylion007, https://github.com/fegin	2024-05-31 21:59:10 +00:00
David Berard	f33beb767d	[NestedTensor] Use maybe_mark_dynamic instead of mark_dynamic (#127453 ) Fixes #127097 TL;DR: dimensions marked with mark_dynamic can result in assertion failures if the marked-dynamic dimensions get specialized. In NJT, we don't care _that_ much that a dimension is marked as dynamic. So instead, mark with `maybe_mark_dynamic` which suggests that a dimension should be dynamic, but doesn't fail if the dimension gets specialized. Background: NJT marks the values tensor as dynamic: `49ad90349d/torch/nested/_internal/nested_tensor.py (L122)` It does this for two reasons: 1. Conceptual: We know that this dimension _should_ be dynamic; it's a nested tensor, so the sequence lengths will _probably_ vary between batches in the common case. Therefore, we should compile it as dynamic to prevent needing a recompile to trigger automatic dynamic shapes. 2. Implementation detail: Right now we run into issues with torch.compile / tensor_unflatten / other details when the dimensions are not marked as dynamic. We have some attempts to remove this (e.g. https://github.com/pytorch/pytorch/pull/126563) but while testing this I wasn't able to get all tests to pass, so there could be potential regressions here if we removed the mark_dynamic. Justification for this change 1. Conceptual: AFAIK, we don't care enough about the dynamism of this dimension to error out if we specialize. We'd prefer that we don't have to recompile to get automatic dynamic shapes, but it's also better to not have this issue (and not to force the user to go hunt down all the other equivalent shapes to mark them as dynamic as well). This solution allows us to suggest the dynamism but not force it. 2. Implementation detail: This still marks the dimension as symbolic at the beginning of dynamo tracing, so we will (probably) avoid a lot of the issues we run into when we completely remove the `mark_dynamic` decorators. Differential Revision: [D57933779](https://our.internmc.facebook.com/intern/diff/D57933779) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127453 Approved by: https://github.com/soulitzer, https://github.com/YuqingJ	2024-05-31 21:32:12 +00:00
albanD	6bfc6e0875	Add back private function torch.cuda.amp.autocast_mode._cast (#127433 ) This is unfortunately used in a few places in the wild: https://github.com/search?q=torch.cuda.amp.autocast_mode._cast&type=code Pull Request resolved: https://github.com/pytorch/pytorch/pull/127433 Approved by: https://github.com/zou3519, https://github.com/guangyey	2024-05-31 20:48:15 +00:00
drisspg	923edef31c	FP8 rowwise scaling (#125204 ) # Summary This pull request introduces an fp8 row-scaling kernel as an optional implementation for `scaled_mm`. The kernel selection is based on the scaling tensors of the inputs. For inputs `x` and `y` of shape `[M, K]` and `[K, N]` respectively, the following conditions must be met: - `x`'s scale should be a 1-dimensional tensor of length `M`. - `y`'s scale should be a 1-dimensional tensor of length `N`. It's important to note that this kernel is not called "rowwise, columnwise" scaling because, although the scales for `y` are semantically along its columns, this implementation only supports the TN format. This means the scaling is along the faster-moving dimension, or the "row". The following two PRs were required to enable local builds: - [PR #126185](https://github.com/pytorch/pytorch/pull/126185) - [PR #125523](https://github.com/pytorch/pytorch/pull/125523) ### Todo We still do not build our Python wheels with this architecture. @ptrblck @malfet, should we replace `sm_90` with `sm_90a`? The NVRTC TMA shadowing feels wrong, but I a not sure the right way to spoof the symbol for this compilation unit: https://github.com/pytorch/pytorch/pull/125204/files#r1586986954 #### ifdef I tried to use : `#if !defined(USE_ROCM) && defined(CUDA_VERSION) && CUDA_VERSION >= 12000 && \ defined(__CUDA_ARCH__) && __CUDA_ARCH__ > 900` to gate the building of the kernel. I was having a hell of a time with this.. so I am not really sure the right way to do this Kernel Credit: @jwfromm Pull Request resolved: https://github.com/pytorch/pytorch/pull/125204 Approved by: https://github.com/lw	2024-05-31 20:09:08 +00:00
Edward Z. Yang	806e6257f3	Unconditionally assign symbolic shapes as locals (#127486 ) Internal xref: https://fb.workplace.com/groups/1405155842844877/posts/8493858177307906 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127486 Approved by: https://github.com/albanD	2024-05-31 20:01:44 +00:00
PyTorch MergeBot	033e733021	Revert "[BE] wrap deprecated function/class with `typing_extensions.deprecated` (#126898 )" This reverts commit 749a132fb0a8325cbad4734a563aa459ca611991. Reverted https://github.com/pytorch/pytorch/pull/126898 on behalf of https://github.com/fbgheith due to switching typing-extensions=4.3.0 to 4.9.0 causes internal failure ([comment](https://github.com/pytorch/pytorch/pull/126898#issuecomment-2142884456))	2024-05-31 19:47:24 +00:00
Robert Mast	ea13e9a097	correct BLAS input (#126200 ) Fixes #32407 With this little correction to Dependencies.cmake it is possible to build an MKL-free version of Pytorch up from version v2.0.0 by explicitly choosing another MKL-free BLAS. This pullrequest fulfills the "if not already present" part of the original comment in Dependencies.cmake: "setting default preferred BLAS options if not already present." It's tested with this Action-.yml: ``` name: Build PyTorch v2.0.0 without AVX on: push: branches: - v2.0.0 pull_request: branches: - v2.0.0 jobs: build: runs-on: ubuntu-20.04 defaults: run: shell: bash -el {0} steps: - name: Checkout repository uses: actions/checkout@v4 with: #repository: 'pytorch/pytorch' #ref: 'v2.3.0' submodules: 'recursive' - uses: conda-incubator/setup-miniconda@v3 with: auto-activate-base: true activate-environment: true python-version: 3.10.13 - name: Install Dependencies - Common - Linux 2 run: \| conda info conda list conda install nomkl conda install astunparse numpy ninja pyyaml setuptools cmake cffi typing_extensions future six requests dataclasses export PYTORCH_CPU_CAPABILITY=cpu export ATEN_CPU_CAPABILITY_DEFAULT=cpu export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"} export ATEN_CPU_CAPABILITY=default export USE_NNPACK=0 export MAX_JOBS=4 export USE_CUDA=0 export USE_ROCM=0 export BLAS=OpenBLAS export CMAKE_ARGS="-D CMAKE_BUILD_TYPE=Release -D USE_AVX=OFF -D USE_NNPACK=OFF -D C_HAS_AVX_2=OFF -D C_HAS_AVX2_2=OFF -D CXX_HAS_AVX_2=OFF -D CXX_HAS_AVX2_2=OFF -D CAFFE2_COMPILER_SUPPORTS_AVX512_EXTENSIONS=OFF -DPYTHON_INCLUDE_DIR=$(python -c "import sysconfig; print(sysconfig.get_path('include'))") -DPYTHON_LIBRARY=$(python -c "import sysconfig; print(sysconfig.get_config_var('LIBDIR'))") -DPYTHON_EXECUTABLE:FILEPATH=`which python`" pip install build wheel typing_extensions python setup.py bdist_wheel - name: Archive production artifacts uses: actions/upload-artifact@v4 with: name: dist-without-markdown path: \| dist !dist/*/.md ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126200 Approved by: https://github.com/jgong5, https://github.com/kit1980	2024-05-31 19:38:42 +00:00
PyTorch MergeBot	bbf892dd58	Revert "Add back private function torch.cuda.amp.autocast_mode._cast (#127433 )" This reverts commit 6e0eeecc7cd4dc389683e35d1f2e34738e09e597. Reverted https://github.com/pytorch/pytorch/pull/127433 on behalf of https://github.com/fbgheith due to depends on https://github.com/pytorch/pytorch/pull/126898 which is failing internally and needs to be reverted ([comment](https://github.com/pytorch/pytorch/pull/127433#issuecomment-2142869610))	2024-05-31 19:35:15 +00:00
Bin Bao	1103444870	[AOTI] Add back include_pytorch for specifying link paths (#126802 ) Summary: Running dashboard with the cpp wrapper mode sometimes hit erros like "undefined symbol: aoti_torch_empty_stride", although it can not be reproduced locally and seems only happen on the dashboard CI. Differential Revision: [D57911442](https://our.internmc.facebook.com/intern/diff/D57911442) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126802 Approved by: https://github.com/chenyang78 ghstack dependencies: #126916, #127037	2024-05-31 19:32:52 +00:00
Brian Hirsh	8af1c655e5	improve eager overhead of _disable_dynamo (#127325 ) it seems like `_disable_dynamo` actually has a fair amount of overhead (especially when it was added to `DTensor.__new__`: this change speeds up @wanchaol 's repro from 0.380 -> 0.312s: P1378202570 (that repro runs a vanilla MLP using 2D parallelism, and calls the DTensor constructor 1280 times). It looks like most of the slowndown is in the fact that we are repeatedly running `import torch._dynamo` and constructing an instance of `torch._dynamo.disable(fn, recursive)` on every call to the constructor - this PR caches it on the first invocation. ~~Update: I realized I cannot use `torch.compiler.is_compiling` to know when to fast-path, because when we hit a graph break, cpython will be running so it will return False.~~ ~~As a test / potential fix, I added a new config, `torch._dynamo.config._is_compiling` that is set to True always inside a compiled region (even on frames that are run by cpython). This definitely seems to do what I want in terms of knowing when to fastpath and avoid overhead - although interested in feedback on how reasonable this is~~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/127325 Approved by: https://github.com/wanchaol, https://github.com/anijain2305	2024-05-31 19:30:47 +00:00
Kwanghoon An	b704c7cf0f	Re trying Support min/max carry over for eager mode from_float method (#127576 ) Summary: Original commit changeset: 2605900516c8 Original Phabricator Diff: D57977896 Test Plan: Re enabling due to prod failure Reviewed By: jerryzh168 Differential Revision: D57978925 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127576 Approved by: https://github.com/jerryzh168	2024-05-31 19:08:07 +00:00
Catherine Lee	121c55d8d1	Old branch deletion script to also delete old ciflow tags (#127625 ) Change branch deletion script to also delete left over ciflow tags that the bot doesn't get to, as well as the one created by triggering a workflow on HUD Example run https://github.com/pytorch/pytorch/actions/runs/9322082915/job/25662376463?pr=127625 (didn't actually delete the tag, but lists what tags it would delete) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127625 Approved by: https://github.com/huydhn	2024-05-31 18:54:54 +00:00
Yanbo Liang	0be06b08fc	[GPT-fast benchmark] Merge GPT-fast and micro benchmark output as one CSV file (#127586 ) Consolidate GPT-fast models benchmark with micro-benchmark, and save output as one CSV file with the same format as https://github.com/pytorch/pytorch/pull/126754#issue-2307296847. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127586 Approved by: https://github.com/Chillee	2024-05-31 18:50:49 +00:00
Svetlana Karslioglu	4a0d96e496	Add a GH action to autolabel docathon PRs (#127569 ) To ease oncall burden for the docathon PR reviewers and ensure all PRs are correctly labeled, adding this GH action that will look for the issue number in the PR and if that issue has a docathon-h1-2024 label, then it would propagate the labels from the issues into the PR. It should not conflict with the existing labelers because we use ``pull_request.add_to_labels`` - credit @kit1980. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127569 Approved by: https://github.com/kit1980	2024-05-31 17:57:07 +00:00
angelayi	b2f5fd8efb	[ts_converter] Basic support for prim::If conversion (#127336 ) Script module: ``` graph(%self : __torch__.M, %x.1 : Tensor, %y.1 : Tensor): %11 : int = prim::Constant[value=1]() %5 : bool = aten::Bool(%x.1) # /data/users/angelayi/pytorch2/test/export/test_converter.py:27:19 %21 : Tensor = prim::If(%5) # /data/users/angelayi/pytorch2/test/export/test_converter.py:27:16 block0(): %8 : Tensor = aten::mul(%y.1, %y.1) # /data/users/angelayi/pytorch2/test/export/test_converter.py:28:27 -> (%8) block1(): %12 : Tensor = aten::add(%y.1, %y.1, %11) # /data/users/angelayi/pytorch2/test/export/test_converter.py:30:27 -> (%12) return (%21) ``` ExportedProgram: ``` ExportedProgram: class GraphModule(torch.nn.Module): def forward(self, x_1: "b8[]", y_1: "i64[]"): # File: <eval_with_key>.23:9 in forward, code: cond = torch.ops.higher_order.cond(l_args_0_, cond_true_0, cond_false_0, [l_args_3_0_]); l_args_0_ = cond_true_0 = cond_false_0 = l_args_3_0_ = None true_graph_0 = self.true_graph_0 false_graph_0 = self.false_graph_0 conditional = torch.ops.higher_order.cond(x_1, true_graph_0, false_graph_0, [y_1]); x_1 = true_graph_0 = false_graph_0 = y_1 = None return (conditional,) class <lambda>(torch.nn.Module): def forward(self, y_1: "i64[]"): # File: <eval_with_key>.20:6 in forward, code: mul_tensor = torch.ops.aten.mul.Tensor(l_args_3_0__1, l_args_3_0__1); l_args_3_0__1 = None mul: "i64[]" = torch.ops.aten.mul.Tensor(y_1, y_1); y_1 = None return mul class <lambda>(torch.nn.Module): def forward(self, y_1: "i64[]"): # File: <eval_with_key>.21:6 in forward, code: add_tensor = torch.ops.aten.add.Tensor(l_args_3_0__1, l_args_3_0__1, alpha = 1); l_args_3_0__1 = None add: "i64[]" = torch.ops.aten.add.Tensor(y_1, y_1); y_1 = None return add ``` This PR also adds support for TupleIndex and incorporates some changes from https://github.com/pytorch/pytorch/pull/127341 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127336 Approved by: https://github.com/BoyuanFeng	2024-05-31 17:46:16 +00:00
cyy	3e66052e16	Improve python3 discovery code in CMake (#127600 ) The improvement is based on my comments in #124613 and it also fixes the current linux-s390x-binary-manywheel CI failures. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127600 Approved by: https://github.com/Skylion007	2024-05-31 17:29:06 +00:00
Wang, Eikan	8d7393cb5e	Update triton-xpu commit pin merge rules for XPU (#127203 ) Add the ".ci/docker/ci_commit_pins/triton-xpu.txt" to the XPU merge rules. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127203 Approved by: https://github.com/atalman	2024-05-31 17:19:19 +00:00
Iris Z	1699edaabb	[DeviceMesh] Adding nD slicing support back (#127465 ) Fixes #126530 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127465 Approved by: https://github.com/wconstab, https://github.com/wanchaol	2024-05-31 17:06:36 +00:00
Aaron Gokaslan	8bf2c0a203	[BE][Ez]: Update ruff to 0.4.6 (#127614 ) Update ruff linter to 0.4.6. Uneventful PR that fixes bugs and reduces false positives. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127614 Approved by: https://github.com/albanD	2024-05-31 17:01:50 +00:00
PyTorch MergeBot	58b461d57a	Revert "[ROCm] Update triton pin to fix libtanh issue (#125396 )" This reverts commit 19333d1eb9b8965edd6c8a52fd59b5c67b4fb523. Reverted https://github.com/pytorch/pytorch/pull/125396 on behalf of https://github.com/atalman due to Broke nightly builds ([comment](https://github.com/pytorch/pytorch/pull/125396#issuecomment-2142638237))	2024-05-31 16:51:39 +00:00
Jason Ansel	225ec08e35	Fix typo in .ci/docker/ubuntu-cuda/Dockerfile (#127503 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127503 Approved by: https://github.com/nWEIdia, https://github.com/Skylion007	2024-05-31 16:50:35 +00:00
Wei Wang	67f0807042	[Inductor] [CI] [CUDA] Skip the failed models and tests the better way (#127150 ) Address subtasks in https://github.com/pytorch/pytorch/issues/126692 After enabling the disabled shards, the following two models regressed (for cu124 configuration): dynamic_inductor_timm_training.csv cspdarknet53,pass,7 (expected) \| cspdarknet53,fail_accuracy,7 (actual) eca_botnext26ts_256,pass,7 (expected) \| eca_botnext26ts_256,fail_accuracy,7 (actual) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127150 Approved by: https://github.com/huydhn, https://github.com/eqy, https://github.com/atalman	2024-05-31 16:35:57 +00:00
Chien-Chin Huang	64c581a1d4	[DSD] Make distributed state_dict support torch.distributed is not initialized case (#127385 ) Fixes https://github.com/pytorch/pytorch/issues/124942 Summary: Allow DSD to support loading the regular optimizer state_dict and can be used when torch.distributed.is_initialized() is False. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127385 Approved by: https://github.com/wz337 ghstack dependencies: #127070, #127071, #127384	2024-05-31 16:28:16 +00:00
Chien-Chin Huang	8b4ad3a8d9	[DSD] Unify the API signatures of set_model_state_dict and set_optimizer_state_dict (#127384 ) Summary: Allow the optim_state_dict argument to be a positional argument. This make sense since this is a required argument and this will make the function signature the consistent as set_model_state_dict without causing BC issues. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127384 Approved by: https://github.com/wz337 ghstack dependencies: #127070, #127071	2024-05-31 16:24:51 +00:00
Chien-Chin Huang	bd868eeb28	[DSD] Support flattening the optimizer state_dict when saving and unflattening when loading (#127071 ) Fixes https://github.com/pytorch/pytorch/issues/126595 What does this PR do? This PR unflattens the optimizer state_dict, similar to what TorchRec does. The current `get_optimizer_state_dict()` converts the parameter IDs to FQNs in order to avoid any conflict with different optimizers on different ranks. The current returned optimizer state_dict looks like the following one: ``` { "state": { "layer1.weight": {"step": 10, "exp_avg": SomeTensor, "exp_avg_sq": SomeTensor}, "layer2.weight": {"step": 10, "exp_avg": SomeTensor, "exp_avg_sq": SomeTensor}, }, "param_group": [ {"lr": 0.0, "betas": (0.9, 0.95), ..., "params": ["layer1.weight", "layer2.weight"]} ] } ``` While this can avoid the conflict and can support merging multiple optimizers use case (e.g., optimizer in backward), the current optimizer state_dict still cannot support MPMD (e.g., pipeline parallelism). The root cause is `param_group`. `param_group` cannot generate unique keys during saving -- DCP will flatten the dict but for `param_group`, DCP will get the keys like, `param_group.lr` or `param_group.params`. These keys will conflict when using pipeline parallelism. This PR flatten the optimizer state_dict to the one as the following one: ``` { "state.layer1.weight.step": 10, "state.layer2.weight.step": 10, "state.layer1.weight.exp_avg": SomeTensor, "state.layer2.weight.exp_avg": SomeTensor, "state.layer1.weight.exp_avg_sq": SomeTensor, "state.layer2.weight.exp_avg_sq": SomeTensor, "param_group.layer1.weight.lr" : 0.1, "param_group.layer2.weight.lr" : 0.1, "param_group.layer1.weight.betas" : (0.9, 0.95), "param_group.layer2.weight.betas" : (0.9, 0.95), } ``` This allows distributed state_dict (DSD) to support MPMD (e.g., pipeline parallelism). Pros and Cons Pros 1. Can support optimizer resharding (e.g., changing the parallelisms from 3D to 2D or changing the number of workers). 2. User don't need to manually add prefix to different optimizer. 3. Allow users to merge the optimizer states easily. One use case is loop-based pipeline parallelism. Cons 1. The implementation has a strong assumption of the structure of `param_groups` and its value. If the assumption changes or some customized optimizers do not meet the assumption, the implementations will be broken. 2. There will be extra values saved in the checkpoints. The assumption here is `param_group` generally contains scalars which are cheap to save. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127071 Approved by: https://github.com/wconstab, https://github.com/wz337 ghstack dependencies: #127070	2024-05-31 16:20:36 +00:00
Chien-Chin Huang	6b1b8d0193	[DSD] Remove the support of Dict[nn.Module, Dict[str, Any]] state_dict (#127070 ) Summary: This is a very complicated signature that is hard for users to reason. Remove the support of this feature. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127070 Approved by: https://github.com/wz337	2024-05-31 16:16:05 +00:00
Zain Huda	a010fa9e24	[DCP] Fix variable spelling (#127565 ) Summary: tsia Test Plan: sandcastle Differential Revision: D57983752 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127565 Approved by: https://github.com/wz337, https://github.com/fegin	2024-05-31 15:32:08 +00:00
xinan.lin	75e7588f47	[Inductor UT] Fix expected failure but pass for test case on Intel GPU. (#127595 ) The XPU expected failure test case `TritonCodeGenTests.test_codegen_config_option_dont_assume_alignment` should have been expected passed after the PR #126261 merged, but due to test flaky, this case was skiped when landing the PR. The expected failure but passed error then exposed in periodic test: https://github.com/pytorch/pytorch/actions/runs/9302864965/job/25605549183#step:14:2082. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127595 Approved by: https://github.com/EikanWang, https://github.com/chuanqi129, https://github.com/peterbell10, https://github.com/atalman	2024-05-31 15:32:00 +00:00
Mikayla Gawarecki	4644def434	Update docstring for weights_only (#127575 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127575 Approved by: https://github.com/malfet	2024-05-31 14:27:31 +00:00
Feny Patel	cddb8dbebe	add workloadd events to pytorch (#127415 ) Summary: add workloadd events to pytorch Test Plan: CIs Differential Revision: D57914472 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127415 Approved by: https://github.com/sraikund16	2024-05-31 14:25:44 +00:00
Bin Bao	10a92b5f84	[AOTI] Fix a bool value codegen issue when calling custom ops (#127398 ) Summary: fixes https://github.com/pytorch/pytorch/issues/127392 Differential Revision: [D57911527](https://our.internmc.facebook.com/intern/diff/D57911527) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127398 Approved by: https://github.com/angelayi, https://github.com/chenyang78 ghstack dependencies: #126916, #127037	2024-05-31 14:01:36 +00:00
Bin Bao	17c5b6508b	[AOTI] Support _CollectiveKernel in the cpp wrapper mode (#127037 ) Summary: _CollectiveKernel appears in TorchBench moco training. It's a special Fallback op that requires extra care. Differential Revision: [D57911441](https://our.internmc.facebook.com/intern/diff/D57911441) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127037 Approved by: https://github.com/malfet ghstack dependencies: #126916	2024-05-31 13:58:50 +00:00
Bin Bao	413b81789f	[AOTI][refactor] Unify val_to_arg_str and val_to_cpp_arg_str (#126916 ) Summary: Now fallback argument type information has been passed, so time to unify val_to_arg_str and val_to_cpp_arg_str Differential Revision: [D57907751](https://our.internmc.facebook.com/intern/diff/D57907751) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126916 Approved by: https://github.com/chenyang78	2024-05-31 13:56:11 +00:00
Edward Z. Yang	aaef7b29e9	Only register _inductor_test ops if not running with deploy (#127557 ) Internal xref: https://fb.workplace.com/groups/1405155842844877/posts/8498194410207616 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127557 Approved by: https://github.com/zou3519	2024-05-31 13:34:23 +00:00
PyTorch MergeBot	029b3ec775	Revert "[inductor][cpp] bf16/fp16 gemm template computed with fp32 w/o epilogue fusion (#126068 )" This reverts commit dae33a4961addb5847dbb362e7bb907bbfc64929. Reverted https://github.com/pytorch/pytorch/pull/126068 on behalf of https://github.com/PaliC due to failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/126068#issuecomment-2141992307))	2024-05-31 12:33:25 +00:00
cyy	a6bae1f6db	Remove more caffe2 files (#127511 ) Remove more caffe2 files. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127511 Approved by: https://github.com/r-barnes	2024-05-31 11:26:27 +00:00
IvanKobzarev	df0c69f32d	[inductor] Add fallback for collectives size estimation for unbacked (#127562 ) Differential Revision: [D57982928](https://our.internmc.facebook.com/intern/diff/D57982928) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127562 Approved by: https://github.com/yifuwang	2024-05-31 11:15:46 +00:00
Animesh Jain	f4d7cdc5e6	[dynamo] Add current instruction to BlockStackEntry (#127482 ) Will be used by exception handling in later PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127482 Approved by: https://github.com/jansel	2024-05-31 08:58:53 +00:00
Yueming Hao	2a03bf5a14	[inductor] fix grid z bug for large grid (#127448 ) Fixes #123210 `2f3d3ddd70/torch/_inductor/runtime/triton_heuristics.py (L1733-L1753)` If a kernel's y_grid is larger than 65535, it will be split into multiple z grids. The above grad_fn does this split before the kernel launch; however, the computations for yoffset and the y_grid are incorrect. For example, if we have xy numel of `(1XBLOCK, 65537YBLOCK)`, this function will return an [xyz]_grid with (1, 32768, 2). XBLOCK and YBLOCK here are used for the following `get_grid_dim`. Let's use their default values (4, 1024). `2f3d3ddd70/torch/_inductor/runtime/triton_heuristics.py (L1734)` [xyz]_grid = (1, 32768, 2) means the workload are divided to two z grids. Because the triton kernel generation still follows xy dimension, one of the exampled generated kernel is shown below. ```python @triton.jit def triton_(in_ptr0, out_ptr0, ynumel, xnumel, YBLOCK : tl.constexpr, XBLOCK : tl.constexpr): ynumel = 655371024 xnumel = 14 yoffset = tl.program_id(1) * (tl.program_id(2) + 1) * YBLOCK yindex = yoffset + tl.arange(0, YBLOCK)[None, :] ymask = yindex < ynumel xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:, None] xmask = xindex < xnumel x2 = xindex y0 = yindex % 128 y1 = (yindex // 128) y3 = yindex tmp0 = tl.load(in_ptr0 + (y0 + (128x2) + (512y1)), xmask, eviction_policy='evict_last') tl.store(out_ptr0 + (x2 + (4y3)), tmp0, xmask) ``` For a trition block with xyz index (0, 0, 1), its yoffset and xoffset are both 0s based on the compuation `yoffset = tl.program_id(1) (tl.program_id(2) + 1) * YBLOCK` and `xoffset = tl.program_id(0) * XBLOCK`. So, this triton block will access the very first elements of the input. However, the correct yoffset should be `(y_index + z_index * y_grid ) * YBLOCK` which is the starting position of the 2nd z grid. At the same time, because we used `y_grid = y_grid // div` to compute the maximum number of element in y dimension, the y_grid is 32768. The total y grids is 32768*2 = 65536, which is less than the actual y grids 65537. So, we should use `y_grid = ceildiv(y_grid, div)` to compute the y grid to save the remaining grids. #123210 is not about AOTInductor, the root cause is the triton kernel generated by torchinductor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127448 Approved by: https://github.com/eellison	2024-05-31 08:01:34 +00:00
titaiwangms	4935a019e4	[ONNX] Update decomposition table to core ATen ops (#127353 ) Fixes #125894 Previous to this PR, there are ATen core ops missing in the decomposition table because we thought they might be decomposed into prim ops, as they are under _refs. The PR picks them back according to `f6ef832e87/torch/_decomp/__init__.py (L253)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127353 Approved by: https://github.com/justinchuby	2024-05-31 06:35:47 +00:00
cyy	0c5faee372	Replace python::python with Python::Module (#127485 ) Use found Python::Module target Pull Request resolved: https://github.com/pytorch/pytorch/pull/127485 Approved by: https://github.com/ezyang	2024-05-31 05:57:05 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	b5e85b8ecc	Add deferred_runtime_assertion pass after run_decompositions (#127305 ) Summary: We also want to reinsert the deferred_runtime passes after run_decompositions as well Test Plan: CI Reviewed By: zhxchen17 Differential Revision: D57802237 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127305 Approved by: https://github.com/BoyuanFeng	2024-05-31 05:45:28 +00:00
Zain Rizvi	ae47152ca8	Expand supported labels to most self-hosted linux pull.yml workflows (#127578 ) Initial set of runners added in https://github.com/pytorch/pytorch/pull/127566 seem to be working. Expanding to include more machine types, especially GPU machines Pull Request resolved: https://github.com/pytorch/pytorch/pull/127578 Approved by: https://github.com/huydhn	2024-05-31 05:40:16 +00:00
Simon Fan	ec098b88b6	[compiled autograd] torch.compile API (#125880 ) - enter existing compiled autograd ctx manager before entering torch.compile frames Pull Request resolved: https://github.com/pytorch/pytorch/pull/125880 Approved by: https://github.com/jansel	2024-05-31 04:38:20 +00:00
cyy	ee08cf5792	Improve MAGMA conditional macro in BatchLinearAlgebra.cpp (#127495 ) Unnecessary TORCH_CHECK(false) are changed to macro coverage as mentioned in #127371 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127495 Approved by: https://github.com/ezyang	2024-05-31 04:27:20 +00:00
Animesh Jain	159632aecd	[dynamo] Support hasattr on BuiltinVariable (#127372 ) Fixes https://github.com/pytorch/pytorch/issues/127172 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127372 Approved by: https://github.com/williamwen42, https://github.com/yanboliang ghstack dependencies: #127377	2024-05-31 04:23:56 +00:00
Animesh Jain	bb6bfd9ad8	[dynamo][compile-time] Cache the child guard managers (#127377 ) Reduces compile time of MobileBertForMaskedLM model from 39 seconds to 26 seconds. This was a regression introduced by #125202. Before that PR, compile time was 24 seconds. The extra two seconds is just because we are going through enormous number of guards. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127377 Approved by: https://github.com/jansel	2024-05-31 04:23:56 +00:00
Menglu Yu	f264745ff1	[interformer] batch pointwise op + unbind stack pass in post grad (#126959 ) Summary: Tested on H100 with single GPU, and the bs is set to 64. Test Plan: # local script ``` buck2 run mode/opt scripts/jackiexu0313/pt2:uniarch_perf_benchmark -- single-module-benchmark --provider interformer --enable_pt2 True --batch_size 64 ``` baseline: P1370993922 \| Metric \| Value \| \|:-------------------\|:-------------\| \| Latency \| 120.84 ms \| \| Model size \| 5.93 G bytes \| \| Flops/example \| 62.22 GB \| \| TFLOPS \| 32.95 \| \| MFU \| 4.12% \| \| Activation/example \| 128.17 MB \| proposal: P1371676068 config ``` torch._inductor.config.pre_grad_fusion_options = {} torch._inductor.config.post_grad_fusion_options = { "batch_aten_mul": {"min_fuse_set_size": 50}, "batch_aten_sigmoid": {"min_fuse_set_size": 50}, "batch_aten_relu": {"min_fuse_set_size": 50}, "batch_linear_post_grad": {"min_fuse_set_size": 50}, "unbind_stack_aten_pass": {}, } ``` \| Metric \| Value \| \|:-------------------\|:-------------\| \| Latency \| 117.30 ms \| \| Model size \| 5.93 G bytes \| \| Flops/example \| 62.65 GB \| \| TFLOPS \| 34.18 \| \| MFU \| 4.27% \| \| Activation/example \| 163.12 MB \| Differential Revision: D57595173 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126959 Approved by: https://github.com/jackiexu1992	2024-05-31 03:54:43 +00:00
cyy	8629f9b3f2	Remove more unused variables in tests (#127510 ) Follows #127379 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127510 Approved by: https://github.com/Skylion007, https://github.com/r-barnes	2024-05-31 03:39:45 +00:00
Edward Z. Yang	0aaac68c57	Add structured logging for tensor fakeification (#126879 ) This adds dumps of MetaTensorDesc and MetaStorageDesc to structured logs when they are triggered from Dynamo. The logs look like this: ``` V0522 08:13:25.267000 140224882566144 torch/_subclasses/meta_utils.py:195] {"describe_storage": {"id": 0, "describer_id": 0, "size": 32}, "frame_id": 0, "frame_compile_id": 0, "attempt": 0} V0522 08:13:25.267000 140224882566144 torch/_subclasses/meta_utils.py:220] {"describe_tensor": {"id": 0, "ndim": 1, "dtype": "torch.float32", "device": "device(type='cpu')", "size": [8], "is_leaf": true, "stride": [1], "storage": 0, "view_func": "<built-in method _view_func_unsafe of Tensor object at 0x7f882959e840>", "describer_id": 0}, "frame_id": 0, "frame_compile_id": 0, "attempt": 0} V0522 08:13:25.268000 140224882566144 torch/_subclasses/meta_utils.py:1594] {"describe_source": {"describer_id": 0, "id": 0, "source": "L['x']"}, "frame_id": 0, "frame_compile_id": 0, "attempt": 0} ``` The `describer_id` is used to disambiguate ids. We expect it to be unique per frame id, but if there is a bug it possibly is not. Note you will get redundant dumps when evaluation restarts. tlparse can use this to give a visualization of input tensors to a model, you could also use this to generate example inputs to run graphs on. Some care is taken to avoid redumping the tensor metadata multiple times, which would happen ordinarily because AOTAutograd refakifies everything after Dynamo, to deal with metadata mutation. Partially fixes https://github.com/pytorch/pytorch/issues/126644 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126879 Approved by: https://github.com/jamesjwu	2024-05-31 01:58:44 +00:00
Pian Pawakapan	b1792a622d	[pipelining] handle param aliasing (#127471 ) Adds support for parameter aliasing in pipelining. Does this by reading the state_dict, and creating a map of id -> valid tensor FQNs (to be used in _sink_params). Assigns additional FQN attributes that may be used, runs _sink_params(), and then deletes unused attributes. Shares some similarity with how export's unflattener does it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127471 Approved by: https://github.com/kwen2501	2024-05-31 01:52:57 +00:00
Shunting Zhang	d535de1747	[inductor] remove reordering_reindex (#127367 ) This fixes the loop ordering issue for avg_pool2d here (https://github.com/pytorch/pytorch/issues/126255#issuecomment-2117931529). The reason we can not fuse the 2 kernels for avg_pool2d is due to ComputedBuffer.iter_reordering_reindex. Take a simpler example: ``` def f(x, y): """ Add a matmul since inductor may force layout for output. """ return (x.sum(dim=-1) + 1) @ y # Make the first 2 dimension not able to merge on purpose so that # ComputedBuffer.iter_reoredering_reindex will be updated. x = rand_strided([20, 20, 30], [30, 900, 1], device="cuda") y = torch.randn(20, 20) ``` Suppose x.sum is stored to x2. The computed buffer for x2 will remember that we have reordered it's first and second dimension (i.e. loop order [1, 0]). Later one when we decide the loop order for x2 when computing 'x2 + 1' , we decide to pick loop order [1, 0] according to the stride analysis. And then we use the saved ComputedBuffer.iter_reordering_reindex to further reorder the loop order. The net effect is that we use loop order [0, 1] which cause the pointwise kernel not able to fuse with the reduction kernel. I feel that we don't need ComputedBuffer.iter_reordering_reindex. And test result shows removing it has neutral impact on the dashboard [link](https://hud.pytorch.org/benchmark/compilers?startTime=Wed%2C%2022%20May%202024%2017%3A30%3A29%20GMT&stopTime=Wed%2C%2029%20May%202024%2017%3A30%3A29%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&lBranch=gh/shunting314/153/head&lCommit=195f42cf1a414d2d1a0422b8a081a85ff52b7d20&rBranch=main&rCommit=d6e3e89804c4063827ea21ffcd3d865e5fe365d9) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127367 Approved by: https://github.com/jansel	2024-05-31 01:36:43 +00:00
PyTorch MergeBot	7646825c3e	Revert "distributed debug handlers (#126601 )" This reverts commit 3d541835d509910fceca00fc5a916e9718c391d8. Reverted https://github.com/pytorch/pytorch/pull/126601 on behalf of https://github.com/PaliC due to breaking internal typechecking tests ([comment](https://github.com/pytorch/pytorch/pull/126601#issuecomment-2141076987))	2024-05-31 01:21:24 +00:00
cyy	d44daebdbc	[Submodule] Remove deprecated USE_TBB option and TBB submodule (#127051 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127051 Approved by: https://github.com/cpuhrsch, https://github.com/malfet	2024-05-31 01:20:45 +00:00
feifan	da9fb670d2	Nadam support the flag for "maximize" (#127214 ) Fixes https://github.com/pytorch/pytorch/issues/126642 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127214 Approved by: https://github.com/janeyx99	2024-05-31 01:11:16 +00:00
PyTorch MergeBot	f6e303fa47	Revert "[DeviceMesh] Adding nD slicing support back (#127465 )" This reverts commit e72232f8f032b970b74da18200678b3a4617bf95. Reverted https://github.com/pytorch/pytorch/pull/127465 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing lint `e72232f8f0`, the error does not like look trivial fix, so I revert the change for a forward fix ([comment](https://github.com/pytorch/pytorch/pull/127465#issuecomment-2141051630))	2024-05-31 00:43:13 +00:00
atalman	af5ed05416	Include triton in py3.12 binaries (#127547 ) Additional Builder PR: https://github.com/pytorch/builder/pull/1846/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/127547 Approved by: https://github.com/williamwen42	2024-05-31 00:30:10 +00:00
Benson Ma	fc73d07e5e	[c10d] Decorate methods in `NCCLUtils.hpp` with `TORCH_API` (#127550 ) Summary: User-defined PyTorch modules that uses `C10D_NCCL_CHECK` run into undefined symbol errors when loaded by `torch.library.load()`, because they have not been exported. This change exports the symbols needed to resolve those runtime errors. Test Plan: PyTorch CI Differential Revision: D57977944 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127550 Approved by: https://github.com/Skylion007	2024-05-31 00:17:25 +00:00
Sergii Dymchenko	a2bff4dc8c	Fix lint (#127584 ) Trivial fix after https://github.com/pytorch/pytorch/pull/124678 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127584 Approved by: https://github.com/huydhn	2024-05-31 00:00:11 +00:00
wz337	e72232f8f0	[DeviceMesh] Adding nD slicing support back (#127465 ) Fixes #126530 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127465 Approved by: https://github.com/wconstab	2024-05-30 23:55:21 +00:00
Shuqiang Zhang	214dd44608	[c10d] add Work's numel to logger for debugging purposes (#127468 ) Summary: We have seen some cases that all ranks call into a collective but it got stuck probably due to incorrect sizes of the tensors. Adding the size info into logging for debugging Also, taking this chance to consolidate all logger related status metrics in to one struct Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/127468 Approved by: https://github.com/wconstab	2024-05-30 23:32:33 +00:00
Scott Wolchok	620ec081ec	Extract inner loops into separate function for ARM64 fp16_dot_with_fp32_arith (#127476 ) Summary: Preparing to generalize to bf16. (This should not be committed unless the following bf16 PR is committed!) Test Plan: Spot-checked llm_experiments benchmark result to make sure it didn't regress. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127476 Approved by: https://github.com/malfet ghstack dependencies: #127435, #127451	2024-05-30 23:28:17 +00:00
Scott Wolchok	603bde1de3	Use efficient ARM fp16 dot product for gemm_transa_ general case (#127451 ) Summary: This doesn't change the overall gemm algorithm away from repeated dot products, just uses our efficient fp16 dot product developed for the gemv case. It seems to improve performance for every prompt length I tested. Test Plan: Use https://github.com/malfet/llm_experiments/blob/main/benchmarks/benchmark_torch_mm.py , edited to test only the trans_b (really gemm_transa_) case for the sizes outlined in the output. Before: ``` Matrix-vector: m=8, n=128, k=1 ==================== trans_b torch.float32 1.05 usec trans_b torch.float16 0.97 usec trans_b torch.bfloat16 1.06 usec m=128, n=8, k=1 ==================== trans_b torch.float32 0.80 usec trans_b torch.float16 0.97 usec trans_b torch.bfloat16 1.00 usec m=4096, n=4096, k=1 ==================== trans_b torch.float32 2160.75 usec trans_b torch.float16 659.77 usec trans_b torch.bfloat16 3800.13 usec m=11008, n=4096, k=1 ==================== trans_b torch.float32 6343.68 usec trans_b torch.float16 1789.42 usec trans_b torch.bfloat16 10098.34 usec m=4096, n=11008, k=1 ==================== trans_b torch.float32 6217.20 usec trans_b torch.float16 1874.47 usec trans_b torch.bfloat16 10490.30 usec m=32000, n=4096, k=1 ==================== trans_b torch.float32 17934.45 usec trans_b torch.float16 5323.81 usec trans_b torch.bfloat16 29320.80 usec Matrix-matrix (prompt len 4: m=8, n=128, k=4 ==================== trans_b torch.float32 2.40 usec trans_b torch.float16 1.22 usec trans_b torch.bfloat16 1.22 usec m=128, n=8, k=4 ==================== trans_b torch.float32 1.52 usec trans_b torch.float16 1.33 usec trans_b torch.bfloat16 1.77 usec m=4096, n=4096, k=4 ==================== trans_b torch.float32 4317.09 usec trans_b torch.float16 15541.04 usec trans_b torch.bfloat16 15032.29 usec m=11008, n=4096, k=4 ==================== trans_b torch.float32 6191.19 usec trans_b torch.float16 40436.29 usec trans_b torch.bfloat16 40626.93 usec m=4096, n=11008, k=4 ==================== trans_b torch.float32 6049.22 usec trans_b torch.float16 42367.16 usec trans_b torch.bfloat16 42482.43 usec m=32000, n=4096, k=4 ==================== trans_b torch.float32 17611.36 usec trans_b torch.float16 117368.54 usec trans_b torch.bfloat16 116958.85 usec Matrix-matrix (prompt len 8: m=8, n=128, k=8 ==================== trans_b torch.float32 1.04 usec trans_b torch.float16 1.71 usec trans_b torch.bfloat16 1.74 usec m=128, n=8, k=8 ==================== trans_b torch.float32 2.10 usec trans_b torch.float16 2.01 usec trans_b torch.bfloat16 2.91 usec m=4096, n=4096, k=8 ==================== trans_b torch.float32 2456.23 usec trans_b torch.float16 30112.76 usec trans_b torch.bfloat16 29941.58 usec m=11008, n=4096, k=8 ==================== trans_b torch.float32 6236.12 usec trans_b torch.float16 80361.22 usec trans_b torch.bfloat16 80466.64 usec m=4096, n=11008, k=8 ==================== trans_b torch.float32 6236.10 usec trans_b torch.float16 82990.74 usec trans_b torch.bfloat16 83899.80 usec m=32000, n=4096, k=8 ==================== trans_b torch.float32 17606.43 usec trans_b torch.float16 234397.38 usec trans_b torch.bfloat16 237057.29 usec Matrix-matrix (prompt len 16: m=8, n=128, k=16 ==================== trans_b torch.float32 1.31 usec trans_b torch.float16 2.67 usec trans_b torch.bfloat16 2.72 usec m=128, n=8, k=16 ==================== trans_b torch.float32 1.66 usec trans_b torch.float16 3.36 usec trans_b torch.bfloat16 5.18 usec m=4096, n=4096, k=16 ==================== trans_b torch.float32 2504.24 usec trans_b torch.float16 60896.53 usec trans_b torch.bfloat16 59852.49 usec m=11008, n=4096, k=16 ==================== trans_b torch.float32 6407.11 usec trans_b torch.float16 163294.92 usec trans_b torch.bfloat16 161199.10 usec m=4096, n=11008, k=16 ==================== trans_b torch.float32 6132.30 usec trans_b torch.float16 167244.77 usec trans_b torch.bfloat16 170064.35 usec m=32000, n=4096, k=16 ==================== trans_b torch.float32 17635.56 usec trans_b torch.float16 475020.00 usec trans_b torch.bfloat16 476332.29 usec Matrix-matrix (prompt len 32: m=8, n=128, k=32 ==================== trans_b torch.float32 1.40 usec trans_b torch.float16 4.67 usec trans_b torch.bfloat16 4.80 usec m=128, n=8, k=32 ==================== trans_b torch.float32 1.24 usec trans_b torch.float16 6.10 usec trans_b torch.bfloat16 10.03 usec m=4096, n=4096, k=32 ==================== trans_b torch.float32 2660.63 usec trans_b torch.float16 122436.04 usec trans_b torch.bfloat16 121687.96 usec m=11008, n=4096, k=32 ==================== trans_b torch.float32 6405.60 usec trans_b torch.float16 324708.42 usec trans_b torch.bfloat16 324866.67 usec m=4096, n=11008, k=32 ==================== trans_b torch.float32 6566.74 usec trans_b torch.float16 330801.04 usec trans_b torch.bfloat16 332561.79 usec m=32000, n=4096, k=32 ==================== trans_b torch.float32 18610.84 usec trans_b torch.float16 944578.75 usec trans_b torch.bfloat16 940674.33 usec Matrix-matrix (prompt len 128: m=8, n=128, k=128 ==================== trans_b torch.float32 2.48 usec trans_b torch.float16 16.43 usec trans_b torch.bfloat16 17.11 usec m=128, n=8, k=128 ==================== trans_b torch.float32 1.83 usec trans_b torch.float16 22.31 usec trans_b torch.bfloat16 37.00 usec m=4096, n=4096, k=128 ==================== trans_b torch.float32 4806.59 usec trans_b torch.float16 485338.83 usec trans_b torch.bfloat16 478835.08 usec m=11008, n=4096, k=128 ==================== trans_b torch.float32 12109.51 usec trans_b torch.float16 1300928.58 usec trans_b torch.bfloat16 1293181.63 usec m=4096, n=11008, k=128 ==================== trans_b torch.float32 11223.70 usec trans_b torch.float16 1326119.92 usec trans_b torch.bfloat16 1330395.12 usec m=32000, n=4096, k=128 ==================== trans_b torch.float32 33485.34 usec trans_b torch.float16 3869227.17 usec trans_b torch.bfloat16 3792905.00 usec ``` After: ``` Matrix-vector: m=8, n=128, k=1 ==================== trans_b torch.float32 0.75 usec trans_b torch.float16 0.71 usec trans_b torch.bfloat16 0.81 usec m=128, n=8, k=1 ==================== trans_b torch.float32 0.75 usec trans_b torch.float16 0.93 usec trans_b torch.bfloat16 0.98 usec m=4096, n=4096, k=1 ==================== trans_b torch.float32 2194.31 usec trans_b torch.float16 661.27 usec trans_b torch.bfloat16 3758.42 usec m=11008, n=4096, k=1 ==================== trans_b torch.float32 5792.04 usec trans_b torch.float16 1789.98 usec trans_b torch.bfloat16 10120.67 usec m=4096, n=11008, k=1 ==================== trans_b torch.float32 6101.22 usec trans_b torch.float16 1927.34 usec trans_b torch.bfloat16 10469.47 usec m=32000, n=4096, k=1 ==================== trans_b torch.float32 18353.20 usec trans_b torch.float16 5161.06 usec trans_b torch.bfloat16 29601.69 usec Matrix-matrix (prompt len 4: m=8, n=128, k=4 ==================== trans_b torch.float32 2.14 usec trans_b torch.float16 0.85 usec trans_b torch.bfloat16 1.19 usec m=128, n=8, k=4 ==================== trans_b torch.float32 1.47 usec trans_b torch.float16 1.85 usec trans_b torch.bfloat16 1.75 usec m=4096, n=4096, k=4 ==================== trans_b torch.float32 4416.40 usec trans_b torch.float16 2688.36 usec trans_b torch.bfloat16 14987.33 usec m=11008, n=4096, k=4 ==================== trans_b torch.float32 6140.24 usec trans_b torch.float16 7467.26 usec trans_b torch.bfloat16 40295.52 usec m=4096, n=11008, k=4 ==================== trans_b torch.float32 6143.10 usec trans_b torch.float16 7298.04 usec trans_b torch.bfloat16 41393.43 usec m=32000, n=4096, k=4 ==================== trans_b torch.float32 17650.72 usec trans_b torch.float16 21346.63 usec trans_b torch.bfloat16 116849.98 usec Matrix-matrix (prompt len 8: m=8, n=128, k=8 ==================== trans_b torch.float32 1.05 usec trans_b torch.float16 1.03 usec trans_b torch.bfloat16 1.69 usec m=128, n=8, k=8 ==================== trans_b torch.float32 2.05 usec trans_b torch.float16 3.08 usec trans_b torch.bfloat16 2.95 usec m=4096, n=4096, k=8 ==================== trans_b torch.float32 2323.99 usec trans_b torch.float16 5265.45 usec trans_b torch.bfloat16 29942.40 usec m=11008, n=4096, k=8 ==================== trans_b torch.float32 6202.01 usec trans_b torch.float16 14677.90 usec trans_b torch.bfloat16 80625.18 usec m=4096, n=11008, k=8 ==================== trans_b torch.float32 6112.05 usec trans_b torch.float16 14340.52 usec trans_b torch.bfloat16 82799.99 usec m=32000, n=4096, k=8 ==================== trans_b torch.float32 17650.65 usec trans_b torch.float16 42551.43 usec trans_b torch.bfloat16 236081.08 usec Matrix-matrix (prompt len 16: m=8, n=128, k=16 ==================== trans_b torch.float32 1.26 usec trans_b torch.float16 1.34 usec trans_b torch.bfloat16 2.69 usec m=128, n=8, k=16 ==================== trans_b torch.float32 1.60 usec trans_b torch.float16 5.81 usec trans_b torch.bfloat16 5.34 usec m=4096, n=4096, k=16 ==================== trans_b torch.float32 2328.05 usec trans_b torch.float16 10526.58 usec trans_b torch.bfloat16 60028.28 usec m=11008, n=4096, k=16 ==================== trans_b torch.float32 6243.35 usec trans_b torch.float16 28505.08 usec trans_b torch.bfloat16 163670.15 usec m=4096, n=11008, k=16 ==================== trans_b torch.float32 5870.11 usec trans_b torch.float16 28597.89 usec trans_b torch.bfloat16 165404.88 usec m=32000, n=4096, k=16 ==================== trans_b torch.float32 17746.27 usec trans_b torch.float16 83393.87 usec trans_b torch.bfloat16 472313.13 usec Matrix-matrix (prompt len 32: m=8, n=128, k=32 ==================== trans_b torch.float32 1.35 usec trans_b torch.float16 2.01 usec trans_b torch.bfloat16 4.68 usec m=128, n=8, k=32 ==================== trans_b torch.float32 1.19 usec trans_b torch.float16 10.98 usec trans_b torch.bfloat16 10.13 usec m=4096, n=4096, k=32 ==================== trans_b torch.float32 2525.29 usec trans_b torch.float16 23106.71 usec trans_b torch.bfloat16 122987.04 usec m=11008, n=4096, k=32 ==================== trans_b torch.float32 6131.34 usec trans_b torch.float16 57537.41 usec trans_b torch.bfloat16 327825.00 usec m=4096, n=11008, k=32 ==================== trans_b torch.float32 6395.01 usec trans_b torch.float16 57456.33 usec trans_b torch.bfloat16 331325.58 usec m=32000, n=4096, k=32 ==================== trans_b torch.float32 19078.68 usec trans_b torch.float16 167735.08 usec trans_b torch.bfloat16 975736.88 usec Matrix-matrix (prompt len 128: m=8, n=128, k=128 ==================== trans_b torch.float32 2.40 usec trans_b torch.float16 6.07 usec trans_b torch.bfloat16 16.83 usec m=128, n=8, k=128 ==================== trans_b torch.float32 1.78 usec trans_b torch.float16 40.35 usec trans_b torch.bfloat16 37.21 usec m=4096, n=4096, k=128 ==================== trans_b torch.float32 4827.60 usec trans_b torch.float16 84341.24 usec trans_b torch.bfloat16 478917.75 usec m=11008, n=4096, k=128 ==================== trans_b torch.float32 11879.96 usec trans_b torch.float16 226484.33 usec trans_b torch.bfloat16 1289465.50 usec m=4096, n=11008, k=128 ==================== trans_b torch.float32 10707.75 usec trans_b torch.float16 229200.58 usec trans_b torch.bfloat16 1327416.67 usec m=32000, n=4096, k=128 ==================== trans_b torch.float32 33306.32 usec trans_b torch.float16 662898.21 usec trans_b torch.bfloat16 3815866.63 usec ``` torch.float16 performance seems to be improved for all except the m=128, n=8, k=128 case, where it is roughly neutral. This case motivated the addition of the "first-tier tail fixup" in the dot kernel. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127451 Approved by: https://github.com/malfet ghstack dependencies: #127435	2024-05-30 23:28:17 +00:00
Scott Wolchok	74b89b9283	Extract dot-product functions from fp16_gemv_trans gemv kernels (#127435 ) Summary: Refactoring step before we attempt to use these to implement a less bad fp16 GEMM. Test Plan: Existing tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127435 Approved by: https://github.com/malfet	2024-05-30 23:28:17 +00:00
eellison	a3c00e4331	[Easy] Move V.fake_mode inside of replace_by_example (#127494 ) Was writing docs and saw that we always have this duplicated usage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127494 Approved by: https://github.com/shunting314, https://github.com/aorenste	2024-05-30 23:23:42 +00:00
Rohan Varma	f9a1bc2c65	[FSDP] Remove _sync_module_states (#124678 ) Remove this unused API Differential Revision: [D56445639](https://our.internmc.facebook.com/intern/diff/D56445639/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124678 Approved by: https://github.com/awgu	2024-05-30 23:02:09 +00:00
laithsakka	029af29e6d	support operator.index function (#127440 ) Fix https://github.com/pytorch/pytorch/issues/127426 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127440 Approved by: https://github.com/mlazos ghstack dependencies: #126444, #127146, #127424	2024-05-30 22:44:18 +00:00
Huy Do	3b88c27c46	Mark DynamicShapesExportTests::test_retracibility_dict_container_inp_out as slow (#127558 ) Same as https://github.com/pytorch/pytorch/pull/117896, another slowpoke `DynamicShapesExportTests::test_retracibility_dict_container_inp_out` shows up on recently on MacOS. For example, https://ossci-raw-job-status.s3.amazonaws.com/log/25585713394 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127558 Approved by: https://github.com/clee2000	2024-05-30 22:40:48 +00:00
PyTorch MergeBot	e02971fcfb	Revert "Enable UFMT on test_shape_ops.py test_show_pickle.py test_sort_and_select.py (#127165 )" This reverts commit a288b95d4e5ceed327c5bdb9696331aa87688d60. Reverted https://github.com/pytorch/pytorch/pull/127165 on behalf of https://github.com/atalman due to lint is failing ([comment](https://github.com/pytorch/pytorch/pull/127165#issuecomment-2140930658))	2024-05-30 22:06:46 +00:00
Peter Bell	4ee003abdf	[inductor] Repeat should not return a view (#127533 ) Fixes #127474 `as_strided` unwraps views and looks at the underlying storage, so it isn't legal to lower `repeat`, which should return a new storage, into a view. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127533 Approved by: https://github.com/lezcano	2024-05-30 21:38:59 +00:00
hippocookie	a288b95d4e	Enable UFMT on test_shape_ops.py test_show_pickle.py test_sort_and_select.py (#127165 ) Fixes some files in #123062 Run lintrunner on files: test_shape_ops.py test_show_pickle.py test_sort_and_select.py ```bash $ lintrunner --take UFMT --all-files ok No lint issues. Successfully applied all patches. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127165 Approved by: https://github.com/ezyang	2024-05-30 21:34:16 +00:00
dilililiwhy	f471482eb2	Try to include NCCL related header file with macro USE_C10D_NCCL (#127501 ) Fixes #ISSUE_NUMBER Try to include NCCL related header file with macro USE_C10D_NCCL, so that third-party device compilation will not be interrupted. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127501 Approved by: https://github.com/ezyang	2024-05-30 21:33:41 +00:00
Xuehai Pan	6849b80411	Add `ninja` as dev dependency (#127380 ) `ninja` is required to build C++ extensions in tests. ```pytb ERROR: test_autograd_cpp_node (__main__.TestCompiledAutograd) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/PanXuehai/Projects/pytorch/torch/testing/_internal/common_utils.py", line 2741, in wrapper method(args, *kwargs) File "test/inductor/test_compiled_autograd.py", line 1061, in test_autograd_cpp_node module = torch.utils.cpp_extension.load_inline( File "/home/PanXuehai/Projects/pytorch/torch/utils/cpp_extension.py", line 1643, in load_inline return _jit_compile( File "/home/PanXuehai/Projects/pytorch/torch/utils/cpp_extension.py", line 1718, in _jit_compile _write_ninja_file_and_build_library( File "/home/PanXuehai/Projects/pytorch/torch/utils/cpp_extension.py", line 1800, in _write_ninja_file_and_build_library verify_ninja_availability() File "/home/PanXuehai/Projects/pytorch/torch/utils/cpp_extension.py", line 1849, in verify_ninja_availability raise RuntimeError("Ninja is required to load C++ extensions") RuntimeError: Ninja is required to load C++ extensions To execute this test, run the following from the base repo dir: python test/inductor/test_compiled_autograd.py -k TestCompiledAutograd.test_autograd_cpp_node ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127380 Approved by: https://github.com/ezyang	2024-05-30 21:22:42 +00:00
Xu Zhao	094183dba6	[torchbench][pt2] Enable Huggingface and Timm models for interal buck runner (#127460 ) Summary: Add huggingface and timm model runs to the internal pt2 benchmark runner. Test Plan: Tesing huggingface model: ``` $ buck2 run mode/opt //pytorch/benchmark:pt2 -- --only BlenderbotSmallForCausalLM --performance --training --device=cuda --amp 33/ 33 +0 frames 2s 13 graphs 13 graph calls 0/ -12 = 0% ops 0% time ``` Testing timm model: ``` $ buck2 run mode/opt //pytorch/benchmark:pt2 -- --only coat_lite_mini --performance --training --device=cuda --amp loading model: 0it [00:11, ?it/s] cuda train coat_lite_mini 8/ 8 +0 frames 4s 2 graphs 2 graph calls 0/ -1 = 0% ops 0% time ``` Differential Revision: D57930582 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127460 Approved by: https://github.com/HDCharles, https://github.com/huydhn	2024-05-30 21:18:28 +00:00
cyy	bf2f5e70dd	Fix warnings in SmallVector (#127250 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/127250 Approved by: https://github.com/ezyang	2024-05-30 21:13:20 +00:00
Zain Rizvi	ad1b18ab2f	Add repo-specific scale config files (#127566 ) Part of moving pytorch/pytorch CI infra to a Linux foundation run AWS account. For self-hosted runners that can run jobs from just a single repo, the runner scalers expect them to be stored in the repo itself. These scale-config files define how the linux foundation's self-hosted runners are configured. These will apply to runners that only are available to the pytorch/pytorch and pytorch/pytorch-canary repos Pull Request resolved: https://github.com/pytorch/pytorch/pull/127566 Approved by: https://github.com/zxiiro, https://github.com/huydhn, https://github.com/atalman	2024-05-30 21:08:45 +00:00
PyTorch MergeBot	846f79e61a	Revert "Reduce number of samples in {svd,pca}_lowrank OpInfos (#127199 )" This reverts commit 18a3f781e6382e2222d7c30c18136267407f9953. Reverted https://github.com/pytorch/pytorch/pull/127199 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing MacOS trunk job `18a3f781e6 (25619618844)` ([comment](https://github.com/pytorch/pytorch/pull/127199#issuecomment-2140834363))	2024-05-30 20:45:31 +00:00
Howard Huang	cce2192396	[pipelining] Support calling multiple recv fwd/bwd ops (#127084 ) Currently, only a single `get_fwd_recv_ops` or `get_bwd_recv_ops` can be called before `forward_one_chunk` and `backward_one_chunk` since they both share the same chunk_id counter. This creates a separate `recv_chunk_id` counter so that recvs can be accumulated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127084 Approved by: https://github.com/wconstab	2024-05-30 20:15:52 +00:00
Howard Huang	aa3d041830	[pipelining] Fix block comments for doc rendering (#127418 ) Previous: <img width="915" alt="image" src="https://github.com/pytorch/pytorch/assets/14858254/14626937-7d79-4a7a-9d0b-3fcfe64b4667"> <img width="926" alt="image" src="https://github.com/pytorch/pytorch/assets/14858254/58ab009c-3f93-46d7-a04f-499a2a0ba390"> New: https://docs-preview.pytorch.org/pytorch/pytorch/127418/distributed.pipelining.html Pull Request resolved: https://github.com/pytorch/pytorch/pull/127418 Approved by: https://github.com/wconstab	2024-05-30 20:10:07 +00:00
Boyuan Feng	ff23c5b7d7	[cudagraph] improve log for mutating static input tensor addresses (#127145 ) Summary: This diff adds more log for cudagraph when static input tensor mutates. For each placeholder whose static input tensor address mutates, we log its name, changed data pointer address, and the input stack trace. Since some placeholder may have empty stack trace, we find its first user with an non-empty stack trace and print this stack trace instead. Test Plan: buck2 run fbcode//caffe2/test/inductor:cudagraph_trees -- --r test_static_inputs_address_mutation_log Differential Revision: D57805118 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127145 Approved by: https://github.com/eellison	2024-05-30 19:57:32 +00:00
Prachi Gupta	19333d1eb9	[ROCm] Update triton pin to fix libtanh issue (#125396 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/125396 Approved by: https://github.com/pruthvistony, https://github.com/nmacchioni	2024-05-30 19:26:58 +00:00
Rohan Varma	2cb6f20867	Warn env vars only once during program (#127046 ) This avoids logs being excessively noisy in some training runs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127046 Approved by: https://github.com/kwen2501, https://github.com/wconstab	2024-05-30 19:10:53 +00:00
Boyuan Feng	4afc5c7bb9	[torchscript] Handle prim::device and prim::dtype (#127466 ) - Support prim::device and prim::dtype during torchscript migration to export - Add unit tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/127466 Approved by: https://github.com/SherlockNoMad	2024-05-30 18:35:44 +00:00
Mikayla Gawarecki	fa426b096b	Default meta device to use swap_tensors in nn.Module._apply (.to_empty and .to('meta')) (#126819 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126819 Approved by: https://github.com/albanD ghstack dependencies: #127313, #126814	2024-05-30 18:28:13 +00:00
Mikayla Gawarecki	bfdec93395	Default XLA to use swap_tensors path in nn.Module._apply (#126814 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126814 Approved by: https://github.com/JackCaoG, https://github.com/albanD ghstack dependencies: #127313	2024-05-30 18:28:13 +00:00
Ali Waheed	39cf2f8e66	Added sorting notes for eig/eigvals (#127492 ) Fixes #58034 @lezcano , Added suggested comments for eig and eigvals in the documentation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127492 Approved by: https://github.com/lezcano, https://github.com/kit1980	2024-05-30 18:13:22 +00:00
Chen Lai	7827afca14	Copy the constant folding pass to the pass under export/passes folder (#127456 ) It's a generic pass and I'm trying to find a good place to host it. It's currently needed by quantization flow. See context in D55930580, it's too much effort to land a fix in the inductor folder. Differential Revision: [D57934182](https://our.internmc.facebook.com/intern/diff/D57934182/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127456 Approved by: https://github.com/angelayi	2024-05-30 18:04:08 +00:00
James Wu	f9937afd4f	Add noqa to prevent lint warnings (#127545 ) This is to prevent the import from being removed due to unused import. What's annoying about this is that it's not consistently running: lintrunner doesn't warn me on this PR even without the comment, but it does on other PRs Pull Request resolved: https://github.com/pytorch/pytorch/pull/127545 Approved by: https://github.com/masnesral	2024-05-30 17:56:49 +00:00
PyTorch MergeBot	12d6446507	Revert "[inductor] fix mkldnn linear binary fusion check ut (#127296 )" This reverts commit cdeb242fc977210e211fd77b217320205c9f4042. Reverted https://github.com/pytorch/pytorch/pull/127296 on behalf of https://github.com/huydhn due to Sorry for reverting you change but one of the tests is failing on trunk ROCm. Please help fix and reland the change https://github.com/pytorch/pytorch/actions/runs/9302535020/job/25606932572 ([comment](https://github.com/pytorch/pytorch/pull/127296#issuecomment-2140334323))	2024-05-30 17:18:23 +00:00
PyTorch MergeBot	e9a6bbbf7c	Revert "[CI] add xpu test in periodic workflow (#126410 )" This reverts commit 30d98611a3a35287c47ded9647f0b4c81fbdf036. Reverted https://github.com/pytorch/pytorch/pull/126410 on behalf of https://github.com/malfet due to Let's sync up on the test strategy/policies here ([comment](https://github.com/pytorch/pytorch/pull/126410#issuecomment-2140269549))	2024-05-30 17:01:02 +00:00
cyy	8777443d73	Remove FindMatlabMex.cmake (#127414 ) It is not used anymore. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127414 Approved by: https://github.com/ezyang	2024-05-30 16:26:35 +00:00
Daniil Kutz	b506d37331	Fix multiple errors while parsing NativeFunctions from YAML (#127413 ) Fixing multiple errors in parse_native_yaml when loading NativeFunctions from Yaml file. Add assertions that validates parsed data. Fixes #127404, #127405, #127406, #127407, #127408, #127409, #127410, #127411 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127413 Approved by: https://github.com/ezyang	2024-05-30 16:25:04 +00:00
PyTorch MergeBot	ea5c17de90	Revert "Add torchao nightly testing workflow (#126885 )" This reverts commit d938170314fa89acaad6b06fbbaac6b98f1e618f. Reverted https://github.com/pytorch/pytorch/pull/126885 on behalf of https://github.com/atalman due to Broke inductor periodic test ([comment](https://github.com/pytorch/pytorch/pull/126885#issuecomment-2140139486))	2024-05-30 16:23:06 +00:00
cyy	be7be9fa16	[Distributed] [8/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#125102 ) This PR continues to clean clang-tidy warnings in torch/csrc/distributed/c10d, following https://github.com/pytorch/pytorch/pull/124987. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125102 Approved by: https://github.com/ezyang	2024-05-30 16:19:53 +00:00
Shunting Zhang	576c5ef1dd	[inductor] fix some tests in test_max_autotune.py (#127472 ) Fix https://github.com/pytorch/pytorch/issues/126176 . We should not use torch.empty to generate input data if we are gonna do any accuracy test. torch.empty may return NaN. In that cause both the reference and the actual result may contain NaN at the same index. But `NaN != NaN` so the test fail. Also if torch.empty returns NaN is not deterministic. It may depends on other tests running earlier. Generating random data instead of calling torch.empty fixes the problem. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127472 Approved by: https://github.com/eellison, https://github.com/jansel	2024-05-30 16:04:48 +00:00
Aaron Orenstein	d2df0f56a3	Fix compilation_latency regression caused by #127060 (#127326 ) It seems that while #127060 improved the speed for tacotron2 it introduced a compilation_latency regression for some of the TIMM benchmarks. The original change was to precompute the Dep metadata - but apparently some benchmarks have few enough overlaps that precomputing O(n) deps was slower than ignoring O(n^2) deps. So change it to go back to computing the Dep metadata on demand but to then cache the result. `dm_nfnet_f0` was a good example because on the dashboard it showed an increase from 140s -> 154s. ``` python benchmarks/dynamo/timm_models.py --performance --cold-start-latency --training --amp --backend inductor --dynamic-shapes --dynamic-batch-only --device cuda --total-partitions 5 --partition-id 1 --output timm-0.csv --only dm_nfnet_f0 ``` Looking at the compilation_latency result. On viable (d6e3e8980): 172.777958 176.725071 177.907955 On viable with #127060 and #127061 fully backed out: 158.305166 158.688560 160.791187 On viable w/ this change: 160.094164 160.201845 161.752157 I think that's probably close enough considering the variance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127326 Approved by: https://github.com/oulgen	2024-05-30 15:37:08 +00:00
rzou	ffe506e853	Better graph break msg (and warning) on Dynamo x Python C++ extension (#127301 ) Dynamo graph breaks on Python C/C++ extensions (e.g. pybinded functions). The usual way to handle this is to turn those extensions into custom ops. This PR adds a nicer graph break message and also changes it to unconditionally warn on this graph break (because graph break messages are usually not visible). Fixes https://github.com/pytorch/pytorch/issues/126799 Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/127301 Approved by: https://github.com/jansel ghstack dependencies: #127291, #127292, #127400, #127423	2024-05-30 14:54:29 +00:00
rzou	c9beea13ac	Rewrite existing links to custom ops gdocs with the landing page (#127423 ) NB: these links will be live after the docs build happens, which is once a day. Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/127423 Approved by: https://github.com/jansel, https://github.com/williamwen42 ghstack dependencies: #127291, #127292, #127400	2024-05-30 14:54:29 +00:00
lezcano	18a3f781e6	Reduce number of samples in {svd,pca}_lowrank OpInfos (#127199 ) We don't need to generate so many samples for these very expensive ops. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127199 Approved by: https://github.com/peterbell10, https://github.com/zou3519 ghstack dependencies: #125580	2024-05-30 14:45:58 +00:00
lezcano	48538d3d14	Implement svd_lowrank and pca_lowrank for complex numbers (#125580 ) We fix a number of bugs previously present in the complex implementation. We also heavily simplify the implementation, using, among other things, that we now have conjugate views. I saw there is a comment regarding how slow some checks on this function are. As such, I removed quite a few of the combinations of inputs to make the OpInfo lighter. I still left a couple relevant examples to not regress coverage though. Fixes https://github.com/pytorch/pytorch/issues/122188 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125580 Approved by: https://github.com/pearu, https://github.com/peterbell10	2024-05-30 14:45:58 +00:00
Isuru Fernando	3fb8a0b627	Fix nextafter in inductor CPP codegen (#126876 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126876 Approved by: https://github.com/peterbell10, https://github.com/jgong5	2024-05-30 14:08:16 +00:00
PyTorch MergeBot	ce63b676f3	Revert "[compiled autograd] torch.compile API (#125880 )" This reverts commit e1c322112a3d7b128b42e27f68bc9a714bfd9a09. Reverted https://github.com/pytorch/pytorch/pull/125880 on behalf of https://github.com/atalman due to sorry your PR broke lint, need to revert ([comment](https://github.com/pytorch/pytorch/pull/125880#issuecomment-2139605376))	2024-05-30 13:53:31 +00:00
albanD	6e0eeecc7c	Add back private function torch.cuda.amp.autocast_mode._cast (#127433 ) This is unfortunately used in a few places in the wild: https://github.com/search?q=torch.cuda.amp.autocast_mode._cast&type=code Pull Request resolved: https://github.com/pytorch/pytorch/pull/127433 Approved by: https://github.com/zou3519, https://github.com/guangyey	2024-05-30 13:29:23 +00:00
Sam Larsen	3f5d8636aa	[inductor] Copy RedisRemoteCacheBackend into pytorch (#127480 ) Summary: We need an implementation of RedisRemoteCacheBackend with the same API that we're using for FbMemcacheRemoteFxGraphCacheBackend. So we'll stop using the Triton implementation and adapt a version for use by inductor. I also renamed parameters and cache entries to match our cache terminology. Test Plan: Ran this command twice and inspected log output to ensure I got cache hits: ``` TORCH_LOGS=+torch._inductor.codecache TORCHINDUCTOR_FX_GRAPH_REMOTE_CACHE=1 python benchmarks/dynamo/torchbench.py --performance --inductor --device cuda --training --amp --print-compilation-time --only dcgan ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127480 Approved by: https://github.com/oulgen	2024-05-30 13:08:10 +00:00
haozhe.zhu	cdeb242fc9	[inductor] fix mkldnn linear binary fusion check ut (#127296 ) In this PR: （1）Fix the unary fusion for bf16 conv/linear. Previously we registered same fusion pattern for `bf16. fp16`. And we do not check the dtype while matching the pattern. This results the `fp16` case matched the `bf16` pattern but in later replacement, we found that we have a float16 here which is not expected, so we do not fuse them. We fix it by checking dtypes to avoid `fp16` case matched `bf16` pattern. ``` def _is_valid_computation_unary_fusion(computation_op, lowp_dtype=None): def fn(match): matched = _is_single_computation_op(computation_op, lowp_dtype)(match) # previously we do not check lowp_dtype here ``` It is not exposed before because we only check the match count, and the match count is anyway correct because we matched the pattern. To address this, we add check on number of `generated_kernel`. If it is not fused, there will be an additional kernel to compute the post op. （2）Previous the ut ``` python test/inductor/test_mkldnn_pattern_matcher.py -k test_linear_binary ``` dose not check the fusion status, fix it in this PR. （3）Extend `test_conv_binary` to test with lp. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127296 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jansel	2024-05-30 12:29:36 +00:00
Dmitry Rogozhkin	9f73c65b8f	xpu: pass MAX_JOBS building xpu_mkldnn_proj (#126562 ) mkldnn is quite big project and MAX_JOBS support is essential when building on a system with big number of cpus and limited memory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126562 Approved by: https://github.com/jgong5, https://github.com/guangyey, https://github.com/albanD	2024-05-30 12:10:33 +00:00
chuanqiw	30d98611a3	[CI] add xpu test in periodic workflow (#126410 ) Works for https://github.com/pytorch/pytorch/issues/114850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126410 Approved by: https://github.com/EikanWang, https://github.com/atalman	2024-05-30 12:10:15 +00:00
Yifu Wang	1071437169	Introduce cuda_p2p based fused_all_gather_matmul and fused_matmul_reduce_scatter (#126634 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126634 Approved by: https://github.com/Chillee, https://github.com/wanchaol	2024-05-30 12:10:11 +00:00
titaiwangms	705346bf8d	[ONNX] Skip optimizer when it fails (#127349 ) continue #127039 (1) Skip optimizer when it fails (2) Update onnx, ort, and onnx-script (3) The update to onnx-script results in the actual optimizer and rewriter enabling in this PR, and https://github.com/pytorch/pytorch/pull/123379 did not update onnx-script. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127349 Approved by: https://github.com/justinchuby	2024-05-30 07:08:45 +00:00
Mikayla Gawarecki	cd06ae0cb8	Relax use_count constraints for swap_tensors when AccumulateGrad holds a reference (#127313 ) ### Before this PR: `torch.utils.swap_tensors(a, b)` required the `use_count` of `a` and `b` to be 1 ```python a = torch.randn(2, 3, requires_grad=True) b = torch.randn(2, 4) out = a * 2 out.sum().backward() # Calling swap_tensors here would fail due to the reference held by AccumulateGrad node, which is not cleaned up after backward # torch.utils.swap_tensors(a, b) del out # Calling swap_tensors here would pass torch.utils.swap_tensors(a, b) ``` ### After this PR: `torch.utils.swap_tensors(a, b)` requires the `use_count` of `a` and `b` to be 1 or 2 IF the second reference is held by `AccumulateGrad` A pre-hook will be registered on the `AccumulateGrad` node so that it will fail if it is called (i.e. if user attempts to backward through the graph). ```python a = torch.randn(2, 3, requires_grad=True) b = torch.randn(2, 4) out = a * 2 out.sum().backward() # Calling swap_tensors here is ok torch.utils.swap_tensors(a, b) # If we ever backward to the AccumulateGrad node it will error that it was poisoned by swap_tensors ``` ### Application to `nn.Module` This issue is especially pertinent in context of `nn.Module` where parameters will have `AccumulateGrad` nodes initialized after forward. Specifically, this is intended to address https://github.com/pytorch/pytorch/pull/126814#issuecomment-2127777866. Previously, this would fail at the `m.cpu()` but we want users to be able to do something like the following, and instead raise an error if the user ever attempts to backward through the poisoned `AccumulateGrad` node ```python import torch import torch.nn as nn m = nn.Linear(3, 5) inp = torch.randn(2, 3) out = m(inp) out.sum().backward() m.cpu() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127313 Approved by: https://github.com/soulitzer	2024-05-30 07:06:55 +00:00
William Wen	d44ab8ba6d	[dynamo] utility to generate bytecode from template function (#127359 ) This will be helpful in reducing some of the hardcoded and python-version-dependent bytecode generation in various places in dynamo - e.g. resume function generation and object reconstruction. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127359 Approved by: https://github.com/jansel ghstack dependencies: #127329	2024-05-30 06:37:32 +00:00
Alex Baden	5d316c81be	[Inductor] Add 0 initialization to Triton masked loads (#127311 ) For a masked `tl.load` operation, the Triton language specifies that values masked out (i.e. where the mask evaluates to false) are undefined in the output of the load. Triton provides an optional `other` parameter which, when included, provides an explicit value to use for masked out values from the load. If the output from a masked load without the `other` parameter is used in a conditional, unexpected behavior can occur. Despite the language specification, all Triton backends currently in use by PyTorch Inductor (NVIDIA, AMD, and Intel) 0-initialize masked loads if `other` is not present (we recently changed the Intel backend behavior to match NVIDIA and AMD because that's what our users expect, even if we are not following the Triton spec to the tee). This PR attempts to "future-proof" Inductor for new backends (or perhaps changes in the current backends? - we did not see any performance change from 0-initializing in the Intel XPU backend but one could imagine compiler optimizations to remove paths that depend on undefined) to add an explicit `other` in instances where later conditionals depend on the `tl.load` output. I also removed an exception to `other` behavior for boolean loads, which was put in place for a Triton bug that should be fixed. I added `other` to the getting started documentation as a clue that masked load behavior requires explicit initialization if, even though I don't expect `undef` values to cause the example code to fail if the underlying output is not 0-initialized. Finally, I added other to the `make_load` function in `select_algorithm.py`, though I wasn't able to determine if that function was actually being called. Fixes #126535 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127311 Approved by: https://github.com/jansel	2024-05-30 04:50:54 +00:00
laithsakka	3947731887	enable test_parameter_free_dynamic_shapes test when nn module inlining is on (#127424 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127424 Approved by: https://github.com/mlazos ghstack dependencies: #126444, #127146	2024-05-30 04:20:07 +00:00
Anshul Sinha	15cc9f2e7e	[dtensor][be] added checksAssert function and refactored test cases (#127356 ) Summary Added c10d checksAsserts functions to reduce written lines of code and refactored test cases. Merged one test case into another. Test Plan pytest test/distributed/_tensor/debug/test_comm_mode.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/127356 Approved by: https://github.com/XilunWu ghstack dependencies: #127025, #127029, #127040, #127134, #127334	2024-05-30 03:48:17 +00:00
Anshul Sinha	998f38814c	[dtensor][debug] added c10d allgather, allgather_coalesced, and allgather_into_tensor_coalesced tracing to CommDebugMode (#127334 ) Summary Added c10d allgather, allgather_coalesced, and allgather_into_tensor_coalesced tracing to CommDebugMode and edited test case in test_comm_mode to include added features. Test Plan pytest test/distributed/_tensor/debug/test_comm_mode.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/127334 Approved by: https://github.com/XilunWu, https://github.com/yifuwang ghstack dependencies: #127025, #127029, #127040, #127134	2024-05-30 03:48:17 +00:00
James Wu	f58fc16e8f	[easy?] Move AsyncCompile to a different file (#127235 ) By moving AsyncCompile to its own file, we can import codecache without running the side effects of AsyncCompile. This will be important for AOTAutogradCaching, where we want to share some implementation details with codecache.py without spawning new processes. To conservatively maintain the same behavior elsewhere, every time we import codecache, I've added an import to torch._inductor.async_compile (except in autograd_cache.py, where the explicit goal is to not do this) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127235 Approved by: https://github.com/aorenste, https://github.com/oulgen, https://github.com/masnesral	2024-05-30 02:43:02 +00:00
chilli	e0fc1ab625	Forward fix for templates + views (#127446 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127446 Approved by: https://github.com/eellison	2024-05-30 02:34:35 +00:00
Tristan Rice	3d541835d5	distributed debug handlers (#126601 ) This adds debug handlers as described in: * https://gist.github.com/d4l3k/828b7be585c7615e85b2c448b308d925 (public copy) * https://docs.google.com/document/d/1la68szcS6wUYElUUX-P6zXgkPA8lnfzpagMTPys3aQ8/edit (internal copy) This is only adding the C++ pieces that will be used from the main process. The Python and torchrun pieces will be added in a follow up PR. This adds 2 handlers out of the box: * `/handler/ping` for testing purposes * `/handler/dump_nccl_trace_pickle` as a POC integration with Flight Recorder Test plan: ``` python test/distributed/elastic/test_control_plane.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126601 Approved by: https://github.com/kurman, https://github.com/c-p-i-o	2024-05-30 02:21:08 +00:00
Simon Fan	e1c322112a	[compiled autograd] torch.compile API (#125880 ) - enter existing compiled autograd ctx manager before entering torch.compile frames Pull Request resolved: https://github.com/pytorch/pytorch/pull/125880 Approved by: https://github.com/jansel	2024-05-30 02:10:06 +00:00
SandishKumarHN	da39461d61	[optim] Move test_grad_scaling_autocast_fused_optimizers to test_cuda.py (#126418 ) this PR address the comments in this PR #124904 - Move test_grad_scaling_autocast_fused_optimizers to test_cuda.py - Combine _grad_scaling_autocast_fused_optimizers into test_grad_scaling_autocast_fused_optimizers - Move to OptimizerInfo framework. - For failing tests test_grad_scaling_autocast_fused_optimizers AdamW_cuda_float32, Adam_cuda_float32 - Added toleranceOverride in this PR - created a issue #127000 ``` > (c2env) [sandish@devgpu166.ash6 ~/pytorch (refactoroptimizers)]$ python test/test_cuda.py -k test_grad_scaling_autocast_fused_optimizers -v /home/sandish/pytorch/torch/backends/cudnn/__init__.py:106: UserWarning: PyTorch was compiled without cuDNN/MIOpen support. To use cuDNN/MIOpen, rebuild PyTorch making sure the library is visible to the build system. warnings.warn( /home/sandish/pytorch/torch/backends/cudnn/__init__.py:106: UserWarning: PyTorch was compiled without cuDNN/MIOpen support. To use cuDNN/MIOpen, rebuild PyTorch making sure the library is visible to the build system. warnings.warn( test_grad_scaling_autocast_fused_optimizers_Adagrad_cpu_float32 (__main__.TestCudaOptimsCPU) ... {'fused': True} {'fused': True} {'weight_decay': 0.1, 'fused': True} {'weight_decay': 0.1, 'fused': True} {'weight_decay': 0.1, 'maximize': True, 'fused': True} {'weight_decay': 0.1, 'maximize': True, 'fused': True} {'lr': 0.1, 'fused': True} {'lr': 0.1, 'fused': True} {'initial_accumulator_value': 0.1, 'weight_decay': 0.1, 'fused': True} {'initial_accumulator_value': 0.1, 'weight_decay': 0.1, 'fused': True} {'lr': 0.1, 'lr_decay': 0.5, 'weight_decay': 0.1, 'fused': True} {'lr': 0.1, 'lr_decay': 0.5, 'weight_decay': 0.1, 'fused': True} {'lr': tensor(0.0010), 'fused': True} {'lr': tensor(0.0010), 'fused': True} ok test_grad_scaling_autocast_fused_optimizers_AdamW_cpu_float32 (__main__.TestCudaOptimsCPU) ... {'fused': True} {'fused': True} {'lr': 0.01, 'fused': True} {'lr': 0.01, 'fused': True} {'weight_decay': 0.1, 'fused': True} {'weight_decay': 0.1, 'fused': True} {'weight_decay': 0.1, 'maximize': True, 'fused': True} {'weight_decay': 0.1, 'maximize': True, 'fused': True} {'weight_decay': 0.1, 'amsgrad': True, 'fused': True} {'weight_decay': 0.1, 'amsgrad': True, 'fused': True} ok test_grad_scaling_autocast_fused_optimizers_Adam_cpu_float32 (__main__.TestCudaOptimsCPU) ... {'fused': True} {'fused': True} {'lr': 0.01, 'fused': True} {'lr': 0.01, 'fused': True} {'weight_decay': 0.1, 'fused': True} {'weight_decay': 0.1, 'fused': True} {'weight_decay': 0.1, 'maximize': True, 'fused': True} {'weight_decay': 0.1, 'maximize': True, 'fused': True} {'weight_decay': 0.1, 'amsgrad': True, 'fused': True} {'weight_decay': 0.1, 'amsgrad': True, 'fused': True} ok test_grad_scaling_autocast_fused_optimizers_SGD_cpu_float32 (__main__.TestCudaOptimsCPU) ... {'fused': True} {'fused': True} {'lr': 0.01, 'fused': True} {'lr': 0.01, 'fused': True} {'lr': tensor(0.0010), 'fused': True} {'lr': tensor(0.0010), 'fused': True} {'momentum': 0.9, 'fused': True} {'momentum': 0.9, 'fused': True} {'momentum': 0.9, 'dampening': 0.5, 'fused': True} {'momentum': 0.9, 'dampening': 0.5, 'fused': True} {'momentum': 0.9, 'weight_decay': 0.1, 'fused': True} {'momentum': 0.9, 'weight_decay': 0.1, 'fused': True} {'momentum': 0.9, 'nesterov': True, 'weight_decay': 0.1, 'fused': True} {'momentum': 0.9, 'nesterov': True, 'weight_decay': 0.1, 'fused': True} {'weight_decay': 0.1, 'maximize': True, 'fused': True} {'weight_decay': 0.1, 'maximize': True, 'fused': True} ok test_grad_scaling_autocast_fused_optimizers_Adagrad_cuda_float32 (__main__.TestCudaOptimsCUDA) ... skipped 'cuda is not supported for fused on Adagrad' test_grad_scaling_autocast_fused_optimizers_AdamW_cuda_float32 (__main__.TestCudaOptimsCUDA) ... {'fused': True} {'fused': True} {'lr': 0.01, 'fused': True} {'lr': 0.01, 'fused': True} {'weight_decay': 0.1, 'fused': True} {'weight_decay': 0.1, 'fused': True} {'weight_decay': 0.1, 'maximize': True, 'fused': True} {'weight_decay': 0.1, 'maximize': True, 'fused': True} {'weight_decay': 0.1, 'amsgrad': True, 'fused': True} {'weight_decay': 0.1, 'amsgrad': True, 'fused': True} {'capturable': True, 'fused': True} {'capturable': True, 'fused': True} {'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'fused': True} {'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'fused': True} {'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'fused': True} {'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'fused': True} ok test_grad_scaling_autocast_fused_optimizers_Adam_cuda_float32 (__main__.TestCudaOptimsCUDA) ... {'fused': True} {'fused': True} {'lr': 0.01, 'fused': True} {'lr': 0.01, 'fused': True} {'weight_decay': 0.1, 'fused': True} {'weight_decay': 0.1, 'fused': True} {'weight_decay': 0.1, 'maximize': True, 'fused': True} {'weight_decay': 0.1, 'maximize': True, 'fused': True} {'weight_decay': 0.1, 'amsgrad': True, 'fused': True} {'weight_decay': 0.1, 'amsgrad': True, 'fused': True} {'capturable': True, 'fused': True} {'capturable': True, 'fused': True} {'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'fused': True} {'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'fused': True} {'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'fused': True} {'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'fused': True} ok test_grad_scaling_autocast_fused_optimizers_SGD_cuda_float32 (__main__.TestCudaOptimsCUDA) ... {'fused': True} {'fused': True} {'lr': 0.01, 'fused': True} {'lr': 0.01, 'fused': True} {'lr': tensor(0.0010), 'fused': True} {'lr': tensor(0.0010), 'fused': True} {'momentum': 0.9, 'fused': True} {'momentum': 0.9, 'fused': True} {'momentum': 0.9, 'dampening': 0.5, 'fused': True} {'momentum': 0.9, 'dampening': 0.5, 'fused': True} {'momentum': 0.9, 'weight_decay': 0.1, 'fused': True} {'momentum': 0.9, 'weight_decay': 0.1, 'fused': True} {'momentum': 0.9, 'nesterov': True, 'weight_decay': 0.1, 'fused': True} {'momentum': 0.9, 'nesterov': True, 'weight_decay': 0.1, 'fused': True} {'weight_decay': 0.1, 'maximize': True, 'fused': True} {'weight_decay': 0.1, 'maximize': True, 'fused': True} ok ---------------------------------------------------------------------- Ran 8 tests in 16.117s OK (skipped=1) > lintrunner test/test_cuda.py ---------------------------------------------------------------------- ok No lint issues. > lintrunner torch/testing/_internal/common_optimizers.py ---------------------------------------------------------------------- ok No lint issues. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126418 Approved by: https://github.com/janeyx99	2024-05-30 01:47:41 +00:00
PyTorch MergeBot	67739d8c6f	Revert "[Submodule] Remove deprecated USE_TBB option and TBB submodule (#127051 )" This reverts commit 699db7988d84d163ebb6919f78885e4630182a7a. Reverted https://github.com/pytorch/pytorch/pull/127051 on behalf of https://github.com/PaliC due to This PR needs to be synced using the import button as there is a bug in our diff train ([comment](https://github.com/pytorch/pytorch/pull/127051#issuecomment-2138496995))	2024-05-30 01:16:57 +00:00
rzou	1abcac9dab	New Custom Ops Documentation landing page (#127400 ) We create a new landing page for PyTorch custom ops (suggested by jansel). All of our error messages will link here, and I'll work with the docs team to see if we can boost SEO for this page. NB: the landing page links some non-searchable webpages. Two of those (the Python custom ops tutorial and C++ custom ops tutorial) will turn into actual webpages when PyTorch 2.4 comes around. I'll make the third one (the Custom Operators Manual) once it stabilizes (we continously add new things to it and the length means that we might want to create a custom website for it to make the presentation more ingestable). Test Plan: - view docs preview. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127400 Approved by: https://github.com/jansel ghstack dependencies: #127291, #127292	2024-05-30 01:06:04 +00:00
saadelkouari	49ad90349d	Correct error message for aten::_local_scalar_dense on meta tensor (#124554 ) registering a meta for aten::_local_scalar_dense with a different error message. Fixes pytorch#119588 Co-authored-by: Edward Z. Yang <ezyang@mit.edu> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124554 Approved by: https://github.com/ezyang	2024-05-30 00:50:29 +00:00
Jiashen Cao	d66f12674c	Handle tuple and dict during TorchScript to ExportedProgram conversion (#127341 ) * Add some test cases for testing List, Tuple, and Dict * Refactor the conversion code slightly * Add a logic to handle Dict Pull Request resolved: https://github.com/pytorch/pytorch/pull/127341 Approved by: https://github.com/SherlockNoMad, https://github.com/angelayi	2024-05-30 00:08:09 +00:00
Lei Ding	f14dc3bde8	Fix check message (#126951 ) As title. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126951 Approved by: https://github.com/Skylion007, https://github.com/kit1980	2024-05-29 23:58:09 +00:00
Edward Z. Yang	76fc58c160	Document the legacy constructor for Tensor (#122625 ) Fixes https://github.com/pytorch/pytorch/issues/122408 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122625 Approved by: https://github.com/albanD	2024-05-29 23:23:19 +00:00
Shan19900305	7931eee5c5	Support torch.dtype as parameter in pybind11 cpp extension. (#126865 ) Support torch.dtype as parameter in pybind11 cpp extension. Example: ` cpp_extension.my_ops(self, other, torch.dtype) ` @ezyang @bdhirsh Co-authored-by: Edward Z. Yang <ezyang@mit.edu> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126865 Approved by: https://github.com/ezyang	2024-05-29 23:19:32 +00:00
cyy	8ea1dc8748	Use Python::NumPy target (#127399 ) Now that we use FindPython, use it again for numpy detection. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127399 Approved by: https://github.com/malfet	2024-05-29 23:17:58 +00:00
lezcano	0fa2c5b049	Fix mask propagation in the presence of where (#125574 ) Before, when calling ops.where, masks were not properly propagated. We now restrict the optimisation to `ops.masked`, which I think it was what the original code intended to do. I'm not 100% sure that even in the masked case this code is not introducing some bugs, but this is a strict improvement over the previous state. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125574 Approved by: https://github.com/peterbell10 ghstack dependencies: #114471, #126783	2024-05-29 23:17:41 +00:00
Darshan Sanghani	15a7916c0e	Ability to capture Process Groups information into Execution Traces (#126995 ) Contains a method added to the ExecutionTraceObserver class to record the snapshot of the current process group config upon tracing start. Unit test: ``` (pytorch) [dsang@devgpu021.nha2 ~/github/pytorch-fork (viable/strict)]$ touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" python test/distributed/test_distributed_spawn.py -v TestDistBackendWithSpawn.test_ddp_profiling_execution_trace /home/dsang/github/pytorch-fork/torch/distributed/optim/__init__.py:28: UserWarning: TorchScript support for functional optimizers isdeprecated and will be removed in a future PyTorch release. Consider using the torch.compile optimizer instead. warn("TorchScript support for functional optimizers is" test_ddp_profiling_execution_trace (__main__.TestDistBackendWithSpawn.test_ddp_profiling_execution_trace) ... /home/dsang/github/pytorch-fork/torch/distributed/optim/__init__.py:28: UserWarning: TorchScript support for functional optimizers isdeprecated and will be removed in a future PyTorch release. Consider using the torch.compile optimizer instead. warn("TorchScript support for functional optimizers is" /home/dsang/github/pytorch-fork/torch/distributed/optim/__init__.py:28: UserWarning: TorchScript support for functional optimizers isdeprecated and will be removed in a future PyTorch release. Consider using the torch.compile optimizer instead. warn("TorchScript support for functional optimizers is" NCCL version 2.20.5+cuda12.0 [rank1]:[W523 16:06:01.705774398 reducer.cpp:1400] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) [rank0]:[W523 16:06:01.705905760 reducer.cpp:1400] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) [rank1]:[W523 16:06:01.715182258 execution_trace_observer.cpp:819] Enabling Execution Trace Observer printing pg info into trace [rank0]:[W523 16:06:01.715841805 execution_trace_observer.cpp:819] Enabling Execution Trace Observer printing pg info into trace [rank1]:[W523 16:06:01.727881877 execution_trace_observer.cpp:831] Disabling Execution Trace Observer [rank0]:[W523 16:06:01.728792871 execution_trace_observer.cpp:831] Disabling Execution Trace Observer Execution trace saved at /tmp/tmpdsov4ngi.et.json [{'id': 3, 'name': '## process_group:init ##', 'ctrl_deps': 2, 'inputs': {'values': ['[{"pg_name": "0", "pg_desc": "default_pg", "backend_config": "cuda:nccl", "ranks": [], "group_size": 2, "group_count": 1}]'], 'shapes': [[]], 'types': ['String']}, 'outputs': {'values': [], 'shapes': [], 'types': []}, 'attrs': [{'name': 'rf_id', 'type': 'uint64', 'value': 1}, {'name': 'fw_parent', 'type': 'uint64', 'value': 0}, {'name': 'seq_id', 'type': 'int64', 'value': -1}, {'name': 'scope', 'type': 'uint64', 'value': 7}, {'name': 'tid', 'type': 'uint64', 'value': 1}, {'name': 'fw_tid', 'type': 'uint64', 'value': 0}, {'name': 'op_schema', 'type': 'string', 'value': ''}, {'name': 'kernel_backend', 'type': 'string', 'value': ''}, {'name': 'kernel_file', 'type': 'string', 'value': ''}]}] Execution trace saved at /tmp/tmpsdiqy6az.et.json [{'id': 3, 'name': '## process_group:init ##', 'ctrl_deps': 2, 'inputs': {'values': ['[{"pg_name": "0", "pg_desc": "default_pg", "backend_config": "cuda:nccl", "ranks": [], "group_size": 2, "group_count": 1}]'], 'shapes': [[]], 'types': ['String']}, 'outputs': {'values': [], 'shapes': [], 'types': []}, 'attrs': [{'name': 'rf_id', 'type': 'uint64', 'value': 1}, {'name': 'fw_parent', 'type': 'uint64', 'value': 0}, {'name': 'seq_id', 'type': 'int64', 'value': -1}, {'name': 'scope', 'type': 'uint64', 'value': 7}, {'name': 'tid', 'type': 'uint64', 'value': 1}, {'name': 'fw_tid', 'type': 'uint64', 'value': 0}, {'name': 'op_schema', 'type': 'string', 'value': ''}, {'name': 'kernel_backend', 'type': 'string', 'value': ''}, {'name': 'kernel_file', 'type': 'string', 'value': ''}]}] ok ---------------------------------------------------------------------- Ran 1 test in 24.447s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126995 Approved by: https://github.com/briancoutinho, https://github.com/sraikund16	2024-05-29 23:16:17 +00:00
Nikita Shulga	3174e6cb8e	[Temp][CI] Run older MPS tests/Mac builds on MacOS 13 (#127428 ) To avoid ambiguity while migration outlined in https://github.com/pytorch-labs/pytorch-gha-infra/pull/399 is in progress. Otherwise, MPS jobs for Ventura can be accidentally scheduled on Sonoma or builds, which might result in flaky failures on trunk. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127428 Approved by: https://github.com/huydhn	2024-05-29 22:58:41 +00:00
PaliC	9257a0698b	[Split Build] Load dependencies from libtorch in __init__.py (#126826 ) This PR makes it such that we search for a libtorch wheel when initializing pytorch in order to find the necessary shared libraries. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126826 Approved by: https://github.com/huydhn, https://github.com/atalman, https://github.com/ZainRizvi	2024-05-29 22:03:50 +00:00
Wanchao Liang	b0ef363972	[dtensor] rename _Partial -> Partial for all imports (#127420 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127420 Approved by: https://github.com/awgu	2024-05-29 21:42:40 +00:00
Catherine Lee	d99b115eb3	Fix delete old branches workflow (#127442 ) The ubuntu runner started using 2.45.1 (prev 2.43.2), which includes `1f49f7506f` (changes +00:00 timezone to Z) Python versions prior to 3.11 do not support Z when parsing isoformat, so update the workflow to use 3.11 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127442 Approved by: https://github.com/huydhn, https://github.com/malfet	2024-05-29 21:17:09 +00:00
JackCaoG	38a33c3202	don't call .item in onehot for XLA (#127335 ) We found that `nn.function.one_hot` will cause a graph break due to the item call in the native implementation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127335 Approved by: https://github.com/ezyang	2024-05-29 20:37:26 +00:00
cyy	84b5aa9a68	[Caffe2] [Reland] Remove Caffe2 proto files (#127394 ) Reland of #126134, which was reverted due to the wrong base. Now that #126705 has been relanded, it's time to remand this one. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127394 Approved by: https://github.com/r-barnes	2024-05-29 20:37:02 +00:00
Yuanhao Ji	92d081e228	[Docs] Add `str` type to `cuda.get_device_name()` and `cuda. get_device_capability()` function (#126743 ) Fixes #126400 The `get_device_name()` and `get_device_capability()` allow passing in a string, but it's not stated in the doc. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126743 Approved by: https://github.com/eqy, https://github.com/kit1980	2024-05-29 20:09:52 +00:00
Kwanghoon An	24a4bfdcc2	[AdaRound] Make versatile for data / extra param for callback function (#126891 ) Summary: For Speech sequential model, there could be a case where model(data) does not work correctly for feed forward, Speech model uses a different type of Criterion (a.k.a loss function) to feed a data on individual components like encoder, predictor, joiner. Hence we need extra parameter to pass feedforward wrapper Differential Revision: D57680391 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126891 Approved by: https://github.com/jerryzh168	2024-05-29 20:05:27 +00:00
Kwanghoon An	c404b2968c	Support min/max carry over for eager mode from_float method (#127309 ) Summary: After QAT is completed or given pre-tuned weight observer via tunable PTQ algorithm, it should not over-write again with a given weight, at least for static QAT never. Dynamic QAT also does not require to re-run weight observer again by design. This is a fix Test Plan: Signals Differential Revision: D57747749 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127309 Approved by: https://github.com/jerryzh168	2024-05-29 19:33:26 +00:00
Sam Larsen	82a370ae3a	Revert "Refresh OpOverloadPacket if a new OpOverload gets added (#126863 )" (#127366 ) This reverts commit ed734178abc99bc1d83ad2c61d3a1e4d4f5d20c8. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127366 Approved by: https://github.com/zou3519	2024-05-29 19:26:06 +00:00
Jane Xu	05e99154ee	Allow int vals to go down the fastpath for _foreach_max (#127303 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127303 Approved by: https://github.com/albanD ghstack dependencies: #127187	2024-05-29 19:08:58 +00:00
Jane Xu	601c5e085d	Add _foreach_max (#127187 ) This PR adds _foreach_max support, the second reduction foreach op we have :D I did have to change the autogen slightly for foreach. I can promise that the existing foreach ops' derivative behavior has not changed as I've added a skip list for the harder requirement I am setting (that the arg list should match in length). I needed to add this requirement as there is another wrong max (the one that does take in a dim for reduction) that keeps getting matched first. Caveats! - We do not fast path if the shapes, dtypes, device, the regular shebang for foreach are not met. We fall back to slowpath! - MORE IMPORTANTLY, we also do not fast path for int8 and int16 and bool, but that's really a skill issue on my end as I've hardcoded -INFINITY into the CUDA kernels, and -INFINITY is not defined for small ints. It'd be nice to know how to do this properly, but that work can also come later. - This does NOT support empty Tensors in the list, because the original max op also does not support empty Tensors. ~I think this should be allowed though, and this PR may come later.~ I understand why this is not allowed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127187 Approved by: https://github.com/albanD	2024-05-29 19:08:58 +00:00
Nikita Shulga	90f4b3fcb2	PyTorch Distributed security assumptions (#127403 ) To highlight, that PyTorch Distributed should only be used in a trusted environment and never on the nodes with open network access, which is very similar in spirit to https://github.com/tensorflow/tensorflow/blob/master/SECURITY.md#running-a-tensorflow-server Thanks to @Xbalien and @K1ingzzz for drawing attention to missing documentation on distributed workloads security assumptions Pull Request resolved: https://github.com/pytorch/pytorch/pull/127403 Approved by: https://github.com/wconstab	2024-05-29 19:08:20 +00:00
laithsakka	5196ef1b59	support builtin id function on user defined object variables. (#127146 ) Fix: https://github.com/pytorch/pytorch/pull/127146 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127146 Approved by: https://github.com/anijain2305 ghstack dependencies: #126444	2024-05-29 19:00:37 +00:00
lancerts	ff65b18fcf	Update the is_causal explaination in the SDPA doc (#127209 ) Fixes #126873 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127209 Approved by: https://github.com/drisspg	2024-05-29 18:53:17 +00:00
cyy	9cc0d56fdc	Remove unused variables in tests (#127379 ) Reland test fixes in #127161 and reduce reduce_ops_test into floating point types. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127379 Approved by: https://github.com/ezyang	2024-05-29 18:30:51 +00:00
Xu Zhao	d938170314	Add torchao nightly testing workflow (#126885 ) Add and test torchao nightly testing workflow. This workflow will be triggered under the following conditions: 1. If the PR has ciflow/torchao label 2. Manual trigger It will run the torchao benchmark on torchbench/timm/huggingface model workloads with 5 configs (noquant, autoquant, int8dynamic, int8weightonly, int4weightonly). The output will be updated to the PT2 Dashboard: https://hud.pytorch.org/benchmark/compilers Pull Request resolved: https://github.com/pytorch/pytorch/pull/126885 Approved by: https://github.com/huydhn	2024-05-29 18:22:29 +00:00
Scott Wolchok	090a031d6f	Use bit_cast instead of UB type-pun-via-union in Half.h (#127321 ) Summary: Type punning via union has undefined behavior due to the strict aliasing rule. bit_cast does the same thing safely (using memcpy under the hood). Test Plan: CI Godbolt demonstrates that doing this via memcpy still generates the same instructions: https://godbolt.org/z/PhePzd4Ex Pull Request resolved: https://github.com/pytorch/pytorch/pull/127321 Approved by: https://github.com/ezyang, https://github.com/Skylion007	2024-05-29 17:43:50 +00:00
Danielle Pintz	8b5cbb7c68	Improve NLLLoss docs (#127346 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127346 Approved by: https://github.com/mikaylagawarecki	2024-05-29 17:29:06 +00:00
rzou	28de9143a3	opcheck should be usable without optional dependencies (#127292 ) This PR excises opcheck's dependency on torch.testing._internal.common_utils, (which comes with dependencies on expecttest and hypothesis). We do this by moving what we need to torch.testing._utils and adding a test for it. Fixes #126870, #126871 Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/127292 Approved by: https://github.com/williamwen42 ghstack dependencies: #127291	2024-05-29 17:17:49 +00:00
Pian Pawakapan	8a31c2aa84	[export] allow complex guards as runtime asserts (#127129 ) With the current state of export's dynamic shapes, we struggle with guards and constraints that are beyond the current dynamic shapes language, expressed with dims and derived dims. While we can compile and guarantee correctness for guards within the current language (e.g. min/max ranges, linear relationships, integer divisibility) we struggle to dynamically compile guards which extend beyond that. For these "complex" guards, we typically do either of the following: 1) raise a constraint violation error, along the lines of "not all values of <symbol> in the specified range satisfy <guard>", with or without suggested fixes, 2) specialize to the provided static values and suggest removing dynamism, or 3) fail compilation due to some arbitrary unsupported case. Previous [work](https://github.com/pytorch/pytorch/pull/124949) went towards resolving this by disabling forced specializations, instead allowing the user to fail at runtime with incorrect inputs. In this PR, relying on [hybrid backed-unbacked symints](https://github.com/pytorch/pytorch/issues/121749), [deferred runtime asserts](https://github.com/pytorch/pytorch/blob/main/torch/fx/passes/runtime_assert.py), and the function [_is_supported_equivalence()](`d7de4c9d80/torch/fx/experimental/symbolic_shapes.py (L1824)`), we add a flag `_allow_complex_guards_as_runtime_asserts` which allows the user to compile exported programs containing these guards and maintain dynamism, while adding correctness checks as runtime assertions in the graph. Hybrid backed-unbacked symints allow us to easily bypass "implicit" guards emitted from computation - guards that we ~expect to be true. Popular examples revolve around reshapes: ``` # reshape def forward(self, x, y): # x: [s0, s1], y: [s2] return x.reshape([-1]) + y # guard s0 * s1 = s2 This leads to the following exported program class GraphModule(torch.nn.Module): def forward(self, x: "f32[s0, s1]", y: "f32[s2]"): sym_size_int: "Sym(s2)" = torch.ops.aten.sym_size.int(y, 0) mul: "Sym(-s2)" = -1 * sym_size_int; sym_size_int = None sym_size_int_1: "Sym(s0)" = torch.ops.aten.sym_size.int(x, 0) sym_size_int_2: "Sym(s1)" = torch.ops.aten.sym_size.int(x, 1) mul_1: "Sym(s0s1)" = sym_size_int_1 sym_size_int_2; sym_size_int_1 = sym_size_int_2 = None add: "Sym(s0s1 - s2)" = mul + mul_1; mul = mul_1 = None eq: "Sym(Eq(s0s1 - s2, 0))" = add == 0; add = None _assert_scalar = torch.ops.aten._assert_scalar.default(eq, "Runtime assertion failed for expression Eq(s0s1 - s2, 0) on node 'eq'"); eq = None view: "f32[s0s1]" = torch.ops.aten.view.default(x, [-1]); x = None add_1: "f32[s0s1]" = torch.ops.aten.add.Tensor(view, y); view = y = None return (add_1,) ``` Another case is symbol divisibility: ``` def forward(self, x): # x: [s0, s1] return x.reshape([-1, x.shape[0] - 1]) # Eq(Mod(s0 s1, s0 - 1), 0) ``` Applying deferred runtime asserts also helps dynamic compilation for "explicit" complex guards that typically cause problems for export. For example we can generate runtime asserts for not-equal guards, and complex conditions like the following: ``` class Foo(torch.nn.Module): def forward(self, x, y): # check that negation of first guard also shows up as runtime assertion if x.shape[0] == y.shape[0]: # False return x + y elif x.shape[0] == y.shape[0] 3: # False return x + 2, y + 3 elif x.shape[0] 2 == y.shape[0] * 3: # True return x * 2.0, y * 3.0 ``` For the above graph we will generate 3 runtime assertions: the negation of the first 2, and the 3rd condition as a guard. One additional benefit here over the current state of exported programs is that this adds further correctness guarantees - previously with explicit complex guards, if compilation succeeded, the guards would be ignored at runtime, treated as given. As shown above, the runtime asserts appear as math ops in the graph, generated by the sympy interpreter, resulting in an _assert_scalar call. There is an option to avoid adding these asserts into the graph, by setting `TORCH_DYNAMO_DO_NOT_EMIT_RUNTIME_ASSERTS=1`. This results in the "original" computation graph, with dynamism, and any incorrect inputs will fail on ops during runtime. Further work could go into prettifying the printer, so the majority of the graph isn't guard-related. Ideally this PR would subsume and remove the recently added [_disable_forced_specializations](https://github.com/pytorch/pytorch/pull/124949) flag, but that flag still handles one additional case of specialization: single-variable equalities where the symbol is solvable for a concrete value: see this [PR](https://github.com/pytorch/pytorch/pull/126925) This PR doesn't change any behavior around data-dependent errors/unbacked symints yet, that could be further work. NOTE: will take naming change suggestions for the flag :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127129 Approved by: https://github.com/avikchaudhuri	2024-05-29 17:15:25 +00:00
Richard Barnes	cc6e72d882	Drop caffe2 core tests and some other stuff (#127089 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/127089 Approved by: https://github.com/Skylion007	2024-05-29 17:11:45 +00:00
cyy	e8e327ba82	Cover clang-tidy to torch/csrc/onnx/init.cpp (#127393 ) Enabling it will not cause issues. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127393 Approved by: https://github.com/Skylion007	2024-05-29 17:05:28 +00:00
cyy	7de1352457	[1/N] Replace exceptions with static_assert(false) in some templates (#127371 ) This PR tries to report some failures at build time. Once the build fails, it generally indicates that we can wrap the code inside some conditional macros, and it is a hint to further reduce the built code size. The sizeof operations were used to ensure that the assertion dependents on specific template instantiations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127371 Approved by: https://github.com/ezyang, https://github.com/Skylion007	2024-05-29 16:14:00 +00:00
cyy	c69562caf9	[Caffe2]Remove more caffe2 files (#126628 ) They are not used. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126628 Approved by: https://github.com/albanD	2024-05-29 16:08:48 +00:00
Andrew M. James	80a8fc07b2	[dynamo] Handle np.iinfo/finfo/dtype as input (#124482 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124482 Approved by: https://github.com/lezcano ghstack dependencies: #124481	2024-05-29 16:00:15 +00:00
Derek	9a8e8101a8	Fix wording in nn.Linear docstring. (#127240 ) Definition (Linear Transformation): A mapping $T : V \to W$ between $F$-vector spaces $V,W$ is called a linear transformation if and only if a) $T(u+v)=T(u)+T(v)$, b) $T(cv)=cT(v)$ for all $u, v \in V$, $c \in F$. Consequently, $T(0_V)=0_W$. Thus $x \mapsto xA^T+b$ for nonzero $b$ is not a linear transformation, but is often referred to as an affine linear transformation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127240 Approved by: https://github.com/soulitzer, https://github.com/albanD	2024-05-29 14:55:40 +00:00
Andrew M. James	ade075444f	[dynamo] Support numpy.dtype (#124481 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124481 Approved by: https://github.com/lezcano	2024-05-29 14:45:14 +00:00
Aaron Gokaslan	bf966588f1	[BE][Ez]: Update cudnn_frontend submodule to v1.4.0 (#127175 ) Updates the cudnn_frontend submodule to the latest 1.4.0 version. Should be a straightforward, header-only submodule update. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127175 Approved by: https://github.com/ezyang, https://github.com/malfet	2024-05-29 14:23:38 +00:00
Nikita Shulga	0910429d72	[BE][CMake] Use FindPython module (#124613 ) As FindPythonInterp and FindPythonLibs has been deprecated since cmake-3.12 Replace `PYTHON_EXECUTABLE` with `Python_EXECUTABLE` everywhere (CMake variable names are case-sensitive) This makes PyTorch buildable with python3 binary shipped with XCode on MacOS TODO: Get rid of `FindNumpy` as its part of Python package Pull Request resolved: https://github.com/pytorch/pytorch/pull/124613 Approved by: https://github.com/cyyever, https://github.com/Skylion007	2024-05-29 13:17:35 +00:00
Bin Bao	942d9abd66	[AOTI] Update reinplace to cover mutated buffer (#127297 ) Summary: Unlike JIT Inductor, AOTI currently unlifts weights and buffers from input args, so the reinplace pass didn't really work for AOTI because it only checks mutation on placeholder, which led to excessive memory copies for kv_cache updates in LLM models. This PR removes those memory copies and roughly offers a 2x speedup. In the future, we will revert the unlift logic in AOTI and make the behvior consitent with JIT Inductor. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/127297 Approved by: https://github.com/peterbell10, https://github.com/chenyang78	2024-05-29 13:07:53 +00:00
Richard Barnes	af69a52f06	Reapply "Remove more of caffe2 (#126705 )" (#127317 ) This reverts commit 00fe0a0d795680ade029fc552f33fffed75c0250. Originally was unnecessarily reverted by an oncall. Landing again. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/127317 Approved by: https://github.com/izaitsevfb	2024-05-29 12:20:25 +00:00
Xuehai Pan	749a132fb0	[BE] wrap deprecated function/class with `typing_extensions.deprecated` (#126898 ) Use `typing_extensions.deprecated` for deprecation annotation if possible. Otherwise, add `category=FutureWarning` to `warnings.warn("message")` if the category is missing. Note that only warnings that their messages contain `[Dd]eprecat(ed\|ion)` are updated in this PR. UPDATE: Use `FutureWarning` instead of `DeprecationWarning`. Resolves #126888 - #126888 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126898 Approved by: https://github.com/albanD	2024-05-29 12:09:27 +00:00
cyy	699db7988d	[Submodule] Remove deprecated USE_TBB option and TBB submodule (#127051 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127051 Approved by: https://github.com/cpuhrsch, https://github.com/malfet	2024-05-29 11:58:03 +00:00
Valeriu	02b1cdab23	[Sync torch_FA2 and FA2 flash_api] + [Expose seqused_k & alibi_slopes arguments] (#126520 ) 1. Expose seqused_k & alibi_slopes arguments: - This can be used when your sequence length k is not the full extent of the tensor. This is useful for kv cache scenarios and was not previously supported in the FA2 TORCH integration. We need these arguments for external xformers lib call to the _flash_attention_forward API. Before: ``` std::optional<Tensor> seqused_k = c10::nullopt; std::optional<Tensor> alibi_slopes = c10::nullopt; ``` After: ``` _flash_attention_forward(... std::optional<Tensor>& seqused_k, std::optional<Tensor>& alibi_slopes, ``` 2. There is a difference between the TORCH_FA2_flash_api:mha_fwd and FA2_flash_api:mha_fwd (same for mha_varlen_fwd) at the query transposition (GQA) step. The CHECK_SHAPE is applied on the original query vs the reshaped query. This causes an error (because of the shape constraint) for such inputs: ``` q = torch.randn([7, 1, 4, 256], dtype=torch.bfloat16, device='cuda') k = torch.randn([7, 51, 1, 256], dtype=torch.bfloat16, device='cuda') v = torch.randn([7, 51, 1, 256], dtype=torch.bfloat16, device='cuda') ``` ![image](https://github.com/pytorch/pytorch/assets/927999/77ea6bf6-b6e9-4f3f-96a9-8d952956ddd9) - i've modified the code as little as possible, but if you prefer a more verbose change like the following, dont hesitate to tell me: ``` at::Tensor swapped_q = seqlenq_ngroups_swapped ? q.reshape({batch_size, num_heads_k, num_heads / num_heads_k, head_size_og}).transpose(1, 2) : q; if (seqlenq_ngroups_swapped) { seqlen_q = num_heads / num_heads_k; num_heads = num_heads_k; } CHECK_SHAPE(swapped_q, batch_size, seqlen_q, num_heads, head_size_og); ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126520 Approved by: https://github.com/drisspg	2024-05-29 11:54:44 +00:00
Jiong Gong	dae33a4961	[inductor][cpp] bf16/fp16 gemm template computed with fp32 w/o epilogue fusion (#126068 ) As part of #125683, this PR adds the initial bf16/fp16 gemm template support with micro-gemm implemented with fused type casting and fp32 computation. It doesn't provide epilogue fusion support yet which will be added in the next PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126068 Approved by: https://github.com/jansel ghstack dependencies: #124021, #126019	2024-05-29 11:15:41 +00:00
WeihaoCui	65af1a9c26	FIX the document of distributed.new_group() (#122703 ) As for now, the document of distributed.new_group() says that it returns `None` when current ranks is not in the new created process group. However, it actually returns `GroupMember.NON_GROUP_MEMBER`. I have check the code and think it is more appropriate that we fix the document. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122703 Approved by: https://github.com/wconstab, https://github.com/kwen2501	2024-05-29 09:40:25 +00:00
Sam Larsen	6c81856dca	[inductor] Add a subprocess-based parallel compile (#126816 ) Summary: Adds a "safe" parallel compile implementation that a) Popens a sub-process with an entry point we control, and b) Uses a ProcessPoolExecutor in that sub-processes to perform parallel compiles. This change essentially squashes these two implementations from jansel, but removes the "thread-based" approach since benchmarking revealed that compile-time performance was poor compared to the existing impl: https://github.com/pytorch/pytorch/pull/124682 https://github.com/pytorch/pytorch/pull/122941 This PR adds the implementation, but defaults to the existing "fork". I'll submit a separate change to enable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126816 Approved by: https://github.com/jansel	2024-05-29 09:40:21 +00:00
Jiong Gong	92bc444ee3	[inductor][cpp] epilogue support for gemm template (#126019 ) As part of #125683, this PR adds the epilogue support for c++ gemm template by reusing the c++ vector codegen on sub-slices of tensors. This is implemented by retracing the epilogue IR nodes with new ranges and offsets. The new `codegen_loop_bodies` and `codegen_functions` methods are added to c++ vector codegen for this purpose. This is leveraged by the `store_output` method of the template kernel for epilogue codegen and store to the final result. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126019 Approved by: https://github.com/jansel ghstack dependencies: #124021	2024-05-29 09:12:03 +00:00
lezcano	00999fd8b1	Prefer flip over index_select (#126783 ) It's faster and has a lower memory footprint in eager. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126783 Approved by: https://github.com/Skylion007, https://github.com/albanD ghstack dependencies: #114471	2024-05-29 09:10:25 +00:00
lezcano	8a21532e53	Fix constant propagation pass (#114471 ) This pass was broken in a number of ways, as we were not generating asserts whenever we took it, even though we need to. While doing so, we found that the analysis we were using for choosing whether to generate asserts or not for dynamic shapes was completely broken. Eliminating indirect indexing in this way allows for a number of optimisations. In particular, we can now fuse against these kernels (indirect indexing disallows fusions). The new strategy is as follows: - We always propagate sympy expressions if we can. - If an expression was an indirect_indexing, we call `check_bounds` - We also call `check_bounds` within `CSEProxy.indirect_indexing` - The checks are issued in the buffer where they would go if the were used in a load - This makes them always be codegen'd before the load and stores - In the case of stores, they will be generated potentially much earlier than the stores themselves, which is fine. We add quite a few asserts to preexisting tests to strengthen them. In particular, we make sure that issuing an assert plays well with all kinds of C++ vectorisation. For now, we rely on the logic within `_maybe_evaluate_static` to prove these bounds. This logic is rather limited though. In the future, we might want to rely on Z3 here to be able to prove bounds in a more general way. Supersedes https://github.com/pytorch/pytorch/pull/113068 Fixes https://github.com/pytorch/pytorch/issues/121251 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114471 Approved by: https://github.com/peterbell10	2024-05-29 09:10:25 +00:00
Animesh Jain	51b22d9cf2	[dynamo] Support enum construction (#127364 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127364 Approved by: https://github.com/yanboliang ghstack dependencies: #127263	2024-05-29 08:09:21 +00:00
Jason Ansel	ad7700bfdb	[inductor] Misc changes (#127307 ) Pulling unrelated changes out of the larger halide PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/127307 Approved by: https://github.com/yanboliang	2024-05-29 08:00:06 +00:00
Jiong Gong	cef776bcd1	[inductor][cpp] GEMM template (infra and fp32) (#124021 ) This PR adds the Cpp template infrastructure and the initial FP32 gemm template. See RFC https://github.com/pytorch/pytorch/issues/125683 for more background info. 1. Cpp template infrastructure Similar template abstractions as the CUTLASS template, i.e., `CppTemplate`, `CppTemplateKernel`, `CppTemplateBuffer`. The MicroGemm micro-kernel abstraction that can be used by Cpp GEMM templates. 2. Initial FP32 gemm template This involves a GEMM template implementation `CppPackedGemmTemplate` that supports GEMM with constant weight (`B`) requiring `N` to be a multiple of register blocking while allows the static or dynamic sizes for the `M` (batch dim) of `A`. The `B` matrix would be prepacked. This is a typical setting for inference workloads. The template handles the thread decomposition (via `thread_blocking`) and cache blocking (via `cache_blocking`). Then it invokes `CppMicroGemm` which handles register blocking, instruction selection, and other CPU architecture-specific optimizations. A `CppMicroGemmFP32Vec` micro-kernel implementation is provided for fp32 matmuls implemented with ATen vec abstraction. 3. Correctness and performance The changes have been validated with fp32 inference on the three benchmark suites (torchbench, huggingface and timm_models) with both static shape and dynamic shapes. Since it is an initial implementation, we are still working on further performance improves with follow-up PRs including the optimizations in kernels as well as fusions. The perf gains are only observed from a selective number of models compared to the ATen kernels which are implemented with MKL. The perf gains are more obvious with dynamic shapes since MKL only supports packed gemm for static shapes. Below are details. Static shapes \| Benchmark \| torchbench \| huggingface \| timm_models \| \|------------\|-------------\|--------------\|--------------\| \| Multi-threaded (baseline) \| 1.47x \| 1.36x \| 1.91x \| \| Multi-threaded (max-autotune) \| 1.47x \| 1.36x \| 1.92x \| \| Single-threaded (baseline) \| 1.56x \| 1.19x \| 1.51x \| \| Single-threaded (max-autotune) \| 1.56x \| 1.19x \| 1.52x \| Key models being sped up: drq: 1.14x soft_act: 1.12 cait_m36_384: 1.18x Dynamic shapes \| Benchmark \| torchbench \| huggingface \| timm_models \| \| --- \| --- \| --- \| --- \| \| Multi-threaded (baseline) \| 1.43x \| 1.28x \| 1.85x \| \| Multi-threaded (max-autotune) \| 1.47x \| 1.28x \| 1.85x \| \| Single-threaded (baseline) \| 1.55x \| 1.20x \| 1.51x \| \| Single-threaded (max-autotune) \| 1.56x \| 1.19x \| 1.53x \| Key models being sped up: BERT_pytorch: 1.22x pyhpc_turbulent: 1.13x soft_actor_critic: 1.77x BlenderbotForCausalLM: 1.09x cait_m36_384: 1.17x Differential Revision: [D57585365](https://our.internmc.facebook.com/intern/diff/D57585365) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124021 Approved by: https://github.com/jansel	2024-05-29 07:37:41 +00:00
William Wen	719589c9bf	[dynamo] move bytecode tests from test_misc to new bytecode test file (#127329 ) Also merge with bytecode hook test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127329 Approved by: https://github.com/yanboliang, https://github.com/jansel	2024-05-29 06:10:59 +00:00
Wanchao Liang	a60b06bd2b	[dtensor] update public API docs (#127340 ) This PR updates the API documentations for the public facing APIs needs more example for each API but plan to add them in a separate PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/127340 Approved by: https://github.com/wz337 ghstack dependencies: #127338, #127339	2024-05-29 05:18:47 +00:00
Wanchao Liang	2c9a420da3	[dtensor] move some modules to private namespace (#127339 ) as titled, moving some modules that are mainly for DTensor private usage to be a private module. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127339 Approved by: https://github.com/awgu ghstack dependencies: #127338	2024-05-29 05:18:47 +00:00
Wanchao Liang	72ef2555e3	[dtensor] make Partial placement public (#127338 ) As titled, partial placement is standardized right now and I think we would want to expose this as a public API to allow user to annotate the the sharding layout easier. Given that we already have use cases internal/externally that uses Partial Keeping the old _Partial name for a while for BC reason Pull Request resolved: https://github.com/pytorch/pytorch/pull/127338 Approved by: https://github.com/awgu, https://github.com/wz337, https://github.com/kwen2501	2024-05-29 05:18:47 +00:00
William Wen	5359af0c7e	[dynamo] wrap GraphModule exceptions in dynamo-wrapped tests (#126341 ) Better approach to https://github.com/pytorch/pytorch/pull/126197 to catch issues like https://github.com/pytorch/pytorch/issues/125568. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126341 Approved by: https://github.com/anijain2305, https://github.com/jansel	2024-05-29 05:18:04 +00:00
laithsakka	cdf2133186	Add compile time profiler for non fbcode targets (#126904 ) This is a tool that allow profiling compile time using strobelight profiler, its a meta only tool. but works on non-fbcode targets. A follow up diff will unify this with caffe2/fb/strobelight/compile_time_profiler.py. example test: ``` run python tools/strobelight/examples/compile_time_profile_example.py ``` ``` python torch/utils/_strobelight/examples/compile_time_profile_example.py strobelight_compile_time_profiler, line 61, 2024-05-23 10:49:28,101, INFO: compile time strobelight profiling enabled strobelight_compile_time_profiler, line 93, 2024-05-23 10:49:28,102, INFO: Unique sample tag for this run is: 2024-05-23-10:49:282334638devvm4561.ash0.facebook.com strobelight_compile_time_profiler, line 94, 2024-05-23 10:49:28,102, INFO: You can use the following link to access the strobelight profile at the end of the run: https://www.internalfb.com/intern/scuba/query/?dataset=pyperf_experimental%2Fon_demand&drillstate=%7B%22purposes%22%3A[]%2C%22end%22%3A%22now%22%2C%22start%22%3A%22-30%20days%22%2C%22filterMode%22%3A%22DEFAULT%22%2C%22modifiers%22%3A[]%2C%22sampleCols%22%3A[]%2C%22cols%22%3A[%22namespace_id%22%2C%22namespace_process_id%22]%2C%22derivedCols%22%3A[]%2C%22mappedCols%22%3A[]%2C%22enumCols%22%3A[]%2C%22return_remainder%22%3Afalse%2C%22should_pivot%22%3Afalse%2C%22is_timeseries%22%3Afalse%2C%22hideEmptyColumns%22%3Afalse%2C%22timezone%22%3A%22America%2FLos_Angeles%22%2C%22compare%22%3A%22none%22%2C%22samplingRatio%22%3A%221%22%2C%22metric%22%3A%22count%22%2C%22aggregation_field%22%3A%22async_stack_complete%22%2C%22top%22%3A10000%2C%22aggregateList%22%3A[]%2C%22param_dimensions%22%3A[%7B%22dim%22%3A%22py_async_stack%22%2C%22op%22%3A%22edge%22%2C%22param%22%3A%220%22%2C%22anchor%22%3A%220%22%7D]%2C%22order%22%3A%22weight%22%2C%22order_desc%22%3Atrue%2C%22constraints%22%3A[[%7B%22column%22%3A%22sample_tags%22%2C%22op%22%3A%22all%22%2C%22value%22%3A[%22[%5C%222024-05-23-10:49:282334638devvm4561.ash0.facebook.com%5C%22]%22]%7D]]%2C%22c_constraints%22%3A[[]]%2C%22b_constraints%22%3A[[]]%2C%22ignoreGroupByInComparison%22%3Afalse%7D&view=GraphProfilerView&&normalized=1712358002&pool=uber strobelight_function_profiler, line 241, 2024-05-23 10:49:34,943, INFO: strobelight run id is: 3507039740348330 strobelight_function_profiler, line 243, 2024-05-23 10:50:00,907, INFO: strobelight profiling running strobelight_function_profiler, line 224, 2024-05-23 10:50:02,741, INFO: strobelight profiling stopped strobelight_function_profiler, line 215, 2024-05-23 10:50:06,173, INFO: Total samples: 7 strobelight_function_profiler, line 215, 2024-05-23 10:50:06,173, INFO: GraphProfiler (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/75cxdro3 strobelight_function_profiler, line 215, 2024-05-23 10:50:06,173, INFO: Icicle view (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/qsgydsee strobelight_compile_time_profiler, line 120, 2024-05-23 10:50:06,174, INFO: 1 strobelight success runs out of 1 non-recursive compilation events. strobelight_function_profiler, line 241, 2024-05-23 10:50:08,137, INFO: strobelight run id is: 8721740011604497 strobelight_function_profiler, line 243, 2024-05-23 10:50:34,801, INFO: strobelight profiling running strobelight_function_profiler, line 224, 2024-05-23 10:50:36,803, INFO: strobelight profiling stopped strobelight_function_profiler, line 215, 2024-05-23 10:50:41,289, INFO: Total samples: 3 strobelight_function_profiler, line 215, 2024-05-23 10:50:41,289, INFO: GraphProfiler (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/qmi2ucwp strobelight_function_profiler, line 215, 2024-05-23 10:50:41,289, INFO: Icicle view (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/7fjkhs9i strobelight_compile_time_profiler, line 120, 2024-05-23 10:50:41,289, INFO: 2 strobelight success runs out of 2 non-recursive compilation events. strobelight_function_profiler, line 241, 2024-05-23 10:50:43,597, INFO: strobelight run id is: 1932476082259558 strobelight_function_profiler, line 243, 2024-05-23 10:51:09,791, INFO: strobelight profiling running strobelight_function_profiler, line 224, 2024-05-23 10:51:11,883, INFO: strobelight profiling stopped strobelight_function_profiler, line 215, 2024-05-23 10:51:16,218, INFO: Total samples: 3 strobelight_function_profiler, line 215, 2024-05-23 10:51:16,218, INFO: GraphProfiler (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/vy1ujxec strobelight_function_profiler, line 215, 2024-05-23 10:51:16,218, INFO: Icicle view (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/2xgadviv strobelight_compile_time_profiler, line 120, 2024-05-23 10:51:16,219, INFO: 3 strobelight success runs out of 3 non-recursive compilation events. ``` or pass TORCH_COMPILE_STROBELIGHT=TRUE for any torch compile python program. ex running on XLNetLMHeadModel. ``` TORCH_COMPILE_STROBELIGHT=TRUE TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 time python benchmarks/dynamo/huggingface.py --ci --accuracy --timing --explain --inductor --device cuda --training --amp --only XLNetLMHeadModel ``` result: Pull Request resolved: https://github.com/pytorch/pytorch/pull/126904 Approved by: https://github.com/aorenste ghstack dependencies: #126444	2024-05-29 05:06:37 +00:00
Boyuan Feng	2b72e2a596	[Cudagraph] better support for streams (#126809 ) This PR fixes Issue #124391. There are two root causes. ### Root Cause 1 [better support for stream during cudagraph capture] When recording a new function, CUDA graph tree records memory block states (e.g., address, size, allocated, etc) via `getCheckpointState`. Let's say the record is called `block_state`. Later, CUDA graph tree would like to recover exactly the same memory block states by `apply_checkpoint_execution_state_in_allocator`, which a) frees all memory blocks; b) allocate all recorded block states (regardless of `block_state->allocated`); c) free blocks with `block_state->allocated == False`; and d) check block_state matches remaining blocks (e.g., `block_state->ptr == block->ptr`). An error may occur when multiple streams exists during recording. [Note](https://github.com/pytorch/pytorch/blob/main/c10/cuda/CUDACachingAllocator.cpp#L2149-L2152) that a block will not be merged with other blocks if it is used by some streams, even if `block->allocated==False`. This may lead to a mismatch between `block_state->ptr` and `block->ptr` in `apply_checkpoint_execution_state_in_allocator`. This PR solves the issue by avoiding inserting events if this events coming from a stream used during cudagraph capture. The reason is that we know all events or streams used during cudagraph capture must have been completed before cudagraph capture finishes. ### Root Cause 2 [fix a bug in checkpoint state] When we getCheckpointState, we create block state. At that time, we do not record block->device. So block_state->device == 0 no matter the real value of block->device. See [how](https://github.com/pytorch/pytorch/blob/main/c10/cuda/CUDACachingAllocator.cpp#L744-L750) BlockState is created from a block. When use block state during setSegmentStateToCheckpoint, we use [block_state.device (=0)](https://github.com/pytorch/pytorch/blob/main/c10/cuda/CUDACachingAllocator.cpp#L1526). This leads to errors. We fixed this issue by recording block->device into block_state in getCheckpointState. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126809 Approved by: https://github.com/eellison	2024-05-29 04:52:35 +00:00
Shengbao Zheng	a41f828da7	[c10d] fix group_name/group_desc set up in eager initialization (#127053 ) Summary: ProcessGroupNCCL set up group_name/desc in c10d log and NCCL when initializing nccl communicator. In eager initialization mode, pg_name and pg_desc is set after communicator initialization so the information won't be available in pytorch log or NCCL communicator. This PR fix this by setting pg_name/desc earlier Differential Revision: D57759816 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127053 Approved by: https://github.com/wconstab, https://github.com/kwen2501	2024-05-29 04:42:34 +00:00
dshi7	932e04142d	extract calculate_time_spent from print_time_report (#127362 ) Fixes #ISSUE_NUMBER wrap certain steps in a separate function for easier TTFB instrumentation (fb internal use case) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127362 Approved by: https://github.com/yanboliang, https://github.com/mengluy0125	2024-05-29 04:37:15 +00:00
PaliC	a25b28a753	[Split Build] Add option to create libtorch wheel and use it to build pytorch as a separate wheel (#126328 ) Creates an option to just build the libtorch portion of pytorch such that we have the necessary .so files. Then it builds a torch package using the libtorch wheel. These options are enabled using ` BUILD_LIBTORCH_WHL` and `BUILD_PYTHON_ONLY`. We run ``` BUILD_LIBTORCH_WHL=1 python setup.py install python setup.py clean BUILD_PYTHON_ONLY=1 python setup.py install ``` to produce ``` sahanp@devgpu086 ~/pytorch (detached HEAD\|REBASE-i 3/5)> ls /home/sahanp/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/torch/lib/ (pytorch-3.10) libshm.so* libtorch_global_deps.so* libtorch_python.so* sahanp@devgpu086 ~/pytorch (detached HEAD\|REBASE-i 3/5)> ldd build/lib/libtorch_python.so (pytorch-3.10) linux-vdso.so.1 (0x00007ffdc2d37000) libtorch.so => /home/sahanp/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/libtorch/lib/libtorch.so (0x00007f539fe99000) libshm.so => /home/sahanp/pytorch/build/lib/libshm.so (0x00007f539fe90000) libcudnn.so.8 => /usr/local/cuda-12.1/targets/x86_64-linux/lib/libcudnn.so.8 (0x00007f539e800000) libnvToolsExt.so.1 => /usr/local/cuda/lib64/libnvToolsExt.so.1 (0x00007f539e400000) libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007f539e000000) libm.so.6 => /lib64/libm.so.6 (0x00007f539fda5000) libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f539ebe5000) libc.so.6 => /lib64/libc.so.6 (0x00007f539dc00000) /lib64/ld-linux-x86-64.so.2 (0x00007f539fea0000) libtorch_cpu.so => /home/sahanp/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/libtorch/lib/libtorch_cpu.so (0x00007f5392400000) libtorch_cuda.so => /home/sahanp/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/libtorch/lib/libtorch_cuda.so (0x00007f5380000000) librt.so.1 => /lib64/librt.so.1 (0x00007f539fd9e000) libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f539fd99000) libdl.so.2 => /lib64/libdl.so.2 (0x00007f539fd94000) libc10.so => /home/sahanp/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/libtorch/lib/libc10.so (0x00007f539eb07000) libmkl_intel_lp64.so.2 => /home/sahanp/.conda/envs/pytorch-3.10/lib/libmkl_intel_lp64.so.2 (0x00007f537ec00000) libmkl_gnu_thread.so.2 => /home/sahanp/.conda/envs/pytorch-3.10/lib/libmkl_gnu_thread.so.2 (0x00007f537ce00000) libmkl_core.so.2 => /home/sahanp/.conda/envs/pytorch-3.10/lib/libmkl_core.so.2 (0x00007f5378800000) libomp.so => /home/sahanp/.conda/envs/pytorch-3.10/lib/libomp.so (0x00007f539e707000) libcupti.so.12 => /usr/local/cuda/lib64/libcupti.so.12 (0x00007f5377e00000) libcudart.so.12 => /usr/local/cuda/lib64/libcudart.so.12 (0x00007f5377a00000) libc10_cuda.so => /home/sahanp/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/libtorch/lib/libc10_cuda.so (0x00007f539ea6a000) libcusparse.so.12 => /usr/local/cuda/lib64/libcusparse.so.12 (0x00007f5368400000) libcufft.so.11 => /usr/local/cuda/lib64/libcufft.so.11 (0x00007f535ee00000) libcusolver.so.11 => /usr/local/cuda/lib64/libcusolver.so.11 (0x00007f534c800000) libcurand.so.10 => /usr/local/cuda/lib64/libcurand.so.10 (0x00007f5346200000) libcublas.so.12 => /usr/local/cuda/lib64/libcublas.so.12 (0x00007f533f800000) libcublasLt.so.12 => /usr/local/cuda/lib64/libcublasLt.so.12 (0x00007f531e800000) libutil.so.1 => /lib64/libutil.so.1 (0x00007f539ea63000) libnvJitLink.so.12 => /usr/local/cuda/lib64/libnvJitLink.so.12 (0x00007f531b800000) sahanp@devgpu086 ~/pytorch (detached HEAD\|REBASE-i 3/5)> ldd build/lib/libtorch_global_deps.so (pytorch-3.10) linux-vdso.so.1 (0x00007ffc265df000) libmkl_intel_lp64.so.2 => /home/sahanp/.conda/envs/pytorch-3.10/lib/libmkl_intel_lp64.so.2 (0x00007fa93fc00000) libmkl_gnu_thread.so.2 => /home/sahanp/.conda/envs/pytorch-3.10/lib/libmkl_gnu_thread.so.2 (0x00007fa93de00000) libmkl_core.so.2 => /home/sahanp/.conda/envs/pytorch-3.10/lib/libmkl_core.so.2 (0x00007fa939800000) libm.so.6 => /lib64/libm.so.6 (0x00007fa940f05000) libcudart.so.12 => /usr/local/cuda/lib64/libcudart.so.12 (0x00007fa939400000) libnvToolsExt.so.1 => /usr/local/cuda/lib64/libnvToolsExt.so.1 (0x00007fa939000000) libgomp.so.1 => /home/sahanp/.conda/envs/pytorch-3.10/lib/libgomp.so.1 (0x00007fa93fb07000) libc.so.6 => /lib64/libc.so.6 (0x00007fa938c00000) libdl.so.2 => /lib64/libdl.so.2 (0x00007fa940efe000) libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fa940ef9000) /lib64/ld-linux-x86-64.so.2 (0x00007fa940ff5000) librt.so.1 => /lib64/librt.so.1 (0x00007fa940ef2000) libstdc++.so.6 => /home/sahanp/.conda/envs/pytorch-3.10/lib/libstdc++.so.6 (0x00007fa93921d000) libgcc_s.so.1 => /home/sahanp/.conda/envs/pytorch-3.10/lib/libgcc_s.so.1 (0x00007fa93faec000) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126328 Approved by: https://github.com/atalman	2024-05-29 04:33:56 +00:00
Ke Wen	8090145936	[pipelining] add back support for multi-use parameters/buffers (#126653 ) ## Motivation Resolves #126626 to support TorchTitan. With this PR, we add back support for cases where a parameter or buffer is used in multiple stages. An example of such usage is in LLaMA (torchtitan), code snippet: ``` for layer in self.layers.values(): h = layer(h, self.freqs_cis) ``` ## Solution Step 1: Remove the previous guards of `if len(node.users) == 1`. Step 2: Call `move_param_to_callee` multiple times, one for each stage ("callee"). Step 3: Delay deletion of the `get_attr` node (for getting the param) from root till this param has been sunk into each stage that uses it. The PR also cleans up the old code around this (dropping the TRANSMIT mode and supporting REPLICATE mode only). ## Test Changed the `ExampleCode` model to use `mm_param1` in multiple stages. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126653 Approved by: https://github.com/pianpwk	2024-05-29 03:36:47 +00:00
Jon Janzen	781f26240a	Add script to copy distributed commits to stable branch (#126918 ) This will be used as part of a prototype of a stable pytorch with a fast-moving distributed folder Tasks: T189915739 Test plan: I ran the script in a few configurations on my local machine. It worked as expected Pull Request resolved: https://github.com/pytorch/pytorch/pull/126918 Approved by: https://github.com/seemethere, https://github.com/malfet	2024-05-29 03:33:44 +00:00
Jiashen Cao	10d2373abd	Add a registry for GraphModuleSerializer (#126550 ) This PR adds a registration function and a global registry for GraphModuleSerializer. After this PR, custom serialization methods can be done through registration instead of subclassing for ease of maintenance. ## Changes - Add a test case where it injects custom op to test serialization. - Add custom op handler - Change allowed op for verifier Co-authored-by: Zhengxu Chen <zhxchen17@outlook.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126550 Approved by: https://github.com/zhxchen17	2024-05-29 03:12:48 +00:00
PyTorch MergeBot	cdbb2c9acc	Revert "[Submodule] Remove deprecated USE_TBB option and TBB submodule (#127051 )" This reverts commit 4fdbaa794f9d5af2f171f772a51cb710c51c925f. Reverted https://github.com/pytorch/pytorch/pull/127051 on behalf of https://github.com/PaliC due to This PR needs to be synced using the import button as there is a bug in our diff train ([comment](https://github.com/pytorch/pytorch/pull/127051#issuecomment-2136428735))	2024-05-29 03:02:35 +00:00
PyTorch MergeBot	7a506dd005	Revert "[Caffe2]Remove Caffe2 proto files (#126134 )" This reverts commit a40658481ada9ecfd5716513a8537818c79cb3ef. Reverted https://github.com/pytorch/pytorch/pull/126134 on behalf of https://github.com/malfet due to Broke bazel builds, see https://github.com/pytorch/pytorch/actions/runs/9278148147/job/25528691981 ([comment](https://github.com/pytorch/pytorch/pull/126134#issuecomment-2136373096))	2024-05-29 01:53:45 +00:00
cyy	669560d51a	Use hidden visibility in OBJECTCXX files (#127265 ) Since it can eliminate some linker warnings on MacOS Pull Request resolved: https://github.com/pytorch/pytorch/pull/127265 Approved by: https://github.com/ezyang, https://github.com/malfet	2024-05-29 01:40:23 +00:00
PyTorch MergeBot	52e448a7f9	Revert "Enable Wunused-variable on tests (#127161 )" This reverts commit 6436a6407d9d65c42efb8e55beeb8b391b67fd64. Reverted https://github.com/pytorch/pytorch/pull/127161 on behalf of https://github.com/malfet due to Broke ReduceTests on Windows (by testing more), see https://github.com/pytorch/pytorch/actions/runs/9274944325/job/25519484937 ([comment](https://github.com/pytorch/pytorch/pull/127161#issuecomment-2136339435))	2024-05-29 01:09:45 +00:00
Michael Hsu	85172fbe84	Back out "Prevent partitioner from ever saving views (#126446 )" (#127316 ) Summary: Revert "Prevent partitioner from ever saving views (#126446)" due to a torchinductor failure on CU Training Framework tests. Reviewed By: Chillee Differential Revision: D57868343 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127316 Approved by: https://github.com/Chillee	2024-05-29 00:29:44 +00:00
cyy	a40658481a	[Caffe2]Remove Caffe2 proto files (#126134 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/126134 Approved by: https://github.com/r-barnes	2024-05-29 00:22:14 +00:00
Janani Sriram	f4cbcff8ef	[TorchScript] Expand TorchScript __init__ annotation warning (#127045 ) Summary: Expand TorchScript `__init__` annotation warning to `list` and `dict` with reference to GSD task T187638414 and annotation warning reproduction D56834720. Currently, the TorchScript compiler ignores and throws `UserWarning`s for the following annotation types for empty values within the `__init__` function: `List`, `Dict`, `Optional`. However, the compiler should additionally cover warnings for `list` and `dict`. This diff adds support for `list` and `dict`. Test Plan: Added 4 new unit tests: `test_annotated_empty_list_lowercase` and `test_annotated_empty_dict_lowercase` verify that TorchScript throws UserWarnings for the list and dict type annotations on empty values. ``` (base) [jananisriram@devvm2248.cco0 /data/users/jananisriram/fbsource/fbcode (e4ce427eb)]$ buck2 test @mode/{opt,inplace} //caffe2/test:jit -- --regex test_annotated_empty_list_lowercase ... Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` (base) [jananisriram@devvm2248.cco0 /data/users/jananisriram/fbsource/fbcode (e4ce427eb)]$ buck2 test @mode/{opt,inplace} //caffe2/test:jit -- --regex test_annotated_empty_dict_lowercase ... Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` `test_annotated_with_jit_empty_list_lowercase` and `test_annotated_with_jit_empty_dict_lowercase` verify that TorchScript throws UserWarnings for the list and dict type annotations on empty values with the jit annotation. ``` (base) [jananisriram@devvm2248.cco0 /data/users/jananisriram/fbsource/fbcode (e4ce427eb)]$ buck2 test @mode/{opt,inplace} //caffe2/test:jit -- --regex test_annotated_with_jit_empty_list_lowercase ... Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` ``` (base) [jananisriram@devvm2248.cco0 /data/users/jananisriram/fbsource/fbcode (e4ce427eb)]$ buck2 test @mode/{opt,inplace} //caffe2/test:jit -- --regex test_annotated_with_jit_empty_dict_lowercase ... Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` Differential Revision: D57752002 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127045 Approved by: https://github.com/davidberard98	2024-05-28 23:49:10 +00:00
Richard Barnes	1be7e4086a	Drop caffe2 nomnigraph (#127086 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/127086 Approved by: https://github.com/Skylion007	2024-05-28 23:20:46 +00:00
Peter Bell	f6ef832e87	[inductor] Use symbolic_hint when bounding fallback size hint (#127262 ) The previous fallback ignores any known hint values in the expression and only looks at the value ranges. By using the `symbolic_hint` we will use both hints and value ranges. Also removed the recursive use of `size_hint` on the bounds, since these should always be constants. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127262 Approved by: https://github.com/lezcano ghstack dependencies: #127251	2024-05-28 22:51:45 +00:00
Peter Bell	26a8fa3a06	[inductor] Restore ExpandView sanity checks (#127251 ) This restores the assertion removed in #124864 The handling of unbacked symints is incidental, the main purpose of this assert was to catch bugs in lowerings. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127251 Approved by: https://github.com/lezcano	2024-05-28 22:51:45 +00:00
Andrew Gu	db0a0ecb60	[FSDP2] Added test for N-way TP and 1-way FSDP with CPU offloading (#127024 ) This PR shows that we can use FSDP solely for CPU offloading when composing with N-way TP. Each FSDP mesh is just 1 rank. This was motivated from an ask on Slack :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127024 Approved by: https://github.com/weifengpy, https://github.com/wanchaol	2024-05-28 22:51:36 +00:00
Anshul Sinha	6b24155827	[dtensor][debug] added c10d gather, reduce, scatter tracing to CommDebugMode (#127134 ) Summary Added c10d gather, reduce, and scatter tracing to CommDebugMode and edited test case in test_comm_mode to include added features. Test Plan pytest test/distributed/_tensor/debug/test_comm_mode.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/127134 Approved by: https://github.com/XilunWu ghstack dependencies: #127025, #127029, #127040	2024-05-28 22:48:07 +00:00
eqy	a76faff71c	[NCCL][CUDA] Optionally avoid rethrowing CUDA Errors in NCCL Watchdog (#126587 ) Doesn't affect current behavior by default, for #126544 I'm not sure what the exact mechanism is here but CUDA errors appear to already be thrown in the main process, meaning that the watchdog is separately throwing CUDA errors again. However this rethrown error causes the process to be terminated as it cannot be handled from user code (which doesn't have visibility of the watchdog thread). Pull Request resolved: https://github.com/pytorch/pytorch/pull/126587 Approved by: https://github.com/kwen2501	2024-05-28 22:17:15 +00:00
eellison	93bfe57144	cudagraphs: fix backward hooks & fsdp interaction (#126914 ) Fixes > ERROR: expected to be in states [<TrainingState.FORWARD_BACKWARD: 2>] but current state is TrainingState.IDLE Error that would occur when composing pt2 fsdp and cudagraphs. Cudagraphs caches output tensor impls in the fast path, so we were inadvertently accumulating multiple hooks on what should have been fresh allocations. from code comment: ``` # this output represents a fresh allocated tensor. # We return the same TensorImpl from run to run to avoid overhead. # autograd.Function will reset the Autograd meta of output tensors # as part of aot_autograd, but _backward_hooks are stored on tensors separately, # so we need to manually reset hooks. `` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126914 Approved by: https://github.com/awgu, https://github.com/xmfan	2024-05-28 22:07:41 +00:00
Chirag Pandya	4154c8358a	[BE] Wrap store check in a try/catch (#127030 ) Summary: Global store may already have been destroyed when we do the check. This leads to a Null Pointer Exception. This caused a SEV in Production. Stack trace from crash: ``` [trainer2]:# 5 c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) [trainer2]:# 6 c10d::ProcessGroupNCCL::heartbeatMonitor() ``` Test Plan: Will deploy in small training job and with `NCCL_DUMP_ON_TIMEOUT` set. Job should complete with no exceptions. Reviewers: Subscribers: Tasks: T190163458 Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/127030 Approved by: https://github.com/Skylion007, https://github.com/shuqiangzhang	2024-05-28 20:57:36 +00:00
Pian Pawakapan	f206c5c628	[export] handle new roots & root swapping in derived dims suggested fixes (#125543 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125543 This PR address 2 issues with derived dim suggested fixes, 1) newly introduced roots, and 2) root swapping. 1 \| Newly introduced roots appear with modulo guards, e.g. Mod(dx, 2) = 0 suggests dx is a derived dim equal to 2 * _dx, introducing a new root _dx. Currently the final suggested fixes handle this correctly, but we can get intermediate results where related derived dims don't rely on a unified root, and are a mixture of min/max range and derived suggestions. For example: ``` "dx": {"eq": 3_dx-1, "max": 36} "dy": {"eq": dx+1} This should lead to suggested fixes _dx = Dim('_dx', max=12) dx = 3 _dx - 1 dy = 3 * _dx ``` This PR prettifies the suggested fixes routine by unifying to a single root, and making each intermediate suggestion either a derived dim or min/max range, not both. 2 \| The current suggested fixes for derived dims can lead to root dims/derived dims being swapped, e.g. `dy - 1, dy` -> `dx, dx + 1`. This leads to problematic suggested fixes that look like `dy - 1 = Dim("dy - 1")` since we don't have access to the original variable name. This PR only adds a suggested fix for the root dim, and removes all other derived suggestions. For example, with the export test case test_derived_dim_out_of_order_simplified: ``` _dimz = torch.export.Dim("_dimz", min=6, max=8) dimy = _dimz - 1 dimx = dimy - 1 dimz = torch.export.Dim("dimz", min=6, max=8) # doesn't work, should be = _dimz class Foo(torch.nn.Module): def forward(self, x, y, z): return x + y[1:] + z[2:] foo = Foo() u, v, w = torch.randn(5), torch.randn(6), torch.randn(7) export( foo, (u, v, w), dynamic_shapes=({0: dimx}, {0: dimy}, {0: dimz}), ) ``` Before: ``` Suggested fixes: _dimz = Dim('_dimz', min=3, max=9223372036854775807) # 2 <= _dimz - 1 <= 9223372036854775806 _dimz - 2 = Dim('_dimz - 2', min=4, max=6) _dimz = Dim('_dimz', min=2, max=9223372036854775806) # 2 <= _dimz <= 9223372036854775806 _dimz - 1 = _dimz - 1 dimz = _dimz ``` New suggested fixes: ``` Suggested fixes: dimz = _dimz ``` Note: This assumes the specified derived relations between dims are correct. This should be valid because: 1) if the relation is plain wrong (e.g. (dx, dx - 1) provided with inputs (6, 4)), this gets caught in beforehand in produce_guards. 2) if the relation is correct but does not match the emitted guard, for example: ``` def forward(self, x, y): return x.reshape([-1]) + y # guard: s0 * 2 = s1 dx = Dim("dx") export( model, (torch.randn(6, 2), torch.randn(12)), dynamic_shapes={"x": (dx, 2), "y": (dx + 6, )} ) ``` This produces two linear equations, leading to specialization since a) produce_guards is able to solve for a concrete value, and b) the export constraint solver will anyways force specializations due to range constraints. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125543 Approved by: https://github.com/avikchaudhuri	2024-05-28 20:41:43 +00:00
cyy	0a9d73a814	Remove c10::guts::bool_constant and c10::guts::negation (#127300 ) They are not used. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127300 Approved by: https://github.com/r-barnes	2024-05-28 19:55:20 +00:00
lancerts	03005bb655	Improve the clarity of the torch.Tensor.backward doc (#127201 ) Improve the clarity of the torch.Tensor.backward doc, particularly wrt the arg `gradient`. Reference https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html, ``` We need to explicitly pass a gradient argument in Q.backward() because it is a vector. gradient is a tensor of the same shape as Q, and it represents the gradient of Q w.r.t. itself ``` @janeyx99 feel free to assign to the corresponding reviewers, thanks Co-authored-by: Jeffrey Wan <soulitzer@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127201 Approved by: https://github.com/soulitzer	2024-05-28 19:25:51 +00:00
Manuel Candales	f600faf248	[metal] Improve perf of int4pack_mm shader (#127135 ) Using vectorized data types and using SIMD groups to optimize memory access pattern Pull Request resolved: https://github.com/pytorch/pytorch/pull/127135 Approved by: https://github.com/malfet	2024-05-28 18:22:58 +00:00
feifan	c9172d4471	print default value in FunctionSignature (#127059 ) Fixes #[126758](https://github.com/pytorch/pytorch/issues/126758) and #[126759](https://github.com/pytorch/pytorch/issues/126759) The output information in the issue is not accurate because `FunctionSignature::toString()` print the schema strings without default. `cb6ef68caa/torch/csrc/utils/python_arg_parser.cpp (L1282-L1283)` This pr, by adding a `default_value` to save the default str ,which shoule be priented. Of course, can also add an new api to reverse `default_bool/default_int` to string, which is slightly more complicated. result: ![image](https://github.com/pytorch/pytorch/assets/37650440/f58a4cbf-b0f4-4c81-9106-59f0d35c54ea) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127059 Approved by: https://github.com/janeyx99	2024-05-28 18:04:31 +00:00
Nikita Shulga	045309aa35	[MPS] Enable toch.mm and friends for complex dtypes (#127241 ) - Add `supportedFloatingOrComplexType` - Change dtype check to those - Extend low-precision fp32 list to complex types - Mark conv2d as supported now, as it was failing due to the tighter accuracy constrains than the same op for float32 dtype Fixes https://github.com/pytorch/pytorch/issues/127178 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127241 Approved by: https://github.com/janeyx99	2024-05-28 17:56:13 +00:00
IvanKobzarev	829f594d7d	[small] guard_size_oblivious, skip check for meta (#127298 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127298 Approved by: https://github.com/ezyang	2024-05-28 17:53:08 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	9521528f71	Log export result of torch.jit.trace to scuba (#126900 ) Summary: We want to track how well torch.jit.trace can be converted to export in large scale. As a first step, we log all of torch.jit.trace unittests whether we can convert the traced module to export module OR we can export the model directly Test Plan: CI Differential Revision: D57629682 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126900 Approved by: https://github.com/SherlockNoMad	2024-05-28 17:49:34 +00:00
PyTorch MergeBot	3f79e09515	Revert "Made some minor improvements to flexattention perf + added more autotune configs (#126811 )" This reverts commit 84e59f052d4342ac9453703be55758de102e20d3. Reverted https://github.com/pytorch/pytorch/pull/126811 on behalf of https://github.com/PaliC due to breaking on V100s / internal tests ([comment](https://github.com/pytorch/pytorch/pull/126811#issuecomment-2135798983))	2024-05-28 17:48:26 +00:00
Jiashen Cao	254783ce80	[Fix]: populate input parameter name when convert TorchScript to ExportedProgram (#126787 ) ## Goal As title ## Design Based on the fact that each TorchScript module has a `code` property which provides the original source code for the `forward` function, I implemented a function to extrapolate `forward` function signature by using the AST parser. Some other tradeoff * Directly parsing src code as string --> will be very buggy * Directly using `compile` function in Python to get the function object --> raises a lot of exceptions because of missing packages or undefined variable names Pull Request resolved: https://github.com/pytorch/pytorch/pull/126787 Approved by: https://github.com/angelayi, https://github.com/tugsbayasgalan	2024-05-28 17:33:44 +00:00
Jez Ng	122282111d	[inductor][reland] Various improvements to error handling during autotuning (#126847 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126847 This is a reland of [D56764094](https://www.internalfb.com/diff/D56764094) / https://github.com/pytorch/pytorch/pull/125762. It was originally reverted due to rebase conflicts. Original commit changeset: 45875a1e5de2 Original Phabricator Diff: [D56764094](https://www.internalfb.com/diff/D56764094) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126847 Approved by: https://github.com/chenyang78	2024-05-28 17:22:26 +00:00
Swayam	df360e2add	Update derivatives.yaml (#127193 ) Fixed a typo in docs Pull Request resolved: https://github.com/pytorch/pytorch/pull/127193 Approved by: https://github.com/soulitzer	2024-05-28 16:56:03 +00:00
Angela Yi	cbb79a2baf	[export] Disable backend decomps for capture_pre_autograd (#127120 ) Differential Revision: D57785713 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127120 Approved by: https://github.com/ydwu4	2024-05-28 16:37:13 +00:00
cyy	c40408850a	[1/N] Fix clang-tidy warnings in aten/src/ATen/cuda/ (#127183 ) Fixes clang-tidy warnings Pull Request resolved: https://github.com/pytorch/pytorch/pull/127183 Approved by: https://github.com/soulitzer, https://github.com/Skylion007	2024-05-28 15:35:29 +00:00
cyy	3d88c618d5	Concat namespaces in torch/csrc/profiler and other fixes (#127266 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127266 Approved by: https://github.com/soulitzer	2024-05-28 15:21:32 +00:00
rzou	4d4d2a96f2	Add space in MetaFallbackKernel.cpp error message (#127291 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127291 Approved by: https://github.com/Skylion007	2024-05-28 13:54:38 +00:00
atalman	a6b994ed54	Fix lint after #126845 (#127286 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/127286 Approved by: https://github.com/NicolasHug, https://github.com/DanilBaibak	2024-05-28 12:38:27 +00:00
chilli	ec8b254ef4	Refactored template codegen to explicitly set current body when generating code (#127144 ) The main motivation for this refactor is that today, when generating templates, this is what happens. ``` def_kernel() # registers hook for fully generating function definition store_output() # registers hook for generating the output store. also keeps a number of things generated on `self.body`. ``` Later on, when we codegen the template: `f8c4c268da/torch/_inductor/codegen/simd.py (L1402)` ``` epilogue_node.codegen() # Also writes to body! template.finalize() # Calls the above two hooks for def_kernel and store_output, which then reads from the accumulated `self.body` ``` Today, this is fine, as long as `store_output` is the last function called in the template. However, there's a couple things we probably want to do with kernels that makes this annoying. 1. In FlexAttention backwards, we might want a `modification` to be positioned after the `store_output` (just logically from a code organization POV). This doesn't work today because `modification` also needs to codegen a subgraph, but writing to `body` here conflicts with `store_output`'s implicit saved state on `self.body`. 2. If we want to support prologue fusion, we need to go through a bunch of contortions today to call the template hook finalization a couple times (https://github.com/pytorch/pytorch/pull/121211/files#diff-73b89475038a5b4705da805f1217783883fb90398ee1164995db392fc4a342c1R322) 3. The current code also makes it quite difficult to support fusion into multiple output nodes. To resolve this, I do two things: 1. I remove the default `self.body` on `TritonTemplateKernel`. Instead, I have a dict of `self.subgraph_bodies`, which can be enabled in a context with `TritonTemplateKernel.set_subgraph_body`. This allows multiple different template functions to write to their own isolated bodies. 2. I add functions that allow you to finalize specific hooks on `PartialRender`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127144 Approved by: https://github.com/jansel	2024-05-28 09:49:13 +00:00
Jiang, Yanbing	457b9f7397	Optimize mask memory for flash attention (#126961 ) The PR optimizes the mask memory for flash attention. Instead of directly converting the whole mask to fp32, we do the conversion block-wisely. This can decrease the peak memory usage (we test in https://huggingface.co/microsoft/Phi-3-mini-128k-instruct, peak memory usage reduces ~50%) and have some performance improvements as well. ### Performance result single socket in Intel (R) Xeon (R) CPU Max 9480 batch_size = 12, q_seq_len = 1030, kv_seq_len = 1179, n_head = 3, head_dim = 33, mask_dim = 4, bool_mask = 0 \| Forward speedup \| Backward speedup -- \| -- \| -- float64 \| 0.82% \| 3.76% float32 \| 2.2% \| 3.9% bfloat16 \| 16.15% \| 7.56% segment-anything-fast Follow https://github.com/pytorch-labs/segment-anything-fast/tree/main/experiments Single socket in Intel (R) Xeon (R) CPU Max 9480 Dtype: bfloat16, models: vit_b and vit_h, test in `SDPA` and `Triton` commit https://github.com/pytorch-labs/segment-anything-fast/blob/main/experiments/run_experiments.py#L199-L200, select the time of 20th iteration. \| vit_b \| \| vit_h \| -- \| -- \| -- \| -- \| -- \| attn_mask w/o block-wise \| attn_mask w/ block-wise \| attn_mask w/o block-wise \| attn_mask w/ block-wise SDPA\| 10.95s/it \| 6.59s/it \| 19.93s/it \| 12.33s/it Triton \| 10.66s/it \| 7.12s/it \| 19.87s/it \| 12.26s/it Pull Request resolved: https://github.com/pytorch/pytorch/pull/126961 Approved by: https://github.com/Valentine233, https://github.com/jgong5	2024-05-28 09:12:18 +00:00
Animesh Jain	1507d5205a	[dynamo][fsdp] Skip Dynamo tracing of __getattr__ if its top-level frame (#127263 ) The generated bytecode for the first frame is below. Inlined comments about the LOAD_ATTR which causes Dynamo to trigger again on `__getattr__`. ~~~ [__bytecode] MODIFIED BYTECODE fn /data/users/anijain/pytorch2/test/dynamo/test_activation_checkpointing.py line 1129 [__bytecode] 1129 0 COPY_FREE_VARS 1 [__bytecode] 2 RESUME 0 [__bytecode] 4 PUSH_NULL [__bytecode] 6 LOAD_GLOBAL 10 (__compiled_fn_1) [__bytecode] 18 LOAD_FAST 0 (x) [__bytecode] 20 LOAD_DEREF 1 (mod) [__bytecode] 22 LOAD_ATTR 6 (_checkpoint_wrapped_module) [__bytecode] 32 LOAD_CONST 1 (0) [__bytecode] 34 BINARY_SUBSCR [__bytecode] 44 LOAD_ATTR 7 (weight) [__bytecode] 54 LOAD_DEREF 1 (mod) [__bytecode] 56 LOAD_ATTR 6 (_checkpoint_wrapped_module) [__bytecode] 66 LOAD_CONST 1 (0) [__bytecode] 68 BINARY_SUBSCR [__bytecode] 78 LOAD_ATTR 8 (bias) # When this optimized bytecode is executed, these two lines call the __getattr__ of ActivationWrapper module. # Dynamo gets invoked on __getattr__. # If we had inlined __getattr__ during the tracing, we would have seen the LOAD_ATTR # on more low level data structures like _modules, obviating the need for CPython # to call python overriden __getattr__. But today, UnspecializedNNModuleVariable # calls python getattr at tracing time (instead of inlining it), resulting in LOAD_ATTR # on the module itself. # To prevent Dynamo to skip tracing of __Getattr__ on the optimized bytecode, # we can check if its top level frame and just skip it. [__bytecode] 88 LOAD_DEREF 1 (mod) [__bytecode] 90 LOAD_ATTR 0 (a) [__bytecode] 100 PRECALL 4 [__bytecode] 104 CALL 4 [__bytecode] 114 UNPACK_SEQUENCE 1 [__bytecode] 118 RETURN_VALUE ~~~~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/127263 Approved by: https://github.com/yf225	2024-05-28 08:16:53 +00:00
cyy	d6e3e89804	Remove c10::void_t (#127248 ) OSS version doesn't use it anymore. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127248 Approved by: https://github.com/ezyang	2024-05-28 06:59:20 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	246311c944	Unconditionally add asserts after export (#127132 ) Summary: Today AOTAutograd drops some of assert nodes so we reapply it after strict export. Test Plan: CI Reviewed By: angelayi Differential Revision: D57786907 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127132 Approved by: https://github.com/zhxchen17	2024-05-28 06:31:39 +00:00
cyy	e4b245292f	Remove caffe2::tensorrt target code from cuda.cmake (#127204 ) Following #126542. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127204 Approved by: https://github.com/ezyang	2024-05-28 04:42:14 +00:00
cyy	c6b36ec2f9	Remove calls of deprecated _aminmax (#127182 ) While #125995 is pending, the calls should be removed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127182 Approved by: https://github.com/ezyang	2024-05-28 03:51:45 +00:00
youkaichao	d957c2d5de	[Doc] update default magma cuda version in readme (#122125 ) Since we use cuda 12.1 by default now, it would be better to update the doc. Many people (including me), want to directly copy-paste commands in readme 😉 Let's make our life easier. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122125 Approved by: https://github.com/malfet	2024-05-28 03:37:23 +00:00
Shaz Qadeer	7c61e7be5c	Address issue #125307 (#126351 ) PyTorch overrides SymPy's Mod and does its own symbolic simplification. Inspired by issue #125307, this PR adds one more simplification tactic. Fixes #125307 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126351 Approved by: https://github.com/ezyang	2024-05-28 02:03:24 +00:00
hippocookie	8979412442	Enable ufmt format on test files (#126845 ) Fixes some files in #123062 Run lintrunner on files: test/test_nnapi.py, test/test_numba_integration.py, test/test_numpy_interop.py, test/test_openmp.py, test/test_optim.py ```bash $ lintrunner -a --take UFMT --all-files ok No lint issues. Successfully applied all patches. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126845 Approved by: https://github.com/ezyang	2024-05-28 01:42:07 +00:00
cyy	57000708fc	Remove c10::invoke_result (#127160 ) Following #124169 , it can be safely remove from OSS version. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127160 Approved by: https://github.com/ezyang	2024-05-28 01:39:28 +00:00
cyy	6436a6407d	Enable Wunused-variable on tests (#127161 ) This PR enables unused-variable warnings in tests and fixes some test code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127161 Approved by: https://github.com/ezyang	2024-05-28 01:37:46 +00:00
cyy	70d8bc2da1	Fix various errors in TCPStoreLibUvBackend.cpp (#127230 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127230 Approved by: https://github.com/Skylion007	2024-05-27 19:14:01 +00:00
Feny Patel	0ff2f8b522	update kineto submodule hash (#126780 ) Summary: update kineto submodule hash Test Plan: CIs Differential Revision: D57620964 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126780 Approved by: https://github.com/Skylion007	2024-05-27 18:11:48 +00:00
Oguz Ulgen	25a9262ba4	Add structured logging for fx graph cache hash (#127156 ) Summary: Add structured logging for fx graph cache hash so that we can debug MAST jobs easily. Test Plan: ad hoc testing Differential Revision: D57791537 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127156 Approved by: https://github.com/jamesjwu	2024-05-27 17:18:41 +00:00
Xuehai Pan	26f4f10ac8	[5/N][Easy] fix typo for `usort` config in `pyproject.toml` (`kown` -> `known`): sort torch (#127126 ) The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127126 Approved by: https://github.com/kit1980	2024-05-27 14:49:57 +00:00
PyTorch MergeBot	c7f6fbfa9d	Revert "[FSDP2] Added test for N-way TP and 1-way FSDP with CPU offloading (#127024 )" This reverts commit 9117779b0a178ec5ca548585a97bcb44be631644. Reverted https://github.com/pytorch/pytorch/pull/127024 on behalf of https://github.com/atalman due to failing in CI ([comment](https://github.com/pytorch/pytorch/pull/127024#issuecomment-2133566325))	2024-05-27 14:12:09 +00:00
PyTorch MergeBot	7121ea6f70	Revert "Add compile time profiler for non fbcode targets (#126904 )" This reverts commit 575cb617db4043dd7a76aaf523dc3ab7ee07e7a5. Reverted https://github.com/pytorch/pytorch/pull/126904 on behalf of https://github.com/atalman due to Broke nightly smoke test ([comment](https://github.com/pytorch/pytorch/pull/126904#issuecomment-2133418687))	2024-05-27 12:52:09 +00:00
PyTorch MergeBot	00fe0a0d79	Revert "Remove more of caffe2 (#126705 )" This reverts commit f95dbc12761cb4466099b0e9a3667057ca39272b. Reverted https://github.com/pytorch/pytorch/pull/126705 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/126705#issuecomment-2133325449))	2024-05-27 11:59:14 +00:00
Jeeja	1110edb94b	Fix stream type to generic in comms default hooks (#120069 ) In comms default_hooks - decompress stream is hardcoded to cuda type. fix this to use generic type based on the grad tensor device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120069 Approved by: https://github.com/jgong5, https://github.com/fegin	2024-05-27 10:27:30 +00:00
PyTorch MergeBot	55c0ab2887	Revert "[5/N][Easy] fix typo for `usort` config in `pyproject.toml` (`kown` -> `known`): sort torch (#127126 )" This reverts commit 7763c83af67eebfdd5185dbe6ce15ece2b992a0f. Reverted https://github.com/pytorch/pytorch/pull/127126 on behalf of https://github.com/XuehaiPan due to Broken CI ([comment](https://github.com/pytorch/pytorch/pull/127126#issuecomment-2133044286))	2024-05-27 09:22:08 +00:00
PyTorch MergeBot	4608971f7a	Revert "[inductor][cpp] GEMM template (infra and fp32) (#124021 )" This reverts commit 0d1e22855022a04a8601a2d94f3079950283ba5d. Reverted https://github.com/pytorch/pytorch/pull/124021 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/124021#issuecomment-2133002071))	2024-05-27 09:01:45 +00:00
PyTorch MergeBot	343a41fba8	Revert "[inductor][cpp] epilogue support for gemm template (#126019 )" This reverts commit 56c412d9063de3dc8163b8e1b0b9b5bf9581ad05. Reverted https://github.com/pytorch/pytorch/pull/126019 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/124021#issuecomment-2133002071))	2024-05-27 09:01:45 +00:00
PyTorch MergeBot	68fddebf84	Revert "[inductor][cpp] bf16/fp16 gemm template computed with fp32 w/o epilogue fusion (#126068 )" This reverts commit 4aa43d11f332b2d7b8f19b4da5ceba612133889d. Reverted https://github.com/pytorch/pytorch/pull/126068 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/124021#issuecomment-2133002071))	2024-05-27 09:01:45 +00:00
PyTorch MergeBot	ed9951ace7	Revert "[inductor][cpp] support bf16/fp16 gemm template epilogue fusion (#126545 )" This reverts commit 43baabe9b94c86bd36ba4a00f501e52d833d7ec8. Reverted https://github.com/pytorch/pytorch/pull/126545 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/124021#issuecomment-2133002071))	2024-05-27 09:01:45 +00:00
PyTorch MergeBot	4c2e671a3b	Revert "[Inductor][CPP] Add Min/Max with VecMask (#126841 )" This reverts commit 1ef4306ab11410a506e0868543a466e87ea879b5. Reverted https://github.com/pytorch/pytorch/pull/126841 on behalf of https://github.com/DanilBaibak due to Blocks reverting of the broken PR ([comment](https://github.com/pytorch/pytorch/pull/126841#issuecomment-2132995404))	2024-05-27 08:58:01 +00:00
PyTorch MergeBot	5247446396	Revert "[Inductor][CPP] Add ne with VecMask (#126940 )" This reverts commit f8c4c268da67e9684f3287b7468f36a5a27c6a0b. Reverted https://github.com/pytorch/pytorch/pull/126940 on behalf of https://github.com/DanilBaibak due to Blocks reverting of the broken PR ([comment](https://github.com/pytorch/pytorch/pull/126841#issuecomment-2132995404))	2024-05-27 08:58:01 +00:00
PyTorch MergeBot	60523fa674	Revert "Move MKLDNN Specific IR to Separate File (#126504 )" This reverts commit bf2909b871579a78e841b661b9b0c302f311d010. Reverted https://github.com/pytorch/pytorch/pull/126504 on behalf of https://github.com/DanilBaibak due to Blocks reverting of the broken PR ([comment](https://github.com/pytorch/pytorch/pull/126841#issuecomment-2132995404))	2024-05-27 08:58:01 +00:00
chuanqiw	ff63e8bac8	[CI] fix doctest case by adding requires (#126855 ) With the triton update, the new dependency `llnl-hatchet` will be introduced. And `pydot` is a dependency of `llnl-hatchet`. So the doctest case `torch/fx/passes/graph_drawer.py::FxGraphDrawer.get_dot_graph:0` won't be skipped. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126855 Approved by: https://github.com/etaf, https://github.com/EikanWang, https://github.com/peterbell10	2024-05-27 07:40:27 +00:00
feifan	22712ba5c5	Radam support the flag for "maximize" (#126765 ) Fixes #[126642](https://github.com/pytorch/pytorch/issues/126642) I reference the maximize in `Adam` and add `Radam's` maximize flag. If this pr is OK, I will add another pr for `Nadam`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126765 Approved by: https://github.com/janeyx99	2024-05-27 06:34:50 +00:00
cyy	5cca904c51	[3/N] Enable clang-tidy in aten/src/ATen/detail/ (#127184 ) Following #127168 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127184 Approved by: https://github.com/jansel	2024-05-27 06:28:07 +00:00
Ting Lu	1c2e221e25	CUDA 12.4 ARM wheel integration to CD - nightly build (#126174 ) rebasing https://github.com/pytorch/pytorch/pull/124112. too many conflict files, so starting a new PR. Test https://github.com/pytorch/builder/pull/1775 (merged) for ARM wheel addition Test https://github.com/pytorch/builder/pull/1828 (merged) for setting MAX_JOBS Current issue to follow up: https://github.com/pytorch/pytorch/issues/126980 Co-authored-by: Aidyn-A <aidyn.b.aitzhan@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126174 Approved by: https://github.com/nWEIdia, https://github.com/atalman	2024-05-27 05:50:36 +00:00
Xuehai Pan	7763c83af6	[5/N][Easy] fix typo for `usort` config in `pyproject.toml` (`kown` -> `known`): sort torch (#127126 ) The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127126 Approved by: https://github.com/kit1980 ghstack dependencies: #127122, #127123, #127124, #127125	2024-05-27 04:22:18 +00:00
cyy	4fdbaa794f	[Submodule] Remove deprecated USE_TBB option and TBB submodule (#127051 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127051 Approved by: https://github.com/cpuhrsch, https://github.com/malfet	2024-05-27 03:54:03 +00:00
Peter Bell	6aa5bb1a76	[inductor] Support persistent reductions for dynamic shapes (#126684 ) Currently persistent reductions are only supported when the reduction dimension is static, however we only really need to know that the rnumel is bounded. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126684 Approved by: https://github.com/lezcano	2024-05-27 02:30:20 +00:00
leslie-fang-intel	bf2909b871	Move MKLDNN Specific IR to Separate File (#126504 ) Summary Following the discussion in https://github.com/pytorch/pytorch/pull/122593#discussion_r1604144782, Move Inductor MKLDNN specific IRs to a separate file. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126504 Approved by: https://github.com/desertfire, https://github.com/jgong5 ghstack dependencies: #126841, #126940	2024-05-27 00:48:09 +00:00
Peter Bell	39de62845a	[decomp] Fix default values missing from inplace `rrelu` decomposition (#126978 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126978 Approved by: https://github.com/lezcano	2024-05-26 23:49:40 +00:00
Xiaodong Wang	06934518a2	[AMD] Fix deprecated amdsmi api (#126962 ) Summary: https://github.com/pytorch/pytorch/pull/119182 uses an API that has already been deprecated by `c551c3caed`. So fixing this in a backward compatible way Differential Revision: D57711088 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126962 Approved by: https://github.com/eqy, https://github.com/izaitsevfb	2024-05-26 20:11:23 +00:00
chilli	ee6cb6daa1	Turn the mutation dependency of MutationOutput to weak deps (#127151 ) A writeup of how mutation works in Inductor: https://docs.google.com/document/d/1P0fSq4Nm-3CvdUe9v-mLdEWD3dgIHUf1czQXMmQsuxc/edit Pull Request resolved: https://github.com/pytorch/pytorch/pull/127151 Approved by: https://github.com/oulgen ghstack dependencies: #127148, #127149	2024-05-26 01:21:03 +00:00
leslie-fang-intel	f8c4c268da	[Inductor][CPP] Add ne with VecMask (#126940 ) Summary Fix https://github.com/pytorch/pytorch/issues/126824#issuecomment-2125039161 which is missing the support of `ne` with `VecMask`. Test Plan ``` python test/inductor/test_torchinductor_opinfo.py -k test_comprehensive_ne_cpu_bool ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126940 Approved by: https://github.com/isuruf, https://github.com/jgong5, https://github.com/peterbell10 ghstack dependencies: #126841	2024-05-25 23:54:48 +00:00
leslie-fang-intel	1ef4306ab1	[Inductor][CPP] Add Min/Max with VecMask (#126841 ) Summary Fix issue: https://github.com/pytorch/pytorch/issues/126824 which is missing the support of `min/max` with `VecMask`. TestPlan ``` python test/inductor/test_torchinductor_opinfo.py -k test_comprehensive_clamp_max_cpu_bool python test/inductor/test_torchinductor_opinfo.py -k test_comprehensive_clamp_min_cpu_bool ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126841 Approved by: https://github.com/isuruf, https://github.com/jgong5, https://github.com/peterbell10	2024-05-25 23:52:21 +00:00
chilli	b8ee7d0cc1	Change direct uses of MutationOutput to `mark_node_as_mutating` (#127149 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127149 Approved by: https://github.com/oulgen ghstack dependencies: #127148	2024-05-25 23:47:39 +00:00
chilli	3817c4f9fa	Unify add_fake_dep and add_mutation_dep, as they're literally the same thing (#127148 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127148 Approved by: https://github.com/oulgen	2024-05-25 23:47:39 +00:00
cyy	9bead53519	[2/N] Fix clang-tidy warnings in aten/src/ATen/detail/ (#127168 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/127168 Approved by: https://github.com/Skylion007	2024-05-25 22:50:02 +00:00
Xuehai Pan	a28bfb5ed5	[4/N][Easy] fix typo for `usort` config in `pyproject.toml` (`kown` -> `known`): sort functorch (#127125 ) The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127125 Approved by: https://github.com/Skylion007 ghstack dependencies: #127122, #127123, #127124	2024-05-25 22:45:38 +00:00
Xuehai Pan	35ea5c6b22	[3/N][Easy] fix typo for `usort` config in `pyproject.toml` (`kown` -> `known`): sort torchgen (#127124 ) The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127124 Approved by: https://github.com/Skylion007 ghstack dependencies: #127122, #127123	2024-05-25 19:20:03 +00:00
Xuehai Pan	0dae2ba5bd	[2/N][Easy] fix typo for `usort` config in `pyproject.toml` (`kown` -> `known`): sort caffe2 (#127123 ) The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127123 Approved by: https://github.com/Skylion007 ghstack dependencies: #127122	2024-05-25 18:26:34 +00:00
Zhenbin Lin	da141b096b	Enable UFMT on test/test_hub.py (#127155 ) Partially addresses #123062 Ran lintrunner on: test/test_hub.py Detail: ``` $ lintrunner -a --take UFMT test/test_hub.py ok No lint issues. Successfully applied all patches. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127155 Approved by: https://github.com/Skylion007	2024-05-25 18:23:24 +00:00
PyTorch MergeBot	12d11fe4e5	Revert "reset dynamo cache before each test (#126586 )" This reverts commit bd24991f461476036d6ba20fed92651c7e46ef7c. Reverted https://github.com/pytorch/pytorch/pull/126586 on behalf of https://github.com/malfet due to Broke tons of tests, see `bd24991f46` ([comment](https://github.com/pytorch/pytorch/pull/126586#issuecomment-2131365576))	2024-05-25 17:17:19 +00:00
James Wu	71eafe9e97	Refactor dispatch logic to clarify control flow (#126402 ) As discussed, this cleans up the code so that create_aot_dispatcher literally chooses an aot_dispatch function and runs it. Moves wrapper logic to jit_compile_runtime_wrappers, and adds aot_dispatch_export to handle export cases in one place. This also makes aot_dispatch_* return the same type always: a Callable and the forward metadata, instead of returning different number of arguments in export cases. Callers that don't care about fw_metadata can just ignore it. Added return type hints to enforce the same exact interface among all the aot_dispatch_* functions. It'd be nice to move the checks from the synthetic base and dedup wrappers that have to do with export outside of those wrappers, but it's probably fine for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126402 Approved by: https://github.com/oulgen, https://github.com/bdhirsh ghstack dependencies: #126193	2024-05-25 16:06:34 +00:00
Aaron Orenstein	7642cdef25	Improve fusable_read_and_write() (#127061 ) Related to https://github.com/pytorch/pytorch/issues/98467 The tacotron2 benchmark creates a lot of nodes which fusion then checks. This improves some of the perf of that checking. `can_fuse_vertical` calls `fusable_read_and_write` on O(read deps * write deps) combinations - but only cares about write deps that are MemoryDeps - so do the isinstance check outside the inner loop to save O(read deps) when it won't matter anyway. Also moves `fusable_read_and_write` to a instance method (instead of a closure) since it doesn't actually capture any variables. I also tried pre-splitting the read deps into `StarDep` vs `MemoryDep` but that didn't actually make any perf difference. Testing: ``` time python benchmarks/dynamo/torchbench.py --accuracy --inference --amp --backend inductor --disable-cudagraphs --device cuda --only tacotron2 ``` Before this change: 10m15s After this change: 9m31s Related to #98467 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127061 Approved by: https://github.com/peterbell10, https://github.com/jansel ghstack dependencies: #127060	2024-05-25 15:17:25 +00:00
Aaron Orenstein	6c79299a35	Improve score_fusion_memory() (#127060 ) Related to #98467 The tacotron2 benchmark creates a lot of nodes which fusion then checks. This improves some of the perf of that checking. `score_fusion_memory` is called O(n^2) times - so by moving the set union, `has_unbacked_symbols` check, and `numbytes_hint` out of the loop we call them O(n) times and the O(n^2) call gets cheaper. Testing: ``` time python benchmarks/dynamo/torchbench.py --accuracy --inference --amp --backend inductor --disable-cudagraphs --device cuda --only tacotron2 ``` Before this change: 12m33s After this change: 10m15s Pull Request resolved: https://github.com/pytorch/pytorch/pull/127060 Approved by: https://github.com/peterbell10, https://github.com/jansel	2024-05-25 15:17:25 +00:00
Xuehai Pan	ba3b05fdf3	[1/N][Easy] fix typo for `usort` config in `pyproject.toml` (`kown` -> `known`): sort stdlib (#127122 ) The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127122 Approved by: https://github.com/kit1980	2024-05-25 08:25:50 +00:00
Wu, Chunyuan	4a997de8b9	[AOTI] support freezing for MKLDNN (#124350 ) ## Description Fixes https://github.com/pytorch/pytorch/issues/114450. This PR builds upon the work from @imzhuhl done in https://github.com/pytorch/pytorch/pull/114451. This PR requires https://github.com/pytorch/pytorch/pull/122472 to land firstly. We leverage the serialization and deserialization API from oneDNN v3.4.1 to save the opaque MKLDNN tensor during the compilation and restore the opaque tensor when loading the compiled .so. ideep version is updated so that we won't break any pipeline even if third_party/ideep is not updated at the same time. ### Test plan: ```sh python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_freezing_non_abi_compatible_cpu python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_conv_freezing_non_abi_compatible_cpu python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_deconv_freezing_non_abi_compatible_cpu python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_linear_freezing_non_abi_compatible_cpu ``` ### TODOs in follow-up PRs 1. We found that using `AOTI_TORCH_CHECK` will cause performance drop on several models (`DistillGPT2`, `MBartForConditionalGeneration`, `T5ForConditionalGeneration`, `T5Small`) compared with JIT Inductor which uses `TORCH_CHECK`. This may need further discussion how to address (`AOTI_TORCH_CHECK` is introduced in https://github.com/pytorch/pytorch/pull/119220). 2. Freezing in non-ABI compatible mode will work with the support in this PR. While for ABI compatible mode, we need to firstly address this issue: `AssertionError: None, i.e. optional output is not supported`. `6c4f43f826/torch/_inductor/codegen/cpp_wrapper_cpu.py (L2023-L2024)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/124350 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-05-25 07:15:36 +00:00
Yu, Guangye	e7a42702f9	generalize custom_fwd&custom_bwd to be device-agnostic (#126531 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126531 Approved by: https://github.com/jgong5, https://github.com/gujinghui, https://github.com/albanD, https://github.com/EikanWang ghstack dependencies: #126527	2024-05-25 06:48:16 +00:00
Yu, Guangye	c09205a057	Deprecate device-specific GradScaler autocast API (#126527 ) # Motivation ## for `torch.amp.GradScaler`, - `torch.cpu.amp.GradScaler(args...)` is completely equivalent to `torch. amp.GradScaler("cpu", args...)`. - `torch.cuda.amp.GradScaler(args...)` is completely equivalent to `torch.amp.GradScaler("cuda", args...)`. So, we intend to depreate them and strongly recommend developer to use `torch.amp.GradScaler`. ## for `custom_fwd` and `custom_bwd`, this is a good solution to make the custom function run with or without effect even in an autocast-enabled region and can be shared by other backends, like CPU and XPU. So we generalize it to be device-agnostic and put them int `torch/amp/autocast_mode.py` and re-expose to `torch.amp.custom_fwd` and `torch.amp.custom_bwd`. Meanwhile, we deprecate `torch.cuda.amp.custom_fwd` and `torch.cuda.amp.custom_bwd`. # Additional Context Add UT to cover the deprecated warning. No need for more UTs to cover the functionality of `torch.amp.custom_f/bwd`, the existing UTs that previously covered the functionality of `torch.cuda.amp.custom_f/bwd` can cover them. To facilitate the review, we separate these code changes to two PRs. The first PR cover `torch.amp.GradScaler`. The follow-up covers `custom_fwd` and `custom_bwd`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126527 Approved by: https://github.com/jgong5, https://github.com/gujinghui, https://github.com/janeyx99, https://github.com/EikanWang	2024-05-25 06:41:34 +00:00
Catherine Lee	ef86a27dba	Mark test_set_per_process_memory_fraction serial (#127087 ) Occasionally OOMs Also should probably give the entire GPU for this anyways Pull Request resolved: https://github.com/pytorch/pytorch/pull/127087 Approved by: https://github.com/huydhn	2024-05-25 06:26:47 +00:00
dshi7	0f67d38f0f	add TORCHDYNAMO_CAPTURE_DYNAMIC_OUTPUT_SHAPE_OPS (#127017 ) tlparse prints failure description like this > dynamic shape operator: aten._unique2.default; to enable, set torch._dynamo.config.capture_dynamic_output_shape_ops = True adding os env var to set it easier for testing Pull Request resolved: https://github.com/pytorch/pytorch/pull/127017 Approved by: https://github.com/jackiexu1992	2024-05-25 05:42:41 +00:00
chilli	84e59f052d	Made some minor improvements to flexattention perf + added more autotune configs (#126811 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126811 Approved by: https://github.com/drisspg, https://github.com/yanboliang, https://github.com/Neilblaze	2024-05-25 05:03:31 +00:00
cyy	9f11fc666a	[1/N] Fix clang-tidy warnings in aten/src/ATen/detail/ (#127057 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/127057 Approved by: https://github.com/Skylion007	2024-05-25 04:55:52 +00:00
Shunting Zhang	bd24991f46	reset dynamo cache before each test (#126586 ) In https://github.com/pytorch/pytorch/issues/125967, we found test results depend on test order. The root cause is due to earlier tests populate dynamo cache and affect the later tests. This PR clear dynamo cache before each unit test so we get more deterministic result for unit test Pull Request resolved: https://github.com/pytorch/pytorch/pull/126586 Approved by: https://github.com/jansel	2024-05-25 04:48:09 +00:00
Ke Wen	8bd26ecf0b	[pipelining] test composability with DDP and FSDP (#127066 ) Added to `multigpu` test config, which is run periodically. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127066 Approved by: https://github.com/H-Huang, https://github.com/wconstab ghstack dependencies: #127136, #126931	2024-05-25 04:30:40 +00:00
Ke Wen	c1d2564acf	[pipelining] Add grad test for interleaved schedules (#126931 ) Added `test_grad_with_manual_interleaved`: - Model: `MultiMLP` - Tested schedules: Interleaved1F1B, LoopedBFS - Two stages per rank ``` Rank 0 stages: [0, 2] Rank 1 stages: [1, 3] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126931 Approved by: https://github.com/wconstab ghstack dependencies: #127136	2024-05-25 04:13:28 +00:00
Ke Wen	eaace67444	[pipelining] do not check inputs for non-0 stages (#127136 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127136 Approved by: https://github.com/wconstab	2024-05-25 04:13:28 +00:00
James Wu	cc9a3412d4	Implement a post_compile step for aot_dispatch_autograd (#126193 ) This PR moves the post compile portion of aot_dispatch_autograd into runtime_wrappers.py. Completing this allows us to run the post compile section on its own when warm starting. I considered leaving this thing in jit_compile_runtime_wrappers, but we're gonna run into circular dependency issues later if we don't move it over Pull Request resolved: https://github.com/pytorch/pytorch/pull/126193 Approved by: https://github.com/bdhirsh ghstack dependencies: #126907	2024-05-25 03:24:20 +00:00
Oguz Ulgen	52bcf120e5	Make inductor config hashing more portable (#127022 ) Summary: masnesral and I noticed that config contains non portable artifacts. Lets fix that. Test Plan: adhoc testing Differential Revision: D57748025 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127022 Approved by: https://github.com/masnesral	2024-05-25 03:01:33 +00:00
Jane Xu	665637714f	Remove SparseAdam weird allowance of raw Tensor input (#127081 ) This continues the full deprecation after https://github.com/pytorch/pytorch/pull/114425. It's been 6 months! And I'm fairly certain no one is going to yell at me as this patch is not really used. ------ # BC Breaking note As of this PR, SparseAdam will become consistent with the rest of our optimizers in that it will only accept containers of Tensors/Parameters/param groups and fully complete deprecation of this path. Hitherto, the SparseAdam constructor had allowed raw tensors as the params argument to the constructor. Now, if you write the following code, there will be an error similar to every other optim: "params argument given to the optimizer should be an iterable of Tensors or dicts" ``` import torch param = torch.rand(16, 32) optimizer = torch.optim.SparseAdam(param) ``` Instead you should replace the last line with ``` optimizer = torch.optim.SparseAdam([param]) ``` to no longer error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127081 Approved by: https://github.com/soulitzer	2024-05-25 02:58:24 +00:00
cyy	29a1f62f23	Replace c10::invoke_result with std::invoke_result (#124169 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124169 Approved by: https://github.com/swolchok	2024-05-25 02:42:13 +00:00
Huy Do	9ef6f8dfc1	Fix typo in inductor workflow for CUDA 12.4 jobs (#127121 ) Discovered by @clee2000. The change was introduced in https://github.com/pytorch/pytorch/pull/121956 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127121 Approved by: https://github.com/clee2000, https://github.com/Skylion007	2024-05-25 02:36:39 +00:00
Ke Wen	ed838793df	[pipelining] Remove qualname mapping (#127018 ) `QualnameMapMixin` was intended to provide a mapping from new FQN of the piped model to the FQN of the original model. It was there because previous tracers and flattening during tracing would modify the FQNs. Now that we use unflattener, the FQN of the stage modules are the same as the original FQNs. We don't need `QualnameMapMixin` any more. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127018 Approved by: https://github.com/H-Huang	2024-05-25 02:32:40 +00:00
drisspg	5f15110499	Update dispatch stub to make SDPA routing cleaner (#126832 ) # Summary Adds a public method to dispatchstub to check if a fn has been registered for a device. We use this new function to clean up the dispatching logic for SDPA, as well as make the private use dispatching simpler: #126392 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126832 Approved by: https://github.com/ezyang, https://github.com/albanD	2024-05-25 01:40:53 +00:00
Shunting Zhang	db9c6aeec6	Revert "Skip test_memory_format_nn_BatchNorm2d in inductor (#125970 )" (#126594 ) This reverts commit 0a9c6e92f8d1a35f33042c8dab39f23b7f39d6e7. enable the test since it's fixed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126594 Approved by: https://github.com/huydhn ghstack dependencies: #126593	2024-05-25 01:27:02 +00:00
Shunting Zhang	b03dc3d167	don't check memory format for empty tensors (#126593 ) Fix https://github.com/pytorch/pytorch/issues/125967 . The test actually fail for empty 4D or 5D tensors when checking for memory format. I'm not exactly sure what recent inductor change cause the failure, but it may be not that important to maintain strides for an empty tensor. (?) I just skip the check for empty tensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126593 Approved by: https://github.com/ezyang	2024-05-25 01:19:45 +00:00
Animesh Jain	84f8cd22ac	[dynamo][TensorVariable] Support "if param.grad_fn" usecase (#126960 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126960 Approved by: https://github.com/jansel ghstack dependencies: #126922	2024-05-25 01:09:26 +00:00
Sheng Fu	bbeb0906c4	Register creak_node_hook (#126671 ) Differential Revision: D57469157 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126671 Approved by: https://github.com/angelayi	2024-05-24 23:32:15 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	72f0bdcc22	Remove torch._constrain_as_value (#127103 ) Summary: This API doesn't do anything useful and should be subsumed by torch._check. Test Plan: CI Differential Revision: D57786740 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127103 Approved by: https://github.com/angelayi	2024-05-24 22:49:46 +00:00
Jason Ansel	d5bf3a98db	[inductor] Refactor indexing() into triton.py (#127047 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127047 Approved by: https://github.com/shunting314 ghstack dependencies: #126944, #126945	2024-05-24 22:46:20 +00:00
Jason Ansel	92433217cb	[inductor] Misc refactors (#126945 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126945 Approved by: https://github.com/shunting314 ghstack dependencies: #126944	2024-05-24 22:46:20 +00:00
Jason Ansel	1b6e3e3bcb	[inductor] Refactor part of IterationRangesEntry into triton.py (#126944 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126944 Approved by: https://github.com/shunting314	2024-05-24 22:46:20 +00:00
Anshul Sinha	83617017e0	[dtensor][debug] add c10d allreduce_coalesced_ tracing to CommDebugMode (#127040 ) Summary Added c10d all_reduce_coalesced tracing to CommDebugMode and added test case to test_comm_mode.py. Test Plan pytest test/distributed/_tensor/debug/test_comm_mode.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/127040 Approved by: https://github.com/XilunWu ghstack dependencies: #127025, #127029	2024-05-24 22:25:44 +00:00
Michael Lazos	59052071b7	Disallow fusions of foreach and reductions (#127048 ) Fixes https://github.com/pytorch/pytorch/issues/120857 This currently isn't supported until we enable foreach reduction kernels. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127048 Approved by: https://github.com/weifengpy	2024-05-24 21:35:06 +00:00
James Wu	023c1baf82	Add global configurations to cache key (#126907 ) This adds a bunch of global configurations to the cache key. There's definitely more I haven't added, but this is just an audit of all of the `torch.*` globals that are used in jit_compile_runtime_wrappers, runtime_wrappers, etc. It also makes the hash details object subclass FXGraphHashDetails, which implements other hashed data like configs inductor depends on. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126907 Approved by: https://github.com/aorenste	2024-05-24 21:26:46 +00:00
dan_the_3rd	c133665d4a	[CUDA] Parallelize upsampling OPS across the batch/channel dimension. (#127082 ) This can make this operation 200x+ faster on modern GPUs for small grid sizes, as otherwise this kernel is scheduled with a single block (!) Tested on A100 with: ``` python test/test_nn.py TestNNDeviceTypeCUDA ``` Benchmarks FW Ran on A100 / bf16 ## Forward pass benchmarks \| batch size \| input size \| output size \| before runtime (mem bandwidth) \| after runtime (mem bandwidth) \| speedup \| \|------------\|------------\|-------------\|------------------\|-----------------\|---------\| \| 768 \| 16x16 \| 6x6 \| 5855us (0.07 GB/s) \| 38us (10 GB/s) \| 154x \| \| 768 \| 16x16 \| 7x7 \| 5214us (0.08 GB/s) \| 37us (11 GB/s) \| 138x \| \| 768 \| 16x16 \| 14x14 \| 2314us (0.27 GB/s) \| 36us (17 GB/s) \| 63x \| \| 768 \| 16x16 \| 16x16 \| 1232us (0.59 GB/s) \| 33us (21 GB/s) \| 36x \| \| 768 \| 32x32 \| 6x6 \| 19442us (0.07 GB/s) \| 98us (15 GB/s) \| 197x \| \| 768 \| 32x32 \| 7x7 \| 16918us (0.09 GB/s) \| 89us (17 GB/s) \| 188x \| \| 768 \| 32x32 \| 14x14 \| 6023us (0.28 GB/s) \| 69us (25 GB/s) \| 86x \| \| 768 \| 32x32 \| 16x16 \| 3455us (0.52 GB/s) \| 55us (32 GB/s) \| 62x \| \| 768 \| 48x48 \| 6x6 \| 38597us (0.08 GB/s) \| 179us (18 GB/s) \| 214x \| \| 768 \| 48x48 \| 7x7 \| 34700us (0.09 GB/s) \| 163us (20 GB/s) \| 211x \| \| 768 \| 48x48 \| 14x14 \| 10647us (0.33 GB/s) \| 112us (31 GB/s) \| 94x \| \| 768 \| 48x48 \| 16x16 \| 7388us (0.49 GB/s) \| 100us (36 GB/s) \| 73x \| \| 768 \| 64x64 \| 6x6 \| 76288us (0.07 GB/s) \| 310us (19 GB/s) \| 246x \| \| 768 \| 64x64 \| 7x7 \| 54981us (0.1 GB/s) \| 257us (23 GB/s) \| 213x \| \| 768 \| 64x64 \| 14x14 \| 16565us (0.37 GB/s) \| 169us (36 GB/s) \| 97x \| \| 768 \| 64x64 \| 16x16 \| 12037us (0.51 GB/s) \| 141us (43 GB/s) \| 84x \| \| 1024 \| 16x16 \| 6x6 \| 8123us (0.06 GB/s) \| 44us (12 GB/s) \| 183x \| \| 1024 \| 16x16 \| 7x7 \| 7017us (0.08 GB/s) \| 45us (12 GB/s) \| 155x \| \| 1024 \| 16x16 \| 14x14 \| 3150us (0.27 GB/s) \| 45us (18 GB/s) \| 69x \| \| 1024 \| 16x16 \| 16x16 \| 1695us (0.57 GB/s) \| 41us (23 GB/s) \| 40x \| \| 1024 \| 32x32 \| 6x6 \| 25918us (0.07 GB/s) \| 120us (16 GB/s) \| 214x \| \| 1024 \| 32x32 \| 7x7 \| 22622us (0.09 GB/s) \| 108us (18 GB/s) \| 208x \| \| 1024 \| 32x32 \| 14x14 \| 8245us (0.28 GB/s) \| 87us (26 GB/s) \| 94x \| \| 1024 \| 32x32 \| 16x16 \| 4599us (0.53 GB/s) \| 68us (35 GB/s) \| 67x \| \| 1024 \| 48x48 \| 6x6 \| 51486us (0.08 GB/s) \| 219us (20 GB/s) \| 234x \| \| 1024 \| 48x48 \| 7x7 \| 46501us (0.09 GB/s) \| 202us (22 GB/s) \| 229x \| \| 1024 \| 48x48 \| 14x14 \| 14280us (0.33 GB/s) \| 145us (32 GB/s) \| 98x \| \| 1024 \| 48x48 \| 16x16 \| 9877us (0.49 GB/s) \| 125us (39 GB/s) \| 79x \| \| 1024 \| 64x64 \| 6x6 \| 101731us (0.07 GB/s) \| 378us (20 GB/s) \| 268x \| \| 1024 \| 64x64 \| 7x7 \| 73465us (0.1 GB/s) \| 320us (24 GB/s) \| 229x \| \| 1024 \| 64x64 \| 14x14 \| 22109us (0.37 GB/s) \| 218us (37 GB/s) \| 101x \| \| 1024 \| 64x64 \| 16x16 \| 16081us (0.51 GB/s) \| 178us (46 GB/s) \| 90x \| \| 1536 \| 16x16 \| 6x6 \| 12546us (0.06 GB/s) \| 61us (13 GB/s) \| 205x \| \| 1536 \| 16x16 \| 7x7 \| 11064us (0.07 GB/s) \| 63us (13 GB/s) \| 175x \| \| 1536 \| 16x16 \| 14x14 \| 4839us (0.26 GB/s) \| 62us (20 GB/s) \| 77x \| \| 1536 \| 16x16 \| 16x16 \| 2630us (0.55 GB/s) \| 59us (24 GB/s) \| 44x \| \| 1536 \| 32x32 \| 6x6 \| 38898us (0.07 GB/s) \| 170us (17 GB/s) \| 227x \| \| 1536 \| 32x32 \| 7x7 \| 34079us (0.09 GB/s) \| 155us (19 GB/s) \| 219x \| \| 1536 \| 32x32 \| 14x14 \| 12632us (0.27 GB/s) \| 124us (28 GB/s) \| 101x \| \| 1536 \| 32x32 \| 16x16 \| 6900us (0.53 GB/s) \| 98us (37 GB/s) \| 70x \| \| 1536 \| 48x48 \| 6x6 \| 77272us (0.08 GB/s) \| 316us (21 GB/s) \| 243x \| \| 1536 \| 48x48 \| 7x7 \| 70153us (0.09 GB/s) \| 291us (23 GB/s) \| 240x \| \| 1536 \| 48x48 \| 14x14 \| 21500us (0.33 GB/s) \| 208us (34 GB/s) \| 103x \| \| 1536 \| 48x48 \| 16x16 \| 14851us (0.49 GB/s) \| 181us (40 GB/s) \| 81x \| \| 1536 \| 64x64 \| 6x6 \| 152669us (0.07 GB/s) \| 548us (21 GB/s) \| 278x \| \| 1536 \| 64x64 \| 7x7 \| 110348us (0.1 GB/s) \| 466us (25 GB/s) \| 236x \| \| 1536 \| 64x64 \| 14x14 \| 33350us (0.36 GB/s) \| 316us (38 GB/s) \| 105x \| \| 1536 \| 64x64 \| 16x16 \| 24173us (0.51 GB/s) \| 263us (47 GB/s) \| 91x \| \| 4096 \| 16x16 \| 6x6 \| 34638us (0.06 GB/s) \| 138us (16 GB/s) \| 249x \| \| 4096 \| 16x16 \| 7x7 \| 31590us (0.07 GB/s) \| 144us (16 GB/s) \| 218x \| \| 4096 \| 16x16 \| 14x14 \| 13203us (0.26 GB/s) \| 149us (23 GB/s) \| 88x \| \| 4096 \| 16x16 \| 16x16 \| 7328us (0.53 GB/s) \| 143us (27 GB/s) \| 51x \| \| 4096 \| 32x32 \| 6x6 \| 103802us (0.07 GB/s) \| 405us (19 GB/s) \| 256x \| \| 4096 \| 32x32 \| 7x7 \| 91354us (0.08 GB/s) \| 372us (22 GB/s) \| 245x \| \| 4096 \| 32x32 \| 14x14 \| 34501us (0.26 GB/s) \| 312us (29 GB/s) \| 110x \| \| 4096 \| 32x32 \| 16x16 \| 18465us (0.52 GB/s) \| 247us (39 GB/s) \| 74x \| ## Backward pass benchmarks \| batch size \| input size \| output size \| before runtime (mem bandwidth) \| after runtime (mem bandwidth) \| speedup \| \|------------\|------------\|-------------\|------------------\|-----------------\|---------\| \| 768 \| 16x16 \| 6x6 \| 78656us (0.0 GB/s) \| 323us (1 GB/s) \| 243x \| \| 768 \| 16x16 \| 7x7 \| 67167us (0.0 GB/s) \| 292us (1 GB/s) \| 230x \| \| 768 \| 16x16 \| 14x14 \| 27478us (0.02 GB/s) \| 229us (2 GB/s) \| 119x \| \| 768 \| 16x16 \| 16x16 \| 131us (5.59 GB/s) \| 56us (13 GB/s) \| 2x \| \| 768 \| 32x32 \| 6x6 \| 271752us (0.0 GB/s) \| 888us (1 GB/s) \| 305x \| \| 768 \| 32x32 \| 7x7 \| 224110us (0.0 GB/s) \| 813us (1 GB/s) \| 275x \| \| 768 \| 32x32 \| 14x14 \| 85365us (0.02 GB/s) \| 450us (3 GB/s) \| 189x \| \| 768 \| 32x32 \| 16x16 \| 67700us (0.02 GB/s) \| 360us (5 GB/s) \| 187x \| \| 768 \| 48x48 \| 6x6 \| 593709us (0.0 GB/s) \| 1988us (1 GB/s) \| 298x \| \| 768 \| 48x48 \| 7x7 \| 485566us (0.0 GB/s) \| 1694us (1 GB/s) \| 286x \| \| 768 \| 48x48 \| 14x14 \| 164059us (0.02 GB/s) \| 897us (3 GB/s) \| 182x \| \| 768 \| 48x48 \| 16x16 \| 134317us (0.02 GB/s) \| 674us (5 GB/s) \| 199x \| \| 768 \| 64x64 \| 6x6 \| 1026651us (0.0 GB/s) \| 3360us (1 GB/s) \| 305x \| \| 768 \| 64x64 \| 7x7 \| 770901us (0.0 GB/s) \| 2584us (2 GB/s) \| 298x \| \| 768 \| 64x64 \| 14x14 \| 277850us (0.02 GB/s) \| 1556us (3 GB/s) \| 178x \| \| 768 \| 64x64 \| 16x16 \| 236245us (0.02 GB/s) \| 1144us (5 GB/s) \| 206x \| \| 1024 \| 16x16 \| 6x6 \| 106638us (0.0 GB/s) \| 341us (1 GB/s) \| 312x \| \| 1024 \| 16x16 \| 7x7 \| 90886us (0.0 GB/s) \| 314us (1 GB/s) \| 288x \| \| 1024 \| 16x16 \| 14x14 \| 36572us (0.02 GB/s) \| 292us (2 GB/s) \| 124x \| \| 1024 \| 16x16 \| 16x16 \| 171us (5.69 GB/s) \| 56us (17 GB/s) \| 3x \| \| 1024 \| 32x32 \| 6x6 \| 356900us (0.0 GB/s) \| 936us (2 GB/s) \| 380x \| \| 1024 \| 32x32 \| 7x7 \| 299139us (0.0 GB/s) \| 870us (2 GB/s) \| 343x \| \| 1024 \| 32x32 \| 14x14 \| 113205us (0.02 GB/s) \| 576us (4 GB/s) \| 196x \| \| 1024 \| 32x32 \| 16x16 \| 90886us (0.02 GB/s) \| 458us (5 GB/s) \| 198x \| \| 1024 \| 48x48 \| 6x6 \| 786896us (0.0 GB/s) \| 2127us (2 GB/s) \| 369x \| \| 1024 \| 48x48 \| 7x7 \| 640515us (0.0 GB/s) \| 1837us (2 GB/s) \| 348x \| \| 1024 \| 48x48 \| 14x14 \| 218720us (0.02 GB/s) \| 1152us (4 GB/s) \| 189x \| \| 1024 \| 48x48 \| 16x16 \| 178827us (0.02 GB/s) \| 863us (5 GB/s) \| 207x \| \| 1024 \| 64x64 \| 6x6 \| 1379991us (0.0 GB/s) \| 3589us (2 GB/s) \| 384x \| \| 1024 \| 64x64 \| 7x7 \| 1047466us (0.0 GB/s) \| 2774us (2 GB/s) \| 377x \| \| 1024 \| 64x64 \| 14x14 \| 370139us (0.02 GB/s) \| 1999us (4 GB/s) \| 185x \| \| 1024 \| 64x64 \| 16x16 \| 316501us (0.02 GB/s) \| 1470us (5 GB/s) \| 215x \| \| 1536 \| 16x16 \| 6x6 \| 159057us (0.0 GB/s) \| 477us (1 GB/s) \| 332x \| \| 1536 \| 16x16 \| 7x7 \| 135578us (0.0 GB/s) \| 441us (1 GB/s) \| 306x \| \| 1536 \| 16x16 \| 14x14 \| 53002us (0.02 GB/s) \| 400us (3 GB/s) \| 132x \| \| 1536 \| 16x16 \| 16x16 \| 252us (5.79 GB/s) \| 55us (26 GB/s) \| 4x \| \| 1536 \| 32x32 \| 6x6 \| 545653us (0.0 GB/s) \| 1323us (2 GB/s) \| 412x \| \| 1536 \| 32x32 \| 7x7 \| 447491us (0.0 GB/s) \| 1248us (2 GB/s) \| 358x \| \| 1536 \| 32x32 \| 14x14 \| 173491us (0.02 GB/s) \| 787us (4 GB/s) \| 220x \| \| 1536 \| 32x32 \| 16x16 \| 136395us (0.02 GB/s) \| 633us (5 GB/s) \| 215x \| \| 1536 \| 48x48 \| 6x6 \| 1198639us (0.0 GB/s) \| 3057us (2 GB/s) \| 392x \| \| 1536 \| 48x48 \| 7x7 \| 985549us (0.0 GB/s) \| 2645us (2 GB/s) \| 372x \| \| 1536 \| 48x48 \| 14x14 \| 331419us (0.02 GB/s) \| 1581us (4 GB/s) \| 209x \| \| 1536 \| 48x48 \| 16x16 \| 270972us (0.02 GB/s) \| 1186us (6 GB/s) \| 228x \| \| 1536 \| 64x64 \| 6x6 \| 2094282us (0.0 GB/s) \| 5214us (2 GB/s) \| 401x \| \| 1536 \| 64x64 \| 7x7 \| 1593449us (0.0 GB/s) \| 4086us (2 GB/s) \| 389x \| \| 1536 \| 64x64 \| 14x14 \| 559244us (0.02 GB/s) \| 2828us (4 GB/s) \| 197x \| \| 1536 \| 64x64 \| 16x16 \| 469471us (0.02 GB/s) \| 2057us (6 GB/s) \| 228x \| \| 4096 \| 16x16 \| 6x6 \| 430494us (0.0 GB/s) \| 1008us (2 GB/s) \| 427x \| \| 4096 \| 16x16 \| 7x7 \| 360346us (0.0 GB/s) \| 1015us (2 GB/s) \| 354x \| \| 4096 \| 16x16 \| 14x14 \| 142868us (0.02 GB/s) \| 988us (3 GB/s) \| 144x \| \| 4096 \| 16x16 \| 16x16 \| 658us (5.93 GB/s) \| 56us (69 GB/s) \| 11x \| \| 4096 \| 32x32 \| 6x6 \| 1425928us (0.0 GB/s) \| 2796us (2 GB/s) \| 509x \| \| 4096 \| 32x32 \| 7x7 \| 1188862us (0.0 GB/s) \| 2906us (2 GB/s) \| 409x \| \| 4096 \| 32x32 \| 14x14 \| 464286us (0.02 GB/s) \| 1965us (4 GB/s) \| 236x \| \| 4096 \| 32x32 \| 16x16 \| 363903us (0.02 GB/s) \| 1588us (6 GB/s) \| 229x \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/127082 Approved by: https://github.com/fmassa	2024-05-24 21:17:12 +00:00
Chien-Chin Huang	b0871f9b33	[DSD] Add a test to verify FSDP lazy initialization case (#127069 ) Summary: Distributed state_dict should not error out because the `model.state_dict()` will trigger FSDP to initialize. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127069 Approved by: https://github.com/wz337	2024-05-24 21:09:11 +00:00
Bin Bao	7394ec7123	[AOTI][refactor] Update DTYPE_TO_CPP mapping (#126915 ) Summary: Use more consistent cpp int types in DTYPE_TO_CPP Pull Request resolved: https://github.com/pytorch/pytorch/pull/126915 Approved by: https://github.com/chenyang78	2024-05-24 21:03:12 +00:00
Sijia Chen	800f461b2a	[User-Written Triton] Handle the `scf.for` and `scf.while` case (#127065 ) Summary: This is the official fix of the issue, reported in https://fb.workplace.com/groups/1075192433118967/permalink/1427865377851669/ The root-cause is the MLIR mutation analyze doesn't find the mutated tensors, which made AOT autograd think there is no users of the Triton kernel and then removed it 😔 --- Triton IR: P1369315213 Wrong Analyze Graph: P1364305956 Right Analyze Graph: P1369324977 Test Plan: buck2 run mode/opt scripts/liptds/domain_kernels:triton_dcpp_flash unit tests Differential Revision: D57606053 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127065 Approved by: https://github.com/oulgen, https://github.com/chenyang78	2024-05-24 21:01:13 +00:00
Shuo Ding	dce29a8a87	Replaced same with assertEqual in two files (#126994 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126994 Approved by: https://github.com/masnesral	2024-05-24 20:50:36 +00:00
PyTorch MergeBot	c34f8c7f91	Revert "Lift jagged -> padded dense forward / backward kernels from fbgemm_gpu (#125946 )" This reverts commit 5e69e11d098a2cfccc8a59377c431e9c71cab9a8. Reverted https://github.com/pytorch/pytorch/pull/125946 on behalf of https://github.com/clee2000 due to sorry the Dr CI fix hasn't been merged yet and its still failing `5e69e11d09` https://github.com/pytorch/pytorch/actions/runs/9228887299/job/25393895252 ([comment](https://github.com/pytorch/pytorch/pull/125946#issuecomment-2130305958))	2024-05-24 20:26:07 +00:00
Scott Wolchok	fdda9a22c3	Performance parity for 32-bit-precision in FP16 ARM matrix-vector kernel using FMLAL instruction (#127033 ) Summary: I discovered this instruction by checking all the intrinsics on https://arm-software.github.io/acle/neon_intrinsics/advsimd.html . Test Plan: Existing test coverage benchmarked custom sizes with https://github.com/malfet/llm_experiments benchmarks/benchmark/torch_mm.py: ``` m=1024, n=1024, k=1 ==================== trans_b torch.float16 43.93 usec Using FP16 accumulation trans_b torch.float16 43.76 usec m=4100, n=4100, k=1 ==================== trans_b torch.float16 719.35 usec Using FP16 accumulation trans_b torch.float16 719.33 usec m=4104, n=4104, k=1 ==================== trans_b torch.float16 727.79 usec Using FP16 accumulation trans_b torch.float16 702.72 usec m=16384, n=16384, k=1 ==================== trans_b torch.float16 18465.11 usec Using FP16 accumulation trans_b torch.float16 11435.28 usec ``` also checked the default sizes. Relevant output before: ``` mv_nt torch.float16 13.05 usec trans_b torch.float16 13.69 usec Using FP16 accumulation mv_nt torch.float16 8.65 usec trans_b torch.float16 9.24 usec ``` after: ``` mv_nt torch.float16 8.66 usec trans_b torch.float16 8.85 usec Using FP16 accumulation mv_nt torch.float16 8.52 usec trans_b torch.float16 8.60 usec ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127033 Approved by: https://github.com/malfet, https://github.com/Skylion007 ghstack dependencies: #126745, #126746, #126793, #126794, #126877, #127016	2024-05-24 19:47:50 +00:00
Scott Wolchok	1d3aa08327	Cleanup: use c10::ForceUnroll and constexpr variables in ARM FP16 matrix-vector fast path (#127016 ) Summary: Just straightforward code cleanup in this path. Test Plan: Existing CI, double-checked benchmark_torch_mm didn't regress as per previous diffs in stack. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127016 Approved by: https://github.com/peterbell10 ghstack dependencies: #126745, #126746, #126793, #126794, #126877	2024-05-24 19:47:50 +00:00
cyy	67d52d7fcb	[caffe2] Remove import_legacy.cpp (#126149 ) I think they are for Caffe2 and should be deleted. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126149 Approved by: https://github.com/r-barnes	2024-05-24 19:47:32 +00:00
Joel Schlosser	5e69e11d09	Lift jagged -> padded dense forward / backward kernels from fbgemm_gpu (#125946 ) PyTorch can't depend on `fbgemm_gpu` as a dependency because `fbgemm_gpu` already has a dependency on PyTorch. So this PR copy / pastes kernels from `fbgemm_gpu`: * `dense_to_jagged_forward()` as CUDA registration for new ATen op `_padded_dense_to_jagged_forward()` * `jagged_to_padded_dense_forward()` as CUDA registration for new ATen op `_jagged_to_padded_dense_forward()` CPU impls for these new ATen ops will be added in a follow-up PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125946 Approved by: https://github.com/davidberard98	2024-05-24 19:16:29 +00:00
Bin Bao	9d4731f952	[AOTI] Disable stack allocation for OSS (#125732 ) Summary: Stack allocation is for certain small CPU models, but its coverage still needs improvement, so default to OFF for OSS. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125732 Approved by: https://github.com/chenyang78 ghstack dependencies: #126720, #126801	2024-05-24 19:10:33 +00:00
Bin Bao	72d30aa026	[AOTI] Fix an int array codegen issue (#126801 ) Summary: fixes https://github.com/pytorch/pytorch/issues/126779. When an int array contains symbol expression, we can't declare it with constexpr. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126801 Approved by: https://github.com/chenyang78 ghstack dependencies: #126720	2024-05-24 19:10:33 +00:00
Bin Bao	71f1aebe1f	[AOTI] Add more fallback ops (#126720 ) Summary: These ops are either in either unit tests or TorchBench. Fixes https://github.com/pytorch/pytorch/issues/122050 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126720 Approved by: https://github.com/chenyang78	2024-05-24 19:10:33 +00:00
Svetlana Karslioglu	f508cd6e00	Update assigntome job (#127027 ) Updating for the new docathon Pull Request resolved: https://github.com/pytorch/pytorch/pull/127027 Approved by: https://github.com/kit1980	2024-05-24 19:04:51 +00:00
Aaron Gokaslan	3cb16ebf08	[BE]: Update ruff to 0.4.5 (#126979 ) Update ruff to 0.4.5 and addresses some false negatives that have been found in the newer version. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126979 Approved by: https://github.com/ezyang	2024-05-24 18:38:35 +00:00
Yifu Wang	4a09117d16	Introduce ProcessGroupCudaP2P (#122163 ) ## Context This stack prototypes automatic micro-pipelining of `all-gather -> matmul` and `matmul -> reduce-scatter` via Inductor. The idea originates from the paper [Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models](https://dl.acm.org/doi/pdf/10.1145/3567955.3567959). The implementation and some key optimizations are heavily influenced by @lw's implementation in xformers. The stack contains several components: - `ProcessGroupCudaP2P` - a thin wrapper around `ProcessGroupNCCL`. It in addition maintains a P2P workspace that enables SM-free, one-sided P2P communication which is needed for optimal micro-pipelining. - `fused_all_gather_matmul` and `fused_matmul_reduce_scatter` dispatcher ops. - Post-grad fx pass that detects `all-gather -> matmul` and `matmul -> reduce-scatter` and replaces them with the fused dispatcher ops. To enable the prototype feature: - Set the distributed backend to `cuda_p2p`. - Set `torch._inductor.config._micro_pipeline_tp` to `True`. NOTE: the prototype sets nothing in stone w.r.t to each component's design. The purpose is to have a performant baseline with reasonable design on which each component can be further improved. ## Benchmark Setup: - 8 x H100 (500W) + 3rd gen NVSwitch. - Llama3 8B training w/ torchtitan. - 8-way TP. Reduced the number of layers from 32 to 8 for benchmarking purpose. Trace (baseline): https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_tmpjaz8zgx0 <img width="832" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/4addba77-5abc-4d2e-93ea-f68078587fe1"> Trace (w/ micro pipelining): https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_tmpn073b4wn <img width="963" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/4f44e78d-8196-43ab-a1ea-27390f07e9d2"> ## This PR `ProcessGroupCudaP2P` is a thin wrapper around `ProcessGroupNCCL`. By default, it routes all collectives to the underlying `ProcessGroupNCCL`. In addition, `ProcessGroupCudaP2P` initializes a P2P workspace that allows direct GPU memory access among the members. The workspace can be used in Python to optimize intra-node communication patterns or to create custom intra-node collectives in CUDA. `ProcessGroupCudaP2P` aims to bridge the gap where certain important patterns can be better optimized via fine-grained P2P memory access than with collectives in the latest version of NCCL. It is meant to complement NCCL rather than replacing it. Usage: ``` # Using ProcessGroupCudaP2P dist.init_process_group(backend="cuda_p2p", ...) # Using ProcessGroupCudaP2P while specifying ProcessGroupCudaP2P.Options pg_options = ProcessGroupCudaP2P.Options() dist.init_process_group(backend="cuda_p2p", pg_options=pg_options, ...) # Using ProcessGroupCudaP2P while specifying ProcessGroupNCCL.Options pg_options = ProcessGroupNCCL.Options() dist.init_process_group(backend="cuda_p2p", pg_options=pg_options, ...) # Using ProcessGroupCudaP2P while specifying both # ProcessGroupCudaP2P.Options and ProcessGroupNCCL.Options pg_options = ProcessGroupCudaP2P.Options() pg_options.nccl_options = ProcessGroupNCCL.Options() dist.init_process_group(backend="cuda_p2p", pg_options=pg_options, ...) # Down-casting the backend to access p2p buffers for cuda_p2p specific # optimizations if is_cuda_p2p_group(group): backend = get_cuda_p2p_backend(group) if required_p2p_buffer_size > backend.get_buffer_size(): # fallback p2p_buffer = backend.get_p2p_buffer(...) else: # fallback ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122163 Approved by: https://github.com/wanchaol	2024-05-24 18:33:18 +00:00
Yidi Wu	01f04230cf	[cond] support torch built in function as subgraph (#126909 ) Fixes https://github.com/pytorch/pytorch/issues/126818. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126909 Approved by: https://github.com/zou3519 ghstack dependencies: #127026	2024-05-24 18:31:43 +00:00
Yidi Wu	2d6d2dbc0b	[dynamo] make callable(nn_module) return True (#127026 ) Before the pr, we have a graph break for `callable(nn_module)`: ```python class M(nn.Module): def forward(self, x): return x.sin() def f(m): return callable(m) res = torch.compile(f, fullgraph=True)(M()) ``` ``` Traceback (most recent call last): File "/data/users/yidi/pytorch/t.py", line 17, in <module> out = torch.compile(f, backend="eager", fullgraph=True)(M()) File "/data/users/yidi/pytorch/torch/_dynamo/eval_frame.py", line 414, in _fn return fn(args, kwargs) File "/data/users/yidi/pytorch/torch/_dynamo/convert_frame.py", line 1077, in catch_errors return callback(frame, cache_entry, hooks, frame_state, skip=1) File "/data/users/yidi/pytorch/torch/_dynamo/convert_frame.py", line 456, in _convert_frame_assert return _compile( File "/data/users/yidi/pytorch/torch/_utils_internal.py", line 74, in wrapper_function return function(args, *kwargs) File "/home/yidi/.conda/envs/pytorch/lib/python3.10/contextlib.py", line 79, in inner return func(args, *kwds) File "/data/users/yidi/pytorch/torch/_dynamo/convert_frame.py", line 799, in _compile guarded_code = compile_inner(code, one_graph, hooks, transform) File "/data/users/yidi/pytorch/torch/_dynamo/utils.py", line 210, in time_wrapper r = func(args, *kwargs) File "/data/users/yidi/pytorch/torch/_dynamo/convert_frame.py", line 618, in compile_inner out_code = transform_code_object(code, transform) File "/data/users/yidi/pytorch/torch/_dynamo/bytecode_transformation.py", line 1167, in transform_code_object transformations(instructions, code_options) File "/data/users/yidi/pytorch/torch/_dynamo/convert_frame.py", line 177, in _fn return fn(args, **kwargs) File "/data/users/yidi/pytorch/torch/_dynamo/convert_frame.py", line 564, in transform tracer.run() File "/data/users/yidi/pytorch/torch/_dynamo/symbolic_convert.py", line 2244, in run super().run() File "/data/users/yidi/pytorch/torch/_dynamo/symbolic_convert.py", line 886, in run while self.step(): File "/data/users/yidi/pytorch/torch/_dynamo/symbolic_convert.py", line 801, in step self.dispatch_table[inst.opcode](self, inst) File "/data/users/yidi/pytorch/torch/_dynamo/symbolic_convert.py", line 496, in wrapper return inner_fn(self, inst) File "/data/users/yidi/pytorch/torch/_dynamo/symbolic_convert.py", line 1255, in CALL_FUNCTION self.call_function(fn, args, {}) File "/data/users/yidi/pytorch/torch/_dynamo/symbolic_convert.py", line 739, in call_function self.push(fn.call_function(self, args, kwargs)) File "/data/users/yidi/pytorch/torch/_dynamo/variables/builtin.py", line 948, in call_function return handler(tx, args, kwargs) File "/data/users/yidi/pytorch/torch/_dynamo/variables/builtin.py", line 711, in <lambda> return lambda tx, args, kwargs: obj.call_function( File "/data/users/yidi/pytorch/torch/_dynamo/variables/builtin.py", line 948, in call_function return handler(tx, args, kwargs) File "/data/users/yidi/pytorch/torch/_dynamo/variables/builtin.py", line 835, in builtin_dipatch unimplemented(error_msg) File "/data/users/yidi/pytorch/torch/_dynamo/exc.py", line 216, in unimplemented raise Unsupported(msg) torch._dynamo.exc.Unsupported: builtin: callable [<class 'torch._dynamo.variables.nn_module.NNModuleVariable'>] False ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127026 Approved by: https://github.com/jansel	2024-05-24 18:31:43 +00:00
cyy	f2c6fddbe1	Remove unnecessary const_cast and other fixes (#127054 ) Removes unnecessary const casts and copies. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127054 Approved by: https://github.com/Skylion007	2024-05-24 18:05:06 +00:00
Andrew Gu	9117779b0a	[FSDP2] Added test for N-way TP and 1-way FSDP with CPU offloading (#127024 ) This PR shows that we can use FSDP solely for CPU offloading when composing with N-way TP. Each FSDP mesh is just 1 rank. This was motivated from an ask on Slack :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127024 Approved by: https://github.com/weifengpy, https://github.com/wanchaol ghstack dependencies: #127004	2024-05-24 17:09:12 +00:00
Mikayla Gawarecki	87f79af24d	Fix map_location for wrapper subclass and device tensors that go through numpy (#126728 ) Fixes https://github.com/pytorch/pytorch/issues/124418 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126728 Approved by: https://github.com/albanD	2024-05-24 16:39:30 +00:00
Nikita Shulga	4ff9113e3d	[MPS] Add `_weight_int8pack_mm` tests (#127041 ) As well as extend the test to cover MV cases (where A matrix is 1xM) Limit int8 op testing to 32x32 matrix sizes for now Pull Request resolved: https://github.com/pytorch/pytorch/pull/127041 Approved by: https://github.com/larryliu0820, https://github.com/manuelcandales	2024-05-24 16:08:06 +00:00
Nikita Shulga	194950c0ca	Default TreadPool size to number of physical cores (#125963 ) TODO: Some benchmarks Pull Request resolved: https://github.com/pytorch/pytorch/pull/125963 Approved by: https://github.com/janeyx99, https://github.com/Skylion007, https://github.com/gajjanag, https://github.com/jgong5	2024-05-24 16:06:48 +00:00
PyTorch MergeBot	5ae9daa4a2	Revert "[AOTI] support freezing for MKLDNN (#124350 )" This reverts commit 654afb6f3ae3ddbd926a753f9af95a6f6e22131c. Reverted https://github.com/pytorch/pytorch/pull/124350 on behalf of https://github.com/clee2000 due to Seems to have broken inductor/test_aot_inductor.py::AOTInductorTestNonABICompatibleCpu::test_freezing_non_abi_compatible_cpu `654afb6f3a` https://github.com/pytorch/pytorch/actions/runs/9224838183/job/25382780192 ([comment](https://github.com/pytorch/pytorch/pull/124350#issuecomment-2129889809))	2024-05-24 16:03:07 +00:00
Eli Simhayev	2ac739cc80	[DOCS] Fixed KLDiv example (#126857 ) Small import fix to make the example run Pull Request resolved: https://github.com/pytorch/pytorch/pull/126857 Approved by: https://github.com/albanD	2024-05-24 15:39:50 +00:00
Shunting Zhang	4105f91cfc	[inductor] fix an assertion for node debug str (#127021 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127021 Approved by: https://github.com/aorenste	2024-05-24 13:37:05 +00:00
Wu, Chunyuan	654afb6f3a	[AOTI] support freezing for MKLDNN (#124350 ) ## Description Fixes https://github.com/pytorch/pytorch/issues/114450. This PR builds upon the work from @imzhuhl done in https://github.com/pytorch/pytorch/pull/114451. This PR requires https://github.com/pytorch/pytorch/pull/122472 to land firstly. We leverage the serialization and deserialization API from oneDNN v3.4.1 to save the opaque MKLDNN tensor during the compilation and restore the opaque tensor when loading the compiled .so. ideep version is updated so that we won't break any pipeline even if third_party/ideep is not updated at the same time. ### Test plan: ```sh python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_freezing_non_abi_compatible_cpu python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_conv_freezing_non_abi_compatible_cpu python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_deconv_freezing_non_abi_compatible_cpu python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_linear_freezing_non_abi_compatible_cpu ``` ### TODOs in follow-up PRs 1. We found that using `AOTI_TORCH_CHECK` will cause performance drop on several models (`DistillGPT2`, `MBartForConditionalGeneration`, `T5ForConditionalGeneration`, `T5Small`) compared with JIT Inductor which uses `TORCH_CHECK`. This may need further discussion how to address (`AOTI_TORCH_CHECK` is introduced in https://github.com/pytorch/pytorch/pull/119220). 2. Freezing in non-ABI compatible mode will work with the support in this PR. While for ABI compatible mode, we need to firstly address this issue: `AssertionError: None, i.e. optional output is not supported`. `6c4f43f826/torch/_inductor/codegen/cpp_wrapper_cpu.py (L2023-L2024)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/124350 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-05-24 13:34:04 +00:00
Jiong Gong	43baabe9b9	[inductor][cpp] support bf16/fp16 gemm template epilogue fusion (#126545 ) As part of #125683, this PR adds epilogue fusion support for bf16/fp16 gemms. The key changes are as follows: 1. bf16 linear w/ epilogue fusion of some ops was originally supported via ATen oneDNN linear pointwise ops. In order to match the ATen op semantics, in-template epilogue support is added to the cpp gemm template so that we would have: "gemm + in-template epilogues -> template buffer". If the template is chosen for codegen, the in-template epilogues will be concatenated with the out-of-template epilogues that are appended during the scheduling. 2. Support bf16/fp16 legalization for `codegen_loop_bodies` which is used to generate the epilogue loops. 3. We used to leverage the in-place buffer mechanism to handle the in-place buffers in the epilogue codegen, in particular, for the reuses for output buffers of GEMM, template and epilogues. This is not correct since the output buffer is an "output" not an "in-place" buffer of the template kernel itself. Now, we use a dedicated "aliases" dict to manage such buffer reuses and the intermediate aliasing buffers are removed after codegen. 4. Add `localize_buffer` method to `LocalBufferScope` to allow the replacement of a global buffer with a local one in the given inductor IR nodes. This helps the fused loops to work on smaller-sized local buffers for better data locality. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126545 Approved by: https://github.com/jansel ghstack dependencies: #124021, #126019, #126068	2024-05-24 12:29:06 +00:00
Jiong Gong	4aa43d11f3	[inductor][cpp] bf16/fp16 gemm template computed with fp32 w/o epilogue fusion (#126068 ) As part of #125683, this PR adds the initial bf16/fp16 gemm template support with micro-gemm implemented with fused type casting and fp32 computation. It doesn't provide epilogue fusion support yet which will be added in the next PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126068 Approved by: https://github.com/jansel ghstack dependencies: #124021, #126019	2024-05-24 12:24:35 +00:00
Jiong Gong	56c412d906	[inductor][cpp] epilogue support for gemm template (#126019 ) As part of #125683, this PR adds the epilogue support for c++ gemm template by reusing the c++ vector codegen on sub-slices of tensors. This is implemented by retracing the epilogue IR nodes with new ranges and offsets. The new `codegen_loop_bodies` and `codegen_functions` methods are added to c++ vector codegen for this purpose. This is leveraged by the `store_output` method of the template kernel for epilogue codegen and store to the final result. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126019 Approved by: https://github.com/jansel ghstack dependencies: #124021	2024-05-24 12:14:12 +00:00
rzou	dd64ca2a02	Inductor respects strides for custom ops by default (#126986 ) Previously, the default was that Inductor did not respect strides for all (builtin and custom) ops unless the op has a "needs_fixed_stride_order" tag on it. This PR changes it so that: - inductor doesn't respect strides for builtin ops. To change the behavior, one can add the "needs_fixed_stride_order" tag - inductor does respect strides for custom ops. To change the behavior, one can add the "does_not_need_fixed_stride_order" tag Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/126986 Approved by: https://github.com/ezyang, https://github.com/albanD	2024-05-24 11:11:18 +00:00
Aaron Orenstein	f14cdc570d	Fix to #126656 (#127050 ) Fix failure from fbcode - in the case of a foreach node the fake `group` needs to be hashable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127050 Approved by: https://github.com/DanilBaibak ghstack dependencies: #126656	2024-05-24 10:56:53 +00:00
PyTorch MergeBot	47c976b904	Revert "[AOTI] Add more fallback ops (#126720 )" This reverts commit 19cd4484ec8449b8c5ebf46be1f8f2fcbace8c6c. Reverted https://github.com/pytorch/pytorch/pull/126720 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/126720#issuecomment-2129011751))	2024-05-24 09:07:07 +00:00
PyTorch MergeBot	f749c5def8	Revert "[AOTI] Fix an int array codegen issue (#126801 )" This reverts commit ff617ab6c8f6f67ae912fbcd45a913a89e19effb. Reverted https://github.com/pytorch/pytorch/pull/126801 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/126720#issuecomment-2129011751))	2024-05-24 09:07:07 +00:00
PyTorch MergeBot	fd9cdeed19	Revert "[AOTI] Disable stack allocation for OSS (#125732 )" This reverts commit 599e684ad6f34dd069eff8611f45e25b7695a339. Reverted https://github.com/pytorch/pytorch/pull/125732 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/126720#issuecomment-2129011751))	2024-05-24 09:07:07 +00:00
Richard Barnes	f95dbc1276	Remove more of caffe2 (#126705 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/126705 Approved by: https://github.com/malfet	2024-05-24 06:53:08 +00:00
Jiong Gong	0d1e228550	[inductor][cpp] GEMM template (infra and fp32) (#124021 ) This PR adds the Cpp template infrastructure and the initial FP32 gemm template. See RFC https://github.com/pytorch/pytorch/issues/125683 for more background info. 1. Cpp template infrastructure Similar template abstractions as the CUTLASS template, i.e., `CppTemplate`, `CppTemplateKernel`, `CppTemplateBuffer`. The MicroGemm micro-kernel abstraction that can be used by Cpp GEMM templates. 2. Initial FP32 gemm template This involves a GEMM template implementation `CppPackedGemmTemplate` that supports GEMM with constant weight (`B`) requiring `N` to be a multiple of register blocking while allows the static or dynamic sizes for the `M` (batch dim) of `A`. The `B` matrix would be prepacked. This is a typical setting for inference workloads. The template handles the thread decomposition (via `thread_blocking`) and cache blocking (via `cache_blocking`). Then it invokes `CppMicroGemm` which handles register blocking, instruction selection, and other CPU architecture-specific optimizations. A `CppMicroGemmFP32Vec` micro-kernel implementation is provided for fp32 matmuls implemented with ATen vec abstraction. 3. Correctness and performance The changes have been validated with fp32 inference on the three benchmark suites (torchbench, huggingface and timm_models) with both static shape and dynamic shapes. Since it is an initial implementation, we are still working on further performance improves with follow-up PRs including the optimizations in kernels as well as fusions. The perf gains are only observed from a selective number of models compared to the ATen kernels which are implemented with MKL. The perf gains are more obvious with dynamic shapes since MKL only supports packed gemm for static shapes. Below are details. Static shapes \| Benchmark \| torchbench \| huggingface \| timm_models \| \|------------\|-------------\|--------------\|--------------\| \| Multi-threaded (baseline) \| 1.47x \| 1.36x \| 1.91x \| \| Multi-threaded (max-autotune) \| 1.47x \| 1.36x \| 1.92x \| \| Single-threaded (baseline) \| 1.56x \| 1.19x \| 1.51x \| \| Single-threaded (max-autotune) \| 1.56x \| 1.19x \| 1.52x \| Key models being sped up: drq: 1.14x soft_act: 1.12 cait_m36_384: 1.18x Dynamic shapes \| Benchmark \| torchbench \| huggingface \| timm_models \| \| --- \| --- \| --- \| --- \| \| Multi-threaded (baseline) \| 1.43x \| 1.28x \| 1.85x \| \| Multi-threaded (max-autotune) \| 1.47x \| 1.28x \| 1.85x \| \| Single-threaded (baseline) \| 1.55x \| 1.20x \| 1.51x \| \| Single-threaded (max-autotune) \| 1.56x \| 1.19x \| 1.53x \| Key models being sped up: BERT_pytorch: 1.22x pyhpc_turbulent: 1.13x soft_actor_critic: 1.77x BlenderbotForCausalLM: 1.09x cait_m36_384: 1.17x Differential Revision: [D57585365](https://our.internmc.facebook.com/intern/diff/D57585365) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124021 Approved by: https://github.com/jansel	2024-05-24 06:26:33 +00:00
Scott Wolchok	505b8ceaa2	Double registers per iteration in FP32-arithmetic FP16 ARM gemv kernel (#126877 ) Summary: I found that doubling this significantly improved performance, but doubling again did not, so I stopped here. Test Plan: CI Benchmarked with llm_experiments repo as previously in stack; relevant data: before: trans_b torch.float16 1396.11 usec (4100) trans_b torch.float16 1399.54 usec (4104) after: trans_b torch.float16 1096.00 usec (4100) trans_b torch.float16 1093.47 usec (4104) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126877 Approved by: https://github.com/malfet ghstack dependencies: #126745, #126746, #126793, #126794	2024-05-24 05:57:09 +00:00
Scott Wolchok	e8fa0f10c5	Quadruple registers per iteration in ARM64 FP16 kernel (#126794 ) The machine has plenty of registers we weren't using. This looks like it might improve performance a couple percent, though there is noise so I'm not certain. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126794 Approved by: https://github.com/malfet ghstack dependencies: #126745, #126746, #126793	2024-05-24 05:57:09 +00:00
daitian1995	f6366454db	Add privateuse1 in FSDP's sharded grad scaler (#126971 ) 1. add privateuse1 in FSDP's sharded grad scaler 2. support found_inf copy for more devices Pull Request resolved: https://github.com/pytorch/pytorch/pull/126971 Approved by: https://github.com/awgu, https://github.com/weifengpy	2024-05-24 05:54:25 +00:00
drisspg	2f6954c7c3	Update the modification api (#127035 ) # Summary Updates the modification jinja template's api, so as to specify the output_name for the fixed buffer. As well updates flex-attention's usage to make the algorithm more clear/ closer align with the vmap impl Pull Request resolved: https://github.com/pytorch/pytorch/pull/127035 Approved by: https://github.com/Chillee	2024-05-24 04:45:34 +00:00
Andrew Gu	894efcd0e9	[DTensor] Supported simple replicate strategy for SVD (#127004 ) This PR adds a simple strategy to always replicate for `torch.linalg.svd()`. This is to help unblock some GaLore exploration. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127004 Approved by: https://github.com/wanchaol	2024-05-24 04:34:43 +00:00
Aaron Orenstein	70dc59c55f	Fix perf regression caused by #122074 (#126996 ) The original change was about 9.5% slower than then before #122074 . This improves it to be only about 1.4% slower. Also touched up some unrelated nits that the linter complained about. Fixes #126293 Ran torchbench 3 times on each change. Perf values before (stable), after (fix), and with #122074 backed out (backout): ``` ../inductor-tools/scripts/modelbench/inductor_single_run.sh single inference performance torchbench pyhpc_isoneutral_mixing amp first dynamic cpp stable: 43.948x 45.754x 44.906x fix: 47.505x 49.987x 47.493x backout: 48.243x 48.199x 48.192x ../inductor-tools/scripts/modelbench/inductor_single_run.sh single inference performance torchbench pyhpc_equation_of_state amp first static default stable: 15.224x 13.286x 15.354x fix: 16.402x 16.370x 16.183x backout: 16.554x 16.675x 16.787x ../inductor-tools/scripts/modelbench/inductor_single_run.sh single inference performance torchbench lennard_jones float32 first static default stable: 1.712x 1.651x 1.640x fix: 1.804x 1.798x 1.792x backout: 1.864x 1.824x 1.836x ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126996 Approved by: https://github.com/jansel	2024-05-24 04:27:22 +00:00
Angela Yi	cb6ef68caa	Propagate tokens in aotautograd (#127028 ) Test Plan: `buck run mode/dev-nosan //aimp/experimental/pt2:pt2_export -- --model-entity-id 938593492 --output /tmp/938593492.zip --use-torchrec-eager-mp --use-manifold` Differential Revision: D57750072 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127028 Approved by: https://github.com/tugsbayasgalan	2024-05-24 03:23:17 +00:00
PyTorch MergeBot	99a11efc8a	Revert "Lift jagged -> padded dense forward / backward kernels from fbgemm_gpu (#125946 )" This reverts commit e2f081837f4276c1a6a37739bd28157f62004a06. Reverted https://github.com/pytorch/pytorch/pull/125946 on behalf of https://github.com/clee2000 due to I think dr ci is wrong and the windows build failure is real `e2f081837f` https://github.com/pytorch/pytorch/actions/runs/9216826622/job/25357819877 ([comment](https://github.com/pytorch/pytorch/pull/125946#issuecomment-2128388126))	2024-05-24 02:37:46 +00:00
drisspg	cfb374dc73	[BE] Create grad check util (#126991 ) # Summary Add small utility func for deciding if we shoudl compute LSE and update to also check for gradMode Pull Request resolved: https://github.com/pytorch/pytorch/pull/126991 Approved by: https://github.com/cpuhrsch	2024-05-24 02:36:00 +00:00
Anshul Sinha	27594be3ed	[dtensor][be] remove repeated test in test_comm_mode.py (#127029 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127029 Approved by: https://github.com/XilunWu ghstack dependencies: #127025	2024-05-24 01:42:13 +00:00
Anshul Sinha	89c638f9a5	[dtensor][debug] add all_reduce_coalesced tracing to CommDebugMode (#127025 ) Summary Added all_reduce_coalesced tracing to CommDebugMode and added test case to test_comm_mode test suite. Test Plan pytest test/distributed/_tensor/debug/test_comm_mode.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/127025 Approved by: https://github.com/XilunWu	2024-05-24 01:42:13 +00:00
laithsakka	575cb617db	Add compile time profiler for non fbcode targets (#126904 ) This is a tool that allow profiling compile time using strobelight profiler, its a meta only tool. but works on non-fbcode targets. A follow up diff will unify this with caffe2/fb/strobelight/compile_time_profiler.py. example test: ``` run python tools/strobelight/examples/compile_time_profile_example.py ``` ``` python torch/utils/_strobelight/examples/compile_time_profile_example.py strobelight_compile_time_profiler, line 61, 2024-05-23 10:49:28,101, INFO: compile time strobelight profiling enabled strobelight_compile_time_profiler, line 93, 2024-05-23 10:49:28,102, INFO: Unique sample tag for this run is: 2024-05-23-10:49:282334638devvm4561.ash0.facebook.com strobelight_compile_time_profiler, line 94, 2024-05-23 10:49:28,102, INFO: You can use the following link to access the strobelight profile at the end of the run: https://www.internalfb.com/intern/scuba/query/?dataset=pyperf_experimental%2Fon_demand&drillstate=%7B%22purposes%22%3A[]%2C%22end%22%3A%22now%22%2C%22start%22%3A%22-30%20days%22%2C%22filterMode%22%3A%22DEFAULT%22%2C%22modifiers%22%3A[]%2C%22sampleCols%22%3A[]%2C%22cols%22%3A[%22namespace_id%22%2C%22namespace_process_id%22]%2C%22derivedCols%22%3A[]%2C%22mappedCols%22%3A[]%2C%22enumCols%22%3A[]%2C%22return_remainder%22%3Afalse%2C%22should_pivot%22%3Afalse%2C%22is_timeseries%22%3Afalse%2C%22hideEmptyColumns%22%3Afalse%2C%22timezone%22%3A%22America%2FLos_Angeles%22%2C%22compare%22%3A%22none%22%2C%22samplingRatio%22%3A%221%22%2C%22metric%22%3A%22count%22%2C%22aggregation_field%22%3A%22async_stack_complete%22%2C%22top%22%3A10000%2C%22aggregateList%22%3A[]%2C%22param_dimensions%22%3A[%7B%22dim%22%3A%22py_async_stack%22%2C%22op%22%3A%22edge%22%2C%22param%22%3A%220%22%2C%22anchor%22%3A%220%22%7D]%2C%22order%22%3A%22weight%22%2C%22order_desc%22%3Atrue%2C%22constraints%22%3A[[%7B%22column%22%3A%22sample_tags%22%2C%22op%22%3A%22all%22%2C%22value%22%3A[%22[%5C%222024-05-23-10:49:282334638devvm4561.ash0.facebook.com%5C%22]%22]%7D]]%2C%22c_constraints%22%3A[[]]%2C%22b_constraints%22%3A[[]]%2C%22ignoreGroupByInComparison%22%3Afalse%7D&view=GraphProfilerView&&normalized=1712358002&pool=uber strobelight_function_profiler, line 241, 2024-05-23 10:49:34,943, INFO: strobelight run id is: 3507039740348330 strobelight_function_profiler, line 243, 2024-05-23 10:50:00,907, INFO: strobelight profiling running strobelight_function_profiler, line 224, 2024-05-23 10:50:02,741, INFO: strobelight profiling stopped strobelight_function_profiler, line 215, 2024-05-23 10:50:06,173, INFO: Total samples: 7 strobelight_function_profiler, line 215, 2024-05-23 10:50:06,173, INFO: GraphProfiler (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/75cxdro3 strobelight_function_profiler, line 215, 2024-05-23 10:50:06,173, INFO: Icicle view (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/qsgydsee strobelight_compile_time_profiler, line 120, 2024-05-23 10:50:06,174, INFO: 1 strobelight success runs out of 1 non-recursive compilation events. strobelight_function_profiler, line 241, 2024-05-23 10:50:08,137, INFO: strobelight run id is: 8721740011604497 strobelight_function_profiler, line 243, 2024-05-23 10:50:34,801, INFO: strobelight profiling running strobelight_function_profiler, line 224, 2024-05-23 10:50:36,803, INFO: strobelight profiling stopped strobelight_function_profiler, line 215, 2024-05-23 10:50:41,289, INFO: Total samples: 3 strobelight_function_profiler, line 215, 2024-05-23 10:50:41,289, INFO: GraphProfiler (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/qmi2ucwp strobelight_function_profiler, line 215, 2024-05-23 10:50:41,289, INFO: Icicle view (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/7fjkhs9i strobelight_compile_time_profiler, line 120, 2024-05-23 10:50:41,289, INFO: 2 strobelight success runs out of 2 non-recursive compilation events. strobelight_function_profiler, line 241, 2024-05-23 10:50:43,597, INFO: strobelight run id is: 1932476082259558 strobelight_function_profiler, line 243, 2024-05-23 10:51:09,791, INFO: strobelight profiling running strobelight_function_profiler, line 224, 2024-05-23 10:51:11,883, INFO: strobelight profiling stopped strobelight_function_profiler, line 215, 2024-05-23 10:51:16,218, INFO: Total samples: 3 strobelight_function_profiler, line 215, 2024-05-23 10:51:16,218, INFO: GraphProfiler (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/vy1ujxec strobelight_function_profiler, line 215, 2024-05-23 10:51:16,218, INFO: Icicle view (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/2xgadviv strobelight_compile_time_profiler, line 120, 2024-05-23 10:51:16,219, INFO: 3 strobelight success runs out of 3 non-recursive compilation events. ``` or pass TORCH_COMPILE_STROBELIGHT=TRUE for any torch compile python program. ex running on XLNetLMHeadModel. ``` TORCH_COMPILE_STROBELIGHT=TRUE TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 time python benchmarks/dynamo/huggingface.py --ci --accuracy --timing --explain --inductor --device cuda --training --amp --only XLNetLMHeadModel ``` result: Pull Request resolved: https://github.com/pytorch/pytorch/pull/126904 Approved by: https://github.com/aorenste ghstack dependencies: #126693	2024-05-24 01:39:40 +00:00
Joel Schlosser	e2f081837f	Lift jagged -> padded dense forward / backward kernels from fbgemm_gpu (#125946 ) PyTorch can't depend on `fbgemm_gpu` as a dependency because `fbgemm_gpu` already has a dependency on PyTorch. So this PR copy / pastes kernels from `fbgemm_gpu`: * `dense_to_jagged_forward()` as CUDA registration for new ATen op `_padded_dense_to_jagged_forward()` * `jagged_to_padded_dense_forward()` as CUDA registration for new ATen op `_jagged_to_padded_dense_forward()` CPU impls for these new ATen ops will be added in a follow-up PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125946 Approved by: https://github.com/davidberard98	2024-05-24 00:42:59 +00:00
Richard Barnes	3f5b59eef4	[codemod] c10::optional -> std::optional in caffe2/aten/src/ATen/DeviceGuard.h +117 (#126901 ) Summary: Generated with ``` fbgs -f '.*\.(cpp\|cxx\|cc\|h\|hpp\|cu\|cuh)$' c10::optional -l \| perl -pe 's/^fbsource.fbcode.//' \| grep -v executorch \| xargs -n 50 perl -pi -e 's/c10::optional/std::optional/g' ``` - If you approve of this diff, please use the "Accept & Ship" button :-) (117 files modified.) Test Plan: Sandcastle Reviewed By: palmje Pull Request resolved: https://github.com/pytorch/pytorch/pull/126901 Approved by: https://github.com/Skylion007, https://github.com/eqy	2024-05-24 00:26:15 +00:00
cyy	95e5c994f9	[Submodule] Clear USE_QNNPACK build option (#126941 ) Following the removal of QNNPACK third-party module #126657, we can clear more build system code. Also third_party/neon2sse was removed because it is not used. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126941 Approved by: https://github.com/ezyang	2024-05-24 00:12:56 +00:00
PyTorch MergeBot	dfabae5b89	Revert "[pipelining] Add grad test for interleaved schedules (#126931 )" This reverts commit abf6d4e6bc1a9a0e08bfc2204560ca7858fa90cd. Reverted https://github.com/pytorch/pytorch/pull/126931 on behalf of https://github.com/clee2000 due to newly added test fails distributed/pipelining/test_schedule.py::ScheduleTest::test_grad_with_manual_interleaved_ScheduleClass0 `abf6d4e6bc` https://github.com/pytorch/pytorch/actions/runs/9214413308/job/25352507591, pull workflow failed on startup on PR, so no distributed tests ran at all ([comment](https://github.com/pytorch/pytorch/pull/126931#issuecomment-2128228496))	2024-05-23 23:51:29 +00:00
Pian Pawakapan	2db13633e7	[export] disable forced specializations, even when solvable with single var (#126925 ) Summary: Previously https://github.com/pytorch/pytorch/pull/124949 added the ability to disable forced specializations on dynamic shapes for export, keeping dynamism for complex guards instead of specializing, allowing unsoundness by having the user fail at runtime. It avoided disabling one case: single-variable equality guards, where a variable is specified as dynamic but can be solvable for a concrete value, suggesting the correct behavior is specialization. For example, guard : Eq(s0 // 4, 400) suggests s0 should specialize to 1600. In debugging, some users (e.g. APS) would like to keep this dynamic, and defer to failing at runtime instead. This PR adds this, so now all forced specializations should be turned off. Mostly this should be used for debugging, since it produces unsoundness, and lets the user proceed with (probably) incorrect dynamism. Test Plan: export tests Differential Revision: D57698601 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126925 Approved by: https://github.com/angelayi	2024-05-23 23:43:30 +00:00
James Wu	6eac3f45c7	Add basic sanity checks for graph ops to cache key (#124745 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124745 Approved by: https://github.com/bdhirsh	2024-05-23 23:37:43 +00:00
Aart Bik	ff82e2e7cf	[traced-graph][sparse] propagate sparsity metadata into traced graph (#117907 ) Propagate sparsity metadata from sparse tensors of torch.sparse into the traced graph representation (with would be useful for a JIT backend that supports a "sparse compiler"). This is a first careful attempt, since the actual "meta" feature seem still incomplete for coo and completely lacking for csr/csc/bsr/bsc. For background see forum postings (with examples): https://discuss.pytorch.org/t/connecting-pytorch-sparse-tensors-with-mlir/195145 https://dev-discuss.pytorch.org/t/connecting-pytorch-sparse-tensors-with-mlir/1803 And feature request: https://github.com/pytorch/pytorch/issues/117188 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117907 Approved by: https://github.com/pearu, https://github.com/ezyang	2024-05-23 22:46:46 +00:00
Yueming Hao	93ba5e7291	Fix typo for input (#126981 ) The variable name should be `cloned_inputs` rather than `clone_inputs`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126981 Approved by: https://github.com/xuzhao9	2024-05-23 22:08:14 +00:00
William Wen	d11e44c0d0	Reset grad state across unittests (#126345 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126345 Approved by: https://github.com/ezyang	2024-05-23 21:16:39 +00:00
Catherine Lee	a31a60d85b	Change run_test.py arg parsing to handle additional args better (#126709 ) Do not inherit parser from common_utils * I don't think we use any variables in run_test that depend on those, and I think all tests except doctests run in a subprocess so they will parse the args in common_utils and set the variables. I don't think doctests wants any of those variables? Parse known args, add the extra args as extra, pass the extra ones along to the subprocess Removes the first instance of `--` I think I will miss run_test telling me if an arg is valid or not Pull Request resolved: https://github.com/pytorch/pytorch/pull/126709 Approved by: https://github.com/ZainRizvi, https://github.com/huydhn, https://github.com/Flamefire	2024-05-23 21:08:12 +00:00
Catherine Lee	09a73da190	Downgrade requests to 2.31.0 for ios and android (#126989 ) Ex https://github.com/pytorch/pytorch/actions/runs/9211850483/job/25342181353 https://github.com/pytorch/pytorch/actions/runs/9211850483/job/25342182105 2.32.0 isn't on the conda channels yet? Is there a way to add them? If not here's a PR to downgrad Pull Request resolved: https://github.com/pytorch/pytorch/pull/126989 Approved by: https://github.com/atalman, https://github.com/malfet	2024-05-23 21:02:50 +00:00
wz337	0d2ac9782b	[FSDP1] Update docstring to include device_mesh arg (#126589 ) Fixes #126548 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126589 Approved by: https://github.com/wanchaol	2024-05-23 20:40:48 +00:00
Wei Wang	0902929d58	[CUDA] [CI]: Enable CUDA 12.4 CI (#121956 ) Reference PR: https://github.com/pytorch/pytorch/pull/93406 Co-authored-by: Aidyn-A <31858918+Aidyn-A@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/121956 Approved by: https://github.com/atalman	2024-05-23 20:37:47 +00:00
Ke Wen	abf6d4e6bc	[pipelining] Add grad test for interleaved schedules (#126931 ) Added `test_grad_with_manual_interleaved`: - Model: `MultiMLP` - Tested schedules: Interleaved1F1B, LoopedBFS - Two stages per rank ``` Rank 0 stages: [0, 2] Rank 1 stages: [1, 3] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126931 Approved by: https://github.com/wconstab ghstack dependencies: #126812, #126721, #126735, #126927	2024-05-23 20:26:08 +00:00
Ke Wen	c46b38bc75	[pipelining] Generalize definition of MultiMLP for testing interleaved schedules (#126927 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126927 Approved by: https://github.com/wconstab ghstack dependencies: #126812, #126721, #126735	2024-05-23 20:26:08 +00:00
Will Constable	6b39146b3f	[pipelining] Validate stage input/output shape/dtype (#126732 ) Address the classes of user errors stemming from (possibly) unintentional dynamic shapes usage or mismatch of configuration time and run time data shapes/dtypes. The goal is to ensure a clear error is raised rather than relying on some underlying error to bubble up when a tensor shape is not compatible, or worse, having a silent correctness issue. Classes of shape/dtype errors * (a) error is thrown within the stage-module forward code, but may be hard to understand/trace back to an input issue * (b) silent correctness issue happens inside the stage-module forward, but the correct output shape is still produced produces the expected output shape * (c) the stage-module produces an output that is locally correct, but not matching the expectation of the following stage, leading to a hang or correctness issue down the line How validation helps Input shape validation - improves debugability of case (a) - guards against case (b) - only needed on first stage, since subsequent stages use pre-allocated recv buffers that can't change shape/size even if they wanted to Output shape validation - guards against case (c) Validation of first stage input and all stages' outputs inductively verifies all shapes Shape/dtype are most critical as they literally affect the number of bytes on the wire. Strides and other tensor properties may also (?) matter, and the validation function can be adjusted accordingly if needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126732 Approved by: https://github.com/kwen2501	2024-05-23 20:16:06 +00:00
Edward Z. Yang	9b91c91e64	Don't add to replacements when guard is suppressed (#126210 ) Also improve logging when guards are suppressed Partially addresses https://github.com/pytorch/pytorch/issues/125641 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126210 Approved by: https://github.com/jbschlosser	2024-05-23 20:10:29 +00:00
Richard Zou	f8857cef45	[Reland] Verify types in custom op schemas (#126861 ) Summary: co-dev reland of https://github.com/pytorch/pytorch/pull/124520, which requires the removal of some executorch tests. Before this PR, we didn't check that types in a schema were valid. This is because TorchScript treats unknown types as type variables. This PR checks types in a schema for the TORCH_LIBRARY APIs. To do this, we add an `allow_typevars` flag to parseSchema so that TorchScript can use allow_typevars=True. We also add some error messages for common mistakes (e.g. using int64_t or double in schema). Test Plan: Wait for tests Differential Revision: D57666659 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126861 Approved by: https://github.com/albanD	2024-05-23 19:53:52 +00:00
Ke Wen	c921c5cc77	[c10d] Print certain logs only on head rank of each node (#125432 ) Recently we added the following warning, which is printed on every rank and makes the log a bit verbose. This PR dedups certain logs that are identical across ranks and prints them only on head rank of each node. Resolves https://github.com/pytorch/pytorch/issues/126275 ========================================= [rank0]:[W502 14:06:55.821964708 ProcessGroupNCCL.cpp:1113] WARNING: process group has NOT been destroyed before it is being destructed. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL data transfers have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 [rank1]:[W502 14:06:57.994276972 ProcessGroupNCCL.cpp:1113] WARNING: process group has NOT been destroyed before it is being destructed. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL data transfers have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 [rank2]:[W502 14:07:00.353013116 ProcessGroupNCCL.cpp:1113] WARNING: process group has NOT been destroyed before it is being destructed. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL data transfers have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 [rank3]:[W502 14:07:02.515511670 ProcessGroupNCCL.cpp:1113] WARNING: process group has NOT been destroyed before it is being destructed. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL data transfers have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125432 Approved by: https://github.com/wconstab	2024-05-23 19:16:11 +00:00
Jason Ansel	0625f92993	[inductor] Run some tests on correct device (#126943 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126943 Approved by: https://github.com/yanboliang	2024-05-23 18:47:44 +00:00
Scott Wolchok	abf40320dd	remove ax/ay arrays in fp16 ARM matmul kernels (#126793 ) These shouldn't do anything as only two elements are live at once, so we can simplify the code. (I checked assembly for the inner loops in instruments and it seems to be the same.) Differential Revision: [D57732738](https://our.internmc.facebook.com/intern/diff/D57732738) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126793 Approved by: https://github.com/malfet ghstack dependencies: #126745, #126746	2024-05-23 18:42:45 +00:00
Scott Wolchok	5dcf3d0f9e	use arith-by-dot-products approach for fp32 accumulation in fp16 matmul (#126746 ) Summary: The faster fp16-native kernel is gated off by default. Let's give people better performance in the default case. Test Plan: CI benchmarked matmul of size 4100x4100x1 and 4104x4104x1 using https://github.com/malfet/llm_experiments/blob/main/benchmarks/benchmark_torch_mm.py (4100 % 32 = 4100 % 8 = 4). Relevant timing numbers without FP16 reduction (which then uses this kernel): after: trans_b torch.float16 1396.11 usec (4100) trans_b torch.float16 1399.54 usec (4104) before: trans_b torch.float16 1840.79 usec (4100) trans_b torch.float16 1786.67 usec (4104) Differential Revision: [D57732736](https://our.internmc.facebook.com/intern/diff/D57732736) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126746 Approved by: https://github.com/malfet ghstack dependencies: #126745	2024-05-23 18:42:45 +00:00
Scott Wolchok	fd4fd24080	add tail fixup for fp16 gemv transposed fast path (#126745 ) Summary: We previously had restrictive gating for the fp16 kernel; now it supports arbitrary m & n. Test Plan: 1) ran test coverage added in #126700, passes 2) benchmarked matmul of size 4100x4100x1 and 4104x4104x1 using https://github.com/malfet/llm_experiments/blob/main/benchmarks/benchmark_torch_mm.py (4100 % 32 = 44100 % 8 = 4). Relevant timing numbers with FP16 reduction enabled (which gates this kernel): after: trans_b torch.float16 716.42 usec (4100) trans_b torch.float16 711.10 usec (4104) Before: trans_b torch.float16 1808.66 usec (4100) trans_b torch.float16 1083.18 usec (4104) Differential Revision: [D57732737](https://our.internmc.facebook.com/intern/diff/D57732737) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126745 Approved by: https://github.com/malfet	2024-05-23 18:42:35 +00:00
PyTorch MergeBot	b36e390b6c	Revert "Default XLA to use swap_tensors path in nn.Module._apply (#126814 )" This reverts commit eb41ed5d90e946e62dd664d7037ebbb021baf33e. Reverted https://github.com/pytorch/pytorch/pull/126814 on behalf of https://github.com/mikaylagawarecki due to broke xla ci ([comment](https://github.com/pytorch/pytorch/pull/126814#issuecomment-2127719337))	2024-05-23 17:43:06 +00:00
PyTorch MergeBot	6a06d36296	Revert "Default meta device to use swap_tensors in nn.Module._apply (.to_empty and .to('meta')) (#126819 )" This reverts commit ab61309ab8f6452975021994a6d4a102d55feba8. Reverted https://github.com/pytorch/pytorch/pull/126819 on behalf of https://github.com/mikaylagawarecki due to broke xla ci ([comment](https://github.com/pytorch/pytorch/pull/126814#issuecomment-2127719337))	2024-05-23 17:43:06 +00:00
Jiashen Cao	041e8d73fd	Separate non/strict functions in _export (#126718 ) Move non/strict _export to different functions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126718 Approved by: https://github.com/angelayi	2024-05-23 17:41:23 +00:00
cyy	e5db6758c8	[BE]: Use make_unique (#126966 ) Adds make_unique in places Pull Request resolved: https://github.com/pytorch/pytorch/pull/126966 Approved by: https://github.com/Skylion007	2024-05-23 17:39:48 +00:00
wz337	264155a8d7	[DCP][AC] Add test for apply AC with FSDP1 (#126935 ) Adding test for this cherry pick. https://github.com/pytorch/pytorch/pull/126559/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/126935 Approved by: https://github.com/fegin	2024-05-23 17:35:54 +00:00
Richard Barnes	bbe68a16b9	[codemod][lowrisk] Remove extra semi colon from caffe2/caffe2/core/observer.h (#126976 ) Summary: `-Wextra-semi` or `-Wextra-semi-stmt` If the code compiles, this is safe to land. Test Plan: Sandcastle Reviewed By: palmje Differential Revision: D57632765 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126976 Approved by: https://github.com/Skylion007	2024-05-23 17:31:19 +00:00
Sherlock Huang	a63310eebc	TorchScript 2 ExportedProgram Converter (#126920 ) Summary: Initial commit for TorchScript 2 ExportedProgram Converter. TODO: - Improve TorchScript IR coverage - parameter and buffers should be owned by output ExportedProgram - Experiment on conditional op conversion Test Plan: buck2 run mode/dev-nosan fbcode//caffe2/test:test_export -- -r TestConverter Differential Revision: D57694784 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126920 Approved by: https://github.com/angelayi, https://github.com/tugsbayasgalan	2024-05-23 17:00:18 +00:00
PyTorch MergeBot	1b29c16e5e	Revert "Introduce ProcessGroupCudaP2P (#122163 )" This reverts commit 2dd269986027ea25c092f769ef8e9524920aaef6. Reverted https://github.com/pytorch/pytorch/pull/122163 on behalf of https://github.com/jithunnair-amd due to This is breaking ROCm distributed CI on trunk ([comment](https://github.com/pytorch/pytorch/pull/122163#issuecomment-2127518473))	2024-05-23 16:06:14 +00:00
Mikayla Gawarecki	ab61309ab8	Default meta device to use swap_tensors in nn.Module._apply (.to_empty and .to('meta')) (#126819 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126819 Approved by: https://github.com/albanD ghstack dependencies: #126814	2024-05-23 15:43:32 +00:00
Mikayla Gawarecki	eb41ed5d90	Default XLA to use swap_tensors path in nn.Module._apply (#126814 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126814 Approved by: https://github.com/JackCaoG, https://github.com/albanD	2024-05-23 15:43:32 +00:00
Animesh Jain	f0366de414	[dynamo] Support __contains__ on obj.__dict__ (#126922 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126922 Approved by: https://github.com/jansel, https://github.com/yanboliang	2024-05-23 09:01:29 +00:00
PyTorch MergeBot	25b8dbc3e4	Revert "[inductor][cpp] GEMM template (infra and fp32) (#124021 )" This reverts commit 9da7efa6774777890c8e4a713f6d23ea5cfcf6a4. Reverted https://github.com/pytorch/pytorch/pull/124021 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/124021#issuecomment-2126568331))	2024-05-23 08:50:18 +00:00
PyTorch MergeBot	45784cd229	Revert "[inductor][cpp] epilogue support for gemm template (#126019 )" This reverts commit 08f57b4bffe6edfdb016703219744482b4d03e23. Reverted https://github.com/pytorch/pytorch/pull/126019 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/124021#issuecomment-2126568331))	2024-05-23 08:50:18 +00:00
PyTorch MergeBot	926327e8fc	Revert "[inductor][cpp] bf16/fp16 gemm template computed with fp32 w/o epilogue fusion (#126068 )" This reverts commit 31412cb2f25bda0fe31dae7b2afc88278794cad6. Reverted https://github.com/pytorch/pytorch/pull/126068 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/124021#issuecomment-2126568331))	2024-05-23 08:50:18 +00:00
PyTorch MergeBot	30c9ca0899	Revert "[inductor][cpp] support bf16/fp16 gemm template epilogue fusion (#126545 )" This reverts commit 7b6d036c05bd782f5e59bdb353f9e47865e9db50. Reverted https://github.com/pytorch/pytorch/pull/126545 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/124021#issuecomment-2126568331))	2024-05-23 08:50:18 +00:00
angelayi	da7bf1d588	[export] Fix unflatten with empty nn_module_stack (#126785 ) Fixes https://fb.workplace.com/groups/1075192433118967/permalink/1433418843962989/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/126785 Approved by: https://github.com/tugsbayasgalan	2024-05-23 08:34:25 +00:00
Oguz Ulgen	a6155d23d1	[easy] Delete dead code global (#126903 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126903 Approved by: https://github.com/aorenste ghstack dependencies: #126083	2024-05-23 08:29:29 +00:00
Oguz Ulgen	cc61d03ac9	Do not trace into triton/backends (#126083 ) Fixes #125807 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126083 Approved by: https://github.com/yanboliang, https://github.com/jansel	2024-05-23 08:29:29 +00:00
laithsakka	558c4413ce	add strobelight cli function profiler (#126693 ) This is a meta only tool, this allow users to profile any python function by annotating it with strobelight using the strobelight profiler. ex ``` def fn(x, y, z): return x * y + z # use decorator with default profiler. @strobelight() @torch.compile() def work(): for i in range(100): for j in range(5): fn(torch.rand(j, j), torch.rand(j, j), torch.rand(j, j)) work() ``` test ``` python torch/utils/strobelight/examples/cli_function_profiler_example.py strobelight_cli_function_profiler, line 274, 2024-05-20 11:05:41,513, INFO: strobelight run id is: -6222660165281106 strobelight_cli_function_profiler, line 276, 2024-05-20 11:06:08,318, INFO: strobelight profiling running strobelight_cli_function_profiler, line 257, 2024-05-20 11:06:11,867, INFO: strobelight profiling stopped strobelight_cli_function_profiler, line 237, 2024-05-20 11:06:16,164, INFO: Total samples: 2470 strobelight_cli_function_profiler, line 237, 2024-05-20 11:06:16,164, INFO: GraphProfiler (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/oiqmyltg strobelight_cli_function_profiler, line 237, 2024-05-20 11:06:16,164, INFO: Icicle view (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/b10x92x0 strobelight_cli_function_profiler, line 274, 2024-05-20 11:06:18,476, INFO: strobelight run id is: -4112659701221677 strobelight_cli_function_profiler, line 276, 2024-05-20 11:06:45,096, INFO: strobelight profiling running strobelight_cli_function_profiler, line 257, 2024-05-20 11:06:52,366, INFO: strobelight profiling stopped strobelight_cli_function_profiler, line 237, 2024-05-20 11:06:56,222, INFO: Total samples: 1260 strobelight_cli_function_profiler, line 237, 2024-05-20 11:06:56,222, INFO: GraphProfiler (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/0yyx6el5 strobelight_cli_function_profiler, line 237, 2024-05-20 11:06:56,223, INFO: Icicle view (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/8m2by4ea (base) [lsakka@devvm4561.ash0 /data/users/lsakka/pytorch/pytorch (strobelight2)]$ python torch/profiler/strobelight_cli_function_profiler_example.py strobelight_cli_function_profiler, line 274, 2024-05-20 11:07:26,701, INFO: strobelight run id is: -2373009368202256 strobelight_cli_function_profiler, line 276, 2024-05-20 11:07:53,477, INFO: strobelight profiling running strobelight_cli_function_profiler, line 257, 2024-05-20 11:07:56,827, INFO: strobelight profiling stopped strobelight_cli_function_profiler, line 237, 2024-05-20 11:08:01,138, INFO: Total samples: 2372 strobelight_cli_function_profiler, line 237, 2024-05-20 11:08:01,138, INFO: GraphProfiler (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/dk797xg9 strobelight_cli_function_profiler, line 237, 2024-05-20 11:08:01,138, INFO: Icicle view (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/4w6c8vnm strobelight_cli_function_profiler, line 274, 2024-05-20 11:08:03,235, INFO: strobelight run id is: -1919086123693716 strobelight_cli_function_profiler, line 276, 2024-05-20 11:08:29,848, INFO: strobelight profiling running strobelight_cli_function_profiler, line 257, 2024-05-20 11:08:37,233, INFO: strobelight profiling stopped strobelight_cli_function_profiler, line 237, 2024-05-20 11:08:41,138, INFO: Total samples: 1272 strobelight_cli_function_profiler, line 237, 2024-05-20 11:08:41,138, INFO: GraphProfiler (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/43r58aew strobelight_cli_function_profiler, line 237, 2024-05-20 11:08:41,138, INFO: Icicle view (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/9g52onmw (base) [lsakka@devvm4561.ash0 /data/users/lsakka/pytorch/pytorch (strobelight2)]$ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126693 Approved by: https://github.com/aorenste	2024-05-23 07:42:25 +00:00
Jiong Gong	7b6d036c05	[inductor][cpp] support bf16/fp16 gemm template epilogue fusion (#126545 ) As part of #125683, this PR adds epilogue fusion support for bf16/fp16 gemms. The key changes are as follows: 1. bf16 linear w/ epilogue fusion of some ops was originally supported via ATen oneDNN linear pointwise ops. In order to match the ATen op semantics, in-template epilogue support is added to the cpp gemm template so that we would have: "gemm + in-template epilogues -> template buffer". If the template is chosen for codegen, the in-template epilogues will be concatenated with the out-of-template epilogues that are appended during the scheduling. 2. Support bf16/fp16 legalization for `codegen_loop_bodies` which is used to generate the epilogue loops. 3. We used to leverage the in-place buffer mechanism to handle the in-place buffers in the epilogue codegen, in particular, for the reuses for output buffers of GEMM, template and epilogues. This is not correct since the output buffer is an "output" not an "in-place" buffer of the template kernel itself. Now, we use a dedicated "aliases" dict to manage such buffer reuses and the intermediate aliasing buffers are removed after codegen. 4. Add `localize_buffer` method to `LocalBufferScope` to allow the replacement of a global buffer with a local one in the given inductor IR nodes. This helps the fused loops to work on smaller-sized local buffers for better data locality. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126545 Approved by: https://github.com/jansel ghstack dependencies: #124021, #126019, #126068	2024-05-23 07:39:29 +00:00
Jiong Gong	31412cb2f2	[inductor][cpp] bf16/fp16 gemm template computed with fp32 w/o epilogue fusion (#126068 ) As part of #125683, this PR adds the initial bf16/fp16 gemm template support with micro-gemm implemented with fused type casting and fp32 computation. It doesn't provide epilogue fusion support yet which will be added in the next PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126068 Approved by: https://github.com/jansel ghstack dependencies: #124021, #126019	2024-05-23 07:39:29 +00:00
Jiong Gong	08f57b4bff	[inductor][cpp] epilogue support for gemm template (#126019 ) As part of #125683, this PR adds the epilogue support for c++ gemm template by reusing the c++ vector codegen on sub-slices of tensors. This is implemented by retracing the epilogue IR nodes with new ranges and offsets. The new `codegen_loop_bodies` and `codegen_functions` methods are added to c++ vector codegen for this purpose. This is leveraged by the `store_output` method of the template kernel for epilogue codegen and store to the final result. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126019 Approved by: https://github.com/jansel ghstack dependencies: #124021	2024-05-23 07:39:29 +00:00
Jiong Gong	9da7efa677	[inductor][cpp] GEMM template (infra and fp32) (#124021 ) This PR adds the Cpp template infrastructure and the initial FP32 gemm template. See RFC https://github.com/pytorch/pytorch/issues/125683 for more background info. 1. Cpp template infrastructure Similar template abstractions as the CUTLASS template, i.e., `CppTemplate`, `CppTemplateKernel`, `CppTemplateBuffer`. The MicroGemm micro-kernel abstraction that can be used by Cpp GEMM templates. 2. Initial FP32 gemm template This involves a GEMM template implementation `CppPackedGemmTemplate` that supports GEMM with constant weight (`B`) requiring `N` to be a multiple of register blocking while allows the static or dynamic sizes for the `M` (batch dim) of `A`. The `B` matrix would be prepacked. This is a typical setting for inference workloads. The template handles the thread decomposition (via `thread_blocking`) and cache blocking (via `cache_blocking`). Then it invokes `CppMicroGemm` which handles register blocking, instruction selection, and other CPU architecture-specific optimizations. A `CppMicroGemmFP32Vec` micro-kernel implementation is provided for fp32 matmuls implemented with ATen vec abstraction. 3. Correctness and performance The changes have been validated with fp32 inference on the three benchmark suites (torchbench, huggingface and timm_models) with both static shape and dynamic shapes. Since it is an initial implementation, we are still working on further performance improves with follow-up PRs including the optimizations in kernels as well as fusions. The perf gains are only observed from a selective number of models compared to the ATen kernels which are implemented with MKL. The perf gains are more obvious with dynamic shapes since MKL only supports packed gemm for static shapes. Below are details. Static shapes \| Benchmark \| torchbench \| huggingface \| timm_models \| \|------------\|-------------\|--------------\|--------------\| \| Multi-threaded (baseline) \| 1.47x \| 1.36x \| 1.91x \| \| Multi-threaded (max-autotune) \| 1.47x \| 1.36x \| 1.92x \| \| Single-threaded (baseline) \| 1.56x \| 1.19x \| 1.51x \| \| Single-threaded (max-autotune) \| 1.56x \| 1.19x \| 1.52x \| Key models being sped up: drq: 1.14x soft_act: 1.12 cait_m36_384: 1.18x Dynamic shapes \| Benchmark \| torchbench \| huggingface \| timm_models \| \| --- \| --- \| --- \| --- \| \| Multi-threaded (baseline) \| 1.43x \| 1.28x \| 1.85x \| \| Multi-threaded (max-autotune) \| 1.47x \| 1.28x \| 1.85x \| \| Single-threaded (baseline) \| 1.55x \| 1.20x \| 1.51x \| \| Single-threaded (max-autotune) \| 1.56x \| 1.19x \| 1.53x \| Key models being sped up: BERT_pytorch: 1.22x pyhpc_turbulent: 1.13x soft_actor_critic: 1.77x BlenderbotForCausalLM: 1.09x cait_m36_384: 1.17x Differential Revision: [D57585365](https://our.internmc.facebook.com/intern/diff/D57585365) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124021 Approved by: https://github.com/jansel	2024-05-23 07:39:29 +00:00
chilli	aa6de76181	Fix silu test for flexattention (#126641 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126641 Approved by: https://github.com/ezyang, https://github.com/drisspg ghstack dependencies: #126615, #126446	2024-05-23 05:45:07 +00:00
youkaichao	36e70572d0	[Dynamo] make bytecode of resume function resemble natural bytecode (#126630 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126630 Approved by: https://github.com/williamwen42	2024-05-23 05:06:33 +00:00
PyTorch MergeBot	2c90b99267	Revert "reset dynamo cache before each test (#126586 )" This reverts commit 43f2f43eb3b6d8cbe8eb7f45acb50376092f1a16. Reverted https://github.com/pytorch/pytorch/pull/126586 on behalf of https://github.com/clee2000 due to broke tests on inductor? test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_CTCLoss_cuda_float64 `43f2f43eb3` https://github.com/pytorch/pytorch/actions/runs/9200644034/job/25308511495 ([comment](https://github.com/pytorch/pytorch/pull/126586#issuecomment-2126228689))	2024-05-23 04:54:28 +00:00
PyTorch MergeBot	b1e214ceb1	Revert "don't check memory format for empty tensors (#126593 )" This reverts commit 12dee4f2046d07db97cddc7b3c5bdf06fc304ae3. Reverted https://github.com/pytorch/pytorch/pull/126593 on behalf of https://github.com/clee2000 due to broke tests on inductor? test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_CTCLoss_cuda_float64 `43f2f43eb3` https://github.com/pytorch/pytorch/actions/runs/9200644034/job/25308511495 ([comment](https://github.com/pytorch/pytorch/pull/126586#issuecomment-2126228689))	2024-05-23 04:54:28 +00:00
PyTorch MergeBot	df4b7cb5f7	Reapply "Skip test_memory_format_nn_BatchNorm2d in inductor (#125970 )" (#126594 ) This reverts commit ce6e36bf8b524c3f4b07605c5b3af2b7d5ba8fd9. Reverted https://github.com/pytorch/pytorch/pull/126594 on behalf of https://github.com/clee2000 due to broke tests on inductor? test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_CTCLoss_cuda_float64 `43f2f43eb3` https://github.com/pytorch/pytorch/actions/runs/9200644034/job/25308511495 ([comment](https://github.com/pytorch/pytorch/pull/126586#issuecomment-2126228689))	2024-05-23 04:54:28 +00:00
PyTorch MergeBot	4f14282e35	Revert "[inductor][cpp] GEMM template (infra and fp32) (#124021 )" This reverts commit 2ac33a9f663269e6060246337c776a20c3b7c858. Reverted https://github.com/pytorch/pytorch/pull/124021 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it has a land race and failing in trunk `2ac33a9f66` ([comment](https://github.com/pytorch/pytorch/pull/124021#issuecomment-2126016522))	2024-05-23 01:13:29 +00:00
PyTorch MergeBot	657d39e44c	Revert "[inductor][cpp] epilogue support for gemm template (#126019 )" This reverts commit 57108d9a4990f6b2ed3578cee58354ab01505dd3. Reverted https://github.com/pytorch/pytorch/pull/126019 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it has a land race and failing in trunk `2ac33a9f66` ([comment](https://github.com/pytorch/pytorch/pull/124021#issuecomment-2126016522))	2024-05-23 01:13:29 +00:00
PyTorch MergeBot	205f08140e	Revert "[inductor][cpp] bf16/fp16 gemm template computed with fp32 w/o epilogue fusion (#126068 )" This reverts commit 57c185b4c765c522a7f2908a773d128c66def190. Reverted https://github.com/pytorch/pytorch/pull/126068 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it has a land race and failing in trunk `2ac33a9f66` ([comment](https://github.com/pytorch/pytorch/pull/124021#issuecomment-2126016522))	2024-05-23 01:13:29 +00:00
Nikita Shulga	2b57652278	Update requests to 2.32.2 (#126805 ) To address CVE-2024-35195 (though it does not really affect PyTorch, only CI) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126805 Approved by: https://github.com/atalman, https://github.com/kit1980, https://github.com/seemethere, https://github.com/Skylion007	2024-05-23 00:21:28 +00:00
eqy	ebbd431d9e	[CPU] Bump `test_complex_2d` thresholds for LBFGS on `complex64` (#126358 ) Is this supposed to be bitwise identical? Wasn't sure how to interpret the comment but it seems to be giving mismatches like: ``` Mismatched elements: 1 / 2 (50.0%) Greatest absolute difference: 4.6372413635253906e-05 at index (1,) (up to 1e-05 allowed) Greatest relative difference: 3.4600801882334054e-05 at index (1,) (up to 1.3e-06 allowed) To execute this test, run the following from the base repo dir: python test/test_optim.py -k test_complex_2d_LBFGS_cpu_complex64 ``` on Neoverse-N2 SBSA ARM CPUs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126358 Approved by: https://github.com/lezcano, https://github.com/janeyx99	2024-05-23 00:16:45 +00:00
Jiong Gong	57c185b4c7	[inductor][cpp] bf16/fp16 gemm template computed with fp32 w/o epilogue fusion (#126068 ) As part of #125683, this PR adds the initial bf16/fp16 gemm template support with micro-gemm implemented with fused type casting and fp32 computation. It doesn't provide epilogue fusion support yet which will be added in the next PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126068 Approved by: https://github.com/jansel ghstack dependencies: #124021, #126019	2024-05-23 00:12:38 +00:00
Jiong Gong	57108d9a49	[inductor][cpp] epilogue support for gemm template (#126019 ) As part of #125683, this PR adds the epilogue support for c++ gemm template by reusing the c++ vector codegen on sub-slices of tensors. This is implemented by retracing the epilogue IR nodes with new ranges and offsets. The new `codegen_loop_bodies` and `codegen_functions` methods are added to c++ vector codegen for this purpose. This is leveraged by the `store_output` method of the template kernel for epilogue codegen and store to the final result. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126019 Approved by: https://github.com/jansel ghstack dependencies: #124021	2024-05-23 00:07:52 +00:00
Jiong Gong	2ac33a9f66	[inductor][cpp] GEMM template (infra and fp32) (#124021 ) This PR adds the Cpp template infrastructure and the initial FP32 gemm template. See RFC https://github.com/pytorch/pytorch/issues/125683 for more background info. 1. Cpp template infrastructure Similar template abstractions as the CUTLASS template, i.e., `CppTemplate`, `CppTemplateKernel`, `CppTemplateBuffer`. The MicroGemm micro-kernel abstraction that can be used by Cpp GEMM templates. 2. Initial FP32 gemm template This involves a GEMM template implementation `CppPackedGemmTemplate` that supports GEMM with constant weight (`B`) requiring `N` to be a multiple of register blocking while allows the static or dynamic sizes for the `M` (batch dim) of `A`. The `B` matrix would be prepacked. This is a typical setting for inference workloads. The template handles the thread decomposition (via `thread_blocking`) and cache blocking (via `cache_blocking`). Then it invokes `CppMicroGemm` which handles register blocking, instruction selection, and other CPU architecture-specific optimizations. A `CppMicroGemmFP32Vec` micro-kernel implementation is provided for fp32 matmuls implemented with ATen vec abstraction. 3. Correctness and performance The changes have been validated with fp32 inference on the three benchmark suites (torchbench, huggingface and timm_models) with both static shape and dynamic shapes. Since it is an initial implementation, we are still working on further performance improves with follow-up PRs including the optimizations in kernels as well as fusions. The perf gains are only observed from a selective number of models compared to the ATen kernels which are implemented with MKL. The perf gains are more obvious with dynamic shapes since MKL only supports packed gemm for static shapes. Below are details. Static shapes \| Benchmark \| torchbench \| huggingface \| timm_models \| \|------------\|-------------\|--------------\|--------------\| \| Multi-threaded (baseline) \| 1.47x \| 1.36x \| 1.91x \| \| Multi-threaded (max-autotune) \| 1.47x \| 1.36x \| 1.92x \| \| Single-threaded (baseline) \| 1.56x \| 1.19x \| 1.51x \| \| Single-threaded (max-autotune) \| 1.56x \| 1.19x \| 1.52x \| Key models being sped up: drq: 1.14x soft_act: 1.12 cait_m36_384: 1.18x Dynamic shapes \| Benchmark \| torchbench \| huggingface \| timm_models \| \| --- \| --- \| --- \| --- \| \| Multi-threaded (baseline) \| 1.43x \| 1.28x \| 1.85x \| \| Multi-threaded (max-autotune) \| 1.47x \| 1.28x \| 1.85x \| \| Single-threaded (baseline) \| 1.55x \| 1.20x \| 1.51x \| \| Single-threaded (max-autotune) \| 1.56x \| 1.19x \| 1.53x \| Key models being sped up: BERT_pytorch: 1.22x pyhpc_turbulent: 1.13x soft_actor_critic: 1.77x BlenderbotForCausalLM: 1.09x cait_m36_384: 1.17x Differential Revision: [D57585365](https://our.internmc.facebook.com/intern/diff/D57585365) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124021 Approved by: https://github.com/jansel	2024-05-22 23:59:12 +00:00
Andrew Gu	e3db9ba37a	[FSDP2] Added test for manual reshard with `reshard_after_forward=False` (#126892 ) This test shows that we could always set `reshard_after_forward=False` but manually insert calls to `module.reshard()` to implement the resharding after forward. This is useful for advanced PP schedules. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126892 Approved by: https://github.com/wanchaol ghstack dependencies: #126887	2024-05-22 23:35:06 +00:00
Andrew Gu	203f2641e9	[FSDP2] Used `CommDebugMode` for comm. count test (#126887 ) simplify the test :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126887 Approved by: https://github.com/wanchaol	2024-05-22 23:35:06 +00:00
Andrew Gu	69325e4de6	[FSDP] Warned on wrapping `ModuleList`/`ModuleDict` (#124764 ) This partially addresses https://github.com/pytorch/pytorch/issues/113794. To avoid being BC breaking, we just issue an warning when wrapping `ModuleList` or `ModuleDict`. We want to add this warning since this is a common pitfall. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124764 Approved by: https://github.com/wanchaol	2024-05-22 23:34:52 +00:00
laithsakka	b0e849870e	Change error message when nn module inlining is enabled for MiscTests.test_map_side_effects (#126444 ) #fix https://github.com/pytorch/pytorch/issues/126355 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126444 Approved by: https://github.com/anijain2305	2024-05-22 23:24:03 +00:00
Shunting Zhang	17186bd5b6	[inductor] make conv lowering work with dynamic shapes (#126823 ) Fix an issue reported by internal user that conv lowering does not work well with dynamic shapes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126823 Approved by: https://github.com/jansel	2024-05-22 23:15:29 +00:00
Shunting Zhang	14c5c753de	[inductor] use smaller RBLOCK for expensive reduction kernels (#126477 ) Triton sometimes uses less registers for more expensive kernel which results in worse perf ( https://github.com/pytorch/pytorch/issues/126463 ). This may make inductor end up with a sub-optimal config. Use a smaller max RBLOCK if the reduction potentially need many registers. Will run perf test.. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126477 Approved by: https://github.com/jansel	2024-05-22 22:47:10 +00:00
Shunting Zhang	ce6e36bf8b	Revert "Skip test_memory_format_nn_BatchNorm2d in inductor (#125970 )" (#126594 ) This reverts commit 0a9c6e92f8d1a35f33042c8dab39f23b7f39d6e7. enable the test since it's fixed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126594 Approved by: https://github.com/huydhn ghstack dependencies: #126586, #126593	2024-05-22 22:43:09 +00:00
Shunting Zhang	12dee4f204	don't check memory format for empty tensors (#126593 ) Fix https://github.com/pytorch/pytorch/issues/125967 . The test actually fail for empty 4D or 5D tensors when checking for memory format. I'm not exactly sure what recent inductor change cause the failure, but it may be not that important to maintain strides for an empty tensor. (?) I just skip the check for empty tensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126593 Approved by: https://github.com/ezyang ghstack dependencies: #126586	2024-05-22 22:43:09 +00:00
Shunting Zhang	43f2f43eb3	reset dynamo cache before each test (#126586 ) In https://github.com/pytorch/pytorch/issues/125967, we found test results depend on test order. The root cause is due to earlier tests populate dynamo cache and affect the later tests. This PR clear dynamo cache before each unit test so we get more deterministic result for unit test Pull Request resolved: https://github.com/pytorch/pytorch/pull/126586 Approved by: https://github.com/jansel	2024-05-22 22:43:09 +00:00
Ke Wen	08c260bc29	[pipelining] Test schedules against manual stage (#126735 ) Added manual stage in test_schedule.py so that we can test various schedules against it. In this file we now have: - test_schedule_with_tracer - test_schedule_with_manual - test_grad_with_tracer - test_grad_with_manual Tested schedules are: - ScheduleGPipe - Schedule1F1B Pull Request resolved: https://github.com/pytorch/pytorch/pull/126735 Approved by: https://github.com/wconstab, https://github.com/H-Huang ghstack dependencies: #126812, #126721	2024-05-22 21:54:27 +00:00
jhavukainen	6a539e80dd	Update descriptor fields to resolve fft precision issue (#125328 ) Fixes #124096 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125328 Approved by: https://github.com/kulinseth, https://github.com/malfet	2024-05-22 21:48:49 +00:00
Catherine Lee	5ccc634603	[CI] Pin uv==0.1.45 for lintrunner (#126908 ) `e4623de4cf/1` ``` 2024-05-22T19:10:48.5974515Z + python3 -m pip install uv 2024-05-22T19:10:48.5975198Z Collecting uv 2024-05-22T19:10:48.5976496Z Downloading uv-0.1.45-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (32 kB) 2024-05-22T19:10:48.5977828Z Downloading uv-0.1.45-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.8 MB) 2024-05-22T19:10:48.5986243Z [?25l [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/12.8 MB[0m [31m?[0m eta [36m-:--:--[0m 2024-05-22T19:10:48.5988326Z [2K [91m━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━[0m [32m6.8/12.8 MB[0m [31m205.8 MB/s[0m eta [36m0:00:01[0m 2024-05-22T19:10:48.5990300Z [2K [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m12.8/12.8 MB[0m [31m215.1 MB/s[0m eta [36m0:00:01[0m 2024-05-22T19:10:48.5991645Z [2K [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m12.8/12.8 MB[0m [31m215.1 MB/s[0m eta [36m0:00:01[0m 2024-05-22T19:10:48.5992724Z [2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m97.8 MB/s[0m eta [36m0:00:00[0m 2024-05-22T19:10:48.5993443Z [?25hInstalling collected packages: uv 2024-05-22T19:10:48.5993950Z Successfully installed uv-0.1.45 2024-05-22T19:10:48.5994363Z + CACHE_DIRECTORY=/tmp/.lintbin 2024-05-22T19:10:48.5994772Z + [[ -d /tmp/.lintbin ]] 2024-05-22T19:10:48.5995157Z + cp -r /tmp/.lintbin . 2024-05-22T19:10:48.5995497Z + lintrunner init 2024-05-22T19:10:48.5995839Z + [[ 1 == \1 ]] ``` vs ``` 2024-05-22T20:33:53.5563991Z + python3 -m pip install uv 2024-05-22T20:33:53.5564921Z Collecting uv 2024-05-22T20:33:53.5566259Z Downloading uv-0.2.1-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (32 kB) 2024-05-22T20:33:53.5568142Z Downloading uv-0.2.1-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.9 MB) 2024-05-22T20:33:53.5570253Z [?25l [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/12.9 MB[0m [31m?[0m eta [36m-:--:--[0m 2024-05-22T20:33:53.5571889Z [2K [91m━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━[0m [32m7.0/12.9 MB[0m [31m208.8 MB/s[0m eta [36m0:00:01[0m 2024-05-22T20:33:53.5573716Z [2K [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m12.9/12.9 MB[0m [31m206.7 MB/s[0m eta [36m0:00:01[0m 2024-05-22T20:33:53.5575478Z [2K [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m12.9/12.9 MB[0m [31m206.7 MB/s[0m eta [36m0:00:01[0m 2024-05-22T20:33:53.5577240Z [2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.9/12.9 MB[0m [31m101.6 MB/s[0m eta [36m0:00:00[0m 2024-05-22T20:33:53.5578531Z [?25hInstalling collected packages: uv 2024-05-22T20:33:53.5579316Z Successfully installed uv-0.2.1 2024-05-22T20:33:53.5580033Z + CACHE_DIRECTORY=/tmp/.lintbin 2024-05-22T20:33:53.5580640Z + [[ -d /tmp/.lintbin ]] 2024-05-22T20:33:53.5581229Z + cp -r /tmp/.lintbin . 2024-05-22T20:33:53.5581799Z + lintrunner init 2024-05-22T20:33:53.5603302Z Traceback (most recent call last): 2024-05-22T20:33:53.5604857Z File "/home/ec2-user/actions-runner/_work/pytorch/pytorch/test-infra/.github/scripts/run_with_env_secrets.py", line 101, in <module> 2024-05-22T20:33:53.5605805Z main() 2024-05-22T20:33:53.5606687Z File "/home/ec2-user/actions-runner/_work/pytorch/pytorch/test-infra/.github/scripts/run_with_env_secrets.py", line 97, in main 2024-05-22T20:33:53.5607762Z run_cmd_or_die(f"docker exec -t {container_name} /exec") 2024-05-22T20:33:53.5608949Z File "/home/ec2-user/actions-runner/_work/pytorch/pytorch/test-infra/.github/scripts/run_with_env_secrets.py", line 38, in run_cmd_or_die 2024-05-22T20:33:53.5610107Z raise RuntimeError(f"Command {cmd} failed with exit code {exit_code}") 2024-05-22T20:33:53.5611328Z RuntimeError: Command docker exec -t e551764bdba0c87c2fc392fba9ea265e8821a552915b36010f18299d8035b304 /exec failed with exit code 1 2024-05-22T20:33:53.5626540Z ##[error]Process completed with exit code 1. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126908 Approved by: https://github.com/huydhn	2024-05-22 21:41:21 +00:00
lezcano	a30baec0c3	[Docs] Fix NumPy + backward example (#126872 ) We were calling backward on a tensor not a scalar... Pull Request resolved: https://github.com/pytorch/pytorch/pull/126872 Approved by: https://github.com/albanD	2024-05-22 21:29:31 +00:00
Aaron Orenstein	e4623de4cf	typing scheduler.py [2/2]: Apply types (#126656 ) Add `# mypy: disallow-untyped-defs` to scheduler.py and then fix the resulting fallout. We probably should eventually add a new node between BaseSchedulerNode and all the non-FusedSchedulerNode types to indicate the split between nodes that have a valid `self.node` and ones that don't. That would cause a lot of the `assert self.node is not None` churn to go away - but was a bigger change because a lot of code makes assumptions about types that aren't reflected in the types themselves. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126656 Approved by: https://github.com/eellison	2024-05-22 20:33:31 +00:00
hippocookie	3591bce6c7	Add usage explanation in torch.dot ducment (#125908 ) Fixes #125842 Add unsupported declaration on <code>torch.dot</code>, avoid misused like: ```python >>> t1, t2 = torch.tensor([0,1]), torch.tensor([2,3]) >>> torch.dot(input=t1, other=t2) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: dot() missing 1 required positional arguments: "tensor" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125908 Approved by: https://github.com/albanD	2024-05-22 20:33:12 +00:00
Masaki Kozuki	0939b68980	Support `dtype` kwarg in `_foreach_norm` (#125665 ) Fixes #125040 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125665 Approved by: https://github.com/janeyx99	2024-05-22 20:27:50 +00:00
Kurman Karabukaev	d62b025efc	[TorchElastic] Option for sharing TCPStore created by rdzv handlers (#125743 ) Summary: 1. Define explicit `use_agent_store` on rdzv handlers. Handlers that set is true can share the store. 2. Instead of agent coordinating master_add/master_port values, the logic is now encapsulated by a rdzv_handler where `RendezvousInfo` will have `RendezvousStoreInfo` object that handlers must return. - Depending on the implementation they can either: - point to existing store (and expected to `use_agent_store` as true - point 1). Client code will rely on `TORCHELASTIC_USE_AGENT_STORE` env variable to know if the store is shared. - build args that `torch.distributed.init_process_group` can bootstrap by creating new store. Additional points: - When TCPStore is shared, it should be wrapped in PrefixStore to qualify/scope namespace for other usecases. - `next_rendezvous` signature changed to return instance of `RendezvousInfo` instead of a (store, rank, world_size) tuple for extensibility purposes. Why: - Reduce moving parts - easier to swap implementation - improve tractability - addressing perf/debug-ability will benefit all usecases - Test Plan: CI Differential Revision: D57055235 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125743 Approved by: https://github.com/d4l3k	2024-05-22 18:24:11 +00:00
Wanchao Liang	fde1e8af7a	[dtensor] implement distributed topk operator (#126711 ) as titled. Implemented the topk operator in DTensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126711 Approved by: https://github.com/wz337 ghstack dependencies: #126710	2024-05-22 18:11:56 +00:00
Wanchao Liang	af633e4a7b	[dtensor] remove unused failed_reason (#126710 ) as titled, this field is not actively used, so removing it Pull Request resolved: https://github.com/pytorch/pytorch/pull/126710 Approved by: https://github.com/wz337	2024-05-22 18:11:56 +00:00
William Wen	a8195f257e	[custom_op] use new python custom ops API on prims ops (#124665 ) Also ads a non-decorator version of `custom_op`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124665 Approved by: https://github.com/zou3519	2024-05-22 17:48:33 +00:00
Shiyan Deng	db0b74bbc5	[CUDA Caching Allocator] Allow division of 0 (#126833 ) Summary: Division of 0 means disabling roundup. Test Plan: CI Differential Revision: D57651410 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126833 Approved by: https://github.com/banitag1	2024-05-22 17:40:39 +00:00
chilli	d4ec18bdad	Prevent partitioner from ever saving views (#126446 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126446 Approved by: https://github.com/anijain2305 ghstack dependencies: #126615	2024-05-22 17:28:46 +00:00
chilli	51e707650f	Fix flexattention not realizing inputs before lowering (also refactored runtime estimation) (#126615 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126615 Approved by: https://github.com/yanboliang, https://github.com/drisspg, https://github.com/xmfan	2024-05-22 17:28:46 +00:00
Ke Wen	3e826c477a	[pipelining] Add pipeline stage test (#126721 ) Test tracer's and manual's stage creation by using a basic schedule (GPipe). (Migrated from https://github.com/pytorch/PiPPy/blob/main/test/test_pipeline_stage.py) Test command: ``` $ python test_stage.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126721 Approved by: https://github.com/wconstab, https://github.com/H-Huang ghstack dependencies: #126812	2024-05-22 16:24:51 +00:00
Ke Wen	403012b50a	[pipelining] expose APIs per pytorch rule (#126812 ) Rule is enforced by #126103. The rule: - If `torch.a.b` defines a public class `C` (i.e. to be exposed in torch API namespace), then `torch.a.b` must be a public path, i.e. no `_`. - `torch.a.b` should ideally have an `__all__` that defines what should be imported from this file when it is imported. - All other definitions in `torch.a.b` that you don't want to expose should have a `_` prefix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126812 Approved by: https://github.com/wconstab	2024-05-22 16:21:13 +00:00
Bin Bao	599e684ad6	[AOTI] Disable stack allocation for OSS (#125732 ) Summary: Stack allocation is for certain small CPU models, but its coverage still needs improvement, so default to OFF for OSS. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125732 Approved by: https://github.com/chenyang78 ghstack dependencies: #126720, #126801	2024-05-22 15:33:24 +00:00
Bin Bao	ff617ab6c8	[AOTI] Fix an int array codegen issue (#126801 ) Summary: fixes https://github.com/pytorch/pytorch/issues/126779. When an int array contains symbol expression, we can't declare it with constexpr. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126801 Approved by: https://github.com/chenyang78 ghstack dependencies: #126720	2024-05-22 15:33:24 +00:00
Bin Bao	19cd4484ec	[AOTI] Add more fallback ops (#126720 ) Summary: These ops are either in either unit tests or TorchBench. Fixes https://github.com/pytorch/pytorch/issues/122050 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126720 Approved by: https://github.com/chenyang78	2024-05-22 15:33:24 +00:00
Edward Z. Yang	0d17aae242	Teach FakeTensor to fill in item_memo when converting scalar CPU tensor (#126245 ) This PR requires a little justification, but let's start with what it does first: 1. When you have a 0d CPU scalar int64/float64 tensor input to a graph, we will preallocate a backed SymInt/SymFloat corresponding to what you would get if you call item() on this tensor. This means you can freely change your input to be a Python int/float or a Tensor with an item() call and end up with exactly the same level of expressivity (specifically, you can guard on the internal SymInt/SymFloat no matter what). By default, the source of the backed SymInt/SymFloat is `L['tensor'].item()`, but if you have promoted a float input into a Tensor, we will cancel out `torch.as_tensor(L['float']).item()` into just `L['float']`. 2. We switch wrap_symfloat to use this, instead of hand crafting the new SymNodeVariable. Everything works out, except that we carefully pass the item() result to tracked fakes (and not the fake Tensor argument) OK, so why do this at all? There is some marginal benefit where now some item() calls on scalar inputs can be guarded on, but IMO this is a pretty marginal benefit, and if it was the only reason, I wouldn't do this. The real reason for this is that I need to be able to propagate fake tensors through the graphs that are produced by Dynamo, and if I am doing the old custom wrap_symfloat logic, there's no way I can do this, because ordinarily an item() call will cause an unbacked SymInt when I reallocate. The other obvious way to solve the problem above is to make a HOP alternative that item() that "bakes in" the backed SymInt its supposed to return. But this strategy seems more parsimonious, and it does have the marginal benefit I mentioned above. The main downside is that what I have to do next, is make it so that when I run tensor computation, I also apply the equivalent operations to the SymInt/SymFloat as well. That's next PR. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126245 Approved by: https://github.com/eellison ghstack dependencies: #126637	2024-05-22 15:25:38 +00:00
Matthew Hoffman	86ad101370	Enable pickling `torch._C.Generator` (#126271 ) Fixes #71398 Add `__reduce__` and `__setstate__` methods for `torch._C.Generator`. `__reduce__` returns a tuple of 3 values: 1. `torch.Generator` itself. 2. A one-element tuple containing the `torch.device` to create the `Generator` with, since this cannot be changed after the object is created. 3. The state, a three-element tuple: the initial seed, the offset (or `None` if a CPU `Generator`), and the RNG state tensor. `__setstate__` calls `manual_seed`, `set_offset` (if not `None`), and `set_state` on each respective element of the state. Added test demonstrating successful reserialization with cpu and cuda `Generator`s. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126271 Approved by: https://github.com/ezyang	2024-05-22 14:38:47 +00:00
rzou	ed734178ab	Refresh OpOverloadPacket if a new OpOverload gets added (#126863 ) If a user accesses an OpOverloadPacket, then creates a new OpOverload, then uses the OpOverloadPacket, the new OpOverload never gets hit. This is because OpOverloadPacket caches OpOverloads when it is constructed. This PR fixes the problem by "refreshing" the OpOverloadPacket if a new OpOverload gets constructed and the OpOverloadPacket exists. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/126863 Approved by: https://github.com/albanD	2024-05-22 14:13:27 +00:00
Tuan Trieu	082251e76b	fix invalid call to aoti_torch_tensor_copy_ (#126668 ) Fixes #123039 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126668 Approved by: https://github.com/desertfire	2024-05-22 13:02:02 +00:00
Yifu Wang	2dd2699860	Introduce ProcessGroupCudaP2P (#122163 ) ## Context This stack prototypes automatic micro-pipelining of `all-gather -> matmul` and `matmul -> reduce-scatter` via Inductor. The idea originates from the paper [Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models](https://dl.acm.org/doi/pdf/10.1145/3567955.3567959). The implementation and some key optimizations are heavily influenced by @lw's implementation in xformers. The stack contains several components: - `ProcessGroupCudaP2P` - a thin wrapper around `ProcessGroupNCCL`. It in addition maintains a P2P workspace that enables SM-free, one-sided P2P communication which is needed for optimal micro-pipelining. - `fused_all_gather_matmul` and `fused_matmul_reduce_scatter` dispatcher ops. - Post-grad fx pass that detects `all-gather -> matmul` and `matmul -> reduce-scatter` and replaces them with the fused dispatcher ops. To enable the prototype feature: - Set the distributed backend to `cuda_p2p`. - Set `torch._inductor.config._micro_pipeline_tp` to `True`. NOTE: the prototype sets nothing in stone w.r.t to each component's design. The purpose is to have a performant baseline with reasonable design on which each component can be further improved. ## Benchmark Setup: - 8 x H100 (500W) + 3rd gen NVSwitch. - Llama3 8B training w/ torchtitan. - 8-way TP. Reduced the number of layers from 32 to 8 for benchmarking purpose. Trace (baseline): https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_tmpjaz8zgx0 <img width="832" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/4addba77-5abc-4d2e-93ea-f68078587fe1"> Trace (w/ micro pipelining): https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_tmpn073b4wn <img width="963" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/4f44e78d-8196-43ab-a1ea-27390f07e9d2"> ## This PR `ProcessGroupCudaP2P` is a thin wrapper around `ProcessGroupNCCL`. By default, it routes all collectives to the underlying `ProcessGroupNCCL`. In addition, `ProcessGroupCudaP2P` initializes a P2P workspace that allows direct GPU memory access among the members. The workspace can be used in Python to optimize intra-node communication patterns or to create custom intra-node collectives in CUDA. `ProcessGroupCudaP2P` aims to bridge the gap where certain important patterns can be better optimized via fine-grained P2P memory access than with collectives in the latest version of NCCL. It is meant to complement NCCL rather than replacing it. Usage: ``` # Using ProcessGroupCudaP2P dist.init_process_group(backend="cuda_p2p", ...) # Using ProcessGroupCudaP2P while specifying ProcessGroupCudaP2P.Options pg_options = ProcessGroupCudaP2P.Options() dist.init_process_group(backend="cuda_p2p", pg_options=pg_options, ...) # Using ProcessGroupCudaP2P while specifying ProcessGroupNCCL.Options pg_options = ProcessGroupNCCL.Options() dist.init_process_group(backend="cuda_p2p", pg_options=pg_options, ...) # Using ProcessGroupCudaP2P while specifying both # ProcessGroupCudaP2P.Options and ProcessGroupNCCL.Options pg_options = ProcessGroupCudaP2P.Options() pg_options.nccl_options = ProcessGroupNCCL.Options() dist.init_process_group(backend="cuda_p2p", pg_options=pg_options, ...) # Down-casting the backend to access p2p buffers for cuda_p2p specific # optimizations if is_cuda_p2p_group(group): backend = get_cuda_p2p_backend(group) if required_p2p_buffer_size > backend.get_buffer_size(): # fallback p2p_buffer = backend.get_p2p_buffer(...) else: # fallback ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122163 Approved by: https://github.com/wanchaol	2024-05-22 09:33:05 +00:00
PyTorch MergeBot	8a4597980c	Revert "Fix flexattention not realizing inputs before lowering (also refactored runtime estimation) (#126615 )" This reverts commit 831efeeadf5fa8d9e7f973057e634a57e3bcf04b. Reverted https://github.com/pytorch/pytorch/pull/126615 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/126615#issuecomment-2124169157))	2024-05-22 08:23:40 +00:00
PyTorch MergeBot	0f37fd06d9	Revert "Prevent partitioner from ever saving views (#126446 )" This reverts commit da2292ce6b37028746bf5beeae04442eef1e803d. Reverted https://github.com/pytorch/pytorch/pull/126446 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/126615#issuecomment-2124169157))	2024-05-22 08:23:40 +00:00
PyTorch MergeBot	d2cbbdee31	Revert "Fix silu test for flexattention (#126641 )" This reverts commit cd3a71f754a2248bcfe500de7c9860bd7d2002bf. Reverted https://github.com/pytorch/pytorch/pull/126641 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/126615#issuecomment-2124169157))	2024-05-22 08:23:40 +00:00
Xia, Weiwen	4575d3be83	[Quant][onednn] fix performance regression of depth-wise qconv (#126761 ) Fixes #125663 It did not handle groups correctly in the original implementation. Test plan: Functionality is covered by UT. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126761 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5	2024-05-22 07:53:11 +00:00
Jez Ng	aede940975	[inductor] Fix cuda compilation under fbcode remote execution (#126408 ) Differential Revision: D57390072 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126408 Approved by: https://github.com/desertfire	2024-05-22 07:51:35 +00:00
zjgarvey	edea2b81b5	[ONNX] Adds Support for Some Bitwise Ops in Onnx Exporter (#126229 ) Addresses #126194 Adds support for - "aten::bitwise_right_shift" - "aten::bitwise_left_shift" - "aten::bitwise_and" Pull Request resolved: https://github.com/pytorch/pytorch/pull/126229 Approved by: https://github.com/justinchuby	2024-05-22 07:47:43 +00:00
Jason Ansel	b516de8cac	[halide-backend] Add HalideCodeCache (#126416 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126416 Approved by: https://github.com/shunting314 ghstack dependencies: #126631, #126655	2024-05-22 06:52:50 +00:00
Wanchao Liang	d937d0db0f	[SAC] fix ignored ops in eager mode to recompute (#126751 ) as titled. I found that there're some issues in the eager mode SAC where sometimes we would have recompute pop from storage of ops that are missing, these ops are detach ops. So this PR refactors the two modes, so that they would always recompute ignored ops Pull Request resolved: https://github.com/pytorch/pytorch/pull/126751 Approved by: https://github.com/yf225	2024-05-22 06:47:22 +00:00
Xuehai Pan	3b0f6cce5c	[pytree] freeze attributes of `TreeSpec` (#124011 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124011 Approved by: https://github.com/zou3519	2024-05-22 05:57:00 +00:00
Banit Agrawal	6edf989e2f	[CUDA Caching Allocator] Round to nearest 512 bytes boundary if number of divisions=1 (#126830 ) Summary: This diff fixes an issue when the number of divisions=1, resulting in unaligned memory accesses. Reviewed By: 842974287 Differential Revision: D57648763 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126830 Approved by: https://github.com/842974287	2024-05-22 04:57:24 +00:00
Chirag Pandya	ae66c94eaa	Capture dtype in Flight Recorder (#126581 ) Summary: Capture dtype in flight recorder. Mismatched dtypes can lead to hangs. Newly added logs to job show mismatching DTYPE of op, which affects data size. Even though the sizes match and we don't see the dtype on the FR log. We end up capturing the type as follows: ``` {'entries': [{'record_id': 0, 'pg_id': 0, 'process_group': ('0', 'default_pg'), 'collective_seq_id': 1, 'p2p_seq_id': 0, 'op_id': 1, 'profiling_name': 'nccl:all_reduce', 'time_created_ns': 1715989097552775261, 'duration_ms': 6.697696208953857, 'input_sizes': [[3, 4]], 'input_dtypes': [6], 'output_sizes': [[3, 4]], 'output_dtypes': [6], 'state': 'completed', 'time_discovered_started_ns': 1715989097593778240, 'time_discovered_completed_ns': 1715989097593778461, 'retired': True, ``` Notice the new fields: input_dtypes: [6] output_dtypes: [6] Test Plan: unit tests Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/issues/126554 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126581 Approved by: https://github.com/wconstab	2024-05-22 03:38:09 +00:00
Simon Fan	7530cfe7e4	[dynamo][flaky tests] test_conv_empty_input_* (#126790 ) Run CI, maybe fixes https://github.com/pytorch/pytorch/issues/126178 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126790 Approved by: https://github.com/mikaylagawarecki	2024-05-22 03:14:21 +00:00
Jiashen Cao	ac1f0befcf	Remove redundant serialization code (#126803 ) After https://github.com/pytorch/pytorch/pull/123308, we no longer need separate serialization path to handle different types that exist in the nn_module metadata. This PR cleans up the redundant code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126803 Approved by: https://github.com/angelayi	2024-05-22 03:14:17 +00:00
Ke Wen	608a11c496	[pipelining] Retire PIPPY_VERBOSITY in favor of TORCH_LOGS=pp (#126828 ) https://github.com/pytorch/pytorch/pull/126499/ established: `TORCH_LOGS=pp` --> info `TORCH_LOGS=-pp` --> warn `TORCH_LOGS=+pp` --> debug Pull Request resolved: https://github.com/pytorch/pytorch/pull/126828 Approved by: https://github.com/wconstab	2024-05-22 02:52:58 +00:00
Isuru Fernando	e3c96935c2	Support CUDA_INC_PATH env variable when compiling extensions (#126808 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126808 Approved by: https://github.com/amjames, https://github.com/ezyang	2024-05-22 02:44:32 +00:00
Ke Wen	5fa7aefb49	[pipelining] Do not print loss (#126829 ) `loss` is a tensor, printing it would induce a GPU-CPU sync, which would slow down the program more than regular debug overhead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126829 Approved by: https://github.com/wconstab	2024-05-22 02:32:04 +00:00
Yueming Hao	e6f655697b	[AOTI] Fix unsupported type of output=s1 (#126797 ) Fixes #123036 In unit test `DynamicShapesCudaWrapperCudaTests.test_scaled_dot_product_attention_cuda_dynamic_shapes_cuda_wrapper`, computed buffer buf3 is compiled to a fallback kernel `aoti_torch_cuda__scaled_dot_product_flash_attention`. It has 9 outputs whose types are `[MultiOutput, MultiOutput, None, None, s1, s1, MultiOutput, MultiOutput,MultiOutput]`. The type `s1` here is passed from [generate_output](`acfe237a71/torch/_inductor/ir.py (L5658)`). They type check for Symbol is missing for fallback kernel output generation. This PR fixes this issue by checking `output.is_Symbol`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126797 Approved by: https://github.com/desertfire	2024-05-22 02:15:43 +00:00
Nikita Shulga	a379ed6e98	Fix SobolEngine default dtype handling (#126781 ) - Change default dtype argument to `None` and fetch it value via `torch.get_default_dtype()` call if not defined - Fix bug in first draw handling logic, that would ignore dtype in favor of default one due to type promotion - Add regression tests Fixes https://github.com/pytorch/pytorch/issues/126478 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126781 Approved by: https://github.com/Skylion007, https://github.com/albanD	2024-05-22 01:55:48 +00:00
eellison	28f29e074b	Dont mutate tensor stride in place in cudnn conv (#126786 ) Fix for https://github.com/pytorch/pytorch/issues/126241. Within the cudnn convolution, we were in-place updating the strides of the tensor to disambiguate for size-1 dims and contiguous and channels last tensors. Instead of mutating the tensors stride, just use a temporary. Inside cudnn it is then copied: `d7ccb5b3c4/include/cudnn_frontend_Tensor.h (L201-L203)`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126786 Approved by: https://github.com/ezyang, https://github.com/shunting314, https://github.com/eqy	2024-05-22 01:53:44 +00:00
Yanbo Liang	66c23cb021	Add micro-benchmark framework and multi_layer_norm as an example (#126754 ) ```micro_benchmark.py``` output csv example (all numbers are fake, just for demo) ``` name,metric,target,actual multi_layer_norm,inference_time(s),20,19.87 multi_layer_norm,memory_bandwidth(GB/s),108,108.04 llama2-int8, token_per_sec,155,156 llama2-int8,memory_bandwidth(GB/s),92,92.7 ``` Expected dashboard looks like: ``` \| name \| metric \| target \| actual \| change \| \|------------------\|------------------------\|--------\|--------\|--------\| \| multi_layer_norm \| inference_time(s) \| 20 \| 19.87 \| 99% \| \| \| memory_bandwidth(GB/s) \| 108 \| 108.04 \| 101% \| \| llama2-int8 \| token_per_sec \| 155 \| 156 \| 100% \| \| \| memory_bandwidth(GB/s) \| 92 \| 92.7 \| 101% \| ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126754 Approved by: https://github.com/Chillee	2024-05-22 01:27:37 +00:00
Andrew Gu	636e79991c	[FSDP2] Fixed 2D clip grad norm test (#126497 ) This fixes https://github.com/pytorch/pytorch/issues/126484. We change from transformer to MLP stack since transformer seems to introduce slight numeric differences when using TP. We include a sequence parallel layer norm module in the MLP stack to exercise `(S(0), R)` placement. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126497 Approved by: https://github.com/weifengpy, https://github.com/wz337	2024-05-22 00:29:13 +00:00
Klein Shen	25ea32567e	[caffe2][1/n] migrate global Static Initializer (#126688 ) Summary: Caffe2 lib has 200+ global static initializer usage, which are papar-cut reference to startup perf. Detail in this post https://fb.workplace.com/groups/arglassesperf/permalink/623909116287154. Kick off a stack to migirate all usage of global static initializer in caffe2. Test Plan: TODO: Please advise how can i test this change? Differential Revision: D57531083 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126688 Approved by: https://github.com/ezyang	2024-05-22 00:16:06 +00:00
Masahiro Hiramori	10a5c1b26c	[Dynamo][TVM] Fix tvm backend interface (#126529 ) Fixes #126528 The repro in the above issue works fine with this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126529 Approved by: https://github.com/xmfan	2024-05-21 23:31:15 +00:00
Xu Zhao	1e818db547	[torchbench] Fix torchao benchmarking script (#126736 ) As the title says. Test Plan: ``` python benchmarks/dynamo/torchbench.py --only BERT_pytorch --bfloat16 --quantization int8dynamic --performance --inference --print-memory cuda eval BERT_pytorch [XZ Debug] Torch grad status: False memory: eager: 0.82 GB, dynamo: 0.92 GB, ratio: 0.89 running benchmark: 100% 1.001x ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126736 Approved by: https://github.com/jerryzh168, https://github.com/huydhn	2024-05-21 23:15:12 +00:00
Jason Ansel	9dba1aca0e	[inductor] Relax type annotations for statically_known_* (#126655 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126655 Approved by: https://github.com/Skylion007, https://github.com/shunting314 ghstack dependencies: #126631	2024-05-21 23:12:42 +00:00
Jason Ansel	c08afbb3da	[inductor] Add kernel_code logging artifact (#126631 ) This is useful for some compile errors where we don't finish outputting the full graph. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126631 Approved by: https://github.com/shunting314	2024-05-21 23:12:42 +00:00
Shuqiang Zhang	4e921593a4	[c10d]skip nan tests for lower versions of CUDA (#126701 ) Summary: We found that the UNIT tests would hang only in one test, linux-focal-cuda11.8-py3.9-gcc9 / test (multigpu, 1, 1, linux.g5.12xlarge.nvidia.gpu), in which DSA would still be raised, but somehow the process would cause errors like: P1369649418 Test Plan: Run CI tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/126701 Approved by: https://github.com/wconstab ghstack dependencies: #126409	2024-05-21 22:25:29 +00:00
Mu-Chu Lee	f6ffe32a9d	[AOTInductor] Automatic detection for buffer mutation and binary linking (#126706 ) Summary: Instead of a explicit config for users to determine buffer mutation, we automatically detect whether there's buffer mutation in the model and determine which section constants would be placed. If constants are too large and doesn't fit within section, we error out directly. Test Plan: Existing tests for buffer mutation and large weight linking Differential Revision: D57579800 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126706 Approved by: https://github.com/desertfire	2024-05-21 21:49:13 +00:00
wz337	fed536dbcf	[DTensor][Optim] Add support for fused_adam and fused_adamw when lr is a tensor (#126750 ) Fixes #126670 In this PR, we update the following: 1. lr is an kwarg. Add support to automatically turn on implict replication for kwarg. We only did this for arg previously. 2. add associated tensor_lr ops in pointwises.py 3. add associated unit test in test_optimizers.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/126750 Approved by: https://github.com/wanchaol, https://github.com/msaroufim	2024-05-21 21:38:05 +00:00
hippocookie	7ee74d986a	Enable UFMT format on test/typing files (#126038 ) Fixes some files in #123062 Run lintrunner on files: test/typing/*/ ``` $ lintrunner -a --take UFMT --all-files ok No lint issues. Successfully applied all patches. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126038 Approved by: https://github.com/shink, https://github.com/ezyang	2024-05-21 21:37:07 +00:00
leslie-fang-intel	1cc9354cb0	Unify the dtype to VecMask<float, N> in ops.masked (#126662 ) Summary Fix issue: https://github.com/pytorch/pytorch/issues/126449. For `ops.masked` in CPP backend, when input dtype is `bool`, we actually load it as `VecMask<float, N>`. So, we should unify the type of `other` and `mask` to the same as `VecMask<float, N>` to invoke `blendv` method. Test Plan ``` clear && python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_ops_masked_with_bool_input clear && PYTORCH_ALL_SAMPLES=1 python -u -m pytest -s -v test/inductor/test_torchinductor_opinfo.py -k test_comprehensive__chunk_cat_cpu_bool ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126662 Approved by: https://github.com/isuruf, https://github.com/jgong5, https://github.com/peterbell10	2024-05-21 20:52:25 +00:00
dependabot[bot]	fd7293db71	Bump rexml from 3.2.5 to 3.2.8 in /ios/TestApp (#126455 ) Bumps [rexml](https://github.com/ruby/rexml) from 3.2.5 to 3.2.8. - [Release notes](https://github.com/ruby/rexml/releases) - [Changelog](https://github.com/ruby/rexml/blob/master/NEWS.md) - [Commits](https://github.com/ruby/rexml/compare/v3.2.5...v3.2.8) --- updated-dependencies: - dependency-name: rexml dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2024-05-21 13:47:12 -07:00
Sahdev Zala	fe0a36fd7c	Fix a link in the compiler backend doc (#126079 ) The core aten is the core subset of aten and seems the corrent link to replace the broken link. Fixes #125961 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126079 Approved by: https://github.com/svekars	2024-05-21 20:16:04 +00:00
Tianyu Liu	5325a6de64	[dtensor] remove `output_` prefix from OpStrategy properties (#126359 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126359 Approved by: https://github.com/wanchaol, https://github.com/XilunWu	2024-05-21 19:54:29 +00:00
Edward Z. Yang	c73c9457aa	Add guard_size_oblivious to vector_norm (#126772 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126772 Approved by: https://github.com/lezcano, https://github.com/Skylion007 ghstack dependencies: #126771	2024-05-21 19:53:21 +00:00
Edward Z. Yang	97eef61474	Don't assume compare_arg is fx.Node (#126771 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126771 Approved by: https://github.com/Skylion007	2024-05-21 19:53:21 +00:00
Sergii Dymchenko	fc594ed219	Remove lint from retryable_workflows (#126806 ) Related to https://github.com/pytorch/test-infra/pull/4934 Lint workflow now uses Docker, so there should not be network-related errors for pip installing stuff. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126806 Approved by: https://github.com/seemethere, https://github.com/ZainRizvi, https://github.com/huydhn	2024-05-21 19:47:23 +00:00
Shivam Raikundalia	4e6673e244	Remove MAX_STACK_ENTRY from _build_table (#126583 ) Summary: As reported by this issue: https://github.com/pytorch/pytorch/issues/83584 We already store the entries in evt.stack so there is no need to cap the limit when we output the table to 5 Test Plan: Regression testing should cover this. We have unit tests to check the stack already. Differential Revision: D57513565 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126583 Approved by: https://github.com/nmacchioni	2024-05-21 18:52:04 +00:00
Andrew M. James	0c76018714	[inductor] Don't inherit `__future__` flags from the calling scope when `compile` -ing generated modules (#126454 ) This file includes `from __futures__ import annotations` which interacts with `compile` by causing type annotations to be populated as strings. Triton does not parse the string annotation correctly. Avoid this behavior by passing `dont_inherit=True` to `compile`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126454 Approved by: https://github.com/peterbell10	2024-05-21 18:51:13 +00:00
cyy	7428fd19fe	Remove outdated options from setup.py (#125988 ) Since the recent removal of Caffe2 files. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125988 Approved by: https://github.com/ezyang	2024-05-21 18:48:23 +00:00
Bin Bao	b40fb2de59	[AOTI] Fix a codegen issue when .item() is used for kernel arg (#126575 ) Summary: fixes https://github.com/pytorch/pytorch/issues/126574 . Pass kernel argument type information into generate_args_decl, so it can generate the argument declaration instead of relying on string matching. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126575 Approved by: https://github.com/chenyang78 ghstack dependencies: #126369	2024-05-21 18:20:20 +00:00
Bin Bao	5e2de16a6f	[AOTI] Codegen None as empty tensor (#126369 ) Summary: When None denotes an optional tensor, we codegen NULL to represent it; but when None is for actual tensor type, we need to codegen an empty tensor for it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126369 Approved by: https://github.com/chenyang78	2024-05-21 18:20:20 +00:00
Tristan Rice	ac51920656	Reapply "c10d: add Collectives abstraction (#125978 )" (#126695 ) This reverts commit d9c3485146913324ab4b3e211d2a4517e138f4af. Reapplies #125978. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126695 Approved by: https://github.com/c-p-i-o	2024-05-21 18:00:09 +00:00
eellison	d8f5627a88	prune back configs (#126570 ) We had a previous PR that added configs for an internal model. Running the below script on output from autotuning, we can prune back the added configs with negligible perf loss: P1365917790. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126570 Approved by: https://github.com/nmacchioni	2024-05-21 17:44:32 +00:00
Scott Wolchok	85fd76f76d	Add test coverage for fp16 matrix-vector specialized kernel (#126700 ) Summary: This kernel is special-cased on ARM because it's important for LLMs, so let's have test coverage. Test Plan: Ran locally and it passes. Intentionally broke fp16_gemv_trans and saw it fail, confirming it provides coverage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126700 Approved by: https://github.com/malfet	2024-05-21 17:23:16 +00:00
Tom Ritchford	bae3b17fd9	Tweak a comment and fix spelling (#126681 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126681 Approved by: https://github.com/Skylion007	2024-05-21 17:19:06 +00:00
Yanbo Liang	0756f9f5fd	Remove debug breakpoint (#126756 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/126756 Approved by: https://github.com/BowenBao, https://github.com/Skylion007	2024-05-21 17:04:50 +00:00
Aaron Orenstein	140ab89c02	typing scheduler.py [1/2]: Bug fix (#126610 ) Found while getting scheduler.py to typecheck - split off to make reviewing easier. 1. is_template: I'm pretty sure this is a bug. Based on the definition of `is_template` I'm pretty sure we want to return the node's `get_template_node()`, not the node itself. 2. can_free: It seems that this was intended to b a raise, not a return. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126610 Approved by: https://github.com/eellison	2024-05-21 16:59:37 +00:00
Catherine Lee	ac2c547838	[TD] Upload names of failures to s3 for pytest cache (#126315 ) Some tests don't get run through pytest and pytest crashes when a test segfaults, so in both caess, the pytest cache won't have an entry (similar to https://github.com/pytorch/test-infra/pull/5205). Instead, manually upload/download an extra file that lists the failing test files Technically this would be more general than the pytest cache Pull Request resolved: https://github.com/pytorch/pytorch/pull/126315 Approved by: https://github.com/ZainRizvi	2024-05-21 16:29:31 +00:00
eellison	4a7b46be3d	small changes to padding (#126716 ) Add cost of writing padding 0s to benchmark, skip dimension that can be squeezed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126716 Approved by: https://github.com/shunting314	2024-05-21 16:09:32 +00:00
PyTorch MergeBot	980f5ac049	Revert "[Quant][PT2E] enable qlinear post op fusion for dynamic quant & qat (#122667 )" This reverts commit 3642e51ea527e23ded10afc266f298b0cb5350c8. Reverted https://github.com/pytorch/pytorch/pull/122667 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/122667#issuecomment-2122642317))	2024-05-21 13:45:07 +00:00
William Wen	b36e01801b	[3.12, inductor] re-enable AsyncCompile.warm_pool for 3.12 (#126724 ) Somehow working now? Fixes https://github.com/pytorch/pytorch/issues/124192 and https://github.com/pytorch/pytorch/issues/125979. Still getting the warning ``` /home/williamwen/local/installs/python3.12/debug/install/lib/python3.12/multiprocessing/popen_fork.py:66: DeprecationWarning: This process (pid=2360707) is multi-threaded, use of fork() may lead to deadlocks in the child. self.pid = os.fork() ``` though Pull Request resolved: https://github.com/pytorch/pytorch/pull/126724 Approved by: https://github.com/masnesral, https://github.com/jansel	2024-05-21 08:50:13 +00:00
cyy	faa72dca41	Remove QNNPACK submodule (#126657 ) QNNPACK has integrated into ATEN for a long time and removing it from third party causing no build issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126657 Approved by: https://github.com/ezyang	2024-05-21 07:25:24 +00:00
Feng Yuan	7d34cfd28a	Update torch-xpu-ops pin (ATen XPU implementation) (#126744 ) Regular bi-weekly pin update. New 85 ATen operators are implemented in XPU backend. https://github.com/intel/torch-xpu-ops/blob/release/2.4/yaml/xpu_functions.yaml Pull Request resolved: https://github.com/pytorch/pytorch/pull/126744 Approved by: https://github.com/EikanWang	2024-05-21 07:21:52 +00:00
Will Constable	4b23c4fc5d	[Pipelining] Clean up function names in 1f1b schedule (#126582 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126582 Approved by: https://github.com/kwen2501 ghstack dependencies: #126539	2024-05-21 06:50:02 +00:00
Will Constable	8c9d332953	[c10d] fix excepthook crash on exc after destroy_process_group (#126739 ) fixes #126379 This is the easy fix. An additional fix that I did not do is to deregister the excepthook (or rather, restore the orignal one) when calling dist.destroy_process_group. This might be a bit complicated in practice, so landing as is for now. Also, couldn't figure out a clean way to test this. assertRaisesRegex wasn't getting a string value, probably becuase of the stderr redirection done via the excepthook in the first place. Output from the original repro is cleaned up with the fix: ``` [rank0]: Traceback (most recent call last): [rank0]: File "/data/users/whc/pytorch/except.py", line 6, in <module> [rank0]: raise ZeroDivisionError [rank0]: ZeroDivisionError ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126739 Approved by: https://github.com/yf225	2024-05-21 06:39:18 +00:00
PyTorch MergeBot	e363a8a222	Revert "[pipelining] Add pipeline stage test (#126721 )" This reverts commit b948b1ad7a9cf61c9692506c60c295fd40e00f43. Reverted https://github.com/pytorch/pytorch/pull/126721 on behalf of https://github.com/clee2000 due to The test_public_bindings failure is real, you just got unlucky since it was also broken on trunk for a different reason ([comment](https://github.com/pytorch/pytorch/pull/126721#issuecomment-2121725408))	2024-05-21 04:40:05 +00:00
Will Constable	dc2560f073	[Pipelining] Add debug logs for batch p2p ops (#126539 ) logs from torchtitan: <img width="2878" alt="image" src="https://github.com/pytorch/pytorch/assets/4984825/4039c85f-0bf1-4924-92fa-2c55e8e4da2a"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126539 Approved by: https://github.com/kwen2501, https://github.com/H-Huang	2024-05-21 03:54:46 +00:00
Will Constable	b96d9090d2	[C10D] make get_node_local_rank() accept fallback_rank (#126737 ) Addresses follow up comments on #123992 and allows the use case of writing code that checks `get_node_local_rank(fallback_rank=0)` and runs correctly whether in the presence of a launcher (e.g. torchrun), or run locally on a single device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126737 Approved by: https://github.com/shuqiangzhang	2024-05-21 03:38:02 +00:00
Yanbo Liang	c1b90a4e8a	[Dynamo] Treat integers stored on nn.Modules as dynamic (#126466 ) Fixes #115711 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126466 Approved by: https://github.com/jansel	2024-05-21 03:31:20 +00:00
Chirag Pandya	a83e745356	[BE] split seq_id to collective_seq_id and p2p_seq_id (#125727 ) Summary: Split out `seq_id` into `collective_seq_id` and `p2p_seq_id`. The main idea here is that collectives that go to all machines should have identical `collective_seq_id` and therefore it makes it easier to spot if one of machines isn't handling a collective operation. Next, we can attempt to match up p2p operations to ensure that the sender(s)/receivers(s) are in sync. Resolves issue: https://github.com/pytorch/pytorch/issues/125173 Test Plan: Unit tests. Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/125727 Approved by: https://github.com/zdevito	2024-05-21 03:26:49 +00:00
eqy	5f64086d08	[NT][SDPA] Bump tolerances for `test_sdpa_with_packed_in_proj_cuda_bfloat16` (#126356 ) Current tolerances fail on RTX 6000 (Ada) with `Mismatched elements: 2 / 144 (1.4%)` ``` AssertionError: Tensor-likes are not close! Mismatched elements: 2 / 144 (1.4%) Greatest absolute difference: 0.002197265625 at index (5, 0, 0) (up to 0.001 allowed) Greatest relative difference: 0.08203125 at index (3, 0, 0) (up to 0.016 allowed) To execute this test, run the following from the base repo dir: python test/test_nestedtensor.py -k test_sdpa_with_packed_in_proj_cuda_bfloat16 This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 ---------------------------------------------------------------------- ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126356 Approved by: https://github.com/drisspg	2024-05-21 03:25:30 +00:00
cdzhan	40cc616909	Fix caching allocator of out-of-tree device is destructed before the … (#126677 ) …destruction of tensors cached by autocast ## Root Cause For out-of-tree device extension it is loaded after torch (different .so), so the global variable `cached_casts` may be constructed before caching allocator and then destructed in reversed order when exit. ## Fix Lazily initialize `cached_casts` to correct the order. ## How to Reproduce && Test Modify the testcase `TestAutocastGPU.test_cast_cache_is_global` in test/test_autocast.py to run on your out-of-tree device. You will see following failure in the end of test. ```bash ---------------------------------------------------------------------- Ran 1 test in 4.812s OK free: 0x30080ff44000400 terminate called after throwing an instance of 'c10::Error' what(): invalid device pointer: 0x30080ff44000400 Exception raised from free at /projs/framework/betterman/code/pytorch_new/catch/torch_mlu/csrc/framework/core/caching_allocator.cpp:1609 (most recent call first): frame #0: <unknown function> + 0x118fe1 (0x7ffaef4d3fe1 in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so) frame #1: <unknown function> + 0x11b1c4 (0x7ffaef4d61c4 in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so) frame #2: <unknown function> + 0x117677 (0x7ffaef4d2677 in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so) frame #3: <unknown function> + 0x11a2bf (0x7ffaef4d52bf in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so) frame #4: <unknown function> + 0x11a186 (0x7ffaef4d5186 in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so) frame #5: <unknown function> + 0x119fde (0x7ffaef4d4fde in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so) frame #6: <unknown function> + 0x119d2e (0x7ffaef4d4d2e in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so) frame #7: <unknown function> + 0x119be0 (0x7ffaef4d4be0 in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so) frame #8: <unknown function> + 0x119977 (0x7ffaef4d4977 in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so) frame #9: <unknown function> + 0x119313 (0x7ffaef4d4313 in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so) frame #10: <unknown function> + 0x118b4c (0x7ffaef4d3b4c in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so) frame #11: c10::Error::Error(c10::SourceLocation, std::string) + 0x34 (0x7ffaef4d27c4 in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so) frame #12: c10::detail::torchCheckFail(char const, char const, unsigned int, std::string const&) + 0x7f (0x7ffaef4d04ed in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so) frame #13: torch_mlu::MLUCachingAllocator::Native::NativeCachingAllocator::free(void) + 0xe6 (0x7ff9a8eeb112 in /projs/framework/betterman/code/pytorch_new/catch/torch_mlu/csrc/lib/libtorch_mlu.so) frame #14: torch_mlu::MLUCachingAllocator::Native::local_raw_delete(void) + 0x3b (0x7ff9a8ed9480 in /projs/framework/betterman/code/pytorch_new/catch/torch_mlu/csrc/lib/libtorch_mlu.so) frame #15: std::unique_ptr<void, void ()(void)>::~unique_ptr() + 0x50 (0x7ffb0a5ea322 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_python.so) frame #16: <unknown function> + 0x1269890 (0x7ffb0a5e4890 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_python.so) frame #17: <unknown function> + 0x1269928 (0x7ffb0a5e4928 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_python.so) frame #18: <unknown function> + 0x127572c (0x7ffb0a5f072c in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_python.so) frame #19: <unknown function> + 0x1275758 (0x7ffb0a5f0758 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_python.so) frame #20: <unknown function> + 0xb9bc7 (0x7ffaef474bc7 in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so) frame #21: <unknown function> + 0xb97bc (0x7ffaef4747bc in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so) frame #22: <unknown function> + 0xdbc50 (0x7ffaef496c50 in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so) frame #23: c10::TensorImpl::~TensorImpl() + 0x82 (0x7ffaef49157e in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so) frame #24: c10::TensorImpl::~TensorImpl() + 0x1c (0x7ffaef4915aa in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so) frame #25: <unknown function> + 0x2f596d9 (0x7ffaf24fc6d9 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so) frame #26: <unknown function> + 0x2f589c2 (0x7ffaf24fb9c2 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so) frame #27: <unknown function> + 0x2f57b92 (0x7ffaf24fab92 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so) frame #28: <unknown function> + 0x2f5c228 (0x7ffaf24ff228 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so) frame #29: <unknown function> + 0x30f3f70 (0x7ffaf2696f70 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so) frame #30: <unknown function> + 0x30f3f90 (0x7ffaf2696f90 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so) frame #31: <unknown function> + 0x30f5004 (0x7ffaf2698004 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so) frame #32: <unknown function> + 0x30f5024 (0x7ffaf2698024 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so) frame #33: <unknown function> + 0x31207f0 (0x7ffaf26c37f0 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so) frame #34: <unknown function> + 0x3120814 (0x7ffaf26c3814 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so) frame #35: <unknown function> + 0x30f51e8 (0x7ffaf26981e8 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so) frame #36: <unknown function> + 0x30f5148 (0x7ffaf2698148 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so) frame #37: <unknown function> + 0x316ecea (0x7ffaf2711cea in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so) frame #38: <unknown function> + 0x468a7 (0x7ffb0c9ed8a7 in /lib/x86_64-linux-gnu/libc.so.6) frame #39: on_exit + 0 (0x7ffb0c9eda60 in /lib/x86_64-linux-gnu/libc.so.6) <omitting python frames> frame #47: __libc_start_main + 0xf3 (0x7ffb0c9cb083 in /lib/x86_64-linux-gnu/libc.so.6) Aborted (core dumped) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126677 Approved by: https://github.com/ezyang	2024-05-21 03:20:17 +00:00
Peter Bell	51c07f9f69	[dynamo] Allow asserts to fail (#126661 ) Currently if an assertion is statically known to be false, dynamo converts it to `_assert_async` which inductor currently ignores. Instead this graph breaks to raise the original assertion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126661 Approved by: https://github.com/ezyang	2024-05-21 02:42:13 +00:00
eellison	d777685ef9	Script for choosing template configurations (#126560 ) This adds logging that will mark any invocation of a matmul for a particular input shapes, and record every template configs performance on it. Then, we can parse that into a script which will minimize the total mm execution time given N allowed templates. And in future, other experiments.. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126560 Approved by: https://github.com/nmacchioni, https://github.com/jansel	2024-05-21 02:28:39 +00:00
Jack Taylor	d30cdc4321	[ROCm] amdsmi library integration (#119182 ) Adds monitoring support for ROCm using amdsmi in place of pynvml. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119182 Approved by: https://github.com/jeffdaily, https://github.com/malfet, https://github.com/xw285cornell	2024-05-21 01:59:26 +00:00
Ke Wen	b948b1ad7a	[pipelining] Add pipeline stage test (#126721 ) Test tracer's and manual's stage creation by using a basic schedule (GPipe). (Migrated from https://github.com/pytorch/PiPPy/blob/main/test/test_pipeline_stage.py) Test command: ``` $ python test_stage.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126721 Approved by: https://github.com/wconstab, https://github.com/H-Huang	2024-05-21 01:22:10 +00:00
Joel Schlosser	31ba6ee49b	Traceable wrapper subclass support for deferred runtime asserts (#126198 ) The padded dense -> jagged conversion op has the signature: ``` _fbgemm_dense_to_jagged_forward(Tensor dense, Tensor[] offsets, SymInt? total_L=None) -> Tensor ``` when `total_L` is not specified, the meta registration has a data-dependent output shape (based on `offsets[0][-1]`). Returning an unbacked SymInt here should work in theory, but traceable wrapper subclass support is missing in later code to handle deferred runtime asserts. This PR fixes this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126198 Approved by: https://github.com/ezyang	2024-05-21 01:21:46 +00:00
youkaichao	82b4528788	[cudagraph] fix verbose graph logging (#126694 ) According to the [doc](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html#group__CUDART__TYPES_1g0907ca7a1e7d0211b71ee49c5403072b): > enum cudaGraphDebugDotFlags > CUDA Graph debug write options > > Values > cudaGraphDebugDotFlagsVerbose = 1<<0 > Output all debug data as if every debug flag is enabled > cudaGraphDebugDotFlagsKernelNodeParams = 1<<2 > Adds cudaKernelNodeParams to output > cudaGraphDebugDotFlagsMemcpyNodeParams = 1<<3 > Adds cudaMemcpy3DParms to output > cudaGraphDebugDotFlagsMemsetNodeParams = 1<<4 > Adds cudaMemsetParams to output > cudaGraphDebugDotFlagsHostNodeParams = 1<<5 > Adds cudaHostNodeParams to output > cudaGraphDebugDotFlagsEventNodeParams = 1<<6 > Adds cudaEvent_t handle from record and wait nodes to output > cudaGraphDebugDotFlagsExtSemasSignalNodeParams = 1<<7 > Adds cudaExternalSemaphoreSignalNodeParams values to output > cudaGraphDebugDotFlagsExtSemasWaitNodeParams = 1<<8 > Adds cudaExternalSemaphoreWaitNodeParams to output > cudaGraphDebugDotFlagsKernelNodeAttributes = 1<<9 > Adds cudaKernelNodeAttrID values to output > cudaGraphDebugDotFlagsHandles = 1<<10 > Adds node handles and every kernel function handle to output > cudaGraphDebugDotFlagsConditionalNodeParams = 1<<15 > Adds cudaConditionalNodeParams to output > `1 << 10` is not the most verbose flag. it is just one flag to add node handles and every kernel function handle to output. `1 << 0` is the most verbose flag, under the name `cudaGraphDebugDotFlagsVerbose`. Here is an example of graph, dumped with `1 << 10`: ```dot digraph dot { subgraph cluster_1 { label="graph_1" graph[style="dashed"]; "graph_1_node_0"[style="solid" shape="rectangle" label="0 MEM_ALLOC node handle: 0x000055D2889750F0 "]; "graph_1_node_1"[style="bold" shape="octagon" label="1 _Z3addPhS_S_m node handle: 0x000055D288979A20 func handle: 0x000055D288978D40 "]; "graph_1_node_2"[style="solid" shape="trapezium"label="2 MEMCPY node handle: 0x000055D28897A130 (DtoH,1024) "]; "graph_1_node_3"[style="solid" shape="rectangle" label="3 MEM_FREE node handle: 0x000055D2889890C0 "]; "graph_1_node_0" -> "graph_1_node_1"; "graph_1_node_1" -> "graph_1_node_2"; "graph_1_node_2" -> "graph_1_node_3"; } } ``` The same graph dumped with `1 << 0`: ```dot digraph dot { subgraph cluster_1 { label="graph_1" graph[style="dashed"]; "graph_1_node_0"[style="solid" shape="record" label="{ MEM_ALLOC \| {{ID \| node handle} \| {0 (topoId: 3) \| 0x000055D2889750F0}} \| {{{poolProps \| {allocType \| handleTypes \| {location \| {type \| id}}} \| {PINNED \| NONE \| DEVICE \| 0}}}} \| {{bytesize \| dptr} \| {1024 \| 0x0000000A02000000}} }"]; "graph_1_node_1"[style="bold" shape="record" label="{KERNEL \| {ID \| 1 (topoId: 2) \| _Z3addPhS_S_m\<\<\<4,256,0\>\>\>} \| {{node handle \| func handle} \| {0x000055D288979A20 \| 0x000055D288978D40}} \| {accessPolicyWindow \| {base_ptr \| num_bytes \| hitRatio \| hitProp \| missProp} \| {0x0000000000000000 \| 0 \| 0.000000 \| N \| N}} \| {cooperative \| 0} \| {priority \| 0} }"]; "graph_1_node_2"[style="solid" shape="record" label="{ MEMCPY \| {{ID \| node handle} \| {2 (topoId: 1) \| 0x000055D28897A130}} \| {kind \| DtoH (DEVICE to HOST PAGEABLE)} \| {{srcPtr \| dstPtr} \| {pitch \| ptr \| xsize \| ysize \| pitch \| ptr \| xsize \| ysize} \| {0 \| 0x0000000A02000000 \| 0 \| 0 \| 0 \| 0x000055D287CA6DB0 \| 0 \| 0}} \| {{srcPos \| {{x \| 0} \| {y \| 0} \| {z \| 0}}} \| {dstPos \| {{x \| 0} \| {y \| 0} \| {z \| 0}}} \| {Extent \| {{Width \| 1024} \| {Height \| 1} \| {Depth \| 1}}}} }"]; "graph_1_node_3"[style="solid" shape="record" label="{ MEM_FREE \| {{ID \| node handle} \| {3 (topoId: 0) \| 0x000055D2889890C0}} \| {{dptr} \| {0x0000000A02000000}} }"]; "graph_1_node_0" -> "graph_1_node_1" [headlabel=0]; "graph_1_node_1" -> "graph_1_node_2" [headlabel=0]; "graph_1_node_2" -> "graph_1_node_3" [headlabel=0]; } } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126694 Approved by: https://github.com/eqy, https://github.com/eellison	2024-05-21 00:55:15 +00:00
dshi7	4644611b14	[cprofile] log manifold link instead of raw data to trace_structured (#126451 ) Internal D57459752 returns manifold URL and this PR adds to tlparse payload Pull Request resolved: https://github.com/pytorch/pytorch/pull/126451 Approved by: https://github.com/jamesjwu	2024-05-21 00:44:55 +00:00
Edward Z. Yang	b85f9d7fa2	Add symbolic_shape_specialization structured trace (#126450 ) This is typically the information you want when diagnosing why something overspecialized in dynamic shapes. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126450 Approved by: https://github.com/albanD	2024-05-21 00:34:05 +00:00
chilli	cd3a71f754	Fix silu test for flexattention (#126641 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126641 Approved by: https://github.com/ezyang, https://github.com/drisspg ghstack dependencies: #126615, #126446	2024-05-20 23:40:56 +00:00
chilli	da2292ce6b	Prevent partitioner from ever saving views (#126446 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126446 Approved by: https://github.com/anijain2305 ghstack dependencies: #126615	2024-05-20 23:40:56 +00:00
chilli	831efeeadf	Fix flexattention not realizing inputs before lowering (also refactored runtime estimation) (#126615 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126615 Approved by: https://github.com/yanboliang, https://github.com/drisspg, https://github.com/xmfan	2024-05-20 23:40:56 +00:00
Oguz Ulgen	14dc8d4f63	Protect codecache against cache failures (#126696 ) When there's a manifold, memcache or filesystem related issues or network outages, we should not completely fail to compile but instead fallback to cold start. Differential Revision: [D57573835](https://our.internmc.facebook.com/intern/diff/D57573835/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126696 Approved by: https://github.com/aorenste	2024-05-20 22:22:41 +00:00
Alexander Kurakin	6f1935b0b5	doc: `torch.utils.data.Sampler`: `__len__` is optional (#125938 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/125938 Approved by: https://github.com/andrewkho, https://github.com/xmfan	2024-05-20 22:20:36 +00:00
Wei Han	74b053d7c4	Pass model path to observer (#126503 ) Summary: Passing model path to observer so that they can get additional info if needed. Test Plan: contbuild & OSS CI Differential Revision: D57475129 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126503 Approved by: https://github.com/kirklandsign	2024-05-20 22:17:56 +00:00
Yueming Hao	acfe237a71	Fix C++ compilation error for tensor array in abi_compatible mode (#126412 ) Fixes #122048 There is a compilation error https://github.com/pytorch/pytorch/issues/122048 when the element type in an array is tensor. It is because `val_to_arg_str does` not take arg type as input, and always generate an int array. This PR change the underlying `codegen_int_array_var` to `codegen_var_array` by adding type checks and corresponding code generations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126412 Approved by: https://github.com/desertfire	2024-05-20 20:57:50 +00:00
angelayi	3d4f1c3083	[export] Make error name private (#126715 ) Fixes CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/126715 Approved by: https://github.com/clee2000	2024-05-20 20:50:11 +00:00
jhavukainen	d28868c7e8	Change skipIfs to xfails in test_mps.py for test_isin (#125412 ) Follow-up to #124896 to move the added test to use expectedFailure instead of skip. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125412 Approved by: https://github.com/kulinseth	2024-05-20 20:23:53 +00:00
PyTorch MergeBot	8bca0847c2	Revert "[TD] Upload names of failures to s3 for pytest cache (#126315 )" This reverts commit 655038687afd19a4a4c9371b77ff046fd6c84be1. Reverted https://github.com/pytorch/pytorch/pull/126315 on behalf of https://github.com/clee2000 due to broke inductor ([comment](https://github.com/pytorch/pytorch/pull/126315#issuecomment-2121133045))	2024-05-20 20:15:08 +00:00
Yueming Hao	2813f0672a	fix huggingface models input issue in torchbench (#126579 ) Fixes https://github.com/pytorch/benchmark/issues/2263. According to https://github.com/pytorch/pytorch/blob/main/benchmarks/dynamo/common.py#L509, example_inputs are formatted as dictionaries for HuggingFace models. However, this forward_pass function passes all inputs to mod with *, which may only pass the input_ids key in HuggingFace model's example inputs. To reproduce, run the following command. ```bash python pytorch/benchmarks/dynamo/torchbench.py --performance --inference -dcuda --only=hf_Bert --output=torchbench_inference.csv ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126579 Approved by: https://github.com/xuzhao9	2024-05-20 19:10:46 +00:00
Mu-Chu Lee	11c2d127ec	[AOTInductor] Add config to allow buffer mutation (#126584 ) Summary: Add an additional config to allow buffer mutation. For data that's greater than 2GB, we would need to set it as read-only, otherwise overflow would occur. This is a temporary solution since it won't handle cases that requires mutable data greater than 2GB. Test Plan: Included in commit. Differential Revision: D57514729 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126584 Approved by: https://github.com/chenyang78	2024-05-20 18:16:00 +00:00
Xu Zhao	2068dadbe8	[torchbench] Add torchao to PT2 Benchmark Runner (#126469 ) Summary: X-link: https://github.com/pytorch/benchmark/pull/2268 Support torchao performance and accuracy tests in PT2 Benchmark Runner, using the inductor backend as the baseline. Test Plan: ``` $ buck2 run mode/opt //caffe2/benchmarks/dynamo:torchbench -- --only BERT_pytorch --bfloat16 --quantization int8dynamic --performance --inference --print-memory loading model: 0it [00:50, ?it/s] cuda eval BERT_pytorch memory: eager: 0.75 GB, dynamo: 0.75 GB, ratio: 1.00 running benchmark: 100% 1.003x ``` Reviewed By: jerryzh168 Differential Revision: D57463273 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126469 Approved by: https://github.com/huydhn	2024-05-20 17:53:44 +00:00
Edward Z. Yang	022adf8c5e	Fix bug for comptime.get_local for cells/closures (#126637 ) I wasn't paying enough attention and didn't notice that LOAD_DEREF is defined differently for InliningInstructionTranslator. Match it up with the code there. This also fixes comptime.print(), which was broken, because closing over an argument turned it into a cell rather than a regular local. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126637 Approved by: https://github.com/yanboliang	2024-05-20 17:51:28 +00:00
Jason Ansel	f9de510121	[dynamo] Graph break on set_num_threads (#126623 ) Fixes #125364 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126623 Approved by: https://github.com/yanboliang	2024-05-20 17:44:32 +00:00
angelayi	89c1cfe144	[export] Allow modules to be created in the forward (#125725 ) Fixes the error in non-strict export when we're tracing a module that initializes another module in its forward function. This appears in [many huggingface models](https://github.com/search?q=repo%3Ahuggingface%2Ftransformers+CrossEntropyLoss%28%29&type=code&fbclid=IwAR285uKvSevJM6SDbXmb4-monj4iH7wf8opkvnec-li7sKpn4lUMjIvbGKc). It's probably not good practice to do this, but since it appears in so many places, and strict-export supports this, we will also support this. The approach we'll take for these cases is that we will inline the call to the module. Parameters and buffers initialized as constants (with `torch.tensor`) will be represented as constant tensors, and those initialized with tensor factory functions (`torch.ones`) will show up as an operator in the graph. The module stack for the ops in the inlined module will reflect the toplevel's module stack. One issue is that strict-export seems to segfault when there is an `nn.Parameter` call in the constructor (https://github.com/pytorch/pytorch/issues/126109). Non-strict export will succeed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125725 Approved by: https://github.com/ydwu4	2024-05-20 17:42:20 +00:00
Catherine Lee	655038687a	[TD] Upload names of failures to s3 for pytest cache (#126315 ) Some tests don't get run through pytest and pytest crashes when a test segfaults, so in both caess, the pytest cache won't have an entry (similar to https://github.com/pytorch/test-infra/pull/5205). Instead, manually upload/download an extra file that lists the failing test files Technically this would be more general than the pytest cache Pull Request resolved: https://github.com/pytorch/pytorch/pull/126315 Approved by: https://github.com/ZainRizvi	2024-05-20 17:36:30 +00:00
Colin Peppler	8c38d0cd64	[inductor] Fix edge case in JIT vs. AOT fusion after finalizing MultiTemplateBuffer (#126622 ) # Context Here's a peripheral scenario causing the JIT-pass and AOT-pass to pick different fusions. ```py # JIT -- buf3 is a MultiTemplateBuffer V.graph.buffers = [buf0, buf1, buf2, buf3, buf4] ^ ^ # JIT pass calls finalize_multi_template_buffers() V.graph.buffers = [buf0, buf1, buf2, buf4, buf3] # AOT, note proximity_score(buf2, buf4) is "better" for fusion than JIT V.graph.buffers = [buf0, buf1, buf2, buf4, buf3] ^ ^ ``` It happens like this: * JIT starts with the original set nodes using V.graph.buffers * In JIT, finalize_multi_template_buffers() is called which can change the order of the buffers. * This makes the order of buffers/scheduler nodes different. * Now, each node's min/max-order is different than before. * As a result, the proximity between two nodes is different. `ad67553c5c/torch/_inductor/scheduler.py (L2316-L2335)` # Error ``` $ TORCH_LOGS="+fusion" python test/inductor/test_max_autotune.py -k test_jit_fusion_matches_aot_fusion ====================================================================== FAIL: test_jit_fusion_matches_aot_fusion (__main__.TestMaxAutotune) ---------------------------------------------------------------------- Traceback (most recent call last): ... File "/data/users/colinpeppler/pytorch/torch/_inductor/graph.py", line 1718, in compile_to_fn code, linemap = self.codegen_with_cpp_wrapper() File "/data/users/colinpeppler/pytorch/torch/_inductor/graph.py", line 1618, in codegen_with_cpp_wrapper return self.codegen() File "/data/users/colinpeppler/pytorch/torch/_inductor/graph.py", line 1636, in codegen self.scheduler.codegen() File "/data/users/colinpeppler/pytorch/torch/_dynamo/utils.py", line 210, in time_wrapper r = func(args, *kwargs) File "/data/users/colinpeppler/pytorch/torch/_inductor/scheduler.py", line 2602, in codegen self.get_backend(device).codegen_node(node) # type: ignore[possibly-undefined] File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/cuda_combined_scheduling.py", line 66, in codegen_node return self._triton_scheduling.codegen_node(node) File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py", line 3377, in codegen_node return self.codegen_node_schedule(node_schedule, buf_accesses, numel, rnumel) File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py", line 3602, in codegen_node_schedule final_kernel.call_kernel(final_kernel.kernel_name) File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py", line 3055, in call_kernel grid = wrapper.generate_default_grid(name, grid) File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/cpp_wrapper_cuda.py", line 174, in generate_default_grid params is not None AssertionError: cuda kernel parameters for triton_poi_fused_add_0 should already exist at this moment, only found dict_keys(['Placeholder.DESCRIPTIVE_NAME', 'triton_poi_fused_add_mul_0', 'triton_poi_fused_pow_1']) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126622 Approved by: https://github.com/chenyang78 ghstack dependencies: #125982	2024-05-20 16:58:08 +00:00
Catherine Lee	7aa853a54e	[CI] Install sccache on XLA build job (#126117 ) XLA build job uses a docker image from XLA, which doesn't have sccache installed. The XLA build job just builds pytorch, XLA gets built during the test job. The pytorch build was taking 1+hrs, with a warm cache it takes <30min Pull Request resolved: https://github.com/pytorch/pytorch/pull/126117 Approved by: https://github.com/malfet	2024-05-20 16:39:14 +00:00
Xia, Weiwen	3642e51ea5	[Quant][PT2E] enable qlinear post op fusion for dynamic quant & qat (#122667 ) Description Add fusion path for dynamic quant and for QAT. The following patterns can be matched for static quant with QAT cases: `qx -> qlinear -> add -> optional relu -> optional type convert -> optional quant` The following patterns can be matched for dynamic quant cases: `qx -> qlinear -> add -> optional relu` Test plan python test/inductor/test_mkldnn_pattern_matcher.py -k test_qlinear python test/inductor/test_cpu_cpp_wrapper.py -k test_qlinear python test/test_quantization.py -k test_linear_unary python test/test_quantization.py -k test_linear_binary Pull Request resolved: https://github.com/pytorch/pytorch/pull/122667 Approved by: https://github.com/jgong5	2024-05-20 15:55:18 +00:00
Nikita Shulga	2f53747ec6	Speedup bf16 gemm fallback on ARM (#126592 ) By dispatching it to multiple threads and using vectorized dot operation (with fp16 to fp32 upcasts via left shift) This bumps stories110M eval from 22 to 55 tokens/sec using bfloat16 TODO: - Refactor tinygemm template and use it here Pull Request resolved: https://github.com/pytorch/pytorch/pull/126592 Approved by: https://github.com/mikekgfb	2024-05-20 12:39:51 +00:00
PyTorch MergeBot	cb69c51b6f	Revert " Updated test_graph_optims and test_graph_scaling_fused_optimizers to use new OptimizerInfo infrastructure (#125127 )" This reverts commit cf35a591b95220aa1bfcc04ff8a943efd1d6d6eb. Reverted https://github.com/pytorch/pytorch/pull/125127 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/125127#issuecomment-2120337584))	2024-05-20 12:14:22 +00:00
Peter Bell	7100a72950	[inductor] Fix ops.scan for non-commutative operators (#126633 ) `tl.associative_scan` supports non-commutative combine functions but `tl.reduce` doesn't. This effects non-persistent scans, where we use the reduction from the previous loop iterations as the base for future iterations. Here I work around this by taking the last element of the scan output and using that as the reduced value. This is done using a trick where we create a mask that is 1 at the desired element and 0 elsewhere, then sum over it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126633 Approved by: https://github.com/Chillee, https://github.com/lezcano	2024-05-20 10:27:17 +00:00
PyTorch MergeBot	d9c3485146	Revert "c10d: add Collectives abstraction (#125978 )" This reverts commit 4b2ae2ac338f3a0de340c9711b03989b8ce66fc6. Reverted https://github.com/pytorch/pytorch/pull/125978 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/125978#issuecomment-2119858015))	2024-05-20 07:40:41 +00:00
PyTorch MergeBot	53f73cdeb6	Revert "Add symbolic_shape_specialization structured trace (#126450 )" This reverts commit da1fc85d60fcf0bd1e8638d643a7c0c6560c3a5f. Reverted https://github.com/pytorch/pytorch/pull/126450 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/126450#issuecomment-2119798075))	2024-05-20 06:59:58 +00:00
PyTorch MergeBot	5ad2f10034	Revert "[inductor] Load python modules using importlib (#126454 )" This reverts commit faa26df72e2a3ff08f9dd564bb50756916826854. Reverted https://github.com/pytorch/pytorch/pull/126454 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/126454#issuecomment-2119771267))	2024-05-20 06:41:11 +00:00
jayanth domalapalli	cf35a591b9	Updated test_graph_optims and test_graph_scaling_fused_optimizers to use new OptimizerInfo infrastructure (#125127 ) This PR is meant to address issue #123451, more specifically, the ```test_graph_optims``` and ```test_graph_scaling_fused_optimizers``` functions in ```test_cuda.py``` have been updated so that they now use the new OptimizerInfo infrastructure. Lintrunner passed: ``` $ lintrunner test/test_cuda.py ok No lint issues. ``` Tests passed: ``` >python test_cuda.py -k test_graph_optims Ran 19 tests in 7.463s OK (skipped=9) >python test_cuda.py -k test_graph_scaling_fused_optimizers Ran 6 tests in 2.800s OK (skipped=3) ``` Both the functions have been moved to the newly created TestCase class ```TestCudaOptims```. The test is mostly the same except the ```@optims``` decorator is used at the top of the function to implicitly call the function using each of the optimizers mentioned in the decorator instead of explicitly using a for loop to iterate through each of the optimizers. I was unable to use the ```_get_optim_inputs_including_global_cliquey_kwargs``` to get all kwargs for each of the optimizers since some of the kwargs that are used in the original ```test_graph_optims``` function are not being returned by the new OptimizerInfo infrastructure, more specifically, for the ```torch.optim.rmsprop.RMSprop``` optimizer, the following kwargs are not returned whenever ```_get_optim_inputs_including_global_cliquey_kwargs``` is called: ``` {'foreach': False, 'maximize': True, 'weight_decay': 0} { 'foreach': True, 'maximize': True, 'weight_decay': 0} ``` I ran into the same issue for ```test_graph_scaling_fused_optimizers```, for the ```torch.optim.adamw.AdamW``` optimizer, whenever ```optim_info.optim_inputs_func(device=device)``` was called, the following kwarg was not returned: ``` {'amsgrad': True} ``` Due to this issue, I resorted to using a dictionary to store the kwargs for each of the optimizers, I am aware that this is less than ideal. I was wondering whether I should use the OptimizerInfo infrastructure to get all the kwargs regardless of the fact that it lacks some kwargs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125127 Approved by: https://github.com/janeyx99	2024-05-20 06:20:45 +00:00
Simon Fan	5fb11cda4f	[compiled autograd] Better cache miss logging (#126602 ) - log only first node key cache miss - log existing node key sizes - log which node's collected sizes became dynamic e.g. ``` DEBUG:torch._dynamo.compiled_autograd.__compiled_autograd_verbose:Cache miss due to new autograd node: torch::autograd::GraphRoot (NodeCall 0) with key size 39, previous key sizes=[] ... DEBUG:torch._dynamo.compiled_autograd.__compiled_autograd_verbose:Cache miss due to new autograd node: torch::autograd::AccumulateGrad (NodeCall 5) with key size 32, previous key sizes=[21] ... DEBUG:torch._dynamo.compiled_autograd.__compiled_autograd_verbose:Cache miss due to dynamic shapes: collected size idx 0 of torch::autograd::GraphRoot (NodeCall 0) DEBUG:torch._dynamo.compiled_autograd.__compiled_autograd_verbose:Cache miss due to dynamic shapes: collected size idx 2 of SumBackward0 (NodeCall 1) DEBUG:torch._dynamo.compiled_autograd.__compiled_autograd_verbose:Cache miss due to dynamic shapes: collected size idx 4 of SumBackward0 (NodeCall 1) DEBUG:torch._dynamo.compiled_autograd.__compiled_autograd_verbose:Cache miss due to dynamic shapes: collected size idx 2 of ReluBackward0 (NodeCall 2) DEBUG:torch._dynamo.compiled_autograd.__compiled_autograd_verbose:Cache miss due to dynamic shapes: collected size idx 9 of AddmmBackward0 (NodeCall 3) DEBUG:torch._dynamo.compiled_autograd.__compiled_autograd_verbose:Cache miss due to dynamic shapes: collected size idx 2 of torch::autograd::AccumulateGrad (NodeCall 5) DEBUG:torch._dynamo.compiled_autograd.__compiled_autograd_verbose:Cache miss due to dynamic shapes: collected size idx 2 of ReluBackward0 (NodeCall 6) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126602 Approved by: https://github.com/jansel ghstack dependencies: #126144, #126146, #126148, #126483	2024-05-19 23:49:52 +00:00
Simon Fan	be67985bd7	[compiled autograd] log in cpp using python logger (#126483 ) Internal infra may not preserve python and c++ log ordering e.g. MAST logs: https://fburl.com/mlhub/38576cxn, all the `[python_compiled_autograd.cpp] Creating cache entry [...]` logs of the entire run are at the beginning of the file Pull Request resolved: https://github.com/pytorch/pytorch/pull/126483 Approved by: https://github.com/jansel ghstack dependencies: #126144, #126146, #126148	2024-05-19 23:49:52 +00:00
cyy	574ae9afb8	[Submodule] Remove third-party onnx-tensorrt (#126542 ) It seems that tensorrt is not used by the C++ code, may be due to the removal of Caffe2. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126542 Approved by: https://github.com/ezyang	2024-05-19 22:34:24 +00:00
cyy	853081a8e7	Replace torch.library.impl_abstract with torch.library.register_fake (#126606 ) To remove the disrupting warning ``` warnings.warn("torch.library.impl_abstract was renamed to " "torch.library.register_fake. Please use that instead; " "we will remove torch.library.impl_abstract in a future " "version of PyTorch.", DeprecationWarning, stacklevel=2) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126606 Approved by: https://github.com/ezyang	2024-05-19 13:21:39 +00:00

4748 changed files with 167698 additions and 153881 deletions

5

.ci/docker/aotriton_version.txt Normal file

View File

 @ -0,0 +1,5 @@
 .6b
 manylinux_2_17
 rocm6.1
 f07e8a1cb1f99627eb6d77f5c0e9295c775f3c7
 c29fa3f3b614e187d7213d745e989a92708cee2bc6020419ab49019af399d1

									
										103

.ci/docker/build.sh
									
												View File
												
				@ -91,9 +91,9 @@ _UCC_COMMIT=20eae37090a4ce1b32bcce6144ccad0b49943e0b

				# configuration, so we hardcode everything here rather than do it

				# from scratch

				case "$image" in

				  pytorch-linux-focal-cuda12.4-cudnn8-py3-gcc9)

				  pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9)

				    CUDA_VERSION=12.4.0

				    CUDNN_VERSION=8

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				    PROTOBUF=yes

				@ -105,9 +105,9 @@ case "$image" in

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9)

				  pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9)

				    CUDA_VERSION=12.1.1

				    CUDNN_VERSION=8

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				    PROTOBUF=yes

				@ -119,9 +119,9 @@ case "$image" in

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-cuda12.4-cudnn8-py3-gcc9-inductor-benchmarks)

				  pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9-inductor-benchmarks)

				    CUDA_VERSION=12.4.0

				    CUDNN_VERSION=8

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				    PROTOBUF=yes

				@ -134,9 +134,9 @@ case "$image" in

				    TRITON=yes

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9-inductor-benchmarks)

				  pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9-inductor-benchmarks)

				    CUDA_VERSION=12.1.1

				    CUDNN_VERSION=8

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				    PROTOBUF=yes

				@ -149,9 +149,9 @@ case "$image" in

				    TRITON=yes

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  pytorch-linux-focal-cuda12.1-cudnn8-py3.12-gcc9-inductor-benchmarks)

				  pytorch-linux-focal-cuda12.1-cudnn9-py3.12-gcc9-inductor-benchmarks)

				    CUDA_VERSION=12.1.1

				    CUDNN_VERSION=8

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.12

				    GCC_VERSION=9

				    PROTOBUF=yes

				@ -164,23 +164,24 @@ case "$image" in

				    TRITON=yes

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  pytorch-linux-focal-cuda11.8-cudnn8-py3-gcc9)

				    CUDA_VERSION=11.8.0

				    CUDNN_VERSION=8

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-cuda12.4-cudnn8-py3-gcc9)

				  pytorch-linux-focal-cuda12.4-cudnn9-py3.12-gcc9-inductor-benchmarks)

				    CUDA_VERSION=12.4.0

				    CUDNN_VERSION=8

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.12

				    GCC_VERSION=9

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    CONDA_CMAKE=yes

				    TRITON=yes

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  pytorch-linux-focal-cuda11.8-cudnn9-py3-gcc9)

				    CUDA_VERSION=11.8.0

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				    PROTOBUF=yes

				@ -192,9 +193,37 @@ case "$image" in

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9)

				  pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9)

				    CUDA_VERSION=12.4.0

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9)

				    CUDA_VERSION=12.1.1

				    CUDNN_VERSION=8

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				    PROTOBUF=yes

				    DB=yes

				    VISION=yes

				    KATEX=yes

				    UCX_COMMIT=${_UCX_COMMIT}

				    UCC_COMMIT=${_UCC_COMMIT}

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9)

				    CUDA_VERSION=12.4.0

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				    PROTOBUF=yes

				@ -301,10 +330,10 @@ case "$image" in

				    DOCS=yes

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  pytorch-linux-jammy-cuda11.8-cudnn8-py3.8-clang12)

				  pytorch-linux-jammy-cuda11.8-cudnn9-py3.8-clang12)

				    ANACONDA_PYTHON_VERSION=3.8

				    CUDA_VERSION=11.8

				    CUDNN_VERSION=8

				    CUDNN_VERSION=9

				    CLANG_VERSION=12

				    PROTOBUF=yes

				    DB=yes

				@ -344,6 +373,13 @@ case "$image" in

				    CONDA_CMAKE=yes

				    EXECUTORCH=yes

				    ;;

				  pytorch-linux-jammy-py3.12-halide)

				    CUDA_VERSION=12.4

				    ANACONDA_PYTHON_VERSION=3.12

				    GCC_VERSION=11

				    CONDA_CMAKE=yes

				    HALIDE=yes

				    ;;

				  pytorch-linux-focal-linter)

				    # TODO: Use 3.9 here because of this issue https://github.com/python/mypy/issues/13627.

				    # We will need to update mypy version eventually, but that's for another day. The task

				@ -351,7 +387,7 @@ case "$image" in

				    ANACONDA_PYTHON_VERSION=3.9

				    CONDA_CMAKE=yes

				    ;;

				  pytorch-linux-jammy-cuda11.8-cudnn8-py3.9-linter)

				  pytorch-linux-jammy-cuda11.8-cudnn9-py3.9-linter)

				    ANACONDA_PYTHON_VERSION=3.9

				    CUDA_VERSION=11.8

				    CONDA_CMAKE=yes

				@ -418,7 +454,7 @@ tmp_tag=$(basename "$(mktemp -u)" | tr '[:upper:]' '[:lower:]')

				#when using cudnn version 8 install it separately from cuda

				if [[ "$image" == *cuda*  && ${OS} == "ubuntu" ]]; then

				  IMAGE_NAME="nvidia/cuda:${CUDA_VERSION}-cudnn${CUDNN_VERSION}-devel-ubuntu${UBUNTU_VERSION}"

				  if [[ ${CUDNN_VERSION} == 8 ]]; then

				  if [[ ${CUDNN_VERSION} == 9 ]]; then

				    IMAGE_NAME="nvidia/cuda:${CUDA_VERSION}-devel-ubuntu${UBUNTU_VERSION}"

				  fi

				fi

				@ -461,6 +497,7 @@ docker build \

				       --build-arg "DOCS=${DOCS}" \

				       --build-arg "INDUCTOR_BENCHMARKS=${INDUCTOR_BENCHMARKS}" \

				       --build-arg "EXECUTORCH=${EXECUTORCH}" \

				       --build-arg "HALIDE=${HALIDE}" \

				       --build-arg "XPU_VERSION=${XPU_VERSION}" \

				       --build-arg "ACL=${ACL:-}" \

				       --build-arg "SKIP_SCCACHE_INSTALL=${SKIP_SCCACHE_INSTALL:-}" \

				@ -470,7 +507,7 @@ docker build \

				       "$@" \

				       .

				# NVIDIA dockers for RC releases use tag names like `11.0-cudnn8-devel-ubuntu18.04-rc`,

				# NVIDIA dockers for RC releases use tag names like `11.0-cudnn9-devel-ubuntu18.04-rc`,

				# for this case we will set UBUNTU_VERSION to `18.04-rc` so that the Dockerfile could

				# find the correct image. As a result, here we have to replace the

				#   "$UBUNTU_VERSION" == "18.04-rc"

									
										10

.ci/docker/centos-rocm/Dockerfile
									
												View File
												
				@ -77,6 +77,9 @@ RUN rm install_rocm.sh

				COPY ./common/install_rocm_magma.sh install_rocm_magma.sh

				RUN bash ./install_rocm_magma.sh

				RUN rm install_rocm_magma.sh

				COPY ./common/install_amdsmi.sh install_amdsmi.sh

				RUN bash ./install_amdsmi.sh

				RUN rm install_amdsmi.sh

				ENV PATH /opt/rocm/bin:$PATH

				ENV PATH /opt/rocm/hcc/bin:$PATH

				ENV PATH /opt/rocm/hip/bin:$PATH

				@ -110,6 +113,13 @@ COPY triton_version.txt triton_version.txt

				RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi

				RUN rm install_triton.sh common_utils.sh triton-rocm.txt triton_version.txt

				# Install AOTriton (Early fail)

				COPY ./aotriton_version.txt aotriton_version.txt

				COPY ./common/common_utils.sh common_utils.sh

				COPY ./common/install_aotriton.sh install_aotriton.sh

				RUN ["/bin/bash", "-c", "./install_aotriton.sh /opt/rocm && rm -rf install_aotriton.sh aotriton_version.txt common_utils.sh"]

				ENV AOTRITON_INSTALLED_PREFIX /opt/rocm/aotriton

				# Install ccache/sccache (do this last, so we get priority in PATH)

				COPY ./common/install_cache.sh install_cache.sh

				ENV PATH /opt/cache/bin:$PATH

2

.ci/docker/ci_commit_pins/executorch.txt

View File

 @ -1 +1 @@
 d4b3e5cc607e97afdba79dc90f8ef968142f347c
 d859653ae916d0a72f6b2b5c5925bed38832140

1

.ci/docker/ci_commit_pins/halide.txt Normal file

View File

				`@ -0,0 +1 @@`
				`340136fec6d3ebc73e7a19eba1663e9b0ba8ab2d`

2

.ci/docker/ci_commit_pins/triton-rocm.txt

View File

 @ -1 +1 @@
 bbe6246e37d8aa791c67daaf9d9d61b26c9ccfdc
 eae954efa5bf584da70324b640288c3ee7aede

2

.ci/docker/ci_commit_pins/triton-xpu.txt

View File

 @ -1 +1 @@
 b8c64f64c18d8cac598b3adb355c21e7439c21de
 b2f15840e0d70eec50d84c7a0575cb835524def

2

.ci/docker/ci_commit_pins/triton.txt

View File

 @ -1 +1 @@
 fff310c891f5a92d55445adf8cc9d29df5841e
 dedb7bdf339a3546896d4820366ca562c586bfa0

									
										5

.ci/docker/common/install_amdsmi.sh
									
										Normal file
									
												View File
												
				@ -0,0 +1,5 @@

				#!/bin/bash

				set -ex

				cd /opt/rocm/share/amd_smi && pip install .

									
										23

.ci/docker/common/install_aotriton.sh
									
										Executable file
									
												View File
												
				@ -0,0 +1,23 @@

				#!/bin/bash

				set -ex

				source "$(dirname "${BASH_SOURCE[0]}")/common_utils.sh"

				TARBALL='aotriton.tar.bz2'

				# This read command alwasy returns with exit code 1

				read -d "\n" VER MANYLINUX ROCMBASE PINNED_COMMIT SHA256 < aotriton_version.txt || true

				ARCH=$(uname -m)

				AOTRITON_INSTALL_PREFIX="$1"

				AOTRITON_URL="https://github.com/ROCm/aotriton/releases/download/${VER}/aotriton-${VER}-${MANYLINUX}_${ARCH}-${ROCMBASE}-shared.tar.bz2"

				cd "${AOTRITON_INSTALL_PREFIX}"

				# Must use -L to follow redirects

				curl -L --retry 3 -o "${TARBALL}" "${AOTRITON_URL}"

				ACTUAL_SHA256=$(sha256sum "${TARBALL}" | cut -d " " -f 1)

				if [ "${SHA256}" != "${ACTUAL_SHA256}" ]; then

				  echo -n "Error: The SHA256 of downloaded tarball is ${ACTUAL_SHA256},"

				  echo " which does not match the expected value ${SHA256}."

				  exit

				fi

				tar xf "${TARBALL}" && rm -rf "${TARBALL}"

									
										2

.ci/docker/common/install_base.sh
									
												View File
												
				@ -3,7 +3,7 @@

				set -ex

				install_ubuntu() {

				  # NVIDIA dockers for RC releases use tag names like `11.0-cudnn8-devel-ubuntu18.04-rc`,

				  # NVIDIA dockers for RC releases use tag names like `11.0-cudnn9-devel-ubuntu18.04-rc`,

				  # for this case we will set UBUNTU_VERSION to `18.04-rc` so that the Dockerfile could

				  # find the correct image. As a result, here we have to check for

				  #   "$UBUNTU_VERSION" == "18.04"*

									
										2

.ci/docker/common/install_conda.sh
									
												View File
												
				@ -85,7 +85,7 @@ fi

				  else

				    CONDA_COMMON_DEPS="astunparse pyyaml mkl=2021.4.0 mkl-include=2021.4.0 setuptools"

				    if [ "$ANACONDA_PYTHON_VERSION" = "3.11" ] || [ "$ANACONDA_PYTHON_VERSION" = "3.12" ]; then

				    if [ "$ANACONDA_PYTHON_VERSION" = "3.11" ] || [ "$ANACONDA_PYTHON_VERSION" = "3.12" ] || [ "$ANACONDA_PYTHON_VERSION" = "3.13" ]; then

				      conda_install numpy=1.26.0 ${CONDA_COMMON_DEPS}

				    else

				      conda_install numpy=1.21.2 ${CONDA_COMMON_DEPS}

									
										17

.ci/docker/common/install_cudnn.sh
									
												View File
												
				@ -1,23 +1,18 @@

				#!/bin/bash

				if [[ ${CUDNN_VERSION} == 8 ]]; then

				if [[ -n "${CUDNN_VERSION}" ]]; then

				    # cuDNN license: https://developer.nvidia.com/cudnn/license_agreement

				    mkdir tmp_cudnn

				    pushd tmp_cudnn

				    if [[ ${CUDA_VERSION:0:4} == "12.4" ]]; then

				        CUDNN_NAME="cudnn-linux-x86_64-8.9.7.29_cuda12-archive"

				        curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/${CUDNN_NAME}.tar.xz

				    elif [[ ${CUDA_VERSION:0:4} == "12.1" ]]; then

				        CUDNN_NAME="cudnn-linux-x86_64-8.9.2.26_cuda12-archive"

				        curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/${CUDNN_NAME}.tar.xz

				    elif [[ ${CUDA_VERSION:0:4} == "11.8" ]]; then

				        CUDNN_NAME="cudnn-linux-x86_64-8.7.0.84_cuda11-archive"

				        curl --retry 3 -OLs https://developer.download.nvidia.com/compute/redist/cudnn/v8.7.0/local_installers/11.8/${CUDNN_NAME}.tar.xz

				    if [[ ${CUDA_VERSION:0:2} == "12" ]]; then

				        CUDNN_NAME="cudnn-linux-x86_64-9.1.0.70_cuda12-archive"

				    elif [[ ${CUDA_VERSION:0:2} == "11" ]]; then

				        CUDNN_NAME="cudnn-linux-x86_64-9.1.0.70_cuda11-archive"

				    else

				        print "Unsupported CUDA version ${CUDA_VERSION}"

				        exit 1

				    fi

				    curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/${CUDNN_NAME}.tar.xz

				    tar xf ${CUDNN_NAME}.tar.xz

				    cp -a ${CUDNN_NAME}/include/* /usr/local/cuda/include/

				    cp -a ${CUDNN_NAME}/lib/* /usr/local/cuda/lib64/

									
										14

.ci/docker/common/install_executorch.sh
									
												View File
												
				@ -37,6 +37,9 @@ install_conda_dependencies() {

				install_pip_dependencies() {

				  pushd executorch/.ci/docker

				  # Install PyTorch CPU build beforehand to avoid installing the much bigger CUDA

				  # binaries later, ExecuTorch only needs CPU

				  pip_install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

				  # Install all Python dependencies

				  pip_install -r requirements-ci.txt

				  popd

				@ -44,13 +47,14 @@ install_pip_dependencies() {

				setup_executorch() {

				  pushd executorch

				  source .ci/scripts/utils.sh

				  # Setup swiftshader and Vulkan SDK which are required to build the Vulkan delegate

				  as_jenkins bash .ci/scripts/setup-vulkan-linux-deps.sh

				  install_flatc_from_source

				  pip_install .

				  export PYTHON_EXECUTABLE=python

				  export EXECUTORCH_BUILD_PYBIND=ON

				  export CMAKE_ARGS="-DEXECUTORCH_BUILD_XNNPACK=ON -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON"

				  # Make sure that all the newly generate files are owned by Jenkins

				  chown -R jenkins .

				  as_jenkins .ci/scripts/setup-linux.sh cmake

				  popd

				}

									
										46

.ci/docker/common/install_halide.sh
									
										Normal file
									
												View File
												
				@ -0,0 +1,46 @@

				#!/bin/bash

				set -ex

				source "$(dirname "${BASH_SOURCE[0]}")/common_utils.sh"

				COMMIT=$(get_pinned_commit halide)

				test -n "$COMMIT"

				# activate conda to populate CONDA_PREFIX

				test -n "$ANACONDA_PYTHON_VERSION"

				eval "$(conda shell.bash hook)"

				conda activate py_$ANACONDA_PYTHON_VERSION

				if [ -n "${UBUNTU_VERSION}" ];then

				    apt update

				    apt-get install -y lld liblld-15-dev libpng-dev libjpeg-dev libgl-dev \

				                  libopenblas-dev libeigen3-dev libatlas-base-dev libzstd-dev

				fi

				conda_install numpy scipy imageio cmake ninja

				git clone --depth 1 --branch release/16.x --recursive https://github.com/llvm/llvm-project.git

				cmake -DCMAKE_BUILD_TYPE=Release \

				        -DLLVM_ENABLE_PROJECTS="clang" \

				        -DLLVM_TARGETS_TO_BUILD="X86;NVPTX" \

				        -DLLVM_ENABLE_TERMINFO=OFF -DLLVM_ENABLE_ASSERTIONS=ON \

				        -DLLVM_ENABLE_EH=ON -DLLVM_ENABLE_RTTI=ON -DLLVM_BUILD_32_BITS=OFF \

				        -S llvm-project/llvm -B llvm-build -G Ninja

				cmake --build llvm-build

				cmake --install llvm-build --prefix llvm-install

				export LLVM_ROOT=`pwd`/llvm-install

				export LLVM_CONFIG=$LLVM_ROOT/bin/llvm-config

				git clone https://github.com/halide/Halide.git

				pushd Halide

				git checkout ${COMMIT} && git submodule update --init --recursive

				pip_install -r requirements.txt

				cmake -G Ninja -DCMAKE_BUILD_TYPE=Release -S . -B build

				cmake --build build

				test -e ${CONDA_PREFIX}/lib/python3 || ln -s python${ANACONDA_PYTHON_VERSION} ${CONDA_PREFIX}/lib/python3

				cmake --install build --prefix ${CONDA_PREFIX}

				chown -R jenkins ${CONDA_PREFIX}

				popd

				rm -rf Halide llvm-build llvm-project llvm-install

				python -c "import halide"  # check for errors

									
										8

.ci/docker/common/install_onnx.sh
									
												View File
												
				@ -30,10 +30,12 @@ pip_install \

				pip_install coloredlogs packaging

				pip_install onnxruntime==1.17.0

				pip_install onnx==1.15.0

				pip_install onnxruntime==1.18

				pip_install onnx==1.16.0

				# pip_install "onnxscript@git+https://github.com/microsoft/onnxscript@3e869ef8ccf19b5ebd21c10d3e9c267c9a9fa729" --no-deps

				pip_install onnxscript==0.1.0.dev20240315 --no-deps

				pip_install onnxscript==0.1.0.dev20240613 --no-deps

				# required by onnxscript

				pip_install ml_dtypes

				# Cache the transformers model to be used later by ONNX tests. We need to run the transformers

				# package to download the model. By default, the model is cached at ~/.cache/huggingface/hub/

									
										6

.ci/docker/common/install_rocm.sh
									
												View File
												
				@ -39,7 +39,8 @@ install_ubuntu() {

				                   rocm-libs \

				                   rccl \

				                   rocprofiler-dev \

				                   roctracer-dev

				                   roctracer-dev \

				                   amd-smi-lib

				    if [[ $(ver $ROCM_VERSION) -ge $(ver 6.1) ]]; then

				        DEBIAN_FRONTEND=noninteractive apt-get install -y --allow-unauthenticated rocm-llvm-dev

				@ -106,7 +107,8 @@ install_centos() {

				                   rocm-libs \

				                   rccl \

				                   rocprofiler-dev \

				                   roctracer-dev

				                   roctracer-dev \

				                   amd-smi-lib

				  # precompiled miopen kernels; search for all unversioned packages

				  # if search fails it will abort this script; use true to avoid case where search fails

10

.ci/docker/requirements-ci.txt

View File

 @ -85,10 +85,10 @@ librosa>=0.6.2 ; python_version < "3.11"
 #Pinned versions:
 #test that import:
 mypy==1.9.0
 mypy==1.10.0
 # Pin MyPy version because new errors are likely to appear with each release
 #Description: linter
 #Pinned versions: 1.9.0
 #Pinned versions: 1.10.0
 #test that import: test_typing.py, test_type_hints.py
 networkx==2.8.8
 @ -134,9 +134,9 @@ opt-einsum==3.3
 #Pinned versions: 3.3
 #test that import: test_linalg.py
 optree==0.11.0
 optree==0.12.1
 #Description: A library for tree manipulation
 #Pinned versions: 0.11.0
 #Pinned versions: 0.12.1
 #test that import: test_vmap.py, test_aotdispatch.py, test_dynamic_shapes.py,
 #test_pytree.py, test_ops.py, test_control_flow.py, test_modules.py,
 #common_utils.py, test_eager_transforms.py, test_python_dispatch.py,
 @ -306,7 +306,7 @@ pywavelets==1.5.0 ; python_version >= "3.12"
 #Pinned versions: 1.4.1
 #test that import:
 lxml==5.0.0.
 lxml==5.0.0
 #Description: This is a requirement of unittest-xml-reporting
 # Python-3.9 binaries

									
										12

.ci/docker/ubuntu-cuda/Dockerfile
									
												View File
												
				@ -103,6 +103,14 @@ COPY triton_version.txt triton_version.txt

				RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi

				RUN rm install_triton.sh common_utils.sh triton.txt triton_version.txt

				ARG HALIDE

				# Build and install halide

				COPY ./common/install_halide.sh install_halide.sh

				COPY ./common/common_utils.sh common_utils.sh

				COPY ci_commit_pins/halide.txt halide.txt

				RUN if [ -n "${HALIDE}" ]; then bash ./install_halide.sh; fi

				RUN rm install_halide.sh common_utils.sh halide.txt

				# Install ccache/sccache (do this last, so we get priority in PATH)

				COPY ./common/install_cache.sh install_cache.sh

				ENV PATH /opt/cache/bin:$PATH

				@ -139,7 +147,7 @@ COPY --from=pytorch/llvm:9.0.1 /opt/llvm /opt/llvm

				ARG CUDNN_VERSION

				ARG CUDA_VERSION

				COPY ./common/install_cudnn.sh install_cudnn.sh

				RUN if [ "${CUDNN_VERSION}" -eq 8 ]; then bash install_cudnn.sh; fi

				RUN if [ -n "${CUDNN_VERSION}" ]; then bash install_cudnn.sh; fi

				RUN rm install_cudnn.sh

				# Install CUSPARSELT

				@ -152,7 +160,7 @@ RUN rm install_cusparselt.sh

				RUN if [ -h /usr/local/cuda-11.6/cuda-11.6 ]; then rm /usr/local/cuda-11.6/cuda-11.6; fi

				RUN if [ -h /usr/local/cuda-11.7/cuda-11.7 ]; then rm /usr/local/cuda-11.7/cuda-11.7; fi

				RUN if [ -h /usr/local/cuda-12.1/cuda-12.1 ]; then rm /usr/local/cuda-12.1/cuda-12.1; fi

				RUN if [ -h /usr/local/cuda-12.1/cuda-12.4 ]; then rm /usr/local/cuda-12.1/cuda-12.4; fi

				RUN if [ -h /usr/local/cuda-12.4/cuda-12.4 ]; then rm /usr/local/cuda-12.4/cuda-12.4; fi

				USER jenkins

				CMD ["bash"]

									
										12

.ci/docker/ubuntu-rocm/Dockerfile
									
												View File
												
				@ -78,6 +78,11 @@ ENV MAGMA_HOME /opt/rocm/magma

				ENV LANG C.UTF-8

				ENV LC_ALL C.UTF-8

				# Install amdsmi

				COPY ./common/install_amdsmi.sh install_amdsmi.sh

				RUN bash ./install_amdsmi.sh

				RUN rm install_amdsmi.sh

				# (optional) Install non-default CMake version

				ARG CMAKE_VERSION

				COPY ./common/install_cmake.sh install_cmake.sh

				@ -100,6 +105,13 @@ COPY triton_version.txt triton_version.txt

				RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi

				RUN rm install_triton.sh common_utils.sh triton-rocm.txt triton_version.txt

				# Install AOTriton

				COPY ./aotriton_version.txt aotriton_version.txt

				COPY ./common/common_utils.sh common_utils.sh

				COPY ./common/install_aotriton.sh install_aotriton.sh

				RUN ["/bin/bash", "-c", "./install_aotriton.sh /opt/rocm && rm -rf install_aotriton.sh aotriton_version.txt common_utils.sh"]

				ENV AOTRITON_INSTALLED_PREFIX /opt/rocm/aotriton

				# Install ccache/sccache (do this last, so we get priority in PATH)

				COPY ./common/install_cache.sh install_cache.sh

				ENV PATH /opt/cache/bin:$PATH

									
										8

.ci/docker/ubuntu/Dockerfile
									
												View File
												
				@ -155,6 +155,14 @@ COPY ci_commit_pins/executorch.txt executorch.txt

				RUN if [ -n "${EXECUTORCH}" ]; then bash ./install_executorch.sh; fi

				RUN rm install_executorch.sh common_utils.sh executorch.txt

				ARG HALIDE

				# Build and install halide

				COPY ./common/install_halide.sh install_halide.sh

				COPY ./common/common_utils.sh common_utils.sh

				COPY ci_commit_pins/halide.txt halide.txt

				RUN if [ -n "${HALIDE}" ]; then bash ./install_halide.sh; fi

				RUN rm install_halide.sh common_utils.sh halide.txt

				ARG ONNX

				# Install ONNX dependencies

				COPY ./common/install_onnx.sh ./common/common_utils.sh ./

									
										37

.ci/pytorch/build.sh
									
												View File
												
				@ -44,10 +44,7 @@ if [[ "$BUILD_ENVIRONMENT" == *cuda11* ]]; then

				  fi

				fi

				if [[ ${BUILD_ENVIRONMENT} == *"paralleltbb"* ]]; then

				  export ATEN_THREADING=TBB

				  export USE_TBB=1

				elif [[ ${BUILD_ENVIRONMENT} == *"parallelnative"* ]]; then

				if [[ ${BUILD_ENVIRONMENT} == *"parallelnative"* ]]; then

				  export ATEN_THREADING=NATIVE

				fi

				@ -233,6 +230,10 @@ if [[ "${BUILD_ENVIRONMENT}" != *android* && "${BUILD_ENVIRONMENT}" != *cuda* ]]

				  export BUILD_STATIC_RUNTIME_BENCHMARK=ON

				fi

				if [[ "$BUILD_ENVIRONMENT" == *-debug* ]]; then

				  export CMAKE_BUILD_TYPE=RelWithAssert

				fi

				# Do not change workspace permissions for ROCm CI jobs

				# as it can leave workspace with bad permissions for cancelled jobs

				if [[ "$BUILD_ENVIRONMENT" != *rocm* ]]; then

				@ -287,9 +288,26 @@ else

				        # Which should be backward compatible with Numpy-1.X

				        python -mpip install --pre numpy==2.0.0rc1

				      fi

				      WERROR=1 python setup.py bdist_wheel

				      WERROR=1 python setup.py clean

				      if [[ "$USE_SPLIT_BUILD" == "true" ]]; then

				        BUILD_LIBTORCH_WHL=1 BUILD_PYTHON_ONLY=0 python setup.py bdist_wheel

				        BUILD_LIBTORCH_WHL=0 BUILD_PYTHON_ONLY=1 python setup.py bdist_wheel --cmake

				      else

				        WERROR=1 python setup.py bdist_wheel

				      fi

				    else

				      python setup.py bdist_wheel

				      python setup.py clean

				      if [[ "$BUILD_ENVIRONMENT" == *xla* ]]; then

				        source .ci/pytorch/install_cache_xla.sh

				      fi

				      if [[ "$USE_SPLIT_BUILD" == "true" ]]; then

				        echo "USE_SPLIT_BUILD cannot be used with xla or rocm"

				        exit 1

				      else

				        python setup.py bdist_wheel

				      fi

				    fi

				    pip_install_whl "$(echo dist/*.whl)"

				@ -328,9 +346,10 @@ else

				    CUSTOM_OP_TEST="$PWD/test/custom_operator"

				    python --version

				    SITE_PACKAGES="$(python -c 'from distutils.sysconfig import get_python_lib; print(get_python_lib())')"

				    mkdir -p "$CUSTOM_OP_BUILD"

				    pushd "$CUSTOM_OP_BUILD"

				    cmake "$CUSTOM_OP_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES/torch" -DPYTHON_EXECUTABLE="$(which python)" \

				    cmake "$CUSTOM_OP_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES/torch;$SITE_PACKAGES" -DPython_EXECUTABLE="$(which python)" \

				          -DCMAKE_MODULE_PATH="$CUSTOM_TEST_MODULE_PATH" -DUSE_ROCM="$CUSTOM_TEST_USE_ROCM"

				    make VERBOSE=1

				    popd

				@ -343,7 +362,7 @@ else

				    SITE_PACKAGES="$(python -c 'from distutils.sysconfig import get_python_lib; print(get_python_lib())')"

				    mkdir -p "$JIT_HOOK_BUILD"

				    pushd "$JIT_HOOK_BUILD"

				    cmake "$JIT_HOOK_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES/torch" -DPYTHON_EXECUTABLE="$(which python)" \

				    cmake "$JIT_HOOK_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES/torch;$SITE_PACKAGES" -DPython_EXECUTABLE="$(which python)" \

				          -DCMAKE_MODULE_PATH="$CUSTOM_TEST_MODULE_PATH" -DUSE_ROCM="$CUSTOM_TEST_USE_ROCM"

				    make VERBOSE=1

				    popd

				@ -355,7 +374,7 @@ else

				    python --version

				    mkdir -p "$CUSTOM_BACKEND_BUILD"

				    pushd "$CUSTOM_BACKEND_BUILD"

				    cmake "$CUSTOM_BACKEND_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES/torch" -DPYTHON_EXECUTABLE="$(which python)" \

				    cmake "$CUSTOM_BACKEND_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES/torch;$SITE_PACKAGES" -DPython_EXECUTABLE="$(which python)" \

				          -DCMAKE_MODULE_PATH="$CUSTOM_TEST_MODULE_PATH" -DUSE_ROCM="$CUSTOM_TEST_USE_ROCM"

				    make VERBOSE=1

				    popd

									
										46

.ci/pytorch/common_utils.sh
									
												View File
												
				@ -56,9 +56,29 @@ function assert_git_not_dirty() {

				function pip_install_whl() {

				  # This is used to install PyTorch and other build artifacts wheel locally

				  # without using any network connection

				  python3 -mpip install --no-index --no-deps "$@"

				  # Convert the input arguments into an array

				  local args=("$@")

				  # Check if the first argument contains multiple paths separated by spaces

				  if [[ "${args[0]}" == *" "* ]]; then

				    # Split the string by spaces into an array

				    IFS=' ' read -r -a paths <<< "${args[0]}"

				    # Loop through each path and install individually

				    for path in "${paths[@]}"; do

				      echo "Installing $path"

				      python3 -mpip install --no-index --no-deps "$path"

				    done

				  else

				    # Loop through each argument and install individually

				    for path in "${args[@]}"; do

				      echo "Installing $path"

				      python3 -mpip install --no-index --no-deps "$path"

				    done

				  fi

				}

				function pip_install() {

				  # retry 3 times

				  # old versions of pip don't have the "--progress-bar" flag

				@ -188,28 +208,6 @@ function clone_pytorch_xla() {

				  fi

				}

				function checkout_install_torchdeploy() {

				  local commit

				  commit=$(get_pinned_commit multipy)

				  pushd ..

				  git clone --recurse-submodules https://github.com/pytorch/multipy.git

				  pushd multipy

				  git checkout "${commit}"

				  python multipy/runtime/example/generate_examples.py

				  BUILD_CUDA_TESTS=1 pip install -e .

				  popd

				  popd

				}

				function test_torch_deploy(){

				 pushd ..

				 pushd multipy

				 ./multipy/runtime/build/test_deploy

				 ./multipy/runtime/build/test_deploy_gpu

				 popd

				 popd

				}

				function checkout_install_torchbench() {

				  local commit

				  commit=$(get_pinned_commit torchbench)

				@ -224,6 +222,8 @@ function checkout_install_torchbench() {

				    # to install and test other models

				    python install.py --continue_on_fail

				  fi

				  echo "Print all dependencies after TorchBench is installed"

				  python -mpip freeze

				  popd

				}

									
										1

.ci/pytorch/create_test_cert.py
									
												View File
												
				@ -6,6 +6,7 @@ from cryptography.hazmat.primitives import hashes, serialization

				from cryptography.hazmat.primitives.asymmetric import rsa

				from cryptography.x509.oid import NameOID

				temp_dir = mkdtemp()

				print(temp_dir)

									
										37

.ci/pytorch/install_cache_xla.sh
									
										Executable file
									
												View File
												
				@ -0,0 +1,37 @@

				#!/bin/bash

				# Script for installing sccache on the xla build job, which uses xla's docker

				# image and doesn't have sccache installed on it.  This is mostly copied from

				# .ci/docker/install_cache.sh.  Changes are: removing checks that will always

				# return the same thing, ex checks for for rocm, CUDA, and changing the path

				# where sccache is installed, and not changing /etc/environment.

				set -ex

				install_binary() {

				  echo "Downloading sccache binary from S3 repo"

				  curl --retry 3 https://s3.amazonaws.com/ossci-linux/sccache -o /tmp/cache/bin/sccache

				}

				mkdir -p /tmp/cache/bin

				mkdir -p /tmp/cache/lib

				export PATH="/tmp/cache/bin:$PATH"

				install_binary

				chmod a+x /tmp/cache/bin/sccache

				function write_sccache_stub() {

				  # Unset LD_PRELOAD for ps because of asan + ps issues

				  # https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90589

				  # shellcheck disable=SC2086

				  # shellcheck disable=SC2059

				  printf "#!/bin/sh\nif [ \$(env -u LD_PRELOAD ps -p \$PPID -o comm=) != sccache ]; then\n  exec sccache $(which $1) \"\$@\"\nelse\n  exec $(which $1) \"\$@\"\nfi" > "/tmp/cache/bin/$1"

				  chmod a+x "/tmp/cache/bin/$1"

				}

				write_sccache_stub cc

				write_sccache_stub c++

				write_sccache_stub gcc

				write_sccache_stub g++

				write_sccache_stub clang

				write_sccache_stub clang++

									
										5

.ci/pytorch/multigpu-test.sh
									
												View File
												
				@ -18,7 +18,9 @@ time python test/run_test.py --verbose -i distributed/test_c10d_gloo

				time python test/run_test.py --verbose -i distributed/test_c10d_nccl

				time python test/run_test.py --verbose -i distributed/test_c10d_spawn_gloo

				time python test/run_test.py --verbose -i distributed/test_c10d_spawn_nccl

				time python test/run_test.py --verbose -i distributed/test_compute_comm_reordering

				time python test/run_test.py --verbose -i distributed/test_store

				time python test/run_test.py --verbose -i distributed/test_symmetric_memory

				time python test/run_test.py --verbose -i distributed/test_pg_wrapper

				time python test/run_test.py --verbose -i distributed/rpc/cuda/test_tensorpipe_agent

				# FSDP tests

				@ -50,6 +52,9 @@ time python test/run_test.py --verbose -i distributed/tensor/parallel/test_tp_ra

				# FSDP2 tests

				time python test/run_test.py --verbose -i distributed/_composable/fsdp/test_fully_shard_training -- -k test_2d_mlp_with_nd_mesh

				# Pipelining composability tests

				time python test/run_test.py --verbose -i distributed/pipelining/test_composability.py

				# Other tests

				time python test/run_test.py --verbose -i test_cuda_primary_ctx

				time python test/run_test.py --verbose -i test_optim -- -k test_forloop_goes_right_direction_multigpu

									
										1

.ci/pytorch/perf_test/compare_with_baseline.py
									
												View File
												
				@ -3,6 +3,7 @@ import json

				import math

				import sys

				parser = argparse.ArgumentParser()

				parser.add_argument(

				    "--test-name", dest="test_name", action="store", required=True, help="test name"

									
										1

.ci/pytorch/perf_test/get_stats.py
									
												View File
												
				@ -3,6 +3,7 @@ import sys

				import numpy

				sample_data_list = sys.argv[1:]

				sample_data_list = [float(v.strip()) for v in sample_data_list]

									
										1

.ci/pytorch/perf_test/update_commit_hash.py
									
												View File
												
				@ -1,6 +1,7 @@

				import json

				import sys

				data_file_path = sys.argv[1]

				commit_hash = sys.argv[2]

									
										1

.ci/pytorch/print_sccache_log.py
									
												View File
												
				@ -1,5 +1,6 @@

				import sys

				log_file_path = sys.argv[1]

				with open(log_file_path) as f:

									
										152

.ci/pytorch/test.sh
									
												View File
												
				@ -249,9 +249,7 @@ fi

				# This tests that the debug asserts are working correctly.

				if [[ "$BUILD_ENVIRONMENT" == *-debug* ]]; then

				    echo "We are in debug mode: $BUILD_ENVIRONMENT. Expect the python assertion to fail"

				    # TODO: Enable the check after we setup the build to run debug asserts without having

				    #       to do a full (and slow) debug build

				    # (cd test && ! get_exit_code python -c "import torch; torch._C._crash_if_debug_asserts_fail(424242)")

				    (cd test && ! get_exit_code python -c "import torch; torch._C._crash_if_debug_asserts_fail(424242)")

				elif [[ "$BUILD_ENVIRONMENT" != *-bazel-* ]]; then

				    # Noop when debug is disabled. Skip bazel jobs because torch isn't available there yet.

				    echo "We are not in debug mode: $BUILD_ENVIRONMENT. Expect the assertion to pass"

				@ -277,6 +275,9 @@ test_python_shard() {

				  # Bare --include flag is not supported and quoting for lint ends up with flag not being interpreted correctly

				  # shellcheck disable=SC2086

				  # modify LD_LIBRARY_PATH to ensure it has the conda env.

				  # This set of tests has been shown to be buggy without it for the split-build

				  time python test/run_test.py --exclude-jit-executor --exclude-distributed-tests $INCLUDE_CLAUSE --shard "$1" "$NUM_TEST_SHARDS" --verbose $PYTHON_TEST_EXTRA_OPTION

				  assert_git_not_dirty

				@ -323,9 +324,11 @@ test_inductor_distributed() {

				  python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_hsdp --verbose

				  python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_2d_transformer_checkpoint_resume --verbose

				  python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_training.py -k test_gradient_accumulation --verbose

				  python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_state_dict.py -k test_dp_state_dict_save_load --verbose

				  python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_frozen.py --verbose

				  python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_mixed_precision.py -k test_compute_dtype --verbose

				  python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_mixed_precision.py -k test_reduce_dtype --verbose

				  python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_clip_grad_norm_.py -k test_clip_grad_norm_2d --verbose

				  python test/run_test.py -i distributed/fsdp/test_fsdp_tp_integration.py -k test_fsdp_tp_integration --verbose

				  # this runs on both single-gpu and multi-gpu instance. It should be smart about skipping tests that aren't supported

				@ -334,26 +337,50 @@ test_inductor_distributed() {

				  assert_git_not_dirty

				}

				test_inductor() {

				  python tools/dynamo/verify_dynamo.py

				  python test/run_test.py --inductor --include test_modules test_ops test_ops_gradients test_torch --verbose

				  # Do not add --inductor for the following inductor unit tests, otherwise we will fail because of nested dynamo state

				  python test/run_test.py --include inductor/test_torchinductor inductor/test_torchinductor_opinfo inductor/test_aot_inductor --verbose

				test_inductor_shard() {

				  if [[ -z "$NUM_TEST_SHARDS" ]]; then

				    echo "NUM_TEST_SHARDS must be defined to run a Python test shard"

				    exit 1

				  fi

				  python tools/dynamo/verify_dynamo.py

				  python test/run_test.py --inductor \

				    --include test_modules test_ops test_ops_gradients test_torch \

				    --shard "$1" "$NUM_TEST_SHARDS" \

				    --verbose

				  # Do not add --inductor for the following inductor unit tests, otherwise we will fail because of nested dynamo state

				  python test/run_test.py \

				    --include inductor/test_torchinductor inductor/test_torchinductor_opinfo inductor/test_aot_inductor \

				    --shard "$1" "$NUM_TEST_SHARDS" \

				    --verbose

				}

				test_inductor_aoti() {

				  # docker build uses bdist_wheel which does not work with test_aot_inductor

				  # TODO: need a faster way to build

				  if [[ "$BUILD_ENVIRONMENT" != *rocm* ]]; then

				      BUILD_AOT_INDUCTOR_TEST=1 python setup.py develop

				      CPP_TESTS_DIR="${BUILD_BIN_DIR}" LD_LIBRARY_PATH="${TORCH_LIB_DIR}" python test/run_test.py --cpp --verbose -i cpp/test_aoti_abi_check cpp/test_aoti_inference

				    BUILD_AOT_INDUCTOR_TEST=1 python setup.py develop

				    CPP_TESTS_DIR="${BUILD_BIN_DIR}" LD_LIBRARY_PATH="${TORCH_LIB_DIR}" python test/run_test.py --cpp --verbose -i cpp/test_aoti_abi_check cpp/test_aoti_inference

				  fi

				}

				test_inductor_cpp_wrapper_abi_compatible() {

				  export TORCHINDUCTOR_ABI_COMPATIBLE=1

				  TEST_REPORTS_DIR=$(pwd)/test/test-reports

				  mkdir -p "$TEST_REPORTS_DIR"

				  echo "Testing Inductor cpp wrapper mode with TORCHINDUCTOR_ABI_COMPATIBLE=1"

				  # cpu stack allocation causes segfault and needs more investigation

				  TORCHINDUCTOR_STACK_ALLOCATION=0 python test/run_test.py --include inductor/test_cpu_cpp_wrapper

				  PYTORCH_TESTING_DEVICE_ONLY_FOR="" python test/run_test.py --include inductor/test_cpu_cpp_wrapper

				  python test/run_test.py --include inductor/test_cuda_cpp_wrapper

				  TORCHINDUCTOR_CPP_WRAPPER=1 python benchmarks/dynamo/timm_models.py --device cuda --accuracy --amp \

				    --training --inductor --disable-cudagraphs --only vit_base_patch16_224 \

				    --output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_training.csv"

				  python benchmarks/dynamo/check_accuracy.py \

				    --actual "$TEST_REPORTS_DIR/inductor_cpp_wrapper_training.csv" \

				    --expected "benchmarks/dynamo/ci_expected_accuracy/inductor_timm_training.csv"

				}

				# "Global" flags for inductor benchmarking controlled by TEST_CONFIG

				@ -378,7 +405,7 @@ if [[ "${TEST_CONFIG}" == *dynamic* ]]; then

				  DYNAMO_BENCHMARK_FLAGS+=(--dynamic-shapes --dynamic-batch-only)

				fi

				if [[ "${TEST_CONFIG}" == *cpu_inductor* ]]; then

				if [[ "${TEST_CONFIG}" == *cpu_inductor* || "${TEST_CONFIG}" == *cpu_aot_inductor* ]]; then

				  DYNAMO_BENCHMARK_FLAGS+=(--device cpu)

				else

				  DYNAMO_BENCHMARK_FLAGS+=(--device cuda)

				@ -503,9 +530,10 @@ test_single_dynamo_benchmark() {

				    test_perf_for_dashboard "$suite" \

				      "${DYNAMO_BENCHMARK_FLAGS[@]}" "$@" "${partition_flags[@]}"

				  else

				    if [[ "${TEST_CONFIG}" == *aot_inductor* ]]; then

				    if [[ "${TEST_CONFIG}" == *aot_inductor* && "${TEST_CONFIG}" != *cpu_aot_inductor* ]]; then

				      # Test AOTInductor with the ABI-compatible mode on CI

				      # This can be removed once the ABI-compatible mode becomes default.

				      # For CPU device, we perfer non ABI-compatible mode on CI when testing AOTInductor.

				      export TORCHINDUCTOR_ABI_COMPATIBLE=1

				    fi

				    python "benchmarks/dynamo/$suite.py" \

				@ -527,6 +555,11 @@ test_inductor_micro_benchmark() {

				  python benchmarks/gpt_fast/benchmark.py --output "${TEST_REPORTS_DIR}/gpt_fast_benchmark.csv"

				}

				test_inductor_halide() {

				  python test/run_test.py --include inductor/test_halide.py --verbose

				  assert_git_not_dirty

				}

				test_dynamo_benchmark() {

				  # Usage: test_dynamo_benchmark huggingface 0

				  TEST_REPORTS_DIR=$(pwd)/test/test-reports

				@ -541,8 +574,16 @@ test_dynamo_benchmark() {

				  elif [[ "${TEST_CONFIG}" == *perf* ]]; then

				    test_single_dynamo_benchmark "dashboard" "$suite" "$shard_id" "$@"

				  else

				    if [[ "${TEST_CONFIG}" == *cpu_inductor* ]]; then

				      test_single_dynamo_benchmark "inference" "$suite" "$shard_id" --inference --float32 "$@"

				    if [[ "${TEST_CONFIG}" == *cpu_inductor* || "${TEST_CONFIG}" == *cpu_aot_inductor* ]]; then

				      local dt="float32"

				      if [[ "${TEST_CONFIG}" == *amp* ]]; then

				        dt="amp"

				      fi

				      if [[ "${TEST_CONFIG}" == *freezing* ]]; then

				        test_single_dynamo_benchmark "inference" "$suite" "$shard_id" --inference --"$dt" --freezing "$@"

				      else

				        test_single_dynamo_benchmark "inference" "$suite" "$shard_id" --inference --"$dt" "$@"

				      fi

				    elif [[ "${TEST_CONFIG}" == *aot_inductor* ]]; then

				      test_single_dynamo_benchmark "inference" "$suite" "$shard_id" --inference --bfloat16 "$@"

				    else

				@ -556,12 +597,16 @@ test_inductor_torchbench_smoketest_perf() {

				  TEST_REPORTS_DIR=$(pwd)/test/test-reports

				  mkdir -p "$TEST_REPORTS_DIR"

				  # smoke test the cpp_wrapper mode

				  TORCHINDUCTOR_CPP_WRAPPER=1 python benchmarks/dynamo/torchbench.py --device cuda --accuracy --bfloat16 \

				    --inference --inductor --only hf_T5 --output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_smoketest.csv"

				  # Test some models in the cpp wrapper mode

				  TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCHINDUCTOR_CPP_WRAPPER=1 python benchmarks/dynamo/torchbench.py --device cuda --accuracy \

				    --bfloat16 --inference --inductor --only hf_T5 --output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv"

				  TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCHINDUCTOR_CPP_WRAPPER=1 python benchmarks/dynamo/torchbench.py --device cuda --accuracy \

				    --bfloat16 --inference --inductor --only llama --output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv"

				  TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCHINDUCTOR_CPP_WRAPPER=1 python benchmarks/dynamo/torchbench.py --device cuda --accuracy \

				    --bfloat16 --inference --inductor --only moco --output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv"

				  python benchmarks/dynamo/check_accuracy.py \

				      --actual "$TEST_REPORTS_DIR/inductor_cpp_wrapper_smoketest.csv" \

				      --expected "benchmarks/dynamo/ci_expected_accuracy/inductor_torchbench_inference.csv"

				    --actual "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv" \

				    --expected "benchmarks/dynamo/ci_expected_accuracy/inductor_torchbench_inference.csv"

				  python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --float16 --training \

				    --batch-size-file "$(realpath benchmarks/dynamo/torchbench_models_list.txt)" --only hf_Bert \

				@ -576,7 +621,8 @@ test_inductor_torchbench_smoketest_perf() {

				  # https://github.com/pytorch/pytorch/actions/runs/7158691360/job/19491437314,

				  # and thus we lower its threshold to reduce flakiness. If this continues to be a problem,

				  # we switch to use some other model.

				  python benchmarks/dynamo/check_perf_csv.py -f "$TEST_REPORTS_DIR/inductor_inference_smoketest.csv" -t 4.9

				  # lowering threshold from 4.9 to 4.7 for cu124. Will bump it up after cuda 12.4.0->12.4.1 update

				  python benchmarks/dynamo/check_perf_csv.py -f "$TEST_REPORTS_DIR/inductor_inference_smoketest.csv" -t 4.7

				  # Check memory compression ratio for a few models

				  for test in hf_Albert timm_vision_transformer; do

				@ -680,7 +726,6 @@ test_aten() {

				  ${SUDO} ln -sf "$TORCH_LIB_DIR"/libmkldnn* "$TEST_BASE_DIR"

				  ${SUDO} ln -sf "$TORCH_LIB_DIR"/libnccl* "$TEST_BASE_DIR"

				  ${SUDO} ln -sf "$TORCH_LIB_DIR"/libtorch* "$TEST_BASE_DIR"

				  ${SUDO} ln -sf "$TORCH_LIB_DIR"/libtbb* "$TEST_BASE_DIR"

				  ls "$TEST_BASE_DIR"

				  aten/tools/run_tests.sh "$TEST_BASE_DIR"

				@ -705,21 +750,6 @@ test_without_numpy() {

				  popd

				}

				# pytorch extensions require including torch/extension.h which includes all.h

				# which includes utils.h which includes Parallel.h.

				# So you can call for instance parallel_for() from your extension,

				# but the compilation will fail because of Parallel.h has only declarations

				# and definitions are conditionally included Parallel.h(see last lines of Parallel.h).

				# I tried to solve it #39612 and #39881 by including Config.h into Parallel.h

				# But if Pytorch is built with TBB it provides Config.h

				# that has AT_PARALLEL_NATIVE_TBB=1(see #3961 or #39881) and it means that if you include

				# torch/extension.h which transitively includes Parallel.h

				# which transitively includes tbb.h which is not available!

				if [[ "${BUILD_ENVIRONMENT}" == *tbb* ]]; then

				  sudo mkdir -p /usr/include/tbb

				  sudo cp -r "$PWD"/third_party/tbb/include/tbb/* /usr/include/tbb

				fi

				test_libtorch() {

				  local SHARD="$1"

				@ -733,7 +763,6 @@ test_libtorch() {

				    ln -sf "$TORCH_LIB_DIR"/libc10* "$TORCH_BIN_DIR"

				    ln -sf "$TORCH_LIB_DIR"/libshm* "$TORCH_BIN_DIR"

				    ln -sf "$TORCH_LIB_DIR"/libtorch* "$TORCH_BIN_DIR"

				    ln -sf "$TORCH_LIB_DIR"/libtbb* "$TORCH_BIN_DIR"

				    ln -sf "$TORCH_LIB_DIR"/libnvfuser* "$TORCH_BIN_DIR"

				    export CPP_TESTS_DIR="${TORCH_BIN_DIR}"

				@ -870,7 +899,6 @@ test_rpc() {

				  # test reporting process to function as expected.

				  ln -sf "$TORCH_LIB_DIR"/libtorch* "$TORCH_BIN_DIR"

				  ln -sf "$TORCH_LIB_DIR"/libc10* "$TORCH_BIN_DIR"

				  ln -sf "$TORCH_LIB_DIR"/libtbb* "$TORCH_BIN_DIR"

				  CPP_TESTS_DIR="${TORCH_BIN_DIR}" python test/run_test.py --cpp --verbose -i cpp/test_cpp_rpc

				}

				@ -1150,15 +1178,21 @@ test_executorch() {

				  pushd /executorch

				  # NB: We need to build ExecuTorch runner here and not inside the Docker image

				  # because it depends on PyTorch

				  export PYTHON_EXECUTABLE=python

				  export EXECUTORCH_BUILD_PYBIND=ON

				  export CMAKE_ARGS="-DEXECUTORCH_BUILD_XNNPACK=ON -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON"

				  # NB: We need to rebuild ExecuTorch runner here because it depends on PyTorch

				  # from the PR

				  # shellcheck disable=SC1091

				  source .ci/scripts/utils.sh

				  build_executorch_runner "cmake"

				  source .ci/scripts/setup-linux.sh cmake

				  echo "Run ExecuTorch unit tests"

				  pytest -v -n auto

				  # shellcheck disable=SC1091

				  LLVM_PROFDATA=llvm-profdata-12 LLVM_COV=llvm-cov-12 bash test/run_oss_cpp_tests.sh

				  echo "Run ExecuTorch regression tests for some models"

				  # NB: This is a sample model, more can be added here

				  export PYTHON_EXECUTABLE=python

				  # TODO(huydhn): Add more coverage here using ExecuTorch's gather models script

				  # shellcheck disable=SC1091

				  source .ci/scripts/test.sh mv3 cmake xnnpack-quantization-delegation ''

				@ -1218,11 +1252,10 @@ elif [[ "$TEST_CONFIG" == distributed ]]; then

				  if [[ "${SHARD_NUMBER}" == 1 ]]; then

				    test_rpc

				  fi

				elif [[ "$TEST_CONFIG" == deploy ]]; then

				  checkout_install_torchdeploy

				  test_torch_deploy

				elif [[ "${TEST_CONFIG}" == *inductor_distributed* ]]; then

				  test_inductor_distributed

				elif [[ "${TEST_CONFIG}" == *inductor-halide* ]]; then

				  test_inductor_halide

				elif [[ "${TEST_CONFIG}" == *inductor-micro-benchmark* ]]; then

				  test_inductor_micro_benchmark

				elif [[ "${TEST_CONFIG}" == *huggingface* ]]; then

				@ -1234,13 +1267,14 @@ elif [[ "${TEST_CONFIG}" == *timm* ]]; then

				  id=$((SHARD_NUMBER-1))

				  test_dynamo_benchmark timm_models "$id"

				elif [[ "${TEST_CONFIG}" == *torchbench* ]]; then

				  if [[ "${TEST_CONFIG}" == *cpu_inductor* ]]; then

				  if [[ "${TEST_CONFIG}" == *cpu_inductor* || "${TEST_CONFIG}" == *cpu_aot_inductor* ]]; then

				    install_torchaudio cpu

				  else

				    install_torchaudio cuda

				  fi

				  install_torchtext

				  install_torchvision

				  TORCH_CUDA_ARCH_LIST="8.0;8.6" pip_install git+https://github.com/pytorch/ao.git

				  id=$((SHARD_NUMBER-1))

				  # https://github.com/opencv/opencv-python/issues/885

				  pip_install opencv-python==4.8.0.74

				@ -1259,7 +1293,7 @@ elif [[ "${TEST_CONFIG}" == *torchbench* ]]; then

				    checkout_install_torchbench

				    # Do this after checkout_install_torchbench to ensure we clobber any

				    # nightlies that torchbench may pull in

				    if [[ "${TEST_CONFIG}" != *cpu_inductor* ]]; then

				    if [[ "${TEST_CONFIG}" != *cpu_inductor* && "${TEST_CONFIG}" != *cpu_aot_inductor* ]]; then

				      install_torchrec_and_fbgemm

				    fi

				    PYTHONPATH=$(pwd)/torchbench test_dynamo_benchmark torchbench "$id"

				@ -1267,17 +1301,19 @@ elif [[ "${TEST_CONFIG}" == *torchbench* ]]; then

				elif [[ "${TEST_CONFIG}" == *inductor_cpp_wrapper_abi_compatible* ]]; then

				  install_torchvision

				  test_inductor_cpp_wrapper_abi_compatible

				elif [[ "${TEST_CONFIG}" == *inductor* && "${SHARD_NUMBER}" == 1 ]]; then

				elif [[ "${TEST_CONFIG}" == *inductor* ]]; then

				  install_torchvision

				  test_inductor

				  test_inductor_distributed

				elif [[ "${TEST_CONFIG}" == *dynamo* && "${SHARD_NUMBER}" == 1 && $NUM_TEST_SHARDS -gt 1 ]]; then

				  install_torchvision

				  test_dynamo_shard 1

				  test_aten

				elif [[ "${TEST_CONFIG}" == *dynamo* && $SHARD_NUMBER -gt 1 && $NUM_TEST_SHARDS -gt 1 ]]; then

				  test_inductor_shard "${SHARD_NUMBER}"

				  if [[ "${SHARD_NUMBER}" == 1 ]]; then

				    test_inductor_aoti

				    test_inductor_distributed

				  fi

				elif [[ "${TEST_CONFIG}" == *dynamo* ]]; then

				  install_torchvision

				  test_dynamo_shard "${SHARD_NUMBER}"

				  if [[ "${SHARD_NUMBER}" == 1 ]]; then

				    test_aten

				  fi

				elif [[ "${BUILD_ENVIRONMENT}" == *rocm* && -n "$TESTS_TO_INCLUDE" ]]; then

				  install_torchvision

				  test_python_shard "$SHARD_NUMBER"

									
										1

.ci/pytorch/win-test-helpers/run_python_nn_smoketests.py
									
												View File
												
				@ -4,6 +4,7 @@ import os

				import subprocess

				import sys

				COMMON_TESTS = [

				    (

				        "Checking that torch is available",

									
										1

.circleci/codegen_validation/normalize_yaml_fragment.py
									
												View File
												
				@ -5,6 +5,7 @@ import sys

				import yaml

				# Need to import modules that lie on an upward-relative path

				sys.path.append(os.path.join(sys.path[0], ".."))

									
										32

.circleci/scripts/binary_linux_test.sh
									
												View File
												
				@ -46,13 +46,18 @@ if [[ "\$python_nodot" = *310* ]]; then

				  PROTOBUF_PACKAGE="protobuf>=3.19.0"

				fi

				if [[ "\$python_nodot" = *39*  ]]; then

				if [[ "\$python_nodot" = *39* ]]; then

				  # There's an issue with conda channel priority where it'll randomly pick 1.19 over 1.20

				  # we set a lower boundary here just to be safe

				  NUMPY_PIN=">=1.20"

				fi

				if [[ "\$python_nodot" = *38* ]]; then

				  # sympy 1.12.1 is the last version that supports Python 3.8

				  SYMPY_PIN="==1.12.1"

				else

				  SYMPY_PIN=">=1.13.0"

				fi

				# Move debug wheels out of the package dir so they don't get installed

				mkdir -p /tmp/debug_final_pkgs

				@ -83,7 +88,7 @@ if [[ "$PACKAGE_TYPE" == conda ]]; then

				      "numpy\${NUMPY_PIN}" \

				      mkl>=2018 \

				      ninja \

				      sympy \

				      "sympy\${SYMPY_PIN}" \

				      typing-extensions \

				      ${PROTOBUF_PACKAGE}

				    if [[ "$DESIRED_CUDA" == 'cpu' ]]; then

				@ -97,8 +102,16 @@ if [[ "$PACKAGE_TYPE" == conda ]]; then

				  )

				elif [[ "$PACKAGE_TYPE" != libtorch ]]; then

				  if [[ "\$BUILD_ENVIRONMENT" != *s390x* ]]; then

				    pip install "\$pkg" --index-url "https://download.pytorch.org/whl/\${CHANNEL}/${DESIRED_CUDA}"

				    retry pip install -q numpy protobuf typing-extensions

				    if [[ "$USE_SPLIT_BUILD" == "true" ]]; then

				      pkg_no_python="$(ls -1 /final_pkgs/torch_no_python* | sort |tail -1)"

				      pkg_torch="$(ls -1 /final_pkgs/torch-* | sort |tail -1)"

				      # todo: after folder is populated use the pypi_pkg channel instead

				      pip install "\$pkg_no_python" "\$pkg_torch" --index-url "https://download.pytorch.org/whl/\${CHANNEL}/${DESIRED_CUDA}_pypi_pkg"

				      retry pip install -q numpy protobuf typing-extensions

				    else

				      pip install "\$pkg" --index-url "https://download.pytorch.org/whl/\${CHANNEL}/${DESIRED_CUDA}"

				      retry pip install -q numpy protobuf typing-extensions

				    fi

				  else

				    pip install "\$pkg"

				    retry pip install -q numpy protobuf typing-extensions

				@ -110,9 +123,18 @@ if [[ "$PACKAGE_TYPE" == libtorch ]]; then

				  cd /tmp/libtorch

				fi

				if [[ "$GPU_ARCH_TYPE" == xpu ]]; then

				  # Workaround for __mkl_tmp_MOD unbound variable issue, refer https://github.com/pytorch/pytorch/issues/130543

				  set +u

				  source /opt/intel/oneapi/pytorch-gpu-dev-0.5/oneapi-vars.sh

				fi

				# Test the package

				/builder/check_binary.sh

				# Clean temp files

				cd /builder && git clean -ffdx

				# =================== The above code will be executed inside Docker container ===================

				EOL

				echo

									
										51

.circleci/scripts/binary_populate_env.sh
									
												View File
												
				@ -33,9 +33,9 @@ if [[ -z "$DOCKER_IMAGE" ]]; then

				  if [[ "$PACKAGE_TYPE" == conda ]]; then

				    export DOCKER_IMAGE="pytorch/conda-cuda"

				  elif [[ "$DESIRED_CUDA" == cpu ]]; then

				    export DOCKER_IMAGE="pytorch/manylinux-cpu"

				    export DOCKER_IMAGE="pytorch/manylinux:cpu"

				  else

				    export DOCKER_IMAGE="pytorch/manylinux-cuda${DESIRED_CUDA:2}"

				    export DOCKER_IMAGE="pytorch/manylinux-builder:${DESIRED_CUDA:2}"

				  fi

				fi

				@ -75,9 +75,9 @@ export PYTORCH_BUILD_NUMBER=1

				TRITON_VERSION=$(cat $PYTORCH_ROOT/.ci/docker/triton_version.txt)

				# Here PYTORCH_EXTRA_INSTALL_REQUIREMENTS is already set for the all the wheel builds hence append TRITON_CONSTRAINT

				TRITON_CONSTRAINT="platform_system == 'Linux' and platform_machine == 'x86_64' and python_version < '3.13'"

				if [[ "$PACKAGE_TYPE" =~ .*wheel.* &&  -n "${PYTORCH_EXTRA_INSTALL_REQUIREMENTS:-}" ]]; then

				  # Only linux Python < 3.12 are supported wheels for triton

				  TRITON_CONSTRAINT="platform_system == 'Linux' and platform_machine == 'x86_64' and python_version < '3.12'"

				  # Only linux Python < 3.13 are supported wheels for triton

				  TRITON_REQUIREMENT="triton==${TRITON_VERSION}; ${TRITON_CONSTRAINT}"

				  if [[ -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*dev.* ]]; then

				      TRITON_SHORTHASH=$(cut -c1-10 $PYTORCH_ROOT/.ci/docker/ci_commit_pins/triton.txt)

				@ -87,11 +87,11 @@ if [[ "$PACKAGE_TYPE" =~ .*wheel.* &&  -n "${PYTORCH_EXTRA_INSTALL_REQUIREMENTS:

				fi

				# Set triton via PYTORCH_EXTRA_INSTALL_REQUIREMENTS for triton rocm package

				if [[ "$PACKAGE_TYPE" =~ .*wheel.* && -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*rocm.* && $(uname) == "Linux" && "$DESIRED_PYTHON" != "3.12" ]]; then

				    TRITON_REQUIREMENT="pytorch-triton-rocm==${TRITON_VERSION}"

				if [[ "$PACKAGE_TYPE" =~ .*wheel.* && -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*rocm.* && $(uname) == "Linux" ]]; then

				    TRITON_REQUIREMENT="pytorch-triton-rocm==${TRITON_VERSION}; ${TRITON_CONSTRAINT}"

				    if [[ -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*dev.* ]]; then

				        TRITON_SHORTHASH=$(cut -c1-10 $PYTORCH_ROOT/.ci/docker/ci_commit_pins/triton-rocm.txt)

				        TRITON_REQUIREMENT="pytorch-triton-rocm==${TRITON_VERSION}+${TRITON_SHORTHASH}"

				        TRITON_REQUIREMENT="pytorch-triton-rocm==${TRITON_VERSION}+${TRITON_SHORTHASH}; ${TRITON_CONSTRAINT}"

				    fi

				    if [[ -z "${PYTORCH_EXTRA_INSTALL_REQUIREMENTS:-}" ]]; then

				        export PYTORCH_EXTRA_INSTALL_REQUIREMENTS="${TRITON_REQUIREMENT}"

				@ -100,30 +100,18 @@ if [[ "$PACKAGE_TYPE" =~ .*wheel.* && -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_B

				    fi

				fi

				JAVA_HOME=

				BUILD_JNI=OFF

				if [[ "$PACKAGE_TYPE" == libtorch ]]; then

				  POSSIBLE_JAVA_HOMES=()

				  POSSIBLE_JAVA_HOMES+=(/usr/local)

				  POSSIBLE_JAVA_HOMES+=(/usr/lib/jvm/java-8-openjdk-amd64)

				  POSSIBLE_JAVA_HOMES+=(/Library/Java/JavaVirtualMachines/*.jdk/Contents/Home)

				  # Add the Windows-specific JNI path

				  POSSIBLE_JAVA_HOMES+=("$PWD/pytorch/.circleci/windows-jni/")

				  for JH in "${POSSIBLE_JAVA_HOMES[@]}" ; do

				    if [[ -e "$JH/include/jni.h" ]] ; then

				      # Skip if we're not on Windows but haven't found a JAVA_HOME

				      if [[ "$JH" == "$PWD/pytorch/.circleci/windows-jni/" && "$OSTYPE" != "msys" ]] ; then

				        break

				      fi

				      echo "Found jni.h under $JH"

				      JAVA_HOME="$JH"

				      BUILD_JNI=ON

				      break

				# Set triton via PYTORCH_EXTRA_INSTALL_REQUIREMENTS for triton xpu package

				if [[ "$PACKAGE_TYPE" =~ .*wheel.* && -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*xpu.* && $(uname) == "Linux" ]]; then

				    TRITON_REQUIREMENT="pytorch-triton-xpu==${TRITON_VERSION}"

				    if [[ -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*dev.* ]]; then

				        TRITON_SHORTHASH=$(cut -c1-10 $PYTORCH_ROOT/.ci/docker/ci_commit_pins/triton-xpu.txt)

				        TRITON_REQUIREMENT="pytorch-triton-xpu==${TRITON_VERSION}+${TRITON_SHORTHASH}"

				    fi

				    if [[ -z "${PYTORCH_EXTRA_INSTALL_REQUIREMENTS:-}" ]]; then

				        export PYTORCH_EXTRA_INSTALL_REQUIREMENTS="${TRITON_REQUIREMENT}"

				    else

				        export PYTORCH_EXTRA_INSTALL_REQUIREMENTS="${PYTORCH_EXTRA_INSTALL_REQUIREMENTS} | ${TRITON_REQUIREMENT}"

				    fi

				  done

				  if [ -z "$JAVA_HOME" ]; then

				    echo "Did not find jni.h"

				  fi

				fi

				cat >"$envfile" <<EOL

				@ -136,6 +124,7 @@ export DESIRED_PYTHON="${DESIRED_PYTHON:-}"

				export DESIRED_CUDA="$DESIRED_CUDA"

				export LIBTORCH_VARIANT="${LIBTORCH_VARIANT:-}"

				export BUILD_PYTHONLESS="${BUILD_PYTHONLESS:-}"

				export USE_SPLIT_BUILD="${USE_SPLIT_BUILD:-}"

				if [[ "${OSTYPE}" == "msys" ]]; then

				  export LIBTORCH_CONFIG="${LIBTORCH_CONFIG:-}"

				  if [[ "${LIBTORCH_CONFIG:-}" == 'debug' ]]; then

				@ -159,8 +148,6 @@ export TORCH_CONDA_BUILD_FOLDER='pytorch-nightly'

				export ANACONDA_USER='pytorch'

				export USE_FBGEMM=1

				export JAVA_HOME=$JAVA_HOME

				export BUILD_JNI=$BUILD_JNI

				export PIP_UPLOAD_FOLDER="$PIP_UPLOAD_FOLDER"

				export DOCKER_IMAGE="$DOCKER_IMAGE"

									
										9

.circleci/scripts/binary_upload.sh
									
												View File
												
				@ -25,6 +25,15 @@ if [[ "${DRY_RUN}" = "disabled" ]]; then

				  AWS_S3_CP="aws s3 cp"

				fi

				if [[ "${USE_SPLIT_BUILD:-false}" == "true" ]]; then

				  UPLOAD_SUBFOLDER="${UPLOAD_SUBFOLDER}_pypi_pkg"

				fi

				# this is special build with all dependencies packaged

				if [[ ${BUILD_NAME} == *-full* ]]; then

				  UPLOAD_SUBFOLDER="${UPLOAD_SUBFOLDER}_full"

				fi

				# Sleep 2 minutes between retries for conda upload

				retry () {

				  "$@"  || (sleep 5m && "$@") || (sleep 5m && "$@") || (sleep 5m && "$@") || (sleep 5m && "$@")

									
										1

.circleci/scripts/trigger_azure_pipeline.py
									
												View File
												
				@ -8,6 +8,7 @@ import time

				import requests

				AZURE_PIPELINE_BASE_URL = "https://aiinfra.visualstudio.com/PyTorch/"

				AZURE_DEVOPS_PAT_BASE64 = os.environ.get("AZURE_DEVOPS_PAT_BASE64_SECRET", "")

				PIPELINE_ID = "911"

2

.clang-tidy

View File

 @ -62,4 +62,6 @@ readability-string-compare,
 '
 HeaderFilterRegex: '^(aten/|c10/|torch/).*$'
 WarningsAsErrors: '*'
 CheckOptions:
   misc-header-include-cycle.IgnoredFilesList: 'format.h;ivalue.h;custom_class.h;Dict.h;List.h'
 ...

2

.flake8

View File

 @ -2,7 +2,7 @@
 # NOTE: **Mirror any changes** to this file the [tool.ruff] config in pyproject.toml
 # before we can fully move to use ruff
 enable-extensions = G
 select = B,C,E,F,G,P,SIM1,T4,W,B9,TOR0,TOR1,TOR2,TOR9
 select = B,C,E,F,G,P,SIM1,SIM911,T4,W,B9,TOR0,TOR1,TOR2,TOR9
 max-line-length = 120
 # C408 ignored because we like the dict keyword argument syntax
 # E501 is not flexible enough, we're using B950 instead

4

.git-blame-ignore-revs

View File

 @ -40,3 +40,7 @@ e6ec0efaf87703c5f889cfc20b29be455885d58d
 a53cda1ddc15336dc1ff0ce1eff2a49cdc5f882e
 # 2024-01-02 clangformat: fused adam #116583
 dc68d1aa9e554d09344a10fff69f7b50b2d23a0
 # 2024-06-28 enable UFMT in `torch/storage.py`
 d80939e5e9337e8078f11489afefec59fd42f93b
 # 2024-06-28 enable UFMT in `torch.utils.data`
 cf0b90e49689d45be91aa539fdf54cf2ea8a9a3

									
										31

.github/actionlint.yaml
									
										vendored
									
												View File
												
				@ -1,9 +1,12 @@

				self-hosted-runner:

				  labels:

				    # GitHub hosted x86 Linux runners

				    - linux.20_04.4x

				    - linux.20_04.16x

				    - linux.large

				    # Repo-specific LF hosted ARC runners

				    - linux.large.arc

				    # Organization-wide AWS Linux Runners

				    - linux.large

				    - linux.2xlarge

				    - linux.4xlarge

				    - linux.12xlarge

				@ -13,18 +16,36 @@ self-hosted-runner:

				    - linux.8xlarge.nvidia.gpu

				    - linux.16xlarge.nvidia.gpu

				    - linux.g5.4xlarge.nvidia.gpu

				    # Organization-wide AWS Linux Runners on Linux Foundation account

				    - lf.linux.large

				    - lf.linux.2xlarge

				    - lf.linux.4xlarge

				    - lf.linux.12xlarge

				    - lf.linux.24xlarge

				    - lf.linux.arm64.2xlarge

				    - lf.linux.4xlarge.nvidia.gpu

				    - lf.linux.8xlarge.nvidia.gpu

				    - lf.linux.16xlarge.nvidia.gpu

				    - lf.linux.g5.4xlarge.nvidia.gpu

				    # Repo-specific IBM hosted S390x runner

				    - linux.s390x

				    # Organization wide AWS Windows runners

				    - windows.4xlarge.nonephemeral

				    - windows.8xlarge.nvidia.gpu

				    - windows.8xlarge.nvidia.gpu.nonephemeral

				    - windows.g5.4xlarge.nvidia.gpu

				    - bm-runner

				    # Organization-wide AMD hosted MI300 runners

				    - linux.rocm.gpu

				    # Repo-specific Apple hosted  runners

				    - macos-m1-ultra

				    - macos-m2-14

				    # Org wise AWS `mac2.metal` runners (2020 Mac mini hardware powered by Apple silicon M1 processors)

				    - macos-m1-stable

				    - macos-m1-13

				    - macos-m1-14

				    - macos-12-xl

				    - macos-12

				    - macos12.3-m1

				    # GitHub-hosted MacOS runners

				    - macos-latest-xlarge

				    - macos-13-xlarge

				    - macos-14-xlarge

				    # Organization-wide Intel hosted XPU runners

				    - linux.idc.xpu

									
										6

.github/actions/diskspace-cleanup/action.yml
									
										vendored
									
												View File
												
				@ -14,12 +14,14 @@ runs:

				    - name: Cleans up diskspace

				      shell: bash

				      run: |

				        set -ex

				        diskspace_cutoff=${{ inputs.diskspace-cutoff }}

				        diskspace=$(df -H / --output=pcent | sed -n 2p | sed 's/%//' | sed 's/ //')

				        docker_root_dir=$(docker info -f '{{.DockerRootDir}}')

				        diskspace=$(df -H --output=pcent ${docker_root_dir} | sed -n 2p | sed 's/%//' | sed 's/ //')

				        msg="Please file an issue on pytorch/pytorch reporting the faulty runner. Include a link to the runner logs so the runner can be identified"

				        if [[ "$diskspace" -ge "$diskspace_cutoff" ]] ; then

				            docker system prune -af

				            diskspace_new=$(df -H / --output=pcent | sed -n 2p | sed 's/%//' | sed 's/ //')

				            diskspace_new=$(df -H --output=pcent ${docker_root_dir} | sed -n 2p | sed 's/%//' | sed 's/ //')

				            if [[ "$diskspace_new" -gt "$diskspace_cutoff" ]] ; then

				                echo "Error: Available diskspace is less than $diskspace_cutoff percent. Not enough diskspace."

				                echo "$msg"

									
										3

.github/actions/filter-test-configs/action.yml
									
										vendored
									
												View File
												
				@ -66,7 +66,8 @@ runs:

				        command: |

				          set -eux

				          # PyYAML 6.0 doesn't work with MacOS x86 anymore

				          python3 -m pip install requests==2.26.0 pyyaml==6.0.1

				          # This must run on Python-3.7 (AmazonLinux2) so can't use request=3.32.2

				          python3 -m pip install requests==2.27.1 pyyaml==6.0.1

				    - name: Parse ref

				      id: parse-ref

									
										21

.github/actions/linux-build/action.yml
									
										vendored
									
												View File
												
				@ -52,6 +52,13 @@ inputs:

				    description: Hugging Face Hub token

				    required: false

				    default: ""

				  use_split_build:

				    description: |

				      [Experimental] Build a libtorch only wheel and build pytorch such that

				      are built from the libtorch wheel.

				    required: false

				    type: boolean

				    default: false

				outputs:

				  docker-image:

				    value: ${{ steps.calculate-docker-image.outputs.docker-image }}

				@ -144,6 +151,7 @@ runs:

				        DEBUG: ${{ inputs.build-with-debug == 'true' && '1' || '0' }}

				        OUR_GITHUB_JOB_ID: ${{ steps.get-job-id.outputs.job-id }}

				        HUGGING_FACE_HUB_TOKEN: ${{ inputs.HUGGING_FACE_HUB_TOKEN }}

				        USE_SPLIT_BUILD: ${{ inputs.use_split_build }}

				      shell: bash

				      run: |

				        # detached container should get cleaned up by teardown_ec2_linux

				@ -163,6 +171,7 @@ runs:

				          -e PR_LABELS \

				          -e OUR_GITHUB_JOB_ID \

				          -e HUGGING_FACE_HUB_TOKEN \

				          -e USE_SPLIT_BUILD \

				          --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \

				          --security-opt seccomp=unconfined \

				          --cap-add=SYS_PTRACE \

				@ -183,7 +192,7 @@ runs:

				    - name: Store PyTorch Build Artifacts on S3

				      uses: seemethere/upload-artifact-s3@v5

				      if: inputs.build-generates-artifacts == 'true' && steps.build.outcome != 'skipped'

				      if: inputs.build-generates-artifacts == 'true' && steps.build.outcome != 'skipped' && inputs.use_split_build != 'true'

				      with:

				        name: ${{ inputs.build-environment }}

				        retention-days: 14

				@ -191,6 +200,16 @@ runs:

				        path: artifacts.zip

				        s3-bucket: ${{ inputs.s3-bucket }}

				    - name: Store PyTorch Build Artifacts on S3 for split build

				      uses: seemethere/upload-artifact-s3@v5

				      if: inputs.build-generates-artifacts == 'true' && steps.build.outcome != 'skipped' && inputs.use_split_build == 'true'

				      with:

				        name: ${{ inputs.build-environment }}-experimental-split-build

				        retention-days: 14

				        if-no-files-found: error

				        path: artifacts.zip

				        s3-bucket: ${{ inputs.s3-bucket }}

				    - name: Upload sccache stats

				      if: steps.build.outcome != 'skipped'

				      uses: seemethere/upload-artifact-s3@v5

									
										11

.github/actions/test-pytorch-binary/action.yml
									
										vendored
									
												View File
												
				@ -26,6 +26,7 @@ runs:

				          -e PYTORCH_FINAL_PACKAGE_DIR \

				          -e PYTORCH_ROOT \

				          -e SKIP_ALL_TESTS \

				          -e USE_SPLIT_BUILD \

				          --tty \

				          --detach \

				          -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \

				@ -35,7 +36,8 @@ runs:

				          "${DOCKER_IMAGE}"

				        )

				        if [[ "${GPU_ARCH_TYPE}" != "rocm" && "${BUILD_ENVIRONMENT}" != "linux-aarch64-binary-manywheel" && "${BUILD_ENVIRONMENT}" != "linux-s390x-binary-manywheel" ]]; then

				        echo "CONTAINER_NAME=${container_name}" >> "$GITHUB_ENV"

				        if [[ "${GPU_ARCH_TYPE}" != "rocm" && "${BUILD_ENVIRONMENT}" != "linux-aarch64-binary-manywheel" && "${BUILD_ENVIRONMENT}" != "linux-s390x-binary-manywheel" && "${GPU_ARCH_TYPE}" != "xpu" ]]; then

				          # Propagate download.pytorch.org IP to container. This is only needed on Linux non aarch64 runner

				          grep download.pytorch.org /etc/hosts | docker exec -i "${container_name}" bash -c "/bin/cat >> /etc/hosts"

				        fi

				@ -46,10 +48,9 @@ runs:

				        docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash -x /run.sh"

				    - name: Cleanup docker

				      if: always() && env.BUILD_ENVIRONMENT == 'linux-s390x-binary-manywheel'

				      if: always() && (env.BUILD_ENVIRONMENT == 'linux-s390x-binary-manywheel' || env.GPU_ARCH_TYPE == 'xpu')

				      shell: bash

				      run: |

				        # on s390x stop the container for clean worker stop

				        # ignore expansion of "docker ps -q" since it could be empty

				        # on s390x or xpu stop the container for clean worker stop

				        # shellcheck disable=SC2046

				        docker stop $(docker ps -q) || true

				        docker stop "${{ env.CONTAINER_NAME }}" || true

2

.github/ci_commit_pins/audio.txt vendored

View File

 @ -1 +1 @@
 f8af5bcd0bb2ce51965cf79d8d4c25dad8a0
 b2a0adc2ec03ab99990d7e8be3d4510438c148

2

.github/ci_commit_pins/torchbench.txt vendored

View File

 @ -1 +1 @@
 d6015d42d9a1834bc7595c4bd6852562fb80b30b
 dbebd44a11eb84afbf53c3c071dd105297e

2

.github/ci_commit_pins/xla.txt vendored

View File

 @ -1 +1 @@
 f0b61e5d782913a0fc7743812f2a8e522189111
 ea4535f0699f366adb554183a65ebf7dc34a8be

									
										281

.github/lf-canary-scale-config.yml
									
										vendored
									
										Normal file
									
												View File
												
				@ -0,0 +1,281 @@

				# Defines runner types that will be provisioned by by LF Self-hosted

				# runners for pytorch/pytorch-canary and their labels.

				#

				# Runners listed here will be available as self hosted runners.

				# Configuration is directly pulled from the main branch.

				#

				# Default values:

				#

				# runner_types:

				#   runner_label: # label to specify in the Github Actions workflow

				#     instance_type: m4.large

				#     os: linux

				#     max_available: 20

				#     disk_size: 50

				#     is_ephemeral: true

				runner_types:

				  lf.c.linux.12xlarge:

				    disk_size: 200

				    instance_type: c5.12xlarge

				    is_ephemeral: false

				    max_available: 1000

				    os: linux

				  lf.c.linux.24xl.spr-metal:

				    disk_size: 200

				    instance_type: c7i.metal-24xl

				    is_ephemeral: false

				    max_available: 30

				    os: linux

				  lf.c.linux.16xlarge.spr:

				    disk_size: 200

				    instance_type: c7i.16xlarge

				    is_ephemeral: false

				    max_available: 30

				    os: linux

				  lf.c.linux.12xlarge.ephemeral:

				    disk_size: 200

				    instance_type: c5.12xlarge

				    is_ephemeral: true

				    max_available: 300

				    os: linux

				  lf.c.linux.16xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g3.16xlarge

				    is_ephemeral: false

				    max_available: 30

				    os: linux

				  lf.c.linux.24xlarge:

				    disk_size: 150

				    instance_type: c5.24xlarge

				    is_ephemeral: false

				    max_available: 250

				    os: linux

				  lf.c.linux.2xlarge:

				    disk_size: 150

				    instance_type: c5.2xlarge

				    is_ephemeral: false

				    max_available: 3120

				    os: linux

				  lf.c.linux.4xlarge:

				    disk_size: 150

				    instance_type: c5.4xlarge

				    is_ephemeral: false

				    max_available: 1000

				    os: linux

				  lf.c.linux.4xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g3.4xlarge

				    is_ephemeral: false

				    max_available: 520

				    os: linux

				  lf.c.linux.8xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g3.8xlarge

				    is_ephemeral: false

				    max_available: 400

				    os: linux

				  lf.c.linux.g4dn.12xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g4dn.12xlarge

				    is_ephemeral: false

				    max_available: 50

				    os: linux

				  lf.c.linux.g4dn.metal.nvidia.gpu:

				    disk_size: 150

				    instance_type: g4dn.metal

				    is_ephemeral: false

				    max_available: 30

				    os: linux

				  lf.c.linux.g5.48xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g5.48xlarge

				    is_ephemeral: false

				    max_available: 20

				    os: linux

				  lf.c.linux.g5.12xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g5.12xlarge

				    is_ephemeral: false

				    max_available: 150

				    os: linux

				  lf.c.linux.g5.4xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g5.4xlarge

				    is_ephemeral: false

				    max_available: 1200

				    os: linux

				  lf.c.linux.large:

				    disk_size: 15

				    instance_type: c5.large

				    is_ephemeral: false

				    os: linux

				  lf.c.linux.arm64.2xlarge:

				    disk_size: 256

				    instance_type: t4g.2xlarge

				    is_ephemeral: false

				    max_available: 200

				    os: linux

				  lf.c.linux.arm64.m7g.2xlarge:

				    disk_size: 256

				    instance_type: m7g.2xlarge

				    is_ephemeral: false

				    max_available: 20

				    os: linux

				  lf.c.windows.4xlarge:

				    disk_size: 256

				    instance_type: c5d.4xlarge

				    is_ephemeral: true

				    max_available: 420

				    os: windows

				  lf.c.windows.4xlarge.nonephemeral:

				    disk_size: 256

				    instance_type: c5d.4xlarge

				    is_ephemeral: false

				    max_available: 420

				    os: windows

				  lf.c.windows.8xlarge.nvidia.gpu:

				    disk_size: 256

				    instance_type: p3.2xlarge

				    is_ephemeral: true

				    max_available: 150

				    os: windows

				  lf.c.windows.8xlarge.nvidia.gpu.nonephemeral:

				    disk_size: 256

				    instance_type: p3.2xlarge

				    is_ephemeral: false

				    max_available: 150

				    os: windows

				  lf.c.windows.g5.4xlarge.nvidia.gpu:

				    disk_size: 256

				    instance_type: g5.4xlarge

				    is_ephemeral: false

				    max_available: 250

				    os: windows

				  ### Setup runner types to test the Amazon Linux 2023 AMI

				  lf.c.amz2023.linux.12xlarge:

				    disk_size: 200

				    instance_type: c5.12xlarge

				    is_ephemeral: false

				    max_available: 1000

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.amz2023.linux.24xl.spr-metal:

				    disk_size: 200

				    instance_type: c7i.metal-24xl

				    is_ephemeral: false

				    max_available: 30

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.amz2023.linux.16xlarge.spr:

				    disk_size: 200

				    instance_type: c7i.16xlarge

				    is_ephemeral: false

				    max_available: 30

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.amz2023.linux.12xlarge.ephemeral:

				    disk_size: 200

				    instance_type: c5.12xlarge

				    is_ephemeral: true

				    max_available: 300

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.amz2023.linux.16xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g3.16xlarge

				    is_ephemeral: false

				    max_available: 30

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.amz2023.linux.24xlarge:

				    disk_size: 150

				    instance_type: c5.24xlarge

				    is_ephemeral: false

				    max_available: 250

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.amz2023.linux.2xlarge:

				    disk_size: 150

				    instance_type: c5.2xlarge

				    is_ephemeral: false

				    max_available: 3120

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.amz2023.linux.4xlarge:

				    disk_size: 150

				    instance_type: c5.4xlarge

				    is_ephemeral: false

				    max_available: 1000

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.amz2023.linux.4xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g3.4xlarge

				    is_ephemeral: false

				    max_available: 520

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.amz2023.linux.8xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g3.8xlarge

				    is_ephemeral: false

				    max_available: 400

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.amz2023.linux.g4dn.12xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g4dn.12xlarge

				    is_ephemeral: false

				    max_available: 50

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.amz2023.linux.g4dn.metal.nvidia.gpu:

				    disk_size: 150

				    instance_type: g4dn.metal

				    is_ephemeral: false

				    max_available: 30

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.amz2023.linux.g5.48xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g5.48xlarge

				    is_ephemeral: false

				    max_available: 20

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.amz2023.linux.g5.12xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g5.12xlarge

				    is_ephemeral: false

				    max_available: 150

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.amz2023.linux.g5.4xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g5.4xlarge

				    is_ephemeral: false

				    max_available: 1200

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.amz2023.linux.large:

				    disk_size: 15

				    instance_type: c5.large

				    is_ephemeral: false

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.amz2023.linux.arm64.2xlarge:

				    disk_size: 256

				    instance_type: t4g.2xlarge

				    is_ephemeral: false

				    max_available: 200

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.amz2023.linux.arm64.m7g.2xlarge:

				    disk_size: 256

				    instance_type: m7g.2xlarge

				    is_ephemeral: false

				    max_available: 20

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

									
										281

.github/lf-scale-config.yml
									
										vendored
									
										Normal file
									
												View File
												
				@ -0,0 +1,281 @@

				# Defines runner types that will be provisioned by by LF Self-hosted

				# runners for pytorch/pytorch and their labels.

				#

				# Runners listed here will be available as self hosted runners.

				# Configuration is directly pulled from the main branch.

				#

				# Default values:

				#

				# runner_types:

				#   runner_label: # label to specify in the Github Actions workflow

				#     instance_type: m4.large

				#     os: linux

				#     max_available: 20

				#     disk_size: 50

				#     is_ephemeral: true

				runner_types:

				  lf.linux.12xlarge:

				    disk_size: 200

				    instance_type: c5.12xlarge

				    is_ephemeral: false

				    max_available: 1000

				    os: linux

				  lf.linux.24xl.spr-metal:

				    disk_size: 200

				    instance_type: c7i.metal-24xl

				    is_ephemeral: false

				    max_available: 30

				    os: linux

				  lf.linux.16xlarge.spr:

				    disk_size: 200

				    instance_type: c7i.16xlarge

				    is_ephemeral: false

				    max_available: 30

				    os: linux

				  lf.linux.12xlarge.ephemeral:

				    disk_size: 200

				    instance_type: c5.12xlarge

				    is_ephemeral: true

				    max_available: 300

				    os: linux

				  lf.linux.16xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g3.16xlarge

				    is_ephemeral: false

				    max_available: 30

				    os: linux

				  lf.linux.24xlarge:

				    disk_size: 150

				    instance_type: c5.24xlarge

				    is_ephemeral: false

				    max_available: 250

				    os: linux

				  lf.linux.2xlarge:

				    disk_size: 150

				    instance_type: c5.2xlarge

				    is_ephemeral: false

				    max_available: 3120

				    os: linux

				  lf.linux.4xlarge:

				    disk_size: 150

				    instance_type: c5.4xlarge

				    is_ephemeral: false

				    max_available: 1000

				    os: linux

				  lf.linux.4xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g3.4xlarge

				    is_ephemeral: false

				    max_available: 520

				    os: linux

				  lf.linux.8xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g3.8xlarge

				    is_ephemeral: false

				    max_available: 400

				    os: linux

				  lf.linux.g4dn.12xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g4dn.12xlarge

				    is_ephemeral: false

				    max_available: 50

				    os: linux

				  lf.linux.g4dn.metal.nvidia.gpu:

				    disk_size: 150

				    instance_type: g4dn.metal

				    is_ephemeral: false

				    max_available: 30

				    os: linux

				  lf.linux.g5.48xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g5.48xlarge

				    is_ephemeral: false

				    max_available: 20

				    os: linux

				  lf.linux.g5.12xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g5.12xlarge

				    is_ephemeral: false

				    max_available: 150

				    os: linux

				  lf.linux.g5.4xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g5.4xlarge

				    is_ephemeral: false

				    max_available: 1200

				    os: linux

				  lf.linux.large:

				    disk_size: 15

				    instance_type: c5.large

				    is_ephemeral: false

				    os: linux

				  lf.linux.arm64.2xlarge:

				    disk_size: 256

				    instance_type: t4g.2xlarge

				    is_ephemeral: false

				    max_available: 200

				    os: linux

				  lf.linux.arm64.m7g.2xlarge:

				    disk_size: 256

				    instance_type: m7g.2xlarge

				    is_ephemeral: false

				    max_available: 20

				    os: linux

				  lf.windows.4xlarge:

				    disk_size: 256

				    instance_type: c5d.4xlarge

				    is_ephemeral: true

				    max_available: 420

				    os: windows

				  lf.windows.4xlarge.nonephemeral:

				    disk_size: 256

				    instance_type: c5d.4xlarge

				    is_ephemeral: false

				    max_available: 420

				    os: windows

				  lf.windows.8xlarge.nvidia.gpu:

				    disk_size: 256

				    instance_type: p3.2xlarge

				    is_ephemeral: true

				    max_available: 150

				    os: windows

				  lf.windows.8xlarge.nvidia.gpu.nonephemeral:

				    disk_size: 256

				    instance_type: p3.2xlarge

				    is_ephemeral: false

				    max_available: 150

				    os: windows

				  lf.windows.g5.4xlarge.nvidia.gpu:

				    disk_size: 256

				    instance_type: g5.4xlarge

				    is_ephemeral: false

				    max_available: 250

				    os: windows

				  ### Setup runner types to test the Amazon Linux 2023 AMI

				  lf.amz2023.linux.12xlarge:

				    disk_size: 200

				    instance_type: c5.12xlarge

				    is_ephemeral: false

				    max_available: 1000

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.amz2023.linux.24xl.spr-metal:

				    disk_size: 200

				    instance_type: c7i.metal-24xl

				    is_ephemeral: false

				    max_available: 30

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.amz2023.linux.16xlarge.spr:

				    disk_size: 200

				    instance_type: c7i.16xlarge

				    is_ephemeral: false

				    max_available: 30

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.amz2023.linux.12xlarge.ephemeral:

				    disk_size: 200

				    instance_type: c5.12xlarge

				    is_ephemeral: true

				    max_available: 300

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.amz2023.linux.16xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g3.16xlarge

				    is_ephemeral: false

				    max_available: 30

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.amz2023.linux.24xlarge:

				    disk_size: 150

				    instance_type: c5.24xlarge

				    is_ephemeral: false

				    max_available: 250

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.amz2023.linux.2xlarge:

				    disk_size: 150

				    instance_type: c5.2xlarge

				    is_ephemeral: false

				    max_available: 3120

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.amz2023.linux.4xlarge:

				    disk_size: 150

				    instance_type: c5.4xlarge

				    is_ephemeral: false

				    max_available: 1000

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.amz2023.linux.4xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g3.4xlarge

				    is_ephemeral: false

				    max_available: 520

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.amz2023.linux.8xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g3.8xlarge

				    is_ephemeral: false

				    max_available: 400

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.amz2023.linux.g4dn.12xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g4dn.12xlarge

				    is_ephemeral: false

				    max_available: 50

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.amz2023.linux.g4dn.metal.nvidia.gpu:

				    disk_size: 150

				    instance_type: g4dn.metal

				    is_ephemeral: false

				    max_available: 30

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.amz2023.linux.g5.48xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g5.48xlarge

				    is_ephemeral: false

				    max_available: 20

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.amz2023.linux.g5.12xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g5.12xlarge

				    is_ephemeral: false

				    max_available: 150

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.amz2023.linux.g5.4xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g5.4xlarge

				    is_ephemeral: false

				    max_available: 1200

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.amz2023.linux.large:

				    disk_size: 15

				    instance_type: c5.large

				    is_ephemeral: false

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.amz2023.linux.arm64.2xlarge:

				    disk_size: 256

				    instance_type: t4g.2xlarge

				    is_ephemeral: false

				    max_available: 200

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.amz2023.linux.arm64.m7g.2xlarge:

				    disk_size: 256

				    instance_type: m7g.2xlarge

				    is_ephemeral: false

				    max_available: 20

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

									
										34

.github/merge_rules.yaml
									
										vendored
									
												View File
												
				@ -27,11 +27,9 @@

				  - third_party/onnx

				  - caffe2/python/onnx/**

				  approved_by:

				  - BowenBao

				  - justinchuby

				  - liqunfu

				  - shubhambhokare1

				  - thiagocrepaldi

				  - titaiwangms

				  - wschin

				  - xadupre

				@ -244,7 +242,9 @@

				  - torch/csrc/xpu/**

				  - torch/xpu/**

				  - test/xpu/**

				  - test/test_xpu.py

				  - third_party/xpu.txt

				  - .ci/docker/ci_commit_pins/triton-xpu.txt

				  approved_by:

				  - EikanWang

				  - jgong5

				@ -286,6 +286,7 @@

				  - test/cpp/dist_autograd/**

				  - test/cpp/rpc/**

				  approved_by:

				  - wconstab

				  - mrshenli

				  - pritamdamania87

				  - zhaojuanmao

				@ -312,6 +313,25 @@

				  - Lint

				  - pull

				- name: DCP

				  patterns:

				  - torch/distributed/checkpoint/**

				  approved_by:

				  - LucasLLC

				  - fegin

				  - wz337

				  - saumishr

				  - daulet-askarov

				  - pradeepdfb

				  - kirtiteja

				  - mhorowitz

				  - saiteja64

				  mandatory_checks_name:

				  - EasyCLA

				  - Lint

				  - pull

				- name: IDEEP

				  patterns:

				  - third_party/ideep

				@ -375,13 +395,21 @@

				- name: CPU inductor

				  patterns:

				  - torch/_inductor/mkldnn_ir.py

				  - torch/_inductor/mkldnn_lowerings.py

				  - torch/_inductor/fx_passes/mkldnn_fusion.py

				  - torch/_inductor/fx_passes/quantization.py

				  - torch/_inductor/codegen/cpp_prefix.h

				  - torch/_inductor/codegen/cpp.py

				  - torch/_inductor/codegen/cpp_utils.py

				  - torch/_inductor/codegen/cpp_micro_gemm.py

				  - torch/_inductor/codegen/cpp_template_kernel.py

				  - torch/_inductor/codegen/cpp_template.py

				  - torch/_inductor/codegen/cpp_gemm_template.py

				  - test/inductor/test_mkldnn_pattern_matcher.py

				  - test/inductor/test_cpu_repo.py

				  - test/inductor/test_cpu_repro.py

				  - test/inductor/test_cpu_cpp_wrapper.py

				  - test/inductor/test_cpu_select_algorithm.py

				  - aten/src/ATen/cpu/**

				  - aten/src/ATen/native/quantized/cpu/**

				  - test/quantization/core/test_quantized_op.py

									
										3

.github/pytorch-probot.yml
									
										vendored
									
												View File
												
				@ -8,6 +8,7 @@ ciflow_push_tags:

				- ciflow/inductor

				- ciflow/inductor-perf-compare

				- ciflow/inductor-micro-benchmark

				- ciflow/inductor-cu124

				- ciflow/linux-aarch64

				- ciflow/mps

				- ciflow/nightly

				@ -19,10 +20,10 @@ ciflow_push_tags:

				- ciflow/xpu

				- ciflow/torchbench

				retryable_workflows:

				- lint

				- pull

				- trunk

				- linux-binary

				- windows-binary

				labeler_config: labeler.yml

				label_to_label_config: label_to_label.yml

				mergebot: True

2

.github/requirements-gha-cache.txt vendored

View File

 @ -10,6 +10,6 @@ lintrunner==0.10.7
 ninja==1.10.0.post1
 nvidia-ml-py==11.525.84
 pyyaml==6.0
 requests==2.31.0
 requests==2.32.2
 rich==10.9.0
 rockset==1.0.3

3

.github/requirements/conda-env-Linux-X64.txt vendored

View File

 @ -4,6 +4,5 @@ mkl-include=2022.1.0
 ninja=1.10.2
 numpy=1.23.3
 pyyaml=6.0
 requests=2.31.0
 setuptools=68.2.2
 typing-extensions=4.3.0
 typing-extensions=4.9.0

3

.github/requirements/conda-env-iOS.txt vendored

View File

 @ -3,6 +3,5 @@ cmake=3.22.1
 ninja=1.10.2
 numpy=1.23.3
 pyyaml=6.0
 requests=2.31.0
 setuptools=68.2.2
 typing-extensions=4.3.0
 typing-extensions=4.9.0

2

.github/requirements/conda-env-macOS-ARM64 vendored

View File

 @ -2,7 +2,7 @@ numpy=1.22.3
 pyyaml=6.0
 setuptools=61.2.0
 cmake=3.22.*
 typing-extensions=4.3.0
 typing-extensions=4.9.0
 dataclasses=0.8
 pip=22.2.2
 pillow=10.0.1

2

.github/requirements/conda-env-macOS-X64 vendored

View File

 @ -4,7 +4,7 @@ numpy=1.21.2
 pyyaml=5.3
 setuptools=46.0.0
 cmake=3.22.*
 typing-extensions=4.3.0
 typing-extensions=4.9.0
 dataclasses=0.8
 pip=22.2.2
 pillow=10.0.1

2

.github/requirements/pip-requirements-iOS.txt vendored

View File

 @ -1,4 +1,4 @@
 # iOS simulator requirements
 coremltools==5.0b5
 protobuf==3.20.2
 optree==0.11.0
 optree==0.12.1

6

.github/requirements/pip-requirements-macOS.txt vendored

View File

 @ -17,16 +17,16 @@ pytest-xdist==3.3.1
 pytest-rerunfailures==10.3
 pytest-flakefinder==1.1.0
 scipy==1.10.1
 sympy==1.11.1
 sympy==1.12.1 ; python_version == "3.8"
 sympy>=1.13.0 ; python_version >= "3.9"
 unittest-xml-reporting<=3.2.0,>=2.0.0
 xdoctest==1.1.0
 filelock==3.6.0
 sympy==1.11.1
 pytest-cpp==2.3.0
 rockset==1.0.3
 z3-solver==4.12.2.0
 tensorboard==2.13.0
 optree==0.11.0
 optree==0.12.1
 # NB: test_hparams_* from test_tensorboard is failing with protobuf 5.26.0 in
 # which the stringify metadata is wrong when escaping double quote
 protobuf==3.20.2

									
										2

.github/scripts/amd/package_triton_wheel.sh
									
										vendored
									
												View File
												
				@ -93,6 +93,8 @@ done

				# Copy Include Files

				cp -r $ROCM_HOME/include/hip $TRITON_ROCM_DIR/include

				cp -r $ROCM_HOME/include/roctracer $TRITON_ROCM_DIR/include

				cp -r $ROCM_HOME/include/hsa $TRITON_ROCM_DIR/include

				# Copy linker

				mkdir -p $TRITON_ROCM_DIR/llvm/bin

									
										31

.github/scripts/build_triton_wheel.py
									
										vendored
									
												View File
												
				@ -1,4 +1,5 @@

				#!/usr/bin/env python3

				import os

				import shutil

				import sys

				@ -7,12 +8,17 @@ from subprocess import check_call

				from tempfile import TemporaryDirectory

				from typing import Optional

				SCRIPT_DIR = Path(__file__).parent

				REPO_DIR = SCRIPT_DIR.parent.parent

				def read_triton_pin(rocm_hash: bool = False) -> str:

				    triton_file = "triton.txt" if not rocm_hash else "triton-rocm.txt"

				def read_triton_pin(device: str = "cuda") -> str:

				    triton_file = "triton.txt"

				    if device == "rocm":

				        triton_file = "triton-rocm.txt"

				    elif device == "xpu":

				        triton_file = "triton-xpu.txt"

				    with open(REPO_DIR / ".ci" / "docker" / "ci_commit_pins" / triton_file) as f:

				        return f.read().strip()

				@ -49,7 +55,7 @@ def build_triton(

				    version: str,

				    commit_hash: str,

				    build_conda: bool = False,

				    build_rocm: bool = False,

				    device: str = "cuda",

				    py_version: Optional[str] = None,

				    release: bool = False,

				) -> Path:

				@ -69,11 +75,14 @@ def build_triton(

				        triton_basedir = Path(tmpdir) / "triton"

				        triton_pythondir = triton_basedir / "python"

				        triton_repo = "https://github.com/openai/triton"

				        if build_rocm:

				        if device == "rocm":

				            triton_pkg_name = "pytorch-triton-rocm"

				        elif device == "xpu":

				            triton_pkg_name = "pytorch-triton-xpu"

				            triton_repo = "https://github.com/intel/intel-xpu-backend-for-triton"

				        else:

				            triton_pkg_name = "pytorch-triton"

				        check_call(["git", "clone", triton_repo], cwd=tmpdir)

				        check_call(["git", "clone", triton_repo, "triton"], cwd=tmpdir)

				        if release:

				            ver, rev, patch = version.split(".")

				            check_call(

				@ -140,7 +149,7 @@ def build_triton(

				            expected_version=None,

				        )

				        if build_rocm:

				        if device == "rocm":

				            check_call(

				                [f"{SCRIPT_DIR}/amd/package_triton_wheel.sh"],

				                cwd=triton_basedir,

				@ -155,7 +164,7 @@ def build_triton(

				        whl_path = next(iter((triton_pythondir / "dist").glob("*.whl")))

				        shutil.copy(whl_path, Path.cwd())

				        if build_rocm:

				        if device == "rocm":

				            check_call(

				                [f"{SCRIPT_DIR}/amd/patch_triton_wheel.sh", Path.cwd()],

				                cwd=triton_basedir,

				@ -170,17 +179,19 @@ def main() -> None:

				    parser = ArgumentParser("Build Triton binaries")

				    parser.add_argument("--release", action="store_true")

				    parser.add_argument("--build-conda", action="store_true")

				    parser.add_argument("--build-rocm", action="store_true")

				    parser.add_argument(

				        "--device", type=str, default="cuda", choices=["cuda", "rocm", "xpu"]

				    )

				    parser.add_argument("--py-version", type=str)

				    parser.add_argument("--commit-hash", type=str)

				    parser.add_argument("--triton-version", type=str, default=read_triton_version())

				    args = parser.parse_args()

				    build_triton(

				        build_rocm=args.build_rocm,

				        device=args.device,

				        commit_hash=args.commit_hash

				        if args.commit_hash

				        else read_triton_pin(args.build_rocm),

				        else read_triton_pin(args.device),

				        version=args.triton_version,

				        build_conda=args.build_conda,

				        py_version=args.py_version,

									
										1

.github/scripts/check_labels.py
									
										vendored
									
												View File
												
				@ -5,7 +5,6 @@ import sys

				from typing import Any

				from github_utils import gh_delete_comment, gh_post_pr_comment

				from gitutils import get_git_remote_name, get_git_repo_dir, GitRepo

				from label_utils import has_required_labels, is_label_err_comment, LABEL_ERR_MSG

				from trymerge import GitHubPR

									
										116

.github/scripts/cherry_pick.py
									
										vendored
									
												View File
												
				@ -3,12 +3,10 @@

				import json

				import os

				import re

				from typing import Any, Optional

				from typing import Any, cast, Dict, List, Optional

				from urllib.error import HTTPError

				from github_utils import gh_fetch_url, gh_post_pr_comment

				from github_utils import gh_fetch_url, gh_post_pr_comment, gh_query_issues_by_labels

				from gitutils import get_git_remote_name, get_git_repo_dir, GitRepo

				from trymerge import get_pr_commit_sha, GitHubPR

				@ -19,6 +17,7 @@ REQUIRES_ISSUE = {

				    "critical",

				    "fixnewfeature",

				}

				RELEASE_BRANCH_REGEX = re.compile(r"release/(?P<version>.+)")

				def parse_args() -> Any:

				@ -58,6 +57,33 @@ def get_merge_commit_sha(repo: GitRepo, pr: GitHubPR) -> Optional[str]:

				    return commit_sha if pr.is_closed() else None

				def get_release_version(onto_branch: str) -> Optional[str]:

				    """

				    Return the release version if the target branch is a release branch

				    """

				    m = re.match(RELEASE_BRANCH_REGEX, onto_branch)

				    return m.group("version") if m else ""

				def get_tracker_issues(

				    org: str, project: str, onto_branch: str

				) -> List[Dict[str, Any]]:

				    """

				    Find the tracker issue from the repo. The tracker issue needs to have the title

				    like [VERSION] Release Tracker following the convention on PyTorch

				    """

				    version = get_release_version(onto_branch)

				    if not version:

				        return []

				    tracker_issues = gh_query_issues_by_labels(org, project, labels=["release tracker"])

				    if not tracker_issues:

				        return []

				    # Figure out the tracker issue from the list by looking at the title

				    return [issue for issue in tracker_issues if version in issue.get("title", "")]

				def cherry_pick(

				    github_actor: str,

				    repo: GitRepo,

				@ -77,17 +103,49 @@ def cherry_pick(

				    )

				    try:

				        org, project = repo.gh_owner_and_name()

				        cherry_pick_pr = ""

				        if not dry_run:

				            org, project = repo.gh_owner_and_name()

				            cherry_pick_pr = submit_pr(repo, pr, cherry_pick_branch, onto_branch)

				            msg = f"The cherry pick PR is at {cherry_pick_pr}"

				            if fixes:

				                msg += f" and it is linked with issue {fixes}"

				            elif classification in REQUIRES_ISSUE:

				                msg += f" and it is recommended to link a {classification} cherry pick PR with an issue"

				        tracker_issues_comments = []

				        tracker_issues = get_tracker_issues(org, project, onto_branch)

				        for issue in tracker_issues:

				            issue_number = int(str(issue.get("number", "0")))

				            if not issue_number:

				                continue

				            post_comment(org, project, pr.pr_num, msg)

				            res = cast(

				                Dict[str, Any],

				                post_tracker_issue_comment(

				                    org,

				                    project,

				                    issue_number,

				                    pr.pr_num,

				                    cherry_pick_pr,

				                    classification,

				                    fixes,

				                    dry_run,

				                ),

				            )

				            comment_url = res.get("html_url", "")

				            if comment_url:

				                tracker_issues_comments.append(comment_url)

				        msg = f"The cherry pick PR is at {cherry_pick_pr}"

				        if fixes:

				            msg += f" and it is linked with issue {fixes}."

				        elif classification in REQUIRES_ISSUE:

				            msg += f" and it is recommended to link a {classification} cherry pick PR with an issue."

				        if tracker_issues_comments:

				            msg += " The following tracker issues are updated:\n"

				            for tracker_issues_comment in tracker_issues_comments:

				                msg += f"* {tracker_issues_comment}\n"

				        post_pr_comment(org, project, pr.pr_num, msg, dry_run)

				    finally:

				        if current_branch:

				@ -159,7 +217,9 @@ def submit_pr(

				        raise RuntimeError(msg) from error

				def post_comment(org: str, project: str, pr_num: int, msg: str) -> None:

				def post_pr_comment(

				    org: str, project: str, pr_num: int, msg: str, dry_run: bool = False

				) -> List[Dict[str, Any]]:

				    """

				    Post a comment on the PR itself to point to the cherry picking PR when success

				    or print the error when failure

				@ -182,7 +242,35 @@ def post_comment(org: str, project: str, pr_num: int, msg: str) -> None:

				    comment = "\n".join(

				        (f"### Cherry picking #{pr_num}", f"{msg}", "", f"{internal_debugging}")

				    )

				    gh_post_pr_comment(org, project, pr_num, comment)

				    return gh_post_pr_comment(org, project, pr_num, comment, dry_run)

				def post_tracker_issue_comment(

				    org: str,

				    project: str,

				    issue_num: int,

				    pr_num: int,

				    cherry_pick_pr: str,

				    classification: str,

				    fixes: str,

				    dry_run: bool = False,

				) -> List[Dict[str, Any]]:

				    """

				    Post a comment on the tracker issue (if any) to record the cherry pick

				    """

				    comment = "\n".join(

				        (

				            "Link to landed trunk PR (if applicable):",

				            f"* https://github.com/{org}/{project}/pull/{pr_num}",

				            "",

				            "Link to release branch PR:",

				            f"* {cherry_pick_pr}",

				            "",

				            "Criteria Category:",

				            " - ".join((classification.capitalize(), fixes.capitalize())),

				        )

				    )

				    return gh_post_pr_comment(org, project, issue_num, comment, dry_run)

				def main() -> None:

				@ -214,7 +302,7 @@ def main() -> None:

				    except RuntimeError as error:

				        if not args.dry_run:

				            post_comment(org, project, pr_num, str(error))

				            post_pr_comment(org, project, pr_num, str(error))

				        else:

				            raise error

									
										1

.github/scripts/close_nonexistent_disable_issues.py
									
										vendored
									
												View File
												
				@ -10,6 +10,7 @@ import requests

				import rockset  # type: ignore[import]

				from gitutils import retries_decorator

				LOGS_QUERY = """

				with

				    shas as (

									
										2

.github/scripts/collect_ciflow_labels.py
									
										vendored
									
												View File
												
				@ -1,10 +1,12 @@

				#!/usr/bin/env python3

				import sys

				from pathlib import Path

				from typing import Any, cast, Dict, List, Set

				import yaml

				GITHUB_DIR = Path(__file__).parent.parent

									
										1

.github/scripts/convert_lintrunner_annotations_to_github.py
									
										vendored
									
												View File
												
				@ -1,7 +1,6 @@

				import json

				import subprocess

				import sys

				from enum import Enum

				from pathlib import Path

				from typing import NamedTuple, Optional

									
										59

.github/scripts/delete_old_branches.py
									
										vendored
									
												View File
												
				@ -2,12 +2,14 @@

				import os

				import re

				from datetime import datetime

				from functools import lru_cache

				from pathlib import Path

				from typing import Any, Callable, Dict, List, Set

				from github_utils import gh_fetch_json_dict, gh_graphql

				from gitutils import GitRepo

				SEC_IN_DAY = 24 * 60 * 60

				CLOSED_PR_RETENTION = 30 * SEC_IN_DAY

				NO_PR_RETENTION = 1.5 * 365 * SEC_IN_DAY

				@ -187,6 +189,17 @@ def get_recent_prs() -> Dict[str, Any]:

				    return prs_by_branch_base

				@lru_cache(maxsize=1)

				def get_open_prs() -> List[Dict[str, Any]]:

				    return paginate_graphql(

				        GRAPHQL_OPEN_PRS,

				        {"owner": "pytorch", "repo": "pytorch"},

				        lambda data: False,

				        lambda res: res["data"]["repository"]["pullRequests"]["nodes"],

				        lambda res: res["data"]["repository"]["pullRequests"]["pageInfo"],

				    )

				def get_branches_with_magic_label_or_open_pr() -> Set[str]:

				    pr_infos: List[Dict[str, Any]] = paginate_graphql(

				        GRAPHQL_NO_DELETE_BRANCH_LABEL,

				@ -196,15 +209,7 @@ def get_branches_with_magic_label_or_open_pr() -> Set[str]:

				        lambda res: res["data"]["repository"]["label"]["pullRequests"]["pageInfo"],

				    )

				    pr_infos.extend(

				        paginate_graphql(

				            GRAPHQL_OPEN_PRS,

				            {"owner": "pytorch", "repo": "pytorch"},

				            lambda data: False,

				            lambda res: res["data"]["repository"]["pullRequests"]["nodes"],

				            lambda res: res["data"]["repository"]["pullRequests"]["pageInfo"],

				        )

				    )

				    pr_infos.extend(get_open_prs())

				    # Get the most recent PR for each branch base (group gh together)

				    branch_bases = set()

				@ -270,5 +275,41 @@ def delete_branches() -> None:

				        delete_branch(git_repo, branch)

				def delete_old_ciflow_tags() -> None:

				    # Deletes ciflow tags if they are associated with a closed PR or a specific

				    # commit.  Lightweight tags don't have information about the date they were

				    # created, so we can't check how old they are.  The script just assumes that

				    # ciflow tags should be deleted regardless of creation date.

				    git_repo = GitRepo(str(REPO_ROOT), "origin", debug=True)

				    def delete_tag(tag: str) -> None:

				        print(f"Deleting tag {tag}")

				        ESTIMATED_TOKENS[0] += 1

				        delete_branch(git_repo, f"refs/tags/{tag}")

				    tags = git_repo._run_git("tag").splitlines()

				    open_pr_numbers = [x["number"] for x in get_open_prs()]

				    for tag in tags:

				        try:

				            if ESTIMATED_TOKENS[0] > 400:

				                print("Estimated tokens exceeded, exiting")

				                break

				            if not tag.startswith("ciflow/"):

				                continue

				            re_match_pr = re.match(r"^ciflow\/.*\/(\d{5,6})$", tag)

				            re_match_sha = re.match(r"^ciflow\/.*\/([0-9a-f]{40})$", tag)

				            if re_match_pr:

				                pr_number = int(re_match_pr.group(1))

				                if pr_number in open_pr_numbers:

				                    continue

				                delete_tag(tag)

				            elif re_match_sha:

				                delete_tag(tag)

				        except Exception as e:

				            print(f"Failed to check tag {tag}: {e}")

				if __name__ == "__main__":

				    delete_branches()

				    delete_old_ciflow_tags()

									
										52

.github/scripts/docathon-label-sync.py
									
										vendored
									
										Normal file
									
												View File
												
				@ -0,0 +1,52 @@

				import os

				import re

				import sys

				from github import Github

				def main() -> None:

				    token = os.environ.get("GITHUB_TOKEN")

				    repo_owner = "pytorch"

				    repo_name = "pytorch"

				    pull_request_number = int(sys.argv[1])

				    g = Github(token)

				    repo = g.get_repo(f"{repo_owner}/{repo_name}")

				    pull_request = repo.get_pull(pull_request_number)

				    pull_request_body = pull_request.body

				    # PR without description

				    if pull_request_body is None:

				        return

				    # get issue number from the PR body

				    if not re.search(r"#\d{1,6}", pull_request_body):

				        print("The pull request does not mention an issue.")

				        return

				    issue_number = int(re.findall(r"#(\d{1,6})", pull_request_body)[0])

				    issue = repo.get_issue(issue_number)

				    issue_labels = issue.labels

				    docathon_label_present = any(

				        label.name == "docathon-h1-2024" for label in issue_labels

				    )

				    # if the issue has a docathon label, add all labels from the issue to the PR.

				    if not docathon_label_present:

				        print("The 'docathon-h1-2024' label is not present in the issue.")

				        return

				    pull_request_labels = pull_request.get_labels()

				    pull_request_label_names = [label.name for label in pull_request_labels]

				    issue_label_names = [label.name for label in issue_labels]

				    labels_to_add = [

				        label for label in issue_label_names if label not in pull_request_label_names

				    ]

				    if not labels_to_add:

				        print("The pull request already has the same labels.")

				        return

				    pull_request.add_to_labels(*labels_to_add)

				    print("Labels added to the pull request!")

				if __name__ == "__main__":

				    main()

BIN
.github/scripts/drci_mocks.json.gz vendored

View File

Binary file not shown.

									
										1

.github/scripts/ensure_actions_will_cancel.py
									
										vendored
									
												View File
												
				@ -1,7 +1,6 @@

				#!/usr/bin/env python3

				import sys

				from pathlib import Path

				import yaml

									
										1

.github/scripts/export_pytorch_labels.py
									
										vendored
									
												View File
												
				@ -14,7 +14,6 @@ import json

				from typing import Any

				import boto3  # type: ignore[import]

				from label_utils import gh_get_labels

									
										1

.github/scripts/filter_test_configs.py
									
										vendored
									
												View File
												
				@ -15,6 +15,7 @@ from urllib.request import Request, urlopen

				import yaml

				REENABLE_TEST_REGEX = "(?i)(Close(d|s)?|Resolve(d|s)?|Fix(ed|es)?) (#|https://github.com/pytorch/pytorch/issues/)([0-9]+)"

				PREFIX = "test-config/"

									
										127

.github/scripts/generate_binary_build_matrix.py
									
										vendored
									
												View File
												
				@ -8,22 +8,25 @@ architectures:

				    * CPU

				    * Latest CUDA

				    * Latest ROCM

				    * Latest XPU

				"""

				import os

				from typing import Dict, List, Optional, Tuple

				CUDA_ARCHES = ["11.8", "12.1", "12.4"]

				CUDA_ARCHES_FULL_VERSION = {"11.8": "11.8.0", "12.1": "12.1.1", "12.4": "12.4.0"}

				CUDA_ARCHES_CUDNN_VERSION = {"11.8": "8", "12.1": "8", "12.4": "8"}

				CUDA_ARCHES_CUDNN_VERSION = {"11.8": "9", "12.1": "9", "12.4": "9"}

				ROCM_ARCHES = ["6.0", "6.1"]

				XPU_ARCHES = ["xpu"]

				CPU_CXX11_ABI_ARCH = ["cpu-cxx11-abi"]

				@ -34,44 +37,47 @@ CPU_AARCH64_ARCH = ["cpu-aarch64"]

				CPU_S390X_ARCH = ["cpu-s390x"]

				CUDA_AARCH64_ARCH = ["cuda-aarch64"]

				PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {

				    "11.8": (

				        "nvidia-cuda-nvrtc-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | "  # noqa: B950

				        "nvidia-cuda-runtime-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cuda-cupti-cu11==11.8.87; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cudnn-cu11==8.7.0.84; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cudnn-cu11==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cublas-cu11==11.11.3.6; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cufft-cu11==10.9.0.58; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-curand-cu11==10.3.0.86; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cusolver-cu11==11.4.1.48; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cusparse-cu11==11.7.5.86; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nccl-cu11==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nccl-cu11==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nvtx-cu11==11.8.86; platform_system == 'Linux' and platform_machine == 'x86_64'"

				    ),

				    "12.1": (

				        "nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | "  # noqa: B950

				        "nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cudnn-cu12==8.9.2.26; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'"

				    ),

				    "12.4": (

				        "nvidia-cuda-nvrtc-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cuda-runtime-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cuda-cupti-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cudnn-cu12==8.9.7.29; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cublas-cu12==12.4.2.65; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cufft-cu12==11.2.0.44; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-curand-cu12==10.3.5.119; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cusolver-cu12==11.6.0.99; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cusparse-cu12==12.3.0.142; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nccl-cu12==2.20.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nvtx-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nvjitlink-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64'"

				    ),

				@ -129,12 +135,16 @@ def arch_type(arch_version: str) -> str:

				        return "cuda"

				    elif arch_version in ROCM_ARCHES:

				        return "rocm"

				    elif arch_version in XPU_ARCHES:

				        return "xpu"

				    elif arch_version in CPU_CXX11_ABI_ARCH:

				        return "cpu-cxx11-abi"

				    elif arch_version in CPU_AARCH64_ARCH:

				        return "cpu-aarch64"

				    elif arch_version in CPU_S390X_ARCH:

				        return "cpu-s390x"

				    elif arch_version in CUDA_AARCH64_ARCH:

				        return "cuda-aarch64"

				    else:  # arch_version should always be "cpu" in this case

				        return "cpu"

				@ -151,10 +161,12 @@ WHEEL_CONTAINER_IMAGES = {

				        gpu_arch: f"pytorch/manylinux-builder:rocm{gpu_arch}-{DEFAULT_TAG}"

				        for gpu_arch in ROCM_ARCHES

				    },

				    "xpu": f"pytorch/manylinux2_28-builder:xpu-{DEFAULT_TAG}",

				    "cpu": f"pytorch/manylinux-builder:cpu-{DEFAULT_TAG}",

				    "cpu-cxx11-abi": f"pytorch/manylinuxcxx11-abi-builder:cpu-cxx11-abi-{DEFAULT_TAG}",

				    "cpu-aarch64": f"pytorch/manylinuxaarch64-builder:cpu-aarch64-{DEFAULT_TAG}",

				    "cpu-s390x": f"pytorch/manylinuxs390x-builder:cpu-s390x-{DEFAULT_TAG}",

				    "cuda-aarch64": f"pytorch/manylinuxaarch64-builder:cuda12.4-{DEFAULT_TAG}",

				}

				CONDA_CONTAINER_IMAGES = {

				@ -213,7 +225,9 @@ def translate_desired_cuda(gpu_arch_type: str, gpu_arch_version: str) -> str:

				        "cpu-cxx11-abi": "cpu-cxx11-abi",

				        "cpu-s390x": "cpu",

				        "cuda": f"cu{gpu_arch_version.replace('.', '')}",

				        "cuda-aarch64": "cu124",

				        "rocm": f"rocm{gpu_arch_version}",

				        "xpu": "xpu",

				    }.get(gpu_arch_type, gpu_arch_version)

				@ -293,11 +307,11 @@ def generate_libtorch_matrix(

				                    "libtorch_variant": libtorch_variant,

				                    "libtorch_config": abi_version if os == "windows" else "",

				                    "devtoolset": abi_version if os != "windows" else "",

				                    "container_image": LIBTORCH_CONTAINER_IMAGES[

				                        (arch_version, abi_version)

				                    ]

				                    if os != "windows"

				                    else "",

				                    "container_image": (

				                        LIBTORCH_CONTAINER_IMAGES[(arch_version, abi_version)]

				                        if os != "windows"

				                        else ""

				                    ),

				                    "package_type": "libtorch",

				                    "build_name": f"libtorch-{gpu_arch_type}{gpu_arch_version}-{libtorch_variant}-{abi_version}".replace(

				                        ".", "_"

				@ -318,19 +332,19 @@ def generate_wheels_matrix(

				        package_type = "manywheel"

				    if python_versions is None:

				        python_versions = FULL_PYTHON_VERSIONS

				        python_versions = FULL_PYTHON_VERSIONS + ["3.13"]

				    if arches is None:

				        # Define default compute archivectures

				        arches = ["cpu"]

				        if os == "linux":

				            arches += CPU_CXX11_ABI_ARCH + CUDA_ARCHES + ROCM_ARCHES

				            arches += CPU_CXX11_ABI_ARCH + CUDA_ARCHES + ROCM_ARCHES + XPU_ARCHES

				        elif os == "windows":

				            arches += CUDA_ARCHES

				        elif os == "linux-aarch64":

				            # Only want the one arch as the CPU type is different and

				            # uses different build/test scripts

				            arches = ["cpu-aarch64"]

				            arches = ["cpu-aarch64", "cuda-aarch64"]

				        elif os == "linux-s390x":

				            # Only want the one arch as the CPU type is different and

				            # uses different build/test scripts

				@ -346,11 +360,23 @@ def generate_wheels_matrix(

				                or arch_version == "cpu-cxx11-abi"

				                or arch_version == "cpu-aarch64"

				                or arch_version == "cpu-s390x"

				                or arch_version == "cuda-aarch64"

				                or arch_version == "xpu"

				                else arch_version

				            )

				            # TODO: Enable python 3.13 on rocm, xpu, aarch64, windows

				            if (

				                gpu_arch_type in ["rocm", "xpu"] or os != "linux"

				            ) and python_version == "3.13":

				                continue

				            # 12.1 linux wheels require PYTORCH_EXTRA_INSTALL_REQUIREMENTS to install

				            if arch_version in ["12.4", "12.1", "11.8"] and os == "linux":

				            if (

				                arch_version in ["12.4", "12.1", "11.8"]

				                and os == "linux"

				                or arch_version == "cuda-aarch64"

				            ):

				                ret.append(

				                    {

				                        "python_version": python_version,

				@ -359,15 +385,64 @@ def generate_wheels_matrix(

				                        "desired_cuda": translate_desired_cuda(

				                            gpu_arch_type, gpu_arch_version

				                        ),

				                        "devtoolset": "",

				                        "devtoolset": (

				                            "cxx11-abi" if arch_version == "cuda-aarch64" else ""

				                        ),

				                        "container_image": WHEEL_CONTAINER_IMAGES[arch_version],

				                        "package_type": package_type,

				                        "pytorch_extra_install_requirements": PYTORCH_EXTRA_INSTALL_REQUIREMENTS[arch_version],  # fmt: skip

				                        "pytorch_extra_install_requirements": (

				                            PYTORCH_EXTRA_INSTALL_REQUIREMENTS[arch_version]  # fmt: skip

				                            if os != "linux-aarch64"

				                            else ""

				                        ),

				                        "build_name": f"{package_type}-py{python_version}-{gpu_arch_type}{gpu_arch_version}".replace(  # noqa: B950

				                            ".", "_"

				                        ),

				                    }

				                )

				                if arch_version != "cuda-aarch64":

				                    ret.append(

				                        {

				                            "python_version": python_version,

				                            "gpu_arch_type": gpu_arch_type,

				                            "gpu_arch_version": gpu_arch_version,

				                            "desired_cuda": translate_desired_cuda(

				                                gpu_arch_type, gpu_arch_version

				                            ),

				                            "use_split_build": "True",

				                            "devtoolset": "",

				                            "container_image": WHEEL_CONTAINER_IMAGES[arch_version],

				                            "package_type": package_type,

				                            "pytorch_extra_install_requirements": (

				                                PYTORCH_EXTRA_INSTALL_REQUIREMENTS[arch_version]  # fmt: skip

				                                if os != "linux-aarch64"

				                                else ""

				                            ),

				                            "build_name": f"{package_type}-py{python_version}-{gpu_arch_type}{gpu_arch_version}-split".replace(  # noqa: B950

				                                ".", "_"

				                            ),

				                        }

				                    )

				                    # Special build building to use on Colab. PyThon 3.10 for 12.1 CUDA

				                    if python_version == "3.10" and arch_version == "12.1":

				                        ret.append(

				                            {

				                                "python_version": python_version,

				                                "gpu_arch_type": gpu_arch_type,

				                                "gpu_arch_version": gpu_arch_version,

				                                "desired_cuda": translate_desired_cuda(

				                                    gpu_arch_type, gpu_arch_version

				                                ),

				                                "use_split_build": "False",

				                                "devtoolset": "",

				                                "container_image": WHEEL_CONTAINER_IMAGES[arch_version],

				                                "package_type": package_type,

				                                "pytorch_extra_install_requirements": "",

				                                "build_name": f"{package_type}-py{python_version}-{gpu_arch_type}{gpu_arch_version}-full".replace(  # noqa: B950

				                                    ".", "_"

				                                ),

				                            }

				                        )

				            else:

				                ret.append(

				                    {

				@ -377,17 +452,21 @@ def generate_wheels_matrix(

				                        "desired_cuda": translate_desired_cuda(

				                            gpu_arch_type, gpu_arch_version

				                        ),

				                        "devtoolset": "cxx11-abi"

				                        if arch_version == "cpu-cxx11-abi"

				                        else "",

				                        "devtoolset": (

				                            "cxx11-abi"

				                            if arch_version in ["cpu-cxx11-abi", "xpu"]

				                            else ""

				                        ),

				                        "container_image": WHEEL_CONTAINER_IMAGES[arch_version],

				                        "package_type": package_type,

				                        "build_name": f"{package_type}-py{python_version}-{gpu_arch_type}{gpu_arch_version}".replace(

				                            ".", "_"

				                        ),

				                        "pytorch_extra_install_requirements":

				                        PYTORCH_EXTRA_INSTALL_REQUIREMENTS["12.1"]  # fmt: skip

				                        if os != "linux" else "",

				                        "pytorch_extra_install_requirements": (

				                            PYTORCH_EXTRA_INSTALL_REQUIREMENTS["12.1"]  # fmt: skip

				                            if os != "linux"

				                            else ""

				                        ),

				                    }

				                )

				    return ret

									
										14

.github/scripts/generate_ci_workflows.py
									
										vendored
									
												View File
												
				@ -5,11 +5,11 @@ import sys

				from dataclasses import asdict, dataclass, field

				from pathlib import Path

				from typing import Dict, Iterable, List, Literal, Set

				from typing_extensions import TypedDict  # Python 3.11+

				import generate_binary_build_matrix  # type: ignore[import]

				import jinja2

				from typing_extensions import TypedDict  # Python 3.11+

				Arch = Literal["windows", "linux", "macos"]

				@ -60,7 +60,7 @@ class BinaryBuildWorkflow:

				    branches: str = "nightly"

				    # Mainly for macos

				    cross_compile_arm64: bool = False

				    macos_runner: str = "macos-12-xl"

				    macos_runner: str = "macos-14-xlarge"

				    def __post_init__(self) -> None:

				        if self.abi_version:

				@ -157,7 +157,7 @@ LINUX_BINARY_SMOKE_WORKFLOWS = [

				        package_type="manywheel",

				        build_configs=generate_binary_build_matrix.generate_wheels_matrix(

				            OperatingSystem.LINUX,

				            arches=["11.8", "12.1"],

				            arches=["11.8", "12.1", "12.4"],

				            python_versions=["3.8"],

				        ),

				        branches="main",

				@ -285,7 +285,7 @@ MACOS_BINARY_BUILD_WORKFLOWS = [

				            libtorch_variants=["shared-with-deps"],

				        ),

				        cross_compile_arm64=False,

				        macos_runner="macos-13-xlarge",

				        macos_runner="macos-14-xlarge",

				        ciflow_config=CIFlowConfig(

				            labels={LABEL_CIFLOW_BINARIES, LABEL_CIFLOW_BINARIES_LIBTORCH},

				            isolated_workflow=True,

				@ -298,7 +298,7 @@ MACOS_BINARY_BUILD_WORKFLOWS = [

				            OperatingSystem.MACOS_ARM64

				        ),

				        cross_compile_arm64=False,

				        macos_runner="macos-13-xlarge",

				        macos_runner="macos-14-xlarge",

				        ciflow_config=CIFlowConfig(

				            labels={LABEL_CIFLOW_BINARIES, LABEL_CIFLOW_BINARIES_WHEEL},

				            isolated_workflow=True,

				@ -308,7 +308,7 @@ MACOS_BINARY_BUILD_WORKFLOWS = [

				        os=OperatingSystem.MACOS_ARM64,

				        package_type="conda",

				        cross_compile_arm64=False,

				        macos_runner="macos-13-xlarge",

				        macos_runner="macos-14-xlarge",

				        build_configs=generate_binary_build_matrix.generate_conda_matrix(

				            OperatingSystem.MACOS_ARM64

				        ),

									
										1

.github/scripts/generate_docker_release_matrix.py
									
										vendored
									
												View File
												
				@ -16,6 +16,7 @@ from typing import Dict, List

				import generate_binary_build_matrix

				DOCKER_IMAGE_TYPES = ["runtime", "devel"]

									
										2

.github/scripts/generate_pytorch_version.py
									
										vendored
									
												View File
												
				@ -4,11 +4,11 @@ import argparse

				import os

				import re

				import subprocess

				from datetime import datetime

				from distutils.util import strtobool

				from pathlib import Path

				LEADING_V_PATTERN = re.compile("^v")

				TRAILING_RC_PATTERN = re.compile("-rc[0-9]*$")

				LEGACY_BASE_VERSION_SUFFIX_PATTERN = re.compile("a0$")

									
										1

.github/scripts/get_workflow_job_id.py
									
										vendored
									
												View File
												
				@ -11,7 +11,6 @@ import sys

				import time

				import urllib

				import urllib.parse

				from typing import Any, Callable, Dict, List, Optional, Tuple

				from urllib.request import Request, urlopen

									
										99

.github/scripts/get_workflow_type.py
									
										vendored
									
												View File
											
				@ -1,99 +0,0 @@

				import json

				from argparse import ArgumentParser

				from typing import Any

				from github import Auth, Github

				from github.Issue import Issue

				WORKFLOW_TYPE_LABEL = "label"

				WORKFLOW_TYPE_RG = "rg"

				WORKFLOW_TYPE_BOTH = "both"

				def parse_args() -> Any:

				    parser = ArgumentParser("Get dynamic rollout settings")

				    parser.add_argument("--github-token", type=str, required=True, help="GitHub token")

				    parser.add_argument(

				        "--github-repo",

				        type=str,

				        required=False,

				        default="pytorch/test-infra",

				        help="GitHub repo to get the issue",

				    )

				    parser.add_argument(

				        "--github-issue", type=int, required=True, help="GitHub issue umber"

				    )

				    parser.add_argument(

				        "--github-user", type=str, required=True, help="GitHub username"

				    )

				    parser.add_argument(

				        "--github-branch", type=str, required=True, help="Current GitHub branch"

				    )

				    return parser.parse_args()

				def get_gh_client(github_token: str) -> Github:

				    auth = Auth.Token(github_token)

				    return Github(auth=auth)

				def get_issue(gh: Github, repo: str, issue_num: int) -> Issue:

				    repo = gh.get_repo(repo)

				    return repo.get_issue(number=issue_num)

				def is_exception_branch(branch: str) -> bool:

				    return branch.split("/")[0] in {"main", "nightly", "release", "landchecks"}

				def get_workflow_type(issue: Issue, username: str) -> str:

				    user_list = issue.get_comments()[0].body.split("\r\n")

				    try:

				        run_option = issue.get_comments()[1].body.split("\r\n")[0]

				    except Exception as e:

				        run_option = "single"

				    if user_list[0] == "!":

				        # Use old runners for everyone

				        return WORKFLOW_TYPE_LABEL

				    elif user_list[1] == "*":

				        if run_option == WORKFLOW_TYPE_BOTH:

				            # Use ARC runners and old runners for everyone

				            return WORKFLOW_TYPE_BOTH

				        else:

				            # Use only ARC runners for everyone

				            return WORKFLOW_TYPE_RG

				    elif username in user_list:

				        if run_option == WORKFLOW_TYPE_BOTH:

				            # Use ARC runners and old runners for a specific user

				            return WORKFLOW_TYPE_BOTH

				        else:

				            # Use only ARC runners for a specific user

				            return WORKFLOW_TYPE_RG

				    else:

				        # Use old runners by default

				        return WORKFLOW_TYPE_LABEL

				def main() -> None:

				    args = parse_args()

				    if is_exception_branch(args.github_branch):

				        output = {"workflow_type": WORKFLOW_TYPE_LABEL}

				    else:

				        try:

				            gh = get_gh_client(args.github_token)

				            issue = get_issue(gh, args.github_repo, args.github_issue)

				            output = {"workflow_type": get_workflow_type(issue, args.github_user)}

				        except Exception as e:

				            output = {"workflow_type": WORKFLOW_TYPE_LABEL}

				    json_output = json.dumps(output)

				    print(json_output)

				if __name__ == "__main__":

				    main()

									
										10

.github/scripts/github_utils.py
									
										vendored
									
												View File
												
				@ -3,7 +3,6 @@

				import json

				import os

				import warnings

				from dataclasses import dataclass

				from typing import Any, Callable, cast, Dict, List, Optional, Tuple, Union

				from urllib.error import HTTPError

				@ -202,3 +201,12 @@ def gh_update_pr_state(org: str, repo: str, pr_num: int, state: str = "open") ->

				            )

				        else:

				            raise

				def gh_query_issues_by_labels(

				    org: str, repo: str, labels: List[str], state: str = "open"

				) -> List[Dict[str, Any]]:

				    url = f"{GITHUB_API_URL}/repos/{org}/{repo}/issues"

				    return gh_fetch_json(

				        url, method="GET", params={"labels": ",".join(labels), "state": state}

				    )

									
										1

.github/scripts/gitutils.py
									
										vendored
									
												View File
												
				@ -19,6 +19,7 @@ from typing import (

				    Union,

				)

				T = TypeVar("T")

				RE_GITHUB_URL_MATCH = re.compile("^https://.*@?github.com/(.+)/(.+)$")

BIN
.github/scripts/gql_mocks.json.gz vendored

View File

Binary file not shown.

									
										2

.github/scripts/label_utils.py
									
										vendored
									
												View File
												
				@ -1,12 +1,12 @@

				"""GitHub Label Utilities."""

				import json

				from functools import lru_cache

				from typing import Any, List, Tuple, TYPE_CHECKING, Union

				from github_utils import gh_fetch_url_and_headers, GitHubComment

				# TODO: this is a temp workaround to avoid circular dependencies,

				#       and should be removed once GitHubPR is refactored out of trymerge script.

				if TYPE_CHECKING:

									
										3

.github/scripts/lintrunner.sh
									
										vendored
									
												View File
												
				@ -7,7 +7,7 @@ eval "$(command conda 'shell.bash' 'hook' 2> /dev/null)"

				conda activate "${CONDA_ENV}"

				# Use uv to speed up lintrunner init

				python3 -m pip install uv

				python3 -m pip install uv==0.1.45

				CACHE_DIRECTORY="/tmp/.lintbin"

				# Try to recover the cached binaries

				@ -29,6 +29,7 @@ python3 -m tools.pyi.gen_pyi \

				    --native-functions-path aten/src/ATen/native/native_functions.yaml \

				    --tags-path aten/src/ATen/native/tags.yaml \

				    --deprecated-functions-path "tools/autograd/deprecated.yaml"

				python3 torch/utils/data/datapipes/gen_pyi.py

				RC=0

				# Run lintrunner on all files

									
										1

.github/scripts/pytest_cache.py
									
										vendored
									
												View File
												
				@ -9,6 +9,7 @@ from pytest_caching_utils import (

				    upload_pytest_cache,

				)

				TEMP_DIR = "./tmp"  # a backup location in case one isn't provided

									
										30

.github/scripts/pytest_caching_utils.py
									
										vendored
									
												View File
												
				@ -14,10 +14,12 @@ from file_io_utils import (

				    zip_folder,

				)

				PYTEST_CACHE_KEY_PREFIX = "pytest_cache"

				PYTEST_CACHE_DIR_NAME = ".pytest_cache"

				BUCKET = "gha-artifacts"

				LASTFAILED_FILE_PATH = Path("v/cache/lastfailed")

				TD_HEURISTIC_PREVIOUSLY_FAILED_ADDITIONAL = "previous_failures_additional.json"

				# Temp folders

				ZIP_UPLOAD = "zip-upload"

				@ -191,6 +193,10 @@ def _merge_pytest_caches(

				        pytest_cache_dir_to_merge_from, pytest_cache_dir_to_merge_into

				    )

				    _merge_additional_failures_files(

				        pytest_cache_dir_to_merge_from, pytest_cache_dir_to_merge_into

				    )

				def _merge_lastfailed_files(source_pytest_cache: Path, dest_pytest_cache: Path) -> None:

				    # Simple cases where one of the files doesn't exist

				@ -232,3 +238,27 @@ def _merged_lastfailed_content(

				            del to_lastfailed[""]

				    return to_lastfailed

				def _merge_additional_failures_files(

				    source_pytest_cache: Path, dest_pytest_cache: Path

				) -> None:

				    # Simple cases where one of the files doesn't exist

				    source_lastfailed_file = (

				        source_pytest_cache / TD_HEURISTIC_PREVIOUSLY_FAILED_ADDITIONAL

				    )

				    dest_lastfailed_file = dest_pytest_cache / TD_HEURISTIC_PREVIOUSLY_FAILED_ADDITIONAL

				    if not source_lastfailed_file.exists():

				        return

				    if not dest_lastfailed_file.exists():

				        copy_file(source_lastfailed_file, dest_lastfailed_file)

				        return

				    # Both files exist, so we need to merge them

				    from_lastfailed = load_json_file(source_lastfailed_file)

				    to_lastfailed = load_json_file(dest_lastfailed_file)

				    merged_content = list(set(from_lastfailed + to_lastfailed))

				    # Save the results

				    write_json_file(dest_lastfailed_file, merged_content)

									
										215

.github/scripts/runner_determinator.py
									
										vendored
									
										Normal file
									
												View File
												
				@ -0,0 +1,215 @@

				# flake8: noqa: G004

				import logging

				import os

				from argparse import ArgumentParser

				from logging import LogRecord

				from typing import Any, Iterable

				from github import Auth, Github

				from github.Issue import Issue

				WORKFLOW_LABEL_META = ""  # use meta runners

				WORKFLOW_LABEL_LF = "lf."  # use runners from the linux foundation

				WORKFLOW_LABEL_LF_CANARY = "lf.c."  # use canary runners from the linux foundation

				GITHUB_OUTPUT = os.getenv("GITHUB_OUTPUT", "")

				GH_OUTPUT_KEY_LABEL_TYPE = "label-type"

				class ColorFormatter(logging.Formatter):

				    """Color codes the log messages based on the log level"""

				    COLORS = {

				        "WARNING": "\033[33m",  # Yellow

				        "ERROR": "\033[31m",  # Red

				        "CRITICAL": "\033[31m",  # Red

				        "INFO": "\033[0m",  # Reset

				        "DEBUG": "\033[0m",  # Reset

				    }

				    def format(self, record: LogRecord) -> str:

				        log_color = self.COLORS.get(record.levelname, "\033[0m")  # Default to reset

				        record.msg = f"{log_color}{record.msg}\033[0m"

				        return super().format(record)

				handler = logging.StreamHandler()

				handler.setFormatter(ColorFormatter(fmt="%(levelname)-8s: %(message)s"))

				log = logging.getLogger(os.path.basename(__file__))

				log.addHandler(handler)

				log.setLevel(logging.INFO)

				def set_github_output(key: str, value: str) -> None:

				    """

				    Defines outputs of the github action that invokes this script

				    """

				    if not GITHUB_OUTPUT:

				        # See https://github.blog/changelog/2022-10-11-github-actions-deprecating-save-state-and-set-output-commands/ for deprecation notice

				        log.warning(

				            "No env var found for GITHUB_OUTPUT, you must be running this code locally. Falling back to the deprecated print method."

				        )

				        print(f"::set-output name={key}::{value}")

				        return

				    with open(GITHUB_OUTPUT, "a") as f:

				        log.info(f"Setting output: {key}='{value}'")

				        f.write(f"{key}={value}\n")

				def parse_args() -> Any:

				    parser = ArgumentParser("Get dynamic rollout settings")

				    parser.add_argument("--github-token", type=str, required=True, help="GitHub token")

				    parser.add_argument(

				        "--github-issue-repo",

				        type=str,

				        required=False,

				        default="pytorch/test-infra",

				        help="GitHub repo to get the issue",

				    )

				    parser.add_argument(

				        "--github-repo",

				        type=str,

				        required=True,

				        help="GitHub repo where CI is running",

				    )

				    parser.add_argument(

				        "--github-issue", type=int, required=True, help="GitHub issue number"

				    )

				    parser.add_argument(

				        "--github-actor", type=str, required=True, help="GitHub triggering_actor"

				    )

				    parser.add_argument(

				        "--github-issue-owner", type=str, required=True, help="GitHub issue owner"

				    )

				    parser.add_argument(

				        "--github-branch", type=str, required=True, help="Current GitHub branch or tag"

				    )

				    parser.add_argument(

				        "--github-ref-type",

				        type=str,

				        required=True,

				        help="Current GitHub ref type, branch or tag",

				    )

				    return parser.parse_args()

				def get_gh_client(github_token: str) -> Github:

				    auth = Auth.Token(github_token)

				    return Github(auth=auth)

				def get_issue(gh: Github, repo: str, issue_num: int) -> Issue:

				    repo = gh.get_repo(repo)

				    return repo.get_issue(number=issue_num)

				def get_potential_pr_author(

				    gh: Github, repo: str, username: str, ref_type: str, ref_name: str

				) -> str:

				    # If the trigger was a new tag added by a bot, this is a ciflow case

				    # Fetch the actual username from the original PR. The PR number is

				    # embedded in the tag name: ciflow/<name>/<pr-number>

				    if username == "pytorch-bot[bot]" and ref_type == "tag":

				        split_tag = ref_name.split("/")

				        if (

				            len(split_tag) == 3

				            and split_tag[0] == "ciflow"

				            and split_tag[2].isnumeric()

				        ):

				            pr_number = split_tag[2]

				            try:

				                repository = gh.get_repo(repo)

				                pull = repository.get_pull(number=int(pr_number))

				            except Exception as e:

				                raise Exception(  # noqa: TRY002

				                    f"issue with pull request {pr_number} from repo {repository}"

				                ) from e

				            return pull.user.login

				    # In all other cases, return the original input username

				    return username

				def is_exception_branch(branch: str) -> bool:

				    return branch.split("/")[0] in {"main", "nightly", "release", "landchecks"}

				def get_workflow_type(issue: Issue, workflow_requestors: Iterable[str]) -> str:

				    try:

				        first_comment = issue.get_comments()[0].body.strip("\n\t ")

				        if first_comment[0] == "!":

				            log.info("LF Workflows are disabled for everyone. Using meta runners.")

				            return WORKFLOW_LABEL_META

				        elif first_comment[0] == "*":

				            log.info("LF Workflows are enabled for everyone. Using LF runners.")

				            return WORKFLOW_LABEL_LF

				        else:

				            all_opted_in_users = {

				                usr_raw.strip("\n\t@ ") for usr_raw in first_comment.split()

				            }

				            opted_in_requestors = {

				                usr for usr in workflow_requestors if usr in all_opted_in_users

				            }

				            if opted_in_requestors:

				                log.info(

				                    f"LF Workflows are enabled for {', '.join(opted_in_requestors)}. Using LF runners."

				                )

				                return WORKFLOW_LABEL_LF

				            else:

				                log.info(

				                    f"LF Workflows are disabled for {', '.join(workflow_requestors)}. Using meta runners."

				                )

				                return WORKFLOW_LABEL_META

				    except Exception as e:

				        log.error(

				            f"Failed to get determine workflow type. Falling back to meta runners. Exception: {e}"

				        )

				        return WORKFLOW_LABEL_META

				def main() -> None:

				    args = parse_args()

				    if args.github_ref_type == "branch" and is_exception_branch(args.github_branch):

				        log.info(f"Exception branch: '{args.github_branch}', using meta runners")

				        label_type = WORKFLOW_LABEL_META

				    else:

				        try:

				            gh = get_gh_client(args.github_token)

				            # The default issue we use - https://github.com/pytorch/test-infra/issues/5132

				            issue = get_issue(gh, args.github_issue_repo, args.github_issue)

				            username = get_potential_pr_author(

				                gh,

				                args.github_repo,

				                args.github_actor,

				                args.github_ref_type,

				                args.github_branch,

				            )

				            label_type = get_workflow_type(

				                issue,

				                (

				                    args.github_issue_owner,

				                    username,

				                ),

				            )

				        except Exception as e:

				            log.error(

				                f"Failed to get issue. Falling back to meta runners. Exception: {e}"

				            )

				            label_type = WORKFLOW_LABEL_META

				    # For Canary builds use canary runners

				    if args.github_repo == "pytorch/pytorch-canary" and label_type == WORKFLOW_LABEL_LF:

				        label_type = WORKFLOW_LABEL_LF_CANARY

				    set_github_output(GH_OUTPUT_KEY_LABEL_TYPE, label_type)

				if __name__ == "__main__":

				    main()

									
										35

.github/scripts/sync_distributed_folder_prototype.sh
									
										vendored
									
										Executable file
									
												View File
												
				@ -0,0 +1,35 @@

				#!/bin/bash

				set -eoux pipefail

				SYNC_BRANCH=pytorch-stable-prototype

				git config user.email "fake@example.com"

				git config user.name  "PyTorch Stable Bot"

				git fetch origin main

				git fetch origin "$SYNC_BRANCH"

				git checkout "$SYNC_BRANCH"

				# Using a hardcoded SHA here is a massive speedup as we can skip the entire history of the pytorch GitHub repo.

				# This specific SHA was chosen as it was before the "branch point" of the stable branch

				for SHA in $(git log ba3b05fdf37ddbc3c301294d6a560a816335e717..origin/main --pretty="%h" --reverse -- torch/distributed torch/csrc/distributed test/distributed test/cpp/c10d benchmarks/distributed)

				do

				    # `git merge-base --is-ancestor` exits with code 0 if the given SHA is an ancestor, and non-0 otherwise

				    if git merge-base --is-ancestor $SHA HEAD || [[ $(git log --grep="(cherry picked from commit $SHA") ]]

				    then

				        echo "Skipping $SHA"

				        continue

				    fi

				    echo "Copying $SHA"

				    git cherry-pick -x "$SHA" -X theirs

				    git reset --soft HEAD~1

				    git add torch/distributed torch/csrc/distributed test/distributed test/cpp/c10d benchmarks/distributed

				    git checkout .

				    git commit --reuse-message=HEAD@{1}

				    git clean -f

				done

				if [[ "${WITH_PUSH}" == true ]]; then

				  git push

				fi

									
										2

.github/scripts/tag_docker_images_for_release.py
									
										vendored
									
												View File
												
				@ -41,7 +41,7 @@ def main() -> None:

				    )

				    options = parser.parse_args()

				    tagged_images: Dict[str, bool] = dict()

				    tagged_images: Dict[str, bool] = {}

				    platform_images = [

				        generate_binary_build_matrix.WHEEL_CONTAINER_IMAGES,

				        generate_binary_build_matrix.LIBTORCH_CONTAINER_IMAGES,

									
										1

.github/scripts/td_llm_indexer.sh
									
										vendored
									
												View File
												
				@ -7,6 +7,7 @@ cd llm-target-determinator

				pip install -q -r requirements.txt

				cd ../codellama

				pip install -e .

				pip install numpy==1.26.0

				# Run indexer

				cd ../llm-target-determinator

									
										39

.github/scripts/test_trymerge.py
									
										vendored
									
												View File
												
				@ -17,9 +17,7 @@ from unittest import main, mock, skip, TestCase

				from urllib.error import HTTPError

				from github_utils import gh_graphql

				from gitutils import get_git_remote_name, get_git_repo_dir, GitRepo

				from trymerge import (

				    categorize_checks,

				    DRCI_CHECKRUN_NAME,

				@ -39,6 +37,7 @@ from trymerge import (

				    validate_revert,

				)

				if "GIT_REMOTE_URL" not in os.environ:

				    os.environ["GIT_REMOTE_URL"] = "https://github.com/pytorch/pytorch"

				@ -180,6 +179,9 @@ def mock_gh_get_info() -> Any:

				    return {

				        "closed": False,

				        "isCrossRepository": False,

				        "headRefName": "foo",

				        "baseRefName": "bar",

				        "baseRepository": {"defaultBranchRef": {"name": "bar"}},

				        "files": {"nodes": [], "pageInfo": {"hasNextPage": False}},

				        "changedFiles": 0,

				    }

				@ -394,6 +396,7 @@ class TestTryMerge(TestCase):

				        # self.assertGreater(len(pr.get_checkrun_conclusions()), 3)

				        self.assertGreater(pr.get_commit_count(), 60)

				    @skip("GitHub doesn't keep this data anymore")

				    def test_gql_retrieve_checksuites(self, *args: Any) -> None:

				        "Fetch comments and conclusions for PR with 60 commits"

				        pr = GitHubPR("pytorch", "pytorch", 94787)

				@ -773,13 +776,13 @@ class TestBypassFailures(TestCase):

				                # than the one on the base commit. This should still count as broken trunk

				                "pr_num": 104214,

				                "related_failure_count": 0,

				                "unrelated_failure_count": 1,

				                "flaky_or_broken_trunk": 1,

				            },

				            {

				                # This PR had one broken trunk failure and it used ghstack

				                "pr_num": 105145,

				                "related_failure_count": 0,

				                "unrelated_failure_count": 1,

				                "flaky_or_broken_trunk": 1,

				            },

				            {

				                # The failure on the merge base was retried successfully and

				@ -788,20 +791,20 @@ class TestBypassFailures(TestCase):

				                # be used to detect broken trunk

				                "pr_num": 107160,

				                "related_failure_count": 0,

				                "unrelated_failure_count": 4,

				                "flaky_or_broken_trunk": 1,

				            },

				            {

				                # This PR used Dr.CI broken trunk classification

				                "pr_num": 111253,

				                "related_failure_count": 1,

				                "unrelated_failure_count": 2,

				                "flaky_or_broken_trunk": 1,

				            },

				        ]

				        for case in test_cases:

				            pr_num = case["pr_num"]

				            related_failure_count = case["related_failure_count"]

				            unrelated_failure_count = case["unrelated_failure_count"]

				            flaky_or_broken_trunk = case["flaky_or_broken_trunk"]

				            pr = GitHubPR("pytorch", "pytorch", pr_num)

				            checks = pr.get_checkrun_conclusions()

				@ -823,7 +826,7 @@ class TestBypassFailures(TestCase):

				            )

				            self.assertTrue(len(pending) == 0)

				            self.assertTrue(

				                len(failed) == unrelated_failure_count + related_failure_count

				                len(failed) == flaky_or_broken_trunk + related_failure_count

				            )

				    def test_ignore_current(self, *args: Any) -> None:

				@ -891,6 +894,24 @@ class TestBypassFailures(TestCase):

				        self.assertTrue(len(ignorable["FLAKY"]) == 1)

				        self.assertTrue(len(ignorable["BROKEN_TRUNK"]) == 0)

				    def test_ignore_failures_older_run_same_workflow(self, *args: Any) -> None:

				        pr = GitHubPR("pytorch", "pytorch", 129013)

				        checks = pr.get_checkrun_conclusions()

				        checks = get_classifications(

				            pr.pr_num,

				            pr.project,

				            checks,

				            [],

				        )

				        pending, failed, ignorable = categorize_checks(

				            checks,

				            list(checks.keys()),

				        )

				        self.assertTrue(len(pending) == 0)

				        self.assertTrue(len(failed) == 0)

				        self.assertTrue(len(ignorable["FLAKY"]) == 2)

				        self.assertTrue(len(ignorable["UNSTABLE"]) == 13)

				    @mock.patch("trymerge.read_merge_rules", side_effect=xla_merge_rules)

				    def test_dont_ignore_flaky_failures(self, *args: Any) -> None:

				        """

				@ -1019,7 +1040,7 @@ class TestGitHubPRGhstackDependencies(TestCase):

				        )

				    @skip(

				        reason="This test is run against a mutalbe PR that has changed, so it no longer works. The test should be changed"

				        reason="This test is run against a mutable PR that has changed, so it no longer works. The test should be changed"

				    )

				    @mock.patch("trymerge.read_merge_rules")

				    @mock.patch("trymerge.GitRepo")

									
										151

.github/scripts/trymerge.py
									
										vendored
									
												View File
												
				@ -45,7 +45,6 @@ from github_utils import (

				    gh_update_pr_state,

				    GitHubComment,

				)

				from gitutils import (

				    are_ghstack_branches_in_sync,

				    get_git_remote_name,

				@ -62,6 +61,7 @@ from label_utils import (

				)

				from trymerge_explainer import get_revert_message, TryMergeExplainer

				# labels

				MERGE_IN_PROGRESS_LABEL = "merging"

				MERGE_COMPLETE_LABEL = "merged"

				@ -81,9 +81,10 @@ JobNameToStateDict = Dict[str, JobCheckState]

				class WorkflowCheckState:

				    def __init__(self, name: str, url: str, status: Optional[str]):

				    def __init__(self, name: str, url: str, run_id: int, status: Optional[str]):

				        self.name: str = name

				        self.url: str = url

				        self.run_id: int = run_id

				        self.status: Optional[str] = status

				        self.jobs: JobNameToStateDict = {}

				@ -122,6 +123,7 @@ fragment PRCheckSuites on CheckSuiteConnection {

				      workflowRun {

				        workflow {

				          name

				          databaseId

				        }

				        databaseId

				        url

				@ -512,7 +514,7 @@ def add_workflow_conclusions(

				    workflows: Dict[str, WorkflowCheckState] = {}

				    # for the jobs that don't have a workflow

				    no_workflow_obj: WorkflowCheckState = WorkflowCheckState("", "", None)

				    no_workflow_obj: WorkflowCheckState = WorkflowCheckState("", "", 0, None)

				    def add_conclusions(edges: Any) -> None:

				        for edge_idx, edge in enumerate(edges):

				@ -523,18 +525,30 @@ def add_workflow_conclusions(

				            workflow_obj: WorkflowCheckState = no_workflow_obj

				            if workflow_run is not None:

				                # This is the usual workflow run ID we see on GitHub

				                workflow_run_id = workflow_run["databaseId"]

				                # While this is the metadata name and ID of the workflow itself

				                workflow_name = workflow_run["workflow"]["name"]

				                workflow_id = workflow_run["workflow"]["databaseId"]

				                workflow_conclusion = node["conclusion"]

				                # Do not override existing status with cancelled

				                if workflow_conclusion == "CANCELLED" and workflow_name in workflows:

				                    continue

				                if workflow_name not in workflows:

				                    workflows[workflow_name] = WorkflowCheckState(

				                # Only keep the latest workflow run for each workflow, heuristically,

				                # it's the run with largest run ID

				                if (

				                    workflow_id not in workflows

				                    or workflows[workflow_id].run_id < workflow_run_id

				                ):

				                    workflows[workflow_id] = WorkflowCheckState(

				                        name=workflow_name,

				                        status=workflow_conclusion,

				                        url=workflow_run["url"],

				                        run_id=workflow_run_id,

				                    )

				                workflow_obj = workflows[workflow_name]

				                workflow_obj = workflows[workflow_id]

				            while checkruns is not None:

				                for checkrun_node in checkruns["nodes"]:

				@ -572,12 +586,12 @@ def add_workflow_conclusions(

				    # the jobs in but don't put the workflow in.  We care more about the jobs in

				    # the workflow that ran than the container workflow.

				    res: JobNameToStateDict = {}

				    for workflow_name, workflow in workflows.items():

				    for workflow in workflows.values():

				        if len(workflow.jobs) > 0:

				            for job_name, job in workflow.jobs.items():

				                res[job_name] = job

				        else:

				            res[workflow_name] = JobCheckState(

				            res[workflow.name] = JobCheckState(

				                workflow.name,

				                workflow.url,

				                workflow.status,

				@ -1163,7 +1177,6 @@ class GitHubPR:

				            # Finally, upload the record to Rockset. The list of pending and failed

				            # checks are at the time of the merge

				            save_merge_record(

				                collection=ROCKSET_MERGES_COLLECTION,

				                comment_id=comment_id,

				                pr_num=self.pr_num,

				                owner=self.org,

				@ -1179,10 +1192,8 @@ class GitHubPR:

				                merge_base_sha=self.get_merge_base(),

				                merge_commit_sha=merge_commit_sha,

				                is_failed=False,

				                dry_run=dry_run,

				                skip_mandatory_checks=skip_mandatory_checks,

				                ignore_current=bool(ignore_current_checks),

				                workspace=ROCKSET_MERGES_WORKSPACE,

				            )

				        else:

				            print("Missing comment ID or PR number, couldn't upload to Rockset")

				@ -1489,7 +1500,6 @@ def checks_to_markdown_bullets(

				@retries_decorator()

				def save_merge_record(

				    collection: str,

				    comment_id: int,

				    pr_num: int,

				    owner: str,

				@ -1505,59 +1515,44 @@ def save_merge_record(

				    merge_base_sha: str,

				    merge_commit_sha: str = "",

				    is_failed: bool = False,

				    dry_run: bool = False,

				    skip_mandatory_checks: bool = False,

				    ignore_current: bool = False,

				    error: str = "",

				    workspace: str = "commons",

				) -> None:

				    """

				    This saves the merge records into Rockset, so we can query them (for fun and profit)

				    This saves the merge records as a json, which can later be uploaded to s3

				    """

				    if dry_run:

				        # Decide not to save the record to Rockset if dry-run is set to not pollute

				        # the collection

				        return

				    try:

				        import rockset  # type: ignore[import]

				    # Prepare the record to be written into Rockset

				    data = [

				        {

				            "comment_id": comment_id,

				            "pr_num": pr_num,

				            "owner": owner,

				            "project": project,

				            "author": author,

				            "pending_checks": pending_checks,

				            "failed_checks": failed_checks,

				            "ignore_current_checks": ignore_current_checks,

				            "broken_trunk_checks": broken_trunk_checks,

				            "flaky_checks": flaky_checks,

				            "unstable_checks": unstable_checks,

				            "last_commit_sha": last_commit_sha,

				            "merge_base_sha": merge_base_sha,

				            "merge_commit_sha": merge_commit_sha,

				            "is_failed": is_failed,

				            "skip_mandatory_checks": skip_mandatory_checks,

				            "ignore_current": ignore_current,

				            "error": error,

				            # This is a unique identifier for the record for deduping purposes

				            # in rockset.  Any unique string would work

				            "_id": f"{project}-{pr_num}-{comment_id}-{os.environ.get('GITHUB_RUN_ID')}",

				        }

				    ]

				    repo_root = Path(__file__).resolve().parent.parent.parent

				        # Prepare the record to be written into Rockset

				        data = [

				            {

				                "comment_id": comment_id,

				                "pr_num": pr_num,

				                "owner": owner,

				                "project": project,

				                "author": author,

				                "pending_checks": pending_checks,

				                "failed_checks": failed_checks,

				                "ignore_current_checks": ignore_current_checks,

				                "broken_trunk_checks": broken_trunk_checks,

				                "flaky_checks": flaky_checks,

				                "unstable_checks": unstable_checks,

				                "last_commit_sha": last_commit_sha,

				                "merge_base_sha": merge_base_sha,

				                "merge_commit_sha": merge_commit_sha,

				                "is_failed": is_failed,

				                "skip_mandatory_checks": skip_mandatory_checks,

				                "ignore_current": ignore_current,

				                "error": error,

				            }

				        ]

				        client = rockset.RocksetClient(

				            host="api.usw2a1.rockset.com", api_key=os.environ["ROCKSET_API_KEY"]

				        )

				        client.Documents.add_documents(

				            collection=collection,

				            data=data,

				            workspace=workspace,

				        )

				    except ModuleNotFoundError:

				        print("Rockset is missing, no record will be saved")

				        return

				    with open(repo_root / "merge_record.json", "w") as f:

				        json.dump(data, f)

				@retries_decorator(rc=[])

				@ -2027,10 +2022,8 @@ def categorize_checks(

				    pending_checks: List[Tuple[str, Optional[str], Optional[int]]] = []

				    failed_checks: List[Tuple[str, Optional[str], Optional[int]]] = []

				    # ok_failed_checks is used with ok_failed_checks_threshold while ignorable_failed_checks

				    # is used to keep track of all ignorable failures when saving the merge record on Rockset

				    ok_failed_checks: List[Tuple[str, Optional[str], Optional[int]]] = []

				    ignorable_failed_checks: Dict[str, List[Any]] = defaultdict(list)

				    # failed_checks_categorization is used to keep track of all ignorable failures when saving the merge record on Rockset

				    failed_checks_categorization: Dict[str, List[Any]] = defaultdict(list)

				    # If required_checks is not set or empty, consider all names are relevant

				    relevant_checknames = [

				@ -2058,36 +2051,38 @@ def categorize_checks(

				            continue

				        elif not is_passing_status(check_runs[checkname].status):

				            target = (

				                ignorable_failed_checks[classification]

				                failed_checks_categorization[classification]

				                if classification

				                in ("IGNORE_CURRENT_CHECK", "BROKEN_TRUNK", "FLAKY", "UNSTABLE")

				                else failed_checks

				            )

				            target.append((checkname, url, job_id))

				            if classification in ("BROKEN_TRUNK", "FLAKY", "UNSTABLE"):

				                ok_failed_checks.append((checkname, url, job_id))

				    flaky_or_broken_trunk = (

				        failed_checks_categorization["BROKEN_TRUNK"]

				        + failed_checks_categorization["FLAKY"]

				    )

				    if ok_failed_checks:

				    if flaky_or_broken_trunk:

				        warn(

				            f"The following {len(ok_failed_checks)} checks failed but were likely due flakiness or broken trunk: "

				            + ", ".join([x[0] for x in ok_failed_checks])

				            f"The following {len(flaky_or_broken_trunk)} checks failed but were likely due flakiness or broken trunk: "

				            + ", ".join([x[0] for x in flaky_or_broken_trunk])

				            + (

				                f" but this is greater than the threshold of {ok_failed_checks_threshold} so merge will fail"

				                if ok_failed_checks_threshold is not None

				                and len(ok_failed_checks) > ok_failed_checks_threshold

				                and len(flaky_or_broken_trunk) > ok_failed_checks_threshold

				                else ""

				            )

				        )

				    if (

				        ok_failed_checks_threshold is not None

				        and len(ok_failed_checks) > ok_failed_checks_threshold

				        and len(flaky_or_broken_trunk) > ok_failed_checks_threshold

				    ):

				        failed_checks = failed_checks + ok_failed_checks

				        failed_checks = failed_checks + flaky_or_broken_trunk

				    # The list of ignorable_failed_checks is returned so that it can be saved into the Rockset merge record

				    return (pending_checks, failed_checks, ignorable_failed_checks)

				    # The list of failed_checks_categorization is returned so that it can be saved into the Rockset merge record

				    return (pending_checks, failed_checks, failed_checks_categorization)

				def merge(

				@ -2330,6 +2325,15 @@ def main() -> None:

				            dry_run=args.dry_run,

				        )

				        return

				    if not pr.is_ghstack_pr() and pr.base_ref() != pr.default_branch():

				        gh_post_pr_comment(

				            org,

				            project,

				            args.pr_num,

				            f"PR targets {pr.base_ref()} rather than {pr.default_branch()}, refusing merge request",

				            dry_run=args.dry_run,

				        )

				        return

				    if args.check_mergeability:

				        if pr.is_ghstack_pr():

				@ -2365,7 +2369,6 @@ def main() -> None:

				            # list of pending and failed checks here, but they are not really

				            # needed at the moment

				            save_merge_record(

				                collection=ROCKSET_MERGES_COLLECTION,

				                comment_id=args.comment_id,

				                pr_num=args.pr_num,

				                owner=org,

				@ -2380,11 +2383,9 @@ def main() -> None:

				                last_commit_sha=pr.last_commit().get("oid", ""),

				                merge_base_sha=pr.get_merge_base(),

				                is_failed=True,

				                dry_run=args.dry_run,

				                skip_mandatory_checks=args.force,

				                ignore_current=args.ignore_current,

				                error=str(e),

				                workspace=ROCKSET_MERGES_WORKSPACE,

				            )

				        else:

				            print("Missing comment ID or PR number, couldn't upload to Rockset")

									
										1

.github/scripts/tryrebase.py
									
										vendored
									
												View File
												
				@ -11,6 +11,7 @@ from github_utils import gh_post_pr_comment as gh_post_comment

				from gitutils import get_git_remote_name, get_git_repo_dir, GitRepo

				from trymerge import GitHubPR

				SAME_SHA_ERROR = (

				    "\n```\nAborting rebase because rebasing the branch resulted in the same sha as the target branch.\n"

				    + "This usually happens because the PR has already been merged.  Please rebase locally and push.\n```"

48

.github/templates/linux_binary_build_workflow.yml.j2 vendored

View File

 @ -58,7 +58,7 @@ jobs:
     uses: ./.github/workflows/_binary-build-linux.yml
     with:!{{ upload.binary_env_as_input(config) }}
       {%- if "aarch64" in build_environment %}
       runs_on: linux.arm64.2xlarge
       runs_on: linux.arm64.m7g.4xlarge
       ALPINE_IMAGE: "arm64v8/alpine"
       {%- elif "s390x" in build_environment %}
       runs_on: linux.s390x
 @ -71,12 +71,17 @@ jobs:
       {%- if config.pytorch_extra_install_requirements is defined and config.pytorch_extra_install_requirements|d('')|length > 0  %}
       PYTORCH_EXTRA_INSTALL_REQUIREMENTS: !{{ config.pytorch_extra_install_requirements }}
       {%- endif %}
       {%- if config["gpu_arch_type"] == "cuda-aarch64" %}
       timeout-minutes: 420
       {%- endif %}
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
   {%- if config["gpu_arch_type"] != "cuda-aarch64" %}
   !{{ config["build_name"] }}-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
     needs: !{{ config["build_name"] }}-build
 {%- if config["gpu_arch_type"] != "rocm" %}
     {%- if config["gpu_arch_type"] not in ["rocm", "xpu"] %}
     uses: ./.github/workflows/_binary-test-linux.yml
     with:!{{ upload.binary_env_as_input(config) }}
       build_name: !{{ config["build_name"] }}
 @ -96,7 +101,41 @@ jobs:
       {%- endif %}
     secrets:
       github-token: ${{ secrets.GITHUB_TOKEN }}
 {%- else %}
     {%- elif config["gpu_arch_type"] == "xpu" %}
     runs-on: linux.idc.xpu
     timeout-minutes: !{{ common.timeout_minutes }}
     !{{ upload.binary_env(config) }}
     permissions:
       id-token: write
       contents: read
     steps:
       - name: Setup XPU
         uses: ./.github/actions/setup-xpu
       - name: configure aws credentials
         id: aws_creds
         uses: aws-actions/configure-aws-credentials@v1.7.0
         with:
           role-to-assume: arn:aws:iam::308535385114:role/gha_workflow_s3_and_ecr_read_only
           aws-region: us-east-1
       - name: Login to Amazon ECR
         id: login-ecr
         uses: aws-actions/amazon-ecr-login@v2
       - uses: !{{ common.download_artifact_action }}
         name: Download Build Artifacts
         with:
           name: !{{ config["build_name"] }}
           path: "${{ runner.temp }}/artifacts/"
       !{{ common.checkout(deep_clone=False, directory="pytorch") }}
       !{{ common.checkout(deep_clone=False, directory="builder", repository=common.builder_repo, branch=common.builder_branch) }}
       - name: Pull Docker image
         uses: pytorch/test-infra/.github/actions/pull-docker-image@main
         with:
           docker-image: !{{ config["container_image"] }}
       - name: Test Pytorch binary
         uses: ./pytorch/.github/actions/test-pytorch-binary
       - name: Teardown XPU
         uses: ./.github/actions/teardown-xpu
     {%- else %}
     runs-on: linux.rocm.gpu
     timeout-minutes: !{{ common.timeout_minutes }}
     !{{ upload.binary_env(config) }}
 @ -121,7 +160,8 @@ jobs:
         uses: ./pytorch/.github/actions/test-pytorch-binary
       - name: Teardown ROCm
         uses: ./.github/actions/teardown-rocm
 {%- endif %}
     {%- endif %}
   {%- endif %}
 {%- if branches == "nightly" %}
   !{{ upload.upload_binaries(config) }}

8

.github/templates/upload.yml.j2 vendored

View File

 @ -30,6 +30,9 @@
   {%- if config["devtoolset"] %}
       DESIRED_DEVTOOLSET: !{{ config["devtoolset"] }}
   {%- endif %}
   {%- if config.use_split_build is defined %}
       use_split_build: !{{ config["use_split_build"] }}
   {%- endif %}
 {%- endif %}
 {%- if config["package_type"] == "libtorch" %}
   {%- if config["libtorch_config"] %}
 @ -44,6 +47,7 @@
       # without this value pip does not get installed for some reason
       DESIRED_PYTHON: "3.8"
   {%- endif %}
 {%- else %}
       DESIRED_PYTHON: "!{{ config["python_version"] }}"
 {%- endif %}
 @ -57,7 +61,11 @@
       id-token: write
       contents: read
 {%- if has_test %}
     {%- if config["gpu_arch_type"] == "cuda-aarch64" %}
     needs: !{{ config["build_name"] }}-build
     {%- else %}
     needs: !{{ config["build_name"] }}-test
     {%- endif %}
 {%- else %}
     needs: !{{ config["build_name"] }}-build
 {%- endif %}

									
										7

.github/workflows/_bazel-build-test.yml
									
										vendored
									
												View File
												
				@ -27,6 +27,11 @@ on:

				        type: string

				        description: |

				          A JSON description of what configs to run later on.

				      runner:

				        required: false

				        type: string

				        default: "linux.large"

				        description: Runner type

				env:

				  GIT_DEFAULT_BRANCH: ${{ github.event.repository.default_branch }}

				@ -34,7 +39,7 @@ env:

				jobs:

				  filter:

				    if: github.repository_owner == 'pytorch'

				    runs-on: [self-hosted, linux.large]

				    runs-on: ${{ inputs.runner }}

				    outputs:

				      test-matrix: ${{ steps.filter.outputs.test-matrix }}

				      is-test-matrix-empty: ${{ steps.filter.outputs.is-test-matrix-empty }}

									
										25

.github/workflows/_binary-build-linux.yml
									
										vendored
									
												View File
												
				@ -12,10 +12,22 @@ on:

				        type: string

				        description: The build environment

				      runs_on:

				          required: false

				          default: linux.12xlarge

				          type: string

				          description: Hardware to run this "build"job on, linux.12xlarge or linux.arm64.2xlarge.

				        required: false

				        default: linux.12xlarge

				        type: string

				        description: Hardware to run this "build"job on, linux.12xlarge or linux.arm64.2xlarge.

				      timeout-minutes:

				        required: false

				        default: 210

				        type: number

				        description: timeout for the job

				      use_split_build:

				        description: |

				          [Experimental] Build a libtorch only wheel and build pytorch such that

				          are built from the libtorch wheel.

				        required: false

				        type: boolean

				        default: false

				      ALPINE_IMAGE:

				        required: false

				        type: string

				@ -78,7 +90,7 @@ on:

				jobs:

				  build:

				    runs-on: ${{ inputs.runs_on }}

				    timeout-minutes: 210

				    timeout-minutes: ${{ inputs.timeout-minutes }}

				    env:

				      PYTORCH_ROOT: ${{ inputs.PYTORCH_ROOT }}

				      BUILDER_ROOT: ${{ inputs.BUILDER_ROOT }}

				@ -105,6 +117,7 @@ jobs:

				      PR_NUMBER: ${{ github.event.pull_request.number }}

				      PYTORCH_FINAL_PACKAGE_DIR: /artifacts

				      SHA1: ${{ github.event.pull_request.head.sha || github.sha }}

				      USE_SPLIT_BUILD: ${{ inputs.use_split_build }}

				    steps:

				      - name: Make the env permanent during this workflow (but not the secrets)

				        shell: bash

				@ -132,6 +145,7 @@ jobs:

				            echo "PR_NUMBER=${{ env.PR_NUMBER }}"

				            echo "PYTORCH_FINAL_PACKAGE_DIR=${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"

				            echo "SHA1=${{ env.SHA1 }}"

				            echo "USE_SPLIT_BUILD=${{ env.use_split_build }}"

				          } >> "${GITHUB_ENV} }}"

				      - name: List the env

				@ -241,6 +255,7 @@ jobs:

				            -e PYTORCH_ROOT \

				            -e SKIP_ALL_TESTS \

				            -e PYTORCH_EXTRA_INSTALL_REQUIREMENTS \

				            -e USE_SPLIT_BUILD \

				            --tty \

				            --detach \

				            -v "${GITHUB_WORKSPACE}/pytorch:/pytorch" \

									
										9

.github/workflows/_binary-test-linux.yml
									
										vendored
									
												View File
												
				@ -63,6 +63,13 @@ on:

				        required: true

				        type: string

				        description: Hardware to run this job on. Valid values are linux.4xlarge, linux.4xlarge.nvidia.gpu, linux.arm64.2xlarge, and linux.rocm.gpu

				      use_split_build:

				        description: |

				          [Experimental] Build a libtorch only wheel and build pytorch such that

				          are built from the libtorch wheel.

				        required: false

				        type: boolean

				        default: false

				    secrets:

				      github-token:

				        required: true

				@ -97,6 +104,7 @@ jobs:

				      PR_NUMBER: ${{ github.event.pull_request.number }}

				      PYTORCH_FINAL_PACKAGE_DIR: /artifacts

				      SHA1: ${{ github.event.pull_request.head.sha || github.sha }}

				      USE_SPLIT_BUILD: ${{ inputs.use_split_build }}

				    steps:

				      - name: Make the env permanent during this workflow (but not the secrets)

				        shell: bash

				@ -124,6 +132,7 @@ jobs:

				            echo "PR_NUMBER=${{ env.PR_NUMBER }}"

				            echo "PYTORCH_FINAL_PACKAGE_DIR=${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"

				            echo "SHA1=${{ env.SHA1 }}"

				            echo "USE_SPLIT_BUILD=${{ env.USE_SPLIT_BUILD }}"

				          } >> "${GITHUB_ENV} }}"

				      - name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"

									
										8

.github/workflows/_binary-upload.yml
									
										vendored
									
												View File
												
				@ -55,6 +55,13 @@ on:

				        required: false

				        type: string

				        description: Desired python version

				      use_split_build:

				        description: |

				          [Experimental] Build a libtorch only wheel and build pytorch such that

				          are built from the libtorch wheel.

				        required: false

				        type: boolean

				        default: false

				    secrets:

				      github-token:

				        required: true

				@ -93,6 +100,7 @@ jobs:

				      PR_NUMBER: ${{ github.event.pull_request.number }}

				      PYTORCH_FINAL_PACKAGE_DIR: /artifacts

				      SHA1: ${{ github.event.pull_request.head.sha || github.sha }}

				      USE_SPLIT_BUILD: ${{ inputs.use_split_build }}

				    steps:

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

									
										8

.github/workflows/_linux-build-label.yml
									
										vendored
									
												View File
												
				@ -56,6 +56,13 @@ on:

				        required: false

				        type: string

				        default: ""

				      use_split_build:

				        description: |

				          [Experimental] Build a libtorch only wheel and build pytorch such that

				          are built from the libtorch wheel.

				        required: false

				        type: boolean

				        default: false

				    secrets:

				      HUGGING_FACE_HUB_TOKEN:

				        required: false

				@ -107,3 +114,4 @@ jobs:

				          aws-role-to-assume: ${{ inputs.aws-role-to-assume }}

				          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

				          HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}

				          use_split_build: ${{ inputs.use_split_build }}

									
										105

.github/workflows/_linux-build-rg.yml
									
										vendored
									
												View File
											
				@ -1,105 +0,0 @@

				name: linux-build-rg

				on:

				  workflow_call:

				    inputs:

				      build-environment:

				        required: true

				        type: string

				        description: Top-level label for what's being built/tested.

				      docker-image-name:

				        required: true

				        type: string

				        description: Name of the base docker image to build with.

				      build-generates-artifacts:

				        required: false

				        type: boolean

				        default: true

				        description: If set, upload generated build artifacts.

				      build-with-debug:

				        required: false

				        type: boolean

				        default: false

				        description: If set, build in debug mode.

				      sync-tag:

				        required: false

				        type: string

				        default: ""

				        description: |

				          If this is set, our linter will use this to make sure that every other

				          job with the same `sync-tag` is identical.

				      cuda-arch-list:

				        required: false

				        type: string

				        default: "5.2"

				        description: |

				          List of CUDA architectures CI build should target.

				      runner-group:

				        required: false

				        type: string

				        default: "arc-lf-linux.2xlarge"

				        description: Runner group to select group type

				      test-matrix:

				        required: false

				        type: string

				        description: |

				          An option JSON description of what test configs to run later on. This

				          is moved here from the Linux test workflow so that we can apply filter

				          logic using test-config labels earlier and skip unnecessary builds

				      s3-bucket:

				        description: S3 bucket to download artifact

				        required: false

				        type: string

				        default: "gha-artifacts"

				      aws-role-to-assume:

				        description: role to assume for downloading artifacts

				        required: false

				        type: string

				        default: ""

				    secrets:

				      HUGGING_FACE_HUB_TOKEN:

				        required: false

				        description: |

				          HF Auth token to avoid rate limits when downloading models or datasets from hub

				    outputs:

				      docker-image:

				        value: ${{ jobs.build.outputs.docker-image }}

				        description: The docker image containing the built PyTorch.

				      test-matrix:

				        value: ${{ jobs.build.outputs.test-matrix }}

				        description: An optional JSON description of what test configs to run later on.

				jobs:

				  build:

				    # Don't run on forked repos

				    if: github.repository_owner == 'pytorch'

				    runs-on:

				      group: ${{ inputs.runner-group }}

				    timeout-minutes: 240

				    outputs:

				      docker-image: ${{ steps.linux-build.outputs.docker-image }}

				      test-matrix: ${{ steps.linux-build.outputs.test-matrix }}

				    steps:

				      # [pytorch repo ref]

				      # Use a pytorch/pytorch reference instead of a reference to the local

				      # checkout because when we run this action we don't *have* a local

				      # checkout. In other cases you should prefer a local checkout.

				      - name: Checkout PyTorch

				        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

				      - name: Linux Build

				        id: linux-build

				        uses: ./.github/actions/linux-build

				        with:

				          build-environment: ${{ inputs.build-environment }}

				          docker-image-name: ${{ inputs.docker-image-name }}

				          build-generates-artifacts: ${{ inputs.build-generates-artifacts }}

				          build-with-debug: ${{ inputs.build-with-debug }}

				          sync-tag: ${{ inputs.sync-tag }}

				          cuda-arch-list: ${{ inputs.cuda-arch-list }}

				          test-matrix: ${{ inputs.test-matrix }}

				          s3-bucket: ${{ inputs.s3-bucket }}

				          aws-role-to-assume: ${{ inputs.aws-role-to-assume }}

				          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

				          HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}

Compare commits

2642 Commits whc/stage2 ... switch-bn

5 .ci/docker/aotriton_version.txt Normal file Unescape Escape View File

103 .ci/docker/build.sh Unescape Escape View File

10 .ci/docker/centos-rocm/Dockerfile Unescape Escape View File

2 .ci/docker/ci_commit_pins/executorch.txt Unescape Escape View File

1 .ci/docker/ci_commit_pins/halide.txt Normal file Unescape Escape View File

2 .ci/docker/ci_commit_pins/triton-rocm.txt Unescape Escape View File

2 .ci/docker/ci_commit_pins/triton-xpu.txt Unescape Escape View File

2 .ci/docker/ci_commit_pins/triton.txt Unescape Escape View File

5 .ci/docker/common/install_amdsmi.sh Normal file Unescape Escape View File

23 .ci/docker/common/install_aotriton.sh Executable file Unescape Escape View File

2 .ci/docker/common/install_base.sh Unescape Escape View File

2 .ci/docker/common/install_conda.sh Unescape Escape View File

17 .ci/docker/common/install_cudnn.sh Unescape Escape View File

14 .ci/docker/common/install_executorch.sh Unescape Escape View File

46 .ci/docker/common/install_halide.sh Normal file Unescape Escape View File

8 .ci/docker/common/install_onnx.sh Unescape Escape View File

6 .ci/docker/common/install_rocm.sh Unescape Escape View File

10 .ci/docker/requirements-ci.txt Unescape Escape View File

12 .ci/docker/ubuntu-cuda/Dockerfile Unescape Escape View File

12 .ci/docker/ubuntu-rocm/Dockerfile Unescape Escape View File

8 .ci/docker/ubuntu/Dockerfile Unescape Escape View File

37 .ci/pytorch/build.sh Unescape Escape View File

46 .ci/pytorch/common_utils.sh Unescape Escape View File

1 .ci/pytorch/create_test_cert.py Unescape Escape View File

37 .ci/pytorch/install_cache_xla.sh Executable file Unescape Escape View File

5 .ci/pytorch/multigpu-test.sh Unescape Escape View File

1 .ci/pytorch/perf_test/compare_with_baseline.py Unescape Escape View File

1 .ci/pytorch/perf_test/get_stats.py Unescape Escape View File

1 .ci/pytorch/perf_test/update_commit_hash.py Unescape Escape View File

1 .ci/pytorch/print_sccache_log.py Unescape Escape View File

152 .ci/pytorch/test.sh Unescape Escape View File

1 .ci/pytorch/win-test-helpers/run_python_nn_smoketests.py Unescape Escape View File

1 .circleci/codegen_validation/normalize_yaml_fragment.py Unescape Escape View File

32 .circleci/scripts/binary_linux_test.sh Unescape Escape View File

51 .circleci/scripts/binary_populate_env.sh Unescape Escape View File

9 .circleci/scripts/binary_upload.sh Unescape Escape View File

1 .circleci/scripts/trigger_azure_pipeline.py Unescape Escape View File

2 .clang-tidy Unescape Escape View File

2 .flake8 Unescape Escape View File

4 .git-blame-ignore-revs Unescape Escape View File

31 .github/actionlint.yaml vendored Unescape Escape View File

6 .github/actions/diskspace-cleanup/action.yml vendored Unescape Escape View File

3 .github/actions/filter-test-configs/action.yml vendored Unescape Escape View File

21 .github/actions/linux-build/action.yml vendored Unescape Escape View File

11 .github/actions/test-pytorch-binary/action.yml vendored Unescape Escape View File

2 .github/ci_commit_pins/audio.txt vendored Unescape Escape View File

2 .github/ci_commit_pins/torchbench.txt vendored Unescape Escape View File

2 .github/ci_commit_pins/xla.txt vendored Unescape Escape View File

281 .github/lf-canary-scale-config.yml vendored Normal file Unescape Escape View File

281 .github/lf-scale-config.yml vendored Normal file Unescape Escape View File

34 .github/merge_rules.yaml vendored Unescape Escape View File

3 .github/pytorch-probot.yml vendored Unescape Escape View File

2 .github/requirements-gha-cache.txt vendored Unescape Escape View File

3 .github/requirements/conda-env-Linux-X64.txt vendored Unescape Escape View File

3 .github/requirements/conda-env-iOS.txt vendored Unescape Escape View File

2 .github/requirements/conda-env-macOS-ARM64 vendored Unescape Escape View File

2 .github/requirements/conda-env-macOS-X64 vendored Unescape Escape View File

2 .github/requirements/pip-requirements-iOS.txt vendored Unescape Escape View File

6 .github/requirements/pip-requirements-macOS.txt vendored Unescape Escape View File

2 .github/scripts/amd/package_triton_wheel.sh vendored Unescape Escape View File

31 .github/scripts/build_triton_wheel.py vendored Unescape Escape View File

1 .github/scripts/check_labels.py vendored Unescape Escape View File

116 .github/scripts/cherry_pick.py vendored Unescape Escape View File

1 .github/scripts/close_nonexistent_disable_issues.py vendored Unescape Escape View File

2 .github/scripts/collect_ciflow_labels.py vendored Unescape Escape View File

1 .github/scripts/convert_lintrunner_annotations_to_github.py vendored Unescape Escape View File

59 .github/scripts/delete_old_branches.py vendored Unescape Escape View File

52 .github/scripts/docathon-label-sync.py vendored Normal file Unescape Escape View File

BIN .github/scripts/drci_mocks.json.gz vendored View File

1 .github/scripts/ensure_actions_will_cancel.py vendored Unescape Escape View File

1 .github/scripts/export_pytorch_labels.py vendored Unescape Escape View File

1 .github/scripts/filter_test_configs.py vendored Unescape Escape View File

127 .github/scripts/generate_binary_build_matrix.py vendored Unescape Escape View File

14 .github/scripts/generate_ci_workflows.py vendored Unescape Escape View File

1 .github/scripts/generate_docker_release_matrix.py vendored Unescape Escape View File

2 .github/scripts/generate_pytorch_version.py vendored Unescape Escape View File

1 .github/scripts/get_workflow_job_id.py vendored Unescape Escape View File

99 .github/scripts/get_workflow_type.py vendored Unescape Escape View File

2642 Commits

whc/stage2 ... switch-bn

5

.ci/docker/aotriton_version.txt Normal file

View File

103

.ci/docker/build.sh

View File

10

.ci/docker/centos-rocm/Dockerfile

View File

2

.ci/docker/ci_commit_pins/executorch.txt

View File

1

.ci/docker/ci_commit_pins/halide.txt Normal file

View File

2

.ci/docker/ci_commit_pins/triton-rocm.txt

View File

2

.ci/docker/ci_commit_pins/triton-xpu.txt

View File

2

.ci/docker/ci_commit_pins/triton.txt

View File

5

.ci/docker/common/install_amdsmi.sh Normal file

View File

23

.ci/docker/common/install_aotriton.sh Executable file

View File

2

.ci/docker/common/install_base.sh

View File

2

.ci/docker/common/install_conda.sh

View File

17

.ci/docker/common/install_cudnn.sh

View File

14

.ci/docker/common/install_executorch.sh

View File

46

.ci/docker/common/install_halide.sh Normal file

View File

8

.ci/docker/common/install_onnx.sh

View File

6

.ci/docker/common/install_rocm.sh

View File

10

.ci/docker/requirements-ci.txt

View File

12

.ci/docker/ubuntu-cuda/Dockerfile

View File

12

.ci/docker/ubuntu-rocm/Dockerfile

View File

8

.ci/docker/ubuntu/Dockerfile

View File

37

.ci/pytorch/build.sh

View File

46

.ci/pytorch/common_utils.sh

View File

1

.ci/pytorch/create_test_cert.py

View File

37

.ci/pytorch/install_cache_xla.sh Executable file

View File

5

.ci/pytorch/multigpu-test.sh

View File

1

.ci/pytorch/perf_test/compare_with_baseline.py

View File

1

.ci/pytorch/perf_test/get_stats.py

View File

1

.ci/pytorch/perf_test/update_commit_hash.py

View File

1

.ci/pytorch/print_sccache_log.py

View File

152

.ci/pytorch/test.sh

View File

1

.ci/pytorch/win-test-helpers/run_python_nn_smoketests.py

View File

1

.circleci/codegen_validation/normalize_yaml_fragment.py

View File

32

.circleci/scripts/binary_linux_test.sh

View File

51

.circleci/scripts/binary_populate_env.sh

View File

9

.circleci/scripts/binary_upload.sh

View File

1

.circleci/scripts/trigger_azure_pipeline.py

View File

2

.clang-tidy

View File

2

.flake8

View File

4

.git-blame-ignore-revs

View File

31

.github/actionlint.yaml vendored

View File

6

.github/actions/diskspace-cleanup/action.yml vendored

View File

3

.github/actions/filter-test-configs/action.yml vendored

View File

21

.github/actions/linux-build/action.yml vendored

View File

11

.github/actions/test-pytorch-binary/action.yml vendored

View File

2

.github/ci_commit_pins/audio.txt vendored

View File

2

.github/ci_commit_pins/torchbench.txt vendored

View File

2

.github/ci_commit_pins/xla.txt vendored

View File

281

.github/lf-canary-scale-config.yml vendored Normal file

View File

281

.github/lf-scale-config.yml vendored Normal file

View File

34

.github/merge_rules.yaml vendored

View File

3

.github/pytorch-probot.yml vendored

View File

2

.github/requirements-gha-cache.txt vendored

View File

3

.github/requirements/conda-env-Linux-X64.txt vendored

View File

3

.github/requirements/conda-env-iOS.txt vendored

View File

2

.github/requirements/conda-env-macOS-ARM64 vendored

View File

2

.github/requirements/conda-env-macOS-X64 vendored

View File

2

.github/requirements/pip-requirements-iOS.txt vendored

View File

6

.github/requirements/pip-requirements-macOS.txt vendored

View File

2

.github/scripts/amd/package_triton_wheel.sh vendored

View File

31

.github/scripts/build_triton_wheel.py vendored

View File

1

.github/scripts/check_labels.py vendored

View File

116

.github/scripts/cherry_pick.py vendored

View File

1

.github/scripts/close_nonexistent_disable_issues.py vendored

View File

2

.github/scripts/collect_ciflow_labels.py vendored

View File

1

.github/scripts/convert_lintrunner_annotations_to_github.py vendored

View File

59

.github/scripts/delete_old_branches.py vendored

View File

52

.github/scripts/docathon-label-sync.py vendored Normal file

View File

BIN
.github/scripts/drci_mocks.json.gz vendored

View File

1

.github/scripts/ensure_actions_will_cancel.py vendored

View File

1

.github/scripts/export_pytorch_labels.py vendored

View File

1

.github/scripts/filter_test_configs.py vendored

View File

127

.github/scripts/generate_binary_build_matrix.py vendored

View File

14

.github/scripts/generate_ci_workflows.py vendored

View File

1

.github/scripts/generate_docker_release_matrix.py vendored

View File

2

.github/scripts/generate_pytorch_version.py vendored

View File

1

.github/scripts/get_workflow_job_id.py vendored

View File

99

.github/scripts/get_workflow_type.py vendored

View File

10

.github/scripts/github_utils.py vendored

View File