pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-21 05:34:18 +08:00

Author	SHA1	Message	Date
Yuanyuan Chen	b63bbe1661	Remove old ROCm version check in tests (#164245 ) This PR removes ROCm<6 version checks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164245 Approved by: https://github.com/jeffdaily	2025-10-06 22:42:01 +00:00
ghostspiders	af10f1f86c	Fix requires_cuda to requires_cuda_and_triton (#160222 ) Fixes ##159399 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160222 Approved by: https://github.com/janeyx99	2025-08-10 07:05:52 +00:00
Jane Xu	5a0926a26e	Stop skipping entire foreach tests, just skip the profiler portion (#156871 ) Instead of skipping the whole test as the CUPTI team figures out what is wrong, let's temporarily skip the profiler check portion. It is high pri to add it back to ensure foreach ops are actually performant. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156871 Approved by: https://github.com/albanD ghstack dependencies: #156876	2025-06-27 22:35:34 +00:00
Jane Xu	50b2069b61	Move out super large one off foreach_copy test (#156876 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156876 Approved by: https://github.com/albanD, https://github.com/jeffdaily	2025-06-26 06:02:38 +00:00
Jane Xu	4ee4863232	Fix #156261 _foreach_copy indexing (#156719 ) Fixes #156261 Thanks to @ngimel's fast eyes For testing, I had experimented with a broader test case change but found that creating a tensor of 2**31+1 size was too expensive to do more than just a few times. Note that while the test case does not run in CI, I did run it locally to ensure it passes with new changes and fails without. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156719 Approved by: https://github.com/albanD	2025-06-24 21:58:44 +00:00
atalman	c199a4d0fd	Move non inductor workflows cuda 12.6->cuda 12.8 (#155234 ) Move non inductor workflows cuda 12.6->cuda 12.8 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155234 Approved by: https://github.com/Skylion007, https://github.com/zxiiro, https://github.com/cyyever, https://github.com/malfet	2025-06-12 12:42:34 +00:00
Jane Xu	94da4523ec	Disable foreach tests that depend on profiler for CUDA 12.6 (#155596 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155596 Approved by: https://github.com/clee2000, https://github.com/malfet	2025-06-10 22:21:06 +00:00
Jane Xu	4979ca5ffa	Synchronize in foreach tests after profiling (#152857 ) After the CI change from 12.4 -> 12.6 around mid-March, the foreach tests have been flaky and hard to repro due to nondeterminism. Per @davidberard98's suggestion, let's try to add a synchronize before checking profiler results to see whether this fixes the flake! The hope is that the 48 currently open foreach flaky issues will close from this change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152857 Approved by: https://github.com/davidberard98	2025-05-06 00:56:48 +00:00
pralay	a9ee797e41	added fake tensor support for foreach_copy (#149127 ) Fixes #149111 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149127 Approved by: https://github.com/jansel, https://github.com/jeromean	2025-03-27 09:26:23 +00:00
Ting Lu	a0bc6d81bb	[CI][CUDA] Move away from cuda12.4, Add cuda12.6 eager CI tests (#148602 ) https://github.com/pytorch/pytorch/issues/145570 breaking https://github.com/pytorch/pytorch/pull/140793/ into eager and inductor benchmarks to unblock Pull Request resolved: https://github.com/pytorch/pytorch/pull/148602 Approved by: https://github.com/atalman, https://github.com/malfet Co-authored-by: atalman <atalman@fb.com>	2025-03-07 00:15:04 +00:00
Xuehai Pan	c73a92fbf5	[BE][CI] bump `ruff` to 0.9.2: multiline `assert` statements (#144546 ) Reference: https://docs.astral.sh/ruff/formatter/black/#assert-statements > Unlike Black, Ruff prefers breaking the message over breaking the assertion, similar to how both Ruff and Black prefer breaking the assignment value over breaking the assignment target: > > ```python > # Input > assert ( > len(policy_types) >= priority + num_duplicates > ), f"This tests needs at least {priority+num_duplicates} many types." > > > # Black > assert ( > len(policy_types) >= priority + num_duplicates > ), f"This tests needs at least {priority+num_duplicates} many types." > > # Ruff > assert len(policy_types) >= priority + num_duplicates, ( > f"This tests needs at least {priority + num_duplicates} many types." > ) > ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/144546 Approved by: https://github.com/malfet	2025-02-27 20:46:16 +00:00
Aaron Orenstein	086d146f6f	Update ruff linter for PEP585 (#147540 ) This turns on PEP585 enforcement in RUFF. - Updates the target python version - Stops ignoring UP006 warnings (PEP585) - Fixes a few issues which crept into the tree in the last day Pull Request resolved: https://github.com/pytorch/pytorch/pull/147540 Approved by: https://github.com/justinchuby, https://github.com/Skylion007	2025-02-22 04:45:17 +00:00
Tom Ritchford	d8c8ba2440	Fix unused Python variables in test/[e-z]* (#136964 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136964 Approved by: https://github.com/justinchuby, https://github.com/albanD	2024-12-18 23:02:30 +00:00
Yukio Siraichi	446ea2aea5	`pow`: fix meta function output argument dtype check. (#140287 ) Tracking issue: #138399 This PR changes the `pow` C++ implementation, making its C++ meta kernel consistent with its Python ref implementation. The following example shows the inconsistency between the two: ```python def run(device): S = (5,) a = torch.rand(S, device=device, dtype=torch.float32) b = 2 out = torch.empty(S, device=device, dtype=torch.float64) return torch.pow(a, b, out=out) >>> run("cpu") Traceback (most recent call last): File "test.py", line 34, in run return torch.pow(a, b, out=out) RuntimeError: Found dtype Double but expected Float >>> run("meta") tensor(..., device='meta', size=(5,), dtype=torch.float64) ``` ~Update:~ ~Note that this happens only for `pow.Tensor_Scalar` overloads. Therefore, this PR needed further 2 modifications:~ - ~Split the `pow` ref implementation, making `pow.Tensor_Scalar` error on mismatching output dtypes~ - ~Create a dispatch for `pow` when `_refs.pow()` is called~ Update: Changing the `TensorIteratorConfig` for `pow.Tensor_Scalar` was easier and, after the discussion below, more correct. The solution was to change the `TensorIteratorBase::build_output_borrowing_argument_owning_unary_op` function, setting: - `cast_common_dtype_to_outputs`; and - `enforce_safe_casting_to_output`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140287 Approved by: https://github.com/ezyang	2024-11-20 13:28:47 +00:00
zeshengzong	cb71bcc542	Replace clone.detach with detach.clone (#140264 ) Fixes #64532 As state in issue, replace `clone.detach` by `detach.clone` Pull Request resolved: https://github.com/pytorch/pytorch/pull/140264 Approved by: https://github.com/soulitzer	2024-11-13 07:01:02 +00:00
Masaki Kozuki	6a368b3fc5	Add ScalarList overload to `_foreach_lerp` (#134482 ) Related: - https://github.com/pytorch/pytorch/issues/133367 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134482 Approved by: https://github.com/janeyx99	2024-11-12 19:03:41 +00:00
Jane Xu	92fb1f79b8	[BE] Test interspersed empty tensors for _foreach_norm test parity (#140191 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140191 Approved by: https://github.com/jbschlosser	2024-11-12 15:35:06 +00:00
Masaki Kozuki	71d8bb7ede	implement `torch._foreach_rsqrt` (#134574 ) Related: - #133367 c Pull Request resolved: https://github.com/pytorch/pytorch/pull/134574 Approved by: https://github.com/eqy, https://github.com/janeyx99	2024-11-12 15:34:35 +00:00
Natalia Gimelshein	1cdaf1d85f	correctly keep track of processed tensors for foreach reductions (#140103 ) Fixes #140066 Pull Request resolved: https://github.com/pytorch/pytorch/pull/140103 Approved by: https://github.com/janeyx99 Co-authored-by: Jane Xu <janeyx@meta.com>	2024-11-08 23:04:53 +00:00
Shan19900305	49723a8ff3	fix stride compare failed when size value equal to one in ForeachUtils.h (#134546 ) When size value equal to one, tensor strides value need be skipped to compare. @ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/134546 Approved by: https://github.com/janeyx99	2024-09-19 18:43:41 +00:00
Andrew Gu	a0d0c6b7e6	Used `torch.equal` in `test_foreach_copy_with_multi_dtypes` (#134861 ) `self.assertEqual` allows some tolerance, but here, we want to show that `_foreach_copy_` gives bitwise equivalent results. Let us use `torch.equal` then. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134861 Approved by: https://github.com/Skylion007, https://github.com/janeyx99, https://github.com/crcrpar	2024-08-30 18:04:41 +00:00
Masaki Kozuki	e21d7b77ce	Update `ForeachfuncInfo.sample_inputs_func` to yield scalars & scalarlists that are more friendly to test_meta (#134552 ) for `test_meta.py` to see more "PASSED" instead of "XFAIL". `pytest test_meta.py -k "_foreach_"` ran 6400 test cases and: - This PR: 4702 passed, 260 skipped, 73732 deselected, 1698 xfailed - main (92c4771853892193d73d87bd60eca4dc7efc51d8): 3906 passed, 260 skipped, 73732 deselected, 2494 xfailed Pull Request resolved: https://github.com/pytorch/pytorch/pull/134552 Approved by: https://github.com/janeyx99	2024-08-30 17:30:50 +00:00
Benjamin Glass	55236d0cb7	TestForeach::test_parity: Remove check for error message text (#134251 ) Previously, error messages were expected to be string equivalent to error messages thrown by the ref function. This check fails for dozens of torch functions, and doesn't appear to add much value for the end user. This commit removes this check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134251 Approved by: https://github.com/amjames, https://github.com/janeyx99 ghstack dependencies: #134253, #134344	2024-08-26 22:40:54 +00:00
Xuehai Pan	4226ed1585	[BE] Format uncategorized Python files with `ruff format` (#132576 ) Remove patterns ``, `test/`, and `torch/**` in `tools/linter/adapters/pyfmt_linter.py` and run `lintrunner`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132576 Approved by: https://github.com/ezyang, https://github.com/Skylion007 ghstack dependencies: #132574	2024-08-04 17:13:31 +00:00
Xuehai Pan	ba48cf6535	[BE][Easy][6/19] enforce style for empty lines in import segments in `test/` (#129757 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129757 Approved by: https://github.com/ezyang	2024-07-17 06:42:37 +00:00
Jane Xu	abe81d5d05	Fix the rest of foreach flakers (#130277 ) Reenable foreach tests on non-sm86 machines. I believe I've fixed the flakes that are caused when TORCH_SHOW_CPP_STACKTRACES=1, though I know @clee2000 had also just landed https://github.com/pytorch/pytorch/pull/129004 for the same effect. Regardless, this makes the foreach tests more robust against future disruptions anyway. Fix similar in flavor to https://github.com/pytorch/pytorch/pull/129003 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130277 Approved by: https://github.com/soulitzer	2024-07-09 02:08:21 +00:00
Jane Xu	adc14adb88	Fix flakiness with test_binary_op_list_error_cases (#129003 ) So how come this PR fixes any flakiness? Well, following my investigation (read pt 1 in the linked ghstack PR below), I had realized that this test only consistently errors after another test was found flaky. Why? Because TORCH_SHOW_CPP_STACKTRACES=1 gets turned on for _every_ test after _any_ test reruns, following this PR https://github.com/pytorch/pytorch/pull/119408. And yea, this test checked for exact error message matching, which no longer would match since the stacktrace for a foreach function is obviously going to be different from a nonforeach. So we improve the test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129003 Approved by: https://github.com/soulitzer	2024-06-20 21:48:22 +00:00
Jane Xu	35c78668b4	Improve the debugging message for when foreach mta_called (#128991 ) The hope that lives in this PR: I am currently trying to debug why the foreach tests are so flaky. It looks like every flaky test falls under this pattern: - a test is flaky due to the mta_called assertion, which gathers data from the profiler regarding whether the multi_tensor_apply_kernel has been called. - then, a later test fails deterministically, usually failing to compare two results. ``` ================== 1 failed, 241 deselected, 2 rerun in 1.76s ================== Got exit code 1 Stopping at first consistent failure The following tests failed and then succeeded when run in a new process ['test/test_foreach.py::TestForeachCUDA::test_binary_op_float_inf_nan__foreach_add_cuda_bfloat16'] The following tests failed consistently: ['test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_add_cuda_bfloat16'] ``` So my suspicion is that the first causes the second, but what causes the first? Idk! So it would be nice to have the error message tell us what the profiler actually saw in case it's getting muddled. This change would help mostly because I have not been able to repro this flakiness locally. Also undo the useless changes in #128220 which are actually redundant as Joel and I realized that we set the seed during the setUp of every test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128991 Approved by: https://github.com/clee2000	2024-06-19 00:25:09 +00:00
Jane Xu	8c20f53a5e	Try seeding individual foreach tests (#128220 ) A first easy attempt to deflake foreach Pull Request resolved: https://github.com/pytorch/pytorch/pull/128220 Approved by: https://github.com/ZainRizvi, https://github.com/crcrpar, https://github.com/huydhn	2024-06-13 22:42:16 +00:00
Edward Z. Yang	2fa6f80b13	Perform reciprocal optimization with foreach_div (#128433 ) Fixes https://github.com/pytorch/pytorch/issues/114165 Internal xref https://fb.workplace.com/groups/1144215345733672/posts/2801223606699496/ Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128433 Approved by: https://github.com/awgu	2024-06-12 22:57:03 +00:00
Jane Xu	ac60bdaf01	Allow slow foreach to run for any backend, not just CPU (#127412 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127412 Approved by: https://github.com/albanD	2024-06-01 13:58:18 +00:00
Masaki Kozuki	df53cc7114	[reland] "[reland] `_foreach_copy` with different src/dst dtypes" (#127186 ) Fixes #115171 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127186 Approved by: https://github.com/ezyang	2024-06-01 01:25:10 +00:00
Jane Xu	05e99154ee	Allow int vals to go down the fastpath for _foreach_max (#127303 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127303 Approved by: https://github.com/albanD ghstack dependencies: #127187	2024-05-29 19:08:58 +00:00
Jane Xu	601c5e085d	Add _foreach_max (#127187 ) This PR adds _foreach_max support, the second reduction foreach op we have :D I did have to change the autogen slightly for foreach. I can promise that the existing foreach ops' derivative behavior has not changed as I've added a skip list for the harder requirement I am setting (that the arg list should match in length). I needed to add this requirement as there is another wrong max (the one that does take in a dim for reduction) that keeps getting matched first. Caveats! - We do not fast path if the shapes, dtypes, device, the regular shebang for foreach are not met. We fall back to slowpath! - MORE IMPORTANTLY, we also do not fast path for int8 and int16 and bool, but that's really a skill issue on my end as I've hardcoded -INFINITY into the CUDA kernels, and -INFINITY is not defined for small ints. It'd be nice to know how to do this properly, but that work can also come later. - This does NOT support empty Tensors in the list, because the original max op also does not support empty Tensors. ~I think this should be allowed though, and this PR may come later.~ I understand why this is not allowed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127187 Approved by: https://github.com/albanD	2024-05-29 19:08:58 +00:00
Masaki Kozuki	96bdb7a0fb	in `test_foreach.py` pacth `KINETO_LOG_LEVEL` to silence profiler log (#126048 ) as per title, `patch.dict` the env var in favor of cleaner logs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126048 Approved by: https://github.com/janeyx99	2024-05-13 15:31:56 +00:00
Catherine Lee	98821b3d92	Disable various flaky tests in test_foreach (#125783 ) * Similar to #125046 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125783 Approved by: https://github.com/huydhn	2024-05-09 18:08:39 +00:00
Masaki Kozuki	aa7be72cc5	Convert `ForeachFuncInfo` to `dataclass` (#125001 ) - `ForeachFuncInfo` to `dataclass` for smaller diff from `OpInfo` - `skips` to `decorators` and `skip` to `xfail` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125001 Approved by: https://github.com/janeyx99, https://github.com/jeffdaily	2024-05-02 04:19:09 +00:00
PyTorch MergeBot	75fa54a9d1	Revert "Convert `ForeachFuncInfo` to `dataclass` (#125001 )" This reverts commit 9466335ae4cb049efd3f4c2b32b2115ba00694f3. Reverted https://github.com/pytorch/pytorch/pull/125001 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it is breaking on ROCm `9466335ae4` ([comment](https://github.com/pytorch/pytorch/pull/125001#issuecomment-2086640674))	2024-04-30 19:05:53 +00:00
Masaki Kozuki	9466335ae4	Convert `ForeachFuncInfo` to `dataclass` (#125001 ) - `ForeachFuncInfo` to `dataclass` for smaller diff from `OpInfo` - `skips` to `decorators` and `skip` to `xfail` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125001 Approved by: https://github.com/janeyx99	2024-04-30 16:19:42 +00:00
Catherine Lee	a68a8c0f6b	Disable test_binary_op_list_error_cases in test_foreach (#125046 ) It's really flaky ex * https://github.com/pytorch/pytorch/issues/124636 * https://github.com/pytorch/pytorch/issues/124529 there are more Pull Request resolved: https://github.com/pytorch/pytorch/pull/125046 Approved by: https://github.com/huydhn	2024-04-26 21:25:38 +00:00
PyTorch MergeBot	1f89bf4188	Revert "[reland] `_foreach_copy` with different src/dst dtypes (#123844 )" This reverts commit ff1e3ff5a503a520c1a310c8e72a383657f9a4bc. Reverted https://github.com/pytorch/pytorch/pull/123844 on behalf of https://github.com/malfet due to Perhaps it enabled it for different dtype, but broke for the same ([comment](https://github.com/pytorch/pytorch/pull/123844#issuecomment-2059861767))	2024-04-16 20:23:14 +00:00
Masaki Kozuki	ff1e3ff5a5	[reland] `_foreach_copy` with different src/dst dtypes (#123844 ) Attempt to reland https://github.com/pytorch/pytorch/pull/121717. The change is the array bounds check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123844 Approved by: https://github.com/janeyx99	2024-04-16 02:20:58 +00:00
Aaron Gokaslan	1d6c5972c1	[BE]: Optimize min/max/sum comprehensions C419 (#123960 ) Automatic fixes that replaces certain list comprehensions with generator ones where appropriate so that they are immediately consumed. This is preview functionality in ruff for rule C419 and it was automatically applied. Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123960 Approved by: https://github.com/malfet	2024-04-12 23:54:15 +00:00
statelesshz	c3de2cc154	Enable UFMT on test/test_foreach.py (#123718 ) Part of https://github.com/pytorch/pytorch/issues/123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123718 Approved by: https://github.com/ezyang	2024-04-10 18:22:12 +00:00
Yifu Wang	eb3a34d280	Optimize multi_tensor_apply (take 2) (#119764 ) ### Take 2 The first take (#119153) landed but was reverted because it broke cuda graph for `multi_tensor_apply`. This PR is a reland of #119153: - Incorporate #119652 so that the optimization can be applied (1) without increasing binary size (2) to all 3 MTA variants without much code duplication. - Ensure the optimization is compatible with cuda graph. ### Summary Due to the dynamic nature of the workload, the kernel arguments aren't guaranteed to fit in the static 4kb kernel argument memory. Previously with the apex implementation, we overcame this limitation by dividing a multi_tensor_apply workload into multiple kernel launches. However, this led to low sustained occupancy, affecting the performance of memory bound ops. Based on the observation that the kernel argument memory limitation doesn't correlate well with available SM resources, we adopt a different approach: - When the kernel arguments fit into the static kernel argument memory, we use this memory to transfer the arguments. - Conversely, when the kernel arguments don't fit into the static kernel argument memory, instead of sacrificing sustained occupancy, we use a page-locked cudaMemcpyAsync to transfer the arguments, then perform the entire workload in a single kernel. This PR only covers `multi_tensor_apply` for tensors. The change can be easily applied to `multi_tensor_apply` for tensors + scalars and `multi_tensor_apply_for_fused_optimizer`. ### Benchmark (WIP) The only benchmark I've conducted so far on `_foreach_copy_` on a set of sizes that resembles internal workload. I need to benchmarks on more problem sizes. The speedup should vary among problem sizes. However, I believe this PR should not be slower than the previous impl on any problem sizes. The benchmark can be reproduced with [this script](https://gist.github.com/yifuwang/178c1f4bf951c5794ea79c04d90e44fa). Baseline A single iteration in trace: <img width="831" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/5c8d72d0-0628-4989-88a8-c756f6bc1319"> ``` https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_5a59145f-567b-472f-8eef-c61c388d45b4.json device ms: 1.111, cpu ms: 7.151 memory bandwidth: 1169.825 GB/s ``` This PR A single iteration in trace: <img width="967" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/a023e183-8166-48f7-b7c0-c8ba32653d2b"> ``` https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_da060725-62a8-466e-b570-2ad67ff0e29d.json device ms: 0.892, cpu ms: 0.810 memory bandwidth: 1456.744 GB/s ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119764 Approved by: https://github.com/eqy, https://github.com/eellison, https://github.com/crcrpar	2024-04-03 05:54:49 +00:00
PyTorch MergeBot	958dbb876c	Revert "`_foreach_copy` with different src/dst dtypes (#121717 )" This reverts commit da2a9a05127c2b44e447e734d99e727d856cb36f. Reverted https://github.com/pytorch/pytorch/pull/121717 on behalf of https://github.com/janeyx99 due to Causing IMAs on V100s internally :C ([comment](https://github.com/pytorch/pytorch/pull/121717#issuecomment-2025553295))	2024-03-28 15:54:40 +00:00
PyTorch MergeBot	bef01c7c2b	Revert "Optimize multi_tensor_apply (take 2) (#119764 )" This reverts commit fe41ba47652ca73569453bddb43605c77bb85184. Reverted https://github.com/pytorch/pytorch/pull/119764 on behalf of https://github.com/atalman due to Failing internally ([comment](https://github.com/pytorch/pytorch/pull/119764#issuecomment-2024105399))	2024-03-27 22:42:07 +00:00
Yifu Wang	fe41ba4765	Optimize multi_tensor_apply (take 2) (#119764 ) ### Take 2 The first take (#119153) landed but was reverted because it broke cuda graph for `multi_tensor_apply`. This PR is a reland of #119153: - Incorporate #119652 so that the optimization can be applied (1) without increasing binary size (2) to all 3 MTA variants without much code duplication. - Ensure the optimization is compatible with cuda graph. ### Summary Due to the dynamic nature of the workload, the kernel arguments aren't guaranteed to fit in the static 4kb kernel argument memory. Previously with the apex implementation, we overcame this limitation by dividing a multi_tensor_apply workload into multiple kernel launches. However, this led to low sustained occupancy, affecting the performance of memory bound ops. Based on the observation that the kernel argument memory limitation doesn't correlate well with available SM resources, we adopt a different approach: - When the kernel arguments fit into the static kernel argument memory, we use this memory to transfer the arguments. - Conversely, when the kernel arguments don't fit into the static kernel argument memory, instead of sacrificing sustained occupancy, we use a page-locked cudaMemcpyAsync to transfer the arguments, then perform the entire workload in a single kernel. This PR only covers `multi_tensor_apply` for tensors. The change can be easily applied to `multi_tensor_apply` for tensors + scalars and `multi_tensor_apply_for_fused_optimizer`. ### Benchmark (WIP) The only benchmark I've conducted so far on `_foreach_copy_` on a set of sizes that resembles internal workload. I need to benchmarks on more problem sizes. The speedup should vary among problem sizes. However, I believe this PR should not be slower than the previous impl on any problem sizes. The benchmark can be reproduced with [this script](https://gist.github.com/yifuwang/178c1f4bf951c5794ea79c04d90e44fa). Baseline A single iteration in trace: <img width="831" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/5c8d72d0-0628-4989-88a8-c756f6bc1319"> ``` https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_5a59145f-567b-472f-8eef-c61c388d45b4.json device ms: 1.111, cpu ms: 7.151 memory bandwidth: 1169.825 GB/s ``` This PR A single iteration in trace: <img width="967" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/a023e183-8166-48f7-b7c0-c8ba32653d2b"> ``` https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_da060725-62a8-466e-b570-2ad67ff0e29d.json device ms: 0.892, cpu ms: 0.810 memory bandwidth: 1456.744 GB/s ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119764 Approved by: https://github.com/eqy, https://github.com/eellison, https://github.com/crcrpar	2024-03-27 00:51:30 +00:00
PyTorch MergeBot	5e0440edb4	Revert "Optimize multi_tensor_apply (take 2) (#119764 )" This reverts commit 0b68a28c87df2c6eb2cf530be4659b5a2f8a95b0. Reverted https://github.com/pytorch/pytorch/pull/119764 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing ROCm job in trunk `0b68a28c87`. Please help take a look and reland the change ([comment](https://github.com/pytorch/pytorch/pull/119764#issuecomment-2014190124))	2024-03-22 02:18:28 +00:00
Yifu Wang	0b68a28c87	Optimize multi_tensor_apply (take 2) (#119764 ) ### Take 2 The first take (#119153) landed but was reverted because it broke cuda graph for `multi_tensor_apply`. This PR is a reland of #119153: - Incorporate #119652 so that the optimization can be applied (1) without increasing binary size (2) to all 3 MTA variants without much code duplication. - Ensure the optimization is compatible with cuda graph. ### Summary Due to the dynamic nature of the workload, the kernel arguments aren't guaranteed to fit in the static 4kb kernel argument memory. Previously with the apex implementation, we overcame this limitation by dividing a multi_tensor_apply workload into multiple kernel launches. However, this led to low sustained occupancy, affecting the performance of memory bound ops. Based on the observation that the kernel argument memory limitation doesn't correlate well with available SM resources, we adopt a different approach: - When the kernel arguments fit into the static kernel argument memory, we use this memory to transfer the arguments. - Conversely, when the kernel arguments don't fit into the static kernel argument memory, instead of sacrificing sustained occupancy, we use a page-locked cudaMemcpyAsync to transfer the arguments, then perform the entire workload in a single kernel. This PR only covers `multi_tensor_apply` for tensors. The change can be easily applied to `multi_tensor_apply` for tensors + scalars and `multi_tensor_apply_for_fused_optimizer`. ### Benchmark (WIP) The only benchmark I've conducted so far on `_foreach_copy_` on a set of sizes that resembles internal workload. I need to benchmarks on more problem sizes. The speedup should vary among problem sizes. However, I believe this PR should not be slower than the previous impl on any problem sizes. The benchmark can be reproduced with [this script](https://gist.github.com/yifuwang/178c1f4bf951c5794ea79c04d90e44fa). Baseline A single iteration in trace: <img width="831" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/5c8d72d0-0628-4989-88a8-c756f6bc1319"> ``` https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_5a59145f-567b-472f-8eef-c61c388d45b4.json device ms: 1.111, cpu ms: 7.151 memory bandwidth: 1169.825 GB/s ``` This PR A single iteration in trace: <img width="967" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/a023e183-8166-48f7-b7c0-c8ba32653d2b"> ``` https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_da060725-62a8-466e-b570-2ad67ff0e29d.json device ms: 0.892, cpu ms: 0.810 memory bandwidth: 1456.744 GB/s ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119764 Approved by: https://github.com/eqy, https://github.com/eellison, https://github.com/crcrpar	2024-03-21 11:53:31 +00:00

1 2 3 4

166 Commits