b63bbe1661 
					 
					
						
						
							
							Remove old ROCm version check in tests ( #164245 )  
						
						 
						
						... 
						
						
						
						This PR removes ROCm<6 version checks.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164245 
Approved by: https://github.com/jeffdaily  
						
						
					 
					
						2025-10-06 22:42:01 +00:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
					 
					
						
						
							
						
						af10f1f86c 
					 
					
						
						
							
							Fix requires_cuda to requires_cuda_and_triton ( #160222 )  
						
						 
						
						... 
						
						
						
						Fixes ##159399
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160222 
Approved by: https://github.com/janeyx99  
						
						
					 
					
						2025-08-10 07:05:52 +00:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
					 
					
						
						
							
						
						5a0926a26e 
					 
					
						
						
							
							Stop skipping entire foreach tests, just skip the profiler portion ( #156871 )  
						
						 
						
						... 
						
						
						
						Instead of skipping the whole test as the CUPTI team figures out what is wrong, let's temporarily skip the profiler check portion. It is high pri to add it back to ensure foreach ops are actually performant.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156871 
Approved by: https://github.com/albanD 
ghstack dependencies: #156876  
						
						
					 
					
						2025-06-27 22:35:34 +00:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
					 
					
						
						
							
						
						50b2069b61 
					 
					
						
						
							
							Move out super large one off foreach_copy test ( #156876 )  
						
						 
						
						... 
						
						
						
						Pull Request resolved: https://github.com/pytorch/pytorch/pull/156876 
Approved by: https://github.com/albanD , https://github.com/jeffdaily  
						
						
					 
					
						2025-06-26 06:02:38 +00:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
					 
					
						
						
							
						
						4ee4863232 
					 
					
						
						
							
							Fix   #156261  _foreach_copy indexing ( #156719 )  
						
						 
						
						... 
						
						
						
						Fixes  #156261 
Thanks to @ngimel's fast eyes
For testing, I had experimented with a broader test case change but found that creating a tensor of 2**31+1 size was too expensive to do more than just a few times. Note that while the test case does not run in CI, I did run it locally to ensure it passes with new changes and fails without.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156719 
Approved by: https://github.com/albanD  
						
						
					 
					
						2025-06-24 21:58:44 +00:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
					 
					
						
						
							
						
						c199a4d0fd 
					 
					
						
						
							
							Move non inductor workflows cuda 12.6->cuda 12.8 ( #155234 )  
						
						 
						
						... 
						
						
						
						Move non inductor workflows cuda 12.6->cuda 12.8
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155234 
Approved by: https://github.com/Skylion007 , https://github.com/zxiiro , https://github.com/cyyever , https://github.com/malfet  
						
						
					 
					
						2025-06-12 12:42:34 +00:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
					 
					
						
						
							
						
						94da4523ec 
					 
					
						
						
							
							Disable foreach tests that depend on profiler for CUDA 12.6 ( #155596 )  
						
						 
						
						... 
						
						
						
						Pull Request resolved: https://github.com/pytorch/pytorch/pull/155596 
Approved by: https://github.com/clee2000 , https://github.com/malfet  
						
						
					 
					
						2025-06-10 22:21:06 +00:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
					 
					
						
						
							
						
						4979ca5ffa 
					 
					
						
						
							
							Synchronize in foreach tests after profiling ( #152857 )  
						
						 
						
						... 
						
						
						
						After the CI change from 12.4 -> 12.6 around mid-March, the foreach tests have been flaky and hard to repro due to nondeterminism. Per @davidberard98's suggestion, let's try to add a synchronize before checking profiler results to see whether this fixes the flake! The hope is that the 48 currently open foreach flaky issues will close from this change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152857 
Approved by: https://github.com/davidberard98  
						
						
					 
					
						2025-05-06 00:56:48 +00:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
					 
					
						
						
							
						
						a9ee797e41 
					 
					
						
						
							
							added fake tensor support for foreach_copy ( #149127 )  
						
						 
						
						... 
						
						
						
						Fixes  #149111 
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149127 
Approved by: https://github.com/jansel , https://github.com/jeromean  
						
						
					 
					
						2025-03-27 09:26:23 +00:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
					 
					
						
						
							
						
						a0bc6d81bb 
					 
					
						
						
							
							[CI][CUDA] Move away from cuda12.4, Add cuda12.6 eager CI tests ( #148602 )  
						
						 
						
						... 
						
						
						
						https://github.com/pytorch/pytorch/issues/145570 
breaking https://github.com/pytorch/pytorch/pull/140793/  into eager and inductor benchmarks to unblock
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148602 
Approved by: https://github.com/atalman , https://github.com/malfet 
Co-authored-by: atalman <atalman@fb.com > 
						
						
					 
					
						2025-03-07 00:15:04 +00:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
					 
					
						
						
							
						
						c73a92fbf5 
					 
					
						
						
							
							[BE][CI] bump ruff to 0.9.2: multiline assert statements ( #144546 )  
						
						 
						
						... 
						
						
						
						Reference: https://docs.astral.sh/ruff/formatter/black/#assert-statements 
> Unlike Black, Ruff prefers breaking the message over breaking the assertion, similar to how both Ruff and Black prefer breaking the assignment value over breaking the assignment target:
>
> ```python
> # Input
> assert (
>     len(policy_types) >= priority + num_duplicates
> ), f"This tests needs at least {priority+num_duplicates} many types."
>
>
> # Black
> assert (
>     len(policy_types) >= priority + num_duplicates
> ), f"This tests needs at least {priority+num_duplicates} many types."
>
> # Ruff
> assert len(policy_types) >= priority + num_duplicates, (
>     f"This tests needs at least {priority + num_duplicates} many types."
> )
> ```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144546 
Approved by: https://github.com/malfet  
						
						
					 
					
						2025-02-27 20:46:16 +00:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
					 
					
						
						
							
						
						086d146f6f 
					 
					
						
						
							
							Update ruff linter for PEP585 ( #147540 )  
						
						 
						
						... 
						
						
						
						This turns on PEP585 enforcement in RUFF.
- Updates the target python version
- Stops ignoring UP006 warnings (PEP585)
- Fixes a few issues which crept into the tree in the last day
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147540 
Approved by: https://github.com/justinchuby , https://github.com/Skylion007  
						
						
					 
					
						2025-02-22 04:45:17 +00:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
					 
					
						
						
							
						
						d8c8ba2440 
					 
					
						
						
							
							Fix unused Python variables in test/[e-z]* ( #136964 )  
						
						 
						
						... 
						
						
						
						Pull Request resolved: https://github.com/pytorch/pytorch/pull/136964 
Approved by: https://github.com/justinchuby , https://github.com/albanD  
						
						
					 
					
						2024-12-18 23:02:30 +00:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
					 
					
						
						
							
						
						446ea2aea5 
					 
					
						
						
							
							pow: fix meta function output argument dtype check. (#140287 )  
						
						 
						
						... 
						
						
						
						Tracking issue: #138399 
This PR changes the `pow` C++ implementation, making its C++ meta kernel consistent with
its Python ref implementation. The following example shows the inconsistency between the
two:
```python
def run(device):
    S = (5,)
    a = torch.rand(S, device=device, dtype=torch.float32)
    b = 2
    out = torch.empty(S, device=device, dtype=torch.float64)
    return torch.pow(a, b, out=out)
>>> run("cpu")
Traceback (most recent call last):
  File "test.py", line 34, in run
    return torch.pow(a, b, out=out)
RuntimeError: Found dtype Double but expected Float
>>> run("meta")
tensor(..., device='meta', size=(5,), dtype=torch.float64)
```
**~Update:~**
~Note that this happens only for `pow.Tensor_Scalar` overloads. Therefore, this PR needed
further 2 modifications:~
- ~Split the `pow` ref implementation, making `pow.Tensor_Scalar` error on mismatching
output dtypes~
- ~Create a dispatch for `pow` when `_refs.pow()` is called~
**Update:**
Changing the `TensorIteratorConfig` for `pow.Tensor_Scalar` was easier and,
after the discussion below, more correct. The solution was to change the
`TensorIteratorBase::build_output_borrowing_argument_owning_unary_op` function,
setting:
- `cast_common_dtype_to_outputs`; and
- `enforce_safe_casting_to_output`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140287 
Approved by: https://github.com/ezyang  
						
						
					 
					
						2024-11-20 13:28:47 +00:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
					 
					
						
						
							
						
						cb71bcc542 
					 
					
						
						
							
							Replace clone.detach with detach.clone ( #140264 )  
						
						 
						
						... 
						
						
						
						Fixes  #64532 
As state in issue, replace `clone.detach` by `detach.clone`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140264 
Approved by: https://github.com/soulitzer  
						
						
					 
					
						2024-11-13 07:01:02 +00:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
					 
					
						
						
							
						
						6a368b3fc5 
					 
					
						
						
							
							Add ScalarList overload to _foreach_lerp ( #134482 )  
						
						 
						
						... 
						
						
						
						Related:
- https://github.com/pytorch/pytorch/issues/133367 
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134482 
Approved by: https://github.com/janeyx99  
						
						
					 
					
						2024-11-12 19:03:41 +00:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
					 
					
						
						
							
						
						92fb1f79b8 
					 
					
						
						
							
							[BE] Test interspersed empty tensors for _foreach_norm test parity ( #140191 )  
						
						 
						
						... 
						
						
						
						Pull Request resolved: https://github.com/pytorch/pytorch/pull/140191 
Approved by: https://github.com/jbschlosser  
						
						
					 
					
						2024-11-12 15:35:06 +00:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
					 
					
						
						
							
						
						71d8bb7ede 
					 
					
						
						
							
							implement torch._foreach_rsqrt ( #134574 )  
						
						 
						
						... 
						
						
						
						Related:
- #133367  c
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134574 
Approved by: https://github.com/eqy , https://github.com/janeyx99  
						
						
					 
					
						2024-11-12 15:34:35 +00:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
					 
					
						
						
							
						
						1cdaf1d85f 
					 
					
						
						
							
							correctly keep track of processed tensors for foreach reductions ( #140103 )  
						
						 
						
						... 
						
						
						
						Fixes  #140066 
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140103 
Approved by: https://github.com/janeyx99 
Co-authored-by: Jane Xu <janeyx@meta.com > 
						
						
					 
					
						2024-11-08 23:04:53 +00:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
					 
					
						
						
							
						
						49723a8ff3 
					 
					
						
						
							
							fix stride compare failed when size value equal to one in ForeachUtils.h ( #134546 )  
						
						 
						
						... 
						
						
						
						When size value equal to one, tensor strides value need be skipped to compare.
@ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134546 
Approved by: https://github.com/janeyx99  
						
						
					 
					
						2024-09-19 18:43:41 +00:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
					 
					
						
						
							
						
						a0d0c6b7e6 
					 
					
						
						
							
							Used torch.equal in test_foreach_copy_with_multi_dtypes ( #134861 )  
						
						 
						
						... 
						
						
						
						`self.assertEqual` allows some tolerance, but here, we want to show that `_foreach_copy_` gives bitwise equivalent results. Let us use `torch.equal` then.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134861 
Approved by: https://github.com/Skylion007 , https://github.com/janeyx99 , https://github.com/crcrpar  
						
						
					 
					
						2024-08-30 18:04:41 +00:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
					 
					
						
						
							
						
						e21d7b77ce 
					 
					
						
						
							
							Update ForeachfuncInfo.sample_inputs_func to yield scalars & scalarlists that are more friendly to test_meta ( #134552 )  
						
						 
						
						... 
						
						
						
						for `test_meta.py` to see more "PASSED" instead of "XFAIL".
`pytest test_meta.py -k "_foreach_"` ran 6400 test cases and:
- This PR: 4702 passed, 260 skipped, 73732 deselected, 1698 xfailed
- main (92c4771853892193d73d87bd60eca4dc7efc51d8): 3906 passed, 260 skipped, 73732 deselected, 2494 xfailed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134552 
Approved by: https://github.com/janeyx99  
						
						
					 
					
						2024-08-30 17:30:50 +00:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
					 
					
						
						
							
						
						55236d0cb7 
					 
					
						
						
							
							TestForeach::test_parity: Remove check for error message text ( #134251 )  
						
						 
						
						... 
						
						
						
						Previously, error messages were expected to be string equivalent to
error messages thrown by the ref function.  This check fails for dozens
of torch functions, and doesn't appear to add much value for the end
user.  This commit removes this check.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134251 
Approved by: https://github.com/amjames , https://github.com/janeyx99 
ghstack dependencies: #134253 , #134344  
						
						
					 
					
						2024-08-26 22:40:54 +00:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
					 
					
						
						
							
						
						4226ed1585 
					 
					
						
						
							
							[BE] Format uncategorized Python files with ruff format ( #132576 )  
						
						 
						
						... 
						
						
						
						Remove patterns `**`, `test/**`, and `torch/**` in `tools/linter/adapters/pyfmt_linter.py` and run `lintrunner`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132576 
Approved by: https://github.com/ezyang , https://github.com/Skylion007 
ghstack dependencies: #132574  
						
						
					 
					
						2024-08-04 17:13:31 +00:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
					 
					
						
						
							
						
						ba48cf6535 
					 
					
						
						
							
							[BE][Easy][6/19] enforce style for empty lines in import segments in test/ ( #129757 )  
						
						 
						
						... 
						
						
						
						See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501 . Most changes are auto-generated by linter.
You can review these PRs via:
```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129757 
Approved by: https://github.com/ezyang  
						
						
					 
					
						2024-07-17 06:42:37 +00:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
					 
					
						
						
							
						
						abe81d5d05 
					 
					
						
						
							
							Fix the rest of foreach flakers ( #130277 )  
						
						 
						
						... 
						
						
						
						Reenable foreach tests on non-sm86 machines. I believe I've fixed the flakes that are caused when TORCH_SHOW_CPP_STACKTRACES=1, though I know @clee2000 had also just landed https://github.com/pytorch/pytorch/pull/129004  for the same effect.
Regardless, this makes the foreach tests more robust against future disruptions anyway. Fix similar in flavor to https://github.com/pytorch/pytorch/pull/129003 
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130277 
Approved by: https://github.com/soulitzer  
						
						
					 
					
						2024-07-09 02:08:21 +00:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
					 
					
						
						
							
						
						adc14adb88 
					 
					
						
						
							
							Fix flakiness with test_binary_op_list_error_cases ( #129003 )  
						
						 
						
						... 
						
						
						
						So how come this PR fixes any flakiness?
Well, following my investigation (read pt 1 in the linked ghstack PR below), I had realized that this test only consistently errors after another test was found flaky.
Why? Because TORCH_SHOW_CPP_STACKTRACES=1 gets turned on for _every_ test after _any_ test reruns, following this PR https://github.com/pytorch/pytorch/pull/119408 . And yea, this test checked for exact error message matching, which no longer would match since the stacktrace for a foreach function is obviously going to be different from a nonforeach.
So we improve the test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129003 
Approved by: https://github.com/soulitzer  
						
						
					 
					
						2024-06-20 21:48:22 +00:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
					 
					
						
						
							
						
						35c78668b4 
					 
					
						
						
							
							Improve the debugging message for when foreach mta_called ( #128991 )  
						
						 
						
						... 
						
						
						
						The hope that lives in this PR: I am currently trying to debug why the foreach tests are so flaky. It looks like every flaky test falls under this pattern:
- a test is flaky due to the mta_called assertion, which gathers data from the profiler regarding whether the multi_tensor_apply_kernel has been called.
- then, a later test fails deterministically, usually failing to compare two results.
```
================== 1 failed, 241 deselected, 2 rerun in 1.76s ==================
Got exit code 1
Stopping at first consistent failure
The following tests failed and then succeeded when run in a new process ['test/test_foreach.py::TestForeachCUDA::test_binary_op_float_inf_nan__foreach_add_cuda_bfloat16']
The following tests failed consistently: ['test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_add_cuda_bfloat16']
```
So my suspicion is that the first causes the second, but what causes the first? Idk! So it would be nice to have the error message tell us what the profiler actually saw in case it's getting muddled. This change would help mostly because I have not been able to repro this flakiness locally.
Also undo the useless changes in #128220  which are actually redundant as Joel and I realized that we set the seed during the setUp of every test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128991 
Approved by: https://github.com/clee2000  
						
						
					 
					
						2024-06-19 00:25:09 +00:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
					 
					
						
						
							
						
						8c20f53a5e 
					 
					
						
						
							
							Try seeding individual foreach tests ( #128220 )  
						
						 
						
						... 
						
						
						
						A first easy attempt to deflake foreach
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128220 
Approved by: https://github.com/ZainRizvi , https://github.com/crcrpar , https://github.com/huydhn  
						
						
					 
					
						2024-06-13 22:42:16 +00:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
					 
					
						
						
							
						
						2fa6f80b13 
					 
					
						
						
							
							Perform reciprocal optimization with foreach_div ( #128433 )  
						
						 
						
						... 
						
						
						
						Fixes https://github.com/pytorch/pytorch/issues/114165 
Internal xref
https://fb.workplace.com/groups/1144215345733672/posts/2801223606699496/ 
Signed-off-by: Edward Z. Yang <ezyang@meta.com >
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128433 
Approved by: https://github.com/awgu  
						
						
					 
					
						2024-06-12 22:57:03 +00:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
					 
					
						
						
							
						
						ac60bdaf01 
					 
					
						
						
							
							Allow slow foreach to run for any backend, not just CPU ( #127412 )  
						
						 
						
						... 
						
						
						
						Pull Request resolved: https://github.com/pytorch/pytorch/pull/127412 
Approved by: https://github.com/albanD  
						
						
					 
					
						2024-06-01 13:58:18 +00:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
					 
					
						
						
							
						
						df53cc7114 
					 
					
						
						
							
							[reland] "[reland] _foreach_copy with different src/dst dtypes" ( #127186 )  
						
						 
						
						... 
						
						
						
						Fixes  #115171 
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127186 
Approved by: https://github.com/ezyang  
						
						
					 
					
						2024-06-01 01:25:10 +00:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
					 
					
						
						
							
						
						05e99154ee 
					 
					
						
						
							
							Allow int vals to go down the fastpath for _foreach_max ( #127303 )  
						
						 
						
						... 
						
						
						
						Pull Request resolved: https://github.com/pytorch/pytorch/pull/127303 
Approved by: https://github.com/albanD 
ghstack dependencies: #127187  
						
						
					 
					
						2024-05-29 19:08:58 +00:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
					 
					
						
						
							
						
						601c5e085d 
					 
					
						
						
							
							Add _foreach_max ( #127187 )  
						
						 
						
						... 
						
						
						
						This PR adds _foreach_max support, the second reduction foreach op we have :D
I did have to change the autogen slightly for foreach. I can promise that the existing foreach ops' derivative behavior has not changed as I've added a skip list for the harder requirement I am setting (that the arg list should match in length). I needed to add this requirement as there is another wrong max (the one that does take in a dim for reduction) that keeps getting matched first.
Caveats!
- We do not fast path if the shapes, dtypes, device, the regular shebang for foreach are not met. We fall back to slowpath!
- MORE IMPORTANTLY, we also do not fast path for int8 and int16 and bool, but that's really a skill issue on my end as I've hardcoded -INFINITY into the CUDA kernels, and -INFINITY is not defined for small ints. It'd be nice to know how to do this properly, but that work can also come later.
- This does NOT support empty Tensors in the list, because the original max op also does not support empty Tensors. ~I think this should be allowed though, and this PR may come later.~ I understand why this is not allowed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127187 
Approved by: https://github.com/albanD  
						
						
					 
					
						2024-05-29 19:08:58 +00:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
					 
					
						
						
							
						
						96bdb7a0fb 
					 
					
						
						
							
							in test_foreach.py pacth KINETO_LOG_LEVEL to silence profiler log ( #126048 )  
						
						 
						
						... 
						
						
						
						as per title, `patch.dict` the env var in favor of cleaner logs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126048 
Approved by: https://github.com/janeyx99  
						
						
					 
					
						2024-05-13 15:31:56 +00:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
					 
					
						
						
							
						
						98821b3d92 
					 
					
						
						
							
							Disable various flaky tests in test_foreach ( #125783 )  
						
						 
						
						... 
						
						
						
						* Similar to #125046 
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125783 
Approved by: https://github.com/huydhn  
						
						
					 
					
						2024-05-09 18:08:39 +00:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
					 
					
						
						
							
						
						aa7be72cc5 
					 
					
						
						
							
							Convert ForeachFuncInfo to dataclass ( #125001 )  
						
						 
						
						... 
						
						
						
						- `ForeachFuncInfo` to `dataclass` for smaller diff from `OpInfo`
- `skips` to `decorators` and `skip` to `xfail`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125001 
Approved by: https://github.com/janeyx99 , https://github.com/jeffdaily  
						
						
					 
					
						2024-05-02 04:19:09 +00:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
					 
					
						
						
							
						
						75fa54a9d1 
					 
					
						
						
							
							Revert "Convert ForeachFuncInfo to dataclass ( #125001 )"  
						
						 
						
						... 
						
						
						
						This reverts commit 9466335ae4cb049efd3f4c2b32b2115ba00694f3.
Reverted https://github.com/pytorch/pytorch/pull/125001  on behalf of https://github.com/huydhn  due to Sorry for reverting your change but I think it is breaking on ROCm 9466335ae4  ([comment](https://github.com/pytorch/pytorch/pull/125001#issuecomment-2086640674 )) 
						
						
					 
					
						2024-04-30 19:05:53 +00:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
					 
					
						
						
							
						
						9466335ae4 
					 
					
						
						
							
							Convert ForeachFuncInfo to dataclass ( #125001 )  
						
						 
						
						... 
						
						
						
						- `ForeachFuncInfo` to `dataclass` for smaller diff from `OpInfo`
- `skips` to `decorators` and `skip` to `xfail`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125001 
Approved by: https://github.com/janeyx99  
						
						
					 
					
						2024-04-30 16:19:42 +00:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
					 
					
						
						
							
						
						a68a8c0f6b 
					 
					
						
						
							
							Disable test_binary_op_list_error_cases in test_foreach ( #125046 )  
						
						 
						
						... 
						
						
						
						It's really flaky
ex
* https://github.com/pytorch/pytorch/issues/124636 
* https://github.com/pytorch/pytorch/issues/124529 
there are more
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125046 
Approved by: https://github.com/huydhn  
						
						
					 
					
						2024-04-26 21:25:38 +00:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
					 
					
						
						
							
						
						1f89bf4188 
					 
					
						
						
							
							Revert "[reland] _foreach_copy with different src/dst dtypes ( #123844 )"  
						
						 
						
						... 
						
						
						
						This reverts commit ff1e3ff5a503a520c1a310c8e72a383657f9a4bc.
Reverted https://github.com/pytorch/pytorch/pull/123844  on behalf of https://github.com/malfet  due to Perhaps it enabled it for different dtype, but broke for the same ([comment](https://github.com/pytorch/pytorch/pull/123844#issuecomment-2059861767 )) 
						
						
					 
					
						2024-04-16 20:23:14 +00:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
					 
					
						
						
							
						
						ff1e3ff5a5 
					 
					
						
						
							
							[reland] _foreach_copy with different src/dst dtypes ( #123844 )  
						
						 
						
						... 
						
						
						
						Attempt to reland https://github.com/pytorch/pytorch/pull/121717 .
The change is the array bounds check.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123844 
Approved by: https://github.com/janeyx99  
						
						
					 
					
						2024-04-16 02:20:58 +00:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
					 
					
						
						
							
						
						1d6c5972c1 
					 
					
						
						
							
							[BE]: Optimize min/max/sum comprehensions C419 ( #123960 )  
						
						 
						
						... 
						
						
						
						Automatic fixes that replaces certain list comprehensions with generator ones where appropriate so that they are immediately consumed. This is preview functionality in ruff for rule C419 and it was automatically applied.
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com >
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123960 
Approved by: https://github.com/malfet  
						
						
					 
					
						2024-04-12 23:54:15 +00:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
					 
					
						
						
							
						
						c3de2cc154 
					 
					
						
						
							
							Enable UFMT on test/test_foreach.py ( #123718 )  
						
						 
						
						... 
						
						
						
						Part of https://github.com/pytorch/pytorch/issues/123062 
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123718 
Approved by: https://github.com/ezyang  
						
						
					 
					
						2024-04-10 18:22:12 +00:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
					 
					
						
						
							
						
						eb3a34d280 
					 
					
						
						
							
							Optimize multi_tensor_apply (take 2) ( #119764 )  
						
						 
						
						... 
						
						
						
						### Take 2
The first take (#119153 ) landed but was reverted because it broke cuda graph for `multi_tensor_apply`. This PR is a reland of #119153 :
- Incorporate #119652  so that the optimization can be applied (1) without increasing binary size (2) to all 3 MTA variants without much code duplication.
- Ensure the optimization is compatible with cuda graph.
### Summary
Due to the dynamic nature of the workload, the kernel arguments aren't guaranteed to fit in the static 4kb kernel argument memory. Previously with the apex implementation, we overcame this limitation by dividing a multi_tensor_apply workload into multiple kernel launches. However, this led to low sustained occupancy, affecting the performance of memory bound ops.
Based on the observation that the kernel argument memory limitation doesn't correlate well with available SM resources, we adopt a different approach:
- When the kernel arguments fit into the static kernel argument memory, we use this memory to transfer the arguments.
- Conversely, when the kernel arguments don't fit into the static kernel argument memory, instead of sacrificing sustained occupancy, we use a page-locked cudaMemcpyAsync to transfer the arguments, then perform the entire workload in a single kernel.
This PR only covers `multi_tensor_apply` for tensors. The change can be easily applied to `multi_tensor_apply` for tensors + scalars and `multi_tensor_apply_for_fused_optimizer`.
### Benchmark (WIP)
The only benchmark I've conducted so far on `_foreach_copy_` on a set of sizes that resembles internal workload. I need to benchmarks on more problem sizes. The speedup should vary among problem sizes. **However, I believe this PR should not be slower than the previous impl on any problem sizes.**
The benchmark can be reproduced with [this script](https://gist.github.com/yifuwang/178c1f4bf951c5794ea79c04d90e44fa ).
**Baseline**
A single iteration in trace:
<img width="831" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/5c8d72d0-0628-4989-88a8-c756f6bc1319 ">
```
https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_5a59145f-567b-472f-8eef-c61c388d45b4.json 
device ms: 1.111, cpu ms: 7.151
memory bandwidth: 1169.825 GB/s
```
**This PR**
A single iteration in trace:
<img width="967" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/a023e183-8166-48f7-b7c0-c8ba32653d2b ">
```
https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_da060725-62a8-466e-b570-2ad67ff0e29d.json 
device ms: 0.892, cpu ms: 0.810
memory bandwidth: 1456.744 GB/s
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119764 
Approved by: https://github.com/eqy , https://github.com/eellison , https://github.com/crcrpar  
						
						
					 
					
						2024-04-03 05:54:49 +00:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
					 
					
						
						
							
						
						958dbb876c 
					 
					
						
						
							
							Revert "_foreach_copy with different src/dst dtypes ( #121717 )"  
						
						 
						
						... 
						
						
						
						This reverts commit da2a9a05127c2b44e447e734d99e727d856cb36f.
Reverted https://github.com/pytorch/pytorch/pull/121717  on behalf of https://github.com/janeyx99  due to Causing IMAs on V100s internally :C ([comment](https://github.com/pytorch/pytorch/pull/121717#issuecomment-2025553295 )) 
						
						
					 
					
						2024-03-28 15:54:40 +00:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
					 
					
						
						
							
						
						bef01c7c2b 
					 
					
						
						
							
							Revert "Optimize multi_tensor_apply (take 2) ( #119764 )"  
						
						 
						
						... 
						
						
						
						This reverts commit fe41ba47652ca73569453bddb43605c77bb85184.
Reverted https://github.com/pytorch/pytorch/pull/119764  on behalf of https://github.com/atalman  due to Failing internally ([comment](https://github.com/pytorch/pytorch/pull/119764#issuecomment-2024105399 )) 
						
						
					 
					
						2024-03-27 22:42:07 +00:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
					 
					
						
						
							
						
						fe41ba4765 
					 
					
						
						
							
							Optimize multi_tensor_apply (take 2) ( #119764 )  
						
						 
						
						... 
						
						
						
						### Take 2
The first take (#119153 ) landed but was reverted because it broke cuda graph for `multi_tensor_apply`. This PR is a reland of #119153 :
- Incorporate #119652  so that the optimization can be applied (1) without increasing binary size (2) to all 3 MTA variants without much code duplication.
- Ensure the optimization is compatible with cuda graph.
### Summary
Due to the dynamic nature of the workload, the kernel arguments aren't guaranteed to fit in the static 4kb kernel argument memory. Previously with the apex implementation, we overcame this limitation by dividing a multi_tensor_apply workload into multiple kernel launches. However, this led to low sustained occupancy, affecting the performance of memory bound ops.
Based on the observation that the kernel argument memory limitation doesn't correlate well with available SM resources, we adopt a different approach:
- When the kernel arguments fit into the static kernel argument memory, we use this memory to transfer the arguments.
- Conversely, when the kernel arguments don't fit into the static kernel argument memory, instead of sacrificing sustained occupancy, we use a page-locked cudaMemcpyAsync to transfer the arguments, then perform the entire workload in a single kernel.
This PR only covers `multi_tensor_apply` for tensors. The change can be easily applied to `multi_tensor_apply` for tensors + scalars and `multi_tensor_apply_for_fused_optimizer`.
### Benchmark (WIP)
The only benchmark I've conducted so far on `_foreach_copy_` on a set of sizes that resembles internal workload. I need to benchmarks on more problem sizes. The speedup should vary among problem sizes. **However, I believe this PR should not be slower than the previous impl on any problem sizes.**
The benchmark can be reproduced with [this script](https://gist.github.com/yifuwang/178c1f4bf951c5794ea79c04d90e44fa ).
**Baseline**
A single iteration in trace:
<img width="831" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/5c8d72d0-0628-4989-88a8-c756f6bc1319 ">
```
https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_5a59145f-567b-472f-8eef-c61c388d45b4.json 
device ms: 1.111, cpu ms: 7.151
memory bandwidth: 1169.825 GB/s
```
**This PR**
A single iteration in trace:
<img width="967" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/a023e183-8166-48f7-b7c0-c8ba32653d2b ">
```
https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_da060725-62a8-466e-b570-2ad67ff0e29d.json 
device ms: 0.892, cpu ms: 0.810
memory bandwidth: 1456.744 GB/s
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119764 
Approved by: https://github.com/eqy , https://github.com/eellison , https://github.com/crcrpar  
						
						
					 
					
						2024-03-27 00:51:30 +00:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
					 
					
						
						
							
						
						5e0440edb4 
					 
					
						
						
							
							Revert "Optimize multi_tensor_apply (take 2) ( #119764 )"  
						
						 
						
						... 
						
						
						
						This reverts commit 0b68a28c87df2c6eb2cf530be4659b5a2f8a95b0.
Reverted https://github.com/pytorch/pytorch/pull/119764  on behalf of https://github.com/huydhn  due to Sorry for reverting your change but it is failing ROCm job in trunk 0b68a28c87 .  Please help take a look and reland the change ([comment](https://github.com/pytorch/pytorch/pull/119764#issuecomment-2014190124 )) 
						
						
					 
					
						2024-03-22 02:18:28 +00:00  
					
					
						 
						
						
							
							
							 
							
							
							
							
							 
						
					 
				 
			
				
					
						
					 
					
						
						
							
						
						0b68a28c87 
					 
					
						
						
							
							Optimize multi_tensor_apply (take 2) ( #119764 )  
						
						 
						
						... 
						
						
						
						### Take 2
The first take (#119153 ) landed but was reverted because it broke cuda graph for `multi_tensor_apply`. This PR is a reland of #119153 :
- Incorporate #119652  so that the optimization can be applied (1) without increasing binary size (2) to all 3 MTA variants without much code duplication.
- Ensure the optimization is compatible with cuda graph.
### Summary
Due to the dynamic nature of the workload, the kernel arguments aren't guaranteed to fit in the static 4kb kernel argument memory. Previously with the apex implementation, we overcame this limitation by dividing a multi_tensor_apply workload into multiple kernel launches. However, this led to low sustained occupancy, affecting the performance of memory bound ops.
Based on the observation that the kernel argument memory limitation doesn't correlate well with available SM resources, we adopt a different approach:
- When the kernel arguments fit into the static kernel argument memory, we use this memory to transfer the arguments.
- Conversely, when the kernel arguments don't fit into the static kernel argument memory, instead of sacrificing sustained occupancy, we use a page-locked cudaMemcpyAsync to transfer the arguments, then perform the entire workload in a single kernel.
This PR only covers `multi_tensor_apply` for tensors. The change can be easily applied to `multi_tensor_apply` for tensors + scalars and `multi_tensor_apply_for_fused_optimizer`.
### Benchmark (WIP)
The only benchmark I've conducted so far on `_foreach_copy_` on a set of sizes that resembles internal workload. I need to benchmarks on more problem sizes. The speedup should vary among problem sizes. **However, I believe this PR should not be slower than the previous impl on any problem sizes.**
The benchmark can be reproduced with [this script](https://gist.github.com/yifuwang/178c1f4bf951c5794ea79c04d90e44fa ).
**Baseline**
A single iteration in trace:
<img width="831" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/5c8d72d0-0628-4989-88a8-c756f6bc1319 ">
```
https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_5a59145f-567b-472f-8eef-c61c388d45b4.json 
device ms: 1.111, cpu ms: 7.151
memory bandwidth: 1169.825 GB/s
```
**This PR**
A single iteration in trace:
<img width="967" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/a023e183-8166-48f7-b7c0-c8ba32653d2b ">
```
https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_da060725-62a8-466e-b570-2ad67ff0e29d.json 
device ms: 0.892, cpu ms: 0.810
memory bandwidth: 1456.744 GB/s
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119764 
Approved by: https://github.com/eqy , https://github.com/eellison , https://github.com/crcrpar  
						
						
					 
					
						2024-03-21 11:53:31 +00:00