abe81d5d05
Fix the rest of foreach flakers ( #130277 )
...
Reenable foreach tests on non-sm86 machines. I believe I've fixed the flakes that are caused when TORCH_SHOW_CPP_STACKTRACES=1, though I know @clee2000 had also just landed https://github.com/pytorch/pytorch/pull/129004 for the same effect.
Regardless, this makes the foreach tests more robust against future disruptions anyway. Fix similar in flavor to https://github.com/pytorch/pytorch/pull/129003
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130277
Approved by: https://github.com/soulitzer
2024-07-09 02:08:21 +00:00
adc14adb88
Fix flakiness with test_binary_op_list_error_cases ( #129003 )
...
So how come this PR fixes any flakiness?
Well, following my investigation (read pt 1 in the linked ghstack PR below), I had realized that this test only consistently errors after another test was found flaky.
Why? Because TORCH_SHOW_CPP_STACKTRACES=1 gets turned on for _every_ test after _any_ test reruns, following this PR https://github.com/pytorch/pytorch/pull/119408 . And yea, this test checked for exact error message matching, which no longer would match since the stacktrace for a foreach function is obviously going to be different from a nonforeach.
So we improve the test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129003
Approved by: https://github.com/soulitzer
2024-06-20 21:48:22 +00:00
35c78668b4
Improve the debugging message for when foreach mta_called ( #128991 )
...
The hope that lives in this PR: I am currently trying to debug why the foreach tests are so flaky. It looks like every flaky test falls under this pattern:
- a test is flaky due to the mta_called assertion, which gathers data from the profiler regarding whether the multi_tensor_apply_kernel has been called.
- then, a later test fails deterministically, usually failing to compare two results.
```
================== 1 failed, 241 deselected, 2 rerun in 1.76s ==================
Got exit code 1
Stopping at first consistent failure
The following tests failed and then succeeded when run in a new process ['test/test_foreach.py::TestForeachCUDA::test_binary_op_float_inf_nan__foreach_add_cuda_bfloat16']
The following tests failed consistently: ['test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_add_cuda_bfloat16']
```
So my suspicion is that the first causes the second, but what causes the first? Idk! So it would be nice to have the error message tell us what the profiler actually saw in case it's getting muddled. This change would help mostly because I have not been able to repro this flakiness locally.
Also undo the useless changes in #128220 which are actually redundant as Joel and I realized that we set the seed during the setUp of every test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128991
Approved by: https://github.com/clee2000
2024-06-19 00:25:09 +00:00
8c20f53a5e
Try seeding individual foreach tests ( #128220 )
...
A first easy attempt to deflake foreach
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128220
Approved by: https://github.com/ZainRizvi , https://github.com/crcrpar , https://github.com/huydhn
2024-06-13 22:42:16 +00:00
2fa6f80b13
Perform reciprocal optimization with foreach_div ( #128433 )
...
Fixes https://github.com/pytorch/pytorch/issues/114165
Internal xref
https://fb.workplace.com/groups/1144215345733672/posts/2801223606699496/
Signed-off-by: Edward Z. Yang <ezyang@meta.com >
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128433
Approved by: https://github.com/awgu
2024-06-12 22:57:03 +00:00
ac60bdaf01
Allow slow foreach to run for any backend, not just CPU ( #127412 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127412
Approved by: https://github.com/albanD
2024-06-01 13:58:18 +00:00
df53cc7114
[reland] "[reland] _foreach_copy
with different src/dst dtypes" ( #127186 )
...
Fixes #115171
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127186
Approved by: https://github.com/ezyang
2024-06-01 01:25:10 +00:00
05e99154ee
Allow int vals to go down the fastpath for _foreach_max ( #127303 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127303
Approved by: https://github.com/albanD
ghstack dependencies: #127187
2024-05-29 19:08:58 +00:00
601c5e085d
Add _foreach_max ( #127187 )
...
This PR adds _foreach_max support, the second reduction foreach op we have :D
I did have to change the autogen slightly for foreach. I can promise that the existing foreach ops' derivative behavior has not changed as I've added a skip list for the harder requirement I am setting (that the arg list should match in length). I needed to add this requirement as there is another wrong max (the one that does take in a dim for reduction) that keeps getting matched first.
Caveats!
- We do not fast path if the shapes, dtypes, device, the regular shebang for foreach are not met. We fall back to slowpath!
- MORE IMPORTANTLY, we also do not fast path for int8 and int16 and bool, but that's really a skill issue on my end as I've hardcoded -INFINITY into the CUDA kernels, and -INFINITY is not defined for small ints. It'd be nice to know how to do this properly, but that work can also come later.
- This does NOT support empty Tensors in the list, because the original max op also does not support empty Tensors. ~I think this should be allowed though, and this PR may come later.~ I understand why this is not allowed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127187
Approved by: https://github.com/albanD
2024-05-29 19:08:58 +00:00
96bdb7a0fb
in test_foreach.py
pacth KINETO_LOG_LEVEL
to silence profiler log ( #126048 )
...
as per title, `patch.dict` the env var in favor of cleaner logs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126048
Approved by: https://github.com/janeyx99
2024-05-13 15:31:56 +00:00
98821b3d92
Disable various flaky tests in test_foreach ( #125783 )
...
* Similar to #125046
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125783
Approved by: https://github.com/huydhn
2024-05-09 18:08:39 +00:00
aa7be72cc5
Convert ForeachFuncInfo
to dataclass
( #125001 )
...
- `ForeachFuncInfo` to `dataclass` for smaller diff from `OpInfo`
- `skips` to `decorators` and `skip` to `xfail`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125001
Approved by: https://github.com/janeyx99 , https://github.com/jeffdaily
2024-05-02 04:19:09 +00:00
75fa54a9d1
Revert "Convert ForeachFuncInfo
to dataclass
( #125001 )"
...
This reverts commit 9466335ae4cb049efd3f4c2b32b2115ba00694f3.
Reverted https://github.com/pytorch/pytorch/pull/125001 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it is breaking on ROCm 9466335ae4
([comment](https://github.com/pytorch/pytorch/pull/125001#issuecomment-2086640674 ))
2024-04-30 19:05:53 +00:00
9466335ae4
Convert ForeachFuncInfo
to dataclass
( #125001 )
...
- `ForeachFuncInfo` to `dataclass` for smaller diff from `OpInfo`
- `skips` to `decorators` and `skip` to `xfail`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125001
Approved by: https://github.com/janeyx99
2024-04-30 16:19:42 +00:00
a68a8c0f6b
Disable test_binary_op_list_error_cases in test_foreach ( #125046 )
...
It's really flaky
ex
* https://github.com/pytorch/pytorch/issues/124636
* https://github.com/pytorch/pytorch/issues/124529
there are more
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125046
Approved by: https://github.com/huydhn
2024-04-26 21:25:38 +00:00
1f89bf4188
Revert "[reland] _foreach_copy
with different src/dst dtypes ( #123844 )"
...
This reverts commit ff1e3ff5a503a520c1a310c8e72a383657f9a4bc.
Reverted https://github.com/pytorch/pytorch/pull/123844 on behalf of https://github.com/malfet due to Perhaps it enabled it for different dtype, but broke for the same ([comment](https://github.com/pytorch/pytorch/pull/123844#issuecomment-2059861767 ))
2024-04-16 20:23:14 +00:00
ff1e3ff5a5
[reland] _foreach_copy
with different src/dst dtypes ( #123844 )
...
Attempt to reland https://github.com/pytorch/pytorch/pull/121717 .
The change is the array bounds check.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123844
Approved by: https://github.com/janeyx99
2024-04-16 02:20:58 +00:00
1d6c5972c1
[BE]: Optimize min/max/sum comprehensions C419 ( #123960 )
...
Automatic fixes that replaces certain list comprehensions with generator ones where appropriate so that they are immediately consumed. This is preview functionality in ruff for rule C419 and it was automatically applied.
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com >
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123960
Approved by: https://github.com/malfet
2024-04-12 23:54:15 +00:00
c3de2cc154
Enable UFMT on test/test_foreach.py ( #123718 )
...
Part of https://github.com/pytorch/pytorch/issues/123062
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123718
Approved by: https://github.com/ezyang
2024-04-10 18:22:12 +00:00
eb3a34d280
Optimize multi_tensor_apply (take 2) ( #119764 )
...
### Take 2
The first take (#119153 ) landed but was reverted because it broke cuda graph for `multi_tensor_apply`. This PR is a reland of #119153 :
- Incorporate #119652 so that the optimization can be applied (1) without increasing binary size (2) to all 3 MTA variants without much code duplication.
- Ensure the optimization is compatible with cuda graph.
### Summary
Due to the dynamic nature of the workload, the kernel arguments aren't guaranteed to fit in the static 4kb kernel argument memory. Previously with the apex implementation, we overcame this limitation by dividing a multi_tensor_apply workload into multiple kernel launches. However, this led to low sustained occupancy, affecting the performance of memory bound ops.
Based on the observation that the kernel argument memory limitation doesn't correlate well with available SM resources, we adopt a different approach:
- When the kernel arguments fit into the static kernel argument memory, we use this memory to transfer the arguments.
- Conversely, when the kernel arguments don't fit into the static kernel argument memory, instead of sacrificing sustained occupancy, we use a page-locked cudaMemcpyAsync to transfer the arguments, then perform the entire workload in a single kernel.
This PR only covers `multi_tensor_apply` for tensors. The change can be easily applied to `multi_tensor_apply` for tensors + scalars and `multi_tensor_apply_for_fused_optimizer`.
### Benchmark (WIP)
The only benchmark I've conducted so far on `_foreach_copy_` on a set of sizes that resembles internal workload. I need to benchmarks on more problem sizes. The speedup should vary among problem sizes. **However, I believe this PR should not be slower than the previous impl on any problem sizes.**
The benchmark can be reproduced with [this script](https://gist.github.com/yifuwang/178c1f4bf951c5794ea79c04d90e44fa ).
**Baseline**
A single iteration in trace:
<img width="831" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/5c8d72d0-0628-4989-88a8-c756f6bc1319 ">
```
https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_5a59145f-567b-472f-8eef-c61c388d45b4.json
device ms: 1.111, cpu ms: 7.151
memory bandwidth: 1169.825 GB/s
```
**This PR**
A single iteration in trace:
<img width="967" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/a023e183-8166-48f7-b7c0-c8ba32653d2b ">
```
https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_da060725-62a8-466e-b570-2ad67ff0e29d.json
device ms: 0.892, cpu ms: 0.810
memory bandwidth: 1456.744 GB/s
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119764
Approved by: https://github.com/eqy , https://github.com/eellison , https://github.com/crcrpar
2024-04-03 05:54:49 +00:00
958dbb876c
Revert "_foreach_copy
with different src/dst dtypes ( #121717 )"
...
This reverts commit da2a9a05127c2b44e447e734d99e727d856cb36f.
Reverted https://github.com/pytorch/pytorch/pull/121717 on behalf of https://github.com/janeyx99 due to Causing IMAs on V100s internally :C ([comment](https://github.com/pytorch/pytorch/pull/121717#issuecomment-2025553295 ))
2024-03-28 15:54:40 +00:00
bef01c7c2b
Revert "Optimize multi_tensor_apply (take 2) ( #119764 )"
...
This reverts commit fe41ba47652ca73569453bddb43605c77bb85184.
Reverted https://github.com/pytorch/pytorch/pull/119764 on behalf of https://github.com/atalman due to Failing internally ([comment](https://github.com/pytorch/pytorch/pull/119764#issuecomment-2024105399 ))
2024-03-27 22:42:07 +00:00
fe41ba4765
Optimize multi_tensor_apply (take 2) ( #119764 )
...
### Take 2
The first take (#119153 ) landed but was reverted because it broke cuda graph for `multi_tensor_apply`. This PR is a reland of #119153 :
- Incorporate #119652 so that the optimization can be applied (1) without increasing binary size (2) to all 3 MTA variants without much code duplication.
- Ensure the optimization is compatible with cuda graph.
### Summary
Due to the dynamic nature of the workload, the kernel arguments aren't guaranteed to fit in the static 4kb kernel argument memory. Previously with the apex implementation, we overcame this limitation by dividing a multi_tensor_apply workload into multiple kernel launches. However, this led to low sustained occupancy, affecting the performance of memory bound ops.
Based on the observation that the kernel argument memory limitation doesn't correlate well with available SM resources, we adopt a different approach:
- When the kernel arguments fit into the static kernel argument memory, we use this memory to transfer the arguments.
- Conversely, when the kernel arguments don't fit into the static kernel argument memory, instead of sacrificing sustained occupancy, we use a page-locked cudaMemcpyAsync to transfer the arguments, then perform the entire workload in a single kernel.
This PR only covers `multi_tensor_apply` for tensors. The change can be easily applied to `multi_tensor_apply` for tensors + scalars and `multi_tensor_apply_for_fused_optimizer`.
### Benchmark (WIP)
The only benchmark I've conducted so far on `_foreach_copy_` on a set of sizes that resembles internal workload. I need to benchmarks on more problem sizes. The speedup should vary among problem sizes. **However, I believe this PR should not be slower than the previous impl on any problem sizes.**
The benchmark can be reproduced with [this script](https://gist.github.com/yifuwang/178c1f4bf951c5794ea79c04d90e44fa ).
**Baseline**
A single iteration in trace:
<img width="831" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/5c8d72d0-0628-4989-88a8-c756f6bc1319 ">
```
https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_5a59145f-567b-472f-8eef-c61c388d45b4.json
device ms: 1.111, cpu ms: 7.151
memory bandwidth: 1169.825 GB/s
```
**This PR**
A single iteration in trace:
<img width="967" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/a023e183-8166-48f7-b7c0-c8ba32653d2b ">
```
https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_da060725-62a8-466e-b570-2ad67ff0e29d.json
device ms: 0.892, cpu ms: 0.810
memory bandwidth: 1456.744 GB/s
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119764
Approved by: https://github.com/eqy , https://github.com/eellison , https://github.com/crcrpar
2024-03-27 00:51:30 +00:00
5e0440edb4
Revert "Optimize multi_tensor_apply (take 2) ( #119764 )"
...
This reverts commit 0b68a28c87df2c6eb2cf530be4659b5a2f8a95b0.
Reverted https://github.com/pytorch/pytorch/pull/119764 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing ROCm job in trunk 0b68a28c87
. Please help take a look and reland the change ([comment](https://github.com/pytorch/pytorch/pull/119764#issuecomment-2014190124 ))
2024-03-22 02:18:28 +00:00
0b68a28c87
Optimize multi_tensor_apply (take 2) ( #119764 )
...
### Take 2
The first take (#119153 ) landed but was reverted because it broke cuda graph for `multi_tensor_apply`. This PR is a reland of #119153 :
- Incorporate #119652 so that the optimization can be applied (1) without increasing binary size (2) to all 3 MTA variants without much code duplication.
- Ensure the optimization is compatible with cuda graph.
### Summary
Due to the dynamic nature of the workload, the kernel arguments aren't guaranteed to fit in the static 4kb kernel argument memory. Previously with the apex implementation, we overcame this limitation by dividing a multi_tensor_apply workload into multiple kernel launches. However, this led to low sustained occupancy, affecting the performance of memory bound ops.
Based on the observation that the kernel argument memory limitation doesn't correlate well with available SM resources, we adopt a different approach:
- When the kernel arguments fit into the static kernel argument memory, we use this memory to transfer the arguments.
- Conversely, when the kernel arguments don't fit into the static kernel argument memory, instead of sacrificing sustained occupancy, we use a page-locked cudaMemcpyAsync to transfer the arguments, then perform the entire workload in a single kernel.
This PR only covers `multi_tensor_apply` for tensors. The change can be easily applied to `multi_tensor_apply` for tensors + scalars and `multi_tensor_apply_for_fused_optimizer`.
### Benchmark (WIP)
The only benchmark I've conducted so far on `_foreach_copy_` on a set of sizes that resembles internal workload. I need to benchmarks on more problem sizes. The speedup should vary among problem sizes. **However, I believe this PR should not be slower than the previous impl on any problem sizes.**
The benchmark can be reproduced with [this script](https://gist.github.com/yifuwang/178c1f4bf951c5794ea79c04d90e44fa ).
**Baseline**
A single iteration in trace:
<img width="831" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/5c8d72d0-0628-4989-88a8-c756f6bc1319 ">
```
https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_5a59145f-567b-472f-8eef-c61c388d45b4.json
device ms: 1.111, cpu ms: 7.151
memory bandwidth: 1169.825 GB/s
```
**This PR**
A single iteration in trace:
<img width="967" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/a023e183-8166-48f7-b7c0-c8ba32653d2b ">
```
https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_da060725-62a8-466e-b570-2ad67ff0e29d.json
device ms: 0.892, cpu ms: 0.810
memory bandwidth: 1456.744 GB/s
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119764
Approved by: https://github.com/eqy , https://github.com/eellison , https://github.com/crcrpar
2024-03-21 11:53:31 +00:00
057892f4be
[CPU] optimize Lp norm for 1-dimensional vector ( #122143 )
...
Fixes https://github.com/pytorch/pytorch/issues/120229
- Optimize vector norm by simplifying vector norm formula for 1-dimensional vector.
- Vector norm formula for 1-dimensional vector simplifies to `abs(x)`. See below for proof.
- Next step, we can similarly optimize matrix norm (`torch.linalg.matrix_norm`) for 1 x 1 matrix.
- Additionally, avoids overflow in power, `abs(x) ** p` for large `p` or `x`, for 1-dimensional vector.
### Performance
Avg Latency (ms) of `torch.norm` and `torch.linalg.vector_norm` for
`torch.norm(torch.randn(2**18, 1), ord, -1)`
`torch.linalg.vector_norm(torch.randn(2**18, 1), ord, -1)`
Tested on 28 physical cores/socket, 1 socket on Skylake.
| | | | | **Avg Latency (ms)** | | |
|-------------------------- |----------------- |--------- |--------- |----------------------- |----------------------- |---------------------------------------- |
| **op** | **input shape** | **dim** | **ord** | **baseline (master)** | **optimized (7102f1ef372b248414d36cbd0c51a546b6b6a41a)** | **speedup ratio (baseline/optimized)** |
| torch.norm | (2**18, 1) | -1 | fro | 34.3755531 | 0.0125408 | 2741.094 |
| | | | inf | 34.0952635 | 0.0122237 | 2789.271 |
| | | | -inf | 34.3674493 | 0.0120759 | 2845.953 |
| | | | 0 | 34.1004515 | 0.0175261 | 1945.69 |
| | | | 1 | 34.1688442 | 0.0121593 | 2810.089 |
| | | | -1 | 33.949492 | 0.0120282 | 2822.487 |
| | | | 2 | 34.3669581 | 0.0120401 | 2854.366 |
| | | | -2 | 33.9252067 | 0.0121069 | 2802.139 |
| | | | | | | |
| torch.linalg.vector_norm | (2**18, 1) | -1 | inf | 34.090879 | 0.0095105 | 3584.545 |
| | | | -inf | 34.3708754 | 0.0099111 | 3467.931 |
| | | | 0 | 34.0880775 | 0.0141716 | 2405.38 |
| | | | 1 | 34.1392851 | 0.0093174 | 3664.036 |
| | | | -1 | 33.925395 | 0.0092483 | 3668.302 |
| | | | 2 | 34.3854165 | 0.0092459 | 3719.002 |
| | | | -2 | 33.932972 | 0.0093007 | 3648.429 |
### Proof
<details>
<summary>For those interested :)</summary>
<img width="382" alt="1_dim_vector_norm_proof1" src="https://github.com/pytorch/pytorch/assets/93151422/59b1e00b-8fcd-47cb-877d-d31403b5195b ">
<img width="432" alt="1_dim_vector_norm_proof2" src="https://github.com/pytorch/pytorch/assets/93151422/236bea15-2dd5-480b-9871-58b2e3b24322 ">
</details>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122143
Approved by: https://github.com/lezcano
2024-03-20 23:20:25 +00:00
da2a9a0512
_foreach_copy
with different src/dst dtypes (#121717 )
...
Fixes #115171
```
torch.version.git_version = '6bff6372a922fe72be5335c6844c10e2687b967d', torch.cuda.get_device_name() = 'NVIDIA RTX 6000 Ada Generation'
[------------------ foreach copy - self: torch.float32 - shape: (512, 512) -----------------]
| src: torch.float32 | src: torch.float16 | src: torch.bfloat16
1 threads: ----------------------------------------------------------------------------------
num_tensors: 32 | 14.2 | 12.6 | 12.7
num_tensors: 256 | 688.0 | 510.3 | 514.0
num_tensors: 1024 | 2768.0 | 2053.3 | 2047.7
Times are in microseconds (us).
[------------------ foreach copy - self: torch.float16 - shape: (512, 512) -----------------]
| src: torch.float32 | src: torch.float16 | src: torch.bfloat16
1 threads: ----------------------------------------------------------------------------------
num_tensors: 32 | 10.0 | 8.9 | 8.8
num_tensors: 256 | 497.6 | 344.3 | 348.3
num_tensors: 1024 | 1991.9 | 1392.0 | 1389.0
Times are in microseconds (us).
[----------------- foreach copy - self: torch.bfloat16 - shape: (512, 512) -----------------]
| src: torch.float32 | src: torch.float16 | src: torch.bfloat16
1 threads: ----------------------------------------------------------------------------------
num_tensors: 32 | 10.0 | 8.8 | 8.8
num_tensors: 256 | 497.5 | 344.5 | 348.0
num_tensors: 1024 | 1993.2 | 1390.4 | 1387.5
Times are in microseconds (us).
[------------------ foreach copy - self: torch.float32 - shape: (515, 515) -----------------]
| src: torch.float32 | src: torch.float16 | src: torch.bfloat16
1 threads: ----------------------------------------------------------------------------------
num_tensors: 32 | 19.0 | 17.9 | 18.1
num_tensors: 256 | 707.2 | 540.2 | 543.1
num_tensors: 1024 | 2900.6 | 2156.6 | 2159.2
Times are in microseconds (us).
[------------------ foreach copy - self: torch.float16 - shape: (515, 515) -----------------]
| src: torch.float32 | src: torch.float16 | src: torch.bfloat16
1 threads: ----------------------------------------------------------------------------------
num_tensors: 32 | 13.8 | 13.7 | 13.1
num_tensors: 256 | 513.2 | 352.6 | 350.4
num_tensors: 1024 | 2047.6 | 1404.4 | 1400.4
Times are in microseconds (us).
[----------------- foreach copy - self: torch.bfloat16 - shape: (515, 515) -----------------]
| src: torch.float32 | src: torch.float16 | src: torch.bfloat16
1 threads: ----------------------------------------------------------------------------------
num_tensors: 32 | 13.6 | 12.8 | 14.2
num_tensors: 256 | 511.9 | 351.8 | 350.6
num_tensors: 1024 | 2045.4 | 1402.2 | 1401.4
Times are in microseconds (us).
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121717
Approved by: https://github.com/janeyx99
2024-03-13 05:42:28 +00:00
82bb06334d
Update python binding for in-place foreach to return List[Tensor]
( #121405 )
...
fixes #104817
taking over #118622
```c++
// _foreach_atan_
static PyObject * THPVariable__foreach_atan_(PyObject* self_, PyObject* args, PyObject* kwargs)
{
HANDLE_TH_ERRORS
static PythonArgParser parser({
"_foreach_atan_(TensorList self)",
}, /*traceable=*/false);
ParsedArgs<1> parsed_args;
auto _r = parser.parse(nullptr, args, kwargs, parsed_args);
if(_r.has_torch_function()) {
return handle_torch_function(_r, nullptr, args, kwargs, THPVariableFunctionsModule, "torch");
}
// aten::_foreach_atan_(Tensor(a!)[] self) -> ()
// auto dispatch__foreach_atan_ = [](at::TensorList self) -> at::TensorList {
auto dispatch__foreach_atan_ = [](at::TensorList self) -> void {
pybind11::gil_scoped_release no_gil;
at::_foreach_atan_(self);
};
dispatch__foreach_atan_(_r.tensorlist(0));
PyObject* self_tensorlist = _r.args[0];
Py_INCREF(self_tensorlist);
return self_tensorlist;
Py_RETURN_NONE;
END_HANDLE_TH_ERRORS
}
...
// _foreach_div_
static PyObject * THPVariable__foreach_div_(PyObject* self_, PyObject* args, PyObject* kwargs)
{
HANDLE_TH_ERRORS
static PythonArgParser parser({
"_foreach_div_(TensorList self, ScalarList scalars)",
"_foreach_div_(TensorList self, Tensor other)",
"_foreach_div_(TensorList self, TensorList other)",
"_foreach_div_(TensorList self, Scalar scalar)",
}, /*traceable=*/false);
ParsedArgs<2> parsed_args;
auto _r = parser.parse(nullptr, args, kwargs, parsed_args);
if(_r.has_torch_function()) {
return handle_torch_function(_r, nullptr, args, kwargs, THPVariableFunctionsModule, "torch");
}
switch (_r.idx) {
case 0: {
// aten::_foreach_div_.ScalarList(Tensor(a!)[] self, Scalar[] scalars) -> ()
// auto dispatch__foreach_div_ = [](at::TensorList self, at::ArrayRef<at::Scalar> scalars) -> at::TensorList {
auto dispatch__foreach_div_ = [](at::TensorList self, at::ArrayRef<at::Scalar> scalars) -> void {
pybind11::gil_scoped_release no_gil;
at::_foreach_div_(self, scalars);
};
dispatch__foreach_div_(_r.tensorlist(0), _r.scalarlist(1));
PyObject* self_tensorlist = _r.args[0];
Py_INCREF(self_tensorlist);
return self_tensorlist;
}
case 1: {
// aten::_foreach_div_.Tensor(Tensor(a!)[] self, Tensor other) -> ()
// auto dispatch__foreach_div_ = [](at::TensorList self, const at::Tensor & other) -> at::TensorList {
auto dispatch__foreach_div_ = [](at::TensorList self, const at::Tensor & other) -> void {
pybind11::gil_scoped_release no_gil;
at::_foreach_div_(self, other);
};
dispatch__foreach_div_(_r.tensorlist(0), _r.tensor(1));
PyObject* self_tensorlist = _r.args[0];
Py_INCREF(self_tensorlist);
return self_tensorlist;
}
case 2: {
// aten::_foreach_div_.List(Tensor(a!)[] self, Tensor[] other) -> ()
// auto dispatch__foreach_div_ = [](at::TensorList self, at::TensorList other) -> at::TensorList {
auto dispatch__foreach_div_ = [](at::TensorList self, at::TensorList other) -> void {
pybind11::gil_scoped_release no_gil;
at::_foreach_div_(self, other);
};
dispatch__foreach_div_(_r.tensorlist(0), _r.tensorlist(1));
PyObject* self_tensorlist = _r.args[0];
Py_INCREF(self_tensorlist);
return self_tensorlist;
}
case 3: {
// aten::_foreach_div_.Scalar(Tensor(a!)[] self, Scalar scalar) -> ()
// auto dispatch__foreach_div_ = [](at::TensorList self, const at::Scalar & scalar) -> at::TensorList {
auto dispatch__foreach_div_ = [](at::TensorList self, const at::Scalar & scalar) -> void {
pybind11::gil_scoped_release no_gil;
at::_foreach_div_(self, scalar);
};
dispatch__foreach_div_(_r.tensorlist(0), _r.scalar(1));
PyObject* self_tensorlist = _r.args[0];
Py_INCREF(self_tensorlist);
return self_tensorlist;
}
}
Py_RETURN_NONE;
END_HANDLE_TH_ERRORS
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121405
Approved by: https://github.com/soulitzer
2024-03-08 21:00:01 +00:00
e0a7b024b0
[ROCm] Skip test_parity* unit tests in test_foreach only if ROCm version < 6.0 ( #117301 )
...
Skip test_parity* unit tests in test_foreach.py on ROCm only if ROCm version < 6.0
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117301
Approved by: https://github.com/jithunnair-amd , https://github.com/ezyang
2024-02-22 16:21:09 +00:00
4319735ace
Add meta registration for _foreach_norm (2nd try) ( #119927 )
...
The first try reused TensorListMetadata, which caused illegal memory access issues when there were too many tensors in the list. We just launch multiple kernels with a simpler version of the struct (to minimize kernels launched).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119927
Approved by: https://github.com/albanD
2024-02-16 00:23:23 +00:00
2a87ab4508
Refactor some tests by using TEST_CUDA & TEST_MULTIGPU instead ( #116083 )
...
as https://github.com/pytorch/pytorch/pull/116014#discussion_r1430510759 stated, refactor some tests related.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116083
Approved by: https://github.com/fduwjj
2024-01-03 08:53:59 +00:00
bd10fea79a
[BE]: Enable F821 and fix bugs ( #116579 )
...
Fixes #112371
I tried to fix as many of the bugs as I could, a few I could not figure out what the proper fix for them was though and so I left them with noqas.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116579
Approved by: https://github.com/ezyang
2024-01-01 08:40:46 +00:00
794545c11f
[BE]: Enable RUF015 codebase wide ( #115507 )
...
Constant time access of first value in collection. This is a constant time operation instead of converting the item to a list to get the first item which is linear. The rule is turned on which automatically autofixes and enforces this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115507
Approved by: https://github.com/malfet
2023-12-11 15:51:01 +00:00
b56b002842
Fix NULL dereference in binary CPU ops ( #115183 )
...
Targeted fix for https://github.com/pytorch/pytorch/issues/113037
A more fundamental one, where those functions are not even called for
empty tensors are coming later
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115183
Approved by: https://github.com/drisspg , https://github.com/atalman , https://github.com/huydhn
2023-12-06 03:37:47 +00:00
2f536ff92c
Refactor values kwarg in foreach tests ( #112781 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112781
Approved by: https://github.com/lezcano
ghstack dependencies: #112778
2023-11-22 22:10:54 +00:00
1f1ff629a8
Use parent class attribute supports_out for foreach_zero opinfo ( #112778 )
...
Instead of introducing a new has_no_out_of_place attribute
Also fixes foreach_copy tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112778
Approved by: https://github.com/lezcano
2023-11-22 18:00:44 +00:00
deec2380c7
Add 0dim Tensor overload for _foreach_div ( #113688 )
...
This PR is ALMOST basically just following the steps from #106677 EXCEPT! We do add one feature. Similar to fused_adam(w), for the CUDA dispatches: when the scalar tensor is on CPU, we .item and redispatch to the normal scalar overload. Otherwise, the cuda kernel will complain about mismatch in devices between the scalar and the tensors.
Why do we add this feature? Our optimizers want to allow lr as a tensor, and lr could be a CPU tensor. lr is used with foreach_div_ in Adam, so our CI will break otherwise.
After this PR, `_foreach_mul` and `_foreach_div` will accept either a CPU or a GPU tensor for the scalar tensor (vs only a GPU tensor). They join the ranks of `fused_adam(w)` in this characteristic. I did not yet do the same thing for foreach_add (the only other foreach op with a .Tensor overload) because there is no use case and will be more involved.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113688
Approved by: https://github.com/mlazos , https://github.com/albanD
2023-11-15 20:59:32 +00:00
44367c59b2
Update skip reason for failing unit tests on ROCm 5.7 ( #113286 )
...
Follow up to https://github.com/pytorch/pytorch/pull/110465 . Updated skip reason for failing unit tests on ROCm 5.7
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113286
Approved by: https://github.com/malfet
2023-11-13 19:29:04 +00:00
3b915f9de0
[pt2] enable meta tests for foreach
ops ( #113484 )
...
Try https://github.com/pytorch/pytorch/pull/113059 again.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113484
Approved by: https://github.com/lezcano
2023-11-11 02:43:41 +00:00
d5eb9f725c
Fix test_add_scalar_with_empty_list_tensor ( #113262 )
...
By actually instantiating test method to a different types and devices rather than always creating it on CPU.
Also, remove `bool` from the list, as adding 1 to bool is not supported.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113262
Approved by: https://github.com/jeanschmidt , https://github.com/atalman , https://github.com/lezcano
2023-11-08 20:56:37 +00:00
3a429423fc
Upgrade CI to ROCm5.7 ( #110465 )
...
This PR is to upgrade CI to ROCm5.7
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110465
Approved by: https://github.com/pruthvistony , https://github.com/malfet
2023-11-08 06:11:10 +00:00
236eff9531
[BE] Refactor repeated assets in test_foreach.py ( #112348 )
...
Tested conditions in `test_binary_op_list_error_cases` looks almost identical, although it tests method and in-place variants. Use for loop to make distinction a bit more explicit
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112348
Approved by: https://github.com/albanD
ghstack dependencies: #112349
2023-10-31 01:11:44 +00:00
80de49653a
Prevent OOB access in foreach_list variants ( #112349 )
...
By checking that lists sizes are the same before computing forward gradients.
Before the change
```cpp
::std::vector<at::Tensor> _foreach_add_List(c10::DispatchKeySet ks, at::TensorList self, at::TensorList other, const at::Scalar & alpha) {
auto self_ = unpack(self, "self", 0);
auto other_ = unpack(other, "other", 1);
[[maybe_unused]] auto _any_requires_grad = compute_requires_grad( self, other );
std::vector<bool> _any_has_forward_grad_result(self.size());
for (const auto& i : c10::irange(self.size())) {
_any_has_forward_grad_result[i] = isFwGradDefined(self[i]) || isFwGradDefined(other[i]);
}
...
```
after the change:
```cpp
::std::vector<at::Tensor> _foreach_add_List(c10::DispatchKeySet ks, at::TensorList self, at::TensorList other, const at::Scalar & alpha) {
auto self_ = unpack(self, "self", 0);
auto other_ = unpack(other, "other", 1);
[[maybe_unused]] auto _any_requires_grad = compute_requires_grad( self, other );
TORCH_CHECK(
self.size() == other.size(),
"Tensor lists must have the same number of tensors, got ",
self.size(),
" and ",
other.size());
std::vector<bool> _any_has_forward_grad_result(self.size());
for (const auto& i : c10::irange(self.size())) {
_any_has_forward_grad_result[i] = isFwGradDefined(self[i]) || isFwGradDefined(other[i]);
}
```
Add regression test
Fixes https://github.com/pytorch/pytorch/issues/112305
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112349
Approved by: https://github.com/Chillee
2023-10-30 20:43:03 +00:00
c7dcba9276
Remove passing disable_fastpath in kwargs ( #112250 )
...
Fixes an issue that came up in https://github.com/pytorch/pytorch/pull/112030
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112250
Approved by: https://github.com/lezcano
2023-10-27 18:29:20 +00:00
ca7d084ff9
Add ScalarTensor or 0dim overload for _foreach_add ( #111079 )
...
Adding a Tensor overload will allow us to:
- optimize in more cases than before
- increase coverage for scalarTensor instead of just scalars in our foreach APIs
The main complication in this PR was that add.Tensor has a scalar overload, so I've now built out support for that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111079
Approved by: https://github.com/albanD
2023-10-20 01:34:07 +00:00
0a60219fe3
[foreach] Fix 0-size handling for real for real ( #109402 )
...
@crcrpar's last attempt to fix the 0-size problem unfortunately did not pass all cases. See my comment in https://github.com/pytorch/pytorch/issues/100701 . When we have a tail tensor of size 0, the old code would mess with the chunk logic to check the previous tensor's length. This is flawed because:
1. if the previous tensor was also 0 sized, (so a tensor list of [tensor, tensor, tensor, ..., 0-sized tensor, 0-sized tensor],) chunks would still be 0 and the nested for loop would be missed.
2. the nested forloop pronounces side effects on tensorListMeta that _shouldn't_ be there! This can mess up the compute in unexpected ways that I haven't really needed to reason through.
We noticed that the problem had not been fixed due to an internal report. This PR solves the issue by:
- removing the finagling of chunks when the tail tensor is 0-sized
- adding a surefire way for the kernel to be launched in the case where the last tensor is 0-sized AND there's content in the metadata, signifying there is stuff to compute still.
## test plan
As I went through the code, I also added some comments explaining what's up and modified our tensor inputs to ensure that this case is tested in the test_parity test in test_foreach.py. Yes, I do realize there is quite a bit of duplication and that this file could be due for a refactor. That said, the primary goal of this PR is to fix the pretty egregious bug and refactoring can be a followup.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109402
Approved by: https://github.com/albanD
2023-09-26 17:38:20 +00:00
4b0281b32c
[BE][foreach] name tests correctly. noncontiguous inputs != fastpath ( #109771 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109771
Approved by: https://github.com/soulitzer
2023-09-22 19:16:14 +00:00
602413a0a0
Refactor test_foreach.py
( #107869 )
...
## Summary
- Change the default of `supports_autograd` and `supports_forward_ad` of `ForeachFuncInfo` to `True`
- Add `test_zero_size_tensor_inputs` to make sure that foreach functions can handle 0-size Tensor inputs
- Add `test_parity` to check the consistency between outputs of foreach and for-loop of native function.
- Add `test_autodiff` to check forward-mode and reverse-mode AD
- Keep the corner cases that are not covered by the newly introduced methods
rel:
- #58833
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107869
Approved by: https://github.com/janeyx99
2023-09-14 19:39:26 +00:00
5814380e7b
Revert "Revert "Reland "Add forward mode AD to out-place foreach functions ( #102409 ) ( #106043 )""" ( #106320 )
...
Fixed a typo specifying the number of tensors and elements in the test having failed in slow gradcheck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106320
Approved by: https://github.com/soulitzer
2023-08-18 23:01:42 +00:00
b234b94760
Add in-place _foreach_copy
( #107226 )
...
Fixes #107162
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107226
Approved by: https://github.com/janeyx99
2023-08-17 00:11:18 +00:00