pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-11-07 01:50:04 +08:00

Author	SHA1	Message	Date
Tom Ritchford	e05ea2b179	Add decomposition for transpose_copy (#130943 ) * Extracted from #128416 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130943 Approved by: https://github.com/amjames, https://github.com/eellison	2024-09-11 19:45:22 +00:00
Nowtryz	a15aabc975	Add MaskedTensor passthrough: unfold, F.Unfold, F.Fold, stack (#125262 ) Hi, I noticed the `unfold` operator was missing on MaskedTensor. I tested that my change works when calling unfold and backward on a `MaskedTensor` but I didn't find the tests for the dispatch of such operation. Where is it? Pull Request resolved: https://github.com/pytorch/pytorch/pull/125262 Approved by: https://github.com/cpuhrsch	2024-09-06 19:06:23 +00:00
Avik Chaudhuri	43f4947d44	fix fake tensor tolist implementation (#135131 ) Summary: When exporting for training with `tolist`, we do not hit `FunctionalTensor.tolist` since we do not functionalize. Unfortunately, this means we hit `FakeTensor.tolist`, which creates unbacked symints that are not backed by proxies. Rather than trying to patch up this low-level implementation, we replace it with essentially what `FunctionalTensor.tolist` does, which is higher-level: we essentially desugar to `item()` calls and let it take care of unbacked symints. Test Plan: Some expected failures are gone now. Also found a test for `tolist` that was written when `FunctionalTensor.tolist` was implemented but not really doing much; repurposed it now to exercise more modes. Differential Revision: D62197742 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135131 Approved by: https://github.com/ezyang	2024-09-05 23:20:31 +00:00
eqy	4f70b3cfae	[CUDA][complex][TF32] Update `test_noncontiguous_samples` tolerances for `complex64` (#134526 ) Recent cuDNN heuristics change surfaces same TF32 issue as `float32` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134526 Approved by: https://github.com/ezyang	2024-09-04 23:37:16 +00:00
Tobias Ringwald	758d787901	Added complex support for `torch.logsumexp` (#133187 ) Added complex support for `torch.logsumexp`. Implemented complex backward pass for `torch.logsumexp`. Fixes #133047 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133187 Approved by: https://github.com/amjames, https://github.com/lezcano	2024-09-03 17:28:36 +00:00
chilli	6fce1faa10	change multinomial to use async asserts instead of a synchronization (#134818 ) Fixes https://github.com/pytorch/pytorch/issues/134442 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134818 Approved by: https://github.com/ezyang ghstack dependencies: #134813	2024-09-03 06:33:24 +00:00
Masaki Kozuki	e21d7b77ce	Update `ForeachfuncInfo.sample_inputs_func` to yield scalars & scalarlists that are more friendly to test_meta (#134552 ) for `test_meta.py` to see more "PASSED" instead of "XFAIL". `pytest test_meta.py -k "_foreach_"` ran 6400 test cases and: - This PR: 4702 passed, 260 skipped, 73732 deselected, 1698 xfailed - main (92c4771853892193d73d87bd60eca4dc7efc51d8): 3906 passed, 260 skipped, 73732 deselected, 2494 xfailed Pull Request resolved: https://github.com/pytorch/pytorch/pull/134552 Approved by: https://github.com/janeyx99	2024-08-30 17:30:50 +00:00
PyTorch MergeBot	f997b2b8e6	Revert "Add MaskedTensor passthrough: unfold, F.Unfold, F.Fold, stack (#125262 )" This reverts commit f685018ea9d08f98cbd7106028db134f967f74d3. Reverted https://github.com/pytorch/pytorch/pull/125262 on behalf of https://github.com/ZainRizvi due to Hi, this PR appears to be calling maskedtensor tests to fail on main. Please rebase your changes onto the latest trunk build to repro the failure. test_maskedtensor.py::TestOperatorsCUDA::test_like_empty_like_layout1_cuda_bool [GH job link](https://github.com/pytorch/pytorch/actions/runs/10604716811/job/29393256312) [HUD commit link](`f685018ea9`) ([comment](https://github.com/pytorch/pytorch/pull/125262#issuecomment-2316387447))	2024-08-28 23:10:07 +00:00
Nowtryz	f685018ea9	Add MaskedTensor passthrough: unfold, F.Unfold, F.Fold, stack (#125262 ) Hi, I noticed the `unfold` operator was missing on MaskedTensor. I tested that my change works when calling unfold and backward on a `MaskedTensor` but I didn't find the tests for the dispatch of such operation. Where is it? Pull Request resolved: https://github.com/pytorch/pytorch/pull/125262 Approved by: https://github.com/cpuhrsch	2024-08-28 21:30:39 +00:00
David Berard	289486d007	Move attention kernels back from fake_impls to meta_registrations (#134288 ) See #121528 for additional context. In #120682, we moved the attention kernels from meta_registrations to fake_impls with the intent of fixing the device handling for seed/offset: these are typically on CPU. We needed to put the registrations in fake_impls to do this because meta_registrations doesn't have a way to specify device, whereas fake_impls does. But when we tried to actually fix the device types (#120839), we had to revert the PR because it broke cudagraph handling (during which seed/offset _are_ on CUDA). Now, we want to put the registrations back in meta_registrations so that we can call these kernels with meta tensors. The use case is later in this stack - we want to be able to use the flop counter with these kernels. Also - I specifically skip the `compare_tensor_meta()` check in test_fake / test_fake_autocast tests for the `_efficient_attention_forward` and `_flash_attention_forward` kernels, which fails because of the device mismatch from the seed/offset tensors. Then we can un-skip these opinfos. I verified that the efficient_attention_forward bug (#120842) is now caught by these opinfos if I revert the fix from this PR. Differential Revision: [D61687369](https://our.internmc.facebook.com/intern/diff/D61687369) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134288 Approved by: https://github.com/drisspg	2024-08-27 21:10:36 +00:00
Benjamin Glass	55236d0cb7	TestForeach::test_parity: Remove check for error message text (#134251 ) Previously, error messages were expected to be string equivalent to error messages thrown by the ref function. This check fails for dozens of torch functions, and doesn't appear to add much value for the end user. This commit removes this check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134251 Approved by: https://github.com/amjames, https://github.com/janeyx99 ghstack dependencies: #134253, #134344	2024-08-26 22:40:54 +00:00
Benjamin Glass	ef8c474fcf	Add the fast path for bfloat16 lgamma (#134344 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134344 Approved by: https://github.com/amjames, https://github.com/janeyx99 ghstack dependencies: #134253	2024-08-26 22:40:54 +00:00
Benjamin Glass	3c5883e550	Fix test_parity xfail for sigmoid (#134253 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134253 Approved by: https://github.com/amjames, https://github.com/janeyx99	2024-08-26 22:40:54 +00:00
Mengwei Liu	a0e062c6f1	Add mean.dtype_out (#133506 ) Give it a try and see if CI is happy Pull Request resolved: https://github.com/pytorch/pytorch/pull/133506 Approved by: https://github.com/bdhirsh	2024-08-26 19:26:11 +00:00
Benjamin Glass	27d97b9649	Remove unnecessary test skip (#134250 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134250 Approved by: https://github.com/amjames, https://github.com/janeyx99	2024-08-26 14:34:53 +00:00
Aaron Gokaslan	83b5d449a3	Add full float16/bfloat16 support to MaxUnPool (#133774 ) It already supported half so might as well add bfloat16 support for parity Pull Request resolved: https://github.com/pytorch/pytorch/pull/133774 Approved by: https://github.com/eqy, https://github.com/ezyang	2024-08-22 13:34:43 +00:00
chilli	938f37b745	Added batching rule for sdpa_math, sdpa_efficient_attention forward, cudnn, and flash attention (#133964 ) Fixes https://github.com/pytorch/pytorch/issues/117016, https://github.com/pytorch/pytorch/issues/102457, https://github.com/pytorch/pytorch/issues/110525, https://github.com/pytorch/pytorch/issues/108065, Pull Request resolved: https://github.com/pytorch/pytorch/pull/133964 Approved by: https://github.com/Skylion007	2024-08-22 05:29:49 +00:00
drisspg	fb26b84390	Update fused kernels and call _safe_softmax from SDPA (#133882 ) # UPDATE: This is take 3 of https://github.com/pytorch/pytorch/pull/131863 which was landed via co dev but not applying correclty # Summary Changes the stance of SDPA on what to do for fully masked out rows ## Current Behavior Several PyTorch users have expressed frustration over this issue: - https://github.com/pytorch/pytorch/issues/41508 - https://github.com/pytorch/pytorch/issues/103749 - https://github.com/pytorch/pytorch/issues/103963 These are significant issues with extensive discussion but no satisfactory resolution. The PyTorch team's consensus, as stated here: https://github.com/pytorch/pytorch/issues/24816#issuecomment-524415617 Can be paraphrased as follows: When passing in fully masked out rows, attention becomes ambiguous. We have two main options: 1. Uniformly attend to all values: ```python scores[masked_out_rows] = 1 / len(row) out[masked_out_rows] = 1 / len(row) * value ``` 2. Decide that attention between no queries (masked) and no keys (masked) is meaningless: ```python output[fully_masked_rows] = NaN ``` We went with option 2. Partially because it was easier to implement, but also people argued that users can slice the output to remove the NaNs: ``` Python >fill_value = -float("inf") >row0 = torch.randn(4) >row1 = torch.tensor([(fill_value for _ in range(4)]) >matrix = torch.stack([row0, row1]).requires_grad_(True) >out = torch.softmax(matrix, 1) >out = out[0] >print(out) tensor([0.5377, 0.2729, 0.0692, 0.1201]) ``` Cool, problem solved. But what happends when you call backwards.. ```Python >out.backward(torch.ones_like(out)) >print(matrix.grad) tensor([[3.0957e-08, 1.4157e-08, 7.7802e-10, 1.3713e-08], [ nan, nan, nan, nan]]) ``` Those pesky NaNs are back! ## Why do we see NaNs today? The core of the problem revolves around using softmax function in sdpa: ```python > row = torch.tensor([(-float("inf")) for _ in range(4)]) > torch.softmax(row, 0) tensor([nan, nan, nan, nan]) ``` ## Quick Aside: Masking in Attention Attention itself doesn't have a concept of masking. The `sdpa` function has an argument called `attn_mask`, which would be more accurately named `attn_bias`. This is because we don't actually "mask" entries when computing attention. Instead, due to implementation details([performance](https://github.com/pytorch/pytorch/issues/25110#issuecomment-524519087)), we add a value to the masked-out query/key pairs. We use a large negative number (typically -inf) to decrease the attention weight, as softmax assigns more weight to larger values. ## Alternative Approaches If we use a very large negative number instead of -inf: ```python > row = torch.tensor([(-1e6) for _ in range(4)]) > torch.softmax(row, 0) tensor([0.2500, 0.2500, 0.2500, 0.2500]) ``` However if users always remembered to "slice" out their outputs i.e.: ```Python >fill_value = -1e6 >... >out.backward(torch.ones_like(out)) >print(matrix.grad) tensor([[-0.0563, -0.0564, 0.1613, -0.0486], [ 0.0000, 0.0000, 0.0000, 0.0000]]) ``` This would bring us back into a better state. ## A Third Option We don't necessarily need to alter the behavior of softmax for -inf or very large negative numbers. The fundamental goal is to exclude certain query/key pairs from attention, regardless of the underlying implementation. This PR implements the new semantic for masking w/ attention in fully masked-out rows: ```python out[masked_out_rows] = 0 ``` Important Note: This idea isn't entirely new. The [MaskedTensor](https://pytorch.org/tutorials/prototype/maskedtensor_overview#safe-softmax) prototype, a tensor subclass, was designed to handle such cases. However, it remains a prototype feature and hasn't gained widespread adoption. ## Details This PR stack does 3 things: 1. Adds a PRIVATE _safe_softmax op 2. Updates semantic for flash_cpu fused kernel 3. Updates semantic for efficient_cuda fused kernel _safe_softmax is not supposed to be used generically and is only meant to be used within the context of SDPA. Due to this fact instead of decomposing softmax and checking for -inf rows we instead "cheat" and use nan_to_num. Why I think this is okay? (please find a counter point if avail) There are multiple ways NaNs can emerge. For the fully masked out rows case nan_to_num works. But what if there were other NaNs, wouldn't this silently remove them? The only case that this can happen is if the input itself had a NaN or an Inf For example: ```Python a = torch.ones([4], requires_grad=False, dtype=torch.float16) a[1] = torch.finfo(torch.float16).max print(a.softmax(-1)) ``` Will return `tensor([0., 1., 0., 0.], dtype=torch.float16)` Where ```Python a = torch.ones([4], requires_grad=False, dtype=torch.float16) a[1] = float("inf") a.softmax(-1) ``` returns: `tensor([nan, nan, nan, nan], dtype=torch.float16)` If we dont want to even allow for the possibility of "inf" or "NaN" attention scores to be converted to 0 then we can implemented it something like this ```Python max = torch.max(a, dim=-1, keepdim=True) exp = torch.exp(a - max.values) denom = torch.sum(exp, dim=-1, keepdim=True) softmax = exp / denom softmax = torch.where(max.values == float('-inf'), 0.0, softmax) ``` however we would be paying for this in math performance. ## Why Now I think one point that has substantially changed where PyTorch should lie on this argument is the fact that we have fused implementations for SDPA now. And these fused implementations allow us to easily and performantly support this new semantic. Differential Revision: [D61418679](https://our.internmc.facebook.com/intern/diff/D61418679) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133882 Approved by: https://github.com/soulitzer	2024-08-19 18:53:11 +00:00
Jiang, Yanbing	215b14530a	Add Half for sparse.mm reduce (#133672 ) This PR is to add Half support for sparse.mm reduce in CPU backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133672 Approved by: https://github.com/Skylion007	2024-08-17 15:20:39 +00:00
PyTorch MergeBot	cfec69e2a1	Revert "Update fused kernels and call _safe_softmax from SDPA (#131863 )" This reverts commit caba37e99b03d2199848197de4e452b78c8c2a23. Reverted https://github.com/pytorch/pytorch/pull/131863 on behalf of https://github.com/izaitsevfb due to breaks executorch test executorch/backends/apple/coreml:test - test_vit_skip_conv (executorch.backends.apple.coreml.test.test_coreml_partitioner.TestCoreMLPartitioner) ([comment](https://github.com/pytorch/pytorch/pull/131863#issuecomment-2291855634))	2024-08-15 17:55:07 +00:00
Aaron Gokaslan	ec49ce5f8e	[CUDA]: Add frexp CUDA bfloat16 support (#133313 ) Fixes #133263 Add CUDA bfloat16 support to cuda_frexp Pull Request resolved: https://github.com/pytorch/pytorch/pull/133313 Approved by: https://github.com/ezyang, https://github.com/eqy	2024-08-15 15:20:00 +00:00
drisspg	caba37e99b	Update fused kernels and call _safe_softmax from SDPA (#131863 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131863 Approved by: https://github.com/jbschlosser, https://github.com/Chillee	2024-08-13 23:37:50 +00:00
Sun, Jiayi	7be77658e9	[Inductor] support masked vectorization for the tail_loop for INT8 datatype (#131155 ) This PR supports masked vectorization for the tail_loop for torch.uint8 and torch.int8 datatype to improve performance. BTW, I fixed the UT of `byte` by setting the range of the sample inputs to [0, 255] since the range of `torch.uint8` is [0, 255]. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131155 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel ghstack dependencies: #130724	2024-08-13 01:12:05 +00:00
drisspg	1434e0b121	Add a private _safe_softmax (#131060 ) # Summary Changes the stance of SDPA on what to do for fully masked out rows ## Current Behavior Several PyTorch users have expressed frustration over this issue: - https://github.com/pytorch/pytorch/issues/41508 - https://github.com/pytorch/pytorch/issues/103749 - https://github.com/pytorch/pytorch/issues/103963 These are significant issues with extensive discussion but no satisfactory resolution. The PyTorch team's consensus, as stated here: https://github.com/pytorch/pytorch/issues/24816#issuecomment-524415617 Can be paraphrased as follows: When passing in fully masked out rows, attention becomes ambiguous. We have two main options: 1. Uniformly attend to all values: ```python scores[masked_out_rows] = 1 / len(row) out[masked_out_rows] = 1 / len(row) * value ``` 2. Decide that attention between no queries (masked) and no keys (masked) is meaningless: ```python output[fully_masked_rows] = NaN ``` We went with option 2. Partially because it was easier to implement, but also people argued that users can slice the output to remove the NaNs: ``` Python >fill_value = -float("inf") >row0 = torch.randn(4) >row1 = torch.tensor([(fill_value for _ in range(4)]) >matrix = torch.stack([row0, row1]).requires_grad_(True) >out = torch.softmax(matrix, 1) >out = out[0] >print(out) tensor([0.5377, 0.2729, 0.0692, 0.1201]) ``` Cool, problem solved. But what happends when you call backwards.. ```Python >out.backward(torch.ones_like(out)) >print(matrix.grad) tensor([[3.0957e-08, 1.4157e-08, 7.7802e-10, 1.3713e-08], [ nan, nan, nan, nan]]) ``` Those pesky NaNs are back! ## Why do we see NaNs today? The core of the problem revolves around using softmax function in sdpa: ```python > row = torch.tensor([(-float("inf")) for _ in range(4)]) > torch.softmax(row, 0) tensor([nan, nan, nan, nan]) ``` ## Quick Aside: Masking in Attention Attention itself doesn't have a concept of masking. The `sdpa` function has an argument called `attn_mask`, which would be more accurately named `attn_bias`. This is because we don't actually "mask" entries when computing attention. Instead, due to implementation details([performance](https://github.com/pytorch/pytorch/issues/25110#issuecomment-524519087)), we add a value to the masked-out query/key pairs. We use a large negative number (typically -inf) to decrease the attention weight, as softmax assigns more weight to larger values. ## Alternative Approaches If we use a very large negative number instead of -inf: ```python > row = torch.tensor([(-1e6) for _ in range(4)]) > torch.softmax(row, 0) tensor([0.2500, 0.2500, 0.2500, 0.2500]) ``` However if users always remembered to "slice" out their outputs i.e.: ```Python >fill_value = -1e6 >... >out.backward(torch.ones_like(out)) >print(matrix.grad) tensor([[-0.0563, -0.0564, 0.1613, -0.0486], [ 0.0000, 0.0000, 0.0000, 0.0000]]) ``` This would bring us back into a better state. ## A Third Option We don't necessarily need to alter the behavior of softmax for -inf or very large negative numbers. The fundamental goal is to exclude certain query/key pairs from attention, regardless of the underlying implementation. This PR implements the new semantic for masking w/ attention in fully masked-out rows: ```python out[masked_out_rows] = 0 ``` Important Note: This idea isn't entirely new. The [MaskedTensor](https://pytorch.org/tutorials/prototype/maskedtensor_overview#safe-softmax) prototype, a tensor subclass, was designed to handle such cases. However, it remains a prototype feature and hasn't gained widespread adoption. ## Details This PR stack does 3 things: 1. Adds a PRIVATE _safe_softmax op 2. Updates semantic for flash_cpu fused kernel 3. Updates semantic for efficient_cuda fused kernel _safe_softmax is not supposed to be used generically and is only meant to be used within the context of SDPA. Due to this fact instead of decomposing softmax and checking for -inf rows we instead "cheat" and use nan_to_num. Why I think this is okay? (please find a counter point if avail) There are multiple ways NaNs can emerge. For the fully masked out rows case nan_to_num works. But what if there were other NaNs, wouldn't this silently remove them? The only case that this can happen is if the input itself had a NaN or an Inf For example: ```Python a = torch.ones([4], requires_grad=False, dtype=torch.float16) a[1] = torch.finfo(torch.float16).max print(a.softmax(-1)) ``` Will return `tensor([0., 1., 0., 0.], dtype=torch.float16)` Where ```Python a = torch.ones([4], requires_grad=False, dtype=torch.float16) a[1] = float("inf") a.softmax(-1) ``` returns: `tensor([nan, nan, nan, nan], dtype=torch.float16)` If we dont want to even allow for the possibility of "inf" or "NaN" attention scores to be converted to 0 then we can implemented it something like this ```Python max = torch.max(a, dim=-1, keepdim=True) exp = torch.exp(a - max.values) denom = torch.sum(exp, dim=-1, keepdim=True) softmax = exp / denom softmax = torch.where(max.values == float('-inf'), 0.0, softmax) ``` however we would be paying for this in math performance. ## Why Now I think one point that has substantially changed where PyTorch should lie on this argument is the fact that we have fused implementations for SDPA now. And these fused implementations allow us to easily and performantly support this new semantic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131060 Approved by: https://github.com/jbschlosser	2024-08-08 23:09:38 +00:00
Apurva Jain	8bc5ef563e	Grouped Query Attention (#132689 ) ### Approach: Using the current function declaration Constraint: Q_Heads % KV_Heads == 0 Major change: - Added a new argument enable_gqa: bool to sdpa function call - It adds a meaning to the last third dimension. Sample use cases this would enable: LLama3 ``` # LLama3 8b call to SDPA query = torch.rand(batch, 32, seq_len_q, D) key = torch.rand(batch, 8, seq_len_kv, D) value = torch.rand(batch, 8, seq_len_kv, D) output = scaled_dot_product_attention(query, key, value, is_causal=True, enable_gqa=True) # Output Shape (batch, 32, seq_len_q, D) ``` ### Design Choice: - Check if Query.size(-3) == Key.size(-3) == Value.size(-3) or, Query.size(-3) % Key.size(-3) == 0 - The function adjusts the key and value tensors to match the query tensor's head dimension by using repeat_interleave if their number of heads are not equal, facilitating correct and efficient computation in attention mechanisms. - By default the enable_gqa flag is set to False, which ensures that regular sdpa functionality remains unchanged. ### Benchmarks: - sdpa.py: #130634 For different batch sizes enable_gqa=True shows a substansial improvement in the run_time of sdpa \| batch_size \| q_num_heads \| kv_num_heads \| q_seq_len \| kv_seq_len \| embed_dim \| forward_time when enable_gqa=True \| forward_time when enable_gqa=False \| \| ------------ \| ------------- \| -------------- \| ----------- \| ------------ \| ----------- \| ----------- \| ---------------- \| \| 1 \| 32 \| 8 \| 2048 \| 2048 \| 2048 \| 100.71 \| 119.70 \| \| 8 \| 32 \| 8 \| 2048 \| 2048 \| 2048 \| 539.78 \| 628.83 \| \| 16 \| 32 \| 8 \| 2048 \| 2048 \| 2048 \| 1056.81 \| 1225.48 \| \| 32 \| 32 \| 8 \| 2048 \| 2048 \| 2048 \| 2099.54 \| 2440.45 \| ![Screenshot 2024-07-25 at 9 07 40 PM](https://github.com/user-attachments/assets/a3e5f716-c39f-4096-9e6c-82a735e57b7b) - TorchTitan: https://github.com/pytorch/torchtitan/pull/458 Differential Revision: D60772086 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132689 Approved by: https://github.com/drisspg	2024-08-07 05:35:36 +00:00
PyTorch MergeBot	bcb4f7c172	Revert "Grouped Query Attention (#128898 )" This reverts commit 6b28af1b79eaa63e2f423d925bbd42330582983f. Reverted https://github.com/pytorch/pytorch/pull/128898 on behalf of https://github.com/ZainRizvi due to Sorry, this broke a bunch of tests internally. See D60638265 ([comment](https://github.com/pytorch/pytorch/pull/128898#issuecomment-2265961038))	2024-08-02 18:58:46 +00:00
majing	3a355c1891	Correct sample creation of torch.histogram in UT op_db to align PyTorch defined operator semantics (#131630 ) Fixes #130916 As the semantics defined in [torch.histogram](https://pytorch.org/docs/stable/generated/torch.histogram.html#torch-histogram), we need an increasing sequence as bins tensor. Random input doesn't make sense for torch.histogram. The case is a comparison between CPU backend and another backend. When the input is random, kernel implementation in other backends have to totally align with the CPU kernel, or the case fails. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131630 Approved by: https://github.com/EikanWang, https://github.com/albanD	2024-08-02 01:51:09 +00:00
Oguz Ulgen	72d2dba992	Add None return type to init (#132335 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132335 Approved by: https://github.com/albanD	2024-08-01 15:26:45 +00:00
jainapurva	6b28af1b79	Grouped Query Attention (#128898 ) ### Approach: Using the current function declaration Constraint: Q_Heads % KV_Heads == 0 Major change: - Added a new argument enable_gqa: bool to sdpa function call - It adds a meaning to the last third dimension. Sample use cases this would enable: LLama3 ``` # LLama3 8b call to SDPA query = torch.rand(batch, 32, seq_len_q, D) key = torch.rand(batch, 8, seq_len_kv, D) value = torch.rand(batch, 8, seq_len_kv, D) output = scaled_dot_product_attention(query, key, value, is_causal=True, enable_gqa=True) # Output Shape (batch, 32, seq_len_q, D) ``` ### Design Choice: - Check if Query.size(-3) == Key.size(-3) == Value.size(-3) or, Query.size(-3) % Key.size(-3) == 0 - The function adjusts the key and value tensors to match the query tensor's head dimension by using repeat_interleave if their number of heads are not equal, facilitating correct and efficient computation in attention mechanisms. - By default the enable_gqa flag is set to False, which ensures that regular sdpa functionality remains unchanged. ### Benchmarks: - sdpa.py: #130634 For different batch sizes enable_gqa=True shows a substansial improvement in the run_time of sdpa \| batch_size \| q_num_heads \| kv_num_heads \| q_seq_len \| kv_seq_len \| embed_dim \| forward_time when enable_gqa=True \| forward_time when enable_gqa=False \| \| ------------ \| ------------- \| -------------- \| ----------- \| ------------ \| ----------- \| ----------- \| ---------------- \| \| 1 \| 32 \| 8 \| 2048 \| 2048 \| 2048 \| 100.71 \| 119.70 \| \| 8 \| 32 \| 8 \| 2048 \| 2048 \| 2048 \| 539.78 \| 628.83 \| \| 16 \| 32 \| 8 \| 2048 \| 2048 \| 2048 \| 1056.81 \| 1225.48 \| \| 32 \| 32 \| 8 \| 2048 \| 2048 \| 2048 \| 2099.54 \| 2440.45 \| ![Screenshot 2024-07-25 at 9 07 40 PM](https://github.com/user-attachments/assets/a3e5f716-c39f-4096-9e6c-82a735e57b7b) - TorchTitan: https://github.com/pytorch/torchtitan/pull/458 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128898 Approved by: https://github.com/drisspg	2024-07-31 22:58:51 +00:00
PyTorch MergeBot	499ead96ff	Revert "Grouped Query Attention (#128898 )" This reverts commit d039b14207fe659d664c590efc06cc0a2abc96c0. Reverted https://github.com/pytorch/pytorch/pull/128898 on behalf of https://github.com/albanD due to Broken test on main ([comment](https://github.com/pytorch/pytorch/pull/128898#issuecomment-2258314481))	2024-07-30 13:11:24 +00:00
jainapurva	d039b14207	Grouped Query Attention (#128898 ) ### Approach: Using the current function declaration Constraint: Q_Heads % KV_Heads == 0 Major change: - Added a new argument enable_gqa: bool to sdpa function call - It adds a meaning to the last third dimension. Sample use cases this would enable: LLama3 ``` # LLama3 8b call to SDPA query = torch.rand(batch, 32, seq_len_q, D) key = torch.rand(batch, 8, seq_len_kv, D) value = torch.rand(batch, 8, seq_len_kv, D) output = scaled_dot_product_attention(query, key, value, is_causal=True, enable_gqa=True) # Output Shape (batch, 32, seq_len_q, D) ``` ### Design Choice: - Check if Query.size(-3) == Key.size(-3) == Value.size(-3) or, Query.size(-3) % Key.size(-3) == 0 - The function adjusts the key and value tensors to match the query tensor's head dimension by using repeat_interleave if their number of heads are not equal, facilitating correct and efficient computation in attention mechanisms. - By default the enable_gqa flag is set to False, which ensures that regular sdpa functionality remains unchanged. ### Benchmarks: - sdpa.py: #130634 For different batch sizes enable_gqa=True shows a substansial improvement in the run_time of sdpa \| batch_size \| q_num_heads \| kv_num_heads \| q_seq_len \| kv_seq_len \| embed_dim \| forward_time when enable_gqa=True \| forward_time when enable_gqa=False \| \| ------------ \| ------------- \| -------------- \| ----------- \| ------------ \| ----------- \| ----------- \| ---------------- \| \| 1 \| 32 \| 8 \| 2048 \| 2048 \| 2048 \| 100.71 \| 119.70 \| \| 8 \| 32 \| 8 \| 2048 \| 2048 \| 2048 \| 539.78 \| 628.83 \| \| 16 \| 32 \| 8 \| 2048 \| 2048 \| 2048 \| 1056.81 \| 1225.48 \| \| 32 \| 32 \| 8 \| 2048 \| 2048 \| 2048 \| 2099.54 \| 2440.45 \| ![Screenshot 2024-07-25 at 9 07 40 PM](https://github.com/user-attachments/assets/a3e5f716-c39f-4096-9e6c-82a735e57b7b) - TorchTitan: https://github.com/pytorch/torchtitan/pull/458 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128898 Approved by: https://github.com/drisspg	2024-07-29 21:49:06 +00:00
Tom Ritchford	bdf5a6dca9	Add decomposition for unsqueeze_copy (#130942 ) * Extracted from #128416 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130942 Approved by: https://github.com/peterbell10	2024-07-29 21:13:37 +00:00
Tom Ritchford	962f248437	Add decomposition for expand_copy (#130940 ) * Extracted from #129476 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130940 Approved by: https://github.com/peterbell10	2024-07-29 16:23:56 +00:00
Tom Ritchford	16247987a1	Add decomposition for t_copy (#130939 ) * Extracted from #128416 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130939 Approved by: https://github.com/peterbell10	2024-07-23 08:29:19 +00:00
Tom Ritchford	500cbb5b90	Add decomposition for view_copy (#130938 ) * Extracted from #128416 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130938 Approved by: https://github.com/peterbell10 ghstack dependencies: #130937	2024-07-21 20:39:24 +00:00
Tom Ritchford	f628813066	Fix out_wrapper, _make_copy_from_view to handle all signatures (#130937 ) * See #128416 and #129476 * Simplify xskip lists in test/functorch/test_ops.py * Add supports_out=True to OpInfos for copy ops Pull Request resolved: https://github.com/pytorch/pytorch/pull/130937 Approved by: https://github.com/peterbell10	2024-07-21 20:39:24 +00:00
Isuru Fernando	bb4251213b	Add decomposition for channel_shuffle (#118775 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118775 Approved by: https://github.com/peterbell10	2024-07-20 01:24:41 +00:00
Catherine Lee	31e79aae6a	Another follow up to #130260 (#130993 ) Another followup to https://github.com/pytorch/pytorch/pull/130260 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130993 Approved by: https://github.com/huydhn	2024-07-19 16:43:54 +00:00
Aidyn-A	bd56bcf0ab	[TEST] Fix _scaled_mm tests (#130897 ) This PR resolves several sets of `_scaled_mm` test failures: - `scale_a` and `scale_b` are now required arguments, so the function `sample_inputs_scaled_mm` must supply them - `_scaled_mm` does not support `"meta"` device, so it should be skipped in `test_meta.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130897 Approved by: https://github.com/drisspg	2024-07-18 02:15:00 +00:00
lezcano	af0b5ee924	Reduce number of samples in {svd,pca}_lowrank OpInfos (#127199 ) We don't need to generate so many samples for these very expensive ops. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127199 Approved by: https://github.com/peterbell10, https://github.com/zou3519	2024-07-17 16:29:36 +00:00
eqy	5e617d7ef5	[CUDA] Actually bump tolerances for `test_grad_pca_lowrank` (#130770 ) Fixes change in #129902 to actually bump pca rather than svd, thanks @ptrblck for the catch Pull Request resolved: https://github.com/pytorch/pytorch/pull/130770 Approved by: https://github.com/Skylion007	2024-07-16 00:41:10 +00:00
PyTorch MergeBot	d97d962082	Revert "Add decompositions for copy variants of view ops (#128416 )" This reverts commit 68751799b85aa7f659420801bdbb8451f01ab09a. Reverted https://github.com/pytorch/pytorch/pull/128416 on behalf of https://github.com/izaitsevfb due to breaks test_qs8_permute_copy test in executorch ([comment](https://github.com/pytorch/pytorch/pull/128416#issuecomment-2224023423))	2024-07-11 22:09:23 +00:00
PyTorch MergeBot	a2f630a9a4	Revert "Decompose expand_copy and permute_copy (#129476 )" This reverts commit 7d4cb2109823f1c4001dff62b461bb9eda07ca17. Reverted https://github.com/pytorch/pytorch/pull/129476 on behalf of https://github.com/izaitsevfb due to depends on #128416 which needs to be reverted ([comment](https://github.com/pytorch/pytorch/pull/129476#issuecomment-2224019720))	2024-07-11 22:06:15 +00:00
Xuehai Pan	973037be6a	[BE][Easy] apply autofix for ruff rules unnecessary-collection-call (C408): `list()` / `tuple()` / `dict()` (#130199 ) This PR changes the empty collection factory call to Python literals: - `list()` -> `[]` - `tuple()` -> `()` - `dict()` -> `{}` The Python literals are more performant and safer. For example, the bytecode for building an empty dictionary: ```bash $ python3 -m dis - <<EOS import collections d1 = {} d2 = dict() dict = collections.OrderedDict d3 = dict() EOS ``` ```text 0 0 RESUME 0 1 2 LOAD_CONST 0 (0) 4 LOAD_CONST 1 (None) 6 IMPORT_NAME 0 (collections) 8 STORE_NAME 0 (collections) 3 10 BUILD_MAP 0 12 STORE_NAME 1 (d1) 4 14 PUSH_NULL 16 LOAD_NAME 2 (dict) 18 CALL 0 26 STORE_NAME 3 (d2) 6 28 LOAD_NAME 0 (collections) 30 LOAD_ATTR 8 (OrderedDict) 50 STORE_NAME 2 (dict) 7 52 PUSH_NULL 54 LOAD_NAME 2 (dict) 56 CALL 0 64 STORE_NAME 5 (d3) 66 RETURN_CONST 1 (None) ``` The dict literal `{}` only has one bytecode `BUILD_MAP`, while the factory call `dict()` has three `PUSH_NULL + LOAD_NAME + CALL`. Also, the factory call is not safe if users override the `dict` name in `locals` or `globals` (see the example of replacing with `OrderedDict` above). Pull Request resolved: https://github.com/pytorch/pytorch/pull/130199 Approved by: https://github.com/malfet	2024-07-11 17:30:28 +00:00
Isuru Fernando	5db9bd467e	Skip test_nnc_correctness for new op _unsafe_masked_index (#130375 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130375 Approved by: https://github.com/lezcano	2024-07-11 08:17:16 +00:00
Tom Ritchford	7d4cb21098	Decompose expand_copy and permute_copy (#129476 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129476 Approved by: https://github.com/amjames, https://github.com/lezcano	2024-07-10 17:12:01 +00:00
Li-Huai (Allan) Lin	99967e1119	[MPS][TYPE_PROMOTION] Fix Clamp (#130226 ) Summary: 1. Fixed #130201 by adding type promotion. 2. Added proper tests. 3. Found torch's type promotion is different from numpy as follows: ```python import torch import numpy as np np.clip(np.array([1], dtype=np.float32), np.array([1], dtype=np.int32), None).dtype # dtype('float64') torch.clamp(torch.tensor([1], dtype=torch.float32), torch.tensor([1], dtype=torch.int32)).dtype # torch.float32 ``` ~Not sure the proper way to handle it, it causes numpy ref tests to fail.~ Reason here, so think I'm gonna xfail it: `3c1cf03fde/test/test_ops.py (L260-L264)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130226 Approved by: https://github.com/malfet	2024-07-10 14:27:39 +00:00
Tom Ritchford	68751799b8	Add decompositions for copy variants of view ops (#128416 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128416 Approved by: https://github.com/amjames, https://github.com/lezcano	2024-07-10 01:39:09 +00:00
Catherine Lee	5b5a1f5202	Add on to Mark some test_decomp tests as slow on win #130260 (#130337 ) An add on to https://github.com/pytorch/pytorch/pull/130260 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130337 Approved by: https://github.com/malfet	2024-07-09 22:30:53 +00:00
Joel Schlosser	fd43a2ba27	Forward fix for test_compare_cpu_cuda_float32 (#130360 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130360 Approved by: https://github.com/malfet ghstack dependencies: #128238	2024-07-09 22:28:39 +00:00

1 2 3 4 5 ...

2207 Commits