This PR enables all PIE rules on ruff, there are already some enabled rules from this family, the new added rules are
```
PIE796 Enum contains duplicate value: {value}
PIE808 Unnecessary start argument in range
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165814
Approved by: https://github.com/ezyang
This PR enables all PIE rules on ruff, there are already some enabled rules from this family, the new added rules are
```
PIE796 Enum contains duplicate value: {value}
PIE808 Unnecessary start argument in range
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165814
Approved by: https://github.com/ezyang
Added a check in aten/src/ATen/native/EmbeddingBag.cpp that checks if per_sample_weights needs a gradient in order to determine if at::_embedding_bag_forward_only or at::_embedding_bag should run.
Also, added two tests in test_embedding.py that check if the command now works.
Fixes#136457
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142338
Approved by: https://github.com/soulitzer
Summary:
# Latest Update
This diff is no longer needed because we did need the check to exist, to make meta behave the same as other devices, see D54526190.
---------------------------------
# Background
T176105639
| case | embedding bag weight | per_sample_weight | fbgemm lookup | forward in meta |
| A | fp32 | fp32 | good | good |
| B | fp16 | fp32 | good| failed [check](https://fburl.com/code/k3n3h031) that forces weight dtype == per_sample_weights dtype |
| C | fp16 | fp16 | P1046999270, RuntimeError: "expected scalar type Float but found Half from fbgemm call" | good |
| D | fp32 | fp16 | N/A | N/A |
Currently we are in case A. Users need to add `use_fp32_embedding` in training to force embedding bag dtype to be fp32. However, users actually hope for case B to use fp16 as the embedding bag weight. When deleting `use_fp32_embedding`, they would fail the [check](https://fburl.com/code/k3n3h031) that forces `weight dtype == per_sample_weights dtype ` in meta_registration.
The check is actually not necessary. Is it because the backend fbgemm does support case B. Additionally, later on in the `meta_embedding_bag`, `weight` and `per_sample_weights` don't need to be in the same dtype (https://fburl.com/code/q0tho05h, weight is src, per_sample_weights is scale) for `is_fast_path_index_select`.
# This diff
Therefore, this diff remove the unnecessary [check](https://fburl.com/code/k3n3h031) to support case B in meta forward. With such, users are able to use fp16 to be the emb bag dtype without the need to force per_sample_weights the same dtype in meta forward (see Test Plan).
# Reference diffs to resolve this issue
Diff 1: D52591217
This passes embedding bag dtype to feature_processor to make per_sample_weights same dtype as emb bag weight. However, `is_meta` also needs to be passed because of case C. fbgemm still does not support per_sample_weights = fp16 (see the above table). Therefore users are forced to only make per_sample_weights fp16 when it is on meta. The solution requires too many hacks.
Diff 2: D53232739
Basically doing the same thing in diff 1 D52591217, except that the hack is added in TorchRec library. This adds an if in EBC and PEA for: when emb bag weight is fp16, it forces per_sample_weight fp16 too. However, it would then result in fbgemm issue too and has broken a bunch of prod models.
Test Plan:
# APS
The following command will run icvr_launcher which triggers ads_launcher and run forward in meta device:
```
buck2 run mode/opt -c python.package_style=inplace //aps_models/ads/icvr:icvr_launcher_publish -- mode=mast_ig_fm_when_combo0_uhm_publish launcher.fbl_entitlement=ads_global_tc_ads_score launcher.data_project=oncall_ads_model_platform launcher.tags=[ads_ranking_taxonomy_exlarge_fm_prod] stages.train=false
```
Result:
{F1461463993}
Reviewed By: ezyang
Differential Revision: D54175438
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136774
Approved by: https://github.com/ezyang
…with large index
Fixes#130806
When an output size of 2147483648 (=131072*16384) is expected in the above issue, it throwed out the following error:
RuntimeError: HIP error: invalid configuration argument
What happened was that the second parameter passed to hipLaunchKernel was crazy {2147483648,1,1}.
Found two issues in the Indexing.cu:
1: ptrdiff_t was used but it is signed int, outTotalSize >= 2147483648 can cause overflow when doing [this](39493aa934/aten/src/ATen/native/cuda/Indexing.cu (L1367)):
2: On ROCm, std::min -> ::min did not work as expected when outTotalSize>=2147483648
As the result, 2147483648 was sent to hipLaunchKernel which the GPU does not support such a huge number since this number specifies the number of threads per block. The original code intended to set 128 threads per block, though this is debatable as the perf would not good for latest powerful GPUs (a TODO item to update for perf maybe?) , but at least it would not cause `invalid configuration argument` error.
[Test]
Run the same code snippet in the [issue](https://github.com/pytorch/pytorch/issues/130806), and print the output, its dim and numel(), which looks like below now:
```
output=tensor([[ 0.4044, -0.0244, -0.6865, ..., -0.7800, 0.1175, 1.6726],
[-1.0866, -0.1609, 0.3538, ..., 1.9105, 0.7882, 1.1583],
[-2.2079, 0.3736, 0.3610, ..., -0.2658, -0.0459, 1.3077],
...,
[ 0.8753, -0.7482, -0.1978, ..., 0.9016, 1.1501, -0.5178],
[-1.5845, -0.6277, 1.4520, ..., 0.5733, -2.1198, -0.0915],
[-0.6310, -1.0239, -0.1910, ..., 0.4309, 0.1630, 0.3239]],
device='cuda:0'), dim=2, numel=2147483648
```
Added a large tensor unit test too.
```
/pytorch# pytest test/nn/test_embedding.py -k test_large_tensors
================================================================================== test session starts ===================================================================================
platform linux -- Python 3.9.19, pytest-7.3.2, pluggy-1.4.0
rootdir: /dockerx/development/pytorch
configfile: pytest.ini
plugins: flakefinder-1.1.0, rerunfailures-14.0, xdist-3.3.1, xdoctest-1.1.0, cpp-2.3.0, hypothesis-5.35.1
collected 288 items / 287 deselected / 1 selected
Running 1 items in this shard
test/nn/test_embedding.py . [100%]
=========================================================================== 1 passed, 287 deselected in 3.16s ============================================================================
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130994
Approved by: https://github.com/jeffdaily, https://github.com/xw285cornell
# Summary
Introduced a BC breaking change in #109533 when self is a view of the value. By using the copy_() op inside fill_ we were hitting `assert_no_partial_overlap` in tensor iterator.
Ideal we would be able to avoid this check if value.numel() ==1 . But rather than monkeying around with tensor iterator I just clone the input instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109835
Approved by: https://github.com/mikaylagawarecki
OK, so this PR used to be about reducing the number of constants we specialize on, but it turns out that unspecialization was ~essentially never used (because we still constant specialized way too aggressively) and I ended up having to fix a bunch of issues to actually get tests to pass. So this PR is now "make int unspecialization actually work". As part of this, I have to turn off unspecialization by default, as there are still latent bugs in inductor.
The general strategy is that an unspecialized int is represented as a SymInt. Representing it as a 0d tensor (which is what the code used to do) is untenable: (1) we often need unspecialized ints to participate in size computations, but we have no way of propagating sympy expressions through tensor compute, and (2) a lot of APIs work when passed SymInt, but not when passed a Tensor. However, I continue to represent Numpy scalars as Tensors, as they are rarely used for size computation and they have an explicit dtype, so they are more accurately modeled as 0d tensors.
* I folded in the changes from https://github.com/pytorch/pytorch/pull/95099 as I cannot represent unspecialized ints as SymInts without also turning on dynamic shapes. This also eliminates the necessity for test_unspec.py, as toggling specialization without dynamic shapes doesn't do anything. As dynamic shapes defaults to unspecializing, I just deleted this entirely; for the specialization case, I rely on regular static shape tests to catch it. (Hypothetically, we could also rerun all the tests with dynamic shapes, but WITH int/float specialization, but this seems... not that useful? I mean, I guess export wants it, but I'd kind of like our Source heuristic to improve enough that export doesn't have to toggle this either.)
* Only 0/1 integers get specialized by default now
* A hodgepodge of fixes. I'll comment on the PR about them.
Fixes https://github.com/pytorch/pytorch/issues/95469
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95621
Approved by: https://github.com/jansel, https://github.com/Chillee
This reverts commit f3bf46e801dec2637751224fd6e27fbf97453bc6.
Reverted https://github.com/pytorch/pytorch/pull/94163 on behalf of https://github.com/huydhn due to Sorry for reverting your PR. But I suspect that it causes flaky SIGSEGV failure for linux-bionic-py3.8-clang9 / test (crossref) job in trunk. For example, 05397b1250
This PR is to fix the segfault reported at https://github.com/pytorch/pytorch/issues/89677, this is a `double free` issue caused by `invalid read`.
The reported issue broke at slow path for `EmbeddingBag` on float32, at [EmbeddingBag.cpp#L451](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/EmbeddingBag.cpp#L451)
Root cause is that `add_indices` has index which exceeds range of `output_data`, for the reported case.
The offsets are given as
```
{0, 6, 12, 15, 25, 32, 40, 42, 46, 53, 53}
```
The `indices` has 55 elements and `offsets[-1] != indices.size(0)`.
When `include_last_offset` is true, the `output` will be in the shape of {offsets.size(0) - 1, weight.sizes()[1]}, which will be {10, 5}.
Originally, `add_indices` will be (i re-arange the 1D tensor by rows, so here 10 rows in total)
```
### this is 55 elements
0 0 0 0 0 0
1 1 1 1 1 1
2 2 2
3 3 3 3 3 3 3 3 3 3
4 4 4 4 4 4 4
5 5 5 5 5 5 5 5
6 6
7 7 7 7
8 8 8 8 8 8 8
10 10
```
The last row has index of 10 which is out of range of output tensor whose size is [10, 5].
The reason is `make_offset2bag` at [EmbeddingBag.cpp#L66](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/EmbeddingBag.cpp#L66) would give the following `offset2bag`:
```
### this is 55 + 1 elements:
0 0 0 0 0 0 1
0 0 0 0 0 1
0 0 1
0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 1
0 0 0 0 0 0 0 1
0 1
0 0 0 1
0 0 0 0 0 0 2
0 0
```
Notice for index 53, it is added twice.
The fix is ignore the last index from `offsets` when `include_last_offset` is true, also this behavior aligns with CUDA, quote from https://github.com/pytorch/pytorch/pull/57208#issuecomment-1021727378
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90358
Approved by: https://github.com/ezyang