Commit Graph

376 Commits

Author SHA1 Message Date
0ce945790e [NJT] Fix schema validation error in jagged functions (#165307)
Fixes #161812
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165307
Approved by: https://github.com/soulitzer
2025-10-13 17:59:18 +00:00
5e58420dff LocalTensor (#164537)
A LocalTensor is a tensor subclass which simulates a tensor that is
distributed across SPMD ranks.  A LocalTensor might be size N, but in fact
there are world_size shards/replicas of it stored internally.  When you do a
plain PyTorch operation on it, we apply the operation to each shard; when you
do a collective, we do the mathematically equivalent operation on the local
shards.  A LocalTensor is associated with a list of ranks which specify
which ranks it holds local tensors for.

NB, this is NOT a DataParallel like abstraction where you can run operations
on multiple different GPUs. It is intended purely for *debugging* purposes,
the overhead is almost certainly too high to keep eight GPUs (even the C++
autograd needs multithreading to keep up!)  (It might potentially be possible
to trace through this with torch.compile and then compile it with CUDA graphs
but this is currently a non-goal.)

In order to handle MPMD, we provide a helper decorator that allows you to
run a function with no side effects for each LocalTensor shard and combine
results back into LocalTensor or LocalIntNode.

Note: This PR convert all DTensor ops and some DTensor tests to illustrate
intended usage and ensure conrrectness. In subsequent PR more tests will be
converted. DUring test conversion we aim to share as much as possible of
test logic between multi-process / multi-threaded and local tensor tests.
We would like to developers to be able to run both flavors of the tests.

Note: This work is based on the original proposal
by @ezyang (WIP PR https://github.com/pytorch/pytorch/pull/162753).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164537
Approved by: https://github.com/ezyang
2025-10-12 20:06:41 +00:00
6a31f42da4 Fix NestedTensor max/min operations for integer dtypes. (#162273)
Fixes: https://github.com/pytorch/pytorch/issues/162049

### Summary

max_dim and min_dim functions incorrectly used torch.finfo()
for all dtypes, causing TypeError for integer tensors.

### Changes

- Use torch.iinfo() for integer dtypes instead of torch.finfo().
- Add CPU test: `test_jagged_max_min_dtypes` covering `int8, int16, int32, int64, uint8, float16, bfloat16, float32 and float64`

### Testing

Before Fix:

`python -m pytest test/test_nestedtensor.py -k "test_jagged_max_min_dtypes" -v`

Output:

```
FAILED [0.0006s] test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_bfloat16 - TypeError: torch.finfo() requires a floating point input type. Use torch.iinfo to handle 'torch.finfo'
FAILED [0.0006s] test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_float16 - TypeError: torch.finfo() requires a floating point input type. Use torch.iinfo to handle 'torch.finfo'
FAILED [0.0006s] test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_float32 - TypeError: torch.finfo() requires a floating point input type. Use torch.iinfo to handle 'torch.finfo'
FAILED [0.0006s] test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_float64 - TypeError: torch.finfo() requires a floating point input type. Use torch.iinfo to handle 'torch.finfo'
FAILED [0.0006s] test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_int16 - TypeError: torch.finfo() requires a floating point input type. Use torch.iinfo to handle 'torch.finfo'
FAILED [0.0005s] test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_int32 - TypeError: torch.finfo() requires a floating point input type. Use torch.iinfo to handle 'torch.finfo'
FAILED [0.0005s] test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_int64 - TypeError: torch.finfo() requires a floating point input type. Use torch.iinfo to handle 'torch.finfo'
FAILED [0.0004s] test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_int8 - TypeError: torch.finfo() requires a floating point input type. Use torch.iinfo to handle 'torch.finfo'
FAILED [0.0004s] test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_uint8 - TypeError: torch.finfo() requires a floating point input type. Use torch.iinfo to handle 'torch.finfo'
```

After Fix:

`python -m pytest test/test_nestedtensor.py -k "test_jagged_max_min_dtypes" -v`

Output:

```
Running 9 items in this shard

test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_bfloat16 PASSED [0.0086s]                                                                                                                   [ 11%]
test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_float16 PASSED [0.0011s]                                                                                                                    [ 22%]
test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_float32 PASSED [0.0011s]                                                                                                                    [ 33%]
test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_float64 PASSED [0.0011s]                                                                                                                    [ 44%]
test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_int16 PASSED [0.0009s]                                                                                                                      [ 55%]
test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_int32 PASSED [0.0010s]                                                                                                                      [ 66%]
test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_int64 PASSED [0.0010s]                                                                                                                      [ 77%]
test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_int8 PASSED [0.0010s]                                                                                                                       [ 88%]
test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_uint8 PASSED [0.0011s]                                                                                                                       [100%]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162273
Approved by: https://github.com/Skylion007, https://github.com/jbschlosser
2025-10-02 18:46:27 +00:00
fd785b1762 Add NestedTensor dispatch for _is_any_true/_is_all_true (#162096)
Fixes: https://github.com/pytorch/pytorch/issues/161818

### Summary
Add NestedTensor support for `_is_any_true` and `_is_all_true`.

### Changes
- Register dispatch for `aten._is_any_true.default` and
  `aten._is_all_true.default`
- Add CPU tests:
  - `test_is_any_true_jagged`: dispatch_matches_values_buffer,
    all_false_returns_false, one_true_returns_true
  - `test_is_all_true_jagged`: dispatch_matches_values_buffer,
    all_true_returns_true, any_false_returns_false

### Testing

Before Fix:

`pytest -q test/test_nestedtensor.py -k "test_is_any_true_jagged or test_is_all_true_jagged" -v`

Output:
```
FAILED [0.0129s] test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_is_all_true_jagged_cpu - NotImplementedError: aten._is_all_true.default
FAILED [0.0007s] test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_is_any_true_jagged_cpu - NotImplementedError: aten._is_any_true.default
```

After Fix:

`pytest -q test/test_nestedtensor.py -k "test_is_any_true_jagged or test_is_all_true_jagged" -v`

Output:

```
Running 2 items in this shard

test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_is_all_true_jagged_cpu PASSED [0.0277s]                                                                                                                               [ 50%]
test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_is_any_true_jagged_cpu PASSED [0.0013s]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162096
Approved by: https://github.com/jbschlosser
2025-09-22 20:22:44 +00:00
bf28990c3d Add support for NestedTensor share_memory_ (#162272)
Fixes: https://github.com/pytorch/pytorch/issues/161915

### Summary

Implements share_memory_() support for NestedTensor!

### Changes

- Added share_memory_() method to NestedTensor class.
  - Shares storage for all NestedTensor components: _values, _offsets, _lengths, and cached seqlen tensors.
  - Guard for CUDA Tensors.

### Testing

Before Fix:

`pytest -q test/test_nestedtensor.py -k "test_share_memory" -v`

Output:

```
Running 1 items in this shard

test/test_nestedtensor.py Fatal Python error: Segmentation fault
```

After Fix:

`pytest -q test/test_nestedtensor.py -k "test_share_memory" -v`

Output:

```
Running 1 items in this shard

test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_share_memory_cpu PASSED [0.0753s]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162272
Approved by: https://github.com/jbschlosser
2025-09-22 19:59:58 +00:00
d08cabe314 [BC Breaking] Remove flex + njt code paths (#161734)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161734
Approved by: https://github.com/jbschlosser
2025-09-16 00:13:56 +00:00
ac9ccd0dc2 Add return-max-scores to flex-attention (#161667)
# Summary

### Update

API

```Py
class AuxRequest(NamedTuple):
    """Request which auxiliary outputs to compute from flex_attention.

    Each field is a boolean indicating whether that auxiliary output should be computed.
    """

    lse: bool = False
    max_scores: bool = False

class AuxOutput(NamedTuple):
    """Auxiliary outputs from flex_attention operation.

    Fields will be None if not requested, or contain the tensor if requested.
    """

    lse: Optional[Tensor] = None
    max_scores: Optional[Tensor] = None

  out_only = flex_attention(query, key, value, score_mod)
  out_max, aux_max = flex_attention(
      query,
      key,
      value,
      score_mod,
      return_aux=FlexAttentionAuxRequest(max_scores=True),
  )
  out_both, aux_both = flex_attention(
      query,
      key,
      value,
      score_mod,
      return_aux=FlexAttentionAuxRequest(lse=True, max_scores=True),
        )
```

Returns the max post mod scores from flex attention.

Not being able to break BC is kinda of annoying here since we end up with a combinatorial problem where if we need to add any more return vals we need to new kwargs that gate if they get returned by the function and need to support the 2**N additional args possible return groups.

Ideally there isn't much more we need to return, but we might want to think about how best to set this up for expansion in the future. I added kwarg only now

Maybe we make a `ExtraReturns` type kwarg that can grow and we don't need to keep adding new top level args.

We could also return a Struct that holds all the extra tensors and start deprecation cycle for logsumexp eventually returning just 1 `ExtraReturns` like struct with the tensors.

### Req Grad
I currently dont return a max_scores that supports backproping grads. I think this might be feasible  but since max is essentially 1 hot 	on the inputs and a reduction we would either need to save another `max_location` from the forward or find the max_score but also only apply to first occurence if there is multiple equivalent scores (need to check if thats we define for vanilla max op in torch).

For now no grad, we can re-visit if needed.

## Perf
I am going to disable for flex_decode. Since at least initially the motivation is for training. I also more hard than it should be to have ops return nuns or optional tensors, If return max is at the false, we should probably just create a tensor of size zero so that we don't slow down the hot path.

```Shell
🔝 Top 5 TFlops Deltas (by absolute %):
shape: (5, 7)
┌────────────────┬────────────────┬───────────────────────┬───────────────┬──────────────┬───────────┬───────────┐
│ attn_type      ┆ dtype          ┆ shape(B,Hq,M,Hkv,N,D) ┆ TFlops (base) ┆ TFlops (max) ┆ delta     ┆ pct_delta │
│ ---            ┆ ---            ┆ ---                   ┆ ---           ┆ ---          ┆ ---       ┆ ---       │
│ str            ┆ str            ┆ str                   ┆ f64           ┆ f64          ┆ f64       ┆ f64       │
╞════════════════╪════════════════╪═══════════════════════╪═══════════════╪══════════════╪═══════════╪═══════════╡
│ causal         ┆ torch.bfloat16 ┆ (4, 16, 2048, 16,     ┆ 249.514658    ┆ 243.078974   ┆ 6.435684  ┆ 2.647569  │
│                ┆                ┆ 2048, 64)             ┆               ┆              ┆           ┆           │
│ alibi          ┆ torch.bfloat16 ┆ (2, 16, 1024, 16,     ┆ 57.971274     ┆ 56.633641    ┆ 1.337633  ┆ 2.361905  │
│                ┆                ┆ 1024, 64)             ┆               ┆              ┆           ┆           │
│ noop           ┆ torch.bfloat16 ┆ (4, 16, 1024, 16,     ┆ 244.052884    ┆ 248.65129    ┆ -4.598406 ┆ -1.849339 │
│                ┆                ┆ 1024, 64)             ┆               ┆              ┆           ┆           │
│ noop           ┆ torch.bfloat16 ┆ (2, 16, 1024, 16,     ┆ 280.71254     ┆ 275.686991   ┆ 5.025549  ┆ 1.822918  │
│                ┆                ┆ 1024, 128)            ┆               ┆              ┆           ┆           │
│ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 16384, 16,    ┆ 152.970031    ┆ 150.489109   ┆ 2.480923  ┆ 1.648573  │
│                ┆                ┆ 16384, 64)            ┆               ┆              ┆           ┆           │
└────────────────┴────────────────┴───────────────────────┴───────────────┴──────────────┴───────────┴───────────┘

🔺 Top 5 Positive TFlops Deltas (highest +%):
shape: (5, 7)
┌────────────────┬────────────────┬────────────────────────┬───────────────┬──────────────┬──────────┬───────────┐
│ attn_type      ┆ dtype          ┆ shape(B,Hq,M,Hkv,N,D)  ┆ TFlops (base) ┆ TFlops (max) ┆ delta    ┆ pct_delta │
│ ---            ┆ ---            ┆ ---                    ┆ ---           ┆ ---          ┆ ---      ┆ ---       │
│ str            ┆ str            ┆ str                    ┆ f64           ┆ f64          ┆ f64      ┆ f64       │
╞════════════════╪════════════════╪════════════════════════╪═══════════════╪══════════════╪══════════╪═══════════╡
│ causal         ┆ torch.bfloat16 ┆ (4, 16, 2048, 16,      ┆ 249.514658    ┆ 243.078974   ┆ 6.435684 ┆ 2.647569  │
│                ┆                ┆ 2048, 64)              ┆               ┆              ┆          ┆           │
│ alibi          ┆ torch.bfloat16 ┆ (2, 16, 1024, 16,      ┆ 57.971274     ┆ 56.633641    ┆ 1.337633 ┆ 2.361905  │
│                ┆                ┆ 1024, 64)              ┆               ┆              ┆          ┆           │
│ noop           ┆ torch.bfloat16 ┆ (2, 16, 1024, 16,      ┆ 280.71254     ┆ 275.686991   ┆ 5.025549 ┆ 1.822918  │
│                ┆                ┆ 1024, 128)             ┆               ┆              ┆          ┆           │
│ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 16384, 16,     ┆ 152.970031    ┆ 150.489109   ┆ 2.480923 ┆ 1.648573  │
│                ┆                ┆ 16384, 64)             ┆               ┆              ┆          ┆           │
│ causal         ┆ torch.bfloat16 ┆ (4, 16, 1024, 16,      ┆ 161.031318    ┆ 158.597808   ┆ 2.43351  ┆ 1.534391  │
│                ┆                ┆ 1024, 64)              ┆               ┆              ┆          ┆           │
└────────────────┴────────────────┴────────────────────────┴───────────────┴──────────────┴──────────┴───────────┘

🔻 Top 5 Negative TFlops Deltas (lowest -%):
shape: (5, 7)
┌────────────────┬────────────────┬───────────────────────┬───────────────┬──────────────┬───────────┬───────────┐
│ attn_type      ┆ dtype          ┆ shape(B,Hq,M,Hkv,N,D) ┆ TFlops (base) ┆ TFlops (max) ┆ delta     ┆ pct_delta │
│ ---            ┆ ---            ┆ ---                   ┆ ---           ┆ ---          ┆ ---       ┆ ---       │
│ str            ┆ str            ┆ str                   ┆ f64           ┆ f64          ┆ f64       ┆ f64       │
╞════════════════╪════════════════╪═══════════════════════╪═══════════════╪══════════════╪═══════════╪═══════════╡
│ noop           ┆ torch.bfloat16 ┆ (4, 16, 1024, 16,     ┆ 244.052884    ┆ 248.65129    ┆ -4.598406 ┆ -1.849339 │
│                ┆                ┆ 1024, 64)             ┆               ┆              ┆           ┆           │
│ alibi          ┆ torch.bfloat16 ┆ (2, 16, 1024, 4,      ┆ 175.546923    ┆ 177.81205    ┆ -2.265127 ┆ -1.273888 │
│                ┆                ┆ 1024, 128)            ┆               ┆              ┆           ┆           │
│ sliding_window ┆ torch.bfloat16 ┆ (4, 16, 16384, 4,     ┆ 156.282597    ┆ 158.209134   ┆ -1.926537 ┆ -1.217715 │
│                ┆                ┆ 16384, 64)            ┆               ┆              ┆           ┆           │
│ sliding_window ┆ torch.bfloat16 ┆ (2, 16, 2048, 16,     ┆ 232.542929    ┆ 235.140136   ┆ -2.597207 ┆ -1.104536 │
│                ┆                ┆ 2048, 128)            ┆               ┆              ┆           ┆           │
│ alibi          ┆ torch.bfloat16 ┆ (2, 16, 1024, 16,     ┆ 169.652791    ┆ 171.475986   ┆ -1.823195 ┆ -1.063236 │
│                ┆                ┆ 1024, 128)            ┆               ┆              ┆           ┆           │
└────────────────┴────────────────┴───────────────────────┴───────────────┴──────────────┴───────────┴───────────┘
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161667
Approved by: https://github.com/Chillee, https://github.com/BoyuanFeng
2025-09-08 22:44:48 +00:00
1128f4c2a8 [cuDNN][SDPA] cuDNN SDPA refactor/cleanup, nested tensor backward, test priority bump for sm90, sm100 (#149282)
cleanup tuple/tensor boilerplate in cuDNN SDPA, preparation for nested/ragged tensor backward

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149282
Approved by: https://github.com/drisspg

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-08-08 22:22:48 +00:00
50580b5053 Add minimal nn.functional.log_softmax support for NestedTensor (#159662)
This only works for the jagged layout and for the non-batch and non-jagged dimensions.

I did this mostly by copy-pasting from the existing softmax implementation, but it seems fairly straightforward and I think it should work.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159662
Approved by: https://github.com/jbschlosser
2025-08-06 20:34:02 +00:00
bc4b04e058 DeviceCopy should have the same layout as input (#159615)
Summary: Fix https://github.com/pytorch/pytorch/issues/159612

- Fix the meta implementation of `nan_to_num`, it should preserve the stride of the input
- The DeviceCopy IR node should always preserve the input's layout, so we don't end up with a contiguous call during device copy

Test Plan:
```
buck2 run @mode/dev-nosan fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_d2h_copy
```

Rollback Plan:

Differential Revision: D79411407

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159615
Approved by: https://github.com/eellison
2025-08-04 23:56:58 +00:00
fa0355c18d Fix full_like decomposition to preserve strides (#158898)
Summary:
See original PR at: https://github.com/pytorch/pytorch/pull/144765, which landed internally but was reverted due to test failures. Addressing reviewer comments and trying again.

Rollback Plan:

Differential hack Revision: D78783627

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158898
Approved by: https://github.com/eellison
2025-07-25 20:21:36 +00:00
7f649ed4f8 Add basic torch.hash_tensor op (#154149)
Added `torch.hash_tensor` reduction function with a `mode` argument that defaults to reduction with xor.

- The hash is always uint64.
- Integers will be casted to uint64 before performing the xor_sum reduction
- Floats will be upcasted to double and then bitcasted to uint64 before performing the xor_sum reduction

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154149
Approved by: https://github.com/albanD
2025-07-23 22:28:03 +00:00
bfe5674e22 Revert "[cuDNN][SDPA] cuDNN SDPA refactor/cleanup, nested tensor backward, test priority bump for sm90, sm100 (#149282)"
This reverts commit 0797b2b6a80cf70a7accc3d5413186e7693d4451.

Reverted https://github.com/pytorch/pytorch/pull/149282 on behalf of https://github.com/wdvr due to reverting as discussed with @drisspg - @eqy please reach out to @drisspg for more info  ([comment](https://github.com/pytorch/pytorch/pull/149282#issuecomment-3084759671))
2025-07-17 16:55:55 +00:00
0797b2b6a8 [cuDNN][SDPA] cuDNN SDPA refactor/cleanup, nested tensor backward, test priority bump for sm90, sm100 (#149282)
cleanup tuple/tensor boilerplate in cuDNN SDPA, preparation for nested/ragged tensor backward

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149282
Approved by: https://github.com/drisspg

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-07-11 16:07:54 +00:00
ed03492238 Add check nested_tensor_from_jagged param jagged_dim >= 1 (#157770)
Fixes #157404

## Test Result

```bash
pytest test/test_nestedtensor.py

...............................................s..........ssssss.................................................................................................s.s..sssss..s...ss............................................................. [ 44%]
...........................................................sssss....sss...s.........ss....s....sss.........s.sss...s..s......s............s.sss.ss...............s.....................s....s......................s.s.....s....s..s..ssssssssss [ 59%]
sssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss..ssssss.ssssssssssssssssssssssssssssssssssssssssssssssssssssssssss.ssssssss...............................s........................................... [ 74%]
.......sss...................................................................................................................................................................................................................................... [ 89%]
....sss..........................................................................................................................................................                                                                                [100%]

==================================================================================================== 1317 passed, 258 skipped in 2504.27s (0:41:44) ====================================================================================================
```

![image](https://github.com/user-attachments/assets/dcc8e46d-b88f-4580-b4ad-0999bad33ec9)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157770
Approved by: https://github.com/soulitzer

Co-authored-by: Jeffrey Wan <soulitzer@gmail.com>
2025-07-10 00:34:39 +00:00
fc0376e8b1 [BE][2/6] fix typos in test/ (test/test_*.py) (#157636)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157636
Approved by: https://github.com/yewentao256, https://github.com/mlazos
ghstack dependencies: #156311, #156609
2025-07-09 11:02:23 +00:00
c553c55be7 Revert "Fix full_like decomposition to preserve strides (#144765)"
This reverts commit 01b0f09931d47bd2716398a0c335b2807dc3074d.

Reverted https://github.com/pytorch/pytorch/pull/144765 on behalf of https://github.com/jeanschmidt due to Seems to be breaking internal tests see [D77652778](https://www.internalfb.com/diff/D77652778), @jansel may you help get this PR merged? ([comment](https://github.com/pytorch/pytorch/pull/144765#issuecomment-3027975098))
2025-07-02 13:56:03 +00:00
01b0f09931 Fix full_like decomposition to preserve strides (#144765)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144765
Approved by: https://github.com/amjames, https://github.com/jansel
2025-07-01 19:13:22 +00:00
06408dae49 Revert "Add view_simple as meta function for view, and avoid calling reshape_view_helper. (#154757)"
This reverts commit 0029259bdfeee627181df2b9f5ff6979f65090ec.

Reverted https://github.com/pytorch/pytorch/pull/154757 on behalf of https://github.com/laithsakka due to post land issue ([comment](https://github.com/pytorch/pytorch/pull/154757#issuecomment-2971385787))
2025-06-13 19:11:43 +00:00
0029259bdf Add view_simple as meta function for view, and avoid calling reshape_view_helper. (#154757)
address https://github.com/pytorch/pytorch/issues/153303

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154757
Approved by: https://github.com/bobrenjc93, https://github.com/leslie-fang-intel
2025-06-12 09:58:15 +00:00
a060f3d272 Rewrite autograd producer consumer stream sync logic (#151079)
Also see previous work https://github.com/pytorch/pytorch/pull/142097

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151079
Approved by: https://github.com/albanD
2025-05-16 15:42:22 +00:00
f363a3f51a Revert "[cuDNN][SDPA] cuDNN SDPA refactor/cleanup, nested tensor backward, test priority bump for sm90, sm100 (#149282)"
This reverts commit 9386701b51aadce951bf38daf497b0257a3f2211.

Reverted https://github.com/pytorch/pytorch/pull/149282 on behalf of https://github.com/jeanschmidt due to Breaking internal builds, see [D74729259](https://www.internalfb.com/diff/D74729259). @drisspg may you help out the author have their PR merged? ([comment](https://github.com/pytorch/pytorch/pull/149282#issuecomment-2881546951))
2025-05-14 20:53:49 +00:00
2c1912452d Revert "Rewrite autograd producer consumer stream sync logic (#151079)"
This reverts commit f78e4529a9d446deb77c6ac38184582f6ab9167a.

Reverted https://github.com/pytorch/pytorch/pull/151079 on behalf of https://github.com/jeanschmidt due to Seems to have introduced regressions in internal signals, see [D74648937](https://www.internalfb.com/diff/D74648937) ([comment](https://github.com/pytorch/pytorch/pull/151079#issuecomment-2880176879))
2025-05-14 13:07:12 +00:00
eqy
9386701b51 [cuDNN][SDPA] cuDNN SDPA refactor/cleanup, nested tensor backward, test priority bump for sm90, sm100 (#149282)
cleanup tuple/tensor boilerplate in cuDNN SDPA, preparation for nested/ragged tensor backward
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149282
Approved by: https://github.com/drisspg
2025-05-14 01:39:24 +00:00
f78e4529a9 Rewrite autograd producer consumer stream sync logic (#151079)
Also see previous work https://github.com/pytorch/pytorch/pull/142097

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151079
Approved by: https://github.com/albanD
2025-05-12 21:07:16 +00:00
eqy
ce317cd5a8 [CUDA][SDPA] bump fudge factor in test_sdpa in test_nestedtensor (#152235)
Small mismatches on e.g., 4090, A6000/A40

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152235
Approved by: https://github.com/drisspg, https://github.com/Skylion007, https://github.com/jbschlosser
2025-04-30 20:24:49 +00:00
4e7d264cf8 Introduce UserDefinedExceptionClassVariable (#146504)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146504
Approved by: https://github.com/anijain2305
2025-03-11 18:55:45 +00:00
c297c09a37 Fix invalid nested int guarding in broadcast_shapes() (#145957)
Fixes #145874

This PR takes the approach of updating the logic determining whether multiple shapes broadcast together to handle nested ints specially.

Possible alternative approach: don't update `broadcast_shapes()` + indicate that e.g. `Ne(j0, 1)` should statically evaluate to False. I briefly tried this but it wasn't straightforward. Is it better?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145957
Approved by: https://github.com/bobrenjc93

Co-authored-by: bobrenjc93 <bobren@meta.com>
2025-03-11 00:53:13 +00:00
93e9daed54 [cuDNN][SDPA][Nested Tensor] Experimental cuDNN Nested Tensor SDPA Support (forward only) (#141178)
Disabled by default for now behind `TORCH_CUDNN_SDPA_NESTED_TENSOR_ENABLED=1`

Just wanted to get this out before starting a series of SDPA cleanup PRs---the biggest thing is we don't need the boilerplate around all of the `build_graph_and_tensors*` functions anymore as we can now use the `UID`-style referencing of tensor nodes as was done for the Conv-V8 API backend.

CC @drisspg

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141178
Approved by: https://github.com/jbschlosser
2025-03-04 23:09:09 +00:00
c73a92fbf5 [BE][CI] bump ruff to 0.9.2: multiline assert statements (#144546)
Reference: https://docs.astral.sh/ruff/formatter/black/#assert-statements

> Unlike Black, Ruff prefers breaking the message over breaking the assertion, similar to how both Ruff and Black prefer breaking the assignment value over breaking the assignment target:
>
> ```python
> # Input
> assert (
>     len(policy_types) >= priority + num_duplicates
> ), f"This tests needs at least {priority+num_duplicates} many types."
>
>
> # Black
> assert (
>     len(policy_types) >= priority + num_duplicates
> ), f"This tests needs at least {priority+num_duplicates} many types."
>
> # Ruff
> assert len(policy_types) >= priority + num_duplicates, (
>     f"This tests needs at least {priority + num_duplicates} many types."
> )
> ```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144546
Approved by: https://github.com/malfet
2025-02-27 20:46:16 +00:00
9a1f720a72 Validate inputs to _nested_view_from_buffer to prevent overflows (#147356)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147356
Approved by: https://github.com/albanD, https://github.com/jbschlosser
ghstack dependencies: #147352, #147354
2025-02-27 15:48:58 +00:00
fa8e3a28a7 Revert "[cuDNN][SDPA][Nested Tensor] Experimental cuDNN Nested Tensor SDPA Support (forward only) (#141178)"
This reverts commit 533b884870acd951e684e0bf551eb76904dec047.

Reverted https://github.com/pytorch/pytorch/pull/141178 on behalf of https://github.com/jeanschmidt due to Broke internal arvr signals, see D69971019. @jbschlosser please help the author get this PR merged ([comment](https://github.com/pytorch/pytorch/pull/141178#issuecomment-2676317470))
2025-02-22 17:28:12 +00:00
533b884870 [cuDNN][SDPA][Nested Tensor] Experimental cuDNN Nested Tensor SDPA Support (forward only) (#141178)
Disabled by default for now behind `TORCH_CUDNN_SDPA_NESTED_TENSOR_ENABLED=1`

Just wanted to get this out before starting a series of SDPA cleanup PRs---the biggest thing is we don't need the boilerplate around all of the `build_graph_and_tensors*` functions anymore as we can now use the `UID`-style referencing of tensor nodes as was done for the Conv-V8 API backend.

CC @drisspg

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141178
Approved by: https://github.com/jbschlosser
2025-02-21 05:22:19 +00:00
43496e9b90 [NJT] fix flop counter for SDPA & test (#147032)
Fixes 3 issues:
1. The test wasn't actually testing SDPA: both were checking cuda, and the inputs to SDPA were not transposed.
2. FlopCounterMode has been renamed _FlopCounterMode (and a wrapper named FlopCounterMode has been added)
3. offsets_to_list also needs to ignore the actual offset values if offsets is a meta tensor.

Differential Revision: [D69558785](https://our.internmc.facebook.com/intern/diff/D69558785)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147032
Approved by: https://github.com/jbschlosser
2025-02-13 07:14:58 +00:00
3cadce7af2 [NJT] Fix inference mode for composite implicit ops without nested-specific kernel (#146633)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146633
Approved by: https://github.com/jbschlosser
2025-02-10 16:59:48 +00:00
1ba1b7b597 Support remaining *_like factory functions for NJT (#144889)
Fixes #144761

This PR adds NJT impls for those *_like functions that were previously missing:
* `full_like()`
* `rand_like()`
* `randint_like()`

It also fixes a bug in existing *_like functions when a new device is specified. Fix is to also transfer `offsets` / `lengths` to the new device.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144889
Approved by: https://github.com/soulitzer
2025-01-27 21:33:51 +00:00
46e06e1d09 Avoid data-dependent errors in NJT tests via capture_scalar_outputs=True (#144588)
Part of my BE project addressing NJT bugs surfaced via OpInfo tests.

There are several xfails related to data-dependent errors in torch.compile. This PR sets `torch._dynamo.config.capture_scalar_outputs=True` to avoid these, which tends to exercise unbacked SymInt logic and will require `torch._check()`-related fixes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144588
Approved by: https://github.com/soulitzer
ghstack dependencies: #144586, #144587
2025-01-24 22:45:01 +00:00
128f3627b1 Implement backward for NJT matmul (#144587)
Part of my BE project addressing NJT bugs surfaced via OpInfo tests.

This PR implements missing backward support for NJT matmul. Notably, for dense tensors, matmul dispatches to bmm. However, due to historical reasons related to NST, NJT handles matmul directly, and thus can't rely on the CompositeImplicit impl of matmul to get the derivative formula.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144587
Approved by: https://github.com/soulitzer
ghstack dependencies: #144586
2025-01-21 18:27:50 +00:00
af204135d8 Fix NJT fill.Scalar for contiguous inputs (#144586)
Part of my BE project addressing NJT bugs surfaced via OpInfo tests.

This PR implements the missing `fill.Scalar` support, which works fine for contiguous inputs, but there is still some AOTAutograd debugging required to handle non-contiguous transposed NJTs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144586
Approved by: https://github.com/soulitzer
2025-01-21 18:22:08 +00:00
b63b81410c Fix NJT frexp() to handle both outputs (#144585)
Part of my BE project addressing NJT bugs surfaced via OpInfo tests.

Before this PR, `frexp()` for NJT was handled via the unary pointwise fallback. The op returns a tuple, however, and the fallback doesn't handle that. This PR defines an explicit impl for `frexp()` that wraps both returned `(mantissa, exponent)` as NJTs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144585
Approved by: https://github.com/soulitzer
ghstack dependencies: #144582, #144583, #144584
2025-01-18 15:59:56 +00:00
3ee531f8b9 Support NJT chunk() backward on batch dim (#144584)
Part of my BE project addressing NJT bugs surfaced via OpInfo tests.

Implements `chunk()` backward on the batch dim, which was left out before. This PR unbinds the components and invokes `copy_()` on these to pass along the appropriate gradients.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144584
Approved by: https://github.com/soulitzer
ghstack dependencies: #144582, #144583
2025-01-18 15:58:24 +00:00
a8ef423fed Fix NJT min / max backward() for non-ragged reductions (#144583)
Part of my BE project addressing NJT bugs surfaced via OpInfo tests.

`value_selecting_reduction_backward()` is used in the backward for min / max, so this PR implements it for NJT. Notably, this isn't enough for reducing over the ragged dim, since that results in a dense tensor and thus NJT's torch_dispatch will not be called for this op. We need factory function support for nested ints to fix that case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144583
Approved by: https://github.com/soulitzer
ghstack dependencies: #144582
2025-01-17 20:57:11 +00:00
cac10b8190 Fix NJT OpInfo entry for nn.functional.prelu (#144582)
Part of my BE project addressing NJT bugs surfaced via OpInfo tests.

The OpInfo entry for prelu was wrong before this PR; `weight` needs to be passed as well. The op isn't fully implemented yet.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144582
Approved by: https://github.com/soulitzer
2025-01-17 20:36:15 +00:00
c07dc64017 Update pin memory related APIs to not pass 'device' argument (#131858)
Based on https://github.com/pytorch/pytorch/pull/126376, this PR tries to update all PT callers (e.g., `Tensor.is_pinned()`, `Tensor.pin_memory()`) to not pass `device` argument.
As for `storage/untyped_storage.is_pinned()/pin_memory()`, we keep the `device` argument but passing `device` is discouraged. And if not given, the default `device` is still 'cuda' for BC.
Additionally, based on device-agnostic pin_memory, `pin_memory_device` argument of `torch.utils.data.DataLoader` is discouraged  now. For BC, explictly passing this argument is still effective. If not given, the default `device` will be the current accelerator.

Fixes #124908
Relates https://github.com/pytorch/pytorch/pull/126376

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131858
Approved by: https://github.com/albanD

Co-authored-by: albanD <desmaison.alban@gmail.com>
2025-01-15 17:23:35 +00:00
cyy
df458be4e5 [4/N] Apply py39 ruff and pyupgrade fixes (#143257)
```torch/fx/passes/annotate_getitem_nodes.py``` was changed to support the new type hinting annotations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143257
Approved by: https://github.com/justinchuby, https://github.com/albanD
2025-01-04 10:47:51 +00:00
228b228449 Fix batch-specific attention mod for NJT + Flex (#143866)
Fixes #143788
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143866
Approved by: https://github.com/Skylion007, https://github.com/cpuhrsch
2024-12-27 20:51:41 +00:00
d8c8ba2440 Fix unused Python variables in test/[e-z]* (#136964)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136964
Approved by: https://github.com/justinchuby, https://github.com/albanD
2024-12-18 23:02:30 +00:00
5dabe2d464 Fix NJT backward tests (#143072)
This PR fixes some issues with NJT backward / compile backward tests:
1. `requires_grad` was not being propagated appropriately during `SampleInput` generation, so a LOT of backward cases were untested before (sad times). This PR utilizes a helper function `_clone()` to clone() / detach() NJTs for SampleInputs while preserving `requires_grad` status. Note: the clone() / detach() stuff is for autograd; can't have two SampleInputs as part of the same autograd graph.
2. Per-sample skips weren't -fully- working; the op logic would still be invoked even with a skip. I found this out thanks to `split_with_sizes`, which segfaults during backwards because it tries to use an NST-specific formula. As annoying as it is, I tried a ton of things but ultimately had to split the `subtest_ctx` into that + a `skip_xfail_ctx` to run the subtests within.
    * Updated all uses of per-sample skips / xfails: 4 in `test_nestedtensor.py` and 1 in `test_vmap.py`
3. Added the appropriate skips / xfails to get everything passing. There are a shitton of bugs to fix!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143072
Approved by: https://github.com/cpuhrsch, https://github.com/soulitzer
2024-12-12 18:06:23 +00:00
e1196dfe51 Deprecate torch._utils.is_compiling() (#127690)
This PR is split from PR #126898.

- #126898

------

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127690
Approved by: https://github.com/Skylion007, https://github.com/malfet
2024-12-08 22:55:36 +00:00
5663ad99e7 Fix per-sample xfails for NJT tests (#142243)
#140736 fixed some xfails, but these were not properly failing in CI due to #142157. This PR removes the xfails so we can land a fix to that issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142243
Approved by: https://github.com/huydhn
2024-12-06 22:39:35 +00:00