Commit Graph

72046 Commits

Author SHA1 Message Date
fd90991790 [rfc] opentelemetry in pytorch (#122999)
1. Add current latest version (opentelemetry-cpp version v1.14.2) to PyTorch library.
Steps:
```
$cd pytorch
$git submodule add https://github.com/open-telemetry/opentelemetry-cpp.git third_party/opentelemetry-cpp
$cd third_party/opentelemetry-cpp
$git checkout v1.14.2
$git add third_party/opentelemetry-cpp .gitmodules
$git commit
```
Expected change in checkout size:
```
(/home/cpio/local/a/pytorch-env) [cpio@devvm17556.vll0 ~/local/pytorch (gh/c-p-i-o/otel)]$ git count-objects -vH
count: 654
size: 3.59 MiB
in-pack: 1229701
packs: 17
size-pack: 1.17 GiB
prune-packable: 76
garbage: 0
size-garbage: 0 bytes
```

2.

TODO
- [x] Figure out how dynamic linking works. App builders will somehow need to `target_include` opentelemetry-cpp at runtime.
- [ ] Examples on how to use opentelemetry + pytorch
- [ ] Tests + documentation (e.g. using null opentelemetry implementation).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122999
Approved by: https://github.com/ezyang
2024-04-21 15:20:21 +00:00
29cc293725 [BE]: FURB142 - Remove set mutations. Use set update (#124551)
Uses set mutation methods instead of manually reimplementing (update, set_difference etc).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124551
Approved by: https://github.com/ezyang
2024-04-21 14:12:33 +00:00
5a1216bb2e [BE]: Update ruff to 0.4.1 (#124549)
Update ruff to 0.4.1 .
This version fixes a lot false negatives/false positives, is 20-40% faster, and has various other bug fixes.

Below is a before and after table showing the execution time of ruff lint and ruff format in milliseconds courtesy of https://astral.sh/blog/ruff-v0.4.0

| Repository                                         | Linter (v0.3) | Linter (v0.4) | Formatter (v0.3) | Formatter (v0.4) |
|----------------------------------------------------|---------------|---------------|------------------|------------------|
| [pytorch/pytorch](https://github.com/pytorch/pytorch) | 328.7         | 251.8         | 351.1            | 274.9            |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124549
Approved by: https://github.com/ezyang
2024-04-21 14:06:23 +00:00
f34905f61d Assert that TracingContext is available when set_example_value is called (#124284)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124284
Approved by: https://github.com/Chillee
ghstack dependencies: #124105, #124059, #124176, #124283
2024-04-21 11:23:13 +00:00
0e6367dd44 Factor var_to_range assignments to _update_var_to_range helper (#124283)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124283
Approved by: https://github.com/IvanKobzarev
ghstack dependencies: #124105, #124059, #124176
2024-04-21 11:23:13 +00:00
cbf420b67a [inductor] for UserDefinedTritonKernels don't mark all inputs as mutating (#124425)
Take this example:
```
def _mul2(x):
    y = torch.empty_like(x)
    mul2_kernel[(10,)](
        in_ptr0=x, out_ptr=y,
        n_elements=x.numel(), BLOCK_SIZE=1,
    )
    return y

def f(x):
    for _ in range(4):
        x = _mul2(x)
    return x + 1
```

Currently, the codegen will show up like this. Notice, how we allocate 5 buffers of the same size.
```
# Source Nodes: [triton_kernel_wrapper_mutation], Original ATen: []
buf0 = empty_strided_cuda((10, ), (1, ), torch.float32)
mul2_kernel_0.run(in_ptr0=arg0_1, out_ptr=reinterpret_tensor(buf0, (10, ), (1, ), 0) ...)

# Source Nodes: [triton_kernel_wrapper_mutation_1], Original ATen: []
buf4 = empty_strided_cuda((10, ), (1, ), torch.float32)
mul2_kernel_0.run(in_ptr0=reinterpret_tensor(buf0, (10, ), (1, ), 0), out_ptr=reinterpret_tensor(buf4, (10, ), (1, ), 0) ...)

# Source Nodes: [triton_kernel_wrapper_mutation_2], Original ATen: []
buf8 = empty_strided_cuda((10, ), (1, ), torch.float32)
mul2_kernel_0.run(in_ptr0=reinterpret_tensor(buf4, (10, ), (1, ), 0), out_ptr=reinterpret_tensor(buf8, (10, ), (1, ), 0) ...)

# Source Nodes: [triton_kernel_wrapper_mutation_3], Original ATen: []
buf12 = empty_strided_cuda((10, ), (1, ), torch.float32)
mul2_kernel_0.run(in_ptr0=reinterpret_tensor(buf8, (10, ), (1, ), 0), out_ptr=reinterpret_tensor(buf12, (10, ), (1, ), 0) ...)

# Source Nodes: [add], Original ATen: [aten.add]
buf16 = empty_strided_cuda((10, ), (1, ), torch.float32)
triton_poi_fused_add_0.run(buf12, buf16, 10, grid=grid(10), stream=stream0)...)
return (buf16, )
```

With this PR, we want to see this. Notice, how we only allocate 2 buffers this time. The other 3 buffers are re-used.
```
# Source Nodes: [triton_kernel_wrapper_mutation], Original ATen: []
buf0 = empty_strided_cuda((10, ), (1, ), torch.float32)
mul2_kernel_0.run(in_ptr0=arg0_1, out_ptr=reinterpret_tensor(buf0, (10, ), (1, ), 0), ...)
del arg0_1

# Source Nodes: [triton_kernel_wrapper_mutation_1], Original ATen: []
buf2 = empty_strided_cuda((10, ), (1, ), torch.float32)
mul2_kernel_0.run(in_ptr0=reinterpret_tensor(buf0, (10, ), (1, ), 0), out_ptr=reinterpret_tensor(buf2, (10, ), (1, ), 0) ...)

# Source Nodes: [triton_kernel_wrapper_mutation_2], Original ATen: []
buf4 = buf0; del buf0  # reuse
mul2_kernel_0.run(in_ptr0=reinterpret_tensor(buf2, (10, ), (1, ), 0), out_ptr=reinterpret_tensor(buf4, (10, ), (1, ), 0) ...)

# Source Nodes: [triton_kernel_wrapper_mutation_3], Original ATen: []
buf6 = buf2; del buf2  # reuse
mul2_kernel_0.run(in_ptr0=reinterpret_tensor(buf4, (10, ), (1, ), 0), out_ptr=reinterpret_tensor(buf6, (10, ), (1, ), 0) ...)
del buf4

# Source Nodes: [add], Original ATen: [aten.add]
buf8 = buf6; del buf6  # reuse
triton_poi_fused_add_0.run(buf8, 10, grid=grid(10), stream=stream0)
return (buf8, )
```

Differential Revision: [D56379307](https://our.internmc.facebook.com/intern/diff/D56379307)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124425
Approved by: https://github.com/oulgen
2024-04-21 06:00:14 +00:00
0d90d4d613 [Dynamo] Fix NamedTuple hasattr bug (#124531)
Fixes #124402

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124531
Approved by: https://github.com/jansel
2024-04-21 04:36:22 +00:00
a6a3f2e06b [MPS] Fixes GELU, LeakyRELU and MISH on non-contiguous tensors (#123049)
Fixes GELU, LeakyRELU and MISH activation functions on non-contiguous tensors (for instance, when a transpose operation was applied on the tensors prior to the MPS operator), forward and backward passes.

I also extended tests on the 3 activation functions to check: full-precision and half-precision, contiguous and non-contiguous, and several dims of tensors: scalars, 1D, empty, 2D, > 3D.

I had issues with Mish and GELU activations when asserting the gradients vs. CPU with sum() on some cases, so I reverted to the previous setup by setting a gradient parameter on .backwards().
This PR also fixes an issue with LeakyRELU on empty tensors.

Fixes #98212 huggingface/transformers#22468 huggingface/transformers#19353
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123049
Approved by: https://github.com/kulinseth
2024-04-21 00:12:32 +00:00
98f3e0214b [NCCL][TEST] Synchronize proper devices (#124517)
There are multiple instances of `torch.cuda.synchronize()` calls without arguments. These calls cause device 0 being synchronized from multiple ranks while the rest of the devices are not. I am pretty sure that was not intended.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124517
Approved by: https://github.com/wconstab, https://github.com/eqy
2024-04-20 23:42:32 +00:00
d6f88105ce Fix the problem about load_state_dict with unexpected key whose prefix matches a valid key (#124385)
Fixes https://github.com/pytorch/pytorch/issues/123510

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124385
Approved by: https://github.com/mikaylagawarecki
2024-04-20 23:19:25 +00:00
afa78ad08c Call writeline from writelines (#124515)
This makes it more convenient to add a breakpoint here.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124515
Approved by: https://github.com/albanD
2024-04-20 15:45:30 +00:00
a32eac345f [dynamo] Return gm.forward for eager backend (#124109)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124109
Approved by: https://github.com/yanboliang, https://github.com/jansel
ghstack dependencies: #124445
2024-04-20 14:11:05 +00:00
febc4d8759 [dynamo][easy] forbid_in_graph check to use getattr_static (#124445)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124445
Approved by: https://github.com/yanboliang, https://github.com/jansel
2024-04-20 14:11:05 +00:00
97ccfad915 Fix test_decomp test for ops with py_impl(CompositeImplicitAutograd) (#116832)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116832
Approved by: https://github.com/lezcano
2024-04-20 11:10:38 +00:00
a3e3693afc [Dynamo] Fix missing bracket in ListVariable (#124532)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124532
Approved by: https://github.com/williamwen42
2024-04-20 08:26:30 +00:00
f20e3ae0c3 Use recursive blob for package data (#119257)
setup.py now supports recursive glob for package data

I only added `.cpp`, `.h`, and `.yaml` files. Not sure if you want to include BAZEL or other files in package_data.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119257
Approved by: https://github.com/zou3519
2024-04-20 06:33:39 +00:00
0d0b5b2655 Enable dynamo rosenbrock sparse tests (#124542)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124542
Approved by: https://github.com/yf225
ghstack dependencies: #124540, #124541
2024-04-20 05:54:41 +00:00
184f16016e Enable dynamo-traced deepcopy test for RMSprop (#124541)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124541
Approved by: https://github.com/yf225
ghstack dependencies: #124540
2024-04-20 05:54:41 +00:00
6a730698e2 Enable dynamo-traced Adamax tests (#124540)
Enabling tests related to https://github.com/pytorch/pytorch/issues/121178

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124540
Approved by: https://github.com/yf225
2024-04-20 05:54:41 +00:00
f1cbaf1764 Adds LSE output for templated-attention-hop if inputs require grad (#124308)
Adds LSE output for templated-attention-hop if inputs require grad

Prep PR for adding autograd support to templated-attention-hop. The kernel needs to output the LSE during the forward which will be used during backwards.

### Output code
https://gist.github.com/drisspg/2aea3ce5db75811e7e143eeecb774d8a

## Before
| Type    |   Speedup |   batch_size |   num_heads |   q_seq_len |   k_seq_len |   head_dim | score_mod     | dtype          |
|---------|-----------|--------------|-------------|-------------|-------------|------------|---------------|----------------|
| Average |     1.159 |              |             |             |             |            |               |                |
| Max     |     1.342 |           16 |          16 |         512 |         512 |         64 | noop          | torch.bfloat16 |
| Min     |     1.016 |            1 |          16 |         512 |         512 |         64 | relative_bias | torch.bfloat16 |

## After
 Type    |   Speedup |   batch_size |   num_heads |   q_seq_len |   k_seq_len |   head_dim | score_mod   | dtype          |
|---------|-----------|--------------|-------------|-------------|-------------|------------|-------------|----------------|
| Average |     1.155 |              |             |             |             |            |             |                |
| Max     |     1.339 |           16 |          16 |         512 |         512 |         64 | noop        | torch.bfloat16 |
| Min     |     1.009 |            1 |          16 |         512 |         512 |         64 | head_bias   | torch.bfloat16 |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124308
Approved by: https://github.com/Chillee
2024-04-20 05:45:56 +00:00
0d64b82f0b Make CompiledFxGraph portable between machines (#124438)
As we prepare FxGraphCache to move to remote, we need to make sure there's no data that is on the disk.

Differential Revision: [D56363808](https://our.internmc.facebook.com/intern/diff/D56363808)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124438
Approved by: https://github.com/jansel
2024-04-20 05:26:14 +00:00
c5a4ba2257 [inductor] consider pointwise nodes when deciding reduction hint (#124131)
In certain **rare** scenarios, inductor can generate a reduction kernel with really bad perf. E.g., if
- the reduction kernel contains a reduction node followed by a pointwise node
- And the pointwise node use a transposed layout.
- the reduction node is an inner reduction
- and rnumel <= 1024 ,

then inductor will generate a persistent reduction kernel and it causes really bad perf when doing tl.store for the pointwise node since we use a very skinny tile `(XBLOCK=1, RBLOCK=next_power_of_2(rnumel))` .

I've tried a few version of fix.
- The first version is, if I found any pointwise node in a reduction kernel uses a non-contiguous dependency, we use ReductionHint.DEFAULT. This cause 8s compilation time increase for huggingface with no perf wins... The reason is ReductionHint.DEFAULT does more autotunings.
- Then I changed the code to be more specific. We change the hint from INNER to DEFAULT if we are sure that the pointwise kernel can use a >1 stride for the lowest dimension. Kernels meet this condition should mostly have really bad perf anyways.

The situation mentioned above is rare. But it's reported by internal users. I'll also run one more perf test.

Testing script: https://gist.github.com/shunting314/9d3389891fa43633b49b8b7564ad6d8b . Something equivalent is also added as a unit test.

For this specific test from user reports, we improve the mentioned reduction kernels perf by **4.14x** (451us -> 109us)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124131
Approved by: https://github.com/jansel
2024-04-20 05:07:56 +00:00
57f64197f3 Reduce warning msg in torch.profiler (#124469)
Summary: This is actually quite noisy and my logs are full of this soft assertion msg. Maybe making it log once?

Test Plan:
On AMD GPU side, I got a lot of those warnings:
```
W0415 01:40:45.109864 917160 collection.cpp:602] Warning: Memcpy ? (? -> ?) (function operator())”
```
So just suppress the excessive logs

Reviewed By: aaronenyeshi, yoyoyocmu

Differential Revision: D55602788

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124469
Approved by: https://github.com/aaronenyeshi
2024-04-20 04:45:12 +00:00
b79b0d3d6a Enable UFMT on test/test_legacy_vmap.py (#124381)
Part of https://github.com/pytorch/pytorch/issues/123062
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124381
Approved by: https://github.com/ezyang
2024-04-20 03:37:57 +00:00
3d8b903d95 [PyTorch] Remove ArrayRefTensor::numel_ (#124516)
ArrayRefTensor::numel_ is redundant with the size of the contained MiniArrayRef. Reclaiming the space entirely would break ABI compatibility, but at least we have 4-8 bytes for future expansion.

Differential Revision: [D56366829](https://our.internmc.facebook.com/intern/diff/D56366829/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D56366829/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124516
Approved by: https://github.com/chenyang78, https://github.com/desertfire
2024-04-20 02:44:20 +00:00
f9fce110af [FSDP2][ez] Removed error check for swap tensors flag (#124513)
Since `DTensor` uses `swap_tensors` path automatically now, we can remove this check for the global flag.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124513
Approved by: https://github.com/weifengpy
ghstack dependencies: #124319, #120256
2024-04-20 00:46:36 +00:00
1c2cb36811 [FSDP2] Added CPU offloading (#120256)
#### Overview
This PR adds CPU offloading via the `offload_policy: OffloadPolicy` argument.
- We incur one H2D copy for each parameter before all-gather.
- We incur one D2H copy for each gradient after reduce-scatter.
- We run optimizer on CPU.

#### Example (Mixed Precision and CPU Offloading)
This example uses a small 125M numel model, which is not too representative. We can try to run with a larger model like Llama-7B. However, since the current optimizer step is already too slow, we may want to patch a faster CPU optimizer.

Forward
![Screenshot 2024-02-21 at 10 36 29 AM](https://github.com/pytorch/pytorch/assets/31054793/00ed95db-3a55-49bb-ac98-9b9162feaacd)
![Screenshot 2024-02-21 at 10 39 12 AM](https://github.com/pytorch/pytorch/assets/31054793/10e29854-1907-4001-b3dc-aab6c3bf153c)

Backward
![Screenshot 2024-02-21 at 10 37 47 AM](https://github.com/pytorch/pytorch/assets/31054793/7039cb2e-eb78-4f53-b83f-67bae61ebddd)
![Screenshot 2024-02-21 at 10 38 44 AM](https://github.com/pytorch/pytorch/assets/31054793/e34615d6-6b6b-4995-aef1-9c7563034799)

Overall CPU (CPU optimizer step dominates)
![Screenshot 2024-02-21 at 10 39 47 AM](https://github.com/pytorch/pytorch/assets/31054793/7a2a929a-3a40-4b35-891b-016cf57e8079)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120256
Approved by: https://github.com/weifengpy
ghstack dependencies: #124319
2024-04-20 00:42:58 +00:00
cf5ca58e7f [NJT] Inline through torch.nested.nested_tensor_from_jagged instead of graph break (#124343)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124343
Approved by: https://github.com/jbschlosser
2024-04-19 23:13:59 +00:00
acbf888a13 rename sl to strobelight (#124455)
Summary:
TORCH_COMPILE_SL_PROFILE ->TORCH_COMPILE_STROBELIGHT
SL_MAX_STACK_LENGTH -> COMPILE_STROBELIGHT_MAX_STACK_LENGTH
SL_MAX_PROFILE_TIME -> COMPILE_STROBELIGHT_MAX_PROFILE_TIME
profile_with_sl() -> strobelight()
compiletime_sl_profile_meta() -> compiletime_strobelight_meta()

Test Plan:
1. run and verify
```
TORCH_COMPILE_STROBELIGHT=TRUE buck2 run  @//mode/inplace  @//mode/opt  //caffe2/fb/strobelight:compiletime_profiler_example
```
2. run and verify
```
buck2 run  @//mode/inplace  @//mode/opt  //caffe2/fb/strobelight:function_profiler_example --local-only
```
3. run and verify truncated stack for
```
TORCH_COMPILE_STROBELIGHT=TRUE COMPILE_STROBELIGHT_MAX_STACK_LENGTH=1 buck2 run  @//mode/inplace  @//mode/opt  //caffe2/fb/strobelight:compiletime_profiler_example
```
4. add infinite loop in _verify and verify samples for
```
COMPILE_STROBELIGHT_MAX_PROFILE_TIME=30 TORCH_COMPILE_STROBELIGHT=TRUE buck2 run  @//mode/inplace  @//mode/opt  //caffe2/fb/strobelight:compiletime_profiler_example
```

Reviewed By: oulgen

Differential Revision: D56327139

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124455
Approved by: https://github.com/oulgen
2024-04-19 22:50:13 +00:00
0feab7d6c3 Revert "Build device generic torch.Stream and torch.Event based on c10::Stream/Event (#123611)"
This reverts commit cb17721899d4d6a55d66d4f7188e36c20a078231.

Reverted https://github.com/pytorch/pytorch/pull/123611 on behalf of https://github.com/jeffdaily due to This broke ROCm. see test_overrides.py ([comment](https://github.com/pytorch/pytorch/pull/123611#issuecomment-2067363780))
2024-04-19 22:44:26 +00:00
929242a15c Revert "torch.mtia module for MTIA device backend (#123612)"
This reverts commit d7e1bf9ff908d2a9c20d5354426d34c539fcb7a1.

Reverted https://github.com/pytorch/pytorch/pull/123612 on behalf of https://github.com/jeffdaily due to This broke ROCm. see test_overrides.py ([comment](https://github.com/pytorch/pytorch/pull/123611#issuecomment-2067363780))
2024-04-19 22:44:26 +00:00
52da03edeb Revert "Add test_cpp_extensions tests for stream_and_event and mita_backend (#123614)"
This reverts commit b6f0159db08c1ad55fe57a5e92d8933e21ea543e.

Reverted https://github.com/pytorch/pytorch/pull/123614 on behalf of https://github.com/jeffdaily due to This broke ROCm. see test_overrides.py ([comment](https://github.com/pytorch/pytorch/pull/123611#issuecomment-2067363780))
2024-04-19 22:44:26 +00:00
f8f7cfbeee Add __torch_function__ support for generated tensor methods/property of PrivateUse1 (#121723)
support following case:
```python
import torch
...
class CustomFooTensor(torch.Tensor):
  @classmethod
  def __torch_function__(cls, func, types, args=(), kwargs=None):
    ...
a = CustomFooTensor([3])
print(a.is_foo)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121723
Approved by: https://github.com/albanD
2024-04-19 22:34:34 +00:00
19850d770d update triton pin (#124429)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124429
Approved by: https://github.com/shunting314, https://github.com/malfet
2024-04-19 22:34:28 +00:00
d8a98ddd60 Prep PR for cutlass 3.5 update (#124412)
# Summary
These changes are needed for the upgrade to cutlass 3.5
#123458

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124412
Approved by: https://github.com/Skylion007, https://github.com/nWEIdia, https://github.com/malfet
2024-04-19 22:10:37 +00:00
b3504af56e Enable UFMT on test/scripts and some files (#124137)
Part of: #123062

Ran lintrunner on:

- `test/scripts`
- `test/simulate_nccl_errors.py`
- `test/test_ao_sparsity.py`
- `test/test_autocast.py`
- `test/test_binary_ufuncs.py`
- `test/test_bundled_images.py`
- `test/test_bundled_inputs.py`
- `test/test_comparison_utils.py`
- `test/test_compile_benchmark_util.py`
- `test/test_complex.py`
- `test/test_cpp_api_parity.py`
- `test/test_cpp_extensions_aot.py`
- `test/test_cpp_extensions_jit.py`
- `test/test_cpp_extensions_open_device_registration.py`

Detail:

```bash
$ lintrunner -a --take UFMT --all-files
ok No lint issues.
Successfully applied all patches.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124137
Approved by: https://github.com/soulitzer
2024-04-19 22:01:27 +00:00
f0560f7b3b [opcheck] Stop doing test_aot_dispatch_static by default (#124495)
Motivations:
- this is pretty redundant with test_aot_dispatch_dynamic.
- The user story for opcheck is that a user should use opcheck to see
  if their operator was "registered correctly". If a user's custom op
  only supports dynamic shapes, then it's a bit awkward for
  one of the tests (e.g. `test_aot_dispatch_static`) to fail.
- We've already stopped running test_aot_dispatch_static in all of
  our opcheck tests.

Test Plan:
- wait for CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124495
Approved by: https://github.com/williamwen42
ghstack dependencies: #124180, #124200, #124299, #124134, #124199, #124403, #124414
2024-04-19 21:57:22 +00:00
37d18966ea [custom_op] set some tags when constructing the op (#124414)
- the op is automatically "pt2-compliant"
- In general we want to turn on needs_fixed_stride_order for all customm
  ops, but this needs some more work, so we're just going to turn it on
  for the new custom op API.

Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124414
Approved by: https://github.com/albanD
ghstack dependencies: #124180, #124200, #124299, #124134, #124199, #124403
2024-04-19 21:57:22 +00:00
1900f79b72 [FSDP2] Added set_reshard_after_backward (#124319)
This PR adds a `set_reshard_after_backward` method to allow disabling resharding after backward. `reshard_after_backward=False` can be used with `reshard_after_forward=False` to implement "ZeRO-1", where there is only all-gather on the first microbatch forward and reduce-scatter on the last microbatch backward.

```
for microbatch_idx, microbatch in dataloader:
    is_last_microbatch = microbatch_idx == num_microbatches - 1
    model.set_requires_gradient_sync(is_last_microbatch)
    model.set_reshard_after_backward(is_last_microbatch)
    model.set_is_last_backward(is_last_microbatch)
    microbatch_fwd_bwd(model, microbatch, microbatch_idx)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124319
Approved by: https://github.com/weifengpy
2024-04-19 21:49:35 +00:00
10b9d4d19c [export] handle Dim.lower = 0, 1 for ep.run_decompositions() (#123602)
Summary:
With pre-dispatch export and ep.run_decompositions(), range constraints are updated through looking at ShapeEnv.var_to_range. However the lower bounds on these may be incorrect - analysis on un-specialized symbols are done with lower bounds of 2, which mismatch with user-specified bounds (may be 0, 1).

This updates `_get_updated_range_constraints()` to use the old range constraints if possible.

Test Plan: Existing pre-dispatch/dynamic shapes test case.

Differential Revision: D55899872

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123602
Approved by: https://github.com/tugsbayasgalan
2024-04-19 21:29:36 +00:00
c74dfca5e7 Int4MM: Unswizzle for different dtypes (#124448)
If dtype is not the one this platform is optimized for, it might need different unswizzling pattenrs Implement ones for non-vectorized flavor of the kernel, so that int4mm can be used with float32 and float16 dtypes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124448
Approved by: https://github.com/jgong5, https://github.com/mikekgfb
2024-04-19 21:17:15 +00:00
000d55870a Enable in oss (#124031)
Biggest movement is 4% HF inference, 9% TIMM inference. Note, this is max-autotune mode so we are more tolerant of compilation increases. We could improve compilation time by limiting:
```
# Take how many of the top triton kernels to benchmark epilogue
max_epilogue_benchmarked_choices = 3
```

There is a hf_Whisper failure which you can repro on main without this stack with `TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_BACKENDS=TRITON TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --backend inductor --amp --accuracy --training --only hf_Whisper`. When you turn off epilogue fusion, it fixes the accuracy. I bisected the failure to an epilogue, however when you compare the results of that epilogue with the corresponding separate kernels the results of the output are equivalent.

Inference:

<img width="1686" alt="image" src="https://github.com/pytorch/pytorch/assets/11477974/0b240080-cd33-4c08-89d3-583103b1fb0c">

Training:

<img width="1329" alt="Screenshot 2024-04-16 at 6 16 30 PM" src="https://github.com/pytorch/pytorch/assets/11477974/db0afcc9-7288-4c27-84ce-4fc1a5690788">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124031
Approved by: https://github.com/Chillee, https://github.com/shunting314
ghstack dependencies: #124030, #122642, #123229, #122825
2024-04-19 20:28:55 +00:00
e6a788ac26 Fix compilation on aarch64 with gcc (#124511)
Which is more stringent than clang when equivalently sized NEON registers are cast to each other. In particular, at one point `uint16x4_t` were cast to `int16x4_t`, which gcc does not allow. Added `vreinterpret_s16_u16` (which is a no-op) to solve this and tested in https://godbolt.org/z/sYb4ThM6M

Test plan: Build aarch64 wheels
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124511
Approved by: https://github.com/mikekgfb
2024-04-19 19:53:19 +00:00
179108f14d Use separate flags for MultiTemplates from BenchmarkFusion (#122825)
Two changes:
- Make the flag for multi template buffer independent from benchmark fusion. While benchmark fusion can be useful, the compilation time/performance trade offs are different than for just templates, which we'd like to enable by default.
- Dont do MultiTemplateBuffers/benchmark-fusion for templates which have custom input gen fn's (currently which only exist internally). Threading the custom input gn fns to benchmark fusion is NYI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122825
Approved by: https://github.com/shunting314
ghstack dependencies: #124030, #122642, #123229
2024-04-19 19:50:42 +00:00
73f56e1e81 [sym_shapes][perf] Do not calculate hint in advice_is_size (#124472)
Differential Revision: [D56352412](https://our.internmc.facebook.com/intern/diff/D56352412)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124472
Approved by: https://github.com/ezyang
2024-04-19 19:10:24 +00:00
661fd23640 [AMD] TunableOp take priority over DISABLE_ADDMM_HIP_LT (#124161)
Summary: It seems super confusing that if we set DISABLE_ADDMM_HIP_LT + PYTORCH_TUNABLEOP_ENABLED, the former takes priority. This is because the former goes through the gemm_and_bias and tunable op is integrated with gemm path. Before we can integrate tunable op with gemm_and_bias, we'll probably just let tunable op takes priority

Test Plan: Run a simple linear program and verified.

Differential Revision: D56183954

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124161
Approved by: https://github.com/jeffdaily, https://github.com/nmacchioni
2024-04-19 19:08:06 +00:00
f87c788a34 Revert "Capture triton kernel in execution trace (#124140)"
This reverts commit 89407eca3b0be3c0272b5c583f8e77b9108a71f8.

Reverted https://github.com/pytorch/pytorch/pull/124140 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/124140#issuecomment-2067137104))
2024-04-19 19:05:44 +00:00
761de37ab7 [sym_shape][perf] eval_static: guards, unbacked compute once (#124217)
Differential Revision: [D56212345](https://our.internmc.facebook.com/intern/diff/D56212345)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124217
Approved by: https://github.com/ezyang
2024-04-19 19:03:04 +00:00
8869b543e8 [AMD] Remove deprecated macro from COnvUtils (#124158)
Summary:
This is not great, but our ATen-cpu is not completely GPU agnostic. Previously we have worked on D54453492 (https://github.com/pytorch/pytorch/pull/121082) and D54528255, but there are a few things we haven't resolved, and it's exploding here. So we'll continue to fix them until all are gone.

This ROCm block is for 4.3 which is very old. I don't think it should be supported any more. So let's just kill this macro

Test Plan: CI

Differential Revision: D56172660

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124158
Approved by: https://github.com/jeffdaily, https://github.com/nmacchioni
2024-04-19 19:00:31 +00:00
b0d83726bd [5/x][AMD][Lowering Enablement] Hipifying aoti code_wrapper (#124241)
Summary: as title

Test Plan:
CI & unit test

patch on top of https://www.internalfb.com/phabricator/paste/view/P1214895953 to test

Differential Revision: D56223917

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124241
Approved by: https://github.com/jansel, https://github.com/desertfire
2024-04-19 18:57:38 +00:00