Compare commits

..

3649 Commits

Author SHA1 Message Date
83ad8e01b1 fix the problem that cpu_fallback for aten::triu_indices on custom device crashed (#121306)
Fixes #121289

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121306
Approved by: https://github.com/ezyang
2024-03-26 01:29:45 +00:00
5e66bf5f42 Avoid COW materialize in nn.functional forward ops (3) (#122443)
Affected ops:
* repeat
* unfold
* logsigmoid
* pixel_shuffle/unshuffle
* remaining norm ops

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122443
Approved by: https://github.com/ezyang
2024-03-26 00:56:57 +00:00
b6982bf2b2 [dynamo] Forward OptimizedModule.__setattr__ to the wrapped module (#122098)
Fixes #114844

In the linked issue we have
```
compiled_module = torch.compile(module)
compiled_module.x = ...
compiled_module(...)  # Mutates self.x
```
Where since the module mutates `self.x` you would expect `compiled_module.x`
to be updated but actually `compiled_module.x = ...` sets an attribute "x"
on the `OptimizedModule` object while the forward method of the module mutates
`module.x`.

This gives the expected behavior by forwarding `compiled_module.__setattr__`
down to `module.__setattr__`. There is already a corresponding `__getattr__`
so now `compiled_module.x` becomes an alias for `module.x`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122098
Approved by: https://github.com/ezyang, https://github.com/lezcano
2024-03-26 00:52:12 +00:00
eda279c997 [CpuInductor] Implement masked_load for integral types (#122608)
Use `if constexpr` to separate float vs integral masked load for avx512
Discovered while looking at `test_comprehensive_fft_ihfft2_cpu_int64` on
non-AVX512 capable CPUs where (5, 6, 7) shape were big enough to start a vectorized loop

Added `test_pad_cast` regression test

Fixes https://github.com/pytorch/pytorch/issues/122606

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122608
Approved by: https://github.com/jansel
ghstack dependencies: #122607
2024-03-25 22:44:54 +00:00
57a3d00b06 [AOTInductor] Add tensor_constantX to pass constant buffer update's check (#122562)
Summary:
During tracing, some constants (tensor_constant{idx}) are being generated internally.
Those constants are neither parameters or buffers, and users have zero control on them.

To accomodate this, we should allow users not passing in those constants generated internally but still be able the constants in the model.

Test Plan:
Included in commit.
```
build/bin/test_aot_inductor
```

Differential Revision: D55286634

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122562
Approved by: https://github.com/chenyang78, https://github.com/khabinov
2024-03-25 22:05:20 +00:00
ebde6c72cb Precompile triton templates (#121998)
Before this PR we were not precompiling triton templates in parallel. Compilation would occur during benchmarking.

Triton benchmarking templates were emitted as :

```
@triton.jit
def triton_mm(arg_A, arg_B, out_ptr0):
```

In order to precompile we need to give the full kernel specification, as we do when we emit the template in the final output code generation.

```
@triton_heuristics.template(
    num_stages=3,
    num_warps=8,
    triton_meta={'signature': {0: '*fp32', 1: '*fp32', 2: '*fp32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [AttrsDescriptor(divisible_by_16=(0, 1, 2), equal_to_1=(), ids_of_folded_args=(), divisible_by_8=())]},
    inductor_meta={'kernel_name': 'Placeholder.DESCRIPTIVE_NAME', 'backend_hash': 'cdeecfeccd31ad7810f96b5752194b1c2406d0a81e39a6ca09c8ee150baae183'},
)
@triton.jit
def triton_mm(arg_A, arg_B, out_ptr0):
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121998
Approved by: https://github.com/jansel
2024-03-25 21:33:36 +00:00
9b095c3fe6 [dynamo] Config to not emit runtime asserts (#122603)
Repetition on squashed & merged by mistake https://github.com/pytorch/pytorch/pull/122406

Differential Revision: [D55312394](https://our.internmc.facebook.com/intern/diff/D55312394)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122603
Approved by: https://github.com/ezyang
2024-03-25 21:17:44 +00:00
1f67da5105 [executorch hash update] update the pinned executorch hash (#122152)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122152
Approved by: https://github.com/pytorchbot
2024-03-25 20:56:34 +00:00
46a76cfef5 [ROCm] Fix test_trace_rule_update.py (#121524)
-Add missing torch API to trace rules and ignore API with manual trace rule.

The PR fix test/dynamo/test_trace_rule_update

maybe related to https://github.com/pytorch/pytorch/pull/121142

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121524
Approved by: https://github.com/jansel, https://github.com/pruthvistony
2024-03-25 20:53:24 +00:00
bc7f3859b3 Update jvp to support symbolic execution. (#120338)
Previously, all jvp tests under dynamo/test_dynamic_shapes would fail because symbolic execution wasn't supported in some autograd functions.

List of changes:
- Update`_has_same_storage_numel` to use `sym_nbytes`
- Symintify `_efficientzerotensor_meta`
- Introduce `empty_generic_symint` with the first argument `size` as symbolic integer
- Update gen_variable_type.py script to call the symint version of zeros_fn function (zeros_symint / _efficientzerotensor_symint)
- Update `has_same_meta` to call `sym_*` functions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120338
Approved by: https://github.com/soulitzer
2024-03-25 20:50:12 +00:00
1c1268b6e9 seg-fault of "basic_string::_M_construct null not valid" fix for getNcclErrorDetailStr (#121905)
When working on testing all-reduce with an alternative rccl replacement backend, my test script crashed. After debugging, I found that `ncclGetLastError(NULL)` return null, and then the code uses the return value to do std::string would seg-fault with an exception of `basic_string::_M_construct null not valid`.

This pull request is to fix this edge condition so that it will exit the program gracefully with useful information.

**Test:**
Before the fix, my test script exits like below:
```
File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2051, in all_reduce
    work = group.allreduce([tensor], opts)
RuntimeError: basic_string::_M_construct null not valid
```

After this fix, my test script exited with useful message like,
```
[rank0]:   File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2219, in all_reduce
[rank0]:     work = group.allreduce([tensor], opts)
[rank0]: torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:272, internal error - please report this issue to the NCCL developers, NCCL version 0.4.2
[rank0]: ncclInternalError: Internal check failed.
[rank0]:  Last error: Unknown NCCL Error
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121905
Approved by: https://github.com/wconstab
2024-03-25 20:49:34 +00:00
05bbcae5bb Refactor functorch meta conversion (#122202)
At a high level, the goal of this refactor was to make it so that `MetaConverter.__call__` has a straightforward code structure in three steps: (1) check if we support doing meta conversion, (2) describe the tensor into MetaTensorDesc, (3) call `meta_tensor` on MetaTensorDesc. However, this is not so easy to do, because there is a big pile of special cases for functional tensor inside `__call__`.

The primarily complication is handling the ambient functionalization state: specifically, the functorch dynamic layer stack and the Python functionalization dispatch. The old code demands that meta tensor conversion happen with this state disabled. But I discovered that when I reconstruct functorch tensors it demands that the functorch layers be active; in fact a batch tensor will have a pointer to the internal functorch layer.

I had some discussion with Richard Zou about what code structure here makes sense. In particular, one of the goals of the refactor here is that I can inflate MetaTensorDesc from an entirely different process, which may not have all of the functorch layers activated at the time we do reconstruction. So it seems to me that we should make it explicit in MetaTensorDesc that there was some functorch layer active at the time the functorch tensor was serialized, so that we could potentially know we need to reconstruct these layers on the other side. This is NOT implemented yet, but there's some notes about how potentially it could proceed. But the important thing here is we SHOULD disable everything when we run `meta_tensor`, and internally be responsible for restoring the stack. Actually, the necessary infra bits in functorch don't exist to do this, so I added some simple implementations in pyfunctorch.py.

The rest is splitting up the manipulations on tensor (we do things like sync the real tensor before describing it; Describer is responsible for this now) and I also tried to simplify the not supported condition, based on my best understanding of what the old thicket of conditions was doing. You may notice that the internal meta_tensor handling of functional tensor is inconsistent with surrounding code: this is because I *exactly* replicated the old reconstruction behavior; a further refactor would be to rationalize this.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122202
Approved by: https://github.com/zou3519
2024-03-25 20:47:21 +00:00
9223b2cb31 Pop codegened parent graph from wrapper in GraphLowering (#122469)
Summary: Previously, we kept a reference to `V.graph` in the `codegened_graph_stack` of the wrapper. Memory regression analysis of https://github.com/pytorch/pytorch/issues/121887 shows that this has led to a slightly higher memory utilization during lowering of the `llama_v2_7b_16h` model. Here we refactor the code to pop the parent subgraph from the `codegened_graph_stack` when codegen-ing is done.

Fixes https://github.com/pytorch/pytorch/issues/121887.

Test Plan: CI, also see https://github.com/pytorch/pytorch/issues/121887#issuecomment-2014209104.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122469
Approved by: https://github.com/eellison
2024-03-25 20:27:59 +00:00
b2c496ba24 Revert "[TorchGen] Add mutable parameter to valuetype_type function in api/cpp.py (#121415)"
This reverts commit c1fe09dc37358d8121f119d66e9e8c8d57035158.

Reverted https://github.com/pytorch/pytorch/pull/121415 on behalf of https://github.com/ezyang due to I think this needs to be reverted to after https://github.com/pytorch/pytorch/pull/120076 revert ([comment](https://github.com/pytorch/pytorch/pull/121415#issuecomment-2018828813))
2024-03-25 20:14:40 +00:00
f84e3bf36d [ez] Fix XLA auto hash updates (#122630)
The xla pin is located in .github/ci_commit_pins not .ci/docker/ci_commit_pins
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122630
Approved by: https://github.com/huydhn
2024-03-25 19:45:56 +00:00
9d1de31634 [BE][CPUInductor] Use C++17 helper templates (#122607)
Such as `std::is_same_v` ,`std::is_integral_v` and C++14 one `std::enable_if_t`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122607
Approved by: https://github.com/jansel, https://github.com/Skylion007
2024-03-25 19:01:44 +00:00
2d4197c9b7 add case for creating storage on ort (#122446)
Fixes #122445

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122446
Approved by: https://github.com/mikaylagawarecki
2024-03-25 18:59:20 +00:00
2db7d874a9 [inductor] Improve error message for shape errors in slice_scatter (#122543)
Fixes #122291

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122543
Approved by: https://github.com/shunting314
2024-03-25 18:57:16 +00:00
db506762d1 Revert "Change ATEN generator argument type to const std::optional<Generator>& (#120076)"
This reverts commit a52b4e22571507abc35c2d47de138497190d2e0a.

Reverted https://github.com/pytorch/pytorch/pull/120076 on behalf of https://github.com/atalman due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/120076#issuecomment-2018680656))
2024-03-25 18:52:05 +00:00
c7bf5871ce CUDAEvent::elapsed_time could accidentally initialize a non-used GPU (#122538)
This sets the device before call cudaEventElapsedTime to avoid the case
where the "cudaGetCurrentDevice" device would be initialized even though
neither event is on that device.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122538
Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab
2024-03-25 17:49:50 +00:00
198927170d Avoid COW materialize in nn.functional forward ops (2) (#121992)
Affected ops:
* dropout
* embedding
* embedding_bag
* mutli_head_attention_forward
* grid_sample
* ctc_loss
* nll_loss
* pdist

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121992
Approved by: https://github.com/ezyang
ghstack dependencies: #122437, #121991
2024-03-25 17:31:19 +00:00
55becf02bc Avoid COW materialize in nn.functional forward ops (1) (#121991)
Affected ops:
* Remaining norm ops
* pad
* margin_loss ops
* fractional_max_pool
* linear
* prelu
* rrelu
* scaled_dot_product_attention
* logsigmoid
* threshold
* binary_cross_entropy
* gelu

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121991
Approved by: https://github.com/ezyang
ghstack dependencies: #122437
2024-03-25 17:31:19 +00:00
4c70ab26ef [MPS] Enable index_select for complex types (#122590)
Surprisingly, as of MacOS-14.14 MPS `gatherWithUpdatesTensor:indicesTensor:axis:batchDimensions:name:` still does not support complex types, so emulate them by using `at::view_as_real` trick

Fixes https://github.com/pytorch/pytorch/issues/122427

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122590
Approved by: https://github.com/Skylion007
2024-03-25 16:57:35 +00:00
e6a37eeb06 run some cuda testcases on other devices if available. (#122182)
If users want to run some cuda testcases on other devices throw setting an environment variable for testing the performance on custom devices, I think it can be used like this pr.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122182
Approved by: https://github.com/ezyang
2024-03-25 16:40:03 +00:00
70ac13b876 [ez][TD] Hide errors in llm retrieval job (#122615)
The new ghstack does have a base on main anymore, so finding the base for ghstacked PRs is harder.  Something similar to https://github.com/pytorch/pytorch/pull/122214 might be needed, but then I'm worried about tokens

Either way, this is a quick workaround to hide these errors for ghstack users
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122615
Approved by: https://github.com/huydhn
2024-03-25 16:35:00 +00:00
47a9725de9 Implement prefer_deferred_runtime_asserts_over_guards (#122090)
Fixes https://github.com/pytorch/pytorch/issues/121749

As promised, it is pretty easy.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122090
Approved by: https://github.com/lezcano
2024-03-25 16:31:16 +00:00
e49a38973f Update DimOrDims typing in torch.sparse (#122471)
I noticed the typing of the `torch.sparse.sum`'s `dim` parameter wasn't allowing an int tuple as input and tracked the issue to this type.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122471
Approved by: https://github.com/soulitzer
2024-03-25 16:25:56 +00:00
06f22537ca [dynamo] Suppress warning about torch.autograd.Function() (#122566)
PR #120577 got reverted due to issues in fbcode.  This hides warning
that PR was trying to fix until we can debug the fbcode issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122566
Approved by: https://github.com/yanboliang
2024-03-25 16:18:43 +00:00
0465a90b00 [export][reland] Fix unflattened submodule ordering. (#122341) (#122507)
Summary:

Make sure the order of submodules is the same as the original eager module.

bypass-github-export-checks

Test Plan: buck test mode/opt caffe2/test:test_export -- -r test_unflatten_submodule_ordering

Differential Revision: D55251277

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122507
Approved by: https://github.com/tugsbayasgalan
2024-03-25 15:22:01 +00:00
11dfa72153 [BE] Remove unnecessary state dict update. (#122528)
From what I can see, following is a redundant/unnecessary setting of dict element.

Differential Revision: [D55191396](https://our.internmc.facebook.com/intern/diff/D55191396/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122528
Approved by: https://github.com/Skylion007
2024-03-25 15:21:44 +00:00
5152945441 GPT2 SDPA inference pattern-matching for Inductor-CPU (#121866)
### Summary
With this PR, SDPA pattern of GPT2 is being mapped to `torch.nn.functional.scaled_dot_product_attention`.
While GPT2 supports both a causal mask & an attention mask, this PR considers the case of attention mask being absent.
TorchBench inference workload for GPT2 also doesn't use an attention-mask.

This pattern's replacement is being disabled for CUDA because [CUDA AOT Inductor](https://github.com/pytorch/pytorch/actions/runs/8319111885/job/22762567770) CI job's `GPT2ForSequenceClassification` accuracy test failed, although all other trunk CUDA Inductor CI checks had passed.
Created #122429 to track that particular issue.

### CPU performance data with TorchBench
|MODEL |BATCH SIZE | DTYPE | BEFORE: Speedup over eager-mode with the default Inductor implementation | AFTER: Speedup over eager mode with SDPA op mapped| Perf boost = (AFTER - BEFORE)/BEFORE * 100|
|--------------------------|-------------|---------|-----------------------------|--------------------------|------------|
|hf_GPT2| 1 | FP32 | 1.522x | 1.791x| 17.67%|
|hf_GPT2| 1 | BF16 (AMP) | 1.795x | 2.387x| 32.98%|
|hf_GPT2| 2 | FP32 |  1.313x |1.629x | 19.3%|
|hf_GPT2|2| BF16 (AMP) | 1.556x | 1.924x | 23.65%|
|hf_GPT2_large| 1 | FP32 | 1.380x |1.585x | 12.93%|
|hf_GPT2_large| 1 | BF16 (AMP) | 1.208x | 1.567x | 22.91%|
|hf_GPT2_large| 2 | FP32 | 1.188x | 1.490x | 25.42%|
|hf_GPT2_large|2| BF16 (AMP) | 0.991x | 1.575x | 58.93%|

Machine - Intel(R) Xeon(R) Platinum 8468H (Xeon 4th gen Sapphire Rapids)
48 physical cores were used. Intel OpenMP & libtcmalloc were preloaded.

Example command -
```
 OMP_NUM_THREADS=48 MKL_NUM_THREADS=48 numactl --membind=0 --cpunodebind=0 -C 0-47 python benchmarks/dynamo/torchbench.py --performance --inference --inductor --float32 -dcpu --only hf_GPT2_large --freezing --batch-size 1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121866
Approved by: https://github.com/Valentine233, https://github.com/jgong5, https://github.com/desertfire
2024-03-25 15:04:03 +00:00
4dc09d6aa4 Revert "Graph-Safe RNG State Exchange for Tensor Parallelism (#114068)"
This reverts commit e9dcda5cba92884be6432cf65a777b8ed708e3d6.

Reverted https://github.com/pytorch/pytorch/pull/114068 on behalf of https://github.com/ezyang due to memory leak in another ci ([comment](https://github.com/pytorch/pytorch/pull/114068#issuecomment-2018044527))
2024-03-25 13:49:04 +00:00
cyy
b9d6f8cc18 Fix clang-tidy warnings in aten/src/ATen/core/*.cpp (#122572)
This PR fixes clang-tidy warnings in aten/src/ATen/core/*.cpp.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122572
Approved by: https://github.com/ezyang
2024-03-25 13:46:24 +00:00
1e404c9b12 Remove redundant query to tensor_to_context (#122278)
from_real_tensor will query it again, so this query is strictly
dominated.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122278
Approved by: https://github.com/eellison
ghstack dependencies: #122044, #122270, #122271
2024-03-25 13:16:21 +00:00
49b81af45f Delete dead memoized_only kwarg in FakeTensor (#122271)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122271
Approved by: https://github.com/eellison
ghstack dependencies: #122044, #122270
2024-03-25 13:16:21 +00:00
f32ce4e28e Delete FakeTensorConverter.__call__ in favor of from_real_tensor (#122270)
It's annoying grepping for `__call__` call-sites so they're now all explicit now. I'd do this to MetaConverter too but that one is way more public, a lot more sites.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122270
Approved by: https://github.com/eellison
ghstack dependencies: #122044
2024-03-25 13:16:13 +00:00
069270db60 [dynamo] Fix list comparison ops (#122559)
Fixes #122376

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122559
Approved by: https://github.com/Skylion007
2024-03-25 07:03:23 +00:00
5891c5b3a6 Factor meta conversion through serializable MetaTensorDesc (#122044)
Fixes https://github.com/pytorch/pytorch/issues/121085

This PR pretty involved so pay attention to this description.  At a high
level, the refactor is intended to be mechanical: anywhere in
MetaConverter where previously we took a Tensor as argument, we now take
a MetaTensorDesc, which contains all of the information that we would
have queried off of the Tensor, but placed into a separate data
structure which we can serialize or use to recreate a fake tensor in
a separate fake tensor mode in exact fidelity to the original.

However, this transformation is not always entirely mechanical.  Here
is what you need to pay attention to:

- The memo table from real Tensor -> meta/fake Tensor is now broken
  into two memo tables: real Tensor -> stable int id -> meta/fake
  Tensor.  The stable int id is needed so that when we do serialization,
  we know when tensors/storages alias each other and can ensure we preserve
  this aliasing upon deserialization.

  The way I have implemented changes the weak reference behavior.
  Previously, when either the real Tensor OR the meta/fake Tensor went
  dead, we would remove the entry from the memo table.  Now, this only
  removes entries from one of the two memo tables.  This semantically
  makes sense, because the user may have held on to the stable int id
  out of band, and may expect a real Tensor to continue to be numbered
  consistently / expect to be able to lookup a meta/fake tensor from
  this id.  If this is unacceptable, it may be possible to rejigger
  the memo tables so that we have real Tensor -> stable int id
  and real Tensor -> meta/fake Tensor, but TBH I find the new
  implementation a lot simpler, and arranging the memo tables in this
  way means that I have to muck around with the real tensor to save
  to the memo table; in the current implementation, I never pass the
  Tensor to meta_tensor function AT ALL, which means it is impossible
  to accidentally depend on it.

- When I fill in the fields of MetaTensorDesc in describe_tensor, I need
  to be careful not to poke fields when they are not valid.  Previously,
  preconditions were implicitly checked via the conditional structure
  ("is this sparse? is this nested?") that is tested before we start
  reading attributes.  This structure has to be replicated in
  describe_tensor, and I have almost assuredly gotten it wrong on my
  first try (I'll be grinding through it on CI; a careful audit will
  help too, by auditing that I've tested all the same conditionals that
  the original access was guarded by.)

- I originally submitted https://github.com/pytorch/pytorch/pull/121821
  for the symbolic shapes change, but it turned out the way I did it
  there didn't actually work so well for this PR.  I ended up just
  inlining the symbolic shapes allocation logic into MetaConverter
  (look for calls to maybe_specialize_sym_int_with_hint), maybe there
  is a better way to structure it, but what I really want is to
  just read sizes/strides/offset directly off of MetaTensorDesc; I
  don't want another intermediate data structure.

- Some fields aren't serializable. These are documented as "NOT
  serializable".  ctx/type should morally be serializable and I just
  need to setup a contract with subclasses to let them be serialized.
  The fake_mode is used solely to test if we are refakefying with
  a pre-existing ShapeEnv and we want to reuse the SymInt
  directly--serializing this case is hopeless but I am kind of hoping
  after this refactor we do not need this at all.  view_func is not
  serializable because it's a bound C implemented method.  Joel has
  promised me that this is not too difficult to actually expose as a
  true data structure, but this is the edgiest of edge cases and there
  is no reason to deal with it right now.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122044
Approved by: https://github.com/eellison
2024-03-25 06:21:17 +00:00
cf06189a2d [CPPInductor] Fix another out-of-bounds access (#122580)
Not sure what was the idea behind `{self.tiling_factor}*sizeof(float)/sizeof({DTYPE_TO_CPP[dtype]})` size calculation (perhaps copy-n-paste error during the refactor made by https://github.com/pytorch/pytorch/pull/97626  ) , but `Vectorized::store(ptr, tiling_factor)` needs at least `tiling_factor` elements, not `tiling_factor/2` (which would be the case with the original calculation if data type is 64-bit value such as int64)
Discovered while trying to enable arch64 vectorized inductor.
Minimal reproducer (reproducible on ARMv8 or any  x86_64 machine that does not support AVX512):
```python
import torch
def do_ds(x, y):
    return torch.diagonal_scatter(x, y)

x=torch.ones(10, 10, dtype=torch.int64)
y=torch.tensor([ 1,  2, -8,  8,  5,  5, -7, -8,  7,  0])
dsc = torch.compile(do_ds)
assert torch.allclose(torch.diagonal_scatter(x, y), dsc(x, y))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122580
Approved by: https://github.com/Skylion007, https://github.com/jansel
2024-03-25 04:49:20 +00:00
deeeaded1f Add metas for randint/rand factory functions out overload (#122375)
Fixes https://github.com/pytorch/pytorch/issues/121897

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122375
Approved by: https://github.com/lezcano
2024-03-25 04:01:38 +00:00
cyy
a01d35c7f6 [TorchGen] Remove unused variables (#122576)
This PR removes some unused Python variables from TorchGen scripts.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122576
Approved by: https://github.com/Skylion007
2024-03-25 03:31:41 +00:00
e75ecd5618 [BE][veclib] Use is_same_v/enable_if_t (#122533)
`enable_if_t` helper is part of C++14
`is_same_v` helper is part of C++17

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122533
Approved by: https://github.com/Skylion007
2024-03-24 20:57:41 +00:00
14e348b7ad Handle JIT test failure when the GPU is newer than the CUDA compiler or vice versa (#122400)
The test may fail because it either uses target flags newer than the GPU resulting in failures loading the compiled binary or targetting a GPU for which CUDA has no support yet/anymore
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122400
Approved by: https://github.com/ezyang
2024-03-24 13:58:06 +00:00
36188360dd [dynamo] support torch.distributed.{group.WORLD, GroupMember.WORLD, distributed_c10d._get_default_group} (#120560)
Fixes https://github.com/pytorch/pytorch/issues/120431

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120560
Approved by: https://github.com/wconstab
2024-03-24 11:13:05 +00:00
3e4a4bea12 [dynamo] Graph break on SymNode control flow (#122546)
Fixes #111918

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122546
Approved by: https://github.com/ezyang
2024-03-24 07:22:02 +00:00
adeedc060f [Inductor] Fix unbacked symbol in stride when using item() (#122298)
Fixes #122296

Test: python test/inductor/test_torchinductor_dynamic_shapes.py -k test_item_unbacked_stride_nobreak_cuda

Co-authored-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122298
Approved by: https://github.com/ezyang
2024-03-24 06:27:15 +00:00
cyy
c1fe09dc37 [TorchGen] Add mutable parameter to valuetype_type function in api/cpp.py (#121415)
This PR is a follow-up of #120076, it moves std::optional<Generator> detection logic into  ```valuetype_type``` of api/cpp.py by adding the mutable parameter, which facilitates future value type changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121415
Approved by: https://github.com/ezyang
2024-03-24 06:11:08 +00:00
ca9606f809 Update COW OpInfo test to include kwargs and expected materialization (#122437)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122437
Approved by: https://github.com/ezyang
2024-03-24 06:07:30 +00:00
9d4218c23e Handle JIT test failure when the GPU is newer than the CUDA compiler (#122402)
The test uses the CUDA compute capabilities of the current device to
compile an extension. If nvcc is older than the device, it will fail
with a message like "Unsupported gpu architecture 'compute_80'"
resulting in a `RuntimeError: Error building extension 'cudaext_archflags'`
ultimately failing the test.

This checks for this case and allows execution to continue

Fixes https://github.com/pytorch/pytorch/issues/51950
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122402
Approved by: https://github.com/ezyang
2024-03-24 05:36:24 +00:00
cyy
808a035658 [Dynamo][4/N] Enable clang-tidy coverage on torch/csrc/dynamo/* (#122534)
This PR enables clang-tidy coverage on torch/csrc/dynamo/* and also contains other small improvements.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122534
Approved by: https://github.com/Skylion007
2024-03-24 05:26:32 +00:00
f0d461beac [vision hash update] update the pinned vision hash (#122536)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122536
Approved by: https://github.com/pytorchbot
2024-03-24 03:42:21 +00:00
5f7e71c411 [dynamo] Add HASATTR guard for UserDefinedObject attrs (#122555)
Fixes #111522

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122555
Approved by: https://github.com/Skylion007
2024-03-24 03:41:58 +00:00
07d037674f [inductor] Fix issue with randint + symbolic shapes (#122428)
Fixes #122405

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122428
Approved by: https://github.com/ezyang
2024-03-24 03:41:13 +00:00
476585b190 Preserve unbacked SymInt on SymNode (#120816)
Previously, when we applied a replacement, a SymInt that was
previously an unbacked SymInt would then transmute into whatever
we replaced it into (e.g., a constant).

This has a major downside: we often look at SymInts associated with
FX nodes (e.g., the meta of x.item() return) to find out where the
unbacked SymInt was allocated.  If we replace it, we no longer can
find out where, e.g., u1 was allocated!  But we need to know this
so we can generate deferred runtime asserts like u1 == s0.

To solve this problem, I have a special mode for replace, resolve_unbacked=False, which lets you disable substitutions on unbacked SymInts. When reporting node.expr, we preferentially avoid applying unbacked SymInt substitutions. To understand if we might accidentally reapply the substitution later, before we have reached the deferred runtime assert, we must study the calls to simplify() in ShapeEnv. My audit turns up these sites:

* `produce_guards`: this is fine, deferred runtime asserts never show up here, we must NOT have unbacked SymInts show up here. Similarly `get_nontrivial_guards`.
* `_maybe_evaluate_static`: this is fine, we are using this to determine if it is necessary to produce a guard/runtime assert. We don't want to reissue a runtime assert if we've already asserted on it, and replacements can help us understand if this has occurred.
* `_simplify_floor_div`: this is a legitimate bug, it needs to be `resolve_unbacked=False`
* `_refine_ranges`: this is fine, a refined range doesn't affect what runtime asserts we issue
* `_update_divisible`: this updates the `self.divisible` set, which specifies when we can simplify away divisibility constraints. Since this affects replacements only, it won't cause us to oversimplify a user provided expression.

There are some situations where we DO want to always apply the substitution, specifically when we have the duplicate symbol problem (we retrace an item call and get u0 and u1 which refer to the same thing.) I don't want two symbols in this case, so a special `rename_unbacked_to` is provided which sets up the unconditional renaming.

Along the way, I make a refinement to `_update_var_to_range`: if you update a var range for a size-like unbacked SymInt, you are now no longer allowed to set its lower bound below 2. This is because if you could, then our size oblivious tests for it would be inconsistent. Actually, I think there is still some inconsistency, because if you assert `u0 == 0` we will still end up with this in deferred runtime asserts, and we will then use this to simplify these statements to be True everywhere else. Maybe we should forbid this kind of refinement; not done in this PR.

Fixes https://github.com/pytorch/pytorch/issues/119689

Fixes https://github.com/pytorch/pytorch/issues/118385

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120816
Approved by: https://github.com/lezcano
2024-03-24 02:56:16 +00:00
cyy
a52b4e2257 Change ATEN generator argument type to const std::optional<Generator>& (#120076)
This PR proposes to use std::optional<Generator>& for underlying functions to avoid unnecessary copy and move operations. The torchgen code was changed to generate the new type.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120076
Approved by: https://github.com/malfet
2024-03-24 02:12:08 +00:00
788638fcdc Suggest TORCHDYNAMO_EXTENDED_DEBUG_ envvars when appropriate (#122473)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122473
Approved by: https://github.com/lezcano
2024-03-24 01:02:20 +00:00
cdc7f0fd3b Fixed failing pyhpc_equation_of_state due to cpp nodes fusion with compatible ranges (#122420)
Fixes #122283

Description:

PR https://github.com/pytorch/pytorch/pull/120077 introduced cpp nodes fusion with compatible ranges with an assumption that all scheduler nodes inside the fused nodes are the same, however, it appeared that snodes can have different indexing expressions. This PR fixes the incorrect assumption.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122420
Approved by: https://github.com/lezcano
2024-03-24 00:40:31 +00:00
4758837930 [BE] Do not use importlib.load_module (#122542)
To get rid of the annoying
```
<frozen importlib._bootstrap>:283: DeprecationWarning: the load_module() method is deprecated and slated for removal in Python 3.12; use exec_module() instead
```
using recipe from https://docs.python.org/3/library/importlib.html#importing-a-source-file-directly

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122542
Approved by: https://github.com/jansel, https://github.com/desertfire
2024-03-23 17:22:26 +00:00
bf40e3f880 [EZ][BE] Add missing acosh op to vec256_float_neon.h (#122513)
As base class has it
ed15370aab/aten/src/ATen/cpu/vec/vec_base.h (L367-L369)

Discovered while attempting to enabling Inductor vectorization on ARM platform

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122513
Approved by: https://github.com/Skylion007
2024-03-23 14:18:02 +00:00
a39e638707 Update bsr_dense_addmm kernel parameters for sizes 3 x 2 ^ N (#122506)
As in the title. The speed-ups for a particular set of input sizes range from about 7 to 85 % depending on the used BSR tensor block sizes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122506
Approved by: https://github.com/cpuhrsch
2024-03-23 11:54:33 +00:00
8a209344c9 Fix access to unitialized memory in VSX vector functions for quantized values (#122399)
Similar to https://github.com/pytorch/pytorch/pull/89833 those function may access uninitialized memory leading
to undefined behavior/results.
Initialize with zeros as done before.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122399
Approved by: https://github.com/ezyang
2024-03-23 06:11:30 +00:00
c677221798 remove torchao dependency (#122524)
Test Plan:
CI

```
buck2 run mode/dev-nosan mode/inplace executorch/examples/models/llama2:export_llama -- -c ~/llama/ultra_new_checkpoint.pt -p ~/llama/params.json -kv -E 8,8 -d fp32 --pt2e_quantize "xnnpack_dynamic" -2
```

```
buck run //executorch/backends/xnnpack/test:test_xnnpack_ops -- executorch.backends.xnnpack.test.ops.linear.TestLinear.test_qd8_fp32_per_token_weight_per_channel_group_int4
```

Differential Revision: D55263008

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122524
Approved by: https://github.com/jerryzh168
2024-03-23 03:18:43 +00:00
19d27a13ea [CPUInductor] Fix out-of-bounds read/write in cvt_int64_to_[fp32|int32] (#122511)
Discovered while debugging regressions in enabling vectorization on ARM platform

Without this change `test_div2_cpu` will fail with invalid values on non-x86 CPU

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122511
Approved by: https://github.com/peterbell10, https://github.com/jansel
2024-03-23 01:45:07 +00:00
4d8a3f8bb3 changed aliasing checks to properly recurse for computing last usage (#122444)
Fixes https://github.com/pytorch/pytorch/issues/122457

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122444
Approved by: https://github.com/yifuwang, https://github.com/jansel
ghstack dependencies: #121624, #122474
2024-03-23 01:43:21 +00:00
50036ec781 [Inductor] Add a test for creating a cpu inductor-> triton backend (#122396)
Summary: Currently there is a test for adding a backend in test/inductor/test_extension_backend.py for a cpp backend with a new device. However there is no such test for the Triton backend; it should be possible for a user to create and register your own ExtensionWrapperCodegen and ExtensionSchedulingfor another non-CUDA device and be able to generate Triton code. For simplicity I have chosen to use a CPU device, as I think it's plausible someone might want to create a CPU Triton backend.

Unfortunately the generation and running of the code is quite tightly coupled so I've had to use a mocked function to extract the code before running. Suggestions are welcome for better ways to do this.

This is a stepping off point for some additional PRs to make the Triton code path less CUDA specific, as currently there would be no way to test this avenue.

Test plan:
```
frames [('total', 1), ('ok', 1)]
stats [('calls_captured', 3), ('unique_graphs', 1)]
inductor [('intermediate_hooks', 1)]
aot_autograd [('total', 1), ('ok', 1)]
.
----------------------------------------------------------------------
Ran 1 test in 0.394s
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122396
Approved by: https://github.com/jansel
2024-03-23 01:14:57 +00:00
41d69ff324 Add a shape inference tool (#120097)
Summary:
Add a shape inference tool that helps to infer each node shape of a given graph module.
1. Given a fx graph, and an example of an input(don't need to be an accurate input that can be forward, but should have valid dims and data structures), `infer shape` creates an input of symbolic shape
2. Shape prop this symbolic input can catch runtime or value exceptions.
3. These errors are constraints for symbol values, and the constraint solver `infer symbolic values` helps us figure out specific values for each symbol.
4. Finally, we run the shape propagation based on input tensor to get tensor shapes for all nodes in the FX traced module.

Test Plan:
### 1. Test `infer symbol values`
Command:
```
buck2 test mode/opt //caffe2/test:fx_experimental -- test_infer_symbol_values
```

### 2. Test `infer shape`
Command:
```
buck2 test mode/opt //caffe2/test:fx_experimental -- test_infer_symbol_values
```
Inferred shape result like: P897560514

Differential Revision: D53593702

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120097
Approved by: https://github.com/yf225
2024-03-23 00:23:24 +00:00
29bca8547b Fix failing test_cpu_repro without vectorization support (#117262)
At least the following tests fail when there is no supported vector ISA:
test_lowp_fp_neg_abs
test_non_contiguous_index_with_constant_stride
test_scalar_mul_bfloat16
test_transpose_non_contiguous
test_transpose_sum2d_cpu_only
test_transpose_sum_outer
test_transpose_vertical_sum_cpu_only
test_vertical_sum_cpu_only

Those tests assert `metrics.generated_cpp_vec_kernel_count` is nonzero
which is never the case without a supported vector ISA, e.g. on PPC and
maybe on AArch.

Skip those tests with a new decorator and use the simpler one where an equivalent is already used

Some usages of `metrics.generated_cpp_vec_kernel_count` where guarded by a check instead of skipping the test. I tried to apply that instead of a skip where the test looked similar enough to where that was previously done.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117262
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-03-23 00:03:55 +00:00
a84f1d3def [effects] Fix backwards handling (#122346)
I didn't previously test the `.backwards()` call, and when testing on #122348 I realized we were missing some token handling in some places.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122346
Approved by: https://github.com/zou3519
2024-03-22 23:31:52 +00:00
e7fa3f7812 AOTDispatch: allow subclasses to correct when we guess metadata of tangents incorrectly (#118670)
This PR is enough to fix https://github.com/pytorch/pytorch/issues/118600.

More description of the problem is in the issue, but the high-level problem is similar to the "tangents might be non-contiguous" problem that we handle today, via forcing all tangents to be contiguous. There, the problem was something like:

"We guessed the tangent strides incorrectly, because strides on the runtime tangents were different from strides on the forward outputs, which we used to generate tangents"

Here, the problem is similar:

"We guessed the tangent tensor subclass's metadata incorrectly, because the runtime tangent was a subclass with different metadata than the forward output subclass".

This happened in an internal DTensor issue, where the metadata in question was the `placements` (shard vs. replicate vs. Partial).

One option is to solve this problem via backward guards. This is needed to unblock internal though, so I figured handling this similarly to how we handle non-contiguous tangents would be reasonable. I did this by:

(1) Assert that the metadata on subclass tangents is the same as what we guessed, and if not raise a loud error

(2) In the error message, provide the name of an optional method that the subclass must implement to handle this case:

`def __force_same_metadata__(self, metadata_tensor):`: If the forward output had a `Replicate()` placement, but the runtime tangent had a `Shard(1)` placement, this method allows a subclass to take the tangent and "convert" it to one with a `Replicate()` placement.

`__force_standard_metadata__(self)`: One issue is that there is another placement called `_Partial`, and its semantics are such that DTensor is **unable** to convert a DTensor with some placement type into another DTensor with a `_Partial` placement.

`__force_standard_metadata__` is now called on all (fake) subclass forward outs at trace-time to generate tangents, and gives subclasses a chance to "fix" any outputs with metadata that they cannot convert to later. Morally, this is similar to the fact that we force a `contiguous()` call on all tangents at trace-time.

I'm interested in thoughts/feedback! Two new dunder methods on traceable subclasses is definitely a contentious change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118670
Approved by: https://github.com/ezyang
2024-03-22 23:16:08 +00:00
f7b8d8e249 Support for sapling scm (#122072)
We can use Sapling (hg) with the pytorch repo but there are a couple minor issues to teach our scripting to be happier with having either a git or hg repo.

This change fixes some issues in:
- setup.py
- lintrunner

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122072
Approved by: https://github.com/ezyang
2024-03-22 22:59:16 +00:00
cyy
482f6c4693 [Dynamo][3/N] Fix clang-tidy warnings in torch/csrc/dynamo/* (#122392)
This PR continues to clean clang-tidy warnings in torch/csrc/dynamo/*, following #122362

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122392
Approved by: https://github.com/ezyang
2024-03-22 22:57:41 +00:00
3f99306452 [export] Remove from_export flag (#122500)
Summary: The flag from_export was incorrectly included in a previous diff (https://www.internalfb.com/diff/D54314379) - it was intended for helping with ExportedProgram verification, but was no longer needed in the final implementation.

Test Plan: Changes no functionality, test/export already covers everything

Differential Revision: D55205857

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122500
Approved by: https://github.com/avikchaudhuri, https://github.com/zhxchen17
2024-03-22 22:55:14 +00:00
03184a82dd [TD] TD on ASAN PR jobs (#122332)
Low impact CPU jobs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122332
Approved by: https://github.com/huydhn
2024-03-22 22:32:51 +00:00
271cc687de Audit retracibility errors and fix some ez ones (#122461)
Summary: Title

Test Plan: CI

Differential Revision: D55227094

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122461
Approved by: https://github.com/zhxchen17
2024-03-22 21:31:51 +00:00
29132c2e47 Prevent dup initializers when ONNXProgram.save is called many times (#122435)
Fixes https://github.com/pytorch/pytorch/issues/122351
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122435
Approved by: https://github.com/titaiwangms
ghstack dependencies: #122196, #122230
2024-03-22 21:03:15 +00:00
4eaa000acc Teach dynamo about torch.func.jvp (#119926)
List of changes:
- Replace JVP_NESTING by torch._C._functorch.maybe_current_level()
- Remove all increment nesting functions from wrap_fx_proxy_cls
- fwAD.make_dual receives the dual_level as keyword argument
- Add jvp_increment_nesting, set_fwd_grad_enabled and dual_level context managers to dynamo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119926
Approved by: https://github.com/zou3519
2024-03-22 20:25:47 +00:00
3795ebe925 Revert "[Inductor] Make codecache CUDA compilation more robust & flexible (#121490)"
This reverts commit 6bbd697306851b785b51b4d0545c1ef9365ddaa6.

Reverted https://github.com/pytorch/pytorch/pull/121490 on behalf of https://github.com/huydhn due to Sorry for reverting you change but I think it is failing on ROCm, i.e. 700c92e1b9 ([comment](https://github.com/pytorch/pytorch/pull/121490#issuecomment-2015829464))
2024-03-22 20:11:47 +00:00
97d3bf71b9 Revert "[Inductor Cutlass backend] GEMM size threshold for Cutlass backend usage (#121491)"
This reverts commit 700c92e1b9cb6fae2610d08e5a960273c4dd1697.

Reverted https://github.com/pytorch/pytorch/pull/121491 on behalf of https://github.com/huydhn due to Sorry for reverting you change but I think it is failing on ROCm, i.e. 700c92e1b9 ([comment](https://github.com/pytorch/pytorch/pull/121490#issuecomment-2015829464))
2024-03-22 20:11:47 +00:00
8013c4409f [inductor] config to control whether we assume inputs are aligned (#122158)
**Motivation**: https://github.com/pytorch/pytorch/issues/112771

**Summary**: Inductor generates triton that assumes that inputs are going to be 16-byte aligned. If the inputs aren't aligned, Inductor clones the inputs. This PR introduces a config option to not do this: when assume_aligned_inputs=False, Inductor will _not_ pass inputs as being divisible_by_16, and Inductor will not make clones. This an can generate code that might be a bit slower, but this tradeoff can be worth it in some scenarios where you might otherwise make a lot of clones.

Ideally, we could do this on a per-tensor basis. But this would be a lot of work, and attempts to add guards on storage offsets to do this automatically have run into issues: recompilations and excessive time to generate/check guards.

**Tests** https://github.com/pytorch/pytorch/pull/122159 flips this to False. It didn't run through all errors, but the ones we see are all expected failures: divisible_by_16 changes; triton kernel caching fails if we call the same triton kernel multiple times (this makes sense because the first call will have unaligned inputs, but subsequent calls have aligned inputs); and some xfailed tests start passing.

**Alternatives/RFC**:
* Is this the right thing to do with cudagraphs?
* Elias and Jason mentioned that we probably still want to make clones if we're dealing with unaligned inputs to matmuls. Is this something we should add in this config option? (In the use case I'm targeting, it seems like we don't need this optimization right now)

Differential Revision: [D55079094](https://our.internmc.facebook.com/intern/diff/D55079094)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122158
Approved by: https://github.com/ezyang
2024-03-22 20:03:38 +00:00
5790096059 [dynamo] Remove uses of raise unimplemented (#122136)
`unimplemented` is a function that raises an error, so
`raise unimplemented(...)` never reaches the `raise`.
Another related issue is that `raise unimplemented(...) from e`
doesn't attach the exception cause correctly. I fix this by adding
a `from_exc` argument to `unimplemented`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122136
Approved by: https://github.com/lezcano
2024-03-22 19:29:58 +00:00
ed15370aab [aoti] Add handling of ir.Constants in promote_constants (#122419)
This issue popped up when enabling predispatch IR on the benchmarks (https://github.com/pytorch/pytorch/pull/122225)

On the following model:
```
class M(torch.nn.Module):
    def __init__(self, device):
        super().__init__()
        self.device = device

    def forward(self, x):
        t = torch.tensor(x.size(-1), device=self.device, dtype=torch.float)
        t = torch.sqrt(t * 3)
        return x * t
```

We get the following error:
```
======================================================================
ERROR: test_constant_abi_compatible_cuda (__main__.AOTInductorTestABICompatibleCuda)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/data/users/angelayi/pytorch/torch/testing/_internal/common_utils.py", line 2741, in wrapper
    method(*args, **kwargs)
  File "/data/users/angelayi/pytorch/test/inductor/test_torchinductor.py", line 9232, in new_test
    return value(self)
  File "/home/angelayi/.conda/envs/pytorch10/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/data/users/angelayi/pytorch/test/inductor/test_aot_inductor.py", line 922, in test_constant
    self.check_model(M(self.device), (torch.randn(5, 5, device=self.device),))
  File "/data/users/angelayi/pytorch/test/inductor/test_aot_inductor.py", line 91, in check_model
    actual = AOTIRunnerUtil.run(
  File "/data/users/angelayi/pytorch/test/inductor/test_aot_inductor_utils.py", line 102, in run
    so_path = AOTIRunnerUtil.compile(
  File "/data/users/angelayi/pytorch/test/inductor/test_aot_inductor_utils.py", line 40, in compile
    so_path = torch._inductor.aot_compile_ep(
  File "/data/users/angelayi/pytorch/torch/_inductor/__init__.py", line 150, in aot_compile_ep
    return compile_fx_aot(
  File "/data/users/angelayi/pytorch/torch/_inductor/compile_fx.py", line 1005, in compile_fx_aot
    compiled_lib_path = compile_fx(
  File "/home/angelayi/.conda/envs/pytorch10/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/data/users/angelayi/pytorch/torch/_inductor/compile_fx.py", line 1111, in compile_fx
    return compile_fx(
  File "/home/angelayi/.conda/envs/pytorch10/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/data/users/angelayi/pytorch/torch/_inductor/compile_fx.py", line 1145, in compile_fx
    return compile_fx(
  File "/home/angelayi/.conda/envs/pytorch10/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/data/users/angelayi/pytorch/torch/_inductor/compile_fx.py", line 1336, in compile_fx
    return inference_compiler(unlifted_gm, example_inputs_)
  File "/data/users/angelayi/pytorch/torch/_dynamo/utils.py", line 265, in time_wrapper
    r = func(*args, **kwargs)
  File "/data/users/angelayi/pytorch/torch/_inductor/compile_fx.py", line 1266, in fw_compiler_base
    return inner_compile(
  File "/home/angelayi/.conda/envs/pytorch10/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/data/users/angelayi/pytorch/torch/_dynamo/repro/after_aot.py", line 83, in debug_wrapper
    inner_compiled_fn = compiler_fn(gm, example_inputs)
  File "/data/users/angelayi/pytorch/torch/_inductor/debug.py", line 304, in inner
    return fn(*args, **kwargs)
  File "/home/angelayi/.conda/envs/pytorch10/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/home/angelayi/.conda/envs/pytorch10/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/data/users/angelayi/pytorch/torch/_dynamo/utils.py", line 265, in time_wrapper
    r = func(*args, **kwargs)
  File "/data/users/angelayi/pytorch/torch/_inductor/compile_fx.py", line 447, in compile_fx_inner
    compiled_graph = fx_codegen_and_compile(
  File "/data/users/angelayi/pytorch/torch/_inductor/compile_fx.py", line 707, in fx_codegen_and_compile
    graph.run(*example_inputs)
  File "/data/users/angelayi/pytorch/torch/_dynamo/utils.py", line 265, in time_wrapper
    r = func(*args, **kwargs)
  File "/data/users/angelayi/pytorch/torch/_inductor/graph.py", line 612, in run
    return super().run(*args)
  File "/data/users/angelayi/pytorch/torch/fx/interpreter.py", line 145, in run
    self.env[node] = self.run_node(node)
  File "/data/users/angelayi/pytorch/torch/_inductor/graph.py", line 957, in run_node
    result = super().run_node(n)
  File "/data/users/angelayi/pytorch/torch/fx/interpreter.py", line 202, in run_node
    return getattr(self, n.op)(n.target, args, kwargs)
  File "/data/users/angelayi/pytorch/torch/_inductor/graph.py", line 819, in call_function
    raise LoweringException(e, target, args, kwargs).with_traceback(
  File "/data/users/angelayi/pytorch/torch/_inductor/graph.py", line 816, in call_function
    out = lowerings[target](*args, **kwargs)
  File "/data/users/angelayi/pytorch/torch/_inductor/lowering.py", line 298, in wrapped
    out = decomp_fn(*args, **kwargs)
  File "/data/users/angelayi/pytorch/torch/_inductor/lowering.py", line 5340, in mul
    return make_pointwise(fn)(a, b)
  File "/data/users/angelayi/pytorch/torch/_inductor/lowering.py", line 409, in inner
    inputs = promote_constants(inputs, override_return_dtype)
  File "/data/users/angelayi/pytorch/torch/_inductor/lowering.py", line 373, in promote_constants
    ex = next(x for x in inputs if isinstance(x, (TensorBox, ExpandView)))
torch._inductor.exc.LoweringException: StopIteration:
  target: aten.mul.Tensor
  args[0]: Constant(value=5.0, dtype=torch.float32, device=device(type='cuda', index=0))
  args[1]: 3
```

So I added an additional casing in `promote_constants` to handle the ir.Constants and now it works! Although please let me know if this is the wrong approach. Here's a paste of the full run with the inductor logs: P1198927007

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122419
Approved by: https://github.com/eellison, https://github.com/desertfire, https://github.com/chenyang78
2024-03-22 18:39:36 +00:00
cyy
52e9049ffa Remove unused variables (#122496)
This PR removes several unused variables in the code base.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122496
Approved by: https://github.com/ezyang
2024-03-22 18:04:09 +00:00
bbe846f430 Add symbolic_opset19.py and symbolic_opset20.py to support opset 19/20, extend opset 18 support (#118828)
Start to fix https://github.com/pytorch/pytorch/issues/114801

Co-authored-by: Thiago Crepaldi <thiagofc@microsoft.com>
Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118828
Approved by: https://github.com/thiagocrepaldi
2024-03-22 18:01:33 +00:00
34d33df056 [DCP] Check if pg exists in async before checking for cpu PG (#122316)
Check if pg exists in async before checking for cpu PG in async save path.

This PR enables using async_save even if PG is not initialized.

Differential Revision: [D54868689](https://our.internmc.facebook.com/intern/diff/D54868689/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D54868689/)!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122316
Approved by: https://github.com/shuqiangzhang, https://github.com/XilunWu
2024-03-22 18:01:11 +00:00
400cc518fc pt2 dper passes: run shape prop before each pass (#122451)
Summary: Most passes relies on shape info. We need to run shape prop after each pass

Reviewed By: frank-wei

Differential Revision: D55221119

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122451
Approved by: https://github.com/frank-wei
2024-03-22 17:57:25 +00:00
152fa9ecc2 skip moondream for training (#122483)
The model shows as failed model on the dashboard for training. But the model is not implemented for training (at least for now):
2196021e9b/torchbenchmark/models/moondream/__init__.py (L6)

Skip it in dashboard.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122483
Approved by: https://github.com/eellison
2024-03-22 17:35:52 +00:00
a3d4eaf253 [inductor] device guard for max autotune benchmark (#122479)
Internal users reported that they get failure for max-autotune if tensors are not on device 0. It turns out that we may use tensors on device say 6 and run kernel on them at device 0.

This PR enforces that we do benchmarking for max-autotune on the correct device.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122479
Approved by: https://github.com/xintwfb, https://github.com/Chillee
2024-03-22 17:27:53 +00:00
3db64c1955 [NCCL PG] Enable ncclCommDevIdxMap unconditionally (#122049)
Differential Revision: D54993977

### Summary
The initial purpose of ncclCommDevIdxMap is to support NCCL zero copy algorithms. Therefore, it is only enabled (with its values filled) if useTensorRegisterAllocatorHook_ is set to true. However, now we rely on it to support dumping NCCL information in a single PG. So we need it to be always available, regardless of whether we enabled useTensorRegisterAllocatorHook_.
Move the code of filling ncclCommDevIdxMap out of if (useTensorRegisterAllocatorHook_) statement.

### Test Plan
See diff

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122049
Approved by: https://github.com/shuqiangzhang
2024-03-22 17:10:33 +00:00
f305c96cac [DCP] Add bytesIO object to test_e2e_save_and_load (#122112)
Added a TestTrainstate that includes BytesIO checkpoint.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122112
Approved by: https://github.com/LucasLLC
2024-03-22 16:57:13 +00:00
86082f1fdc [aot_inductor] added runtime checks for input/output tensors in debug compile mode (#122047)
This PR added runtime checks to guard the dtypes and shapes of input/output tensors.
Currently, we enable these only for debug compilation
(i.e. aot_inductor.debug_compile is True) in abi_compatible mode.

Differential Revision: [D54993148](https://our.internmc.facebook.com/intern/diff/D54993148)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122047
Approved by: https://github.com/desertfire
2024-03-22 16:40:33 +00:00
90a13c3c5b Added a check in register_lowering to avoid decomposed ops (#117632)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117632
Approved by: https://github.com/lezcano
2024-03-22 16:38:31 +00:00
9347a79f1c [Watchdog Timer] Clear timer for already terminated process (#122324)
Summary:
handling cases where worker process is terminated w/o releasing the timer request, this scenario causes reaping of process at expiry.

removing the non-existent process during clear timer.

Test Plan: unit tests

Differential Revision: D55099773

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122324
Approved by: https://github.com/d4l3k
2024-03-22 15:48:03 +00:00
018f5e2c32 Fix unused variable warning in int4mm.cu (#122286)
Fix the following warning while compilation:
```
/home/pytorch/aten/src/ATen/native/cuda/int4mm.cu: In function ‘at::Tensor at::native::_weight_int4pack_mm_cuda(const at::Tensor&, const at::Tensor&, int64_t, const at::Tensor&)’:
/home/pytorch/aten/src/ATen/native/cuda/int4mm.cu:871:6: warning: variable ‘stream’ set but not used [-Wunused-but-set-variable]
  871 |   auto stream = at::cuda::getCurrentCUDAStream();
      |      ^~~~~~
/home/pytorch/aten/src/ATen/native/cuda/int4mm.cu: In function ‘at::Tensor at::native::_convert_weight_to_int4pack_cuda(const at::Tensor&, int64_t)’:
/home/pytorch/aten/src/ATen/native/cuda/int4mm.cu:1044:6: warning: variable ‘stream’ set but not used [-Wunused-but-set-variable]
 1044 |   auto stream = at::cuda::getCurrentCUDAStream();
      |      ^~~~~~
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122286
Approved by: https://github.com/soulitzer
2024-03-22 15:46:18 +00:00
7fd14ebb52 [export] Use randomized inputs to examples. (#122424)
Summary: as title. replacing all torch.ones to randn

Test Plan: CI

Reviewed By: tugsbayasgalan

Differential Revision: D55206441

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122424
Approved by: https://github.com/tugsbayasgalan
2024-03-22 15:32:28 +00:00
60bc29aa0b Revert "[Quant] [PT2] Add SiLU into X86InductorQuantizer Conv2d Unary Annotation (#122267)"
This reverts commit 2c6eeb26d3f61fba352ad51fd8653120937a20f3.

Reverted https://github.com/pytorch/pytorch/pull/122267 on behalf of https://github.com/jeanschmidt due to Not sure if this PR caused breakages in main rocm jobs, I'll remerge if reverting does not fix it ([comment](https://github.com/pytorch/pytorch/pull/122267#issuecomment-2015294491))
2024-03-22 15:04:30 +00:00
b30b396d05 Revert "[Quant] [Inductor] Enable the Inductor Lowering of QConv2d post op SiLU (#122268)"
This reverts commit 99f0fec7d0873d627e8c7f2dec65818d725424b0.

Reverted https://github.com/pytorch/pytorch/pull/122268 on behalf of https://github.com/jeanschmidt due to Not sure if this PR caused breakages in main rocm jobs, I'll remerge if reverting does not fix it ([comment](https://github.com/pytorch/pytorch/pull/122267#issuecomment-2015294491))
2024-03-22 15:04:30 +00:00
777ac511cc Revert "[Quant] [Inductor] Enable the Inductor Lowering of QConv2d post op HardSwish with int8-mix-bf16 (#122373)"
This reverts commit 783fd89ff1cf401e484c20d14b16823abf20d87d.

Reverted https://github.com/pytorch/pytorch/pull/122373 on behalf of https://github.com/jeanschmidt due to Not sure if this PR caused breakages in main rocm jobs, I'll remerge if reverting does not fix it ([comment](https://github.com/pytorch/pytorch/pull/122267#issuecomment-2015294491))
2024-03-22 15:04:30 +00:00
dbedc6bb7c Revert "[Quant] [Inductor] Enable the Inductor Lowering of QConv2d post op HardTanh with int8-mix-bf16 (#122374)"
This reverts commit 23a6d74f9352e0afb37750fee300d077c4ba9393.

Reverted https://github.com/pytorch/pytorch/pull/122374 on behalf of https://github.com/jeanschmidt due to Not sure if this PR caused breakages in main rocm jobs, I'll remerge if reverting does not fix it ([comment](https://github.com/pytorch/pytorch/pull/122267#issuecomment-2015294491))
2024-03-22 15:04:30 +00:00
02fee6caec Revert "Change ATEN generator argument type to const std::optional<Generator>& (#120076)"
This reverts commit ecbe82b9cec75324b7efb58e1d9cae6b35b71bdc.

Reverted https://github.com/pytorch/pytorch/pull/120076 on behalf of https://github.com/jeanschmidt due to Reverting in order to check if this will fix XLA trunk jobs ([comment](https://github.com/pytorch/pytorch/pull/120076#issuecomment-2015272644))
2024-03-22 14:53:45 +00:00
e6986e4317 Public API for NJT construction from jagged components (#121518)
This PR introduces `torch.nested.nested_tensor_from_jagged(values, offsets=None, lengths=None, jagged_dim=1)` (bikeshedding welcome). This is intended to be the main entrypoint for getting an NJT from the `(values, offsets, lengths)` components. The returned NJT is a view of the `values` component.

Note that `torch.nested.nested_tensor()` / `torch.nested.as_nested_tensor()` already exist for constructing an NJT from a list of tensors.

TODO:
* Some doc formatting; suggestions welcome there
* Tests / examples using `jagged_dim != 1`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121518
Approved by: https://github.com/cpuhrsch
ghstack dependencies: #113279, #113280
2024-03-22 14:48:22 +00:00
65c37fe05a AOTAutograd: ensure traced tangent subclass metadata takes non-contiguous outputs into account (#118669)
Fixes https://github.com/pytorch/pytorch/issues/118596.

The issue was as follows:

(1) Whenever AOTAutograd sees an output that is non-contiguous, that it needs a tangent for, it forces the tangent that it generates to be contiguous during tracing

(2) However: if this tangent is a subclass, we need to generate code to flatten/unflatten the subclass at runtime.

(3) To do so, we use the metadata stashed here: https://github.com/pytorch/pytorch/blob/main/torch/_functorch/_aot_autograd/schemas.py#L231

(4) However, this metadata was **wrong** - it was generated by inspecting the tangent, **before** we made the tangent contiguous.

The fix in this PR basically moves the logic make `traced_tangents` contiguous earlier, at the time that we first generate `ViewAndMutationMetadata`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118669
Approved by: https://github.com/zou3519
ghstack dependencies: #118803, #119947
2024-03-22 14:42:27 +00:00
09be5800c8 dynamo: support placement kwargs for DTensor.to_local() (#119947)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119947
Approved by: https://github.com/wanchaol, https://github.com/yoyoyocmu
ghstack dependencies: #118803
2024-03-22 14:42:27 +00:00
2e44b12dd4 dynamo: handle DTensor.device_mesh.device_type (#118803)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118803
Approved by: https://github.com/wanchaol, https://github.com/yanboliang
2024-03-22 14:42:22 +00:00
ea8e0c75c7 [quant][pt2] Fix create FQ with FixedQParamsQSpec (#122104)
Summary: Before we just returned a _PartialWrapper object when
using FixedQParamsQuantizationSpec in QAT. This is wrong and
we should return a FQ object instead.

Differential Revision: [D55021106](https://our.internmc.facebook.com/intern/diff/D55021106)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122104
Approved by: https://github.com/jerryzh168
2024-03-22 14:23:05 +00:00
6e6891e843 [jit] Fix _batch_norm_with_update shape function (#122430)
Summary: We used `native_batch_norm`'s shape function before,
but the schemas are actually different. We need to create new
shape functions for `_batch_norm_with_update` specifically.

Test Plan:
buck2 test '@fbcode//mode/opt-tsan' fbcode//caffe2/test/cpp/jit:jit -- --exact 'caffe2/test/cpp/jit:jit - TestShapeGraphLinting.Basic'

Reviewers: bdhirsh, davidberard98, eellison

Differential Revision: [D55211182](https://our.internmc.facebook.com/intern/diff/D55211182)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122430
Approved by: https://github.com/eellison, https://github.com/bdhirsh
2024-03-22 14:21:57 +00:00
23a6d74f93 [Quant] [Inductor] Enable the Inductor Lowering of QConv2d post op HardTanh with int8-mix-bf16 (#122374)
**Summary**
Enable the fusion pattern of `QConv2d -> hardtanh` lowering for int8-mixed-bf16 case.

**Test Plan**
```
python -m pytest test_mkldnn_pattern_matcher.py -k test_qconv2d_hardtanh_int8_mixed_bf16_cpu
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122374
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
ghstack dependencies: #122266, #122267, #122268, #122373
2024-03-22 13:13:14 +00:00
f65373e278 Revert "Factor meta conversion through serializable MetaTensorDesc (#122044)"
This reverts commit e2d89e970480d7e5b10a77928442d8caf94e0e84.

Reverted https://github.com/pytorch/pytorch/pull/122044 on behalf of https://github.com/jeanschmidt due to Seems that some landrace caused this PR to break lint ([comment](https://github.com/pytorch/pytorch/pull/122044#issuecomment-2015025490))
2024-03-22 12:46:21 +00:00
700c92e1b9 [Inductor Cutlass backend] GEMM size threshold for Cutlass backend usage (#121491)
* Adds a configurable GEMM size threshold for the usage of Cutlass GEMM Kernels **_inductor.config.cutlass_backend_min_gemm_size**

 * During GEMM algorithm choice generation: **if no viable choices can be generated using the configured backends, the ATen backend will be used as a fallback backend**, even if it is not enabled in **_inductor.config.max_autotune_gemm_backends**

Test plan:
CI
Additional unit test in test_cutlass_backend.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121491
Approved by: https://github.com/jansel
ghstack dependencies: #121490
2024-03-22 10:58:43 +00:00
d34514f8db Renamed mutationlayout/aliasedlayout (#122474)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122474
Approved by: https://github.com/jansel
ghstack dependencies: #121624
2024-03-22 08:32:14 +00:00
eca30df846 Added load_args to repro (#121624)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121624
Approved by: https://github.com/ezyang
2024-03-22 08:32:14 +00:00
783fd89ff1 [Quant] [Inductor] Enable the Inductor Lowering of QConv2d post op HardSwish with int8-mix-bf16 (#122373)
**Summary**
Enable the fusion pattern of `QConv2d -> hardswish` lowering for int8-mixed-bf16 case.

**Test Plan**
```
python -m pytest test_mkldnn_pattern_matcher.py -k test_qconv2d_hardswish_int8_mixed_bf16_cpu
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122373
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
ghstack dependencies: #122266, #122267, #122268
2024-03-22 08:17:57 +00:00
99f0fec7d0 [Quant] [Inductor] Enable the Inductor Lowering of QConv2d post op SiLU (#122268)
**Summary**
Enable the fusion pattern of `QConv2d -> silu` lowering to `swish` as `QConv2d` post operator.

**Test Plan**
```
python -m pytest test_mkldnn_pattern_matcher.py -k test_qconv2d_silu_cpu
python -m pytest test_mkldnn_pattern_matcher.py -k test_qconv2d_silu_int8_mixed_bf16_cpu
python -m pytest test_mkldnn_pattern_matcher.py -k test_qat_qconv2d_silu
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122268
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
ghstack dependencies: #122266, #122267
2024-03-22 08:15:28 +00:00
bb75313f0a [dynamo] Optimize handling of BINARY_OP (#122465)
This saves ~0.1s on https://dev-discuss.pytorch.org/t/a-torchdynamo-trace-time-ablation-study/1961

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122465
Approved by: https://github.com/oulgen
2024-03-22 08:14:58 +00:00
2c6eeb26d3 [Quant] [PT2] Add SiLU into X86InductorQuantizer Conv2d Unary Annotation (#122267)
**Summary**
Add `SiLU` into X86InductorQuantizer Conv2d Unary Annotation

**TestPlan**
```
python -m pytest test_x86inductor_quantizer.py -k test_conv2d_unary
python -m pytest test_x86inductor_quantizer.py -k test_qat_conv2d_unary
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122267
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
ghstack dependencies: #122266
2024-03-22 08:12:23 +00:00
6bbd697306 [Inductor] Make codecache CUDA compilation more robust & flexible (#121490)
Minor changes which make the CUDA compilation within _inductor/codecache.py
more robust and flexible.

Test plan:
CI
Additional test in test_codecache.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121490
Approved by: https://github.com/jansel
2024-03-22 08:12:11 +00:00
a337ee0a3a [Quant] Enable QConv2d with silu post op (#122266)
**Summary**
Enable QConv2d implementation with post op `silu`

**Test Plan**
```
python -m pytest test_quantized_op.py -k test_qconv2d_silu_pt2e
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122266
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
2024-03-22 07:58:45 +00:00
b78e8c0d37 remove duplicate method run_subtests (#122421)
Fixes #121654

I have removed the duplicate test `run_subtests` from `common_dtensor.py` and `common_fsdp.py` and moved it to `common_distributed.py`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122421
Approved by: https://github.com/soulitzer
2024-03-22 07:00:49 +00:00
6ba85cfc2a Fixed memory leak in Python dispatcher w.r.t. THPDevice. (#122439)
Fixes the memory leak reported in #122417.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122439
Approved by: https://github.com/soulitzer
2024-03-22 06:44:12 +00:00
3600778ede Do not create a new node if no normalization is needed (#122330)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122330
Approved by: https://github.com/jansel
2024-03-22 05:51:28 +00:00
e2d89e9704 Factor meta conversion through serializable MetaTensorDesc (#122044)
Fixes https://github.com/pytorch/pytorch/issues/121085

This PR pretty involved so pay attention to this description.  At a high
level, the refactor is intended to be mechanical: anywhere in
MetaConverter where previously we took a Tensor as argument, we now take
a MetaTensorDesc, which contains all of the information that we would
have queried off of the Tensor, but placed into a separate data
structure which we can serialize or use to recreate a fake tensor in
a separate fake tensor mode in exact fidelity to the original.

However, this transformation is not always entirely mechanical.  Here
is what you need to pay attention to:

- The memo table from real Tensor -> meta/fake Tensor is now broken
  into two memo tables: real Tensor -> stable int id -> meta/fake
  Tensor.  The stable int id is needed so that when we do serialization,
  we know when tensors/storages alias each other and can ensure we preserve
  this aliasing upon deserialization.

  The way I have implemented changes the weak reference behavior.
  Previously, when either the real Tensor OR the meta/fake Tensor went
  dead, we would remove the entry from the memo table.  Now, this only
  removes entries from one of the two memo tables.  This semantically
  makes sense, because the user may have held on to the stable int id
  out of band, and may expect a real Tensor to continue to be numbered
  consistently / expect to be able to lookup a meta/fake tensor from
  this id.  If this is unacceptable, it may be possible to rejigger
  the memo tables so that we have real Tensor -> stable int id
  and real Tensor -> meta/fake Tensor, but TBH I find the new
  implementation a lot simpler, and arranging the memo tables in this
  way means that I have to muck around with the real tensor to save
  to the memo table; in the current implementation, I never pass the
  Tensor to meta_tensor function AT ALL, which means it is impossible
  to accidentally depend on it.

- When I fill in the fields of MetaTensorDesc in describe_tensor, I need
  to be careful not to poke fields when they are not valid.  Previously,
  preconditions were implicitly checked via the conditional structure
  ("is this sparse? is this nested?") that is tested before we start
  reading attributes.  This structure has to be replicated in
  describe_tensor, and I have almost assuredly gotten it wrong on my
  first try (I'll be grinding through it on CI; a careful audit will
  help too, by auditing that I've tested all the same conditionals that
  the original access was guarded by.)

- I originally submitted https://github.com/pytorch/pytorch/pull/121821
  for the symbolic shapes change, but it turned out the way I did it
  there didn't actually work so well for this PR.  I ended up just
  inlining the symbolic shapes allocation logic into MetaConverter
  (look for calls to maybe_specialize_sym_int_with_hint), maybe there
  is a better way to structure it, but what I really want is to
  just read sizes/strides/offset directly off of MetaTensorDesc; I
  don't want another intermediate data structure.

- Some fields aren't serializable. These are documented as "NOT
  serializable".  ctx/type should morally be serializable and I just
  need to setup a contract with subclasses to let them be serialized.
  The fake_mode is used solely to test if we are refakefying with
  a pre-existing ShapeEnv and we want to reuse the SymInt
  directly--serializing this case is hopeless but I am kind of hoping
  after this refactor we do not need this at all.  view_func is not
  serializable because it's a bound C implemented method.  Joel has
  promised me that this is not too difficult to actually expose as a
  true data structure, but this is the edgiest of edge cases and there
  is no reason to deal with it right now.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122044
Approved by: https://github.com/eellison
ghstack dependencies: #122018
2024-03-22 03:56:34 +00:00
cyy
ecbe82b9ce Change ATEN generator argument type to const std::optional<Generator>& (#120076)
This PR proposes to use std::optional<Generator>& for underlying functions to avoid unnecessary copy and move operations. The torchgen code was changed to generate the new type.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120076
Approved by: https://github.com/malfet
2024-03-22 03:49:31 +00:00
ef0d470eb3 [vision hash update] update the pinned vision hash (#122453)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122453
Approved by: https://github.com/pytorchbot
2024-03-22 03:37:11 +00:00
fb57d1699b [export] Fix handling output in remove_effect_tokens_pass (#122357)
Added handling for updating the output_spec in the graph signature if the the result of a with_effects call is an output.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122357
Approved by: https://github.com/zhxchen17
2024-03-22 03:35:59 +00:00
09eb07bee8 Introduce XPU implementation for PyTorch ATen operators (#120891)
As a follow-up to #114835 and #119682, we add limited ATen operators implementation for XPU. With this PR, the blocking issue for oneDNN operations and Inductor XPU backend will be resolved as the two components depend on these operations to support its basic features, respectively.

The added ATen operators include:

- `copy_`, `_to_copy`, `_copy_from_and_resize`, , `clone`
- `view`, `view_as_real`, `view_as_complex`,
- `as_strided`, `_reshape_alias`, `resize_`, `resize_as_`,
- `add`/`add_`, `sub`/`sub_`, `mul`/`mul_`, `div`/`div_`, `abs`,
- `empty`, `empty_strided`,
- `fill_`, `zeros_`.

Co-authored-by: Wang, Eikan <eikan.wang@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120891
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/atalman
2024-03-22 03:31:04 +00:00
e419011471 [inductor] Add torch.while_loop support to JIT Inductor (#122069)
Summary: `torch.while_loop` HOP support is added to JIT Inductor. The test coverage is limited due to the functionality constraints of the upstream `torch.while_loop` op in Dynamo / Export. When those are lifted, we'll add more tests (see TODO-s in the test file).

AOT Inductor support will be added in a follow-up PR.

Test Plan:

```
$ python test/inductor/test_control_flow.py
...
----------------------------------------------------------------------
Ran 38 tests in 159.387s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122069
Approved by: https://github.com/jansel, https://github.com/eellison
2024-03-22 02:45:27 +00:00
5e0440edb4 Revert "Optimize multi_tensor_apply (take 2) (#119764)"
This reverts commit 0b68a28c87df2c6eb2cf530be4659b5a2f8a95b0.

Reverted https://github.com/pytorch/pytorch/pull/119764 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing ROCm job in trunk 0b68a28c87.  Please help take a look and reland the change ([comment](https://github.com/pytorch/pytorch/pull/119764#issuecomment-2014190124))
2024-03-22 02:18:28 +00:00
470b44c048 Support for torch.nested.as_nested_tensor(t) (#113280)
This PR adds support for tensor inputs to `as_nested_tensor()`. The tensor is treated as a batch of consistently-sized constituents. It utilizes `_nested_view_from_values_offsets()` to return a real view that allows for propagating gradients into inputs.
Co-authored-by: voznesenskym <voznesenskym@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113280
Approved by: https://github.com/cpuhrsch, https://github.com/soulitzer
ghstack dependencies: #113279
2024-03-22 02:12:37 +00:00
cd6bfc7965 Proper view support for jagged layout NestedTensor (#113279)
This PR:
* Introduces an ATen op for creating true jagged views from a dense values buffer
    * `_nested_view_from_jagged(values, offsets, lengths, ragged_idx, dummy)`
    * This ops is implemented on the Python side using torch.library so we can return a subclass instance
    * `jagged_from_list()` now uses this instead of the old autograd.Function `NestedViewFromBuffer`
    * The latter op is used for non-contiguous JTs returned via `torch.nested.narrow()`
    * `dummy` is an awful hack to ensure that `NestedTensor.__torch_dispatch__()` is invoked for our view
* Introduces an ATen op for accessing the `values` component of an NT via a view
    * `_nested_get_values(nt)`
* **Removes** the autograd.Functions `ViewNestedFromBuffer` and `ViewBufferFromNested` in favor of `nested_from_values_offsets()` / `nested_from_values_offsets_lengths()` and `nt.values()`, respectively.
* Changes test code to prefer `as_nested_tensor()` over `jagged_from_list()` directly
    * Similarly, avoid `buffer_from_jagged()`, preferring `values()`
* Depends on general subclass view fake-ification on the PT2 side (handled solely in previous PRs in the stack)

With these changes, the semantics of jagged layout NTs are such that they are considered a true view of the underlying `values` buffer. This means views of jagged NTs are views of the underlying buffer as well, simplifying some handling.

Differential Revision: [D54269922](https://our.internmc.facebook.com/intern/diff/D54269922)
Co-authored-by: voznesenskym <voznesenskym@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113279
Approved by: https://github.com/ezyang
2024-03-22 02:12:36 +00:00
bde22835c6 [PT2] - Guard oblivious on meta registrations (#122216)
Summary:
```
[trainer0|0]:Potential framework code culprit (scroll up for full backtrace):
[trainer0|0]:  File "/mnt/xarfuse/uid-539346/56d4bb3d-seed-nspid4026531836_cgpid183208940-ns-4026531840/torch/_meta_registrations.py", line 5043, in scatter_gather_dtype_check
[trainer0|0]:    if index.numel() != 0:
```

Test Plan: General CI.

Reviewed By: ezyang

Differential Revision: D54689183

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122216
Approved by: https://github.com/ezyang
2024-03-22 01:36:03 +00:00
4f93b3d958 [Dort] Reduce excessive warning to info (#122442)
No need to warn when an op can be exported.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122442
Approved by: https://github.com/thiagocrepaldi
2024-03-22 01:09:33 +00:00
a001b4b048 Inductor: Don't clamp views when the views come from split_with_sizes (#122149)
Summary:
Fixes #122126

`split_with_sizes` don't need clamping.

Test Plan: Added test + CI

Differential Revision: D55043320

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122149
Approved by: https://github.com/ezyang
2024-03-22 00:55:36 +00:00
b1fa0ce4aa [export] build the infra to rollout predispatch export. (#122326)
Test Plan:
fbcode:caffe2/test/quantization:test_quantization
fbcode:bolt/nn/executorch/backends/tests:qnn_test
fbcode:on_device_ai/helios/compiler_tests/...
fbcode:pyspeech/tests:pyspeech_utils_test_oss
fbcode:caffe2/test:quantization_pt2e_qat
fbcode:on_device_ai/Assistant/Jarvis/tests:test_custom_ops
fbcode:modai/test:test_modai
fbcode:executorch/exir/backend/test:test_partitioner

Differential Revision: D55133846

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122326
Approved by: https://github.com/tugsbayasgalan
2024-03-22 00:55:10 +00:00
4b535906aa Better handle test-config labels on PR (#122155)
I have some minor fixes in the scripts to

1. Fix the bug where the empty test matrix was confusingly print as unstable https://github.com/pytorch/pytorch/pull/121381#issuecomment-2004558588
1. Replace `print` with `logging.info`
1. Remove the hardcoded `VALID_TEST_CONFIG_LABELS` list.  It's out of date and not many people use this features besides `test-config/default`, so why bother.  The behavior here is simpler now:
    1. If the PR has some `test-config/*` labels, they will be applied
    1. If the PR has none of them, all test configs are applied
1. Add log for the previous 2 cases to avoid confusion

### Testing

```
python filter_test_configs.py --workflow "Mac MPS" --job-name "macos-12-py3-arm64 / build" --event-name "push" --schedule "" --branch "" --tag "ciflow/mps/121381" \
  --pr-number 121065 \
  --test-matrix "{ include: [
    { config: "mps", shard: 1, num_shards: 1, runner: "macos-m1-stable" },
    { config: "mps", shard: 1, num_shards: 1, runner: "macos-m2-14" },
  ]}
 ```

Also running on this PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122155
Approved by: https://github.com/clee2000
2024-03-21 23:20:52 +00:00
bce640709c Revert "Precompile triton templates (#121998)"
This reverts commit b8df2f0ca530ebe01fa079c891c170a1f4b22823.

Reverted https://github.com/pytorch/pytorch/pull/121998 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is causing all ROCm trunk job to fail b8df2f0ca5 ([comment](https://github.com/pytorch/pytorch/pull/121998#issuecomment-2014003037))
2024-03-21 23:05:59 +00:00
c4486d3e88 Allow fake models to run with ONNXProgram.__call__ (#122230)
In order to a fake model to run using ONNXProgram.__call__
interface, we need to save the model into disk along with external data
before executing the model. This is what this PR implements

An alternative is to ONNXProgram.__call__ to detect that the model
was exported with fake mode and explicit raise an exception when
ONNXProgram.__call__ is executed. The exception message would instruct
the user to call ONNXProgram.save and manually execute the model using
the ONNX runtime of choice.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122230
Approved by: https://github.com/BowenBao
ghstack dependencies: #122196
2024-03-21 22:28:05 +00:00
4ba51bb2c4 Add keys used for templated attention impls (#122423)
# Summary

Mypy will complain that these attributes dont exist for this PR: https://github.com/pytorch/pytorch/pull/121845/
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122423
Approved by: https://github.com/bdhirsh
2024-03-21 22:16:53 +00:00
224beecee6 Revert "Proper view support for jagged layout NestedTensor (#113279)"
This reverts commit 5855c490f09a028bfdfefea8b93c9833eb55dc5c.

Reverted https://github.com/pytorch/pytorch/pull/113279 on behalf of https://github.com/jbschlosser due to Need to fix BC thing ([comment](https://github.com/pytorch/pytorch/pull/113279#issuecomment-2013899762))
2024-03-21 22:03:01 +00:00
12e7602cf9 Revert "Support for torch.nested.as_nested_tensor(t) (#113280)"
This reverts commit 17c9c7026521be1c194cae278b76ac8e8f7d145b.

Reverted https://github.com/pytorch/pytorch/pull/113280 on behalf of https://github.com/jbschlosser due to Need to fix BC thing ([comment](https://github.com/pytorch/pytorch/pull/113280#issuecomment-2013893099))
2024-03-21 22:00:44 +00:00
816db3bd29 Revert "Public API for NJT construction from jagged components (#121518)"
This reverts commit d4dff9cf5e7b734a8621b571e8f5a761dc43e1e0.

Reverted https://github.com/pytorch/pytorch/pull/121518 on behalf of https://github.com/jbschlosser due to Need to fix BC thing ([comment](https://github.com/pytorch/pytorch/pull/121518#issuecomment-2013879641))
2024-03-21 21:56:29 +00:00
48afb5c325 [inductor] Use python constants in IndexPropagation (#122031)
In the next PR I have the IR `ops.neg(ops.constant(0.0, torch.float32))`
which should be folded to `ops.constant(-0.0, torch.float32)` but it seems that
`sympy.Float(-0.0)` doesn't respect the sign of the zero and so we instead
get a positive zero constant.

Here, I work around this by doing the constant folding with python arithmetic
which does respect signed zeros.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122031
Approved by: https://github.com/lezcano
2024-03-21 21:53:22 +00:00
99055ae165 [aoti] Fix compilation bug for buffer mutations (#121688)
I realized there's a bug when unlifting buffer mutations in AOTI.
However there seems to be a bug during tracing where AOTI mutates the buffer. I didn't take the time to investigate, so I left is as TODO for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121688
Approved by: https://github.com/chenyang78, https://github.com/bdhirsh
2024-03-21 21:51:32 +00:00
332456c44d triton_kernel_wrap shouldn't access FakeTensor.data_ptr (#122418)
The comment suggests that we need to replace all FakeTensors with real
tensors. `torch.empty` doesn't actually return a real Tensor because
FakeTensorMode is active!

We disable torch dispatch so that torch.empty actually returns a real Tensor.

The motivation for this PR is that we're trying to ban
FakeTensor.data_ptr (or at least warn on it) in torch.compile. See the
next PR up in the stack

Test Plan:
- Existing tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122418
Approved by: https://github.com/oulgen
2024-03-21 21:48:07 +00:00
621fdc9db8 infer_schema can add alias annotations when passed a list of mutated args (#122343)
Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122343
Approved by: https://github.com/ezyang
ghstack dependencies: #122319, #122320
2024-03-21 21:39:07 +00:00
639d6201b4 Expand the types infer_schema can infer (#122320)
This PR allows it to infer:
- None return as ()
- List[Tensor] as Tensor[]

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122320
Approved by: https://github.com/ezyang, https://github.com/soulitzer
ghstack dependencies: #122319
2024-03-21 21:39:07 +00:00
0dd78f1828 Add standalone tests for infer_schema (#122319)
We're gonna reuse this helper in the new python custom ops API. Given a
function with type annotations, `infer_schema(fun)` returns an inferred
schema.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122319
Approved by: https://github.com/ezyang, https://github.com/soulitzer
2024-03-21 21:39:04 +00:00
23524710e6 [dynamo] use proxies to nn.Module in dynamo generated GraphModules (#120756)
Fixes remaining refleaks found when debugging https://github.com/pytorch/pytorch/issues/119607, tests added in https://github.com/pytorch/pytorch/pull/120657.

Also fixes some tests that xfail: https://github.com/pytorch/pytorch/issues/120631 (not entirely sure why), but introduced tests now fail.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120756
Approved by: https://github.com/jansel
2024-03-21 21:23:12 +00:00
2cd0a5d516 [Inductor] Fix for WrapperCodeGen.statically_known_int_or_none (#121808)
There's obviously a small typo in WrapperCodeGen.statically_known_int_or_none,
where the return value of a call to V.graph._shape_env._maybe_evaluate_static
is being discarded.

This fix changes that to work how it was likely intended to.

Test Plan:
CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121808
Approved by: https://github.com/lezcano, https://github.com/jansel, https://github.com/aakhundov
2024-03-21 21:15:32 +00:00
968c4c4154 Revert "Refactor gpu trace to be device-agnostic (#121794)"
This reverts commit 74deacbf31d032a2659dc1633dc3e5248921d466.

Reverted https://github.com/pytorch/pytorch/pull/121794 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it breaks ROCm jobs in trunk 74deacbf31, please help take a look and reland the change ([comment](https://github.com/pytorch/pytorch/pull/121794#issuecomment-2013674083))
2024-03-21 20:33:17 +00:00
13afbcfc85 Revert "Support gpu trace on XPU (#121795)"
This reverts commit 91ead3eae4cd6cbf50fe7a7b4a2f9f35302bc9b2.

Reverted https://github.com/pytorch/pytorch/pull/121795 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it breaks ROCm jobs in trunk 74deacbf31, please help take a look and reland the change ([comment](https://github.com/pytorch/pytorch/pull/121794#issuecomment-2013674083))
2024-03-21 20:33:16 +00:00
182bb0f2ca Revert "Introduce XPU implementation for PyTorch ATen operators (#120891)"
This reverts commit 148a8de6397be6e4b4ca1508b03b82d117bfb03c.

Reverted https://github.com/pytorch/pytorch/pull/120891 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I need to revert it to resolve a conflict in trunk https://github.com/pytorch/pytorch/pull/121794#issuecomment-2013434523.  Please help reland the change after ([comment](https://github.com/pytorch/pytorch/pull/120891#issuecomment-2013668563))
2024-03-21 20:30:20 +00:00
628dcde136 [AOTI] Disable stack allocation when there is a fallback op (#122367)
Summary: Stack allocation is disabled when there is an aten fallback op, see c84f81b395/torch/_inductor/codegen/cpp_wrapper_cpu.py (L974). But we need to do the same where is a custom op fallback.

Test Plan: CI

Reviewed By: mikekgfb

Differential Revision: D55149369

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122367
Approved by: https://github.com/mikekgfb
2024-03-21 20:02:33 +00:00
af9b71c82f fix typo in while_loop_test (#122416)
As titiled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122416
Approved by: https://github.com/angelayi
2024-03-21 19:42:08 +00:00
d131cbc44f Fuse the input -> p2p buffer copy into one-shot all-reduce kernel when the input is small (#121213)
This improves the gpt-fast llama2 70B 8xH100 (non-standard) TP benchmark from 86 tok/s to 88 tok/s.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121213
Approved by: https://github.com/Chillee
2024-03-21 18:25:57 +00:00
765c3fc138 fix breaking changes for ONNX Runtime Training (#122000)
Fixes breaking changes for ONNX Runtime Training.

PR https://github.com/pytorch/pytorch/pull/121102 introduced incompatibility with ORT training because of change in parameter type. Creating a PR to add previous parameter types and verified that it works with ORT training.

Error with current scenario:

```
site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/aten_op_executor/aten_op_executor.cc:60:40: error: invalid conversion from ‘const DLManagedTensor*’ to ‘DLManagedTensor*’ [-fpermissive]
at::Tensor tensor = at::fromDLPack(dlpack);

site-packages/torch/include/ATen/DLConvertor.h:15:46: note:   initializing argument 1 of ‘at::Tensor at::fromDLPack(DLManagedTensor*)’
TORCH_API Tensor fromDLPack(DLManagedTensor* src);
```
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122000
Approved by: https://github.com/malfet
2024-03-21 18:10:22 +00:00
c2651a7f0e Make check_is_size clamp to sys.maxsize - 1, so sys.maxsize comparison returns False (#122372)
Partially fixes https://github.com/pytorch/pytorch/issues/113002

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122372
Approved by: https://github.com/lezcano
ghstack dependencies: #122370
2024-03-21 17:14:42 +00:00
780f70b728 Make expected stride test in torch._prims_common size oblivious (#122370)
Partially addresses https://github.com/pytorch/pytorch/issues/113002

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122370
Approved by: https://github.com/lezcano
2024-03-21 17:14:42 +00:00
25bf5f7e61 Revert "Enable x86 CPU vectorization on windows [submodule sleef] (#118980)"
This reverts commit aa74a8b9e5b34eaa700a64064818adc7a12942ca.

Reverted https://github.com/pytorch/pytorch/pull/118980 on behalf of https://github.com/huydhn due to Sorry for revert your change one more time but the hard part is that it breaks lot of internal builds ([comment](https://github.com/pytorch/pytorch/pull/118980#issuecomment-2013043364))
2024-03-21 17:07:17 +00:00
b8df2f0ca5 Precompile triton templates (#121998)
Before this PR we were not precompiling triton templates in parallel. Compilation would occur during benchmarking.

Triton benchmarking templates were emitted as :

```
@triton.jit
def triton_mm(arg_A, arg_B, out_ptr0):
```

In order to precompile we need to give the full kernel specification, as we do when we emit the template in the final output code generation.

```
@triton_heuristics.template(
    num_stages=3,
    num_warps=8,
    triton_meta={'signature': {0: '*fp32', 1: '*fp32', 2: '*fp32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [AttrsDescriptor(divisible_by_16=(0, 1, 2), equal_to_1=(), ids_of_folded_args=(), divisible_by_8=())]},
    inductor_meta={'kernel_name': 'Placeholder.DESCRIPTIVE_NAME', 'backend_hash': 'cdeecfeccd31ad7810f96b5752194b1c2406d0a81e39a6ca09c8ee150baae183'},
)
@triton.jit
def triton_mm(arg_A, arg_B, out_ptr0):
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121998
Approved by: https://github.com/jansel
ghstack dependencies: #121996, #120275, #121997
2024-03-21 17:04:53 +00:00
17175cdbc7 [Docs] Add extended debugging options for troubleshooting (#122028)
Fixes #120889

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122028
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-03-21 17:00:45 +00:00
c20bc18d59 [export] allow static constraints in dynamic_shapes (#121860)
This PR allows users to specify int values for dimensions in dynamic_shapes as well as None, for example:

```
class Foo(torch.nn.Module):
    def forward(self, x, y, z):
        ...

    foo = Foo()
    inputs = (torch.randn(4, 6), torch.randn(5, 4), torch.randn(3, 3))

for dynamic_shapes in [
    None
    ((4, 6), (5, 4), (3, 3)),
    ((None, 6), None, {0: 3, 1: 3})
]:
    _ = export(foo, inputs, dynamic_shapes=dynamic_shapes)
```

All of the above should produce the same ExportedProgram.

This is done by temporarily creating a static dim constraint during analysis, where vr.lower == vr.upper. These constraints are then deleted during _process_constraints(), and do not show up in the final ExportedProgram's range_constraints.

Additionally, export() will also fail if the shapes are mis-specified, for example:
```
_ = export(foo, inputs, dynamic_shapes=((5, None), None, None))
```
leads to `torch._dynamo.exc.UserError: Static shape constraint of 5 does not match input size of 4, for L['x'].size()[0]`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121860
Approved by: https://github.com/avikchaudhuri
2024-03-21 16:59:59 +00:00
16935de961 Support alias for NestedTensorCPU/CUDA (#117711)
Fixes #ISSUE_NUMBER

Co-authored-by: Vincent Moens <vmoens@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117711
Approved by: https://github.com/albanD
2024-03-21 16:05:52 +00:00
148a8de639 Introduce XPU implementation for PyTorch ATen operators (#120891)
As a follow-up to #114835 and #119682, we add limited ATen operators implementation for XPU. With this PR, the blocking issue for oneDNN operations and Inductor XPU backend will be resolved as the two components depend on these operations to support its basic features, respectively.

The added ATen operators include:

- `copy_`, `_to_copy`, `_copy_from_and_resize`, , `clone`
- `view`, `view_as_real`, `view_as_complex`,
- `as_strided`, `_reshape_alias`, `resize_`, `resize_as_`,
- `add`/`add_`, `sub`/`sub_`, `mul`/`mul_`, `div`/`div_`, `abs`,
- `empty`, `empty_strided`,
- `fill_`, `zeros_`.

Co-authored-by: Wang, Eikan <eikan.wang@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120891
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/atalman
2024-03-21 15:42:20 +00:00
204fd69ca6 Make ONNXProgram.model_proto and disk file the same (#122196)
Currently, the in-memory onnx program model proto does
not contain initializers saved into the disk version.

This PR changes this behavior, so that both versions are
identical. This is important for running models with fake
tensor from OMMProgram.model_proto directly, without a file
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122196
Approved by: https://github.com/BowenBao
2024-03-21 15:29:31 +00:00
f9996ed764 [BE] Enable torch inductor tests running on MacOS (#122360)
Original idea was limit the testing to just x86 Macs, but right now it will be skipped on all Apple Silicon ones, as all of them have Metal capable GPU

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122360
Approved by: https://github.com/jansel
2024-03-21 14:47:05 +00:00
456b112dca [inductor] Support non-Tensor predicate in torch.cond (#122378)
Summary: Previously, we only supported torch.Tensor boolean scalar predicate in `torch.cond` in Inductor. This PR adds support for SymBool and Python bool predicate, to match the `torch.cond` [sematics](https://pytorch.org/docs/stable/generated/torch.cond.html) in Dynamo / Export.

Test Plan:

```
$ python test/inductor/test_control_flow.py
...
----------------------------------------------------------------------
Ran 34 tests in 56.980s

OK

$ python test/inductor/test_aot_inductor.py -k test_cond
...
----------------------------------------------------------------------
Ran 54 tests in 460.093s

OK (skipped=4)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122378
Approved by: https://github.com/jansel, https://github.com/chenyang78
2024-03-21 14:35:01 +00:00
0b68a28c87 Optimize multi_tensor_apply (take 2) (#119764)
### Take 2

The first take (#119153) landed but was reverted because it broke cuda graph for `multi_tensor_apply`. This PR is a reland of #119153:
- Incorporate #119652 so that the optimization can be applied (1) without increasing binary size (2) to all 3 MTA variants without much code duplication.
- Ensure the optimization is compatible with cuda graph.

### Summary

Due to the dynamic nature of the workload, the kernel arguments aren't guaranteed to fit in the static 4kb kernel argument memory. Previously with the apex implementation, we overcame this limitation by dividing a multi_tensor_apply workload into multiple kernel launches. However, this led to low sustained occupancy, affecting the performance of memory bound ops.

Based on the observation that the kernel argument memory limitation doesn't correlate well with available SM resources, we adopt a different approach:
- When the kernel arguments fit into the static kernel argument memory, we use this memory to transfer the arguments.
- Conversely, when the kernel arguments don't fit into the static kernel argument memory, instead of sacrificing sustained occupancy, we use a page-locked cudaMemcpyAsync to transfer the arguments, then perform the entire workload in a single kernel.

This PR only covers `multi_tensor_apply` for tensors. The change can be easily applied to `multi_tensor_apply` for tensors + scalars and `multi_tensor_apply_for_fused_optimizer`.

### Benchmark (WIP)

The only benchmark I've conducted so far on `_foreach_copy_` on a set of sizes that resembles internal workload. I need to benchmarks on more problem sizes. The speedup should vary among problem sizes. **However, I believe this PR should not be slower than the previous impl on any problem sizes.**

The benchmark can be reproduced with [this script](https://gist.github.com/yifuwang/178c1f4bf951c5794ea79c04d90e44fa).

**Baseline**

A single iteration in trace:
<img width="831" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/5c8d72d0-0628-4989-88a8-c756f6bc1319">

```
https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_5a59145f-567b-472f-8eef-c61c388d45b4.json
device ms: 1.111, cpu ms: 7.151
memory bandwidth: 1169.825 GB/s
```

**This PR**

A single iteration in trace:
<img width="967" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/a023e183-8166-48f7-b7c0-c8ba32653d2b">

```
https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_da060725-62a8-466e-b570-2ad67ff0e29d.json
device ms: 0.892, cpu ms: 0.810
memory bandwidth: 1456.744 GB/s
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119764
Approved by: https://github.com/eqy, https://github.com/eellison, https://github.com/crcrpar
2024-03-21 11:53:31 +00:00
0d8e960f74 Revert "[Sparsity] add support for H100 compute capability 9.x (#121768)"
This reverts commit 91fdaa1b416ab8ac8be30f3c3428751e236657cd.

Reverted https://github.com/pytorch/pytorch/pull/121768 on behalf of https://github.com/jeanschmidt due to Agreed on reverting and fixing rocm tests ([comment](https://github.com/pytorch/pytorch/pull/121768#issuecomment-2011893826))
2024-03-21 10:42:08 +00:00
cyy
7f8bb1de83 [Dynamo][2/N] Fix clang-tidy warnings in torch/csrc/dynamo/* (#122362)
This PR continues to clean clang-tidy warnings in torch/csrc/dynamo/*, following #122259

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122362
Approved by: https://github.com/ezyang
2024-03-21 09:41:41 +00:00
ea1cd31b50 [c10d] Log the target of FR dump (#122345)
Summary: It would be useful to log the destination of the trace dump in either manifold or local file for the users to quickly locate the dump

Test Plan: Modified unit tests

Differential Revision: D54972069

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122345
Approved by: https://github.com/wconstab
2024-03-21 08:03:05 +00:00
365e89a591 Add tensor step to adadelta (#122252)
Towards fixing https://github.com/pytorch/pytorch/issues/115679
Fixes Adadelta step update while compiling

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122252
Approved by: https://github.com/janeyx99
2024-03-21 07:28:47 +00:00
7fa1be506b Add an option to sdpa benchmark to specify backend (#122368)
# Summary
Adds the ability to specify sdpa backend
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122368
Approved by: https://github.com/cpuhrsch
2024-03-21 07:00:40 +00:00
18c164ef7c [Inductor] Match insignficiant strides on outputs (#122239)
Fix for https://github.com/pytorch/pytorch/issues/116433

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122239
Approved by: https://github.com/Chillee
2024-03-21 05:35:59 +00:00
b915877deb Support numpy array in Tensor.__eq__ (#122249)
When the `other` arg of `Tensor.__eq__` is a numpy array, it is converted to a PyTorch tensor view of the numpy array, which is then given as the `other` arg to a `Tensor.eq` call

Fixes #119965
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122249
Approved by: https://github.com/ezyang
2024-03-21 04:55:01 +00:00
bf18e967b4 [c10d] disable compute_duration by default (#122138)
Summary:
Compute duration would invoke additional cuda overhead and possibly
GPU mem increase and possible hang, so we want to disable it by default and enable it only
when needed, or at least when timing is enabled.

Test Plan:
Test with existing unit test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122138
Approved by: https://github.com/wconstab
2024-03-21 04:45:37 +00:00
ea6f67853e [inductor fbcode] Add python include paths for Python.h (#122363)
Summary:
We're getting errors that Python.h is not found because we didn't have
the proper include path set up for it.

bypass-github-export-checks

Test Plan: I can only get this to show up in Bento: N5106134

Reviewed By: hl475, chenyang78

Differential Revision: D55133110

Co-authored-by: Bert Maher <bertrand@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122363
Approved by: https://github.com/bertmaher
2024-03-21 04:32:17 +00:00
d4dff9cf5e Public API for NJT construction from jagged components (#121518)
This PR introduces `torch.nested.nested_tensor_from_jagged(values, offsets=None, lengths=None, jagged_dim=1)` (bikeshedding welcome). This is intended to be the main entrypoint for getting an NJT from the `(values, offsets, lengths)` components. The returned NJT is a view of the `values` component.

Note that `torch.nested.nested_tensor()` / `torch.nested.as_nested_tensor()` already exist for constructing an NJT from a list of tensors.

TODO:
* Some doc formatting; suggestions welcome there
* Tests / examples using `jagged_dim != 1`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121518
Approved by: https://github.com/cpuhrsch
ghstack dependencies: #113280
2024-03-21 04:14:17 +00:00
17c9c70265 Support for torch.nested.as_nested_tensor(t) (#113280)
This PR adds support for tensor inputs to `as_nested_tensor()`. The tensor is treated as a batch of consistently-sized constituents. It utilizes `_nested_view_from_values_offsets()` to return a real view that allows for propagating gradients into inputs.
Co-authored-by: voznesenskym <voznesenskym@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113280
Approved by: https://github.com/cpuhrsch, https://github.com/soulitzer
2024-03-21 04:13:55 +00:00
77bed8f7f2 [ONNX] model_type flag is only supported under SKIP_XFAIL_SUBTESTS (#122336)
Fixes #120918

To address the confusion that developers usually have on which list to put xfail and skip. This PR provides guidance that `model_type` and `matcher` specified xfail/skip should go to `SKIP_XFAIL_SUBTESTS`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122336
Approved by: https://github.com/BowenBao, https://github.com/thiagocrepaldi
2024-03-21 04:10:32 +00:00
cc0cadaf4c [vision hash update] update the pinned vision hash (#122154)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122154
Approved by: https://github.com/pytorchbot
2024-03-21 03:59:12 +00:00
61f69c7fc4 [audio hash update] update the pinned audio hash (#122153)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122153
Approved by: https://github.com/pytorchbot
2024-03-21 03:53:24 +00:00
885fb9742d Handle special kwargs in user-written Triton kernel calls (#122280)
Summary: Special kwargs like `num_warps`, `num_stages`, and `num_ctas` can be passed to the Triton kernel call as kwargs. These kwargs are handled in a special way, not being passed to the underlying kernel function directly. In this PR, we move those special kwargs from `kwargs` of the `TritonKernelVariable` in dynamo to `Autotuner`'s `Config` instances (either already existing or newly created for this purpose). As a result, the special kwargs can be codegened correctly as a part of `Config`, not as direct arguments to the kernel `.run`.

Test Plan:

```
python test/inductor/test_triton_kernels.py -k test_triton_kernel_special_kwargs
...
----------------------------------------------------------------------
Ran 6 tests in 6.783s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122280
Approved by: https://github.com/oulgen
2024-03-21 03:34:07 +00:00
3e6fdea390 [ONNX] Fix list dtype finding bug in dispatcher (#122327)
Fixes #122166

Before this PR, dispatcher assumes the first input should provide the reasonable dtype to them, but `aten::index` reveals the cases with `None` in the front of inputs. The PR addresses it by selecting the first non None input to take dtype.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122327
Approved by: https://github.com/justinchuby, https://github.com/thiagocrepaldi
2024-03-21 02:54:58 +00:00
ae913175c3 Fix GraphModuleDeserializer (#122342)
Summary: self.constants is used in self.deserialize_signature()

Test Plan: CI

Differential Revision: D55152971

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122342
Approved by: https://github.com/zhxchen17
2024-03-21 02:27:39 +00:00
e9dcda5cba Graph-Safe RNG State Exchange for Tensor Parallelism (#114068)
See #113541

The PR allows for registering and controlling multiple RNG states using indices, ensuring cudagraph-safe operations, and includes both C++ and Python API changes to support this functionality.

cc  @eellison @anijain2305 @jansel @ezyang @ptrblck @csarofeen @mcarilli
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114068
Approved by: https://github.com/ezyang
2024-03-21 01:57:08 +00:00
91ead3eae4 Support gpu trace on XPU (#121795)
# Motivation
Support GPU trace on XPU backend. Add GPU trace to xpu runtime. It is beneficial to generalize the device caching allocator in the next step.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121795
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/jgong5, https://github.com/albanD
ghstack dependencies: #121794
2024-03-21 01:56:42 +00:00
74deacbf31 Refactor gpu trace to be device-agnostic (#121794)
# Motivation
Refactor gpu trace to be device-agnostic. gpu trace is usually used in runtime components, including Device, Stream, Event, Guard, and Allocator. It should be device-agnostic and can be shared among each device backend.

# Solution
move `_cuda_trace.py` to `_gpu_trace.py`, which makes each device backend owns their callback, respectively.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121794
Approved by: https://github.com/jgong5, https://github.com/albanD, https://github.com/EikanWang, https://github.com/gujinghui
2024-03-21 01:52:58 +00:00
57734202c6 [HSTU][TGIF] Provide a API to check whether running in torch_dispatch mode (#122339)
Summary: We provide a `is_in_torch_dispatch_mode` API returning `bool` to determine whether the program is running in torch dispatch mode or not.

Test Plan:
- OSS CI
- Tested with publish of hstu models with the this diff and following diffs D54964288, D54964702, D54969677, D55025489, runtime errors are not raised anymore in publish

Differential Revision: D55091453

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122339
Approved by: https://github.com/jiayisuse
2024-03-21 01:37:23 +00:00
e38d60bc07 Remove some stale xla dynamo backend (#122128)
`torchxla_trace_once ` and `aot_torchxla_trivial ` should be removed.

In our internal(hopefully dashboard can be open source soon) torchbench daily runs, `openxla` backend has much higher passing rate and similar perfomrance as the `openxla_eval`(non-aot-auto-grad backend). We still use `openxla_eval` in llama2 example but I think we should move user to `openxla` backend going forward.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122128
Approved by: https://github.com/alanwaketan, https://github.com/jansel
2024-03-21 01:13:50 +00:00
c20cf97366 Move some cudagraphs checks into C++ (#122251)
Based off of https://github.com/pytorch/pytorch/pull/111094
This + cpp guards improves TIMM geomean optimizer performance by about 20%

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122251
Approved by: https://github.com/eellison
2024-03-21 01:02:23 +00:00
be5863de39 Remove usage of deprecated volatile (#122231)
Summary:
When building our iOS app, we get a compile error about the deprecated `volatile` keyword.

This diff attempts to fix it by replacing the usage of the deprecated `volatile` keyword with `atomic` as suggested by malfet

Test Plan: Successfully built the iOS app that previously had a compile error

Differential Revision: D55090518

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122231
Approved by: https://github.com/malfet
2024-03-21 00:55:16 +00:00
1686e2d1e4 [symbolic shapes][compile-time] Minor compile time optimization in has_free_symbols (#122144)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122144
Approved by: https://github.com/lezcano
ghstack dependencies: #120726
2024-03-21 00:48:57 +00:00
cyy
c2eedb7f8a [Dynamo][1/N] Fix clang-tidy warnings in torch/csrc/dynamo/* (#122259)
This PR begins a series of works to ensure dynamo C++ code is clang-tidy clean.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122259
Approved by: https://github.com/ezyang
2024-03-21 00:43:25 +00:00
c80601f35a Revert "Avoid COW materialize in conv, log sigmoid, repeat, group_norm, batch_norm (#121537)"
This reverts commit a2a88f39ee991f471f2a2c54571886d70f5cd2e6.

Reverted https://github.com/pytorch/pytorch/pull/121537 on behalf of https://github.com/kurtamohler due to flaky CI failures ([comment](https://github.com/pytorch/pytorch/pull/121537#issuecomment-2010937226))
2024-03-21 00:03:30 +00:00
eqy
d5b5012dc4 [CUDA] Raise softmax_forward_64bit_indexing GPU memory requirement (#116075)
printing `torch.cuda.memory_summary()` shows ~41GiB reserved at the end of this test, not sure how it was passing previously on CUDA.

CC @ptrblck @malfet

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116075
Approved by: https://github.com/ptrblck, https://github.com/malfet
2024-03-21 00:03:17 +00:00
5855c490f0 Proper view support for jagged layout NestedTensor (#113279)
This PR:
* Introduces an ATen op for creating true jagged views from a dense values buffer
    * `_nested_view_from_jagged(values, offsets, lengths, ragged_idx, dummy)`
    * This ops is implemented on the Python side using torch.library so we can return a subclass instance
    * `jagged_from_list()` now uses this instead of the old autograd.Function `NestedViewFromBuffer`
    * The latter op is used for non-contiguous JTs returned via `torch.nested.narrow()`
    * `dummy` is an awful hack to ensure that `NestedTensor.__torch_dispatch__()` is invoked for our view
* Introduces an ATen op for accessing the `values` component of an NT via a view
    * `_nested_get_values(nt)`
* **Removes** the autograd.Functions `ViewNestedFromBuffer` and `ViewBufferFromNested` in favor of `nested_from_values_offsets()` / `nested_from_values_offsets_lengths()` and `nt.values()`, respectively.
* Changes test code to prefer `as_nested_tensor()` over `jagged_from_list()` directly
    * Similarly, avoid `buffer_from_jagged()`, preferring `values()`
* Depends on general subclass view fake-ification on the PT2 side (handled solely in previous PRs in the stack)

With these changes, the semantics of jagged layout NTs are such that they are considered a true view of the underlying `values` buffer. This means views of jagged NTs are views of the underlying buffer as well, simplifying some handling.

Differential Revision: [D54269922](https://our.internmc.facebook.com/intern/diff/D54269922)
Co-authored-by: voznesenskym <voznesenskym@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113279
Approved by: https://github.com/ezyang
2024-03-20 23:45:34 +00:00
057892f4be [CPU] optimize Lp norm for 1-dimensional vector (#122143)
Fixes https://github.com/pytorch/pytorch/issues/120229

- Optimize vector norm by simplifying vector norm formula for 1-dimensional vector.
- Vector norm formula for 1-dimensional vector simplifies to `abs(x)`. See below for proof.
- Next step, we can similarly optimize matrix norm (`torch.linalg.matrix_norm`) for 1 x 1 matrix.
- Additionally, avoids overflow in power, `abs(x) ** p` for large `p` or `x`, for 1-dimensional vector.

### Performance
Avg Latency (ms) of `torch.norm` and `torch.linalg.vector_norm` for
`torch.norm(torch.randn(2**18, 1), ord, -1)`
`torch.linalg.vector_norm(torch.randn(2**18, 1), ord, -1)`
Tested on 28 physical cores/socket, 1 socket on Skylake.

|                          	|                 	|         	|         	| **Avg Latency (ms)**  	|                       	|                                        	|
|--------------------------	|-----------------	|---------	|---------	|-----------------------	|-----------------------	|----------------------------------------	|
| **op**                   	| **input shape** 	| **dim** 	| **ord** 	| **baseline (master)** 	| **optimized (7102f1ef372b248414d36cbd0c51a546b6b6a41a)** 	| **speedup ratio (baseline/optimized)** 	|
| torch.norm               	| (2**18, 1)      	| -1      	| fro     	| 34.3755531            	| 0.0125408             	| 2741.094                               	|
|                          	|                 	|         	| inf     	| 34.0952635            	| 0.0122237             	| 2789.271                               	|
|                          	|                 	|         	| -inf    	| 34.3674493            	| 0.0120759             	| 2845.953                               	|
|                          	|                 	|         	| 0       	| 34.1004515            	| 0.0175261             	| 1945.69                                	|
|                          	|                 	|         	| 1       	| 34.1688442            	| 0.0121593             	| 2810.089                               	|
|                          	|                 	|         	| -1      	| 33.949492             	| 0.0120282             	| 2822.487                               	|
|                          	|                 	|         	| 2       	| 34.3669581            	| 0.0120401             	| 2854.366                               	|
|                          	|                 	|         	| -2      	| 33.9252067            	| 0.0121069             	| 2802.139                               	|
|                          	|                 	|         	|         	|                       	|                       	|                                        	|
| torch.linalg.vector_norm 	| (2**18, 1)      	| -1      	| inf     	| 34.090879             	| 0.0095105             	| 3584.545                               	|
|                          	|                 	|         	| -inf    	| 34.3708754            	| 0.0099111             	| 3467.931                               	|
|                          	|                 	|         	| 0       	| 34.0880775            	| 0.0141716             	| 2405.38                                	|
|                          	|                 	|         	| 1       	| 34.1392851            	| 0.0093174             	| 3664.036                               	|
|                          	|                 	|         	| -1      	| 33.925395             	| 0.0092483             	| 3668.302                               	|
|                          	|                 	|         	| 2       	| 34.3854165            	| 0.0092459             	| 3719.002                               	|
|                          	|                 	|         	| -2      	| 33.932972             	| 0.0093007             	| 3648.429                               	|

### Proof
<details>
<summary>For those interested :)</summary>

<img width="382" alt="1_dim_vector_norm_proof1" src="https://github.com/pytorch/pytorch/assets/93151422/59b1e00b-8fcd-47cb-877d-d31403b5195b">
<img width="432" alt="1_dim_vector_norm_proof2" src="https://github.com/pytorch/pytorch/assets/93151422/236bea15-2dd5-480b-9871-58b2e3b24322">

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122143
Approved by: https://github.com/lezcano
2024-03-20 23:20:25 +00:00
aa74a8b9e5 Enable x86 CPU vectorization on windows [submodule sleef] (#118980)
Enable VEC on Windows OS.
1. Fix some type defination gap between Windows and Linux.
2. Fix some operator not support on Windows, such as [], /.
3. Enable static sleef library build on Windows.
4. Disable unsupported function overloading on MSVC.
5. Upgrade submodule sleef lib, which fixed build issue on Windows.
6. Fixed bazel build issues.
7. Fix test app not link to sleef on Windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118980
Approved by: https://github.com/jgong5, https://github.com/ezyang, https://github.com/malfet
2024-03-20 22:41:13 +00:00
666d6291af Cast checkpoint weights to match model parameter's dtype (#122100)
Fixes #121986
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122100
Approved by: https://github.com/BowenBao
2024-03-20 22:01:40 +00:00
2289fa5f5a [while_loop] fix mode not on stack error (#122323)
Fixes https://github.com/pytorch/pytorch/issues/121453.

This is caused by missing  `with mode` in FakeTensor mode.

Test Plan:
add new tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122323
Approved by: https://github.com/tugsbayasgalan
ghstack dependencies: #122244
2024-03-20 21:17:33 +00:00
512251c8f3 Use tree_map to get device ids and device types for activation checkpointing (#121462)
`get_device_states` doesn't recursively look into nested lists/dicts to find tensors. As a result, activation checkpointing for such inputs results in silent incorrect results as `get_device_states` returns an empty result and no rng is saved as a result here: https://github.com/pytorch/pytorch/blob/main/torch/utils/checkpoint.py#L188 since `fwd_device_states` is empty.

Fixed this by using `tree_map` for both `get_device_states` and `_infer_device_type`. Also added appropriate unit tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121462
Approved by: https://github.com/soulitzer
2024-03-20 21:09:21 +00:00
cyy
1dd1899fd6 Add missing throw of std::runtime_error in dynamo/guards.cpp (#122306)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122306
Approved by: https://github.com/Skylion007, https://github.com/ezyang
2024-03-20 20:50:01 +00:00
d2a8d3864c [PT2][Inductor] Change the log for the group batch fusion (#122245)
Summary: Instead of using "batch_fusion" and "group_fusion" to log, we use the specific pass name to log, which could better summarize the hit of each pattern as well as debug

Test Plan:
```
buck2 test mode/dev-nosan //caffe2/test/inductor:group_batch_fusion
```

Differential Revision: D55103303

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122245
Approved by: https://github.com/jackiexu1992
2024-03-20 20:45:37 +00:00
61ff41f0ca [while_loop] disable closure capturing and manually set the inputs. (#122244)
For while_loop operator, it's important to keep the output ordering consistent with input ordering. Previously, we're using set_graph_inputs="automatic", which doesn't respect such ordering. This PR changes it to "manual" and respects the original user inputs' ordering. We disable closures for body and cond fn as they require some additional designs. This PR is just to prevent the bleeding.

 Repro:
```python
import torch
from torch._higher_order_ops.while_loop import while_loop
from torch._functorch.aot_autograd import aot_export_module

class Nested(torch.nn.Module):
    def forward(self, ci, cj, a, b):
        def cond_fn(i1, j1, x1, y1):
            return i1 > 0
        def body_fn(i1, j1, x1, y1):
            def cond_fn_nested(i2, j2, x2, y2):
                return j2 > 0
            def body_fn_nested(i2, j2, x2, y2):
                return i2.clone(), j2 - 1, x2 + 3.14, y2 - 2.71
            i1, j1, x1, y1 = while_loop(
                cond_fn_nested, body_fn_nested, [i1, j1, x1, y1]
            )
            return i1 - 1, j1.clone(), x1 * 2, y1 / 2
        return while_loop(cond_fn, body_fn, (ci, cj, a, b))

nested = Nested()
torch.compile(nested, backend="eager", fullgraph=True)(torch.tensor(2), torch.tensor(2), torch.randn(2, 2), torch.randn(2, 2))
```

Test plan:
add new test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122244
Approved by: https://github.com/aakhundov
2024-03-20 20:14:35 +00:00
2f6e8e84c5 Fix _chunk_cat.out issue (#122076)
# PR
Vectors allocated inside `get_chunk_cat_metadata()` are out of local scope when used in `_chunk_cat_out_cuda_contiguous()`. This PR fixes the issue by returning vectors from `get_chunk_cat_metadata`.
This PR also added a few unit tests to cover more edge cases.

# Tests
This PR is tested with the following command and no error shows. So the flaky test error should be resolved.

- `PYTORCH_NO_CUDA_MEMORY_CACHING=1 compute-sanitizer python test/test_ops.py -v -k test_out__chunk_cat_cuda_float32`
- `PYTORCH_NO_CUDA_MEMORY_CACHING=1 python test/test_ops.py -v -k test_out__chunk_cat_cuda_float32 --repeat 1500`

Fixes #122026
Fixes #121950

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122076
Approved by: https://github.com/yifuwang
2024-03-20 20:01:38 +00:00
c84f81b395 [export] add pass to remove auto functionalized hop (#122246)
Summary: Adds a pass that blindly removes the functionalize hop without consideration on if its safe. Useful for ExecuTorch today and other usecases that have additional logic that can reason about when this pass is safe to use

Test Plan: added unit test

Differential Revision: D55103867

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122246
Approved by: https://github.com/angelayi
2024-03-20 19:31:52 +00:00
d813474363 [Pytorch] auto format _python_dispatch file (#122226)
Summary: Auto format the _python_dispatch file, to make D55091453 easier to review

Test Plan: `arc lint`

Differential Revision: D55091454

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122226
Approved by: https://github.com/aakhundov
2024-03-20 19:28:39 +00:00
821ad56ea6 [CI] Enables support for pytorch ci build in ARC + introduces _linux-build-rg.yml. (#121930)
Introduce changes related to enable ARC to run on build for linux-jammy-py3.8-gcc11

Depends on:
* https://github.com/pytorch/pytorch/pull/121908
* https://github.com/pytorch/pytorch/pull/121907
* Force docker to update credentials: https://github.com/pytorch/test-infra/pull/4991
* Add permissions to role to access ECR: acc0154aa0
* Add permissions to the role to access relevant S3 bucket: 496b0422c3

## Reasoning for introducing a new `_linux-build-rg.yml`

Old style `runs-on` definition accept a string, new style `runs-on` requires a object in the format:

```
--- old
...
  runs-on: "linux.2xlarge"
...
--- new
...
  runs-on:
    group: "running-group"
...
```

In other words, to specify a group the format of the yaml needs to be changed. Unfortunately, there is no way to accomplish this change using any trick in the book that I am aware of. This is due to the fact that GH actions yaml are not templatable and support minimal functions / replacements. A few examples that did not work:
* [`e234f25` (#119544)](e234f25ba1 (diff-b317d4da565a9e329ccf67e669c2ff1f4d4bc5fb0ffa4d74132545ad66f84339R76))
* [`087de4a` (#119544)](087de4ad8b (diff-b317d4da565a9e329ccf67e669c2ff1f4d4bc5fb0ffa4d74132545ad66f84339R76))
* [`f03512e` (#119544)](f03512e344 (diff-b317d4da565a9e329ccf67e669c2ff1f4d4bc5fb0ffa4d74132545ad66f84339R76))
* [`67581fb` (#119544)](67581fb737 (diff-b317d4da565a9e329ccf67e669c2ff1f4d4bc5fb0ffa4d74132545ad66f84339R76))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121930
Approved by: https://github.com/seemethere
2024-03-20 19:06:10 +00:00
91fdaa1b41 [Sparsity] add support for H100 compute capability 9.x (#121768)
Summary: as title

Test Plan: buck test mode/opt //caffe2/test/...

Differential Revision: D54792168

@diff-train-skip-merge

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121768
Approved by: https://github.com/SherlockNoMad
2024-03-20 19:00:54 +00:00
d1e8b97387 [export] Log module hierarchy. (#121970)
Summary:
We can also log the module hierarchy in the following format:
```
:ToplevelModule
sparse:SparshArch
dense:DenseArch
```
So that we can have more information recorded about model's identity.

Test Plan: CI

Differential Revision: D54921097

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121970
Approved by: https://github.com/angelayi
2024-03-20 18:59:42 +00:00
0696db8202 Revert "Teach dynamo about torch.func.jvp (#119926)"
This reverts commit 17489784b635187316c6c856c5fe6b6a28d8a15a.

Reverted https://github.com/pytorch/pytorch/pull/119926 on behalf of https://github.com/peterbell10 due to broken mac jobs on main ([comment](https://github.com/pytorch/pytorch/pull/119926#issuecomment-2010327997))
2024-03-20 18:34:43 +00:00
1d13c82559 Precompile in background (#121997)
Precompile benchmarking choices in parallel, and then wait on those choices prior to benchmarking. In the case of deferred templates, we only only wait only those choices in the scheduler to allow multiple separate lowerings to compile in parallel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121997
Approved by: https://github.com/jansel
ghstack dependencies: #121996, #120275
2024-03-20 18:34:12 +00:00
65eb22158e Revert "Update jvp to support symbolic execution. (#120338)"
This reverts commit afc4c9382ff8b55da848ef40b4a17a92fb3d2ad6.

Reverted https://github.com/pytorch/pytorch/pull/120338 on behalf of https://github.com/huydhn due to Broke dynamo tests on trunk ([comment](https://github.com/pytorch/pytorch/pull/120338#issuecomment-2010276712))
2024-03-20 18:04:53 +00:00
072935917b Update cuda_to_hip_mappings.py (#122110)
Added one datatype mapping (cuda_bf16.h), and a number of cub/hipcub mappings. Note: the missing mappings were discovered when hipifying the Mamba model's (https://github.com/state-spaces/mamba) forward kernel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122110
Approved by: https://github.com/jithunnair-amd, https://github.com/Skylion007
2024-03-20 17:17:53 +00:00
334f7e43f9 [TD] Remove credentials requirement for retrieval (#122279)
Made the bucket readable by public
https://s3.console.aws.amazon.com/s3/buckets/target-determinator-assets?region=us-east-1&bucketType=general&tab=permissions

The only jobs that matter here are the retrieval and td jobs, which were both successful

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122279
Approved by: https://github.com/huydhn
2024-03-20 15:55:46 +00:00
2e02e1efad Skip nonzero unbacked SymInt memo in inference mode (#122147)
Summary: In `torch.inference_mode()`, fake tensors don't have `_version`s. This breaks unbacked SymInt memoization in `torch.nonzero` tracing. Here we disable the latter in inference mode.

Fixes https://github.com/pytorch/pytorch/issues/122127

Test Plan:

```
$ python test/inductor/test_unbacked_symints.py -k test_nonzero_in_inference_mode
...
----------------------------------------------------------------------
Ran 2 tests in 14.060s

OK
```

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122147
Approved by: https://github.com/ezyang
2024-03-20 14:44:55 +00:00
15a8185cd3 Revert "Enable x86 CPU vectorization on windows [submodule sleef] (#118980)"
This reverts commit 2b060983809e5fe8706acd085fff67b6a27bfb5f.

Reverted https://github.com/pytorch/pytorch/pull/118980 on behalf of https://github.com/zou3519 due to This caused build failures for 2+ pytorch devs, so we're reverting it to be safe ([comment](https://github.com/pytorch/pytorch/pull/118980#issuecomment-2009661069))
2024-03-20 14:10:12 +00:00
06db0a9f78 Revert "Upgrade submodule sleef to fix build warning (#122168)"
This reverts commit eec8b252b70b2489aee7281d336eb9c32dd85483.

Reverted https://github.com/pytorch/pytorch/pull/122168 on behalf of https://github.com/zou3519 due to trying to revert another PR ([comment](https://github.com/pytorch/pytorch/pull/122168#issuecomment-2009653474))
2024-03-20 14:05:58 +00:00
8a94005d46 [dynamo][runtime_asserts] Ignore failures on sorting sympy relations (#122205)
Differential Revision: [D55075500](https://our.internmc.facebook.com/intern/diff/D55075500)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122205
Approved by: https://github.com/ezyang
2024-03-20 13:25:37 +00:00
afc4c9382f Update jvp to support symbolic execution. (#120338)
Previously, all jvp tests under dynamo/test_dynamic_shapes would fail because symbolic execution wasn't supported in some autograd functions.

List of changes:
- Update`_has_same_storage_numel` to use `sym_nbytes`
- Symintify `_efficientzerotensor_meta`
- Introduce `empty_generic_symint` with the first argument `size` as symbolic integer
- Update gen_variable_type.py script to call the symint version of zeros_fn function (zeros_symint / _efficientzerotensor_symint)
- Update `has_same_meta` to call `sym_*` functions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120338
Approved by: https://github.com/soulitzer
ghstack dependencies: #119926
2024-03-20 13:09:19 +00:00
17489784b6 Teach dynamo about torch.func.jvp (#119926)
List of changes:
- Replace JVP_NESTING by torch._C._functorch.maybe_current_level()
- Remove all increment nesting functions from wrap_fx_proxy_cls
- fwAD.make_dual receives the dual_level as keyword argument
- Add jvp_increment_nesting, set_fwd_grad_enabled and dual_level context managers to dynamo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119926
Approved by: https://github.com/zou3519
2024-03-20 13:09:19 +00:00
eb1d6ed9f9 [Inductor] fix addmm fusion check (#121953)
Fixes #121253.

To avoid functional issue, disable pattern match for `addmm` when `beta!=1 or 0` or `alpha!=1`, as either `mkl_linear` or `mkldnn_linear` doesn't accept `beta` or `alpha` as parameters.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121953
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
2024-03-20 09:22:51 +00:00
ee6ce31b1d [BE][fix] fix test_tp_random_state and add it to periodic test list (#122248)
fix #122184 . Add the test to periodic test so that we can capture the error at CI in future.

**Test**:
`pytest test/distributed/tensor/parallel/test_tp_random_state.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122248
Approved by: https://github.com/wanchaol
2024-03-20 08:24:14 +00:00
a1d02b423c XFAIL detectron2_maskrcnn_r_101_c4 CPU inductor accuracy (#122263)
This starts to fail in trunk after the stack https://github.com/pytorch/pytorch/pull/122066 lands

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122263
Approved by: https://github.com/jansel
2024-03-20 08:03:34 +00:00
477d154ffd [dynamo] Add missing _nonvar_fields annotations (#122219)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122219
Approved by: https://github.com/anijain2305
ghstack dependencies: #122218
2024-03-20 07:53:18 +00:00
46bf37b3f7 [dynamo] Replace VariableTracker.apply with visit/realize_all (#122218)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122218
Approved by: https://github.com/anijain2305
2024-03-20 07:53:18 +00:00
a0db2e4237 [dynamo] Fixed handling of ImportError (#122222)
Fixes #122088

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122222
Approved by: https://github.com/anijain2305
2024-03-20 07:52:01 +00:00
7832efb242 [export] skip nn_module_stack verifier for non-fx.GraphModule modules (#122210)
Downstream users of torch.export may have different module classes (e.g. LoweredBackendModule), which cannot be checked for metadata in the same way. Add lines to skip this for non-fx.GraphModule modules.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122210
Approved by: https://github.com/angelayi, https://github.com/zhxchen17
2024-03-20 07:40:48 +00:00
7d2b2dec4b [Pytoch][Vulkan] Register run_conv1d_context (#122172)
Summary: We have rewritten `conv1d` as `create_conv1d_context` and `run_conv1d_context` to enable prepack of `weight` and `bias`. We have registered `create_conv1d_context` but not `run_conv1d_context`. We add the registration in this diff.

Test Plan:
```
[luwei@devbig439.ftw3 /data/users/luwei/fbsource (f89a7de33)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="*conv1d*"
Using additional configuration options from /home/luwei/.buckconfig.d/experiments_from_buck_start
Recommended: For faster builds try buck2: replace 'buck' with 'buck2'
NOTE: buck-out/ has changed: look for files in fbsource/buck-out/v2/
'buck2 build --show-output //xplat/caffe2:pt_vulkan_api_test_bin' will print the new output paths.

If you are building in fbsource//xplat and have questions, post in 'Cross Platform Dev Discussions': https://fb.workplace.com/groups/xplat.qa

  Targets matching .buckconfig buck2.supported_projects:
  {'//xplat/caffe2:pt_vulkan_api_test_bin': '//xplat'}

  To suppress this warning: touch ~/.config/.dont_hint_buck2

Building: finished in 0.1 sec (100%) 394/394 jobs, 0/394 updated
  Total time: 0.2 sec
BUILD SUCCEEDED
Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = *conv1d*
[==========] Running 2 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 2 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.conv1d_simple
[       OK ] VulkanAPITest.conv1d_simple (208 ms)
[ RUN      ] VulkanAPITest.conv1d
[       OK ] VulkanAPITest.conv1d (81 ms)
[----------] 2 tests from VulkanAPITest (289 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 1 test suite ran. (289 ms total)
[  PASSED  ] 2 tests.
```

full test result
```
...
[----------] 427 tests from VulkanAPITest (22583 ms total)

[----------] Global test environment tear-down
[==========] 427 tests from 1 test suite ran. (22583 ms total)
[  PASSED  ] 426 tests.
[  SKIPPED ] 1 test, listed below:
[  SKIPPED ] VulkanAPITest.querypool_flushed_shader_log

  YOU HAVE 11 DISABLED TESTS
```

Differential Revision: D55052816

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122172
Approved by: https://github.com/nathanaelsee
2024-03-20 07:36:23 +00:00
e7141d117f [IntraNodeComm] refactor rendezvous into a separate method for better code organization and error handling (#120968)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120968
Approved by: https://github.com/wanchaol
2024-03-20 06:54:25 +00:00
cyy
9f572b99a6 [Clang-tidy header][29/N] Enable clang-tidy warnings in aten/src/ATen/core/*.h (#122190)
This PR enables clang-tidy in `aten/src/ATen/core/*.h`, which ends the series of patches beginning from #122015.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122190
Approved by: https://github.com/Skylion007
2024-03-20 06:17:37 +00:00
11e64b4ba8 [dtensor] aten.cat to use stack strategy approach (#122209)
This PR switch aten.cat to use the strategy approach that is similar to
aten.stack, as these two ops share similar semantics

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122209
Approved by: https://github.com/wz337
2024-03-20 04:19:25 +00:00
5b7ceab650 Support auto_functionalize in pre-dispatch (#122177)
Summary: Title

Test Plan: CI

Differential Revision: D55042061

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122177
Approved by: https://github.com/zou3519
2024-03-20 04:17:58 +00:00
dc89d8b74a Fix broken lint after #116876 (#122253)
Trivial fixes, so let's do this instead of reverting the change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122253
Approved by: https://github.com/clee2000
2024-03-20 04:09:00 +00:00
de950039fc Use .get in xml parsing (#122103)
Check that the `classname` attribute actually exists.
#122017
I expect this route to happen very rarely

At a certain point, we should just remove this parsing altogether since everything uses pytest now...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122103
Approved by: https://github.com/huydhn
2024-03-20 04:07:49 +00:00
6662627c89 Add APIs for custom device using TensorIteratorBase. (#120792)
1) add operand and get_dim_names API;
2) set will_resize to true when output tensor is undefined;
3) add abs_stub for dummy device and calculate on cpu device;
4) support dummy device copy with stride;
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120792
Approved by: https://github.com/ezyang
2024-03-20 03:51:09 +00:00
f8565c4a28 [sigmoid] Clean up serialization API. (#122102)
Summary: Entirely remove the old serializer code to avoid further confusion and code bloat.

Test Plan: CI

Reviewed By: SherlockNoMad

Differential Revision: D54857118

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122102
Approved by: https://github.com/tugsbayasgalan
2024-03-20 03:45:36 +00:00
1f8177dedf [Inductor][CPU] fix flash attention last_stride!=1 issue (#122083)
Fixes #121174.

Conv converts the input of sdpa to channel last, resulting in accuracy issue. Ensure the layout in lowering.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122083
Approved by: https://github.com/eellison, https://github.com/jgong5
2024-03-20 02:22:33 +00:00
cyy
55310e58a9 Use constexpr for index variables (#122178)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122178
Approved by: https://github.com/Skylion007, https://github.com/ezyang
2024-03-20 02:20:17 +00:00
eec8b252b7 Upgrade submodule sleef to fix build warning (#122168)
Subsequent PR to https://github.com/pytorch/pytorch/pull/118980, fix sleef build warning.

submodule sleef, include this sleef PR: https://github.com/shibatch/sleef/pull/514

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122168
Approved by: https://github.com/jgong5, https://github.com/ezyang
2024-03-20 02:14:56 +00:00
cbbed46377 Defer selection of triton template (#120275)
Our prior approach to epilogue fusion was to select from a choice from a set of triton templates and extern calls based on benchmarking inputs, then unconditionally fuse epilogues. This can be sub-optimal in following ways:

- We select an extern kernel, however an epilogue like relu() exists such that choosing a triton template + relu would have been faster
- We select a triton template, epilogue fuse, and register spilling occurs causing it to be slower than not epilogue fusing.

In this PR we wait to select either the Triton Template or Extern Kernel based on benchmarking results from the kernel itself and its epilogue. As soon as a successful fusion occurs where a fused Triton Template + epilogue is faster than the unfused choice we finalize the MultiTemplateBuffer as a specific template. If no fusion occurs we'll finalize the MultiTemplateBuffer after fusion.

Note: if there are multiple epilogue fusions (not super likely), even though we select a template after the first fusion, we will still benchmark to see if subsequent epilogue are worth fusing. We could potentially defer choosing template in this case in a follow up at expense of compile time.

Gives 4% HF training win, 10% TIMM inference win. Increases compilation time which I will be trying to address more in follow up prs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120275
Approved by: https://github.com/jansel
ghstack dependencies: #121996
2024-03-20 01:40:33 +00:00
e5e0685f61 Revert "[dynamo] Forward OptimizedModule.__setattr__ to the wrapped module (#122098)"
This reverts commit 88ebdbc97c103271766203df6662240e95a09b42.

Reverted https://github.com/pytorch/pytorch/pull/122098 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the distributed failure looks legit as it is also failing in trunk 88ebdbc97c ([comment](https://github.com/pytorch/pytorch/pull/122098#issuecomment-2008483316))
2024-03-20 01:12:24 +00:00
19d6004b97 add int8 woq mm pattern matcher (#120985)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120985
Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/eellison
2024-03-20 00:23:41 +00:00
6fefc52a2b Set py3.x build-environment name consistently (#122247)
https://github.com/pytorch/pytorch/pull/122157 checks for the Python version using `"$BUILD_ENVIRONMENT" != *py3.8*`, but some build environment uses a different style with `py3_8` instead causing numpy 2.x to be installed there wrongly, i.e. 03b987fe3f
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122247
Approved by: https://github.com/malfet
2024-03-19 23:56:38 +00:00
6c659bbc36 [codemod][lowrisk] Remove unused exception parameter from caffe2/c10/mobile/CPUCachingAllocator.cpp (#116875)
Summary:
`-Wunused-exception-parameter` has identified an unused exception parameter. This diff removes it.

This:
```
try {
    ...
} catch (exception& e) {
    // no use of e
}
```
should instead be written as
```
} catch (exception&) {
```

If the code compiles, this is safe to land.

Test Plan: Sandcastle

Reviewed By: kimishpatel, palmje

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116875
Approved by: https://github.com/Skylion007
2024-03-19 23:52:09 +00:00
6b95dc8884 [codemod][lowrisk] Remove unused exception parameter from caffe2/torch/csrc/jit/frontend/lexer.cpp (#116876)
Summary:
`-Wunused-exception-parameter` has identified an unused exception parameter. This diff removes it.

This:
```
try {
    ...
} catch (exception& e) {
    // no use of e
}
```
should instead be written as
```
} catch (exception&) {
```

If the code compiles, this is safe to land.

Test Plan: Sandcastle

Reviewed By: palmje

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116876
Approved by: https://github.com/Skylion007
2024-03-19 23:51:26 +00:00
d0153ca755 use make_storage_impl to create storages for COWStorage. (#121896)
Thanks to https://github.com/pytorch/pytorch/pull/118459, `make_storage_impl` will use the func ,which register for other backends, to create StorageImpl.

`make_storage_impl` completely overwrites the `make_intrusive<StorageImpl>`, so it makes sense to replace  `make_intrusive<StorageImpl>` with `make_storage_impl` to create storage in cow.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121896
Approved by: https://github.com/ezyang
2024-03-19 23:40:15 +00:00
4aaf25bc38 delete useless cast_outputs call in unary_op_impl_float_out (#120486)
cast_outputs function is only used for CPU device, and this function already called in cpu_xxx_vec, like cpu_kernel_vec.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120486
Approved by: https://github.com/ezyang
2024-03-19 23:37:06 +00:00
2980779d0b [codemod] Remove unused variables in caffe2/caffe2/experiments/operators/tt_pad_op.h (#120177)
Summary:
LLVM-15 has a warning `-Wunused-but-set-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance.

This diff either (a) removes an unused variable and, possibly, it's associated code, or (b) qualifies the variable with `[[maybe_unused]]`, mostly in cases where the variable _is_ used, but, eg, in an `assert` statement that isn't present in production code.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan: Sandcastle

Reviewed By: palmje

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120177
Approved by: https://github.com/Skylion007
2024-03-19 23:36:52 +00:00
2239b55cd1 Add some more sanity asserts to checkPoolLiveAllocations (#122223)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122223
Approved by: https://github.com/eellison
2024-03-19 23:26:19 +00:00
139647d317 Fix #83241: torch.nn.TripletMarginLoss allowed margin less or equal to 0 (#121978)
Documentation states that the parameter margin of torch.nn.TripletMarginLoss is greater than 0, however any value was being accepted. Also fixed torch.nn.TripletMarginWithDistanceLoss which had the same problem. Added error test input for the new ValueError.

Fixes #83241

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121978
Approved by: https://github.com/mikaylagawarecki
2024-03-19 23:19:11 +00:00
a843bbdb21 [codemod] Remove unused variables in caffe2/caffe2/opt/nql/graphmatcher.cc (#118116)
Summary:
LLVM-15 has a warning `-Wunused-but-set-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance.

This diff either (a) removes an unused variable and, possibly, it's associated code, or (b) qualifies the variable with `[[maybe_unused]]`, mostly in cases where the variable _is_ used, but, eg, in an `assert` statement that isn't present in production code.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan: Sandcastle

Reviewed By: malfet, dmm-fb

Differential Revision: D52981072

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118116
Approved by: https://github.com/Skylion007
2024-03-19 22:45:43 +00:00
f05af9e377 [codemod] Remove unused variables in caffe2/caffe2/opt/nql/ast.h (#120176)
Summary:
LLVM-15 has a warning `-Wunused-but-set-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance.

This diff either (a) removes an unused variable and, possibly, it's associated code, or (b) qualifies the variable with `[[maybe_unused]]`, mostly in cases where the variable _is_ used, but, eg, in an `assert` statement that isn't present in production code.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan: Sandcastle

Reviewed By: palmje

Differential Revision: D53779579

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120176
Approved by: https://github.com/Skylion007
2024-03-19 22:44:51 +00:00
03b987fe3f [CI] Test that NumPy-2.X builds are backward compatible with 1.X (#122157)
By compiling PyTorch against 2.x RC, but running all the tests with Numpy-1.X

This has no affects on binary builds
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122157
Approved by: https://github.com/atalman
2024-03-19 22:40:26 +00:00
f8becb626f [codemod] Remove unused variables in caffe2/caffe2/contrib/fakelowp/spatial_batch_norm_fp16_fake_op.h (#120178)
Summary:
LLVM-15 has a warning `-Wunused-but-set-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance.

This diff either (a) removes an unused variable and, possibly, it's associated code, or (b) qualifies the variable with `[[maybe_unused]]`, mostly in cases where the variable _is_ used, but, eg, in an `assert` statement that isn't present in production code.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan: Sandcastle

Reviewed By: palmje

Differential Revision: D53779549

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120178
Approved by: https://github.com/Skylion007
2024-03-19 22:36:38 +00:00
94eb940a02 [codemod] Remove unused variables in caffe2/caffe2/operators/softmax_op_cudnn.cc (#121995)
Summary:
LLVM-15 has a warning `-Wunused-but-set-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance.

This diff either (a) removes an unused variable and, possibly, it's associated code, or (b) qualifies the variable with `[[maybe_unused]]`, mostly in cases where the variable _is_ used, but, eg, in an `assert` statement that isn't present in production code.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan: Sandcastle

Reviewed By: palmje

Differential Revision: D54931224

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121995
Approved by: https://github.com/Skylion007
2024-03-19 22:35:58 +00:00
a6aa3afa77 [codemod] Remove unused variables in caffe2/caffe2/video/video_decoder.cc (#122151)
Summary:
LLVM-15 has a warning `-Wunused-but-set-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance.

This diff either (a) removes an unused variable and, possibly, it's associated code, or (b) qualifies the variable with `[[maybe_unused]]`, mostly in cases where the variable _is_ used, but, eg, in an `assert` statement that isn't present in production code.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan: Sandcastle

Differential Revision: D54378401

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122151
Approved by: https://github.com/Skylion007
2024-03-19 22:34:17 +00:00
a80c60ad8f [codemod] Remove unused variables in caffe2/caffe2/operators/conv_op_cudnn.cc (#122161)
Summary:
LLVM-15 has a warning `-Wunused-but-set-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance.

This diff either (a) removes an unused variable and, possibly, it's associated code, or (b) qualifies the variable with `[[maybe_unused]]`, mostly in cases where the variable _is_ used, but, eg, in an `assert` statement that isn't present in production code.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan: Sandcastle

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122161
Approved by: https://github.com/Skylion007
2024-03-19 22:33:19 +00:00
02f436da6d [codemod][bugfix] Fix addressing bug in caffe2/caffe2/video/video_input_op.h (#121856)
Summary:
# Diff Specific

The signature of `copyFrom` is
```
void Tensor::CopyFrom(const Tensor& src, bool async) {
```
so the `&context` always evaluated to true.

I could dig around to see if anyone cares about what the flag should actually be, but this is old code in caffe2, so I've just used `true` and we'll keep using whatever behaviour we've been using since 2019 or so when this was written.

# General

A bug in this code was identified by `-Waddress`, which we are working to enable globally.

This diff fixes the bug. There are a few types of fixes it might employ:

The bug could be `const_char_array == "hello"` which compares two addresses and therefore is almost always false. This is fixed with `const_char_array == std::string_view("hello")` because `string_view` has an `==` operator that makes an appropriate comparison.

The bug could be `if(name_of_func)` which always returns true because the function always has an address. Likely you meant to call the function here!

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan: Sandcastle

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121856
Approved by: https://github.com/Skylion007
2024-03-19 22:28:06 +00:00
1c4887d52b fix dlrm accuracy test in max-autotune (#122012)
torchrec_dlrm training fail the accuracy check when max-autotune is enabled.

I found there is no real issue in PT2. We fail to get fp64 reference results for the accuracy check. In max-autotune mode numerical may change a bit and cause the cosine similarity check fail. Using fp64 baseline is more reliable and make the test pass.

The reason why we are not using a fp64 baseline earlier is because torchrec uses a dataclass [Batch](99e6e669b5/torchrec/datasets/utils.py (L28)) to represent the input. We use pytree to cast model and inputs to fp64. pytree can not look into a dataclass. My fix is to convert the dataclass to namedtuple to be more pytree friendly

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122012
Approved by: https://github.com/jansel, https://github.com/eellison
2024-03-19 22:23:42 +00:00
c71554b944 Revert "[aot_inductor][easy] enable test_triton_kernel_multi_output_arg (#122052)"
This reverts commit 206da97b8b61f51041f67de68e68e9a1875589ab.

Reverted https://github.com/pytorch/pytorch/pull/122052 on behalf of https://github.com/huydhn due to Although this look fixed on OSS, it is still failing internally.  I have added the reproducible buck command in the diff D55046262 ([comment](https://github.com/pytorch/pytorch/pull/122052#issuecomment-2008253185))
2024-03-19 22:22:12 +00:00
7678be4667 Replace numel with sym_numel in is_int_or_symint (#122145)
Fixes https://github.com/pytorch/pytorch/issues/122124

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122145
Approved by: https://github.com/Skylion007
2024-03-19 21:58:43 +00:00
6915a5be70 Increase numel limit to 2^63 for replicatepad1d (#122199)
Summary: As title

Test Plan:
```
CUDA_VISIBLE_DEVICES=5 buck2 test mode/opt //caffe2/test:test_nn_cuda -- test_replicatepad_64bit_indexing
```

Also benchmarked in N5106027
```
device_ms, cpu_ms, gb/device_ms*1000
# before changes
11.058772478103638 18.912256770000006 735.4118906278957
# after changes
10.621162576675415 18.58972748 765.7121070725207
```

Differential Revision: D55030372

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122199
Approved by: https://github.com/ezyang
2024-03-19 21:55:34 +00:00
b12d297b44 [AARCH64] Hide FP16 scalar arithmetic behind proper feature flag (#122204)
On Apple Silicon:
```
% sysctl machdep.cpu.brand_string; clang -dM -E - < /dev/null|grep __ARM_FEATURE_FP16
machdep.cpu.brand_string: Apple M1
#define __ARM_FEATURE_FP16_FML 1
#define __ARM_FEATURE_FP16_SCALAR_ARITHMETIC 1
#define __ARM_FEATURE_FP16_VECTOR_ARITHMETIC 1
```
On Graviton2 with respective `-march` flag:
```
# ./cpuinfo/build/cpu-info |grep Microarch -A1; gcc -dM -E - -march=armv8.2-a+fp16 </dev/null | grep __ARM_FEATURE_FP16
Microarchitectures:
	8x Neoverse N1
#define __ARM_FEATURE_FP16_SCALAR_ARITHMETIC 1
#define __ARM_FEATURE_FP16_VECTOR_ARITHMETIC 1
```
Test Plan: CI

Reviewed By: dimitribouche

Differential Revision: D55033347

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122204
Approved by: https://github.com/huydhn
2024-03-19 21:18:09 +00:00
901ba2be86 [quant][pt2e] Add support for conv transpose + bn + {relu} weights fusion in PTQ (#122046)
Summary:

also added some utils in xnnpack_quantizer_utils.py
* annotate_conv_tranpsose_bn_relu and annotate_conv_transpose_bn -> this is for QAT
* annotate_conv_transpose_relu

conv_transpose + bn weights fusion is performed automatically and can not be disabled currently
we can add support to allow disable this fusion later if needed

Test Plan:
python test/test_quantization.py -k test_conv_transpose_bn_fusion

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122046
Approved by: https://github.com/andrewor14
2024-03-19 21:00:57 +00:00
bc1fef113d Respect TORCH_DISABLE_ADDR2LINE in symbolizer (#121359)
If TORCH_DISABLE_ADDR2LINE is set, the symbolizer will instead give the
filename of the shared library as the filename, the offset in that library as the linenumber,
and use dladdr to get the function name if possible. This is much faster than using addr2line,
and the symbols can be later resolved offline using addr2line if desired.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121359
Approved by: https://github.com/aaronenyeshi
2024-03-19 20:50:26 +00:00
7718a1cd4f T159183991: Error: EXC_SOFTWARE / SIGABRT at IGPyTorchFramework:-[MPSImageWrapperTrampoline endSynchronization:] (MPSImageWrapper.mm<line_num>):cpp_exception_clas (#122132)
Summary: Prevent crash by not throwing a C++ exception.

Test Plan: spongebobsandcastle

Reviewed By: SS-JIA

Differential Revision: D55036050

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122132
Approved by: https://github.com/SS-JIA
2024-03-19 20:01:33 +00:00
c0b2e56c8f Support triton.language.dtype with torch.compile -- Second Attempt (#122141)
This PR is the second attempt at supporting `triton.language.dtype`, now instead of putting it on the graph, we put it on the side table since it is a constant.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122141
Approved by: https://github.com/jansel
ghstack dependencies: #122140
2024-03-19 19:40:52 +00:00
58a805da71 [UserDefinedTriton] Move constant args out of the fx graph (#122140)
@ezyang mentioned that we should not put constant args on the graph. Especially when there are args that would be trickier to put on the graph. E.g. next PR needs `triton.language.dtype` as an argument on the graph.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122140
Approved by: https://github.com/jansel
2024-03-19 19:40:52 +00:00
c5ffebebab [export] allow Dim(1,2) for export dynamic shapes (v2 after revert) (#121910)
Creating this after [PR](https://github.com/pytorch/pytorch/pull/121642) got reverted.

Current dynamic shapes implementation fixes lower range of Dims to be 2 for analysis, but allows 0/1 shapes during runtime. This leads to failures when initializing Dim(1,2). This PR sets the lower bound to 0, and avoids erroring out when conflicting with the generated (2, maxsize) constraint during analysis.

Also resolves a derived dim constraints issue with the following code:
```
class Bar(torch.nn.Module):
    def forward(self, x, y):
        return x + y[1:]

dx = Dim("dx", min=1, max=3)
ep = export(
    Bar(),
    (torch.randn(2, 2), torch.randn(3, 2)),
    dynamic_shapes=({0: dx, 1: None}, {0: dx+1, 1: None})
)
print(ep.range_constraints)
```

In main:
```
{s0: ValueRanges(lower=2, upper=3, is_bool=False), s0 + 1: ValueRanges(lower=3, upper=4, is_bool=False)}
```

This PR:
```
{s0: ValueRanges(lower=1, upper=3, is_bool=False), s0 + 1: ValueRanges(lower=2, upper=4, is_bool=False)}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121910
Approved by: https://github.com/avikchaudhuri, https://github.com/zhxchen17
2024-03-19 19:08:05 +00:00
d56ab7b020 Revert "[torch export][serialize] create a more compact stacktrace format for serialization (#121675)"
This reverts commit eae89138d891d0310483c4d86dcb69b16de0a6b5.

Reverted https://github.com/pytorch/pytorch/pull/121675 on behalf of https://github.com/jeanschmidt due to It seems that this PR broke lint jobs, I am reverting to confirm if this is the case ([comment](https://github.com/pytorch/pytorch/pull/121675#issuecomment-2007919486))
2024-03-19 19:02:09 +00:00
36e5c1dcab Revert "Teach dynamo about torch.func.jvp (#119926)"
This reverts commit edd04b7c16cc6715411119bb7db234a9df59065f.

Reverted https://github.com/pytorch/pytorch/pull/119926 on behalf of https://github.com/jeanschmidt due to lots of breakages in pull jobs, checking if reverting this one will help ([comment](https://github.com/pytorch/pytorch/pull/119926#issuecomment-2007915919))
2024-03-19 18:59:46 +00:00
88999674a0 Revert "Update jvp to support symbolic execution. (#120338)"
This reverts commit 39877abee2c3ad1956013d467b0f6e86cd20acfb.

Reverted https://github.com/pytorch/pytorch/pull/120338 on behalf of https://github.com/jeanschmidt due to lots of breakages in pull jobs, checking if reverting this one will help ([comment](https://github.com/pytorch/pytorch/pull/120338#issuecomment-2007898831))
2024-03-19 18:50:12 +00:00
e0d57001ef [codemod] Remove unused variables in caffe2/caffe2/experiments/operators/fully_connected_op_prune.h (#122165)
Summary:
LLVM-15 has a warning `-Wunused-but-set-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance.

This diff either (a) removes an unused variable and, possibly, it's associated code, or (b) qualifies the variable with `[[maybe_unused]]`, mostly in cases where the variable _is_ used, but, eg, in an `assert` statement that isn't present in production code.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan: Sandcastle

Reviewed By: dmm-fb

Differential Revision: D54380402

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122165
Approved by: https://github.com/Skylion007
2024-03-19 18:41:16 +00:00
6bd2d12bc7 release gil in prepareProfiler (#121949)
Initializing profiler while holding gil can lead to deadlocks, as it makes some presumably synchronizing cuda calls

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121949
Approved by: https://github.com/aaronenyeshi
2024-03-19 18:05:21 +00:00
7fb2d69282 [PT2] - Fix cat backwards wrapping on symints (#121527)
Summary:
Wrapping was comparing Symint and ints forcing a guard. Rewrite it with TORCH_GUARD_SIZE_OBLIVIOUS
```
[trainer0|0]:  File "<invalid>", line 0, in THPEngine_run_backward(_object*, _object*, _object*)
[trainer0|0]:  File "<invalid>", line 0, in torch::autograd::python::PythonEngine::execute(std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge>> const&, std::vector<at::Tensor, std::allocator<at::Tensor>> const&, bool, bool, bool, std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge>> const&)
[trainer0|0]:  File "<invalid>", line 0, in torch::autograd::Engine::execute(std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge>> const&, std::vector<at::Tensor, std::allocator<at::Tensor>> const&, bool, bool, bool, std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge>> const&)
[trainer0|0]:  File "<invalid>", line 0, in torch::autograd::python::PythonEngine::execute_with_graph_task(std::shared_ptr<torch::autograd::GraphTask> const&, std::shared_ptr<torch::autograd::Node>, torch::autograd::InputBuffer&&)
[trainer0|0]:  File "<invalid>", line 0, in torch::autograd::Engine::execute_with_graph_task(std::shared_ptr<torch::autograd::GraphTask> const&, std::shared_ptr<torch::autograd::Node>, torch::autograd::InputBuffer&&)
[trainer0|0]:  File "<invalid>", line 0, in torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&)
[trainer0|0]:  File "<invalid>", line 0, in torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueue> const&)
[trainer0|0]:  File "<invalid>", line 0, in torch::autograd::Node::operator()(std::vector<at::Tensor, std::allocator<at::Tensor>>&&)
[trainer0|0]:  File "<invalid>", line 0, in torch::autograd::generated::CatBackward0::apply(std::vector<at::Tensor, std::allocator<at::Tensor>>&&)
[trainer0|0]:  File "<invalid>", line 0, in torch::autograd::generated::details::cat_tensors_backward(at::Tensor const&, std::vector<std::vector<c10::SymInt, std::allocator<c10::SymInt>>, std::allocator<std::vector<c10::SymInt, std::allocator<c10::SymInt>>>> const&, std::vector<c10::ScalarType, std::allocator<c10::ScalarType>> const&, long)
[trainer0|0]:  File "<invalid>", line 0, in c10::operator==(c10::SymInt const&, int)
[trainer0|0]:  File "<invalid>", line 0, in c10::SymBool::guard_bool(char const*, long) const
[trainer0|0]:  File "<invalid>", line 0, in torch::impl::PythonSymNodeImpl::guard_bool(char const*, long)
```

Test Plan: Regular CI

Differential Revision: D54667300

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121527
Approved by: https://github.com/ezyang
2024-03-19 18:03:02 +00:00
8de4d86479 Back out "[fx] Preserve Fx graph node order in partitioner across runs (#115621)" (#122113)
Summary:
Original commit changeset: 6578f47abfdb

Original Phabricator Diff: D54913931

Differential Revision: D55027171

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122113
Approved by: https://github.com/osalpekar
2024-03-19 18:00:37 +00:00
eae89138d8 [torch export][serialize] create a more compact stacktrace format for serialization (#121675)
Summary:
- we want fx nodes' stack trace format to be backward compatible and same as before in the program we export
- however in the serialized format, we would want to show a more compact stack_trace format, otherwise the nodes attributes are dominated by stack traces
- the diff implements the minimal in serialization process to dedupe node stack traces by resorting to a fileinfo_list and a filename_to_abbrev map, so we can use index to represent filenames, use lineno to represent lines.

Test Plan:
# llm
base on D54497918
```
buck2 run @//mode/dev-nosan fbcode//executorch/examples/models/llama2:export_llama -- -c ~/stories110M.pt -p ~/params.json
```
set up breakpoint after serialization/deserialization
- serialize
```
(Pdb) v_meta = [n.meta for n in exported_program.graph_module.graph.nodes]
(Pdb) paste_client.create_phabricator_paste_object(paste_creation_client_id=1093956601162697, content=str(v_meta)).number
1193647450
(Pdb) json_program = json.dumps(_dataclass_to_dict(serialized_graph.co_fileinfo_ordered_list),cls=EnumEncoder)
(Pdb) json_bytes = json_program.encode('utf-8')
(Pdb) paste_client.create_phabricator_paste_object(paste_creation_client_id=1093956601162697, content=str(json_bytes)).number
1193604333
(Pdb) sys.getsizeof(json_bytes)
3846
(Pdb) compressed_bytes = zstd.ZstdCompressor().compress(json_bytes)
(Pdb) sys.getsizeof(compressed_bytes)
1139
```
in P1193647450 (before serialization), search for `stack_trace`
in P1193604333 (after serialization), search for `stack_trace` and `co_fileinfo_ordered_list`

[note: didn't do compression in this diff since the size is pretty small and it adds complexity if we do compression]
- deserialize
```
(Pdb) v_meta = [n.meta for n in deserialized_exported_program.graph_module.graph.nodes]
(Pdb) paste_client.create_phabricator_paste_object(paste_creation_client_id=1093956601162697, content=str(v_meta)).number
1193629435
```
in P1193629435, search for `stack_trace`

# ads

Differential Revision: D54654443

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121675
Approved by: https://github.com/angelayi
2024-03-19 17:58:12 +00:00
eqy
271b12c790 [Functorch] Bump tolerances for test_per_sample_grads_embeddingnet_mechanism_functional_call_cuda (#122014)
the `rtol` was indeed a problem on Grace Hopper

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122014
Approved by: https://github.com/zou3519
2024-03-19 17:52:39 +00:00
ba9a1d96a4 Add scuba logging for TorchScript usage (#121936)
Summary: Infra to log live usage of TorchScript internally

Test Plan: manually tested

Differential Revision: D54923510

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121936
Approved by: https://github.com/zhxchen17
2024-03-19 17:38:27 +00:00
4819da60ab [TD] Add LLM retrieval + heuristic (#121836)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121836
Approved by: https://github.com/osalpekar
2024-03-19 17:31:47 +00:00
cec0fd6f2f [pt2] add symbolic shape support for decompose mm and expose max_block to user config (#121440)
Summary:
1) As described in https://fb.workplace.com/groups/1075192433118967/permalink/1381918665779674/
As a follow up, we can increase max_block["y"] to sovle the issue
2) add symbolic shape support for decompose mm pass. I did not find a good way to compare symint with int. So when there is a symbolic shape, i would assume it is a "large" dim.

Test Plan:
Without change block: aps-pt2-7c23cea900

increase y_block: aps-pt2_dynamic_shape-25a027423c

Differential Revision: D54525453

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121440
Approved by: https://github.com/mengluy0125, https://github.com/Yuzhen11
2024-03-19 17:31:16 +00:00
764eae9c4e Revert "Add Flash Attention support on ROCM (#121561)"
This reverts commit a37e22de7059d06b75e4602f0568c3154076718a.

Reverted https://github.com/pytorch/pytorch/pull/121561 on behalf of https://github.com/huydhn due to Sorry for reverting your change but this needs more work to be able to land in fbcode because https://github.com/ROCm/aotriton is not available there atm.  We are working to reland this change before 2.3 release ([comment](https://github.com/pytorch/pytorch/pull/121561#issuecomment-2007717091))
2024-03-19 17:14:28 +00:00
88ebdbc97c [dynamo] Forward OptimizedModule.__setattr__ to the wrapped module (#122098)
Fixes #114844

In the linked issue we have
```
compiled_module = torch.compile(module)
compiled_module.x = ...
compiled_module(...)  # Mutates self.x
```
Where since the module mutates `self.x` you would expect `compiled_module.x`
to be updated but actually `compiled_module.x = ...` sets an attribute "x"
on the `OptimizedModule` object while the forward method of the module mutates
`module.x`.

This gives the expected behavior by forwarding `compiled_module.__setattr__`
down to `module.__setattr__`. There is already a corresponding `__getattr__`
so now `compiled_module.x` becomes an alias for `module.x`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122098
Approved by: https://github.com/ezyang, https://github.com/lezcano
2024-03-19 16:51:43 +00:00
2164b7f746 Flatten/Unflatten micro optimization in proxy_tensor.py (#121993)
Lowers compile time by 1s across all suites on average
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121993
Approved by: https://github.com/anijain2305, https://github.com/jansel, https://github.com/zou3519
2024-03-19 16:49:28 +00:00
42624bceb6 Fixes nan with large bf16 values (#122135)
Fixes #121558

Performance on main:
``` Markdown
+------------+-----------+-----------+------------+-----------+-----------+----------------+--------------------+--------------------+
| batch_size | num_heads | q_seq_len | kv_seq_len | embed_dim | is_causal |     dtype      |    forward_time    |   backward_time    |
+------------+-----------+-----------+------------+-----------+-----------+----------------+--------------------+--------------------+
|     1      |    16     |    128    |    128     |   2048    |   True    | torch.bfloat16 | 12.608132004970683 | 65.90210803551601  |
|     1      |    16     |    128    |    128     |   2048    |   False   | torch.bfloat16 | 11.75877740024589  | 64.83824399765581  |
|     1      |    16     |    256    |    256     |   2048    |   True    | torch.bfloat16 | 16.465420153690506 |  67.6770955324173  |
|     1      |    16     |    256    |    256     |   2048    |   False   | torch.bfloat16 | 17.398148600477725 | 68.19829455344006  |
|     1      |    16     |    512    |    512     |   2048    |   True    | torch.bfloat16 | 29.053532000398263 | 99.58901099162175  |
|     1      |    16     |    512    |    512     |   2048    |   False   | torch.bfloat16 |  27.826815698063   | 98.05690299253911  |
|     1      |    16     |   1024    |    1024    |   2048    |   True    | torch.bfloat16 | 49.89655229728669  | 178.24282555375248 |
|     1      |    16     |   1024    |    1024    |   2048    |   False   | torch.bfloat16 | 48.840098950313404 | 174.5950729819015  |
|     1      |    16     |   4096    |    2048    |   2048    |   True    | torch.bfloat16 | 505.66218036692584 | 1865.9265094902366 |
|     1      |    16     |   4096    |    2048    |   2048    |   False   | torch.bfloat16 | 295.0534054543823  | 967.3831606050952  |
|     1      |    32     |    128    |    128     |   2048    |   True    | torch.bfloat16 | 11.496030446141958 | 55.11070846114308  |
|     1      |    32     |    128    |    128     |   2048    |   False   | torch.bfloat16 | 11.47399884648621  | 55.452342028729625 |
|     1      |    32     |    256    |    256     |   2048    |   True    | torch.bfloat16 | 13.216444296995178 | 55.14447903260589  |
|     1      |    32     |    256    |    256     |   2048    |   False   | torch.bfloat16 | 12.763233599252999 | 55.142355500720434 |
|     1      |    32     |    512    |    512     |   2048    |   True    | torch.bfloat16 | 19.409965351223946 |  74.9107634765096  |
|     1      |    32     |    512    |    512     |   2048    |   False   | torch.bfloat16 | 19.02470579952933  | 74.84168506925926  |
|     1      |    32     |   1024    |    1024    |   2048    |   True    | torch.bfloat16 | 46.37695319834165  | 172.19150450546294 |
|     1      |    32     |   1024    |    1024    |   2048    |   False   | torch.bfloat16 | 45.225963747361675 | 185.19691249821335 |
|     1      |    32     |   4096    |    2048    |   2048    |   True    | torch.bfloat16 | 634.3090848531574  | 2249.057865119539  |
|     1      |    32     |   4096    |    2048    |   2048    |   False   | torch.bfloat16 | 320.47313248040155 | 1053.0515247955916 |
|     4      |    16     |    128    |    128     |   2048    |   True    | torch.bfloat16 | 13.448987301671878 | 63.63581650657579  |
|     4      |    16     |    128    |    128     |   2048    |   False   | torch.bfloat16 | 12.509283400140703 | 63.059300999157124 |
|     4      |    16     |    256    |    256     |   2048    |   True    | torch.bfloat16 | 19.71098779467866  | 105.55780201684684 |
|     4      |    16     |    256    |    256     |   2048    |   False   | torch.bfloat16 | 18.264925852417946 | 105.12311349157244 |
|     4      |    16     |    512    |    512     |   2048    |   True    | torch.bfloat16 | 45.218703348655254 | 222.87272597895935 |
|     4      |    16     |    512    |    512     |   2048    |   False   | torch.bfloat16 | 43.55393464793451  | 230.63290398567915 |
|     4      |    16     |   1024    |    1024    |   2048    |   True    | torch.bfloat16 | 134.02968645095825 | 514.6893998607993  |
|     4      |    16     |   1024    |    1024    |   2048    |   False   | torch.bfloat16 | 157.13709802366793 | 624.5892751030624  |
|     4      |    16     |   4096    |    2048    |   2048    |   True    | torch.bfloat16 | 1776.7079547047617 | 6353.551096981391  |
|     4      |    16     |   4096    |    2048    |   2048    |   False   | torch.bfloat16 | 1143.6000745743513 | 3811.8767354171723 |
|     4      |    32     |    128    |    128     |   2048    |   True    | torch.bfloat16 | 11.717129248427227 | 55.35991647047922  |
|     4      |    32     |    128    |    128     |   2048    |   False   | torch.bfloat16 | 11.746983398916198 | 55.76716404175386  |
|     4      |    32     |    256    |    256     |   2048    |   True    | torch.bfloat16 | 17.255573300644752 | 106.47456656442955 |
|     4      |    32     |    256    |    256     |   2048    |   False   | torch.bfloat16 | 16.46409669774584  | 108.07770595420152 |
|     4      |    32     |    512    |    512     |   2048    |   True    | torch.bfloat16 | 46.63354124641045  | 213.74862996162847 |
|     4      |    32     |    512    |    512     |   2048    |   False   | torch.bfloat16 | 47.01801469782367  | 240.78139301855117 |
|     4      |    32     |   1024    |    1024    |   2048    |   True    | torch.bfloat16 | 127.76448752265424 | 508.08745552785695 |
|     4      |    32     |   1024    |    1024    |   2048    |   False   | torch.bfloat16 | 168.6308984644711  | 667.2996102133766  |
|     4      |    32     |   4096    |    2048    |   2048    |   True    | torch.bfloat16 | 2268.1598202325404 | 7727.2648515645415 |
|     4      |    32     |   4096    |    2048    |   2048    |   False   | torch.bfloat16 | 1242.8469699807465 | 4161.965740495361  |
|     8      |    16     |    128    |    128     |   2048    |   True    | torch.bfloat16 | 14.340955897932872 | 93.72280450770633  |
|     8      |    16     |    128    |    128     |   2048    |   False   | torch.bfloat16 | 13.25262250029482  |  93.2030284893699  |
|     8      |    16     |    256    |    256     |   2048    |   True    | torch.bfloat16 | 27.598425600444898 | 183.23776399483904 |
|     8      |    16     |    256    |    256     |   2048    |   False   | torch.bfloat16 | 26.362583553418514 | 183.51862096460536 |
|     8      |    16     |    512    |    512     |   2048    |   True    | torch.bfloat16 | 84.52303148806094  | 383.50319798337296 |
|     8      |    16     |    512    |    512     |   2048    |   False   | torch.bfloat16 | 89.41743348259479  | 432.5502900755964  |
|     8      |    16     |   1024    |    1024    |   2048    |   True    | torch.bfloat16 | 217.76640450116247 | 943.9354750793427  |
|     8      |    16     |   1024    |    1024    |   2048    |   False   | torch.bfloat16 | 303.0781910638325  | 1225.4394043702632 |
|     8      |    16     |   4096    |    2048    |   2048    |   True    | torch.bfloat16 | 3470.8542854059488 | 12194.579601055011 |
|     8      |    16     |   4096    |    2048    |   2048    |   False   | torch.bfloat16 | 2268.1174043100327 | 7608.0941944383085 |
|     8      |    32     |    128    |    128     |   2048    |   True    | torch.bfloat16 | 12.289720651460811 | 95.88620596332476  |
|     8      |    32     |    128    |    128     |   2048    |   False   | torch.bfloat16 | 11.618648946750909 | 95.56685149436818  |
|     8      |    32     |    256    |    256     |   2048    |   True    | torch.bfloat16 | 31.567946751601994 | 180.62468653079122 |
|     8      |    32     |    256    |    256     |   2048    |   False   | torch.bfloat16 | 28.611703700153157 | 189.4215695792809  |
|     8      |    32     |    512    |    512     |   2048    |   True    | torch.bfloat16 | 84.11306998459621  | 385.25596749968827 |
|     8      |    32     |    512    |    512     |   2048    |   False   | torch.bfloat16 | 93.82540901424363  | 455.77428903197875 |
|     8      |    32     |   1024    |    1024    |   2048    |   True    | torch.bfloat16 | 226.80530551588163 | 965.8026450779289  |
|     8      |    32     |   1024    |    1024    |   2048    |   False   | torch.bfloat16 | 327.4116570246406  | 1312.5067745568228 |
|     8      |    32     |   4096    |    2048    |   2048    |   True    | torch.bfloat16 | 4445.5064804060385 | 15020.768146496266 |
|     8      |    32     |   4096    |    2048    |   2048    |   False   | torch.bfloat16 | 2433.0302356975153 | 8300.016750581563  |
+------------+-----------+-----------+------------+-----------+-----------+----------------+--------------------+--------------------+
```

Performance on this branch:
```Markdown
+------------+-----------+-----------+------------+-----------+-----------+----------------+--------------------+--------------------+
| batch_size | num_heads | q_seq_len | kv_seq_len | embed_dim | is_causal |     dtype      |    forward_time    |   backward_time    |
+------------+-----------+-----------+------------+-----------+-----------+----------------+--------------------+--------------------+
|     1      |    16     |    128    |    128     |   2048    |   True    | torch.bfloat16 | 12.783618393586949 | 65.59692794689909  |
|     1      |    16     |    128    |    128     |   2048    |   False   | torch.bfloat16 | 12.064015300711617 | 56.99719698168337  |
|     1      |    16     |    256    |    256     |   2048    |   True    | torch.bfloat16 | 16.629025398287922 | 68.65267595276237  |
|     1      |    16     |    256    |    256     |   2048    |   False   | torch.bfloat16 | 17.462356004398313 | 68.35797848179936  |
|     1      |    16     |    512    |    512     |   2048    |   True    | torch.bfloat16 |  29.5476081490051  | 101.22994752600789 |
|     1      |    16     |    512    |    512     |   2048    |   False   | torch.bfloat16 | 28.395320149138573 | 98.62275794148445  |
|     1      |    16     |   1024    |    1024    |   2048    |   True    | torch.bfloat16 | 50.50016101449728  | 181.4357690163888  |
|     1      |    16     |   1024    |    1024    |   2048    |   False   | torch.bfloat16 | 49.450615647947416 | 175.86063902126625 |
|     1      |    16     |   4096    |    2048    |   2048    |   True    | torch.bfloat16 | 506.06461532879626 | 1866.0613044630736 |
|     1      |    16     |   4096    |    2048    |   2048    |   False   | torch.bfloat16 | 299.9336270149797  | 976.4662646921353  |
|     1      |    32     |    128    |    128     |   2048    |   True    | torch.bfloat16 | 11.45752210286446  | 58.79682704107836  |
|     1      |    32     |    128    |    128     |   2048    |   False   | torch.bfloat16 | 11.407129396684468 | 58.14061599085107  |
|     1      |    32     |    256    |    256     |   2048    |   True    | torch.bfloat16 | 13.822759891627355 | 56.56979401828722  |
|     1      |    32     |    256    |    256     |   2048    |   False   | torch.bfloat16 | 13.39154909946956  |  56.7130644340068  |
|     1      |    32     |    512    |    512     |   2048    |   True    | torch.bfloat16 | 20.282494352431968 | 77.29688903782517  |
|     1      |    32     |    512    |    512     |   2048    |   False   | torch.bfloat16 | 19.899454596452415 |  75.4446149803698  |
|     1      |    32     |   1024    |    1024    |   2048    |   True    | torch.bfloat16 | 48.494275606935844 | 177.5322465109639  |
|     1      |    32     |   1024    |    1024    |   2048    |   False   | torch.bfloat16 | 46.84524350450374  | 189.1778860008344  |
|     1      |    32     |   4096    |    2048    |   2048    |   True    | torch.bfloat16 | 635.1026654010639  | 2248.0451600858937 |
|     1      |    32     |   4096    |    2048    |   2048    |   False   | torch.bfloat16 | 335.1591735263355  | 1080.4320796160027 |
|     4      |    16     |    128    |    128     |   2048    |   True    | torch.bfloat16 | 13.63953539985232  | 65.50709309522063  |
|     4      |    16     |    128    |    128     |   2048    |   False   | torch.bfloat16 | 12.858113402035087 | 63.021871959790595 |
|     4      |    16     |    256    |    256     |   2048    |   True    | torch.bfloat16 | 19.98318645055406  | 105.87883047992364 |
|     4      |    16     |    256    |    256     |   2048    |   False   | torch.bfloat16 | 18.619045056402683 | 104.90188701078296 |
|     4      |    16     |    512    |    512     |   2048    |   True    | torch.bfloat16 | 45.91175540117546  | 226.00732848513871 |
|     4      |    16     |    512    |    512     |   2048    |   False   | torch.bfloat16 | 44.39614630537107  | 232.39317198749632 |
|     4      |    16     |   1024    |    1024    |   2048    |   True    | torch.bfloat16 | 135.5409600073472  | 522.7949097752571  |
|     4      |    16     |   1024    |    1024    |   2048    |   False   | torch.bfloat16 | 158.79383607534692 | 628.5856699105352  |
|     4      |    16     |   4096    |    2048    |   2048    |   True    | torch.bfloat16 | 1775.9978299727663 | 6343.203847063706  |
|     4      |    16     |   4096    |    2048    |   2048    |   False   | torch.bfloat16 | 1160.680354805663  | 3842.235009651631  |
|     4      |    32     |    128    |    128     |   2048    |   True    | torch.bfloat16 | 11.553713708417488 | 65.50691701704638  |
|     4      |    32     |    128    |    128     |   2048    |   False   | torch.bfloat16 | 11.486379051348194 |  56.9980075233616  |
|     4      |    32     |    256    |    256     |   2048    |   True    | torch.bfloat16 | 17.56585600087419  | 107.89892700267956 |
|     4      |    32     |    256    |    256     |   2048    |   False   | torch.bfloat16 | 16.828144202008843 | 109.05519902007653 |
|     4      |    32     |    512    |    512     |   2048    |   True    | torch.bfloat16 | 48.23235589428805  | 217.8974545095116  |
|     4      |    32     |    512    |    512     |   2048    |   False   | torch.bfloat16 | 49.09284680034033  | 244.73925953498107 |
|     4      |    32     |   1024    |    1024    |   2048    |   True    | torch.bfloat16 | 134.77827049791813 | 522.7259948151186  |
|     4      |    32     |   1024    |    1024    |   2048    |   False   | torch.bfloat16 | 176.60772847011688 | 681.5171707421541  |
|     4      |    32     |   4096    |    2048    |   2048    |   True    | torch.bfloat16 | 2267.821540008299  | 7720.425300067291  |
|     4      |    32     |   4096    |    2048    |   2048    |   False   | torch.bfloat16 | 1295.3941145678982 | 4272.425139788538  |
|     8      |    16     |    128    |    128     |   2048    |   True    | torch.bfloat16 | 14.514714101096615 |  94.2192979855463  |
|     8      |    16     |    128    |    128     |   2048    |   False   | torch.bfloat16 | 13.553097198018804 |  93.244242540095   |
|     8      |    16     |    256    |    256     |   2048    |   True    | torch.bfloat16 | 27.95821905019693  | 185.0469880155288  |
|     8      |    16     |    256    |    256     |   2048    |   False   | torch.bfloat16 | 26.709681446664035 | 184.22623950755226 |
|     8      |    16     |    512    |    512     |   2048    |   True    | torch.bfloat16 | 85.85420495364815  | 388.3417735341937  |
|     8      |    16     |    512    |    512     |   2048    |   False   | torch.bfloat16 | 89.97473795898259  | 434.4228169647977  |
|     8      |    16     |   1024    |    1024    |   2048    |   True    | torch.bfloat16 | 220.6919804448262  | 958.9654899900779  |
|     8      |    16     |   1024    |    1024    |   2048    |   False   | torch.bfloat16 | 306.55586952343583 | 1233.2170095760375 |
|     8      |    16     |   4096    |    2048    |   2048    |   True    | torch.bfloat16 | 3470.7326447824016 | 12183.611298678443 |
|     8      |    16     |   4096    |    2048    |   2048    |   False   | torch.bfloat16 | 2299.064100370742  | 7669.618452200666  |
|     8      |    32     |    128    |    128     |   2048    |   True    | torch.bfloat16 | 12.427107692928985 | 96.96270158747211  |
|     8      |    32     |    128    |    128     |   2048    |   False   | torch.bfloat16 | 11.856995843118057 | 96.38117247959599  |
|     8      |    32     |    256    |    256     |   2048    |   True    | torch.bfloat16 |  32.9956392000895  | 182.52741603646427 |
|     8      |    32     |    256    |    256     |   2048    |   False   | torch.bfloat16 | 29.397601098753512 | 191.0755339777097  |
|     8      |    32     |    512    |    512     |   2048    |   True    | torch.bfloat16 | 89.06024845782667  | 392.2585004474967  |
|     8      |    32     |    512    |    512     |   2048    |   False   | torch.bfloat16 | 97.78487798757851  | 462.07307645818213 |
|     8      |    32     |   1024    |    1024    |   2048    |   True    | torch.bfloat16 |  240.521906001959  | 992.4693452194335  |
|     8      |    32     |   1024    |    1024    |   2048    |   False   | torch.bfloat16 | 341.98952303268015 | 1339.2950996058062 |
|     8      |    32     |   4096    |    2048    |   2048    |   True    | torch.bfloat16 | 4445.311005110853  | 15001.030603889374 |
|     8      |    32     |   4096    |    2048    |   2048    |   False   | torch.bfloat16 | 2535.9767401823774 | 8528.990152990447  |
+------------+-----------+-----------+------------+-----------+-----------+----------------+--------------------+--------------------+
```

```
{'avg_forward_time_nan_fix': 399.7900972732653,
 'avg_backward_time_nan_fix': 1409.652114014413,
 'avg_forward_time_main_branch': 394.6807206988645,
 'avg_backward_time_main_branch': 1399.4055472857629,
 'geo_mean_nan_fix': 150.95049601244946,
 'geo_mean_main_branch': 148.3381648508822}
 ```

The y axis is wrong and is micro seconds but the relative comparison still works
<img width="790" alt="Screenshot 2024-03-18 at 3 34 15 PM" src="https://github.com/pytorch/pytorch/assets/32754868/ca278c15-b815-4535-bdcd-07e522055466">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122135
Approved by: https://github.com/cpuhrsch
2024-03-19 16:32:00 +00:00
e26280ad8b Fix typing for autograd.Function with ctx-less forward (#122167)
Previously, typing an autograd.Function like the following would lead to
a mypy error (which expects the first arg to forward to be named `ctx`).

This PR fixes that by deleting the ctx arg.

```py
class MySin(torch.autograd.Function):
    @staticmethod
    def forward(x: torch.Tensor) -> torch.Tensor:
        return x.sin()

    @staticmethod
    def setup_context(*args, **kwargs):
        pass

    @staticmethod
    def backward(ctx, grad):
        if grad.stride(0) > 1:
            return grad.sin()
        return grad.cos()
```

Test Plan:
- tested locally (I don't know how to put up a test in CI for this).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122167
Approved by: https://github.com/soulitzer
2024-03-19 16:15:23 +00:00
f9ed1c432d Revert "Refactor gpu trace to be device-agnostic (#121794)"
This reverts commit 0ff1109e2688b8c841c9dd0eeecfba16f027b049.

Reverted https://github.com/pytorch/pytorch/pull/121794 on behalf of https://github.com/jeanschmidt due to Reverting to see if rocm trunk errors are related ([comment](https://github.com/pytorch/pytorch/pull/121794#issuecomment-2007519408))
2024-03-19 15:40:26 +00:00
c05bf0037d [dynamo] Remove copy_graphstate/restore_graphstate (#122067)
Some dead code cleanup.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122067
Approved by: https://github.com/oulgen
2024-03-19 15:37:53 +00:00
7673cb534a Revert "Skip nonzero unbacked SymInt memo in inference mode (#122147)"
This reverts commit 5e2687391229cee6e4dc0214f9208b4ecbe058c1.

Reverted https://github.com/pytorch/pytorch/pull/122147 on behalf of https://github.com/jeanschmidt due to Reverting to see if trunk error in inductor are related ([comment](https://github.com/pytorch/pytorch/pull/122147#issuecomment-2007513000))
2024-03-19 15:37:24 +00:00
cyy
6c01c25319 [Clang-tidy header][28/N] Fix clang-tidy warnings in aten/src/ATen/core/*.h (#122175)
This PR fixes various clang-tidy warnings on aten/src/ATen/core/*.h, following https://github.com/pytorch/pytorch/pull/122023
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122175
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-03-19 14:08:54 +00:00
6c50308801 [ATen-Vulkan][EZ] Small fixes: fix gpu size calculation and Half scalartype ctype mapping (#122096)
Summary:
## Context

Some small fixes to the ATen-Vulkan backend.

The first is that GPU sizes for a 4 dimensional tensor with width packing had a small bug:

```
      case 4:
        switch (memory_layout) {
          case api::GPUMemoryLayout::TENSOR_WIDTH_PACKED:
            gpu_sizes.at(0) = sizes.at(0);
            gpu_sizes.at(1) = sizes.at(1);
            // should be gpu_sizes.at(2) == sizes.at(2)
            gpu_sizes.at(2) = sizes.at(3);
            gpu_sizes.at(3) = api::utils::align_up(sizes.at(3), INT64_C(4));
            break;
```

This was fixed by simplifying the logic of GPU size calculation for texture storage.

The second was to modify the ctype mapping of the `api::kHalf` scalar type to be `float` instead of `unsigned short`. This is because GLSL does not natively support `float16`, so even with a FP16 texture type CPU/GPU transfer shaders will have to read from and write to `float` buffers.

In the future, we will look into integrating [VK_KHR_shader_float16_int8](https://registry.khronos.org/vulkan/specs/1.3-extensions/man/html/VK_KHR_shader_float16_int8.html) into ATen-Vulkan to allow for 16 bit and 8 bit types to be referenced explicitly.

Test Plan: CI

Differential Revision: D55018171

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122096
Approved by: https://github.com/jorgep31415
2024-03-19 13:27:27 +00:00
39877abee2 Update jvp to support symbolic execution. (#120338)
Previously, all jvp tests under dynamo/test_dynamic_shapes would fail because symbolic execution wasn't supported in some autograd functions.

List of changes:
- Update`_has_same_storage_numel` to use `sym_nbytes`
- Symintify `_efficientzerotensor_meta`
- Introduce `empty_generic_symint` with the first argument `size` as symbolic integer
- Update gen_variable_type.py script to call the symint version of zeros_fn function (zeros_symint / _efficientzerotensor_symint)
- Update `has_same_meta` to call `sym_*` functions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120338
Approved by: https://github.com/soulitzer
ghstack dependencies: #119926
2024-03-19 13:06:42 +00:00
edd04b7c16 Teach dynamo about torch.func.jvp (#119926)
List of changes:
- Replace JVP_NESTING by torch._C._functorch.maybe_current_level()
- Remove all increment nesting functions from wrap_fx_proxy_cls
- fwAD.make_dual receives the dual_level as keyword argument
- Add jvp_increment_nesting, set_fwd_grad_enabled and dual_level context managers to dynamo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119926
Approved by: https://github.com/zou3519
2024-03-19 13:06:42 +00:00
6b5259e507 [lint] bump lint dependency PyYAML to 6.0.1 to support Python 3.12 (#122022)
[PyYAML 6.0.0](https://pypi.org/project/PyYAML/6.0) was released 2.5 years ago and it is not installable with Python 3.12.

This PR bumps the version of [PyYAML to 6.0.1](https://pypi.org/project/PyYAML/6.0.1) in `lintrunner` configuration.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122022
Approved by: https://github.com/Skylion007
2024-03-19 12:23:49 +00:00
8168338063 Add CPU implementation for torch._int_mm (s8*s8->s32) (#121792)
Fixes #121647

**Description**
Currently, the op `torch._int_mm` only supports CUDA device. This PR adds CPU implementation for it.
Besides the request from the issue, this op may also be useful for planned CPU implementations of [LLM.int8()](https://arxiv.org/abs/2208.07339) in [Bitsandbytes](https://github.com/TimDettmers/bitsandbytes).

The implementation prefers mkldnn (oneDNN) kernels. If mkldnn is not available, a reference implementation with nested for loops is used.

**Test plan**
`python test/test_linalg.py -k test__int_mm_cpu`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121792
Approved by: https://github.com/jgong5, https://github.com/lezcano
2024-03-19 08:44:33 +00:00
0d845f7b07 Fix auto_functionalize (#121990)
Differential Revision: D54964130

When we re-export, auto_functionalize HOP will be in the graph. Therefore, we need to implement proper functionalization rule for it. Since the content inside auto_functionalize is guaranteed be functional, it is ok to just fall through it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121990
Approved by: https://github.com/ydwu4, https://github.com/zou3519
2024-03-19 07:11:11 +00:00
a2a88f39ee Avoid COW materialize in conv, log sigmoid, repeat, group_norm, batch_norm (#121537)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121537
Approved by: https://github.com/ezyang
2024-03-19 06:15:00 +00:00
0ff1109e26 Refactor gpu trace to be device-agnostic (#121794)
# Motivation
Refactor gpu trace to be device-agnostic. gpu trace is usually used in runtime components, including Device, Stream, Event, Guard, and Allocator. It should be device-agnostic and can be shared among each device backend.

# Solution
move `_cuda_trace.py` to `_gpu_trace.py`, which makes each device backend owns their callback, respectively.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121794
Approved by: https://github.com/jgong5, https://github.com/albanD, https://github.com/EikanWang, https://github.com/gujinghui
2024-03-19 06:02:28 +00:00
09ce76809c Improve compiler detection on MacOS (#121406)
By relying on `is_apple_clang` helper function rather than on compiler name (as `gcc` is clang on MacOS):
```
% which gcc; gcc -v
/usr/bin/gcc
Apple clang version 15.0.0 (clang-1500.3.9.4)
Target: arm64-apple-darwin23.3.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
```
But
```
% /opt/homebrew/bin/gcc-13 -v
Using built-in specs.
COLLECT_GCC=/opt/homebrew/bin/gcc-13
COLLECT_LTO_WRAPPER=/opt/homebrew/Cellar/gcc/13.2.0/bin/../libexec/gcc/aarch64-apple-darwin23/13/lto-wrapper
Target: aarch64-apple-darwin23
Configured with: ../configure --prefix=/opt/homebrew/opt/gcc --libdir=/opt/homebrew/opt/gcc/lib/gcc/current --disable-nls --enable-checking=release --with-gcc-major-version-only --enable-languages=c,c++,objc,obj-c++,fortran --program-suffix=-13 --with-gmp=/opt/homebrew/opt/gmp --with-mpfr=/opt/homebrew/opt/mpfr --with-mpc=/opt/homebrew/opt/libmpc --with-isl=/opt/homebrew/opt/isl --with-zstd=/opt/homebrew/opt/zstd --with-pkgversion='Homebrew GCC 13.2.0' --with-bugurl=https://github.com/Homebrew/homebrew-core/issues --with-system-zlib --build=aarch64-apple-darwin23 --with-sysroot=/Library/Developer/CommandLineTools/SDKs/MacOSX14.sdk
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 13.2.0 (Homebrew GCC 13.2.0)
```

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121406
Approved by: https://github.com/malfet, https://github.com/jansel
2024-03-19 05:32:08 +00:00
FEI
8499767e96 add sdpa choice for DeviceType::PrivateUse1 (#121409)
Fixes  #116854

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121409
Approved by: https://github.com/drisspg
2024-03-19 05:08:46 +00:00
5bc7f7f977 [dynamo] Make tx.next_instruction lazy (#122066)
Improves benchmarks/dynamo/microbenchmarks/dynamo_microbenchmarks.py
from 2.5s to 2.4s.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122066
Approved by: https://github.com/oulgen, https://github.com/anijain2305
ghstack dependencies: #122039, #122043, #122055, #122058, #122060, #122063
2024-03-19 04:23:30 +00:00
153a01833b [dynamo] Optimize SourcelessBuilder (#122063)
Improves `benchmarks/dynamo/microbenchmarks/dynamo_microbenchmarks.py`
from 2.7s to 2.5s.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122063
Approved by: https://github.com/anijain2305
ghstack dependencies: #122039, #122043, #122055, #122058, #122060
2024-03-19 04:23:30 +00:00
8082adcf65 [dynamo] Only rename a proxy once (#122060)
Improves `benchmarks/dynamo/microbenchmarks/dynamo_microbenchmarks.py`
from 3.9s to 2.7s.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122060
Approved by: https://github.com/oulgen
ghstack dependencies: #122039, #122043, #122055, #122058
2024-03-19 04:23:27 +00:00
2bec55c5f9 [dynamo] Remove VariableTracker.parents_tracker (#122058)
This is leftover from mutable variable tracker days and no longer needed.

Improves benchmarks/dynamo/microbenchmarks/dynamo_microbenchmarks.py
from 4.2s to 3.9s.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122058
Approved by: https://github.com/oulgen, https://github.com/anijain2305
ghstack dependencies: #122039, #122043, #122055
2024-03-19 04:23:24 +00:00
3c706bf483 [dynamo] Optimize BuiltinVariable (#122055)
Improves `benchmarks/dynamo/microbenchmarks/dynamo_microbenchmarks.py`
from 5.1s to 4.2s (compared to 2 PRs ago).

This works by precomputing (and caching) the parts of `BuiltinVariable.call_function` that don't depend on the values of args/kwargs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122055
Approved by: https://github.com/oulgen, https://github.com/anijain2305
ghstack dependencies: #122039, #122043
2024-03-19 04:23:20 +00:00
07caea5c12 [dynamo] Refactor COMPARE_OP and comparison builtins (#122043)
This removes the duplicate handling of comparison ops between symbolic_convert and bultin and refactors the handling to use the binop infrastructure.  This change regresses overheads a bit, but this is fixed in the next PR.

New test skips are variants of `type(e) is np.ndarray` previously falling back to eager.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122043
Approved by: https://github.com/anijain2305
ghstack dependencies: #122039
2024-03-19 04:23:17 +00:00
769ff86b91 [dynamo] Optimize COMPARE_OP (#122039)
Improves `benchmarks/dynamo/microbenchmarks/dynamo_microbenchmarks.py`
from 5.6 to 5.1s.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122039
Approved by: https://github.com/Skylion007, https://github.com/anijain2305
2024-03-19 04:23:14 +00:00
cyy
e1706bba3b [Clang-tidy header][27/N] Fix clang-tidy warnings in aten/src/ATen/core/*.h (#122023)
This PR fixes various clang-tidy warnings on aten/src/ATen/core/*.h, following #122015
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122023
Approved by: https://github.com/ezyang
2024-03-19 03:26:15 +00:00
5e26873912 Skip nonzero unbacked SymInt memo in inference mode (#122147)
Summary: In `torch.inference_mode()`, fake tensors don't have `_version`s. This breaks unbacked SymInt memoization in `torch.nonzero` tracing. Here we disable the latter in inference mode.

Test Plan:

```
$ python test/inductor/test_unbacked_symints.py -k test_nonzero_in_inference_mode
...
----------------------------------------------------------------------
Ran 2 tests in 14.060s

OK
```

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122147
Approved by: https://github.com/ezyang
2024-03-19 03:20:33 +00:00
8860c625ea [dynamo][guards-cpp-refactor] Integrate cpp guard manager with CheckFnManager (#120726)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120726
Approved by: https://github.com/jansel
2024-03-19 03:11:31 +00:00
f84d560236 [dynamo] Raise accumulated cache size limit (#122130)
Fixes #114511

This was raised by IBM folks where the a LLM compile was failing because it had more than 64 layers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122130
Approved by: https://github.com/Chillee, https://github.com/jansel
ghstack dependencies: #121954, #122005
2024-03-19 02:35:48 +00:00
7084528eb9 [dynamo][model_output] Do not include none for CustomizedDictVariable (#122005)
Fixes https://github.com/pytorch/pytorch/issues/120923

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122005
Approved by: https://github.com/weifengpy, https://github.com/jansel
ghstack dependencies: #121954
2024-03-19 02:35:48 +00:00
2b06098380 Enable x86 CPU vectorization on windows [submodule sleef] (#118980)
Enable VEC on Windows OS.
1. Fix some type defination gap between Windows and Linux.
2. Fix some operator not support on Windows, such as [], /.
3. Enable static sleef library build on Windows.
4. Disable unsupported function overloading on MSVC.
5. Upgrade submodule sleef lib, which fixed build issue on Windows.
6. Fixed bazel build issues.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118980
Approved by: https://github.com/jgong5, https://github.com/ezyang, https://github.com/malfet
2024-03-19 02:22:04 +00:00
6502c888cf Enable fx graph cache in torch_test.py when using PYTORCH_TEST_WITH_INDUCTOR=1 (#122010)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122010
Approved by: https://github.com/eellison
2024-03-19 02:17:10 +00:00
18d94d7165 Make FX nodes sortable (#122071)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122071
Approved by: https://github.com/oulgen
2024-03-19 01:40:56 +00:00
1f4d4d3b78 [fx] preserver partiioner order fix (#122111)
Summary:
Previous implementation seems to introduce a key value of {"node": none}. This causes an error in logging later on because we extract the name from the "node" but it is a string instead of a torch.fx.node

This seems to cause tests to pass.

Test Plan:
CI

ExecuTorch CI:
buck test mode/dev-nosan //executorch/backends/xnnpack/test:test_xnnpack_models

Reviewed By: larryliu0820

Differential Revision: D55026133

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122111
Approved by: https://github.com/mikekgfb
2024-03-19 01:00:44 +00:00
34f36a28df [MPS] Fwd-fix for clamp regression (#122148)
Forward fix for regressions introduced by https://github.com/pytorch/pytorch/pull/121381 as we failed to run MPS CI twice on it

- Do not call `minimumWithNaNPropagationWithPrimaryTensor` for integral tensors as it will crash with
  ```
    /AppleInternal/Library/BuildRoots/ce725a5f-c761-11ee-a4ec-b6ef2fd8d87b/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Utility/MPSKernelDAG.mm:805: failed assertion `Error getting visible function: (null) Function isNaN_i16_i8 was not found in the library'
   ```
- Change the order of max and min call as it's apparently important for
  consistency, as `min(max(a, b), c)` might not equal to `max(min(a, c), b)` if `c` is not always less or equal than `b`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122148
Approved by: https://github.com/huydhn
2024-03-19 00:52:45 +00:00
ae983d2d6e Fix typo in sparse.rst (#121826)
Change word "on" to "one" when talking in the third person.

Fixes #121770
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121826
Approved by: https://github.com/janeyx99
2024-03-19 00:17:19 +00:00
e6cf3e90a5 [AOTAutograd / Functionalization] Fix incorrect expand_inverse (#122114)
This is a rebase of https://github.com/pytorch/pytorch/pull/114538,
originally submited by @jon-chuang.

Fixes #114302

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122114
Approved by: https://github.com/bdhirsh
2024-03-18 22:52:57 +00:00
ba69dc6675 [Easy] add option to print compilation time (#121996)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121996
Approved by: https://github.com/davidberard98
2024-03-18 22:42:41 +00:00
2ab8b34433 Error out in case of in-source builds (#122037)
Such builds could not succeed, as arch-specific ATen dispatch mechanism will create temporary files that will be added to the build system with every rebuild, which will result in build failures

Fixes https://github.com/pytorch/pytorch/issues/121507

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122037
Approved by: https://github.com/PaliC, https://github.com/kit1980
2024-03-18 21:48:18 +00:00
e6a461119a [functorch] Add batch rule for linalg.lu_unpack (#121811)
Fixes: https://github.com/pytorch/pytorch/issues/119998

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121811
Approved by: https://github.com/peterbell10, https://github.com/zou3519
2024-03-18 21:24:16 +00:00
773ae817f7 Batch Norm Consolidation (#116092)
**Summary:**

This commit simplifies the existing decomposition hierarchy
of batch norm ops by adding a single, backend agnostic op:
`batch_norm_with_update`. The existing hierarchy looks like:

```
aten.batch_norm ->
aten._batch_norm_impl_index ->
[
  aten.native_batch_norm ->
  aten._native_batch_norm_legit (export only) ->
  _batch_norm_legit_cpu/cuda (kernels, export only) ->
  _batch_norm_cpu/cuda (kernels)
] OR
[ aten.cudnn_batch_norm ] OR
[ aten.miopen_batch_norm ]
```

Aside from complexity, an important problem with the
above decomposition hierarchy is cuda numerics in
export flows. We observed significantly worse convergence
when training a mobilenetv2-like model when using the
`_batch_norm_cuda` kernel instead of the `cudnn_batch_norm`
kernel. This means users who export their models on CPU
first then move the models to cuda later may silently
see worse accuracies even when cudnn is installed,
because they are using the worse kernel. This issue is
summarized in https://github.com/pytorch/pytorch/issues/111384.

Instead, the new hierarchy proposed by consolidating
existing batch norm ops will look like:

```
aten.batch_norm ->
aten.batch_norm_with_update ->
[ _batch_norm_cpu (kernel) ] OR
[ _batch_norm_cuda (kernel) ] OR
[ cudnn_batch_norm (kernel) ] OR
[ miopen_batch_norm (kernel) ]
```

The new op `batch_norm_with_update` hides backend
implementation details and automatically picks the right
kernel based on what is installed. This commit also adds
the following variants to this op:

```
batch_norm_with_update_functional
batch_norm_with_update.out
batch_norm_no_update
batch_norm_no_update.out
batch_norm_backward
```

Note that this commit only adds this op and its variants,
but does not actually change the decomps to produce these
ops in the graph. This will be done after the 2 week FC
window, and the ops used in the old stack is planned to
be removed after the 6 month BC window.

Test Plan: `OpInfo` tests for `batch_norm_with_update`.

Reviewers: albanD, bdhirsh

Subscribers: albanD, bdhirsh, supriyar

Tasks: https://github.com/pytorch/pytorch/issues/111384

Differential Revision: [D54805279](https://our.internmc.facebook.com/intern/diff/D54805279)
Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116092
Approved by: https://github.com/bdhirsh, https://github.com/albanD
2024-03-18 21:01:30 +00:00
a17cd226d6 [inductor] Enable FX graph caching on another round of inductor tests (#121994)
Summary: Enabling caching for these tests was blocked by https://github.com/pytorch/pytorch/pull/121686

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121994
Approved by: https://github.com/eellison
2024-03-18 20:55:18 +00:00
7c5e29ae71 Back out "Support triton.language.dtype with torch.compile (#121690)" (#122108)
Summary: Some hard to deal with package import/export related problems. Lets revert and start with clean slate.

Test Plan: CI

Differential Revision: D55024877

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122108
Approved by: https://github.com/ezyang
2024-03-18 20:50:28 +00:00
685ace3834 [compiled autograd] add dynamo segfault test (#122004)
To catch issues like https://github.com/pytorch/pytorch/issues/121862 in CI. This passes because we reverted the PRs, and https://github.com/pytorch/pytorch/pull/121870 confirms that this test can catch it

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122004
Approved by: https://github.com/eellison
2024-03-18 20:07:15 +00:00
40acc84aaf Fix torch.clamp in MPS to handle NaN correctly (#121381)
Fixes #120899

So this is interesting. There are methods that specifically propagate NaN instead of clamping to real numbers.
https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraph/3857573-maximumwithnanpropagationwithpri

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121381
Approved by: https://github.com/malfet
2024-03-18 19:38:15 +00:00
0a1b3be216 chore: add unit test to verify split_by_tags output_type (#121262)
Add a test case as per https://github.com/pytorch/pytorch/pull/120361#issuecomment-1979163324

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121262
Approved by: https://github.com/atalman
2024-03-18 19:19:26 +00:00
676a77177e Revert "[BE] Migrate pull.yml to use S3 pytorch-ci-artifacts bucket for linux-jammy-py3_8-gcc11 and docs builds/tests (#121908)"
This reverts commit 4cbf963894e78d1cfedffe4f829740dc99163caa.

Reverted https://github.com/pytorch/pytorch/pull/121908 on behalf of https://github.com/jeanschmidt due to this is due to OIDC can't work on forked PR due to token write permissions can't be shared ([comment](https://github.com/pytorch/pytorch/pull/121908#issuecomment-2004707582))
2024-03-18 19:03:11 +00:00
df1cdaedeb Log restart reasons and extra compile time in CompilationMetrics (#121827)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121827
Approved by: https://github.com/ezyang, https://github.com/yanboliang
2024-03-18 18:59:25 +00:00
74c09a757b Simplify Storage meta conversion with PyObject preservation (#122018)
Thanks to https://github.com/pytorch/pytorch/pull/109039 we can rely on
finalizers on Storage PyObject to handle removal from dict.

Irritatingly, we still have to attach finalizer, because we don't have
a weak key AND value dict (only one or the other).

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122018
Approved by: https://github.com/eellison, https://github.com/kurtamohler
2024-03-18 18:55:58 +00:00
32410f80ec [Caffe2 CPU tests] Update CMakeLists.txt (#119643)
I was trying to build PyTorch with USE_GLOG=ON (so we could get better timestamps around the nccl logging) and ran into this error

```
[1/7] Linking CXX executable bin/verify_api_visibility
FAILED: bin/verify_api_visibility
: && /opt/rh/gcc-toolset-11/root/usr/bin/c++ -D_GLIBCXX_USE_CXX11_ABI=1 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow -DHAVE_AVX512_CPU_DEFINITION -DHAVE_AVX2_CPU_DEFINITION -O2 -g -DNDEBUG -rdynamic     -Wl,--no-as-needed caffe2/CMakeFiles/verify_api_visibility.dir/__/aten/src/ATen/test/verify_api_visibility.cpp.o -o bin/verify_api_visibility -L/lib/intel64   -L/lib/intel64_win   -L/lib/win-x64 -Wl,-rpath,/lib/intel64:/lib/intel64_win:/lib/win-x64:/usr/local/cuda/lib64:/root/conda/lib:/mnt/code/pytorch/build/lib:  lib/libgtest_main.a  -Wl,--no-as-needed,"/mnt/code/pytorch/build/lib/libtorch.so" -Wl,--as-needed  -Wl,--no-as-needed,"/mnt/code/pytorch/build/lib/libtorch_cpu.so" -Wl,--as-needed  lib/libprotobuf.a  /root/conda/lib/libmkl_intel_lp64.so  /root/conda/lib/libmkl_gnu_thread.so  /root/conda/lib/libmkl_core.so  -fopenmp  /usr/lib64/libpthread.so  -lm  /usr/lib64/libdl.so  -Wl,--no-as-needed,"/mnt/code/pytorch/build/lib/libtorch_cuda.so" -Wl,--as-needed  lib/libc10_cuda.so  lib/libc10.so  /root/conda/lib/libglog.so.0.4.0  /root/conda/lib/libgflags.so.2.2.2  -lpthread  /usr/local/cuda/lib64/libcudart.so  /usr/local/cuda/lib64/libnvToolsExt.so  lib/libgtest.a  -pthread && /root/conda/bin/cmake -E __run_co_compile --lwyu="ldd;-u;-r" --source=bin/verify_api_visibility && :
/opt/rh/gcc-toolset-11/root/usr/bin/ld: /mnt/code/pytorch/build/lib/libtorch.so: undefined reference to symbol '_ZTVN10__cxxabiv117__class_type_infoE@@CXXABI_1.3'
/opt/rh/gcc-toolset-11/root/usr/bin/ld: /usr/lib64/libstdc++.so.6: error adding symbols: DSO missing from command line
collect2: error: ld returned 1 exit status
```

Adding stdc++ explicitly to the list of libraries to link seems to fix the build, and I was able to get a working build of PyTorch.
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119643
Approved by: https://github.com/zdevito
2024-03-18 18:35:32 +00:00
5d52b163d1 [dynamo] Optimize load/store/const op handling (#122038)
Improves `benchmarks/dynamo/microbenchmarks/dynamo_microbenchmarks.py`
from 6.7s to 5.6.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122038
Approved by: https://github.com/Skylion007
ghstack dependencies: #122032, #122033, #122034, #122035
2024-03-18 18:08:06 +00:00
4034873a31 [dynamo] Optimize builtin handling (#122035)
Improves `benchmarks/dynamo/microbenchmarks/dynamo_microbenchmarks.py`
from 7.3s to 6.7s.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122035
Approved by: https://github.com/Skylion007
ghstack dependencies: #122032, #122033, #122034
2024-03-18 18:08:06 +00:00
6ca0323615 [dynamo] Optimize VariableTracker.__post_init__ (#122034)
Improves `benchmarks/dynamo/microbenchmarks/dynamo_microbenchmarks.py`
from 8.6s to 7.3s.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122034
Approved by: https://github.com/Skylion007
ghstack dependencies: #122032, #122033
2024-03-18 18:08:06 +00:00
115c9c6d6b Remove __getattribute__ on autograd.Function (#122033)
Improves `benchmarks/dynamo/microbenchmarks/overheads.py` from 38.7us to
34.3us.

See #122029
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122033
Approved by: https://github.com/zou3519, https://github.com/soulitzer
ghstack dependencies: #122032
2024-03-18 18:08:06 +00:00
5a10b56083 [dynamo] Small microbenchmark changes (#122032)
Used to generate numbers in #122029

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122032
Approved by: https://github.com/yanboliang
2024-03-18 18:08:06 +00:00
1a58e9d357 [TD] LLM indexer to run daily (#121835)
Run indexer daily
Run indexer in docker container

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121835
Approved by: https://github.com/osalpekar, https://github.com/malfet
2024-03-18 16:34:01 +00:00
ceb1910bad Revert "[BE] Enables support for pytorch ci build in ARC + introduces _linux-build-rg.yml. (#121930)"
This reverts commit 11b36e163df66196d24fbded4b37ef8f8c032640.

Reverted https://github.com/pytorch/pytorch/pull/121930 on behalf of https://github.com/huydhn due to New action is breaking current ci in not rebased PRs ([comment](https://github.com/pytorch/pytorch/pull/121930#issuecomment-2004393980))
2024-03-18 16:33:23 +00:00
11b36e163d [BE] Enables support for pytorch ci build in ARC + introduces _linux-build-rg.yml. (#121930)
Introduce changes related to enable ARC to run on build for linux-jammy-py3.8-gcc11

Depends on:
* https://github.com/pytorch/pytorch/pull/121908
* https://github.com/pytorch/pytorch/pull/121907
* Force docker to update credentials: https://github.com/pytorch/test-infra/pull/4991
* Add permissions to role to access ECR: acc0154aa0
* Add permissions to the role to access relevant S3 bucket: 496b0422c3

## Reasoning for introducing a new `_linux-build-rg.yml`

Old style `runs-on` definition accept a string, new style `runs-on` requires a object in the format:

```
--- old
...
  runs-on: "linux.2xlarge"
...
--- new
...
  runs-on:
    group: "running-group"
...
```

In other words, to specify a group the format of the yaml needs to be changed. Unfortunately, there is no way to accomplish this change using any trick in the book that I am aware of. This is due to the fact that GH actions yaml are not templatable and support minimal functions / replacements. A few examples that did not work:
* [`e234f25` (#119544)](e234f25ba1 (diff-b317d4da565a9e329ccf67e669c2ff1f4d4bc5fb0ffa4d74132545ad66f84339R76))
* [`087de4a` (#119544)](087de4ad8b (diff-b317d4da565a9e329ccf67e669c2ff1f4d4bc5fb0ffa4d74132545ad66f84339R76))
* [`f03512e` (#119544)](f03512e344 (diff-b317d4da565a9e329ccf67e669c2ff1f4d4bc5fb0ffa4d74132545ad66f84339R76))
* [`67581fb` (#119544)](67581fb737 (diff-b317d4da565a9e329ccf67e669c2ff1f4d4bc5fb0ffa4d74132545ad66f84339R76))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121930
Approved by: https://github.com/seemethere
2024-03-18 15:40:43 +00:00
c4d24b5b7f special-case cuda array interface of zero size (#121458)
Fixes #98133
retry of #98134
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121458
Approved by: https://github.com/bdice, https://github.com/ptrblck, https://github.com/mikaylagawarecki
2024-03-18 15:21:38 +00:00
f7908d9fa8 enable reshape+linear+reshape fusion for dynamic shapes (#121116)
reshape+linear+reshape fusion for dynamic shapes has been disabled in https://github.com/pytorch/pytorch/pull/107123.
Re-enable it by comparing the symbolic values in case of dynamic shapes. This will improve the performance for dynamic shape cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121116
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-03-18 14:46:27 +00:00
f2f8eeea94 Inductor: fix Conv output stride for dynamic shapes (#121400)
Fixes https://github.com/pytorch/pytorch/issues/120873.
Fixes the output stride of Conv in the case of dynamic shapes. The previous logic in inductor assumed that the output stride of Conv is always channels last while it is actually contiguous if `dynamic_shapes and is_contiguous_storage_and_layout(x)`.

### Static shape
In static shape cases, since weight is prepacked (`weight_t.is_mkldnn()` will be `true`), we'll always force output to be channels last in the Conv kernel, thus it's fine to have the assumption in Inductor that the output stride of Conv is always channels last.
96ed37ac13/aten/src/ATen/native/mkldnn/Conv.cpp (L357-L358)

### Dynamic shape
In dynamic shape cases, we won't do weight prepack for Conv, in this case, the Conv kernel decides the output layout based on the input and weight layout.
96ed37ac13/torch/_inductor/fx_passes/mkldnn_fusion.py (L1024-L1025)

For input with `channels = 1`, like tensor of size `(s0, 1, 28, 28)` and stride `(784, 784, 28, 1)`, in Inductor, with `req_stride_order` in channels last order, the `require_stride_order` on `x` of such size and stride won't change the stride of the tensor since stride for dimensions of size 1 is ignored
96ed37ac13/torch/_inductor/ir.py (L5451)

While in Conv kernel, such tensor is consider it as **contiguous** tensor instead of channels last tensor thus the output of the Conv kernel will be in contiguous format.
96ed37ac13/aten/src/ATen/native/ConvUtils.h (L396-L404)

To align the behavior of the Conv kernel, we set the output_stride in such case to be contiguous instead of channels last.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121400
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-03-18 10:56:58 +00:00
206da97b8b [aot_inductor][easy] enable test_triton_kernel_multi_output_arg (#122052)
looks like we already support aoti_torch_cuda_sort in C shim.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122052
Approved by: https://github.com/oulgen
2024-03-18 09:14:35 +00:00
65ccac6f17 Fix triton import time cycles (#122059)
Summary: `has_triton` causes some import time cycles. Lets use `has_triton_package` which is enough.

Test Plan:
```
buck2 test 'fbcode//mode/opt' fbcode//fblearner/flow/projects/model_processing/pytorch_model_export_utils/logical_transformations/tests:filter_inference_feature_metadata_test -- --exact 'fblearner/flow/projects/model_processing/pytorch_model_export_utils/logical_transformations/tests:filter_inference_feature_metadata_test - test_collect_features_from_graph_module_nodes (fblearner.flow.projects.model_processing.pytorch_model_export_utils.logical_transformations.tests.filter_inference_feature_metadata_test.FilterInferenceFromFeatureMetadataTest)'
```
now passes

Differential Revision: D55001430

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122059
Approved by: https://github.com/aakhundov
2024-03-18 05:50:32 +00:00
bc9d054260 [executorch hash update] update the pinned executorch hash (#122061)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122061
Approved by: https://github.com/pytorchbot
2024-03-18 05:02:27 +00:00
7380585d97 [vision hash update] update the pinned vision hash (#122062)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122062
Approved by: https://github.com/pytorchbot
2024-03-18 03:41:50 +00:00
e39aedfcc5 Fix fx graph triton import bug (#122041)
Summary: Unless we register triton to be a special import, FX graph import mechanism imports it as `from fx-generated._0 import triton as triton` which is obviously broken.

Test Plan:
I could not figure out how to write a test for this but
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//tgif/lib/tests/gpu_tests:lowering_pass_test -- -r test_default_ait_lowering_multi_hardwares
```
now passes

Differential Revision: D54990782

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122041
Approved by: https://github.com/aakhundov
2024-03-17 22:48:51 +00:00
5030913d6a [test] Delete variables that have been declared but not referenced di… (#121964)
Delete variables that have been declared but not referenced in aten/src/ATen/test/cuda_distributions_test.cu
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121964
Approved by: https://github.com/janeyx99
2024-03-17 09:45:05 +00:00
cyy
d9460758df [Clang-tidy header][26/N] Fix clang-tidy warnings in aten/src/ATen/core/*.h (#122015)
This PR fixes various clang-tidy warnings on aten/src/ATen/core/*.h
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122015
Approved by: https://github.com/ezyang
2024-03-17 07:56:45 +00:00
c568b84794 [dynamo][guards] Move backend match to eval_frame (#121954)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121954
Approved by: https://github.com/jansel
2024-03-17 06:52:10 +00:00
fc504d719f [executorch hash update] update the pinned executorch hash (#122036)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122036
Approved by: https://github.com/pytorchbot
2024-03-17 04:56:37 +00:00
6f74b76072 Move get_unwrapped outside of disable_functorch (#121849)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121849
Approved by: https://github.com/albanD
2024-03-16 22:25:07 +00:00
3bd38928ba [export] Improve consistency for nn_module_stack metadata, add checks to _trace.py (#120661)
We would like to improve consistency for nn_module_stack metadata in torch.export.

This PR ensures that all tests in test/export/test_export.py has the following constraints:
- Remove nn_module_stack for all placeholder & output nodes, for all modules and submodules
- Ensure nn_module_stack is present for all other node types for the top-level module (there is still an issue with torch.cond submodules having empty fields)
- Add these checks to _export() in _trace.py (we would add this in the Verifier, but downstream apps construct ExportedPrograms separate from _export(), and metadata may not be maintained there)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120661
Approved by: https://github.com/avikchaudhuri
2024-03-16 21:44:52 +00:00
6d9588a12b [inductor] disable linear weight prepacking pass on double (#121478)
Fix #121175

Co-authored-by: Jiong Gong <jiong.gong@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121478
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-03-16 13:24:21 +00:00
9990d1bc22 Add 'profiler/python' to the package.' (#121892)
Fixes #ISSUE_NUMBER
expose the `py_symbolize` interface for use.
thank you
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121892
Approved by: https://github.com/zdevito
2024-03-16 11:11:26 +00:00
5f601a41e0 Pin protobuf to 3.20.2 on macOS (#121918)
The newer protobuf 5.26.0 releasing on March 13rd is causing failures with `test_hparams_*` from `test_tensorboard` in which the stringify metadata is wrong when escaping double quote. For example, 3bc2bb6781.  This looks like an upstream issue from Tensorboard where it doesn't work with this brand new protobuf version https://github.com/tensorflow/tensorboard/blob/master/tensorboard/pip_package/requirements.txt#L29

The package has been pinned on Docker https://github.com/pytorch/pytorch/blob/main/.ci/docker/requirements-ci.txt#L155, so it should be pinned on macOS too.  We want to eventually just have one requirements.txt file.

Fixes https://github.com/pytorch/pytorch/issues/122008
Fixes https://github.com/pytorch/pytorch/issues/121927
Fixes https://github.com/pytorch/pytorch/issues/121946
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121918
Approved by: https://github.com/kit1980
2024-03-16 09:48:05 +00:00
4d9d5fe540 [executorch hash update] update the pinned executorch hash (#122009)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122009
Approved by: https://github.com/pytorchbot
2024-03-16 04:46:45 +00:00
4d92928fe2 [dynamo] Add tests for fake FSDP (#121610)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121610
Approved by: https://github.com/yanboliang
ghstack dependencies: #121735, #120965
2024-03-16 04:29:59 +00:00
0b7d9711d4 [dynamo] Add support for nn.Parameter constructor (part 2) (#120965)
This handles the case where the tensor isn't an input.

The changes to dynamo tests are cases where we would previously fall back to eager.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120965
Approved by: https://github.com/yanboliang
ghstack dependencies: #121735
2024-03-16 04:29:58 +00:00
040b925753 [Compiled Autograd] Reorder accumulate grad nodes (#121735)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121735
Approved by: https://github.com/xmfan
2024-03-16 04:29:56 +00:00
f0b9a8344a [vision hash update] update the pinned vision hash (#121177)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121177
Approved by: https://github.com/pytorchbot
2024-03-16 03:25:08 +00:00
b94691700e [FSDP] Avoided CPU sync in clip_grad_norm_ (#122001)
Copying a scalar 0 tensor on CPU to GPU or constructing a scalar 0 tensor on GPU requires a CPU sync with the GPU. Let us avoid doing ops that involve it.

`FSDP.clip_grad_norm_` already first checks if all parameters are not sharded and calls into `nn.utils.clip_grad_norm_`, so at the point of the code changes, there is guaranteed to be some sharded parameters.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122001
Approved by: https://github.com/wanchaol
2024-03-16 03:01:49 +00:00
7bc91d5dc2 [mergebot][BE] If we don't have any required checks, don't run required checks (#121921)
This PR addresses the issue identified in #121920. The existing problem is that all tests are deemed mandatory if none are selected as required. This behavior is particularly noticeable during a force merge operation.

In the context of a force merge, it may not be necessary to execute any tests which are not required (imo). However, this proposed change could be seen as controversial, hence it has been separated from the main update for further discussion and review.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121921
Approved by: https://github.com/huydhn
ghstack dependencies: #121920
2024-03-16 01:35:21 +00:00
2b71b21a3f Don't use Proxy torch function in the sym size calls (#121981)
Fixes #ISSUE_NUMBER

Changes from https://github.com/pytorch/pytorch/pull/121938 + adds test

@bypass-github-pytorch-ci-checks

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121981
Approved by: https://github.com/davidberard98
2024-03-16 01:20:26 +00:00
37e563276b Document complex optimizer semantic behavior (#121667)
<img width="817" alt="image" src="https://github.com/pytorch/pytorch/assets/31798555/565b389d-3e86-4767-9fcb-fe075b50aefe">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121667
Approved by: https://github.com/albanD
2024-03-16 00:43:47 +00:00
12662900f9 [inductor] FX graph cache: Fix bug handling constants (#121925)
Summary: During key calculation for FX graph caching: Rather than specialize on "small" vs. "large" tensor constants (i.e., inlined vs. not inlined), always hash on the tensor value. Doing so avoids the complication of trying to later attach the constant values as attributes to an already-compiled module. Instead, different constants will cause an FX graph cache miss and we'll just compile.

Test Plan: New unit test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121925
Approved by: https://github.com/eellison
2024-03-16 00:11:51 +00:00
cyy
6b0f61891f [Clang-tidy header][25/N] Fix clang-tidy warnings and enable clang-tidy on c10/cuda/*.{cpp,h} (#121952)
This PR enables clang-tidy to code in c10/cuda.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121952
Approved by: https://github.com/Skylion007
2024-03-16 00:09:54 +00:00
0cc60a05da Revert "Fix torch.clamp in MPS to handle NaN correctly (#121381)"
This reverts commit ca80d07ac71c1bfc9b13c3281a713fed89f15e0f.

Reverted https://github.com/pytorch/pytorch/pull/121381 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think its test is failing in trunk https://github.com/pytorch/pytorch/actions/runs/8302739752/job/22725865151#step:7:644, we should have ciflow/mps to run the test on PR.  Please take a look a reland the change ([comment](https://github.com/pytorch/pytorch/pull/121381#issuecomment-2000685856))
2024-03-15 23:53:05 +00:00
07ec3356b9 Revert "Force upsample to be float32 (#121324)"
This reverts commit 2770e3addd9f05101705f0fef85a163e0034b8a5.

Reverted https://github.com/pytorch/pytorch/pull/121324 on behalf of https://github.com/huydhn due to I think it is better to revert and reland this next week 2770e3addd ([comment](https://github.com/pytorch/pytorch/pull/121324#issuecomment-2000617536))
2024-03-15 23:20:01 +00:00
256c0ec1e5 [docs] Added comment on replicate -> partial for _NormPartial (#121976)
Add a version of https://github.com/pytorch/pytorch/pull/121945#discussion_r1525697167 as a comment in the code

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121976
Approved by: https://github.com/wanchaol
ghstack dependencies: #121747, #121869, #121945
2024-03-15 23:04:06 +00:00
b717aa6f36 Revert "[BE] Enables support for pytorch ci build in ARC + introduces _linux-build-rg.yml. (#121930)"
This reverts commit 2c33e3a372c077badc561b4aad4997e52c03610a.

Reverted https://github.com/pytorch/pytorch/pull/121930 on behalf of https://github.com/huydhn due to I am seeing lots of inductor jobs failing after this change 2c33e3a372.  They looks unrelated though but this change updates Docker image so may be something sneaks in.  I will try to revert this to see if it helps and will reland the change after ([comment](https://github.com/pytorch/pytorch/pull/121930#issuecomment-2000547641))
2024-03-15 22:05:21 +00:00
ca80d07ac7 Fix torch.clamp in MPS to handle NaN correctly (#121381)
Fixes #120899

So this is interesting. There are methods that specifically propagate NaN instead of clamping to real numbers.
https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraph/3857573-maximumwithnanpropagationwithpri

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121381
Approved by: https://github.com/malfet
2024-03-15 21:54:50 +00:00
26aaabb979 [c10d] initialize lastEnqueuedSeq_ and lastCompletedSeq_ (#121980)
Summary:
It is found that this 2 unitilized number was logged with some super
large or negative numbers, which is confusing. So we need to initialize
them. Now -1 indicate the number if invalid, or no work is completed or
enqueued yet. 0 could be a legit seq id.
Test Plan:
Build

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121980
Approved by: https://github.com/xw285cornell, https://github.com/wconstab, https://github.com/kwen2501, https://github.com/XilunWu
2024-03-15 21:45:15 +00:00
dfc5e9325d format caffe2/torch/_export/serde/serialize.py (#121670)
Summary: black caffe2/torch/_export/serde/serialize.py

Test Plan: tests

Differential Revision: D54654847

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121670
Approved by: https://github.com/angelayi
2024-03-15 21:30:16 +00:00
53d2188df9 Update get_aten_graph_module (#121937)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121937
Approved by: https://github.com/andrewor14
2024-03-15 20:35:55 +00:00
af86d67d61 [Doc][NVTX] Add documentation for nvtx.range (#121699)
The context manager `torch.cuda.nvtx.range` has been around for about 4 years (see #42925). Unfortunately, it was never documented and as a consequence users are just unaware of it (see #121663).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121699
Approved by: https://github.com/janeyx99
2024-03-15 20:26:44 +00:00
b92daff6e9 [DTensor] Enable ASGD foreach optimizer and add the associated unit test (#121942)
Enable ASGD foreach optimizer and add DTensor optimizer unit test for ASGD.

Note that we need to investigate why when using ASGD we need higher atol and rtol when comparing model parameters. Listing it as a TODO now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121942
Approved by: https://github.com/wanchaol
2024-03-15 20:21:27 +00:00
f4dd2fda51 [DTensor] Supported 2D clip_grad_norm_ (#121945)
This PR adds support for 2D `clip_grad_norm_` (`foreach=True`).
- This PR changes `OpSchema.args_spec` to use pytree if the runtime schema info specifies it.
- This PR includes a unit test for 2D FSDP2 + SP with `clip_grad_norm_` enabled, which serves as a complete numerics test for 2D.

Note: With this PR patched, 2-way SP + 4-way FSDP matches 8-way FSDP numerics on Llama-7B (doubling local batch size for the 2-way SP run).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121945
Approved by: https://github.com/wanchaol
ghstack dependencies: #121747, #121869
2024-03-15 20:11:24 +00:00
2c33e3a372 [BE] Enables support for pytorch ci build in ARC + introduces _linux-build-rg.yml. (#121930)
Introduce changes related to enable ARC to run on build for linux-jammy-py3.8-gcc11

Depends on:
* https://github.com/pytorch/pytorch/pull/121908
* https://github.com/pytorch/pytorch/pull/121907
* Force docker to update credentials: https://github.com/pytorch/test-infra/pull/4991
* Add permissions to role to access ECR: acc0154aa0
* Add permissions to the role to access relevant S3 bucket: 496b0422c3

## Reasoning for introducing a new `_linux-build-rg.yml`

Old style `runs-on` definition accept a string, new style `runs-on` requires a object in the format:

```
--- old
...
  runs-on: "linux.2xlarge"
...
--- new
...
  runs-on:
    group: "running-group"
...
```

In other words, to specify a group the format of the yaml needs to be changed. Unfortunately, there is no way to accomplish this change using any trick in the book that I am aware of. This is due to the fact that GH actions yaml are not templatable and support minimal functions / replacements. A few examples that did not work:
* [`e234f25` (#119544)](e234f25ba1 (diff-b317d4da565a9e329ccf67e669c2ff1f4d4bc5fb0ffa4d74132545ad66f84339R76))
* [`087de4a` (#119544)](087de4ad8b (diff-b317d4da565a9e329ccf67e669c2ff1f4d4bc5fb0ffa4d74132545ad66f84339R76))
* [`f03512e` (#119544)](f03512e344 (diff-b317d4da565a9e329ccf67e669c2ff1f4d4bc5fb0ffa4d74132545ad66f84339R76))
* [`67581fb` (#119544)](67581fb737 (diff-b317d4da565a9e329ccf67e669c2ff1f4d4bc5fb0ffa4d74132545ad66f84339R76))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121930
Approved by: https://github.com/seemethere
2024-03-15 20:09:50 +00:00
6f4fa8e9a1 [inductor] FX graph cache: simplify "current callable" logic (#121903)
Summary: The handling of the current_callable and compiled_artifact fields in the CompiledFxGraph object is unnecessarily complicated and confusing. We can simplify by storing only the callable. That field is not serializable, so the caching approach is to store a path to the generated artifact and reload from disk on a cache hit. We can just reload inline in the FX cache hit path. This change has the added benefit that it makes it easier to fallback to a "cache miss" if the path somehow doesn't exist.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121903
Approved by: https://github.com/eellison
2024-03-15 20:00:08 +00:00
d0d09f5977 Fix torch.compile links (#121824)
Fixes https://github.com/pytorch/pytorch.github.io/issues/1567

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121824
Approved by: https://github.com/svekars, https://github.com/peterbell10, https://github.com/malfet
ghstack dependencies: #121823
2024-03-15 19:49:37 +00:00
8a5a377190 Move doc links to point to main (#121823)
The previous links were pointing to an outdated branch

Command: `find . -type f -exec sed -i "s:docs/main:docs/master:g" {} + `

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121823
Approved by: https://github.com/albanD, https://github.com/malfet
2024-03-15 19:49:37 +00:00
535bc71d03 Enable FX graph caching in another batch of inductor tests (#121697)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121697
Approved by: https://github.com/eellison
2024-03-15 19:38:51 +00:00
3ee319c49c Fall back to eager mode when viewing with differing bitwidths (#120998) (#121786)
The inductor lowering code for viewing a tensor as a type with a different bitwidth currently doesn't generate valid triton code. This change looks for a source and destination dtype and, if different sizes, falls back to the eager mode aten implementation.  Prior to this change, this condition would throw an exception.

Fixes #120998.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121786
Approved by: https://github.com/peterbell10, https://github.com/bertmaher
2024-03-15 19:33:30 +00:00
409b1a6081 Add lowering for cummax, cummin (#120429)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120429
Approved by: https://github.com/peterbell10
2024-03-15 19:04:38 +00:00
d04faf4531 [dynamo][compile-time] Remove preserve rng state per op (#121923)
We already have one globally - 02bb2180f4/torch/_dynamo/convert_frame.py (L477)

I don't think we need per op.

Saves ~2 seconds on this benchmark

~~~
def fn(x):
    for _ in range(10000):
        x = torch.ops.aten.sin(x)
    return x
~~~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121923
Approved by: https://github.com/jansel
2024-03-15 18:24:46 +00:00
67ec870234 Fix FakeTensorUpdater logic for updating fake tensors (#116168)
Fixes https://github.com/pytorch/pytorch/issues/114464

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116168
Approved by: https://github.com/peterbell10
2024-03-15 18:22:24 +00:00
239d87af5e combine loops so fn_name correct in error message (#121601)
The error message shown when input aliasing is detected in `while_loop_func` may not have the correct `fn_name` as it set only in the previous for loop.  This change merges the two loops so that `fn_name` has the correct value.

No Issue Number for this minor change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121601
Approved by: https://github.com/albanD
2024-03-15 17:14:56 +00:00
39fdde7f84 [release] Increase version 2.3.0->2.4.0 (#121974)
Branch cut for 2.3.0 completed hence advance main version to 2.4.0

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121974
Approved by: https://github.com/jeanschmidt
2024-03-15 17:09:33 +00:00
565d1e28ab update kineto submodule commit id (#121843)
Summary: Update kineto submodule commit id so that pytorch profiler can pick up kineto changes from https://github.com/pytorch/kineto/pull/880

Test Plan: CI

Differential Revision: D54828357

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121843
Approved by: https://github.com/aaronenyeshi
2024-03-15 16:55:25 +00:00
3c3d7455a3 Disable inductor (default) and inductor (dynamic) by default in the perf run launcher (#121914)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121914
Approved by: https://github.com/desertfire
2024-03-15 16:46:24 +00:00
ef25d83a62 [export] Add serialization support for tokens (#121552)
Differential Revision: [D54906766](https://our.internmc.facebook.com/intern/diff/D54906766)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121552
Approved by: https://github.com/zhxchen17
2024-03-15 16:15:11 +00:00
014f91a9d9 [FSDP2] implement HSDP (#121569)
support HSDP in per-parameter sharding FSDP: https://github.com/pytorch/pytorch/issues/121023

HSDP is a hybrid of FSDP and DDP: reduce-scatter grads intra-node (FSDP), and all-reduce grads inter-node (DDP)

for unit test, we are testing 2 + 2 GPUs in single node: ``pytest test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_hsdp``

allreduce overlaps with next reduce-scatter in profiler traces
<img width="886" alt="Screenshot 2024-03-14 at 3 02 52 PM" src="https://github.com/pytorch/pytorch/assets/134637289/98f1f2b5-c99d-4744-9938-10d0431487e5">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121569
Approved by: https://github.com/awgu
2024-03-15 10:00:18 +00:00
4cbf963894 [BE] Migrate pull.yml to use S3 pytorch-ci-artifacts bucket for linux-jammy-py3_8-gcc11 and docs builds/tests (#121908)
Switch to use LF S3 bucket for pull on linux-jammy-py3_9-gcc and docs jobs. This is required to migrate to ARC and move to use LF resources.

Depends on https://github.com/pytorch/pytorch/pull/121907
Follow up issue https://github.com/pytorch/pytorch/issues/121919
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121908
Approved by: https://github.com/malfet
2024-03-15 09:09:53 +00:00
2770e3addd Force upsample to be float32 (#121324)
Fixes #121072

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121324
Approved by: https://github.com/ezyang
2024-03-15 07:50:45 +00:00
e25054b248 [compiled autograd] free stack objects before calling compiled graph (#121707)
Moved compilation code into _compiled_autograd_impl, frees stack allocated objects e.g. AutogradCompilerCall

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121707
Approved by: https://github.com/jansel
2024-03-15 07:12:38 +00:00
5a2b4fc8f0 [dynamo] Convert invalid args into graph breaks (#121784)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121784
Approved by: https://github.com/yanboliang
2024-03-15 06:51:27 +00:00
fc33bbf827 better support set_default_dtype(torch.float16), update doc (#121730)
1. Fixes #121300
2. Previously, calling `torch.tensor([2j])` after `torch.set_default_dtype(torch.float16)` will cause a runtime error. This PR also fixes it and enables test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121730
Approved by: https://github.com/peterbell10
2024-03-15 06:48:42 +00:00
8fdd8125b6 [executorch hash update] update the pinned executorch hash (#121871)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121871
Approved by: https://github.com/pytorchbot
2024-03-15 05:25:36 +00:00
cyy
fb10e13000 [Clang-tidy header][24/N] Fix clang-tidy warnings on c10/cuda/*.{cpp,h} (#120781)
This PR begins to clean clang-tidy warnings of code in c10/cuda.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120781
Approved by: https://github.com/ezyang
2024-03-15 05:03:22 +00:00
e4fda049c2 DTensor: add comm tests to test_tp_examples (#121669)
This adds some basic comm tests to test_tp_examples. This validates that the expected distributed calls are being made for `test_transformer_training`.

Fixes #121649

Test plan:

```
pytest test/distributed/tensor/parallel/test_tp_examples.py -k test_transformer_training
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121669
Approved by: https://github.com/wanchaol
2024-03-15 03:37:48 +00:00
02083f5452 [DCP][DSD] Add AdamW to distributed state dict unit tests (#121774)
Thanks @fegin for removing the fsdp root module check in DCP to unblock test updates. https://github.com/pytorch/pytorch/pull/121544

This PR adds "optimzer_class" as a kwarg for the subtests of the following tests to add AdamW as an option.

- test_fsdp
- test_compiled_fsdp
- test_fsdp2
- test_ddp
- test_fsdp_ddp
- test_cpu_offload_full_state_dict

In addition, we temporarily remove the two _verify_osd_by_load in _test_save_load, as state dict loading seems affect parameters. Creating an issue https://github.com/pytorch/pytorch/issues/121186 to keep track.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121774
Approved by: https://github.com/Skylion007
ghstack dependencies: #121773
2024-03-15 03:33:33 +00:00
efbeefbb84 [executorch] Make trymerge force merges actually work with executorch (#121920)
This PR addresses an issue with the trymerge function for executorch, which currently uses Facebook CLA instead of Easy CLA. This bug has been patched in #121921. However, the patch is potentially controversial, and we still want to verify Facebook CLA if it exists. Therefore, this PR includes Facebook CLA in our set of mandatory checks.

Additionally, this PR removes Facebook CLA from one of the mocks. This change is necessary because the specific PR used for testing fails due to the presence of Facebook CLA in the mock.

## Testing:
We run `find_matching_merge_rule(pr = GitHubPR("pytorch", "executorch", 2326), skip_mandatory_checks=True, skip_internal_checks=True)` to check if things work

https://pastebin.com/HHSFp2Gw

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121920
Approved by: https://github.com/huydhn
2024-03-15 03:21:44 +00:00
a623666066 [dynamo][compile-time] Make output_graph new_var linear (#121858)
Fixes https://github.com/pytorch/pytorch/issues/121679

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121858
Approved by: https://github.com/jansel
2024-03-15 03:20:04 +00:00
3bc2bb6781 use two pass reduction for deterministic reduction order (#115620)
## Motivation
Address the [non-deterministic reduction order](https://github.com/pytorch/pytorch/issues/93542#issuecomment-1411294181) issue for `omp parallel reduction`.

## Latest update on 1.15:
55d81901bc.
Do not reduce to arr in loops. Instead, reduce to a local scaler and write it to arr after local reduction is done. This will allow the compiler to optimize the reduction variable in register instead read/write from memory. If the `working set` of `loop body` is quite large, `read/write from register/memory` will have a large gap.
```
vaddss (%xmm0, %xmm11, %xmm11) -> accumulate in register %xmm0
vaddssl ((%rdx, %rdi, 4), %xmm0, %xmm0) -> accumulate in memory address (%rdx, %rdi, 4)
```
Examples code:
```
tmp0_acc_arr[64];
#pragma omp parallel num_threads(64)
{
    auto tid = omp_get_thread_num();
    #pragma omp for
    for(...){
        ....
        tmp0_acc_arr[tid] = tmp0_acc_arr[tid] + tmp_x;  // access array will always from memory
    }
}
```
will be changed to
```
tmp0_acc_arr[64];
#pragma omp parallel num_threads(64)
{
    auto tid = omp_get_thread_num();
    **auto tmp0_acc_local = 0;**
    #pragma omp for
    for(...){
        ....
        **tmp0_acc_local**  = tmp0_acc_local + tmp_x;
    }
    **tmp0_acc_arr[tid] = tmp0_acc_local;**
}
```

## Descriptions
Following aten to use `two pass reduction` with `omp parallel` for deterministic reduction order.
9c3ae37fc4/aten/src/ATen/Parallel-inl.h (L39)
9c3ae37fc4/aten/src/ATen/native/TensorIteratorReduce.cpp (L24)
```
            float tmp_acc0 = 0;
            at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(0);
            // init reduction buffer per thread
            float tmp_acc0_arr[64];
            at::vec::Vectorized<float> tmp_acc0_vec_arr[64];
            for (int tid = 0; tid < 64; tid++)
            {
                tmp_acc0_arr[tid] = 0;
                tmp_acc0_vec_arr[tid] = at::vec::Vectorized<float>(0);
            }
            #pragma omp parallel num_threads(64)
            {
                int tid = omp_get_thread_num();
                #pragma omp for
                for(long x0=static_cast<long>(0L); x0<static_cast<long>(3964928L); x0+=static_cast<long>(16L))
                {
                    auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x0));
                    auto tmp1 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x0));
                    auto tmp2 = tmp0 - tmp1;
                    auto tmp3 = tmp2 * tmp2;
                    // reduce to per thread buffers
                    tmp_acc0_vec_arr[tid] = tmp_acc0_vec_arr[tid] + tmp3;
                }
            }
            // second pass reduce
            for (int tid = 0; tid < 64; tid++)
            {
                tmp_acc0 = tmp_acc0 + tmp_acc0_arr[tid];
                tmp_acc0_vec = tmp_acc0_vec + tmp_acc0_vec_arr[tid];
            }
            tmp_acc0 = tmp_acc0 + at::vec::vec_reduce_all<float>([](at::vec::Vectorized<float>& x, at::vec::Vectorized<float>& y) { return x + y; }, tmp_acc0_vec);
            out_ptr0[static_cast<long>(0L)] = static_cast<float>(tmp_acc0);
```

## Test results
I test this PR with dynamo benchmark on 32-core ICX system,
Result (avg speed up):
| |  before this PR   | after this PR  |
| ---- |  ----  | ----  |
| torchbench | 1.303  | 1.301 |
| hugginface | 1.346  | 1.343 |
| timms | 1.971 | 1.970 |

```
export LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so:${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libjemalloc.so
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1"
export KMP_AFFINITY=granularity=fine,compact,1,0
export KMP_BLOCKTIME=1

multi_threads_test() {
    CORES=$(lscpu | grep Core | awk '{print $4}')
    export OMP_NUM_THREADS=$CORES
    end_core=$(expr $CORES - 1)
    numactl -C 0-${end_core} --membind=0 python benchmarks/dynamo/${SUITE}.py --${SCENARIO} --${DT} -dcpu -n50 --no-skip --dashboard --only "${MODEL}" ${Channels_extra} ${BS_extra} ${Shape_extra} ${Mode_extra} ${Wrapper_extra} ${Flag_extra} --timeout 9000 --backend=inductor --output=${LOG_BASE}/${SUITE}.csv
}

SCENARIO=performance
DT=float32
export TORCHINDUCTOR_FREEZING=1
Flag_extra="--freezing"
Mode_extra="--inference"

for suite in timm_models huggingface torchbench
do
  export SUITE=$suite
  echo $SUITE
  export LOG_BASE=`date +%m%d%H%M%S`
  mkdir $LOG_BASE
  multi_threads_test
done
```
System info
```
ubuntu@ip-172-31-18-205:~/hz/pytorch$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  64
  On-line CPU(s) list:   0-63
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
    CPU family:          6
    Model:               106
    Thread(s) per core:  2
    Core(s) per socket:  32
    Socket(s):           1
    Stepping:            6
    BogoMIPS:            5800.00
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic mo
                         vbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xs
                         aveopt xsavec xgetbv1 xsaves wbnoinvd ida arat avx512vbmi pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq rdpid md_clear flush_l1d arch_capabilities
Virtualization features:
  Hypervisor vendor:     KVM
  Virtualization type:   full
Caches (sum of all):
  L1d:                   1.5 MiB (32 instances)
  L1i:                   1 MiB (32 instances)
  L2:                    40 MiB (32 instances)
  L3:                    54 MiB (1 instance)
NUMA:
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-63
Vulnerabilities:
  Gather data sampling:  Unknown: Dependent on hypervisor status
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Mitigation; Clear CPU buffers; SMT Host state unknown
  Retbleed:              Not affected
  Spec rstack overflow:  Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
  Srbds:                 Not affected
  Tsx async abort:       Not affected
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115620
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-03-15 02:03:10 +00:00
0cd094a4fd Revert "[aoti] Fix compilation bug for buffer mutations (#121688)"
This reverts commit 9f314d4aa82169ee552ae2a8ad701bd0441a12b7.

Reverted https://github.com/pytorch/pytorch/pull/121688 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/121688#issuecomment-1998740094))
2024-03-15 01:34:04 +00:00
01d7c948e2 Make torch/_inductor/comms.py recognize native funcol IRs as collective IRs (#118498)
### Summary

As title. After this PR, Inductor should recognize native funcol IRs as collectives wherever the existing funcol IRs are recognized as collectives.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118498
Approved by: https://github.com/wanchaol
2024-03-15 01:24:36 +00:00
60ccf81490 [dynamo] Refactor update_block_stack into a seperate function (#121810)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121810
Approved by: https://github.com/williamwen42
ghstack dependencies: #121790
2024-03-15 01:01:05 +00:00
1e9a7df8fe [dynamo] Compile time optimizations in tx.step() (#121790)
`python benchmarks/dynamo/microbenchmarks/dynamo_microbenchmarks.py`
- Before: `symbolic_convert_overhead_stress_test: 10.7s`
- After: `symbolic_convert_overhead_stress_test: 8.6s`

`tx.step()` is a small part of that benchmark, so likely the speedup in that isolated function is larger than the top line.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121790
Approved by: https://github.com/oulgen
2024-03-15 01:01:05 +00:00
1afa8e0985 Fix #83153: torch.nn.hardtahn allowed min_val to be greater than max_val (#121627)
Fixes #83153

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121627
Approved by: https://github.com/albanD
2024-03-15 00:57:45 +00:00
710446b1eb [dtensor] refactor and generalize stack strategy (#121869)
This PR rewrite the stack strategy to be more generalized, basically
stack/cat like strategy follow pattern need to be smarter, i.e. it
should be able to identify:
1. PR, PP, RP -> follow PP
2. RR, SR, RS -> follow SS

So this PR refactors how the follow strategy should work, and make sure
we start following the strategy that incurred lowest cost. i.e. for
multiple PR, RP placements, we should be able to further delay the
pending sum reductions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121869
Approved by: https://github.com/awgu
2024-03-15 00:34:25 +00:00
92ed8553a6 Revert "Switch cudagraph backend to cudagraph trees (#121019)" and "Add Cudagraphs disable checking (#121018)" (#121864)
This reverts commit 9373ad0bb87b364375a468c296d2daef0e8817d7.

Revert "Add Cudagraphs disable checking (#121018)"

This reverts commit 4af0e634bf02309583dfe3b5c3421442fda5ec7e.

Causes compilation time increase.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121864
Approved by: https://github.com/eellison
2024-03-15 00:03:09 +00:00
d604ab81a2 [PyTorch] Fix static runtime sigrid_hash precomputed multiplier pass (#120851)
This pass was broken.

Differential Revision: [D54336561](https://our.internmc.facebook.com/intern/diff/D54336561/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D54336561/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120851
Approved by: https://github.com/houseroad
2024-03-15 00:02:38 +00:00
cceabe873f [jit] ClassType hashing: hash on compilation_unit as well (#121928)
Following up on #121874 - it turns out that in our case, we're seeing repeated class names that are from different compilation units.  Our previous hash function wasn't considering the compilation unit, leading to hash collisions (and then exponential memory usage in the number of copies of this class name)

Differential Revision: [D54916455](https://our.internmc.facebook.com/intern/diff/D54916455)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121928
Approved by: https://github.com/eellison
ghstack dependencies: #121874
2024-03-14 23:16:08 +00:00
2d9cee20a2 [jit] AliasDB type hash - don't always return 0 (#121874)
This hash was missing an assignment, so for almost all types it was returning "0".

c10::flat_hash_map turns out to have really bad behavior with a terrible hash like this, nearly exponential in memory usage.

Differential Revision: [D54916424](https://our.internmc.facebook.com/intern/diff/D54916424)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121874
Approved by: https://github.com/eellison
2024-03-14 23:16:08 +00:00
57b20c51b9 Don't record autograd state ops while torch.compile in pre-dispatch export (#121736)
Summary: Refer to OSS PR for details

Test Plan: CI

Differential Revision: D54812833

In pre-dispatch export, we have a special proxy torch mode where we intercept torch._C._set_grad_enabled op to correctly capture user's intention on train/eval. However, this is bit problematic when we are tracing torch.cond during export as it calls torch.compile internally. As a result, we end up capturing unwanted autograd context manager  calls that are happening inside dynamo framework code because the top level tracer is still active. We fix it by turning off this proxy torch mode. We can still capture autograd ops inside cond branches because dynamo will translate them into HOP for us, so we don't have to intercept with special proxy mode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121736
Approved by: https://github.com/anijain2305, https://github.com/ydwu4
2024-03-14 23:06:10 +00:00
bd7beef529 [Inductor] Update the cpp_wrapper entry function signature (#121745)
Summary: Update the entry function to use AtenTensorHandle instead of at::Tensor. This makes the compilation of the generated cpp wrapper code much faster: test_cpu_cpp_wrapper.py from 35 min to 21 min, and test_cuda_cpp_wrapper.py from 21 min to 14 min.

Differential Revision: [D54818715](https://our.internmc.facebook.com/intern/diff/D54818715)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121745
Approved by: https://github.com/chenyang78, https://github.com/jansel
ghstack dependencies: #121523, #121743, #121744
2024-03-14 22:23:00 +00:00
8be80706b4 [AOTI] Add pybind for tensor_converter util functions (#121744)
Differential Revision: [D54818716](https://our.internmc.facebook.com/intern/diff/D54818716)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121744
Approved by: https://github.com/chenyang78
ghstack dependencies: #121523, #121743
2024-03-14 22:20:51 +00:00
46493ee9b5 [AOTI][refactor] Update tensor_converter util functions (#121743)
Summary: Update the signature of unsafe_alloc_new_handles_from_tensors and alloc_tensors_by_stealing_from_handles. This is a preparation step towards adding pybind for these two functions, which will be used by cpp_wraper JIT Inductor.

Differential Revision: [D54818717](https://our.internmc.facebook.com/intern/diff/D54818717)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121743
Approved by: https://github.com/chenyang78
ghstack dependencies: #121523
2024-03-14 22:17:54 +00:00
3df1b3b0ad [jit] support getattr/hasattr on NamedTuple (#121863)
getattr is already supported on objects, and seems like for the most part for NamedTuples. The only remaining gap seems to be that hasattr only accepted objects, not NamedTuples. This PR adds support, and adds some basic tests.

Differential Revision: [D54888612](https://our.internmc.facebook.com/intern/diff/D54888612)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121863
Approved by: https://github.com/eellison
2024-03-14 22:07:28 +00:00
818b14025a [AOTI][refactor] Remove is_legacy_abi_kernel and abi_compatible_kernel (#121523)
Summary: is_legacy_abi_kernel was used for _scaled_dot_product_flash_attention fallback. It is only needed for C shim kernel name matching now, and the name matching is done with a direct string comparison. Also consolidate the fallback cpp kernel naming logic in CppWrapperCpu.

Differential Revision: [D54727789](https://our.internmc.facebook.com/intern/diff/D54727789)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121523
Approved by: https://github.com/chenyang78
2024-03-14 22:05:38 +00:00
43e243180b Add gpt-fast as a static benchmark (#121886)
Run:
```
python benchmarks/gpt_fast/benchmark.py
```
It generated a cvs file ```gpt_fast_benchmark.csv``` with the content like:
```
name,mode,target,actual,percentage
Llama-2-7b-chat-hf,bfloat16,104,103.458618,99.48%
Llama-2-7b-chat-hf,int8,155,158.964615,102.56%
Mixtral-8x7B-v0.1,int8,97,99.760132,102.85%
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121886
Approved by: https://github.com/Chillee
2024-03-14 21:46:59 +00:00
0e68eb1505 Add privateuseone flags for c10::EventFlag (#121118)
Fixes #117341
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121118
Approved by: https://github.com/albanD
2024-03-14 20:07:12 +00:00
9f314d4aa8 [aoti] Fix compilation bug for buffer mutations (#121688)
I realized there's a bug when unlifting buffer mutations in AOTI.
However there seems to be a bug during tracing where AOTI mutates the buffer. I didn't take the time to investigate, so I left is as TODO for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121688
Approved by: https://github.com/chenyang78
2024-03-14 19:35:26 +00:00
0636c11811 [AOTInductor] Include build cmds at the end of wrapper file (#121872)
Summary:
For easier debugging, include build commands at the end of codegen wrapper.

{F1468438991}

Test Plan: CI

Differential Revision: D54882164

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121872
Approved by: https://github.com/chenyang78, https://github.com/desertfire
2024-03-14 18:41:17 +00:00
c409292197 [sigmoid] Use deserializer from oss. (#121839)
Summary:
Old path:
thrift -> thrift deserializer -> graph module.
new path:
thrift -> python dataclass -> oss deserializer -> graph_module

Test Plan:
CI
buck2 test mode/dev-nosan caffe2/test/inductor/fb:test_aot_inductor_pt2_inference

Reviewed By: SherlockNoMad

Differential Revision: D54855251

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121839
Approved by: https://github.com/angelayi
2024-03-14 18:38:58 +00:00
499136a4dd [Inductor] Fix a dynamic shape problem when lowering diagonal (#121881)
Summary: when computing the diagonal size, we need to use correct symbolic min/max function.

Differential Revision: [D54884899](https://our.internmc.facebook.com/intern/diff/D54884899)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121881
Approved by: https://github.com/aakhundov
2024-03-14 18:36:37 +00:00
5b1642516f [with_effects] Skip over profiler.record_function_exit (#121829)
Summary:
tldr: User calls to `torch.autograd.profiler.record_function` fails when tracing with non-strict pre-dispatch export due to an effect token failure, so the solution is to skip over these operators 😅

Some user code contains calls to a `torch.autograd.profiler.record_function` context, like https://fburl.com/code/uesgknbq and https://fburl.com/code/iogbnsfw, which is used for adding user-defined events into the profiler.

Currently these function calls will be skipped/removed in dynamo (https://fburl.com/code/fkf7qmai) but **non-strict pre-dispatch export** will hit these operators during tracing. However, it seems that although these operators get hit by the dispatcher, they don't actually show up in the final graph (maybe they get DCE-d).

However, an issue comes up with a recent change with effect tokens (D54639390) which creates tokens if it sees a ScriptObject during tracing. The operator `torch.ops.profiler.record_function_exit` takes in a ScriptObject, so the effect tokens framework now tries to add an effect token to this operator, but results in the following error: (https://www.internalfb.com/intern/everpaste/?handle=GI-hvBknzj2ZxYkBABNzdztDxJVAbsIXAAAB, P1195258619)

The reason is because this operator only gets hit during pre-dispatch, not post-dispatch tracing. During pre-dispatch tracing, we first trace using post-dispatch to collect metadata needed for functionalization, and then we do pre-dispatch tracing to construct the graph. The metadata collection phase is also when we determine what operators need effect tokens and create those tokens. However, since the operator only shows up in pre-dispatch tracing, we do create any tokens. During the actual pre-dispatch tracing to create the graph, we then run into this operator and try to get a token, but none exist, causing an error :(

This PR just blocks the record_function operator from being looked at by the effect tokens framework. But a proper fix might be to have functionalization run on the pre-dispatch graph or have the operator also show up in the post-dispatch graph. But since in the PT2 stack dynamo just gets rid of this operator so that it won't show up anywhere downstream, I think we can also just ignore this operator.

Test Plan: Fixed test for P1195258619

Differential Revision: D54857444

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121829
Approved by: https://github.com/BoyuanFeng, https://github.com/tugsbayasgalan
2024-03-14 18:09:43 +00:00
f1f7c5c31e [ez] Document for add_var_to_val (#121850)
Summary: Add doc for ShapeEnv.add_var_to_val

Test Plan: doc only change

Reviewed By: izaitsevfb

Differential Revision: D54872335

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121850
Approved by: https://github.com/izaitsevfb
2024-03-14 18:01:09 +00:00
4c3a052acf [BE] Add S3 bucket argument to number of workflows (#121907)
Namely, it adds the `s3-bucket` argument to the following workflows, with default value set to `gha-artifacts`):
- _docs
- _linux-test workflows
- download-build-artifacts
- pytest-cache-download
- upload-test-artifacts

This is prerequisite part is required in order to start migrating to other s3 buckets for asset storage; This is one of the required steps in order to migrate to ARC and move our assets away from our S3 to Linux Foundation S3

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121907
Approved by: https://github.com/malfet
2024-03-14 17:57:05 +00:00
38d7d366b9 [FSDP2] Added 2D DCP save/load test (#121747)
To prepare for FSDP2 + TP/SP in torchtrain, we should verify that we can resume training correctly with DCP save/load. For loading into a new model/optimizer instance, torchtrain uses lightweight `ModelWrapper` and `OptimizerWrapper`. In the added unit test, we use `get_optimizer_state_dict` directly to show the minimal requirement for correctness.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121747
Approved by: https://github.com/wz337
2024-03-14 17:24:17 +00:00
443444dc7f [c10d] Add generic scuba logging capability into c10d (#121859)
Summary:
This diff tries to periodically (e.g., every 30s) log critical collective
progress status to scuba table, starting from a few metric such as last
enequeued seq id.

With the Scuba table, it is our hope that we can easily detect the straggler of a PG,
E.g., the rank that has not progressed it seq_ for X seconds while other ranks in the same PG have a larger seq_

The implementation needs to make sure that Scuba will be used only for FB internal use
cases.

For OSS, we still provide a generic logger data struct and logger that can be
easily extended. If users do not register the logger, nothing will be logged.

Test Plan:
Re-use the existing unit test for fb side of operations, such as
test_register_and_dump in test_c10d_manifold and change the dump period to a
very small number, e.g., 1ms, verified that the loggs are correctly shown in scuba table:
https://fburl.com/scuba/c10d_work_update/9trhwnmy

Reviewed By: wconstab

Differential Revision: D54556219

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121859
Approved by: https://github.com/wconstab
2024-03-14 16:03:45 +00:00
83f8e51404 Add CUTLASS kernel as choice for (u)int8/(b)float16 mixed MM autotuning (#119986)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119986
Approved by: https://github.com/kadeng
ghstack dependencies: #119685
2024-03-14 16:03:10 +00:00
be0bdf111c relax tol for flaky nansum_out_dtype_cuda_float32 test (#121550)
TestReductionsCUDA.test_nansum_out_dtype_cuda_float32 would fail or pass depending on the random inputs. Observed by ROCm internal QA testing.  But same problematic random inputs breaks the test for CUDA, verified on V100.

There is precedent in another test within the same file to relax tolerance.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121550
Approved by: https://github.com/albanD
2024-03-14 15:28:45 +00:00
7e13b5ba29 Checkout release branch rather then commit_hash when building triton release (#115379) (#121901)
Cherry pick of https://github.com/pytorch/pytorch/pull/115379 from Release 2.2 that should be applied to main and Release 2.3 as well

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121901
Approved by: https://github.com/DanilBaibak, https://github.com/jeanschmidt
2024-03-14 14:42:29 +00:00
956059fa2e [Fix] Fixed behaviour for the conversion of complex tensors to bool (#121803)
Fixes #120875

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121803
Approved by: https://github.com/lezcano
2024-03-14 13:35:15 +00:00
1251f0fa31 Add CUTLASS kernel as choice for _int_mm() Inductor autotuning (#119685)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119685
Approved by: https://github.com/cpuhrsch, https://github.com/kadeng
2024-03-14 13:25:23 +00:00
38d9bb5abc Make PyTorch compilable against upcoming Numpy-2.0 (#121880)
Test plan:
```
% python -c "import torch;import numpy;print(numpy.__version__, torch.tensor(numpy.arange(3, 10)))"
2.1.0.dev0+git20240312.9de8a80 tensor([3, 4, 5, 6, 7, 8, 9])
% python -c "import torch;print(torch.rand(3, 3).numpy())"
[[0.0931946  0.44874293 0.8480404 ]
 [0.93877375 0.10188377 0.67375803]
 [0.02520031 0.89019287 0.5691561 ]]

```
Fixes https://github.com/pytorch/pytorch/issues/121798

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121880
Approved by: https://github.com/albanD
2024-03-14 05:36:50 +00:00
b4c53aa0ec Do not compile FP16 arith internally (#121844)
Also, decorate unused args with `C10_UNUSED` to fix linter warnings
Test Plan: `buck2 build -c fbcode.arch=aarch64  //caffe2:ATen-cpu`

Differential Revision: D54870507

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121844
Approved by: https://github.com/osalpekar
2024-03-14 05:19:02 +00:00
3eb322ff29 Handle transitive replacements in Triton kernel mutation analysis (#121867)
Summary: Previously, we didn't handle transitive replacements in MLIR walk-based function info mining in the Triton kernel mutation analysis pass. As a result, for the TTIR below:

```
tt.func private @cumsum__fp32S1_16S__1cconstexpr_1__2cconstexpr_False_(%arg0: tensor<1x16xf32> loc("...":296:0)) -> tensor<1x16xf32> attributes {noinline = false} {
    %0 = "tt.scan"(%arg0) <{axis = 1 : i32, reverse = false}> ({
    ^bb0(%arg1: f32 loc(unknown), %arg2: f32 loc(unknown)):
      %1 = tt.call @_sum_combine__fp32_fp32__(%arg1, %arg2) : (f32, f32) -> f32 loc(#loc16)
      tt.scan.return %1 : f32 loc(#loc16)
    }) : (tensor<1x16xf32>) -> tensor<1x16xf32> loc(#loc16)
    tt.return %0 : tensor<1x16xf32> loc(#loc18)
  } loc(#loc15)
```

the mined function dict looked like this:

```
{Intermediate(idx=25): [Op(name='tt.call',
                           fn_call_name='_sum_combine__fp32_fp32__',
                           args=[Intermediate(idx=26),
                                 Intermediate(idx=26)])],
 Intermediate(idx=27): [Op(name='tt.scan.return',
                           fn_call_name=None,
                           args=[Intermediate(idx=25)])],
 Intermediate(idx=-4): [Op(name='tt.return',
                           fn_call_name=None,
                           args=[Intermediate(idx=27)])]}
```

whereas it should look like this (not the `Param(idx=0)` arguments of the `tt.call`):

```
{Intermediate(idx=25): [Op(name='tt.call',
                           fn_call_name='_sum_combine__fp32_fp32__',
                           args=[Param(idx=0),
                                 Param(idx=0)])],
 Intermediate(idx=27): [Op(name='tt.scan.return',
                           fn_call_name=None,
                           args=[Intermediate(idx=25)])],
 Intermediate(idx=-4): [Op(name='tt.return',
                           fn_call_name=None,
                           args=[Intermediate(idx=27)])]}
```

This is fixed in the PR.

Test Plan:

```
$ python test/inductor/test_triton_kernels.py -k test_cumsum
.
----------------------------------------------------------------------
Ran 1 test in 1.771s

OK
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121867
Approved by: https://github.com/oulgen
2024-03-14 04:06:37 +00:00
4cd503c1f3 Enable FX graph cache for a batch of inductor tests (#121696)
Summary: Get more FX graph cache coverage by enabling it for these unit tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121696
Approved by: https://github.com/eellison
2024-03-14 03:39:59 +00:00
15abc56bd5 Graph break on step closure in optimizer (#121777)
Fixes https://github.com/pytorch/pytorch/issues/116494

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121777
Approved by: https://github.com/yanboliang
2024-03-14 03:18:23 +00:00
f85f58bf86 Fix quantized linear vulkan tests (#120960)
Summary: Fixed quatized linear vulkan tests by using an old pack_biases function.

Test Plan:
**Vulkan quantized api tests**
buck2 run --target-platforms ovr_config//platform/macos:arm64-fbsource  //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1

...
...
...
[ RUN      ] VulkanAPITest.linear_2d_flat
[       OK ] VulkanAPITest.linear_2d_flat (5 ms)
[ RUN      ] VulkanAPITest.linear_2d_small
[       OK ] VulkanAPITest.linear_2d_small (0 ms)
[ RUN      ] VulkanAPITest.linear_2d_large
[       OK ] VulkanAPITest.linear_2d_large (4 ms)
[ RUN      ] VulkanAPITest.linear_3d_flat
[       OK ] VulkanAPITest.linear_3d_flat (2 ms)
[ RUN      ] VulkanAPITest.linear_3d_small
[       OK ] VulkanAPITest.linear_3d_small (1 ms)
[ RUN      ] VulkanAPITest.linear_3d_large
[       OK ] VulkanAPITest.linear_3d_large (1 ms)
[ RUN      ] VulkanAPITest.linear_4d_flat
[       OK ] VulkanAPITest.linear_4d_flat (1 ms)
[ RUN      ] VulkanAPITest.linear_4d_small
[       OK ] VulkanAPITest.linear_4d_small (1 ms)
[ RUN      ] VulkanAPITest.linear_4d_large
[       OK ] VulkanAPITest.linear_4d_large (2 ms)
...
...
[----------] 85 tests from VulkanAPITest (1704 ms total)

[----------] Global test environment tear-down
[==========] 85 tests from 1 test suite ran. (1704 ms total)
[  PASSED  ] 85 tests.

  YOU HAVE 8 DISABLED TESTS

**Vulkan api tests**
buck2 run --target-platforms ovr_config//platform/macos:arm64-fbsource  //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1

[----------] Global test environment tear-down
[==========] 426 tests from 1 test suite ran. (4997 ms total)
[  PASSED  ] 423 tests.
[  SKIPPED ] 1 test, listed below:
[  SKIPPED ] VulkanAPITest.querypool_flushed_shader_log
[  FAILED  ] 2 tests, listed below:
[  FAILED  ] VulkanAPITest.log_softmax_underflow
[  FAILED  ] VulkanAPITest.log_softmax

Differential Revision: D54396367

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120960
Approved by: https://github.com/yipjustin
2024-03-14 02:23:00 +00:00
a37caa6ed3 [Quant][Inductor] Enable quantization linear pattern fusion with int8_mixed_bf16 for gelu (#116004)
**Summary**
Enable QLinear Unary pattern for gelu with int8_mix_bf16

**Test plan**
python test/inductor/test_mkldnn_pattern_matcher.py -k test_qlinear_gelu_int8_mixed_bf16

Co-authored-by: leslie-fang-intel <leslie.fang@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116004
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel
ghstack dependencies: #114853, #114854
2024-03-14 01:52:12 +00:00
43d68e9c8f [Quant][Inductor] Enable quantization linear pattern fusion for gelu inside inductor (#114854)
**Summary**
Enable QLinear Unary pattern for gelu with int8

**Test plan**
python test/inductor/test_mkldnn_pattern_matcher.py -k test_qlinear_gelu_cpu

Co-authored-by: leslie-fang-intel <leslie.fang@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114854
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
ghstack dependencies: #114853
2024-03-14 01:49:14 +00:00
25e00545bb [Quant][PT2E] Enable linear and linear-unary post-op gelu quant recipe for x86 inductor quantizer (#114853)
**Summary**
Add Gelu for linear-unary post-op quantization recipe to x86 inductor quantizer.

**Test plan**
python -m pytest test/quantization/pt2e/test_x86inductor_quantizer.py -k test_linear_unary_gelu
python test/test_quantization.py -k test_linear_unary_with_quantizer_api
Co-authored-by: leslie-fang-intel <leslie.fang@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114853
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jerryzh168
2024-03-14 01:46:35 +00:00
a04e7fca8e Use memcache versioning for autotune remote cache (#121748)
Summary: Internal training platform doesn't get updated very frequently, so lets use versioning for memcache.

Test Plan: existing tests

Differential Revision: D54818197

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121748
Approved by: https://github.com/aakhundov, https://github.com/jansel
2024-03-14 00:36:10 +00:00
7e076c75bd [C10D] Fix coalescedCollective op Flight Recording (#120430)
Also noticed and filed https://github.com/pytorch/pytorch/issues/120516 during this work. May land this as is and then test/fix the other varieties of coalesced collectives later.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120430
Approved by: https://github.com/kwen2501
2024-03-13 23:55:00 +00:00
bf7ac4ddf7 Revert "[export] allow Dim(1,2) for export dynamic shapes (#121642)"
This reverts commit a8dcbf2749f2081f939621db2d38fd15ab7e34a3.

Reverted https://github.com/pytorch/pytorch/pull/121642 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/121642#issuecomment-1996121710))
2024-03-13 23:51:20 +00:00
3e02a7efcd Only FA2 doesn't support attn-mask (#121825)
Fixes #121783

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121825
Approved by: https://github.com/cpuhrsch
2024-03-13 23:03:39 +00:00
a8dcbf2749 [export] allow Dim(1,2) for export dynamic shapes (#121642)
Current dynamic shapes implementation fixes lower range of Dims to be 2 for analysis, but allows 0/1 shapes during runtime. This leads to failures when initializing Dim(1,2). This PR sets the lower bound to 0, and avoids erroring out when conflicting with the generated (2, maxsize) constraint during analysis.

Also resolves a derived dim constraints issue with the following code:
```
class Bar(torch.nn.Module):
    def forward(self, x, y):
        return x + y[1:]

dx = Dim("dx", min=1, max=3)
ep = export(
    Bar(),
    (torch.randn(2, 2), torch.randn(3, 2)),
    dynamic_shapes=({0: dx, 1: None}, {0: dx+1, 1: None})
)
print(ep.range_constraints)
```

In main:
```
{s0: ValueRanges(lower=2, upper=3, is_bool=False), s0 + 1: ValueRanges(lower=3, upper=4, is_bool=False)}
```

This PR:
```
{s0: ValueRanges(lower=1, upper=3, is_bool=False), s0 + 1: ValueRanges(lower=2, upper=4, is_bool=False)}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121642
Approved by: https://github.com/avikchaudhuri
2024-03-13 22:59:07 +00:00
70c6f542f2 Revert "[dynamo] Convert invalid args into graph breaks (#121784)"
This reverts commit 0df39480f6a74c9094555e8a61a8c8bb01716d4e.

Reverted https://github.com/pytorch/pytorch/pull/121784 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it breaks ONNX test in trunk 0c1ac4484d ([comment](https://github.com/pytorch/pytorch/pull/121784#issuecomment-1995979435))
2024-03-13 22:12:43 +00:00
aaff8d274a CUDA fast path for _chunk_cat() (#120678)
This PR provides CUDA fast path implementation for ATen Op `_chunk_cat` (#121081).

Performance on a production benchmark:

- Float16 in, Float16 out: 249 -> 500
- BFloat16 in, BFloat16 out: 248 -> 500
- BFloat16 in, Float32 out: 126 -> 278
- Float32 in, Float32 out: 153 -> 260
- Float64 in, Float64 out: 79 -> 132
- int8 in, int8 out: 332 -> 908
- int16 in, int16 out: 250 -> 489
- int32 in, int32 out: 153 -> 260
- int64 in, int64 out: 79 -> 132

Unit: Billion elements per second. Hardware: H100. Baseline: [Existing FSDP implementation](7b3febdca7/torch/distributed/_composable/fsdp/_fsdp_collectives.py (L176))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120678
Approved by: https://github.com/yifuwang
2024-03-13 22:02:06 +00:00
c53e3f57b5 allow fp16 in quant/dequant decompositions (#121738)
Test Plan:
```
buck2 run mode/dev-nosan mode/inplace executorch/examples/models/llama2:export_llama -- -c ~/llama/ultra_new_checkpoint.pt -p ~/llama/params.json -kv -E 8,8 -d fp16 --pt2e_quantize "xnnpack_dynamic" -2
```

Reviewed By: kirklandsign

Differential Revision: D54785950

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121738
Approved by: https://github.com/jerryzh168
2024-03-13 21:45:08 +00:00
c7193f4099 [DDP][PT2D][2D] Enable DDP + TP and add test for compiled DDP + TP (#120479)
This PR enables DDP + TP using a TP internal API. This should not be the final implementation. A more sound implementation is to inline the TP internal API in DDP. In other words, DDP needs to be aware of DTensor so that we can support 2D state_dict.

This PR adds a compiled DDP + TP test to ensure the new compiled DDP fusion doesn't break TP all_reduce.

**TODOs**

- [x] Implement DDP allreduce fusion algorithm for Inductor post_grad pass.
- [x] Add unit tests to ensure the fusion doesn't DDP + TP.
- [ ] Group different PG and data type of all_reduces.
- [ ] Mixed precision supports and tests
- [ ] Implement the fusions with Inductor IR.
- [ ] Add auto bucketing based on Inductor profiling.

Differential Revision: [D54105050](https://our.internmc.facebook.com/intern/diff/D54105050/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120479
Approved by: https://github.com/wz337
ghstack dependencies: #113209
2024-03-13 21:41:22 +00:00
dd568f4207 [Export, AOTInductor] Populate ShapeEnv's var_to_val during deserialization (#121759)
Summary:
Deserialization didn't populate ShapeEnv's `var_to_val` field properly, and AOTInductor is relying on this field to compile dynamic shape properly.
As a result, when AOTI failed at compiling a deserialized ExportedProgram.

Test Plan: buck2 test  mode/dev-nosan caffe2/test/inductor/fb:test_aot_inductor_pt2_inference

Differential Revision: D54559494

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121759
Approved by: https://github.com/avikchaudhuri
2024-03-13 21:28:25 +00:00
a2a4693c1b Revert "Init CUDA instead of faking memory stats (#121698)"
This reverts commit 2460f0b1c7bb6e088aca1f6e9bb62c834053d71b.

Reverted https://github.com/pytorch/pytorch/pull/121698 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it breaks inductor CPU tests 5b90074540 ([comment](https://github.com/pytorch/pytorch/pull/121698#issuecomment-1995868090))
2024-03-13 21:23:42 +00:00
45a835cef2 Revert "[compiled autograd] free stack objects before calling compiled graph (#121707)"
This reverts commit 5b90074540577267c29f5f784be123ee54f6491d.

Reverted https://github.com/pytorch/pytorch/pull/121707 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it breaks inductor CPU tests 5b90074540 ([comment](https://github.com/pytorch/pytorch/pull/121698#issuecomment-1995868090))
2024-03-13 21:23:42 +00:00
8b1b61bc70 [compiled autograd] support custom ops backed by c++ autograd::Function (#120681)
- Adds support for custom ops backed by c++ custom autograd functions, e.g. fbgemm
- Include files more granularly to avoid namespace pollution and circular imports

limitations:
- requires user to audit their code and opt-in their custom autograd::Function via autograd::Function::is_traceable and maybe additional compiled_args + apply_with_saved implementation. this was the only way I can think of for soundness
- will throw if we can't hash the saved_data i.e. for any non implemented type other than list and dict in at::IValue::hash b0cfa96e82/aten/src/ATen/core/ivalue.cpp (L364)
- can technically silently fail if both the typeid hash and the typeid string name of the custom autograd::Function collide at the same time, and an identical autograd graph containing a different custom autograd::Function, yet that has an identical implementation, is called. this case seems extremely unlikely, and the only alternative to hash collision i can think of is compiling with reflection
- tensors not saved via save_variables are not lifted, and are specialized on TensorImpl*'s hash (treated as a memory address). if needed, we can lift them.

Differential Revision: [D54818488](https://our.internmc.facebook.com/intern/diff/D54818488)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120681
Approved by: https://github.com/jansel
2024-03-13 21:13:21 +00:00
58ff55aac5 Add support for tt.scan to triton kernel mutation analysis (#121828)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121828
Approved by: https://github.com/aakhundov, https://github.com/Skylion007
2024-03-13 20:37:56 +00:00
8e6d572b4e [DDP][PT2D] Allreduce fusion fx pass using concat and all_reduce_coalesced (#113209)
Differential Revision: [D49858057](https://our.internmc.facebook.com/intern/diff/D49858057/)

**TL;DR**
This PR implements 2 different DDP all_reduce fusions in Inductor post_grad fx passes. The two fusions are 1) fusion with concat op and 2) fusion with all_reduce_coalesced. When DDP detects that Python reducer is being used, DDP will automatically turn on the fusion.

This PR does not invent any algorithm and simply reflects the bucket size users set to DDP.

**Implementation Details**
*Fusion with concat op*
The idea of this fusion is to use a concat op to concatenate all the gradients into one tensor and perform one `all_reduce`. After the `wait` op of the `all_reduce`, splitting and reshaping will also be perform to get the individual gradient.

Because DDP needs to perform gradient scaling, the benefit of using this fusion is that we could perform the gradient scaling over the the concatenated buffer.

*Fusion with `all_reduce_coalesced`*
The idea of this fusion is to use `all_reduce_coalesced` op to directly perform the `all_reduce` over multiple buffers. This avoid the copy overhead but may not achieve the best NCCL performance. In addition, because there are multiple buffers, we could not do one simple gradient scaling but have to rely on `foreach_div` to help the gradient scaling.

**Limitations**
Current fusions do not distinguish `all_reduce` generated by different DDP modules. This is okay if all DDP instances use the same PG and data type. The support of multiple DDP instances with different PG and data type will come in the later PRs.

**TODOs**
- [x] Implement DDP allreduce fusion algorithm for Inductor post_grad pass.
- [ ] Add unit tests to ensure the fusion doesn't DDP + TP.
- [ ] Group different PG and data type of `all_reduce`s.
- [ ] Mixed precision supports and tests
- [ ] Implement the fusions with Inductor IR.
- [ ] Add auto bucketing based on Inductor profiling.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113209
Approved by: https://github.com/yf225
2024-03-13 20:37:09 +00:00
0c1ac4484d Support call_method in DDPOptimizer (#121771)
This PR fixes Issue #111279.

While #111279 reported the issue with `MultiheadAttention`, a minimal reproduction would be:
```python
class ToyModel(nn.Module):
    def __init__(self,):
        super().__init__()
        self.linear = nn.Linear(128, 10)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.linear.forward(x) # Error
        # return self.linear(x) # OK
```

Dynamo treats `self.linear(x)` as `call_module` while treating `self.linear.forward(x)` as a [`get_attr` and a `call_method`](https://github.com/pytorch/pytorch/blob/main/torch/_dynamo/variables/nn_module.py#L358-L378). However, existing DDPOptimizer assumes, for a `get_attr` node, `getattr(gm, node.target)` gives a tensor with the `requires_grad` attribute. Existing DDPOptimizer also does not support `call_method` nodes.

This PR adds support for `call_method` and check on `get_attr`. It also checks if a module's parameters have been added to a bucket to support multiple method calls from the same module.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121771
Approved by: https://github.com/yf225
2024-03-13 20:03:15 +00:00
0df39480f6 [dynamo] Convert invalid args into graph breaks (#121784)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121784
Approved by: https://github.com/yanboliang
ghstack dependencies: #121615, #121616
2024-03-13 20:02:33 +00:00
5b90074540 [compiled autograd] free stack objects before calling compiled graph (#121707)
Moved compilation code into _compiled_autograd_impl, frees stack allocated objects e.g. AutogradCompilerCall

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121707
Approved by: https://github.com/jansel
ghstack dependencies: #121698
2024-03-13 19:31:44 +00:00
2460f0b1c7 Init CUDA instead of faking memory stats (#121698)
This is very confusing when checking for memory usage and allocations are only happening using C API. We should change it to a warning/error or just init cuda. Codepaths that run on non-CUDA environments shouldn't call into these functions in the first place

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121698
Approved by: https://github.com/jansel
2024-03-13 19:31:44 +00:00
cd949d133e Support setUpClass & tearDownClass with instantiate_device_type_tests() (#121686)
Summary: instantiate_device_type_tests() creates dynamic test case classes that derive from a "template class". By default, the test harness will call the setUpClass() and tearDownClass() methods defined by the template class (if the template class defines them). We can explicitly create these methods in the dynamic class and arrange to call those methods in both base classes. That allows us to support setUpClass & tearDownClass test classes used with instantiate_device_type_tests().
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121686
Approved by: https://github.com/ezyang, https://github.com/eellison
2024-03-13 18:28:42 +00:00
ffabb25c48 Count the number of entries directly in avg_pool2d lowering (#121429)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121429
Approved by: https://github.com/peterbell10
ghstack dependencies: #116085
2024-03-13 18:19:47 +00:00
a19a05fd1d Add lowering for avg_pool{1, 3}d (#116085)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116085
Approved by: https://github.com/peterbell10
2024-03-13 18:19:47 +00:00
79fac48bb3 Use pytorch bot's labeler (#121762)
Change corresponds to https://github.com/pytorch/test-infra/pull/4995
Testing (very light) in https://github.com/malfet/deleteme/pull/81
Should help with https://github.com/pytorch/test-infra/issues/4950

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121762
Approved by: https://github.com/huydhn
2024-03-13 17:16:49 +00:00
05df03ec1b Allow custom attributes for torch function subclasses (#121693)
Added custom attribute access with test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121693
Approved by: https://github.com/anijain2305
2024-03-13 17:01:57 +00:00
92a2b214f8 Make translation validation more user friendly (#120880)
Two main changes:

- Don't rethrow the exception when we fail in TV, just throw the entire
  thing and trust the user will inspect logs / backtrace to see we
  failed in TV

- Don't add an event to the TV logs until we've confirmed that the event
  actually runs without erroring.  This prevents us from putting events
  that e.g., fail because the guard on data dependent size, and the
  failing in TV.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120880
Approved by: https://github.com/lezcano, https://github.com/ysiraichi
2024-03-13 15:21:59 +00:00
b1d5998956 Upgrade to tlparse 0.3.7 (#121772)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121772
Approved by: https://github.com/Skylion007
2024-03-13 15:21:20 +00:00
5498804ec2 [MPS] Fix naive matmul for BFloat16 (#121731)
Will only work on MacOS14 or newer, so compile the shader with `MTLLanguageVersion_3_1` when appropriate

Fixes https://github.com/pytorch/pytorch/issues/121583
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121731
Approved by: https://github.com/albanD
2024-03-13 14:34:03 +00:00
559ca13b3f [dynamo] Refactor TorchInGraphFunctionVariable for compile time (#121616)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121616
Approved by: https://github.com/oulgen
ghstack dependencies: #121615
2024-03-13 14:21:21 +00:00
51cf57c6c6 Revert "Include torch warn in each error in cudnn/Conv_v8.cpp (#120719)"
This reverts commit 5fd7f5c4e336c2c3041e10529990c620cc8cf9a5.

Reverted https://github.com/pytorch/pytorch/pull/120719 on behalf of https://github.com/janeyx99 due to sorry but am reverting as this prints unwanted warnings even when an exception is not thrown  ([comment](https://github.com/pytorch/pytorch/pull/120719#issuecomment-1994491826))
2024-03-13 14:09:38 +00:00
a157a0d00d [constraints] Fix scalar type for constraint_range to Long (#121752)
Differential Revision: [D54822125](https://our.internmc.facebook.com/intern/diff/D54822125)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121752
Approved by: https://github.com/ezyang
2024-03-13 11:11:09 +00:00
7fe0cc53e9 make _process_dynamic_shapes an implementation detail (#121713)
Summary: `_process_dynamic_shapes` converts new dynamic shapes to old constraints, but in the future may not need to do so. Preparing for that future.

Test Plan: CI

Differential Revision: D54780374

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121713
Approved by: https://github.com/tugsbayasgalan
2024-03-13 08:33:00 +00:00
5088e4956e Add quantized conv transpose2d op (#120151)
Test Plan:
Run vulkan api test:
# buck2 build --target-platforms ovr_config//platform/macos:arm64-fbsource  //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 --show-output"
# buck-out//v2/gen/fbsource/xplat/caffe2/pt_vulkan_api_test_binAppleMac
Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc
[==========] Running 418 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 418 tests from VulkanAPITest
....
[----------] Global test environment tear-down
[==========] 418 tests from 1 test suite ran. (4510 ms total)
[  PASSED  ] 417 tests.
[  SKIPPED ] 1 test, listed below:
[  SKIPPED ] VulkanAPITest.querypool_flushed_shader_log

  YOU HAVE 9 DISABLED TESTS

Run quantized vulkan api test: Note the linear quantized are failing but all the convolution tests still pass. Linear failures are being debugged.
# buck2 build --target-platforms ovr_config//platform/macos:arm64-fbsource  //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 --show-output"
# buck-out//v2/gen/fbsource/xplat/caffe2/pt_vulkan_quantized_api_test_binAppleMac
Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc
[==========] Running 86 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 86 tests from VulkanAPITest
...
[  PASSED  ] 77 tests.
[  FAILED  ] 9 tests, listed below:
[  FAILED  ] VulkanAPITest.linear_2d_flat
[  FAILED  ] VulkanAPITest.linear_2d_small
[  FAILED  ] VulkanAPITest.linear_2d_large
[  FAILED  ] VulkanAPITest.linear_3d_flat
[  FAILED  ] VulkanAPITest.linear_3d_small
[  FAILED  ] VulkanAPITest.linear_3d_large
[  FAILED  ] VulkanAPITest.linear_4d_flat
[  FAILED  ] VulkanAPITest.linear_4d_small
[  FAILED  ] VulkanAPITest.linear_4d_large

 9 FAILED TESTS
  YOU HAVE 8 DISABLED TESTS

Differential Revision: D52344261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120151
Approved by: https://github.com/yipjustin
2024-03-13 08:09:57 +00:00
e99fa0042c Back out "[DeviceMesh] Add support for nD slicing (#119752)" (#121763)
Summary:
Original commit changeset: e52b8809c8d8

Original Phabricator Diff: D54778906

We have to backout this diff.
D54778906 seems to be causing test failures for APF blocking trunk health and hence release. Just starting to look at the issue. T182209248

Test Plan: Sandcastle

Reviewed By: satgera

Differential Revision: D54825114

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121763
Approved by: https://github.com/osalpekar
2024-03-13 07:22:08 +00:00
be33d31ae2 add std::ostream& operator<< for BFloat16 in BFloat16.h (#121302)
This PR Move `operator<<` of `BFloat16` to `BFloat16.h`.

Previously, this function is in `TensorDataContainer.h`. If need `std::cout` a `BFloat16` variable when debugging, `TensorDataContainer.h` have to be included. This is inconvient and counterintuitive.

Other dtypes such as `Half`, define their `operator<<` in headers where they are defined such as `Half.h`. Therefore, I think it makes more sense to move `operator<<` of `BFloat16` to `BFloat16.h`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121302
Approved by: https://github.com/ezyang
2024-03-13 06:47:34 +00:00
5986552ebe [nit][DCP][DSD] Remove variables not being used in test_state_dict.py #121204 (#121773)
Replacing https://github.com/pytorch/pytorch/pull/121204

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121773
Approved by: https://github.com/Skylion007
2024-03-13 06:35:04 +00:00
da2a9a0512 _foreach_copy with different src/dst dtypes (#121717)
Fixes #115171

```
torch.version.git_version = '6bff6372a922fe72be5335c6844c10e2687b967d', torch.cuda.get_device_name() = 'NVIDIA RTX 6000 Ada Generation'
[------------------ foreach copy - self: torch.float32 - shape: (512, 512) -----------------]
                         |  src: torch.float32  |  src: torch.float16  |  src: torch.bfloat16
1 threads: ----------------------------------------------------------------------------------
      num_tensors: 32    |          14.2        |          12.6        |           12.7
      num_tensors: 256   |         688.0        |         510.3        |          514.0
      num_tensors: 1024  |        2768.0        |        2053.3        |         2047.7

Times are in microseconds (us).

[------------------ foreach copy - self: torch.float16 - shape: (512, 512) -----------------]
                         |  src: torch.float32  |  src: torch.float16  |  src: torch.bfloat16
1 threads: ----------------------------------------------------------------------------------
      num_tensors: 32    |          10.0        |           8.9        |            8.8
      num_tensors: 256   |         497.6        |         344.3        |          348.3
      num_tensors: 1024  |        1991.9        |        1392.0        |         1389.0

Times are in microseconds (us).

[----------------- foreach copy - self: torch.bfloat16 - shape: (512, 512) -----------------]
                         |  src: torch.float32  |  src: torch.float16  |  src: torch.bfloat16
1 threads: ----------------------------------------------------------------------------------
      num_tensors: 32    |          10.0        |           8.8        |            8.8
      num_tensors: 256   |         497.5        |         344.5        |          348.0
      num_tensors: 1024  |        1993.2        |        1390.4        |         1387.5

Times are in microseconds (us).

[------------------ foreach copy - self: torch.float32 - shape: (515, 515) -----------------]
                         |  src: torch.float32  |  src: torch.float16  |  src: torch.bfloat16
1 threads: ----------------------------------------------------------------------------------
      num_tensors: 32    |          19.0        |          17.9        |           18.1
      num_tensors: 256   |         707.2        |         540.2        |          543.1
      num_tensors: 1024  |        2900.6        |        2156.6        |         2159.2

Times are in microseconds (us).

[------------------ foreach copy - self: torch.float16 - shape: (515, 515) -----------------]
                         |  src: torch.float32  |  src: torch.float16  |  src: torch.bfloat16
1 threads: ----------------------------------------------------------------------------------
      num_tensors: 32    |          13.8        |          13.7        |           13.1
      num_tensors: 256   |         513.2        |         352.6        |          350.4
      num_tensors: 1024  |        2047.6        |        1404.4        |         1400.4

Times are in microseconds (us).

[----------------- foreach copy - self: torch.bfloat16 - shape: (515, 515) -----------------]
                         |  src: torch.float32  |  src: torch.float16  |  src: torch.bfloat16
1 threads: ----------------------------------------------------------------------------------
      num_tensors: 32    |          13.6        |          12.8        |           14.2
      num_tensors: 256   |         511.9        |         351.8        |          350.6
      num_tensors: 1024  |        2045.4        |        1402.2        |         1401.4

Times are in microseconds (us).

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121717
Approved by: https://github.com/janeyx99
2024-03-13 05:42:28 +00:00
a13dd92d88 [dynamo] Minor compile time optimizations in torch.py (#121615)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121615
Approved by: https://github.com/oulgen
2024-03-13 05:36:22 +00:00
d619be57c0 [executorch hash update] update the pinned executorch hash (#121056)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121056
Approved by: https://github.com/pytorchbot
2024-03-13 04:54:16 +00:00
0c1d59b72f CI: Fix flaky artifact upload step (#121733)
This PR changes the upload artifact step of the wheels and conda build to write
each matrix entry to a different file. This is because updating the same file
from multiple jobs can be flaky as is warned in the docs for upload-artifact

> Warning: Be careful when uploading to the same artifact via multiple jobs as artifacts may become corrupted. When uploading a file with an identical name and path in multiple jobs, uploads may fail with 503 errors due to conflicting uploads happening at the same time. Ensure uploads to identical locations to not interfere with each other.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121733
Approved by: https://github.com/huydhn
ghstack dependencies: #121268
2024-03-13 04:42:52 +00:00
52ed35bb64 [inductor] Update triton pin (#121268)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121268
Approved by: https://github.com/oulgen, https://github.com/malfet
2024-03-13 04:42:52 +00:00
07330ff7b6 [MPS][BE] Define _compute_tolerances (#121754)
Right now logic is mostly duplicated between `test_output_match` and `test_output_gradient_match`
So move tolerance definition logic into a shared `_compute_tolerances` function and
only keep differences (for example, grad checks are completely skipped for `torch.unique`) in the respective test functions.

Also, increase tolerance for `pow` and `__rpow__` only on MacOS-13.3 or older and remove GRAD xfaillist for those

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121754
Approved by: https://github.com/albanD
2024-03-13 04:08:06 +00:00
f83392b677 cublasLt workspace warning info is misleading, the unit of measuremen… (#121073)
cublasLt workspace warning info is misleading, the unit of measurement should be KiB instead of bytes

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121073
Approved by: https://github.com/eqy, https://github.com/ezyang
2024-03-13 03:37:40 +00:00
e755dab0d1 [ROCm] Enable several test_unary_ufuncs UTs on ROCm (#121104)
Enabled:
test_unary_ufuncs::TestUnaryUfuncsCUDA::test_reference_numerics_small_atan_cuda_complex64
test_unary_ufuncs::TestUnaryUfuncsCUDA::test_reference_numerics_small_atan_cuda_complex128
test_unary_ufuncs::TestUnaryUfuncsCUDA::test_reference_numerics_extremal_atan_cuda_complex128
test_unary_ufuncs::TestUnaryUfuncsCUDA::test_reference_numerics_small__refs_atan_cuda_complex64
test_unary_ufuncs::TestUnaryUfuncsCUDA::test_reference_numerics_small__refs_atan_cuda_complex128
test_unary_ufuncs::TestUnaryUfuncsCUDA::test_reference_numerics_extremal__refs_atan_cuda_complex128
test_unary_ufuncs::TestUnaryUfuncsCUDA::test_reference_numerics_extremal_atanh_cuda_complex128
test_unary_ufuncs::TestUnaryUfuncsCUDA::test_reference_numerics_extremal__refs_atanh_cuda_complex128

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121104
Approved by: https://github.com/jeffdaily, https://github.com/ezyang
2024-03-13 03:34:22 +00:00
f24ae66abf [AOTInductor] Skip tests on RoCM for duplicate_constant_folding (#121750)
Summary: Skip AMD tests for duplicated kernels in constant folding

Test Plan: Diff is test

Differential Revision: D54820804

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121750
Approved by: https://github.com/huydhn
2024-03-13 03:21:21 +00:00
9f235971f0 Gate tt.reduce Triton mutation tests on Triton version (#121753)
Summary: The goal is to make the `test_argmax` and `test_reduce_sum` to work both before and after https://github.com/openai/triton/pull/3191 is included into the Triton pin. This is important to make those tests work during the Triton pin update process both in OSS and internally.

Test Plan:

```
$ python test/inductor/test_triton_kernels.py -k test_reduce_sum -k test_argmax
..
----------------------------------------------------------------------
Ran 2 tests in 1.906s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121753
Approved by: https://github.com/Skylion007
2024-03-13 01:43:02 +00:00
7d05c4c093 Remove error anti-pattern when dealing with dynamic shape output (#121681)
There are cases where capture_dynamic_output_shape_ops=True and we will still see DynamicOutputShapeException. For example, when an op doesn't have a meta kernel implemented to return the correct dynamic shape output. If we blindly give users instructions to set capture_dynamic_output_shape_ops to True, users would try it and see no change. As witnessed in this issue:
https://github.com/pytorch/pytorch/issues/121036#issuecomment-1985221435

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121681
Approved by: https://github.com/tugsbayasgalan
2024-03-13 00:45:23 +00:00
9df0dca7f6 Revert "[ Inductor ] Shape padding honors output stride preservation (#120797)"
This reverts commit 57fc35a3af09f7657b2be593a1046f0ac2dd50ab.

Reverted https://github.com/pytorch/pytorch/pull/120797 on behalf of https://github.com/williamwen42 due to perf regression on dashboard ([comment](https://github.com/pytorch/pytorch/pull/120797#issuecomment-1992857428))
2024-03-13 00:43:34 +00:00
02bb2180f4 [torch export] replace traceback.extract_stack with CapturedTraceback.extract (#121449)
Summary:
with a simple bench in TestDeserializer.test_basic function:
```
time_start = time.time()
for i in range(1000):
    self.check_graph(MyModule(), inputs)
warnings.warn(f"time_taken: {time.time() - time_start}")
```
and forcing FakeTensorConfig.debug to True, record_stack_traces to True, logging level to debug, it shows that the the changed code is consistently ard 20 secs faster (~90s vs originally ~110s)

Test Plan:
test passed, see summary

compared debug trace before and after:
- exactly the same for fake tensor and proxy callsite https://www.internalfb.com/intern/diffing/?paste_number=1189883685
- slightly different for the user frame in proxy node https://www.internalfb.com/intern/diffing/?paste_number=1189884347

Differential Revision: D54237017

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121449
Approved by: https://github.com/angelayi
2024-03-13 00:19:05 +00:00
7a53dedb07 CI: Specify libc and libstdcxx versions in conda environments (#121556)
Without this we get mismatches between the GLIBC and GLIBCXX ABI used
by conda packages vs pytorch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121556
Approved by: https://github.com/isuruf, https://github.com/malfet
2024-03-13 00:12:54 +00:00
68be750e17 Cleanup some exception handling in triton mutation tracking (#121739)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121739
Approved by: https://github.com/Skylion007
ghstack dependencies: #121690
2024-03-13 00:02:36 +00:00
a9274c9a2c Fix aoti doc to avoid cannot bind non-const lvalue reference error (#121672)
This PR corrects the example in the AOTInductor example which currently fails with:
```
/home/ubuntu/test/inference.cpp:21:62: error: cannot bind non-const lvalue reference of type ‘std::vector<at::Tensor>&’ to an rvalue of type ‘std::vector<at::Tensor>’
   21 |     std::cout << runner.run({torch::randn({2, 10}, at::kCPU)})[0] << std::endl;
      |
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121672
Approved by: https://github.com/desertfire
2024-03-12 23:43:40 +00:00
79ee6bbde3 Support triton.language.dtype with torch.compile (#121690)
Putting this PR as an RFC since I have resorted to some horrible hacks in order to make this work.
```
(Pdb) p triton.language.float32
triton.language.fp32
(Pdb) p str(triton.language.float32)
'fp32'
(Pdb) p repr(triton.language.float32)
'triton.language.fp32'
```
This means that we need to "rewrite" them for fx graph and inductor execution.

This PR allows Mamba2 to work with `torch.compile`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121690
Approved by: https://github.com/Skylion007
2024-03-12 23:21:46 +00:00
22bb24986d [dynamo][guards] Use lazy variable tracker for func defaults (#121388)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121388
Approved by: https://github.com/jansel
2024-03-12 22:48:48 +00:00
519151a062 [fx] Preserve Fx graph node order in partitioner across runs (#115621)
Fixes #ISSUE_NUMBER
partitioner generates different graph in recompilation on each run
Co-authored-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115621
Approved by: https://github.com/izaitsevfb
2024-03-12 22:18:43 +00:00
a95ceb51a2 Release fix pinning slow-tests.json (#121746)
Apply release changes script adds version to SLOW_TESTS_FILE which should not change

Test:
```
SLOW_VER=test
sed -i -e s#/slow-tests.json#"/slow-tests.json?versionId=${SLOW_VER}"#  tools/stats/import_test_stats.py
```
Output:
```
SLOW_TESTS_FILE = ".pytorch-slow-tests.json"
...
url = "https://ossci-metrics.s3.amazonaws.com/slow-tests.json?versionId=test"
```

related to: https://github.com/pytorch/pytorch/pull/121726
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121746
Approved by: https://github.com/huydhn
2024-03-12 22:04:55 +00:00
a5ec45f2ec [Inductor Cutlass backend] Move tests to separate file (#121489)
Move Cutlass backend related tests to test/inductor/test_cutlass_backend.py - no changes to the tests themselves.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121489
Approved by: https://github.com/jansel
2024-03-12 21:59:48 +00:00
844bfbbd2e feat: Update Dockerfile default versions for Python, OS, and CUDA arch list (#121560)
- Update Dockerfile default versions for Python, OS, and CUDA arch list
	- Python 3.8 is EOL later this year, the `docker.Makefile` has 3.10 as default
	- `docker.Makefile` is using 22.04 so this just aligns that
	- The GPU feature list is quite dated, most of those architectures are long past EOL and we aren't getting the newer cards (A100, H100) into that list until now https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#gpu-feature-list

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121560
Approved by: https://github.com/seemethere, https://github.com/Neilblaze, https://github.com/atalman, https://github.com/malfet
2024-03-12 21:43:26 +00:00
d62bdb087d [Profiler] add missing field device_resource_id (#121480)
Fixes #121479

Co-authored-by: Aaron Shi <enye.shi@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121480
Approved by: https://github.com/aaronenyeshi
2024-03-12 21:42:53 +00:00
5478a4e348 Don't run non-strict for test case that doesn't need non-strict (#121710)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121710
Approved by: https://github.com/BoyuanFeng
ghstack dependencies: #121652, #121678, #121687
2024-03-12 21:32:33 +00:00
5b506c8bce Revert "[dynamo][guards] Use lazy variable tracker for func defaults (#121388)"
This reverts commit 04a5d6e8d3f09ee6741484bcfea022228f747b09.

Reverted https://github.com/pytorch/pytorch/pull/121388 on behalf of https://github.com/osalpekar due to causing executorch model-test failures internally. See [D54707529](https://www.internalfb.com/diff/D54707529) ([comment](https://github.com/pytorch/pytorch/pull/121388#issuecomment-1992619251))
2024-03-12 21:31:18 +00:00
522d972924 [eazy] add more log when accuracy check fail (#121656)
Add these log to debug the regress of accuracy test for dm_nfnet_f0 model for training.

With these extra log when the accuracy check fail, we can verify if it's close to succeed or not. If yes that indicates there is no real issue but just flaky and we probably can tune the tolerance to fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121656
Approved by: https://github.com/jansel, https://github.com/Skylion007
2024-03-12 20:58:20 +00:00
f50c652422 avoid aten dispatch shadowing type with variable (#121659)
Summary:
`DECLARE_DISPATCH` is shadowing the variable data with the data type:
`extern TORCH_API struct name name` -> `extern TORCH_API struct gemm_stub gemm_stub` for instance.
This is probably dangerous behavior to rely on, as the compiler needs to always resolve to type and/or data based on context. Previous macro fails with VS2022.

Test Plan: `buck2 build arvr/mode/win/vs2022/cpp20/opt //xplat/caffe2:aten_pow_ovrsource`

Differential Revision: D54699849

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121659
Approved by: https://github.com/albanD
2024-03-12 20:50:47 +00:00
6d8a7d6e58 [pytorch] optional zero points on dequantize per channel (#121724)
Summary:
X-link: https://github.com/pytorch/executorch/pull/2364

bypass-github-export-checks

Test Plan: sandcastle

Reviewed By: mikekgfb

Differential Revision: D54709217

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121724
Approved by: https://github.com/mikekgfb
2024-03-12 19:54:11 +00:00
a6149eba12 [easy] Refactor MultiOutput. codegen_list_tuple_access to use subclass type checks (#121662)
Summary:
# Why?

Right now I'm running into a case where `itype` is `torch.fx.immutable_collections.immutable_list` which is a subclass of `list`. However, currently we're checking the concrete types (i.e. `list`) and `immutable_list` isn't explictly supported here.

Thus, we use a runtime check that looks at the subclass so we can support subclasses -- such as immutable_list -- as well.

Test Plan: ci

Differential Revision: D54764829

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121662
Approved by: https://github.com/aakhundov
2024-03-12 19:27:56 +00:00
90e886aa6c Sanity check for non-strict (#121687)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121687
Approved by: https://github.com/avikchaudhuri
ghstack dependencies: #121652, #121678
2024-03-12 18:21:32 +00:00
443e241cc5 Don't cache predispatch kernels (#121712)
Summary: Title

Test Plan: CI

Differential Revision: D54791087

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121712
Approved by: https://github.com/ydwu4
2024-03-12 18:05:59 +00:00
a26480a4d1 [dtensor] move early return check into redistribute autograd function (#121653)
This PR fixed the bug of redistribute to move early return check into the
redistribute autograd function, so that even though we redistribute the
same placement, the grad_placements from the `to_local` call might be
different, the redistribute backward still need to happen

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121653
Approved by: https://github.com/awgu
2024-03-12 17:37:30 +00:00
00a53b58dd Refactor release only changes to two step execution (#121728)
Refactor release only changes to two step execution.

1. Step ``tag-docker-images.sh`` . Tags latest docker images for current release. This step takes about 30min to complete. This step may fail due to space issues on the local host or http connection when pulling image. Hence should be rerun if failed.

2. Apply release only changes ``apply-release-changes.sh`` prepares a PR with release only changes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121728
Approved by: https://github.com/jeanschmidt
2024-03-12 17:22:22 +00:00
4e63d9065a [dynamo] Delete record replay tests as they are not maintained (#121705)
Fixes https://github.com/pytorch/pytorch/issues/115518

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121705
Approved by: https://github.com/mlazos
2024-03-12 17:16:34 +00:00
cd1751b14f [dynamo] Measure Dynamo cache latency lookup (#121604)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121604
Approved by: https://github.com/jansel
ghstack dependencies: #121614, #121622
2024-03-12 17:09:11 +00:00
22489bfe70 [dynamo][guards-cpp-refactor] Directly call root guard manager in eval_frame (#121622)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121622
Approved by: https://github.com/jansel
ghstack dependencies: #121614
2024-03-12 17:09:11 +00:00
2348e8e4e7 [dynamo][guards-cpp-refactor] Simplify DYNAMIC_INDICES guard (#121614)
Use NO_HASATTR guard for the common part.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121614
Approved by: https://github.com/jansel
2024-03-12 17:08:56 +00:00
0398dc9e8e Revert "[DCP] Makes fsspec public (#121508)"
This reverts commit d482614fec5fb9bccb49bf4ee4ab561e872c0f50.

Reverted https://github.com/pytorch/pytorch/pull/121508 on behalf of https://github.com/osalpekar due to this causes torchrec tests to fail internally with this error: ModuleNotFoundError: No module named 'fsspec'. see [D54779117](https://www.internalfb.com/diff/D54779117) ([comment](https://github.com/pytorch/pytorch/pull/121508#issuecomment-1992137831))
2024-03-12 17:02:43 +00:00
b84f94f6a3 Restore timestamps on C++ logs without glog (#121384)
It looks like it was commented out because the original implementation was not sufficiently portable. I had to do some rewrites to the innards to make it no portable. No Windows nanoseconds support because I'm lazy.

I tested by running `build/bin/TCPStoreTest` and observing the log messages there.  I am actually not sure how to look at the log messages from Python though.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121384
Approved by: https://github.com/Skylion007, https://github.com/malfet
2024-03-12 17:01:32 +00:00
704e15307e [caffe2] replace refernces to np.asscalar (#121332) (#121545)
Summary:

`np.asscalar` was deprecated and removed in a recent Numpy. It used to be implemented the following way, and the recommended alternative is to call `item()` directly:
```python
def asscalar(a):
    return a.item()
```
This fixes all of the references.

Test Plan: visual inspection and automated tests

Differential Revision: D54697760

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121545
Approved by: https://github.com/malfet
2024-03-12 16:58:47 +00:00
d1715c3adb [export] Update error message for set_grad (#121666)
Context: https://fb.workplace.com/groups/222849770514616/posts/381979051268353/?comment_id=383334957799429
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121666
Approved by: https://github.com/ydwu4
2024-03-12 16:41:45 +00:00
3c8c7e2a46 [dynamo] Tweak naming for module hook bw_state (#121609)
Some minor changes not related to the other PRs in the stack

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121609
Approved by: https://github.com/yanboliang
2024-03-12 16:27:56 +00:00
7a68e0a3e8 [DCP][state_dict] Remove the check of FSDP has root (#121544)
Root may not exist due to FSDP lazy initialization.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121544
Approved by: https://github.com/Skylion007
ghstack dependencies: #121273, #121276, #121290
2024-03-12 15:43:19 +00:00
85dc254364 [DTensor] Moved Transformer sharding to staticmethod (#121660)
To support FSDP + TP/SP unit tests, let us factor out the canonical TP/SP sharding of `Transformer` to a staticmethod that can be called by other unit tests.

Test Plan:
```
pytest test/distributed/tensor/parallel/test_tp_examples.py -k test_transformer_training
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121660
Approved by: https://github.com/wanchaol, https://github.com/yifuwang
ghstack dependencies: #121360, #121357
2024-03-12 15:08:57 +00:00
cc51e100f5 [ET-VK] Enable Dynamic shape support via tensor virtual and physical resizing (#121598)
Summary:
## Context

This changeset lays the foundations for supporting dynamic shapes in the ExecuTorch Vulkan delegate via allowing Tensors to be resized in one of two ways:

1. Discarding underlying `vkImage` or `vkBuffer` and reallocating a new `vkImage` or `vkBuffer` with updated sizes. This method is intended to be used when the current `vkImage` or `vkBuffer` is not large enough to contain the new sizes.
2. Update the tensor's size metadata without reallocating any new resources. This allows shaders to interpret the underlying `vkImage` or `vkBuffer` as if it were smaller than it actually is, and allows command buffers to be preserved when sizes are changed.

Test Plan: Check CI. Tests have also been added to `vulkan_compute_api_test` that test the two methods of tensor resizing.

Differential Revision: D54728401

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121598
Approved by: https://github.com/jorgep31415
2024-03-12 14:32:00 +00:00
2a99e6f299 Update error message (#121644)
Summary:
We don't want people to move to NCCL exp without explicit opt in. It seems that sparse allreduce was accidentally called and people were confused whether they should use NCCL exp instead.

Update the error message to explicitly say that sparse_allreduce is not supported.

Test Plan: sandcastle

Differential Revision: D54759307

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121644
Approved by: https://github.com/awgu
2024-03-12 13:04:21 +00:00
edf22f3a48 Modify signature of dequantize ops for decomposed quantized Tensor (#119173) (#121450)
Summary:
X-link: https://github.com/pytorch/executorch/pull/2308

Note: The initial purpose of this PR is to draw suggestion and feedback regarding better alternative, if any.

At present, dequantize op for decomposed quantized Tensor representation e.g. dequantize_per_tensor() assumes the output dtype as torch.float and hence, it does not have the output dtype in its operator argument list. However, this op signature becomes unusable when the assumption breaks. Because, in case the output dtype is different from torch.float, there is no way to specify the same during dequantization.

This change is aimed at generalizing the signature of dequantize op like dequantize_per_tensor() for wider use-cases where the output dtype can be different from torch.float and needs to passed during dequantization. The proposal is to use an additional argument named 'output_dtype' to solve the problem. However, we would also like to have suggestion and feedback regarding any better alternative that can be used instead.

cc jerryzh168 jianyuh raghuramank100 jamesr66a vkuzo jgong5 Xia-Weiwen leslie-fang-intel

Reviewed By: digantdesai

Differential Revision: D53590486

Pulled By: manuelcandales

Co-authored-by: kausik <kmaiti@habana.ai>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121450
Approved by: https://github.com/jerryzh168
2024-03-12 12:36:31 +00:00
06d2392003 Support tt.reduce in Triton kernel analysis pass (#121706)
Summary: Previously, we bailed out of the Triton kernel analysis pass when seeing a `tt.reduce` op. In this PR, we support the op and don't bail out anymore.

Test Plan: This is a bit tricky, as the extension is added to the MLIR walk-based analysis code path which is active only on when the MLIR bindings added in https://github.com/openai/triton/pull/3191 are available. So for now I've run the `test_argmax` and `test_reduce_sum` manually with a newer Triton version than the current pin. When pin updates, we'll make those tests official (left a TODO comment).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121706
Approved by: https://github.com/jansel
2024-03-12 11:38:28 +00:00
78b4793c96 [dynamo][compile-time] Caching VTs to reduce compile-time (#121031)
Reduces the `torch.compile(backend="eager")` for this code

~~~
def fn(x):
    for _ in range(10000):
        # x = torch.sin(x)
        x = torch.ops.aten.sin(x)
        # x = sin(x)

    return x
~~~

From 18 seconds to 12 seconds.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121031
Approved by: https://github.com/jansel
2024-03-12 09:19:50 +00:00
52ad2b682c Generate predispatch tests (#121678)
In this PR, we create another dynamic test class for TestExport tests that basically serializes/deserializas pre-dispatch IR. I encountered 4 additional failures. But 3 of them are due to different operator showing up in the graph and only one legit failure which is tracked by another task internally.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121678
Approved by: https://github.com/angelayi
ghstack dependencies: #121652
2024-03-12 08:34:50 +00:00
656134c38f [ROCm] enable complex128 in test_addmm_sizes_all_sparse_csr for rocm for trivial (k,n,m) cases (#120504)
This PR enables `test_addmm_sizes_all_sparse_csr_k_*_n_*_m_*_cuda_complex128` for ROCm for trivial cases  (m or n or k = 0)

CUSPARSE_SPMM_COMPLEX128_SUPPORTED also used for `test_addmm_all_sparse_csr` and ` test_sparse_matmul` and both of them are skipped for ROCm by `@skipIfRocm` or `@skipCUDAIf(not _check_cusparse_spgemm_available())`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120504
Approved by: https://github.com/jithunnair-amd, https://github.com/ezyang
2024-03-12 07:29:57 +00:00
86a2d67bb9 Simplify guards using info from previous guards (#121463)
Let me see what CI thinks about this one. Will add tests tomorrow.

Fixes https://github.com/pytorch/pytorch/issues/119917
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121463
Approved by: https://github.com/ezyang
2024-03-12 04:22:20 +00:00
703e83e336 Fix AARCH64 builds (#121700)
After https://github.com/pytorch/pytorch/pull/119992 was landed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121700
Approved by: https://github.com/janeyx99, https://github.com/huydhn
2024-03-12 04:17:47 +00:00
159f30331f [quant][pt2e] Call sub-quantizers' transform_for_annotation in ComposableQuantizer (#121548)
Test Plan:
```
buck run caffe2/test:quantization_pt2e
```

Differential Revision: D54454707

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121548
Approved by: https://github.com/jerryzh168
2024-03-12 02:59:12 +00:00
7fc497711d Also test predispatch serialization (#121652)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121652
Approved by: https://github.com/zhxchen17, https://github.com/angelayi
2024-03-12 02:37:59 +00:00
6ca9ae4f86 Express y grid > 2^16 in terms of z grid (#121554)
CUDA has a max y_grid of 65535. If we're computing larger than that we can compose it in terms of z grid, which is currently unused in inductor codegen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121554
Approved by: https://github.com/aakhundov
2024-03-12 02:36:19 +00:00
fb1d7935bb [optim][BE] move complex_2d (last of complex tests) to OptimInfo (#120618)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120618
Approved by: https://github.com/albanD
2024-03-12 02:33:21 +00:00
a37e22de70 Add Flash Attention support on ROCM (#121561)
This patch addresses the major limitations in our previous [PR #115981](https://github.com/pytorch/pytorch/pull/115981) through the new dedicated repository [AOTriton](https://github.com/ROCm/aotriton)

- [x] Only supports MI200 series GPU (i.e., `gcnArchName == gfx90a:sramecc+:xnack-`).
    * MI300X is supported. More architectures will be added once Triton support them.
- [x] Only supports power of two sequence lengths.
    * Now it support arbitrary sequence length
- [ ] No support for varlen APIs.
    * varlen API will be supported in the next release of AOTriton
- [x] Only support head dimension 16,32,64,128.
    * Now it support arbitrary head dimension <= 256
- [x] Performance is still being optimized.
    * Kernel is selected according to autotune information from Triton.

Other improvements from AOTriton include
* Allow more flexible Tensor storage layout
* More flexible API

This is a more extensive fix to #112997

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121561
Approved by: https://github.com/malfet, https://github.com/atalman
2024-03-12 01:16:53 +00:00
3a5f48d55f Port remove_split_ops to PT2 pre-grad passes (#121674)
Summary: For OEMAE, this contributes 14% of the total DPER pass perf gain.

Test Plan:
Run test cases

Run oemae lower benchmark with and with this fix. FLOP/s 29 -> 34.

Reviewed By: frank-wei

Differential Revision: D54711064

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121674
Approved by: https://github.com/frank-wei
2024-03-12 01:15:19 +00:00
5b5d423c2e Benchmark templates (#118880)
Adding support for benchmarking templates in `benchmark_fusion`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118880
Approved by: https://github.com/shunting314
2024-03-11 23:55:13 +00:00
7676433012 [AOTInductor] Reuse generated kernels between constant graph and main graph (#121564)
Summary: We copy the src_to_kernel from constant graph to main graph so that we could avoid generating duplicating kernels. And pass throught the name counter such that no duplicated names will be generated.

Test Plan: Included in commit

Differential Revision: D54706767

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121564
Approved by: https://github.com/desertfire, https://github.com/chenyang78
2024-03-11 22:44:38 +00:00
272cf29e4d [FSDP2][BE] Refactored check_1d_sharded_parity to use mesh (#121357)
Eventually, we should just have one unified way to check for parity between a `DTensor`-sharded model and a replicated model. This PR is a small refactor to work toward that. One current gap to use this `check_sharded_parity` function for 2D is that FSDP's `(Shard(0), Shard(0))` layout differs from that of the `DTensor` APIs since FSDP shards on dim-0 after TP shards on dim-0.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121357
Approved by: https://github.com/weifengpy
ghstack dependencies: #121360
2024-03-11 22:34:42 +00:00
cd1dc5e484 Delete requirements-flake8.txt (#121657)
The file seems to be unused and also has different flake8 version compared to .lintrunner.toml, creating confusion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121657
Approved by: https://github.com/huydhn, https://github.com/seemethere, https://github.com/malfet
2024-03-11 22:29:25 +00:00
fd0dbcd891 Revert "Batch Norm Consolidation (#116092)"
This reverts commit 7b4f70eda519ccd7f28de17689edd43c52743bc9.

Reverted https://github.com/pytorch/pytorch/pull/116092 on behalf of https://github.com/osalpekar due to Causes build failure in //caffe2:aten-hip (AMD build) target. See [D54707318](https://www.internalfb.com/diff/D54707318) for more details, may require internal build system changes to resolve. ([comment](https://github.com/pytorch/pytorch/pull/116092#issuecomment-1989542965))
2024-03-11 22:22:41 +00:00
498a94a7f5 Don't install torchfix for python<3.9 (#121655)
Fixes https://github.com/pytorch/pytorch/issues/121591

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121655
Approved by: https://github.com/huydhn, https://github.com/malfet
2024-03-11 22:18:42 +00:00
b2f09c1859 Revert "[compiled autograd] support custom ops backed by c++ autograd::Function (#120681)"
This reverts commit d27509c384c9847cd2ac1f5d63ec143704b50591.

Reverted https://github.com/pytorch/pytorch/pull/120681 on behalf of https://github.com/xmfan due to breaking internal builds, see D54707287 ([comment](https://github.com/pytorch/pytorch/pull/120681#issuecomment-1989542344))
2024-03-11 22:18:36 +00:00
d1f45a93af Check for releasing GIL at compiletime (#116695)
Introduce `conditional_gil_scoped_release` and use it in `wrap_pybind_function*` to avoid a runtime branch making the code cleaner and faster.

@albanD This is the GIL change extracted from #112607 as discussed.

Also fixes a potential use of a moved-from object introduced in #116560:
- `f` is captured by value in a lambda that may be used called times
- After `std::move(f)` the lambda is not safe to call anymore

CC @cyyever for that change
Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116695
Approved by: https://github.com/albanD, https://github.com/Skylion007
2024-03-11 22:04:56 +00:00
fd13a56f61 Refactor some testing helpers for FX graph cache testing (#121520)
Summary: I plan to enable the FX graph cache for more inductor unit tests. This PR does some refactoring to prepare by moving the `TestCase` base class to `torch._inductor.test_case` (which mirrors the existing `torch._dynamo.test_case`). In a subsequent diff, I'll modify tests importing `torch._dynamo.test_case.TestCase` to use `torch._inductor.test_case.TestCase` instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121520
Approved by: https://github.com/eellison
2024-03-11 21:46:27 +00:00
e01b07e1e8 [ROCm] Autocast RNN Support (#121539)
Fixes #116361

Implements Autocast wrapper for miopen rnn's

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121539
Approved by: https://github.com/albanD, https://github.com/jeffdaily
2024-03-11 21:14:43 +00:00
fc712311ce port fuse_parallel_linear (without changing weights) to PT2 pre-grad (#121617)
Summary: Does not change weights structure so compatible with const folding and realtime weights update

Test Plan: run added test cases

Reviewed By: frank-wei

Differential Revision: D53843428

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121617
Approved by: https://github.com/frank-wei
2024-03-11 20:51:11 +00:00
3461404869 [pt2 export]fix name collision on constant name (#121145)
Summary: Taking the right most part of the fqn will cause name conflict when having multiple instances of the same class. Changed to replace "." in fqn by "_" to avoid invalid syntax in input args.

Test Plan: added test case

Differential Revision: D54435230

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121145
Approved by: https://github.com/zhxchen17
2024-03-11 20:40:59 +00:00
b091a32909 Add a section on release wiki about pytorchbot cherry-pick command (#121648)
I add a section about the new `pytorchbot cherry-pick` command in the release wiki so that more people know about it
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121648
Approved by: https://github.com/atalman, https://github.com/seemethere
2024-03-11 20:09:58 +00:00
dd2062c737 fix CMake FindCUDA module for cross-compiling (#121590)
Fix two cross-compiling issues in `FindCUDA.cmake` (xref: https://github.com/conda-forge/pytorch-cpu-feedstock/pull/224).

1. `setup.py` reads the cached `CUDA_TOOLKIT_ROOT_DIR`, so it must be cached.
41286f1505/setup.py (L593)

I also submitted it to the upstream CMake: https://gitlab.kitware.com/cmake/cmake/-/merge_requests/9323.

2. [SBSA toolkit](https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=arm64-sbsa&Compilation=Cross&Distribution=Ubuntu&target_version=20.04&target_type=deb_network_cross) is in `sbsa-linux` directory. See also https://gitlab.kitware.com/cmake/cmake/-/issues/24192

I also submitted it to the upstream CMake: https://gitlab.kitware.com/cmake/cmake/-/merge_requests/9324
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121590
Approved by: https://github.com/malfet
2024-03-11 20:09:52 +00:00
5fd7f5c4e3 Include torch warn in each error in cudnn/Conv_v8.cpp (#120719)
Fixes #120702

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120719
Approved by: https://github.com/eqy, https://github.com/janeyx99
2024-03-11 20:05:42 +00:00
9aa3fedb75 Slightly faster FX graph iterator (#121611)
Before:
```
iterating over 100000000 FX nodes took 5.9s (16830686 nodes/s)
```

After:
```
iterating over 100000000 FX nodes took 5.0s (19937698 nodes/s)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121611
Approved by: https://github.com/oulgen
2024-03-11 20:00:19 +00:00
ae22bdaefe Update torchbench commit pin, add sam_fast benchmark (#121420)
After this, the sam_fast benchmark can now be run in the pytorch repo:
```
SEGMENT_ANYTHING_FAST_USE_FLASH_4=0 benchmarks/dynamo/torchbench.py --inference --amp --performance --backend=inductor --explain --only sam_fast
```

sam_fast is designed for inference only, with cuda and amp on. The code adds these restrictions to the benchmark.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121420
Approved by: https://github.com/oulgen, https://github.com/msaroufim
2024-03-11 19:48:53 +00:00
dccc1ca839 [torch] Use __prepare_scriptable__ for closures (#121553)
Summary:
This fixes a case left incomplete by https://github.com/pytorch/pytorch/pull/106229
The object is using __prepare_scriptable__ correctly inside of torch.jit.script()
but the clousre that is obtained below is using the non-prepared version.
This causes issues when the prepared and non-prepared versions are in different python modules.

Test Plan:
```
buck2 run mode/opt caffe2/test:jit -- -r test_decorator
```

Differential Revision: D54308741

Re-exporting, as #120806 #121307 were not properly merged.

Co-authored-by: Daniel Herrera <dherrera@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121553
Approved by: https://github.com/huydhn, https://github.com/seemethere
2024-03-11 19:14:19 +00:00
b4160fd9c7 Clean up macOS x86 binaries build jobs (#116726)
This will stop building binaries for MacOS x86 on PyTorch including nightly and all future releases.  If we want this for 2.2, this can be cherry-picked there.

* [x] https://github.com/pytorch/pytorch/pull/116725
* [ ] https://github.com/pytorch/pytorch/pull/116726

Fixes https://github.com/pytorch/pytorch/issues/114602

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116726
Approved by: https://github.com/atalman
2024-03-11 19:09:39 +00:00
8d03c59d59 Bring torch_xla pin to the latest torch_xla commit (03/08/2024). (#121529)
Update the torch_xla pin to a more recent one (03/08/2024). We need to make sure the torch_xla pin stays up-to-date so that pytorch can test against a up-to-date version of torch_xla.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121529
Approved by: https://github.com/atalman
2024-03-11 18:25:42 +00:00
39ed038f41 [TEST] Prepare test_cumulative_trapezoid for SciPy 1.12 (#121541)
Follow up on #119326 with addressed comment: https://github.com/pytorch/pytorch/pull/119326#issuecomment-1939428705:
> I'd like to propose a slightly different approach. We could check if scipy is version `1.12.0`. If so, overload `scipy_cumulative_trapezoid` with a function that specifically checks `t.shape[axis] == 0`, and in that case return an array of the same shape as `t`, which is the expected behavior as far as I understand. That way, we're not just skipping the test cases

I would like to add that the version check is not necessary as in any case the outcome is the same.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121541
Approved by: https://github.com/nWEIdia, https://github.com/albanD
2024-03-11 17:48:29 +00:00
6801595349 Fix round robin sharding (#121022)
Fix round robin sharding when there are no test times and sort_by_time=False

Adds more tests to test_test_selections for sort_by_time=False
Adds more checks to test_split_shards_random for serial/parallel ordering + ordering of tests
Refactoring of dup code

Tested locally by running `python test/run_test.py --shard 3 5` with no test times downloaded and checked that it wasn't an empty list.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121022
Approved by: https://github.com/huydhn, https://github.com/osalpekar
2024-03-11 17:30:12 +00:00
e2ac2dc13a Update NCCL submodule to v2.20.5 (#121635)
Updates NCCL submodule to 2.20.5 . Includes a lot of bugfixes for reductions and connections issues. Should also improve performance. We have been running 2.20.5 internally for a few weeks, the binary pip wheels have finally been published so we can update main.

Release notes here: https://docs.nvidia.com/deeplearning/nccl/release-notes/rel_2-20-5.html#rel_2-20-5
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121635
Approved by: https://github.com/malfet
2024-03-11 17:23:59 +00:00
89add71168 fix synchronization behavior for copies with type change (#121341)
Fixes #121320

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121341
Approved by: https://github.com/albanD
2024-03-11 17:09:45 +00:00
03717430cc Fix lower precision check for MKLDNN on Windows (#121618)
Fixes #120788

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121618
Approved by: https://github.com/xuhancn, https://github.com/jgong5, https://github.com/mingfeima, https://github.com/seemethere
2024-03-11 16:09:20 +00:00
e29004615f Add NEON accelerated torch.mv kernel (#119992)
This reduces `torch.mv` time for 256x768 matrix by 256 element vector from 209 usec to 16 usec for nontransposed case and from 104 to 18 usec if transposed

Also, add fp16-accumulation flavor to the same ops (controlled by private `torch._C._set_cpu_allow_fp16_reduced_precision_reduction` which yields a slightly better numbers), summarized in the following table

| op | original | F32+NEON | F16+NEON|
| ---| -------- | ---------- | ----- |
| torch.mv(m, v) | 209.53 usec | 16.25 usec | 14.68 usec |
| torch.mv(m.t(), v) |  104.80 usec | 28.68 usec | 24.82 usec |

Test plan: CI on MacOS for both CPU and MPS test fp32<->fp16 matmul consistency ( For example "test_output_grad_match_nn_functional_linear_cpu_float16" passes if fp32-reductions are performed, but fails if fp16 accumulation is used)

To investigate:
 - why replacing `sum0Vec = vaddq_f32(sum0Vec, vmulq_f32(a0Vec, xVec));` with `sum0Vec = vfmaq_f32(sum0Vec, a0Vec, xVec);` slows down gemv from 16.2 to 18.2 usec

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119992
Approved by: https://github.com/mikekgfb
2024-03-11 16:00:01 +00:00
fac06a12c8 CI sanity check test for env vars (#120519)
Make a test that fails on purpose to trigger retries.  Check the opposite of success (that env vars exist)

It's bit hacky because I want it to fail on the normal flow in order to trigger reruns but I don't want to expose the failures to users since it's confusing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120519
Approved by: https://github.com/huydhn
2024-03-11 15:35:45 +00:00
6c11d3ce0c Add support to save safetensors checkpoint directly into onnx (#121001)
Currently, when `torch.onnx.dynamo_export` is called within `torch.onnx.enable_fake_mode`, all the external pytorch checkpoint files used to initialize the model are automatically and used by `torch.onnx.ONNXProgram.save` to recreate the initializers for
the newly exported ONNX model.

This API extends the mechanism for HuggingFace models that use safetensors weights. This PR detects safetensors state files and converts them to PyTorch format using mmap on a temporary file, which is deleted after conversion is finished.

Without this PR, the user would have to convert the safetensors files to pytorch format manually and feed it to `torch.onnx.ONNXProgram.save` manually.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121001
Approved by: https://github.com/BowenBao, https://github.com/malfet
2024-03-11 15:21:59 +00:00
485f8ebc07 add __repr__ function to FunctionSchema for Python (#121484)
Fixes #118566

Unlike **OpOverload** or **OpOverloadPacket**, there is a lot of complex information in the schema, so for me keeping it as is is probably a good choice, but in theory the **\_\_repr__** function should show the class name as well as some other key information.

If you have any choices, please show me, thank you.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121484
Approved by: https://github.com/Skylion007
2024-03-11 15:16:50 +00:00
d1510e01fa Upgrade submodule onednn to v3.3.5 (#120767)
This upgrade contains the fixes to the known issues brought by oneDNN v3.3.2, including issues https://github.com/pytorch/pytorch/issues/115346, https://github.com/pytorch/pytorch/issues/120211 and https://github.com/pytorch/pytorch/issues/120406 and those listed in PR #112700.

Issue https://github.com/pytorch/pytorch/issues/115346 (perf regression) was fixed by oneDNN v3.3.4. No new regression was found with v3.3.5. The detailed results of v3.3.4 are given below and compared with v3.1.1 (the oneDNN version in PyTorch before it was updated to v3.3.2).
1. A performance regression with 5.8% perf drop from `pytorch_stargan-train` (see https://github.com/pytorch/benchmark/issues/2076#issuecomment-1847545843)
Validation results with this patch: Latency increased by 0.60%
```
Tested on an Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz instance (IceLake)
oneDNN v3.1.1
metrics-1484287.json
{
    "name": "cpu",
    "environ": {
        "pytorch_git_version": "6c8c5ad5eaf47a62fafbb4a2747198cbffbf1ff0"
    },
    "metrics": {
        "latency": 418.851717
    }
}
oneDNN v3.3.4
{
    "name": "cpu",
    "environ": {
        "pytorch_git_version": "6c8c5ad5eaf47a62fafbb4a2747198cbffbf1ff0"
    },
    "metrics": {
        "latency": 421.381313
    }
}
```

2. Performance regression of FP32 rexnet_100 with Inductor, dynamic shape, multi-threads (see https://github.com/pytorch/pytorch/issues/115346#issue-2030859592)
Validation results with this patch: Latency reduced by 3.23%
```
Tested on an Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz instance (IceLake)
oneDNN v3.1.1
(inductor speedup over eager mode) 2.876x
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks
cpu,rexnet_100,128,2.875904,113.314765,18.455283,0.990437,1302.636134,1315.212902,351,1,0,0

oneDNN v3.3.4
(inductor speedup over eager mode) 3.003x
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks
cpu,rexnet_100,128,3.003012,109.653012,91.547260,0.990048,1302.532506,1315.625370,351,1,0,0
```

3. Performance regression of AMP hf_T5_generate and tinynet_a with Inductor, static shape, multi-threads (see https://github.com/pytorch/pytorch/issues/115346#issuecomment-1856029962)
Validation results with this patch: Latency reduced by 0.85%
```
Tested on an AWS spr metal instance
oneDNN v3.1.1
(inductor speedup over eager mode) 1.120x
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks
cpu,hf_T5_generate,1,1.120018,1197.807729,205.905466,0.442803,125.179904,282.698957,10550,48,8,4

oneDNN v3.3.4
(inductor speedup over eager mode) 1.134x
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks
cpu,hf_T5_generate,1,1.133594,1187.701514,205.855527,0.422012,128.405094,304.268493,10550,48,8,4
```

The following issues about functionality are fixed by this upgrade. Test cases are also added for these issues.
- https://github.com/pytorch/pytorch/issues/120211
- https://github.com/pytorch/pytorch/issues/120406
- https://github.com/pytorch/pytorch/issues/120547

-----

Below are detailed data of torchbench CPU userbenchmark test and Inductor FP32/AMP inference tests. No regression of perf or functionality was found.
I.  *torchbench CPU userbenchmark test*
Suite | Speedup
-- | --
eager_throughtput_bf16_infer | 1.001848
eager_throughtput_fp32_infer | 1.000257
eager_throughtput_fx_int8 | 1.003069
jit_llga_throughtput_amp_bf16 | 1.000682
jit_llga_throughtput_fp32 | 1.000313
eager_throughtput_bf16_train | 0.998222
eager_throughtput_fp32_train | 1.003384

II. *Inductor FP32/AMP inference tests*
i.  FP32 static default
suite | name | thread | batch size | Ratio Speedup(New/old)
-- | -- | -- | -- | --
torchbench | timm_efficientnet | multiple | 64 | 1.09
timm_models | tinynet_a | multiple | 128 | 1.14

ii.  FP32 dynamic default

suite | name | thread | batch size | Ratio Speedup(New/old)
-- | -- | -- | -- | --
torchbench | alexnet | multiple | 128 | 1.08
torchbench | basic_gnn_edgecnn | multiple | 1 | 0.98
torchbench | timm_efficientnet | multiple | 64 | 1.08

iii. AMP static default

suite | name | thread | batch size | Ratio Speedup(New/old)
-- | -- | -- | -- | --
torchbench | hf_distil_whisper | multiple | 1 | 1.18
torchbench | timm_efficientnet | multiple | 64 | 1.32
huggingface | BartForConditionalGeneration | multiple | 2 | 1.19
timm_models | eca_halonext26ts | multiple | 128 | 1.13
timm_models | nfnet_l0 | multiple | 128 | 1.13
timm_models | rexnet_100 | multiple | 128 | 1.45
timm_models | spnasnet_100 | multiple | 128 | 1.15
timm_models | tf_efficientnet_b0 | multiple | 128 | 1.22
timm_models | tinynet_a | multiple | 128 | 1.49
torchbench | hf_Bert_large | single | 1 | 1.16
huggingface | XLNetLMHeadModel | single | 1 | 1.07

iv.  AMP dynamic default

suite | name | thread | batch size | Ratio Speedup(New/old)
-- | -- | -- | -- | --
torchbench | timm_efficientnet | multiple | 64 | 1.32
huggingface | PLBartForConditionalGeneration | multiple | 4 | 1.14
timm_models | nfnet_l0 | multiple | 128 | 1.15
timm_models | rexnet_100 | multiple | 128 | 1.45
timm_models | tinynet_a | multiple | 128 | 1.34
huggingface | XLNetLMHeadModel | single | 1 | 1.09

-----

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120767
Approved by: https://github.com/chuanqi129, https://github.com/jgong5, https://github.com/atalman
2024-03-11 12:56:59 +00:00
605c0a28aa [dtensor][debug] force visualize_sharding not to print for empty tensors (#121217)
**Summary**
Current `visualize_sharding` code cannot print for empty DTensor objects which leads to an exception. This PR skips the print logic if the DTensor passed in has 0 element.
<img width="2165" alt="Pasted Graphic" src="https://github.com/pytorch/pytorch/assets/12968408/fa40b5e7-dad7-4d3a-a485-6a18067320ff">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121217
Approved by: https://github.com/wanchaol
ghstack dependencies: #121385, #121382
2024-03-11 09:22:49 +00:00
3a5ab17bdc [dtensor][debug] visualize_sharding skip if the current rank is not in mesh (#121382)
**Summary**
We should skip the `visualize_sharding()` function on those ranks that are not a part of the DTensor's mesh. If not, exception will be thrown in current visualize logic.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121382
Approved by: https://github.com/wanchaol
ghstack dependencies: #121385
2024-03-11 09:22:49 +00:00
b383123e37 [dtensor][debug] visualize_sharding only compute offset on the first rank in mesh (#121385)
**Summary**
avoid computing on ranks where we do not plan to visualize the DTensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121385
Approved by: https://github.com/wanchaol
2024-03-11 09:22:31 +00:00
9c50ecc84b Fix get_rank under a non-default group. (#120481)
Fixes #120213

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120481
Approved by: https://github.com/yifuwang
2024-03-11 05:40:54 +00:00
7cc476ea16 [dynamo] Fix support for nn.Parameter constructor (part 1) (#120163)
This captures calls to `torch.nn.Parameter` by lifting them to graph inputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120163
Approved by: https://github.com/albanD, https://github.com/yanboliang
ghstack dependencies: #121086
2024-03-11 05:14:42 +00:00
32488b0664 [dynamo] Support _unsafe_set_version_counter (#121086)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121086
Approved by: https://github.com/yanboliang
2024-03-11 05:14:42 +00:00
7a4e451184 [Dynamo] Fix function overrides (#120885)
To check existence of `__torch_function__`, the code intended to iterate each element but got `TupleVariable` when the ordinary `has_torch_function()` was being used. Needs further unpack in this case

Fixes #120653

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120885
Approved by: https://github.com/yanboliang
2024-03-11 02:18:43 +00:00
f11f2b0d55 split predispatch pass into multiple passes (#121592)
Summary:
It's very difficult to debug the passes ineffectiveness, with them mingled in one single pass container. Here we extract them into seperate passes with diagnostics info.

This is also required for a later change, where we must run shape prop on each of these passes, in order for the subsequent passes to have the correct shape information.

Reviewed By: frank-wei

Differential Revision: D53579545

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121592
Approved by: https://github.com/frank-wei
2024-03-11 00:30:55 +00:00
13e8181b7b relax assertion on fake shape (#121599)
Summary: Seems like if you use `capture_pre_autograd_graph` fake tensor shapes can be ints instead of symints.

Test Plan: fixes the AssertionError in N5057219

Differential Revision: D54729142

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121599
Approved by: https://github.com/angelayi, https://github.com/BoyuanFeng
2024-03-10 22:51:10 +00:00
660ec3d38d [Export] Fix bug removing node from wrong graph (#121574)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121574
Approved by: https://github.com/ydwu4
2024-03-10 04:46:11 +00:00
41286f1505 [IntraNodeComm] fix a hybridCubeMeshAllReduceKernel breakage caused by a recent refactor (#121575)
`hybridCubeMeshAllReduceKernel` uses the latter half of p2p buffers as relay buffers. The relay buffer address is calculated using a bf16 base pointer and the buffer size in byte. The breakage was caused by not taking element size into account.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121575
Approved by: https://github.com/Chillee
2024-03-10 00:55:25 +00:00
60cd2a43ca [DeviceMesh] Add support for nD slicing (#119752)
Fixes one of the issue mentioned in #118639
@mvpatel2000

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119752
Approved by: https://github.com/wanchaol
2024-03-10 00:16:37 +00:00
e90cddb0d3 [inductor] Log triton kernel source and metadata on failure (#120494)
If Triton compilation fails it's much easier to debug when given the
kernel source directly, versus a PyTorch repro.

This would have helped root cause
https://github.com/pytorch/pytorch/issues/118589 almost immediately

Differential Revision: [D54119568](https://our.internmc.facebook.com/intern/diff/D54119568/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120494
Approved by: https://github.com/peterbell10, https://github.com/eellison, https://github.com/jansel
2024-03-09 20:12:27 +00:00
168a04e752 [inductor] Changes to support newer triton pin (#121267)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121267
Approved by: https://github.com/lezcano
ghstack dependencies: #121438
2024-03-09 18:17:36 +00:00
459c5bca58 [inductor] Refactor common triton imports into one function (#121438)
This means when codegen depends on a particular import we only need to
add it in one place and it's applied to all triton kernels.

This also changes codegen slightly so instead of generating
`@pointwise` we now generate `@triton_heuristics.pointwise` just so
the imports are the same for all kernel types.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121438
Approved by: https://github.com/lezcano
2024-03-09 18:17:36 +00:00
8c96b4367a Remove opmath cast for im2col decomp (#121363)
It is unclear why opmath cast is needed for im2col decomp, given that the decomposition is mainly performing padding, slicing, indexing and shape manipulation. There is no need for performing these operations in a higher precision, and in doing so it requires more memory and yields less performance.

Sample script to demonstrate inserted cast before this change

```python
import torch
from torch._decomp.decompositions import im2col

def func(x):
    return torch.nn.functional.unfold(
        x, kernel_size=[3, 1], padding=[2, 0], dilation=1, stride=1
    )

x = torch.rand(1, 1, 5, 5, dtype=torch.float16)

eo = torch._dynamo.export(
    func, aten_graph=True, decomposition_table={torch.ops.aten.im2col.default: im2col}
)(x)
eo.graph_module.print_readable()
```

```
class GraphModule(torch.nn.Module):
    def forward(self, x):
        arg0: "f16[1, 1, s0, s0]";

        arg0, = fx_pytree.tree_flatten_spec(([x], {}), self._in_spec)
        arg0_1 = arg0

        _to_copy: "f32[1, 1, s0, s0]" = torch.ops.aten._to_copy.default(arg0_1, dtype = torch.float32)
        ...
        constant_pad_nd: "f32[1, 1, s0 + 4, s0]" = torch.ops.aten.constant_pad_nd.default(_to_copy, [0, 0, 2, 2], 0.0);  _to_copy = None
        ...
        slice_1: "f32[1, 1, s0 + 4, s0]" = torch.ops.aten.slice.Tensor(constant_pad_nd, 0, 0, 9223372036854775807);  constant_pad_nd = None
        slice_2: "f32[1, 1, s0 + 4, s0]" = torch.ops.aten.slice.Tensor(slice_1, 1, 0, 9223372036854775807);  slice_1 = None
        index: "f32[1, 1, 3, s0 + 2, 1, s0]" = torch.ops.aten.index.Tensor(slice_2, [None, None, unsqueeze_5, add_3]);  slice_2 = unsqueeze_5 = add_3 = None
        permute: "f32[1, 1, 3, 1, s0 + 2, s0]" = torch.ops.aten.permute.default(index, [0, 1, 2, 4, 3, 5]);  index = None
        ...
        view: "f32[1, 3, s0**2 + 2*s0]" = torch.ops.aten.view.default(permute, [1, 3, mul]);  permute = mul = None
        _to_copy_1: "f16[1, 3, s0**2 + 2*s0]" = torch.ops.aten._to_copy.default(view, dtype = torch.float16);  view = None
        return pytree.tree_unflatten([_to_copy_1], self._out_spec)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121363
Approved by: https://github.com/lezcano
2024-03-09 15:37:27 +00:00
71d0202627 [dynamo] support rewriting dist.all_reduce with explicitly specified reduce op (#120181)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120181
Approved by: https://github.com/wconstab, https://github.com/awgu
2024-03-09 08:28:22 +00:00
cf9742371c Revert "Add CUTLASS kernel as choice for _int_mm() Inductor autotuning (#119685)"
This reverts commit 752d164b2f0d401042de4a75f36f7e84bae91daa.

Reverted https://github.com/pytorch/pytorch/pull/119685 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is crashing on ROCm 752d164b2f ([comment](https://github.com/pytorch/pytorch/pull/119685#issuecomment-1986773384))
2024-03-09 07:20:53 +00:00
761783a4ff [profiler] Fix recorded profiler step number (#121127)
Fixes [121126](https://github.com/pytorch/pytorch/issues/121126)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121127
Approved by: https://github.com/briancoutinho
2024-03-09 06:54:51 +00:00
242e03ba86 [dtensor] add async_op option to redistribute and some refactor (#121477)
async output option was only available in `full_tensor()` call, but I think it's
generally good to make this option available in the `redistribute` call directly
so that user can control it

This PR adds async_op option to redistribute call, to allow user control
whether to perform tensor redistribution asynchronously or not.

By default we set this to False, this is to follow the semantics of the c10d
collectives.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121477
Approved by: https://github.com/wz337
2024-03-09 06:17:23 +00:00
a6a67da333 [quant] Add error check for input_edge annotation (#121536)
Summary:
Raises error when an input edge contains non-Node elements like constant values etc in annotation.

Test Plan:
python test/test_quantization.py -k test_input_edge_sanity_check

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121536
Approved by: https://github.com/andrewor14
2024-03-09 06:13:04 +00:00
e8836759d0 [export] Add effect token to export (#121424)
Following the creation of effect tokens (https://github.com/pytorch/pytorch/pull/120296), we want to now add support for these tokens in export because the calling/returning convention has changed. The inputs are now `(tokens, params, buffers, constants, user_inputs)` and the outputs are `(tokens, buffer_mutations, user_mutations, user_outputs)`. The graph looks something like:
```
graph():
    %arg0_1 : [num_users=1] = placeholder[target=arg0_1]
    %attr : [num_users=2] = placeholder[target=attr]
    %arg1_1 : [num_users=2] = placeholder[target=arg1_1]
    %with_effects : [num_users=2] = call_function[target=torch._higher_order_ops.effects.with_effects](args = (%arg0_1, _TorchScriptTesting.takes_foo.default, %attr, %arg1_1), kwargs = {})
    %getitem : [num_users=1] = call_function[target=operator.getitem](args = (%with_effects, 0), kwargs = {})
    %getitem_1 : [num_users=1] = call_function[target=operator.getitem](args = (%with_effects, 1), kwargs = {})
    %with_effects_1 : [num_users=2] = call_function[target=torch._higher_order_ops.effects.with_effects](args = (%getitem, _TorchScriptTesting.takes_foo.default, %attr, %getitem_1), kwargs = {})
    %getitem_2 : [num_users=1] = call_function[target=operator.getitem](args = (%with_effects_1, 0), kwargs = {})
    %getitem_3 : [num_users=1] = call_function[target=operator.getitem](args = (%with_effects_1, 1), kwargs = {})
    %add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%arg1_1, %getitem_3), kwargs = {})
    return (getitem_2, add)
```

During unlifting, we will first remove the tokens and with_effect calls using the `remove_effect_tokens` pass. (cc @SherlockNoMad on the pass to remove tokens). This is so that this won't change the calling conventions when retracing. The graph after unlifting looks something like:
```
graph():
    %attr_1 : [num_users=2] = get_attr[target=attr]
    %arg1_1 : [num_users=2] = placeholder[target=arg1_1]
    %takes_foo_default_1 : [num_users=1] = call_function[target=torch.ops._TorchScriptTesting.takes_foo.default](args = (%attr_1, %arg1_1), kwargs = {})
    %takes_foo_default : [num_users=1] = call_function[target=torch.ops._TorchScriptTesting.takes_foo.default](args = (%attr_1, %takes_foo_default_1), kwargs = {})
    %add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%arg1_1, %takes_foo_default), kwargs = {})
    return (add,)
```

Serialization support will be added in a followup.
Note: tokens only affect custom ops that take in ScriptObjects, not ScriptObject methods yet.

Differential Revision: [D54639390](https://our.internmc.facebook.com/intern/diff/D54639390)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121424
Approved by: https://github.com/tugsbayasgalan
2024-03-09 02:43:26 +00:00
eb3919944d [C10d][NCCL] Refactor complex all_reduce and broadcast (#121045)
The necessity of this PR lies in the fact that autograd engine + DDP calls `all_reduce` from C++, so the changes must be made in C++.

```
[rank0]: Traceback (most recent call last):
[rank0]:   File "~/complex_ddp.py", line 72, in <module>
[rank0]:     main()
[rank0]:   File "~/complex_ddp.py", line 64, in main
[rank0]:     loss.backward()
[rank0]:   File "/home/usr/pytorch/torch/_tensor.py", line 525, in backward
[rank0]:     torch.autograd.backward(
[rank0]:   File "/home/usr/pytorch/torch/autograd/__init__.py", line 267, in backward
[rank0]:     _engine_run_backward(
[rank0]:   File "/home/usr/pytorch/torch/autograd/graph.py", line 744, in _engine_run_backward
[rank0]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank0]: TypeError: Input tensor data type is not supported for NCCL process group: ComplexFloat
```

I believe, for minimizing the Python overhead, the same could be done for the rest of the ops, what do you think @kwen2501?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121045
Approved by: https://github.com/eqy, https://github.com/kwen2501
2024-03-09 02:00:54 +00:00
752d164b2f Add CUTLASS kernel as choice for _int_mm() Inductor autotuning (#119685)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119685
Approved by: https://github.com/cpuhrsch
2024-03-09 02:00:50 +00:00
13a25c647f [export] improve binary op fast path broadcast check (#121546)
# Context
I believe we have an incorrect guard being created during FakeTensor's binary op fast path.

Consider this case
```
# op.shape: (10, 192); final_shape: (s0, 10, 192)
# Guard Ne(s0, 10) is created when we create SymBool(10 == s0)
if isinstance(op, torch.Tensor) and op.shape == final_shape:
    break
```

As of right now, `op.shape == final_shape` checks whether one of the binary op's operands is the same as the binay op's output shape.
* If one of them is a dynamic shape, then we'll create a guard via`SymBool` creation (i.e. `s0 == 10`).
* If the `SymBool` expr resolves to `false`, then we'll create the guard `Ne(s0, 10)`.

This is a problem when the # of dimensions aren't the same between `op.shape` & `final_shape`. Take the case above for example, `op.shape: (10, 192); final_shape: (s0, 10, 192)`. Although, the shapes aren't the same, it doesn't necessarily mean that `s0 != 10`.

Some thoughts (feel free to ignore). What if the # of dimensions are equal but one of the shapes has symbols. Here's three cases:
  1. `op.shape: (9000, 10, 192); final_shape: (s0, 10, 192)` -- not broadcastable.
  2. `op.shape: (1, 10, 192); final_shape: (s0, 10, 192)` -- 0/1 specialization wins?
  3. `op.shape: (100, 10, 192); final_shape: (s0, 10, 192) where s0 = 100` -- Ask user to mark `s0` as a constant.

# Test
```
$ TORCHDYNAMO_VERBOSE=1 PYTORCH_TEST_WITH_DYNAMO=1 pytest -s test/dynamo/test_dynamic_shapes.py -k test_export_fast_binary_broadcast_check_dynamic_shapes

torch.fx.experimental.symbolic_shapes.ConstraintViolationError: Constraints violated (dim0)! For more information, run with TORCH_LOGS="+dynamic".
  - Not all values of dim0 = L['a'].size()[0] in the specified range 3 <= dim0 <= 1024 satisfy the generated guard Ne(L['a'].size()[0], 3).
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121546
Approved by: https://github.com/aakhundov
2024-03-09 01:49:42 +00:00
d482614fec [DCP] Makes fsspec public (#121508)
Fixes #118033

Also removes `_checkpointer.py` class
original PR's:
- https://github.com/pytorch/pytorch/pull/121330
- https://github.com/pytorch/pytorch/pull/121329

We're also disabling `test_fsdp` since it is failing on random PR's

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121508
Approved by: https://github.com/fegin
2024-03-09 01:14:18 +00:00
6791b0c09e Change default torch_function behavior to be disabled when torch_dispatch is defined (take 2) (#120632)
This does not introduce a new test but is tested by checking that all the classes we already have still behave as before now that they don't explicitly disable torch_function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120632
Approved by: https://github.com/ezyang
2024-03-09 01:08:37 +00:00
ca9678405a [CUDA graphs] Pool argument for make_graphed_callables (#121475)
It is just a nice feature to have for the situations when users want multiple graphs captures and/or graphed callables to share the same memory pool.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121475
Approved by: https://github.com/eellison, https://github.com/eqy
2024-03-09 00:15:38 +00:00
b2f19dd284 [C10d][UCC] Retain CUDA context in progress_loop (#121446)
UCC requires CUDA context be present, while `progress_loop` f61192b014/torch/csrc/distributed/c10d/ProcessGroupUCC.cpp (L333) runs on the side thread and it does not have context present (even though it sets the device).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121446
Approved by: https://github.com/kwen2501
2024-03-09 00:09:47 +00:00
ed8eebd1c2 Changed cublas repdocubility URL (#121534)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121534
Approved by: https://github.com/Skylion007
2024-03-08 23:46:21 +00:00
b0a0850a5c [DCP] Replaced storage() with untyped_storage() (#121538)
Let us try to remove this warning 😄 :
```
[rank0]:/data/users/andgu/pytorch/torch/distributed/checkpoint/filesystem.py:150: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
[rank0]:  if tensor.storage().size() != tensor.numel():
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121538
Approved by: https://github.com/wz337, https://github.com/fegin
2024-03-08 23:46:17 +00:00
8887c95004 [inductor] Skip welford combine on first reduciton loop iteration (#121488)
On the first iteration we short circuit `welford_reduce` since we know
the accumulators are filled with the default values.

This is split out from #120330 to hopefully avoid the meta-internal failure.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121488
Approved by: https://github.com/lezcano
2024-03-08 23:40:48 +00:00
fe78cf040b [profiler] add a function to allow adding preset user-defined metadata to traces (#121487)
Summary:
`add_metadata_json` function in profiler can only work when being called during trace collection. However, sometimes we want to pass in some user-defined metadata and amend to the trace before trace collection starts, e.g. when the profiler is defined.
This PR add a function `preset_metadata_json` for this purpose. The preset metadata will be stored and amended to the trace later.

Differential Revision: D54678790

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121487
Approved by: https://github.com/aaronenyeshi
2024-03-08 23:18:48 +00:00
9eb8fae02d Revert "Fix round robin sharding (#121022)"
This reverts commit effdea5fc62c6bf13cb8035f7bfcc205f05a8b6a.

Reverted https://github.com/pytorch/pytorch/pull/121022 on behalf of https://github.com/clee2000 due to made sharding really uneven ([comment](https://github.com/pytorch/pytorch/pull/121022#issuecomment-1986552662))
2024-03-08 23:16:24 +00:00
bc02fca358 [dtensor] to_local backward grad placement passthrough (#121474)
to_local accepts a `grad_placements` if user choose to pass, previously
we enforce the grad_out to be the "same" placement as the current
DTensor for safety.

But I realized that we DO NOT need to enforce this constraint. Why?
backward placement does not need to be the same as fwd tensor placement, this
is already the case for param vs param.grad (i.e. param can be replicate
and grad can be partial), so we should not restrict this to activation
vs activation grad too

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121474
Approved by: https://github.com/awgu, https://github.com/yoyoyocmu, https://github.com/yifuwang
2024-03-08 23:11:49 +00:00
9373ad0bb8 Switch cudagraph backend to cudagraph trees (#121019)
Switch torch.compile(..., backend="cudagraphs") to use cudagraph trees. Enabled a few test in cudagraph_trees and note that there is another test suite existing for cudagraphs backend: https://github.com/pytorch/pytorch/blob/main/test/dynamo/test_cudagraphs.py.

This is basically the inductor cudagraphs without inductor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121019
Approved by: https://github.com/ezyang, https://github.com/jansel
ghstack dependencies: #121017, #121018
2024-03-08 22:56:26 +00:00
7b3febdca7 Change assertion throw to error message for const_run_impl call. (#121396)
Summary:
Currently we do not have a easy mechanism to distinguish between models created with some specific config.
We use a warning instead of failing directly.

Test Plan: Messaging change only.

Reviewed By: aakhundov

Differential Revision: D54622522

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121396
Approved by: https://github.com/chenyang78
2024-03-08 22:48:43 +00:00
038b2e8780 [c10d] Add complex support for P2P (#121240)
Fixes the following error when `tensor` is a complex tensor:
```
[rank0]:     return pg.send([tensor], dst, tag)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: RuntimeError: Unconvertible NCCL type ComplexFloat
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121240
Approved by: https://github.com/shuqiangzhang
2024-03-08 22:47:49 +00:00
4af0e634bf Add Cudagraphs disable checking (#121018)
Adds the same cudagraphs disable checking from inductor - cudagraph trees to cudagraphs backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121018
Approved by: https://github.com/ezyang
ghstack dependencies: #121017
2024-03-08 22:47:24 +00:00
7d0ad5c6f0 [FSDP2] Zeroed padded tensor in _apply (#121509)
This PR replaces the `Tensor.resize_` with an explicit zero-ing of the padded tensor. Uninitialized padding is not good since it can give false positive NaNs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121509
Approved by: https://github.com/Skylion007, https://github.com/wanchaol
2024-03-08 22:31:19 +00:00
f2d5e96db4 [export] Add docs for 2.3 release (#121466)
- Added docs about non-strict export
- Added example using derived dims
- Added api docs for ep.run_decompositions() (https://github.com/pytorch/pytorch/issues/119480)
- Tried to include/cover everything in https://docs.google.com/document/d/1kZ_BbB3JnoLbUZleDT6635dHs88ZVYId8jT-yTFgf3A/edit
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121466
Approved by: https://github.com/zhxchen17
2024-03-08 22:29:48 +00:00
2c2d6ce515 Revert "CI sanity check test for env vars (#120519)"
This reverts commit f43b9c56c598b3a0f4d8e1d85f1e67b8f273d235.

Reverted https://github.com/pytorch/pytorch/pull/120519 on behalf of https://github.com/clee2000 due to broken on slow d27509c384 https://github.com/pytorch/pytorch/actions/runs/8208843198/job/22453617568 ([comment](https://github.com/pytorch/pytorch/pull/120519#issuecomment-1986480624))
2024-03-08 22:01:35 +00:00
35d3adb4b0 Add ATen Op _chunk_cat and _chunk_cat.out (#121081)
# Motivation

In backward of per-parameter sharding FSDP, each rank performs reduce scatter to sync gradients across ranks. A rank chunks each gradient tensor into `world_size` slices along the 0-th dimension and concatenate all slices along the 1-th dimension. Gradient tensors will be padded before concatenation when tensor.size(0) % world_size != 0.

### Example 1
Consider `world_size=3` and tensors A (2x4), B (3x3), C (1x2):

Input tensors:
```
AAAA   BBB   CC
AAAA   BBB
       BBB
```

Reduce-scatter-copy-in Output:
```
AAAABBBCC
AAAABBB00
0000BBB00
```

### Example 2
Consider `world_size=2` and tensors A (2x4), B (3x3), C(1x2), D(4x2):

Input tensors:
```
AAAA   BBB   CC   DD
AAAA   BBB   00   DD
       BBB        DD
       000        DD
```

Reduce-scatter-copy-in first pad:
```
AAAA   BBB   CC   DD
AAAA   BBB   00   DD
       BBB        DD
       000        DD
```

Then chunk and cat along dim as the output:
```
AAAABBBBBBCCDDDD
AAAABBB00000DDDD
```

The performance of reduce-scatter-copy-in is critical to per-parameter sharding FSDP. However, reduce-scatter-copy-in via composing existing ATen ops involves `cat` and irregular `pad`, leading redundant data copies and unsatisfactory performance.

# PR
We provide aten native support for reduce-scatter-copy-in, namely `_chunk_cat()`:

```
_chunk_cat(Tensor[] tensors, int dim, int num_chunks) -> Tensor
```

This PR includes the registration of `_chunk_cat` and `_chunk_cat.out`, OpInfo tests, and basic implementation composing existing ATen ops.
In the next PR, we will add the CUDA implementation. Comparing with baselines of composing existing ATen ops, `_chunk_cat()` CUDA implementation improves copy bandwidth from 498 GB/s to 966 GB/s on a production benchmark.

## Requirements on input

1. If input tensors have different ndims, dim should be non-negative and be less than the ndims of every input tensors. If all input tensors have the same ndims, we support both negative and non-negative dim.
2. For wrapped_dim, all tensors should have the same size for 0,...,wrapped_dim-1 dimensions. No requirements for (wrapped_dim, ...)-th dimension.
3. Expect positive num_chunks
4. Expect non-empty input tensor list and each input tensor should have at least 1 element

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121081
Approved by: https://github.com/albanD
2024-03-08 21:48:12 +00:00
a656e12bf5 Disable test_torch_name_rule_map_updated in code (#120627)
I am getting tired of this test  ;-;

It gets disabled because it's broken, and then gets fixed, but something breaks it while it was disabled so its still broken and the infra is not handling it well.

Disable issue is https://github.com/pytorch/pytorch/issues/114831
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120627
Approved by: https://github.com/yanboliang
2024-03-08 21:00:51 +00:00
82bb06334d Update python binding for in-place foreach to return List[Tensor] (#121405)
fixes #104817
taking over #118622

```c++
// _foreach_atan_
static PyObject * THPVariable__foreach_atan_(PyObject* self_, PyObject* args, PyObject* kwargs)
{
  HANDLE_TH_ERRORS
  static PythonArgParser parser({
    "_foreach_atan_(TensorList self)",
  }, /*traceable=*/false);

  ParsedArgs<1> parsed_args;
  auto _r = parser.parse(nullptr, args, kwargs, parsed_args);
  if(_r.has_torch_function()) {
    return handle_torch_function(_r, nullptr, args, kwargs, THPVariableFunctionsModule, "torch");
  }
  // aten::_foreach_atan_(Tensor(a!)[] self) -> ()

  // auto dispatch__foreach_atan_ = [](at::TensorList self) -> at::TensorList {
  auto dispatch__foreach_atan_ = [](at::TensorList self) -> void {
    pybind11::gil_scoped_release no_gil;
    at::_foreach_atan_(self);
  };
  dispatch__foreach_atan_(_r.tensorlist(0));
  PyObject* self_tensorlist = _r.args[0];
  Py_INCREF(self_tensorlist);
  return self_tensorlist;
  Py_RETURN_NONE;
  END_HANDLE_TH_ERRORS
}
...
// _foreach_div_
static PyObject * THPVariable__foreach_div_(PyObject* self_, PyObject* args, PyObject* kwargs)
{
  HANDLE_TH_ERRORS
  static PythonArgParser parser({
    "_foreach_div_(TensorList self, ScalarList scalars)",
    "_foreach_div_(TensorList self, Tensor other)",
    "_foreach_div_(TensorList self, TensorList other)",
    "_foreach_div_(TensorList self, Scalar scalar)",
  }, /*traceable=*/false);

  ParsedArgs<2> parsed_args;
  auto _r = parser.parse(nullptr, args, kwargs, parsed_args);
  if(_r.has_torch_function()) {
    return handle_torch_function(_r, nullptr, args, kwargs, THPVariableFunctionsModule, "torch");
  }
  switch (_r.idx) {
    case 0: {
      // aten::_foreach_div_.ScalarList(Tensor(a!)[] self, Scalar[] scalars) -> ()

      // auto dispatch__foreach_div_ = [](at::TensorList self, at::ArrayRef<at::Scalar> scalars) -> at::TensorList {
      auto dispatch__foreach_div_ = [](at::TensorList self, at::ArrayRef<at::Scalar> scalars) -> void {
        pybind11::gil_scoped_release no_gil;
        at::_foreach_div_(self, scalars);
      };
      dispatch__foreach_div_(_r.tensorlist(0), _r.scalarlist(1));
      PyObject* self_tensorlist = _r.args[0];
      Py_INCREF(self_tensorlist);
      return self_tensorlist;
    }
    case 1: {
      // aten::_foreach_div_.Tensor(Tensor(a!)[] self, Tensor other) -> ()

      // auto dispatch__foreach_div_ = [](at::TensorList self, const at::Tensor & other) -> at::TensorList {
      auto dispatch__foreach_div_ = [](at::TensorList self, const at::Tensor & other) -> void {
        pybind11::gil_scoped_release no_gil;
        at::_foreach_div_(self, other);
      };
      dispatch__foreach_div_(_r.tensorlist(0), _r.tensor(1));
      PyObject* self_tensorlist = _r.args[0];
      Py_INCREF(self_tensorlist);
      return self_tensorlist;
    }
    case 2: {
      // aten::_foreach_div_.List(Tensor(a!)[] self, Tensor[] other) -> ()

      // auto dispatch__foreach_div_ = [](at::TensorList self, at::TensorList other) -> at::TensorList {
      auto dispatch__foreach_div_ = [](at::TensorList self, at::TensorList other) -> void {
        pybind11::gil_scoped_release no_gil;
        at::_foreach_div_(self, other);
      };
      dispatch__foreach_div_(_r.tensorlist(0), _r.tensorlist(1));
      PyObject* self_tensorlist = _r.args[0];
      Py_INCREF(self_tensorlist);
      return self_tensorlist;
    }
    case 3: {
      // aten::_foreach_div_.Scalar(Tensor(a!)[] self, Scalar scalar) -> ()

      // auto dispatch__foreach_div_ = [](at::TensorList self, const at::Scalar & scalar) -> at::TensorList {
      auto dispatch__foreach_div_ = [](at::TensorList self, const at::Scalar & scalar) -> void {
        pybind11::gil_scoped_release no_gil;
        at::_foreach_div_(self, scalar);
      };
      dispatch__foreach_div_(_r.tensorlist(0), _r.scalar(1));
      PyObject* self_tensorlist = _r.args[0];
      Py_INCREF(self_tensorlist);
      return self_tensorlist;
    }
  }
  Py_RETURN_NONE;
  END_HANDLE_TH_ERRORS
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121405
Approved by: https://github.com/soulitzer
2024-03-08 21:00:01 +00:00
d27509c384 [compiled autograd] support custom ops backed by c++ autograd::Function (#120681)
- Adds support for custom ops backed by c++ custom autograd functions, e.g. fbgemm
- Include files more granularly to avoid namespace pollution and circular imports

limitations:
- requires user to audit their code and opt-in their custom autograd::Function via autograd::Function::is_traceable and maybe additional compiled_args + apply_with_saved implementation. this was the only way I can think of for soundness
- will throw if we can't hash the saved_data i.e. for any non implemented type other than list and dict in at::IValue::hash b0cfa96e82/aten/src/ATen/core/ivalue.cpp (L364)
- can technically silently fail if both the typeid hash and the typeid string name of the custom autograd::Function collide at the same time, and an identical autograd graph containing a different custom autograd::Function, yet that has an identical implementation, is called. this case seems extremely unlikely, and the only alternative to hash collision i can think of is compiling with reflection
- tensors not saved via save_variables are not lifted, and are specialized on TensorImpl*'s hash (treated as a memory address). if needed, we can lift them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120681
Approved by: https://github.com/jansel
2024-03-08 20:43:29 +00:00
f43b9c56c5 CI sanity check test for env vars (#120519)
Make a test that fails on purpose to trigger retries.  Check the opposite of success (that env vars exist)

It's bit hacky because I want it to fail on the normal flow in order to trigger reruns but I don't want to expose the failures to users since it's confusing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120519
Approved by: https://github.com/huydhn
2024-03-08 20:28:50 +00:00
75bb049d38 Skip AOT Inductor test_cond_* tests on ROCm (#121522)
Summary: The newly added tests in https://github.com/pytorch/pytorch/pull/121120 are failing in the `ciflow/periodic` jobs. Here we skip those on ROCm to avoid the need to disable those tests manually on ROCm.

Test Plan:

```
$ python test/inductor/test_aot_inductor.py -k test_cond_nested
...
----------------------------------------------------------------------
Ran 6 tests in 72.122s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121522
Approved by: https://github.com/huydhn, https://github.com/malfet
ghstack dependencies: #121120
2024-03-08 20:13:55 +00:00
53d5276d69 Improve Dynamo support for torch function and class methods in general (#121365)
I was originally trying to solve https://github.com/pytorch/pytorch/issues/120799 but got sidetracked along the way.
This PR contains a couple fixes. Let me know if you want me to split them up!

- Properly handle invalid user code when "super()" is called from non-method/classmethod. It will now properly raise the same error as CPython
- Fix base VariableTracker `__str__` method shadowing all `__repr__` methods defined in subclasses
- Fix accessing a classmethod on a user object to bind "cls" and not "self"
- Fix custom class handling of super() call to properly handle mixed regular/class/static methods

Locally , test_repros.py -k test_batch_norm_act still fails where the generated graph module is:
```
Call using an FX-traced Module, line 8 of the traced Module's generated forward function:
    x = self.forward(l_x_);  self = l_x_ = None
    x_1 = self.L__self___act(x);  x = None
```
note that "self" is being unset on the first line even though it is used on the second one.
For reference, this is the test c268ce4a6d/test/dynamo/test_repros.py (L1368-L1369)
I cannot figure out where the generated forward function comes from though, any hint would be welcome!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121365
Approved by: https://github.com/jansel
2024-03-08 20:03:49 +00:00
c0996866f4 Revert "Change ATEN generator argument type to const std::optional<Generator>& (#120076)"
This reverts commit 4305c64fea154ee1ab566e19bd7568753fc30916.

Reverted https://github.com/pytorch/pytorch/pull/120076 on behalf of https://github.com/izaitsevfb due to breaking internal builds(take 3) ([comment](https://github.com/pytorch/pytorch/pull/120076#issuecomment-1986338164))
2024-03-08 20:01:03 +00:00
c78f72d7e7 [c10d] Deprecate torch.distributed.pipeline (#121464)
In favor of PiPPy (Pipeline Parallelism for PyTorch) https://github.com/pytorch/PiPPy

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121464
Approved by: https://github.com/wz337, https://github.com/awgu
2024-03-08 19:55:02 +00:00
27a0900946 Revert "[fx] Preserve Fx graph node order in partitioner across runs (#115621)"
This reverts commit 25c74a93cdf67545a4e3e1bedf2dbabbddfc5845.

Reverted https://github.com/pytorch/pytorch/pull/115621 on behalf of https://github.com/izaitsevfb due to depends on #120076, which needs to be reverted ([comment](https://github.com/pytorch/pytorch/pull/115621#issuecomment-1986324796))
2024-03-08 19:50:57 +00:00
937e89f252 cudagraphs backend refactoring (#121017)
This is just some refactoring.. no functional changes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121017
Approved by: https://github.com/ezyang
2024-03-08 19:47:41 +00:00
bc117898f1 Revert "Update XLA pin (#121501)"
This reverts commit 9d83f9dc0e4535f6535389201bc3c4a37f3305e3.

Reverted https://github.com/pytorch/pytorch/pull/121501 on behalf of https://github.com/malfet due to We are trying to revert underlying change first ([comment](https://github.com/pytorch/pytorch/pull/121501#issuecomment-1986289409))
2024-03-08 19:34:44 +00:00
22cd2658b4 Disable GroupRegistry's thread isolation by default (#121457)
Today `GroupRegistry` employs thread isolation by default, i.e. every thread sees its own process group registry. This is intended to work for one-device-per-process (for python use cases) and one-device-per-thread case (for custom native runtimes).

However, there's a problem - there are python use cases that initializes/registers process groups in one thread, and runs collectives in another thread. This use case should be supported. However, since `GroupRegistry` employs thread isolation by default, collectives in different threads can't find the registered process groups.

This PR fixes the issue by:
- Make `GroupRegistry` work in non-thread isolation mode by default. This would match the behavior w/o the native process group registry.
- Introduces `set_thread_isolation_mode` so one-device-per-thread runtimes can enable thread isolation mode explicitly.

Differential Revision: [D54658515](https://our.internmc.facebook.com/intern/diff/D54658515)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121457
Approved by: https://github.com/wanchaol
2024-03-08 19:31:24 +00:00
2c9c57c061 Only profiling when it's enabled. (#121404)
Summary:
The profiling, even when disabled, takes up about 1.5% cpu for a model I'm looking into.

This patch just splits into with/without profile runs.

The potential downside is that now the script can't enable profiling in itself. It doesn't seem to be used anywhere. If that's a crusial usecase, we can do something about it but ideally we wouldn't.

Test Plan:
Link with profiles:
https://fburl.com/scuba/strobelight_services/ihxsl7pj

```
buck2 run fbcode//caffe2/test/cpp/jit:jit
```

Reviewed By: zhxchen17

Differential Revision: D54066589

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121404
Approved by: https://github.com/zhxchen17
2024-03-08 19:23:14 +00:00
df06b94778 Add complex support to parametrizations.spectral_norm (#121452)
Fixes https://github.com/pytorch/pytorch/issues/121091

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121452
Approved by: https://github.com/ezyang, https://github.com/peterbell10
2024-03-08 19:17:20 +00:00
0f3f4f5534 Revert "[nit][DCP][DSD] Remove Unused Variables in test_state_dict.py (#121204)"
This reverts commit 4186c365313e909dfc8574c4469e5015439c2924.

Reverted https://github.com/pytorch/pytorch/pull/121204 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but the failure looks legit ([comment](https://github.com/pytorch/pytorch/pull/121204#issuecomment-1986252526))
2024-03-08 19:08:50 +00:00
d55d803812 Add operator length hint support (#121495)
Seemed like an easy operator to squeeze into Python 2.3 . Added a simple test. Partially addresses #116396

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121495
Approved by: https://github.com/albanD
2024-03-08 19:08:33 +00:00
9b03a06288 [BE] [MPS] Fix out resize logic in torch.where (#121476)
By deleting `where_mps`  and registering MPS dispatch for `where_kernel`.
As result of this change resizing and type-checking logic is shared between MPS, CPU and  CUDA backends.

Add test_case to `TestMPS.test_where` (that should eventually be removed, when `out` OpInfo testing is enabled for MPS
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121476
Approved by: https://github.com/albanD, https://github.com/Skylion007
ghstack dependencies: #121473, #121494
2024-03-08 18:59:37 +00:00
9cc89970a9 [BE] Cleanup where_self_out (#121494)
- Avoid extra assignments by using ternary instead of if-else
- Do not call type-cast unless it is needed (in most cases only one of two arguments will need to be custed)
- Avoid extra assignment for condition_, by calling `cast` under `if` condition
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121494
Approved by: https://github.com/albanD, https://github.com/Skylion007
ghstack dependencies: #121473
2024-03-08 18:59:37 +00:00
1866ee6735 Enable out OpInfo testing for torch.where (#121473)
And fix behavior discrepancy between CPU and CUDA by raising an error when `out.dtype` is unexpected

Fixes https://github.com/pytorch/pytorch/issues/121397
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121473
Approved by: https://github.com/Skylion007, https://github.com/albanD
2024-03-08 18:59:37 +00:00
0dd21c0c34 Update Quantizable LSTM to support QAT (#121448)
Summary: Title.

Test Plan:
* CI
* N3684627

Differential Revision: D54653542

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121448
Approved by: https://github.com/andrewor14
2024-03-08 18:55:50 +00:00
b52e0bf131 Deprecate torch.autograd.function.traceable, is_traceable (#121413)
- There are no usages of this internally.
- There are very few usages of this in OSS (most of these are forks of old
repositories).
- This flag doesn't do anything.

We're deprecating it to prevent confusion. I will delete it immediately
after the branch cut.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121413
Approved by: https://github.com/albanD, https://github.com/soulitzer
2024-03-08 18:41:07 +00:00
08460f4bae [tp] remove deprecated tp_mesh_dim arg (#121432)
This PR removes the deprecated tp_mesh_dim arg to prepare for release.
As we deprecated this arg for a while (by throwing deprecating
messages), we should remove it before the release

#suppress-api-compatibility-check

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121432
Approved by: https://github.com/wz337
ghstack dependencies: #121431
2024-03-08 17:46:44 +00:00
30982ce072 [tp] doc fixes (#121431)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121431
Approved by: https://github.com/wz337
2024-03-08 17:46:44 +00:00
effdea5fc6 Fix round robin sharding (#121022)
Fix round robin sharding when there are no test times and sort_by_time=False

Adds more tests to test_test_selections for sort_by_time=False
Adds more checks to test_split_shards_random for serial/parallel ordering + ordering of tests
Refactoring of dup code

Tested locally by running `python test/run_test.py --shard 3 5` with no test times downloaded and checked that it wasn't an empty list.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121022
Approved by: https://github.com/huydhn, https://github.com/osalpekar
2024-03-08 17:01:34 +00:00
9d83f9dc0e Update XLA pin (#121501)
To 8078b8f38c

Fixes regression caused by https://github.com/pytorch/pytorch/pull/120076

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121501
Approved by: https://github.com/Skylion007, https://github.com/aakhundov, https://github.com/albanD
2024-03-08 16:53:10 +00:00
a2a8c1fda0 [AOTDispatch] Return mutated inputs directly when keeping mutations (#120514)
Fixes #120242

The example from the issue now results in the graph
```python
def forward(self, arg0_1, arg1_1):
    sin = torch.ops.aten.sin.default(arg0_1);  arg0_1 = None
    copy_ = torch.ops.aten.copy_.default(arg1_1, sin);  arg1_1 = sin = None
    return (copy_,)
```

and the corresponding inductor kernel eliminates the intermediate buffer
completely

```python
def call(args):
    arg0_1, arg1_1 = args
    args.clear()
    assert_size_stride(arg0_1, (5, ), (1, ))
    assert_size_stride(arg1_1, (5, ), (1, ))
    with torch.cuda._DeviceGuard(0):
        torch.cuda.set_device(0)
        # Source Nodes: [sin], Original ATen: [aten.sin]
        stream0 = get_raw_stream(0)
        triton_poi_fused_sin_0.run(arg0_1, arg1_1, 5, grid=grid(5), stream=stream0)
        del arg0_1
    return (arg1_1, )
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120514
Approved by: https://github.com/ezyang, https://github.com/oulgen, https://github.com/lezcano
2024-03-08 16:33:26 +00:00
f7ec984b1b [DTensor][XLA] support XLA backend in distirbute_module API (#121355)
Addresses #92909  cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121355
Approved by: https://github.com/wanchaol
2024-03-08 15:47:33 +00:00
7b4f70eda5 Batch Norm Consolidation (#116092)
**Summary:**

This commit simplifies the existing decomposition hierarchy
of batch norm ops by adding a single, backend agnostic op:
`batch_norm_with_update`. The existing hierarchy looks like:

```
aten.batch_norm ->
aten._batch_norm_impl_index ->
[
  aten.native_batch_norm ->
  aten._native_batch_norm_legit (export only) ->
  _batch_norm_legit_cpu/cuda (kernels, export only) ->
  _batch_norm_cpu/cuda (kernels)
] OR
[ aten.cudnn_batch_norm ] OR
[ aten.miopen_batch_norm ]
```

Aside from complexity, an important problem with the
above decomposition hierarchy is cuda numerics in
export flows. We observed significantly worse convergence
when training a mobilenetv2-like model when using the
`_batch_norm_cuda` kernel instead of the `cudnn_batch_norm`
kernel. This means users who export their models on CPU
first then move the models to cuda later may silently
see worse accuracies even when cudnn is installed,
because they are using the worse kernel. This issue is
summarized in https://github.com/pytorch/pytorch/issues/111384.

Instead, the new hierarchy proposed by consolidating
existing batch norm ops will look like:

```
aten.batch_norm ->
aten.batch_norm_with_update ->
[ _batch_norm_cpu (kernel) ] OR
[ _batch_norm_cuda (kernel) ] OR
[ cudnn_batch_norm (kernel) ] OR
[ miopen_batch_norm (kernel) ]
```

The new op `batch_norm_with_update` hides backend
implementation details and automatically picks the right
kernel based on what is installed. This commit also adds
the following variants to this op:

```
batch_norm_with_update_functional
batch_norm_with_update.out
batch_norm_no_update
batch_norm_no_update.out
batch_norm_backward
```

Note that this commit only adds this op and its variants,
but does not actually change the decomps to produce these
ops in the graph. This will be done after the 2 week FC
window, and the ops used in the old stack is planned to
be removed after the 6 month BC window.

Test Plan: `OpInfo` tests for `batch_norm_with_update`.

Reviewers: albanD, bdhirsh

Subscribers: albanD, bdhirsh, supriyar

Tasks: https://github.com/pytorch/pytorch/issues/111384

Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116092
Approved by: https://github.com/bdhirsh, https://github.com/albanD
2024-03-08 15:07:15 +00:00
c253d1c1db Add links to _ex variants in all linalg functions that support them (#121451)
Fixes https://github.com/pytorch/pytorch/issues/96632
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121451
Approved by: https://github.com/ezyang
2024-03-08 12:19:16 +00:00
975d428425 [Quant] Add the operator of decomposed fake quant per channel (#121297)
**Summary**
Add the operator of `quantized_decomposed.fake_quant_per_channel` and test the forward and backward of this op with comparing to ATen.

**Test Plan**
```
python -u -m pytest -s -v test_cpu_repro.py -k test_decomposed_fake_quant_per_channel
```

**Next Step**
Optimize the performance: from the generated code of forward and backward graph, the code didn't vectorize.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121297
Approved by: https://github.com/jerryzh168, https://github.com/jgong5
2024-03-08 10:51:37 +00:00
8ed0932172 Update link to OpenVINO backend in torch.compiler.rst (#121303)
This is a permalink, so it will remain active regardless of documentation version changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121303
Approved by: https://github.com/soulitzer
2024-03-08 08:17:13 +00:00
b3f24b57fb fix accidental specialization with faketensor input checks (#121460)
Summary: When fake tensors are passed to a graph module and we do runtime assertions on them, we can accidentally trigger specialization guards. It's better to just relax the checking for these.

Test Plan: confirmed that problem in T181400371 is now fixed

Differential Revision: D54658960

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121460
Approved by: https://github.com/angelayi
2024-03-08 08:02:37 +00:00
2e789ad522 [DCP][state_dict][doc] Update the distributed state_dict document (#121290)
As title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121290
Approved by: https://github.com/LucasLLC
ghstack dependencies: #121273, #121276
2024-03-08 07:58:18 +00:00
e628f2cc66 suggested fixes for congruences (#121418)
Differential Revision: D54636152

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121418
Approved by: https://github.com/zhxchen17
2024-03-08 07:19:51 +00:00
96ed37ac13 [DCP] Makes async_save public (#121325)
Makes async_save public

Differential Revision: [D54593610](https://our.internmc.facebook.com/intern/diff/D54593610/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121325
Approved by: https://github.com/wz337
ghstack dependencies: #121317
2024-03-08 05:13:13 +00:00
13366a101a [DCP][state_dict][doc] Fix the documents for distributed_state_dict (#121276)
As title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121276
Approved by: https://github.com/wz337, https://github.com/LucasLLC
ghstack dependencies: #121273
2024-03-08 03:29:47 +00:00
72dd9b2430 [inductor] Make some improvements to FX graph caching (#117888)
Summary: This is in preparation to enable FX graph caching by default. First fix some bugs uncovered by running all unit tests under `test/inductor/`. I'll enable in a separate diff in case we need to revert. Summary of changes:
* Turn off caching for tests that require a compilation, e.g., when checking that a relevant counter was incremented
* Bypass caching when we see mkldnn tensors as constants (they currently don't serialize, so we can't save to disk)
* Include various global settings that could affect compilation	in the cache key calculation.
* Handle a few config settings that break key calculation.
* Handle code paths where no ShapeEnv is available (the cache impl requires a shape env as part of handling guards)
* Skip caching when freezing is	enabled	(Freezing can embed constants that wouldn't be static across runs).
* Fix the clear() method to not	throw when the cache /tmp dir doesn't exist

Test Plan: Ran all tests under `test/inductor/` twice with TORCHINDUCTOR_FX_GRAPH_CACHE=1 to exercise any test that might be affected by caching.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117888
Approved by: https://github.com/eellison
2024-03-08 02:30:49 +00:00
909d73d8cb [DCP] Removes no_dist and coordinator_rank from public DCP API's (#121317)
[DCP] Removes `no_dist` and `coordinator_rank` from public DCP API's

Differential Revision: [D54591181](https://our.internmc.facebook.com/intern/diff/D54591181/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121317
Approved by: https://github.com/fegin
2024-03-08 02:14:12 +00:00
23ac0cd561 more passing dynamo tests (#121378)
These are just tests that I noticed passed on current main

Running:
```
PYTORCH_TEST_WITH_DYNAMO=1 pytest test/dynamo/test_dynamic_shapes.py test/dynamo/test_compile.py -k 'test_export_decomp_dynamic_shapes or test_export_dynamic_dim_cleanup_dynamic_shapes or test_export_multi_dynamic_dim_constraint_dynamic_shapes or test_export_multi_dynamic_dim_unsafe_relationship_dynamic_shapes or test_export_no_raise_dynamic_shapes or test_export_preserve_constraints_as_metadata_scalar_dynamic_shapes or test_export_raise_on_relationship_dynamic_shapes or test_exported_graph_serialization_dynamic_shapes  or test_retracibility_dict_container_inp_out_dynamic_shapes or test_retracibility_nested_list_out_dynamic_shapes or test_exception_table_e2e_2_dynamic_shapes or test_exception_table_e2e_dynamic_shapes or test_exception_table_parsing_dynamic_shapes or test_inference_mode_dynamic_shapes or test_inplace_view_on_graph_input_dynamic_shapes or test_numpy_torch_operators_dynamic_shapes or test_py311_jump_offset_dynamic_shapes or test_lazy_module_no_cls_to_become_dynamic_shapes or test_batchnorm_e2e_dynamic_shapes or test_functools_wraps_dynamic_shapes or test_jit_trace_errors_dynamic_shapes or test_multi_import_dynamic_shapes or test_requires_grad_guards_with_grad_mode2_dynamic_shapese or test_dynamo_signatures'
```
BEFORE: `1 failed, 1 passed, 22 skipped, 1372 deselected`
AFTER: `24 passed, 1372 deselected`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121378
Approved by: https://github.com/oulgen
2024-03-08 01:59:01 +00:00
4186c36531 [nit][DCP][DSD] Remove Unused Variables in test_state_dict.py (#121204)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121204
Approved by: https://github.com/Skylion007
2024-03-08 01:54:25 +00:00
0f8c9acc29 Revert "[fake_impls] Fix seed/offset device for attention kernels (#120839)" (#121447)
This reverts commit df3c8b8390bc601072b0ee9b2c39e07adf370fe2.

It regressed cudagraphs+PT2 performance on SDPA.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121447
Approved by: https://github.com/Chillee
2024-03-08 01:48:23 +00:00
dc514b967e [dtensor][TP] check funcol calls and improve doc for loss parallel (#121366)
Since CommDebugMode is fixed, we can check that loss parallel is working as expected.

Under loss parallel, the forward computation should invoke 3 all-reduces, and the backward computation should invoke no functional collectives.

Co-authored-by: Wanchao <wanchaol@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121366
Approved by: https://github.com/wanchaol
2024-03-08 01:41:31 +00:00
25c74a93cd [fx] Preserve Fx graph node order in partitioner across runs (#115621)
Fixes #ISSUE_NUMBER
partitioner generates different graph in recompilation on each run
Co-authored-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115621
Approved by: https://github.com/ezyang
2024-03-08 01:37:53 +00:00
7dc1ab8989 make dyanmo work with _LazyGraphModule.lazy_forward (#121259)
Fix https://github.com/pytorch/pytorch/issues/121198 .

We previously already trigger the real recompilation for LazyGraphModule when it runs thru dynamo context. But people may pass in LazyGraphModule._lazy_forward rather than the LazyGraphModule instance itself. This PR handles that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121259
Approved by: https://github.com/williamwen42, https://github.com/jansel
2024-03-08 01:37:39 +00:00
9bff1599b6 [Torch Elastic][Draft] Refactor SubprocessHandler to separate module for easier subclass (#120373)
Summary:
## No Functional Change
- Refactor Subprocess Handler into a separate folder for easier subclassing
- SubprocessHandler
    - added `local_rank_id` in `SubprocessHandler` to make it available as a field in the class
    - pass in `local_rank_id` from subprocess start

Test Plan: No functional changes.

Differential Revision: D54038627

#suppress-api-compatibility-check

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120373
Approved by: https://github.com/kurman
2024-03-08 01:37:34 +00:00
c86a1ce125 [dynamo][guards-cpp-refactor] Func defaults and kwdefaults accessor (#121338)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121338
Approved by: https://github.com/jansel
ghstack dependencies: #121327
2024-03-08 01:24:00 +00:00
79a04f2df9 [dynamo][guards-cpp-refactor] Permit dict version guard in DictGuardManager (#121327)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121327
Approved by: https://github.com/jansel
2024-03-08 01:24:00 +00:00
962c1b4c69 Update XNNPACK revision to fcbf55a (#120583)
Update XNNPACK dependency to revision fcbf55a. This is part of a larger, synchronized update of the dependency version for PyTorch, ExecuTorch, and FB internal targets.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120583
Approved by: https://github.com/mcr229
2024-03-08 01:19:22 +00:00
090616d9a1 [Indutor] Support auto-tuned custom PT ops in abi compatible mode (#120877)
Differential Revision: D54344556

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120877
Approved by: https://github.com/aakhundov
2024-03-08 01:16:57 +00:00
04a5d6e8d3 [dynamo][guards] Use lazy variable tracker for func defaults (#121388)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121388
Approved by: https://github.com/jansel
2024-03-08 01:10:46 +00:00
5d8e4126b6 Fixup test_trace_rules (#121351)
Summary:
Fixes
https://www.internalfb.com/intern/testinfra/diagnostics/7599824578133672.281475099376195.1709732674/

(for some reason this test didn't run in OSS)?

Reached out to Yanbo Liang for additional context:
 {F1465435684}

Test Plan:
Local:
https://www.internalfb.com/intern/testinfra/testconsole/testrun/16325548673376150/

Differential Revision: D54605075

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121351
Approved by: https://github.com/malfet, https://github.com/yanboliang
2024-03-08 00:50:45 +00:00
af62a70fab [export] Fix nn_module_stack in retracing (#121423)
Fixes https://fb.workplace.com/groups/1075192433118967/permalink/1391916691446538/
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121423
Approved by: https://github.com/zhxchen17
2024-03-08 00:34:11 +00:00
4f120dc2a6 Clean up mode handling in python dispatcher (#121083)
Things that were bad before this PR:
1. Temporarily unsetting functional tensor mode and proxy mode both had duplicate implementation
2. There are variants of mode handling private utils that has duplicate implementation. (different APIs calling repeated implementation, so i refactored)
3. _push_mode API used to take dispatch key argument which is not necessary.
4. There are unused APIs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121083
Approved by: https://github.com/zou3519
2024-03-08 00:30:34 +00:00
0811f15270 [DCP][state_dict] Let _offload_state_dict_to_cpu to return the companion_obj if it exist. (#121273)
As title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121273
Approved by: https://github.com/wz337, https://github.com/LucasLLC
2024-03-08 00:24:29 +00:00
f76e541ec7 [BE] NO MORE discrepancy between forloop foreach capturable YAY (#121269)
and I will not let it happen again

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121269
Approved by: https://github.com/albanD
ghstack dependencies: #121260, #121264
2024-03-08 00:00:30 +00:00
9d6c5be781 Add ASGD capturable API for forloop (#121264)
@tfsingh I got to it first--wanted to land this stack and close the gap ASAP.

This PR also fixes a discrepancy between `_init_group` and `__set_state__` because we have the constants live on params' device always.

There are some next steps though:
- ASGD can be made faster by making etas, mus, steps be on CPU when NOT capturable. (I had mistakenly thought foreachifying was faster and so we landed https://github.com/pytorch/pytorch/pull/107857, but it is slower). No one has complained yet though.  ¯\_(ツ)_/¯

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121264
Approved by: https://github.com/albanD
ghstack dependencies: #121260
2024-03-08 00:00:30 +00:00
24821fec26 Add RAdam capturable API for forloop (#121260)
Implementation thanks to @MarouaneMaatouk in https://github.com/pytorch/pytorch/pull/118697, though I've since cleaned it up a lot to save perf on the rect < 5 eager case. It also just looks better now :) Added tests and the cudagraph health check.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121260
Approved by: https://github.com/mlazos
2024-03-08 00:00:30 +00:00
b1657beac1 feat: Add min, max ranges to mark_dynamic API (#119737)
Fixes https://github.com/pytorch/pytorch/issues/115137

This PR adds:

- mark_dynamic API will accept `min`, `max` values to create a bounded constraint on the dim.
- test case in test_misc.py which checks if `ConstraintViolationError` is triggered if `torch.compile` gets a input dimension out of bounds.

Co-authored-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119737
Approved by: https://github.com/ezyang, https://github.com/jansel
2024-03-07 23:26:03 +00:00
e0c534fe02 Revert "[Inductor] Add support for NEON ISA in the Inductor C++ backend (#105590)"
This reverts commit 156954d6a2a05f3ce8288dd054691102e596e461.

Reverted https://github.com/pytorch/pytorch/pull/105590 on behalf of https://github.com/ezyang due to https://github.com/pytorch/pytorch/issues/121288#issuecomment-1981980699 ([comment](https://github.com/pytorch/pytorch/pull/105590#issuecomment-1984745827))
2024-03-07 23:06:29 +00:00
3d089de851 Add torch.cond support to AOT Inductor (#121120)
Summary: In this PR, `torch.cond` support and the necessary codegening infrastructure is added to C++ wrapper (AOTInductor and friends).

Notable additions:

- A new mechanism in the Python wrapper codegen to precompile and save the Triton kernels (generated and user-defined) which haven't been covered by the active path through the control flow given the sample inputs. As we can't do the runtime autotuning of the kernels outside the active path, we precompile and save them with the `launchers[0]` (corresponding to the first config).

- Codegen infra for `torch.cond` in the C++ wrapper (ABI- and non-ABI-compatible). The `torch.cond` codegen has been slightly refactored to avoid duplication across the Python and C++ wrappers.

- More extensions of the caching sites in the wrapper code to cache per codegened graph (e.g., `codegen_int_array_var`) + some infra for tracking the current codegened graph in the wrapper (both during codegen-ing in the `Scheduler.codegen` and in the `WrapperCodeGen.generate` functions).

- New unit tests to cover the added AOT Inductor + `torch.cond` functionality.

Codegen examples from the new unit tests:

- [`test_cond_simple_abi_compatible_cpu`](https://gist.github.com/aakhundov/862d5de9aa460f5df399e1387f7b342e)
- [`test_cond_simple_abi_compatible_cuda`](https://gist.github.com/aakhundov/d70b81f95fa8cc768cedef9acacb25bb)
- [`test_cond_simple_non_abi_compatible_cpu`](https://gist.github.com/aakhundov/c0ae7a8cbb6fa311c838e1b580f9a3f6)
- [`test_cond_simple_non_abi_compatible_cuda`](https://gist.github.com/aakhundov/08b945d4e8a32c97b7f9ff6272f4a223)
- [`test_cond_nested_abi_compatible_cuda`](https://gist.github.com/aakhundov/ce664f433c53e010ce4c0d96a6c13711)
- [`test_cond_with_parameters_abi_compatible_cuda`](https://gist.github.com/aakhundov/77afbeb8eaab5c5b930a3f922a7baf12)
- [`test_cond_with_multiple_outputs_abi_compatible_cuda`](https://gist.github.com/aakhundov/8cc06105ec8a3fe88be09b3f6e32c690)

Test Plan:

```
$ python test/inductor/test_aot_inductor.py -k test_cond
...
----------------------------------------------------------------------
Ran 42 tests in 170.619s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121120
Approved by: https://github.com/jansel, https://github.com/chenyang78
2024-03-07 22:39:57 +00:00
26740f853e Remove unnecessary use of ctx.resolve_tools. (#120493)
In this case, it's simpler to use ctx.actions.run(executable = ...), which already ensures that the runfiles associated with the executable are present.

(It's also possible to use ctx.actions.run_shell(tools = ...) with a custom command line, but it's unclear to me that indirecting through the shell is needed here.)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120493
Approved by: https://github.com/ezyang
2024-03-07 22:33:17 +00:00
d14d62b7aa [dynamo] add more refleak tests (#120657)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120657
Approved by: https://github.com/jansel
2024-03-07 22:25:43 +00:00
6490441d8f Remove dead get_shape_groups (#120813)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120813
Approved by: https://github.com/albanD
2024-03-07 22:20:30 +00:00
18d574a07a [Inductor] Use indices for constants in triton_meta (#121427)
@bertmaher pointed out that constants are passed with their indices, not their names. Looking at triton source, this appears to be true 392370b303/python/triton/runtime/jit.py (L381-L385)
I'm guessing both indices and names work here but lets be consistent.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121427
Approved by: https://github.com/aakhundov
2024-03-07 21:59:43 +00:00
f61192b014 Fix for Wait kernel lowering in inductor not accepting MultiOutputs from non-collective calls (#121428)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121428
Approved by: https://github.com/yifuwang
2024-03-07 21:29:25 +00:00
76f1461892 [export] Serialize union fields with single entry dict. (#121263) (#121337)
Summary:

remove "$type" and "$value" fields, instead only serialize as {type: value} for union fields directly.

bypass-github-export-checks

Test Plan: CI

Differential Revision: D54600943

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121337
Approved by: https://github.com/tugsbayasgalan
2024-03-07 21:24:28 +00:00
4c58f2b675 [PyTorch] Use uint32_t for ProcessedNode::num_outputs (#121335)
We already use uint32_t for indexing, and the notion of a single graph node with more than four billion outputs stretches credulity.

Differential Revision: [D54598821](https://our.internmc.facebook.com/intern/diff/D54598821/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121335
Approved by: https://github.com/Skylion007
2024-03-07 21:15:05 +00:00
ea8f6e2e54 Subclass view fake-ification via reified ViewFuncs (#118405)
This PR:
* Uses reified ViewFuncs to swap in fake tensors / symbolic SymInts for view replay during subclass view fake-ification
* Enables automatic dynamic on view bases -> fakeifies according to the resultant symbolic context instead of the old "all-static" approach
* Covers the following view types:
    * subclass -> dense
    * dense -> subclass
    * subclass -> subclass
* Dense -> dense views are handled the old way via an `as_strided()` call, as it's likely there is no view func available

Differential Revision: [D54269082](https://our.internmc.facebook.com/intern/diff/D54269082)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118405
Approved by: https://github.com/ezyang
2024-03-07 19:56:16 +00:00
63ec5cd158 TD Heuristic for tests mentioned in PR body, less verbose TD printing (#120621)
Move tests that are mentioned in PR body or commit message to front.  Also attempts to find any issues/PRs mentioned in the PR body and search for those too (ex if you link a disable issue and that issue contains the test file that it was failing on)

looking for: dynamo/test_export_mutations

Also removes some printed information in TD

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120621
Approved by: https://github.com/osalpekar
2024-03-07 19:36:11 +00:00
c7a65f58b0 [CI] Script to fetch creds from current AWS session (#121426)
Because some implementations, like OpenDAL does not work with AWS IMDSv2, but this script will bridge the gap and enables more recent `sccache` releases(that switched from simple-s3 to OpenDAL) to work in current CI system

When launched it prints something like:
```
export AWS_ACCESS_KEY_ID=XXXXX
export AWS_SECRET_ACCESS_KEY=YYYY
export AWS_SESSION_TOKEN=ZZZZ
```
which can be `eval`ed and passed then sccache can use those failures.

Validated in https://github.com/pytorch/pytorch/pull/121323
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121426
Approved by: https://github.com/Skylion007
2024-03-07 19:25:54 +00:00
2b1661c7a0 Revert "[compiled autograd] support custom ops backed by c++ autograd::Function (#120681)"
This reverts commit 05c256849b464deee16ccd70152fd54071c6c79c.

Reverted https://github.com/pytorch/pytorch/pull/120681 on behalf of https://github.com/izaitsevfb due to breaking internal builds, see D54617701 ([comment](https://github.com/pytorch/pytorch/pull/120681#issuecomment-1984214079))
2024-03-07 18:53:51 +00:00
60aaba4128 create function to get ProcessGroupNCCL uid (#121132)
Summary: expose ProcessGroupNCCL uid

Differential Revision: D54446056

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121132
Approved by: https://github.com/aaronenyeshi
2024-03-07 18:34:38 +00:00
83d095c213 [BE] Remove unnecessary requires_cuda in common_optimizers.py (#121249)
@mlazos had already added the needed decorator on the test itself.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121249
Approved by: https://github.com/Skylion007, https://github.com/mlazos, https://github.com/albanD
ghstack dependencies: #121183
2024-03-07 17:57:02 +00:00
53bdae736d Add capturable single tensor Adamax (#121183)
Finishes the work started in https://github.com/pytorch/pytorch/pull/118697. Thanks @MarouaneMaatouk for the attempt, but due to inactivity I have opened this PR for Adamax. Note that the new capturable implementation is much simpler and I've modified the foreach capturable impl--it now calls fewer kernels and is more easily comparable to forloop.

Next steps:
* This PR discovered two bugs: #121178 and #121238.
* Move the now hefty graph optim tests in test_cuda to use OptimInfo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121183
Approved by: https://github.com/albanD
2024-03-07 17:57:02 +00:00
af88425cdc Forward fix lint after 121202 (#121425)
Forward fix after #121202, where the lintrunner job failed due to being unable to checkout the pytorch repo
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121425
Approved by: https://github.com/ezyang, https://github.com/aakhundov, https://github.com/malfet
2024-03-07 17:20:13 +00:00
suo
c3c15eb9a6 [export] update docs to not export raw functions (#121272)
as title

Differential Revision: [D54555101](https://our.internmc.facebook.com/intern/diff/D54555101/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121272
Approved by: https://github.com/zhxchen17
2024-03-07 17:18:07 +00:00
862b99b571 Revert "[ATen][CUDA][CUBLAS] cublasLtMatmul increase workspace_size (#120925)"
This reverts commit 3239f86a3df133b5977d988324639e0de7af8749.

Reverted https://github.com/pytorch/pytorch/pull/120925 on behalf of https://github.com/malfet due to Breaks internal tests, likely due to the increased memory requirements ([comment](https://github.com/pytorch/pytorch/pull/120925#issuecomment-1983875400))
2024-03-07 16:16:07 +00:00
eea37c6db4 [profiler] record nccl version in distributed info (#121044)
Summary: Add a field of NCCL version in distributed info if backend is NCCL

Differential Revision: D54432888

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121044
Approved by: https://github.com/aaronenyeshi
2024-03-07 15:56:02 +00:00
cyy
3aa512cd72 [Clang-tidy header][23/N] Enable clang-tidy coverage on aten/src/ATen/*.{cpp,h} (#121380)
This PR finishes the works beginning with #https://github.com/pytorch/pytorch/pull/120763 by enabling clang-tidy on aten/src/ATen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121380
Approved by: https://github.com/Skylion007
2024-03-07 15:11:07 +00:00
9a45001905 [dynamo] relax missing symbols runtime assert (#121339)
Differential Revision: [D54603361](https://our.internmc.facebook.com/intern/diff/D54603361)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121339
Approved by: https://github.com/ezyang
2024-03-07 14:53:38 +00:00
0339f1ca82 [Inductor] Allocate another shard for testing cpp-wrapper JIT (#121310)
Summary: The ABI-compatible for cpp wrapper has not been turned on as default, so test them separately. Expect to add more tests for the shard.

Differential Revision: [D54617287](https://our.internmc.facebook.com/intern/diff/D54617287)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121310
Approved by: https://github.com/chenyang78
ghstack dependencies: #121309
2024-03-07 14:24:21 +00:00
7e598c0053 [Inductor] Enable ABI-compatible mode for cpp-wrapper JIT (#121309)
Differential Revision: [D54617284](https://our.internmc.facebook.com/intern/diff/D54617284)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121309
Approved by: https://github.com/chenyang78
2024-03-07 14:22:06 +00:00
57fc35a3af [ Inductor ] Shape padding honors output stride preservation (#120797)
This fix makes sure that shape padding honors inductors 'keep_output_strides' setting.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120797
Approved by: https://github.com/eellison
2024-03-07 13:52:29 +00:00
cyy
4305c64fea Change ATEN generator argument type to const std::optional<Generator>& (#120076)
This PR proposes to use std::optional<Generator>& for underlying functions to avoid unnecessary copy and move operations. The torchgen code was changed to generate the new type.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120076
Approved by: https://github.com/malfet
2024-03-07 09:52:21 +00:00
1ce5049692 [inuctor] fix the layout problem for nll_loss2d_backward (#121173)
Fixes https://github.com/pytorch/pytorch/issues/120759 .

The CUDA implementation of nll_loss2d_backward.default requires that the 'self' tensor to be contiguous. These implicit assumption may be broken by layout optimizations. The fix here is to add the constraint when we explicitly defining the fallback for the op.

Not sure if we can improve the cuda kernel to release the constraints though.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121173
Approved by: https://github.com/jansel, https://github.com/desertfire
2024-03-07 09:05:07 +00:00
b3065f6899 add int8 packed gemm support on CPU device (#118056)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118056
Approved by: https://github.com/mikekgfb
2024-03-07 08:41:43 +00:00
e8e3049f57 [FSDP2] Relaxed check for parent mesh (#121360)
Mixing 1D and 2D `DTensor`s in the same sharded state dict should be okay, so we can remove the check that a parameter for FSDP to shard must be a `DTensor` if passing a child mesh to FSDP.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121360
Approved by: https://github.com/yifuwang, https://github.com/Skylion007
ghstack dependencies: #120351, #121328
2024-03-07 08:09:25 +00:00
db36d21f5c Add SDPA pattern for HuggingFace models BF16 (#121202)
### Description

- Add pattern for bf16 input type with fp32 attention mask. (Example model: ElectraForCausalLM)
- Add pattern with batch_size=1 to avoid some clones in graph. (Example model: text-classification+prajjwal1-bert-tiny)

### Newly matched models
Dtype: bf16, machine: SPR

#### Dynamo HuggingFace models

- ElectraForCausalLM (speedup=2.09x)
- ElectraForQuestionAnswering (speedup=4.22x)
- AlbertForQuestionAnswering (speedup=1.36x)
- AlbertForMaskedLM (speedup=1.39x)

#### OOB HuggingFace models

- multiple-choice+google-electra-base-discriminator
- text-classification+prajjwal1-bert-tiny
- text-classification+prajjwal1-bert-mini
- text-classification+google-electra-base-generator
- text-classification+bert-large-cased
- casual-language-modeling+xlm-roberta-base
- text-classification+roberta-base
- text-classification+xlm-roberta-base
- text-classification+albert-base-v2
- token-classification+google-electra-base-generator
- masked-language-modeling+bert-base-cased

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121202
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-03-07 07:40:00 +00:00
953c6c37cb Wrap remote cache creation with a try-catch (#121340)
Summary: In production I am seeing errors like "AttributeError: module 'triton.runtime' has no attribute 'fb_memcache'", likely due to some package skew. Until this is resolved, lets wrap this code with try-catch.

Test Plan: CI

Differential Revision: D54604339

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121340
Approved by: https://github.com/aakhundov
2024-03-07 07:05:49 +00:00
291ce86a6c Modify StorageImplCreateHelper (#118459)
I want to use tensor.untyped_storage()[a:b] for ``PrivateUse1`` backend but fail. The code will go into ``THPStorage_get``:
bb6eba189f/torch/csrc/Storage.cpp (L525-L540)

Here ``torch`` will create a new ``c10::StorageImpl`` but not consider about ``PrivateUse1`` backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118459
Approved by: https://github.com/albanD
2024-03-07 06:26:55 +00:00
f848e9c646 [Quant][Inductor] Fix q/dq per channel lowering with 64-bit qparams (#120984)
Fixes #120869

Fix lowering of `quantize_per_channel` and `dequantize_per_channel` with float64 scale and int64 zero point.
Generated codes are incorrect without explicit type conversion. Add type conversion to the lowering pass, i.e., float64 (double) -> float32 and int64 -> int32.

**Test plan**
python test/inductor/test_cpu_repro.py -k test_per_channel_fake_quant_module_uint8

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120984
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jerryzh168
2024-03-07 06:23:52 +00:00
4f9d4e1ab0 [DTensor][XLA] refactor DTensor _xla API (#113214)
In response to the change pytorch/xla#5776 and #92909

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113214
Approved by: https://github.com/wanchaol
2024-03-07 06:18:05 +00:00
cyy
c723514ef4 [CUDACachingAllocator] Simplify update_stat and avoid casts (#120964)
update_stat in CUDACachingAllocator.cpp was split into increase and decrease functions in this PR to simplify the implementation and avoid type casts throughout the code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120964
Approved by: https://github.com/albanD
2024-03-07 05:55:38 +00:00
55232c4e1c Make CausalBias a torch.Tensor subclass again (#121358)
# Summary
This was removed in #116071 in order to enable compile support and re-adding this seems to still work with compile
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121358
Approved by: https://github.com/Skylion007, https://github.com/cpuhrsch
2024-03-07 05:20:47 +00:00
df2ad1fecc [dtensor][debug] have visualize_sharding correctly print for sub-mesh DTensor (#121216)
**Summary**
In `visualize_sharding` we chose to only print on rank 0 (global rank) which means calling `visualize_sharind` will never print anything when the dtensor object's mesh doesn't include rank 0 (i.e. a sub-mesh). This PR has `visualize_sharding` always print on rank whose mesh coordinate is (0, 0, ..., 0) instead of whose global rank is 0.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121216
Approved by: https://github.com/wanchaol
ghstack dependencies: #121179, #120260
2024-03-07 04:50:15 +00:00
77873f6fe5 [dtensor][1/N] add torchrec even row-wise sharding example (#120260)
**Summary**
our goal is to demonstrate that DTensor's capability to represent TorchRec's parameter sharding. Currently this is done with `ShardedTensor` and theoretically `DTensor` can replace it with minor change.

This PR serves as a start of this effort by adding an example test that represents TorchRec's `ShardingType.ROW_WISE` using DTensor. Note that this PR only covers the even sharding case.

**Test Run**
`torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/torchrec_sharding_example.py -e row-wise`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120260
Approved by: https://github.com/wanchaol
ghstack dependencies: #121179
2024-03-07 04:50:15 +00:00
9cc0f23e5c [dtensor][debug] allow visualize_sharding to print header (#121179)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121179
Approved by: https://github.com/wanchaol
2024-03-07 04:50:06 +00:00
a2854ae904 Bugfix consume_prefix_in_state_dict_if_present function to keep the order of the state_dict (#117464)
This PR proposes to keep the original order as the original state_dict, as the issue creator proposed. It also removes a bug concerning how ``_metadata`` is handled (see below), as well as other small changes to properly remove the prefix when is present.

In the original code, ``_metadata`` was handled as a ``key``.

```
    # also strip the prefix in metadata if any.
    if "_metadata" in state_dict:
```

This is not the case, ``_metadata`` is actually an ``attribute``. Hence, the previous condition is changed to:

```
    # also strip the prefix in metadata if any.
    if hasattr(state_dict, "_metadata"):
```

This PR also includes the necessary test.

Fixes #106942

Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117464
Approved by: https://github.com/mikaylagawarecki
2024-03-07 04:00:49 +00:00
edd80f87b8 Prevent infinite recursion within Tensor.__repr__ (#120206)
`Tensor.__repr__` calls functions which can perform logging which ends up logging `self` (with `__repr__`) causing an infinite loop. Instead of logging all the args in FakeTensor.dispatch log the actual parameters (and use `id` to log the tensor itself).

The change to torch/testing/_internal/common_utils.py came up during testing - in some ways of running the test parts was `('test', 'test_testing.py')` and so `i` was 0 and we were doing a join on `()` which was causing an error.

Repro:
```
import torch
from torch.testing import make_tensor
from torch._subclasses.fake_tensor import FakeTensor, FakeTensorMode
t = torch.sparse_coo_tensor(((0, 1), (1, 0)), (1, 2), size=(2, 2))
t2 = FakeTensor.from_tensor(t, FakeTensorMode())
print(repr(t2))
```
and run with `TORCH_LOGS=+all`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120206
Approved by: https://github.com/yanboliang, https://github.com/pearu
2024-03-07 02:24:45 +00:00
eb4d87f237 graph break on sparse tensors constructions (#120458)
Fix some tests in https://github.com/pytorch/pytorch/issues/119780
sparse_bsc_tensor is not supported
https://github.com/pytorch/pytorch/pull/117907

Also more about the issue here.
https://docs.google.com/document/d/1EIb4qG88-SjVFn5TloLERliYdxIu2hwYoAA8skjOVfo/edit

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120458
Approved by: https://github.com/ezyang
2024-03-07 02:17:41 +00:00
1a28ebffb3 [TP] Introduce Sequence Parallel Style for Laynorm/RMSNorm/Dropout (#121295)
As titled, this PR introduces a dedicated `ParallelStyle` to shard the
nn.LayerNorm/nn.Dropout/RMSNorm layers. We were mainly using a manual
distribute_module calls before when sharding the RMSNorm layer, but I
think we should have a dedicate TP API to easily shard those layers,
instead of user manually using DTensors.

I call this SequenceParallel, which might bring some confusion that we
technically "deprecated" a SequenceParallel style months ago. But this
time the SeuqenceParallel style is significantly different with the
previous ones (which used to shard two consecutive Linear layers). I
believe making it the right name is the first priority, instead of
worrying about the issue of reusing the old name

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121295
Approved by: https://github.com/awgu, https://github.com/tianyu-l
ghstack dependencies: #121294
2024-03-07 02:04:59 +00:00
967dd31621 [cuDNN] Cleanup cuDNN < 8.1 ifdefs (#120862)
Follow-up of #95722

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120862
Approved by: https://github.com/Skylion007
2024-03-07 01:46:25 +00:00
b9087f8571 [profiler] Add execution_trace_observer as an optional argument to profiler (#119912)
# Update Profiler API to collect Execution Traces

## TLDR
We would like to simplify collecting Execution Trace and Kineto together. Execution Trace and Kineto both provide meaningful information that can be combined to enable benchmarking, performance analysis and simulating new hardware.
```
import torch

def main():
    with torch.profiler.profile(
        activities=[
            torch.profiler.ProfilerActivity.CPU,
            torch.profiler.ProfilerActivity.CUDA,
        ],
        …
        excution_trace_observer=ExecutionTraceObserver() # <<<<<<< NEW
    ) as prof:
        ...
        prof.step()
```

See test/profiler/test_profiler.py 'test_execution_trace_with_kineto' for an example of using this API.

## What are Execution Traces?
[Chakra Execution Traces](https://github.com/mlcommons/chakra/wiki) offer a graph based representation of AI/ML workloads.  It stands apart from conventional AI/ML frameworks by focusing on replay benchmarks, simulators, and emulators, prioritizing agile performance modeling and adaptable methodologies.
- Chakra is part of ML Commons industry standard and is being adopted by other companies besides NVIDIA too.
- At Meta we have instrumented PyPer framework to collect Execution Traces. More details on our [PyTorch implementation of Chakra can be found here](https://github.com/mlcommons/chakra/wiki)

Chakra essentially enables benchmarking and co-design for ML Models without having to reproduce entier software stacks and helps companies collaborate together [[chakra paper](https://arxiv.org/pdf/2305.14516.pdf)]

## Why correlate Execution Trace with PyTorch/Kineto Trace

Both Execution Traces and Kineto/ provide different types of information and combining. While PyTorch ETs focus on CPU operators with explicit dependencies between them, Kineto traces encode GPU operators with their start and end times. In addition, collecting them at different timestamps will be inaccurate as several operations (NCCL, Embedding lookup) are data dependent and may not match correctly.
Thus, it makes sense to collect both ET and Kineto together. The problem is that there are two code paths.

## Proposal
The proposal is to modify the PyTorch profiler (Kineto) API to enable execution trace to be collected simultaneously, see TLDR section

# Testing
Updated the unit test for collecting kineto and Execution Trace together.
- Check the collected ET has right range of events.
- Compare two sets of IDs - record func Ids in ET and external IDs in Kineto. We check if these have a constant difference.

```
pytest test/profiler/test_profiler.py  -k test_execution_trace_with_kineto -rP

Running 1 items in this shard

test/profiler/test_profiler.py [W execution_trace_observer.cpp:682] Enabling Execution Trace Observer
STAGE:2024-03-05 09:05:05 1119546:1119546 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
[W execution_trace_observer.cpp:694] Disabling Execution Trace Observer
STAGE:2024-03-05 09:05:05 1119546:1119546 ActivityProfilerController.cpp:320] Completed Stage: Collection
STAGE:2024-03-05 09:05:05 1119546:1119546 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119912
Approved by: https://github.com/sanrise, https://github.com/aaronenyeshi
2024-03-07 01:30:26 +00:00
eb1145436a [DCP] Adds main in format utils (#120128)
Adds main in format utils. Usage:

`python -m torch.distributed.checkpoint.format_utils dcp_to_torch dcp_dir torch_file.pt`

or

`python -m torch.distributed.checkpoint.format_utils torch_to_dcp torch_file.pt dcp_dir`

Differential Revision: [D53791355](https://our.internmc.facebook.com/intern/diff/D53791355/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120128
Approved by: https://github.com/fegin, https://github.com/wz337
2024-03-07 01:18:17 +00:00
cyy
5cc511f72f Use c10::irange and fix other index types in ForeachReduceOp.cu (#121123)
This PR follows the suggestions in #121066 and changes most loops to c10::irange.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121123
Approved by: https://github.com/soulitzer
2024-03-07 00:11:27 +00:00
c268ce4a6d Make ATen-cpu cuda/rocm agnostic (#121082)
Summary: This specific rocm logic will make aten-cpu code diverge between rocm and cuda. This is not good because we won't be able to share aten-cpu.so between rocm and cuda. More specifically, this will prevent us build aten-hip by default, which requires us to set up rocm specific rules which is an extra burden for our build system.

Test Plan: sandcastle + oss ci

Differential Revision: D54453492

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121082
Approved by: https://github.com/jeffdaily, https://github.com/aaronenyeshi, https://github.com/albanD
2024-03-06 23:51:40 +00:00
e50ded03a6 Use type check for also is_not (#113859)
Handle `is_not` for:

9647a251cb/torch/_dynamo/variables/builtin.py (L1314-L1317)

I noticed https://github.com/pytorch/pytorch/issues/111713 exists, I think it's no harm to land this first.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113859
Approved by: https://github.com/Skylion007
2024-03-06 23:12:42 +00:00
a88356f45c [dtensor] make add_.Tensor/div_.Scalar to be linear pointwise instead (#121294)
add_.Tensor and div_.Scalar should support linearity so that we delay the partial
results.

This fixes the additional collective in the layernorm layer that we seen

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121294
Approved by: https://github.com/tianyu-l
2024-03-06 22:52:18 +00:00
2f064d895c Switch TORCH_TRACE to accept a directory by default (#121331)
Directory is better because it works smoothly with distributed
runs; otherwise you'd need to modify torchrun to setup distinct
log names for each file.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Differential Revision: [D54597814](https://our.internmc.facebook.com/intern/diff/D54597814)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121331
Approved by: https://github.com/albanD
2024-03-06 22:46:18 +00:00
372f192050 [DTensor] Initialized RNG tracker if needed (#121328)
Since we are already checking if the RNG tracker is initialized, there is no real performance difference between erroring vs. just initializing a default RNG tracker (which we choose to be the `OffsetBasedRNGTracker`).

```
pytest test/distributed/_composable/fsdp/test_fully_shard_init.py -k test_meta
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121328
Approved by: https://github.com/wanchaol
ghstack dependencies: #120351
2024-03-06 22:21:44 +00:00
b0e2ed4d67 removing some macros (#120314)
Summary: Will be making some changes in the surrounding code, they are going to be easier without macros

Differential Revision: D54001770

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120314
Approved by: https://github.com/zhxchen17
2024-03-06 22:06:05 +00:00
69cedc16c5 Add padding dimension checks and tests (#121298)
Fixes #121093

Previously, calling the following functions with invalid padding dimensions would cause a segmentation fault:
```
torch._C._nn.replication_pad1d, torch._C._nn.replication_pad3d, torch._C._nn.replication_pad3d
```

To fix, added condition checking to raise a runtime error with a debug message instead, specifying the correct dimensions necessary.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121298
Approved by: https://github.com/mikaylagawarecki
2024-03-06 21:55:34 +00:00
d7a5e59647 [dynamo] support group=None when rewriting collectives (#121043)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121043
Approved by: https://github.com/awgu
2024-03-06 21:37:19 +00:00
3fee05f242 Triage the remaining fallbacks (#121312)
Building off work from @amjames. There may be some missclassifications, feel free to flag them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121312
Approved by: https://github.com/jansel
2024-03-06 21:23:47 +00:00
e865700f6a [FSDP2] Added initial meta-device init support (#120351)
This PR adds initial support for meta-device initialization for pre-training without loading from a state dict. The idea is to allow `fully_shard(module)` to return and still have sharded parameters on meta device. Then, the user is free to initialize them as they please, e.g. using `to_empty()`.

We override `_apply` to achieve the following:
- Reshard the parameters to ensure that sharded parameters are registered (for correctness) -- we will always need this
- Pad new local tensors and use the padded local tensors (to handle uneven sharding) -- we will remove this once `DTensor` pads its local tensor

We use the `swap_tensors` path in `_apply`. For now, this requires setting `torch.__future__.set_swap_module_params_on_conversion(True)`; however, in the future, this may be enabled by default for wrapper subclasses and will not need any explicit API call. If requiring this call is too intrusive in the short term, we can also call it in `_apply` or when importing `fully_shard`.

```
# Pre-training flow (no checkpoint)
global_mesh = init_device_mesh(..., mesh_dim_names=("dp", "tp"))
dp_mesh, tp_mesh = global_mesh["dp"], global_mesh["tp"]
with torch.device("meta"):
  model = ...
  parallelize_module(model, tp_mesh, ...)
  fully_shard(model, mesh=dp_mesh, ...)
for param in model.parameters():
  assert param.device.type == "meta"

model.to_empty(device="cuda")
random.manual_seed(42, global_mesh)
for module in model.modules():
  if hasattr(module, "reset_parameters"):
    module.reset_parameters()
```

This PR includes some minor changes to allow the user to similarly cast the module to a different dtype after construction time but before forward.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120351
Approved by: https://github.com/wanchaol
2024-03-06 21:18:25 +00:00
3cf02c5e06 [Dev Container] Fix container build by preventing conda prompt (#121128)
Without this the build will freeze with prompt:
  Proceed ([y]/n)?

I'm using rootless podman in vscode instead of docker but I think it should not affect this.
..or does conda somehow detect Docker but not Podman? Anyway, this should not break anything.

Btw, I also had to uncomment the line: "remoteUser": "root" in devcontainer.json to finish the post installation properly but I guess there might be other workarounds - and perhaps you don't want to run as root if your container has root privileges.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121128
Approved by: https://github.com/drisspg
2024-03-06 20:50:40 +00:00
58ac4a2007 Remove llava from ci_expected_accuracy as it's flaky (#121322)
https://github.com/pytorch/pytorch/pull/121029 added it into the CI but the test is flaky on hud. It alternates between fail_accuracy and fail_to_run

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121322
Approved by: https://github.com/desertfire
2024-03-06 20:47:01 +00:00
23fb37fa41 Revert "[export] Serialize union fields with single entry dict. (#121263)"
This reverts commit 7feabe9b73e6ba7724b62ea91df27049defdf378.

Reverted https://github.com/pytorch/pytorch/pull/121263 on behalf of https://github.com/osalpekar due to A large number of inductor benchmarking jobs failing starting this PR. See for details: 7feabe9b73 ([comment](https://github.com/pytorch/pytorch/pull/121263#issuecomment-1981680049))
2024-03-06 19:58:55 +00:00
76f3663efe Fixed a memory leak when calling from_numpy on a numpy array with an … (#121156)
…unsupported dtype.

Fixes #121138.

The lambda function that DECREFs the object is not called when the dtype conversion functions throws. This PR moves the conversion before the INCREF, which prevents the memory leak.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121156
Approved by: https://github.com/soulitzer, https://github.com/albanD
2024-03-06 19:37:38 +00:00
360761f7d0 [Torchelasic] Create root log directory by default (#121257)
Summary:
After refactoring in https://github.com/pytorch/pytorch/pull/120691, default behavior unintentionally was changes from creating tempdir for logging to not capturing any logs by torch Elastic Agent.

Reverting the behavior to:
- making tempdir when log dir is not specified
- allowing non-empty root log dir
    - Note: in case attempt folder exists, it will be pruned here: https://github.com/pytorch/pytorch/blob/main/torch/distributed/elastic/multiprocessing/api.py#L294

Differential Revision: D54531851

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121257
Approved by: https://github.com/d4l3k
2024-03-06 18:50:38 +00:00
418568d2e3 Add Float8 support to onnx exporter (#121281)
Fixes #106877

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121281
Approved by: https://github.com/BowenBao, https://github.com/titaiwangms
2024-03-06 18:46:56 +00:00
cyy
5a2527db22 [Clang-tidy header][22/N] Fix clang-tidy warnings in aten/src/ATEN/*.{cpp,h} (#121102)
This PR continues to fix clang-tidy warnings in aten/src/ATEN/*, following #120763.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121102
Approved by: https://github.com/Skylion007
2024-03-06 18:36:31 +00:00
c5ef4df274 guard on grads being None in compiled optimizers (#121291)
Fixes #115607

We were missing guards when the grads were set to `None`. So if we compiled the optimizer with grads set to their proper value, and then with the grads set to `None` we'd continuously run the `None` version because all of the guards would pass and it would be ordered before the correct version in the cache.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121291
Approved by: https://github.com/Skylion007, https://github.com/anijain2305
2024-03-06 18:33:23 +00:00
7feabe9b73 [export] Serialize union fields with single entry dict. (#121263)
Summary: remove "$type" and "$value" fields, instead only serialize as {type: value} for union fields directly.

Test Plan: CI

Differential Revision: D54553770

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121263
Approved by: https://github.com/tugsbayasgalan
2024-03-06 18:16:16 +00:00
c66d68ba51 [PT2] Add tolist() to FunctionalTensor for torch.export (#121242)
Adding tolist() to FunctionalTensor for torch.exporting TorchRec data types
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121242
Approved by: https://github.com/ezyang
2024-03-06 18:10:44 +00:00
05c256849b [compiled autograd] support custom ops backed by c++ autograd::Function (#120681)
- Adds support for custom ops backed by c++ custom autograd functions, e.g. fbgemm
- Include files more granularly to avoid namespace pollution and circular imports

limitations:
- requires user to audit their code and opt-in their custom autograd::Function via autograd::Function::is_traceable and maybe additional compiled_args + apply_with_saved implementation. this was the only way I can think of for soundness
- will throw if we can't hash the saved_data i.e. for any non implemented type other than list and dict in at::IValue::hash b0cfa96e82/aten/src/ATen/core/ivalue.cpp (L364)
- can technically silently fail if both the typeid hash and the typeid string name of the custom autograd::Function collide at the same time, and an identical autograd graph containing a different custom autograd::Function, yet that has an identical implementation, is called. this case seems extremely unlikely, and the only alternative to hash collision i can think of is compiling with reflection
- tensors not saved via save_variables are not lifted, and are specialized on TensorImpl*'s hash (treated as a memory address). if needed, we can lift them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120681
Approved by: https://github.com/jansel
2024-03-06 18:01:56 +00:00
b27d76949b [ROCm] Enable several fake_crossref UTs on ROCm (#121112)
Enabled unit tests:

test_ops::TestFakeTensorCUDA::test_fake_crossref_backward_amp_linalg_norm_subgradients_at_zero_cuda_float32
test_ops::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_linalg_norm_subgradients_at_zero_cuda_float32
test_ops::TestFakeTensorCUDA::test_fake_crossref_backward_amp_norm_nuc_cuda_float32
test_ops::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_norm_nuc_cuda_float32
test_ops::TestFakeTensorCUDA::test_fake_crossref_backward_amp_svd_cuda_float32
test_ops::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_svd_cuda_float32

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121112
Approved by: https://github.com/ezyang
2024-03-06 17:36:47 +00:00
b529c19bdf Revert "Batch Norm Consolidation (#116092)"
This reverts commit 5680f565d5b7d4aa412a3988d3d91ca4c5679303.

Reverted https://github.com/pytorch/pytorch/pull/116092 on behalf of https://github.com/jeffdaily due to broke ROCm, PR signal was clean but trunk was not, the merge should have been blocked but wasn't ([comment](https://github.com/pytorch/pytorch/pull/116092#issuecomment-1981373237))
2024-03-06 17:10:01 +00:00
8dd4b6a78c Fix venv compatibility issue by updating python_lib_path (#121103)
Reference by sys.executable is the absolute path of the executable binary for the Python interpreter, which may not be appropriate. Instead, sys.base_exec_prefix is more suitable, and this change will correctly resolve the library when using venv. I have tested it with a venv created by rye.

https://docs.python.org/3.6/library/sys.html#sys.executable

> A string giving the absolute path of the executable binary for the Python interpreter, on systems where this makes sense. If Python is unable to retrieve the real path to its executable, [sys.executable](https://docs.python.org/3.6/library/sys.html#sys.executable) will be an empty string or None.

https://docs.python.org/3.6/library/sys.html#sys.exec_prefix

> A string giving the site-specific directory prefix where the platform-dependent Python files are installed; by default, this is also '/usr/local'. This can be set at build time with the --exec-prefix argument to the configure script. Specifically, all configuration files (e.g. the pyconfig.h header file) are installed in the directory exec_prefix/lib/pythonX.Y/config, and shared library modules are installed in exec_prefix/lib/pythonX.Y/lib-dynload, where X.Y is the version number of Python, for example 3.2.

https://docs.python.org/3.6/library/sys.html#sys.base_exec_prefix

> Set during Python startup, before site.py is run, to the same value as [exec_prefix](https://docs.python.org/3.6/library/sys.html#sys.exec_prefix). If not running in a [virtual environment](https://docs.python.org/3.6/library/venv.html#venv-def), the values will stay the same; if site.py finds that a virtual environment is in use, the values of [prefix](https://docs.python.org/3.6/library/sys.html#sys.prefix) and [exec_prefix](https://docs.python.org/3.6/library/sys.html#sys.exec_prefix) will be changed to point to the virtual environment, whereas [base_prefix](https://docs.python.org/3.6/library/sys.html#sys.base_prefix) and [base_exec_prefix](https://docs.python.org/3.6/library/sys.html#sys.base_exec_prefix) will remain pointing to the base Python installation (the one which the virtual environment was created from).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121103
Approved by: https://github.com/ezyang
2024-03-06 17:00:46 +00:00
a427d90411 add int4 packed gemm support on CPU device (#117475)
This patch adds int4 packed gemm support on CPU, both `avx512` and `avx2` are supported. It is used to speedup https://github.com/pytorch-labs/gpt-fast

The default perf measured on Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores) is `16.13 sec total, 12.40 tokens/sec`

* WOQ int4 on avx512: `5.92 sec total, 33.79 tokens/sec`
* WOQ int4 on avx2: `6.90 sec total, 29.00 tokens/sec`

WOQ int4 is measured with method: https://github.com/pytorch-labs/gpt-fast?tab=readme-ov-file#int4-weight-only-quantization

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117475
Approved by: https://github.com/jgong5, https://github.com/malfet
2024-03-06 16:25:53 +00:00
54d92f2e37 Add jacrev support in torch.compile (#121146)
Changes are simple. Moved a few entries on trace_rules.py and included tests to compare the graph generated by jacrev

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121146
Approved by: https://github.com/zou3519
2024-03-06 16:05:33 +00:00
49d1fd31cf Fuse nodes with sizes (s0*s1*...,) and (s0, s1, s2, ...) (#120077)
Description:
- PR tries to fuse nodes with compatible sizes, for example `node1: (s0, s1, s2)` and `node2: (s0 * s1 * s2)`. On `main` these two nodes can be fused due to different sizes. With this PR we can recompute node2 size, body etc using node1 indexing constraint and thus be able to fuse two nodes.
- this should influence only cpu device

Example:
```python
from unittest.mock import patch
import torch
from torch._inductor.graph import GraphLowering
from torch._inductor import config

# Force multple scheduler nodes creation to fuse them
config.realize_opcount_threshold = 1

@torch.compile(fullgraph=True, dynamic=True)
def fn(x: torch.Tensor, w1: torch.Tensor, w2: torch.Tensor) -> torch.Tensor:
    o1 = x * w1.view(1, 1, 1, -1)
    o2 = x * w2.view(1, 1, 1, -1)
    output = o1 + o2
    return output

in_nodes = []
outputs = []
run_node = GraphLowering.run_node

graph_lowering_obj = None

def run_node_alt(self, n):
    global graph_lowering_obj

    graph_lowering_obj = self
    in_nodes.append(n)
    output = run_node(self, n)
    outputs.append(output)

    return output

x = torch.rand(1, 3, 32, 32)
w1 = torch.randn(32)
w2 = torch.randn(32)

with patch.object(GraphLowering, "run_node", run_node_alt):
    fn(x, w1, w2)

print("graph_lowering_obj.buffers:", graph_lowering_obj.buffers)
print("graph_lowering_obj.scheduler:", graph_lowering_obj.scheduler.nodes)
```

Output on `main`:
```
graph_lowering_obj.buffers: [ComputedBuffer(name='buf0', layout=FixedLayout('cpu', torch.float32, size=[1, s1, s0, s0], stride=[s0**2*s1, s0**2, s0, 1]), data=Pointwise(
  'cpu',
  torch.float32,
  def inner_fn(index):
      _, i1, i2, i3 = index
      tmp0 = ops.load(arg3_1, i3 + i1 * s0**2 + i2 * s0)
      tmp1 = ops.load(arg1_1, i3)
      tmp2 = tmp0 * tmp1
      return tmp2
  ,
  ranges=[1, s1, s0, s0],
  origin_node=mul,
  origins={mul}
)), ComputedBuffer(name='buf1', layout=FixedLayout('cpu', torch.float32, size=[1, s1, s0, s0], stride=[s0**2*s1, s0**2, s0, 1]), data=Pointwise(
  'cpu',
  torch.float32,
  def inner_fn(index):
      _, i1, i2, i3 = index
      tmp0 = ops.load(arg3_1, i3 + i1 * s0**2 + i2 * s0)
      tmp1 = ops.load(arg4_1, i3)
      tmp2 = tmp0 * tmp1
      return tmp2
  ,
  ranges=[1, s1, s0, s0],
  origin_node=mul_1,
  origins={mul_1}
)), ComputedBuffer(name='buf2', layout=FixedLayout('cpu', torch.float32, size=[1, s1, s0, s0], stride=[s0**2*s1, s0**2, s0, 1]), data=Pointwise(
  'cpu',
  torch.float32,
  def inner_fn(index):
      _, i1, i2, i3 = index
      tmp0 = ops.load(buf0, i3 + i1 * s0**2 + i2 * s0)
      tmp1 = ops.load(buf1, i3 + i1 * s0**2 + i2 * s0)
      tmp2 = tmp0 + tmp1
      return tmp2
  ,
  ranges=[1, s1, s0, s0],
  origin_node=add,
  origins={add}
))]
graph_lowering_obj.scheduler: [FusedSchedulerNode(nodes=buf0_buf1), SchedulerNode(name='buf2')]
```
Output on this PR:
```
graph_lowering_obj.buffers: [ComputedBuffer(name='buf0', layout=FixedLayout('cpu', torch.float32, size=[1, s1, s0, s0], stride=[s0**2*s1, s0**2, s0, 1]), data=Pointwise(
  'cpu',
  torch.float32,
  def inner_fn(index):
      _, i1, i2, i3 = index
      tmp0 = ops.load(arg3_1, i3 + i1 * s0**2 + i2 * s0)
      tmp1 = ops.load(arg1_1, i3)
      tmp2 = tmp0 * tmp1
      return tmp2
  ,
  ranges=[1, s1, s0, s0],
  origin_node=mul,
  origins={mul}
)), ComputedBuffer(name='buf1', layout=FixedLayout('cpu', torch.float32, size=[1, s1, s0, s0], stride=[s0**2*s1, s0**2, s0, 1]), data=Pointwise(
  'cpu',
  torch.float32,
  def inner_fn(index):
      _, i1, i2, i3 = index
      tmp0 = ops.load(arg3_1, i3 + i1 * s0**2 + i2 * s0)
      tmp1 = ops.load(arg4_1, i3)
      tmp2 = tmp0 * tmp1
      return tmp2
  ,
  ranges=[1, s1, s0, s0],
  origin_node=mul_1,
  origins={mul_1}
)), ComputedBuffer(name='buf2', layout=FixedLayout('cpu', torch.float32, size=[1, s1, s0, s0], stride=[s0**2*s1, s0**2, s0, 1]), data=Pointwise(
  'cpu',
  torch.float32,
  def inner_fn(index):
      _, i1, i2, i3 = index
      tmp0 = ops.load(buf0, i3 + i1 * s0**2 + i2 * s0)
      tmp1 = ops.load(buf1, i3 + i1 * s0**2 + i2 * s0)
      tmp2 = tmp0 + tmp1
      return tmp2
  ,
  ranges=[1, s1, s0, s0],
  origin_node=add,
  origins={add}
))]
graph_lowering_obj.scheduler: [FusedSchedulerNode(nodes=buf0_buf1_buf2)]
```

Context:
While working on https://github.com/pytorch/pytorch/pull/120411, upsampling bicubic decomposition, I saw an extra for-loop in C++ generated code summing up two buffers. Exploring the cause, it happend due to buffer number of ops goes beyond `config.realize_opcount_threshold`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120077
Approved by: https://github.com/jgong5, https://github.com/lezcano, https://github.com/peterbell10
2024-03-06 12:19:45 +00:00
aa0b0944d5 [dynamo] Re-dispatch torch.Tensor.new into torch.Tensor.new_empty method. (#121075)
Fix: https://github.com/pytorch/xla/issues/6009

This PR adds another case to `TensorVariable.method_new` special case, where it
re-dispatches `new` into `new_empty`.

Since we are using fake tensors, the `new` call doesn't actually gets to the corresponding
backend (e.g. XLA). So, things like the following might happen:

```python
@torch.compile(backend="openxla")
def foo(x):
    new_x = x.new(*x.size())

    # new_x.device() == "xla"
    # x.device() == "xla:0"

    return new_x + x

a = torch.arange(10)
foo(a.to(xm.xla_device()))
```

Resulting in the following error:

```python
Traceback (most recent call last):
  ...
  File "torch/_dynamo/utils.py", line 1654, in get_fake_value
    ret_val = wrap_fake_exception(
  File "torch/_dynamo/utils.py", line 1190, in wrap_fake_exception
    return fn()
  File "torch/_dynamo/utils.py", line 1655, in <lambda>
    lambda: run_node(tx.output, node, args, kwargs, nnmodule)
  File "torch/_dynamo/utils.py", line 1776, in run_node
    raise RuntimeError(make_error_message(e)).with_traceback(
  File "torch/_dynamo/utils.py", line 1758, in run_node
    return node.target(*args, **kwargs)
  File "torch/utils/_stats.py", line 20, in wrapper
    return fn(*args, **kwargs)
  File "torch/_subclasses/fake_tensor.py", line 885, in __torch_dispatch__
    return self.dispatch(func, types, args, kwargs)
  File "torch/_subclasses/fake_tensor.py", line 1224, in dispatch
    return self._cached_dispatch_impl(func, types, args, kwargs)
  File "torch/_subclasses/fake_tensor.py", line 955, in _cached_dispatch_impl
    output = self._dispatch_impl(func, types, args, kwargs)
  File "torch/_subclasses/fake_tensor.py", line 1445, in _dispatch_impl
    return self.wrap_meta_outputs_with_default_device_logic(
  File "torch/_subclasses/fake_tensor.py", line 1575, in wrap_meta_outputs_with_default_device_logic
    return tree_map(wrap, r)
  File "torch/utils/_pytree.py", line 900, in tree_map
    return treespec.unflatten(map(func, *flat_args))
  File "torch/utils/_pytree.py", line 736, in unflatten
    leaves = list(leaves)
  File "torch/_subclasses/fake_tensor.py", line 1550, in wrap
    ) = FakeTensor._find_common_device(func, flat_args)
  File "torch/_subclasses/fake_tensor.py", line 625, in _find_common_device
    merge_devices(arg)
  File "torch/_subclasses/fake_tensor.py", line 620, in merge_devices
    raise RuntimeError(
torch._dynamo.exc.TorchRuntimeError: Failed running call_function <built-in function add>(*(FakeTensor(..., device='xla', size=(10,), dtype=torch.int64), FakeTensor(..., device='xla:0', size=(10,), dtype=torch.int64)), **{}):
Unhandled FakeTensor Device Propagation for aten.add.Tensor, found two different devices xla, xla:0
```

Using `new_empty`, instead, fixes this error because it uses the device from the source
tensor, instead of inferring from the current dispatch key set.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121075
Approved by: https://github.com/jansel
2024-03-06 11:49:27 +00:00
e3bd6efe72 [dynamo][guards-cpp-refactor] Prevent duplication of leaf guards (#121164)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121164
Approved by: https://github.com/jansel
ghstack dependencies: #121121, #121147, #121154
2024-03-06 08:36:45 +00:00
b6b2d5b00a [dynamo][guards-cpp-refactor] Pass source name for debug ease (#121154)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121154
Approved by: https://github.com/jansel
ghstack dependencies: #121121, #121147
2024-03-06 08:36:45 +00:00
52d89d8491 [dynamo][guards-cpp-refactor] Simplify DictGuardManager by removing KeyValueDictGuardManager (#121147)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121147
Approved by: https://github.com/jansel
ghstack dependencies: #121121
2024-03-06 08:36:45 +00:00
af7f55ffc8 [dynamo][guards-cpp-refactor] Add argnames in pybind'ings (#121121)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121121
Approved by: https://github.com/jansel
2024-03-06 08:36:45 +00:00
0b9bfcf9bb [non-strict export] support tensor attribute without other args (#121176)
Summary: Without args we have a hard time detecting fake modes. This causes a fake mode mismatch error in non-strict (specifically, `aot_export_module`) when the module contains tensor attributes, because we create a fresh fake mode when we cannot detect one. The fix is to pass the same fake mode throughout.

Test Plan: added test

Differential Revision: D54516595

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121176
Approved by: https://github.com/angelayi, https://github.com/tugsbayasgalan
2024-03-06 08:10:00 +00:00
8087912622 Revert "[XPU][Profiler] Add Logic To The Profiler For Processing XPU-backend Data (#120185)"
This reverts commit 0ab2ec37383e44fa00c520de6e2b40845fccc6f3.

Reverted https://github.com/pytorch/pytorch/pull/120185 on behalf of https://github.com/briancoutinho due to This PR contains a list search in '_parse_kineto_events()' that can lead to very high cost of running this post trace, training jobs getting stuck for mins ([comment](https://github.com/pytorch/pytorch/pull/120185#issuecomment-1980180774))
2024-03-06 06:39:51 +00:00
099ff51d45 torch check the division by zero in batch_norm_update_stats (#120882)
Fixes #120803

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120882
Approved by: https://github.com/CaoE, https://github.com/malfet
2024-03-06 05:40:21 +00:00
2eec0e7c5f [BE] Remove __iniline__ from __global__ (#121246)
in layer_norm_kernel.cu since the qualifier seems to be ignored according to:

```
[18/263] Building CUDA object
caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/layer_norm_kernel.cu.o
/home/mkozuki/ghq/github.com/crcrpar/torch-3/aten/src/ATen/native/cuda/layer_norm_kernel.cu(300):
warning #20050-D: inline qualifier ignored for "__global__" function

Remark: The warnings can be suppressed with "-diag-suppress
<warning-number>"

/home/mkozuki/ghq/github.com/crcrpar/torch-3/aten/src/ATen/native/cuda/layer_norm_kernel.cu(300):
warning #20050-D: inline qualifier ignored for "__global__" function

Remark: The warnings can be suppressed with "-diag-suppress
<warning-number>"
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121246
Approved by: https://github.com/eqy, https://github.com/malfet
2024-03-06 05:16:52 +00:00
31bfa59970 Capture primitive data type arguments for profiling python_function (#120949)
RECORD_FUNCTION in python_function only captures argument that is a Tensor. However, it is very common for user to use non tensor arguments in custom ops, for example, sequence length in GPT attention custom op. My previous PR tries to capture all non-tensor arguments, it turned out in some cases, it is very expensive.

This PR is to support primitive (or its container) arguments in RECORD_FUNCTION.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120949
Approved by: https://github.com/soulitzer
2024-03-06 05:09:22 +00:00
5680f565d5 Batch Norm Consolidation (#116092)
**Summary:**

This commit simplifies the existing decomposition hierarchy
of batch norm ops by adding a single, backend agnostic op:
`batch_norm_with_update`. The existing hierarchy looks like:

```
aten.batch_norm ->
aten._batch_norm_impl_index ->
[
  aten.native_batch_norm ->
  aten._native_batch_norm_legit (export only) ->
  _batch_norm_legit_cpu/cuda (kernels, export only) ->
  _batch_norm_cpu/cuda (kernels)
] OR
[ aten.cudnn_batch_norm ] OR
[ aten.miopen_batch_norm ]
```

Aside from complexity, an important problem with the
above decomposition hierarchy is cuda numerics in
export flows. We observed significantly worse convergence
when training a mobilenetv2-like model when using the
`_batch_norm_cuda` kernel instead of the `cudnn_batch_norm`
kernel. This means users who export their models on CPU
first then move the models to cuda later may silently
see worse accuracies even when cudnn is installed,
because they are using the worse kernel. This issue is
summarized in https://github.com/pytorch/pytorch/issues/111384.

Instead, the new hierarchy proposed by consolidating
existing batch norm ops will look like:

```
aten.batch_norm ->
aten.batch_norm_with_update ->
[ _batch_norm_cpu (kernel) ] OR
[ _batch_norm_cuda (kernel) ] OR
[ cudnn_batch_norm (kernel) ] OR
[ miopen_batch_norm (kernel) ]
```

The new op `batch_norm_with_update` hides backend
implementation details and automatically picks the right
kernel based on what is installed. This commit also adds
the following variants to this op:

```
batch_norm_with_update_functional
batch_norm_with_update.out
batch_norm_no_update
batch_norm_no_update.out
batch_norm_backward
```

Note that this commit only adds this op and its variants,
but does not actually change the decomps to produce these
ops in the graph. This will be done after the 2 week FC
window, and the ops used in the old stack is planned to
be removed after the 6 month BC window.

Test Plan: `OpInfo` tests for `batch_norm_with_update`.

Reviewers: albanD, bdhirsh

Subscribers: albanD, bdhirsh, supriyar

Tasks: https://github.com/pytorch/pytorch/issues/111384

Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116092
Approved by: https://github.com/bdhirsh, https://github.com/albanD
2024-03-06 04:50:46 +00:00
f72eb5ae4c __grid__constant is only suported on cuda version >= 11.8 (#121275)
Summary: Update the macros to exclude using __grid__constant on compiling for devices > sm80 but cuda version < 11.8.

Test Plan: buck2 build --keep-going --config buck2.log_configured_graph_size=true --flagfile fbcode//mode/dev fbcode//sigrid/predictor/client/python:ig_sigrid_client_pybinding

Differential Revision: D54556796

Co-authored-by: Driss Guessous <drisspg@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121275
Approved by: https://github.com/drisspg
2024-03-06 03:44:59 +00:00
dad1b76584 Introduce EphemeralSource for symbols that should be simplified out (#120948)
Context: view fake-ification should handle closed-over state in ViewFuncs for use in view replay by:
* fake-ifying tensors
* symbolicizing SymInts

This avoids invalid specialization during view replay. However, the symbols / tensors created as intermediates in the view chain should not stick around or be guarded on. This PR introduces an `EphemeralSource` intended to be used as a source for this purpose. It has the following properties:
* Considered first to be simplified out in symbol simplification logic
* Errors if guarded on

Differential Revision: [D54561597](https://our.internmc.facebook.com/intern/diff/D54561597)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120948
Approved by: https://github.com/ezyang
2024-03-06 02:30:52 +00:00
d968fc442b [FSDP] restore fully_shard after exit from mock.patch (#121058)
manually restore fully_shard after \_\_exit\_\_ from mock.patch ctx. This will fix flaky CIs in trunk
```
pytest test/distributed/_composable/fsdp/test_fully_shard_training.py
```

this is a workaround to make mock.patch(fully_shard) work with multi-thread
* thread 1 set func.\_\_module\_\_[fully_shard] = patched function
* thread 2 read func.\_\_module\_\_[fully_shard], thought it is original and fail to restore fully_shard during \_\_exit\_\_
* this PR manually restore fully_shard after \_\_exit\_\_

Co-authored-by: Andrew Gu <31054793+awgu@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121058
Approved by: https://github.com/awgu
2024-03-06 02:14:59 +00:00
eqy
8dafc81ba9 [cuBLAS][cuBLASLt] Fix expected failures for int_mm on sm75 (turing) (#121277)
CC @malfet @atalman @ptrblck @tinglvv

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121277
Approved by: https://github.com/malfet
2024-03-06 01:51:01 +00:00
ce6a7d56fc Don't merge qnnpack (#120676)
Summary: qnnack library merge fails on some application. This fix implements recommendation from Android build team to prevent merge for qnnpack.

Test Plan:
1. Measure the binary size impact
1. Release build failed previously; now it should succeed

Differential Revision: D54048156

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120676
Approved by: https://github.com/kimishpatel
2024-03-06 01:42:13 +00:00
4b3903379a Add assign argument to torch.Tensor.module_load (#121158)
Make `torch.__future__.get_swap_module_params_on_conversion() == True` account for `assign` argument to `nn.Module.load_state_dict`

Similar to when `torch.__future__.set_swap_module_params_on_conversion()` is `False`, `assign=True` means that we do not incur a `self.copy_(other)` and the properties of `other` will be preserved

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121158
Approved by: https://github.com/albanD
ghstack dependencies: #121157
2024-03-06 01:32:06 +00:00
27389e03f0 [easy] Fixed requires_grad preservation for nn.Module.load_state_dict(assign=True) (#121157)
Always preserve requires_grad of param in module. Documentation fixed in PR stacked above.
Also fix test case to test load a state_dict generated with `keep_vars=False` (the default)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121157
Approved by: https://github.com/albanD
2024-03-06 01:32:06 +00:00
87a533ed1b c10:intrusive_ptr, self assignment (#119275)
Summary:
In C++ books/sources, self assignment check is often considered a bad practice, since it is very very unlikely.

See, for example libc++ doesn't have it:
cf94e0082e/libcxx/include/__memory/shared_ptr.h (L651)

How about we remove it?

Test Plan:
This check is like 1% of cycles assinged to intrusive_ptr::operator=
https://fburl.com/scuba/strobelight_services/9qqnrkdn

This is not a lot in purely cycles but since it's gpu machines, can be substantial

Differential Revision: D53471639

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119275
Approved by: https://github.com/cyyever, https://github.com/ezyang
2024-03-06 01:11:56 +00:00
412c687e2e Fix permuted sum precision issue for lower precision on CPU (#108559)
Fixes #83149
There is a limitation of `TensorIterator` reductions:
The non-permuted input tensor will be coalesced down to a 2-d tensor by `TensorIterator` whereas the permuted case may become a >2d operation (for example, two reduced dimensions and non-reduced dim).
Since the cpu reduction loop of `TensorIterator` only operates on two dimensions at a time, this means the intermediate sums will be truncated to lower precision.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108559
Approved by: https://github.com/mingfeima, https://github.com/peterbell10
2024-03-06 01:01:35 +00:00
34e3f6f3c9 fix segfault in torch.native_channel_shuffle when input is empty (#121199)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

fix https://github.com/pytorch/pytorch/issues/121092

`torch.channel_shuffle` could handle empty inputs correctly. `torch.native_channel_shuffle` bypassed the `numel == 0` check, this causes divided by zero in underlying kernel.

* __->__ #121199

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121199
Approved by: https://github.com/malfet
2024-03-06 00:46:36 +00:00
8473cd92e4 remove compute capability 3.5 for CUDA 12 (#114930)
CUDA 12 has removed compute capability 3.5. NVCC throws the error: `nvcc fatal   : Unsupported gpu architecture 'compute_35'`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114930
Approved by: https://github.com/malfet
2024-03-06 00:40:57 +00:00
d13ed8503c CI: Add aarch64 docker build and ciflow tags (#120931)
adding workflows for aarch64 linux docker build with ACL installed as system dependency

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120931
Approved by: https://github.com/atalman, https://github.com/malfet
2024-03-06 00:31:22 +00:00
cac36e232e [PyTorch] Split StaticModule out of test_static_runtime (#121028)
I want to use StaticModule in another (internal) test, so splitting it out.

Differential Revision: [D54384817](https://our.internmc.facebook.com/intern/diff/D54384817/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121028
Approved by: https://github.com/suo
2024-03-05 23:14:07 +00:00
f5391dad82 Update docs to point to new sdpa_kernel context manager (#121180)
# Summary

Updates the SDPA docs to fix some small inaccuracies and points to the new sdpa_kernel context manger. The Enum like type binded from cpp SDPBackend does not render its fields for some reason. Manually list them instead for now

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121180
Approved by: https://github.com/mikaylagawarecki
2024-03-05 22:19:48 +00:00
8bb3e0b643 [pytorch] Name the main and autograd threads for better debugging (#121170)
The main thread and the autograd one are latency critical threads. They launch CPU/GPU/Accelerator kernels and if for some reason they get preempted, the rank can become a straggler in a distributed training application. By naming these threads we can debug performance issues that impact the latency sensitive threads.

I used Kineto traces to verify if the thread names were propagated:

<img width="851" alt="Screenshot 2024-03-04 at 3 07 43 PM" src="https://github.com/pytorch/pytorch/assets/23515689/68b4a09c-b8e5-4f14-a5c0-6593f866c03f">

Also:

```
nvidia-smi
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   3065920      C   ...me#python#py_version_3_10     1968MiB |
|    1   N/A  N/A   3065926      C   ...me#python#py_version_3_10     1978MiB |
|    2   N/A  N/A   3065930      C   ...me#python#py_version_3_10     2084MiB |
|    3   N/A  N/A   3065936      C   ...me#python#py_version_3_10     2016MiB |
|    4   N/A  N/A   3065939      C   ...me#python#py_version_3_10     1998MiB |
|    5   N/A  N/A   3065943      C   ...me#python#py_version_3_10     2070MiB |
|    6   N/A  N/A   3065948      C   ...me#python#py_version_3_10     2026MiB |
|    7   N/A  N/A   3065952      C   ...me#python#py_version_3_10     2070MiB |
+-----------------------------------------------------------------------------+
[me@myhost ~]$ ps -T -p 3065920
    PID    SPID TTY          TIME CMD
3065920 3065920 pts/14   00:01:04 pt_main_thread
...
3065920 3092181 pts/14   00:00:40 pt_autograd_d0
3065920 3092182 pts/14   00:00:00 pt_autograd_d1
3065920 3092183 pts/14   00:00:00 pt_autograd_d2
3065920 3092184 pts/14   00:00:00 pt_autograd_d3
3065920 3092185 pts/14   00:00:00 pt_autograd_d4
3065920 3092186 pts/14   00:00:00 pt_autograd_d5
3065920 3092187 pts/14   00:00:00 pt_autograd_d6
3065920 3092188 pts/14   00:00:00 pt_autograd_d7
...

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121170
Approved by: https://github.com/albanD
2024-03-05 22:15:39 +00:00
24944f6717 [doc] Fix math display in ChannelShuffle doc (#121247)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121247
Approved by: https://github.com/mikaylagawarecki
2024-03-05 21:30:51 +00:00
b3a9d677a3 [ez] Add super() calls in test_custom_ops (#121239)
Some disable issues are getting spammed
Check that test_impl_invalid_devices gets skipped by the disable issue
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121239
Approved by: https://github.com/zou3519
2024-03-05 21:16:06 +00:00
34a28f01dd [Autograd] Improve error for leaf tensors as out argument to fallback (#121089)
Closes  #120988

Currently operators that hit the autograd fallback call `check_inplace`
on all mutated inputs, including out arguments. This leads to a slightly
confusing error message:
```
RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.
```

Compared to functions that don't fallback, which raise
```
RuntimeError: add(): functions with out=... arguments don't support automatic differentiation, but one of the arguments requires grad.
```

This changes the error message to make clear the issue is with the out argument,
but does not tighten the check to outright ban out arguments that require grad.
Instead, I use the same checks from `check_inplace` which allows non-leaf tensors
that require grad to pass without error.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121089
Approved by: https://github.com/lezcano, https://github.com/soulitzer
ghstack dependencies: #121142
2024-03-05 21:13:27 +00:00
eae9751e82 Fix linalg_eigvals invalid use of composite dispatch key (#121142)
`linalg_eigvals_out` calls into a dispatch stub, so only supports CPU and CUDA
strided tensors but incorrectly claimed to be a composite op. `linalg_eigvals`
also shouldn't defer to the out variant inside a `CompositeImplicitAutograd` op
as not all types support out variants. Instead, I add a new helper
`_linalg_eigvals` which does the same thing in a non-composite operator.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121142
Approved by: https://github.com/lezcano
2024-03-05 21:13:27 +00:00
393b4ab432 Fixes issue_119785 (#121048)
Fixes #ISSUE_119785

- Removed all sentinel files of `test_causal_variants_.*`.

- The `test_causal_variants_causal_variant_` tests could pass after removing the dynamo_skips files.

- The `test_causal_variants_compile_causal_variant` fails with `PYTORCH_TEST_WITH_DYNAMO=1`. These tests already call torch.compile, so added @skipIfTorchDynamo to skip them for `PYTORCH_TEST_WITH_DYNAMO`.

**Tests**
```
$ PYTORCH_TEST_WITH_DYNAMO=1 pytest test_transformers.py -v -k "test_causal_variants"
================================================================== test session starts ==================================================================
platform linux -- Python 3.10.13, pytest-7.4.0, pluggy-1.0.0 -- /home/shuqiyang/.conda/envs/pytorch/bin/python
cachedir: .pytest_cache
rootdir: /data/users/shuqiyang/pytorch
configfile: pytest.ini
collected 77250 items / 77218 deselected / 32 selected
Running 32 items in this shard

test_transformers.py::TestAttnBiasCPU::test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cpu PASSED [0.7745s]                  [  3%]
test_transformers.py::TestAttnBiasCPU::test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape1_cpu PASSED [0.8020s]                  [  6%]
test_transformers.py::TestAttnBiasCPU::test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape2_cpu SKIPPED [0.0385s] (Lower righ...) [  9%]
test_transformers.py::TestAttnBiasCPU::test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape3_cpu PASSED [0.5046s]                  [ 12%]
test_transformers.py::TestAttnBiasCPU::test_causal_variants_causal_variant_CausalVariant_UPPER_LEFT_shape0_cpu PASSED [0.6483s]                   [ 15%]
test_transformers.py::TestAttnBiasCPU::test_causal_variants_causal_variant_CausalVariant_UPPER_LEFT_shape1_cpu PASSED [0.8537s]                   [ 18%]
test_transformers.py::TestAttnBiasCPU::test_causal_variants_causal_variant_CausalVariant_UPPER_LEFT_shape2_cpu PASSED [0.8388s]                   [ 21%]
test_transformers.py::TestAttnBiasCPU::test_causal_variants_causal_variant_CausalVariant_UPPER_LEFT_shape3_cpu PASSED [0.4859s]                   [ 25%]
test_transformers.py::TestAttnBiasCPU::test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cpu SKIPPED [0.0084s] (Th...) [ 28%]
test_transformers.py::TestAttnBiasCPU::test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape1_cpu SKIPPED [0.0086s] (Th...) [ 31%]
test_transformers.py::TestAttnBiasCPU::test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape2_cpu SKIPPED [0.0081s] (Th...) [ 34%]
test_transformers.py::TestAttnBiasCPU::test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape3_cpu SKIPPED [0.0085s] (Th...) [ 37%]
test_transformers.py::TestAttnBiasCPU::test_causal_variants_compile_causal_variant_CausalVariant_UPPER_LEFT_shape0_cpu SKIPPED [0.0082s] (Thi...) [ 40%]
test_transformers.py::TestAttnBiasCPU::test_causal_variants_compile_causal_variant_CausalVariant_UPPER_LEFT_shape1_cpu SKIPPED [0.0085s] (Thi...) [ 43%]
test_transformers.py::TestAttnBiasCPU::test_causal_variants_compile_causal_variant_CausalVariant_UPPER_LEFT_shape2_cpu SKIPPED [0.0081s] (Thi...) [ 46%]
test_transformers.py::TestAttnBiasCPU::test_causal_variants_compile_causal_variant_CausalVariant_UPPER_LEFT_shape3_cpu SKIPPED [0.0085s] (Thi...) [ 50%]
test_transformers.py::TestAttnBiasCUDA::test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda PASSED [9.4185s]                [ 53%]
test_transformers.py::TestAttnBiasCUDA::test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape1_cuda PASSED [0.4273s]                [ 56%]
test_transformers.py::TestAttnBiasCUDA::test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape2_cuda SKIPPED [0.0280s] (Lower ri...) [ 59%]
test_transformers.py::TestAttnBiasCUDA::test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape3_cuda PASSED [8.0999s]                [ 62%]
test_transformers.py::TestAttnBiasCUDA::test_causal_variants_causal_variant_CausalVariant_UPPER_LEFT_shape0_cuda PASSED [0.3785s]                 [ 65%]
test_transformers.py::TestAttnBiasCUDA::test_causal_variants_causal_variant_CausalVariant_UPPER_LEFT_shape1_cuda PASSED [0.3818s]                 [ 68%]
test_transformers.py::TestAttnBiasCUDA::test_causal_variants_causal_variant_CausalVariant_UPPER_LEFT_shape2_cuda PASSED [0.3864s]                 [ 71%]
test_transformers.py::TestAttnBiasCUDA::test_causal_variants_causal_variant_CausalVariant_UPPER_LEFT_shape3_cuda PASSED [0.7668s]                 [ 75%]
test_transformers.py::TestAttnBiasCUDA::test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda SKIPPED [0.0089s] (...) [ 78%]
test_transformers.py::TestAttnBiasCUDA::test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape1_cuda SKIPPED [0.0087s] (...) [ 81%]
test_transformers.py::TestAttnBiasCUDA::test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape2_cuda SKIPPED [0.0087s] (...) [ 84%]
test_transformers.py::TestAttnBiasCUDA::test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape3_cuda SKIPPED [0.0084s] (...) [ 87%]
test_transformers.py::TestAttnBiasCUDA::test_causal_variants_compile_causal_variant_CausalVariant_UPPER_LEFT_shape0_cuda SKIPPED [0.0087s] (T...) [ 90%]
test_transformers.py::TestAttnBiasCUDA::test_causal_variants_compile_causal_variant_CausalVariant_UPPER_LEFT_shape1_cuda SKIPPED [0.0087s] (T...) [ 93%]
test_transformers.py::TestAttnBiasCUDA::test_causal_variants_compile_causal_variant_CausalVariant_UPPER_LEFT_shape2_cuda SKIPPED [0.0084s] (T...) [ 96%]
test_transformers.py::TestAttnBiasCUDA::test_causal_variants_compile_causal_variant_CausalVariant_UPPER_LEFT_shape3_cuda SKIPPED [0.0087s] (T...) [100%]

=================================================== 14 passed, 18 skipped, 77218 deselected in 39.72s ===================================================
```
```
$ pytest test_transformers.py -v -k "test_causal_variants"
================================================================== test session starts ==================================================================
platform linux -- Python 3.10.13, pytest-7.4.0, pluggy-1.0.0 -- /home/shuqiyang/.conda/envs/pytorch/bin/python
cachedir: .pytest_cache
rootdir: /data/users/shuqiyang/pytorch
configfile: pytest.ini
collected 77250 items / 77218 deselected / 32 selected
Running 32 items in this shard

test_transformers.py::TestAttnBiasCPU::test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cpu PASSED [0.2410s]                  [  3%]
test_transformers.py::TestAttnBiasCPU::test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape1_cpu PASSED [0.3984s]                  [  6%]
test_transformers.py::TestAttnBiasCPU::test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape2_cpu SKIPPED [0.0011s] (Lower righ...) [  9%]
test_transformers.py::TestAttnBiasCPU::test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape3_cpu PASSED [0.0095s]                  [ 12%]
test_transformers.py::TestAttnBiasCPU::test_causal_variants_causal_variant_CausalVariant_UPPER_LEFT_shape0_cpu PASSED [0.1749s]                   [ 15%]
test_transformers.py::TestAttnBiasCPU::test_causal_variants_causal_variant_CausalVariant_UPPER_LEFT_shape1_cpu PASSED [0.2138s]                   [ 18%]
test_transformers.py::TestAttnBiasCPU::test_causal_variants_causal_variant_CausalVariant_UPPER_LEFT_shape2_cpu PASSED [0.2715s]                   [ 21%]
test_transformers.py::TestAttnBiasCPU::test_causal_variants_causal_variant_CausalVariant_UPPER_LEFT_shape3_cpu PASSED [0.0108s]                   [ 25%]
test_transformers.py::TestAttnBiasCPU::test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cpu PASSED [0.4864s]          [ 28%]
test_transformers.py::TestAttnBiasCPU::test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape1_cpu PASSED [0.5346s]          [ 31%]
test_transformers.py::TestAttnBiasCPU::test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape2_cpu SKIPPED [0.0011s] (Lo...) [ 34%]
test_transformers.py::TestAttnBiasCPU::test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape3_cpu PASSED [0.1722s]          [ 37%]
test_transformers.py::TestAttnBiasCPU::test_causal_variants_compile_causal_variant_CausalVariant_UPPER_LEFT_shape0_cpu PASSED [0.2341s]           [ 40%]
test_transformers.py::TestAttnBiasCPU::test_causal_variants_compile_causal_variant_CausalVariant_UPPER_LEFT_shape1_cpu PASSED [0.4786s]           [ 43%]
test_transformers.py::TestAttnBiasCPU::test_causal_variants_compile_causal_variant_CausalVariant_UPPER_LEFT_shape2_cpu PASSED [0.4635s]           [ 46%]
test_transformers.py::TestAttnBiasCPU::test_causal_variants_compile_causal_variant_CausalVariant_UPPER_LEFT_shape3_cpu PASSED [0.0861s]           [ 50%]
test_transformers.py::TestAttnBiasCUDA::test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda PASSED [9.7579s]                [ 53%]
test_transformers.py::TestAttnBiasCUDA::test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape1_cuda PASSED [0.0044s]                [ 56%]
test_transformers.py::TestAttnBiasCUDA::test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape2_cuda SKIPPED [0.0007s] (Lower ri...) [ 59%]
test_transformers.py::TestAttnBiasCUDA::test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape3_cuda PASSED [9.2065s]                [ 62%]
test_transformers.py::TestAttnBiasCUDA::test_causal_variants_causal_variant_CausalVariant_UPPER_LEFT_shape0_cuda PASSED [0.0081s]                 [ 65%]
test_transformers.py::TestAttnBiasCUDA::test_causal_variants_causal_variant_CausalVariant_UPPER_LEFT_shape1_cuda PASSED [0.0063s]                 [ 68%]
test_transformers.py::TestAttnBiasCUDA::test_causal_variants_causal_variant_CausalVariant_UPPER_LEFT_shape2_cuda PASSED [0.0059s]                 [ 71%]
test_transformers.py::TestAttnBiasCUDA::test_causal_variants_causal_variant_CausalVariant_UPPER_LEFT_shape3_cuda PASSED [0.0055s]                 [ 75%]
test_transformers.py::TestAttnBiasCUDA::test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda PASSED [0.1200s]        [ 78%]
test_transformers.py::TestAttnBiasCUDA::test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape1_cuda PASSED [0.1032s]        [ 81%]
test_transformers.py::TestAttnBiasCUDA::test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape2_cuda SKIPPED [0.0010s] (...) [ 84%]
test_transformers.py::TestAttnBiasCUDA::test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape3_cuda PASSED [0.1151s]        [ 87%]
test_transformers.py::TestAttnBiasCUDA::test_causal_variants_compile_causal_variant_CausalVariant_UPPER_LEFT_shape0_cuda PASSED [0.0705s]         [ 90%]
test_transformers.py::TestAttnBiasCUDA::test_causal_variants_compile_causal_variant_CausalVariant_UPPER_LEFT_shape1_cuda PASSED [0.0713s]         [ 93%]
test_transformers.py::TestAttnBiasCUDA::test_causal_variants_compile_causal_variant_CausalVariant_UPPER_LEFT_shape2_cuda PASSED [0.0696s]         [ 96%]
test_transformers.py::TestAttnBiasCUDA::test_causal_variants_compile_causal_variant_CausalVariant_UPPER_LEFT_shape3_cuda PASSED [0.1516s]         [100%]

=================================================== 28 passed, 4 skipped, 77218 deselected in 39.23s ====================================================
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121048
Approved by: https://github.com/zou3519
2024-03-05 20:19:02 +00:00
8ccf8b2c47 Avoid COW input materialize in more forward ops (#121070)
Affected operators are: addr, cdist, sparse.sampled_addm, sparse.mm,
matrix_exp, softmax, cross_entropy

Part of #97856

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121070
Approved by: https://github.com/ezyang
2024-03-05 19:47:24 +00:00
81dbc487c7 ci: add "typing_extensions" package to ci requirements list (#121136)
this is required for torchgen

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121136
Approved by: https://github.com/malfet, https://github.com/atalman
2024-03-05 18:26:01 +00:00
3239f86a3d [ATen][CUDA][CUBLAS] cublasLtMatmul increase workspace_size (#120925)
According to the [cuBLAS API Reference](https://docs.nvidia.com/cuda/cublas/index.html#cublassetworkspace) the recommended workspace size for Hopper is 32 MiB and for the rest architectures 4 MiB. This PR increases the workspace size accordingly. I am not aware of the recommended workspace size for HIP, that is why I am keeping it unchanged.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120925
Approved by: https://github.com/eqy, https://github.com/malfet
2024-03-05 18:13:05 +00:00
8aeb247a3d [export] Remove WrapperModule. (#121042)
Summary: WrapperModule seems a good idea but may introduce some surprising behavior to users, for example, it never registers enclosed modules as submodules and therefore it's unclear that's the state dict for the exported program should look like, because some people may argue to include every state in state dict but others want to keep them as constants.

Test Plan: CI

Reviewed By: tugsbayasgalan

Differential Revision: D54326331

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121042
Approved by: https://github.com/angelayi
2024-03-05 18:10:22 +00:00
0e604becc5 [NJT] support chunk on batch dim (#119713)
- support chunk op on batch dim
- support empty_like op
- add tests for the like ops

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119713
Approved by: https://github.com/jbschlosser
2024-03-05 17:57:50 +00:00
ae4c85960f Add Deberta pass (#121206)
Adding DebertaForQuestionAnswering to inductor benchmark pass, as it did not show up before

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121206
Approved by: https://github.com/desertfire
2024-03-05 17:56:25 +00:00
5abf7972d1 [DCP][state_dict] Implement pin_memory and shared_memory copy for _offload_state_dict_to_cpu (#120378)
**Summary**
This PR extend `_offload_state_dict_to_cpu` to accept a `cpu_offload_state_dict` argument. If `cpu_offload_state_dict` is not None, `_offload_state_dict_to_cpu` will use `copy_` to copy the GPU data to the CPU tensors. This allows users to pass a pin_memory or share_memory version of `cpu_offload_state_dict`.

This PR also adds `_create_cpu_state_dict` to allow users to easily create a pin_memory or share_memory cpu state_dict.

**Performance improvement**
```
# The micro-benchmark has a source state_dict with 150 tensors, and each tensor is 50MB.
# The micro-benchmark is run on a H100 machine with PCIe 5

cpu_state_dict_2 = _create_cpu_state_dict(state_dict, pin_memory=True)
cpu_state_dict_3 = _create_cpu_state_dict(state_dict, share_memory=True)

# GPU->CPU memory: 4.6556 seconds
cpu_state_dict = _offload_state_dict_to_cpu(state_dict)

# GPU->pin memory: 0.1566 seconds
_offload_state_dict_to_cpu(state_dict, cpu_offload_state_dict=cpu_state_dict_2)

# GPU->shared memory: 0.5509 seconds (variation is quite large)
_offload_state_dict_to_cpu(state_dict, cpu_offload_state_dict=cpu_state_dict_3)

# GPU->pin memory->shared memory: 0.2550 seconds
_offload_state_dict_to_cpu(state_dict, cpu_offload_state_dict=cpu_state_dict_2)
_offload_state_dict_to_cpu(cpu_state_dict_2, cpu_offload_state_dict=cpu_state_dict_3)
```

Differential Revision: [D54045845](https://our.internmc.facebook.com/intern/diff/D54045845/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120378
Approved by: https://github.com/LucasLLC
2024-03-05 17:48:15 +00:00
cyy
6ecd65886a Remove unnecessary const_casts (#121225)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121225
Approved by: https://github.com/soulitzer
2024-03-05 17:34:24 +00:00
85c807b3fd [export] Ensure optional fields always have default value. (#121163)
Summary: Add additional check to make sure we can always unset an optional field.

Test Plan: CI

Differential Revision: D54504243

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121163
Approved by: https://github.com/tugsbayasgalan
2024-03-05 17:16:49 +00:00
35004b8ab4 [dynamo] Fix handling of invalid args (#121110)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121110
Approved by: https://github.com/yanboliang
ghstack dependencies: #121106
2024-03-05 17:16:04 +00:00
4f19b5f7ef [dynamo] Remove extra guard for tensor constant attrs (#121106)
Also deletes some unused code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121106
Approved by: https://github.com/yanboliang, https://github.com/anijain2305
2024-03-05 17:16:04 +00:00
e4352182bd Disable remote cache test on ROCM (#121210)
Fixes #121194
Fixes #121166

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121210
Approved by: https://github.com/aakhundov
2024-03-05 16:35:40 +00:00
f25a25fde5 Fix lintrunner-noclang (#121205)
Fix lintrunnner-noclang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121205
Approved by: https://github.com/Skylion007
2024-03-05 16:18:36 +00:00
fbf36d01a0 Update Triton (#119457)
Fix pytorch nightly compilation for cuda linking

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119457
Approved by: https://github.com/lezcano
2024-03-05 15:04:12 +00:00
59d9f1e227 Spectral norm value test (#121068)
Spectral norm implementation has extensive tests, but there doesn't appear to be any checking that indeed the spectral norm (= top singular value) is correctly calculated. There should at least be one such testcase.

This adds one such testcase for the parameterizations.py implementation of spectral norm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121068
Approved by: https://github.com/soulitzer
2024-03-05 14:46:31 +00:00
d621e3e3b8 Add exhaustive module and optimizer tests for torch.load(state_dict, weights_only=True) (#121049)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121049
Approved by: https://github.com/janeyx99
2024-03-05 14:27:50 +00:00
42821d462a [ATen][Native][CUDA] Decrease max_threads in ctc_loss (#120746)
There will be some changes in CUDA 12.4 that would require smaller number of threads per block with double precision in `ctc_loss`. This PR addresses the change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120746
Approved by: https://github.com/ptrblck, https://github.com/janeyx99
2024-03-05 14:14:41 +00:00
12191f4b3e Fix make triton command on release branch (#121169)
Fixes #120044

Should fix build from source instructions on release branch here: https://github.com/pytorch/pytorch#from-source

Please note we are using /test/ channel for release here to make sure it works, before actual release is completed.

Test main:
```
make triton
pip3 uninstall -y triton
WARNING: Skipping triton as it is not installed.
Looking in indexes: https://download.pytorch.org/whl/nightly/
Collecting pytorch-triton==3.0.0+a9bc1a3647
  Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.0.0%2Ba9bc1a3647-cp310-cp310-linux_x86_64.whl (239.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 239.0/239.0 MB 8.7 MB/s eta 0:00:00
Requirement already satisfied: filelock in /home/atalman/miniconda3/envs/py310/lib/python3.10/site-packages (from pytorch-triton==3.0.0+a9bc1a3647) (3.13.1)
Installing collected packages: pytorch-triton
  Attempting uninstall: pytorch-triton
    Found existing installation: pytorch-triton 2.2.0
    Uninstalling pytorch-triton-2.2.0:
      Successfully uninstalled pytorch-triton-2.2.0
Successfully installed pytorch-triton-3.0.0+a9bc1a3647
```

Test release/2.2:
```
make triton
pip3 uninstall -y triton
WARNING: Skipping triton as it is not installed.
Looking in indexes: https://download.pytorch.org/whl/test/
Collecting pytorch-triton==2.2.0
  Using cached https://download.pytorch.org/whl/test/pytorch_triton-2.2.0-cp310-cp310-linux_x86_64.whl (183.1 MB)
Requirement already satisfied: filelock in /home/atalman/miniconda3/envs/py310/lib/python3.10/site-packages (from pytorch-triton==2.2.0) (3.13.1)
Installing collected packages: pytorch-triton
  Attempting uninstall: pytorch-triton
    Found existing installation: pytorch-triton 3.0.0+a9bc1a3647
    Uninstalling pytorch-triton-3.0.0+a9bc1a3647:
      Successfully uninstalled pytorch-triton-3.0.0+a9bc1a3647
Successfully installed pytorch-triton-2.2.0
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121169
Approved by: https://github.com/seemethere
2024-03-05 13:53:53 +00:00
ee557d8f61 skip detectron2_fcos_r_50_fpn in dynamic shape test (#120697)
As reported in https://github.com/pytorch/pytorch/issues/119434, `detectron2_fcos_r_50_fpn` failed with dynamic shape testing, we propose to skip the dynamic batch size testing of this model in this PR.

* Error msg is
```
  File "/home/jiayisun/pytorch/benchmarks/dynamo/common.py", line 3877, in run
    assert marked, f"nothing in example_inputs had a dim with {batch_size}"
AssertionError: nothing in example_inputs had a dim with 4
```

* Root Cause is
Benchmark code will only annotate the inputs' dim as dynamic when its size equals to batch size c617e7b407/benchmarks/dynamo/common.py (L3867-L3871). If it fails to find any dim equals to batch size, above error throws.
However, the inputs of `detectron2_fcos_r_50_fpn` are as follows:

```
([{'file_name': '/home/jiayisun/benchmark/torchbenchmark/data/.data/coco2017-minimal/coco/val2017/000000001268.jpg', 'height': 427, 'width': 640, 'image_id': 1268, 'image': tensor([[[147., 124.,  82.,  ...,   3.,   4.,   5.],
         [125., 104.,  65.,  ...,   3.,   3.,   4.],
         [ 87.,  68.,  34.,  ...,   2.,   2.,   2.],
         ...,
         [ 47.,  45.,  41.,  ...,  45.,  45.,  45.],
         [ 46.,  44.,  40.,  ...,  44.,  45.,  46.],
         [ 46.,  44.,  40.,  ...,  43.,  45.,  46.]],

        [[154., 129.,  84.,  ...,   3.,   4.,   5.],
         [133., 110.,  69.,  ...,   3.,   3.,   4.],
         [ 95.,  76.,  43.,  ...,   2.,   2.,   2.],
         ...,
         [ 44.,  42.,  38.,  ...,  34.,  37.,  39.],
         [ 43.,  41.,  37.,  ...,  35.,  39.,  41.],
         [ 43.,  41.,  37.,  ...,  35.,  40.,  43.]],

        [[171., 140.,  85.,  ...,   3.,   4.,   5.],
         [147., 120.,  71.,  ...,   3.,   3.,   4.],
         [103.,  83.,  47.,  ...,   2.,   2.,   2.],
         ...,
         [ 46.,  44.,  40.,  ...,  16.,  20.,  22.],
         [ 45.,  43.,  39.,  ...,  17.,  22.,  26.],
         [ 45.,  43.,  39.,  ...,  18.,  24.,  28.]]])}, ... ],)
```

None of the inputs' dim will equal to input batch size, so I think we may need to skip the dynamic batch size testing for this model.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120697
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/desertfire
2024-03-05 12:12:18 +00:00
c4a1570864 Temporarily increased compile time limit of #GPUs to 120. (#121076)
Fixes #115331.

This is a temporary fix to increase the compile time number of GPUs to 120 until #119639 can be merged. Changing the parameter to 128 leads to annoying errors, as some checks would be tautological (`int8_t` is always < 128).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121076
Approved by: https://github.com/albanD
2024-03-05 11:39:14 +00:00
de8af28083 [FSDP][StateDict] Allow FULL_STATE_DICT option for 2D (#120837)
Fixes #120722

TL;DR for the issue:
As users are expected to use get_model_state_dict to do state_dict retrieval, I think it's fine to remove the warning and RuntimeError.
More context in #120722.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120837
Approved by: https://github.com/Skylion007
2024-03-05 10:03:44 +00:00
cyy
507611f9ae [CUDACachingAllocator] Turn Allocator::allocate into non-const (#120969)
Ideally, the method should be non-const since it changes the allocator state. Some const_casts are also removed in the way.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120969
Approved by: https://github.com/albanD
2024-03-05 09:53:05 +00:00
46c9d646dd [Dynamo] Fix inspect.getattr_static doesn't work well for torch.utils._cxx_pytree.PyTreeSpec (#120812)
Fixes #118793

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120812
Approved by: https://github.com/zou3519
2024-03-05 09:05:26 +00:00
311cc564f6 Fix README Typo (#120892)
Fixes a README typo so that the prompt is consistent with VSCode 1.87.0.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120892
Approved by: https://github.com/albanD, https://github.com/drisspg
2024-03-05 09:05:21 +00:00
a7e93c341f [hoo] Add with_effects to handle side effectful ops (#120296)
Proposal: https://docs.google.com/document/d/179QyhicGzTXJ5jvTAoAosP_Nzgf3PpgZwU_E3VV9PlM/edit#heading=h.bnm38nu3yfno
Implementation discussion: https://docs.google.com/document/d/179QyhicGzTXJ5jvTAoAosP_Nzgf3PpgZwU_E3VV9PlM/edit#heading=h.bj61609o1buq

Result with print:
```
graph():
    %arg0_1 : [num_users=1] = placeholder[target=arg0_1]
    %arg1_1 : [num_users=1] = placeholder[target=arg1_1]
    %with_effects : [num_users=1] = call_function[target=torch._higher_order_ops.effects.with_effects](args = (%arg0_1, aten.print.default, moo), kwargs = {})
    %getitem : [num_users=1] = call_function[target=operator.getitem](args = (%with_effects, 0), kwargs = {})
    %add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%arg1_1, %arg1_1), kwargs = {})
    return [getitem, add]
```

Follow ups:
* Add handling to auto_functionalize
* Add support for tokens on the export side
* Add support for tokens on the inductor side

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120296
Approved by: https://github.com/zou3519
2024-03-05 08:58:32 +00:00
29976519a1 Make configs hash part of remote cache key (#121152)
Summary:
While testing I noticed that if we generate different configs, we will fail to use the remote cache, so lets include configs in the cache key.

Not sure how to write a deterministic test for this.

Test Plan: existing tests

Differential Revision: D54500957

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121152
Approved by: https://github.com/aakhundov
2024-03-05 08:01:24 +00:00
43416e3059 Correctly read the cache key for remote cache (#121151)
Summary: While investigating why we were calling put each time, I noticed that memcache backend returns a list instead of direct result, which means that we were correctly fetching the cached result but not using it.

Test Plan: The test should now work as expected

Differential Revision: D54500851

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121151
Approved by: https://github.com/aakhundov
2024-03-05 07:33:20 +00:00
9e16622397 Move JK check to on-demand (#121182)
Summary: Some tests are failing due to checking JK during forking. Lets move the JK check to on-demand.

Differential Revision: D54518293

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121182
Approved by: https://github.com/aakhundov
2024-03-05 07:03:25 +00:00
9ccff0aff9 Remove ids_of_folded_args from test_triton_kernel_equal_to_1_arg (#121192)
Summary: Due to the Triton pin update in https://github.com/pytorch/pytorch/pull/119457, `test_triton_kernel_equal_to_1_arg` started to break, as `ids_of_folded_args` has vanished from the upstream Triton codebase.

Test Plan:

```
$ python test/inductor/test_triton_kernels.py -k test_triton_kernel_equal_to_1_arg
...
----------------------------------------------------------------------
Ran 6 tests in 6.790s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121192
Approved by: https://github.com/oulgen, https://github.com/bertmaher
2024-03-05 06:35:04 +00:00
4b49bc19e8 [export][reland] Disable exported_program.__call__ (#120019)
Summary: Reland of D53075378 / https://github.com/pytorch/pytorch/pull/119466

Test Plan: CI

Differential Revision: D53827930

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120019
Approved by: https://github.com/ydwu4
2024-03-05 05:29:46 +00:00
6ddf5cf85e [AOTI] Update cpp wrapper codegen to use v2 C shim (#120714)
Summary: To use the torchgen-ed v2 C shim interface, cpp wrapper codegen needs to update its rule for generating the right parameter and function call. Because changing the emitted code will cause a FC breakage, we add a flag to control the behavior.

Differential Revision: [D54258086](https://our.internmc.facebook.com/intern/diff/D54258086)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120714
Approved by: https://github.com/chenyang78
ghstack dependencies: #120513
2024-03-05 04:32:32 +00:00
bd19d6d822 [AOTI] Use torchgen to generate C shim functions (#120513)
Summary: The current C shim layer manually implements a C interface for a handful of ops. Obviously that's not scalable if we want to extend it to cover all aten ops. This new torchgen script automatically generates C shim interfaces for CPU and CUDA backends. The interface follows the same parameter passing rules as the current C shim layer, such as

* Use plain C data types to pass parameters
* Use AtenTensorHandle to pass at::Tensor
* Use pointer type to pass optional parameter
* Use pointer+length to pass list
* Use device_type+device_index to pass device
* When a parameter is a pointer of pointer, e.g. AtenTensorHandle**, the script generates either a list of optional values or an optional list of values

https://gist.github.com/desertfire/83701532b126c6d34dae6ba68a1b074a is an example of the generated torch/csrc/inductor/aoti_torch/generated/c_shim_cuda.cpp file. The current version doesn't generate C shim wrappers for all aten ops, and probably generates more wrappers than needed on the other hand, but it should serve as a good basis.

This PR by itself won't change AOTI codegen and thus won't introduce any FC breakage. The actual wrapper codegen changes will come in another PR with some version control flag to avoid FC breakage.

Differential Revision: [D54258087](https://our.internmc.facebook.com/intern/diff/D54258087)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120513
Approved by: https://github.com/jansel
2024-03-05 04:28:44 +00:00
ffe45a8188 [ATen-vulkan] Implement global shader registry (#121088)
Differential Revision: D54447700

## Context

This changeset updates Vulkan SPIR-V codegen to introduce a global SPIR-V shader registry and register shaders dynamically at static initialization time. This change makes it possible to define and link custom shader libraries to the ATen-Vulkan runtime.

Before:

* `gen_vulkan_spv.py` generated two files, `spv.h` and `spv.cpp` which would contain the definition and initialization of Vulkan shader registry variables.

After:

* Introduce the `ShaderRegistry` class in `api/`, which encapsulates functionality of the `ShaderRegistry` class previously defined in the generated `spv.h` file
* Introduce a global shader registry (defined as a static variable in the `api::shader_registry() function`
* Define a `ShaderRegisterInit` class (taking inspiration from `TorchLibraryInit`) that allows for dynamic shader registration
* `gen_vulkan_spv.py` now only generates `spv.cpp`, which defines a static `ShaderRegisterInit` instance that triggers registration of the compiled shaders to the global shader registry.

Benefits:

* Cleaner code base; we no longer have `ShaderRegistry` defined in a generated file, and don't need a separate implementation file (`impl/Registry.*`) to handle shader lookup. All that logic now lives under `api/ShaderRegistry.*`
* Makes it possible to compile and link separate shader libraries, providing similar flexibility as defining and linking custom ATen operators

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121088
Approved by: https://github.com/manuelcandales, https://github.com/jorgep31415
2024-03-05 03:56:57 +00:00
c3c618c750 Update torchbench pin (#121029)
Fixes https://github.com/pytorch/pytorch/issues/117280 after bumping the HF version in https://github.com/pytorch/benchmark/pull/2179

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121029
Approved by: https://github.com/desertfire
2024-03-05 03:21:32 +00:00
a15c02562a Fix dynamo failure (#121167)
Summary: Title

Test Plan: CI

Differential Revision: D54509198

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121167
Approved by: https://github.com/izaitsevfb
2024-03-05 03:19:59 +00:00
3381f282c3 Revert "Update Triton (#119457)"
This reverts commit d49864f6a526d3def25f8da2fa9b8815b3347b9d.

Reverted https://github.com/pytorch/pytorch/pull/119457 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing test_triton_kernels in trunk d49864f6a5 ([comment](https://github.com/pytorch/pytorch/pull/119457#issuecomment-1977792634))
2024-03-05 01:46:44 +00:00
9deaa2e812 [BE]: FURB187 Use inplace reverse on lists: faster, more readable. (#121140)
Use `reverse()` method as it's faster and inplace.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121140
Approved by: https://github.com/albanD
2024-03-05 01:36:17 +00:00
ec4146c535 [inductor] skip foreach kernel for benchmark fusion (#121168)
benchmark fusion currently does not support foreach kernel. If we don't explicitly skip foreach kernels, we end up with exceptions in `codegen_node_schedule` because individual nodes in a foreach kernel may have incompatible shapes from pointwise/reduction perspective.

cc Manman Ren ( @manman-ren ) who reported the issue when turning on benchmark fusion on BertForMaskedLM.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121168
Approved by: https://github.com/Chillee
2024-03-05 01:27:55 +00:00
bcf35c6ae6 [tensorboard] Handle bfloat16 type in add_histogram (#120087)
Summary:
add_histogram fails for this data type. Updating conversion code to handle it.

Stack trace for the failure -

`
[trainer0]Traceback (most recent call last):
[trainer0]  File "<torch_package_0>.tensorboard/logging/summary_v2.py", line 203, in unscriptable_record_summary
[trainer0]    unscriptable_histogram(name, t, step, ranks)
[trainer0]  File "<torch_package_0>.tensorboard/logging/fx_v1.py", line 146, in unscriptable_histogram
[trainer0]    Adhoc.writer().add_histogram(tag, x, step.int())
[trainer0]  File "/tmp/aienv/images/aienv_image_09slg3j1/torch/utils/tensorboard/writer.py", line 40, in wrapper
[trainer0]    resp = super_method(*args, **kwargs)
[trainer0]  File "/tmp/aienv/images/aienv_image_09slg3j1/torch/utils/tensorboard/writer_oss.py", line 526, in add_histogram
[trainer0]    histogram(tag, values, bins, max_bins=max_bins), global_step, walltime
[trainer0]  File "/tmp/aienv/images/aienv_image_09slg3j1/torch/utils/tensorboard/summary.py", line 482, in histogram
[trainer0]    values = make_np(values)
[trainer0]  File "/tmp/aienv/images/aienv_image_09slg3j1/torch/utils/tensorboard/_convert_np.py", line 23, in make_np
[trainer0]    return _prepare_pytorch(x)
[trainer0]  File "/tmp/aienv/images/aienv_image_09slg3j1/torch/utils/tensorboard/_convert_np.py", line 30, in _prepare_pytorch
[trainer0]    x = x.detach().cpu().numpy()
[trainer0]TypeError: Got unsupported ScalarType BFloat16
`

Test Plan: Updated unit test that was failing before but passes after this change.

Reviewed By: hamzajzmati, jcarreiro

Differential Revision: D53841197

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120087
Approved by: https://github.com/jcarreiro, https://github.com/yanboliang
2024-03-05 00:27:21 +00:00
a3a8137484 [onnxrt, dynamo] Fix run with inputs on mix devices (#121159)
`onnxrt` assumes all tensors are on the same device before, and this PR fixes that by setting individual device for each tensor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121159
Approved by: https://github.com/thiagocrepaldi
2024-03-04 23:39:33 +00:00
83c312990f Add missing newline to repro and some utility thing in repro (#121051)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121051
Approved by: https://github.com/ezyang, https://github.com/shunting314, https://github.com/eellison
2024-03-04 22:52:54 +00:00
eba28a6f91 [VK-API][Op Redesign][3/n] Expose new Context and Resource APIs (#121060)
Summary: For use in the next diff.

Test Plan: sc

Differential Revision: D54397862

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121060
Approved by: https://github.com/SS-JIA
2024-03-04 22:26:07 +00:00
70c23a51ac Revert "[ATen][CUDA][CUBLAS] cublasLtMatmul increase workspace_size (#120925)"
This reverts commit 0a38a6ac8046e4d3f9cfaba86b7ec6517038646f.

Reverted https://github.com/pytorch/pytorch/pull/120925 on behalf of https://github.com/clee2000 due to broke inductor models and caused accuracy regression on nightly dashboard 0a38a6ac80 https://github.com/pytorch/pytorch/actions/runs/8118465367/job/22193590228 ([comment](https://github.com/pytorch/pytorch/pull/120925#issuecomment-1977556485))
2024-03-04 22:13:23 +00:00
df3c8b8390 [fake_impls] Fix seed/offset device for attention kernels (#120839)
1) Fix fake_impls to return the correct device for these attention
   kernels.
2) Remove special-casing and test file xfails
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120839
Approved by: https://github.com/drisspg
2024-03-04 22:02:32 +00:00
6a5c7d5f95 [ATen-vulkan] Enable deferred descriptor pool initialization (#121134)
Differential Revision: D54487619

## Context

Allow the descriptor pool of an `api::Context` object to be initialized in a deferred fashion, instead of forcing initialization upon construction. This mode of operation will be used in the ExecuTorch Vulkan delegate, where the exact number of descriptor sets can determined once the graph is built instead of needing to "guess" an adequate amount.

## Implementation Details

* Check `config.descriptorPoolMaxSets > 0` to check if the descriptor pool should be initialized
* Introduce `DescriptorPool::init()` function to trigger intialization
* Introduce safeguards against using an uninitialized descriptor pool

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121134
Approved by: https://github.com/manuelcandales
2024-03-04 21:37:32 +00:00
0c07c0c15f Revert "add int4 packed gemm support on CPU device (#117475)"
This reverts commit 30befa592e0675cc694f87a4f6fb80894709e719.

Reverted https://github.com/pytorch/pytorch/pull/117475 on behalf of https://github.com/izaitsevfb due to fails meta-internal tests ([comment](https://github.com/pytorch/pytorch/pull/117475#issuecomment-1977474686))
2024-03-04 21:20:57 +00:00
74b19fa8b9 fix fsdp device mesh depenency issue (#121061)
as reported in https://github.com/pytorch/torchtrain/pull/103

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121061
Approved by: https://github.com/awgu
2024-03-04 21:20:09 +00:00
7a065e3b23 improve the constantLR doc (#120852)
Fixes #120716
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120852
Approved by: https://github.com/janeyx99
2024-03-04 21:15:27 +00:00
cb812c9832 Add windows constraint to mkl package in wheel (#121014)
Follow up on: https://github.com/pytorch/pytorch/pull/102604
Address this comment: https://github.com/pytorch/pytorch/pull/102604#discussion_r1419944305

Whl metadata for all wheels published to pypi must match, otherwise poetry install will fail see this comment:
https://github.com/pytorch/pytorch/issues/88049#issuecomment-1302555269

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121014
Approved by: https://github.com/malfet
2024-03-04 20:54:26 +00:00
4cdc2d7096 [dynamo] Remove expected dynamo test failures (#120836)
Fixes some of the tests in #120643

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120836
Approved by: https://github.com/zou3519
2024-03-04 20:41:49 +00:00
a98c17edc7 Revert "add int8 packed gemm support on CPU device (#118056)"
This reverts commit f84375ca5db623a6a53cbce2864d27dfad626228.

Reverted https://github.com/pytorch/pytorch/pull/118056 on behalf of https://github.com/izaitsevfb due to breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/118056#issuecomment-1977368720))
2024-03-04 20:09:40 +00:00
9ff65d56a5 Revert "delete useless cast_outputs call in unary_op_impl_float_out (#120486)"
This reverts commit d053dcfa69a52e6b9f9f2ba997b6bffbc9b29bb5.

Reverted https://github.com/pytorch/pytorch/pull/120486 on behalf of https://github.com/izaitsevfb due to Fails meta internal tests ([comment](https://github.com/pytorch/pytorch/pull/120486#issuecomment-1977343125))
2024-03-04 19:52:23 +00:00
26431db939 [ONNX] Perform implicit casting of constants for the onnx::where operator (#118733) (#120619)
This PR fixes the problem of having the `Where` operator bound to different types in cases where the dtype is not explicitly set. The PR extends the implicit casting to the onnx::Where operator to fix the issue, and includes the corresponding unit test.

Fixes #118733

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120619
Approved by: https://github.com/BowenBao, https://github.com/thiagocrepaldi
2024-03-04 19:27:30 +00:00
58047205ed Delete unnecessary code (#120365)
Summary: Title

Test Plan: CI

Differential Revision: D53828357

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120365
Approved by: https://github.com/Skylion007
2024-03-04 18:02:58 +00:00
2e6c08a14b Update flash_attention kernel from 2.3.6 to 2.5.5 (#118935)
# Summary
Updates FlashAttention kernel code from tag [2.3.6](https://github.com/Dao-AILab/flash-attention/releases/tag/v2.3.6) to [2.5.3](https://github.com/Dao-AILab/flash-attention/releases/tag/v2.5.5).

The usual changes were then re-rellod on top of the modified kernel, changing how dropout saved for backward, removing the head_dim_pad since this would make the kernel inplace mutate and that has a bad interaction with functionalization.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118935
Approved by: https://github.com/cpuhrsch
2024-03-04 17:36:22 +00:00
d49864f6a5 Update Triton (#119457)
Fix pytorch nightly compilation for cuda linking

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119457
Approved by: https://github.com/bertmaher
2024-03-04 17:04:59 +00:00
6566b3db67 Add an autotune cache for inductor generated kernels (#120963)
Summary: Inductor currently has a best config cache for kernels that it generates. This is a local cache done via writing to the file system. This diff takes this local cache to remote by reusing the existing triton caching mechanism built via Memcache internally and Redis externally.

Test Plan:
tested locally using `TORCH_INDUCTOR_AUTOTUNE_REMOTE_CACHE =1`

Look at scuba to verify the local testing: https://fburl.com/scuba/triton_remote_cache/z6pypznk

The plan is to land this diff with this turned off and gradually introduce this.

Differential Revision: D54398076

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120963
Approved by: https://github.com/jansel
2024-03-04 16:58:37 +00:00
3ef0befdc9 Better error messages for impl_abstract_pystub (#120959)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120959
Approved by: https://github.com/drisspg
2024-03-04 15:24:36 +00:00
ce2903080c Add sparse compressed fake tensor support (#120920)
As in the title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120920
Approved by: https://github.com/ezyang
2024-03-04 14:38:45 +00:00
c06499981d Add a decomposition for torch.put, 2. (#120179)
As in the title. It is an updated copy of https://github.com/pytorch/pytorch/pull/115306 .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120179
Approved by: https://github.com/lezcano, https://github.com/peterbell10, https://github.com/jgong5
2024-03-04 14:37:30 +00:00
8ba49d0e53 Fix compilation error: load_fp32_from_fp16’ was not declared in this scope for ppc64le (#120307)
This patch adds missing Implementation of load_fp32_from_fp16 for half. Fixes the error  load_fp32_from_fp16’ was not declared in this scope .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120307
Approved by: https://github.com/jgong5
2024-03-04 11:08:39 +00:00
27ac73073b Fix hipification issue (#121107)
Differential Revision: D54470055

```
buck-out/v2/gen/fbcode/713b128926d8b21f/caffe2/__ATen-hip__/buck-headers/ATen/native/hip/MemoryAccess.cuh:201:61: error: comparison of integers of different signs: 'R' (aka 'unsigned int') and 'int' [-Werror,-Wsign-compare]
    return ((threadIdx.x  + thread_work_elem*num_threads()) < remaining);
             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~  ^ ~~~~~~~~~
```

```
buck-out/v2/gen/fbcode/713b128926d8b21f/caffe2/__ATen-hip__/buck-headers/ATen/native/hip/MemoryAccess.cuh:223:15: error: unused variable 'to' [-Werror,-Wunused-variable]
    scalar_t *to = reinterpret_cast<scalar_t *>(data[0]) + block_work_size() * idx;
              ^
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121107
Approved by: https://github.com/chenyang78
2024-03-04 09:41:21 +00:00
2e50566722 [dtensor] change distribute_module input/output_fn to accept module (#120895)
This is a BC breaking change to distribute_module. The underlying rationle
for this change is that sometimes in the input_fn/output_fn, user would want
to access to the current module for some attributes. This might not be
common enough, but in some cases it's worth to access to the module.

An outstanding use case we want to support is float8, if we want to make
float8 works with the TP API, the input_fn/output_fn of TP parallel
styles would need to get access to the module, where the module might
encapsulates `dynamic_linear.emulate` attribute, that is useful for
input/output casting

Since this is needed for fp8 and DTensor still under prototype release,
I feel it's worth the change and it's better we make the change as
early.

Right now making it a soft BC breaking, which means we maintain BC still
but throw deprecation messages.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120895
Approved by: https://github.com/tianyu-l
2024-03-04 07:22:32 +00:00
3045b16488 Do not use warm_pool() if TorchTnT is used (#121047)
Summary: This diff is needed to avoid QPS drop when parallel compilation is used with TorchTNT.

Test Plan:
On TNT
* https://www.internalfb.com/mast/job/torchx-ldm_train-hxjhl0k1wjz93
On PyPer
* f537224855

Differential Revision: D54430900

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121047
Approved by: https://github.com/yanboliang
2024-03-04 06:14:11 +00:00
cyy
4b494d0750 Fix comparison of integer expressions of different signedness (#121066)
Fixes these warnings
```
src/aten/src/ATen/native/cuda/ForeachReduceOp.cu:190:19: warning: comparison of integer expressions of different signedness: ‘int’ and ‘const size_t’ {aka ‘const long unsigned int’} [-Wsign-compare]
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121066
Approved by: https://github.com/tringwald, https://github.com/Skylion007
2024-03-04 02:14:10 +00:00
c83dfc8854 [PT2][Inductor] Fix missing "example_value" for nodes introduced by group batch fusion (#120974)
Summary: Similar to D54140488, we fix more such bugs

Test Plan:
# unit test
```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:group_batch_fusion
```
Cache hits: 0%. Commands: 1 (cached: 0, remote: 0, local: 1)
Tests finished: Pass 9. Fail 0. Fatal 0. Skip 0. Build failure 0

# local reproduce
```
buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode split_batch
```

Differential Revision: D54399360

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120974
Approved by: https://github.com/jackiexu1992
2024-03-04 02:11:57 +00:00
cead0363a8 [jit][nested strided tensor] support nested tensor in check_trace (#121039)
Summary:
torch.testing.assert_equal doesn't support nested strided tensors because sizes is not implemented.

This adds special handling for nested tensors by checking for nested tensors unbinding if they are found.

Test Plan: test_trace_with_nested_strided_tensor_output

Differential Revision: D54430238

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121039
Approved by: https://github.com/YuqingJ
2024-03-04 01:15:45 +00:00
089f4c0bd9 If data dependent, check if guard_size_oblivious would fix problem and report if so (#121011)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121011
Approved by: https://github.com/lezcano
2024-03-03 23:23:14 +00:00
cyy
13fadea888 [Clang-tidy header][21/N] Fix clang-tidy warnings in aten/src/ATEN/*.{cpp,h} (#120763)
This PR continues to fix clang-tidy warnings in aten/src/ATEN/*, following #120574.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120763
Approved by: https://github.com/Skylion007
2024-03-03 23:18:43 +00:00
4f0481e1d5 [inductor] add decompostition for mm in backward (#120933)
Summary:
1) As a follow up in D53602514. Found a new way to decompose mm in backward. Sum the permuted input and reduce along 0 dim. Some benchmark result P1190140001. 30x speedup
Some explanations on why the original mm decomposition is slow. For mxkxn mm, when m is small and k is large, the stride for lhs is [m,1], hence it need to access memory k times to load all the data. As a result, decomposition will be effective with permute since the stride will be [k,1].

2) add another pattern for large k. benchmark result P1190596489 28x speedup

3) fix the value not found error in ig ctr. f536115499

Test Plan:
pt2 decompose:

 {F1462894821}
decompose: f536159404
baseline: f536282578
705k vs 725k 4% for ig ctr

Differential Revision: D54294491

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120933
Approved by: https://github.com/mengluy0125
2024-03-03 18:46:42 +00:00
b7f2522692 [dynamo][compile-time] Remove unnecessary tree_map_only (#121052)
Reduces the torch.compile(backend="eager") for this code by 1-2 seconds.

~~~
def fn(x):
    for _ in range(10000):
        # x = torch.sin(x)
        x = torch.ops.aten.sin(x)
        # x = sin(x)

    return x
~~~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121052
Approved by: https://github.com/jansel
ghstack dependencies: #121053
2024-03-03 06:59:43 +00:00
368f242e37 Revert "[PT2D] Make the speedup benchmark works with DDP + CompiledAutograd (#120454)"
This reverts commit 8c2e569928a200893fe971e615b82a2f9ce32630.

Reverted https://github.com/pytorch/pytorch/pull/120454 on behalf of https://github.com/desertfire due to breaks nightly dashboard cudagraphs run ([comment](https://github.com/pytorch/pytorch/pull/120454#issuecomment-1975001824))
2024-03-03 02:58:47 +00:00
0e0a621e0c [dynamo] Minor refactors (#120966)
These are changes I pulled out of the above PRs due to not being
related.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120966
Approved by: https://github.com/yanboliang
2024-03-03 02:20:48 +00:00
8e4301077e [dynamo][comp-time] BuiltinVariableTracker - inspect signature only on failure (#121053)
Reduces the torch.compile(backend="eager") for this code by 1-2 seconds.
~~~
def fn(x):
    for _ in range(10000):
        # x = torch.sin(x)
        x = torch.ops.aten.sin(x)
        # x = sin(x)

    return x
~~~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121053
Approved by: https://github.com/jansel
2024-03-02 23:03:00 +00:00
7aced61c46 [DCP] deletes legacy formatting test (#120127)
Should no longer be necessary

Differential Revision: [D53791345](https://our.internmc.facebook.com/intern/diff/D53791345/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120127
Approved by: https://github.com/fegin
ghstack dependencies: #119816
2024-03-02 22:04:39 +00:00
7f81563e5e [dynamo][guards-cpp-refactor] Skip type and length check guard for DictGuardManager (#120739)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120739
Approved by: https://github.com/jansel
ghstack dependencies: #120673
2024-03-02 13:15:53 +00:00
82d1465d8d [dynamo][guards-cpp-refactor] DICT_CONTAINS guard (#120673)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120673
Approved by: https://github.com/jansel
2024-03-02 13:15:53 +00:00
bab4b5a341 [dist][sharded_tensor] Fix ChunkShardingSpec metadata offsets for empty shards (#121002)
ChunkShardingSpec generated metadata where offsets exceed the tensor size.

Example:

Torchrec prepared ShardedTensorMetadata:
```
ShardedTensorMetadata(shards_metadata=[
ShardMetadata(shard_offsets=[0, 0], shard_sizes=[2, 512], placement=rank:0/cuda:0),
ShardMetadata(shard_offsets=[2, 0], shard_sizes=[2, 512], placement=rank:1/cuda:1),
ShardMetadata(shard_offsets=[4, 0], shard_sizes=[2, 512], placement=rank:2/cuda:2),
ShardMetadata(shard_offsets=[6, 0], shard_sizes=[2, 512], placement=rank:3/cuda:3),
ShardMetadata(shard_offsets=[8, 0], shard_sizes=[2, 512], placement=rank:4/cuda:4),
ShardMetadata(shard_offsets=[10, 0], shard_sizes=[0, 512], placement=rank:5/cuda:5),
ShardMetadata(shard_offsets=[10, 0], shard_sizes=[0, 512], placement=rank:6/cuda:6)
],
size=torch.Size([10, 512]
),
```
Calling ShardedTensor._init_from_local_shards_and_global_metadata()
ShardedTensor ShardingSpec builds metadata

```
ShardedTensorMetadata(shards_metadata=[
ShardMetadata(shard_offsets=[0, 0], shard_sizes=[2, 512], placement=rank:0/cuda:0),
ShardMetadata(shard_offsets=[2, 0], shard_sizes=[2, 512], placement=rank:1/cuda:1),
ShardMetadata(shard_offsets=[4, 0], shard_sizes=[2, 512], placement=rank:2/cuda:2),
ShardMetadata(shard_offsets=[6, 0], shard_sizes=[2, 512], placement=rank:3/cuda:3),
ShardMetadata(shard_offsets=[8, 0], shard_sizes=[2, 512], placement=rank:4/cuda:4),
ShardMetadata(shard_offsets=[10, 0], shard_sizes=[0, 512], placement=rank:5/cuda:5),
ShardMetadata(shard_offsets=[12, 0], shard_sizes=[0, 512], placement=rank:6/cuda:6)
],
size=torch.Size([10, 512]), tensor_properties=TensorProperties(dtype=torch.float16, layout=torch.strided, requires_grad=False, memory_format=torch.contiguous_format, pin_memory=False))
```
The deduced ChunkShardingSpec:
```
ChunkShardingSpec(dim=0, placements=[rank:0/cuda:0, rank:1/cuda:1, rank:2/cuda:2, rank:3/cuda:3, rank:4/cuda:4, rank:5/cuda:5, rank:6/cuda:6])
```

The fix is to limit offsets by dim size.

Differential Revision: [D54419513](https://our.internmc.facebook.com/intern/diff/D54419513)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121002
Approved by: https://github.com/wz337
2024-03-02 08:58:48 +00:00
suo
66b20b4297 [export][ez] minor variable rename (#121040)
since `_export()` now takes an `nn.Module` only (which is asserted against at an upper layer), we should change this variable name from `f` to `mod` and remove some unnecessary isinstance checks

Differential Revision: [D54430381](https://our.internmc.facebook.com/intern/diff/D54430381/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121040
Approved by: https://github.com/angelayi
ghstack dependencies: #121037
2024-03-02 08:49:06 +00:00
suo
505637198a [export] cleanup to rewrite steps (#121037)
1. Some underscores for consistency of private functions.
2. remove dead code in `_replace_param_buffer_names`

Differential Revision: [D54429206](https://our.internmc.facebook.com/intern/diff/D54429206/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121037
Approved by: https://github.com/angelayi, https://github.com/zhxchen17
2024-03-02 08:45:50 +00:00
b0cfa96e82 [Torchelastic][Logging] Pluggable logsspecs using python entrypoints and option to specify one by name. (#120942)
Summary:
Expose an option to users to specify name of the LogsSpec implementation to use.
- Has to be defined in entrypoints under `torchrun.logs_specs` group.
- Must implement LogsSpec defined in prior PR/diff.

Test Plan: unit test+local tests

Reviewed By: ezyang

Differential Revision: D54180838

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120942
Approved by: https://github.com/ezyang
2024-03-02 08:07:52 +00:00
f351a71dbb remove constraints from capture_pre_autograd_graph (#120981)
Differential Revision: D54407296

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120981
Approved by: https://github.com/zhxchen17
2024-03-02 07:00:51 +00:00
83d848e1c7 [Quant][Inductor] Enable lowering of dynamic qlinear for X86Inductor (#120605)
**description**
Enable lowering of dynamic qlinear for X86Inductor. The pattern is `choose_qparams -> getitem -> q -> dq -> linear`. We only fuse `dq -> linear` and get `choose_qparams -> getitem -> q -> onednn.qlinear_pointwise`. So, we treat it as dynamic quantization of activation + static quantized linear.
The previous implementation of `onednn.qlinear_pointwise` is for the case where `x_scale` and `x_zp` are scalars. Since `choose_qparams` returns tensors, we added a variation `onednn.qlinear_pointwise.tensor` to support the case.
This feature is targeting PyTorch 2.3 release.

**Test plan**
```
python inductor/test_mkldnn_pattern_matcher.py -k test_dynamic_qlinear_cpu
python inductor/test_mkldnn_pattern_matcher.py -k test_dynamic_qlinear_qat_cpu
python inductor/test_cpu_cpp_wrapper.py -k test_dynamic_qlinear
```

**Performance before and after lowering `choose_qparam` to Inductor**
Before
- latency for shape (32, 32) = 0.151 ms
  latency for shape (128, 128) = 0.153 ms
  latency for shape (1024, 1024) = 0.247 ms

After
- latency for shape (32, 32) = 0.049 ms
- latency for shape (128, 128) = 0.052 ms
- latency for shape (1024, 1024) = 0.133 ms

Test method: A module with a single Linear layer, dynamic-quantize, lower to X86Inductor
Test env & config: Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz, single instance, single core, using Intel OpenMP and Tcmalloc

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120605
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jerryzh168
2024-03-02 05:11:17 +00:00
af5376c444 [dtensor] add support for loss parallel (#119877)
Loss parallel is the last piece of sequence parallelism to enable. It enables efficient distributed cross entropy computation when the input is sharded on the class dimension (in a classification problem with many classes). The implementation is via a context manager `loss_parallel`, after enabling which users can directly use `torch.nn.functional.cross_entropy` or `torch.nn.CrossEntropyLoss` without modifying other parts of their code.

Here are the underlying rationales why we are going through these op replacements:

1. `nn.functional.cross_entropy` is the common method that OSS user is using for things like transformer training, to avoid changing user code, we want user to still use this function for loss calculation if they are already using it.
2. `nn.functional.cross_entropy` boils down into `aten.log_softmax` and `aten.nll_loss_foward/backward`, and DTensor now supports those ops already (#117723 #119255 #118917 #119256). They are doing computation with input *replicated* on the class dimension.
3. However when the input of this loss calculation is **sharded on the class dimension**, to run sharded computation efficiently, we need to run both `aten.log_softmax` and `aten.nll_loss_foward` with multiple all-reduce collectives **in the middle of** those aten ops. This is not possible if we are just overriding these two ops, so we need to have some way to **decompose** these two ops into smaller ops to have collectives run in the middle of these two ops.
4. We explored the existing decompositions (#118950). It seems working, except that `log_softmax_backward` and `nll_loss_backward` combined together in aten are implemented in a inefficient way, which would trigger an additional expensive collective. Recently some user also reported similar issues https://github.com/pytorch/pytorch/issues/119261.
5. Therefore, currently we are doing our own decomposition inside a context manager for sequence parallelism specifically. Once we have a better decomposition in core, we can possibly take that instead of reinventing the wheels here.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119877
Approved by: https://github.com/wanchaol
2024-03-02 05:06:26 +00:00
c4ed456fc3 [inductor] fix accuracy failure for a few models under freezing (#121054)
Fix https://github.com/pytorch/pytorch/issues/120545 . The reason why these models fail accuracy test with freezing is due to the conv-batchnorm fusion. Conv-batchnorm fusion causes relative big numerical churn.

For the failed TIMM models, raising the tolerance to `8 * 1e-2` can make the test pass.

For the failed TB models, the numerical difference is too large. Having a discussion with @eellison , we decided to skip them with freezing for now.

One the other hand, we probably should dig more why the conv-bn fusion cause such large numerical difference.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121054
Approved by: https://github.com/eellison
2024-03-02 04:53:59 +00:00
f84375ca5d add int8 packed gemm support on CPU device (#118056)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118056
Approved by: https://github.com/mikekgfb
ghstack dependencies: #117475
2024-03-02 04:35:49 +00:00
5258c3645d [ATen-vulkan][EZ] Bug fixes: only create the image view when memory has been bound, invalidate cmd on flush (#121027)
Summary:
## Context

Introduce some simple bug fixes to the Vulkan Compute API that were causing errors on Android.

1. When using deferred allocation for image textures, it is undefined behaviour to create a `vkImageView` for a `vkImage` that has not yet been bound to memory. Fix this by creating the image view only after the `vkImage` has been bound to memory.
2. When flushing the `api::Context`, the command pool is flushed but any current command buffers are not invalidated. This will cause a segmentation fault if the command buffer is not submitted prior to calling `flush()`, because subsequent calls to `submit_*_job()` will use the old command buffer which will have been freed when the command pool is flushed. To fix, invalidate any existing command buffers when calling `flush()`.

Test Plan:
Build the test binary for Android:

```
buck build --target-platforms=ovr_config//platform/android:arm64-fbsource -c ndk.custom_libcxx=false //xplat/caffe2:pt_vulkan_api_test_bin --show-output
```

Push and run the test binary on a local android phone.

Differential Revision: D54425370

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121027
Approved by: https://github.com/mcr229, https://github.com/cbilgin
2024-03-02 04:35:46 +00:00
2d9efad38f Add the bound check for flatten with out_dim (#120894)
Fixes #120762

The bound is not valid in the example but unchecked.
```
a = torch.tensor([1, 2, 3])
a.flatten(start_dim=0, end_dim=1, out_dim='a')
```

The same is checked for the case

```
a = torch.tensor([1, 2, 3])
a.flatten(start_dim=0, end_dim=1)
```

- Therefore, just apply the same check.

@malfet @janeyx99
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120894
Approved by: https://github.com/malfet, https://github.com/spzala
2024-03-02 03:56:55 +00:00
06fe6ed82b [dynamo bug burndown] update tensor creation to support sequences of tensors (#120872)
Fixes https://github.com/pytorch/pytorch/issues/120645

`_internal_new_from_data` calls `_recursive_build`, but we run into an error such as the cases.
```
Failed running call_function <function tensor at 0xDEADBEEF>:
scalar_tensor(): argument (position 1) must be Number, not FakeTensor

# e.g. cases
1. [FakeTensor(..., size=(20, 1), dtype=torch.float64), ..., FakeTensor(..., size=(20, 1), dtype=torch.float64)]
- Here, we call _recursive_build(sizes=[4] ...) which hits the base case `if dim == ndim:` in the 2nd level of recursion.
- So, we try to return `scalar_tensor(FakeTensor)`
2. [[(FakeTensor(..., size=(1,), dtype=torch.int64), FakeTensor(..., size=(), dtype=torch.int64)]]

# site note: when can size = ()? Probably from scalar_tensor.
>>> torch.scalar_tensor(1).shape
torch.Size([])
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120872
Approved by: https://github.com/ezyang
2024-03-02 02:22:59 +00:00
a3b81666b1 [Dynamo] Fix guards for code objects (#120909)
By comparing them only by id, and raising an assert if someone calls into `EQUALS_MATCH`
Which render following example compileable:
```python
import torch

@torch.compile()
def foo(x, y):
    code = compile(y, "foo", "exec")
    exec(y)
    return x

print(foo(torch.rand(3), "print('Hello World')"))
```

Fixes https://github.com/pytorch/pytorch/issues/120647

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120909
Approved by: https://github.com/jansel
2024-03-02 02:17:17 +00:00
f7a2bae0ac Change TestOpWaitiness to use MultiProcessTestCase (#121046)
The test has been failing sporadically rencetly in CI and the failures
are not reproducible locally, likely due to some nasty race conditional
related a combination of MultiThreadedTestCase, the use of global state
and finalizers, and the recently introduced test decorator for native
funcol migration.

Switching to the test to use MultiProcessTestCase to provide better
isolation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121046
Approved by: https://github.com/weifengpy
2024-03-02 01:12:14 +00:00
4cf6d1172b [FSDP2] Used ReduceOp.AVG if fp32 reduce-scatter (#120919)
This PR uses `ncclAvg` op (via `ReduceOp.AVG`) if doing fp32 reduce-scatter. This allows the division by world size to happen in the reduce-scatter kernel itself, which seems to save extra memory read/write for dividing. This yields ~1.5% speedup on the Llama-7B workload (and makes per-parameter FSDP faster than flat-parameter FSDP 😅 ).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120919
Approved by: https://github.com/yifuwang, https://github.com/wanchaol
ghstack dependencies: #120238, #120910
2024-03-02 00:39:16 +00:00
85157af784 Fix more xfails for scaled_dot_product_attention (#121032)
Followup to #120928. - should fix #120921 .

I missed one test in #120928 - test_dispatch_symbolic_meta_outplace_all_strides. This wasn't caught because #120921 was open at the time, disabling the test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121032
Approved by: https://github.com/drisspg
2024-03-02 00:28:44 +00:00
7c71d7f32b [DTensor] Supported foreach=True for clip_grad_norm_ (#120910)
This PR adds support for `clip_grad_norm_(foreach=True)` by implementing `aten._foreach_norm.Scalar` and `aten._foreach_mul_.Tensor`. `foreach=True` is required to get competitive performance with `DTensor`.

`foreach=True` reduces CPU overhead for Llama-7B from 388 ms to 63 ms. Existing flat-parameter FSDP's `clip_grad_norm_` takes 3 ms on CPU 😢 .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120910
Approved by: https://github.com/wanchaol, https://github.com/janeyx99
ghstack dependencies: #120238
2024-03-02 00:28:09 +00:00
f0e8e7cf43 [DTensor] Supported foreach=False for clip_grad_norm_ (#120238)
This PR adds `DTensor` support for `aten.linalg_vector_norm.default` and `aten.stack.default` so that we can run `clip_grad_norm_` (with `foreach=False`).

To implement `linalg_vector_norm`, we introduce a `_NormPartial` placement since the reduction op for norm is the norm itself.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120238
Approved by: https://github.com/wanchaol
2024-03-02 00:25:16 +00:00
30befa592e add int4 packed gemm support on CPU device (#117475)
This patch adds int4 packed gemm support on CPU, both `avx512` and `avx2` are supported. It is used to speedup https://github.com/pytorch-labs/gpt-fast

The default perf measured on Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores) is `16.13 sec total, 12.40 tokens/sec`

* WOQ int4 on avx512: `5.92 sec total, 33.79 tokens/sec`
* WOQ int4 on avx2: `6.90 sec total, 29.00 tokens/sec`

WOQ int4 is measured with method: https://github.com/pytorch-labs/gpt-fast?tab=readme-ov-file#int4-weight-only-quantization

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117475
Approved by: https://github.com/jgong5, https://github.com/malfet
2024-03-02 00:17:34 +00:00
c8e56b4965 [c10d] dump from one and only one thread (PG0's monitor thread) (#120893)
Summary:
When there are multiple PGs in a process and a hardware failure happens,
we found that multiple PGs/ threads in the same
process are competing to dump the same records at the same time. The
affects the reliability of dumps.

In this PR, we will try to make the change such that only one thread/PG
could dump: PG0's monitor thread. We use a static variable to indicate
that something (e.g., collective timeout) has triggered the dump
locally.

monitor thread would dump debug info under any one of the 3 conditions:
1: this static variable is set to true by the watchdog thread when it detects
a timeout or pipe dump signal
2: timeout signal is received from other ranks through tcpstore
3: no heartbeat of watchdog
Test Plan:
python test/distributed/test_c10d_nccl.py -k
test_timeout_dumps_on_stuck_ranks

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120893
Approved by: https://github.com/wconstab
2024-03-02 00:13:13 +00:00
3d7cf8f392 Revert "Limit loop unrolling (#120023)"
This reverts commit 6cc7f9a2e6bedff3109ea066278e9805713da4bb.

Reverted https://github.com/pytorch/pytorch/pull/120023 on behalf of https://github.com/anijain2305 due to breaks llms export ([comment](https://github.com/pytorch/pytorch/pull/120023#issuecomment-1974104633))
2024-03-02 00:04:08 +00:00
d8395830ea [ONNX][dynamo_export] Skip instance_norm decomp for export (#120866)
Otherwise, instance_norm is decomposed into batch_norm with training set to True.
Downstream exporter has no way to figure out that training is actually not needed.
On the other hand, ONNX does have InstanceNormalization operator defined, however
due to decomp, it unnecessarily exports as batch norm and glue code.

Depends on https://github.com/microsoft/onnxscript/pull/1284
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120866
Approved by: https://github.com/thiagocrepaldi, https://github.com/titaiwangms
2024-03-01 23:51:16 +00:00
581fe26792 [C10D] Add ProcessGroup op_id to track ops inside coalescing region (#120745)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120745
Approved by: https://github.com/zdevito
2024-03-01 23:45:43 +00:00
0a38a6ac80 [ATen][CUDA][CUBLAS] cublasLtMatmul increase workspace_size (#120925)
According to the [cuBLAS API Reference](https://docs.nvidia.com/cuda/cublas/index.html#cublassetworkspace) the recommended workspace size for Hopper is 32 MiB and for the rest architectures 4 MiB. This PR increases the workspace size accordingly. I am not aware of the recommended workspace size for HIP, that is why I am keeping it unchanged.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120925
Approved by: https://github.com/eqy, https://github.com/malfet
2024-03-01 23:32:59 +00:00
06b52dd103 TD outside of test job (#118250)
Give TD it's own job so that each shard can get the results from this one job artifact and they will always be in sync with each other/no longer need to worry about consistently issues

* Move test discovery to its own file that is not dependent on torch so it can be run without building torch
  * Cannot do cpp test discovery before building pytorch
* Move TD calculation to own file that will create a json file with the final results
* TD is now job/build env agnostic
* TD will rank all tests, including those that test jobs may not want to run (ex it will rank distributed tests along with default tests, even though these tests are never run on the same machine together)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118250
Approved by: https://github.com/huydhn
2024-03-01 23:08:10 +00:00
d08ce51881 [compiled autograd] refactor eager test loading and run custom ops tests (#120679)
TestCustomOp's tests uses helper attributes and functions from a util parent class. To support arbitrary test classes, we need to refactor the current approach. Instead of allowlisting certain methods, we can instead copy the whole class and only overwrite the "test_.*" methods.

Compiled autograd fails on ~10/90 of the newly added tests. test_autograd_function_backed_op is the example we discussed in PT-2D meeting about requiring c++ autograd::Function support. I'm addressing this in #120732

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120679
Approved by: https://github.com/jansel, https://github.com/zou3519
2024-03-01 22:48:17 +00:00
8cb4855d1e Release the GIL in serialization when it is safe to do so (#120818)
In particular this ensures we release the GIL when serializing:
- PyBytes objects (this is how we get the pickle object)
- Storage objects

Other string-like objects keep the gil which is fine because we only use this for very small strings today (for endianess) and so releasing the GIL is not important there
Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120818
Approved by: https://github.com/colesbury
2024-03-01 22:37:26 +00:00
fd2ab1f613 [PT2][Inductor] Change the split cat log to debug (#120823)
Summary: Address the report in https://github.com/pytorch/pytorch/issues/120771.

Test Plan: see signal

Differential Revision: D54323475

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120823
Approved by: https://github.com/jackiexu1992
2024-03-01 22:34:23 +00:00
797d4fbdf4 [export] Log operator set. (#120951)
Summary: as title. We want to count the number of total operator calls, and the distinct set of operators in the exported graph.

Test Plan: CI

Differential Revision: D54390298

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120951
Approved by: https://github.com/tugsbayasgalan
2024-03-01 20:58:31 +00:00
d3876f73e7 Preserve metadata for MutableMapping and MutableSequence in pin_memory and collate_fn (#120553)
For the user-defined `Mapping` type, it may contain some metadata (e.g., pytorch/tensordict#679, https://github.com/pytorch/pytorch/pull/120195#issue-2141716712). Simply use `type(mapping)({k: v for k, v in mapping.items()})` do not take this metadata into account. This PR uses `copy.copy(mapping)` to create a clone of the original collection and iteratively updates the elements in the cloned collection. This preserves the metadata in the original collection via `copy.copy(...)` rather than relying on the `__init__` method in the user-defined classes.

Reference:

- pytorch/tensordict#679
- #120195

Closes #120195

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120553
Approved by: https://github.com/vmoens
2024-03-01 20:43:42 +00:00
a7c799fb85 [executorch] Add support for method variants in aten executorch code gen (#121016)
Summary: Title.

Test Plan: The added unittest

Differential Revision: D54423028

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121016
Approved by: https://github.com/larryliu0820
2024-03-01 20:33:02 +00:00
7a64eb65e4 Fix Dynamo tests failing with "Failed running call_function <built-in function linalg_norm" (#120993)
When iterating the ord value through an array, we are sharing the same torchdynamo context. This makes dynamo treat the `ord` variable as dynamic shape, causing problems.

In the `vector_norm` decomposition, casting the int type ord to float will fix this problem.

Fixes https://github.com/pytorch/pytorch/issues/119795
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120993
Approved by: https://github.com/lezcano
2024-03-01 20:27:45 +00:00
39e4d1a535 Make TestEmbeddingNNDeviceTypeCPU::test_EmbeddingBag_per_sample_weights_and_no_offsets_cpu_int32_float32 compatible with TorchDynamo (#120831)
Previously, the test case directly accesses the tensor data via tensor.data which is not supported on FakeTensor. So we manually copy the tensor as a workaround.
Fixes: https://github.com/pytorch/pytorch/issues/119788

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120831
Approved by: https://github.com/janeyx99
2024-03-01 20:27:41 +00:00
e02047add4 [BE][Ez]: Update ruff to 0.3.0 (#121003)
Update ruff to 0.3.0
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121003
Approved by: https://github.com/malfet
2024-03-01 20:20:55 +00:00
af93849a3a [pt2 export] small fix on non_persistent buffer unlift (#120715)
Summary: Change to get_buffer from the input plain_graph_module instead of the new stateful_gm when restoring non_persistent buffers, since the stateful_gm doesn't contain the buffer yet.

Test Plan:
Added test case.
`buck test caffe2/test:test_export -- test_unlift_nonpersistent_buffer`

Differential Revision: D54216772

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120715
Approved by: https://github.com/zhxchen17
2024-03-01 20:20:00 +00:00
19fcf6de1a Add lowering for fraction_max_pool2d (#120460)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120460
Approved by: https://github.com/peterbell10, https://github.com/lezcano
2024-03-01 20:13:20 +00:00
cdb50d0380 remove constraints from aot_compile (#120979)
Differential Revision: D54405986

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120979
Approved by: https://github.com/zhxchen17
2024-03-01 20:06:21 +00:00
55ae8fb1f6 Switched m1 runners to the lable macos-m1-stable (#120997)
Switched m1 runners to use  `macos-m1-stable` label, which points to exactly the same M1 running MacOS-13.2
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120997
Approved by: https://github.com/malfet
2024-03-01 19:52:34 +00:00
de3202abea [EZ][BE] Remove Python-2 installation logic (#121015)
Not sure why it's still there in 2024
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121015
Approved by: https://github.com/jeffdaily, https://github.com/atalman
2024-03-01 19:39:02 +00:00
b474a523c6 Ban passing in free function into capture_pre_autograd_graph (#120817)
Summary: Today we don't allow free functions to be tracing callable in torch.export. As a part of migrating capture_preautograd_graph usages to torch.export, we need to ban free functions to capture_preautograd_graph  as well

Test Plan: CI

Differential Revision: D54319597

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120817
Approved by: https://github.com/zhxchen17, https://github.com/andrewor14
2024-03-01 19:38:58 +00:00
ce50db22c2 Handle transposition pattern seen in SDPA with unbacked SymInts (#121005)
Fixes https://github.com/pytorch/pytorch/issues/121000

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121005
Approved by: https://github.com/lezcano
2024-03-01 18:58:19 +00:00
11f2e8beac [Dynamo, Compiled] Save some python overhead when calling compiled function with many tangents (#118730)
When a dynamo backend captures the entire forward pass and the entire backward pass without graph break, there could be many (per my memory, hundreds or thousands for big model) `contiguous` calls. Here we can save those overhead by checking `is_contiguous` before `contigous` call.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118730
Approved by: https://github.com/thiagocrepaldi, https://github.com/ezyang
2024-03-01 18:57:18 +00:00
0b18ed1c47 [FSDP] Added warning about unsupported double backwards (#120926)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120926
Approved by: https://github.com/Skylion007
2024-03-01 18:40:30 +00:00
f01a23d01b Don't aggressively rewrite asserts for symbolic expressions (#120564)
Fixes: https://github.com/pytorch/pytorch/issues/118417

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120564
Approved by: https://github.com/ezyang
2024-03-01 17:46:36 +00:00
c844b377fa [dynamo] Reorder logs (#116106)
Currently when there is a print/warning in the graph, dynamo graph breaks causing export to fail. However export would like to just skip over these print/warning calls: https://github.com/pytorch/pytorch/issues/113792.

Additionally there's a torch.compile feature request to "reorder prints" so that instead of graph breaking when hitting prints/logging, we can skip over these prints to create larger compiled graphs, and then print the results out after those compiled graphs: https://github.com/pytorch/pytorch/issues/93739. This PR also adds the `reorderable_logging_functions` config for users to register logging functions to be reordered (like `print` or a custom logging function). Printout of the bytecode after reordering the prints looks like the following: P914736600

There are some limitations to the printing right now:
* You can only register logging functions, not methods
* Inputs to the logging functions can only be tensors, constants, and format strings
* Inputs to the logging functions which will later be mutated in-place will not be printed correctly

TODO: Add the following tests
* print function with argument of nested data structure;
* print function with argument of nested data structure being updated inside of compile region (this would test if we handle side effect correctly);
* custom defined logging functions with nn.Module or nn.Module attribute arguments;
* custom defined logging functions with submodule input/output as arguments (we need to handle the mapping and fused-out value);
* custom defined logging functions with tensor argument and mutation inside of the function (TBD: this may increase memory usage);

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116106
Approved by: https://github.com/yanboliang
2024-03-01 17:04:24 +00:00
9fc56f8209 Exclude operators that produce unbacked symbols (#120917)
Unbacked symbols vary at runtime which means they are not CUDA
graphable.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120917
Approved by: https://github.com/eellison
2024-03-01 16:56:08 +00:00
ea7149aa22 Replace TTIR string parsing with structured MLIR walk in Triton kernel mutation analysis (#120476)
Summary: Previously, we relied on the `lark`-based parsing of the string TTIR representation dumped by the Triton compiler. However, this has proven to be brittle in the face of changes both in the user-written Triton kernel code and in the Triton compiler code.

In this PR, we add an alternative way of mining the function information from the TTIR based on walking the tree of structured MLIR entities. To this end, we rely on the MLIR bindings exposed by `libtriton` (related PR in Triton: https://github.com/openai/triton/pull/3191).

For now, we introduce gating based on whether `ttir_module.hasattr("walk")`. This will allow switching to the newly introduced TTIR analysis approach only when the new MLIR bindings (including that of `ModuleOp::walk`) become available in the Triton pin. Before then, we'll keep using the old string TTIR parsing-based approach.

Test Plan: The new functionality was tested locally with the latest Triton version compiled with the added new MLIR bindings: all Triton kernel mutation tests in `test_triton_kernels.py` are passing. Here we rely on the CI for regression testing, but it won't cover the new functionality due to gating.

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120476
Approved by: https://github.com/oulgen
2024-03-01 16:20:24 +00:00
8861507ba3 Fix guard for SUPPORTED_NODES (#120798)
The special-case code for handling SUPPORTED_NODES was producing a guard that looked like:
```
"G['torch'].utils._pytree.SUPPORTED_NODES[<class '__main__.CausalLMOutputWithPast'>].type"
```
resulting in a eval error trying to evaluate the guard.

This change adds a new source type (`ClassSource`) which is given a class type (in this case `CausalLMOutputWithPast`) and attempts to fetch it from its defining module.  It then uses that to build the `SUPPORTED_NODES` guards instead of referring to the type directly.

Also added a unit test which fails before this change and passes after.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120798
Approved by: https://github.com/anijain2305
2024-03-01 16:03:21 +00:00
b8e6ca6f76 Add sparse compressed meta tensor support (#120707)
As in the title.

Replaces https://github.com/pytorch/pytorch/pull/120498 and https://github.com/pytorch/pytorch/pull/120562

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120707
Approved by: https://github.com/ezyang
ghstack dependencies: #120703
2024-03-01 13:28:47 +00:00
70d4d109f2 Make SparseCsr a functionality dispatch key (#120703)
As in the title.

To enable meta and fake tensor support for sparse compressed tensors in compliance with the meta/fake tensor support for sparse COO tensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120703
Approved by: https://github.com/ezyang
2024-03-01 13:28:46 +00:00
eee040c939 expose nested header to wheel (#120603)
expose nested header to pytorch wheel, help with developers for reuse pytorch nested tensor related utils header inside wheel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120603
Approved by: https://github.com/jbschlosser, https://github.com/gujinghui
2024-03-01 09:59:45 +00:00
c646030cd2 Support higher order op functionalization in predispatch IR (#115314)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115314
Approved by: https://github.com/bdhirsh
2024-03-01 09:13:47 +00:00
82b356193d Move VariableInfo into its own file to avoid circular dependency (#120732)
VariableInfo is used by both `custom_function.h` (in a templated class) and `compiled_autograd.h` (in a class with some templated methods). Another way could have been to make a `compiled_autograd.cpp` and forward declare VariableInfo, but this VariableInfo was also being used in other nodes like PyNode so it felt cleaner to do it this way.

Differential Revision: [D54287007](https://our.internmc.facebook.com/intern/diff/D54287007)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120732
Approved by: https://github.com/jansel
2024-03-01 08:48:13 +00:00
8c2e569928 [PT2D] Make the speedup benchmark works with DDP + CompiledAutograd (#120454)
With DDP + CompiledAutograd, we could not use the same parallelized model to do the test. This PR copies the model.

Differential Revision: [D54094257](https://our.internmc.facebook.com/intern/diff/D54094257/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120454
Approved by: https://github.com/yf225, https://github.com/xmfan
2024-03-01 08:35:22 +00:00
cyy
77ef9d4022 Add verbose parameter to torch.hub.list (#120717)
This PR adds ```verbose``` to ```torch.hub.list``` to let  users being able to disable extraneous outputs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120717
Approved by: https://github.com/ezyang
2024-03-01 07:39:48 +00:00
63b259492a Revert "[dynamo] Reorder logs (#116106)"
This reverts commit c5472628ff9dedff57722941ac1b2a50af880197.

Reverted https://github.com/pytorch/pytorch/pull/116106 on behalf of https://github.com/clee2000 due to landrace with 342e7929b804ec56121e82e92d6a199b549c38b1, which removed the import for warnings.  Should be an easy fix after rebase c5472628ff ([comment](https://github.com/pytorch/pytorch/pull/116106#issuecomment-1972586180))
2024-03-01 06:25:46 +00:00
eqy
86e6497c6f [Inductor][cuDNN] Disable tf32 in test_mutate_view_for_conv_output (#120953)
Another disablement of TF32 to unblock #120642

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120953
Approved by: https://github.com/Skylion007
2024-03-01 05:51:29 +00:00
6ed26392b3 Update xfails for scaled_dot_product_attention (#120928)
Update xfails for test_dispatch_meta_outplace and test_dispatch_symbolic_meta_outplace.

These tests are sometimes expected to fail, because we moved the registrations from meta_registrations.py to fake_impls.py. AFAIK, this is okay because fake tensors will still work because we have special handling in fake_impls.py. The purpose of this PR is to update the xfails so they are correctly xfailing the failing tests.

Previously, I set these to xfail only for bfloat16, float16, and float32, but not float64; but this isn't really correct. Explanation below:

Scaled dot product attention (SDPA) has multiple implementations, including efficient_attention, flash_attention, and unfused attention. flash_attention supports fp16, bf16. efficient_attention supports fp16, bf16, fp32. unfused attention supports all dtypes.

efficient_attention and flash_attention implementations will fail the meta tests, but the unfused attention will not. Certain platforms may support none, both, or one of efficient_attention and flash_attention. Unfused attention will pass because it falls back to constituent ops which have registered meta kernels.

So: on CUDA, we have all 3 available: in bf16, fp16, fp32, we'll select one of the fused implementations (where this test will fail).
On ROCM, we don't have efficient_attention: so fp32 will use the unfused implementation, where the test will pass.

Fix in this PR:
* If any fused impl is available, then xfail float16 & bfloat16
* If efficient_attention is available, then also xfail float32
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120928
Approved by: https://github.com/drisspg
2024-03-01 05:16:11 +00:00
2a08a51738 Add _assert_scalar and teach Inductor to codegen it (#114148)
Inductor codegen for `_assert_async` is currently disabled because we don't really understand how to codegen `scalar_to_tensor` on a Sympy expression. I initially tried to see if I could get this to work, but I got into some weird problem involving stride sorting, so I decided to fix it properly by not going through a tensor.

So we introduce an `_assert_scalar` which takes a scalar as an argument, avoiding needing to turn a SymBool into a tensor before asserting on it. I also add `_functional_assert_scalar` for good luck, although this doesn't do anything right now because https://github.com/pytorch/pytorch/pull/104203 still hasn't been landed.

I need to customize the codegen for this operator, so I decide to directly implement it in Inductor, rather than trying to treat it as a generic ExternKernel. This leads to the new AssertScalar IR node. This is written carefully so that it doesn't get DCE'd by Inductor.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114148
Approved by: https://github.com/jansel
ghstack dependencies: #120800
2024-03-01 05:06:36 +00:00
77aea289ae Add test to check that COW inputs are not materialized (#119507)
Part of #97856

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119507
Approved by: https://github.com/ezyang
ghstack dependencies: #120455
2024-03-01 05:05:28 +00:00
13a54ce279 Avoid COW materialization in at::parallel_for/parallel_reduce (#120455)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120455
Approved by: https://github.com/albanD
2024-03-01 05:05:28 +00:00
d053dcfa69 delete useless cast_outputs call in unary_op_impl_float_out (#120486)
cast_outputs function is only used for CPU device, and this function already called in cpu_xxx_vec, like cpu_kernel_vec.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120486
Approved by: https://github.com/ezyang
2024-03-01 04:54:11 +00:00
c5472628ff [dynamo] Reorder logs (#116106)
Currently when there is a print/warning in the graph, dynamo graph breaks causing export to fail. However export would like to just skip over these print/warning calls: https://github.com/pytorch/pytorch/issues/113792.

Additionally there's a torch.compile feature request to "reorder prints" so that instead of graph breaking when hitting prints/logging, we can skip over these prints to create larger compiled graphs, and then print the results out after those compiled graphs: https://github.com/pytorch/pytorch/issues/93739. This PR also adds the `reorderable_logging_functions` config for users to register logging functions to be reordered (like `print` or a custom logging function). Printout of the bytecode after reordering the prints looks like the following: P914736600

There are some limitations to the printing right now:
* You can only register logging functions, not methods
* Inputs to the logging functions can only be tensors, constants, and format strings
* Inputs to the logging functions which will later be mutated in-place will not be printed correctly

TODO: Add the following tests
* print function with argument of nested data structure;
* print function with argument of nested data structure being updated inside of compile region (this would test if we handle side effect correctly);
* custom defined logging functions with nn.Module or nn.Module attribute arguments;
* custom defined logging functions with submodule input/output as arguments (we need to handle the mapping and fused-out value);
* custom defined logging functions with tensor argument and mutation inside of the function (TBD: this may increase memory usage);

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116106
Approved by: https://github.com/yanboliang
2024-03-01 04:48:44 +00:00
02a410ee12 Enable TORCH_TRACE by default in all Tupperware like environments (#120915)
Summary:
This is a reimplemented version of the FB specific code in https://www.internalfb.com/diff/D54230697

The new strategy is that we unconditionally install an FB handler to trace_log logger (and always set level to DEBUG). When the first log message is emitted, we check the JK/filesystem to see if we should actually do logging. If we decide we don't do logging, we remove the handler from trace_log and are done.

build_only[github-export-checks,executorch,pytorch_benchmark,pytorch_quantization,pytorch_distributed,pytorch_distributed_gpu,pytorch_dynamo_inductor,pytorch_functorch,pytorch_fx2trt,pytorch_diff_train_tests_ads,glow_fb_pytorch_tests,training_platform,training_platform_compatibility,training_toolkit_applications,training_toolkit_examples,training_toolkit_model_optimization,dper3_pytorch,xplat_caffe2,pytorch_dev,android-pytorch-instrumentation-tests,smartpytorchgithub_first_try_merge,frl-target-determinator,f6-buck,training_platform_for_github,sigmoid_cpu,sigmoid_gpu,aiplatform_modelprocessing_for_github,accelerators_workloads_models_slimdsnn,ae_aotinductor_benchmark_test,aps_,aps_deterministic_ne_tests,dper_lib_silvertorch,torchrec,torchrec_fb,deeplearning_aot_inductor]

Test Plan:
sandcastle

```
buck2 test 'fbcode//mode/dev-nosan' fbcode//torchrec/inference/tests:test_single_gpu_executor -- --exact 'torchrec/inference/tests:test_single_gpu_executor - TorchDeployGPUTest.NestedModelSingleGPU'
buck2 test 'fbcode//mode/dev-nosan' fbcode//dper_lib/silvertorch/modules/dynamic_stats/tests:accumulators_test -- --exact 'dper_lib/silvertorch/modules/dynamic_stats/tests:accumulators_test - test_global_fixed_interval_accumulator (dper_lib.silvertorch.modules.dynamic_stats.tests.accumulators_test.GlobalFixedIntervalUnivalentAcculumatorTest)'
```

Also running a test flow with/without JK enabled

Differential Revision: D54275086

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120915
Approved by: https://github.com/yanboliang
2024-03-01 04:47:13 +00:00
518a23bb03 support bool as Scalar Type in TorchScript (#113835)
Fixes #112402
Fixes #75465

From the description in #75465 , the bool type should subtype from the int. and `register_prim_ops.cpp` already supports converting from bool to int or float.
So this patch can fix bool as Scalar in TorchScirpt.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113835
Approved by: https://github.com/davidberard98
2024-03-01 04:20:15 +00:00
2e84d01d05 [executorch hash update] update the pinned executorch hash (#120747)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120747
Approved by: https://github.com/pytorchbot
2024-03-01 04:02:09 +00:00
65d568680c Revert "[Dynamo] Fix inspect.getattr_static doesn't work well for torch.utils._cxx_pytree.PyTreeSpec (#120812)"
This reverts commit 1104e0798c8206e0226f2d68f6bb065645e6276f.

Reverted https://github.com/pytorch/pytorch/pull/120812 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the XLA failure test_simple_model look legit 1104e0798c ([comment](https://github.com/pytorch/pytorch/pull/120812#issuecomment-1972460001))
2024-03-01 03:53:27 +00:00
e49f31ca02 [onnxrt, dynamo] Enable custom ONNX model transforms in onnxrt dynamo backend (#120854)
A global transorm list is created. All backend instances call the transform functions in that list sequentially to modify the exported ONNX model before sending model to ORT session. For example, `record_onnx_model_transform` below is a no-op transform and only records the ONNX graphs sent to ONNXRuntime.

```python
        recorded_models = []

        def record_onnx_model_transform(onnx_model):
            # Record the ONNX model seen by the transform.
            recorded_models.append(onnx_model)

        from torch.onnx import (
            register_backend_graph_transform,
            unregister_backend_graph_transform,
        )
        # Register so that `onnxrt` backend calls it to modify ONNX model.
        register_backend_graph_transform(record_onnx_model_transform)

        def example_model(x: torch.Tensor):
            y = torch.sigmoid(x)
            z = x + y
            return z

        # During the compilation, the exported ONNX model will be
        # modified by calling `record_onnx_model_transform` before
        # sending the model to `onnxruntime.InferenceSession`.
        compiled_model = torch.compile(
            example_model,
            backend="onnxrt",
            dynamic=True,
        )
        # Now, `recorded_models` should contain one `onnx.ModelProto` representing
        # `example_model(x: torch.Tensor)`.

        # Remove the pass when not needed. If `record_onnx_model_transform` is not
        # removed, it will be applied to all models compiled by `backend="onnxrt"`.
        unregister_backend_graph_transform(record_onnx_model_transform)
```

In the future, we plan to use this mechanism to register all graph transforms such ash graph fusion and general ONNX optimization for `onnxrt`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120854
Approved by: https://github.com/BowenBao, https://github.com/thiagocrepaldi
2024-03-01 03:24:17 +00:00
67c97a9aad fix the scale dot attention doc (#120859)
Fixes #120810

The code verifies the broadcast behavior (from the issue),
```
import torch

B = 3
S = 5
L = 7
E = 16
EV = 32
additional_batches = [2, 4]

query_shape = [B] + additional_batches + [L, E]
key_shape = [B] + additional_batches + [S, E]
value_shape = [B] + additional_batches + [S, EV]

query = torch.rand(*query_shape)
key = torch.rand(*key_shape)
value = torch.rand(*value_shape)
mask = torch.zeros((1, 1, S), dtype=torch.bool)
mask[:, :, S // 2 :] = True

# query.to("cuda")
# key.to("cuda")
# value.to("cuda")
# mask.to("cuda")

attention = torch.nn.functional.scaled_dot_product_attention(query, key, value, mask)

print(f"query shape = {query.shape}")
print(f"key shape = {key.shape}")
print(f"value shape = {value.shape}")
print(f"mask shape = {mask.shape}")
print(f"attention shape = {attention.shape}")

#in both CPU and cuda, output shape is:
# query shape = torch.Size([3, 2, 4, 7, 16])
# key shape = torch.Size([3, 2, 4, 5, 16])
# value shape = torch.Size([3, 2, 4, 5, 32])
# mask shape = torch.Size([1, 1, 5])
# attention shape = torch.Size([3, 2, 4, 7, 32])

## test add is broadcasting mask to query@(key.mT)
res = query@(key.mT)
print(res.shape)
res2 = torch.add(res, mask)
print(res2.shape)
```

At code level, in the default backend,
ab38354887/aten/src/ATen/native/transformers/attention.cpp (L735)

the add operation is broadcasting the `attn_mask` to `auto attn = at::matmul(query, key.transpose(-2, -1) * scaling_factor);`

- Changed the doc in [torch/nn/functional.py](https://github.com/pytorch/pytorch/pull/120859/files#diff-c358c214f663ba0c8b9c6846fbe0042fa29494cf02fe4714a17dcd0d268b035b).
- Also fixed a few inconsistencies in the cpp comments.

@mikaylagawarecki

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120859
Approved by: https://github.com/drisspg
2024-03-01 02:54:08 +00:00
b35551f357 Ban reset_to_zero argument to triton.autotune in user defined kernels (#120938)
Fixes #120802

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120938
Approved by: https://github.com/chenyang78, https://github.com/jansel
2024-03-01 02:37:24 +00:00
06f8af30fa Change FakeTensor serialization to consider only an _active_ FakeTensor mode (#120848)
Summary: https://github.com/pytorch/pytorch/pull/108186 make some changes related to FakeTensor serialization such that saving and loading a tensor will give us a meta tensor, even if FakeTensor mode is not enabled. This means we can't properly save and load Tensors as part of Fx graph caching. This PR changes the logic to check if there's an _active_ FakeTensor mode.

Test Plan:
* New unit tests
* Validated unit tests introduced in https://github.com/pytorch/pytorch/pull/108186 still pass
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120848
Approved by: https://github.com/eellison, https://github.com/thiagocrepaldi
2024-03-01 02:37:21 +00:00
e3dbd194f4 [dynamo] Support module backwards hooks (#120685)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120685
Approved by: https://github.com/yanboliang, https://github.com/xmfan
2024-03-01 02:24:26 +00:00
9b2c35b4fe [dynamo] Fix convolution meta kernel when input channel is 0 (#120944)
Addresses https://github.com/pytorch/pytorch/issues/118797

Adding in special channel handling logic from eager (set output channels to 0 when input channels are 0):
67d3e4f2a2/aten/src/ATen/native/Convolution.cpp (L1400-L1403)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120944
Approved by: https://github.com/zou3519
2024-03-01 01:18:21 +00:00
d534a49767 Reinplace auto_functionalized (#120829)
Fixes https://github.com/pytorch/pytorch/issues/120441

We follow how triton_kernel_wrapper_functional gets re-inplaced:
- If we see auto_functionalized, then first we compute what inputs we
  actually need to clone ("tensors_to_clone") and fixup the graph. This happens in
  `reinplace_and_refine_tensors_to_clone`, which I have refactored out
  of the triton_kernel_wrapper_functional reinplacing code.
- Later on, after the reinplacing pass, we have a decomposition pass for
  auto_functionalized. In that decomposition pass, we make use of the
  "tensor_to_clone" info and only clone those inputs in the
  decomposition.
- We shepherd "tensor_to_clone" from the first step to the second step
  by setting the .meta field on the auto_functionalized node.

Test Plan:
- existing tests
- tested locally by reading the output of TORCH_LOGS="post_grad_graphs"
- added assertExpectedInline tests for the post_grad_graphs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120829
Approved by: https://github.com/oulgen
2024-03-01 00:55:19 +00:00
791f8ef350 [Composable APIs] Add composable API fully_shard deprecation warning (#120929)
`fully_shard`(https://github.com/pytorch/pytorch/blob/main/torch/distributed/_composable/fsdp/fully_shard.py) will be used by new FSDP2 and we want to add a deprecation warning to the existing composable API's `fully_shard`(https://github.com/pytorch/pytorch/blob/main/torch/distributed/_composable/fully_shard.py#L40).

Planned release schedule is as follows https://dev-discuss.pytorch.org/t/release-cadence-for-year-2023-2024/1557:

Minor Version | Release branch cut | Release date | First patch release date | Second patch release date
-- | -- | -- | -- | --
2.3 | Mar 2024 | Apr 2024 | May 2024 | Jun 2024
2.4 | May 2024 | Jul 2024 | Aug 2024 | Sep 2024
2.5 | Aug 2024 | Oct 2024 | Nov 2024 | Dec 2024

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120929
Approved by: https://github.com/awgu
2024-03-01 00:55:16 +00:00
fd35aafc26 Teach dynamo about vjp (#119405)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119405
Approved by: https://github.com/zou3519
ghstack dependencies: #118407
2024-03-01 00:21:10 +00:00
9d5dea7812 [DCP] Adds storage reader and planner classes for online loading/sharding of models in torch.save format (#119816)
as title

Differential Revision: [D53718041](https://our.internmc.facebook.com/intern/diff/D53718041/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119816
Approved by: https://github.com/fegin
2024-03-01 00:21:05 +00:00
33da8d5c12 Revert "Fix guard for SUPPORTED_NODES (#120798)"
This reverts commit 1b8bb027f676aa8c4260a3f6b9a5c98c37d25dc7.

Reverted https://github.com/pytorch/pytorch/pull/120798 on behalf of https://github.com/kit1980 due to the new test fails internally, see D54343456 ([comment](https://github.com/pytorch/pytorch/pull/120798#issuecomment-1972134227))
2024-02-29 23:19:22 +00:00
7ebfe21724 Fix nll loss dynamo failure (#120805)
Fix for https://github.com/pytorch/pytorch/issues/119791 Part of dynamo bug bash
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120805
Approved by: https://github.com/Skylion007, https://github.com/zou3519, https://github.com/malfet
2024-02-29 22:34:49 +00:00
d03b11ad5b Pass inductor strides forward in ddp optimizer (#120523)
# Note: Returning Fake Tensors on First AOT Autograd Call
            #
            # Inductor will optimize strides of outputs when it deems it profitable.
            # For instance, converting to channels last. When we split the graph here
            # into multiple inductor compilations, we need to make sure that the
            # output strides of one compilation is appropriately passed to the subsequent
            # compilations. However, the mapping from inductor output to dynamo output
            # is non-trivial due to aot_autograd's deduping, de-aliasing, mutation, re-writing,
            # subclass handling, etc. In order to replay all this logic we set a flag such that
            # the first invocation of inductor in aot_autograd will return Fake Tensors with
            # appropriate strides. Then, all of aot autograd's runtime logic is replayed.
            # This gives us the appropriately strided outputs here which will reflect runtime strides.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120523
Approved by: https://github.com/yf225, https://github.com/bdhirsh
2024-02-29 22:25:00 +00:00
772db2a3ae Fix handling of torch.return_types in dynamo (#120826)
Handle quasi-namedtuples as a special case in dynamo

Fixes #120651

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120826
Approved by: https://github.com/anijain2305
2024-02-29 22:11:35 +00:00
da559c98e3 Fix isin decomp and add python meta registration (#120821)
Fixes #119792

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120821
Approved by: https://github.com/malfet, https://github.com/peterbell10
2024-02-29 22:08:50 +00:00
76d3a6bb4a Revert "[C10D] Add ProcessGroup op_id to track ops inside coalescing region (#120745)"
This reverts commit 381a7ad3f1cd38bf8e814ae9d275f101a2136139.

Reverted https://github.com/pytorch/pytorch/pull/120745 on behalf of https://github.com/kit1980 due to The new test fails internally, see D54343421 ([comment](https://github.com/pytorch/pytorch/pull/120745#issuecomment-1972047106))
2024-02-29 22:06:13 +00:00
e7039e3a0b [dynamo][easy] Dynamo test changes (#120927)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120927
Approved by: https://github.com/yanboliang
ghstack dependencies: #120864, #120730
2024-02-29 22:05:41 +00:00
39c092d242 Skip semi-structured-sparse on windows (#120807)
# Sumary

We can see that in this job on the other PR: https://github.com/pytorch/pytorch/actions/runs/8086597674/job/22096699337?pr=120641#step:11:11272

building the SemiStrucutredSparse kernel is erroring on windows machine so I think we she land this.

### Details

Introduced in here:  https://github.com/pytorch/pytorch/pull/120434

we don't compile for windows so we should have skipped this test.

There is another PR: https://github.com/pytorch/pytorch/pull/120641
which removes this skip for windows, so if that is green we should do that otherwise skip windows tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120807
Approved by: https://github.com/alexsamardzic, https://github.com/jcaip
2024-02-29 21:48:52 +00:00
1a1f58ffbe [rocm][cmake] retrieve rocm location from ROCM_SOURCE_DIR env if specified (#120898)
This PR allows us to build PyTorch with a rocm that is not installed
to the default location, i.e. /opt/rocm

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120898
Approved by: https://github.com/jianyuh
2024-02-29 21:32:45 +00:00
b2dddcfe27 [FSDP2][DCP][DSD] Add test to ensure FSDP2 model/optim state_dict work after a full training loop (#120871)
This PR adds tests to test distributed state dict work properly for FSDP2's model and optimizer state_dict after a full training loop.

We test the combination of these options on a evenly sharded model.
```
{
    "reshard_after_forward": [True, False],
    "optimizer_class": [torch.optim.Adam],
    "compile_model": [True, False],
},
```

Followup: 1. Add test for unevenly sharded model. 2. Add test to include `torch.optim.AdamW` (seems to have some gaps currently, still investigating)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120871
Approved by: https://github.com/fegin
2024-02-29 21:24:00 +00:00
67d3e4f2a2 [TorchElastic] Refactoring to support non-default logging strategy (#120691)
Summary:
Pulling out logging parameters into a logging specs that can be overridden (follow-up changes on possible mechanism)

Why?
Right now the logging approach is quite rigid:
- Requires for log directory to exist and not be empty
- Will create tempdir otherwise,
- Creates subdir for a run
- creates subdir for each attempt
- creates files named as stdout.log, stderr.log, error.json

In some instances some of the users would like to customize the behavior including file names based on context. And we do have right now a mechanism to template multiplexed teed output prefix.

With current changes, users can create custom log spec that can use env variables to change the behavior.

Notes:
Made `LaunchConf.logs_specs` as an optional field that will be bound to `DefaultLogsSpecs` instance. There are large number of clients (code) that use the API directly without using torchrun API. For those cases, we have to explicitly pass LogSpecs implementation if we would like to override the implementation. For the regular torchrun users, we can use pluggable approach proposed in the follow up change.

Test Plan: CI + unit tests

Differential Revision: D54176265

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120691
Approved by: https://github.com/ezyang
2024-02-29 20:59:17 +00:00
277bc97709 [FSDP2][ez] Combined communication test files (#120904)
This just combines the unit tests for the collectives ops for copy-in/all-gather/copy-out and copy-in/reduce-scatter/view-out with the unit tests for communication schedule. I was mainly thinking to try to not have too many test files.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120904
Approved by: https://github.com/Skylion007, https://github.com/wanchaol
ghstack dependencies: #120659
2024-02-29 20:36:04 +00:00
0b924d7cde Revert "[inductor] Optimize welford reduction (#120330)"
This reverts commit 7eb7ac815f0247a62b621897cea95ec4ca56d52e.

Reverted https://github.com/pytorch/pytorch/pull/120330 on behalf of https://github.com/kit1980 due to Broke internal tests, see D54230858 ([comment](https://github.com/pytorch/pytorch/pull/120330#issuecomment-1971878323))
2024-02-29 20:12:50 +00:00
0a7666801d SymIntify prod_backward (#120776)
Fixes https://github.com/pytorch/pytorch/issues/120608

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120776
Approved by: https://github.com/albanD
2024-02-29 20:05:22 +00:00
313abcdba2 [c10d] fix the unwanted reason (#120863)
Summary:
Addressing #120849. Current c10d treat a reason as a failure, hence give some unwanted false
postiive errors. This is a quick fix, but we need to revisit the error
handling logic

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120863
Approved by: https://github.com/kwen2501
2024-02-29 19:58:11 +00:00
f94933ed42 Refine value ranges on inequalities (#120800)
This is basically done the obvious way. For better or worse, I jammed this into what used to be `_maybe_guard_eq` but now is `_maybe_guard_rel`. I was careful to test all the off by one conditions, and each permutation. Let me know if you think I missed anything. Importantly, this now works for unbacked SymInts.

While testing, I noticed we are silently duck sizing all symbolic variables in `test_dynamic_shapes.py`. This may or may not be covering up bugs.

Along the way, I had to fix a bug in export constraints, where we weren't checking that the final var_to_range was consistent with what the user requested at top level.

After I implemented all this, I realized that applying this to non-unbacked SymInts was duplicative with @ysiraichi's previous work on https://github.com/pytorch/pytorch/pull/97963 . The upside is I now understand what Yukio was trying to do in the original PR, and I think my new logic is simpler and less error prone. In Yukio's earlier diff, Yukio tried very hard to avoid changing what guards we actually issue (since this would cause tests to wobble). Thus, when he refined a range, he also saved the guard that actually caused the range to refine. In this PR, I don't bother saving these guards; instead I just tighten var_to_range directly and rely on generating guards on this to be correct. The key insight is that if I assert `x < y`, it's always safe to emit (potentially) more restrictive range guards, because this won't invalidate our guards, it will just make them a little too strong (but actually, I think we are precise along the way.) If these guards make it unnecessary to test `x < y`, because now the ranges for x and y are disjoint, this is fine, we've subsumed the x < y guard and can just not bother testing it. If I've gotten it right, TV will agree with me.

In fact, I had a bug in this PR which TV didn't catch, which is that when we have a recorded var_to_guards for a symbol, we unconditionally never generate the range guard for it, even if the var_to_guards is potentially inconsistent with var_to_range (because var_to_range was updated separately). With var_to_guards removed, I don't have to worry abou this inconsistency.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120800
Approved by: https://github.com/Skylion007, https://github.com/avikchaudhuri, https://github.com/ysiraichi
2024-02-29 19:41:51 +00:00
81c4c0dda2 [functional collecitve] don't import torchdynamo when running torchdeploy (#120900)
Summary: Importing torchdynamo in `functional_collective_impl.py` seems to break loading of torchdeploy models.

Test Plan: CI

Differential Revision: D54355011

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120900
Approved by: https://github.com/fegin
2024-02-29 19:20:54 +00:00
f7a809c96a fix dupe deprecated warning in dynamo export (#120896)
Summary:
When we convert `dynamic_shapes` to `constraints` and pass them to `_dynamo.export`, we shouldn't give a deprecation warning. Such conversion happens when calling `torch.export.export`, e.g. But it can also happen when calling `capture_pre_autograd_graph` (which itself has this deprecation warning when `constraints` are passed directly as well).

Since `_log_export_usage` is an indicator of a top-level call (it is `True` by default but set to `False`, or at least passed through, by callers), we can (ab)use it to indicate when to give this deprecation warning.

Test Plan: none

Differential Revision: D54350172

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120896
Approved by: https://github.com/BoyuanFeng, https://github.com/zhxchen17
2024-02-29 18:57:42 +00:00
0290fe65bd Test TD (test removal) on crossref (#119426)
Current threshold is to cut the bottom 75% of test files, which results in 13 min of tests getting cut.
test_ops, functorch/test_ops, and test_decomp and other really long running test files are not getting cut and make the top 25% to take really long (still 90+ min)

The original plan was to test on rocm but I'm worried about queuing given that cutting 75% of test files only cuts off 13 min, and crossref is rarely referenced by others and people keep talking about getting rid of it, so it's a good alternative

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119426
Approved by: https://github.com/huydhn
2024-02-29 18:53:43 +00:00
1458f1de66 Revert "Update flash_attention kernel from 2.3.6 to 2.5.5 (#118935)"
This reverts commit 4b7a521856ca5fb0fc28edd18591f77fff5a6ba1.

Reverted https://github.com/pytorch/pytorch/pull/118935 on behalf of https://github.com/atalman due to Significantly increases build time. Optimization is needed ([comment](https://github.com/pytorch/pytorch/pull/118935#issuecomment-1971723284))
2024-02-29 18:42:21 +00:00
96eff4ef70 [inductor max autotune] Detailed autotuning result logs ( machine-readable ) (#119004)
This diff introduces a new separate logging of autotuning results,
with the intention of making the results analyzable, specifically
those for the new experimental Cutlass backend.

Results are logged as text files with one JSON document corresponding to a single benchmark result per line.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119004
Approved by: https://github.com/jansel
ghstack dependencies: #120620
2024-02-29 18:24:13 +00:00
a911eb74ae [dynamo] Graph break when faking named tensors (#120779)
Fixes #120644
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120779
Approved by: https://github.com/zou3519
2024-02-29 18:22:15 +00:00
1104e0798c [Dynamo] Fix inspect.getattr_static doesn't work well for torch.utils._cxx_pytree.PyTreeSpec (#120812)
Fixes #118793

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120812
Approved by: https://github.com/zou3519
2024-02-29 18:19:14 +00:00
ca679384c2 [rocm][cmake] correctly check the ROCM_SOURCE_DIR environment (#120858)
The existing use of "if(NOT ENV{ROCM_SOURCE_DIR})" seems to be
not working correctly, e.g.

```
$ cmake --version
cmake version 3.26.4

$ cat CMakeList.txt
cmake_minimum_required(VERSION 3.18 FATAL_ERROR)
project(FOO)

if(NOT ENV{ROCM_SOURCE_DIR})
  message(INFO ": not defined 1")
else()
  message(INFO ": defined 1: $ENV{ROCM_SOURCE_DIR}")
endif()

if("$ENV{ROCM_SOURCE_DIR}" STREQUAL "")
  message(INFO ": not defined 2")
else()
  message(INFO ": defined 2: $ENV{ROCM_SOURCE_DIR}")
endif()
$ ROCM_SOURCE_DIR=/tmp cmake .
INFO: not defined 1
INFO: defined 2: /tmp
-- Configuring done (0.0s)
-- Generating done (0.0s)
-- Build files have been written to: /home/yangche/tmp/tmp
```

This PR replace it with a STREQUAL check. Note that the choice
of STREQUAL is to avoid cases like:

```
$ ROCM_SOURCE_DIR= cmake .
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120858
Approved by: https://github.com/jianyuh, https://github.com/jeffdaily
2024-02-29 17:49:00 +00:00
9e016debeb [dynamo] Fix inference_mode context variable (#120830)
<idk what im doing>
Fixes #120646

The module for torch.inference_mode should be torch

The input to `create` is a bool (mode?) and `_enter_inference_mode` expects a bool but [BlockStackEntry](50073248ed/torch/_dynamo/symbolic_convert.py (L206)) expects `target_values` to be a list?
[inference_mode](50073248ed/torch/autograd/grad_mode.py (L205))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120830
Approved by: https://github.com/zou3519, https://github.com/anijain2305, https://github.com/tugsbayasgalan
2024-02-29 17:10:06 +00:00
98c4ba683e [EZ][BE] Fix ResourceWarning (#120886)
By closing the file handle

Fixes
```
/Users/nshulga/git/pytorch/pytorch/test/quantization/core/test_docs.py:132: ResourceWarning: unclosed file <_io.TextIOWrapper name='/Users/nshulga/git/pytorch/pytorch/docs/source/quantization.rst' mode='r' encoding='UTF-8'>
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120886
Approved by: https://github.com/seemethere, https://github.com/kit1980, https://github.com/Skylion007
2024-02-29 17:07:39 +00:00
664dd61b29 Add some more symbolic shapes related files to ciflow/inductor (#120887)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120887
Approved by: https://github.com/janeyx99, https://github.com/malfet
2024-02-29 16:59:32 +00:00
558316b5f4 Emit grid wrapper inlined with the user defined triton kernel (#120824)
Fixes #120801

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120824
Approved by: https://github.com/chenyang78, https://github.com/jansel
ghstack dependencies: #120809
2024-02-29 16:17:45 +00:00
84e2accd6c Make triton_meta be part of user defined triton kernel cache (#120809)
Tensors with different shapes will generate different triton meta (divisibility rules), we need this to be part of the cache key.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120809
Approved by: https://github.com/chenyang78, https://github.com/jansel
2024-02-29 16:17:45 +00:00
342e7929b8 [export] kill deprecated constraints API (#120860)
Summary:
Previously `export` would take `constraints` built with `dynamic_dim(...)`s. This has been deprecated for a while; one can now pass in a `dynamic_shapes` spec built with `Dim(...)`s.

Here we kill this deprecated API. Eventually this will lead to simplification of the underlying implementation, since the new `Dim`-based specs can map 1-1 with symbolic shapes concepts without going through indirect machinery of `dynamic_dim`-based constraints. It is expected that internal APIs like `_dynamo.export` and `_trace._export_to_torch_ir` will change when that happens.

Leaving `aot_compile` and `capture_pre_autograd_graph` entry points alone for now. This will eventually be updated anyway.

Test Plan: updated tests

Differential Revision: D54339703

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120860
Approved by: https://github.com/suo, https://github.com/tugsbayasgalan
2024-02-29 16:15:50 +00:00
3cfed01228 [AOTI] Store OpOverload in ir.ExternKernel (#120629)
Summary: Currently the logics for filling the default value for optional arguments are scattered in several places. By storing OpOverload in the base ExternKernel class, we can simplify codegen_kwargs, and this is a preparation step for enabling the torchgen-ed C shim. The default value filling logic for FallbackKernel can also be simplified, but that can come later.

Differential Revision: [D54258089](https://our.internmc.facebook.com/intern/diff/D54258089)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120629
Approved by: https://github.com/chenyang78
ghstack dependencies: #119987, #120592
2024-02-29 15:51:33 +00:00
fa7241ed79 [AOTI] Change the cpp wrapper codegen for sdpa (#120592)
Summary: Switch codegen for sdpa to always point to v2 in the C shim. Since aoti_torch__scaled_dot_product_flash_attention_v2 has been introduced for a while, there shouldn't be any FC issue in production.

Differential Revision: [D54258090](https://our.internmc.facebook.com/intern/diff/D54258090)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120592
Approved by: https://github.com/chenyang78
ghstack dependencies: #119987
2024-02-29 15:49:23 +00:00
52e3c78a43 [AOTI][refactor] Move a few util functions in atoi_torch (#119987)
Summary: Move these util functions from an anonymous namespace to a common header so that later torchgen-ed files can use them.

Differential Revision: [D54258088](https://our.internmc.facebook.com/intern/diff/D54258088)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119987
Approved by: https://github.com/chenyang78
2024-02-29 15:46:47 +00:00
5b9e5f854b [profiler] Log process group id instead of backend id (#120475)
Summary:
https://github.com/pytorch/pytorch/pull/104373 introduced backend_id
> an unique ID for the actual backend object, this is also exposed in record_param_comms, so we can correlate these collectives with the right backend object.

However, it is inconvenient to correlate collectives with backend id. Instead, using pg id(uid) to correlate directly is a better solution.
This PR change the ID information exposted in record_param_comms from backend_id to pg_id.

Differential Revision: D53558257

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120475
Approved by: https://github.com/aaronenyeshi
2024-02-29 15:04:33 +00:00
576c0482a5 Remove hard numpy dependency from guards.py (#119519)
I'm not sure if this is the ideal behavior / best fix for this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119519
Approved by: https://github.com/albanD
2024-02-29 14:37:33 +00:00
5db5049b34 Move TRITON_CONSTRAINT setting to common binary_populate_env.sh, BE - Cleanup unused build scripts (#120744)
1. This moves TRITON_CONSTRAINT to common binary_populate_env.sh so that this is set for all wheels.
test in CI via ``ciflow/binaries`` label. Please note we only setting this constraint when PYTORCH_EXTRA_INSTALL_REQUIREMENTS is set. And this variable is set for all the wheels that gets uploaded to pypi. Hence triton wheels need to be set at the same place.
This is done for regular wheels and rocm wheels separately, since rocm wheels using different triton package

3. Cleanup legacy unused code
Test:
``
git grep setup_linux_system_environment.sh
``

Needs: https://github.com/pytorch/builder/pull/1712

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120744
Approved by: https://github.com/huydhn
2024-02-29 14:25:34 +00:00
f988f649be [IntraNodeComm] accept P2P buffer size as constructor argument (#120856)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120856
Approved by: https://github.com/wanchaol
ghstack dependencies: #120855
2024-02-29 11:43:52 +00:00
22b5548f5d [IntraNodeComm] refactor all_reduce variants as private methods (#120855)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120855
Approved by: https://github.com/Chillee, https://github.com/wanchaol
2024-02-29 11:43:52 +00:00
96793e0f10 [ROCm] enable scaled_gemm (#117822)
scaled_gemm for ROCm using hipblaslt.  As of ROCm 6.0, HIPBLASLT_MATMUL_DESC_AMAX_D_POINTER is not supported.  A work-around is provided, performing the absmax operation on the output buffer, but this results in some loss of accuracy for the absmax result.  For this reason the feature should be considered beta/preview.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117822
Approved by: https://github.com/jianyuh, https://github.com/xw285cornell
2024-02-29 10:20:48 +00:00
09aefe1502 Fix ouput typos (#120870)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120870
Approved by: https://github.com/clee2000
2024-02-29 08:29:14 +00:00
14c5ebc8a1 [Dynamo] Do not attempt to make nditer spawned arrays writable (#120868)
As they are not, converting `numpy.nditer` to writable is too expensive and  tensor values are copied anyway

Minimal reproducer:
```python
import numpy as np
import torch

@torch.compile
def f(x):
    return x + 1.0

for x in np.nditer(np.arange(3)):
    print(f(x))
```

Fixes https://github.com/pytorch/pytorch/issues/119787

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120868
Approved by: https://github.com/jansel
2024-02-29 07:49:59 +00:00
169c220bf8 [torch.compile] Provide capability to register callback on compile start/stop (#120764)
This is a requirement from Meta internal cases, where ppl wants to register a callback function to detect if a job is stuck during compilation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120764
Approved by: https://github.com/jansel
2024-02-29 07:37:52 +00:00
82cbd9b131 [dynamo][guards-cpp-refactor] PythonLambdaGuardAccessor (#120730)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120730
Approved by: https://github.com/jansel
ghstack dependencies: #120864
2024-02-29 07:25:13 +00:00
66d05a8900 [dynamo] Fix source for default dict default_factory (#120864)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120864
Approved by: https://github.com/yanboliang, https://github.com/Skylion007, https://github.com/jansel
2024-02-29 07:25:13 +00:00
df1e855313 [fake_impls] fix max_seqlen return values in efficient_attention_forward (#120842)
To match the actual implementation, we should return the max_seqlen_q/k, not M, N, when in the sparse case

7e185277cd/aten/src/ATen/native/transformers/cuda/attention.cu (L981-L996)

Note that although the .cu file sets max_seqlen_k = 0 in the sparse case, it actually returns max_seqlen_k or N:

7e185277cd/aten/src/ATen/native/transformers/cuda/attention.cu (L1224-L1231)

Tests - added in the next PR (#102839, which also fixes other parts of the test_fake tests so that we can un-xfail them and actually run the tests)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120842
Approved by: https://github.com/YuqingJ
ghstack dependencies: #120682
2024-02-29 07:12:27 +00:00
eqy
d1d50d2e4c [Inductor][cuDNN] Disable tf32 in test_mutate_base_for_conv_output (#120867)
Looks like there is a sum? comparison where TF32 may not provide the necessary accuracy, leading to failures on sm86.

CC @Skylion007 , hopefully this unblocks #120642

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120867
Approved by: https://github.com/Skylion007
2024-02-29 06:59:32 +00:00
cyy
8a42cff7b1 [DeviceIndex][7/N] Use DeviceIndex in XPU (#120576)
Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120576
Approved by: https://github.com/guangyey, https://github.com/Skylion007
2024-02-29 05:54:23 +00:00
4b18ab869f [torch.export] Support is_compiling() flag for non-strict mode (#119602)
Summary: In non-strict mode of torch.export() we didn't set those `is_compiling()` to `True` which is needed by some models.

Test Plan: Unit tests and manual testing.

Differential Revision: D53624452

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119602
Approved by: https://github.com/suo
2024-02-29 05:52:51 +00:00
0a46102b37 Add equal_to_1 to triton_meta for user-written Triton kernels (#120579)
Summary: Previously, we omitted `equal_to_1` from the `triton_meta` part of the `@user_autotune` decorator. For user-written Triton kernels, this could lead to perf regressions, as the kernel in the Inductor codegen is compiled without `equal_to_1` specialization.

Fixes #120478. The repro from the issue, on A100:

Before this PR:

```
Triton matmul:           0.0167 seconds
Triton matmul compiled:  0.0751 seconds
```

After this PR:

```
Triton matmul:           0.0168 seconds
Triton matmul compiled:  0.0072 seconds
```

Test Plan:

```
$ python test/dynamo/test_triton_kernels.py -k  test_triton_kernel_equal_to_1_arg
...
----------------------------------------------------------------------
Ran 3 tests in 3.545s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120579
Approved by: https://github.com/oulgen, https://github.com/jansel, https://github.com/chenyang78
2024-02-29 05:19:39 +00:00
4407138bf6 [inductor][eazy] fix a typo in test (#120832)
In theory we can test anything, but the test name mentioned attention so we should multiple the inv_scale rather than divide it. And I guess that the initial intension of the test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120832
Approved by: https://github.com/desertfire, https://github.com/jansel
2024-02-29 05:04:04 +00:00
2d17230212 [inductor] Do not reuse buffers across scopes in mem planning (#120777)
Summary: Previously, in the `memory_plan_reuse` we assumed that the generated code is flat: in the sense of it can't have nested scopes. However, with nested control flow codegen-ing, this is no longer the case. This causes bugs in buffers being reused across the visibility boundaries in different nested scopes.

In this PR, we add nested planning states in `memory_plan_reuse` on entering and exiting scope in the codegen. This restricts the buffer reusability only to the currently active (peak) scope / planning state.

Test Plan:

```
python test/inductor/test_control_flow.py -k test_subgraphs_with_parameters
...
----------------------------------------------------------------------
Ran 27 tests in 149.413s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120777
Approved by: https://github.com/chenyang78, https://github.com/desertfire, https://github.com/jansel
ghstack dependencies: #120665
2024-02-29 03:52:02 +00:00
f5b99976ad [C10D] Make _set_pg_timeout work with DeviceMesh PG (#120850)
Fixes #120847

Makes _set_pg_timeout work on nccl and/or gloo backends instead of working only on one backend (gloo) in cases that both backends exist for the group.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120850
Approved by: https://github.com/XilunWu, https://github.com/wanchaol
2024-02-29 03:41:15 +00:00
26d6ddc232 [bug burndown]Fix #119784 (#120846)
Addresses https://github.com/pytorch/pytorch/issues/119784. Interestingly, the test seem to just pass (yay!). Tested locally that the failing set of tests pass using `PYTORCH_TEST_WITH_DYNAMO=1 pytest functorch/test_vmap.py -v`

Will wait for CI to pass first before bugging people for reviews.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120846
Approved by: https://github.com/Skylion007
2024-02-29 03:30:40 +00:00
fad228c7cc Fix a potential race condition in the test decorators for enabling/disabling native funcol (#120833)
Previous, we parametrize some tests to run with both native and py funcol by flipping a global variable. However, some of these tests are multi-threaded tests, and the parametrization mechanism could lead to race condition.

This PR changes the mechansim to use `mock.patch` which is applied on a per-thread basis.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120833
Approved by: https://github.com/wconstab
2024-02-29 03:19:44 +00:00
2c0c70f763 [Dynamo] enumerate imported names for eval_frame.py (#120778)
Fixes https://github.com/pytorch/pytorch/issues/120699 .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120778
Approved by: https://github.com/Skylion007
2024-02-29 03:08:43 +00:00
ef9e89984c [pytorch] Support output types that are non tensors (#120804)
Summary:
per title
This is needed because some modules return None and non tensors as output

Test Plan: sandcastle?

Reviewed By: zhxchen17

Differential Revision: D54311609

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120804
Approved by: https://github.com/zhxchen17
2024-02-29 02:49:10 +00:00
0dbef1618f [inductor] Apply fx passes recursively to nested subgraphs (#120665)
Summary: The current machinery of Inductor's `compile_fx` assumes that the incoming fx graph is flat. As a result, everything before `graph.run` is applied to the outermost graph. This assumption was valid before #119759, but now there is control flow bringing (arbitrarily deeply) nested fx subgraphs to `compile_fx`.

In this PR, we start extending the `compile_fx` machinery to deal with nested fx subgraphs. Namely, we recursively apply Inductor's `pre_grad`, `joint_graph`, and `post_grad` passes to the nested subgraphs in the incoming fx graph.

For the recursive application of the `pre_grad` passes (which require example inputs per subgraph), we don't pass example inputs for the nested subgraphs. A few different attempts to infer the latter via fake tensor prop has led to different side effects in the model. Therefore, to the nested subgraphs, we only apply a subset of `pre_grad` passes that doesn't require example inputs.

Test Plan:

```
$ python test/inductor/test_control_flow.py
...
----------------------------------------------------------------------
Ran 26 tests in 59.252s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120665
Approved by: https://github.com/eellison
2024-02-29 02:34:54 +00:00
db1cc781db Revert "[dynamo] Function => FunctionCtx for placeholder obj (#120577)"
This reverts commit ee01d0807b924874a329be78c6ee880f556645db.

Reverted https://github.com/pytorch/pytorch/pull/120577 on behalf of https://github.com/jansel due to Causing breakages internally ([comment](https://github.com/pytorch/pytorch/pull/120577#issuecomment-1970254363))
2024-02-29 01:56:09 +00:00
b2e4b621cc Reduce create_env log level to DEBUG (#120772)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120772
Approved by: https://github.com/albanD
2024-02-29 01:33:16 +00:00
9e0631cc8a get CommsDebugMode to work with DTensor (#118769)
Tested with Wanchao's repro:
```
from typing import Tuple, List, Dict, cast
import torch
import torch.nn as nn
from torch.distributed.device_mesh import init_device_mesh
from torch.distributed._tensor import distribute_tensor, DTensor, Shard, Placement, Replicate

mesh = init_device_mesh(device_type="cuda", mesh_shape=(2,))
x = torch.randn(4, 8, requires_grad=True)
y = torch.randn(4, 32, requires_grad=True)
x_dtensor = DTensor.from_local(x, mesh, [Shard(0)], run_check=False)
y_dtensor = DTensor.from_local(y, mesh, [Shard(0)], run_check=False)
from torch.distributed._tensor.debug import CommDebugMode
comm_mode = CommDebugMode()
with comm_mode:
    z = torch.mm(x_dtensor, y_dtensor)
print(comm_mode.get_comm_counts())
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118769
Approved by: https://github.com/wanchaol
2024-02-29 01:11:05 +00:00
381a7ad3f1 [C10D] Add ProcessGroup op_id to track ops inside coalescing region (#120745)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120745
Approved by: https://github.com/zdevito
ghstack dependencies: #120724, #120270
2024-02-29 01:03:31 +00:00
f85d3a022c [C10D] Fix pointToPoint op Flight Recording (#120270)
Fix and test issues with both coalesced and individual send/recv ops

Considered an alternate approach and then ditched it
 - alternate approach: #119757
 - reason ditched: prefer recording individual collective events inside
   coalescing region instead of just the event at the end of the region,
   which also would not have tensor sizes or opnames without additional
   state variables added

Another approach also ditched
- record events on workEnqueue instead of initWork
- reason ditched: too messy to get input/output shapes tagged on
  recording when recording in workEnqueue.  Adding the info onto the
  Work obj would be possible, but adds to overhead of copying Works
  which we do on every collective. We can get info off the input/output
  tensors directly in initWork, but we don't want to keep refs to those
  tensors alive while the work is Enqueued, so we'd have to specifically
  copy size lists or something.

This PR instead avoids creating a work inside pointToPoint when
coalescing is active. Instead, only at endCoalescing() is a work finally
intialized and enqueued.  But it adds a record() call inside
pointToPoint() instead of creating a work, during coalescing. This
record() call picks up tensor shapes and op names.

It ALSO changes initWork to accept a 'record' argument. This defaults to
false, and should only be set to true if the caller ensures the work
will be enqueued by workEnqueue, ensuring its cuda events are live when
used by flight recorder's update_state().

The testing uncovers some odd pre-existing behavior and leaves them
alone for now. We could change some of these
- seq starts off at 1, not 0 for first op (but this is inconistent)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120270
Approved by: https://github.com/shuqiangzhang
ghstack dependencies: #120724
2024-02-29 01:03:31 +00:00
7f4d673885 [C10D] Add record_id to flight recorder (#120724)
In cases where sequence number is shared between events (e.g. coalesced
collectives) we want to ensure a unique (and ordered) ID per record.

Note: the records are already in a list, so their ID could be implicitly
observed.  But (1) it's a ring buffer, so absolute ID is lost once the
buffer rolls over once, (2) users may sort or process or filter their
flight records, so having the ID be an explicit member of an entry is
still useful

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120724
Approved by: https://github.com/zdevito
2024-02-29 01:03:31 +00:00
950b484356 skip three pyhpc models with dynamic shape test (#120599)
As reported in https://github.com/pytorch/pytorch/issues/119434, `pyhpc_isoneutral_mixing`, `pyhpc_equation_of_state` and `pyhpc_turbulent_kinetic_energy` failed with dynamic shape testing, we propose to skip the dynamic batch size testing of these 3 models in this PR.

* Error msg is
```
  File "/localdisk/leslie/torch_inductor_community/pytorch/benchmarks/dynamo/common.py", line 3879, in run
    assert marked, f"nothing in example_inputs had a dim with {batch_size}"
AssertionError: nothing in example_inputs had a dim with 1048576
```

* Root Cause is
  *  Benchmark code will only annotate the inputs' dim as dynamic when its size equals to batch size c617e7b407/benchmarks/dynamo/common.py (L3867-L3871). If it fails to find any dim equals to batch size, above error throws.
  * However, for these 3 models, none of the inputs' dim will equal to input batch size since the [relationship of dim sizes](26b85eadde/torchbenchmark/models/pyhpc_equation_of_state/__init__.py (L12-L16))
  ```
    shape = (
        math.ceil(2 * size ** (1/3)),
        math.ceil(2 * size ** (1/3)),
        math.ceil(0.25 * size ** (1/3)),
    )
  ```
  * Another thing is `pyhpc_isoneutral_mixing`, `pyhpc_equation_of_state` can pass the dynamic batch size accuracy testing, because the batch size has been set to 4 in accuracy testing (c617e7b407/benchmarks/dynamo/common.py (L3456)) and `math.ceil(2 * size ** (1/3))` happens equaling to 4.

* Since the dim sizes of input has above relationship, running the these models in dynamic shape, we may need to annotate `dim[0](s0) = dim[2](s1) * 8`, per the discussion in https://github.com/pytorch/pytorch/issues/117477#issuecomment-1897108756 @avikchaudhuri, looks like we are not expressible for this case. So, I think we may need to skip the dynamic batch size testing for these 3 models.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120599
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-02-29 00:38:06 +00:00
3179107629 [DDP][PT2D] Ignore gradient sync if the gradient is not defined (#120419)
From the test, accum_grad_hook can still be fired even if the gradient is None. We need to ignore the gradient sync for this case.

Differential Revision: [D54076485](https://our.internmc.facebook.com/intern/diff/D54076485/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120419
Approved by: https://github.com/yf225, https://github.com/XilunWu
2024-02-29 00:27:54 +00:00
ab38354887 Allow str inputs in non-strict tracing (#120536)
Previously, torch.export in non-strict mode was failing on str inputs while creating fake inputs for tracing (fakify()), and using graph nodes to create constraints. This fixes those 2 stages to allow strs to pass.

Failing test case:
```
class Foo(torch.nn.Module):
            def forward(self, a, b, mode):
                return torch.div(a, b, rounding_mode=mode)

        foo = Foo()
        inps = (torch.randn(4, 4), torch.randn(4), "trunc")
        exported = export(foo, inps)
        with self.assertRaisesRegex(
            RuntimeError, "to be equal to trunc, but got floor"
        ):
            _ = exported.module()(torch.randn(4, 4), torch.randn(4), "floor")
        self.assertTrue(torch.allclose(exported.module()(*inps), foo(*inps)))
```

Before:
```
(pytorch-local) pianpwk@pianpwk-mbp pytorch % python test/export/test_export_nonstrict.py -k test_runtime_assert_for_prm_str
E
======================================================================
ERROR: test_runtime_assert_for_prm_str_non_strict (__main__.NonStrictExportTestExport.test_runtime_assert_for_prm_str_non_strict)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/pianpwk/Documents/pytorch/torch/testing/_internal/common_utils.py", line 2744, in wrapper
    method(*args, **kwargs)
  File "/Users/pianpwk/Documents/pytorch/test/export/testing.py", line 40, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/pianpwk/Documents/pytorch/test/export/test_export.py", line 1588, in test_runtime_assert_for_prm_str
    exported = export(foo, inps)
               ^^^^^^^^^^^^^^^^^
  File "/Users/pianpwk/Documents/pytorch/test/export/test_export_nonstrict.py", line 16, in mocked_non_strict_export
    return export(*args, **kwargs, strict=False)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/pianpwk/Documents/pytorch/torch/export/__init__.py", line 186, in export
    return _export(
           ^^^^^^^^
  File "/Users/pianpwk/Documents/pytorch/torch/export/_trace.py", line 541, in wrapper
    raise e
  File "/Users/pianpwk/Documents/pytorch/torch/export/_trace.py", line 527, in wrapper
    ep = fn(*args, **kwargs)
         ^^^^^^^^^^^^^^^^^^^
  File "/Users/pianpwk/Documents/pytorch/torch/export/exported_program.py", line 83, in wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/pianpwk/Documents/pytorch/torch/export/_trace.py", line 707, in _export
    ) = make_fake_inputs(f, args, kwargs, constraints)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/pianpwk/Documents/pytorch/torch/_export/non_strict_utils.py", line 133, in make_fake_inputs
    fake_args, fake_kwargs = tree_map_with_path(
                             ^^^^^^^^^^^^^^^^^^^
  File "/Users/pianpwk/Documents/pytorch/torch/utils/_pytree.py", line 1519, in tree_map_with_path
    return treespec.unflatten(func(*xs) for xs in zip(*all_keypath_leaves))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/pianpwk/Documents/pytorch/torch/utils/_pytree.py", line 734, in unflatten
    leaves = list(leaves)
             ^^^^^^^^^^^^
  File "/Users/pianpwk/Documents/pytorch/torch/utils/_pytree.py", line 1519, in <genexpr>
    return treespec.unflatten(func(*xs) for xs in zip(*all_keypath_leaves))
                              ^^^^^^^^^
  File "/Users/pianpwk/Documents/pytorch/torch/_export/non_strict_utils.py", line 134, in <lambda>
    lambda kp, val: fakify(fake_mode, kp, val, t_constraints, sources),
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/pianpwk/Documents/pytorch/torch/_export/non_strict_utils.py", line 68, in fakify
    raise ValueError("Only tensors allowed as input")
ValueError: Only tensors allowed as input

To execute this test, run the following from the base repo dir:
     python test/export/test_export_nonstrict.py -k test_runtime_assert_for_prm_str_non_strict

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

----------------------------------------------------------------------
Ran 1 test in 0.008s

FAILED (errors=1)
```

After:
```
(pytorch-local) pianpwk@pianpwk-mbp pytorch % python test/export/test_export_nonstrict.py -k test_runtime_assert_for_prm_str
.
----------------------------------------------------------------------
Ran 1 test in 0.237s

OK
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120536
Approved by: https://github.com/tugsbayasgalan, https://github.com/zhxchen17, https://github.com/avikchaudhuri, https://github.com/gmagogsfm
2024-02-28 23:56:30 +00:00
1b8bb027f6 Fix guard for SUPPORTED_NODES (#120798)
The special-case code for handling SUPPORTED_NODES was producing a guard that looked like:
```
"G['torch'].utils._pytree.SUPPORTED_NODES[<class '__main__.CausalLMOutputWithPast'>].type"
```
resulting in a eval error trying to evaluate the guard.

This change adds a new source type (`ClassSource`) which is given a class type (in this case `CausalLMOutputWithPast`) and attempts to fetch it from its defining module.  It then uses that to build the `SUPPORTED_NODES` guards instead of referring to the type directly.

Also added a unit test which fails before this change and passes after.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120798
Approved by: https://github.com/anijain2305
2024-02-28 23:34:17 +00:00
aa36821615 [Memory Snapshot] Stop clearing history when changing context (#120436)
Summary:
This change will avoid clearing the memory event history, when changing the context from `record_memory_history(context=None)` to `record_memory_history(context="python")`.

Now it will continue recording memory events with changing context on the fly. Only `record_memory_history(enabled=None)` will clear the history.

Test Plan:
# Ran on the following local Resnet50 example:

- At iteration=0, record_memory_history(context=None, stacks="python")
- At iteration=3, record_memory_history(context="all", stacks="python")
- After iteration=4, export_memory_snapshot()

## Before:
 - Only collects the last 2 iterations with python call stacks.
![image](https://github.com/pytorch/pytorch/assets/17602366/86154532-9f73-4d10-9194-19e8c96ee4f3)

## After:
 - Collects all 5 iterations, where first 3 iterations have no call stacks, and last 2 iterations have python call stacks.
![image](https://github.com/pytorch/pytorch/assets/17602366/c2c277d6-b400-4da2-85c8-a7f119d409f8)
![image](https://github.com/pytorch/pytorch/assets/17602366/dc9da2f8-41cc-44b0-9c32-ec3cbe79d2c4)

Differential Revision: D54084017

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120436
Approved by: https://github.com/zdevito, https://github.com/leitian
2024-02-28 22:46:26 +00:00
86ff31c4a0 Revert "Avoid COW materialization in at::parallel_for/parallel_reduce (#120455)"
This reverts commit cabc09a5f259f1cc1e3bad1d80b5e5274838bced.

Reverted https://github.com/pytorch/pytorch/pull/120455 on behalf of https://github.com/izaitsevfb due to breaks xla jobs ([comment](https://github.com/pytorch/pytorch/pull/120455#issuecomment-1970026100))
2024-02-28 22:30:18 +00:00
dbe0967a0a Revert "Add test to check that COW inputs are not materialized (#119507)"
This reverts commit 2ebf2c88baa4667d55eda92f4c8424db505af781.

Reverted https://github.com/pytorch/pytorch/pull/119507 on behalf of https://github.com/izaitsevfb due to breaks xla jobs ([comment](https://github.com/pytorch/pytorch/pull/119507#issuecomment-1970022840))
2024-02-28 22:26:59 +00:00
7e185277cd [cuDNN] bump cuDNN-frontend submodule to 1.1.2 (#120761)
Hopefully resolves additional `CUDNN_STATUS_SUCCESS` failures that we have been seeing on H100 (though curiously not on upstream CI, perhaps due to the different hardware being tested)

Need to confirm the fix on our end before merging

CC @Skylion007

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120761
Approved by: https://github.com/Skylion007, https://github.com/nWEIdia
2024-02-28 22:15:43 +00:00
9c9bde515c Factor out Submod compilers (#120527)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120527
Approved by: https://github.com/kadeng
2024-02-28 22:11:47 +00:00
5b5bcf0470 Test that tlparse understands the structured logs we output (#120658)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120658
Approved by: https://github.com/Skylion007, https://github.com/malfet
ghstack dependencies: #120712, #120289
2024-02-28 21:58:39 +00:00
d6c202975c Move attention kernels from meta_registrations to fake_impls (#120682)
This PR is mostly just code movement to make the code review easier - AFAIK it should not change any functionality. The final goal is to remove the xfails for some of the test_fake opinfos for these ops. The opinfos are failing because the outputs can have mixed devices - we need to move them to fake_impls first before we can support mixed device returns.

This PR:
* Move the `_meta_registrations.py` implementations to `fake_impls.py`
* Change the function signature from taking explicit named variables to taking `{args, kwargs}` and normalizing them
* Wrap all the returned tensors in FakeTensors

Tests: relying on opinfos. I also checked `test_fake_*` for these tests (by removing x-fails and patching things until they passed) to verify general correctness.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120682
Approved by: https://github.com/drisspg
2024-02-28 21:49:13 +00:00
50073248ed add a note wrt torch.nn.functional.scaled_dot_product_attention (#120668)
followup change of https://github.com/pytorch/pytorch/pull/120565

- Added a note in the transformer class pointing out the mask definition is opposite to that of :attr:`attn_mask` in
            torch.nn.functional.scaled_dot_product_attention.
@mikaylagawarecki

Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120668
Approved by: https://github.com/mikaylagawarecki
2024-02-28 21:16:34 +00:00
e2ee87d48b Fix segfault on mac when running vulkan tests (#120337)
Summary: Vulkan gtests were segfaulting on mac because the memory for barriers can get destroyed after the local function(CommandBuffer::insert_barrier) exits where it is created. Since we provide this barrier pointer to vulkan library it needs to be around even after the function exit, else we get crashes.

Test Plan:
See that there is no segfault on mac with fix and tests can run:

Compile gtests:
buck2 build --target-platforms ovr_config//platform/macos:arm64-fbsource  //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 --show-output"

Crash w/o diff
bash-3.2$ buck-out//v2/gen/fbsource/xplat/caffe2/pt_vulkan_quantized_api_test_binAppleMac
Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc
[==========] Running 85 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 85 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.uniform_buffer_copy
[       OK ] VulkanAPITest.uniform_buffer_copy (88 ms)
[ RUN      ] VulkanAPITest.copy_to_buffer
Segmentation fault: 11

With diff there is no crash:
bash-3.2$ buck-out//v2/gen/fbsource/xplat/caffe2/pt_vulkan_quantized_api_test_binAppleMac
Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc
[==========] Running 85 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 85 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.uniform_buffer_copy
[       OK ] VulkanAPITest.uniform_buffer_copy (296 ms)
.....
[  FAILED  ] VulkanAPITest.gelu_quint8_self (23 ms)
[----------] 85 tests from VulkanAPITest (1494 ms total)

[----------] Global test environment tear-down
[==========] 85 tests from 1 test suite ran. (1494 ms total)
[  PASSED  ] 72 tests.
[  FAILED  ] 13 tests, listed below:
[  FAILED  ] VulkanAPITest.linear_2d_flat
[  FAILED  ] VulkanAPITest.linear_2d_small
[  FAILED  ] VulkanAPITest.linear_2d_large
[  FAILED  ] VulkanAPITest.linear_3d_flat
[  FAILED  ] VulkanAPITest.linear_3d_small
[  FAILED  ] VulkanAPITest.linear_3d_large
[  FAILED  ] VulkanAPITest.linear_4d_flat
[  FAILED  ] VulkanAPITest.linear_4d_small
[  FAILED  ] VulkanAPITest.linear_4d_large
[  FAILED  ] VulkanAPITest.gelu_qint8
[  FAILED  ] VulkanAPITest.gelu_qint8_self
[  FAILED  ] VulkanAPITest.gelu_quint8
[  FAILED  ] VulkanAPITest.gelu_quint8_self

The above failing tests were failing before as well and are being worked on.

Differential Revision: D54023146

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120337
Approved by: https://github.com/SS-JIA
2024-02-28 20:55:47 +00:00
e317e39a02 Fix nonlinearity arg issue in RNN (#120234)
Fixes #114617

This PR fix the the issue with `nonlinearity`, so that it can be passed as arg or kwarg.

Alternatively, if making `nonlinearity` kwarg-only is preferred, I can revert to another commit. cc @mikaylagawarecki
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120234
Approved by: https://github.com/mikaylagawarecki
2024-02-28 20:53:18 +00:00
8b22fe9594 [FX passes] Set group/batch fusion log to DEBUG level (#120780)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120780
Approved by: https://github.com/jackiexu1992
2024-02-28 20:48:11 +00:00
4903e33e19 Revert "Capture non tensor arguments in record_function (#120017)"
This reverts commit 5c5b71b6eebae76d744261715231093e62f0d090.

Reverted https://github.com/pytorch/pytorch/pull/120017 on behalf of https://github.com/soulitzer due to regresses perf on autograd Function when using profiler ([comment](https://github.com/pytorch/pytorch/pull/120017#issuecomment-1969883792))
2024-02-28 20:43:33 +00:00
01ec8df6d8 [Compiled Autograd] Introduce BackwardState capture (#120382)
This adds support for backwards hooks that are *both*:
1) Interior to the graph; and
2) Dynamically generated (e.g. lambdas)

We do this by creating a BackwardState object that is used to register the hooks in the forward, then populated by dynamo *after* the forwards runs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120382
Approved by: https://github.com/xmfan
2024-02-28 20:36:47 +00:00
c016ffed5b [C10D] Fix logic for default group=None in _set_pg_timeout (#120686)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120686
Approved by: https://github.com/yifuwang
2024-02-28 20:31:14 +00:00
11de40f82f [flight recorder] record process group configuration (#120262)
Summary: Record process group configuration (i.e. ranks involved in a process group) to facilitate NCCL related debugging.

Differential Revision: D53792087

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120262
Approved by: https://github.com/shuqiangzhang
2024-02-28 20:31:08 +00:00
5aa7f8646f [inductor][Gemm] Autotune with matrix_instr_nonkdim for AMDGPU (#120742)
Relanding https://github.com/pytorch/pytorch/pull/120639 + a fix to drop `matrix_instr_nonkdim` that does not align with `BLOCK_M` or `BLOCK_N`

Matrix multiplication with Triton is usually in a tiled way. For a large tile size that single hardware instruction cannot handle, i.e, a 128*128 size matmul, it has to be broken down to a sequence of smaller hardware mma instructions. On AMDGPU, matrix_instr_nonkdim controls the shape of the mma instructions, of which the default value is 0 in Triton. This means by default Triton will decompose a large tiled matmul operation into a sequence of 32x32x8 mma instructions. There are other mma instructions available, such as 16x16x16 which requires matrix_instr_nonkdim=16. This change enables tuning the value for Gemm which seems to improve its performance by 20% - 2x.

Before:
  ```
AUTOTUNE mm(1024x1024, 1024x1024)
  ExternKernelCaller(extern_kernels.mm) 0.0410 ms 100.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.0487 ms 84.2%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.0544 ms 75.4%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.0633 ms 64.8%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.0687 ms 59.7%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.0716 ms 57.3%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.0748 ms 54.9%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.0788 ms 52.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=2 0.1014 ms 40.5%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.1069 ms 38.4%
  SingleProcess AUTOTUNE takes 8.1153 seconds
```

After:
  ```
AUTOTUNE mm(1024x1024, 1024x1024)
  ExternKernelCaller(extern_kernels.mm) 0.0417 ms 100.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 0.0470 ms 88.7%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 0.0488 ms 85.4%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=4 0.0490 ms 85.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 0.0525 ms 79.5%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=4 0.0553 ms 75.4%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 0.0574 ms 72.7%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=8 0.0634 ms 65.8%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=2 0.0655 ms 63.7%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 0.0681 ms 61.2%
  SingleProcess AUTOTUNE takes 11.4076 seconds
```

Before:
  ```
AUTOTUNE mm(2048x2048, 2048x2048)
  ExternKernelCaller(extern_kernels.mm) 0.2094 ms 100.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.2452 ms 85.4%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.2763 ms 75.8%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.2836 ms 73.8%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.2854 ms 73.4%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.2951 ms 71.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.2970 ms 70.5%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.4184 ms 50.1%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.5097 ms 41.1%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=2 0.5570 ms 37.6%
  SingleProcess AUTOTUNE takes 3.4052 seconds
```

After:
  ```
AUTOTUNE mm(2048x2048, 2048x2048)
  ExternKernelCaller(extern_kernels.mm) 0.2117 ms 100.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=8 0.2429 ms 87.2%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 0.2485 ms 85.2%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 0.2526 ms 83.8%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 0.2537 ms 83.5%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 0.2554 ms 82.9%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 0.2623 ms 80.7%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 0.2695 ms 78.5%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=8 0.2758 ms 76.8%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=4 0.2792 ms 75.8%
  SingleProcess AUTOTUNE takes 11.3538 seconds

```

Before:
  ```
AUTOTUNE mm(4096x4096, 4096x4096)
  ExternKernelCaller(extern_kernels.mm) 1.5901 ms 100.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 1.9380 ms 82.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 1.9943 ms 79.7%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 2.0640 ms 77.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 2.0941 ms 75.9%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 2.1272 ms 74.7%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 2.1554 ms 73.8%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 2.2931 ms 69.3%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 3.7016 ms 43.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=2 4.6021 ms 34.6%
  SingleProcess AUTOTUNE takes 9.0523 seconds
```

After:
  ```
AUTOTUNE mm(4096x4096, 4096x4096)
  ExternKernelCaller(extern_kernels.mm) 1.5862 ms 100.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 1.6924 ms 93.7%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 1.7616 ms 90.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 1.8159 ms 87.4%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 1.9340 ms 82.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 1.9352 ms 82.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 2.0378 ms 77.8%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 2.0983 ms 75.6%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 2.1138 ms 75.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 2.1657 ms 73.2%
  SingleProcess AUTOTUNE takes 8.2225 seconds
```

Before:
  ```
AUTOTUNE mm(8192x8192, 8192x8192)
  ExternKernelCaller(extern_kernels.mm) 12.0134 ms 100.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 14.8082 ms 81.1%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 15.4242 ms 77.9%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 16.6869 ms 72.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 16.7751 ms 71.6%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 17.0145 ms 70.6%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 17.1363 ms 70.1%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 18.2159 ms 66.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 29.4726 ms 40.8%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=2 37.9039 ms 31.7%
  SingleProcess AUTOTUNE takes 11.0074 seconds
```

After:
  ```
AUTOTUNE mm(8192x8192, 8192x8192)
  ExternKernelCaller(extern_kernels.mm) 11.9554 ms 100.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 12.9953 ms 92.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 13.7726 ms 86.8%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 13.9647 ms 85.6%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=8 14.9728 ms 79.8%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 15.3729 ms 77.8%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 15.3955 ms 77.7%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=4 15.5647 ms 76.8%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 16.0037 ms 74.7%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 16.7432 ms 71.4%
  SingleProcess AUTOTUNE takes 14.9839 seconds
```

Reviewed By: xw285cornell, nmacchioni

Differential Revision: D54203170

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120742
Approved by: https://github.com/xw285cornell
2024-02-28 20:27:14 +00:00
b020ee5b05 [PyTorch Use MaybeOwned when promoting indices/offsets in embedding_bag (#120755)
We're currently doing two unnecessary reference count
operations in the case where promotion doesn't need to happen.

Differential Revision: [D54285999](https://our.internmc.facebook.com/intern/diff/D54285999/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120755
Approved by: https://github.com/cyyever, https://github.com/Skylion007
ghstack dependencies: #120752
2024-02-28 20:13:30 +00:00
98d1529474 [PyTorch] fix mixed int32/int64 indices/offsets for embedding_bag_out (#120752)
This was an oversight in D27482738 (#55189) -- it only patched the regular embedding_bag operator, but static runtime uses the out variant.

Differential Revision: [D54285460](https://our.internmc.facebook.com/intern/diff/D54285460/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120752
Approved by: https://github.com/houseroad
2024-02-28 20:13:30 +00:00
db92558229 [codemod][lowrisk] Fix deprecated use of 0/NULL (#120740)
Summary:
`nullptr` is typesafe. `0` and `NULL` are not. In the future, only `nullptr` will be allowed.

This diff helps us embrace the future _now_ in service of enabling `-Wzero-as-null-pointer-constant`.

Test Plan: Sandcastle

Reviewed By: meyering

Differential Revision: D54163060

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120740
Approved by: https://github.com/Skylion007
2024-02-28 20:13:13 +00:00
491c2b4665 Let torch dynamo inline torch.func.grad (#118407)
When dynamo sees torch.func.grad, it tries to inline all frames related
to.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118407
Approved by: https://github.com/zou3519
2024-02-28 20:05:00 +00:00
5472923998 derived dim (#118729)
With the current `Dim`-based dynamic shapes API for export, one can express that shapes of different input shapes must be equal by reusing the same `Dim`. However, non-trivial relationships between such input shapes cannot be expressed.

Recently we are seeing more and more examples of code that require this additional expressibility, e.g., where a pair of shapes might differ by one, or a shape might be double another (or simply even).

This PR introduces the concept of a "derived" `Dim`, i.e., a linear arithmetic expression over a `Dim`. By using a combination of `Dim`s and derived `Dim`s to specify input shapes, the desired relationships can be expressed naturally. E.g., a pair of shapes might be `dim` and `dim + 1`, or `dim` and `2*dim`, or even `2*dim` and `dim + 1`.

We extend the current infrastructure that translates `Dim`s to deprecated `dynamic_dim`-based constraints to work with derived `Dim`s. As usual, we raise constraint violation errors when shape guards cannot be verified given a dynamic shapes spec; suggest fixes; and raise runtime errors when future inputs violate the spec.

Importantly, some guards that used to cause forced specializations in the constraint solver because they were deemed "too complex" now do not do so, because they can now be specified as constraints. Since this was what motivated the introduction of a `disable_constraint_solver` flag to some internal APIs, we may not need that flag any more.

Note that shapes of placeholders in exported programs can now contain symbolic expressions and not just symbols.

Differential Revision: D53254587

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118729
Approved by: https://github.com/ezyang
2024-02-28 19:48:32 +00:00
9c55aa6ff6 TransformerEncoder/Decoder: add type hints (#120550)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120550
Approved by: https://github.com/mikaylagawarecki
2024-02-28 19:36:08 +00:00
4b7a521856 Update flash_attention kernel from 2.3.6 to 2.5.5 (#118935)
# Summary
Updates FlashAttention kernel code from tag [2.3.6](https://github.com/Dao-AILab/flash-attention/releases/tag/v2.3.6) to [2.5.3](https://github.com/Dao-AILab/flash-attention/releases/tag/v2.5.5).

The usual changes were then re-rellod on top of the modified kernel, changing how dropout saved for backward, removing the head_dim_pad since this would make the kernel inplace mutate and that has a bad interaction with functionalization.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118935
Approved by: https://github.com/cpuhrsch
2024-02-28 19:31:15 +00:00
a9d9077f12 Revert "Increased compile time max GPUs to 512. Switched to int16_t DeviceIndex. (#119639)"
This reverts commit 7c556428c74a79c6d9c272826344a0828d3f66f5.

Reverted https://github.com/pytorch/pytorch/pull/119639 on behalf of https://github.com/kit1980 due to breaking internal builds, see D54286923 ([comment](https://github.com/pytorch/pytorch/pull/119639#issuecomment-1969634480))
2024-02-28 18:57:09 +00:00
1c67f6cb26 fix decomposition of aten.diag_embed (#120549)
Fixes #117019
Make the input that one dim negative and the other nonnegative be correctly solved in decomposition of `aten.diag_embed`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120549
Approved by: https://github.com/Dalian991, https://github.com/janeyx99
2024-02-28 18:48:01 +00:00
f422467ccb [BE]Delay the call to set_pytorch_distributed_envs_from_justknobs (#120625)
When initializing the default process group, `init_process_group` will show the explicit message indicating the default process group is being initialized twice.

However, with `set_pytorch_distributed_envs_from_justknobs` being the very first line in `init_process_group`, the error message becomes implicit and hard to understand the root cause when testing with the FB code base.

Differential Revision: [D54206202](https://our.internmc.facebook.com/intern/diff/D54206202/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120625
Approved by: https://github.com/wconstab, https://github.com/yifuwang
2024-02-28 18:34:45 +00:00
91190d8087 [quant][pt2e] Relax model_is_exported input (#120720)
Summary: This commit relaxes the `model_is_exported` API to
additionally work for `torch.nn.Module`s in addition to just
`torch.fx.GraphModule`s, simplifying downstream uses.

Test Plan:
python test/test_quantization.py TestQuantizePT2E.test_model_is_exported

Differential Revision: [D54263935](https://our.internmc.facebook.com/intern/diff/D54263935)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120720
Approved by: https://github.com/tugsbayasgalan
2024-02-28 18:32:03 +00:00
f67c77c497 Update engine.cpp (#120773)
Minor comment fix; `backward` and `grad` are flipped here. See https://pytorch.org/docs/stable/_modules/torch/autograd.html#backward

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120773
Approved by: https://github.com/albanD, https://github.com/janeyx99, https://github.com/soulitzer
2024-02-28 18:23:35 +00:00
0ab2ec3738 [XPU][Profiler] Add Logic To The Profiler For Processing XPU-backend Data (#120185)
This pull request is writing to provide an update on the recent advancements made in the PyTorch profiler with regards to XPU backend support. Following the successful merge of a previous pull request #94502 that established a pathway for the XPU backend within PyTorch, we have now taken steps to enhance the profiler's capabilities for handling and displaying profile data directly related to the XPU backend.

# Motivation

The current pull request builds upon this foundation by refining the profiler's data processing scripts, particularly `profiler_util.py`, to accommodate XPU backend-specific profile data. The aim is to align the handling and presentation of this data with that of the CUDA backend, offering users a consistent experience across different device profiles. This includes generating outputs such as JSON files compatible with Chrome trace tooling, among other formats.

# Principles

1. Minimal Impact: The modifications introduced should support XPU backend data with minimal disruption to the existing profiling scripts.
2. Consistency: Changes should maintain stylistic and functional consistency with existing `CUDA` and `privateuse1` pathways, ensuring no adverse effects on other logic paths.
3. Exclusivity: Ensure that the new XPU pathway does not interfere with or impede other pathways.

# Solutions

### a. Pathway Identification:

Introduction of a `use_xpu` flag within `torch.autograd.profiler.profile` interfaces to distinguish XPU-specific profiling.

### b. `use_device` Logic Revision:

With the introduction of the XPU pathway, `use_device` no longer implies a binary relationship with `use_cuda`. Consequently, we have revised related logic to remove implicit assertions and establish independent device distinction.

### c. Kernel List Segregation:

To accommodate the non-binary nature of device pathways, we have enabled kernel lists to identify specific device affiliations through separate list objects.

### d. Formatted Output:

To ensure output consistency, we have employed code duplication and keyword substitution techniques to facilitate the formatting of XPU-related profile data.

# Additional Enhancements

### a. Enumerations in `.pyi` Files:

Added recognition items for `DeviceType` and `ProfilerActivity` specific to XPU.

### b. Correct DeviceType Returns:

Revised `deviceTypeFromActivity` logic to accurately differentiate between device backends, even when they share common flags such as `libkineto::ActivityType::GPU_MEMCPY`.

### c. Bug Fixes in `cuda_corr_map`:

Addressed a corner case where erroneous parent-child event relationships were formed due to shared function event identifiers. The solution involves refining `cuda_corr_map` processing to prevent a function event from being misidentified as both the linker and linkee.

# Further Abstraction

Looking forward, we acknowledge the potential for further abstraction in the codebase. The current changes necessitated by XPU support have highlighted opportunities for reducing redundancy by consolidating naming conventions and utilizing a singular `device` naming system that relies on `DeviceType` attributes or string flags for differentiation. This would involve significant refactoring to replace device-specific flags and variables. This topic needs further discussions about whether we could and when we should deprecate all those flags and variables named with `cuda`.

# Next Pull Request

The next pull request will be contingent on Kineto's adoption of Intel's forthcoming PTI-sdk library, which will enable direct usage of XPU-related tracers. Subsequent modifications to `libkineto_init()` will aim to endow PyTorch running on XPU backends with comprehensive profiling capabilities on XPU devices.

We appreciate your attention to these enhancements and welcome any feedback or questions you may have regarding these developments.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120185
Approved by: https://github.com/aaronenyeshi, https://github.com/gujinghui
2024-02-28 17:50:32 +00:00
3e8b56d362 [Inductor] Track constant's original_fqn mapping (#120524)
When compiling an deserialized ExportedProgram, constant’s original_fqn is not populated(). Highlighted line is missing, And a latter assertion is breaking due to original_fqn missing.

```
        constants_info_[0].name = "L__self___w_pre";
	constants_info_[0].dtype = static_cast<int32_t>(cached_torch_dtype_float32);
	constants_info_[0].offset = 0;
	constants_info_[0].data_size = 64;
	constants_info_[0].from_folded = false;
	constants_info_[0].shape = {4, 4};
	constants_info_[0].stride = {4, 1};
	// constants_info_[0].original_fqn = "w_pre";   // this line is missing
```

Inductor is relying `dynamo_flat_name_to_original_fqn` to populate the original_fqn field. This field originates from `graph_module.meta["dynamo_flat_name_to_original_fqn"]`, and is set during dynamo tracing. However, when compiling
an deserialized ExportedProgram, we don't do dynamo tracing, thus this field is missing.

As a fix, I maintain AOTI's own mapping for constant tensor's fqn.

Differential Revision: D54097073

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120524
Approved by: https://github.com/chenyang78
2024-02-28 17:36:29 +00:00
702e82da28 [cuDNN][Flash Attention] Minor cleanup for cuDNN SDPA (#120750)
Cleaning up before hopefully starting work on backward

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120750
Approved by: https://github.com/Skylion007, https://github.com/drisspg
2024-02-28 17:32:07 +00:00
364faafe75 [DCP] Asserts CPU backend for async_save (#120241)
If a CPU device is not present, collectives will hang in the threaded case due to: https://github.com/pytorch/pytorch/issues/115861

This PR asserts a CPU device is enabled in the pg group backend.

Differential Revision: [D53952864](https://our.internmc.facebook.com/intern/diff/D53952864/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120241
Approved by: https://github.com/fegin
2024-02-28 17:21:30 +00:00
c8a34a4013 [ez] Smaller weight for some TD heuristics (#120736)
Normalize to different number for the fuzzier heuristics

Could this be done as a weighting elsewhere? Yes, but putting it here since I'm not sure which object would hold it best
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120736
Approved by: https://github.com/malfet, https://github.com/huydhn
2024-02-28 17:07:45 +00:00
dfe7b9d471 Move user defined triton tests to inductor test folder (#120738)
Summary: FBCode CI does not compile torch with CUDA for tests in dynamo folder, instead of adding a special rule, lets move these tests to inductor folder.

Test Plan:
```
buck run mode/opt //caffe2/test/inductor/:triton_kernels
```
now works instead of skipping tests

Differential Revision: D54280629

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120738
Approved by: https://github.com/aakhundov
2024-02-28 17:03:41 +00:00
df40847486 Add xpu header to include/ATen/xpu (#120786)
# Motivation
Add xpu header file to `include/ATen/xpu` to make them public.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120786
Approved by: https://github.com/gujinghui, https://github.com/EikanWang, https://github.com/jgong5, https://github.com/albanD
2024-02-28 16:22:14 +00:00
7881b95c73 Don't suppress error codes in lint job, properly activate conda (#120769)
Before:

```
2024-02-28T02:38:24.3757573Z + conda activate /opt/conda/envs/py_3.9
2024-02-28T02:38:24.3757872Z
2024-02-28T02:38:24.3758116Z CondaError: Run 'conda init' before 'conda activate'
```

Now, this would actually fail the job, and I also fix the bug.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120769
Approved by: https://github.com/albanD, https://github.com/janeyx99, https://github.com/malfet
2024-02-28 15:17:31 +00:00
facfc0baaf Update _constrain_as_size docs (#120728)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120728
Approved by: https://github.com/Skylion007
2024-02-28 15:03:10 +00:00
82099ab87b [easy] Reword unexpected success error messages and generated github issues now that we have sentinel files (#120766)
It's a bit annoying to have to read through the test name in verbose mode just to see what the test's sentinel file is actually called when encountering an unexpected success. Now that we have sentinel files, we can directly list the file path from root in the error message.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120766
Approved by: https://github.com/Skylion007
2024-02-28 11:15:29 +00:00
46e3f670b4 refactor code to share across different devices (#120602)
# Motivation
Refactor utils code to make it possible to share across CUDA, XPU, and other backends.

# Solution
Move `_dummy_type` and `_LazySeedTracker` to torch._utils;

# Additional Context
When upstreaming, refactor these code changes by isolating them into in an additional PR to minimize their impact on the CUDA code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120602
Approved by: https://github.com/albanD, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/EikanWang
2024-02-28 09:42:58 +00:00
a11a49af58 Add NCCL work sequence number to work info (#120596)
Summary: Expose sequence number to work info. The number can help applications identify a NCCL work more precisely.

Test Plan:
1. pytest test/distributed/test_c10d_nccl.py::WorkHookTest::test_on_completion_hook_seq
2. pytest test/distributed/test_c10d_nccl.py::WorkHookTest

Differential Revision: D54180050

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120596
Approved by: https://github.com/kwen2501
2024-02-28 07:54:37 +00:00
be31e522ce [PT2][Inductor] Fix "example_value" absent for stack nodes (#120655)
Summary:
We observed that stack nodes have missing exampe_value in DPA+FIRST, causing issue to further do split cat. Full error log: P1187633689.

pre grad graph: https://www.internalfb.com/intern/everpaste/?color=0&handle=GPUFOBWniTeB6s8DAN8z9sHTadpxbr0LAAAz

We found that it was introduced by the new stack nodes in the group batch fusion, thus we fix the bug to enable further split cat optimization.

Test Plan:
```
buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode split_batch
```
before fix: P1187633689
```
W0221 13:32:09.334000 139773455527936 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: sigmoid_16
W0221 13:32:09.335000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_19
W0221 13:32:09.335000 139773455527936 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: add_16
W0221 13:32:09.335000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_6
W0221 13:32:09.335000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_5
W0221 13:32:09.336000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_4
W0221 13:32:09.517000 139773455527936 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: add_20
W0221 13:32:09.518000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_18
W0221 13:32:09.518000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_17
W0221 13:32:09.521000 139773455527936 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: add_19
W0221 13:32:09.521000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_15
W0221 13:32:09.521000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_14
W0221 13:32:09.522000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_16
W0221 13:32:09.524000 139773455527936 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: add_18
W0221 13:32:09.525000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_12
W0221 13:32:09.525000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_11
W0221 13:32:09.525000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_13
W0221 13:32:09.527000 139773455527936 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: add_17
W0221 13:32:09.528000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_9
W0221 13:32:09.528000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_8
W0221 13:32:09.528000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_10
W0221 13:32:09.528000 139773455527936 torch/_inductor/fx_passes/split_cat.py:274] [0/0_1] example value absent for node: stack_7
```

after fix:
P1189491364
```
W0226 13:19:56.542000 139770599518208 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: sigmoid_16
W0226 13:19:56.543000 139770599518208 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: add_16
W0226 13:19:56.703000 139770599518208 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: add_20
W0226 13:19:56.707000 139770599518208 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: add_19
W0226 13:19:56.711000 139770599518208 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: add_18
W0226 13:19:56.713000 139770599518208 torch/_inductor/fx_passes/split_cat.py:186] [0/0_1] example value absent for node: add_17
```

Differential Revision: D54140488

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120655
Approved by: https://github.com/jackiexu1992
2024-02-28 05:35:36 +00:00
12995a5d9d [2/2] Intel GPU Runtime Upstreaming for Generator (#118613)
# Motivation
According to [[1/2] Intel GPU Runtime Upstreaming for Generator](https://github.com/pytorch/pytorch/pull/118528), as mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), the second PR covers the changes under `python frontend`.

# Design
Currently, it primarily offers geneartor-related APIs, including

- `torch.xpu.default_generators`
- `torch.xpu.get_rng_state`
- `torch.xpu.get_rng_state_all`
- `torch.xpu.initial_seed`
- `torch.xpu.manual_seed`
- `torch.xpu.manual_seed_all`
- `torch.xpu.seed`
- `torch.xpu.seed_all`
- `torch.xpu.set_rng_state`
- `torch.xpu.set_rng_state_all`

# Additional Context
The differences with CUDA:
The generator-related frontend python APIs are 1:1 mapping with CUDA.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118613
Approved by: https://github.com/gujinghui, https://github.com/EikanWang, https://github.com/jgong5, https://github.com/albanD
2024-02-28 05:28:11 +00:00
8ba4cb451f Fix an import loop (#119820)
Summary:
We ran into the following import loop when testing aps:

```
Traceback (most recent call last):
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/runtime/lib/python3.10/multiprocessing/forkserver.py", line 274, in main
    code = _serve_one(child_r, fds,
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/runtime/lib/python3.10/multiprocessing/forkserver.py", line 313, in _serve_one
    code = spawn._main(child_r, parent_sentinel)
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/runtime/lib/python3.10/multiprocessing/spawn.py", line 125, in _main
    prepare(preparation_data)
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/runtime/lib/python3.10/multiprocessing/spawn.py", line 234, in prepare
    _fixup_main_from_name(data['init_main_from_name'])
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/runtime/lib/python3.10/multiprocessing/spawn.py", line 258, in _fixup_main_from_name
    main_content = runpy.run_module(mod_name,
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/runtime/lib/python3.10/runpy.py", line 224, in run_module
    return _run_module_code(code, init_globals, run_name, mod_spec)
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/runtime/lib/python3.10/runpy.py", line 96, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/runtime/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/aps_models/ads/icvr/icvr_launcher.py", line 29, in <module>
    class ICVRConfig(AdsComboLauncherConfig):
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/aps_models/ads/common/ads_launcher.py", line 249, in <module>
    class AdsComboLauncherConfig(AdsConfig):
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/aps_models/ads/common/app_config.py", line 16, in <module>
    class AdsConfig(RecTrainAppConfig):
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/apf/rec/config_def.py", line 47, in <module>
    class EmbeddingKernelConfig:
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/apf/rec/config_def.py", line 52, in EmbeddingKernelConfig
    cache_algorithm: CacheAlgorithm = CacheAlgorithm.LRU
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/torchrec/distributed/types.py", line 501, in <module>
    class ParameterSharding:
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/torchrec/distributed/types.py", line 527, in ParameterSharding
    sharding_spec: Optional[ShardingSpec] = None
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/torch/distributed/_shard/sharding_spec/api.py", line 48, in <module>
    class ShardingSpec(ABC):
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/torch/distributed/_shard/sharding_spec/api.py", line 55, in ShardingSpec
    tensor_properties: sharded_tensor_meta.TensorProperties,
  File "/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/torch/distributed/_shard/sharded_tensor/__init__.py", line 21, in <module>
    def empty(sharding_spec: shard_spec.ShardingSpec,
ImportError: cannot import name 'ShardingSpec' from partially initialized module 'torch.distributed._shard.sharding_spec.api' (most likely due to a circular import) (/mnt/xarfuse/uid-26572/e04e8e0a-seed-nspid4026534049_cgpid5889271-ns-4026534028/torch/distributed/_shard/sharding_spec/api.py)
```

Using future annotations to mitigate.

Test Plan:
```
hg update 1b1b3154616b70fd3325c467db1f7e0f70182a74
CUDA_VISIBLE_DEVICES=1,2 buck2 run @//mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=local_ctr_cvr_rep
```

Differential Revision: D53685582

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119820
Approved by: https://github.com/fegin
2024-02-28 05:09:16 +00:00
e9a961f66a [dynamo][refactor] Use originating_source for HASATTR (#120723)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120723
Approved by: https://github.com/jansel
ghstack dependencies: #120520, #120590, #120721
2024-02-28 05:00:59 +00:00
a774baa501 [audio hash update] update the pinned audio hash (#120748)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120748
Approved by: https://github.com/pytorchbot
2024-02-28 04:47:38 +00:00
184e815c74 Add TORCH_LOGS_FORMAT=short alias (#120757)
Shorthand for `"%(levelname)s:%(name)s:%(message)s"` which is hard to
remember.

I find the default formatter annoying since just the metadata fills up
most of the width of my terminal.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120757
Approved by: https://github.com/ezyang
2024-02-28 04:40:48 +00:00
bd5f290505 [vision hash update] update the pinned vision hash (#120749)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120749
Approved by: https://github.com/pytorchbot
2024-02-28 04:36:16 +00:00
bfa71b523d add complex32 to v3_dtypes (#120388)
Fixes [#120290](https://github.com/pytorch/pytorch/issues/120290)
Fixes https://github.com/pytorch/pytorch/issues/73502

use `v3_dtypes` and `torch._utils._rebuild_tensor_v3` to handle torch.save(complex32)

result:
![image](https://github.com/pytorch/pytorch/assets/37650440/18b6cbb3-fb3f-4855-9d48-374014647988)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120388
Approved by: https://github.com/albanD
2024-02-28 02:32:29 +00:00
5a53c0ff23 [dynamo][refactor] Rename LIST_LENGTH to SEQUENCE_LENGTH, separate DICT_LENGTH (#120721)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120721
Approved by: https://github.com/jansel
ghstack dependencies: #120520, #120590
2024-02-28 02:19:10 +00:00
1627d9e06d [aot_inductor] added a utility function aoti_torch_print_tensor_handle (#120660)
Added a function to print tenosr values for a tensor handle.
It can be injected to the cpp wrapper code and help debug
numerical issues.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120660
Approved by: https://github.com/desertfire
2024-02-28 02:08:34 +00:00
d21c6eb215 Do not wrap output with input device inside _to_copy (#119868)
Fixing https://github.com/pytorch/pytorch/issues/118790

This diff revert a small part of the code that was introduced in https://github.com/pytorch/pytorch/pull/104689

The PR above added a comment that "In case of dtype promotion, fake tensor converted into tensor"
but its not always the case that a conversion in dtype causes a fake tensor to be a tensor.

When such conversion does not happen we get the following error
```
Creating a new Tensor subclass FakeTensor but the raw Tensor object is already associated to
 a python object of type FakeTensor
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119868
Approved by: https://github.com/ezyang, https://github.com/thiagocrepaldi
2024-02-28 01:51:43 +00:00
33499ec41b [FSDP2][DCP][DSD] Add FSDP2 model state dict unit test with distributed state dict (#120680)
This adds some initial unit tests for FSDP2 model state dict only.

This PR adds two tests:

1. Add a unit test for parity check for FSDP `model.state_dict()` with distributed_state_dict's `get_model_state_dict`.
2. Add a unit test to make sure`StateDictOptions(full_state_dict=True, cpu_offload=True)` in distributed_state_dict work for FSDP2 model state_dict.

Optimizer state dict will be in follow up PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120680
Approved by: https://github.com/awgu
2024-02-28 01:40:04 +00:00
1aa9099839 [CLANGTIDY] Enable clang-tidy in torch/csrc/xpu (#120616)
# Motivation
refer to [#118504](https://github.com/pytorch/pytorch/pull/118504), enabling clang-tidy in `torch/csrc/xpu`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120616
Approved by: https://github.com/albanD
2024-02-28 01:35:25 +00:00
1a1fc1047d Add structured trace logs (#120289)
Overall design: https://docs.google.com/document/d/1CX_hJ0PNy9f3R1y8TJrfkSeLkvGjjjLU84BSXgS2AZ8/edit

How to read the diff:
* Most files are me augmenting pre-existing logging with structured variants. For the most part it's simple (esp FX graphs, which have a canonical string representation); it gets more complicated when I decided to JSON-ify some data structure instead of keeping the ad hoc printing (notably, guards and dynamo output graph sizes)
* torch/_functorch/_aot_autograd/collect_metadata_analysis.py is some unrelated fixes I noticed while auditing artifact logs
* torch/_logging/_internal.py has the actual trace log implementation. The trace logger is implement as a logger named torch.__trace which is disconnected from the logging hierarchy. It gets its own handler and formatter (TorchLogsFormatter with _is_trace True). `trace_structured` is the main way to emit a trace log. Unusually, there's a separate "metadata" and "payload" field. The metadata field should not be too long (as it is serialized as a single line) and is always JSON (we put contextual things like compile id in it); the payload field can be long and is emitted after the metadata log line and can span multiple lines.
* torch/_logging/structured.py contains some helpers for converting Python data structures into JSON form. Notably, we have a string interning implementation here, which helps reduce the cost of serializing filenames into the log.
* test/dynamo/test_structured_trace.py the tests are cribbed from test_logging.py, but all rewritten to use expect tests on munged versions of what we'd actually output. Payloads are never tested, since they tend not be very stable.

https://github.com/ezyang/tlparse is a POC Rust program that can interpret these logs.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120289
Approved by: https://github.com/Skylion007
ghstack dependencies: #120712
2024-02-28 01:01:41 +00:00
677e67c399 Update nn.Module._apply to not gate on should_use_set_data when swap_tensors is set (#120659)
This updates the nesting of if statements in `nn.Module._apply` such that if

`torch.__future__.set_swap_module_params_on_conversion(True)`, we always try to swap regardless of whether
- `torch._has_compatible_shallow_copy_type(param, fn(param)`
- `torch.__future__.set_overwrite_module_params_on_conversion` is set

This means that `meta_module.to_empty('device')` can now use the swap_tensors path cc @awgu

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120659
Approved by: https://github.com/albanD
2024-02-28 00:59:34 +00:00
213b3ac3f2 [BE] fail_* variables don't need to be shared across restarts, they're set only once (#120712)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120712
Approved by: https://github.com/yanboliang
2024-02-28 00:48:11 +00:00
2ebf2c88ba Add test to check that COW inputs are not materialized (#119507)
Part of #97856

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119507
Approved by: https://github.com/ezyang
ghstack dependencies: #120455
2024-02-28 00:37:33 +00:00
cabc09a5f2 Avoid COW materialization in at::parallel_for/parallel_reduce (#120455)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120455
Approved by: https://github.com/albanD
2024-02-28 00:37:33 +00:00
cyy
1e9fafc160 [Clang-tidy header][20/N] Fix clang-tidy warnings in aten/src/ATEN/*.{cpp,h} (#120574)
This PR fixes some clang-tidy warnings in aten/src/ATEN/*.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120574
Approved by: https://github.com/Skylion007
2024-02-28 00:13:05 +00:00
9c597ff137 use condition_variable and wait_until in nccl dump on timeout (#120544)
Fixes test_c10d_nccl.py -k test_timeout_dumps_timing_enabled_True.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120544
Approved by: https://github.com/atalman
2024-02-28 00:06:08 +00:00
14b258b5bc Fix broken link in README (#120698)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120698
Approved by: https://github.com/janeyx99
2024-02-27 23:55:06 +00:00
5929d4e830 [CUDA][cuBLAS] Check if a context is present when grabbing a cuBLAS handle (#120131)
cuBLAS has indicated that certain kernels will transition to using the driver API over the CUDA runtime API, which we've observed to break existing tests (e.g., DataParallel) that use multithreading and may not eagerly grab a context via `cudaSetDevice`.

CC @Aidyn-A @ptrblck

Co-authored-by: Aidyn-A <31858918+Aidyn-A@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120131
Approved by: https://github.com/atalman
2024-02-27 22:45:16 +00:00
f36e00b8ce Revert "[inductor][Gemm] Autotune with matrix_instr_nonkdim for AMDGPU (#120639)"
This reverts commit 78f53a3f731ee67dcffd308519ed48a745640dde.

Reverted https://github.com/pytorch/pytorch/pull/120639 on behalf of https://github.com/izaitsevfb due to breaking ROCm ([comment](https://github.com/pytorch/pytorch/pull/120639#issuecomment-1967585568))
2024-02-27 21:05:57 +00:00
6cc7f9a2e6 Limit loop unrolling (#120023)
Tacotron2 causes massive loop unrolling resulting in very large graphs (26k nodes) which was causing inductor (and tracing itself) to choke.

The unrolling size is controlled by the environment variable TORCHDYNAMO_MAX_LOOP_UNROLL_NODES which defaults to the arbitrary value 5000.

This updates the tacotron2 timings as follows:
eager timing: 3m:23s -> 35s
aot_eager timing: 4m:12s -> 39s
inductor timing: 22m:24s ->1m

For reference the big loop in tacotron2 was this one (model.py[405]):
```
        while len(mel_outputs) < decoder_inputs.size(0) - 1:
            decoder_input = decoder_inputs[len(mel_outputs)]
            mel_output, gate_output, attention_weights = self.decode(decoder_input)
            mel_outputs += [mel_output.squeeze(1)]
            gate_outputs += [gate_output.squeeze(1)]
            alignments += [attention_weights]
```
which gets unrolled and inlined adding about 36 nodes to the graph per iteration.

Fixes #98467
Relates to #102839 which hopefully will result in a better fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120023
Approved by: https://github.com/yanboliang
2024-02-27 20:44:21 +00:00
f3dd2a544c Revert "Add structured trace logs (#120289)"
This reverts commit 9dfaef962cda5f65eec53e5fd6f07b5226ea65cb.

Reverted https://github.com/pytorch/pytorch/pull/120289 on behalf of https://github.com/kit1980 due to breaking internal builds, see D54230697 ([comment](https://github.com/pytorch/pytorch/pull/120289#issuecomment-1967477120))
2024-02-27 19:49:05 +00:00
eqy
65efece3a4 [CUDA][cuBLAS] Bump test_cublas_baddbmm_large_input tolerances (#117889)
Unfortunate that the current `rtol=1e-5` hits a literal 1 / 1000000 mismatch (`rtol=1.04e-5`) on L40.

CC @ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117889
Approved by: https://github.com/atalman
2024-02-27 19:05:20 +00:00
5b5c167adc [dynamo] Add some helpers to PyCodegen (#120684)
This are used in later PRs in the stack

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120684
Approved by: https://github.com/yanboliang
2024-02-27 18:46:51 +00:00
0c8bb6f70c [dtensor] standardize tuple strategy handling for foreach ops (#120695)
This PR refactors the tuple strategy handling logic, and allow
TupleStrategy to have both input/output specs for each OpStrategy child,
so that we could further enable operators like foreach norm

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120695
Approved by: https://github.com/awgu
2024-02-27 18:23:11 +00:00
440a9b212d [profiler] log process group config information in distributedInfo field (#119443)
Summary: Process Group config is essential to analyze collective pattern. We have added this in Execution Trace. Now expose this information in Kineto as well

Differential Revision: D53557965

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119443
Approved by: https://github.com/kwen2501
2024-02-27 18:21:54 +00:00
78f53a3f73 [inductor][Gemm] Autotune with matrix_instr_nonkdim for AMDGPU (#120639)
Summary:
Matrix multiplication with Triton is usually in a tiled way. For a large tile size that single hardware instruction cannot handle, i.e, a 128*128 size matmul, it has to be broken down to a sequence of smaller hardware mma instructions. On AMDGPU, matrix_instr_nonkdim controls the shape of the mma instructions, of which the default value is 32 in Triton. This means by default Triton will decompose a large tiled matmul operation into a sequence of 32x32x8 mma instructions. There are other mma instructions available, such as 16x16x16 which requires matrix_instr_nonkdim=16. This change enables tuning the value for Gemm which seems to improve its performance by 20% - 2x.

Similar changes has been done to the HSTU ragged attention kernel D53386525.

Test Plan:

Before:
  ```
AUTOTUNE mm(1024x1024, 1024x1024)
  ExternKernelCaller(extern_kernels.mm) 0.0410 ms 100.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.0487 ms 84.2%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.0544 ms 75.4%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.0633 ms 64.8%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.0687 ms 59.7%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.0716 ms 57.3%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.0748 ms 54.9%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.0788 ms 52.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=2 0.1014 ms 40.5%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.1069 ms 38.4%
  SingleProcess AUTOTUNE takes 8.1153 seconds
```

After:
  ```
AUTOTUNE mm(1024x1024, 1024x1024)
  ExternKernelCaller(extern_kernels.mm) 0.0417 ms 100.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 0.0470 ms 88.7%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 0.0488 ms 85.4%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=4 0.0490 ms 85.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 0.0525 ms 79.5%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=4 0.0553 ms 75.4%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 0.0574 ms 72.7%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=8 0.0634 ms 65.8%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=2 0.0655 ms 63.7%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 0.0681 ms 61.2%
  SingleProcess AUTOTUNE takes 11.4076 seconds
```

Before:
  ```
AUTOTUNE mm(2048x2048, 2048x2048)
  ExternKernelCaller(extern_kernels.mm) 0.2094 ms 100.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.2452 ms 85.4%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.2763 ms 75.8%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.2836 ms 73.8%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.2854 ms 73.4%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.2951 ms 71.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.2970 ms 70.5%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 0.4184 ms 50.1%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 0.5097 ms 41.1%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=2 0.5570 ms 37.6%
  SingleProcess AUTOTUNE takes 3.4052 seconds
```

After:
  ```
AUTOTUNE mm(2048x2048, 2048x2048)
  ExternKernelCaller(extern_kernels.mm) 0.2117 ms 100.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=8 0.2429 ms 87.2%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 0.2485 ms 85.2%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 0.2526 ms 83.8%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 0.2537 ms 83.5%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 0.2554 ms 82.9%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 0.2623 ms 80.7%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 0.2695 ms 78.5%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=8 0.2758 ms 76.8%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=4 0.2792 ms 75.8%
  SingleProcess AUTOTUNE takes 11.3538 seconds

```

Before:
  ```
AUTOTUNE mm(4096x4096, 4096x4096)
  ExternKernelCaller(extern_kernels.mm) 1.5901 ms 100.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 1.9380 ms 82.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 1.9943 ms 79.7%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 2.0640 ms 77.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 2.0941 ms 75.9%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 2.1272 ms 74.7%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 2.1554 ms 73.8%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 2.2931 ms 69.3%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 3.7016 ms 43.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=2 4.6021 ms 34.6%
  SingleProcess AUTOTUNE takes 9.0523 seconds
```

After:
  ```
AUTOTUNE mm(4096x4096, 4096x4096)
  ExternKernelCaller(extern_kernels.mm) 1.5862 ms 100.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 1.6924 ms 93.7%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 1.7616 ms 90.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 1.8159 ms 87.4%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 1.9340 ms 82.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 1.9352 ms 82.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 2.0378 ms 77.8%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 2.0983 ms 75.6%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 2.1138 ms 75.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 2.1657 ms 73.2%
  SingleProcess AUTOTUNE takes 8.2225 seconds
```

Before:
  ```
AUTOTUNE mm(8192x8192, 8192x8192)
  ExternKernelCaller(extern_kernels.mm) 12.0134 ms 100.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 14.8082 ms 81.1%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 15.4242 ms 77.9%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 16.6869 ms 72.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 16.7751 ms 71.6%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 17.0145 ms 70.6%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 17.1363 ms 70.1%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=4 18.2159 ms 66.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=8 29.4726 ms 40.8%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=0, num_warps=2 37.9039 ms 31.7%
  SingleProcess AUTOTUNE takes 11.0074 seconds
```

After:
  ```
AUTOTUNE mm(8192x8192, 8192x8192)
  ExternKernelCaller(extern_kernels.mm) 11.9554 ms 100.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 12.9953 ms 92.0%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 13.7726 ms 86.8%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 13.9647 ms 85.6%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=8 14.9728 ms 79.8%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=8 15.3729 ms 77.8%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 15.3955 ms 77.7%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=32, num_stages=0, num_warps=4 15.5647 ms 76.8%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 16.0037 ms 74.7%
  ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=0, num_warps=4 16.7432 ms 71.4%
  SingleProcess AUTOTUNE takes 14.9839 seconds
```

Reviewed By: xw285cornell, nmacchioni

Differential Revision: D54203170

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120639
Approved by: https://github.com/xw285cornell, https://github.com/jansel
2024-02-27 18:16:33 +00:00
3f62b05d31 [export] Use forward hooks to capture module signatures. (#120468)
Summary:
When we export in on strict mode and turn on preserve_module_call_signature, the following assertion error will occur today:
```
child_split[: len(parent_split)] == parent_split
```
This is due to the fact that we're monkey patching forward call directly, which kinda breaks the attribute propagation in the tracer. It's actually better to implement this by using forward hook because we don't have to alter the original module structure at all during export.

Test Plan: CI

Differential Revision: D54102714

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120468
Approved by: https://github.com/ydwu4
2024-02-27 17:44:06 +00:00
ed3c256b61 Add lowering for adaptive_max_pool2d (#120254)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120254
Approved by: https://github.com/lezcano
2024-02-27 16:32:18 +00:00
27bb73fe46 [AOTI] Fix a strict-aliasing warning (#120628)
Summary: This gets rid of an annoying compile time warning, "dereferencing type-punned pointer will break strict-aliasing rules"

Differential Revision: [D54207229](https://our.internmc.facebook.com/intern/diff/D54207229)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120628
Approved by: https://github.com/Skylion007
2024-02-27 15:09:13 +00:00
c29ac05ac0 [inductor] correctly retrieve the "shared" attribute from a Triton binary (#120666)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120666
Approved by: https://github.com/jansel
2024-02-27 13:10:09 +00:00
435063aa89 Decomposition for upsample_linear{1d, 3d} (#114774)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114774
Approved by: https://github.com/lezcano, https://github.com/vfdev-5, https://github.com/peterbell10
2024-02-27 11:57:45 +00:00
2ad66e6bf0 Fix test failure: Add torch.cuda._get_device_properties to dynamo trace rules (#120620)
In this PR stack, there were unrelated test failures within test_trace_rules.py - It turned out that torch.cuda._get_device_properties should be registered in _dynamoc/trace_rules.py. A test failed because it was not.

This is a small fix which tries to get rid of the test failure by manually registering that function.

Note:
I am not sure whether this is the best way to fix this, as I am neither familiar with the trace rules nor with the introduction of torch.cuda._get_device_properties.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120620
Approved by: https://github.com/Skylion007
2024-02-27 10:46:01 +00:00
e3d64c4d5d [dynamo] Desugar accumulate_grad, fix .grad handling (#120590)
Fixes https://github.com/pytorch/pytorch/issues/118435
Fixes https://github.com/pytorch/pytorch/issues/119906

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120590
Approved by: https://github.com/ezyang, https://github.com/jansel
ghstack dependencies: #120520
2024-02-27 10:12:26 +00:00
9db6a849ed [FSDP] Clean missing and unexpected keys (#120600)
Currently, when loading w/strict=False or w/strict=True and looking at
error message, FQNs are garbled w/FSDP details such as "_fsdp_wrapped_module".
This makes it tricky for upstream applications to validate the expected set of
keys are missing / unexpected (for example with PEFT where state_dict is loaded
non-strict), and makes error message more complicated w/FSDP details.

This PR cleans those prefixes by using `clean_tensor_name` in FSDP's existing
post load_state_dict hooks. Currently, only full_state_dict impl is tested, can test the rest of the impls as follow up work.

Differential Revision: [D54182472](https://our.internmc.facebook.com/intern/diff/D54182472/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120600
Approved by: https://github.com/XilunWu, https://github.com/fegin
2024-02-27 07:43:45 +00:00
b2a318d856 [PyTorch][ExportedProgram] add 'is_lifted_tensor_constant' and 'get_lifted_tensor_constant' utils (#120546)
as title

Differential Revision: [D54149274](https://our.internmc.facebook.com/intern/diff/D54149274/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120546
Approved by: https://github.com/kirklandsign
2024-02-27 07:16:55 +00:00
7c556428c7 Increased compile time max GPUs to 512. Switched to int16_t DeviceIndex. (#119639)
Fixes #115331.

This PR increases the number of valid GPU devices to 512 (from 64) in order to future-proof PyTorch for providers that offer [single nodes with a large device count](https://www.tensorwave.com/). Until now, `DeviceIndex` was an `int8_t`, thus multiple changes were necessary:

- `DeviceIndex` changed to `int16_t`. Updated consumers that assume it to be an `int8_t`.
- Updated bounds checking for `torch.device()` in the Python frontend. Right now, we allow funny things like `torch.device('cpu', 200).index == -56`, which is undefined behavior. I inserted some checks to only allow values between 0 and `c10::Device::MAX_NUM_DEVICES - 1`.
- Updated the `ArgumentInfo` struct as it hardcodes the device index as 8 bit field [^1]. Might be a breaking change, not sure if users rely on this.
- Introduced `c10::Device::MAX_NUM_DEVICES` as a replacement for the old `C10_COMPILE_TIME_MAX_GPUS`

[^1]: This field was unsigned, so I guess this has also been undef behavior the whole time? Our default device index is -1, so this always wrapped around to 255 when written to the `ArgumentInfo` struct. When I switched the `DeviceIndex` to `int16_t`, it actually stayed 255 after unpacking from `ArgumentInfo` again, as the `DeviceIndex` was now wide enough that it didn't wrap back to -1.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119639
Approved by: https://github.com/cyyever, https://github.com/albanD, https://github.com/huydhn
2024-02-27 07:05:48 +00:00
cbbc309cae [pytree][reland] Require pytree serialized_type_name (#120636)
Relanding https://github.com/pytorch/pytorch/pull/119718 as the diff which prevents breakages of torchrec [D53857843](https://www.internalfb.com/diff/D53857843) has landed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120636
Approved by: https://github.com/avikchaudhuri
2024-02-27 06:53:33 +00:00
12f724c779 [export] preserve constant fqn (#120664)
Summary:
Previously we were renaming constants to `lifted_constant_tensor0` or equivalent. This PR changes things so that the constants retain the same FQN as in the original eager module.

Actually, `symbolic_trace` already is supposed to do this, but the code path is not triggered when used from `make_fx`, since we don't pass an actual `nn.Module` instance to `trace()`, but rather a multiply-wrapped-functionalized-lambda-thing.

So, I reproduced the essential logic outside of make_fx, at the export layer.

Test Plan: added a unit test

Differential Revision: D54221616

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120664
Approved by: https://github.com/SherlockNoMad
2024-02-27 06:35:51 +00:00
a358b23a6a Keep test order due to rename_privateuse1_backend is disposable (#120464)
With the change in https://github.com/pytorch/pytorch/pull/120399.
As rename_privateuse1_backend is disposable, run test_external_module_register with an renamed backend may cause problem. Try to change the testcase name and keep the right order (ASCII).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120464
Approved by: https://github.com/albanD
2024-02-27 05:38:43 +00:00
5a5b654481 [BE]: Enable ruff LOG checks (#120674)
Enable LOG error codes in ruff to find bad usages of the logger: https://docs.astral.sh/ruff/rules/#flake8-logging-log
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120674
Approved by: https://github.com/ezyang
2024-02-27 04:37:20 +00:00
b6139b1e57 [PyTorch][CUDA Caching Allocator] Export sync-stream-and-free-HBM counter in memory_stats for performance debugging (#120050)
Differential Revision: D53734057

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120050
Approved by: https://github.com/xw285cornell
2024-02-27 04:34:53 +00:00
a1c641f118 [executorch hash update] update the pinned executorch hash (#120675)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120675
Approved by: https://github.com/pytorchbot
2024-02-27 03:59:16 +00:00
237773132d Restore artifact name in log messages (#120671)
Yuzhen Huang was complaining to me that searching for `__recompile`
no longer works.  This is because the glog format is filename, not
logger name, so we lost the artifact name.  Add it back.

Looks like:

```
V0226 15:56:04.142000 139828992779264 torch/_dynamo/guards.py:1084] [0/2] __guards: ___check_type_id(L['inputs'], 7626144)
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120671
Approved by: https://github.com/Skylion007
2024-02-27 03:37:11 +00:00
ac28571742 [vision hash update] update the pinned vision hash (#119944)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119944
Approved by: https://github.com/pytorchbot
2024-02-27 03:25:51 +00:00
9d423f0e91 [audio hash update] update the pinned audio hash (#120135)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120135
Approved by: https://github.com/pytorchbot
2024-02-27 03:20:00 +00:00
63f874b476 [dynamo][guards-cpp-refactor] DictGetItemGuardAccessor for f_locals (#120593)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120593
Approved by: https://github.com/jansel
2024-02-27 03:13:55 +00:00
27990045ff docker: Only match tags that start with v* (#120670)
To avoid issues where version could be confused with a ciflow tag.

Example:

```
❯ git describe --tags --always
ciflow/periodic/c3496d50f0bb437c70f27085f71155209277bfd4-47-g4ca24959d1a
❯ git describe --tags --always --match "v[1-9]*.*"
v1.8.0-rc1-36500-g4ca24959d1a
```

Resolves https://github.com/pytorch/pytorch/issues/120392

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120670
Approved by: https://github.com/kit1980, https://github.com/atalman
2024-02-27 02:55:33 +00:00
cf6df886a0 Remove hard numpy dependency from experimental_ops.py (#119520)
Based on similar code in the codebase

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119520
Approved by: https://github.com/albanD
2024-02-27 02:46:13 +00:00
2de7468d2b Switch to native functional collective by default (#120370)
This enables native functional collectives by default. After this PR:
- The Python APIs remain backward compatible. Users will receive a deprecation warning if they use `(rank, tags)` as process group identifier.
- Collectives will be captured as `_c10d_functional` ops in post-grad fx graphs. The change will not affect end-users, but it will impact `torch-xla` which has implemented an all-reduce backend based on the existing `c10d_functional` IR. This excludes the migration for `torch-xla` use cases, which will be coordinated separately (see communications in #93173).
- Collectives will be lowered to and codegen'd by new Inductor collective IRs (`ir._CollectiveKernel` and `ir._WaitKernel`). This change will not affect end-users.

Testing performed:
- We have been running a set of representative unit tests with both the new native funcol and the old py funcol in CI. These test will continue to run with the old py funcol after this PR, so they are covered until they are removed.
- Manually verified with e2e llama model training with DTensor + functional collectives (https://github.com/fairinternal/xlformers/tree/pt2_llm/pt2d#create-your-local-development-env).

Fallback mechansim:
- Introduced a temporary environment variable `TORCH_DISABLE_NATIVE_FUNCOL` that allows users to fall back to the previous implementation. We don't expect the migration to break anything; the mechanism is a safety measure to reduce potential disruption in case the PR causes unforeseen breakages.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120370
Approved by: https://github.com/wconstab, https://github.com/yf225
2024-02-27 01:53:56 +00:00
8a59f49da2 [dynamo][compile-time] Collect guard debug stack info only with logs enabled (#120520)
Reduces backend=eager compile time from 33 to 19 seconds for `MobileBertForQuestionAnswering`. This also helps an internal model where guards.add function is taking 124 seconds.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120520
Approved by: https://github.com/mlazos
2024-02-27 01:51:16 +00:00
2e0e545759 [EZ][BE] Use nested namespace in functorch (#120663)
I should really enable this clang-tidy check rather than doing it by hand
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120663
Approved by: https://github.com/zou3519, https://github.com/Skylion007
2024-02-27 01:45:32 +00:00
b3fe53e1ad [1/2] Intel GPU Runtime Upstreaming for Generator (#118528)
# Motivation
As mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), the last runtime component we would like to upstream is `Generator` which is responsible for the pseudo-random number generation. To facilitate the code review, we split the code changes into 2 PRs. This is one of the 2 PRs and covers the changes under `aten`.

# Design
Following the previous design, `c10::GeneratorImpl` is the device-agnostic abstraction of a random number generator. So we will introduce an XPU generator `XPUGeneratorImpl`, inheriting from `c10::GeneratorImpl`, to manage random states on an Intel GPU device. Intel GPU runtime `Generator` adopts the same algorithm as CPU. The corresponding C++ file should be placed in aten/src/ATen/xpu/ folder and is built in `libtorch_xpu.so`.
This PR provide the list of APIs:
- `getDefaultXPUGenerator`
- `createXPUGenerator`

# Additional Context
The 2nd PR will cover `python frontend`.

The differences with CUDA:
The generator-related ATen CPP APIs are 1:1 mapping with CUDA.
The XPUGeneratorImpl's member functions have slight differences with CUDA.
lack of CUDA-related counterpart APIs listed below:
- capture_prologue
- capture_epilogue
- philox_cuda_state
- reset_rnn_state

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118528
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD
2024-02-27 01:39:40 +00:00
f064dec7e0 Add torch.ops.aten.print (#120295)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120295
Approved by: https://github.com/zou3519
2024-02-27 01:34:59 +00:00
ef9b6d6816 Replace individual detaches with overall torch.no_grad decorator (#120638)
Fixes https://github.com/pytorch/pytorch/issues/120611.

At first, I thought there were too many detaches, but @awgu and I made the conclusion that both `clip_grad_norm_` and `clip_grad_value_` should be run under torch.no_grad similar to optimizer step. One option is to continue calling `detach`, but doing that on many tensors is slower than setting the context to be no_grad (I think?) and Andrew had noticed: "the 1st round of detaches takes 10 ms for FSDP2, whereas existing FSDP's clip_grad_norm_ only takes 3 ms total" since there are more tensors in FSDP2.

This change also disables grad mode for the foreach path of `clip_grad_value_`, which the first attempt that didn't do this was an oversight. Not sure how to add a test case for this since grad mode will be turned back on after the call.

New profile is not much different from the one in the bottom of this stack, but the number of detaches is 0 :D:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (c71bcceb)]$ python playground2.py
STAGE:2024-02-26 13:07:15 211224:211224 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
STAGE:2024-02-26 13:07:16 211224:211224 ActivityProfilerController.cpp:320] Completed Stage: Collection
STAGE:2024-02-26 13:07:16 211224:211224 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                       cudaLaunchKernel        70.63%     110.415ms        70.63%     110.415ms       5.811ms       0.000us         0.00%       0.000us       0.000us            19
                               aten::linalg_vector_norm         0.18%     284.000us        26.00%      40.636ms      40.636ms       3.000us         0.99%       3.000us       3.000us             1
                                            aten::clamp         0.09%     148.000us        14.88%      23.261ms      23.261ms       1.000us         0.33%       1.000us       1.000us             1
                                               aten::to         0.75%       1.170ms        14.05%      21.970ms      84.826us       0.000us         0.00%     258.000us       0.996us           259
                                         aten::_to_copy         2.28%       3.562ms        13.31%      20.800ms     161.240us       0.000us         0.00%     258.000us       2.000us           129
                                    aten::_foreach_norm         4.44%       6.935ms        12.72%      19.878ms       9.939ms      19.000us         6.29%      21.000us      10.500us             2
                                              aten::add         0.11%     173.000us        10.97%      17.153ms      17.153ms       1.000us         0.33%       1.000us       1.000us             1
                                            aten::stack         2.99%       4.673ms         9.15%      14.300ms      14.300ms       0.000us         0.00%       6.000us       6.000us             1
                                            aten::copy_         5.49%       8.586ms         8.96%      14.001ms     108.535us     258.000us        85.43%     258.000us       2.000us           129
                                       aten::reciprocal         0.11%     179.000us         8.35%      13.051ms      13.051ms       1.000us         0.33%       1.000us       1.000us             1
                                              aten::cat         0.64%     993.000us         4.42%       6.902ms       6.902ms       6.000us         1.99%       6.000us       6.000us             1
                                            aten::zeros         0.04%      69.000us         4.28%       6.698ms       3.349ms       0.000us         0.00%       2.000us       1.000us             2
                                            aten::zero_         0.04%      66.000us         4.13%       6.462ms       3.231ms       0.000us         0.00%       2.000us       1.000us             2
                                            aten::fill_         0.06%      98.000us         4.09%       6.396ms       3.198ms       2.000us         0.66%       2.000us       1.000us             2
                                    aten::_foreach_mul_         1.50%       2.342ms         3.79%       5.924ms       2.962ms      10.000us         3.31%      10.000us       5.000us             2
                                            aten::empty         3.27%       5.115ms         3.27%       5.115ms      19.826us       0.000us         0.00%       0.000us       0.000us           258
                                    aten::empty_strided         2.07%       3.237ms         2.07%       3.237ms      25.093us       0.000us         0.00%       0.000us       0.000us           129
                             cudaDeviceEnablePeerAccess         1.93%       3.023ms         1.93%       3.023ms       1.512ms       0.000us         0.00%       0.000us       0.000us             2
                                        aten::unsqueeze         1.21%       1.896ms         1.74%       2.725ms      10.645us       0.000us         0.00%       0.000us       0.000us           256
                                        cudaMemcpyAsync         1.01%       1.572ms         1.01%       1.572ms      12.186us       0.000us         0.00%       0.000us       0.000us           129
                                       aten::as_strided         0.54%     839.000us         0.54%     839.000us       3.265us       0.000us         0.00%       0.000us       0.000us           257
                                    cudaStreamWaitEvent         0.34%     539.000us         0.34%     539.000us       2.089us       0.000us         0.00%       0.000us       0.000us           258
                                        cudaEventRecord         0.18%     274.000us         0.18%     274.000us       1.062us       0.000us         0.00%       0.000us       0.000us           258
                                              aten::mul         0.07%     107.000us         0.08%     132.000us     132.000us       1.000us         0.33%       1.000us       1.000us             1
                                  cudaDeviceSynchronize         0.01%      17.000us         0.01%      17.000us       8.500us       0.000us         0.00%       0.000us       0.000us             2
                                cudaDeviceCanAccessPeer         0.00%       7.000us         0.00%       7.000us       3.500us       0.000us         0.00%       0.000us       0.000us             2
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us         0.66%       2.000us       1.000us             2
void at::native::(anonymous namespace)::multi_tensor...         0.00%       0.000us         0.00%       0.000us       0.000us      13.000us         4.30%      13.000us       3.250us             4
void at::native::lpnorm_cleanup<float, (at::native::...         0.00%       0.000us         0.00%       0.000us       0.000us       6.000us         1.99%       6.000us       3.000us             2
                         Memcpy PtoP (Device -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us     258.000us        85.43%     258.000us       2.000us           129
void at::native::(anonymous namespace)::CatArrayBatc...         0.00%       0.000us         0.00%       0.000us       0.000us       6.000us         1.99%       6.000us       3.000us             2
void at::native::reduce_kernel<512, 1, at::native::R...         0.00%       0.000us         0.00%       0.000us       0.000us       3.000us         0.99%       3.000us       3.000us             1
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us         0.33%       1.000us       1.000us             1
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us         0.33%       1.000us       1.000us             1
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us         0.33%       1.000us       1.000us             1
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us         0.33%       1.000us       1.000us             1
void at::native::(anonymous namespace)::multi_tensor...         0.00%       0.000us         0.00%       0.000us       0.000us      10.000us         3.31%      10.000us       2.500us             4
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 156.319ms
Self CUDA time total: 302.000us
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120638
Approved by: https://github.com/Skylion007, https://github.com/albanD
ghstack dependencies: #120623
2024-02-27 01:27:05 +00:00
df72819f91 clip_grad_norm can use fast foreach path for inf norm (#120623)
Now that foreach_norm supports inf, we should not special case it.

For a mere 256 parameters, we get a win of 30ms in CPU time and ~800us -> 300us decrease in CUDA time. This win is only bigger for more parameters.

New profile:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (bf1c0490|REBASE-i|detached HEAD)]$ python playground2.py
STAGE:2024-02-26 13:14:10 395517:395517 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
STAGE:2024-02-26 13:14:11 395517:395517 ActivityProfilerController.cpp:320] Completed Stage: Collection
STAGE:2024-02-26 13:14:11 395517:395517 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                       cudaLaunchKernel        67.01%     102.262ms        67.01%     102.262ms       5.382ms       2.000us         0.66%       2.000us       0.105us            19
                               aten::linalg_vector_norm         0.20%     311.000us        23.44%      35.776ms      35.776ms       3.000us         0.99%       3.000us       3.000us             1
                                               aten::to         0.79%       1.208ms        14.62%      22.311ms      86.143us       0.000us         0.00%     263.000us       1.015us           259
                                            aten::clamp         0.12%     182.000us        13.96%      21.303ms      21.303ms       1.000us         0.33%       1.000us       1.000us             1
                                         aten::_to_copy         2.38%       3.628ms        13.83%      21.103ms     163.589us       0.000us         0.00%     263.000us       2.039us           129
                                    aten::_foreach_norm         4.71%       7.185ms        13.54%      20.659ms      10.329ms      19.000us         6.29%      23.000us      11.500us             2
                                              aten::add         0.14%     211.000us        10.86%      16.580ms      16.580ms       1.000us         0.33%       1.000us       1.000us             1
                                            aten::stack         3.11%       4.744ms         9.59%      14.642ms      14.642ms       0.000us         0.00%       6.000us       6.000us             1
                                            aten::copy_         5.71%       8.721ms         9.27%      14.152ms     109.705us     258.000us        85.43%     263.000us       2.039us           129
                                       aten::reciprocal         0.13%     193.000us         7.93%      12.100ms      12.100ms       1.000us         0.33%       1.000us       1.000us             1
                                              aten::cat         0.67%       1.017ms         4.67%       7.129ms       7.129ms       6.000us         1.99%       6.000us       6.000us             1
                                            aten::zeros         0.05%      79.000us         4.46%       6.800ms       3.400ms       0.000us         0.00%       2.000us       1.000us             2
                                            aten::zero_         0.05%      79.000us         4.28%       6.537ms       3.268ms       0.000us         0.00%       2.000us       1.000us             2
                                            aten::fill_         0.09%     131.000us         4.23%       6.458ms       3.229ms       2.000us         0.66%       2.000us       1.000us             2
                                    aten::_foreach_mul_         1.56%       2.377ms         3.86%       5.896ms       2.948ms      10.000us         3.31%      10.000us       5.000us             2
                                            aten::empty         3.55%       5.414ms         3.55%       5.414ms      20.984us       0.000us         0.00%       0.000us       0.000us           258
                                    aten::empty_strided         2.18%       3.323ms         2.18%       3.323ms      25.760us       0.000us         0.00%       0.000us       0.000us           129
                                           aten::detach         0.85%       1.302ms         2.10%       3.199ms      12.496us       0.000us         0.00%       0.000us       0.000us           256
                             cudaDeviceEnablePeerAccess         2.01%       3.069ms         2.01%       3.069ms       1.534ms       0.000us         0.00%       0.000us       0.000us             2
                                        aten::unsqueeze         1.24%       1.899ms         1.81%       2.769ms      10.816us       0.000us         0.00%       0.000us       0.000us           256
                                                 detach         1.24%       1.897ms         1.24%       1.897ms       7.410us       0.000us         0.00%       0.000us       0.000us           256
                                        cudaMemcpyAsync         1.01%       1.539ms         1.01%       1.539ms      11.930us       0.000us         0.00%       0.000us       0.000us           129
                                       aten::as_strided         0.58%     881.000us         0.58%     881.000us       3.428us       0.000us         0.00%       0.000us       0.000us           257
                                    cudaStreamWaitEvent         0.35%     540.000us         0.35%     540.000us       2.093us       0.000us         0.00%       0.000us       0.000us           258
                                        cudaEventRecord         0.18%     278.000us         0.18%     278.000us       1.078us       5.000us         1.66%       5.000us       0.019us           258
                                              aten::mul         0.08%     125.000us         0.09%     138.000us     138.000us       1.000us         0.33%       1.000us       1.000us             1
                                  cudaDeviceSynchronize         0.01%      13.000us         0.01%      13.000us       6.500us       0.000us         0.00%       0.000us       0.000us             2
                                cudaDeviceCanAccessPeer         0.00%       5.000us         0.00%       5.000us       2.500us       0.000us         0.00%       0.000us       0.000us             2
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us         0.66%       2.000us       1.000us             2
void at::native::(anonymous namespace)::multi_tensor...         0.00%       0.000us         0.00%       0.000us       0.000us      13.000us         4.30%      13.000us       3.250us             4
void at::native::lpnorm_cleanup<float, (at::native::...         0.00%       0.000us         0.00%       0.000us       0.000us       6.000us         1.99%       6.000us       3.000us             2
                         Memcpy PtoP (Device -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us     258.000us        85.43%     258.000us       2.000us           129
void at::native::(anonymous namespace)::CatArrayBatc...         0.00%       0.000us         0.00%       0.000us       0.000us       6.000us         1.99%       6.000us       3.000us             2
void at::native::reduce_kernel<512, 1, at::native::R...         0.00%       0.000us         0.00%       0.000us       0.000us       3.000us         0.99%       3.000us       3.000us             1
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us         0.33%       1.000us       1.000us             1
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us         0.33%       1.000us       1.000us             1
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us         0.33%       1.000us       1.000us             1
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us         0.33%       1.000us       1.000us             1
void at::native::(anonymous namespace)::multi_tensor...         0.00%       0.000us         0.00%       0.000us       0.000us      10.000us         3.31%      10.000us       2.500us             4
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 152.613ms
Self CUDA time total: 302.000us
```

Compared to on main:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (5a0a9644)]$ python playground2.py
STAGE:2024-02-26 13:09:56 285045:285045 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
STAGE:2024-02-26 13:09:57 285045:285045 ActivityProfilerController.cpp:320] Completed Stage: Collection
STAGE:2024-02-26 13:09:57 285045:285045 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                       cudaLaunchKernel        61.42%     113.375ms        61.42%     113.375ms     424.625us      45.000us         5.66%      45.000us       0.169us           267
                               aten::linalg_vector_norm        14.04%      25.909ms        37.67%      69.534ms     271.617us     514.000us        64.65%     559.000us       2.184us           256
                                               aten::to         0.78%       1.433ms        12.87%      23.751ms      91.703us       0.000us         0.00%     278.000us       1.073us           259
                                         aten::_to_copy         2.02%       3.730ms        12.09%      22.318ms     173.008us       0.000us         0.00%     278.000us       2.155us           129
                                            aten::clamp         0.09%     174.000us        11.43%      21.103ms      21.103ms       1.000us         0.13%       1.000us       1.000us             1
                                              aten::add         0.11%     205.000us         9.08%      16.768ms      16.768ms       1.000us         0.13%       1.000us       1.000us             1
                                            aten::copy_         4.94%       9.112ms         8.15%      15.043ms     116.612us     258.000us        32.45%     278.000us       2.155us           129
                                            aten::stack         2.76%       5.091ms         7.97%      14.719ms      14.719ms       0.000us         0.00%       6.000us       6.000us             1
                                       aten::reciprocal         0.11%     194.000us         7.01%      12.933ms      12.933ms       1.000us         0.13%       1.000us       1.000us             1
                                              aten::max         0.09%     165.000us         6.43%      11.868ms      11.868ms       3.000us         0.38%       3.000us       3.000us             1
                                           aten::detach         1.58%       2.911ms         4.12%       7.596ms      14.836us       0.000us         0.00%       0.000us       0.000us           512
                                              aten::cat         0.56%       1.042ms         3.73%       6.882ms       6.882ms       6.000us         0.75%       6.000us       6.000us             1
                                    aten::_foreach_mul_         1.36%       2.503ms         3.33%       6.145ms       3.072ms      10.000us         1.26%      10.000us       5.000us             2
                                                 detach         2.54%       4.685ms         2.54%       4.685ms       9.150us       0.000us         0.00%       0.000us       0.000us           512
                                    aten::empty_strided         1.92%       3.545ms         1.92%       3.545ms      27.481us       0.000us         0.00%       0.000us       0.000us           129
                             cudaDeviceEnablePeerAccess         1.64%       3.022ms         1.64%       3.022ms       1.511ms       0.000us         0.00%       0.000us       0.000us             2
                                        aten::unsqueeze         1.03%       1.892ms         1.49%       2.746ms      10.727us       0.000us         0.00%       0.000us       0.000us           256
                                       aten::as_strided         1.35%       2.494ms         1.35%       2.494ms       4.862us       0.000us         0.00%       0.000us       0.000us           513
                                        cudaMemcpyAsync         1.01%       1.868ms         1.01%       1.868ms      14.481us       4.000us         0.50%       4.000us       0.031us           129
                                    cudaStreamWaitEvent         0.41%     760.000us         0.41%     760.000us       2.946us       8.000us         1.01%       8.000us       0.031us           258
                                        cudaEventRecord         0.15%     276.000us         0.15%     276.000us       1.070us       8.000us         1.01%       8.000us       0.031us           258
                                              aten::mul         0.08%     139.000us         0.08%     153.000us     153.000us       1.000us         0.13%       1.000us       1.000us             1
                                            aten::empty         0.02%      35.000us         0.02%      35.000us      35.000us       0.000us         0.00%       0.000us       0.000us             1
                                  cudaDeviceSynchronize         0.01%      14.000us         0.01%      14.000us       7.000us       0.000us         0.00%       0.000us       0.000us             2
                                cudaDeviceCanAccessPeer         0.00%       5.000us         0.00%       5.000us       2.500us       0.000us         0.00%       0.000us       0.000us             2
void at::native::reduce_kernel<512, 1, at::native::R...         0.00%       0.000us         0.00%       0.000us       0.000us     514.000us        64.65%     514.000us       2.008us           256
                         Memcpy PtoP (Device -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us     258.000us        32.45%     258.000us       2.000us           129
void at::native::(anonymous namespace)::CatArrayBatc...         0.00%       0.000us         0.00%       0.000us       0.000us       6.000us         0.75%       6.000us       3.000us             2
void at::native::reduce_kernel<512, 1, at::native::R...         0.00%       0.000us         0.00%       0.000us       0.000us       3.000us         0.38%       3.000us       3.000us             1
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us         0.13%       1.000us       1.000us             1
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us         0.13%       1.000us       1.000us             1
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us         0.13%       1.000us       1.000us             1
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us         0.13%       1.000us       1.000us             1
void at::native::(anonymous namespace)::multi_tensor...         0.00%       0.000us         0.00%       0.000us       0.000us      10.000us         1.26%      10.000us       2.500us             4
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 184.579ms
Self CUDA time total: 795.000us
```

For script:
```
import torch
from math import inf
from torch.nn.utils import clip_grad_norm_

params = [torch.rand(32, 16, device="cuda:3")*5 for _ in range(128)] + [torch.rand(32, 16, device="cuda:4")*-7 for _ in range(128)]
for p in params:
    p.grad = torch.rand_like(p)

with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA,
    ]
) as p:
    total_norm = clip_grad_norm_(params, 10.0, norm_type=inf)
    torch.cuda.synchronize()

print(p.key_averages().table(sort_by="cpu_time_total"))
print(total_norm)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120623
Approved by: https://github.com/Skylion007, https://github.com/mikaylagawarecki
2024-02-27 01:27:05 +00:00
b01bd1f7a1 Revert "Add torch.ops.aten.print (#120295)"
This reverts commit 3b944113c837e1111510487f4525aa07039462fe.

Reverted https://github.com/pytorch/pytorch/pull/120295 on behalf of https://github.com/kit1980 due to breaking internal builds, see D54123688 ([comment](https://github.com/pytorch/pytorch/pull/120295#issuecomment-1965618191))
2024-02-27 01:18:48 +00:00
17560eb472 Revert "[Dynamo] Remove deadcode: unwrapping script_if_tracing (#120444)"
This reverts commit 4d2073bc3faa7f2014c4fb2f568e68fe195b6f99.

Reverted https://github.com/pytorch/pytorch/pull/120444 on behalf of https://github.com/kit1980 due to breaking internal builds, see D54192376 ([comment](https://github.com/pytorch/pytorch/pull/120444#issuecomment-1965600268))
2024-02-27 00:58:00 +00:00
e874376f6a Mark test_reference_numerics_extremal__refs_frexp_cuda as xfail on windows (#120640)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120640
Approved by: https://github.com/clee2000
2024-02-27 00:35:55 +00:00
d341b66e96 Revert [dynamo] support group=None when rewriting collectives (#12018) (#120677)
This reverts commit 298c686d3f7bc26399481b8830e71c4f02ce629c.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120677
Approved by: https://github.com/yifuwang, https://github.com/huydhn
2024-02-27 00:33:35 +00:00
fdae9363b3 [meta registration] efficient_attention_forward fix for NT inputs (#120594)
When cu_seqlens_q is provided, we should use the user-specified max_seqlen_q instead of inferring it as query.size(1):

1c7b0e7cd1/aten/src/ATen/native/transformers/cuda/attention.cu (L989)

This wasn't caught because the value is taken as ceil(max_seqlen / 32) * 32; in the opinfos, and the opinfo inputs were small enough that this value was 32 in either case.

Differential Revision: [D54179733](https://our.internmc.facebook.com/intern/diff/D54179733)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120594
Approved by: https://github.com/drisspg
2024-02-27 00:10:37 +00:00
9dfaef962c Add structured trace logs (#120289)
Overall design: https://docs.google.com/document/d/1CX_hJ0PNy9f3R1y8TJrfkSeLkvGjjjLU84BSXgS2AZ8/edit

How to read the diff:
* Most files are me augmenting pre-existing logging with structured variants. For the most part it's simple (esp FX graphs, which have a canonical string representation); it gets more complicated when I decided to JSON-ify some data structure instead of keeping the ad hoc printing (notably, guards and dynamo output graph sizes)
* torch/_functorch/_aot_autograd/collect_metadata_analysis.py is some unrelated fixes I noticed while auditing artifact logs
* torch/_logging/_internal.py has the actual trace log implementation. The trace logger is implement as a logger named torch.__trace which is disconnected from the logging hierarchy. It gets its own handler and formatter (TorchLogsFormatter with _is_trace True). There's a teensy bit of FB specific code to automatically enable trace logging if a /logs directory exists. `trace_structured` is the main way to emit a trace log. Unusually, there's a separate "metadata" and "payload" field. The metadata field should not be too long (as it is serialized as a single line) and is always JSON (we put contextual things like compile id in it); the payload field can be long and is emitted after the metadata log line and can span multiple lines.
* torch/_logging/structured.py contains some helpers for converting Python data structures into JSON form. Notably, we have a string interning implementation here, which helps reduce the cost of serializing filenames into the log.
* test/dynamo/test_structured_trace.py the tests are cribbed from test_logging.py, but all rewritten to use expect tests on munged versions of what we'd actually output. Payloads are never tested, since they tend not be very stable.

https://github.com/ezyang/tlparse is a POC Rust program that can interpret these logs.

Testing that the fbcode detection works at https://www.internalfb.com/mlhub/pipelines/runs/fblearner/534553450 (Meta-only)

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120289
Approved by: https://github.com/Skylion007
2024-02-27 00:04:23 +00:00
ecb3f33a1a [dynamo] fix segfault in _debug_get_cache_entry_list (#120635)
Fix https://github.com/pytorch/pytorch/issues/120607.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120635
Approved by: https://github.com/jansel
2024-02-26 23:31:09 +00:00
64660b51f6 Add the hyperlink of the transfomer doc (#120565)
Fixes #120488

- The shape for forward pass is clearly stated in the main [transformer class](https://pytorch.org/docs/stable/generated/torch.nn.Transformer.html)

- Boolean mask for _key_padding_mask is also explained in the main transformer class.

Therefore, add the hyperlink to the transformer class explicitly so the user can refer back to the main class. Also, correct several symbols in the transform doc from normal text style to math style.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120565
Approved by: https://github.com/mikaylagawarecki
2024-02-26 23:11:58 +00:00
Kai
c59b14163b Implement aten::upsample_linear1d on mps (#115031)
Related to #77764

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115031
Approved by: https://github.com/malfet
2024-02-26 23:04:52 +00:00
30625ae582 Add cpp stack traces to our own reruns (#119408)
Note that I'm not sure why we both have pytest rerun the failing test twice via 81abc2b249/test/run_test.py (L966) before our own logic retries it as well?

The failing test is only here to make sure it works as expected in the CI env. Will remove before landing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119408
Approved by: https://github.com/huydhn
2024-02-26 22:21:14 +00:00
41adec3c59 Revert "Switch to native functional collective by default (#120370)"
This reverts commit 1f1bc0e6acc3613339b1001a7c9fcd1dfe7b6580.

Reverted https://github.com/pytorch/pytorch/pull/120370 on behalf of https://github.com/yifuwang due to broke CI ([comment](https://github.com/pytorch/pytorch/pull/120370#issuecomment-1965362938))
2024-02-26 21:55:13 +00:00
7b1cc140aa Use lxml in scripts/compile_tests when it is available (#120633)
It's around 30x (300s -> 10s) faster.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120633
Approved by: https://github.com/oulgen
2024-02-26 21:35:22 +00:00
5a0a964444 [Dynamo] Fix guards for script_if_tracing or lru_cache fn with default args (#120390)
Fixes #120387

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120390
Approved by: https://github.com/anijain2305
2024-02-26 19:40:14 +00:00
55b5908427 [PT2][Inductor]Add unbind node normalization (#120253)
Summary: Normalize unbind nodes for the followup split_cat pattern detection and node removals

Test Plan:
```
buck2 test //caffe2/test/inductor:split_cat_fx_passes
```
Buck UI: https://www.internalfb.com/buck2/f42297c2-2595-40a2-b270-5cec026f2fe4
Test UI: https://www.internalfb.com/intern/testinfra/testrun/17451448578242323
Network: Up: 132KiB  Down: 88KiB  (reSessionID-fc725143-317a-42a9-bc7e-0bbab6ef9e5c)
Jobs completed: 27. Time elapsed: 3:09.2s.
Cache hits: 0%. Commands: 1 (cached: 0, remote: 0, local: 1)
Tests finished: Pass 11. Fail 0. Fatal 0. Skip 0. Build failure 0

```
buck2 test mode/opt mode/inplace caffe2/test/inductor/fb:test_split_cat_fx_passes_aten_fb
```

Differential Revision: D53964593

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120253
Approved by: https://github.com/jackiexu1992
2024-02-26 19:13:26 +00:00
274b362442 [FSDP] Removed .detach in clip_grad_norm_ (#120612)
This seems unnecessary under `no_grad()` context. The unit tests still pass.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120612
Approved by: https://github.com/Skylion007
ghstack dependencies: #120231
2024-02-26 19:03:00 +00:00
fd3cf88f27 Rewrite docs about why we guard on dynamic dims (#120566)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120566
Approved by: https://github.com/desertfire
2024-02-26 18:58:30 +00:00
759204253f [export] Change runtime asserts to using assert_scalar (#119608)
By changing runtime symbolic asserts to using assert_scalar, the asserts can call into `expect_true` and modify the shape env so that we can run through the traced graph module with fake tensors. With assert_async, the asserts only get hit during runtime, but that means if we run the graph module with fake tensors, the asserts will not affect the shape env, so later data dependent calls to the fake tensors may result in GuardOnDataDependentSymNode errors.

https://github.com/pytorch/pytorch/issues/119587

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119608
Approved by: https://github.com/ezyang
2024-02-26 17:56:12 +00:00
2fb32a5f07 Enable fake tensor caching in fbcode by default (#118555)
Summary: Enabled by default in OSS; this switches the default to "on" in fbcode too.

Test Plan: Ran torchbench benchmarks in fbcode

Differential Revision: [D53771626](https://our.internmc.facebook.com/intern/diff/D53771626)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118555
Approved by: https://github.com/eellison
2024-02-26 17:35:23 +00:00
ee01d0807b [dynamo] Function => FunctionCtx for placeholder obj (#120577)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120577
Approved by: https://github.com/yanboliang
2024-02-26 17:16:31 +00:00
7eb7ac815f [inductor] Optimize welford reduction (#120330)
This does two things,
1) Short circuit `welford_reduce` on the first iteration to ignore the accumulator (big win for small `rnumel`)
2) Replace division with multiplication by reciprocal

Currently this is not enough to match two pass reduction with bfloat16 but it is still a significant improvement.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120330
Approved by: https://github.com/lezcano
2024-02-26 17:01:47 +00:00
c39bbd6def Numbers based TD (#119901)
Convert from a list/bucket based TD system to just a numbers based TD system.  Looks like a massive change but a decent amount of it is tests and removing code.

Main file of interest is interface.py, which Github is collapsing by default due to size

The test files pretty much got rewritten entirely since a lot of the old tests are no longer relevant.

Other notable changes:
* Use Frozenset to make TestRun hashable
* Adds tools/test/heuristics/__init__.py to ensure that unittest can discover the tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119901
Approved by: https://github.com/osalpekar, https://github.com/huydhn
2024-02-26 17:01:19 +00:00
86063b4d03 Add torch._print to dynamo trace_rules (#120533)
Fixes #114831

Before:
```
(pytorch10) angelayi@devgpu022 ~/local/pytorch [main] $  python test/dynamo/test_trace_rules.py -k test_torch_name_rule_map_updated
F
======================================================================
FAIL: test_torch_name_rule_map_updated (__main__.TraceRuleTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/data/users/angelayi/pytorch/torch/testing/_internal/common_utils.py", line 2739, in wrapper
    method(*args, **kwargs)
  File "/data/users/angelayi/pytorch/test/dynamo/test_trace_rules.py", line 328, in test_torch_name_rule_map_updated
    self._check_set_equality(
  File "/data/users/angelayi/pytorch/test/dynamo/test_trace_rules.py", line 302, in _check_set_equality
    self.assertTrue(len(x) == 0, msg1)
AssertionError: False is not true : New torch objects: {<built-in method _print of type object at 0x7ff477e40ee0>} were not added to `trace_rules.torch_c_binding_in_graph_functions` or `test_trace_rules.ignored_c_binding_in_graph_function_names`. Refer the instruction in `torch/_dynamo/trace_rules.py` for more details.

To execute this test, run the following from the base repo dir:
     python test/dynamo/test_trace_rules.py -k test_torch_name_rule_map_updated

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

----------------------------------------------------------------------
Ran 1 test in 0.184s

FAILED (failures=1)
```
After change:
```
(pytorch10) angelayi@devgpu022 ~/local/pytorch [main] $  python test/dynamo/test_trace_rules.py -k test_torch_name_rule_map_updated
.
----------------------------------------------------------------------
Ran 1 test in 0.209s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120533
Approved by: https://github.com/clee2000, https://github.com/yanboliang, https://github.com/huydhn, https://github.com/Skylion007
2024-02-26 16:52:59 +00:00
8a32a07856 Revert "Add meta device support to sparse compressed tensors (#120498)"
This reverts commit 5d71ba688563ef491bb28d47c493ec6fc7791da2.

Reverted https://github.com/pytorch/pytorch/pull/120498 on behalf of https://github.com/zou3519 due to broke CI ([comment](https://github.com/pytorch/pytorch/pull/120498#issuecomment-1964491999))
2024-02-26 15:59:36 +00:00
b381a4372b make GPT2ForSequenceClassification pass inference accuracy check (#120537)
We need a higher tolerance for GPT2ForSequenceClassification since if I change --bfloat16 in
```
time python benchmarks/dynamo/huggingface.py --accuracy --inference --bfloat16 --backend inductor --disable-cudagraphs --only GPT2ForSequenceClassification
```
to --float16 or --float32 it will pass the accuracy check.

Adding --freezing can also make the test pass for this model. I think that's may be due to different fusion output being generated (depending on if constant propagation is happening controlled by freezing) and cause some small numerical difference.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120537
Approved by: https://github.com/jansel
2024-02-26 11:02:57 +00:00
f4cf25bb24 Fix a bug where nn.functional._AllGather.backward produces wrong gradients (#120582)
Summary:
Fixes #120386

`_AllGather.backward` assumes that `_ReduceScatter` would always in-place update the output buffer. However, when the output buffer is non-contiguous, `_ReduceScatter` would allocate and return a different buffer, causing the gradient to be thrown away.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120582
Approved by: https://github.com/XilunWu
2024-02-26 09:58:27 +00:00
c617e7b407 Add resnet50/mobilenet_v2_quantized_qat in into deterministic_algorithms exclusive list (#120384)
After PR: https://github.com/pytorch/pytorch/pull/120026, 2 `Torchbench` testcases: `resnet50_quantized_qat` and `mobilenet_v2_quantized_qat` can pass the performance testing but failed with accuracy test. The failure msg is:  `mobilenet_v2_quantized_qat, RuntimeError: quantized_resize_cpu_ does not have a deterministic implementation but you set 'torch.use_deterministic_algorithms(True)'. `

- `torch.use_deterministic_algorithms(True)` only setting for accuracy test. fff9d98e58/benchmarks/dynamo/common.py (L3480)
- However, `quantized_resize_cpu_` only support `nondeterministic_algorithms` because the resized output memory may be uninitialized. fff9d98e58/aten/src/ATen/native/quantized/cpu/TensorOperators.cpp (L85-L87)

Add these 2 models into the deterministic_algorithms exclusive model list in this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120384
Approved by: https://github.com/desertfire, https://github.com/jgong5
2024-02-26 05:05:43 +00:00
a299db2983 [dynamo][guards-cpp-refactor] NO_HASATTR guard (#120469)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120469
Approved by: https://github.com/jansel
2024-02-26 04:37:40 +00:00
1c7b0e7cd1 [inductor][cpp] disable masked load for non-fp data types (#120558)
Fix https://github.com/pytorch/pytorch/issues/120377. We disable the masked load for non-fp data types for now. The complete support of masks will be added in https://github.com/pytorch/pytorch/pull/119654.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120558
Approved by: https://github.com/lezcano, https://github.com/jansel
2024-02-26 04:12:22 +00:00
ea20885d95 [executorch hash update] update the pinned executorch hash (#120264)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120264
Approved by: https://github.com/pytorchbot
2024-02-26 03:55:32 +00:00
c18623b7ed [dynamo] Reland 120147 - - Use EQUALS_MATCH guard for mod.training (#120578)
To fix Memory leak discovered in https://github.com/pytorch/pytorch/issues/112090

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120578
Approved by: https://github.com/jansel
2024-02-26 03:49:47 +00:00
685d862c45 Add SparsePrivateUse1 in backend_to_string, layout_from_backend and check_base_legacy_new. (#119263)
1) Using items stored in torch._tensor_classes to check item passed from python side;
2) Add SparsePrivateUse1 in backend_to_string, layout_from_backend and check_base_legacy_new;
3) Using more general API to get python module name in get_storage_obj and get_name functions.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119263
Approved by: https://github.com/ezyang
2024-02-26 01:54:30 +00:00
4328e772bf [dynamo][guards-cpp-refactor] DICT_VERSION guard (#120416)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120416
Approved by: https://github.com/jansel
ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065, #120067, #120068, #120089, #120091, #120119, #120123, #120093, #120096, #120342, #120344, #120359
2024-02-25 23:24:24 +00:00
c269e48af0 [dynamo][guards-cpp-refactor] DictGuardManager (#120359)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120359
Approved by: https://github.com/jansel
ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065, #120067, #120068, #120089, #120091, #120119, #120123, #120093, #120096, #120342, #120344
2024-02-25 23:24:24 +00:00
775a4388d9 [dynamo][guards-cpp-refactor] WEAKREF_ALIVE guard (#120344)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120344
Approved by: https://github.com/jansel
ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065, #120067, #120068, #120089, #120091, #120119, #120123, #120093, #120096, #120342
2024-02-25 23:24:04 +00:00
5d71ba6885 Add meta device support to sparse compressed tensors (#120498)
As in the title.

Unblocks https://github.com/pytorch/pytorch/pull/117907#discussion_r1499251745

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120498
Approved by: https://github.com/ezyang
2024-02-25 16:50:17 +00:00
834c7a1d3e [dynamo][refactor] Move some helper functions to global scope (#120426)
This is to prepare for guard C++ refactor work.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120426
Approved by: https://github.com/ezyang
2024-02-25 04:38:20 +00:00
5c7b761f8e Fix default world_size when running on 1 or 0 GPU (#119372)
the mentioned distributed tests would fail if the number of GPUs available isn't sufficient. need to correct the default world size.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119372
Approved by: https://github.com/eqy, https://github.com/fegin
2024-02-25 04:14:34 +00:00
cyy
81f0b2c14e [Clang-tidy header][19/N] Enable clang-tidy on torch/csrc/autograd/profiler_legacy.* (#120552)
This PR enables clang-tidy on torch/csrc/autograd/profiler_legacy.* and cleans some path rules of clang-tidy.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120552
Approved by: https://github.com/Skylion007
2024-02-25 03:29:40 +00:00
298c686d3f [dynamo] support group=None when rewriting collectives (#120118)
Resolves case 2 in #120082.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120118
Approved by: https://github.com/wconstab
ghstack dependencies: #120370
2024-02-25 03:12:10 +00:00
3e382456c1 Fix compiler check (#120492)
Fixes #119304

1. Add try catch to handle the compiler version check.
2. Retry to query compiler version info.
3. Return False if can't get compiler info twice.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120492
Approved by: https://github.com/ezyang
2024-02-25 02:41:20 +00:00
0f20cc1e0e Don't use size on TensorVariable when doing out resize test (#120567)
Fixes https://github.com/pytorch/pytorch/issues/120482
Fixes https://github.com/pytorch/pytorch/issues/120511

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120567
Approved by: https://github.com/Skylion007
2024-02-25 02:21:34 +00:00
54c1cf8d8a add distributed checkpoint support for custom device (#120201)
Fixes #120200

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120201
Approved by: https://github.com/fegin, https://github.com/wz337
2024-02-24 19:14:29 +00:00
56203fc407 Add profiling for backward (#120540)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120540
Approved by: https://github.com/anijain2305
2024-02-24 16:53:28 +00:00
a17979faa6 [dynamo] add stronger test for dynamo memory leaks (#120459)
This issue was raised by a regression of https://github.com/pytorch/pytorch/issues/112090 caused by https://github.com/pytorch/pytorch/pull/120147.

Make the memory leak test stronger by using weakref to check for model deletion instead of measuring CUDA memory allocation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120459
Approved by: https://github.com/jansel
2024-02-24 16:30:20 +00:00
a62d9184d5 [ET-VK] Move graph runtime from PT directory to ET directory (#120528)
Summary:
## Context

Move Vulkan graph runtime from PyTorch directory to ExecuTorch directory to improve development logistics:

* ExecuTorch delegate changes will no longer require export to PyTorch directory
* Makes it much easier to enable OSS build for Vulkan delegate

Test Plan:
```
LD_LIBRARY_PATH=/home/ssjia/fbsource/third-party/swiftshader/lib/linux-x64/ buck run fbcode/mode/dev-nosan //xplat/executorch/backends/vulkan/test:vulkan_compute_api_test_bin

buck2 run fbcode//executorch/backends/vulkan/test:test_vulkan_delegate
```

Differential Revision: D54133350

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120528
Approved by: https://github.com/manuelcandales
2024-02-24 15:00:21 +00:00
1f1bc0e6ac Switch to native functional collective by default (#120370)
This enables native functional collectives by default. After this PR:
- The Python APIs remain backward compatible. Users will receive a deprecation warning if they use `(rank, tags)` as process group identifier.
- Collectives will be captured as `_c10d_functional` ops in post-grad fx graphs. The change will not affect end-users, but it will impact `torch-xla` which has implemented an all-reduce backend based on the existing `c10d_functional` IR. This excludes the migration for `torch-xla` use cases, which will be coordinated separately (see communications in #93173).
- Collectives will be lowered to and codegen'd by new Inductor collective IRs (`ir._CollectiveKernel` and `ir._WaitKernel`). This change will not affect end-users.

Testing performed:
- We have been running a set of representative unit tests with both the new native funcol and the old py funcol in CI. These test will continue to run with the old py funcol after this PR, so they are covered until they are removed.
- Manually verified with e2e llama model training with DTensor + functional collectives (https://github.com/fairinternal/xlformers/tree/pt2_llm/pt2d#create-your-local-development-env).

Fallback mechansim:
- Introduced a temporary environment variable `TORCH_DISABLE_NATIVE_FUNCOL` that allows users to fall back to the previous implementation. We don't expect the migration to break anything; the mechanism is a safety measure to reduce potential disruption in case the PR causes unforeseen breakages.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120370
Approved by: https://github.com/wconstab, https://github.com/yf225
2024-02-24 09:38:26 +00:00
33938cfddd [BE][Ez] Update ruff to 0.2.2 (#120517)
Updates ruff to 0.2.2. This updates the config and handles some of the new rules that have come out of preview.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120517
Approved by: https://github.com/albanD
2024-02-24 07:13:53 +00:00
79f059987e Update find_test_dir() to check for skip files relative to the local path first. (#120521)
The search code to find the dynamo skip files wasn't working properly when used with pytest and multiple files:
```
pytest a.py b.py
```
because pytest would point `__main__` at itself instead of the individual file. (This worked fine when only running a single file test)

Change the scanning code to look for the skip directory relative to its own file first.

While in there add/update some comments and log a warning when the directory wasn't found (instead of a hard crash).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120521
Approved by: https://github.com/oulgen
2024-02-24 03:29:25 +00:00
15add24bf2 fix: set codegen in _SplitterBase partitioner (#120361)
For graphs with single output, the expectation of torch.export / torch.compile graph_module output type is a single torch.tensor instead of a tuple.
However,  after using `_SplitterBase` partitioner on these graph_module (obtained from torch.export/torch.compile), the resultant graph module will return a tuple of tensors, in this case `(output,)`.

This PR adds codegen to the graphs produced by `_SplitterBase` partitioner. Setting this will ensure pytree unflatten nodes will be added automatically to handle unflattening of the output to return single outputs directly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120361
Approved by: https://github.com/angelayi
2024-02-24 02:27:20 +00:00
3eefe96297 Update scripts/compile_tests/update_failures.py (#120529)
In order to unbreak this script, I have only tested with
```
./scripts/compile_tests/update_failures.py 97918e8c37e649dc8782bb1822ae954bca904d0f
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120529
Approved by: https://github.com/zou3519
2024-02-23 22:15:44 +00:00
b7df3bba62 add decomposition for frexp (#119217)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119217
Approved by: https://github.com/peterbell10
ghstack dependencies: #119284, #120027
2024-02-23 21:52:42 +00:00
f7e79299c7 register torch.return_types in torch.fx._pytree (#120027)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120027
Approved by: https://github.com/lezcano, https://github.com/zou3519, https://github.com/XuehaiPan
ghstack dependencies: #119284
2024-02-23 21:52:42 +00:00
c3496d50f0 Fix torch.return_types init signature (#119284)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119284
Approved by: https://github.com/peterbell10, https://github.com/XuehaiPan
2024-02-23 21:52:34 +00:00
623632a401 More informative stacklevel for autograd function warning (#120512)
Internal xref:
https://fb.workplace.com/groups/1405155842844877/posts/8064897663537295

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120512
Approved by: https://github.com/albanD
2024-02-23 21:48:55 +00:00
4d2073bc3f [Dynamo] Remove deadcode: unwrapping script_if_tracing (#120444)
After the consolidated ```trace_rules.lookup```, we already unwrap at
2240018c03/torch/_dynamo/variables/builder.py (L712)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120444
Approved by: https://github.com/anijain2305
2024-02-23 21:22:09 +00:00
8e20385447 [c10d] fix the macro definition of NCCL_COMM_DUMP (#120502)
Summary:
Only if both macros are defined, should we dump the comm dump,
otherwise, use the original definition.

The previous implementation missed the function definition when IS_NCCL_EXP is defined but NCCL_COMM_DUMP is not defined

Test Plan:
Build and unit test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120502
Approved by: https://github.com/dsjohns2, https://github.com/Skylion007
2024-02-23 20:59:39 +00:00
7cd623aa89 Remove monkey-patch for torch.utils._rebuild_tensor (#120446)
Not needed after #108186

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120446
Approved by: https://github.com/titaiwangms, https://github.com/BowenBao
2024-02-23 20:42:50 +00:00
ed0ea2f30b add export to torch.jit.__all__ (#120432)
I use pyright in the vscode. When I use `@torch.jit.export`, I always see an annoying error saying `export` is not exported.

![image](https://github.com/pytorch/pytorch/assets/9496702/f7b0e17f-6497-4f9a-87dd-55dc627156c3)

Adding it to `__all__` should fix it.

I have seen #92240 and #101678, and I am not sure why `export` is not there. cc @ringohoffman
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120432
Approved by: https://github.com/eellison
2024-02-23 20:37:09 +00:00
e29eb39e04 [EZ] Fix typo in gcc version detection (#120489)
It should be `FATAL_ERROR` rather than `FATAL`

I wish cmakelint would have detected it

Also, downgrade this check to 9.3, as all our binary builds are using 9.3 at the moment (will update in a followup PR)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120489
Approved by: https://github.com/DanilBaibak, https://github.com/Skylion007
2024-02-23 20:31:21 +00:00
007606e520 [dynamo][guards-cpp-refactor] TENSOR_MATCH guard (#120342)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120342
Approved by: https://github.com/jansel
ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065, #120067, #120068, #120089, #120091, #120119, #120123, #120093, #120096
2024-02-23 20:10:09 +00:00
4b65d192f0 [dynamo][guards-cpp-refactor] DYNAMIC_INDICES guard (#120096)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120096
Approved by: https://github.com/jansel
ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065, #120067, #120068, #120089, #120091, #120119, #120123, #120093
2024-02-23 20:10:09 +00:00
a92ce46dc3 [dynamo][guards-cpp-refactor] GlobalWeakRefGuardAccessor (#120093)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120093
Approved by: https://github.com/jansel
ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065, #120067, #120068, #120089, #120091, #120119, #120123
2024-02-23 20:10:01 +00:00
bb331b1eb5 [dynamo][guards-cpp-refactor] LENGTH_CHECK guard (#120123)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120123
Approved by: https://github.com/jansel
ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065, #120067, #120068, #120089, #120091, #120119
2024-02-23 20:09:52 +00:00
2eac593ffd [dynamo][guards-cpp-refactor] TUPLE_ITERATOR_LEN guard (#120119)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120119
Approved by: https://github.com/jansel
ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065, #120067, #120068, #120089, #120091
2024-02-23 20:09:43 +00:00
da95421f05 [dynamo][guards-cpp-refactor] TupleIteratorGetItemAccessor (#120091)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120091
Approved by: https://github.com/jansel
ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065, #120067, #120068, #120089
2024-02-23 20:09:34 +00:00
39f0a5ecc9 [c10d] simplify the dump timeout logic and unify the async call (#120331)
Summary:
The current dump timeout logic is a bit cumbersome as it needs 2 times: 1.
timeout, 2. wake up time. And in theory the caller just needs to wait
for a max of timeout value for the dump and declare the dump to be
either successful or not. Also we unify the async call using std::async
instead of a customized async lauch function for each operation.
Test Plan:
Unit tests
Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120331
Approved by: https://github.com/wconstab
2024-02-23 19:46:40 +00:00
8646872ff7 Make balance_gradient preserved in export (#120332)
Summary: We can only not-decompose CompositeImplicit functional custom ops. From the looks of the implementation, this op looks functional. So the fix is just fixing the schema.

Test Plan: CI

Differential Revision: D54019265

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120332
Approved by: https://github.com/zhxchen17
2024-02-23 19:14:08 +00:00
182ed1e32c Use a dtype property in torch inductor nodes (#119227)
I usually forget to do `x.get_dtype()` and I type `x.dtype`. Similarly for `layout, device, sizes`. What do you think about making them properties?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119227
Approved by: https://github.com/lezcano, https://github.com/jansel
2024-02-23 18:40:03 +00:00
d54121d13f Increase bazel CUDA tests timeout to 480s (#120443)
One of the bazel CUDA tests `//:modules_test` frequently timeout in trunk, so I try to increase the timeout value to 480s https://bazel.build/reference/test-encyclopedia to see if it helps fix the issue.  Bazel CPU tests already use this value.

Here is an example timeout https://github.com/pytorch/pytorch/actions/runs/8009308009/job/21877698886#step:13:3316
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120443
Approved by: https://github.com/clee2000
2024-02-23 18:32:35 +00:00
6b35415a54 Create a sentinel file for each dynamo test skips (Part 2) (#120501)
[no ci]

tested on https://github.com/pytorch/pytorch/pull/120451
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120501
Approved by: https://github.com/clee2000
ghstack dependencies: #120500
2024-02-23 18:25:30 +00:00
cffdd642a9 Create a sentinel file for each dynamo test skips (Part 1) (#120500)
[no ci]

tested on https://github.com/pytorch/pytorch/pull/120451
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120500
Approved by: https://github.com/clee2000
2024-02-23 18:25:30 +00:00
2120f65174 [AT-VK][EZ] Move ops to dedicated folder (#120364)
These ops are at the level of the OperatorRegistry from the previous change. All ExecuTorch ops will go here.
```
ATen/native/vulkan/graph/ops
```
They are not to be confused with the general ATen ops from `native_functions.yaml` that will continue to exist. All PyTorch ops are here.
```
ATen/native/vulkan/ops
```

To help think around this split, note that we can actually implement the latter ATen ops with the former OperatorRegistry ops, since it's currently a subset.

Differential Revision: [D54030933](https://our.internmc.facebook.com/intern/diff/D54030933/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120364
Approved by: https://github.com/SS-JIA
ghstack dependencies: #120362, #120363
2024-02-23 18:11:09 +00:00
6d920dd3c6 [ET-VK][Op Redesign][2/n] Introduce OperatorRegistry (#120363)
TSIA

Differential Revision: [D53982439](https://our.internmc.facebook.com/intern/diff/D53982439/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120363
Approved by: https://github.com/SS-JIA
ghstack dependencies: #120362
2024-02-23 18:07:59 +00:00
3e2ac1f094 [AT-VK][EZ] Define OpNode constructor (#120362)
Instead of using `emplace_back()`. This will be useful throughout the rest of the stack.

Differential Revision: [D53982443](https://our.internmc.facebook.com/intern/diff/D53982443/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120362
Approved by: https://github.com/SS-JIA
2024-02-23 18:05:17 +00:00
232f09e0ea Add copy of scripts for setting up s390x workers (#120417)
This PR contains scripts used to produce self-hosted s390x worker.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120417
Approved by: https://github.com/malfet
2024-02-23 17:01:44 +00:00
3b944113c8 Add torch.ops.aten.print (#120295)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120295
Approved by: https://github.com/zou3519
2024-02-23 17:01:22 +00:00
cyy
97918e8c37 [Clang-tidy header][18/N] Enable clang-tidy on headers in torch/csrc/cuda (#118504)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118504
Approved by: https://github.com/albanD
2024-02-23 16:47:33 +00:00
2892d2f31b Revert "[inductor] Optimize welford reduction (#120330)"
This reverts commit 4c6ba16f825ca7b99133efca95da0b7112add66b.

Reverted https://github.com/pytorch/pytorch/pull/120330 on behalf of https://github.com/jeffdaily due to broke ROCm CI while ROCm was in unstable status ([comment](https://github.com/pytorch/pytorch/pull/120330#issuecomment-1961623739))
2024-02-23 16:24:52 +00:00
2c85c9e77e [Memory Snapshot] Add Total memory used after allocation in Trace View (#120339)
Summary: Being able to see max allocated helps improve user experience with memory snapshots.

Test Plan:
Before:
![image](https://github.com/pytorch/pytorch/assets/17602366/534001fa-2fbe-4fc5-bd48-cd82f3277941)

After:
![image](https://github.com/pytorch/pytorch/assets/17602366/f8b9a7bc-3a34-4e38-82cb-f766e54b3fd2)

Reviewed By: zdevito

Differential Revision: D53953648

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120339
Approved by: https://github.com/zdevito
2024-02-23 16:17:14 +00:00
d9db9e62e3 Describe special case in avgpool (#120335)
Fixes #116420

AvgPool1d, AvgPool2d and AvgPool3d include now in their descriptions the special case when `ceil_mode` is True and the last window starts outside the tensor.
Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120335
Approved by: https://github.com/mikaylagawarecki
2024-02-23 15:29:54 +00:00
cef9f70f4b Move torchbench model configuration into a YAML file. (#120299)
This PR moves other aspects of torchbench's model configuration (e.g. batch size,
tolerance requirements, etc.) into a new YAML file: `torchbench.yaml`. It also merges the
recently added `torchbench_skip_models.yaml` file inside the `skip` key.

This is an effort so that external consumers are able to easily replicate the performance
results and coverage results from the PyTorch HUD.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120299
Approved by: https://github.com/jansel
2024-02-23 14:00:14 +00:00
54bac042e7 Fix error in examples of torch.linalg.lu_factor (#120484)
Found an error in the doc of `torch.linalg.lu_factor` related to `torch.linalg.lu_solve`. Also fix a sphinx issue by the way.
```Python traceback
TypeError: linalg_lu_solve(): argument 'LU' (position 1) must be Tensor, not torch.return_types.linalg_lu_factor
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120484
Approved by: https://github.com/lezcano
2024-02-23 13:19:04 +00:00
b96ea097ee [aotinductor] rename CppWrapperCodeGen and CudaWrapperCodeGen (#120391)
make WrapperCodeGen subclass names consistent with the
file names:

CppWrapperCodeGen -> CppWrapperCpu
CudaWrapperCodeGen -> CppWrapperCuda

Differential Revision: [D54074938](https://our.internmc.facebook.com/intern/diff/D54074938)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120391
Approved by: https://github.com/aakhundov
2024-02-23 10:41:50 +00:00
72fec96e59 fix no shard state dict loading (#120367)
Summary: fix no shard state dict loading

Test Plan: CI tests

Differential Revision: D51058607

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120367
Approved by: https://github.com/fegin
2024-02-23 07:25:43 +00:00
9e9eaf0032 [CUDA] Workaround register spilling issue in mem-efficient SDP kernels on sm60 (#120445)
We're seeing that a newer version of CUDA introduces register spilling behavior for a few kernels on Pascal---this PR works around them for this specific version.

CC @ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120445
Approved by: https://github.com/Skylion007, https://github.com/drisspg
2024-02-23 06:06:37 +00:00
edf1c4e552 [Dynamo] Handle guard_size_oblivious in user code (#120379)
Fixes https://github.com/pytorch/pytorch/issues/120083

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120379
Approved by: https://github.com/yanboliang
2024-02-23 05:38:57 +00:00
a5548c6886 Create a sentinel file for each dynamo test failure (#120355)
Created via
```
import os
current_dir = os.path.dirname(os.path.abspath(__file__))
directory = os.path.join(current_dir, 'dynamo_expected_failures')
for name in dynamo_expected_failures:
    path = os.path.join(directory, name)
    with open(path, 'w') as fp:
        pass
```

Differential Revision: [D54036062](https://our.internmc.facebook.com/intern/diff/D54036062)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120355
Approved by: https://github.com/aorenste, https://github.com/yanboliang
2024-02-23 05:22:11 +00:00
2240018c03 Construct c10::Half from float16_t on ARMv8 (#120425)
By hiding float32 constructors and exposing float16 ones. This allows compiler do implicit conversions as needed, and in safe cases optimize out unneeded upcasts to fp32, see example [below](https://godbolt.org/z/5TKnY4cos)
```cpp
#include <arm_neon.h>

#ifndef __ARM_FEATURE_FP16_SCALAR_ARITHMETIC
#error Ieeee
#endif

float16_t sum1(float16_t x, float16_t y) {
    return x + y;
}

float16_t sum2(float16_t x, float16_t y) {
    return static_cast<float>(x) + static_cast<float>(y);
}
```
both sum variants are  compiled to scalar fp16 add, if build for the platform that supports fp16 arithmetic
```
sum1(half, half):                            // @sum1(half, half)
        fadd    h0, h0, h1
        ret
sum2(half, half):                            // @sum2(half, half)
        fadd    h0, h0, h1
        ret
```

Fixes build error in some aarch64 configurations after #119483 which are defined as supporting FP16 but don't define _Float16.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120425
Approved by: https://github.com/mikekgfb, https://github.com/atalman, https://github.com/snadampal
2024-02-23 04:22:45 +00:00
eqy
3f6be7696b [cuDNN][cuDNN RNNv8 API] Fix math type behavior in cuDNN RNN (#120277)
Adds back `CUDNN_TENSOR_OP_MATH` which was erroneously dropped by #115719

CC @malfet @ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120277
Approved by: https://github.com/drisspg
2024-02-23 04:11:14 +00:00
36c1cc962a Update cutlass from 3.3.0 to 3.4.1 (#120434)
### COPY OF https://github.com/pytorch/pytorch/pull/120010

### Update
I have rolled the two blocking changes into this PR, I also imported this to fbcode to verify that nothing is breaking:
D53870253

This copy was generated by merging in all the internal only changes into one merged atomic commit and re-exporting to github

### Current Status
- [PR](https://github.com/pytorch/pytorch/pull/118935) aims to update the flash attention kernels to a more recent version.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120434
Approved by: https://github.com/Skylion007, https://github.com/cpuhrsch
2024-02-23 03:57:26 +00:00
cyy
f609f2050f [structural binding][6/N] Replace std::tie with structural binding (#120353)
This PR follows https://github.com/pytorch/pytorch/pull/119774, it is a continued work to clean up std::tie.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120353
Approved by: https://github.com/albanD
2024-02-23 03:38:40 +00:00
3426c6f559 update the tensor.scatter_ doc (#120169)
Fixes #119543

- doc fixed with the `reduce` being a kwarg (see below for details)
- doc added another interface `(int dim, Tensor index, Number value, *, str reduce)` where
the full signature in the pyi file after build is
```
def scatter_(self, dim: _int, index: Tensor, value: Union[Number, _complex], *, reduce: str) -> Tensor:
```
. This can be further verified in
02fb043522/aten/src/ATen/native/native_functions.yaml (L8014)

Therefore, the value can be int, bool, float, or complex type.

Besides the issue mentioned in 119543, the `reduce should be a kwarg` as shown below
```
 * (int dim, Tensor index, Tensor src)
 * (int dim, Tensor index, Tensor src, *, str reduce)
 * (int dim, Tensor index, Number value)
 * (int dim, Tensor index, Number value, *, str reduce)
 ```

The test case for scala value is already implemented in

70bc3b3be4/test/test_scatter_gather_ops.py (L86)

so no additional test case required.

@mikaylagawarecki  @janeyx99

Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120169
Approved by: https://github.com/mikaylagawarecki
2024-02-23 02:51:55 +00:00
bb6f50929b Fix lint after https://github.com/pytorch/pytorch/pull/105590 (#120461)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120461
Approved by: https://github.com/Skylion007
2024-02-23 02:45:23 +00:00
2b0168aeb0 [c10d] update the work progress of PG periodically (#120438)
Summary:
Previously, I added lastEnqueuedSeq_ and lastCompletedSeq_ to store the states of PG progress
but log only when there is timeout detected.

We found it is not enough since the 'straggler' itself might not detect
the timeout and hence there is no log from the 'straggler'.

In this PR, we can log these states periorically so that it would be
much easier for us to identify the straggler by checking which rank
has the smallest number of lastEnqueuedSeq_
Test Plan:
Log adding, build success

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120438
Approved by: https://github.com/wconstab, https://github.com/XilunWu, https://github.com/kwen2501
2024-02-23 01:40:43 +00:00
8f4ffd3d8a [HigherOrderOp] makes control flow operators respect global decomp table (#120412)
A follow up of @zou3519 's comment on https://github.com/pytorch/pytorch/pull/120366. We create a helper method for this purpose.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120412
Approved by: https://github.com/zou3519
2024-02-23 00:10:20 +00:00
156954d6a2 [Inductor] Add support for NEON ISA in the Inductor C++ backend (#105590)
Fixes #104729

As suggested in the [blog](https://dev-discuss.pytorch.org/t/torchinductor-update-5-cpu-backend-backend-performance-update-and-deep-dive-on-key-optimizations/1117#:~:text=It%20can%20be,sub%2Dclasses.), I subclassed the `VecISA` class and implemented a NEON version of the `vec_reduce_all()` function, to go along with the existing AVX2 and AVX512 versions. Any operation that calls `vec_reduce_all()` will also take the NEON path and benefit from its vectorization.

The `vec_reduce_all()` is invoked by Softmax and other operations like norms. Using the fast path results in 30% time savings for Softmax as compared to the previously taken slow path.

  | Slow path | Fast path (NEON intrinsics)
-- | -- | --
Softmax (100 passes, 1024 dimension) | 623.706ms | 452.011ms

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105590
Approved by: https://github.com/jgong5, https://github.com/malfet
2024-02-22 23:55:35 +00:00
4c6ba16f82 [inductor] Optimize welford reduction (#120330)
This does two things,
1) Short circuit `welford_reduce` on the first iteration to ignore the accumulator (big win for small `rnumel`)
2) Replace division with multiplication by reciprocal

Currently this is not enough to match two pass reduction with bfloat16 but it is still a significant improvement.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120330
Approved by: https://github.com/lezcano
2024-02-22 23:54:24 +00:00
722afe6171 Revert "[dynamo] Use EQUALS_MATCH guard for mod.training (#120147)"
This reverts commit b642a18e8056287b0e5768f631dd03e0326a8b11.

Reverted https://github.com/pytorch/pytorch/pull/120147 on behalf of https://github.com/williamwen42 due to memory leak, see https://github.com/pytorch/pytorch/issues/112090 ([comment](https://github.com/pytorch/pytorch/pull/120147#issuecomment-1960522018))
2024-02-22 23:46:55 +00:00
3588e7f265 Ignore .numpy() under FakeTensorMode() (#120261)
Fixes #120259

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120261
Approved by: https://github.com/jansel
2024-02-22 22:49:20 +00:00
f9eb66e16d [BE][EZ] Flatten preprocessor hierarchy (#120422)
Instead of
```cpp
#if defined(foo)
#else
#if defined(bar)
#else
#endif
#endif
```
use
```cpp
#if defined(foo)
#elif defined(bar)
#else
#endif
```

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120422
Approved by: https://github.com/seemethere, https://github.com/kit1980, https://github.com/Skylion007
2024-02-22 22:38:08 +00:00
1c7ba330b2 [BE][optim] Simplify _init_group. (#120055)
This version is more concise and avoids second lookup in case `momentum_buffer` is in the `state`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120055
Approved by: https://github.com/janeyx99
2024-02-22 22:15:01 +00:00
5603d95375 [DeviceMesh] Ensure mesh tensor is a cpu tensor (#120046)
More discussion in the last comment in https://github.com/pytorch/pytorch/pull/118614

In general, users won't pass a cuda tensor to DeviceMesh, as the mesh tensor is just a way to construct a mesh that doesn't require cuda compute. Taking suggestion from @awgu to enforce the tensor to be cpu tensor if it is not already so that we can prevent a device sync.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120046
Approved by: https://github.com/wanchaol, https://github.com/wconstab
2024-02-22 22:03:13 +00:00
c11bd724fe [ROCm] replace ROCmLoops.cuh with hipified CUDALoops.cuh (#120101)
The intent of this change was to minimize code differences between CUDA and ROCm while maintaining or improving performance.

Verified new performance using pytorch/benchmarks/operator_benchmark.

```
python -u -m pt.unary_test --tag-filter all --device cuda
python -u -m pt.binary_test --tag-filter all --device cuda
```

On MI200 this improved performance on average 3%.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120101
Approved by: https://github.com/albanD
2024-02-22 21:57:36 +00:00
77692736d1 Use privateuseone during external module register test (#120399)
Fixes #120397

Use privateuseone instead of xpu in test_external_module_register.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120399
Approved by: https://github.com/albanD, https://github.com/malfet
2024-02-22 21:32:59 +00:00
edd03f975f highlight readme code block (#120228)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120228
Approved by: https://github.com/mikaylagawarecki
2024-02-22 21:23:08 +00:00
1eae8950b9 [Dynamic] Fix dynamic shape size inspection bug (#120341)
Fixes #120198

Differential Revision: [D54035984](https://our.internmc.facebook.com/intern/diff/D54035984)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120341
Approved by: https://github.com/ezyang
2024-02-22 21:08:28 +00:00
11e4a9266d Temporarily support ranks + tag as pg identifier in native funcol (#120226)
As communicated in https://github.com/pytorch/pytorch/issues/93173#issuecomment-1907095208, although we are dropping `(ranks, tag)` as group identifier in funcols, there will be a grace period for migration. This PR adds temporary `(ranks, tag)` support in native funcols. It also helps us decouple the py funcol -> native funcol transition from the API change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120226
Approved by: https://github.com/wanchaol, https://github.com/wconstab
ghstack dependencies: #120042, #120043, #120070
2024-02-22 20:24:16 +00:00
5a3e19578f Make tests using CommDebugMode work for both legacy and native funcol (#120070)
We have many tests that use CommDebugMode to verify the occurrence of collectives. These tests do so by querying comm_counts with legacy funcol ops as key. For the purpose of native funcol migration, we need these tests to work for both legacy and native funcol. To avoid the need to modify all tests to accommodate the two implementations, we make CommDebugMode translate native funcol ops into legacy funcol ops until the migration finishes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120070
Approved by: https://github.com/wconstab, https://github.com/wanchaol
ghstack dependencies: #120042, #120043
2024-02-22 20:24:15 +00:00
a4c5f48b11 Prepare test_dtensor.py for native funcol migration (#120043)
This file contains representative tests that we would like to run with both funcol impls during the migration period. Marking them as `@run_with_both_funcol_impls`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120043
Approved by: https://github.com/wanchaol
ghstack dependencies: #120042
2024-02-22 20:24:15 +00:00
1c9fc720ae Change the .clone() in native funcol's all_reduce to use at::MemoryFormat::Contiguous (#120042)
Summary:
While I think it probably makes more sense to only require `all_reduce` input to be non-overlapping and dense, today `ProcessGroupNCCL` requires it to be contiguous. This is also what the `all_reduce` in non-native funcol does.

Also marking a test affected by this with `@run_with_both_funcol_impls`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120042
Approved by: https://github.com/wanchaol
2024-02-22 20:24:15 +00:00
7b8f6736d1 [cond] make sure subgraphs in cond are decomposed according to current decomp table (#120366)
Fixes https://github.com/pytorch/pytorch/issues/120160. The issue is because previously cond doesn't pass in the global decomposition table in ProxyMode. This PR adds the current_decomposition_table to the recursive make_fx call.

Test Plan:
see added tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120366
Approved by: https://github.com/aakhundov, https://github.com/jansel
2024-02-22 20:06:46 +00:00
680cfec295 Fix the default value of side in torch.searchsorted (#120066)
Fixes #119999, currently the [doc](https://pytorch.org/docs/stable/generated/torch.searchsorted.html#torch.searchsorted) shows the default value of `side = "left"`
<img width="600" alt="Screenshot 2024-02-16 at 10 36 08 AM" src="https://github.com/pytorch/pytorch/assets/7495155/e7d159aa-4985-4f50-9d81-6e71c3116c0d">
while the [implementation ](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/native_functions.yaml#L11247) gives the default value of `side = c10::nullopt`.

- fix the [torch doc](https://github.com/pytorch/pytorch/blob/main/torch/_torch_docs.py#L13782) such that the default value of side is None.

- fix the [comment in cpp](4dc75f9084/aten/src/ATen/native/Bucketization.cpp (L19)) such that the default value of side is None.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120066
Approved by: https://github.com/malfet
2024-02-22 19:35:17 +00:00
c37d07a1bc [FSDP2] Removed super().__setattr__ call (#120340)
`nn.Module.__setattr__` does not actually call `super().__setattr__()`. If we make this call in our fast path, then we will inadvertently set the parameter as an actual attribute on the module, not just as an entry in the `_parameters` dict. This can lead to a bug where after replacing the parameters on the module (e.g. via `to_empty()` from meta device), we now have both an actual attribute (old) and a new entry in `_parameters` (new). Trying to access the parameter would give the old one since Python only resolves `__getattr__` if normal attribute lookup fails.

The bug was exercised in the following PR. I wanted to land this bug fix separately.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120340
Approved by: https://github.com/yifuwang
ghstack dependencies: #120231
2024-02-22 19:33:57 +00:00
2ba798df60 [inductor] decompose memory bound mm (#120047)
Summary:
Decompose memory bound mm/bmm.
Linear decomposition result:  D53502768
BMM decomposition result: D53148650
 We should only decompose when
1)bmm, b is large, m,n,k is relative small
2)mm/addmm. m is large, n and K is relative small. e.g. mm of input gradient in linear backward should not be decomposed since m is small and n is large.
Need to conduct more experiments to see if we can find a better strategy for decomposition. I have tried to use a linear regression model (see the bento results) which does not fit well. For short term, we use heuristics to determine when to decompose.

Test Plan:
```
buck2 test mode/dev-nosan //caffe2/test/inductor:decompose_mem_bound_mm
```

COFFEE APS mc0:
baseline: aps-lsf-0124-bf16-267ccb7a0d
decompose: aps-lsf-0124-bf16-4e3824db40

FIRST AFOC pyper mc1

Differential Revision: D53602514

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120047
Approved by: https://github.com/mengluy0125
2024-02-22 19:29:51 +00:00
ce807c17c0 modify comment of SparseTensor coalesce (#120221)
Fixes #ISSUE_NUMBER
Found the comment of coalesce is incorrect, modify it
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120221
Approved by: https://github.com/mikaylagawarecki
2024-02-22 19:24:53 +00:00
bb72bfe2ac Add code example for torch.stack() (#120304)
Fixes #120303

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120304
Approved by: https://github.com/albanD
2024-02-22 18:30:30 +00:00
ca64f7cbb8 Fix rendering in the doc of PackedSequence (#120385)
Correct a typo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120385
Approved by: https://github.com/albanD
2024-02-22 18:29:12 +00:00
a77226aa49 [inductor] improve kernel metadata logging (#120274)
Log a few more fields
- num_atomic_add: perf of kernels using atomic_add are usually data dependent. Our benchmarking code generate all indices to be 0 which will result in worse perf than reality.
- kernel_args_num_gb: estimate the amount of read/writes for kernel args. In-place args will be double counted. If we have a good estimation, this should be the lower bound of memory access that the GPU performs. Sometimes GPU will do more memory access since a single buffer may be access multiple times (e.g. for softmax when input tensor is quite large. cache only help a bit here). With this logged, and if we augment the metadata with amount of memory the GPU actually accessed, then it would be nice to dig into kernels that GPU access more memory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120274
Approved by: https://github.com/jansel
ghstack dependencies: #120266
2024-02-22 18:28:05 +00:00
b88621040a [profiler] Add kineto init delay when used in daemon mode (#120276)
Fixes #112389

## About

PyTorch (Kineto) profiler registers with the profiling daemon Dynolog to enable on-demand profiling. The user should only need to set the env variable `KINETO_USE_DAEMON`. To enable this we need to initialize kineto library early rather than lazily on a PyTorch profiler call. This initialization happens in a static initializer.
- Kineto init function basically registers a callback using the CUDA CUPTI library https://github.com/pytorch/kineto/blob/main/libkineto/src/init.cpp#L130-L148
- However, the above needs the dynamic linking to libcupti.so to have taken place.
- I understand now that static initializations of compilation units will be called before the dynamic linking leading to a segfault in #112389

![image](https://github.com/pytorch/pytorch/assets/6922212/29c9e79b-8080-4198-aaae-8a5696dccaec)

## Workaround
We add a delay in the initialization that can be configured using the env variable 'KINETO_DAEMON_INIT_DELAY_S'. May not be the best but it could help resolve the issue.

## Testing
Tested this out with [linear_model_example.py](https://github.com/facebookincubator/dynolog/blob/main/scripts/pytorch/linear_model_example.py)
First export the daemon env variable

### Without any delay
```
>$ python3 linear_model_example.py

INFO:2024-02-21 19:34:50 2366287:2366287 init.cpp:131] Registering daemon config loader, cpuOnly =  1
INFO:2024-02-21 19:34:50 2366287:2366287 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
INFO:2024-02-21 19:34:50 2366287:2366287 IpcFabricConfigClient.cpp:93] Setting up IPC Fabric at endpoint: dynoconfigclientb8f91363-d8d6-47a7-9103-197661e28397 status = initialized
INFO:2024-02-21 19:34:50 2366287:2366287 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
INFO:2024-02-21 19:34:50 2366287:2366287 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
cpu
99 1385.468505859375
```

### With 5 seconds delay
```
>$ KINETO_DAEMON_INIT_DELAY_S=5 python3 linear_model_example.py

cpu
99 284.82305908203125
10099 8.817167282104492
INFO:2024-02-21 19:34:26 2359155:2359214 init.cpp:131] Registering daemon config loader, cpuOnly =  1
ERROR: External init callback must run in same thread as registerClient (1782580992 != -1922169024)
INFO:2024-02-21 19:34:26 2359155:2359214 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
INFO:2024-02-21 19:34:26 2359155:2359214 IpcFabricConfigClient.cpp:93] Setting up IPC Fabric at endpoint: dynoconfigclient49270a3f-e913-4ea6-b9e0-cc90a853a869 status = initialized
INFO:2024-02-21 19:34:26 2359155:2359214 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
INFO:2024-02-21 19:34:26 2359155:2359214 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
20099 8.817167282104492
```

### With an invalid delay
```
>$ KINETO_DAEMON_INIT_DELAY_S=abc python3 linear_model_example.py

INFO:2024-02-21 19:35:02 2369647:2369647 init.cpp:131] Registering daemon config loader, cpuOnly =  1
INFO:2024-02-21 19:35:02 2369647:2369647 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
INFO:2024-02-21 19:35:02 2369647:2369647 IpcFabricConfigClient.cpp:93] Setting up IPC Fabric at endpoint: dynoconfigclient0e12a349-af7b-4322-901d-1ff22f91fd4c status = initialized
INFO:2024-02-21 19:35:02 2369647:2369647 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
INFO:2024-02-21 19:35:02 2369647:2369647 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
cpu
```

### Unit test updated as well.

## Impact
This should not impact any general user. The initialization only occurs if `KINETO_USE_DAEMON` is set in the environment.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120276
Approved by: https://github.com/anupambhatnagar, https://github.com/aaronenyeshi
2024-02-22 18:17:33 +00:00
be0ee93467 [pytree] support X | Y union type in tree_map_only (#120389)
Follow-up PR for #119974 with some small tweaks.

1. Support `X | Y` union type for Python 3.10+
2. Enable predicate function in `tree_map_only` in CXX pytree.
3. Remove unnecessary function definition.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120389
Approved by: https://github.com/zou3519
2024-02-22 18:17:13 +00:00
65627cfd6a [dtensor] implement scaled dot product attention (flash-attention) (#120298)
as titled, this PR implements the sdpa flash attention op in DTensor

Adding flash attention first but efficient attention and other attention
ops should be similar

fixes https://github.com/pytorch/pytorch/issues/120333

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120298
Approved by: https://github.com/XilunWu
ghstack dependencies: #120297
2024-02-22 17:53:47 +00:00
f2452e98a6 Revert "Native Half on ARM (#119483)"
This reverts commit 8f3fd79b23d483e846537b62f49111696d117870.

Reverted https://github.com/pytorch/pytorch/pull/119483 on behalf of https://github.com/malfet due to Broke nightly arm builds (and will be breaking runtime), as F16 arithmetic is ARMv8.2 only, see https://github.com/pytorch/pytorch/actions/runs/8000968963/job/21851281141 ([comment](https://github.com/pytorch/pytorch/pull/119483#issuecomment-1959944948))
2024-02-22 17:41:55 +00:00
c7328602ed [ROCm] enable tests test_sampled_addmm_autograd_cuda_*, test_sample… (#117501)
These tests PASS on ROCM 5.6+ now:

- test_sampled_addmm_autograd_cuda_complex128
- test_sampled_addmm_autograd_cuda_complex64
- test_sampled_addmm_autograd_cuda_float32
- test_sampled_addmm_autograd_cuda_float64
- test_sampled_addmm_cuda_complex128
- test_sampled_addmm_cuda_complex64
- test_sampled_addmm_cuda_float32
- test_sampled_addmm_cuda_float64
- test_autograd_dense_output_addmm_cuda_float64
- test_autograd_dense_output_addmv_cuda_float64
- test_autograd_dense_output_mv_cuda_float64

@pruthvistony @jithunnair-amd

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117501
Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily, https://github.com/malfet
2024-02-22 17:24:25 +00:00
1c1028ac49 [DCP] Adds utility for converting torch save to dcp (#119815)
as title

Differential Revision: [D53718040](https://our.internmc.facebook.com/intern/diff/D53718040/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119815
Approved by: https://github.com/fegin
ghstack dependencies: #119813, #119814
2024-02-22 17:22:11 +00:00
aae7ccd2d5 [FSDP2] disable compile in broken unit tests (#120358)
following unit tests are broken in original commit, revert to keep trunk healthy. will add them back when figuring out the root cuase
```
python test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_param_registration
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120358
Approved by: https://github.com/awgu, https://github.com/Skylion007
2024-02-22 17:17:23 +00:00
1ab441a7dd [DCP] Adds utility for converting dcp to torch save format (#119814)
as title

Differential Revision: [D53718042](https://our.internmc.facebook.com/intern/diff/D53718042/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119814
Approved by: https://github.com/fegin
ghstack dependencies: #119813
2024-02-22 16:55:58 +00:00
e0a7b024b0 [ROCm] Skip test_parity* unit tests in test_foreach only if ROCm version < 6.0 (#117301)
Skip test_parity* unit tests in test_foreach.py on ROCm only if ROCm version < 6.0

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117301
Approved by: https://github.com/jithunnair-amd, https://github.com/ezyang
2024-02-22 16:21:09 +00:00
de60050801 [inductor] Colorization improvements for bandwidth profiler (#120343)
A couple things:
* Don't colorize output to the log file
* Don't repeatedly warn if colorama isn't installed

Differential Revision: [D54027075](https://our.internmc.facebook.com/intern/diff/D54027075/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120343
Approved by: https://github.com/Chillee, https://github.com/shunting314
2024-02-22 15:25:46 +00:00
03f7235caa [Dynamo] Fix dynamo trace rules (#120371)
```test_trace_rules.py``` is still failing due to this.

Fixes https://github.com/pytorch/pytorch/issues/114831
(Having this here will run the disabled test on the PR)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120371
Approved by: https://github.com/drisspg, https://github.com/huydhn
2024-02-22 14:32:00 +00:00
0e4bd25a33 [inductor] When generating debug logs don't fail if nvcc not found (#120346)
If nvcc isn't found subprocess throws a CalledProcessError

Differential Revision: [D54028438](https://our.internmc.facebook.com/intern/diff/D54028438/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120346
Approved by: https://github.com/Skylion007, https://github.com/shunting314
2024-02-22 14:25:34 +00:00
c2b2e57032 Intel GPU Runtime Upstreaming for Guard (#118523)
# Motivation
According to [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), the 5th runtime component we would like to upstream is `Guard`. We will cover device guard and stream guard in this PR.

# Design
Device guard is used mainly for op dispatcher in PyTorch. Currently, PyTorch already has a device guard abstraction `c10::impl::DeviceGuardImplInterface`. In our design, we will introduce an `XPUGuardImpl` class inherits from `c10::impl::DeviceGuardImplInterface`. Register `XPUGuardImpl` to PyTorch after we implement the device switch management mechanism in `XPUGuardImpl`. Besides, we will introduce `XPUGuard`, `OptionalXPUGuard`, `XPUStreamGuard`, and `OptionalXPUStreamGuard`. They are all following the design of CUDA's counterpart. The corresponding C++ file should be placed in c10/xpu/ folder.

# Additional Context
It is unnecessary to add `Guard` code to PyTorch frontend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118523
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/jgong5, https://github.com/malfet
ghstack dependencies: #120315
2024-02-22 14:07:21 +00:00
dcfe463600 fix xpu build failure (#120315)
# Motivation
fix build failure introduced by [[DeviceIndex][6/N] Use DeviceIndex in more places](https://github.com/pytorch/pytorch/pull/120133), parameter `total` is undefined in line 100. see https://github.com/pytorch/pytorch/pull/120133/files#diff-00eb8a6f5dfbc341ee9ab9aff0e3dbece8ad73483d4f41a005b1f453cb78221cR91-L102
[PR120133](https://github.com/pytorch/pytorch/pull/120133) forgot to add the label `ciflow/xpu`, so the XPU CI flow was not triggered.

# Solution
refer to [Why is std::cout not printing the correct value for my int8_t number?](https://stackoverflow.com/questions/7587782) , static cast int8_t to int16_t and the condition `device >= 0 && device < total` is enough.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120315
Approved by: https://github.com/Skylion007, https://github.com/cyyever, https://github.com/malfet, https://github.com/EikanWang, https://github.com/gujinghui
2024-02-22 13:43:56 +00:00
faad8ecb26 Use opmath for sinc on CPU (#120311)
This aligns the implementation with CUDA and `torch.compile`

Fixes https://github.com/pytorch/pytorch/issues/118176 https://github.com/pytorch/pytorch/issues/49133

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120311
Approved by: https://github.com/jgong5, https://github.com/Chillee
2024-02-22 12:37:50 +00:00
5c5b71b6ee Capture non tensor arguments in record_function (#120017)
Summary: RECORD_FUNCTION only capture the argument when it is a Tensor. However, it is very common for user to use the argument with primitive data type (int, float, index, bool). This DIFF is to support non tensor arguments in RECORD_FUNCTION.

Test Plan:
unit test
    buck test  mode/dev-nosan caffe2/test:profiler -- test_execution_trace_with_pt2 test_execution_trace_alone test_execution_trace_with_kineto test_execution_trace_start_stop test_execution_trace_repeat_in_loop test_execution_trace_no_capture

Differential Revision: D53674768

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120017
Approved by: https://github.com/soulitzer
2024-02-22 09:40:08 +00:00
7e6bce9684 [amd] fix unused variable device_flags (#120369)
Summary:
get build error due to D53986297 (https://github.com/pytorch/pytorch/pull/119996)

```
caffe2/c10/cuda/__fb_c10_hipify_gen__/out/c10/hip/HIPStream.cpp:40:23: error: unused variable 'device_flags' [-Werror,-Wunused-variable]
static c10::once_flag device_flags[C10_COMPILE_TIME_MAX_GPUS];
```

Reviewed By: jianyuh, xw285cornell

Differential Revision: D54027737

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120369
Approved by: https://github.com/xw285cornell, https://github.com/jianyuh
2024-02-22 09:36:59 +00:00
5210a22b39 Add basic shampoo test (#120293)
Fixes [T175418669](https://www.internalfb.com/intern/tasks/?t=175418669)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120293
Approved by: https://github.com/bdhirsh
2024-02-22 08:39:55 +00:00
354a436d96 Remove device assert in Gradscaler (#119362)
Fixes #119358

Co-authored-by: Edward Z. Yang <ezyang@mit.edu>
Co-authored-by: ydwu4 <ydwu2014@gmail.com>
Co-authored-by: PyTorch UpdateBot <pytorchupdatebot@users.noreply.github.com>
Co-authored-by: Bin Bao <binbao@meta.com>
Co-authored-by: Shuqiang Zhang <sqzhang@meta.com>
Co-authored-by: Adnan Akhundov <aakhundov@meta.com>
Co-authored-by: Ting Lu <tingl@nvidia.com>
Co-authored-by: Yang Chen <yangche@fb.com>
Co-authored-by: cyy <cyyever@outlook.com>
Co-authored-by: Animesh Jain <anijain@umich.edu>
Co-authored-by: Jason Ansel <jansel@meta.com>
Co-authored-by: Eddie Yan <eddiey@nvidia.com>
Co-authored-by: wz337 <wz337@cornell.edu>
Co-authored-by: Xuehai Pan <XuehaiPan@pku.edu.cn>
Co-authored-by: Anthony Alayo <anthony.alayo@applovin.com>
Co-authored-by: leslie-fang-intel <leslie.fang@intel.com>
Co-authored-by: Yifu Wang <yifu@fb.com>
Co-authored-by: Yukio Siraichi <yukio.siraichi@gmail.com>
Co-authored-by: atalman <atalman@fb.com>
Co-authored-by: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Co-authored-by: haozhe.zhu <haozhe.zhu@intel.com>
Co-authored-by: lezcano <lezcano-93@hotmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119362
Approved by: https://github.com/ezyang
2024-02-22 08:02:18 +00:00
fff9d98e58 Revert "Increased compile time max GPUs to 512. Switched to int16_t DeviceIndex. (#119639)"
This reverts commit e0268821dd2ea0e8a51b81c0ef3b18e77f68a33d.

Reverted https://github.com/pytorch/pytorch/pull/119639 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think the Window failures are legit as they are failing now in trunk, i.e. 450339ab2d ([comment](https://github.com/pytorch/pytorch/pull/119639#issuecomment-1958428416))
2024-02-22 00:12:54 +00:00
8fa6340701 Revert "Ignore .numpy() under FakeTensorMode() (#120261)"
This reverts commit 952b37145b7bb526ea5907ac574e324d274b02ee.

Reverted https://github.com/pytorch/pytorch/pull/120261 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems breaking trunk on Python 3.12 952b37145b ([comment](https://github.com/pytorch/pytorch/pull/120261#issuecomment-1958267417))
2024-02-21 23:09:27 +00:00
cyy
1aad5c98b4 [structural binding][5/N] Replace std::tie with structural binding (#120142)
This PR follows https://github.com/pytorch/pytorch/pull/119774, it is a continued work to clean up std::tie.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120142
Approved by: https://github.com/albanD
2024-02-21 22:32:55 +00:00
d514df63ea Reenable triton tests and clean extra clones after the pin update (#120324)
Test Plan: just tests

Differential Revision: D54008642

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120324
Approved by: https://github.com/aakhundov, https://github.com/sijiac
2024-02-21 22:25:33 +00:00
952b37145b Ignore .numpy() under FakeTensorMode() (#120261)
Fixes #120259

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120261
Approved by: https://github.com/jansel
2024-02-21 22:06:29 +00:00
450339ab2d Test for fatal signal in test_pynode_destruction_deadlock (#120279)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120279
Approved by: https://github.com/albanD
2024-02-21 21:53:51 +00:00
306642b66d [export] fix test_passes on ci (#120322)
We put the test cases generation in unitest.setUp to avoid running export on machines that runs with Python 3.12, where dynamo is not supported.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120322
Approved by: https://github.com/angelayi, https://github.com/huydhn, https://github.com/malfet
2024-02-21 21:23:40 +00:00
e0268821dd Increased compile time max GPUs to 512. Switched to int16_t DeviceIndex. (#119639)
Fixes #115331.

This PR increases the number of valid GPU devices to 512 (from 64) in order to future-proof PyTorch for providers that offer [single nodes with a large device count](https://www.tensorwave.com/). Until now, `DeviceIndex` was an `int8_t`, thus multiple changes were necessary:

- `DeviceIndex` changed to `int16_t`. Updated consumers that assume it to be an `int8_t`.
- Updated bounds checking for `torch.device()` in the Python frontend. Right now, we allow funny things like `torch.device('cpu', 200).index == -56`, which is undefined behavior. I inserted some checks to only allow values between 0 and `c10::Device::MAX_NUM_DEVICES - 1`.
- Updated the `ArgumentInfo` struct as it hardcodes the device index as 8 bit field [^1]. Might be a breaking change, not sure if users rely on this.
- Introduced `c10::Device::MAX_NUM_DEVICES` as a replacement for the old `C10_COMPILE_TIME_MAX_GPUS`

[^1]: This field was unsigned, so I guess this has also been undef behavior the whole time? Our default device index is -1, so this always wrapped around to 255 when written to the `ArgumentInfo` struct. When I switched the `DeviceIndex` to `int16_t`, it actually stayed 255 after unpacking from `ArgumentInfo` again, as the `DeviceIndex` was now wide enough that it didn't wrap back to -1.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119639
Approved by: https://github.com/cyyever, https://github.com/albanD
2024-02-21 21:10:49 +00:00
27c5bbe5cb Add is_nested_int() (#119975)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119975
Approved by: https://github.com/jbschlosser
ghstack dependencies: #119661, #119974
2024-02-21 21:10:02 +00:00
2e77629b9f [pytrees] Allow tree_map_only to support predicate function as filter (#119974)
In many places in the code we use `tree_map_only((SymInt, SymBool, SymFloat), foo)` but with nested ints, it is possible to have SymInts that are non-symbolic, so we may want to do something like `tree_map_only(is_symbolic, foo)` instead.

Alternative: wrap nested int SymNodes with something other than SymInt.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119974
Approved by: https://github.com/zou3519
ghstack dependencies: #119661
2024-02-21 21:10:02 +00:00
722e87899a [Memory Snapshot] Clean up elem text (#120245)
Summary:
These UI changes were added:
- Prefix address with Addr: and size with Size:
- Add comma between addr and size
- Remove duplicate (${elem.size} bytes) print out

Test Plan:
Before:
![image](https://github.com/pytorch/pytorch/assets/17602366/2d9867d6-9cdb-405b-aa92-f0daf44f2ba7)
After:
![image](https://github.com/pytorch/pytorch/assets/17602366/c7bd97d3-fdc6-4832-ae35-97a02ea73907)

Differential Revision: D53953187

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120245
Approved by: https://github.com/zdevito
2024-02-21 20:59:04 +00:00
a5893926f2 [dtensor] simplify outputs wrapping handling (#120297)
This PR simplifies the outputs wrapping handling in op dispatch, to make
it simpler and easier to understand.

It also enables a new case, where if the output DTensorSpec for the res is
None, and the res is a scalar tensor, we will just return the scalar
tensor instead of wrapping it with a DTensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120297
Approved by: https://github.com/wz337
2024-02-21 20:28:20 +00:00
e06978be4b [CI] Add initial inductor cpu smoketest for performance (#116456)
Co-authored-by: chuanqiw <chuanqi.wang@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116456
Approved by: https://github.com/jgong5, https://github.com/atalman
2024-02-21 20:04:50 +00:00
9630bcbd49 [execution trace/chakra] remove backend_id from pg_info (#120038)
Summary:
PR 104373(https://github.com/pytorch/pytorch/pull/104373) log backend which has an unsafe dict loop up that might fail.
We decide to deprecate backend_id and use pg id/name directly.

Differential Revision: D53676181

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120038
Approved by: https://github.com/aaronenyeshi
2024-02-21 19:37:18 +00:00
e7eab2f07e Fix to keep stride in return_and_correct_aliasing() (#117860)
Fixes #117794

Fix tripped the assert here: 86dedebeaf/torch/utils/_python_dispatch.py (L216)

From investigation: I found that functionalization of an in-place op (`mul_` in this test case) results in the strides of `TwoTensor`'s `a` / `b` components being mutated to be contiguous. This is not reflected in the outer tensor, causing the assert to be tripped.

After discussion with Brian, I address this in this PR by disallowing input mutations on non-contiguous tensor subclass inputs for now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117860
Approved by: https://github.com/bdhirsh
2024-02-21 19:15:27 +00:00
fa77829126 Remove bc linter label triggers after test-infra #4956 (#120148)
After https://github.com/pytorch/test-infra/pull/4956, mergebot will not block merge for a bc linter failure that has been suppressed.  The failure will be ignored instead.

This should help mitigate https://github.com/pytorch/test-infra/issues/4938 because the workflow will not be triggered multiple times when labels are attached.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120148
Approved by: https://github.com/clee2000
2024-02-21 18:36:38 +00:00
e87deb8004 fix: conversion of max memory allocated and reserved to GB (#120172)
Fixes #120171

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120172
Approved by: https://github.com/soulitzer, https://github.com/aaronenyeshi
2024-02-21 18:04:47 +00:00
d336be2942 Update torch.mean() description about dtype restriction. (#120208)
Fixes #120173

Co-authored-by: Jeffrey Wan <soulitzer@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120208
Approved by: https://github.com/soulitzer
2024-02-21 18:04:11 +00:00
9c64068ef8 [dynamo][guards-cpp-refactor] TypeGuardAccessor (#120089)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120089
Approved by: https://github.com/jansel
ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065, #120067, #120068
2024-02-21 17:56:48 +00:00
ec6783990a [dynamo][guards-cpp-refactor] GlobalsGuardAccessor (#120068)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120068
Approved by: https://github.com/jansel
ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065, #120067
2024-02-21 17:56:48 +00:00
66c52d678f [dynamo][guards-cpp-refactor] GetItemGuardAccessor (#120067)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120067
Approved by: https://github.com/jansel
ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065
2024-02-21 17:56:36 +00:00
7a0c2a9d0a [dynamo][guards-cpp-refactor] NO_TENSOR_ALIASING guard (#120065)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120065
Approved by: https://github.com/jansel
ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064
2024-02-21 17:56:18 +00:00
8d5ae8c0b3 [dynamo][guards-cpp-refactor] TENSOR_ALIASING guard (#120064)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120064
Approved by: https://github.com/jansel
ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062
2024-02-21 17:56:05 +00:00
034955b2fc [dynamo][guards-cpp-refactor] DATA_PTR_MATCH guard (#120062)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120062
Approved by: https://github.com/jansel
ghstack dependencies: #119822, #119827, #119833, #120060, #120061
2024-02-21 17:55:46 +00:00
cc6cf89c30 [dynamo][guards-cpp-refactor] GLOBAL_STATE guard (#120061)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120061
Approved by: https://github.com/jansel
ghstack dependencies: #119822, #119827, #119833, #120060
2024-02-21 17:55:32 +00:00
5066bec743 [dynamo][guards-cpp-refactor] DEFAULT_DEVICE guard (#120060)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120060
Approved by: https://github.com/jansel
ghstack dependencies: #119822, #119827, #119833
2024-02-21 17:55:17 +00:00
8f3fd79b23 Native Half on ARM (#119483)
Summary: Native Half on ARM

Test Plan: sandcastle

Differential Revision: D53585776

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119483
Approved by: https://github.com/ezyang, https://github.com/jgong5
2024-02-21 17:46:16 +00:00
29b2131c62 [Inductor] Fix bug around out of order constexprs in inductor (#120287)
Inductor signature/config generation code assumes that all constexprs come as last arguments of the function. This is not always true for user defined kernels.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120287
Approved by: https://github.com/jansel
2024-02-21 17:39:41 +00:00
cfddfce0d3 Alternate sharding (#119078)
Changes sharding to attempt to put all serial tests on as few shards as possible.  Parallel tests are then distributed across all shards, with most of which likely ending up on the non serial shards

Example: 8 minutes of serial tests, 20 minutes of parallel tests, 2 proc per machine, 6 machines
-> 8 + 20/2 = 18 total minutes of tests
-> 18 / 6 machines = 3 min per machine
-> all serial tests should fit on 3 machines (3min, 3 min, 2min)
-> majority of parallel tests should go on last 4 machines, one of which is shared with the serial tests

Move serial tests to run first

If I want to move to a purely numbers based sharding, this ensures that parallel tests are run with parallel tests as much as possible instead of interleaving serial + parallel tests, which decreases effectiveness of parallelization, while also ensuring that test reordering is still mostly effective.

See 73e816ee80 for example logs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119078
Approved by: https://github.com/huydhn
2024-02-21 16:40:27 +00:00
a24cba35b0 [c10d][flight recorder] dump additinal NCCL debug info (#120063)
Summary:
This PR is mainly about flight recorder side of changes that takes a
map of maps as input, and dump it as picklable. Also add functions that
should be compiled only when NCCL_COMM_DUMP is defined
Test Plan:
Integration tests with NCCL would be done later, here we only do the
c10d side of dump test, aka,NCCLTraceTest

Testing the dump function is a bit tricky as we don't have
existing C++ unit tests for them. So we still use the Python NCCLTraceTest with
the python binding of _dump_nccl_trace(), we manually fed the
dump_nccl_trace with a map of test info, and assert the pickle result and
print the converted python dict:
```
(sqzhang_1) [sqzhang@devgpu009.cln1 ~/pytorch (main)]$  python
test/distributed/test_c10d_nccl.py NCCLTraceTest
NCCL version 2.19.3+cuda12.0
[rank0]:[E ProcessGroupNCCL.cpp:1200] [PG 0 Rank 0] ProcessGroupNCCL
preparing to dump debug info.
.NCCL version 2.19.3+cuda12.0
.NCCL version 2.19.3+cuda12.0
{'ncclID2': {'Key2': 'Value2', 'Key1': 'Value1'}, 'ncclID1': {'Key2':
'Value2', 'Key1': 'Value1'}}
{'ncclID2': {'Key2': 'Value2', 'Key1': 'Value1'}, 'ncclID1': {'Key2':
'Value2', 'Key1': 'Value1'}}
.NCCL version 2.19.3+cuda12.0
{'ncclID2': {'Key2': 'Value2', 'Key1': 'Value1'}, 'ncclID1': {'Key2':
'Value2', 'Key1': 'Value1'}}
{'ncclID2': {'Key2': 'Value2', 'Key1': 'Value1'}, 'ncclID1': {'Key2':
'Value2', 'Key1': 'Value1'}}
.NCCL version 2.19.3+cuda12.0
.NCCL version 2.19.3+cuda12.0
.NCCL version 2.19.3+cuda12.0
.NCCL version 2.19.3+cuda12.0
.
----------------------------------------------------------------------
Ran 8 tests in 95.761s
OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120063
Approved by: https://github.com/wconstab
2024-02-21 16:35:23 +00:00
06bc203c7b Update dynamo_test_failures list (#120271)
This PR removes and adds some failures and successes that were hidden in the past week (ish).

https://github.com/pytorch/pytorch/pull/119408 (47182a8f4b5e36e280ca3595ba134f53499d2dc9) accidentally removed environment variables on rerun (see PR body of https://github.com/pytorch/pytorch/pull/120251 for slightly more details).

Enabling testing with dynamo is set using an env var, so if a test failed with dynamo, it would rerun without the dynamo env var set, making it pass on retry.  Normally, the flaky test bot would catch this and make an issue for the test, but the CI env var controls whether or not xml test reports get made, and that also got removed on rerun, so the xmls weren't made either.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120271
Approved by: https://github.com/DanilBaibak, https://github.com/zou3519
2024-02-21 16:34:34 +00:00
9199468401 Properly trace into mark_static (#120232)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120232
Approved by: https://github.com/yanboliang
2024-02-21 13:51:31 +00:00
d38a3627a5 Support privateUser1 key in RNN op. (#118182) (#118351)
Support privateUser1 key in RNN op。

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118351
Approved by: https://github.com/bdhirsh
2024-02-21 13:51:27 +00:00
eae025b1d7 Fix bug with block pointer multi dim args (#120263)
Summary:
Now we can parse statements like
```
%22 = tt.make_tensor_ptr %20, [%21, %c128_i64], [%c2048_i64, %c1_i64], [%1, %c0_i32]
```

Test Plan:
Added new test

```
buck2 test mode/opt //hammer/ops/tests/inductor:ragged_hstu_test
```
now passes again with optimizations

Differential Revision: D53975130

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120263
Approved by: https://github.com/aakhundov, https://github.com/sijiac
2024-02-21 09:06:20 +00:00
cyy
3cd6a21e8f [DeviceIndex][6/N] Use DeviceIndex in more places (#120133)
This PR follows the series of patches beginning with #119142 and fixes various XPU and python related methods to use DeviceIndex.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120133
Approved by: https://github.com/Skylion007
2024-02-21 06:24:23 +00:00
cyy
d5d13ab15e Remove C10_FALLTHROUGH (#120157)
Since [[fallthrough]] is supported in our C++17 compilers and no other repo is using it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120157
Approved by: https://github.com/Skylion007
2024-02-21 06:18:58 +00:00
d6801578c3 Update tracing rules for new cudnn functions (#120268)
# Summary
This updates the trace_rules with the new cudnn torch functions for sdpa

To repro:
`pytest test/dynamo/test_trace_rules.py -k test_torch_name_rule_map_updated`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120268
Approved by: https://github.com/shuqiangzhang, https://github.com/huydhn, https://github.com/yanboliang
2024-02-21 05:22:44 +00:00
65519d183b Remove old optimizer tests (#120257)
Removes old tests now that all configs are covered in test_compiled_optimizers.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120257
Approved by: https://github.com/eellison
2024-02-21 05:11:23 +00:00
b4cef25a1e add register_device_op_overrides (#119268)
Fixes #119267

Currently https://github.com/pytorch/pytorch/blob/main/torch/_inductor/codegen/common.py#L106 only supports built-in device function, I'm going to add a register function to get overrides class.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119268
Approved by: https://github.com/jansel
2024-02-21 04:53:07 +00:00
3993771617 Expose recordSize in ChunkRecordIterator (#120239)
Summary: Add a public method to read recordSize in ChunkRecordIterator

Test Plan: ci

Differential Revision: D53931944

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120239
Approved by: https://github.com/zoranzhao
2024-02-21 04:33:03 +00:00
26610175d2 pass device_str for async_compile.triton function (#120202)
Fixes #120203

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120202
Approved by: https://github.com/jansel
2024-02-21 03:48:57 +00:00
800e9acd43 [inductor] fix bandwidth extimation for StarDep (#120266)
A lot of HF models fail when inductor_config.bechmark_kernel is enabled. The reason is the bandwidth estimation code assumes every dependencies has an index but StarDep does not. An exception is raised when StarDep.index is being accessed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120266
Approved by: https://github.com/eellison, https://github.com/jansel
2024-02-21 03:33:45 +00:00
20f7e5a719 Remove dependency of triton during inductor codegen (#120193)
Fixes #120192

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120193
Approved by: https://github.com/jansel
2024-02-21 01:09:48 +00:00
dd6b5e236e Prepare test_inductor_collectives.py for native funcol migration (#120025)
There are some tests in this file that are impl specific, e.g. verifying generated code via `FileCheck`. These tests are covered for native funcol in test_c10d_functional_native.py, therefore marking them with `@run_with_legacy_funcol`.

Other tests are marked with `@run_with_both_funcol_impls`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120025
Approved by: https://github.com/wanchaol
ghstack dependencies: #119982
2024-02-21 00:46:25 +00:00
af765dbdfd [ez] Explicit env for run_test (#120251)
env=None (which is the default) inherits the env from the calling process.  Explicitly set the env to the calling process env so that things can be added to it later

Tested in: e7b4d8ec88
Checked that test-reports (which depend on the CI env var) get made.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120251
Approved by: https://github.com/huydhn
2024-02-21 00:40:19 +00:00
a1fc29cd78 Revert "[pytree] add function tree_iter (#120155)"
This reverts commit 372d078f361e726bb4ac0884ac334b04c58179ef.

Reverted https://github.com/pytorch/pytorch/pull/120155 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/120155#issuecomment-1955479765))
2024-02-21 00:21:28 +00:00
701f651f9c Change the parameter type from int to float in torch.nn.Softplus (#120183)
Fixes #120175

1 The c_api uses the double
f2cf0768d1/torch/csrc/api/include/torch/nn/options/activation.h (L501).

2 The type is also double in the test case
f2cf0768d1/test/cpp/api/functional.cpp (L1788)

3 With float parameter in python works perfectly fine
```
m = nn.Softplus(beta=0.1,threshold=1.2)
input = torch.randn(2)
output = m(input)

print(output)
tensor([7.3749, 7.6852])
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120183
Approved by: https://github.com/mikaylagawarecki
2024-02-21 00:14:38 +00:00
35891e5007 Explicitly set nn.Module.set_extra_state return type to None (#120161)
Implicitly, the return type of `set_extra_state` is `NoReturn` since it always raises an error, and pyright will complain about mismatched return types if you override it with an implementation that doesn't also always raise an error. If we explicitly hint the return type as `None` (how we expect people to override it), we can avoid this error message.

```
Method "set_extra_state" overrides class "Module" in an incompatible manner
    Return type mismatch: base method returns type "NoReturn", override returns type "None"
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120161
Approved by: https://github.com/mikaylagawarecki
2024-02-20 23:57:36 +00:00
e54c4e8659 [aot_autograd] handle subclass input mutations correctly in collect_metadata_analysis.py (#120136)
This PR fixes the issue in https://github.com/pytorch/pytorch/issues/120188.

In collect_metadata_analysis.py, handling of input/output mutations was different from handling in other locations. In other locations, MUTATED_OUT_GRAPH was used to indicate that mutation would require returning an output; in collect_metadata_analysis.py, any type of mutation was being handled as if it would require returning an output.

This PR changes collect_metadata_analysis to match other callsites and refactors computation of mutation types so that it is a property of the dataclass instead of something that needs to be computed manually when constructing an InputAliasInfo.

Differential Revision: [D53950998](https://our.internmc.facebook.com/intern/diff/D53950998)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120136
Approved by: https://github.com/bdhirsh
ghstack dependencies: #120141
2024-02-20 23:30:57 +00:00
b36404159d [aot_autograd] support inplace mutations for subclasses (#120141)
This PR removes the conditional logic depending on requires_subclass_dispatch for mutation handling.

Inputs are labeled with one of three labels: NOT_MUTATED, MUTATED_IN_GRAPH, or MUTATED_OUT_GRAPH. MUTATED_IN_GRAPH indicates mutation that is allowed in the aot autograd graph; MUTATED_OUT_GRAPH indicates mutation that is not allowed in the graph, so the result is computed, returned, and then assigned back to the input after the graph.

Previously, there was logic to handle subclasses differently, so that MUTATED_IN_GRAPH + subclasses would behave like MUTATED_OUT_GRAPH.

This PR simplifies aot_autograd's handling of mutations so that MUTATED_IN_GRAPH will always be handled in graph, even when subclasses are present. Note that there are still some cases where subclass support won't be handled correctly.

Differential Revision: [D53950999](https://our.internmc.facebook.com/intern/diff/D53950999)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120141
Approved by: https://github.com/bdhirsh
2024-02-20 23:30:57 +00:00
96092e1f55 Extend aot_graph_input_parser to sym shapes (#120246)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120246
Approved by: https://github.com/shunting314
2024-02-20 23:24:45 +00:00
7acdd08fcc [FSDP2] Used stream APIs for CUDA event handling (#120231)
If we already have Python `Stream` objects, then calling `stream1.wait_stream(stream2)` is syntactic sugar for creating an `event: Event`, recording it in `stream2`, and calling `stream1.wait_event(event)`.

~~Getting a Python `Stream` object incurs some CPU overhead, so we prefer to not change other callsites where we do not already have the `Stream` objects.~~
Update: Calling `event.record()` with no stream specified calls `torch.cuda.current_stream()`, so the overhead should be identical.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120231
Approved by: https://github.com/yifuwang
ghstack dependencies: #118298, #119985
2024-02-20 21:35:46 +00:00
dfb83df889 Revert "Add cpp stack traces to our own reruns (#119408)"
This reverts commit 47182a8f4b5e36e280ca3595ba134f53499d2dc9.

Reverted https://github.com/pytorch/pytorch/pull/119408 on behalf of https://github.com/clee2000 due to iirc the default setting of env to None causes it to inherit the env of the calling process, I'll make a PR that makes it so that the old env vars don't disappear, and then re merge this on top of it.  Reverting this because I think some important env vars are disappearing (specifically CI) ([comment](https://github.com/pytorch/pytorch/pull/119408#issuecomment-1955128676))
2024-02-20 21:28:13 +00:00
2d6c0cc81b Run test_functional_api.py with both legacy and native funcol impls (#119982)
Additional changes: tests in test_functional_api.py uses multi-threaded pg which is implemented in Python. For the native ops to call into the Python pg implementation, glue code in PyProcessGroup is required for each collective. This PR also adds a few pieces of previously missing glue code, which are necessary for running test_functional_api.py with native funcol.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119982
Approved by: https://github.com/wanchaol
2024-02-20 21:15:37 +00:00
d42ede8ae4 [torch.compile] Log compilation start time for timeline view (#120220)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120220
Approved by: https://github.com/angelayi
2024-02-20 21:07:40 +00:00
be8ba5ef2d Revert "use two pass reduction for deterministic reduction order (#11… (#120243)
This reverts commit cc7ef43423fe36cf1778a9c9643454d62050a5b5.

Manual revert because of the conflict in: test/inductor/test_cpu_repro.py , conflict with this PR: https://github.com/pytorch/pytorch/pull/118365

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120243
Approved by: https://github.com/malfet, https://github.com/huydhn
2024-02-20 20:50:29 +00:00
4f0f25b7ce [Inductor][bugFix] fix a bug in merge_splits (#119956)
Summary: RecGPT got a keyerror when running the split_cat, and it was caused by a corner case hit.

Test Plan: P1184947021

Differential Revision: D53791839

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119956
Approved by: https://github.com/jackiexu1992
2024-02-20 20:38:34 +00:00
957f37686a Refactor instance_descriptor for new triton version (#119636)
Check https://github.com/pytorch/pytorch/pull/119457#issuecomment-1936764161

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119636
Approved by: https://github.com/shunting314
2024-02-20 20:26:35 +00:00
8464654ae4 Add missing words to torch.utils.checkpoint doc (#120196)
This PR adds a couple of missing words in the Checkpointing documentation, it doesn't have a specific issue number related to it.

Changes are:
- "backward." -> "backward propagation."
- "to be advanced than" -> "to be more advanced than"

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120196
Approved by: https://github.com/soulitzer
2024-02-20 20:18:42 +00:00
b33e8d3f6b [Inductor][fx pass] Add split cat pattern to remove cat nodes (#115004)
Summary: Titled

Test Plan:
# unit test
```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:split_cat_fx_passes
```
Buck UI: https://www.internalfb.com/buck2/8e4179db-363a-41b5-8bd7-cc445a512f6f
Test UI: https://www.internalfb.com/intern/testinfra/testrun/15762598708548039
Network: Up: 91KiB  Down: 32KiB  (reSessionID-b0985d82-1919-49c5-b307-ee0ab49b4738)
Jobs completed: 28. Time elapsed: 1:27.1s.
Cache hits: 0%. Commands: 2 (cached: 0, remote: 0, local: 2)
Tests finished: Pass 11. Fail 0. Fatal 0. Skip 0. Build failure 0

# local reproduce (IG_CTR)
```
buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode split_batch
```
P895047189

Differential Revision: D51777617

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115004
Approved by: https://github.com/jackiexu1992
2024-02-20 19:35:20 +00:00
cccacf6c8e add a test that non_overlapping checks dont generate too many guards (#120106)
Pre-emptive test in OSS to ensure that models relying on the "non-overlapping guards" checks do not suffer drastically w.r.t. guard slowness. Current plan is to follow up on this with a "real" fix, to generate a linear number of these guards.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120106
Approved by: https://github.com/mlazos
2024-02-20 18:38:59 +00:00
6d82a7e9b0 Add pixel_shuffle to core aten decomps (#120092)
Summary:
https://github.com/pytorch/pytorch/pull/118239 added a decomposition
for pixel_shuffle, so pixel_shuffle no longer needs to be a Core ATen Op. We
have also fixed the internal use case so that it no longer special cases on
pixel_shuffle, allowing us to revert the changes in
https://github.com/pytorch/pytorch/pull/118921.

Test Plan: CI

Differential Revision: D53860966

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120092
Approved by: https://github.com/ydwu4
2024-02-20 18:37:32 +00:00
53bfae2c06 [MPS] Add torch.fft. support (#119670)
Increase tolerance for `ftt` ops, that warrants a further investigation as it grows larger with larger matrix dimensions (see https://github.com/pytorch/pytorch/issues/120237 )

When compiling on MacOS13, implement `+[FakeMPSGraphFFTDescriptor descriptor]` as a redispatch to a real thing.

Fixes https://github.com/pytorch/pytorch/issues/78044
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119670
Approved by: https://github.com/kulinseth, https://github.com/albanD
2024-02-20 18:23:06 +00:00
5f3f8fd3c7 [Inductor] Setting kernel launch and exit callbacks for inductor generated triton kernels (#119450)
`CompiledKernel.launch_enter_hook` and `CompiledKernel.launch_exit_hook` are hooks that allow external tools to monitor the execution of Triton kernels and read each kernel's metadata. Initially, these hooks have a value of `None`.

Triton's kernel launcher passes hooks and kernel metadata by default, while Inductor's launcher doesn't. This PR could unify the parameters passed to both launchers so that tools can get information from both handwritten Triton kernels and Inductor-generated Triton kernels.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119450
Approved by: https://github.com/soulitzer
2024-02-20 16:58:20 +00:00
d3839b624b [ROCm] HIP Lazy Streams (#119996)
For ROCm/HIP, each stream is lazily initialized rather than creating all streams when the first stream is requested. HIP streams are not as lightweight as CUDA streams; the pooling strategy can affect performance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119996
Approved by: https://github.com/ezyang
2024-02-20 16:24:04 +00:00
26fbbc3e84 DTensor + dynamo: fix is_shard/replicate always inlining to False (#118668)
Fixes an internal enablement bug. When dynamo traces `is_sharded`/`is_replicate`, it would unconditioanlly assume the result was False.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118668
Approved by: https://github.com/wconstab, https://github.com/wanchaol
ghstack dependencies: #117667, #117666, #118209, #118191, #118667
2024-02-20 15:23:48 +00:00
609cde94f9 DTensor: use memory_format in the hash for all aten ops that use that arg (e.g. aten.clone) (#118667)
This fixes an internal DTensor enablement bug (I don't have an OSS issue for it)

I finally root-caused this as follows:

(1) we were fakefying a DTensor graph input, that was an autograd non-leaf (it had a grad_fn)

(2) that caused it do go through this `clone()` call during fakeification: https://github.com/pytorch/pytorch/blob/main/torch/_subclasses/meta_utils.py#L549

(3) `clone(torch.preserve_format)` is supposed to return another DTensor with the same strides as the input, but I noticed we were returning a DTensor with contiguous strides incorrectly.

(4) It turns out that DTensor was hashing on the sharding strategy for `aten.clone`, regardless of the `memory_format` kwarg that was passed in.

I could have manually updated the `clone` sharding strategy registration to take `memory_format` into account. But instead, I figured that every aten op with a sharding strategy needs to handle the memory_format kwarg specially - so I tried to generically force DTensor to consider all ATen ops that take a `memory_format` kwarg during hashing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118667
Approved by: https://github.com/wanchaol
ghstack dependencies: #117667, #117666, #118209, #118191
2024-02-20 15:23:48 +00:00
6819452a08 fix multiple-fake-modes bug with compile + subclasses (#118191)
This should fix the "multiple fake modes" errors we've been seeing with both float8 tensor and DTensor.

Haven't added a test yet - will add one before landing.

I also have a separate PR that would have made the error significantly nicer (the bad error resulted from us returning a FakeTensor at runtime): https://github.com/pytorch/pytorch/pull/118644

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118191
Approved by: https://github.com/drisspg
ghstack dependencies: #117667, #117666, #118209
2024-02-20 15:23:41 +00:00
b4b1480b06 remove redundant to_dtype in Fused Schedular Nodes (#118365)
Fix https://github.com/pytorch/pytorch/issues/115260.
This issue is triggered by `FusedSchedularNodes` cases.
We always store `lowp buffer` to `store_cache` then load `lowp buffer` from `store_cache` and `convert it to float` before `compute ops`.
Now we will generate a `{key: to(float32)_expr, value: the float32 cse var before to_dtype and store}` in `cse.cache`.
Then the `to_dtype(float32)` after `load` will hit this cache and not generate a new var with cast codes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118365
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-02-20 13:35:03 +00:00
c28a43988e Fix typo under aten/src/ATen/native directory (#119686)
This PR fixes typo in comments and msgs under `aten/src/ATen/native` directory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119686
Approved by: https://github.com/lezcano, https://github.com/malfet
2024-02-20 06:31:10 +00:00
389b56b4c4 [dynamo][guards-cpp-refactor] GetAttrGuardAccessor (#119833)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119833
Approved by: https://github.com/jansel
ghstack dependencies: #119822, #119827
2024-02-20 05:33:08 +00:00
96f45d15d8 [dynamo][guards-c++-refactor] EQUALS_MATCH guard (#119827)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119827
Approved by: https://github.com/jansel
ghstack dependencies: #119822
2024-02-20 05:33:08 +00:00
0802951081 [dynamo][guards-c++-refactor] Introduce LeafGuard, GuardManager and GuardAccessor classes (#119822)
The full blown implementation is in this stack - https://github.com/pytorch/pytorch/pull/110590 which is passing all the test cases on CI. That stack is hard to review. So, breaking apart.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119822
Approved by: https://github.com/jansel
2024-02-20 05:33:08 +00:00
0512ba43ab [executorch hash update] update the pinned executorch hash (#120214)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120214
Approved by: https://github.com/pytorchbot
2024-02-20 04:13:02 +00:00
a7e2b609d3 Skip less replacements (#119570)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119570
Approved by: https://github.com/ezyang
2024-02-20 04:10:33 +00:00
cc7ef43423 use two pass reduction for deterministic reduction order (#115620)
## Motivation
Address the [non-deterministic reduction order](https://github.com/pytorch/pytorch/issues/93542#issuecomment-1411294181) issue for `omp parallel reduction`.

## Latest update on 1.15:
55d81901bc.
Do not reduce to arr in loops. Instead, reduce to a local scaler and write it to arr after local reduction is done. This will allow the compiler to optimize the reduction variable in register instead read/write from memory. If the `working set` of `loop body` is quite large, `read/write from register/memory` will have a large gap.
```
vaddss (%xmm0, %xmm11, %xmm11) -> accumulate in register %xmm0
vaddssl ((%rdx, %rdi, 4), %xmm0, %xmm0) -> accumulate in memory address (%rdx, %rdi, 4)
```
Examples code:
```
tmp0_acc_arr[64];
#pragma omp parallel num_threads(64)
{
    auto tid = omp_get_thread_num();
    #pragma omp for
    for(...){
        ....
        tmp0_acc_arr[tid] = tmp0_acc_arr[tid] + tmp_x;  // access array will always from memory
    }
}
```
will be changed to
```
tmp0_acc_arr[64];
#pragma omp parallel num_threads(64)
{
    auto tid = omp_get_thread_num();
    **auto tmp0_acc_local = 0;**
    #pragma omp for
    for(...){
        ....
        **tmp0_acc_local**  = tmp0_acc_local + tmp_x;
    }
    **tmp0_acc_arr[tid] = tmp0_acc_local;**
}
```

## Descriptions
Following aten to use `two pass reduction` with `omp parallel` for deterministic reduction order.
9c3ae37fc4/aten/src/ATen/Parallel-inl.h (L39)
9c3ae37fc4/aten/src/ATen/native/TensorIteratorReduce.cpp (L24)
```
            float tmp_acc0 = 0;
            at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(0);
            // init reduction buffer per thread
            float tmp_acc0_arr[64];
            at::vec::Vectorized<float> tmp_acc0_vec_arr[64];
            for (int tid = 0; tid < 64; tid++)
            {
                tmp_acc0_arr[tid] = 0;
                tmp_acc0_vec_arr[tid] = at::vec::Vectorized<float>(0);
            }
            #pragma omp parallel num_threads(64)
            {
                int tid = omp_get_thread_num();
                #pragma omp for
                for(long x0=static_cast<long>(0L); x0<static_cast<long>(3964928L); x0+=static_cast<long>(16L))
                {
                    auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x0));
                    auto tmp1 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x0));
                    auto tmp2 = tmp0 - tmp1;
                    auto tmp3 = tmp2 * tmp2;
                    // reduce to per thread buffers
                    tmp_acc0_vec_arr[tid] = tmp_acc0_vec_arr[tid] + tmp3;
                }
            }
            // second pass reduce
            for (int tid = 0; tid < 64; tid++)
            {
                tmp_acc0 = tmp_acc0 + tmp_acc0_arr[tid];
                tmp_acc0_vec = tmp_acc0_vec + tmp_acc0_vec_arr[tid];
            }
            tmp_acc0 = tmp_acc0 + at::vec::vec_reduce_all<float>([](at::vec::Vectorized<float>& x, at::vec::Vectorized<float>& y) { return x + y; }, tmp_acc0_vec);
            out_ptr0[static_cast<long>(0L)] = static_cast<float>(tmp_acc0);
```

## Test results
I test this PR with dynamo benchmark on 32-core ICX system,
Result (avg speed up):
| |  before this PR   | after this PR  |
| ---- |  ----  | ----  |
| torchbench | 1.303  | 1.301 |
| hugginface | 1.346  | 1.343 |
| timms | 1.971 | 1.970 |

```
export LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so:${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libjemalloc.so
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1"
export KMP_AFFINITY=granularity=fine,compact,1,0
export KMP_BLOCKTIME=1

multi_threads_test() {
    CORES=$(lscpu | grep Core | awk '{print $4}')
    export OMP_NUM_THREADS=$CORES
    end_core=$(expr $CORES - 1)
    numactl -C 0-${end_core} --membind=0 python benchmarks/dynamo/${SUITE}.py --${SCENARIO} --${DT} -dcpu -n50 --no-skip --dashboard --only "${MODEL}" ${Channels_extra} ${BS_extra} ${Shape_extra} ${Mode_extra} ${Wrapper_extra} ${Flag_extra} --timeout 9000 --backend=inductor --output=${LOG_BASE}/${SUITE}.csv
}

SCENARIO=performance
DT=float32
export TORCHINDUCTOR_FREEZING=1
Flag_extra="--freezing"
Mode_extra="--inference"

for suite in timm_models huggingface torchbench
do
  export SUITE=$suite
  echo $SUITE
  export LOG_BASE=`date +%m%d%H%M%S`
  mkdir $LOG_BASE
  multi_threads_test
done
```
System info
```
ubuntu@ip-172-31-18-205:~/hz/pytorch$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  64
  On-line CPU(s) list:   0-63
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
    CPU family:          6
    Model:               106
    Thread(s) per core:  2
    Core(s) per socket:  32
    Socket(s):           1
    Stepping:            6
    BogoMIPS:            5800.00
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic mo
                         vbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xs
                         aveopt xsavec xgetbv1 xsaves wbnoinvd ida arat avx512vbmi pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq rdpid md_clear flush_l1d arch_capabilities
Virtualization features:
  Hypervisor vendor:     KVM
  Virtualization type:   full
Caches (sum of all):
  L1d:                   1.5 MiB (32 instances)
  L1i:                   1 MiB (32 instances)
  L2:                    40 MiB (32 instances)
  L3:                    54 MiB (1 instance)
NUMA:
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-63
Vulnerabilities:
  Gather data sampling:  Unknown: Dependent on hypervisor status
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Mitigation; Clear CPU buffers; SMT Host state unknown
  Retbleed:              Not affected
  Spec rstack overflow:  Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
  Srbds:                 Not affected
  Tsx async abort:       Not affected
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115620
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-02-20 00:46:59 +00:00
ae7830051d [BE] Delete GCC-7 ICE workarounds (#120122)
As one needs gcc-9 to compile PyTorch, so those workarounds are no longer relevant

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120122
Approved by: https://github.com/huydhn, https://github.com/kit1980, https://github.com/Skylion007
2024-02-20 00:31:20 +00:00
0bdeaad936 Revert "add register_device_op_overrides (#119268)"
This reverts commit 2864a7e161cc107f7e4c00cccdf860a6089c73c3.

Reverted https://github.com/pytorch/pytorch/pull/119268 on behalf of https://github.com/malfet due to Broke lint ([comment](https://github.com/pytorch/pytorch/pull/119268#issuecomment-1953231324))
2024-02-19 22:31:32 +00:00
3ad067fe2b [CPP] Update GCC minversion check to 9 or newer (#120126)
It's already a requirement for building PyTorch, but should be a
requirement for linking extensions with it, as that can lead to runtime
crashes, as `std::optional` template layout is incompatible between
gcc-9 and older compilers.

Also, update minimum supported clang version to 9.x(used to build Android), as clang-5 is clearly not C++17 compliant.

Fixes https://github.com/pytorch/pytorch/issues/120020

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120126
Approved by: https://github.com/Skylion007
2024-02-19 22:05:00 +00:00
48bdd0fb47 [ROCm] TunableOp bugfix filename handling (#120144)
Fixes nightly wheel seg fault during pytorch shutdown.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120144
Approved by: https://github.com/xw285cornell
2024-02-19 21:31:29 +00:00
f1fbba8f35 Revert "Fix lint after #119268 (#120207)"
This reverts commit d9d0f1dccc59ce6f0cb150ac236654c24a0d1118.

Reverted https://github.com/pytorch/pytorch/pull/120207 on behalf of https://github.com/atalman due to Broke inductor tests ([comment](https://github.com/pytorch/pytorch/pull/120207#issuecomment-1953170249))
2024-02-19 21:21:12 +00:00
a73a98c9ae Revert "Updating sleef submodule to eb3d97785 to fix export errors (#119953)"
This reverts commit fa9cbdce993601276765ad7701871f7e04a400c6.

Reverted https://github.com/pytorch/pytorch/pull/119953 on behalf of https://github.com/atalman due to Broke trunk linux-focal-cpu-py3.10-gcc9-bazel-test and linux-focal-cuda12.1-py3.10-gcc9-bazel-test. These are not flaky failures. ([comment](https://github.com/pytorch/pytorch/pull/119953#issuecomment-1953118780))
2024-02-19 20:26:33 +00:00
d9d0f1dccc Fix lint after #119268 (#120207)
Fixes lint after: https://github.com/pytorch/pytorch/issues/119268

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120207
Approved by: https://github.com/davidberard98
2024-02-19 20:01:45 +00:00
92bf2a4550 [torchbench] Update skipped models. (#120117)
This PR updates the list of benchmarks that should (not) be skipped. Here's a summary of
the changes:

- `detectron2_maskrcnn`: #120115
- `fambench_xlmr`: moved to canary models
- `hf_Bert` and `hf_Bert_large`: pass
- `maml`: pass
- `clip`: renamed to `hf_clip`
- `gat`, `gcn`, and `sage`: moved to canary models

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120117
Approved by: https://github.com/ezyang, https://github.com/lezcano
2024-02-19 18:08:32 +00:00
637cf4a3f2 Test parametrization utils for native funcol migration (#119950)
```
Between the time we switch to the native funcol by default and the time when
we are confident that we can remove the legacy implementation, we want to
ensure that the legacy funcol remains covered by unit tests. This is to
prepare for any potential (but unlikely) reverts. The following utilities
help achieve this goal.

run_with_{native,legacy}_funcol - mark a test to run with only
{native,legacy} funcol. These decorators are for impl specific tests (e.g.
verifying generated code with FileCheck).

run_with_both_funcol_impls - parametrize a test to run with both legacy and
native funcol.

run_with_both_funcol_impls_with_arg - same as run_with_both_funcol_impls, but
passes `enable_native_funcol` to the test so impl specific checks can be
carried out.
```

This PR also marks some tests we want to cover in this fashion. More tests will be marked in subsequent PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119950
Approved by: https://github.com/wanchaol
ghstack dependencies: #119881
2024-02-19 02:46:03 +00:00
40786ca509 Handle unwaited work objects on process termination (#119881)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119881
Approved by: https://github.com/wconstab
2024-02-19 02:46:02 +00:00
84de851539 [Inductor] Enable the decomposition of quant/dequant per channel (#119177)
**Summary**
Part 2 of fixing https://github.com/pytorch/pytorch/issues/119141 which needs vectorized code generation of per channel quant and int8 data type.
Enable decomposition of quant/dequant per channel to make it vectorized code generation.

**TestPlan**
```
python -u -m pytest -s -v test_cpu_repro.py -k test_per_channel_fake_quant_uint8
python -u -m pytest -s -v test_cpu_repro.py -k test_per_channel_fake_quant_int8
python -u -m pytest -s -v test_cpu_repro.py -k test_per_channel_fake_quant_uint8_bf16_input
python -u -m pytest -s -v test_cpu_repro.py -k test_per_channel_fake_quant_int8_bf16_input
```

Co-authored-by: Jiong Gong <jiong.gong@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119177
Approved by: https://github.com/peterbell10, https://github.com/jansel
2024-02-19 01:30:44 +00:00
fa9cbdce99 Updating sleef submodule to eb3d97785 to fix export errors (#119953)
Fixes #119952 with submodule updates

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119953
Approved by: https://github.com/ezyang
2024-02-19 00:56:24 +00:00
f2cf0768d1 [dynamo][distributed] handle _rank_not_in_group, _get_or_create_default_group (#119628)
Copy of #117692

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119628
Approved by: https://github.com/yanboliang
2024-02-18 22:34:35 +00:00
372d078f36 [pytree] add function tree_iter (#120155)
Fixes #119768

- #119768

This PR adds a new function `tree_iter` that lazily iterates over the tree leaves. It is different than the `tree_leaves` function while the latter traversal the whole tree first to build a list of leaves.

```python
for leaf in tree_iter(tree):
    ...
```

is much more efficient than:

```python
for leaf in tree_leaves(tree):
    ...
```

where `tree_leaves(tree)` is `list(tree_iter(tree))`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120155
Approved by: https://github.com/vmoens
2024-02-18 09:16:50 +00:00
61a3a7628c [nit][DTensor][Test] Update test name to reflect the actual test (#118960)
test_name: test_partial_mul_failure -> test_partial_mul

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118960
Approved by: https://github.com/XilunWu
2024-02-18 08:23:06 +00:00
2864a7e161 add register_device_op_overrides (#119268)
Fixes #119267

Currently https://github.com/pytorch/pytorch/blob/main/torch/_inductor/codegen/common.py#L106 only supports built-in device function, I'm going to add a register function to get overrides class.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119268
Approved by: https://github.com/jansel
2024-02-18 06:11:54 +00:00
70bc3b3be4 [executorch hash update] update the pinned executorch hash (#120165)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120165
Approved by: https://github.com/pytorchbot
2024-02-18 03:44:50 +00:00
d74bdd5042 [inductor] Always allow 64 bit in next_power_of_2 (#120164)
see #120153 #120152

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120164
Approved by: https://github.com/yanboliang
2024-02-18 03:22:46 +00:00
de15781af0 [cuDNN] Bump cuDNN frontend submodule to 1.1.1 (#120137)
Hopefully addresses the failure seen when trying to bump to 1.1.0 (#119642) CC @Skylion007

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120137
Approved by: https://github.com/Skylion007
2024-02-18 02:57:02 +00:00
b642a18e80 [dynamo] Use EQUALS_MATCH guard for mod.training (#120147)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120147
Approved by: https://github.com/jansel
ghstack dependencies: #120132, #120140, #120145
2024-02-18 00:31:36 +00:00
0b11b0edd6 [dynamo][refactor] Use existing helper functions for CLOSURE_MATCH (#120145)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120145
Approved by: https://github.com/jansel, https://github.com/Fidget-Spinner
ghstack dependencies: #120132, #120140
2024-02-18 00:31:36 +00:00
0c972c7c4e enhance next_power_of_2 function (#120153)
Fixes #120152

cc  @ezyang @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler @amjames @jansel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120153
Approved by: https://github.com/jansel
2024-02-17 20:18:46 +00:00
2fea475215 [dynamo] Refactor reconstruct() not to return anything (#120150)
This simplifies things slightly and avoids some bugs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120150
Approved by: https://github.com/yanboliang
2024-02-17 17:13:41 +00:00
757fc663a8 [dynamo][refactor] Use TYPE_MATCH instead of manually constructing guard (#120140)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120140
Approved by: https://github.com/jansel, https://github.com/yanboliang
ghstack dependencies: #120132
2024-02-17 16:03:36 +00:00
48d96c08f2 [dynamo][guards] Use EQUALS_MATCH for NAME_MATCH (#120132)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120132
Approved by: https://github.com/jansel, https://github.com/yanboliang
2024-02-17 16:03:36 +00:00
cyy
a9953a5ef3 Remove unused c10/util/C++17.h inclusion and outdated checks (#120149)
This is a continued work to clean up pre-C++17 code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120149
Approved by: https://github.com/ezyang
2024-02-17 14:28:17 +00:00
fac598c4ae [inductor] allow padding mm/bmm/addmm in the presence of dynamic dims (#120073)
Previously, pad_mm skips cases where any input tensor has symbolic
dimension or stride. This is too constraint in practise.
This PR enables this pass to pad non-symbolic dimensions in
the presence of dynamic dims. For example, with this PR, we could
pad the K dimension (i.e. 1921) for torch.mm(A[s0, 1921], B[2048, 1921]).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120073
Approved by: https://github.com/jansel
2024-02-17 12:22:20 +00:00
2f8a80ecb2 Fix skip for test_set_nccl_pg_timeout (#120130)
Test is failing on our internal CI with below error
```RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
```

Purpose of this test is for nccl so it doesnt make sense to run in 1 GPU setting either.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120130
Approved by: https://github.com/wconstab, https://github.com/eqy
2024-02-17 07:36:14 +00:00
badf84bd6b [inductor] Add torch.cond support to JIT Inductor (#119759)
Summary: `torch.cond` is already supported in Dynamo and Export: the `true_fn` and `false_fn` subgraphs are traced as child fx graphs of the main graph and passed to the `torch.cond` higher-order operator in the fx graph. However, this breaks in Inductor, as the latter doesn't have the ways of dealing with child fx subgraphs and properly lowering and codegen-ing them.

In this PR, we add `torch.cond` support in Inductor. This is achieved by adding subgraph lowering and codegen-ing infrastructure as well as new `Conditional` IR node type weaving the parent graph with the true and false child subgraphs.

Here we only implement `torch.cond` support in JIT Inductor (Python wrapper codegen). The implementation in AOT Inductor (C++ wrapper codegen), including ABI-compatibility mode, will follow.

Test Plan:

```
$ python test/inductor/test_control_flow.py
...
----------------------------------------------------------------------
Ran 24 tests in 86.790s
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119759
Approved by: https://github.com/jansel, https://github.com/eellison
2024-02-17 07:25:27 +00:00
30000aa3fd [c10d] remove one line of verbose log (#120138)
Summary:
I don't find exiting DBG mode support in c10d. This is flooding the log, removing it to unblock user
Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120138
Approved by: https://github.com/wconstab
2024-02-17 06:39:57 +00:00
fa0e39560c [AOTI] Fix a typo (#120094)
Differential Revision: [D53861810](https://our.internmc.facebook.com/intern/diff/D53861810)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120094
Approved by: https://github.com/khabinov, https://github.com/sijiac
2024-02-17 05:28:58 +00:00
0a7471e0df [executorch hash update] update the pinned executorch hash (#120134)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120134
Approved by: https://github.com/pytorchbot
2024-02-17 05:00:35 +00:00
ac2ba7889d [export] turn on replace_set_grad_with_hop_pass in pre_dispatch (#119915)
This PR turns on replace_set_grad_with_hop_pass for pre_dispatch export. To do that, we need to propagate the meta-data from original submodule to the new higher order op and fix the names of nodes as is required by the _sig_to_specs pass.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119915
Approved by: https://github.com/tugsbayasgalan
ghstack dependencies: #119732, #119736, #119810, #119913, #119914
2024-02-17 02:18:35 +00:00
737630268c [export] manuually create test cases for split and inline (#119914)
This PR makes the tests for inline and sequential_split stop relying on set_grad_enabled to be in the graph. Because they'll be gone if we turn on the replace_set_grad_with_hop_pass in the following diff. Instead, we'll manually insert them into the graph.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119914
Approved by: https://github.com/tugsbayasgalan
ghstack dependencies: #119732, #119736, #119810, #119913
2024-02-17 02:18:35 +00:00
8d81e61fb6 [export] make node_inline_ also inline the get_item calls (#119913)
As titled. Before the PR, after we split then inline_, there will be getitem calls in the graph while the original graph module doesn't have them. This PR removes the additional get_item calls by inlining.

Test Plan:
Added new test cases for graphs that return multiple outputs and takes multiple inputs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119913
Approved by: https://github.com/tugsbayasgalan
ghstack dependencies: #119732, #119736, #119810
2024-02-17 02:18:27 +00:00
812f05d731 [export] add replace_set_grad_with_hop_pass (#119810)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119810
Approved by: https://github.com/tugsbayasgalan
ghstack dependencies: #119732, #119736
2024-02-17 02:18:19 +00:00
4769e6916a [export] add node_inline_ to prepare replacing set_grad_enabled with hop (#119736)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119736
Approved by: https://github.com/tugsbayasgalan
ghstack dependencies: #119732
2024-02-17 02:18:11 +00:00
068659ddc2 [export] add sequential_split to prepare replacing set_grad_enabled with hop (#119732)
This pr is the 1/N pr of transforming the global state mutating ops  such as torch._C.set_grad_enabled calls in pre-dispatch graph into a higher order op so that the graph becomes more functional. We make use of split_module to help us do the transformation.

This pr preserves the node.name in original module by adding a new kwarg `keep_original_node_name` to split_module.

For a graph looks like this:
```python
def forward(self, arg_0):
    arg0_1, = fx_pytree.tree_flatten_spec(([arg_0], {}), self._in_spec)
    add = torch.ops.aten.add.Tensor(arg0_1, 1);  arg0_1 = None
    sin = torch.ops.aten.sin.default(add);  add = None
    sum_1 = torch.ops.aten.sum.default(sin);  sin = None
    _set_grad_enabled = torch._C._set_grad_enabled(False)
    add_1 = torch.ops.aten.add.Tensor(sum_1, 1);  sum_1 = None
    _set_grad_enabled_1 = torch._C._set_grad_enabled(True)
    sub = torch.ops.aten.sub.Tensor(add_1, 1)
    return pytree.tree_unflatten((add_1, sub), self._out_spec)
```
Before the change, split graph returns the following graphs and subgraphs (notice the change from `add` -> `add_tensor`, `sin` -> `sin_default`:
```python
def forward(self, arg_0):
    arg0_1, = fx_pytree.tree_flatten_spec(([arg_0], {}), self._in_spec)
    submod_0 = self.submod_0(arg0_1);  arg0_1 = None
    submod_1 = self.submod_1(submod_0);  submod_0 = None
    submod_2 = self.submod_2(submod_1)
    return pytree.tree_unflatten((submod_1, submod_2), self._out_spec)

# submod_0
def forward(self, arg0_1):
    add_tensor = torch.ops.aten.add.Tensor(arg0_1, 1);  arg0_1 = None
    sin_default = torch.ops.aten.sin.default(add_tensor);  add_tensor = None
    sum_default = torch.ops.aten.sum.default(sin_default);  sin_default = None
    return sum_default

# submod_1
def forward(self, sum_1):
    _set_grad_enabled = torch._C._set_grad_enabled(False)
    add_tensor = torch.ops.aten.add.Tensor(sum_1, 1);  sum_1 = None
    return add_tensor

# submod_2
def forward(self, add_1):
    _set_grad_enabled = torch._C._set_grad_enabled(True)
    sub_tensor = torch.ops.aten.sub.Tensor(add_1, 1);  add_1 = None
    return sub_tensor
    """)

```

After the change, the test produce the following graph, all the node names in original graph module are preserved in sub_modules.
```python

def forward(self, arg_0):
    sub, = fx_pytree.tree_flatten_spec(([arg_0], {}), self._in_spec)
    submod_0 = self.submod_0(sub);  sub = None
    submod_1 = self.submod_1(submod_0);  submod_0 = None
    submod_2 = self.submod_2(submod_1)
    return pytree.tree_unflatten((submod_1, submod_2), self._out_spec)

# submod_0
def forward(self, arg0_1):
    add = torch.ops.aten.add.Tensor(arg0_1, 1);  arg0_1 = None
    sin = torch.ops.aten.sin.default(add);  add = None
    sum_1 = torch.ops.aten.sum.default(sin);  sin = None
    return sum_1

# submod_1
def forward(self, sum_1):
    _set_grad_enabled = torch._C._set_grad_enabled(False)
    add_1 = torch.ops.aten.add.Tensor(sum_1, 1);  sum_1 = None
    return add_1

# submod_2
def forward(self, add_1):
    _set_grad_enabled_1 = torch._C._set_grad_enabled(True)
    sub = torch.ops.aten.sub.Tensor(add_1, 1);  add_1 = None
    return sub

```

Note that currently, we call split_module on the graph after pre-dispatch aot. The difference is even larger if we `split_module` the graph module produced by dynamo, where all the original variables names in user program are preserved after dynamo but  lost after `split_module` without this change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119732
Approved by: https://github.com/tugsbayasgalan
2024-02-17 02:18:04 +00:00
becfda005e tiny improvement to the cprofile wrapper (#120100)
1. right now we double increment the profile counter. The PR avoid that so we don't end up with profile_0, profile_2, profile_4 ...
2. log the latency to run the passed in function with profiling on so we can easily skip those _compile call which returns quickly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120100
Approved by: https://github.com/eellison
2024-02-17 02:10:25 +00:00
36e118b810 [inductor] logging meta data for inductor generated triton kernel (#120048)
I want to log metadata for inductor generated triton kernels for a couple of purposes
1. with these metadata, it should be convenient to find unaligned reduction kernels and try the idea here https://github.com/pytorch/pytorch/issues/119929 . I think it's nice to try on kernels that are used in real models
2. I'm thinking that based on the collected kernel metadata, I can build a simple offline tool by benchmarking each kernel with ncu and augment each kernel metadata with: latency, theoretical membw (estimated memory access / latency), and actually achieved membw. Hopefully this can point us to some good optimization opportunities.

Command:
```
TORCHINDUCTOR_CACHE_DIR=`realpath ~/inductor-caches/kernel-metadata-log` TORCHINDUCTOR_ENABLED_METRIC_TABLES=kernel_metadata TORCHINDUCTOR_BENCHMARK_KERNEL=1 TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 time python benchmarks/dynamo/huggingface.py --backend inductor --amp --performance --training
```

The best practice here is to point inductor cache to a folder outside of /tmp so that one can always run the kernel again based on the path stored in kernel metadata. (folders under /tmp may get removed by the system)

Here is first 1000 rows of collected metadata for huggingface: https://gist.github.com/shunting314/cf4ebdaaaa7e852efcaa93524c868e5f

And here is the total 10K kernels collected for huggingface. The gist can not be rendered as a csv since it's too large: https://gist.github.com/shunting314/7f841528e2debdc2ae05dece4ac591be .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120048
Approved by: https://github.com/jansel
2024-02-17 02:09:27 +00:00
24968ff042 Add quantized gelu (#119935)
Summary: Added Quantized gelu for vulkan backend.

Test Plan:
**Tested it on "On Demand RL FBSource"**

LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck2 run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_quantized_api_test_bin -c pt.vulkan_full_precision=1 -- --gtest_filter="VulkanAPITest.gelu_q*"

----------------------------------------------------------------------------------

Note: Google Test filter = VulkanAPITest.gelu_q*
[==========] Running 4 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 4 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.gelu_qint8
[       OK ] VulkanAPITest.gelu_qint8 (318 ms)
[ RUN      ] VulkanAPITest.gelu_qint8_self
[       OK ] VulkanAPITest.gelu_qint8_self (214 ms)
[ RUN      ] VulkanAPITest.gelu_quint8
[       OK ] VulkanAPITest.gelu_quint8 (152 ms)
[ RUN      ] VulkanAPITest.gelu_quint8_self
[       OK ] VulkanAPITest.gelu_quint8_self (142 ms)
[----------] 4 tests from VulkanAPITest (828 ms total)

[----------] Global test environment tear-down
[==========] 4 tests from 1 test suite ran. (828 ms total)
[  PASSED  ] 4 tests.

Differential Revision: D52985437

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119935
Approved by: https://github.com/jorgep31415
2024-02-17 01:17:25 +00:00
7973ac586d [Memory Snapshot] Add CUDAAllocatorConfig details into snapshot metadata (#119404)
Summary:
Include the CUDAAllocatorConfig at the time of snapshot into the snapshot file. These include adding variables:

```
  double garbage_collection_threshold;
  size_t max_split_size;
  size_t pinned_num_register_threads;
  bool expandable_segments;
  bool release_lock_on_cudamalloc;
  bool pinned_use_cuda_host_register;
  std::string last_allocator_settings;
  std::vector<size_t> roundup_power2_divisions;
```

Test Plan:
`PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True ` produces
```
{'PYTORCH_CUDA_ALLOC_CONF': 'expandable_segments:True',
 'max_split_size': -1,
 'garbage_collection_threshold': 0.0,
 'expandable_segments': True,
 'pinned_num_register_threads': 1,
 'release_lock_on_cudamalloc': False,
 'pinned_use_cuda_host_register': False,
 'roundup_power2_divisions': {'1': 0,
  '2': 0,
  '4': 0,
  '8': 0,
  '16': 0,
  '32': 0,
  '64': 0,
  '128': 0,
  '256': 0,
  '512': 0,
  '1024': 0,
  '2048': 0,
  '4096': 0,
  '8192': 0,
  '16384': 0,
  '32768': 0}}
```
`PYTORCH_CUDA_ALLOC_CONF="max_split_size_mb:2000,roundup_power2_divisions:[256:1,512:2,1024:4,>:8]"` produces
```
{'PYTORCH_CUDA_ALLOC_CONF': 'max_split_size_mb:2000,roundup_power2_divisions:[256:1,512:2,1024:4,>:8]',
 'max_split_size': 2097152000,
 'garbage_collection_threshold': 0.0,
 'expandable_segments': False,
 'pinned_num_register_threads': 1,
 'release_lock_on_cudamalloc': False,
 'pinned_use_cuda_host_register': False,
 'roundup_power2_divisions': {'1': 1, '2': 1, '4': 1, '8': 1, '16': 1, '32': 1, '64': 1, '128': 1, '256': 1, '512': 2, '1024': 8, '2048': 8, '4096': 8, '8192': 8, '16384': 8, '32768': 8}
}
```

Differential Revision: D53536199

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119404
Approved by: https://github.com/zdevito
2024-02-17 01:16:37 +00:00
9aa8bbf7f2 [BE] Delete C10_IS_TRIVIALLY_COPYABLE (#120120)
It's not used anywhere in PyTorch after custom implementation of `c10::optional` is gone, and it's not used by the repo as well, see https://github.com/search?type=code&q=C10_IS_TRIVIALLY_COPYABLE+org%3Apytorch

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120120
Approved by: https://github.com/Skylion007, https://github.com/albanD, https://github.com/huydhn
2024-02-17 01:04:30 +00:00
79569d117d Add hpu device support in storage/resize (#119761)
Add hpu device to
 - In storage method resize_
 - is_supported_device for fsdp
 - for storage add hpu device support

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119761
Approved by: https://github.com/mikaylagawarecki
2024-02-17 01:04:27 +00:00
6b63d3bac9 [ONNX][dynamo_export] Adjust to new symbolic shape name format in value_info (#119855)
Bump onnxscript in CI and adjust the test case expectation of the experimental exported shape naming format.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119855
Approved by: https://github.com/thiagocrepaldi
2024-02-17 00:51:19 +00:00
cyy
e61c8ef3aa Simplify c10::is_pod implementation and remove unneeded inclusion of C++17.h (#118212)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118212
Approved by: https://github.com/albanD
2024-02-17 00:14:09 +00:00
cyy
6952d6ddad [structural binding][4/N] Replace std::tie with structural binding (#120039)
This PR follows https://github.com/pytorch/pytorch/pull/119774, it is a continued work to clean up std::tie

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120039
Approved by: https://github.com/albanD, https://github.com/Skylion007
2024-02-17 00:05:58 +00:00
761fa5d6ec Add FakeTensor support to torch._utils._rebuild_tensor (#108186)
There are two scenarios:

* Scenario 1: The checkpoint was saved with pytorch < 1.6
* Scenario 2: The checkpoint was saved with pytorch >= 1.6

Repro Scenario 1:

```python
from torch._subclasses import fake_tensor
import transformers

fake_mode = fake_tensor.FakeTensorMode()
with fake_mode:
    fake_model = transformers.AutoModel.from_pretrained("sshleifer/tiny-gpt2")
```

Error:

```bash
Some weights of the model checkpoint at sshleifer/tiny-gpt2 were not used when initializing GPT2Model: ['lm_head.weight']
- This IS expected if you are initializing GPT2Model from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing GPT2Model from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /opt/conda/envs/ptca/lib/python3.8/site-packages/transformers/modeling_utils.py:463 in           │
│ load_state_dict                                                                                  │
│                                                                                                  │
│    460 │   │   │   )                                                                             │
│    461 │   │   return safe_load_file(checkpoint_file)                                            │
│    462 │   try:                                                                                  │
│ ❱  463 │   │   return torch.load(checkpoint_file, map_location="cpu")                            │
│    464 │   except Exception as e:                                                                │
│    465 │   │   try:                                                                              │
│    466 │   │   │   with open(checkpoint_file) as f:                                              │
│                                                                                                  │
│ /opt/pytorch/torch/serialization.py:1030 in load                                                 │
│                                                                                                  │
│   1027 │   │   │   │   return _legacy_load(opened_file, map_location, _weights_only_unpickler,   │
│   1028 │   │   │   except RuntimeError as e:                                                     │
│   1029 │   │   │   │   raise pickle.UnpicklingError(UNSAFE_MESSAGE + str(e)) from None           │
│ ❱ 1030 │   │   return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args  │
│   1031                                                                                           │
│   1032                                                                                           │
│   1033 # Register pickling support for layout instances such as                                  │
│                                                                                                  │
│ /opt/pytorch/torch/serialization.py:1258 in _legacy_load                                         │
│                                                                                                  │
│   1255 │   _sys_info = pickle_module.load(f, **pickle_load_args)                                 │
│   1256 │   unpickler = UnpicklerWrapper(f, **pickle_load_args)                                   │
│   1257 │   unpickler.persistent_load = persistent_load                                           │
│ ❱ 1258 │   result = unpickler.load()                                                             │
│   1259 │                                                                                         │
│   1260 │   deserialized_storage_keys = pickle_module.load(f, **pickle_load_args)                 │
│   1261                                                                                           │
│                                                                                                  │
│ /opt/pytorch/torch/_utils.py:201 in _rebuild_tensor_v2                                           │
│                                                                                                  │
│   198 def _rebuild_tensor_v2(                                                                    │
│   199 │   storage, storage_offset, size, stride, requires_grad, backward_hooks, metadata=None    │
│   200 ):                                                                                         │
│ ❱ 201 │   tensor = _rebuild_tensor(storage, storage_offset, size, stride)                        │
│   202 │   tensor.requires_grad = requires_grad                                                   │
│   203 │   if metadata:                                                                           │
│   204 │   │   set_tensor_metadata(tensor, metadata)                                              │
│                                                                                                  │
│ /opt/pytorch/torch/_utils.py:180 in _rebuild_tensor                                              │
│                                                                                                  │
│   177 def _rebuild_tensor(storage, storage_offset, size, stride):                                │
│   178 │   # first construct a tensor with the correct dtype/device                               │
│   179 │   t = torch.tensor([], dtype=storage.dtype, device=storage._untyped_storage.device)      │
│ ❱ 180 │   return t.set_(storage._untyped_storage, storage_offset, size, stride)                  │
│   181                                                                                            │
│   182                                                                                            │
│   183 def get_tensor_metadata(tensor):                                                           │
│                                                                                                  │
│ /opt/pytorch/torch/utils/_stats.py:20 in wrapper                                                 │
│                                                                                                  │
│   17 │   │   if fn.__qualname__ not in simple_call_counter:                                      │
│   18 │   │   │   simple_call_counter[fn.__qualname__] = 0                                        │
│   19 │   │   simple_call_counter[fn.__qualname__] = simple_call_counter[fn.__qualname__] + 1     │
│ ❱ 20 │   │   return fn(*args, **kwargs)                                                          │
│   21 │   return wrapper                                                                          │
│   22                                                                                             │
│                                                                                                  │
│ /opt/pytorch/torch/_subclasses/fake_tensor.py:1160 in __torch_dispatch__                         │
│                                                                                                  │
│   1157 │   def __torch_dispatch__(self, func, types, args=(), kwargs=None):                      │
│   1158 │   │   assert self not in _get_current_dispatch_mode_stack(), func                       │
│   1159 │   │   try:                                                                              │
│ ❱ 1160 │   │   │   return self.dispatch(func, types, args, kwargs)                               │
│   1161 │   │   except TypeError:                                                                 │
│   1162 │   │   │   log.exception("fake tensor raised TypeError")                                 │
│   1163 │   │   │   raise                                                                         │
│                                                                                                  │
│ /opt/pytorch/torch/_subclasses/fake_tensor.py:1318 in dispatch                                   │
│                                                                                                  │
│   1315 │   │                                                                                     │
│   1316 │   │   # we are falling through to running non constant tensors, any input constant tha  │
│   1317 │   │   # is written to must be invalidated                                               │
│ ❱ 1318 │   │   self.invalidate_written_to_constants(func, flat_arg_fake_tensors, args, kwargs)   │
│   1319 │   │                                                                                     │
│   1320 │   │   # Try for fastpath                                                                │
│   1321 │   │   if has_symbolic_sizes:                                                            │
│                                                                                                  │
│ /opt/pytorch/torch/_subclasses/fake_tensor.py:1557 in invalidate_written_to_constants            │
│                                                                                                  │
│   1554 │   │   any_constant = any(e.constant is not None for e in flat_arg_fake_tensors)         │
│   1555 │   │   if any_constant and get_schema_info(func).is_mutable():                           │
│   1556 │   │   │   schema_info = get_schema_info(func)                                           │
│ ❱ 1557 │   │   │   _, new_kwargs = normalize_function(                                           │
│   1558 │   │   │   │   func, args=args, kwargs=kwargs, normalize_to_only_use_kwargs=True         │
│   1559 │   │   │   )                                                                             │
│   1560 │   │   │   for k, v in new_kwargs.items():                                               │
│                                                                                                  │
│ /opt/pytorch/torch/fx/operator_schemas.py:297 in normalize_function                              │
│                                                                                                  │
│   294 │   │   new_args_and_kwargs = _args_kwargs_to_normalized_args_kwargs(sig, args, kwargs,    │
│   295 │   else:                                                                                  │
│   296 │   │   assert callable(target)                                                            │
│ ❱ 297 │   │   torch_op_schemas = get_signature_for_torch_op(target)                              │
│   298 │   │   matched_schemas = []                                                               │
│   299 │   │   if torch_op_schemas:                                                               │
│   300 │   │   │   # Iterate through all of the schema until we find one that matches             │
│                                                                                                  │
│ /opt/pytorch/torch/fx/operator_schemas.py:167 in get_signature_for_torch_op                      │
│                                                                                                  │
│   164 │   │   │   return (None, None) if return_schemas else None                                │
│   165 │   │   schemas = torch._C._jit_get_schemas_for_operator(aten_fn)                          │
│   166 │                                                                                          │
│ ❱ 167 │   signatures = [_torchscript_schema_to_signature(schema) for schema in schemas]          │
│   168 │   return (signatures, schemas) if return_schemas else signatures                         │
│   169                                                                                            │
│   170 @compatibility(is_backward_compatible=False)                                               │
│                                                                                                  │
│ /opt/pytorch/torch/fx/operator_schemas.py:167 in <listcomp>                                      │
│                                                                                                  │
│   164 │   │   │   return (None, None) if return_schemas else None                                │
│   165 │   │   schemas = torch._C._jit_get_schemas_for_operator(aten_fn)                          │
│   166 │                                                                                          │
│ ❱ 167 │   signatures = [_torchscript_schema_to_signature(schema) for schema in schemas]          │
│   168 │   return (signatures, schemas) if return_schemas else signatures                         │
│   169                                                                                            │
│   170 @compatibility(is_backward_compatible=False)                                               │
│                                                                                                  │
│ /opt/pytorch/torch/fx/operator_schemas.py:70 in _torchscript_schema_to_signature                 │
│                                                                                                  │
│    67 │   from inspect import Parameter                                                          │
│    68 │   parameters : List[Parameter] = []                                                      │
│    69 │   for arg in ts_schema.arguments:                                                        │
│ ❱  70 │   │   arg_type = _torchscript_type_to_python_type(arg.type)                              │
│    71 │   │   default = arg.default_value if arg.has_default_value() else Parameter.empty        │
│    72 │   │   # TODO: Figure out if this is safe. It seems like when generating the type signa   │
│    73 │   │   # PythonArgParser, we emit signatures with `input` instead of `self` as the firs   │
│                                                                                                  │
│ /opt/pytorch/torch/fx/operator_schemas.py:64 in _torchscript_type_to_python_type                 │
│                                                                                                  │
│    61 │   eval'ing the annotation_str. _type_eval_globals sets up expressions                    │
│    62 │   like "List" and "Future" to map to actual types (typing.List and jit.Future)           │
│    63 │   """                                                                                    │
│ ❱  64 │   return eval(ts_type.annotation_str, _type_eval_globals)                                │
│    65                                                                                            │
│    66 def _torchscript_schema_to_signature(ts_schema : torch._C.FunctionSchema) -> inspect.Sig   │
│    67 │   from inspect import Parameter                                                          │
│ <string>:1 in <module>                                                                           │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
NameError: name 'Storage' is not defined

During handling of the above exception, another exception occurred:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /opt/conda/envs/ptca/lib/python3.8/site-packages/transformers/modeling_utils.py:467 in           │
│ load_state_dict                                                                                  │
│                                                                                                  │
│    464 │   except Exception as e:                                                                │
│    465 │   │   try:                                                                              │
│    466 │   │   │   with open(checkpoint_file) as f:                                              │
│ ❱  467 │   │   │   │   if f.read(7) == "version":                                                │
│    468 │   │   │   │   │   raise OSError(                                                        │
│    469 │   │   │   │   │   │   "You seem to have cloned a repository without having git-lfs ins  │
│    470 │   │   │   │   │   │   "git-lfs and run `git lfs install` followed by `git lfs pull` in  │
│                                                                                                  │
│ /opt/conda/envs/ptca/lib/python3.8/codecs.py:322 in decode                                       │
│                                                                                                  │
│    319 │   def decode(self, input, final=False):                                                 │
│    320 │   │   # decode input (taking the buffer into account)                                   │
│    321 │   │   data = self.buffer + input                                                        │
│ ❱  322 │   │   (result, consumed) = self._buffer_decode(data, self.errors, final)                │
│    323 │   │   # keep undecoded input until the next call                                        │
│    324 │   │   self.buffer = data[consumed:]                                                     │
│    325 │   │   return result                                                                     │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

During handling of the above exception, another exception occurred:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /opt/pytorch/bug_repro.py:16 in <module>                                                         │
│                                                                                                  │
│   13 fake_model = transformers.AutoModel.from_pretrained("sshleifer/tiny-gpt2")                  │
│   14 assert fake_model is not None                                                               │
│   15 with fake_mode:                                                                             │
│ ❱ 16 │   fake_model = transformers.AutoModel.from_pretrained("sshleifer/tiny-gpt2")  # raises    │
│                                                                                                  │
│ /opt/conda/envs/ptca/lib/python3.8/site-packages/transformers/models/auto/auto_factory.py:484 in │
│ from_pretrained                                                                                  │
│                                                                                                  │
│   481 │   │   │   )                                                                              │
│   482 │   │   elif type(config) in cls._model_mapping.keys():                                    │
│   483 │   │   │   model_class = _get_model_class(config, cls._model_mapping)                     │
│ ❱ 484 │   │   │   return model_class.from_pretrained(                                            │
│   485 │   │   │   │   pretrained_model_name_or_path, *model_args, config=config, **hub_kwargs,   │
│   486 │   │   │   )                                                                              │
│   487 │   │   raise ValueError(                                                                  │
│                                                                                                  │
│ /opt/conda/envs/ptca/lib/python3.8/site-packages/transformers/modeling_utils.py:2604 in          │
│ from_pretrained                                                                                  │
│                                                                                                  │
│   2601 │   │   if from_pt:                                                                       │
│   2602 │   │   │   if not is_sharded and state_dict is None:                                     │
│   2603 │   │   │   │   # Time to load the checkpoint                                             │
│ ❱ 2604 │   │   │   │   state_dict = load_state_dict(resolved_archive_file)                       │
│   2605 │   │   │                                                                                 │
│   2606 │   │   │   # set dtype to instantiate the model under:                                   │
│   2607 │   │   │   # 1. If torch_dtype is not None, we use that dtype                            │
│                                                                                                  │
│ /opt/conda/envs/ptca/lib/python3.8/site-packages/transformers/modeling_utils.py:479 in           │
│ load_state_dict                                                                                  │
│                                                                                                  │
│    476 │   │   │   │   │   │   "model. Make sure you have saved the model properly."             │
│    477 │   │   │   │   │   ) from e                                                              │
│    478 │   │   except (UnicodeDecodeError, ValueError):                                          │
│ ❱  479 │   │   │   raise OSError(                                                                │
│    480 │   │   │   │   f"Unable to load weights from pytorch checkpoint file for '{checkpoint_f  │
│    481 │   │   │   │   f"at '{checkpoint_file}'. "                                               │
│    482 │   │   │   │   "If you tried to load a PyTorch model from a TF 2.0 checkpoint, please s  │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
OSError: Unable to load weights from pytorch checkpoint file for '/root/.cache/huggingface/hub/models--sshleifer--tiny-gpt2/snapshots/5f91d94bd9cd7190a9f3216ff93cd1dd95f2c7be/pytorch_model.bin' at
'/root/.cache/huggingface/hub/models--sshleifer--tiny-gpt2/snapshots/5f91d94bd9cd7190a9f3216ff93cd1dd95f2c7be/pytorch_model.bin'. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set
from_tf=True.
```

Repro scenario 2:

```python
import tempfile
import torch
from torch._subclasses import fake_tensor

class TheModelClass(torch.nn.Module):
    def __init__(self):
        super(TheModelClass, self).__init__()
        self.fc1 = torch.nn.Linear(5, 10)

    def forward(self, x):
        return self.fc1(x)

with tempfile.NamedTemporaryFile() as state_dict_file:
    # Create state_dict to be loaded later
    model = TheModelClass()
    torch.save(model.state_dict(), state_dict_file.name)

    fake_mode = fake_tensor.FakeTensorMode()
    with fake_mode:
        # This is where the bug is triggered
        state_dict = torch.load(state_dict_file.name)
```

Error:

```bash
Traceback (most recent call last):
  File "issue_gh_torch_105077.py", line 22, in <module>
    state_dict = torch.load(state_dict_file.name)
  File "/opt/pytorch/torch/serialization.py", line 1014, in load
    return _load(opened_zipfile,
  File "/opt/pytorch/torch/serialization.py", line 1422, in _load
    result = unpickler.load()
  File "/opt/pytorch/torch/_utils.py", line 205, in _rebuild_tensor_v2
    tensor = _rebuild_tensor(storage, storage_offset, size, stride)
  File "/opt/pytorch/torch/_utils.py", line 184, in _rebuild_tensor
    return t.set_(storage._untyped_storage, storage_offset, size, stride)
  File "/opt/pytorch/torch/utils/_stats.py", line 20, in wrapper
    return fn(*args, **kwargs)
  File "/opt/pytorch/torch/_subclasses/fake_tensor.py", line 1288, in __torch_dispatch__
    return self.dispatch(func, types, args, kwargs)
  File "/opt/pytorch/torch/_subclasses/fake_tensor.py", line 1468, in dispatch
    self.invalidate_written_to_constants(func, flat_arg_fake_tensors, args, kwargs)
  File "/opt/pytorch/torch/_subclasses/fake_tensor.py", line 1733, in invalidate_written_to_constants
    _, new_kwargs = normalize_function(
  File "/opt/pytorch/torch/fx/operator_schemas.py", line 297, in normalize_function
    torch_op_schemas = get_signature_for_torch_op(target)
  File "/opt/pytorch/torch/fx/operator_schemas.py", line 167, in get_signature_for_torch_op
    signatures = [_torchscript_schema_to_signature(schema) for schema in schemas]
  File "/opt/pytorch/torch/fx/operator_schemas.py", line 167, in <listcomp>
    signatures = [_torchscript_schema_to_signature(schema) for schema in schemas]
  File "/opt/pytorch/torch/fx/operator_schemas.py", line 70, in _torchscript_schema_to_signature
    arg_type = _torchscript_type_to_python_type(arg.type)
  File "/opt/pytorch/torch/fx/operator_schemas.py", line 64, in _torchscript_type_to_python_type
    return eval(ts_type.annotation_str, _type_eval_globals)
  File "<string>", line 1, in <module>
NameError: name 'Storage' is not defined
```

This PR adds the ability to create fake tensors during torch.load (when fake mode is active) by changing the storage's device to 'meta'.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108186
Approved by: https://github.com/ezyang, https://github.com/atalman
2024-02-16 23:42:50 +00:00
7ad4ab4765 Remove unused import (#120004)
Summary: Title

Test Plan: CI

Differential Revision: D53820298

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120004
Approved by: https://github.com/zhxchen17, https://github.com/Skylion007
2024-02-16 22:00:44 +00:00
7b1f5c874f [PT2][Optimus][Observability] Log the optimus graph transformation to the scuba (#119745)
Summary: Current everstore upload logging may cuase excessive compilation time when the model has lots of graph breaks (post: https://fb.workplace.com/groups/257735836456307/permalink/633533465543207/), we here log the transformation only when the graph changed

Test Plan:
timeout flows:
f528209775
f530084719

Differential Revision: D53692344

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119745
Approved by: https://github.com/jackiexu1992
2024-02-16 21:32:04 +00:00
006eead7d2 [dynamo][functional_collectives] Add all_to_all_single, all_gather_list, reduce_scatter_list to dynamo remapping (#119683)
Differential Revision: [D53758434](https://our.internmc.facebook.com/intern/diff/D53758434)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119683
Approved by: https://github.com/ezyang
2024-02-16 21:28:39 +00:00
4f4629d522 [Dynamo] Fix ListIteratorVariable repr to avoid log flooding (#120053)
This issue was found from Meta internal use case.
Before:
```
V0215 18:33:41.761000 140489262883968 torch/_dynamo/symbolic_convert.py:682 [0/0] TRACE starts_line /data/users/ybliang/debug/debug4.py:11 in <listcomp> (f) (inline depth: 1)
V0215 18:33:41.761000 140489262883968 torch/_dynamo/symbolic_convert.py:682 [0/0]         a = [sum(x) for x in result]
V0215 18:33:41.761000 140489262883968 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE BUILD_LIST 0 []
V0215 18:33:41.761000 140489262883968 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE LOAD_FAST .0 [ListVariable()]
V0215 18:33:41.762000 140489262883968 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE FOR_ITER 18 [ListVariable(), ListIteratorVariable([LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker()], index=0)]
V0215 18:33:41.762000 140489262883968 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE STORE_FAST x [ListVariable(), ListIteratorVariable([LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker()], index=1), LazyVariableTracker()]
V0215 18:33:41.762000 140489262883968 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE LOAD_GLOBAL sum [ListVariable(), ListIteratorVariable([ListVariable(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker()], index=1)]
V0215 18:33:41.763000 140489262883968 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE LOAD_FAST x [ListVariable(), ListIteratorVariable([ListVariable(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker()], index=1), BuiltinVariable(sum)]
V0215 18:33:41.763000 140489262883968 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE CALL_FUNCTION 1 [ListVariable(), ListIteratorVariable([ListVariable(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker()], index=1), BuiltinVariable(sum), ListVariable()]
V0215 18:33:41.764000 140489262883968 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE LIST_APPEND 2 [ListVariable(), ListIteratorVariable([ListVariable(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker()], index=1), ConstantVariable(int: 50)]
V0215 18:33:41.765000 140489262883968 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE JUMP_ABSOLUTE 4 [ListVariable(), ListIteratorVariable([ListVariable(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker()], index=1)]
V0215 18:33:41.765000 140489262883968 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE FOR_ITER 18 [ListVariable(), ListIteratorVariable([ListVariable(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker()], index=1)]
V0215 18:33:41.765000 140489262883968 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE STORE_FAST x [ListVariable(), ListIteratorVariable([ListVariable(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker()], index=2), LazyVariableTracker()]
V0215 18:33:41.765000 140489262883968 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE LOAD_GLOBAL sum [ListVariable(), ListIteratorVariable([ListVariable(), ListVariable(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker()], index=2)]
V0215 18:33:41.765000 140489262883968 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE LOAD_FAST x [ListVariable(), ListIteratorVariable([ListVariable(), ListVariable(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker()], index=2), BuiltinVariable(sum)]
V0215 18:33:41.766000 140489262883968 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE CALL_FUNCTION 1 [ListVariable(), ListIteratorVariable([ListVariable(), ListVariable(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker()], index=2), BuiltinVariable(sum), ListVariable()]
V0215 18:33:41.766000 140489262883968 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE LIST_APPEND 2 [ListVariable(), ListIteratorVariable([ListVariable(), ListVariable(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker()], index=2), ConstantVariable(int: 68)]
V0215 18:33:41.767000 140489262883968 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE JUMP_ABSOLUTE 4 [ListVariable(), ListIteratorVariable([ListVariable(), ListVariable(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker(), LazyVariableTracker()], index=2)]
```
After:
```
V0215 18:27:57.901000 140556649206912 torch/_dynamo/symbolic_convert.py:682 [0/0] TRACE starts_line /data/users/ybliang/debug/debug4.py:11 in <listcomp> (f) (inline depth: 1)
V0215 18:27:57.901000 140556649206912 torch/_dynamo/symbolic_convert.py:682 [0/0]         a = [sum(x) for x in result]
V0215 18:27:57.901000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE BUILD_LIST 0 []
V0215 18:27:57.901000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE LOAD_FAST .0 [ListVariable()]
V0215 18:27:57.901000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE FOR_ITER 18 [ListVariable(), ListIteratorVariable(length=10, index=0)]
V0215 18:27:57.901000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE STORE_FAST x [ListVariable(), ListIteratorVariable(length=10, index=1), LazyVariableTracker()]
V0215 18:27:57.902000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE LOAD_GLOBAL sum [ListVariable(), ListIteratorVariable(length=10, index=1)]
V0215 18:27:57.902000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE LOAD_FAST x [ListVariable(), ListIteratorVariable(length=10, index=1), BuiltinVariable(sum)]
V0215 18:27:57.903000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE CALL_FUNCTION 1 [ListVariable(), ListIteratorVariable(length=10, index=1), BuiltinVariable(sum), ListVariable()]
V0215 18:27:57.903000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE LIST_APPEND 2 [ListVariable(), ListIteratorVariable(length=10, index=1), ConstantVariable(int: 55)]
V0215 18:27:57.904000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE JUMP_ABSOLUTE 4 [ListVariable(), ListIteratorVariable(length=10, index=1)]
V0215 18:27:57.904000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE FOR_ITER 18 [ListVariable(), ListIteratorVariable(length=10, index=1)]
V0215 18:27:57.904000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE STORE_FAST x [ListVariable(), ListIteratorVariable(length=10, index=2), LazyVariableTracker()]
V0215 18:27:57.904000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE LOAD_GLOBAL sum [ListVariable(), ListIteratorVariable(length=10, index=2)]
V0215 18:27:57.904000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE LOAD_FAST x [ListVariable(), ListIteratorVariable(length=10, index=2), BuiltinVariable(sum)]
V0215 18:27:57.904000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE CALL_FUNCTION 1 [ListVariable(), ListIteratorVariable(length=10, index=2), BuiltinVariable(sum), ListVariable()]
V0215 18:27:57.905000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE LIST_APPEND 2 [ListVariable(), ListIteratorVariable(length=10, index=2), ConstantVariable(int: 64)]
V0215 18:27:57.905000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE JUMP_ABSOLUTE 4 [ListVariable(), ListIteratorVariable(length=10, index=2)]
V0215 18:27:57.905000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE FOR_ITER 18 [ListVariable(), ListIteratorVariable(length=10, index=2)]
V0215 18:27:57.905000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE STORE_FAST x [ListVariable(), ListIteratorVariable(length=10, index=3), LazyVariableTracker()]
V0215 18:27:57.906000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE LOAD_GLOBAL sum [ListVariable(), ListIteratorVariable(length=10, index=3)]
V0215 18:27:57.906000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE LOAD_FAST x [ListVariable(), ListIteratorVariable(length=10, index=3), BuiltinVariable(sum)]
V0215 18:27:57.906000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE CALL_FUNCTION 1 [ListVariable(), ListIteratorVariable(length=10, index=3), BuiltinVariable(sum), ListVariable()]
V0215 18:27:57.907000 140556649206912 torch/_dynamo/symbolic_convert.py:708 [0/0] TRACE LIST_APPEND 2 [ListVariable(), ListIteratorVariable(length=10, index=3), ConstantVariable(int: 56)]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120053
Approved by: https://github.com/williamwen42
2024-02-16 21:19:37 +00:00
26343451be DTensor: make tensor_flatten more compatible for dynamo getattr (#118209)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118209
Approved by: https://github.com/ezyang, https://github.com/wanchaol
ghstack dependencies: #117667, #117666
2024-02-16 21:16:07 +00:00
ee7bcf23db dynamo: support attribute access on tensor subclasses without sources (#117666)
Fixes https://github.com/pytorch/pytorch/issues/117596

This was needed for Float8Tensor. Before this PR, dynamo would sometimes handle attribute access on tensor subclasses correctly, but it would choke on tensor subclasses with no source (it would fall back to using a `GetAttrVariable` to represent the attribute access, which is a problem if the attribute is a tensor that we later want to call tensor methods on).

I supported two cases:

(1) the attribute is a tensor, which is part of the `attrs` returned by the subclass's `__tensor_flatten__`. This creates a `TensorVariable`
(2) the attribute is a constant, which is part of the constant metadata returned by `__tensor_flatten__`. As per the contract of tensor_flatten, this should be a `ConstantVariable`. It could be possible that we allow non-constant metadata in the future, but we don't support that today.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117666
Approved by: https://github.com/zou3519
ghstack dependencies: #117667
2024-02-16 21:16:07 +00:00
67f6aca0d0 dynamo: respect autograd.Function + multiple save_for_backward calls (#117667)
Fixes https://github.com/pytorch/pytorch/issues/117652. Corner case that I hit debugging some Float8 issues.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117667
Approved by: https://github.com/ezyang, https://github.com/zou3519
2024-02-16 21:16:07 +00:00
4ac857f94e Support broadcast in native funcol (#119229)
### Summary

@LucasLLC recently implemented `broadcast` in funcol. This is not yet available in the native funcol ops. This PR adds support for broadcast for native funcol.

- Added `_c10d_functional::broadcast` and `_c10d_functional::broadcast_`
- Integrated with python functol broadcast and `AsyncCollectiveTensor`
- Implemented Inductor lowering. Verified correctness and buffer reuse behavior
- Validated dynamo traceability
- Validated AOTInductor compile-ability

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119229
Approved by: https://github.com/wanchaol
ghstack dependencies: #119104
2024-02-16 21:01:34 +00:00
24d5caba6e [EZ] Fix argument parsing in build_with_debinfo (#120088)
`nargs="?"` accept 0 or 1 argument, but `nargs="*"` accepts 0 or any number of arguments, which is the intended behavior of the tool

Test plan: Run `python tools/build_with_debinfo.py aten/src/ATen/native/cpu/BlasKernel.cpp aten/src/ATen/native/BlasKernel.cpp` and observe that it generates torch_cpu with those two files containing debug information

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120088
Approved by: https://github.com/Skylion007
2024-02-16 20:06:52 +00:00
2d4aa91a10 Fix searchsorted function signature in docs (#120086)
Side should be optional string, to match definition in native_functions: fbe8e0f92d/aten/src/ATen/native/native_functions.yaml (L11246)

Fixes https://github.com/pytorch/pytorch/issues/119999

Test plan: https://docs-preview.pytorch.org/pytorch/pytorch/120086/generated/torch.searchsorted.html#torch-searchsorted

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120086
Approved by: https://github.com/lezcano
2024-02-16 20:00:04 +00:00
288d1f3698 [Optim][Rprop] Replace new().resize_as_() by torch.full_like() (#119978)
As titled.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119978
Approved by: https://github.com/janeyx99
2024-02-16 19:54:04 +00:00
6ea4480818 [quant][pt2e] Add model_is_exported util function (#119726)
Summary: This commit adds the `model_is_exported` util function
for users to be able to easily tell what APIs to call to move
their models between train and eval modes. This has the
additional advantage of hiding the implementation of how we
detect a model is exported, in case the metadata format changes
in the future.

Test Plan:
python test/test_quantization.py TestQuantizePT2E.test_model_is_exported

Differential Revision: [D53812972](https://our.internmc.facebook.com/intern/diff/D53812972)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119726
Approved by: https://github.com/tugsbayasgalan, https://github.com/albanD
2024-02-16 19:29:36 +00:00
312ce35c1f Rename singleton int to nested int (#119661)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119661
Approved by: https://github.com/ezyang
2024-02-16 19:21:17 +00:00
b97fa6ac30 Make roll a decomposition and remove its lowering (#119857)
We use the fact that we now propagate indexing properly to avoid having
to maintain two different implementations of the op. Doing this we also remove
a spurious guard on this op.

We move the ref into a decomp as we now use advanced indexing.
The only difference we did in the implementation is that we now use
advanced indexing rather than `torch.cat`.

We also remove it from core. Let's see how this goes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119857
Approved by: https://github.com/peterbell10, https://github.com/larryliu0820
ghstack dependencies: #119863, #119864
2024-02-16 19:14:39 +00:00
8b02d64197 Correct index propagation for % (#119864)
The current index propagation transformed % into `fmod`. This was
incorrect. We perform the index propagation in the most common case,
when we it is correct to do it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119864
Approved by: https://github.com/peterbell10
ghstack dependencies: #119863
2024-02-16 19:14:39 +00:00
00524970e8 Simplify indexing when doing ModularIndexing + index propagation. (#119863)
We now avoid creating an unnecessary ternary operator in some reasonably
common case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119863
Approved by: https://github.com/peterbell10
2024-02-16 19:14:39 +00:00
86dedebeaf Revert "Add pixel_shuffle to core aten decomps (#119899)"
This reverts commit 9201d7335a25d9a91e10c1914c399419af0bd7c3.

Reverted https://github.com/pytorch/pytorch/pull/119899 on behalf of https://github.com/huydhn due to Sorry for reverting your change but keep the diff D53766709 around while investigating the failed tests is not a good practice and could lead to out of sync issue, so it is better to revert and reland this ([comment](https://github.com/pytorch/pytorch/pull/119899#issuecomment-1948970686))
2024-02-16 17:44:59 +00:00
b10ae9e54c [pytree] Properly register immutable collections (#120036)
Summary:
Getting error like:
```
No registered serialization name for <class 'torch.fx.immutable_collections.immutable_dict'> found. Please update your _register_pytree_node call with a `serialized_type_name` kwarg.
```

Reviewed By: suo

Differential Revision: D53833323

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120036
Approved by: https://github.com/SherlockNoMad
2024-02-16 17:39:12 +00:00
124c251510 Guarantee init cuda before attaching hooks (#120052)
Summary: If cuda is not initialized before calling attachAllocatorTraceTracker, then the CudaCachingAllocator device_allocator is empty which means that the registration hooks are not setup. This means that a new segment_alloc will not be registered causing an expensive dynamic registration each time the segment is used. The fix is to guarantee that cuda is initialized before attaching the hooks. If cuda is already initialized, then this lazyInitCUDA is a no-op.

Test Plan:
Testing this on fsdp+tp example model where cuda is not initialized before init_process_group.

Job without the fix keeps dynamically registering:
https://www.internalfb.com/mlhub/pipelines/runs/mast/torchx-fsdp_2d_main-j544j0vn7zqh4c?job_attempt=0&version=0&env=PRODUCTION
The following keeps looping:
[0]:2024-02-14T10:48:18.873079 twshared0039:4836:6232 [0] NCCL INFO CTRAN-MAPPER: registered buffer 0x7f6ebe000000 len 608124000, state 1
[0]:2024-02-14T10:48:18.873087 twshared0039:4836:6232 [0] NCCL INFO *dynamicRegist = true
[0]:2024-02-14T10:48:18.903234 twshared0039:4836:6232 [0] NCCL INFO CTRAN-MAPPER: deregister buffer 0x7f6ebe000000 len 608124000, state 1
[0]:2024-02-14T10:48:18.903240 twshared0039:4836:6232 [0] NCCL INFO CTRAN-MAPPER: deregiter buffer 0x7f6ebe000000 len 608124000

Job with the fix does not have this issue:
https://www.internalfb.com/mlhub/pipelines/runs/mast/torchx-fsdp_2d_main-hzm5dwqncr7l7?version=0&env=PRODUCTION

Reviewed By: minsii, kwen2501, xw285cornell

Differential Revision: D53770989

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120052
Approved by: https://github.com/kwen2501
2024-02-16 17:36:53 +00:00
fbe8e0f92d Fix missing right square bracket to match glog format (#119966)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119966
Approved by: https://github.com/oulgen
ghstack dependencies: #119869
2024-02-16 15:14:00 +00:00
9726d7ca8e Add lowering for logcumsumexp (#118753)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118753
Approved by: https://github.com/peterbell10
ghstack dependencies: #119809
2024-02-16 14:04:38 +00:00
3f4dd9bfa4 Back out "[pytree] Require serialized_type_name" (#120041)
Summary:
D53785493 breaks apf.rec.ir.tests.ir_export_deserialize_test.IRExportDeserializeTest: test_export_deserialize_ebc failed:

https://www.internalfb.com/sandcastle/workflow/3436246515685789584

Test Plan: buck2 test mode/opt apf/rec/ir/tests:ir_export_deserialize_test

Differential Revision: D53834881

Co-authored-by: Wilson Hong <wilsonhong@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120041
Approved by: https://github.com/ydwu4
2024-02-16 10:02:25 +00:00
4625ecb858 Add decomp for linalg.cross (#119809)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119809
Approved by: https://github.com/lezcano, https://github.com/peterbell10
2024-02-16 09:58:38 +00:00
3693d8f467 Do to convert UnsupportedFakeTensorException into RuntimeError in runNode for proper graph breaking. (#120026)
Fix: https://github.com/pytorch/pytorch/issues/119779 by properly graph breaking  a proper fix is to handle quantized tensors for full complete solution.

if when generating  a fake tensor, UnsupportedFakeTensorException is thrown, then its handled and converted into a
Unimplemented in inside wrap_fake_exception which is then translated to a graph break.

However run_node used to convert  UnsupportedFakeTensorException into a runtime error, creating runtime
errors instead of graph breaks whenever generating a fake tensor for a quantized tensor fails.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120026
Approved by: https://github.com/jansel
2024-02-16 09:21:58 +00:00
54025c01a7 [DCP][state_dict] Let distributed_state_dict filter out the compiler prefix (#119830)
Let distributed_state_dict filter out the compiler prefix

Differential Revision: [D53681864](https://our.internmc.facebook.com/intern/diff/D53681864/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119830
Approved by: https://github.com/wz337
2024-02-16 08:59:58 +00:00
bc7f3efb09 [aot_inductor] move CppWrapperCodeGen into a separate file (#119871)
This reverts commit d8e319a961bb872027f0abdc413d6beb7502ac9b.

Differential Revision: [D53817853](https://our.internmc.facebook.com/intern/diff/D53817853)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119871
Approved by: https://github.com/albanD, https://github.com/khabinov
ghstack dependencies: #119870
2024-02-16 08:14:20 +00:00
78c9b2948a [aot_inductor] move CudaWrapperCodeGen into a separate file (#119870)
This reverts commit 3ab08946d5052eaeda11d683d6a58e801a032755.

Differential Revision: [D53817852](https://our.internmc.facebook.com/intern/diff/D53817852)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119870
Approved by: https://github.com/khabinov
2024-02-16 08:10:51 +00:00
8f9f12c068 Intel GPU Runtime Upstreaming for Device Allocator (#118091)
# Motivation
According to [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842) and [[RFC] Intel GPU Runtime Upstreaming for Allocator](https://github.com/pytorch/pytorch/issues/116322), we will upstream the key functionality of device `Allocator` dedicated for XPU to PyTorch. And following our design prepare to generalize `Allocator` in parallel.

# Design
In the current design, XPU uses an `XPUAllocator` class, inherited from `c10::Allocator`. `XPUAllocator` is a manager to handle `DeviceCachingAllocator`, which is a per-device implementation of the caching mechanism to manage the already cached or newly allocated memory. The caching mechanism is similar to other backends, like CUDA. We can visualize the design as below.
<p align="center">
<img width="162" alt="image" src="https://github.com/pytorch/pytorch/assets/106960996/6b17b8cf-e7d1-48b4-b684-f830c409d218">
</p>

# Additional Context
We're going to implement our design gradually. This PR covers the device `Allocator` dedicated to XPU. The second PR covers the host `Allocator`.
Besides these PRs, we plan to generalize the device `Allocator` device-agnostic through another PR.
In this PR, our device `Allocator` has the same memory management mechanism as CUDA, but lacks features such as expendable segments and statistics. We will add these features back in the subsequent PR which intend to generalize `Allocator`.

The differences with CUDA:
only key functionality, and lack of AsyncAllocator, gpu_trace, history_record, graph functionality, memory snapshot, memory statistics, expandable segment...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118091
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/jgong5, https://github.com/albanD
ghstack dependencies: #117611, #117619, #117734
2024-02-16 06:46:00 +00:00
b8be8b639f Add Runtime Constant-Folding function of AOTInductor for AOTInductorModels used internally. (#119823)
Summary:
1. Make sure folded constants generated internally doesn't get exposed.
2. Add runConstantFolding and related API calls

Test Plan:
```buck2 run mode/opt-split-dwarf -c fbcode.nvcc_arch=v100,a100 caffe2/caffe2/fb/predictor/tests_gpu:pytorch_predictor_container_gpu_test -- --gtest_filter=*PyTorchPredictorContainerTest.LoadAOTInductorModel*
```
The test triggers the added predictor tests `test_aot_inductor_merge_net_file_*.predictor_20240206`,
which would trigger runConstantFolding from predictor's module loading.

Reviewed By: SherlockNoMad

Differential Revision: D53718139

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119823
Approved by: https://github.com/chenyang78
2024-02-16 06:45:48 +00:00
4dc75f9084 Intel GPU Runtime Upstreaming for Event (#117734)
# Motivation
As mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), the next runtime component we would like to upstream is `Event` which handles the status of an operation that is being executed. Typically, in some circumstances, we can fine-grain control of the operation execution via `Event`.

# Design
`XPUEvent` is a movable but not a copyable wrapper around sycl event. It should be created lazily on an XPU device when recording an `XPUStream`. Meanwhile, `XPUEvent` can wait for another `XPUEvent` or all the submitted kernels on an `XPUStream` to complete. Align to the other backend, the C++ files related to `Event` will be placed in `aten/src/ATen/xpu` folder. For frontend code, `XPUEvent` runtime API will be bound to Python `torch.xpu.Event`. The corresponding C++ code will be placed in `torch/csrc/xpu/Event.cpp` and Python code will be placed in `torch/xpu/streams.py` respectively.

# Additional Context
It is worth mentioning that the `elapsed_time` method is temporarily not supported by `XPUEvent`. We will be adding support for it soon. Meanwhile `XPUEvent` doesn't support IPC from different processes. For the other parts, we have almost a 1:1 mapping with CUDA.

lack of the below APIs:
- `torch.cuda.Event.ipc_handle`
- `CUDAEvent`'s constructor with `IpcEventHandle`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117734
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/jgong5, https://github.com/malfet
ghstack dependencies: #117611, #117619
2024-02-16 06:28:26 +00:00
02fb043522 Change native funcol inductor tests to use fake pg (#119104)
Summary:
Previously these tests require more than 2 GPUs to run. Changing them to use fake pg so they can run more often.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119104
Approved by: https://github.com/wconstab
ghstack dependencies: #119103
2024-02-16 05:18:45 +00:00
62e5840b36 [Dynamo] Do not create TorchInGraphFunctionVariable for tags (#120005)
Fixes https://github.com/pytorch/pytorch/issues/119793

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120005
Approved by: https://github.com/yanboliang
2024-02-16 03:37:32 +00:00
ddde1e4dee [executorch hash update] update the pinned executorch hash (#119943)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119943
Approved by: https://github.com/pytorchbot
2024-02-16 03:36:56 +00:00
4eefe7285a Use ARMV8 fconv insns to seepd up scalar fp16<->fp32 (#120012)
Thanks to discussion with @mikekgfb I've realized that FP16_ARITH is the feature available by default on Apple Silicon, so let use it to speed up portable but slow bit mashing algorithm implemented as `c10::detail::fp16_ieee_from_fp32_value` by using the following implicit conversion routine:
```cpp
float sve_fp16_to_fp32_value(uint16_t h) {
  union {
     uint16_t h;
     float16_t f16;
  } x = {h};
  return x.f16;
}
```
that according to the https://godbolt.org/z/8s14GvEjo is turned into [`fcvt s0,h0`](https://developer.arm.com/documentation/ddi0602/2023-12/SIMD-FP-Instructions/FCVT--Floating-point-Convert-precision--scalar--?lang=en)

As results, very slow and naive [`torch.mm`](edd9ddf73f/aten/src/ATen/native/cpu/BlasKernel.cpp (L108)) runs 3x faster: 85 msec before to 27 msec (measured by running e41341df2d/benchmarks/benchmark_torch_mm.py )

This is a reland of https://github.com/pytorch/pytorch/pull/119895 that got reverted because it was not buildable using Jetson toolkit

"Fixed" the problem by guarding the fast conversions with `!defined(__CUDACC__)`  (for internal folks, tested it by running `buck build @arvr/mode/embedded/jetson/linux/opt-stripped //xplat/caffe2:caffe2_ops_cuda_ovrsource` )
But also, extended the conversion to all AARHC64 platforms, not just the ones that support FP16 arithmetic extensions (i.e. ARMv8.2)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120012
Approved by: https://github.com/huydhn
2024-02-16 03:04:06 +00:00
3e5e8590f4 Account for inference mode in FakeTensor cache (#119963)
Summary: an fbcode test exposed a shortcoming where we serve a FakeTensor from the cache with the wrong inference_mode. Take the current mode into account in the cache key so we only serve entries from the same mode we're in currently

Test Plan: New unit test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119963
Approved by: https://github.com/eellison
2024-02-16 02:53:33 +00:00
8bfc87ce74 fixed flop counter formula for conv transposed backwards pass (#119874)
Fixes #119806
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119874
Approved by: https://github.com/zou3519
ghstack dependencies: #119521
2024-02-16 02:43:49 +00:00
17c345ebd9 [FSDP] compile compute and CI with @test_compiled_fsdp (#119933)
goal: all unit tests for eager. we want to test torch.compile by default

this PR adds ``@test_compiled_fsdp(compile_compute_on_module=None/TransformerBlock)`` to unit tests. now it's compiling compute-only as follows.

```
module.compile() # include user registered hooks if any
fully_shard(module)
```

torch.compile does not work following component yet
* compiling AC
* compiling reshard_after_forward=2
* delayed_all_gather, delayed_reduce_scatter

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119933
Approved by: https://github.com/awgu, https://github.com/jansel
2024-02-16 01:48:51 +00:00
c802c50196 Setup Nvidia Runtime before Indexer (#119923)
Sets up Nvidia Runtime and runs indexer inside a docker container.

Verified this works by running the indexer jobs (all the setup is correct, it OOMs for an unrelated reason, for which a fix is on the way).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119923
Approved by: https://github.com/huydhn
2024-02-16 00:33:18 +00:00
4319735ace Add meta registration for _foreach_norm (2nd try) (#119927)
The first try reused TensorListMetadata, which caused illegal memory access issues when there were too many tensors in the list. We just launch multiple kernels with a simpler version of the struct (to minimize kernels launched).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119927
Approved by: https://github.com/albanD
2024-02-16 00:23:23 +00:00
707cde9b31 [DTensor][Test] Improve math_ops test (#118956)
The DTensor fully_shard_tensor was created but not used in shard_math_ops test previously.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118956
Approved by: https://github.com/wanchaol
2024-02-15 23:59:25 +00:00
cyy
94f19fe545 [3/N] Replace std::tie with structural binding (#119962)
This PR follows https://github.com/pytorch/pytorch/pull/119774, it is a continued work to clean up std::tie

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119962
Approved by: https://github.com/albanD
2024-02-15 23:48:28 +00:00
2a63dd8889 [Dynamo] Support lazy module with namedtuple/dict input (#119972)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119972
Approved by: https://github.com/jansel
2024-02-15 23:18:18 +00:00
f9f602fcb8 Clean up decorators (#119925)
as title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119925
Approved by: https://github.com/eellison
2024-02-15 22:51:53 +00:00
444c628e06 Include the scalar tensor auto-transfer in the doc (#119967)
Fixes #119609

@albanD

Co-authored-by: albanD <desmaison.alban@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119967
Approved by: https://github.com/albanD
2024-02-15 22:37:39 +00:00
47300221c2 Revert "[export] Change runtime asserts to using assert_scalar (#119608)"
This reverts commit f4d641ba2fb11fca2ba47f0c425d8a4a1adbffb6.

Reverted https://github.com/pytorch/pytorch/pull/119608 on behalf of https://github.com/huydhn due to This break ONNX trunk job 65fd8b6730 ([comment](https://github.com/pytorch/pytorch/pull/119608#issuecomment-1947436402))
2024-02-15 22:25:24 +00:00
da1df5d7b8 [ROCm] Update triton wheels to ROCm 6.0 (#119765)
Upgrades nightly triton issues to ROCM 6.0 and adds bitcodes for gfx941 and gfx942.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119765
Approved by: https://github.com/jeffdaily, https://github.com/huydhn
2024-02-15 21:57:51 +00:00
3f4f91f2eb [inductor][eazy] fix profiler (#119959)
print_performance previously returns the execution time for `times` runs in total but now it returns the average execution time of a single run.  Change the profiler to be consistent with that. Not sure if there is a good way to add test though.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119959
Approved by: https://github.com/eellison
2024-02-15 21:47:09 +00:00
65fd8b6730 Revert "[export] Disable exported_program.__call__ (#119466)"
This reverts commit c26884f06345bf61e0843d13db84e76236ff6142.

Reverted https://github.com/pytorch/pytorch/pull/119466 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/119466#issuecomment-1947384298))
2024-02-15 21:42:32 +00:00
744898b311 Add doc page for environment variables that effect PyTorch Runtime (#119087)
# Summary

The goal of this PR is to add a doc page to list a number of environment that effect the PyTorch runtime. It will likely not be exhaustive but hopefully will be added and updated to stay relevant.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119087
Approved by: https://github.com/janeyx99, https://github.com/eqy
2024-02-15 21:41:38 +00:00
d707e3c9c6 Fix handling none source in build_torch_function_fn (#119724)
Fix https://github.com/pytorch/pytorch/issues/119580

When a UserDefinedObjectVariable is created it does not always have a source, i.e: when its an intermediate
This diff fix two handling of none source in two locations during an inlining of a user torch function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119724
Approved by: https://github.com/jansel, https://github.com/mlazos, https://github.com/anijain2305
2024-02-15 21:21:47 +00:00
9548860b37 Fix typo in istft docstring (#119776)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119776
Approved by: https://github.com/colesbury
2024-02-15 21:20:00 +00:00
a2f07bb317 Fix typo under docs directory (#119657)
This PR fixes typo under `docs` directory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119657
Approved by: https://github.com/colesbury
2024-02-15 21:14:34 +00:00
2d7a395c0f Fix typo in functional.py (#119775)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119775
Approved by: https://github.com/colesbury
2024-02-15 21:14:29 +00:00
c3b4d78e17 [Dynamo][Easy] Fix a small bug in test_trace_rules.py (#119973)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119973
Approved by: https://github.com/zou3519
2024-02-15 20:44:32 +00:00
b4c7afe101 [pytree] Require serialized_type_name (#119718)
Differential Revision: [D53785493](https://our.internmc.facebook.com/intern/diff/D53785493)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119718
Approved by: https://github.com/suo
2024-02-15 20:32:44 +00:00
f32560c939 Remove Redundant Bullet Point (#120007)
Fast path explanation for scaled_dot_product_attention in nn.MultiHeadAttention mentioned inputs being batched with batch_first = True twice.  Removed the second mention of this requirement.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120007
Approved by: https://github.com/mikaylagawarecki
2024-02-15 19:47:35 +00:00
605de946cf Clarify the patience in ReduceLROnPlateau (#119872)
Fixes #119763
@janeyx99

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119872
Approved by: https://github.com/janeyx99
2024-02-15 19:43:06 +00:00
26b6de43e5 Revert "Use ARMV8.2 scalar fp16<->fp32 conversion (#119895)" (#120001)
This reverts commit d833e2f2364a01c6fdab689a8bb5bbf55a5b60f7.

This is failing some RL builds internally using clang 13 D53791577

https://github.com/pytorch/pytorch/pull/119895#issuecomment-1946859332.  The bot doesn't like a commit being merged into the stack base and fails to revert the PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120001
Approved by: https://github.com/malfet
2024-02-15 19:41:51 +00:00
9b6fae2d79 Tweak to pr#119719 - eager & fullgraph (#119921)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119921
Approved by: https://github.com/oulgen
2024-02-15 19:31:56 +00:00
01ee85c8ab [PyTorch][Vulkan]remove redundant test of log_softmax (#119964)
Summary: `vulkan_api_test.cpp` already has [a test for `log_softmax`](https://www.internalfb.com/code/fbsource/[c79b73bd7d5f661c81ff3cf999cfa1af664f0c48]/xplat/caffe2/aten/src/ATen/test/vulkan_api_test.cpp?lines=4521), so we remove the redundant `DISABLED_log_softmax`. According to the comment the test was disabled because "the op is not working correctly. Add it back when it is fixed." Actually it's a simple typo mistake: the [CPU output should use `at::log_softmax` instead of `at::softmax`](https://www.internalfb.com/code/fbsource/[c79b73bd7d5f661c81ff3cf999cfa1af664f0c48]/xplat/caffe2/aten/src/ATen/test/vulkan_api_test.cpp?lines=4548). Since we already have a test for `log_softmax`, the fix isn't necessary and we remove this disabled test.

Test Plan:
Full vulkan_api_test P1184744699:
```
LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck2 run fbcode/mode/dev-nosan    //xplat/caffe2:pt_vulkan_api_test_bin
...
[  SKIPPED ] VulkanAPITest.querypool_flushed_shader_log (0 ms)
[----------] 427 tests from VulkanAPITest (23633 ms total)

[----------] Global test environment tear-down
[==========] 427 tests from 1 test suite ran. (23634 ms total)
[  PASSED  ] 426 tests.
[  SKIPPED ] 1 test, listed below:
[  SKIPPED ] VulkanAPITest.querypool_flushed_shader_log
```

Reviewed By: jorgep31415

Differential Revision: D53766200

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119964
Approved by: https://github.com/jorgep31415
2024-02-15 19:16:56 +00:00
8835ff1b09 [AMD] Update hipify code to oss (#119958)
Summary: Syncing the hipify code to third party. Trunk was broken by multiple diffs D53716382 D53744795

Test Plan: sandcastle

Differential Revision: D53790854

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119958
Approved by: https://github.com/jianyuh, https://github.com/drisspg
2024-02-15 19:14:34 +00:00
143b5f2745 Fix the missing device in _memory_profiler (#119751)
Fixes #119722,
1, added the missing device in
```
max_memory_allocated = torch.cuda.max_memory_allocated()
max_memory_reserved = torch.cuda.max_memory_reserved()
```
2, fix the device parameter to device_str. Based on [lines](2bda6b4cb8/torch/profiler/profiler.py (L291)), the input device are a string (device_str) for
```
self.mem_tl.export_memory_timeline_html
self.mem_tl.export_memory_timeline_raw
self.mem_tl.export_memory_timeline
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119751
Approved by: https://github.com/aaronenyeshi
2024-02-15 19:11:15 +00:00
98fd23cccc [EASY] Move OpsHandler and MockHandler to their own file (#119851)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119851
Approved by: https://github.com/lezcano
ghstack dependencies: #119728
2024-02-15 18:54:41 +00:00
6f324e8776 [ATen] Tag isinf as a pointwise op (#119728)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119728
Approved by: https://github.com/lezcano
2024-02-15 18:54:41 +00:00
eqy
e386bfa688 [CUDA][cuSPARSE] Work around IMA in cuSPARSE ALG1 on SM 8.9 devices (#119610)
Originally surfaced from the discuss forum:
https://discuss.pytorch.org/t/issue-with-torch-sparse-mm-while-running-on-gpu/188669

This has been forwarded to cuSPARSE but we have not yet received a commitment on their end to fix this issue directly.

CC @ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119610
Approved by: https://github.com/jeffdaily, https://github.com/jcaip
2024-02-15 18:28:45 +00:00
2429495820 [FSDP2][ez] Made typing more strict to avoid cast (#119985)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119985
Approved by: https://github.com/Skylion007, https://github.com/fegin
ghstack dependencies: #118298
2024-02-15 18:20:35 +00:00
840426e793 [export] Log export time. (#119960)
Summary: as title. we are logging the time to complete one export session.

Test Plan: CI

Differential Revision: D53737766

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119960
Approved by: https://github.com/angelayi
2024-02-15 18:04:15 +00:00
9b38ee2343 Revert "Alternate sharding (#119078)"
This reverts commit 861acda20577739d52dd0bcf09e162192f25020f.

Reverted https://github.com/pytorch/pytorch/pull/119078 on behalf of https://github.com/clee2000 due to failing 861acda205 ([comment](https://github.com/pytorch/pytorch/pull/119078#issuecomment-1946583857))
2024-02-15 16:59:50 +00:00
a83a1bc43b Adding c10 device type to newly added DeviceAccelerator (#119961)
Follow up to https://github.com/pytorch/pytorch/pull/104364,

A new file got submitted yesterday that is using DeviceType without the c10 namespace. This fixes that. I haven't yet figured out a way to setup a test for this, but I will submit a follow up PR once I figure that out.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119961
Approved by: https://github.com/ezyang
2024-02-15 14:56:05 +00:00
e5bfdde7ba Fix the skip condition for test_c10d tests (#119938)
Seeing the error for c10d tests when running on 1GPU. Adding the skip when there is insufficient GPU.

```
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
```
referring to https://github.com/pytorch/pytorch/pull/84980

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119938
Approved by: https://github.com/eqy, https://github.com/fegin
2024-02-15 11:03:39 +00:00
c26884f063 [export] Disable exported_program.__call__ (#119466)
Summary: `ExportedProgram` is an artifact produced by torch.export, containing the graph that is exported, along with other attributes about the original program such as the graph signature, state dict, and constants. One slightly confusing thing that users run into is that they treat the `ExportedProgram` as a `torch.nn.Module`, since the object is callable. However, as we do not plan to support all features that `torch.nn.Module`s have, like hooks, we want to create a distinction between it and the `ExportedProgram` by removing the `__call__` method. Instead users can create a proper `torch.nn.Module` through `exported_program.module()` and use that as a callable.

Test Plan: CI

Differential Revision: D53075378

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119466
Approved by: https://github.com/zhxchen17, https://github.com/thiagocrepaldi
2024-02-15 08:49:34 +00:00
f4d641ba2f [export] Change runtime asserts to using assert_scalar (#119608)
By changing runtime symbolic asserts to using assert_scalar, the asserts can call into `expect_true` and modify the shape env so that we can run through the traced graph module with fake tensors. With assert_async, the asserts only get hit during runtime, but that means if we run the graph module with fake tensors, the asserts will not affect the shape env, so later data dependent calls to the fake tensors may result in GuardOnDataDependentSymNode errors.

https://github.com/pytorch/pytorch/issues/119587

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119608
Approved by: https://github.com/ezyang
2024-02-15 07:13:42 +00:00
c83af673bc Allow CUDA extension builds to skip generating cuda dependencies during compile time (#119936)
nvcc flag `--generate-dependencies-with-compile` doesn't seem to be supported by `sccache` for now. Builds with this flag enabled will not benefit from sccache.

This PR adds an environment variable that allows users to set this flag and skip those nvcc dependencies to speed up their build with compiler caches. If everything is "fresh build" in CI, we don't care if there are unnecessary recompile during incremental builds.

related: https://github.com/pytorch/pytorch/pull/49344

- [ ] todo: raise an issue to sccache

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119936
Approved by: https://github.com/ezyang
2024-02-15 07:03:59 +00:00
cyy
d4882e438a [DeviceIndex][5/N] Use DeviceIndex in more places (#119866)
This PR follows the series of patches beginning with #119142 and fixes various CUDA related methods to use DeviceIndex.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119866
Approved by: https://github.com/Skylion007
2024-02-15 07:01:43 +00:00
cyy
68328ad394 Check existence of caffe2::mkl target (#119945)
Fixes #118862
If libtorch is included multiply times in different sub-folders, linking caffe2::mkl may incur errors like
```
  Cannot specify link libraries for target "caffe2::mkl" which is not built
  by this project.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119945
Approved by: https://github.com/ezyang
2024-02-15 06:28:17 +00:00
0898ead2d5 Timestamp Embedding Indices Generated for TD (#119955)
Timestamps the generated embedding indices. Moves the old indices to an `archived/` folder and then uploads the index to a `latest/` folder. There will be a short period in between these operations where there is no index in `latest/`. To handle this case, any workflow fetching the index (such as the retriever) should use a retry with backoff when copying from S3.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119955
Approved by: https://github.com/huydhn
2024-02-15 04:48:40 +00:00
af346df6a0 [PyTorch][Vulkan]fix the issue of log 0 after softmax (#119898)
Summary: In some cases the output of `softmax` are so small that they are below the float16 precision. These values are represented as 0 in float16 and result in `-inf` when log is applied. According to [Wikipedia](https://en.wikipedia.org/wiki/Half-precision_floating-point_format#Exponent_encoding), the minimum strictly positive (subnormal) value is 2^−24 ≈ 5.9605 × 10^−8. Therefore, we add 6 x 10^-8 to the output of softmax to avoid the numerical issue.

Test Plan:
We add two tests:
- `log_softmax_underflow_exception` tests the log_softmax without adding epsilon to the output of softmax, so we expect to get nan or -inf. (**NOTE**: this test has passed on both devserver and on Android device, but failed on the `
fbsource//xplat/caffe2:vulkan_ops_testAndroid` test on CI. In this test, `log` of small numbers [even `log 0` shows output -88 instead of `-inf`](https://interncache-cco.fbcdn.net/v/t49.3276-7/379414752_342395058779076_6447867753374424757_n.txt?ccb=1-7&_nc_sid=ce8ad4&efg=eyJ1cmxnZW4iOiJwaHBfdXJsZ2VuX2NsaWVudC9pbnRlcm4vc2l0ZS94L3Rlc3RpbmZyYSJ9&_nc_ht=interncache-cco&oh=00_AfApTdId1WOHUqdoSTc66s6adnrQt1YS0NDT-LDppIvX0g&oe=65D0CC99). We cannot reproduce this error on device now, so we **DISABLE** this test for now to integrate into CI.)
- `log_softmax_underflow` tests the updated implementation of log_softmax, nan and -inf have been removed

## test on devserver

```
luwei@devbig984.prn1 /data/users/luwei/fbsource (9f6b78894)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck2 run fbcode/mode/dev-nosan    //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="*log_softmax_underflow*"
File changed: fbcode//caffe2/aten/src/ATen/test/vulkan_api_test.cpp
File changed: fbsource//xplat/caffe2/aten/src/ATen/test/vulkan_api_test.cpp
Buck UI: https://www.internalfb.com/buck2/baaaa683-60da-4dd8-95b9-6848fe1d7d74
Network: Up: 53KiB  Down: 1.4MiB  (reSessionID-9580ce4f-7e1e-4c65-8497-52443329b796)
Jobs completed: 6. Time elapsed: 24.2s.
Cache hits: 0%. Commands: 2 (cached: 0, remote: 1, local: 1)
BUILD SUCCEEDED
Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = *log_softmax_underflow*
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from VulkanAPITest
[ DISABLED ] VulkanAPITest.DISABLED_log_softmax_underflow_exception
[ RUN      ] VulkanAPITest.log_softmax_underflow
[       OK ] VulkanAPITest.log_softmax_underflow (169 ms)
[----------] 1 test from VulkanAPITest (169 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (169 ms total)
[  PASSED  ] 1 test.

  YOU HAVE 1 DISABLED TEST
```

full test results: P1184164670
```
[----------] 428 tests from VulkanAPITest (21974 ms total)

[----------] Global test environment tear-down
[==========] 428 tests from 1 test suite ran. (21974 ms total)
[  PASSED  ] 427 tests.
[  SKIPPED ] 1 test, listed below:
[  SKIPPED ] VulkanAPITest.querypool_flushed_shader_log

  YOU HAVE 11 DISABLED TESTS
```

## test on device:
- build
```
[luwei@devbig984.prn1 /data/users/luwei/fbsource (82c91e8da)]$ buck2 build -c ndk.static_linking=true -c pt.enable_qpl=0  --target-platforms=ovr_config//platform/android:arm32-fbsource //xplat/caffe2:pt_vulkan_api_test_binAndroid  --show-output
```
- push to device and run
```
[luwei@devbig984.prn1 /data/users/luwei/fbsource (82c91e8da)]$ adb shell /data/local/tmp/pt_vulkan_api_test_binAndroid --gtest_filter="*log_softmax_underflow*"
Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = *log_softmax_underflow*
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from VulkanAPITest
[ DISABLED ] VulkanAPITest.DISABLED_log_softmax_underflow_exception
[ RUN      ] VulkanAPITest.log_softmax_underflow
[       OK ] VulkanAPITest.log_softmax_underflow (292 ms)
[----------] 1 test from VulkanAPITest (293 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (294 ms total)
[  PASSED  ] 1 test.

  YOU HAVE 1 DISABLED TEST

```

Reviewed By: yipjustin

Differential Revision: D53694989

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119898
Approved by: https://github.com/jorgep31415
2024-02-15 03:59:44 +00:00
cd08dc37f8 Support tracing native functional collective via python APIs (#119103)
Summary:
- Inlined `torch.distributed.distributed_c10d._get_group_size_by_name`
- Updated all torch.compile tests in test_c10d_functional_native.py to use funcol python APIs (as opposed to the dispatcher ops)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119103
Approved by: https://github.com/wconstab, https://github.com/fegin, https://github.com/wanchaol
2024-02-15 03:33:49 +00:00
cyy
5f9b432494 [2/N] Replace std::tie with structural binding (#119879)
This PR follows #119774, Python generated code was changed to use structural binding.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119879
Approved by: https://github.com/albanD
2024-02-15 02:56:34 +00:00
9ff9798716 Fix a bug in kernel analysis with ttir defined args (#119934)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119934
Approved by: https://github.com/aakhundov
2024-02-15 02:49:11 +00:00
7f5b87c953 [torch.compile] Log more compilation time breakdown (#119865)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119865
Approved by: https://github.com/ezyang
2024-02-15 02:20:07 +00:00
516f38a144 [RelEng] Define BUILD_BUNDLE_PTXAS (#119750)
That would bundle PTXAS into a `bin` folder

When compiling for Triton, define `TRITION_PTXAS_PATH` if `ptxas` is bundled with PyTorch Needed to make PyTorch compiled against CUDA-11.8 usable with 11.8 driver, as Triton is bundled with latest (CUDA-12.3 at time of PyTorch-2.2 release) ptxas

Needs 5c814e2527 to produce valid binary builds

Test plan:
- Create dummy ptxas in `torch/bin` folder and observe `torch.compile` fail with backtrace in Triton module.
- Run following script (to be added to binary tests ) against CUDA-11.8 wheel:
```python
import torch
import triton

@torch.compile
def foo(x: torch.Tensor) -> torch.Tensor:
  return torch.sin(x) + torch.cos(x)

x=torch.rand(3, 3, device="cuda")
print(foo(x))
# And check that CUDA versions match
cuda_version = torch.version.cuda
ptxas_version = triton.backends.nvidia.compiler.get_ptxas_version().decode("ascii")
assert cuda_version in ptxas_version, f"CUDA version mismatch: torch build with {cuda_version}, but Triton uses ptxs {ptxas_version}"
```

Fixes https://github.com/pytorch/pytorch/issues/119054

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119750
Approved by: https://github.com/jansel, https://github.com/atalman
2024-02-15 02:08:57 +00:00
a07fd51b6b [caffe2] Add an avx512 implementation of adagrad_update (#113289)
Summary: As per title

Test Plan: contbuilds

Differential Revision: D50947444

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113289
Approved by: https://github.com/ezyang
2024-02-15 01:45:30 +00:00
861acda205 Alternate sharding (#119078)
Changes sharding to attempt to put all serial tests on as few shards as possible.  Parallel tests are then distributed across all shards, with most of which likely ending up on the non serial shards

Example: 8 minutes of serial tests, 20 minutes of parallel tests, 2 proc per machine, 6 machines
-> 8 + 20/2 = 18 total minutes of tests
-> 18 / 6 machines = 3 min per machine
-> all serial tests should fit on 3 machines (3min, 3 min, 2min)
-> majority of parallel tests should go on last 4 machines, one of which is shared with the serial tests

Move serial tests to run first

If I want to move to a purely numbers based sharding, this ensures that parallel tests are run with parallel tests as much as possible instead of interleaving serial + parallel tests, which decreases effectiveness of parallelization, while also ensuring that test reordering is still mostly effective.

See 73e816ee80 for example logs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119078
Approved by: https://github.com/huydhn
2024-02-15 01:32:44 +00:00
b4252d73b1 Make pattern matcher more robust (#119876)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119876
Approved by: https://github.com/cccclai
2024-02-15 00:48:16 +00:00
daf1050ae5 [dtensor] refactor sharding cost model to count for latency (#119897)
This PR refactors the shardeing cost model, to do a more accurate
estimation of redistribute cost, including both collective latency and
communciation time.

The previous cost model does not recale the latency and communciation
time, therefore the latency factor is too small to be counted, and in
the case of small tensors, multiple collectives is preferred than a
single collective, which is wrong.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119897
Approved by: https://github.com/tianyu-l
2024-02-15 00:35:56 +00:00
99cb807e25 Skip test_wrap_bad if run under pytest (#115070)
Pytest replaces sys.stdout/stderr by `TextIOWrapper` instances which do not support `fileno()`
Hence skip that test in this case

Fixes #115069

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115070
Approved by: https://github.com/clee2000
2024-02-15 00:10:05 +00:00
d833e2f236 Use ARMV8.2 scalar fp16<->fp32 conversion (#119895)
Thanks to discussion with @mikekgfb I've realized that SVE is the
feature availble by default on Apple Silicon, so let use it to speed up
portable but slow bit mashing algorithm implemented as `c10::detail::fp16_ieee_from_fp32_value` by using the following implicit conversion routine:
```cpp
float sve_fp16_to_fp32_value(uint16_t h) {
  union {
     uint16_t h;
     float16_t f16;
  } x = {h};
  return x.f16;
}
```
that according to the https://godbolt.org/z/8s14GvEjo is turned into [`fcvt s0,h0`](https://developer.arm.com/documentation/ddi0596/2021-12/SVE-Instructions/FCVT--Floating-point-convert-precision--predicated--)

As results, very slow and naive [`torch.mm`](edd9ddf73f/aten/src/ATen/native/cpu/BlasKernel.cpp (L108)) runs 3x faster: 85 msec before to 27 msec (measured by running e41341df2d/benchmarks/benchmark_torch_mm.py )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119895
Approved by: https://github.com/mikekgfb
ghstack dependencies: #119892
2024-02-14 23:42:53 +00:00
096ebcca73 [FSDP2] Added gradient accumulation w/o reduction (#118298)
This PR adds a way to do gradient accumulation without collectives (i.e. reduce-scatter for FSDP and reduce-scatter/all-reduce for HSDP, though HSDP is not yet implemented). Since the `no_sync()` context manager has received some feedback, we simply define a method on the module to set whether the module requires gradient synchronization or not, where this method can recurse or not.
```
# Before with `no_sync()`:
with fsdp_model.no_sync() if not is_last_microbatch else contextlib.nullcontext():
  # Forward/backward

# After with a setter:
fsdp_model.set_requires_gradient_sync(not is_last_microbatch)
# Forward/backward
```
Having the method be able to recurse or not also gives some flexibility. For example, some large modules can still reduce-scatter, while some smaller modules can avoid it to save communication bandwidth:
```
fsdp_modules_to_reduce_scatter: Set[nn.Module] = ...
for module in fsdp_model.modules():
  if isinstance(module, FSDP) and module not in fsdp_modules_to_reduce_scatter:
    module.set_requires_gradient_sync(not is_last_microbatch)
# Forward/backward
```

(Separately, we may expose a helper for `return [module for model.modules() if isinstance(module, FSDP)]`.)

---

To show the spirit of this API choice, I also included `set_requires_all_reduce` that would give us the ability to only reduce-scatter but not all-reduce for HSDP (originally from the MiCS paper). If we want to flexibly support heterogeneous sharding where FSDP is applied to some modules and HSDP to others in the same model, then having a module-level method that has the option to not recurse makes sense to me.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118298
Approved by: https://github.com/wconstab, https://github.com/wanchaol
ghstack dependencies: #119550, #118136, #118223, #118755, #119825
2024-02-14 23:09:59 +00:00
8f27fde2f5 [export] Log private api uses. (#119848)
Summary:
as title.
The following APIs are logged:
- capture_preautograd_graph
- torch._export.aot_compile
- external usage of _export_to_torch_ir (AOTInductor, Pippy)
- constraints API
- public use of torch._dynamo.export

Test Plan: CI

Differential Revision: D53735599

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119848
Approved by: https://github.com/suo
2024-02-14 22:58:23 +00:00
340b6fa972 Deduplicate docs between global and non-global full backward hooks (#119708)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119708
Approved by: https://github.com/albanD
ghstack dependencies: #114970
2024-02-14 22:53:44 +00:00
3713103db4 Revert "[Inductor] Setting kernel launch and exit callbacks for inductor generated triton kernels (#119450)"
This reverts commit 4e93b00b692118b8531f3807ec95eb4c538ea419.

Reverted https://github.com/pytorch/pytorch/pull/119450 on behalf of https://github.com/soulitzer due to Regressed perf on the dashboard ([comment](https://github.com/pytorch/pytorch/pull/119450#issuecomment-1944876761))
2024-02-14 22:44:21 +00:00
756cf2913d Fix NJT stride access in SDPA dispatcher logic (#119846)
`._stride` -> `._strides`

Adds test to cover this case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119846
Approved by: https://github.com/drisspg, https://github.com/ani300, https://github.com/soulitzer
ghstack dependencies: #119910
2024-02-14 22:37:52 +00:00
0560c193a6 Fix meta registration for _flash_attention_forward() [ROCm forward fix] (#119910)
Addresses ROCm failures from #119812

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119910
Approved by: https://github.com/drisspg
2024-02-14 22:37:52 +00:00
734ae20f2e [C10] Expand half unittest (#119892)
So far it's been only testing legacy conversion, rather than the one actually used when `at::Half` is constructed
Test `fp16` to `fp32` for the whole range of its 65536 values, though skip NaN comparisons, as different algorithms are not guaranteed to yield identical NaN representations and they are different anyway.

Do a small code cleanup, remove extraneous semicolons as well as named namespace inside unnamed one
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119892
Approved by: https://github.com/kit1980
2024-02-14 22:32:43 +00:00
3470ab42bb [DCP] Automatically set no_dist if distributed is unavailable (#119813)
[DCP] Automatically set `no_dist` if distributed is unavailable

Differential Revision: [D53718043](https://our.internmc.facebook.com/intern/diff/D53718043/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119813
Approved by: https://github.com/fegin, https://github.com/wz337
2024-02-14 22:25:07 +00:00
cd380c794f [CUDNN][SDPA] Experimental cuDNN Flash Attention v2 Inference (#115663)
#113713

Going to clean up some of the checks and will remove draft status after.
Can be tested on SM80+ with `TORCH_CUDNN_MHA_ENABLED=1`.

CC @drisspg @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115663
Approved by: https://github.com/drisspg
2024-02-14 22:02:06 +00:00
9ec8dd2467 Reify view_func() closures as ViewFuncs (#118404)
Replaces `view_func()` closures with a reified `ViewFunc` data structure. Codegen generates a `ViewFunc` subclass for each view op (e.g. `NarrowViewFunc`) containing state needed to reconstruct the view. The `ViewFunc` API allows for querying and hot-swapping any `SymInt`s or `Tensors` in the state through `get_symints()` / `get_tensors()` / `clone_and_set()`, which will be essential for fake-ification later on.

```cpp
/// Base class for view functions, providing reapplication of a view on a new base.
/// Each view op should get a codegenerated subclass of this class containing
/// any state needed to reconstruct the view. The class also provides convenience
/// accessors for saved SymInts / tensor state. This is useful for e.g. fake-ification,
/// where we want to use symbolic values or fake tensors instead.
struct TORCH_API ViewFunc {
  virtual ~ViewFunc() {}
  /// Returns any SymInts in the saved state.
  virtual std::vector<c10::SymInt> get_symints() const { return {}; }
  /// Returns the number of SymInts in the saved state.
  virtual size_t num_symints() const { return 0; }
  /// Returns any tensors in the saved state.
  virtual std::vector<at::Tensor> get_tensors() const { return {}; }
  /// Returns the number of tensors in the saved state.
  virtual size_t num_tensors() const { return 0; }
  /// Reapplies the view on the given base using the saved state.
  virtual at::Tensor operator()(const at::Tensor&) const = 0;
  /// Returns a clone of this ViewFunc, optionally with the specified saved state.
  virtual std::unique_ptr<ViewFunc> clone_and_set(
      std::optional<std::vector<c10::SymInt>> = c10::nullopt,
      std::optional<std::vector<at::Tensor>> = c10::nullopt) const = 0;

protected:
  /// Sets the values of any SymInts in the saved state. The input vector size must
  /// match the number of SymInts in the saved state (i.e. the size of the list
  /// returned by get_symints()).
  virtual void set_symints(std::vector<c10::SymInt>) {}
  /// Sets the values of any Tensors in the saved state. The input vector size must
  /// match the number of Tensors in the saved state (i.e. the size of the list
  /// returned by get_tensors()).
  virtual void set_tensors(std::vector<at::Tensor>) {}
};
```

New codegen files:
* `torch/csrc/autograd/generated/ViewFunc.h`
* `torch/csrc/autograd/generated/ViewFuncs.cpp`

The templates for these also contains impls for `ChainedViewFunc` and `ErroringViewFunc` which are used in a few places within autograd.

Example codegen for `slice.Tensor`:
```cpp
// torch/csrc/autograd/generated/ViewFuncs.h
#define SLICE_TENSOR_VIEW_FUNC_AVAILABLE
struct SliceTensorViewFunc : public torch::autograd::ViewFunc {
  SliceTensorViewFunc(int64_t dim, c10::optional<c10::SymInt> start, c10::optional<c10::SymInt> end, c10::SymInt step) : dim(dim), start(start), end(end), step(step)
  {};
  virtual ~SliceTensorViewFunc() override {};
  virtual std::vector<c10::SymInt> get_symints() const override;
  virtual size_t num_symints() const override;
  virtual std::vector<at::Tensor> get_tensors() const override;
  virtual size_t num_tensors() const override;
  virtual at::Tensor operator()(const at::Tensor&) const override;
  virtual std::unique_ptr<ViewFunc> clone_and_set(
      std::optional<std::vector<c10::SymInt>> = c10::nullopt,
      std::optional<std::vector<at::Tensor>> = c10::nullopt) const override;

protected:
  virtual void set_symints(std::vector<c10::SymInt>) override;
  virtual void set_tensors(std::vector<at::Tensor>) override;

private:
  int64_t dim;
  c10::optional<c10::SymInt> start;
  c10::optional<c10::SymInt> end;
  c10::SymInt step;
};
...

// torch/csrc/autograd/generated/ViewFuncs.cpp
std::vector<c10::SymInt> SliceTensorViewFunc::get_symints() const {
  ::std::vector<c10::SymInt> symints;
  symints.reserve((start.has_value() ? 1 : 0) + (end.has_value() ? 1 : 0) + 1);
  if(start.has_value()) symints.insert(symints.end(), *(start));
  if(end.has_value()) symints.insert(symints.end(), *(end));
  symints.push_back(step);
  return symints;
}

size_t SliceTensorViewFunc::num_symints() const {
  return static_cast<size_t>((start.has_value() ? 1 : 0) + (end.has_value() ? 1 : 0) + 1);
}

void SliceTensorViewFunc::set_symints(std::vector<c10::SymInt> symints) {
  TORCH_INTERNAL_ASSERT(symints.size() == num_symints());
  auto i = 0;
  if(start.has_value()) start = symints[i];
  i += (start.has_value() ? 1 : 0);
  if(end.has_value()) end = symints[i];
  i += (end.has_value() ? 1 : 0);
  step = symints[i];
}

std::vector<at::Tensor> SliceTensorViewFunc::get_tensors() const {
  ::std::vector<at::Tensor> tensors;
  return tensors;
}

size_t SliceTensorViewFunc::num_tensors() const {
  return static_cast<size_t>(0);
}

void SliceTensorViewFunc::set_tensors(std::vector<at::Tensor> tensors) {
  TORCH_INTERNAL_ASSERT(tensors.size() == num_tensors());

}

at::Tensor SliceTensorViewFunc::operator()(const at::Tensor& input_base) const {
  return at::_ops::slice_Tensor::call(input_base, dim, start, end, step);
}

std::unique_ptr<ViewFunc> SliceTensorViewFunc::clone_and_set(
    std::optional<std::vector<c10::SymInt>> symints,
    std::optional<std::vector<at::Tensor>> tensors) const {
  auto output = std::make_unique<SliceTensorViewFunc>(dim, start, end, step);
  if (symints.has_value()) {
    output->set_symints(std::move(*(symints)));
  }
  if (tensors.has_value()) {
    output->set_tensors(std::move(*(tensors)));
  }
  return output;
}
```

The `_view_func()` / `_view_func_unsafe()` methods now accept two additional (optional) args for `symint_visitor_fn` / `tensor_visitor_fn`. If these are defined, they are expected to be python callables that operate on a single SymInt / tensor and return a new one. This allows for the hot-swapping needed during fake-ification.

For testing, there are extensive pre-existing tests, and I added a test to ensure that hot-swapping functions correctly.
```sh
python test/test_autograd.py -k test_view_func_replay
python test/test_ops.py -k test_view_replay
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118404
Approved by: https://github.com/ezyang
2024-02-14 22:00:43 +00:00
6b04251b87 [inductor][scheduler] Use set for origin (#119861)
xref - https://github.com/pytorch/pytorch/issues/119440

This avoids node > node comparison if the origin order is same in the origins tuple. However, I am unable to come up with a test case where this could happen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119861
Approved by: https://github.com/Skylion007, https://github.com/eellison
2024-02-14 22:00:38 +00:00
29235c7063 Handle aliases correctly in foreach (#119508)
Fixes https://github.com/pytorch/pytorch/issues/119436

<s>In essence we need to ensure aliases are run in separate foreach kernels so that they are ordered correctly. Previously, aliases could end up in the same kernel which creates weird scheduling dependencies.</s>

There was a bug in cycle detection/can_fuse which was creating cycles when more than two aliases were used in foreach nodes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119508
Approved by: https://github.com/jansel
2024-02-14 21:21:28 +00:00
e0f6fa6a7c Windows Dynamo Error Removal CI Check (#115969)
Rebase of #111313 onto `main`, for CI validation

Co-authored-by: Stella Laurenzo <stellaraccident@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115969
Approved by: https://github.com/PaliC, https://github.com/thiagocrepaldi
2024-02-14 21:14:36 +00:00
9201d7335a Add pixel_shuffle to core aten decomps (#119899)
Summary: https://github.com/pytorch/pytorch/pull/118239 added a decomposition for pixel_shuffle, so pixel_shuffle no longer needs to be a Core ATen Op. We have also fixed the internal use case so that it no longer special cases on pixel_shuffle, allowing us to revert the changes in https://github.com/pytorch/pytorch/pull/118921.

Test Plan: CI

Differential Revision: D53766709

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119899
Approved by: https://github.com/peterbell10, https://github.com/lezcano
2024-02-14 21:01:11 +00:00
244b124bb8 Add linux cpu test for 3.12 (#117853)
This is continuation of work: https://github.com/pytorch/pytorch/pull/113987

Co-authored-by: albanD <desmaison.alban@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117853
Approved by: https://github.com/albanD
2024-02-14 20:52:23 +00:00
bb67a28738 [DTensor] Enable Adamax foreach optimizer (#119850)
Enable Adamax foreach optimizer and add DTensor unit test for Adamax.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119850
Approved by: https://github.com/wanchaol
2024-02-14 20:43:00 +00:00
2aad3f93f8 Fix guards for field access through properties (#119719)
When building guards which went through a property we were analyzing the property using getattr_static but the guard wasn't built using getattr_static so if the property was "unusual" it generated misbehaved code which referenced a non-existent `__closure__` field.

Fixes #118786

Note that after this change some of the referenced tests are still failing with a different error - but getting further.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119719
Approved by: https://github.com/oulgen
2024-02-14 20:42:55 +00:00
7797a8c2cb [testing][inductor] Allow grad tolerance override (#119844)
Introduce `grad_atol` and `grad_rtol` kwargs, default behavior is
preserved by using `atol` and `rtol` values.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119844
Approved by: https://github.com/peterbell10
2024-02-14 20:18:48 +00:00
15f1b9f1c4 Improve TORCHDYNAMO_EXTENDED_DEBUG for GuardOnDataDependentSymNode (#119412)
This PR substantially improves the error reporting for GuardOnDataDependentSymNode in the following ways:

* The GuardOnDataDependentSymNode error message is rewritten for clarity, and contains a link to a new doc on how to resolve these issues https://docs.google.com/document/d/1HSuTTVvYH1pTew89Rtpeu84Ht3nQEFTYhAX3Ypa_xJs/edit#heading=h.44gwi83jepaj
* We support `TORCHDYNAMO_EXTENDED_DEBUG_CREATE_SYMBOL`, which lets you specify a symbol name to get detailed debug information when it is logged (e.g., the full backtrace and user backtrace of the symbol creation). The exact symbols that you may be interested in our now explicitly spelled out in the error message.
* We support `TORCHDYNAMO_EXTENDED_DEBUG_CPP` which enables reporting C++ backtraces whenever we would report a backtrace.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119412
Approved by: https://github.com/avikchaudhuri
ghstack dependencies: #117356
2024-02-14 20:01:07 +00:00
0e6eee3c89 [ROCm] TunableOp (#114894)
Some operations, such as GEMMs, could be implemented using more than one library or more than one technique. For example, a GEMM could be implemented for CUDA or ROCm using either the blas or blasLt libraries. Further, ROCm's rocblas and hipblaslt libraries allow the user to query for all possible algorithms and then choose one. How does one know which implementation is the fastest and should be chosen? That's what TunableOp provides.

See the README.md for additional details.

TunableOp was ported from onnxruntime starting from commit 08dce54266.  The content was significantly modified and reorganized for use within PyTorch.  The files copied and their approximate new names or source content location within aten/src/ATen/cuda/tunable include the following:

- onnxruntime/core/framework/tunable.h -> Tunable.h
- onnxruntime/core/framework/tuning_context.h -> Tunable.h
- onnxruntime/core/framework/tuning_context_impl.h -> Tunable.cpp
- onnxruntime/core/providers/rocm/tunable/gemm_common.h -> GemmCommon.h
- onnxruntime/core/providers/rocm/tunable/gemm_hipblaslt.h -> GemmHipblaslt.h
- onnxruntime/core/providers/rocm/tunable/gemm_rocblas.h -> GemmRocblas.h
- onnxruntime/core/providers/rocm/tunable/gemm_tunable.cuh -> TunableGemm.h
- onnxruntime/core/providers/rocm/tunable/rocm_tuning_context.cc -> Tunable.cpp
- onnxruntime/core/providers/rocm/tunable/util.h -> StreamTimer.h
- onnxruntime/core/providers/rocm/tunable/util.cc -> StreamTimer.cpp

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114894
Approved by: https://github.com/xw285cornell, https://github.com/jianyuh
2024-02-14 19:03:49 +00:00
90f785dc34 Change default TORCH_LOGS format to match Meta/glog standard (#119869)
Before:

```
[2024-02-13 19:34:50,591] [0/0] torch._dynamo.guards.__guards: [DEBUG] GUARDS:
[2024-02-13 19:34:50,591] [0/0] torch._dynamo.guards.__guards: [DEBUG] ___check_type_id(L['x'], 70049616)                            # assert x.shape[0] > 2  # b.py:5 in f
[2024-02-13 19:34:50,592] [0/0] torch._dynamo.guards.__guards: [DEBUG] hasattr(L['x'], '_dynamo_dynamic_indices') == False           # assert x.shape[0] > 2  # b.py:5 in f
```

After this change, the logs look like this:

```
V0214 07:00:49.354000 139646045393920 torch/_dynamo/guards.py:1023 [0/0] GUARDS:
V0214 07:00:49.354000 139646045393920 torch/_dynamo/guards.py:1039 [0/0] ___check_type_id(L['x'], 70050096)                            # assert x.shape[0] > 2  # b.py:5 in f
V0214 07:00:49.355000 139646045393920 torch/_dynamo/guards.py:1039 [0/0] hasattr(L['x'], '_dynamo_dynamic_indices') == False           # assert x.shape[0] > 2  # b.py:5 in f
```

The main differences from what we had before:

* We don't print DEBUG/INFO/WARNING, instead, we only print a single character. DEBUG, somewhat oddly, maps to V, because it corresponds to glog VERBOSE
* The year is omitted, and a more compact representation for date/month is adopted. Somewhat perplexingly, six digits are allocated for the nanoseconds, even though Python typically doesn't have that level of resolution
* The thread ID is included (in a containerized environment, this thread id will be typically much lower)
* Instead of using the module name, we give a filepath, as well as the line the log message was emitted from. I think the line number is a nice touch and improvement over our old logs, but one downside is we do lose the artifact name in the log message, in case anyone was grepping for that.
* I chose to move the compile id prefix to the very end so as to keep a uniform layout before it, but I do think there are benefits to having it before the filename

Meta only: This format was reverse engineered off of 6b8bbe3b53/supervisor/logging.py and https://www.internalfb.com/code/fbsource/[e6728305a48540110f2bdba198aa74eee47290f9]/fbcode/tupperware/front_end/log_reader/filter/StreamingLogLineFilter.cpp?lines=105-114

Now, I think this may be slightly controversial, but I have chosen to apply this format *by default* in OSS. My reasoning is that many PT2 developers work with the logs in OSS, and keeping the format identical to what we run in prod will make it easier for these skills to transfer.

The non-negotiable portion of the new format is "V0213 19:28:32"; the date string is expected to be in exactly this form or Tupperware will fail to parse it as a date.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119869
Approved by: https://github.com/oulgen, https://github.com/mlazos, https://github.com/Skylion007
2024-02-14 18:56:35 +00:00
d999222fba [dtensor] add op support for nll_loss_backward (#119256)
As titled. This is a followup to PR #118917 on nll_loss_forward. It also fixes an issue in it: the forward function produces two return values, the loss `result` and the `total_weight`. The previous PR didn't explicitly deal with the `total_weight` part.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119256
Approved by: https://github.com/wanchaol
2024-02-14 18:50:42 +00:00
47182a8f4b Add cpp stack traces to our own reruns (#119408)
Note that I'm not sure why we both have pytest rerun the failing test twice via 81abc2b249/test/run_test.py (L966) before our own logic retries it as well?

The failing test is only here to make sure it works as expected in the CI env. Will remove before landing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119408
Approved by: https://github.com/huydhn
2024-02-14 18:40:23 +00:00
6cf48187c5 [export] Remove references to capture_pre_autograd_graph inside test_export (#119875)
Summary: Title

Test Plan: CI

Reviewed By: zhxchen17

Differential Revision: D53728889

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119875
Approved by: https://github.com/angelayi
2024-02-14 17:59:10 +00:00
ee3a7bdc2d [export] Don't error if nn_module_stack doesn't contain a class (#119753)
Summary: When we deserialize nn_module_stack, sometimes the module no longer exists in the python environment so we cannot deserialize it back into the python type and instead it's kept as a string. This causes downstream failures when retracing due to one of our checks in export. This diff just bypasses the check.

Test Plan: CI

Reviewed By: chakriu

Differential Revision: D53527706

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119753
Approved by: https://github.com/zhxchen17
2024-02-14 16:56:11 +00:00
3e21c785a4 [ROCm] Initial ir.Scan/aten.cumsum lowering support on ROCm (#119369)
It was noted in https://github.com/pytorch/pytorch/pull/117992 that ROCm is still falling back to eager with scan's with inductor.

Initially as part of https://github.com/pytorch/pytorch/pull/106581 ROCm was disabled on this feature due to lack of triton support.

This PR will enable support for lowering scan operations on ROCm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119369
Approved by: https://github.com/peterbell10
2024-02-14 16:13:46 +00:00
fb492f7ca1 [inductor] Reorder if check to avoid more expensive check. (#119817)
If `mkldnn` is not enabled or not available there is no point in performing a relatively expensive `all` check.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119817
Approved by: https://github.com/Skylion007
2024-02-14 16:04:31 +00:00
184605ae7d [inductor] Replace generators with map. (#119818)
It's more concise and efficient.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119818
Approved by: https://github.com/Skylion007, https://github.com/Neilblaze
2024-02-14 16:02:52 +00:00
edd9ddf73f Propagate allow_non_graph_fake between get_fake_values_from_nodes and get_fake_values (#119731)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119731
Approved by: https://github.com/jansel, https://github.com/anijain2305
ghstack dependencies: #119314, #119435
2024-02-14 15:26:17 +00:00
cyy
87c6cd2f00 [1/N] Replace std::tie with structural binding (#119774)
This PR replaces some std::tie calls with structural binding from C++17.  This not only makes the code more compact, but also has some performance gain.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119774
Approved by: https://github.com/albanD, https://github.com/malfet
2024-02-14 09:25:04 +00:00
a45c627f27 [c10d][flight recorder] store a copy of string in entry (#119837)
Summary:
Previously, we just store the char pointer in entry, the string is a
temp object and will be destructed when we want to dump/access it.

A quick fix is to store a copy of the string, but without changing the
upstream char*.

An alternative is to change every profilingTitle into std:string, this
however would needs comprehensive overhall of the code up to the
c10d::work layer above workNCCL and RecordFunction etc.

We chose the first option for this change

Resolve #119808

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119837
Approved by: https://github.com/zdevito, https://github.com/wconstab
2024-02-14 09:13:56 +00:00
4a50572c92 [inductor] Recursivly unwrap_storage_for_input when convert_to_reinterpret_view fails (#119867)
Summary:
When, during `ExternKernel.realize_input` call, underlying `ExternKernel.convert_to_reinterpret_view` fails, we currently fall back to `cls.copy_input` here:

31e59766e7/torch/_inductor/ir.py (L3805-L3816)

This creates a `TensorBox(StorageBox(...))` wrapped output, which causes a problem for this assertion:

31e59766e7/torch/_inductor/ir.py (L3479)

Here we add a special case handling for this to unwrap `x` recursively.

Test Plan:
This local repro:

```
torch.compile()
def f(a, b, mat1, mat2):
    bias = torch.bmm(a + 3.14, b).permute(0, 2, 1).reshape(3992, -1)
    return torch.addmm(bias, mat1, mat2)
f(
    torch.randn(3992, 20, 40).cuda(),
    torch.randn(3992, 40, 192).cuda(),
    torch.empty(3992, 1024).cuda(),
    torch.empty(1024, 3840).cuda(),
)
```

with this line:

690f54b0f5/torch/_inductor/fx_passes/post_grad.py (L650)

changed to `if cond(*args, **kwargs):` fails before and succeeds after this PR.

Differential Revision: D53743146

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119867
Approved by: https://github.com/xw285cornell
2024-02-14 07:50:34 +00:00
9f44274373 Add tests to verify disabled optimizers (#118919)
As title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118919
Approved by: https://github.com/janeyx99
2024-02-14 07:45:16 +00:00
ca55468416 Target Determinator Indexer Workflow (#118824)
As described in [this talk](https://www.youtube.com/watch?v=I95KmF6KSIA) and [this repo](https://github.com/osalpekar/llm-target-determinator),  we are experimenting with using CodeLlama-powered information retrieval for target determination.

The idea is that we create embeddings for PyTorch test functions, and store this index in S3. Then when a new PR comes in, we create embedding(s) for that PR, compare them to the index of test embeddings, and run only the most relevant tests.

This PR creates a workflow that does the indexing part (creating embeddings for functions and store in S3). All the logic for running the indexer is in [osalpekar/llm-target-determinator](https://github.com/osalpekar/llm-target-determinator). This workflow just checks out the relevant repos, installs the dependencies, runs the torchrun command to trigger indexing, and uploads the artifacts to S3.
Co-authored-by: Catherine Lee <csl@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118824
Approved by: https://github.com/izaitsevfb, https://github.com/huydhn
2024-02-14 06:21:18 +00:00
caf9d9d7c1 [executorch hash update] update the pinned executorch hash (#119733)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119733
Approved by: https://github.com/pytorchbot
2024-02-14 06:15:25 +00:00
54a30f6d4e [Dynamo] Update trace_rules.py and re-enable skipped tests (#119860)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119860
Approved by: https://github.com/angelayi
2024-02-14 05:22:55 +00:00
8ba2675488 Fix for-loop divisibility parsing (#119859)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119859
Approved by: https://github.com/aakhundov
ghstack dependencies: #119834, #119835, #119836, #119838
2024-02-14 05:09:59 +00:00
1f0e4ac146 Add support for while-loops in ttir analysis (#119838)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119838
Approved by: https://github.com/aakhundov
ghstack dependencies: #119834, #119835, #119836
2024-02-14 05:09:59 +00:00
5ffac768f6 Add support for labels to ttir analysis (#119836)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119836
Approved by: https://github.com/aakhundov
ghstack dependencies: #119834, #119835
2024-02-14 05:09:59 +00:00
3f09c5ee66 Add TTIR verification (#119835)
Make sure the TTIR generated is valid before attempting to analyze. Incorrectly written triton code would produce broken TTIR. Minor discussion on https://github.com/openai/triton/issues/3120
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119835
Approved by: https://github.com/aakhundov
ghstack dependencies: #119834
2024-02-14 05:09:59 +00:00
b257ff80da Add test scf.for with multi return (#119834)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119834
Approved by: https://github.com/aakhundov
2024-02-14 05:09:59 +00:00
72bbbab70a Add the missing test_dynamo_list_index from #119151 (D53392287) (#119854)
D53392287 botched the export somehow and the exported PR https://github.com/pytorch/pytorch/pull/119151 didn't contain the added test.  The discrepancy is showing up on diff train patch up diff D53694548

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119854
Approved by: https://github.com/kit1980, https://github.com/malfet
2024-02-14 04:10:02 +00:00
563f1b9fef [inductor] Use torch.cuda.clock_rate instead of triton.testing.nvsmi (#118662)
`triton.testing.nvsmi` invokes `nvidia-smi` as a subprocess, and Meta
prod usually doesn't make nvidia-smi available.  Might as well just use
something that's native to torch.

Differential Revision: [D53235814](https://our.internmc.facebook.com/intern/diff/D53235814/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118662
Approved by: https://github.com/jansel
2024-02-14 03:23:49 +00:00
80379ef0aa [dynamo-must-fix] Use ID_MATCH for UserDefinedClass (#119853)
Fixes https://github.com/pytorch/pytorch/issues/119715

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119853
Approved by: https://github.com/jansel
2024-02-14 03:14:42 +00:00
4240304da4 [TorchElastic] Handle SystemExit with code == 0 (#119697)
Summary:
Fix for a case where --run-path option fails to exit if the script exits with non-error status code.
When there is an error exit code, run-path correctly detects an error and fails when calling spawn.join(). However for-non error case, current behavior is to check the return value of the operation and the fix is to return None so that our MP code detects an exit.

Test Plan:
cat /tmp/script.py
~~~
import sys
def main():
    exit_code = 1
    if len(sys.argv) > 1:
        exit_code = int(sys.argv[1])
    sys.exit(exit_code)

if __name__=="__main__":
    main()
~~~

Case of exit code with 0 (prior behavior - never exits):
torchrun --run-path /tmp/script.py 0

~~~
[2024-02-12 09:20:57,523] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs.
[2024-02-12 09:20:58,980] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs.
(conda:pytorch) ➜  workspace echo $?
0
~~~

Existing behavior for non-zero exit code still works:
torchrun --run-path /tmp/script.py
~~~
(conda:pytorch) ➜  workspace torchrun --run-path /tmp/script.py
[2024-02-12 09:16:20,667] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs.
[2024-02-12 09:16:22,197] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs.
[2024-02-12 09:16:25,795] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 64668) of fn: run_script_path (start_method: spawn)
[2024-02-12 09:16:25,795] torch.distributed.elastic.multiprocessing.api: [ERROR] Traceback (most recent call last):
[2024-02-12 09:16:25,795] torch.distributed.elastic.multiprocessing.api: [ERROR]   File "/Users/kurman/workspace/pytorch/torch/distributed/elastic/multiprocessing/api.py", line 441, in _poll
[2024-02-12 09:16:25,795] torch.distributed.elastic.multiprocessing.api: [ERROR]     self._pc.join(-1)
[2024-02-12 09:16:25,795] torch.distributed.elastic.multiprocessing.api: [ERROR]   File "/Users/kurman/workspace/pytorch/torch/multiprocessing/spawn.py", line 177, in join
[2024-02-12 09:16:25,795] torch.distributed.elastic.multiprocessing.api: [ERROR]     raise ProcessExitedException(
[2024-02-12 09:16:25,795] torch.distributed.elastic.multiprocessing.api: [ERROR] torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with exit code 1
Traceback (most recent call last):
  File "/Users/kurman/miniconda3/envs/pytorch/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch', 'console_scripts', 'torchrun')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kurman/workspace/pytorch/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/Users/kurman/workspace/pytorch/torch/distributed/run.py", line 812, in main
    run(args)
  File "/Users/kurman/workspace/pytorch/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/Users/kurman/workspace/pytorch/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kurman/workspace/pytorch/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
run_script_path FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-02-12_09:16:25
  host      : kurman-mbp.dhcp.thefacebook.com
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 64668)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
~~~

Differential Revision: D53653874

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119697
Approved by: https://github.com/wconstab
2024-02-14 03:09:09 +00:00
5ce305270b Add a decomposition for isin() (#115390)
Co-authored-by: Peter Bell <peterbell10@live.co.uk>
Co-authored-by: Mario Lezcano Casado <3291265+lezcano@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115390
Approved by: https://github.com/peterbell10
2024-02-14 03:03:42 +00:00
75a6d6aef7 [inductor] Support storage resizing (#119749)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119749
Approved by: https://github.com/yf225
ghstack dependencies: #119647, #119671
2024-02-14 03:03:38 +00:00
31e59766e7 Fix meta registration for _flash_attention_forward() (#119812)
Meta registration wrongly assumes 4D inputs, while the underlying op allows 3D inputs for the `mha_varlen_fwd()` case.
Testing: I added `detach()`es so the NJT test `test_sdpa_compile()` won't fail for a view-related reason. It should pass now with this fix.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119812
Approved by: https://github.com/drisspg
2024-02-14 02:38:53 +00:00
179ecab7e7 Do full checkout in lint workflow to rebuild new Docker images (#119858)
From https://github.com/pytorch/pytorch/pull/119575, using `fetch-depth: 1` didn't work for `calculate-docker-image` when rebuilding a new one.  Specifically, doing a full checkout is needed for `git rev-parse HEAD~:.ci/docker` to get the Docker tag.

This shows up as a trunk failure after the recent Docker image update 507db17675
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119858
Approved by: https://github.com/PaliC, https://github.com/clee2000, https://github.com/malfet
2024-02-14 02:37:54 +00:00
690f54b0f5 [dynamo][nit] Cleanup analyze_kernel_mutations nits. (#119703)
Using `extend` is more efficient and other changes are stylistic.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119703
Approved by: https://github.com/Skylion007
2024-02-14 02:04:13 +00:00
f9f0c67445 beef up non-overlapping checks for detecting false aliasing of graph inputs (#119826)
This extra check is needed for some more complicated parameter sizes/strides for an internal model

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119826
Approved by: https://github.com/albanD
2024-02-14 01:46:30 +00:00
c9459e7f55 Update atomicMaxFloat (#119577)
# Summary

Initially reported in https://github.com/pytorch/pytorch/issues/119320

I found that the by updating this function the nan values went away. I then created a godbolt to try and highlight the difference between the two versions:
https://godbolt.org/z/3sKqEqn4M

However they appear to always produce the same value, as the nvcc version is varied, except that the for some versions -inf is chosen and for others the correct subnormal is chosen... I am having a hard time finding an isolated test case for this but will keep working

### Update:
I added printf_statements to the the version and indeed some values/*addr contain -0.0f. Hence the reason why this update fixes the reported issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119577
Approved by: https://github.com/yifuwang
2024-02-14 01:17:16 +00:00
suo
8e029dc616 [export] fix tuple return with symints (#119829)
as title.

Differential Revision: [D53726648](https://our.internmc.facebook.com/intern/diff/D53726648/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119829
Approved by: https://github.com/zhxchen17, https://github.com/khabinov
2024-02-14 01:16:38 +00:00
4a5b2cd6cb Revert "Windows Dynamo Error Removal CI Check (#115969)"
This reverts commit 45e7af5818f1d4ab1cf568390b3721b9be4251a9.

Reverted https://github.com/pytorch/pytorch/pull/115969 on behalf of https://github.com/PaliC due to this pr ended up breaking some of our periodic tests ([comment](https://github.com/pytorch/pytorch/pull/115969#issuecomment-1942934386))
2024-02-14 01:11:46 +00:00
16369816a2 [sparse] semi-structured sparse refactor (#117302)
Summary:

This PR is a refactor of semi-structured sparsity support.

**deprecation**:

Before `torch.sparse.to_sparse_semi_structured` had a kwarg param
`transposed=False`, which has been removed. This kwarg was unused and
now thros a deprecation warning.

Namely, I've taken the subclassing implementation that xFormers has
created and brought it over to PyTorch, as part of our plan to upstream
runtime 2:4 sparsity.

I've also copied over all the op support that Daniel implemenented that
did not depend on the fast sparsification routines, into
`_sparse_semi_structured_ops.py`

With this subclass, all of our internal tests pass, as well as those in
xFormers.

The main change is that we now define a base subclass,
`SparseSemiStructuredTensor` that is inherited from for each of the
specific backends.

We also now can arbitrarily override the sparse dispatch table with
`_load_dispatch_table()`, idea being this is still general enough
where users don't need to modify pytorch source code to get their model
working.

This also adds in padding support and stores alg_id and fuse_transpose
as flags on the tensor, instead of hardcoding them.

There still remains two components in xFormers that will need to be
ported over eventually:
- the autograd functions  (`Sparsify24`, `Sparsify24_like`)
- fast sparsification routines that they rely on

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117302
Approved by: https://github.com/alexsamardzic, https://github.com/HDCharles
2024-02-14 01:10:40 +00:00
2536c5186e [BE] Properly mark destructor overrides (Take 2) (#119656)
Otherwise, at least on MacOS builds are littered with:
```
In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/DeviceAccelerator.h:6:
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/detail/MTIAHooksInterface.h:23:11: warning: '~MTIAHooksInterface' overrides a destructor but is not marked 'override' [-Winconsistent-missing-destructor-override]
  virtual ~MTIAHooksInterface() = default;
          ^
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/detail/CUDAHooksInterface.h:65:11: warning: '~CUDAHooksInterface' overrides a destructor but is not marked 'override' [-Winconsistent-missing-destructor-override]
  virtual ~CUDAHooksInterface() = default;
          ^
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/detail/AcceleratorHooksInterface.h:15:11: note: overridden virtual function is here
  virtual ~AcceleratorHooksInterface() = default;
          ^
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/detail/MPSHooksInterface.h:21:11: warning: '~MPSHooksInterface' overrides a destructor but is not marked 'override' [-Winconsistent-missing-destructor-override]
  virtual ~MPSHooksInterface() = default;
          ^
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/detail/AcceleratorHooksInterface.h:15:11: note: overridden virtual function is here
  virtual ~AcceleratorHooksInterface() = default;
          ^
```

 Likely introduced by https://github.com/pytorch/pytorch/pull/119329

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119656
Approved by: https://github.com/Skylion007
2024-02-14 01:05:58 +00:00
cyy
cb0886ecf2 [DeviceIndex][4/N] Use DeviceIndex in more places (#119741)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119741
Approved by: https://github.com/aaronenyeshi, https://github.com/ezyang
2024-02-14 00:29:10 +00:00
suo
b2e779868f make internal lintrunner mypy clean (#119840)
as title

Differential Revision: [D53732505](https://our.internmc.facebook.com/intern/diff/D53732505/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119840
Approved by: https://github.com/ezyang
2024-02-14 00:25:42 +00:00
507db17675 Update HF pin (#119717)
Sometime between now and the previous pin update, HF introduced a
ModelOutputs type, which was not pytree serializable, causing
aot_compile to fail on new HF models
(https://fb.workplace.com/groups/1075192433118967/permalink/1377977852840422/).
With https://github.com/huggingface/transformers/pull/27871, we
can now pytree serialize HF ModelOutputs types.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119717
Approved by: https://github.com/desertfire
2024-02-14 00:17:16 +00:00
b51e0246b7 sccache version update (#119554)
Fixes #37928

`sccache` is updated to the newer version (`v0.7.4`) to fix non-cacheable calls `multiple input files`  for `CUDA` builds.

This should make `Cache hits (CUDA)`  work as expected and improve the speed dramatically.

---

Additional information:

- Modified `install_sccache.bat` check structure due to GitHub Action error `Process completed with exit code 255.`
    - Error is occurring when freshly downloaded `sccache` is being called with `--show-stats` or `--start-server` arguments within the script
    - Now, it is checking file's existence and killing/deleting executable before the download

- Removed `sccache-cl` since it is no longer needed with newer versions of `sccache`

---

`win-vs2019-cpu-py3 / build` - `16m 27s`

![image](https://github.com/pytorch/pytorch/assets/148207261/b5628e6c-64bb-4293-9d07-480f56df44f1)

`win-vs2019-cuda11.8-py3 / build` - `17m 4s` **(previously ~45 mins - 1h30mins)**

![image](https://github.com/pytorch/pytorch/assets/148207261/e4ab01cb-0f56-41e8-984f-110e643b9c09)

Now `Cache Hits (CUDA)` hits all `304` object and the error `Non-cacheable reasons` is fixed.

![image](https://github.com/pytorch/pytorch/assets/148207261/c8c25d2e-3fc1-4edb-8982-99c1f490cb54)

---

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119554
Approved by: https://github.com/malfet
2024-02-13 23:50:40 +00:00
be35fc9ea7 Size oblivious test for slice optimization (#119625)
Fixes https://github.com/pytorch/pytorch/issues/119623

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119625
Approved by: https://github.com/albanD
2024-02-13 23:47:52 +00:00
d81d5f52d5 [FSDP2][ez] Replaced groupby with all for same-dtype check (#119825)
The `groupby` logic to check if all all-gather inputs have the same dtype is not so readable. Let us use `all` instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119825
Approved by: https://github.com/Skylion007
ghstack dependencies: #119550, #118136, #118223, #118755
2024-02-13 23:28:53 +00:00
cf117e37d5 Refactor THPStorage_resize_ (#119671)
Moving code around to allow it to be reused in the next PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119671
Approved by: https://github.com/yf225
ghstack dependencies: #119647
2024-02-13 23:28:47 +00:00
ca777fbbb7 Add Accelerator device and shell hooks (#119329)
This adds a concept of Accelerator that points to one of our devices. See DeviceAccelerator.h in this PR for details https://github.com/pytorch/pytorch/pull/119329/files#diff-83cc748bed5df1a453c272cc5ecc7e572d4eb694c5125384d8fbd17a0b5f50c8
It also adds scaffolding for shared C++ API to allow generic feature implementation. This PR in particular updates the autograd engine to use this generic API.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119329
Approved by: https://github.com/ezyang, https://github.com/huydhn
2024-02-13 23:15:24 +00:00
e9b78f2db0 Rewrite group_batch_fusion.find_independent_subset_greedy() to be iterative. (#118324)
Improve performance of inductor searching large graphs for potential fusions.
Also adds some direct unit tests of find_independent_subset_greedy() to ensure that the rewrite didn't break behavior.

Fixes #98467

Previously find_independent_subset_greedy() was recursive and the example from the issue would cause it to blow out the stack. This changes it to be iterative and also caches some of the computed dependencies (it can't cache all of them because the caller is allowed to change the graph during the iteration).

Fusion is still slow - but at least finishes.

After this change the example given in #98467 has the following backend timings (on one particular CPU):
eager timing: 3m:23s
aot_eager timing: 4m:12s
inductor timing: 22m:24s

Possible future work to improve this further:
1. In dynamo limit the amount of inlining allowed before falling back to a graph break. This test ends up tracing through 483k bytecodes generating the graph.
2. In inductor have a limit so we don't exhaustively search the graph for fusion possibilities.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118324
Approved by: https://github.com/oulgen
2024-02-13 22:54:53 +00:00
ba1eb0e27f [ROCm] upgrade CI to 6.0 (#119495)
Co-authored-by: Jithun Nair <jithun.nair@amd.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119495
Approved by: https://github.com/huydhn
2024-02-13 22:39:03 +00:00
df9b44436a [ROCm] Enable float16/complex32 fft tests on ROCm (#117296)
This PR is to enable float16/complex32 fft tests on ROCm.
Sample results are attached here:
[test_spectral_ops_results.log](https://github.com/pytorch/pytorch/files/13908533/test_spectral_ops_results.log)

test_decomp::TestDecompCUDA::test_comprehensive_fft*
test_decomp::TestDecompCUDA::test_quick_fft*
test_jit_fuser_te::TestNNCOpInfoCUDA::test_nnc_correctness_fft*
test_meta::TestMetaCUDA::test_dispatch_meta_inplace_fft*
test_meta::TestMetaCUDA::test_dispatch_meta_outplace_fft*
test_meta::TestMetaCUDA::test_dispatch_symbolic_meta_inplace_fft*
test_meta::TestMetaCUDA::test_dispatch_symbolic_meta_outplace_fft*
test_meta::TestMetaCUDA::test_meta_inplace_fft*
test_meta::TestMetaCUDA::test_meta_outplace_fft*
test_ops::TestCommonCUDA::test_complex_half_reference_testing_fft*
test_ops::TestCommonCUDA::test_python_ref__refs_fft*
test_ops::TestCommonCUDA::test_python_ref_executor__refs_fft*
test_ops::TestCommonCUDA::test_python_ref_meta__refs*
test_ops::TestCommonCUDA::test_python_ref_torch_fallback__refs_fft*
test_schema_check::TestSchemaCheckModeOpInfoCUDA::test_schema_correctness_fft*
test_spectral_ops::TestFFTCUDA::test_empty_fft__refs_fft*
test_spectral_ops::TestFFTCUDA::test_empty_fft_fft*
test_spectral_ops::TestFFTCUDA::test_fft_half_and_chalf_not_power_of_two_error__refs_fft*
test_spectral_ops::TestFFTCUDA::test_fft_half_and_chalf_not_power_of_two_error_fft*
test_spectral_ops::TestFFTCUDA::test_fft_round_trip_cuda*
test_spectral_ops::TestFFTCUDA::test_fft_type_promotion_cuda*
test_spectral_ops::TestFFTCUDA::test_fftn_round_trip_cuda*
test_spectral_ops::TestFFTCUDA::test_hfftn_cuda_float16
test_spectral_ops::TestFFTCUDA::test_ihfftn_cuda_float16
test_utils::TestDeviceUtilsCUDA::test_device_mode_ops_fft

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117296
Approved by: https://github.com/pruthvistony, https://github.com/malfet
2024-02-13 22:35:32 +00:00
63d64c8995 [MPS] Enable more bfloat16 ops (#119738)
Introduce conveninence inlinable `mps::supportedFloatingType` function
that returns true if type is Float, Half or BFloat16

Test by running LLM inference using bfloat16

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119738
Approved by: https://github.com/Skylion007
2024-02-13 22:11:00 +00:00
eb9a3383c2 [MPS] Add naive std_mean implementation (#119777)
By just calling `std_mps` and `mean` in sequence

Move `var_mean` decomp to `ReduceOps.mm`, as it should be faster to skip dispatching to a Python, which one can validate by running the following script:
```python
from timeit import default_timer

import torch
from torch.utils.benchmark import Measurement, Timer

def bench_var_mean(
    m, n, k,
    dtype = torch.float32,
    device:str = "cpu",
) -> Measurement:
    setup = f"""
     x = torch.rand({m}, {n}, {k}, dtype={dtype}, device="{device}")
    """

    t = Timer(
        stmt="torch.var_mean(x, dim=1)", setup=setup, language="python", timer=default_timer
    )
    return t.blocked_autorange()

for x in [100, 1000]:
    rc = bench_var_mean(1000, x, 100, device="mps")
    print(f"{x:5} : {rc.mean*1e6:.2f} usec")
```
which before the change reports 681 and 1268 usec and after 668 and 684 (which probably means that GPU is not saturated, but overhead from switching between native and interpretable runtimes are shorter.

Fixes https://github.com/pytorch/pytorch/issues/119663

TODOs:
 - Refactor the codebase and implement proper composite function (that must be faster)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119777
Approved by: https://github.com/albanD
2024-02-13 21:51:29 +00:00
ee5b59dd4b [ROCm] CatArrayBatchedCopy performance improvement (#118685)
Tune the grid and block sizes for ROCm.  Add a contig kernel separate from aligned+contig.

Verified new performance using pytorch/benchmarks/operator_benchmark.

`python -m pt.cat_test --device=cuda --tag-filter all`

On MI200 this improved performance on average 4%, and on MI300 14%.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118685
Approved by: https://github.com/malfet
2024-02-13 21:51:20 +00:00
6665b96ebb Rewrite maybe_reduce more carefully for unbacked SymInt (#119562)
Fixes https://github.com/pytorch/pytorch/issues/119476

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119562
Approved by: https://github.com/albanD
ghstack dependencies: #119559
2024-02-13 21:40:06 +00:00
28f299a870 [c10d] Fix compilation of NCCL_EXP path (#119805)
Fixes issue pointed out in https://github.com/pytorch/pytorch/pull/119421#issuecomment-1941694621

When refactoring ProcessGroupNCCL, some code in the NCCL_EXP path wasn't done cleanly.

Cc: @kunalb @H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119805
Approved by: https://github.com/H-Huang
2024-02-13 21:26:59 +00:00
f9200c8608 [BE][Ez]: FURB129: remove unneeded readlines() (#119796)
Applies a refurb rule to remove any readlines() in a for loop iteration as it just creates a temporary list in memory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119796
Approved by: https://github.com/ezyang
2024-02-13 21:21:22 +00:00
3319dbcd23 Update vmap guard to avoid recompilations (#119061)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119061
Approved by: https://github.com/zou3519
2024-02-13 20:50:23 +00:00
abadbbc4b0 [c10d][flight recorder] remove unintended assignment of entry (#119748)
Summary:
auto& entry = entries_.at(*id % max_entries_);
entry = entries_.at(*id % max_entries_);
The above line of code has unintended consequence of invoking copy/assignment
of entry objects as ref itself cannot be re-assigned.

Also what could cause the crash is that the entry ref could become invalid if entries_ are
resized by other threads. and this could result in 'copy to a garbage
location'. The fix is to use a pointer which can be re-assigned after
re-acquiring the lock

Tests: python test/distributed/test_c10d_nccl.py NCCLTraceTest

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119748
Approved by: https://github.com/wconstab, https://github.com/fegin
2024-02-13 20:18:58 +00:00
34638c82a6 [mergebot] No unique behavior for facebook bot re pending jobs (#119735)
if fb bot says merge without -f, do normal behavior and wait for pending checks
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119735
Approved by: https://github.com/izaitsevfb, https://github.com/huydhn
2024-02-13 20:07:24 +00:00
8ec3d8e35f Fixed FxGraphDrawer compat constructor (#119767)
Match FxGraphDrawer compat constructor signature to avoid the following failure when `pydot` is not installed:
```
  File "/pytorch/torch/_functorch/partitioners.py", line 933, in draw_graph
    g = graph_drawer.FxGraphDrawer(
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
TypeError: __init__() got an unexpected keyword argument 'dot_graph_shape'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119767
Approved by: https://github.com/eellison
2024-02-13 19:36:01 +00:00
8ec8d78ef2 [quant][pt2e][be] Rename eval_utils -> export_utils (#119725)
It's not really eval_utils anymore, since we added some training
related utils. Instead it should be util functions that are
related to general export use cases.

Differential Revision: [D53711494](https://our.internmc.facebook.com/intern/diff/D53711494)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119725
Approved by: https://github.com/tugsbayasgalan
2024-02-13 19:10:06 +00:00
0a2e000edf [BE] Enabled mypy in common_fsdp.py (#118755)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118755
Approved by: https://github.com/Skylion007, https://github.com/weifengpy, https://github.com/wanchaol
ghstack dependencies: #119550, #118136, #118223
2024-02-13 19:05:30 +00:00
8c1480f568 [FSDP2] Added mixed precision (#118223)
This PR adds mixed precision configured via `MixedPrecisionPolicy`.
- By default (`cast_forward_inputs=True`), each FSDP module will cast forward floating-point input tensors to `param_dtype` if specified. If the user wants to own the cast, then the user can disable it by passing `False`.
- Symmetrically, by default (`output_dtype=None`) each FSDP module will not cast the forward output. If the user wants to customize the output dtype, then the user can pass a `torch.dtype`.
- `param_dtype` configures the unsharded parameters' dtype for forward/backward computation and hence the all-gather dtype.
- `reduce_dtype` configures the gradient reduction dtype. If `reduce_dtype=None` and `param_dtype is not None`, then `reduce_dtype` inherits from `param_dtype` for simplicity.

We test against a manually implemented reference implementation instead of comparing against existing FSDP since the comparison is more direct to what we want to test.

---

**Overhead benchmarks to inform design**
The dilemma is as follows:
- The common path for FSDP is bf16 parameter mixed precision, where we cast sharded parameters from fp32 to bf16 before all-gathering them.
- The baseline implementation is to `torch._foreach_copy_` the sharded parameters to the flat `all_gather_input`, which gets passed to `dist.all_gather_into_tensor`.
    - The baseline incurs 1 extra fp32 read and 1 extra bf16 write per parameter because `_foreach_copy` takes the slow path, calling `copy_` in a loop, and `copy_` calls `dst.copy_(src.to(bf16))` where `dst` is bf16 and `src` is fp32.
    - These `copy_` calls stay in C++ and do not require calling `at::as_strided`.
- The issue with this baseline implementation is that it requires knowing that all parameters in the group will be cast from fp32 to bf16 to do this `_foreach_copy_` from fp32 sources to a bf16 destination.
- We want per-parameter FSDP to support mixed dtype all-gathers, which would involve different parameters providing different dtype all-gather inputs and viewing them as uint8 for a combined flat all-gather input, where this viewing-as-uint8 step is only needed in the mixed dtype case.
- However, this incurs more CPU overhead, so we want to investigate this in more detail.

We consider 150 `nn.Parameter`s with shapes taken from an internal model (where the shapes only affect the copy bandwidth, not the CPU overhead). We focus on world size 128 first. We consider two experiments: (1) run the copy-in with no head start, allowing CPU boundedness affect GPU time, and (2) run the copy-in with a CPU head start, removing CPU overhead from affecting GPU time.

No head start:
- Baseline `torch._foreach_copy_`: 0.525 ms CPU; 0.528 ms GPU
- `.to(bf16)` before `torch._foreach_copy_`: 0.828 ms CPU; 0.836 ms GPU
- `.to(bf16).view(uint8)` before `torch._foreach_copy_`: 0.933 ms CPU; 0.937 ms GPU

Head start (removing CPU boundedness from GPU times):
- Baseline `torch._foreach_copy_`: 0.393 ms GPU
- `.to(bf16)` before `torch._foreach_copy_`: 0.403 ms GPU
- `.to(bf16).view(uint8)` before `torch._foreach_copy_`: 0.403 ms GPU

Some other interesting notes:
- Constructing a set of all all-gather input dtypes: ~0.015 ms -- this would be the overhead cost of checking whether we need to view as uint8 (i.e. whether we have mixed dtype); alternatively, we could always view as uint8 (but that loses the mixed precision policy info from the profiler trace)
- Changing from `[t.to(bf16).view(uint8) for t in ts]` to two list comprehensions like `[t.to(bf16) for t in ts]; [t.view(uint8) for t in ts]` actually reduces CPU overhead 🤔  (by ~0.04 ms)

We see that the main difference is just CPU overhead. The GPU times are almost the same. (Actually, sweeping over 8, 16, 32, 64 world size, we do see difference in GPU time inversely proportional to world size, as expected since smaller world sizes copy more data. However, even at world size 8, the difference is only 0.407 ms vs. 0.445 ms GPU time.) Note though that the CPU overhead differences are exacerbated when the PyTorch profiler is turned on, and how much so seems to depend on the CPU capability.

Seeing these numbers, I am inclined to prefer to just incur the CPU overhead, especially given that if we want to support the mixed dtype case for fp8 all-gather, we will need to incur this anyway. If the CPU overhead becomes a problem on a real workload, then we will need to figure out options then, one being using `torch.compile` possibly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118223
Approved by: https://github.com/weifengpy, https://github.com/wanchaol
ghstack dependencies: #119550, #118136
2024-02-13 19:05:30 +00:00
3956ce01e0 [FSDP2] Added autograd/memory/overlap/frozen/2D/AC tests (#118136)
This PR adds tests for autograd (mainly backward hooks), memory, overlap, and frozen parameters.
- Autograd: unused forward output, unused forward module, non-tensor activations (common in internal models)
- Memory: expected GPU memory usage after init, forward, backward, and optimizer step
- Overlap: communication/computation overlap in forward and backward
- Frozen: expected reduce-scatter size, training parity

This PR adds some initial 2D (FSDP + TP) training and model state dict tests. The only change required for model sharded state dict is to make sure parameters are sharded before save and load.

This PR adds tests that `fully_shard` can use `torch.utils.checkpoint`, `_composable.checkpoint`, and `CheckpointWrapper` on a transformer.

(I squashed all of these into one PR now to save CI cost.)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118136
Approved by: https://github.com/weifengpy, https://github.com/wanchaol
ghstack dependencies: #119550
2024-02-13 19:05:30 +00:00
39c68efd85 [dynamo] Capture untyped_storage().resize_() (#119647)
This makes storage resizing work with `backend=eager`, the next two PRs make it work for inductor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119647
Approved by: https://github.com/yf225
2024-02-13 19:03:28 +00:00
c0e5cca4f8 [DDP] Change the --no-optimize-ddp flag to reflect the latest usage (#119437)
Compiled DDP now has 4 different optimization modes. This PR changes the Dynamo benchmark flag to reflect that change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119437
Approved by: https://github.com/wconstab, https://github.com/xmfan
2024-02-13 16:53:56 +00:00
c2522554dd Prevent DCE'ing unbacked SymInt for view outputs (#119552)
Fixes https://github.com/pytorch/pytorch/issues/119414

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119552
Approved by: https://github.com/Skylion007, https://github.com/eellison
2024-02-13 16:32:21 +00:00
52de407b6c Avoid performing replacements when it would unrefine ranges (#117356)
Fixes https://github.com/pytorch/pytorch/issues/117268; check this issue for background.

This PR does the following:

* Do not perform a replacement if the expression we're replacing the symbol with has a less refined value range than the original. There's a little bit of trickiness around the handling for values close to INT64_MAX; when checking if a range refines another, I *only* consider the range representable in 64-bit integers. This is enough to prevent us from doing a substitution like `i0 = 10 - i1`, but it appears to still let us do the other substitutions we like, such as `i0 = i1` or `i0 = 12 * i1`
* The test above is order dependent: if we assert an equality BEFORE we have refined a range, we might be willing to do the replacement because there isn't a meaningful range. This means that it's important to mark things as sizes, before you start doing other error checking. `split_with_sizes` is adjusted accordingly. It would be good to raise an error if you get the ordering wrong, but I leave this to future work.
* It turns out this is not enough to fix AOTAutograd, because we lose the size-ness of unbacked SymInts when AOTAutograd retraces the Dynamo graph. So update deferred runtime assert insertion to also insert size-ness and value ranges annotations. Note that, in principle, it shouldn't be necessary to explicitly do the latter; these should just show up as deferred runtime asserts. That's some extra refactoring for a later day.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117356
Approved by: https://github.com/lezcano
2024-02-13 15:56:59 +00:00
0fd371c868 fix torch.cumsum docs (#117944)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117944
Approved by: https://github.com/zou3519
2024-02-13 15:29:06 +00:00
c2a835d710 [inductor] Refactor device guard Python codegen to allow nested indentation (#119673)
Summary: The codegen of `with torch.cuda._DeviceGuard` context manager in the Python wrapper code is implemented via `device_cm_stack: contextlib.ExitStack()`. As the context managers in the stack are `code.indent()`, this means that the whole stack is unindented at once on `device_cm_stack.close()`. This becomes problematic when attempting to codegen indented code (e.g., for control flow in Python and / or nested subgraph codegen-ing).

In this PR, we refactor the device guard codegen-ing in Python by replacing the `device_cm_stack` by explicit indent and unindent calls for entering and exiting the `with torch.cuda._DeviceGuard` context manager. This allows for nested device guard context managers and better aligns with other indented codegen-ing intertwined with it (e.g., for nested subgraph codegen-ing).

This is necessary for the upcoming support for `torch.cond` (and other control flow operators) in Inductor. Before that, the only change in the Python wrapper codegen is that the `return outputs` is now happening outside the `with torch.cuda._DeviceGuard` context manager.

Test Plan: CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119673
Approved by: https://github.com/peterbell10
2024-02-13 15:05:30 +00:00
f4b5f710e8 Fix typo in private attr of inference_mode (#119167)
This PR amends #102642.

`torch.inference_mode`'s attribute to store the actual context is inconsistent between `__init__` and `__enter__`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119167
Approved by: https://github.com/albanD
2024-02-13 14:59:59 +00:00
3629287151 Implement analysis for for-loops (#119730)
This PR adds support for for-loop parsing and analysis. While doing so, I ran into some constant value and function name problems so I fixed them as well. Technically, it should be possible to break this into multiple PRs but since these are small, I'm bundling them together.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119730
Approved by: https://github.com/aakhundov
2024-02-13 09:02:53 +00:00
2ae655b4f1 caffe2: remove support for specifically running "flaky tests" (#112007)
Summary:
In March 2019 D14468816 introduced some infra to mark tests as flaky
while still running them. In July 2019 D15797371 removed the last use of this
feature. Remove the related code as well.

Test Plan: ci

Reviewed By: mlogachev

Differential Revision: D50601204

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112007
Approved by: https://github.com/malfet
2024-02-13 07:56:37 +00:00
60148f1761 [EZ] Set maximum supported version of Python as 3.12 (#119743)
Doesn't really affect anything other than metadata on PyPI website
Otherwise programming languages tab on https://pypi.org/project/torch/2.2.0/ shows supported version 3.8 to 3.10:
<img width="239" alt="image" src="https://github.com/pytorch/pytorch/assets/2453524/e17f9982-8833-4cd8-b8d8-b2f1cb538548">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119743
Approved by: https://github.com/kit1980, https://github.com/Skylion007
2024-02-13 06:56:32 +00:00
beb0041384 improve cuda graph symint logging msg (#119739)
Users were confused by `recording cudagraph tree for None`
`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119739
Approved by: https://github.com/mlazos
2024-02-13 06:26:36 +00:00
bfb9ea1a43 fix compile DTensor.from_local in trace_rule_look up (#119659)
There's a bug when converting from TorchVariable to trace rule look ups,
in some corner cases the DTensor.from_local calls not matching the trace
name rule look up, resulting in a None look up, and falling back to the
UserFunctionVariable, which makes the tracing silent wrong by tracing
into the DTensor.from_local function. Not exactly sure yet why the look
up failed

This PR fixes the DTensor.from_local tracing to make sure in everycase
we should hit the InGraphFunctionVariable

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119659
Approved by: https://github.com/yifuwang
2024-02-13 05:21:19 +00:00
379183a0dd Skip log line if no tensors were dedupped (#119742)
Skips log line if nothing was dedupped. Avoids unhelpful logs like:
```
2024-02-13 01:31:52,113 _dedup_tensors.py:46 INFO p:MainProcess t:MainThread: Duplicate keys to remove: {}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119742
Approved by: https://github.com/Skylion007
2024-02-13 05:18:16 +00:00
a4c476a081 [BE] Use more GTest primitives in XPU unit tests (#119527)
# Motivation
Use `EXPECT_EQ` to refine XPU's UT when relying on gtest.

# Solution
use `EXPECT_EQ` directly instead of `ASSERT_EQ_XPU`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119527
Approved by: https://github.com/malfet
2024-02-13 05:18:03 +00:00
cyy
47a2e6b6b8 Fix C++20 build (#112333)
Currently C++20 fails because of incorrect template initialization order. This PR adjusted the order of theses classes and a constructor to address the issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112333
Approved by: https://github.com/albanD
2024-02-13 05:10:19 +00:00
2bda6b4cb8 [DTensor] Only wait on AsyncCollectiveTensor after DTensor-based state dict loading (#119716)
Summary:
This PR serves as a follow-up fix to address numerical correctness concerns identified in PR #118197, and we should only wait on `AsyncCollectiveTensor`.

Without the change, we occasionally ran into exception: `AttributeError("'Tensor' object has no attribute 'wait'")`

Test Plan:
**CI**:
Wait for the CI test

**Test with prod model**:
- Tested with models and no-longer ran into the exception after checkpoint loading.

Differential Revision: D53680406

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119716
Approved by: https://github.com/fegin, https://github.com/Skylion007, https://github.com/wz337
2024-02-13 04:30:45 +00:00
2502a01110 Linear-BN Fusion: add precondition check (#119264)
Fixes #118990

The root cause is due to `out_features` of Linear not matching `num_features` of BatchNorm, resulting in shape mismatch while computing `fused_w`, and `fused_b`. This can happen for linear-bn folding because linear layer operates over the last dim, `(*, H_in)`, while bn layer operates over the channel dim, `(N, C_in, H, W)`.

To preserve the shapes of the original linear weight and bias in linear-bn folding, check linear `out_features` match bn `num_features`. If they don't match, bn `num_features` need to be 1 to broadcast.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119264
Approved by: https://github.com/eellison
2024-02-13 04:16:34 +00:00
15ef52a015 [MPS] Enable conj and conj_physical (#119669)
Former is only on MacOS 14+, but at least on older MacOSes it would raise an exception rather than returning non-conjugated tensor

Preliminary step for enabling FFT ops (without it `ifft` would never work)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119669
Approved by: https://github.com/albanD
ghstack dependencies: #119681
2024-02-13 02:27:51 +00:00
214f06ae3a Revert "Add Accelerator device and shell hooks (#119329)"
This reverts commit 4b9568a360c4a90220e78e43435be8c56bc33fb2.

Reverted https://github.com/pytorch/pytorch/pull/119329 on behalf of https://github.com/huydhn due to Breaks internal build and requires OSS file update to fix it ([comment](https://github.com/pytorch/pytorch/pull/119329#issuecomment-1940278598))
2024-02-13 02:23:45 +00:00
7d4b666870 Revert "[BE] Properly mark destructor overrides (#119656)"
This reverts commit 069581b3ca354c3b34079d23bc237442d6f28cc3.

Reverted https://github.com/pytorch/pytorch/pull/119656 on behalf of https://github.com/huydhn due to I need to revert this to unblock the revert of https://github.com/pytorch/pytorch/pull/119329#issuecomment-1939637967 and will reland this after resolving the conflicts ([comment](https://github.com/pytorch/pytorch/pull/119656#issuecomment-1940270997))
2024-02-13 02:20:45 +00:00
2921c2b3d9 [mypy] refactor mkldnn_fusion._is_valid_binary to avoid [union-attr] has no attribute (#119085)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119085
Approved by: https://github.com/Skylion007
2024-02-13 02:13:46 +00:00
db228f1efd [Lint] replace [assigment] with [method-assign] for methods (#119706)
started with TODO fix from here https://github.com/pytorch/pytorch/blob/main/torch/testing/_internal/common_utils.py#L746
using ignore[method-assign] instead of ignore[assigment]

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119706
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/kit1980
2024-02-13 02:06:04 +00:00
9f8c84a399 Revert "Add missing include for internal build (#119721)"
This reverts commit e0cabebad94f1cf35742f8ec14f9938be3a195ab.

Reverted https://github.com/pytorch/pytorch/pull/119721 on behalf of https://github.com/huydhn due to This fixes the build failures but there is still an issue with the missing libcaffe2_torch_fb_sparsenn_sparsenn_operators_gpu.so on D53686094 ([comment](https://github.com/pytorch/pytorch/pull/119721#issuecomment-1940191340))
2024-02-13 01:56:12 +00:00
ea8e4fd5ac Support FunctoolsPartialVariable::get_function, fix NamedTupleVariable::as_proxy and handle call_function in get_fake_values_from_nodes (#119435)
partially address https://github.com/pytorch/pytorch/issues/118785
This diff fixes three things:
1. add get_function to FunctoolsPartialVariable note that it will be available only if all args constant otherwise,
it would throw unimplemented in the call to asPythonConstant.

2. NamedTupleVariable takes args dispatched not as list ex: NamedTuple(a, b, c) vs NamedTuple([a, b, c]),
 hence fix that by specializing asProxy.

3. A call to create_arg from within create_proxy, changes a python NamedTuple to a function call node without
associating an example value! Updated get_fake_values_from_nodes to handle such case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119435
Approved by: https://github.com/jansel, https://github.com/anijain2305
ghstack dependencies: #119314
2024-02-13 01:44:08 +00:00
74d55b0e63 [dynamo] Support torch.distributed.fsdp._flat_param._same_storage_size (#119627)
Replaces #117690

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119627
Approved by: https://github.com/Skylion007
2024-02-13 01:27:37 +00:00
472500e32a Revert "Avoid performing replacements when it would unrefine ranges (#117356)"
This reverts commit 0e6b314fc2e7c965717e939a4e457a9b9d7e133e.

Reverted https://github.com/pytorch/pytorch/pull/117356 on behalf of https://github.com/huydhn due to Sorry for reverting the change but it looks like the forward fix still needs more work https://github.com/pytorch/pytorch/pull/119712, so it would be cleaner to reland them ([comment](https://github.com/pytorch/pytorch/pull/117356#issuecomment-1940032407))
2024-02-13 01:16:58 +00:00
2492f8748e Revert "Improve TORCHDYNAMO_EXTENDED_DEBUG for GuardOnDataDependentSymNode (#119412)"
This reverts commit f208795182a22ebaef84a284750669fa372157cb.

Reverted https://github.com/pytorch/pytorch/pull/119412 on behalf of https://github.com/huydhn due to Sorry for reverting the change but it looks like the forward fix still needs more work https://github.com/pytorch/pytorch/pull/119712, so it would be cleaner to reland them ([comment](https://github.com/pytorch/pytorch/pull/119412#issuecomment-1939937937))
2024-02-13 00:52:19 +00:00
830ed6d9b2 [quant][pt2] Fix _disallow_eval_train error message (#119694)
Fix the message to use the right function name.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119694
Approved by: https://github.com/tugsbayasgalan
2024-02-13 00:17:53 +00:00
55483fc2c9 Min-cut partitioner always saves tensors that are returned as-is in backward (#114970)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114970
Approved by: https://github.com/Chillee
2024-02-13 00:04:41 +00:00
bd9db6a9c7 Update to TorchFix 0.4.0 (#119424)
`torch.library.Library` updated to `torch.library._scoped_library` in files with many tests where it seems obvious to do, otherwise `noqa: TOR901` added - see https://github.com/pytorch/pytorch/pull/118318 for more context.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119424
Approved by: https://github.com/zou3519
2024-02-12 23:30:12 +00:00
5acd1f0f7d Add cherry-pick workflow (#119352)
After https://github.com/pytorch/test-infra/pull/4758, we can create a new workflow on PyTorch to receive `try-cherry-pick` dispatch event from the bot, and create the cherry pick PR.

* [x] Cherry pick a PR after it has been landed and create a cherry pick PR to the target release branch.
* [ ] The second part after this is to update the release tracker with the info.  This will be done in a subsequent PR.
* [ ] ghstack is not yet supported
* [ ] Cherry pick a reverted commit is not yet supported (from @kit1980 comment)

### Testing

The script can be used locally:

```
python cherry_pick.py --onto release/2.2 --classification release --github-actor huydhn 118907
The cherry pick PR is at https://github.com/pytorch/pytorch/pull/119351
```

The test cherry pick PR is created at https://github.com/pytorch/pytorch/pull/119351

Unit testing this on CI is tricky, so I test this out on canary instead.

* https://github.com/pytorch/pytorch-canary/pull/193#issuecomment-1933162707 creates the PR at https://github.com/pytorch/pytorch-canary/pull/201
  * One more test on canary with the new token https://github.com/pytorch/pytorch-canary/pull/193#issuecomment-1933229483.  The minimum required permission from what I see is `workflow`
* Cherry picking conflicts could happen and needs to be handled manually https://github.com/pytorch/pytorch-canary/pull/194#issuecomment-1933142975
* ~Require a linked issue when cherry picking regressions, critical fixes, or fixing new features https://github.com/pytorch/pytorch-canary/pull/193#issuecomment-1933174520~ Relax this requirement to a suggestion
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119352
Approved by: https://github.com/atalman
2024-02-12 23:12:10 +00:00
suo
f15b517055 [export] suppress type error (#119720)
Differential Revision: [D53681243](https://our.internmc.facebook.com/intern/diff/D53681243/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119720
Approved by: https://github.com/kit1980, https://github.com/huydhn
2024-02-12 22:54:36 +00:00
b3df3e4e94 Restore OpInfo/ModuleInfo tests in Inductor-wrapped tests (#119693)
I accidentally disabled this without realizing it. It turns out that
PYTORCH_TEST_WITH_INDUCTOR=1 implies PYTORCH_TEST_WITH_DYNAMO=1, which
activates skipIfTorchDynamo decorators.

Test Plan:
- wait for CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119693
Approved by: https://github.com/bdhirsh
2024-02-12 22:44:45 +00:00
e0cabebad9 Add missing include for internal build (#119721)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119721
Approved by: https://github.com/huydhn
2024-02-12 22:36:16 +00:00
70c93c6097 [inductor] Update JIT Inductor cpp wrapper entry function signature (#119280)
Summary: Change JIT Inductor cpp wrapper entry function to use similar signature as AOTInductor, i.e. using an array of AtenTensorHandle instead of a vector of at::Tensor as the inputs and return output through a pointer. This makes it easier to consolidate the ABI compatible and non-compatible modes.

Differential Revision: [D53478825](https://our.internmc.facebook.com/intern/diff/D53478825)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119280
Approved by: https://github.com/chenyang78
2024-02-12 22:24:35 +00:00
02b60e76c9 make flash_attn_bw impl correct w.r.t. meta when k and v have different strides (#119500)
`dv = at::empty_like(k)` and `dv = at::empty_like(v)` can be materially different, because `empty_like` tries to preserve the strides of the input when possible. So if `k` is contiguous, but `v`, is transposed, then before this PR, `dv` would be computed to be contiguous.

Alternatively, we could change the meta implementation of `aten._scaled_dot_product_flash_attention` to this:
```
    grad_q = torch.empty_like(query.transpose(1, 2)).transpose(1, 2)
    grad_k = torch.empty_like(key.transpose(1, 2)).transpose(1, 2)
    grad_v = torch.empty_like(key.transpose(1, 2)).transpose(1, 2)
    return grad_q, grad_k, grad_v
```

But (I think?) the logic in the sdpa backward impl was a typo.

I noticed this because changing the meta formula as above was enough to fix the issue with the `aot_eager` backend in this [link](https://github.com/pytorch/pytorch/issues/116935#issuecomment-1914310523).

A minimal repro that I made looks like this:
```
import torch

# in this repro, "grad_out" and "value" are transposed tensors,
# but "key" and "value" are contiguous
a = torch.randn(2, 513, 16, 64, dtype=torch.float16, device='cuda').transpose(1, 2)
b = torch.randn(2, 16, 513, 64, dtype=torch.float16, device='cuda')
c = torch.randn(2, 16, 513, 64, dtype=torch.float16, device='cuda')
d = torch.randn(2, 513, 16, 64, dtype=torch.float16, device='cuda').transpose(1, 2)
e = torch.randn(2, 16, 513, 64, dtype=torch.float16, device='cuda')
f = torch.randn(2, 16, 513, device='cuda')
g = None
h = None
i = 513
j = 513
k = 0.0
l = False
m = torch.tensor(1, dtype=torch.int64)
n = torch.tensor(1, dtype=torch.int64)

out1_ref, out2_ref, out3_ref = torch.ops.aten._scaled_dot_product_flash_attention_backward(a, b, c, d, e, f, g, h, i, j, k, l, m, n, scale=0.125)

from torch._meta_registrations import meta__scaled_dot_product_flash_backward
out1_test, out2_test, out3_test = meta__scaled_dot_product_flash_backward(a, b, c, d, e, f, g, h, i, j, k, l, m, n, scale=0.125)

# prints True True
print(out1_ref.is_contiguous())
print(out1_test.is_contiguous())

# prints True True
print(out2_ref.is_contiguous())
print(out2_test.is_contiguous())

# prints True False
print(out3_ref.is_contiguous())
print(out3_test.is_contiguous())
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119500
Approved by: https://github.com/drisspg, https://github.com/ezyang, https://github.com/Skylion007
2024-02-12 22:12:29 +00:00
cyy
10789ccd83 Remove redundant CMake NUMA code (#119650)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119650
Approved by: https://github.com/ezyang
2024-02-12 21:53:44 +00:00
34a61c527b Revert "Enable x86 CPU vectorization on windows (#118980)"
This reverts commit 5f69d95b2b303382fe4cf301e73e36414c879c5c.

Reverted https://github.com/pytorch/pytorch/pull/118980 on behalf of https://github.com/huydhn due to This is breaking Window binary build https://github.com/pytorch/pytorch/actions/runs/7874475000/job/21484997298 where it failed to build sleef ([comment](https://github.com/pytorch/pytorch/pull/118980#issuecomment-1939619212))
2024-02-12 21:33:14 +00:00
cyy
10f3abc6b8 [DeviceIndex][3/N] Use DeviceIndex in more places (#119635)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119635
Approved by: https://github.com/ezyang
2024-02-12 21:31:27 +00:00
064b61009b Correctly formatting the example in get_state_dict (#119532)
This PR corrects the example formatting provided in https://pytorch.org/docs/stable/distributed.checkpoint.html. In this issue, @wz337 is also commenting that the return type was not showing up correctly. I didn't see any formatting issue, but I could be wrong.

Fixes #118837

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119532
Approved by: https://github.com/fegin
2024-02-12 21:28:22 +00:00
ad217d4266 [ez] Add try catch for deleting old branches (#119696)
I think some chars in branch names affect the api calls, so just assume they're protected
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119696
Approved by: https://github.com/huydhn
2024-02-12 21:08:59 +00:00
7eecbf8a30 Remove unnecessary skipIfTorchDynamo from test_jit_fuser_te (#118728)
And add some expected failures.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118728
Approved by: https://github.com/bdhirsh
2024-02-12 20:55:29 +00:00
28c30f29be Update documentation for set_flush_denormal support on ARM (#119354)
**Documentation update for set_flush_denormal():**
-> set_flush_denormal() is now supported on ARM CPU's.
-> **PR:** https://github.com/pytorch/pytorch/pull/115184  (Already merged)

**Reference page:** https://pytorch.org/docs/stable/generated/torch.set_flush_denormal.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119354
Approved by: https://github.com/drisspg
2024-02-12 20:53:22 +00:00
7d780ff86f Revert "Enable fake tensor caching in fbcode by default (#118555)"
This reverts commit 0f2fbbff109cbc184a6a88247813dbcddaea2e5f.

Reverted https://github.com/pytorch/pytorch/pull/118555 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing one model test internally. Please take a look at the diff for more info D53189048 ([comment](https://github.com/pytorch/pytorch/pull/118555#issuecomment-1939550273))
2024-02-12 20:51:23 +00:00
110919c984 Check QNNPACK support for the platform before running test (#119139)
Do not run test ConstantPropagation.CustomClassesCanBePropagated on a platform where QNNPACK is not supported.

For example, this test fails on M1 Mac because QNNPACK is not supported on M1 Mac:
[----------] 1 test from ConstantPropagation
[ RUN      ] ConstantPropagation.CustomClassesCanBePropagated
unknown file: Failure
as described in more details in the issue #88613.

After the PR, test passes successfully as below:
[----------] 1 test from ConstantPropagation
[ RUN      ] ConstantPropagation.CustomClassesCanBePropagated
[       OK ] ConstantPropagation.CustomClassesCanBePropagated (0 ms)
[----------] 1 test from ConstantPropagation (0 ms total)

Fixes #88613

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119139
Approved by: https://github.com/jcaip
2024-02-12 20:21:07 +00:00
7adfeba47a Add Python 3.12 as experimental to release 2.2 (#119705)
Add 3.12 as experimental version to Release 2.2

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119705
Approved by: https://github.com/Skylion007, https://github.com/huydhn
2024-02-12 20:13:54 +00:00
suo
82248f0b1c [export] improve FakeTensor serialization (#119531)
Recently we made it possible to serialize ExportedPrograms with fake parameters/buffers/etc.

The serialization regime was kind of whacky; basically we serialized a stub and reassembled the FakeTensor using metadata that we had stashed elsewhere in the Graph state.

This was bad for a few reasons:
- Storing the metadata separately from the actual serialized object caused situations where you could have one but not the other. An example case is if you had a FakeTensor contained inside a TorchBind object—there was no obviously place to store the metadata for this. This actually happens—TensorQueue in fbgemm does this.
- It created an annoying cycle: we had to deserialize the Graph's tensor metadata in order to deserialize (potentially faked) constants, but we need constants in order to deserialize the Graph.

This fixes all that. The basic idea is to patch the reducer function for FakeTensor at serialization time, and serialize a copy of the FakeTensor metadata. We already are policing BC for the TensorMeta schema struct so it's not a net increase in the BC surface.

As a bonus, I fixed a weird bug with torchbind tracing where we were accidentally reinterpreting a torch.ScriptObject as a torch.ScriptModule (which was the root cause of some weird behavior @bahuang was seeing last week).

Differential Revision: [D53601251](https://our.internmc.facebook.com/intern/diff/D53601251/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119531
Approved by: https://github.com/zhxchen17
2024-02-12 19:28:08 +00:00
482345d747 Refactor out shape test into InputMetadata::maybe_reduce (#119559)
I'm going to gut this function shortly, and having it all on
InputMetadata is convenient for this purpose.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119559
Approved by: https://github.com/soulitzer
2024-02-12 19:27:48 +00:00
c24b74efc7 Revert "Optimize multi_tensor_apply (#119153)"
This reverts commit 24be7daf799ed94e1964e2ce440ccaad15962719.

Reverted https://github.com/pytorch/pytorch/pull/119153 on behalf of https://github.com/yifuwang due to This PR is breaking cuda graph for multi_tensor_apply ([comment](https://github.com/pytorch/pytorch/pull/119153#issuecomment-1939365823))
2024-02-12 19:11:29 +00:00
8d8fb9783c [MPS][EZ] Fix cfloat->chalf conversion on MacOS13 (#119681)
By using `view_as_real` when type casting between two complex types
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119681
Approved by: https://github.com/Skylion007, https://github.com/albanD
2024-02-12 19:09:10 +00:00
eb0f9efd31 fix is_ and is_not (#118978)
Fix issue https://github.com/pytorch/pytorch/issues/118805

Note: this was a refresh PR of https://github.com/pytorch/pytorch/pull/118806
discussion there is relevant

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118978
Approved by: https://github.com/lezcano
2024-02-12 19:04:40 +00:00
0e5b6594b7 [Dynamo] Minor cleanup of redundant function lookup logics (#119666)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119666
Approved by: https://github.com/angelayi
2024-02-12 19:00:39 +00:00
ed20e9118b Fixed hash issue in fx_graph_cse (#119567)
Description:
- Fixed issue with hash collision for `hash((primals_2, 1.0)) == hash((primals_2, 1))`

Repro code:
```python
import torch
from torch._functorch.compile_utils import fx_graph_cse

def func(inpt, osize):
    size = inpt.shape[-1]
    s1 = size - 1
    s2 = size - 1.0
    scale = s2 / (osize - 1.0)
    inpt = torch.clamp(inpt, 0, s1)
    return scale * inpt

gms = []
def toy_backend(gm, _):
    gms.append(gm)
    return gm.forward

torch._dynamo.reset()
fn = torch.compile(backend=toy_backend, dynamic=True)(func)
t = torch.rand(3, 100)
out = fn(t, 50)
gm = gms[0]

print(gm.graph)
new_fx_g = fx_graph_cse(gm.graph)
print(str(new_fx_g))
```
Original graph
```
graph():
    %s0 : torch.SymInt [num_users=0] = placeholder[target=s0]
    %s1 : torch.SymInt [num_users=0] = placeholder[target=s1]
    %l_inpt_ : torch.Tensor [num_users=2] = placeholder[target=L_inpt_]
    %l_osize_ : torch.SymInt [num_users=1] = placeholder[target=L_osize_]
    %size : [num_users=1] = call_method[target=size](args = (%l_inpt_,), kwargs = {})
    %getitem_1 : [num_users=2] = call_function[target=operator.getitem](args = (%size, 1), kwargs = {})
    %sub : [num_users=1] = call_function[target=operator.sub](args = (%getitem_1, 1), kwargs = {})
    %sub_1 : [num_users=1] = call_function[target=operator.sub](args = (%getitem_1, 1.0), kwargs = {})
    %sub_2 : [num_users=1] = call_function[target=operator.sub](args = (%l_osize_, 1.0), kwargs = {})
    %truediv : [num_users=1] = call_function[target=operator.truediv](args = (%sub_1, %sub_2), kwargs = {})
    %inpt : [num_users=1] = call_function[target=torch.clamp](args = (%l_inpt_, 0, %sub), kwargs = {})
    %mul : [num_users=1] = call_function[target=operator.mul](args = (%truediv, %inpt), kwargs = {})
    return (mul,)
```
New wrong graph where `sub_2` is replaced incorrectly with `sub`:
```
graph():
    %s0 : torch.SymInt [num_users=0] = placeholder[target=s0]
    %s1 : torch.SymInt [num_users=0] = placeholder[target=s1]
    %l_inpt_ : torch.Tensor [num_users=2] = placeholder[target=L_inpt_]
    %l_osize_ : torch.SymInt [num_users=1] = placeholder[target=L_osize_]
    %size : [num_users=1] = call_method[target=size](args = (%l_inpt_,), kwargs = {})
    %getitem_1 : [num_users=1] = call_function[target=operator.getitem](args = (%size, 1), kwargs = {})
    %sub : [num_users=2] = call_function[target=operator.sub](args = (%getitem_1, 1), kwargs = {})
    %sub_2 : [num_users=1] = call_function[target=operator.sub](args = (%l_osize_, 1.0), kwargs = {})
    %truediv : [num_users=1] = call_function[target=operator.truediv](args = (%sub, %sub_2), kwargs = {})
    %inpt : [num_users=1] = call_function[target=torch.clamp](args = (%l_inpt_, 0, %sub), kwargs = {})
    %mul : [num_users=1] = call_function[target=operator.mul](args = (%truediv, %inpt), kwargs = {})
    return (mul,)
```
With this PR the new graph is the following:
```
graph():
    %s0 : torch.SymInt [num_users=0] = placeholder[target=s0]
    %s1 : torch.SymInt [num_users=0] = placeholder[target=s1]
    %l_inpt_ : torch.Tensor [num_users=2] = placeholder[target=L_inpt_]
    %l_osize_ : torch.SymInt [num_users=1] = placeholder[target=L_osize_]
    %size : [num_users=1] = call_method[target=size](args = (%l_inpt_,), kwargs = {})
    %getitem_1 : [num_users=2] = call_function[target=operator.getitem](args = (%size, 1), kwargs = {})
    %sub : [num_users=1] = call_function[target=operator.sub](args = (%getitem_1, 1), kwargs = {})
    %sub_1 : [num_users=1] = call_function[target=operator.sub](args = (%getitem_1, 1.0), kwargs = {})
    %sub_2 : [num_users=1] = call_function[target=operator.sub](args = (%l_osize_, 1.0), kwargs = {})
    %truediv : [num_users=1] = call_function[target=operator.truediv](args = (%sub_1, %sub_2), kwargs = {})
    %inpt : [num_users=1] = call_function[target=torch.clamp](args = (%l_inpt_, 0, %sub), kwargs = {})
    %mul : [num_users=1] = call_function[target=operator.mul](args = (%truediv, %inpt), kwargs = {})
    return (mul,)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119567
Approved by: https://github.com/eellison
2024-02-12 18:52:11 +00:00
27ffede878 [reland] Fix estimate_nccl_collective_runtime (#118986)
`estimate_nccl_collective_runtime` has been broken and the errors have been silently swallowed by inductor. This PR:
- Fixes the issues described in https://github.com/pytorch/pytorch/issues/118497.
- Adds white-box testing so future issues can be surfaced in tests.
- Add support for native funcol IRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118986
Approved by: https://github.com/yf225
ghstack dependencies: #119102
2024-02-12 18:48:06 +00:00
b2043c0543 [c10d] PGNCCL refactor part 2: Simplify ProcessGroupNCCL into single-device style (#119421)
Part 2 and last part of #118674:
Introduce actual "single-device" code change to ProcessGroupNCCL.

assert size == 1 and test refactor have been done in #119099.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119421
Approved by: https://github.com/shuqiangzhang
2024-02-12 18:45:49 +00:00
893dcac068 [c10d] explicitly abort communicators in destroy_process_group call (#119250)
Summary:
This PR tries to resolve issue #119215.

Basically,  processgroup shutdown (and hence ncclCommAbort) is called in
destroy_process_group APIs for the corresponding PGs. and in the
destructor of ProcessGroup, we avoid calling abort/ncclCommAbort.
Instead, it just checks if the users have explicitly already called destroy_process_group. If
not, Destructor will log a warning and encourage/expect users to do so
for cleanup of resources of PGs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119250
Approved by: https://github.com/minsii, https://github.com/kwen2501
2024-02-12 18:40:28 +00:00
31f00b0160 Clarify that legacy cat behavior only applies for 1-D tensor (#119684)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119684
Approved by: https://github.com/albanD
2024-02-12 18:13:04 +00:00
059bf1baa4 Separate clang lint? (#119575)
25 min -> 17 + 13 min, which is still not as fast as I want it to be but I'll take it
Lintrunner provides some parallelism by default, but it's not perfect

Reducing fetch-depth from all to 1 further reduces time by ~2-3 minutes

From non clang's logs:
```
2024-02-09T22:05:39.5297616Z Requirement already satisfied: PyYAML==6.0 in /opt/conda/lib/python3.11/site-packages (6.0)
2024-02-09T22:12:23.6164708Z Collecting black==23.12.1
```
I don't know why this part takes so long, maybe it's just buffering?  Clang version doesn't show this issue

See 5a750c8035
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119575
Approved by: https://github.com/huydhn, https://github.com/malfet
2024-02-12 17:46:31 +00:00
bc521f2ce3 In dynamo tracing for index() use None as the default indicator for end and not -1 (#119151)
Summary: In dynamo tracing, `index()`'s implementation currently has the default begin index as `0` and the default end index as`-1` which means that by default we're dropping the last element. Rather we should be doing `None` which will ensure that the last element is also checked.

Test Plan: CI

Differential Revision: D53392287

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119151
Approved by: https://github.com/yanboliang
2024-02-12 17:45:05 +00:00
cf474a09f5 Decompose torch.ops.higher_order.auto_functionalized in Inductor (#118673)
We'd like to get auto_functionalized to work with AOTInductor. To get
there, we decompose `output = auto_functionalized(inplace_op, ...)` into its
corresponding aten ops (clones + inplace_op) before the Inductor lowering phase.

This decomposition must happen at the end of the Inductor FX passes
because it introduces in-place operations.

The pattern matcher's "replace this single node with multiple nodes" API
isn't robust enough here. The problem is that `auto_functionalized`
returns a single output (this output is a List), but the decomposition
ends up returning the unpacked List (e.g. it may return two tensors).
Previously, there was an assertion that this was not the case; I fixed
up `replace_with_graph` to handle this.

Future: Not all of the clones are necessary (e.g. if the input's last
usage is this operator, then we don't need to clone it). We can add this
logic later.

Test Plan:
- existing tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118673
Approved by: https://github.com/oulgen
2024-02-12 17:30:01 +00:00
8069b29603 [export] Implement logging for scuba. (#119585)
Summary: As we're growing the user surface of torch.export, we'd like to understand better how people are using our APIs. It's also possible to analyze the usages based on static analysis, but due to the fact that there could be many creative ways to call things in Python, I think just building some logging infra will benefit us in the short term and gain us some insights.

Test Plan:
buck test caffe2/test:test_export
{F1454519846}

Reviewed By: tugsbayasgalan

Differential Revision: D53618220

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119585
Approved by: https://github.com/avikchaudhuri
2024-02-12 17:28:14 +00:00
757201c213 Refactor ExportedProgram to expose the functions for pre and postprocessing (#119513)
Reason:
Consumers of ExportProgram might choose to further lower exported_program.graph_module
to something else.
Then, it will need to setup the calling convention to call it.

This refactor concentrates these calling convention to one place and can be reused.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119513
Approved by: https://github.com/zhxchen17
2024-02-12 17:22:27 +00:00
72d9a38118 add get_function to TorchInGraphFunctionVariable (#119314)
partially address https://github.com/pytorch/pytorch/issues/118785

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119314
Approved by: https://github.com/yanboliang, https://github.com/anijain2305
2024-02-12 16:35:34 +00:00
1c1dc0e4e0 [sparse] Add in out_dtype support (i8i8->bf16, i32) for cusparselt (#119296)
Summary:

Adds in out_dtype support for (i8i8->bf16) and (i8i8->i32) matmul with
cuSPARSELt.

Test Plan:

```
python test/test_sparse_semi_structured.py -k mixed
```

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119296
Approved by: https://github.com/cpuhrsch, https://github.com/alexsamardzic
2024-02-12 16:02:36 +00:00
5f69d95b2b Enable x86 CPU vectorization on windows (#118980)
Enable VEC on Windows OS.
1. Fix some type defination gap between Windows and Linux.
2. Fix some operator not support on Windows, such as [], /.
3. Enable static sleef library build on Windows.
4. Disable unsupported function overloading on MSVC.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118980
Approved by: https://github.com/jgong5, https://github.com/ezyang, https://github.com/malfet
2024-02-12 16:01:30 +00:00
52a3de6cbf [AOTI][refactor] Move ThreadLocalCachedOutputTensor into a separate header (#119392)
Summary: Move common functionality into a separate header so that later JIT and AOT Inductor can share it.

Test Plan: CI

Differential Revision: D53523452

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119392
Approved by: https://github.com/khabinov
2024-02-12 15:56:16 +00:00
24bdd03d23 Revert "Reify view_func() closures as ViewFuncs (#118404)"
This reverts commit d5a6762263a98e5153bc057c8ba4f377542c7e55.

Reverted https://github.com/pytorch/pytorch/pull/118404 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/118404#issuecomment-1938600260))
2024-02-12 12:38:51 +00:00
79df897608 Fix some tests in test_c10d_functional_native.py (#119102)
Summary:
This PR fixes a few tests that were broken because `empty` became `empty_strided_cuda` in the generate code.

Also changed some _c10d_functional calls to funcol calls so add coverage to tracing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119102
Approved by: https://github.com/wconstab, https://github.com/wanchaol
2024-02-12 09:28:18 +00:00
0342b227e5 Revert "[c10d] PGNCCL refactor part 2: Simplify ProcessGroupNCCL into single-device style (#119421)"
This reverts commit f3e7d809936d9f1bf63102e8afe241e13ed8766a.

Reverted https://github.com/pytorch/pytorch/pull/119421 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/119421#issuecomment-1938169747))
2024-02-12 07:34:20 +00:00
cyy
8a3c241094 Remove unused header inclusion (#119667)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119667
Approved by: https://github.com/Skylion007
2024-02-12 05:36:25 +00:00
dcb08a7044 Add CUDAEvent recording for constant folding to show up. (#119216)
Summary: Add a layer of call to let CUDAEvent show up for constant folding.

Test Plan: Existing tests

Differential Revision: D53437934

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119216
Approved by: https://github.com/khabinov
2024-02-12 03:46:36 +00:00
bc4d0277cd [executorch hash update] update the pinned executorch hash (#119648)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119648
Approved by: https://github.com/pytorchbot
2024-02-12 03:42:07 +00:00
76fac69577 add a couple more cases to pointwise_cat perf tests (#119521)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119521
Approved by: https://github.com/ezyang, https://github.com/eellison
2024-02-12 03:41:08 +00:00
647564dbaa Implement conditional statements in kernel analysis (#119664)
This PR makes it so that ops is no longer a dict of RET => OP but rather it is now RET => List[OP] since now multiple OPs can return the same RET. In real execution, only one of these OPs will be executed, so no need to worry about renaming. For analysis, we pessimistically assume any one of them could be executed (which is safest for analysis purposes)

Example TTIRs that can now be handled:
```
    scf.if %13 {
      %14 = tt.get_program_id y : i32 loc(#loc13)
      %c0_i32_1 = arith.constant 0 : i32 loc(#loc14)
      %15 = arith.cmpi eq, %14, %c0_i32_1 : i32 loc(#loc14)
      scf.if %15 {
        %16 = arith.addf %8, %11 : tensor<4xf32> loc(#loc16)
        %17 = tt.splat %arg2 : (!tt.ptr<f32, 1>) -> tensor<4x!tt.ptr<f32, 1>> loc(#loc17)
        %18 = tt.addptr %17, %4 : tensor<4x!tt.ptr<f32, 1>>, tensor<4xi32> loc(#loc17)
        tt.store %18, %16, %5 {cache = 1 : i32, evict = 1 : i32} : tensor<4xf32> loc(#loc18)
      } else {
      } loc(#loc15)
    } else {
    } loc(#loc12)
```

and

```
    %14 = scf.if %13 -> (tensor<4xf32>) {
      %17 = arith.addf %8, %11 : tensor<4xf32> loc(#loc13)
      scf.yield %17 : tensor<4xf32> loc(#loc13)
    } else {
      %17 = arith.mulf %8, %11 : tensor<4xf32> loc(#loc14)
      scf.yield %17 : tensor<4xf32> loc(#loc14)
    } loc(#loc12)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119664
Approved by: https://github.com/aakhundov
2024-02-12 01:54:26 +00:00
663dd5d006 [inductor] Update the compile options for CppPythonBindingsCodeCache (#119415)
Differential Revision: [D53554681](https://our.internmc.facebook.com/intern/diff/D53554681)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119415
Approved by: https://github.com/jansel, https://github.com/khabinov
2024-02-11 21:25:34 +00:00
069581b3ca [BE] Properly mark destructor overrides (#119656)
Otherwise, at least on MacOS builds are littered with:
```
In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/DeviceAccelerator.h:6:
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/detail/MTIAHooksInterface.h:23:11: warning: '~MTIAHooksInterface' overrides a destructor but is not marked 'override' [-Winconsistent-missing-destructor-override]
  virtual ~MTIAHooksInterface() = default;
          ^
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/detail/CUDAHooksInterface.h:65:11: warning: '~CUDAHooksInterface' overrides a destructor but is not marked 'override' [-Winconsistent-missing-destructor-override]
  virtual ~CUDAHooksInterface() = default;
          ^
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/detail/AcceleratorHooksInterface.h:15:11: note: overridden virtual function is here
  virtual ~AcceleratorHooksInterface() = default;
          ^
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/detail/MPSHooksInterface.h:21:11: warning: '~MPSHooksInterface' overrides a destructor but is not marked 'override' [-Winconsistent-missing-destructor-override]
  virtual ~MPSHooksInterface() = default;
          ^
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/detail/AcceleratorHooksInterface.h:15:11: note: overridden virtual function is here
  virtual ~AcceleratorHooksInterface() = default;
          ^
```

 Likely introduced by https://github.com/pytorch/pytorch/pull/119329

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119656
Approved by: https://github.com/Skylion007
2024-02-11 21:07:16 +00:00
a4cc6b85dc [dynamo][eval][perf] Remove unnecessary dict copies. (#119305)
Both of these variables are already created using `dict(...)` so making yet another `dict` copy is pure overhead and boilerplate.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119305
Approved by: https://github.com/Skylion007
2024-02-11 20:29:26 +00:00
e5f46a1d35 Check alignment of ReinterpretView args of custom Triton kernels (#119649)
Summary: Currently, when a custom (user-written) Triton kernel has a ReinterpretView argument in IR, we're always skipping the alignment checking for this argument when preparing the `signature_of` for the AOT compilation of the Triton kernel (via setting `TensorArg.check_alignment` to `False`). This is problematic for user-written kernels where, albeit reinterpreted, the argument of the Triton kernel (the data pointer) can still be aligned to 16. When we skip alignment checking, the performance of the AOT-compiled internal Triton kernels can degrade 2x--3x.

In this PR, we replace `TensorArg.check_alignment` by `TensorArg.offset`, in which we specify the offset of the `ReinterpretView.layout` relative to the underlying `ir.Buffer` (corresponding to the data pointer before reinterpretation). As the size and stride of the layout don't change the alignment properties, those can be skipped. Importantly, for `ReinterpretView` arguments of custom Triton kernels, we use `arg.data.get_name()` as the buffer name. That, together with the offset, is used to check the alignment.

Bonus: the namedtuples in `codegen/common.py` are refactored as `dataclass`es, with nicer type hints and default values (for the newly added `TensorArg.offset`).

Test Plan:

```
$ python test/inductor/test_aot_inductor.py -k test_triton_kernel_reinterpret_view
...
----------------------------------------------------------------------
Ran 6 tests in 27.952s

OK (skipped=4)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119649
Approved by: https://github.com/oulgen
2024-02-11 20:21:17 +00:00
b8e4423278 [torch][cuda][perf] Avoid unnecessary dicts. (#118011)
It's unnecessary and inefficient to create a `dict` from list indices to list values just to check if particular `idx` exists there. This way leads to `O(N)` time and space complexity whereas using `list` directly is `O(1)` time and space complexity.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118011
Approved by: https://github.com/Skylion007
2024-02-11 19:29:24 +00:00
95a8d5b1bc [random] Replace for loop with list comprehension. (#119143)
It's more idiomatic and efficient.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119143
Approved by: https://github.com/Skylion007
2024-02-11 19:29:19 +00:00
4394e0dc2c [inductor] Use list comprehension to initialize unused_views. (#119618)
It's more idiomatic and efficient.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119618
Approved by: https://github.com/Skylion007
2024-02-11 18:57:18 +00:00
24be7daf79 Optimize multi_tensor_apply (#119153)
### Summary

Due to the dynamic nature of the workload, the kernel arguments aren't guaranteed to fit in the static 4kb kernel argument memory. Previously with the apex implementation, we overcame this limitation by dividing a multi_tensor_apply workload into multiple kernel launches. However, this led to low sustained occupancy, affecting the performance of memory bound ops.

Based on the observation that the kernel argument memory limitation doesn't correlate well with available SM resources, we adopt a different approach:
- When the kernel arguments fit into the static kernel argument memory, we use this memory to transfer the arguments.
- Conversely, when the kernel arguments don't fit into the static kernel argument memory, instead of sacrificing sustained occupancy, we use a page-locked cudaMemcpyAsync to transfer the arguments, then perform the entire workload in a single kernel.

This PR only covers `multi_tensor_apply` for tensors. The change can be easily applied to `multi_tensor_apply` for tensors + scalars and `multi_tensor_apply_for_fused_optimizer`.

### Benchmark (WIP)

The only benchmark I've conducted so far on `_foreach_copy_` on a set of sizes that resembles internal workload. I need to benchmarks on more problem sizes. The speedup should vary among problem sizes. **However, I believe this PR should not be slower than the previous impl on any problem sizes.**

The benchmark can be reproduced with [this script](https://gist.github.com/yifuwang/178c1f4bf951c5794ea79c04d90e44fa).

**Baseline**

A single iteration in trace:
<img width="831" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/5c8d72d0-0628-4989-88a8-c756f6bc1319">

```
https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_5a59145f-567b-472f-8eef-c61c388d45b4.json
device ms: 1.111, cpu ms: 7.151
memory bandwidth: 1169.825 GB/s
```

**This PR**

A single iteration in trace:
<img width="967" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/a023e183-8166-48f7-b7c0-c8ba32653d2b">

```
https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_da060725-62a8-466e-b570-2ad67ff0e29d.json
device ms: 0.892, cpu ms: 0.810
memory bandwidth: 1456.744 GB/s
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119153
Approved by: https://github.com/janeyx99
2024-02-11 18:12:22 +00:00
2c91e13afc Add lowerings to special functions (#119187)
As in the title.

In addition, the PR introduces infrastructure for lowerings of pointwise functions that have both cpp and triton implementations available.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119187
Approved by: https://github.com/peterbell10
2024-02-11 16:35:40 +00:00
4ee8aac432 [MPS] Enable bfloat16 support on MacOS 14 (#119641)
Per [MPSDataType](https://developer.apple.com/documentation/metalperformanceshaders/mpsdatatype/mpsdatatypebfloat16?changes=_11&language=objc) documentation bfloat16 are supported in MacOS Sonoma or later

Added missing `MPSDataTypeBFloat16` and `MTLLanguageVersion3_1` enums to `MPSGraphSonomaOps.h`

TODO: Enable more testing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119641
Approved by: https://github.com/Skylion007
2024-02-11 16:25:29 +00:00
68e009dd8f [BE][EZ] Use dyspatch_sync_with_rethrow in searchsorted (#119646)
For the proper exception handling, otherwise raising C++ exception inside dispatch block will crash the app (discovered while enabling more BFloat16 ops)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119646
Approved by: https://github.com/Skylion007
2024-02-11 07:19:00 +00:00
6cd82253ae fix torch.set_float32_matmul_precision doc (#119620)
Fixes #119606, clearify the explictly stored number of bits in doc

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119620
Approved by: https://github.com/eqy, https://github.com/malfet
2024-02-11 06:41:37 +00:00
cyy
88183923d2 Remove unneeded linking of torch_shm_manager in CMake (#119540)
This PR aims to clean up torch_shm_manager dependency in CMake.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119540
Approved by: https://github.com/ezyang
2024-02-11 06:33:35 +00:00
0bed0501fa Don't skip register-spilling configs in custom Triton kernel auto-tuning (#119634)
Summary: There has been some empirical evidence that, for (non-trivial) custom (user-written) Triton kernels, a register-spilling config yields the best result in auto-tuning. For this reason, we don't skip register-spilling config from auto-tuning of the custom Triton kernels.

<details>
<summary>An example of auto-tuning result with the register-spilling config outperforming others</summary>

```
BLOCK_M: 16, BLOCK_N: 16, num_warps: 2, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.748896, nreg 255, nspill 0, #shared-mem 8704
BLOCK_M: 16, BLOCK_N: 16, num_warps: 4, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 1.723424, nreg 249, nspill 0, #shared-mem 8704
BLOCK_M: 16, BLOCK_N: 16, num_warps: 8, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 2.202656, nreg 190, nspill 0, #shared-mem 8704
BLOCK_M: 16, BLOCK_N: 16, num_warps: 2, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.748256, nreg 255, nspill 0, #shared-mem 8704
BLOCK_M: 16, BLOCK_N: 16, num_warps: 4, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 1.724896, nreg 249, nspill 0, #shared-mem 8704
BLOCK_M: 16, BLOCK_N: 16, num_warps: 8, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 2.201632, nreg 190, nspill 0, #shared-mem 8704
BLOCK_M: 16, BLOCK_N: 32, num_warps: 2, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.651664, nreg 255, nspill 56, #shared-mem 13312
BLOCK_M: 16, BLOCK_N: 32, num_warps: 4, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.846368, nreg 255, nspill 14, #shared-mem 13312
BLOCK_M: 16, BLOCK_N: 32, num_warps: 8, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 1.841792, nreg 243, nspill 0, #shared-mem 13312
BLOCK_M: 16, BLOCK_N: 32, num_warps: 2, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.651584, nreg 255, nspill 56, #shared-mem 13312
BLOCK_M: 16, BLOCK_N: 32, num_warps: 4, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.846432, nreg 255, nspill 14, #shared-mem 13312
BLOCK_M: 16, BLOCK_N: 32, num_warps: 8, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 1.841904, nreg 243, nspill 0, #shared-mem 13312
BLOCK_M: 16, BLOCK_N: 64, num_warps: 2, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 1.236448, nreg 255, nspill 254, #shared-mem 22528
BLOCK_M: 16, BLOCK_N: 64, num_warps: 4, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 1.484384, nreg 255, nspill 174, #shared-mem 22528
BLOCK_M: 16, BLOCK_N: 64, num_warps: 8, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 1.131168, nreg 255, nspill 6, #shared-mem 22528
BLOCK_M: 16, BLOCK_N: 64, num_warps: 2, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 1.236544, nreg 255, nspill 254, #shared-mem 22528
BLOCK_M: 16, BLOCK_N: 64, num_warps: 4, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 1.483648, nreg 255, nspill 174, #shared-mem 22528
BLOCK_M: 16, BLOCK_N: 64, num_warps: 8, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 1.131408, nreg 255, nspill 6, #shared-mem 22528
BLOCK_M: 32, BLOCK_N: 16, num_warps: 2, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.516112, nreg 255, nspill 28, #shared-mem 13312
BLOCK_M: 32, BLOCK_N: 16, num_warps: 4, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.737792, nreg 255, nspill 0, #shared-mem 13312
BLOCK_M: 32, BLOCK_N: 16, num_warps: 8, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 1.411632, nreg 193, nspill 0, #shared-mem 13312
BLOCK_M: 32, BLOCK_N: 16, num_warps: 2, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.515904, nreg 255, nspill 28, #shared-mem 13312
BLOCK_M: 32, BLOCK_N: 16, num_warps: 4, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.736608, nreg 255, nspill 0, #shared-mem 13312
BLOCK_M: 32, BLOCK_N: 16, num_warps: 8, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 1.409808, nreg 193, nspill 0, #shared-mem 13312
BLOCK_M: 32, BLOCK_N: 32, num_warps: 2, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.553536, nreg 255, nspill 130, #shared-mem 18432
BLOCK_M: 32, BLOCK_N: 32, num_warps: 4, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.569792, nreg 255, nspill 56, #shared-mem 18432
BLOCK_M: 32, BLOCK_N: 32, num_warps: 8, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.892448, nreg 255, nspill 4, #shared-mem 18432
BLOCK_M: 32, BLOCK_N: 32, num_warps: 2, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.553584, nreg 255, nspill 130, #shared-mem 18432
BLOCK_M: 32, BLOCK_N: 32, num_warps: 4, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.569568, nreg 255, nspill 56, #shared-mem 18432
BLOCK_M: 32, BLOCK_N: 32, num_warps: 8, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.892240, nreg 255, nspill 4, #shared-mem 18432
BLOCK_M: 32, BLOCK_N: 64, num_warps: 2, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 1.332928, nreg 255, nspill 366, #shared-mem 28672
BLOCK_M: 32, BLOCK_N: 64, num_warps: 4, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.922256, nreg 255, nspill 228, #shared-mem 28672
BLOCK_M: 32, BLOCK_N: 64, num_warps: 8, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.758400, nreg 255, nspill 26, #shared-mem 28672
BLOCK_M: 32, BLOCK_N: 64, num_warps: 2, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 1.333440, nreg 255, nspill 366, #shared-mem 28672
BLOCK_M: 32, BLOCK_N: 64, num_warps: 4, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.922336, nreg 255, nspill 228, #shared-mem 28672
BLOCK_M: 32, BLOCK_N: 64, num_warps: 8, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.758496, nreg 255, nspill 26, #shared-mem 28672
BLOCK_M: 64, BLOCK_N: 16, num_warps: 2, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 1.231648, nreg 255, nspill 292, #shared-mem 22528
BLOCK_M: 64, BLOCK_N: 16, num_warps: 4, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.639424, nreg 255, nspill 90, #shared-mem 22528
BLOCK_M: 64, BLOCK_N: 16, num_warps: 8, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.917952, nreg 240, nspill 0, #shared-mem 22528
BLOCK_M: 64, BLOCK_N: 16, num_warps: 2, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 1.230624, nreg 255, nspill 292, #shared-mem 22528
BLOCK_M: 64, BLOCK_N: 16, num_warps: 4, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.639168, nreg 255, nspill 90, #shared-mem 22528
BLOCK_M: 64, BLOCK_N: 16, num_warps: 8, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.917440, nreg 240, nspill 0, #shared-mem 22528
BLOCK_M: 64, BLOCK_N: 32, num_warps: 2, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.838080, nreg 255, nspill 354, #shared-mem 28672
BLOCK_M: 64, BLOCK_N: 32, num_warps: 4, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.569184, nreg 255, nspill 178, #shared-mem 28672
BLOCK_M: 64, BLOCK_N: 32, num_warps: 8, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.614720, nreg 255, nspill 28, #shared-mem 28672
BLOCK_M: 64, BLOCK_N: 32, num_warps: 2, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.838048, nreg 255, nspill 354, #shared-mem 28672
BLOCK_M: 64, BLOCK_N: 32, num_warps: 4, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.569472, nreg 255, nspill 178, #shared-mem 28672
BLOCK_M: 64, BLOCK_N: 32, num_warps: 8, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.615104, nreg 255, nspill 28, #shared-mem 28672
BLOCK_M: 64, BLOCK_N: 64, num_warps: 2, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 1.012128, nreg 255, nspill 522, #shared-mem 40960
BLOCK_M: 64, BLOCK_N: 64, num_warps: 4, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.861536, nreg 255, nspill 378, #shared-mem 40960
BLOCK_M: 64, BLOCK_N: 64, num_warps: 8, num_ctas: 1, num_stages: 1, enable_warp_specialization: False, enable_persistent: False: 0.771584, nreg 255, nspill 134, #shared-mem 40960
BLOCK_M: 64, BLOCK_N: 64, num_warps: 2, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 1.012512, nreg 255, nspill 522, #shared-mem 40960
BLOCK_M: 64, BLOCK_N: 64, num_warps: 4, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.861024, nreg 255, nspill 378, #shared-mem 40960
BLOCK_M: 64, BLOCK_N: 64, num_warps: 8, num_ctas: 1, num_stages: 2, enable_warp_specialization: False, enable_persistent: False: 0.771712, nreg 255, nspill 134, #shared-mem 40960
```

</details>

In the above, the winning config is `BLOCK_M: 32, BLOCK_N: 16, num_warps: 2, num_ctas: 1, num_stages: 2`, although it has non-zero `nspill 28`. This is an example where we need to consider all configs, including the register-spilling ones, to obtain the best result from auto-tuning.

In the worst case, this will just make auto-tuning longer, but can't regress the results. And, as the number of custom Triton kernels in the model is normally much smaller than the number of Inductor-generated ones, this should be acceptable.

Test Plan: CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119634
Approved by: https://github.com/oulgen
2024-02-11 02:13:25 +00:00
3ab08946d5 Revert "[aot_inductor] move CudaWrapperCodeGen into a separate file (#119448)"
This reverts commit 0597dab523c0a341e136452a8f723f12700164c0.

Reverted https://github.com/pytorch/pytorch/pull/119448 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/119448#issuecomment-1937345167))
2024-02-10 23:04:36 +00:00
d8e319a961 Revert "[aot_inductor] move CppWrapperCodeGen into a separate file (#119491)"
This reverts commit 760056bbdc552314e7e81adc45e11766ac0f333c.

Reverted https://github.com/pytorch/pytorch/pull/119491 on behalf of https://github.com/DanilBaibak due to Reverted as a dependency for #119448 ([comment](https://github.com/pytorch/pytorch/pull/119491#issuecomment-1937344548))
2024-02-10 23:02:05 +00:00
6db6a1b526 [aten] Use emplace instead of insert. (#119614)
this avoids pair construction in case inserted key is already present in dict
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119614
Approved by: https://github.com/Skylion007
2024-02-10 22:35:00 +00:00
2c8722182e [dynamo][guards] Avoid unnecessary stack copies. (#119115)
There is no need to make a `frame_summary_stack` copy in case it's not modified. Proposed change uses copy-on-write functional approach that is easy to understand and is more efficient in case `self.loc_in_frame` is `None`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119115
Approved by: https://github.com/Skylion007
2024-02-10 21:56:00 +00:00
cyy
568740f080 [DeviceIndex][2/N] Use DeviceIndex instead of int in allocators (#119545)
Follows #119142
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119545
Approved by: https://github.com/ezyang
2024-02-10 20:27:59 +00:00
57d8f67619 [Dynamo][17/N] Rename SkipFilesVariable to SkipFunctionVariable and move to functions.py (#119619)
This is follow-up-3 from https://github.com/pytorch/pytorch/pull/118971#issue-2114082018

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119619
Approved by: https://github.com/jansel
2024-02-10 19:33:37 +00:00
dcce5327bb [core][perf] Use set comprehensions in _RecreateLookupTables. (#119617)
It's more idiomatic and much more efficient.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119617
Approved by: https://github.com/Skylion007
2024-02-10 18:53:25 +00:00
c5116d9e44 Fix optim.lr_scheduler examples in doc to use optimizer vs self.opt (#119563)
Fixes #119561

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119563
Approved by: https://github.com/janeyx99
2024-02-10 15:10:43 +00:00
34db6f1b13 Revert "make flash_attn_bw impl correct w.r.t. meta when k and v have different strides (#119500)"
This reverts commit 095f4713077639f0e48fa33d051c0de2eb1f8525.

Reverted https://github.com/pytorch/pytorch/pull/119500 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/119500#issuecomment-1937003082))
2024-02-10 13:06:30 +00:00
c0f1183eb4 [inductor] Fix compile error on scan with no mask (#119555)
Fixes #119591

Currently this results in invalid syntax:
```python
tmp4 = tl.where(, tmp1, tmp2)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119555
Approved by: https://github.com/lezcano
2024-02-10 12:38:40 +00:00
e71c202520 Use CUDA if cuda's macro is set for AOTI runner's pybind (#119616)
Summary:
Use CUDA if cuda's macro is set for AOTI runner's pybind
This is a duplicate of #119438 for landing issues

Test Plan:
Existing tests (D52303882)

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119616
Approved by: https://github.com/khabinov
2024-02-10 11:00:47 +00:00
3581428ea0 Do not mark tt.load's arguments as mutated (#119631)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119631
Approved by: https://github.com/aakhundov
ghstack dependencies: #119581, #119615
2024-02-10 08:46:50 +00:00
6c5bf5a5ce Implement kernel analysis for functions with multiple return values (#119615)
This diff adds few improvements:

* Parsing for multiple return value: `tt.return %1, %arg0`
* Parsing for assignment for multiple values: `%1:2` means %1 has two values
* Parsing for usage of a value with multiple values: `%1#0` means 0th index of %1
* Fixes a bug in memo-cycle detection when multiple tests are executed back to back

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119615
Approved by: https://github.com/aakhundov
ghstack dependencies: #119581
2024-02-10 08:46:50 +00:00
e693089c7a [Dynamo] Refactor tensor methods handling (#119581)
Fixes part of #119128

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119581
Approved by: https://github.com/jansel, https://github.com/anijain2305
2024-02-10 08:46:50 +00:00
699ae72f51 [DCP][state_dict] Fix the issue that get_state_dict/set_state_dict ignore the buffer (#119573)
get_state_dict and set_state_dict currently do not appropriately handle the
buffers. This PR fixes thie issue.

Fixes, https://github.com/pytorch/pytorch/issues/119535.

Differential Revision: [D53616762](https://our.internmc.facebook.com/intern/diff/D53616762/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119573
Approved by: https://github.com/wz337, https://github.com/LucasLLC
2024-02-10 06:36:58 +00:00
a82c50793e [executorch hash update] update the pinned executorch hash (#119510)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119510
Approved by: https://github.com/pytorchbot
2024-02-10 03:40:34 +00:00
8fd11cb307 [2/2] Intel GPU Runtime Upstreaming for Stream (#117619)
# Motivation
According to [[1/2] Intel GPU Runtime Upstreaming for Stream](https://github.com/pytorch/pytorch/pull/117611), as mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), the second PR covers the changes under `python frontend`.

# Design
Currently, it primarily offers stream-related APIs, including
 - `torch.xpu.StreamContext`
 - `torch.xpu.current_stream`
 - `torch.xpu.set_stream`
 - `torch.xpu.synchronize`
 - `torch._C._xpu_getCurrentRawStream`

# Additional Context
We will implement functions like `torch.xpu.Stream.wait_event`, `torch.xpu.Stream.wait_stream`, and `torch.xpu.Stream.record_event` in the next PR related with `Event`.

The differences with CUDA:
no default and external stream in XPU and lack of below APIs:
- `torch.cuda.ExternalStream`
- `torch.cuda.default_stream`
- `toch.cuda.is_current_stream_capturing`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117619
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/albanD
ghstack dependencies: #117611
2024-02-10 03:39:42 +00:00
f2778e3874 [vision hash update] update the pinned vision hash (#119511)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119511
Approved by: https://github.com/pytorchbot
2024-02-10 03:22:13 +00:00
42ca82dfb1 [audio hash update] update the pinned audio hash (#119612)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119612
Approved by: https://github.com/pytorchbot
2024-02-10 03:22:06 +00:00
3278b4c557 be more consrevative until regression is debugged (#119583)
See, internal regression: https://www.internalfb.com/diff/D53375778?transaction_fbid=953511712782168

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119583
Approved by: https://github.com/Chillee
2024-02-10 03:06:58 +00:00
70a364d402 non-strict improvements: constant args and kwargs (#119529)
This PR makes a couple of improvements to non-strict to bring it closer to strict. (This lets us remove some expected failures from test_export.)

1. Support constant arguments (easy).
2. Support keyword arguments. This forces us to add kwargs to `aot_export_module`. Indeed there is no way to make this work otherwise, because some arguments in a function signature can be keyword-only and thus cannot be simulated by positional arguments alone. Adding kwargs to `aot_export_module` turns out to be fairly routine, but there is a bit of a unsatisfactory fork between how it is called by strict and non-strict: because strict calls it on a graph module, kwargs must be converted to positional arguments. So kwargs in `aot_export_module` really only comes into play in non-strict.

Differential Revision: D53600977

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119529
Approved by: https://github.com/zhxchen17, https://github.com/tugsbayasgalan
2024-02-10 02:55:40 +00:00
760056bbdc [aot_inductor] move CppWrapperCodeGen into a separate file (#119491)
This PR moved CppWrapperCodeGen class into a seperate file,
cpp_wrapper.py, to simplify wrapper.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119491
Approved by: https://github.com/desertfire, https://github.com/albanD
2024-02-10 02:15:56 +00:00
095f471307 make flash_attn_bw impl correct w.r.t. meta when k and v have different strides (#119500)
`dv = at::empty_like(k)` and `dv = at::empty_like(v)` can be materially different, because `empty_like` tries to preserve the strides of the input when possible. So if `k` is contiguous, but `v`, is transposed, then before this PR, `dv` would be computed to be contiguous.

Alternatively, we could change the meta implementation of `aten._scaled_dot_product_flash_attention` to this:
```
    grad_q = torch.empty_like(query.transpose(1, 2)).transpose(1, 2)
    grad_k = torch.empty_like(key.transpose(1, 2)).transpose(1, 2)
    grad_v = torch.empty_like(key.transpose(1, 2)).transpose(1, 2)
    return grad_q, grad_k, grad_v
```

But (I think?) the logic in the sdpa backward impl was a typo.

I noticed this because changing the meta formula as above was enough to fix the issue with the `aot_eager` backend in this [link](https://github.com/pytorch/pytorch/issues/116935#issuecomment-1914310523).

A minimal repro that I made looks like this:
```
import torch

# in this repro, "grad_out" and "value" are transposed tensors,
# but "key" and "value" are contiguous
a = torch.randn(2, 513, 16, 64, dtype=torch.float16, device='cuda').transpose(1, 2)
b = torch.randn(2, 16, 513, 64, dtype=torch.float16, device='cuda')
c = torch.randn(2, 16, 513, 64, dtype=torch.float16, device='cuda')
d = torch.randn(2, 513, 16, 64, dtype=torch.float16, device='cuda').transpose(1, 2)
e = torch.randn(2, 16, 513, 64, dtype=torch.float16, device='cuda')
f = torch.randn(2, 16, 513, device='cuda')
g = None
h = None
i = 513
j = 513
k = 0.0
l = False
m = torch.tensor(1, dtype=torch.int64)
n = torch.tensor(1, dtype=torch.int64)

out1_ref, out2_ref, out3_ref = torch.ops.aten._scaled_dot_product_flash_attention_backward(a, b, c, d, e, f, g, h, i, j, k, l, m, n, scale=0.125)

from torch._meta_registrations import meta__scaled_dot_product_flash_backward
out1_test, out2_test, out3_test = meta__scaled_dot_product_flash_backward(a, b, c, d, e, f, g, h, i, j, k, l, m, n, scale=0.125)

# prints True True
print(out1_ref.is_contiguous())
print(out1_test.is_contiguous())

# prints True True
print(out2_ref.is_contiguous())
print(out2_test.is_contiguous())

# prints True False
print(out3_ref.is_contiguous())
print(out3_test.is_contiguous())
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119500
Approved by: https://github.com/drisspg, https://github.com/ezyang, https://github.com/Skylion007
2024-02-10 02:04:56 +00:00
e1c1b8c2b2 [dynamo] Improve support for backwards hooks (#119525)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119525
Approved by: https://github.com/yanboliang, https://github.com/anijain2305
2024-02-10 01:14:03 +00:00
cyy
05602915f5 Link torch_cpu to cudart only if CUPTI is enabled (#118232)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118232
Approved by: https://github.com/ezyang
2024-02-10 00:53:51 +00:00
44796682d0 [torch][ao] Fix module name filter for pytorch2 quantization for underscores (#119344)
Summary:
There was a bug in the module name filter for modules that had an underscore
already in them, as it was replaced with a "dot" notation.
This is because it was thought that underscores always meant a module separator,
but this isn't the case for modules whose name contains an underscore.

Test Plan:
Added a unit test. Before this change, that test failed (due to applying the wrong
qscheme). Now it passes.

Differential Revision: D53502771

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119344
Approved by: https://github.com/jerryzh168
2024-02-10 00:29:08 +00:00
34f7dc9eba [ONNX] Support op consistency error reproduction (#119512)
Fixes #119472

Introduce the debugging tool in onnxscript: https://github.com/microsoft/onnxscript/blob/main/onnxscript/tests/function_libs/torch_lib/error_reproduction.py

This tool can help us quickly find the inputs leading to mismatched errors.

NOTE: this produces `error_reports` folder where there are different markdown reports for each mismatched test cases.

For example - CREATE_REPRODUCTION_REPORT=1 python -m pytest onnxscript/tests/function_libs/torch_lib/ops_test.py -k test_output_match_fft_fft_cpu_bool

### Summary

The output of ONNX Runtime does not match that of PyTorch when executing test
`test_fx_op_consistency.TestOnnxModelOutputConsistency_opset_version_18_model_type_TorchModelType.TORCH_NN_MODULECPU.test_output_match_fft_fft_cpu_bool`, `sample 3` in ONNX Script `TorchLib`.

To recreate this report, use

```bash
CREATE_REPRODUCTION_REPORT=1 python -m pytest onnxscript/tests/function_libs/torch_lib/ops_test.py -k test_output_match_fft_fft_cpu_bool
```

### ONNX Model

```
<
   ir_version: 8,
   opset_import: ["pkg.onnxscript.torch_lib" : 1, "" : 18, "pkg.onnxscript.torch_lib.common" : 1],
   producer_name: "pytorch",
   producer_version: "2.2.0"
>
main_graph (bool[31] l_args_0_) => (float[31,2] _fft_r2c)
   <bool[31] l_args_0_, float[31] _to_copy, float[31,2] _fft_r2c>
{
   _to_copy = Cast <to: int = 1> (l_args_0_)
   _val_2 = Constant <value: tensor = int64[1] {-1}> ()
   _val_3 = Unsqueeze (_to_copy, _val_2)
   _val_4 = Constant <value: tensor = int64[1] {0}> ()
   _val_5 = Unsqueeze (_val_3, _val_4)
   _val_6 = DFT <axis: int = 1, inverse: int = 0, onesided: int = 0> (_val_5)
   _val_7 = Constant <value: tensor = int64[1] {0}> ()
   _val_8 = Squeeze (_val_6, _val_7)
   _fft_r2c = pkg.onnxscript.torch_lib._fftn_onnx_normalization <dims: ints = [0], forward: int = 1, normalization: int = 0> (_val_3, _val_8)
}
<
  domain: "pkg.onnxscript.torch_lib",
  opset_import: ["" : 18]
>
_fftn_onnx_normalization <normalization,forward,dims>(self, transformed) => (result_15)
{
   self_shape = Shape (self)
   dims = Constant <value_ints: ints = @dims> ()
   self_shape_subscripted = Gather <axis: int = 0> (self_shape, dims)
   total_sample_count = ReduceProd <keepdims: int = 0> (self_shape_subscripted)
   total_sample_count_0 = CastLike (total_sample_count, transformed)
   normalization = Constant <value_int: int = @normalization> ()
   int64_1 = Constant <value: tensor = int64 int64_1 {1}> ()
   cond = Equal (normalization, int64_1)
   result_15 = If (cond) <then_branch: graph = thenGraph_21 () => ( result_3) {
      forward = Constant <value_int: int = @forward> ()
      forward_as_bool = Cast <to: int = 9> (forward)
      result_3 = If (forward_as_bool) <then_branch: graph = thenGraph_23 () => ( result) {
         tmp = Sqrt (total_sample_count_0)
         result = Div (transformed, tmp)
      }, else_branch: graph = elseGraph_23 () => ( result_2) {
         tmp_1 = Sqrt (total_sample_count_0)
         result_2 = Mul (transformed, tmp_1)
      }>
   }, else_branch: graph = elseGraph_21 () => ( result_14) {
      normalization_4 = Constant <value_int: int = @normalization> ()
      int64_2 = Constant <value: tensor = int64 int64_2 {2}> ()
      cond_5 = Equal (normalization_4, int64_2)
      result_14 = If (cond_5) <then_branch: graph = thenGraph_27 () => ( result_9) {
         forward_6 = Constant <value_int: int = @forward> ()
         forward_6_as_bool = Cast <to: int = 9> (forward_6)
         result_9 = If (forward_6_as_bool) <then_branch: graph = thenGraph_29 () => ( result_7) {
            result_7 = Div (transformed, total_sample_count_0)
         }, else_branch: graph = elseGraph_29 () => ( result_8) {
            result_8 = Identity (transformed)
         }>
      }, else_branch: graph = elseGraph_27 () => ( result_13) {
         forward_10 = Constant <value_int: int = @forward> ()
         forward_10_as_bool = Cast <to: int = 9> (forward_10)
         result_13 = If (forward_10_as_bool) <then_branch: graph = thenGraph_35 () => ( result_11) {
            result_11 = Identity (transformed)
         }, else_branch: graph = elseGraph_35 () => ( result_12) {
            result_12 = Mul (transformed, total_sample_count_0)
         }>
      }>
   }>
}
<
  domain: "pkg.onnxscript.torch_lib.common",
  opset_import: ["" : 18]
>
Rank (input) => (return_val)
{
   tmp = Shape (input)
   return_val = Size (tmp)
}
<
  domain: "pkg.onnxscript.torch_lib.common",
  opset_import: ["" : 18]
>
IsScalar (input) => (return_val)
{
   tmp = Shape (input)
   tmp_0 = Size (tmp)
   tmp_1 = Constant <value_int: int = 0> ()
   return_val = Equal (tmp_0, tmp_1)
}
```

### Inputs

Shapes: `['Tensor<torch.Size([31]), dtype=torch.bool>']`

<details><summary>Details</summary>
<p>

```python
kwargs = {}
inputs = (tensor([False, False,  True,  True, False,  True, False,  True, False, False,
         True, False, False, False, False, False,  True,  True,  True,  True,
         True,  True,  True,  True, False, False, False, False,  True,  True,
         True]),)
```

</p>
</details>

### Expected output

Shape: `torch.Size([31, 2])`

<details><summary>Details</summary>
<p>

```python
expected = tensor([[16.0000,  0.0000],
        [-0.2369,  2.6590],
        [ 0.7336, -4.9670],
        [ 2.2093,  2.9865],
        [-0.7166,  1.0928],
        [-3.0614,  3.0015],
        [-1.8945, -0.9677],
        [-2.1538,  0.2513],
        [-2.2432,  1.3978],
        [-0.3429,  1.9494],
        [-0.6495, -1.5423],
        [-0.6005,  2.2398],
        [ 2.2639,  2.6430],
        [ 1.7609,  0.2033],
        [-1.3829, -2.3365],
        [-1.6854, -0.0311],
        [-1.6854,  0.0311],
        [-1.3829,  2.3365],
        [ 1.7609, -0.2033],
        [ 2.2639, -2.6430],
        [-0.6005, -2.2398],
        [-0.6495,  1.5423],
        [-0.3429, -1.9494],
        [-2.2432, -1.3978],
        [-2.1538, -0.2513],
        [-1.8945,  0.9677],
        [-3.0614, -3.0015],
        [-0.7166, -1.0928],
        [ 2.2093, -2.9865],
        [ 0.7336,  4.9670],
        [-0.2369, -2.6590]])
```

</p>
</details>

### Actual output

Shape: `torch.Size([31, 2])`

<details><summary>Details</summary>
<p>

```python
actual = tensor([[ 1.6000e+01, -9.1791e-06],
        [-2.3695e-01,  2.6590e+00],
        [ 7.3355e-01, -4.9670e+00],
        [ 2.2093e+00,  2.9865e+00],
        [-7.1663e-01,  1.0928e+00],
        [-3.0614e+00,  3.0015e+00],
        [-1.8946e+00, -9.6773e-01],
        [-2.1538e+00,  2.5126e-01],
        [-2.2432e+00,  1.3978e+00],
        [-3.4294e-01,  1.9494e+00],
        [-6.4946e-01, -1.5423e+00],
        [-6.0044e-01,  2.2398e+00],
        [ 2.2639e+00,  2.6430e+00],
        [ 1.7609e+00,  2.0326e-01],
        [-1.3829e+00, -2.3365e+00],
        [-1.6854e+00, -3.1130e-02],
        [-1.6854e+00,  3.1161e-02],
        [-1.3829e+00,  2.3365e+00],
        [ 1.7609e+00, -2.0327e-01],
        [ 2.2639e+00, -2.6430e+00],
        [-6.0047e-01, -2.2398e+00],
        [-6.4945e-01,  1.5423e+00],
        [-3.4294e-01, -1.9494e+00],
        [-2.2432e+00, -1.3978e+00],
        [-2.1538e+00, -2.5129e-01],
        [-1.8945e+00,  9.6773e-01],
        [-3.0615e+00, -3.0015e+00],
        [-7.1663e-01, -1.0928e+00],
        [ 2.2093e+00, -2.9865e+00],
        [ 7.3354e-01,  4.9670e+00],
        [-2.3695e-01, -2.6589e+00]])
```

</p>
</details>

### Difference

<details><summary>Details</summary>
<p>

```diff
--- actual
+++ expected
@@ -1,31 +1,31 @@
-tensor([[ 1.6000e+01, -9.1791e-06],
-        [-2.3695e-01,  2.6590e+00],
-        [ 7.3355e-01, -4.9670e+00],
-        [ 2.2093e+00,  2.9865e+00],
-        [-7.1663e-01,  1.0928e+00],
-        [-3.0614e+00,  3.0015e+00],
-        [-1.8946e+00, -9.6773e-01],
-        [-2.1538e+00,  2.5126e-01],
-        [-2.2432e+00,  1.3978e+00],
-        [-3.4294e-01,  1.9494e+00],
-        [-6.4946e-01, -1.5423e+00],
-        [-6.0044e-01,  2.2398e+00],
-        [ 2.2639e+00,  2.6430e+00],
-        [ 1.7609e+00,  2.0326e-01],
-        [-1.3829e+00, -2.3365e+00],
-        [-1.6854e+00, -3.1130e-02],
-        [-1.6854e+00,  3.1161e-02],
-        [-1.3829e+00,  2.3365e+00],
-        [ 1.7609e+00, -2.0327e-01],
-        [ 2.2639e+00, -2.6430e+00],
-        [-6.0047e-01, -2.2398e+00],
-        [-6.4945e-01,  1.5423e+00],
-        [-3.4294e-01, -1.9494e+00],
-        [-2.2432e+00, -1.3978e+00],
-        [-2.1538e+00, -2.5129e-01],
-        [-1.8945e+00,  9.6773e-01],
-        [-3.0615e+00, -3.0015e+00],
-        [-7.1663e-01, -1.0928e+00],
-        [ 2.2093e+00, -2.9865e+00],
-        [ 7.3354e-01,  4.9670e+00],
-        [-2.3695e-01, -2.6589e+00]])
+tensor([[16.0000,  0.0000],
+        [-0.2369,  2.6590],
+        [ 0.7336, -4.9670],
+        [ 2.2093,  2.9865],
+        [-0.7166,  1.0928],
+        [-3.0614,  3.0015],
+        [-1.8945, -0.9677],
+        [-2.1538,  0.2513],
+        [-2.2432,  1.3978],
+        [-0.3429,  1.9494],
+        [-0.6495, -1.5423],
+        [-0.6005,  2.2398],
+        [ 2.2639,  2.6430],
+        [ 1.7609,  0.2033],
+        [-1.3829, -2.3365],
+        [-1.6854, -0.0311],
+        [-1.6854,  0.0311],
+        [-1.3829,  2.3365],
+        [ 1.7609, -0.2033],
+        [ 2.2639, -2.6430],
+        [-0.6005, -2.2398],
+        [-0.6495,  1.5423],
+        [-0.3429, -1.9494],
+        [-2.2432, -1.3978],
+        [-2.1538, -0.2513],
+        [-1.8945,  0.9677],
+        [-3.0614, -3.0015],
+        [-0.7166, -1.0928],
+        [ 2.2093, -2.9865],
+        [ 0.7336,  4.9670],
+        [-0.2369, -2.6590]])
```

</p>
</details>

### Full error stack

```
Tensor-likes are not close!

Mismatched elements: 21 / 62 (33.9%)
Greatest absolute difference: 3.719329833984375e-05 at index (26, 1) (up to 1e-05 allowed)
Greatest relative difference: 0.0005033136694692075 at index (15, 1) (up to 1.3e-06 allowed)
  File "/home/titaiwang/pytorch/test/onnx/test_fx_op_consistency.py", line 1763, in _compare_onnx_and_torch_exported_program
    torch.testing.assert_close(
  File "/home/titaiwang/pytorch/torch/testing/_comparison.py", line 1523, in assert_close
    raise error_metas[0].to_error(msg)

```

### Environment

```
OS: Linux-5.15.135.1-2.cm2-x86_64-with-glibc2.35
Python version: 3.11.5 (main, Sep 11 2023, 13:54:46) [GCC 11.2.0]
onnx==1.15.0
onnxruntime==1.17.0
onnxscript==0.1.0.dev20240207
numpy==1.26.0
torch==2.2.0a0+git684ce1b
```
Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119512
Approved by: https://github.com/justinchuby, https://github.com/thiagocrepaldi
2024-02-09 23:24:01 +00:00
bb287d73ec [ONNX] Apply modularizarion to exported program exporting (#119498)
Apply modularization pass to exported program exporting. The only two things that needs to be taken care of are (1) the extra call stack generated by `torch.export.export` and (2) lifted placeholder has call stack (different from original placeholder).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119498
Approved by: https://github.com/thiagocrepaldi
2024-02-09 22:57:42 +00:00
3372aa51b4 Integrate swap_tensors into nn.Module.load_state_dict (#117913)
Added a `torch.Tensor` method that defines how to transform `other`, a value in the state dictionary, to be loaded into `self`, a param/buffer in an `nn.Module` before swapping via `torch.utils.swap_tensors`
* `param.module_load(sd[key])`

This method can be overridden using `__torch_function__`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117913
Approved by: https://github.com/albanD
2024-02-09 22:32:29 +00:00
a7f82b7d62 [fix] tmp fix for import issue in dtensor (#119582)
a temporary fix for S394053 which is likely caused by backward incompatible `import` introduced in D53437243. It's yet to be understood why this may cause an issue but let's forward "fix" it first then draft a follow up diff for a right fix.

Differential Revision: [D53621345](https://our.internmc.facebook.com/intern/diff/D53621345/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119582
Approved by: https://github.com/tianyu-l
2024-02-09 20:50:27 +00:00
bf8db86a19 [FSDP] Added deprecation msg for NO_SHARD (#119553)
This only includes the warning for world size >1 since we clamp to `NO_SHARD` for world size 1. We mainly do not want `NO_SHARD` to proliferate anymore.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119553
Approved by: https://github.com/Skylion007
2024-02-09 20:32:03 +00:00
f3e7d80993 [c10d] PGNCCL refactor part 2: Simplify ProcessGroupNCCL into single-device style (#119421)
Part 2 and last part of #118674:
Introduce actual "single-device" code change to ProcessGroupNCCL.

assert size == 1 and test refactor have been done in #119099.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119421
Approved by: https://github.com/shuqiangzhang
2024-02-09 20:23:20 +00:00
0597dab523 [aot_inductor] move CudaWrapperCodeGen into a separate file (#119448)
wrapper.py is getting more complex. Let's first split it
into smaller pieces. Will have another PR to move CppWrapperCodeGen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119448
Approved by: https://github.com/desertfire
2024-02-09 20:18:04 +00:00
9a1df7cfd7 ReduceLROnPlateau init _last_lr (#119366) (#119556)
Fixes #119366

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119556
Approved by: https://github.com/janeyx99
2024-02-09 19:35:02 +00:00
bf8a5a11be Fix Inductor CSE Across Separate Reductions (#119410)
We were CSE'ing a load across two separate reduction loop bodies. This is because we were examining an indirect indexing that did not have an explicit rindex in its load. I've commented with more details and other potentials on the fix.

Tried using minifier unsuccessfully and hand minified some but could do more..

Fix for https://github.com/pytorch/pytorch/issues/119327

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119410
Approved by: https://github.com/shunting314, https://github.com/jansel
2024-02-09 19:34:57 +00:00
f208795182 Improve TORCHDYNAMO_EXTENDED_DEBUG for GuardOnDataDependentSymNode (#119412)
This PR substantially improves the error reporting for GuardOnDataDependentSymNode in the following ways:

* The GuardOnDataDependentSymNode error message is rewritten for clarity, and contains a link to a new doc on how to resolve these issues https://docs.google.com/document/d/1HSuTTVvYH1pTew89Rtpeu84Ht3nQEFTYhAX3Ypa_xJs/edit#heading=h.44gwi83jepaj
* We support `TORCHDYNAMO_EXTENDED_DEBUG_CREATE_SYMBOL`, which lets you specify a symbol name to get detailed debug information when it is logged (e.g., the full backtrace and user backtrace of the symbol creation). The exact symbols that you may be interested in our now explicitly spelled out in the error message.
* We support `TORCHDYNAMO_EXTENDED_DEBUG_CPP` which enables reporting C++ backtraces whenever we would report a backtrace.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119412
Approved by: https://github.com/avikchaudhuri
ghstack dependencies: #117356
2024-02-09 19:15:28 +00:00
01e248d6f1 Fix FallbackKernel behavior on mutable ops (#118649)
FallbackKernel wasn't handing mutable ops correctly: it would not report
them in get_mutation_names or get_alias_names. This would lead to silent
incorrectness -- Inductor would incorrectly reorder the mutable op with other
mutable ops.

This PR fixes that:
- we only support mutable operations that are "auto_functionalizable".
  That is, they mutate inputs and do not return aliases of any inputs.
- Following the Triton kernel work, any mutated inputs must be specified
  in get_alias_names and processed via mark_node_as_mutating
- We also do some minor cleanup by killing dead code (FallbackKernel no
  longer processes OpOverloadPacket) and adding some handling around
  HOPs.

Test Plan:
- new tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118649
Approved by: https://github.com/eellison, https://github.com/oulgen
2024-02-09 19:01:54 +00:00
25a0fa6d13 Revert "[dynamo] Improve support for backwards hooks (#119525)"
This reverts commit b1f4b2a63c038f0090886d7d213825f39c283ea5.

Reverted https://github.com/pytorch/pytorch/pull/119525 on behalf of https://github.com/clee2000 due to broke test_autograd.py::TestAutograd::test_post_accumulate_grad_hook_gets_cleaned_up on dynamo https://github.com/pytorch/pytorch/actions/runs/7847212828/job/21416215820 b1f4b2a63c.  The failure exists on the PR as well, but got masked by the other test.  Putting this as no signal? ([comment](https://github.com/pytorch/pytorch/pull/119525#issuecomment-1936447169))
2024-02-09 18:58:55 +00:00
4b9568a360 Add Accelerator device and shell hooks (#119329)
This adds a concept of Accelerator that points to one of our devices. See DeviceAccelerator.h in this PR for details https://github.com/pytorch/pytorch/pull/119329/files#diff-83cc748bed5df1a453c272cc5ecc7e572d4eb694c5125384d8fbd17a0b5f50c8
It also adds scaffolding for shared C++ API to allow generic feature implementation. This PR in particular updates the autograd engine to use this generic API.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119329
Approved by: https://github.com/ezyang
2024-02-09 18:54:28 +00:00
d5a6762263 Reify view_func() closures as ViewFuncs (#118404)
Replaces `view_func()` closures with a reified `ViewFunc` data structure. Codegen generates a `ViewFunc` subclass for each view op (e.g. `NarrowViewFunc`) containing state needed to reconstruct the view. The `ViewFunc` API allows for querying and hot-swapping any `SymInt`s or `Tensors` in the state through `get_symints()` / `get_tensors()` / `clone_and_set()`, which will be essential for fake-ification later on.

```cpp
/// Base class for view functions, providing reapplication of a view on a new base.
/// Each view op should get a codegenerated subclass of this class containing
/// any state needed to reconstruct the view. The class also provides convenience
/// accessors for saved SymInts / tensor state. This is useful for e.g. fake-ification,
/// where we want to use symbolic values or fake tensors instead.
struct TORCH_API ViewFunc {
  virtual ~ViewFunc() {}
  /// Returns any SymInts in the saved state.
  virtual std::vector<c10::SymInt> get_symints() const { return {}; }
  /// Returns the number of SymInts in the saved state.
  virtual size_t num_symints() const { return 0; }
  /// Returns any tensors in the saved state.
  virtual std::vector<at::Tensor> get_tensors() const { return {}; }
  /// Returns the number of tensors in the saved state.
  virtual size_t num_tensors() const { return 0; }
  /// Reapplies the view on the given base using the saved state.
  virtual at::Tensor operator()(const at::Tensor&) const = 0;
  /// Returns a clone of this ViewFunc, optionally with the specified saved state.
  virtual std::unique_ptr<ViewFunc> clone_and_set(
      std::optional<std::vector<c10::SymInt>> = c10::nullopt,
      std::optional<std::vector<at::Tensor>> = c10::nullopt) const = 0;

protected:
  /// Sets the values of any SymInts in the saved state. The input vector size must
  /// match the number of SymInts in the saved state (i.e. the size of the list
  /// returned by get_symints()).
  virtual void set_symints(std::vector<c10::SymInt>) {}
  /// Sets the values of any Tensors in the saved state. The input vector size must
  /// match the number of Tensors in the saved state (i.e. the size of the list
  /// returned by get_tensors()).
  virtual void set_tensors(std::vector<at::Tensor>) {}
};
```

New codegen files:
* `torch/csrc/autograd/generated/ViewFunc.h`
* `torch/csrc/autograd/generated/ViewFuncs.cpp`

The templates for these also contains impls for `ChainedViewFunc` and `ErroringViewFunc` which are used in a few places within autograd.

Example codegen for `slice.Tensor`:
```cpp
// torch/csrc/autograd/generated/ViewFuncs.h
#define SLICE_TENSOR_VIEW_FUNC_AVAILABLE
struct SliceTensorViewFunc : public torch::autograd::ViewFunc {
  SliceTensorViewFunc(int64_t dim, c10::optional<c10::SymInt> start, c10::optional<c10::SymInt> end, c10::SymInt step) : dim(dim), start(start), end(end), step(step)
  {};
  virtual ~SliceTensorViewFunc() override {};
  virtual std::vector<c10::SymInt> get_symints() const override;
  virtual size_t num_symints() const override;
  virtual std::vector<at::Tensor> get_tensors() const override;
  virtual size_t num_tensors() const override;
  virtual at::Tensor operator()(const at::Tensor&) const override;
  virtual std::unique_ptr<ViewFunc> clone_and_set(
      std::optional<std::vector<c10::SymInt>> = c10::nullopt,
      std::optional<std::vector<at::Tensor>> = c10::nullopt) const override;

protected:
  virtual void set_symints(std::vector<c10::SymInt>) override;
  virtual void set_tensors(std::vector<at::Tensor>) override;

private:
  int64_t dim;
  c10::optional<c10::SymInt> start;
  c10::optional<c10::SymInt> end;
  c10::SymInt step;
};
...

// torch/csrc/autograd/generated/ViewFuncs.cpp
std::vector<c10::SymInt> SliceTensorViewFunc::get_symints() const {
  ::std::vector<c10::SymInt> symints;
  symints.reserve((start.has_value() ? 1 : 0) + (end.has_value() ? 1 : 0) + 1);
  if(start.has_value()) symints.insert(symints.end(), *(start));
  if(end.has_value()) symints.insert(symints.end(), *(end));
  symints.push_back(step);
  return symints;
}

size_t SliceTensorViewFunc::num_symints() const {
  return static_cast<size_t>((start.has_value() ? 1 : 0) + (end.has_value() ? 1 : 0) + 1);
}

void SliceTensorViewFunc::set_symints(std::vector<c10::SymInt> symints) {
  TORCH_INTERNAL_ASSERT(symints.size() == num_symints());
  auto i = 0;
  if(start.has_value()) start = symints[i];
  i += (start.has_value() ? 1 : 0);
  if(end.has_value()) end = symints[i];
  i += (end.has_value() ? 1 : 0);
  step = symints[i];
}

std::vector<at::Tensor> SliceTensorViewFunc::get_tensors() const {
  ::std::vector<at::Tensor> tensors;
  return tensors;
}

size_t SliceTensorViewFunc::num_tensors() const {
  return static_cast<size_t>(0);
}

void SliceTensorViewFunc::set_tensors(std::vector<at::Tensor> tensors) {
  TORCH_INTERNAL_ASSERT(tensors.size() == num_tensors());

}

at::Tensor SliceTensorViewFunc::operator()(const at::Tensor& input_base) const {
  return at::_ops::slice_Tensor::call(input_base, dim, start, end, step);
}

std::unique_ptr<ViewFunc> SliceTensorViewFunc::clone_and_set(
    std::optional<std::vector<c10::SymInt>> symints,
    std::optional<std::vector<at::Tensor>> tensors) const {
  auto output = std::make_unique<SliceTensorViewFunc>(dim, start, end, step);
  if (symints.has_value()) {
    output->set_symints(std::move(*(symints)));
  }
  if (tensors.has_value()) {
    output->set_tensors(std::move(*(tensors)));
  }
  return output;
}
```

The `_view_func()` / `_view_func_unsafe()` methods now accept two additional (optional) args for `symint_visitor_fn` / `tensor_visitor_fn`. If these are defined, they are expected to be python callables that operate on a single SymInt / tensor and return a new one. This allows for the hot-swapping needed during fake-ification.

For testing, there are extensive pre-existing tests, and I added a test to ensure that hot-swapping functions correctly.
```sh
python test/test_autograd.py -k test_view_func_replay
python test/test_ops.py -k test_view_replay
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118404
Approved by: https://github.com/ezyang
2024-02-09 18:51:36 +00:00
261f0138a2 [easy] Fix pass_manager type annotation (#119499)
Summary: passes are str not callable here.

Test Plan: lint

Reviewed By: frank-wei

Differential Revision: D53592166

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119499
Approved by: https://github.com/22quinn, https://github.com/Skylion007
2024-02-09 18:39:43 +00:00
suo
5747ec24b4 [export] fix canonicalization for input mutations (#119533)
The comparison was off: user_input_mutation and buffer_mutation had the same numeric value, which led the comparison to move to the next element of the tuple and try to compare `None` to `spec.buffer_mutation.buffer_name`, which doesn't work. So make them different numbers.

Differential Revision: [D53601300](https://our.internmc.facebook.com/intern/diff/D53601300/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119533
Approved by: https://github.com/zhxchen17
2024-02-09 18:30:39 +00:00
cf42dd09ca [FSDP2] Replaced version-ctx with no_grad; removed no_grad (#119550)
This PR replaces the `_unsafe_preserve_version_counters` context with a simple `torch.no_grad()` context instead. This decreases CPU overhead from (1 context enter/exit + `N` loop over tensors) with just (1 context enter/exit).

This PR also removes a `torch.no_grad()` from `init_unsharded_param` as it helps compiling but does not affect eager.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119550
Approved by: https://github.com/Skylion007
2024-02-09 18:24:19 +00:00
f3a2094065 [Dynamo][Export] Mitigate legacy issue that aten op as export entrance function (#119528)
This is going to fix a legacy issue like:
```
torch._dynamo.export(torch.ops.aten.scaled_dot_product_attention, ...)(*inputs,)
```
This is not supported any more, now the top level ```torch.export``` only support ```nn.Module```, but there are still some tests using the internal APIs and caused the ```trace_rules.check``` assertion error. This PR is going to mitigate such cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119528
Approved by: https://github.com/ydwu4
2024-02-09 18:24:09 +00:00
5356b5d1f0 [Dynamo][16/N] Move skipfiles to trace_rules.py (#119432)
This is follow-up-1 for https://github.com/pytorch/pytorch/pull/118971#issue-2114082018. Only code motion and doc update in this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119432
Approved by: https://github.com/jansel
2024-02-09 18:18:23 +00:00
7082e24ce8 [quant][pt2e][bc-breaking] Set fold_quantize to True in convert_pt2e (#119425)
Summary: This is a follow up to https://github.com/pytorch/pytorch/pull/118605 to set `fold_quantize` flag to True in `convert_pt2e`

Test Plan: CI

Differential Revision: D53550237

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119425
Approved by: https://github.com/andrewor14
2024-02-09 18:13:43 +00:00
3f82e435eb Fix delete branches (#119399)
Due to PR_WINDOW, if the magic string exists in the body but the pr was not updated recently, the query wouldn't find it and would delete the branch.  Instead, query separately for branches with the no-delete-branch label, which I created recently.

Might as well query for branches with open PRs while we're at it so PRs with the stale label won't get their branches deleted either
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119399
Approved by: https://github.com/huydhn
2024-02-09 17:28:00 +00:00
c6f39740c7 Revert "Fix delete branches (#119399)"
This reverts commit e1fc7e1ebcf4b87d5c34bf276806212c38ca00f0.

Reverted https://github.com/pytorch/pytorch/pull/119399 on behalf of https://github.com/clee2000 due to has a bug ([comment](https://github.com/pytorch/pytorch/pull/119399#issuecomment-1936291560))
2024-02-09 17:14:23 +00:00
53a6ab3fda [BE] Update Pillow to 10.2.0 (#119517)
As older versions have arbitrary code execution vulnerabilities Reported by Dependabot, documented in https://nvd.nist.gov/vuln/detail/CVE-2023-50447

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119517
Approved by: https://github.com/kit1980, https://github.com/seemethere
2024-02-09 17:05:28 +00:00
b1f4b2a63c [dynamo] Improve support for backwards hooks (#119525)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119525
Approved by: https://github.com/yanboliang
2024-02-09 17:02:40 +00:00
5d6e323549 No TD (test removal) option in CI (#118808)
It currently doesn't do anything, but I will want these env vars later.  Maybe I should start using ghstack

Intention: --enable-td actually gets rid of tests

I am open to better names
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118808
Approved by: https://github.com/huydhn, https://github.com/osalpekar
2024-02-09 16:42:27 +00:00
e1fc7e1ebc Fix delete branches (#119399)
Due to PR_WINDOW, if the magic string exists in the body but the pr was not updated recently, the query wouldn't find it and would delete the branch.  Instead, query separately for branches with the no-delete-branch label, which I created recently.

Might as well query for branches with open PRs while we're at it so PRs with the stale label won't get their branches deleted either
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119399
Approved by: https://github.com/huydhn
2024-02-09 16:40:32 +00:00
5d81ade484 [Inductor max autotune] Multithreaded Precompilation (#119386)
When using the Cutlass backend, the compilation
of CUDA source files can totally dominate the runtime required for the benchmarking done
as part of Autotuning.

This change adds a multithreaded precompilation phase, which serves to pre-populate the compilation cache ( both in-memory, and a
possible on-disk sccache ).

Also it ensures that no unneccessary compilation
and benchmarking steps are performed, which was peviously the case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119386
Approved by: https://github.com/aakhundov
2024-02-09 16:11:30 +00:00
173256424a Update setuptools to 68.2.2 (#119456)
Followup after itself: Anaconda does not have setuptools v65, but does v68
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119456
Approved by: https://github.com/Skylion007
2024-02-09 15:38:25 +00:00
eff93fbd86 Revert "[Dynamo][16/N] Move skipfiles to trace_rules.py (#119432)"
This reverts commit 56364124af8fe148ba8b0c935571ebae6500f33b.

Reverted https://github.com/pytorch/pytorch/pull/119432 on behalf of https://github.com/atalman due to Breaks internal tests ([comment](https://github.com/pytorch/pytorch/pull/119432#issuecomment-1936122795))
2024-02-09 15:25:25 +00:00
90dabff260 Avoid COW materialize in various operations (#119506)
Operations affected include dot, cross, scatter/gather, shape, sort,
triangular, unary, scalar, pad, complex, to_list, fft

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119506
Approved by: https://github.com/ezyang
ghstack dependencies: #119501, #119502, #119503, #119504
2024-02-09 14:47:19 +00:00
8a09f1320c Avoid COW materialize in index, reduce, compare, unique, and copy ops (#119504)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119504
Approved by: https://github.com/ezyang
ghstack dependencies: #119501, #119502, #119503
2024-02-09 14:47:19 +00:00
0e6b314fc2 Avoid performing replacements when it would unrefine ranges (#117356)
Fixes https://github.com/pytorch/pytorch/issues/117268; check this issue for background.

This PR does the following:

* Do not perform a replacement if the expression we're replacing the symbol with has a less refined value range than the original. There's a little bit of trickiness around the handling for values close to INT64_MAX; when checking if a range refines another, I *only* consider the range representable in 64-bit integers. This is enough to prevent us from doing a substitution like `i0 = 10 - i1`, but it appears to still let us do the other substitutions we like, such as `i0 = i1` or `i0 = 12 * i1`
* The test above is order dependent: if we assert an equality BEFORE we have refined a range, we might be willing to do the replacement because there isn't a meaningful range. This means that it's important to mark things as sizes, before you start doing other error checking. `split_with_sizes` is adjusted accordingly. It would be good to raise an error if you get the ordering wrong, but I leave this to future work.
* It turns out this is not enough to fix AOTAutograd, because we lose the size-ness of unbacked SymInts when AOTAutograd retraces the Dynamo graph. So update deferred runtime assert insertion to also insert size-ness and value ranges annotations. Note that, in principle, it shouldn't be necessary to explicitly do the latter; these should just show up as deferred runtime asserts. That's some extra refactoring for a later day.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117356
Approved by: https://github.com/lezcano
2024-02-09 14:43:58 +00:00
064610d8ac Don't guard if there are unbacked SymInts (#119312)
Fixes https://github.com/pytorch/pytorch/issues/119309

Not sure how to write the test.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119312
Approved by: https://github.com/lezcano
2024-02-09 11:02:47 +00:00
a13bb9f6a8 Add symbol_guard_limit_before_specialize (#119347)
Add a flag setting that controls a threshold of guards involving a symbol, after which we force a symbol to be specialized. The roll out plan is to enable this on OSS but not fbcode, and then roll out to fbcode after we get some telemetry from the previous PR.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119347
Approved by: https://github.com/lezcano
2024-02-09 08:44:37 +00:00
a050d146b7 [Inductor] Add Int8 data type into Inductor CPP backend vectorized code generation (#119179)
**Summary**
Part 1 of fixing https://github.com/pytorch/pytorch/issues/119141 which needs vectorized code generation of per channel quant and int8 data type.
In the current implementation for quantization, the vectorized code generation only supports the `uint8` data type. In this PR, we introduce support for the `int8` data type within the vectorized code generation.

**TestPlan**
```
python -u -m pytest -s -v test_cpu_repro.py -k test_decomposed_dequant_relu_quant_int8
python -u -m pytest -s -v test_cpu_repro.py -k test_dequant_quant_lowering_int8
python -u -m pytest -s -v test_cpu_repro.py -k test_dequant_maxpool2d_lowering_int8
python -u -m pytest -s -v test_cpu_repro.py -k test_tile2d_load_decomposed_dequant_add_relu_quant_int8
python -u -m pytest -s -v test_cpu_repro.py -k test_per_tensor_fake_quant_int8
python -u -m pytest -s -v test_cpu_repro.py -k test_non_contiguous_load_buf_quant_int8
python -u -m pytest -s -v test_cpu_repro.py -k test_tile2d_store_channel_shuffle_cl_quant_output_int8
python -u -m pytest -s -v test_cpu_repro.py -k test_dequant_relu_quant_dequant_relu_quant_lowering_int8
```

Co-authored-by: Jiong Gong <jiong.gong@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119179
Approved by: https://github.com/peterbell10, https://github.com/jgong5, https://github.com/jansel
2024-02-09 07:33:12 +00:00
5918622d72 Avoid COW materialize in pooling, batch linalg, upsample, softmax ops (#119503)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119503
Approved by: https://github.com/ezyang
ghstack dependencies: #119501, #119502
2024-02-09 06:52:16 +00:00
53deddd66d Avoid COW materialization for TensorInfo with const type (#119502)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119502
Approved by: https://github.com/ezyang
ghstack dependencies: #119501
2024-02-09 06:51:43 +00:00
fba5b7f7c8 Avoid COW materialization for TensorAccessors with const type (#119501)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119501
Approved by: https://github.com/ezyang
2024-02-09 06:46:00 +00:00
fa071a2e1b Clarifying windows cosine behaviour in the documentation (#119444)
After following the discussion, I've created a PR to update the documentation clarifying the function's behaviour (@tqbl solution 1).

Fixes #110541

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119444
Approved by: https://github.com/malfet
2024-02-09 05:57:44 +00:00
0f2fbbff10 Enable fake tensor caching in fbcode by default (#118555)
Summary: Enabled by default in OSS; this switches the default to "on" in fbcode too.

Test Plan: Ran torchbench benchmarks in fbcode

Differential Revision: [D53189048](https://our.internmc.facebook.com/intern/diff/D53189048)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118555
Approved by: https://github.com/eellison
2024-02-09 05:42:16 +00:00
2cdf9b7674 [BE] Update requests to 2.31.0 (#119516)
Fixes potential memory leak detected by DepandaBot and reported in  https://nvd.nist.gov/vuln/detail/CVE-2023-32681

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119516
Approved by: https://github.com/kit1980, https://github.com/seemethere
2024-02-09 05:10:16 +00:00
458e83b5b3 Revert "Add FakeTensor support to torch._utils._rebuild_tensor (#108186)"
This reverts commit 113506d2d4a0120e912c8f36e70a621f55378f81.

Reverted https://github.com/pytorch/pytorch/pull/108186 on behalf of https://github.com/atalman due to Reverted Internally ([comment](https://github.com/pytorch/pytorch/pull/108186#issuecomment-1935310344))
2024-02-09 04:19:20 +00:00
930b60f5aa Add Debug Utility To Generate Inputs for AOT Graphs (#119409)
```
    Takes in a function which has been printed with print_readable() and constructs kwargs to run it.
    Currently only handles Tensor inputs and a graph module which might have tensor constants.
    Example:
        Consider a function `forward` defined as follows:
        >>> def forward(self, primals_1: "f32[1001, 6]"):
        ...     _tensor_constant0: "i64[4190]" = self._tensor_constant0
        ...     # Further implementation
        >>> kwargs = aot_graph_input_parser(forward)
        >>> forward(**kwargs)
    """
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119409
Approved by: https://github.com/shunting314
2024-02-09 03:55:19 +00:00
2d474e17cb Don't log canonicalized expressions (#119471)
Fixes https://github.com/pytorch/pytorch/issues/119467
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119471
Approved by: https://github.com/ezyang
2024-02-09 02:46:11 +00:00
8994f2367d Revert "Fix jagged NT softmax semantics (#119459)"
This reverts commit 6adadbaf7943f760ea2375619b1783020b69d4e6.

Reverted https://github.com/pytorch/pytorch/pull/119459 on behalf of https://github.com/malfet due to broke dynamo, see https://github.com/pytorch/pytorch/actions/runs/7835402753/job/21386634602 ([comment](https://github.com/pytorch/pytorch/pull/119459#issuecomment-1935246413))
2024-02-09 02:31:49 +00:00
88429a8084 [inductor] Add split scan kernel (#117992)
This PR adds a new type of triton kernel in which data is persistent but the
reduction dimension is split over multiple blocks (up to the entire kernel).
though this is called a reduction dimension, in actuality we only support scans.
because of this limitation, i have to be able to block fusions of split scan
operations with reductions so chose to add a new `ir.SplitScan` node which
is identical but allows for differentiation in the scheduler.

The split scan kernel is also the first to require an additional workspace buffer
which is used to communicate between cuda blocks. this is slightly tricky as we
the exact scratch space requirement isn't known until the grid size is calculated.
here i workaround the issue by setting a minimum rblock size and always allocating
to the maximum possible grid size for a given input tensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117992
Approved by: https://github.com/jansel
ghstack dependencies: #117991
2024-02-09 01:56:00 +00:00
01edb8a559 [inductor] Refactor triton range_tree handling (#117991)
Currently the dimension handling in triton kernels has various special cases e.g.
- handling "r" for non-reduction vs persistent reduction vs non-persistent reduction.
- handling "x" when `no_x_dim` is set

This adds three new properties to the range tree objects which capture the
same information in a more generic way:
- `is_loop`: true for the "r" dimension of a non-persistent reduction
- `tensor_dim`: Optional index of the triton tensor dimension
- `grid_dim`: Optional index of the triton grid dimension

The motivation here is I want to add a new split scan kernel type which is:
- not a persistent reduction, yet has `is_loop=False` for the "r" dimension
- Has a `grid_dim` for the "r" dimension

These flags now only need to be set once in `initialize_range_trees`, instead of having
to infer them throughout the code based on the tree prefix and various other kernel flags.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117991
Approved by: https://github.com/lezcano
2024-02-09 01:56:00 +00:00
6efda849b5 Update chunk_dtensor to support HYBRID_SHARD (#119481)
Fixes https://github.com/pytorch/pytorch/issues/118639.

Adds support to replicate across HSDP dimensions instead of sharding for shard placement

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119481
Approved by: https://github.com/Skylion007, https://github.com/wz337
2024-02-09 01:30:53 +00:00
454abb6b99 Disable tests that use bfloat 16 for SM < 80 (#118449)
```
`torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
RuntimeError: Internal Triton PTX codegen error:
ptxas /tmp/compile-ptx-src-83b319, line 51; error   : Feature '.bf16' requires .target sm_80 or higher
ptxas /tmp/compile-ptx-src-83b319, line 51; error   : Feature 'cvt with .f32.bf16' requires .target sm_80 or higher
ptxas /tmp/compile-ptx-src-83b319, line 59; error   : Feature '.bf16' requires .target sm_80 or higher
ptxas /tmp/compile-ptx-src-83b319, line 59; error   : Feature 'cvt with .f32.bf16' requires .target sm_80 or higher
ptxas /tmp/compile-ptx-src-83b319, line 65; error   : Feature '.bf16' requires .target sm_80 or higher
ptxas /tmp/compile-ptx-src-83b319, line 65; error   : Feature 'cvt.bf16.f32' requires .target sm_80 or higher
ptxas fatal   : Ptx assembly aborted due to errors
Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information

You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True

To execute this test, run the following from the base repo dir:
     python test/inductor/test_torchinductor.py -k test_bfloat16_to_int16_cuda`
```

Fixed test failure that uses bfloat 16 on pre SM80 (V100 is where the test failure is seen for this test)

See also #113384

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118449
Approved by: https://github.com/eqy, https://github.com/peterbell10
2024-02-09 01:27:22 +00:00
915f9db03c [Dynamo] Support kwargs for lazy module (#119445)
Summary:
Seems like `kwargs` is already support in `_infer_argument`, so we don't need the extra assertion `len(kwargs) == 0`.

This optimization ensures compatibility with torch.compile() for LazyModules with kwargs inputs, preventing graph breaks.

Test Plan: Unit tetst and CI

Differential Revision: D53558778

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119445
Approved by: https://github.com/yanboliang
2024-02-09 00:46:41 +00:00
45c4a0ce9d Update setup tools to 65.5.1 (#119456)
Should some dependabot  alerts by:
- Updating setupttols to 65.5.1
- Updating jinja2 to 3.3.1

TODO:
 - Update jinja2 and sphinx for the docs builds
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119456
Approved by: https://github.com/Skylion007
2024-02-08 23:34:41 +00:00
a8d1645f15 Revert "Add lowering for logcumsumexp (#118753)"
This reverts commit 5a77ee65879b58e99911fd53d92ddb55a1c234eb.

Reverted https://github.com/pytorch/pytorch/pull/118753 on behalf of https://github.com/jeffdaily due to broke ROCm CI, but not seen until trunk job ([comment](https://github.com/pytorch/pytorch/pull/118753#issuecomment-1935074235))
2024-02-08 23:10:33 +00:00
cyy
560c92c324 [DeviceIndex] Use DeviceIndex instead of int in CUDA wrappers (#119142)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119142
Approved by: https://github.com/ezyang
2024-02-08 23:00:56 +00:00
e98dbae0a0 [ROCm] enable hipsolver backend for linalg.eigh (#115177)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115177
Approved by: https://github.com/lezcano
2024-02-08 22:03:27 +00:00
suo
0f12c0af44 [export] allow user input mutation in aot_export (#119356)
This PR enables input mutation in aot_export by removing the guard and ensuring that the GraphSignature is properly wired up.

This allows to undo the gross hack in torch.export where we lift user inputs to buffers in order to get around aot_export upstream support. It also makes input mutation work properly for non-strict mode.

Mutations on inputs that require_grad are still banned (I added a test for a non-parameter input as well, just to make sure).

Differential Revision: [D53507440](https://our.internmc.facebook.com/intern/diff/D53507440/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119356
Approved by: https://github.com/bdhirsh, https://github.com/zhxchen17, https://github.com/titaiwangms
2024-02-08 22:02:24 +00:00
9f8ade04cc [aot_inductor] replace TORCH_CHECK with AOTI_CHECK in the generate cpp code (#119220)
In some cases where we have TORCH_CHECK in loops, it may cause the host
compiler to spend hours optimizing the run_impl function. This PR
mitigated the issue by replacing TORCH_CHECK with a custom AOTI_CHECK,
where we force the underneath assert function to be noinline.

If forcing noinline caused any serious perf regression, we could
either add an option to turn on/off enable noinline. Or, we could
another an option to just turn AOTI_CHECK into a no-op, similar
to the ```assert``` macro from cassert.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119220
Approved by: https://github.com/hl475, https://github.com/desertfire
2024-02-08 21:57:27 +00:00
71e772f827 Update logging.cpp for explicit chrono import (#119469)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119469
Approved by: https://github.com/davidberard98
2024-02-08 21:57:23 +00:00
45e7af5818 Windows Dynamo Error Removal CI Check (#115969)
Rebase of #111313 onto `main`, for CI validation

Co-authored-by: Stella Laurenzo <stellaraccident@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115969
Approved by: https://github.com/ezyang
2024-02-08 21:23:45 +00:00
0827510fd3 [export] Remove torch._export.export (#119095)
XLA changes: https://github.com/pytorch/xla/pull/6486

Test Plan: CI

Differential Revision: D53316196

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119095
Approved by: https://github.com/ydwu4, https://github.com/zhxchen17, https://github.com/tugsbayasgalan, https://github.com/avikchaudhuri, https://github.com/jerryzh168
2024-02-08 21:22:04 +00:00
a7754b2b60 [dtensor] switch softmax backward ops to OpStrategy (#119255)
As titled. This is a followup to PR #117723 on softmax forward ops.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119255
Approved by: https://github.com/XilunWu, https://github.com/wanchaol
2024-02-08 21:18:39 +00:00
d9a1b25807 Fixed an issue where nn.Linear would cause an internal int underflow … (#119221)
…when trying to reshape a scalar input.

Fixes #119161

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119221
Approved by: https://github.com/albanD
2024-02-08 21:06:34 +00:00
7fd6b1c558 s/print/warn in arch choice in cpp extension (#119463)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119463
Approved by: https://github.com/malfet
2024-02-08 20:38:51 +00:00
db1a4dcb5a [BE] Add dtypesIfMPS to ModuleInfo enabling float16 tests for MPS and remove all skipIfMPS for float64 (#119039)
Right now, `ModuleInfo.dtypes` defaults to `torch.testing._internal.common_dtype.floating_types()`, almost no ModuleInfos override this (so only `float32` and `float64` are tested).

This is the first step to clean up/improve dtype testing for `ModuleInfos` and fix #116626.

Follow up PRs will updates `dtypes=` (and perhaps `dtypesIf{Device}` (if it makes sense)) for each `ModuleInfo`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119039
Approved by: https://github.com/janeyx99
2024-02-08 20:35:32 +00:00
4e93b00b69 [Inductor] Setting kernel launch and exit callbacks for inductor generated triton kernels (#119450)
`CompiledKernel.launch_enter_hook` and `CompiledKernel.launch_exit_hook` are hooks that allow external tools to monitor the execution of Triton kernels and read each kernel's metadata. Initially, these hooks have a value of `None`.

Triton's kernel launcher passes hooks and kernel metadata by default, while Inductor's launcher doesn't. This PR could unify the parameters passed to both launchers so that tools can get information from both handwritten Triton kernels and Inductor-generated Triton kernels.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119450
Approved by: https://github.com/jansel
2024-02-08 20:19:18 +00:00
6adadbaf79 Fix jagged NT softmax semantics (#119459)
Before: `softmax` definition uses `jagged_unary_pointwise()` (wrong)
After: `softmax` impl adjusts the `dim` arg to account for the difference in dimensionality between the outer NT and the NT's `_values`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119459
Approved by: https://github.com/soulitzer
2024-02-08 20:13:12 +00:00
278a0e1600 [NestedTensor] Support binary pointwise ops with >2 inputs (if inputs are non-tensors) (#119419)
It should usually be safe to run pointwise binary ops with >2 inputs. e.g. threshold_backward(tensor, tensor, scalar): we just operate on the values of the nested tensors, and pass in the other args as-is.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119419
Approved by: https://github.com/soulitzer
2024-02-08 20:06:40 +00:00
cd9a1934fb [ONNX] Bump to onnx1.15.0 and ort1.17.0 in CI (#119106)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119106
Approved by: https://github.com/thiagocrepaldi, https://github.com/titaiwangms
2024-02-08 19:26:13 +00:00
91f038161a [FSDP2] Used split_with_sizes_copy for all-gather copy-out (#119451)
This switches to using @yifuwang's `split_with_sizes_copy.out` fast path!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119451
Approved by: https://github.com/yifuwang
ghstack dependencies: #118017, #118118
2024-02-08 19:04:30 +00:00
suo
def572929b [export/nonstrict] always create FakeTensorMode (#119446)
Previously in non-strict mode we would source a FakeTensorMode from existing tensors if available.

It turns out this is problematic, as it means we can't directly control the behavior of this FakeTensorMode. For example, if the user-provided FakeTensorMode does not set `allow_non_fake_inputs=True`, then we get into trouble with constant tensors, etc.

At the moment, we still have to epxlicitly re-fakifky the module state. @ezyang has recommended against this, but it's necessary because `create_aot_dispatcher_function` calls `detect_fake_mode` on all the inputs, which will error if not all the FakeTensors are on the same mode. We should straighten this out, but leaving for the future.

Differential Revision: [D53559043](https://our.internmc.facebook.com/intern/diff/D53559043/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119446
Approved by: https://github.com/ezyang, https://github.com/zhxchen17
2024-02-08 18:54:18 +00:00
7ec6ac89e8 Add lowering to special.modified_bessel_i0 (#118993)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118993
Approved by: https://github.com/peterbell10
2024-02-08 18:42:40 +00:00
9242523ad5 [ET-Vulkan] aten.pow.Tensor_Tensor (#119423)
Summary:
This wires the eager-mode operation to the Vulkan shader. We only cover the case where both inputs are Tensor type, which is on par with the existing operators: add, sub, mul, div, floor_div.

It doesn't seem like we can cover [any other of the 8 cases](https://www.internalfb.com/code/fbsource/[e45c04564445b5e67ebb61e6ba53995729686526]/xplat/caffe2/torch/distributed/_tensor/ops/pointwise_ops.py?lines=310-317), right now. We categorize them and explain that what's missing for each.

## Category 1
The other 2/3 "standard" cases requires one of the values to be a scalar,
```
z = torch.pow(x, y)
```
```
aten.pow.Scalar,
aten.pow.Tensor_Scalar,
aten.pow.Tensor_Tensor,
```
which is not currently supported.
```
F 00:00:01.746228 executorch:aten_bridge.cpp:21] In function check_tensor_meta(), assert failed (b.sizes().data() != nullptr): ETensor must have valid sizes array
```

## Category 2
IIUC, these operators require an out argument in the declaration. However, when they are traced they collapsed into Category 1, e.g., we obtain `aten.pow.Tensor_Tensor` not `aten.pow.Tensor_Tensor_out`.

This appears in line with current PT-Vulkan, which only [implements the other two categories](https://www.internalfb.com/code/fbsource/[f148c22604b8e409696fd64f814cda89d091fe7a]/xplat/caffe2/aten/src/ATen/native/vulkan/ops/BinaryOp.cpp?lines=533-558).
```
torch.pow(x, y, out=z)
```
```
aten.pow.Scalar_out,
aten.pow.Tensor_Scalar_out,
aten.pow.Tensor_Tensor_out,
```

## Category 3
IIUC, in-place operators are written like this:
```
x.pow_(y)
```
```
aten.pow_.Scalar,
aten.pow_.Tensor,
```
They are not currently supported.
```
  File "/data/users/jorgep31415/fbsource/buck-out/v2/gen/fbcode/b007eb344207ad7d/executorch/backends/vulkan/test/__test_vulkan_delegate__/test_vulkan_delegate#link-tree/torch/_export/verifier.py", line 188, in _check_valid_op
    raise SpecViolationError(
torch._export.verifier.SpecViolationError: operator 'aten.copy_.default' is not functional
```

Test Plan:
```
[jorgep31415@devvm15882.vll0 /data/users/jorgep31415/fbsource (fd1ed5f81)]$ buck2 test fbcode//executorch/backends/vulkan/test:test_vulkan_delegate -- test_vulkan_backend_pow
File changed: fbcode//executorch/backends/vulkan/vulkan_preprocess.py
Buck UI: https://www.internalfb.com/buck2/7f9ec9e5-cbac-4618-b8ad-d94d10bb50ff
Test UI: https://www.internalfb.com/intern/testinfra/testrun/562950306906309
Network: Up: 3.2KiB  Down: 0B  (reSessionID-ea5af789-c131-4170-ba20-5c5c9718276b)
Jobs completed: 7. Time elapsed: 48.5s.
Cache hits: 0%. Commands: 1 (cached: 0, remote: 0, local: 1)
Tests finished: Pass 1. Fail 0. Fatal 0. Skip 0. Build failure 0
```

Differential Revision: D53547865

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119423
Approved by: https://github.com/SS-JIA, https://github.com/malfet
2024-02-08 18:31:33 +00:00
b51b27922b Add to_empty() suggestion in the error message (#119353)
Fixes #119293, the comprehensive documentation is [here](0f478d9d61/docs/source/meta.rst (id11)).
Just added the suggestion into the error message so it is more informative to user.

@albanD

Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119353
Approved by: https://github.com/mikaylagawarecki
2024-02-08 18:30:02 +00:00
5a77ee6587 Add lowering for logcumsumexp (#118753)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118753
Approved by: https://github.com/peterbell10
2024-02-08 18:29:34 +00:00
7315ec7505 Revert "Fix estimate_nccl_collective_runtime (#118986)"
This reverts commit 0dab6fb35284ed47d1c6339e9d71e4ca3b50dc51.

Reverted https://github.com/pytorch/pytorch/pull/118986 on behalf of https://github.com/atalman due to Breaks internal tests ([comment](https://github.com/pytorch/pytorch/pull/118986#issuecomment-1934680463))
2024-02-08 18:11:53 +00:00
1d61011c11 [MPS] Add support for complex scalars (#119318)
- Switch to native complex support if running on MacOS Monterey or newer for binary ops.
- Python complex scalars are always represented in PyTorch as ComplexDouble, but MPS yet to support double precision types, so downcast them to floats
- Also add `cf`(for complex float)  and `ch`(for complex half) to MPSScalar value union
- Fix complex scalars to view promotion, by introducing `legacy_complex_as_view` helper function, that non-float types to complex and promotes CPU complex scalars to MPS before turning them into a view.
- Add `test_tensor_scalar_binops`

Fixes https://github.com/pytorch/pytorch/issues/119088

Test plan: CI (have quite a lot of tests, see new unexpected successes) +  `python -c "import torch;x,y=torch.rand(2, 2, dtype=torch.cfloat, device='mps'),torch.tensor(2+3j,dtype=torch.chalf);print(y+x)"`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119318
Approved by: https://github.com/albanD
2024-02-08 18:10:59 +00:00
2b9cba86cf Fix deadlock in ExecutionTraceObserver (#119242) (#119398)
Summary:

With the compiled PyTorch module, in execution_trace_observer.cpp, function convertIValue calls TensorImpl->storage_offset(). That function call will trigger a recursive call into recordOperatorStart. It will cause a deadlock on ob.g_mutex.

This DIFF is to fix this deadlock by replacing std::mutex with std::recursive_mutex.

Since PyTorch only has one thread for FWD, and one thread for BWD. The contention is very low, the performance should NOT be a concern.

Test Plan:
Unit Test
    buck test  mode/dev-nosan caffe2/test:profiler -- test_execution_trace_with_pt2

Differential Revision: D53533253

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119398
Approved by: https://github.com/aaronenyeshi
2024-02-08 18:00:51 +00:00
896cf9d1ce [inductor][cpp] vectorization support for int32/int64 (#119001)
This pull request aims to complete most of the support for vectorizing int32 and int64 data types except for indirect indexing and masks. The basic data type support for uint32 and uint64 is also added but without vectorization. More vectorized conversion functions are added between integer and float. In order to support int64 vectors, a new VectorizedN class to handle vectors of arbitrary length. Below are the details:
1. Complete most of the int32 and int64 vectorization support including load, store, reduction, constant and conversion. The indirect indexing and masks will be addressed in follow-up PRs, after which, the legality checking logic in `CppVecKernelChecker` can be further simplified.
2. Util functions for conversion between integer and float vectors (in cpp_prefix.h and ATen vec). Ideally, we'd better move them from cpp_prefix.h to ATen vec to simplify cpp_prefix.h, will be addressed in follow-up PRs.
3. Introduced a new template class VectorizedN, designed to handle vectors of arbitrary length by encapsulating multiple Vectorized<T> instances. This class supports most of the operations of `Vectorized<T>`. It makes the support of int64 vectorization simpler. I will also apply it to bf16/fp16/int8 in the follow-up PRs for better efficiency. For example, bf16 currently only uses half of the vector lanes. With `VectorizedN`, we can use full of the lanes and map bf16 vector to `VectorizedN<float,2>` on conversion.
4. Basic data type support is added for uint32 and uint64 (in graph.py). Vectorization support will be added later but not of high priority due to fewer usages.

Next steps:

- [ ] Refactor the vector mask handling to support data types other than float. Currently vector masks are implemented with float vectors.
- [ ] Fully utilize vector lanes for bfloat16/float16/int8.
- [ ] Support indirect indexing with vectorized index via scalarization.
- [ ] Clean up `CppVecKernelChecker`.
- [ ] Simplify `cpp_prefix.h` including refactoring vector conversion logic.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119001
Approved by: https://github.com/peterbell10, https://github.com/jansel
2024-02-08 17:38:49 +00:00
8182fce769 Revert "Add cpp stack traces to our own reruns (#119408)"
This reverts commit fbe6f6236e25e27e5968715f824dc8bfb0e37213.

Reverted https://github.com/pytorch/pytorch/pull/119408 on behalf of https://github.com/malfet due to Looks like it introduced intermittent crashes see https://github.com/pytorch/pytorch/actions/runs/7823402867/job/21344456540 for example, testing the theory ([comment](https://github.com/pytorch/pytorch/pull/119408#issuecomment-1934589057))
2024-02-08 17:20:39 +00:00
8da2f81527 [export] Convert internal tests to using .module() (#119105)
Test Plan: CI

Differential Revision: D53091904

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119105
Approved by: https://github.com/ydwu4
2024-02-08 17:19:07 +00:00
c3e0836084 [export] Remove CallSpec (#117671)
Summary: This is not really being used anywhere

Test Plan: CI

Differential Revision: D52842563

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117671
Approved by: https://github.com/avikchaudhuri, https://github.com/zhxchen17
2024-02-08 17:19:03 +00:00
9436710afd Implement shallow copy functions for FunctionalTensorWrapper. (#118783)
Fix: #115792

This PR implements 2 virtual functions of `TensorImpl` that are called when setting the
`tensor.data`:

- `shallow_copy_from`: which calls `copy_tensor_metadata`; and

- `copy_tensor_metadata`: which copies all `FunctionalTensorWrapper` metadata and ~calls
`dest->value_.set_data(src->value_)`~ assigns `dest->value_ = src->value_`, so as to copy also the inner tensor using the same
method

Before this PR, the inner tensor of a `FunctionalTensorWrapper` was being ignored.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118783
Approved by: https://github.com/bdhirsh
2024-02-08 17:15:46 +00:00
6d8f192fd0 [DCP] Call os.sync if os.fsync does not work for fsspec (#119287)
Some fsspec storage may not support fileno(). In such a case, we fall back to os.sync()

If may not be necessary to call `os.sync()` as in such a case, the storage may be a remote storage that requires a special sync API call.

Differential Revision: [D53433425](https://our.internmc.facebook.com/intern/diff/D53433425/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119287
Approved by: https://github.com/wz337, https://github.com/LucasLLC
ghstack dependencies: #118888
2024-02-08 17:10:38 +00:00
b251bca205 [dynamo] inlining into __iter__ of user defined object (#119243)
Fixes #119198.

This PR make dynamo inline `__iter__` of a user defined object instead of creating a graph break. Also added a new test, which shows:
1. the loop is unrolled
2. the length of the loop is guarded when inlining `__iter__`
```python
class Mod:
    def __init__(self):
        self.a = [torch.randn(2, 2), torch.randn(2, 2)]

    def __iter__(self):
        return iter(self.a)

def f(mod):
    ret = []
    for x in mod:
        ret.append(x + 1)
    return ret
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119243
Approved by: https://github.com/jansel
2024-02-08 17:07:30 +00:00
b181e52a8f [export] Support non-tensor tuple hoo outputs (#119402)
There's an internal custom op which has a None output, so when it becomes auto_functionalized, the HOO's output is (None, Tensor, Tensor, ...). This PR adds support for the None output, and any int/bool outputs from HOOs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119402
Approved by: https://github.com/suo, https://github.com/avikchaudhuri
2024-02-08 16:54:40 +00:00
7f05c72864 [nccl flight recorder] record time we discover start and complete (#119249)
Some APIs like ncclCommAbort can cause nccl kernels to finish even if
they were previously stuck. Because we can gather the trace buffer after
those calls, we can end up seeing some collectives marked completed eventhough
that complete happened several minutes after they started and clearly after
the timeout. This changes how we record state so that we keep track of the time
we discover a state change, so even if eventually the collective gets marked complete,
we can observe it happened minutes after it was schedule.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119249
Approved by: https://github.com/wconstab
2024-02-08 16:48:33 +00:00
3a8bf25fdd [SparseCsr] Remove triton sdpa skip after triton pin update (#109601)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109601
Approved by: https://github.com/desertfire, https://github.com/amjames
2024-02-08 16:40:25 +00:00
d947534782 [DCP] Enable filesystem/fsspec auto detection (#118888)
This API enables the ability to automatically detect whether to use filesystem or fsspec based on the checkpoint_id.

Differential Revision: [D53318043](https://our.internmc.facebook.com/intern/diff/D53318043/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118888
Approved by: https://github.com/wz337, https://github.com/LucasLLC
2024-02-08 16:38:04 +00:00
4f2bf7fa87 Print the value of constants in __str__ (#119276)
Not sure why we haven't been doing this really...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119276
Approved by: https://github.com/jansel
2024-02-08 16:23:36 +00:00
579999a731 [PyTorch] Back scalar value to pinned memory for .item() (#119202)
Summary: This diff optimizes the .item() call by backing the scalar value storage with pinned memory, so we dont create an implicit synchronization with libcuda library.

Test Plan:
# Prod VDD model on H100
Vanguard runs
9.8k qps -> 10.1k qps (~3% improvement)

# .item() Benchmark
1 thread 50k iterations

consistent ~2-3% improvements

With pinned memory
item() took 1.627608060836792 seconds
item() took 1.635591983795166 seconds
item() took 1.6398141384124756 seconds
item() took 1.6378591060638428 seconds
item() took 1.618534803390503 seconds
item() took 1.6467158794403076 seconds
item() took 1.6278800964355469 seconds
item() took 1.6205573081970215 seconds
item() took 1.64951753616333 seconds
item() took 1.6286702156066895 seconds

w/o pinned memory
item() took 1.6783554553985596 seconds
item() took 1.6670520305633545 seconds
item() took 1.6748230457305908 seconds
item() took 1.6708712577819824 seconds
item() took 1.6836023330688477 seconds
item() took 1.6518056392669678 seconds
item() took 1.6769678592681885 seconds
item() took 1.661888837814331 seconds
item() took 1.6627326011657715 seconds
item() took 1.6908581256866455 seconds

Differential Revision: D53431148

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119202
Approved by: https://github.com/xw285cornell
2024-02-08 16:23:15 +00:00
08657b82f5 Reduce scope of dispatching in logcumsumexp_backward (#119397)
Everything inside the `AT_DISPATCH` block is being compiled 5 times,
so it makes sense to limit it to the only line that uses `scalar_t` which is
the `numeric_limits` query.

Also a small optimization, instead of computing `grad.log()` and `(-grad).log()`
we can compute `grad.abs().log()` which is 2 pointwise ops instead of 3.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119397
Approved by: https://github.com/lezcano, https://github.com/albanD
2024-02-08 15:09:22 +00:00
56364124af [Dynamo][16/N] Move skipfiles to trace_rules.py (#119432)
This is follow-up-1 for https://github.com/pytorch/pytorch/pull/118971#issue-2114082018. Only code motion and doc update in this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119432
Approved by: https://github.com/jansel
2024-02-08 09:41:52 +00:00
0a41ac3cf3 [1/2] Intel GPU Runtime Upstreaming for Stream (#117611)
# Motivation
As mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), the second runtime component we would like to upstream is `Stream` which contains the device management functions of Intel GPU's runtime. To facilitate the code review, we split the code changes into 2 PRs. This is one of the 2 PRs and covers the changes under `c10`.

# Design
Intel GPU stream is a wrapper of sycl queue which schedules kernels on a sycl device. In our design, we will maintain a sycl queue pool containing 32 queues per priority per device. And when a queue is requested one of these queues is returned round-robin. The corresponding C++ files related to `Device` will be placed in `c10/xpu` folder. We provide the `c10::xpu::XPUStream` APIs, like
 - `XPUStream getStreamFromPool`
 - `XPUStream getCurrentXPUStream`
 - `void setCurrentXPUStream`
 - `void device_synchronize`

# Additional Context
In our plan, 2 PRs should be submitted to PyTorch for `Stream`:
1. for c10
2. for python frontend.

The differences with CUDA:
no default and external stream in XPU and lack of the below API:
- `getDefaultCUDAStream`
- `getStreamFromExternal`

for cuda, `cuda::device_synchronize` can sync all streams on the device, but for xpu, `xpu::sync_streams_on_device` only sync all reserved streams on the device.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117611
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/malfet
2024-02-08 09:07:23 +00:00
cyy
7d516bbd5f Update MacOS deployment target to OS version 11.1 (#119373)
To avoid the following error:
```
2024-02-07T12:49:51.8306390Z ld: warning: dylib (/Users/runner/work/_temp/anaconda/envs/wheel_py38/lib/libomp.dylib) was built for newer macOS version (11.1) than being linked (11.0)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119373
Approved by: https://github.com/huydhn
2024-02-08 08:19:42 +00:00
5f6b35915a [executorch hash update] update the pinned executorch hash (#119336)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119336
Approved by: https://github.com/pytorchbot
2024-02-08 03:38:53 +00:00
f579c65ef6 Release GIL for torch::autograd::clear_autocast_cache (#119416)
Fixes #119262

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119416
Approved by: https://github.com/albanD
2024-02-08 03:22:48 +00:00
9d6bf20022 [FSDP2] Added backward prefetching (#118118)
This PR adds explicit backward prefetching to overlap communication and computation in backward (namely, needed for `reshard_after_forward=True` or `reshard_after_forward: int`). We do this by recording the post-forward order and using its reverse to approximate the backward order.

This works for the typical 1 forward / 1 backward training. However, for more complex schedules, this can run into some gaps:
- We need to know the _true end of backward_.
    - At the true of end of backward, we can clear our recorded post-forward order and pre-backward hook state, and we should wait on gradient reductions.
    - There is no easy way to know whether the current backward marks the true end of backward. Therefore, we introduce an API for the user to set this: `fsdp_module.set_is_last_backward(bool)`. For example, for pipeline parallelism's DFS cooldown backward, we can call `fsdp_module.set_is_last_backward(is_last_microbatch)`.
- When the user runs backward through only part of the model, our reverse-post-forward-order heuristic risks _mistargeted prefetches_ for unused modules, which would mean the module's parameters are all-gathered and not freed until the end of backward.
    - To error on the side of less memory usage (but no overlap), this PR introduces logic to check whether a module will need its unshard in the current backward (by recording the module's `forward` outputs' `grad_fn`s and querying the autograd engine).
    - Note that there may be _no_ overlap in backward for some parts due to no prefetching.
    - Note further that when running multiple backwards, if the user does not use `set_is_last_backward`, we may not be able to provide a meaningful error message, as the pre-backward hook could be erroneously cleared on the 1st backward.
    - In the future, we may expose more APIs from the autograd engine (similar to `_current_graph_task_execution_order`) to make the prefetching exact. (Currently, `_current_graph_task_execution_order` requires the `with torch.autograd.set_multithreading_enabled(False)`, which is too hard of a constraint as we cannot easily modify users' training loops. We can replace the multi-threading check with a device check. Moreover, in the partial backward case in this PR's unit test, I still hit an [internal assertion](b816760a2f/torch/csrc/autograd/engine.cpp (L476)), so some follow-up is required.)

<details>
<summary> Old Discussion </summary>

For discussion:
- The PR includes a counter `expected_backward_unshard_count` to mitigate mistargeted prefetches in backward. However, it can be seen as a necessary but not sufficient solution.
    - If a module's outputs do not require gradient, then we certainly do not need to unshard the module in backward.
    - However, if a module's outputs do require gradient, then we still may not need to unshard the module for _this_ backward (e.g. if the module did not contribute to `loss` for the current `loss.backward()`).
    - This counter will only address the first case but not the second. If we want to address the second, then we may need more info from the autograd engine.
- For now, I did not include any unit test to cover these behaviors, as I do not have a good example yet.
</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118118
Approved by: https://github.com/weifengpy, https://github.com/wanchaol
ghstack dependencies: #118017
2024-02-08 03:17:45 +00:00
1d2382f141 [DDP] Use compiled_autograd to trace DDP backward allreduce (#110662)
**Summary**
The reducer of `DistributedDataParallel`  is implemented with C++ and it is not easy to trace the allreduce launched in the reducer. This PR modifies `DistributedDataParallel` to launch one allreduce per gradient when `compiled_autograd` is enabled. The changes allow us to use `compiled_autograd` to trace the allreduce and later be optimized (fused) in the Inductor.

**Key Logic**
1. If `ddp_python_hook` is True, we assume `compiled_autograd` is used. `DistributedDataParallel` registers `compiled_accum_grad_hook` for all parameters.
2. In the first forward() call, if `DistributedDataParallel` is not compiled, all  `compiled_accum_grad_hook` are deregistered. If `DistributedDataParallel` is compiled, all `compiled_accum_grad_hook` will be compiled by `compiled_autograd`.
3.  `compiled_accum_grad_hook` launches an allreduce to reduce the gradient of the parameter.

**Bucketing**
The compiled backward is slow because there is no bucketing for the allreduces. We rely on Inductor to bucket the allreduces.

The bucketing is done in a separate PR.

Differential Revision: [D49428482](https://our.internmc.facebook.com/intern/diff/D49428482/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110662
Approved by: https://github.com/wconstab
2024-02-08 03:03:15 +00:00
113506d2d4 Add FakeTensor support to torch._utils._rebuild_tensor (#108186)
Partially fixes https://github.com/pytorch/pytorch/issues/105077

Repro:

```python
import tempfile
import torch
from torch._subclasses import fake_tensor

class TheModelClass(torch.nn.Module):
    def __init__(self):
        super(TheModelClass, self).__init__()
        self.fc1 = torch.nn.Linear(5, 10)

    def forward(self, x):
        return self.fc1(x)

with tempfile.NamedTemporaryFile() as state_dict_file:
    # Create state_dict to be loaded later
    model = TheModelClass()
    torch.save(model.state_dict(), state_dict_file.name)

    fake_mode = fake_tensor.FakeTensorMode()
    with fake_mode:
        # This is where the bug is triggered
        state_dict = torch.load(state_dict_file.name)
```

Error:

```bash
Traceback (most recent call last):
  File "issue_gh_torch_105077.py", line 22, in <module>
    state_dict = torch.load(state_dict_file.name)
  File "/opt/pytorch/torch/serialization.py", line 1014, in load
    return _load(opened_zipfile,
  File "/opt/pytorch/torch/serialization.py", line 1422, in _load
    result = unpickler.load()
  File "/opt/pytorch/torch/_utils.py", line 205, in _rebuild_tensor_v2
    tensor = _rebuild_tensor(storage, storage_offset, size, stride)
  File "/opt/pytorch/torch/_utils.py", line 184, in _rebuild_tensor
    return t.set_(storage._untyped_storage, storage_offset, size, stride)
  File "/opt/pytorch/torch/utils/_stats.py", line 20, in wrapper
    return fn(*args, **kwargs)
  File "/opt/pytorch/torch/_subclasses/fake_tensor.py", line 1288, in __torch_dispatch__
    return self.dispatch(func, types, args, kwargs)
  File "/opt/pytorch/torch/_subclasses/fake_tensor.py", line 1468, in dispatch
    self.invalidate_written_to_constants(func, flat_arg_fake_tensors, args, kwargs)
  File "/opt/pytorch/torch/_subclasses/fake_tensor.py", line 1733, in invalidate_written_to_constants
    _, new_kwargs = normalize_function(
  File "/opt/pytorch/torch/fx/operator_schemas.py", line 297, in normalize_function
    torch_op_schemas = get_signature_for_torch_op(target)
  File "/opt/pytorch/torch/fx/operator_schemas.py", line 167, in get_signature_for_torch_op
    signatures = [_torchscript_schema_to_signature(schema) for schema in schemas]
  File "/opt/pytorch/torch/fx/operator_schemas.py", line 167, in <listcomp>
    signatures = [_torchscript_schema_to_signature(schema) for schema in schemas]
  File "/opt/pytorch/torch/fx/operator_schemas.py", line 70, in _torchscript_schema_to_signature
    arg_type = _torchscript_type_to_python_type(arg.type)
  File "/opt/pytorch/torch/fx/operator_schemas.py", line 64, in _torchscript_type_to_python_type
    return eval(ts_type.annotation_str, _type_eval_globals)
  File "<string>", line 1, in <module>
NameError: name 'Storage' is not defined
```

This PR adds the ability to create fake tensors during `torch.load` by wrapping the `torch.tensor.set_` call around a `torch.utils._mode_utils.no_dispatch()` to skip fake mode dispatcher for it and thus create a real tensor. It later calls `fake_mode.from_tensor(t)` to finally create the fake tensor.

Co-authored-by: Edward Z. Yang <ezyang@mit.edu>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108186
Approved by: https://github.com/ezyang
2024-02-08 03:01:34 +00:00
9a992b0918 [4/4] Intel GPU Runtime Upstreaming for Device (#116869)
# Motivation
According to [[1/4] Intel GPU Runtime Upstreaming for Device](https://github.com/pytorch/pytorch/pull/116019), as mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), this last PR  covers the changes under lazy initialization.

# Design
This PR primarily offers the support of multi-processing via lazy initialization. We lazily initialize our runtime avoiding initializing XPU until the first time it is accessed. In our design, we extend `cuda_lazy_init` to `device_lazy_init` which is a device-agnostic API that can support any backend. And change `maybe_initialize_cuda` to `maybe_initialize_device` to support lazy initialization for both CUDA and XPU while maintaining scalability.

# Additional Context
We adopt a similar design to CUDA. So we share some code with CUDA.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116869
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/malfet
ghstack dependencies: #119248
2024-02-08 03:01:21 +00:00
3cb7ec312c [PT-Vulkan] aten::conv1d - opt: width-pack weight tensor (>2x speedup) (#118835)
## This diff
This optimization reduces calls to `texelFetch(uKernel, ...)` by 4.

We borrow MatMul's work to do the re-packing:

https://www.internalfb.com/code/fbsource/[7e8ef1b8adeda224a736f8cc4bf870e0a659df95]/xplat/caffe2/aten/src/ATen/native/vulkan/ops/Mm.cpp?lines=20%2C50

## Future optimziations

We are already batching reads from input/weight tensors, and writes to output tensor.

Here are other ideas, which I won't pursue for now. (2) is the most doable.
1. **Batch reads/writes along the dimension that is most commonly > 1.** For weights, the length dimension is definitely correct here, but input/outputs could potentially leverage the length dimensions too. However, `stride != 1` would complicate this optimization.
2. **Batch an optimal number of reads/writes.** Instead of default-ing to 4 elements (since that corresponds to 1 texel), consider more elements such as MatMul's 4x4 texel tile.
3. **Obscure shader compiler optimizations.** Since MatMul seemed to benefit from several seemingly equivalent ways to write code.

Differential Revision: [D53204674](https://our.internmc.facebook.com/intern/diff/D53204674/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118835
Approved by: https://github.com/SS-JIA, https://github.com/liuk22
2024-02-08 02:23:51 +00:00
2349e473f1 Forward fix for same_shape oblivious guard (#119383)
Fixes internal test

```
buck2 test '@fbcode//mode/opt' fbcode//accelerators/workloads/models/slimdsnn:slimdsnn_test -- --exact 'accelerators/workloads/models/slimdsnn:slimdsnn_test - test_generate (accelerators.workloads.models.slimdsnn.test_slimdsnn.SlimDSNN)'
```

And I added an OSS test that approximates the internal situation.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Differential Revision: [D53544208](https://our.internmc.facebook.com/intern/diff/D53544208)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119383
Approved by: https://github.com/atalman, https://github.com/albanD
2024-02-08 02:11:46 +00:00
64aaa8f508 Fix typo on Contribution Guide (#119428)
Fixes #119427

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119428
Approved by: https://github.com/awgu, https://github.com/kit1980
2024-02-08 01:07:27 +00:00
fbe6f6236e Add cpp stack traces to our own reruns (#119408)
Note that I'm not sure why we both have pytest rerun the failing test twice via 81abc2b249/test/run_test.py (L966) before our own logic retries it as well?

The failing test is only here to make sure it works as expected in the CI env. Will remove before landing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119408
Approved by: https://github.com/huydhn
2024-02-08 00:54:16 +00:00
33761969a4 Remove parent device mesh check (#118620)
Removes raising error if a device_mesh has a parent.

The comment says that HSDP + TP is not supported, but I'm able to do 2D parallelism + HSDP fine. The only issues are:
- this check
- https://github.com/pytorch/pytorch/pull/118618
- a series of PRs related to checkpointing with 3D meshes that I will open
We currently monkeypatch for the above which I am slowly upstreaming.

I imagine torch will have a better, native integration eventually, but this check seems too aggressive in the meantime given DTensor now lets users do some things themselves (which is amazing 🎉)!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118620
Approved by: https://github.com/Skylion007
2024-02-08 00:49:28 +00:00
029a16c41f [c10d] PGNCCL refactor part 1: adds assert size==1 (#119099)
Breaking #118674 into multiple smaller PRs.
This is the first one.
It adds `assert size==1` to PGNCCL, and refactors some old tests written in multi-device style (which would otherwise fail at the assert).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119099
Approved by: https://github.com/wconstab, https://github.com/XilunWu
2024-02-07 22:29:29 +00:00
6fe5a3adaf release GIL for cudaEventDestroy (#119393)
cudaEventDestroy can become blocking under some circumstances, and then holding GIL will lead to deadlocks.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119393
Approved by: https://github.com/Skylion007
2024-02-07 22:16:18 +00:00
ad75d9e2ca [easy] Fix test_triton_kernel_reinterpret_view_mem_leak by cloning fwd input (#119219)
```

$ python test/inductor/test_aot_inductor.py -k test_triton_kernel_reinterpret_view_mem_leak

# Before
RuntimeError:
Found following user inputs located at [0] are mutated. This is currently banned in the aot_export workflow.
If you need this functionality, please file a github issue.

fw_metadata=ViewAndMutationMeta(input_info=[InputAliasInfo(is_leaf=True, mutates_data=True, mutates_metadata=False, mutations_hidden_from_autograd=True, mutations_under_no_grad_or_inference_mode=False, mutates_storage_metadata=False, requires_grad=False, mutation_type=<MutationType.MUTATED_OUT_GRAPH: 3>),...)

# Now
Ran 6 tests in 13.851s
OK (skipped=4)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119219
Approved by: https://github.com/oulgen
2024-02-07 21:30:16 +00:00
81abc2b249 Revert "[quant][pt2e][bc-breaking] Remove fold_quantize flag (#118701)"
This reverts commit 482d952e880cf78c103a06f2d483556ab0a89138.

Reverted https://github.com/pytorch/pytorch/pull/118701 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/118701#issuecomment-1932866964))
2024-02-07 20:56:16 +00:00
a6e16fe202 Fix global in header warning (#119380)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119380
Approved by: https://github.com/janeyx99
2024-02-07 20:35:21 +00:00
35aa353c48 Change watchdog log from "NCCL" to "Process group" (#118121)
This PR changes the watchdog log.
In order to avoid confusion that NCCL creates a watchdog thread and reports the error log, it is better to change "NCCL" to "Process group" to better indicate the source of the log.

@wconstab

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118121
Approved by: https://github.com/kwen2501, https://github.com/wconstab
2024-02-07 20:14:49 +00:00
892a7bf674 [BE]: Add filelock typing to mypy stubs (#119390)
Realized we used filelock in some places, but didn't have a mypy type stub for it. Noticed it in this PR: https://github.com/pytorch/pytorch/pull/119386
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119390
Approved by: https://github.com/albanD, https://github.com/malfet
2024-02-07 20:14:28 +00:00
d0db80126e [EZ][CI] Fetch full history for MPS jobs (#119401)
Otherwise emitting TD stats will fail with following warning:
```
Emiting td_test_failure_stats
/Users/ec2-user/runner/_work/pytorch/pytorch/tools/testing/target_determination/heuristics/edited_by_pr.py:37: UserWarning: Can't query changed test files due to Command '['git', 'merge-base', 'origin/main', 'HEAD']' returned non-zero exit status 1.
  warn(f"Can't query changed test files due to {e}")
```

Test plan: Observe that MPS jobs finishes without those warnings
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119401
Approved by: https://github.com/atalman, https://github.com/huydhn
2024-02-07 19:29:30 +00:00
51fb99250b Fix missing MAST log when there is Unicode non-decodable text in logs (#119298)
Summary:
## Issue
When there is Unicode non-decodable text in logs, `tail_logger` will stop working afterwards, i.e. f527390102

In the example, the process stopped producing Python logs after 17:20:21 untill the job finished
```
[0]:I0201 17:20:21.338000 3429 gen_ai/genie_projects/llm/metaformers/reward_model_score.py:335] Progress: 118 batches out of 512 total batches. 23.05 % | (gpu mem: 25.8GB, free CPU mem: 1387.8GB)
I0201 17:39:14 Stopping twtask-main.service with Service Result: [success] Exit Code: [exited] Exit Status: [0]
```
At the end, `UnicodeDecodeError` was thrown at the end with no call stack.

## Fix
Use `errors="replace"` to avoid throwing exception when `UnicodeDecodeError` happens.

Test Plan: f528854819

Differential Revision: D53483644

Co-authored-by: Jack Zhang <jackzh@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119298
Approved by: https://github.com/XilunWu
2024-02-07 19:25:43 +00:00
02c24b0b5e Add Python binding resizable to class {Untyped,Typed}Storage (#119286)
This PR exposes `resizable` method of `StorageImpl` to Python frontend to make it accessible for users.

Fixes #119233

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119286
Approved by: https://github.com/ezyang, https://github.com/mikaylagawarecki
2024-02-07 19:15:55 +00:00
d054cd3e44 [FSDP2] Added reshard_after_forward (#118017)
This PR adds the `reshard_after_forward: Union[bool, int]` arg and a `reshard()` method. The `reshard_after_forward` argument trades off communication and memory.
- `reshard_after_forward=True`: reshard parameters after forward; unshard (all-gather) in backward
- `reshard_after_forward=False`: no reshard of parameters after forward; no unshard (all-gather) in backward
- `reshard_after_forward: int`: reshard parameters to a smaller world size; unshard (all-gather) over small world size in backward

In comparison with DeepSpeed and existing FSDP:
- `reshard_after_forward=True` == `FULL_SHARD` == ZeRO-3
- `reshard_after_forward=False` == `SHARD_GRAD_OP` == ZeRO-2
- `reshard_after_forward=8` == ZeRO++

ZeRO-1 is `reshard_after_after_forward=False` without gradient reduction (implemented in a later PR). If we need gradient reduction on an iteration, then ZeRO-2 supersedes ZeRO-1.

We prefer a simple state transition between `SHARDED` / `SHARDED_POST_FORWARD` and `UNSHARDED`, where the state directly defines what tensors are registered to the module. In particular, we _do not_ have a state where the sharded parameters are registered but the unsharded parameters are still in GPU memory. This greatly simplifies our state transitions, but it means that parameters may be non-intuitively registered to the module (e.g. if only the root does not reshard after forward, then the root will be the only without sharded parameters registered). To address this, we introduce a simple `reshard()` method that can force-reshard the parameters. This makes sense to me because the typical case does not care about the registered parameters after forward (in fact, for existing FSDP with `use_orig_params=False`, the unsharded parameters are still registered and are dangling tensors without storage.)

I plan to expose a complementary `unshard(async_op: bool = True)` method in the future.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118017
Approved by: https://github.com/weifengpy, https://github.com/wanchaol
2024-02-07 19:14:20 +00:00
482d952e88 [quant][pt2e][bc-breaking] Remove fold_quantize flag (#118701)
Summary:
This is a follow up to https://github.com/pytorch/pytorch/pull/118605 to remove `fold_quantize` flag from
`convert_pt2e`

Test Plan: CI

Differential Revision: D53247301

BC Breaking Note:

flag `fold_quantize` set to True `convert_pt2e` and now we'll fold the quantize op in the weight by default, so users will see model size reduction by default after pt2e quantization.
2.2
```
folded_model = convert_pt2e(model, fold_quantize=True)

non_folded_model = convert_pt2e(model)
```

2.3
```
folded_model = convert_pt2e(model)

non_folded_model = convert_pt2e(model, fold_quantize=False)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118701
Approved by: https://github.com/andrewor14, https://github.com/leslie-fang-intel
2024-02-07 19:10:51 +00:00
0e2330d84c fix lint (#119395)
Summary: as title

Test Plan: lint

Differential Revision: D53532399

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119395
Approved by: https://github.com/tugsbayasgalan, https://github.com/malfet
2024-02-07 19:06:41 +00:00
23b030a79c [easy] Add testing utilties for torch.nn.utils.set_swap_module_params_on_conversion (#118023)
For above PR to parametrize existing `load_state_dict` tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118023
Approved by: https://github.com/albanD
ghstack dependencies: #118028, #117167
2024-02-07 18:55:44 +00:00
d5a718d27b Add swap_tensors path to nn.Module._apply (#117167)
Added `torch.__future__.{get/set}_swap_module_params_on_conversion` that defaults to `False` for now, but we probably want to modify  to override this and default to `True` in `nn.Module._apply` if input is a tensor subclass.

From offline discussion, for now we are **not** allowing `swap_tensor` after the first module forward has been run*** if the autograd graph is still alive. The reason being that `torch.utils.swap_tensors(t1, t2)` requires the `use_count` of both `TensorImpl`s associated with `t1` and `t2` to be 1.  The first forward pass will install `AccumulateGrad` nodes on each param, which [bump the refcount of the associated TensorImpl](6cf1fc66e3/torch/csrc/autograd/variable.cpp (L307)). **Future work might be to swap the refs that the `AccumulateGrad` nodes hold if it is necessary.**

***From this, it might seem like we don't need to handle gradients. However, I still handle the grads for the edge case that the grads are set via `p.grad = grad` OR the autograd graph is no longer alive because the output has been garbage collected.

If any `swap_tensors` fails on any of the parameters in the `nn.Module` we raise an error.

**`RNNBase` overrides `nn.Module._apply()` and installs weakrefs on some parameters. As a result, all modules that inherit from `RNNBase` (`RNN`, `GRU` and `LSTM`) cannot use the`swap_tensors` path as of now**

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117167
Approved by: https://github.com/albanD
ghstack dependencies: #118028
2024-02-07 18:55:44 +00:00
91d1d2c421 Make MHA Query Scaling Behaviors Consistent (#119323)
The multi-head attention (MHA) query scaling behaviors are not consistent when [`need_weights`](8ac9b20d4b/torch/nn/modules/activation.py (L1073)) values are different.

On the current main, when `need_weights = True`, the query scaling was performed using a [division](8ac9b20d4b/torch/nn/functional.py (L5434)) and it will be exported as a `Div` operator in ONNX. When `need_weights = False`, the query scaling was performed using a [multiplication](422b4271ae/aten/src/ATen/native/transformers/attention.cpp (L711)) and it will be exported as a `Mul` operator in ONNX defined in the [PyTorch ONNX Symbolics](422b4271ae/torch/onnx/symbolic_opset14.py (L177)).

We should make the query scaling behaviors consistent. On most of the platforms, multiplication performs no worse than division. Therefore, we should use multiplication consistently for both `need_weights = True` and `need_weights = False`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119323
Approved by: https://github.com/mikaylagawarecki, https://github.com/albanD
2024-02-07 18:42:57 +00:00
5eda355e54 [inductor, test] remove cast for test_pow2_cpu (#114912)
Verifies https://github.com/pytorch/pytorch/issues/94010

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114912
Approved by: https://github.com/angelayi
2024-02-07 18:32:30 +00:00
0dab6fb352 Fix estimate_nccl_collective_runtime (#118986)
`estimate_nccl_collective_runtime` has been broken and the errors have been silently swallowed by inductor. This PR:
- Fixes the issues described in https://github.com/pytorch/pytorch/issues/118497.
- Adds white-box testing so future issues can be surfaced in tests.
- Add support for native funcol IRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118986
Approved by: https://github.com/yf225
ghstack dependencies: #118910, #118911, #118437
2024-02-07 18:02:51 +00:00
088d538a8d Revert "[Inductor] GEMM shape padding improvements (#118522)"
This reverts commit cc46829f96dba05b9b46bae31a1e6d2a053f667e.

Reverted https://github.com/pytorch/pytorch/pull/118522 on behalf of https://github.com/eellison due to regresses HF ~4/5% ([comment](https://github.com/pytorch/pytorch/pull/118522#issuecomment-1932557670))
2024-02-07 17:42:14 +00:00
f6bf7d26e1 Print full exception info in Graph break log (#119292)
So, this is a little awkward, so I don't mind more thoughts on how best to do this.

Let's suppose that you have a graph break inside of an inlined function call. We are not actually going to print this graph break yet; instead, we are going to restart analysis so that we can run up until the inlined function call. When this happens, the only log message we ever get is the log to `graph_break` (seen here) reporting that a graph break has occurred.

In the current code, we don't print the fully formatted exception if you are only using `graph_breaks` logging. So the exception that induced the graph break has its traceback lost forever. For some classes of errors, esp., guard on data-dependent SymInt, this is quite bad.

With this change, we do print the traceback. On this sample program:

```
import torch
import torch._dynamo.config

torch._dynamo.config.capture_scalar_outputs = True

def g(x, y):
    y = x.item()
    if y < 3:
        return x + 2
    else:
        return x + 3

@torch.compile()
def f(x, y):
    y = y * y
    return g(x, y)

f(torch.tensor(4), torch.randn(4))
```

It looks like this:

```
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] Graph break: Traceback (most recent call last):
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]   File "/data/users/ezyang/b/pytorch/torch/_dynamo/variables/tensor.py", line 878, in evaluate_expr
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]     return guard_scalar(self.sym_num)
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]   File "/data/users/ezyang/b/pytorch/torch/fx/experimental/symbolic_shapes.py", line 414, in guard_scalar
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]     return guard_bool(a)
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]   File "/data/users/ezyang/b/pytorch/torch/fx/experimental/symbolic_shapes.py", line 663, in guard_bool
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]     return a.node.guard_bool("", 0)  # NB: uses Python backtrace
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]   File "/data/users/ezyang/b/pytorch/torch/fx/experimental/sym_node.py", line 366, in guard_bool
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]     r = self.shape_env.evaluate_expr(self.expr, self.hint, fx_node=self.fx_node)
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]   File "/data/users/ezyang/b/pytorch/torch/fx/experimental/recording.py", line 227, in wrapper
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]     return fn(*args, **kwargs)
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]   File "/data/users/ezyang/b/pytorch/torch/fx/experimental/symbolic_shapes.py", line 3670, in evaluate_expr
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]     concrete_val = self.size_hint(orig_expr)
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]   File "/data/users/ezyang/b/pytorch/torch/fx/experimental/symbolic_shapes.py", line 3403, in size_hint
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]     raise self._make_data_dependent_error(result_expr, expr)
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] torch.fx.experimental.symbolic_shapes.GuardOnDataDependentSymNode: It appears that you're trying to get a value out of symbolic int/float whose value is data-dependent (and thus we do not know the true value.)  The expression we were trying to evaluate is u0 < 3 (unhinted: u0 < 3).  For more information, run with TORCH_LOGS="+dynamic".
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] During handling of the above exception, another exception occurred:
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] Traceback (most recent call last):
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]   File "/data/users/ezyang/b/pytorch/torch/_dynamo/symbolic_convert.py", line 469, in wrapper
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]     return inner_fn(self, inst)
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]   File "/data/users/ezyang/b/pytorch/torch/_dynamo/symbolic_convert.py", line 1196, in CALL_FUNCTION
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]     self.call_function(fn, args, {})
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]   File "/data/users/ezyang/b/pytorch/torch/_dynamo/symbolic_convert.py", line 651, in call_function
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]     self.push(fn.call_function(self, args, kwargs))
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]   File "/data/users/ezyang/b/pytorch/torch/_dynamo/variables/functions.py", line 279, in call_function
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]     return super().call_function(tx, args, kwargs)
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]   File "/data/users/ezyang/b/pytorch/torch/_dynamo/variables/functions.py", line 87, in call_function
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]     return tx.inline_user_function_return(
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]   File "/data/users/ezyang/b/pytorch/torch/_dynamo/symbolic_convert.py", line 657, in inline_user_function_return
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]     return InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]   File "/data/users/ezyang/b/pytorch/torch/_dynamo/symbolic_convert.py", line 2262, in inline_call
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]     return cls.inline_call_(parent, func, args, kwargs)
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]   File "/data/users/ezyang/b/pytorch/torch/_dynamo/symbolic_convert.py", line 2372, in inline_call_
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]     tracer.run()
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]   File "/data/users/ezyang/b/pytorch/torch/_dynamo/symbolic_convert.py", line 787, in run
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]     and self.step()
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]   File "/data/users/ezyang/b/pytorch/torch/_dynamo/symbolic_convert.py", line 750, in step
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]     getattr(self, inst.opname)(inst)
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]   File "/data/users/ezyang/b/pytorch/torch/_dynamo/symbolic_convert.py", line 431, in inner
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]     eval_result = value.evaluate_expr(self.output)
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]   File "/data/users/ezyang/b/pytorch/torch/_dynamo/variables/tensor.py", line 880, in evaluate_expr
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]     raise UserError(  # noqa: TRY200
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] torch._dynamo.exc.UserError: Consider annotating your code using torch._constrain_as_*(). It appears that you're trying to get a value out of symbolic int/float whose value is data-dependent (and thus we do not know the true value.)  The expression we were trying to evaluate is u0 < 3 (unhinted: u0 < 3).  For more information, run with TORCH_LOGS="+dynamic".
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] For more information about this error, see: https://pytorch.org/docs/main/generated/exportdb/index.html#constrain-as-size-example
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] From user code at:
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]   File "/data/users/ezyang/b/pytorch/b.py", line 16, in f
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]     return g(x, y)
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]   File "/data/users/ezyang/b/pytorch/b.py", line 8, in g
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]     if y < 3:
[2024-02-06 10:32:24,334] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]
```

The end of the log at restarted computation maybe can be improved too. Right now it looks like this:

```
[2024-02-06 10:32:24,338] [0/0_1] torch._dynamo.symbolic_convert: [DEBUG] TRACE CALL_FUNCTION 2 [UserFunctionVariable(), LazyVariableTracker(), TensorVariable()]
[2024-02-06 10:32:24,338] [0/0_1] torch._dynamo.output_graph: [DEBUG] COMPILING GRAPH due to GraphCompileReason(reason='Consider annotating your code using torch._constrain_as_*(). It appears that you\'re trying to get a value out of symbolic int/float whose value is data-dependent (and thus we do not know the true value.)  The expression we were trying to evaluate is u0 < 3 (unhinted: u0 < 3).  For more information, run with TORCH_LOGS="+dynamic".\n\nFor more information about this error, see: https://pytorch.org/docs/main/generated/exportdb/index.html#constrain-as-size-example', user_stack=[<FrameSummary file /data/users/ezyang/b/pytorch/b.py, line 16 in f>, <FrameSummary file /data/users/ezyang/b/pytorch/b.py, line 8 in g>], graph_break=True)
```

An alternative to doing it this way, is I can make symbolic shapes print a warning log when guard on unbacked SymInt itself, so we don't have to worry about Dynamo generating the backtrace well. If, for the most part, the backtrace for other graph breaks is irrelevant, then this would seem to be a more expedient solution.

PTAL and submit your opinions.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119292
Approved by: https://github.com/yanboliang
2024-02-07 17:20:31 +00:00
f79ae7599a [export] fakify module state in nonstrict (#119297)
Summary:
Previously, we were not fakifying module state explicitly in the nonstrict path.

This led to errors when modules were constructed under a fake mode, since the user-provided fake mode was clashing with the one that we had constructed internally to fakify the inputs.

This fixes things to use a single fake mode for everything.

As a side effect, this raised the question of how we ought to serialize state_dicts/constants that might be fake tensors. Naively calling torch.save understandably explodes—so this diff piggybacks on our infra for doing this on meta["val"]. Open to revising this, I'm low confidence that it's the best way to do it.

Test Plan: unit tests

Differential Revision: D53484942

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119297
Approved by: https://github.com/tugsbayasgalan
2024-02-07 17:12:22 +00:00
40ec155e58 [AOTI][refactor] Split common aoti_runtime utils into a separate header (#119066)
Summary: Split common utils from aoti_runtime/model.h into a separate header file, because when turning on ABI-compatible mode for JIT Inductor we won't need AOTInductorModel, but we do need some common utils, e.g. RAIIAtenTensorHandle.

Differential Revision: [D53478809](https://our.internmc.facebook.com/intern/diff/D53478809)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119066
Approved by: https://github.com/khabinov
2024-02-07 16:54:00 +00:00
059994d2b7 Migrate load_state_dict hook tests to OptimizerInfo (#119310)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119310
Approved by: https://github.com/albanD
ghstack dependencies: #119283, #119288, #119299, #119308
2024-02-07 16:00:01 +00:00
0320e62255 Migrate test_state_dict hooks to OptimizerInfo (#119308)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119308
Approved by: https://github.com/mikaylagawarecki
ghstack dependencies: #119283, #119288, #119299
2024-02-07 16:00:01 +00:00
5c46600f84 [RELAND] refactor lazy init to device-agnostic (#119248)
# Motivation
This PR intends to extend `cuda_lazy_init` to `device_lazy_init` which is a device-agnostic API that can support any backend. And change `maybe_initialize_cuda` to `maybe_initialize_device` to support lazy initialization for CUDA while maintaining scalability.

# Design
We maintain a flag for each backend to manage the lazy initialization state separately.

# Additional Context
No need more UTs.
This is a reland PR, the original PR is [refactor lazy init to device-agnostic](https://github.com/pytorch/pytorch/pull/118846).
This is a common PR, and does not trigger xpu ciflow.

Differential Revision: [D53478332](https://our.internmc.facebook.com/intern/diff/D53478332)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119248
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/jgong5, https://github.com/atalman
2024-02-07 15:58:51 +00:00
3625ccfbea Move step global hooks test to OptimizerInfo (#119299)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119299
Approved by: https://github.com/mikaylagawarecki
ghstack dependencies: #119283, #119288
2024-02-07 15:50:31 +00:00
7b3762e6bc Move step pre/post hook tests to OptimizerInfo (#119288)
Note that this increases coverage from 1 config (vanilla SGD) to all the configs (13 optimizers at around 6-7 each). The test time seems fine though!

With the torch cuda synchronization:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (b6093c03)]$ python test/test_optim.py -k test_step_pre_hook -k test_step_post_hook
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
....................................................
----------------------------------------------------------------------
Ran 52 tests in 13.680s

OK
```

Excluding the torch cuda synchronization:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (916f6fe3)]$ python test/test_optim.py -k test_step_pre_hook -k test_step_post_hook
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
....................................................
----------------------------------------------------------------------
Ran 52 tests in 1.038s

OK
```

The old tests:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (916f6fe3)]$ python test/test_optim.py -k test_pre_hook -k test_post_hook
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
..
----------------------------------------------------------------------
Ran 2 tests in 0.518s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119288
Approved by: https://github.com/mikaylagawarecki
ghstack dependencies: #119283
2024-02-07 15:50:31 +00:00
99ddfaf572 Add symbol guard counts instrumentation (#119290)
This helps us understand if there are symbols which are extremely hot
(i.e., have a lot of guards mentioning them).  Extremely hot symbols are
candidates for being turned static.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119290
Approved by: https://github.com/bdhirsh
2024-02-07 14:35:14 +00:00
7c95cc5e03 Add basic reference documentation for symbolic_shapes.py (#118997)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118997
Approved by: https://github.com/albanD
2024-02-07 14:33:42 +00:00
1435cfecfa Increase accumulate_grad_ gradient's expected refcount to account for pybind (#119068)
Account for pybind of the op holding 1 ref when torch.ops.inductor.accumulate_grad_.default is called during run time

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119068
Approved by: https://github.com/jansel
ghstack dependencies: #118817, #119334
2024-02-07 10:25:43 +00:00
326dcf9dc8 Never reuse accumulated gradients' buffers (#119334)
Since accumulate grad may steal the gradient's `c10::Storage`, we can't reuse the op otherwise the gradient will get overwritten. From benchmarks, using the inductor's codegen'd _empty_strided_cpu/cuda and assigning to it has lower overhead than deep copying the gradient and reusing its buffer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119334
Approved by: https://github.com/jansel
ghstack dependencies: #118817
2024-02-07 10:25:42 +00:00
8e14e1d514 Fix gradient refcounts in pybind and compiled autograd (#118817)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118817
Approved by: https://github.com/jansel
2024-02-07 10:25:42 +00:00
d85631b721 Revert "Fix deadlock in ExecutionTraceObserver (#119242)"
This reverts commit 6fc775ae13b675f8d02f7f85bc4348bba3ae3dd3.

Reverted https://github.com/pytorch/pytorch/pull/119242 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/119242#issuecomment-1931445631))
2024-02-07 07:37:22 +00:00
dfdbd73360 add Half support for flash attention (#119247)
Re-open for https://github.com/pytorch/pytorch/pull/118368.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119247
Approved by: https://github.com/drisspg, https://github.com/malfet
2024-02-07 05:57:41 +00:00
0f478d9d61 [Dynamo][15/N] Merge allow_in_graph/inline/skip trace rules check into trace_rule.lookup (#118971)
Finally we have this PR to merge allow_in_graph/inline/skip trace rules into ```trace_rules.lookup_inner```, where we can define and lookup trace rules at both function level and file level. Going forward, this is the central place that we define and consulte Dynamo trace rule for any function.
* ```trace_rules.looup``` is the API can return allow_in_graph, inline or skip.
* ```skipfiles.check``` is the API can return inline or skip, since we have multiple places that only do inline/skip check.
  *  I'll move ```skipfiles.check``` to ```trace_rules.check``` as one of the follow-ups.
* Both functions consulte ```trace_rules.lookup_inner``` to get the tracing rule.

To avoid a single big PR, I left a few items as the follow-ups:
* Remove ```skipfiles.py``` and merge the code into ```trace_rules.py```.
* We do double check in ```symbolic_convert.check_inlineable```, will refactor and simplify it. We should only do inline/skip check before generating ```SkipFilesVariable``` and ```UserFunctionVariable```.
* Rename ```SkipFilesVariable``` as ```SkipFunctionVariable```, since we only handle functions.
* The inline/skip reasons are not logged for some cases, since the new lookup framework doesn't always return inline/skip reasons. I'll refactor loggings to record the inline/skip reason in next step.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118971
Approved by: https://github.com/jansel
2024-02-07 05:15:39 +00:00
284b0b5f44 Add --local-ranks-filter to torchrun: allow logs filtering by rank (#118562)
Addresses issue https://github.com/pytorch/pytorch/issues/117383

The implementation exposes `--local-ranks-filter` which filters by rank which files we pass to `TailLog` (used in torchrun to determine which logs to output to stdout/stderr)

## Behavior
### with --tee
Currently --tee is implemented as --redirect to file, and streams file to console using `tail`. When --tee is specified, file logs will be unaffected and we will only filter the output to console.

### with --redirect
When --redirect is specified without --tee, nothing is logged to console, so we no-op.

### with neither
When neither --tee or --redirect are specified, torchrun uses empty string "" to indicate logging to console. We intercept this empty string, and redirect it to "/dev/null" to not print to console.

The api also allows a per-rank configuration for --tee and --redirect, and is also supported by this filter implementation.

## Usage
### without --tee
```
> TORCH_LOGS_FORMAT="%(levelname)s: %(message)s" TORCH_LOGS="graph" torchrun --standalone --nproc_per_node=2 --role rank --local_rank_filter=0 t.py
hello from rank 0 python
DEBUG: TRACED GRAPH
 __compiled_fn_0 <eval_with_key>.0 opcode         name    target                   args       kwargs
-------------  ------  -----------------------  ---------  --------
placeholder    l_x_    L_x_                     ()         {}
call_function  mul     <built-in function mul>  (l_x_, 5)  {}
output         output  output                   ((mul,),)  {}
...
```
### with --tee
```
> TORCH_LOGS_FORMAT="%(levelname)s: %(message)s" TORCH_LOGS="graph" torchrun --standalone --nproc_per_node=2 --role rank --tee 3 --local_rank_filter=0 t.py
[rank0]:hello from rank 0 python
[rank0]:DEBUG: TRACED GRAPH
[rank0]: __compiled_fn_0 <eval_with_key>.0 opcode         name    target                   args       kwargs
[rank0]:-------------  ------  -----------------------  ---------  --------
[rank0]:placeholder    l_x_    L_x_                     ()         {}
[rank0]:call_function  mul     <built-in function mul>  (l_x_, 5)  {}
[rank0]:output         output  output                   ((mul,),)  {}
...
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118562
Approved by: https://github.com/wconstab, https://github.com/wanchaol
2024-02-07 04:29:54 +00:00
6c3600d008 Enable optional tensorList fallback to cpu. (#119273)
add optional tensorList fallback to cpu.
Add testcases and old pr is: https://github.com/pytorch/pytorch/pull/106449

@bdhirsh
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119273
Approved by: https://github.com/bdhirsh
2024-02-07 03:54:13 +00:00
53ee47ca32 [vision hash update] update the pinned vision hash (#119337)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119337
Approved by: https://github.com/pytorchbot
2024-02-07 03:43:26 +00:00
ee1c2449f7 [dynamo] delete dynamo cache entry when guard function is invalidated [attempt 2] (#119107)
Attempt #2 for https://github.com/pytorch/pytorch/pull/117875 to fix https://github.com/pytorch/pytorch/issues/112090.

Summary of changes:
- ~Changed CacheEntry linked list into a doubly-linked list structure to support deletion.~ (done by C++ refactor)
- Added CacheEntry and ExtraState borrowed references to GuardFn so that GuardFn can tell ExtraState to delete CacheEntry when the GuardFn is invalidated.
- ~Added ExtraState raw reference to CacheEntry so that we can get ExtraState to correctly point to the first CacheEntry if it gets deleted.~ (done by C++ refactor)
- CacheEntry destructor needs to reset GuardFn refs to ExtraState/CacheEntry in order to prevent use-after-free.
- code_context values that are nn.GraphModules need to be weakrefs in order to prevent circular references.
- Added tests that check for memory leaks and cache deletion operations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119107
Approved by: https://github.com/jansel
2024-02-07 03:32:42 +00:00
fcc36de9d6 [ONNX][dynamo_export] Turn off opmath type promotion for div (#119112)
Skip opmath promotion for `_prims_common.ELEMENTWISE_TYPE_PROMOTION_KIND.INT_TO_FLOAT` as well.
Fixes https://github.com/pytorch/pytorch/issues/118941

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119112
Approved by: https://github.com/thiagocrepaldi
2024-02-07 03:27:00 +00:00
45a79323fe Add torch.dtype instances to the public API (#119307)
Fixes #91908

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119307
Approved by: https://github.com/albanD
2024-02-07 02:57:49 +00:00
8c2fde1fcf [EZ][BE] [CMake] Remove checks for GCC-7 (#119306)
As PyTorch now uses C++17 and needs gcc-9.4+ to compile

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119306
Approved by: https://github.com/Skylion007
2024-02-07 01:24:01 +00:00
e9907a3446 [PyTorch] Free up 8 bytes per intrusive_ptr_target (#117986)
We don't need 64-bit reference and weak counts. (We also probably don't need a full 32 bits, but we'll deal with that later.)

Differential Revision: [D52851891](https://our.internmc.facebook.com/intern/diff/D52851891/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117986
Approved by: https://github.com/ezyang
2024-02-07 00:48:00 +00:00
5f2ad407a9 Fix typo on torch.frombuffer() documentation (#119214)
Fixes #114345

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119214
Approved by: https://github.com/albanD
2024-02-07 00:41:51 +00:00
5ae6f6cffe Test seo torch cuda (#119324)
Testing if this will help improve SEO of this page.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119324
Approved by: https://github.com/albanD
2024-02-07 00:39:51 +00:00
728228a7c7 LazyGraphModule: improve the fix for the FakeTensorMode mismatch issue (#119311)
The previous fix https://github.com/pytorch/pytorch/pull/118981 misses some corner cases. It works when both LazyGraphModule and compiled-autograd are enabled. But it fail with FakeTensorMode mismatch error again if LazyGraphModule+CompiledAutograd+DynamicShape are all enabled. Note that disabling any of the three does not trigger the issue.

The reason why enabling DynamicShape cause the previous fix not working is, we will call the bw_compiler here before running the backward pass if there are symints saved for backward: 73f0fdea5b/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py (L382)

The bw_compiler may cause extra GraphModule recompilation on the bw_module which cause it's forward method become the lazy one again. The fix is just to delay applying the previous fix after the potential extra call of the bw_compiler.

Repro on hf_Whisper:
```
CUDA_VISIBLE_DEVICES=1 time benchmarks/dynamo/torchbench.py -dcuda --training --backend=inductor --disable-cudagraphs --accuracy --only hf_Whisper --repeat 1 --compiled-autograd  --dynamic-batch-only
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119311
Approved by: https://github.com/xmfan, https://github.com/jansel
2024-02-07 00:35:39 +00:00
e868a7fedd [AOTI] Rename config.aot_inductor.abi_compatible (#119065)
Summary: Rename config.aot_inductor.abi_compatible to config.abi_compatible, since the cpp_wrapper mode in JIT Inductor will share the same flag.

Differential Revision: [D53478752](https://our.internmc.facebook.com/intern/diff/D53478752)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119065
Approved by: https://github.com/khabinov
2024-02-07 00:14:33 +00:00
c814d8e5c2 Fix handling random() calls encountered inside inlined code. (#119218)
Fix https://github.com/pytorch/pytorch/issues/118787

In the compiled function, calls to random() are replaced with a single function call
to a function that generates all the random variables .
The random calls encountered during compilation used to be tracked inside a variable
stored inside the instruction translator. And when there are nested translators, the tracked
calls used to get lost when the inner instructions translator popped out.

This diff fixes that by moving the tracked calla to the output graph which is shared across translators that are generating the same function.

More details about the issue and why this solution is picked are in the github issue above.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119218
Approved by: https://github.com/jansel, https://github.com/anijain2305
2024-02-06 23:48:21 +00:00
5e78c4b0f4 [dynamo] Functools partial reconstruct (#118583)
Replaces #117721

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118583
Approved by: https://github.com/yanboliang
ghstack dependencies: #118901, #118616
2024-02-06 23:42:43 +00:00
62cc1053d8 [dynamo] Fix missing guards in FunctoolsPartialVariable (#118616)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118616
Approved by: https://github.com/yanboliang
ghstack dependencies: #118901
2024-02-06 23:42:43 +00:00
6fc775ae13 Fix deadlock in ExecutionTraceObserver (#119242)
Summary:
With the compiled PyTorch module, in execution_trace_observer.cpp, function convertIValue calls TensorImpl->storage_offset(). That function call will trigger a recursive call into recordOperatorStart. It will cause a deadlock on ob.g_mutex.

This DIFF is to fix this deadlock by replacing std::mutex with std::recursive_mutex.

Since PyTorch only has one thread for FWD, and one thread for BWD. The contention is very low, the performance should NOT be a concern.

Test Plan:
Unit Test
    buck test  mode/dev-nosan caffe2/test:profiler -- test_execution_trace_with_pt2

Differential Revision: D53299183

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119242
Approved by: https://github.com/aaronenyeshi
2024-02-06 23:36:22 +00:00
d0ca849fdf Refactor Symint Deduping to separate pass (#118938)
Previously Symint Deduping was done during proxy tracing which made it more difficult to reason about. This refactors the deduping to a separate pass.

We only dedupe symints which are resolvable from input symint nodes so as to avoid inducing a dependency on the backward in the forward.

potential fix for : https://github.com/pytorch/pytorch/issues/118224

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118938
Approved by: https://github.com/ezyang
2024-02-06 23:07:31 +00:00
dea15c9fdc Revert "Add meta registration for _foreach_norm (#118604)"
This reverts commit b8bb12cd454b716da6a98db826fcc45fd7c0db05.

Reverted https://github.com/pytorch/pytorch/pull/118604 on behalf of https://github.com/atalman due to Breaks internal tests ([comment](https://github.com/pytorch/pytorch/pull/118604#issuecomment-1930849491))
2024-02-06 22:20:44 +00:00
6c1cca153e [quant][pt2e] Allow users to override train/eval behavior (#119091)
Summary: This commit adds a util for PT2E quantization users
to call `model.train()` and `model.eval()` without error.
Instead, these will automatically call the equivalent
`move_exported_model_to_train/eval` for the user, which only
switch behavior for special ops like dropout and batchnorm.
This enables users to onboard to the PT2E flow more easily.

Test Plan:
python test/test_quantization.py TestQuantizePT2E.test_allow_exported_model_train_eval

Reviewers: jerryzh168, tugsbayasgalan, zhxchen17

Subscribers: jerryzh168, tugsbayasgalan, zhxchen17, supriyar

Differential Revision: [D53426636](https://our.internmc.facebook.com/intern/diff/D53426636)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119091
Approved by: https://github.com/jerryzh168, https://github.com/tugsbayasgalan, https://github.com/zhxchen17
2024-02-06 22:19:58 +00:00
9d46fe603d Revert "[c10d] PGNCCL refactor part 1: adds assert size==1 (#119099)"
This reverts commit 4ab852b6c558a0b8e9fea0c863c782fe65f00be0.

Reverted https://github.com/pytorch/pytorch/pull/119099 on behalf of https://github.com/atalman due to Breaks internal tests ([comment](https://github.com/pytorch/pytorch/pull/119099#issuecomment-1930839754))
2024-02-06 22:14:36 +00:00
0f68bcaa5c Make filename optional in update_failures.py (#119289)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119289
Approved by: https://github.com/zou3519
2024-02-06 21:56:09 +00:00
422b4271ae Change PrivateUse1's resize_bytes to PrivateUse1HooksInterface (#117839)
Reopen from https://github.com/pytorch/pytorch/pull/117211
Modify the logic for entering the registration branch so that existing uts are not affected.
Co-authored-by: albanD <desmaison.alban@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117839
Approved by: https://github.com/albanD
2024-02-06 20:51:56 +00:00
ae4e866bba [dynamo] refactor CacheEntry and ExtraState to eval_frame.c to C++ (#118438)
Part of implementing CacheEntry invalidation to fix https://github.com/pytorch/pytorch/issues/112090.

Changes:
- Move CacheEntry and ExtraState to C++
- Use pybind to control reference counting
- Use std::list instead of manually implementing a linked list

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118438
Approved by: https://github.com/jansel
2024-02-06 20:48:11 +00:00
73f0fdea5b [fix] accounting for dilation in pool padding assertion (#118897)
Fixes https://github.com/pytorch/pytorch/issues/7541

It is a copy of https://github.com/pytorch/pytorch/pull/111427, I have failed to fix all its issues in time, and it got closed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118897
Approved by: https://github.com/mikaylagawarecki
2024-02-06 20:32:58 +00:00
ec31d11580 [dynamo] Skip dynamo when inside a functorch context (#118901)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118901
Approved by: https://github.com/zou3519
2024-02-06 20:22:24 +00:00
f3645fc38b Update auto_functionalize docs (#119228)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119228
Approved by: https://github.com/zou3519
2024-02-06 19:50:54 +00:00
f85b0ea8bb Migrate last lbfgs test over to OptimizerInfo (#119283)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119283
Approved by: https://github.com/Skylion007, https://github.com/mikaylagawarecki
2024-02-06 19:49:05 +00:00
3f0fd36835 Introduce size oblivious guards (#118579)
Fixes https://github.com/pytorch/pytorch/issues/117361

The implementation here slightly diverges from what was proposed in the issue, so I will recap what this PR is doing here. Today, when doing computations involving size-like unbacked SymInts, we assume for all operations that the compile time range of the integer is `[2, inf]`, even though at runtime we also accept zero and one.

This PR removes the carte blanche assumption, and instead does the analysis in a much more limited and controlled fashion: only for guards which we have designated as "size oblivious" are we willing to do the analysis under the assumption that the range of all size-like unbacked SymInts is `[2, inf]`; otherwise, we will faithfully only do analysis with `[0, inf]` (or whatever the user provided) bounds.

The infra pieces of this PR are:

* Remove runtime_var_to_range from torch/fx/experimental/symbolic_shapes.py; modify `_constrain_range_for_size` to refine the range without clamping min to 2, and instead add the symbol to a `size_like` set in the ShapeEnv
* When evaluating an expression, if the expression is requested to be evaluated in a `size_oblivious` way, we attempt to statically compute the value of the expression with the assumption that all symbols in `size_like` are updated to assume that they are `>= 2`.
* Add Python and C++ APIs for guarding on a SymBool in a size-oblivious way. In C++, I also need to add some helpers for performing symbolic comparisons, since the stock comparisons immediately specialize in the "normal" way.

The rest of the changes of the PR are marking various spots in PyTorch framework code as size oblivious, based on what our current test suite exercises.

As you review the places where we have marked things as size oblivious, it may become clear why I ended up not opting for the "designate a branch as the default branch when it's not statically obvious which way to go": for some of the conditions, this answer is rather non-obvious. I think potentially there is another refinement on top of this PR, which is something like "I don't care if you can't figure it out with ValueRange analysis, go down this path anyway if there are unbacked sizes involved." But even if we add this API, I think we are obligated to attempt the ValueRange analysis first, since it can lead to better outcomes sometimes (e.g., we are able to figure out that something is contiguous no matter what the unbacked size is.)

When is it permissible to mark something as size oblivious? Heuristically, it is OK anywhere in framework code if it gets you past a guard on unbacked SymInt problem. It is somewhat difficult to provide a true semantic answer, however. In particular, these annotations don't have any observational equivalence guarantee; for example, if I have `torch.empty(u0, 1).squeeze()`, we will always produce a `[u0]` size tensor, even though if `u0 == 1` PyTorch will actually produce a `[]` size tensor. The argument that I gave to Lezcano is that we are in fact defining an alternate semantics for a "special" size = 0, 1, for which we have these alternate eager mode semantics. In particular, suppose that we have a constant `special1` which semantically denotes 1, but triggers alternate handling rules. We would define `torch.empty(special1, 1).squeeze()` to always produce a `[special1]` size tensor, making its semantics coincide with unbacked SymInt semantics. In this model, the decision to designate guards as size oblivious is simply a user API question: you put them where ever you need some handling for special1! As we conservatively error out whenever it is not obvious what `special1` semantics should be, it is always valid to expand these semantics to cover more cases (although you can always choose the wrong semantics!)

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118579
Approved by: https://github.com/eellison, https://github.com/lezcano
2024-02-06 19:45:32 +00:00
5410385c42 [dynamo] support comparing stream with constant (#119199)
Before the pr, we have a graph break for:
```python
def f():
    if torch.cuda.current_stream() is not None:
        return torch.randn(2, 2)
torch.compile(f, backend="eager", fullgraph=True)()
```
This pr supports comparson ops of StreamVariable and ConstantVariable by returning a constant.

It's safe to return a constant in this case becuase the StreamVariable is guarded by ID_MATCH when created.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119199
Approved by: https://github.com/yifuwang, https://github.com/anijain2305, https://github.com/jansel
2024-02-06 19:26:03 +00:00
fa157af69c [mypy] declare type for DynamoTestCase._exit_stack (#119084)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119084
Approved by: https://github.com/Skylion007
2024-02-06 18:26:07 +00:00
238d87f74d Add a short code snippet in the RNN doc (#119150)
Fixes #109443,
also remove a duplicated comment line `# Efficient implementation equivalent to the following:` in scaled_dot_product_attention doc.

@mikaylagawarecki
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119150
Approved by: https://github.com/malfet
2024-02-06 17:41:51 +00:00
169c070076 Move catch_errors_wrapper to convert_frame (#119253)
With this change, we now have the invariant that eval_frame only
contains "hot" functions that are called at runtime, as opposed to
cold functions which are only called at compile time.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119253
Approved by: https://github.com/yanboliang
ghstack dependencies: #119251
2024-02-06 17:40:07 +00:00
790858afa9 Make start compiling stack trace omit framework frames (#119251)
Fixes https://github.com/pytorch/pytorch/issues/119238

Here's what it looks like now:

```
$ TORCH_LOGS=+torch._dynamo.convert_frame python a.py
[2024-02-05 18:52:07,248] [0/0] torch._dynamo.convert_frame: [DEBUG] torchdynamo start compiling f /data/users/ezyang/b/pytorch/a.py:3, stack (elided 5 frames):
[2024-02-05 18:52:07,248] [0/0] torch._dynamo.convert_frame: [DEBUG]   File "/data/users/ezyang/b/pytorch/a.py", line 7, in <module>
[2024-02-05 18:52:07,248] [0/0] torch._dynamo.convert_frame: [DEBUG]     f(torch.randn(2))
[2024-02-05 18:52:07,248] [0/0] torch._dynamo.convert_frame: [DEBUG]   File "/data/users/ezyang/b/pytorch/torch/_dynamo/eval_frame.py", line 453, in _fn
[2024-02-05 18:52:07,248] [0/0] torch._dynamo.convert_frame: [DEBUG]     return fn(*args, **kwargs)
[2024-02-05 18:52:07,248] [0/0] torch._dynamo.convert_frame: [DEBUG]
$ cat a.py
import torch

@torch.compile
def f(x):
    return x * 2

f(torch.randn(2))
```

The eval_frame frame is intentionally present, since what happens is you run the torch.compile wrapper, and then you actually hit the user frame to be compiled.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119251
Approved by: https://github.com/yanboliang, https://github.com/mlazos
2024-02-06 17:40:07 +00:00
22669843c2 Reserve sizes in c10::VaryingShape::concrete_sizes(), c10::TensorType::computeStrideProps() (#119189)
Summary: Costly reallocs.

Test Plan: CI

Reviewed By: efiks

Differential Revision: D53264908

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119189
Approved by: https://github.com/Skylion007
2024-02-06 17:13:37 +00:00
8ee9f26ce8 [Dynamo] Remove build_checkpoint_variable from call_getattr (#119236)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119236
Approved by: https://github.com/jansel
2024-02-06 16:59:40 +00:00
2ad3599a71 Add torch.backends.mha.get_fastpath_enabled to FUNC_INLINELIST (#118979)
Summary: Add torch.backends.mha.get_fastpath_enabled to FUNC_INLINELIST

Test Plan: See the one in D53154041
Reviewed By: yjhao, yanboliang, Yuzhen11

Differential Revision: D53154041

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118979
Approved by: https://github.com/yanboliang
2024-02-06 16:25:33 +00:00
a77be631e0 Bugfix to MixtureSameFamily's _pad_mixture_dimension (#118947)
Fixes Issue #73792

This is a duplicate of pull request. #73864. It's a small bugfix that should have happened a long time ago, but it didn't because I didn't actually follow up with the pull request after originally submitting. That's my bad. Trying to remedy the error.

This contains a fix to _pad_mixture_dimension, which intends to count the number of dimensions in its referent tensors, but accidentally counts the number of elements (and can thus end up creating tensors with potentially thousands of dimensions by mistake). Also contains a single test for the fixed behavior.

Co-authored-by: Jeffrey Wan <soulitzer@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118947
Approved by: https://github.com/soulitzer
2024-02-06 16:24:22 +00:00
499040ac32 Revert "Add FakeTensor support to torch._utils._rebuild_tensor (#108186)"
This reverts commit 426339e4de2efc0cbd501e2bff947ba890ec9817.

Reverted https://github.com/pytorch/pytorch/pull/108186 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/108186#issuecomment-1929978008))
2024-02-06 15:04:48 +00:00
1e4b408b02 [decomp] Add tests for different dtypes to SDPA decomposition (#119239)
Summary: As titled. Skipping torch.bfloat16 because for some reason the
difference is 0.01.

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119239
Approved by: https://github.com/drisspg
2024-02-06 11:17:07 +00:00
85033759d6 Update scatter_reduce_ test with parallel backend check (#118708)
**Summary**
Follow up of https://github.com/pytorch/pytorch/pull/118278, in which new added UT `test_scatter_using_atomic_add` failed with `native parallel backend` as reported in https://github.com/pytorch/pytorch/issues/118518.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118708
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/lezcano
2024-02-06 09:43:40 +00:00
7d7a3f0b37 [inductor] Support sympy.expr in user-defined Triton kernel grid fn (#119165)
## Problem

A user-defined Triton kernel grid may use a sympy magic method like `Max`. This comes in the form of a form of a `sympy.Expr`, namely `sympy.core.function.FunctionClass`.

Handling this is not trivial since `user_defined_kernel_grid_fn_code` is used in Eager & Inductor. Eager usage below.

## Approach

Pass in wrapper when Inductor codegens grid with ints/sympy.Expr, so we can utilize wrapper functions, such as `codegen_shape_tuple()`.

Differential Revision: D53367012

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119165
Approved by: https://github.com/aakhundov
2024-02-06 08:39:55 +00:00
8a8e70477e Fix type hints on nn.attention.sdpa_kernel (#119140)
Fixes #119133
Altered type hint and assert to include SDPBackend; disallowed None in assert.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119140
Approved by: https://github.com/mikaylagawarecki, https://github.com/cpuhrsch, https://github.com/drisspg
2024-02-06 07:33:22 +00:00
720f781160 [CPU] Optimize softmax as flash attention v2 (#118957)
### Descriptions
According to flash attention v2, optimize softmax by dividing sum out of the KV inner loop.

### Performance
Stable Diffusion V2.1 on GNR

| Version | Kernel time (s) | Speedup |
|---------|----------------|----------------|
| BF16 Before | 28.67 |
| BF16 After | 23.55 | 17.86% |
| FP32 Before | 54.20 |
| FP32 After | 49.47 | 8.73% |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118957
Approved by: https://github.com/jgong5, https://github.com/drisspg
2024-02-06 07:06:36 +00:00
4ab852b6c5 [c10d] PGNCCL refactor part 1: adds assert size==1 (#119099)
Breaking #118674 into multiple smaller PRs.
This is the first one.
It adds `assert size==1` to PGNCCL, and refactors some old tests written in multi-device style (which would otherwise fail at the assert).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119099
Approved by: https://github.com/wconstab
2024-02-06 06:59:47 +00:00
884b6d2a67 [inductor] Implementing missing magic methods on IR values. (#118933)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118933
Approved by: https://github.com/peterbell10
2024-02-06 05:50:26 +00:00
e47f571da7 Revert "Update scatter_reduce_ test with parallel backend check (#118708)"
This reverts commit d670dfb7ae0a88cf010455301eb1d0ef91950f1a.

Reverted https://github.com/pytorch/pytorch/pull/118708 on behalf of https://github.com/leslie-fang-intel due to Test Case still fail ([comment](https://github.com/pytorch/pytorch/pull/118708#issuecomment-1928767568))
2024-02-06 04:37:08 +00:00
12ac3ba383 [executorch hash update] update the pinned executorch hash (#118936)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118936
Approved by: https://github.com/pytorchbot
2024-02-06 03:41:33 +00:00
3497388b9f [export] Fix serialization for auto_functionalization (#118810)
- Added support for serializig the auto_functionalization op, which
  required adding the functions `serialize_arbitrary_inputs` and
  `serialize_arbitrary_outputs` which will serialize the inputs/outputs
  without needing a schema, since HOOs do not have a schema.
- Added support for serializing user input mutations
- Added support for serializing operator inputs. They just get turned
  into strings.

Differential Revision: [D53331039](https://our.internmc.facebook.com/intern/diff/D53331039)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118810
Approved by: https://github.com/suo
2024-02-06 03:41:05 +00:00
03db96c248 [Dynamo] Enhance autograd.Function strict mode test (#119237)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119237
Approved by: https://github.com/zou3519
2024-02-06 02:54:19 +00:00
074f2bb5ce Fix dynamo benchmark runner for torchbench skip sets (#118615)
Fix dynamo benchmark runner for torchbench skip sets, which introduced by PR #118032

This runner.py script is still used in the [Inductor CPU Performance Dashboard](https://github.com/pytorch/pytorch/issues/93531) regular test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118615
Approved by: https://github.com/jgong5, https://github.com/ysiraichi, https://github.com/ezyang
2024-02-06 02:06:54 +00:00
9250965f8b [ez] Lower windows timeout limit for trunk, set test step timeout (#119234)
Lower windows timeout to be the same as linux

Step timeout thing for win (linux version + details for why at https://github.com/pytorch/pytorch/pull/93084)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119234
Approved by: https://github.com/huydhn
2024-02-06 01:54:31 +00:00
86d5d1650b [dynamo] support dict.clear() (#119197)
For code like following:
```python
import torch
def f():
    a = {"a": torch.randn(2, 2)}
    a.clear()
    return a
torch.compile(f, backend="eager", fullgraph=True)()
```

We have a graph break before the pr:
```
torch._dynamo.exc.Unsupported: call_method ConstDictVariable() clear [] {}
```

Test Plan:
Added new tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119197
Approved by: https://github.com/jansel, https://github.com/anijain2305
2024-02-06 01:17:55 +00:00
c0164f2393 Revert "[BE] Add dtypesIfMPS to ModuleInfo enabling float16 tests for MPS and remove all skipIfMPS for float64 (#119039)"
This reverts commit 04d52d5399ad4abb8af9e8405be79e2a7f8b4c7a.

Reverted https://github.com/pytorch/pytorch/pull/119039 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing MPS test in trunk 04d52d5399,  may be a landrace ([comment](https://github.com/pytorch/pytorch/pull/119039#issuecomment-1928595240))
2024-02-06 01:13:28 +00:00
3829b55416 [inductor] Support ProxyExecutor argument codegen for sympy.Expr (#119166)
Differential Revision: D53398312

## Problem
Currently, if a sympy expression that uses a magic method like `Max` is passed as an argument to ProxyExecutor, then C++ compilation will fail. We need to use std::max method instead.

```
# What we see
aoti_torch_proxy_executor_call_function(..., std::vector<int64_t>{Max(1025, u1)}.data(), ...);

# What we want
aoti_torch_proxy_executor_call_function(..., std::vector<int64_t>{std::max(1025L, u1)}.data(), ...)
```

## Approach
Use C++ wrapper's expression printer to handle this conversion

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119166
Approved by: https://github.com/aakhundov
2024-02-06 00:33:25 +00:00
781f7c9080 [BE] Use OptimizerInfo step_requires_closure, only_supports_sparse_grads (#119230)
So I had planned ahead of time to use these but forgot to actually use them when migrating tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119230
Approved by: https://github.com/albanD
2024-02-06 00:13:43 +00:00
69344fe987 c10d: Don't add NCCL backend by default without CUDA (#119149)
The NCCL backend requires CUDA (including devices) to be available. So don't use that backend by default if that isn't the case to avoid the following error when creating a CPU-only device mesh:
> RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!

Fixes #117746

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119149
Approved by: https://github.com/kwen2501
2024-02-05 23:55:07 +00:00
fd0bf96c2b [inductor] make multi-kernel work with cpp-wrapper (#117813)
Make multi-kernel work with cpp-wrapper. multi-kernel generates two equivalent variants for a reduction. At runtime the faster one is picked. But cpp-wrapper need save cubin file during codegen. They don't work with each other at the beginning.

Thanks Jason for suggesting a neat way to integrate these two. cpp-wrapper does 2 passes codegen right now. For the first pass, we still generate multi-kernel code and run it; for the second pass, we load the cubin file for the faster kernel directly. And multi-kernel python code is not generated for the second pass since they should not be needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117813
Approved by: https://github.com/jansel
2024-02-05 23:35:41 +00:00
04d52d5399 [BE] Add dtypesIfMPS to ModuleInfo enabling float16 tests for MPS and remove all skipIfMPS for float64 (#119039)
Right now, `ModuleInfo.dtypes` defaults to `torch.testing._internal.common_dtype.floating_types()`, almost no ModuleInfos override this (so only `float32` and `float64` are tested).

This is the first step to clean up/improve dtype testing for `ModuleInfos` and fix #116626.

Follow up PRs will updates `dtypes=` (and perhaps `dtypesIf{Device}` (if it makes sense)) for each `ModuleInfo`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119039
Approved by: https://github.com/janeyx99
2024-02-05 23:19:01 +00:00
d9d8c2b79f Remove HSDP validation check (#112435)
Currently, HSDP validates that all intra/inter node PGs are the same. This makes sense if you are only using HSDP with no other forms of parallelism and is a nice but not necessary sanity check.

However, if you want to mix HSDP with other forms, say tensor parallelism on the FFN of a transformer block, the intra/inter node PGs will be different for that layer. This check raises errors in this scenario, so we need to remove this assumption.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112435
Approved by: https://github.com/wz337, https://github.com/Skylion007
2024-02-05 22:27:53 +00:00
966db82c9d Revert "Remove extra graph breaks (#118987)"
This reverts commit 9a8e3b07d75e3e9bb902f81b4b6e1042bbe06b58.

Reverted https://github.com/pytorch/pytorch/pull/118987 on behalf of https://github.com/eellison due to reverting because it causes regression ([comment](https://github.com/pytorch/pytorch/pull/118987#issuecomment-1928224447))
2024-02-05 22:19:37 +00:00
b8bb12cd45 Add meta registration for _foreach_norm (#118604)
This PR also fixes the discrepancy between _foreach_norm fast path and slow path, where storage_offsets will be different between the lists of tensors. Here are some profile results showing that we aren't significantly slower. Do note that we're replacing N `as_strided`/`select` calls to N `empty` calls.

For script:
```
import torch

ts = [torch.rand(32, 16, device="cuda") for _ in range(128)]

with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA,
    ]
) as p:
    res = torch._foreach_norm(ts)
print(p.key_averages().table(sort_by="cpu_time_total"))
```

OG baseline:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (7cf98987)]$ python playground2.py
STAGE:2024-01-30 13:16:48 2740431:2740431 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
STAGE:2024-01-30 13:16:48 2740431:2740431 ActivityProfilerController.cpp:320] Completed Stage: Collection
STAGE:2024-01-30 13:16:48 2740431:2740431 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                    aten::_foreach_norm        25.36%       4.209ms        99.94%      16.586ms      16.586ms       8.000us        88.89%       9.000us       9.000us             1
                                       cudaLaunchKernel        61.21%      10.159ms        61.21%      10.159ms       2.540ms       0.000us         0.00%       0.000us       0.000us             4
                                            aten::zeros         0.43%      71.000us        58.35%       9.683ms       9.683ms       0.000us         0.00%       1.000us       1.000us             1
                                            aten::zero_         0.33%      55.000us        57.35%       9.517ms       9.517ms       0.000us         0.00%       1.000us       1.000us             1
                                            aten::fill_         0.42%      69.000us        57.01%       9.462ms       9.462ms       1.000us        11.11%       1.000us       1.000us             1
                                           aten::select         8.04%       1.335ms        11.29%       1.873ms      14.633us       0.000us         0.00%       0.000us       0.000us           128
                                       aten::as_strided         3.24%     538.000us         3.24%     538.000us       4.203us       0.000us         0.00%       0.000us       0.000us           128
                                            aten::empty         0.90%     150.000us         0.90%     150.000us      75.000us       0.000us         0.00%       0.000us       0.000us             2
                                  cudaDeviceSynchronize         0.06%      10.000us         0.06%      10.000us      10.000us       0.000us         0.00%       0.000us       0.000us             1
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        11.11%       1.000us       1.000us             1
void at::native::(anonymous namespace)::multi_tensor...         0.00%       0.000us         0.00%       0.000us       0.000us       6.000us        66.67%       6.000us       3.000us             2
void at::native::lpnorm_cleanup<float, (at::native::...         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        22.22%       2.000us       2.000us             1
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 16.596ms
Self CUDA time total: 9.000us
```

And here's after this PR:
```
STAGE:2024-02-05 08:27:02 1127843:1127843 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
STAGE:2024-02-05 08:27:02 1127843:1127843 ActivityProfilerController.cpp:320] Completed Stage: Collection
STAGE:2024-02-05 08:27:02 1127843:1127843 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                    aten::_foreach_norm        30.95%       4.653ms        99.95%      15.026ms      15.026ms       9.000us        90.00%      10.000us      10.000us             1
                                       cudaLaunchKernel        52.41%       7.879ms        52.41%       7.879ms       1.970ms       0.000us         0.00%       0.000us       0.000us             4
                                            aten::zeros         0.39%      58.000us        48.29%       7.260ms       7.260ms       0.000us         0.00%       1.000us       1.000us             1
                                            aten::zero_         0.35%      53.000us        47.25%       7.103ms       7.103ms       0.000us         0.00%       1.000us       1.000us             1
                                            aten::fill_         0.43%      65.000us        46.90%       7.050ms       7.050ms       1.000us        10.00%       1.000us       1.000us             1
                                            aten::empty        15.42%       2.318ms        15.42%       2.318ms      17.969us       0.000us         0.00%       0.000us       0.000us           129
                                  cudaDeviceSynchronize         0.05%       7.000us         0.05%       7.000us       7.000us       0.000us         0.00%       0.000us       0.000us             1
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        10.00%       1.000us       1.000us             1
void at::native::(anonymous namespace)::multi_tensor...         0.00%       0.000us         0.00%       0.000us       0.000us       6.000us        60.00%       6.000us       3.000us             2
void at::native::lpnorm_cleanup<float, (at::native::...         0.00%       0.000us         0.00%       0.000us       0.000us       3.000us        30.00%       3.000us       3.000us             1
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 15.033ms
Self CUDA time total: 10.000us
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118604
Approved by: https://github.com/albanD
2024-02-05 22:01:01 +00:00
51e096114b Increase recommended logging in DEFAULT_LOGGING (#119207)
For long running batch jobs, it is best to opt for logs that are too
spammy rather than not spammy enough.  This lines up DEFAULT_LOGGING
with our current internal guidance at Meta.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119207
Approved by: https://github.com/bdhirsh
2024-02-05 21:59:10 +00:00
5086e1cf3f Remove distributed/c10d/Functional.hpp (#119138)
This file is useless and was accidentally checked in.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119138
Approved by: https://github.com/Skylion007
2024-02-05 21:58:08 +00:00
200108c6e6 Delete old branches (#117079)
Example https://github.com/pytorch/pytorch/actions/runs/7562281351/job/20592425611?pr=117079 (The code to delete branches isn't being run, it's just listing the branches it wants to delete)

Internal code: https://fburl.com/code/hdvvbfkj

Threshold for branch with PR is 30 days regardless of whether or not the PR is merged or not (compared to 3 days if merged and 30 days if closed).  Threshold for branch without PR is 1.5 years (same internally).

Threshold of ~400 queries to github so it doesn't hit token usage limits.  Currently this leads to about 350 branches deleted per run.

Only query for the last 90 days of updated PRs to reduce token usage, so if a branch has a PR but it was updated 90+ days ago, it will think it doesn't have a PR and will wait for the 1.5 years branch update check instead, regardless of whether the PR is open or closed.

I tested that it could delete my own branch and it worked.

labeled with test-config/crossref because I just want the smallest test config possible to reduce CI usage
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117079
Approved by: https://github.com/malfet
2024-02-05 20:50:05 +00:00
b816760a2f More progress on type checking ValueRanges (#118870)
Type checking Python is a pain. Here are my learnings:

* The types for heavily polymorphic code is going to be verbose, no way around it. I originally was hoping I could lean on polymorphism with a bounded TypeVar to compactly write signatures for many of the ValueRanges methods, but I ran into some unworkaroundable mypy bugs. Writing out all the types explicitly and using `@overload` liberally works pretty well, so I think I recommend people do that instead of trying to do fancy things.
* Sympy is missing annotations for assumptions, because they are all metaprogrammed. I don't really relish maintaining a typeshed for sympy, so I wrote a small mypy plugin to add them in.
* GADT style refinement is... just not a good idea in practice. Mypy easily gets confused whether or not a return value from a refined section is allowed for the outer return type. So many of these have been replaced with less informative implementation types and more informative external types via overloads. Hopefully this is good for use sites.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118870
Approved by: https://github.com/Skylion007, https://github.com/albanD
2024-02-05 20:29:25 +00:00
b92819a039 Move nn.Module.load_state_dict tests from test_nn.py to separate file (#118028)
Move these tests out so in https://github.com/pytorch/pytorch/pull/117913 where we can to run these tests with both `torch.nn.utils.set_swap_module_params_on_conversion({True/False})`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118028
Approved by: https://github.com/albanD
2024-02-05 20:17:28 +00:00
71655bccbe Fix wrong mobile build Docker image (#119213)
It turns out that the Docker image name hasn't been updated yet referring to a non-existing name, may be we could update `calculate-docker-image` to fail in this case if there is a way to separate a non-existing name failure v.s. missing tag failure.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119213
Approved by: https://github.com/clee2000, https://github.com/kit1980, https://github.com/malfet
2024-02-05 19:48:10 +00:00
962fca6839 [storage][perf] Reduce _get_device_from_module overhead. (#119144)
Using `rsplit` with maxsplit=1 is more efficient since it 1) stops traversal as soon as the first `.` from the right side is encountered 2) creates no more than 2-element list

This change also reuses `last_part` to avoid unnecessary repetition of a split.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119144
Approved by: https://github.com/Skylion007, https://github.com/mikaylagawarecki
2024-02-05 19:33:18 +00:00
b964a1222c Revert "[inductor] make multi-kernel work with cpp-wrapper (#117813)"
This reverts commit c24ffc3f66b2270dfc65a404687b91b55ed580e9.

Reverted https://github.com/pytorch/pytorch/pull/117813 on behalf of https://github.com/atalman due to Failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/117813#issuecomment-1927877102))
2024-02-05 19:25:39 +00:00
b2e0f8d82d [mypy] added type annotations to codegen_nodes methods (#119080)
added correct type annotations to scheduler and backends'
codegen_nodes methods

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119080
Approved by: https://github.com/eellison
2024-02-05 18:33:52 +00:00
88e346680b Patch all_gather to support HSDP + TP (#118638)
Update all_gather to support HSDP + TP.

Currently, the `_all_gather_dtensor` function for dtensors only replaces the first dimension with replicate (the FSDP dimension) and does not touch the second dimension (which is assumed to be the TP dimension). With HSDP, we have two dimensions ahead of the TP dimension as opposed to 1. This PR updates to replace all other dimensions with replicate to run the all-gather.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118638
Approved by: https://github.com/fegin, https://github.com/awgu, https://github.com/wz337
2024-02-05 18:29:23 +00:00
f481835115 Revert "add Half support for flash attention on CPU (#118368)" (#119204)
This reverts commit a5a63db3bf937a6eff993d1222fab18cc63f9cb2.

Fixes #ISSUE_NUMBER

Reverts #118368

Got reverted internally but branch got deleted to automation didn't work

Mildly edited stack trace
```

...
    return torch._dynamo.disable(fn, recursive)(*args, **kwargs)
  File "torch/_dynamo/eval_frame.py", line 453, in _fn
    return fn(*args, **kwargs)
  File "torch/_dynamo/external_utils.py", line 25, in inner
    return fn(*args, **kwargs)
  File "torch/fx/experimental/proxy_tensor.py", line 635, in dispatch_trace
    graph = tracer.trace(root, concrete_args)
  File "torch/fx/experimental/proxy_tensor.py", line 995, in trace
    res = super().trace(root, concrete_args)
  File "torch/_dynamo/eval_frame.py", line 453, in _fn
    return fn(*args, **kwargs)
  File "torch/_dynamo/external_utils.py", line 25, in inner
    return fn(*args, **kwargs)
  File "torch/fx/_symbolic_trace.py", line 793, in trace
    (self.create_arg(fn(*args)),),
  File "torch/fx/experimental/proxy_tensor.py", line 665, in wrapped
    out = f(*tensors)
  File "<string>", line 1, in <lambda>
  File "torch/_functorch/_aot_autograd/traced_function_transforms.py", line 357, in _functionalized_f_helper
    f_outs = fn(*f_args)
  File "torch/_functorch/_aot_autograd/traced_function_transforms.py", line 68, in inner_fn
    outs = fn(*args)
  File "torch/_functorch/_aot_autograd/utils.py", line 161, in flat_fn
    tree_out = fn(*args, **kwargs)
  File "torch/_functorch/_aot_autograd/traced_function_transforms.py", line 618, in functional_call
    out = PropagateUnbackedSymInts(mod).run(
  File "torch/fx/interpreter.py", line 145, in run
    self.env[node] = self.run_node(node)
  File "torch/_functorch/_aot_autograd/traced_function_transforms.py", line 593, in run_node
    result = super().run_node(n)
  File "torch/fx/interpreter.py", line 202, in run_node
    return getattr(self, n.op)(n.target, args, kwargs)
  File "torch/fx/interpreter.py", line 274, in call_function
    return target(*args, **kwargs)
  File "torch/_ops.py", line 571, in __call__
    return self_._op(*args, **kwargs)
  File "torch/_subclasses/functional_tensor.py", line 380, in __torch_dispatch__
    outs_unwrapped = func._op_dk(
  File "torch/utils/_stats.py", line 20, in wrapper
    return fn(*args, **kwargs)
  File "torch/fx/experimental/proxy_tensor.py", line 744, in __torch_dispatch__
    return self.inner_torch_dispatch(func, types, args, kwargs)
  File "torch/fx/experimental/proxy_tensor.py", line 779, in inner_torch_dispatch
    return proxy_call(self, func, self.pre_dispatch, args, kwargs)
  File "torch/fx/experimental/proxy_tensor.py", line 423, in proxy_call
    r = maybe_handle_decomp(proxy_mode, func, args, kwargs)
  File "torch/fx/experimental/proxy_tensor.py", line 1225, in maybe_handle_decomp
    return CURRENT_DECOMPOSITION_TABLE[op](*args, **kwargs)
  File "torch/_decomp/decompositions.py", line 4322, in scaled_dot_product_flash_attention_for_cpu
    torch._check(
  File "torch/__init__.py", line 1133, in _check
    _check_with(RuntimeError, cond, message)
  File "torch/__init__.py", line 1116, in _check_with
    raise error_type(message_evaluated)
RuntimeError: query must be FP32, FP64, BF16 but got torch.float16

While executing %_scaled_dot_product_flash_attention_for_cpu : [num_users=1] = call_function[target=torch.ops.aten._scaled_dot_product_flash_attention_for_cpu.default](args = (%l_q_, %l_k_, %l_v_), kwargs = {attn_mask: %l_attn_mask_})
Original traceback:
  File "executorch/backends/xnnpack/partition/graphs/sdpa.py", line 34, in forward
    return torch.nn.functional.scaled_dot_product_attention(
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119204
Approved by: https://github.com/kit1980
2024-02-05 18:24:53 +00:00
ab613a4019 Revert "refactor lazy init to device-agnostic (#118846)"
This reverts commit 520771d7b35034c96c5b4604ecf8960e6aab856f.

Reverted https://github.com/pytorch/pytorch/pull/118846 on behalf of https://github.com/atalman due to Failing, tests https://github.com/pytorch/torchdistx/blob/main/src/python/torchdistx/_C/fake.cc#L11  ([comment](https://github.com/pytorch/pytorch/pull/118846#issuecomment-1927651305))
2024-02-05 18:06:30 +00:00
124a54ef16 [jit][perf] Reduce lookupInModule overhead. (#119145)
It's inefficient to split remaining parts of the module name by '.' just to join it back again. Instead it's more idiomatic and efficient to use `maxsplit=1` to ensure that all remaining parts remain intact. This improves best case time and space complexity since scan can terminate on first encountered `.` and only 2 parts are returned in a list.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119145
Approved by: https://github.com/Skylion007
2024-02-05 18:01:00 +00:00
fa8d97776c [aotinductor] Migrate fuse_split_linear_add from dper_pass to AOTI based on predispatch IR (#118983)
Summary: As titled. Added support of fuse_split_linear_add in pregrad passes based on predispatch IR

Test Plan: TORCH_LOGS=inductor,aot   buck2 run  mode/opt mode/inplace caffe2/test/inductor/fb:test_split_cat_fx_passes_aten_fb

Differential Revision: D53302168

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118983
Approved by: https://github.com/kflu, https://github.com/chenyang78
2024-02-05 17:58:42 +00:00
5f9f771711 [DeviceMesh][Test] Remove test_raises_mesh_dim_less_than_2 (#119172)
The test is no longer applicable after we allow 1D slice from 1D mesh. https://github.com/pytorch/pytorch/pull/118895

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119172
Approved by: https://github.com/awgu, https://github.com/atalman
2024-02-05 17:34:51 +00:00
d444a3b443 [MPS] fix float32 error on mps, in linalg.matrix_rank and linalg.pinv (#114771)
Fixes #114285

(However, still have NotImplementedError
```NotImplementedError: The operator 'aten::_linalg_svd.U' is not currently implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on https://github.com/pytorch/pytorch/issues/77764. As a temporary fix, you can set the environment variable `PYTORCH_ENABLE_MPS_FALLBACK=1` to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.```)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114771
Approved by: https://github.com/lezcano
2024-02-05 15:36:55 +00:00
a72190fd51 make nanogpt work with both compiled autograd and _LazyGraphModule (#118981)
@xmfan and @fegin reported that _LazyGraphModule ( https://github.com/pytorch/pytorch/pull/117911 ) makes nanogpt training fail with compiled autograd.

We have a repro:  ``` python benchmarks/dynamo/torchbench.py --training --backend=inductor --disable-cudagraphs --accuracy --only nanogpt --repeat 1 --compiled-autograd ```
but it's still mysterious how to trigger the issue with a toy model.

The error message for the failure is https://gist.github.com/shunting314/6402a6388b3539956090b6bc098952fb . In compile_fx we will call `detect_fake_mode`. This function will look for an active FakeTensorMode from both TracingContext and example inputs. The error is triggered because we find different FakeTensorMode from these 2 sources.

Although I don't know what really causes the discrepancy of FakeTensorMode above, the fix here is to force _LazyGraphModule recompilation if we have compiled autograd enabled. This does not hurt compilation time most of the time because we anyway will call the graph module here in the backward pass when compiled autograd is enabled: 855d5f144e/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py (L705)

Let me know if we can have a better fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118981
Approved by: https://github.com/jansel
2024-02-05 10:40:06 +00:00
d670dfb7ae Update scatter_reduce_ test with parallel backend check (#118708)
**Summary**
Follow up of https://github.com/pytorch/pytorch/pull/118278, in which new added UT `test_scatter_using_atomic_add` failed with `native parallel backend` as reported in https://github.com/pytorch/pytorch/issues/118518.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118708
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-02-05 08:48:45 +00:00
0348975a87 Set up new logging artifact for SymNode (#119158)
Fixes #113876

Hi, I updated various logging configs and the SymNode module to use the new dedicated logging artifact. This is my first pytorch PR, mirrored my changes off of https://github.com/pytorch/pytorch/pull/111808.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119158
Approved by: https://github.com/ezyang
2024-02-05 07:34:54 +00:00
0245000be8 [DeviceMesh] Temporarily disable re-use subgroup (#118940)
Summary:
The reuse subgroup logic is causing GLOO to timeout on two internal modelstore tests (relevant tests in test plan).
We temporarily disabling re-use subgroup during root-causing to allow the internal tests to be able to run again, as they are now omitted shown in T176426987.

Test Plan:
CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118940
Approved by: https://github.com/wanchaol
2024-02-05 06:30:00 +00:00
0c3a1c893e [dynamo] Setup the globals for guard_fn without a reference to f_locals (#118447)
UPDATE - I changed the PR because from discussion with @jansel it was clear that someone else was holding on to a reference to f_locals. This PR now solves that problem first. I removed the eval_frame.c part because it was failing tests that use `exec` or `eval` with weird error like `no no locals found when storing 'math'`. I would debug that in a separate PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118447
Approved by: https://github.com/Skylion007, https://github.com/jansel
ghstack dependencies: #118975, #118420
2024-02-05 05:39:39 +00:00
b8307513e5 [torchelastic][rendezvous] Add option to enable libuv for TCPStore based rendezvous backend (#118944)
Summary:
Expose an option to enable libuv in TCPStore based rendezvous backend that will allow better scaling.

Libuv support has been added recently and allows scaling for more than 2K nodes.

Test Plan: Unit tests

Differential Revision: D53335860

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118944
Approved by: https://github.com/wconstab
2024-02-04 23:11:32 +00:00
5ebed6f1c3 [torch] fix comment typo (#118656)
Summary: as title

Differential Revision: D49841787

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118656
Approved by: https://github.com/Skylion007, https://github.com/zhxchen17
2024-02-04 22:20:09 +00:00
0d5f53a2f9 fix forward test_memory_planning.py (#119109)
Summary: fixes a broken test, also makes it run in fbcode correctly

Test Plan: test

Reviewed By: angelayi

Differential Revision: D53373709

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119109
Approved by: https://github.com/angelayi
2024-02-04 21:45:07 +00:00
052e824467 improve CUDACachingAllocator lock contention (#118550)
Summary: NativeCachingAllocator has a global lock which shows lock contention with one process using multiple GPUs. The lock is required to lookup Block from pointer. We can make the lock more fine grain to reduce the lock contention.

Test Plan: existing unittests, verified on prod models using eight GPUs showing double digits improvements

Differential Revision: D52493091

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118550
Approved by: https://github.com/albanD
2024-02-04 16:45:25 +00:00
b41f3e8df1 [AOTI] Make abi_compatible as default for OSS CI (#119126)
Summary: Introduce an environment varible AOT_INDUCTOR_ABI_COMPATIBLE to control the ABI-compatible mode, and turn it on for OSS CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119126
Approved by: https://github.com/chenyang78
ghstack dependencies: #119125
2024-02-04 15:48:58 +00:00
79b20aec76 [AOTI] Support copy_, _fft_c2c and view_as_real in C shim (#119125)
Summary: These ops exist in GoogleFnet. Also add a Complex fallback for convert_element_type. After this PR, we can enable ABI-compatible for AOTInductor OSS CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119125
Approved by: https://github.com/chenyang78
2024-02-04 15:48:58 +00:00
cee16353db [Dynamo][autograd.Function] Should graph break on stride accesses in backward (#119137)
Fixes #118399

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119137
Approved by: https://github.com/oulgen
2024-02-04 09:08:45 +00:00
8f82a44a5b Run device mesh tests with native funcol enabled (#118437)
### Summary

Run the relevant tests in `test/distributed/_tensor/test_dtensor_compile.py` and `test/distributed/test_device_mesh.py` with native funcol enabled, in addition to with them being disabled.

All tests excepts `test_tp_compile_comm_reordering` pass. This is expected because the native funcols have slightly different IRs, so the reordering pass needs to be adjusted. This test is disabled for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118437
Approved by: https://github.com/LucasLLC
ghstack dependencies: #118910, #118911
2024-02-04 04:11:11 +00:00
cyy
e3371ff739 Use correct type of indices in ForeachUtils.h (#119116)
Fix a type mismatch detected by MSVC:
```
C:\Program Files\Microsoft Visual Studio\2022\Preview\VC\Tools\MSVC\14.39.33519\include\xutility(255): warning C4267: “初始化”: 从“size_t”转换到“_Ty”,可能丢失数据
        with
        [
            _Ty=int
        ]
C:\Program Files\Microsoft Visual Studio\2022\Preview\VC\Tools\MSVC\14.39.33519\include\xutility(255): note: 模板实例化上下文(最早的实例化上下文)为
pytorch/aten/src\ATen/native/ForeachUtils.h(363): note: 查看对正在编译的函数 模板 实例化“_Ty &std::vector<_Ty,std::allocator<_Ty>>::emplace_back<const I&>(const I &)”的引用
        with
        [
            _Ty=int,
            I=size_t
        ]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119116
Approved by: https://github.com/Skylion007
2024-02-04 04:03:54 +00:00
6620176da7 Add documentation for meta device (#119119)
Fixes https://github.com/pytorch/pytorch/issues/119098

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119119
Approved by: https://github.com/bdhirsh
2024-02-04 01:05:22 +00:00
dab16b6b8e s/supress/suppress/ (#119132)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119132
Approved by: https://github.com/kit1980, https://github.com/malfet
2024-02-04 00:54:14 +00:00
abc09b27b9 Some minor type stub improvements (#118529)
I was just playing around with improving the typing of symbolic_shapes. The PR is not "complete" but I in particular wanted to get feedback on whether or not people liked making ValueRanges Generic; it seems that distinguishing if you have an Expr ValueRange or a SympyBoolean ValueRange is a lot of trouble for downstream. Using TypeGuard, we can perform refinements on the generic parameter inside methods, although we still have to cast back to ValueRange[T] due to https://github.com/python/mypy/issues/14425#issuecomment-1914852707

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118529
Approved by: https://github.com/Skylion007
2024-02-04 00:19:00 +00:00
3ed9df36a9 Clean up some obsolete TODOs in run_test and several test files (#119113)
* The TODOs in `test/test_nestedtensor.py` has been mitigated, I keep the issue for reference.
* ~~The TODOs in `test/test_ops_fwd_gradients.py` doesn't apply anymore~~
* The TODOs in `run_test.py` to support disabling C++ tests is probably not going to happen.  I have never seen a flaky C++ test that needs to be disabled before.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119113
Approved by: https://github.com/kit1980
2024-02-03 23:54:30 +00:00
26a2743162 Fix placeholder tensor is empty for relu in mps (#118965)
Fixes #118845
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118965
Approved by: https://github.com/malfet
2024-02-03 23:50:35 +00:00
0ddcb5c3ca Include the documentation on scale arg being a keyword only arg (#119129)
Fixes #117240
@drisspg

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119129
Approved by: https://github.com/drisspg
2024-02-03 23:41:06 +00:00
ffae20e594 [BE][MPS] Add dictionaryFromPlaceholders (#119077)
Which are a convenience methods that create a dictionary from placeholder, making code a more compact.
Also added `runMPSGraph` overloaded function with Placeholder instead of an output dictionary, as majority of the operators have just one  output.
Typical change looks as follows
```patch
-    NSDictionary<MPSGraphTensor*, MPSGraphTensorData*>* feeds = @{
-      selfPlaceholder.getMPSGraphTensor() : selfPlaceholder.getMPSGraphTensorData(),
-    };
-    NSDictionary<MPSGraphTensor*, MPSGraphTensorData*>* results =
-        @{outputPlaceholder.getMPSGraphTensor() : outputPlaceholder.getMPSGraphTensorData()};
-    runMPSGraph(stream, cachedGraph->graph(), feeds, results);
+    auto feeds = dictionaryFromPlaceholders(selfPlaceholder);
+    runMPSGraph(stream, cachedGraph->graph(), feeds, outputPlaceholder);
   }
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119077
Approved by: https://github.com/kit1980, https://github.com/albanD
2024-02-03 22:07:02 +00:00
2d64fddd48 [dtensor] add op support for nll_loss_forward (#118917)
This is part of the work to support cross entropy in dtensor.

This PR doesn't support nll_loss computation with input sharded on the channel dimension yet. In that case, redistribution to Replicate is needed in sharding propagation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118917
Approved by: https://github.com/wanchaol
2024-02-03 20:08:10 +00:00
4c397e6ec6 [Dynamo] Add correct guards for tracable tensor subclasses (#119110)
Fixes #118896
```
(pt) [ybliang@devgpu002.ash8 ~/local/pytorch (subclass)]$ TORCH_LOGS="+guards" python test/dynamo/test_subclasses.py -k test_torch_dispatch_subclass_guard_recompile
/home/ybliang/local/miniconda3/envs/pt/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
[2024-02-02 16:43:02,186] [0/0] torch._dynamo.guards.__guards: [DEBUG] GUARDS:
[2024-02-02 16:43:02,186] [0/0] torch._dynamo.guards.__guards: [DEBUG] ___check_type_id(L['w'], 110557008)                           # return torch.add(w, 1.0)  # ata/users/ybliang/pytorch/test/dynamo/test_subclasses.py:923 in fn
[2024-02-02 16:43:02,187] [0/0] torch._dynamo.guards.__guards: [DEBUG] hasattr(L['w'].a, '_dynamo_dynamic_indices') == False         # return torch.add(w, 1.0)  # ata/users/ybliang/pytorch/test/dynamo/test_subclasses.py:923 in fn
[2024-02-02 16:43:02,187] [0/0] torch._dynamo.guards.__guards: [DEBUG] hasattr(L['w'].b, '_dynamo_dynamic_indices') == False         # return torch.add(w, 1.0)  # ata/users/ybliang/pytorch/test/dynamo/test_subclasses.py:923 in fn
[2024-02-02 16:43:02,187] [0/0] torch._dynamo.guards.__guards: [DEBUG] utils_device.CURRENT_DEVICE == None                           # _dynamo/output_graph.py:388 in init_ambient_guards
[2024-02-02 16:43:02,187] [0/0] torch._dynamo.guards.__guards: [DEBUG] ___check_current_backend(139704947520224)                     # _dynamo/output_graph.py:394 in init_ambient_guards
[2024-02-02 16:43:02,187] [0/0] torch._dynamo.guards.__guards: [DEBUG] check_tensor(L['w'].a, Tensor, DispatchKeySet(CPU, BackendSelect, ADInplaceOrView, AutogradCPU), torch.float32, device=None, requires_grad=False, size=[2, 2], stride=[2, 1])  # return torch.add(w, 1.0)  # ata/users/ybliang/pytorch/test/dynamo/test_subclasses.py:923 in fn
[2024-02-02 16:43:02,187] [0/0] torch._dynamo.guards.__guards: [DEBUG] check_tensor(L['w'].b, Tensor, DispatchKeySet(CPU, BackendSelect, ADInplaceOrView, AutogradCPU), torch.float32, device=None, requires_grad=False, size=[2, 2], stride=[2, 1])  # return torch.add(w, 1.0)  # ata/users/ybliang/pytorch/test/dynamo/test_subclasses.py:923 in fn
[2024-02-02 16:43:02,206] [0/1] torch._dynamo.guards.__guards: [DEBUG] GUARDS:
[2024-02-02 16:43:02,207] [0/1] torch._dynamo.guards.__guards: [DEBUG] hasattr(L['w'], '_dynamo_dynamic_indices') == False           # return torch.add(w, 1.0)  # ata/users/ybliang/pytorch/test/dynamo/test_subclasses.py:923 in fn
[2024-02-02 16:43:02,207] [0/1] torch._dynamo.guards.__guards: [DEBUG] utils_device.CURRENT_DEVICE == None                           # _dynamo/output_graph.py:388 in init_ambient_guards
[2024-02-02 16:43:02,207] [0/1] torch._dynamo.guards.__guards: [DEBUG] ___check_current_backend(139704947520224)                     # _dynamo/output_graph.py:394 in init_ambient_guards
[2024-02-02 16:43:02,207] [0/1] torch._dynamo.guards.__guards: [DEBUG] check_tensor(L['w'], Tensor, DispatchKeySet(CPU, BackendSelect, ADInplaceOrView, AutogradCPU), torch.float32, device=None, requires_grad=False, size=[2, 2], stride=[2, 1])  # return torch.add(w, 1.0)  # ata/users/ybliang/pytorch/test/dynamo/test_subclasses.py:923 in fn
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119110
Approved by: https://github.com/anijain2305, https://github.com/bdhirsh, https://github.com/yoyoyocmu
2024-02-03 18:12:51 +00:00
7a52455102 [dynamo] Refactor TensorVariable method handling (#119111)
This should slightly improve compile times and be easier to maintain.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119111
Approved by: https://github.com/yanboliang, https://github.com/anijain2305
2024-02-03 17:18:19 +00:00
fcf22a853d Enable test_ellipsis_index_2 with Torch dynamo (#118773)
Fix issue #118819

test_ellipsis_index_2 is specifically testing properties of torch._numpy.array()
and that a field tensor is being added hence overriding the imports.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118773
Approved by: https://github.com/anijain2305, https://github.com/lezcano
2024-02-03 10:33:48 +00:00
1adedc3c86 [decomp] Remove pixel_shuffle from core aten decomps (#118921)
pixel_shuffle is a core aten op
(https://pytorch.org/docs/main/torch.compiler_ir.html#core-aten-ir) so we should not decompose it.

https://github.com/pytorch/pytorch/pull/118239 added a decomp for it which is causing an internal test failure
(https://www.internalfb.com/intern/test/281475090561210/) which cases on the pixel_shuffle operator.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118921
Approved by: https://github.com/SherlockNoMad, https://github.com/lezcano
2024-02-03 08:21:32 +00:00
4dc53f777b Fix dynamo failure w/ astype (#117952)
The torch "fake" ndarray had some mismatches vs numpy.ndarray which caused test_sparse_to_sparse_compressed to fail under dynamo.

This also fixes (because the test now hits it) a problem where unpacking a sequence with the incorrect number of args would assert in dynamo instead of graph breaking (because it would throw an exception). Added a unit test for this condition.

Fixed:
- torch._numpy._ndarray.astype() (actually used by the test)
- torch._numpy._ndarray.put() (drive-by discovery)
- torch._numpy._ndarray.view() (drive-by discovery)

(burndown item 7)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117952
Approved by: https://github.com/yanboliang
ghstack dependencies: #117951
2024-02-03 08:10:15 +00:00
c6c851102f Fix test_compressed_layout_conversions_coverage to check BSC format (#117951)
test_compressed_layout_conversions_coverage verifies torch's conversions between different memory layouts using numpy as a reference. Since numpy doesn't support BSC format it just skipped that. Instead fake it by using a transposed BSR format.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117951
Approved by: https://github.com/zou3519
2024-02-03 08:10:15 +00:00
6c8faf4680 [executorch] Run llama in xplat (#118831)
Summary:
Error running llama in xplat, where half type isnt part of c10_mobile targets. See:  D53158320

This diff:
- Creates a `torch_mobile_all_ops_et` target, which is the same as `torch_mobile_all_ops`, except with a preprocessor flag (C10_MOBILE_HALF) to support Half type
- Check C10_MOBILE_HALF in LinearAlgebra.cpp and include it
- Use `torch_mobile_all_ops_et` for executorch, instead of `torch_mobile_all_ops`.

Considerations:
- Using `torch_mobile_all_ops_et` across executorch means that our runtime binary size for xplat aten increases (see test plan for increase amount, thanks tarun292 for the pointer). This may be okay, as aten mode isn't used in production.

Test Plan:
Run language llama in xplat:
```
buck2 run xplat/executorch/examples/models/llama2:main_aten -- --model_path llama-models/very_new_checkpoint_h.pte --tokenizer_path llama-models/flores200sacrebleuspm.bin --prompt 'fr Hello' --eos
```
And in fbcode:
```
buck2 run fbcode//executorch/examples/models/llama2:main_aten -- --model_path llama-models/very_new_checkpoint_h.pte --tokenizer_path llama-models/flores200sacrebleuspm.bin --prompt 'fr Hello' --eos
```

Test executor_runner size increase with:
```
buck2 build fbcode//executorch/sdk/fb/runners:executor_runner_aten
```
||original|this diff (+half dtype)|diff|
|unstripped|214975784|214976472|+688|
|stripped|71373488|71373808|+320|

Differential Revision: D53292674

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118831
Approved by: https://github.com/larryliu0820
2024-02-03 08:07:19 +00:00
a64b03a58e Move lr tensor to cuda if needed (#119073)
Fixes https://github.com/pytorch/pytorch/issues/119026

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119073
Approved by: https://github.com/eellison
2024-02-03 07:34:33 +00:00
41b63b26c2 [dynamo] Fix incorrect docstring placements in _guards.py. (#119114)
This makes them unavailable when using help and other tools accessing them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119114
Approved by: https://github.com/kit1980
2024-02-03 06:25:54 +00:00
9a8e3b07d7 Remove extra graph breaks (#118987)
Fixes https://github.com/pytorch/pytorch/issues/104053

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118987
Approved by: https://github.com/janeyx99
2024-02-03 05:55:09 +00:00
ce40ee8ecd [FSDP] Fixed device_mesh and auto wrap (#119064)
If the user passes `device_mesh`, then we should not forward the process groups to the children during auto wrap and instead just rely on the `device_mesh` argument. This should fix https://github.com/pytorch/pytorch/issues/118906.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119064
Approved by: https://github.com/wz337
2024-02-03 03:57:29 +00:00
18fc1ca7d9 [MPS][BE] Add native lerp support (#119036)
By implementing `out = self + weight * (end-self)` as MPS graph

LERP is tested by `test_output_match_lerp_cpu_float[32|16]` based on OpInfo and 10+ tests from `test_optim.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119036
Approved by: https://github.com/albanD
2024-02-03 02:58:50 +00:00
30d3ff1659 Inline gradcheck functions since they don't have C bindings (#119047)
Gradcheck functions are in python, so they shouldn't be in `torch_c_binding_in_graph_functions`
fixes #118792

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119047
Approved by: https://github.com/yanboliang, https://github.com/zou3519
2024-02-03 02:46:39 +00:00
372e9550bd ProcessGroupGloo::reduce_scatter_tensor_coalesced (#118911)
### Motivation
Despite our plan to reduce gloo usage, it is still being widely used as testing tool (in both the PyTorch CI and user tests) for code that only uses nccl in real world scenario. There's some coverage issues around all-gather and reduce-scatter variants, which are currently worked around in ugly ways (e.g. [this](b9e86bc93d/torch/distributed/_functional_collectives_impl.py (L216-L219)) and [this](b9e86bc93d/torch/distributed/_functional_collectives_impl.py (L262-L272))). For native funcol I ran into the same issues but I'd rather just fix the coverage.

### This PR
We already have a fallback impl for `_reduce_scatter_base`, which is composed from all-reduce + scatter. The scatter was not necessary. It introduces extra communication, sync point, and forced the impl to fail on `asyncOp=True`. This PR does the following:
- Simulate reduce-scatter with `allreduce(inp).chunk(world_size)[rank]`. This is still 2x communication than a real reduce-scatter (since all-reduce = reduce-scatter + all-gather), but it's strictly better than what we have now.
- By doing the above, the comm becomes async and we don't have to fail on `asyncOp=True`.
- The general logic is implemented in `reduce_scatter_tensor_coalesced`. `_reduce_scatter_base` just calls it with single input/output.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118911
Approved by: https://github.com/shuqiangzhang
ghstack dependencies: #118910
2024-02-03 02:42:47 +00:00
65314a6129 [c10d] add an unit test for unordered destruction of PGs (#119045)
Summary:
We were suspecting ncclCommsAbort was hung due to NCCL 2.17's 'bug'
triggered by different ranks calls desctructors of different PGs in
different order. This can be reproed in a NCCL level test for 2.17
We need a test case in c10d to constantly check if PGs can be destructed
in different order
Test Plan:
Run the test and print out the distruction orders are expected
```
[$  python test/distributed/test_c10d_nccl.py
ProcessGroupNCCLTest.test_close_multi_pg_unordered
NCCL version 2.19.3+cuda12.0
[rank0]:[W ProcessGroupNCCL.cpp:1128] [PG 2 Rank 0] ProcessGroupNCCL
destructor entered.
[rank0]:[W ProcessGroupNCCL.cpp:1147] [PG 2 Rank 0] ProcessGroupNCCL
aborting communicators, check for 'abort finished' logs or look for
abort hang
[rank1]:[W ProcessGroupNCCL.cpp:1128] [PG 1 Rank 1] ProcessGroupNCCL
destructor entered.
[rank1]:[W ProcessGroupNCCL.cpp:1147] [PG 1 Rank 1] ProcessGroupNCCL
aborting communicators, check for 'abort finished' logs or look for
abort hang
[rank0]:[W ProcessGroupNCCL.cpp:1151] [PG 2 Rank 0] ProcessGroupNCCL
abort finished.
[rank0]:[W ProcessGroupNCCL.cpp:1128] [PG 1 Rank 0] ProcessGroupNCCL
destructor entered.
[rank0]:[W ProcessGroupNCCL.cpp:1147] [PG 1 Rank 0] ProcessGroupNCCL
aborting communicators, check for 'abort finished' logs or look for
abort hang
[rank1]:[W ProcessGroupNCCL.cpp:1151] [PG 1 Rank 1] ProcessGroupNCCL
abort finished.
[rank1]:[W ProcessGroupNCCL.cpp:1128] [PG 2 Rank 1] ProcessGroupNCCL
destructor entered.
[rank1]:[W ProcessGroupNCCL.cpp:1147] [PG 2 Rank 1] ProcessGroupNCCL
aborting communicators, check for 'abort finished' logs or look for
abort hang
[rank0]:[W ProcessGroupNCCL.cpp:1151] [PG 1 Rank 0] ProcessGroupNCCL
abort finished.
[rank1]:[W ProcessGroupNCCL.cpp:1151] [PG 2 Rank 1] ProcessGroupNCCL
abort finished.
.
----------------------------------------------------------------------
Ran 1 test in 18.969s
OK](url)

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119045
Approved by: https://github.com/yifuwang
2024-02-03 02:37:12 +00:00
857508fa36 Change the internal assert to torch_check in torch::nn::functional::InterpolateFuncOptions (#117831)
Fixes #117333

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117831
Approved by: https://github.com/malfet
2024-02-03 02:15:11 +00:00
9ffed22391 Document file format returned by torch.save (#118719)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118719
Approved by: https://github.com/albanD
2024-02-03 02:11:44 +00:00
2eba82d122 [dynamo] decrease logging level for graph break in higher order op. (#119079)
Fixes https://github.com/pytorch/pytorch/issues/119059.

This hides both logs behind TORCH_LOGS=dynamo. Just logging the exception seems not very informative. So I just put both under log.info(). For the example in the issue the log now looks like:
```
(pytorch-3.10) ~/local/pytorch$ python test.py
(pytorch-3.10) ~/local/pytorch$
```
```
(pytorch-3.10) ~/local/pytorch$ python test.py
(pytorch-3.10) ~/local/pytorch$ TORCH_LOGS=dynamo python test.py
[2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO] Step 1: torchdynamo start tracing linear /home/yidi/local/pytorch/test.py:267
[2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO] Stack (most recent call last):
[2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/test.py", line 272, in <module>
[2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO]     y = linear(x)
[2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 453, in _fn
[2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO]     return fn(*args, **kwargs)
[2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 615, in catch_errors
[2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO]     return callback(frame, cache_entry, hooks, frame_state)
[2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 748, in _convert_frame
[2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO]     result = inner_convert(frame, cache_entry, hooks, frame_state)
[2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 390, in _convert_frame_assert
[2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO]     return _compile(
[2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/miniconda3/envs/pytorch-3.10/lib/python3.10/contextlib.py", line 79, in inner
[2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO]     return func(*args, **kwds)
[2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 650, in _compile
[2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO]     guarded_code = compile_inner(code, one_graph, hooks, transform)
[2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/utils.py", line 248, in time_wrapper
[2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO]     r = func(*args, **kwargs)
[2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 531, in compile_inner
[2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO]     out_code = transform_code_object(code, transform)
[2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/bytecode_transformation.py", line 1033, in transform_code_object
[2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO]     transformations(instructions, code_options)
[2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 155, in _fn
[2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO]     return fn(*args, **kwargs)
[2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 478, in transform
[2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO]     tracer = InstructionTranslator(
[2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2032, in __init__
[2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO]     _step_logger()(
[2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/logging.py", line 55, in log
[2024-02-02 14:08:19,000] [0/0] torch._dynamo.symbolic_convert: [INFO]     logger.log(level, "Step %s: %s", step, msg, **kwargs)
[2024-02-02 14:08:19,001] [0/0] torch.fx.experimental.symbolic_shapes: [INFO] create_env
[2024-02-02 14:08:19,016] [0/0] torch._dynamo.variables.higher_order_ops: [INFO] speculate_subgraph: while introspecting autograd.Function, we were unable to trace function `backward` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[2024-02-02 14:08:19,016] [0/0] torch._dynamo.variables.higher_order_ops: [INFO] call_method GetAttrVariable(AutogradFunctionContextVariable(Function), needs_input_grad) __getitem__ (ConstantVariable(int),) {}
[2024-02-02 14:08:19,017] [0/0] torch._dynamo.convert_frame: [INFO] Restarting analysis due to _dynamo/symbolic_convert.py:141 in fail_and_restart_analysis
[2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO] Step 1: torchdynamo start tracing linear /home/yidi/local/pytorch/test.py:267
[2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO] Stack (most recent call last):
[2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/test.py", line 272, in <module>
[2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO]     y = linear(x)
[2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 453, in _fn
[2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO]     return fn(*args, **kwargs)
[2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 615, in catch_errors
[2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO]     return callback(frame, cache_entry, hooks, frame_state)
[2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 748, in _convert_frame
[2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO]     result = inner_convert(frame, cache_entry, hooks, frame_state)
[2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 390, in _convert_frame_assert
[2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO]     return _compile(
[2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/miniconda3/envs/pytorch-3.10/lib/python3.10/contextlib.py", line 79, in inner
[2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO]     return func(*args, **kwds)
[2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 650, in _compile
[2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO]     guarded_code = compile_inner(code, one_graph, hooks, transform)
[2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/utils.py", line 248, in time_wrapper
[2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO]     r = func(*args, **kwargs)
[2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 531, in compile_inner
[2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO]     out_code = transform_code_object(code, transform)
[2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/bytecode_transformation.py", line 1033, in transform_code_object
[2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO]     transformations(instructions, code_options)
[2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 155, in _fn
[2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO]     return fn(*args, **kwargs)
[2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 478, in transform
[2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO]     tracer = InstructionTranslator(
[2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2032, in __init__
[2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO]     _step_logger()(
[2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/logging.py", line 55, in log
[2024-02-02 14:08:19,017] [0/0_1] torch._dynamo.symbolic_convert: [INFO]     logger.log(level, "Step %s: %s", step, msg, **kwargs)
[2024-02-02 14:08:19,017] [0/0_1] torch.fx.experimental.symbolic_shapes: [INFO] create_env
[2024-02-02 14:08:19,021] [0/0_1] torch.fx.experimental.symbolic_shapes: [INFO] produce_guards
[2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] Step 1: torchdynamo start tracing forward /home/yidi/local/pytorch/test.py:257
[2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO] Stack (most recent call last):
[2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/test.py", line 272, in <module>
[2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO]     y = linear(x)
[2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 453, in _fn
[2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO]     return fn(*args, **kwargs)
[2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/test.py", line 268, in linear
[2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO]     return UseNeedsInputGradFunction.apply(x)
[2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/autograd/function.py", line 572, in apply
[2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO]     return super().apply(*args, **kwargs)  # type: ignore[misc]
[2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 615, in catch_errors
[2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO]     return callback(frame, cache_entry, hooks, frame_state)
[2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 748, in _convert_frame
[2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO]     result = inner_convert(frame, cache_entry, hooks, frame_state)
[2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 390, in _convert_frame_assert
[2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO]     return _compile(
[2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/miniconda3/envs/pytorch-3.10/lib/python3.10/contextlib.py", line 79, in inner
[2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO]     return func(*args, **kwds)
[2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 650, in _compile
[2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO]     guarded_code = compile_inner(code, one_graph, hooks, transform)
[2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/utils.py", line 248, in time_wrapper
[2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO]     r = func(*args, **kwargs)
[2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 531, in compile_inner
[2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO]     out_code = transform_code_object(code, transform)
[2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/bytecode_transformation.py", line 1033, in transform_code_object
[2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO]     transformations(instructions, code_options)
[2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 155, in _fn
[2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO]     return fn(*args, **kwargs)
[2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 478, in transform
[2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO]     tracer = InstructionTranslator(
[2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2032, in __init__
[2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO]     _step_logger()(
[2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO]   File "/home/yidi/local/pytorch/torch/_dynamo/logging.py", line 55, in log
[2024-02-02 14:08:19,025] [1/0] torch._dynamo.symbolic_convert: [INFO]     logger.log(level, "Step %s: %s", step, msg, **kwargs)
[2024-02-02 14:08:19,025] [1/0] torch.fx.experimental.symbolic_shapes: [INFO] create_env
[2024-02-02 14:08:19,097] torch._dynamo.utils: [INFO] TorchDynamo compilation metrics:
[2024-02-02 14:08:19,097] torch._dynamo.utils: [INFO] Function, Runtimes (s)
[2024-02-02 14:08:19,097] torch._dynamo.utils: [INFO] _compile.<locals>.compile_inner, 0.0283
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119079
Approved by: https://github.com/zou3519
2024-02-03 02:10:13 +00:00
d91d21fd6f [submodule kineto] Enable profiler connection to daemon during init for cpu only jobs (#118320)
Fixes #112389 and https://github.com/facebookincubator/dynolog/issues/208

This PR enables profiler initialization for CPU only use cases. The main goal is to enable on-demand profiling with a daemon when using CPU only mode of PyTorch.
* When CUDA is available the profiler is initialized on first CUDA stream creation (or lazily when profiler is run).
* Since the CUDA stream creation callback does not exist on CPU only PyTorch the profiler is never initied on its own.
* Thus the job does not register with Dynolog when we set "KINETO_USE_DAEMON" env variable to set.

Part of the fix is in Kineto https://github.com/pytorch/kineto/pull/861, we point to it in PyTorch.
The change in PyTorch is to correctly set the `cpuOnly` argument.

## TestPlan:

Build PyTorch from source with USE_CUDA=0 so we have CPU only based build.  Git hash = `a40951defd87b9a5e582cf9112bf7a8bd0930c79`
(See instructions in PyTorch repo)

For the setup we run dynolog daemon in another terminal
```
buck2 run dynolog/src:dynolog  -- --enable_ipc_monitor &
```

Now run an example model in PyTorch - see [linear_model.py](https://github.com/facebookincubator/dynolog/blob/main/scripts/pytorch/linear_model_example.py) , and set the device to 'cpu' inside the code instead of 'cuda'.
```
export KINETO_USE_DAEMON=1
python linear_model_example.py
```
Output shows the profiler registration with dynolog
```
(pytorch) [bcoutinho@devgpu038.ftw6 ~/local/pytorch (main)]$ python linear_model_example.py
INFO:2024-01-25 11:08:53 1807792:1807792 init.cpp:122] Registering daemon config loader, cpuOnly =  1
INFO:2024-01-25 11:08:53 1807792:1807792 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
INFO:2024-01-25 11:08:53 1807792:1807792 IpcFabricConfigClient.cpp:93] Setting up IPC Fabric at endpoint: dynoconfigclient0dc36b8a-e14c-4260-958b-4b2e7d15e986 status = initialized
INFO:2024-01-25 11:08:53 1807792:1807792 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
INFO:2024-01-25 11:08:53 1807792:1807792 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
```

We can also collect a trace using
```
[bcoutinho@devgpu038.ftw6 ~/fbsource/fbcode (3bc85f968)]$ buck2 run dynolog/cli:dyno -- gputrace --log-file /tmp/test.json
Kineto config =
ACTIVITIES_LOG_FILE=/tmp/test.json
PROFILE_START_TIME=0
ACTIVITIES_DURATION_MSECS=500
PROFILE_REPORT_INPUT_SHAPES=false
PROFILE_PROFILE_MEMORY=false
PROFILE_WITH_STACK=false
PROFILE_WITH_FLOPS=false
PROFILE_WITH_MODULES=false
response length = 147
response = {"activityProfilersBusy":0,"activityProfilersTriggered":[1807792],"eventProfilersBusy":0,"eventProfilersTriggered":[],"processesMatched":[1807792]}
Matched 1 processes
Trace output files will be written to:
    /tmp/test_1807792.json
```
And trace file contains the trace correctly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118320
Approved by: https://github.com/aaronenyeshi
2024-02-03 01:40:56 +00:00
494c2ec054 [DCP][BE] Let FsspecWriter and FsspecReader inherit from FileSystemWriter and FileSystemReader (#118887)
There is no logic changed. However this PR dramatially reduces the effort to maintain filesystem-like storage backend. As we are going to enable fsspec, this is a must BE iteam.

Differential Revision: [D53318044](https://our.internmc.facebook.com/intern/diff/D53318044/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118887
Approved by: https://github.com/wz337
2024-02-03 01:14:13 +00:00
6b009aceea Enable scaled_mm on sm89 devices (#118881)
Fixes #118703

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118881
Approved by: https://github.com/malfet
2024-02-03 00:44:03 +00:00
440b7d5279 [auto_functionalize] Remove mutated_args_name from args (#119050)
`auto_functionalize` currently takes a custom op, a list of mutated argument names, and inputs to the custom op as kwargs. The list of mutated argument names is computed from the schema, and gets created when we're tracing. However, it seems that having the list of mutated argument names is a little unnecessary since we can always recompute it from the schema during runtime.

This also prevents the case where users might incorrectly modify the inputs to this operator, as we will now just recompute it during the runtime. This probably won't affect things too much because inductor will decompose auto_functionalize.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119050
Approved by: https://github.com/zou3519
2024-02-03 00:27:14 +00:00
3aeaa21eb0 Revert "Remove parent device mesh check (#118620)"
This reverts commit 3f1f057adfcd4cef67fff9605a894cb075c02881.

Reverted https://github.com/pytorch/pytorch/pull/118620 on behalf of https://github.com/atalman due to broke periodic linux-focal-cuda11.8-py3.9-gcc9 ([comment](https://github.com/pytorch/pytorch/pull/118620#issuecomment-1924933878))
2024-02-03 00:22:56 +00:00
de6a906093 Expose aggressive_recomputation as an inductor config (#118943)
Summary:
As title.

We found aggressive_recomputation shows memory savings (7% on APS COFFEE model) with 2% QPS loss.

It also gives very promising signal on our auto ac experiments: https://docs.google.com/document/d/1S2qgMg1CwAQ4U1Ffuk2epbEOx06ogZhioX2jKCwL7ZQ/edit

 {F1426175073}

Test Plan:
APS COFFEE from silverlakeli
- Zoom of baseline job: https://www.internalfb.com/intern/zoomer/?profiling_run_fbid=927380488801910&tab=overview
- Zoom of job with aggressive_recomputation: https://www.internalfb.com/intern/zoomer/?profiling_run_fbid=1126815608217470&tab=overview

APS 1100x shrunk version:
- baseline: https://www.internalfb.com/mast/job/aps-yuzhenhuang-afe049505a
- test: https://www.internalfb.com/mast/job/aps-yuzhenhuang-709e41bf0d
Memory from 42.98% -> 41.04%.

Reviewed By: yf225, yuxihu, silverlakeli, richqyz

Differential Revision: D53248057

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118943
Approved by: https://github.com/anijain2305, https://github.com/yanboliang
2024-02-03 00:17:03 +00:00
7bbd9befed Improve example for `torch.mode()` (#115308)
Fixes #89820 and improves the documentation.

Co-authored-by: Sam Gross <colesbury@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115308
Approved by: https://github.com/colesbury
2024-02-03 00:13:26 +00:00
c24ffc3f66 [inductor] make multi-kernel work with cpp-wrapper (#117813)
Make multi-kernel work with cpp-wrapper. multi-kernel generates two equivalent variants for a reduction. At runtime the faster one is picked. But cpp-wrapper need save cubin file during codegen. They don't work with each other at the beginning.

Thanks Jason for suggesting a neat way to integrate these two. cpp-wrapper does 2 passes codegen right now. For the first pass, we still generate multi-kernel code and run it; for the second pass, we load the cubin file for the faster kernel directly. And multi-kernel python code is not generated for the second pass since they should not be needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117813
Approved by: https://github.com/jansel
2024-02-03 00:06:21 +00:00
576383c2eb Add torch check for dtype within bilinear (#118900)
Fixes https://github.com/pytorch/pytorch/issues/117237
Short-term fix, when dtype does not match, it will be reflected in the torch check.

@ezyang a cpp test case is added
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118900
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-02-03 00:02:00 +00:00
a4355d6b9a Revert "Add --filter-rank to torchrun: allow logs filtering by rank (#118562)"
This reverts commit 73229b4f931f8cd1799b0905d61e3d8e85157bcd.

Reverted https://github.com/pytorch/pytorch/pull/118562 on behalf of https://github.com/xmfan due to breaks MAST precheck, flag naming conflict ([comment](https://github.com/pytorch/pytorch/pull/118562#issuecomment-1924916601))
2024-02-02 23:56:21 +00:00
63fd6883fd [c10d] logging utility for cpp-python stacktrace (#118924)
user may not know which line of code called collectives in a big code base. When debugging, we can print python-cpp stacktrace in case user call ``ProcessGroup.reduce`` instead of ``torch.distributed.reduce``

```
LOG(INFO) << "ProcessGroupNCCL::_allgather_base stacktrace: "
                       << get_python_cpp_trace();
```

output (using _allgather_base as an example): one example python-part trace is ``all_gather_into_tensor from /data/users/weif/pytorch/torch/distributed/distributed_c10d.py:2838``
```
ProcessGroupNCCL::_allgather_base stacktrace: #0 torch::unwind::unwind() from ??:0
#1 torch::CapturedTraceback::gather(bool, bool, bool) from ??:0
#2 c10d::get_python_cpp_trace[abi:cxx11]() from :0
#3 c10d::ProcessGroupNCCL::_allgather_base(at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&) from ??:0
#4 c10d::ops::(anonymous namespace)::_allgather_base_CUDA(at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, bool, long) from Ops.cpp:0
#5 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<std::tuple<at::Tensor, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (*)(at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, bool, long), std::tuple<at::Tensor, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > >, c10::guts::typelist::typelist<at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, bool, long> >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) from :0
#6 torch::autograd::basicAutogradNotImplementedFallbackImpl(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) from autograd_not_implemented_fallback.cpp:0
#7 c10d::ProcessGroup::_allgather_base(at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&) from :0
#8 pybind11::cpp_function::initialize<pybind11::cpp_function::initialize<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (c10d::ProcessGroup::*)(at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(c10d::ProcessGroup*, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&)#1}, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup*, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(pybind11::cpp_function::initialize<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (c10d::ProcessGroup::*)(at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(c10d::ProcessGroup*, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&)#1}&&, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (*)(c10d::ProcessGroup*, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) from :0
#9 pybind11::cpp_function::dispatcher(_object*, _object*, _object*) from :0
#10 cfunction_call from /usr/local/src/conda/python-3.10.12/Objects/methodobject.c:543
#11 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.12/Objects/call.c:215
#12 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:112
#13 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114
#14 all_gather_into_tensor from /data/users/weif/pytorch/torch/distributed/distributed_c10d.py:2838
#15 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46
#16 do_call_core from /usr/local/src/conda/python-3.10.12/Python/ceval.c:5945
#17 wrapper from /data/users/weif/pytorch/torch/distributed/c10d_logger.py:75
#18 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46
#19 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114
#20 _all_gather_flat_param from /data/users/weif/pytorch/torch/distributed/fsdp/_flat_param.py:1399
#21 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46
#22 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114
#23 unshard from /data/users/weif/pytorch/torch/distributed/fsdp/_flat_param.py:1308
#24 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46
#25 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114
#26 _unshard from /data/users/weif/pytorch/torch/distributed/fsdp/_runtime_utils.py:332
#27 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46
#28 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114
#29 _pre_forward_unshard from /data/users/weif/pytorch/torch/distributed/fsdp/_runtime_utils.py:448
#30 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46
#31 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114
#32 _pre_forward from /data/users/weif/pytorch/torch/distributed/fsdp/_runtime_utils.py:413
#33 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46
#34 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114
#35 forward from /data/users/weif/pytorch/torch/distributed/fsdp/fully_sharded_data_parallel.py:839
#36 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46
#37 do_call_core from /usr/local/src/conda/python-3.10.12/Python/ceval.c:5945
#38 _call_impl from /data/users/weif/pytorch/torch/nn/modules/module.py:1520
#39 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46
#40 do_call_core from /usr/local/src/conda/python-3.10.12/Python/ceval.c:5945
#41 _wrapped_call_impl from /data/users/weif/pytorch/torch/nn/modules/module.py:1511
#42 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46
#43 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.12/Objects/call.c:431
#44 slot_tp_call from /usr/local/src/conda/python-3.10.12/Objects/typeobject.c:7494
#45 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.12/Objects/call.c:215
#46 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:112
#47 inner from /data/users/weif/pytorch/run_fsdp.py:72
#48 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46
#49 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114
#50 run from /data/users/weif/pytorch/run_fsdp.py:76
#51 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46
#52 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114
#53 main from /data/users/weif/pytorch/run_fsdp.py:133
#54 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46
#55 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114
#56 <module> from /data/users/weif/pytorch/run_fsdp.py:137
#57 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46
#58 PyEval_EvalCode from /usr/local/src/conda/python-3.10.12/Python/ceval.c:1134
#59 run_eval_code_obj from /usr/local/src/conda/python-3.10.12/Python/pythonrun.c:1291
#60 run_mod from /usr/local/src/conda/python-3.10.12/Python/pythonrun.c:1312
#61 pyrun_file from /usr/local/src/conda/python-3.10.12/Python/pythonrun.c:1208
#62 _PyRun_SimpleFileObject from /usr/local/src/conda/python-3.10.12/Python/pythonrun.c:456
#63 _PyRun_AnyFileObject from /usr/local/src/conda/python-3.10.12/Python/pythonrun.c:90
#64 pymain_run_file_obj from /usr/local/src/conda/python-3.10.12/Modules/main.c:357
#65 Py_BytesMain from /usr/local/src/conda/python-3.10.12/Modules/main.c:1090
#66 __libc_start_call_main from ??:0
#67 <unwind unsupported> from ??:0
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118924
Approved by: https://github.com/kwen2501
2024-02-02 23:49:18 +00:00
a3cec6a7fa [ONNX] Eliminate redundant TODOs (#119060)
Remove titaiwangms/AllenTiTaiWang/titaiwang created TODOs:

1. Resolved TODOs
2. Turned TODOs to NOTEs if they are not actionable
3. Merge duplicated TODOs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119060
Approved by: https://github.com/kit1980, https://github.com/thiagocrepaldi
2024-02-02 23:37:52 +00:00
454e6b380c [export] Prevent specialization on backends (#118683)
Summary: https://github.com/pytorch/pytorch/issues/118289 shows that sometimes we will decompose into backend-specific operators, causing some specializations. We should probably avoid this by disabling these by default?

Test Plan: CI

Differential Revision: D53241300

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118683
Approved by: https://github.com/zhxchen17
2024-02-02 23:33:59 +00:00
db2225da37 [export] fix forward test_lift_unlift (#119090)
Test Plan: fixes test

Reviewed By: zhxchen17

Differential Revision: D53367522

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119090
Approved by: https://github.com/kit1980
2024-02-02 23:07:36 +00:00
9fe3693bbb [dynamo] bypass graph break due to masking if inference mode (#119056)
Relax the constraints in https://github.com/pytorch/pytorch/issues/114123 when we're in inference mode.

Test Plan:
See added tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119056
Approved by: https://github.com/ezyang, https://github.com/zou3519
2024-02-02 22:53:23 +00:00
suo
4d45c68ca6 [fx] fix for subgraph rewriter (#119052)
the semantics of `try_get_attr` are to default to None if the attribute doesn't exist; but we were throwing an exception in `get_submodule`. Catch that exception and return None.

Differential Revision: [D53358747](https://our.internmc.facebook.com/intern/diff/D53358747/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119052
Approved by: https://github.com/angelayi
2024-02-02 22:47:53 +00:00
c908caf92b [DeviceMesh] Alllow 1d slice from 1d mesh (#118895)
Fixes [ISSUE_NUMBER](https://github.com/pytorch/pytorch/issues/118851)

i.e.
mesh = init_device_mesh("cuda", (8,), mesh_dim_names=("dp"))
then we do dp_mesh = mesh["dp"] should still work, just dummy return without recording parent mesh

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118895
Approved by: https://github.com/wanchaol
2024-02-02 22:00:24 +00:00
6379010ebd [dynamo][higher order ops] Remove restore side effects logic (#118420)
The problem was exposed in https://github.com/pytorch/pytorch/pull/118071 where the control flow tests were always recompiling. The issue turned out that the same nonlocal variable used in `true_fn` and `false_fn` was getting lifted twice and thus creating two inputs in the main Fx graph. Dynamo Tensor guards does not like it because it wants all input tensors to be non-aliased.

We already have logic to check if two different sources (closure of true_fn and closure of false_fn) point to the same tensor using side effects infra. But we were restoring side_effects after subtracing the true and false branches. This is not needed anymore. side_effects trace both read-only as well as actual writes to the variables. For higher order ops, any mutation which is not read-only leads to a graph break and safely exits the tracing. For read-only side effects, its doesn't matter.

This PR removes the restoring of side_effects, which turns on the logic for checking if two different sources point to the same tensor, and thus lifts the common non local tensor to just once in the main graph.

Related discussion at https://github.com/pytorch/pytorch/issues/113235

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118420
Approved by: https://github.com/ydwu4, https://github.com/mlazos, https://github.com/zou3519
ghstack dependencies: #118975
2024-02-02 21:54:22 +00:00
113138aa55 add test cases for GradScaler on CPU (#109994)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109994
Approved by: https://github.com/jgong5, https://github.com/ezyang
2024-02-02 21:49:07 +00:00
426339e4de Add FakeTensor support to torch._utils._rebuild_tensor (#108186)
Partially fixes https://github.com/pytorch/pytorch/issues/105077

Repro:

```python
import tempfile
import torch
from torch._subclasses import fake_tensor

class TheModelClass(torch.nn.Module):
    def __init__(self):
        super(TheModelClass, self).__init__()
        self.fc1 = torch.nn.Linear(5, 10)

    def forward(self, x):
        return self.fc1(x)

with tempfile.NamedTemporaryFile() as state_dict_file:
    # Create state_dict to be loaded later
    model = TheModelClass()
    torch.save(model.state_dict(), state_dict_file.name)

    fake_mode = fake_tensor.FakeTensorMode()
    with fake_mode:
        # This is where the bug is triggered
        state_dict = torch.load(state_dict_file.name)
```

Error:

```bash
Traceback (most recent call last):
  File "issue_gh_torch_105077.py", line 22, in <module>
    state_dict = torch.load(state_dict_file.name)
  File "/opt/pytorch/torch/serialization.py", line 1014, in load
    return _load(opened_zipfile,
  File "/opt/pytorch/torch/serialization.py", line 1422, in _load
    result = unpickler.load()
  File "/opt/pytorch/torch/_utils.py", line 205, in _rebuild_tensor_v2
    tensor = _rebuild_tensor(storage, storage_offset, size, stride)
  File "/opt/pytorch/torch/_utils.py", line 184, in _rebuild_tensor
    return t.set_(storage._untyped_storage, storage_offset, size, stride)
  File "/opt/pytorch/torch/utils/_stats.py", line 20, in wrapper
    return fn(*args, **kwargs)
  File "/opt/pytorch/torch/_subclasses/fake_tensor.py", line 1288, in __torch_dispatch__
    return self.dispatch(func, types, args, kwargs)
  File "/opt/pytorch/torch/_subclasses/fake_tensor.py", line 1468, in dispatch
    self.invalidate_written_to_constants(func, flat_arg_fake_tensors, args, kwargs)
  File "/opt/pytorch/torch/_subclasses/fake_tensor.py", line 1733, in invalidate_written_to_constants
    _, new_kwargs = normalize_function(
  File "/opt/pytorch/torch/fx/operator_schemas.py", line 297, in normalize_function
    torch_op_schemas = get_signature_for_torch_op(target)
  File "/opt/pytorch/torch/fx/operator_schemas.py", line 167, in get_signature_for_torch_op
    signatures = [_torchscript_schema_to_signature(schema) for schema in schemas]
  File "/opt/pytorch/torch/fx/operator_schemas.py", line 167, in <listcomp>
    signatures = [_torchscript_schema_to_signature(schema) for schema in schemas]
  File "/opt/pytorch/torch/fx/operator_schemas.py", line 70, in _torchscript_schema_to_signature
    arg_type = _torchscript_type_to_python_type(arg.type)
  File "/opt/pytorch/torch/fx/operator_schemas.py", line 64, in _torchscript_type_to_python_type
    return eval(ts_type.annotation_str, _type_eval_globals)
  File "<string>", line 1, in <module>
NameError: name 'Storage' is not defined
```

This PR adds the ability to create fake tensors during `torch.load` by wrapping the `torch.tensor.set_` call around a `torch.utils._mode_utils.no_dispatch()` to skip fake mode dispatcher for it and thus create a real tensor. It later calls `fake_mode.from_tensor(t)` to finally create the fake tensor.

Co-authored-by: Edward Z. Yang <ezyang@mit.edu>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108186
Approved by: https://github.com/ezyang
2024-02-02 20:35:38 +00:00
3b41793412 Purge redundant module init tests (#119028)
Fixes #118784

This test file is old and redundant; coverage is maintained in `test_modules.py` via the `test_factory_kwargs` set of tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119028
Approved by: https://github.com/zou3519
2024-02-02 20:17:00 +00:00
a69016a741 Add lowering to special.bessel_j1 (#118992)
As in the title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118992
Approved by: https://github.com/peterbell10
2024-02-02 20:16:08 +00:00
c7ba5f6c6f [AOTI] Fix a cpp kernel missing arg type issue (#119021)
Summary: The current way of fetching the kernel arg types only works for tensors, not symbols.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119021
Approved by: https://github.com/aakhundov, https://github.com/hl475, https://github.com/khabinov
2024-02-02 20:11:58 +00:00
debc3b3254 Download reports only if they're necessary (#119027)
Previously we were downloading all of (eager311, dynamo38, dynamo311).
Now we just download what's necessary. This is useful for
update_failures.py because the dynamo tests finish much faster than the
eager tests and it only needs the result from the dynamo tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119027
Approved by: https://github.com/jamesjwu
ghstack dependencies: #118874, #118882, #118931
2024-02-02 20:11:01 +00:00
a68cf3ef7d update_failures.py: add option to also remove "skipped" tests (#118931)
Previously, you could run update_failures.py (with a commit hash) and it
would add new expected failures and skips for newly failing tests and
remove expected failures for newly passing tests.

This PR teaches update_failures.py to also remove skips for tests that
are now passing without them.

The way we do this is:
- dynamo_test_failures.py doesn't actually skip tests -- it runs the
  test and then suppresses the signal.
- if the test actually passed, then the test gets skipped with a special
  skip message
- we teach update_failures.py to look for the presence of that skip
  message.

Test Plan:
- Used this to generate https://github.com/pytorch/pytorch/pull/118928
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118931
Approved by: https://github.com/yanboliang
ghstack dependencies: #118874, #118882
2024-02-02 20:11:01 +00:00
1de50f8654 [HigherOrderOp] fix stack trace to report user stack (#118826)
Fixes https://github.com/pytorch/pytorch/issues/111020

For the following code:
```python
import torch
import torch._higher_order_ops.wrap

glob = []

def f(x):
    glob.append(x)
    return x.clone()

@torch.compile(backend='eager', fullgraph=True)
def g(x):
    return torch.ops.higher_order.wrap(f, x)

x = torch.randn(3)
g(x)
```

The stacktrace now becomes:
```
[2024-02-01 15:23:34,691] [0/0] torch._dynamo.variables.higher_order_ops: [WARNING] speculate_subgraph: while introspecting wrap, we were unable to trace function `f` into a single graph. This means that Dynamo was unable to prove safety for this API and will fall back to eager-mode PyTorch, which could lead to a slowdown.
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] HigherOrderOperator: Mutating a variable not in the current scope (SideEffects)
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] Traceback (most recent call last):
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR]   File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 381, in speculate_subgraph
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR]     output = f.call_function(tx, args, sub_kwargs)
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR]   File "/home/yidi/local/pytorch/torch/_dynamo/variables/functions.py", line 278, in call_function
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR]     return super().call_function(tx, args, kwargs)
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR]   File "/home/yidi/local/pytorch/torch/_dynamo/variables/functions.py", line 86, in call_function
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR]     return tx.inline_user_function_return(
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR]   File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 657, in inline_user_function_return
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR]     return InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR]   File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2261, in inline_call
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR]     return cls.inline_call_(parent, func, args, kwargs)
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR]   File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2370, in inline_call_
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR]     tracer.run()
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR]   File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 787, in run
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR]     and self.step()
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR]   File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 750, in step
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR]     getattr(self, inst.opname)(inst)
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR]   File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 469, in wrapper
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR]     return inner_fn(self, inst)
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR]   File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1196, in CALL_FUNCTION
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR]     self.call_function(fn, args, {})
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR]   File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 651, in call_function
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR]     self.push(fn.call_function(self, args, kwargs))
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR]   File "/home/yidi/local/pytorch/torch/_dynamo/variables/misc.py", line 583, in call_function
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR]     return self.obj.call_method(tx, self.name, args, kwargs)
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR]   File "/home/yidi/local/pytorch/torch/_dynamo/variables/lists.py", line 330, in call_method
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR]     return super().call_method(tx, name, args, kwargs)
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR]   File "/home/yidi/local/pytorch/torch/_dynamo/variables/lists.py", line 241, in call_method
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR]     tx.output.side_effects.mutation(self)
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR]   File "/home/yidi/local/pytorch/torch/_dynamo/side_effects.py", line 325, in mutation
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR]     self.check_allowed_side_effect(var)
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR]   File "/home/yidi/local/pytorch/torch/_dynamo/side_effects.py", line 157, in check_allowed_side_effect
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR]     unimplemented(
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR]   File "/home/yidi/local/pytorch/torch/_dynamo/exc.py", line 190, in unimplemented
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR]     raise Unsupported(msg)
[2024-02-01 15:23:34,692] [0/0] torch._dynamo.variables.higher_order_ops: [ERROR] torch._dynamo.exc.Unsupported: HigherOrderOperator: Mutating a variable not in the current scope (SideEffects)
Traceback (most recent call last):
  File "/home/yidi/local/pytorch/test.py", line 219, in <module>
    g(x)
  File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 453, in _fn
    return fn(*args, **kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 615, in catch_errors
    return callback(frame, cache_entry, hooks, frame_state)
  File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 390, in _convert_frame_assert
    return _compile(
  File "/home/yidi/local/miniconda3/envs/pytorch-3.10/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 650, in _compile
    guarded_code = compile_inner(code, one_graph, hooks, transform)
  File "/home/yidi/local/pytorch/torch/_dynamo/utils.py", line 248, in time_wrapper
    r = func(*args, **kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 531, in compile_inner
    out_code = transform_code_object(code, transform)
  File "/home/yidi/local/pytorch/torch/_dynamo/bytecode_transformation.py", line 1033, in transform_code_object
    transformations(instructions, code_options)
  File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 155, in _fn
    return fn(*args, **kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 496, in transform
    tracer.run()
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2125, in run
    super().run()
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 787, in run
    and self.step()
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 750, in step
    getattr(self, inst.opname)(inst)
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 469, in wrapper
    return inner_fn(self, inst)
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1196, in CALL_FUNCTION
    self.call_function(fn, args, {})
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 651, in call_function
    self.push(fn.call_function(self, args, kwargs))
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 1227, in call_function
    p_args, p_kwargs, example_value, body_r, treespec, _ = self.create_wrapped_node(
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 1190, in create_wrapped_node
    ) = speculate_subgraph(
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 453, in speculate_subgraph
    raise ex
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/higher_order_ops.py", line 381, in speculate_subgraph
    output = f.call_function(tx, args, sub_kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/functions.py", line 278, in call_function
    return super().call_function(tx, args, kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/functions.py", line 86, in call_function
    return tx.inline_user_function_return(
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 657, in inline_user_function_return
    return InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2261, in inline_call
    return cls.inline_call_(parent, func, args, kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2370, in inline_call_
    tracer.run()
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 787, in run
    and self.step()
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 750, in step
    getattr(self, inst.opname)(inst)
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 469, in wrapper
    return inner_fn(self, inst)
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1196, in CALL_FUNCTION
    self.call_function(fn, args, {})
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 651, in call_function
    self.push(fn.call_function(self, args, kwargs))
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/misc.py", line 583, in call_function
    return self.obj.call_method(tx, self.name, args, kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/lists.py", line 330, in call_method
    return super().call_method(tx, name, args, kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/lists.py", line 241, in call_method
    tx.output.side_effects.mutation(self)
  File "/home/yidi/local/pytorch/torch/_dynamo/side_effects.py", line 325, in mutation
    self.check_allowed_side_effect(var)
  File "/home/yidi/local/pytorch/torch/_dynamo/side_effects.py", line 157, in check_allowed_side_effect
    unimplemented(
  File "/home/yidi/local/pytorch/torch/_dynamo/exc.py", line 190, in unimplemented
    raise Unsupported(msg)
torch._dynamo.exc.Unsupported: HigherOrderOperator: Mutating a variable not in the current scope (SideEffects)

from user code:
   File "/home/yidi/local/pytorch/test.py", line 216, in g
    return torch.ops.higher_order.wrap(f, x)
  File "/home/yidi/local/pytorch/test.py", line 211, in f
    glob.append(x)

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information

You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118826
Approved by: https://github.com/yanboliang, https://github.com/zou3519
2024-02-02 20:08:01 +00:00
3c0c387429 Support symbolic min/max on unbacked SymInt (#118953)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118953
Approved by: https://github.com/ColinPeppler, https://github.com/aakhundov
2024-02-02 20:01:46 +00:00
f641c55c9b Make torch._dynamo.mark_static work inside graph (#118962)
I livecoded the entire PR authoring process, you can watch it at https://youtu.be/06HuwNR9-uI

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118962
Approved by: https://github.com/yanboliang
2024-02-02 20:01:27 +00:00
29f99a3365 Update XLA commit pin (#118871)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118871
Approved by: https://github.com/albanD
2024-02-02 19:55:04 +00:00
bd8c91efc0 Remove some now-succeeding tests from dynamo_test_failures.py (#118928)
Test Plan:
- wait for CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118928
Approved by: https://github.com/aorenste, https://github.com/anijain2305, https://github.com/yanboliang
2024-02-02 19:49:26 +00:00
bf4e171539 [export] support non-persistent buffers (#118969)
Summary:
X-link: https://github.com/pytorch/executorch/pull/1817

Basic support for non-persistent buffers, which are buffers that do not show up in the state dict.

One weird twist is that most of our other systems (FX, aot_export, dynamo) have completely buggy handling of non-persistent buffers. I tried to go on a wild goose chase to fix them all, but it got to be too much. So I introduced some sad rewrite passes in `_export` make the final state dict correctly align with the original module's state dict.

This exposed some bugs/ambiguous handling of parameters/buffers in existing test code. For example, `TestSaveLoad.test_save_buffer` traced over a module that was not in the root module hierarchy and caused some weird behavior. I think we should error explicitly on use cases like this: https://github.com/pytorch/pytorch/issues/118410. For now I just rewrote the tests or skipped them.

As a side effect, this diff tightened up quite a few sloppy  behaviors around state dict handling:
- Tensor attributes were getting promoted to be buffers—bad!
- Tracing through a module not in the children of the root module would add its parameters/buffers to the state dict—bad!

This behavior is unlikely to show up in user code since the model would be totally broken, but did show up in a bunch of tests.

#buildmore

Test Plan:
unit tests
sandcastle

Differential Revision: D53340041

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118969
Approved by: https://github.com/guangy10, https://github.com/huydhn, https://github.com/titaiwangms
2024-02-02 19:16:08 +00:00
b5ba80828f [optim] Rectify capturable testing and fix bugs! (#118326)
This PR fixes several bugs, listed in priority:
1. `load_state_dict` with a nontensor step was incorrect for capturable and fused implementations since we don't create the tensors on the right device in `__setstate__`. This has been fixed.
2. The most recently added capturable implementations forgot the check that all tensors should be on CUDA for eager. We've now added those checks
3. The most recent change in Adamax only adds capturable for foreach but will silently be incorrect for forloop/single-tensor. I've added erroring and modified testing with many many many skips for that. Honestly my preference after this PR has only been further cemented  that we should just do the single tensor and multi tensor capturable implementations together in the future. @mlazos
4. The conditional for adding cuda-supported configs for the optimizer infos was incorrect! So we hadn't been testing capturable! This also stands rectified and was the trigger for this PR in the first place.
5. In a similar way, the conditional for `_get_optim_inputs_including_global_cliquey_kwargs` was incorrect sometimes as well. This has also been corrected.

The following is not a bug, but is just something to make life simpler by not needing to handle Nones: `optim_input_funcs` must now mandatorily take in a `device`, which could be a string or a torch.device.

Details for posterity:
4. Running the test_foreach_matches_forloop test and printing the configs that get printed yields capturable getting included, which is correct.
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (5d50138f)]$ python test/test_optim.py -k test_foreach_matches_forloop_AdamW_cuda
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
params=None, kwargs={}, desc=default
params=None, kwargs={'lr': 0.01}, desc=non-default lr
params=None, kwargs={'weight_decay': 0.1}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.1, 'maximize': True}, desc=maximize
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True}, desc=amsgrad
params=None, kwargs={'capturable': True}, desc=capturable
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'capturable': True}, desc=capturable, amsgrad
params=None, kwargs={'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True}, desc=Tensor lr with capturable and amsgrad
.
----------------------------------------------------------------------
Ran 1 test in 19.229s

OK
```
5. Running the test_optimizer_can_be_printed test (which calls `_get_optim_inputs_including_global_cliquey_kwargs`) and printing what gets run is also now correct.
```
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
params=None, kwargs={'differentiable': False}, desc=default
params=None, kwargs={'differentiable': True}, desc=default & differentiable
params=None, kwargs={'lr': 0.01, 'differentiable': False}, desc=non-default lr
params=None, kwargs={'lr': 0.01, 'differentiable': True}, desc=non-default lr & differentiable
params=None, kwargs={'weight_decay': 0.1, 'differentiable': False}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.1, 'differentiable': True}, desc=nonzero weight_decay & differentiable
params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'differentiable': False}, desc=maximize
params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'differentiable': True}, desc=maximize & differentiable
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'differentiable': False}, desc=amsgrad
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'differentiable': True}, desc=amsgrad & differentiable
.params=None, kwargs={'foreach': False, 'differentiable': False, 'fused': False}, desc=default
params=None, kwargs={'foreach': True, 'differentiable': False, 'fused': False}, desc=default & foreach
params=None, kwargs={'foreach': False, 'differentiable': True, 'fused': False}, desc=default & differentiable
params=None, kwargs={'foreach': False, 'differentiable': False, 'fused': True}, desc=default & fused
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False, 'fused': False}, desc=non-default lr
params=None, kwargs={'lr': 0.01, 'foreach': True, 'differentiable': False, 'fused': False}, desc=non-default lr & foreach
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': True, 'fused': False}, desc=non-default lr & differentiable
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False, 'fused': True}, desc=non-default lr & fused
params=None, kwargs={'weight_decay': 0.1, 'foreach': False, 'differentiable': False, 'fused': False}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.1, 'foreach': True, 'differentiable': False, 'fused': False}, desc=nonzero weight_decay & foreach
params=None, kwargs={'weight_decay': 0.1, 'foreach': False, 'differentiable': True, 'fused': False}, desc=nonzero weight_decay & differentiable
params=None, kwargs={'weight_decay': 0.1, 'foreach': False, 'differentiable': False, 'fused': True}, desc=nonzero weight_decay & fused
params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=maximize
params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=maximize & foreach
params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=maximize & differentiable
params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=maximize & fused
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=amsgrad
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=amsgrad & foreach
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=amsgrad & differentiable
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=amsgrad & fused
params=None, kwargs={'capturable': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=capturable
params=None, kwargs={'capturable': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=capturable & foreach
params=None, kwargs={'capturable': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=capturable & differentiable
params=None, kwargs={'capturable': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=capturable & fused
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=capturable, amsgrad
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=capturable, amsgrad & foreach
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=capturable, amsgrad & differentiable
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=capturable, amsgrad & fused
params=None, kwargs={'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=Tensor lr with capturable and amsgrad
params=None, kwargs={'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=Tensor lr with capturable and amsgrad & foreach
params=None, kwargs={'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=Tensor lr with capturable and amsgrad & differentiable
params=None, kwargs={'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=Tensor lr with capturable and amsgrad & fused
.
----------------------------------------------------------------------
Ran 2 tests in 11.112s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118326
Approved by: https://github.com/mlazos
2024-02-02 19:13:00 +00:00
8b00e5aa12 [FSDP2] Added pre/post-backward (#118004)
This PR adds the pre- and post-backward logic:
- **Pre-backward hook:** `FSDPState` and `FSDPParamGroup` define this, and `FSDPState` is responsible for registering since its pre-backward should run even if the `FSDPState` does not manage any parameters (in case it is the root).
- **Post-backward hook:** Only `FSDParamGroup` defines this since the post-backward hook reshards parameters and reduce-scatters gradients (functionality only needed with managed parameters). The `FSDPParamGroup` is responsible for registering this.
- **Post-backward final callback:** `FSDPState` defines this, and each `FSDPParamGroup` defines a `finalize_backward()` to call in the final callback.

### Pre-Backward

The pre-backward hook is registered on the module outputs (that require gradient), and it should run when the first such output has its gradient computed. The hook may run multiple times per backward, once per module forward. Specifically, there will be one `(pre-backward, post-backward)` interval for each of the module's `forward()` calls. This is contrast with the existing FSDP semantics, which only defines a single `(pre-backward, post-backward)` interval that is equivalent to the union of this FSDP's `(pre-backward, post-backward)` intervals. This avoids spiking memory from having multiple modules not resharding and avoids some autograd edge cases.

We implement the pre-backward hook by having a flag that is set upon the 1st calls to disable subsequent calls. This flag could be maintained by FSDP, but for a cleaner design, we augment `register_multi_grad_hook` with a `mode="any"` option and use that instead.

### Post-Backward

The post-backward hook is equivalent to a module full backward hook (`nn.Module.register_full_backward_hook`) except it adds pytree logic to work with data structures other than just flat `Tensor` args passed to `nn.Module.forward`. If we were to use `register_full_backward_hook`, then the hook could fire early (before all gradients for the module have been computed). Most internal models use custom data structures as `forward` inputs, and they find that unifying under pytree is an acceptable solution.

Unlike existing FSDP, we are able to reshard the parameters in the post-backward hook _before_ 'concatenating' the autograd-computed gradients, achieving a lower peak memory usage. (Existing FSDP has `SplitWithSizesBackward` that calls a `CatArrayBatched`, and here we have the reduce-scatter copy-in.)

### Final Callback
The final callback runs as a queued callback to the autograd engine, meaning that it runs at the end of backward.

In the future, if we do not want to wait for the reduce-scatter (or similar for CPU offloading), we can augment the final callback. The code is written such that each reduce-scatter can be waited on separately (via CUDA event).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118004
Approved by: https://github.com/weifengpy, https://github.com/wanchaol
ghstack dependencies: #117950, #117955, #117973, #117975
2024-02-02 19:10:11 +00:00
a688b4b397 Update pointwise concat heuristics (#118453)
This PR updates the heuristics for lowering to pointwise cat to trigger when we have either a small number of arbitrary pointwise inputs (8) or up to 128 pointwise inputs when they correspond to simple pointwise kernels or data movement.

This originally came from an internal use case which noticed poor codegen: https://fb.workplace.com/groups/1075192433118967/posts/1365770660727808.

In our initial heuristics for lowering to a masked loads pointwise concat kernel we were conservative with the number of inputs we would allow by setting a maximum of 4.

However, I've noticed that we can much more aggressively fuse to pointwise_concat codegen performantly.

In the following benchmark I compare foreach and pointwise_cat codegen : https://gist.github.com/eellison/2bf83231f2940d9b9b33eb4721d35e15.

Here is the [csv output](https://gist.github.com/eellison/529da68b326e1d832c26c1dcdb42c313). When there is neither `gelu` applied on prologue or epilogue pointwise concat is faster (this is just the data movement case). Applying gelu on the epilogue does not affect this result.  When you apply gelu on the prologue, then as the # of inputs starts to increase you end up getting register spills with pointwise concat and it gets slower.

![image](https://github.com/pytorch/pytorch/assets/11477974/0d6612b8-d60f-4984-99eb-9b518cd4af74)

![image](https://github.com/pytorch/pytorch/assets/11477974/4dda3341-68f9-4d1d-8334-67d7196371fb)

When I benchmarked with relu instead of gelu, only as inputs got up to 256 did the pointwise and foreach even out.

![image](https://github.com/pytorch/pytorch/assets/11477974/985418f8-ddb8-47c1-baea-ccd9de72cd7f)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118453
Approved by: https://github.com/jansel, https://github.com/shunting314, https://github.com/mlazos
ghstack dependencies: #118452
2024-02-02 18:31:37 +00:00
3a1ae86a93 Fix internal failure D53291154 (#118907)
Fix internal failure D53291154

from alban: the change is breaking because the alpha argument is now kwarg only (via the * marker) while it was ok for it to be positional before for the rsub.Scalar overload

```
 _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "torch/_dynamo/eval_frame.py", line 453, in _fn
    return fn(*args, **kwargs)
  File "torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "torch/_dynamo/eval_frame.py", line 615, in catch_errors
    return callback(frame, cache_entry, hooks, frame_state)
  File "torch/_dynamo/convert_frame.py", line 390, in _convert_frame_assert
    return _compile(
  File "python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "torch/_dynamo/convert_frame.py", line 650, in _compile
    guarded_code = compile_inner(code, one_graph, hooks, transform)
  File "torch/_dynamo/utils.py", line 248, in time_wrapper
    r = func(*args, **kwargs)
  File "torch/_dynamo/convert_frame.py", line 531, in compile_inner
    out_code = transform_code_object(code, transform)
  File "torch/_dynamo/bytecode_transformation.py", line 1033, in transform_code_object
    transformations(instructions, code_options)
  File "torch/_dynamo/convert_frame.py", line 155, in _fn
    return fn(*args, **kwargs)
  File "torch/_dynamo/convert_frame.py", line 496, in transform
    tracer.run()
  File "torch/_dynamo/symbolic_convert.py", line 2125, in run
    super().run()
  File "torch/_dynamo/symbolic_convert.py", line 787, in run
    and self.step()
  File "torch/_dynamo/symbolic_convert.py", line 750, in step
    getattr(self, inst.opname)(inst)
  File "torch/_dynamo/symbolic_convert.py", line 469, in wrapper
    return inner_fn(self, inst)
  File "torch/_dynamo/symbolic_convert.py", line 1249, in CALL_FUNCTION_KW
    self.call_function(fn, args, kwargs)
  File "torch/_dynamo/symbolic_convert.py", line 651, in call_function
    self.push(fn.call_function(self, args, kwargs))
  File "torch/_dynamo/variables/torch.py", line 614, in call_function
    tensor_variable = wrap_fx_proxy(
  File "torch/_dynamo/variables/builder.py", line 1285, in wrap_fx_proxy
    return wrap_fx_proxy_cls(target_cls=TensorVariable, **kwargs)
  File "torch/_dynamo/variables/builder.py", line 1370, in wrap_fx_proxy_cls
    example_value = get_fake_value(proxy.node, tx, allow_non_graph_fake=True)
  File "torch/_dynamo/utils.py", line 1653, in get_fake_value
    raise TorchRuntimeError(str(e)).with_traceback(e.__traceback__) from None
  File "torch/_dynamo/utils.py", line 1599, in get_fake_value
    ret_val = wrap_fake_exception(
  File "torch/_dynamo/utils.py", line 1140, in wrap_fake_exception
    return fn()
  File "torch/_dynamo/utils.py", line 1600, in <lambda>
    lambda: run_node(tx.output, node, args, kwargs, nnmodule)
  File "torch/_dynamo/utils.py", line 1720, in run_node
    raise RuntimeError(fn_str + str(e)).with_traceback(e.__traceback__) from e
  File "torch/_dynamo/utils.py", line 1699, in run_node
    return node.target(*args, **kwargs)
  File "torch/utils/_stats.py", line 20, in wrapper
    return fn(*args, **kwargs)
  File "torch/_subclasses/fake_tensor.py", line 1637, in __torch_dispatch__
    return self.dispatch(func, types, args, kwargs)
  File "torch/_subclasses/fake_tensor.py", line 1975, in dispatch
    return self._dispatch_impl(func, types, args, kwargs)
  File "torch/_subclasses/fake_tensor.py", line 2190, in _dispatch_impl
    r = func(*args, **kwargs)
  File "torch/_ops.py", line 571, in __call__
    return self_._op(*args, **kwargs)
  File "torch/_prims_common/wrappers.py", line 252, in _fn
    result = fn(*args, **kwargs)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118907
Approved by: https://github.com/lezcano
2024-02-02 18:17:34 +00:00
fd000340fd ProcessGroupGloo::allgather_into_tensor_coalesced (#118910)
### Motivation
Despite our plan to reduce gloo usage, it is still being widely used as testing tool (in both the PyTorch CI and user tests) for code that only uses nccl in real world scenario. There's some coverage issues around all-gather and reduce-scatter variants, which are currently worked around in ugly ways (e.g. [this](b9e86bc93d/torch/distributed/_functional_collectives_impl.py (L216-L219)) and [this](b9e86bc93d/torch/distributed/_functional_collectives_impl.py (L262-L272))). For native funcol I ran into the same issues but I'd rather just fix the coverage.

**I think it's reasonable to think of this as a fix rather than adding new features. This is orthogonal to the potential reduction of gloo usage**.

### This PR

This PR adds `ProcessGroupGloo::allgather_into_tensor_coalesced`.  This is very straightforward - `ProcessGroupGloo` already supports `allgather_coalesced`, to which we can funnel `allgather_into_tensor_coalesced`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118910
Approved by: https://github.com/shuqiangzhang
2024-02-02 17:53:28 +00:00
70605d150b [quant][pt2] Add move_exported_model_to_train (#113492)
Summary: This is the equivalent API to `model.train()` for
exported models, analogous to `move_exported_model_to_eval`.

Test Plan:
python test/test_quantization.py TestQuantizePT2E.test_move_exported_model_dropout
python test/test_quantization.py TestQuantizePT2E.test_move_exported_model_dropout_inplace
python test/test_quantization.py TestQuantizePT2E.test_move_exported_model_dropout_bn

Reviewers: jerryzh168, kimishpatel

Subscribers: jerryzh168, kimishpatel, supriyar
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113492
Approved by: https://github.com/jerryzh168, https://github.com/tugsbayasgalan
2024-02-02 17:39:47 +00:00
52b679d415 [BE] Cleanup CircleCI README (#118927)
All of the information there is out-of-date as CI/CD has long migrated to the GitHub Actions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118927
Approved by: https://github.com/kit1980
2024-02-02 17:08:20 +00:00
0e5fe4b3ae [AOTI] Fix a RAIIAtenTensorHandle premature deallocation bug (#118963)
Summary: generate_index_put_fallback currently generates something like the following,

```
AtenTensorHandle tensor_handle_array_1[] = {nullptr, nullptr, arg1_1, wrap_with_raii_handle_if_needed(tmp_tensor_handle_0)};
```

The problem is wrap_with_raii_handle_if_needed creates a RAIIAtenTensorHandle which only lives during this tmp array initialization. After the initialization is done, RAIIAtenTensorHandle dies and releases the underlying Tensor, and when later tensor_handle_array_1 is passed to aoti_torch_index_put_out, some of its element AtenTensorHandle becomes invalid, cauing segfault.

Differential Revision: [D53339348](https://our.internmc.facebook.com/intern/diff/D53339348)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118963
Approved by: https://github.com/aakhundov
2024-02-02 16:49:45 +00:00
53da422582 [export] Move _create_graph_module_for_export to torch/export (#118893)
Summary: I have to keep the torch/_export one to not break executorch...

Test Plan: CI

Reviewed By: avikchaudhuri

Differential Revision: D52842750

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118893
Approved by: https://github.com/zhxchen17
2024-02-02 16:40:01 +00:00
b374f8987d [ROCm] Hipify trie re-engineering and adding unit tests (#118433)
Fixes #[117504](https://github.com/pytorch/pytorch/issues/117504)

Re-engineering Hipify Trie:
(1) Re-engineering Trie.
(2) More documentation or comments for easier understanding
(3) Created a set of unit test (class `TestHipifyTrie`) to test the Trie data structure and APIs.

Test:
```
root@xxx:/development/pytorch# pytest test/test_utils.py -k TestHipifyTrie
==================================================================================================== test session starts ====================================================================================================
platform linux -- Python 3.9.18, pytest-7.3.2, pluggy-1.3.0
rootdir: /dockerx/development/pytorch
configfile: pytest.ini
plugins: flakefinder-1.1.0, rerunfailures-13.0, xdist-3.3.1, xdoctest-1.1.0, cpp-2.3.0, shard-0.1.2, hypothesis-5.35.1
collected 11453 items / 11445 deselected / 8 selected
Running 8 items in this shard

test/test_utils.py ........                                                                                                                                                                                           [100%]

============================================================================================ 8 passed, 11445 deselected in 3.84s ============================================================================================
root@xxx:/development/pytorch#
```
Also performed diff on modified and generated contents by this tool with the original code and the new code of the hipify_python.py script. Verified no difference.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118433
Approved by: https://github.com/malfet, https://github.com/jeffdaily
2024-02-02 16:04:59 +00:00
65efbf078c Optimize dict keys guard when all the keys are constant (#118855)
We also rename ODICT_KEYS and make it use a list rather than a string.

Split from https://github.com/pytorch/pytorch/pull/118630.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118855
Approved by: https://github.com/peterbell10
ghstack dependencies: #117982, #118098, #117983, #117625, #118194, #118003, #118208, #118199, #118535
2024-02-02 14:42:56 +00:00
cdbc29e91a [dynamo,optim] Use the actual sources from the parameters when tracing "params" in an optimizer (#118535)
Fixes the unnecessary guards described at https://github.com/pytorch/pytorch/pull/117983#discussion_r1467622149

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118535
Approved by: https://github.com/mlazos
ghstack dependencies: #117982, #118098, #117983, #117625, #118194, #118003, #118208, #118199
2024-02-02 14:42:56 +00:00
a3770bcf10 Add functools.partial and UserDefinedFunction to dict keys (#118199)
This is tested by `fullgraph=True` in the `test_getattr_dict` test.
I can write a one-off test for both if that's needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118199
Approved by: https://github.com/peterbell10, https://github.com/jansel, https://github.com/anijain2305
ghstack dependencies: #117982, #118098, #117983, #117625, #118194, #118003, #118208
2024-02-02 14:42:35 +00:00
9d592c14eb Don't assume all subclasses of BaseUserFunctionVariable have a fn attribute (#118208)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118208
Approved by: https://github.com/anijain2305
ghstack dependencies: #117982, #118098, #117983, #117625, #118194, #118003
2024-02-02 14:42:06 +00:00
188628d99e [dynamo,easy] Add Typing variable to possible dict keys (#118003)
With this one, the only keys we are not tracing properly in the
(non-skipped) test suite are `OutDtypeHigherOrderVariable()`, and a
couple `UserDefinedObjectVariables`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118003
Approved by: https://github.com/anijain2305, https://github.com/Skylion007, https://github.com/jansel
ghstack dependencies: #117982, #118098, #117983, #117625, #118194
2024-02-02 14:40:21 +00:00
ecf7d0e8ac Make dict guards amenable to the CSE pass (#118194)
Supersedes https://github.com/pytorch/pytorch/pull/118096 as a much cleaner and simpler solution.

It is difficult to write a test for this one without exposing too much
of the internals. You can see empirically that it works by running
```
TORCHDYNAMO_PRINT_GUARDS=1 TORCH_LOGS=+guards  python test/test_optim.py -k test_can_load_older_state_dict_ASGD_cpu_float32
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118194
Approved by: https://github.com/jansel, https://github.com/peterbell10
ghstack dependencies: #117982, #118098, #117983, #117625
2024-02-02 14:38:48 +00:00
eb2bdfae88 Make variables in dict LazyTrackers (not lazily guarded yet) and avoid using DICT_KEYS guard (#117625)
Make variables in dict lazy and remove DICT_KEYS guard.

We build the keys of a dict depth-first and we rely on the guards of
each element in the dict to create the correct guards. This allows us to
remove the rather buggy DICT_KEYS guard and make the guard lazy.
The guards are not completely lazy yet, as we instantiate them in
`_HashableTracker._eq_impl` but it should be possible to make them
truly lazy.

Also, adding new types to the supported types within keys should be less
error prone.

This is marginally less efficient when we graph break, but in turn we
should graph break much less. It also  makes the dicts code easier to maintain
(removes `is_hashable_python_var`).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117625
Approved by: https://github.com/jansel, https://github.com/peterbell10, https://github.com/anijain2305
ghstack dependencies: #117982, #118098, #117983
2024-02-02 14:38:08 +00:00
75a5c41921 [dynamo,optim] Place guards on the args before assuming they exist (#117983)
This enables the new way of writing guards for dicts. Before we were
doing things like
```
  L['self'].param_groups[0][___dict_keys_getitem(L['self'].param_groups[0], 0)][3] is L['self'].param_groups[0]['params'][3]
```
without knowing whether `L['self'].param_groups[0][___dict_keys_getitem(L['self'].param_groups[0], 0)]` was a list.

On a different note, I'll probably write a pass to recover the previous
way to place guards on dicts via something like `DICT_KEYS`  as an
optimisation, as it seems relevant for optimisers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117983
Approved by: https://github.com/mlazos
ghstack dependencies: #117982, #118098
2024-02-02 14:37:46 +00:00
b1da929df9 Use SourcelesBuilder in BuiltinVariable (#118098)
This was failing when fetching a dictionary from a module

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118098
Approved by: https://github.com/peterbell10, https://github.com/anijain2305
ghstack dependencies: #117982
2024-02-02 14:37:23 +00:00
0f3e20a1b6 Print the malformed guard when there's a guard error. (#117982)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117982
Approved by: https://github.com/jansel, https://github.com/anijain2305
2024-02-02 14:37:05 +00:00
292243d1aa Automatically pull test reports from CI (#118882)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118882
Approved by: https://github.com/jamesjwu, https://github.com/yanboliang
ghstack dependencies: #118874
2024-02-02 14:18:56 +00:00
0f7954107a Add ability to print histogram as a github issue (#118874)
Adds the ability to print the failures histogram into lines that can be
copy-pasted into a github issue.

I used this to generate https://github.com/orgs/pytorch/projects/43

Test Plan:
- tested locally
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118874
Approved by: https://github.com/jamesjwu
2024-02-02 14:18:56 +00:00
520771d7b3 refactor lazy init to device-agnostic (#118846)
# Motivation
This PR intends to extend `cuda_lazy_init` to `device_lazy_init` which is a device-agnostic API that can support any backend. And change `maybe_initialize_cuda` to `maybe_initialize_device` to support lazy initialization for CUDA while maintaining scalability.

# Design
We maintain a flag for each backend to manage the lazy initialization state separately.

# Additional Context
No need more UTs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118846
Approved by: https://github.com/malfet
2024-02-02 12:10:39 +00:00
2de327cedc Fixed an illegal memory access in cross entropy loss when using an index that is not a valid class (#117561)
…dex that is not a valid class.

Fixes #117532.

Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117561
Approved by: https://github.com/mikaylagawarecki
2024-02-02 11:03:16 +00:00
05ac295177 [export] Fix bug with user input mutations (#118942)
We hit an edge case where the graph exporting contains placeholder nodes whose names conflict with names from aot_export, we don't update the user_inputs_to_mutate in the graph signature correctly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118942
Approved by: https://github.com/tugsbayasgalan, https://github.com/zhxchen17
2024-02-02 09:02:04 +00:00
cc46829f96 [Inductor] GEMM shape padding improvements (#118522)
Improvements to shape padding logic in torch/_inductor/pad_mm.py

These changes could lead up to 14% perf improvement for certain Meta internal models in experiments.

Most notably:
  * 1.) Use aten.const_pad_nd operation to pad Tensors in a single op instead of using multiple steps involving intermediate buffers. This appears to be more performant than the previous logic, confirmed by Profiling & Benchmarking results ( Meta internal )
 * 2.) Make many paddings unneccessary using explicitly transposed GEMM when either M or N dimension is properly aligned but the other is not, configurable via config.shape_pad_use_transpose (default: True).
  * 3.) Enable shape padding for the Inductor CUDA  /  Cutlass backend for all GEMM ops where Cutlass would be enabled, without benchmarking in that case.
  * Add config flag to always pad shapes (without benchmarking first), configurable via config.force_shape_pad (default: False )
  * Added several new unit tests to ensure tensors are padded such that they meet all alignment requirements after padding.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118522
Approved by: https://github.com/jansel, https://github.com/eellison
2024-02-02 08:50:06 +00:00
cyy
855d5f144e Relax MKL_INT assumption to int64_t (#118946)
When I built Pytorch on Windows with lastest MKL, it reported:
```
sources\pytorch\aten\src\ATen/cpu/vml.h(106): error C2338: static_assert failed: 'MKL_INT is assumed to be int32_t'
```
It should be safe to relax the restriction to int64_t.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118946
Approved by: https://github.com/ezyang
2024-02-02 07:11:47 +00:00
2964170f3a Revert "[optim] Rectify capturable testing and fix bugs! (#118326)"
This reverts commit d947b9d50011ebd75db2e90d86644a19c4fe6234.

Reverted https://github.com/pytorch/pytorch/pull/118326 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it looks like there are some relevant failures in trunk d947b9d500, may be a land race ([comment](https://github.com/pytorch/pytorch/pull/118326#issuecomment-1923125676))
2024-02-02 07:08:14 +00:00
4a5a2c6571 Update auto_functionalize schema (#118809)
- Moved the dictionary arguments to the node's kwargs as dicts are not
  valid inputs.
- Inlined the mutated arguments to the output. Originally, the output of
  auto_functionalize was the operator output
  and a list of mutated arguments (ex. [op_out1, op_out2, [mutated_arg1,
  mutated_arg2]]. However this is not easily exportable. Now, it will
  just be [op_out1, op_out2, mutated_arg1, mutated_arg2].

Differential Revision: [D53331040](https://our.internmc.facebook.com/intern/diff/D53331040)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118809
Approved by: https://github.com/zou3519
2024-02-02 06:21:43 +00:00
89b7ab671e Protect against modules without __file__ (#117445)
The __file__ special variable is optional so should be treated as such.

Fixes #117109

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117445
Approved by: https://github.com/oulgen, https://github.com/yanboliang
2024-02-02 06:06:50 +00:00
3d8c36786b Add device for distributed examples (#118867)
## 🐛 Describe the bug

The following example (`all_reduce`) missed `device` allocation
a205e7bf56/torch/distributed/distributed_c10d.py (L2080-L2087)

## Solution

A better example should be like this
a205e7bf56/torch/distributed/distributed_c10d.py (L3212-L3222)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118867
Approved by: https://github.com/soulitzer
2024-02-02 05:51:59 +00:00
da5cbb1269 [export] fix for duplicate constant lifting (#118776)
Summary:
Whenever we access a constant, we emit a `get_attr` node for it.

The `lift_constants_pass` was lifting every `get_attr` node unconditionally, even if the same target was already lifted. This diff fixes that.

I also took the liberty of adding some infra to make it easier to unit test passes. GraphBuilder lets you declaratively construct graphs with the right metadata, it's pretty useful for directly inducing the pattern you want to test against.

Test Plan: added unit test

Differential Revision: D53278161

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118776
Approved by: https://github.com/angelayi, https://github.com/titaiwangms
2024-02-02 05:51:31 +00:00
32f48e917d [minimizer] Defined traverse (#118889)
Summary:
Add defined traverse mode for minimizer
it take user input start_idx and end_idx, form a subgraph, compare result from acclerators vs cpu

Differential Revision: D53318292

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118889
Approved by: https://github.com/jfix71
2024-02-02 05:50:17 +00:00
3f1f057adf Remove parent device mesh check (#118620)
Removes raising error if a device_mesh has a parent.

The comment says that HSDP + TP is not supported, but I'm able to do 2D parallelism + HSDP fine. The only issues are:
- this check
- https://github.com/pytorch/pytorch/pull/118618
- a series of PRs related to checkpointing with 3D meshes that I will open
We currently monkeypatch for the above which I am slowly upstreaming.

I imagine torch will have a better, native integration eventually, but this check seems too aggressive in the meantime given DTensor now lets users do some things themselves (which is amazing 🎉)!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118620
Approved by: https://github.com/wz337, https://github.com/wanchaol
2024-02-02 05:29:49 +00:00
9cc6422ab6 Revert "[executorch hash update] update the pinned executorch hash (#118936)"
This reverts commit 8cc8cf75f31f7e430ab2918db4a2fb9c7b951024.

Reverted https://github.com/pytorch/pytorch/pull/118936 on behalf of https://github.com/suo due to conflicts with human change ([comment](https://github.com/pytorch/pytorch/pull/118936#issuecomment-1922824471))
2024-02-02 05:05:44 +00:00
8cc8cf75f3 [executorch hash update] update the pinned executorch hash (#118936)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118936
Approved by: https://github.com/pytorchbot
2024-02-02 04:10:53 +00:00
497ea17684 Limit reductions into pointwise cat fusion (#118452)
@Chillee observed a regression when fusing the following:
```
        def f(a, b):
            return torch.cat([torch.softmax(a, dim=-1), torch.softmax(b, dim=-1)])
```

This PR limits pointwise concat/masked fusion in this case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118452
Approved by: https://github.com/jansel
2024-02-02 03:34:50 +00:00
babd6c776d [inductor] skip launching kernels with zero grid in AOTInductor when using backed symints (#118654)
Like #110312 but we also run this check when backed symints are in the grid (e.g. s1 / 512)

### Why?

Let's say we lower a model and generate GPU kernel grid with symbolic shapes, for e.g. `s1 / 512`. If at some point later, we ran the lowered model with inputs s.t. `s1 = 0`, then we'll launch the kernel with a `0` sized grid. This surfaces as `CUDA driver error: invalid argument`.

To avoid this, we check for a `0` sized grid whenever there's symbolic shapes which includes backed and unbacked symints.

This adds non-zero overhead to the CPU. However, in return, we get better reliability when encountering this scenario. This scenario happened when serving an internal model.

### Test

```
$ python test/inductor/test_aot_inductor.py -k test_zero_grid_with_unbacked_symbols
OK (skipped=3)

$ python test/inductor/test_aot_inductor.py -k test_zero_grid_with_backed_symbols

# Before
Error: CUDA driver error: invalid argument
FAILED (errors=2, skipped=3)

# Now
OK (skipped=3)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118654
Approved by: https://github.com/chenyang78, https://github.com/desertfire
2024-02-02 03:19:52 +00:00
946ea47a4f [inductor] Fix an internal test issue (#118903)
Summary: test_add_complex4 that introduced in https://github.com/pytorch/pytorch/pull/117929  fails internally, because of a cpp compilation issue for cpu. Specify the right device in the test instead.

Differential Revision: [D53333919](https://our.internmc.facebook.com/intern/diff/D53333919)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118903
Approved by: https://github.com/clee2000
2024-02-02 03:18:12 +00:00
8b729fb826 [ez] Fix CI log file piping error (#118807)
Fixes https://github.com/pytorch/pytorch/issues/118764

Example log https://github.com/pytorch/pytorch/actions/runs/7737363970/job/21097159160
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118807
Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/seemethere
2024-02-02 03:07:56 +00:00
d947b9d500 [optim] Rectify capturable testing and fix bugs! (#118326)
This PR fixes several bugs, listed in priority:
1. `load_state_dict` with a nontensor step was incorrect for capturable and fused implementations since we don't create the tensors on the right device in `__setstate__`. This has been fixed.
2. The most recently added capturable implementations forgot the check that all tensors should be on CUDA for eager. We've now added those checks
3. The most recent change in Adamax only adds capturable for foreach but will silently be incorrect for forloop/single-tensor. I've added erroring and modified testing with many many many skips for that. Honestly my preference after this PR has only been further cemented  that we should just do the single tensor and multi tensor capturable implementations together in the future. @mlazos
4. The conditional for adding cuda-supported configs for the optimizer infos was incorrect! So we hadn't been testing capturable! This also stands rectified and was the trigger for this PR in the first place.
5. In a similar way, the conditional for `_get_optim_inputs_including_global_cliquey_kwargs` was incorrect sometimes as well. This has also been corrected.

The following is not a bug, but is just something to make life simpler by not needing to handle Nones: `optim_input_funcs` must now mandatorily take in a `device`, which could be a string or a torch.device.

Details for posterity:
4. Running the test_foreach_matches_forloop test and printing the configs that get printed yields capturable getting included, which is correct.
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (5d50138f)]$ python test/test_optim.py -k test_foreach_matches_forloop_AdamW_cuda
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
params=None, kwargs={}, desc=default
params=None, kwargs={'lr': 0.01}, desc=non-default lr
params=None, kwargs={'weight_decay': 0.1}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.1, 'maximize': True}, desc=maximize
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True}, desc=amsgrad
params=None, kwargs={'capturable': True}, desc=capturable
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'capturable': True}, desc=capturable, amsgrad
params=None, kwargs={'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True}, desc=Tensor lr with capturable and amsgrad
.
----------------------------------------------------------------------
Ran 1 test in 19.229s

OK
```
5. Running the test_optimizer_can_be_printed test (which calls `_get_optim_inputs_including_global_cliquey_kwargs`) and printing what gets run is also now correct.
```
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
params=None, kwargs={'differentiable': False}, desc=default
params=None, kwargs={'differentiable': True}, desc=default & differentiable
params=None, kwargs={'lr': 0.01, 'differentiable': False}, desc=non-default lr
params=None, kwargs={'lr': 0.01, 'differentiable': True}, desc=non-default lr & differentiable
params=None, kwargs={'weight_decay': 0.1, 'differentiable': False}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.1, 'differentiable': True}, desc=nonzero weight_decay & differentiable
params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'differentiable': False}, desc=maximize
params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'differentiable': True}, desc=maximize & differentiable
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'differentiable': False}, desc=amsgrad
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'differentiable': True}, desc=amsgrad & differentiable
.params=None, kwargs={'foreach': False, 'differentiable': False, 'fused': False}, desc=default
params=None, kwargs={'foreach': True, 'differentiable': False, 'fused': False}, desc=default & foreach
params=None, kwargs={'foreach': False, 'differentiable': True, 'fused': False}, desc=default & differentiable
params=None, kwargs={'foreach': False, 'differentiable': False, 'fused': True}, desc=default & fused
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False, 'fused': False}, desc=non-default lr
params=None, kwargs={'lr': 0.01, 'foreach': True, 'differentiable': False, 'fused': False}, desc=non-default lr & foreach
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': True, 'fused': False}, desc=non-default lr & differentiable
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False, 'fused': True}, desc=non-default lr & fused
params=None, kwargs={'weight_decay': 0.1, 'foreach': False, 'differentiable': False, 'fused': False}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.1, 'foreach': True, 'differentiable': False, 'fused': False}, desc=nonzero weight_decay & foreach
params=None, kwargs={'weight_decay': 0.1, 'foreach': False, 'differentiable': True, 'fused': False}, desc=nonzero weight_decay & differentiable
params=None, kwargs={'weight_decay': 0.1, 'foreach': False, 'differentiable': False, 'fused': True}, desc=nonzero weight_decay & fused
params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=maximize
params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=maximize & foreach
params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=maximize & differentiable
params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=maximize & fused
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=amsgrad
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=amsgrad & foreach
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=amsgrad & differentiable
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=amsgrad & fused
params=None, kwargs={'capturable': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=capturable
params=None, kwargs={'capturable': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=capturable & foreach
params=None, kwargs={'capturable': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=capturable & differentiable
params=None, kwargs={'capturable': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=capturable & fused
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=capturable, amsgrad
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=capturable, amsgrad & foreach
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=capturable, amsgrad & differentiable
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=capturable, amsgrad & fused
params=None, kwargs={'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=Tensor lr with capturable and amsgrad
params=None, kwargs={'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=Tensor lr with capturable and amsgrad & foreach
params=None, kwargs={'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=Tensor lr with capturable and amsgrad & differentiable
params=None, kwargs={'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=Tensor lr with capturable and amsgrad & fused
.
----------------------------------------------------------------------
Ran 2 tests in 11.112s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118326
Approved by: https://github.com/mlazos
2024-02-02 02:02:58 +00:00
08472a4fd5 [dtensor] add op support for aten.gather.default (#118513)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118513
Approved by: https://github.com/wanchaol, https://github.com/XilunWu
2024-02-02 01:48:21 +00:00
8ca8729321 [PT-Vulkan][EZ] Adjust string-report width (#118914)
## Before: P1148506541

Some of the shader names are now too long.
```
Kernel Name              Workgroup Size             Duration (ns)
===========              ==============               ===========
vulkan.nchw_to_image     {500, 500, 1}                    4322188
vulkan.nchw_to_image     {500, 500, 1}                    4322240
vulkan.convert_channels_to_height_packed{500, 125, 1}                    1189240
vulkan.zero              {1, 1, 1}                           3744
vulkan.convert_channels_to_width_packed{125, 500, 1}                    1265680
```

## After: P1148506671

Now it's just right; `convert_channels_to_height_packed` is the longest shader name in the codebase.
```
Kernel Name                             Workgroup Size             Duration (ns)
===========                             ==============               ===========
vulkan.nchw_to_image                    {500, 500, 1}                    4327232
vulkan.nchw_to_image                    {500, 500, 1}                    4327960
vulkan.convert_channels_to_height_packed{500, 125, 1}                    1190540
vulkan.zero                             {1, 1, 1}                           3744
vulkan.convert_channels_to_width_packed {125, 500, 1}                    1287468
```

Differential Revision: [D53293924](https://our.internmc.facebook.com/intern/diff/D53293924/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118914
Approved by: https://github.com/liuk22
2024-02-02 01:43:48 +00:00
7e1ac59016 [pytorch][vulkan] add 1d tensor support for linear (#118690)
Summary: Vulkan Linear op doesn't support 1d tensors. We can unsqueeze 1d tensors to 2d to unblock the functionality.

Test Plan:
`LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck2 run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="*linear_*"`
```
Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = *linear_*
[==========] Running 11 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 11 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.linear_1d_small
[       OK ] VulkanAPITest.linear_1d_small (319 ms)
[ RUN      ] VulkanAPITest.linear_1d_large
[       OK ] VulkanAPITest.linear_1d_large (64 ms)
[ RUN      ] VulkanAPITest.linear_2d_flat
[       OK ] VulkanAPITest.linear_2d_flat (0 ms)
[ RUN      ] VulkanAPITest.linear_2d_small
[       OK ] VulkanAPITest.linear_2d_small (0 ms)
[ RUN      ] VulkanAPITest.linear_2d_large
[       OK ] VulkanAPITest.linear_2d_large (129 ms)
[ RUN      ] VulkanAPITest.linear_3d_flat
[       OK ] VulkanAPITest.linear_3d_flat (0 ms)
[ RUN      ] VulkanAPITest.linear_3d_small
[       OK ] VulkanAPITest.linear_3d_small (1 ms)
[ RUN      ] VulkanAPITest.linear_3d_large
[       OK ] VulkanAPITest.linear_3d_large (51 ms)
[ RUN      ] VulkanAPITest.linear_4d_flat
[       OK ] VulkanAPITest.linear_4d_flat (0 ms)
[ RUN      ] VulkanAPITest.linear_4d_small
[       OK ] VulkanAPITest.linear_4d_small (1 ms)
[ RUN      ] VulkanAPITest.linear_4d_large
[       OK ] VulkanAPITest.linear_4d_large (6 ms)
[----------] 11 tests from VulkanAPITest (578 ms total)

[----------] Global test environment tear-down
[==========] 11 tests from 1 test suite ran. (578 ms total)
[  PASSED  ] 11 tests.
```

Differential Revision: D53243201

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118690
Approved by: https://github.com/jorgep31415, https://github.com/liuk22
2024-02-02 01:35:45 +00:00
796278b57e Revert "[inductor] make multi-kernel work with cpp-wrapper (#117813)"
This reverts commit 20484a193626ef72e0b3f35914f17deb2a89b8fc.

Reverted https://github.com/pytorch/pytorch/pull/117813 on behalf of https://github.com/atalman due to broke linux-focal-rocm5.7-py3.8 tests ([comment](https://github.com/pytorch/pytorch/pull/117813#issuecomment-1922613135))
2024-02-02 01:19:19 +00:00
9153174cd1 [pt-vulkan] Introduce SharedObject class to ComputeGraph (#118756)
## Context

This changeset is part of a stack that enables memory planning (i.e. sharing memory between intermediate tensors) in the PyTorch Vulkan Compute API. Note that Memory Planning can only be used via the ExecuTorch delegate (currently a WIP) and not Lite Interpreter (which does not collect metadata regarding tensor lifetimes).

This changeset builds upon the [previous PR enabling resource aliasing](https://github.com/pytorch/pytorch/pull/118436) and introduces the `SharedObject` class to `ComputeGraph`, which manages resource aliasing in graph execution mode. `SharedObject` tracks which `vTensor` values in a `ComputeGraph` share the same backing memory, and provides functionality to aggregate memory requirements and bind users to same memory allocation.

## Notes for Reviewers

The `SharedObject` class is introduced in `Graph.h`. It's fairly simple and provides three functions:

* `add_user()` which adds a `ValueRef` to the list of users of the `SharedObject`, and updates the aggregate memory requirements with the memory requirements of the new user
* `allocate_memory()` creates a `VmaAllocation` with the aggregated memory requirements
* `bind_users()` iterates over the `users` of the `SharedObject` and binds each `vTensor`'s underlying resource to the memory associated with the `SharedObject`.

As for how `SharedObject` is used in `ComputeGraph`:

* `add_tensor()` now has an additional argument `shared_object_idx` which, if `>0`, will construct a `vTensor` without any backing memory and add the new `vTensor` to the `SharedObject` at `shared_object_idx`
* `encode_execute()` will first iterate through the `SharedObject`s of the graph and allocate + bind users before recording the command buffer.

Differential Revision: [D53271486](https://our.internmc.facebook.com/intern/diff/D53271486/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118756
Approved by: https://github.com/jorgep31415, https://github.com/yipjustin
2024-02-02 01:19:00 +00:00
a5a63db3bf add Half support for flash attention on CPU (#118368)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118368
Approved by: https://github.com/jgong5, https://github.com/Valentine233, https://github.com/drisspg
ghstack dependencies: #118367
2024-02-02 01:08:39 +00:00
838c1c553e Add back recompile test (#118905)
Adds back a test that was skipped

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118905
Approved by: https://github.com/janeyx99
2024-02-02 00:51:01 +00:00
4b59bfe8e5 [CI] Filter should not fail if pr_body is empty (#118934)
Otherwise it will fail with `TypeError: argument of type 'NoneType' is not iterable` (see https://github.com/pytorch/pytorch/actions/runs/7748725174/job/21131915226 for example)

```
% gh api /repos/pytorch/pytorch/issues/118927|
{
  "url": "https://api.github.com/repos/pytorch/pytorch/issues/118927",
  ...
  "body": null,
  ...
  "state_reason": null
}
```

TODO: Can we add a test for it?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118934
Approved by: https://github.com/clee2000, https://github.com/seemethere, https://github.com/huydhn
2024-02-02 00:49:20 +00:00
08d90a1ea9 Workaround for super() calls in test_torchinductor_dynamic_shapes (#118586)
Info about super in dynamic classes:
https://stackoverflow.com/questions/71879642/how-to-pass-function-with-super-when-creating-class-dynamically
https://stackoverflow.com/questions/43782944/super-does-not-work-together-with-type-supertype-obj-obj-must-be-an-i

Calling super(TestCase) actually calls TestCase's parent's functions, bypassing TestCase itself's functions

Mainly doing this because it's making disable bot spam

Test: checked locally and check that https://github.com/pytorch/pytorch/issues/117954 actually got skipped

Logs for `inductor/test_torchinductor_dynamic_shapes.py::TestInductorDynamicCUDA::test_unbacked_index_select_cuda`
https://ossci-raw-job-status.s3.amazonaws.com/log/21083466405
Afaik this PR doesn't actually cause the test to fail, it just surfaces the error since the mem leak check wasn't running previously

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118586
Approved by: https://github.com/huydhn
2024-02-02 00:40:37 +00:00
7c609f01ff [PT-Vulkan] aten::conv1d - support any batch size (#118834)
Completes `aten::conv1d` implementation.

See D53204673 for full context.

Differential Revision: [D53253625](https://our.internmc.facebook.com/intern/diff/D53253625/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118834
Approved by: https://github.com/yipjustin
ghstack dependencies: #118833
2024-02-01 23:53:00 +00:00
dc4779b010 Split out fake_impls from fake_tensor (#118878)
The motivation is fake_tensor is marked as an uninteresting file for the purposes of backtraces, but operator implementations in fake tensor are interesting and I do want them reported.

How did I decide whether or not to move helper functions or not? It was kind of random, but if they weren't used in fake tensor generally I moved them over.

There are no functional code changes, so you only need to review the import changes.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118878
Approved by: https://github.com/eellison
2024-02-01 23:50:56 +00:00
844a76ebe8 [MPS][BE] Remove stale TODO (#118902)
And use convenient methods

TODO was added by an accidental copy-n-paste of code from https://github.com/pytorch/pytorch/pull/82315 into  https://github.com/pytorch/pytorch/pull/88532

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118902
Approved by: https://github.com/kit1980
2024-02-01 23:43:23 +00:00
a16df1d85f [Dynamo] graph break on isinstance calls if we don't know the type (#118778)
If we can't figure out the python type of a VariableTracker, then the
isinstance call should graph break (instead of raising an error).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118778
Approved by: https://github.com/ydwu4
ghstack dependencies: #118768
2024-02-01 23:18:10 +00:00
39aab55c1c Add myself to CODEOWNERS for serialization-related files (#118892)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118892
Approved by: https://github.com/albanD
2024-02-01 23:14:04 +00:00
46ef73505d Clarify how to get extra link flags when building CUDA/C++ extension (#118743)
Make it a bit more explicit how one parse linker arguments to the build and point to the superclass documentation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118743
Approved by: https://github.com/ezyang
2024-02-01 22:35:25 +00:00
dbba1d4bf5 Revert "Some minor type stub improvements (#118529)"
This reverts commit c978f38bd4aedeff4ee9ae693349217daea01412.

Reverted https://github.com/pytorch/pytorch/pull/118529 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/118529#issuecomment-1922362331))
2024-02-01 22:18:36 +00:00
d4a94ad041 [ONNX] Fix upsample_bilinear2d decomp skip with output shape (#118823)
The previous output size missed the first two dimensions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118823
Approved by: https://github.com/titaiwangms
2024-02-01 22:04:35 +00:00
6692f2c91e [no ci] Add myself to MPS codeowners (#118904)
I got pinged on every other PR anyway, so just a means to automate the process

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118904
Approved by: https://github.com/albanD
2024-02-01 21:52:15 +00:00
6929322a28 [PT-Vulkan] aten::conv1d - support any channel-group combo (#118833)
## Main

Part of completing `aten::conv1d`'s implementation. See D53204673 for full context.

This diff relaxes the constraint
```
c_in = c_out = groups
```
to support any legal combination of c_in, c_out, groups.

From the [PyTorch docs](https://pytorch.org/docs/stable/generated/torch.nn.Conv1d.html), both c_in and c_out must be divisible by groups. Apart from that, any combo is now fair game.

## Additional

Improved GLSL comments and variable names, since more indices yield more headaches.

Differential Revision: [D53248767](https://our.internmc.facebook.com/intern/diff/D53248767/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118833
Approved by: https://github.com/yipjustin
2024-02-01 21:46:01 +00:00
61b572ed56 [inductor] more accurate throughput calculations for kernel benchmarks (#118858)
Our current throughput calculations for kernel benchmarks have some issues,
particularly when we slice inputs in the kernel. In such cases, we count
the original inputs as part of the memory traffic passed across the kernel.
This is incorrect because it may result in a much larger throughput
calculation, which can even exceed the theoretical bandwidth.

Instead, we should only count the size of the "slices" that contribute to
the actual memory traffic.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118858
Approved by: https://github.com/jansel
2024-02-01 21:42:14 +00:00
20484a1936 [inductor] make multi-kernel work with cpp-wrapper (#117813)
Make multi-kernel work with cpp-wrapper. multi-kernel generates two equivalent variants for a reduction. At runtime the faster one is picked. But cpp-wrapper need save cubin file during codegen. They don't work with each other at the beginning.

Thanks Jason for suggesting a neat way to integrate these two. cpp-wrapper does 2 passes codegen right now. For the first pass, we still generate multi-kernel code and run it; for the second pass, we load the cubin file for the faster kernel directly. And multi-kernel python code is not generated for the second pass since they should not be needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117813
Approved by: https://github.com/jansel
2024-02-01 21:29:02 +00:00
54668ad6dc Cleanup max cuda device (#118779)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118779
Approved by: https://github.com/ezyang
2024-02-01 21:11:28 +00:00
f63dc9a21d s/DIRECLTY/DIRECTLY/ (#118877)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118877
Approved by: https://github.com/albanD
2024-02-01 20:25:58 +00:00
923a7c7572 add test elipsis to dynamo test functions (#118754)
add tests to ensure the reported bug in #117563 is not failing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118754
Approved by: https://github.com/anijain2305
2024-02-01 19:05:01 +00:00
318e6ff40e Fix __name__ on a reconstructed NestedUserFunctionVariable (#118768)
```
def f():
    def g():
        return ()

    print(g.__name__)

f()
```

The following script should print `g` (with or without torch.compile),
but prints `f.<locals>.g` with torch.compile.

The problem looks like we use the co_qualname when reconstructing the
NestedUserFunctionVariable. I switched this over to use the co_name.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118768
Approved by: https://github.com/yanboliang, https://github.com/jansel
2024-02-01 18:59:01 +00:00
b0e65dd1b4 Fix TCP Store Windows (#118860)
In https://github.com/pytorch/pytorch/pull/107607 there was added a new Validate flow, however on Windows it was not calling addMiscellaneousSocket.
Added missing call to addMiscellaneousSocket on Windows.

Fixes #118737

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118860
Approved by: https://github.com/awgu, https://github.com/malfet
2024-02-01 18:46:18 +00:00
df048f4da4 Revert "[RELAND] Remove deprecated fbgemm operators (#112153)"
This reverts commit 19e8ba95e535cd73d3eb37849f383ca8bab58603.

Reverted https://github.com/pytorch/pytorch/pull/112153 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/112153#issuecomment-1921965780))
2024-02-01 18:35:19 +00:00
0f7e63620f CUDA fast path for split_with_sizes_copy.out (#117203)
### Motivation
In per-parameter sharding FSDP, each rank holds one shard of every parameter. Before a bucket of parameters is used, FSDP performs all-gather to reconstruct the full parameters. The following example demonstrates the process for `world_size=2`, `num_params=3` (`A`, `B`, `C` standands for values in param `A`, `B`, `C`):

All-gather output:
```
AAAABBBCCAAAABBBCC
```

After all-gather-copy-out:
```
AAAAAAAA  BBBBBB  CCCC
```

The performance of all-gather-copy-out is crucial for the viability of per-parameter sharding FSDP. After thorough experiments, we believe that acceptable performance for this op is not achievable via composing existing ATen ops today.

We have proven that ideal performance is achievable with a [custom kernel](https://github.com/pytorch/pytorch/pull/115515). This PR aims to incorporate the optimizations to appropriate ATen ops (as suggested by @albanD).

### all-gather-copy-out via Composing ATen Ops

Carrying out the op out via composing ATen ops involves a combination of view ops and copy ops. After thorough experiments, we found that the most natural/performant way to express the op is via `split_with_sizes` + `_foreach_copy_`, which works as follows:

Reshape all-gather output as (world_size, -1):
```
AAAABBBCC
AAAABBBCC
```

`split_with_sizes` + `_foreach_copy_`:
```
AAAA BBB CC
AAAA BBB CC
```

However, the performance of this approach is still far below that of the custom kernel. We've identified the following reasons:
- The approach requires materializing `O(num_params)` intermediate views, which induces large amount of CPU overhead when `num_params` is high.
- `_foreach_copy_` uses the same block size all tensors, leading to waste for small tensors and insufficient thread count for large tensors. This means low effective occupancy.
- `_foreach_copy_` dispatches multiple kernels for typical problem sizes for all-gather-copy-out. This further lowers the effective occupancy.
- Due to the nature of the workload, the underlying copies are unaligned. `_foreach_copy_` isn't aggressive enough in exploiting vectorization oppurtunities in such workloads.

### PR
Introduces a CUDA backend for `split_with_sizes_copy.out` that addresses the above inefficiencies. See code for details.

### Benchmarks
The benchmarks are conducted on a set of representative problems sizes on an A100. CPU overhead and GPU execution time is measured separately, as reasonable CPU overhead doesn't directly affect e2e throughput. The reported copy bandwidth is calculated with GPU execution time.

Compared to the baseline, we observe 3x-10x higher throughput compared to the baseline depending on the problem size. We also observe lower CPU overhead across the board compared to the baseline.

Baseline:
```
num_params=150   world_size=8     mixed=True    Param size: 0.059 GB    Copy bandwidth: 67.564 GB/s (gpu ms/iter: 0.869, cpu ms/iter 10.460)
num_params=54    world_size=8     mixed=True    Param size: 1.453 GB    Copy bandwidth: 260.373 GB/s (gpu ms/iter: 5.582, cpu ms/iter 0.572)
num_params=54    world_size=8     mixed=True    Param size: 0.512 GB    Copy bandwidth: 239.585 GB/s (gpu ms/iter: 2.135, cpu ms/iter 0.587)
num_params=50    world_size=8     mixed=True    Param size: 0.200 GB    Copy bandwidth: 205.361 GB/s (gpu ms/iter: 0.976, cpu ms/iter 0.534)
num_params=3     world_size=8     mixed=True    Param size: 0.983 GB    Copy bandwidth: 268.397 GB/s (gpu ms/iter: 3.663, cpu ms/iter 0.084)
num_params=9     world_size=8     mixed=True    Param size: 0.802 GB    Copy bandwidth: 265.240 GB/s (gpu ms/iter: 3.024, cpu ms/iter 0.154)
num_params=3     world_size=8     mixed=True    Param size: 1.573 GB    Copy bandwidth: 268.918 GB/s (gpu ms/iter: 5.849, cpu ms/iter 0.087)
num_params=9     world_size=8     mixed=True    Param size: 2.248 GB    Copy bandwidth: 268.141 GB/s (gpu ms/iter: 8.384, cpu ms/iter 0.151)
num_params=150   world_size=128   mixed=True    Param size: 0.064 GB    Copy bandwidth: 73.237 GB/s (gpu ms/iter: 0.874, cpu ms/iter 10.664)
num_params=54    world_size=128   mixed=True    Param size: 1.458 GB    Copy bandwidth: 259.902 GB/s (gpu ms/iter: 5.609, cpu ms/iter 0.584)
num_params=54    world_size=128   mixed=True    Param size: 0.515 GB    Copy bandwidth: 238.703 GB/s (gpu ms/iter: 2.158, cpu ms/iter 0.612)
num_params=50    world_size=128   mixed=True    Param size: 0.203 GB    Copy bandwidth: 205.144 GB/s (gpu ms/iter: 0.987, cpu ms/iter 0.559)
num_params=3     world_size=128   mixed=True    Param size: 0.983 GB    Copy bandwidth: 270.467 GB/s (gpu ms/iter: 3.635, cpu ms/iter 0.073)
num_params=9     world_size=128   mixed=True    Param size: 0.802 GB    Copy bandwidth: 267.700 GB/s (gpu ms/iter: 2.997, cpu ms/iter 0.133)
num_params=3     world_size=128   mixed=True    Param size: 1.573 GB    Copy bandwidth: 268.913 GB/s (gpu ms/iter: 5.849, cpu ms/iter 0.093)
num_params=9     world_size=128   mixed=True    Param size: 2.248 GB    Copy bandwidth: 266.589 GB/s (gpu ms/iter: 8.433, cpu ms/iter 0.207)
num_params=150   world_size=1024  mixed=True    Param size: 0.202 GB    Copy bandwidth: 135.107 GB/s (gpu ms/iter: 1.495, cpu ms/iter 10.904)
num_params=54    world_size=1024  mixed=True    Param size: 1.524 GB    Copy bandwidth: 258.675 GB/s (gpu ms/iter: 5.890, cpu ms/iter 0.996)
num_params=54    world_size=1024  mixed=True    Param size: 0.575 GB    Copy bandwidth: 238.919 GB/s (gpu ms/iter: 2.408, cpu ms/iter 0.765)
num_params=50    world_size=1024  mixed=True    Param size: 0.246 GB    Copy bandwidth: 209.836 GB/s (gpu ms/iter: 1.172, cpu ms/iter 0.611)
num_params=3     world_size=1024  mixed=True    Param size: 1.007 GB    Copy bandwidth: 270.607 GB/s (gpu ms/iter: 3.720, cpu ms/iter 0.100)
num_params=9     world_size=1024  mixed=True    Param size: 0.818 GB    Copy bandwidth: 266.375 GB/s (gpu ms/iter: 3.071, cpu ms/iter 0.176)
num_params=3     world_size=1024  mixed=True    Param size: 1.611 GB    Copy bandwidth: 270.601 GB/s (gpu ms/iter: 5.952, cpu ms/iter 0.099)
num_params=9     world_size=1024  mixed=True    Param size: 2.248 GB    Copy bandwidth: 268.558 GB/s (gpu ms/iter: 8.371, cpu ms/iter 0.207)
num_params=150   world_size=8     mixed=False   Param size: 0.035 GB    Copy bandwidth: 43.749 GB/s (gpu ms/iter: 0.797, cpu ms/iter 10.531)
num_params=54    world_size=8     mixed=False   Param size: 0.961 GB    Copy bandwidth: 254.084 GB/s (gpu ms/iter: 3.781, cpu ms/iter 0.752)
num_params=54    world_size=8     mixed=False   Param size: 0.282 GB    Copy bandwidth: 216.792 GB/s (gpu ms/iter: 1.299, cpu ms/iter 0.717)
num_params=50    world_size=8     mixed=False   Param size: 0.149 GB    Copy bandwidth: 188.025 GB/s (gpu ms/iter: 0.793, cpu ms/iter 0.633)
num_params=3     world_size=8     mixed=False   Param size: 0.655 GB    Copy bandwidth: 267.793 GB/s (gpu ms/iter: 2.447, cpu ms/iter 0.107)
num_params=9     world_size=8     mixed=False   Param size: 0.634 GB    Copy bandwidth: 264.232 GB/s (gpu ms/iter: 2.401, cpu ms/iter 0.182)
num_params=3     world_size=8     mixed=False   Param size: 1.049 GB    Copy bandwidth: 268.455 GB/s (gpu ms/iter: 3.906, cpu ms/iter 0.089)
num_params=9     world_size=8     mixed=False   Param size: 1.711 GB    Copy bandwidth: 267.633 GB/s (gpu ms/iter: 6.394, cpu ms/iter 0.177)
num_params=150   world_size=128   mixed=False   Param size: 0.038 GB    Copy bandwidth: 46.698 GB/s (gpu ms/iter: 0.807, cpu ms/iter 10.488)
num_params=54    world_size=128   mixed=False   Param size: 0.963 GB    Copy bandwidth: 253.450 GB/s (gpu ms/iter: 3.799, cpu ms/iter 0.655)
num_params=54    world_size=128   mixed=False   Param size: 0.283 GB    Copy bandwidth: 216.857 GB/s (gpu ms/iter: 1.307, cpu ms/iter 0.671)
num_params=50    world_size=128   mixed=False   Param size: 0.151 GB    Copy bandwidth: 189.059 GB/s (gpu ms/iter: 0.799, cpu ms/iter 0.572)
num_params=3     world_size=128   mixed=False   Param size: 0.655 GB    Copy bandwidth: 269.849 GB/s (gpu ms/iter: 2.429, cpu ms/iter 0.078)
num_params=9     world_size=128   mixed=False   Param size: 0.634 GB    Copy bandwidth: 264.501 GB/s (gpu ms/iter: 2.399, cpu ms/iter 0.149)
num_params=3     world_size=128   mixed=False   Param size: 1.049 GB    Copy bandwidth: 268.426 GB/s (gpu ms/iter: 3.906, cpu ms/iter 0.086)
num_params=9     world_size=128   mixed=False   Param size: 1.711 GB    Copy bandwidth: 267.495 GB/s (gpu ms/iter: 6.398, cpu ms/iter 0.170)
num_params=150   world_size=1024  mixed=False   Param size: 0.122 GB    Copy bandwidth: 101.151 GB/s (gpu ms/iter: 1.211, cpu ms/iter 10.476)
num_params=54    world_size=1024  mixed=False   Param size: 1.000 GB    Copy bandwidth: 252.323 GB/s (gpu ms/iter: 3.963, cpu ms/iter 0.633)
num_params=54    world_size=1024  mixed=False   Param size: 0.318 GB    Copy bandwidth: 218.322 GB/s (gpu ms/iter: 1.455, cpu ms/iter 0.622)
num_params=50    world_size=1024  mixed=False   Param size: 0.185 GB    Copy bandwidth: 196.369 GB/s (gpu ms/iter: 0.944, cpu ms/iter 0.576)
num_params=3     world_size=1024  mixed=False   Param size: 0.671 GB    Copy bandwidth: 269.369 GB/s (gpu ms/iter: 2.491, cpu ms/iter 0.076)
num_params=9     world_size=1024  mixed=False   Param size: 0.645 GB    Copy bandwidth: 264.441 GB/s (gpu ms/iter: 2.439, cpu ms/iter 0.140)
num_params=3     world_size=1024  mixed=False   Param size: 1.074 GB    Copy bandwidth: 269.955 GB/s (gpu ms/iter: 3.978, cpu ms/iter 0.073)
num_params=9     world_size=1024  mixed=False   Param size: 1.711 GB    Copy bandwidth: 267.168 GB/s (gpu ms/iter: 6.405, cpu ms/iter 0.147)
```
New kernel:
```
num_params=150   world_size=8     mixed=True    Param size: 0.059 GB    Copy bandwidth: 560.946 GB/s (gpu ms/iter: 0.105, cpu ms/iter 1.066)
num_params=54    world_size=8     mixed=True    Param size: 1.453 GB    Copy bandwidth: 732.657 GB/s (gpu ms/iter: 1.984, cpu ms/iter 0.417)
num_params=54    world_size=8     mixed=True    Param size: 0.512 GB    Copy bandwidth: 753.514 GB/s (gpu ms/iter: 0.679, cpu ms/iter 0.419)
num_params=50    world_size=8     mixed=True    Param size: 0.200 GB    Copy bandwidth: 719.400 GB/s (gpu ms/iter: 0.279, cpu ms/iter 0.410)
num_params=3     world_size=8     mixed=True    Param size: 0.983 GB    Copy bandwidth: 782.121 GB/s (gpu ms/iter: 1.257, cpu ms/iter 0.098)
num_params=9     world_size=8     mixed=True    Param size: 0.802 GB    Copy bandwidth: 766.458 GB/s (gpu ms/iter: 1.047, cpu ms/iter 0.134)
num_params=3     world_size=8     mixed=True    Param size: 1.573 GB    Copy bandwidth: 790.611 GB/s (gpu ms/iter: 1.989, cpu ms/iter 0.099)
num_params=9     world_size=8     mixed=True    Param size: 2.248 GB    Copy bandwidth: 789.754 GB/s (gpu ms/iter: 2.847, cpu ms/iter 0.138)
num_params=150   world_size=128   mixed=True    Param size: 0.064 GB    Copy bandwidth: 565.667 GB/s (gpu ms/iter: 0.113, cpu ms/iter 0.996)
num_params=54    world_size=128   mixed=True    Param size: 1.458 GB    Copy bandwidth: 670.681 GB/s (gpu ms/iter: 2.174, cpu ms/iter 0.289)
num_params=54    world_size=128   mixed=True    Param size: 0.515 GB    Copy bandwidth: 676.135 GB/s (gpu ms/iter: 0.762, cpu ms/iter 0.264)
num_params=50    world_size=128   mixed=True    Param size: 0.203 GB    Copy bandwidth: 662.603 GB/s (gpu ms/iter: 0.306, cpu ms/iter 0.249)
num_params=3     world_size=128   mixed=True    Param size: 0.983 GB    Copy bandwidth: 769.283 GB/s (gpu ms/iter: 1.278, cpu ms/iter 0.078)
num_params=9     world_size=128   mixed=True    Param size: 0.802 GB    Copy bandwidth: 761.057 GB/s (gpu ms/iter: 1.054, cpu ms/iter 0.104)
num_params=3     world_size=128   mixed=True    Param size: 1.573 GB    Copy bandwidth: 774.325 GB/s (gpu ms/iter: 2.031, cpu ms/iter 0.075)
num_params=9     world_size=128   mixed=True    Param size: 2.248 GB    Copy bandwidth: 773.048 GB/s (gpu ms/iter: 2.908, cpu ms/iter 0.099)
num_params=150   world_size=1024  mixed=True    Param size: 0.202 GB    Copy bandwidth: 641.405 GB/s (gpu ms/iter: 0.315, cpu ms/iter 0.616)
num_params=54    world_size=1024  mixed=True    Param size: 1.524 GB    Copy bandwidth: 646.772 GB/s (gpu ms/iter: 2.356, cpu ms/iter 0.276)
num_params=54    world_size=1024  mixed=True    Param size: 0.575 GB    Copy bandwidth: 658.157 GB/s (gpu ms/iter: 0.874, cpu ms/iter 0.278)
num_params=50    world_size=1024  mixed=True    Param size: 0.246 GB    Copy bandwidth: 642.032 GB/s (gpu ms/iter: 0.383, cpu ms/iter 0.245)
num_params=3     world_size=1024  mixed=True    Param size: 1.007 GB    Copy bandwidth: 728.990 GB/s (gpu ms/iter: 1.381, cpu ms/iter 0.080)
num_params=9     world_size=1024  mixed=True    Param size: 0.818 GB    Copy bandwidth: 689.763 GB/s (gpu ms/iter: 1.186, cpu ms/iter 0.102)
num_params=3     world_size=1024  mixed=True    Param size: 1.611 GB    Copy bandwidth: 765.507 GB/s (gpu ms/iter: 2.104, cpu ms/iter 0.078)
num_params=9     world_size=1024  mixed=True    Param size: 2.248 GB    Copy bandwidth: 757.626 GB/s (gpu ms/iter: 2.967, cpu ms/iter 0.106)
num_params=150   world_size=8     mixed=False   Param size: 0.035 GB    Copy bandwidth: 584.272 GB/s (gpu ms/iter: 0.060, cpu ms/iter 0.656)
num_params=54    world_size=8     mixed=False   Param size: 0.961 GB    Copy bandwidth: 728.234 GB/s (gpu ms/iter: 1.319, cpu ms/iter 0.264)
num_params=54    world_size=8     mixed=False   Param size: 0.282 GB    Copy bandwidth: 730.059 GB/s (gpu ms/iter: 0.386, cpu ms/iter 0.279)
num_params=50    world_size=8     mixed=False   Param size: 0.149 GB    Copy bandwidth: 670.899 GB/s (gpu ms/iter: 0.222, cpu ms/iter 0.274)
num_params=3     world_size=8     mixed=False   Param size: 0.655 GB    Copy bandwidth: 775.699 GB/s (gpu ms/iter: 0.845, cpu ms/iter 0.077)
num_params=9     world_size=8     mixed=False   Param size: 0.634 GB    Copy bandwidth: 773.612 GB/s (gpu ms/iter: 0.820, cpu ms/iter 0.112)
num_params=3     world_size=8     mixed=False   Param size: 1.049 GB    Copy bandwidth: 781.395 GB/s (gpu ms/iter: 1.342, cpu ms/iter 0.081)
num_params=9     world_size=8     mixed=False   Param size: 1.711 GB    Copy bandwidth: 789.156 GB/s (gpu ms/iter: 2.169, cpu ms/iter 0.116)
num_params=150   world_size=128   mixed=False   Param size: 0.038 GB    Copy bandwidth: 517.056 GB/s (gpu ms/iter: 0.073, cpu ms/iter 0.632)
num_params=54    world_size=128   mixed=False   Param size: 0.963 GB    Copy bandwidth: 684.246 GB/s (gpu ms/iter: 1.407, cpu ms/iter 0.294)
num_params=54    world_size=128   mixed=False   Param size: 0.283 GB    Copy bandwidth: 680.593 GB/s (gpu ms/iter: 0.416, cpu ms/iter 0.286)
num_params=50    world_size=128   mixed=False   Param size: 0.151 GB    Copy bandwidth: 682.197 GB/s (gpu ms/iter: 0.221, cpu ms/iter 0.255)
num_params=3     world_size=128   mixed=False   Param size: 0.655 GB    Copy bandwidth: 759.470 GB/s (gpu ms/iter: 0.863, cpu ms/iter 0.074)
num_params=9     world_size=128   mixed=False   Param size: 0.634 GB    Copy bandwidth: 765.694 GB/s (gpu ms/iter: 0.829, cpu ms/iter 0.094)
num_params=3     world_size=128   mixed=False   Param size: 1.049 GB    Copy bandwidth: 766.535 GB/s (gpu ms/iter: 1.368, cpu ms/iter 0.075)
num_params=9     world_size=128   mixed=False   Param size: 1.711 GB    Copy bandwidth: 787.608 GB/s (gpu ms/iter: 2.173, cpu ms/iter 0.105)
num_params=150   world_size=1024  mixed=False   Param size: 0.122 GB    Copy bandwidth: 640.203 GB/s (gpu ms/iter: 0.191, cpu ms/iter 0.668)
num_params=54    world_size=1024  mixed=False   Param size: 1.000 GB    Copy bandwidth: 713.947 GB/s (gpu ms/iter: 1.401, cpu ms/iter 0.274)
num_params=54    world_size=1024  mixed=False   Param size: 0.318 GB    Copy bandwidth: 642.855 GB/s (gpu ms/iter: 0.494, cpu ms/iter 0.276)
num_params=50    world_size=1024  mixed=False   Param size: 0.185 GB    Copy bandwidth: 643.297 GB/s (gpu ms/iter: 0.288, cpu ms/iter 0.262)
num_params=3     world_size=1024  mixed=False   Param size: 0.671 GB    Copy bandwidth: 690.626 GB/s (gpu ms/iter: 0.972, cpu ms/iter 0.078)
num_params=9     world_size=1024  mixed=False   Param size: 0.645 GB    Copy bandwidth: 754.431 GB/s (gpu ms/iter: 0.855, cpu ms/iter 0.109)
num_params=3     world_size=1024  mixed=False   Param size: 1.074 GB    Copy bandwidth: 769.985 GB/s (gpu ms/iter: 1.395, cpu ms/iter 0.080)
num_params=9     world_size=1024  mixed=False   Param size: 1.711 GB    Copy bandwidth: 766.337 GB/s (gpu ms/iter: 2.233, cpu ms/iter 0.103)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117203
Approved by: https://github.com/albanD, https://github.com/awgu
ghstack dependencies: #118512
2024-02-01 18:23:01 +00:00
68f9c28e00 Don't make default arguments dynamic (#118772)
Noticed this while working on
https://github.com/pytorch/pytorch/issues/114590

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118772
Approved by: https://github.com/anijain2305
2024-02-01 18:11:57 +00:00
24dd9f42ce [MPS] Fix use_metal_mm condition (#118830)
One should not only look at stride size, but on dimensions as well, as strides of `torch.rand(65536, 1)` are `(1, 1)`

Extend test to account for this situation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118830
Approved by: https://github.com/huydhn
2024-02-01 17:53:42 +00:00
3e79ef6db8 Complete decomposition for aten.round (#118635)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118635
Approved by: https://github.com/peterbell10
2024-02-01 17:14:44 +00:00
0010b6145e Reduce register usage of fused adam(w) (#118361)
Part of #117872

| branch | cpu time avg (ms) | cuda time avg (ms) |
|--------|--------------|---------------|
| [main](eebe7e1d37f1baa995c694d540cc2fc98884fa18) | 13.430 | 144.117 |
| pr                                               | 13.371 | 49.655  |

Used torch profiler to measure the avg perf or 20 iterations.
Model is openlm-research/open_llama_7b_v2 (script is [here](https://gist.github.com/crcrpar/ca951d4e7f3e1c771d502135b798f0d1)).

---

PR
```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg       CPU Mem  Self CPU Mem      CUDA Mem  Self CUDA Mem    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          ProfilerStep*         0.00%       0.000us         0.00%       0.000us       0.000us        5.789s        46.42%        5.789s     289.456ms           0 b           0 b           0 b           0 b            20
                                          ProfilerStep*        36.02%        3.119s        67.19%        5.819s     290.958ms       0.000us         0.00%        2.586s     129.276ms      48.00 Kb      -1.47 Mb           0 b    -504.23 Gb            20
                                               aten::mm         2.57%     222.681ms         8.80%     762.415ms      56.475us        2.501s        20.05%        2.501s     185.255us           0 b           0 b     441.39 Gb     441.39 Gb         13500
       autograd::engine::evaluate_function: MmBackward0         0.10%       8.600ms         8.17%     707.935ms     157.319us       0.000us         0.00%        1.625s     361.098us           0 b           0 b     198.65 Gb    -135.03 Gb          4500
                                            MmBackward0         0.39%      33.896ms         7.99%     692.035ms     153.786us       0.000us         0.00%        1.601s     355.710us           0 b           0 b     330.84 Gb    -248.00 Mb          4500
                              Optimizer.step#AdamW.step         0.00%       0.000us         0.00%       0.000us       0.000us        1.007s         8.07%        1.007s      50.329ms           0 b           0 b           0 b           0 b            20
                                             AdamW.step         0.01%     837.000us         3.36%     290.610ms      14.530ms       0.000us         0.00%     993.235ms      49.662ms           0 b           0 b           0 b           0 b            20
                              Optimizer.step#AdamW.step         0.22%      18.825ms         3.35%     289.773ms      14.489ms       0.000us         0.00%     993.235ms      49.662ms           0 b           0 b           0 b           0 b            20
                                    aten::_fused_adamw_         0.12%      10.823ms         3.09%     267.428ms      13.371ms     993.095ms         7.96%     993.095ms      49.655ms           0 b           0 b           0 b           0 b            20
void at::native::(anonymous namespace)::multi_tensor...         0.00%       0.000us         0.00%       0.000us       0.000us     993.095ms         7.96%     993.095ms     154.207us           0 b           0 b           0 b           0 b          6440
                                           aten::matmul         0.19%      16.140ms         1.73%     149.869ms      33.304us       0.000us         0.00%     876.000ms     194.667us           0 b           0 b     107.46 Gb           0 b          4500
sm90_xmma_gemm_bf16bf16_bf16f32_f32_nt_n_tilesize128...         0.00%       0.000us         0.00%       0.000us       0.000us     835.374ms         6.70%     835.374ms     185.639us           0 b           0 b           0 b           0 b          4500
                                           aten::linear         0.27%      23.268ms         1.97%     170.227ms      37.828us       0.000us         0.00%     776.278ms     172.506us           0 b           0 b     107.46 Gb      12.17 Gb          4500
sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize128...         0.00%       0.000us         0.00%       0.000us       0.000us     707.074ms         5.67%     707.074ms     183.180us           0 b           0 b           0 b           0 b          3860
                                              aten::mul         1.31%     113.614ms         5.14%     445.405ms      22.125us     552.421ms         4.43%     552.780ms      27.459us     256.32 Kb     256.21 Kb     420.38 Gb     419.88 Gb         20131
sm90_xmma_gemm_bf16bf16_bf16f32_f32_nn_n_tilesize128...         0.00%       0.000us         0.00%       0.000us       0.000us     442.209ms         3.55%     442.209ms     138.190us           0 b           0 b           0 b           0 b          3200
      autograd::engine::evaluate_function: MulBackward0         0.25%      21.336ms         5.00%     432.976ms      74.651us       0.000us         0.00%     398.627ms      68.729us           0 b           0 b     -45.71 Gb    -252.76 Gb          5800
                                             aten::add_         0.37%      31.975ms         7.19%     622.433ms      53.658us     391.957ms         3.14%     391.957ms      33.789us           0 b           0 b      -4.35 Gb      -4.35 Gb         11600
sm90_xmma_gemm_bf16bf16_bf16f32_f32_nn_n_tilesize256...         0.00%       0.000us         0.00%       0.000us       0.000us     345.037ms         2.77%     345.037ms     265.413us           0 b           0 b           0 b           0 b          1300
                                            aten::copy_         0.41%      35.727ms        20.62%        1.786s     146.503us     342.386ms         2.75%     342.386ms      28.092us      48.00 Kb      48.00 Kb     -56.00 Mb     -56.00 Mb         12188
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 8.661s
Self CUDA time total: 12.472s
```

main
```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg       CPU Mem  Self CPU Mem      CUDA Mem  Self CUDA Mem    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          ProfilerStep*         0.00%       0.000us         0.00%       0.000us       0.000us        7.671s        42.31%        7.671s     383.529ms           0 b           0 b           0 b           0 b            20
                                          ProfilerStep*        28.85%        3.050s        72.83%        7.700s     385.009ms       0.000us         0.00%        4.474s     223.678ms      48.00 Kb      -1.48 Mb           0 b    -504.45 Gb            20
                              Optimizer.step#AdamW.step         0.00%       0.000us         0.00%       0.000us       0.000us        2.896s        15.97%        2.896s     144.787ms           0 b           0 b           0 b           0 b            20
                                             AdamW.step         0.01%     819.000us         2.75%     291.024ms      14.551ms       0.000us         0.00%        2.882s     144.125ms           0 b           0 b           0 b           0 b            20
                              Optimizer.step#AdamW.step         0.17%      18.291ms         2.74%     290.205ms      14.510ms       0.000us         0.00%        2.882s     144.125ms           0 b           0 b           0 b           0 b            20
                                    aten::_fused_adamw_         0.10%      10.893ms         2.54%     268.602ms      13.430ms        2.882s        15.90%        2.882s     144.117ms           0 b           0 b           0 b           0 b            20
void at::native::(anonymous namespace)::multi_tensor...         0.00%       0.000us         0.00%       0.000us       0.000us        2.882s        15.90%        2.882s     447.570us           0 b           0 b           0 b           0 b          6440
                                               aten::mm         2.05%     217.136ms         7.21%     762.211ms      56.460us        2.499s        13.78%        2.499s     185.075us           0 b           0 b     441.37 Gb     441.37 Gb         13500
       autograd::engine::evaluate_function: MmBackward0         0.07%       7.179ms         6.77%     715.673ms     159.038us       0.000us         0.00%        1.624s     360.812us           0 b           0 b     198.65 Gb    -134.64 Gb          4500
                                            MmBackward0         0.32%      34.257ms         6.62%     700.088ms     155.575us       0.000us         0.00%        1.600s     355.460us           0 b           0 b     330.59 Gb    -628.00 Mb          4500
                                           aten::matmul         0.15%      15.892ms         1.32%     139.597ms      31.022us       0.000us         0.00%     874.861ms     194.414us           0 b           0 b     107.46 Gb           0 b          4500
sm90_xmma_gemm_bf16bf16_bf16f32_f32_nt_n_tilesize128...         0.00%       0.000us         0.00%       0.000us       0.000us     834.631ms         4.60%     834.631ms     185.474us           0 b           0 b           0 b           0 b          4500
                                           aten::linear         0.21%      22.460ms         1.51%     159.620ms      35.471us       0.000us         0.00%     774.772ms     172.172us           0 b           0 b     107.46 Gb      11.88 Gb          4500
sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize128...         0.00%       0.000us         0.00%       0.000us       0.000us     705.996ms         3.89%     705.996ms     182.901us           0 b           0 b           0 b           0 b          3860
                                              aten::mul         1.06%     112.529ms         4.28%     452.473ms      22.488us     552.242ms         3.05%     552.266ms      27.447us     255.90 Kb     255.88 Kb     413.93 Gb     413.90 Gb         20121
sm90_xmma_gemm_bf16bf16_bf16f32_f32_nn_n_tilesize128...         0.00%       0.000us         0.00%       0.000us       0.000us     441.514ms         2.44%     441.514ms     137.973us           0 b           0 b           0 b           0 b          3200
      autograd::engine::evaluate_function: MulBackward0         0.19%      20.517ms         4.18%     442.189ms      76.239us       0.000us         0.00%     398.552ms      68.716us           0 b           0 b     -45.57 Gb    -251.17 Gb          5800
                                             aten::add_         0.30%      31.703ms         6.01%     635.030ms      54.744us     391.897ms         2.16%     391.897ms      33.784us           0 b           0 b      -5.71 Gb      -5.71 Gb         11600
sm90_xmma_gemm_bf16bf16_bf16f32_f32_nn_n_tilesize256...         0.00%       0.000us         0.00%       0.000us       0.000us     344.972ms         1.90%     344.972ms     265.363us           0 b           0 b           0 b           0 b          1300
                                            aten::copy_         0.33%      34.415ms        34.75%        3.674s     301.437us     342.661ms         1.89%     342.661ms      28.115us      80.00 Kb      80.00 Kb    -240.00 Mb    -240.00 Mb         12188
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 10.574s
Self CUDA time total: 18.129s
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118361
Approved by: https://github.com/janeyx99
2024-02-01 17:04:10 +00:00
b73a2b7795 [ait] inspect get_attr nodes for _decline_if_input_dtype (#118760)
Summary:
previously get_attr nodes were skipped, but for example:

%mul_240 : [num_users=1] = call_function[target=torch_tensorrt.fx.tracer.acc_tracer.acc_ops.mul](args = (), kwargs = {input: %_fx_const_folded_attrs_13, other: %add_143})

where %_fx_const_folded_attrs_13 is int64, but add_143 is float causes issues if skipped, e.g. "unsupported dtype='int64' for alignments"

Differential Revision: D53273467

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118760
Approved by: https://github.com/khabinov
2024-02-01 15:56:15 +00:00
ff9ce94489 Create empty host tensor for privateuseone (#118854)
For the H2D copy of local_used_map_ on the privateuseone device, reuse the CUDA logic.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118854
Approved by: https://github.com/ezyang
2024-02-01 15:32:55 +00:00
d790c1dca6 [CUDA][cuDNN][TF32] Misc TF32 updates (#118781)
Twiddle some thresholds that don't seem to play nice with sm90.

CC @tinglvv @nWEIdia @ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118781
Approved by: https://github.com/ezyang
2024-02-01 15:32:50 +00:00
687946eea1 [FSDP2] Added reduce-scatter (#117975)
This PR adds the FSDP reduce-scatter (the copy-in/reduce-scatter collective/view-out).
- We use gradient pre- and post-divide factors like existing FSDP (mainly for fp16 reduction).
- We use a separate CUDA stream for the reduce-scatter to conveniently handle additional kernels surrounding the collective as a separate 'thread of execution' (e.g. pre/post-divide and later the D2H gradient offload).
- ~~The implementation in this PR is more complicated to _try_ to reduce CPU overhead by using `torch.split` instead of a Python for-loop. The challenge comes from the fact that the autograd-computed unsharded gradients do not have padding. We prefer to not do an intermediate padding step and instead directly copying to the big reduce-scatter input.~~ For simplicity, I changed the implementation to include intermediate padding steps, as it can still achieve ~250 GB/s, and it avoids any `O(NP)` tensor materialization for world size `N` and `P` `nn.Parameter`s.

<details>
<summary> Recall: Copy-in/All-Gather/Copy-Out Example </summary>

Suppose we have 2 parameters with shapes `(3, 3)` (denoted with `A`s) and `(2, 2)` (denoted with `B`s) and 2 ranks, where `P` represents padding and `E` represents empty:
```
Given:
(3, 3): AAAAAAAAA
(2, 2): BBBB

Sharded parameters/all-gather inputs:
Rank 0: AAAAAA, BB
Rank 1: AAAPPP, BB

Each rank allocate group's all-gather output:
EEEEEEEEEEEEEEEE
Each rank copy-in:
Rank 0: AAAAAABBEEEEEEEE
Rank 1: EEEEEEEEAAAPPPBB

Each rank all-gather:
Rank 0: AAAAAABBAAAPPPBB
Rank 1: AAAAAABBAAAPPPBB

Each rank copy-out:
Rank 0: AAAAAAAAAPPP, BBBB
Rank 1: AAAAAAAAAPPP, BBBB
```
</details>

<details>
<summary> Copy-in/Reduce-Scatter/View-Out Example </summary>

Suppose we have 2 gradients with shapes `(3, 3)` (denoted with `a`s when not-yet-reduced and `A`s after reduced) and `(2, 2)` (denoted with `b`s and `B`s similarly) and 2 ranks, where `E` represents empty:
```
Given from autograd:
(3, 3): aaaaaaaaa
(2, 2): bbbb

Unsharded gradients/reduce-scatter inputs (no padding!):
Rank 0: aaaaaaaaa, bbbb
Rank 1: aaaaaaaaa, bbbb

Each rank allocate group's reduce-scatter input:
EEEEEEEEEEEEEEEE
Each rank copy-in:
Rank 0: aaaaaabbaaaEEEbb
Rank 1: aaaaaabbaaaEEEbb

Each rank :
Rank 0: AAAAAABBAAAEEEBB
Rank 1: AAAAAABBAAAEEEBB

Each rank view-out:
Rank 0: AAAAAA BB
Rank 1: AAA, BB
```
</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117975
Approved by: https://github.com/weifengpy, https://github.com/yifuwang
ghstack dependencies: #117950, #117955, #117973
2024-02-01 15:21:37 +00:00
9c2b43cc50 [inductor] Handle special values correctly in ir.Scan codegen (#118788)
Special values (`NaN`/`+/-Inf`) are not correctly during codegen for `ir.Scan` nodes. This
is a fairly minor bugfix that has not come up since the only two scan
ops with lowerings use "normal" values.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118788
Approved by: https://github.com/peterbell10
2024-02-01 14:54:20 +00:00
221747507d Revert "[export] support non-persistent buffers (#118612) (#118722)"
This reverts commit a43c28368c184ba1bf964f4fb99bec300917e2f4.

Reverted https://github.com/pytorch/pytorch/pull/118722 on behalf of https://github.com/atalman due to broke linux-jammy-py3-clang12-executorch ([comment](https://github.com/pytorch/pytorch/pull/118722#issuecomment-1921484565))
2024-02-01 14:39:29 +00:00
4a5a3bcc89 Revert "fused adam(w): Reduce register usage (#117872)"
This reverts commit b8e71cf3022e701604ea1f0c381c0b9ccf8743be.

Reverted https://github.com/pytorch/pytorch/pull/117872 on behalf of https://github.com/janeyx99 due to This was not intended to be merged ([comment](https://github.com/pytorch/pytorch/pull/117872#issuecomment-1921425677))
2024-02-01 14:15:00 +00:00
a1dd367716 Fixed error in bicubic upsampling aa=false for uint8 input (#118389)
Description:
- Fixed error in bicubic upsampling aa=false for uint8 input. This is seen in the test suite:
```diff
- self.assertLess(diff.max(), 15)
+ self.assertLess(diff.max(), 5)
```
While reducing the input range we do not fully remove the clipping effect that's why the threshold is 5 and not around 1.

- Renamed methods
- The error is mostly visible for upsampling (smaller -> larger) mode on the boundary values

More details on the bug:
For uint8 input and antialising=False we are using separable algorithm (using temp buffers and interpolating dimensions one by one) where interpolation weights and input indices are computed and stored using index ranges: `index_min` and `index_size`; weights outside of the `index_size` are zeros. For example, for an output point we can have index_min=10 and index_size=4 and 4 non-zero weights: so the output value is computed as
```
out_value = sum([src[i + index_min] * w for i, w in zip(range(4), weights) ])
```
When computing index ranges and weights for output points near the boundaries we should clamp `index_min` between 0 and input_size and `index_size` becomes smaller than 4. This approach is OK for antialiasing=True but is not correct for antialiasing=False where weights are computed incorrectly:
```
-- output index i= 0
regular float32 approach:
source indices: [-2, -1, 0, 1] -> outbounded values are clamped to boundaries -> [0, 0, 0, 1]
interp weights: [-0.07200000000000006, 0.4600000000000001, 0.72, -0.1080000000000001]

separable uint8 approach:
source indices coming from index ranges (min, size): [0, 1]
incorrect interp weights computed with current implementation : [1.1764705882352944, -0.17647058823529432, 0.0, 0.0]
fixed interp weights in the PR: [1.108, -0.1080000000000001, 0.0, 0.0]
Note: weight value corresponding to source index 0 is 1.108 = -0.07200000000000006 + 0.4600000000000001 + 0.72 and weight value corresponding to source index 1 is -0.1080000000000001 is the same as in f32 approach.
```

Quick benchmark to ensure perfs no regression:

```
[------------------------------------------------------------------------------------ Resize ------------------------------------------------------------------------------------]
                                                                               |  torch (2.3.0a0+gitfda85a6) PR  |  torch (2.3.0a0+git0d1e705) Nightly  |  Speed-up: PR vs Nightly
1 threads: -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
      3 torch.uint8 channels_first bilinear (400, 400) -> (224, 224) aa=False  |        440.996 (+-2.044)        |          470.824 (+-5.927)           |      1.068 (+-0.000)
      3 torch.uint8 channels_first bicubic (400, 400) -> (224, 224) aa=False   |        463.565 (+-1.519)        |          497.231 (+-10.825)          |      1.073 (+-0.000)
      3 torch.uint8 channels_first bilinear (400, 400) -> (700, 700) aa=False  |       1717.000 (+-28.589)       |         1915.570 (+-43.397)          |      1.116 (+-0.000)
      3 torch.uint8 channels_first bicubic (400, 400) -> (700, 700) aa=False   |       1801.954 (+-22.391)       |         1981.501 (+-37.034)          |      1.100 (+-0.000)
      3 torch.uint8 channels_last bilinear (400, 400) -> (224, 224) aa=False   |        199.599 (+-0.851)        |          196.535 (+-3.788)           |      0.985 (+-0.000)
      3 torch.uint8 channels_last bicubic (400, 400) -> (224, 224) aa=False    |        243.126 (+-0.681)        |          240.695 (+-2.306)           |      0.990 (+-0.000)
      3 torch.uint8 channels_last bilinear (400, 400) -> (700, 700) aa=False   |        686.270 (+-2.870)        |          687.769 (+-17.863)          |      1.002 (+-0.000)
      3 torch.uint8 channels_last bicubic (400, 400) -> (700, 700) aa=False    |        899.509 (+-5.377)        |          899.063 (+-9.001)           |      1.000 (+-0.000)

Times are in microseconds (us).
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118389
Approved by: https://github.com/NicolasHug
ghstack dependencies: #118388
2024-02-01 14:14:32 +00:00
cyy
8b140da804 Use MKL_INT in MKL wrapper interfaces (#118734)
I encountered the error when built PyTorch on Windows MKL:

```
pytorch\aten\src\ATen\native\mkl\LinearAlgebra.cpp(74): error C2664: “void cblas_sgemm_batch(const CBLAS_LAYOUT,const CBLAS_TRANSPOSE *,const CBLAS_TRANSPOSE *,const __int64 *,const __int64 *,const __int64 *,const float *,const float **,const __int64 *,const float **,const __int64 *,const float *,float **,const __int64 *,const __int64,const __int64 *) noexcept”: 无法将参数 4 从“const int *”转换为“const __int64 *”
pytorch\aten\src\ATen\native\mkl\LinearAlgebra.cpp(74): note: 指向的类型不相关; 转换需要 reinterpret_cast、C 样式强制转换或带圆括号的函数样式强制转换
C:\Program Files (x86)\Intel\oneAPI\2024.0\include\mkl_cblas.h(550): note: 参见“cblas_sgemm_batch”的声明
```
This was because MKL_INT was defined as int64_t. This PR tries to use MKL_INT.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118734
Approved by: https://github.com/ezyang
2024-02-01 13:32:28 +00:00
a205e7bf56 [3/4] Intel GPU Runtime Upstreaming for Device (#116850)
# Motivation
According to [[1/4] Intel GPU Runtime Upstreaming for Device](https://github.com/pytorch/pytorch/pull/116019), As mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), this third PR  covers the changes under `libtorch_python`.

# Design
This PR primarily offers device-related APIs in python frontend, including
- `torch.xpu.is_available`
- `torch.xpu.device_count`
- `torch.xpu.current_device`
- `torch.xpu.set_device`
- `torch.xpu.device`
- `torch.xpu.device_of`
- `torch.xpu.get_device_name`
- `torch.xpu.get_device_capability`
- `torch.xpu.get_device_properties`
- ====================
- `torch.xpu._DeviceGuard`
- `torch.xpu._is_compiled`
- `torch.xpu._get_device`

# Additional Context
We will implement the support of lazy initialization in the next PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116850
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/malfet
2024-02-01 12:31:26 +00:00
eaa45f47f8 [sigmoid] fix for torchbind serialization (#118791)
Summary:
There is an annoying inconsistency in how we pickle custom objs.
`torch.save` will invoke regular pickle, for which we have bound `__setstate__`/`__getstate__` methods on `torch.ScriptObject`: https://fburl.com/code/4howyl4u.

This serializes in a different format than TorchScript does, which uses the TS C++ pickler.

The issue we were facing was using the Python pickler to save, and the C++ pickler to load. If we use the C++ pickler to both save and load (plus some plumbing to get type/object resolution to work correctly), then things should work.

Test Plan:
ran SherlockNoMad's repro
```
buck2 run 'fbcode//mode/dev-nosan' scripts/bahuang:export_torchbind -- --logging DBG
```

Got to a new error, which has to do with how we're initializing the graph, but will leave that for future diffs.

Reviewed By: SherlockNoMad

Differential Revision: D53248454

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118791
Approved by: https://github.com/qxy11, https://github.com/SherlockNoMad, https://github.com/khabinov
2024-02-01 10:09:07 +00:00
0dc15ff674 [reland][export] Fix graph signature for primitive outputs (#118818)
Summary: Reland of D53233649/https://github.com/pytorch/pytorch/pull/118655. Previously I didn't realize there was a use-case of a torchbind object as an input to the graph, so I didn't mark `CustomObjArgument` as a valid input, which broke [this test](a43c28368c/test/export/test_torchbind.py (L81)). Somehow the initial CI did not catch it, but hud was sad so that PR was reverted. So now I added `CustomObjArgument` as valid input [here](https://github.com/pytorch/pytorch/pull/118818/files#diff-92420f977c3a02b2deadf6752ce4a9ee601c20612a1a13cc365252eb09410edbR298).

Test Plan: CI

Reviewed By: tarun292

Differential Revision: D53288445

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118818
Approved by: https://github.com/ydwu4
2024-02-01 09:59:05 +00:00
b8e71cf302 fused adam(w): Reduce register usage (#117872)
As per title, reducing register usage for better occupancy.

Changes are:
- use 32bit indexing if possible
- convert some arguments of fused adam(w) functor to its template parameters
- give `const` to some arguments

Tables below are before/after of adamw for sm90 with / without amsgrad enabled.

### without amsgrad
| dtype | main | this PR |
|-------|------|---------|
| bf16  | 79   | 64      |
| fp16  | 82   | 64      |
| fp32  | 126  | 64      |
| fp64  | 128  | 109     |

### with amsgrad
| dtype | main | this PR |
|-------|------|---------|
| bf16  | 124  | 74      |
| fp16  | 124  | 74      |
| fp32  | 123  | 76      |
| fp64  | 128  | 121     |

---

`AdamW(..., fused=True)` with llama-2 bf16 on H100 improved to 49.935ms of cuda avg time from 126.648ms according to torch profiler.

This PR:
```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg       CPU Mem  Self CPU Mem      CUDA Mem  Self CUDA Mem    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          ProfilerStep*         0.00%       0.000us         0.00%       0.000us       0.000us        5.878s        46.47%        5.878s     293.918ms           0 b           0 b           0 b           0 b            20
                                               aten::mm         2.57%     224.777ms         8.50%     741.993ms      54.962us        2.591s        20.48%        2.591s     191.910us           0 b           0 b     441.39 Gb     441.39 Gb         13500
                                          ProfilerStep*        31.64%        2.763s        67.67%        5.910s     295.485ms       0.000us         0.00%        2.551s     127.547ms      48.00 Kb      -1.44 Mb           0 b    -506.38 Gb            20
       autograd::engine::evaluate_function: MmBackward0         0.13%      11.349ms         7.90%     690.160ms     153.369us       0.000us         0.00%        1.726s     383.544us           0 b           0 b     198.65 Gb    -137.53 Gb          4500
                                            MmBackward0         0.45%      38.959ms         7.68%     670.399ms     148.978us       0.000us         0.00%        1.693s     376.326us           0 b           0 b     332.81 Gb       2.26 Gb          4500
                              Optimizer.step#AdamW.step         0.00%       0.000us         0.00%       0.000us       0.000us        1.012s         8.00%        1.012s      50.617ms           0 b           0 b           0 b           0 b            20
                                             AdamW.step         0.01%     846.000us         3.39%     296.240ms      14.812ms       0.000us         0.00%     998.876ms      49.944ms           0 b           0 b           0 b           0 b            20
                              Optimizer.step#AdamW.step         0.26%      23.113ms         3.38%     295.394ms      14.770ms       0.000us         0.00%     998.876ms      49.944ms           0 b           0 b           0 b           0 b            20
                                    aten::_fused_adamw_         0.13%      11.000ms         3.08%     268.545ms      13.427ms     998.705ms         7.89%     998.705ms      49.935ms           0 b           0 b           0 b           0 b            20
void at::native::(anonymous namespace)::multi_tensor...         0.00%       0.000us         0.00%       0.000us       0.000us     998.705ms         7.89%     998.705ms     155.078us           0 b           0 b           0 b           0 b          6440
sm90_xmma_gemm_bf16bf16_bf16f32_f32_nt_n_tilesize128...         0.00%       0.000us         0.00%       0.000us       0.000us     872.287ms         6.90%     872.287ms     193.842us           0 b           0 b           0 b           0 b          4500
                                           aten::matmul         0.19%      16.721ms         1.82%     159.130ms      35.362us       0.000us         0.00%     864.840ms     192.187us           0 b           0 b     107.46 Gb           0 b          4500
                                           aten::linear         0.28%      24.641ms         2.09%     182.129ms      40.473us       0.000us         0.00%     765.554ms     170.123us           0 b           0 b     107.46 Gb      12.46 Gb          4500
sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize128...         0.00%       0.000us         0.00%       0.000us       0.000us     690.729ms         5.46%     690.729ms     178.945us           0 b           0 b           0 b           0 b          3860
                                              aten::mul         1.36%     118.465ms         4.89%     427.071ms      21.225us     549.580ms         4.34%     549.697ms      27.320us     224.03 Kb     223.96 Kb     413.51 Gb     413.36 Gb         20121
sm90_xmma_gemm_bf16bf16_bf16f32_f32_nn_n_tilesize128...         0.00%       0.000us         0.00%       0.000us       0.000us     484.455ms         3.83%     484.455ms     151.392us           0 b           0 b           0 b           0 b          3200
      autograd::engine::evaluate_function: MulBackward0         0.27%      23.176ms         4.63%     404.534ms      69.747us       0.000us         0.00%     406.155ms      70.027us           0 b           0 b     -46.01 Gb    -257.12 Gb          5800
                                             aten::add_         0.39%      34.186ms         7.22%     630.849ms      54.384us     394.402ms         3.12%     394.402ms      34.000us           0 b           0 b      -6.68 Gb      -6.68 Gb         11600
sm90_xmma_gemm_bf16bf16_bf16f32_f32_nn_n_tilesize256...         0.00%       0.000us         0.00%       0.000us       0.000us     366.653ms         2.90%     366.653ms     282.041us           0 b           0 b           0 b           0 b          1300
                                            aten::copy_         0.41%      35.934ms        20.61%        1.800s     147.691us     341.572ms         2.70%     341.572ms      28.025us      48.00 Kb      48.00 Kb     -40.00 Mb     -40.00 Mb         12188
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 8.733s
Self CUDA time total: 12.651s

AdamW.step <FunctionEventAvg key=AdamW.step self_cpu_time=846.000us cpu_time=14.812ms  self_cuda_time=0.000us cuda_time=49.944ms input_shapes= cpu_memory_usage=0 cuda_memory_usage=0>
Optimizer.step#AdamW.step <FunctionEventAvg key=Optimizer.step#AdamW.step self_cpu_time=23.113ms cpu_time=14.770ms  self_cuda_time=0.000us cuda_time=49.944ms input_shapes= cpu_memory_usage=0 cuda_memory_usage=0>
Optimizer.step#AdamW.step <FunctionEventAvg key=Optimizer.step#AdamW.step self_cpu_time=0.000us cpu_time=0.000us  self_cuda_time=1.012s cuda_time=50.617ms input_shapes= cpu_memory_usage=0 cuda_memory_usage=0>

```

Main
```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg       CPU Mem  Self CPU Mem      CUDA Mem  Self CUDA Mem    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                          ProfilerStep*         0.00%       0.000us         0.00%       0.000us       0.000us        7.354s        42.89%        7.354s     367.698ms           0 b           0 b           0 b           0 b            20
                                          ProfilerStep*        28.22%        2.875s        72.48%        7.384s     369.184ms       0.000us         0.00%        4.067s     203.325ms      48.00 Kb      -1.48 Mb           0 b    -508.04 Gb            20
                                               aten::mm         2.24%     228.499ms         7.13%     726.223ms      53.794us        2.563s        14.95%        2.563s     189.873us           0 b           0 b     441.39 Gb     441.39 Gb         13500
                              Optimizer.step#AdamW.step         0.00%       0.000us         0.00%       0.000us       0.000us        2.546s        14.85%        2.546s     127.304ms           0 b           0 b           0 b           0 b            20
                                             AdamW.step         0.01%     821.000us         2.87%     292.871ms      14.644ms       0.000us         0.00%        2.533s     126.654ms           0 b           0 b           0 b           0 b            20
                              Optimizer.step#AdamW.step         0.22%      22.801ms         2.87%     292.050ms      14.602ms       0.000us         0.00%        2.533s     126.654ms           0 b           0 b           0 b           0 b            20
                                    aten::_fused_adamw_         0.11%      11.332ms         2.61%     265.853ms      13.293ms        2.533s        14.77%        2.533s     126.648ms           0 b           0 b           0 b           0 b            20
void at::native::(anonymous namespace)::multi_tensor...         0.00%       0.000us         0.00%       0.000us       0.000us        2.533s        14.77%        2.533s     393.315us           0 b           0 b           0 b           0 b          6440
       autograd::engine::evaluate_function: MmBackward0         0.13%      13.342ms         6.73%     685.250ms     152.278us       0.000us         0.00%        1.706s     379.209us           0 b           0 b     198.65 Gb    -138.02 Gb          4500
                                            MmBackward0         0.38%      38.974ms         6.52%     664.652ms     147.700us       0.000us         0.00%        1.675s     372.113us           0 b           0 b     333.59 Gb       2.75 Gb          4500
sm90_xmma_gemm_bf16bf16_bf16f32_f32_nt_n_tilesize128...         0.00%       0.000us         0.00%       0.000us       0.000us     859.515ms         5.01%     859.515ms     191.003us           0 b           0 b           0 b           0 b          4500
                                           aten::matmul         0.16%      16.431ms         1.49%     152.052ms      33.789us       0.000us         0.00%     856.839ms     190.409us           0 b           0 b     107.46 Gb           0 b          4500
                                           aten::linear         0.23%      23.703ms         1.72%     174.862ms      38.858us       0.000us         0.00%     758.995ms     168.666us           0 b           0 b     107.46 Gb      12.21 Gb          4500
sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize128...         0.00%       0.000us         0.00%       0.000us       0.000us     682.302ms         3.98%     682.302ms     176.762us           0 b           0 b           0 b           0 b          3860
                                              aten::mul         1.16%     117.854ms         4.12%     420.100ms      20.892us     544.045ms         3.17%     544.157ms      27.062us     240.38 Kb     240.34 Kb     419.45 Gb     419.29 Gb         20108
sm90_xmma_gemm_bf16bf16_bf16f32_f32_nn_n_tilesize128...         0.00%       0.000us         0.00%       0.000us       0.000us     479.767ms         2.80%     479.767ms     149.927us           0 b           0 b           0 b           0 b          3200
      autograd::engine::evaluate_function: MulBackward0         0.27%      27.303ms         3.95%     402.627ms      69.418us       0.000us         0.00%     403.020ms      69.486us           0 b           0 b     -45.56 Gb    -257.26 Gb          5800
                                             aten::add_         0.32%      32.543ms         6.08%     619.248ms      53.383us     393.242ms         2.29%     393.242ms      33.900us           0 b           0 b      -6.21 Gb      -6.21 Gb         11600
sm90_xmma_gemm_bf16bf16_bf16f32_f32_nn_n_tilesize256...         0.00%       0.000us         0.00%       0.000us       0.000us     363.245ms         2.12%     363.245ms     279.419us           0 b           0 b           0 b           0 b          1300
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us     338.460ms         1.97%     338.460ms      29.228us           0 b           0 b           0 b           0 b         11580
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 10.187s
Self CUDA time total: 17.145s

AdamW.step <FunctionEventAvg key=AdamW.step self_cpu_time=821.000us cpu_time=14.644ms  self_cuda_time=0.000us cuda_time=126.654ms input_shapes= cpu_memory_usage=0 cuda_memory_usage=0>
Optimizer.step#AdamW.step <FunctionEventAvg key=Optimizer.step#AdamW.step self_cpu_time=22.801ms cpu_time=14.602ms  self_cuda_time=0.000us cuda_time=126.654ms input_shapes= cpu_memory_usage=0 cuda_memory_usage=0>
Optimizer.step#AdamW.step <FunctionEventAvg key=Optimizer.step#AdamW.step self_cpu_time=0.000us cpu_time=0.000us  self_cuda_time=2.546s cuda_time=127.304ms input_shapes= cpu_memory_usage=0 cuda_memory_usage=0>
```

Script I used: https://gist.github.com/crcrpar/ca951d4e7f3e1c771d502135b798f0d1

<!--

## adamw

### This PR

```console
$ cuobjdump ./build/caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/fused_adamw_impl.cu.o -xelf all
Extracting ELF file    1: fused_adamw_impl.sm_70.cubin
Extracting ELF file    2: fused_adamw_impl.sm_80.cubin
Extracting ELF file    3: fused_adamw_impl.sm_90.cubin
$ cuobjdump -res-usage fused_adamw_impl.sm_90.cubin | cu++filt

Resource usage:
 Common:
  GLOBAL:3
 Function void at::native::<unnamed>::multi_tensor_apply_kernel<long, at::native::<unnamed>::FusedOptimizerTensorListMetadata<(int)4>, at::native::<unnamed>::FusedAdamMathFunctor<c10::BFloat16, (int)4, (at::native::ADAM_MODE)1, (bool)0>, const float *, double, double, double, double, double, bool, const float *, const float *>(T1, T2, T3, T4...):
  REG:64 STACK:8 SHARED:0 LOCAL:0 CONSTANT[0]:3952 TEXTURE:0 SURFACE:0 SAMPLER:0
 Function void at::native::<unnamed>::multi_tensor_apply_kernel<long, at::native::<unnamed>::FusedOptimizerTensorListMetadata<(int)4>, at::native::<unnamed>::FusedAdamMathFunctor<c10::Half, (int)4, (at::native::ADAM_MODE)1, (bool)0>, const float *, double, double, double, double, double, bool, const float *, const float *>(T1, T2, T3, T4...):
  REG:64 STACK:8 SHARED:0 LOCAL:0 CONSTANT[0]:3952 TEXTURE:0 SURFACE:0 SAMPLER:0
 Function void at::native::<unnamed>::multi_tensor_apply_kernel<long, at::native::<unnamed>::FusedOptimizerTensorListMetadata<(int)4>, at::native::<unnamed>::FusedAdamMathFunctor<float, (int)4, (at::native::ADAM_MODE)1, (bool)0>, const float *, double, double, double, double, double, bool, const float *, const float *>(T1, T2, T3, T4...):
  REG:64 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:3952 TEXTURE:0 SURFACE:0 SAMPLER:0
 Function void at::native::<unnamed>::multi_tensor_apply_kernel<long, at::native::<unnamed>::FusedOptimizerTensorListMetadata<(int)4>, at::native::<unnamed>::FusedAdamMathFunctor<double, (int)4, (at::native::ADAM_MODE)1, (bool)0>, const float *, double, double, double, double, double, bool, const float *, const float *>(T1, T2, T3, T4...):
  REG:109 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:3952 TEXTURE:0 SURFACE:0 SAMPLER:0
```

### Main

```console
$ cuobjdump ./build/caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/fused_adamw_impl.cu.o -xelf all
Extracting ELF file    1: fused_adamw_impl.cu.1.sm_70.cubin
Extracting ELF file    2: fused_adamw_impl.cu.2.sm_80.cubin
Extracting ELF file    3: fused_adamw_impl.cu.3.sm_90.cubin
$ cuobjdump -res-usage fused_adamw_impl.cu.3.sm_90.cubin | cu++filt

Resource usage:
 Common:
  GLOBAL:3
 Function void at::native::<unnamed>::multi_tensor_apply_kernel<at::native::<unnamed>::FusedOptimizerTensorListMetadata<(int)4>, at::native::<unnamed>::FusedAdamMathFunctor<c10::BFloat16, (int)4>, float *, double, double, double, double, double, bool, bool, float *, float *, at::native::ADAM_MODE>(T1, T2, T3...):
  REG:79 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:3945 TEXTURE:0 SURFACE:0 SAMPLER:0
 Function void at::native::<unnamed>::multi_tensor_apply_kernel<at::native::<unnamed>::FusedOptimizerTensorListMetadata<(int)4>, at::native::<unnamed>::FusedAdamMathFunctor<c10::Half, (int)4>, float *, double, double, double, double, double, bool, bool, float *, float *, at::native::ADAM_MODE>(T1, T2, T3...):
  REG:82 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:3945 TEXTURE:0 SURFACE:0 SAMPLER:0
 Function void at::native::<unnamed>::multi_tensor_apply_kernel<at::native::<unnamed>::FusedOptimizerTensorListMetadata<(int)4>, at::native::<unnamed>::FusedAdamMathFunctor<float, (int)4>, float *, double, double, double, double, double, bool, bool, float *, float *, at::native::ADAM_MODE>(T1, T2, T3...):
  REG:126 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:3945 TEXTURE:0 SURFACE:0 SAMPLER:0
 Function void at::native::<unnamed>::multi_tensor_apply_kernel<at::native::<unnamed>::FusedOptimizerTensorListMetadata<(int)4>, at::native::<unnamed>::FusedAdamMathFunctor<double, (int)4>, float *, double, double, double, double, double, bool, bool, float *, float *, at::native::ADAM_MODE>(T1, T2, T3...):
  REG:128 STACK:40 SHARED:0 LOCAL:0 CONSTANT[0]:3945 TEXTURE:0 SURFACE:0 SAMPLER:0
```

## adamw & amsgrad
### This PR
```console
root@1a5180b041f7:/opt/pytorch/pytorch# cuobjdump ./build/caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/fused_adamw_amsgrad_impl.cu.o -xelf all
Extracting ELF file    1: fused_adamw_amsgrad_impl.sm_70.cubin
Extracting ELF file    2: fused_adamw_amsgrad_impl.sm_80.cubin
Extracting ELF file    3: fused_adamw_amsgrad_impl.sm_90.cubin
root@1a5180b041f7:/opt/pytorch/pytorch# cuobjdump -res-usage fused_adamw_amsgrad_impl.sm_90.cubin

Resource usage:
 Common:
  GLOBAL:3
 Function void at::native::<unnamed>::multi_tensor_apply_kernel<long, at::native::<unnamed>::FusedOptimizerTensorListMetadata<(int)5>, at::native::<unnamed>::FusedAdamMathFunctor<c10::BFloat16, (int)5, (at::native::ADAM_MODE)1, (bool)1>, const float *, double, double, double, double, double, bool, const float *, const float *>(T1, T2, T3, T4...):
  REG:74 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:3904 TEXTURE:0 SURFACE:0 SAMPLER:0
 Function void at::native::<unnamed>::multi_tensor_apply_kernel<long, at::native::<unnamed>::FusedOptimizerTensorListMetadata<(int)5>, at::native::<unnamed>::FusedAdamMathFunctor<c10::Half, (int)5, (at::native::ADAM_MODE)1, (bool)1>, const float *, double, double, double, double, double, bool, const float *, const float *>(T1, T2, T3, T4...):
  REG:74 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:3904 TEXTURE:0 SURFACE:0 SAMPLER:0
 Function void at::native::<unnamed>::multi_tensor_apply_kernel<long, at::native::<unnamed>::FusedOptimizerTensorListMetadata<(int)5>, at::native::<unnamed>::FusedAdamMathFunctor<float, (int)5, (at::native::ADAM_MODE)1, (bool)1>, const float *, double, double, double, double, double, bool, const float *, const float *>(T1, T2, T3, T4...):
  REG:76 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:3904 TEXTURE:0 SURFACE:0 SAMPLER:0
 Function void at::native::<unnamed>::multi_tensor_apply_kernel<long, at::native::<unnamed>::FusedOptimizerTensorListMetadata<(int)5>, at::native::<unnamed>::FusedAdamMathFunctor<double, (int)5, (at::native::ADAM_MODE)1, (bool)1>, const float *, double, double, double, double, double, bool, const float *, const float *>(T1, T2, T3, T4...):
  REG:121 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:3904 TEXTURE:0 SURFACE:0 SAMPLER:0
```

### Main
```console
root@7c40321796bc:/opt/pytorch/pytorch# cuobjdump ./build/caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/fused_adamw_amsgrad_impl.cu.o -xelf all
Extracting ELF file    1: fused_adamw_amsgrad_impl.cu.1.sm_70.cubin
Extracting ELF file    2: fused_adamw_amsgrad_impl.cu.2.sm_80.cubin
Extracting ELF file    3: fused_adamw_amsgrad_impl.cu.3.sm_90.cubin
root@7c40321796bc:/opt/pytorch/pytorch# cuobjdump -res-usave fused_adamw_amsgrad_impl.cu.3.sm_90.cubin
cuobjdump fatal   : Unknown option 'res-usave'
root@7c40321796bc:/opt/pytorch/pytorch# cuobjdump -res-usage fused_adamw_amsgrad_impl.cu.3.sm_90.cubin

Resource usage:
 Common:
  GLOBAL:3
 Function void at::native::<unnamed>::multi_tensor_apply_kernel<at::native::<unnamed>::FusedOptimizerTensorListMetadata<(int)5>, at::native::<unnamed>::FusedAdamMathFunctor<c10::BFloat16, (int)5>, float *, double, double, double, double, double, bool, bool, float *, float *, at::native::ADAM_MODE>(T1, T2, T3...):
  REG:124 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:3897 TEXTURE:0 SURFACE:0 SAMPLER:0
 Function void at::native::<unnamed>::multi_tensor_apply_kernel<at::native::<unnamed>::FusedOptimizerTensorListMetadata<(int)5>, at::native::<unnamed>::FusedAdamMathFunctor<c10::Half, (int)5>, float *, double, double, double, double, double, bool, bool, float *, float *, at::native::ADAM_MODE>(T1, T2, T3...):
  REG:124 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:3897 TEXTURE:0 SURFACE:0 SAMPLER:0
 Function void at::native::<unnamed>::multi_tensor_apply_kernel<at::native::<unnamed>::FusedOptimizerTensorListMetadata<(int)5>, at::native::<unnamed>::FusedAdamMathFunctor<float, (int)5>, float *, double, double, double, double, double, bool, bool, float *, float *, at::native::ADAM_MODE>(T1, T2, T3...):
  REG:123 STACK:0 SHARED:0 LOCAL:0 CONSTANT[0]:3897 TEXTURE:0 SURFACE:0 SAMPLER:0
 Function void at::native::<unnamed>::multi_tensor_apply_kernel<at::native::<unnamed>::FusedOptimizerTensorListMetadata<(int)5>, at::native::<unnamed>::FusedAdamMathFunctor<double, (int)5>, float *, double, double, double, double, double, bool, bool, float *, float *, at::native::ADAM_MODE>(T1, T2, T3...):
  REG:128 STACK:40 SHARED:0 LOCAL:0 CONSTANT[0]:3897 TEXTURE:0 SURFACE:0 SAMPLER:0
```

-->
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117872
Approved by: https://github.com/janeyx99
2024-02-01 09:34:50 +00:00
eba4bd6b86 Updated test_upsamplingBiMode2d_consistency (#118388)
Description:
- Lowered error thresholds and added input range for bicubic to make visible the inconsistency error in the implementation for upsampling (smaller -> larger) bicubic aa=false mode for uint8 input dtype
- Updated out-dated comments
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118388
Approved by: https://github.com/NicolasHug
2024-02-01 09:22:23 +00:00
7e0ea0d5df [export] Only deepcopy graph in unlift (#118821)
Summary: We only need to deepcopy the graph because we're modifying the graph by unlifting its parameter/buffer inputs. We don't need to deepcopy the graph module state/contents. This causes an error when the graph module contains an ExecuTorch LoweredModule which stores tensors.

Test Plan: Fixes the following diff

Differential Revision: D53290077

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118821
Approved by: https://github.com/tugsbayasgalan
2024-02-01 09:00:22 +00:00
4fc4f5eb06 [Dynamo] Support tensor is not tensor (#118840)
Fixes Meta internal use case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118840
Approved by: https://github.com/yf225
2024-02-01 07:32:43 +00:00
a1280f0cc6 Add an OpInfo test for split_with_sizes_copy (#118512)
Adding an `OpInfo` test for `split_with_sizes_copy` so we can use it to test [CUDA fast path for split_with_sizes_copy.out](https://github.com/pytorch/pytorch/pull/117203). Since the `OpInfo` test doesn't exist yet and introducing it requires modifications to the `CompositeExplicitAutograd` impl, adding the `OpInfo` test in a separate PR to establish a healthy baseline.

Changes made:
- Registered a batching rule for `split_with_sizes_copy`.
- Registered a decomposition for `split_with_sizes_copy`.
- Registered a DTensor prop rule for `split_with_sizes_copy`.
- Added required dtype and device checks to the composite impl.
- Added output resize to the composite impl.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118512
Approved by: https://github.com/albanD
2024-02-01 07:09:27 +00:00
2b48891e62 [AOTInductor] Add Runtime Constant-folding for AOTInductor (#118765)
Summary:
Add Runtime Constant-folding for AOTInductor.
This also include the invocation of constant folding at load time.

The constant folding lowering is a 2-step process.
First, we split the graph into 2 modules, one of it is the constant module, which doesn't depend on any input and the whole module could be inferred (constant-folded) one-time and be reused. The constant module, is lowered, and being codegen-ed as usual and cached (let's call this constant code). The constant code reuses the whole lowering/profiling/etc. process, only difference is that we do not generate any headers or initialization for the constant code.
Second, after handling the constant module, we take care of the main module (which is the part that would depend on the user input.) For the main module, we take in one additional component, the constant code, compare with a normal lowering. Addition step we do here is that, we inject the constant code into the codegen-ed main module, and create the caller for the main module to consume the result of the constant module.

Test Plan: Unit tests included in commit.

Differential Revision: D53274382

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118765
Approved by: https://github.com/chenyang78
2024-02-01 04:54:25 +00:00
b97ab47619 [pytorch][ao] Update PerChannelMinMaxObserver default _load_from_state_dict (#118659)
Summary:
When `version` is missing in the metadata, use `min_val/max_val` as keys instead of `max_vals/min_vals`

## Reasons
1. It's been almost 2 years since this change D30003700, which means now most checkpoints are using the `max_val/min_val` keys

2. most checkpoints dumps using `model.state_dict()` don't have version info, which will lead a fake `missing keys` error when loading state_dict

Test Plan: CI

Differential Revision: D53233012

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118659
Approved by: https://github.com/jerryzh168
2024-02-01 04:39:31 +00:00
526701cfb7 [executorch hash update] update the pinned executorch hash (#118698)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118698
Approved by: https://github.com/pytorchbot
2024-02-01 03:39:50 +00:00
45d2dff844 [easy] Enable test_neg_view for 5D SampleInput for torch.nn.functional.linear (#118815)
Fixes #117854

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118815
Approved by: https://github.com/malfet
2024-02-01 03:26:45 +00:00
adff335095 [vision hash update] update the pinned vision hash (#118825)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118825
Approved by: https://github.com/pytorchbot
2024-02-01 03:14:16 +00:00
9b28621369 [FSDP2] Added forward unshard/wait for unshard/reshard (#117973)
This PR adds the all-gather and free logic required for forward.
- We define the logical all-gather as two ops: (1) unshard and (2) wait for unshard. This abstraction allows capturing both implicit forward prefetching (using multiple streams and `async_op=False`) and explicit forward prefetching (using `async_op=True`).
- Symmetrically, we define the reshard op to free the unsharded parameters.

Some other notes:
- The `FSDPParamGroup` and its `FSDPParam`s transition their sharded states together. This invariant allows us to reason about the parameters by group rather than individually with respect to whether they are sharded or unsharded.

---

### How Does the Overlap Work for All-Gather?

For context, the all-gather consists of three steps: (1) copy-in, (2) all-gather collective, and (3) copy-out.

<details>
<summary> Example </summary>

Suppose we have 2 parameters with shapes `(3, 3)` (denoted with `A`s) and `(2, 2)` (denoted with `B`s) and 2 ranks, where `P` represents padding and `E` represents empty:
```
Given:
(3, 3): AAAAAAAAA
(2, 2): BBBB

Sharded parameters/all-gather inputs:
Rank 0: AAAAAA, BB
Rank 1: AAAPPP, BB

Each rank allocate group's all-gather output:
EEEEEEEEEEEEEEEE
Each rank copy-in:
Rank 0: AAAAAABBEEEEEEEE
Rank 1: EEEEEEEEAAAPPPBB

Each rank all-gather:
Rank 0: AAAAAABBAAAPPPBB
Rank 1: AAAAAABBAAAPPPBB

Each rank copy-out:
Rank 0: AAAAAAAAAPPP, BBBB
Rank 1: AAAAAAAAAPPP, BBBB
```
</details>

`dist.all_gather_into_tensor()` always has the PG's NCCL stream wait for the current stream before running the collective. `async_op=True` means that the function waits on the work, having the current stream wait for the NCCL stream before returning. `async_op=False` means it returns the `Work` object, which the user can wait on later.

#### Implicit Prefetching
Implicit prefetching achieves communication/computation overlap without changing the CPU issue order:
- We use separate streams for copy-in and for issuing the `dist.all_gather_into_tensor()`. The copy-in stream allows us to overlap the copy-in with all-gather/reduce-scatter in backward, and the all-gather stream allows us to overlap the all-gather with forward compute (issued before it).
     - Because `dist.all_gather_into_tensor()` always has the PG's NCCL stream wait for the current stream, we need this "dummy" all-gather stream to prevent the all-gather from waiting on the forward compute with which it should overlap.
     - Without the separate copy-in stream, we cannot overlap all-gather copy-in with all-gather in forward.
- We copy-out in the default stream after having the default stream wait for the all-gather. This means that the autograd leaves are allocated in the default stream and autograd will not call `recordStream`.

Implicit prefetching does not require knowing the execution order ahead of time. However, when overlapping the next all-gather with the current compute, there may be a gap from the CPU thread issuing the current compute. If the CPU thread can run ahead, then this is not an issue.

#### Explicit Prefetching
Explicit prefetching achieves communication/computation by changing the CPU issue order, namely by reordering the all-gather to be before the compute with which it should overlap.
- Because we reorder, we do not need any separate streams, and we can use `async_op=False` for overlap.
- We can expose this explicit prefetching as a module-level `unshard()` op (e.g. `module.unshard(async_op: bool)`, and we can use it as a primitive for implementing the explicit forward prefetching in existing FSDP.

Explicit prefetching requires knowing the execution order.

---

Disclaimer: The testing is relatively lighter in this PR. I did not want to spend too much time writing new forward-only tests. The stream usage will be exercised thoroughly once we have backward too.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117973
Approved by: https://github.com/weifengpy, https://github.com/yifuwang
ghstack dependencies: #117950, #117955
2024-02-01 03:08:13 +00:00
8d6e34b21b Add verbose option to failures histogram (#118757)
Sample output: https://gist.github.com/jamesjwu/cc80d7da305add0a69c5e39aae09a077
Using directories from https://hud.pytorch.org/pr/118597:
eager_tests: [linux-focal-py3.11-clang10 / test (default, 1, 3, linux.2xlarge)](https://github.com/pytorch/pytorch/actions/runs/7716582714/job/21034340833)
dynamo_tests: [linux-focal-py3.11-clang10 / test (dynamo, 1, 3, linux.2xlarge)](https://github.com/pytorch/pytorch/actions/runs/7716582714/job/21034342747)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118757
Approved by: https://github.com/zou3519
2024-02-01 02:46:36 +00:00
499f31d40b [dynamo] use par_style = "xar" in minifier targets file (#118603)
For internal usage, par_style="xar" is needed in order for certain build
modes to work with triton.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118603
Approved by: https://github.com/williamwen42
2024-02-01 02:42:26 +00:00
a43c28368c [export] support non-persistent buffers (#118612) (#118722)
Summary:
X-link: https://github.com/pytorch/executorch/pull/1769

Basic support for non-persistent buffers, which are buffers that do not show up in the state dict.

One weird twist is that most of our other systems (FX, aot_export, dynamo) have completely buggy handling of non-persistent buffers. I tried to go on a wild goose chase to fix them all, but it got to be too much. So I introduced some sad rewrite passes in `_export` make the final state dict correctly align with the original module's state dict.

This exposed some bugs/ambiguous handling of parameters/buffers in existing test code. For example, `TestSaveLoad.test_save_buffer` traced over a module that was not in the root module hierarchy and caused some weird behavior. I think we should error explicitly on use cases like this: https://github.com/pytorch/pytorch/issues/118410. For now I just rewrote the tests or skipped them.

Test Plan: added a unit test

Differential Revision: D53253905

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118722
Approved by: https://github.com/SherlockNoMad, https://github.com/angelayi
2024-02-01 00:36:09 +00:00
4cba1dd0c3 [submodule] Update cudnn_frontend to v1.0.3 (#118782)
# Summary
Updates cudnn frontend to tagged 1.0.3 tagged version

submodule

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118782
Approved by: https://github.com/malfet
2024-02-01 00:35:03 +00:00
suo
2f79a7bf9e [export] make spec comparison indifferent to fx collections (#118718)
Treat immutable_dict as dict and immutale_list as list. This behavior was tripped up by some executorch tests

Differential Revision: [D53252679](https://our.internmc.facebook.com/intern/diff/D53252679/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118718
Approved by: https://github.com/zhxchen17
2024-02-01 00:10:49 +00:00
6c67f3333a [Inductor] Skip triton templates for mixedmm on SM70- (#118591)
As it results in numerical errors, see https://github.com/pytorch/pytorch/issues/117144

Fixes https://github.com/pytorch/pytorch/issues/117144

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118591
Approved by: https://github.com/jansel
2024-01-31 23:30:45 +00:00
da4b4d961e Support printing storage while FakeTensorMode is enabled (#118780)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118780
Approved by: https://github.com/thiagocrepaldi, https://github.com/eellison
2024-01-31 23:10:47 +00:00
30f43e3d89 [ONNX][bench] Deepcopy model to another device before export to avoid OOM (#118710)
Prior to onnx export, the model is deepcopied to avoid modifications that may affect later performance profiling. However this increases the memory requirement on the device.
This PR modifies the script to deepcopy and export the model on another device when possible.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118710
Approved by: https://github.com/thiagocrepaldi
2024-01-31 23:03:39 +00:00
21ce53b9c5 Add inf norm support for _foreach_norm (#118441)
Fixes #117803

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118441
Approved by: https://github.com/mlazos
2024-01-31 22:58:51 +00:00
e87ac82c98 Fix missing default dim param in weight norm interface decomp (#118762)
Fix for https://github.com/pytorch/pytorch/issues/118742

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118762
Approved by: https://github.com/ezyang, https://github.com/shunting314
2024-01-31 22:10:10 +00:00
e426924c19 Change classification to beta for TORCH_LOGS (#118682)
Changes classification of TORCH_LOGS to beta
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118682
Approved by: https://github.com/svekars
2024-01-31 21:50:55 +00:00
fb391a016d Test that optimizers are running cudagraphs (#118716)
Updates compiled optimizer tests to ensure that cudagraphs is running when on cuda.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118716
Approved by: https://github.com/eellison
2024-01-31 21:34:23 +00:00
8dee7b7a16 Add TORCHDYNAMO_EXTENDED_DEBUG_GUARD_ADDED (#118750)
This allows us to request extended (including C++ backtrace) information
whenever a specific guard occurs.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118750
Approved by: https://github.com/aakhundov
2024-01-31 21:16:27 +00:00
c978f38bd4 Some minor type stub improvements (#118529)
I was just playing around with improving the typing of symbolic_shapes. The PR is not "complete" but I in particular wanted to get feedback on whether or not people liked making ValueRanges Generic; it seems that distinguishing if you have an Expr ValueRange or a SympyBoolean ValueRange is a lot of trouble for downstream. Using TypeGuard, we can perform refinements on the generic parameter inside methods, although we still have to cast back to ValueRange[T] due to https://github.com/python/mypy/issues/14425#issuecomment-1914852707

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118529
Approved by: https://github.com/Skylion007
2024-01-31 20:56:56 +00:00
5ced432a0d Revert "[export] Fix graph signature for primitive outputs (#118655)"
This reverts commit 680cc6b17ab3f318c0da6177646afe6700152327.

Reverted https://github.com/pytorch/pytorch/pull/118655 on behalf of https://github.com/atalman due to broke TestExportTorchbind.test_input test ([comment](https://github.com/pytorch/pytorch/pull/118655#issuecomment-1919940598))
2024-01-31 20:55:46 +00:00
a768a50a55 Re-enable test_nan_to_num (#118711)
Resolve TODO and re-enable as https://github.com/pytorch/pytorch/issues/82763 is resolved.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118711
Approved by: https://github.com/peterbell10
2024-01-31 20:01:10 +00:00
9391af9796 Merging heuristics (#118029)
Everyday I move closer and closer to just using numbers

* number of heuristics that marked it as high, probable, low, none etc
* order of heuristics in the `__init__` file as well as how the heuristic ordered the tests
* put heuristics historical edited files and profiling as not trial mode
* briefly sanity checked that all shards of the larger test files (ex test_ops) exist and there are no dups
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118029
Approved by: https://github.com/huydhn
2024-01-31 20:00:10 +00:00
3280fdb883 [FSDP2] Added _to_kwargs root forward input cast (#117955)
This PR adds a `_to_kwargs()` call on the FSDP root module's forward inputs to move them to `device` similar to DDP.
39df084001/torch/nn/parallel/distributed.py (L1426-L1427)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117955
Approved by: https://github.com/weifengpy
ghstack dependencies: #117950
2024-01-31 19:51:32 +00:00
d33f9dcefe [FSDP2] Added all-gather and unsharded parameter (#117950)
This PR adds the FSDP all-gather (the copy-in/all-gather collective and the copy-out) and the unsharded parameter concept to `FSDPParam`. This is to prepare for being able to run the forward pass.
- We implement all-gather as two functions: `foreach_all_gather` (copy-in/all-gather collective) and `foreach_all_gather_copy_out` (copy-out).
    - In the future, there will be two paths: `async_op=True` in the default stream for explicit prefetching and `async_op=False` in separate streams for implicit prefetching.
    - In the future, we will use `torch.split_with_sizes_copy` in the copy-out when it has the CUDA fast path.
    - We have the functions operate on `List[FSDPParam]` instead of passing the `torch.Tensor` and metadata mainly so that the `all_gather_input` can be computed under the `all_gather_copy_in_stream`. Since the two functions are specific to FSDP, I did not see motivation for avoiding this at the cost of entering/exiting the `all_gather_copy_in_stream` context twice (which incurs some CPU overhead).
- The `init_all_gather_output()` and `init_unsharded_parameter()` functions may seem unintuitive. The reason we initialize them once and write to them in-place thereafter is for autograd. See the note `[Note: FSDP and autograd]` in the code.
- We expand our 'FSDP tensors' definition to include the all-gather input and all-gather output in addition to the sharded and unsharded parameters. This distinction might seem unnecessary or pedantic, but it enables a language for describing pre- and post-all-gather transformations.
- We use the `_unsafe_preserve_version_counters` context when copying out because otherwise autograd will complain of a version mismatch in backward due to writing to the leaf tensors. (An alternative would be to use `.data`, but we are avoiding that 😄 .)

---

<details>
<summary> Copy-in/All-Gather/Copy-Out Example </summary>

Suppose we have 2 parameters with shapes `(3, 3)` (denoted with `A`s) and `(2, 2)` (denoted with `B`s) and 2 ranks, where `P` represents padding and `E` represents empty:
```
Given:
(3, 3): AAAAAAAAA
(2, 2): BBBB

Sharded parameters/all-gather inputs:
Rank 0: AAAAAA, BB
Rank 1: AAAPPP, BB

Each rank allocate group's all-gather output:
EEEEEEEEEEEEEEEE
Each rank copy-in:
Rank 0: AAAAAABBEEEEEEEE
Rank 1: EEEEEEEEAAAPPPBB

Each rank all-gather:
Rank 0: AAAAAABBAAAPPPBB
Rank 1: AAAAAABBAAAPPPBB

Each rank copy-out:
Rank 0: AAAAAAAAAPPP, BBBB
Rank 1: AAAAAAAAAPPP, BBBB
```
</details>

---

For context, we use the copy-in/all-gather/copy-out strategy instead of NCCL group coalescing for two reasons:
1. One large NCCL all-gather is still noticeably faster than several NCCL all-gathers using group coalescing of the same total bytes (even after NCCL 2.18.3). We prefer to tradeoff extra device-to-device copies (using GPU high-bandwidth memory) to save communication time, which does not improve as fast from hardware generation to generation.
2. Copying out of the all-gather buffer tensor simplifies multi-stream memory handling because there is a constant number of such all-gather tensors alive at once. (The copy-out is done in the default/compute stream.) If we directly used the all-gather tensor memory for computation, then the number of such alive tensors is linear in the module depth and hence dependent on the particular model.

---

Disclaimer: This PR has some extraneous code, but I did not want to simplify too much since that code will be added back soon anyway (e.g. for overlapping, mixed precision, and ZeRO++). Hopefully it does not hinder code review too much.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117950
Approved by: https://github.com/weifengpy, https://github.com/wanchaol
2024-01-31 19:51:32 +00:00
483001e846 Revert "Workaround for super() calls in test_torchinductor_dynamic_shapes (#118586)"
This reverts commit f2682e75e6fd735c4a84afe59eafd541f7643f4a.

Reverted https://github.com/pytorch/pytorch/pull/118586 on behalf of https://github.com/atalman due to Broke slow tests ([comment](https://github.com/pytorch/pytorch/pull/118586#issuecomment-1919810802))
2024-01-31 19:44:29 +00:00
649f2e3000 Fix for out of bounds registers_ access in mobile TorchScript interpreter (#110300)
Summary:
The TorchScript interpreter had multiple opcodes whose logic had the potential to access the registers_ array out of bounds.

This change ensures that all registers_ accesses are in bounds or an exception will be thrown.

Test Plan: contbuild + OSS signals

Differential Revision: D49748737

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110300
Approved by: https://github.com/malfet, https://github.com/kimishpatel
2024-01-31 19:40:02 +00:00
8026534a2f Add torch.complex128 and torch.complex32 to DTYPE_TO_ATEN dictionary. (#117929)
Fixes #117370

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117929
Approved by: https://github.com/Skylion007, https://github.com/desertfire
2024-01-31 19:34:58 +00:00
82b6ee5a2a Fix build error in ppc64le (#118516)
...
from /home/vinithav/pytorch-build/forks/myforks/jan23/pytorch/aten/src/ATen/test/vec_test_all_types.cpp:1: /home/vinithav/pytorch-build/forks/myforks/jan23/pytorch/aten/src/ATen/cpu/vec/vec256/vsx/vec256_float_vsx.h: In member function 'bool at::vec::DEFAULT::Vectorized::has_inf_nan() const': /home/vinithav/pytorch-build/forks/myforks/jan23/pytorch/aten/src/ATen/cpu/vec/vec256/vsx/vec256_float_vsx.h:244:36: error: no matching function for call to 'at::vec::DEFAULT::Vectorized::_isinf(float&) const' 244 | if(_isnan(_vec0[i]) || _isinf(_vec0[i])) {
| ~~~~~~^~~~~~~~~~
/home/vinithav/pytorch-build/forks/myforks/jan23/pytorch/aten/src/ATen/cpu/vec/vec256/vsx/vec256_float_vsx.h:237:21: note: candidate: 'at::vec::DEFAULT::Vectorized at::vec::DEFAULT::Vectorized::_isinf() const'~ ...
Started breaking from
29516bd2a0.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118516
Approved by: https://github.com/ezyang
2024-01-31 19:33:57 +00:00
aca41a3a74 [optim] lbfgs: handle complex params as independent real params (#118184)
Ref: #86340

Fixes #118148

This fixes LBFGS for complex parameters. Complex parameters are handled as R^2.
I also added a test, unfortunately, due to the closure required, I could not use the existing `_test_complex_optimizer` used for all other optimizers.
Lbfgs is special, as it will call the objective function multiple times internally. So I felt making a one-off test for lbfgs might be justifiable.
We will test if each step taken internally by the optimizer is the same for R^2 and complex parameters.

Let me know if the approach is ok, thanks

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118184
Approved by: https://github.com/janeyx99
2024-01-31 19:24:16 +00:00
82b0341af3 s/verison/version/ (#118749)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118749
Approved by: https://github.com/malfet, https://github.com/albanD
2024-01-31 19:23:55 +00:00
41dfd0e063 Update Dynamo passrate/histogram scripts (#118752)
Changelog:
- Don't count running PYTORCH_TEST_WITH_DYNAMO=1 on dynamo/ tests in the pass
rate. This was a bug (we were counting all of these as failing, but in
reality, most of these pass). The net effect is that the passrate is (artifically)
6% higher.
- Have the histogram script filter out skips based on the passrate metric.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118752
Approved by: https://github.com/jamesjwu
2024-01-31 19:15:17 +00:00
99b69e1ffb add PrivateUse1 device support in function options_from_string. (#118627)
add PrivateUse1 device support in function options_from_string.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118627
Approved by: https://github.com/soulitzer
2024-01-31 18:52:58 +00:00
7aff92c838 [torch] Expose dynamic_shapes api at multiple levels (#118695)
Summary: Exposes `dynamic_shapes` api at multiple levels so it's easier to replace the old API `dynamic_dim()` with the new API `Dim()`.

Test Plan: CI

Differential Revision: D53246409

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118695
Approved by: https://github.com/ydwu4
2024-01-31 18:50:01 +00:00
6bd1807ae9 enable mkl_gemm_f16f16f32 in cpublas::gemm (#118367)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118367
Approved by: https://github.com/jgong5, https://github.com/cpuhrsch
2024-01-31 18:37:42 +00:00
81d12846dc Add decomp for pixel_shuffle/unshuffle (#118239)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118239
Approved by: https://github.com/peterbell10
2024-01-31 18:34:21 +00:00
81b55f58ce Matmul decide should_fold using has_out instead of grad_mode (#118617)
Fixes https://github.com/pytorch/pytorch/issues/118548

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118617
Approved by: https://github.com/lezcano
2024-01-31 18:34:16 +00:00
a5a0fdcae9 Remove some unnecessary skipIfTorchDynamo (#118725)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118725
Approved by: https://github.com/bdhirsh
2024-01-31 18:18:17 +00:00
680cc6b17a [export] Fix graph signature for primitive outputs (#118655)
Summary:
Now that we allow primitive outputs, we need to fix how the graph
signature outputs user_outputs

Test Plan: CI

Differential Revision: D53233649

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118655
Approved by: https://github.com/tarun292
2024-01-31 18:00:02 +00:00
8455447972 Support builtin callable with object arguments in dynamo (#118678)
Fix issue #117556

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118678
Approved by: https://github.com/anijain2305
2024-01-31 17:54:08 +00:00
68c3cb7594 s/fialure/failure/ (#118744)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118744
Approved by: https://github.com/peterbell10
2024-01-31 17:42:44 +00:00
suo
5586d7797e fix up batchnorm folding in pt2 quant (#118720)
Changes to how attributes are structured messed this pass up, fix it

Differential Revision: [D53253601](https://our.internmc.facebook.com/intern/diff/D53253601/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118720
Approved by: https://github.com/SherlockNoMad
2024-01-31 17:40:47 +00:00
4a677da36b Add more triton kernel mutation tracking tests (#118691)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118691
Approved by: https://github.com/aakhundov
ghstack dependencies: #118676, #118595
2024-01-31 17:38:17 +00:00
b4f4fd0c28 Parse and handle functions in TTIR (#118595)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118595
Approved by: https://github.com/aakhundov
ghstack dependencies: #118676
2024-01-31 17:38:17 +00:00
1bf9ddf130 add test_truth (#118597)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118597
Approved by: https://github.com/anijain2305
2024-01-31 15:10:58 +00:00
1128cf96f0 [AOTI] Support _embedding_bag in C shim (#118706)
Summary: At some point I will stop manually adding ops to C shim, but use torchgen to generate those code. For the near term, I need to add a few more in order to switch the AOTInductor dashboard run.

Differential Revision: [D53249074](https://our.internmc.facebook.com/intern/diff/D53249074)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118706
Approved by: https://github.com/frank-wei, https://github.com/aakhundov
ghstack dependencies: #118704, #118705
2024-01-31 15:02:40 +00:00
8db8ff652c [AOTI] Add aoti_torch_view_dtype in C shim (#118705)
Summary: Support ir.ComplexView in the ABI-compatible codegen

Differential Revision: [D53249039](https://our.internmc.facebook.com/intern/diff/D53249039)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118705
Approved by: https://github.com/frank-wei
ghstack dependencies: #118704
2024-01-31 14:42:29 +00:00
dd52939438 [inductor] Refactor ir.ComplexView (#118704)
Summary: Make ir.ComplexView a subclass of ir.FallbackKernel, to unify the codegen logic

Differential Revision: [D53248972](https://our.internmc.facebook.com/intern/diff/D53248972)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118704
Approved by: https://github.com/frank-wei
2024-01-31 14:42:29 +00:00
35f3ccffd4 [Cutlass 3.3.0 submodule upgrade] (#118629)
Cutlass 3.3 offers the following improvements:

Adds support for mixed precision GEMMs On Hopper and Ampere Adds support for < 16B aligned GEMMs on Hopper
Enhancements to EVT
Enhancements to Python interface
Enhancements to Sub-byte type handling in CuTe
Several other bug-fixes and performance improvements. minor doc update
Test Plan:

CI ( ciflow/trunk, ciflow/inductor )
pytest test/inductor/test_max_autotune.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118629
Approved by: https://github.com/drisspg, https://github.com/Skylion007, https://github.com/khabinov
2024-01-31 13:53:58 +00:00
c3a3e61bcb Resolve TODO in test_slice_mutation2 (#118712)
As https://github.com/pytorch/pytorch/issues/94693 has been resolved.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118712
Approved by: https://github.com/peterbell10
2024-01-31 08:26:22 +00:00
9afd539075 [sigmoid] update serialization to include custom objs (#118684)
Summary: Update the serialization code to handle custom objs.

Test Plan: buck2 run 'fbcode//mode/dev-nosan' fbcode//sigmoid/frontend/test_gpu:serializer_test

Differential Revision: D53139356

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118684
Approved by: https://github.com/angelayi, https://github.com/suo
2024-01-31 08:23:34 +00:00
56718cab8d Unskip test_complex_type_conversions (#118694)
Resolve TODO and unskip test_complex_type_conversions as real and imag have been implemented for complex.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118694
Approved by: https://github.com/huydhn
2024-01-31 08:04:15 +00:00
73229b4f93 Add --filter-rank to torchrun: allow logs filtering by rank (#118562)
Addresses issue https://github.com/pytorch/pytorch/issues/117383

The implementation exposes `--filter-ranks` which filters by rank which files we pass to `TailLog` (used in torchrun to determine which logs to output to stdout/stderr)

## Behavior
### with --tee
Currently --tee is implemented as --redirect to file, and streams file to console using `tail`. When --tee is specified, file logs will be unaffected and we will only filter the output to console.

### with --redirect
When --redirect is specified without --tee, nothing is logged to console, so we no-op.

### with neither
When neither --tee or --redirect are specified, torchrun uses empty string "" to indicate logging to console. We intercept this empty string, and redirect it to "/dev/null" to not print to console.

The api also allows a per-rank configuration for --tee and --redirect, and is also supported by this filter implementation.

## Usage
### without --tee
```
> TORCH_LOGS_FORMAT="%(levelname)s: %(message)s" TORCH_LOGS="graph" torchrun --standalone --nproc_per_node=2 --role rank --filter_ranks=0 t.py
hello from rank 0 python
DEBUG: TRACED GRAPH
 __compiled_fn_0 <eval_with_key>.0 opcode         name    target                   args       kwargs
-------------  ------  -----------------------  ---------  --------
placeholder    l_x_    L_x_                     ()         {}
call_function  mul     <built-in function mul>  (l_x_, 5)  {}
output         output  output                   ((mul,),)  {}
...
```
### with --tee
```
> TORCH_LOGS_FORMAT="%(levelname)s: %(message)s" TORCH_LOGS="graph" torchrun --standalone --nproc_per_node=2 --role rank --tee 3 --filter_ranks=0 t.py
[rank0]:hello from rank 0 python
[rank0]:DEBUG: TRACED GRAPH
[rank0]: __compiled_fn_0 <eval_with_key>.0 opcode         name    target                   args       kwargs
[rank0]:-------------  ------  -----------------------  ---------  --------
[rank0]:placeholder    l_x_    L_x_                     ()         {}
[rank0]:call_function  mul     <built-in function mul>  (l_x_, 5)  {}
[rank0]:output         output  output                   ((mul,),)  {}
...
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118562
Approved by: https://github.com/wconstab, https://github.com/wanchaol
2024-01-31 07:40:01 +00:00
995f69623d Add Silu to Dtensor Pointwise ops (#118702)
# Summary
Adds silu to the supported list, needed for llama2 mlp support

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118702
Approved by: https://github.com/Skylion007, https://github.com/wanchaol
2024-01-31 06:17:36 +00:00
74f4947caf Fix admm over empty tensors and broadcastable input (#118619)
Fixes https://github.com/pytorch/pytorch/issues/118131

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118619
Approved by: https://github.com/albanD
2024-01-31 05:40:25 +00:00
2d37a046e7 [export] Enforce serialization BC/FC with updater script. (#118424)
Summary:
This diff implements a mechanism for safely update torch.export serialization schema, aka schema.py, which is the API surface having the strongest compatibility guarantee.

The diff is consist of 3 changes:
- Added a script to "build" or "materialize" schema.py into a platform neutral format (yaml), which serves as the committed form of the seialization schema.
- Added unittest to compare against schema.py and schema.yaml, so that it forces developers to execute the updater script when there is mismatch between two files.
- Added a checker inside the updater script, so that all the compatible change will result in a minor version bump, and all the incompatible changes will result in a major version bump.

torch.export's serialization BC/FC policy is (tentatively) documented here: https://docs.google.com/document/d/1EN7JrHbOPDhbpLDtiYG4_BPUs7PttpXlbZ27FuwKhxg/edit#heading=h.pup7ir8rqjhx , we will update the

As noted in the code doc, people should be able to run the following command to update schema properly from now on:

```
    python scripts/export/update_schema.py --prefix <path_to_torch_development_diretory>
or
    buck run caffe2:export_update_schema -- --prefix /data/users/$USER/fbsource/fbcode/caffe2/
```

Test Plan:
buck test mode/opt caffe2/test:test_export -- -r test_schema
buck run caffe2:update_export_schema -- --prefix /data/users/$USER/fbsource/fbcode/caffe2/

Differential Revision: D52971020

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118424
Approved by: https://github.com/angelayi
2024-01-31 05:37:58 +00:00
697ca4f292 Preliminary DeviceMesh + native c10d functional integration (#118423)
### Summary
- Added `group_name` as the third field in `dim_group_infos`.
- `DeviceMeshTest` now runs both w/ and w/0 `_USE_NATIVE_C10D_FUNCTIONAL=1` in CI.

### Other fixes
- Convert `reduceOp` to lower case before passing it into c10d_functional ops.
- Added a finalizer to handle unwaited collectives (this mirrors the treatment for Python functional collective ops).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118423
Approved by: https://github.com/wanchaol, https://github.com/LucasLLC, https://github.com/wconstab
2024-01-31 04:36:12 +00:00
e3cde68534 [FSDP2] Added initial _lazy_init and FQNs for debugging (#117881)
This PR adds the initial `_lazy_init`. Lazy initialization marks the point when the FSDP structure is finalized and is typically the beginning of the first forward. This would be after any meta-device initialization.
- Lazy initialization is distinct from construction time because when processing `fully_shard(module)`, there is no way to know whether a parent of `module` will have `fully_shard` applied as well. This is a consequence of `fully_shard` having to be applied bottom-up.
- At lazy initialization, we now have the concept of a _root_. The root FSDP module is the one whose `forward` runs first and ends last (and hence similarly for its backward). Having a single root simplifies handling logic that should only run "once per forward/backward/iteration". We may consider relaxing this in the future, but it will add more complexity to the design.
- Once we have a root, we can define _fully-qualified names_ (FQNs) for both parameters and modules. To aid debugging, we store `_param_fqn` and `_module_fqn` on `FSDPParam` and `FSDPParamGroup`, respectively. Note that we can have a unique `_module_fqn` for `FSDPParamGroup` since we currently assume a 1:1 relationship.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117881
Approved by: https://github.com/weifengpy, https://github.com/wanchaol
ghstack dependencies: #118525, #117814, #117867, #117877
2024-01-31 03:38:53 +00:00
f7ae454003 [vision hash update] update the pinned vision hash (#118700)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118700
Approved by: https://github.com/pytorchbot
2024-01-31 03:10:52 +00:00
6d7cfb5c3f [audio hash update] update the pinned audio hash (#118699)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118699
Approved by: https://github.com/pytorchbot
2024-01-31 03:10:48 +00:00
0a7e2ce0e1 [PT-Vulkan] aten::conv1d - support any stride, padding, dilation (#118660)
Summary:
This diff stack builds on yipjustin's initial special-case implementation: D50914117.

That special-case only covers
```
strides = 1
padding = 0
dilation = 1
in_channels = out_channels = groups
n = 1
```

Test Plan:
```
[jorgep31415@161342.od /data/sandcastle/boxes/fbsource (a0b8b9b7f)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck2 run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="*conv1d*"
File changed: fbsource//xplat/caffe2/aten/src/ATen/native/vulkan/glsl/conv1d.glsl
File changed: fbcode//caffe2/aten/src/ATen/test/vulkan_api_test.cpp
File changed: fbcode//caffe2/aten/src/ATen/native/vulkan/glsl/conv1d.glsl
3 additional file change events
Buck UI: https://www.internalfb.com/buck2/ebb61796-c71d-4e0c-8148-de1eb67b5d4c
Network: Up: 10KiB  Down: 53MiB  (reSessionID-5f852cf6-9bf1-4c73-a471-4c121b53ed62)
Jobs completed: 16. Time elapsed: 21.6s.
Cache hits: 43%. Commands: 7 (cached: 3, remote: 0, local: 4)
BUILD SUCCEEDED
Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = *conv1d*
[==========] Running 2 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 2 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.conv1d_simple
[       OK ] VulkanAPITest.conv1d_simple (136 ms)
[ RUN      ] VulkanAPITest.conv1d
[       OK ] VulkanAPITest.conv1d (35 ms)
[----------] 2 tests from VulkanAPITest (172 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 1 test suite ran. (172 ms total)
[  PASSED  ] 2 tests.
```

Reviewed By: yipjustin

Differential Revision: D53204673

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118660
Approved by: https://github.com/yipjustin
2024-01-31 01:49:09 +00:00
suo
68a75d4539 [lint] remove merge_base_with from .lintrunner.toml (#118677)
This setting is problematic in fbcode, where the expected behavior is to match `arc lint`, which has a behavior much like running `lintrunner` without a `--merge-base-with` argument.

Let's try removing this. I also updated the CI message to encourage people to run with `-m origin/main`, which should hopefully cut down on confusion in the absence of defaulting to that behavior.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118677
Approved by: https://github.com/PaliC
2024-01-31 00:53:58 +00:00
07a7feca74 [FSDP2] Sharded parameter in FSDPParam (#117877)
This PR adds logic to shard the managed parameters on dim-0. This is like `distribute_tensor()` with two differences:
1. `distribute_tensor()` today cannot accept a `DTensor` and reshard it to the parent mesh (https://github.com/pytorch/pytorch/issues/116101).
2. `DTensor` does not pad its local shard on any `Shard` dimensions (https://github.com/pytorch/pytorch/issues/113045).

As such, the `FSDPParam._init_sharded_param()` derives the global `DTensor` metadata itself and pads the local tensor on dim-0. The padding helps make the all-gather copy-in more efficient since the all-gather buffer will require padding.

---

Some details:
- We free the original parameter manually after constructing the sharded parameter. This lowers the peak memory during construction time slightly (since not _all_ parameters in the group must be sharded before the original parameters are freed) and is not strictly necessary.
- We bypass `nn.Module.__setattr__` because the checks are slow and unnecessary. The drawback is that we would ignore a user-defined override of `__setattr__`; however, since we have never encountered this in practice, I am okay with this. Notably, user calls to `setattr` would still use the override; FSDP only uses `setattr` as a mechanism for switching between sharded and unsharded parameters.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117877
Approved by: https://github.com/wanchaol
ghstack dependencies: #118525, #117814, #117867
2024-01-31 00:44:19 +00:00
cyy
4a019047ad Enable nested namespace check in clang-tidy (#118506)
It is time to enable nested namespaces in the code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118506
Approved by: https://github.com/albanD
2024-01-31 00:32:35 +00:00
1b03423526 [meta registration] fix _efficient_attention_forward for jagged inputs (#118657)
Fixes the meta registration for the logsumexp output, whose shape should
be defined by the size of the offsets tensor when it exists.

644f64f2d1/aten/src/ATen/native/transformers/cuda/attention.cu (L1045)

Differential Revision: [D53234217](https://our.internmc.facebook.com/intern/diff/D53234217)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118657
Approved by: https://github.com/YuqingJ
2024-01-31 00:11:39 +00:00
6fa162e681 Reland: [aotinductor] Replicate split_cat from torch IR to predispatch IR" (#118590)
Summary:
This is part the pass migration efforts. The final target is removing the acc tracer in AOTI.
In this diff, I did a few things:
1. copy and modify the `fx_passes/split_cat.py` passes based on predispatch IR.
2. verify the correctness by copying the `test_split_cat_fx_passes.py` and create a new file `test_split_cat_fx_passes_aten_fb.py` which is executed in AOTI and checked the counters
3. create a util function to execute the pass and compare the before/after graph to give user more information like pass effect and time spent. It will create logs like
```
[2024-01-25 20:26:48,997] torch._inductor.utils: [INFO] [Pre grad(predispatch IR)]Apply split_cat, index: 0, save before/after graph to /tmp/tmpvlpwrklp, graph before/after are the same = False, time elapsed = 0:00:00.001585
[2024-01-25 20:26:49,000] torch._inductor.utils: [INFO] [Pre grad(predispatch IR)]Apply split_cat, index: 1, save before/after graph to /tmp/tmpz_onjfeu, graph before/after are the same = False, time elapsed = 0:00:00.001873
[2024-01-25 20:26:49,002] torch._inductor.utils: [INFO] [Pre grad(predispatch IR)]Apply split_cat, index: 2, save before/after graph to /tmp/tmpgkck8yko, graph before/after are the same = True, time elapsed = 0:00:00.000269
[2024-01-25 20:26:49,007] torch._inductor.utils: [INFO] [Pre grad(predispatch IR)]Apply split_cat, index: 3, save before/after graph to /tmp/tmpquenq06y, graph before/after are the same = False, time elapsed = 0:00:00.003621
[2024-01-25 20:26:49,009] torch._inductor.utils: [INFO] [Pre grad(predispatch IR)]Apply split_cat, index: 4, save before/after graph to /tmp/tmpi8fia0dv, graph before/after are the same = True, time elapsed = 0:00:00.000190
```

Differential Revision: D53171027

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118590
Approved by: https://github.com/kflu, https://github.com/khabinov, https://github.com/chenyang78
2024-01-31 00:09:46 +00:00
7761ceb6b3 Fix a bug with python lambda capture (#118676)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118676
Approved by: https://github.com/jamesjwu, https://github.com/aakhundov
2024-01-30 23:59:07 +00:00
616e9dbed8 add torch.float64 precision support to the transformer test suite in TP/SP (#116436)
This PR (as a followup to #115530) resolves previous issues of not passing `assertEqual()` tests (with small error) when comparing outputs from the single-gpu model and the distributed model, under certain input/model sizes or when certain operations (e.g. weight-tying) are enabled. This is done by simply enabling higher precision computation using `dtype=torch.float64`.

What is not tested: whether or not distributed model training convergence rate is affected using just `torch.float32` precision.

Test plan:
TP: `python test/distributed/tensor/parallel/test_tp_examples.py -k test_transformer_training_is_seq_parallel_False`
TP+SP: `python test/distributed/tensor/parallel/test_tp_examples.py -k test_transformer_training_is_seq_parallel_True`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116436
Approved by: https://github.com/wanchaol
2024-01-30 23:50:29 +00:00
1f376b3b24 Flix lint after #117814 (#118689)
Forward fix after PR: #117814 . make lint green again

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118689
Approved by: https://github.com/awgu, https://github.com/huydhn
2024-01-30 23:46:27 +00:00
1e78dc95a4 Fix/Temporarily disable tests broken due to triton version mismatch (#118661)
Summary:
These test were broken because internal triton is 2.2 whereas external is 3.0.

Will update after internal version catches up.

Test Plan: CI

Differential Revision: D53231204

Co-authored-by: Oguz Ulgen <oulgen@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118661
Approved by: https://github.com/oulgen
2024-01-30 23:06:35 +00:00
2f7839e6db register decomposition for rsub in torch._refs (#118288)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118288
Approved by: https://github.com/lezcano
ghstack dependencies: #118398
2024-01-30 22:18:15 +00:00
04ded1399d Fix signatures of torch.{add, sub, mul} (#118398)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118398
Approved by: https://github.com/lezcano
2024-01-30 22:18:15 +00:00
6ea233a14c [FSDP2] Added initial FSDPParamGroup, FSDPParam, ParamModuleInfo (#117867)
This PR adds the initial `FSDPParamGroup` and `FSDPParam` classes, and it focuses on the `ParamModuleInfo` data structure.

- `ParamModuleInfo` has the info needed to `setattr` a managed parameter, where it must account for shared parameters and shared modules.
    ```
    # Shared parameter
    lin1.weight = lin2.weight

    # Shared module
    mlp.lin1 = mlp.lin2
    ```
- In order for FSDP to find shared modules' parameters, we must use `remove_duplicate=False`. See https://github.com/pytorch/pytorch/pull/99448/ for the original context. Finding shared modules' parameters is not necessary for the `setattr` logic, but in case we need it in the future (like for existing FSDP's state dict), we include that info for now.

With this PR, we see the general system architecture:
- 1 `module` : 1 `fully_shard`
- 1 `fully_shard` : 1 `FSDPParamGroup`
- 1 `FSDPParamGroup` : k `FSDPParam`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117867
Approved by: https://github.com/weifengpy, https://github.com/wanchaol
ghstack dependencies: #118525, #117814
2024-01-30 22:07:59 +00:00
ae6233ec47 [FSDP2] Added mesh arg, FSDPState, move to device (#117814)
Squashed to include https://github.com/pytorch/pytorch/pull/117861, https://github.com/pytorch/pytorch/pull/117852

---

This PR adds `_get_managed_modules()` to determine which modules a `fully_shard(module)` call manages. The rule is defined as:
> `fully_shard(module)` manages all modules in `module.modules()` except those already managed by a nested `fully_shard()` or a nested non-composable API (e.g. `replicate()` or TorchRec).

Practically, this can be implemented as a graph search from `module` that does not proceed into any module with `fully_shard` or a non-composable API applied. Because the non-composable APIs follow the same rule, this rule is correct inductively.

---

This PR adds `_get_managed_states(managed_modules)` to return the managed parameters and buffers given the managed modules.
- Without an extra mechanism to ignore specific parameters or buffers, the rule currently is simply to get the directly managed state (i.e. parameters/buffers) from each managed module while de-duplicating shared ones.
- However, we prefer this translation from managed modules to managed states to accommodate ignoring specific states in the future (which has appeared in various open-source use cases).

---

This PR adds the `mesh` argument to `fully_shard` and some helper data structures specific to FSDP/HSDP that pre-compute useful info like rank/world size for each mesh dim.
- The `mesh` defines the FSDP/HSDP algorithm. 1D mesh means FSDP, and 2D mesh means HSDP, where we assume sharding on the last dimension.
    - We can revisit the HSDP sharding-dim assumption if needed in the future.
- The default (if `mesh is None`) is that `fully_shard` calls `init_device_mesh` following the global process group.
- The helper data structures are the various `*MeshInfo`s. I included up to the `HSDPMeshInfo` even though it will not be immediately used to show the spirit of it. We want to tag both the shard and replicate dims.
- The `mesh_info` variable in `fully_shard` is not used for now. It will be passed downstream in the future.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117814
Approved by: https://github.com/wanchaol, https://github.com/wconstab
ghstack dependencies: #118525
2024-01-30 22:05:16 +00:00
7aa4b35b75 [FSDP2][Reland] Introduced initial fully_shard frontend (#118525)
This PR introduces the initial `fully_shard` frontend without any distributed logic that will be built into per-parameter-sharding FSDP.
- We design `fully_shard` to be a _module-level_ API (taking in an `nn.Module`), e.g. as opposed to a tensor-level one.
- We define a `FSDP` class and use a dynamic class swap, setting `module.__class__` to a newly created class that subclasses `FSDP` and `type(module)`, to allow FSDP to override and add methods on the module.
    - We name this class as `FSDP<type(module)>`, e.g. `FSDPLinear` for `Linear`.
    - We disable the `deepcopy` because the state object inserted on the module will not be trivially `deepcopy`-able.
- Calling `fully_shard(module)` inserts a state object on `module` but not any of its children. This state object will be used for any FSDP-specific state.
- We raise an error on `ModuleList` or `ModuleDict` since they do not implement `forward()`, and FSDP will rely on `forward()` to insert logic (https://github.com/pytorch/pytorch/issues/113794).
- In the future, we will deprecate the existing `fully_shard` that calls into the same backend logic as `FullyShardedDataParallel` as there is no adoption for that and we prefer to reuse that name.

**Reland details:** I removed `test/distributed/_composable/fsdp/_test_fully_shard_common.py` and moved its contents to the existing `torch/testing/_internal/common_fsdp.py`, which is already a target for internal tests.

Differential Revision: [D53187509](https://our.internmc.facebook.com/intern/diff/D53187509)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118525
Approved by: https://github.com/wanchaol
2024-01-30 22:05:16 +00:00
48f876143a Fix missing permission in create release workflow (#118681)
Fixes https://github.com/pytorch/pytorch/actions/runs/7715417683/job/21029944543
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118681
Approved by: https://github.com/clee2000, https://github.com/seemethere, https://github.com/atalman, https://github.com/malfet
2024-01-30 22:02:30 +00:00
1aa836f502 Dont fuse write into read if indexing differs (#118210)
Fix for https://github.com/pytorch/pytorch/issues/101950, https://github.com/pytorch/pytorch/issues/94693

Similar to inplacing a kernel only fuse a write after a read of the same tensor if the write and read have same indexing formula. I did a perf test and it was neutral.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118210
Approved by: https://github.com/jansel
2024-01-30 21:55:27 +00:00
82a7460b67 [quant][bc-breaking] Turn on fold_quantize by default (#118605)
Summary:
Previously by default we don't generate quantized weight, that is, we'll have fp32 weight, and
`fp32 weight -> q -> dq -> linear -> ...` in the quantized model

After this PR, we'll produce a graph with int8 weight by default after convert_pt2e:
`int8 weight -> dq -> linear -> ...`

We'll remove the fold_quantize flag in the next PR

Test Plan: CI

Differential Revision: D51730862

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118605
Approved by: https://github.com/andrewor14
2024-01-30 21:42:29 +00:00
ba1be17733 Remove voznesenskym from the list of autoreviewers (#118680)
Mitigates the failures of "Auto Request Review" workflow:
```
Requesting review to ezyang, albanD, miladm, voznesenskym, antoniojkim, SherlockNoMad
Error: HttpError: Reviews may only be requested from collaborators. One or more of the users or teams you specified is not a collaborator of the pytorch/pytorch repository.
```
https://github.com/pytorch/pytorch/actions/runs/7716852492/job/21034629665?pr=118669
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118680
Approved by: https://github.com/clee2000
2024-01-30 21:35:38 +00:00
f2682e75e6 Workaround for super() calls in test_torchinductor_dynamic_shapes (#118586)
Info about super in dynamic classes:
https://stackoverflow.com/questions/71879642/how-to-pass-function-with-super-when-creating-class-dynamically
https://stackoverflow.com/questions/43782944/super-does-not-work-together-with-type-supertype-obj-obj-must-be-an-i

Calling super(TestCase) actually calls TestCase's parent's functions, bypassing TestCase itself's functions

Mainly doing this because it's making disable bot spam

Test: checked locally and check that https://github.com/pytorch/pytorch/issues/117954 actually got skipped

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118586
Approved by: https://github.com/huydhn
2024-01-30 21:34:05 +00:00
4f5785b6b3 Enable possibly-undefined error code (#118533)
Fixes https://github.com/pytorch/pytorch/issues/118129

Suppressions automatically added with

```
import re

with open("error_file.txt", "r") as f:
    errors = f.readlines()

error_lines = {}
for error in errors:
    match = re.match(r"(.*):(\d+):\d+: error:.*\[(.*)\]", error)
    if match:
        file_path, line_number, error_type = match.groups()
        if file_path not in error_lines:
            error_lines[file_path] = {}
        error_lines[file_path][int(line_number)] = error_type

for file_path, lines in error_lines.items():
    with open(file_path, "r") as f:
        code = f.readlines()
    for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True):
        code[line_number - 1] = code[line_number - 1].rstrip() + f"  # type: ignore[{error_type}]\n"
    with open(file_path, "w") as f:
        f.writelines(code)
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Co-authored-by: Catherine Lee <csl@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118533
Approved by: https://github.com/Skylion007, https://github.com/zou3519
2024-01-30 21:07:01 +00:00
e332653eb3 [inductor] Use at::detail::empty_strided_* in cpp_wraper mode (#118490)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118490
Approved by: https://github.com/desertfire
2024-01-30 21:03:19 +00:00
1562dae62c [BE]: Apply RUF025 dict.fromkeys preview rule (#118637)
Simplifies and optimizes dict construction using the `fromkeys` classmethod ctor. This also makes it really obvious when all the keys will have the same static value, which could be a bug if unintentional. It is also significantly faster than using a dict comprehension. The rule is in preview, but I am adding a forward fix for when it becomes stable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118637
Approved by: https://github.com/albanD
2024-01-30 20:46:54 +00:00
e33e88e5bc Add separate logging target for cudagraphs (#118329)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118329
Approved by: https://github.com/mlazos
2024-01-30 20:16:51 +00:00
e180218949 [c10d] Log the last enqueued and completed collective (#118582)
Summary:
During debugging of some timeouted jobs, I found it difficult to
identify which rank is at fault eventhough we have logs of many ranks
reporting timeout on a specific collective seq.

If we can also report lastEqueuedSeq and lastCompletedSeq, it would be
much easier to identify,
1. whether a rank has not even join a collective call (not enqueued)
2. Or it has joined the collective call, but not completed.

For the 1st case, it is mostly likely users code problem
for the 2ed case, it could be lower-layer issues

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118582
Approved by: https://github.com/wconstab
2024-01-30 20:13:55 +00:00
9247641f34 [PT-Vulkan] aten::unsqueeze - nit optimization (#118575)
Summary:
Learning Vulkan shaders and realized one of the branches can be easily optimized.

The relevant branch is only taken when we unsqueeze along `dim == 1` for 3D tensors.
1. There's an unnecessary for-loop.
2. There's an unnecessary dependency on the output tensor's number of channels.

## CPU Tensor
```
3D->4D: (c, h, w) -> (c, 0, h, w)
```
## GPU Texture
```
3D->4D: (w, h, c/4)[c%4] -> (w, h, c)[0]
```

Note the GPU Texture's output is always at `[0]` and the output tensor's number of channels is always 1.

We are currently writing the same value `v[p]` to all elements of the texel `out_texel`, but we need only write it to `out_texel[0]`:

Test Plan:
```
[jorgep31415@161342.od /data/sandcastle/boxes/fbsource (ca3b566bc)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck2 run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="*unsqueeze*"
File changed: fbcode//caffe2/aten/src/ATen/native/vulkan/glsl/unsqueeze.glsl
File changed: fbsource//xplat/caffe2/aten/src/ATen/native/vulkan/glsl/unsqueeze.glsl
Buck UI: https://www.internalfb.com/buck2/2c7f1365-e004-41a0-9201-473929a2738a
Network: Up: 174B  Down: 0B  (reSessionID-c54d25da-f44b-49f7-8bfd-1db4eee50f6d)
Jobs completed: 6. Time elapsed: 14.4s.
Cache hits: 0%. Commands: 1 (cached: 0, remote: 0, local: 1)
BUILD SUCCEEDED
Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = *unsqueeze*
[==========] Running 10 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 10 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.unsqueeze_0dto1d_dim0
[       OK ] VulkanAPITest.unsqueeze_0dto1d_dim0 (60 ms)
[ RUN      ] VulkanAPITest.unsqueeze_1dto2d_dim0
[       OK ] VulkanAPITest.unsqueeze_1dto2d_dim0 (0 ms)
[ RUN      ] VulkanAPITest.unsqueeze_1dto2d_dim1
[       OK ] VulkanAPITest.unsqueeze_1dto2d_dim1 (132 ms)
[ RUN      ] VulkanAPITest.unsqueeze_2dto3d_dim0
[       OK ] VulkanAPITest.unsqueeze_2dto3d_dim0 (20 ms)
[ RUN      ] VulkanAPITest.unsqueeze_2dto3d_dim1
[       OK ] VulkanAPITest.unsqueeze_2dto3d_dim1 (66 ms)
[ RUN      ] VulkanAPITest.unsqueeze_2dto3d_dim2
[       OK ] VulkanAPITest.unsqueeze_2dto3d_dim2 (3 ms)
[ RUN      ] VulkanAPITest.unsqueeze_3dto4d_dim0
[       OK ] VulkanAPITest.unsqueeze_3dto4d_dim0 (19 ms)
[ RUN      ] VulkanAPITest.unsqueeze_3dto4d_dim1
[       OK ] VulkanAPITest.unsqueeze_3dto4d_dim1 (1 ms)
[ RUN      ] VulkanAPITest.unsqueeze_3dto4d_dim2
[       OK ] VulkanAPITest.unsqueeze_3dto4d_dim2 (1 ms)
[ RUN      ] VulkanAPITest.unsqueeze_3dto4d_dim3
[       OK ] VulkanAPITest.unsqueeze_3dto4d_dim3 (1 ms)
[----------] 10 tests from VulkanAPITest (307 ms total)

[----------] Global test environment tear-down
[==========] 10 tests from 1 test suite ran. (307 ms total)
[  PASSED  ] 10 tests.
[
```

Differential Revision: D53189637

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118575
Approved by: https://github.com/yipjustin
2024-01-30 20:01:18 +00:00
suo
d0627cc2af [export] do not rewrite state dict when unlifting (#118611)
This is Very Bad; changing state dict keys violates one of the key contracts we have, which is "do not mess with the state dict".

Change unlift to use a similar `_assign_attr` approach that fx.GraphModule and unflatten do.

Also took the opportunity to improve the interface of `_assign_attr` to be more general.

Differential Revision: [D53139277](https://our.internmc.facebook.com/intern/diff/D53139277/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118611
Approved by: https://github.com/zhxchen17
ghstack dependencies: #118607, #118608, #118609, #118610
2024-01-30 19:14:19 +00:00
suo
be90ab7efd [export] do not unlift cond/map submodules (#118610)
I don't think we should be unlifting HOO submodules.

What is the constract of unlifting? It is: restore the original calling convention of the module, undoing the transformation in which we lift parameters, buffers, and constants to inputs in the graph.

Unlifting does *not* make any guarantees about what's going on inside the module. It's still a flat module. So why should we lift the cond/map submodules? It doesn't have anything to do with the contract stated above; it's some internal stuff that doesn't affect how the module will be called.

Further, this code as written modifies the state dict; adding a new buffer that is actually duplicate of a previous buffer. Modifying the state dict from the original eager module is never correct.

Differential Revision: [D53160713](https://our.internmc.facebook.com/intern/diff/D53160713/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118610
Approved by: https://github.com/zhxchen17
ghstack dependencies: #118607, #118608, #118609
2024-01-30 19:14:18 +00:00
suo
4ee8aa6028 [export] adopt KeyPath API in nonstrict mode (#118609)
This PR rewrites two paths to use the newly-added keypaths API in pytree:
First: we were hand-rolling a tree_map during fakification because we wanted to track sources. This PR uses keypaths instead, which can do the same thing without needing custom code.

Second: our constraint error formatting was referencing placeholder names in error messages. These placeholder names are not otherwise user-visible, so they are super confusing to users (e.g. "which input does arg1_3 correspond to?"). This diff uses the `keystr` API to format the error message.

This necessitated some small refactors—generating the keystr is expensive so doing it in an f-string was very bad.

It can also be further improved—we can inspect the signature so that instead of `*args[0]` we can give people the actual argument name, which would be the ideal UX. But leaving that for later.

Differential Revision: [D53139358](https://our.internmc.facebook.com/intern/diff/D53139358/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118609
Approved by: https://github.com/zhxchen17
ghstack dependencies: #118607, #118608
2024-01-30 19:14:11 +00:00
suo
ca090b2c77 [export] do not use tree_flatten_spec (#118608)
tree_flatten_spec is bad; it isn't synced up with `register_pytree_node` so it will not handle arbitrary custom pytrees. It's also not really maintained.

We only use it for two purposes:
- To retain kwarg ordering stability, so that if the user passes in kwargs in a different order things will still work.
- To do "structural" checks that ignore types.

In both cases, tree_flatten_spec is probably *not* the ideal way to implement the desired behavior.

## kwargs ordering
- tree_flatten_spec overwrites the behavior of ALL dictionaries, not just kwargs. This is not correct, dictionary ordering is meaningful in Python, and it's pretty trivial to write a program that relies on dict ordering.
- For kwargs, we do sort of expect that the order in which arguments are passed shouldn't matter. BUT there is one exception: `**kwargs`. In fact, [PEP 468](https://peps.python.org/pep-0468/) was introduced specifically to clarify that ordering does matter when the function being called uses `**kwargs`.

In this diff I introduce a utility function that *only* reorders kwargs. This gets us most of the way to correct—dicts are no longer reordered, but kwargs can be passed in any order.

A "fully correct" solution would need fix the corner case from PEP468. We could detect whether the top-level fn being traced uses `**kwargs` (via `inspect`), then serialize a flag for it. In ExportedProgram, we would check that flag and only re-order if `**kwargs` was unused; otherwise error if the key order doesn't match. This is a super corner case though, so I'll file it as a followup task.

## structural equivalence checking

This is another use case, where again `tree_flatten_spec` is too broad. Generally we want to treat a precise two types as the same, not override the behavior of comparison generally. So I introduce an `is_equivalent` util for this purpose.

Differential Revision: [D53168420](https://our.internmc.facebook.com/intern/diff/D53168420/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118608
Approved by: https://github.com/zhxchen17
ghstack dependencies: #118607
2024-01-30 19:14:04 +00:00
bc9642f578 Skip more tests under rocm (#118624)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118624
Approved by: https://github.com/aakhundov
2024-01-30 19:06:06 +00:00
e6e7d7f26b [pt-vulkan] Introduce MemoryAllocation class and enable deferred allocation and resource aliasing (#118436)
## Context

This changeset is part of a stack that enables memory planning (i.e. sharing memory between intermediate tensors) in the PyTorch Vulkan Compute API. Note that Memory Planning can only be used via the ExecuTorch delegate (currently a WIP) and not Lite Interpreter (which does not collect metadata regarding tensor lifetimes).

This changeset enables [resource aliasing](https://gpuopen-librariesandsdks.github.io/VulkanMemoryAllocator/html/resource_aliasing.html), a technique that allows two resources (i.e. `VkImage`s or `VkBuffer`s to bind to the same memory allocation. This is the core feature that allows memory planning to be implemented in PyTorch Vulkan.

## Notes for Reviewers

At a high level, this changeset introduces the `MemoryAllocation` struct which represents a raw `VmaAllocation`. `VulkanImage` and `VulkanBuffer` have been updated to store a `MemoryAllocation` member instead of the raw handle of a `VmaAllocation`.

`vTensor`, `VulkanImage`, and `VulkanBuffer` constructors now have a `allocate_memory` argument which controls if memory should be allocated on construction. If `false`, then memory must be allocated separately and bound later using `bind_allocation()` before the resource can be used.

Internal:

## Notes for Internal Reviewers

Please refer to [this design doc](https://docs.google.com/document/d/1EspYYdkmzOrfd76mPH2_2BgTDt-sOeFnwTkV3ZsFZr0/edit?usp=sharing) to understand how memory planning will work end-to-end in the Vulkan Delegate.

Differential Revision: [D53136249](https://our.internmc.facebook.com/intern/diff/D53136249/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118436
Approved by: https://github.com/yipjustin
2024-01-30 19:03:55 +00:00
40ece2e579 Revert "Enable possibly-undefined error code (#118533)"
This reverts commit 4f13f69a45ef53747e2eefffd65d91ce840b431b.

Reverted https://github.com/pytorch/pytorch/pull/118533 on behalf of https://github.com/clee2000 due to sorry i'm trying to figure out a codev merge conflict, if this works i'll be back to rebase and merge ([comment](https://github.com/pytorch/pytorch/pull/118533#issuecomment-1917695185))
2024-01-30 19:00:34 +00:00
suo
6511811ebb [export] preserve metadata during nonstrict tracing (#118607)
Previously, nonstrict tracing would wipe metadata of graphmodules, because the wrapper class we're using was not detected as a graphmodule and thus meta preservation was not turned on

Differential Revision: [D53139354](https://our.internmc.facebook.com/intern/diff/D53139354/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118607
Approved by: https://github.com/zhxchen17
2024-01-30 18:27:52 +00:00
644f64f2d1 [c10d] added docstrings and tests for src / dst (#118593)
Follow up https://github.com/pytorch/pytorch/pull/118359: whether``src`` and ``dst`` are base on global pg or sub pg
* update c10d docstring: ``src`` / ``dst`` are base on global pg regardless of ``group`` arguments
* communication ops with ``dst`` argument: ``reduce``, ``gather_object``, ``gather``, ``send``, ``isend``
* communication ops with ``src`` argument: ``irecv``, ``recv``, ``broadcast``, ``broadcast_object_list``, ``scatter``, ``scatter_object_list``
* ``pytest test/distributed/test_c10d_nccl.py -k subgroup``

3 collectives are for pickable objects (``gather_object``, ``broadcast_object_list``, ``scatter_object_list``). There are 2 ways to set device
* use device argument: it's implemented in ``broadcast_object_list``. maybe worth implementing in the other 2
* ``torch.cuda.set_device(global_rank)``

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118593
Approved by: https://github.com/wconstab
2024-01-30 17:47:58 +00:00
19e8ba95e5 [RELAND] Remove deprecated fbgemm operators (#112153)
These operators are not used and have been deprecated since #72690
(Feb 2022).

BC-breaking message:

`TorchScript` models that were exported with the deprecated
`torch.jit.quantized` API will no longer be loadable, as the required
internal operators have been removed.
Please re-export your models using the newer `torch.ao.quantization` API
instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112153
Approved by: https://github.com/jerryzh168
2024-01-30 16:32:37 +00:00
2327879fb6 Add lowering to special.bessel_j0 (2nd try) (#118565)
This PR is a copy of https://github.com/pytorch/pytorch/pull/118464 that was merged without using pytorchbot. Sorry for the noise!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118565
Approved by: https://github.com/peterbell10
2024-01-30 15:26:59 +00:00
fbf92500fb enable privateuseone to perform streaming backward (#117111)
Fixes #116957

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117111
Approved by: https://github.com/soulitzer
2024-01-30 15:13:31 +00:00
15702a8027 Fix lnit after #118533 (#118633)
Fixes lint after https://github.com/pytorch/pytorch/pull/118533
Adds ignore ``possibly-undefined`` to more places

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118633
Approved by: https://github.com/DanilBaibak
2024-01-30 14:07:16 +00:00
827949cef2 accelerate binary_cross_entropy_with_logits by using log_sigmoid operator (#115539)
When I was reimplementing BCEwithLogits, I found that `log_sigmoid` operator could accelerate the function.

Simple benchmark on AMD 3600 CPU Ubuntu 22.04:
|avg time (ms)|with `pos_weight`|no `pos_weight`|
|-|-|-|
|original|1986|1658|
|this PR|1295|995|

faster 35-40%. This is probably benefited by the `log_sigmoid` vectorization code.

CUDA benchmark was not obtained, but I believe CUDA can be also benefited by reduecing kernel launches as https://github.com/pytorch/pytorch/pull/11054#issuecomment-442233714 and https://github.com/pytorch/pytorch/pull/78267#issue-1248398454 mentioned.

The simple benchmark cpp file:
[demo.txt](https://github.com/pytorch/pytorch/files/13635355/demo.txt)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115539
Approved by: https://github.com/malfet
2024-01-30 13:24:13 +00:00
e5bb527d3e [inductor][cpp] support scalar value in vec reduction (#118511)
Fix https://github.com/pytorch/pytorch/issues/118379

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118511
Approved by: https://github.com/leslie-fang-intel, https://github.com/lezcano, https://github.com/jansel
2024-01-30 13:07:43 +00:00
91690983ff [easy] Faster empty LIST_LENGTH guard (#118542)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118542
Approved by: https://github.com/peterbell10, https://github.com/jansel
2024-01-30 13:02:18 +00:00
64efec9953 Port FakeProcessGroup to cpp (#118426)
### Summary
Native functional collective ops requires the backend to be implemented in C++. Porting `FakeProcessGroup` to cpp so that it can also work for native functional collective ops.

The existing tests involving `FakeProcessGroup` all pass. In addition, `DeviceMeshTest::test_fake_pg_device_mesh` now pass with `_USE_NATIVE_C10D_FUNCTIONAL=1`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118426
Approved by: https://github.com/wanchaol
ghstack dependencies: #113057
2024-01-30 11:40:13 +00:00
da0635d17c Add pytorch-distributed justknobs helper (#118568)
Summary:
Sets up a helper that checks any JKs relevent to pytorch distributed,
and propagates their values to ENV.

Test Plan: Added unit test

Differential Revision: D53192406

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118568
Approved by: https://github.com/zdevito
2024-01-30 08:13:52 +00:00
3ecc2f3a0d [PT2][Runtime Numeric Check] Fix compatibility issue (#118578)
Summary: Titled

Test Plan: WIP

Differential Revision: D53196722

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118578
Approved by: https://github.com/jackiexu1992
2024-01-30 08:04:27 +00:00
b7c8485704 refactor mm_plus_mm check to pattern match (#118456)
Fixes #103101

replace #103253

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118456
Approved by: https://github.com/jansel
2024-01-30 07:48:06 +00:00
c7af626a26 [c10d] allow nonblocking wrap of ncclCommInitRankConfig (#118256)
resolve #117749

Summary:
Updated the PR with the following intentions:

1. identify eagerMode init (as opposed to lazy init), in which case we will create NCCL comms without guarantees that they are fully initialized if NONBLOCKING mode is also enabled.
2. Python users can do their other works (e.g., model init) between invoking init_process_group and their first collective call.
3. c10D would guarantee/wait for communicators to be initialized before issuing the first collective call.
4. For NCCL collective calls, the contract between python users and c10d is not changed much from blocking calls (C10d would wait the NCCL call to be ncclSuccess, or timeout, whichever happens first).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118256
Approved by: https://github.com/kwen2501
2024-01-30 06:23:20 +00:00
e632d0c0dc Break Triton MutationTests to one kernel per test (#118553)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118553
Approved by: https://github.com/aakhundov
ghstack dependencies: #118588
2024-01-30 06:17:55 +00:00
eqy
4a48899b6e [CUDA][complex] Define LIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS in CUDA build (#117061)
An upcoming CUDA release will migrate to CCCL, and we need this define to preserve current complex behavior: https://nvidia.github.io/libcudacxx/standard_api/numerics_library/complex.html

CC @miscco @ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117061
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-01-30 06:11:31 +00:00
c203d88795 Skip mutation tests on rocm (#118588)
Fixes #118585

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118588
Approved by: https://github.com/aakhundov, https://github.com/jansel
2024-01-30 05:46:54 +00:00
fe07851173 [CUDA][TF32][functorch] Also disable TF32 for vjp and jvp tests (#118592)
CC @zou3519
Appears to be the same issue as https://github.com/pytorch/pytorch/issues/86798
Seen surfacing on >= sm80

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118592
Approved by: https://github.com/zou3519
2024-01-30 05:34:20 +00:00
8be6dee14b [inductor] Fix codegen bug with Native Triton kernels with ReinterpretView args (#118569)
Summary:
### Context

It's possible for the args of a user-defined Triton Kernel to be codegen-ed twiced. But this only happens if the arg is a `ReinterpretView`.
* First via `arg.codegen_reference()` in `define_user_defined_triton_kernel()`
* Second in `self.codegen_kwargs()`.

When using `abi_compatible=True`, the duplicate codegen will look like the code below. The issue in the code is that one of the Tensors, internal to the graph, isn't properly freed. This scenario was eventually exposed as a memory leak when we re-ran an AOTInductor model many times and observed `memory.used` increase after each iteration.
```
auto tmp_tensor_handle_0 = reinterpret_tensor_wrapper(buf1, 2, int_array_0, int_array_1, 0L);
auto tmp_tensor_handle_1 = reinterpret_tensor_wrapper(buf1, 2, int_array_0, int_array_1, 0L);
...
// There's no wrap_with_raii_handle_if_needed() for tmp_tensor_handle_0.
// And there's no reference to tmp_tensor_handle_0.
// Thus, tmp_tensor_handle_0 is left as an AtenTensorHandle which isn't
// automatically cleaned-up like RAIIAtenTensorHandle
CUdeviceptr var_6;
aoti_torch_get_data_ptr(wrap_with_raii_handle_if_needed(tmp_tensor_handle_1), reinterpret_cast<void**>(&var_6));
void* kernel_args_var_2[] = {..., &var_6, ...};
launchKernel(kernels.add_kernel_0, ..., kernel_args_var_2);
```

### Solution
We just need the arg's buffer name when creating the `TensorArg` in `define_user_defined_triton_kernel()`. Thus, just return the buffer's name and avoid any potential side-effects with `arg.codegen_reference()`.

Test Plan:
### Inspect device memory allocated
```
# Before diff
0 device memory 2048
1 device memory 2560
2 device memory 3072
3 device memory 3584
4 device memory 4096
5 device memory 4608

# With diff (memory usage doesn't grow)
0 device memory 1536
1 device memory 1536
2 device memory 1536
3 device memory 1536
4 device memory 1536
5 device memory 1536
```

Reviewed By: jingsh, tissue3

Differential Revision: D53190934

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118569
Approved by: https://github.com/oulgen
2024-01-30 05:19:32 +00:00
4f13f69a45 Enable possibly-undefined error code (#118533)
Fixes https://github.com/pytorch/pytorch/issues/118129

Suppressions automatically added with

```
import re

with open("error_file.txt", "r") as f:
    errors = f.readlines()

error_lines = {}
for error in errors:
    match = re.match(r"(.*):(\d+):\d+: error:.*\[(.*)\]", error)
    if match:
        file_path, line_number, error_type = match.groups()
        if file_path not in error_lines:
            error_lines[file_path] = {}
        error_lines[file_path][int(line_number)] = error_type

for file_path, lines in error_lines.items():
    with open(file_path, "r") as f:
        code = f.readlines()
    for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True):
        code[line_number - 1] = code[line_number - 1].rstrip() + f"  # type: ignore[{error_type}]\n"
    with open(file_path, "w") as f:
        f.writelines(code)
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118533
Approved by: https://github.com/Skylion007, https://github.com/zou3519
2024-01-30 05:08:10 +00:00
5dfcf07449 Reland PR117393 [inductor/fb] log config dict when compilation finishes (#118552)
Summary: Reverted due to merge conflict

Differential Revision: D53188124

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118552
Approved by: https://github.com/mengluy0125
2024-01-30 04:34:22 +00:00
dcc077eea2 [executorch hash update] update the pinned executorch hash (#118594)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118594
Approved by: https://github.com/pytorchbot
2024-01-30 03:49:49 +00:00
0d47f6a44f [ez][inductor] fix a typo in should_pad_bench (#118598)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118598
Approved by: https://github.com/eellison
2024-01-30 03:49:44 +00:00
135f785d77 [audio hash update] update the pinned audio hash (#118338)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118338
Approved by: https://github.com/pytorchbot
2024-01-30 03:44:00 +00:00
ff0cb38693 [vision hash update] update the pinned vision hash (#118340)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118340
Approved by: https://github.com/pytorchbot
2024-01-30 03:15:16 +00:00
2eefbc02a0 [ez] Discover tests without importing torch (#118574)
Moves test discovery into a file that doesn't have import torch so test listing can be done without having torch installed.

Helpful when you don't have torch installed (aka me when I'm feeling lazy)
I want to move TD into it's own job that doesn't need to wait for build to finish, so this is part of that.

The first commit is a nothing more than a copy paste of the selected functions/vars into a new file, the second commit has various changes that should be checked.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118574
Approved by: https://github.com/huydhn
2024-01-30 03:02:29 +00:00
eb9905be5d [export] Remove the branch for skipping verifier. (#118139)
Summary:
We used to skip verifier when the signature object is not the "correct" one (usually from some deprecated frontend). This was very useful when we wanted to pay a small cost to enable verifier path to be called everywhere for torch export.

Now I believe no tests are relying on this behavior so we should remove this weird branch.

Test Plan: CI

Differential Revision: D53024506

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118139
Approved by: https://github.com/suo
2024-01-30 02:58:03 +00:00
b778f44e97 Allow using native c10d_functional via _functional_collectives (#113057)
This diff introduces an env var `_USE_NATIVE_C10D_FUNCTIONAL` that tells `_functional_collective` to use native `c10d_functional` ops. The Python version and the native version will co-exist until we completely switch to the native version after more testing and verification.

NOTE: `DeviceMesh` support for native `c10d_functional` will be added in a subsequent PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113057
Approved by: https://github.com/LucasLLC, https://github.com/wconstab, https://github.com/wanchaol
2024-01-30 02:34:25 +00:00
126c1621ce Add Support for CausalBias to torch compile (#116071)
Fixes #115363

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116071
Approved by: https://github.com/mlazos
2024-01-30 02:22:48 +00:00
67d8db9252 Remove semicolon after return_from_mutable_noop_redispatch (#118538)
[`return_from_mutable_noop_redispatch`](65f8276bc6/torchgen/gen_functionalization_type.py (L477)) calls
[`return_str`](65f8276bc6/torchgen/gen_functionalization_type.py (L159-L166)). `return_str`'s output includes `;` so I think the semicolon after the callsite of `return_from_mutable_noop_redispatch` is not needed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118538
Approved by: https://github.com/colesbury
2024-01-30 02:22:42 +00:00
0ed24cb1af [export] comments about runtime_var_to_range. (#118539)
Summary: Add some comments in case we forgot what runtime_var_to_range means

Test Plan: eyes

Differential Revision: D53186114

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118539
Approved by: https://github.com/suo
2024-01-30 02:07:34 +00:00
b1f8b6b8fc Forward Fix accidental removal of import (#118572)
Summary:
This Diff is a forward fix for this PR: https://github.com/pytorch/pytorch/pull/114689

Where I accidentally removed the old import from backends/cuda.

Test Plan: Verrified on failing revert diff and it did indeed fix the issue

Reviewed By: DanilBaibak

Differential Revision: D53193454

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118572
Approved by: https://github.com/DanilBaibak
2024-01-30 02:07:19 +00:00
460950d3aa [Nested Tensor] Support ragged_idx != 1 on aten::is_same_size, aten::_to_copy (#118442)
is_same_size is needed internally; `_to_copy` should be easy because it doesn't support new layouts.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118442
Approved by: https://github.com/cpuhrsch
2024-01-30 01:32:51 +00:00
6c9f72156e Fix constant folding bug with sym size tensor (#118411)
When there was a constant folded SymInt which was used to construct a then constant folding tensor, we had previously used tried to use the sympy symbol which would error (should take in SymInt not symbol).

Fix by recording the observed size during constant folding.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118411
Approved by: https://github.com/ezyang
2024-01-30 01:26:51 +00:00
aef820926c Add some tests for 3d channels last (#118283)
Part of a multi-PR work to fix #59168.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118283
Approved by: https://github.com/albanD
2024-01-30 01:26:47 +00:00
bacbad5bc9 add GradScaler on CPU (#109993)
Step 2 of https://github.com/pytorch/pytorch/issues/111559.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109993
Approved by: https://github.com/jgong5, https://github.com/ezyang
2024-01-29 23:42:35 +00:00
796d270392 [easy] Fix small typo in register_state_dict_pre_hook doc (#118571)
Fixed https://github.com/pytorch/pytorch/pull/112674#issuecomment-1912849827

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118571
Approved by: https://github.com/janeyx99, https://github.com/albanD
2024-01-29 23:18:12 +00:00
413a434846 [export] Convert all export tests to .module() (#118425)
Test Plan: CI

Differential Revision: D53075379

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118425
Approved by: https://github.com/suo
2024-01-29 23:06:54 +00:00
ca7cbf1226 Add memory_format to typehints of Tensor.cpu and Tensor.cuda (#118392)
Fixes #118501

which makes mypy complain if users use memory_format in torch.cpu/torch.cuda in their code.

this adds the missing memory_format to the typehints of both functions.
I believe there is no test infrastructure for type hints....
Co-authored-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118392
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-01-29 22:56:34 +00:00
e1cbf6dff5 Use SEQUENTIAL posix_fadvise on mmapped files (#117805)
In theory this tells the system that we will access the file sequentially which allows prefetching future blocks. In practice it doubles the read-ahead size on Linux (which effectively doubles the read sizes).

Without this, CUDA uploads of files that aren't already in FS cache, using mmapped files (safetensors) as source, run at ~1 GB/s (from an SSD that has ~7 GB/s read speed...).

With this, they run at ~1.5 GB/s which is still bad but better than before!

It is possible to increase the read performance further by touching the pages from multiple threads; in fact, when the tensors loaded from the file are used by the CPU, we get fairly good load performance (~5 GB/s), which appears to be because multiple threads page fault and trigger more concurrent reads which improves SSD read throughput... however, this is not the case for CUDA uploads, and it is difficult to make that change in a generic way because it's unclear what the usage pattern of the input file is going to be.

All of the numbers above are taken on Samsung 990 Pro SSD, on Linux kernel 6.5 with FS cache cleared between every attempt to load a file. The file is loaded via `safetensors.safe_open` which uses UntypedTensor.from_file to load the file into memory, which in turn uses MapAllocator.cpp.

I felt safe doing this change unconditionally but please let me know if you'd like to see a separate allocator flag for this, propagated through to UntypedTensor. Note that the fadvise API is not available on macOS.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117805
Approved by: https://github.com/mikaylagawarecki
2024-01-29 22:38:00 +00:00
67c6152f4e [HigherOrderOp] support while_loop in dynamo (#116913)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116913
Approved by: https://github.com/zou3519
2024-01-29 22:32:32 +00:00
e3d7a19f73 [CI] add wait for /orig branch in mergeability check (#118576)
---

Test runs:
* [happy path](https://github.com/pytorch/pytorch/actions/runs/7702614677/job/20991275431?pr=118576) (this PR)
* [waiting for the hardcoded branch name](https://github.com/izaitsevfb/pr-head-test/actions/runs/7702386966/job/20990584514#step:3:33) in a separate repo (step succeeded after the branch was manually pushed)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118576
Approved by: https://github.com/malfet, https://github.com/huydhn
2024-01-29 22:10:50 +00:00
a40be5f4dc Autograd doc cleanup (#118500)
I don't think we'll realistically go though deprecation for these now since there are a couple use of each online. So document appropriately.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118500
Approved by: https://github.com/soulitzer
2024-01-29 21:51:33 +00:00
fc5cde7579 [dynamo] constant fold torch.cuda.get_device_properties to avoid graph break (#118422)
Before the PR, we have a graph break for code like this,
```python
    def test_get_device_properties_tensor_device(a):
        x = a.to("cuda")
        prop = torch.cuda.get_device_properties(x.device)
        if prop.major == 8:
            return x + prop.multi_processor_count
        return x + prop.max_threads_per_multi_processor
```
This PR constant folds the torch.cuda.get_device_properties and we'll get a following dynamo graph:
```python
[2024-01-26 13:28:13,253] [0/0] torch._dynamo.output_graph.__graph: [DEBUG]  <eval_with_key>.0 class GraphModule(torch.nn.Module):
[2024-01-26 13:28:13,253] [0/0] torch._dynamo.output_graph.__graph: [DEBUG]     def forward(self, L_a_ : torch.Tensor):
[2024-01-26 13:28:13,253] [0/0] torch._dynamo.output_graph.__graph: [DEBUG]         l_a_ = L_a_
[2024-01-26 13:28:13,253] [0/0] torch._dynamo.output_graph.__graph: [DEBUG]
[2024-01-26 13:28:13,253] [0/0] torch._dynamo.output_graph.__graph: [DEBUG]         # File: /home/yidi/local/pytorch/test/dynamo/test_functions.py:544 in test_get_device_properties_tensor_device, code: x = a.to("cuda")
[2024-01-26 13:28:13,253] [0/0] torch._dynamo.output_graph.__graph: [DEBUG]         x = l_a_.to('cuda');  l_a_ = None
[2024-01-26 13:28:13,253] [0/0] torch._dynamo.output_graph.__graph: [DEBUG]
[2024-01-26 13:28:13,253] [0/0] torch._dynamo.output_graph.__graph: [DEBUG]         # File: /home/yidi/local/pytorch/test/dynamo/test_functions.py:547 in test_get_device_properties_tensor_device, code: return x + prop.multi_processor_count
[2024-01-26 13:28:13,253] [0/0] torch._dynamo.output_graph.__graph: [DEBUG]         add = x + 108;  x = None
[2024-01-26 13:28:13,253] [0/0] torch._dynamo.output_graph.__graph: [DEBUG]         return (add,)
[2024-01-26 13:28:13,253] [0/0] torch._dynamo.output_graph.__graph: [DEBUG]
```

The signature of get_device_properties is:
```python
def get_device_properties(device: _device_t) -> _CudaDeviceProperties:
```
I think it's safe to constant fold get_device_properties():
1. torch.cuda.get_device_properties(tensor.device). In this case, tensor.device.index is guarded in _check_tensor
2. torch.cuda.get_device_properties(device_int_id). We don't expect the GPU properties for a particular index changes during a torch.compile run and it make sense to specialize the properties for a concrete device_int_id.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118422
Approved by: https://github.com/yanboliang, https://github.com/jansel
2024-01-29 20:26:40 +00:00
f99adbb4ec [inductor] Remove ROCm xfail on test_cum{sum,prod}_zero_dim (#118558)
Fixes #118540, fixes #118541

Since the zero-dim case reduces to a pointwise operation, we don't fallback on
ROCm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118558
Approved by: https://github.com/malfet
2024-01-29 20:23:40 +00:00
6591741183 [dynamo] support inference_mode with no arguments (#118427)
Before the pr, we have an error for the following code
```python
def k(x):
    with torch.inference_mode():
        x = x + 1
        return x

torch.compile(k, backend="eager", fullgraph=True)(x)
```
error message:
```
Traceback (most recent call last):
....
    return InferenceModeVariable.create(tx, args[0].as_python_constant())
torch._dynamo.exc.InternalTorchDynamoError: list index out of range
```

This pr supports the case when torch.inference_mode is not provided any argument (i.e. default to True).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118427
Approved by: https://github.com/yanboliang, https://github.com/jansel
2024-01-29 20:20:26 +00:00
e0d04b7119 [Caffe2] Fix bug in str on wide types (#117531)
Summary:
The current implementation of `str` passes wide types (`wchar_t`, `wchar_t*`, `std::wstring`) directly to `std::ostringstream`. This has the following behavior:

 - C++17, `wchar_t` & `wchar_t *`: print the integer representation of the character or the pointer. This is unexpected and almost certainly a (runtime) bug.
 - C++17, `std::wstring`: compile-time error.
 - C++20, all of the above: compile-time error.

To fix the bug and to enable C++20 migration, this diff performs narrowing on these wide types (assuming UTF-16 encoding) before passing them to `std::ostringstream`. This fixes both the C++20 compile time errors and the C++17 runtime bugs.

This bug surfaced in enabling C++20 windows builds, because windows specific caffe2 code uses `TORCH_CHECK` with wide strings, which references `str` for generating error messages.

Test Plan: CI & https://godbolt.org/z/ecTGd8Ma9

Differential Revision: D52792393

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117531
Approved by: https://github.com/malfet
2024-01-29 20:11:37 +00:00
68b18dc2a2 [DeviceMesh] Removed print of self._dim_group_infos (#118527)
This print seems to have accidentally been merged in. It is a bit verbose during unit tests, so this PR removes it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118527
Approved by: https://github.com/wz337
2024-01-29 19:14:25 +00:00
bb55970e5b Revert "Add justknobs env helper for pytorch distributed (#118451)"
This reverts commit 4d1bb2175a49e9b4440085a3dc2e2b211e5cf99e.

Reverted https://github.com/pytorch/pytorch/pull/118451 on behalf of https://github.com/wconstab due to Broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/118451#issuecomment-1915369013))
2024-01-29 19:01:05 +00:00
0288db3120 [DCP] Removes Checkpoint Wrapped Prefix from state dict fqns (#118119)
Fixes #117399

~~Soliciting some early feedback here.~~

~~Do we happen to know if there already some tests that cover this case or would it make sense to add? @fegin , @wz337~~

Edit: Added tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118119
Approved by: https://github.com/fegin
2024-01-29 18:18:52 +00:00
fb11354594 Revert "[c10d] relax the nccl error check for nonblocking mode (#118254)"
This reverts commit 993e4f3911856be3a93746f6ed6a13f25de6ff65.

Reverted https://github.com/pytorch/pytorch/pull/118254 on behalf of https://github.com/clee2000 due to has internal failures D53170606 ([comment](https://github.com/pytorch/pytorch/pull/118254#issuecomment-1915267786))
2024-01-29 17:56:40 +00:00
3011a4406f [BE][GHF] Do not hardcode default branch name (#118530)
Instead rely on `GitHubPR.default_branch()` which is the name of the repo's default branch.

Do not pass branch name `merge_changes` is called, as it is set to default branch inside the function

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118530
Approved by: https://github.com/clee2000
2024-01-29 17:18:23 +00:00
65f8276bc6 add an option to specify custom addr2line binary (#118328)
There is a need for users to pick their own addr2line binary in their deployment due to reasons like default addr2line being too slow etc... This option would allow user quickly experiment other alternatives.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118328
Approved by: https://github.com/zdevito, https://github.com/aaronenyeshi
2024-01-29 16:36:38 +00:00
abe3c55a6a Update DDP dynamo debug docs (#118295)
Refreshes https://github.com/pytorch/pytorch/pull/114201 and updates it to include other log names that also include ddp_optimizer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118295
Approved by: https://github.com/LucasLLC, https://github.com/wanchaol
2024-01-29 14:58:26 +00:00
f9971daaee Fix divergence between internal + external (#118509)
D53049807 and https://github.com/pytorch/pytorch/pull/118197 got out of sync somehow

Fixing externally since I'm pretty sure the internal version is correct

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118509
Approved by: https://github.com/malfet
2024-01-29 14:53:50 +00:00
04c1df651a [inductor][cpp] enable vectorization with constant bool (#118380)
Related model DebertaForQuestionAnswering etc. For DebertaForQuestionAnswering, single thread, measured on ICX:
Before: 0.990x, After: 1.043x

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118380
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
2024-01-29 13:31:22 +00:00
ee3dfbbe47 [Inductor] Fix Argmax codegen with Nan input (#118358)
**Summary**
Fix issue https://github.com/pytorch/pytorch/issues/118266, current `torch.argmax` and `torch.argmin` has different return values with eager and Inductor cpp backend when inputs has `Nan` value. Align cpp backend results to eager by reusing the compare function.

**Test Plan**
```
python -u -m pytest -s -v test_cpu_repro.py -k test_argmin_cpu_only
python -u -m pytest -s -v test_cpu_repro.py -k test_argmax_argmin_with_nan_value
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118358
Approved by: https://github.com/lezcano, https://github.com/jgong5, https://github.com/jansel
2024-01-29 09:09:46 +00:00
41dfdde9f5 Handle some numpy functions with out arguments correctly in dynamo (#118248)
Dynamo creates Tensors when tracing through numpy ufuncs like np.sin, np.minimum etc. When running, np functions generally return Tensors when run with `torch.compile`. However, we currently require when normalizing `out` arguments that the input is an ndarray.  This creates assertion errors when running torch.compile on any numpy function with an out argument:
```
    def test_numpy_ufunc_out(self):
        @torch.compile(backend="eager")
        def foo():
            x = np.arange(5)
            out = np.empty((x.shape[0], x.shape[0]))
            res_out = np.sin(x, out=out)
            assert res_out is out
        foo()
```
Failure with stack trace: https://gist.github.com/jamesjwu/68e217638d735678b3de968584dba23f

Instead, we can wrap tensors in an ndarray in normalize_outarray to handle the case correctly. Fixing this resolves ~220 tests under dynamo_test_failures, but also exposes a followup bug.

In the presence of a graph break, ndarrays don't preserve their id, which can affect assertions and `is` checks between numpy arrays:
```
     def test_x_and_out_broadcast(self, ufunc):
        x = self.get_x(ufunc)
        out = np.empty((x.shape[0], x.shape[0]))

        x_b = np.broadcast_to(x, out.shape)
        # ufunc is just np.sin here
        res_out = ufunc(x, out=out)
        res_bcast = ufunc(x_b)
        # passes
        assert res_out is out
        graph_break()
        # fails
        assert res_out is out
```
Regular tensors preserve their id because Dynamo caches their example tensor values across a graph break. However, with ndarrays, we only store their converted tensor values, and construct new ndarrays around those values:
eebe7e1d37/torch/_dynamo/variables/builder.py (L1083)
Added a test with expected failure to showcase this — we can then fix that issue separately.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118248
Approved by: https://github.com/lezcano
2024-01-29 09:09:21 +00:00
4d1bb2175a Add justknobs env helper for pytorch distributed (#118451)
Summary:
Adds a JK killswitch check and configures the env for enabling pytorch
nccl flight recorder.  Note- this only enables recording events in memory, not
dumping them.

Test Plan: CI test

Reviewed By: zdevito

Differential Revision: D52920092

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118451
Approved by: https://github.com/malfet
2024-01-29 08:57:16 +00:00
41902a6ebc [dynamo] Optimize is_tracing checks (#118474)
benchmarks/dynamo/microbenchmarks/overheads.py
- before: 10.4us
- after: 9.9us

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118474
Approved by: https://github.com/yanboliang
2024-01-29 08:31:26 +00:00
eba240afcb Revert "[FSDP2] Introduced initial fully_shard frontend (#117776)"
This reverts commit 316579e30ce820cb5f431e6bb816a882db918b38.

Reverted https://github.com/pytorch/pytorch/pull/117776 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/117776#issuecomment-1914121167))
2024-01-29 07:38:41 +00:00
e6f3a4746c include a print for _get_cuda_arch_flags (#118503)
Related to #118494, it is not clear to users that the default behavior is to include **all** feasible archs (if the 'TORCH_CUDA_ARCH_LIST' is not set).

In these scenarios, a user may experience a long build time. Adding a print statement to reflect this behavior. [`verbose` arg is not available and not feeling necessary to add `verbose` arg to this function and all its parent functions...]

Co-authored-by: Edward Z. Yang <ezyang@mit.edu>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118503
Approved by: https://github.com/ezyang
2024-01-29 07:03:56 +00:00
47b5a6b05d [Dynamo] Analyze triton kernels via tracing to determine mutations (#117300)
This PR adds TTIR lexing and parsing in order to analyze which of the user defined triton kernel inputs are mutated.

Differential Revision: [D53165999](https://our.internmc.facebook.com/intern/diff/D53165999)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117300
Approved by: https://github.com/jansel
2024-01-29 06:37:08 +00:00
2951bbf0f7 Add some type annotations to torch._inductor.codegen.wrapper (#118491)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118491
Approved by: https://github.com/Skylion007
2024-01-29 06:17:27 +00:00
5f59d0c748 [C10D] Disarm PGNCCL Heartbeat Monitor to gather data (#118344)
Summary:
Leave monitoring thread 'running' in log-only mode. Use the kill logs to
correlate with actual job outcomes (e.g. does stuck job detector agree?)

Later, re-enable (using a justknobs knob this time)

Test Plan: CI

Differential Revision: D53108142

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118344
Approved by: https://github.com/shuqiangzhang, https://github.com/yifuwang, https://github.com/malfet, https://github.com/kwen2501
2024-01-29 06:09:36 +00:00
890d8e6692 [executorch hash update] update the pinned executorch hash (#118502)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118502
Approved by: https://github.com/pytorchbot
2024-01-29 03:45:45 +00:00
0d9aff2523 Removed unused “device” argument in torch.frombuffer() #118273 (#118439)
Fixes #118273

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118439
Approved by: https://github.com/albanD
2024-01-28 22:01:49 +00:00
acc700739e Upgrade mypy version to 1.8.0 (#118481)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118481
Approved by: https://github.com/Skylion007, https://github.com/albanD
ghstack dependencies: #118414, #118418, #118432, #118467, #118468, #118469, #118475, #118479, #118480
2024-01-28 19:22:37 +00:00
338596dfbc Forbid follow_imports = skip from mypy.ini (#118480)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118480
Approved by: https://github.com/Skylion007, https://github.com/albanD
ghstack dependencies: #118414, #118418, #118432, #118467, #118468, #118469, #118475, #118479
2024-01-28 19:22:37 +00:00
119b66ba16 Use strict to toggle strict options in MYPYSTRICT (#118479)
As we force a specific version of mypy, it's OK to use the agglomerated flag.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118479
Approved by: https://github.com/Skylion007, https://github.com/albanD
ghstack dependencies: #118414, #118418, #118432, #118467, #118468, #118469, #118475
2024-01-28 19:22:22 +00:00
ecca533872 Use dmypy instead of mypy in lintrunner (#118475)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118475
Approved by: https://github.com/suo
ghstack dependencies: #118414, #118418, #118432, #118467, #118468, #118469
2024-01-28 13:42:06 +00:00
cad79bd0bb Remove follow_imports = skip from sympy (#118469)
dmypy silently ignores follow_imports = skip, so to get parity between
dmypy and mypy we have to suck it up and type: ignore all of the sympy
typing problems.

The suppressions were added automatically with the following script generated by GPT-4:

```
import re

# Read the error file
with open("error_file.txt", "r") as f:
    errors = f.readlines()

# Parse the lines with errors and error types
error_lines = {}
for error in errors:
    match = re.match(r"(.*):(\d+):\d+: error:.*\[(.*)\]", error)
    if match:
        file_path, line_number, error_type = match.groups()
        if file_path not in error_lines:
            error_lines[file_path] = {}
        error_lines[file_path][int(line_number)] = error_type

# Insert ignore comments in the source files
for file_path, lines in error_lines.items():
    with open(file_path, "r") as f:
        code = f.readlines()
    for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True):
        code[line_number - 1] = code[line_number - 1].rstrip() + f"  # type: ignore[{error_type}]\n"
    with open(file_path, "w") as f:
        f.writelines(code)
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118469
Approved by: https://github.com/Skylion007
ghstack dependencies: #118414, #118418, #118432, #118467, #118468
2024-01-28 13:38:38 +00:00
59b4d2cd40 [mypy] Remove colorama ignore_missing_imports (#118468)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118468
Approved by: https://github.com/Skylion007
ghstack dependencies: #118414, #118418, #118432, #118467
2024-01-28 13:38:38 +00:00
46712b019d Enable local_partial_types (#118467)
When using dmypy, this setting is enabled and cannot be turned off. Force it for regular mypy too.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118467
Approved by: https://github.com/Skylion007
ghstack dependencies: #118414, #118418, #118432
2024-01-28 13:38:22 +00:00
2ed0af2bde [executorch hash update] update the pinned executorch hash (#118477)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118477
Approved by: https://github.com/pytorchbot
2024-01-28 03:56:11 +00:00
9d5b950bdd [BE][Easy]: Update ruff to 0.1.14 (#118466)
Updates ruff to 0.1.14 which has some more autofixes, bugfixes, and fixes some false positives. Full changelog found here: https://github.com/astral-sh/ruff/releases/tag/v0.1.14
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118466
Approved by: https://github.com/ezyang
2024-01-27 23:44:25 +00:00
ca1d70632d [14/N][Dynamo] Make trace_rules.lookup only handle function + callable type (#118366)
Step by step changes to unblock #118264

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118366
Approved by: https://github.com/angelayi
2024-01-27 23:02:44 +00:00
62c1e4a578 Added missing CircularPad*d references so the docs are actually built. (#118465)
Fixes #118429

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118465
Approved by: https://github.com/Skylion007
2024-01-27 22:39:01 +00:00
2728c9137d [easy][AOT] Fix shortcut path for simple tuple/list spec (#118460)
`type(self.spec)` is always `TreeSpec` and the condition is always `False`. This PR changes it to `self.spec.type`, which is the type of tree that the spec represents.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118460
Approved by: https://github.com/Skylion007
2024-01-27 19:04:12 +00:00
1460334436 [quant] Remove deprecated torch.jit.quantized APIs (#118406)
The `torch.jit.quantized` interface has been deprecated since #40102 (June 2020).

BC-breaking message:

All functions and classes under `torch.jit.quantized` will now raise an error if
called/instantiated. This API has long been deprecated in favor of
`torch.ao.nn.quantized`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118406
Approved by: https://github.com/jerryzh168
2024-01-27 18:32:45 +00:00
d03173e88c Unify MYPYINDUCTOR and MYPY (#118432)
The original motivation for MYPYINDUCTOR was a faster type checking configuration that only checked a subset of files. With the removal of `follow_imports = ignore`, we are now able to use dmypy to do fast incremental typechecking, eliminating the need for this.

Perhaps erroneously, when I tee'ed up this PR I elected to delete the `follow_imports = skip` designations in the mypy-inductor.ini. This lead to a number of extra type error suppressions that I manually edited. You will need to review.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118432
Approved by: https://github.com/Skylion007
ghstack dependencies: #118414, #118418
2024-01-27 17:23:20 +00:00
42062e2622 [pytree][BE] update treespec is_leaf() access (#116371)
Change `isinstance(treespec, LeafSpec) -> treespec.is_leaf()`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116371
Approved by: https://github.com/zou3519
2024-01-27 11:44:57 +00:00
26473460a4 [ET-Vulkan] ExecuTorch Vulkan floor_div (#118428)
Summary: Add a new operator "floor_div" to ET-Vulkan.

Test Plan:
```
[yipjustin@7777.od ~/fbcode (b32108c6c)]$ buck2 test fbcode//executorch/backends/vulkan/test:test_vulkan_delegate --
File changed: fbcode//executorch/backends/vulkan/test/test_vulkan_delegate.py
Buck UI: https://www.internalfb.com/buck2/90290e5b-d47e-4cac-bc63-9939cc210d1f
Test UI: https://www.internalfb.com/intern/testinfra/testrun/3940649890839142
Network: Up: 2.8KiB  Down: 0B  (reSessionID-e7425cc1-0987-46d8-a7bf-418a660bee5b)
Jobs completed: 19. Time elapsed: 42.6s.
Cache hits: 0%. Commands: 1 (cached: 0, remote: 0, local: 1)
Tests finished: Pass 7. Fail 0. Fatal 0. Skip 0. Build failure 0
```

Reviewed By: SS-JIA

Differential Revision: D53072722

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118428
Approved by: https://github.com/SS-JIA
2024-01-27 11:20:52 +00:00
eqy
8d790abab9 [NCCL][c10d] Log failing pointer if deregistration fails (#118455)
For debugging convenience

CC @minsii @Aidyn-A @syed-ahmed @ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118455
Approved by: https://github.com/wconstab
2024-01-27 11:03:02 +00:00
dabb90f2a4 Revert "[Exception] [6/N] Remove use of torch::TypeError (#117964)"
This reverts commit 87335fabaeca41f9721ba5d5eb7eafcf70b7afad.

Reverted https://github.com/pytorch/pytorch/pull/117964 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/117964#issuecomment-1913079096))
2024-01-27 08:44:34 +00:00
suo
bb6eba189f [export][ez] remove unused argument from InterpreterModule (#118364)
small thing I noticed

Differential Revision: [D53113926](https://our.internmc.facebook.com/intern/diff/D53113926/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118364
Approved by: https://github.com/angelayi
2024-01-27 06:46:01 +00:00
89a1175e0e Upgrade mypy python_version to 3.11 (#118418)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118418
Approved by: https://github.com/albanD
ghstack dependencies: #118414
2024-01-27 06:10:46 +00:00
978faf1fa2 Use an op counter to decide when to realize a kernel (#117030)
Instead of checking the number of bytes in the string representation
of the kernel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117030
Approved by: https://github.com/lezcano, https://github.com/peterbell10
2024-01-27 05:28:46 +00:00
800e2e823f Add compilable foreach RAdam support (#117912)
Fixes https://github.com/pytorch/pytorch/issues/117807

This brings the number of supported optimizers with `torch.compile` to 11/13 (!)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117912
Approved by: https://github.com/janeyx99
2024-01-27 04:32:27 +00:00
fe10b1800f LazyGraphModule (#117911)
I feel it's easier to open a new PR rather than iterating on the previous PR (https://github.com/pytorch/pytorch/pull/105257 ) since this is more like a rewrite.

In this PR, instead of changing GraphModule directly which can easily causes BC issue, I create a LazyGraphModule class as Zachary & Jason suggested in comments from the previous PR.

The difference between LazyGraphModule and GraphModule is mainly about how re-compile for the graph module happens. In GraphModule the recompilation happens 'eagerly': constructing a GraphModule will cause the recompilation. While in LazyGraphModule, we just mark the module as needing recompilation. The real recompilation only happens when absolutely required (e.g. call forward method, access the code property etc.). In a lot of cases in torch.compile, the real recompilation eventually is not triggered at all. This can save a few seconds of compilation time.

By default, GraphModule rather than LazyGraphModule is used. `use_lazy_graph_module(True)` context manager can be used to pick LazyGraphModule instead. This has been applied to the torch.compile stack.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117911
Approved by: https://github.com/jansel
2024-01-27 04:10:18 +00:00
70699a6357 [C10D] Add tests for gather and gather_object with subgroup (#118359)
Addresses #118337 somewhat- we probably need to update docs. Let's first
confirm what behavior we want.

Identifies a couple of confusing things
1) 'dst' arg for many collectives is always in 'global' rank regardless
   of whether a subgroup is passed in.  This needs a doc update
2) gather_object has a strong dependency on setting the cuda device;
   could we make that smoother?
3) gather_object also should be happy with an empty list on the dst
   side, imo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118359
Approved by: https://github.com/weifengpy
2024-01-27 04:08:56 +00:00
28625d746f [executorch hash update] update the pinned executorch hash (#118443)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118443
Approved by: https://github.com/pytorchbot
2024-01-27 04:08:49 +00:00
993e4f3911 [c10d] relax the nccl error check for nonblocking mode (#118254)
resolve https://github.com/pytorch/pytorch/issues/117749

Summary:
This is the first step to enable NCCL nonblocking mode.

In NCCL nonblocking mode,  ncclInProgress is an expected return value
when checking communicators. Without this relaxation, watchdog thread
would throw NCCL errors during work checking while it is expected.

Test Plan:
Set nonblocking mode in unit tests, and make sure all existing NCCL
tests pass
Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118254
Approved by: https://github.com/kwen2501
2024-01-27 03:49:00 +00:00
40c08795b0 [JIT] python IR bindings: consolidate tests, add short docs in OVERVIEW.md (#118319)
Document the existence of python IR bindings; quick comments about it; and consolidate tests in one file to serve as examples to users.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118319
Approved by: https://github.com/eellison
2024-01-27 03:11:51 +00:00
9bce208dfb Replace follow_imports = silent with normal (#118414)
This is a lot of files changed! Don't panic! Here's how it works:

* Previously, we set `follow_imports = silent` for our mypy.ini configuration. Per https://mypy.readthedocs.io/en/stable/running_mypy.html#follow-imports, what this does is whenever we have an import to a module which is not listed as a file to be typechecked in mypy, we typecheck it as normal but suppress all errors that occurred in that file.
* When mypy is run inside lintrunner, the list of files is precisely the files covered by the glob in lintrunner.toml, but with files in excludes excluded.
* The top-level directive `# mypy: ignore-errors` instructs mypy to typecheck the file as normal, but ignore all errors.
* Therefore, it should be equivalent to set `follow_imports = normal`, if we put `# mypy: ignore-errors` on all files that were previously excluded from the file list.
* Having done this, we can remove the exclude list from .lintrunner.toml, since excluding a file from typechecking is baked into the files themselves.
* torch/_dynamo and torch/_inductor were previously in the exclude list, because they were covered by MYPYINDUCTOR. It is not OK to mark these as `# mypy: ignore-errors` as this will impede typechecking on the alternate configuration. So they are temporarily being checked twice, but I am suppressing the errors in these files as the configurations are not quite the same. I plan to unify the configurations so this is only a temporary state.
* There were some straggler type errors after these changes somehow, so I fixed them as needed. There weren't that many.

In the future, to start type checking a file, just remove the ignore-errors directive from the top of the file.

The codemod was done with this script authored by GPT-4:

```
import glob

exclude_patterns = [
    ...
]

for pattern in exclude_patterns:
    for filepath in glob.glob(pattern, recursive=True):
        if filepath.endswith('.py'):
            with open(filepath, 'r+') as f:
                content = f.read()
                f.seek(0, 0)
                f.write('# mypy: ignore-errors\n\n' + content)
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118414
Approved by: https://github.com/thiagocrepaldi, https://github.com/albanD
2024-01-27 02:44:11 +00:00
af1338bfbf fix escape nested comments in C++ (#117882)
Fixes #115243, as it is tricky to deal with the nested comment in doxygen + sphinx. Change 6 below is adopted as the fix. All other changes do not work.

After adopting change 6, realize the original
`torch::optim::SGD sgd(0.9);` is not the correct call to the sgd constructor,
modified to the correct one
`torch::optim::SGD sgd(model->parameters(), 0.9);`

- Original in [link](https://pytorch.org/cppdocs/api/function_namespacetorch_1ad98de93d4a74dd9a91161f64758f1a76.html#exhale-function-namespacetorch-1ad98de93d4a74dd9a91161f64758f1a76): `///   torch::optim::SGD sgd(/*lr=*/0.9);`
![image](https://github.com/pytorch/pytorch/assets/7495155/0054b355-4925-4112-93b4-9385fdc34bb9)

- Change 1, this solution is referenced from [here](https://stackoverflow.com/questions/24978463/doxygen-escape-nested-comments-in-c): `///   torch::optim::SGD sgd(/&zwj;* lr= *&zwj;/0.9);`
![image](https://github.com/pytorch/pytorch/assets/7495155/77ff2d18-3097-4265-8dcd-31d78acb9c6e)

- Change 2: `///   torch::optim::SGD sgd(/* lr= *//* 0.9);`
![image](https://github.com/pytorch/pytorch/assets/7495155/b520f8de-ead7-4009-b0fb-f4517daba077)

- Change 3: `///   torch::optim::SGD sgd(/\*lr=\*/0.9);`
![image](https://github.com/pytorch/pytorch/assets/7495155/07e9e608-4640-43c0-994a-37983b803003)

- Change 4: `///   torch::optim::SGD sgd(/&lowast; lr= &lowast;/0.9);`
![image](https://github.com/pytorch/pytorch/assets/7495155/121e55c5-0802-4ff3-bbd7-3521e1299d94)

- Change 5:
```
/// \rst
/// .. code-block:: cpp
///
///   torch::nn::Linear model(3, 4);
///   torch::load(model, "model.pt");
///   \verbatim
///   torch::optim::SGD sgd(/*lr=*/0.9);
///   \endverbatim
///   std::istringstream stream("...");
///   torch::load(sgd, stream);
///
///   auto tensor = torch::ones({3, 4});
///   torch::load(tensor, "my_tensor.pt");
/// \endrst
```
![image](https://github.com/pytorch/pytorch/assets/7495155/e675f551-e939-4be8-b24a-e2e53377dd08)

- Change 6: `///   torch::optim::SGD sgd(0.9);  // 0.9 is the learning rate`
![image](https://github.com/pytorch/pytorch/assets/7495155/ecf0adc4-9b0b-4aef-b0bc-72d4b17c45fa)
![image](https://github.com/pytorch/pytorch/assets/7495155/01bf5d5b-8450-4599-8c9a-00204ab56119)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117882
Approved by: https://github.com/cpuhrsch, https://github.com/malfet
2024-01-27 02:37:23 +00:00
5b31516008 [dynamo] inline torch.jit._unwrap_optional (#118434)
Before this pr, torch.jit._unwrap_optional is in the skipfile list thus causing a graph break. Check its implementation it's just a normal python function [here](ff8e33556e/torch/jit/_script.py (L1681-L1683)):
```python
def _unwrap_optional(x):
    assert x is not None, "Unwrapping null optional"
    return x
```
We could safely inline it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118434
Approved by: https://github.com/yanboliang
2024-01-27 02:22:14 +00:00
4aa1f994be [dynamo][assume_constant_result] Dont put symbolic guards for assume_constant_result (#118430)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118430
Approved by: https://github.com/ydwu4
2024-01-27 01:56:14 +00:00
838d3620cd [NCCL PG] log NCCL comm at creation and abort (#118335)
Summary: It helps correlate NCCL PG with corresponding NCCL comm in separate logs.

Differential Revision: D53107647

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118335
Approved by: https://github.com/wconstab
2024-01-27 01:43:53 +00:00
80cb6db90d [CUDA] [CI] Disable flash attention for sm87 architecture when the head dim > 192 (#117678)
Head dim > 192 requires A100/H100 (sm80 or sm90) per TORCH_CHECK [here](0c26565d5d/aten/src/ATen/native/transformers/cuda/flash_attn/flash_api.cpp (L760)).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117678
Approved by: https://github.com/eqy, https://github.com/malfet
2024-01-27 01:22:47 +00:00
7cc7bf9dda [GHF] Add co-authors to PR (#118347)
Mention co-authors in PR body

Modify `CommitAuthors` to include query first two commit `authors`, which makes sure that authors from suggested commits are recognized.

Test plan: CI + check `get_authors()` on a few PRs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118347
Approved by: https://github.com/kit1980
2024-01-27 01:02:49 +00:00
4d771c56de [xnnpack] Move x86 flags to platform_compiler_flags (#117923)
Summary:
AVX extension flags are x86 specific, and clang-18 has started to error on it when building targets that's not x86. I couldn't find the resulting upstream change that made these flags an error, but it's fairly trivial that these flags do not apply to all architectures.

For most of the flags, they are already defined in `platform_compiler_flags`. The changes done
* Gate the flags under `compiler_flags` with `selects`
* If flags weren't defined in `platform_compiler_flags`, define them there as well
* Remove the `^` and `$` in the platform regex. Not all flavors start with the platform (e.g. `android-x86_64`.
* Some minor formatting changes were also included here.

Test Plan:
Atop D52741786,
```
buck2 build --flagfile 'arvr/mode/android/apk/linux/opt'  '//arvr/projects/mixedreality/android/ocean_passthrough_service:ocean_passthrough_mrservice_dev'
```

Differential Revision: D52856224

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117923
Approved by: https://github.com/mcr229
2024-01-26 23:41:06 +00:00
ff8e33556e Enables load balancing duplicates in DCP (#116469)
Enables the deduplication of saved entries by load balancing duplicates across ranks.

Tested with existing and modified tests. Additionally tested with the following code snippet, which saves a 20GB DDP model in **~3 seconds on 8 ranks**.  Before this PR, the same operation has been measured at ~19 seconds.

```
def run(local_rank, world_size, param_size, num_params, work_dir):

    os.environ["RANK"] = str(local_rank)
    os.environ["MASTER_ADDR"] = "localhost"
    os.environ["MASTER_PORT"] = "12355"
    device = torch.device(f"cuda:{local_rank}")
    torch.cuda.set_device(device)
    dist.init_process_group(backend="nccl", rank=local_rank, world_size=world_size)

    model = Model(param_size=param_size, num_params=num_params)
    model = DistributedDataParallel(model, gradient_as_bucket_view=True)
    _patch_model_state_dict(model)

    sz = sum(t.nelement() * t.element_size() for t in model.parameters())
    rank_0_print(f"Model size: {sz / 1_000_000_000.0} GB")
    rank_0_print("Saving the model with DCP...")

    checkpointer = _FileSystemCheckpointer(
        f"{args.work_dir}/dcp",
        sync_files=False,
        single_file_per_rank=False,
        thread_count=1
    )

    begin_ts = time.monotonic()
    checkpointer.save(state_dict={"model": model})
    end_ts = time.monotonic()
    rank_0_print(f"Took {end_ts - begin_ts} seconds with DCP")
```

Differential Revision: [D52435926](https://our.internmc.facebook.com/intern/diff/D52435926/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116469
Approved by: https://github.com/fegin, https://github.com/wz337
2024-01-26 22:34:14 +00:00
b95c45fbf7 add stack trace to device skip (#118112)
Log stack trace of offending cpu use if it causes a disabling of cudagraphs. Also refactoring disable_cudagraphs: bool, and disable_cudagraphs_reason: str -> Optional[str].

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118112
Approved by: https://github.com/bdhirsh
2024-01-26 22:33:48 +00:00
b256b7b348 Add way to actually delete a torch.library.Library object (#118318)
Relying on object lifetimes in Python is a bad idea due to reference
cycles. Previously, when a torch.library.Library object gets destroyed,
it clears all the registrations associated with it, but it's unclear
when it actually gets destroyed due to the existence of refcycles.

This PR:
- adds torch::Library::clear(), which deterministically releases all of
  the RAII registration handles of the torch::Library object
- adds a new `torch.library._scoped_library` context manager, which creates
  a library and cleans it up at the end of the scope using the previous item.
  All tests (unless they already handle library lifetimes) should use
  this new API
- Rewrites some flaky tests to use `_scoped_library`.

In the future we'll probably migrate all of our torch.library tests to
use `_scoped_library`, but that's kind of annoying because we have
multiple thousands of LOC

I'm hoping this will deflake those tests; we'll see.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118318
Approved by: https://github.com/albanD
2024-01-26 22:30:51 +00:00
f129e3fe03 [inductor] Handle cum{sum,prod} on zero-dim tensors (#117990)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117990
Approved by: https://github.com/lezcano
2024-01-26 22:21:42 +00:00
074ac822d5 [ONNX] Skip empty input test case in aten_mm (#118413)
Fixes #117718
Fixes #117725

It's actually a known issue in https://github.com/microsoft/onnxscript/pull/586, and we do exclude the empty input test cases in aten_matmul. This PR follows the skip, and add aten_mm as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118413
Approved by: https://github.com/thiagocrepaldi
2024-01-26 22:06:57 +00:00
eee63ac845 [dynamo] move torch._C._get_cublas_allow_tf32 to constant_fold_functions (#118342)
Previously, I create a value match for torch._C._get_cublas_allow_tf32, it should just be in constant_fold_functions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118342
Approved by: https://github.com/yanboliang, https://github.com/jansel
ghstack dependencies: #118236
2024-01-26 22:00:00 +00:00
d41cfc92e6 [CI] simplify mergeability check workflow (#118415)
Test run:
https://github.com/pytorch/pytorch/actions/runs/7673050632/job/20914851421?pr=118415
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118415
Approved by: https://github.com/PaliC, https://github.com/huydhn
2024-01-26 21:45:24 +00:00
84251d1d71 [ez] Windows log printing + save successful test logs (#118124)
when doing print(f.read().decode etc etc) it prints an extra new line, so manually splitlines and strip to see if that helps

My guess is windows line ending differences

Also always save log file regardless of success or failure

See 476b81a9bf for what it looks like now

Swapped to opening in text mode instead of binary, seems to be ok now.

42483193bf024983060a234dc0262f4840aef4b8 for example
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118124
Approved by: https://github.com/huydhn
2024-01-26 21:14:25 +00:00
5c56822be2 [export] Various fixes to .module() (#118272)
Summary: While turning on .module() for all the export tests, I uncovered some bugs with .module() and while fixing them I ended up rewriting some of the code... Some of the bugs were:

* bad kwargs support on the unlifted module
* no support for user input mutations
* (at the commit hash i was working off of) no support for custom objects
* there were no tests on unlifting weights from cond/map submodules

Test Plan: CI

Differential Revision: D53075380

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118272
Approved by: https://github.com/suo
2024-01-26 21:05:07 +00:00
2ed1b1747a Fix Auto Functionalize to handle specified default values (#118331)
Summary: When there were optionals with specified default values the code was improperly handling the number of parameters causing IndexError: tuple index out of range.

Test Plan: New tests.

Reviewed By: zou3519

Differential Revision: D53095812

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118331
Approved by: https://github.com/zou3519
2024-01-26 20:31:38 +00:00
07499074bb Increasing session duration for AWS credentials for _rocm-test.yml (#118412)
The workflow _rocm-test.yml needs longer session duration for AWS role keys

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118412
Approved by: https://github.com/jeffdaily, https://github.com/huydhn
2024-01-26 19:32:24 +00:00
939008a268 Fix RuntimeError: NYI: Named tensors are not supported with the tracer (#118393)
This PR relands #108238 that was closed as stale due to CLA issues and also because the CI check has marked the PR as not mergeable.

Repro 1:

```python
import torch

def f(x):
    return x[x > 0]

jf = torch.jit.trace(f, torch.tensor(2., device="cuda"))
```
Error:

```bash
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/pytorch/torch/jit/_trace.py", line 874, in trace
    traced = torch._C._create_function_from_trace(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<stdin>", line 2, in f
RuntimeError: NYI: Named tensors are not supported with the tracer
```

Repro2:

```python
import torch
import torch.nn.functional as F
from torch import nn
import copy

class Net(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, inputs):
        x = copy.deepcopy(inputs) # RuntimeError: NYI: Named tensors are not supported with the tracer
        x = F.relu(x)
        return x

model = Net()
images = torch.randn(8, 28, 28)
torch.jit.trace(model, images)
```

Error 2:

```bash
Traceback (most recent call last):
  File "/opt/pytorch/test_deepcopy.py", line 18, in <module>
  File "/opt/pytorch/torch/jit/_trace.py", line 806, in trace
    return trace_module(
           ^^^^^^^^^^^^^
  File "/opt/pytorch/torch/jit/_trace.py", line 1074, in trace_module
    module._c._create_method_from_trace(
  File "/opt/pytorch/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/pytorch/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/pytorch/torch/nn/modules/module.py", line 1501, in _slow_forward
    result = self.forward(*input, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/pytorch/test_deepcopy.py", line 12, in forward
    x = F.relu(x)
        ^^^^^^^^^^
  File "/opt/conda/envs/ptca/lib/python3.11/copy.py", line 153, in deepcopy
    y = copier(memo)
        ^^^^^^^^^^^^
  File "/opt/pytorch/torch/_tensor.py", line 122, in __deepcopy__
    new_storage = self._typed_storage()._deepcopy(memo)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/pytorch/torch/storage.py", line 847, in _deepcopy
    return self._new_wrapped_storage(copy.deepcopy(self._untyped_storage, memo))
                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/ptca/lib/python3.11/copy.py", line 153, in deepcopy
    y = copier(memo)
        ^^^^^^^^^^^^
  File "/opt/pytorch/torch/storage.py", line 112, in __deepcopy__
    new_storage = self.clone()
                  ^^^^^^^^^^^^
  File "/opt/pytorch/torch/storage.py", line 126, in clone
    return type(self)(self.nbytes(), device=self.device).copy_(self)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: NYI: Named tensors are not supported with the tracer
```

----
 #48054 RuntimeError: NYI: Named tensors are not supported with the tracer
 #49538 jit tracer doesn't work with unflatten layer
 #31591 when i try to export a pytorch model to ONNX, got RuntimeError: output of traced region did not have observable data dependence with trace inputs; this probably indicates your program cannot be understood by the tracer.
   - This bug was closed but exists. Multiple comments on it still showing error. This is addressed

Likely fixes the following issues (but untested)

 #63297 Named tensor in tracer
 #2323 [Bug] torch.onnx.errors.UnsupportedOperatorError when convert mask2former to onnx

Fix zero dimensioned tensors when used with jit.trace They are currently assigned an empty set for names {} this is not the same as "no name" so jit.trace bails with
  "NYI: Named tensors are not supported with the tracer"
This happens when I am trying to save a non-trivial model as onnx but the simplest repro I have seen is 48054 above which has been added as test/jit/test_zero_dim_tensor_trace.py

Test plan:
  New unit test added
  Broken scenarios tested locally
  CI

Fixes #48054

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118393
Approved by: https://github.com/zou3519
2024-01-26 19:31:23 +00:00
bfbb8d8220 Don't manually invoke atexit exit handlers in tests (#118409)
Fixes https://github.com/pytorch/pytorch/issues/104098

This is a bad idea because it runs all the exit handlers and messes with
global state that is necessary for other tests to run.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118409
Approved by: https://github.com/ydwu4, https://github.com/yanboliang
ghstack dependencies: #118152, #118309
2024-01-26 19:11:19 +00:00
728789d850 Deflake stream tests, part 2 (#118391)
I missed these the first time around, some more streams need to be
synchronized.

Fixes https://github.com/pytorch/pytorch/issues/112694

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118391
Approved by: https://github.com/ydwu4, https://github.com/yanboliang
2024-01-26 19:10:53 +00:00
e696fa1ee7 [tp] enable rowwise embedding sharding in RowwiseParallel (#118242)
As titled, this PR enables the rowwise embedding sharding in the
RowwiseParallel style, and add tests to ensure it's working as expected

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118242
Approved by: https://github.com/tianyu-l
ghstack dependencies: #118079, #118080
2024-01-26 19:01:24 +00:00
dc8357b397 [dtensor] implement dim-0 (row) embedding sharding with MaskPartial (#118080)
This PR add support for rowwise sharded embedding by adding a
MaskPartial placement that inherits from the default partial placement,
and override the Partial constracts to construct the mask and release
the mask after the reduction

The MaskPartial placement have the potential to support other ops
sharding computation that requires a mask for semantic correctness.
currently make it live in the embedding ops but we can move it to a
common place if needed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118080
Approved by: https://github.com/tianyu-l
ghstack dependencies: #118079
2024-01-26 19:01:24 +00:00
910b49c48b [dtensor] rewrite embedding ops using op strategy (#118079)
This PR rewrites sharded embedding rule to use OpStrategy instead of the
rule, one step further to get rid of rules and consolidate the embedding
operator implementation, to prepare for rowwise embedding
implementation, which will come in next PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118079
Approved by: https://github.com/tianyu-l
2024-01-26 19:01:15 +00:00
25f72194e8 Realize inputs to DynamicScalar before unwrapping storage (#118125)
Fixes https://github.com/pytorch/pytorch/issues/118102

Unfortunately, the test still fails due to an unrelated problem https://github.com/pytorch/pytorch/issues/117665

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118125
Approved by: https://github.com/eellison, https://github.com/jansel
ghstack dependencies: #117862
2024-01-26 18:08:03 +00:00
96d94f574e Fix several bugs related to unbacked SymInt codegen in inductor (#117862)
Let me tell you, this was a *journey.*

* When we repropagate through FX interpreter in AOTAutograd, this will reallocate unbacked SymInts. We can eliminate all of these fresh allocations by appropriately asserting equalities on them setting up replacements. See also https://github.com/pytorch/pytorch/issues/111950
* The `inner_fn` of Loops can contain references to unbacked SymInts. We must collect them to prevent DCE.
* Export naughtily accessed `_expr` when it should have accessed `expr` on SymNode. Fixed two sites of this.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117862
Approved by: https://github.com/bdhirsh
2024-01-26 18:08:03 +00:00
89a0b1df51 fix lint for cudnn codes (#117091)
Fixes the lint issue described in https://github.com/pytorch/pytorch/pull/116759

@albanD Please have a look

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117091
Approved by: https://github.com/albanD
2024-01-26 17:53:22 +00:00
2842d3c9d3 [Nested Tensor] view: basic support for ragged_idx != 1 and _unsafe_view (#118317)
Uses case: `_unsafe_view` is used in aot_autograd to create a view that doesn't register as a view:

eebe7e1d37/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py (L470-L476)

If a transposed nested tensor (i.e. NT with ragged_idx != 1) encounters this code path, it previously would fail for two reasons: 1) because `_unsafe_view` isn't registered, and 2) because ragged_idx != 1 is not supported. This PR adds support for `_unsafe_view` (completely reusing the implementation of `view`; this just registers `_unsafe_view` as another op using the same implementation). It also adds support for ragged_idx != 1, but only for trivial cases where inp._size == size (the use case used by aot_autograd).

Tests: verify that the result of `_unsafe_view` doesn't have a `_base`, and that simple views on transposed NTs work.

Differential Revision: [D53096814](https://our.internmc.facebook.com/intern/diff/D53096814)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118317
Approved by: https://github.com/soulitzer
2024-01-26 17:29:37 +00:00
533637d9a3 Revert "Check if enable inside run call (#118101)"
This reverts commit 2abb812a78c0d3976e6eb10114716bcb163480ca.

Reverted https://github.com/pytorch/pytorch/pull/118101 on behalf of https://github.com/clee2000 due to broke periodic multigpu test some how 6fc015fedc ([comment](https://github.com/pytorch/pytorch/pull/118101#issuecomment-1912357321))
2024-01-26 16:41:56 +00:00
f1aef2c094 Don't check is_conj for _refs.linalg.svd (#117972)
The flag is not correctly set when PyTorch is compiled with GPU support resulting in failures in
`test_ops.py::test_python_ref_meta__refs_linalg_svd_cpu_complex`.

Use a similar approach to test_meta and skip the check for this function.

Workaround for #105068

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117972
Approved by: https://github.com/lezcano
2024-01-26 15:24:29 +00:00
af8f37c2b6 Revert "Use SEQUENTIAL posix_fadvise on mmapped files (#117805)"
This reverts commit 401aa1a1deaee19909c957d7d56d91341018b4dc.

Reverted https://github.com/pytorch/pytorch/pull/117805 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/117805#issuecomment-1912204403))
2024-01-26 14:59:58 +00:00
cyy
6da0e7f84b [Clang-tidy header][17/N] Apply clang-tidy on headers in torch/csrc/cuda (#117829)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117829
Approved by: https://github.com/albanD
2024-01-26 13:33:24 +00:00
8ff55c7e68 Clarified sampling process of torch.randn for complex dtypes. (#118315)
Fixes #118269.

Clarified the docs of `torch.randn` and `torch.randn_like`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118315
Approved by: https://github.com/lezcano
2024-01-26 13:05:19 +00:00
b66c4eda61 [Inductor] Add Thread Number Checker in scatter_reduce_ fallback for CPP backend (#118278)
**Summary**
Follow up of https://github.com/pytorch/pytorch/pull/108220 which improves performance of `basic_gnn_gin`, `basic_gnn_sage` and `basic_gnn_gcn` in multi thread test cases. However, it causes performance regression of these 3 models in single thread test case as reported in https://github.com/pytorch/pytorch/issues/117740. Fix the single thread issues in this PR by adding the thread number check to decide whether fallback `scatter_reduce_` or not.

**Test Plan**
```
python -u -m pytest -s -v test_cpu_repro.py -k test_scatter_using_atomic_add
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118278
Approved by: https://github.com/jansel, https://github.com/jgong5
2024-01-26 12:43:25 +00:00
0857a3a753 [c10d_functional] fix an issue where mutation on views fails in inductor (#118333)
`_CollectiveKernel.create_inplace` expresses mutation with the newly introduced `MutationOutput` which requires the `layout` of the input. Currently, there's a bug where if the input is a view, `inp.layout` fails. This PR fixes the issue by unwrapping the input if it's a view.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118333
Approved by: https://github.com/wanchaol
2024-01-26 11:13:30 +00:00
4d0b471389 fix key error in pre_grad fx_passes_numeric_check (#118325)
Summary:
```
I0125 121749.865 pyper_config_utils.py:8225] torchdynamo pyper config = TorchDynamoConfig(backend='inductor', optimize_ddp=False, log_compile_graph=False, inductor_config=TorchInductorConfig(enable_cudagraph=False, max_autotune=False, max_autotune_pointwise=True, max_autotune_gemm=False, search_autotune_cache=False, autotune_in_subproc=False, aggressive_fusion=False, shape_padding=True, permute_fusion=False, epilogue_fusion_first=False, debug=True, triton=None, trace_enabled=False, log_kernel_source=False, split_cat_fx_passes=False, group_fusion=False, batch_fusion=False, coordinate_descent_tuning=False, coordinate_descent_check_all_directions=False, coordinate_descent_search_radius=1, layout_optimization=True, pre_grad_fusion_options={}, post_grad_fusion_options={}, max_pointwise_cat_inputs=4, fx_passes_numeric_check={}), automatic_dynamic_shapes=True)
```
In trainer
```
I0125 12:58:51.832000 4011.139732263132160 torchdynamo_wrapper.py:291  trainer:0:1 ] [pt2] creating torchdynamo backend wrapper with settings TorchDynamoConfig(backend='inductor', optimize_ddp=False, log_compile_graph=False, inductor_config=TorchInductorConfig(enable_cudagraph=False, max_autotune=False, max_autotune_pointwise=True, max_autotune_gemm=False, search_autotune_cache=False, autotune_in_subproc=False, aggressive_fusion=False, shape_padding=True, permute_fusion=False, epilogue_fusion_first=False, debug=True, triton=None, trace_enabled=False, log_kernel_source=False, split_cat_fx_passes=False, group_fusion=False, batch_fusion=False, coordinate_descent_tuning=False, coordinate_descent_check_all_directions=False, coordinate_descent_search_radius=1, layout_optimization=True, pre_grad_fusion_options={}, post_grad_fusion_options={}, max_pointwise_cat_inputs=4, fx_passes_numeric_check={}), automatic_dynamic_shapes=True) #ai_training_job_id="febe34d9-b2fb-493e-a5cc-6a0b1dc85ad4" #ai_training_local_rank="1" #ai_training_role_rank="1" #mast_job_attempt="2" #mast_job_name="f525072920-TrainingApplication"
...
if config.fx_passes_numeric_check["pre_grad"]:
```

https://www.internalfb.com/diff/D52826442?dst_version_fbid=1115735309429172&transaction_fbid=682438900759710

https://www.internalfb.com/diff/D51838043?dst_version_fbid=336373395892373&transaction_fbid=349901787874069

This diff first fixes the key error to restore broken tests.  Its pyper changes can be addressed later.

https://www.internalfb.com/code/fbsource/[72c19313ed73]/fbcode/caffe2/torch/_inductor/config.py?lines=142-147

Test Plan: buck2 run //caffe2/torch/fb/training_toolkit/integration_tests/training_lifecycle/cogwheel_tests/pyper_release_v2:cogwheel_smallworld_mimo_cmf_deterministic_ne_pt2_training_platform__canary_offline_training-launcher -- --build-fbpkg --run-disabled --tests test

Reviewed By: yusuo

Differential Revision: D53102344

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118325
Approved by: https://github.com/mengluy0125
2024-01-26 11:02:12 +00:00
8dd1be49b7 [Inductor] Use sleef implementation for CPP backend acosh codegen (#118350)
**Summary**
Fix https://github.com/pytorch/pytorch/issues/118267. Current cpp backend using `f"({x} + ({x}*{x} - {vec_one}).sqrt()).log()"` to calculate `acosh`, the issue happens when input is a large negative value like `-910685.8125`. In this case, `(x*x - 1).sqrt() + x` equals to 0, and `0.log()` returns `-inf`. However, based on the document: https://pytorch.org/docs/stable/generated/torch.acosh.html, negative inputs should returns `Nan`. Using acosh sleef implementation to fix this issue.

**Test Plan**
```
python -u -m pytest -s -v test_cpu_repro.py -k test_acosh_with_negative_large_input
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118350
Approved by: https://github.com/jgong5, https://github.com/lezcano
2024-01-26 10:19:40 +00:00
2ea38498b0 [FSDP][BE] Only show state_dict log when the debug level is detail (#118196)
As title

Differential Revision: [D53038704](https://our.internmc.facebook.com/intern/diff/D53038704/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118196
Approved by: https://github.com/rohan-varma, https://github.com/wz337
ghstack dependencies: #118197, #118195
2024-01-26 09:52:36 +00:00
4f4e61bb75 [DCP] Add tests to demonstrate DCP checkpoint conversion (#117773)
As title

Differential Revision: [D52854759](https://our.internmc.facebook.com/intern/diff/D52854759/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117773
Approved by: https://github.com/LucasLLC, https://github.com/wz337
ghstack dependencies: #116248, #117772
2024-01-26 09:39:10 +00:00
644bc69530 [DCP] Allow users to save and load without creating storage reader and writer (#117772)
Right now DCP API requires users to create StorageWriter and StorageReader for every API call. This PR allows users to only pass the checkpointer_id (a path) and use it to read/write a checkpoint without creating a StorageReader and Writer.

Differential Revision: [D52740556](https://our.internmc.facebook.com/intern/diff/D52740556/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117772
Approved by: https://github.com/wz337
ghstack dependencies: #116248
2024-01-26 09:08:35 +00:00
fc30bd3b7b Revert "[dtensor] rewrite embedding ops using op strategy (#118079)"
This reverts commit e599a0879684abedec2a28b08b822fd4a4219105.

Reverted https://github.com/pytorch/pytorch/pull/118079 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/118079#issuecomment-1911681293))
2024-01-26 08:47:14 +00:00
bfb5e7642e Revert "[dtensor] implement dim-0 (row) embedding sharding with MaskPartial (#118080)"
This reverts commit 8cc02b46c33b5192289e4cf64fa55d685127bfb8.

Reverted https://github.com/pytorch/pytorch/pull/118080 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/118079#issuecomment-1911681293))
2024-01-26 08:47:14 +00:00
bc67f87559 Revert "[tp] enable rowwise embedding sharding in RowwiseParallel (#118242)"
This reverts commit 7a9012d7e847a6265e70873e9baab70838edd601.

Reverted https://github.com/pytorch/pytorch/pull/118242 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/118079#issuecomment-1911681293))
2024-01-26 08:47:14 +00:00
2c9a90cde6 [ROCm] backward compatible type enums (#118137)
Fixes builds of pytorch using unreleased ROCm packages that are missing type enums introduced in ROCm 6.0 release.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118137
Approved by: https://github.com/xw285cornell, https://github.com/anupambhatnagar
2024-01-26 08:40:13 +00:00
f8e14f3b46 [PyTorch][Vulkan] Clean up aten::stack (#118314)
Summary:
After D50347338, we already support zero-dim tensor input, which was my original task. As a result, this diff doesn't add or change functionality; it just cleans up the following:
1. Fix TORCH_CHECK to only allow `tensor.dim() <= 3`. Previously, it was a no-op since it didn't use `&&`.
2. Add `tensor.dim() == 0` tests.
3. Address `readability-container-size-empty` and `performance-unnecessary-copy-initialization` linter errors.

Test Plan:
Tested on OD.
```
[jorgep31415@29786.od /data/sandcastle/boxes/fbsource (1d0b920e0)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck2 run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -c pt.vulkan_full_precision=1 -- --gtest_filter="*stack*"
File changed: fbsource//xplat/caffe2/aten/src/ATen/native/vulkan/ops/Unsqueeze.cpp
File changed: fbsource//xplat/caffe2/aten/src/ATen/native/vulkan/glsl/unsqueeze.glsl
File changed: fbsource//xplat/caffe2/aten/src/ATen/test/vulkan_api_test.cpp
3 additional file change events
Buck UI: https://www.internalfb.com/buck2/98bb3bfa-a1d1-440e-8724-b4990c9cc7ca
Network: Up: 1.4MiB  Down: 377KiB  (reSessionID-6eccf420-3951-4942-9350-998803589b8d)
Jobs completed: 17. Time elapsed: 42.6s.
Cache hits: 38%. Commands: 8 (cached: 3, remote: 0, local: 5)
BUILD SUCCEEDED
Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = *stack*
[==========] Running 5 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 5 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.stack_invalid_inputs
[       OK ] VulkanAPITest.stack_invalid_inputs (27 ms)
[ RUN      ] VulkanAPITest.stack_0d
[       OK ] VulkanAPITest.stack_0d (28 ms)
[ RUN      ] VulkanAPITest.stack_1d
[       OK ] VulkanAPITest.stack_1d (1 ms)
[ RUN      ] VulkanAPITest.stack_2d
[       OK ] VulkanAPITest.stack_2d (148 ms)
[ RUN      ] VulkanAPITest.stack_3d
[       OK ] VulkanAPITest.stack_3d (354 ms)
[----------] 5 tests from VulkanAPITest (561 ms total)

[----------] Global test environment tear-down
[==========] 5 tests from 1 test suite ran. (561 ms total)
[  PASSED  ] 5 tests.
```

Reviewed By: copyrightly, liuk22

Differential Revision: D53071188

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118314
Approved by: https://github.com/liuk22
2024-01-26 04:28:06 +00:00
2b1ee9be7a [executorch hash update] update the pinned executorch hash (#118339)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118339
Approved by: https://github.com/pytorchbot
2024-01-26 04:26:38 +00:00
0c5da6100f [PyTorch][Vulkan] Clean up aten::unsqueeze (#118311)
Summary:
After D50347338, we already support zero-dim tensor input, which was my original task. As a result, this diff doesn't add or change functionality; it just cleans up the following:
1. Fix TORCH_CHECK to only allow `tensor.dim() <= 3`. Previously, it was a no-op since it didn't use `&&`.
2. Add 0->1 `tensor.dim()` tests.
3. Remove `dim == 0` case from shader since that path is never executed. The `cpp` code sends the input to `submit_copy` instead.

Test Plan:
Tested on OD.
```
[jorgep31415@29786.od /data/sandcastle/boxes/fbsource (c66693c95)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck2 run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -c pt.vulkan_full_precision=1 -- --gtest_filter="*unsqueeze*"
File changed: fbcode//caffe2/aten/src/ATen/native/vulkan/glsl/unsqueeze.glsl
File changed: fbsource//xplat/caffe2/aten/src/ATen/test/vulkan_api_test.cpp
File changed: fbsource//xplat/caffe2/aten/src/ATen/native/vulkan/glsl/unsqueeze.glsl
Buck UI: https://www.internalfb.com/buck2/16cf8f59-e535-493b-b123-5952ef8f1453
Network: Up: 21KiB  Down: 1.4MiB  (reSessionID-1219eefd-e78b-4bfd-aef8-8e4b38da82f8)
Jobs completed: 8. Time elapsed: 37.8s.
Cache hits: 0%. Commands: 3 (cached: 0, remote: 1, local: 2)
BUILD SUCCEEDED
Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = *unsqueeze*
[==========] Running 10 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 10 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.unsqueeze_0dto1d_dim0
[       OK ] VulkanAPITest.unsqueeze_0dto1d_dim0 (61 ms)
[ RUN      ] VulkanAPITest.unsqueeze_1dto2d_dim0
[       OK ] VulkanAPITest.unsqueeze_1dto2d_dim0 (0 ms)
[ RUN      ] VulkanAPITest.unsqueeze_1dto2d_dim1
[       OK ] VulkanAPITest.unsqueeze_1dto2d_dim1 (110 ms)
[ RUN      ] VulkanAPITest.unsqueeze_2dto3d_dim0
[       OK ] VulkanAPITest.unsqueeze_2dto3d_dim0 (16 ms)
[ RUN      ] VulkanAPITest.unsqueeze_2dto3d_dim1
[       OK ] VulkanAPITest.unsqueeze_2dto3d_dim1 (58 ms)
[ RUN      ] VulkanAPITest.unsqueeze_2dto3d_dim2
[       OK ] VulkanAPITest.unsqueeze_2dto3d_dim2 (2 ms)
[ RUN      ] VulkanAPITest.unsqueeze_3dto4d_dim0
[       OK ] VulkanAPITest.unsqueeze_3dto4d_dim0 (16 ms)
[ RUN      ] VulkanAPITest.unsqueeze_3dto4d_dim1
[       OK ] VulkanAPITest.unsqueeze_3dto4d_dim1 (1 ms)
[ RUN      ] VulkanAPITest.unsqueeze_3dto4d_dim2
[       OK ] VulkanAPITest.unsqueeze_3dto4d_dim2 (1 ms)
[ RUN      ] VulkanAPITest.unsqueeze_3dto4d_dim3
[       OK ] VulkanAPITest.unsqueeze_3dto4d_dim3 (1 ms)
[----------] 10 tests from VulkanAPITest (270 ms total)

[----------] Global test environment tear-down
[==========] 10 tests from 1 test suite ran. (270 ms total)
[  PASSED  ] 10 tests.

```

Also, to improve my confidence in unit tests, I modified [force_flush.py](https://www.internalfb.com/code/fbsource/[6e606c6f62dafd2121e78ffe14ae12f1b6d8d405]/fbcode/wearables/camera/ml/pytorch_vulkan_native/demo/force_flush.py) to run several combinations of `aten::unsqueeze` on OD.

Verified these work as expected.
```
torch.zeros([])
torch.randn([])
torch.rand([])
torch.ones([])
torch.tensor(0, dtype=torch.float)
```

Found that Vulkan in general does not support the following. That's ok though since it's technically a 1d tensor which is not part of my task.
```
torch.tensor([])
```

Differential Revision: D53071189

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118311
Approved by: https://github.com/liuk22
2024-01-26 04:22:54 +00:00
8467de4e97 Fix kaiser_window for lower precision data types on CPU (#117345)
Fixes #117230.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117345
Approved by: https://github.com/jgong5, https://github.com/soumith
2024-01-26 03:26:12 +00:00
eqy
ef29fe745f [CUDA] Add missing TF32 annotation to test_uint4x2_mixed_mm (#118143)
Addresses numerical mismatches seen on architectures with TF32.

CC @nWEIdia

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118143
Approved by: https://github.com/nWEIdia, https://github.com/jansel
2024-01-26 03:23:22 +00:00
b599f5608c Fix mergeability check for ghstack PRs (#118258)
# Changes
* introduce `--check-mergeability` trymerge flag that attempts to merge PR locally, using the same merge logic as the mergebot, but requires just a read-only `GITHUB_TOKEN` and git repo.
* change mergeability workflow to utilize the new --check-mergeability logic

# Alternatives considered

1.
> Rewrite `https://github.com/pytorch/test-infra/actions/workflows/pr-dependencies-check.yml` to correctly support partially merged ghstacks.

That would be a slightly better approach, but ROI is lower, as it requires reimplementing trymerge logic and additional effort to consolidate the codebase (trymerge lives in pytorch repo).

`pr-dependencies-check.yml` still produces human-readable results for partially merged ghstack prs (even if it falsely reports them as non-mergeable).

2.

> Instead of introducing new trymerge flag, use existing flags, including `--dry-run`.

That didn't work, as no combination of existing flags skips the rule checks and ROCKSET lookups.

# Testing

1. Manual testing  `trymerge.py --check-mergeability`  on the regular and ghstack PRs:

```
export GITHUB_TOKEN=
export GIT_REPO_DIR=`pwd`
export GITHUB_REPOSITORY=pytorch/pytorch
export GIT_REMOTE_URL=https://github.com/pytorch/pytorch

# Test 1 (2 prs, 1 is closed)
python3 ../pytorch/.github/scripts/trymerge.py --check-mergeability  117862
Skipping 1 of 2 PR (#117859) as its already been merged

echo $?
0

# Test 2 (3 prs, 1 is closed)
python3 ../pytorch/.github/scripts/trymerge.py --check-mergeability  118125
Skipping 1 of 3 PR (#117859) as its already been merged

echo $?
0

# Test 3 (3 prs, intentional conflicts introduced into `main`):

python3 ../pytorch/.github/scripts/trymerge.py --check-mergeability  118125
Skipping 1 of 3 PR (#117859) as its already been merged
stdout:
Auto-merging torch/_inductor/ir.py
Auto-merging torch/_inductor/lowering.py
CONFLICT (content): Merge conflict in torch/_inductor/lowering.py
error: could not apply 66ba5b8792f... Realize inputs to DynamicScalar before unwrapping
...
RuntimeError: Command `git -C /Users/ivanzaitsev/pytorch2 cherry-pick -x 66ba5b8792fa076c4e512d920651e5b6b7e466f4` returned non-zero exit code 1
```

2.  Workflow run:
https://github.com/pytorch/pytorch/actions/runs/7660736172/job/20878651852?pr=118258

<img width="516" alt="image" src="https://github.com/pytorch/pytorch/assets/108101595/28fbf0d2-ac2a-4518-b41d-b32b41373747">
<img width="621" alt="image" src="https://github.com/pytorch/pytorch/assets/108101595/ddbf8566-a417-43ec-9d0e-f623f4a71313">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118258
Approved by: https://github.com/PaliC, https://github.com/huydhn
2024-01-26 03:15:56 +00:00
4e456fd95b [AOTI] Support scalar to tensor in the ABI-compatible mode (#118024)
Differential Revision: [D53019485](https://our.internmc.facebook.com/intern/diff/D53019485)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118024
Approved by: https://github.com/ezyang
2024-01-26 03:15:05 +00:00
66c3152e36 [CI] Build docker on larger runners (#118167)
Otherwise it takes 1+h to build CUDA12.1 docker
- Limit UCC builds to just sm_52(M60) and sm_86(A10G), which I think has the biggest impact
- Replace hardcoded `-j6` build parallelism with more dynamic `-j$[$(nproc) - 2]`
- Remove redundant check about Ubuntu-14.04
- Added `DOCKER_BUILDKIT` to parallelize the builds

As result, docker build time drops from 1+h to 35 min
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118167
Approved by: https://github.com/huydhn
2024-01-26 02:28:25 +00:00
385d8b32fc Update PocketFFT submodule (#118348)
Accidentally downgraded by force merge of https://github.com/pytorch/pytorch/pull/117804

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118348
Approved by: https://github.com/kit1980
2024-01-26 02:01:06 +00:00
3cdd4e236e [inductor][easy] dump triton kernel names in the log (#118313)
This may help debugging.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118313
Approved by: https://github.com/desertfire
2024-01-26 02:00:04 +00:00
7a9012d7e8 [tp] enable rowwise embedding sharding in RowwiseParallel (#118242)
As titled, this PR enables the rowwise embedding sharding in the
RowwiseParallel style, and add tests to ensure it's working as expected

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118242
Approved by: https://github.com/tianyu-l
ghstack dependencies: #118079, #118080
2024-01-26 01:36:24 +00:00
8cc02b46c3 [dtensor] implement dim-0 (row) embedding sharding with MaskPartial (#118080)
This PR add support for rowwise sharded embedding by adding a
MaskPartial placement that inherits from the default partial placement,
and override the Partial constracts to construct the mask and release
the mask after the reduction

The MaskPartial placement have the potential to support other ops
sharding computation that requires a mask for semantic correctness.
currently make it live in the embedding ops but we can move it to a
common place if needed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118080
Approved by: https://github.com/tianyu-l
ghstack dependencies: #118079
2024-01-26 01:36:24 +00:00
3d062f9abe Revert "[pytorch][kineto] log process group config in distributed info (#117774)"
This reverts commit 9c1348feb3de872f7cabd807abbc228e7192cd46.

Reverted https://github.com/pytorch/pytorch/pull/117774 on behalf of https://github.com/aaronenyeshi due to This diff is breaking internal jobs, but has been internally reverted ([comment](https://github.com/pytorch/pytorch/pull/117774#issuecomment-1911251092))
2024-01-26 01:10:31 +00:00
6596a3f23d [Export] Remove ScriptObjectMeta (#118241)
Summary: As title. Use CustomObjArgument as ScriptObjectMeta

Test Plan: CIs

Reviewed By: zhxchen17

Differential Revision: D53062230

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118241
Approved by: https://github.com/zhxchen17
2024-01-26 00:37:19 +00:00
401aa1a1de Use SEQUENTIAL posix_fadvise on mmapped files (#117805)
In theory this tells the system that we will access the file sequentially which allows prefetching future blocks. In practice it doubles the read-ahead size on Linux (which effectively doubles the read sizes).

Without this, CUDA uploads of files that aren't already in FS cache, using mmapped files (safetensors) as source, run at ~1 GB/s (from an SSD that has ~7 GB/s read speed...).

With this, they run at ~1.5 GB/s which is still bad but better than before!

It is possible to increase the read performance further by touching the pages from multiple threads; in fact, when the tensors loaded from the file are used by the CPU, we get fairly good load performance (~5 GB/s), which appears to be because multiple threads page fault and trigger more concurrent reads which improves SSD read throughput... however, this is not the case for CUDA uploads, and it is difficult to make that change in a generic way because it's unclear what the usage pattern of the input file is going to be.

All of the numbers above are taken on Samsung 990 Pro SSD, on Linux kernel 6.5 with FS cache cleared between every attempt to load a file. The file is loaded via `safetensors.safe_open` which uses UntypedTensor.from_file to load the file into memory, which in turn uses MapAllocator.cpp.

I felt safe doing this change unconditionally but please let me know if you'd like to see a separate allocator flag for this, propagated through to UntypedTensor. Note that the fadvise API is not available on macOS.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117805
Approved by: https://github.com/mikaylagawarecki
2024-01-26 00:26:57 +00:00
de9ddd19a5 Various CI settings (#117668)
Test [ci-verbose-test-logs] (this worked, the test logs printing while running and interleaved and are really long)

Settings for no timeout (step timeout still applies, only gets rid of ~30 min timeout for shard of test file) and no piping logs/extra verbose test logs (good for debugging deadlocks but results in very long and possibly interleaved logs).

Also allows these to be set via pr body if the label name is in brackets ex [label name] or the test above.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117668
Approved by: https://github.com/huydhn
2024-01-26 00:17:29 +00:00
8c167f9fc3 [CMake] Explicitly error out if CuDNN older than 8.5 (#118235)
Also update README.md
Fixes https://github.com/pytorch/pytorch/issues/118193

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118235
Approved by: https://github.com/zou3519
2024-01-25 23:41:04 +00:00
71757093c5 [dynamo] avoid graph break on torch.backends.cuda.matmul.allow_tf32 (#118236)
Before the PR, we have a graph break for the following test:
```python
    def test_cublas_allow_tf32(x):
        if torch.backends.cuda.matmul.allow_tf32:
            return x.sin() + 1

        return x.cos() - 1
```

In this PR, we first add "torch.backends.cuda" to MOD_INLINELIST to trace through the python binding and get the actual call torch._C._get_cublas_allow_tf32, where it's already a TorchInGraphVariable. Because _get_cublas_allow_tf32 is accessing the same variable as at::globalContext().allowTF32CuBLAS(), which is guarded by dynamo as a global state [here](https://github.com/pytorch/pytorch/blob/main/torch/csrc/dynamo/guards.cpp#L443), we could safely assume it returns a ConstantVariable during tracing.

After this pr, we get the following graph:
```python
[2024-01-24 15:31:01,501] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]  <eval_with_key>.0 class GraphModule(torch.nn.Module):
[2024-01-24 15:31:01,501] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]     def forward(self, L_x_ : torch.Tensor):
[2024-01-24 15:31:01,501] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         l_x_ = L_x_
[2024-01-24 15:31:01,501] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]
[2024-01-24 15:31:01,501] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         # File: /home/yidi/local/pytorch/test/dynamo/test_functions.py:515 in test_cublas_allow_tf32, code: return x.cos() - 1
[2024-01-24 15:31:01,501] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         cos = l_x_.cos();  l_x_ = None
[2024-01-24 15:31:01,501] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         sub = cos - 1;  cos = None
[2024-01-24 15:31:01,501] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         return (sub,)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118236
Approved by: https://github.com/yanboliang, https://github.com/anijain2305
2024-01-25 23:40:23 +00:00
b5c9623835 [export] Add node meta into UnflattenedModule (#118138)
Summary: Reland of #117686

Test Plan: CI

Differential Revision: D53012028

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118138
Approved by: https://github.com/zhxchen17
2024-01-25 23:37:41 +00:00
a93940b5db [export] Allow constant outputs + None input/outputs (#117894)
Added support for constant outputs. We will just embed the constant directly into the output, like `return (x, 1)`.
Also adds support for None input/outputs. For None inputs we address it the same way we do to constants, which is that a placeholder with no users will be inserted into the graph, and the None will be embedded into whatever operator is using the None. For None outputs, we will also address the same way we do constants, which is that we embed it into the output, like `return (x, None)`.

Differential Revision: D52881070

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117894
Approved by: https://github.com/zhxchen17
2024-01-25 23:37:34 +00:00
24133e44b1 Fix return type hint for list types (#118238)
All single element list types are `Tensor[]` so they will always be Tuple.
I don't know of any way to easily access the pyi type and compare that to a real run so no testing here :(
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118238
Approved by: https://github.com/ezyang
2024-01-25 23:35:20 +00:00
52c5803088 [NestedTensor] Support ragged_idx != 1 in pointwise ops (#118157)
This PR allows pointwise ops to operate on tensors with ragged_idx != 1. It does this by passing the ragged_idx metadata into the construction of the returned NestedTensor when computing pointwise ops. The assumption is that: pointwise ops can operate directly on the values tensors, and the resulting tensor should have all the same metadata properties as the input tensors. For binary ops, a test is added to verify that adding two tensors with different ragged_idx cannot be added.

Previously:
* unary pointwise ops would error out when performed on nested tensors with ragged_idx != 1
* binary pointwise ops would produce tensors with nonsense shapes

Differential Revision: [D53032641](https://our.internmc.facebook.com/intern/diff/D53032641)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118157
Approved by: https://github.com/jbschlosser
2024-01-25 23:34:15 +00:00
91d5f94f85 [FSDP] Idempotent reshard (#117997)
address assertion error "Expects storage to be allocated" by making reshard idempotent https://github.com/pytorch/pytorch/issues/117510

```pytest test/distributed/fsdp/test_fsdp_fine_tune.py -k test_parity_with_non_frozen_fsdp```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117997
Approved by: https://github.com/awgu
2024-01-25 23:29:23 +00:00
b10b08227a Passes process group to _all_gather_keys in dcp.load (#118301)
As title

Fixes #118277

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118301
Approved by: https://github.com/Skylion007, https://github.com/fegin
2024-01-25 23:07:57 +00:00
02a411d4a6 [mergebot] Dry run for labels + easier to read Dr CI result (#118240)
Dry run open for labels so we can run trymerge locally with dryrun without actually affected the PR

Make Dr.CI results easier to read (previously a massive json dump, now just the job names + ids, in a nicer format)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118240
Approved by: https://github.com/huydhn
2024-01-25 23:06:43 +00:00
26f1da0b1b Fix node traversal when setting up stacktrace preservation hooks (#118252)
We only want to traverse over each node in the graph exactly once, and we do that by inserting nodes into the "seen" set. The issue is that we forget to check the "seen" set when inserting the root nodes. Typically that is not a problem, because the root nodes are from the different outputs and thus usually correspond to different nodes. With split_with_sizes, though all of the outputs correspond to the same node, ands this leads to the node being iterated over 3 times, and 3 sets of hooks being attached to the same node.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118252
Approved by: https://github.com/zou3519
ghstack dependencies: #117552, #118234, #118249
2024-01-25 22:56:20 +00:00
b8bd3bb30a Fix aot_autograd seq_nr logic (#118249)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118249
Approved by: https://github.com/zou3519
ghstack dependencies: #117552, #118234
2024-01-25 22:56:20 +00:00
3c77a3ed03 export ATen/native/sparse/*.h (#118274)
Fixes #ISSUE_NUMBER

We are trying to adapt `SparsePrivateUse1` in our code. However, I found that `sparse_stup` has not been exposed yet, which makes it impossible for me to implement stup and register. I hope that the header files in this directory can be exposed. @albanD
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118274
Approved by: https://github.com/ezyang
2024-01-25 22:47:39 +00:00
fae569b4f2 [dynamo] avoid graph break on tensor.element_size() (#118229)
Before this PR, for the following code, we have a graph break `torch._dynamo.exc.Unsupported: torch.* op returned non-Tensor int call_method element_size`
```python
import torch
def f(x):
  return x.sin().element_size() + x.sin()

x = torch.randn(2, 2)
torch.compile(f, backend="eager", fullgraph=True)(x)
```
After this PR, we got the following graph, where element_size() is baked in as a constant.
```python
[2024-01-24 13:49:02,814] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]  <eval_with_key>.0 class GraphModule(torch.nn.Module):
[2024-01-24 13:49:02,814] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]     def forward(self, L_x_ : torch.Tensor):
[2024-01-24 13:49:02,814] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         l_x_ = L_x_
[2024-01-24 13:49:02,814] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]
[2024-01-24 13:49:02,814] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         # File: /home/yidi/local/pytorch/test.py:4 in f, code: return x.sin().element_size() + x.sin()
[2024-01-24 13:49:02,814] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         sin = l_x_.sin()
[2024-01-24 13:49:02,814] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         sin_1 = l_x_.sin();  l_x_ = None
[2024-01-24 13:49:02,814] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         add = 4 + sin_1;  sin_1 = None
[2024-01-24 13:49:02,814] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         return (add,)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118229
Approved by: https://github.com/yanboliang, https://github.com/jansel, https://github.com/anijain2305
2024-01-25 22:28:37 +00:00
bd6bf97ea5 stop using torch.Tensor in dynamo/test_export_mutations.py (#118287)
This causes test flakiness, because torch.Tensor allocates a Tensor with
uninitialized memory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118287
Approved by: https://github.com/ydwu4
2024-01-25 22:21:41 +00:00
f7f7283ec7 Skip test_none_names_refcount under Dynamo-wrapped CI (#118309)
Fixes https://github.com/pytorch/pytorch/issues/117716
Dynamo does some things that modifies the refcount. Skipping this test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118309
Approved by: https://github.com/ydwu4, https://github.com/yanboliang, https://github.com/albanD
ghstack dependencies: #118152
2024-01-25 22:21:22 +00:00
4e45d791e7 Remove set_ exclusion in FakeTensor dispatch cache (#118154)
Summary: Now that set_ is marked as a view op, this special case is no longer necessary

Test Plan: CI exposed the need for this special case in the first place, so I think we can just rely on the existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118154
Approved by: https://github.com/bdhirsh
2024-01-25 21:54:36 +00:00
13bdd6c4e2 Revert "[Dynamo, ONNX] use environment variable ONNXRT_DUMP_PATH to dump onnx models created by onnxrt backend (#117551)"
This reverts commit 3221585af0f78cee20f1fb739e140ab59a517ee1 as this
commit was already landed as 83581f91ca9c3b78b0f8dc3a0a2c1cb229d20e99.
2024-01-25 13:41:39 -08:00
ea851eb027 Uses Serial Loader for DCP.save when more then one thread is used. (#118114)
The OverlappingCPU Loader is causing a major drop in performance when used with multiple threads. This PR is a temporary fix while we investigate why this is the case.

Benchmarks for save, using a 7.25GB FSDP model, as per the TSS benchmark. Both benchmarks run on 8 ranks.

Before this PR
9.475 s
8 threads

After this PR
1.632 s
8 threads

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118114
Approved by: https://github.com/wz337, https://github.com/fegin
2024-01-25 21:11:16 +00:00
708e6241ed Fix sympy_subs to preserve integer and non-negative properties. (#118150)
This diff introduce the following changes:
1. Fix sympy_subs to preserve integer and non-negative properties of replaced symbol when replacement is string
why is this needed?
I was compiling an expression:
 x*abs(y)  where y =-2
  what happens is that this expression is passed as ``s1*abs(s0)`` then s0 is replaced to ks0 with a call to sympy_subs.
 but sympy_subs used to replace s0 (integer=false, nonegative=false) with ks0(inetegr=true, nonegative = true)
 resulting in ``x*abs(ks0) = x*ks0`` which is wrong

2. rename sympy_symbol to sympy_index_symbol to make it explicit.
3. add assertion that replaced expression is not passed as string but always a sympy expression.

Fixes https://github.com/pytorch/pytorch/issues/117757

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118150
Approved by: https://github.com/ezyang
2024-01-25 20:54:55 +00:00
2de24c11f6 [inductor] Slightly faster memory allocation on CUDA (#118255)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118255
Approved by: https://github.com/peterbell10
ghstack dependencies: #118065, #118070, #118171
2024-01-25 20:49:14 +00:00
3e76a0e9c2 Install an excepthook which annotates exceptions with rank information when distributed is initialized (#118190)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118190
Approved by: https://github.com/Skylion007, https://github.com/kwen2501
2024-01-25 20:43:18 +00:00
1565d58ad9 [inductor] correctly generate grid info for benchmark_kernel (#118202)
Previously, we generated the grid argument with tree.numel for
a benchmark TritonKernel. This was not correct, because it
didn't match the launch config used for profiling and running.

This PR fixed the issue by emitting the grid value computed
by the kernel's grid_fn, which is used by the profiler and
the kernel's runner.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118202
Approved by: https://github.com/shunting314, https://github.com/jansel
2024-01-25 20:37:44 +00:00
b47cf4182e Fix support non tensor inputs to operator.pos function (#118251)
Fixes #118231

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118251
Approved by: https://github.com/Skylion007, https://github.com/anijain2305
2024-01-25 20:37:40 +00:00
476b744e23 [AOTI] Forward fix https://github.com/pytorch/pytorch/pull/117989 (#118291)
Summary: https://github.com/pytorch/pytorch/pull/117989 disabled   use_thread_local_cached_output_tensor for cuda, but it is not necessarily true, because we can still have cpu tensors when running cuda models.

Differential Revision: D53089956

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118291
Approved by: https://github.com/Skylion007, https://github.com/frank-wei, https://github.com/chenyang78, https://github.com/khabinov
2024-01-25 20:30:17 +00:00
1f6aa4b336 [mypy] Enable follow_imports = normal for mypy-torch.backends.* (#116311)
Summary:

Test Plan:

```
lintrunner --take MYPYINDUCTOR --all-files
ok No lint issues.

lintrunner -a
ok No lint issues.
Successfully applied all patches.
```

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116311
Approved by: https://github.com/int3
2024-01-25 20:17:22 +00:00
3221585af0 [Dynamo, ONNX] use environment variable ONNXRT_DUMP_PATH to dump onnx models created by onnxrt backend (#117551)
With this PR, if environment variable `ONNXRT_DUMP_PATH` is set, the backend onnxrt dumps every onnx it creates as well as the graph_module stored as a text file. This allows users to see what onnx file is generated when this backend is used.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117551
Approved by: https://github.com/thiagocrepaldi, https://github.com/wschin
2024-01-25 20:00:14 +00:00
9768f73cb2 [AOTI] Skip test_index_put_with_none_index on rocm (#118290)
Summary: The test was added in https://github.com/pytorch/pytorch/pull/118187 and is failing on rocm.

Differential Revision: [D53089729](https://our.internmc.facebook.com/intern/diff/D53089729)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118290
Approved by: https://github.com/DanilBaibak
2024-01-25 19:36:00 +00:00
83581f91ca [Dynamo, ONNX] use environment variable ONNXRT_DUMP_PATH to dump onnx models created by onnxrt backend (#117551)
With this PR, if environment variable `ONNXRT_DUMP_PATH` is set, the backend onnxrt dumps every onnx it creates as well as the graph_module stored as a text file. This allows users to see what onnx file is generated when this backend is used.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117551
Approved by: https://github.com/thiagocrepaldi, https://github.com/wschin
2024-01-25 18:53:41 +00:00
bb3db079b1 [Export] Introduce class_fqn into CustomObjArgument (#118158)
Summary:
Class FQN is needed when unpacking CustomObj instance.
For all other Arguments, e.g. Tensor, TensorList, SymInt, we always know their exact type. However, CustomObjArgument had an opaque type.
Adding this field also helps unveiling the type of this opaque object.

Test Plan: CI

Differential Revision: D53029847

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118158
Approved by: https://github.com/zhxchen17
2024-01-25 18:44:25 +00:00
fed0f2946f [FSDP][BE] Fix optim_state_dict_to_load doc errors (#118195)
As title

Differential Revision: [D53038703](https://our.internmc.facebook.com/intern/diff/D53038703/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118195
Approved by: https://github.com/rohan-varma, https://github.com/wz337
ghstack dependencies: #118197
2024-01-25 18:29:04 +00:00
01388d0790 [dynamo] Slightly better error message if key not in dict (#117902)
Was debugging an export issue, and currently when `key` does not exist in `self.items`, the error message is
```
  File "/opt/pytorch/torch/_dynamo/variables/dicts.py", line 208, in getitem_const
    return self.items[key]
           ~~~~~~~~~~^^^^^
torch._dynamo.exc.InternalTorchDynamoError: <torch._dynamo.variables.dicts.ConstDictVariable._HashableTracker object at 0x7fd7697cbf90>
```
This PR changes it to be the following.
```
File "/data/users/angelayi/pytorch/torch/_dynamo/variables/dicts.py", line 199, in getitem_const
    raise KeyError(arg.value)
torch._dynamo.exc.InternalTorchDynamoError: shape
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117902
Approved by: https://github.com/williamwen42
2024-01-25 18:13:40 +00:00
e1f9eca113 [DeviceMesh] Reuse sub_group pg if exists (#115716)
Currently, we create new_group for sub_group pg during mesh initialization. The PR changes this so we will:
1) re-use sub_group pg if it exsits,
2) create new sub_group pg if it does not exist.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115716
Approved by: https://github.com/wanchaol
2024-01-25 18:07:16 +00:00
a289dba7b1 Add missing cuda libraries for context_gpu_test (#117493)
This adds some missing cuda (curand and cublas) libraries that are required for the context_gpu_test to link.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117493
Approved by: https://github.com/ezyang
2024-01-25 18:04:23 +00:00
eb054cc012 Revert "Fix Auto Functionalize to handle specified default values (#118035)"
This reverts commit 2d7a360911fb7b27be82c51ca86b4b34b6f1b087.

Reverted https://github.com/pytorch/pytorch/pull/118035 on behalf of https://github.com/zou3519 due to needs internal changes, reverting so we can land via co-dev ([comment](https://github.com/pytorch/pytorch/pull/118035#issuecomment-1910706841))
2024-01-25 17:53:15 +00:00
8810fdd21e fsdp: Unit test for ModuleWrapPolicy as a Callable (#117395)
We use `_or_policy` as a `Callable` to wrap a `ModuleWrapPolicy` instance as a `Callable`.

Fixes https://github.com/pytorch/pytorch/issues/109266

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117395
Approved by: https://github.com/wconstab
2024-01-25 17:40:06 +00:00
c1e0674485 [DCP][BC] Remove the dependency on _shard.TensorProperties (#116248)
ShardedTensor is in the maintence mode and is going to be deprecated. DCP's metadata should not rely on any definitions in ShardedTensor. This PR creates a replica of TensorProperties in DCP and removes the dependency on _shard.TensorProperties

Differential Revision: [D52357732](https://our.internmc.facebook.com/intern/diff/D52357732/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116248
Approved by: https://github.com/wconstab, https://github.com/LucasLLC, https://github.com/wz337
2024-01-25 17:24:16 +00:00
316579e30c [FSDP2] Introduced initial fully_shard frontend (#117776)
This PR introduces the initial `fully_shard` frontend without any distributed logic that will be built into per-parameter-sharding FSDP.
- We design `fully_shard` to be a _module-level_ API (taking in an `nn.Module`), e.g. as opposed to a tensor-level one.
- We define a `FSDP` class and use a dynamic class swap, setting `module.__class__` to a newly created class that subclasses `FSDP` and `type(module)`, to allow FSDP to override and add methods on the module.
    - We name this class as `FSDP<type(module)>`, e.g. `FSDPLinear` for `Linear`.
    - We disable the `deepcopy` because the state object inserted on the module will not be trivially `deepcopy`-able.
- Calling `fully_shard(module)` inserts a state object on `module` but not any of its children. This state object will be used for any FSDP-specific state.
- We raise an error on `ModuleList` or `ModuleDict` since they do not implement `forward()`, and FSDP will rely on `forward()` to insert logic (https://github.com/pytorch/pytorch/issues/113794).
- In the future, we will deprecate the existing `fully_shard` that calls into the same backend logic as `FullyShardedDataParallel` as there is no adoption for that and we prefer to reuse that name.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117776
Approved by: https://github.com/wconstab, https://github.com/weifengpy, https://github.com/wanchaol
ghstack dependencies: #117994, #118186, #117984
2024-01-25 17:22:07 +00:00
4f78869c18 [state_dict] Calls wait() for the DTensor to_local() result (#118197)
See the discussion in https://github.com/pytorch/pytorch/pull/117799.

There are some issues when returning a AsyncCollectiveTensor (haven't found the
root causes), including OOM and unexpected values.

This PR forces `_gather_state_dict()` to be synchronous with respect to the mian stream.

Differential Revision: [D53049807](https://our.internmc.facebook.com/intern/diff/D53049807/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118197
Approved by: https://github.com/wz337, https://github.com/LucasLLC
2024-01-25 17:14:08 +00:00
817debeb89 [inductor] Slightly faster memory allocation on CPU (#118171)
Based on `python benchmarks/dynamo/microbenchmarks/overheads.py`:
- Before `12.2us`
- After `10.5us`

This is inspired by a2c17a2b00 -- but in Python rather than C++

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118171
Approved by: https://github.com/jgong5, https://github.com/peterbell10
ghstack dependencies: #118065, #118070
2024-01-25 16:54:57 +00:00
d6b556bd98 Added "any" mode to register_multi_grad_hook (#117984)
This is a re-open of https://github.com/pytorch/pytorch/pull/115628/. This PR adds an `"any"` option to `register_multi_grad_hook` that runs the hook when the gradient of _any_ of the input tensors is computed. The existing functionality is folded under the default `"all"` mode.

The multi-threaded test case is based on the existing one for `register_multi_grad_hook`. I would appreciate a closer look on that. ~~I am not sure about the hook signature (i.e. why we see two gradients in the hook that runs instead of just one, as [`register_hook`](https://pytorch.org/docs/stable/generated/torch.Tensor.register_hook.html) docs suggest).~~ It was because I was iterating over the 2 elements in the single tensor 😢 .

I did not update the `notes/autograd.rst`, which currently has a [blurb](https://pytorch.org/docs/stable/notes/autograd.html#special-hooks) on `register_multi_grad_hook`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117984
Approved by: https://github.com/soulitzer
ghstack dependencies: #117994, #118186
2024-01-25 16:25:52 +00:00
173777461c expose nested tensor header file (#117956)
This pr is for expose nested tensor related header files, it will makes other people easier when developing nested tensor related kernel for extension module.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117956
Approved by: https://github.com/ezyang
2024-01-25 15:53:10 +00:00
865945cc1f Convert requires_cuda to full decorator (#118281)
Don't require using it as `@requires_cuda()` -> `@requires_cuda` instead No need for the partial function invoked many times

Split out this change from the initial large refactoring in #117741 to hopefully get merged before conflicts arise

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118281
Approved by: https://github.com/ezyang
2024-01-25 15:50:21 +00:00
87fb8b6218 [DTensor] Relaxed to_local requires_grad warning (#118186)
The existing warning in `DTensor.__new__()` checks `if requires_grad != local_tensor.requires_grad:` and warns with:

> To construct DTensor from `torch.Tensor`, it's recommended to use `local_tensor.detach()` and make `requires_grad` consistent.

Calling `local_tensor.detach()` will have the returned `Tensor` have `requires_grad=False`, so the error message refers to the case where `local_tensor.requires_grad is True` but the user passed `requires_grad=False` to `to_local()`.

However, there is the converse case, where `local_tensor.requires_grad is False` but the user passed `requires_grad=True`. In this case, the original `if requires_grad != local_tensor.requires_grad:` check succeeds, and the warning is emitted. However, the warning message does not apply in that case.

This can happen via `_prepare_output_fn` -> `redistribute` -> `Redistribute.forward()`, where `output.requires_grad is False` but it passes `requires_grad=input.requires_grad` which can be `True`.

We should not warn in this case since `Redistribute.forward()` is our own framework code, so I was proposing to relax the warning.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118186
Approved by: https://github.com/XilunWu, https://github.com/wanchaol
ghstack dependencies: #117994
2024-01-25 15:49:32 +00:00
a5230e6019 [ez][docs] Fixed render of tensors in backward (#117994)
Before:
<img width="851" alt="Screenshot 2024-01-22 at 2 03 49 PM" src="https://github.com/pytorch/pytorch/assets/31054793/a71111ab-c7c4-4af5-a996-cbd42bcc8326">

After:
![Screenshot 2024-01-23 at 7 13 40 PM](https://github.com/pytorch/pytorch/assets/31054793/36db28a0-a96f-434c-a93f-fe78aff1e035)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117994
Approved by: https://github.com/soulitzer, https://github.com/weifengpy
2024-01-25 15:49:32 +00:00
8f973038d5 Update update_failures.py given feedback (#118237)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118237
Approved by: https://github.com/drisspg
2024-01-25 15:42:01 +00:00
b5b36cf0c4 Fix failure of test_dynamo_distributed & test_inductor_collectives (#117741)
When CUDA is not available `c10d.init_process_group("nccl"...)` will fail with
> RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!

Hence add a corresponding skip marker to the classes deriving from DynamoDistributedSingleProcTestCase next to the `requires_nccl` marker.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117741
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-01-25 13:25:36 +00:00
ee1dbb2acf [AOTI] Fix a None as index codegen issue (#118187)
Summary: Fix a ABI-compatible codegen issue when index_put has None in its indices.

Differential Revision: [D53047489](https://our.internmc.facebook.com/intern/diff/D53047489)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118187
Approved by: https://github.com/chenyang78
ghstack dependencies: #118168, #118169
2024-01-25 11:53:44 +00:00
d1e661a1ce [AOTI] Add _scaled_dot_product_efficient_attention to C shim (#118169)
Summary: _scaled_dot_product_efficient_attention is used in some TIMM models

Differential Revision: [D53032358](https://our.internmc.facebook.com/intern/diff/D53032358)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118169
Approved by: https://github.com/chenyang78
ghstack dependencies: #118168
2024-01-25 11:53:44 +00:00
5c7a18c5cb [AOTI] Refactor shim_common.cpp (#118168)
Summary: Use new_tensor_handle to reduce code repetition

Differential Revision: [D53032353](https://our.internmc.facebook.com/intern/diff/D53032353)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118168
Approved by: https://github.com/chenyang78
2024-01-25 11:53:29 +00:00
4b4e6550f2 Update oneDNN build option for older systems (#118057)
Fixes [#116623](https://github.com/pytorch/pytorch/issues/116623).

As we discussed in https://github.com/pytorch/pytorch/issues/116623#issuecomment-1900406773 and https://github.com/pytorch/pytorch/issues/116623#issuecomment-1900825829, we update oneDNN build option to support older systems and document we only support CPUs with SSE4.1+.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118057
Approved by: https://github.com/malfet
2024-01-25 11:34:51 +00:00
eebe7e1d37 Migrate update-viablestrict to test-infra (#118163)
In https://github.com/pytorch/test-infra/pull/4905, so that ExecuTorch can use the same GHA on their CI.

### Testing

https://github.com/pytorch/pytorch/actions/runs/7634906738/job/20799502532#step:2:15480
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118163
Approved by: https://github.com/clee2000
2024-01-25 07:07:34 +00:00
357a06f7c9 [ONNX] Fix type promotion pass (#118246)
Currently, when `node.meta["val"]` is `torch.Sym*`, its `hint` [is extracted](61865205b6/torch/onnx/_internal/fx/passes/type_promotion.py (L86)) and used in type promotion. However, it will [override](61865205b6/torch/onnx/_internal/fx/passes/type_promotion.py (L1409)) dynamic shape information carried in `node.meta["val"]` during [type propagation](61865205b6/torch/onnx/_internal/fx/passes/type_promotion.py (L1401)) and the FX graph seen in `onnxrt` always carries static shapes. Let's use `torch.Sym*` directly so that the type promotion propagates and stores dynamic shapes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118246
Approved by: https://github.com/titaiwangms
2024-01-25 07:04:18 +00:00
2c6a233c45 Report the type of a tensor in wrap_to_fake (#118220)
This could help diagnose why a tensor wasn't considered static.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118220
Approved by: https://github.com/albanD, https://github.com/bdhirsh
ghstack dependencies: #118215, #118217
2024-01-25 06:53:12 +00:00
8b95fb4eb8 Add stack trace to "start tracing" log (#118217)
When debugging problems on unfamiliar model code, I often want to know
"how did I end up in this compiled region."  Printing the stack trace at
tracing start lets me find out this information.

Looks like this:

```
[2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO] Step 1: torchdynamo start tracing f /data/users/ezyang/c/pytorch/b.py:3
[2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO] Stack (most recent call last):
[2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO]   File "/data/users/ezyang/c/pytorch/b.py", line 9, in <module>
[2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO]     f(torch.randn(5))
[2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO]   File "/data/users/ezyang/c/pytorch/torch/_dynamo/eval_frame.py", line 437, in _fn
[2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO]     return fn(*args, **kwargs)
[2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO]   File "/data/users/ezyang/c/pytorch/torch/_dynamo/eval_frame.py", line 601, in catch_errors
[2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO]     return callback(frame, cache_entry, hooks, frame_state)
[2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO]   File "/data/users/ezyang/c/pytorch/torch/_dynamo/convert_frame.py", line 743, in _convert_frame
[2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO]     result = inner_convert(frame, cache_entry, hooks, frame_state)
[2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO]   File "/data/users/ezyang/c/pytorch/torch/_dynamo/convert_frame.py", line 386, in _convert_frame_assert
[2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO]     return _compile(
[2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO]   File "/data/users/ezyang/c/pytorch/torch/_dynamo/convert_frame.py", line 645, in _compile
[2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO]     guarded_code = compile_inner(code, one_graph, hooks, transform)
[2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO]   File "/data/users/ezyang/c/pytorch/torch/_dynamo/utils.py", line 248, in time_wrapper
[2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO]     r = func(*args, **kwargs)
[2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO]   File "/data/users/ezyang/c/pytorch/torch/_dynamo/convert_frame.py", line 526, in compile_inner
[2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO]     out_code = transform_code_object(code, transform)
[2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO]   File "/data/users/ezyang/c/pytorch/torch/_dynamo/bytecode_transformation.py", line 1033, in transform_code_object
[2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO]     transformations(instructions, code_options)
[2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO]   File "/data/users/ezyang/c/pytorch/torch/_dynamo/convert_frame.py", line 151, in _fn
[2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO]     return fn(*args, **kwargs)
[2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO]   File "/data/users/ezyang/c/pytorch/torch/_dynamo/convert_frame.py", line 473, in transform
[2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO]     tracer = InstructionTranslator(
[2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO]   File "/data/users/ezyang/c/pytorch/torch/_dynamo/symbolic_convert.py", line 2030, in __init__
[2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO]     _step_logger()(
[2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO]   File "/data/users/ezyang/c/pytorch/torch/_dynamo/logging.py", line 55, in log
[2024-01-24 12:07:11,819] [0/1] torch._dynamo.symbolic_convert: [INFO]     logger.log(level, "Step %s: %s", step, msg, **kwargs)
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118217
Approved by: https://github.com/albanD
ghstack dependencies: #118215
2024-01-25 06:53:12 +00:00
2a178dade8 Augment create_symbol with user/infra backtrace fragment (#118215)
Looks like this:

```
[2024-01-24 11:59:41,656] [0/1] torch.fx.experimental.symbolic_shapes: [INFO] create_symbol s0 = 5 for L['x'].size()[0] [2, 9223372036854775806] at b.py:5 in f (_dynamo/variables/builder.py:1788 in <lambda>)
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118215
Approved by: https://github.com/albanD, https://github.com/bdhirsh
2024-01-25 06:53:12 +00:00
514159ddcb Add torch_dynamo to resume_in for ease of debugging (#118201)
resume_in_* code objects show up in user backtraces when failures occur
in code that has been Dynamo processed.  It is obvious to me, a PT2
developer, that these are generated by PT2, but it is NOT obvious to a
non-core dev that this is happened.  Add an extra torch_dynamo
breadcrumb to help get people to the right place.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118201
Approved by: https://github.com/albanD
2024-01-25 06:52:17 +00:00
5a83c47d98 [vision hash update] update the pinned vision hash (#117594)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117594
Approved by: https://github.com/pytorchbot
2024-01-25 05:33:01 +00:00
e0903b0720 [executorch hash update] update the pinned executorch hash (#118040)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118040
Approved by: https://github.com/pytorchbot
2024-01-25 05:27:53 +00:00
e5e9f390be [dynamo] Optimize overheads from _TorchDynamoContext (#118070)
Based on `python benchmarks/dynamo/microbenchmarks/overheads.py`:
- Before `18.1us`
- After `12.2us`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118070
Approved by: https://github.com/yanboliang, https://github.com/anijain2305
ghstack dependencies: #118065
2024-01-25 05:04:56 +00:00
a40951defd [C10D] Fix nccl flightrecorder ignored dump timeout (#118142)
Don't call future.get() unless it's ready, because it waits.
Also, refactor the code a bit for simplicity.

We should do a follow-on PR to clean up the timeouts further, but this
should fix the glaring timeout bug.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118142
Approved by: https://github.com/shuqiangzhang
ghstack dependencies: #118044, #118046, #118047
2024-01-25 04:25:36 +00:00
cyy
87335fabae [Exception] [6/N] Remove use of torch::TypeError (#117964)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117964
Approved by: https://github.com/albanD
2024-01-25 03:35:58 +00:00
67300a11cb Support custom autograd Function forward AD return non-Tensor in forward (#118234)
Fixes https://github.com/pytorch/pytorch/issues/117491

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118234
Approved by: https://github.com/albanD
ghstack dependencies: #117552
2024-01-25 03:24:29 +00:00
2d7a360911 Fix Auto Functionalize to handle specified default values (#118035)
Summary: When there were optionals with specified default values the code was improperly handling the number of parameters causing `IndexError: tuple index out of range`

Test Plan: new tests

Differential Revision: D52977644

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118035
Approved by: https://github.com/williamwen42
2024-01-25 01:22:12 +00:00
4a49e2b52d refactoring (#118111)
No real changes, just moving mutation checking skip to a helper file.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118111
Approved by: https://github.com/bdhirsh
ghstack dependencies: #118110
2024-01-25 00:36:46 +00:00
4448f2a49d Log stack trace of mutated idx reland (#118110)
Relanding of https://github.com/pytorch/pytorch/pull/117720 with a fixed `next(iter(dict.values()))` instead of `next(dict.values())` and a corresponding test that would have caught the problem (as well as a type annotation that also would have).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118110
Approved by: https://github.com/bdhirsh
2024-01-25 00:30:03 +00:00
5b819d9ef0 Properly move retains_grad hook on in-place over view for base (#117552)
Fixes https://github.com/pytorch/pytorch/issues/117366
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117552
Approved by: https://github.com/albanD
2024-01-25 00:27:13 +00:00
9c1348feb3 [pytorch][kineto] log process group config in distributed info (#117774)
Summary: Process Group config is essential to analyze collective pattern. We have added this in Execution Trace. Now expose this information in Kineto as well

Test Plan: Tested in HPC

Differential Revision: D52882292

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117774
Approved by: https://github.com/wconstab, https://github.com/aaronenyeshi
2024-01-25 00:08:10 +00:00
89530c8590 [dynamo] Test for using torch.nn when replay_records are enabled (#116215)
This adds a reproducer for a failure that has since been fixed in main.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116215
Approved by: https://github.com/jansel
ghstack dependencies: #116230, #116214
2024-01-24 23:42:35 +00:00
7c33ce7702 [CI] Install dill in ci (#116214)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116214
Approved by: https://github.com/malfet
ghstack dependencies: #116230
2024-01-24 23:42:35 +00:00
b53cc6cf8d [dynamo] Fix test_replay_record.py (#116230)
This test isn't run in CI because the CI runners don't have dill installed.
This fixes the tests so they run for me locally, and in the next PR I add
dill to the CI so we can test it properly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116230
Approved by: https://github.com/jansel
2024-01-24 23:42:35 +00:00
61865205b6 Deflake Dynamo stream tests (#118205)
streams need to be synchronized, otherwise, there is undefined behavior.
This PR adds the necessary synchronization. This exposed some bugs
(https://github.com/pytorch/pytorch/issues/118204), so I just marked the
tests as expectedFailure.

Test Plan:
- tested locally

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118205
Approved by: https://github.com/yanboliang
2024-01-24 23:31:47 +00:00
5e0ef84b01 [dynamo] Refactor install_global_once, remove usages of install_global_unsafe (#118100)
We split install_global_once into two APIs:
- `install_global_by_id(prefix, value) -> name`: installs a global if it hasn't
been installed yet
- `install_global(prefix, value) -> name`: always installs the global (and
  generates a unique name for it)

Then, we refactor most callsites of `install_global_unsafe` to one of
the previous. Some callsites cannot be refactored because we create the
global name first, do a lot of stuff with it, and then install it.

This fixes more test flakiness.

Test Plan:
- Existing tests; I can't reliably repro the flakiness
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118100
Approved by: https://github.com/ezyang, https://github.com/mlazos
2024-01-24 23:25:44 +00:00
2abb812a78 Check if enable inside run call (#118101)
In theory this way we never have to worry about subclasses calling super().setUp() ever again

Also, dynamically creating classes (ex via type in instantiate_device_type_tests) makes super() calls a bit odd
https://stackoverflow.com/questions/71879642/how-to-pass-function-with-super-when-creating-class-dynamically
https://stackoverflow.com/questions/43782944/super-does-not-work-together-with-type-supertype-obj-obj-must-be-an-i

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118101
Approved by: https://github.com/huydhn
2024-01-24 22:38:41 +00:00
dba160e676 [13/N][Dynamo] Refactor torch ctx manager classes check out of trace_rules.lookup (#118130)
I'm going to merge inline/skip/allow_in_graph check into ```trace_rules.lookup```, so it's better to make it only handle function types.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118130
Approved by: https://github.com/williamwen42
2024-01-24 22:33:41 +00:00
4e29f01bf2 Remove sdp_kernel and replace with sdpa_kernel in attention namespace (#114689)
# Summary
Simplification of Backend Selection

This PR deprecates the `torch.backends/cuda/sdp_kernel` context manager and replaces it with a new context manager `torch.nn.attention.sdpa_kernel`. This context manager also changes the api for this context manager.

For `sdp_kernel` one would specify the backend choice by taking the negation of what kernel they would like to run. The purpose of this backend manager was to only to be a debugging tool, "turn off the math backend" and see if you can run one of the fused implementations.

Problems:
- This pattern makes sense if majority of users don't care to know anything about the backends that can be run. However, if users are seeking to use this context manager then they are explicitly trying to run a specific backend.
- This is not scalable. We are working on adding the cudnn backend and this API makes it so so that more implementations will need to be turned off if user wants to explicitly run a given backend.
- Discoverability of the current context manager. It is somewhat un-intutive that this backend manager is in backends/cuda/init when this now also controls the CPU fused kernel behavior. I think centralizing to attention namespace will be helpful.

Other concerns:
- Typically backends (kernels) for operators are entirely hidden from users and implementation details of the framework. We have exposed this to users already, albeit not by default and with beta warnings. Does making backends choices even more explicit lead to problems when we potentially want to remove existing backends, (perhaps inputs shapes will get covered by newer backends).

A nice side effect is now that we aren't using the `BACKEND_MAP` in test_transformers many, many dynamo failures are passing for CPU tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114689
Approved by: https://github.com/cpuhrsch
2024-01-24 22:28:04 +00:00
77186af028 [DTensor][BE] re-enable test_dtensor_ops in CPU CI (#118134)
**Test**
`pytest test/distributed/_tensor/test_dtensor_ops.py`
This only runs CPU test and completes in 1 minute on local.
<img width="3002" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/bfbcaff0-2581-41a7-817d-f68e4041b8b1">

CI Run: https://hud.pytorch.org/pr/pytorch/pytorch/118134
Search for "distributed" test and click any of them. Then search for "test_dtensor_ops". Saw successful run of `test_dtensor_ops`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118134
Approved by: https://github.com/malfet, https://github.com/huydhn, https://github.com/wanchaol
ghstack dependencies: #117726, #118132
2024-01-24 22:11:51 +00:00
e6288820e3 Revert "Update triton ROCm version to 6.0" (#118179)
Reverting [this commit](https://github.com/pytorch/pytorch/pull/117433) due to failures observed in wheel environment e.g:
```
ImportError: /tmp/torchinductor_root/triton/0/ebfa57c0b7b95873c96cad6f9bca148d/hip_utils.so: undefined symbol: hipGetDevicePropertiesR0600`
```

Will revert for now and investigate and aim to re-land this as part of https://github.com/pytorch/pytorch/pull/116270

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118179
Approved by: https://github.com/jeffdaily, https://github.com/malfet
2024-01-24 22:01:27 +00:00
af9b6fa04e Revert "Check if enable inside run call (#118101)"
This reverts commit 6fc015fedc96e532da756e9408fcedb9c81a423f.

Reverted https://github.com/pytorch/pytorch/pull/118101 on behalf of https://github.com/clee2000 due to possibly causing failures on b025e5984ce30eed10df0cc89111e88983d823d3 ([comment](https://github.com/pytorch/pytorch/pull/118101#issuecomment-1908940940))
2024-01-24 21:26:35 +00:00
15608d8cb4 Add guardrails preventing complex params in LBFGS & SparseAdam (#118161)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118161
Approved by: https://github.com/mikaylagawarecki
ghstack dependencies: #118160
2024-01-24 21:22:47 +00:00
17ecd1e9cd Migrate test_complex_optimizer to OptimizerInfo (#118160)
This PR does what it says and more.

1. We increase coverage by a LOT! Previously, complex was not tested for many many configs, including foreach + maximize at the same time. Or the fused impls. Or just random configs people forgot about.
2. I rearranged the maximize conditional and the _view_as_real to preserve list-ness. This is needed for _view_as_real to function properly, I did add a comment in the Files Changed. This new order also just...makes more aesthetic sense.
3. Note that LBFGS and SparseAdam are skipped--they don't support complex and now we know.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118160
Approved by: https://github.com/mikaylagawarecki
2024-01-24 21:22:47 +00:00
6978c3ddf3 Removes an Incorrect Type Specification from AdaptiveMaxPool1d (#118162)
The return type for the forward pass of nn.AdaptiveMaxPool1d is specified to be Tensor, but if self.return_indices, then the result type should be tuple[Tensor,Tensor].

For users trying to trace/script this function with indices, the incorrect typing is problematic.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118162
Approved by: https://github.com/albanD
2024-01-24 20:31:02 +00:00
821b2c543c [AOTI] Support .item() in the ABI-compatible mode (#117989)
Summary:

Differential Revision: [D52965076](https://our.internmc.facebook.com/intern/diff/D52965076)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117989
Approved by: https://github.com/ezyang, https://github.com/chenyang78
2024-01-24 20:17:59 +00:00
2f6fc33c20 Move skip sets into a new file. (#118032)
This PR moves the skip sets that lived in benchmarks/dynamo/torchbench.py into a more
readable YAML file, so that it is consumable from other projects (e.g. XLA).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118032
Approved by: https://github.com/lezcano, https://github.com/ezyang
2024-01-24 19:22:01 +00:00
e599a08796 [dtensor] rewrite embedding ops using op strategy (#118079)
This PR rewrites sharded embedding rule to use OpStrategy instead of the
rule, one step further to get rid of rules and consolidate the embedding
operator implementation, to prepare for rowwise embedding
implementation, which will come in next PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118079
Approved by: https://github.com/tianyu-l
2024-01-24 19:12:12 +00:00
b025e5984c Get Device instance with correct type when privateuse1 backend is registered (#117966)
Fixes #ISSUE_NUMBER
If privateuse1 backend is registered. Let torch.device return corresponding instance of Device when only index is given.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117966
Approved by: https://github.com/albanD, https://github.com/malfet
2024-01-24 19:03:18 +00:00
6fc015fedc Check if enable inside run call (#118101)
In theory this way we never have to worry about subclasses calling super().setUp() ever again

Also, dynamically creating classes (ex via type in instantiate_device_type_tests) makes super() calls a bit odd
https://stackoverflow.com/questions/71879642/how-to-pass-function-with-super-when-creating-class-dynamically
https://stackoverflow.com/questions/43782944/super-does-not-work-together-with-type-supertype-obj-obj-must-be-an-i

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118101
Approved by: https://github.com/huydhn
2024-01-24 18:51:05 +00:00
fc135454ca [PT2][Optimus][Reliability]Fix a bug in gradients computation for runtime numeric check (#118105)
Summary:
We observed the following error when launch e2e AFOC model test
```
RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.
```
f524190245

Differential Revision: D53011463

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118105
Approved by: https://github.com/jackiexu1992
2024-01-24 18:45:10 +00:00
1e185c7803 [c10d] Barrier uses stream sync instead of device sync (#117804)
Resubmitting #96785

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117804
Approved by: https://github.com/wconstab
2024-01-24 18:42:14 +00:00
6e78592cbb Added type checking for ExportedProgram (#117231)
Fixes #116952

Added type checking for ExportedProgram in save function. Please review.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117231
Approved by: https://github.com/avikchaudhuri
2024-01-24 18:24:44 +00:00
af1ebc45d3 [quant][pt2e] Add fold_quantize=True for all convert_pt2e calls (#117797)
Summary: In preparation for enabling fold_quantize=True by default

Test Plan: CI

Differential Revision: D52879612

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117797
Approved by: https://github.com/andrewor14
2024-01-24 17:54:13 +00:00
90b3cf33ac [C10] Make Scalar constructable from longs (#118149)
On Linux and Mac `int64_t` is an alias to either `long` (Linux) or  `long long` (Mac)

Because of that, attempt to construct `c10::Scalar` from the other type will fail with `conversion from ‘long long int’ to ‘c10::Scalar’ is ambiguous`.

I.e. attempt to compile:
```cpp
int main() {
  c10::Scalar s = 1L;
}
```
on MacOS failed with:
```
foo.cpp:3:15: error: conversion from 'long' to 'c10::Scalar' is ambiguous
  c10::Scalar s = 1L;
              ^   ~~
/Users/nshulga/git/pytorch/pytorch/torch/include/c10/core/Scalar.h:59:7: note: candidate constructor
      DEFINE_IMPLICIT_CTOR)
      ^
/Users/nshulga/git/pytorch/pytorch/torch/include/c10/core/Scalar.h:59:7: note: candidate constructor
/Users/nshulga/git/pytorch/pytorch/torch/include/c10/core/Scalar.h:59:7: note: candidate constructor
/Users/nshulga/git/pytorch/pytorch/torch/include/c10/core/Scalar.h:59:7: note: candidate constructor
/Users/nshulga/git/pytorch/pytorch/torch/include/c10/core/Scalar.h:59:7: note: candidate constructor
/Users/nshulga/git/pytorch/pytorch/torch/include/c10/core/Scalar.h:59:7: note: candidate constructor
/Users/nshulga/git/pytorch/pytorch/torch/include/c10/core/Scalar.h:59:7: note: candidate constructor
/Users/nshulga/git/pytorch/pytorch/torch/include/c10/core/Scalar.h:62:3: note: candidate constructor
  Scalar(uint16_t vv) : Scalar(vv, true) {}
  ^
/Users/nshulga/git/pytorch/pytorch/torch/include/c10/core/Scalar.h:63:3: note: candidate constructor
  Scalar(uint32_t vv) : Scalar(vv, true) {}
  ^
/Users/nshulga/git/pytorch/pytorch/torch/include/c10/core/Scalar.h:64:3: note: candidate constructor
  Scalar(uint64_t vv) {
  ^

```

Prevent this by providing missing constructors when needed. Alas one can not use SFINAE, as template constructors on Scalar mess up a lot of implicit conversions, so I use  `static_asserts` to  detect early on if premise for constructing this class holds.

Add ScalarTest::LongsAndLongLongs that is essentially a compile time test

Discovered while trying to enable AOTI on MacOS
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118149
Approved by: https://github.com/ezyang, https://github.com/albanD
ghstack dependencies: #118077, #118076
2024-01-24 17:32:29 +00:00
880f9bb57e Remove xfails for consistently succeeding tests (#118152)
Fixes https://github.com/pytorch/pytorch/issues/117786, https://github.com/pytorch/pytorch/issues/117785
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118152
Approved by: https://github.com/yanboliang
2024-01-24 15:47:55 +00:00
bd99115276 [AOTI] Enable for MacOS (#118076)
- Add `darwin` to the list of supported platform
- Add `#include <sstream>` to `aoti_runtime/model.h`
- Refactor Linux specific constant compilation logic to `_compile_consts_linux`
- Add `_compile_consts_darwin` that converts consts to .S file that is linked into a shared library
   - Patch file using magic to avoid converting bytes to large hexadecimal string
- Generate integer constants with `LL` suffix on MacOS (corresponds to int64_t definition)
- Enable test_aot_inductor.py tests on MacOS

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118076
Approved by: https://github.com/desertfire
ghstack dependencies: #118077
2024-01-24 14:24:05 +00:00
a545ebc870 Switched macOS runners type to macos-m1-stable (#117651)
Switched macOS runners type to `macos-m1-stable`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117651
Approved by: https://github.com/huydhn
2024-01-24 11:55:13 +00:00
12662f4d95 [dynamo] add username in debug path (#117820)
Summary: No user name may cause conflict and permission error when people share a dev server

bypass-github-pytorch-ci-checks

Test Plan: ci

Differential Revision: D52895486

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117820
Approved by: https://github.com/kflu, https://github.com/DanilBaibak
2024-01-24 10:14:20 +00:00
7d396918c6 [Inductor] Fix argument unused during compilation warning (#118077)
By not passing linker flag if `compile_only` is set to `True`
Before that change every invocation of AOTI compiler resulted in emitting at least 4 warnings:
```
clang: warning: -lomp: 'linker' input unused [-Wunused-command-line-argument]
clang: warning: argument unused during compilation: '-shared' [-Wunused-command-line-argument]
clang: warning: argument unused during compilation: '-undefined dynamic_lookup' [-Wunused-command-line-argument]
clang: warning: argument unused during compilation: '-L/Users/nshulga/miniforge3/lib' [-Wunused-command-line-argument]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118077
Approved by: https://github.com/desertfire
2024-01-24 09:52:16 +00:00
50ead5d8ae [fx] add an option to not retrace when doing op fusion (#118120)
Summary: If the given model is already a graph module, we would want to skip retrace in some cases.

Test Plan: CI

Differential Revision: D53018283

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118120
Approved by: https://github.com/zyan0
2024-01-24 09:41:26 +00:00
c5702a0891 [dynamo] Optimize BACKEND_MATCH guard (#118065)
As measured by `benchmarks/dynamo/microbenchmarks/overheads.py`:
- Before `22.5us`
- After `18.1us`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118065
Approved by: https://github.com/ydwu4
2024-01-24 07:47:52 +00:00
ed0ec2e0be Remove dynamo runner's dependency on distributed build (#117903)
So that we can bisect faster without needing to rebuild distributed module. We remove the annotation to avoid flake8 undefined name lint

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117903
Approved by: https://github.com/xuzhao9
2024-01-24 06:51:14 +00:00
725f4b58ac Cache dfs path in propose_partitions and re-use that later when trying to find cycles in the graph (#115943)
Summary:
This diff introduces a caching mechanism to improve the performance of the partitioner in PyTorch. The changes involve adding a cache to store the DFS path of each node in the graph, which can be reused later when trying to find cycles in the graph.

This shows significant improvements for the edge use cases where the ASR model (which is around 6000+ nodes) used to take 26 minutes, but after this it takes around 8 minutes.

Test Plan: Relying on the existing ExecuTorch CI tests that heavily use this partitioning mechanism and also tested out locally via Bento notebooks.

Differential Revision: D51289200

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115943
Approved by: https://github.com/SherlockNoMad
2024-01-24 05:30:11 +00:00
d59c2d6e05 [dtensor] refactor partial redistribution logic (#113334)
This PR:

* Make the remaining placement transform to move from redistribute.py to
placement_types, specifically partial related logic
* redefine partial interface to make things more consistent, and add
  docs about the transformation relationships

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113334
Approved by: https://github.com/tianyu-l, https://github.com/XilunWu
ghstack dependencies: #118078
2024-01-24 04:56:16 +00:00
03205ff3ba [dtensor] make local_shard_size_on_dim be staticmethod (#118078)
As titled, this is so that we can use it for the case when we don't need
to construct a Shard placement
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118078
Approved by: https://github.com/XilunWu
2024-01-24 04:56:16 +00:00
8d49737f2b [CUDA][Complex] Bump thresholds for conv3d (#118151)
Seeing a 1/1000 numerical mismatch

CC @coyotelll

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118151
Approved by: https://github.com/ezyang
2024-01-24 04:18:31 +00:00
46c228f0e2 [DTensor][BE] rename PlacementStrategy.output_spec to output_specs since now we support a tuple of DTensorSpec as output (#116437)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116437
Approved by: https://github.com/wanchaol
2024-01-24 03:33:58 +00:00
26968cefb0 [DTensor][fix] re-enable [add]mm tensor test (#118132)
**Summary**
Re-enable tests that were disabled in #118045 as #117726 fixed the empty tensor case for DTensor [add]mm.

**Test Plan**
`pytest test/distributed/_tensor/test_dtensor_ops.py -s -k mm`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118132
Approved by: https://github.com/malfet
ghstack dependencies: #117726
2024-01-24 03:17:18 +00:00
155f27a97b [DTensor][fix] fix is_tensor_shardable to correctly handle Replicate placement (#117726)
**Summary**
Previously DTensor sharding plans filter (i.e. `is_tensor_shardable()`) cannot correctly handle the case where the input `DTensor` has 0 dimension. This filter should return `True` if the sharding placement on 0 dimension is `Replicate` even if `tensor dim < num of shards` on that dimension in which case `tensor dim == 0` and `num of shards == 1`.

In this PR we also noticed a behavior discrepancy of `torch.addmm`. See #118131

**Test Plan**
```
pytest test/distributed/_tensor/test_dtensor_ops.py -s -k addmm
pytest test/distributed/_tensor/test_dtensor_ops.py -s -k mm_cpu_float32
CUDA_VISIBLE_DEVICES="" pytest test/distributed/_tensor/test_matrix_ops.py -s -k empty_operand
pytest test/distributed/_tensor/test_matrix_ops.py -s -k empty_operand
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117726
Approved by: https://github.com/wanchaol
2024-01-24 03:17:18 +00:00
e9c240670f [sigmoid] Add canonicalized IR as an option. (#116758)
Summary: as title, the "canonical" flag is added to sigmoid serializer, so that we can optionally "normalize" the IR to give stable names and orders to IR nodes, which could help with the cases to compare IR definitions.

Test Plan: buck run @//mode/opt //aps_models/ads/config_model_authoring/stability:cli export-generated-module-state-command

Differential Revision: D52431965

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116758
Approved by: https://github.com/avikchaudhuri
2024-01-24 03:11:25 +00:00
21e8546b11 [inductor][fx] Fix broadcast_tensors with unbacked symints when translation validation is off (#118066)
## Context
This is an example that runs into an AssertionError while lowering in Inductor.
```
# While lowering, b will be expanded because b.size(1) == 1.
a = torch.zeros([u0, 512])
b = torch.ones([u0, 1])
return a * b
```

Below's the tail-end of the stack trace. Here's the important bits:
1. In _inductor/sizevars.py, we'll call `self.shape_env.defer_runtime_assert(expr, msg, fx_node=V.graph.current_node)`.
2. This leads to the creation of a `ShapeEnvEvent` with an FX node via `kwargs={"fx_node": V.graph.current_node}` ([see](0c9b513470/torch/fx/experimental/recording.py (L245-L247))).
3. Eventually, we try to call `maybe_convert_node()` but it expects translation validation to be on ([see](0c9b513470/torch/fx/experimental/recording.py (L118-L121))).
```
  File "pytorch/torch/_inductor/lowering.py", line 221, in transform_args
    for i, x in zip(indices, broadcast_tensors(*[args[i] for i in indices])):
  File "pytorch/torch/_inductor/lowering.py", line 294, in wrapped
    out = decomp_fn(*args, **kwargs)
  File "pytorch/torch/_inductor/lowering.py", line 676, in broadcast_tensors
    x = expand(x, target)
  File "pytorch/torch/_inductor/lowering.py", line 294, in wrapped
    out = decomp_fn(*args, **kwargs)
  File "pytorch/torch/_inductor/lowering.py", line 793, in expand
    return TensorBox(ExpandView.create(x.data, tuple(sizes)))
  File "pytorch/torch/_inductor/ir.py", line 1871, in create
    new_size = cls._normalize_size(x, new_size)
  File "pytorch/torch/_inductor/ir.py", line 1862, in _normalize_size
    new_size[i] = V.graph.sizevars.expect_equals(
  File "pytorch/torch/_inductor/sizevars.py", line 338, in expect_equals
    self.expect_true(sympy.Eq(left, right), msg=msg)
  File "pytorch/torch/_inductor/sizevars.py", line 333, in expect_true
    self.shape_env.defer_runtime_assert(expr, msg, fx_node=V.graph.current_node)  # (1) is here
  File "pytorch/torch/fx/experimental/recording.py", line 257, in wrapper
    return event.run(self)   # (2) happens right before this
  File "pytorch/torch/fx/experimental/recording.py", line 155, in run
    replacearg(index=3, key="fx_node", fn=maybe_convert_node)
  File "pytorch/torch/fx/experimental/recording.py", line 138, in replacearg
    kwargs[key] = fn(kwargs[key])
  File "pytorch/torch/fx/experimental/recording.py", line 128, in maybe_convert_node
    assert hasattr(shape_env, "name_to_node")  # (3) is here
```

## Approach
Since [translation validation](c6be5d55a5/torch/fx/experimental/validator.py (L574)) may not be on during Inductor lowering, we can check if that's True and return the FX node's name in this case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118066
Approved by: https://github.com/ezyang, https://github.com/peterbell10
2024-01-24 03:07:30 +00:00
41a56f7828 Fix swap_tensors to swap PyObjects associated with TensorImpl (#116955)
This PR intends to fix the following issue when swapping two tensors

```python
>>> import torch
>>> torch.manual_seed(5)
>>> t1 = torch.randn(2)
>>> t2 = torch.randn(3)
>>> t1
tensor([-0.4868, -0.6038])
>>> t2
tensor([-0.5581,  0.6675, -0.1974])
>>> torch.utils.swap_tensors(t1, t2)
>>> t1
tensor([-0.5581,  0.6675, -0.1974])
>>> t2
tensor([-0.4868, -0.6038])
>>> t1.fill_(0.5) # t1 back to its unswapped state :o
tensor([-0.4868, -0.6038])
```

What happens here is that in `THPVariable_Wrap` (which is used when going back from C++ --> Python), we check if the TensorImpl of the tensor to be returned already has a pointer to a PyObject in its PyObject slot. If this is the case then this object is returned.

57491d2046/torch/csrc/autograd/python_variable.cpp (L271-L292)

When we run any operation that returns the same TensorImpl (e.g. inplace op, `t.to(dtype=t.dtype)`, etc.), although `t1` now has `t2`'s TensorImpl, `t2`'s TensorImpl still has a reference to `t2`, so when we do the op on `t1` and `THPVariable_Wrap` attempts to return the pointer to the TensorImpl's PyObject, we return a pointer to `t2` instead.

The TensorImpl should have the PyObjects in their PyObjectSlots swapped as well in `swap_tensors`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116955
Approved by: https://github.com/albanD
2024-01-24 01:40:18 +00:00
fc30c4d769 Migrate forloop directional tests to OptimizerInfo (#117410)
This PR is another step towards modernizing our optimizer tests by tackling the simplest foreach tests. The replaced tests are now removed in `test/optim/test_optim.py`.

**Changes in coverage?** Yes!
- This PR _decreases_ coverage (!!!!) by only checking the direction on the forloop implementations vs both the forloop and foreach. Why? I believe it should be sufficient to check the forloop only, as the foreach parity is already checked in the `foreach_matches_forloop` test.
- This PR also _increases_ coverage for SparseAdam with contiguous params on CUDA, which was previously forbidden due to an old old bug that has since been fixed.

What will it take to fully remove `test_basic_cases`?
- We need to flavor the tests with LRSchedulers
- Testing for param groups --> which all just distinguish between lrs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117410
Approved by: https://github.com/albanD
2024-01-24 01:28:40 +00:00
5b671ce486 [dynamo] fix typo in 3.11 resume_execution.py (#118108)
whoopsie

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118108
Approved by: https://github.com/angelayi, https://github.com/zou3519
2024-01-24 00:59:04 +00:00
b7b1affe97 Add half specializations for load of sum (#106454)
Add half specializations for load of sum

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106454
Approved by: https://github.com/jgong5, https://github.com/cpuhrsch
2024-01-24 00:35:20 +00:00
c0732c8d5e [Dynamo] Add complex to literal constant (#117819)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117819
Approved by: https://github.com/zou3519
2024-01-23 23:46:46 +00:00
cd084c4909 Add TensorIteratorConfig::add_const_input to avoid COW materialize (#118053)
Part of #97856

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118053
Approved by: https://github.com/ezyang
2024-01-23 22:32:39 +00:00
abd759d50d [fx] Add hooks to intercept node replacements. (#117825)
Summary: Adding an experimental API to FX graph module to place "hooks" every time when we are changing or replacing nodes in a graph, so that we can properly update the new name in graph signature and potentially other places.

Test Plan:
buck test mode/opt  -c fbcode.enable_gpu_sections=true caffe2/test/distributed/_tensor/experimental:tp_transform

buck test mode/opt caffe2/test:test_export -- -r test_replace_hook

Differential Revision: D52896531

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117825
Approved by: https://github.com/avikchaudhuri
2024-01-23 22:28:40 +00:00
b369888bec Replace constraints with dynamic_shapes in caffe2/test/cpp & torchrec/distributed/tests/test_pt2 (#118026)
Summary: `constraints` argument for `torch.export` has been deprecated in favor of the `dynamic_shapes` argument. This PR updates the use of the deprecated API in `caffe2/test/cpp` and `torchrec/distributed/test/test_pt2`.

Test Plan: CI

Differential Revision: D52977354

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118026
Approved by: https://github.com/chenyang78
2024-01-23 22:15:15 +00:00
6ac284122b [Memory Snapshot] Track context for SEGMENT_FREE and SEGMENT_UNMAP (#118055)
Summary: Show the stack when SEGMENT_FREE and SEGMENT_UNMAP occurs. This may be useful for debugging such as when empty_cache() may cause a segment to be freed. If the free context is unavailable, resort to the segment allocation stack.

Test Plan: CI

Differential Revision: D52984953

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118055
Approved by: https://github.com/zdevito
2024-01-23 21:48:57 +00:00
c6930aad46 Update Triton pin (#117873)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117873
Approved by: https://github.com/shunting314, https://github.com/malfet
2024-01-23 21:05:30 +00:00
13d2cdffa2 Remove optimizer.step patching for profiler hook (#115772)
1. I'd like to remove the patching that avoids the profiler hook, but it adds an additional graph break due to nested wrappers. #117767 if interested, see (internal only) paste for [before](P996529232) and [after](P997507449) this PR.

```
I've locally run perf benchmarks for yolov3: Before the speedup is 4.183x, and after it is 4.208x.
I've also run it for resnet50: before, speedup is 3.706x and now it is 3.924x.
```

2. @mlazos I now unwrap twice in the dynamo and inductor tests. This feels like we're testing deficiently--should we add tests to test that tracing through the profiler hook and the use_grad hook are functioning according to expectations (I know there's at least one graph break in one).
3. There's a strange memory thing going on...what is happening? This has been resolved with @voznesenskym's [change](https://github.com/pytorch/pytorch/pull/116169). (for details see below)

<details>
This PR will fail the test_static_address_finalizer test due to a mysterious thing that is happening (idk what, but maybe the dynamo cache or a frame _expecting_ the patching to have been done).

There is no Python refcycle, as the backrefs for `p_ref()` look like:
![image](https://github.com/pytorch/pytorch/assets/31798555/4d6cbf50-3924-4efe-b578-d93389eebec8)
(so 5 backrefs but none of them python)

And the refs:
![image](https://github.com/pytorch/pytorch/assets/31798555/25e01105-bcb9-44ca-997a-2cf1670a6d42)
</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115772
Approved by: https://github.com/jansel, https://github.com/mlazos
2024-01-23 20:15:41 +00:00
77705e7486 [dtensor] fix unnecessary redistribute in new_factory_strategy (#118037)
**Summary**
Previously, assuming `x` is a DTensor with non-replicate placement, calling `x.new_full` would create a replicated (but unused) copy of `x`, incurring unnecessary communications. This PR fixes the issue.

**Test**
`python test/distributed/_tensor/test_tensor_ops.py -k test_new_full`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118037
Approved by: https://github.com/wanchaol, https://github.com/XilunWu
2024-01-23 19:35:43 +00:00
58e7ec5843 Revert "Log stack trace of mutated idx (#117720)"
This reverts commit 365c7a292fedbf776014b878849ebd3dcb7463f0.

Reverted https://github.com/pytorch/pytorch/pull/117720 on behalf of https://github.com/eellison due to cause of https://github.com/pytorch/pytorch/issues/118104 ([comment](https://github.com/pytorch/pytorch/pull/117720#issuecomment-1906693119))
2024-01-23 18:40:20 +00:00
364728b27b Reduce pytest prints (#117069)
* custom pytest-shard so I can control the verbosity (also index by 1 since it's confusing)
* normal runs (not keep-going) always rerun each failed test 9 times (3 per process, 3 processes).  Previously it would only run the entire test file 3 times, so if a test before you segfaulted, you only got 2 tries

Example of quieter log https://github.com/pytorch/pytorch/actions/runs/7481334046/job/20363147497
"items in shard" only gets printed once at the beginning, and the reruns just say how many got skipped.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117069
Approved by: https://github.com/huydhn
2024-01-23 18:39:30 +00:00
5ec2d7959d Revert "[ez] Provide a slightly better error message if process times out (#117865)"
This reverts commit 5538b37a065e5a68c3fb9d1f8eaa3e4fd12fd0b8.

Reverted https://github.com/pytorch/pytorch/pull/117865 on behalf of https://github.com/clee2000 due to Does not play nice with retry_shell, which expects timeoutexpired, but i cant control the error message of that ([comment](https://github.com/pytorch/pytorch/pull/117865#issuecomment-1906640922))
2024-01-23 18:13:41 +00:00
6784594532 Fix sparse windows on CPU with MKL (#102604)
Fix https://github.com/pytorch/pytorch/issues/97352.
This PR changes the way the linking to intel MKL is done and updating MKL on Windows to mkl-2021.4.0 .
There are for both conda and pip packages MKL  version with which you can link dynamically. mkl-devel contains the static versions of the dlls and MKL contains the needed dlls for the runtime. MKL dlls and static libs starting with  2021.4.0 have the version in their names( for MKL 2023 we have mkl_core.2.dll and for 2021.4.0 we have mkl_core.1.dll) so its possible to have multiple versions installed and it will work properly.
For the wheel build, I added dependency for whell MKL and on conda a dependecy for the conda MKL  and on libtorch I copied the MKL binaries in libtorch.
In order to test this PR I have to use custom builder https://github.com/pytorch/builder/pull/1467

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102604
Approved by: https://github.com/IvanYashchuk, https://github.com/malfet
2024-01-23 17:41:18 +00:00
7598a4efdc [ROCm] Disable MIOpen for empty tensors for RNN (#117672)
Some MIOpen RNN functions (lstm, rnn, gru) can't work with empty tensors and return error "MIOpen Error: Lengths must be > 0"
This PR disables MIOpen tor empty tensors and force to use native methods
The solution is based on condition of using CUDNN 3a52147cc5/aten/src/ATen/native/TensorProperties.cpp (L91)
It also fix [test_nn.py::TestNN::test_RNN_input_size_zero](29fa6fbc4e/test/test_nn.py (L4592)) on ROCM

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117672
Approved by: https://github.com/cpuhrsch
2024-01-23 17:30:18 +00:00
0c9b513470 [Export] Fix serialize_metadata (#118031)
Summary: As title.

Test Plan: CI

Differential Revision: D52979069

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118031
Approved by: https://github.com/zhxchen17
2024-01-23 17:03:04 +00:00
9ebaa27922 Fix types.MethodDescriptorType related bug in dynamo (#118041)
Methods that were `types.MethodDescriptorType` were failing because the `tensor.method()` to `method(tensor)` conversion was dropping the tensor and just calling `method`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118041
Approved by: https://github.com/yanboliang
ghstack dependencies: #118000
2024-01-23 16:11:38 +00:00
3b38f7b266 Remove skips for passing tests (#118000)
These tests were already passing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118000
Approved by: https://github.com/yanboliang
2024-01-23 16:11:38 +00:00
3ec4f00316 [inductor] Allow reinplacing functionalized scatter ops (#116899)
This expands the reinplacing pass to allow reinplacing view-scatter operations.
e.g. if our python code is:
```
a = view1(inp)
b = view2(a)
b.copy_(src)
```
this generates a functionalized graph like:
```python
a = view1(inp)
a_updated = view2_scatter(a, src)
inp_updated = view1_scatter(inp, a_updated)
```

First, the `canonicalize_view_scatter_ops` step rewrites the functionalized graph
in the form:
```python
inp_updated = _generalized_scatter(inp, src, [view1, view2])
a_updated = view1(inp_updated)
```

I then register `_generalized_scatter` as a normal inplacable op which can be
handled by the pre-existing mechanism. Since we've fused the two scatter ops into one,
the reinplacing pass sees only one user of `inp` which allows the entire operation to be
reinplaced  if desired (and I add heuristics that sometimes choose not to reinplace).

Finally, there is a decomposition step which decomposes out-of-place or in-place
`_generalized_scatter` operations either back into view_scatter operations, or
into the version with mutations. When introducing mutations, the reinplaced
version is equivalent to the original mutation:
```
a = view1(inp)
b = view2(a)
b.copy_(src)
```

Or when out-of-place we end up with a minor restructuring of the graph:
```
a = view1(inp)
tmp = view2_scatter(a, src)
inp_updated = view1_scatter(inp, tmp)
a_updated = view1(inp_updated)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116899
Approved by: https://github.com/lezcano
ghstack dependencies: #116898, #117121
2024-01-23 15:31:28 +00:00
5502a63b22 [inductor] Allow reinplacing before meta-only users (#117121)
Currently if you have the code:
```python
idx = torch.arange(10, device=x.device)
src = torch.ones(10, dtype=x.dtype, device=x.device)
x.index_put_((idx,), src)
expand = x.expand((2, x.shape[0]))
```

The `index_put_` cannot be reinplaced under dynamic shapes due to the user
`aten.sym_size(x, 0)` however since this function only looks at the tensor
metadata, it is actually fine to reinplace.

Here I ignore these operators in the analysis of the reinplacing pass, so
reinplacing can happen under dynamic shapes as well. I also handle cases
where views are created just to be fed to `sym_size`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117121
Approved by: https://github.com/lezcano
ghstack dependencies: #116898
2024-01-23 15:31:28 +00:00
eb0fcab421 [inductor] Move reinplace pass to its own file (#116898)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116898
Approved by: https://github.com/lezcano
2024-01-23 15:31:28 +00:00
e309d6fa1c Better unsupported op error message (#117770)
Previously, if someone wrote a python abstract impl but didn't import
the module it is in, then we would raise an error message suggesting
that the user needs to add an abstract impl for the operator.

In addition to this, we suggest that the user try importing the module
associated with the operator in the pystub (it's not guaranteed that
an abstract impl does exist) to avoid confusion.

Test Plan:
- new test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117770
Approved by: https://github.com/ydwu4, https://github.com/williamwen42
2024-01-23 15:05:16 +00:00
4d625c1c92 [AOTI] Fix a bug in the torch._export.aot_load API (#118039)
Summary:
tree_flatten_spec should use args instead of *args

clone of https://github.com/pytorch/pytorch/pull/117948 but with some fbcode specific changes

Test Plan: CI

Differential Revision: D52982401

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118039
Approved by: https://github.com/angelayi
2024-01-23 14:54:02 +00:00
bff348b28f [AOTI] Add missing include to model.h (#118075)
At lest if one tries to compile the AOTI code on Darwin, compilation
fails with implicit instantiation of undefined template error:
```
In file included from /Users/nshulga/git/pytorch/pytorch/torch/include/torch/csrc/inductor/aoti_runtime/arrayref_tensor.h:3:
/Users/nshulga/git/pytorch/pytorch/torch/include/torch/csrc/inductor/aoti_runtime/model.h:69:21: error: implicit instantiation of undefined template 'std::basic_stringstream<char>'
  std::stringstream ss;
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118075
Approved by: https://github.com/desertfire
ghstack dependencies: #118074
2024-01-23 14:34:00 +00:00
2963e85a3f [EZ][AOTI] Fix typos (#118074)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118074
Approved by: https://github.com/desertfire
2024-01-23 14:34:00 +00:00
ae459c5809 Don't use private accessor on SymNode to get _expr (#118007)
This materially impacts https://github.com/pytorch/pytorch/pull/117862
split this out for testing

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118007
Approved by: https://github.com/tugsbayasgalan
2024-01-23 14:29:19 +00:00
73c9be1395 Don't use private accessor on SymNode to get _expr (round 2) (#118013)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118013
Approved by: https://github.com/tugsbayasgalan
2024-01-23 14:29:12 +00:00
905a7cc340 [ROCm] skip test_eager_transforms.py test_compile_vmap_hessian_cuda (#118009)
Memory leak detected on ROCm.  Skip until it can be addressed.

PYTORCH_TEST_WITH_ROCM=1 PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 python test_eager_transforms.py -k test_compile_vmap_hessian_cuda

See #117642 for moving rocm CI to unstable due to this test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118009
Approved by: https://github.com/jeanschmidt
2024-01-23 09:57:18 +00:00
4cfd16cb6d [Inductor] optimize transpose_mxn with bf16 data type (#117958)
**Summary**
Add the vectorization implementation of `transpose_mxn` with BFloat16 data type when matrix size is 16X16 or 32X32 which observed in Stable Diffusion BF16.

**TestPlan**
```
python -u -m pytest -s -v test_cpu_repro.py -k test_transpose_mxn_16_16_bf16_fp16
python -u -m pytest -s -v test_cpu_repro.py -k test_transpose_mxn_32_32_bf16_fp16
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117958
Approved by: https://github.com/jgong5, https://github.com/lezcano
2024-01-23 09:43:35 +00:00
40890ba8e7 [CI] Add python test skip logic for XPU (#117621)
Add python test skip logic for XPU

For test purpose, cherry-pick #116833 & #116850 firstly, and the xpu test passed https://github.com/pytorch/pytorch/actions/runs/7566746218/job/20604997985?pr=117621. Revert them now.

Works for #114850

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117621
Approved by: https://github.com/huydhn
2024-01-23 08:20:42 +00:00
455bba38f4 [C10D] Make Flight Recorder report time_created in ns (#118047)
Addresses (6) from #117883

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118047
Approved by: https://github.com/zdevito
ghstack dependencies: #118044, #118046
2024-01-23 08:18:08 +00:00
5df92a9244 [C10D] Add version tag to NCCL Flight Recorder Dump (#118046)
Addresses (3) from #117883

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118046
Approved by: https://github.com/zdevito
ghstack dependencies: #118044
2024-01-23 08:18:08 +00:00
dace1fda2e [C10D] Make NCCL Flight Recorder dump produce a dict (#118044)
Putting the list of entries into a particular key of a top-level dict
paves the way for adding other metadata as other top level keys.

Addresses 1 and 2 from #117883

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118044
Approved by: https://github.com/zdevito
2024-01-23 08:18:08 +00:00
28c8a07b4d add mask_convert_to_lp to support bool->fp16/bf16 convert (#117830)
Fix
https://github.com/pytorch/pytorch/issues/117624
https://github.com/pytorch/pytorch/issues/117627

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117830
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-01-23 07:52:43 +00:00
6049998971 [C10D] Finer-grain nccl heartbeat, avoid false positive hangs (#118016)
Summary:
Previously, heatbeat was incremented once per finishing a for loop over a list
of in-progress work items, under the assumption that either the processing
would be predictably quick, or it would hang completely.

In fact, there can be cuda API contention that causes the processing of works
to slow down arbitrarily but not truly deadlock.  To guard against this, we
bump the heartbeat at the smallest unit of progress, one work item being
successfully processed.

Test Plan: CI

Differential Revision: D52973948

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118016
Approved by: https://github.com/shuqiangzhang, https://github.com/kwen2501
2024-01-23 07:25:18 +00:00
a8978d3676 [dynamo] Add size(), get_coordinate() support for DeviceMesh in dynamo (#117710)
Summary: This fix is part of: https://github.com/pytorch/pytorch/issues/117670

Test Plan: Unit tetst and CI

Differential Revision: D52857348

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117710
Approved by: https://github.com/wconstab, https://github.com/yanboliang, https://github.com/wanchaol, https://github.com/anijain2305
2024-01-23 07:10:52 +00:00
bb28965924 Revert "Remove skips for passing tests (#118000)"
This reverts commit 3c339b5b21fdbd530f82765f84bcabde8266d3e0.

Reverted https://github.com/pytorch/pytorch/pull/118000 on behalf of https://github.com/oulgen due to test passing on diff but failing on hud... ([comment](https://github.com/pytorch/pytorch/pull/118000#issuecomment-1905351752))
2024-01-23 06:10:25 +00:00
suo
d84173c025 [export] fix unlifting of custom class constants (#117979)
we didn't have a test covering this case, add one.

Aside: we should invest in actually unit testing the lifting/unlifting passes, both separately and also against each other. I have a diff cooking for that.

Differential Revision: [D52962180](https://our.internmc.facebook.com/intern/diff/D52962180/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117979
Approved by: https://github.com/avikchaudhuri
ghstack dependencies: #115222, #117978
2024-01-23 05:51:00 +00:00
suo
7b0979ef8e [export] fixes to unflatten + custom obj composition (#117978)
The test I added for this didn't actually enable torchbind tracing, oops. Fix that and fix the issues that cropped up.

Differential Revision: [D52962205](https://our.internmc.facebook.com/intern/diff/D52962205/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117978
Approved by: https://github.com/avikchaudhuri
ghstack dependencies: #115222
2024-01-23 05:50:41 +00:00
e056cf5507 [ac][pattern matcher] Do not percolate tags beyond the inputs of matched portion (#118034)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118034
Approved by: https://github.com/yf225
2024-01-23 05:02:32 +00:00
3708f2608e [DTensor] Skip [add]mm empty tensor test (#118045)
As DTensor does not support multiplication of [4,0] and [0,4] matrices

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118045
Approved by: https://github.com/yf225, https://github.com/wanchaol
2024-01-23 04:08:11 +00:00
0036385b55 [Inductor][Reliability] Add runtime numeric check for pt2 Optimus in the pre grad pass (#115142)
Summary: Titled

Test Plan:
# local reproduce
Patch ``icfg.fx_passes_numeric_check["pre_fx_passes"] = True"
```
buck2 run @mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode split_batch
```
P965217137

# MC candidates
### FIRST + CMF
f520754604
P1056796962
### ICVR
f520816217
P1056839342
### IG_CTR
f520819178
P1056903302
### MAI
f520823559
P1057712009
### AFOC
f520822438
P1057760058
### DPA
f520826815
P1057808574
### How the runtime numeric check to catch [SEVs](https://docs.google.com/document/d/1WOtlbgCBbmU1klK1LiGSO0lYf_7mtSP4nAdvhQHM0JE/edit#heading=h.k61fy2rhaijp)
bug fix diff: D51378532
### CMF+(FIRST)
f509587388
P1058305139
by running the numeric check, we can catch the forward loss differences (e.g., diffing(https://www.internalfb.com/intern/diffing/?paste_number=1058293804))
https://pxl.cl/4bQDG

f501760099
P1058400691
by running the numeric check, we can catch the forward loss differences (e.g., diffing(https://www.internalfb.com/intern/diffing/?paste_number=1058412054))
https://pxl.cl/4bQMw

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115142
Approved by: https://github.com/jackiexu1992, https://github.com/yanboliang
2024-01-23 03:56:50 +00:00
3c339b5b21 Remove skips for passing tests (#118000)
These tests were already passing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118000
Approved by: https://github.com/yanboliang
2024-01-23 03:41:23 +00:00
4646d0e1b2 Update xla.txt (#117999)
XLA CI is currently broken in PyTorch, I think there are 2 reasons causing that
1. There is an offending Pytorch PR c393b2f1ee. Han is working on a fix in https://github.com/pytorch/xla/pull/6345
2. Commit that pytorch pin to 2990cb38c17e06d0dbe25437674ca40130d76a8f was not a valid commit. I think this is because we tried to help them to land a breaking pr in https://github.com/pytorch/xla/pull/6307. However I think we did a rebase which vanish that commit. now the CI failed
```
fatal: reference is not a tree: 2990cb38c17e06d0dbe25437674ca40130d76a8f
585
```
Let me first update the pin to the master so it at least run some test, this way we can discover if there is any additional issue. I will rebase after @qihqi 's fix passed all CI

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117999
Approved by: https://github.com/clee2000
2024-01-23 03:36:32 +00:00
fed45aee54 Replace invoking self.value if there is a user defined init, avoiding arbitrary code execution (#117818)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117818
Approved by: https://github.com/ezyang
2024-01-23 03:14:58 +00:00
dc1b9d758e Update passrate calculation script to skip inductor and export (#118030)
We don't want to count running test/inductor/ and test/export/ with
PYTORCH_TEST_WITH_DYNAMO=1 as a part of the pass rate.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118030
Approved by: https://github.com/ydwu4
ghstack dependencies: #117998
2024-01-23 02:33:57 +00:00
162f643090 Script to generate failures histogram (#118008)
Generates something that looks like
https://gist.github.com/zou3519/43aa8ef28a327bd68cfbac83d84c0999
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118008
Approved by: https://github.com/yanboliang, https://github.com/oulgen
2024-01-23 02:28:55 +00:00
af7cd5c32a [Dynamo] Install module globals per output_graph (#117998)
Fixes https://github.com/pytorch/pytorch/issues/117851

In tests, we ran into an issue where:
- In frame A, Dynamo would install a global
- We call reset()
- reset() did not delete the installed global due to a refcycle
- In frame B, Dynamo would re-use the same global
- Python gc ran, deleting the installed global, leading to the compiled
  version of frame B raising NameNotFound

This PR changes the following:
- module globals are now installed at a per-frame basis.
- renames install_global to install_global_unsafe: if the names are not
  unique and end up being re-used across frames, then we've got trouble.

Test Plan:
- I tested that this got rid of the test flakiness locally. I'm not sure
  how to easily write a test for this, because I don't actually know
  what the refcycle in the above is.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117998
Approved by: https://github.com/ezyang, https://github.com/anijain2305
2024-01-23 02:28:02 +00:00
a85fd20d45 [ONNX] Improve support to mmap for ONNXProgram.save (#117863)
Currently, when the user passes a model state_dict which is not a file,
ONNXProgram.save calls torch.save along with io.BytesIO, which does not
support memory-map. That makes the file stream to be fully allocated in
memory.

This PR removes the torch.save call and passes the dict directly to the
serializer. this is beneficial for the scenario when model_state_dict
is generated by torch.load(..., mmap=True) as the state dict will be
mappped in memory instead of fully loaded in memory.

This PR leverages https://github.com/pytorch/pytorch/pull/102549
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117863
Approved by: https://github.com/wschin
2024-01-23 02:00:00 +00:00
052860294f Replace constraints with dynamic_shapes in export-to-executorch tutorial (#117916)
Summary: `constraints` argument for `torch.export` has been deprecated in favor of the `dynamic_shapes` argument. This PR updates the use of the deprecated API in export-to-executorch tutorial.

Test Plan: CI

Differential Revision: D52932772

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117916
Approved by: https://github.com/angelayi, https://github.com/avikchaudhuri
2024-01-23 01:17:19 +00:00
d810b10232 Add beta1 support to CyclicLR momentum (#113548)
Fixes #73910

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113548
Approved by: https://github.com/janeyx99
2024-01-23 01:16:58 +00:00
d01ba4e94e enable fp8 cast for inductor CPU (#117737)
Enable FP8 cast for this issue https://github.com/pytorch/pytorch/issues/117119.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117737
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-01-23 01:16:15 +00:00
d8420c0b0c [Nested Tensor]Add helper functions to set max_seqlen/min_seqlen directly (#117815)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117815
Approved by: https://github.com/soulitzer
2024-01-23 01:00:45 +00:00
a27a6e8cf1 [ROCm] skip test_sparse_csr test_triton_bsr_softmax_cuda (#118006)
The tests were taking too long and leading to CI timeouts.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118006
Approved by: https://github.com/huydhn
2024-01-23 00:09:42 +00:00
c6be5d55a5 Migrate param_group testing to OptimizerInfo (#117675)
Today, our param_group testing does the equivalent of pitting weight and bias with different optimizer hyperparams and then check that the overall result is going the right direction based on maximize.

This PR introduces two tests to encompass coverage:
1. For every optimizer input (no differentiable), always force bias to have 0 weight_decay, and then check that the direction is expected. This is basically a replica to today's tests, but is more methodical as the test is a real use case.
2. To ensure that the different groups have distinct behavior, I added another test where lr is basically 0 in default group, and ensure that the param in the default group doesn't move while loss does.

Together, these tests do a better job of testing param groups than today's tests, **though we do lose some flavors**. For example, RMSProp also pits centered=True vs False across the param_groups, Adadelta has a variation on rho, and ASGD has a variation for t0. I don't think this is really a loss, as the previous test was just testing for direction and our new tests test stronger guarantees.

The leftover param group configs are used in conjunction with LRSchedulers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117675
Approved by: https://github.com/albanD
2024-01-22 23:48:46 +00:00
d280b6ae58 Ensure that deleter is called even for a no-data tensor. (#117418)
Summary:

When using a custom deleter InefficientStdFunctionContext was using a
std::unique_ptr<> to store the pointer and call the deleter - but this failed to
call the deleter if the pointer was null. Since we have a separate holder class
anyway take out the std::unique_ptr<> and call the deleter directly.

Fixes #117273

Test Plan:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117418
Approved by: https://github.com/wjakob, https://github.com/yanboliang
2024-01-22 23:27:27 +00:00
cef5b93f28 [ez] Serial when NUM_PROCS is 1 (#117977)
Makes it easier to understand whats going on
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117977
Approved by: https://github.com/huydhn
2024-01-22 23:11:41 +00:00
f9fca33baf [codemod][highrisk] Fix shadowed variable in caffe2/caffe2/onnx/onnx_exporter.cc (#117996)
Summary:
Our upcoming compiler upgrade will require us not to have shadowed variables. Such variables have a _high_ bug rate and reduce readability, so we would like to avoid them even if the compiler was not forcing us to do so.

This codemod attempts to fix an instance of a shadowed variable. Please review with care: if it's failed the result will be a silent bug.

**What's a shadowed variable?**

Shadowed variables are variables in an inner scope with the same name as another variable in an outer scope. Having the same name for both variables might be semantically correct, but it can make the code confusing to read! It can also hide subtle bugs.

This diff fixes such an issue by renaming the variable.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan: Sandcastle

Reviewed By: igorsugak

Differential Revision: D52582853

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117996
Approved by: https://github.com/PaliC, https://github.com/kit1980, https://github.com/malfet
2024-01-22 22:57:06 +00:00
b901999350 [inductor] For View.create(x, sizes) call realize_input() instead of realize() when handling unbacked symints (#117013)
# Context
Let's say we do `View.create(x, sizes)` where `x` is a `SliceView` and `sizes` contains unbacked symints e.g. `sizes = [i14, 256]`. Then, this we'll run ([this code](7e37f63e5e/torch/_inductor/ir.py (L2058-L2071))) where we.
1. Call `x.realize()` -- SliceView(Pointwise) -> SliceView(ComputedBuffer).
2. Retrieve storage & layout via `as_storage_and_layout(x)`
3. Calculate `new_layout` based off layout & `new_sizes`
3. `return ReinterpretView(storage, new_layout)`
However, (2) will raise `NotImplementedError` ([see](7e37f63e5e/torch/_inductor/ir.py (L1704-L1731))) since `x` is a `SliceView` and that isn't supported.

Thus, I tried adding support for `SliceView` in `as_storage_and_layout`. This worked for my case, but if instead `sizes` had backed symints e.g. `sizes=[s0, 256]` then some benchmarked models lost accuracy.
```
    if isinstance(x, SliceView):
        return as_storage_and_layout(
            x.data,
            freeze=freeze,
            want_contiguous=want_contiguous,
            stride_order=stride_order,
        )
```

So instead of the above, I tried unwrapping the `SliceView` via `x = x.unwrap_view()`. This works for my usecase and passes CI but I'm not entirely sure why. If we unwrap our `SliceView` and create a `ReinterpretView`, I'd assume we'd lose the reindexer from `SliceView`. ~~But maybe we can re-create the same indexing from the `ReinterpretView`'s strides?~~ edit: we do lose vital information (like offset) when you release your `SliceView` and create a `ReinterpretView` so that's a no-go.

Moving onto the final version of this PR. We call `ExternKernel.realize_input()` (feels a bit weird to use `ExternKernel` but it's exactly what I need). It will go ahead and handle our `SliceView` case ([see](a468b9fbdf/torch/_inductor/ir.py (L3733-L3739))) by converting it to a `ReinterpretView` with the correct offset.

# Test
```
$ python test/inductor/test_unbacked_symints.py
..
----------------------------------------------------------------------
Ran 10 tests in 20.813s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117013
Approved by: https://github.com/jansel, https://github.com/ezyang
2024-01-22 22:34:10 +00:00
f96b7d06d7 [export] skip export tests when test with dynamo in ci (#117988)
Fixes https://github.com/pytorch/pytorch/issues/117947.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117988
Approved by: https://github.com/suo, https://github.com/zou3519
2024-01-22 22:14:36 +00:00
c14751b6cf Remove extraneous [[fallthrough]] in ivalue.cpp (#117985)
Test Plan: Sandcastle

Differential Revision: D52963965

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117985
Approved by: https://github.com/Skylion007, https://github.com/albanD
2024-01-22 21:54:39 +00:00
b5799d9977 Revert "[c10d] Barrier uses stream sync instead of device sync (#117804)"
This reverts commit 0f6bbb1c070c3a9713893659377e20e147c125f6.

Reverted https://github.com/pytorch/pytorch/pull/117804 on behalf of https://github.com/clee2000 due to sorry the docs test failure is real, I think it wants the lines after the .. note to be indented https://github.com/pytorch/pytorch/actions/runs/7616827874/job/20745016788.  Marking as nosignal due to bad Dr. CI categorization ([comment](https://github.com/pytorch/pytorch/pull/117804#issuecomment-1904882487))
2024-01-22 21:54:03 +00:00
792dfa7e16 Allow dynamic shapes of tuple type for inputs of dataclass type (#117917)
Summary:
In `torch.export.export(f, args, kwargs, ..., dynamic_shpapes=None, ...)`, `dataclass` is an acceptable type of inputs (for args and kwargs). The `dynamic_shapes` of the `dataclass` inputs needs to be the same `dataclass` type which replaces each tensor attributes with `dynamic_shapes` of the corresponding tensors. (https://github.com/pytorch/pytorch/blob/main/torch/export/dynamic_shapes.py#L375)

However, some `dataclass` may have limitations on the types of attributes (e.g., having to be tensors) such that the same `dataclass` cannot be constructed for dynamic shapes.

For an input of `dataclass` type, this task enables a `dynamic_shapes` of a tuple type that specifies dynamic shape specifications for each tensor of the input in the same order as the input dataclass type's flatten_fn (https://github.com/pytorch/pytorch/blob/main/torch/utils/_pytree.py#L103)

Test Plan: buck test //caffe2/test:test_export

Differential Revision: D52932856

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117917
Approved by: https://github.com/avikchaudhuri
2024-01-22 21:50:28 +00:00
4df65bf51b Optimize recursive_add_node in fx splitter (#117969)
Summary: The `FxNetAccFusionsFinder.recursive_add_node` function can run into an exponential complexity when applied to an fx graph with multiple densely connected layers of nodes. Here we add a `visited` set which reduces the worst case complexity to linear.

In the internal MRS models with the densely connected layer structure, this fix reduces the fx split time from forever to < 100ms, hence unblocking the internal enablement.

P.S. As much as I want to add a unit test, I can't find any existing tests for the `_SplitterBase` infra. Happy to add one if pointed to where. Thanks!

Test Plan: CI

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D52951321](https://our.internmc.facebook.com/intern/diff/D52951321)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117969
Approved by: https://github.com/oulgen, https://github.com/khabinov
2024-01-22 21:49:36 +00:00
86e8551446 [dtensor] switch softmax forward ops to OpStrategy (#117723)
**Summary**
This PR switches the softmax and log_softmax ops to use OpStrategy instead of rules. This PR also adds support when the softmax dimension is sharded -- a replication is performed before computation.

**Test**
`python test/distributed/_tensor/test_math_ops.py -k test_softmax_fwd`
`python test/distributed/_tensor/test_math_ops.py -k test_softmax_with_bwd`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117723
Approved by: https://github.com/XilunWu
2024-01-22 21:26:48 +00:00
fdac55c35d Added example regarding weight_decay distinction with per-parameter API (#117436)
Added new example and description regarding per-parameter `weight_decay` distinction for bias and non-bias terms.

Fixes #115935

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117436
Approved by: https://github.com/janeyx99
2024-01-22 21:26:02 +00:00
b14d57ceda Replace constraints with dynamic_shapes in scripts/sijiac/prototypes and test/inductor (#117915)
Summary: `constraints` argument for `torch.export` has been deprecated in favor of the `dynamic_shapes` argument. This PR updates the use of the deprecated API in `scripts/sijiac/prototypes` and `test/inductor`.

Test Plan: buck test mode/dev-nosan fbcode//caffe2/test/inductor:test_aot_inductor

Differential Revision: D52931743

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117915
Approved by: https://github.com/angelayi
2024-01-22 21:24:03 +00:00
95a6866220 Migrate fused optim load_state_dict to OptimizerInfo (#117890)
The new tests look like:

```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (29f899ef)]$ python test/test_optim.py -v -k test_cpu_load_state_dict
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
test_cpu_load_state_dict_impl_capturable_AdamW_cpu_float32 (__main__.TestOptimRenewedCPU) ... skipped 'Only runs on cuda'
test_cpu_load_state_dict_impl_capturable_Adam_cpu_float32 (__main__.TestOptimRenewedCPU) ... skipped 'Only runs on cuda'
test_cpu_load_state_dict_impl_capturable_SGD_cpu_float32 (__main__.TestOptimRenewedCPU) ... skipped 'Only runs on cuda'
test_cpu_load_state_dict_impl_fused_AdamW_cpu_float32 (__main__.TestOptimRenewedCPU) ... skipped 'Only runs on cuda'
test_cpu_load_state_dict_impl_fused_Adam_cpu_float32 (__main__.TestOptimRenewedCPU) ... skipped 'Only runs on cuda'
test_cpu_load_state_dict_impl_fused_SGD_cpu_float32 (__main__.TestOptimRenewedCPU) ... skipped 'Only runs on cuda'
test_cpu_load_state_dict_impl_capturable_AdamW_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_cpu_load_state_dict_impl_capturable_Adam_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_cpu_load_state_dict_impl_capturable_SGD_cuda_float32 (__main__.TestOptimRenewedCUDA) ... skipped 'SGD does not currently support capturable'
test_cpu_load_state_dict_impl_fused_AdamW_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_cpu_load_state_dict_impl_fused_Adam_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_cpu_load_state_dict_impl_fused_SGD_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok

----------------------------------------------------------------------
Ran 12 tests in 12.865s

OK (skipped=6)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117890
Approved by: https://github.com/albanD
2024-01-22 21:14:38 +00:00
9a2c8f644b Mark DynamicShapesExportTests::test_retracibility_dynamic_shapes as slow (#117896)
Mark `dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_retracibility_dynamic_shapes` explicitly as slow

I cannot figure out what the correct way to do this is

Tested locally

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117896
Approved by: https://github.com/huydhn
2024-01-22 21:12:03 +00:00
903e1913ff Rename unbacked SymInt prefix to u (#117859)
Currently, it conflicts with Inductor's naming convention for index
variables

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117859
Approved by: https://github.com/lezcano, https://github.com/jansel, https://github.com/avikchaudhuri
2024-01-22 20:53:47 +00:00
0f6bbb1c07 [c10d] Barrier uses stream sync instead of device sync (#117804)
Resubmitting #96785

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117804
Approved by: https://github.com/wconstab
2024-01-22 20:14:51 +00:00
c170fbd309 [dtensor] refactor redistribute and fix uneven sharding redistribution (#115525)
This PR:
- refactors the redistribute implementation logic to make it more
sound, by figuring out the transform informations first and then apply
transformation step by step, we also cache the decisions so that it
could be reuse again
- for uneven sharding, refactor uneven sharding logic, and use a logical
  shape concept for each transform information to fix the uneven sharding
  multi-mesh redistribute bug

fixes https://github.com/pytorch/pytorch/issues/115310

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115525
Approved by: https://github.com/XilunWu
2024-01-22 18:57:44 +00:00
2bb2cc0b71 [tp] add clarification to doc and improve TP examples (#117618)
This PR adds a clarification about evenly sharded assumption in the main
tp doc and improved the tp examples by adding device mesh constructions

fixes https://github.com/pytorch/pytorch/issues/100044

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117618
Approved by: https://github.com/wconstab, https://github.com/awgu
2024-01-22 18:56:50 +00:00
01abb5af21 additional support for float8_e4m3fnuz and _e5m2fnuz (#115214)
Follow up to #107586.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115214
Approved by: https://github.com/peterbell10, https://github.com/malfet
2024-01-22 18:33:41 +00:00
56ef5afdee [dynamo] Add more dynamo call_methods and getattr support or Placement (#117733)
Summary:
Explained by title.
This fix is part of: https://github.com/pytorch/pytorch/issues/117670

Test Plan:
Unit tetst and CI
- Unit test: `buck2 test mode/dev-nosan //caffe2/test/distributed/_tensor:dtensor_compile -- test_placement_compile`

Differential Revision: D52863073

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117733
Approved by: https://github.com/yanboliang
2024-01-22 18:22:54 +00:00
suo
f612e96180 [export] set proper fqn in lift constant tensor pass (#115222)
See comments: previously we were populating the lifted constant in the buffer list without an FQN, which messed up unflattening.

Differential Revision: [D50568062](https://our.internmc.facebook.com/intern/diff/D50568062/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115222
Approved by: https://github.com/tugsbayasgalan
2024-01-22 18:13:49 +00:00
80cf0ce153 Enhance torch.vmap support from inside torch.compile (#116050)
This work rewrites vmap support in torch.compile by inlining most of
the frames into the existing FX graph. It also unlocks to PyTorch to
support features that were previously missing, such as keyword args.

Fixes: https://github.com/pytorch/pytorch/issues/114306

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116050
Approved by: https://github.com/zou3519
2024-01-22 17:53:45 +00:00
b2a3d6ba0d [exportdb] Remove torch/fb/exportdb (#117866)
Summary: This has already been moved to torch/_export/db

Test Plan: no tests? I think?

Reviewed By: avikchaudhuri

Differential Revision: D52875607

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117866
Approved by: https://github.com/ydwu4
2024-01-22 17:41:33 +00:00
a359afbc3f Make and/or on uint8 tensors properly return 0x00 or 0x01 (#117827)
Fixes https://github.com/pytorch/pytorch/issues/117215

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117827
Approved by: https://github.com/albanD
2024-01-22 17:30:22 +00:00
c6c54df81b Fix incorrect type hints of Module.to (#117937)
Fixes #117936

While #113647 fixed the issue that `device` did not accept strings, it did not get the type hints fully correct. This PR removes the `str` variants from the type hints for the `dtype` parameter(s) in all overloads.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117937
Approved by: https://github.com/albanD
2024-01-22 16:47:30 +00:00
60519fa3b7 change master to main in datapipes readme (#117919)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117919
Approved by: https://github.com/albanD
2024-01-22 16:29:41 +00:00
86b4b27e26 [docs] start a new FSDP notes doc (#117323)
As discussed on [slack](https://pytorch.slack.com/archives/C3PDTEV8E/p1703699711772289) adding Andrew Gu's advanced FSDP design notes with a few additions from myself based on our discussion.

I hope I did the RST right, I haven't done RST in a while.

- The first section is Andrew's words verbatim + formatting
- The second section is Andrew's words verbatim + formatting + a few of my additions that were confirmed by Andrew, and which hopefully should help understand the process better.

tagging @albanD as requested.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117323
Approved by: https://github.com/awgu
2024-01-22 15:46:35 +00:00
8dc421a6b4 Revert "accelerate binary_cross_entropy_with_logits by using log_sigmoid operator (#115539)"
This reverts commit 03b12e56c758431df6f95075ce3a0113ccaeb3f9.

Reverted https://github.com/pytorch/pytorch/pull/115539 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/115539#issuecomment-1904157729))
2024-01-22 14:48:35 +00:00
cyy
c3780010a5 Remove calls of c10::guts::void_t (#117942)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117942
Approved by: https://github.com/Skylion007
2024-01-22 06:12:37 +00:00
3580e5d407 [executorch hash update] update the pinned executorch hash (#117953)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117953
Approved by: https://github.com/pytorchbot
2024-01-22 04:34:44 +00:00
cyy
39df084001 [Clang-tidy header][16/N] Enable clang-tidy on headers in torch/csrc/autograd (#117821)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117821
Approved by: https://github.com/Skylion007
2024-01-22 00:52:56 +00:00
cyy
3baade4425 Remove calls of c10::guts::conjunction,c10::guts::disjunction,c10::guts::negation (#117926)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117926
Approved by: https://github.com/Skylion007
2024-01-22 00:35:42 +00:00
02209b5880 Revert "[docs] start a new FSDP notes doc (#117323)"
This reverts commit 7f474da6bcc735cde5ef1417dc28472769307f5d.

Reverted https://github.com/pytorch/pytorch/pull/117323 on behalf of https://github.com/awgu due to broke docs ([comment](https://github.com/pytorch/pytorch/pull/117323#issuecomment-1902740900))
2024-01-21 19:47:27 +00:00
suo
c393b2f1ee [export] require Module to be passed to export (#117528)
This PR changes torch.export to require an nn.Module as input, rather than taking an arbitrary callable.

The rationale for this is that we have several invariants the ExportedProgram that are ambiguous if the top-level object being traced is a function:
1. We "guarantee" that every call_function node has an `nn_module_stack` populated.
2. We offer ways to access the state_dict/parameters/buffers of the exported program.

We'd like torch.export to offer strong invariants—the value proposition of export is that you can trade flexibility for stronger guarantees about your model.

An alternative design would be to implicitly convert the top-level function into a module, rather than require that the user provide a module. I think that's reasonable (it's what we did in TorchScript), but in the spirit of being explicit (another design tenet of export) I avoid that here.

Differential Revision: [D52789321](https://our.internmc.facebook.com/intern/diff/D52789321/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117528
Approved by: https://github.com/thiagocrepaldi, https://github.com/zhxchen17, https://github.com/avikchaudhuri, https://github.com/tugsbayasgalan
2024-01-21 19:36:13 +00:00
3ee092f75b VSX: Fix overflow in complex division (#116972)
For large complex values the division produces inf or NaN values which leads other functions to produce such too,
e.g. `torch._refs.sgn` used in a test.
Example:
```
$ python -c 'import torch; print(torch._refs.sgn(torch.complex(torch.tensor([-501]*16, dtype=torch.float32), torch.tensor([-1e20]*16, dtype=torch.float32))))'
tensor([-0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj])

$ python -c 'import torch; t = torch.complex(torch.tensor([-501]*16, dtype=torch.float32), torch.tensor([-1e20]*16, dtype=torch.float32)); print(t / t.abs())'
tensor([-0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj, -0.+nanj])
```
Implement the same algorithm as used in numpy and x86 (#93277)

Reason here is that for a tensor with a component of `1e20` the abs-squared value used in the division contains a term `1e20 * 1e20` which overflows the dynamic range of float32 (3e38) and yields an "inf", so the division yields "nan"

Output after change:
```
$ python -c 'import torch; t = torch.complex(torch.tensor([-501]*16, dtype=torch.float32), torch.tensor([-1e20]*16, dtype=torch.float32)); print(torch._refs.sgn(t), t.sgn(), t / t.abs())'
tensor([-5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j,
        -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j,
        -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j,
        -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j]) tensor([-5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j,
        -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j,
        -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j,
        -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j]) tensor([-5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j,
        -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j,
        -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j,
        -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j, -5.0100e-18-1.j])
```

CC @quickwritereader who wrote the initial code and @VitalyFedyunin who was involved in the initial review and @lezcano who reviewed #93277

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116972
Approved by: https://github.com/lezcano
2024-01-21 19:21:13 +00:00
afabed6ae6 [inductor][custom ops] Add tag to custom ops to preserve stride orders in inductor (#117298)
fixes #116715

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117298
Approved by: https://github.com/eellison
2024-01-21 18:47:01 +00:00
41556324a9 [cpp_wrapper] Change CppWrapperCodeCache to use faster python binding (#117693)
Summary: Using faster binding following https://github.com/pytorch/pytorch/pull/117500. torch.utils.cpp_extension.load_inline builds a lot of things and is very slow. With this change, later we can further reduce the included header files using the ABI-compatible mode and thus further speed up the compilation.

Result:
```
python test/inductor/test_cuda_cpp_wrapper.py -k test_relu_cuda_cuda_wrapper

Before: Ran 1 test in 32.843s
After: Ran 1 test in 26.229s
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117693
Approved by: https://github.com/jansel
2024-01-21 16:07:52 +00:00
7f474da6bc [docs] start a new FSDP notes doc (#117323)
As discussed on [slack](https://pytorch.slack.com/archives/C3PDTEV8E/p1703699711772289) adding Andrew Gu's advanced FSDP design notes with a few additions from myself based on our discussion.

I hope I did the RST right, I haven't done RST in a while.

- The first section is Andrew's words verbatim + formatting
- The second section is Andrew's words verbatim + formatting + a few of my additions that were confirmed by Andrew, and which hopefully should help understand the process better.

tagging @albanD as requested.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117323
Approved by: https://github.com/albanD, https://github.com/awgu
2024-01-21 15:11:24 +00:00
b50ccad86e [BE]: Add type alias typing annotation to prims_common (#117928)
Explicitly mark unions assignments as type aliases to make it easier for static type checkers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117928
Approved by: https://github.com/ezyang
2024-01-21 14:26:59 +00:00
df4e3d9d08 Document OpsHandler protocol (#117790)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117790
Approved by: https://github.com/jansel
2024-01-21 07:20:53 +00:00
eqy
8f7caaee67 [cuDNN] Fix cuDNN version parsing against future versions of cuDNN (#117908)
Remove the unnecesssary dependence on assuming a fixed number of digits per field

CC @ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117908
Approved by: https://github.com/cpuhrsch
2024-01-21 05:00:01 +00:00
fbd1d567ed [inductor] Fix CPP wrapper codegen for ExternKernel args (#117931)
Summary: We see IR nodes `repr`-ed directly in the CPP wrapper codegen. Recently, this issue has been fixed for the Python wrapper codegen in D52899373 (https://github.com/pytorch/pytorch/pull/117838). Here we extend the fix to CPP wrapper codegen / AOTInductor.

Test Plan:
New unit tests. In OSS:

```
python test/inductor/test_aot_inductor.py -k test_triton_kernel_multi_output_arg
```

```
python test/inductor/test_aot_inductor.py -k test_triton_kernel_extern_kernel_arg
```

Differential Revision: D52936248

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117931
Approved by: https://github.com/oulgen, https://github.com/chenyang78, https://github.com/desertfire
2024-01-21 04:58:56 +00:00
fa1e89b337 Ban mutation on dropout outputs in export (#117879)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117879
Approved by: https://github.com/ezyang
ghstack dependencies: #117811
2024-01-21 04:53:40 +00:00
949a76a7f0 [executorch hash update] update the pinned executorch hash (#117899)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117899
Approved by: https://github.com/pytorchbot
2024-01-21 04:19:27 +00:00
suo
2ae66ddba0 [export] fix test ownership (#117886)
as title

Differential Revision: [D52924188](https://our.internmc.facebook.com/intern/diff/D52924188/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117886
Approved by: https://github.com/ydwu4
2024-01-21 01:18:16 +00:00
bad5e1e0bb [Quant] [Inductor] Enable the Inductor Lowering of QConv2d post op hardswish (#117489)
**Summary**
Enable the fusion pattern of `QConv2d -> hardswish` lowering to `hardswish` as `QConv2d` post operator.

**Test Plan**
```
python -m pytest test_mkldnn_pattern_matcher.py -k test_qconv2d_hardswish_cpu
python -m pytest test_mkldnn_pattern_matcher.py -k test_qat_qconv2d_hardswish
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117489
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
ghstack dependencies: #117487, #117488
2024-01-21 00:01:32 +00:00
05ef2030ea [c10d] Add logs for NCCL Comm Abort call (#117868)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117868
Approved by: https://github.com/kwen2501
2024-01-20 21:34:13 +00:00
2de3474711 Simplify kwargs propagation in __call__. (#117880)
In case no keyword arguments are passed, `**kwargs` would expand just fine without the need for extra overhead of `or {}`. In addition to reducing boilerplate, this also comes with a small perf improvement:
```
In [1]: def null(*args, **kwargs):
   ...:     pass
   ...:

In [2]: def call1(*args, **kwargs):
   ...:     return null(*args, **(kwargs or {}))
   ...:

In [3]: def call2(*args, **kwargs):
   ...:     return null(*args, **kwargs)
   ...:

In [4]: %timeit call1()
145 ns ± 2.07 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [5]: %timeit call2()
118 ns ± 2.14 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [6]: %timeit call1()
147 ns ± 6.19 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [7]: %timeit call2()
117 ns ± 0.846 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117880
Approved by: https://github.com/Skylion007
2024-01-20 19:29:35 +00:00
50633620b2 sympy.Symbol is a subclass of sympy.Expr (#117857)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117857
Approved by: https://github.com/peterbell10
2024-01-20 18:09:44 +00:00
af831415a8 fix cpp backend relu codegen with inf input (#117622)
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/117544.
For CPP backend, current `ReLU` will code gen to `f"{x} * ({x}>0)"` in `CppOverrides`. The result mismatches with eager when input has `inf`, since `inf * 0` will result to `nan` based on [IEEE_754](https://en.wikipedia.org/wiki/IEEE_754). Change the code gen to `f"std::max({x}, decltype({x})(0))"` to align with eager implementation as in 1deb75b584/aten/src/ATen/native/cpu/TensorCompareKernel.cpp (L392)

**TestPlan**
```
python -u -m pytest test_cpu_repro.py -k test_relu_with_inf_value
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117622
Approved by: https://github.com/jgong5, https://github.com/lezcano
2024-01-20 13:28:03 +00:00
4bf481fb1b Fix inductor pattern match error for qlinear with bmm (#117633)
Summary:

PR https://github.com/pytorch/pytorch/pull/116599 convert `bmm` when input dim exceeds 2 and not contiguous to `qlinear`. However, there is an error when check weight size because of not considering the permute op.

Test Plan:
python test_mkldnn_pattern_matcher.py -k test_qlinear_input_dim_exceeds_2_and_not_contiguous

Fixes: -

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117633
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel
2024-01-20 12:26:26 +00:00
0ae952db76 enable mkldnn bf32 matmul (#116015)
### Testing
FP32 matmul vs. mkldnn BF32 matmul  on SPR

single core:

Input | BF32   / ms | FP32  /   ms | Speed up
-- | -- | -- | --
M: 128, N: 128, K: 128, trans_a: False, trans_b: False | 32.842 | 38.279 | 1.165
M: 128, N: 256, K: 128, trans_a: False, trans_b: False | 38.590 | 73.967 | 1.917
M: 8192, N: 768, K: 768, trans_a: False, trans_b: False | 18456.267 | 74588.002 | 4.041

56 cores:
Input | BF32   / ms | FP32 /   ms | Speed up
-- | -- | -- | --
M: 8192, N: 768, K: 768, trans_a: False, trans_b: False | 1199.400 | 1715.548 | 1.430
M: 8192, N: 768, K: 768, trans_a: False, trans_b: True |1129.204 | 1708.912 |  1.513
M: 8192, N: 768, K: 3072, trans_a: False, trans_b: False | 3655.915  | 7992.877 | 2.186
M: 8192, N: 768, K: 3072, trans_a: False, trans_b: True | 3707.993 |  8026.191 | 2.165
Batch: 768, M: 128, N: 64, K: 128  | 1296.419 | 1308.411 | 1.009

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116015
Approved by: https://github.com/jgong5, https://github.com/ezyang
2024-01-20 09:30:23 +00:00
aaae2d8bb6 Add compilable and capturable foreach adamax with tests (#117835)
Based off of https://github.com/pytorch/pytorch/pull/110345

Fixes https://github.com/pytorch/pytorch/issues/117812

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117835
Approved by: https://github.com/janeyx99
2024-01-20 05:29:05 +00:00
suo
e732adf0a7 [pytree] add access api (#117771)
This PR introduces an API to use KeyPaths to actually access values on pytrees.

Differential Revision: [D52881260](https://our.internmc.facebook.com/intern/diff/D52881260/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117771
Approved by: https://github.com/zou3519, https://github.com/XuehaiPan
2024-01-20 04:03:26 +00:00
a1b3b5748f [Pytoch][Vulkan] Create context for conv1d (#117780)
Summary:
`conv1d` has two arguments `weight` and `bias` which are stored as constant tensors on the CPU and they are transferred to GPU at every inference call. We create a context for this operator to avoid the repeated passing. Specifically, we
- created `Conv1dPackedContext`,`create_conv1d_context` and `run_layernorm_context` in `Convolution.h` and `Convolution.cpp`
- registered them in `Register.cpp`
- rewrote the graph representation of the op in `vulkan_rewrite.cpp`

Test Plan:
## Numerical test
```
[luwei@82308.od /data/sandcastle/boxes/fbsource (8a8d911dc)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="*conv1d*"
Buck UI: https://www.internalfb.com/buck2/7760800b-fd75-479a-9368-be5fcd5a7fef
Network: Up: 0B  Down: 0B
Jobs completed: 4. Time elapsed: 0.6s.
BUILD SUCCEEDED
Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = *conv1d*
[==========] Running 2 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 2 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.conv1d_simple
[       OK ] VulkanAPITest.conv1d_simple (159 ms)
[ RUN      ] VulkanAPITest.conv1d
[       OK ] VulkanAPITest.conv1d (57 ms)
[----------] 2 tests from VulkanAPITest (217 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 1 test suite ran. (217 ms total)
[  PASSED  ] 2 tests.
```

Full test result in P1053644934, summary as below
```
[----------] 419 tests from VulkanAPITest (28080 ms total)
[----------] Global test environment tear-down
[==========] 419 tests from 1 test suite ran. (28080 ms total)
[  PASSED  ] 418 tests.
[  SKIPPED ] 1 test, listed below:
[  SKIPPED ] VulkanAPITest.querypool_flushed_shader_log
```
## Graph representation comparison
We created a model using `conv1d` and traced it as below
```
# Define a simple model that uses conv1d
class MyModel(torch.nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        self.conv1d = nn.Conv1d(16, 33, 3)

    def forward(self, x):
        return self.conv1d(x)

# Create an instance of the model
model = MyModel()

# Create a dummy input tensor for tracing
input_tensor = torch.randn(20, 16, 50)

# Use torch.jit.trace to trace the model and generate a graph
traced_model = torch.jit.trace(model, input_tensor)
```
Then we converted the traced model to Vulkan backend using `optimize_for_mobile`
```
from torch.utils import mobile_optimizer

vulkan_model = mobile_optimizer.optimize_for_mobile(
    traced_model, backend="vulkan", preserved_methods=to_preserve
)
```
Next we can print the graph of the `vulkan_model` as `print(vk_model.graph)`
- before this diff: `conv1d` was used
```
graph(%self.1 : __torch__.___torch_mangle_16.MyModel,
      %x : Tensor):
  %60 : Device = prim::Constant[value="cpu"]()
  %self.conv1d.bias : Float(33, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=<Tensor>]()
  %37 : bool = prim::Constant[value=0]()
  %36 : NoneType = prim::Constant()
  %59 : Device = prim::Constant[value="vulkan"]()
  %self.conv1d.weight : Float(33, 16, 3, strides=[48, 3, 1], requires_grad=0, device=cpu) = prim::Constant[value=<Tensor>]()
  %7 : int = prim::Constant[value=1](), scope: __module.conv1d # /mnt/xarfuse/uid-23453/243f3953-seed-nspid4026532834_cgpid7972545-ns-4026532831/torch/nn/modules/conv.py:306:0
  %18 : int[] = prim::Constant[value=[1]]()
  %19 : int[] = prim::Constant[value=[0]]()
  %39 : Tensor = aten::to(%x, %59, %36, %37, %37)
  %20 : Tensor = aten::conv1d(%39, %self.conv1d.weight, %self.conv1d.bias, %18, %19, %18, %7)
  %58 : Tensor = aten::to(%20, %60, %36, %37, %37)
  return (%58)
```
- after this diff: `conv1d` was replaced with `run_conv1d_context`
```
graph(%self.1 : __torch__.___torch_mangle_6.MyModel,
      %x : Tensor):
  %85 : Device = prim::Constant[value="cpu"]()
  %51 : bool = prim::Constant[value=0]()
  %50 : NoneType = prim::Constant()
  %84 : Device = prim::Constant[value="vulkan"]()
  %53 : Tensor = aten::to(%x, %84, %50, %51, %51)
  %prepack_folding_forward._jit_pass_packed_weight_0 : __torch__.torch.classes.vulkan.Conv1dPackedContext = prim::GetAttr[name="prepack_folding_forward._jit_pass_packed_weight_0"](%self.1)
  %22 : Tensor = vulkan_prepack::run_conv1d_context(%53, %prepack_folding_forward._jit_pass_packed_weight_0)
  %83 : Tensor = aten::to(%22, %85, %50, %51, %51)
  return (%83)
```

Differential Revision: D52865379

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117780
Approved by: https://github.com/yipjustin
2024-01-20 02:35:32 +00:00
10923f8720 Revert "[inductor][custom ops] Add tag to custom ops to preserve stride orders in inductor (#117298)"
This reverts commit 1967394690f144a7ba1717eccec977286cafe2da.

Reverted https://github.com/pytorch/pytorch/pull/117298 on behalf of https://github.com/huydhn due to Sorry for reverting you change but it is failing in MacOS 1967394690, may be due to a landrace ([comment](https://github.com/pytorch/pytorch/pull/117298#issuecomment-1901594120))
2024-01-20 02:14:58 +00:00
94f0472579 [Quant] [PT2] Add Hardswish into X86InductorQuantizer Conv2d Unary Annotation (#117488)
**Summary**
Add `hardswish`  into X86InductorQuantizer Conv2d Unary Annotation

**TestPlan**
```
python -m pytest test_x86inductor_quantizer.py -k test_conv2d_unary
python -m pytest test_x86inductor_quantizer.py -k test_qat_conv2d_unary
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117488
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
ghstack dependencies: #117487
2024-01-20 01:37:33 +00:00
1967394690 [inductor][custom ops] Add tag to custom ops to preserve stride orders in inductor (#117298)
fixes #116715

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117298
Approved by: https://github.com/eellison
2024-01-20 01:37:28 +00:00
181e6dafd0 [MPS] Fix lintear for 5D tensors (#117837)
torch.nn.Linear crashes with internal assert if invoked with 5D tensors,
due to the bug in MPS framework, i.e. invoking
```swift
import MetalPerformanceShadersGraph

let graph = MPSGraph()
let x = graph.constant(1, shape: [2, 1, 2, 1, 2], dataType: .float32)
let y = graph.constant(1, shape: [2, 3], dataType: .float32)
let z = graph.matrixMultiplication(primary: x, secondary: y, name: nil)
let device = MTLCreateSystemDefaultDevice()!
let buf = device.makeBuffer(length: 48)!
let td = MPSGraphTensorData(buf, shape: [2, 1, 2, 1, 3], dataType: .int32)
let cmdBuf = MPSCommandBuffer(from: device.makeCommandQueue()!)
graph.encode(to: cmdBuf, feeds: [:], targetOperations: nil, resultsDictionary: [z:td], executionDescriptor: nil)
cmdBuf.commit()
```
crashes with
```
AppleInternal/Library/BuildRoots/0032d1ee-80fd-11ee-8227-6aecfccc70fe/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSNDArray/Kernels/MPSNDArrayIdentity.mm:813: failed assertion `New volume: 4 should match old volume: 8 [reshapeWithCommandBuffer] MPSNDArrayIdentity.'
zsh: abort      ./build/matmul
```

Workaround the issue by flattening the forward and backward tensors if number of dimentions is greater than 4

Add regression tests to Linear opinfo samples

Fixes https://github.com/pytorch/pytorch/issues/114942

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117837
Approved by: https://github.com/janeyx99
2024-01-20 01:19:19 +00:00
d4cc1c5bff Add new pattern matchers for SDPA (#113004)
Add two new pattern matchers to enable SDPA in more models.

- Pattern 14: `BertLarge`
- Pattern 15: `DistilBert`

Perf on SPR:

<img width="1007" alt="image" src="https://github.com/pytorch/pytorch/assets/23010269/f0813343-c9e8-4fd4-9fa0-d0e67e1d57af">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113004
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/eellison
2024-01-20 00:46:46 +00:00
8f91a53e9a Add environment for close-nonexistent-disable-issues (#117885)
Made a new environment called rockset-read-only that has a read only api key for rockset
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117885
Approved by: https://github.com/kit1980, https://github.com/malfet
2024-01-19 23:45:46 +00:00
3c1498d117 [ONNX] Add bfloat16 support for scaled_dot_product_attention (#117878)
Using ONNX opset 14, the aten scaled_dot_product_attention oeprator can be implemented with bfloat16 support because Add-14 does support bfloat16

This PR simply add bfloat16 to the list of supported types
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117878
Approved by: https://github.com/BowenBao
2024-01-19 23:24:44 +00:00
f684e44fd6 Revert "Reduce pytest prints (#117069)"
This reverts commit 40dbd567e04483c671f9c897171bf9d1e7162b68.

Reverted https://github.com/pytorch/pytorch/pull/117069 on behalf of https://github.com/clee2000 due to need to handle timeout expired better ([comment](https://github.com/pytorch/pytorch/pull/117069#issuecomment-1901270953))
2024-01-19 23:07:51 +00:00
5538b37a06 [ez] Provide a slightly better error message if process times out (#117865)
Just a slightly clearer error message
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117865
Approved by: https://github.com/malfet, https://github.com/huydhn
2024-01-19 22:58:00 +00:00
29f899ef87 [pytorch][vulkan] cumsum dim <= 1 (#117580)
Summary:
Following the implementation of Softmax, striding over the texture differently based on the desired dimension.

Softmax performs a similar operation as cumsum (generally called "scan") iterating over all items in a dimension, but cumsum only needs to iterate once to collate the sum, compared to softmax which needs to iterate multiple times to collect the max and denominator for the final calculation.

Similar to the softmax implmentation there's likely opportunities to optimize, but this gets all dims < 4 functional first.

Test Plan:
`LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck2 run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="*cumsum*"`:
```
Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = *cumsum*
[==========] Running 4 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 4 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.cumsum_1d
[       OK ] VulkanAPITest.cumsum_1d (93 ms)
[ RUN      ] VulkanAPITest.cumsum_2d
[       OK ] VulkanAPITest.cumsum_2d (74 ms)
[ RUN      ] VulkanAPITest.cumsum_3d
[       OK ] VulkanAPITest.cumsum_3d (105 ms)
[ RUN      ] VulkanAPITest.cumsum_4d
[       OK ] VulkanAPITest.cumsum_4d (73 ms)
[----------] 4 tests from VulkanAPITest (346 ms total)

[----------] Global test environment tear-down
[==========] 4 tests from 1 test suite ran. (346 ms total)
[  PASSED  ] 4 tests.
```

Differential Revision: D52814000

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117580
Approved by: https://github.com/yipjustin
2024-01-19 21:52:48 +00:00
dd6c0f6844 Trim Dynamo shards 7->3 (#117869)
We added all of the tests we wanted for now. These fit comfortably in 3
shards (the total test time previously was 0.5 hours on each shard).
Going to decrease the number of shards to 3 so that it's less unwieldy
to work with.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117869
Approved by: https://github.com/huydhn, https://github.com/malfet
2024-01-19 21:48:35 +00:00
365c7a292f Log stack trace of mutated idx (#117720)
Log stack trace of mutated tensor that prevents cudagraphs. Will do some subsequent refactors when all of the checks are moved to this fashion.

Differential Revision: [D52896588](https://our.internmc.facebook.com/intern/diff/D52896588)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117720
Approved by: https://github.com/bdhirsh
ghstack dependencies: #117823
2024-01-19 21:38:44 +00:00
6c99bf0766 move disable_cudagraph_reason disabling after codecache is accessed (#117823)
Disabling cudagraphs has to happen after a codecache loading or it wont properly be disabled on a cache hit.

Differential Revision: [D52896590](https://our.internmc.facebook.com/intern/diff/D52896590)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117823
Approved by: https://github.com/bdhirsh, https://github.com/masnesral
2024-01-19 21:33:25 +00:00
c4eab49ded [MacOS] Embed libomp.dylib/omp.h into MacOS wheel (#114816)
To keep them on par with what we do on x86
And `omp.h` as it is needed for `torch.compile` on CPU

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114816
Approved by: https://github.com/atalman
2024-01-19 21:21:33 +00:00
414a1fd29f [PyTorch] Add IValue::IValue(std::vector<T>&&) ctors (#117769)
There are two IValue constructors that take `const std::vector<T>&`. Add moving variants to allow callers to save on reference counting.

Differential Revision: [D52879065](https://our.internmc.facebook.com/intern/diff/D52879065/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117769
Approved by: https://github.com/suo, https://github.com/Skylion007
2024-01-19 21:11:11 +00:00
d45fd68012 OIDC for update_pytorch_labels (#117876)
Companion: https://github.com/pytorch-labs/pytorch-gha-infra/pull/339
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117876
Approved by: https://github.com/huydhn, https://github.com/malfet
2024-01-19 21:08:28 +00:00
ad3d41692e [PyTorch] return decltype(auto) from getItem (#117569)
This allows getItem to take advantage of the nicer (sometimes-const-reference) return type from `List::get() const` added in the previous diff.

Differential Revision: [D52809097](https://our.internmc.facebook.com/intern/diff/D52809097/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117569
Approved by: https://github.com/iseeyuan, https://github.com/malfet
ghstack dependencies: #117568
2024-01-19 21:04:53 +00:00
632fcc4831 [PyTorch] Make List::get() const match List::operator[]() const (#117568)
As far as I can tell, `get()` is supposed (and documented) to be the same as a const `operator[]`. We have an efficient implementation for `operator[]`. Let's use it for `get()`.

Differential Revision: [D52809098](https://our.internmc.facebook.com/intern/diff/D52809098/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117568
Approved by: https://github.com/suo, https://github.com/malfet
2024-01-19 21:04:53 +00:00
15d568d621 [Inductor] Use codegen reference for buffer to string (#117838)
Summary: The added test case ends up emitting an inductor IR as the buffer string, lets properly emit the buffer name instead.

Test Plan: added new test

Differential Revision: D52899373

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117838
Approved by: https://github.com/aakhundov
2024-01-19 20:18:53 +00:00
1f5c27eb18 cleanup code comments _compute_numerical_gradient (#117484)
cleanup code comments for ` _compute_numerical_gradient`:
- reference parameters passed
- indicate that central difference approximation is used
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117484
Approved by: https://github.com/soulitzer
2024-01-19 18:51:52 +00:00
ab216bbaeb cleanup code comments analytical Jacobian as vjp projection (#117483)
Cleanup code comments for `_compute_analytical_jacobian_rows` to make clear Jacobian is computed by standard basis vector projections using the vector-Jacobian-product operation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117483
Approved by: https://github.com/soulitzer
2024-01-19 18:50:26 +00:00
40dbd567e0 Reduce pytest prints (#117069)
* custom pytest-shard so I can control the verbosity (also index by 1 since it's confusing)
* normal runs (not keep-going) always rerun each failed test 9 times (3 per process, 3 processes).  Previously it would only run the entire test file 3 times, so if a test before you segfaulted, you only got 2 tries

Example of quieter log https://github.com/pytorch/pytorch/actions/runs/7481334046/job/20363147497
"items in shard" only gets printed once at the beginning, and the reruns just say how many got skipped.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117069
Approved by: https://github.com/huydhn
2024-01-19 18:42:12 +00:00
2f4456a73e Remove xfail on test_make_weak_keyed_dict_from_weak_keyed_dict (#117848)
Based on the logs, this test has been consistently passing, so we remove
the xfail.

Fixes https://github.com/pytorch/pytorch/issues/116765
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117848
Approved by: https://github.com/Skylion007
ghstack dependencies: #117765
2024-01-19 18:05:30 +00:00
b637fdc8b3 Revert "additional support for float8_e4m3fnuz and _e5m2fnuz (#115214)"
This reverts commit 74e13624998f2a4de29bce73a949d7f0339ec04e.

Reverted https://github.com/pytorch/pytorch/pull/115214 on behalf of https://github.com/PaliC due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/115214#issuecomment-1900815152))
2024-01-19 17:35:04 +00:00
f316c35a34 [export] Support preserving submodule callling convention in non-strict export (#117796)
Summary: Title

Test Plan: CI

Reviewed By: zhxchen17

Differential Revision: D52889236

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117796
Approved by: https://github.com/angelayi
2024-01-19 17:16:45 +00:00
249a226113 [export] Error on not pytree-flattened nodes (#117598)
Attempts to make the input/output mismatch error better by first checking if the inputs/outputs are able to be pytree flattened into supporting types (tensors, symints, ...). So if user passes in some datastructure which does not have a pytree flatten registration, this will error with the message "It looks like one of the inputs is with type CustomType is not supported or pytree flatten-able.... please register a pytree flatten/unflatten function using the pytree.register_pytree_node API".

The check inside of produce_matching should now only error if something unexpected happens (dynamo accidentally adds an input or removes an output), and should be considered an internal error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117598
Approved by: https://github.com/avikchaudhuri, https://github.com/BowenBao
2024-01-19 17:13:39 +00:00
6c5c2121b1 Run some OOMing tests serially (#117759)
They were disabled due to being flaky due to OOMs but got renamed.  Seeing if running serially helps

I kind of want to keep this test disabled since the rest of the file is probably fine...

Issues in question: #113132 #113136 #113140
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117759
Approved by: https://github.com/malfet, https://github.com/huydhn
2024-01-19 16:45:35 +00:00
de25718300 [release] Docker Release build trigger on rc for testing (#117849)
Enable triggering the Docker Release builds on RC. Use test channel in this case. Hence following logic is applied:
1. On RC trigger use test channel and upload to pytorch-test : https://github.com/orgs/pytorch/packages/container/package/pytorch-test
2. On Final RC use prod channel and upload to pytorch : https://github.com/orgs/pytorch/packages/container/package/pytorch
3. Nightly: https://github.com/orgs/pytorch/packages/container/package/pytorch-nightly

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117849
Approved by: https://github.com/malfet
2024-01-19 15:01:46 +00:00
03b12e56c7 accelerate binary_cross_entropy_with_logits by using log_sigmoid operator (#115539)
When I was reimplementing BCEwithLogits, I found that `log_sigmoid` operator could accelerate the function.

Simple benchmark on AMD 3600 CPU Ubuntu 22.04:
|avg time (ms)|with `pos_weight`|no `pos_weight`|
|-|-|-|
|original|1986|1658|
|this PR|1295|995|

faster 35-40%. This is probably benefited by the `log_sigmoid` vectorization code.

CUDA benchmark was not obtained, but I believe CUDA can be also benefited by reduecing kernel launches as https://github.com/pytorch/pytorch/pull/11054#issuecomment-442233714 and https://github.com/pytorch/pytorch/pull/78267#issue-1248398454 mentioned.

The simple benchmark cpp file:
[demo.txt](https://github.com/pytorch/pytorch/files/13635355/demo.txt)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115539
Approved by: https://github.com/lezcano
2024-01-19 14:56:43 +00:00
98a044d33e [CI] Build M1 conda binaries on M1 runners (#117801)
As usual, almost no work on PyTorch side, all changes are on the builder end, namely:
- 8b67d32929 - depend on `blas * mkl` only on x86 machines
- eb78393f1e - install arm64 conda when running on Apple Silicon
- 0d3aea4ee0 - constrain llvmdev-9 to x86 machines only
- 6c6a33b271 - set correct DEVELOPER_DIR path

TODO:
 - We should auto-detect this `DEVELOPER_DIR` via `xcode-select`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117801
Approved by: https://github.com/atalman
2024-01-19 14:31:12 +00:00
17c5f69852 Run test_jit with PYTORCH_TEST_WITH_DYNAMO=1 in CI (#117765)
Gets rid of all the single test excludes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117765
Approved by: https://github.com/voznesenskym
2024-01-19 13:42:41 +00:00
f115f1cde1 [Quant] Enable QConv2d with hardswish post op (#117487)
**Summary**
Enable QConv2d implementation with post op `hardswish`

**Test Plan**
```
python -m pytest test_quantized_op.py -k test_qconv2d_hardswish_pt2e
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117487
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
2024-01-19 13:24:06 +00:00
cyy
5756b7a08e Remove math_compat.h (#117828)
Follows #116167
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117828
Approved by: https://github.com/malfet
2024-01-19 12:56:17 +00:00
f2d6e99f8d Workaround a cusolver bug on CUDA < 12.1 in triangular_solve (#117636)
Fix https://github.com/pytorch/pytorch/issues/79191

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117636
Approved by: https://github.com/malfet
2024-01-19 12:42:37 +00:00
suo
4057d005ff Initial torchbind support in PT2 (#117697)
This PR adds the bare minimum functionality to get torchbind working in an e2e testable way on PT2.

It implements:
* ProxyTensor support
* Simple torch.export support (proxytensor-only path, e.g. non-strict).
* add some tests exercising the path.

Because all this is not fully baked, I hide the functionality behind a feature flag (`enable_torchbind_tracing()`) so it does not affect regular users for now.

Still on the agenda:
* Dynamo support
* Actual FakeMode support
* Mutability support

Hoping to get this first bit in as a standalone, as it will unblock some more extensive experimentation/testing going on internally.

Differential Revision: [D51825372](https://our.internmc.facebook.com/intern/diff/D51825372/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117697
Approved by: https://github.com/SherlockNoMad
2024-01-19 06:28:20 +00:00
c51a4e64c0 Add support for compiling SDPAParams (#117207)
Allows us to `allow_in_graph` this `torch._C` struct for supporting scaled dot product attention.
helps unblock https://github.com/pytorch/pytorch/pull/116071

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117207
Approved by: https://github.com/voznesenskym
2024-01-19 05:51:15 +00:00
8524fa566c [executorch hash update] update the pinned executorch hash (#117593)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117593
Approved by: https://github.com/pytorchbot
2024-01-19 04:34:12 +00:00
f302a0d380 Re-enable SGD (#117434)
Re-enables the SGD optimizer now that compile times are more reasonable. [Benchmark run](https://github.com/pytorch/pytorch/actions/runs/7511073761)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117434
Approved by: https://github.com/anijain2305, https://github.com/janeyx99
2024-01-19 04:28:50 +00:00
924ed91612 Move getDurationFromFirstEvent to USE_C10D_NCCL ifdef (#117738)
Fixes #117517

Try to move nccl related function *getDurationFromFirstEvent* to USE_C10D_NCCL ifdef (Related to https://github.com/pytorch/pytorch/issues/114575)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117738
Approved by: https://github.com/wconstab, https://github.com/XilunWu
2024-01-19 04:28:47 +00:00
cyy
38d9b3d937 Remove use of math_compat.h (#116167)
Because  ANDROID>=21 is assumed in CI tests, it is time to remove old workarounds. math_compat.h contains solely wrapper math functions for ANDROID, so we can remove its usage.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116167
Approved by: https://github.com/ezyang
2024-01-19 03:37:55 +00:00
cyy
5c17f66a3d [Exception] [5/N] Remove torch::IndexError (#117713)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117713
Approved by: https://github.com/ezyang
2024-01-19 03:36:15 +00:00
3131e0460e Changed return type of randint64_cpu to int64_t to prevent codegen is… (#117443)
…sues.

Fixes #117435.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117443
Approved by: https://github.com/ezyang
2024-01-19 03:23:20 +00:00
1adf77ce5e Don't use functional tensor inside _unstack_pytree (#117811)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117811
Approved by: https://github.com/ydwu4
2024-01-19 03:15:06 +00:00
c16e6e4cf7 [ProcessGroup] Make watchdog check work queue more frequently (#117297)
Today watchdog's sleep interval is 1s. That's a bit long compared to modern GPU link's (or network link's) speed.

Take DDP and Ampere for example:

DDP's bucket size = 25 MB
Ampere's NVLink speed = 250 GB/s

25 MB / 250 GB/s = 100 ms.
So we are updating the interval to 100 ms.

Update:
25 MB / 250 GB/s = 0.1 ms
But let's see how it goes so far between making the checking more aggressive.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117297
Approved by: https://github.com/fduwjj
2024-01-19 02:33:31 +00:00
aadbaf8e2d [EZ][BE] Move build_android_gradle.sh (#117795)
From `.circleci/scripts` to `scripts`, next to another `build_android.sh`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117795
Approved by: https://github.com/huydhn
2024-01-19 02:14:28 +00:00
d618e86328 [ONNX] Bump transformers in CI test (#117703)
Fixes #117660

(1) skip dynamic tests for exported program in `test_fx_to_onnx_onnxruntime.py`, as they are not expected to pass anyway.
(2) Move dolly model to runtime, since it's working in exporting, but it is blocked by non-persistent buffers as well.
(3) openai whisper has changed/regression due to modeling modifications.
(4) Replace OpenLlama with Llama, because OpenLlama is deprecated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117703
Approved by: https://github.com/thiagocrepaldi
2024-01-19 02:10:10 +00:00
74e1362499 additional support for float8_e4m3fnuz and _e5m2fnuz (#115214)
Follow up to #107586.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115214
Approved by: https://github.com/peterbell10
2024-01-19 00:50:18 +00:00
c317bf2c2b [HigherOrderOp][BE] factor out merge_graph_inputs (#116912)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116912
Approved by: https://github.com/zou3519
ghstack dependencies: #116721, #116823
2024-01-19 00:35:26 +00:00
c6028f8f73 [HigherOrderOp] Add while_loop support (#116823)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116823
Approved by: https://github.com/zou3519
ghstack dependencies: #116721
2024-01-19 00:35:26 +00:00
113f0749f5 [HigherOrderOp] move some common utils in cond to utils.py (#116721)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116721
Approved by: https://github.com/zou3519
2024-01-19 00:35:26 +00:00
77cfacab55 Revert "Reduce pytest prints (#117069)"
This reverts commit 2f89ef23007626aca1a577a4a388e315253c834f.

Reverted https://github.com/pytorch/pytorch/pull/117069 on behalf of https://github.com/clee2000 due to distributed tests are not printing items ([comment](https://github.com/pytorch/pytorch/pull/117069#issuecomment-1899433816))
2024-01-19 00:27:03 +00:00
a468b9fbdf Update xla.txt to fix missing commit (#117708)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117708
Approved by: https://github.com/masnesral, https://github.com/huydhn
2024-01-18 23:51:51 +00:00
2f84a9d37c Revert "[CUDNN][SDPA] Experimental cuDNN Flash Attention v2 Inference (#115663)"
This reverts commit 5aa92b5090e3db4a053548a3f360dd06c16df2f7.

Reverted https://github.com/pytorch/pytorch/pull/115663 on behalf of https://github.com/PaliC due to Unfortunately, this pr breaks cuda builds internally ([comment](https://github.com/pytorch/pytorch/pull/115663#issuecomment-1899388813))
2024-01-18 23:40:30 +00:00
2f89ef2300 Reduce pytest prints (#117069)
* custom pytest-shard so I can control the verbosity (also index by 1 since it's confusing)
* normal runs (not keep-going) always rerun each failed test 9 times (3 per process, 3 processes).  Previously it would only run the entire test file 3 times, so if a test before you segfaulted, you only got 2 tries

Example of quieter log https://github.com/pytorch/pytorch/actions/runs/7481334046/job/20363147497
"items in shard" only gets printed once at the beginning, and the reruns just say how many got skipped.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117069
Approved by: https://github.com/huydhn
2024-01-18 23:30:59 +00:00
e432b2e607 [inductor] multi-kernel support (#103469)
For a persistent reduction, we generate 2 flavor of 'equivalant' kernels at the same time
- persistent reduction
- regular reduction

A MultiKernel wraps these 2 kernels and pick the one with better performance at runtime.

Here I talk more about implementation details:
- Inductor maintains states for generating kernels. E.g. the wrapper code.  After we generate code for one kernel, we need restore the inductor state before we can generate the counterpart.

***There is one thing I need some comments from others***:
There is one tricky thing about kernel arguments. In general, inductor removes a buffer from the argument list if it's only used inside the kernel.  But somehow a buffer removed by persistent reduction kernel may still be kept by the regular (non-persistent) reduction kernel because of some CSE invalidation rule. My current implementation avoid removing buffers if multi_kernel is enabled. This makes sure both flavors of reduction has consistent argument list.  Another idea I have is, we generate the multi-kernel definition with the union of arguments from both sub-kernels. Let each sub-kernel pick the subset of arguments it wants. But this will make the code-gen or multi-kernel much complex.

I'm not sure if there is some easy and clean way to resolve this.

Testing command:
```

TORCHINDUCTOR_MULTI_KERNEL=1 TORCH_LOGS=+torch._inductor.graph TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 python benchmarks/dynamo/huggingface.py --backend inductor --amp --performance --only BertForMaskedLM --training

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103469
Approved by: https://github.com/jansel
2024-01-18 23:16:31 +00:00
fee96adde7 [EZ] Update weekly.yml to use actions from test-infra (#117775)
It was deleted from `pytorch/pytorch` by https://github.com/pytorch/pytorch/pull/117506

Thanks [BowenBao](https://github.com/BowenBao) for alerting
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117775
Approved by: https://github.com/huydhn
2024-01-18 22:58:32 +00:00
6d9432c44c [ONNX][dynamo_export] Decomposition skips using custom operator (#117314)
A context manager that disables the decomposition of certain ops during dynamo tracing.

The approach is to temporarily hijack the operator callable with PT2 custom operator.
The custom operator will not be decomposed and will show up as a single node to be exported to ONNX.

For the time being the decomposition of these ops is otherwise unavoidable.

https://github.com/pytorch/pytorch/issues/116684
https://github.com/pytorch/pytorch/issues/115883

This solution will no longer be required once the issue is resolved.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117314
Approved by: https://github.com/justinchuby, https://github.com/malfet
2024-01-18 22:19:28 +00:00
92d718aed1 [export] Add lifted constant obj to input (#116985)
Test Plan: wip

Differential Revision: D52556070

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116985
Approved by: https://github.com/suo
2024-01-18 22:10:53 +00:00
eba5d5485d [dynamo] make ConstantSource propagate through built-in ops for TensorVariable (#117704)
Fixes #117685.

This PR only makes ConstantSource perserved for built-in ops when we find all the inputs are either constant tensors or python constants.

 It doesn't fundamentally solve the problem of preserving ConstantSource information through all operators that's potentially can be constant folded.

For the following code in the issue:
```
class Bob(torch.nn.Module):
    def __init__(self, p, val) -> None:
        super().__init__()
        self.p = p
        self.y = torch.nn.Parameter(torch.tensor(val))

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # This only looks dynamic but it's actually a constant value
        if get_y(self.y) < self.p:
            return torch.cat([x,x])
        else:
            return x
```
The graph exported looks like following:
```python
class GraphModule(torch.nn.Module):
    def forward(self, x):
        arg0: "f32[s0, s1]";

        arg0, = fx_pytree.tree_flatten_spec(([x], {}), self._in_spec)
        l_x_ = arg0

        # File: /home/yidi/local/pytorch/test/dynamo/test_export.py:1498 in forward, code: return torch.cat([x, x])
        cat = torch.cat([l_x_, l_x_]);  l_x_ = None
        return pytree.tree_unflatten([cat], self._out_spec)
```

Test Plan:
Added a new test for the given repro.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117704
Approved by: https://github.com/jansel, https://github.com/anijain2305
2024-01-18 20:18:34 +00:00
1462d72904 Speed up triu_tril_kernel (#115013)
1. Batch Processing: Enhance kernel efficiency by having each thread handle multiple elements, reducing the frequency of offset calculations.
2. Inplace Operation Optimization: For inplace functions, eliminate unnecessary copying to enhance performance.

Up to 5x speed up compared to torch 2.1.1

# Benchmark
Test on NVIDIA RTX 3080, WSL, CUDA 12.1. Peak performance is recorded.

  | function | dtype | shape | k | torch 2.1.1 | this PR | speed up
-- | -- | -- | -- | -- | -- | -- | --
various   dtype |   |   |   |   |   |  
  | triu_ | int8 | [1, 3072, 3072] | 0 | 0.107 | 0.028 | 3.76x
  | triu_ | float16 | [1, 3072, 3072] | 0 | 0.108 | 0.029 | 3.79x
  | triu_ | float32 | [1, 3072, 3072] | 0 | 0.114 | 0.045 | 2.52x
  | triu_ | float64 | [1, 3072, 3072] | 0 | 0.172 | 0.082 | 2.11x
  | triu | int8 | [1, 3072, 3072] | 0 | 0.111 | 0.056 | 2.00x
  | triu | float16 | [1, 3072, 3072] | 0 | 0.108 | 0.049 | 2.22x
  | triu | float32 | [1, 3072, 3072] | 0 | 0.116 | 0.091 | 1.27x
  | triu | float64 | [1, 3072, 3072] | 0 | 0.175 | 0.176 | 1.00x
various   shape |   |   |   |   |   |  
  | triu_ | float32 | [1, 8192, 8192] | 0 | 0.798 | 0.311 | 2.56x
  | triu_ | float32 | [4, 1024, 1024] | 0 | 0.054 | 0.023 | 2.37x
  | triu_ | float32 | [4, 1021, 1021] | 0 | 0.054 | 0.023 | 2.33x
  | triu_ | float32 | [256, 128, 256] | 0 | 0.111 | 0.038 | 2.92x
  | triu_ | float32 | [128, 257, 125] | 0 | 0.051 | 0.029 | 1.77x
  | triu_ | float32 | [20480, 16, 16] | 0 | 0.072 | 0.036 | 1.97x
  | triu | float32 | [1, 8192, 8192] | 0 | 0.797 | 0.611 | 1.31x
  | triu | float32 | [4, 1024, 1024] | 0 | 0.056 | 0.042 | 1.32x
  | triu | float32 | [4, 1021, 1021] | 0 | 0.058 | 0.044 | 1.32x
  | triu | float32 | [256, 128, 256] | 0 | 0.114 | 0.093 | 1.22x
  | triu | float32 | [128, 257, 125] | 0 | 0.051 | 0.036 | 1.43x
  | triu | float32 | [20480, 16, 16] | 0 | 0.075 | 0.061 | 1.23x
various dim |   |   |   |   |   |  
  | triu_ | float32 | [3072, 3072] | 0 | 0.093 | 0.037 | 2.49x
  | triu_ | float32 | [1, 3072, 3072] | 0 | 0.114 | 0.045 | 2.52x
  | triu_ | float32 | [1, 1, 3072, 3072] | 0 | 0.138 | 0.053 | 2.60x
  | triu | float32 | [3072, 3072] | 0 | 0.097 | 0.091 | 1.07x
  | triu | float32 | [1, 3072, 3072] | 0 | 0.116 | 0.091 | 1.27x
  | triu | float32 | [1, 1, 3072, 3072] | 0 | 0.140 | 0.090 | 1.55x
various k |   |   |   |   |   |   |  
  | triu_ | float16 | [1, 3072, 3072] | 0 | 0.108 | 0.029 | 3.79x
  | triu_ | float16 | [1, 3072, 3072] | 1536 | 0.103 | 0.042 | 2.44x
  | triu_ | float16 | [1, 3072, 3072] | -1536 | 0.114 | 0.020 | 5.68x
  | triu | float16 | [1, 3072, 3072] | 0 | 0.108 | 0.049 | 2.22x
  | triu | float16 | [1, 3072, 3072] | 1536 | 0.104 | 0.039 | 2.65x
  | triu | float16 | [1, 3072, 3072] | -1536 | 0.115 | 0.058 | 2.00x

# Benchmark Code

```python3
import time
import torch

torch.manual_seed(42)

def timeit(f, run_times=1000):
    torch.cuda.synchronize()
    t1 = time.time()
    for _ in range(run_times):
        f()
    torch.cuda.synchronize()
    t2 = time.time()
    return (t2 - t1) / run_times

for dtype in [torch.int8, torch.float16, torch.float32, torch.float64]:
    for shape in [
        [1, 8192, 8192],
        [3072, 3072],
        [1, 3072, 3072],
        [1, 1, 3072, 3072],
        [4, 1024, 1024],
        [4, 1021, 1021],
        [256, 128, 256],
        [128, 257, 125],
        [20480, 16, 16],
    ]:
        for k in [0, shape[-1] // 2, -shape[-1] // 2]:
            a = torch.empty(shape, dtype=dtype, device="cuda")
            for _ in range(4):
                t_triu = timeit(lambda: a.triu(k))
                t_triu_ = timeit(lambda: a.triu_(k))
                t_clone = timeit(lambda: a.clone())
                print(dtype, shape, f"{k=}", f"triu_ {t_triu_ * 1000:.6f} ({t_triu_ / t_clone:.2f}xMemcpy)", f"triu {t_triu * 1000:.6f} ({t_triu / t_clone:.2f}xMemcpy)")

            a = torch.rand(shape, device="cuda")
            a = (a * 10).to(dtype)
            assert (a.triu(k) == a.cpu().triu(k).cuda()).all()
            assert (a.tril(k) == a.cpu().tril(k).cuda()).all()
            assert (a.clone().triu_(k) == a.triu(k)).all()
            assert (a.clone().tril_(k) == a.tril(k)).all()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115013
Approved by: https://github.com/eqy, https://github.com/janeyx99
2024-01-18 19:58:00 +00:00
16ebfbbf07 All tests run with markDynamoStrictTest now (#117763)
Last test to remove from the denylist was dynamo/test_logging.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117763
Approved by: https://github.com/voznesenskym
ghstack dependencies: #117729, #117747, #117754, #117761
2024-01-18 19:42:41 +00:00
5278200507 Add some better docs for dynamo_test_failures.py (#117761)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117761
Approved by: https://github.com/voznesenskym
ghstack dependencies: #117729, #117747, #117754
2024-01-18 19:42:41 +00:00
07216721cf [codemod] markDynamoStrictTest batch 23 (#117754)
[codemod] markDynamoStrictTest test_custom_ops
[codemod] markDynamoStrictTest test_python_dispatch
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117754
Approved by: https://github.com/voznesenskym
ghstack dependencies: #117729, #117747
2024-01-18 19:37:04 +00:00
def4959662 Revert "[inductor] allow mm template to accumulate with float16 dtype (#117479)"
This reverts commit a7fbbc2a4a05fa4863f9d0e2adabcdc5e276c675.

Reverted https://github.com/pytorch/pytorch/pull/117479 on behalf of https://github.com/PaliC due to breaking tests internally ([comment](https://github.com/pytorch/pytorch/pull/117479#issuecomment-1899032973))
2024-01-18 18:53:37 +00:00
suo
23d53a4360 add test_public_bindings to internal CI (#117712)
enable this test in meta-internal CI, since it's mildly infuriating to not be able to locally test this when working inside meta

One change:
This test uses `pkgutil.walk_packages`, which ignores namespace packages. A quirk in Meta's internal python packaging system is that it adds `__init__.py` to each source directory. So this test picks up more files to check internally than in the GitHub CI.

So I changed this test from using raw `pkgutil` to a version that also looks into namespace packages, so we're checking the same thing across both CIs.

Differential Revision: [D52857631](https://our.internmc.facebook.com/intern/diff/D52857631/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117712
Approved by: https://github.com/ezyang
2024-01-18 18:20:43 +00:00
1b773df3c6 Place .lrodata later in the binary (#117575)
Summary:
By default, in LLD 16, .lrodata is placed immediately after .rodata.
However, .lrodata can be very large in our compiled models, which leads to
relocation out-of-range errors for relative relocations. So we place it
after other the sections that are referenced from .text using relative
relocations. This is the default behavior in GNU ld.
Reviewed By: muchulee8, desertfire, khabinov, chenyang78

Differential Revision: D52557846

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117575
Approved by: https://github.com/chenyang78, https://github.com/khabinov
2024-01-18 17:58:18 +00:00
7451dd0585 Revert "Add node meta value into UnflattenedModule (#117686)"
This reverts commit cbf24ba962f72175ec1c71a25f3379f7d9149ec1.

Reverted https://github.com/pytorch/pytorch/pull/117686 on behalf of https://github.com/PaliC due to breaks internal modeling tests ([comment](https://github.com/pytorch/pytorch/pull/117686#issuecomment-1898939899))
2024-01-18 17:46:38 +00:00
5aa895e53e Don't run inductor tests in Dynamo shard (#117747)
In theory we could, but these get really slow once we turn on strict
mode, so we're not going to for now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117747
Approved by: https://github.com/bdhirsh
ghstack dependencies: #117729
2024-01-18 17:43:30 +00:00
646229218f Revert "[export] Error on not pytree-flattened nodes (#117598)"
This reverts commit 560213de2d8f734987e25680e72d565501ab8318.

Reverted https://github.com/pytorch/pytorch/pull/117598 on behalf of https://github.com/PaliC due to breaking executorch tests internally ([comment](https://github.com/pytorch/pytorch/pull/117598#issuecomment-1898926720))
2024-01-18 17:37:59 +00:00
4720109d7f [dynamo] add common methods to DistributedVariable (#117590)
This PR refactors the distributed related variables to use
DistributedVariable for common methods, so that things like
`python_type` works for all distributed variables.

Maybe we can add `as_python_constant` to the DistributedVariable too? I
didn't add in this PR but if that make sense I can update.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117590
Approved by: https://github.com/voznesenskym
2024-01-18 17:32:31 +00:00
044b9012d5 Update PocketFFT (#117595)
This updates PocketFFT submodule to 9d3ab05a7f

Probably fixes https://github.com/pytorch/pytorch/issues/117589 (as it includes https://github.com/mreineck/pocketfft/issues/5 that should fix PocketFFT compilation on Windows)

Also adjust `#if __cplusplus >= 201703` replace path in Android scripts (need to submit the fix back to PocketFFT)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117595
Approved by: https://github.com/huydhn
2024-01-18 17:08:44 +00:00
db1a6eda9e [codemod] markDynamoStrictTest batch 22 (#117729)
[codemod] markDynamoStrictTest test_autograd
[codemod] markDynamoStrictTest test_ao_sparsity
[codemod] markDynamoStrictTest test_jit
[codemod] markDynamoStrictTest test_quantization
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117729
Approved by: https://github.com/bdhirsh
2024-01-18 16:59:26 +00:00
fa86fa7a61 Fix MSVC 14.38 - VS 2022 Build (#117497)
Fixes #115922

This PR is prepared to separate existing https://github.com/pytorch/pytorch/pull/116926 and to apply suggestions in the review.

`scalar_t` which is defined as `c10::impl::ScalarTypeToCPPType<ScalarType::Half>::t` appears to be causing the issue with `Visual Studio 2022 17.8.4`  (coming with `MSVC 14.38.33130`)

Error message:
```
aten\src\ATen/cpu/vec/vec_base.h(150): fatal error C1001: Internal compiler error.
(compiler file 'D:\a_work\1\s\src\vctools\Compiler\CxxFE\sl\p1\c\toinil.c', line 910)
```

---

Related line was added for a similar issue before as a workaround (`scalar_t` definition) [Fix compile error for vs2022](https://github.com/pytorch/pytorch/pull/85958)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117497
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-01-18 16:53:27 +00:00
a669319450 [inductor] Faster C++ kernel python bindings (#117500)
Calling C++ from Python via ctypes is notoriously slow.  This switches to generating our own C++ bindings directly, which is a >5x speedup on this kernel-launch-bound microbenchmark:
```python
from ctypes import c_void_p
import torch
from torch import empty
from torch._inductor.codecache import AsyncCompile
from torch._dynamo.testing import rand_strided
from torch._inductor.utils import print_performance
from torch._inductor.wrapper_benchmark import compiled_module_main

async_compile = AsyncCompile()

src = '''
#include "/tmp/torchinductor_jansel/gb/cgbau5vlj6cetmcjbjbtw6x4rrivaln6f45s5d72gy2bfx5foz3k.h"
extern "C" void kernel(const float* in_ptr0,
                       float* out_ptr0)
{
    {
        auto tmp0 = in_ptr0[static_cast<long>(0L)];
        auto tmp1 = static_cast<float>(1.0);
        auto tmp2 = decltype(tmp0)(tmp0 + tmp1);
        out_ptr0[static_cast<long>(0L)] = tmp2;
    }
}
'''

cpp_fused_add_ctypes = async_compile.cpp(src)
cpp_fused_add_cpython = async_compile.cpp_pybinding(["const float*", "float*"], src)

async_compile.wait(globals())
del async_compile

def call(arg0_1):
    buf0 = empty((1,), device='cpu', dtype=torch.float32)
    if use_ctypes:
        for _ in range(100):
            cpp_fused_add_ctypes(c_void_p(arg0_1.data_ptr()), c_void_p(buf0.data_ptr()))
    else:
        for _ in range(100):
            cpp_fused_add_cpython(arg0_1, buf0)
    del arg0_1
    return (buf0,)

def benchmark_compiled_module(times=1000, repeat=100):
    arg0_1 = rand_strided((1,), (1,), device='cpu', dtype=torch.float32)
    return print_performance(lambda: call(arg0_1), times=times, repeat=repeat)

print("old ctypes bindings: ", end='')
use_ctypes = True
compiled_module_main('None', benchmark_compiled_module)
print("new bindings:        ", end='')
use_ctypes = False
compiled_module_main('None', benchmark_compiled_module)
```
Output:
```
old ctypes bindings: 0.000073
new bindings:        0.000013
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117500
Approved by: https://github.com/desertfire
2024-01-18 16:20:12 +00:00
6e4e81a9ef [dynamo] Extend LazyVariableTracker to tuples (#117426)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117426
Approved by: https://github.com/lezcano, https://github.com/jansel
2024-01-18 15:51:28 +00:00
26956980c6 [AOTI] Add torch._export.aot_load (#117610)
Summary: Add a torch._export.aot_load API that can load an AOTInductor-compiled model.so into a python executable.

Test Plan: CI

Differential Revision: D52825456

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117610
Approved by: https://github.com/angelayi, https://github.com/khabinov, https://github.com/chenyang78
2024-01-18 15:02:16 +00:00
2fb9d8811f Don't try to directly compare symbols, it won't work (#117674)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117674
Approved by: https://github.com/lezcano
2024-01-18 12:18:45 +00:00
8bf788c390 [SAC][Dynamo] Add support for functools.partial in CheckpointHigherOrderVariable (#117657)
# Context

In some cases, we might want to build the `context_fn` with runtime-defined policies. One way of implementing this is to make `context_fn` be a partial, which holds the information that we want to pass. One concrete example is the [automatic policy selection from `xformers`](ad986981b1/xformers/checkpoint.py (L185)).

# The problem

The previous implementation wouldn't work with partials because `FunctoolsPartialVariable` doesn't have a `fn` attribute.

This PR addresses this case, but ideally we could get this solved in a more general fashion, as callable classes and `NestedUserFunctionVariable` are not supported by this PR.

# Tests

I've added a basic test that mimics the tests around it. The tests could probably be simplified, but I've decided to keep changes to a minimum.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117657
Approved by: https://github.com/yf225
2024-01-18 11:59:23 +00:00
b0084be114 Revert "Re-enable SGD (#117434)"
This reverts commit e7fac72be75a9fa7a31c6fc8062364fdfc4aaa3a.

Reverted https://github.com/pytorch/pytorch/pull/117434 on behalf of https://github.com/lezcano due to breaks test_profiler.py when run with dynamo ([comment](https://github.com/pytorch/pytorch/pull/117434#issuecomment-1898311961))
2024-01-18 11:37:36 +00:00
0d1e7053ac [easy] Log guard failure (#117639)
Facilitates greatly debugging guard creation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117639
Approved by: https://github.com/Skylion007, https://github.com/jansel
ghstack dependencies: #112252, #117630, #110524, #108420
2024-01-18 09:37:33 +00:00
4ba5318d3f [dynamo] Add DictView variable tracker (#108420)
This also starts a comparison pattern where we don't ask variables
what's their type, but what are their capabilities.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108420
Approved by: https://github.com/jansel
ghstack dependencies: #112252, #117630, #110524
2024-01-18 09:37:33 +00:00
f4df0f061c Implement set in terms of dict (#110524)
This allows to heavily simplify the implementation of set, which was
"quite unique". Now we represent a set a as a dict where all its values
are None.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110524
Approved by: https://github.com/jansel
ghstack dependencies: #112252, #117630
2024-01-18 09:36:41 +00:00
bc85eb948f Break on unsupported keys for dicts / elements for sets (#117630)
As per title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117630
Approved by: https://github.com/jansel
ghstack dependencies: #112252
2024-01-18 09:35:46 +00:00
4512a95371 [easy]Remove specialized value (#112252)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112252
Approved by: https://github.com/jansel
2024-01-18 09:34:50 +00:00
2dd4a254a0 add Half support for interpolate operators on CPU (#105648)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105648
Approved by: https://github.com/jgong5, https://github.com/cpuhrsch
2024-01-18 09:07:16 +00:00
c9528a11dd Add Half support for masked_softmax on CPU (#117028)
Add Half support for `masked_softmax` on CPU.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117028
Approved by: https://github.com/jgong5, https://github.com/cpuhrsch
2024-01-18 08:59:20 +00:00
e60bc502b4 [Inductor Intel GPU backend Upstream] Generalize part of Inductor test case (#117513)
Following the RFC https://github.com/pytorch/pytorch/issues/114856, before upstream Intel XPU Inductor Backend, we need to preapre corresponding Inductor test cases. This PR aims to generalize part of Inductor test case so that a new GPU backend can reuse the existing test case with minimal code change.

This Pull Request preferentially generalizes the test cases that cover Inductor's base functionality as follow:
- test/inductor/test_codecache.py
- test/inductor/test_codegen_triton.py
- test/inductor/test_kernel_benchmark.py
- test/inductor/test_torchinductor.py
- test/inductor/test_torchinductor_codegen_dynamic_shapes.py
- test/inductor/test_torchinductor_dynamic_shapes.py
- test/inductor/test_torchinductor_opinfo.py
- test/inductor/test_triton_heuristics.py
- test/inductor/test_triton_wrapper.py

Feature request: https://github.com/pytorch/pytorch/issues/114856

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117513
Approved by: https://github.com/EikanWang, https://github.com/jansel
2024-01-18 08:26:21 +00:00
cyy
b72ddbab60 [Clang-tidy header][15/N] Enable clang-tidy on headers in c10/cuda and c10/mobile (#116602)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116602
Approved by: https://github.com/ezyang
2024-01-18 08:15:50 +00:00
57ca455471 [dynamo] Add hasattr support for TupleVariable (#117694)
Summary:
This change adds support hasattr support for TupleVariable in dynamo.

This fix is part of: https://github.com/pytorch/pytorch/issues/117670

Test Plan: Unit test and CI

Differential Revision: D52850665

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117694
Approved by: https://github.com/yanboliang
2024-01-18 07:47:43 +00:00
bc9cb04822 Replaced CHECK with TORCH_CHECK in order to not abort, but throw a Ru… (#117653)
…ntimeError instead.

Fixes #117499.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117653
Approved by: https://github.com/antoniojkim, https://github.com/JackCaoG, https://github.com/alanwaketan
2024-01-18 07:47:22 +00:00
e7fac72be7 Re-enable SGD (#117434)
Re-enables the SGD optimizer now that compile times are more reasonable. [Benchmark run](https://github.com/pytorch/pytorch/actions/runs/7511073761)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117434
Approved by: https://github.com/anijain2305, https://github.com/janeyx99
2024-01-18 06:47:15 +00:00
79811e765c [2/4] Intel GPU Runtime Upstreaming for Device (#116833)
# Motivation
According to [[1/4] Intel GPU Runtime Upstreaming for Device](https://github.com/pytorch/pytorch/pull/116019), as mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), the second PR  covers the changes under `aten`.

# Design
We will compile the code for XPU separately into a library named `libtorch_xpu.so`. Currently, it primarily offers device-related APIs, including
- `getCurrentDeviceProperties`
- `getDeviceProperties`
- `getGlobalIdxFromDevice`
- `getDeviceFromPtr`

# Additional Context
`XPUHooks` is an indispensable part of the runtime. We upstream `XPUHooks` in this PR since there is some code related to `Device` in it and we also refine some logic and code to avoid forward declaration in `DLPack`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116833
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/malfet
2024-01-18 05:02:42 +00:00
61ea3036bc Allow explicit shutdown of the compile-worker pools (#117664)
Summary: Allow the trainer to explicitly shutdown the compile-worker pools to save CPU resource, thereby avoiding QPS degradation.

Test Plan: See the test plan in D52839313
Differential Revision: D52839313

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117664
Approved by: https://github.com/yanboliang
2024-01-18 04:56:11 +00:00
1859895ffa Docs: fix docstring errors in model_averaging (#117038)
pydocstyle check

averagers.py

Pre
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/averagers.py:1 at module level:
        D100: Missing docstring in public module
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/averagers.py:20 in public method `__init__`:
        D107: Missing docstring in __init__
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/averagers.py:27 in public method `average_parameters`:
        D102: Missing docstring in public method
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/averagers.py:84 in public method `__init__`:
        D107: Missing docstring in __init__
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/averagers.py:106 in public method `average_parameters`:
        D205: 1 blank line required between summary line and description (found 0)
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/averagers.py:106 in public method `average_parameters`:
        D400: First line should end with a period (not '`')
6

Post
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/averagers.py:1 at module level:
        D100: Missing docstring in public module
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/averagers.py:20 in public method `__init__`:
        D107: Missing docstring in __init__
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/averagers.py:27 in public method `average_parameters`:
        D102: Missing docstring in public method
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/averagers.py:84 in public method `__init__`:
        D107: Missing docstring in __init__
4

utils.py

Pre
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/utils.py:1 at module level:
        D100: Missing docstring in public module
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/utils.py:17 in public function `average_parameters`:
        D205: 1 blank line required between summary line and description (found 0)
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/utils.py:45 in public function `get_params_to_average`:
        D205: 1 blank line required between summary line and description (found 0)
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/utils.py:45 in public function `get_params_to_average`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/utils.py:68 in public function `average_parameters_or_parameter_groups`:
        D200: One-line docstring should fit on one line with quotes (found 3)
5

Post
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/utils.py:1 at module level:
        D100: Missing docstring in public module
1

hierarchical_model_averager.py

Pre
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py:1 at module level:
        D100: Missing docstring in public module
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py:16 in public class `HierarchicalModelAverager`:
        D205: 1 blank line required between summary line and description (found 0)
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py:98 in public method `__init__`:
        D107: Missing docstring in __init__
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py:137 in private method `_find_process_group`:
        D205: 1 blank line required between summary line and description (found 0)
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py:137 in private method `_find_process_group`:
        D400: First line should end with a period (not ',')
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py:137 in private method `_find_process_group`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py:151 in public method `average_parameters`:
        D205: 1 blank line required between summary line and description (found 0)
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py:151 in public method `average_parameters`:
        D400: First line should end with a period (not '`')
8

Post /workspaces/pytorch/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py:1 at module level:
        D100: Missing docstring in public module
/workspaces/pytorch/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py:99 in public method `__init__`:
        D107: Missing docstring in __init__
2

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117038
Approved by: https://github.com/H-Huang
2024-01-18 04:12:51 +00:00
4f2620ce56 [PT2][split_cat] fix a bug in merge_splits (#117707)
Summary: Recently, we found merge splits (D45204109) is not working for AFOC model, thus patch a fix.

Test Plan:
The error log: P1046934021
# Flows used to local reproduce
### non-first:
f522317780
after the fix: P1047603217
### first:
f522253163
after the fix: P1047764917

Differential Revision: D52856359

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117707
Approved by: https://github.com/jackiexu1992
2024-01-18 04:04:32 +00:00
suo
02c96f6949 [export] modify torch.export tests to pass a Module in (#117572)
We have a lot of tests that pass a function to torch.export.

We are planning to disallow this, so fix up the tests to pass a module in.

Differential Revision: [D52791309](https://our.internmc.facebook.com/intern/diff/D52791309/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117572
Approved by: https://github.com/tugsbayasgalan
ghstack dependencies: #117570, #117571
2024-01-18 03:40:40 +00:00
suo
ccc8440609 [export] introduce WrapperModule (#117571)
Simple module to wrap a callable. This is a useful utility for when we start requiring that torch.export take an nn.Module.

Differential Revision: [D52791310](https://our.internmc.facebook.com/intern/diff/D52791310/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117571
Approved by: https://github.com/tugsbayasgalan, https://github.com/avikchaudhuri
ghstack dependencies: #117570
2024-01-18 03:40:34 +00:00
suo
5697986482 [export] change exportdb to require torch.nn.Module (#117570)
Part of the effort to make torch.export require nn.Module.

Differential Revision: [D52631366](https://our.internmc.facebook.com/intern/diff/D52631366/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117570
Approved by: https://github.com/tugsbayasgalan
2024-01-18 03:40:10 +00:00
41153542ae Use wait stream instead of synchronize() in cudagraph warmup (#117578)
Fix for https://github.com/pytorch/pytorch/issues/113895

There are three phases to cudagraph trees. Warmup, recording, and execution. On recording and execution we are executing under the current_stream. In warmup we execute under a side stream that we also use for cudagraph recording so as to reuse memory.

After we execute on the side stream we need to sync the current stream to the side stream. Previously there was a `torch.cuda.synchronize` but not a `torch.cuda.current_stream().wait_stream(stream)`. This PR removes the global sync and adds a wait_stream. I have confirmed that it fixes https://github.com/pytorch/pytorch/issues/113895.

It's not entirely clear me why torch.cuda.synchronize would be insufficient - I would have thought the global sync would encompass the stream to stream sync. However, we do have a number of [instances](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/compile_fx.py#L748-L749) throughout the code base where we do a stream->stream sync after the global sync so clearly I am missing something here. In any case the stream->stream sync is better perf than a global synchronize.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117578
Approved by: https://github.com/zdevito
2024-01-18 03:33:44 +00:00
560213de2d [export] Error on not pytree-flattened nodes (#117598)
Attempts to make the input/output mismatch error better by first checking if the inputs/outputs are able to be pytree flattened into supporting types (tensors, symints, ...). So if user passes in some datastructure which does not have a pytree flatten registration, this will error with the message "It looks like one of the inputs is with type CustomType is not supported or pytree flatten-able.... please register a pytree flatten/unflatten function using the pytree.register_pytree_node API".

The check inside of produce_matching should now only error if something unexpected happens (dynamo accidentally adds an input or removes an output), and should be considered an internal error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117598
Approved by: https://github.com/avikchaudhuri, https://github.com/BowenBao
2024-01-18 03:06:42 +00:00
634ce3c913 Document and type torch._inductor.virtualized (#117658)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117658
Approved by: https://github.com/eellison, https://github.com/peterbell10
ghstack dependencies: #117650
2024-01-18 03:03:20 +00:00
16ff6cd340 Catch some missing unbacked symbol dependencies (#117650)
Whenever an IR node has reference to an unbacked SymInt, we must
register it as a use of the unbacked SymInt.

This fix isn't complete but the rest of the fix is fairly difficult, so
putting this in to start.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117650
Approved by: https://github.com/lezcano
2024-01-18 03:03:20 +00:00
cb2b98ad6b [codemod] markDynamoStrictTest batch 21 (#117609)
[codemod] markDynamoStrictTest test_torch
[codemod] markDynamoStrictTest test_ops_gradients
[codemod] markDynamoStrictTest test_ops
[codemod] markDynamoStrictTest test_modules
[codemod] markDynamoStrictTest test_ops_jit
[codemod] markDynamoStrictTest test_ops_fwd_gradients
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117609
Approved by: https://github.com/bdhirsh
ghstack dependencies: #117700, #117701, #117702
2024-01-18 02:49:26 +00:00
bbf65bc451 Revert "[Dynamo] Remove the workaround since it has been fixed (#117615)"
This reverts commit b3e2571e83eff4a5ce45a7ad037c2fa2df87da9d.

Reverted https://github.com/pytorch/pytorch/pull/117615 on behalf of https://github.com/huydhn due to Sorry for reverting you change but it seems to start failing some dynamo tests in trunk b3e2571e83.  I try to disable the failed test but yet another one shows up ([comment](https://github.com/pytorch/pytorch/pull/117615#issuecomment-1897683076))
2024-01-18 02:48:34 +00:00
cbf24ba962 Add node meta value into UnflattenedModule (#117686)
Fixes #116670
Following the lead of #116720, added node.meta['val'] back to newly created subgraphs.

node.meta['val'] is essential to ONNX in terms of the shape and type information.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117686
Approved by: https://github.com/angelayi
2024-01-18 02:37:15 +00:00
6d96beb6be [c10d] Remove health check (#117699)
https://github.com/pytorch/pytorch/pull/114916 and https://github.com/pytorch/pytorch/pull/116222 added support for eager NCCL comm init (performed as soon as `init_process_group` is called).

If any user cares about the time difference and want to see NCCL init errors early, they can use eager init now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117699
Approved by: https://github.com/wconstab
2024-01-18 02:14:49 +00:00
21ddca4225 Enable HIP build for //sigrid/predictor:pytorch_disagg_gpu_task (#117616)
Summary: Tweak some header include, as well as explicitly ignore hipEventDestroy return value.

Test Plan: CI

Reviewed By: jiaqizhai

Differential Revision: D52722234

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117616
Approved by: https://github.com/xw285cornell
2024-01-18 01:37:50 +00:00
3882714168 Fix check-labels.yml for ghstack PRs (#117680)
Otherwise check-labels doesn't run on ghstack PRs, see https://github.com/pytorch/pytorch/pull/117609 for example: no Check Labels workflow run.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117680
Approved by: https://github.com/izaitsevfb
2024-01-18 01:33:55 +00:00
f7143b79bd Stricter pull_request_target in labeler.yml (#117677)
Copied from https://github.com/pytorch/pytorch/blob/main/.github/workflows/check-labels.yml

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117677
Approved by: https://github.com/izaitsevfb, https://github.com/malfet
2024-01-18 01:33:49 +00:00
58c4bc62bb [c10d] Deprecate Work.result() (#117565)
Work.result() returns a vector of tensors. This signature is problematic as some collectives may just return one tensor (e.g all-reduce), while some others may return multiple tensors (e.g. all-gather).

It would be clearer/easier for users to directly access the result via the tensor/tensorlist passed to the collective APIs.

Deprecating work.result() would also allow us to remove the `outputs_` field in the Work class, avoiding an "artificial" reference to the tensor, which could potentially hold up the tensor's memory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117565
Approved by: https://github.com/wconstab
2024-01-18 01:22:37 +00:00
5aa92b5090 [CUDNN][SDPA] Experimental cuDNN Flash Attention v2 Inference (#115663)
#113713

Going to clean up some of the checks and will remove draft status after.
Can be tested on SM80+ with `TORCH_CUDNN_MHA_ENABLED=1`.

CC @drisspg @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115663
Approved by: https://github.com/drisspg
2024-01-18 01:20:36 +00:00
a60b566d37 [TorchElastic] Support for overprovisioning in C10 based rendezvous (#117066)
Summary:
Allow TorchElastic to manage more nodes than a maximum nnodes specifed in a job. It will be used as a spare capacity/warm nodes for schedulers that support elasticity.

RFC: https://github.com/pytorch/pytorch/issues/114097

Test Plan: Integration tests

Differential Revision: D52343874

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117066
Approved by: https://github.com/zdevito
2024-01-18 01:16:55 +00:00
a1afd1b195 Revert "[inductor] Faster C++ kernel python bindings (#117500)"
It should have never been landed, but was landed again, thanks to
ghstack grafting/ungrafting see discussion on https://github.com/pytorch/pytorch/pull/116910

This reverts commit e457b6fb18782425661e8a09d0222d0b29518ad1.
2024-01-17 17:06:32 -08:00
410515241d [c01d] Remove CoalescedWorkNCCL (#117696)
`CoalescedWorkNCCL` is dead code now. Nowhere is it used.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117696
Approved by: https://github.com/wconstab
2024-01-18 01:00:43 +00:00
387ea260af [c10d] Enable watchdog for coalesced work (#117682)
Fixes https://github.com/pytorch/pytorch/issues/114301

Previously, coalesced work (created by `end_coalescing`) is not watched by watchdog, which results in silent timeout.

The culprit is that we reset `coalescing_state_` to 0 before checking it to see if we should enqueue a work.

Example:
```
import torch
import torch.distributed as dist
from datetime import timedelta

dist.init_process_group(backend="nccl", timeout=timedelta(seconds=10))
rank = dist.get_rank()
world_size = dist.get_world_size()
device = torch.device(f"cuda:{rank}")

# Create tensors of different sizes to create hang
s = 100 * 1024 * 1024 * (world_size - rank)
with dist._coalescing_manager(device=device):
    dist.all_reduce(torch.ones(s, device=device))
    dist.broadcast(torch.ones(s, device=device), src=0)

torch.cuda.synchronize()
print(f"{dist.get_rank()} done")

```

Watchdog fires:
```
$ torchrun --nproc-per-node 2 example.py
...
[rank1]:[E ProcessGroupNCCL.cpp:545] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=COALESCED, NumelIn=18446744073709551615, NumelOut=18446744073709551615, Timeout(ms)=10000) ran for 10000 milliseconds before timing out.
[rank0]:[E ProcessGroupNCCL.cpp:545] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=COALESCED, NumelIn=18446744073709551615, NumelOut=18446744073709551615, Timeout(ms)=10000) ran for 10567 milliseconds before timing out.
...
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117682
Approved by: https://github.com/wconstab, https://github.com/fduwjj
2024-01-18 00:42:36 +00:00
cyy
396a5c3091 [Exception] [4/N] Replace torch::IndexError and torch::ValueError with C10 counterparts (#117317)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117317
Approved by: https://github.com/ezyang
2024-01-18 00:35:29 +00:00
c64fd8b89c [codemod] markDynamoStrictTest batch 20 (#117702)
[codemod] markDynamoStrictTest test_tensorexpr_pybind
[codemod] markDynamoStrictTest test_tensorexpr
[codemod] markDynamoStrictTest test_jit_llga_fuser
[codemod] markDynamoStrictTest test_jit_fuser_te

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117702
Approved by: https://github.com/bdhirsh
ghstack dependencies: #117700, #117701
2024-01-18 00:30:22 +00:00
3770311093 [codemod] markDynamoStrictTest batch 19 (#117701)
[codemod] markDynamoStrictTest export/test_verifier
[codemod] markDynamoStrictTest export/test_upgrade
[codemod] markDynamoStrictTest export/test_unflatten
[codemod] markDynamoStrictTest export/test_serialize
[codemod] markDynamoStrictTest export/test_serdes
[codemod] markDynamoStrictTest export/test_retraceability
[codemod] markDynamoStrictTest export/test_passes
[codemod] markDynamoStrictTest export/test_pass_infra
[codemod] markDynamoStrictTest export/test_functionalized_assertions
[codemod] markDynamoStrictTest export/test_export_nonstrict
[codemod] markDynamoStrictTest export/test_export
[codemod] markDynamoStrictTest export/test_experimental
[codemod] markDynamoStrictTest export/test_db

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117701
Approved by: https://github.com/bdhirsh, https://github.com/malfet
ghstack dependencies: #117700
2024-01-18 00:30:22 +00:00
82c0083819 Fix trition wheels build (take 2) (#117706)
Sorry, I should have been more thorough in reviewing https://github.com/pytorch/pytorch/pull/117648 Triton wheels are built of `main` branch, rather than `nightly`, see
2db53a01e5/.github/workflows/build-triton-wheel.yml (L1-L6)

Test plan: merge and hope for the best :P

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117706
Approved by: https://github.com/huydhn, https://github.com/atalman
2024-01-18 00:26:36 +00:00
898f6a48a9 [codemod] markDynamoStrictTest batch 18 (#117700)
[codemod] markDynamoStrictTest functorch/test_vmap
[codemod] markDynamoStrictTest profiler/test_profiler_tree
[codemod] markDynamoStrictTest profiler/test_profiler
[codemod] markDynamoStrictTest profiler/test_memory_profiler
[codemod] markDynamoStrictTest functorch/test_ops
[codemod] markDynamoStrictTest functorch/test_aotdispatch

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117700
Approved by: https://github.com/bdhirsh
2024-01-18 00:25:38 +00:00
b3e2571e83 [Dynamo] Remove the workaround since it has been fixed (#117615)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117615
Approved by: https://github.com/angelayi
2024-01-18 00:21:22 +00:00
3114813314 Replace constraints with dynamic_shapes in deeplearning/aot_inductor test (#117573)
Summary: `constraints` argument for `torch.export` has been deprecated in favor of the `dynamic_shapes` argument. This PR updates the use of the deprecated API in `deeplearning/aot_inductor/test/test_custom_ops.py`.

Test Plan: buck test mode/dev-nosan fbcode//deeplearning/aot_inductor/test:test_custom_ops -- test_export_extern_fallback_nodes_dynamic_shape

Differential Revision: D52790332

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117573
Approved by: https://github.com/angelayi
2024-01-17 23:50:08 +00:00
2db53a01e5 propagate torch stack trace metadata to copy_() nodes during input mutations (#117587)
Tested by running the below script:
```
import torch
@torch.compile(backend="aot_eager", fullgraph=True)
def f(x):
    y = x.view(-1)
    y.mul_(2)
    return

x = torch.ones(4)
f(x)
```

Which gives me this ATen graph (notice that the copy_() node is bundled under the stacktrace for `mul_(2)`):
```
 ===== Forward graph 0 =====
 <eval_with_key>.2 from /data/users/hirsheybar/e/pytorch/torch/fx/experimental/proxy_tensor.py:521 in wrapped class <lambda>(torch.nn.Module):
    def forward(self, arg0_1: "f32[4]"):
        # File: /data/users/hirsheybar/e/pytorch/tmp5.py:8, code: y = x.view(-1)
        view: "f32[4]" = torch.ops.aten.view.default(arg0_1, [-1])

        # File: /data/users/hirsheybar/e/pytorch/tmp5.py:9, code: y.mul_(2)
        mul: "f32[4]" = torch.ops.aten.mul.Tensor(view, 2);  view = None
        view_1: "f32[4]" = torch.ops.aten.view.default(mul, [4]);  mul = None
        copy_: "f32[4]" = torch.ops.aten.copy_.default(arg0_1, view_1);  arg0_1 = view_1 = None
        return ()

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117587
Approved by: https://github.com/eellison
2024-01-17 23:07:45 +00:00
26a63907ba Ordering placeholder and get_attr nodes in unflattened module (#116910)
Previous to this PR, the generated unflattened module could mix up the order of `placeholder` and newly created `get_attr`. As `placeholder` is the input of a function, it should be placed ahead of `get_attr` nodes.

Before:
```bash
test/export/test_unflatten.py::TestUnflatten::test_placeholder_and_get_attr_ordering_after_unflattened opcode         name         target                    args                                                            kwargs
-------------  -----------  ------------------------  --------------------------------------------------------------  --------
get_attr       bias         bias                      ()                                                              {}
get_attr       weight       weight                    ()                                                              {}
placeholder    l_x_         l_x_                      ()                                                              {}
call_function  convolution  aten.convolution.default  (l_x_, weight, bias, [2, 2], [0, 0], [1, 1], False, [0, 0], 1)  {}
output         output       output                    (convolution,)                                                  {}
```

After:
```bash
test/export/test_unflatten.py::TestUnflatten::test_placeholder_and_get_attr_ordering_after_unflattened opcode         name         target                    args                                                            kwargs
-------------  -----------  ------------------------  --------------------------------------------------------------  --------
placeholder    l_x_         l_x_                      ()                                                              {}
get_attr       weight       weight                    ()                                                              {}
get_attr       bias         bias                      ()                                                              {}
call_function  convolution  aten.convolution.default  (l_x_, weight, bias, [2, 2], [0, 0], [1, 1], False, [0, 0], 1)  {}
output         output       output                    (convolution,)                                                  {}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116910
Approved by: https://github.com/tugsbayasgalan
ghstack dependencies: #117409, #116667, #117591, #117500
2024-01-17 23:03:15 +00:00
e457b6fb18 [inductor] Faster C++ kernel python bindings (#117500)
Calling C++ from Python via ctypes is notoriously slow.  This switches to generating our own C++ bindings directly, which is a >5x speedup on this kernel-launch-bound microbenchmark:
```python
from ctypes import c_void_p
import torch
from torch import empty
from torch._inductor.codecache import AsyncCompile
from torch._dynamo.testing import rand_strided
from torch._inductor.utils import print_performance
from torch._inductor.wrapper_benchmark import compiled_module_main

async_compile = AsyncCompile()

src = '''
#include "/tmp/torchinductor_jansel/gb/cgbau5vlj6cetmcjbjbtw6x4rrivaln6f45s5d72gy2bfx5foz3k.h"
extern "C" void kernel(const float* in_ptr0,
                       float* out_ptr0)
{
    {
        auto tmp0 = in_ptr0[static_cast<long>(0L)];
        auto tmp1 = static_cast<float>(1.0);
        auto tmp2 = decltype(tmp0)(tmp0 + tmp1);
        out_ptr0[static_cast<long>(0L)] = tmp2;
    }
}
'''

cpp_fused_add_ctypes = async_compile.cpp(src)
cpp_fused_add_cpython = async_compile.cpp_pybinding(["const float*", "float*"], src)

async_compile.wait(globals())
del async_compile

def call(arg0_1):
    buf0 = empty((1,), device='cpu', dtype=torch.float32)
    if use_ctypes:
        for _ in range(100):
            cpp_fused_add_ctypes(c_void_p(arg0_1.data_ptr()), c_void_p(buf0.data_ptr()))
    else:
        for _ in range(100):
            cpp_fused_add_cpython(arg0_1, buf0)
    del arg0_1
    return (buf0,)

def benchmark_compiled_module(times=1000, repeat=100):
    arg0_1 = rand_strided((1,), (1,), device='cpu', dtype=torch.float32)
    return print_performance(lambda: call(arg0_1), times=times, repeat=repeat)

print("old ctypes bindings: ", end='')
use_ctypes = True
compiled_module_main('None', benchmark_compiled_module)
print("new bindings:        ", end='')
use_ctypes = False
compiled_module_main('None', benchmark_compiled_module)
```
Output:
```
old ctypes bindings: 0.000073
new bindings:        0.000013
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117500
Approved by: https://github.com/desertfire
ghstack dependencies: #117409, #116667, #117591
2024-01-17 23:03:15 +00:00
763ddb396d Revert "[codemod] markDynamoStrictTest batch 18 (#117604)"
This reverts commit 24f288114a696a27771c075b8e8df556c13eced6.

Reverted https://github.com/pytorch/pytorch/pull/117604 on behalf of https://github.com/zou3519 due to probably a crossed merge? ([comment](https://github.com/pytorch/pytorch/pull/117604#issuecomment-1897082562))
2024-01-17 22:16:27 +00:00
01c0c67937 Revert "[codemod] markDynamoStrictTest batch 19 (#117605)"
This reverts commit 0cda1e0b218895ce6121531991348b8bcbce9b94.

Reverted https://github.com/pytorch/pytorch/pull/117605 on behalf of https://github.com/zou3519 due to probably a crossed merge? ([comment](https://github.com/pytorch/pytorch/pull/117605#issuecomment-1897065994))
2024-01-17 22:12:59 +00:00
87c2427173 Revert "[codemod] markDynamoStrictTest batch 20 (#117606)"
This reverts commit 308e154af5fd6388f49eabe631e7b78ca3ac9c39.

Reverted https://github.com/pytorch/pytorch/pull/117606 on behalf of https://github.com/zou3519 due to probably a crossed merge? ([comment](https://github.com/pytorch/pytorch/pull/117606#issuecomment-1897042843))
2024-01-17 22:08:20 +00:00
84cfe6d8b2 Drop all gather stats to debug not warning (#117669)
Logger default level results in these all gather stats being spammed into every run which is very annoying

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117669
Approved by: https://github.com/Skylion007, https://github.com/awgu
2024-01-17 21:44:59 +00:00
8841d26046 [dynamo] LazyVariable - redirect __str__ to the realized variable __str__ (#117583)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117583
Approved by: https://github.com/lezcano, https://github.com/jansel
2024-01-17 21:12:12 +00:00
a7fbbc2a4a [inductor] allow mm template to accumulate with float16 dtype (#117479)
Fixes #108621

replace #108637 and #108982

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117479
Approved by: https://github.com/jansel
2024-01-17 21:01:14 +00:00
208e64a9ba Initial implementation of FakeTensor caching (#113873)
Summary: Cache the result of FakeTensor dispatch and skip re-evaluation on cache hits.

Test Plan: New unit tests. Caching is enabled in this diff, so all existing tests exercise the cache as well.

Differential Revision: [D52841637](https://our.internmc.facebook.com/intern/diff/D52841637)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113873
Approved by: https://github.com/eellison
2024-01-17 20:38:54 +00:00
c0940d2e93 [pytree] reuse flatten_fn in flatten_with_keys_fn to ensure consistency (#117656)
Reuse `flatten_fn` in `flatten_with_keys_fn` to ensure `flatten_fn` and `flatten_with_keys_fn` get the same `leaves` and `context`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117656
Approved by: https://github.com/suo
2024-01-17 20:38:49 +00:00
bffc8ecfb0 [codemod] Fix shadows in PyTorch (#117562)
Test Plan: Sandcastle

Differential Revision: D52802592

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117562
Approved by: https://github.com/Skylion007, https://github.com/malfet
2024-01-17 20:33:50 +00:00
da6abaeeac Revert "[inductor] Faster C++ kernel python bindings (#117500)"
This reverts commit bb0fd1bd3ca145b77159427bc5bacf5f98ec3896.

Reverted https://github.com/pytorch/pytorch/pull/117500 on behalf of https://github.com/PaliC due to breaking internal discussed with author offline ([comment](https://github.com/pytorch/pytorch/pull/117500#issuecomment-1896516512))
2024-01-17 19:34:26 +00:00
cb0bfcf590 Revert "Ordering placeholder and get_attr nodes in unflattened module (#116910)"
This reverts commit 12561bb5fed08283baf7a31e6678341a04e83adb.

Reverted https://github.com/pytorch/pytorch/pull/116910 on behalf of https://github.com/PaliC due to breaking internal discussed with author offline ([comment](https://github.com/pytorch/pytorch/pull/117500#issuecomment-1896516512))
2024-01-17 19:34:26 +00:00
89cf1ddb5c [AOTInductor] Allow user to explicitly specify Device to run on (#117413)
Summary:
AOTInductor currently infer cuda device index by `cudaGetDevice()`. This assumes outer runtime calls `cudaSetDevice()` somewhere, before invoking AOTInductor run.

This diff adds an explicit argument for specifying target Device. e.g. compiled on "cuda:0", run on "cuda:1".

todo:
- Are the changes in interface.h BC breaking? as it changes the function signatures in .so file. Might just need introduce a new "Create" function.

Test Plan: CI

Differential Revision:
D52747132

Privacy Context Container: 368960445142440

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117413
Approved by: https://github.com/chenyang78, https://github.com/desertfire, https://github.com/khabinov
2024-01-17 19:28:04 +00:00
308e154af5 [codemod] markDynamoStrictTest batch 20 (#117606)
[codemod] markDynamoStrictTest test_tensorexpr_pybind
[codemod] markDynamoStrictTest test_tensorexpr
[codemod] markDynamoStrictTest test_jit_llga_fuser
[codemod] markDynamoStrictTest test_jit_fuser_te
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117606
Approved by: https://github.com/bdhirsh
ghstack dependencies: #117219, #117604, #117605
2024-01-17 19:20:11 +00:00
0cda1e0b21 [codemod] markDynamoStrictTest batch 19 (#117605)
[codemod] markDynamoStrictTest export/test_verifier
[codemod] markDynamoStrictTest export/test_upgrade
[codemod] markDynamoStrictTest export/test_unflatten
[codemod] markDynamoStrictTest export/test_serialize
[codemod] markDynamoStrictTest export/test_serdes
[codemod] markDynamoStrictTest export/test_retraceability
[codemod] markDynamoStrictTest export/test_passes
[codemod] markDynamoStrictTest export/test_pass_infra
[codemod] markDynamoStrictTest export/test_functionalized_assertions
[codemod] markDynamoStrictTest export/test_export_nonstrict
[codemod] markDynamoStrictTest export/test_export
[codemod] markDynamoStrictTest export/test_experimental
[codemod] markDynamoStrictTest export/test_db
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117605
Approved by: https://github.com/bdhirsh
ghstack dependencies: #117219, #117604
2024-01-17 19:20:11 +00:00
24f288114a [codemod] markDynamoStrictTest batch 18 (#117604)
[codemod] markDynamoStrictTest functorch/test_vmap
[codemod] markDynamoStrictTest profiler/test_profiler_tree
[codemod] markDynamoStrictTest profiler/test_profiler
[codemod] markDynamoStrictTest profiler/test_memory_profiler
[codemod] markDynamoStrictTest functorch/test_ops
[codemod] markDynamoStrictTest functorch/test_aotdispatch
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117604
Approved by: https://github.com/bdhirsh
ghstack dependencies: #117219
2024-01-17 19:20:01 +00:00
006d655956 [codemod] markDynamoStrictTest batch 17 (#117219)
[codemod] markDynamoStrictTest test_xnnpack_integration
[codemod] markDynamoStrictTest test_vulkan
[codemod] markDynamoStrictTest test_package
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117219
Approved by: https://github.com/bdhirsh
2024-01-17 19:19:50 +00:00
1967165d4d [codemod] markDynamoStrictTest batch 16 (#117218)
[codemod] markDynamoStrictTest test_public_bindings
[codemod] markDynamoStrictTest test_package
[codemod] markDynamoStrictTest test_legacy_vmap
[codemod] markDynamoStrictTest test_namedtensor
[codemod] markDynamoStrictTest test_fx
[codemod] markDynamoStrictTest test_dataloader
[codemod] markDynamoStrictTest test_content_store
[codemod] markDynamoStrictTest test_schema_check
[codemod] markDynamoStrictTest lazy/test_ts_opinfo
[codemod] markDynamoStrictTest functorch/test_vmap_registrations
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117218
Approved by: https://github.com/bdhirsh, https://github.com/voznesenskym
ghstack dependencies: #117409, #116667, #117591, #117500, #116910, #117553
2024-01-17 19:12:41 +00:00
ca0abf8606 Add inductor-specific testing strict mode denylist (#117553)
We have one for Dynamo that currently applies to all "compile"
configurations (PYTORCH_TEST_WITH_DYNAMO, PYTORCH_TEST_WITH_INDUCTOR). I
don't want to figure out the inductor situation right now, so we're
going to add another denylist for inductor and work through it later.

Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117553
Approved by: https://github.com/voznesenskym
ghstack dependencies: #117409, #116667, #117591, #117500, #116910
2024-01-17 19:12:41 +00:00
12561bb5fe Ordering placeholder and get_attr nodes in unflattened module (#116910)
Previous to this PR, the generated unflattened module could mix up the order of `placeholder` and newly created `get_attr`. As `placeholder` is the input of a function, it should be placed ahead of `get_attr` nodes.

Before:
```bash
test/export/test_unflatten.py::TestUnflatten::test_placeholder_and_get_attr_ordering_after_unflattened opcode         name         target                    args                                                            kwargs
-------------  -----------  ------------------------  --------------------------------------------------------------  --------
get_attr       bias         bias                      ()                                                              {}
get_attr       weight       weight                    ()                                                              {}
placeholder    l_x_         l_x_                      ()                                                              {}
call_function  convolution  aten.convolution.default  (l_x_, weight, bias, [2, 2], [0, 0], [1, 1], False, [0, 0], 1)  {}
output         output       output                    (convolution,)                                                  {}
```

After:
```bash
test/export/test_unflatten.py::TestUnflatten::test_placeholder_and_get_attr_ordering_after_unflattened opcode         name         target                    args                                                            kwargs
-------------  -----------  ------------------------  --------------------------------------------------------------  --------
placeholder    l_x_         l_x_                      ()                                                              {}
get_attr       weight       weight                    ()                                                              {}
get_attr       bias         bias                      ()                                                              {}
call_function  convolution  aten.convolution.default  (l_x_, weight, bias, [2, 2], [0, 0], [1, 1], False, [0, 0], 1)  {}
output         output       output                    (convolution,)                                                  {}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116910
Approved by: https://github.com/tugsbayasgalan
ghstack dependencies: #117409, #116667, #117591, #117500
2024-01-17 19:12:33 +00:00
bb0fd1bd3c [inductor] Faster C++ kernel python bindings (#117500)
Calling C++ from Python via ctypes is notoriously slow.  This switches to generating our own C++ bindings directly, which is a >5x speedup on this kernel-launch-bound microbenchmark:
```python
from ctypes import c_void_p
import torch
from torch import empty
from torch._inductor.codecache import AsyncCompile
from torch._dynamo.testing import rand_strided
from torch._inductor.utils import print_performance
from torch._inductor.wrapper_benchmark import compiled_module_main

async_compile = AsyncCompile()

src = '''
#include "/tmp/torchinductor_jansel/gb/cgbau5vlj6cetmcjbjbtw6x4rrivaln6f45s5d72gy2bfx5foz3k.h"
extern "C" void kernel(const float* in_ptr0,
                       float* out_ptr0)
{
    {
        auto tmp0 = in_ptr0[static_cast<long>(0L)];
        auto tmp1 = static_cast<float>(1.0);
        auto tmp2 = decltype(tmp0)(tmp0 + tmp1);
        out_ptr0[static_cast<long>(0L)] = tmp2;
    }
}
'''

cpp_fused_add_ctypes = async_compile.cpp(src)
cpp_fused_add_cpython = async_compile.cpp_pybinding(["const float*", "float*"], src)

async_compile.wait(globals())
del async_compile

def call(arg0_1):
    buf0 = empty((1,), device='cpu', dtype=torch.float32)
    if use_ctypes:
        for _ in range(100):
            cpp_fused_add_ctypes(c_void_p(arg0_1.data_ptr()), c_void_p(buf0.data_ptr()))
    else:
        for _ in range(100):
            cpp_fused_add_cpython(arg0_1, buf0)
    del arg0_1
    return (buf0,)

def benchmark_compiled_module(times=1000, repeat=100):
    arg0_1 = rand_strided((1,), (1,), device='cpu', dtype=torch.float32)
    return print_performance(lambda: call(arg0_1), times=times, repeat=repeat)

print("old ctypes bindings: ", end='')
use_ctypes = True
compiled_module_main('None', benchmark_compiled_module)
print("new bindings:        ", end='')
use_ctypes = False
compiled_module_main('None', benchmark_compiled_module)
```
Output:
```
old ctypes bindings: 0.000073
new bindings:        0.000013
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117500
Approved by: https://github.com/desertfire
ghstack dependencies: #117409, #116667, #117591
2024-01-17 19:12:24 +00:00
0c26565d5d Revert "Add pull request target to bc lint (#106065)"
This reverts commit d4136c90882337a0891f5216292e9e3d55c13262.

Reverted https://github.com/pytorch/pytorch/pull/106065 on behalf of https://github.com/izaitsevfb due to Tightening CI security ([comment](https://github.com/pytorch/pytorch/pull/106065#issuecomment-1896439167))
2024-01-17 18:51:46 +00:00
9da01affd3 Revert "[inductor] Faster C++ kernel python bindings (#117500)"
This reverts commit 3a52147cc59b240737602d3d046080bbf6f567f1.

Reverted https://github.com/pytorch/pytorch/pull/117500 on behalf of https://github.com/PaliC due to breaking internal discussed with author offline ([comment](https://github.com/pytorch/pytorch/pull/117500#issuecomment-1896426304))
2024-01-17 18:42:39 +00:00
8c7e3a18ff Revert "Ordering placeholder and get_attr nodes in unflattened module (#116910)"
This reverts commit 5e0e78585d9f662ecb957c327c8d3fa31bff4f9a.

Reverted https://github.com/pytorch/pytorch/pull/116910 on behalf of https://github.com/PaliC due to breaking internal discussed with author offline ([comment](https://github.com/pytorch/pytorch/pull/117500#issuecomment-1896426304))
2024-01-17 18:42:39 +00:00
e877c2e6ff Revert "Add inductor-specific testing strict mode denylist (#117553)"
This reverts commit ab6207a34248fdf2d2766d0062f358b63380e151.

Reverted https://github.com/pytorch/pytorch/pull/117553 on behalf of https://github.com/PaliC due to breaking internal discussed with author offline ([comment](https://github.com/pytorch/pytorch/pull/117500#issuecomment-1896426304))
2024-01-17 18:42:39 +00:00
7f3cac06b9 Revert "[codemod] markDynamoStrictTest batch 16 (#117218)"
This reverts commit 46a8408fa123da571dc1c13dba9479ba6d540249.

Reverted https://github.com/pytorch/pytorch/pull/117218 on behalf of https://github.com/PaliC due to breaking internal discussed with author offline ([comment](https://github.com/pytorch/pytorch/pull/117500#issuecomment-1896426304))
2024-01-17 18:42:39 +00:00
29fa6fbc4e [Dynamo] Fix a corner case of reinplace_inplaceable_ops pass for triton kernels (#117612)
Summary:
We saw the following failure when compiling custom triton kernels:
```
RuntimeError: Argument 'getitem_22' of Node 'triton_kernel_wrapper_functional_proxy_3' was used before it has been defined! Please check that Nodes in the graph are topologically ordered
```
The root-cause is when doing the replacement, the replacement is replaced by another replacement. The fix will keep finding the replacement until it is not replaced

Test Plan:

Added a test case

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117612
Approved by: https://github.com/aakhundov
2024-01-17 18:41:42 +00:00
e94b79f627 Revert "[codemod] markDynamoStrictTest batch 17 (#117219)"
This reverts commit 5bb2298da769121421711504da47955d3129b54f.

Reverted https://github.com/pytorch/pytorch/pull/117219 on behalf of https://github.com/PaliC due to sadly I need to revert these in order to revert https://github.com/pytorch/pytorch/pull/117500 ([comment](https://github.com/pytorch/pytorch/pull/117219#issuecomment-1896407436))
2024-01-17 18:35:56 +00:00
8483f493af Revert "[codemod] markDynamoStrictTest batch 18 (#117604)"
This reverts commit 70b22be32a2e6a1a51cb70a1418d73bfba533cc0.

Reverted https://github.com/pytorch/pytorch/pull/117604 on behalf of https://github.com/PaliC due to sadly I need to revert these in order to revert https://github.com/pytorch/pytorch/pull/117500 ([comment](https://github.com/pytorch/pytorch/pull/117219#issuecomment-1896407436))
2024-01-17 18:35:56 +00:00
0bfd9653ef Revert "[codemod] markDynamoStrictTest batch 19 (#117605)"
This reverts commit 45d7859e751dff2096df8b346226b71cf6031424.

Reverted https://github.com/pytorch/pytorch/pull/117605 on behalf of https://github.com/PaliC due to sadly I need to revert these in order to revert https://github.com/pytorch/pytorch/pull/117500 ([comment](https://github.com/pytorch/pytorch/pull/117219#issuecomment-1896407436))
2024-01-17 18:35:56 +00:00
d51583b214 Revert "[codemod] markDynamoStrictTest batch 20 (#117606)"
This reverts commit ab847a2f5c903c629f4e2ab9bfea11f7edc1cf0e.

Reverted https://github.com/pytorch/pytorch/pull/117606 on behalf of https://github.com/PaliC due to sadly I need to revert these in order to revert https://github.com/pytorch/pytorch/pull/117500 ([comment](https://github.com/pytorch/pytorch/pull/117219#issuecomment-1896407436))
2024-01-17 18:35:56 +00:00
06dab05405 Revert "[export] Error on not pytree-flattened nodes (#117598)"
This reverts commit 35e847830511b2c700586d312177794be094d67e.

Reverted https://github.com/pytorch/pytorch/pull/117598 on behalf of https://github.com/huydhn due to Sorry for reverting you change but it is failing ONNX test in trunk 35e8478305, probably a landrace as the PR signal looks fine ([comment](https://github.com/pytorch/pytorch/pull/117598#issuecomment-1896389009))
2024-01-17 18:29:04 +00:00
d0fc268918 Fixed issue in upsample_nearestnd lowering with scales (#117538)
Fixed #116848

Related to the bug introduced in my previous PR here: https://github.com/pytorch/pytorch/pull/113749/files#diff-a1b077971cddfabfa0071c5162265066e867bc07721816d95b9cbe58431c38e3R3264

Originally, the code was
```python
def upsample_nearestnd(
    x,
    output_size,
    scales_x: Tuple[Optional[float], ...],
    n: int = 2,
    exact: bool = False,
):
   # ...
    scales = [i / o for i, o in zip(i_sizes, o_sizes)]
    for i, scale in enumerate(scales):
        if scale:
            scales[i] = scale
```
which is wrong as `scales_x` is not used but can be provided by the user. The code was working for cases when user provided scale value can be recomputed using `input / output` sizes, e.g. scale=2.0. However, this would fail if input scale is a float value, e.g. 2.3, in this case recomputed scale is a bit different (e.g. 2.292682926829268, depending on input and output size) and can lead to an inconsistent output.
This problem was "fixed" to the following in my previous PR: https://github.com/pytorch/pytorch/pull/113749
```python
def upsample_nearestnd(
    x,
    output_size,
    scales_x: Tuple[Optional[float], ...],
    n: int = 2,
    exact: bool = False,
):
   # ...
    scales = [i / o for i, o in zip(i_sizes, o_sizes)]
    for i, scale in enumerate(scales_x):
        if scale:
            scales[i] = scale
```
however, this leads to a wrong scale value as it should be inverted as (1 / scale).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117538
Approved by: https://github.com/peterbell10
2024-01-17 18:14:35 +00:00
ab847a2f5c [codemod] markDynamoStrictTest batch 20 (#117606)
[codemod] markDynamoStrictTest test_tensorexpr_pybind
[codemod] markDynamoStrictTest test_tensorexpr
[codemod] markDynamoStrictTest test_jit_llga_fuser
[codemod] markDynamoStrictTest test_jit_fuser_te
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117606
Approved by: https://github.com/bdhirsh
ghstack dependencies: #117219, #117604, #117605
2024-01-17 17:43:27 +00:00
45d7859e75 [codemod] markDynamoStrictTest batch 19 (#117605)
[codemod] markDynamoStrictTest export/test_verifier
[codemod] markDynamoStrictTest export/test_upgrade
[codemod] markDynamoStrictTest export/test_unflatten
[codemod] markDynamoStrictTest export/test_serialize
[codemod] markDynamoStrictTest export/test_serdes
[codemod] markDynamoStrictTest export/test_retraceability
[codemod] markDynamoStrictTest export/test_passes
[codemod] markDynamoStrictTest export/test_pass_infra
[codemod] markDynamoStrictTest export/test_functionalized_assertions
[codemod] markDynamoStrictTest export/test_export_nonstrict
[codemod] markDynamoStrictTest export/test_export
[codemod] markDynamoStrictTest export/test_experimental
[codemod] markDynamoStrictTest export/test_db
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117605
Approved by: https://github.com/bdhirsh
ghstack dependencies: #117219, #117604
2024-01-17 17:43:27 +00:00
70b22be32a [codemod] markDynamoStrictTest batch 18 (#117604)
[codemod] markDynamoStrictTest functorch/test_vmap
[codemod] markDynamoStrictTest profiler/test_profiler_tree
[codemod] markDynamoStrictTest profiler/test_profiler
[codemod] markDynamoStrictTest profiler/test_memory_profiler
[codemod] markDynamoStrictTest functorch/test_ops
[codemod] markDynamoStrictTest functorch/test_aotdispatch
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117604
Approved by: https://github.com/bdhirsh
ghstack dependencies: #117219
2024-01-17 17:43:17 +00:00
6d1406d177 [oidc] Migrate Triton wheel upload to oidc (#117648)
Fix for triton upload job that is currently failing:
https://github.com/pytorch/pytorch/actions/runs/7555471235/job/20574022304

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117648
Approved by: https://github.com/DanilBaibak, https://github.com/jeanschmidt, https://github.com/malfet
2024-01-17 17:04:36 +00:00
35e8478305 [export] Error on not pytree-flattened nodes (#117598)
Attempts to make the input/output mismatch error better by first checking if the inputs/outputs are able to be pytree flattened into supporting types (tensors, symints, ...). So if user passes in some datastructure which does not have a pytree flatten registration, this will error with the message "It looks like one of the inputs is with type CustomType is not supported or pytree flatten-able.... please register a pytree flatten/unflatten function using the pytree.register_pytree_node API".

The check inside of produce_matching should now only error if something unexpected happens (dynamo accidentally adds an input or removes an output), and should be considered an internal error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117598
Approved by: https://github.com/avikchaudhuri
2024-01-17 16:33:57 +00:00
40a6710ad3 Mark set_ as an inplace view op (#115769)
Summary: To be used in https://github.com/pytorch/pytorch/pull/113873. Since set_ is effectively an inplace view op, we'll need to skip caching them.

Test Plan: Built pytorch; specifically this step: `/home/slarsen/local/miniconda3/envs/pytorch-3.10/bin/python -m torchgen.gen --source-path /home/slarsen/local/pytorch/cmake/../aten/src/ATen --install_dir /home/slarsen/local/pytorch/build/aten/src/ATen --per-operator-headers --generate sources --output-dependencies /home/slarsen/local/pytorch/build/aten/src/ATen/generated_sources.cmake`

Differential Revision: [D52814561](https://our.internmc.facebook.com/intern/diff/D52814561)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115769
Approved by: https://github.com/bdhirsh
2024-01-17 15:32:18 +00:00
5bb2298da7 [codemod] markDynamoStrictTest batch 17 (#117219)
[codemod] markDynamoStrictTest test_xnnpack_integration
[codemod] markDynamoStrictTest test_vulkan
[codemod] markDynamoStrictTest test_package
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117219
Approved by: https://github.com/bdhirsh
2024-01-17 14:41:07 +00:00
3bb8d2b905 Update triton ROCm version to 6.0 (#117433)
Related to PyTorch nightly wheels upgrade to ROCm6.0: https://github.com/pytorch/pytorch/pull/116983

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117433
Approved by: https://github.com/malfet, https://github.com/jeffdaily
2024-01-17 12:09:45 +00:00
e2830e6328 [PyTorch] SDPA decomp: actually use attn_mask (#117579)
Summary: Need to pass this along

Test Plan:
```
cd ~/fbsource/fbcode/executorch/backends/xnnpack/test
buck test fbcode//mode/dev-nosan :test_xnnpack_ops -- test_fp32_sdpa
buck run fbcode//mode/dev-nosan :test_xnnpack_models -- executorch.backends.xnnpack.test.models.llama2_et_example.TestLlama2ETExample.test_fp32
```

Reviewed By: larryliu0820

Differential Revision: D52812369

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117579
Approved by: https://github.com/larryliu0820
2024-01-17 10:26:43 +00:00
1deb75b584 [c10d] Move the timeout dump check from watchdog to monitoring thread (#117168)
To avoid potential hang in watchdog thread which will prevent us from dumping timeout debugging info, we move the check of global collective timeout signals and dumping debugging info to monitoring thread. We also need to ensure that we don't wait very long to check out the timeout signal from store; otherwise, we will miss the signal and don't get debugging info dumped.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117168
Approved by: https://github.com/wconstab
2024-01-17 08:05:40 +00:00
ed6006ee5d [Reland][ONNX] Guard xfail tests with error messages (#117592)
Reland #117425

Previous to this PR, xfail tests didn't provide (1) guarantee of error message/reason (could be outdated), and (2) execution of the test (xfail_if_model_type_is_not_exportedprogram). Therefore, the tests are less robust with xfail labeled, as we can't be sure if it's still failing with the same reason, and if it's even still failing. This PR fixes the issue with try/except with error message matching to consolidate the xfail truth and reason.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117592
Approved by: https://github.com/BowenBao
2024-01-17 08:05:35 +00:00
suo
9448065061 [pytree] add key path api (#116786)
This PR introduces a key path API to pytrees, drawing direct inspiration from JAX's [key path API](https://jax.readthedocs.io/en/latest/jax-101/05.1-pytrees.html#key-paths).

I added the 3 APIs described there, and a registry of `flatten_with_keys` fns for each node type, which is a version of `flatten` that also returns `KeyEntry`s describing how to access values from the original pytree.

Current use cases for this API:
- Folks would like to do argument traversal over input pytrees to do verification and compatibility enforcement. Keypaths are useful for this—https://fburl.com/code/06p7zrvr is a handrolled pass doing basically the same thing but probably more fragilely.
- In export non-strict mode, we need to figure out a way to track sources for pytree inputs. In strict mode, dynamo handles this for us, but we'd like a decoupled component to handle this when we're not using dynamo.

I'm sure there are places it would be useful.

Some design notes:
- I only implemented the API for  the Python pytree impl. optree has some differences in how their keypath APIs are designed (see https://github.com/pytorch/pytorch/issues/113378 for discussion). I have some issues with the proposed typed_path solution in that discussion and prefer JAX's API, but we can hash that out separately.
- The way folks register a `flatten_with_keys` fn is through a new kwarg to `register_pytree_node`. This follows how we do serialization fns, although the list of additional arguments is getting unwieldy.
- My impl handles pytrees with an undefined `flatten_with_keys` fn is different from JAX. I will raise an error, JAX creates a fallback keyentry.

Differential Revision: [D52547850](https://our.internmc.facebook.com/intern/diff/D52547850/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116786
Approved by: https://github.com/voznesenskym
2024-01-17 07:24:35 +00:00
5667a990fd Chore: improve log message about cache size limit exceeded (#116557)
Fixes #114527

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116557
Approved by: https://github.com/ezyang
2024-01-17 06:07:18 +00:00
3cd2c68fbe Fix syntax highlighting in android (#117439)
Hi i have found code blocks are not highlighted properly.

This PR aims to fix that
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117439
Approved by: https://github.com/ezyang
2024-01-17 05:17:13 +00:00
735715e6d3 [Dynamo] Make profiler function will be ignored warn only once (#117585)
Fix #111632

#111622 accidentally reverted #111921, we should bring it back.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117585
Approved by: https://github.com/williamwen42, https://github.com/mlazos, https://github.com/msaroufim
2024-01-17 04:05:45 +00:00
2c5488d719 Match all_gather_into_tensor args names in remapping (#117224)
Fixes #114179

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117224
Approved by: https://github.com/wanchaol, https://github.com/wconstab
2024-01-17 03:50:29 +00:00
8f1bc876b2 [quant] Support custom qmin/qmax for activation and weight for xnnpack quantizer (#117305)
Summary:
att, this allows us to experiment with 4 bit quant in xnnpack

Test Plan:
python test/test_quantization.py -k test_dynamic_linear_int4_weight

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117305
Approved by: https://github.com/digantdesai
2024-01-17 03:22:49 +00:00
e4c2dfb35b [Dynamo, ONNX] Run llama attention with onnxrt and dynamic shapes (#117009)
As title. This PR enables dynamic shapes for running llama with ORT. Both forward and backward are captured as a single graph with this PR.

Summary of changes:
- Test llama attention, llama decoder, llama model to ensure (1) no graph breaks (2) models exported with dynamic shapes with onnxrt dynamo backend
- Reshape SymInt to tensor with shape (1,) to align with the cast done for int in fx_onnx_interpreter.py
- Create an util function to map Python types (e.g., float) to ONNX tensor element type (e.g., onnx.TensorProto.FLOAT).
- Return `hint` for torch.Sym* in type promotion pass.
- Remove _replace_to_copy_with_to since exporter supports aten::_to_copy it now.
- Modify _get_onnx_devices to return CPU device for torch.Sym*.
- Introduce _adjust_scalar_from_fx_to_onnx (e.g., change 0 to tensor(0)) and _adjust_scalar_from_onnx_to_fx (e.g., change tensor(0) to 0) for adjusting scalars when passing values to and receive values from ORT.
- Now, ValueInfoProto of graph inputs (i.e., input_value_infos) are stored and used as `ORT-expected type` when calling `_adjust_scalar_from_fx_to_onnx`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117009
Approved by: https://github.com/titaiwangms
2024-01-17 03:02:41 +00:00
fb06ed36d1 Change dynamo_test_failures.py to silently run skipped tests (#117401)
- We silently run skipped tests and then raise a skip message with the
  error message (if any)
- Instead of raising expectedFailure, we raise a skip message with the
  error message (if any)

We log the skip messages in CI, so this will let us read the logs and do
some basic triaging of the failure messages.

Test Plan:
- existing tests. I hope that there are no tests that cause each other
  to fail.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117401
Approved by: https://github.com/voznesenskym
ghstack dependencies: #117391, #117400
2024-01-17 02:48:19 +00:00
9056c7d941 use getPinnedMemoryAllocator for privateuseone (#117530)
Fixes #117482

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117530
Approved by: https://github.com/ezyang
2024-01-17 02:33:02 +00:00
8852bb561c More efficient multi-threading in Softmax & LogSoftmax CPU kernels (#116367)
### Summary
In #85398, while fixing a bug (which was _not caused by, but was exposed by_ AVX512 implementation) in `_vec_logsoftmax_lastdim`, I had made some revisions to use more threads in some cases, but was asked to roll back [those changes](https://github.com/pytorch/pytorch/pull/85398#discussion_r1087680237) during the PR's review.
At the time, landing that PR asap seemed essential, so I agreed to roll-back that change,

In some cases, more threads can be used than are being used with the current approach.
<strike>In this PR, I'm reintroducing those changes, which are geared towards more efficient multi-threading.</strike>.
On second thought, even for other softmax kernels besides `_vec_log_softmax_lastdim` and `_vec_softmax_lastdim`, we could simply use `grain_size` of 0 or 1, instead of complicating code because `CHUNK_SIZE` for each thread is already being computed as per some heuristic, and if `grain_size` would be `0`, then work among the OpenMP threads (which, BTW, stay constant in number, unless explicitly changed, since we don't use the OpenMP `num_threads` clause in PyTorch) would be distributed equitably, thus yielding the similar speedup as the approach in the first commit of this PR.
I've also added op-level benchmarks pertaining to example input shapes in this PR.

### Benchmarks

Machine - Intel(R) Xeon(R) Platinum 8468H (Xeon 4th gen, formerly codenamed Sapphire Rapids)
One socket of 48 physical cores was used, with & without HyperThreading.
Intel OpenMP & tcmalloc were preloaded.

Softmax benchmarks can be run with the following command, but the relevant benchmarks are the last dim ones -
`KMP_AFFINITY=granularity=fine,compact,1,0 KMP_BLOCKTIME=1 KMP_SETTINGS=1 OMP_NUM_THREADS=48 MKL_NUM_THREADS=48 numactl --membind=0 --cpunodebind=0 python -m pt.softmax_test --tag-filter all`

#### Already existing benchmarks
|Benchmark name (dim is 1, by default) | Previous implementation's latency (in ms) | This implementation's latency (in ms)|Speedup Percentage = (old-new)*100/old | Speedup ratio (old/new)|
|-------------|--------|-------|----------------------------|----------|
|Softmax_N1_C3_H256_W256_cpu|31.364|11.594|63.03%  |2.705|
|Softmax_N4_C3_H256_W256_cpu|34.475|24.966| 27.58%|1.380|
|Softmax_N8_C3_H512_W256_cpu|94.044|78.372|16.66%|1.199|
|Softmax2d_N8_C3_H512_W256_cpu|100.195|79.529|20.62%|1.259|

#### Some of the following benchmarks are being added in this PR
|Benchmark name| Previous implementation's latency (in ms) | This implementation's latency (in ms)|Speedup percentage = (old-new)*100/old| Speedup ratio  (old/new) |
|-------------|--------|-------|----------------------------|--------------------|
|LogSoftmax_M128_N128_dim1_cpu|7.629|6.475|15.12%| 1.178|
|LogSoftmax_M48_N128_dim1_cpu|6.848|5.969|12.83%| 1.147|
|LogSoftmax_M16_N1024_dim1_cpu|7.004|6.322|9.73%| 1.107|
|LogSoftmax_M32_N1024_dim1_cpu|7.037|6.558|6.80%| 1.073|
|LogSoftmax_M48_N1024_dim1_cpu|7.155|6.773|5.33%|1.056|
|LogSoftmax_M16_N512_dim1_cpu|6.797|5.862|13.75%|1.159|
|LogSoftmax_M32_N512_dim1_cpu|7.223|6.202|14.13%|1.164|
|LogSoftmax_M48_N512_dim1_cpu|7.159|6.301|11.98%|1.136|
|LogSoftmax_M16_N256_dim1_cpu|6.842|5.682|16.95%|1.204|
|LogSoftmax_M32_N256_dim1_cpu|6.840|6.086|11.02%|1.123|
|LogSoftmax_M48_N256_dim1_cpu|7.005|6.031|13.94%|1.161|

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116367
Approved by: https://github.com/jgong5, https://github.com/ezyang
2024-01-17 02:26:29 +00:00
4a54ab328c Removed an internal assertion for the optional stable value and inste… (#117414)
…ad defaulted to the standard (=false).

Fixes #117255.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117414
Approved by: https://github.com/ezyang
2024-01-17 02:25:21 +00:00
1872834247 [MPS] Fix torch.mm correctness for large matrices (#117549)
Currently `matrixMultiplicationWithPrimaryTensor:secondaryTensor:` returns incorrect results if one of the matrix dimensions is greater than 32K
Solve it by providing a very naive matrix multiplication metal shader and call it if stride size is greater than 32768 elements, as slicing inside the MPSGraph doesn't work either, since `-sliceTensor:starts:ends:strides:` somehow affects matmul as well, if tiling is done as follows:
```objc
  NSMutableArray<MPSGraphTensor*>* rows = [NSMutableArray new];
  for (int64_t i = 0; i < M; i += tile_size) {
    const auto i_end = std::min(i + tile_size, M);
    NSMutableArray<MPSGraphTensor*>* row_chunks = [NSMutableArray new];
    for (int64_t j = 0; j < K; j += tile_size) {
      const auto j_end = std::min(j + tile_size, K);
      MPSGraphTensor* tile = nil;
      for (int64_t k = 0; k < N; k += tile_size) {
        const auto k_end = std::min(k + tile_size, N);
        auto selfChunk = [graph sliceTensor:selfTensor
                                     starts:@[ @(i), @(k) ]
                                       ends:@[ @(i_end), @(k_end) ]
                                    strides:@[ @(1), @(1) ]
                                       name:nil];
        auto otherChunk = [graph sliceTensor:otherTensor
                                      starts:@[ @(k), @(j) ]
                                        ends:@[ @(k_end), @(j_end) ]
                                     strides:@[ @(1), @(1) ]
                                        name:nil];
        auto chunkMM = [graph matrixMultiplicationWithPrimaryTensor:selfChunk secondaryTensor:otherChunk name:nil];

        tile = tile ? [graph additionWithPrimaryTensor:tile secondaryTensor:chunkMM name:nil] : chunkMM;
      }
      [row_chunks addObject:tile];
    }
    auto row = row_chunks.count > 1 ? [graph concatTensors:row_chunks dimension:1 name:nil] : row_chunks.firstObject;
    [rows addObject:row];
  }
  return rows.count > 1 ? [graph concatTensors:rows dimension:0 name:nil] : rows.firstObject;
```

One can always use metal MM by defining `PYTORCH_MPS_PREFER_METAL` environment variable
Fixes https://github.com/pytorch/pytorch/issues/116769
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117549
Approved by: https://github.com/kulinseth
2024-01-17 01:33:08 +00:00
f518cf811d [DCP] Adds support for meta tensor loading for DCP.load_state_dict() (#113319)
Currently, DCP requires the `model.state_dict()` to be materialized before passing it to DCP to load, since DCP uses the pre-allocated storage from the initialized model state_dict. Therefore, even for fine-tuning and distributed inference, users would need to explicitly materialize the model on GPU before `DCP.load_state_dict()`.

Today's flow:
```
with torch.device("meta"):
    model2 = parallelize_module(
        MLPModule("meta"), tp_mesh, parallelize_plan=parallelize_plan
    )

model.to_empty(device='cuda')
state_dict_to_load = model2.state_dict()
DCP.load_state_dict(
    state_dict=state_dict_to_load,
    storage_reader=DCP.FileSystemReader(CHECKPOINT_DIR),
)
model2.load_state_dict(state_dict_to_load)
```

This PR adds support for meta tensor loading. In DCP's planner, when encountering tensors/DTensor on meta device, we initialize tensor/DTensor on the current device on the fly and replace the tensor/DTensor on meta device in the state_dict.  After the change, users no longer needs to manually call `model.to_empty()` when loading existing checkpoints for fine-tuning and distributed inference.

Updated user flow:
```
with torch.device("meta"):
    model2 = parallelize_module(
        MLPModule("meta"), tp_mesh, parallelize_plan=parallelize_plan
    )
# no longer need to call model.to_empty(device='cuda')
state_dict_to_load = model2.state_dict()
DCP.load_state_dict(
    state_dict=state_dict_to_load,
    storage_reader=DCP.FileSystemReader(CHECKPOINT_DIR),
)
model2.load_state_dict(state_dict_to_load, assign=True)
```

Note that for distributed training, it's still the users' responsibility to reset the parameters (`model.reset_parameters()`) as checkpoint might not exist.

Note that we need to loop thru the state_dict to replace meta tensor/DTensor instead of calling `model.to_empty()` since `DCP.load()` only takes in state_dict but not model.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113319
Approved by: https://github.com/fegin, https://github.com/LucasLLC
2024-01-17 00:23:29 +00:00
4a44a3c76d update kineto submodule (#114297)
Rework roctracer shutdown flushing

9365c1aa09

This fixes flaky unit tests that use kineto to verify certain kernels have executed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114297
Approved by: https://github.com/malfet, https://github.com/atalman
2024-01-17 00:17:03 +00:00
cf470e7b59 Migrate update-commit-hash to test-infra (#117506)
After https://github.com/pytorch/test-infra/pull/4885, the GHA is now reusable on `test-infra`.  This tests the change and we can also land it after https://github.com/pytorch/test-infra/pull/4885 lands.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117506
Approved by: https://github.com/malfet, https://github.com/atalman
2024-01-17 00:15:04 +00:00
1d14adfa66 [mta] Fused SGD (#116585)
depends on #116583

rel:
- #94791

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116585
Approved by: https://github.com/janeyx99
2024-01-16 23:54:38 +00:00
5aac95c713 Introduce slice_inverse() op (#117041)
Introduces a new op `slice_inverse()`. This is used in the reverse view_func for slice and several other ops (e.g. `split_with_sizes`, `chunk`). It's implemented behind the scenes by a call to `as_strided()`, but it's easier for subclasses to implement the more limited `slice_inverse()` than the full `as_strided()`. This PR:
* Introduces the op itself
* Updates all relevant functional inverses to call `slice_inverse()` instead of `as_strided()` directly
* Makes codegen changes to allow `slice_scatter()` to be the copy variant for `slice_inverse()`
    * Need to avoid view_copy codegen (assumes if view name ends in inverse, we don't need to gen one, which is possibly a bad assumption)

@albanD / @soulitzer / @bdhirsh: I'm most interested in your thoughts on the codegen changes and whether this is the right way to go.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117041
Approved by: https://github.com/bdhirsh
2024-01-16 23:44:54 +00:00
f6767244cf Added meta function for _upsample_bicubic2d_aa (#117347)
This should fix remaining errors with Resize op in torchvision: https://github.com/pytorch/vision/actions/runs/7298953575?pr=8127
```
/opt/conda/envs/ci/lib/python3.8/site-packages/torch/nn/functional.py:4072: in interpolate
    return torch._C._nn._upsample_bicubic2d_aa(input, output_size, align_corners, scale_factors)
E   torch._dynamo.exc.TorchRuntimeError: Failed running call_function <function interpolate at 0x7f4443fe00d0>(*(FakeTensor(..., size=(1, s0, s1, s2)),), **{'size': [s4, floor(s3*s4/floor(s1*s3/s2))], 'mode': 'bicubic', 'align_corners': False, 'antialias': True}):
E   aten/src/ATen/RegisterCompositeImplicitAutograd.cpp:5567: SymIntArrayRef expected to contain only concrete integers
E
E   from user code:
E      File "/pytorch/vision/torchvision/transforms/v2/functional/_geometry.py", line 260, in resize_image
E       image = interpolate(
E
E   Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information
E
E
E   You can suppress this exception and fall back to eager by setting:
E       import torch._dynamo
E       torch._dynamo.config.suppress_errors = True
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117347
Approved by: https://github.com/peterbell10
2024-01-16 23:33:55 +00:00
b1c3f9f1b9 Fix missing mkl-dnn include paths (#117492)
Fixes #91968 and #100960
This commit fixes missing  include paths by linking `caffe2_pybind11_state_gpu` against `caffe2::mkldnn`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117492
Approved by: https://github.com/ezyang
2024-01-16 23:28:17 +00:00
46a8408fa1 [codemod] markDynamoStrictTest batch 16 (#117218)
[codemod] markDynamoStrictTest test_public_bindings
[codemod] markDynamoStrictTest test_package
[codemod] markDynamoStrictTest test_legacy_vmap
[codemod] markDynamoStrictTest test_namedtensor
[codemod] markDynamoStrictTest test_fx
[codemod] markDynamoStrictTest test_dataloader
[codemod] markDynamoStrictTest test_content_store
[codemod] markDynamoStrictTest test_schema_check
[codemod] markDynamoStrictTest lazy/test_ts_opinfo
[codemod] markDynamoStrictTest functorch/test_vmap_registrations
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117218
Approved by: https://github.com/bdhirsh, https://github.com/voznesenskym
ghstack dependencies: #117553
2024-01-16 23:04:31 +00:00
ab6207a342 Add inductor-specific testing strict mode denylist (#117553)
We have one for Dynamo that currently applies to all "compile"
configurations (PYTORCH_TEST_WITH_DYNAMO, PYTORCH_TEST_WITH_INDUCTOR). I
don't want to figure out the inductor situation right now, so we're
going to add another denylist for inductor and work through it later.

Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117553
Approved by: https://github.com/voznesenskym
2024-01-16 23:04:31 +00:00
5e0e78585d Ordering placeholder and get_attr nodes in unflattened module (#116910)
Previous to this PR, the generated unflattened module could mix up the order of `placeholder` and newly created `get_attr`. As `placeholder` is the input of a function, it should be placed ahead of `get_attr` nodes.

Before:
```bash
test/export/test_unflatten.py::TestUnflatten::test_placeholder_and_get_attr_ordering_after_unflattened opcode         name         target                    args                                                            kwargs
-------------  -----------  ------------------------  --------------------------------------------------------------  --------
get_attr       bias         bias                      ()                                                              {}
get_attr       weight       weight                    ()                                                              {}
placeholder    l_x_         l_x_                      ()                                                              {}
call_function  convolution  aten.convolution.default  (l_x_, weight, bias, [2, 2], [0, 0], [1, 1], False, [0, 0], 1)  {}
output         output       output                    (convolution,)                                                  {}
```

After:
```bash
test/export/test_unflatten.py::TestUnflatten::test_placeholder_and_get_attr_ordering_after_unflattened opcode         name         target                    args                                                            kwargs
-------------  -----------  ------------------------  --------------------------------------------------------------  --------
placeholder    l_x_         l_x_                      ()                                                              {}
get_attr       weight       weight                    ()                                                              {}
get_attr       bias         bias                      ()                                                              {}
call_function  convolution  aten.convolution.default  (l_x_, weight, bias, [2, 2], [0, 0], [1, 1], False, [0, 0], 1)  {}
output         output       output                    (convolution,)                                                  {}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116910
Approved by: https://github.com/tugsbayasgalan
2024-01-16 22:58:37 +00:00
4ec667cc64 Revert "[ONNX] Guard xfail tests with error messages (#117425)"
This reverts commit 1993956da33376f34125306209930ed00c486abd.

Reverted https://github.com/pytorch/pytorch/pull/117425 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing in trunk 1993956da3 ([comment](https://github.com/pytorch/pytorch/pull/117425#issuecomment-1894650769))
2024-01-16 22:56:35 +00:00
3a52147cc5 [inductor] Faster C++ kernel python bindings (#117500)
Calling C++ from Python via ctypes is notoriously slow.  This switches to generating our own C++ bindings directly, which is a >5x speedup on this kernel-launch-bound microbenchmark:
```python
from ctypes import c_void_p
import torch
from torch import empty
from torch._inductor.codecache import AsyncCompile
from torch._dynamo.testing import rand_strided
from torch._inductor.utils import print_performance
from torch._inductor.wrapper_benchmark import compiled_module_main

async_compile = AsyncCompile()

src = '''
#include "/tmp/torchinductor_jansel/gb/cgbau5vlj6cetmcjbjbtw6x4rrivaln6f45s5d72gy2bfx5foz3k.h"
extern "C" void kernel(const float* in_ptr0,
                       float* out_ptr0)
{
    {
        auto tmp0 = in_ptr0[static_cast<long>(0L)];
        auto tmp1 = static_cast<float>(1.0);
        auto tmp2 = decltype(tmp0)(tmp0 + tmp1);
        out_ptr0[static_cast<long>(0L)] = tmp2;
    }
}
'''

cpp_fused_add_ctypes = async_compile.cpp(src)
cpp_fused_add_cpython = async_compile.cpp_pybinding(["const float*", "float*"], src)

async_compile.wait(globals())
del async_compile

def call(arg0_1):
    buf0 = empty((1,), device='cpu', dtype=torch.float32)
    if use_ctypes:
        for _ in range(100):
            cpp_fused_add_ctypes(c_void_p(arg0_1.data_ptr()), c_void_p(buf0.data_ptr()))
    else:
        for _ in range(100):
            cpp_fused_add_cpython(arg0_1, buf0)
    del arg0_1
    return (buf0,)

def benchmark_compiled_module(times=1000, repeat=100):
    arg0_1 = rand_strided((1,), (1,), device='cpu', dtype=torch.float32)
    return print_performance(lambda: call(arg0_1), times=times, repeat=repeat)

print("old ctypes bindings: ", end='')
use_ctypes = True
compiled_module_main('None', benchmark_compiled_module)
print("new bindings:        ", end='')
use_ctypes = False
compiled_module_main('None', benchmark_compiled_module)
```
Output:
```
old ctypes bindings: 0.000073
new bindings:        0.000013
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117500
Approved by: https://github.com/desertfire
2024-01-16 22:30:04 +00:00
2a3fb7dbb6 [ROCm] Fix NHWC related tests in test_inductor_freezing (#117158)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117158
Approved by: https://github.com/eellison, https://github.com/pruthvistony
2024-01-16 20:48:49 +00:00
4712c7dac8 [inductor] add C-shim for index_put (#116667)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116667
Approved by: https://github.com/desertfire, https://github.com/chenyang78
2024-01-16 20:29:14 +00:00
3e8c8ce37b Update Reviewers for PT-D team (#117409)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117409
Approved by: https://github.com/fegin, https://github.com/awgu, https://github.com/fduwjj
2024-01-16 19:40:41 +00:00
1993956da3 [ONNX] Guard xfail tests with error messages (#117425)
Previous to this PR, xfail tests didn't provide (1) guarantee of error message/reason (**could be outdated**), and (2) execution of the test (`xfail_if_model_type_is_not_exportedprogram`). Therefore, the tests are less robust with xfail labeled, as we can't be sure if it's still failing with the same reason, and if it's even still failing. This PR fixes the issue with try/except with error message matching to consolidate the xfail truth and reason.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117425
Approved by: https://github.com/thiagocrepaldi
2024-01-16 19:33:51 +00:00
28be47c267 [RELAND][export] Exempt autograd ops for predispatch export (#117448)
Summary: Reland of https://github.com/pytorch/pytorch/pull/116527/files

Test Plan: CI

Differential Revision: D52675324

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117448
Approved by: https://github.com/ydwu4
2024-01-16 19:32:15 +00:00
99e54744f7 Fix ExecuTorch pinned commit update failure (#117518)
https://github.com/pytorch/pytorch/pull/117003 shows in interesting failure in which building ExecuTorch runner fails because it needs the change from https://github.com/pytorch/pytorch/pull/117378.  This reveals a chicken-and-egg bug in the job setup where building ExecuTorch runner depends on PyTorch and thus couldn't be part of the Docker image build where PyTorch is not yet available.  The failure happens because an outdated version of PyTorch is there on the Docker image.

So, like vision and audio, the step to build ExecuTorch runner needs to be done during test time.

I also fix the installation of vision and audio in ET job because they are now installed using PyTorch pinned commits as usual after https://github.com/pytorch/executorch/pull/1247
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117518
Approved by: https://github.com/larryliu0820, https://github.com/malfet
2024-01-16 18:25:15 +00:00
c30346db0e Check in some torch.compile helper scripts (#117400)
- passrate.py: compute the pass rate
- update_failures.py: update `dynamo_test_failures.py`

Both of these scripts require you to download the test results from CI
locally. Maybe we can automate this more in the future. Checking these
in for now, with no tests :P.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117400
Approved by: https://github.com/voznesenskym
ghstack dependencies: #117391
2024-01-16 17:14:43 +00:00
a7a2773567 Check invariants for dynamo_test_failures.py (#117391)
Test that:
- the xfail list and the skip list don't intersect
- the test names look sane
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117391
Approved by: https://github.com/voznesenskym
2024-01-16 17:14:43 +00:00
29516bd2a0 add _amp_foreach_non_finite_check_and_unscale_cpu_ and _amp_update_scale_cpu_ kernels on CPU (#109281)
Step1 of https://github.com/pytorch/pytorch/issues/111559.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109281
Approved by: https://github.com/jgong5, https://github.com/ezyang
2024-01-16 15:25:08 +00:00
0fa6ee44d9 [CI] Skip lib for xpu binary unit test (#117514)
Skip .so and .a libraries under build/bin/ for test_xpu_bin in CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117514
Approved by: https://github.com/malfet
2024-01-16 12:07:15 +00:00
13473df0d7 [MPS] Make addmm support empty matmul (#117223)
Refactor common part between `mm_out_mps` and `addmm_out_mps` into `do_mm` static function.
Change input placeholder initialization logic in a way that `addmm` can handle matrix multiplication with empty dimension.
Add tests for `mm`+`addmm` with empty tensors to OpInfo but skip addmm with empty matrices from onnx tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117223
Approved by: https://github.com/albanD
2024-01-16 06:46:20 +00:00
28bb31e4a5 [Dynamo] Trace autograd.function in dynamo when inputs require grad (#116358) (#116897)
For training graphs (when inputs require grad), previously, we would speculate the forward and backward graph to determine if there are any graph breaks, side effect and etc but would not actually use these speculated graphs. We would just insert a call function node on the graph and later rely on autograd's tracing.

This approach does not work for more generalized graphs like graphs that include user defined triton kernels because autograd is not able to do the higher order function conversation.

This PR speculates the forward and backward functions and emits them in a HOF that later gets used via templating mechanism.

While working on this PR, I have exposed some bugs in the current tracing due to trampoline functions losing the source information resulting in incorrect graphs being produced. I have fixed these source information bugs and killed the trampolines.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116897
Approved by: https://github.com/Skylion007, https://github.com/jansel, https://github.com/voznesenskym
2024-01-16 03:57:13 +00:00
f20eaadfef [vision hash update] update the pinned vision hash (#117509)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117509
Approved by: https://github.com/pytorchbot
2024-01-16 03:17:24 +00:00
ae3d7091cb [BE] Replace deprecated set_default_tensor_type (#117505)
Not sure what it was doing there, but replaced it with `set_default_dtype`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117505
Approved by: https://github.com/Skylion007
2024-01-16 02:32:49 +00:00
dd2cff1591 [Dynamo] Use isinstance rather than istype when check if python module type (#117022)
This is to fix a issue from Meta internal use case, where third-party ```DictConfig``` has bug on [```__eq__```](fd730509ef/omegaconf/dictconfig.py (L596)) and it triggers Dynamo error because we are using ```obj in [x, y]``` check. Then I found we can use ```isinstance``` to cover all and removing these special cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117022
Approved by: https://github.com/ckluk2, https://github.com/jansel
2024-01-15 23:25:30 +00:00
bac0878780 Error if compiled nondeterministic backward called in deterministic mode (#114780)
Part of #113707

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114780
Approved by: https://github.com/ezyang, https://github.com/albanD
2024-01-15 22:45:40 +00:00
c1ab2777c0 Update state_dict.py to propagate cpu offload (#117453)
Update state_dict.py to propagate cpu offload. It looks like this flag is accidentally ignored?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117453
Approved by: https://github.com/Skylion007
2024-01-15 22:13:37 +00:00
1a57c18760 Fixed cuda grads for interpolate::trilinear on non-contig grad output (#117373)
Fixes #113642

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117373
Approved by: https://github.com/lezcano
2024-01-15 18:05:47 +00:00
001585f446 [fx][inductor] Add statically_known_true utility for SymBool (#117359)
This adds a function `statically_known_true` for `SymBool` that works
like inductor's `is_expr_static_and_true`. That is, it tries to simplify the
expression to a constant or returns `False` if it cannot be simplified.

This is useful in cases that can be optimized if the condition is met,
otherwise it doesn't effect correctness so we can avoid adding guards.

I also use this new function in inductor for `FakeTensorUpdater` and
`remove_noop_pass` which both generated unexpected guards previously.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117359
Approved by: https://github.com/lezcano
2024-01-15 18:01:10 +00:00
661747c727 XPU, move oidc to top level workflow and use gha_workflow_s3_and_ecr_read_only policy (#117498)
1. oidc permissions need to be set on top level workflow
2. rename gha_workflow_s3_and_ecr_read_only to gha_workflow_s3_and_ecr_read_only policy which better reflects the policy usage

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117498
Approved by: https://github.com/chuanqi129, https://github.com/huydhn
2024-01-15 17:46:20 +00:00
7a8013fbfa [inductor] Handle more edge cases in slice and slice_scatter (#117377)
Fixes #117110

When slicing we can end up with start and end which are out of bounds, which is
handled in python slicing by clamping to the correct bounds. There is also the
case where end < start which should result in an empty slice.

In the isoneutral_mixing failure we have the second case, with `start=2, end=0`
which in `slice_scatter` became `src_size[dim] = -2`.

This PR improves slice's edge case handling and factors the start and end
normalization code out so it can be shared with slice_scatter.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117377
Approved by: https://github.com/lezcano
2024-01-15 17:05:48 +00:00
5c700f60a5 Properly preserve SymInt input invariant when splitting graphs (#117406)
Fixes https://github.com/pytorch/pytorch/issues/111636
Fixes https://github.com/pytorch/pytorch/issues/108877
Fixes https://github.com/pytorch/pytorch/issues/116956

Inductor has an invariant that every dynamic shape symbol s0, s1, etc. which is referenced by an input tensor must also be passed in explicitly as an argument. It has some capability of reverse engineering symbols if it's obvious how to get them (e.g., if you pass in `arg: f32[s0, 4]` it will know that it can retrieve `s0 = arg.size(0)`) but in full generality it is not always possible to derive this (e.g., if the only mention of s0 is in `arg2: f32[s0 + s1, 4]`).  However, the graph splitter used by optimize_ddp did not respect this invariant. This PR makes it respect it.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117406
Approved by: https://github.com/wconstab
2024-01-15 15:04:57 +00:00
75818adcf7 Pyi doc inclusion + fix (#117267)
Reland of https://github.com/pytorch/pytorch/pull/114705 with extra fix to smoothly handle when the modules we're trying to load are not available (and thus the pyi won't contain the docs in this case).

Tested locally that it works properly in fbcode.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117267
Approved by: https://github.com/ezyang
2024-01-15 13:06:53 +00:00
7a851fedc8 support torch.mm with conjugate transposed inputs (#117238)
Fix https://github.com/pytorch/pytorch/issues/116855.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117238
Approved by: https://github.com/lezcano
2024-01-15 12:36:01 +00:00
41ffea2f99 Properly unwrap_storage tensors sent to DynamicScalar (#117444)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117444
Approved by: https://github.com/Skylion007
2024-01-15 12:15:04 +00:00
d9b265adaf modify the conditions as PythonModuleVariable (#116856)
## Motivation
The current code of `value in [torch.backends.cudnn, torch.ops]` requires `value` to have the implementation of `__eq__`. If the value is a custom object and does not implement `__eq__`, dynamo will throw error. For example, ConvolutionOpContext, the custom 'torch._C.ScriptClass' object registered in IPEX, dynamo will throw the following error:

**torch._dynamo.exc.InternalTorchDynamoError: '__eq__' is not implemented for __torch__.torch.classes.ipex_prepack.ConvolutionOpContext**

I think this is a common issue, To avoid this issue, the PR replaces the current code `value in [torch.backends.cudnn, torch.ops]`with `isinstance(value, (torch.backends.cudnn.CudnnModule, torch._ops._Ops)`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116856
Approved by: https://github.com/jansel
2024-01-15 11:10:57 +00:00
d089bb1b72 [xla hash update] update the pinned xla hash (#117485)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117485
Approved by: https://github.com/pytorchbot
2024-01-15 10:33:18 +00:00
2b56d80460 [inductor][cpp] apply simplify_index_in_vec_range to vector store and vector transpose (#117263)
As the title, this PR extends the `simplify_index_in_vec_range` to store and transpose.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117263
Approved by: https://github.com/jansel
ghstack dependencies: #117221, #117260
2024-01-15 08:41:28 +00:00
3b00dd5843 [inductor][cpp] apply simplify_index_in_vec_range in select_tiling_indices to enable more contiguous vec load (#117260)
For the one of the kernels in the UT `test_vec_contiguous_ModularIndexing`:
Before:
```c++
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(28L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(16L); x1+=static_cast<long>(16L))
                {
                    {
                        #pragma omp declare reduction(welford:Welford<float>:omp_out = welford_combine(omp_out, omp_in)) initializer(omp_priv={Welford<float>()})
                        #pragma omp declare reduction(welford:Welford<at::vec::Vectorized<float>>:omp_out = welford_combine(omp_out, omp_in)) initializer(omp_priv={Welford<at::vec::Vectorized<float>>()})
                        Welford<float> tmp_acc0 = Welford<float>();
                        Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                        for(long x2=static_cast<long>(0L); x2<static_cast<long>(512L); x2+=static_cast<long>(1L))
                        {
                            auto tmp0 =
                            [&]
                            {
                                __at_align__ std::array<float, 16> tmpbuf;
                                #pragma GCC unroll 16
                                for (long x1_inner = 0; x1_inner < 16; x1_inner++)
                                {
                                    tmpbuf[x1_inner] = in_ptr0[static_cast<long>((128L*(c10::div_floor_integer(x2, 256L))) + (256L*x1) + (256L*x1_inner) + (7168L*(static_cast<long>(c10::div_floor_integer(x2, 128L)) % static_cast<long>(2L))) + (14336L*x0) + (static_cast<long>(x2) % static_cast<long>(128L)))];
                                }
                                return at::vec::Vectorized<float>::loadu(tmpbuf.data());
                            }
                            ()
                            ;
                            tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0);
                        }
                        tmp_acc0_vec.mean.store(out_ptr0 + static_cast<long>(x1 + (28L*x0)));
                        tmp_acc0_vec.m2.store(out_ptr1 + static_cast<long>(x1 + (28L*x0)));
                    }
                }
                #pragma omp simd simdlen(8)
                for(long x1=static_cast<long>(16L); x1<static_cast<long>(28L); x1+=static_cast<long>(1L))
                {
                    {
                        #pragma omp declare reduction(    welford:Welford<float>:    omp_out = welford_combine(omp_out, omp_in))     initializer(omp_priv={Welford<float>()})
                        Welford<float> tmp_acc0 = Welford<float>();
                        for(long x2=static_cast<long>(0L); x2<static_cast<long>(512L); x2+=static_cast<long>(1L))
                        {
                            auto tmp0 = in_ptr0[static_cast<long>((128L*(c10::div_floor_integer(x2, 256L))) + (256L*x1) + (7168L*(static_cast<long>(c10::div_floor_integer(x2, 128L)) % static_cast<long>(2L))) + (14336L*x0) + (static_cast<long>(x2) % static_cast<long>(128L)))];
                            tmp_acc0 = welford_combine(tmp_acc0, tmp0);
                        }
                        out_ptr0[static_cast<long>(x1 + (28L*x0))] = tmp_acc0.mean;
                        out_ptr1[static_cast<long>(x1 + (28L*x0))] = tmp_acc0.m2;
                    }
                }
```

After:
```c++
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(28L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(28L); x1+=static_cast<long>(1L))
                {
                    {
                        #pragma omp declare reduction(welford:Welford<float>:omp_out = welford_combine(omp_out, omp_in)) initializer(omp_priv={Welford<float>()})
                        #pragma omp declare reduction(welford:Welford<at::vec::Vectorized<float>>:omp_out = welford_combine(omp_out, omp_in)) initializer(omp_priv={Welford<at::vec::Vectorized<float>>()})
                        Welford<float> tmp_acc0 = Welford<float>();
                        Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                        for(long x2=static_cast<long>(0L); x2<static_cast<long>(512L); x2+=static_cast<long>(16L))
                        {
                            auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>((128L*(c10::div_floor_integer(x2, 256L))) + (256L*x1) + (7168L*(static_cast<long>(c10::div_floor_integer(x2, 128L)) % static_cast<long>(2L))) + (14336L*x0) + (static_cast<long>(x2) % static_cast<long>(128L))));
                            tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0);
                        }
                        tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec));
                        out_ptr0[static_cast<long>(x1 + (28L*x0))] = static_cast<float>(tmp_acc0.mean);
                        out_ptr1[static_cast<long>(x1 + (28L*x0))] = static_cast<float>(tmp_acc0.m2);
                    }
                }
            }
```

This PR also further speeds up the model `swin_base_patch4_window7_224` from 1.25x to 1.28x.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117260
Approved by: https://github.com/jansel
ghstack dependencies: #117221
2024-01-15 06:57:25 +00:00
3a0bcd2c12 [audio hash update] update the pinned audio hash (#117423)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117423
Approved by: https://github.com/pytorchbot
2024-01-15 05:50:51 +00:00
19502ff6aa Fixed typo in build_activation_images.py (#117458)
In line 24 of build_activation_images.py, I changed "programmaticly" to "programmatically" to be dramatically correct.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117458
Approved by: https://github.com/malfet
2024-01-15 03:27:40 +00:00
03c6f79548 [vision hash update] update the pinned vision hash (#117311)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117311
Approved by: https://github.com/pytorchbot
2024-01-15 03:15:20 +00:00
2200118f59 Enable some uint{16,32,64} tests that are working (#116809)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116809
Approved by: https://github.com/albanD
2024-01-15 02:25:21 +00:00
a298fba146 [MPS] Increase metal language support to 2.3 (#117472)
As Conda binaries are still built on MacOS 12, which renders MPS unusable after https://github.com/pytorch/pytorch/pull/116942

Test plan:
```
 % xcrun -sdk macosx metal --std=macos-metal2.3 -Wall -o Index Index.metal
 % xcrun -sdk macosx metal --std=macos-metal2.2 -Wall -o Index Index.metal
Index.metal:167:1: error: type 'const constant ulong3 *' is not valid for attribute 'buffer'
REGISTER_INDEX_OP_ALL_DTYPES(select);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Index.metal:159:5: note: expanded from macro 'REGISTER_INDEX_OP_ALL_DTYPES'
    REGISTER_INDEX_OP(8bit,  idx64, char,  INDEX_OP_TYPE, ulong3);    \
    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
...
```

Fixes https://github.com/pytorch/pytorch/issues/117465

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117472
Approved by: https://github.com/xuzhao9
2024-01-15 01:16:52 +00:00
61a181e83c Report function name in stack trace annotations (#117459)
When working with internal flows, it can sometimes be ambiguous what
version of the code they are working with.  In this case, having the
function name available in the stack trace can help identify what you
are looking at.

Example now looks like:

```
[DEBUG]         # File: /data/users/ezyang/a/pytorch/a.py:5 in f, code: return x + x
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117459
Approved by: https://github.com/Skylion007
2024-01-15 00:29:13 +00:00
a6d33614d6 add float8 types to dtypes table (#117375)
Summary:

As titled

Test Plan:

CI

Reviewers:

Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117375
Approved by: https://github.com/ezyang
2024-01-15 00:23:07 +00:00
c3e2b94827 Realize non-ReinterpretView Views in custom Triton kernel args (#117468)
Summary: If any of the `TensorBox` arguments of a custom (user-written) Triton kernel in the graph is wrapped into a `BaseView` subclass which is not `ReinterpretView`, this currently conflicts with the cloning (which preserves RVs) and downstream processing (which needs a layout to mark mutation) of the input.

This PR adds conversion of the non-RV views to `ReinterpretView`s by realizing the corresponding inputs to the Triton kernel. As realization happens anyway before the Triton kernel call, this should not affect the perf. But it covers currently missed patterns in the internal models (see the unit test for a repro).

Test Plan:

```
$ python test/dynamo/test_triton_kernels.py -k test_triton_kernel_slice_and_view_input
...
----------------------------------------------------------------------
Ran 1 test in 3.909s

OK
```

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117468
Approved by: https://github.com/oulgen
2024-01-14 23:31:38 +00:00
62496ffd0d [dynamo][easy]: Add support for operator.truth (#117463)
* This is an old builtin function equivalent to the bool constructor. it is easy enough to add support for.
* I also realized the tests were in the wrong class (the one reserved for testing default args) so I moved them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117463
Approved by: https://github.com/jansel
2024-01-14 19:08:31 +00:00
2748f05056 Add torch.fx.interpreter to uninteresting_files (#117460)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117460
Approved by: https://github.com/Skylion007
2024-01-14 18:35:21 +00:00
a1155883d4 Clean up Docker config on ROCm runner (#117432)
This fixes the issues on trunk when logging in to ECR on ROCm runner is failing.  During my test, it's also ok to fail the login part with that `not implemented` error https://github.com/pytorch/pytorch/actions/runs/7516446579/job/20461801473, and pulling the image from ECR still works, so I set `continue-on-error: true` on the step.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117432
Approved by: https://github.com/malfet
2024-01-14 18:27:09 +00:00
a76610e6fb [BE] Delete unused is_dynamo_compiling (#117455)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117455
Approved by: https://github.com/Skylion007, https://github.com/yanboliang
ghstack dependencies: #117451, #117452, #117454
2024-01-14 15:15:29 +00:00
347255809c Make c10::SymInt typecaster support scalar-like fake tensor (#117454)
We can use `__index__` to do this conversion because that will trigger a
guard on data dependent SymInt if the tensor is a fake tensor, but if
we fetch item directly and put it in the Scalar, we may still be able to
make it work out.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117454
Approved by: https://github.com/yanboliang
ghstack dependencies: #117451, #117452
2024-01-14 15:15:29 +00:00
796fe40a96 [BE] Delete unnecessary variable fastpath (#117452)
This fastpath is unnecessary because in the logic below we
do the same thing:

```
        auto& var = THPVariable_Unpack(obj);
        if (var.numel() != 1 ||
            !at::isIntegralType(
                var.dtype().toScalarType(), /*include_bool*/ true)) {
          throw_intlist_exception(this, i, obj, idx);
        }
        auto scalar = var.item();
        TORCH_CHECK(scalar.isIntegral(/*include bool*/ false));
        res.push_back(scalar.toSymInt())
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117452
Approved by: https://github.com/yanboliang
ghstack dependencies: #117451
2024-01-14 14:39:46 +00:00
220cf46c2a Always accept 0-d scalar tensors as int, even if __index__ fails (#117451)
Fixes https://github.com/pytorch/pytorch/issues/117288

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117451
Approved by: https://github.com/yanboliang
2024-01-14 14:39:46 +00:00
38c18f3825 [c10d] Add a timeout check interval variable for timeout dump (#117093)
The current timeout check frequency is relied on monitoring thread's timeout thread which can be too long (even if we set it to 2mins) so let's use a separate timeout variable which users can configure it. And we only only let default PG to check TCPStore so even more frequent check should be fine. (Our stress test is performed on every half second).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117093
Approved by: https://github.com/wconstab, https://github.com/kwen2501
2024-01-14 02:33:17 +00:00
003c900d5e Add _assert_scalar (#117378)
Peeled off from https://github.com/pytorch/pytorch/pull/114148, because that PR is going to take a while to actually land.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117378
Approved by: https://github.com/jansel
2024-01-14 00:50:36 +00:00
1a8545164a [export] Add unit test for SDPA export result (#117390)
Summary:

A follow up for #117097. In that PR I didn't add
`_scaled_dot_product_attention_for_cpu` into the core_aten_decomposition
table. This PR does that and also add a unit test.

Test Plan: python test/export/test_export.py -k
test_scaled_dot_product_attention

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117390
Approved by: https://github.com/drisspg
2024-01-14 00:21:28 +00:00
bf27dd6df9 Add dynamo support for operator.abs (#117442)
A test case for operator.abs and allows for constant folding with it. Partially applies to #116396

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117442
Approved by: https://github.com/jansel, https://github.com/malfet
2024-01-13 21:38:55 +00:00
1a790f5a61 [RELAND] Error grad mode op in export API (#117420)
Summary: Title

Test Plan: CI

Differential Revision: D52706691

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117420
Approved by: https://github.com/angelayi
2024-01-13 21:36:29 +00:00
d6847c5977 [CI] Set correct permissions for auto_request_review (#117408)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117408
Approved by: https://github.com/izaitsevfb, https://github.com/atalman
2024-01-13 20:02:03 +00:00
53f3361319 [BE] Use nested namespaces for sparse (#117415)
C++17 is fu

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117415
Approved by: https://github.com/Skylion007
2024-01-13 19:51:28 +00:00
d8bdb50379 [reland] pass shape/stride during tensor unflatten (#117340)
Reland of https://github.com/pytorch/pytorch/pull/113547 as the previous
PR reverted bc of torch.compile symbolic shape issue. Since we now disabled tensor
unflatten with dynamo.disable, we should not hit this issue again

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117340
Approved by: https://github.com/Skylion007
ghstack dependencies: #117336
2024-01-13 19:33:47 +00:00
eebf115686 [fsdp][2d] FSDP sync module states handle tensor subclass (#117336)
This PR adds the ability to let FSDP sync module states kwarg to handle
tensor subclass, because FSDP works on the "dp" mesh dimension, as long
as FSDP works on a different device mesh dimension, we can safety let
FSDP just broadcast the DTensor local shards.

fixes https://github.com/pytorch/pytorch/issues/117126

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117336
Approved by: https://github.com/awgu
2024-01-13 19:33:47 +00:00
fc044b5cdb [pt-vulkan] Add build time flag to control descriptor pool sizes (#117398)
Summary:
## Context

When running large models with a lot of operators, the default descriptor pool allocated by the Vulkan compute API may run out of descriptor sets. This changeset introduces the `VULKAN_DESCRIPTOR_POOL_SIZE` build variable (which will default to `1024u`) which can allow for a larger descriptor pool to be allocated if necessary.

## Notes for Reviewers

This is a simple stopgap solution until we have bandwidth to implement the more general solution, which would be to modify the `DescriptorPool` class defined in `api/Descriptor.[h,cpp]` to automatically allocate a new descriptor pool when memory runs out. However, I would consider this change to be low priority since with a delegate/graph mode of execution, the descriptor pool can often be allocated to exactly fit a model's requirements.

Test Plan:
There should be no functional changes under default build settings. Run `vulkan_api_test` to make sure everything works as before; CI should test for that as well.

```
# On devserver
LD_LIBRARY_PATH=/home/ssjia/Github/swiftshader_prebuilt/swiftshader/build/bin/ buck run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="*"
```

Reviewed By: yipjustin, jorgep31415

Differential Revision: D52742140

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117398
Approved by: https://github.com/yipjustin
2024-01-13 13:11:00 +00:00
2c8975387d [Optimus] fix batch layernorm numerical issue (#117404)
Summary:
Fix the numerical issue with addcmul.

Found that torch.addcmul will generate different value from torch.add+torch.mul with 32 bit check. Mini repro: N4823658

Change addcmul tp torch.add+torch.mm

Test Plan:
buck test

before change
```
the diff index is:  0
the diff index is:  1
the diff index is:  6
```

after change numeric on par

Differential Revision: D52745671

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117404
Approved by: https://github.com/mengluy0125
2024-01-13 10:04:12 +00:00
f008efa8e7 Reconstruct streams via global registration, temporary impl to unblock FSDP (#117386)
This is a placeholder implementation for reconstructing streams via global storage to unblock FSDP, pending proper stream support design

This PR does a few things:

1) fixes registration for devices with indices. We were only supporting "cuda", we now support "cuda:k" interfaces where k is # of gpu

2) Changes the stream objects in dynamo to take devices as device types, instead of strings, and updates the string based device APIs to gracefully take device types.

3) Introduces a reconstruct-by-global (using existing cleanup hook structures) to streams as a placeholder impl for now

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117386
Approved by: https://github.com/jansel
2024-01-13 07:03:33 +00:00
ef3217d9f7 [PyTorch] Mark USDT probes as noinline to avoid duplications in ThinLTO mode (#117381)
Differential Revision: D52710343

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117381
Approved by: https://github.com/chaekit
2024-01-13 06:18:01 +00:00
302f931c25 Update Reviewers for PyTorch Distributed team (#116231)
Update merge rule approver list under 'Distributed' section based on current PyTorch distributed team composition.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116231
Approved by: https://github.com/fduwjj, https://github.com/XilunWu
2024-01-13 05:07:13 +00:00
96163eb010 Switch nightly binaries to oidc. Remove aws keys (#117416)
This should fix all wheel nightly upload failures:
https://hud.pytorch.org/hud/pytorch/pytorch/nightly/1?per_page=50&name_filter=upload
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117416
Approved by: https://github.com/huydhn, https://github.com/malfet
2024-01-13 03:24:13 +00:00
22ddf91dbb [torch][fx] more strong typed codegen for partial specialized code on boolean (#117201)
Summary:
* in some fx partial specialized codegen via `concrete_args` on boolean arguments, we extend to further use the graphmodule on strong typed runtime like torchscript.
* this diff fix the type annotation for boolean only and preserve argument mapping for leafing pytree nodes.

Test Plan: buck2 test 'fbcode//mode/opt' fbcode//caffe2/test:fx -- --exact 'caffe2/test:fx - test_partial_trace (test_fx.TestFX)'

Differential Revision: D52667883

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117201
Approved by: https://github.com/houseroad
2024-01-13 03:10:02 +00:00
2bc7da1ab7 [HigherOrderOp] change signature of map_impl (#117161)
Summary:
X-link: https://github.com/pytorch/executorch/pull/1580

This PR changes the schema of map_impl from map_impl(f, num_mapped, *operands) to map_impl(f, mapped_args: Tuple, moperands: Tuple). This is to prepare for turning on dynamo for eager mode map, where we want to get rid of the num_mapped scalar.

Test Plan: Existing tests.

Differential Revision: D52495413

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117161
Approved by: https://github.com/angelayi, https://github.com/tugsbayasgalan
2024-01-13 02:50:46 +00:00
f2f47c6848 [dynamo] realize LazyVT's in DICT_MERGE (#117282)
Fixes https://github.com/pytorch/pytorch/issues/115029.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117282
Approved by: https://github.com/jansel, https://github.com/mlazos
2024-01-13 01:50:39 +00:00
3e397cefc5 Add uint1 to uint7 dtypes (#117208)
Summary:
These dtypes are added since we see more demand for these sub byte dtypes, especially with
the popularity of LLMs (https://pytorch.org/blog/accelerating-generative-ai-2/#step-4-reducing-the-size-of-the-weights-even-more-with-int4-quantization-and-gptq-2021-toks)

Note these are just placeholders, the operator support for these dtypes will be implemented with tensor subclass.
e.g. torch.empty(..., dtype=torch.uint1) will return a tensor subclass of uint1, that supports different operations like bitwsise ops, add, mul etc. (will be added later)

Also Note that these are not quantized data types, we'll implement quantization logic with tensor subclass backed up by these dtypes as well.
e.g `Int4GroupedQuantization(torch.Tensor)` will be implemented with torch.uint4 Tensors (see https://github.com/pytorch-labs/ao/pull/13 as an example)

Test Plan:
CIs
python test/test_quantization.py -k test_uint1_7_dtype

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117208
Approved by: https://github.com/ezyang
2024-01-13 01:09:23 +00:00
52575eb1bb The permission id-token write needs to be set on rocm-test callers (#117422)
All these workflows lack the necessary permission to run `_rocm-test` job after https://github.com/pytorch/pytorch/pull/117160, for example https://github.com/pytorch/pytorch/actions/runs/7508520071

### Testing

Confirm that trunk is back https://github.com/pytorch/pytorch/actions/runs/7508830196.  Other workflows would be the same, i.e. rocm https://github.com/pytorch/pytorch/actions/runs/7508830137/job/20444989127.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117422
Approved by: https://github.com/atalman
2024-01-13 00:27:46 +00:00
9746f36e50 [export] Minor fixes to serialization (#117374)
* Checks that the input to torch.export.save is an ExportedProgram (https://github.com/pytorch/pytorch/issues/116952)
* Fixes naming for serialized state dict from `serialized_state_dict.json` to `serialized_state_dict.pt` (https://github.com/pytorch/pytorch/issues/116949)
* Moves some tests to be expectFailure rather than blocklisted
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117374
Approved by: https://github.com/ydwu4
2024-01-13 00:23:06 +00:00
7f1f0b1135 [C10D] Add duration_ms to flight recorder (#114817)
Measures the duration of a collective operation using nccl start/end
events and includes this duration (in ms) in the flight recorder data.

duration_ms will be an optional field, since it only works when
timing is enabled.  Currently timing is enabled when flight recorder
is enabled, but this is not a strict requirement.  Duration is also
not available for collectives not in a completed state.

Note: computing duration can lead to a hang due to calling cudaEventDuration when
the cuda driver queue is full.

We don't ever want dump() api to hang, since we might want dump to help
debug a hang. Hence, we only query durations from the watchdog thread,
and it's possible during dump() call, some of the most recent
collectives durations won't have been computed yet at time of dump.  We
make this tradeoff to ensure that dump() itself will never hang.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114817
Approved by: https://github.com/fduwjj, https://github.com/zdevito
ghstack dependencies: #116905
2024-01-12 23:34:11 +00:00
7a7535283f Some basic support for uint{16,32,64} codegen in CPU inductor (#116810)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116810
Approved by: https://github.com/chenyang78, https://github.com/eellison, https://github.com/desertfire
2024-01-12 23:13:28 +00:00
4b25948ee6 Torchbench Dynamo Runner: Enable DDP for perf test and traces (#113332)
- Removes an outdated assert that prevents perf tests from running DDP, we now have single node --multiprocess and perf tests are already wrapping the model using `deepcopy_and_maybe_ddp`
- Append rank name to traces to avoid all ranks trying to create the same file
- Renames `deepcopy_and_maybe_ddp` to `deepcopy_and_maybe_parallelize` to include FSDP

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113332
Approved by: https://github.com/H-Huang, https://github.com/wconstab
2024-01-12 22:41:09 +00:00
c329eddcb9 Migrate the rest of state_dict testing to OptimizerInfo (#117186)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117186
Approved by: https://github.com/albanD
ghstack dependencies: #116509
2024-01-12 22:32:37 +00:00
bcf1f312a0 Migrate nontensor step and CUDA params state_dict tests to OptimizerInfo (#116509)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116509
Approved by: https://github.com/albanD
2024-01-12 22:32:37 +00:00
7b753cc7b8 Skip some slow tests (under Dynamo) (#117389)
Otherwise these may cause timeouts.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117389
Approved by: https://github.com/jerryzh168, https://github.com/voznesenskym
ghstack dependencies: #117318, #117320
2024-01-12 22:18:07 +00:00
d73846689d Rename test_legacy_vmap.py TestCase names (#117320)
The problem is that the dynamo_test_failures logic recognizes tests by
their TestClass.test_name. Unfortunately we have duplicate
TestClass.test_name in test_legacy_vmap and test_vmap. This PR
unduplicates them.

Something more robust would have been to include the test file name in
the dynamo_test_failures logic, but... it's a bit too late for that. We
can fix it if it becomes more of a problem in the future.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117320
Approved by: https://github.com/voznesenskym
ghstack dependencies: #117318
2024-01-12 22:18:07 +00:00
06576d859d Stop running ModuleInfo tests under Dynamo (#117318)
This is a policy decision, similar to the OpInfo one. The problem is
that they just take too long to run when we reset() before and after
each.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117318
Approved by: https://github.com/voznesenskym
2024-01-12 22:17:59 +00:00
fbd9bccb75 [C10D](reland) Add GIL checker to NCCL watchdog monitor (#117312)
Whenever the monitor thread kills the watchdog thread for being stuck, we do so to save cluster time and get a faster failure signal, but we want to know more about why it got stuck.

One possible reason for watchdog stuckness is GIL contention, which could be ruled out or observed by making an attempt to acquire the GIL at exit time.

If we cannot acquire the GIL within a short time window (1s) we abort the attempt and report GIL contention, otherwise we report that GIL was acquired successfully.

Reland: uses a function pointer to avoid destructor ordering issues on dlclose. (Looks like the destructor for the std::function was being run later than the libtorchpython lib was unloaded, leading to a crash).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117312
Approved by: https://github.com/zdevito
2024-01-12 21:48:45 +00:00
7b0926cc3e Fix wrong class inheritance in pyi (#116404)
As the title stated.

f6dfbffb3b/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp (L153)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116404
Approved by: https://github.com/ezyang, https://github.com/wconstab
2024-01-12 21:25:29 +00:00
c167c34396 Skip unsupported tests on arm (#117344)
add skips to tests that involve record_context_cpp on ARM as it is only supported on linux x86_64 arch. Error is reported as below:
```
Traceback (most recent call last):
  File "/usr/lib/python3.10/unittest/case.py", line 59, in testPartExecutor
    yield
  File "/usr/lib/python3.10/unittest/case.py", line 591, in run
    self._callTestMethod(testMethod)
  File "/usr/lib/python3.10/unittest/case.py", line 549, in _callTestMethod
    method()
  File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_utils.py", line 2674, in wrapper
    method(*args, **kwargs)
  File "/opt/pytorch/pytorch/test/test_cuda.py", line 3481, in test_direct_traceback
    c = gather_traceback(True, True, True)
RuntimeError: record_context_cpp is not support on non-linux non-x86_64 platforms
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117344
Approved by: https://github.com/malfet, https://github.com/drisspg
2024-01-12 21:12:11 +00:00
384c4885fa [ProcessGroup] Do not print NCCL_DEBUG before NCCL init (#117328)
In case /etc/nccl.conf is used, `NCCL_DEBUG` is not set to sys env until NCCL inits.
The deleted print point is before NCCL inits, hence may be inaccurate.
This PR removes it and relies on the other print point which is after NCCL comm creation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117328
Approved by: https://github.com/wconstab, https://github.com/fduwjj
2024-01-12 20:46:50 +00:00
18bd5c05bc FFT: Handle noop fftn calls gracefully (#117368)
Fixes #117252
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117368
Approved by: https://github.com/malfet
2024-01-12 20:16:50 +00:00
5cf481d1ac [CI] Explicitly specify read-all permissions on the token (#117290)
Would be nice to have it

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117290
Approved by: https://github.com/seemethere, https://github.com/osalpekar, https://github.com/huydhn, https://github.com/atalman
2024-01-12 19:15:54 +00:00
013a59acbd Update BCEWithLogitsLoss documentation regarding pos_weight (#117046)
Added clarification for the example provided for the pos_weight parameter in the BCEWithLogitsLoss class, particularly in multi-label binary classification context. This enhancement addresses potential misunderstandings about the application of 'binary' classification, which typically implies two classes, to scenarios involving multiple classes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117046
Approved by: https://github.com/mikaylagawarecki
2024-01-12 18:26:25 +00:00
e54b40e5eb [dynamo] GetItemSource - restrict the supported index Source to be GlobalWeakRefSource (#117138)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117138
Approved by: https://github.com/jansel, https://github.com/mlazos
2024-01-12 18:21:14 +00:00
657545dbdd Migrate rocm test to using oidc (#117160)
Similar to Intel XPU, lets use OIDC for rocm runners.

Refer to this PR: https://github.com/pytorch/pytorch/pull/116554

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117160
Approved by: https://github.com/huydhn, https://github.com/malfet
2024-01-12 17:57:26 +00:00
cb42bc705b Make auto_functionalized HOP fallback in inductor (#117084)
It looks like the inductor fallback previously worked with HOPs but no longer
does, so I fixed that:
- all HOPs are exposed under torch.ops.higher_order, so I changed how
  inductor looks them up
- the inductor fallback assumed that an operator's signature was (*args,
  **kwargs). This is true for all the OpOverloads but not HOPs. I
  rewrote the code to not rely on this.

Test Plan:
- existing tests
- new test for auto_functionalized HOP.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117084
Approved by: https://github.com/williamwen42
2024-01-12 17:57:01 +00:00
a97d00cca5 [Nested Tensor]Support SDPA math fallback for jagged layout nested tensor (#116445)
Support this fallback by converting the jagged layout NT to strided layout NT, and the convert the result back to jagged layout NT.
This fallback might not be efficient since it uses unbind, contiguous and split.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116445
Approved by: https://github.com/soulitzer
2024-01-12 17:30:40 +00:00
21d370819b [CI] Set permissions for stale workflow (#117371)
Hopefully should fix failures one observes in HUD as default permissions for the repo were changed to read-only
<img width="232" alt="image" src="https://github.com/pytorch/pytorch/assets/2453524/4047472c-ca3c-4288-add7-97f0ce43106a">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117371
Approved by: https://github.com/clee2000
2024-01-12 16:44:15 +00:00
172dd13ecf [inductor][cpp] improve vector contiguous checks for FloorDiv and ModularIndexing (#117221)
Fix https://github.com/pytorch/pytorch/issues/114488

The PR tries to enable contiguous vector loads for cases where we can reduce `FloorDiv` and `ModularIndexing` in the vectorized loop.

Take the index expression in test case `test_vec_contiguous_ModularIndexing` for example.
`14336*x0 + 256*x1 + 128*((x2//256)) + ModularIndexing(x2, 1, 128) + 7168*ModularIndexing(x2, 128, 2)` can be reduced to `14336*x0 + 256*x1 + x2 + 128*x2_div_c0 + 7168*x2_mod_c0 + x2_mod_c1` where `x2` is a vectorized loop variable and the vector length is 16. This means we can do vectorized load for this index. Check the code comment for more details:
https://github.com/pytorch/pytorch/pull/117221/files#diff-5ab7b0235e2076a5fc6629ba0b109208940f5b94f5c13babc3e0f87cf4fcec82R317-R329

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117221
Approved by: https://github.com/jansel
2024-01-12 15:20:36 +00:00
6c624aad37 [CPU] Disable floating-point contraction when compiling (#116318)
Fixes #100775.

For CPU inductor path, disable -ffp-contract, such as fma, from optimization flags to fix functional issues.

### Validation
Validation on 3 benchmark suites.

- [x] FP32: Negligible geomean change; No outlier models.

<img width="582" alt="image" src="https://github.com/pytorch/pytorch/assets/23010269/7c14a8b8-eb6c-4794-bff9-2e1ae3a22781">

- [x] BF16: Negligible geomean change; No outlier models.

<img width="589" alt="image" src="https://github.com/pytorch/pytorch/assets/23010269/cf558737-8cb2-411f-8761-27b9f8fc43af">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116318
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-01-12 14:09:05 +00:00
6ebb26d572 Fail Conv Binary Inplace check when act and accum are same tensor (#117331)
**Summary**
When a tensor is used as the act of conv and extra input of the binary add node, we shouldn't do conv binary inplace fusion.
```
      a
    /   \
 conv
   \
     add
```

**TestPlan**
```
python -u -m pytest -s -v test_mkldnn_pattern_matcher.py -k test_conv2d_binary_inplace_fusion_failed_cpu
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117331
Approved by: https://github.com/jgong5
ghstack dependencies: #117330
2024-01-12 10:34:11 +00:00
19a9fdbf3a Add more alias and mutation check for other input of Conv Binary Inplace fusion (#117330)
**Summary**
Fix the issue: https://github.com/pytorch/pytorch/issues/117108.
Use the outplace conv binary fusion when other input is with type `TensorBox(View(ReinterpretView()))` since other input is a view of some other tensor.

**Test Plan**
```
python -u -m pytest -s -v test_mkldnn_pattern_matcher.py -k test_conv2d_binary_inplace_fusion_failed_cpu
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117330
Approved by: https://github.com/jgong5
2024-01-12 10:29:33 +00:00
f7d9047864 [inductor] Iterative percolate tags (#117306)
Fixes https://github.com/pytorch/pytorch/issues/116581

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117306
Approved by: https://github.com/aorenste, https://github.com/eellison
2024-01-12 07:52:32 +00:00
47c9d12ffd Add super().setUp() to TestFFT1D (#117329)
One day I'll move the check to be somewhere else so we don't need to worry about this anymore
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117329
Approved by: https://github.com/huydhn
2024-01-12 07:47:01 +00:00
50049cfaa0 [1/4] Intel GPU Runtime Upstreaming for Device (#116019)
# Motivation
As mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), The first runtime component we would like to upstream is `Device` which contains the device management functions of Intel GPU's runtime. To facilitate the code review, we split the code changes into 4 PRs. This is one of the 4 PRs and covers the changes under `c10`.

# Design
Intel GPU device is a wrapper of sycl device on which kernels can be executed. In our design, we will maintain a sycl device pool containing all the GPU devices of the current machine, and manage the status of the device pool by PyTorch. The thread local safe is considered in this design. The corresponding C++ files related to `Device` will be placed in c10/xpu folder. And we provide the c10 device runtime APIs, like
  - `c10::xpu::device_count`
  - `c10::xpu::set_device`
  - ...

# Additional Context
In our plan, 4 PRs should be submitted to PyTorch for `Device`:
1. for c10
2. for aten
3. for python frontend
4. for lazy initialization shared with CUDA

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116019
Approved by: https://github.com/gujinghui, https://github.com/jgong5, https://github.com/EikanWang, https://github.com/malfet
2024-01-12 07:36:25 +00:00
7dac2f9f2d [export][ez] Fix getting meta["val"] (#117313)
Summary: For integer inputs, they do not have a meta["val"].

Test Plan: `buck run @//mode/dev-nosan  //executorch/examples/portable/scripts:export -- -m emformer_predict` passes the export step

Differential Revision: D52716419

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117313
Approved by: https://github.com/kirklandsign, https://github.com/tugsbayasgalan
2024-01-12 06:17:38 +00:00
40f12cec93 Change predispatch tracing API (#117278)
Summary: Change the API used in export for aotinductor

Test Plan: buck2 run mode/opt mode/inplace caffe2/test/inductor/fb:test_group_batch_fusion_fb

Differential Revision: D52678653

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117278
Approved by: https://github.com/angelayi, https://github.com/khabinov
2024-01-12 06:10:02 +00:00
ec443089c7 enable fp16 mkldnn fusion/prepack in inductor (#117206)
- Extend `linear/conv/rnn` packable with `float16`.
- Extend `Unary fusion` to support `float16`.

Test Case:
    Extend bfloat16 related test in `test_cpu_repro.py` and `test_mkldnn_pattern_matcher.py` to test both `fp16` and `bf16`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117206
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-01-12 06:08:42 +00:00
9d5954e2a9 ignore ill-formed solution of reduce_inequalities (#117310)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/117033

Sometimes the solution returned by `sympy.solvers.inequalities.reduce_inequalities` can contain sub-expressions of the form `CRootOf(...)`, denoting the complex root of some equation in `x`, where `x` is an arbitrary symbol. We will now gracefully fail when this happens, like we already do when the solver itself fails.

Test Plan: added a test

Differential Revision: D52715578

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117310
Approved by: https://github.com/ezyang
2024-01-12 06:01:13 +00:00
638f85fd67 Add default parameters to rrelu_with_noise() (#117141)
Summary:
rrelu_with_noise() was listed as having default parameters in the schema but the
actual code definition didn't have them.

The failing example was calling rrelu() which DOES have default parameters and
it passes those defaulted values to C++. Under the covers the C code was calling
the python version of rrelu_with_noise().

Although the C++ code was passing all the values to the python version of
rrelu_with_noise() the pytorch C++ -> Python dispatch code looks at the schema
and strips any parameters which match the schema's listed defaults so if the
schema shows defaults that aren't in the code it will be a problem.

Test Plan:
I added a unit test for this specific case. It would probably be better to write
a more general one to validate all the ops against their schemas - but I haven't
learned enough about the test harness to do that yet.

Fixes #115811

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117141
Approved by: https://github.com/yanboliang, https://github.com/oulgen
2024-01-12 05:32:13 +00:00
d29bf0a37e Fix ONNXProgram.save to use torch.load(..., mmap=True) for large models (#117295)
During ONNXProgram.save, the implicit/explicit state_dict passed in must
be loaded in memory in order to read each initializer and create an
external tensor proto with them

This PR ensures torch.load uses memory-map to support large models that
cannot fit in memory
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117295
Approved by: https://github.com/BowenBao
ghstack dependencies: #117294
2024-01-12 04:38:27 +00:00
b62ba82cdc Update initializer path for ONNXProgram.save due to onnx.checker limitation (#117294)
According to https://github.com/onnx/onnx/blob/main/docs/ExternalData.md#large-models-2gb when initializers are larger than 2GB, `onnx.checker` requires the model to be in the same directory as the initializer.

Although not strictly necessary for the export and model save to succeed, it is desirable to have the `onnx.checker` to succeed when validation the resulting large model.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117294
Approved by: https://github.com/BowenBao
2024-01-12 04:22:12 +00:00
b3b585af64 Revert "[codemod] markDynamoStrictTest batch 16 (#117218)"
This reverts commit 47119785acbfe20d9ef6cf5d90887a441402f5c7.

Reverted https://github.com/pytorch/pytorch/pull/117218 on behalf of https://github.com/zou3519 due to just felt like reverting this ([comment](https://github.com/pytorch/pytorch/pull/117218#issuecomment-1888360366))
2024-01-12 03:06:20 +00:00
ac0bed01df Revert "[dynamo] GetItemSource - restrict the supported index Source to be GlobalWeakRefSource (#117138)"
This reverts commit c278a1b39c8ae33feaa4a87b35b721fff7fdf19a.

Reverted https://github.com/pytorch/pytorch/pull/117138 on behalf of https://github.com/zou3519 due to Broke jobs on main, I'm not sure why ([comment](https://github.com/pytorch/pytorch/pull/117138#issuecomment-1888290068))
2024-01-12 01:55:49 +00:00
3214ada631 [MPS][BE] Better format nested ternary (#117198)
- Replace double ternary with if + ternary
- Replace deprecated `AT_ASSERT` with `TORCH_INTERNAL_ASSERT`
- Replace regular asserts with `TORCH_CHECK` or `TORCH_INTERNAL_ASSERT` depending on context

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117198
Approved by: https://github.com/Skylion007
2024-01-12 01:29:17 +00:00
04604eea8a [inductor] check nan/inf for graph inputs (#117189)
This is split out from #103469

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117189
Approved by: https://github.com/jansel
2024-01-12 00:59:32 +00:00
47119785ac [codemod] markDynamoStrictTest batch 16 (#117218)
[codemod] markDynamoStrictTest test_dataloader
[codemod] markDynamoStrictTest test_public_bindings
[codemod] markDynamoStrictTest test_namedtensor
[codemod] markDynamoStrictTest test_fx
[codemod] markDynamoStrictTest test_content_store
[codemod] markDynamoStrictTest test_schema_check
[codemod] markDynamoStrictTest lazy/test_ts_opinfo
[codemod] markDynamoStrictTest functorch/test_ops
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117218
Approved by: https://github.com/bdhirsh
2024-01-12 00:32:36 +00:00
c278a1b39c [dynamo] GetItemSource - restrict the supported index Source to be GlobalWeakRefSource (#117138)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117138
Approved by: https://github.com/jansel
2024-01-11 23:26:25 +00:00
5d2d21a7be [bfloat16][easy] kthvalue, median (#117279)
Fixes #109991
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117279
Approved by: https://github.com/Skylion007
2024-01-11 22:44:07 +00:00
5c6e7962f4 [c10d][EZ] Add more logs in the destructor of ProcessGroupNCCL for better root cause investigation (#117291)
Add logs to the place where we inspect whether a hang happens.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117291
Approved by: https://github.com/XilunWu, https://github.com/shuqiangzhang
2024-01-11 22:33:30 +00:00
53cba40651 [Distributed] Fix tests when CUDA not available (#117163)
NCCL tests failed after https://github.com/pytorch/pytorch/pull/116217 when PyTorch was not built with CUDA. This PR fixes the issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117163
Approved by: https://github.com/malfet, https://github.com/wanchaol
2024-01-11 22:27:43 +00:00
9f87760160 Revert "[Nested Tensor]Support SDPA math fallback for jagged layout nested tensor (#116445)"
This reverts commit e55a778cbb518e54c5afa5b8107b352746d7f41a.

Reverted https://github.com/pytorch/pytorch/pull/116445 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but i see it fails ROCm test in trunk due to an unsupported use case e55a778cbb ([comment](https://github.com/pytorch/pytorch/pull/116445#issuecomment-1888060036))
2024-01-11 22:21:45 +00:00
0a5aa5c2d1 [pt-vulkan][ez] Remove reference to c10::MemoryFormat from api/ folder (#117183)
## Context

This change is part of a set of changes that removes all references to the `c10` library in the `api/`, `graph/`, and `impl/` folders of the PyTorch Vulkan codebase. This is to ensure that these components can be built as a standalone library such that they can be used as the foundations of a Android GPU delegate for ExecuTorch.

## Notes for Reviewers

This changeset removes references to `c10::MemoryFormat` in `api/Tensor.[h,cpp]`; when constructing a `vTensor`, the `api::StorageType` (i.e. whether the tensor will be backed by buffer or texture storage) and `api::GPUMemoryLayout` (i.e. which dimension will be the fastest moving dimension) must be specified directly.

Differential Revision: [D52662234](https://our.internmc.facebook.com/intern/diff/D52662234/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117183
Approved by: https://github.com/liuk22, https://github.com/yipjustin
ghstack dependencies: #117176, #117177, #117178, #117179, #117180, #117181
2024-01-11 22:08:29 +00:00
8b0bfb3aaa [FSDP] remove unused flat_param_part_view (#117082)
flat_param_part_view is unused in pytorch repo: https://fburl.com/ssaomd7x

it became unused since refactoring in https://github.com/pytorch/pytorch/pull/115497

before that, the original code is below. Since flat_param is 1D, we do
not need .view for reshaping

```
self.flat_param.data = padded_unsharded_flat_param[
    : unsharded_size.numel()
].view(
    unsharded_size
)
```

unit test: pytest test/distributed/fsdp/test_fsdp_core.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117082
Approved by: https://github.com/awgu, https://github.com/wconstab, https://github.com/Skylion007
2024-01-11 21:59:51 +00:00
3c66c89057 [pt-vulkan] Replace c10::ScalarType with native equivalent (#117181)
## Context

This change is part of a set of changes that removes all references to the `c10` library in the `api/`, `graph/`, and `impl/` folders of the PyTorch Vulkan codebase. This is to ensure that these components can be built as a standalone library such that they can be used as the foundations of a Android GPU delegate for ExecuTorch.

## Notes for Reviewers

This changeset introduces `api::ScalarType` in `api/Types.h`, which is intended to function the same as `c10::ScalarType`; thus `api/Types.h` is the primary file of interest. The rest of the changes are straightforward replacements of `c10::ScalarType` with `api::ScalarType`.

Differential Revision: [D52662237](https://our.internmc.facebook.com/intern/diff/D52662237/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117181
Approved by: https://github.com/yipjustin
ghstack dependencies: #117176, #117177, #117178, #117179, #117180
2024-01-11 21:43:33 +00:00
331ae7f89f [pt-vulkan][ez] Replace c10::overflows with native equivalent (#117180)
## Context

This change is part of a set of changes that removes all references to the `c10` library in the `api/`, `graph/`, and `impl/` folders of the PyTorch Vulkan codebase. This is to ensure that these components can be built as a standalone library such that they can be used as the foundations of a Android GPU delegate for ExecuTorch.

## Notes for Reviewers

This changeset is very straightforward, as it simply copies the required components of `c10::overflows` from [`c10/util/Half.h`](https://github.com/pytorch/pytorch/blob/main/c10/util/Half.h#L477) into `api/Utils.h`.

Differential Revision: [D52662236](https://our.internmc.facebook.com/intern/diff/D52662236/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117180
Approved by: https://github.com/yipjustin
ghstack dependencies: #117176, #117177, #117178, #117179
2024-01-11 21:43:33 +00:00
4205892be6 [pt-vulkan][ez] Replace ArrayRef with std::vector<T>& (#117179)
## Context

This change is part of a set of changes that removes all references to the `c10` library in the `api/`, `graph/`, and `impl/` folders of the PyTorch Vulkan codebase. This is to ensure that these components can be built as a standalone library such that they can be used as the foundations of a Android GPU delegate for ExecuTorch.

## Notes for Reviewers

This changeset replaces all instances of `c10::ArrayRef<T>` with `std::vector<T>&` and all instances of`c10::IntArrayRef` with `std::vector<int64_t>&`. There are a lot of changes in this changeset but that is simply due to the large number of callsites. All the changes are straightforward replacements.

Differential Revision: [D52662235](https://our.internmc.facebook.com/intern/diff/D52662235/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117179
Approved by: https://github.com/yipjustin
ghstack dependencies: #117176, #117177, #117178
2024-01-11 21:43:15 +00:00
b209de6699 [pt-vulkan] Replace TORCH_CHECK and similar macros with native equivalents (#117178)
## Context

This change is part of a set of changes that removes all references to the `c10` library in the `api/`, `graph/`, and `impl/` folders of the PyTorch Vulkan codebase. This is to ensure that these components can be built as a standalone library such that they can be used as the foundations of a Android GPU delegate for ExecuTorch.

## Notes for Reviewers

This changeset introduces `api::Error` class in `api/Exception.h`, which is a more barebones copy of the `c10::Error` class from [`c10/util/Exception.h`](https://github.com/pytorch/pytorch/blob/main/c10/util/Exception.h). The macros `VK_CHECK_COND` (equivalent to `TORCH_CHECK(cond, msg)`) and `VK_THROW` (equivalent to `TORCH_CHECK(false, msg)` are introduced as well to replace calls to `TORCH_CHECK()` and similar macros.

Although this is a large diff, the most meaningful changes are in the added files `api/Exception.[h,cpp]` and `api/StringUtil.[h,cpp]` (which is mostly adapted from [`c10/util/StringUtil.h`](https://github.com/pytorch/pytorch/blob/main/c10/util/StringUtil.h)) which implements `api::Error` and the new macros. The rest of the diff is replacing calls to `TORCH_CHECK()` and similar macros with `VK_CHECK_COND()` and `VK_THROW()`.

Differential Revision: [D52662233](https://our.internmc.facebook.com/intern/diff/D52662233/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117178
Approved by: https://github.com/yipjustin
ghstack dependencies: #117176, #117177
2024-01-11 21:43:15 +00:00
fe298e901a [pt-vulkan][ez] Replace ska::flat_hash_map, c10::get_hash with std::unordered_map, std::hash (#117177)
## Context

This change is part of a set of changes that removes all references to the `c10` library in the `api/`, `graph/`, and `impl/` folders of the PyTorch Vulkan codebase. This is to ensure that these components can be built as a standalone library such that they can be used as the foundations of a Android GPU delegate for ExecuTorch.

## Notes for Reviewers

The majority of the changes in this changeset are:

* Replacing instances of `ska::flat_hash_map` with `std::unordered_map`
   * `ska::flat_hash_map` is an optimized hash map, but the optimizations shouldn't be too impactful so `std::unordered_map` should suffice. Performance regression testing will be done at the final change in this stack to verify this.
* Replacing `c10::get_hash` with `std::hash` where only one variable is getting hashed or the `utils::hash_combine()` function added to `api/Utils.h` (which was copied from `c10/util/hash.h`)

Differential Revision: [D52662231](https://our.internmc.facebook.com/intern/diff/D52662231/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117177
Approved by: https://github.com/yipjustin
ghstack dependencies: #117176
2024-01-11 21:43:15 +00:00
57b76b970b [pt-vulkan][ez] Miscellaneous small c10 deprecations (c10::irange, C10_LIKELY, c10::SmallVector, etc.) (#117176)
## Context

This change is part of a set of changes that removes all references to the `c10` library in the `api/`, `graph/`, and `impl/` folders of the PyTorch Vulkan codebase. This is to ensure that these components can be built as a standalone library such that they can be used as the foundations of a Android GPU delegate for ExecuTorch.

## Notes for Reviewers

This changeset deprecates various easy-to-replace symbols from the `c10` library with either C++ STL equivalents or by using copying those `c10` symbols as native equivalents. The symbols that were impacted are:

* `c10::irange`
  * removed and replaced with standard for loops
* `C10_LIKELY` and `C10_UNLIKELY`
  * These macros allow for some branch re-ordering compiler optimizations when building with GCC. They aren't strictly necessary and their impact is likely minimal so these have simply been removed.
* `c10::SmallVector<T, N>`
  * My understanding is that `c10::SmallVector<T, N>` is essentially a wrapper around `std::vector<T>` that is optimized for array sizes up to `N`. I don't believe that this optimization is worth creating a native equivalent, so I replaced instances this symbol with replaced with `std::vector<T>`
* `c10::multiply_integers`
  * This function is simply a convenient wrapper around `std::accumulate`, so I copied it as a native equivalent in `api/Utils.h`

This changeset comprises entirely of the replacements described above.

Differential Revision: [D52662232](https://our.internmc.facebook.com/intern/diff/D52662232/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117176
Approved by: https://github.com/yipjustin
2024-01-11 21:42:24 +00:00
24c39bb5e5 Upgrade nightly wheels to rocm6.0 (#116983)
Follow-up to https://github.com/pytorch/builder/pull/1647

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116983
Approved by: https://github.com/jeffdaily, https://github.com/atalman
2024-01-11 20:36:00 +00:00
e55a778cbb [Nested Tensor]Support SDPA math fallback for jagged layout nested tensor (#116445)
Support this fallback by converting the jagged layout NT to strided layout NT, and the convert the result back to jagged layout NT.
This fallback might not be efficient since it uses unbind, contiguous and split.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116445
Approved by: https://github.com/soulitzer
2024-01-11 20:28:40 +00:00
92cc8ae172 [FSDP] Cloned unsharded tensor slice in optim state dict load (#117261)
This takes the fix from https://github.com/pytorch/pytorch/issues/116553. Cloning the slice allows the base (much larger) tensor to be freed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117261
Approved by: https://github.com/wz337
2024-01-11 20:21:12 +00:00
88bf84f106 [benchmark] add --compile-autograd to dynamo benchmarks (#117196)
Adds `--compile-autograd` flag to benchmark suite to run accuracy and performance tests. Also adds autograd_captures and autograd_compiles to dynamo stats

e.g. accuracy_inductor.csv
```
dev,name,batch_size,accuracy,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks,autograd_captures,autograd_compiles
cuda,BERT_pytorch,4,pass,2655,2,8,7,1,1
cuda,Background_Matting,4,pass_due_to_skip,0,0,0,0,0,0
cuda,DALLE2_pytorch,0,eager_fail_to_run,0,0,0,0,0,0
cuda,LearningToPaint,4,pass,639,2,8,7,1,1
...
```

e.g. speedup_inductor.csv
```
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks,autograd_captures,autograd_compiles
cuda,hf_T5,8,1.214311,136.236793,88.350570,0.751322,18.754706,24.962275,3298,2,8,8,1,1
cuda,hf_T5,8,1.226645,135.431856,52.461461,1.040973,18.754706,18.016508,795,1,7,7,0,0
...
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117196
Approved by: https://github.com/jansel
2024-01-11 20:12:58 +00:00
83c45a9931 Faster gc_count update for CUDACachingAllocator (and avoid nullptr de… (#117064)
…reference) (#109065)

Summary:

Modify the way we update gc_count in CUDACachingAlloctor to make it faster.

Originally D48481557, but reverted due to nullptr dereference in some cases (D49003756). This diff changed to use correct constructor for search key (so avoid nullptr dereference). Also, added nullptr check (and returns 0 if it is) in gc_count functions.

Differential Revision: D49068760

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117064
Approved by: https://github.com/zdevito
2024-01-11 19:47:05 +00:00
5bc896e5dc Dockerfile; Add cuda bin to PATH (#117105)
We need this to execute `nvidia-smi` in the officially released containers. We have already it in the Docker CI

See
94db6578cc/.ci/docker/linter-cuda/Dockerfile (L35)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117105
Approved by: https://github.com/atalman
2024-01-11 18:10:19 +00:00
9e3580f793 Fix #117011: add the TORCH_CHECK(grad_output) of upsample_nearest::backward() (#117100)
add the TORCH_CHECK(grad_output) of upsample_nearest::backward()

Fixes #117011

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117100
Approved by: https://github.com/lezcano
2024-01-11 18:06:22 +00:00
f89725fb41 [DCP][BC] Add the backward compatibility test (#116247)
This PR adds a test to ensure all metadata is backward compatible with the older definination.

Differential Revision: [D52357733](https://our.internmc.facebook.com/intern/diff/D52357733/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116247
Approved by: https://github.com/wz337
ghstack dependencies: #116245, #116246
2024-01-11 18:01:35 +00:00
7e9cbc6834 [CI] Catch more exception types when running eager in PT2 tests (#117120)
Summary: https://github.com/pytorch/pytorch/actions/runs/7467073391/job/20320251143#step:16:1332 shows a case where model loading fails with KeyError but the error is not logged in the report csv file, which can cause an eager model failure silently ignored in the PT2 integration test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117120
Approved by: https://github.com/huydhn
2024-01-11 17:46:11 +00:00
5b24877663 Improve uint{16,32,64} dlpack/numpy compatibility (#116808)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116808
Approved by: https://github.com/malfet, https://github.com/albanD
2024-01-11 17:01:54 +00:00
623b7fedc4 [c10d] Add comments to the rest environment variable within NCCLPG (#117092)
Not every environment within NCCLPG has comments, let's add comments to each of them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117092
Approved by: https://github.com/kwen2501
ghstack dependencies: #116545
2024-01-11 16:47:25 +00:00
3d1869d0ae [DCP][BE] Improve the readability of filesystem and fsspec filesystem (#116246)
1. Better typing
2. Remove 1-liner function

Differential Revision: [D52357731](https://our.internmc.facebook.com/intern/diff/D52357731/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116246
Approved by: https://github.com/wz337
ghstack dependencies: #116245
2024-01-11 16:27:21 +00:00
4c7b602645 Add Support For Symbolic Shapes in Register_replacement, SDPA Pattern Matching (#115441)
Many of our pattern matching replacements are specified as a `search_fn` and a `replacment_fn`. The search_fn's are traced out once with static shapes, converted to a pattern, and then matched on every graph compiled with inductor.

The static shape patterns would not match with graphs that are traced out with dynamic shapes because SymInts would be added to the graph as `sym_size` fx nodes which added additional uses and prevented matching. The previous PR partially addresses this by deduping SymInts that are resolvable to graph inputs, as is the calling convention in aot autograd.

This PR adjusts our matching of the `search_fn` by adding SymInts to the arguments we trace out the search_fn with so that their symint accesses are deduped. Later, if we have a match, we will trace out the replacement graph with the correct Tensors and corresponding symbolic shapes that will get added to the graph.

Note: the replacement patterns will insert sym_size uses which could potentially be removed, but I'll leave that for follow up.

Fix for https://github.com/pytorch/pytorch/issues/111190.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115441
Approved by: https://github.com/jansel
ghstack dependencies: #116158
2024-01-11 15:58:37 +00:00
bfc336308a Revert "Error grad mode op in export API (#117187)"
This reverts commit 89ef426ba0d87091303f6a3c21c38749f9af72a3.

Reverted https://github.com/pytorch/pytorch/pull/117187 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/117187#issuecomment-1887363580))
2024-01-11 15:01:36 +00:00
767e1b6349 Revert "Bring docstring to .pyi file (#114705)"
This reverts commit 0dd5deecedd136852c7ccc81630eaefbebe5be29.

Reverted https://github.com/pytorch/pytorch/pull/114705 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/114705#issuecomment-1887165326))
2024-01-11 13:30:44 +00:00
7005a4bcb6 [dynamo] Added dyn shapes support for math trigo ops: sin(h), cos(h), tan(h) ... (#114866)
Description:
- Added dynamic shapes support for math trigo ops: sin(h), cos(h), tan(h) ...

```python
import math
import torch

def func(x, a, b):
    c = 0
    c = c + math.sqrt(a)
    c = c + math.cos(a)
    c = c + math.cosh(a)
    c = c + math.sin(a)
    c = c + math.sinh(a)
    c = c + math.tan(a)
    c = c + math.tanh(a)
    c = c + math.asin(b)
    c = c + math.acos(b)
    c = c + math.atan(a)
    y = x + c
    return y

cfunc = torch.compile(func, dynamic=True, fullgraph=True)

device = "cpu"  # or "cuda"
x = torch.tensor([0, 1, 2, 3], dtype=torch.float32, device=device)
a = 12
b = 1

out = cfunc(x, a, b)
expected = func(x, a, b)
torch.testing.assert_close(out, expected)
```

and the graph `TORCH_LOGS=+graph_code python check_math_ops.py`:

<details>
<summary>
graph code
</summary>

```
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG] TRACED GRAPH
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]  ===== __compiled_fn_0 =====
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]  <eval_with_key>.0 class GraphModule(torch.nn.Module):
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]     def forward(self, L_a_ : torch.SymInt, s1 : torch.SymInt, L_x_ : torch.Tensor):
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         l_a_ = L_a_
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         l_x_ = L_x_
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         # File: check_math_ops.py:57, code: c = c + math.sqrt(a)
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         sym_sqrt = torch.sym_sqrt(l_a_)
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         add = 0 + sym_sqrt;  sym_sqrt = None
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         # File: check_math_ops.py:58, code: c = c + math.cos(a)
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         sym_cos = torch.sym_cos(l_a_)
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         add_1 = add + sym_cos;  add = sym_cos = None
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         # File: check_math_ops.py:59, code: c = c + math.cosh(a)
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         sym_cosh = torch.sym_cosh(l_a_)
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         add_2 = add_1 + sym_cosh;  add_1 = sym_cosh = None
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         # File: check_math_ops.py:60, code: c = c + math.sin(a)
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         sym_sin = torch.sym_sin(l_a_)
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         add_3 = add_2 + sym_sin;  add_2 = sym_sin = None
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         # File: check_math_ops.py:61, code: c = c + math.sinh(a)
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         sym_sinh = torch.sym_sinh(l_a_)
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         add_4 = add_3 + sym_sinh;  add_3 = sym_sinh = None
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         # File: check_math_ops.py:62, code: c = c + math.tan(a)
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         sym_tan = torch.sym_tan(l_a_)
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         add_5 = add_4 + sym_tan;  add_4 = sym_tan = None
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         # File: check_math_ops.py:63, code: c = c + math.tanh(a)
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         sym_tanh = torch.sym_tanh(l_a_)
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         add_6 = add_5 + sym_tanh;  add_5 = sym_tanh = None
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         # File: check_math_ops.py:64, code: c = c + math.asin(b)
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         add_7 = add_6 + 1.5707963267948966;  add_6 = None
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         # File: check_math_ops.py:65, code: c = c + math.acos(b)
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         add_8 = add_7 + 0.0;  add_7 = None
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         # File: check_math_ops.py:66, code: c = c + math.atan(a)
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         sym_atan = torch.sym_atan(l_a_);  l_a_ = None
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         add_9 = add_8 + sym_atan;  add_8 = sym_atan = None
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         # File: check_math_ops.py:67, code: y = x + c
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         y = l_x_ + add_9;  l_x_ = add_9 = None
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]         return (y,)
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]
[2023-11-30 22:16:10,654] [0/0] torch._dynamo.output_graph.__graph_code: [DEBUG]
```
</details>

Generated code with `TORCH_LOGS=+output_code python check_math_ops.py`:
<details>
<summary>
C++ code
</summary>

```
[2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] cpp_fused_add_0 = async_compile.cpp('''
[2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] #include "/tmp/torchinductor_root/2l/c2ljzlm4sosod7u6lyrroqdba6hmfcyijrric6p4t3fhbcmw6osp.h"
[2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] extern "C" void kernel(const float* in_ptr0,
[2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG]                        float* out_ptr0,
[2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG]                        const long ks0,
[2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG]                        const long ks1)
[2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] {
[2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG]     {
[2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG]         #pragma GCC ivdep
[2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG]         for(long x0=static_cast<long>(0L); x0<static_cast<long>(ks0); x0+=static_cast<long>(1L))
[2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG]         {
[2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG]             auto tmp0 = in_ptr0[static_cast<long>(x0)];
[2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG]             auto tmp1 = c10::convert<float>(1.57079632679490 + (std::sqrt(ks1)) + (std::atan(ks1)) + (std::cos(ks1)) + (std::cosh(ks1)) + (std::sin(ks1)) + (std::sinh(ks1)) + (std::tan(ks1)) + (std::tanh(ks1)));
[2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG]             auto tmp2 = decltype(tmp0)(tmp0 + tmp1);
[2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG]             out_ptr0[static_cast<long>(x0)] = tmp2;
[2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG]         }
[2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG]     }
[2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] }
[2023-11-30 22:19:09,709] [0/0] torch._inductor.graph.__output_code: [DEBUG] ''')
```

</details>

<details>
<summary>
Triton code
</summary>

```
[2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] @pointwise(
[2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG]     size_hints=[4],
[2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG]     filename=__file__,
[2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG]     triton_meta={'signature': {0: '*fp32', 1: '*fp32', 2: 'i32', 3: 'i32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [instance_descriptor(divisible_by_16=(0, 1), equal_to_1=(), i
ds_of_folded_args=(), divisible_by_8=())]},
[2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG]     inductor_meta={'autotune_hints': set(), 'kernel_name': 'triton_poi_fused_add_0', 'mutated_arg_names': []},
[2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG]     min_elem_per_thread=0
[2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] )
[2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] @triton.jit
[2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] def triton_(in_ptr0, out_ptr0, ks0, xnumel, XBLOCK : tl.constexpr):
[2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG]     xoffset = tl.program_id(0) * XBLOCK
[2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG]     xindex = xoffset + tl.arange(0, XBLOCK)[:]
[2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG]     xmask = xindex < xnumel
[2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG]     x0 = xindex
[2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG]     tmp0 = tl.load(in_ptr0 + (x0), xmask)
[2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG]     tmp1 = 1.57079632679490 + (tl.math.sqrt(ks0.to(tl.float32))) + (tl.math.atan((ks0).to(tl.float32))) + (tl.math.cos((ks0).to(tl.float32))) + (tl.math.cosh((ks0).to(tl.float32))) + (tl.math.sin((ks0)
.to(tl.float32))) + (tl.math.sinh((ks0).to(tl.float32))) + (tl.math.tan((ks0).to(tl.float32))) + (tl.math.tanh((ks0).to(tl.float32)))
[2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG]     tmp2 = tmp1.to(tl.float32)
[2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG]     tmp3 = tmp0 + tmp2
[2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG]     tl.store(out_ptr0 + (x0), tmp3, xmask)
[2023-11-30 22:20:00,383] [0/0] torch._inductor.graph.__output_code: [DEBUG] ''')
```

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114866
Approved by: https://github.com/peterbell10
2024-01-11 11:52:28 +00:00
cyy
2b5a201aa6 [Exception] [3/N] Replace torch::NotImplementedError and torch::LinAlgError with C10 counterparts. (#116824)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116824
Approved by: https://github.com/albanD
2024-01-11 11:27:04 +00:00
89ef426ba0 Error grad mode op in export API (#117187)
Summary:
This is reland of https://github.com/pytorch/pytorch/pull/116339
Needed to some internal adjustments to make it work properly. Original credit goes to andrewlee302

Test Plan: CI

Differential Revision: D52674706

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117187
Approved by: https://github.com/suo
2024-01-11 09:06:59 +00:00
0e1f43c44d [inductor] don't access cluster_dims for too old version of triton (#117192)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117192
Approved by: https://github.com/masnesral
2024-01-11 08:37:30 +00:00
3b2ddb6f71 Update TorchBench pinned commit (#117073)
~~To match their recent v4.36.2 release https://github.com/huggingface/transformers/commits/v4.36.2.  This is to fix the KeyError showing on release branch https://github.com/pytorch/pytorch/actions/runs/7451512288/job/20279117324#step:16:1336.  I think this can be updated in main too because the current pinned commit is already 4-month old.~~

Check with @desertfire, trying to update TorchBench pinned commit instead.

The test is also failing in main https://github.com/pytorch/pytorch/actions/runs/7467073391/job/20320251143#step:16:1120, but for some reason, it doesn't surface as a failure.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117073
Approved by: https://github.com/atalman, https://github.com/thiagocrepaldi, https://github.com/desertfire
2024-01-11 08:35:00 +00:00
1cefc58905 init tls grad_mode/local_dispatch_key set while fork new thread in (#113246)
TorchDynamo will guard grad_mode and the local dispatch key set.
3a429423fc/torch/csrc/dynamo/guards.cpp (L13-L16)

While using ThroughputBenchmark, those tls state will not be init as same as the main thread status.
3a429423fc/torch/csrc/utils/throughput_benchmark-inl.h (L64-L94)

Run following scripts
```
import torch
linear = torch.nn.Linear(128, 128)
compiled = torch.compile(linear)
x = torch.rand(10, 128)
with torch.no_grad(), torch.cpu.amp.autocast(enabled=True, dtype=torch.bfloat16):
    compiled(x)
    compiled(x)

from torch._dynamo import config
config.error_on_recompile = True
from torch.utils import ThroughputBenchmark
with torch.no_grad(), torch.cpu.amp.autocast(enabled=True, dtype=torch.bfloat16):
    bench = ThroughputBenchmark(compiled)
    bench.add_input(x)
    stats = bench.benchmark(
        num_calling_threads=10,
        num_warmup_iters=100,
        num_iters=100,
    )
    print(stats)
```
will lead to 2 re-compile reasons:
```
triggered by the following guard failure(s): ___check_global_state()
triggered by the following guard failure(s): tensor 'x' dispatch key set mismatch.
```

This will trigger a re-compile in torchdynamo. But since `ThroughputBenchmark` is used for sharing weight within threads, the model should not be changed anymore while running the benchmark. So this PR is to init the tls state as same as main thread. Then we can use ` ThroughputBenchmark` to run torchdynamo optimized models.

throughputbenchmark
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113246
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-01-11 08:31:46 +00:00
9f57cf502f [inductor][cpu]disable pointwise_cat on CPU (#116313)
We observed negative performance impact of pointwise_cat optimization on CPU so disabled it. We will revisit this later after enabling vectorization on index_expr.

This PR fix the following three regression issues:
https://github.com/pytorch/pytorch/issues/115827
https://github.com/pytorch/pytorch/issues/112139
https://github.com/pytorch/pytorch/issues/114495

and cause performance regression of pytorch_unet again. Related issue: https://github.com/pytorch/pytorch/issues/115343

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116313
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/eellison
2024-01-11 08:00:00 +00:00
e3d4f4d14b [ProxyTensor] dedupe symbolic shapes in tracing (#116158)
Dedupes symbolic shapes in proxy tensor tracing. Reusing the existing sym shape avoids inserting spurious sym_size calls, which can interfere with pattern matching and graph passes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116158
Approved by: https://github.com/ezyang
2024-01-11 07:15:11 +00:00
6f9fcc79c2 [DCP][BE] Remove unused fields (#116245)
As title

Differential Revision: [D52357730](https://our.internmc.facebook.com/intern/diff/D52357730/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116245
Approved by: https://github.com/wz337
2024-01-11 06:03:09 +00:00
263cc12fab Add Dynamo Reset in PT2E Quantization testing (#117200)
**Summary**
Fix https://github.com/pytorch/pytorch/issues/117012 by adding `torch._dynamo.reset()` in `PT2EQuantizationTestCase._quantize`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117200
Approved by: https://github.com/jerryzh168
2024-01-11 05:53:55 +00:00
5ae221a214 [ONNX] Refactor op consistency tests (#116319)
Fixes #105338

This PR changes the ops consistency tests from manual adding ops into testing list to automated testing all ops in registry. It also spots more complex dtype bugs in the converter.

Overall, this PR provides:
(1) Whole test coverage on ONNX registry
(2) More completed complex supports
(3) Only test the same dtypes as torchlib
(4) Auto xfail unsupported nodes

Follow-up issue: https://github.com/pytorch/pytorch/issues/117118
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116319
Approved by: https://github.com/justinchuby
2024-01-11 05:17:40 +00:00
9b1fac694e [c10d] Add extra sleep in waitForDumpOrTimeout to ensure enough time for all ranks dump debug info (#116545)
We added an extra sleep and make it configurable so that users can set an extra wait to ensure all ranks have dumped the debug info.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116545
Approved by: https://github.com/wconstab
2024-01-11 04:39:57 +00:00
ca23c56efc [codemod] markDynamoStrictTest batch 15 (#117139)
[codemod] markDynamoStrictTest test_spectral_ops
[codemod] markDynamoStrictTest test_fx_experimental
[codemod] markDynamoStrictTest test_foreach
[codemod] markDynamoStrictTest test_decomp
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117139
Approved by: https://github.com/voznesenskym
ghstack dependencies: #117114, #117127, #117128, #117129, #117133
2024-01-11 04:28:57 +00:00
9dbe4eae82 [codemod] markDynamoStrictTest batch 14 (#117133)
[codemod] markDynamoStrictTest test_utils
[codemod] markDynamoStrictTest test_unary_ufuncs
[codemod] markDynamoStrictTest test_sparse_semi_structured
[codemod] markDynamoStrictTest test_sparse_csr
[codemod] markDynamoStrictTest test_sparse
[codemod] markDynamoStrictTest test_reductions
[codemod] markDynamoStrictTest test_proxy_tensor
[codemod] markDynamoStrictTest test_prims
[codemod] markDynamoStrictTest test_maskedtensor
[codemod] markDynamoStrictTest test_masked
[codemod] markDynamoStrictTest test_legacy_vmap
[codemod] markDynamoStrictTest test_binary_ufuncs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117133
Approved by: https://github.com/voznesenskym
ghstack dependencies: #117114, #117127, #117128, #117129
2024-01-11 04:28:57 +00:00
a526d0a926 Skip all OpInfo-based test when running with PYTORCH_TEST_WITH_DYNAMO (#117129)
This is a policy decision. These tests:
- are flaky, and fixing the flakiness is unfeasible at the moment
- are highly redundant
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117129
Approved by: https://github.com/voznesenskym
ghstack dependencies: #117114, #117127, #117128
2024-01-11 04:28:42 +00:00
dc43ad4286 add is_grad_enabled check in runtime_wrapper before running with torch.no_grad (#117089)
We observed that `with torch.no_grad()` in runtime_wrapper introduced ~10% (0.06ms->0.066ms) inference performance regression on lennard_jones on cpu.
For inference tasks in benchmark, grad has been disabled, but in the current runtime_wrapper, no_grad is set again and its time is counted into the running time.
Therefore, we add `is_grad_enabled` check in runtime_wrapper before running with torch.no_grad. If grad has been disabled, there is no need to set no_grad.

Before this pr:
1.043x
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks
cpu,lennard_jones,1,**1.043427**,**0.068366**,4.756151,0.941846,45.056819,47.838822,9,1,0,0

After this pr:
1.146x
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks
cpu,lennard_jones,1,**1.146190**,**0.061844**,4.468380,0.936456,44.427264,47.441920,9,1,0,0

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117089
Approved by: https://github.com/jgong5, https://github.com/bdhirsh
2024-01-11 03:37:45 +00:00
203430a778 [dynamo] easy - better assert message for EQUALS_MATCH guard (#117006)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117006
Approved by: https://github.com/lezcano
ghstack dependencies: #116723
2024-01-11 03:14:43 +00:00
79de14546d [export] Add TORCH_LOGS=export (#116993)
Adds TORCH_LOGS=export which currently includes dynamo/dynamic logs. In the future if we add any logs under the torch/export directory it will also show up in the TORCH_LOGS=export

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116993
Approved by: https://github.com/avikchaudhuri
2024-01-11 03:02:23 +00:00
6f0f4f12ca [BugFix] Prevent LSTM to run with wrong input shape (#115542)
Fixes #114874
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115542
Approved by: https://github.com/mikaylagawarecki
2024-01-11 02:57:09 +00:00
10509dac85 [C10D] Rename flightrecorder key vars to avoid confusion (#116905)
Key vars are strings used as dict keys (e.g. duration_s was a string
"duration_ms")

_s confused me with time (seconds) since duration_s was a key string and
duration_ms is another variable holding a time value.

Now duration_key is "duration_ms".

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116905
Approved by: https://github.com/zdevito
2024-01-11 02:57:04 +00:00
1174e82bde Revert "Add _assert_scalar and teach Inductor to codegen it (#114148)"
This reverts commit b6028acfa46363c1d3262a1522741a06c307843f.

Reverted https://github.com/pytorch/pytorch/pull/114148 on behalf of https://github.com/osalpekar due to Going to revert this given the broken torchrec PT2 tests internally: [D52648865](https://www.internalfb.com/diff/D52648865). Logs aren't too clear but @dstaay-fb can help debug as well ([comment](https://github.com/pytorch/pytorch/pull/114148#issuecomment-1886100368))
2024-01-11 02:30:22 +00:00
0f10a706f6 add a docblock for torch._scaled_mm (#117190)
Summary:

Describes the arguments in more detail. Not in user facing docs for now, but a step towards getting there eventually.

Test Plan: CI

Reviewers:

Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117190
Approved by: https://github.com/drisspg
2024-01-11 02:22:44 +00:00
edec54b9de Add torch._lazy_clone to create COW tensors (#113397)
Part of #109833

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):
* __->__ #113397
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113397
Approved by: https://github.com/ezyang
2024-01-11 01:32:44 +00:00
71343507cd Add super().setup in test_numeric (#117148)
Call super().setUp() so that it will check the disabled test json (and also reset seeds etc)

Test:
Check that test_all_any is skipped in dynamo shard - success
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117148
Approved by: https://github.com/huydhn
2024-01-11 01:03:46 +00:00
cyy
2f17a21b2b [Reland] [13/N] Enable clang-tidy on headers of torch/csrc (#117088)
Reland of #116560 and fixes the issued reported by #116695

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117088
Approved by: https://github.com/albanD
2024-01-10 23:58:04 +00:00
8783fe9cf3 [export] Modify SDPA decomposition to decompose _scaled_dot_product_flash_attention_for_cpu (#117097)
Summary: As titled. #115913 added
`_scaled_dot_product_flash_attention_for_cpu` and the export result of
`scaled_dot_product_attention` includes this op. Adding this
decomposition so that it's being decomposed the same way as
`_scaled_dot_product_attention_math`.

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117097
Approved by: https://github.com/lezcano
2024-01-10 23:46:14 +00:00
f70aeb4ffd Fix backward for reshape() on jagged layout NT (#117137)
Provides symbolic C++-side `reshape_as()` / `reshape()` decomps for jagged layout NTs to make the backwards pass work.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117137
Approved by: https://github.com/soulitzer
2024-01-10 23:35:07 +00:00
e10cfdd895 Update matmul requires_grad checks (#117067)
Fixes https://github.com/pytorch/pytorch/issues/116099
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117067
Approved by: https://github.com/lezcano, https://github.com/albanD
ghstack dependencies: #116523, #116710
2024-01-10 23:16:42 +00:00
7e6a04e542 Allow unMarkDynamoStrictTest to work on tests (instead of just classes) (#117128)
Tested locally.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117128
Approved by: https://github.com/voznesenskym
ghstack dependencies: #117114, #117127
2024-01-10 22:25:40 +00:00
1b8ebb6c42 [codemod] markDynamoStrictTest batch 13 (#117127)
[codemod] markDynamoStrictTest test_overrides
[codemod] markDynamoStrictTest test_namedtuple_return_api
[codemod] markDynamoStrictTest test_jiterator
[codemod] markDynamoStrictTest test_jit_disabled
[codemod] markDynamoStrictTest test_jit_autocast
[codemod] markDynamoStrictTest test_fx_reinplace_pass
[codemod] markDynamoStrictTest test_fx_passes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117127
Approved by: https://github.com/voznesenskym
ghstack dependencies: #117114
2024-01-10 22:25:40 +00:00
79e6d2ae9d Remove incorrect usages of skipIfTorchDynamo (#117114)
Using `@skipifTorchDynamo` is wrong, the correct usage is
`@skipIfTorchDynamo()` or `@skipIfTorchDynamo("msg")`. This would cause
tests to stop existing.
Added an assertion for this and fixed the incorrect callsites.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117114
Approved by: https://github.com/voznesenskym
2024-01-10 22:25:31 +00:00
d6540038c0 Fix 0-dim Index in Index Copy decomp (#117065)
Fix for https://github.com/pytorch/pytorch/issues/115931

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117065
Approved by: https://github.com/jansel, https://github.com/shunting314
2024-01-10 22:13:43 +00:00
b9293e74a2 [ROCm] Fixes for hipblasLt for mm use case. (#116537)
This PR fixes the accuracy issues for hipblasLT for mm case on ROCm.
This PR is a follow up to the integration PR https://github.com/pytorch/pytorch/pull/114329 and https://github.com/pytorch/pytorch/pull/114890

The accuracy issue arises for mm usecase for ROCm where hipblasLT is enabled, and a bias has been passed which is not required. This PR addresses that issue.
Added a unit-test case for this issue (bias=None) case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116537
Approved by: https://github.com/jeffdaily, https://github.com/malfet
2024-01-10 22:13:18 +00:00
7e37f63e5e [Reference Cycle Detector] Ignore FakeTensor in cycle leak detection (#117116)
Summary: Skip FakeTensors since these tensors are not actually using GPU memory. Reference Cycle Detector does not need to generate plots for these tensors.

Test Plan: CI and internal testing.

Differential Revision: D52637209

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117116
Approved by: https://github.com/zdevito, https://github.com/tianfengfrank
2024-01-10 21:33:56 +00:00
3e9bb8d4de Run docker release build on final tag (#117131)
To be successful, the docker release workflow needs to run on final tag, after the Release to conda and pypi are complete.

Please refer to: https://github.com/pytorch/pytorch/blob/main/Dockerfile#L76

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117131
Approved by: https://github.com/huydhn, https://github.com/seemethere, https://github.com/malfet
2024-01-10 21:00:45 +00:00
73990c37e6 [c10d] To make ProcessGroupNCCL to use globalStore for coordination (#117075)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117075
Approved by: https://github.com/wconstab
ghstack dependencies: #117074
2024-01-10 20:39:53 +00:00
180425df9b [c10d] Add a recursive method to get the inner most store (#117074)
In c10d PG initialization, we wrap TCPStore with multiple layers of PrefixStore which adds layers of prefix.

One example is:
"default_pg/0//cuda//timeout_dump"
When initialized the default PG, because there is no store passed. We first add the prefix "default_pg" to the TCPStore returned from rendezvous:

bdeaaad70c/torch/distributed/distributed_c10d.py (L1240)

We then add pg_name (aka 0) bdeaaad70c/torch/distributed/distributed_c10d.py (L1376) and device (aka cuda) bdeaaad70c/torch/distributed/distributed_c10d.py (L1387)

to the prefix. Then when we call store_->set("timeout_dump"). The actual key used for writing into TCPStore is "default_pg/0//cuda//timeout_dump".

For sub-PG, things get even interesting, we put the store wrapped with default pg name to a cache:
bdeaaad70c/torch/distributed/distributed_c10d.py (L1517)

And when creating each subPG, it is append its PG name right after the cached store. The example keys are:
'default_pg/0//10//cuda//timeout_dump', 'default_pg/0//12//cuda//timeout_dump', 'default_pg/0//38//cuda//timeout_dump', 'default_pg/0//39//cuda//timeout_dump'. (10, 12, 38 and 39 are all PG names of each subPG created)

The reason why the number in the name is bumped up so high is because for each subPG creation, all ranks have to call the API together and the global variable used for PG name will be bumped up monolithically:

bdeaaad70c/torch/distributed/distributed_c10d.py (L3666)

Similar things happen for using hashing for PG names.

This has a potential issue, because each sub-PG has an instance of ProcessGroupNCCL, and if we want to set something global to notify all sub-PGs (and all ranks). This added prefix causes bugs. For example, if on sub-PG 1, we set a value to TCPStore with key ('default_pg/0//1//cuda//timeout_dump'), while we use the default PG instances to check the TCPStore, which are using the key ('default_pg/0//cuda//timeout_dump'), default PG instances will never get the notified signals. So in this PR, we added a new API in PrefixStore which we get the innermost non-PrefixStore for set and check. The next PR will make changes in NCCL watchdog.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117074
Approved by: https://github.com/wconstab, https://github.com/H-Huang
2024-01-10 20:22:55 +00:00
6f8fc42dba [inductor] Add support for tl.make_block_ptr (#116079)
On A100 this is a small regression:
![image](https://github.com/pytorch/pytorch/assets/533820/b30eee9d-c0fe-4123-99da-d554fc5d0171)

So I will leave it disabled by default.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116079
Approved by: https://github.com/shunting314
2024-01-10 20:02:49 +00:00
9bf9586c6d Pytest do not rewrite assertions by default (#117060)
From https://pytest.org/en/7.4.x/how-to/assert.html#advanced-assertion-introspection
pytest only rewrites test modules directly discovered by its test collection process, so asserts in supporting modules which are not themselves test modules will not be rewritten.

In CI we usually call the test file (`python test_ops.py`), which then calls run_test which then calls pytest.main, so the test module is already imported as `__main__`, so pytest does not import the test module itself and relies on the already imported module.  (#95844)

However, calling `pytest test_ops.py` will rely on pytest to import the module, resulting in asserts being rewritten, so I add --assert=plain by default into the opts so we don't have to worry about this anymore.  Another way to make pytest stop assertion rewriting in a file is to include `PYTEST_DONT_REWRITE` somewhere in the file.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117060
Approved by: https://github.com/zou3519
2024-01-10 20:02:45 +00:00
fad7734fa7 [AOTI] Remove caching for compiled model.so (#117087)
Summary: Oleg found the model.so caching does not compute hash key with model weights included, which can cause incorrect model.so reuse. Since caching is not really necessary in the AOT mode, let's just remove it.

Test Plan: CI

Differential Revision: D52647555

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117087
Approved by: https://github.com/frank-wei, https://github.com/chenyang78, https://github.com/khabinov
2024-01-10 19:53:27 +00:00
e4e80dc9b3 [FSDP] sharded grad scaler: copy found_inf after waiting on async reduce_all (#115710)
**Expected behavior**: when rank 0 have inf grad, rank 1...k should get `found_inf=1` after `dist.reduce_all`
**Bug addressed in this PR**: for cpu offloaded param.grad, when rank 0 have inf, rank 1...k would not have found_inf=1. This is because `found_inf` was copied before `future.wait` on async `dist.reduce_all`

repro the bug using the newly added unit test: `pytest test/distributed/fsdp/test_fsdp_sharded_grad_scaler.py -k test_sharded_grad_scaler_found_inf`

```
  File "/data/users/weif/pytorch/test/distributed/fsdp/test_fsdp_sharded_grad_scaler.py", line 320, in _test_sharded_grad_scaler_found_inf
    self.assertEqual(
  File "/data/users/weif/pytorch/torch/testing/_internal/common_utils.py", line 3576, in assertEqual
    raise error_metas.pop()[0].to_error(
AssertionError: Scalars are not close!

Expected 1.0 but got 2.0.
Absolute difference: 1.0 (up to 1e-05 allowed)
Relative difference: 1.0 (up to 1.3e-06 allowed)
rank: 0 iter: 0 expect origin scale 2.0 to be backed off by 0.5 but got 2.0
```

verify the bug is fixed: `pytest test/distributed/fsdp/test_fsdp_sharded_grad_scaler.py -k test_sharded_grad_scaler_found_inf`

```
test/distributed/fsdp/test_fsdp_sharded_grad_scaler.py dist init r=1, world=8
dist init r=3, world=8
dist init r=7, world=8
dist init r=4, world=8
dist init r=6, world=8
dist init r=2, world=8
dist init r=0, world=8
dist init r=5, world=8
NCCL version 2.19.3+cuda12.0
.                                                                                                                 [100%]

====================================================================== 1 passed, 19 deselected in 27.43s =========================

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115710
Approved by: https://github.com/awgu
2024-01-10 19:17:25 +00:00
9eb842cbd6 Compiled autograd: Lift autograd functions' backward and provide default key for custom autograd functions (#115573)
This PR adds support for torch.autograd.Function subclasses in compiled autograd. We do this by:
- Creating a uid for all torch.autograd.Function via its metaclass. This uid is used in the compiled autograd key, which is a subset of the cache key to the compiled graph
- "Lifting" the backward/saved_tensors, having them as input arguments in the compiled graph
  - Creating proxies to track the backward's inputs and outputs. Since the backward's outputs (grads) have to match the forward's inputs, we pass the node's `input_info` (forward's input sizes) to build the proxies tracking the backward's outputs.
  - Use a `FakeContext` class as a replacement for the autograd node's context object (`BackwardCFunction`) during tracing, only support passing saved_tensors from the forward to the backward
  - Index each backward, to support multiple torch.autograd.Functions in the same graph
  - Special case for `CompiledFunctionBackward`, lifting CompiledFunction will fail 4 tests and requires some skipfiles changes that I'd rather do that in a separate PR

Example graph: test_custom_fn_saved_multiple_tensors (eager fw + compiled autograd)
```python
class MyFn(torch.autograd.Function):
    @staticmethod
    def forward(ctx, x, y):
        ctx.save_for_backward(x, y)
        return torch.sin(x), torch.sin(y)

    @staticmethod
    def backward(ctx, gO_x, gO_y):
        (x, y) = ctx.saved_tensors
        return gO_x * torch.cos(x), gO_y * torch.cos(y)
```
The backwards is lifted via `getitem_5` and `call_backward`
```python
# Compiled autograd graph
 ===== Compiled autograd graph =====
 <eval_with_key>.0 class CompiledAutograd(torch.nn.Module):
    def forward(self, inputs, sizes, hooks):
        # No stacktrace found for following nodes
        getitem: "f32[]" = inputs[0]
        getitem_1: "f32[10]" = inputs[1]
        getitem_2: "f32[10]" = inputs[2]
        getitem_3: "f32[10]" = inputs[3]
        getitem_4: "f32[10]" = inputs[4];  inputs = None
        expand: "f32[10]" = torch.ops.aten.expand.default(getitem, [10]);  getitem = None
        mul: "f32[10]" = torch.ops.aten.mul.Tensor(expand, getitem_2);  getitem_2 = None
        mul_1: "f32[10]" = torch.ops.aten.mul.Tensor(expand, getitem_1);  expand = getitem_1 = None
        getitem_5 = hooks[0];  hooks = None
        call_backward = torch__dynamo_external_utils_call_backward(getitem_5, (getitem_3, getitem_4), mul_1, mul);  getitem_5 = mul_1 = mul = None
        getitem_6: "f32[10]" = call_backward[0]
        getitem_7: "f32[10]" = call_backward[1];  call_backward = None
        accumulate_grad_ = torch.ops.inductor.accumulate_grad_.default(getitem_4, getitem_7);  getitem_4 = getitem_7 = None
        accumulate_grad__1 = torch.ops.inductor.accumulate_grad_.default(getitem_3, getitem_6);  getitem_3 = getitem_6 = None
        return []
```

then is later inlined by dynamo
```python
# Dynamo graph
 ===== __compiled_fn_0 =====
 <eval_with_key>.1 class GraphModule(torch.nn.Module):
    def forward(self, L_inputs_0_ : torch.Tensor, L_inputs_1_ : torch.Tensor, L_inputs_2_ : torch.Tensor, L_inputs_3_ : torch.Tensor, L_inputs_4_ : torch.Tensor):
        getitem = L_inputs_0_
        getitem_1 = L_inputs_1_
        getitem_2 = L_inputs_2_
        x = L_inputs_3_
        y = L_inputs_4_

        # File: <eval_with_key>.0:10, code: expand = torch.ops.aten.expand.default(getitem, [10]);  getitem = None
        expand = torch.ops.aten.expand.default(getitem, [10]);  getitem = None

        # File: <eval_with_key>.0:11, code: mul = torch.ops.aten.mul.Tensor(expand, getitem_2);  getitem_2 = None
        mul = torch.ops.aten.mul.Tensor(expand, getitem_2);  getitem_2 = None

        # File: <eval_with_key>.0:12, code: mul_1 = torch.ops.aten.mul.Tensor(expand, getitem_1);  expand = getitem_1 = None
        mul_1 = torch.ops.aten.mul.Tensor(expand, getitem_1);  expand = getitem_1 = None

        # File: /data/users/xmfan/core/pytorch/test/inductor/test_compiled_autograd.py:412, code: return gO_x * torch.cos(x), gO_y * torch.cos(y)
        cos = torch.cos(x)
        getitem_6 = mul_1 * cos;  mul_1 = cos = None
        cos_1 = torch.cos(y)
        getitem_7 = mul * cos_1;  mul = cos_1 = None

        # File: <eval_with_key>.0:17, code: accumulate_grad_ = torch.ops.inductor.accumulate_grad_.default(getitem_4, getitem_7);  getitem_4 = getitem_7 = None
        accumulate_grad__default = torch.ops.inductor.accumulate_grad_.default(y, getitem_7);  y = getitem_7 = None

        # File: <eval_with_key>.0:18, code: accumulate_grad__1 = torch.ops.inductor.accumulate_grad_.default(getitem_3, getitem_6);  getitem_3 = getitem_6 = None
        accumulate_grad__default_1 = torch.ops.inductor.accumulate_grad_.default(x, getitem_6);  x = getitem_6 = None
        return ()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115573
Approved by: https://github.com/jansel
2024-01-10 18:01:28 +00:00
b4a35632f9 Add function to materialize COW storages (#117053)
Summary: From Kurt Mohler, see https://github.com/pytorch/pytorch/pull/113396 (manually imported due to ghimport problems)

Test Plan: sandcastle, OSS CI

Differential Revision: D52610522

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117053
Approved by: https://github.com/malfet, https://github.com/kurtamohler
2024-01-10 15:34:16 +00:00
ec98df70f3 [CPU] _vec_softmax_backward, _vec_log_softmax_backward, _vec_logsoftmax: fix CHUNK_SIZE to avoid unnecessarily large allocation (#117029)
Similar to https://github.com/pytorch/pytorch/pull/116990, fixes `CHUNK_SIZE` in `_vec_softmax_backward`, `_vec_log_softmax_backward`, `_vec_logsoftmax`, where `CHUNK_SIZE` is set as
```cpp
int64_t BLOCK_SIZE = 128 * 1024;
int64_t CHUNK_SIZE = std::max<int64_t>(BLOCK_SIZE / dim_size / sizeof(scalar_t), Vec::size());
CHUNK_SIZE = CHUNK_SIZE / Vec::size() * Vec::size();
```
where `BLOCK_SIZE / dim_size / sizeof(scalar_t)` computes the maximum number of inner dim that can fit into L2 cache, assuming L2 cache = 128KB, and `CHUNK_SIZE / Vec::size() * Vec::size()` is to make `CHUNK_SIZE` a multiple of `Vec::size()`.

Fix `CHUNK_SIZE` as the minimum between `CHUNK_SIZE` and `inner_size` to avoid unnecessarily large `CHUNK_SIZE` and unnecessarily large allocation for `max` and `tmp_sum` buffer.
```cpp
auto buffer = std::make_unique<scalar_t []>(CHUNK_SIZE * 2);
scalar_t* input_max_data = buffer.get();
scalar_t* tmp_sum_data = buffer.get() + CHUNK_SIZE;
```

### Performance

Perf data of `_vec_logsoftmax` collected for `dim_size` in range [2^0, 2^9] and `outer_size` in range [2^0, 2^3]. To measure the benefit from avoiding unnecessarily large allocation, values of `outer_size` were chosen such that `outer_size` is less than `BLOCK_SIZE / dim_size / sizeof(scalar_t)` for all values of `outer_size`.

Tested on 28 physical cores/socket, 1 socket on Skylake.

| **dim_size** 	| **BLOCK_SIZE / dim_size / sizeof(scalar_t)** 	| **input shape: (dim_size, inner_size)** 	| **Baseline (original implementation)** 	| **Optimized** 	| **Speedup Ratio (Baseline/Optimized)** 	|
|--------------	|----------------------------------------------	|-----------------------------------------	|----------------------------------------	|---------------	|----------------------------------------	|
| 1            	| 32768                                        	| (1, 1)                                  	| 0.012578964                            	| 0.003523827   	| **3.569689**                           	|
|              	|                                              	| (1, 2)                                  	| 0.012645721                            	| 0.003550053   	| **3.562122**                           	|
|              	|                                              	| (1, 4)                                  	| 0.01303196                             	| 0.003521442   	| **3.700745**                           	|
|              	|                                              	| (1, 8)                                  	| 0.01275301                             	| 0.003552437   	| **3.589933**                           	|
| 2            	| 16384                                        	| (2, 1)                                  	| 0.008230209                            	| 0.003688335   	| **2.231416**                           	|
|              	|                                              	| (2, 2)                                  	| 0.00821352                             	| 0.003502369   	| **2.345133**                           	|
|              	|                                              	| (2, 4)                                  	| 0.008280277                            	| 0.003442764   	| **2.405125**                           	|
|              	|                                              	| (2, 8)                                  	| 0.0086236                              	| 0.003490448   	| **2.470628**                           	|
| 4            	| 8192                                         	| (4, 1)                                  	| 0.005865097                            	| 0.003454685   	| **1.697723**                           	|
|              	|                                              	| (4, 2)                                  	| 0.005846024                            	| 0.003490448   	| **1.674863**                           	|
|              	|                                              	| (4, 4)                                  	| 0.006036758                            	| 0.0035429     	| **1.703903**                           	|
|              	|                                              	| (4, 8)                                  	| 0.005993843                            	| 0.003669262   	| **1.633528**                           	|
| 8            	| 4096                                         	| (8, 1)                                  	| 0.00469923                             	| 0.003535748   	| **1.329063**                           	|
|              	|                                              	| (8, 2)                                  	| 0.004696846                            	| 0.003600121   	| **1.304636**                           	|
|              	|                                              	| (8, 4)                                  	| 0.005483627                            	| 0.003721714   	| **1.473414**                           	|
|              	|                                              	| (8, 8)                                  	| 0.005180836                            	| 0.00389576    	| **1.329865**                           	|
| 16           	| 2048                                         	| (16, 1)                                 	| 0.00446558                             	| 0.003738403   	| **1.194515**                           	|
|              	|                                              	| (16, 2)                                 	| 0.004258156                            	| 0.00382185    	| **1.114161**                           	|
|              	|                                              	| (16, 4)                                 	| 0.004422665                            	| 0.004007816   	| **1.10351**                            	|
|              	|                                              	| (16, 8)                                 	| 0.004923344                            	| 0.004308224   	| **1.142778**                           	|
| 32           	| 1024                                         	| (32 , 1)                                	| 0.004467964                            	| 0.00402689    	| **1.109532**                           	|
|              	|                                              	| (32, 2)                                 	| 0.004336834                            	| 0.004196167   	| 1.033523                               	|
|              	|                                              	| (32, 4)                                 	| 0.004661083                            	| 0.004513264   	| 1.032752                               	|
|              	|                                              	| (32, 8)                                 	| 0.005385876                            	| 0.005121231   	| **1.051676**                           	|
| 64           	| 512                                          	| (64, 1)                                 	| 0.004725456                            	| 0.00462532    	| 1.021649                               	|
|              	|                                              	| (64, 2)                                 	| 0.005085468                            	| 0.004930496   	| 1.031431                               	|
|              	|                                              	| (64, 4)                                 	| 0.005791187                            	| 0.005600452   	| 1.034057                               	|
|              	|                                              	| (64, 8)                                 	| 0.007030964                            	| 0.006783009   	| 1.036555                               	|
| 128          	| 256                                          	| (128, 1)                                	| 0.005710125                            	| 0.005786419   	| _0.986815_                             	|
|              	|                                              	| (128, 2)                                	| 0.006377697                            	| 0.006473064   	| _0.985267_                             	|
|              	|                                              	| (128, 4)                                	| 0.00754118                             	| 0.007488728   	| 1.007004                               	|
|              	|                                              	| (128, 8)                                	| 0.009772778                            	| 0.009725094   	| 1.004903                               	|
| 256          	| 128                                          	| (256 , 1)                               	| 0.007708073                            	| 0.007715225   	| _0.999073_                             	|
|              	|                                              	| (256, 2)                                	| 0.008938313                            	| 0.009071827   	| _0.985283_                             	|
|              	|                                              	| (256, 4)                                	| 0.011227131                            	| 0.011045933   	| 1.016404                               	|
|              	|                                              	| (256, 8)                                	| 0.016131401                            	| 0.016396046   	| _0.983859_                             	|
| 512          	| 64                                           	| (512, 1)                                	| 0.011544228                            	| 0.011487007   	| 1.004981                               	|
|              	|                                              	| (512, 2)                                	| 0.014071465                            	| 0.014281273   	| _0.985309_                             	|
|              	|                                              	| (512, 4)                                	| 0.019016266                            	| 0.018930435   	| 1.004534                               	|
|              	|                                              	| (512, 8)                                	| 0.028913021                            	| 0.028159618   	| 1.026755                               	|

Bolded speedup ratio indicates greater than 5% speedup, to identify as significant speedup. Especially for smaller `dim_size` (1, 2, 4, 8, 16, 32), we observe significant speedups (greater than 5% better, **bolded**)  as smaller the `dim_size`, larger the `BLOCK_SIZE / dim_size / sizeof(scalar_t)`, hence larger the unnecessary allocation.

For larger `dim_size` (64, 128, 256, 512), we also observe insignificantly better (less than 5% better, unbolded) performance.
For some shapes such as {128, 1}, we also observe insignificantly worse (less than 5% worse, _italicized_) performance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117029
Approved by: https://github.com/jgong5, https://github.com/ezyang
2024-01-10 15:04:34 +00:00
e0da05e1ba [codemod] markDynamoStrictTest dynamo/* (#117077)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117077
Approved by: https://github.com/bdhirsh
ghstack dependencies: #117076
2024-01-10 14:37:52 +00:00
04f788f925 Unflake test_auto_functionalize (#117076)
feat better cleanup of the custom op.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117076
Approved by: https://github.com/bdhirsh
2024-01-10 14:37:52 +00:00
5046b4981d [ROCm] Add opt-in option for inductor's layout optimisation on ROCm (#116329)
Disabling layout optimisation in inductor for ROCm (https://github.com/pytorch/pytorch/pull/111474) was a bit shortsighted.

If there are workloads that heavily use NHWC we will see a perf drop from additional transpose ops. Instead of disabling this entirely on ROCm this is now an opt-in feature.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116329
Approved by: https://github.com/jansel, https://github.com/eellison
2024-01-10 13:56:27 +00:00
94db6578cc [Quant] Add dynamic quantization config for x86 inductor backend (#115337)
**Description**
Add dynamic quantization config for x86 inductor backend.
To support the QKV structure in self-attention, we removed an assertion in port-metadata-pass that requires single dequantize node after quantize node.

**Test plan**
```
python test/test_quantization.py -k TestQuantizePT2EX86Inductor.test_dynamic_quant_linear
python test/test_quantization.py -k TestQuantizePT2EX86Inductor.test_qat_dynamic_quant_linear
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115337
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
2024-01-10 11:33:37 +00:00
558cc69641 Fix torch function kwarg dispatch (#117083)
Previously, kwargs were incorrectly dispatched by passing them as the true kwargs to the torch function call. To fix, the kwargs of the original torch op need to be stored in a dictionary and passed as an argument to the torch function implementation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117083
Approved by: https://github.com/drisspg
2024-01-10 10:55:10 +00:00
e88d0648ed Revert "[export] Error grad mode op in export API (#116339)"
This reverts commit 943179852102ac0be27aeae5a2c0272e25ccf90e.

Reverted https://github.com/pytorch/pytorch/pull/116339 on behalf of https://github.com/tugsbayasgalan due to PR below this in the stack broke torchrec/sigmoid tests ([comment](https://github.com/pytorch/pytorch/pull/116339#issuecomment-1884599027))
2024-01-10 10:42:33 +00:00
77ecb3d725 Revert "[export] Exempt autograd ops for predispatch export (#116527)"
This reverts commit af2ded23eb398e14cf380b39d46bfa786d26b3ee.

Reverted https://github.com/pytorch/pytorch/pull/116527 on behalf of https://github.com/tugsbayasgalan due to Need to revert this to revert the bottom diff ([comment](https://github.com/pytorch/pytorch/pull/116527#issuecomment-1884592658))
2024-01-10 10:38:27 +00:00
20f394f10a [LLVM/TensorExpr] Update for an API change in LLVM 18. (#117086)
`registerPassBuilderCallbacks` takes now an extra bool argument to print extra information. Currently initialized to false to not change functional behaviour.

Relevant LLVM commit:
ffb1f20e0d

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117086
Approved by: https://github.com/bertmaher
2024-01-10 09:08:42 +00:00
cyy
20f769544c [12/N] Apply clang-tidy and fix warnings in headers of torch/csrc (#116486)
This PR follows #116751.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116486
Approved by: https://github.com/albanD
2024-01-10 08:48:14 +00:00
90df7c008a Migrate state_dict bc test to OptimizerInfo, increase coverage (#116500)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116500
Approved by: https://github.com/albanD
2024-01-10 08:19:27 +00:00
19e93b85b9 Fixes last_dim stride check for singleton dimensions (#117001)
Fixes #116333

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117001
Approved by: https://github.com/cpuhrsch
2024-01-10 04:46:49 +00:00
8bcdde5058 Support uint{16,32,64} deterministic empty fill and scalar Python binding handling (#116807)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116807
Approved by: https://github.com/albanD
ghstack dependencies: #116805, #116806
2024-01-10 02:17:23 +00:00
43a23a704a Support uint{16,32,64} copy (#116806)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116806
Approved by: https://github.com/albanD
ghstack dependencies: #116805
2024-01-10 02:17:23 +00:00
2e983fcfd3 Support unsigned int for randint, item, equality, fill, iinfo, tensor (#116805)
These are some basic utilities that are often used for testing.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116805
Approved by: https://github.com/albanD
2024-01-10 02:17:23 +00:00
4a10e9eed4 update build guide to use mkl-static. (#116946)
# Background:
We found current build guide use mkl dynamic link. It may trigger a mkl link issue.

Detailed:
In build environment, libtorch_cpu.so will dynamic link to system mkl binaries by default.
If users install another version mkl library, it may lead to mkl symbol conflict.

I also checked released pytorch binary it use static mkl link. The build script shows it: https://github.com/pytorch/builder/blob/main/common/install_mkl.sh#L10

# Solution:
Update build guide to use mkl static link. And it is aligned to build script.

Conda install command docs:
https://anaconda.org/intel/mkl-static
https://anaconda.org/intel/mkl-include

# Validation
No mkl libraries dependencing, after use `conda install intel::mkl-static intel::mkl-include`.
## Windows
![image](https://github.com/pytorch/pytorch/assets/8433590/cc554ded-d827-4de5-81c6-cc3039155580)

## Linux
<img width="959" alt="image" src="https://github.com/pytorch/pytorch/assets/8433590/79766ad8-4ba2-4ff1-adc9-63affd8d419a">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116946
Approved by: https://github.com/jgong5, https://github.com/malfet
2024-01-10 01:35:02 +00:00
b4f1ab4505 Docs: fix docstring errors in ddp_comm_hooks (#116866)
Reopens #115272
Fixes ddp_comm_hooks errors in https://github.com/pytorch/pytorch/issues/112644

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116866
Approved by: https://github.com/awgu
2024-01-10 01:24:06 +00:00
16d69290c6 Use view name instead of view_copy name for functional inverses (#117056)
Ex: `unsqueeze_copy_inverse()` -> `unsqueeze_inverse()`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117056
Approved by: https://github.com/bdhirsh
2024-01-10 00:52:36 +00:00
fdfdba7c13 [BE] Use __builtin_overflow_sub when available (#117015)
Which is faster then ternary.

Following script
```python
import torch
from timeit import default_timer

global_setup = """
"""
setup = """
c10::SymInt a = c10::SymInt(123);
"""
code = """
-a;
"""

from torch.utils.benchmark import Timer

t = Timer(stmt=code, setup=setup, global_setup=global_setup, language="c++", timer=default_timer)

print(t.blocked_autorange())
```

reports 4.17 ns median type before and 3.61 ns after on x86_64 Linux and 2.02 ns before and 1.91 ns after on Apple M1

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117015
Approved by: https://github.com/albanD
2024-01-10 00:50:09 +00:00
a6325ad86c Fix cuInit test on Windows (#117055)
By changing library name from `libcuda.so.1` to `nvcuda.dll` on Windows

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117055
Approved by: https://github.com/Skylion007, https://github.com/huydhn, https://github.com/atalman
2024-01-10 00:45:18 +00:00
907e80239d Fix broken lint after #117052 (#117080)
https://hud.pytorch.org/pr/pytorch/pytorch/117052#20318344490 breaks lint, forward fixing with `lintrunner -a`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117080
Approved by: https://github.com/atalman, https://github.com/clee2000, https://github.com/Skylion007
2024-01-10 00:44:19 +00:00
d9fc438083 [cpu][vec512][double] unsigned left shift for mask (#117021)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117021
Approved by: https://github.com/leslie-fang-intel
2024-01-10 00:32:15 +00:00
0b72ce1bd1 Add at::sparse::full_coo_indices utility function. (#116352)
As in the title.

`full_coo_indices(shape)` should be used instead of `ones(shape).nonzero().T` as `full_coo_indices` is exponentially more efficient.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116352
Approved by: https://github.com/cpuhrsch
ghstack dependencies: #116206
2024-01-10 00:07:09 +00:00
152bde6e27 [MPS][BE] Move kernel_index_offset to HistogramKernel (#117037)
As it have almost nothing in commmon with the rest of indexing primitives other than name
Also, use `mtl_dispatch1DJob` to dispatch the work and check that tensor
size is less than 4Gb, as this function would not work with larger
tensors
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117037
Approved by: https://github.com/kulinseth
ghstack dependencies: #116903, #116904, #116915, #116940, #116942
2024-01-10 00:02:14 +00:00
8918ce4087 Add TORCH_LOGS_OUT to direct TORCH_LOGS output (#117005)
Twice now, while I was debugging accuracy bugs, I get dynamo logs that are 100k lines long and it is impossible to read them on the terminal. Lets add an option to write them to a file.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117005
Approved by: https://github.com/ezyang, https://github.com/zou3519
ghstack dependencies: #116894
2024-01-09 23:46:22 +00:00
b6028acfa4 Add _assert_scalar and teach Inductor to codegen it (#114148)
Inductor codegen for `_assert_async` is currently disabled because we don't really understand how to codegen `scalar_to_tensor` on a Sympy expression. I initially tried to see if I could get this to work, but I got into some weird problem involving stride sorting, so I decided to fix it properly by not going through a tensor.

So we introduce an `_assert_scalar` which takes a scalar as an argument, avoiding needing to turn a SymBool into a tensor before asserting on it. I also add `_functional_assert_scalar` for good luck, although this doesn't do anything right now because https://github.com/pytorch/pytorch/pull/104203 still hasn't been landed.

I need to customize the codegen for this operator, so I decide to directly implement it in Inductor, rather than trying to treat it as a generic ExternKernel. This leads to the new AssertScalar IR node. This is written carefully so that it doesn't get DCE'd by Inductor.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114148
Approved by: https://github.com/jansel
2024-01-09 23:21:26 +00:00
d2033a0639 [quant][pt2e][xnnpack_quantizer] add support for linear_relu (#117052)
Add support for linear_relu annotation for XNNPACKQuantizer, this allows the input to linear and the output to relu to share the same quantization parameter.s

Differential Revision: [D52574086](https://our.internmc.facebook.com/intern/diff/D52574086/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117052
Approved by: https://github.com/jerryzh168, https://github.com/digantdesai
2024-01-09 23:19:52 +00:00
4f3d698cac Impl. call_hasattr for BaseUserFunctionVariable (#116049)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116049
Approved by: https://github.com/zou3519
2024-01-09 22:58:58 +00:00
8a6c43fbe5 add predispatch_pass to hold pass functions to be run when config.is_predispatch is true (#116788)
Summary:
config.is_predispatch is a config to instruct inductor to enable predispatch
tracing (high level pre-dispatch IR).  Currently, there is no dedicated pass
for this config.

In this commit, for better pass function management, we created
`predispatch_pass` to hold the pass functions to be run on the high level
pre-dispatch IR-based graphs.

Test Plan: CI

Differential Revision: D52491332

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116788
Approved by: https://github.com/frank-wei
2024-01-09 22:42:24 +00:00
39ae4d8cd7 Revert "[inductor] Add support for tl.make_block_ptr (#116079)"
This reverts commit d527df707acce59bd432763c94399aa7b3fe38cf.

Reverted https://github.com/pytorch/pytorch/pull/116079 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the new test is failing on ROCm ([comment](https://github.com/pytorch/pytorch/pull/116079#issuecomment-1883890254))
2024-01-09 22:19:57 +00:00
848cfe8d45 [reland] unflatten_tensor on compute stream for DTensorExtension (#117020)
reland of https://github.com/pytorch/pytorch/pull/116559, which was reverted by internal.

The underlying reason for the revert is that the torch.dynamo.disable can't be used by the
pytorch codebase, as it's conflicting with some torch.deploy together, although the later one
only run some inference, but it somehow take that weird dependency on fsdp..

We have seen this issue with our functional collectives that we can't
use any dynamo components otherwise torch.deploy would complain..

verified internally that after removing torch.dynamo.disable the test
passed again

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117020
Approved by: https://github.com/awgu
2024-01-09 21:25:15 +00:00
1dd4813328 [BE][dynamo]: Add operator is and is not tests to dynamo tests (#116397)
Adds an operator that was unit not tested in our test suite - improves coverage. Inspired by looking into https://github.com/pytorch/pytorch/pull/116397 after @XuehaiPan brought up some issues with builtins in #116389

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116397
Approved by: https://github.com/albanD, https://github.com/jansel
2024-01-09 21:13:22 +00:00
5866284d4a Make not passing use_reentrant back to warning instead of erroring and clarify docs (#116710)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116710
Approved by: https://github.com/albanD
ghstack dependencies: #116523
2024-01-09 20:58:49 +00:00
4e666ba011 Update torch.autograd.graph logging to not print out grad_output (#116523)
Instead of printing the tensor's data print the dtype and shape metadata of the tensor.
```
Executing: <VarMeanBackward0 object at 0x1352d0e20> with grad_outputs: [None,f32[]]
```
This is important in order to avoid doing a cuda sync and also useful to reduce verbosity.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116523
Approved by: https://github.com/albanD
2024-01-09 20:40:02 +00:00
29ae4f22bf Enables private_use_one lazy_init by PrivateUse1HooksInterface (#115067)
Fixes https://github.com/pytorch/pytorch/issues/112369

In my last pr:https://github.com/pytorch/pytorch/pull/113343, I want to implement lazy_init for other device through `REGISTER_LAZY_INIT `. But this might be too big of a change.

Recently, my team found that `torch.load` without `lazy_init ` will also results in the same error.
bbd5b935e4/torch/csrc/Storage.cpp (L319-L321)
bbd5b935e4/torch/csrc/Storage.cpp (L334-L335)

So, I want to use `PrivateUse1HooksInterface` to implement lazy_init for `PrivateUse1`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115067
Approved by: https://github.com/ezyang
2024-01-09 20:12:08 +00:00
ab1ac43752 [pytree] extend pytree operations with is_leaf prediction function (#116419)
Add an extra `is_leaf` prediction function to pytree operations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116419
Approved by: https://github.com/zou3519
2024-01-09 19:50:08 +00:00
suo
902807a86d enable pytree tests in fbcode (#116787)
these were not runnable before

Differential Revision: [D52547846](https://our.internmc.facebook.com/intern/diff/D52547846/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116787
Approved by: https://github.com/zou3519
2024-01-09 19:12:43 +00:00
b4eb97a072 Revert "[C10D] Add GIL checker to NCCL watchdog monitor (#116798)"
This reverts commit 830ace33bcc0291e5c615ad1727799b1d04067cd.

Reverted https://github.com/pytorch/pytorch/pull/116798 on behalf of https://github.com/osalpekar due to This seems to crash torchrec inference unittests: [D52583939](https://www.internalfb.com/diff/D52583939) ([comment](https://github.com/pytorch/pytorch/pull/116798#issuecomment-1883624022))
2024-01-09 19:09:02 +00:00
b8374314cc [AOTI] Update AOTI runner util (#116971)
Summary: Update the runner used in integration tests after https://github.com/pytorch/torchrec/pull/1604

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116971
Approved by: https://github.com/chenyang78
2024-01-09 19:07:54 +00:00
d527df707a [inductor] Add support for tl.make_block_ptr (#116079)
On A100 this is a small regression:
![image](https://github.com/pytorch/pytorch/assets/533820/b30eee9d-c0fe-4123-99da-d554fc5d0171)

So I will leave it disabled by default.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116079
Approved by: https://github.com/shunting314
ghstack dependencies: #116078
2024-01-09 19:06:51 +00:00
94363cee41 [inductor] Indexing refactors (#116078)
Perf differences seems to be noise:
![image](https://github.com/pytorch/pytorch/assets/533820/d7a36574-0388-46e4-bd4d-b274d37cab2b)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116078
Approved by: https://github.com/aakhundov
2024-01-09 19:06:51 +00:00
84b04e42a1 [ROCm] Enable aot_inductor tests (#116713)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116713
Approved by: https://github.com/jithunnair-amd, https://github.com/desertfire
2024-01-09 19:05:44 +00:00
ad22bd2fa1 [export][refactor][6/n] Remove equality_constraints (#116979)
Through the new dynamic_shapes API and using torch.export.Dim, dimensions that are equal will now be represented by the same symbol, so we no longer need to store `equality_constraints`.

Differential Revision: D52351705

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116979
Approved by: https://github.com/avikchaudhuri
2024-01-09 19:04:47 +00:00
bdeaaad70c [CPU] _vec_log_softmax_lastdim: fix CHUNK_SIZE to avoid unnecessarily large allocation (#116990)
Given input shape of `[outer_size, dim_size]`, `_vec_log_softmax_lastdim` sets `CHUNK_SIZE` as
```cpp
int64_t CHUNK_SIZE = std::max<int64_t>(
      1,
      at::internal::GRAIN_SIZE / (sizeof(scalar_t) * dim_size));
```
where `at::internal::GRAIN_SIZE / (sizeof(scalar_t) * dim_size)` computes the maximum number of rows that can fit into L1d cache size `(GRAIN_SIZE)`.

Fix `CHUNK_SIZE` as the minimum between `CHUNK_SIZE` and `outer_size` to avoid unnecessarily large `CHUNK_SIZE` and unnecessarily large allocation for `max` and `tmp_sum` buffer.
```cpp
auto tmp_sum_scalar = std::make_unique<scalar_t[]>(CHUNK_SIZE);
auto max_input_arr = std::make_unique<scalar_t[]>(CHUNK_SIZE);
```

### Performance

Perf data collected for `dim_size` in range [2^0, 2^9] and `outer_size` in range [2^0, 2^3]. To measure the benefit from avoiding unnecessarily large allocation, values of `outer_size` were chosen such that `outer_size` is less than `at::internal::GRAIN_SIZE / (sizeof(scalar_t) * dim_size)` for all values of `dim_size`.

Tested on 28 physical cores/socket, 1 socket on Skylake.

| **dim_size** 	| **at::internal::GRAIN_SIZE / (sizeof(scalar_t)   * dim_size)** 	| **input shape: (outer_size, dim_size)** 	| **Baseline (original implementation)** 	| **Optimized** 	| **Speedup Ratio (Baseline/Optimized)** 	|
|--------------	|----------------------------------------------------------------	|-----------------------------------------	|----------------------------------------	|---------------	|----------------------------------------	|
| 1            	| 8192                                                           	| (1, 1)                                  	| 0.006070137                            	| 0.003378391   	| **1.796754**                           	|
|              	|                                                                	| (2, 1)                                  	| 0.006327629                            	| 0.00361681    	| **1.749506**                           	|
|              	|                                                                	| (4, 1)                                  	| 0.006246567                            	| 0.00379324    	| **1.646763**                           	|
|              	|                                                                	| (8, 1)                                  	| 0.006320477                            	| 0.003941059   	| **1.603751**                           	|
| 2            	| 4096                                                           	| (1, 2)                                  	| 0.004889965                            	| 0.003342628   	| **1.46291**                            	|
|              	|                                                                	| (2, 2)                                  	| 0.005021095                            	| 0.003380775   	| **1.48519**                            	|
|              	|                                                                	| (4, 2)                                  	| 0.004897118                            	| 0.003535748   	| **1.38503**                            	|
|              	|                                                                	| (8, 2)                                  	| 0.005195141                            	| 0.003790855   	| **1.37044**                            	|
| 4            	| 2048                                                           	| (1, 4)                                  	| 0.004477501                            	| 0.003364086   	| **1.330971**                           	|
|              	|                                                                	| (2, 4)                                  	| 0.004198551                            	| 0.003452301   	| **1.21616**                            	|
|              	|                                                                	| (4, 4)                                  	| 0.004312992                            	| 0.003650188   	| **1.181581**                           	|
|              	|                                                                	| (8, 4)                                  	| 0.004432201                            	| 0.00399828    	| **1.108527**                           	|
| 8            	| 1024                                                           	| (1, 8)                                  	| 0.004155636                            	| 0.0035429     	| **1.172948**                           	|
|              	|                                                                	| (2, 8)                                  	| 0.003905296                            	| 0.003569126   	| **1.094188**                           	|
|              	|                                                                	| (4, 8)                                  	| 0.004405975                            	| 0.003864765   	| **1.140037**                           	|
|              	|                                                                	| (8, 8)                                  	| 0.004785061                            	| 0.004456043   	| **1.073836**                           	|
| 16           	| 512                                                            	| (1, 16)                                 	| 0.003867149                            	| 0.003504753   	| **1.103401**                           	|
|              	|                                                                	| (2, 16)                                 	| 0.003743172                            	| 0.003340244   	| **1.120628**                           	|
|              	|                                                                	| (4, 16)                                 	| 0.003614426                            	| 0.003519058   	| 1.0271                                 	|
|              	|                                                                	| (8, 16)                                 	| 0.00395298                             	| 0.003488064   	| **1.133288**                           	|
| 32           	| 256                                                            	| (1, 32)                                 	| 0.003900528                            	| 0.003421307   	| **1.14007**                            	|
|              	|                                                                	| (2, 32)                                 	| 0.003569126                            	| 0.003511906   	| 1.016293                               	|
|              	|                                                                	| (4, 32)                                 	| 0.003736019                            	| 0.003590584   	| 1.040505                               	|
|              	|                                                                	| (8, 32)                                 	| 0.003845692                            	| 0.003662109   	| **1.05013**                            	|
| 64           	| 128                                                            	| (1, 64)                                 	| 0.003652573                            	| 0.003437996   	| **1.062413**                           	|
|              	|                                                                	| (2, 64)                                 	| 0.003700256                            	| 0.003516674   	| **1.052203**                           	|
|              	|                                                                	| (4, 64)                                 	| 0.003783703                            	| 0.003638268   	| 1.039974                               	|
|              	|                                                                	| (8, 64)                                 	| 0.003993511                            	| 0.003809929   	| 1.048185                               	|
| 128          	| 64                                                             	| (1, 128)                                	| 0.003848076                            	| 0.003600121   	| **1.068874**                           	|
|              	|                                                                	| (2, 128)                                	| 0.003979206                            	| 0.003826618   	| 1.039875                               	|
|              	|                                                                	| (4, 128)                                	| 0.004360676                            	| 0.004224777   	| 1.032167                               	|
|              	|                                                                	| (8, 128)                                	| 0.005149841                            	| 0.004999638   	| 1.030043                               	|
| 256          	| 32                                                             	| (1, 256)                                	| 0.003943443                            	| 0.003738403   	| **1.054847**                           	|
|              	|                                                                	| (2, 256)                                	| 0.00420332                             	| 0.00408411    	| 1.029189                               	|
|              	|                                                                	| (4, 256)                                	| 0.004820824                            	| 0.00474453    	| 1.01608                                	|
|              	|                                                                	| (8, 256)                                	| 0.006194115                            	| 0.006067753   	| 1.020825                               	|
| 512          	| 16                                                             	| (1, 512)                                	| 0.004277229                            	| 0.004253387   	| 1.005605                               	|
|              	|                                                                	| (2, 512)                                	| 0.004863739                            	| 0.004782677   	| 1.016949                               	|
|              	|                                                                	| (4, 512)                                	| 0.006172657                            	| 0.00607729    	| 1.015692                               	|
|              	|                                                                	| (8, 512)                                	| 0.011193752                            	| 0.010819435   	| 1.034597                               	|

Bolded speedup ratio indicates greater than 5% speedup, to identify as significant speedup. Especially for smaller `dim_size` (1, 2, 4, 8, 16), we observe significant speedups as smaller the `dim_size`, larger the `at::internal::GRAIN_SIZE / (sizeof(scalar_t) * dim_size)`, hence larger the unnecessary allocation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116990
Approved by: https://github.com/jgong5, https://github.com/ezyang
2024-01-09 18:43:02 +00:00
75968e2f94 Optimize operator (#117017)
As the title stated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117017
Approved by: https://github.com/Skylion007
2024-01-09 18:37:22 +00:00
0dd5deeced Bring docstring to .pyi file (#114705)
Fixes #37762

Since the original issue hasn't been making progress for more than 3 years, I am attempting to make this PR to at least make some progress forward.

This PR attempts to add docstring to the `.pyi` files. The docstrings are read from [`_torch_docs`](https://github.com/pytorch/pytorch/blob/main/torch/_torch_docs.py) by mocking [`_add_docstr`](9f073ae304/torch/csrc/Module.cpp (L329)), which is the only function used to add docstring.

Luckily, `_torch_docs` has no dependencies for other components of PyTorch, and can be imported without compiling `torch._C` with `_add_docstr` mocked.

The generated `.pyi` file looks something like the following:

[_VariableFunctions.pyi.txt](https://github.com/pytorch/pytorch/files/13494263/_VariableFunctions.pyi.txt)

<img width="787" alt="image" src="https://github.com/pytorch/pytorch/assets/6421097/73c2e884-f06b-4529-8301-0ca0b9de173c">

And the docstring can be picked up by VSCode:

<img width="839" alt="image" src="https://github.com/pytorch/pytorch/assets/6421097/1999dc89-a591-4c7a-80ac-aa3456672af4">

<img width="908" alt="image" src="https://github.com/pytorch/pytorch/assets/6421097/ecf3fa92-9822-4a3d-9263-d224d87ac288">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114705
Approved by: https://github.com/albanD
2024-01-09 18:37:16 +00:00
cfd0728b24 Feature: cudnn convolution out (#116759)
Fixes #115611

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116759
Approved by: https://github.com/albanD
2024-01-09 17:51:29 +00:00
0ef1266bc6 [BE] Fix CUDA build warnings (#117023)
After https://github.com/pytorch/pytorch/pull/116595/files compiling every .cu file results in
```
/home/nshulga/git/pytorch/pytorch/c10/util/Half.h(450): warning #173-D: floating-point value does not fit in required integral type
           -static_cast<uint64_t>(f) > static_cast<uint64_t>(limit::max()));
                                                   ^
          detected during:
            instantiation of "std::enable_if_t<<expression>, __nv_bool> c10::overflows<To,From>(From, __nv_bool) [with To=float, From=int64_t]" at line 159 of /home/nshulga/git/pytorch/pytorch/c10/util/TypeCast.h
            instantiation of "To c10::checked_convert<To,From>(From, const char *) [with To=float, From=int64_t]" at line 122 of /home/nshulga/git/pytorch/pytorch/c10/core/Scalar.h

Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"

/home/nshulga/git/pytorch/pytorch/c10/util/Half.h(450): warning #173-D: floating-point value does not fit in required integral type
           -static_cast<uint64_t>(f) > static_cast<uint64_t>(limit::max()));
                                                   ^
          detected during:
            instantiation of "std::enable_if_t<<expression>, __nv_bool> c10::overflows<To,From>(From, __nv_bool) [with To=float, From=uint64_t]" at line 159 of /home/nshulga/git/pytorch/pytorch/c10/util/TypeCast.h
            instantiation of "To c10::checked_convert<To,From>(From, const char *) [with To=float, From=uint64_t]" at line 122 of /home/nshulga/git/pytorch/pytorch/c10/core/Scalar.h

/home/nshulga/git/pytorch/pytorch/c10/util/Half.h(450): warning #173-D: floating-point value does not fit in required integral type
           -static_cast<uint64_t>(f) > static_cast<uint64_t>(limit::max()));
                                                   ^
          detected during:
            instantiation of "std::enable_if_t<<expression>, __nv_bool> c10::overflows<To,From>(From, __nv_bool) [with To=double, From=int64_t]" at line 159 of /home/nshulga/git/pytorch/pytorch/c10/util/TypeCast.h
            instantiation of "To c10::checked_convert<To,From>(From, const char *) [with To=double, From=int64_t]" at line 122 of /home/nshulga/git/pytorch/pytorch/c10/core/Scalar.h

/home/nshulga/git/pytorch/pytorch/c10/util/Half.h(450): warning #173-D: floating-point value does not fit in required integral type
           -static_cast<uint64_t>(f) > static_cast<uint64_t>(limit::max()));
                                                   ^
          detected during:
            instantiation of "std::enable_if_t<<expression>, __nv_bool> c10::overflows<To,From>(From, __nv_bool) [with To=double, From=uint64_t]" at line 159 of /home/nshulga/git/pytorch/pytorch/c10/util/TypeCast.h
            instantiation of "To c10::checked_convert<To,From>(From, const char *) [with To=double, From=uint64_t]" at line 122 of /home/nshulga/git/pytorch/pytorch/c10/core/Scalar.h

/home/nshulga/git/pytorch/pytorch/c10/util/Half.h(450): warning #173-D: floating-point value does not fit in required integral type
           -static_cast<uint64_t>(f) > static_cast<uint64_t>(limit::max()));
                                                   ^
          detected during:
            instantiation of "std::enable_if_t<<expression>, __nv_bool> c10::overflows<To,From>(From, __nv_bool) [with To=c10::complex<float>, From=int64_t]" at line 159 of /home/nshulga/git/pytorch/pytorch/c10/util/TypeCast.h
            instantiation of "To c10::checked_convert<To,From>(From, const char *) [with To=c10::complex<float>, From=int64_t]" at line 122 of /home/nshulga/git/pytorch/pytorch/c10/core/Scalar.h

/home/nshulga/git/pytorch/pytorch/c10/util/Half.h(450): warning #173-D: floating-point value does not fit in required integral type
           -static_cast<uint64_t>(f) > static_cast<uint64_t>(limit::max()));
                                                   ^
          detected during:
            instantiation of "std::enable_if_t<<expression>, __nv_bool> c10::overflows<To,From>(From, __nv_bool) [with To=c10::complex<float>, From=uint64_t]" at line 159 of /home/nshulga/git/pytorch/pytorch/c10/util/TypeCast.h
            instantiation of "To c10::checked_convert<To,From>(From, const char *) [with To=c10::complex<float>, From=uint64_t]" at line 122 of /home/nshulga/git/pytorch/pytorch/c10/core/Scalar.h

/home/nshulga/git/pytorch/pytorch/c10/util/Half.h(450): warning #173-D: floating-point value does not fit in required integral type
           -static_cast<uint64_t>(f) > static_cast<uint64_t>(limit::max()));
                                                   ^
          detected during:
            instantiation of "std::enable_if_t<<expression>, __nv_bool> c10::overflows<To,From>(From, __nv_bool) [with To=c10::complex<double>, From=int64_t]" at line 159 of /home/nshulga/git/pytorch/pytorch/c10/util/TypeCast.h
            instantiation of "To c10::checked_convert<To,From>(From, const char *) [with To=c10::complex<double>, From=int64_t]" at line 122 of /home/nshulga/git/pytorch/pytorch/c10/core/Scalar.h

/home/nshulga/git/pytorch/pytorch/c10/util/Half.h(450): warning #173-D: floating-point value does not fit in required integral type
           -static_cast<uint64_t>(f) > static_cast<uint64_t>(limit::max()));
                                                   ^
          detected during:
            instantiation of "std::enable_if_t<<expression>, __nv_bool> c10::overflows<To,From>(From, __nv_bool) [with To=c10::complex<double>, From=uint64_t]" at line 159 of /home/nshulga/git/pytorch/pytorch/c10/util/TypeCast.h
            instantiation of "To c10::checked_convert<To,From>(From, const char *) [with To=c10::complex<double>, From=uint64_t]" at line 122 of /home/nshulga/git/pytorch/pytorch/c10/core/Scalar.h
```
Fix it by using using `if constexpr` to avoid calling `static_cast<uint64_t>` for any floating point type

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117023
Approved by: https://github.com/albanD
2024-01-09 17:40:10 +00:00
b6962208b8 [CI] Add initial ci test workflow for XPU based on IDC runners (#116554)
Add initial CI test for XPU based on IDC self-hosted runners with label `linux.idc.xpu`, which will be triggered by label `ciflow/xpu` for current stage.

Works for RFC https://github.com/pytorch/pytorch/issues/114850

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116554
Approved by: https://github.com/EikanWang, https://github.com/atalman
2024-01-09 17:00:35 +00:00
6784030df4 [MPS] Add support for 64-bit index operations (#116942)
But enable it only if `iter.can_use_32bit_indexing()` is False. add test for index_select, but enable it only on Sonoma, as all attempts to create 4Gb+ tensor on Ventura and older fail
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116942
Approved by: https://github.com/Skylion007, https://github.com/kulinseth
ghstack dependencies: #116903, #116904, #116915, #116940
2024-01-09 16:56:49 +00:00
81b7a09d27 [CI] Test that cuInit is not called during import (#117010)
By making a driver API call in subprocess and expecting it to return `CUDA_ERROR_NOT_INITIALIZED`

Test Plan: run it on nighties before https://github.com/pytorch/pytorch/pull/116201 got reverted and observe the failure

This is very important for lots of distributed launchers

Fixes https://github.com/pytorch/pytorch/issues/116276

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117010
Approved by: https://github.com/albanD
2024-01-09 14:44:22 +00:00
db79ceb110 [ROCm] Enabling additional UTs on ROCm (#115738)
Unskips mostly for dynamo/inductor UT.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115738
Approved by: https://github.com/jithunnair-amd, https://github.com/malfet
2024-01-09 08:36:07 +00:00
f0bbc2fcf5 [AOTInductor] Small refactor so both Meta internal and OSS can deal with misplaced args and kwargs for Extern Fallback kernels (#116779)
Summary:
In torch/_inductor/lowering.py (https://fburl.com/code/jd58vxpw), we are using
```
fallback_cumsum(x, dim=axis, dtype=dtype)
```
so this will treat `x` as args, `dim` and `dtype` as kwargs from https://fburl.com/code/cikchxp9

The issue has been fixed from D52530506 for OSS but not Meta internal. This diff address the Meta internal issue by some refactoring so both Meta internal and OSS can use the same helper function. The diff also added some debug log.

Test Plan:
before
```
aoti_torch_proxy_executor_call_function(proxy_executor, 2, 1, std::vector<int64_t>{torch.int64}.data(), 2, std::vector<AtenTensorHandle>{buf702, buf708}.data());
```
after
```
aoti_torch_proxy_executor_call_function(proxy_executor, 2, 1, std::vector<int64_t>{0}.data(), 2, std::vector<AtenTensorHandle>{buf702, buf708}.data());
```
so `torch.int64` changed to `0`

Differential Revision: D52532031

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116779
Approved by: https://github.com/desertfire, https://github.com/chenyang78
2024-01-09 07:57:46 +00:00
6e2f879d7f [ROCm] hipify mapping for cudaDevAttrMaxSharedMemoryPerBlockOptin (#116984)
Summary: Map `cudaDevAttrMaxSharedMemoryPerBlockOptin` to `hipDeviceAttributeMaxSharedMemoryPerBlock` to make it work for AMD GPUs.

Test Plan: CI

Differential Revision: D52558076

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116984
Approved by: https://github.com/jeffdaily
2024-01-09 07:38:20 +00:00
d78776e2e6 Stop unconditionally applying hermetic mode (#116996)
When originally authored, it was not necessary to unconditionally apply
hermetic mode, but I chose to apply it in eager mode to help catch bugs.
Well, multipy is kind of dead, and hermetic mode is causing real
implementation problems for people who want to do fancy Python stuff
from the dispatcher.  So let's yank this mode for now.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116996
Approved by: https://github.com/jansel
2024-01-09 05:55:08 +00:00
6cf1fc66e3 [cuda][easy] cosmetic and small syntax changes to layer_norm_kernel.cu (#116920)
Used `auto` and `const` where needed; replaced a CUDA specific `__syncwarp` with device agnostic `WARP_SYNC`; added more comments.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116920
Approved by: https://github.com/malfet
2024-01-09 04:44:57 +00:00
104a23e4f5 [cpu][vec512] improve int load/store/with mask (#116964)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116964
Approved by: https://github.com/leslie-fang-intel
ghstack dependencies: #116961, #116962, #116963
2024-01-09 04:37:44 +00:00
4e54a70451 [cpu][vec512] improve double load/store with mask (#116963)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116963
Approved by: https://github.com/leslie-fang-intel
ghstack dependencies: #116961, #116962
2024-01-09 04:37:44 +00:00
428807f9bc [cpu][vec512] improve fp32 load/store with mask (#116962)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116962
Approved by: https://github.com/leslie-fang-intel
ghstack dependencies: #116961
2024-01-09 04:32:22 +00:00
a0bd7dfec1 [cpu][vec512] improve bf16/fp16 load/store with mask for inductor (#116961)
Improve perf of vec512 bfloat16 (and also float16) load and store with partial vector lanes using masked load/store instead of via `memcpy` with aux buffer. In inductor CPU backend, we do load/store half (16) vector lanes for bfloat16 and float16.

Using the following micro-benchmark script for `layernorm + add`:
```python
import torch
import torch.nn as nn
from benchmark_helper import time_with_torch_timer

class AddLayernorm(nn.Module):
    def __init__(self, hidden_size):
        super().__init__()
        self.ln = nn.LayerNorm(hidden_size)

    def forward(self, hidden_states):
        return hidden_states + self.ln(hidden_states)

hidden_states = torch.randn(1, 512, 1024).to(torch.bfloat16)

with torch.no_grad():
    compiled_add_ln = torch.compile(add_ln)
    print(time_with_torch_timer(compiled_add_ln, hidden_states, iters=10000))
```

Measured on single-core `Intel(R) Xeon(R) Platinum 8358 CPU`.
Before: 1.39 ms
After: 498.66 us

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116961
Approved by: https://github.com/sanchitintel, https://github.com/leslie-fang-intel
2024-01-09 04:18:33 +00:00
bac0de160c [ROCm] Add minimal inductor test to rocm-test workflow (#115425)
Adds the `inductor/test_torchinductor` to tests-to-include so we can have some PR-level test coverage for inductor tests on ROCm. This should help catch issues before merging (e.g. https://github.com/pytorch/pytorch/pull/114772)

This unit test takes ~6minutes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115425
Approved by: https://github.com/jithunnair-amd, https://github.com/huydhn, https://github.com/malfet
2024-01-09 03:54:25 +00:00
4c0d63180a Support NNModules as dict keys (#116723)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116723
Approved by: https://github.com/lezcano
2024-01-09 03:32:47 +00:00
92cf7ba36b [vision hash update] update the pinned vision hash (#117002)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117002
Approved by: https://github.com/pytorchbot
2024-01-09 03:21:43 +00:00
ff0a3f35a4 [audio hash update] update the pinned audio hash (#116954)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116954
Approved by: https://github.com/pytorchbot
2024-01-09 03:16:00 +00:00
14be2ee271 Inductor qlinear int8_bf16 with bmm (#116604)
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/116492, `linear` will be decomposed into `bmm` when input dim exceeds 2 and not contiguous. Fix this issue by convert the pattern back into `qlinear`. This PR focus on int8_bf16 case following of https://github.com/pytorch/pytorch/pull/116599.

**Test Plan**
```
python -u -m pytest -s -v test_mkldnn_pattern_matcher.py -k test_qlinear_int8_mixed_bf16_input_dim_exceeds_2_and_not_contiguous
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116604
Approved by: https://github.com/jgong5
ghstack dependencies: #116937, #116599
2024-01-09 01:36:27 +00:00
153b3a0996 Inductor qlinear int8_fp32 with bmm (#116599)
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/116492, `linear` will be decomposed into `bmm` when input dim exceeds 2 and not contiguous. Fix this issue by convert the pattern back into `qlinear`. This PR focus on int8_fp32 case, will follow up int8_bf16 case in next PR.

**Test Plan**
```
python -u -m pytest -s -v test_mkldnn_pattern_matcher.py -k test_qlinear_input_dim_exceeds_2_and_not_contiguous
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116599
Approved by: https://github.com/jgong5
ghstack dependencies: #116937
2024-01-09 01:33:46 +00:00
6ca31ae1d3 [CI] Add inductor workflow for rocm (#110544)
This PR is to create a separate CI job for inductor UTs on ROCm. You will need to add `ciflow/inductor` tag on PRs to trigger this job. However, the job will run on its own on any commit merged in main. This job takes around 1.5 hours to run and it is run in parallel to other rocm jobs. It is run only on the MI210 CI runners to ensure maximum inductor functionality is tested.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110544
Approved by: https://github.com/jithunnair-amd, https://github.com/jansel, https://github.com/huydhn
2024-01-09 01:32:15 +00:00
227579d6a0 [Inductor] [Quant] Add remaining user check for qconv binary fusion (#115809)
**Summary**
Similar as https://github.com/pytorch/pytorch/pull/115153, when we do the `qconv_binary` fusion with post op sum, we also need to ensure that: all users of the extra input in this pattern should be ancestor nodes of the compute node, except for the binary node connected to the compute node.

Also change some variable names in this diff as:

- Change name of `qconv2d_node_after_weight_prepack` to `compute_node`
- Change name of `extra_input_node` to `extra_input_of_binary_node`

**Test Plan**
```
python -u -m pytest -s -v test_mkldnn_pattern_matcher.py -k test_qconv2d_add_3
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115809
Approved by: https://github.com/jgong5
ghstack dependencies: #115153
2024-01-09 01:26:50 +00:00
33d90cfd16 Allow for [-oo, oo] ranges for bools (#114362)
This fixes a problem in Seamless M4T in fairseq2 repro
instructions at https://docs.google.com/document/d/1PVy4KibfljirQDoijOwyHCV97B67r_iElWqFh7h1Acc/edit

I tried extracting a minimal repro but I couldn't actually manage it!

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114362
Approved by: https://github.com/Skylion007
2024-01-09 01:08:34 +00:00
f26ed0a71d [dynamo] Move graph breaks in for/while->skip after logging (#116981)
We were losing critical graph break info if the graph break came from a for or while loop.

Given:

```
def foo(x, y):
    z = x * y
    for i in range(10):
        z = z * y
        print(z)
    return z

a = torch.randn([2, 2])
b = torch.randn([2, 2])

foo = torch._dynamo.optimize('eager')(foo)

foo(a, b)
```

Before:

```
$ TORCH_LOGS=+graph_breaks python x.py
tensor([[-0.1046, -0.1597],
        [-0.0006, -0.1327]])
tensor([[-4.2091e-02,  6.3045e-02],
        [-1.6759e-05,  4.0366e-02]])
tensor([[-1.6929e-02, -2.4892e-02],
        [-4.8690e-07, -1.2281e-02]])
tensor([[-6.8091e-03,  9.8278e-03],
        [-1.4146e-08,  3.7363e-03]])
tensor([[-2.7387e-03, -3.8803e-03],
        [-4.1097e-10, -1.1367e-03]])
tensor([[-1.1015e-03,  1.5320e-03],
        [-1.1940e-11,  3.4584e-04]])
tensor([[-4.4304e-04, -6.0488e-04],
        [-3.4688e-13, -1.0522e-04]])
tensor([[-1.7820e-04,  2.3882e-04],
        [-1.0078e-14,  3.2012e-05]])
tensor([[-7.1672e-05, -9.4293e-05],
        [-2.9279e-16, -9.7392e-06]])
tensor([[-2.8827e-05,  3.7229e-05],
        [-8.5063e-18,  2.9630e-06]])
```
After:

```
$ TORCH_LOGS=+graph_breaks python x.py
[2024-01-08 11:14:49,372] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG] Graph break: call_function BuiltinVariable(print) [TensorVariable()] {} from user code at:
[2024-01-08 11:14:49,372] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]   File "/data/users/voz/pytorch/x.py", line 32, in foo
[2024-01-08 11:14:49,372] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]     print(z)
[2024-01-08 11:14:49,372] [0/0] torch._dynamo.symbolic_convert.__graph_breaks: [DEBUG]
tensor([[ 0.2065,  0.0766],
        [-2.0600,  1.8425]])
tensor([[-0.0617, -0.0698],
        [-3.5799,  2.2167]])
tensor([[ 0.0184,  0.0636],
        [-6.2212,  2.6669]])
tensor([[-5.5031e-03, -5.7971e-02],
        [-1.0811e+01,  3.2085e+00]])
tensor([[ 1.6437e-03,  5.2837e-02],
        [-1.8788e+01,  3.8601e+00]])
tensor([[-4.9093e-04, -4.8157e-02],
        [-3.2650e+01,  4.6441e+00]])
tensor([[ 1.4663e-04,  4.3891e-02],
        [-5.6741e+01,  5.5872e+00]])
tensor([[-4.3796e-05, -4.0004e-02],
        [-9.8605e+01,  6.7220e+00]])
tensor([[ 1.3081e-05,  3.6461e-02],
        [-1.7136e+02,  8.0871e+00]])
tensor([[-3.9070e-06, -3.3231e-02],
        [-2.9779e+02,  9.7296e+00]])
````

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116981
Approved by: https://github.com/ezyang
2024-01-09 00:39:03 +00:00
e728ebb66d Small docstring fix (#116947)
Fix a small typo in the docstring of checkpoint function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116947
Approved by: https://github.com/Skylion007, https://github.com/kit1980
2024-01-08 23:51:59 +00:00
28e2e12b2a [quant][be] enable xnnpack_quantizer tests to run in internal CI (#116911)
Summary: fixed an import problem for test_xnnpack_quantizer so that it can run in CI

Test Plan:
internal CI
sanity check: buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- --exact 'caffe2/test/quantization:test_quantization - test_conv2d (caffe2.test.quantization.pt2e.test_xnnpack_quantizer.TestXNNPACKQuantizer)'

Differential Revision: D52576449

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116911
Approved by: https://github.com/mcr229
2024-01-08 23:43:47 +00:00
534c73d478 Fix NaN bug in torch.signal.windows.kaiser (#116470)
Fixes #115595

As an aside, there are currently no tests checking the output of `torch.signal.windows.kaiser` against the output of scipy's implementation, which is what is done with `torch.kaiser_window`. The same goes for the other window functions in `torch.signal.windows`. I did some tests on my end, but I'm not sure what the best practice is, so I haven't included them for now.

@gchanan @mruberry
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116470
Approved by: https://github.com/ezyang
2024-01-08 22:24:52 +00:00
d006cae2a8 Update documentation for unsigned int types (#116804)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116804
Approved by: https://github.com/albanD
ghstack dependencies: #116595, #116803
2024-01-08 22:02:10 +00:00
fd0c071969 Add tolist support for unsigned types (#116803)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116803
Approved by: https://github.com/albanD
ghstack dependencies: #116595
2024-01-08 22:02:10 +00:00
f4e35e2c3d Proposed mechanism for handling uint64_t in Scalar (#116595)
Here's the problem: if we support unsigned integer types, and in particular if we support uint64_t, we need a way to represent these integers in Scalar. However, Scalar currently stores all integral values inside int64_t, which is not wide enough to accommodate all possible uint64_t values. So we need to do something to Scalar to support it.

The obvious thing to do is add a uint64_t field to the union, and used it some situations. But when should we use it? The proposal is that we only use it if and only if the integer in question is not representable in int64_t. The historical precedent for this is our handling for uint8_t. Because this type is representable inside int64_t, we have historically stored it inside Scalar as an int64_t. In general, the concept behind Scalar is that it doesn't know the signedness/unsignedness/bitwidth of its input; in particular, we typically construct Scalar from Python int, which doesn't have any concept of how wide the integer is! So it doesn't make any sense to allow for a small integer like 255 to be representable under both the HAS_i tag and the HAS_u tag. So we forbid the latter case.

Although I have proposed this, the PR as currently written just chokes when you pass it a uint64_t that's too big. There's some more logic that would have to be written out for this. I'm putting this out to start to get some agreement that this is the way to do it.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116595
Approved by: https://github.com/albanD
2024-01-08 22:02:03 +00:00
7073dc604e Merge merging rules of CPU inductor and x86 CPU quantization (#116937)
**Summary**
Following the discussion at https://github.com/pytorch/pytorch/pull/116599#issuecomment-1878757581, due to the limitation of the current merging rules that prevent cross-checking all files among different merge groups, it is proposed to merge the groups `x86 CPU quantization` and `CPU inductor` since they are closely related.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116937
Approved by: https://github.com/jgong5, https://github.com/atalman
2024-01-08 15:32:03 +00:00
a2d73e21d1 follow up #115078, broken distributed tests (#116217)
ROCm distributed tests started failing after #115078.  This skips the new tests if the number of GPUs available isn't sufficient.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116217
Approved by: https://github.com/malfet, https://github.com/huydhn
2024-01-08 15:26:54 +00:00
cyy
ad507789d1 [Reland] [11/N] Enable clang-tidy warnings on c10/util/*.h (#116751)
Reland of #116353 with C++ diagnostic macros restored.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116751
Approved by: https://github.com/albanD
2024-01-08 11:07:58 +00:00
e780213340 [xla hash update] update the pinned xla hash (#116958)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116958
Approved by: https://github.com/pytorchbot
2024-01-08 11:00:59 +00:00
6173386fc4 [MPS][BE] Remove unused nOffsets parameter (#116940)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116940
Approved by: https://github.com/Skylion007
ghstack dependencies: #116903, #116904, #116915
2024-01-08 04:55:35 +00:00
f663935935 [MPS] Fix boundary checks in generateKernelOffsets (#116915)
`TORCH_CHECK(i<UINT32_MAX)` is always false, it should be `TORCH_CHECK(iterShape[i] < UINT32_MAX)`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116915
Approved by: https://github.com/Skylion007, https://github.com/kulinseth
ghstack dependencies: #116903, #116904
2024-01-08 04:55:35 +00:00
aa718065b2 [MPS][BE] Refactor common code (#116904)
Into `generateKernelDataOffsets` which was repeated character by character in BinaryKernel, CrossKernel and Indexing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116904
Approved by: https://github.com/Skylion007
ghstack dependencies: #116903
2024-01-08 04:55:35 +00:00
57491d2046 Add bfloat16 + fp16 support to fractional_max_pool for CUDA and CPU (#116950)
Adds bfloat16 to fractional_max_pool. If op supports fp32 and fp16, it really should support bf16 for the most part. Most but not all ops satisfy this, so I am adding support for the few that do not.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116950
Approved by: https://github.com/lezcano
2024-01-08 03:54:29 +00:00
7d61fa23df Add float16 support to CUDA logaddexp2 (#116948)
float16 is already supported on CPU for this op and on gpu for `logaddexp` so let's expand support to the function with the base2 variant as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116948
Approved by: https://github.com/lezcano
2024-01-08 03:37:07 +00:00
2fe90e4d47 [vision hash update] update the pinned vision hash (#116908)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116908
Approved by: https://github.com/pytorchbot
2024-01-08 03:24:41 +00:00
6c32cd05a3 [executorch hash update] update the pinned executorch hash (#116936)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116936
Approved by: https://github.com/pytorchbot
2024-01-08 03:18:18 +00:00
376f036570 Add bfloat16 CUDA support to multinomial (#116951)
Add torch bfloat16 support to multinomial. Only a few methods in torch support fp32, fp16, but not bfloat16 so let's go and finish implementing them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116951
Approved by: https://github.com/lezcano
2024-01-08 01:43:16 +00:00
8257b867d8 Add bfloat16 CUDA support to binomial distribution (#116932)
Now all distributions support bfloat16 as input.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116932
Approved by: https://github.com/malfet
2024-01-07 19:50:10 +00:00
4a37f57c69 Add batched sparse CSR/CSC/BSR/BSC to sparse COO conversion support (#116206)
As in the title.

Fixes https://github.com/pytorch/pytorch/issues/104868

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116206
Approved by: https://github.com/amjames, https://github.com/lezcano, https://github.com/cpuhrsch
2024-01-07 19:42:02 +00:00
cyy
4b74bb6c34 [Exception] [2/N] Remove THPUtils_assert (#116772)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116772
Approved by: https://github.com/albanD
2024-01-07 14:21:43 +00:00
3c7f358c91 Update the expected accuracy value for demucs (#116944)
Update the expected value with `python benchmarks/dynamo/ci_expected_accuracy/update_expected.py b847290ddd9c6a5a598c70f8b660ee2b1e71dc95` as this is now failing in trunk after 95041829c8

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116944
Approved by: https://github.com/voznesenskym
2024-01-07 13:34:51 +00:00
de005b14ab [dynamo] fix more broken dict tests (#116943)
Forward fixing after #111196

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116943
Approved by: https://github.com/huydhn
2024-01-07 08:00:16 +00:00
8ddac14a15 Add unsigned integer dtypes to PyTorch (#116594)
The dtypes are very useless right now (not even fill works), but it makes torch.uint16, uint32 and uint64 available as a dtype.

Towards https://github.com/pytorch/pytorch/issues/58734

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116594
Approved by: https://github.com/albanD
ghstack dependencies: #116698, #116693
2024-01-07 07:40:49 +00:00
8e273e23b5 Refactor promoteType to no longer use shifting strategy (#116693)
Instead of manually fixing the indices (extremely error prone when new
dtypes are added) we just setup a lookup table to map ScalarType to the
offsets table.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116693
Approved by: https://github.com/albanD
ghstack dependencies: #116698
2024-01-07 07:40:49 +00:00
c5e6485d14 Add AT_DISPATCH_V2 (#116698)
See top-level comment on Dispatch_v2.h for motivation.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116698
Approved by: https://github.com/albanD, https://github.com/malfet
2024-01-07 07:40:49 +00:00
9557b63c85 [MPS][BE] Do not crash if Metal function can not be found (#116938)
As [`newFunctionWithName:`](https://developer.apple.com/documentation/metal/mtllibrary/1515524-newfunctionwithname) does not accept error argument, do not attempt to print it as it'll be guaranteed `nil` at that point, that results in a classic null pointer dereference, when `TORCH_CHECK` will attempt to construct `std::string` from it. See below backtrace for example:
```
 thread #1, queue = 'metal gpu stream', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
    frame #0: 0x000000018a316dc4 libsystem_platform.dylib`_platform_strlen + 4
    frame #1: 0x00000001471011bc libtorch_cpu.dylib`std::__1::__constexpr_strlen[abi:v160006](__str=0x0000000000000000) at cstring:114:10
    frame #2: 0x0000000147100c24 libtorch_cpu.dylib`std::__1::char_traits<char>::length(__s=0x0000000000000000) at char_traits.h:220:12
  * frame #3: 0x0000000147100bf0 libtorch_cpu.dylib`std::__1::basic_ostream<char, std::__1::char_traits<char>>& std::__1::operator<<[abi:v160006]<std::__1::char_traits<char>>(__os=0x000000016fdfb3a0, __str=0x0000000000000000) at ostream:901:57
    frame #4: 0x0000000147100bb4 libtorch_cpu.dylib`std::__1::basic_ostream<char, std::__1::char_traits<char>>& c10::detail::_str<char const*>(ss=0x000000016fdfb3a0, t=0x000000016fdfb5d0) at StringUtil.h:55:6
    frame #5: 0x00000001471007ac libtorch_cpu.dylib`std::__1::basic_ostream<char, std::__1::char_traits<char>>& c10::detail::_str<char const*, char const*>(ss=0x000000016fdfb3a0, t=0x000000016fdfb4f8, args=0x000000016fdfb5d0) at StringUtil.h:68:10
    frame #6: 0x0000000147101444 libtorch_cpu.dylib`std::__1::basic_ostream<char, std::__1::char_traits<char>>& c10::detail::_str<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, char const*, char const*>(ss=0x000000016fdfb3a0, t="index_select_32bit_idx32", args=0x000000016fdfb4f8, args=0x000000016fdfb5d0) at StringUtil.h:68:10
    frame #7: 0x0000000147101404 libtorch_cpu.dylib`std::__1::basic_ostream<char, std::__1::char_traits<char>>& c10::detail::_str<char const*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, char const*, char const*>(ss=0x000000016fdfb3a0, t=0x000000016fdfb500, args="index_select_32bit_idx32", args=0x000000016fdfb4f8, args=0x000000016fdfb5d0) at StringUtil.h:68:10
    frame #8: 0x000000014710137c libtorch_cpu.dylib`c10::detail::_str_wrapper<char const*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, char const*, char const* const&>::call(args=0x000000016fdfb500, args="index_select_32bit_idx32", args=0x000000016fdfb4f8, args=0x000000016fdfb5d0) at StringUtil.h:75:5
    frame #9: 0x0000000147101310 libtorch_cpu.dylib`decltype(auto) c10::str<char [53], std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, char [10], char const*>(args={a\xcb\xa7H\x01\0\0\0}, args="index_select_32bit_idx32", args={\x96\xcb\xa7H\x01\0\0\0}, args=0x000000016fdfb5d0) at StringUtil.h:111:10
    frame #10: 0x0000000147100210 libtorch_cpu.dylib`decltype(auto) c10::detail::torchCheckMsgImpl<char [53], std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, char [10], char const*>((null)="Expected indexFunction to be true, but got false.  (Could this error message be improved?  If so, please report an enhancement request to PyTorch.)", args={a\xcb\xa7H\x01\0\0\0}, args="index_select_32bit_idx32", args={\x96\xcb\xa7H\x01\0\0\0}, args=0x000000016fdfb5d0) at Exception.h:453:10
    frame #11: 0x00000001470fffe8 libtorch_cpu.dylib`at::mps::MPSDevice::metalIndexingPSO(this=0x0000600000381670, kernel="index_select_32bit_idx32") at MPSDevice.mm:62:3
```

This was introduced by https://github.com/pytorch/pytorch/pull/99855 that replaced `newFunctionWithName:constantValues:error:` with `newFunctionWithName:`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116938
Approved by: https://github.com/Skylion007
2024-01-07 07:08:54 +00:00
20c2ec9a15 [CPU] Add flash attention mask version (#115913)
Add a masked-version flash attention for CPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115913
Approved by: https://github.com/jgong5, https://github.com/drisspg
2024-01-07 04:58:23 +00:00
b847290ddd Back out "[2d] unflatten_tensor on compute stream for DTensorExtension (#116559)" (#116939)
Summary:
Original commit changeset: 65298112f3db

Original Phabricator Diff: D52530451

Differential Revision: D52583345

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116939
Approved by: https://github.com/842974287
2024-01-07 03:53:40 +00:00
4b5b8f8a75 Add bfloat16 CUDA support to smoothl1loss (#116933)
Gradually ensuring that all CUDA ops that support float16 also support bfloat16 if possible

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116933
Approved by: https://github.com/malfet
2024-01-07 02:42:49 +00:00
a7902571be Add bfloat16 CUDA support to gamma unary functions (#116929)
Add bfloat16 support to unary gamma functions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116929
Approved by: https://github.com/malfet
2024-01-07 02:07:55 +00:00
8e1119f7b2 Fix typo in CUDA Macro (#116930)
Found while grepping for remaining _AND macros in CUDA subfolder

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116930
Approved by: https://github.com/malfet
2024-01-07 01:49:32 +00:00
83e8a0721d Reland #111196 (take 4) "Support tensors as Dict keys" (#116934)
Fixes #ISSUE_NUMBER

See that PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116934
Approved by: https://github.com/ezyang, https://github.com/huydhn
2024-01-07 01:37:26 +00:00
95041829c8 Add bfloat16 CUDA support to RNN (#116927)
Fixes #116925
Fixes #116763

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116927
Approved by: https://github.com/malfet
2024-01-06 22:55:34 +00:00
a5b86847ef Fix compiler warnings in cuda code (#116921)
Fixes compiler warnings about comparison between signed and unsigned data types

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116921
Approved by: https://github.com/Skylion007
2024-01-06 21:25:19 +00:00
65da4e1ba2 [CI] Use jemalloc for CUDA builds (#116900)
According to @ptrblck it'll likely mitigate non-deterministic NVCC bug
See https://github.com/pytorch/pytorch/issues/116289 for more detail

Test plan: ssh into one of the cuda builds and make sure that `LD_PRELOAD` is set for the top-level make command

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116900
Approved by: https://github.com/atalman
2024-01-06 21:03:02 +00:00
c05dd2aaf0 [EZ][MPS] Use dispatch with rethrow for indexing (#116903)
Otherwise any assert withing sync block will cause an unrecoverable abort rather than structured exception
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116903
Approved by: https://github.com/Skylion007
2024-01-06 20:36:47 +00:00
9519c8afd4 [export] Remove hacks for passing pinned version test. (#116871)
Summary: nature will heal itself.

Test Plan: CI

Reviewed By: angelayi

Differential Revision: D52566227

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116871
Approved by: https://github.com/angelayi
2024-01-06 18:09:27 +00:00
2dca3e99eb Revert "Support tensors as Dict keys Re-PR of #111196 (#116785)"
This reverts commit 1badad9ce9694ef70f6a3dc01000f2cf310c4c11.

Reverted https://github.com/pytorch/pytorch/pull/116785 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/116785#issuecomment-1879592261))
2024-01-06 08:22:33 +00:00
88197f2202 Rename experimental API (#116895)
Summary: Title

Test Plan: CI

Differential Revision: D52571286

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116895
Approved by: https://github.com/zhxchen17
2024-01-06 08:01:09 +00:00
830ace33bc [C10D] Add GIL checker to NCCL watchdog monitor (#116798)
Whenever the monitor thread kills the watchdog thread for being stuck,
we do so to save cluster time and get a faster failure signal, but we
want to know more about why it got stuck.

One possible reason for watchdog stuckness is GIL contention, which
could be ruled out or observed by making an attempt to acquire the GIL
at exit time.

If we cannot acquire the GIL within a short time window (1s) we abort
the attempt and report GIL contention, otherwise we report that GIL was
acquired successfully.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116798
Approved by: https://github.com/zdevito
2024-01-06 05:13:43 +00:00
f24bba1624 [executorch hash update] update the pinned executorch hash (#116800)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116800
Approved by: https://github.com/pytorchbot
2024-01-06 04:10:52 +00:00
78c3098470 cmake: Include CheckCXXCompilerFlag where it is used (#113028)
Move the `include(CheckCXXCompilerFlag)` above the `append_cxx_flag_if_supported` function that uses it to avoid depending on the caller to have it already included.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113028
Approved by: https://github.com/malfet
2024-01-06 04:05:45 +00:00
1badad9ce9 Support tensors as Dict keys Re-PR of #111196 (#116785)
This prepares the PR where we implement sets in terms of dicts.
To do so, rather than storing internally a dictionary that maps literals
to VariableTrackers, it stores (pretty much) a dictionary from VTs to VTs.
To do so, keys are wrapped in an opaque internal class _Hashable.
The Hashable class is opaque on purpose so that it fails hard if
if it inadvertently leaks back into user code.
We also found and fixed a number of latent bugs and inconsistencies
in the way dynamo checked what can be a dict key. More generally, we
make much clearer what are the things that need to be modified to add
a new supported key type to Dicts.

Fixes [#107595](https://www.internalfb.com/tasks?t=107595)
Fixes [#111603](https://www.internalfb.com/tasks?t=111603)

Re-PR of https://github.com/pytorch/pytorch/pull/111196 sadly due to reverts, we could not reuse @lezcano's original PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116785
Approved by: https://github.com/mlazos
2024-01-06 03:35:35 +00:00
ff0f79d3c7 [MPS] Mark torch.[all|any] as working with complex on MacOS14 (#116907)
It was enabled by https://github.com/pytorch/pytorch/pulls/116457 but at the time PR was landed Sonoma testing was still not enabled

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116907
Approved by: https://github.com/osalpekar, https://github.com/kit1980
2024-01-06 01:10:11 +00:00
0b0c76bace Support squeeze.dim for jagged NT (#116891)
As title. Needed for `rev_view_func()` of `unsqueeze()`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116891
Approved by: https://github.com/soulitzer
ghstack dependencies: #115894, #116512
2024-01-06 01:00:53 +00:00
8894a97707 [Dynamo] Fix source for autograd.function default value (#116894)
Before this PR, the source guard would emit
```
globals()['Gradient'].__class__.forward.__defaults__[0]
```
which is incorrect

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116894
Approved by: https://github.com/zou3519, https://github.com/yanboliang
2024-01-06 00:36:00 +00:00
5323b2daa5 [docs] add mode="reduce-overhead" into torch.compile to enable cuda g… (#116529)
…raph

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116529
Approved by: https://github.com/eellison
2024-01-05 22:54:20 +00:00
2753960177 markDynamoStrictTest most of test/lazy/.* (#116893)
[codemod] markDynamoStrictTest lazy/test_step_closures
[codemod] markDynamoStrictTest lazy/test_reuse_ir
[codemod] markDynamoStrictTest lazy/test_meta_kernel
[codemod] markDynamoStrictTest lazy/test_generator
[codemod] markDynamoStrictTest lazy/test_functionalization
[codemod] markDynamoStrictTest lazy/test_extract_compiled_graph
[codemod] markDynamoStrictTest lazy/test_debug_util
[codemod] markDynamoStrictTest lazy/test_bindings
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116893
Approved by: https://github.com/Skylion007
ghstack dependencies: #116879, #116880, #116881, #116892
2024-01-05 22:29:35 +00:00
af2ded23eb [export] Exempt autograd ops for predispatch export (#116527)
Summary:
We intend to preserve autograd ops for predispatch export. Therefore, we
need to exempt the autograd ops in some places, e.g. verifier and
proxy_tensor.py.

Test Plan:
python test/export/test_export.py -k test_predispatch_export_with_autograd_op
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116527
Approved by: https://github.com/tugsbayasgalan
ghstack dependencies: #116339
2024-01-05 22:28:57 +00:00
9431798521 [export] Error grad mode op in export API (#116339)
Summary:
As current export doesn't support training, so grad mode ops doesn't
make sense. To avoid the confusion, we choose to early error if there
exist grad mode ops.

Test Plan:
python test/export/test_safeguard.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116339
Approved by: https://github.com/tugsbayasgalan
2024-01-05 22:28:57 +00:00
8fd4efacb4 markDynamoStrictTest most test/functorch/* (#116892)
[codemod] markDynamoStrictTest functorch/test_rearrange
[codemod] markDynamoStrictTest functorch/test_parsing
[codemod] markDynamoStrictTest functorch/test_minifier
[codemod] markDynamoStrictTest functorch/test_memory_efficient_fusion
[codemod] markDynamoStrictTest functorch/test_logging
[codemod] markDynamoStrictTest functorch/test_eager_transforms
[codemod] markDynamoStrictTest functorch/test_dims
[codemod] markDynamoStrictTest functorch/test_control_flow
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116892
Approved by: https://github.com/Skylion007
ghstack dependencies: #116879, #116880, #116881
2024-01-05 22:26:20 +00:00
e5f2ac18da [codemod] markDynamoStrictTest batch 12 (#116881)
[codemod] markDynamoStrictTest distributions/test_distributions
[codemod] markDynamoStrictTest distributions/test_constraints
[codemod] markDynamoStrictTest benchmark_utils/test_benchmark_utils
[codemod] markDynamoStrictTest backends/xeon/test_launch
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116881
Approved by: https://github.com/voznesenskym
ghstack dependencies: #116879, #116880
2024-01-05 21:59:40 +00:00
7562a00946 Make TORCH_LOGS="dist_ddp" include DDPOptimizer logs (#116794)
Note: ddp_graphs is still 'separate' from log components since it is an
artifact.  Not sure it's possible to enable it by default when dist_ddp
is selected.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116794
Approved by: https://github.com/fduwjj
2024-01-05 21:31:42 +00:00
5377b994da [aot_inductor] Retrieve original FQNs for weights (#116157)
Differential Revision: D52303882

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116157
Approved by: https://github.com/frank-wei
2024-01-05 21:30:36 +00:00
521dbbfaff Remove cpp/tensorexpr benchmarks (#116868)
Summary: These refer to a deprecated backend of torchscript which is no longer built in releases, and require llvm to be built.

Test Plan:
```
python setup.py develop
```

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116868
Approved by: https://github.com/hl475, https://github.com/chenyang78, https://github.com/eellison, https://github.com/mikekgfb
2024-01-05 21:23:30 +00:00
99ef47098d Use smaller shapes in lstm test to fix the CI timeout (#116453)
Fixes https://github.com/pytorch/pytorch/issues/108824 by using smaller shapes while keeping the same test scope

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116453
Approved by: https://github.com/huydhn, https://github.com/jgong5
2024-01-05 21:19:56 +00:00
499ca71e49 [codemod] markDynamoStrictTest batch 11 (#116880)
[codemod] markDynamoStrictTest nn/test_pruning
[codemod] markDynamoStrictTest nn/test_pooling
[codemod] markDynamoStrictTest nn/test_parametrization
[codemod] markDynamoStrictTest nn/test_packed_sequence
[codemod] markDynamoStrictTest nn/test_multihead_attention
[codemod] markDynamoStrictTest nn/test_module_hooks
[codemod] markDynamoStrictTest nn/test_lazy_modules
[codemod] markDynamoStrictTest nn/test_init
[codemod] markDynamoStrictTest nn/test_embedding
[codemod] markDynamoStrictTest nn/test_dropout
[codemod] markDynamoStrictTest nn/test_convolution
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116880
Approved by: https://github.com/voznesenskym
ghstack dependencies: #116879
2024-01-05 21:17:43 +00:00
ef7abdbd1a [C10] Mark Complex::imag as C10_HOST_DEVICE (#116877)
It feels weird that `real` is marked as such, but `imag` is not

Find while working on https://github.com/pytorch/pytorch/issues/116628

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116877
Approved by: https://github.com/Skylion007
2024-01-05 21:17:05 +00:00
c72d9f5de3 [no ci] Add pytorch-dev-infra as owners of .ci folder (#116901)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116901
Approved by: https://github.com/huydhn
2024-01-05 21:15:47 +00:00
0f0020d76f [GHF] Add support for new style stacks (#116873)
Where base stack targets default branch, rather than base. But as
default branch is likely to advance, since PR was made, search for
mergebase before determining whether `base`..`head` are in sync with `orig` branch
Also, rather than hardcode default branch name, fetch it from `GitHubPR.default_branch()`

Test Plan: https://github.com/malfet/deleteme/pull/77

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116873
Approved by: https://github.com/ezyang
2024-01-05 20:32:24 +00:00
71d8fe690f Replace recursive stable_topological_sort() with iterative. (#116761)
Summary:
A graph with a deep set of nodes caused stable_topological_sort() to recurse and
pop the stack. Rewrite it to be iterative and avoid recursion.

Fixes #115506

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116761
Approved by: https://github.com/jansel, https://github.com/oulgen, https://github.com/Skylion007
2024-01-05 20:13:49 +00:00
476e9d5f77 [codemod] markDynamoStrictTest batch 10 (#116879)
[codemod] markDynamoStrictTest test_cpp_extensions_aot_no_ninja
[codemod] markDynamoStrictTest test_cpp_extensions_aot_ninja
[codemod] markDynamoStrictTest test_cpp_api_parity
[codemod] markDynamoStrictTest test_complex
[codemod] markDynamoStrictTest test_compile_benchmark_util
[codemod] markDynamoStrictTest test_comparison_utils
[codemod] markDynamoStrictTest test_bundled_inputs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116879
Approved by: https://github.com/voznesenskym
2024-01-05 19:46:55 +00:00
764a18016d VSX: Fix vectorized abs function for complex tensors (#116859)
Use a similar approach with Sleef as in #99550
to improve the precision and extremal value handling of the `abs` function for complex tensors.

This fixes
- test_reference_numerics_extremal__refs_abs_cpu_float64
- test_reference_numerics_extremal__refs_abs_cpu_float128

which failed on PPC.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116859
Approved by: https://github.com/lezcano
2024-01-05 19:24:42 +00:00
63ee35c4e0 BugFix: Fix F632 bug in dynamo (if statement is always false) (#116867)
This was flagged by a preview ruff check as the if statement always evaluating false. Likely a typo between `is` and `in`. I also micro-optimized some list construction into tuple construction, which is semantically identical, but faster.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116867
Approved by: https://github.com/lezcano, https://github.com/albanD, https://github.com/yanboliang
2024-01-05 19:15:05 +00:00
d455c33cca [ez][td] Pipe TD logs to log file (#116796)
It is a bit annoying have them come up when searching through the logs.  They're also surprisingly long
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116796
Approved by: https://github.com/huydhn
2024-01-05 19:05:12 +00:00
ebedce24ab [FSDP] enable autograd in forward prefetching (#116792)
**problem**
when prefetching for next forward, current forward may be annotated by
`@torch.no_grad`. `param.grad_fn` keeps being None during prefetching.
`_post_backward_hook` never gets triggered

repro
```pytest test/distributed/fsdp/test_fsdp_freezing_weights.py```

**solution**
this PR enabled autograd during prefetching (`_use_unsharded_views`), so
`param.grad_fn` are properly assigned for next forward

a longer-term fix would be moving `_use_unsharded_views` out of
`_prefetch_handle` and put it in `_pre_forward_unshard`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116792
Approved by: https://github.com/awgu
2024-01-05 18:44:27 +00:00
7f124167b5 [BE][Easy]: Update libfmt submodule to 10.2.1 (#116864)
Follow up to #116363. There was an update and 10.2.1 was released that fixes an accidental ABI change in 10.2 with libfmt on windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116864
Approved by: https://github.com/albanD
2024-01-05 18:32:23 +00:00
4b6961a629 [no ci] Fix spelling (#116872)
s/initization/initialization/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116872
Approved by: https://github.com/albanD, https://github.com/Skylion007, https://github.com/atalman
2024-01-05 18:04:36 +00:00
0a0209e8a1 [ROCm] Use MI210 CI runners for all trunk commits (#116797)
As a follow-up to https://github.com/pytorch/pytorch/pull/115981

To make sure we catch any regressions/breakages related to flash attention/inductor/etc. functionality that is only enabled for MI210s, we would like to switch the trunk commit CI jobs to always run on MI210 runners. This should help us accurately identify the breaking commits for ROCm CI on the HUD.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116797
Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony
2024-01-05 17:46:38 +00:00
9ac0e6971a Revert "[1/4] Intel GPU Runtime Upstreaming for Device (#116019)"
This reverts commit b4cebe2c34242ceee3a1bc285f426662942a29ac.

Reverted https://github.com/pytorch/pytorch/pull/116019 on behalf of https://github.com/malfet due to Broke internal and periodic buck builds, see https://github.com/pytorch/pytorch/actions/runs/7414664129/job/20176215868 ([comment](https://github.com/pytorch/pytorch/pull/116019#issuecomment-1879030285))
2024-01-05 17:36:39 +00:00
7956ca16e6 Enable reverse view_funcs by default for python subclasses (#116512)
Part 3 of implementation for general [subclass view fake-ification](https://docs.google.com/document/d/1C5taWiplmX7nKiURXDOAZG2W5VNJ2iV0fQFq92H0Cxw).

Changes codegen to generate `view_func()` / `rev_view_func()` by default for python subclasses. With `view_func()` existing more often now, the lazy view rebase logic [here](f10c3f4184/torch/csrc/autograd/variable.cpp (L665-L695)) causes some slight behavior changes for in-place ops on views:
* Additional view nodes are inserted into output graphs, changing their string representation, although they are functionally the same. The extra nodes are removed in AOTAutograd's DCE pass.
* When `t` is a `FunctionalTensor`, calling `t.grad_fn` will now invoke `view_func()`; we need to make sure we're operating in a `FunctionalTensorMode` so the view op calls succeed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116512
Approved by: https://github.com/bdhirsh, https://github.com/soulitzer
ghstack dependencies: #115894
2024-01-05 16:48:12 +00:00
3c21264c9b Introduce reverse view_funcs (#115894)
Part 2 of implementation for general [subclass view fake-ification](https://docs.google.com/document/d/1C5taWiplmX7nKiURXDOAZG2W5VNJ2iV0fQFq92H0Cxw).

Details:
* Codegen `rev_view_func()` alongside `view_func()`
    * Reverse view_func gives you a "base" from a "view": `rev_view_func(new_view) -> new_base` AKA it plays the original view backwards
* Utilizes the functional inverses defined in `FunctionalInverses.cpp`, passing `InverseReturnMode::AlwaysView`
* Manually implements functional inverses for `narrow()` and `chunk()`
* **NB: Multi-output views now set view_func() / rev_view_func() for each of the output views!**
    * Due to this, the `as_view()` overload that operates on a list of views is scrapped in favor of iteration via codegen

Example codegen in `ADInplaceOrViewTypeN.cpp`:
```cpp
at::Tensor narrow(c10::DispatchKeySet ks, const at::Tensor & self, int64_t dim, c10::SymInt start, c10::SymInt length) {
  auto _tmp = ([&]() {
    at::AutoDispatchBelowADInplaceOrView guard;
    return at::_ops::narrow::redispatch(ks & c10::after_ADInplaceOrView_keyset, self, dim, start, length);
  })();
  std::function<at::Tensor(const at::Tensor&)> func=nullptr;
  std::function<at::Tensor(const at::Tensor&)> rev_func=nullptr;
  if (false || !self.unsafeGetTensorImpl()->support_as_strided() ||
      c10::AutogradState::get_tls_state().get_view_replay_enabled()) {
    func = [=](const at::Tensor& input_base) {
      return at::_ops::narrow::call(input_base, dim, start, length);
    };
    rev_func = [=](const at::Tensor& input_view) {
      // NB: args from narrow() signature are passed along to the inverse
      return at::functionalization::FunctionalInverses::narrow_copy_inverse(self, input_view, at::functionalization::InverseReturnMode::AlwaysView, dim, start, length);
    };
  }
  auto result = as_view(/* base */ self, /* output */ _tmp, /* is_bw_differentiable */ true, /* is_fw_differentiable */ true, /* view_func */ func, /* rev_view_func */ rev_func, /* creation_meta */ InferenceMode::is_enabled() ? CreationMeta::INFERENCE_MODE : (at::GradMode::is_enabled() ? CreationMeta::DEFAULT : CreationMeta::NO_GRAD_MODE));
  return result;
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115894
Approved by: https://github.com/soulitzer
2024-01-05 16:48:12 +00:00
053b15c596 [codemod] markDynamoStrictTest batch 9 (#116836)
[codemod] markDynamoStrictTest test_datapipe
[codemod] markDynamoStrictTest test_cuda_trace
[codemod] markDynamoStrictTest test_cuda_sanitizer
[codemod] markDynamoStrictTest test_cuda_primary_ctx
[codemod] markDynamoStrictTest test_cuda_nvml_based_avail
[codemod] markDynamoStrictTest test_cuda_multigpu
[codemod] markDynamoStrictTest test_cuda_expandable_segments
[codemod] markDynamoStrictTest test_cuda
[codemod] markDynamoStrictTest test_cpp_extensions_open_device_registration
[codemod] markDynamoStrictTest test_cpp_extensions_jit
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116836
Approved by: https://github.com/bdhirsh
ghstack dependencies: #116802, #116827, #116829, #116834
2024-01-05 16:40:40 +00:00
ee07260337 [codemod] markDynamoStrictTest batch 8 (#116834)
[codemod] markDynamoStrictTest test_flop_counter
[codemod] markDynamoStrictTest test_fake_tensor
[codemod] markDynamoStrictTest test_expanded_weights
[codemod] markDynamoStrictTest test_dynamic_shapes
[codemod] markDynamoStrictTest test_dlpack
[codemod] markDynamoStrictTest test_dispatch
[codemod] markDynamoStrictTest test_deploy
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116834
Approved by: https://github.com/bdhirsh
ghstack dependencies: #116802, #116827, #116829
2024-01-05 16:40:24 +00:00
c0da5a4c68 [codemod] markDynamoStrictTest batch 7 (#116829)
[codemod] markDynamoStrictTest test_license
[codemod] markDynamoStrictTest test_itt
[codemod] markDynamoStrictTest test_import_stats
[codemod] markDynamoStrictTest test_hub
[codemod] markDynamoStrictTest test_futures
[codemod] markDynamoStrictTest test_functionalization_of_rng_ops
[codemod] markDynamoStrictTest test_functionalization
[codemod] markDynamoStrictTest test_functional_autograd_benchmark
[codemod] markDynamoStrictTest test_function_schema
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116829
Approved by: https://github.com/bdhirsh
ghstack dependencies: #116802, #116827
2024-01-05 16:33:20 +00:00
6747d1383f [codemod] markDynamoStrictTest batch 6 (#116827)
[codemod] markDynamoStrictTest test_model_exports_to_core_aten
[codemod] markDynamoStrictTest test_model_dump
[codemod] markDynamoStrictTest test_mobile_optimizer
[codemod] markDynamoStrictTest test_mkldnn_verbose
[codemod] markDynamoStrictTest test_mkldnn_fusion
[codemod] markDynamoStrictTest test_mkldnn
[codemod] markDynamoStrictTest test_mkl_verbose
[codemod] markDynamoStrictTest test_meta
[codemod] markDynamoStrictTest test_matmul_cuda
[codemod] markDynamoStrictTest test_logging
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116827
Approved by: https://github.com/bdhirsh
ghstack dependencies: #116802
2024-01-05 16:33:20 +00:00
9543caadc8 [codemod] markDynamoStrictTest batch 5 (#116802)
[codemod] markDynamoStrictTest test_openmp
[codemod] markDynamoStrictTest test_numpy_interop
[codemod] markDynamoStrictTest test_numba_integration
[codemod] markDynamoStrictTest test_nn
[codemod] markDynamoStrictTest test_nestedtensor
[codemod] markDynamoStrictTest test_native_mha
[codemod] markDynamoStrictTest test_native_functions
[codemod] markDynamoStrictTest test_multiprocessing_spawn
[codemod] markDynamoStrictTest test_multiprocessing
[codemod] markDynamoStrictTest test_monitor
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116802
Approved by: https://github.com/bdhirsh
2024-01-05 16:33:13 +00:00
0159e3abbd [dynamo] add a handler for itertools_chain_from_iterable and test (#116849)
1. add a handler for itertools_chain_from_iterable
2. a test for itertools_chain_from_iterable

Fixes #116463

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116849
Approved by: https://github.com/ezyang
2024-01-05 15:14:18 +00:00
0249c4a785 Add config toggle suggestions for data-dependent/dynamic output shape (#114337)
Fixes https://github.com/pytorch/pytorch/issues/114220

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114337
Approved by: https://github.com/aakhundov
2024-01-05 14:01:01 +00:00
53f8d17d1e Specialize SymNodeVariable when used as module index (#114377)
Fixes #114171

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114377
Approved by: https://github.com/Skylion007
2024-01-05 13:51:52 +00:00
0e8698c3b6 Prevent unbacked symbol reallocation by forcing unification for unbacked symbol def sites (#114368)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114368
Approved by: https://github.com/aakhundov
2024-01-05 13:51:36 +00:00
f692fc9e7f fix typo (#116828)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116828
Approved by: https://github.com/Skylion007
2024-01-05 12:35:33 +00:00
5f5405f809 I have seen this deprecation and I am curious if this is the fix (#116714)
Lets see what CI/CD says

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116714
Approved by: https://github.com/awgu, https://github.com/wanchaol
2024-01-05 07:02:58 +00:00
79ba39710e [AOTI] Forward fix a Windows build failure (#116790)
Summary: forward fix https://github.com/pytorch/pytorch/pull/116269
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116790
Approved by: https://github.com/khabinov, https://github.com/huydhn
2024-01-05 06:00:58 +00:00
2ccc7af028 Revert "[CPU] Add flash attention mask version (#115913)"
This reverts commit 76a3fbb7092d25638a046c1994030fc8108e5fbf.

Reverted https://github.com/pytorch/pytorch/pull/115913 on behalf of https://github.com/zou3519 due to broke transformer test on dynamo shard ([comment](https://github.com/pytorch/pytorch/pull/115913#issuecomment-1878043389))
2024-01-05 02:39:12 +00:00
bbfd81f513 [codemod] markDynamoStrictTest batch (#116791)
[codemod] markDynamoStrictTest test_sympy_utils
[codemod] markDynamoStrictTest test_serialization
[codemod] markDynamoStrictTest test_segment_reductions
[codemod] markDynamoStrictTest test_schema_check
[codemod] markDynamoStrictTest test_scatter_gather_ops
[codemod] markDynamoStrictTest test_pytree
[codemod] markDynamoStrictTest test_pruning_op
[codemod] markDynamoStrictTest test_per_overload_api
[codemod] markDynamoStrictTest test_out_dtype_op
[codemod] markDynamoStrictTest test_optim
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116791
Approved by: https://github.com/voznesenskym
ghstack dependencies: #116735, #116736, #116739, #116740, #116742, #116743, #116744, #116745
2024-01-05 02:22:53 +00:00
6d9b837c27 Graphbreak when creating a map with unsupported keys (#116460)
As per title. With this, https://github.com/pytorch/pytorch/issues/93697
does not choke, but spits out many of these:
```
[ERROR] Name: "L['self']"
[ERROR]     Source: local
[ERROR]     Create Function: NN_MODULE
[ERROR]     Guard Types: ['ID_MATCH']
[ERROR]     Code List: ["___check_obj_id(L['self'], 139962171127504)"]
[ERROR]     Object Weakref: <weakref at 0x7f4b72f7c9a0; to
'ActorCriticPolicy' at 0x7f4b7b7df6d0>
[ERROR]     Guarded Class Weakref: <weakref at 0x7f4afbd08b30; to
'ABCMeta' at 0x56463a727840 (ActorCriticPolicy)>
[ERROR] Created at:
[ERROR]   File
"/home/lezcano/git/pytorch/pytorch/torch/_dynamo/variables/builder.py",
line 248, in __call__
[ERROR]     vt = self._wrap(value)
[ERROR]   File
"/home/lezcano/git/pytorch/pytorch/torch/_dynamo/variables/builder.py",
line 474, in _wrap
[ERROR]     return self.wrap_module(value)
[ERROR]   File
"/home/lezcano/git/pytorch/pytorch/torch/_dynamo/variables/builder.py",
line 941, in wrap_module
[ERROR]     return self.tx.output.register_attr_or_module(
[ERROR]   File
"/home/lezcano/git/pytorch/pytorch/torch/_dynamo/output_graph.py", line
735, in register_attr_or_module
[ERROR]     install_guard(source.make_guard(GuardBuilder.NN_MODULE))
[ERROR] Error while creating guard:
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116460
Approved by: https://github.com/jansel
ghstack dependencies: #116459
2024-01-05 01:48:07 +00:00
7c8f38700a [dynamo] Fix np.issubdtype (#116459)
Fixes the issue described at https://github.com/pytorch/pytorch/issues/93697#issuecomment-1828346590

This doesn't fix the full issue yet, now we hit
```python
  File
  "/home/lezcano/git/pytorch/pytorch/torch/_dynamo/symbolic_convert.py",
  line 744, in step
  getattr(self, inst.opname)(inst)
  File
  "/home/lezcano/git/pytorch/pytorch/torch/_dynamo/symbolic_convert.py",
  line 1366, in BUILD_MAP
      assert (
      AssertionError
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116459
Approved by: https://github.com/peterbell10
2024-01-05 01:48:07 +00:00
76a3fbb709 [CPU] Add flash attention mask version (#115913)
Add a masked-version flash attention for CPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115913
Approved by: https://github.com/jgong5, https://github.com/drisspg
2024-01-05 01:27:36 +00:00
6413511713 [export][refactor][4/n] Make equality_constraints optional (#116233)
Summary: needed to remove equality_contraints eventually :P

Test Plan: CI

Differential Revision: D52351709

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116233
Approved by: https://github.com/tugsbayasgalan
2024-01-05 00:50:52 +00:00
db69956feb [Dynamo] Catch ImportError when tracing_rules load objects (#116783)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116783
Approved by: https://github.com/angelayi
2024-01-05 00:26:17 +00:00
b0393ebe9b [MPS] Make test_mps.py passable on Sonoma (#116764)
- Enable Sonoma testing on M2 machines
- Add 70+ ops to the list of supported ones on MacOS Sonoma
- Enable nn.functional.
- Add explicit `TORCH_CHECK` to mark scatter/gather, index_select and linalg ops as yet not supporting Complex, as attempt to call those will crash with various MPS asserts such as:
```
(mpsFileLoc): /AppleInternal/Library/BuildRoots/0032d1ee-80fd-11ee-8227-6aecfccc70fe/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:96:0: error: 'mps.reduction_min' op operand #0 must be tensor of MPS type values or memref of MPS type values, but got 'tensor<5x5xcomplex<f32>>'
(mpsFileLoc): /AppleInternal/Library/BuildRoots/0032d1ee-80fd-11ee-8227-6aecfccc70fe/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:96:0: note: see current operation: %3 = "mps.reduction_min"(%1, %2) <{keep_dims}> : (tensor<5x5xcomplex<f32>>, tensor<2xsi32>) -> tensor<1x1xcomplex<f32>>
```
- Treat bools as int8 to fix regression re-surfaced in `index_fill` (used to be broken in Monterey, then fixed in Ventura and broken in Sonoma again)
- `nn.functional.max_pool2d` results now match CPU output for uint8 dtype in Sonoma

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116764
Approved by: https://github.com/kulinseth, https://github.com/seemethere
2024-01-05 00:25:47 +00:00
d0cf2182ea Fix TransformerEncoderLayer for bias=False (#116760)
Fixes https://github.com/pytorch/pytorch/issues/116385

Don't call `torch._transformer_encoder_layer_fwd` when `bias=False`

`bias=False` was not something that `torch._transformer_encoder_layer_fwd`  was meant to work with, it was my bad that this wasn't tested as I approved https://github.com/pytorch/pytorch/pull/101687.

`bias=False` was causing the `tensor_args` in [`TransformerEncoder`](a17de2d645/torch/nn/modules/transformer.py (L663-L677)) to contain `None`s and error on checks for the fastpath like `t.requires_grad for t in tensor_args`.

Alternative fix would be to
1) Pass `torch.zeros_like({*}.weight)` to the kernel when `bias=False` and filter `tensor_args` as appropriate
2) Fix `torch._transformer_encoder_layer_fwd` to take `Optional<Tensor>` for biases and fix the kernels as appropriate

Let me know if these approaches are preferable

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116760
Approved by: https://github.com/jbschlosser
2024-01-05 00:13:10 +00:00
e3ca7346ce Re-add initial Flash Attention support on ROCM (#115981)
Note about the Updates:

This PR:
1. skips more flash attention related UTs on MI200
2. Fix additional ATen compiling errors after hipification
3. Fix the author "root" of a specific commit
4. Includes the patch from Nikita in favor of block level static initialization.

CAVEAT: This revised PR has a commit that modifies the CI to force its running on MI200 nodes. That specific commit must be reverted before merge.

Original PR (https://github.com/pytorch/pytorch/pull/114309) Note:

This pull requests add initial Flash Attention support for AMD/ROCM platform. It added a specialized Triton repository/branch as a compile-time dependency for Flash Attention math library on AMD/ROCM. This triton submodule is not used at runtime and will not be shipped to the final pytorch package. We have the plan to release this specialized Triton as a separate project.

Know limitations:

- Only supports MI200 series GPU (i.e., `gcnArchName == gfx90a:sramecc+:xnack-`.
- Only supports power of two sequence lengths.
- No support for varlen APIs.
- Only support head dimension 16,32,64,128.
- Performance is still being optimized.

Fixes #112997

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115981
Approved by: https://github.com/malfet
2024-01-04 22:21:31 +00:00
8195a0aaa7 Move array_of helper to c10/util (#116749)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116749
Approved by: https://github.com/drisspg, https://github.com/Skylion007
ghstack dependencies: #116685
2024-01-04 21:58:32 +00:00
5ac57a06eb [export] Refactor ExportPassBase. (#116778)
Summary:
X-link: https://github.com/pytorch/executorch/pull/1532

as title. This diff decouple the pass base library from torch export and exir, so that different layers can evolve in their own fashion, and we have more head room to divide and conquer in the future.

Test Plan: CI

Reviewed By: angelayi

Differential Revision: D52514517

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116778
Approved by: https://github.com/angelayi
2024-01-04 21:32:14 +00:00
e7d741b0fd [C10D] Dump cpp stacktraces on heartbeat monitor timeout (#116717)
Summary:
If heartbeat monitor times out and kills the process, we want to know why.

It's convenient to use an internal tool for this, but we plan to later
integrate with torchelastic to call into pyspy or something else, which will be
both better (including py stacks) and compatible with OSS.

Test Plan: tested manually, observed c++ stacktraces were dumped

Reviewed By: fduwjj

Differential Revision: D52370243

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116717
Approved by: https://github.com/zdevito
2024-01-04 21:11:47 +00:00
cyy
d23972df00 Update libfmt submodule to 10.2.0 (#116363)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116363
Approved by: https://github.com/ezyang
2024-01-04 19:25:40 +00:00
70f3a530d7 [AOTI] Add pybind for AOTIModelContainerRunnerCpu and AOTIModelContainerRunnerCuda (#116269)
Summary: Now we can allocate an AOTIModelContainerRunner object instead of relying on torch.utils.cpp_extension.load_inline. Also renamed AOTInductorModelRunner to AOTIRunnerUtil in this PR.

Test Plan: CI

Reviewed By: khabinov

Differential Revision: D52339116

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116269
Approved by: https://github.com/khabinov
2024-01-04 18:58:24 +00:00
56d7a47806 [BE] Use precompiled headers to speedup clang-tidy (#116780)
This brings the time down by 30% (from [30](https://github.com/pytorch/pytorch/actions/runs/7412899917/job/20170674075#step:11:64) min to [20](https://github.com/pytorch/pytorch/actions/runs/7413082213/job/20171286833?pr=116780#step:11:64) min)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116780
Approved by: https://github.com/kit1980, https://github.com/Skylion007
2024-01-04 18:37:44 +00:00
39f8853313 [inductor] Use max sm clock when calculating device tflops (#116754)
See openai/triton#2801

Current SM clocks may fluctuate at runtime and change the result of
`get_device_tflops`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116754
Approved by: https://github.com/lezcano
2024-01-04 17:38:21 +00:00
6793b99107 [BugFix] Fix SegFault when torch.all/any dispatched to mps or other backends (#116457)
The old implementation will result in an infinite recursive loop, leading to a stack overflow and segfault.

If TORCH_SHOW_DISPATCH_TRACE is on, with a debug version pytorch, we can see the following endless output in terminal:
```
[call] op=[aten::quantize_per_tensor], key=[AutogradCPU]
  [redispatch] op=[aten::quantize_per_tensor], key=[CPU]
 [call] op=[aten::any.dims], key=[AutogradCPU]
  [redispatch] op=[aten::any.dims], key=[QuantizedCPU]
   [call] op=[aten::empty.memory_format], key=[BackendSelect]
    [redispatch] op=[aten::empty.memory_format], key=[CPU]
   [call] op=[aten::any.dims_out], key=[QuantizedCPU]
    [call] op=[aten::any.dims], key=[QuantizedCPU]
     [call] op=[aten::empty.memory_format], key=[BackendSelect]
      [redispatch] op=[aten::empty.memory_format], key=[CPU]
     [call] op=[aten::any.dims_out], key=[QuantizedCPU]
      [call] op=[aten::any.dims], key=[QuantizedCPU]
       [call] op=[aten::empty.memory_format], key=[BackendSelect]
        [redispatch] op=[aten::empty.memory_format], key=[CPU]
       [call] op=[aten::any.dims_out], key=[QuantizedCPU]
        [call] op=[aten::any.dims], key=[QuantizedCPU]
         [call] op=[aten::empty.memory_format], key=[BackendSelect]
          [redispatch] op=[aten::empty.memory_format], key=[CPU]
         [call] op=[aten::any.dims_out], key=[QuantizedCPU]
          [call] op=[aten::any.dims], key=[QuantizedCPU]
           [call] op=[aten::empty.memory_format], key=[BackendSelect]
            [redispatch] op=[aten::empty.memory_format], key=[CPU]
           [call] op=[aten::any.dims_out], key=[QuantizedCPU]
            [call] op=[aten::any.dims], key=[QuantizedCPU]
             [call] op=[aten::empty.memory_format], key=[BackendSelect]
              [redispatch] op=[aten::empty.memory_format], key=[CPU]
             [call] op=[aten::any.dims_out], key=[QuantizedCPU]
              [call] op=[aten::any.dims], key=[QuantizedCPU]
               [call] op=[aten::empty.memory_format], key=[BackendSelect]
                [redispatch] op=[aten::empty.memory_format], key=[CPU]
               [call] op=[aten::any.dims_out], key=[QuantizedCPU]
                [call] op=[aten::any.dims], key=[QuantizedCPU]
                 [call] op=[aten::empty.memory_format], key=[BackendSelect]
                  [redispatch] op=[aten::empty.memory_format], key=[CPU]
                 [call] op=[aten::any.dims_out], key=[QuantizedCPU]
                  [call] op=[aten::any.dims], key=[QuantizedCPU]
.....
.....
.....
```

Fixes #116452
Fixes #116451

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116457
Approved by: https://github.com/malfet
2024-01-04 17:37:17 +00:00
b4cebe2c34 [1/4] Intel GPU Runtime Upstreaming for Device (#116019)
# Motivation
As mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), The first runtime component we would like to upstream is `Device` which contains the device management functions of Intel GPU's runtime. To facilitate the code review, we split the code changes into 4 PRs. This is one of the 4 PRs and covers the changes under `c10`.

# Design
Intel GPU device is a wrapper of sycl device on which kernels can be executed. In our design, we will maintain a sycl device pool containing all the GPU devices of the current machine, and manage the status of the device pool by PyTorch. The thread local safe is considered in this design. The corresponding C++ files related to `Device` will be placed in c10/xpu folder. And we provide the c10 device runtime APIs, like
  - `c10::xpu::device_count`
  - `c10::xpu::set_device`
  - ...

# Additional Context
In our plan, 4 PRs should be submitted to PyTorch for `Device`:
1. for c10
2. for aten
3. for python frontend
4. for lazy initialization shared with CUDA

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116019
Approved by: https://github.com/gujinghui, https://github.com/jgong5, https://github.com/EikanWang, https://github.com/malfet
2024-01-04 17:35:04 +00:00
43fb1b671c [export] Improve verifier to not specialize on dialect. (#116705)
Summary:
Currently we have a very ugly specialization on edge dialect in verifier like the following:
```
 # TODO Remove this branch.
            if ep.dialect == "EDGE":  # !!! Don't change this allowlist. !!!
                pass
            else:
                raise e
```
In this diff we do some additional work to make signature checking also work in exir. We decouple the transformation stack in torch export and exir so that different layers of the stack can evolve in their own fashion and the team can divide and conquer them seperately.

Test Plan: CI

Differential Revision: D52499225

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116705
Approved by: https://github.com/tugsbayasgalan
2024-01-04 17:17:23 +00:00
f1a393c029 [codemod] markDynamoStrictTest batch (#116745)
- test_show_pickle
- test_show_pickle
- test_set_default_mobile_cpu_allocator
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116745
Approved by: https://github.com/Skylion007, https://github.com/voznesenskym
ghstack dependencies: #116728, #116729, #116730, #116731, #116732, #116733, #116734, #116735, #116736, #116739, #116740, #116742, #116743, #116744
2024-01-04 15:04:18 +00:00
311548b79c [codemod] markDynamoStrictTest test_sort_and_select (#116744)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116744
Approved by: https://github.com/voznesenskym
ghstack dependencies: #116728, #116729, #116730, #116731, #116732, #116733, #116734, #116735, #116736, #116739, #116740, #116742, #116743
2024-01-04 15:04:18 +00:00
30f0a05207 [codemod] markDynamoStrictTest test_stateless (#116743)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116743
Approved by: https://github.com/Skylion007, https://github.com/voznesenskym
ghstack dependencies: #116728, #116729, #116730, #116731, #116732, #116733, #116734, #116735, #116736, #116739, #116740, #116742
2024-01-04 15:03:21 +00:00
46b44fb246 [codemod] markDynamoStrictTest test_subclass (#116742)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116742
Approved by: https://github.com/voznesenskym
ghstack dependencies: #116728, #116729, #116730, #116731, #116732, #116733, #116734, #116735, #116736, #116739, #116740
2024-01-04 15:02:46 +00:00
c2174974ae [codemod] markDynamoStrictTest test_tensor_creation_ops (#116740)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116740
Approved by: https://github.com/voznesenskym
ghstack dependencies: #116728, #116729, #116730, #116731, #116732, #116733, #116734, #116735, #116736, #116739
2024-01-04 15:02:03 +00:00
7c5704fc00 [codemod] markDynamoStrictTest test_tensorboard (#116739)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116739
Approved by: https://github.com/voznesenskym
ghstack dependencies: #116728, #116729, #116730, #116731, #116732, #116733, #116734, #116735, #116736
2024-01-04 15:01:25 +00:00
caa33e1eb1 [codemod] markDynamoStrictTest test_testing (#116736)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116736
Approved by: https://github.com/voznesenskym
ghstack dependencies: #116728, #116729, #116730, #116731, #116732, #116733, #116734, #116735
2024-01-04 15:01:07 +00:00
882d1f4ea6 [codemod] markDynamoStrictTest test_transformers (#116735)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116735
Approved by: https://github.com/voznesenskym
ghstack dependencies: #116728, #116729, #116730, #116731, #116732, #116733, #116734
2024-01-04 15:00:23 +00:00
eb958d7552 Fix bug in unflatten pytree (#116750)
Summary: Title

Test Plan: CI

Differential Revision: D52529088

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116750
Approved by: https://github.com/zhxchen17
2024-01-04 14:23:40 +00:00
75dae4f691 Revert "[dynamo] Fix np.issubdtype (#116459)"
This reverts commit b5c33ccdb3198a48a354e21a4fdace0ec6d04146.

Reverted https://github.com/pytorch/pytorch/pull/116459 on behalf of https://github.com/zou3519 due to Broke CI, seems to be a landrace ([comment](https://github.com/pytorch/pytorch/pull/116459#issuecomment-1877135999))
2024-01-04 14:00:11 +00:00
3a0f6897c5 Revert "Graphbreak when creating a map with unsupported keys (#116460)"
This reverts commit c2a020a2184982361a712bbb1e9766caba26dba6.

Reverted https://github.com/pytorch/pytorch/pull/116460 on behalf of https://github.com/zou3519 due to I think the bottom PR broke CI ([comment](https://github.com/pytorch/pytorch/pull/116460#issuecomment-1877132374))
2024-01-04 13:56:57 +00:00
c2a020a218 Graphbreak when creating a map with unsupported keys (#116460)
As per title. With this, https://github.com/pytorch/pytorch/issues/93697
does not choke, but spits out many of these:
```
[ERROR] Name: "L['self']"
[ERROR]     Source: local
[ERROR]     Create Function: NN_MODULE
[ERROR]     Guard Types: ['ID_MATCH']
[ERROR]     Code List: ["___check_obj_id(L['self'], 139962171127504)"]
[ERROR]     Object Weakref: <weakref at 0x7f4b72f7c9a0; to
'ActorCriticPolicy' at 0x7f4b7b7df6d0>
[ERROR]     Guarded Class Weakref: <weakref at 0x7f4afbd08b30; to
'ABCMeta' at 0x56463a727840 (ActorCriticPolicy)>
[ERROR] Created at:
[ERROR]   File
"/home/lezcano/git/pytorch/pytorch/torch/_dynamo/variables/builder.py",
line 248, in __call__
[ERROR]     vt = self._wrap(value)
[ERROR]   File
"/home/lezcano/git/pytorch/pytorch/torch/_dynamo/variables/builder.py",
line 474, in _wrap
[ERROR]     return self.wrap_module(value)
[ERROR]   File
"/home/lezcano/git/pytorch/pytorch/torch/_dynamo/variables/builder.py",
line 941, in wrap_module
[ERROR]     return self.tx.output.register_attr_or_module(
[ERROR]   File
"/home/lezcano/git/pytorch/pytorch/torch/_dynamo/output_graph.py", line
735, in register_attr_or_module
[ERROR]     install_guard(source.make_guard(GuardBuilder.NN_MODULE))
[ERROR] Error while creating guard:
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116460
Approved by: https://github.com/jansel
ghstack dependencies: #116459
2024-01-04 12:36:31 +00:00
81f98f1082 Experimental non-strict mode (#114658)
This is proof-of-concept implementation of how people can use a marker `mark_strict` to enable torchdynamo while exporting under non-strict mode. The main idea is that `mark_strict` will turn into an HOO which then utilizes dynamo to do correctness analysis in the same way how torch.cond works today. There are some notable limitations:
1. This API is not meant for public use yet
2. Strict region can't work with arbitrary container inputs
3. We don't preserve `nn_module_stack` and other node metadata for the strict region.
4. strict_mode HOO will show up in the final graph. This is undesirable in the long term, but for short term experiments, it should be good enough. Will fix this in the follow up PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114658
Approved by: https://github.com/ydwu4
2024-01-04 12:24:58 +00:00
cyy
91bbcf8c71 [1/N] replace THPUtils_assert with TORCH_CHECK (#116675)
This PR replaces THPUtils_assert with TORCH_CHECK.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116675
Approved by: https://github.com/albanD
2024-01-04 11:15:33 +00:00
faea6f2c7a [C10D] Make heartbeat_ atomic (#116702)
Summary:
Currently, the code is working. We know this becuase we observe heartbeat
timeouts.

However, there is a chance that if the code were refactored, the compiler could
optimize away the load of heartbeat_ inside heartbeatMonitor, and we wouldn't
know.

Using atomic here is not really for thread synchronization, but more to ensure
compiler optimizations (hoisting the read outside the loop) can never be
allowed to happen.  Again, we know this isn't currently happening bc if it
were, it  would not be an intermittent failure, it would be an always failure.
(at least with a fixed compiler/platform).

I previously avoided atomic bc we didn't want shared locks between heartbeat
monitor and watchdog thread.  Why? if watchdog held the lock and hung, monitor
could also hang.  However, this really can't happen (Afaik) when using an
atomic.

Test Plan: existing CI tests

Differential Revision: D52378257

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116702
Approved by: https://github.com/fduwjj, https://github.com/zdevito
2024-01-04 06:06:32 +00:00
2bdc2a68cb [ez][td] Fix for emit metrics can't find JOB_NAME (#116748)
After #113884
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116748
Approved by: https://github.com/malfet, https://github.com/huydhn
2024-01-04 05:31:25 +00:00
670e7992fd [Easy] Document AGGRESSIVE_RECOMPUTATION flag in min-cut partitioner (#114007)
As titled.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114007
Approved by: https://github.com/wanchaol
2024-01-04 05:05:08 +00:00
a8a9695047 Move promoteTypes to cpp file (#116685)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116685
Approved by: https://github.com/albanD
2024-01-04 04:42:14 +00:00
f071687ef1 Clean up macOS x86 CI build and test jobs (#116725)
We're ready to pull the plug on MacOX x86 build and test jobs on CI.

* [ ] https://github.com/pytorch/pytorch/pull/116725
* [ ] https://github.com/pytorch/pytorch/pull/116726

More details is at https://github.com/pytorch/pytorch/issues/114602
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116725
Approved by: https://github.com/malfet, https://github.com/seemethere
2024-01-04 04:26:32 +00:00
9b88354b80 [executorch hash update] update the pinned executorch hash (#116668)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116668
Approved by: https://github.com/pytorchbot
2024-01-04 04:12:25 +00:00
b5c33ccdb3 [dynamo] Fix np.issubdtype (#116459)
Fixes the issue described at https://github.com/pytorch/pytorch/issues/93697#issuecomment-1828346590

This doesn't fix the full issue yet, now we hit
```python
  File
  "/home/lezcano/git/pytorch/pytorch/torch/_dynamo/symbolic_convert.py",
  line 744, in step
  getattr(self, inst.opname)(inst)
  File
  "/home/lezcano/git/pytorch/pytorch/torch/_dynamo/symbolic_convert.py",
  line 1366, in BUILD_MAP
      assert (
      AssertionError
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116459
Approved by: https://github.com/peterbell10
2024-01-04 03:55:50 +00:00
e2359f72c8 [BE]: Update ruff to 0.1.11 (#116704)
Updates ruff to 0.1.11
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116704
Approved by: https://github.com/malfet
2024-01-04 03:35:45 +00:00
e70dfe07f6 [audio hash update] update the pinned audio hash (#116747)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116747
Approved by: https://github.com/pytorchbot
2024-01-04 03:27:48 +00:00
c14a0b6c84 [codemod] markDynamoStrictTest batch (#116734)
- test_type_promotion
- test_type_info
- test_type_hints
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116734
Approved by: https://github.com/voznesenskym
ghstack dependencies: #116728, #116729, #116730, #116731, #116732, #116733
2024-01-04 03:18:06 +00:00
bfb9df3684 [codemod] markDynamoStrictTest batch (#116733)
- test_weak
- test_view_ops
- test_typing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116733
Approved by: https://github.com/voznesenskym
ghstack dependencies: #116728, #116729, #116730, #116731, #116732
2024-01-04 03:18:06 +00:00
a308a25fb7 [codemod] markDynamoStrictTest batch (#116732)
- torch_np/numpy_tests/core/test_getlimits
- torch_np/numpy_tests/core/test_einsum
- torch_np/numpy_tests/core/test_dtype
- torch_np/numpy_tests/core/test_dlpack
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116732
Approved by: https://github.com/voznesenskym
ghstack dependencies: #116728, #116729, #116730, #116731
2024-01-04 03:17:57 +00:00
9255f55767 [codemod] markDynamoStrictTest batch (#116731)
- torch_np/numpy_tests/core/test_numerictypes
- torch_np/numpy_tests/core/test_numeric
- torch_np/numpy_tests/core/test_indexing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116731
Approved by: https://github.com/voznesenskym
ghstack dependencies: #116728, #116729, #116730
2024-01-04 03:17:47 +00:00
1f7badd856 [codemod] markDynamoStrictTest batch (#116730)
- torch_np/numpy_tests/core/test_scalarinherit
- torch_np/numpy_tests/core/test_scalar_methods
- torch_np/numpy_tests/core/test_scalar_ctors
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116730
Approved by: https://github.com/voznesenskym
ghstack dependencies: #116728, #116729
2024-01-04 03:17:39 +00:00
d1d6b90a1b [codemod] markDynamoStrictTest torch_np/numpy_tests/core/test_scalarmath (#116729)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116729
Approved by: https://github.com/voznesenskym
ghstack dependencies: #116728
2024-01-04 03:17:29 +00:00
3ba35548c3 [codemod] markDynamoStrictTest torch_np/numpy_tests/core/test_shape_base (#116728)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116728
Approved by: https://github.com/voznesenskym
2024-01-04 03:17:22 +00:00
3acb7972b0 [BE] Test CrossEntropyLoss for torch.half (#116681)
To test it on MPS and CUDA devices
Also, move some float64 skip-tests for MPS to xfail, same as CPU tests for torch.half
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116681
Approved by: https://github.com/xuzhao9, https://github.com/mikaylagawarecki
2024-01-04 02:16:09 +00:00
6fece41e9a [codemod][lowrisk] Remove extra semi colon from caffe2/c10/util/Float8_e5m2.h (#115761)
Summary:
`-Wextra-semi` or `-Wextra-semi-stmt`

If the code compiles, this is safe to land.

Test Plan: Sandcastle

Reviewed By: palmje

Differential Revision: D51995078

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115761
Approved by: https://github.com/Skylion007
2024-01-04 02:02:26 +00:00
5395331644 Avoid GIL during exit (#116709)
Stacks recorded when tensors are being freed during exit could
try to acquire the GIL. Py_IsInitialized can be used to check if we
are post Python exit and should not attempt to acquire the GIL.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116709
Approved by: https://github.com/aaronenyeshi
2024-01-04 01:56:44 +00:00
4926146537 [Inductor] Fix Conv Binary Inplace Fusion issue (#115153)
**Summary**
Take this Pattern as example
```
  #      ReLU
  #     /   \
  #  Conv1
  #   /      \
  # Conv2
  #   \      /
  #      Add
```
The current `ConvBinaryInplace` check will fail to perform Inplace fusion (using outplace fusion instead) due to `ReLU` having 2 users. However, if all users of `ReLU` are ancestor nodes of `Conv2`, we should be able to proceed with the `ConvBinaryInplace` fusion. This diff relaxes the `ConvBinaryInplace` check accordingly.

**TestPlan**
```
python -m pytest test_mkldnn_pattern_matcher.py -k test_conv2d_binary_inplace_fusion_pass_cpu
python -m pytest test_mkldnn_pattern_matcher.py -k test_conv2d_binary_inplace_fusion_failed_cpu
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115153
Approved by: https://github.com/CaoE, https://github.com/jgong5
2024-01-04 01:06:27 +00:00
ce2df3f690 [HigherOrderOp] set set_subgraph_inputs to flatten_manual for map (#115853)
We change manually_set_subgraph_inputs to three modes: manual, automatic and flatten_manual. The flatten_manual wil first flatten the sub_args then recussively call set_subgrah_inputs = "manual". This allows us to control the order of the placeholder shown up in the graph, which is necessary for map, where we want to keep the mapped arguments before the rest positional arguments.

Right now, map only takes a single tensor as mapped argument but it would become pretty easy to match the subgraph inputs to original proxy if we have a "flatten_manual" option.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115853
Approved by: https://github.com/zou3519
2024-01-04 00:27:07 +00:00
a2f3770b24 [BE] Remove arch -arch arm64 (#116724)
It was needed back in a day when there were no arm64 runner daemon binaries, so the trick was needed to execute native arm64 tests when invoked from x86 runner daemon

Followup after  https://github.com/pytorch/pytorch/pull/116680

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116724
Approved by: https://github.com/huydhn
2024-01-03 23:59:53 +00:00
4e330882da [inductor] Add ABI shim function for torch.scatter_reduce (#116700)
Ran into the following exception during C++ file compilation.
```
error: use of undeclared identifier 'aoti_torch_scatter_reduce_out'
    aoti_torch_scatter_reduce_out(buf12, buf12,0,buf13,buf14, "sum",1);
    ^
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116700
Approved by: https://github.com/aakhundov
2024-01-03 23:43:44 +00:00
a75b587803 [codemod] markDynamoStrictTest torch_np/numpy_tests/fft/test_helper (#116654)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116654
Approved by: https://github.com/bdhirsh
ghstack dependencies: #116638, #116673, #116639, #116640, #116641, #116642, #116643, #116644, #116645, #116646, #116647, #116648, #116649, #116650, #116651, #116652, #116653
2024-01-03 23:03:06 +00:00
f3e2661555 [codemod] markDynamoStrictTest torch_np/numpy_tests/fft/test_pocketfft (#116653)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116653
Approved by: https://github.com/bdhirsh
ghstack dependencies: #116638, #116673, #116639, #116640, #116641, #116642, #116643, #116644, #116645, #116646, #116647, #116648, #116649, #116650, #116651, #116652
2024-01-03 23:03:06 +00:00
bf4c1a3d66 [codemod] markDynamoStrictTest torch_np/numpy_tests/lib/test_arraypad (#116652)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116652
Approved by: https://github.com/bdhirsh
ghstack dependencies: #116638, #116673, #116639, #116640, #116641, #116642, #116643, #116644, #116645, #116646, #116647, #116648, #116649, #116650, #116651
2024-01-03 23:03:06 +00:00
f4168c0e2e [codemod] markDynamoStrictTest torch_np/numpy_tests/lib/test_arraysetops (#116651)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116651
Approved by: https://github.com/bdhirsh
ghstack dependencies: #116638, #116673, #116639, #116640, #116641, #116642, #116643, #116644, #116645, #116646, #116647, #116648, #116649, #116650
2024-01-03 23:03:06 +00:00
dab1599d81 [codemod] markDynamoStrictTest torch_np/numpy_tests/lib/test_function_base (#116650)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116650
Approved by: https://github.com/bdhirsh
ghstack dependencies: #116638, #116673, #116639, #116640, #116641, #116642, #116643, #116644, #116645, #116646, #116647, #116648, #116649
2024-01-03 23:03:06 +00:00
8a76c07b98 [threaded pg] add devices to avoid seeing warnings (#116678)
This PR adds devices to register_backend of multithraeded pg, to avoid
seeing tons of warnings.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116678
Approved by: https://github.com/awgu, https://github.com/XilunWu
ghstack dependencies: #116426, #116559, #116573
2024-01-03 23:01:19 +00:00
b10cb168a7 [tp] disable some assertion temporarily for torch.compile (#116573)
Disable some runtime assertion first as it does not work with
torch.compile properly, I'll have a follow up fix in dynamo and reenable
this check again

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116573
Approved by: https://github.com/awgu, https://github.com/XilunWu
ghstack dependencies: #116426, #116559
2024-01-03 23:01:19 +00:00
7309f6fdf0 Remove hardcoding arch to arm64 (#116680)
https://github.com/pytorch/pytorch/pull/116627 hardcodes arch to arm64 and it's failing on x86 GitHub runner (yup, they are still there on periodic, we haven't pulled the plug yet).

https://github.com/pytorch/pytorch/actions/runs/7392059632/job/20112760709#step:2:12 is an example failure.

There is no need to set the arch here because it has already been set earlier in the workflow https://github.com/pytorch/pytorch/blob/main/.github/workflows/_mac-test.yml#L47

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116680
Approved by: https://github.com/seemethere
2024-01-03 22:42:14 +00:00
f6be25bae6 [inductor] Add shape checks to ExpandView (#113839)
Currently `ExpandView` doesn't check that the expanded shape is valid which may
allow bugs to slip through which cause silent correctness issues.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113839
Approved by: https://github.com/ezyang
2024-01-03 22:31:43 +00:00
1c69d0bdb5 Revert "[11/N] Enable clang-tidy warnings on c10/util/*.h (#116353)"
This reverts commit 37aae5932c26c3729d68b6ebdf00e618fe229b1c.

Reverted https://github.com/pytorch/pytorch/pull/116353 on behalf of https://github.com/izaitsevfb due to Reverting, breaks internal builds: error: implicit conversion from 'long long' to 'float' may lose precision [-Werror,-Wimplicit-int-float-conversion] ([comment](https://github.com/pytorch/pytorch/pull/116353#issuecomment-1876045800))
2024-01-03 22:22:11 +00:00
0aa50909f3 Revert "[12/N] Apply clang-tidy and fix warnings in headers of torch/csrc (#116486)"
This reverts commit 5aa258eb09d5ecd62aea4d2bd02bbfa5eda0d554.

Reverted https://github.com/pytorch/pytorch/pull/116486 on behalf of https://github.com/izaitsevfb due to Reverting, as it depends on https://github.com/pytorch/pytorch/pull/116353, which has to be reverted ([comment](https://github.com/pytorch/pytorch/pull/116486#issuecomment-1876042948))
2024-01-03 22:18:54 +00:00
791db94c62 Revert "[13/N] Enable clang-tidy on headers of torch/csrc (#116560)"
This reverts commit b0629cdd67ea5dd264250262e0af75579ed26952.

Reverted https://github.com/pytorch/pytorch/pull/116560 on behalf of https://github.com/izaitsevfb due to Reverting, as it depends on #116353, which has to be reverted ([comment](https://github.com/pytorch/pytorch/pull/116560#issuecomment-1876033363))
2024-01-03 22:08:40 +00:00
71523c2289 Add 116583 to .git-blame-ignore-revs (#116676)
since #116583 is purely cosmetic.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116676
Approved by: https://github.com/janeyx99
2024-01-03 19:37:31 +00:00
9693b3740b [easy] [c10d] Add documentation for the device_id parameter for init_process_group (#116222)
Follow-up to add missing docs for #114916

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116222
Approved by: https://github.com/kwen2501, https://github.com/fduwjj
2024-01-03 19:32:18 +00:00
f543093e06 [ONNX] Fix output mismatch issue of repeat_interleave when dim is None (#116689)
'input' is introduced but it's mixed with 'self' in repeat_interleave, which causes the mismatch issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116689
Approved by: https://github.com/thiagocrepaldi
2024-01-03 18:38:00 +00:00
68105da229 Revert "[Dynamo] Trace autograd.function in dynamo when inputs require grad (#116358)"
This reverts commit 97891b184c12763f335fbe1ff63fab843edafab5.

Reverted https://github.com/pytorch/pytorch/pull/116358 on behalf of https://github.com/izaitsevfb due to Breaks internal accuracy test, see D52491095, pytorch/benchmark/fb/test_gpu:run_test_gpu - test_train_ig_feed_over_inductor_accuracy  ([comment](https://github.com/pytorch/pytorch/pull/116358#issuecomment-1875779697))
2024-01-03 18:20:51 +00:00
68b77311ad Fix bug in non-strict input processor (#116674)
Summary: Title

Test Plan: CI

Reviewed By: zhxchen17

Differential Revision: D52499932

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116674
Approved by: https://github.com/tugsbayasgalan
2024-01-03 18:13:25 +00:00
1429c204f8 Increase hub download chunk size (#116536)
This PR increases the read size for the `hub.download_url_to_file` function from 8,192 bytes to 131,072 bytes (128 * 1,024), as reading in larger chunks should be more efficient. The size could probably be larger still, at the expense of the progress bar not getting updated as often.

It re-introduces use of the `READ_DATA_CHUNK` constant that was originally used for this purpose in 4a3baec961 and since forgotten.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116536
Approved by: https://github.com/NicolasHug
2024-01-03 17:38:45 +00:00
c919935cb7 [export] Update schema versioning format. (#116462)
Summary: Update the old versioning scheme to a major and minor version.

Test Plan: CI

Differential Revision: D52431963

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116462
Approved by: https://github.com/tugsbayasgalan
2024-01-03 17:34:58 +00:00
2ae55e99fe [release] Add Launch Execution XFN meeting process to release runbook (#116701)
Make sure we have this process documented in the runbook.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116701
Approved by: https://github.com/seemethere
2024-01-03 17:16:18 +00:00
d2fc00d2cc [codemod] markDynamoStrictTest torch_np/numpy_tests/lib/test_histograms (#116649)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116649
Approved by: https://github.com/bdhirsh
ghstack dependencies: #116638, #116673, #116639, #116640, #116641, #116642, #116643, #116644, #116645, #116646, #116647, #116648
2024-01-03 17:00:32 +00:00
2d1011d84f [codemod] markDynamoStrictTest torch_np/numpy_tests/lib/test_index_tricks (#116648)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116648
Approved by: https://github.com/bdhirsh
ghstack dependencies: #116638, #116673, #116639, #116640, #116641, #116642, #116643, #116644, #116645, #116646, #116647
2024-01-03 17:00:32 +00:00
c47ab693ff [codemod] markDynamoStrictTest torch_np/numpy_tests/lib/test_shape_base_ (#116647)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116647
Approved by: https://github.com/bdhirsh
ghstack dependencies: #116638, #116673, #116639, #116640, #116641, #116642, #116643, #116644, #116645, #116646
2024-01-03 17:00:23 +00:00
6a300bd1c6 [codemod] markDynamoStrictTest torch_np/numpy_tests/lib/test_twodim_base (#116646)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116646
Approved by: https://github.com/bdhirsh
ghstack dependencies: #116638, #116673, #116639, #116640, #116641, #116642, #116643, #116644, #116645
2024-01-03 17:00:13 +00:00
34a8c64c92 [codemod] markDynamoStrictTest torch_np/numpy_tests/lib/test_type_check (#116645)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116645
Approved by: https://github.com/bdhirsh
ghstack dependencies: #116638, #116673, #116639, #116640, #116641, #116642, #116643, #116644
2024-01-03 17:00:07 +00:00
fe287af812 [codemod] markDynamoStrictTest torch_np/numpy_tests/linalg/test_linalg (#116644)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116644
Approved by: https://github.com/bdhirsh
ghstack dependencies: #116638, #116673, #116639, #116640, #116641, #116642, #116643
2024-01-03 16:59:59 +00:00
28a8e4bdb6 [codemod] markDynamoStrictTest torch_np/test_basic (#116643)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116643
Approved by: https://github.com/bdhirsh
ghstack dependencies: #116638, #116673, #116639, #116640, #116641, #116642
2024-01-03 16:59:50 +00:00
146426a0df [codemod] markDynamoStrictTest torch_np/test_binary_ufuncs (#116642)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116642
Approved by: https://github.com/bdhirsh
ghstack dependencies: #116638, #116673, #116639, #116640, #116641
2024-01-03 16:59:41 +00:00
efe3b7f457 [codemod] markDynamoStrictTest torch_np/test_dtype (#116641)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116641
Approved by: https://github.com/bdhirsh
ghstack dependencies: #116638, #116673, #116639, #116640
2024-01-03 16:59:32 +00:00
d760014b9f [codemod] markDynamoStrictTest torch_np/test_function_base (#116640)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116640
Approved by: https://github.com/bdhirsh
ghstack dependencies: #116638, #116673, #116639
2024-01-03 16:59:25 +00:00
efee9e689e [codemod] markDynamoStrictTest torch_np/test_ndarray_methods (#116639)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116639
Approved by: https://github.com/bdhirsh
ghstack dependencies: #116638, #116673
2024-01-03 16:59:19 +00:00
608091e4d1 [codemod] markDynamoStrictTest torch_np/numpy_tests/core/test_multiarray (#116673)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116673
Approved by: https://github.com/voznesenskym
ghstack dependencies: #116638
2024-01-03 16:59:12 +00:00
70eb53505b [export] Update range constraints to runtime_var_to_range (#115427)
Updated range_constraints to be the union of shape_env.var_to_range and shape_env.runtime_var_to_range, with shape_env.runtime_var_to_range taking priority.

Due to 0/1 specialization, if we bound an unbacked symint to be less than 5, the range of possible values for this symint is actually recorded as [2, 5] in shape_env.var_to_range. To fix this so that users will be able to see a more understandable range of [0, 5], shape_env.runtime_var_to_range was created to store the range of [0, 5]. Since range_constraints is a user-facing attribute to query the ranges of certain symints, we want to use shape_env.runtime_var_to_range to get the unbacked symints ranges, rather than shape_env.var_to_range.

Additionally, run_decompositions() has an issue where it will always add assertions to the graph, even if a previous run has already added the assertions. So, I added a part to the AddRuntimeAssertionsForInlineConstraints which will store which assertions have already been added.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115427
Approved by: https://github.com/zhxchen17
2024-01-03 16:55:04 +00:00
f081c45a34 Add out_dtype support for sparse semi-structured CUTLASS back-end (#116519)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116519
Approved by: https://github.com/cpuhrsch
2024-01-03 16:23:17 +00:00
ba06951c66 [BE] [cuDNN] Always build assuming cuDNN >= 8.1 (#95722)
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 27084ed</samp>

This pull request simplifies and cleans up the code that uses the cuDNN library for convolution, batch normalization, CTC loss, and quantized operations. It removes the unnecessary checks and conditions for older cuDNN versions and the experimental cuDNN v8 API, and ~~replaces them with the stable `cudnn_frontend` API that requires cuDNN v8 or higher. It also adds the dependency and configuration for the `cudnn_frontend` library in the cmake and bazel files.~~ Correction: The v7 API will still be available with this PR, and can still be used, without any changes to the defaults. This change simply always _builds_ the v8 API, and removes the case where _only_ the v7 API is built.

This is a re-land of https://github.com/pytorch/pytorch/pull/91527

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95722
Approved by: https://github.com/malfet, https://github.com/atalman
2024-01-03 15:41:28 +00:00
3407541b0c add cpu inductor merge rule (#116679)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116679
Approved by: https://github.com/huydhn
2024-01-03 15:09:36 +00:00
b57d473091 [codemod] markDynamoStrictTest torch_np/test_nep50_examples (#116638)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116638
Approved by: https://github.com/bdhirsh
2024-01-03 14:45:43 +00:00
49de03f0fd adapt to other acceleration devices (#116682)
Fixes #116504

When this API is invoked, a runtime error occurs. When the NPU acceleration device is used, the input tensor is not processed at a branch. As a result, some input tensors are on the CPU and some are on the NPU. As a result, an error is reported.
Here, I adapt to other acceleration devices and move the tensor on the acceleration device to the CPU. It's tested and feasible.

The details are in the issue:#116504

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116682
Approved by: https://github.com/lezcano
2024-01-03 12:41:19 +00:00
c1b88723f8 Fix buck build after recent clang-tidy updates (#116669)
Broken after either https://github.com/pytorch/pytorch/pull/116486 or https://github.com/pytorch/pytorch/pull/116353 I think.  Here is an example build failure 0bc21c6a6b
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116669
Approved by: https://github.com/Skylion007
2024-01-03 09:02:58 +00:00
2a87ab4508 Refactor some tests by using TEST_CUDA & TEST_MULTIGPU instead (#116083)
as https://github.com/pytorch/pytorch/pull/116014#discussion_r1430510759 stated, refactor some tests related.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116083
Approved by: https://github.com/fduwjj
2024-01-03 08:53:59 +00:00
d9c0e37bab [2d] unflatten_tensor on compute stream for DTensorExtension (#116559)
Context: Existing FSDPExtension have some bug in the case when the
unflatten tensor involves some compute/communications in cuda stream,
the current logic of FSDPExtension unflatten tensor happens in the
unshard stream, which makes runtime lost sync with the compute stream,
and if there're some dependencies between the compute stream and the
unflatten tensor logic, currently it would lose sync point, which could
possibly lead to NaN.

This PR make the FSDPExtension to record the compute stream and let
DTensorExtension to directly use the compute stream for unflatten_tensor.

In long term we might want to directly make the FSDP runtime logic to only
make the unshard happen in unshard stream, and use unshard views to
happen in the compute stream. We currently fix this in the Extension
directly as this is the simplest thing to do without affecting FSDP
runtime logic

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116559
Approved by: https://github.com/awgu, https://github.com/fduwjj, https://github.com/yifuwang
ghstack dependencies: #116426
2024-01-03 07:29:08 +00:00
29674b8e1d [dtensor] fix dtensor _to_copy op for mix precision (#116426)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116426
Approved by: https://github.com/fduwjj
2024-01-03 07:29:08 +00:00
b0749bce6c [export] Allow None as the meta value for tensor output. (#116664)
Summary: Sometimes we will get a None value from ops which returns Tensor type in the schema. Allow this case during serialization.

Test Plan: test__scaled_dot_product_flash_attention

Differential Revision: D52491668

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116664
Approved by: https://github.com/SherlockNoMad
2024-01-03 07:07:39 +00:00
3fe437b24b [BE]: Update flake8 to v6.1.0 and fix lints (#116591)
Updates flake8 to v6.1.0 and fixes a few lints using sed and some ruff tooling.
- Replace `assert(0)` with `raise AssertionError()`
- Remove extraneous parenthesis i.e.
  - `assert(a == b)` -> `assert a == b`
  - `if(x > y or y < z):`->`if x > y or y < z:`
  - And `return('...')` -> `return '...'`

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116591
Approved by: https://github.com/albanD, https://github.com/malfet
2024-01-03 06:04:44 +00:00
09ee96b69d [MPS] Fix CrossEntropyLoss for float16 (#116597)
Looks like neither [`divisionNoNaNWithPrimaryTensor:`](https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraph/3675593-divisionnonanwithprimarytensor) nor `oneHotWithIndicesTensor:` works for `MPSDataTypeFloat16`, so provide an explicit cast for one-hot tensor and alternative implementation using the formula from the official doc, i.e.
> `resultTensor = select(secondaryTensor, primaryTensor / secondaryTensor, 0)`

Alas, at the moment  it can not be tested via `test_modules.py` as it runs only `torch.float32` and `torch.float64` tests (and `torch.half` implementation is not available for CPU)

Fixes https://github.com/pytorch/pytorch/issues/116095

TODO: Enable testing via TestModules, but will do in separate PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116597
Approved by: https://github.com/kulinseth
2024-01-03 05:58:26 +00:00
75359934bd [C10D] Improve Heartbeat Monitor exit logs (#116268) (#116661)
Summary:

- add workMetaList_.size() so we know how many outstanding works there
  were when killing
- Print our first log before debuginfo dump instead of after, since it
  is clearer when reading the logs that we time out and then dump
- Organize the log strings- put them near where they are used

cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l yf225

imported-using-ghimport

Test Plan: Imported from OSS

Reviewed By: fduwjj

Differential Revision: D52369167

Pulled By: wconstab

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116661
Approved by: https://github.com/fduwjj
2024-01-03 05:35:06 +00:00
1ae39a372e Inductor cpp wrapper: fix cumsum codegen (#116171)
Fixes https://github.com/pytorch/pytorch/issues/115829

For `cumsum(Tensor self, int dim, *, ScalarType? dtype=None) -> Tensor`, `dim` is not a `kwarg_only` argument, but it could be provided as a kwarg when calling this OP.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116171
Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel
2024-01-03 05:33:17 +00:00
ef98987017 Fix user input mutations for run_decompositions (#116382)
Fixes #115106

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116382
Approved by: https://github.com/angelayi
2024-01-03 05:04:22 +00:00
c5bd88b56a [export] Improve serialization of union types. (#116511)
Summary:
Making union types harder to use wrong:
1. Initialize unset fields still with None, but we don't assert on the uniqueness of not None field, since it's possible to set a real field to None.
2. Raise error on unset fields in union, reducing the error surface and enforcing type safety.
3. Serialize union type with only tag and omit all the unset fields, this makes the serialized model more readable and debuggable.

Test Plan:
buck test mode/opt caffe2/test:test_export
buck test mode/opt executorch/exir/...
buck test mode/opt mode/inplace aps_models/ads/icvr/tests:export_test

Differential Revision: D52446586

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116511
Approved by: https://github.com/angelayi
2024-01-03 04:58:59 +00:00
ca4df16fdd [c10d] Make DebugInfoWriter Singleton across all PG objects (#116489)
Previously, we have the writer register to each NCCL PG(backend), so for every pg, we have a NCCL PG instance, so if we use some customized writer when multiple sub-PGs are used, we need to ensure user to register the writer for every backend which indicates a bad UX. Furthermore, the debug info is global, so it does not make sense to have the writer for each instance. We even have a static mutex in the `dumpDebuggingInfo` to ensure we serialize the write, that makes it more obvious that we can make the writer a singleton so that we only have one writer instance for all PG instances.

Although the rationale is clear, the implementation may vary a lot. So this PR is RFC for now to see if this implementation makes sense or not.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116489
Approved by: https://github.com/kwen2501
2024-01-03 03:42:54 +00:00
41f265b06a [quant][pt2e] Preserve numeric_debug_handle in quantization flows (#116477)
Summary:
We introduced `node.meta["numeric_debug_handle"]` in https://github.com/pytorch/pytorch/pull/114315 to
indicate the numeric debug handle for values in the graph, in this PR we supported preserving this field
in prepare and convert so that we can use these for numerical debugging

Next: we also want to preserve these in deepcopy of GraphModule as well

Test Plan:
python test/test_quantization.py -k test_quantize_pt2e_preserve_handle

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116477
Approved by: https://github.com/tugsbayasgalan
2024-01-03 03:39:00 +00:00
f73b1b9388 [EZ] Update lxml dependency to 5.0.0 (#116657)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116657
Approved by: https://github.com/atalman
2024-01-03 02:57:31 +00:00
6e9ca2f220 Enable eye on CPU for bfloat16 dtype (#116616)
Fixes https://github.com/pytorch/pytorch/issues/116609

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116616
Approved by: https://github.com/Skylion007
2024-01-03 02:53:27 +00:00
5005f36c12 Clean up files under fb/vulkan/... (#116665)
Remove files accidentally imported in https://github.com/pytorch/pytorch/pull/114712
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116665
Approved by: https://github.com/izaitsevfb, https://github.com/seemethere
2024-01-03 01:55:32 +00:00
3ac0aaf478 [codemod] markDynamoStrictTest torch_np/test_random (#116637)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116637
Approved by: https://github.com/bdhirsh
ghstack dependencies: #116632, #116634, #116635, #116636
2024-01-03 00:51:36 +00:00
884e449753 [codemod] markDynamoStrictTest torch_np/test_reductions (#116636)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116636
Approved by: https://github.com/bdhirsh
ghstack dependencies: #116632, #116634, #116635
2024-01-03 00:51:36 +00:00
8ec606d4c5 [codemod] markDynamoStrictTest torch_np/test_scalars_0D_arrays (#116635)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116635
Approved by: https://github.com/bdhirsh
ghstack dependencies: #116632, #116634
2024-01-03 00:51:36 +00:00
9b27fcf65a [codemod] markDynamoStrictTest torch_np/test_ufuncs_basic (#116634)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116634
Approved by: https://github.com/bdhirsh
ghstack dependencies: #116632
2024-01-03 00:51:36 +00:00
0ce32ce409 [codemod] markDynamoStrictTest torch_np/test_unary_ufuncs (#116632)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116632
Approved by: https://github.com/bdhirsh
2024-01-03 00:51:36 +00:00
a1191ce4bf optimize (u)int8 vectorized operator* (#116235)
Summary: optimize (u)int8 vectorized operator*

Test Plan: sandcastle github

Differential Revision: D52318192

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116235
Approved by: https://github.com/hl475, https://github.com/malfet
2024-01-03 00:50:23 +00:00
0f6f582c0d Add config to disable TransformerEncoder/MHA fastpath (#112212)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112212
Approved by: https://github.com/jbschlosser
2024-01-02 23:59:30 +00:00
9dc68d1aa9 clangformat: fused adam (#116583)
Apply clangformat to fused adam/adamw files.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116583
Approved by: https://github.com/janeyx99
2024-01-02 22:30:23 +00:00
3ff4572fe7 delete sharded tensor from fsdp/tp tests (#116244)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116244
Approved by: https://github.com/awgu, https://github.com/wz337, https://github.com/fduwjj
ghstack dependencies: #116122
2024-01-02 22:11:36 +00:00
dfccaac31b [2d] Ensure gradient clear out pending AsyncCollectiveTensor in FSDP Extension (#116122)
As titled, this PR adds gradient hook to the FSDP DTensor extension, to check if there's gradients that are AsyncCollectiveTensors, if there're some, we eagerly wait there.

This is needed because sometimes the parameter's gradient might still pending with AsyncCollectiveTensor, if we directly feed them to FSDP then FSDP would use the ACT's storage to do reduce_scatter, which is wrong.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116122
Approved by: https://github.com/awgu, https://github.com/fduwjj
2024-01-02 22:11:36 +00:00
a2061ceefe ci: Output runner OS / HW for macOS (#116627)
It's difficult to debug these since there's no understanding of what the
OS / HW that we're running CI on so output it so we can have a better
understanding here.

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116627
Approved by: https://github.com/janeyx99
2024-01-02 22:05:53 +00:00
640d46f823 [inductor] Control the cpp_wrapper mode with an env variable (#116615)
Summary: also add one model test for the cpp_wrapper mode on CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116615
Approved by: https://github.com/angelayi
2024-01-02 21:50:25 +00:00
295bdaafb7 [codemod] markDynamoStrictTest test_module_init (#116625)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116625
Approved by: https://github.com/bdhirsh
ghstack dependencies: #116618, #116619, #116621, #116622, #116624
2024-01-02 20:55:48 +00:00
074dfc2648 [codemod] markDynamoStrictTest test_linalg (#116624)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116624
Approved by: https://github.com/bdhirsh
ghstack dependencies: #116618, #116619, #116621, #116622
2024-01-02 20:55:48 +00:00
5d8e066f6b [codemod] markDynamoStrictTest test_indexing (#116622)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116622
Approved by: https://github.com/bdhirsh
ghstack dependencies: #116618, #116619, #116621
2024-01-02 20:55:39 +00:00
fc7546e9db [codemod] markDynamoStrictTest test_functional_optim (#116621)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116621
Approved by: https://github.com/bdhirsh
ghstack dependencies: #116618, #116619
2024-01-02 20:55:31 +00:00
88d1638139 [codemod] markDynamoStrictTest test_autograd_fallback (#116619)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116619
Approved by: https://github.com/bdhirsh
ghstack dependencies: #116618
2024-01-02 20:55:21 +00:00
39339df8d7 [codemod] markDynamoStrictTest test_autocast (#116618)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116618
Approved by: https://github.com/bdhirsh
2024-01-02 20:54:24 +00:00
0bc21c6a6b [C10d] Fix Log Prefix in NCCLPG so that each instance gets its own prefix (#116520)
Somehow the logprefix only have ProcessGroup 0 rank [global rank]. This does not give the expected result as per the comment says "a prefix that is unique to this process group and rank". So this PR fix it and make it different for different subPGs.

The reason is that we set the prefix static which is shared across all NCCLPG instances and whoever calls this function first will set `rank_` and `uid_` to the prefix. We always initialize PG 0 first that's why we always see PG[0] + global ranks for all subPGs.

<img width="484" alt="image" src="https://github.com/pytorch/pytorch/assets/6937752/7fbb0226-7e25-4306-9cee-22e17b00bc8e">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116520
Approved by: https://github.com/wconstab
ghstack dependencies: #116218
2024-01-02 20:23:58 +00:00
6d8d3c1334 add a DTensor test for weight tying (#116475)
Weight tying is useful when we'd like to share weights (and their gradients) between two modules, e.g. the word/token embedding module and the output linear module in language models. This test demonstrates that with DTensor it can be achieved just as with normal tensor, e.g. using `model.fc.weight = model.embedding.weight`.

To test: `python test/distributed/tensor/parallel/test_tp_examples.py -k test_weight_tying`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116475
Approved by: https://github.com/wanchaol, https://github.com/fduwjj
2024-01-02 20:19:36 +00:00
fb5a9f2f5c Fix implicit conversion to double (#116614)
Summary:
Forward fix for https://github.com/pytorch/pytorch/pull/116185 / D52390113

Error:
```
xplat/caffe2/c10/util/order_preserving_flat_hash_map.h:602:23: error: implicit conversion from 'uint64_t' (aka 'unsigned long long') to 'double' may lose precision [-Werror,-Wimplicit-int-float-conversion]
[CONTEXT]             std::ceil(num_elements / static_cast<double>(_max_load_factor))));
[CONTEXT]                       ^~~~~~~~~~~~ ~
xplat/caffe2/c10/util/order_preserving_flat_hash_map.h:923:22: error: implicit conversion from 'uint64_t' (aka 'unsigned long long') to 'double' may lose precision [-Werror,-Wimplicit-int-float-conversion]
[CONTEXT]         num_elements + 1 >
[CONTEXT]         ~~~~~~~~~~~~~^~~ ~
xplat/caffe2/c10/util/order_preserving_flat_hash_map.h:924:34: error: implicit conversion from 'uint64_t' (aka 'unsigned long long') to 'double' may lose precision [-Werror,-Wimplicit-int-float-conversion]
[CONTEXT]             (num_slots_minus_one + 1) * static_cast<double>(_max_load_factor)) {
[CONTEXT]              ~~~~~~~~~~~~~~~~~~~~^~~  ~
xplat/caffe2/c10/util/order_preserving_flat_hash_map.h:923:22: error: implicit conversion from 'uint64_t' (aka 'unsigned long long') to 'double' may lose precision [-Werror,-Wimplicit-int-float-conversion]
[CONTEXT]         num_elements + 1 >
[CONTEXT]         ~~~~~~~~~~~~~^~~ ~
xplat/caffe2/c10/util/order_preserving_flat_hash_map.h:924:34: error: implicit conversion from 'uint64_t' (aka 'unsigned long long') to 'double' may lose precision [-Werror,-Wimplicit-int-float-conversion]
[CONTEXT]             (num_slots_minus_one + 1) * static_cast<double>(_max_load_factor)) {
[CONTEXT]              ~~~~~~~~~~~~~~~~~~~~^~~  ~
xplat/caffe2/c10/util/order_preserving_flat_hash_map.h:923:22: error: implicit conversion from 'uint64_t' (aka 'unsigned long long') to 'double' may lose precision [-Werror,-Wimplicit-int-float-conversion]
[CONTEXT]         num_elements + 1 >
[CONTEXT]         ~~~~~~~~~~~~~^~~ ~
xplat/caffe2/c10/util/order_preserving_flat_hash_map.h:924:34: error: implicit conversion from 'uint64_t' (aka 'unsigned long long') to 'double' may lose precision [-Werror,-Wimplicit-int-float-conversion]
[CONTEXT]             (num_slots_minus_one + 1) * static_cast<double>(_max_load_factor)) {
```

Fixed by casting int parts to double explicitly.

Test Plan: SC

Differential Revision: D52482968

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116614
Approved by: https://github.com/jeanschmidt, https://github.com/seemethere
2024-01-02 20:08:51 +00:00
77d979f748 Autograd attaches logging hooks only in debug level (#116522)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116522
Approved by: https://github.com/albanD
2024-01-02 20:06:18 +00:00
b18d8d4595 Add a wrapper to transform a NumPy function into a PyTorch function (#114610)
A less general version of this wrapper was used in the keynote on
`torch.compile(numpy)`. We expose a generic version of the wrapper
that works seamlessly with `torch.compile`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114610
Approved by: https://github.com/albanD
2024-01-02 18:35:29 +00:00
be455921f5 Fix missing words in README.md (#116606)
minor fix to wording

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116606
Approved by: https://github.com/Skylion007
2024-01-02 18:24:58 +00:00
95a86ed9ca [Quant] Add int8 linear op gelu for quantization PT2E with Inductor. input is an int8 CPU tensor; weight is an int8 MdkldnnCPU tensor (#114852)
**Summary**
Enable Int8 Linear Gelu post operator fusions for Stock PyTorch Inductor. The input is an int8 CPU tensor and weight is an int8 MkldnnCPU tensor.

**Test plan**
python test/test_quantization.py -k test_qlinear_gelu_pt2e

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114852
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
2024-01-02 08:11:26 +00:00
a81edf9f23 [inductor] Fix cpp_wrapper codegen for ir.ComplexView (#116481)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116481
Approved by: https://github.com/htyu
2024-01-02 05:38:58 +00:00
cyy
b0629cdd67 [13/N] Enable clang-tidy on headers of torch/csrc (#116560)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116560
Approved by: https://github.com/Skylion007, https://github.com/albanD
2024-01-02 05:33:04 +00:00
1ed8efa9b3 [MPS] Speedup addmm (#116548)
- Do not copy bias to output
- Skip respective multiplication op if either alpha or beta are equal to 1.0
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116548
Approved by: https://github.com/albanD
ghstack dependencies: #116547
2024-01-02 00:43:37 +00:00
abd80cbb15 [Inductor] Decompose bmm if batch2's last dim size is 1 and coordinate_descent_tuning is enabled (#116582)
We found this perf optimization opportunity at https://github.com/pytorch-labs/gpt-fast/pull/71. This would bring 5%+ perf gain for Mixtral 8x7B on gpt-fast.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116582
Approved by: https://github.com/lezcano
2024-01-01 21:24:02 +00:00
4ffe1fb7f4 [BE]: Improve typing to respect ruff PYI058 (#116588)
Tried out rule PYI058 and it flagged one typing recommendation in our codebase that would be better to fix.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116588
Approved by: https://github.com/malfet, https://github.com/kit1980
2024-01-01 20:49:55 +00:00
cf618452d3 [BE]: Fix F821 error in torch/fx/experimental (#116587)
Fix F821 error in torch/fx/experimental. Fixes a bug I did not fix in #116579
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116587
Approved by: https://github.com/kit1980
2024-01-01 19:45:49 +00:00
035e55822a vulkan: fix gcc build errors (#115976)
Fixes #96617

There was already an attempt to fix this build issue - see #96618. One commit is reused from this attempt (@zboszor) with adjustments to commit message. Another one differs and takes into account provided review feedback (@ezyang).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115976
Approved by: https://github.com/ezyang
2024-01-01 11:10:42 +00:00
4451ca068c [xla hash update] update the pinned xla hash (#116388)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116388
Approved by: https://github.com/pytorchbot
2024-01-01 10:30:59 +00:00
bd10fea79a [BE]: Enable F821 and fix bugs (#116579)
Fixes #112371

I tried to fix as many of the bugs as I could, a few I could not figure out what the proper fix for them was though and so I left them with noqas.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116579
Approved by: https://github.com/ezyang
2024-01-01 08:40:46 +00:00
6c02520466 Remove unneeded comment and link for BuildExtension (#115496)
`BuildExtension` is no longer derived from object, but from `build_ext`. Py2 is also deprecated, so this comment wouldn't be required anyways

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115496
Approved by: https://github.com/Skylion007
2024-01-01 08:29:48 +00:00
db752f2f1a Pin the version of expecttest to 0.1.6 in requirements.txt (#116238)
The version 0.2.0 of expecttest have removed `ACCEPT` variable by this [PR](https://github.com/ezyang/expecttest/pull/11), so when someone install python dependences using `pip install -r PyTorch_Root/requirements.txt`, the latest version of expecttest will be installed which will cuase failure in some PyTorch Tests. So Pin the version of expecttest to 0.1.6 like [this](db35ccf463/.ci/docker/requirements-ci.txt (L28)) is needed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116238
Approved by: https://github.com/ezyang
2024-01-01 05:25:39 +00:00
60844ccc4f [MPS][BE] Refactor common code (#116566)
Introduce `mtl_setBuffer` and `mps_dispatch1DJob` and use it to bind
Tensor to metal kernel as well as disptatch Metal job

This avoids potential typos/bugs when one tries to bind tensor to a
Metal kernel but forgets about storage offset

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116566
Approved by: https://github.com/Skylion007
2024-01-01 04:58:18 +00:00
aec4377257 Optimize batch_norm_cpu_collect_stats_channels_last_impl when N <= num_threads (#113619)
Currently `batch_norm_cpu_collect_stats_channels_last_impl` uses two-path reduction to vertical reduce from shape of `{NHW, C}` to `{C}`. First path is reduction from `{NHW, C}` to intermediate buffer `{num_threads, C}`. Second path is reduction from `{num_threads, C}` to `{C}`.

Optimization is as follows:
- Add if/else path.

1. if `NHW > num_threads`, do the two-path reduction.
2. else `NHW <= num_threads`, do single-path reduction -- `NHW` is small enough that there is no need to first reduce to intermediate buffer.

- Moreover when `NHW <= num_threads`, use two methods, Method 1 and Method 2.
[Method 1](https://github.com/pytorch/pytorch/pull/113619/files#diff-e39a21a7125ac201b766a585b57ebf8429a7ac28cd723b09930aceb198fd25b0R372-R397): parallel on C, vertical reduce `{NHW, C} => {C}`
[Method 2](https://github.com/pytorch/pytorch/pull/113619/files#diff-e39a21a7125ac201b766a585b57ebf8429a7ac28cd723b09930aceb198fd25b0R325-R370): parallel on tiles of C, vectorized vertical reduce on each tile `{NHW, TILE_SIZE} => {TILE_SIZE}`

1. if `(num_threads == 1 || (C <= TILE_SIZE || C > THRESHOLD))`, use Method 2.
2. else, use Method 1.

- When `num_threads == 1`, there is no thread synchronization overhead, so it is better to use Method 2 than Method 1.
- When `C > THRESHOLD`, C is large enough that the benefit from tiling and vectorization outweigh the synchronization overhead.
- When `C <= TILE_SIZE`, the problem size is small enough (`C <= TILE_SIZE && NHW <= num_threads`) that it's better to launch single thread with vectorization than C threads without vectorization.
- `TILE_SIZE` is set to `16`.
- `THRESHOLD` is set to `2048`, it is an empirically found threshold to tile on C or not.

See comments for details.

### Performance

Perf data collected for C in range [2^1, 2^20], and (N,H,W) = (1,2,14) for all values of C. Values of (N,H,W)=(1,2,14) were arbitrarily chosen that satisfies the condition NHW <= num_threads = 28.
Tested on 28 physical cores/socket, 1 socket on Skylake.

| **(N, H, W) = (1, 2, 14)** 	|                                                            	|               	|                                        	|
|----------------------------	|------------------------------------------------------------	|---------------	|----------------------------------------	|
|                            	| **Avg Latency (ms)**                                       	|               	|                                        	|
| **n_channel**              	| **Baseline (original implementation): two-path reduction** 	| **Optimized** 	| **Speedup Ratio (Optimized/Baseline)** 	|
| 1048576                    	| 13.67034435                                                	| 3.059654236   	| 4.467937649                            	|
| 524288                     	| 5.230793953                                                	| 0.840408802   	| 6.224106578                            	|
| 262144                     	| 2.131233215                                                	| 0.353398323   	| 6.030682876                            	|
| 131072                     	| 0.990390778                                                	| 0.213630199   	| 4.636005491                            	|
| 65536                      	| 0.422859192                                                	| 0.107388496   	| 3.937658186                            	|
| 32768                      	| 0.224406719                                                	| 0.075747967   	| 2.962544459                            	|
| 16384                      	| 0.143647194                                                	| 0.049884319   	| 2.879606175                            	|
| 8192                       	| 0.10917902                                                 	| 0.031619072   	| 3.452948273                            	|
| 4096                       	| 0.08869648                                                 	| 0.024063587   	| 3.685920935                            	|
| 2048                       	| 0.075721741                                                	| 0.022127628   	| 3.422045038                            	|
| 1024                       	| 0.06685257                                                 	| 0.018239021   	| 3.665359477                            	|
| 512                        	| 0.051283836                                                	| 0.017580986   	| 2.917005696                            	|
| 256                        	| 0.043172836                                                	| 0.020868778   	| 2.06877642                             	|
| 128                        	| 0.042669773                                                	| 0.018148422   	| 2.351156069                            	|
| 64                         	| 0.038774014                                                	| 0.015704632   	| 2.468954                               	|
| 32                         	| 0.038630962                                                	| 0.013871193   	| 2.784977656                            	|
| 16                         	| 0.027766228                                                	| 0.008444786   	| 3.287972897                            	|
| 8                          	| 0.019891262                                                	| 0.007579327   	| 2.624410192                            	|
| 4                          	| 0.018217564                                                	| 0.008151531   	| 2.234863995                            	|
| 2                          	| 0.017716885                                                	| 0.008127689   	| 2.179818128                            	|

### Single Thread Performance
Perf data collected for C in range [2^1, 2^20], and (N,H,W) = (1,1,1) for all values of C. Values of (N,H,W)=(1,1,1) were chosen to satisfy the condition NHW <= num_threads = 1 for single thread performance.
Tested on 1 physical core/socket, 1 socket on Skylake.

| **(N, H, W) = (1, 1, 1)** 	|                                                            	|               	|                                        	|
|---------------------------	|------------------------------------------------------------	|---------------	|----------------------------------------	|
|                           	| **Avg Latency (ms)**                                       	|               	|                                        	|
| **n_channel**             	| **Baseline (original implementation): two-path reduction** 	| **Optimized** 	| **Speedup Ratio (Optimized/Baseline)** 	|
| 1048576                   	| 10.97419                                                   	| 8.390961      	| 1.307859                               	|
| 524288                    	| 4.860618                                                   	| 4.128075      	| 1.177454                               	|
| 262144                    	| 2.782302                                                   	| 1.981447      	| 1.404177                               	|
| 131072                    	| 2.105565                                                   	| 1.073592      	| 1.961234                               	|
| 65536                     	| 0.857651                                                   	| 0.523462      	| 1.63842                                	|
| 32768                     	| 0.309389                                                   	| 0.247979      	| 1.24764                                	|
| 16384                     	| 0.13869                                                    	| 0.098376      	| 1.409796                               	|
| 8192                      	| 0.072258                                                   	| 0.050876      	| 1.420263                               	|
| 4096                      	| 0.038414                                                   	| 0.027308      	| 1.40667                                	|
| 2048                      	| 0.021684                                                   	| 0.015688      	| 1.382219                               	|
| 1024                      	| 0.013294                                                   	| 0.009842      	| 1.350775                               	|
| 512                       	| 0.008659                                                   	| 0.006645      	| 1.303193                               	|
| 256                       	| 0.006964                                                   	| 0.005393      	| 1.291335                               	|
| 128                       	| 0.005918                                                   	| 0.00464       	| 1.275437                               	|
| 64                        	| 0.005324                                                   	| 0.004292      	| 1.240556                               	|
| 32                        	| 0.004981                                                   	| 0.004163      	| 1.196449                               	|
| 16                        	| 0.004833                                                   	| 0.003943      	| 1.225514                               	|
| 8                         	| 0.004768                                                   	| 0.003896      	| 1.22399                                	|
| 4                         	| 0.004828                                                   	| 0.003955      	| 1.220615                               	|
| 2                         	| 0.004776                                                   	| 0.003934      	| 1.213939                               	|

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113619
Approved by: https://github.com/jgong5, https://github.com/ezyang
2024-01-01 04:43:42 +00:00
fc5fda14bc Try creating a bf16 tensor as a last resort of is_bf16_supported(). (#115924)
Fix: #115900 https://github.com/pytorch/xla/issues/6085

This PR adds a last resort for testing for BF16 support on CUDA. This is necessary on GPUs
such as RTX 2060, where `torch.cuda.is_bf_supported()` returns False, but we can
successfully create a BF16 tensor on CUDA.

Before this PR:

```python
>>> torch.cuda.is_bf_supported()
False
>>> torch.tensor([1.], dtype=torch.bfloat16, device="cuda")
tensor([...], device='cuda:0', dtype=torch.bfloat16)
```

After this PR:

```python
>>> torch.cuda.is_bf_supported()
True
>>> torch.tensor([1.], dtype=torch.bfloat16, device="cuda")
tensor([...], device='cuda:0', dtype=torch.bfloat16)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115924
Approved by: https://github.com/jansel
2024-01-01 01:15:30 +00:00
127812efee [BE]: Further improve pathlib checks in torch serialization (#116577)
Follow up #116564. `os.path` functions can accept an os.PathLike object too.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116577
Approved by: https://github.com/malfet
2023-12-31 20:24:40 +00:00
4bfaa6bc25 [MPS] Fix addmm (#116547)
Remove weird logic for designating matrices as transposed if sizes match(which always true if square matrices are multiplied with each other), which resulted in `torch.addmm` returns transposed matrix compared to `torch.mm`, see below:
```
% python -c "import torch;torch.set_default_device('mps');a=torch.eye(2);b=torch.arange(4.0).reshape(2, 2);print(a@b);print(torch.addmm(torch.zeros(2, 2), a,b))"
tensor([[0., 1.],
        [2., 3.]], device='mps:0')
tensor([[0., 2.],
        [1., 3.]], device='mps:0')
```

Fixes introduced to `torch.mm` in https://github.com/pytorch/pytorch/pull/77462 suggests that this is not needed

Modify `sample_inputs_addmm` to test `torch.addmm` with square matrices, but skip this config for `test_autograd_dense_output_addmm`, see https://github.com/pytorch/pytorch/issues/116565

TODO: probably tweak tolerances, as `test_output_match_addmm_cpu_float16` fails with 2x2 matrices, but passes using 3x3 ones with errors slightly exceeding the tolerance

Fixes https://github.com/pytorch/pytorch/issues/116331
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116547
Approved by: https://github.com/albanD, https://github.com/Skylion007
2023-12-31 02:28:59 +00:00
aef06c316b [BE]: Add better handling of pathlib.Path with os calls (#116564)
Builds on #116562 to the rest of the instances of pathlib in the PyTorch.
* Uses more generic `os.PathLike` and `os.fspath` calls where appropiate
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116564
Approved by: https://github.com/malfet
2023-12-31 01:46:03 +00:00
86cd6655a1 [BE]: Use exist_ok arg for os.makedirs calls (#116561)
Optimize os.makedirs calls to use exist_ok parameter when possible to avoid unnecessary checks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116561
Approved by: https://github.com/malfet
2023-12-30 21:12:53 +00:00
4f9858a902 [BE]: Use os.fspath and os.PathLike in torch serialization (#116562)
Use proper `os.fspath` to better convert `os.PathLike` object to a path.
Replace `pathlib.Path` with `os.PathLike` which is more generic and typing correct. `pathlib.Path` is an instance of `os.PathLike`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116562
Approved by: https://github.com/malfet
2023-12-30 20:53:10 +00:00
cyy
5aa258eb09 [12/N] Apply clang-tidy and fix warnings in headers of torch/csrc (#116486)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116486
Approved by: https://github.com/albanD
2023-12-30 18:38:53 +00:00
cyy
37aae5932c [11/N] Enable clang-tidy warnings on c10/util/*.h (#116353)
This PR enables clang-tidy coverage on c10/util/*.h
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116353
Approved by: https://github.com/albanD
2023-12-30 14:38:39 +00:00
97891b184c [Dynamo] Trace autograd.function in dynamo when inputs require grad (#116358)
For training graphs (when inputs require grad), previously, we would speculate the forward and backward graph to determine if there are any graph breaks, side effect and etc but would not actually use these speculated graphs. We would just insert a call function node on the graph and later rely on autograd's tracing.

This approach does not work for more generalized graphs like graphs that include user defined triton kernels because autograd is not able to do the higher order function conversation.

This PR speculates the forward and backward functions and emits them in a HOF that later gets used via templating mechanism.

While working on this PR, I have exposed some bugs in the current tracing due to trampoline functions losing the source information resulting in incorrect graphs being produced. I have fixed these source information bugs and killed the trampolines.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116358
Approved by: https://github.com/jansel
2023-12-30 01:51:30 +00:00
c5d9173d04 [BE]: Enable readability-redundant-function-ptr-dereference check (#116538)
Enable an additional clang-tidy check to remove redundant function ptr dereferences to help make the code more readable.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116538
Approved by: https://github.com/malfet
2023-12-30 01:15:35 +00:00
5e58be678c Make collect env BC compatible (#116532)
To avoid errors like the one in https://github.com/pytorch/pytorch/issues/116531 when the user tries to run collect_env
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116532
Approved by: https://github.com/malfet
2023-12-30 01:13:37 +00:00
bd7d26bb96 [CI] Fix docker builds (#116549)
By pinning lxml to 4.9.4 as 5.0.0 is missing Python-3.9 binaries, see https://pypi.org/project/lxml/5.0.0/#files
<img width="568" alt="image" src="https://github.com/pytorch/pytorch/assets/2453524/fbd64512-b788-4bf6-9c1f-084dcedfd082">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116549
Approved by: https://github.com/houseroad, https://github.com/aakhundov
2023-12-30 00:38:14 +00:00
961fbbe967 [CI] Add initial ci build test for XPU (#116100)
Add initial CI build test for XPU, which will be triggered by label `ciflow/xpu` for current stage.

Works for RFC #114850

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116100
Approved by: https://github.com/EikanWang, https://github.com/huydhn, https://github.com/atalman
2023-12-29 23:44:46 +00:00
de4d48df34 [c10d] Fix timeout dump path write path overlap when there are multiple PGs (#116218)
Basically we observed that if there are multiple PGs and if the timeout happens on one of the subPG, we somehow use the local rank in the dump file. We realize that:
1. For setting the timeout signal in the store, any watchdog thread from any PG can do that.
2. For checking and dump, only the watchdog thread of default PG which we will always create and contain all ranks (no file name conflict) is needed here because the store signal and dump debug info are all global.
3. Since dump is global, we want to avoid the case when ranks from sub-PG pollute logs from global ranks (local rank 0 vs global rank 0). So that we use global ranks here to initialize debug info writer. (Down the road, we are thinking about making it a singleton so that user only register it once for multi-PG case.)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116218
Approved by: https://github.com/wconstab
2023-12-29 21:58:25 +00:00
db2b4078b9 Add missing cstdint includes (#116458)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116458
Approved by: https://github.com/Skylion007
2023-12-29 18:30:26 +00:00
wgb
71ec3edbf7 Enhance Opinfo to support privateuse1 (#116417)
Fix Opinfo does not support third-party devices when the current test framework instantiation method is privateuse1.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116417
Approved by: https://github.com/albanD
2023-12-29 13:43:29 +00:00
e01e00fba8 fix code spell (#116530)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116530
Approved by: https://github.com/albanD
2023-12-29 12:58:38 +00:00
afadfa0175 [c10d] Add stream info during nccl comm abort call (#116076)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116076
Approved by: https://github.com/XilunWu
2023-12-29 06:58:26 +00:00
e8a9d088c6 [DevX] Add tool and doc on partial debug builds (#116521)
Turned command sequence mentioned in https://dev-discuss.pytorch.org/t/how-to-get-a-fast-debug-build/1597 and in various discussions into a tool that I use almost daily to debug crashes or correctness issues in the codebase

Essentially it allows one to turn this:
```
Process 87729 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
    frame #0: 0x00000001023d55a8 libtorch_python.dylib`at::indexing::impl::applySelect(at::Tensor const&, long long, c10::SymInt, long long, c10::Device const&, std::__1::optional<c10::ArrayRef<c10::SymInt>> const&)
libtorch_python.dylib`at::indexing::impl::applySelect:
->  0x1023d55a8 <+0>:  sub    sp, sp, #0xd0
    0x1023d55ac <+4>:  stp    x24, x23, [sp, #0x90]
    0x1023d55b0 <+8>:  stp    x22, x21, [sp, #0xa0]
    0x1023d55b4 <+12>: stp    x20, x19, [sp, #0xb0]
```
into this
```
Process 87741 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
    frame #0: 0x00000001024e2628 libtorch_python.dylib`at::indexing::impl::applySelect(self=0x00000001004ee8a8, dim=0, index=(data_ = 3), real_dim=0, (null)=0x000000016fdfe535, self_sizes= Has Value=true ) at TensorIndexing.h:239:7
   236 	    const at::Device& /*self_device*/,
   237 	    const c10::optional<SymIntArrayRef>& self_sizes) {
   238 	  // See NOTE [nested tensor size for indexing]
-> 239 	  if (self_sizes.has_value()) {
   240 	    auto maybe_index = index.maybe_as_int();
   241 	    if (maybe_index.has_value()) {
   242 	      TORCH_CHECK_INDEX(
```
while retaining good performance for the rest of the codebase
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116521
Approved by: https://github.com/atalman
2023-12-29 05:15:35 +00:00
df85a920cf [Inductor][Observability] Add logging for split cat pass (#116442)
Summary: Add logs for both in the pre and post grad passes

Test Plan:
```
buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode split_batch
```
[2023-12-26 17:14:24,203] [0/0] torch._inductor.fx_passes.post_grad: [INFO] counters of inductor dict after apply the split cat in the post grad pass: Counter({'pattern_matcher_nodes': 4076, 'pattern_matcher_count': 2917, 'remove_split_with_size_one': 1322, 'split_cat_norm': 461, 'consecutive_split_merged': 371, 'scmerge_cat_removed': 41, 'scmerge_cat_added': 32, 'scmerge_split_removed': 28, 'getitem_cat_merged': 11, 'batch_fusion': 7, 'scmerge_split_sections_removed': 3, 'scmerge_split_added': 2, 'split_squeeze_replaced': 2})

[2023-12-26 17:16:28,437] torch._inductor.fx_passes.post_grad: [INFO] counters of inductor dict after apply the split cat in the post grad pass: Counter({'pattern_matcher_nodes': 4122, 'pattern_matcher_count': 2935, 'remove_split_with_size_one': 1322, 'split_cat_norm': 461, 'consecutive_split_merged': 371, 'scmerge_cat_removed': 41, 'batch_fusion': 39, 'scmerge_cat_added': 32, 'scmerge_split_removed': 28, 'getitem_cat_merged': 11, 'scmerge_split_sections_removed': 3, 'scmerge_split_added': 2, 'split_squeeze_replaced': 2})

Differential Revision: D52425400

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116442
Approved by: https://github.com/yanboliang
2023-12-29 05:10:45 +00:00
8deaa13417 [EZ][Distributed] Add 'c10d' to distributed TORCH_LOG comment (#116526)
Address the comment in https://github.com/pytorch/pytorch/pull/116434, which I confused in the first beginning. Let's add c10d to the comment.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116526
Approved by: https://github.com/XilunWu
2023-12-29 04:40:37 +00:00
ef94499ad7 [executorch hash update] update the pinned executorch hash (#116474)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116474
Approved by: https://github.com/pytorchbot
2023-12-29 03:13:51 +00:00
240121587a [vision hash update] update the pinned vision hash (#116524)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116524
Approved by: https://github.com/pytorchbot
2023-12-29 03:08:39 +00:00
cab79ceb51 [Inductor Intel GPU backend Upstream] Step 2: Register and add Intel GPU Inductor backend (#116330)
Right after the first PR https://github.com/pytorch/pytorch/pull/116020, this PR forcus on generalizing device-bias runtime code that used in the basic workflow including triton kernel generation, codecache, autotuning.

 Feature request: #114856

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116330
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/desertfire
2023-12-29 02:49:37 +00:00
8173d98c57 [quant][be] Skip conv-bn folding when there are no batchnorm ops (#116440)
Summary:
`_fold_conv_bn_qat` is taking a long time currently, so skipping it when it's not necessary,
we can have follow up fixes to actually reduce the patterns or cache the patterns if possible

Test Plan:
uncomment the print in `test_speed`, run

python test/test_quantization.py -k test_speed

and make sure the convert time is low, e.g. 0.1s instead of 8-9 seconds

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116440
Approved by: https://github.com/andrewor14
2023-12-28 23:33:21 +00:00
33917150d3 Cleanup scope ref properly (#116169)
Fixes https://github.com/pytorch/pytorch/issues/116143

See test in PR for a case where this happens. Discovered while debugging optimizers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116169
Approved by: https://github.com/janeyx99, https://github.com/williamwen42, https://github.com/jansel
2023-12-28 23:29:37 +00:00
4371939751 Removing HTA documentation (#116513)
Removing HTA documentation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116513
Approved by: https://github.com/aaronenyeshi, https://github.com/malfet, https://github.com/atalman
2023-12-28 23:04:23 +00:00
8220d5c66d Support pathlib.Path as input to torch.load when mmap=True (#116104)
Fixes #116103

This now works:

```py
import torch
from pathlib import Path

file = Path("example.pt")
torch.save(torch.rand(5, 3), file)
torch.load(file, mmap=True)   # works!
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116104
Approved by: https://github.com/mikaylagawarecki
2023-12-28 22:54:11 +00:00
02e2158e75 Fix for out of bounds read in mobile interpreter INTERFACE_CALL opcode handler (#110301)
Summary:
The INTERFACE_CALL opcode for the mobile TorchScript interpreter contained an out of bounds read issue leading to memory corruption.

This change adds an explicit check that the number of inputs passed to the format method called when handling the INTERFACE_CALL opcode is a valid and within bounds of the stack.

Test Plan: contbuild + OSS signals

Differential Revision: D49739450

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110301
Approved by: https://github.com/dbort
2023-12-28 22:09:03 +00:00
7e12e722af [Dynamo][12/N] Remove allowed_functions.py (#116401)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116401
Approved by: https://github.com/angelayi
2023-12-28 21:26:06 +00:00
439f2a6c1f [RelEng] Missing signal for release branches (#116516)
Run slow/periodic and inductor workflows on push to release branches

Right now there are no signal from those jobs on release branches at all.
This will run periodic jobs on every commit to release branch, which is fine, as they are short lived and have a much lower traffic that a regular jobs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116516
Approved by: https://github.com/clee2000
2023-12-28 20:19:55 +00:00
4af1c27fa8 Migrate repr, deterministic state_dict test to OptimizerInfo (#116496)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116496
Approved by: https://github.com/albanD
ghstack dependencies: #116471
2023-12-28 19:49:04 +00:00
f3c4395358 [BE] Add helper in common_optimizers to get all optim inputs (#116471)
This will be a common utility in test_optim.py. Printing out the optimizer inputs when using this helper looks reasonable:

For local test plan, click below.
<details>

```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (d186986c)]$ python test/test_optim.py -vv -k test_step_is_noop_when_params_have_no_grad
test_step_is_noop_when_params_have_no_grad_ASGD_cpu_float32 (__main__.TestOptimRenewedCPU) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default
params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach
params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable
params=None, kwargs={'lr': 0.02, 'foreach': False, 'differentiable': False}, desc=non-default lr
params=None, kwargs={'lr': 0.02, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach
params=None, kwargs={'lr': 0.02, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable
params=None, kwargs={'t0': 100, 'foreach': False, 'differentiable': False}, desc=t0
params=None, kwargs={'t0': 100, 'foreach': True, 'differentiable': False}, desc=t0 & foreach
params=None, kwargs={'t0': 100, 'foreach': False, 'differentiable': True}, desc=t0 & differentiable
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nonzero weight_decay & foreach
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nonzero weight_decay & differentiable
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable
ok
test_step_is_noop_when_params_have_no_grad_Adadelta_cpu_float32 (__main__.TestOptimRenewedCPU) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default
params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach
params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False}, desc=non-default lr
params=None, kwargs={'lr': 0.01, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nonzero weight_decay & foreach
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nonzero weight_decay & differentiable
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable
params=None, kwargs={'rho': 0.95, 'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=rho
params=None, kwargs={'rho': 0.95, 'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=rho & foreach
params=None, kwargs={'rho': 0.95, 'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=rho & differentiable
ok
test_step_is_noop_when_params_have_no_grad_Adagrad_cpu_float32 (__main__.TestOptimRenewedCPU) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default
params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach
params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nonzero weight_decay & foreach
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nonzero weight_decay & differentiable
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable
params=None, kwargs={'initial_accumulator_value': 0.1, 'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=initial_accumulator_value
params=None, kwargs={'initial_accumulator_value': 0.1, 'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=initial_accumulator_value & foreach
params=None, kwargs={'initial_accumulator_value': 0.1, 'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=initial_accumulator_value & differentiable
params=None, kwargs={'lr': 0.1, 'lr_decay': 0.5, 'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=lr_decay
params=None, kwargs={'lr': 0.1, 'lr_decay': 0.5, 'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=lr_decay & foreach
params=None, kwargs={'lr': 0.1, 'lr_decay': 0.5, 'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=lr_decay & differentiable
ok
test_step_is_noop_when_params_have_no_grad_AdamW_cpu_float32 (__main__.TestOptimRenewedCPU) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default
params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach
params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False}, desc=non-default lr
params=None, kwargs={'lr': 0.01, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nonzero weight_decay & foreach
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nonzero weight_decay & differentiable
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable
params=None, kwargs={'weight_decay': 0.9, 'amsgrad': True, 'foreach': False, 'differentiable': False}, desc=amsgrad
params=None, kwargs={'weight_decay': 0.9, 'amsgrad': True, 'foreach': True, 'differentiable': False}, desc=amsgrad & foreach
params=None, kwargs={'weight_decay': 0.9, 'amsgrad': True, 'foreach': False, 'differentiable': True}, desc=amsgrad & differentiable
ok
test_step_is_noop_when_params_have_no_grad_Adam_cpu_float32 (__main__.TestOptimRenewedCPU) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default
params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach
params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False}, desc=non-default lr
params=None, kwargs={'lr': 0.01, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nonzero weight_decay & foreach
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nonzero weight_decay & differentiable
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable
params=None, kwargs={'weight_decay': 0.9, 'amsgrad': True, 'foreach': False, 'differentiable': False}, desc=amsgrad
params=None, kwargs={'weight_decay': 0.9, 'amsgrad': True, 'foreach': True, 'differentiable': False}, desc=amsgrad & foreach
params=None, kwargs={'weight_decay': 0.9, 'amsgrad': True, 'foreach': False, 'differentiable': True}, desc=amsgrad & differentiable
ok
test_step_is_noop_when_params_have_no_grad_Adamax_cpu_float32 (__main__.TestOptimRenewedCPU) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default
params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach
params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable
params=None, kwargs={'lr': 0.001, 'foreach': False, 'differentiable': False}, desc=non-default lr
params=None, kwargs={'lr': 0.001, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach
params=None, kwargs={'lr': 0.001, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nonzero weight_decay & foreach
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nonzero weight_decay & differentiable
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable
ok
test_step_is_noop_when_params_have_no_grad_LBFGS_cpu_float32 (__main__.TestOptimRenewedCPU) ... ok
test_step_is_noop_when_params_have_no_grad_NAdam_cpu_float32 (__main__.TestOptimRenewedCPU) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default
params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach
params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable
params=None, kwargs={'lr': 0.001, 'foreach': False, 'differentiable': False}, desc=non-default lr
params=None, kwargs={'lr': 0.001, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach
params=None, kwargs={'lr': 0.001, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable
params=None, kwargs={'momentum_decay': 0.006, 'foreach': False, 'differentiable': False}, desc=non-zero momentum_decay
params=None, kwargs={'momentum_decay': 0.006, 'foreach': True, 'differentiable': False}, desc=non-zero momentum_decay & foreach
params=None, kwargs={'momentum_decay': 0.006, 'foreach': False, 'differentiable': True}, desc=non-zero momentum_decay & differentiable
params=None, kwargs={'weight_decay': 0.9, 'momentum_decay': 0.006, 'foreach': False, 'differentiable': False}, desc=weight_decay
params=None, kwargs={'weight_decay': 0.9, 'momentum_decay': 0.006, 'foreach': True, 'differentiable': False}, desc=weight_decay & foreach
params=None, kwargs={'weight_decay': 0.9, 'momentum_decay': 0.006, 'foreach': False, 'differentiable': True}, desc=weight_decay & differentiable
params=None, kwargs={'weight_decay': 0.9, 'momentum_decay': 0.006, 'decoupled_weight_decay': True, 'foreach': False, 'differentiable': False}, desc=decoupled_weight_decay
params=None, kwargs={'weight_decay': 0.9, 'momentum_decay': 0.006, 'decoupled_weight_decay': True, 'foreach': True, 'differentiable': False}, desc=decoupled_weight_decay & foreach
params=None, kwargs={'weight_decay': 0.9, 'momentum_decay': 0.006, 'decoupled_weight_decay': True, 'foreach': False, 'differentiable': True}, desc=decoupled_weight_decay & differentiable
ok
test_step_is_noop_when_params_have_no_grad_RAdam_cpu_float32 (__main__.TestOptimRenewedCPU) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default
params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach
params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable
params=None, kwargs={'lr': 0.002, 'foreach': False, 'differentiable': False}, desc=non-default lr
params=None, kwargs={'lr': 0.002, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach
params=None, kwargs={'lr': 0.002, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable
params=None, kwargs={'eps': 1e-06, 'foreach': False, 'differentiable': False}, desc=non-default eps
params=None, kwargs={'eps': 1e-06, 'foreach': True, 'differentiable': False}, desc=non-default eps & foreach
params=None, kwargs={'eps': 1e-06, 'foreach': False, 'differentiable': True}, desc=non-default eps & differentiable
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nonzero weight_decay & foreach
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nonzero weight_decay & differentiable
params=None, kwargs={'weight_decay': 0.9, 'decoupled_weight_decay': True, 'foreach': False, 'differentiable': False}, desc=decoupled_weight_decay
params=None, kwargs={'weight_decay': 0.9, 'decoupled_weight_decay': True, 'foreach': True, 'differentiable': False}, desc=decoupled_weight_decay & foreach
params=None, kwargs={'weight_decay': 0.9, 'decoupled_weight_decay': True, 'foreach': False, 'differentiable': True}, desc=decoupled_weight_decay & differentiable
ok
test_step_is_noop_when_params_have_no_grad_RMSprop_cpu_float32 (__main__.TestOptimRenewedCPU) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default
params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach
params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable
params=None, kwargs={'lr': 0.001, 'foreach': False, 'differentiable': False}, desc=non-default lr
params=None, kwargs={'lr': 0.001, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach
params=None, kwargs={'lr': 0.001, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nonzero weight_decay & foreach
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nonzero weight_decay & differentiable
params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'foreach': False, 'differentiable': False}, desc=centered
params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'foreach': True, 'differentiable': False}, desc=centered & foreach
params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'foreach': False, 'differentiable': True}, desc=centered & differentiable
params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'momentum': 0.1, 'foreach': False, 'differentiable': False}, desc=momentum
params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'momentum': 0.1, 'foreach': True, 'differentiable': False}, desc=momentum & foreach
params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'momentum': 0.1, 'foreach': False, 'differentiable': True}, desc=momentum & differentiable
params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'momentum': 0.1, 'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize
params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'momentum': 0.1, 'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach
params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'momentum': 0.1, 'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable
ok
test_step_is_noop_when_params_have_no_grad_Rprop_cpu_float32 (__main__.TestOptimRenewedCPU) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default
params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach
params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable
params=None, kwargs={'lr': 0.0002, 'foreach': False, 'differentiable': False}, desc=non-default lr
params=None, kwargs={'lr': 0.0002, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach
params=None, kwargs={'lr': 0.0002, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable
params=None, kwargs={'etas': (0.5, 1.5), 'foreach': False, 'differentiable': False}, desc=non-default etas
params=None, kwargs={'etas': (0.5, 1.5), 'foreach': True, 'differentiable': False}, desc=non-default etas & foreach
params=None, kwargs={'etas': (0.5, 1.5), 'foreach': False, 'differentiable': True}, desc=non-default etas & differentiable
params=None, kwargs={'step_sizes': (2e-06, 100), 'foreach': False, 'differentiable': False}, desc=non-default step_sizes
params=None, kwargs={'step_sizes': (2e-06, 100), 'foreach': True, 'differentiable': False}, desc=non-default step_sizes & foreach
params=None, kwargs={'step_sizes': (2e-06, 100), 'foreach': False, 'differentiable': True}, desc=non-default step_sizes & differentiable
params=None, kwargs={'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize
params=None, kwargs={'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach
params=None, kwargs={'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable
ok
test_step_is_noop_when_params_have_no_grad_SGD_cpu_float32 (__main__.TestOptimRenewedCPU) ... params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False}, desc=default
params=None, kwargs={'lr': 0.01, 'foreach': True, 'differentiable': False}, desc=default & foreach
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': True}, desc=default & differentiable
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'foreach': False, 'differentiable': False}, desc=momentum
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'foreach': True, 'differentiable': False}, desc=momentum & foreach
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'foreach': False, 'differentiable': True}, desc=momentum & differentiable
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'dampening': 0.5, 'foreach': False, 'differentiable': False}, desc=dampening
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'dampening': 0.5, 'foreach': True, 'differentiable': False}, desc=dampening & foreach
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'dampening': 0.5, 'foreach': False, 'differentiable': True}, desc=dampening & differentiable
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=non-zero weight_decay
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=non-zero weight_decay & foreach
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=non-zero weight_decay & differentiable
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'nesterov': True, 'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nesterov
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'nesterov': True, 'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nesterov & foreach
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'nesterov': True, 'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nesterov & differentiable
params=None, kwargs={'lr': 0.01, 'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize
params=None, kwargs={'lr': 0.01, 'weight_decay': 0.9, 'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach
params=None, kwargs={'lr': 0.01, 'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable
ok
test_step_is_noop_when_params_have_no_grad_SparseAdam_cpu_float32 (__main__.TestOptimRenewedCPU) ... ok
test_step_is_noop_when_params_have_no_grad_ASGD_cuda_float32 (__main__.TestOptimRenewedCUDA) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default
params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach
params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable
params=None, kwargs={'lr': 0.02, 'foreach': False, 'differentiable': False}, desc=non-default lr
params=None, kwargs={'lr': 0.02, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach
params=None, kwargs={'lr': 0.02, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable
params=None, kwargs={'t0': 100, 'foreach': False, 'differentiable': False}, desc=t0
params=None, kwargs={'t0': 100, 'foreach': True, 'differentiable': False}, desc=t0 & foreach
params=None, kwargs={'t0': 100, 'foreach': False, 'differentiable': True}, desc=t0 & differentiable
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nonzero weight_decay & foreach
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nonzero weight_decay & differentiable
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable
ok
test_step_is_noop_when_params_have_no_grad_Adadelta_cuda_float32 (__main__.TestOptimRenewedCUDA) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default
params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach
params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False}, desc=non-default lr
params=None, kwargs={'lr': 0.01, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nonzero weight_decay & foreach
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nonzero weight_decay & differentiable
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable
params=None, kwargs={'rho': 0.95, 'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=rho
params=None, kwargs={'rho': 0.95, 'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=rho & foreach
params=None, kwargs={'rho': 0.95, 'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=rho & differentiable
ok
test_step_is_noop_when_params_have_no_grad_Adagrad_cuda_float32 (__main__.TestOptimRenewedCUDA) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default
params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach
params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nonzero weight_decay & foreach
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nonzero weight_decay & differentiable
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable
params=None, kwargs={'initial_accumulator_value': 0.1, 'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=initial_accumulator_value
params=None, kwargs={'initial_accumulator_value': 0.1, 'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=initial_accumulator_value & foreach
params=None, kwargs={'initial_accumulator_value': 0.1, 'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=initial_accumulator_value & differentiable
params=None, kwargs={'lr': 0.1, 'lr_decay': 0.5, 'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=lr_decay
params=None, kwargs={'lr': 0.1, 'lr_decay': 0.5, 'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=lr_decay & foreach
params=None, kwargs={'lr': 0.1, 'lr_decay': 0.5, 'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=lr_decay & differentiable
ok
test_step_is_noop_when_params_have_no_grad_AdamW_cuda_float32 (__main__.TestOptimRenewedCUDA) ... params=None, kwargs={'foreach': False, 'differentiable': False, 'fused': False}, desc=default
params=None, kwargs={'foreach': True, 'differentiable': False, 'fused': False}, desc=default & foreach
params=None, kwargs={'foreach': False, 'differentiable': True, 'fused': False}, desc=default & differentiable
params=None, kwargs={'foreach': False, 'differentiable': False, 'fused': True}, desc=default & fused
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False, 'fused': False}, desc=non-default lr
params=None, kwargs={'lr': 0.01, 'foreach': True, 'differentiable': False, 'fused': False}, desc=non-default lr & foreach
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': True, 'fused': False}, desc=non-default lr & differentiable
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False, 'fused': True}, desc=non-default lr & fused
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False, 'fused': False}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False, 'fused': False}, desc=nonzero weight_decay & foreach
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True, 'fused': False}, desc=nonzero weight_decay & differentiable
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False, 'fused': True}, desc=nonzero weight_decay & fused
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=maximize
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=maximize & foreach
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=maximize & differentiable
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=maximize & fused
params=None, kwargs={'weight_decay': 0.9, 'amsgrad': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=amsgrad
params=None, kwargs={'weight_decay': 0.9, 'amsgrad': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=amsgrad & foreach
params=None, kwargs={'weight_decay': 0.9, 'amsgrad': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=amsgrad & differentiable
params=None, kwargs={'weight_decay': 0.9, 'amsgrad': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=amsgrad & fused
ok
test_step_is_noop_when_params_have_no_grad_Adam_cuda_float32 (__main__.TestOptimRenewedCUDA) ... params=None, kwargs={'foreach': False, 'differentiable': False, 'fused': False}, desc=default
params=None, kwargs={'foreach': True, 'differentiable': False, 'fused': False}, desc=default & foreach
params=None, kwargs={'foreach': False, 'differentiable': True, 'fused': False}, desc=default & differentiable
params=None, kwargs={'foreach': False, 'differentiable': False, 'fused': True}, desc=default & fused
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False, 'fused': False}, desc=non-default lr
params=None, kwargs={'lr': 0.01, 'foreach': True, 'differentiable': False, 'fused': False}, desc=non-default lr & foreach
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': True, 'fused': False}, desc=non-default lr & differentiable
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False, 'fused': True}, desc=non-default lr & fused
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False, 'fused': False}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False, 'fused': False}, desc=nonzero weight_decay & foreach
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True, 'fused': False}, desc=nonzero weight_decay & differentiable
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False, 'fused': True}, desc=nonzero weight_decay & fused
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=maximize
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=maximize & foreach
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=maximize & differentiable
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=maximize & fused
params=None, kwargs={'weight_decay': 0.9, 'amsgrad': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=amsgrad
params=None, kwargs={'weight_decay': 0.9, 'amsgrad': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=amsgrad & foreach
params=None, kwargs={'weight_decay': 0.9, 'amsgrad': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=amsgrad & differentiable
params=None, kwargs={'weight_decay': 0.9, 'amsgrad': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=amsgrad & fused
ok
test_step_is_noop_when_params_have_no_grad_Adamax_cuda_float32 (__main__.TestOptimRenewedCUDA) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default
params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach
params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable
params=None, kwargs={'lr': 0.001, 'foreach': False, 'differentiable': False}, desc=non-default lr
params=None, kwargs={'lr': 0.001, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach
params=None, kwargs={'lr': 0.001, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nonzero weight_decay & foreach
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nonzero weight_decay & differentiable
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable
ok
test_step_is_noop_when_params_have_no_grad_LBFGS_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_step_is_noop_when_params_have_no_grad_NAdam_cuda_float32 (__main__.TestOptimRenewedCUDA) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default
params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach
params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable
params=None, kwargs={'lr': 0.001, 'foreach': False, 'differentiable': False}, desc=non-default lr
params=None, kwargs={'lr': 0.001, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach
params=None, kwargs={'lr': 0.001, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable
params=None, kwargs={'momentum_decay': 0.006, 'foreach': False, 'differentiable': False}, desc=non-zero momentum_decay
params=None, kwargs={'momentum_decay': 0.006, 'foreach': True, 'differentiable': False}, desc=non-zero momentum_decay & foreach
params=None, kwargs={'momentum_decay': 0.006, 'foreach': False, 'differentiable': True}, desc=non-zero momentum_decay & differentiable
params=None, kwargs={'weight_decay': 0.9, 'momentum_decay': 0.006, 'foreach': False, 'differentiable': False}, desc=weight_decay
params=None, kwargs={'weight_decay': 0.9, 'momentum_decay': 0.006, 'foreach': True, 'differentiable': False}, desc=weight_decay & foreach
params=None, kwargs={'weight_decay': 0.9, 'momentum_decay': 0.006, 'foreach': False, 'differentiable': True}, desc=weight_decay & differentiable
params=None, kwargs={'weight_decay': 0.9, 'momentum_decay': 0.006, 'decoupled_weight_decay': True, 'foreach': False, 'differentiable': False}, desc=decoupled_weight_decay
params=None, kwargs={'weight_decay': 0.9, 'momentum_decay': 0.006, 'decoupled_weight_decay': True, 'foreach': True, 'differentiable': False}, desc=decoupled_weight_decay & foreach
params=None, kwargs={'weight_decay': 0.9, 'momentum_decay': 0.006, 'decoupled_weight_decay': True, 'foreach': False, 'differentiable': True}, desc=decoupled_weight_decay & differentiable
ok
test_step_is_noop_when_params_have_no_grad_RAdam_cuda_float32 (__main__.TestOptimRenewedCUDA) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default
params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach
params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable
params=None, kwargs={'lr': 0.002, 'foreach': False, 'differentiable': False}, desc=non-default lr
params=None, kwargs={'lr': 0.002, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach
params=None, kwargs={'lr': 0.002, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable
params=None, kwargs={'eps': 1e-06, 'foreach': False, 'differentiable': False}, desc=non-default eps
params=None, kwargs={'eps': 1e-06, 'foreach': True, 'differentiable': False}, desc=non-default eps & foreach
params=None, kwargs={'eps': 1e-06, 'foreach': False, 'differentiable': True}, desc=non-default eps & differentiable
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nonzero weight_decay & foreach
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nonzero weight_decay & differentiable
params=None, kwargs={'weight_decay': 0.9, 'decoupled_weight_decay': True, 'foreach': False, 'differentiable': False}, desc=decoupled_weight_decay
params=None, kwargs={'weight_decay': 0.9, 'decoupled_weight_decay': True, 'foreach': True, 'differentiable': False}, desc=decoupled_weight_decay & foreach
params=None, kwargs={'weight_decay': 0.9, 'decoupled_weight_decay': True, 'foreach': False, 'differentiable': True}, desc=decoupled_weight_decay & differentiable
ok
test_step_is_noop_when_params_have_no_grad_RMSprop_cuda_float32 (__main__.TestOptimRenewedCUDA) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default
params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach
params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable
params=None, kwargs={'lr': 0.001, 'foreach': False, 'differentiable': False}, desc=non-default lr
params=None, kwargs={'lr': 0.001, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach
params=None, kwargs={'lr': 0.001, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nonzero weight_decay & foreach
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nonzero weight_decay & differentiable
params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'foreach': False, 'differentiable': False}, desc=centered
params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'foreach': True, 'differentiable': False}, desc=centered & foreach
params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'foreach': False, 'differentiable': True}, desc=centered & differentiable
params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'momentum': 0.1, 'foreach': False, 'differentiable': False}, desc=momentum
params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'momentum': 0.1, 'foreach': True, 'differentiable': False}, desc=momentum & foreach
params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'momentum': 0.1, 'foreach': False, 'differentiable': True}, desc=momentum & differentiable
params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'momentum': 0.1, 'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize
params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'momentum': 0.1, 'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach
params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'momentum': 0.1, 'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable
ok
test_step_is_noop_when_params_have_no_grad_Rprop_cuda_float32 (__main__.TestOptimRenewedCUDA) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default
params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach
params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable
params=None, kwargs={'lr': 0.0002, 'foreach': False, 'differentiable': False}, desc=non-default lr
params=None, kwargs={'lr': 0.0002, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach
params=None, kwargs={'lr': 0.0002, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable
params=None, kwargs={'etas': (0.5, 1.5), 'foreach': False, 'differentiable': False}, desc=non-default etas
params=None, kwargs={'etas': (0.5, 1.5), 'foreach': True, 'differentiable': False}, desc=non-default etas & foreach
params=None, kwargs={'etas': (0.5, 1.5), 'foreach': False, 'differentiable': True}, desc=non-default etas & differentiable
params=None, kwargs={'step_sizes': (2e-06, 100), 'foreach': False, 'differentiable': False}, desc=non-default step_sizes
params=None, kwargs={'step_sizes': (2e-06, 100), 'foreach': True, 'differentiable': False}, desc=non-default step_sizes & foreach
params=None, kwargs={'step_sizes': (2e-06, 100), 'foreach': False, 'differentiable': True}, desc=non-default step_sizes & differentiable
params=None, kwargs={'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize
params=None, kwargs={'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach
params=None, kwargs={'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable
ok
test_step_is_noop_when_params_have_no_grad_SGD_cuda_float32 (__main__.TestOptimRenewedCUDA) ... params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False}, desc=default
params=None, kwargs={'lr': 0.01, 'foreach': True, 'differentiable': False}, desc=default & foreach
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': True}, desc=default & differentiable
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'foreach': False, 'differentiable': False}, desc=momentum
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'foreach': True, 'differentiable': False}, desc=momentum & foreach
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'foreach': False, 'differentiable': True}, desc=momentum & differentiable
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'dampening': 0.5, 'foreach': False, 'differentiable': False}, desc=dampening
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'dampening': 0.5, 'foreach': True, 'differentiable': False}, desc=dampening & foreach
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'dampening': 0.5, 'foreach': False, 'differentiable': True}, desc=dampening & differentiable
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=non-zero weight_decay
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=non-zero weight_decay & foreach
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=non-zero weight_decay & differentiable
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'nesterov': True, 'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nesterov
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'nesterov': True, 'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nesterov & foreach
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'nesterov': True, 'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nesterov & differentiable
params=None, kwargs={'lr': 0.01, 'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize
params=None, kwargs={'lr': 0.01, 'weight_decay': 0.9, 'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach
params=None, kwargs={'lr': 0.01, 'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable
ok
test_step_is_noop_when_params_have_no_grad_SparseAdam_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok

----------------------------------------------------------------------
Ran 26 tests in 19.089s

OK
```

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116471
Approved by: https://github.com/albanD
2023-12-28 19:49:04 +00:00
577529daec [Dynamo] Implement a simple mutation tracker for user defined triton kernels (#116466)
This PR adds a very simple mutation tracking mechanism to dynamo which can later be improved to be more thorough. Currently it allows tensors to be in tl.load but if it sees a tensor used anywhere else (including a tl.load), it bails out.

One question about the method: is `ast.NodeVisitor` the best thing to use here? Having to detect mutations with this is kinda pretty since you need to keep setting state at each transition.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116466
Approved by: https://github.com/aakhundov
2023-12-28 18:59:44 +00:00
f10c3f4184 Fix module pre bw hooks when input doesn't req grad but gradients are changed by the user (#116454)
As per title.

FYI @vkuzo
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116454
Approved by: https://github.com/mikaylagawarecki
2023-12-28 18:32:50 +00:00
fb91acd33b [release] Add specific section about building and testing final rc (#116476)
Formalize process of building and testing final rc. To avoid having missing PRs in the release, similar to this: https://github.com/pytorch/pytorch/pull/114197

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116476
Approved by: https://github.com/huydhn
2023-12-28 15:25:08 +00:00
b5e83b8c50 Fix edge case for size 1 channels dim in AdaptiveMaxPool (#116482)
Fixes https://github.com/pytorch/pytorch/issues/107842

Unlike `AdaptiveAvgPool`, `AdaptiveMaxPool` does not have a CUDA kernel for ChannelsLast. We workaround this by calling `contiguous()` on the input. However, there is an edge case when the channels dimension has size 1.

```python
>>> t = torch.randn(2, 1, 3, 3)
>>> t.stride()
(9, 9, 3, 1)
>>> t_c =  t.to(memory_format=torch.channels_last)
>>> t_c.stride()
(9, 1, 3, 1)  # (CHW, 1, CW, C)
>>> t_c.is_contiguous()
True  # contiguity check doesn't check strides for singleton dimensions
```

Since the CUDA kernel treats the batch,`B`, and  channels,`C`, dimensions as implicitly flattened and increments the data pointer for `input` to the start of the next plane using

669b182d33/aten/src/ATen/native/cuda/AdaptiveMaxPooling2d.cu (L67)

If our input falls into the aforementioned edge case, the `data_ptr` will not be incremented correctly. The simple fix for this is to calculate the stride for the channels dimension using $\prod_{i > 1}size(i)$

Analogous fix for the 3D case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116482
Approved by: https://github.com/albanD
2023-12-28 15:02:29 +00:00
dfc898ede4 Don't decompose functional ops in predispatch functionalization (#116383)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116383
Approved by: https://github.com/bdhirsh
ghstack dependencies: #115188, #115210
2023-12-28 11:54:04 +00:00
80c07df659 Update doc for the constraints of FractionalMaxPool2d (#116261)
Fixes [#115531 ](https://github.com/pytorch/pytorch/issues/115531)
Update doc for the constraints of FractionalMaxPool2d.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116261
Approved by: https://github.com/mikaylagawarecki
2023-12-28 06:55:36 +00:00
d791074c81 Clean up PyTorch op BC check list (#116468)
Summary: Remove the expired items.

Test Plan: CI

Differential Revision: D52435764

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116468
Approved by: https://github.com/feikou
2023-12-28 06:05:59 +00:00
6243dbb5c0 [DTensor][BE] unify PlacementStrategy print function (#116428)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116428
Approved by: https://github.com/wanchaol
ghstack dependencies: #115683, #115689
2023-12-28 01:10:20 +00:00
87fea086aa [DTensor] remove experimental DTensor op backward layer norm (#115689)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115689
Approved by: https://github.com/wanchaol, https://github.com/yoyoyocmu
ghstack dependencies: #115683
2023-12-28 01:10:20 +00:00
575f17ebd4 [DTensor] add layer norm backward support (#115683)
**Summary**
This PR adds DTensor implementation for ATen op `native_layer_norm_backward`.

**Test Plan**
pytest test/distributed/_tensor/test_math_ops.py -s -k layer_norm
pytest test/distributed/_tensor/test_dtensor_ops.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115683
Approved by: https://github.com/wanchaol
2023-12-28 01:10:10 +00:00
b3f7fdbf0a Add decomp for pad_sequence (#116285)
Summary: currently pad_sequence caused symbolic shape specialization in export which is unintended. Adding a decomp seems to work to avoid the c++ kernel which caused the specialization.

Test Plan: buck test mode/opt caffe2/test:test_export -- -r pad_sequence

Reviewed By: SherlockNoMad

Differential Revision: D52345667

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116285
Approved by: https://github.com/peterbell10, https://github.com/lezcano
2023-12-27 23:56:51 +00:00
d59350cc1c [Dynamo] Consolidate common constant types (#116366)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116366
Approved by: https://github.com/Skylion007
2023-12-27 23:54:35 +00:00
6375eb15ef [Dynamo][11/N] allow_in_graph/disallow_in_graph decorator refactor (#116365)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116365
Approved by: https://github.com/jansel
2023-12-27 23:50:35 +00:00
53e32d12c4 [c10] Use nested namespace in c10/cuda (#116464)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116464
Approved by: https://github.com/Skylion007
2023-12-27 23:14:00 +00:00
93b86bf531 [GHF] Implement stacked revert (#116447)
By adding `get_ghstack_dependent_prs` that using `git branch --contains`
finds all PRs containing stacked branch, selecting longest one (in
terms of distance between origin and default branch) and skipping all
open PRs

Please note, that reverts should be applied in a reversed order with the
one how PRs were landed originally.

Use a bit of a defensive programming, i.e. revert single PR if attempt to fetch dependencies fails for some reason.

Test plan:
 - Lint
 -  ```
    >>> from trymerge import GitRepo, GitHubPR, get_ghstack_prs, get_ghstack_dependent_prs
    >>> pr=GitHubPR("pytorch", "pytorch", 115188)
    >>> pr1=GitHubPR("pytorch", "pytorch", 115210)
    >>> repo=GitRepo("/Users/nshulga/git/pytorch/pytorch")
    >>> get_ghstack_dependent_prs(repo, pr1)
    [('22742d93a5357c9b5b45a74f91a6dc5599c9c266', <trymerge.GitHubPR object at 0x100f32f40>)]
    >>> get_ghstack_dependent_prs(repo, pr)
    [('22742d93a5357c9b5b45a74f91a6dc5599c9c266', <trymerge.GitHubPR object at 0x10102eaf0>), ('76b1d44d576c20be79295810904c589241ca1bd2', <trymerge.GitHubPR object at 0x10102eb50>)]
    >>> rc=get_ghstack_dependent_prs(repo, pr)
    rc[0]>>> rc[0][1].pr_num
    115210
    >>> rc[1][1].pr_num
    115188
    ```
 - see: https://github.com/malfet/deleteme/pull/59#issuecomment-1869904714 and https://github.com/malfet/deleteme/pull/74#issuecomment-1870542702

Fixes https://github.com/pytorch/test-infra/issues/4845
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116447
Approved by: https://github.com/huydhn
ghstack dependencies: #116446
2023-12-27 23:01:16 +00:00
5fcc2519f5 [GHF] Refactors (#116446)
Prep change for allowing stacked reverts

This is a no-op that factors out some helper function that would be
useful later:
 - `get_pr_commit_sha` finds a committed sha for a given PR
 - `_revlist_to_prs` converts a revlist to GitHubPRs conditionally
   filtering some out
 - `do_revert_prs` reverts multiple PRs in a batch, but so far is
   invoked with only one PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116446
Approved by: https://github.com/huydhn, https://github.com/seemethere
2023-12-27 23:01:16 +00:00
85628c0e57 Revert "[export] Update range constraints to runtime_var_to_range (#115427)"
This reverts commit f8ad664cf267bcbdd8f8f85e27ad3a6e7d9fa86f.

Reverted https://github.com/pytorch/pytorch/pull/115427 on behalf of https://github.com/angelayi due to failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/115427#issuecomment-1870671728))
2023-12-27 22:44:45 +00:00
a17069684c Improve nn.modules.activation and batchnorm docs (#113531)
Fixes #112602

For some reason, I could not get the same output when running pycodestyle command as indicated in the issue. I manually ran ruff checks fixing the following issues  `D202`, `D204`,  `D205`, `D207`, `D400` and `D401`.

### Requested output

nn.modules.activation:
before: 135
after: 79

nn.modules.batchnorm
before: 21
after: 3
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113531
Approved by: https://github.com/mikaylagawarecki
2023-12-27 21:06:47 +00:00
3149e4a667 [dynamo] fix sum() function with start argument (#116389)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116389
Approved by: https://github.com/Skylion007, https://github.com/malfet
2023-12-27 20:42:27 +00:00
83502feabe [BE]: Enable readability-simplify-subscript-expr clang-tidy check (#116356)
[BE]: enable clang-tidy check for readability-simplify-subscript-expr which looks for unnecessarily complex subscripting of the underlying data array of STL types.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116356
Approved by: https://github.com/lezcano
2023-12-27 20:22:20 +00:00
8d84b5041c [pt-vulkan] Address CLANGTIDY warnings in api, graph, and impl folders (#116431)
## Context

**Currently, `*.h` and `*.cpp` produces many lint warnings/errors from `clang-tidy` in the Meta internal Phabricator mirror**. These changes address all the lint warnings in the `api`, `graph`, and `impl` folders in preparation for upcoming planned work.

## Review Guide

* Most changes are the result of automatically applied patches from `clang-tidy`
  * However, some warnings had to be manually addressed
  * There should be no functional changes
* Many of the `clang-tidy` warnings arose from the `facebook-hte-BadMemberName` rule which checks for compliance with variable naming rules from Meta's internal C++ style guide
  * However, the rest of the ATen codebase does not conform to this rule, and PyTorch Vulkan was written to be consisten with ATen's naming conventions; thus, to stay consistent with the rest of ATen, this rule is disabled wherever relevant using `// @lint-ignore-every CLANGTIDY facebook-hte-BadMemberName`
* Lint was disabled entirely for`vulkan_api_test.cpp` since there are too many warnings to address at the moment. Addressing all of them will be a small project of its own; thus, in the interim lint will be disabled to reduce distracting signals for developers.

Internal:

## Notes for Internal Reviewers

This diff was largely created with

```
cd ~/fbsource/xplat/caffe2/aten/src/ATen/native/vulkan
arc lint -e extra -a --take CLANGTIDY * 2>&1 | tee ~/scratch/lint.txt
```

The above command automatically applied patches suggested by `clang-tidy`, and the rest of the warnings were addressed manually.

To disable `facebook-hte-BadMemberName`, I found that disabling it via a `.clang-tidy` file didn't work with `arc lint`, and the only way that worked was through the adding a comment

```
// @lint-ignore-every CLANGTIDY facebook-hte-BadMemberName
```

Differential Revision: [D50336057](https://our.internmc.facebook.com/intern/diff/D50336057/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116431
Approved by: https://github.com/GregoryComer, https://github.com/kirklandsign
2023-12-27 19:29:18 +00:00
bbe3261dd3 [BE]: Use iterable.chain.from_iterable where possible (#116376)
This is more readable and more efficient when dealing with lots of sequences to chain together.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116376
Approved by: https://github.com/albanD
2023-12-27 19:20:07 +00:00
e0e90bc0d4 Revert "[dynamo] fix sum() function with start argument (#116389)"
This reverts commit 3c9076f070fab5b27eae3b7846755c98b7c97a1a.

Reverted https://github.com/pytorch/pytorch/pull/116389 on behalf of https://github.com/kit1980 due to Breaks Meta-internal tests, but the issue could have been caught on GitHub ([comment](https://github.com/pytorch/pytorch/pull/116389#issuecomment-1870556927))
2023-12-27 19:05:55 +00:00
5c9464fb51 add CALL_FINALLY opcode (#116159)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116159
Approved by: https://github.com/yanboliang
2023-12-27 19:01:08 +00:00
f657b2b1f8 [Dynamo][10/N] Remove TorchVariable and is_allowed (#116312)
After this refactor:
* ```TorchVariable``` definition and all references are removed.
* All ```is_allowed``` references except one are removed.
  - The only left one is in ```torch/_dynamo/decorators:_disallow_in_graph_helper```. It was called when users put ```disallow_in_graph``` decorator on a function. Since we use the lists in ```trace_rules``` to decide the function's trace rule, so the decorator would only be used as customer function rather than torch functions. I'll defer this to a separate decorator refactor PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116312
Approved by: https://github.com/jansel
2023-12-27 18:47:05 +00:00
87da0e1d23 [GHF] Fix gh_get_labels for small repos (#116444)
Not sure if this is recent API change or what but  `gh_get_labels('malfet', 'deleteme')` used to raise an exception (see https://github.com/malfet/deleteme/actions/runs/7334535266/job/19971328673#step:6:37 )
```
  File "/home/runner/work/deleteme/deleteme/.github/scripts/label_utils.py", line 50, in get_last_page_num_from_header
    link_info[link_info.rindex(prefix) + len(prefix) : link_info.rindex(suffix)]
AttributeError: 'NoneType' object has no attribute 'rindex'
```

And with this fix it returns the expected list

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116444
Approved by: https://github.com/huydhn
2023-12-27 15:50:42 +00:00
e14026bc2a [CUDNN] RNNv6 API deprecation support (#115719)
The cuDNN RNNv6 API has been deprecated and support will be dropped in a recent release; this PR migrates to the newer API to support newer cuDNN versions that would otherwise break the build.

Note that it may not be tested yet in upstream CI if the upstream CI cuDNN version is less than 8.9.7.

CC @ptrblck @malfet

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115719
Approved by: https://github.com/albanD, https://github.com/malfet
2023-12-27 09:31:08 +00:00
0aa5b751bb [executorch hash update] update the pinned executorch hash (#116438)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116438
Approved by: https://github.com/pytorchbot
2023-12-27 09:13:54 +00:00
924f1b841a [optim] Allow torch.float64 scalars for forloop + foreach implementations (#115841)
Should allow for uses cases mentioned in #110940

This would allow scalars to also be float64s in the foreach implementation. The fused implementation would still create a float32 step on Adam and AdamW. This PR also does NOT worry about performance and is mainly for enablement.

Next steps:
- Relax the constraint on fused adam(w) and allow torch.float64 scalars there
- Allow _performant_ mixed dtypes in foreach (a bigger project in itself).

This PR will conflict with my other PRs, I will figure out a landing order

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115841
Approved by: https://github.com/albanD
2023-12-27 09:13:49 +00:00
1d13086492 [BE] force DTensorTestBase.build_device_mesh to use world_size rather than NUM_DEVICES constant (#116439)
**Test**:
`python test/distributed/fsdp/test_shard_utils.py -k test_create_chunk_dtensor`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116439
Approved by: https://github.com/wanchaol
2023-12-27 07:37:07 +00:00
6b91e6907e Add setUserEnabledNNPACK config (#116152)
When exporting a model with a convolution kernel on cpu, if mkldnn is disabled and nnpack is enabled, export will go down the nnpack optimized convolution kernel for certain shapes ((code pointer)[cd449e260c/aten/src/ATen/native/Convolution.cpp (L542-L552)]). This means that we will automatically create a guard on that certain shape. If users want to export without any restrictions, one option is to disable nnpack. However, no config function exists for this, so this PR is adding a config function, similar to the `set_mkldnn_enabled` function.

Original context is in https://fb.workplace.com/groups/1075192433118967/posts/1349589822345892/?comment_id=1349597102345164&reply_comment_id=1349677642337110.

To test the flag, the following script runs successfully:
```
import os

import torch
from torchvision.models import ResNet18_Weights, resnet18

torch.set_float32_matmul_precision("high")

model = resnet18(weights=ResNet18_Weights.DEFAULT)
model.eval()

with torch.no_grad():
    # device = "cuda" if torch.cuda.is_available() else "cpu"
    torch.backends.mkldnn.set_flags(False)
    torch.backends.nnpack.set_flags(False)   # <--- Added config
    device = "cpu"
    model = model.to(device=device)
    example_inputs = (torch.randn(2, 3, 224, 224, device=device),)
    batch_dim = torch.export.Dim("batch", min=2, max=32)
    so_path = torch._export.aot_compile(
        model,
        example_inputs,
        # Specify the first dimension of the input x as dynamic
        dynamic_shapes={"x": {0: batch_dim}},
        # Specify the generated shared library path
        options={
            "aot_inductor.output_path": os.path.join(os.getcwd(), "resnet18_pt2.so"),
            "max_autotune": True,
        },
    )

```

I'm not sure who to add as reviewer, so please feel free to add whoever is relevant!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116152
Approved by: https://github.com/malfet
2023-12-27 06:00:16 +00:00
9c3ae37fc4 [Distributed] Add finer granularity tag for distributed submodule (#116434)
This PR is the start to enable the integrate pytorch distributed logs in Torch LOGs. We now already have one tag "distributed" for all distributed components but distributed is a very large component and we want to have some hierarchy and give users options to only turn on logs for certain submodules. So we also added tags starting with "dist_*" for each submodule. (This PR only adds some of them and we are going to add more down the road)

Related discussions can be found here: https://github.com/pytorch/pytorch/issues/113544

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116434
Approved by: https://github.com/awgu, https://github.com/wanchaol
2023-12-27 04:09:34 +00:00
2c89e5a5e5 [inductor] Sort unbacked symbols before iterating on them (#116421)
get_unbacked_symbol_defs and get_unbacked_symbol_uses inconsistently return dicts vs. sets. The majority of the use cases of these methods use them for set membership, which is deterministic, but set iteration is non deterministic. Therefore, in the one place where we iterate through unbacked symbols, we sort by the symbol name before iterating to preserve determinism.

Another approach would be to have these functions consistently return dictionaries, where the key of the dictionary is the name of the symbol. I'm happy to do that approach if we think it's likely future code will forget to sort before iteration.

Fixes #113130

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116421
Approved by: https://github.com/oulgen, https://github.com/aakhundov
2023-12-27 03:35:58 +00:00
362bc6d7cb Fixed a segfault issue when passing an empty kernel to quantized_max_… (#116342)
…pool1d.

Fixes #116323.

Reused the same check as for `max_pool1d`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116342
Approved by: https://github.com/jerryzh168
2023-12-27 01:22:49 +00:00
d0395239c1 [DTensor] allow OpStrategy to represent ops whose return type is a tuple (#115682)
**Summary**:
Ops like `native_layer_norm_backward` return a tuple of optional torch.Tensor.
This PR allows to use OpStrategy to represent `native_layer_norm_backward`'s
return value sharding.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115682
Approved by: https://github.com/wanchaol
2023-12-27 00:44:11 +00:00
44b98c09ca [BE] migrate all assertRaises tests to OptimizerInfo test_errors (#116315)
Removes a part of the sparse adam test and the following three tests: `test_fused_optimizer_raises`, `test_duplicate_params_across_param_groups`, `test_duplicate_params_in_one_param_group`

```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (d2d129de)]$ python test/test_optim.py -k test_fused_optimizer_raises -k test_duplicate_params_across_param_groups -k test_duplicate_params_in_one_param_group
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
...
----------------------------------------------------------------------
Ran 3 tests in 0.023s

OK
```

Increases coverage by testing the duplicate param tests on ALL the optims instead of just one each. Also fixes SparseAdam bug which was accidentally calling torch.unbind through list instead of putting params in a list. This bug was caught by migrating the weird warning stuff to just one easy warning context manager, which checks that nothing else gets raised.

The new test_errors does not run slower than before, overhead is still king:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (d2d129de)]$ python test/test_optim.py -k test_errors
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
..........................
----------------------------------------------------------------------
Ran 26 tests in 10.337s

OK
```

Compared to test_errors BEFORE my commit :p
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (b47aa696)]$ python test/test_optim.py -k test_errors
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
.............sssssssssssss
----------------------------------------------------------------------
Ran 26 tests in 11.980s

OK (skipped=13)
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (b47aa696)]$
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116315
Approved by: https://github.com/mikaylagawarecki
2023-12-27 00:08:31 +00:00
8abeacda6f Refactor user defined triton kernel tests (#116425)
I will be adding more triton tests of different types, so I'm moving them to a brand new file. While doing this, I also cleaned up some flake linting opt outs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116425
Approved by: https://github.com/aakhundov
2023-12-26 23:54:26 +00:00
3b709d7c1e Revert "[Dynamo][10/N] Remove TorchVariable and is_allowed (#116312)"
This reverts commit 015bd0e0a189f929e469c6bc75fe1541c18a014d.

Reverted https://github.com/pytorch/pytorch/pull/116312 on behalf of https://github.com/kit1980 due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/116312#issuecomment-1869825506))
2023-12-26 23:47:15 +00:00
13505898c9 Revert "[Dynamo][11/N] allow_in_graph/disallow_in_graph decorator refactor (#116365)"
This reverts commit 951da38800f66e2d2bb2bb8e87e12218d1e28b8c.

Reverted https://github.com/pytorch/pytorch/pull/116365 on behalf of https://github.com/kit1980 due to Need to revert this because of https://github.com/pytorch/pytorch/pull/116312 ([comment](https://github.com/pytorch/pytorch/pull/116365#issuecomment-1869824468))
2023-12-26 23:43:45 +00:00
0aa185f394 [BE] Make torch.cuda.has_magma a build time check (#116299)
Perhaps originally one needed to query about GPU capability, but right now it's a simple check for a build time flag: 52f0457d7d/aten/src/ATen/cuda/detail/CUDAHooks.cpp (L165-L171)

Alternative, to avoid `at::hasMAGMA()` call  one can implement it as follows:
```cpp
  const auto use_magma = caffe2::GetBuildOptions().at("USE_MAGMA");
  return PyBool_FromLong(use_magma == "1");
```

Make this check very similar to `_has_mkldnn`
0978482afa/torch/csrc/Module.cpp (L1793-L1794)

Test plan:
 Run `lldb -- python3 -c "import torch;print(torch.cuda.has_magma)"` and make sure it returns True and that `cuInit` is not called

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116299
Approved by: https://github.com/seemethere, https://github.com/albanD
2023-12-26 23:37:23 +00:00
0edc348788 Revert "[Dynamo] Consolidate common constant types (#116366)"
This reverts commit 36dccc2aba61a2637aa5d42f38b6fd1fe10dcbdc.

Reverted https://github.com/pytorch/pytorch/pull/116366 on behalf of https://github.com/kit1980 due to Need to revert this because of https://github.com/pytorch/pytorch/pull/116312 ([comment](https://github.com/pytorch/pytorch/pull/116366#issuecomment-1869821625))
2023-12-26 23:36:52 +00:00
e86636266f [Quantized] Fixed equal_quantized_cpu for QUInt4 (#116307)
- Return false if scalar_type is different (because QInt8 and QUint8 has identical item_size but shouldn't be compared by comparing data)
- Compute data_size correctly for QUInt4x2 and QUInt2x4 dtypes
- Add regression test

Fixes https://github.com/pytorch/pytorch/issues/116087

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116307
Approved by: https://github.com/jerryzh168
2023-12-26 21:52:28 +00:00
e5bcfe205e [inductor] fix cpp_wrapper inputs mismatch (#116197)
Summary: fixes https://github.com/pytorch/pytorch/issues/115035, where in the cpp_wrapper JIT inductor, the input args should contain the lifted parameters.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116197
Approved by: https://github.com/jansel
2023-12-26 21:41:47 +00:00
7571511af9 [inductor] More tweaks to fusion logs (#115084)
I think it's more useful to print out actual fusions rather than
possible fusions.

I also updated `speedup_by_fusion`'s logs to include the node names in
the log output.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115084
Approved by: https://github.com/jansel, https://github.com/aakhundov
2023-12-26 20:25:57 +00:00
6051f9f404 multiply int8/uint8 for AVX512 (#116346)
Summary: multiply int8/uint8 for AVX512

Test Plan: sandcastle, github

Differential Revision: D52393918

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116346
Approved by: https://github.com/jgong5
2023-12-26 19:44:05 +00:00
51eef859eb min, max, clamp* for AVX2 hosts (#116236)
Summary: min, max, clamp* for AVX2 hosts

Test Plan: sandcastle, github

Differential Revision: D52353148

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116236
Approved by: https://github.com/alexsamardzic, https://github.com/malfet
2023-12-26 19:43:43 +00:00
427ecc61c0 [Easy][BE]: Fix none type comparison (#116399)
Simplifies type comparison, as it is unneeded since None is a singleton, and all objects are the same None object when they are set to None.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116399
Approved by: https://github.com/XuehaiPan, https://github.com/lezcano, https://github.com/malfet
2023-12-26 19:36:34 +00:00
0978482afa Revert "Implement aten::upsample_linear1d on mps (#115031)"
This reverts commit c6969cb8a93a7dfd3f1bf17716470174bb973076.

Reverted https://github.com/pytorch/pytorch/pull/115031 on behalf of https://github.com/malfet due to Broke lint, will fwd fix and re-land ([comment](https://github.com/pytorch/pytorch/pull/115031#issuecomment-1869693081))
2023-12-26 18:01:49 +00:00
f4230ec9fd [inductor] Remove the float16 restriction for cpu cpp_wrapper (#116205)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116205
Approved by: https://github.com/jgong5, https://github.com/chunyuan-w, https://github.com/jansel
2023-12-26 16:01:20 +00:00
Kai
c6969cb8a9 Implement aten::upsample_linear1d on mps (#115031)
Related to #77764

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115031
Approved by: https://github.com/malfet
2023-12-26 15:44:21 +00:00
4c6e842496 [inductor][cpp] load as scalar for the index invariant in the vector range (#116387)
For the test `test_expr_vec_non_contiguous`. The index_expr `31L + (63L*(c10::div_floor_integer(x1, 32L))) + (c10::div_floor_integer(x2, 32L))` is invariant under the vector range of `x2`.
Before change
```c++
            #pragma omp for
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(4L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(1024L); x1+=static_cast<long>(16L))
                {
                    {
                        #pragma omp declare reduction(max:at::vec::Vectorized<float>:omp_out = at::vec::maximum(omp_out, omp_in)) initializer(omp_priv={at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity())})
                        float tmp_acc0 = -std::numeric_limits<float>::infinity();
                        at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity());
                        for(long x2=static_cast<long>(0L); x2<static_cast<long>(1024L); x2+=static_cast<long>(1L))
                        {
                            auto tmp0 =
                            [&]
                            {
                                __at_align__ std::array<int, 16> tmpbuf;
                                #pragma GCC unroll 16
                                for (long x1_inner = 0; x1_inner < 16; x1_inner++)
                                {
                                    tmpbuf[x1_inner] = static_cast<long>(31L + (63L*(c10::div_floor_integer((x1 + x1_inner), 32L))) + (c10::div_floor_integer(x2, 32L)));
                                }
                                return at::vec::Vectorized<int>::loadu(tmpbuf.data());
                            }
                            ()
                            ;
                            auto tmp1 = static_cast<int>(2048);
                            auto tmp2 = at::vec::Vectorized<int>(tmp1);
                            auto tmp3 = to_float_mask(tmp0 < tmp2);
                            auto tmp4 = [&]
                            {
                                auto tmp5 =
                                [&]
                                {
                                    __at_align__ std::array<float, 16> tmpbuf;
                                    #pragma GCC unroll 16
                                    for (long x1_inner = 0; x1_inner < 16; x1_inner++)
                                    {
                                        if (vector_lane_mask_check(tmp3, x1_inner))
                                        {
                                            tmpbuf[x1_inner] = in_ptr0[static_cast<long>(31L + (63L*(c10::div_floor_integer((x1 + x1_inner), 32L))) + (2048L*(static_cast<long>((x1 + x1_inner)) % static_cast<long>(32L))) + (65536L*x0) + (c10::div_floor_integer(x2, 32L)))];
                                        }
                                    }
                                    return at::vec::Vectorized<float>::loadu(tmpbuf.data());
                                }
                                ()
                                ;
                                return tmp5;
                            }
                            ;
                            auto tmp6 =
                            [&]
                            {
                                if (all_zero(to_float_mask(tmp3)))
                                {
                                    return at::vec::Vectorized<float>(static_cast<float>(0.0));
                                }
                                else
                                {
                                    return decltype(tmp4())::blendv(at::vec::Vectorized<float>(static_cast<float>(0.0)), tmp4(), to_float_mask(tmp3));
                                }
                            }
                            ()
                            ;
                            tmp_acc0_vec = at::vec::maximum(tmp_acc0_vec, tmp6);
                        }
                        tmp_acc0_vec.store(out_ptr0 + static_cast<long>(x1 + (1024L*x0)));
                    }
                }
            }
        }
```
After change
```c++
            #pragma omp for
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(4L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(1024L); x1+=static_cast<long>(16L))
                {
                    {
                        #pragma omp declare reduction(max:at::vec::Vectorized<float>:omp_out = at::vec::maximum(omp_out, omp_in)) initializer(omp_priv={at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity())})
                        float tmp_acc0 = -std::numeric_limits<float>::infinity();
                        at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity());
                        for(long x2=static_cast<long>(0L); x2<static_cast<long>(1024L); x2+=static_cast<long>(1L))
                        {
                            auto tmp0 = c10::convert<int>(31L + (63L*(c10::div_floor_integer(x1, 32L))) + (c10::div_floor_integer(x2, 32L)));
                            auto tmp1 = static_cast<int>(2048);
                            auto tmp2 = tmp0 < tmp1;
                            auto tmp3 = [&]
                            {
                                auto tmp4 =
                                [&]
                                {
                                    __at_align__ std::array<float, 16> tmpbuf;
                                    #pragma GCC unroll 16
                                    for (long x1_inner = 0; x1_inner < 16; x1_inner++)
                                    {
                                        if (tmp2 != 0)
                                        {
                                            tmpbuf[x1_inner] = in_ptr0[static_cast<long>(31L + (63L*(c10::div_floor_integer((x1 + x1_inner), 32L))) + (2048L*(static_cast<long>((x1 + x1_inner)) % static_cast<long>(32L))) + (65536L*x0) + (c10::div_floor_integer(x2, 32L)))];
                                        }
                                    }
                                    return at::vec::Vectorized<float>::loadu(tmpbuf.data());
                                }
                                ()
                                ;
                                return tmp4;
                            }
                            ;
                            auto tmp5 =
                            [&]
                            {
                                if (all_zero(to_float_mask(tmp2)))
                                {
                                    return at::vec::Vectorized<float>(static_cast<float>(0.0));
                                }
                                else
                                {
                                    return decltype(tmp3())::blendv(at::vec::Vectorized<float>(static_cast<float>(0.0)), tmp3(), to_float_mask(tmp2));
                                }
                            }
                            ()
                            ;
                            tmp_acc0_vec = at::vec::maximum(tmp_acc0_vec, tmp5);
                        }
                        tmp_acc0_vec.store(out_ptr0 + static_cast<long>(x1 + (1024L*x0)));
                    }
                }
            }
        }
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116387
Approved by: https://github.com/EikanWang, https://github.com/lezcano
ghstack dependencies: #114545
2023-12-26 08:45:04 +00:00
3c9076f070 [dynamo] fix sum() function with start argument (#116389)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116389
Approved by: https://github.com/Skylion007
2023-12-26 06:37:55 +00:00
cyy
bb2a1e9941 Enable readability-redundant-smartptr-get in clang-tidy (#116381)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116381
Approved by: https://github.com/Skylion007
2023-12-26 06:05:15 +00:00
ffe6f9ac91 [inductor cpp] support vectorization for index_expr that depends on tiling itervar or with indirect indexing (#114545)
As the title, this PR enables vectorization for the situation when the the index_expr depends on vectorized itervar. There are two cases here:
1. The vectorized itervar has constant stride in the index_expr. We vectorize the index_expr with `Vectorized<int32>::arange` for this case.
2. Otherwise, we load the index_expr vector in a non-contiguous way with a loop.

Below is the generated code for the first case from the test `test_concat_inner_vec`. Here `x1` is the index_expr and depends on the vectorized itervar `x1`. It has constant stride 1. We vectorized it with arange. We use `all_zero` to implement a short-cut for masks to avoid unnecessary execution of nested masked regions which are invalid.
Before:
```c++
            #pragma omp for  collapse(2)
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(32L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(155L); x1+=static_cast<long>(1L))
                {
                    auto tmp0 = c10::convert<long>(x1);
                    auto tmp1 = static_cast<long>(0);
                    auto tmp2 = tmp0 >= tmp1;
                    auto tmp3 = static_cast<long>(35);
                    auto tmp4 = tmp0 < tmp3;
                    auto tmp5 = [&]
                    {
                        auto tmp6 = in_ptr0[static_cast<long>(x1 + (35L*x0))];
                        return tmp6;
                    }
                    ;
                    auto tmp7 = tmp4 ? tmp5() : static_cast<decltype(tmp5())>(0.0);
                    auto tmp8 = tmp0 >= tmp3;
                    auto tmp9 = static_cast<long>(155);
                    auto tmp10 = tmp0 < tmp9;
                    auto tmp11 = [&]
                    {
                        auto tmp12 = in_ptr1[static_cast<long>((-35L) + x1 + (120L*x0))];
                        return tmp12;
                    }
                    ;
...
```
After:
```c++
            #pragma omp for
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(32L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(144L); x1+=static_cast<long>(16L))
                {
                    auto tmp0 = c10::convert<int>(x1);
                    auto tmp1 = at::vec::Vectorized<int32_t>::arange(tmp0, 1);
                    auto tmp2 = static_cast<int>(0);
                    auto tmp3 = at::vec::Vectorized<int>(tmp2);
                    auto tmp4 = to_float_mask(tmp1 >= tmp3);
                    auto tmp5 = static_cast<int>(35);
                    auto tmp6 = at::vec::Vectorized<int>(tmp5);
                    auto tmp7 = to_float_mask(tmp1 < tmp6);
                    auto tmp8 = [&]
                    {
                        auto tmp9 = masked_load(in_ptr0 + static_cast<long>(x1 + (35L*x0)), to_float_mask(tmp7));
                        return tmp9;
                    }
                    ;
                    auto tmp10 =
                    [&]
                    {
                        if (all_zero(to_float_mask(tmp7)))
                        {
                            return at::vec::Vectorized<float>(static_cast<float>(0.0));
                        }
                        else
                        {
                            return decltype(tmp8())::blendv(at::vec::Vectorized<float>(static_cast<float>(0.0)), tmp8(), to_float_mask(tmp7));
                        }
                    }
                    ()
                    ;
...
```

Below is the generated code for the second case from the test case `test_expr_vec_non_contiguous`. Here, the index_expr is `31L + (63L*(c10::div_floor_integer(x1, 32L))) + (c10::div_floor_integer(x2, 32L))` which depends on the vectorized itervar `x2` and doesn't have constant stride. So, we load the index_expr vector with a loop. (In fact, this can be further optimized since the index_expr is invariant with the data points in the range [x2, x2+16). So it can be regarded as a scalar. This will be optimized in the follow-up PR.) The code uses `vector_lane_mask_check` to implement the masked version of non-contiguous load.
Before:
```c++
            #pragma omp for  collapse(2)
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(4L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(1024L); x1+=static_cast<long>(1L))
                {
                    {
                        float tmp_acc0 = -std::numeric_limits<float>::infinity();
                        for(long x2=static_cast<long>(0L); x2<static_cast<long>(1024L); x2+=static_cast<long>(1L))
                        {
                            auto tmp0 = c10::convert<long>(31L + (63L*(c10::div_floor_integer(x1, 32L))) + (c10::div_floor_integer(x2, 32L)));
                            auto tmp1 = static_cast<long>(2048);
                            auto tmp2 = tmp0 < tmp1;
                            auto tmp3 = [&]
                            {
                                auto tmp4 = in_ptr0[static_cast<long>(31L + (63L*(c10::div_floor_integer(x1, 32L))) + (2048L*(static_cast<long>(x1) % static_cast<long>(32L))) + (65536L*x0) + (c10::div_floor_integer(x2, 32L)))];
                                return tmp4;
                            }
                            ;
                            auto tmp5 = tmp2 ? tmp3() : static_cast<decltype(tmp3())>(0.0);
                            tmp_acc0 = max_propagate_nan(tmp_acc0, tmp5);
                        }
                        out_ptr0[static_cast<long>(x1 + (1024L*x0))] = tmp_acc0;
                    }
                }
            }
```
After:
```c++
            #pragma omp for
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(4L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(1024L); x1+=static_cast<long>(16L))
                {
                    {
                        #pragma omp declare reduction(max:at::vec::Vectorized<float>:omp_out = at::vec::maximum(omp_out, omp_in)) initializer(omp_priv={at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity())})
                        float tmp_acc0 = -std::numeric_limits<float>::infinity();
                        at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity());
                        for(long x2=static_cast<long>(0L); x2<static_cast<long>(1024L); x2+=static_cast<long>(1L))
                        {
                            auto tmp0 =
                            [&]
                            {
                                __at_align__ std::array<int, 16> tmpbuf;
                                #pragma GCC unroll 16
                                for (long x1_inner = 0; x1_inner < 16; x1_inner++)
                                {
                                    tmpbuf[x1_inner] = static_cast<long>(31L + (63L*(c10::div_floor_integer((x1 + x1_inner), 32L))) + (c10::div_floor_integer(x2, 32L)));
                                }
                                return at::vec::Vectorized<int>::loadu(tmpbuf.data());
                            }
                            ()
                            ;
                            auto tmp1 = static_cast<int>(2048);
                            auto tmp2 = at::vec::Vectorized<int>(tmp1);
                            auto tmp3 = to_float_mask(tmp0 < tmp2);
                            auto tmp4 = [&]
                            {
                                auto tmp5 =
                                [&]
                                {
                                    __at_align__ std::array<float, 16> tmpbuf;
                                    #pragma GCC unroll 16
                                    for (long x1_inner = 0; x1_inner < 16; x1_inner++)
                                    {
                                        if (vector_lane_mask_check(tmp3, x1_inner))
                                        {
                                            tmpbuf[x1_inner] = in_ptr0[static_cast<long>(31L + (63L*(c10::div_floor_integer((x1 + x1_inner), 32L))) + (2048L*(static_cast<long>((x1 + x1_inner)) % static_cast<long>(32L))) + (65536L*x0) + (c10::div_floor_integer(x2, 32L)))];
                                        }
                                    }
                                    return at::vec::Vectorized<float>::loadu(tmpbuf.data());
                                }
                                ()
                                ;
                                return tmp5;
                            }
                            ;
                            auto tmp6 =
                            [&]
                            {
                                if (all_zero(to_float_mask(tmp3)))
                                {
                                    return at::vec::Vectorized<float>(static_cast<float>(0.0));
                                }
                                else
                                {
                                    return decltype(tmp4())::blendv(at::vec::Vectorized<float>(static_cast<float>(0.0)), tmp4(), to_float_mask(tmp3));
                                }
                            }
                            ()
                            ;
                            tmp_acc0_vec = at::vec::maximum(tmp_acc0_vec, tmp6);
                        }
                        tmp_acc0_vec.store(out_ptr0 + static_cast<long>(x1 + (1024L*x0)));
                    }
                }
            }
        }
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114545
Approved by: https://github.com/lezcano
2023-12-26 05:36:39 +00:00
a254fbfd61 Initialize variable for all codepaths in dynamo benchmarks (#116260)
Sometimes, the first statement that sets this variable in the try block fails due to out of memory issues and the finally block tries to delete this variable, but it was not written to in the first place.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116260
Approved by: https://github.com/lezcano
2023-12-26 05:15:39 +00:00
f6dfbffb3b [c10d] Add hashing as a debug feature for before and after NCCL collective call (#113238)
For now, we use `TORCH_DISTRIBUTED_DEBUG = DETAIL` to turn a debug feature which calculate the hashing for input tensors and output results of c10d collective in NCCL.  This is a debugging feature so that we can rule out the bug from c10d level.

<img width="840" alt="image" src="https://github.com/pytorch/pytorch/assets/6937752/cdc70b0b-ae3c-4efd-86ff-adc5c5ba505f">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113238
Approved by: https://github.com/wconstab, https://github.com/fegin
2023-12-25 22:25:38 +00:00
039fbeb016 [dynamo] fix functools.reduce() function with None as initial (#116398)
The `initial` argument in `functools.reduce` can be `None`.

```python
initial_missing = object()

def reduce(function, iterable, initial=initial_missing, /):
    it = iter(iterable)
    if initial is initial_missing:
        value = next(it)
    else:
        value = initial
    for element in it:
        value = function(value, element)
    return value
```

Reference:

- python/cpython#102759

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116398
Approved by: https://github.com/Skylion007
2023-12-25 21:23:28 +00:00
c7e9c15102 Ignore SIGINT in codecache workers (#116380)
Fixes #116379

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116380
Approved by: https://github.com/Skylion007
2023-12-25 08:59:54 +00:00
951da38800 [Dynamo][11/N] allow_in_graph/disallow_in_graph decorator refactor (#116365)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116365
Approved by: https://github.com/jansel
2023-12-25 07:15:09 +00:00
22742d93a5 Expose functional IR to capture_pre_autograd (#115210)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115210
Approved by: https://github.com/zhxchen17
ghstack dependencies: #115188
2023-12-25 04:51:21 +00:00
76b1d44d57 pre_dispatch aot_export (#115188)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115188
Approved by: https://github.com/bdhirsh
2023-12-25 04:51:21 +00:00
36dccc2aba [Dynamo] Consolidate common constant types (#116366)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116366
Approved by: https://github.com/Skylion007
2023-12-24 22:58:01 +00:00
199e07f108 [pytree][BE] update treespec num_children access (#116370)
Change `len(treespec.children_spes) -> treespec.num_children`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116370
Approved by: https://github.com/Skylion007
2023-12-24 20:54:32 +00:00
81cebca3d2 [Inductor] [Quant] Fix QConv Binary Inplace Layout Issue (#115613)
This pull request primarily addresses two issues to resolve the `QConvPointWiseBinaryPT2E` layout problem:

- As the changes made in 611a7457ca, for `QConvPointWiseBinaryPT2E` with post-op `sum`, we should also utilize `NoneLayout` and return `accum` instead of `QConvPointWiseBinaryPT2E`.

- Additionally, this pull request fixes an issue in the `_quantized_convolution_onednn` implementation. Given that we expect `accum` to be inplace changed, we should avoid copying `accum` by changing the memory format or data type inside the kernel implementation. Instead, we have moved the necessary changes of memory format or data type to the lowering of `QConvPointWiseBinaryPT2E`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115613
Approved by: https://github.com/jgong5, https://github.com/oulgen
ghstack dependencies: #116172
2023-12-24 08:04:29 +00:00
dfb6815170 [Reland] [PT2] [Quant] Change the QConv2d Binary post op name from add to sum (#116172)
**Summary**
Re-land https://github.com/pytorch/pytorch/pull/115329. Open a new PR since the origin branch has been deleted.
Change the QConv2d Binary fusion post op name from `add` to `sum`, since we are actually using OneDNN `post op sum` instead of `Binary_Add` for now.

**TestPlan**
```
python -m pytest test_quantized_op.py -k test_qconv2d_sum_pt2e
python -m pytest test_quantized_op.py -k test_qconv2d_sum_relu_pt2e
python -m pytest test_quantized_op.py -k test_qconv2d_sum_relu_float_output_pt2e
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116172
Approved by: https://github.com/kit1980
2023-12-24 08:00:21 +00:00
7cdbdc789d [executorch hash update] update the pinned executorch hash (#116362)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116362
Approved by: https://github.com/pytorchbot
2023-12-24 05:02:05 +00:00
f1cdb39da3 [dynamo] Fix handling of one_hot (#116338)
Fixes #115817

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116338
Approved by: https://github.com/yanboliang
2023-12-24 04:55:35 +00:00
dbbe8485b4 Fake Tensor refactors part 2 (#116345)
This should help trace time a bit.
This refactors `op_implementations` (which requires O(n) checks per op) to mostly use a dict with O(1) cost per op.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116345
Approved by: https://github.com/yanboliang
2023-12-24 04:54:50 +00:00
6c419a0efd Fixed a segfault when calling topk on a quantized scalar tensor. (#116337)
Fixes #116324.

Added an extra check for empty sizes (=scalars) when running `topk` on quantized tensors. Added a test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116337
Approved by: https://github.com/Skylion007
2023-12-23 23:21:12 +00:00
3a4fe835cc Fixed segfault when trying to permute empty tensor (#116335)
Fixes #116325.

Fixed unchecked access to first element of `dims` when permuting an empty tensor. Added test to prevent regressions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116335
Approved by: https://github.com/Skylion007
2023-12-23 23:14:28 +00:00
015bd0e0a1 [Dynamo][10/N] Remove TorchVariable and is_allowed (#116312)
After this refactor:
* ```TorchVariable``` definition and all references are removed.
* All ```is_allowed``` references except one are removed.
  - The only left one is in ```torch/_dynamo/decorators:_disallow_in_graph_helper```. It was called when users put ```disallow_in_graph``` decorator on a function. Since we use the lists in ```trace_rules``` to decide the function's trace rule, so the decorator would only be used as customer function rather than torch functions. I'll defer this to a separate decorator refactor PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116312
Approved by: https://github.com/jansel
2023-12-23 09:44:09 +00:00
4912922297 Fake Tensor refactors part 1 (#116344)
These are mostly small performance optimizations to move constant list construction into global scope and replace O(n) `x in list` checks with O(1) `x in dict` checks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116344
Approved by: https://github.com/yanboliang
2023-12-23 08:38:26 +00:00
08b404e3a2 [Dynamo] Remove ExecutionRecorder.MOD_EXCLUDES during replay & record (#116347)
Remove ```ExecutionRecorder.MOD_EXCLUDES``` since now torch python modules are wrapped as ```PythonModuleVariable``` after #115724.
This is reported from Meta internal user cases, where it triggers failure when replay & record is enabled. But the enablement was triggered by ```TORCH_COMPILE_DEBUG=1``` rather than they really need this. Actually they are not using it according the conversation with the team members. I think we don't maintain replay & record well, so probably we can remove them from our codebase to avoid such issues in the future.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116347
Approved by: https://github.com/jansel
2023-12-23 08:13:14 +00:00
cyy
7663ffb673 [10/N] Fixes clang-tidy warnings in c10/util/*.h (#116326)
Still a continued work for clean up c10/util/*.h
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116326
Approved by: https://github.com/Skylion007
2023-12-23 04:59:55 +00:00
84b2a32359 [executorch hash update] update the pinned executorch hash (#115599)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115599
Approved by: https://github.com/huydhn
2023-12-23 04:07:23 +00:00
60f4114769 Support nn_module_stack in non_strict mode (#116309)
Summary: Title

Test Plan: CI

Differential Revision: D52382672

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116309
Approved by: https://github.com/zhxchen17
2023-12-23 03:34:58 +00:00
0931170a13 [vision hash update] update the pinned vision hash (#116343)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116343
Approved by: https://github.com/pytorchbot
2023-12-23 03:16:06 +00:00
4f4b931aba [inductor] Do variance calculation in opmath type (#115181)
Fixes #114903

Previously large split variance reductions stored the intermediates as float16
precision, which may lead to overflow as the intermediate result is
unnormalized.

In #114903 we see two different `num_split` decisions made based on the
hardware capabilities, one of which has large enough intermediates to cause
overflows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115181
Approved by: https://github.com/shunting314
2023-12-23 01:06:43 +00:00
65c5eed01d [sigmoid] Remove workaround for constant output. (#116288)
Summary: no more workaround_export_bug_constant_buffer_output

Test Plan:
buck2 run mode/dev-nosan //scripts/ads_pt2_inference:pt2_cli -- --src_model manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/473164617/6/gpu_lowering/input.predictor.disagg.gpu.merge

buck2 run mode/opt caffe2/torch/fb/model_transform/fx2trt/packaging:generate_merge_net_file -- --action=generate --lower_backend=aot_inductor_ep --input_file=/data/users/zhxchen17/fbsource/fbcode/input.predictor.disagg.gpu.merge --output_file=/tmp/409501788_66.predictor.disagg.gpu.merge

buck2 run mode/opt -c fbcode.nvcc_arch=a100 caffe2/torch/fb/model_transform/fx2trt/packaging:load_merge_net_predictor -- --loadMode=Normal --inputMergeNetFile=/tmp/409501788_66.predictor.disagg.gpu.merge --pytorch_predictor_sigmoid_enabled=true

Reviewed By: khabinov, SherlockNoMad

Differential Revision: D52210429

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116288
Approved by: https://github.com/tugsbayasgalan
2023-12-22 20:33:09 +00:00
3f9e9ecfe4 Fix torch.detach doc-string (#115850)
Fixes https://github.com/pytorch/pytorch/issues/98976

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115850
Approved by: https://github.com/albanD
2023-12-22 20:04:33 +00:00
b940fa2fce Delete unused global variable (#116228)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116228
Approved by: https://github.com/angelayi
ghstack dependencies: #116225, #116226
2023-12-22 19:07:59 +00:00
f08c4da86d Add a decomposition for take() (#114813)
Presumably this can close https://github.com/pytorch/pytorch/pull/109784

Also related to https://github.com/pytorch/pytorch/issues/93757 (though `take` is not listed there).

There's no bounds checking here (out of bounds indices cause a segfault or undefined behavior). Should that be added somehow?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114813
Approved by: https://github.com/peterbell10, https://github.com/lezcano
2023-12-22 18:14:57 +00:00
341c4227a8 Update F32 sparse semi-structured support for CUTLASS back-end (#116017)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116017
Approved by: https://github.com/jcaip
2023-12-22 16:53:04 +00:00
0b9146bf5d [BE][Easy]: Update ruff to 0.1.9 (#116290)
Updates the ruff linter with lots of bugfixes, speed improvements, and fix improvements.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116290
Approved by: https://github.com/janeyx99, https://github.com/malfet
2023-12-22 15:26:02 +00:00
0e39f4db92 Disables denormal floating numbers on ARM CPU (#115184)
**Motivation:**
Denormal numbers are used to store extremely small numbers that are close to 0. Denormal numbers can incur extra computational cost. To solve the low performance issue caused by denormal numbers, Pytorch supports flushing denormal numbers and it successfully configures flush denormal mode

Currently set_flush_denormal() is only supported on x86 architectures supporting SSE3 ([https://pytorch.org/docs/stable/generated/torch.set_flush_denormal.html (Opens in new window or tab)](https://pytorch.org/docs/stable/generated/torch.set_flush_denormal.html) and now we want to extend this functionality for ARM architecture.

**This PR:**
->Supports set_flush_denormal() on ARM.
->Datatypes supported and tested: FP64, FP32, BFloat16

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115184
Approved by: https://github.com/jgong5
2023-12-22 13:56:46 +00:00
cyy
9a0c217a0a [9/N] Fixes clang-tidy warnings in c10/util/*.h (#116185)
Continued work to clean headers in c10/util.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116185
Approved by: https://github.com/Skylion007
2023-12-22 09:35:44 +00:00
c7514ccc8c Delete unused API again (#116226)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116226
Approved by: https://github.com/angelayi
ghstack dependencies: #116225
2023-12-22 09:30:00 +00:00
7a6cb9fdfb [Inductor Intel GPU backend Upstream] Step 1/3: Generalize device-bias code in code generation. (#116020)
As the [RFC](https://github.com/pytorch/pytorch/issues/114856) mentions, this is the step 1 to add Intel GPU backend as an alternative inductor backend.

### Design
Typically, in order to integrate Intel GPU backend into Inductor, we need to inherit from `WrapperCodegen` and `TritonScheduling` and implement the corresponding subclasses respectively. However, since `WrapperCodegen` and `TritonScheduling` have some device-bias code generation **scattered** in their methods, overriding them in subclasses would introduce a lot of duplicated parent class code.
For example:
2a44034895/torch/_inductor/codegen/wrapper.py (L487)

2a44034895/torch/_inductor/codegen/triton.py (L1996)

 So we abstract the device-bias code scattered in WrapperCodegen and TritonScheduling and provide a unified interface "DeviceOpOverrides". This way, when integrating a new backend, we can  maximize the reuse of `WrapperCodegen` and `TritonScheduling` code by inherit and implement this interface for device flexibility.

Currently the `DeviceOpOverrides` only cover Python wrapper code generation. We can futher extend it to cover Cpp wrapper code generation on demand.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116020
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jansel
2023-12-22 08:42:51 +00:00
7d0ad6e870 Make native c10d_functional ops work with AOTInductor (#113735)
Summary:
- Revised `c10d_functional` ops to conform to https://github.com/pytorch/pytorch/tree/main/aten/src/ATen/native#func
- Modifed `get_cpp_op_schema()` to handle mutable args and aliasing returns

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113735
Approved by: https://github.com/desertfire
ghstack dependencies: #113438
2023-12-22 08:12:13 +00:00
718b576e2c Port all_to_all_single to native c10d_functional (#113438)
Summary:
- Ported `all_to_all_single` to native c10d_functional
- Added Inductor support for the native `all_to_all_single` via the new collective IR's `create_out_of_place()`
- Since the new collective IR derives from `FallbackKernel` which implements a generic `free_unbacked_symbols`, no additional unbacked symbol handling for all_to_all_single is required

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113438
Approved by: https://github.com/yf225, https://github.com/ezyang
2023-12-22 08:12:13 +00:00
cb489e769c Delete unused API (#116225)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116225
Approved by: https://github.com/angelayi
2023-12-22 06:38:47 +00:00
b6473065c6 [AMD] Fix build for intra_node_comm (#116291)
Summary: amd build is broken

Test Plan:
```
buck-out/v2/gen/fbcode/75c2b50d9f8b18d8/caffe2/__fb_libtorch_hipify_gen_eqsb_torch/csrc/distributed/c10d/intra_node_comm.hip__/out/torch/csrc/distributed/c10d/intra_node_comm.hip:37:1: error: non-void function does not return a value [-Werror,-Wreturn-type]
}
^
1 error generated when compiling for gfx90a.
```

Now it's gone

Differential Revision: D52373348

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116291
Approved by: https://github.com/yifuwang
2023-12-22 05:51:50 +00:00
b342286646 adds async save, makes checkpointer private (#116293)
Adds Async Save and also makes `Checkpointer` classes private.

The original PR was here: https://github.com/pytorch/pytorch/pull/115864

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116293
Approved by: https://github.com/fegin
2023-12-22 05:22:39 +00:00
ad3c0b2c00 [torch.export] fixes for unlifting lifted tensor constants (#116266)
Summary: lifted tensor constants were not being treated the same way as named buffers when unlifting, i.e. getting name correction to convert "." in FQNS to "_" for proper names. Additionally, future torchbind object support will allow objects to be registered, so only register_buffer for lifted constants if the value is a tensor.

Differential Revision: D52367846

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116266
Approved by: https://github.com/angelayi
2023-12-22 04:46:25 +00:00
cyy
764b4cd44e Remove outdated string function wrapper for Android and Caffe2 (#116186)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116186
Approved by: https://github.com/janeyx99
2023-12-22 04:31:56 +00:00
b47aa69685 [c10d] Fix the hang issue in store.check(TIMEOUT_DUMP) (#116297)
Summary:
We have found out the root cause of the hang is NOT due to destruction
of stores. The hang in the check() only happens when the store is of
type FileStore.

The file held by each filestore was a temp file, which was created by
Python Tempfile, it was deleted by default when the file was closed.

Note that the file was opened and closed by every check() in the watchdog and in constructor of FileStore.

The when check() tried to open the deleted file again, open() would fail
after the timeout value (by default 5 mins), hence the hang happened.

The fix is simple, just avoid the default deletion after the file is
closed.
Test Plan:

1. We first reproduce the hang in check() in the existing unit test:
   test_init_process_group_for_all_backends by enabling the
   DumpOnTimeOut and making the main thread sleep for 2s, to give enough time for tempfile
   to be deleted
2. Adding log to check ref count of fileStore and also the sequence of
   file opening and closing
3. With the repro, an exception will be thrown as "no such file or
   directory' and unit test would fail
4. Verify the tests now passes with the above knob change
5. add an unit test in test_c10d_nccl to cover the fileStore check() code path
python test/distributed/test_c10d_common.py ProcessGroupWithDispatchedCollectivesTests
python test/distributed/test_c10d_nccl.py ProcessGroupNCCLTest.test_file_store_check
Reviewers:

Subscribers:

Tasks:
T173200093
Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116297
Approved by: https://github.com/fduwjj
ghstack dependencies: #116296
2023-12-22 04:04:30 +00:00
94f3781145 Fixed bug with unpickling integers > 64-bits (#115264)
Fixes #115234

Currently, the unpickling code does not support integers larger than 64 bits in size. However, this is a part of the Python unpickling code.

See `pickle.py` in CPython:
```
def decode_long(data):
    r"""Decode a long from a two's complement little-endian binary string.

    >>> decode_long(b'')
    0
    >>> decode_long(b"\xff\x00")
    255
    >>> decode_long(b"\xff\x7f")
    32767
    >>> decode_long(b"\x00\xff")
    -256
    >>> decode_long(b"\x00\x80")
    -32768
    >>> decode_long(b"\x80")
    -128
    >>> decode_long(b"\x7f")
    127
    """
    return int.from_bytes(data, byteorder='little', signed=True)
```

E.g.:
```
>>> int.from_bytes(bytearray(b'\xff\xff\xff\xff\xff\xff\xff\xff\x00'), byteorder='little', signed=True)
18446744073709551615
```

This PR makes it so that integers of arbitrary size are supported with JS BigNums.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115264
Approved by: https://github.com/zdevito
2023-12-22 03:17:34 +00:00
9736deae76 [vision hash update] update the pinned vision hash (#109957)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109957
Approved by: https://github.com/pytorchbot
2023-12-22 03:12:23 +00:00
db25462ffd [quant][pt2e] Relax constraints on dtype and qscheme to allow for customizations (#116287)
Summary:
att

Test Plan:
CI

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116287
Approved by: https://github.com/kimishpatel
2023-12-22 03:12:04 +00:00
fdf8718225 Update reviewes for PyTorch Distributed (#116296)
Summary:
Add shuqiangzhang as a reviewer
Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116296
Approved by: https://github.com/fduwjj
2023-12-22 02:49:51 +00:00
4b97ed2ed8 [SparseCompressed] support csc layout for add sparse/dense. (#115433)
`add` when passed one sparse and one dense argument  will error if the
sparse argument does not have  csr layout. This PR modifies the
underlying algorithm to be generic on the compressed dimension handling
both csr and csc. The functions are renamed to use the
`sparse_compressed` qualifier rather than `sparse_csr`

Fixes: #114807

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115433
Approved by: https://github.com/cpuhrsch, https://github.com/pearu
ghstack dependencies: #115432
2023-12-22 01:47:55 +00:00
910baa3a03 [SparseCompressed] Support add(sparse_compressed, dense) (#115432)
Addition involving sparse compressed and dense arguments is implemented
requiring that the dense tensor be on the LHS. This change adds support
for the other pattern `sparse + dense by permuting arguments.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115432
Approved by: https://github.com/cpuhrsch, https://github.com/pearu
2023-12-22 01:47:55 +00:00
suo
d2d129de65 [sigmoid] replace unflatten with upstream version (#115468)
as title

Differential Revision: [D52000213](https://our.internmc.facebook.com/intern/diff/D52000213/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115468
Approved by: https://github.com/zhxchen17
2023-12-22 00:56:19 +00:00
127cae7ec8 [C10D] Increase TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC (#116267)
Change default from 2 min to 10 min.

Why? Many cases of heartbeat timeout were reported, but increasing
timeout led to the same job hanging in a different place, suggesting
heartbeat kill was working well and not a false positive.  However, some
others reported jobs running fine with increased timeouts.  One such
case was investigated below, and suggests that indeed a 2 min timeout is
too aggressive.  While we have not fully root caused the issue, it
is better to avoid killing jobs that would otherwise complete.

Current theory is that watchdog is not totally deadlocked, but is slowed
down in its processing of work objs due to some intermittent resource
contention.  Hence, allowing more time is more of a workaround than a
fix.

Debug/Analysis:
https://docs.google.com/document/d/1NMNWoTB86ZpP9bqYLZ_EVA9byOlEfxw0wynMVEMlXwM

Differential Revision: [D52368791](https://our.internmc.facebook.com/intern/diff/D52368791)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116267
Approved by: https://github.com/fduwjj
2023-12-22 00:47:45 +00:00
d6de2df6b6 Improve the error message when a PR lacks the necessary approvals (#116161)
The error message from https://github.com/pytorch/pytorch/pull/115329#issuecomment-1857135047 is pretty confusing because it lists some random `pytorch/metamates` folks from `superuser` merge rule.  My attempt here is to make the error message clearer by pointing out:

* All the matching merge rules and
* Their list of approvers

The message will now become:

```
Approvers from one of the follow rules are needed:
- Core Reviewers (1, 2, 3, 4, 5, ...)
- Core Maintainers (1, 2)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116161
Approved by: https://github.com/malfet, https://github.com/PaliC, https://github.com/atalman, https://github.com/ZainRizvi
2023-12-22 00:22:43 +00:00
99f7e721fe [inductor] make inductor work with new triton compile interface (#115878)
Recent 2 triton PRs (https://github.com/openai/triton/pull/2701, https://github.com/openai/triton/pull/2756) change the interface for triton.compile, this PR added the necessary change on inductor side to work with both old and new compile API.

Also there is some simplification between compilation call in subprocess and the one in main process
- previously we pass warm_cache_only=True if the compilation happens in subprocess. But triton never use that argument in the currently used pin. So I removed that
- previously we only pass compute_capability if compilation happens in subprocess. The PR change that to always passing compute_capability to triton.compile no matter if the compilation happens in main or sub process.

Updated:
There are more interface change from triton side. E.g.
- tl.math.{min, max} now requires a propagate_nan argument
- JITFunction.run now requires a warmup argument. This affect the benchmarking phase of matmul max-autotune; on the other hand, JITFunction.run forbids stream argument now. Simply removing passing this in when benchmarking matmul triton kernel will work for both old and new version of triton.
- triton Autotuner change attribute name from 'warmup' to 'num_warmup' and from 'rep' to 'num_rep'. This cause dynamo failed to handle triton Autotuner object since dynamo TritonKernelVariable makes assumption about attribute names. It's used in some test cases that a model call triton Autotuner directly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115878
Approved by: https://github.com/jansel
2023-12-22 00:09:29 +00:00
247f9c3de4 Preserve strides of custom Triton kernel args (#116219)
Summary: Currently, we [`clone`](19207b9183/torch/_inductor/lowering.py (L5273)) every `TensorBox` argument of custom Triton kernels while lowering them to the Inductor IR, during which the stride information of the kernel inputs is lost. This is problematic in the common case when the strides of a `torch.Tensor` argument are passed as scalars to a custom Triton kernel alongside the tensor itself (due to the underlying Triton code interpreting the tensors as raw pointers, so the contained stride semantics of the `torch.Tensor` is lost).

In this PR, we add an extended version of the existing [`clone` lowering](19207b9183/torch/_inductor/lowering.py (L2289))---`clone_preserve_reinterpret_view`---which carries over the `ir.ReinterpretVew` layers (if any) from the source `TensorBox` to the cloned one. The rationale behind adding a new function (and switching to it in the `triton_kernel_wrap` only for now) as opposed to extending the existing `clone` is keeping the semantics of the latter untouched, as it is a lowering of `torch.clone` (albeit incomplete, as the `memory_format` is currently ignored). Changing the existing `clone` would change the semantics which is not necessarily desirable in general. Open to suggestions, though.

Test Plan:

```
$ python test/dynamo/test_functions.py -k test_triton_kernel_strided_input
...
----------------------------------------------------------------------
Ran 1 test in 5.568s

OK
```

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116219
Approved by: https://github.com/jansel
2023-12-21 22:46:32 +00:00
a27ed4d364 [dynamo / DDP] Add optimize_ddp_lazy_compile config to control lazy compile for DDPOptimizer (False by default) (#116292)
We want to enable `optimize_ddp_lazy_compile` by default as soon as possible, becuase it will fix stride mismatch errors (see motivation: https://github.com/pytorch/pytorch/pull/114154).

However, lazy compile currently causes shape mismatch in other cases (`test_graph_split_inductor_transpose`) and we need to fix them before we can enable it by default.

Differential Revision: D52373445

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116292
Approved by: https://github.com/williamwen42, https://github.com/wconstab
2023-12-21 22:34:24 +00:00
1e834e0e50 Fix bug in mem_eff kernel with attention mask and MQA (#116234)
# Summary

Found using the repros mentioned in this issue: #112577

After many go rounds with compute-sanitizer and eventual printf debugging I feel pretty confident that this was the underlying issue

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116234
Approved by: https://github.com/malfet, https://github.com/danthe3rd, https://github.com/atalman
2023-12-21 21:52:21 +00:00
52f0457d7d Support view returns for functional inverses on narrowing views (#115893)
Part 1 of implementation for general [subclass view fake-ification](https://docs.google.com/document/d/1C5taWiplmX7nKiURXDOAZG2W5VNJ2iV0fQFq92H0Cxw).

The following functional inverses are currently implemented scatter-style and thus never return views:
* `as_strided_copy_inverse()`
* `diagonal_copy_inverse()`
* `expand_copy_inverse()`
* `select_copy_int_inverse()`
* `slice_copy_Tensor_inverse()`
* `split_copy_Tensor_inverse()`
* `split_with_sizes_copy_inverse()`
* `unbind_copy_int_inverse()`
* `unfold_copy_inverse()`

We need to get actual views for the introduction of reverse view funcs coming next.

Details:
* Use `as_strided()` to implement actual view inverses for the above
    * Assumes we're given a mutated_view that is actually part of a bigger storage; this isn't really the case for functionalization
* Introduce `InverseReturnMode` enum for customization of functional inverses
    * `AlwaysView` - always return an actual view; needed for reverse view_funcs()
    * `NeverView` - always do a copy; useful for certain functionalization use cases (e.g. XLA, executorch)
    * `ViewOrScatterInverse` - return an actual view in most cases, but prefer scatter inverses when they exist. this avoids the need to implement `as_strided()` for subclasses, which can be difficult or impossible
* Make sure functionalization works as before
    * Use `ViewOrScatterInverse` when reapply_views TLS is True or `NeverView` otherwise
    * Adds tests to ensure old behavior for above inverses **in functionalization**
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115893
Approved by: https://github.com/bdhirsh
2023-12-21 21:39:22 +00:00
suo
b5c866db13 [export] Add FlatArgsAdapter to unflatten (#115467)
This is the final divergence between our internal/external unflatteners.

Differential Revision: [D52001135](https://our.internmc.facebook.com/intern/diff/D52001135/)

@diff-train-skip-merge
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115467
Approved by: https://github.com/zhxchen17
ghstack dependencies: #115466, #115795
2023-12-21 20:52:36 +00:00
suo
01ec3d1113 [export] upstream some final fixes to OSS unflatten (#115795)
as title

Differential Revision: [D52141387](https://our.internmc.facebook.com/intern/diff/D52141387/)

@diff-train-skip-merge
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115795
Approved by: https://github.com/zhxchen17
ghstack dependencies: #115466
2023-12-21 20:52:36 +00:00
suo
bc3ef1684e [export] refactor unflatten.py to be a top-level API (#115466)
This is in preparation for the merging of the internal and external versions of
the unflattener. Unflatten needs to be its own API because we are adding more
options to it in forthcoming diffs.

Differential Revision: [D52001133](https://our.internmc.facebook.com/intern/diff/D52001133/)

@diff-train-skip-merge
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115466
Approved by: https://github.com/zhxchen17
2023-12-21 20:52:29 +00:00
497777e302 Revert "Mark set_ as an inplace view op (#115769)"
This reverts commit cd449e260c830c9ce0f06ed4833b46aa638f1529.

Reverted https://github.com/pytorch/pytorch/pull/115769 on behalf of https://github.com/jeanschmidt due to breaking landing signals internally, more details on the diff, author is tagged ([comment](https://github.com/pytorch/pytorch/pull/115769#issuecomment-1866846607))
2023-12-21 19:53:32 +00:00
0e63837ec7 [dynamo] Skip some tests using scipy.kstest (#116263)
These tests are failing in CI with this error
```
  File "/opt/conda/envs/py_3.11/lib/python3.11/site-packages/torch/_dynamo/variables/builder.py", line 1126, in wrap_numpy_ndarray
    value.flags.writeable = True
    ^^^^^^^^^^^^^^^^^^^^^
torch._dynamo.exc.InternalTorchDynamoError: cannot set WRITEABLE flag to True of this array
```

And it may be related to a `SIGKILL` exception being raised shortly after the
failure.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116263
Approved by: https://github.com/lezcano
2023-12-21 18:08:29 +00:00
199b04fdbd Back out "Implement pass-through state_dict and load_state_dict for dynamo OptimizedModule (#113423)" (#116243)
Summary:
Original commit changeset: 2a9588cfd51b

Original Phabricator Diff: D52062368

Test Plan: In investigating S386328 and S382826, we found checkpoint loading succeed after backout D52062368: S386328_backout_1220_193648

Differential Revision: D52356011

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116243
Approved by: https://github.com/voznesenskym
2023-12-21 17:57:05 +00:00
ed03834693 Revert "Expose functional IR to capture_pre_autograd (#115210)"
This reverts commit 4b59b4dffba633f638f3d7ccffff2abc2e53f25e.

Reverted https://github.com/pytorch/pytorch/pull/115210 on behalf of https://github.com/malfet due to This should fix test_export_constraints_error_non_strict failures, see https://github.com/pytorch/pytorch/issues/116273 ([comment](https://github.com/pytorch/pytorch/pull/115210#issuecomment-1866706302))
2023-12-21 17:49:43 +00:00
a357a0f315 Back out "[Kineto] Initialize libkineto profilers during torch init process during pybind set-up (#112623)" (#116201)
Summary:
This diff needs to be backed out because TorchBench llama_v2_7b_16h has a cublas init error.
https://github.com/pytorch/benchmark/actions/runs/7266269668/job/19797677485?pr=2095

Test Plan: CI

Differential Revision: D52339142

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116201
Approved by: https://github.com/xuzhao9
2023-12-21 16:32:19 +00:00
ff4aac109a [BE][Easy]: Enable clang-tidy check readability-misplaced-array-index (#116210)
Enable clang-tidy check readability which checks for a bizarre C++ construct that is usually indicative of an error: https://clang.llvm.org/extra/clang-tidy/checks/readability/misplaced-array-index.html (indexing a number by a pointer, which surprisingly inverts the operands).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116210
Approved by: https://github.com/albanD, https://github.com/malfet
2023-12-21 15:09:10 +00:00
cc2c2c6ca9 [Easy][BE]: Enable clang-tidy check for duplicate includes (#116193)
Adds a clang-tidy check to flag duplicate include files
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116193
Approved by: https://github.com/albanD, https://github.com/malfet
2023-12-21 14:58:12 +00:00
2dce364634 [AOTI][refactor] Remove model_container_runner_cuda.cpp (#116113)
Differential Revision: [D52301272](https://our.internmc.facebook.com/intern/diff/D52301272)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116113
Approved by: https://github.com/khabinov
ghstack dependencies: #116047
2023-12-21 14:56:25 +00:00
f71d302c63 Revert "[Easy][BE]: Enable clang-tidy check for duplicate includes (#116193)"
This reverts commit 71cb13869b4eced76589f47e26bd64cdc2d54aa2.

Reverted https://github.com/pytorch/pytorch/pull/116193 on behalf of https://github.com/jeanschmidt due to Breaking internal test (bolt_nn_espresso_operator_test_eureka-scheduler) and job (build-rdk-diff-windows-debug-cuda11) @malfet and @albanD, please help the author get this PR merged by providing more information ([comment](https://github.com/pytorch/pytorch/pull/116193#issuecomment-1866391726))
2023-12-21 14:43:07 +00:00
348cb2f8f9 Revert "[BE][Easy]: Enable clang-tidy check readability-misplaced-array-index (#116210)"
This reverts commit 5d5ef016a622c8259b328e8b6f8fa7ffcf3c80dc.

Reverted https://github.com/pytorch/pytorch/pull/116210 on behalf of https://github.com/jeanschmidt due to unfortunately, It is required to revert this PR in order to properly revert https://github.com/pytorch/pytorch/pull/116193 ([comment](https://github.com/pytorch/pytorch/pull/116210#issuecomment-1866380974))
2023-12-21 14:37:41 +00:00
ec6c4fed3f Revert "Support nn_module_stack in torch.export(strict=False) (#115454)"
This reverts commit 6730b5bcb41e0519572759d9ad9852a113d0a7e4.

Reverted https://github.com/pytorch/pytorch/pull/115454 on behalf of https://github.com/jeanschmidt due to Breaking internal tests recycle_bin_citadel and executorch, check internal diff to see more details ([comment](https://github.com/pytorch/pytorch/pull/115454#issuecomment-1866315233))
2023-12-21 14:05:43 +00:00
0567f71ac6 Revert " pre_dispatch aot_export (#115188)"
This reverts commit a267d6735051a4714fa2ac1c163315b650118744.

Reverted https://github.com/pytorch/pytorch/pull/115188 on behalf of https://github.com/jeanschmidt due to sadly, it is required to revert this commit in order to revert https://github.com/pytorch/pytorch/pull/115454 ([comment](https://github.com/pytorch/pytorch/pull/115188#issuecomment-1866310014))
2023-12-21 14:03:18 +00:00
f170d6665c [DCP] Add a profiler function for benchmarking save and load (#116007)
Many operations when calling DCP's save and load are executed on CPU. Thus we can easily profile these operations with cProfile. This PR adds the ability to profile the save() and load()

One follow-up for this PR is to integrate the feature with the distributed logging flags.

Differential Revision: [D52245434](https://our.internmc.facebook.com/intern/diff/D52245434/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116007
Approved by: https://github.com/LucasLLC, https://github.com/wz337
ghstack dependencies: #116006
2023-12-21 08:03:07 +00:00
a548ff40de [DCP][BE] Remove unused function (#116006)
As title

Differential Revision: [D52245433](https://our.internmc.facebook.com/intern/diff/D52245433/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116006
Approved by: https://github.com/wz337
2023-12-21 07:20:08 +00:00
4b59b4dffb Expose functional IR to capture_pre_autograd (#115210)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115210
Approved by: https://github.com/zhxchen17
ghstack dependencies: #115188
2023-12-21 07:16:07 +00:00
8fd1963ae2 [dynamo][collective_op] Use the value of the wrappered attribute async_op in dynamo when checking supported or not (#115921)
I found whatever the attribute `async_op` in collective ops is `True` or `False` explicitly set by the users, it always leads to the graph break because the argument `async_op` is wrappered as `ConstantVariable(bool)` in dynamo. So here we need to use the `value` for the judgement.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115921
Approved by: https://github.com/jansel, https://github.com/wconstab
2023-12-21 03:27:57 +00:00
74e8cfc9a0 Forward fix torch package bug - dont depend on dynam in fsdp directly (#116229)
Differential Revision: [D52350752](https://our.internmc.facebook.com/intern/diff/D52350752)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116229
Approved by: https://github.com/janeyx99, https://github.com/zou3519
2023-12-21 03:10:22 +00:00
db35ccf463 Revert "[innductor] make inductor work with new triton compile interface (#115878)"
This reverts commit bbded928b3556cf5678edf8fa41109d418312bcc.

Reverted https://github.com/pytorch/pytorch/pull/115878 on behalf of https://github.com/kit1980 due to Broke ROCm https://github.com/pytorch/pytorch/actions/runs/7282149837/job/19844618618 ([comment](https://github.com/pytorch/pytorch/pull/115878#issuecomment-1865369349))
2023-12-21 02:00:17 +00:00
65d3dde665 Fix allowed dtypes for mem_eff attention (#116026)
# Summary

Fix issue bug in detecting mem eff capability for cuda devices less than sm80:
https://github.com/pytorch-labs/gpt-fast/issues/49

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116026
Approved by: https://github.com/janeyx99
2023-12-21 01:56:38 +00:00
c1d960aadd [Quant] [Inductor] add input shape check for quantized conv binary lowering (#115247)
Add inputs shape check for quantized conv binary lowering, since qconv2d_pointwise.binary does not yet support the case of broadcasting shape inputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115247
Approved by: https://github.com/leslie-fang-intel, https://github.com/eellison
2023-12-21 01:36:49 +00:00
be9de33240 [Dynamo][9/N] Make SkipFilesVariable wrap functions only (#115963)
Make ```SkipFilesVariable``` only handle function type, and route skipped classes to ```UserDefinedClassVariable```. The reasons behind this are:
* We'd like to remove ```is_allowed```, so the allowed/disallowed torch classes should have a proper place to handle. We can put them in either ```SkipFilesVariable``` and ```UserDefinedClassVariable``` under the current architecture, but it's  confusing to have two places do one thing.
   - Going forward, let's make ```SkipFilesVariable``` only handle functions, and probably I'll rename it to ```SkippedFunctionVariable``` in the following PRs.
   - Let's do dispatch by value's type, all torch classes stuff would go to ```UserDefinedClassVariable``` in the next PR.
* We'd merge in_graph/skip/inline trace decision into the same API ```trace_rule.lookup```, so probably we have to limit the input to only function for better organizing ```VariableBuilder._wrap``` logics.
   - Next step, I'll merge ```skipfiles.check``` into ```trace_rules.lookup```, and do the skipfile check before wrapping them into correct variable tracker.
   - Though the ```TorchCtxManagerClassVariable``` is decided by ```trace_rules.lookup```, I'll refactor it out in the following PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115963
Approved by: https://github.com/jansel
2023-12-21 01:35:07 +00:00
a734085a63 [ONNX][Dort] Fix bug preventing running with OrtValueVector (#116124)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116124
Approved by: https://github.com/justinchuby, https://github.com/titaiwangms
ghstack dependencies: #115945
2023-12-21 01:20:46 +00:00
259b0af367 [ONNX] Add copy before export for perf bench to avoid mutating base model (#115945)
Otherwise base model might be mutated and affects the performance measured.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115945
Approved by: https://github.com/justinchuby, https://github.com/titaiwangms
2023-12-21 01:20:46 +00:00
feafbcf437 [AOTI][refactor] Refactor model runner API (#116047)
Summary: 1) make proxy executor as a private member; 2) use std::string instead of char*

Differential Revision: [D52301106](https://our.internmc.facebook.com/intern/diff/D52301106)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116047
Approved by: https://github.com/khabinov
2023-12-21 01:05:37 +00:00
9502fa8d84 add a transformer suite in TP/SP tests (#115530)
This is to address issue #115309.

Test plan
`python test/distributed/tensor/parallel/test_tp_examples.py -k test_transformer_training_is_seq_parallel_False`
`python test/distributed/tensor/parallel/test_tp_examples.py -k test_transformer_training_is_seq_parallel_True`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115530
Approved by: https://github.com/wanchaol
2023-12-21 01:04:36 +00:00
7ca6e0d38f [EZ] Add CUSPARSELT to build variables (#116213)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116213
Approved by: https://github.com/Skylion007, https://github.com/kit1980, https://github.com/atalman
ghstack dependencies: #116212
2023-12-21 01:02:11 +00:00
74119a3482 [EZ] Fix typo in USE_GLOO var (#116212)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116212
Approved by: https://github.com/Skylion007, https://github.com/kit1980
2023-12-21 01:02:11 +00:00
f206e31e2f Swap slots if slots match in swap_tensor (#116128)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116128
Approved by: https://github.com/albanD
2023-12-21 00:43:30 +00:00
8aae46f843 [ROCm] fix nightly 5.6 build (#116029)
ROCm 5.6 nightly wheel build broken by #114329.  This fixes it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116029
Approved by: https://github.com/huydhn, https://github.com/jithunnair-amd, https://github.com/atalman
2023-12-21 00:22:42 +00:00
be90b757d9 Enable compiled Adam in the benchmarks (#116093)
Commit b697bcc583 of mlazos/compiled-adam2 at https://hud.pytorch.org/benchmark/compilers
is an initial benchmark run

Increases compile time by 20s for torchbench and HF, and 30s for TIMM

I expect the compile time to come down significantly with fake tensor prop caching

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116093
Approved by: https://github.com/janeyx99
2023-12-21 00:17:36 +00:00
bbded928b3 [innductor] make inductor work with new triton compile interface (#115878)
Recent 2 triton PRs (https://github.com/openai/triton/pull/2701, https://github.com/openai/triton/pull/2756) change the interface for triton.compile, this PR added the necessary change on inductor side to work with both old and new compile API.

Also there is some simplification between compilation call in subprocess and the one in main process
- previously we pass warm_cache_only=True if the compilation happens in subprocess. But triton never use that argument in the currently used pin. So I removed that
- previously we only pass compute_capability if compilation happens in subprocess. The PR change that to always passing compute_capability to triton.compile no matter if the compilation happens in main or sub process.

Updated:
There are more interface change from triton side. E.g.
- tl.math.{min, max} now requires a propagate_nan argument
- JITFunction.run now requires a warmup argument. This affect the benchmarking phase of matmul max-autotune; on the other hand, JITFunction.run forbids stream argument now. Simply removing passing this in when benchmarking matmul triton kernel will work for both old and new version of triton.
- triton Autotuner change attribute name from 'warmup' to 'num_warmup' and from 'rep' to 'num_rep'. This cause dynamo failed to handle triton Autotuner object since dynamo TritonKernelVariable makes assumption about attribute names. It's used in some test cases that a model call triton Autotuner directly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115878
Approved by: https://github.com/jansel
2023-12-21 00:03:38 +00:00
5d5ef016a6 [BE][Easy]: Enable clang-tidy check readability-misplaced-array-index (#116210)
Enable clang-tidy check readability which checks for a bizarre C++ construct that is usually indicative of an error: https://clang.llvm.org/extra/clang-tidy/checks/readability/misplaced-array-index.html (indexing a number by a pointer, which surprisingly inverts the operands).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116210
Approved by: https://github.com/albanD, https://github.com/malfet
2023-12-21 00:00:20 +00:00
897600eb35 [inductor] Some tests have both CPU and CUDA variants running with CPU tensors (#116131)
I don't think that's intended.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116131
Approved by: https://github.com/jansel
2023-12-21 00:00:15 +00:00
7c7208a9e7 Forward fix to remove xfails for vmap NT tests in Dynamo (#116216)
Resolves land race between #116111 and #114523.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116216
Approved by: https://github.com/kit1980
2023-12-20 22:55:08 +00:00
edf1ea622d Move step is noop tests (#115299)
As stated. I do notice there is perhaps opportunity to abstract, but the tests as written are also super understandable and more abstraction might not be desirable.

This PR _increases coverage_. The original tests each tested 12 default configs (left out Rprop). Now the tests test ~80 configs, and then foreach + fused on top of that! Test time, we basically increase over 10-fold, but this test is tiny so we are not worried:

Old:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (5ca9672c)]$ python test/test_optim.py -k test_step_is_noop_when_params_have_no_grad
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
.
----------------------------------------------------------------------
Ran 1 test in 0.028s

OK
```

New (includes the old test):
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (5ca9672c)]$ python test/test_optim.py -k test_step_is_noop_when_params_have_no_grad
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
...........................
----------------------------------------------------------------------
Ran 27 tests in 0.456s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115299
Approved by: https://github.com/albanD
ghstack dependencies: #114802, #115023, #115025
2023-12-20 22:49:44 +00:00
8f3a0594e9 Move tests depending on listed configs to OptimizerInfo (#115025)
Removing 4 tests:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (7539011b)]$ python test/test_optim.py -v -k test_fused_optimizers_with_large_tensors -k test_fused_optimizers_with_varying_tensors -k test_multi_tensor_optimizers_with_large_tensors -k test_multi_tensor_optimizers_with_varying_tensors
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
test_fused_optimizers_with_large_tensors (optim.test_optim.TestOptim) ... ok
test_fused_optimizers_with_varying_tensors (optim.test_optim.TestOptim) ... ok
test_multi_tensor_optimizers_with_large_tensors (optim.test_optim.TestOptim) ... ok
test_multi_tensor_optimizers_with_varying_tensors (optim.test_optim.TestOptim) ... ok

----------------------------------------------------------------------
Ran 4 tests in 22.731s

OK
```

For the same 4 but more granular:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (7539011b)]$ python test/test_optim.py  -v -k test_fused_large_tensor -k test_fused_mixed_device_dtype -k test_foreach_large_tensor -k test_foreach_mixed_device_dtype
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
test_foreach_large_tensor_ASGD_cpu_float16 (__main__.TestOptimRenewedCPU) ... skipped 'Only runs on cuda'
....
test_fused_mixed_device_dtype_Adam_cpu_float32 (__main__.TestOptimRenewedCPU) ... skipped 'Only runs on cuda'
test_foreach_large_tensor_ASGD_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_large_tensor_Adadelta_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_large_tensor_Adagrad_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_large_tensor_AdamW_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_large_tensor_Adam_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_large_tensor_NAdam_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_large_tensor_RAdam_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_large_tensor_RMSprop_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_large_tensor_Rprop_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_large_tensor_SGD_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_ASGD_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_Adadelta_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_Adagrad_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_AdamW_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_Adam_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_Adamax_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_NAdam_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_RAdam_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_RMSprop_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_Rprop_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_SGD_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_fused_large_tensor_AdamW_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_fused_large_tensor_Adam_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_fused_mixed_device_dtype_AdamW_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_fused_mixed_device_dtype_Adam_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok

----------------------------------------------------------------------
Ran 50 tests in 50.785s

OK (skipped=25)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115025
Approved by: https://github.com/albanD
ghstack dependencies: #114802, #115023
2023-12-20 22:49:44 +00:00
05d60931b3 Migrate test_peak_mem_multi_tensor_optimizers to OptimizerInfo (#115023)
Replace the following:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (1bbf1c6f)]$ python test/test_optim.py -k test_peak_mem_multi_tensor_optimizers
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
.
----------------------------------------------------------------------
Ran 1 test in 38.599s

OK
```

with 11 tests (one for each foreach optim :))
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (1bbf1c6f)]$ python test/test_optim.py -k TestOptimRenewedCUDA.test_foreach_memory
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
...........
----------------------------------------------------------------------
Ran 11 tests in 39.293s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115023
Approved by: https://github.com/albanD
ghstack dependencies: #114802
2023-12-20 22:49:44 +00:00
4fb92b591d [BE] remove redundant _test_derived_optimizers by migrating more to OptimizerInfo (#114802)
New tests look like:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (af8fca04)]$ python test/test_optim.py -v -k TestOptimRenewedCUDA.test_fused
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
test_fused_AdamW_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_fused_Adam_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok

----------------------------------------------------------------------
Ran 2 tests in 34.591s

OK
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (af8fca04)]$ python test/test_optim.py
-v -k test_set_default_dtype_works_with_foreach
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
test_set_default_dtype_works_with_foreach_ASGD_cpu_float64 (__main__.TestOptimRenewedCPU) ... skipped 'Only runs on cuda'
...
test_set_default_dtype_works_with_foreach_ASGD_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_set_default_dtype_works_with_foreach_Adadelta_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_set_default_dtype_works_with_foreach_Adagrad_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_set_default_dtype_works_with_foreach_AdamW_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_set_default_dtype_works_with_foreach_Adam_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_set_default_dtype_works_with_foreach_Adamax_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_set_default_dtype_works_with_foreach_NAdam_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_set_default_dtype_works_with_foreach_RAdam_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_set_default_dtype_works_with_foreach_RMSprop_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_set_default_dtype_works_with_foreach_Rprop_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_set_default_dtype_works_with_foreach_SGD_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok

----------------------------------------------------------------------
Ran 22 tests in 32.915s

OK (skipped=11)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114802
Approved by: https://github.com/albanD
2023-12-20 22:49:44 +00:00
0fae3dfef7 Add convenient things for Dynamo testing (#116173)
- added a way to easily add a skip
- added a way to easily turn markDynamoStrictTest on by default for a
  particular test file
- added an envvar to turn markDynamoStrictTest on by default
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116173
Approved by: https://github.com/voznesenskym
2023-12-20 22:49:26 +00:00
19207b9183 Allow more backend worker threads with each using a separate cuda stream (#116190)
Added a `--num_workers` option to `server.py` that allows more than 1 worker in the `ThreadPoolWorker` used for model predictions. Each worker uses its own `cuda.Stream()` that is created when the worker thread is initialized.

Ran benchmark for 2-4 workers with `compile=False` (since compile is not thread-safe)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116190
Approved by: https://github.com/albanD
ghstack dependencies: #115286, #116187, #116188, #116189
2023-12-20 22:08:29 +00:00
0dd64174bd Do H2D/D2H of input/result on separate threads/cuda.Streams (#116189)
Added two `ThreadPoolExecutor`s with 1 worker each for D2H and H2D copies. Each uses its own `cuda.Stream`. The purpose is to try to overlap D2H and H2D with compute and allow the worker handling prediction to launch compute kernels without being blocked by D2H/H2D.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116189
Approved by: https://github.com/albanD
ghstack dependencies: #115286, #116187, #116188
2023-12-20 22:08:29 +00:00
3793ad6a7e Fix bugs in metrics calculation in inference benchmark and rerun baseline (#116188)
Before this PR, each `request_time` was separated by the time for a `torch.randn(...)` to create the fake `data` tensor on CPU. This meant that the gap between `request_times` **scaled with the batch_size**. So the latency comparisons across batch sizes were inaccurate. In this PR we generate all the fake data outside the loop to avoid this.

Other bug fixes:
- Only start polling GPU utilization after warmup event is complete
- Correct calculation of throughput: previously `(num_batches * batch_size) / sum(response_times)`, should have been `(num_batches * batch_size) / (last_response_time - first_request_time)`
- Make sure that response sent back to frontend is on CPU
- Use a lock to ensure writing to `metrics_dict` in `metrics_thread` and `gpu_utilization_thread` in a thread-safe manner

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116188
Approved by: https://github.com/albanD
ghstack dependencies: #115286, #116187
2023-12-20 22:08:22 +00:00
75a4b10d56 [easy] Add option for profiling backend in inference benchmark (#116187)
Some misc fixes, also added option for experiment name to add to result table

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116187
Approved by: https://github.com/albanD
ghstack dependencies: #115286
2023-12-20 22:08:11 +00:00
31f21e033e Run inference in an Executor (#115286)
Experiment: run model predictions in the backend in a ThreadPoolExecutor so that each model prediction does not block reading requests from the queue

Baseline is reset in above PR that bugfixes a lot of the metrics calculations but I kept the metrics here anyway

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115286
Approved by: https://github.com/albanD
2023-12-20 22:08:02 +00:00
b72127cd4b [inductor] Support sym exprs in lowering constant promotion (#116196)
Follow-up to https://github.com/pytorch/pytorch/pull/115920

This PR fixes the error with symbolic expression in aten.div:
```python
import torch
aten = torch.ops.aten

def func(x, a):
    return aten.div(x * 0.5, a, rounding_mode=None)

cfunc = torch.compile(func, dynamic=True, fullgraph=True)
device = "cpu"
x = 124
a = 33
out = cfunc(x, a)
expected = func(x, a)
torch.testing.assert_close(out, expected)
```
Error message:
```
  File "/pytorch/torch/_inductor/graph.py", line 700, in call_function
    out = lowerings[target](*args, **kwargs)
  File "/pytorch/torch/_inductor/lowering.py", line 293, in wrapped
    out = decomp_fn(*args, **kwargs)
  File "/pytorch/torch/_inductor/lowering.py", line 4823, in div_mode
    return div(a, b)
  File "/pytorch/torch/_inductor/lowering.py", line 293, in wrapped
    out = decomp_fn(*args, **kwargs)
  File "/pytorch/torch/_inductor/lowering.py", line 4857, in div
    a, b = promote_constants(
  File "/pytorch/torch/_inductor/lowering.py", line 368, in promote_constants
    ex = next(x for x in inputs if isinstance(x, (TensorBox, ExpandView)))
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
LoweringException: StopIteration:
  target: aten.div.Tensor_mode
  args[0]: 1.0*s0
  args[1]: s1
  kwargs: {'rounding_mode': None}

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116196
Approved by: https://github.com/peterbell10
2023-12-20 21:59:51 +00:00
a267d67350 pre_dispatch aot_export (#115188)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115188
Approved by: https://github.com/bdhirsh
2023-12-20 21:36:25 +00:00
4afe2687d5 Reland "Serve multistream graph captures from correct pool (#114647)" (#116199)
Fixes a variable shadowing problem that broke internal builds.

This reverts commit fe156456194ed64bdf8b086d469b3643515a2baf.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116199
Approved by: https://github.com/eellison
2023-12-20 21:22:34 +00:00
199bacaf77 [Dynamo] Fix broken trunk and re-enable test_torch_name_rule_map_updated (#116146)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116146
Approved by: https://github.com/williamwen42
2023-12-20 21:22:29 +00:00
6e2c9be501 [Easy][BE]: Enable RUF008 and RUF016 checks (#116195)
Enables a few more static linting checks for mutable defaults in dataclasses and for detecting a common type error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116195
Approved by: https://github.com/malfet
2023-12-20 21:16:49 +00:00
bc0d8649a4 Fix missing dependency in torch.utils.tensorboard (#115598)
Fixes #114591

Version package was removed in this pull request: #114108 but is still used in `torch.utils.tensorboard` causing import errors. The fix removes the import and uses a simpler check.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115598
Approved by: https://github.com/malfet
2023-12-20 21:11:52 +00:00
1d5a9a1c1a [Easy][BE]: remove itertools.accumulate Python 2 shim and apply UFMT (#116192)
Removes an unnecessary duplicated utility functions and just have it rely on itertools. Since the file is low traffic, I also added the modified files to UFMT'd files and formatted them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116192
Approved by: https://github.com/malfet
2023-12-20 20:36:59 +00:00
602abf6b55 [ROCm] more 6.0 changes (#115946)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115946
Approved by: https://github.com/pruthvistony, https://github.com/huydhn, https://github.com/malfet
2023-12-20 20:19:29 +00:00
ea3a5f8ddc Add chunk for jagged layout NT (#115842)
Nice to have for the [SDPA tutorial](https://pytorch.org/tutorials/intermediate/scaled_dot_product_attention_tutorial.html)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115842
Approved by: https://github.com/soulitzer
ghstack dependencies: #115192, #116111
2023-12-20 20:13:20 +00:00
29b198dcf8 Add markDynamoStrictTest to NT tests (#116111)
Decorates all NT tests with `@markDynamoStrictTest` to ensure we get the correct signal. Adds xfails where needed to get things passing.

Includes a fix in meta_utils.py for a bug that was breaking several python 3.11 tests. In particular, a dense tensor graph input that is a view of a strided NT would slip past Dynamo's check and break in meta-ification.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116111
Approved by: https://github.com/soulitzer, https://github.com/zou3519
ghstack dependencies: #115192
2023-12-20 20:13:20 +00:00
f2c1fb3ee4 Fix crash in SymInt unary minus (#116160)
Before this change `-SymInt(std::numeric_limits<int64_t>::min()) == 0` would reliably crash with null pointer dereference, as `data_` of the SymInt returned by `operator-` would be `0x8000000000000000`, because of the carry/overflow flags set by `negq`.

Before the change x86_64 assembly generated for
4f02cc0670/c10/core/SymInt.cpp (L137)
looked as follows:
```
   0x7ffff7f2f490 <+115>: movq   %rax, %rdx
    0x7ffff7f2f493 <+118>: negq   %rdx
    0x7ffff7f2f496 <+121>: movq   %rdx, (%rbp)
    0x7ffff7f2f49a <+125>: movabsq $0x4000000000000000, %rdx ; imm = 0x4000000000000000
    0x7ffff7f2f4a4 <+135>: cmpq   %rdx, %rax
    0x7ffff7f2f4a7 <+138>: jle    0x7ffff7f2f520            ; <+259> at SymInt.cpp:141:1
```
`negq %rfx` correspond to unary minus and  `cmpq   %rdx, 0x4000000000000000` are inverted `check_range`
b6d0d0819a/c10/core/SymInt.h (L247-L249)
Flags raised by `negq` will affect the results of `cmpq`, and as result value would not be allocated on heap, but rather preserved as `nullptr`.

Not sure if it's worth benchmarking, but perhaps using `__builtin_sub_overflow` would be faster as it does not require an extra comparison, just guarantees that overflow flags is cleared after the op.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116160
Approved by: https://github.com/Skylion007, https://github.com/colesbury
2023-12-20 20:12:57 +00:00
f8ad664cf2 [export] Update range constraints to runtime_var_to_range (#115427)
Updated range_constraints to be the union of shape_env.var_to_range and shape_env.runtime_var_to_range, with shape_env.runtime_var_to_range taking priority.

Due to 0/1 specialization, if we bound an unbacked symint to be less than 5, the range of possible values for this symint is actually recorded as [2, 5] in shape_env.var_to_range. To fix this so that users will be able to see a more understandable range of [0, 5], shape_env.runtime_var_to_range was created to store the range of [0, 5]. Since range_constraints is a user-facing attribute to query the ranges of certain symints, we want to use shape_env.runtime_var_to_range to get the unbacked symints ranges, rather than shape_env.var_to_range.

Additionally, run_decompositions() has an issue where it will always add assertions to the graph, even if a previous run has already added the assertions. So, I added a part to the AddRuntimeAssertionsForInlineConstraints which will store which assertions have already been added.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115427
Approved by: https://github.com/zhxchen17
2023-12-20 20:00:41 +00:00
1be6a070bc Add support for torch.cond in vmap (#114523)
Fixes: https://github.com/pytorch/pytorch/issues/114136

Patch enables conversion of a BatchedTensor into FakeTensor and write
torch.cond vmap support using torch.where

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114523
Approved by: https://github.com/zou3519
2023-12-20 19:54:38 +00:00
06ae9b79ed [mtia] add module exporter to net minimizer (#115687)
Summary: add module exporter to net minimizer

Reviewed By: amylittleyang

Differential Revision: D52086699

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115687
Approved by: https://github.com/jfix71
2023-12-20 19:36:23 +00:00
6de28e92d2 [BE]: Apply FURB118 (prev): replaces unnecessary lambdas with operator. (#116027)
This replaces a bunch of unnecessary lambdas with the operator package. This is semantically equivalent, but the operator package is faster, and arguably more readable. When the FURB rules are taken out of preview, I will enable it as a ruff check.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116027
Approved by: https://github.com/malfet
2023-12-20 19:35:08 +00:00
2d2016fdf8 WIP Add compatibility with channels_last_3d for conv3d (#114790)
Part of a multi-PR work to fix #59168

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114790
Approved by: https://github.com/albanD
2023-12-20 19:28:25 +00:00
8bff59e41d [ROCm] add hipblaslt support (#114329)
Disabled by default. Enable with env var DISABLE_ADDMM_HIP_LT=0. Tested on both ROCm 5.7 and 6.0.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114329
Approved by: https://github.com/malfet
2023-12-20 19:09:25 +00:00
0b0b9b3275 [c10d][libuv] add partial read test for libuv backend and fix an error which only happens when partially reading a buffer (#116141)
**Test Plan**
1. build pytorch
2. execute `TORCH_CPP_LOG_LEVEL=INFO build/bin/TCPStoreTest --gtest_filter=TCPStoreTest.testLibUVPartialRead` from the pytorch root directory.

without the change:
<img width="761" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/1942e3c2-a9c1-4fe4-87e8-7e21f4d8f9aa">

with the change:
<img width="747" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/f3e96a5b-0ed1-49bd-9184-bb8a5ebebc33">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116141
Approved by: https://github.com/wconstab
2023-12-20 18:37:55 +00:00
ee5d981249 [BE]: Enable RUFF PERF402 and apply fixes (#115505)
* Enable PERF402. Makes code more efficient and succinct by removing useless list copies that could be accomplished either via a list constructor or extend call. All test cases have noqa added since performance is not as sensitive in that folder.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115505
Approved by: https://github.com/malfet
2023-12-20 18:01:24 +00:00
8837df1d71 [c10d] Expose check method to Python for store via pybind (#116144)
Differential Revision: [D52310987](https://our.internmc.facebook.com/intern/diff/D52310987)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116144
Approved by: https://github.com/wconstab
2023-12-20 17:57:13 +00:00
71cb13869b [Easy][BE]: Enable clang-tidy check for duplicate includes (#116193)
Adds a clang-tidy check to flag duplicate include files
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116193
Approved by: https://github.com/albanD, https://github.com/malfet
2023-12-20 17:56:21 +00:00
fe15645619 Revert "Serve multistream graph captures from correct pool (#114647)"
This reverts commit 8a445f7bd5bef43b30b61b20483d606c6e42e606.

Reverted https://github.com/pytorch/pytorch/pull/114647 on behalf of https://github.com/jeanschmidt due to breaking multiple internal build jobs, please check internal diff in order to obtain more details ([comment](https://github.com/pytorch/pytorch/pull/114647#issuecomment-1864840724))
2023-12-20 17:11:42 +00:00
ea7f2de6f3 [docker] Fix typo in docker-release workflow (#116191)
Fix copy-paste typo in docker-release workflow.  After https://github.com/pytorch/pytorch/pull/116097

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116191
Approved by: https://github.com/malfet
2023-12-20 16:44:36 +00:00
16e539e0e6 Fix index range check (#116062)
Fixes incorrect range check when index is `std::numeric_limits<int64_t>::min()`, as result of unary minus operations for such values is undefined, but in practice is equal to self, see https://godbolt.org/z/Wxhh44ocr

Lower bound check was `size >= -index`, which was incorrect if `index` is `INT64_MIN`, with `-1 - index`, which for all int64_t values returns result that also fits into int64_t range. `- (index + 1)` is more readable and results in the identical optimized assembly, see https://godbolt.org/z/3vcnMYf9a , but its intermediate result for `INT64_MAX` is  outside of `int64_t` range, which leads to a similar problems as with `int64_min` in original example.

Added regression test.

Fixes https://github.com/pytorch/pytorch/issues/115415

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116062
Approved by: https://github.com/Skylion007, https://github.com/albanD
2023-12-20 15:40:57 +00:00
fabf9433e7 [AOTI][refactor] Organize model runner files (#116022)
Summary: Move runner util files into a subdirectory and put AOTIModelContainerRunnerCpu into a separate file

Differential Revision: [D52300693](https://our.internmc.facebook.com/intern/diff/D52300693)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116022
Approved by: https://github.com/khabinov
2023-12-20 15:35:34 +00:00
4d6a1ad400 Activation checkpoint and checkpoint_sequential errors if use_reentrant not passed explicitly (#115868)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115868
Approved by: https://github.com/albanD
ghstack dependencies: #115438
2023-12-20 15:23:44 +00:00
cfb3cd11c1 Add basic autograd TORCH_LOGS support (#115438)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115438
Approved by: https://github.com/albanD
2023-12-20 15:23:44 +00:00
cfbf647adb Add aten/src/ATen/native/quantized/cpu/ path to CPU quantization merge rule (#116145)
Observing following PR: https://github.com/pytorch/pytorch/pull/115329
Comment from author: https://github.com/pytorch/pytorch/pull/115329#issuecomment-1851339555

pytorchbot merge failed.
Reason is this logic, we expect all files in PR to match one merge rule:
110339a310/.github/scripts/trymerge.py (L1310-L1324)

This should mitigate the issue, followup will post a PR to refactor this code to allow cross rule matching of approvers
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116145
Approved by: https://github.com/huydhn, https://github.com/kit1980, https://github.com/malfet
2023-12-20 14:43:15 +00:00
8eb7f6276b Ensure wrapping subclasses with as_subclass is supported (#116091)
As title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116091
Approved by: https://github.com/pmeier, https://github.com/zou3519
2023-12-20 14:37:08 +00:00
c215e59bf2 Revert "[inductor] Avoid bool being upcast to int (#109913)"
This reverts commit 92998693a9455af6259cae468265f01cfff8810e.

Reverted https://github.com/pytorch/pytorch/pull/109913 on behalf of https://github.com/jeanschmidt due to causing performance regression in relevant metrics, @malfet I believe you are the correct person to help identify and fix the issues. More details check internal OPS count for ads metricsnin the internal related diff ([comment](https://github.com/pytorch/pytorch/pull/109913#issuecomment-1864397407))
2023-12-20 12:33:50 +00:00
cyy
968b94bef2 [8/N] Fixes clang-tidy warnings in c10/{core,util}/*.h (#116082)
This patch enables clang-tidy coverage on c10/**/*.h and contains other fixes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116082
Approved by: https://github.com/Skylion007
2023-12-20 12:22:21 +00:00
d72d99e591 Fix sparse compressed tensor invariants checks when nnz==0 (#115826)
Fixes https://github.com/pytorch/pytorch/issues/115755

This PR is a step toward deprecating `torch.empty(..., layout=<sparse compressed tensor layout>)` that usage should be minimized as it will produce invalid tensors, see also https://github.com/pytorch/pytorch/issues/90695 .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115826
Approved by: https://github.com/cpuhrsch, https://github.com/amjames
2023-12-20 12:16:07 +00:00
bdfabe5e7d Revert "[Dynamo][9/N] Make SkipFilesVariable wrap functions only (#115963)"
This reverts commit bb5a27052fa989f2365793c7ffe2d5a453aca31a.

Reverted https://github.com/pytorch/pytorch/pull/115963 on behalf of https://github.com/jeanschmidt due to causing significant performance regression, identified by number of ops in ads, please check internal diff ([comment](https://github.com/pytorch/pytorch/pull/115963#issuecomment-1864361697))
2023-12-20 12:06:55 +00:00
af8a50e656 Revert "Fix allowed dtypes for mem_eff attention (#116026)"
This reverts commit fc58909babcd07ea9652a1c1b3c2c7803f407a37.

Reverted https://github.com/pytorch/pytorch/pull/116026 on behalf of https://github.com/jeanschmidt due to breaking internal windows buck builds, check internal diff for more details ([comment](https://github.com/pytorch/pytorch/pull/116026#issuecomment-1864354665))
2023-12-20 12:01:34 +00:00
6e1ba79b7f [re-land] Introduce 3 low-latency, intra-node allreduce algorithms for small messages to PyTorch (#114001) (#116125)
This is an attempt to re-land https://github.com/pytorch/pytorch/pull/114001. The previous attempt used `std::array` in cuda kernels which wasn't compatible with Meta's internal build.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116125
Approved by: https://github.com/yf225
2023-12-20 07:13:50 +00:00
9df4ee8d38 Fix ColwiseParallel typo (#116151)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116151
Approved by: https://github.com/wanchaol
2023-12-20 06:40:32 +00:00
545d2126f6 [pt-vulkan] Enable Python code blocks in shader templates and upgrade shader template generation (#115948)
Summary:
This change makes two major improvements to PyTorch Vulkan's shader authoring workflow.

## Review Guide

There are a lot of changed files because every GLSL shader had to be touched. The majority of changes is changing

```
#define PRECISION $precision
#define FORMAT $format
```

to

```
#define PRECISION ${PRECISION}
#define FORMAT ${FORMAT}
```

due to changes in how shader templates are processed.

For reviewers, the primary functional changes to review are:

* `gen_vulkan_spv.py`
  * Majority of functional changes are in this file, which controls how shader templates are processed.
* `shader_params.yaml`
  * controls how shader variants are generated

## Python Codeblocks in Shader Templates

From now on, every compute shader (i.e. `.glsl`) is treated as a shader template. To this effect, the `templates/` folder has been removed and there is now a global `shader_params.yaml` file to describe the shader variants that should be generated for all shader templates.

**Taking inspiration from XNNPACK's [`xngen` tool](https://github.com/google/XNNPACK/blob/master/tools/xngen.py), shader templates can now use Python codeblocks**.  One example is:

```
$if not INPLACE:
  layout(set = 0, binding = 0, FORMAT) uniform PRECISION restrict writeonly image3D uOutput;
  layout(set = 0, binding = 1) uniform PRECISION sampler3D uInput;
  layout(set = 0, binding = 2) uniform PRECISION sampler3D uOther;
  layout(set = 0, binding = 3) uniform PRECISION restrict Block {
    ivec4 output_sizes;
    ivec4 input_sizes;
    ivec4 other_sizes;
    float alpha;
  }
  uArgs;
$else:
  layout(set = 0, binding = 0, FORMAT) uniform PRECISION restrict image3D uOutput;
  layout(set = 0, binding = 1) uniform PRECISION sampler3D uOther;
  layout(set = 0, binding = 2) uniform PRECISION restrict Block {
    ivec4 output_sizes;
    ivec4 other_sizes;
    float alpha;
  }
  uArgs;
```

Another is:

```
  // PYTHON CODEBLOCK
  $if not IS_DIV:
    const int c_index = (pos.z % ((uArgs.output_sizes.z + 3) / 4)) * 4;
    if (uArgs.other_sizes.z != 1 && c_index + 3 >= uArgs.output_sizes.z) {
      ivec4 c_ind = ivec4(c_index) + ivec4(0, 1, 2, 3);
      vec4 mask = vec4(lessThan(c_ind, ivec4(uArgs.output_sizes.z)));
      other_texel = other_texel * mask + vec4(1, 1, 1, 1) - mask;
    }

  // PYTHON CODEBLOCK
  $if not INPLACE:
    ivec3 input_pos =
        map_output_pos_to_input_pos(pos, uArgs.output_sizes, uArgs.input_sizes);
    const vec4 in_texel =
        load_texel(input_pos, uArgs.output_sizes, uArgs.input_sizes, uInput);

    imageStore(uOutput, pos, OP(in_texel, other_texel, uArgs.alpha));
  $else:
    const vec4 in_texel = imageLoad(uOutput, pos);
    imageStore(uOutput, pos, OP(in_texel, other_texel, uArgs.alpha));
```

In addition to making it easier and clearer to write shader templates, this enables shaders that were previously unable to be consolidated into a single template to now be represented using a single template, such as non inplace and inplace variants of the same shader.

## `generate_variant_forall` in shader variant YAML configuration

YAML files that describe how shader variants should be generated can now use a `generate_variant_forall` field to iterate over various settings for a specific parameter for each variant defined. Example:

```
unary_op:
  parameter_names_with_default_values:
    OPERATOR: exp(X)
    INPLACE: 0
  generate_variant_forall:
    INPLACE:
      - VALUE: 0
        SUFFIX: ""
      - VALUE: 1
        SUFFIX: "inplace"
  shader_variants:
    - NAME: exp
      OPERATOR: exp(X)
    - NAME: sqrt
      OPERATOR: sqrt(X)
    - NAME: log
      OPERATOR: log(X)
```

Previously, the `inplace` variants would need to have separate `shader_variants` entries. If there are multiple variables that need to be iterated across, then all possible combinations will be generated. Would be good to take a look to see how the new YAML configuration works.

Test Plan:
There is no functional change to this diff; we only need to make sure that the generated shaders are still correct. Therefore, we only need to run `vulkan_api_test`.

```
# On Mac Laptop
buck run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 -- --gtest_filter="*"
```

Reviewed By: digantdesai

Differential Revision: D52087084

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115948
Approved by: https://github.com/manuelcandales
2023-12-20 05:47:33 +00:00
9766781512 Skip some flaky Dynamo tests (#116165)
The goal right now is to get the Dynamo CI back to green.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116165
Approved by: https://github.com/drisspg, https://github.com/aakhundov, https://github.com/huydhn, https://github.com/khabinov
2023-12-20 05:05:02 +00:00
3747aca49a [C10D] Make all PGNCCL LOG usages use logPrefix() (#116060)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116060
Approved by: https://github.com/fduwjj
ghstack dependencies: #116059
2023-12-20 04:19:45 +00:00
6ffe1da375 Add support for multi device foreach ops (#116064)
Fix for https://github.com/pytorch/pytorch/issues/102023

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116064
Approved by: https://github.com/mlazos
2023-12-20 04:19:40 +00:00
c72bc61bcd [ROCm] Fix caffe2 build with hipblasv2 api (#116073)
Summary: we need this change along with D52244365 to make caffe2 build happy

Test Plan: OSS CI

Differential Revision: D52275058

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116073
Approved by: https://github.com/jeffdaily, https://github.com/malfet
2023-12-20 04:02:29 +00:00
a597a00c87 [AOTI][refactor][3/n] Declare python_kernel_name and cpp_kernel_name in ExternKernel (#115972)
Summary: Both ExternKernelAlloc and ExternKernelOut need the two fields, so declaring them in the base class. Also add cpp codegen for IndexPutFallback and InplaceBernoulliFallback in this PR.

This is a reland of https://github.com/pytorch/pytorch/pull/115831

Differential Revision: [D52290900](https://our.internmc.facebook.com/intern/diff/D52290900)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115972
Approved by: https://github.com/chenyang78
2023-12-20 03:22:03 +00:00
4f02cc0670 [C10D] Add logPrefix to abortCommsFromMap (#116059)
Prints additional info such as PG ID/Rank.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116059
Approved by: https://github.com/fduwjj
2023-12-20 02:17:04 +00:00
c3bc65d9d8 [dynamo] Restore constant tensor original FQNs (#116086)
Differential Revision: D52192693

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116086
Approved by: https://github.com/angelayi, https://github.com/muchulee8
2023-12-20 02:10:02 +00:00
6730b5bcb4 Support nn_module_stack in torch.export(strict=False) (#115454)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115454
Approved by: https://github.com/suo, https://github.com/bdhirsh
2023-12-20 01:43:39 +00:00
c173a9d9b3 add Half support for layer_norm on CPU (#99590)
### Testing
Single socket (icx, 32cores):
| shape | fp32 forward (ms) | fp16 forward (ms) | mixed fp32 fp16 forward (ms) | fp32 backward (ms) | fp16 backward (ms) | mixed fp32 fp16 backward (ms) |
| -- | -- | -- | -- | -- | -- | -- |
| (1, 8, 16) | 0.012 | 0.011 | 0.011 | 0.051 | 0.051 | 0.050 |
| (8 ,8, 16) | 0.013 | 0.013 | 0.013 | 0.054 | 0.053 | 0.051 |
| (32, 8, 16) | 0.015 | 0.014 | 0.014 | 0.059 | 0.054 | 0.052 |
| (64, 128, 56, 56) | 1.875 | 0.790 | 1.016 | 12.845 | 7.151 | 6.985 |
| (64, 128, 256, 256) | 50.226 | 25.462 | 35.736 | 328.957 | 179.615 | 175.618 |

Single core (icx):

| shape | fp32 forward (ms) | fp16 forward (ms) | mixed fp32 fp16 forward (ms) | fp32 backward (ms) | fp16 backward (ms) | mixed fp32 fp16 backward (ms) |
| -- | -- | -- | -- | -- | -- | -- |
| (1, 8, 16) | 0.012 | 0.011 | 0.011 | 0.040 | 0.041 | 0.041 |
| (8 ,8, 16) | 0.012 | 0.012 | 0.012 | 0.042 | 0.042 | 0.042 |
| (32, 8, 16) | 0.027 | 0.014 | 0.014 | 0.048 | 0.048 | 0.046 |
| (64, 128, 56, 56) | 58.054 | 11.034 | 17.928 | 108.603 | 48.816 | 50.244 |
| (64, 128, 256, 256) | 1327.758 | 352.394 | 496.994 | 2846.182 | 1224.247 | 1218.422 |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99590
Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/cpuhrsch
2023-12-20 01:11:15 +00:00
45cfe9cdf7 [export] Fix test to run internally (#116118)
Test Plan: `buck2 run @//mode/dev-nosan //caffe2/test:test_export`

Reviewed By: suo

Differential Revision: D52297701

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116118
Approved by: https://github.com/suo
2023-12-20 01:02:16 +00:00
c55210b4f0 [Inductor] Deduplicate grid wrapper statements for user defined triton kernels (#115849)
Noticed that on many MRS kernels the grid wrapper for autotuning is huge with a bunch of duplicates due to num_warps and num_stages not being needed for grid calculation. Lets deduplicate these entries.

Previously, we would see wrapper like
```
    def grid_wrapper_for_add_kernel_2d_autotuned_0(meta):
        if meta['BLOCK_SIZE_X'] == 128 and meta['BLOCK_SIZE_Y'] == 128: return (4, 2, 1)
        if meta['BLOCK_SIZE_X'] == 128 and meta['BLOCK_SIZE_Y'] == 128: return (4, 2, 1)
        if meta['BLOCK_SIZE_X'] == 64 and meta['BLOCK_SIZE_Y'] == 64: return (8, 4, 1)
        if meta['BLOCK_SIZE_X'] == 64 and meta['BLOCK_SIZE_Y'] == 64: return (8, 4, 1)
```
now it looks like
```
    def grid_wrapper_for_add_kernel_2d_autotuned_0(meta):
        if meta['BLOCK_SIZE_X'] == 128 and meta['BLOCK_SIZE_Y'] == 128: return (4, 2, 1)
        if meta['BLOCK_SIZE_X'] == 64 and meta['BLOCK_SIZE_Y'] == 64: return (8, 4, 1)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115849
Approved by: https://github.com/jansel
2023-12-20 00:25:32 +00:00
9a2a44457a SDPA extend backward realized tensor alignment checking to forward realized tensors (#116069)
The logic to check alignment for realized tensors in the backward can be extended for realized tensors in the forward. This fixes an interaction with freezing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116069
Approved by: https://github.com/drisspg
2023-12-20 00:14:20 +00:00
110339a310 Fix c10::div_floor_floating compile error (#115647)
Introduced by #113276. I've added a test to catch future regressions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115647
Approved by: https://github.com/desertfire, https://github.com/vfdev-5
2023-12-20 00:09:01 +00:00
68c7aac809 [export][reland] non-strict export with dynamic shapes (#116048)
Reland of https://github.com/pytorch/pytorch/pull/115862

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116048
Approved by: https://github.com/ydwu4
2023-12-19 23:57:22 +00:00
cd449e260c Mark set_ as an inplace view op (#115769)
Summary: To be used in https://github.com/pytorch/pytorch/pull/113873. Since set_ is effectively an inplace view op, we'll need to skip caching them.

Test Plan: Built pytorch; specifically this step: `/home/slarsen/local/miniconda3/envs/pytorch-3.10/bin/python -m torchgen.gen --source-path /home/slarsen/local/pytorch/cmake/../aten/src/ATen --install_dir /home/slarsen/local/pytorch/build/aten/src/ATen --per-operator-headers --generate sources --output-dependencies /home/slarsen/local/pytorch/build/aten/src/ATen/generated_sources.cmake`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115769
Approved by: https://github.com/bdhirsh
2023-12-19 23:08:05 +00:00
0759240001 [sparse] update cslt to 0.5.2.1 (#115988)
Summary:

- update install_cusparselt to download 0.5.2.1 for 12.1
- add ifdef for new compute_type changes

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115988
Approved by: https://github.com/malfet
ghstack dependencies: #115369
2023-12-19 23:02:54 +00:00
eqy
d55365dc05 [CUDA] Workaround shmem limit for certain input sizes in AdaptiveAvgPool1D (#115231)
Reference issue #68248

CC @ptrblck @malfet @xwang233

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115231
Approved by: https://github.com/mikaylagawarecki
2023-12-19 22:40:10 +00:00
7d92449171 Add call to run_tests for more tests? (#115781)
To make sure they get run in CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115781
Approved by: https://github.com/kshitij12345, https://github.com/mlazos, https://github.com/voznesenskym
2023-12-19 22:20:10 +00:00
7f7a7b0b48 Reset stepcurrent cache if file succeeds (#115775)
Attempt to surface the segfault that happens on exit by resetting the "pytest last run" cache if pytest succeeds.  CI does not rerun on success so we won't hit an infinite loop anywhere, and I don't expect people to rerun on success (unless they're looking for flakes? Either way I highly doubt any one is using the --sc/--scs flag locally).

This ensures that if pytest succeeds but the process gets a non zero exit code, the rerun will start at beginning instead of skipping all the "succeeding" tests.

This only applies if the --sc/--scs flags are used, custom to pytorch and probably not used anywhere other than CI, not to be confused with --stepwise, which pytest has by default

Here's a list of segfaulting inductor/test_aot_inductor tests, which I added skips for:
```
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocation::test_duplicated_params_abi_compatible_cpu_with_stack_allocation
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocation::test_fqn_abi_compatible_cpu_with_stack_allocation
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocation::test_no_args_abi_compatible_cpu_with_stack_allocation
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocation::test_output_misaligned_abi_compatible_cpu_with_stack_allocation
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocation::test_pytree_inputs_abi_compatible_cpu_with_stack_allocation
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocation::test_seq_abi_compatible_cpu_with_stack_allocation
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocation::test_simple_split_abi_compatible_cpu_with_stack_allocation
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_addmm_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_aliased_buffer_reuse_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_buffer_reuse_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_convolution_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_duplicated_params_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_empty_graph_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_fqn_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_large_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_missing_output_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_no_args_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_output_misaligned_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_output_path_1_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_pytree_inputs_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_repeat_interleave_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_return_constant_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_reuse_kernel_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_seq_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_simple_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_simple_split_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_small_constant_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_with_no_triton_profiler_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_with_offset_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_with_profiler_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_zero_size_weight_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115775
Approved by: https://github.com/desertfire
2023-12-19 22:19:57 +00:00
f88c9af98e [TEST] Skip scaled_dot_product_attention test on sm < 80 (#115760)
According to the [functionality](https://github.com/NVIDIA/cutlass/blob/main/media/docs/functionality.md) page, CUTLASS support `bfloat16` aka `bf16` only on compute capability 80+ devices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115760
Approved by: https://github.com/drisspg
2023-12-19 22:00:33 +00:00
ae6f1f4a47 [BE]: enable readability-delete-null-pointer clang-tidy check (#116107)
* Enables an additional clang-tidy check that remove unnecessary nullptr checks around delete statements.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116107
Approved by: https://github.com/albanD, https://github.com/malfet
2023-12-19 21:08:37 +00:00
d85314c95c Support Predispatch functionalization (#113728)
In this PR, we are implementing Functionalization on pre-dispatch graph. Today, every dispatch key except for Dispatchkey.Python has a dedicated mode stack in python. PreDispatch tracing relies on this behaviour by pushing ProxyTorchDispatchMode to Dispatchkey.PreDispatch mode stack and handle the dispatching logic in python. To make pre-dispatch functionalization work, we now need to push FunctionalTensorMode on DispatchKey.PreDispatch mode stack and make sure it runs before ProxyTorchDispatchMode. (this is very similar to how post-dispatch tracing work). Here are some design decisions we made for this flow to work:

1. FunctionalTensorMode internally calls C++ functionalize key. Since C++ functionalization goes after PreDispatch, if we are not careful, we will keep re-entering into PreDispatch key. We solve this by directly dispatching to C++ Functionalize key.

2. We delete mode_stack_per_key logic because the only realistic time it is exercised is for PreDispatch and it is in general not safe to have a plain list because FunctionalTensorMode and ProxyTorchDispatchMode ordering matter and it is hard to enforce it on plain list. Instead, now we have a private class that tracks PreDispatch mode stack.

3.  We will still run CompositeImplicitAutograd decomps in this PR, and disable this logic later as a followup.

Some missing bits after this PR:
1. Preserving autograd ops in a functional form. Right now they still show up in the graph but in a "non-functional" way.
2. Turn off CompositeImplicitAutograd decomps
3. Functionalizing HOO

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113728
Approved by: https://github.com/bdhirsh
2023-12-19 20:28:35 +00:00
1474eb5f29 Fix jagged composite impl of flatten() (#115192)
Need to handle this in `NestedTensor.__torch_function__()` since it's CompositeImplicit
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115192
Approved by: https://github.com/cpuhrsch, https://github.com/soulitzer
2023-12-19 19:15:21 +00:00
cbc70e9b9c [caffe2] Add option for build_cpukernel_avx2 (#116008)
Summary: We would like to have a more flexible way to customize the build option with avx2 instruction to address other issues

Test Plan: CI

Differential Revision: D52247916

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116008
Approved by: https://github.com/mattjgalloway
2023-12-19 18:49:52 +00:00
77d5f60740 [fsdp][torch.compile] FSDP changes (#115497)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115497
Approved by: https://github.com/albanD
2023-12-19 18:44:36 +00:00
e52983939c fix(conv_v8): optimize lru cache in conv v8 (#114110)
Fixes #108474

the main issue is due to GCC's dual abi.

https://gcc.gnu.org/onlinedocs/libstdc++/manual/using_dual_abi.html
> requires lists to keep track of their size.

seems like in GCC's old abi, std::list::size is linear

other optimization is:
* `splice` instead of erase then push, will save some memory and time.

more perf benchmark is coming...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114110
Approved by: https://github.com/eqy, https://github.com/atalman, https://github.com/malfet
2023-12-19 18:43:37 +00:00
d749b4a152 Implements permute_tensor in functional collectives (#115078)
Implementation of `permute_tensor` as per @yifuwang 's suggestion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115078
Approved by: https://github.com/wanchaol, https://github.com/yifuwang
2023-12-19 18:33:28 +00:00
71bedc3a69 [Inductor UT] fix unreachable code (#116094)
The testcase test_uint4x2_mixed_mm has indentation error. This pr make testcode reachable.

test result:
```
pytest test_torchinductor.py -k test_uint4x2_mixed_mm -v
=========================================================================================== test session starts ===========================================================================================
platform linux -- Python 3.10.12, pytest-7.4.2, pluggy-1.3.0 -- /usr/bin/python
cachedir: .pytest_cache
hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase('/workspace/pytorch/test/inductor/.hypothesis/examples')
rootdir: /workspace/pytorch
configfile: pytest.ini
plugins: shard-0.1.2, xdoctest-1.0.2, flakefinder-1.1.0, xdist-3.3.1, rerunfailures-12.0, hypothesis-5.35.1
collected 964 items / 962 deselected / 2 selected
Running 2 items in this shard: test/inductor/test_torchinductor.py::CpuTests::test_uint4x2_mixed_mm_cpu, test/inductor/test_torchinductor.py::CudaTests::test_uint4x2_mixed_mm_cuda

test_torchinductor.py::CpuTests::test_uint4x2_mixed_mm_cpu PASSED [2.2136s]                                                                                                                         [ 50%]
test_torchinductor.py::CudaTests::test_uint4x2_mixed_mm_cuda PASSED [1.9466s]                                                                                                                       [100%]

=================================================================================== 2 passed, 962 deselected in 15.70s ====================================================================================

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116094
Approved by: https://github.com/peterbell10
2023-12-19 17:14:25 +00:00
5ba87a31bc Unflake test_reference_numerics_large__refs_special_multigammaln_mvlgamma_p_1_cpu_bfloat16 (#116058)
Run the test under markDynamoStrict mode and record an expected failure
under the Dynamo CI shard.

Test Plan:
- wait for CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116058
Approved by: https://github.com/atalman
2023-12-19 16:42:29 +00:00
7b7f11f230 [dynamo] test number of guards when inputs are views (#115793)
After # 113734 landed (adding dynamic storage offsets), we found that compilation times increased significantly. The reason: tensors_definitely_do_not_overlap was doing comparisons on storage offsets which were adding guards

626b7dc847/torch/_functorch/_aot_autograd/input_output_analysis.py (L268-L276)

This guard is added on all pairs of tensors which are views of the same source tensor - i.e. it the number of guards can be quadratic in the number of input tensors. This PR adds a test to prevent similar regressions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115793
Approved by: https://github.com/yanboliang
2023-12-19 16:09:29 +00:00
91e184fd74 Revert "Introduce 3 low-latency, intra-node allreduce algorithms for small messages to PyTorch (#114001)"
This reverts commit 4edc921857f39ba9510b6ab1c454149cfb2de157.

Reverted https://github.com/pytorch/pytorch/pull/114001 on behalf of https://github.com/jeanschmidt due to Breaking multiple internal tests, might be flakiness but multiple retries did not elicit an improvement, please check internal diff ([comment](https://github.com/pytorch/pytorch/pull/114001#issuecomment-1863036417))
2023-12-19 16:01:19 +00:00
b6d0d0819a Revert "[PT2] [Quant] Change the QConv2d Binary post op name from add to sum (#115329)"
This reverts commit 9ae0e6292944139ea598e7347c95ebd7df09e819.

Reverted https://github.com/pytorch/pytorch/pull/115329 on behalf of https://github.com/jeanschmidt due to Breaking internal builds, please check internal diff to get the list and logs, @jerryzh168 please support the author in order to get these changes merged and landed ([comment](https://github.com/pytorch/pytorch/pull/115329#issuecomment-1863021726))
2023-12-19 15:52:57 +00:00
c539f7df10 Revert "[Inductor] Deduplicate grid wrapper statements for user defined triton kernels (#115849)"
This reverts commit 21b8127f1c9f31c02145d906aae2db1ada703067.

Reverted https://github.com/pytorch/pytorch/pull/115849 on behalf of https://github.com/jeanschmidt due to Breaking internal tests, please check internal diff for more details ([comment](https://github.com/pytorch/pytorch/pull/115849#issuecomment-1863012933))
2023-12-19 15:47:55 +00:00
505a9e4854 add support for dynamic shapes in round (#115259)
Fixes #114310 and supersedes #114748.

There are two reasons why we have quite a few special cases for `round`:

1. `round` is actually two ops. With `ndigits=None` (default), `round` always returns an integer. When `ndigits` is an integer, the returned type is a float.
2. Although `round` takes two arguments, it is a unary function with a parameter rather than a binary one.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115259
Approved by: https://github.com/peterbell10, https://github.com/lezcano
2023-12-19 15:45:50 +00:00
a7bfa04da6 Revert "More markDynamoStrictTest (#115870)"
This reverts commit 7f686c8fe127cc7db07134297fa09be20ab87918.

Reverted https://github.com/pytorch/pytorch/pull/115870 on behalf of https://github.com/jeanschmidt due to Breaking internal tests and builds, please check diff ([comment](https://github.com/pytorch/pytorch/pull/115870#issuecomment-1862997125))
2023-12-19 15:40:57 +00:00
24af118e55 Revert "markDynamoStrictTest more tests (#115871)"
This reverts commit 478f0e96dc2593db401903ac2ae053f8cd1e29ea.

Reverted https://github.com/pytorch/pytorch/pull/115871 on behalf of https://github.com/jeanschmidt due to Breaking internal tests and builds, please check diff, this is required to revert #115870 ([comment](https://github.com/pytorch/pytorch/pull/115871#issuecomment-1862992931))
2023-12-19 15:36:27 +00:00
5b6b680517 Revert "Adamw refactor (#115983)"
This reverts commit eafeba71c1ed35f8cf2d39016bf66c0b088e4a9f.

Reverted https://github.com/pytorch/pytorch/pull/115983 on behalf of https://github.com/jeanschmidt due to Breaking internal tests, @janeyx99 please help @tfsingh to have this PR landed ([comment](https://github.com/pytorch/pytorch/pull/115983#issuecomment-1862976954))
2023-12-19 15:26:44 +00:00
92998693a9 [inductor] Avoid bool being upcast to int (#109913)
Currently the inductor code for `x.any(-1)` does a this strange dance:
```python
tmp0 = tl.load(in_ptr0 + (r1 + (128*x0)), rmask & xmask)
tmp1 = tmp0.to(tl.int64)
tmp2 = (tmp1 != 0)
```

This happens because `register_lowering` is doing type promotion with the
dimension argument, and so promotes to `int64` which we then cast back to bool.
A better fix would be to fix `register_lowering` but for now I just remove
the unnecessary type promotion from `aten.any`.

In the current code we also see:
```python
     tmp5 = tl.where(rmask & xmask, tmp3, 0)
```
which promotes the boolean value to int since `0` is an int32 in triton.
This fixes it to generate a boolean constant instead.

Finally there is also a triton bug where the `tl.load` itself upcasts to
`tl.int8`. I fix this by adding an explicit cast to `tl.int1`. The final
kernel code looks like:

```python
tmp0 = tl.load(in_ptr0 + (r1 + (128*x0)), rmask & xmask).to(tl.int1)
tmp1 = tl.broadcast_to(tmp0, [XBLOCK, RBLOCK])
tmp3 = tl.full([1, 1], 0, tl.int1)
tmp4 = tl.where(rmask & xmask, tmp1, tmp3)
tmp5 = triton_helpers.any(tmp4, 1)[:, None]

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109913
Approved by: https://github.com/lezcano
2023-12-19 14:16:10 +00:00
992c4e7b24 Actually run Dynamo tests in all Dynamo shards (#115962)
We weren't doing this before. Also adds some more skips so that CI
passes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115962
Approved by: https://github.com/voznesenskym
ghstack dependencies: #115925
2023-12-19 14:12:53 +00:00
0bd5a3fed7 [releng] Docker release Refactor Push nightly tags step. Move cuda and cudnn version to docker tag rather then name (#116097)
Follow up after : https://github.com/pytorch/pytorch/pull/116070

This PR does 2 things.

1. Refactor Push nightly tags step, don't need to extract CUDA_VERSION anymore. New tag should be in this format: ``${PYTORCH_VERSION}-cuda$(CUDA_VERSION_SHORT)-cudnn$(CUDNN_VERSION)-runtime``
2. Move cuda$(CUDA_VERSION_SHORT)-cudnn$(CUDNN_VERSION) from docker name to tag

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116097
Approved by: https://github.com/jeanschmidt
2023-12-19 13:53:08 +00:00
a31effa15f Update device_mesh.py docs imports (#116074)
These are not importable from `torch.distributed`, at least today.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116074
Approved by: https://github.com/wz337, https://github.com/fegin
2023-12-19 09:44:55 +00:00
eqy
2a44034895 [CUDA] Include <thrust/swap.h> in LinearAlgebra.cu (#116072)
Fixes build against the latest `NVIDIA/cccl`.

CC @malfet @xwang233 @ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116072
Approved by: https://github.com/malfet, https://github.com/xwang233
2023-12-19 05:56:52 +00:00
327bdcdb14 Some tiny modification about torch.set/get_default_device (#116014)
1. fix bug of torch.set_default_device in multi-threading
2. add new interface named torch.get_default_device

Fixes #115333
Fixes #115917

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116014
Approved by: https://github.com/malfet, https://github.com/jansel
2023-12-19 05:08:06 +00:00
b48abbc020 [DeviceMesh] Fix DeviceMesh docstring (#116053)
1. remove outdated comments
2. fix examples in docstring

Doc after fix:
<img width="706" alt="image" src="https://github.com/pytorch/pytorch/assets/31293777/19f4f03c-0fd7-4e88-bca1-1a6ce693fbb7">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116053
Approved by: https://github.com/wanchaol
2023-12-19 04:05:49 +00:00
8b0122ad33 Add lowerings for reflection_pad{1, 3}d_backward (#115645)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115645
Approved by: https://github.com/lezcano, https://github.com/peterbell10
2023-12-19 04:05:10 +00:00
9dda4b20a0 [MPS] Enable select/[broad]cast ops for complex dtypes (#115727)
By representing `torch.cfloat`/`torch.chalf` as `float2`/`half2` metal types and modifying `SCATTER_OPS_TEMPLATE`/`GATHER_OPS_TEMPLATE` to accept third argument which is fully specialized `cast` function, which is no-op for regular type, but special cased for float->complex and complex->float

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115727
Approved by: https://github.com/kulinseth
2023-12-19 02:25:28 +00:00
cyy
1544c37520 [7/N] Fixes clang-tidy warnings in c10/{core,util}/*.h (#115495)
This PR continues to fix clang-tidy warnings for headers in c10/core and c10/util.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115495
Approved by: https://github.com/malfet
2023-12-19 02:14:30 +00:00
9b8f934068 Remove memory_format check for native_group_norm_backward (#115721)
To fix https://github.com/pytorch/pytorch/issues/115940.
Remove memory_format check for native_group_norm_backward.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115721
Approved by: https://github.com/mikaylagawarecki
2023-12-19 02:12:26 +00:00
01b979fc9a [Inductor] Fix constant folding and extern kernel mutation tracking bugs (#115908)
This PR fixes two bugs
1) Constant folding a triton kernel results in the kernel's inputs to be returned back without any modification. Disable constant folding for triton kernels. Need more investigation
2) NoneLayout buffers should not be deleted as they do not exist

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115908
Approved by: https://github.com/aakhundov, https://github.com/jansel
2023-12-19 02:06:50 +00:00
bb5a27052f [Dynamo][9/N] Make SkipFilesVariable wrap functions only (#115963)
Make ```SkipFilesVariable``` only handle function type, and route skipped classes to ```UserDefinedClassVariable```. The reasons behind this are:
* We'd like to remove ```is_allowed```, so the allowed/disallowed torch classes should have a proper place to handle. We can put them in either ```SkipFilesVariable``` and ```UserDefinedClassVariable``` under the current architecture, but it's  confusing to have two places do one thing.
   - Going forward, let's make ```SkipFilesVariable``` only handle functions, and probably I'll rename it to ```SkippedFunctionVariable``` in the following PRs.
   - Let's do dispatch by value's type, all torch classes stuff would go to ```UserDefinedClassVariable``` in the next PR.
* We'd merge in_graph/skip/inline trace decision into the same API ```trace_rule.lookup```, so probably we have to limit the input to only function for better organizing ```VariableBuilder._wrap``` logics.
   - Next step, I'll merge ```skipfiles.check``` into ```trace_rules.lookup```, and do the skipfile check before wrapping them into correct variable tracker.
   - Though the ```TorchCtxManagerClassVariable``` is decided by ```trace_rules.lookup```, I'll refactor it out in the following PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115963
Approved by: https://github.com/jansel
2023-12-19 02:01:47 +00:00
47908a608f Revert "[ROCm] add hipblaslt support (#114329)"
This reverts commit b062ea38039234c80404a8f5f4d5a93c4cb9832d.

Reverted https://github.com/pytorch/pytorch/pull/114329 on behalf of https://github.com/jeanschmidt due to Reverting due to inconsistencies on internal diff ([comment](https://github.com/pytorch/pytorch/pull/114329#issuecomment-1861933267))
2023-12-19 01:04:58 +00:00
ed0c0c49ef Revert "[ROCm] fix nightly 5.6 build (#116029)"
This reverts commit 63e242b1e41759f9b24a0fbb997f157a06a9dd13.

Reverted https://github.com/pytorch/pytorch/pull/116029 on behalf of https://github.com/jeanschmidt due to Need to revert, in order to be able to revert #114329 ([comment](https://github.com/pytorch/pytorch/pull/116029#issuecomment-1861931736))
2023-12-19 01:01:42 +00:00
368a0c06d4 [releng] Docker Official release make sure cuda version is part of image name (#116070)
Follow up on https://github.com/pytorch/pytorch/pull/115949

Change docker build image name:
``pytorch:2.1.2-devel``-> ``2.1.2-cuda12.1-cudnn8-devel and 2.1.2-cuda11.8-cudnn8-devel``

Ref: https://github.com/orgs/pytorch/packages/container/package/pytorch-nightly

Naming will be same as in https://hub.docker.com/r/pytorch/pytorch/tags
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116070
Approved by: https://github.com/huydhn, https://github.com/seemethere
2023-12-19 00:58:15 +00:00
5894af83be Use dequantized weight and bias in conv2d quantized ops (#115615)
Summary:
Dequantize weight and bias for conv2d ops to improve performance. The weight and bias are usually small in size hence they do not increase memory footprint by a lot when dequantized.

With optimization cunet-enc ops:
vulkan.quantized_conv2d  {96, 72, 2}                      3753204
vulkan.quantized_conv2d  {96, 72, 2}                      6977048
vulkan.quantized_conv2d_dw{96, 72, 2}                      2499640
vulkan.quantized_conv2d_pw_2x2{96, 72, 2}                       842088
vulkan.quantized_conv2d  {48, 36, 4}                      2388152
vulkan.quantized_conv2d  {48, 36, 4}                      4775940
vulkan.quantized_conv2d_dw{48, 36, 4}                       709800
vulkan.quantized_conv2d_pw_2x2{48, 36, 4}                       483236
vulkan.quantized_conv2d  {24, 18, 8}                      2562144
vulkan.quantized_conv2d  {24, 18, 8}                      5447624
vulkan.quantized_conv2d_dw{24, 18, 8}                       392756
vulkan.quantized_conv2d_pw_2x2{24, 18, 8}                       509080

Without optimization:
vulkan.quantized_conv2d  {96, 72, 2}                      4291768
vulkan.quantized_conv2d  {96, 72, 2}                      7871344
vulkan.quantized_conv2d_dw{96, 72, 2}                      2658500
vulkan.quantized_conv2d_pw_2x2{96, 72, 2}                       891020
vulkan.quantized_conv2d  {48, 36, 4}                      2966860
vulkan.quantized_conv2d  {48, 36, 4}                      5661812
vulkan.quantized_conv2d_dw{48, 36, 4}                       816556
vulkan.quantized_conv2d_pw_2x2{48, 36, 4}                       528632
vulkan.quantized_conv2d  {24, 18, 8}                      3139604
vulkan.quantized_conv2d  {24, 18, 8}                      6202820
vulkan.quantized_conv2d_dw{24, 18, 8}                       452660
vulkan.quantized_conv2d_pw_2x2{24, 18, 8}                       557388

Test Plan:
Ensure all vulkan quantize tests pass:
buck2 run --target-platforms ovr_configplatform/macos:arm64-fbsourcexplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 --show-output"
Running main() from third-party/googletest/1.11.0/googletest/googletest/src/gtest_main.cc
[==========] Running 78 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 78 tests from VulkanAPITest

...
[==========] 78 tests from 1 test suite ran. (1519 ms total)
[  PASSED  ] 78 tests.

buck2 run --target-platforms ovr_config//platform/macos:arm64-fbsource  //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 --show-output"

Running main() from third-party/googletest/1.11.0/googletest/googletest/src/gtest_main.cc
[==========] Running 395 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 395 tests from VulkanAPITest

...
[----------] 395 tests from VulkanAPITest (6515 ms total)

[----------] Global test environment tear-down
[==========] 395 tests from 1 test suite ran. (6515 ms total)
[  PASSED  ] 394 tests.
[  SKIPPED ] 1 test, listed below:
[  SKIPPED ] VulkanAPITest.querypool_flushed_shader_log

  YOU HAVE 5 DISABLED TESTS

Reviewed By: yipjustin

Differential Revision: D50997532

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115615
Approved by: https://github.com/manuelcandales, https://github.com/yipjustin
2023-12-19 00:23:52 +00:00
270ed13e87 [DTensor] Make DTensor from_local backward partial() to replicate() pass through (#115967)
Summary:
This change makes the `DTensor.from_local()` placements in backward pass from `Partial()` to `Replicate()` as pass through for following reasons:
1. When we run backward pass of DTensor.from_local, if the target placement is partial() (i.e. from user manual overwrite code instead of torch_dispatch) we keep the grad as replicate. This is because converting the gradients back to `Partial()` is meaningless.
2. The current div logic will lead to wrong numerical value in the above case.

Test Plan:
**CI**:
CI Tests

**Unit test**:
`buck2 test mode/dev-nosan //caffe2/test/distributed/_tensor:redistribute`
- Passed

**With model training**:
```
# We tested the case where input tensor is manually overwrite as Partial() and
# output tensor manually overwrite to Shard() then to local.

# Before the change: numerical value not correct
Forward pass:
    collective: ReduceScatter
backward pass:
    collective: AllGather + div by process group size

# After the change: div is removed as expected.
Forward pass:
    collective: ReduceScatter
Backward pas:
    collective: AllGather
```

Differential Revision: D52175709

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115967
Approved by: https://github.com/wanchaol
2023-12-19 00:16:10 +00:00
3472a9200d expand subclass type tests in dynamo (#116024)
Following up on my own comments in https://github.com/pytorch/pytorch/pull/115323#pullrequestreview-1769491483.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116024
Approved by: https://github.com/mlazos
2023-12-19 00:08:55 +00:00
054f9548b4 [dynamo] Store CompilationEvents in a buffer in torch._dynamo.utils (#115788)
Motivation: it would be nice to be able to test using the metrics in log_compilation_event; currently dumps logs (or logs to a database in fbcode) - these are hard to use in unit tests.

This change:
* always record the information in torch._dynamo.utils.record_compilation_metrics; here, log into a limited-size deque to prevent the list of metrics from getting too long
* if config.log_compilation_metrics, then call back into the original log_compilation_event function

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115788
Approved by: https://github.com/yanboliang
2023-12-18 23:26:13 +00:00
fc58909bab Fix allowed dtypes for mem_eff attention (#116026)
# Summary

Fix issue bug in detecting mem eff capability for cuda devices less than sm80:
https://github.com/pytorch-labs/gpt-fast/issues/49

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116026
Approved by: https://github.com/janeyx99
2023-12-18 23:20:52 +00:00
6b120c6cf9 Update the sdpa benchmark to measure forward backward time in isolation (#115986)
# Summary

The benchmarks were getting a little stale and I think it makes more sense to measure in isolation now rather than E2E in a mha component.

This is a pre-req for getting the data for https://github.com/pytorch/pytorch/pull/115357

Output from run:
``` Shell
+------------+-----------+-----------+------------+-----------+-----------+----------------+--------------------+--------------------+
| batch_size | num_heads | q_seq_len | kv_seq_len | embed_dim | is_causal |     dtype      |    forward_time    |   backward_time    |
+------------+-----------+-----------+------------+-----------+-----------+----------------+--------------------+--------------------+
|     1      |    16     |    128    |    128     |   2048    |   True    | torch.bfloat16 | 23.86634959839284  | 66.21150835417211  |
|     1      |    16     |    128    |    128     |   2048    |   False   | torch.bfloat16 | 23.452017060481012 | 66.90612225793302  |
|     1      |    16     |    256    |    256     |   2048    |   True    | torch.bfloat16 | 24.478124547749758 |  76.4232068322599  |
|     1      |    16     |    256    |    256     |   2048    |   False   | torch.bfloat16 |  24.6928428998217  | 75.76151192188263  |
|     1      |    16     |    512    |    512     |   2048    |   True    | torch.bfloat16 | 28.69622849393636  | 114.73898496478796 |
|     1      |    16     |    512    |    512     |   2048    |   False   | torch.bfloat16 | 34.399422979913645 | 112.96746158041059 |
|     1      |    16     |   1024    |    1024    |   2048    |   True    | torch.bfloat16 |  65.4690912924707  | 216.26344555988908 |
|     1      |    16     |   1024    |    1024    |   2048    |   False   | torch.bfloat16 | 88.57532404363155  | 212.07790216431025 |
|     8      |    16     |    128    |    128     |   2048    |   True    | torch.bfloat16 | 11.582905380055308 | 70.09557797573505  |
|     8      |    16     |    128    |    128     |   2048    |   False   | torch.bfloat16 | 12.068384909071026 | 70.01491216942668  |
|     8      |    16     |    256    |    256     |   2048    |   True    | torch.bfloat16 | 31.671419646590945 | 203.54910241439939 |
|     8      |    16     |    256    |    256     |   2048    |   False   | torch.bfloat16 |  33.0585768679157  | 209.45609430782497 |
|     8      |    16     |    512    |    512     |   2048    |   True    | torch.bfloat16 | 87.43969700299202  | 469.8729298543185  |
|     8      |    16     |    512    |    512     |   2048    |   False   | torch.bfloat16 | 123.9265550393611  | 580.1084265112877  |
|     8      |    16     |   1024    |    1024    |   2048    |   True    | torch.bfloat16 | 561.1918237991632  | 1181.655174586922  |
|     8      |    16     |   1024    |    1024    |   2048    |   False   | torch.bfloat16 | 884.2707145959139  | 1662.4679416418073 |
+------------+-----------+-----------+------------+-----------+-----------+----------------+--------------------+--------------------+
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115986
Approved by: https://github.com/mikaylagawarecki
2023-12-18 22:40:47 +00:00
bf62511e07 Reshape decomposition for jagged layout NT (#115191)
No more segfault from using `reshape()` on jagged NT :)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115191
Approved by: https://github.com/cpuhrsch, https://github.com/soulitzer
2023-12-18 22:34:41 +00:00
63e242b1e4 [ROCm] fix nightly 5.6 build (#116029)
ROCm 5.6 nightly wheel build broken by #114329.  This fixes it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116029
Approved by: https://github.com/huydhn, https://github.com/jithunnair-amd, https://github.com/atalman
2023-12-18 22:12:30 +00:00
8452f41305 Adds allreduce to inductor remap (#115950)
Fixes #115728

Implements a rewrite path for allreduce

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115950
Approved by: https://github.com/wconstab
2023-12-18 22:00:22 +00:00
2a5659a797 add length assertion to PrepareModuleInput and PrepareModuleOutput (#115957)
## summary

`zip(inputs, self.input_layouts, self.desired_input_layouts)` is used in `_prepare_input_fn`; similar for `_prepare_output_fn`. Without assertion, unmatched dimension in inputs/outputs will be lost, potentially causing unexpected behabiors.

## test plan
`python test/distributed/tensor/parallel/test_tp_style.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115957
Approved by: https://github.com/wanchaol
2023-12-18 21:50:18 +00:00
a699b10339 [buck2][win] fix caffe2 protobuf_rule (#115954)
Summary:
c2_protobuf_rule ([here](https://fburl.com/code/iyiulpmv)) is broken on buck2, ultimately due to the following error:

> .\./caffe2.proto: File does not reside within any path specified using --proto_path (or -I).  You must specify a --proto_path which encompasses this file.  Note that the proto_path must be an exact prefix of the .proto file names -- protoc is too dumb to figure out when two paths (e.g. absolute and relative) are equivalent (it's harder than you think).

The root cause is differences in how buck1 and buck2 handle `%SRCDIR%` (absolute versus relative paths). This diff fixes the build.

Test Plan:
# Before

```
buck2 build arvr/mode/win/opt //xplat/caffe2:caffe2.pb.h
```

```
More details at https://www.internalfb.com/intern/buck/build/c6550454-ae6d-479e-9d08-016e544ef050
BUILD SUCCEEDED
```

```
Action failed: fbsource//xplat/caffe2:caffe2.pb.h (genrule)
Remote command returned non-zero exit code <no exit code>
Reproduce locally: frecli cas download-action 5df17cf64b7e2fc5ab090c91e1129f2f3cad36dc72c7c182ab052af23d3f32aa:145
stdout:
stderr:
OUTMISS: Missing outputs: buck-out/v2/gen/fbsource/dd87aacb8683145b/xplat/caffe2/caffe2.pb.h/out/caffe2.pb.h
```

# After

Buck1 still works

```
buck1 build arvr/mode/win/opt //xplat/caffe2:caffe2.pb.h
```

Buck2 works

```
buck2 build arvr/mode/win/opt //xplat/caffe2:caffe2.pb.h
```

```
Buck UI: https://www.internalfb.com/buck2/e5dae607-325a-4eab-b0c9-66fe4e9a6254
BUILD SUCCEEDED
```

Differential Revision: D52218365

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115954
Approved by: https://github.com/mcr229
2023-12-18 21:41:10 +00:00
2f7bb18def [Doc] Add padding size constraint in nn.ReflectionPad2d (#115995)
Fixes #115532

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115995
Approved by: https://github.com/mikaylagawarecki
2023-12-18 21:29:14 +00:00
1e272fb6d6 [export] Undo "module: export" labeling (#116042)
Delete the auto-labeling of "module: export" as this is not really used, and we want to delete the "module: export" label.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116042
Approved by: https://github.com/clee2000
2023-12-18 21:23:17 +00:00
c4748b425e Add main in dynamo/test_compile.py (#115941)
Need to verify that  it is dynamo's custom TestCase and run_tests instead of the general common_utils TestCase and run_tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115941
Approved by: https://github.com/msaroufim
2023-12-18 20:53:28 +00:00
a1a0b290d2 [tp] further fix the docs (#115974)
some typo result in the note section not rendered properly, can't see
this from the last PR directly as the last PR only show the first commit
documentation :(

Also make the parallelize_module doc example more concrete

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115974
Approved by: https://github.com/wz337
2023-12-18 20:41:53 +00:00
8868c1cfae [sparse][ci] Add cuSPASRELt to CI (#115369)
Summary:

This PR adds in cuSPARSELt v0.4.07 into CI (12.1 and 11.8 CUDA) to run our cuSPARSELt specific tests.

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115369
Approved by: https://github.com/malfet
2023-12-18 20:33:30 +00:00
2b2ed52799 [xla hash update] update the pinned xla hash (#116003)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116003
Approved by: https://github.com/clee2000
2023-12-18 20:31:49 +00:00
7b6210e8a4 Use matrix generate script for docker release workflows (#115949)
Enable both supported CUDA version builds for docker release. Rather then building only 1 version.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115949
Approved by: https://github.com/huydhn
2023-12-18 20:20:59 +00:00
e30d436b01 [fx][split][testing] Add testing for #107981 (#108731)
- Follow-up to #107981, adding testing for metadata copying in placeholder nodes within the `split_by_tags` utility
- Validation included in the test from #107248, since both tests are relevant to the same aspect of the utility
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108731
Approved by: https://github.com/angelayi
2023-12-18 20:19:18 +00:00
bf20b56e9d Fix PyTorch build error on ppc64le (#115729)
The PyTorch build breaks when building from tip on ppc64le with following error pytorch/aten/src/ATen/native/quantized/cpu/kernels/QuantizedOpKernels.cpp:863:46: error: no matching function for call to 'at::vec::DEFAULT::Vectorizedc10::qint8::dequantize(at::vec::DEFAULT::Vectorized&, at::vec::DEFAULT::Vectorized&)

Issue reported #115165

This patch fixes the build issue.

Fixes #115165

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115729
Approved by: https://github.com/albanD
2023-12-18 19:00:56 +00:00
77366ba637 Increased hardcoded limit for number of GPUs. (#115368)
Fixes #115331.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115368
Approved by: https://github.com/albanD
2023-12-18 18:39:19 +00:00
80b1ecc308 Run eager adam optimizer in benchmarks where possible (#115445)
Runs eager Adam (instead of SGD) on all models that don't fail accuracy.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115445
Approved by: https://github.com/desertfire
2023-12-18 18:28:23 +00:00
8a445f7bd5 Serve multistream graph captures from correct pool (#114647)
This fixes #114320 by placing the logic for determining whether to allocate
to a pool inside a callback that is controlled by CUDAGraph.cpp or by the
python bound api to allocate a stream directly to a pool.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114647
Approved by: https://github.com/ngimel, https://github.com/eellison
2023-12-18 18:24:15 +00:00
3b70bd3970 Take 2 of "Add an option to log the source of the Triton kernels generated by torch._inductor (#115979)
Summary: This is useful the comparing the Triton kernels generated by two different invocations of torch.compile on the same model (e.g., checking of serial compile and parallel compile generate identical Triton kernels).

Test Plan:
Unit test:
buck2 test mode/opt //caffe2/torch/fb/module_factory/sync_sgd/tests:test_torchdynamo_wrapper -- --print-passing-details >& ~/tmp/log.test
PyPer Mast job:
https://www.internalfb.com/mast/job/sw-951074659-OfflineTraining_87587a4e
See the *.py files generated in:
pyper_traces/tree/torchinductor_traces/sw-951074659-OfflineTraining_87587a4e/4623

Differential Revision: D52221500

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115979
Approved by: https://github.com/yanboliang
2023-12-18 18:16:44 +00:00
386776c49a [torch] Reduce the memory usage by adding flags to clearing intermediate graphs used for optimization during the ineference. (#115657)
Summary: During the inference time the intermediate graphs for optimization are not used so the Executor's graph is the only graph we need to keep around these two flags

Test Plan:
the FLAGS are all off by default

baseline
```
buck run mode/opt-clang  sigrid/predictor/client/localnet:run_model -- --model_id_to_load=951679039 --model_snapshot_to_load=244 --torch_jit_do_not_store_optimized_graph=true
I1212 10:24:20.407408 401092 SigridPredictorLocalModelFactory.cpp:32] Memory usage for 951679039_244 is 182863 Kb
```
```
buck run mode/opt-clang  sigrid/predictor/client/localnet:run_model -- --model_id_to_load=951679039 --model_snapshot_to_load=244 --torch_jit_do_not_store_optimized_graph=true --torch_jit_release_profiling_graph_after_optimization=true
I1212 10:31:37.663487 464000 SigridPredictorLocalModelFactory.cpp:32] Memory usage for 951679039_244 is 186127 Kb
```
```
buck run mode/opt-clang  sigrid/predictor/client/localnet:run_model -- --model_id_to_load=951679039 --model_snapshot_to_load=244 --torch_jit_do_not_store_optimized_graph=true --torch_jit_release_profiling_graph_after_optimization=true --torch_jit_execution_plan_avoid_extra_graph_copy=true
I1212 10:29:42.848093 447218 SigridPredictorLocalModelFactory.cpp:32] Memory usage for 951679039_244 is 129451 Kb```

Differential Revision: D52081631

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115657
Approved by: https://github.com/houseroad
2023-12-18 17:56:39 +00:00
dd367b7c8f check tensor subclass when using torch.compile + SAC (#115960)
as titled, when using SAC + torch.compile, it currently only check for
functional tensor, but not checking any tensor subclasses, therefore SAC
under torch.compile would ignore the tensor types like tensor
subclasses. Fixed in this PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115960
Approved by: https://github.com/bdhirsh
2023-12-18 17:49:06 +00:00
e43d33f4f7 [export] Support torch.sym* ops (#115854)
Fixes https://github.com/pytorch/pytorch/issues/108830 and https://github.com/pytorch/executorch/issues/1379#issuecomment-1853322866

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115854
Approved by: https://github.com/zhxchen17
2023-12-18 17:48:47 +00:00
647f14e70b [BE]: Enable clang-tidy check for readability-string-compare (#115994)
Adds a clang-tidy check to ensure string compare is not used unnecessarily in a way that is less efficient and less readable if an equality overload exists.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115994
Approved by: https://github.com/albanD
2023-12-18 16:13:00 +00:00
d7caef7996 [CI] Update clang-format (#116002)
To 17.0.6 build using https://github.com/pytorch/test-infra/blob/main/.github/workflows/clang-tidy-linux.yml

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116002
Approved by: https://github.com/suo
2023-12-18 14:58:46 +00:00
c285ca7916 [AOTInductor] Add updaing constant buffer to active buffer. (#116001)
Summary:
Refactor update inactive constant buffer to allow updating with active
buffer.

Test Plan:
Existing test to test inactive buffer updates.
UpdateConstantsCuda in cpp test for active buffer updates.

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116001
Approved by: https://github.com/chenyang78
2023-12-18 11:49:03 +00:00
34fe850d00 SymInt'ify sparse_compressed_tensor (#107903)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107903
Approved by: https://github.com/cpuhrsch
ghstack dependencies: #115586
2023-12-17 17:36:20 +00:00
419f2ca3e3 Fix a crash in sparse compressed tensor invariants check when nnz == 0 (#115825)
Fixes python crash example from https://github.com/pytorch/pytorch/issues/115755

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115825
Approved by: https://github.com/cpuhrsch
2023-12-17 17:36:15 +00:00
eafeba71c1 Adamw refactor (#115983)
Fixes #104899, refactors adamw by abstracting out common code in adam.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115983
Approved by: https://github.com/janeyx99
2023-12-17 06:58:39 +00:00
87ea6fb844 Make input contiguous for DTensor reduce scatter to fix the incorrect numerical values (#115847)
Summary:
This change is to make the input tensor contiguous for DTensor reduce scatter in the case no padding is needed.

There's no exception thrown during training, but we ran into numerical value correctness issue without the change.

Test Plan:
**CI**
CI test

**WHEN model test**:
- Verified loss for each iteration within the expected range.
- Verified NE on-par with this change with 4B training data.

Differential Revision: D52170822

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115847
Approved by: https://github.com/wanchaol
2023-12-17 01:35:09 +00:00
bc4115ffcf [Inductor][Observability] Change to log.debug to avoid excessive long of logs (#115474)
Summary: Titled

Test Plan: CI

Differential Revision: D52003825

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115474
Approved by: https://github.com/jackiexu1992, https://github.com/yanboliang
2023-12-17 00:25:54 +00:00
4123cca859 [AARCH64] Fall back to GEMM if mkldnn_matmul fails (#115936)
- Add call to `at::globalContext().userEnabledMkldnn()` to `apply_mkldnn_matmul_heur`
- Surround calls to `mkldnn_matmul` with `try {} catch {}`
- Print warning and fall back to BLAS (by calling  `at::globalContext().setUserEnabledMkldnn()`) if `mkldnn_matmul()` fails

Test plan: On Linux arm run:
```shell
$ sudo chmod 400 /sys; python -c "import torch;m=torch.nn.Linear(1, 32);print(torch.__version__);print(m(torch.rand(32, 1)))"
Error in cpuinfo: failed to parse the list of possible processors in /sys/devices/system/cpu/possible
Error in cpuinfo: failed to parse the list of present processors in /sys/devices/system/cpu/present
Error in cpuinfo: failed to parse both lists of possible and present processors
2.3.0.dev20231215
bad err=11 in Xbyak::Error
bad err=11 in Xbyak::Error
/home/ubuntu/miniconda3/envs/py311/lib/python3.11/site-packages/torch/nn/modules/linear.py:116: UserWarning: mkldnn_matmul failed, switching to BLAS gemm:internal error (Triggered internally at /pytorch/aten/src/ATen/native/LinearAlgebra.cpp:1509.)
  return F.linear(input, self.weight, self.bias)
tensor([[-0.5183,  0.2279, -0.4035,  ..., -0.3446,  0.0938, -0.2113],
        [-0.5111,  0.2362, -0.3821,  ..., -0.3536,  0.1011, -0.2159],
        [-0.6387,  0.0894, -0.7619,  ..., -0.1939, -0.0282, -0.1344],
        ...,
        [-0.6352,  0.0934, -0.7516,  ..., -0.1983, -0.0247, -0.1366],
        [-0.4790,  0.2733, -0.2862,  ..., -0.3939,  0.1338, -0.2365],
        [-0.5702,  0.1682, -0.5580,  ..., -0.2796,  0.0412, -0.1782]],
       grad_fn=<AddmmBackward0>)
```
Fixes https://github.com/pytorch/pytorch/issues/114750

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115936
Approved by: https://github.com/lezcano
2023-12-16 21:37:56 +00:00
b06b02559e Support non grapharg and intermediary grad access (#115898)
Support for something we need for both FSDP and optimizers. For sourced args that are not inputs (params, etc) - we use the dynamic_getattr flow on tensors. This soundly handles the storage and registration and guarding downstream of tensor_wrap for the grad values. For non sourced (true intermediates), we only support None (the idea being that if we have a true intermediate in the graph with grad, we are already doing something weird).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115898
Approved by: https://github.com/bdhirsh
ghstack dependencies: #115315, #112184
2023-12-16 18:43:37 +00:00
c5dcb50c00 [easy] aten ops: support passing all args as kwargs, including self (#114920)
Summary:
This is important for writing aten IR based graph transformation.

```
In [4]: [x.name for x in torch.ops.aten.reshape.default._schema.arguments]
Out[4]: ['self', 'shape']

In [8]: torch.ops.aten.reshape.default(torch.rand(1,2), shape=[2])
Out[8]: tensor([0.7584, 0.4834])

# === CANNOT CALL `self` BY KWARGS ===

In [7]: torch.ops.aten.reshape.default(self=torch.rand(1,2), shape=[2])
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[7], line 1
----> 1 torch.ops.aten.reshape.default(self=torch.rand(1,2), shape=[2])

TypeError: OpOverload.__call__() got multiple values for argument 'self'

```

# Where's the problem?

1. the aten ops first arg is usually named `self` (aten/src/ATen/native/native_functions.yaml)
2. Unfortunately, in `torch._ops.{OpOverload, OpOverloadPacket}.__call__()`, the first arg is (by python convention) named `self` too.

So when call `self` by kwargs, `OpOverloadPacket.__call__` received:

```
OpOverloadPacket.__call__(self, {"self": ...})
```

It is Python that does not allow some argument named "arg" to appear twice. and hence

> TypeError: OpOverload.__call__() got multiple values for argument 'self'

# How to fix?

**Note that**, in above, `self` is an instance of `OpOverloadPacket`, and the "self" kwarg is the input tensor to the aten op. To fix, we only need to differentiate the two `self`s.

In Python, first arg of a method does not need to be named `self`. So we change the `__call__` definition to:

```
def __call__(_self, ...):
```

Now the call becomes:

```
OpOverloadPacket.__call__(_self, {"self": ...})
```

where:
* `_self` is the instance to the `OpOverloadPacket`
* `"self"` is the input tensor to the aten op.

Test Plan:
```
In [4]: [x.name for x in torch.ops.aten.reshape.default._schema.arguments]
Out[4]: ['self', 'shape']

In [3]: torch.ops.aten.reshape.default(self=torch.rand(1,2), shape=[2])
Out[3]: tensor([0.5127, 0.3051])
```

Differential Revision: D51731996

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114920
Approved by: https://github.com/houseroad
2023-12-16 18:32:58 +00:00
88207b10ca Enable thp(transparent huge pages) for buffer sizes >=2MB (#107697)
The 2MB thp pages provide better allocation latencies compared to the standard 4KB pages. This change has shown substantial improvement for batch mode usecases where the tensor sizes are larger than 100MB.

Only enabled if THP_MEM_ALLOC_ENABLE environment variable is set.

Relanding https://github.com/pytorch/pytorch/pull/93888 with functionality disabled for Android

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107697
Approved by: https://github.com/malfet
2023-12-16 18:16:19 +00:00
622947afa8 [BE] Use nested namespace in ATen/native (#115938)
It's a C++17 feature that usually makes code a bit more compact, and should have no side-effects otherwise.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115938
Approved by: https://github.com/Skylion007
2023-12-16 06:07:40 +00:00
e3aefe2970 Revert "Initial Flash Attention support on ROCM (#114309)" (#115975)
This reverts commit 5bddbed399a89bf2875a38bb84cb869f382f1809.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115975
Approved by: https://github.com/atalman, https://github.com/malfet
2023-12-16 03:40:14 +00:00
8283491eff [TEST] Increase numerical tolerances in test_torchinductor_opinfo:test_comprehensive (#115768)
There are numerical mismatches that causes some tests of `test_comprehensive` to fail. I propose to just increase tolerances a bit to make them pass.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115768
Approved by: https://github.com/jansel
2023-12-16 03:00:22 +00:00
49af19cd8e Skip some flaky Dynamo tests in test_linalg.py (#115925)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115925
Approved by: https://github.com/lezcano
2023-12-16 02:38:56 +00:00
2a2f2e454a [inductor] Fixed issue with true div on integer input with dyn shapes (#115920)
Related to https://github.com/pytorch/pytorch/issues/115742, `Cpu/CudaTests.test_div8`

Description:
- Fixed issue with true div on integer input with dyn shapes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115920
Approved by: https://github.com/peterbell10
2023-12-16 02:06:39 +00:00
d08905db7e Trigger a mergability check on ghstack prs (#115944)
Works to solve https://github.com/pytorch/test-infra/issues/4816

In conjunction with https://github.com/pytorch/test-infra/pull/4823 this pr should make it such that all ghstack prs kick off a job which is a mergability check.

Test plan, once https://github.com/pytorch/test-infra/pull/4823 is merged, I'll resubmit this diff to make sure the workflow job triggers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115944
Approved by: https://github.com/izaitsevfb, https://github.com/huydhn
2023-12-16 01:53:10 +00:00
14a6b24c8b [Dynamo][8/N] Wrap itertools.* as ItertoolsVariable (#115802)
This is part of a series changes before removing ```is_allowed```.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115802
Approved by: https://github.com/voznesenskym
2023-12-16 01:42:02 +00:00
056a882cb9 add markDynamoStrictTest to TestOptimRenewed, removing flakiness (#115947)
fixes #115406 fixes #115394 fixes #115393 fixes #115392 fixes #115391

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115947
Approved by: https://github.com/albanD, https://github.com/zou3519
2023-12-16 01:33:32 +00:00
0597eb56c2 Generate exhaustive compiled optimizer tests (#115906)
Generates tests for all permutations of arguments using the existing optimizer infos.
Covers capturable, cpu/gpu, single/multitensor and optimizer specific constants like rho/etas, etc.

[new test list](https://gist.github.com/mlazos/d3404383e7c3d490cbb51b7d6c750629)
[old test list](https://gist.github.com/mlazos/e0043aee1b6a0962d2f3ac8193aa62f8)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115906
Approved by: https://github.com/janeyx99
2023-12-16 00:42:43 +00:00
034e871710 [Dynamo] Look up variables from old frame, rather than copy variables to new frame; skip some copy to save time. (#115062)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115062
Approved by: https://github.com/williamwen42
2023-12-16 00:02:59 +00:00
94d28161fa Fix broken PyYAML 6.0 on MacOS x86 (#115956)
May be we should just get rid of x86 jobs, but that's for another day.  This one should fix the broken build in trunk, i.e. https://github.com/pytorch/pytorch/actions/runs/7227220153/job/19694420117.

I guess that the failure looks flaky depending on the version of default python3 on the GitHub x86 runner.

The issue from PyYAML https://github.com/yaml/pyyaml/issues/601
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115956
Approved by: https://github.com/malfet
2023-12-15 23:17:05 +00:00
74dfdc567b [MPS] aten::erfinv bug fix: add storage offset buffers to handle slicing (#105801)
A bug fix of a recently merged PR per comment: https://github.com/pytorch/pytorch/pull/101507#discussion_r1271393706

The follow test would fail without this bug fix:

```
import torch
def test_erfinv():
    for device in ['cpu', 'mps']:
        x = torch.tensor([0.1, 0.2, 0.3, 0.4, 0.5], device=device)
        y = x[2:].erfinv()

        x2 = torch.tensor([0.3, 0.4, 0.5], device=device)
        y2 = x2.erfinv()

        print(y)
        print(y2)

        torch.testing.assert_close(y, y2)
        print(f"{device} passes.")

test_erfinv()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105801
Approved by: https://github.com/malfet
2023-12-15 23:14:03 +00:00
d92d4133e7 [8/n] Update XNNPACK Submodule Version Part 8 Everything Remaining to get it to work (#115714)
> **__Note:__** XNNPACK Upgrade is too large in the range of **40k** files and **10m** Lines of code, Thus we break the update of the library into multiple parts. All Parts [1 - n] Must be landed together for it to work. ***This also means If there is a revert. Please revert the Entire Stack.***

This change is everything remaining requiring XNNPACK version to work.

@allow-large-files

Differential Revision: [D52099769](https://our.internmc.facebook.com/intern/diff/D52099769/)

---
submodule
(unblock merge to make ShipIt happy)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115714
Approved by: https://github.com/digantdesai
2023-12-15 23:08:08 +00:00
2e517b20d9 [MPS] Add Conv3D support for MPS (#114183)
Fixes #77818

I saw that PR #99246 was approved, but no one fixed the rebase conflicts, so I am bringing this up again to be merged.
I am leveraging @mattiaspaul work. Quoting the description here:

> * this pull request enables 3D convolutions (forward/backward) for MPS (Apple Silicon) within the same Convolution.mm file as conv2d.
> * does not support channel_last (since pytorch doesn't implement channel_last for 3D tensors)
> * does not support conv3d_transpose and treats depth-separable convolutions not as normal case (there are no MPS kernels available for either of those so far)
> * requires MacOS >=13.2 (Ventura)

Please, let me know if there are any other changes needed and I'll be happy to implement them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114183
Approved by: https://github.com/malfet
2023-12-15 23:05:01 +00:00
9fcf6fb6fe [C10D] Add waitForDumpOrTimeout to log on dump abandonment (#115876)
Helps call attention to any cases where the dump actually times out.

The timeout is likely to hit if we run into slow stacktrace processing.

Log any exceptions encountered in the background thread, but don't raise
them- we're already willing to abandon the debug dump, and want to
proceed with our normal execution (in the case of dumppipe) or shutdown
process (when dumping happens on timeout and shutdown is already
initiated).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115876
Approved by: https://github.com/zdevito
ghstack dependencies: #115807
2023-12-15 22:13:06 +00:00
82e0d00da9 [c10d] Polish NCCL PG monitor thread log message (#115888)
We turned on monitor thread by default in https://github.com/pytorch/pytorch/pull/112518, and we want the error message that is displayed when the monitor kills the process to be more informative.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115888
Approved by: https://github.com/wconstab
2023-12-15 22:00:29 +00:00
1f3bdf40ad [export] Update schema version (#115712)
Since pytorch 2.1 release we've made some BC breaking changes to the serialized schema. We should update it in time for the 2.2 release. Some of the changes include:

* https://github.com/pytorch/pytorch/pull/114371 - custom class objects / pybinded objects are no longer saved directly to the `ExportedProgram` structure. Instead, the name is serialized inside of the program, and the actual bytes are stored. in a separate location from the exported program, allowing it to be saved to a different location.
* https://github.com/pytorch/pytorch/pull/111204 - `GraphSignature` structure changed and `call_spec` is removed from the `GraphModule` schema
* https://github.com/pytorch/pytorch/pull/111407 - `loss_outout` -> `loss_output`
* https://github.com/pytorch/pytorch/pull/113075 - `example_inputs` removed from the `ExportedProgram` structure (this originally did not store anything), `dialect` added to the `ExportedProgram` structure.
* https://github.com/pytorch/pytorch/pull/113689 - tensor constants are now lifted as inputs to the graph, and their locations are stored in the `GraphSignature`
* https://github.com/pytorch/pytorch/pull/114172 - removed `equality_constraints` and added a `SymExprHint` for all symbolic expressions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115712
Approved by: https://github.com/gmagogsfm
2023-12-15 21:43:03 +00:00
715d663794 [inductor] split test_cpp_wrapper.py into cpu and cuda test files (#115479)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115479
Approved by: https://github.com/atalman
ghstack dependencies: #115167
2023-12-15 21:21:10 +00:00
50c9665f92 Revert "[export] Support torch.sym* ops (#115854)"
This reverts commit 347cb91946318eaedc350c2c3cda659d1cbde931.

Reverted https://github.com/pytorch/pytorch/pull/115854 on behalf of https://github.com/atalman due to OSSCI oncall, broke multple jobs ([comment](https://github.com/pytorch/pytorch/pull/115854#issuecomment-1858486796))
2023-12-15 21:07:52 +00:00
80a9625d9f Revert "non-strict export with dynamic shapes (#115862)"
This reverts commit 1bb0d0fc1f1da750206fad45f32e9564f0edd1f4.

Reverted https://github.com/pytorch/pytorch/pull/115862 on behalf of https://github.com/atalman due to OSSCI oncall, failing trunk / macos-12-py3-arm64 / test ([comment](https://github.com/pytorch/pytorch/pull/115862#issuecomment-1858482486))
2023-12-15 21:04:12 +00:00
1bb0d0fc1f non-strict export with dynamic shapes (#115862)
Differential Revision: [D52175048](https://our.internmc.facebook.com/intern/diff/D52175048/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115862
Approved by: https://github.com/zhxchen17
2023-12-15 20:11:30 +00:00
347cb91946 [export] Support torch.sym* ops (#115854)
Fixes https://github.com/pytorch/pytorch/issues/108830 and https://github.com/pytorch/executorch/issues/1379#issuecomment-1853322866

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115854
Approved by: https://github.com/zhxchen17
2023-12-15 20:08:04 +00:00
6c2103bdf7 Fixed some failing inductor tests with exact_dtype=True (#115828)
Addresses point 1 from #115742: fixing  CPUReproTest.test_embedding_vec_bf16

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115828
Approved by: https://github.com/peterbell10
2023-12-15 20:02:19 +00:00
91b848bf81 Revert "markDynamoStrictTest on more tests (#115879)"
This reverts commit 8b650cdd3cdd1174b399f312ec2f7955551a2f5d.

Reverted https://github.com/pytorch/pytorch/pull/115879 on behalf of https://github.com/atalman due to OSSCI oncall, broke inductor ([comment](https://github.com/pytorch/pytorch/pull/115879#issuecomment-1858418921))
2023-12-15 20:00:09 +00:00
c006c8b50e Revert "markDynamoStrictTest some more (#115885)"
This reverts commit 55ce4693ff2c0b6e50b8af323f36ecc7ff929638.

Reverted https://github.com/pytorch/pytorch/pull/115885 on behalf of https://github.com/atalman due to OSSCI oncall, broke inductor ([comment](https://github.com/pytorch/pytorch/pull/115885#issuecomment-1858409669))
2023-12-15 19:51:24 +00:00
61abacf829 [tp] improve documentation (#115880)
Improve the TP documentation in terms of format and descriptions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115880
Approved by: https://github.com/XilunWu
2023-12-15 18:44:22 +00:00
d5115bfb06 Revert "[AOTI][refactor][3/n] Declare python_kernel_name and cpp_kernel_name in ExternKernel (#115831)"
This reverts commit 287a86567731ff4d87f71dcd285d0ab4253cfceb.

Reverted https://github.com/pytorch/pytorch/pull/115831 on behalf of https://github.com/desertfire due to rocm CI failure ([comment](https://github.com/pytorch/pytorch/pull/115831#issuecomment-1858322270))
2023-12-15 18:34:55 +00:00
72eab5aa43 Configures distributed_checkpoint label (#115833)
Configures the existing `module: distributed_checkpoint` label
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115833
Approved by: https://github.com/wconstab, https://github.com/wz337
2023-12-15 18:17:25 +00:00
1b506e7469 Revert "non-strict export with dynamic shapes (#115862)"
This reverts commit f54bb1ed566f27affff9fdbd5c1ceee854ef2de5.

Reverted https://github.com/pytorch/pytorch/pull/115862 on behalf of https://github.com/atalman due to OSSCI oncall, failing trunk / macos-12-py3-arm64 / test ([comment](https://github.com/pytorch/pytorch/pull/115862#issuecomment-1858197497))
2023-12-15 17:03:42 +00:00
7ed2bc7c67 [GHF] Do not block reverts with internal changes (#115903)
As check is more often than not is unreliable, so better just post a
warning and let the revert proceed.

Fixes https://github.com/pytorch/test-infra/issues/4797

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115903
Approved by: https://github.com/clee2000, https://github.com/atalman
2023-12-15 17:00:07 +00:00
f54bb1ed56 non-strict export with dynamic shapes (#115862)
Differential Revision: [D52175048](https://our.internmc.facebook.com/intern/diff/D52175048/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115862
Approved by: https://github.com/zhxchen17
2023-12-15 16:38:45 +00:00
b062ea3803 [ROCm] add hipblaslt support (#114329)
Disabled by default. Enable with env var DISABLE_ADDMM_HIP_LT=0. Tested on both ROCm 5.7 and 6.0.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114329
Approved by: https://github.com/malfet
2023-12-15 15:36:46 +00:00
287a865677 [AOTI][refactor][3/n] Declare python_kernel_name and cpp_kernel_name in ExternKernel (#115831)
Summary: Both ExternKernelAlloc and ExternKernelOut need the two fields, so declaring them in the base class. Also add cpp codegen for IndexPutFallback and InplaceBernoulliFallback in this PR.

Differential Revision: [D52189999](https://our.internmc.facebook.com/intern/diff/D52189999)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115831
Approved by: https://github.com/chenyang78
2023-12-15 14:40:44 +00:00
66994bca5f Revert "[inductor] split test_cpp_wrapper.py into cpu and cuda test files (#115479)"
This reverts commit 653acd8fe1d0a7b4a084a47ee022f163015fee64.

Reverted https://github.com/pytorch/pytorch/pull/115479 on behalf of https://github.com/desertfire due to will cause land race in fbcode because https://github.com/pytorch/pytorch/pull/115831 is already landed internally ([comment](https://github.com/pytorch/pytorch/pull/115479#issuecomment-1857979948))
2023-12-15 14:35:40 +00:00
55ce4693ff markDynamoStrictTest some more (#115885)
Featuring
test_native_mha.py
test_nn.py
test_prims.py
test_schema_check.py
test_serialization.py
test_show_pickle.py
test_sort_and_select.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115885
Approved by: https://github.com/voznesenskym
ghstack dependencies: #115845, #115855, #115856, #115857, #115858, #115870, #115871, #115879
2023-12-15 13:19:52 +00:00
8b650cdd3c markDynamoStrictTest on more tests (#115879)
Featuring:
test_mobile_optimizer.py
test_module_init.py
test_modules.py
test_multiprocessing.py
test_multiprocessing_spawn.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115879
Approved by: https://github.com/voznesenskym
ghstack dependencies: #115845, #115855, #115856, #115857, #115858, #115870, #115871
2023-12-15 13:19:52 +00:00
2d43e31aa9 Fix wrong behavior of is_alias_of and c10d::reducer on MTIA (#115553)
Reviewed By: kirteshpatil

Differential Revision: D51860023

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115553
Approved by: https://github.com/fduwjj
2023-12-15 11:14:41 +00:00
4ea7430ffb [BE] Don't copy CuDNN libs twice (#115872)
- It was installed twice : once in `/usr/local/cuda/lib64` folder and 2nd time in `/usr/lib64`
- And don't install CuDNN headers thrice, only in `/usr/local/cuda/includa`
- Error on unknown CUDA version
- Modify bazel builds to look for cudnn in `/usr/local/cuda` folder
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115872
Approved by: https://github.com/huydhn
2023-12-15 09:47:14 +00:00
b4d6443bcf [Dynamo] Log innermost user frame filename & lineno for better error aggregation (#115899)
CompilationMetrics example:
```
frame_key='1',
co_name='fn',
co_filename='/data/users/ybliang/debug/debug1.py',
co_firstlineno=58,
cache_size=0,
accumulated_cache_size=0,
guard_count=None,
graph_op_count=None,
graph_node_count=None,
graph_input_count=None,
entire_frame_compile_time_s=None,
backend_compile_time_s=None,
fail_type="<class 'torch._dynamo.exc.Unsupported'>",
fail_reason='custome dict init with args/kwargs unimplemented',
fail_user_frame_filename='/data/users/ybliang/debug/debug1.py',
fail_user_frame_lineno=61
```
where:
* ```fail_type``` and ```fail_reason``` are exceptions inside of Dynamo.
* ```fail_user_frame_filename``` and ```fail_user_frame_lineno``` are where the original user code triggered the exception.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115899
Approved by: https://github.com/davidberard98, https://github.com/ydwu4
2023-12-15 08:24:55 +00:00
4edc921857 Introduce 3 low-latency, intra-node allreduce algorithms for small messages to PyTorch (#114001)
## Summary
This PR added 3 intra-node GPU allreduce algorithms to PyTorch:
- One-shot allreduce (inspired by FasterTransformer): all ranks simultaneously read and accumulate data from other ranks.
- Two-shot allreduce (inspired by FasterTransformer): all ranks simultanesouly read and accumulate `1 / world_size` data from other ranks. Then all ranks read accumulated data from other ranks. (effectively one-shot reduce-scatter + one-shot all-gather).
- Hybrid cube mesh allreduce (original): a one-shot allreduce variant that avoids transmission over PCIe on HCM topology.

## Micro Benchmarks
![image](https://github.com/pytorch/pytorch/assets/4156752/7bd25ffc-cd5b-4acb-bd65-b01bc136726e)

![image](https://github.com/pytorch/pytorch/assets/4156752/3ced31b4-6c31-4f34-a2d8-c072df29ae0e)

![image](https://github.com/pytorch/pytorch/assets/4156752/5b942c05-4fcc-4ec9-ae29-12c64080bb1c)

## Details
The intra-node algos are organized behind `c10d::IntraNodeComm`, which is responsible for:
- Managing handshaking and cuda IPC handle exchange among ranks.
- Querying NVLink connection and detecting topology.
- Performing algo selection based on available info.
- Launching the selected allreduce kernel.

`c10d::IntraNodeComm` is integrated into `c10d::ProcessGroupNCCL` as follows:
- When the `ENABLE_INTRA_NODE_COMM` environment variable is set, `c10d::ProcessGroupNCCL` initialize a `c10d::IntraNodeComm` for its ranks.
  - If the setup is not suitable for intra-node comm (e.g. not all ranks are from the same node), the rendezvous logic guarantees all participants fall back consistently.
- `c10d::ProcessGroupNCCL::allreduce` consults `c10d::IntraNodeComm` whether to use intra-node allreduce and carries out the communication accordingly.

We currently detect two types of topoloies from the nNVLink connection mesh:
- Fully connected: all GPU pairs has direct NVLink connection (e.g. NVSwitch or fully connected sub-set of hybrid cube mesh)
  - `msg <= 256KB`: one-shot allreduce.
  - `256KB < msg <= 10MB`: two-shot allreduce.
  -  `msg > 10MB`: instructs the caller to fallback to NCCL.
- Hybrid cube mesh
  - `msg <= 256KB`: one-shot allreduce.
  - `msg > 256KB`: instructs the caller to fallback to NCCL.

## Next Steps
- Fine tune algo selection based on GPU model, topology, link speed.
- Potentially optimize the two-shot allreduce impl. Accroding to FasterTransformer, two-shot allreduce is preferred until 50MB. There might be room for improvement, but PyTorch does impose more constraints:
  - FasterTransformer uses a single process to drive multiple devices. It can use `cudaDeviceEnablePeerAccess` enable device-level peer access.
  - PyTorch uses multiple process to drive multiple devices. With cuda IPC, a device can only share a specific region to other devices. This means extra copies may be unavoidable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114001
Approved by: https://github.com/yf225
2023-12-15 08:17:35 +00:00
cd47e335d1 [TEST] Skip test_schema_correctness for float8 dtype (#115757)
According to the https://github.com/pytorch/pytorch/issues/107256#issuecomment-1705341870 the ops tested in `test_schema_correctness` are not supported with `torch.float8_e4m3fn` yet. Until they are not supported, it is best to skip the test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115757
Approved by: https://github.com/drisspg
2023-12-15 06:26:46 +00:00
c1c9b739e2 Back out "[aotinductor] replace lld with the default ld linker (#115478)" (#115875)
Summary:
Back out the diff

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115875
Approved by: https://github.com/chenyang78
2023-12-15 05:56:06 +00:00
478f0e96dc markDynamoStrictTest more tests (#115871)
For:
test_dispatch.py
test_fake_tensor.py
test_indexing.py
test_linalg.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115871
Approved by: https://github.com/voznesenskym
ghstack dependencies: #115845, #115855, #115856, #115857, #115858, #115870
2023-12-15 05:26:54 +00:00
7f686c8fe1 More markDynamoStrictTest (#115870)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115870
Approved by: https://github.com/voznesenskym
ghstack dependencies: #115845, #115855, #115856, #115857, #115858
2023-12-15 05:26:54 +00:00
9ae0e62929 [PT2] [Quant] Change the QConv2d Binary post op name from add to sum (#115329)
**Summary**
Change the QConv2d Binary fusion post op name from `add` to `sum`, since we are actually using OneDNN `post op sum` instead of `Binary_Add` for now.

**TestPlan**
```
python -m pytest test_quantized_op.py -k test_qconv2d_sum_pt2e
python -m pytest test_quantized_op.py -k test_qconv2d_sum_relu_pt2e
python -m pytest test_quantized_op.py -k test_qconv2d_sum_relu_float_output_pt2e
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115329
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
2023-12-15 05:10:47 +00:00
653acd8fe1 [inductor] split test_cpp_wrapper.py into cpu and cuda test files (#115479)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115479
Approved by: https://github.com/atalman
ghstack dependencies: #115167
2023-12-15 04:04:08 +00:00
eqy
9056903b09 [CUDA] 64-bit indexing for avg_pool_backward (#114193)
Fixes #113833

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114193
Approved by: https://github.com/malfet
2023-12-15 03:58:46 +00:00
8e2d63cbc3 [export][reland] Remove runtime assertion pass (#115597)
Summary:
Reland of https://github.com/pytorch/pytorch/pull/115196
D52054112 to fix internal failures.

Test Plan: CI

Differential Revision: D52054110

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115597
Approved by: https://github.com/ydwu4, https://github.com/zhxchen17
2023-12-15 03:22:03 +00:00
7d4ccd7b9e [AOTI][refactor][2/n] Rename kernel to python_kernel_name (#115766)
Differential Revision: [D52164940](https://our.internmc.facebook.com/intern/diff/D52164940)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115766
Approved by: https://github.com/chenyang78
ghstack dependencies: #115783
2023-12-15 03:08:13 +00:00
8e1cff96e3 [C10D] Log PG size in init log (#115807)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115807
Approved by: https://github.com/XilunWu
2023-12-15 02:38:54 +00:00
5989e1222d [BE] Set torch.cuda.has_half to True (#115884)
This check was introduced by https://github.com/pytorch/pytorch/pull/5417 and then turned into a tautology by https://github.com/pytorch/pytorch/pull/10147

So I guess it's time to let go of all that dynamic initialization (and may be just delete it in 2.3?)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115884
Approved by: https://github.com/kit1980
2023-12-15 02:30:55 +00:00
a8e354a9a0 [sparse][semi-structured] enable fp32 support, separate sparse and dense constraints (#115550)
Summary:

Both cuSPASRELt and CUTLASS support 1:2 semi-structured sparsity for
fp32, which this PR enables.(thanks @alexsamardzic).

Furthermore, this PR also updates the sparse_config to take into account
the different shape constraints for sparse and dense matrices.

Technically, cuSPARSELt supports smaller sparse matrix constraints as it
seens to pad to the CUTLASS constraints under the hood. However, in
practice small sparse matrices are not commonly used and we care more
about the dense constraints for LLM inference.

For now, we keep the CUTLASS constraints in place for both cuSPARSELt
and CUTLASS tensors

This PR also reconnects the _FUSE_TRANSPOSE flag for cuSPARSELt tensors.

Test Plan:
```
python test/test_sparse_semi_structured.py
```

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115550
Approved by: https://github.com/cpuhrsch
2023-12-15 02:28:17 +00:00
6d5fe07659 Fix numpy warning when importing torch without numpy installed (#115867)
Fixes #115638

I verified locally that with no numpy install the warning no longer occurs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115867
Approved by: https://github.com/soulitzer
2023-12-15 02:22:12 +00:00
9e84d0fa60 [MPS] Fix opposite error message in empty_mps (#115746)
Fixes #115625
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115746
Approved by: https://github.com/mikaylagawarecki
2023-12-15 01:31:40 +00:00
85262b0a9e markDynamoStrictTest some test_cpp_extensions.* (#115858)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115858
Approved by: https://github.com/voznesenskym
ghstack dependencies: #115845, #115855, #115856, #115857
2023-12-15 01:22:38 +00:00
8ddca5aeae markDynamoStrictTest some more tests (#115857)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115857
Approved by: https://github.com/voznesenskym
ghstack dependencies: #115845, #115855, #115856
2023-12-15 01:22:38 +00:00
3477a2ee03 unMarkDynamoStrictTest on OpInfo-based tests (#115856)
These take too long to run under strict mode. We'll worry about them
later. Note that these decorators don't do anything yet (unless we flip
the default from non-strict to strict).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115856
Approved by: https://github.com/voznesenskym
ghstack dependencies: #115845, #115855
2023-12-15 01:22:31 +00:00
0722ce35f5 Increase number of Dynamo shards from 2->7 (#115855)
In preparation for ~3x increased test time coming in the upcoming PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115855
Approved by: https://github.com/voznesenskym
ghstack dependencies: #115845
2023-12-15 01:22:24 +00:00
4ccd8eb613 Add Dynamo test expected failure mechanism (#115845)
Tests that are added to a list in dynamo_test_failures.py will
automatically be marked as expectedFailure when run with
PYTORCH_TEST_WITH_DYNAMO=1. I'm splitting this PR off on its own so that
I can test various things on top of it.

Also added an unMarkDynamoStrictTest that is not useful until we turn
on strict mode by default.

Test Plan:
- code reading
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115845
Approved by: https://github.com/voznesenskym
2023-12-15 01:22:17 +00:00
5477120ebf [executorch] Update iOS toolchain with a modern cmake syntax. (#115799)
Summary: Replace exec_program with execute_process

Test Plan: CI

Differential Revision: D52147108

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115799
Approved by: https://github.com/huydhn
2023-12-15 00:51:30 +00:00
f90a5f891b [AOTI][refactor][1/n] Rename cpp_kernel to cpp_kernel_name (#115783)
Differential Revision: [D52142184](https://our.internmc.facebook.com/intern/diff/D52142184)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115783
Approved by: https://github.com/chenyang78, https://github.com/jansel
2023-12-15 00:50:17 +00:00
1b8599283f Optimize quantized max pool 2d (#115690)
Summary:
We do not need to dequantize and quantize again for this op.

With this optimization cunet-enc ops:

vulkan.quantized_max_pool2d_quint8{48, 36, 2}                       207532
vulkan.quantized_max_pool2d_quint8{24, 18, 4}                        78832
vulkan.quantized_max_pool2d_quint8{12, 9, 8}                         49296

Without optimization:
vulkan.quantized_max_pool2d_quint8{48, 36, 2}                       234416
vulkan.quantized_max_pool2d_quint8{24, 18, 4}                        94380
vulkan.quantized_max_pool2d_quint8{12, 9, 8}                         58760

Test Plan:
Ensure all vulkan quantize tests pass:
buck2 run --target-platforms ovr_configplatform/macos:arm64-fbsourcexplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 --show-output"
Running main() from third-party/googletest/1.11.0/googletest/googletest/src/gtest_main.cc
[==========] Running 78 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 78 tests from VulkanAPITest

...
[==========] 78 tests from 1 test suite ran. (1519 ms total)
[  PASSED  ] 78 tests.

buck2 run --target-platforms ovr_config//platform/macos:arm64-fbsource  //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 --show-output"

Running main() from third-party/googletest/1.11.0/googletest/googletest/src/gtest_main.cc
[==========] Running 395 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 395 tests from VulkanAPITest
...
[  SKIPPED ] VulkanAPITest.querypool_flushed_shader_log (0 ms)
[----------] 395 tests from VulkanAPITest (6515 ms total)

[----------] Global test environment tear-down
[==========] 395 tests from 1 test suite ran. (6515 ms total)
[  PASSED  ] 394 tests.
[  SKIPPED ] 1 test, listed below:
[  SKIPPED ] VulkanAPITest.querypool_flushed_shader_log

  YOU HAVE 5 DISABLED TESTS

Reviewed By: yipjustin, copyrightly

Differential Revision: D50998619

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115690
Approved by: https://github.com/SS-JIA
2023-12-15 00:45:37 +00:00
6fee208064 Handle -1 in jagged layout NT view ops (#115843)
Allows for inheriting the ragged and batch dims via -1:
```python
nt.view(-1, -1, D)
nt.expand(B, -1, D)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115843
Approved by: https://github.com/soulitzer
ghstack dependencies: #115636
2023-12-15 00:42:47 +00:00
c947ed1135 [BE][ROCm] Use modern C++ (#115844)
This removes global (but ROCM_ONLY) `over_arch` and `gcn_arch_override_flag` variables in favor of block level static initialization introduced in C++11

To quote from [ISO/IEC 14882-2014](https://www.open-std.org/jtc1/sc22/wg21/docs/standards)
>The zero-initialization (8.5) of all block-scope variables with static storage duration (3.7.1) or thread storage
> duration (3.7.2) is performed before any other initialization takes place. Constant initialization (3.6.2) of a
> block-scope entity with static storage duration, if applicable, is performed before its block is first entered.
> An implementation is permitted to perform early initialization of other block-scope variables with static or
> thread storage duration under the same conditions that an implementation is permitted to statically initialize
> a variable with static or thread storage duration in namespace scope (3.6.2). Otherwise such a variable is
> initialized the first time control passes through its declaration; such a variable is considered initialized upon
> the completion of its initialization. If the initialization exits by throwing an exception, the initialization
> is not complete, so it will be tried again the next time control enters the declaration. If control enters
> the declaration concurrently while the variable is being initialized, the concurrent execution shall wait for
> completion of the initialization.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115844
Approved by: https://github.com/huydhn
2023-12-15 00:38:43 +00:00
7e6ec8d3db [ONNX] Add proper iobinding synchronize for ONNX cuda bench (#115773)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115773
Approved by: https://github.com/thiagocrepaldi
ghstack dependencies: #115670, #115673
2023-12-15 00:37:32 +00:00
823523acc0 [ONNX] Dump sarif diagnostics for failed onnx exports in benchmark (#115673)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115673
Approved by: https://github.com/thiagocrepaldi
ghstack dependencies: #115670
2023-12-15 00:37:32 +00:00
0959e67de3 [ONNX] Set correct cuda.current_device for multi-device onnx performance bench (#115670)
Otherwise `torch.cuda.synchronize()` works on a different device from the one that
runs PyTorch model, which lead to incorrect performance number.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115670
Approved by: https://github.com/thiagocrepaldi
2023-12-15 00:37:32 +00:00
59f7355f86 Revert "[ROCm] add hipblaslt support (#114329)"
This reverts commit bb2bb8cca1c00e3f6e7025a62688d0cfcbfee144.

Reverted https://github.com/pytorch/pytorch/pull/114329 on behalf of https://github.com/atalman due to OSSCI oncall, trunk  tests are failing ([comment](https://github.com/pytorch/pytorch/pull/114329#issuecomment-1857003155))
2023-12-14 23:53:30 +00:00
66b04e3cb7 [nccl flight recorder] nullptr profiling name (#115851)
Sometimes profiling name can be a nullptr, which
throws on conversion to std::string. This adds a check.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115851
Approved by: https://github.com/wconstab
2023-12-14 23:40:54 +00:00
21b8127f1c [Inductor] Deduplicate grid wrapper statements for user defined triton kernels (#115849)
Noticed that on many MRS kernels the grid wrapper for autotuning is huge with a bunch of duplicates due to num_warps and num_stages not being needed for grid calculation. Lets deduplicate these entries.

Previously, we would see wrapper like
```
    def grid_wrapper_for_add_kernel_2d_autotuned_0(meta):
        if meta['BLOCK_SIZE_X'] == 128 and meta['BLOCK_SIZE_Y'] == 128: return (4, 2, 1)
        if meta['BLOCK_SIZE_X'] == 128 and meta['BLOCK_SIZE_Y'] == 128: return (4, 2, 1)
        if meta['BLOCK_SIZE_X'] == 64 and meta['BLOCK_SIZE_Y'] == 64: return (8, 4, 1)
        if meta['BLOCK_SIZE_X'] == 64 and meta['BLOCK_SIZE_Y'] == 64: return (8, 4, 1)
```
now it looks like
```
    def grid_wrapper_for_add_kernel_2d_autotuned_0(meta):
        if meta['BLOCK_SIZE_X'] == 128 and meta['BLOCK_SIZE_Y'] == 128: return (4, 2, 1)
        if meta['BLOCK_SIZE_X'] == 64 and meta['BLOCK_SIZE_Y'] == 64: return (8, 4, 1)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115849
Approved by: https://github.com/jansel
2023-12-14 23:26:04 +00:00
194d57dae7 Add values backward support for sparse CSR, CSC, BSR, and BSC tensors (#115586)
Fixes https://github.com/pytorch/pytorch/issues/107286

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115586
Approved by: https://github.com/cpuhrsch, https://github.com/albanD
2023-12-14 23:09:13 +00:00
49d826bcd3 [dtensor] update op db tests (#115722)
This PR updates the op db tests xfails, we should see whether we can
enable this again in CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115722
Approved by: https://github.com/XilunWu
2023-12-14 22:49:13 +00:00
ef6a0faf89 [export] Fix canonicalization. (#115830)
Summary: Add the missed layout argument branch.

Test Plan: buck2 test 'fbcode//mode/dev-nosan' fbcode//sigmoid/inference/test_gpu:export_package_sparse_toy_test

Differential Revision: D52166501

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115830
Approved by: https://github.com/angelayi
2023-12-14 22:48:26 +00:00
bb2bb8cca1 [ROCm] add hipblaslt support (#114329)
Disabled by default. Enable with env var DISABLE_ADDMM_HIP_LT=0. Tested on both ROCm 5.7 and 6.0.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114329
Approved by: https://github.com/malfet
2023-12-14 21:41:22 +00:00
04ef21f5dd [C10D] Make dumpDebuggingInfo share a mutex across PGs (#115803)
The mutex was originally added to avoid racing to dump debuginfo,
where a race in this case would result in a corrupted dump file.

The reason a mutex helps is that it forces all dump requests to be
serialized, so that an observer would either see an in-progress file, a
complete file, or no file.  Without a mutex, a fourth state is possible
(a file that has been written to by multiple threads and is invalid).

Becuase the mutex was a ProcessGroupNCCL class member, and each PG
instance has its own watchdog thread that can launch a dump, it was not
doing its job.  Making the mutex static shares it between instances of
the class and ensures serialization of dumps triggered by any PG.

(Note: dumps triggered by different PGs have the same, global contents
anyway- there is only one global flight recorder, so it doesn't matter
who triggers it.)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115803
Approved by: https://github.com/kwen2501
ghstack dependencies: #115771, #115798, #115800, #115801
2023-12-14 21:17:44 +00:00
7ecddaef23 Revert "Introduce 3 low-latency, intra-node allreduce algorithms for small messages to PyTorch (#114001)"
This reverts commit adfbd2b219f4995d3f13870927022b67550f8b0e.

Reverted https://github.com/pytorch/pytorch/pull/114001 on behalf of https://github.com/atalman due to OSSCI oncall, breaks periodic jobs ([comment](https://github.com/pytorch/pytorch/pull/114001#issuecomment-1856539040))
2023-12-14 20:33:10 +00:00
67232199b1 [dynamo] Log shape_env_guard_count separately from guard_count (#115776)
guard_count counts all the shape_env guards as a single guard; log the shape_env_guard_count separately so those metrics can be used.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115776
Approved by: https://github.com/yanboliang
2023-12-14 20:12:49 +00:00
eqy
353f2dbd9c [CUDA] Fix V100 expected failures in test_mm_decomp and test_linalg (#115666)
BFloat16 isn't supported on sm70 and we get an unexpected cuBLAS success in 12.3+

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115666
Approved by: https://github.com/malfet
2023-12-14 19:17:53 +00:00
28e37d4f3b Update Trition pin (#115743)
To include a cherry-pick of https://github.com/openai/triton/pull/2771 that should fix  cuda-11.8 runtime issues

Also, tweak build wheel script to update both ROCm and vanilla Trition builds version to 2.2 (even though on trunk it should probably be 3.3 already)

TODO: Remove `ROCM_TRITION_VERSION` once both trunk and ROCM version are in sync again

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115743
Approved by: https://github.com/davidberard98
2023-12-14 18:54:24 +00:00
87547a26b8 [aotinductor] add no weight change version of fuse_parallel_linear (#115791)
Summary: We need a new version of fuse_parallel_linear w/o creating new weights for real-time update.

Reviewed By: khabinov

Differential Revision: D52128296

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115791
Approved by: https://github.com/khabinov
2023-12-14 18:36:17 +00:00
ca4caf4eac Revert "[inductor] Do variance calculation in opmath type (#115181)"
This reverts commit 42390a097b987cd3384511c3df3747699f2281f4.

Reverted https://github.com/pytorch/pytorch/pull/115181 on behalf of https://github.com/atalman due to OSSCI oncall, broke periodic tests ([comment](https://github.com/pytorch/pytorch/pull/115181#issuecomment-1856360644))
2023-12-14 18:21:49 +00:00
0fe014bd8a [C10D] Change PGNCCL logs to prefix [PG {} Rank {}] (#115801)
Adds a PG {process group uid} prefix component to logs.

This is helpful in situations where there are multiple processgroups,
and rank information by itself is confusing.  (For example rank0 on PG1
may correspond to rank3 on PG0.  People may assume 'rank0' references
the global (PG0) world, but it may reference a sub-pg.  Prefacing the PG
helps clarify this.

Does NOT change logs from inside WorkNCCL functions, since WorkNCCL
doens't know what PG ID it corresponds to. Will address these logs
separately.

Example:

```
[I ProcessGroupNCCL.cpp:787] [PG 0 Rank 0] ProcessGroupNCCL initialization ...
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115801
Approved by: https://github.com/fduwjj
ghstack dependencies: #115771, #115798, #115800
2023-12-14 18:17:16 +00:00
e94267587b [C10D] Refactor NCCL logs to use common prefix helper (#115800)
Put the repeated code that string formats [Rank {rank}] in one place.

Sets up for the next PR that also adds more info to this prefix.

(Does not change exception messages, which could be done as well.
Exception messages are not formatted quite the same way. Tries
instead to keep from changing log behavior (in this PR) and only
refactor code.

Did limited testing (some logs were observed OK).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115800
Approved by: https://github.com/fduwjj
ghstack dependencies: #115771, #115798
2023-12-14 18:13:24 +00:00
eb6e70cf66 [C10D] Only open NCCL dump pipe file once per process (#115798)
The NCCL flight recorder is per-process (it is shared by all
processgroups), but individual process groups used to construct their
own pipe for being signaled to dump the flight recorder.

This ensures that only one pipe per process is created, by only creating
the pipe on the first ProcessGroup (uid_ == 0) which should be the world
group.

Filenames are still keyed off of rank, but this should now be global
rank instead of sub-pg rank, making the filenames unique across the
whole trainer process.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115798
Approved by: https://github.com/zdevito
ghstack dependencies: #115771
2023-12-14 17:48:26 +00:00
74d2b9dd15 [C10D] Make DumpPipe disabled when FlightRecorder disabled (#115771)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115771
Approved by: https://github.com/fduwjj
2023-12-14 17:42:46 +00:00
b618869208 [inductor] label cpp test files with oncall: cpu inductor (#115167)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115167
Approved by: https://github.com/atalman
2023-12-14 17:39:27 +00:00
c80e2d5bb2 [fbcode] consolidate usage of fp8 linears for inference models (#115808)
Summary:
ATT, this will use implementation of D51812709 for fp8 linears.

Meanwhile, it also adds use-case of delay quantization

Test Plan:
```
CUDA_VISIBLE_DEVICES=7 buck run mode/opt  -c fbcode.platform010_cuda_version=12 -c fbcode.nvcc_arch=h100 -c fbcode.use_link_groups=false caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --local-model /home/xiaoruichao/test_models/463113248.input.predictor.disagg.gpu.merge --lower-backend AOT_INDUCTOR --fp8-linear-quantization-type delay_quantization --disable-acc-tracer-aot-inductor
```

```
CUDA_VISIBLE_DEVICES=7 buck run mode/opt  -c fbcode.platform010_cuda_version=12 -c fbcode.nvcc_arch=h100 -c fbcode.use_link_groups=false caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --local-model /home/xiaoruichao/test_models/463113248.input.predictor.disagg.gpu.merge --lower-backend AOT_INDUCTOR --fp8-linear-quantization-type delay_quantization --disable-acc-tracer-aot-inductor
```

Reviewed By: tter1

Differential Revision: D51840344

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115808
Approved by: https://github.com/ipiszy
2023-12-14 16:59:48 +00:00
5bddbed399 Initial Flash Attention support on ROCM (#114309)
This pull requests add initial Flash Attention support for AMD/ROCM platform. It added a specialized Triton repository/branch as a compile-time dependency for Flash Attention math library on AMD/ROCM. This triton submodule is not used at runtime and will not be shipped to the final pytorch package. We have the plan to release this specialized Triton as a separate project.

Know limitations:

- [ ] Only supports MI200 series GPU (i.e., `gcnArchName == gfx90a:sramecc+:xnack-`.
- [ ] Only supports power of two sequence lengths.
- [ ] No support for varlen APIs.
- [ ] Only support head dimension 16,32,64,128.
- [ ] Performance is still being optimized.

Fixes https://github.com/pytorch/pytorch/issues/112997

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114309

Approved by: https://github.com/jeffdaily, https://github.com/malfet

---------

Co-authored-by: Joseph Groenenboom <joseph.groenenboom@amd.com>
2023-12-14 08:52:57 -08:00
ac60a70e06 Migrated loss functions to ModuleInfos (#115584)
Migrates most tests in `common_nn.py:criterion_tests` to ModuleInfos.

**I can split this up if it is too large to review**

What this PR does not include:
- [`no_batch_dim` tests](https://github.com/pytorch/pytorch/blob/main/torch/testing/_internal/common_nn.py#L3995-L4112)
- [tests that use the functional variant of the loss function and `wrap_functional`](https://github.com/pytorch/pytorch/blob/main/torch/testing/_internal/common_nn.py#L1079-L1128)

#### On test times
This PR increases test time by ~58s locally
Before this PR:
```
>>> python test/test_nn.py -k Loss
Ran 1003 tests in 28.977s
```
After this PR
```
>>> python test/test_nn.py -k Loss
Ran 368 tests in 23.073s
```

```
>>> python test/test_modules.py -k Loss
Ran 836 tests in 63.900s
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115584
Approved by: https://github.com/janeyx99
ghstack dependencies: #115617
2023-12-14 16:21:05 +00:00
f727bed2e6 [inductor] Updated upsample_bilinear2d decomposition (#104182)
Description:
- Updated upsample_bilinear2d decomposition
  - added support for uint8 dtype support
  - code improvements
- Added uint8 dtype tests

Perf considerations:
- There is minor perf regression (speed-up ~0.7) on cases uint8, align_corners=True when output is smaller/equal (256, 256)
- For cases, when output is larger (256, 256) and input dtype uint8, nightly output is wrong, so IMO large perf regression (speed-up around ~0.2) should not be taken into account.

## Perfs benchmarks

```
[--------------------------------------------------------------------------------------------------------------------------------------------------------- Interpolate, cpu --------------------------------------------------------------------------------------------------------------------------------------------------------]
                                                                                                                                                    |  Eager (2.3.0a0+gitafcfdb1) PR  |  Compiled (2.3.0a0+gitafcfdb1) PR  |  Compiled (2.3.0a0+gitde89a53) Nightly  |  speed-up PR vs Nightly  |  Eager (2.3.0a0+gitde89a53) Nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      Input (1, 3, 500, 400), torch.uint8, torch.contiguous_format | mode: bilinear, align_corners: True, antialias: False, osize: (256, 256)       |        565.212 (+-3.548)        |        1384.210 (+-10.798)         |           1230.996 (+-32.930)           |     0.889 (+-0.000)      |          566.253 (+-1.526)
      Input (1, 3, 500, 400), torch.uint8, torch.contiguous_format | mode: bilinear, align_corners: False, antialias: False, osize: (256, 256)      |        565.404 (+-1.614)        |         1491.649 (+-7.763)         |            2974.959 (+-6.006)           |     1.994 (+-0.000)      |          566.476 (+-1.742)
      Input (1, 3, 500, 400), torch.uint8, torch.channels_last | mode: bilinear, align_corners: True, antialias: False, osize: (256, 256)           |        270.761 (+-0.861)        |         1557.777 (+-4.699)         |            1080.919 (+-4.243)           |     0.694 (+-0.000)      |          269.829 (+-0.986)
      Input (1, 3, 500, 400), torch.uint8, torch.channels_last | mode: bilinear, align_corners: False, antialias: False, osize: (256, 256)          |        270.960 (+-0.995)        |        1723.913 (+-12.433)         |            3191.938 (+-6.194)           |     1.852 (+-0.000)      |          269.962 (+-1.657)
      Input (1, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: True, antialias: False, osize: (256, 256)     |        1555.884 (+-5.169)       |         1178.753 (+-4.957)         |            1910.445 (+-5.988)           |     1.621 (+-0.000)      |          1560.804 (+-6.793)
      Input (1, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: False, antialias: False, osize: (256, 256)    |        1651.193 (+-6.952)       |         1323.466 (+-6.059)         |            3374.842 (+-8.168)           |     2.550 (+-0.000)      |          1653.497 (+-8.018)
      Input (1, 3, 500, 400), torch.float32, torch.channels_last | mode: bilinear, align_corners: True, antialias: False, osize: (256, 256)         |        978.482 (+-10.183)       |         1383.768 (+-4.341)         |            2147.841 (+-6.581)           |     1.552 (+-0.000)      |          979.983 (+-1.499)
      Input (1, 3, 500, 400), torch.float32, torch.channels_last | mode: bilinear, align_corners: False, antialias: False, osize: (256, 256)        |        1074.472 (+-5.031)       |         1414.912 (+-5.754)         |           3590.968 (+-10.042)           |     2.538 (+-0.000)      |          1074.589 (+-3.948)
      Input (4, 3, 500, 400), torch.uint8, torch.contiguous_format | mode: bilinear, align_corners: True, antialias: False, osize: (256, 256)       |        2168.703 (+-8.964)       |        5400.528 (+-26.628)         |           4777.299 (+-11.891)           |     0.885 (+-0.000)      |          2168.133 (+-7.667)
      Input (4, 3, 500, 400), torch.uint8, torch.contiguous_format | mode: bilinear, align_corners: False, antialias: False, osize: (256, 256)      |       2169.132 (+-12.618)       |        6583.866 (+-28.959)         |           11986.894 (+-45.838)          |     1.821 (+-0.000)      |         2174.488 (+-10.317)
      Input (4, 3, 500, 400), torch.uint8, torch.channels_last | mode: bilinear, align_corners: True, antialias: False, osize: (256, 256)           |        992.808 (+-6.086)        |         5985.028 (+-9.532)         |            4334.158 (+-9.423)           |     0.724 (+-0.000)      |          989.604 (+-5.499)
      Input (4, 3, 500, 400), torch.uint8, torch.channels_last | mode: bilinear, align_corners: False, antialias: False, osize: (256, 256)          |        987.618 (+-6.350)        |        6963.044 (+-28.885)         |           15441.096 (+-55.324)          |     2.218 (+-0.000)      |          985.573 (+-5.159)
      Input (4, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: True, antialias: False, osize: (256, 256)     |       6695.557 (+-35.067)       |        4657.603 (+-14.220)         |           8058.708 (+-41.684)           |     1.730 (+-0.000)      |         6714.996 (+-38.626)
      Input (4, 3, 500, 400), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: False, antialias: False, osize: (256, 256)    |       7040.481 (+-39.486)       |        5445.704 (+-16.659)         |           13906.618 (+-53.298)          |     2.554 (+-0.000)      |         7034.453 (+-44.626)
      Input (4, 3, 500, 400), torch.float32, torch.channels_last | mode: bilinear, align_corners: True, antialias: False, osize: (256, 256)         |       3926.186 (+-10.660)       |        5741.433 (+-12.748)         |           9356.036 (+-40.848)           |     1.630 (+-0.000)      |         3930.598 (+-17.086)
      Input (4, 3, 500, 400), torch.float32, torch.channels_last | mode: bilinear, align_corners: False, antialias: False, osize: (256, 256)        |        4308.536 (+-9.607)       |        6122.755 (+-47.278)         |           15637.567 (+-54.392)          |     2.554 (+-0.000)      |         4307.463 (+-11.268)
      Input (1, 3, 1200, 1300), torch.uint8, torch.contiguous_format | mode: bilinear, align_corners: True, antialias: False, osize: (200, 300)     |       2512.740 (+-10.860)       |         1573.590 (+-5.061)         |            451.355 (+-1.210)            |     0.287 (+-0.000)      |         2511.727 (+-10.930)
      Input (1, 3, 1200, 1300), torch.uint8, torch.contiguous_format | mode: bilinear, align_corners: False, antialias: False, osize: (200, 300)    |       2489.926 (+-11.915)       |         1537.233 (+-4.212)         |            2501.470 (+-7.446)           |     1.627 (+-0.000)      |         2500.000 (+-12.155)
      Input (1, 3, 1200, 1300), torch.uint8, torch.channels_last | mode: bilinear, align_corners: True, antialias: False, osize: (200, 300)         |        632.032 (+-2.108)        |         1496.994 (+-4.194)         |            404.759 (+-1.064)            |     0.270 (+-0.000)      |          630.122 (+-4.086)
      Input (1, 3, 1200, 1300), torch.uint8, torch.channels_last | mode: bilinear, align_corners: False, antialias: False, osize: (200, 300)        |        629.174 (+-4.386)        |         1708.935 (+-8.817)         |            2643.296 (+-9.723)           |     1.547 (+-0.000)      |          628.388 (+-1.326)
      Input (1, 3, 1200, 1300), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: True, antialias: False, osize: (200, 300)   |        4409.941 (+-8.016)       |         1160.133 (+-4.698)         |            1897.089 (+-9.392)           |     1.635 (+-0.000)      |         4450.959 (+-10.438)
      Input (1, 3, 1200, 1300), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: False, antialias: False, osize: (200, 300)  |       4493.427 (+-11.703)       |         1329.226 (+-4.740)         |           2835.872 (+-12.241)           |     2.133 (+-0.000)      |          4506.973 (+-9.914)
      Input (1, 3, 1200, 1300), torch.float32, torch.channels_last | mode: bilinear, align_corners: True, antialias: False, osize: (200, 300)       |        901.712 (+-4.071)        |         1320.739 (+-5.197)         |            2207.605 (+-8.219)           |     1.671 (+-0.000)      |          904.757 (+-4.558)
      Input (1, 3, 1200, 1300), torch.float32, torch.channels_last | mode: bilinear, align_corners: False, antialias: False, osize: (200, 300)      |        990.080 (+-3.922)        |         1702.563 (+-7.909)         |           3074.196 (+-10.478)           |     1.806 (+-0.000)      |          990.482 (+-4.444)
      Input (4, 3, 1200, 1300), torch.uint8, torch.contiguous_format | mode: bilinear, align_corners: True, antialias: False, osize: (200, 300)     |       9785.550 (+-58.445)       |        6135.680 (+-33.569)         |           1628.572 (+-19.770)           |     0.265 (+-0.000)      |         9893.606 (+-62.377)
      Input (4, 3, 1200, 1300), torch.uint8, torch.contiguous_format | mode: bilinear, align_corners: False, antialias: False, osize: (200, 300)    |       9710.191 (+-57.597)       |        6066.824 (+-36.364)         |           10469.110 (+-42.775)          |     1.726 (+-0.000)      |         9919.022 (+-72.190)
      Input (4, 3, 1200, 1300), torch.uint8, torch.channels_last | mode: bilinear, align_corners: True, antialias: False, osize: (200, 300)         |       2790.356 (+-12.188)       |        6134.101 (+-28.694)         |            1576.832 (+-6.030)           |     0.257 (+-0.000)      |         2761.122 (+-11.503)
      Input (4, 3, 1200, 1300), torch.uint8, torch.channels_last | mode: bilinear, align_corners: False, antialias: False, osize: (200, 300)        |       2778.711 (+-13.603)       |        6608.528 (+-37.776)         |           10841.549 (+-49.429)          |     1.641 (+-0.000)      |         2753.037 (+-10.995)
      Input (4, 3, 1200, 1300), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: True, antialias: False, osize: (200, 300)   |      45533.868 (+-102.618)      |         4962.994 (+-8.215)         |           9003.968 (+-38.179)           |     1.814 (+-0.000)      |        43531.261 (+-102.951)
      Input (4, 3, 1200, 1300), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: False, antialias: False, osize: (200, 300)  |       45932.699 (+-81.207)      |        5595.682 (+-11.482)         |           12302.907 (+-50.254)          |     2.199 (+-0.000)      |         43916.455 (+-80.468)
      Input (4, 3, 1200, 1300), torch.float32, torch.channels_last | mode: bilinear, align_corners: True, antialias: False, osize: (200, 300)       |        3827.804 (+-8.057)       |        6311.580 (+-25.021)         |           11760.614 (+-51.531)          |     1.863 (+-0.000)      |         3849.959 (+-10.848)
      Input (4, 3, 1200, 1300), torch.float32, torch.channels_last | mode: bilinear, align_corners: False, antialias: False, osize: (200, 300)      |        4169.007 (+-8.452)       |        6820.716 (+-35.310)         |           15264.633 (+-49.982)          |     2.238 (+-0.000)      |         4183.875 (+-19.104)
      Input (1, 3, 300, 400), torch.uint8, torch.contiguous_format | mode: bilinear, align_corners: True, antialias: False, osize: (600, 700)       |        1306.914 (+-7.470)       |        10598.101 (+-38.410)        |           2678.031 (+-11.051)           |     0.253 (+-0.000)      |          1307.470 (+-8.519)
      Input (1, 3, 300, 400), torch.uint8, torch.contiguous_format | mode: bilinear, align_corners: False, antialias: False, osize: (600, 700)      |        1307.268 (+-8.197)       |        10161.123 (+-45.643)        |           17148.842 (+-55.402)          |     1.688 (+-0.000)      |          1308.077 (+-8.553)
      Input (1, 3, 300, 400), torch.uint8, torch.channels_last | mode: bilinear, align_corners: True, antialias: False, osize: (600, 700)           |        548.574 (+-2.157)        |        10072.806 (+-41.368)        |            2408.971 (+-6.997)           |     0.239 (+-0.000)      |          547.726 (+-1.721)
      Input (1, 3, 300, 400), torch.uint8, torch.channels_last | mode: bilinear, align_corners: False, antialias: False, osize: (600, 700)          |        546.664 (+-1.484)        |        11123.694 (+-43.636)        |           18058.070 (+-48.552)          |     1.623 (+-0.000)      |          547.151 (+-1.627)
      Input (1, 3, 300, 400), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: True, antialias: False, osize: (600, 700)     |       7935.051 (+-71.022)       |        7654.533 (+-29.512)         |           12414.194 (+-87.450)          |     1.622 (+-0.000)      |         7900.056 (+-53.997)
      Input (1, 3, 300, 400), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: False, antialias: False, osize: (600, 700)    |       8546.732 (+-53.118)       |        8583.572 (+-35.656)         |          19111.824 (+-166.978)          |     2.227 (+-0.000)      |         8515.433 (+-63.300)
      Input (1, 3, 300, 400), torch.float32, torch.channels_last | mode: bilinear, align_corners: True, antialias: False, osize: (600, 700)         |       6202.642 (+-34.355)       |        8915.622 (+-62.293)         |           14327.295 (+-52.188)          |     1.607 (+-0.000)      |         6213.329 (+-39.740)
      Input (1, 3, 300, 400), torch.float32, torch.channels_last | mode: bilinear, align_corners: False, antialias: False, osize: (600, 700)        |       6811.128 (+-33.747)       |        9647.316 (+-50.837)         |           20830.594 (+-62.979)          |     2.159 (+-0.000)      |         6822.512 (+-37.092)
      Input (4, 3, 300, 400), torch.uint8, torch.contiguous_format | mode: bilinear, align_corners: True, antialias: False, osize: (600, 700)       |       5079.586 (+-19.067)       |        42238.442 (+-87.643)        |           11282.141 (+-42.477)          |     0.267 (+-0.000)      |         5104.234 (+-17.706)
      Input (4, 3, 300, 400), torch.uint8, torch.contiguous_format | mode: bilinear, align_corners: False, antialias: False, osize: (600, 700)      |       5079.575 (+-16.306)       |        41512.995 (+-83.710)        |          68789.816 (+-440.001)          |     1.657 (+-0.000)      |         5097.446 (+-21.724)
      Input (4, 3, 300, 400), torch.uint8, torch.channels_last | mode: bilinear, align_corners: True, antialias: False, osize: (600, 700)           |        2039.974 (+-8.614)       |       42322.773 (+-111.866)        |           10399.237 (+-43.140)          |     0.246 (+-0.000)      |         2043.808 (+-10.707)
      Input (4, 3, 300, 400), torch.uint8, torch.channels_last | mode: bilinear, align_corners: False, antialias: False, osize: (600, 700)          |       2036.214 (+-10.083)       |        44353.281 (+-71.548)        |          73340.412 (+-324.780)          |     1.654 (+-0.000)      |          2039.000 (+-9.554)
      Input (4, 3, 300, 400), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: True, antialias: False, osize: (600, 700)     |       33821.523 (+-96.639)      |        30552.094 (+-65.023)        |          49494.486 (+-872.916)          |     1.620 (+-0.000)      |         33844.404 (+-92.466)
      Input (4, 3, 300, 400), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: False, antialias: False, osize: (600, 700)    |      36196.104 (+-128.169)      |        34038.432 (+-79.697)        |          75761.226 (+-905.194)          |     2.226 (+-0.000)      |         36260.473 (+-94.642)
      Input (4, 3, 300, 400), torch.float32, torch.channels_last | mode: bilinear, align_corners: True, antialias: False, osize: (600, 700)         |       24827.821 (+-77.335)      |        37006.218 (+-86.318)        |          61297.625 (+-898.192)          |     1.656 (+-0.000)      |         24823.275 (+-80.945)
      Input (4, 3, 300, 400), torch.float32, torch.channels_last | mode: bilinear, align_corners: False, antialias: False, osize: (600, 700)        |       27266.138 (+-70.262)      |        40109.475 (+-94.248)        |          92086.075 (+-404.922)          |     2.296 (+-0.000)      |         27287.992 (+-89.507)

Times are in microseconds (us).

[--------------------------------------------------------------------------------------------------------------------------------------------------------- Interpolate, cuda ---------------------------------------------------------------------------------------------------------------------------------------------------------]
                                                                                                                                                      |  Eager (2.3.0a0+gitafcfdb1) PR  |  Compiled (2.3.0a0+gitafcfdb1) PR  |  Compiled (2.3.0a0+gitde89a53) Nightly  |  speed-up PR vs Nightly  |  Eager (2.3.0a0+gitde89a53) Nightly
1 threads: ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      Input (1, 3, 2345, 2456), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: True, antialias: False, osize: (1234, 1345)   |         98.259 (+-0.014)        |          97.156 (+-0.008)          |             97.443 (+-0.031)            |     1.003 (+-0.000)      |           98.248 (+-0.021)
      Input (1, 3, 2345, 2456), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: False, antialias: False, osize: (1234, 1345)  |         97.048 (+-0.016)        |          97.480 (+-0.018)          |             96.819 (+-0.126)            |     0.993 (+-0.000)      |           97.045 (+-0.015)
      Input (1, 3, 2345, 2456), torch.float32, torch.channels_last | mode: bilinear, align_corners: True, antialias: False, osize: (1234, 1345)       |         97.944 (+-0.028)        |          91.686 (+-0.411)          |             93.894 (+-1.011)            |     1.024 (+-0.000)      |           97.933 (+-0.008)
      Input (1, 3, 2345, 2456), torch.float32, torch.channels_last | mode: bilinear, align_corners: False, antialias: False, osize: (1234, 1345)      |         98.008 (+-0.011)        |          91.205 (+-0.346)          |             96.854 (+-0.058)            |     1.062 (+-0.000)      |           97.203 (+-0.010)
      Input (4, 3, 2345, 2456), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: True, antialias: False, osize: (1234, 1345)   |        384.318 (+-0.011)        |         382.793 (+-0.007)          |            382.472 (+-0.011)            |     0.999 (+-0.000)      |          384.701 (+-0.012)
      Input (4, 3, 2345, 2456), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: False, antialias: False, osize: (1234, 1345)  |        384.266 (+-0.009)        |         385.333 (+-0.024)          |            382.554 (+-0.022)            |     0.993 (+-0.000)      |          384.386 (+-0.016)
      Input (4, 3, 2345, 2456), torch.float32, torch.channels_last | mode: bilinear, align_corners: True, antialias: False, osize: (1234, 1345)       |        383.924 (+-0.011)        |         570.071 (+-0.030)          |            545.615 (+-0.051)            |     0.957 (+-0.000)      |          384.044 (+-0.012)
      Input (4, 3, 2345, 2456), torch.float32, torch.channels_last | mode: bilinear, align_corners: False, antialias: False, osize: (1234, 1345)      |        384.184 (+-0.016)        |         560.857 (+-0.026)          |            552.447 (+-0.040)            |     0.985 (+-0.000)      |          384.063 (+-0.016)
      Input (1, 3, 1234, 1345), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: True, antialias: False, osize: (2345, 2456)   |        122.188 (+-0.053)        |         116.744 (+-1.006)          |            163.762 (+-0.015)            |     1.403 (+-0.000)      |          121.874 (+-0.015)
      Input (1, 3, 1234, 1345), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: False, antialias: False, osize: (2345, 2456)  |        122.156 (+-0.012)        |         182.692 (+-0.013)          |            161.653 (+-0.018)            |     0.885 (+-0.000)      |          121.926 (+-0.014)
      Input (1, 3, 1234, 1345), torch.float32, torch.channels_last | mode: bilinear, align_corners: True, antialias: False, osize: (2345, 2456)       |        105.852 (+-0.324)        |         119.545 (+-0.294)          |            190.527 (+-0.023)            |     1.594 (+-0.000)      |          105.999 (+-0.446)
      Input (1, 3, 1234, 1345), torch.float32, torch.channels_last | mode: bilinear, align_corners: False, antialias: False, osize: (2345, 2456)      |        106.507 (+-0.282)        |         120.060 (+-0.257)          |            162.330 (+-0.012)            |     1.352 (+-0.000)      |          106.567 (+-0.385)
      Input (4, 3, 1234, 1345), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: True, antialias: False, osize: (2345, 2456)   |        447.907 (+-0.015)        |         463.863 (+-1.779)          |            650.492 (+-0.331)            |     1.402 (+-0.000)      |          446.596 (+-0.017)
      Input (4, 3, 1234, 1345), torch.float32, torch.contiguous_format | mode: bilinear, align_corners: False, antialias: False, osize: (2345, 2456)  |        447.750 (+-0.017)        |         723.832 (+-0.170)          |            641.539 (+-0.075)            |     0.886 (+-0.000)      |          446.467 (+-0.019)
      Input (4, 3, 1234, 1345), torch.float32, torch.channels_last | mode: bilinear, align_corners: True, antialias: False, osize: (2345, 2456)       |        439.549 (+-0.031)        |         507.772 (+-2.879)          |            758.795 (+-0.482)            |     1.494 (+-0.000)      |          440.372 (+-0.025)
      Input (4, 3, 1234, 1345), torch.float32, torch.channels_last | mode: bilinear, align_corners: False, antialias: False, osize: (2345, 2456)      |        439.538 (+-0.029)        |         509.260 (+-2.704)          |            654.195 (+-2.621)            |     1.285 (+-0.000)      |          440.362 (+-0.026)

Times are in microseconds (us).
```

[Source](f4751a3196/perf_interp_mode.py), [Output](899f34c024/output/20231213-214209-upsample-bilinear-pr_vs_nightly-speedup.md)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104182
Approved by: https://github.com/lezcano
2023-12-14 14:50:06 +00:00
28e4004286 Add doc for torch.distributed.breakpoint (#115656)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115656
Approved by: https://github.com/wanchaol, https://github.com/fegin
ghstack dependencies: #115705
2023-12-14 14:45:36 +00:00
cyy
fcb95bf31b [2/N] Use std::in_place (#115480)
Remove c10/util/in_place.h
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115480
Approved by: https://github.com/soulitzer
2023-12-14 12:54:22 +00:00
6500ccebd7 enable fp16 autocast for dynamo benchmark (#114088)
`--amp` to enable amp path for` CUDA` (default amp_dtype will be float16) and `CPU` (default amp_dtype will be bfloat16).

If users set `--amp_dtype`, the amp_dtype from users will have the highest priority.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114088
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-12-14 12:38:44 +00:00
afe6d272c6 Fix buck OSS build after #115570 (#115804)
From #115570, `supports_shlib_interfaces` is only available in https://buck2.build/docs/api/rules/ not buck https://buck.build/rule/cxx_library.html.  The best way to fix this is probably to migrate OSS CI to buck2, so this is a temporary workaround because the fix from #115570 is only needed internally anyway
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115804
Approved by: https://github.com/kit1980, https://github.com/malfet
2023-12-14 08:33:07 +00:00
adfbd2b219 Introduce 3 low-latency, intra-node allreduce algorithms for small messages to PyTorch (#114001)
## Summary
This PR added 3 intra-node GPU allreduce algorithms to PyTorch:
- One-shot allreduce (inspired by FasterTransformer): all ranks simultaneously read and accumulate data from other ranks.
- Two-shot allreduce (inspired by FasterTransformer): all ranks simultanesouly read and accumulate `1 / world_size` data from other ranks. Then all ranks read accumulated data from other ranks. (effectively one-shot reduce-scatter + one-shot all-gather).
- Hybrid cube mesh allreduce (original): a one-shot allreduce variant that avoids transmission over PCIe on HCM topology.

## Micro Benchmarks
![image](https://github.com/pytorch/pytorch/assets/4156752/7bd25ffc-cd5b-4acb-bd65-b01bc136726e)

![image](https://github.com/pytorch/pytorch/assets/4156752/3ced31b4-6c31-4f34-a2d8-c072df29ae0e)

![image](https://github.com/pytorch/pytorch/assets/4156752/5b942c05-4fcc-4ec9-ae29-12c64080bb1c)

## Details
The intra-node algos are organized behind `c10d::IntraNodeComm`, which is responsible for:
- Managing handshaking and cuda IPC handle exchange among ranks.
- Querying NVLink connection and detecting topology.
- Performing algo selection based on available info.
- Launching the selected allreduce kernel.

`c10d::IntraNodeComm` is integrated into `c10d::ProcessGroupNCCL` as follows:
- When the `ENABLE_INTRA_NODE_COMM` environment variable is set, `c10d::ProcessGroupNCCL` initialize a `c10d::IntraNodeComm` for its ranks.
  - If the setup is not suitable for intra-node comm (e.g. not all ranks are from the same node), the rendezvous logic guarantees all participants fall back consistently.
- `c10d::ProcessGroupNCCL::allreduce` consults `c10d::IntraNodeComm` whether to use intra-node allreduce and carries out the communication accordingly.

We currently detect two types of topoloies from the nNVLink connection mesh:
- Fully connected: all GPU pairs has direct NVLink connection (e.g. NVSwitch or fully connected sub-set of hybrid cube mesh)
  - `msg <= 256KB`: one-shot allreduce.
  - `256KB < msg <= 10MB`: two-shot allreduce.
  -  `msg > 10MB`: instructs the caller to fallback to NCCL.
- Hybrid cube mesh
  - `msg <= 256KB`: one-shot allreduce.
  - `msg > 256KB`: instructs the caller to fallback to NCCL.

## Next Steps
- Fine tune algo selection based on GPU model, topology, link speed.
- Potentially optimize the two-shot allreduce impl. Accroding to FasterTransformer, two-shot allreduce is preferred until 50MB. There might be room for improvement, but PyTorch does impose more constraints:
  - FasterTransformer uses a single process to drive multiple devices. It can use `cudaDeviceEnablePeerAccess` enable device-level peer access.
  - PyTorch uses multiple process to drive multiple devices. With cuda IPC, a device can only share a specific region to other devices. This means extra copies may be unavoidable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114001
Approved by: https://github.com/yf225
2023-12-14 08:13:08 +00:00
36c6c0c7dc [pytree] expand tree_map to accept multi-inputs (#115642)
Fixes #115419
Fixes #91323
Closes #115549

- #115419
- #91323

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115642
Approved by: https://github.com/vmoens, https://github.com/zou3519
2023-12-14 06:16:42 +00:00
eqy
7e1542b938 [CUDA][FP8] Skip test_dtypes on FP8 _scaled_mm (#115661)
This test isn't actually parametrized by `dtype` so it seems to surface bogus failures where "unsupported" types "work" but in reality fp8 is used every time.

CC @drisspg I'm guessing this doesn't surface in upstream CI because there are no SM9.0 runners yet?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115661
Approved by: https://github.com/drisspg
2023-12-14 05:12:33 +00:00
f5458f8f00 [C10D] Make DumpPipe pipe file configurable (#115770)
Add TORCH_NCCL_DEBUG_INFO_PIPE_FILE env, allowing separate pipe file
location from dump file location.

Defaults PIPE_FILE to empty, meaning disabled.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115770
Approved by: https://github.com/zdevito
2023-12-14 03:54:43 +00:00
ef01e78fd9 disable test_ddp_profiling_autograd_profiler in distributed_test.py (#115704)
test was previously disabled in upstream: https://github.com/pytorch/pytorch/issues/77342, currently failing in NVIDIA internal CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115704
Approved by: https://github.com/soulitzer
2023-12-14 01:41:37 +00:00
722752fc28 Revert "Increased hardcoded limit for number of GPUs. (#115368)"
This reverts commit c039f01bd932d4f67d5b1d63ade8f1db11bfb72e.

Reverted https://github.com/pytorch/pytorch/pull/115368 on behalf of https://github.com/osalpekar due to This was reverted internally due to a release breakage ([comment](https://github.com/pytorch/pytorch/pull/115368#issuecomment-1854956224))
2023-12-14 01:28:01 +00:00
5e615f5f3a [BE] Use version.txt to determine version of nightly builds (#115794)
Fixes TODO from https://github.com/pytorch/pytorch/pull/33326
Test plan: check version generated by CI:
 - https://github.com/pytorch/pytorch/actions/runs/7202798334/job/19621620744?pr=115794#step:9:64
 - https://github.com/pytorch/pytorch/actions/runs/7202798329/job/19621639791?pr=115794#step:11:104

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115794
Approved by: https://github.com/atalman
2023-12-14 01:09:51 +00:00
661c1cf2aa numerical mismatch fix for test_mem_efficient_attention_attn_mask_vs_math_ref_grads in test_transformers.py (#115707)
adjust dropout_fudge_factor since previous fudge factor was too small and led to numerical mismatch in NVIDIA internal CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115707
Approved by: https://github.com/drisspg
2023-12-14 01:04:39 +00:00
ffc826bf10 [nccl-pg] Store PG global rank information in tracing logs (#115730)
Storing the list of global ranks associated with each PG allows us to correlate traces across different ranks.

Test Plan:

OSS CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115730
Approved by: https://github.com/fduwjj
2023-12-14 00:59:17 +00:00
b38e14c12a [Reland][HigherOrderOp] remove unused get_item in MapHigherOrder (#115758)
Summary: This is a reland of https://github.com/pytorch/pytorch/pull/115207

Test Plan: Modified existing tests.

Reviewed By: yanboliang

Differential Revision: D52045157

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115758
Approved by: https://github.com/angelayi
2023-12-14 00:41:46 +00:00
626b7dc847 Revert "Migrated loss functions to ModuleInfos (#115584)"
This reverts commit f138b08d2e9c8d676f2a404e97d773f42132b0c7.

Reverted https://github.com/pytorch/pytorch/pull/115584 on behalf of https://github.com/atalman due to OSS CI oncall, breaks slow test ([comment](https://github.com/pytorch/pytorch/pull/115584#issuecomment-1854855080))
2023-12-13 23:34:30 +00:00
3fa3ed4923 Workaround to avoid MSVC std ambiguous symbol error (#115748)
Don't know what the correct fix is, but it appears that this is the known workaround https://github.com/pytorch/pytorch/issues/18607

Failing windows build: https://hud.pytorch.org/pytorch/pytorch/pull/114897?sha=574a6f7cfe979f1bac62c6b0b51380ff67a31a09

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115748
Approved by: https://github.com/jbschlosser
ghstack dependencies: #114895, #115739
2023-12-13 23:22:52 +00:00
67ce57ff66 Add pragma once to headers (#115739)
This reverts commit 9b93c23b5e2d695c2fbd9c886cc0c8010edab717.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115739
Approved by: https://github.com/Skylion007, https://github.com/jbschlosser
ghstack dependencies: #114895
2023-12-13 23:22:52 +00:00
c7ae2c170f [inductor] Added non-integer expr support for floordiv in triton codegen (#115751)
Description:
- Added non-integer expr support for floordiv in triton codegen
- Added a test
  - cpp test is skipped as failing and https://github.com/pytorch/pytorch/pull/115647 may fix it

This PR is fixing compilation error with the following code:
```python
import torch

def func(x, a):
    n = (a * 1.234) // 8.234
    y = x + n
    return y

cfunc = torch.compile(func, dynamic=True, fullgraph=True)

device = "cuda"
x = torch.tensor(0, dtype=torch.float32, device=device)
a = 33

out = cfunc(x, a)
expected = func(x, a)
torch.testing.assert_close(out, expected)
```
Error message on Nightly:
```
  File "/usr/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
torch._dynamo.exc.BackendCompilerFailed: backend='compile_fx_wrapper' raised:
CompilationError: at 7:38:def triton_(in_ptr0, out_ptr0, ks0, xnumel, XBLOCK : tl.constexpr):
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x0 = xindex
    tmp0 = tl.load(in_ptr0 + (x0), xmask)
    tmp1 = ((1.23400000000000*ks0) // 8.23400000000000)
                                      ^
AssertionError()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115751
Approved by: https://github.com/peterbell10
2023-12-13 23:17:42 +00:00
3643548447 [Export] Support ser/des test on existing cases (#115413)
Summary:
Similar as #115399

Test Plan:
```
$ python test/export/test_serdes.py
...
Ran 72 tests in 29.097s

OK (expected failures=13)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115413
Approved by: https://github.com/tugsbayasgalan
ghstack dependencies: #115402
2023-12-13 23:17:12 +00:00
a34d56a64a [Export] Support retraceability test on existing cases (#115402)
Summary:
Similar as #115399

Test Plan:
python test/export/test_retraceability.py

    Ran 71 tests in 31.929s

    OK (expected failures=14)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115402
Approved by: https://github.com/tugsbayasgalan
2023-12-13 23:17:12 +00:00
43efe39cb1 [codemod][lowrisk] Remove extra semi colon from caffe2/caffe2/opt/optimizer.cc (#115018)
Summary:
`-Wextra-semi` or `-Wextra-semi-stmt`

If the code compiles, this is safe to land.

Test Plan: Sandcastle

Reviewed By: dmm-fb

Differential Revision: D51777924

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115018
Approved by: https://github.com/Skylion007
2023-12-13 23:11:33 +00:00
ad76a4e1e7 [inductor] Allow sympy expressions to participate in type promotion (#115676)
In the test example we have `add(i64[10], sympy.Expr)` where
`sympy.Expr` is not considered a promoting arg so isn't factored into
the type promotion. However, in eager it would promote to float32.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115676
Approved by: https://github.com/lezcano
ghstack dependencies: #115677, #115699, #115700
2023-12-13 22:22:37 +00:00
869e52e3dd Support torch function user objects (#111765)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111765
Approved by: https://github.com/jansel
2023-12-13 22:11:52 +00:00
81321baf5c [PyTorch] Remove ArrayRefTensor::dtype (#113578)
Knocks off a few nanoseconds from CPU inference due to not having to set this field; paths that would've needed it are expensive anyway.

Differential Revision: [D51182794](https://our.internmc.facebook.com/intern/diff/D51182794/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113578
Approved by: https://github.com/khabinov, https://github.com/Neilblaze
ghstack dependencies: #112800, #113577
2023-12-13 21:32:14 +00:00
8c57fde21f Let all_reduce_coalesced accept one tensor as well (#115650)
This diff introduces a change to the `all_reduce_coalesced` function in `distributed_c10d.py`. The function now accepts a single tensor as well as a list of tensors. This allows for more flexibility in the use of the function.

This is just a syntax sugar for the compiler to use `all_reduce_coalesced` without worrying  about converting the input to a list.

Differential Revision: [D51433236](https://our.internmc.facebook.com/intern/diff/D51433236/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115650
Approved by: https://github.com/wconstab
ghstack dependencies: #115523, #115302, #115648, #115649
2023-12-13 21:32:01 +00:00
b9af126908 [PyTorch] Add input numel assert for minimal arrayref interface (#113577)
We currently have no shape checking on CPU IIUC. Now we at least do numel checking for the minimal arrayref interface.

Differential Revision: [D51165703](https://our.internmc.facebook.com/intern/diff/D51165703/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113577
Approved by: https://github.com/chenyang78, https://github.com/jansel
ghstack dependencies: #112800
2023-12-13 21:31:55 +00:00
db851b1bc9 [Dynamo][7/N] Wrap python modules under torch as regular PythonModuleVariable (#115724)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115724
Approved by: https://github.com/jansel
2023-12-13 21:23:14 +00:00
54d552e991 [funcol] Directly import DeviceMesh to avoid circular dependency (#115649)
This diff aims to directly import DeviceMesh from torch.distributed.device_mesh instead of importing it from dist._tensor. This is done to avoid a circular dependency issue. The code changes in each file of the diff are as follows:

- torch/distributed/_functional_collectives.py: import DeviceMesh from torch.distributed instead of dist._tensor.

Overall, this diff aims to improve the code by avoiding circular dependencies and improving the import statements.

==
The above summary is generated by LLM with minor manual fixes. The following summary is by me.

The original import causes some issues when compiling DDP with compiled_autograd. The root cause of compilation failure is not identified but it is good to fix the lazy initialization, which indirectly fixes the compilation issues for DDP.

Differential Revision: [D51857246](https://our.internmc.facebook.com/intern/diff/D51857246/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115649
Approved by: https://github.com/wconstab, https://github.com/wz337
ghstack dependencies: #115523, #115302, #115648
2023-12-13 20:44:58 +00:00
7388d40165 Make pytorch_qnnpack a shared library (#115570)
Summary:
This library contains global state, e.g. pytorch_qnnp_params. If we make
it a static library, different shared libraries linking that static
library can end up with their own copies of the global state, leading to
bugs. Make it a shared library instead, to avoid this issue.

Test Plan: buck2 test fbsource//fbandroid/javatests/com/facebook/playground/apps/fb4aplayground/scenarios/pytorchscenario:pytorchscenario -- --run-disabled --regex runBundledInputWithLocalAsset

Differential Revision: D51926024

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115570
Approved by: https://github.com/malfet
2023-12-13 20:44:37 +00:00
c90fdb9ac0 Fix torch.distributed.breakpoint (#115705)
Switches from calling breakpoint() internally to using a subclass of
Pdb.

Fixes #115685

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115705
Approved by: https://github.com/wanchaol, https://github.com/fegin
2023-12-13 20:33:56 +00:00
8a8d0adc0b Fix troch.gradient check for spacing arg list length (#115686)
Fixes #114207

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115686
Approved by: https://github.com/albanD
2023-12-13 20:17:20 +00:00
23bff71de4 [llvm][oncall] Fix build for llvm-18+ (#115652)
Summary:
https://reviews.llvm.org/D137838 moved Host.h and some other files under TargetParser.
https://github.com/llvm/llvm-project/pull/74261 Removed it from Support folder.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115652
Approved by: https://github.com/davidberard98
2023-12-13 20:11:31 +00:00
4d8ad4fb82 Move SingletonSymNodeImpl from c10 to aten (#114895)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114895
Approved by: https://github.com/jbschlosser
2023-12-13 20:01:18 +00:00
2a514f48d7 Add huggingface gpt2 fake tensor unit test for torch.onnx.dynamo_export (#115380)
open llama, dolly v2 and falcon are still broken regardless of `ExportedProgram`, so they were not moved from `test_fx_to_onnx.py` to `fx_to_onnx_onnxruntime.py`.

Dolly and falcon already have tracking issues, but a tracking issue was created for open llama: https://github.com/pytorch/pytorch/issues/115552

A tracking issue was created for `xfail_if_model_type_is_exportedprogram` and `xfail_if_model_type_is_not_exportedprogram` issues with unexpected success runs: https://github.com/pytorch/pytorch/issues/115747
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115380
Approved by: https://github.com/titaiwangms
2023-12-13 19:49:06 +00:00
suo
926236305f [sigmoid] fix for FX tracing unflattened modules (#115708)
Differential Revision: [D52095387](https://our.internmc.facebook.com/intern/diff/D52095387/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115708
Approved by: https://github.com/zhxchen17
2023-12-13 19:43:46 +00:00
75d3bbaaa2 Fix cudagraph check message (#115664)
This error message is printed when CUDAGraph trees are used with multiple device indices.

However, the message seems to say the opposite.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115664
Approved by: https://github.com/soulitzer
2023-12-13 18:44:43 +00:00
42390a097b [inductor] Do variance calculation in opmath type (#115181)
Fixes #114903

Previously large split variance reductions stored the intermediates as float16
precision, which may lead to overflow as the intermediate result is
unnormalized.

In #114903 we see two different `num_split` decisions made based on the
hardware capabilities, one of which has large enough intermediates to cause
overflows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115181
Approved by: https://github.com/shunting314
2023-12-13 18:40:44 +00:00
95de4f5764 add sm80orlater check to test_sdpa (#115702)
test_sdpa and test_sdpa2 in test_aot_inductor.py use bfloat16 which is not supported by sm < 80, so skip test if sm < 80

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115702
Approved by: https://github.com/soulitzer
2023-12-13 18:21:32 +00:00
caddcf9de5 Fix lint error in aten/src/ATen/native/cuda/CUDALoops.cuh (#115616)
Fix lint error in `aten/src/ATen/native/cuda/CUDALoops.cuh`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115616
Approved by: https://github.com/soulitzer
2023-12-13 18:13:00 +00:00
afa62d6237 [nccl-pg] Pass group global rank information to NCCL PG (#114736)
We were only passing a subset of the group creation information to the
NCCL PG.  We are specifically missing the information on which global
ranks belong to a particular PG.

This allows the NCCL PG to use this additional information for things
like better trace logging.

Test Plan:

OSS CI

Reviewers:

Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114736
Approved by: https://github.com/kwen2501
2023-12-13 18:02:51 +00:00
193f87857e [BC breaking] Remove check_sparse_nnz argument of gradcheck (#115658)
As in title per deprecation plan.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115658
Approved by: https://github.com/cpuhrsch, https://github.com/soulitzer
2023-12-13 17:34:30 +00:00
310f6ab11a [fsdp] Replace acc_grad hooking with register_post_accumulate_grad_hook on flat_param (#112184)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112184
Approved by: https://github.com/albanD
ghstack dependencies: #115315
2023-12-13 16:24:44 +00:00
97888725c5 [Export] Test non-strict mode on existing test cases (#115399)
Summary:
Dynamo test methodology provides a good example to patch various
treaments on the same set of test cases. A pitfall is the global config
that could be easily modified somewhere. Here we change the behavior of
the export API thru hijacking it with self defined code.

For supporting non-strict test suite, the `strict=False` is explicitly
passed into the export API when it's called w/ or w/o strict arg.

Test Plan:
python test/export/test_export_nonstrict.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115399
Approved by: https://github.com/zhxchen17, https://github.com/tugsbayasgalan
2023-12-13 16:01:06 +00:00
66a76516bf [ROCm] Disabling Kernel Asserts for ROCm by default - fix and clean up and refactoring (#114660)
Related to #103973  #110532 #108404 #94891

**Context:**
As commented in 6ae0554d11/cmake/Dependencies.cmake (L1198)
Kernel asserts are enabled by default for CUDA and disabled for ROCm.
However it is somewhat broken, and Kernel assert was still enabled for ROCm.

Disabling kernel assert is also needed for users who do not have PCIe atomics support. These community users have verified that disabling the kernel assert in PyTorch/ROCm platform fixed their pytorch workflow, like torch.sum script, stable-diffusion. (see the related issues)

**Changes:**

This pull request serves the following purposes:
* Refactor and clean up the logic,  make it simpler for ROCm to enable and disable Kernel Asserts
* Fix the bug that Kernel Asserts for ROCm was not disabled by default.

Specifically,
- Renamed `TORCH_DISABLE_GPU_ASSERTS` to `C10_USE_ROCM_KERNEL_ASSERT` for the following reasons:
(1) This variable only applies to ROCm.
(2) The new name is more align with #define CUDA_KERNEL_ASSERT function.
(3) With USE_ in front of the name, we can easily control it with environment variable to turn on and off this feature during build (e.g. `USE_ROCM_KERNEL_ASSERT=1 python setup.py develop` will enable kernel assert for ROCm build).
- Get rid of the `ROCM_FORCE_ENABLE_GPU_ASSERTS' to simplify the logic and make it easier to understand and maintain
- Added `#cmakedefine` to carry over the CMake variable to C++

**Tests:**
(1) build with default mode and verify that USE_ROCM_KERNEL_ASSERT  is OFF(0), and kernel assert is disabled:

```
python setup.py develop
```
Verify CMakeCache.txt has correct value.
```
/xxxx/pytorch/build$ grep USE_ROCM_KERNEL_ASSERT CMakeCache.txt
USE_ROCM_KERNEL_ASSERT:BOOL=0
```
Tested the following code in ROCm build and CUDA build, and expected the return code differently.

```
subprocess.call([sys.executable, '-c', "import torch;torch._assert_async(torch.tensor(0,device='cuda'));torch.cuda.synchronize()"])
```
This piece of code is adapted from below unit test to get around the limitation that this unit test now was skipped for ROCm. (We will check to enable this unit test in the future)

```
python test/test_cuda_expandable_segments.py -k test_fixed_cuda_assert_async
```

Ran the following script, expecting r ==0 since the CUDA_KERNEL_ASSERT is defined as nothing:
```
>> import sys
>>> import subprocess
>>> r=subprocess.call([sys.executable, '-c', "import torch;torch._assert_async(torch.tensor(0,device='cuda'));torch.cuda.synchronize()"])
>>> r
0
```

(2) Enable the kernel assert by building with USE_ROCM_KERNEL_ASSERT=1, or USE_ROCM_KERNEL_ASSERT=ON
```
USE_ROCM_KERNEL_ASSERT=1 python setup.py develop
```

Verify `USE_ROCM_KERNEL_ASSERT` is `1`
```
/xxxx/pytorch/build$ grep USE_ROCM_KERNEL_ASSERT CMakeCache.txt
USE_ROCM_KERNEL_ASSERT:BOOL=1
```

Run the assert test, and expected return code not equal to 0.

```
>> import sys
>>> import subprocess
>>> r=subprocess.call([sys.executable, '-c', "import torch;torch._assert_async(torch.tensor(0,device='cuda'));torch.cuda.synchronize()"])
>>>/xxxx/pytorch/aten/src/ATen/native/hip/TensorCompare.hip:108: _assert_async_cuda_kernel: Device-side assertion `input[0] != 0' failed.
:0:rocdevice.cpp            :2690: 2435301199202 us: [pid:206019 tid:0x7f6cf0a77700] Callback: Queue 0x7f64e8400000 aborting with error : HSA_STATUS_ERROR_EXCEPTION: An HSAIL operation resulted in a hardware exception. code: 0x1016

>>> r
-6
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114660
Approved by: https://github.com/jeffdaily, https://github.com/malfet, https://github.com/jithunnair-amd
2023-12-13 15:44:53 +00:00
fb80f05ee2 [inductor] Fix angle decomposition return type (#115700)
The current decomposition always returns float32 when the input isn't complex.
Instead, we should do proper type promotion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115700
Approved by: https://github.com/lezcano
ghstack dependencies: #115677, #115699
2023-12-13 14:16:31 +00:00
9cdc80d581 [inductor] Fix torch.bernoulli decomposition return type (#115699)
Strangely enough, `torch.bernoulli` doesn't return a boolean and instead
it matches the output type of the inplace bernoulli.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115699
Approved by: https://github.com/lezcano
ghstack dependencies: #115677
2023-12-13 14:16:31 +00:00
0e0dd8f985 [dynamo][BE] Move torchvision import inside of test_multi_import (#115677)
Currently this skip imports torchvision, so if your torchvision install
is broken then the entire file fails at collection time. This instead
means only the test itself will fail.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115677
Approved by: https://github.com/lezcano
2023-12-13 14:16:31 +00:00
3807fc690f [OSSCI oncall] fix lint (#115737)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115737
Approved by: https://github.com/DanilBaibak
2023-12-13 14:15:26 +00:00
0870afb85c Revert "[Export] Test non-strict mode on existing test cases (#115399)"
This reverts commit 2411a92e9d9f90e2db3cde9190e1301bd02cb221.

Reverted https://github.com/pytorch/pytorch/pull/115399 on behalf of https://github.com/atalman due to OSSCI oncall, broke CI tests ([comment](https://github.com/pytorch/pytorch/pull/115399#issuecomment-1853869965))
2023-12-13 12:59:09 +00:00
bda6f02343 Revert "[Export] Support retraceability test on existing cases (#115402)"
This reverts commit b0c7dd47cdb8d17bbfd0ab2963b1afb908dab716.

Reverted https://github.com/pytorch/pytorch/pull/115402 on behalf of https://github.com/atalman due to OSSCI oncall, broke CI tests ([comment](https://github.com/pytorch/pytorch/pull/115402#issuecomment-1853864075))
2023-12-13 12:55:07 +00:00
3b87681ddc Revert "[Export] Support ser/des test on existing cases (#115413)"
This reverts commit 47443591631ebb80a84487bbdab3233e0077941d.

Reverted https://github.com/pytorch/pytorch/pull/115413 on behalf of https://github.com/atalman due to OSSCI oncall, broke CI tests ([comment](https://github.com/pytorch/pytorch/pull/115413#issuecomment-1853859443))
2023-12-13 12:51:34 +00:00
f9cf6ae889 [PyTorch] AOTI: add minimal arrayref interface (#112800)
This implements an optional alternate interface to the AOTI
generated DSO, intended to increase efficiency for models running on
CPU and requiring minimal overhead. See comment in config.py for more
explanation.

This took a while to get right (e.g., I initially required 1-D
MiniArrayRef<T> for the inputs, but found that multi-dimensional
ArrayRefTensor<T> ended up simplifying the implementation and allowed
test_aot_inductor.py to run) and is somewhat intricate, so I am
anticipating that review will require some back-and-forth.

Differential Revision: [D50699890](https://our.internmc.facebook.com/intern/diff/D50699890/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D50699890/)!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112800
Approved by: https://github.com/chenyang78
2023-12-13 12:06:35 +00:00
331128b444 [c10] signal_handler: atomically exchange the signal count to fix data race in ExecuteStepRecursive() (#115510)
Summary:
`CheckForSignals()` can be called from multiple threads concurrently, e.g. from within `ExecuteStepRecursive()`. This means that `my_sigint_count_` and `my_sighup_count_` can be written concurrently, causing data races.

To fix, use atomic exchange which writes the new value and returns the old value in one atomic operation.

Test Plan: Running TSAN tests that failed before and now pass

Differential Revision: D52018963

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115510
Approved by: https://github.com/malfet
2023-12-13 12:06:06 +00:00
50db2aa70a [funcol][BE] Apply ufmt to _functional_collectives.py and turn on lintrunner for functional_collective (#115648)
No logic change, just formatting.

Differential Revision: [D51857236](https://our.internmc.facebook.com/intern/diff/D51857236/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115648
Approved by: https://github.com/wconstab, https://github.com/wz337
ghstack dependencies: #115523, #115302
2023-12-13 11:19:29 +00:00
db8d409d08 [DCP][BE] Apply ufmt to DCP and turn on lintrunner for DCP (#115302)
No logic change. Just typing and ufmt.

Differential Revision: [D51914982](https://our.internmc.facebook.com/intern/diff/D51914982/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115302
Approved by: https://github.com/XilunWu, https://github.com/wz337, https://github.com/LucasLLC
ghstack dependencies: #115523
2023-12-13 10:32:36 +00:00
cc28f61fa3 [DCP][BE] Move DCP._state_dict_utils out from DCP (#115523)
DCP._state_dict_utils is also used by FSDP. This can cause circular import sometimes. Move it out from DCP to avoid circular import.

Differential Revision: [D52022440](https://our.internmc.facebook.com/intern/diff/D52022440/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115523
Approved by: https://github.com/wz337
2023-12-13 08:59:48 +00:00
1500379b6d [MPS] Enable torch.rand[n] for complex types (#115514)
Test plan:
```
% python -c "import torch;print(torch.rand(3, 3, dtype=torch.chalf, device='mps'))"
tensor([[0.4639+0.8350j, 0.0479+0.1650j, 0.2510+0.9551j],
        [0.4746+0.3984j, 0.1484+0.8242j, 0.0098+0.7129j],
        [0.7979+0.6162j, 0.7188+0.9580j, 0.5186+0.2559j]], device='mps:0',
       dtype=torch.complex32)
% python3 -c "import torch; x=torch.randn(1000000, dtype=torch.cfloat, device='mps'); print((x-x.mean()).abs().pow(2).div(x.numel()-1).sum().sqrt())"
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115514
Approved by: https://github.com/lezcano
ghstack dependencies: #115512, #115513, #115554
2023-12-13 07:30:56 +00:00
4744359163 [Export] Support ser/des test on existing cases (#115413)
Summary:
Similar as #115399

Test Plan:
```
$ python test/export/test_serdes.py
...
Ran 72 tests in 29.097s

OK (expected failures=13)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115413
Approved by: https://github.com/tugsbayasgalan
ghstack dependencies: #115399, #115402
2023-12-13 06:01:17 +00:00
b0c7dd47cd [Export] Support retraceability test on existing cases (#115402)
Summary:
Similar as #115399

Test Plan:
python test/export/test_retraceability.py

FAILED (failures=6, errors=8, expected failures=7)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115402
Approved by: https://github.com/tugsbayasgalan
ghstack dependencies: #115399
2023-12-13 06:01:17 +00:00
2411a92e9d [Export] Test non-strict mode on existing test cases (#115399)
Summary:
Dynamo test methodology provides a good example to patch various
treaments on the same set of test cases. A pitfall is the global config
that could be easily modified somewhere. Here we change the behavior of
the export API thru hijacking it with self defined code.

For supporting non-strict test suite, the `strict=False` is explicitly
passed into the export API when it's called w/ or w/o strict arg.

Test Plan:
python test/export/test_export_nonstrict.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115399
Approved by: https://github.com/zhxchen17, https://github.com/tugsbayasgalan
2023-12-13 06:01:17 +00:00
dd42201cb8 [export] Preserve FQN in export_to_torch_ir (#115462)
AOTInductor currently relies of export_to_torch_ir to generate a graph, and passes it to inductor to generate the .so. They would like the FQN to be consistent so that they can easily find/update the weights in the .so.

Note that since export flattens all modules in to a single computational graph, we will change the FQNs in the original module by replacing all periods with underscores. For example, `foo.child1param`, which points to a submodule named `foo`'s parameter named `child1param`, will be renamed to `foo_child1param` since we no longer have the submodule `foo`. This is done just by doing `name.replace(".", "_")`.

Outputted AOTInductor c++ code: https://www.internalfb.com/phabricator/paste/view/P900120950?lines=377-355%2C354

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115462
Approved by: https://github.com/tugsbayasgalan
2023-12-13 04:58:47 +00:00
0dad85b402 [Dynamo] Fix torch.tensor call with tuple (#115713)
Land #114383 on behalf of @ezyang since he is on recharge and this is an high priority issue.
Fix #114231

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115713
Approved by: https://github.com/angelayi, https://github.com/voznesenskym
2023-12-13 04:08:12 +00:00
38101e349e [usdt][torch] Sample dispatch operator integration (#115593)
Summary:
By default the instruction at the USDT is nop, when the tracepoint is attached (e.g. through bpftrace) the code inside the semaphore check is executed. Thus there should be no performance impact as long as the USDT is not attached from the tracepoint execution code itself, however the semaphore check itself `TORCH_SDT_IS_ENABLED` will incur the cost of a `read_global_volatile` operation.

https://github.com/dtrace4linux/linux/blob/master/doc/usdt.html for more info

Test Plan:
```
buck2  build  mode/opt caffe2/torch/fb/observers:strobelight_observer_runner --show-full-output
```

```
/data/users/rihams/fbsource/buck-out/v2/gen/fbcode/0bc8cf217a8cf352/caffe2/torch/fb/observers/__strobelight_observer_runner__/strobelight_observer_runner
```

```
sudo bpftrace -e 'usdt:/data/users/rihams/fbsource/buck-out/v2/gen/fbcode/6081734815403318/caffe2/torch/fb/observers/__strobelight_observer_runner__/strobelight_observer_runner:pytorch:operator_* { printf("%s --> %s\n", probe, str(arg0)); }' -v

usdt:<path>strobelight_observer_runner:pytorch:operator_start --> aten::empty_strided
usdt:<path>strobelight_observer_runner:pytorch:operator_end --> aten::empty_strided
usdt:<path>strobelight_observer_runner:pytorch:operator_end --> aten::empty_like
usdt:<path>strobelight_observer_runner:pytorch:operator_start --> aten::fill_
usdt:<path>strobelight_observer_runner:pytorch:operator_end --> aten::fill_
usdt:<path>strobelight_observer_runner:pytorch:operator_end --> aten::ones_like
usdt:<path>strobelight_observer_runner:pytorch:operator_start --> aten::mul
usdt:<path>strobelight_observer_runner:pytorch:operator_end --> aten::mul
usdt:<path>strobelight_observer_runner:pytorch:operator_start --> aten::add
usdt:<path>strobelight_observer_runner:pytorch:operator_end --> aten::add
usdt:<path>strobelight_observer_runner:pytorch:operator_start --> aten::detach
usdt:<path>strobelight_observer_runner:pytorch:operator_end --> aten::detach
usdt:<path>strobelight_observer_runner:pytorch:operator_start --> aten::randn
usdt:<path>strobelight_observer_runner:pytorch:operator_start --> aten::empty
usdt:<path>strobelight_observer_runner:pytorch:operator_end --> aten::empty
usdt:<path>strobelight_observer_runner:pytorch:operator_start --> aten::normal_
usdt:<path>strobelight_observer_runner:pytorch:operator_end --> aten::normal_
usdt:<path>strobelight_observer_runner:pytorch:operator_end --> aten::randn
usdt:<path>strobelight_observer_runner:pytorch:operator_start --> aten::to
usdt:<path>strobelight_observer_runner:pytorch:operator_start --> aten::_to_copy
usdt:<path>strobelight_observer_runner:pytorch:operator_start --> aten::empty_strided
usdt:<path>strobelight_observer_runner:pytorch:operator_end --> aten::empty_strided

```

Differential Revision: D44636587

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115593
Approved by: https://github.com/malfet
2023-12-13 02:41:48 +00:00
17c104ac18 [export] Do not copy state_dict in run_decomp (#115269)
Fixes https://github.com/pytorch/pytorch/issues/114628

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115269
Approved by: https://github.com/thiagocrepaldi, https://github.com/ydwu4
2023-12-13 01:21:21 +00:00
99554112d3 [pytorch] add namespace for optTypeMetaToScalarType in codegen to avoid not declared when compile (#115623)
Fixes compilation failure in some environment.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115623
Approved by: https://github.com/albanD
2023-12-13 00:59:01 +00:00
1392843e7b [inductor] make sure bitcast input and target type have the same bitwidth (#115619)
This PR fixed #104791

bitcast requires the source and target have the bitwidth.
Because the input tensor's dtype could be promoted, e.g. from float16 to
float, we have to cast the tensor to its original source dtype before
invoking bitcast in such cases. After that, we also need to convert
the bit-casted tensor back to float to make sure we keep using higher
precision values for the rest of the computation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115619
Approved by: https://github.com/jansel, https://github.com/eellison
2023-12-13 00:53:04 +00:00
469d6d45fe [BE] Bye bye, CircleCI (#115701)
In PyTorch, a change we now see,
CircleCI's gone, set it free.
With commits and a push,
No more waiting in hush,
For a simpler CI spree!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115701
Approved by: https://github.com/PaliC, https://github.com/suo, https://github.com/seemethere
2023-12-13 00:26:49 +00:00
76ced0df03 Consider storage_changed for assigning alias_of_input in aot_autograd when computing differentiable outputs that alias each other (#115315)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115315
Approved by: https://github.com/bdhirsh
2023-12-12 23:21:58 +00:00
946de1cf4c [export][fix] Add back export strict argument (#115668)
Summary:
\#115556 omitted strict argument, which is necessary for non-strict mode
dev.

Test Plan:
python test/export/test_export.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115668
Approved by: https://github.com/tugsbayasgalan, https://github.com/angelayi
2023-12-12 22:59:10 +00:00
48ed165380 [FSDP][state_dict] Create a FSDP/EP unittest (#115567)
As title

Differential Revision: [D52043394](https://our.internmc.facebook.com/intern/diff/D52043394/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115567
Approved by: https://github.com/wz337, https://github.com/LucasLLC
2023-12-12 22:48:11 +00:00
639060cb0b Use get_mkldnn_enabled for decompositions (#115448)
`torch._C.has_mkldnn` does not respect cases where users try to disable mkldnn using `torch._C._set_mkldnn_enabled()`. This is relevant to edge use cases, where they do not want decompositions to go to the ATen opset, and do not want the mkldnn operator to appear in the graph.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115448
Approved by: https://github.com/jgong5, https://github.com/ydwu4
2023-12-12 22:42:51 +00:00
f78f23d753 [export] Turn off output value from sources for export. (#115442)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115442
Approved by: https://github.com/tugsbayasgalan
2023-12-12 22:41:23 +00:00
af09fe256a [Inductor] Implement a deduplist data structure for name to user tracking (#115609)
Summary:
An internal MRS model was taking over a day's worth of time to compile due to many duplicates in dependency tracking. This PR replaces the list with a custom dedup list.
Normally one could use a set/dict for this purpose however the list in question gets elements appended as it is being iterated over which means that we need to keep the list semantics.

Test Plan: ad hoc testing

Differential Revision: D52060659

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115609
Approved by: https://github.com/jansel
2023-12-12 22:28:30 +00:00
ffb2a28a67 Fixes expected behavior when no_dist=True in state_dict_loader.load (#115660)
Fixes expected behavior when `no_dist=True` in `state_dict_loader.load`

Fixes #115591

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115660
Approved by: https://github.com/wz337, https://github.com/fegin
2023-12-12 22:21:16 +00:00
f138b08d2e Migrated loss functions to ModuleInfos (#115584)
Migrates most tests in `common_nn.py:criterion_tests` to ModuleInfos.

**I can split this up if it is too large to review**

What this PR does not include:
- [`no_batch_dim` tests](https://github.com/pytorch/pytorch/blob/main/torch/testing/_internal/common_nn.py#L3995-L4112)
- [tests that use the functional variant of the loss function and `wrap_functional`](https://github.com/pytorch/pytorch/blob/main/torch/testing/_internal/common_nn.py#L1079-L1128)

#### On test times
This PR increases test time by ~58s locally
Before this PR:
```
>>> python test/test_nn.py -k Loss
Ran 1003 tests in 28.977s
```
After this PR
```
>>> python test/test_nn.py -k Loss
Ran 368 tests in 23.073s
```

```
>>> python test/test_modules.py -k Loss
Ran 836 tests in 63.900s
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115584
Approved by: https://github.com/janeyx99
ghstack dependencies: #115617
2023-12-12 22:20:20 +00:00
1becd2c314 Align checks in _use_cudnn_ctc_loss with those in _cudnn_ctc_loss (#115617)
This PR is intended to fix the following problem:

When using `CTCLoss`, there is a cudnn path gated by a call to [`_use_cudnn_ctc_loss`](
e918461377/aten/src/ATen/native/cudnn/LossCTC.cpp (L73-L101)) which checks some conditions

e918461377/aten/src/ATen/native/LossCTC.cpp (L486-L496)

However, there are more checks in `_cudnn_ctc_loss`
e918461377/aten/src/ATen/native/cudnn/LossCTC.cpp (L122-L130)

some of which are not present in `_use_cudnn_ctc_loss` (e.g. the check that `targets` is on CPU which will cause a RuntimeError after dispatching to `_cudnn_ctc_loss`). Instead, these checks should be in `_use_cudnn_ctc_loss` so that the normal `_ctc_loss` path will be used if the checks are not met)

e.g. Before this PR

```python
>>> import torch
>>> ctcloss = torch.nn.CTCLoss()
>>> log_probs = torch.randn((50, 3, 15), device='cuda').log_softmax(2)
>>> target = torch.randint(1, 15, (30 + 25 + 20,), dtype = torch.int)
>>> input_lengths = torch.tensor((50, 50, 50), device='cuda')
>>> target_lengths = torch.tensor((30, 25, 20), device='cuda')
>>> ctcloss(log_probs, target, input_lengths, target_lengths)
tensor(4.1172, device='cuda:0')
>>> target = target.to('cuda')
>>> ctcloss(log_probs, target, input_lengths, target_lengths)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/data/users/mg1998/pytorch/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/users/mg1998/pytorch/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/users/mg1998/pytorch/torch/nn/modules/loss.py", line 1779, in forward
    return F.ctc_loss(log_probs, targets, input_lengths, target_lengths, self.blank, self.reduction,
  File "/data/users/mg1998/pytorch/torch/nn/functional.py", line 2660, in ctc_loss
    return torch.ctc_loss(
RuntimeError: Expected tensor to have CPU Backend, but got tensor with CUDA Backend (while checking arguments for cudnn_ctc_loss)
```

After this PR the above snippet runs without error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115617
Approved by: https://github.com/janeyx99
2023-12-12 22:20:20 +00:00
c3ed9f65a0 Revert "[8/n] Update XNNPACK Version Part 8 Everything Remaining to get it to work (#115587)"
This reverts commit a8dc9d8e353ddcf7db0247349a3acd0dd37fcc6f.

Reverted https://github.com/pytorch/pytorch/pull/115587 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/115587#issuecomment-1852835898))
2023-12-12 21:28:09 +00:00
ac4f6beb00 [Dynamo] Make resume function name more explicit by adding lineno (#115608)
Adding lineno to resume function name for easy aggregation in Scuba table.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115608
Approved by: https://github.com/jansel, https://github.com/williamwen42
2023-12-12 21:08:41 +00:00
40ce9a4cfb [c10d] Create a python c10d API _set_pg_timeout to set timeout (#115453)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115453
Approved by: https://github.com/wconstab, https://github.com/H-Huang
2023-12-12 20:52:43 +00:00
8a58af2a9f [Reland][HigherOrderOp] make MapHigherOrder create map_impl (#115561)
This is a reland of #115205, which gets reverted due to internal test failure.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115561
Approved by: https://github.com/angelayi
2023-12-12 20:45:01 +00:00
8739d1e3f9 Fix a fast mode gradcheck bug where specified eps argument is ignored when switching to slow mode (#115634)
As in the title.

The reproducer for the bug is as follows:
```python
>>> import torch
>>> dtype = torch.bfloat16
>>> D1 = torch.tensor([[1, 2], [3, 4]], dtype=dtype, requires_grad=True)
>>> D2 = torch.tensor([[1, 2], [3, 4]], dtype=dtype, requires_grad=True)
>>> torch.autograd.gradcheck(torch.mm, (D1, D2), fast_mode=True)
```

<details>

```
torch.autograd.gradcheck.GradcheckError: Jacobian mismatch for output 0 with respect to input 0,
numerical:tensor(0., dtype=torch.bfloat16)
analytical:tensor(4.9062, dtype=torch.bfloat16)

The above quantities relating the numerical and analytical jacobians are computed
in fast mode. See: https://github.com/pytorch/pytorch/issues/53876 for more background
about fast mode. Below, we recompute numerical and analytical jacobians in slow mode:

Numerical:
 tensor([[0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.]], dtype=torch.bfloat16)
Analytical:
tensor([[1., 2., 0., 0.],
        [3., 4., 0., 0.],
        [0., 0., 1., 2.],
        [0., 0., 3., 4.]], dtype=torch.bfloat16)

```
</details>

```python
The max per-element difference (slow mode) is: 4.0.
>>> torch.autograd.gradcheck(torch.mm, (D1, D2), fast_mode=True, eps=1e-1)
```

<details>

```
<snip>
torch.autograd.gradcheck.GradcheckError: Jacobian mismatch for output 0 with respect to input 0,
numerical:tensor(5., dtype=torch.bfloat16)
analytical:tensor(4.9062, dtype=torch.bfloat16)

The above quantities relating the numerical and analytical jacobians are computed
in fast mode. See: https://github.com/pytorch/pytorch/issues/53876 for more background
about fast mode. Below, we recompute numerical and analytical jacobians in slow mode:

Numerical:
 tensor([[0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.]], dtype=torch.bfloat16)
Analytical:
tensor([[1., 2., 0., 0.],
        [3., 4., 0., 0.],
        [0., 0., 1., 2.],
        [0., 0., 3., 4.]], dtype=torch.bfloat16)
```

</details>

```
The max per-element difference (slow mode) is: 4.0.
```

Notice that changing `eps` value has no effect to max per-element difference.

With this PR, increasing `eps` value will lead to sensible results in numerical jacobian:
```python
>>> torch.autograd.gradcheck(torch.mm, (D1, D2), fast_mode=True, eps=1e-1)
```

<details>

```
<snip>
torch.autograd.gradcheck.GradcheckError: Jacobian mismatch for output 0 with respect to input 0,
numerical:tensor(5., dtype=torch.bfloat16)
analytical:tensor(4.9062, dtype=torch.bfloat16)

The above quantities relating the numerical and analytical jacobians are computed
in fast mode. See: https://github.com/pytorch/pytorch/issues/53876 for more background
about fast mode. Below, we recompute numerical and analytical jacobians in slow mode:

Numerical:
 tensor([[0.9375, 1.8750, 0.0000, 0.0000],
        [2.9688, 3.7500, 0.0000, 0.0000],
        [0.0000, 0.0000, 1.2500, 2.5000],
        [0.0000, 0.0000, 2.5000, 3.7500]], dtype=torch.bfloat16)
Analytical:
tensor([[1., 2., 0., 0.],
        [3., 4., 0., 0.],
        [0., 0., 1., 2.],
        [0., 0., 3., 4.]], dtype=torch.bfloat16)
```

</details>

```
The max per-element difference (slow mode) is: 0.5.
```

Finally:
```python
>>> torch.autograd.gradcheck(torch.mm, (D1, D2), fast_mode=True, eps=1e-1, atol=1)
True
```
that would fail with the current main branch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115634
Approved by: https://github.com/lezcano, https://github.com/soulitzer, https://github.com/albanD
ghstack dependencies: #115536
2023-12-12 20:00:56 +00:00
75ab294eb5 Enable builtin tests for ONNX Export with ExportedProgram models (#114762)
Fixed by https://github.com/pytorch/pytorch/pull/113982
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114762
Approved by: https://github.com/BowenBao
2023-12-12 19:50:06 +00:00
d954ef208f [DCP][state_dict] DCP state_dict cannot correctly find FQN when the leaf module is wrapped by FSDP (#115592)
Summary: The original logic has an incorrect assumption that there is at one object name left when traversing the module tree. This is not correct when the leaf module is wrapped by FSDP.

Test Plan: CI

Differential Revision: D52049293

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115592
Approved by: https://github.com/wz337
2023-12-12 19:22:23 +00:00
9252 changed files with 202122 additions and 89462 deletions

View File

@ -19,6 +19,7 @@ See `build.sh` for valid build environments (it's the giant switch).
* `ubuntu` -- Dockerfile for Ubuntu image for CPU build and test jobs
* `ubuntu-cuda` -- Dockerfile for Ubuntu image with CUDA support for nvidia-docker
* `ubuntu-rocm` -- Dockerfile for Ubuntu image with ROCm support
* `ubuntu-xpu` -- Dockerfile for Ubuntu image with XPU support
## Usage

View File

@ -71,6 +71,8 @@ if [[ "$image" == *cuda* && "$UBUNTU_VERSION" != "22.04" ]]; then
DOCKERFILE="${OS}-cuda/Dockerfile"
elif [[ "$image" == *rocm* ]]; then
DOCKERFILE="${OS}-rocm/Dockerfile"
elif [[ "$image" == *xpu* ]]; then
DOCKERFILE="${OS}-xpu/Dockerfile"
elif [[ "$image" == *cuda*linter* ]]; then
# Use a separate Dockerfile for linter to keep a small image size
DOCKERFILE="linter-cuda/Dockerfile"
@ -202,7 +204,7 @@ case "$image" in
PROTOBUF=yes
DB=yes
VISION=yes
ROCM_VERSION=5.6
ROCM_VERSION=5.7
NINJA_VERSION=1.9.0
CONDA_CMAKE=yes
TRITON=yes
@ -213,11 +215,21 @@ case "$image" in
PROTOBUF=yes
DB=yes
VISION=yes
ROCM_VERSION=5.7
ROCM_VERSION=6.0
NINJA_VERSION=1.9.0
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-jammy-xpu-2024.0-py3)
ANACONDA_PYTHON_VERSION=3.8
GCC_VERSION=11
PROTOBUF=yes
DB=yes
VISION=yes
BASEKIT_VERSION=2024.0.0-49522
NINJA_VERSION=1.9.0
CONDA_CMAKE=yes
;;
pytorch-linux-jammy-py3.8-gcc11-inductor-benchmarks)
ANACONDA_PYTHON_VERSION=3.8
GCC_VERSION=11
@ -265,6 +277,7 @@ case "$image" in
CONDA_CMAKE=yes
TRITON=yes
DOCS=yes
UNINSTALL_DILL=yes
;;
pytorch-linux-jammy-py3-clang12-executorch)
ANACONDA_PYTHON_VERSION=3.10
@ -284,6 +297,15 @@ case "$image" in
CUDA_VERSION=11.8
CONDA_CMAKE=yes
;;
pytorch-linux-jammy-aarch64-py3.10-gcc11)
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=11
ACL=yes
PROTOBUF=yes
DB=yes
VISION=yes
CONDA_CMAKE=yes
;;
*)
# Catch-all for builds that are not hardcoded.
PROTOBUF=yes
@ -337,7 +359,7 @@ if [[ "$image" == *cuda* && ${OS} == "ubuntu" ]]; then
fi
# Build image
docker build \
DOCKER_BUILDKIT=1 docker build \
--no-cache \
--progress=plain \
--build-arg "BUILD_ENVIRONMENT=${image}" \
@ -374,6 +396,8 @@ docker build \
--build-arg "DOCS=${DOCS}" \
--build-arg "INDUCTOR_BENCHMARKS=${INDUCTOR_BENCHMARKS}" \
--build-arg "EXECUTORCH=${EXECUTORCH}" \
--build-arg "BASEKIT_VERSION=${BASEKIT_VERSION}" \
--build-arg "ACL=${ACL:-}" \
-f $(dirname ${DOCKERFILE})/Dockerfile \
-t "$tmp_tag" \
"$@" \

View File

@ -1 +1 @@
ca6322dcfc51b209a06b76d160bd95d81d58f15c
7f96f5a852ba452670255d28d59f1e6398141fbb

View File

@ -1 +1 @@
6c26faa159b79a42d7fa46cb66e2d21523351987
243e186efbf7fb93328dd6b34927a4e8c8f24395

View File

@ -1 +1 @@
dafe1459823b9549417ed95e9720f1b594fab329
0a22a91d04c2b4a029a69a198eac390089c3e891

View File

@ -1 +1 @@
bcad9dabe15021c53b6a88296e9d7a210044f108
989adb9a29496c22a36ef82ca69cad5dad536b9c

View File

@ -0,0 +1,16 @@
set -euo pipefail
readonly version=v23.08
readonly src_host=https://review.mlplatform.org/ml
readonly src_repo=ComputeLibrary
# Clone ACL
[[ ! -d ${src_repo} ]] && git clone ${src_host}/${src_repo}.git
cd ${src_repo}
git checkout $version
# Build with scons
scons -j8 Werror=0 debug=0 neon=1 opencl=0 embed_kernels=0 \
os=linux arch=armv8a build=native multi_isa=1 \
fixed_format_kernels=1 openmp=1 cppthreads=0

View File

@ -61,6 +61,7 @@ install_ubuntu() {
${maybe_libiomp_dev} \
libyaml-dev \
libz-dev \
libjemalloc2 \
libjpeg-dev \
libasound2-dev \
libsndfile-dev \
@ -74,6 +75,7 @@ install_ubuntu() {
libtool \
vim \
unzip \
gpg-agent \
gdb
# Should resolve issues related to various apt package repository cert issues
@ -151,7 +153,7 @@ wget https://ossci-linux.s3.amazonaws.com/valgrind-${VALGRIND_VERSION}.tar.bz2
tar -xjf valgrind-${VALGRIND_VERSION}.tar.bz2
cd valgrind-${VALGRIND_VERSION}
./configure --prefix=/usr/local
make -j6
make -j$[$(nproc) - 2]
sudo make install
cd ../../
rm -rf valgrind_build

View File

@ -9,10 +9,19 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
MAJOR_PYTHON_VERSION=$(echo "$ANACONDA_PYTHON_VERSION" | cut -d . -f 1)
MINOR_PYTHON_VERSION=$(echo "$ANACONDA_PYTHON_VERSION" | cut -d . -f 2)
if [[ $(uname -m) == "aarch64" ]]; then
BASE_URL="https://github.com/conda-forge/miniforge/releases/latest/download"
case "$MAJOR_PYTHON_VERSION" in
2)
CONDA_FILE="Miniconda2-latest-Linux-x86_64.sh"
3)
CONDA_FILE="Miniforge3-Linux-aarch64.sh"
;;
*)
echo "Unsupported ANACONDA_PYTHON_VERSION: $ANACONDA_PYTHON_VERSION"
exit 1
;;
esac
else
case "$MAJOR_PYTHON_VERSION" in
3)
CONDA_FILE="Miniconda3-latest-Linux-x86_64.sh"
;;
@ -21,6 +30,7 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
exit 1
;;
esac
fi
mkdir -p /opt/conda
chown jenkins:jenkins /opt/conda
@ -47,15 +57,39 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
# Uncomment the below when resolved to track the latest conda update
# as_jenkins conda update -y -n base conda
if [[ $(uname -m) == "aarch64" ]]; then
export SYSROOT_DEP="sysroot_linux-aarch64=2.17"
else
export SYSROOT_DEP="sysroot_linux-64=2.17"
fi
# Install correct Python version
as_jenkins conda create -n py_$ANACONDA_PYTHON_VERSION -y python="$ANACONDA_PYTHON_VERSION"
# Also ensure sysroot is using a modern GLIBC to match system compilers
as_jenkins conda create -n py_$ANACONDA_PYTHON_VERSION -y\
python="$ANACONDA_PYTHON_VERSION" \
${SYSROOT_DEP}
# libstdcxx from conda default channels are too old, we need GLIBCXX_3.4.30
# which is provided in libstdcxx 12 and up.
conda_install libstdcxx-ng=12.3.0 -c conda-forge
# Install PyTorch conda deps, as per https://github.com/pytorch/pytorch README
CONDA_COMMON_DEPS="astunparse pyyaml mkl=2021.4.0 mkl-include=2021.4.0 setuptools"
if [ "$ANACONDA_PYTHON_VERSION" = "3.11" ]; then
conda_install numpy=1.23.5 ${CONDA_COMMON_DEPS}
if [[ $(uname -m) == "aarch64" ]]; then
CONDA_COMMON_DEPS="astunparse pyyaml setuptools openblas==0.3.25=*openmp* ninja==1.11.1 scons==4.5.2"
if [ "$ANACONDA_PYTHON_VERSION" = "3.8" ]; then
conda_install numpy=1.24.4 ${CONDA_COMMON_DEPS}
else
conda_install numpy=1.26.2 ${CONDA_COMMON_DEPS}
fi
else
conda_install numpy=1.21.2 ${CONDA_COMMON_DEPS}
CONDA_COMMON_DEPS="astunparse pyyaml mkl=2021.4.0 mkl-include=2021.4.0 setuptools"
if [ "$ANACONDA_PYTHON_VERSION" = "3.11" ] || [ "$ANACONDA_PYTHON_VERSION" = "3.12" ]; then
conda_install numpy=1.26.0 ${CONDA_COMMON_DEPS}
else
conda_install numpy=1.21.2 ${CONDA_COMMON_DEPS}
fi
fi
# Install llvm-8 as it is required to compile llvmlite-0.30.0 from source
@ -89,14 +123,5 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
pip_install -r /opt/conda/requirements-docs.txt
fi
# HACK HACK HACK
# gcc-9 for ubuntu-18.04 from http://ppa.launchpad.net/ubuntu-toolchain-r/test/ubuntu
# Pulls llibstdc++6 13.1.0-8ubuntu1~18.04 which is too new for conda
# So remove libstdc++6.so.3.29 installed by https://anaconda.org/anaconda/libstdcxx-ng/files?version=11.2.0
# Same is true for gcc-12 from Ubuntu-22.04
if grep -e [12][82].04.[623] /etc/issue >/dev/null; then
rm /opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/lib/libstdc++.so.6
fi
popd
fi

View File

@ -2,8 +2,8 @@
if [[ ${CUDNN_VERSION} == 8 ]]; then
# cuDNN license: https://developer.nvidia.com/cudnn/license_agreement
mkdir tmp_cudnn && cd tmp_cudnn
CUDNN_NAME="cudnn-linux-x86_64-8.3.2.44_cuda11.5-archive"
mkdir tmp_cudnn
pushd tmp_cudnn
if [[ ${CUDA_VERSION:0:4} == "12.1" ]]; then
CUDNN_NAME="cudnn-linux-x86_64-8.9.2.26_cuda12-archive"
curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/${CUDNN_NAME}.tar.xz
@ -11,17 +11,14 @@ if [[ ${CUDNN_VERSION} == 8 ]]; then
CUDNN_NAME="cudnn-linux-x86_64-8.7.0.84_cuda11-archive"
curl --retry 3 -OLs https://developer.download.nvidia.com/compute/redist/cudnn/v8.7.0/local_installers/11.8/${CUDNN_NAME}.tar.xz
else
curl --retry 3 -OLs https://developer.download.nvidia.com/compute/redist/cudnn/v8.3.2/local_installers/11.5/${CUDNN_NAME}.tar.xz
print "Unsupported CUDA version ${CUDA_VERSION}"
exit 1
fi
tar xf ${CUDNN_NAME}.tar.xz
cp -a ${CUDNN_NAME}/include/* /usr/include/
cp -a ${CUDNN_NAME}/include/* /usr/local/cuda/include/
cp -a ${CUDNN_NAME}/include/* /usr/include/x86_64-linux-gnu/
cp -a ${CUDNN_NAME}/lib/* /usr/local/cuda/lib64/
cp -a ${CUDNN_NAME}/lib/* /usr/lib/x86_64-linux-gnu/
cd ..
popd
rm -rf tmp_cudnn
ldconfig
fi

View File

@ -0,0 +1,21 @@
#!/bin/bash
set -ex
# cuSPARSELt license: https://docs.nvidia.com/cuda/cusparselt/license.html
mkdir tmp_cusparselt && cd tmp_cusparselt
if [[ ${CUDA_VERSION:0:4} == "12.1" ]]; then
CUSPARSELT_NAME="libcusparse_lt-linux-x86_64-0.5.2.1-archive"
curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-x86_64/${CUSPARSELT_NAME}.tar.xz
elif [[ ${CUDA_VERSION:0:4} == "11.8" ]]; then
CUSPARSELT_NAME="libcusparse_lt-linux-x86_64-0.4.0.7-archive"
curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-x86_64/${CUSPARSELT_NAME}.tar.xz
fi
tar xf ${CUSPARSELT_NAME}.tar.xz
cp -a ${CUSPARSELT_NAME}/include/* /usr/local/cuda/include/
cp -a ${CUSPARSELT_NAME}/lib/* /usr/local/cuda/lib64/
cd ..
rm -rf tmp_cusparselt
ldconfig

View File

@ -48,7 +48,6 @@ setup_executorch() {
install_flatc_from_source
pip_install .
build_executorch_runner "cmake"
# Make sure that all the newly generate files are owned by Jenkins
chown -R jenkins .

View File

@ -26,18 +26,19 @@ pip_install \
pytest-cov==4.0.0 \
pytest-subtests==0.10.0 \
tabulate==0.9.0 \
transformers==4.32.1
transformers==4.36.2
pip_install coloredlogs packaging
retry pip_install -i https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple/ --no-cache-dir --no-input ort-nightly==1.17.0.dev20231005006
pip_install -i https://test.pypi.org/simple/ onnx==1.15.0rc2
pip_install onnxscript==0.1.0.dev20231128 --no-deps
pip_install onnxruntime==1.17.0
pip_install onnx==1.15.0
# pip_install "onnxscript@git+https://github.com/microsoft/onnxscript@3e869ef8ccf19b5ebd21c10d3e9c267c9a9fa729" --no-deps
pip_install onnxscript==0.1.0.dev20240315 --no-deps
# Cache the transformers model to be used later by ONNX tests. We need to run the transformers
# package to download the model. By default, the model is cached at ~/.cache/huggingface/hub/
IMPORT_SCRIPT_FILENAME="/tmp/onnx_import_script.py"
as_jenkins echo 'import transformers; transformers.AutoModel.from_pretrained("sshleifer/tiny-gpt2"); transformers.AutoTokenizer.from_pretrained("sshleifer/tiny-gpt2");' > "${IMPORT_SCRIPT_FILENAME}"
as_jenkins echo 'import transformers; transformers.AutoModel.from_pretrained("sshleifer/tiny-gpt2"); transformers.AutoTokenizer.from_pretrained("sshleifer/tiny-gpt2"); transformers.AutoModelForSpeechSeq2Seq.from_pretrained("openai/whisper-large-v3");' > "${IMPORT_SCRIPT_FILENAME}"
# Need a PyTorch version for transformers to work
pip_install --pre torch --index-url https://download.pytorch.org/whl/nightly/cpu

View File

@ -9,7 +9,8 @@ tar xf "${OPENSSL}.tar.gz"
cd "${OPENSSL}"
./config --prefix=/opt/openssl -d '-Wl,--enable-new-dtags,-rpath,$(LIBRPATH)'
# NOTE: openssl install errors out when built with the -j option
make -j6; make install_sw
NPROC=$[$(nproc) - 2]
make -j${NPROC}; make install_sw
# Link the ssl libraries to the /usr/lib folder.
sudo ln -s /opt/openssl/lib/lib* /usr/lib
cd ..

View File

@ -2,55 +2,18 @@
set -ex
# This function installs protobuf 3.17
install_protobuf_317() {
pb_dir="/usr/temp_pb_install_dir"
mkdir -p $pb_dir
pb_dir="/usr/temp_pb_install_dir"
mkdir -p $pb_dir
# On the nvidia/cuda:9-cudnn7-devel-centos7 image we need this symlink or
# else it will fail with
# g++: error: ./../lib64/crti.o: No such file or directory
ln -s /usr/lib64 "$pb_dir/lib64"
# On the nvidia/cuda:9-cudnn7-devel-centos7 image we need this symlink or
# else it will fail with
# g++: error: ./../lib64/crti.o: No such file or directory
ln -s /usr/lib64 "$pb_dir/lib64"
curl -LO "https://github.com/protocolbuffers/protobuf/releases/download/v3.17.3/protobuf-all-3.17.3.tar.gz" --retry 3
tar -xvz -C "$pb_dir" --strip-components 1 -f protobuf-all-3.17.3.tar.gz
# -j6 to balance memory usage and speed.
# naked `-j` seems to use too much memory.
pushd "$pb_dir" && ./configure && make -j6 && make -j6 check && sudo make -j6 install && sudo ldconfig
popd
rm -rf $pb_dir
}
curl -LO "https://github.com/protocolbuffers/protobuf/releases/download/v3.17.3/protobuf-all-3.17.3.tar.gz" --retry 3
install_ubuntu() {
# Ubuntu 14.04 has cmake 2.8.12 as the default option, so we will
# install cmake3 here and use cmake3.
apt-get update
if [[ "$UBUNTU_VERSION" == 14.04 ]]; then
apt-get install -y --no-install-recommends cmake3
fi
# Cleanup
apt-get autoclean && apt-get clean
rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
install_protobuf_317
}
install_centos() {
install_protobuf_317
}
# Install base packages depending on the base OS
ID=$(grep -oP '(?<=^ID=).+' /etc/os-release | tr -d '"')
case "$ID" in
ubuntu)
install_ubuntu
;;
centos)
install_centos
;;
*)
echo "Unable to determine OS..."
exit 1
;;
esac
tar -xvz --no-same-owner -C "$pb_dir" --strip-components 1 -f protobuf-all-3.17.3.tar.gz
NPROC=$[$(nproc) - 2]
pushd "$pb_dir" && ./configure && make -j${NPROC} && make -j${NPROC} check && sudo make -j${NRPOC} install && sudo ldconfig
popd
rm -rf $pb_dir

View File

@ -80,6 +80,14 @@ install_ubuntu() {
fi
fi
# ROCm 6.0 had a regression where journal_mode was enabled on the kdb files resulting in permission errors at runtime
if [[ $(ver $ROCM_VERSION) -ge $(ver 6.0) ]]; then
for kdb in /opt/rocm/share/miopen/db/*.kdb
do
sqlite3 $kdb "PRAGMA journal_mode=off; PRAGMA VACUUM;"
done
fi
# Cleanup
apt-get autoclean && apt-get clean
rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
@ -151,6 +159,14 @@ install_centos() {
fi
fi
# ROCm 6.0 had a regression where journal_mode was enabled on the kdb files resulting in permission errors at runtime
if [[ $(ver $ROCM_VERSION) -ge $(ver 6.0) ]]; then
for kdb in /opt/rocm/share/miopen/db/*.kdb
do
sqlite3 $kdb "PRAGMA journal_mode=off; PRAGMA VACUUM;"
done
fi
# Cleanup
yum clean all
rm -rf /var/cache/yum

View File

@ -7,7 +7,7 @@ git clone https://bitbucket.org/icl/magma.git
pushd magma
# Version 2.7.2 + ROCm related updates
git checkout 823531632140d0edcb7e77c3edc0e837421471c5
git checkout a1625ff4d9bc362906bd01f805dbbe12612953f6
cp make.inc-examples/make.inc.hip-gcc-mkl make.inc
echo 'LIBDIR += -L$(MKLROOT)/lib' >> make.inc

View File

@ -64,5 +64,6 @@ if [ -n "${CONDA_CMAKE}" ]; then
# latest numpy version, which fails ASAN tests with the following import error: Numba
# needs NumPy 1.20 or less.
conda_reinstall cmake="${CMAKE_VERSION}"
conda_reinstall numpy="${NUMPY_VERSION}"
# Note that we install numpy with pip as conda might not have the version we want
pip_install --force-reinstall numpy=="${NUMPY_VERSION}"
fi

View File

@ -36,7 +36,12 @@ function install_ucc() {
git submodule update --init --recursive
./autogen.sh
./configure --prefix=$UCC_HOME --with-ucx=$UCX_HOME --with-cuda=$with_cuda
# We only run distributed tests on Tesla M60 and A10G
NVCC_GENCODE="-gencode=arch=compute_52,code=sm_52 -gencode=arch=compute_86,code=compute_86"
./configure --prefix=$UCC_HOME \
--with-ucx=$UCX_HOME \
--with-cuda=$with_cuda \
--with-nvcc-gencode="${NVCC_GENCODE}"
time make -j
sudo make install

View File

@ -0,0 +1,115 @@
#!/bin/bash
set -xe
# Intel® software for general purpose GPU capabilities.
# Refer to https://dgpu-docs.intel.com/releases/stable_647_21_20230714.html
# Intel® oneAPI Base Toolkit (version 2024.0.0) has been updated to include functional and security updates.
# Refer to https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html
# Users should update to the latest version as it becomes available
function install_ubuntu() {
apt-get update -y
apt-get install -y gpg-agent wget
# Set up the repository. To do this, download the key to the system keyring
wget -qO - https://repositories.intel.com/gpu/intel-graphics.key \
| gpg --dearmor --output /usr/share/keyrings/intel-graphics.gpg
wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB \
| gpg --dearmor | tee /usr/share/keyrings/oneapi-archive-keyring.gpg > /dev/null
# Add the signed entry to APT sources and configure the APT client to use the Intel repository
echo "deb [arch=amd64 signed-by=/usr/share/keyrings/intel-graphics.gpg] https://repositories.intel.com/gpu/ubuntu jammy/production/2328 unified" \
| tee /etc/apt/sources.list.d/intel-gpu-jammy.list
echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main" \
| tee /etc/apt/sources.list.d/oneAPI.list
# Update the packages list and repository index
apt-get update
# The xpu-smi packages
apt-get install -y flex bison xpu-smi
# Compute and Media Runtimes
apt-get install -y \
intel-opencl-icd intel-level-zero-gpu level-zero \
intel-media-va-driver-non-free libmfx1 libmfxgen1 libvpl2 \
libegl-mesa0 libegl1-mesa libegl1-mesa-dev libgbm1 libgl1-mesa-dev libgl1-mesa-dri \
libglapi-mesa libgles2-mesa-dev libglx-mesa0 libigdgmm12 libxatracker2 mesa-va-drivers \
mesa-vdpau-drivers mesa-vulkan-drivers va-driver-all vainfo hwinfo clinfo
# Development Packages
apt-get install -y libigc-dev intel-igc-cm libigdfcl-dev libigfxcmrt-dev level-zero-dev
# Install Intel® oneAPI Base Toolkit
if [ -n "$BASEKIT_VERSION" ]; then
apt-get install intel-basekit=$BASEKIT_VERSION -y
else
apt-get install intel-basekit -y
fi
# Cleanup
apt-get autoclean && apt-get clean
rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
}
function install_centos() {
dnf install -y 'dnf-command(config-manager)'
dnf config-manager --add-repo \
https://repositories.intel.com/gpu/rhel/8.6/production/2328/unified/intel-gpu-8.6.repo
# To add the EPEL repository needed for DKMS
dnf -y install https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm
# https://dl.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm
# Create the YUM repository file in the /temp directory as a normal user
tee > /tmp/oneAPI.repo << EOF
[oneAPI]
name=Intel® oneAPI repository
baseurl=https://yum.repos.intel.com/oneapi
enabled=1
gpgcheck=1
repo_gpgcheck=1
gpgkey=https://yum.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB
EOF
# Move the newly created oneAPI.repo file to the YUM configuration directory /etc/yum.repos.d
mv /tmp/oneAPI.repo /etc/yum.repos.d
# The xpu-smi packages
dnf install -y flex bison xpu-smi
# Compute and Media Runtimes
dnf install -y \
intel-opencl intel-media intel-mediasdk libmfxgen1 libvpl2\
level-zero intel-level-zero-gpu mesa-dri-drivers mesa-vulkan-drivers \
mesa-vdpau-drivers libdrm mesa-libEGL mesa-libgbm mesa-libGL \
mesa-libxatracker libvpl-tools intel-metrics-discovery \
intel-metrics-library intel-igc-core intel-igc-cm \
libva libva-utils intel-gmmlib libmetee intel-gsc intel-ocloc hwinfo clinfo
# Development packages
dnf install -y --refresh \
intel-igc-opencl-devel level-zero-devel intel-gsc-devel libmetee-devel \
level-zero-devel
# Install Intel® oneAPI Base Toolkit
dnf install intel-basekit -y
# Cleanup
dnf clean all
rm -rf /var/cache/yum
rm -rf /var/lib/yum/yumdb
rm -rf /var/lib/yum/history
}
# The installation depends on the base OS
ID=$(grep -oP '(?<=^ID=).+' /etc/os-release | tr -d '"')
case "$ID" in
ubuntu)
install_ubuntu
;;
centos)
install_centos
;;
*)
echo "Unable to determine OS..."
exit 1
;;
esac

View File

@ -15,7 +15,7 @@ click
#Pinned versions:
#test that import:
coremltools==5.0b5
coremltools==5.0b5 ; python_version < "3.12"
#Description: Apple framework for ML integration
#Pinned versions: 5.0b5
#test that import:
@ -25,6 +25,11 @@ coremltools==5.0b5
#Pinned versions:
#test that import:
dill==0.3.7
#Description: dill extends pickle with serializing and de-serializing for most built-ins
#Pinned versions: 0.3.7
#test that import: dynamo/test_replay_record.py test_dataloader.py test_datapipe.py test_serialization.py
expecttest==0.1.6
#Description: method for writing tests where test framework auto populates
# the expected output based on previous runs
@ -47,6 +52,11 @@ junitparser==2.1.1
#Pinned versions: 2.1.1
#test that import:
lark==0.12.0
#Description: parser
#Pinned versions: 0.12.0
#test that import:
librosa>=0.6.2 ; python_version < "3.11"
#Description: A python package for music and audio analysis
#Pinned versions: >=0.6.2
@ -66,7 +76,7 @@ librosa>=0.6.2 ; python_version < "3.11"
#Description: A testing library that allows you to replace parts of your
#system under test with mock objects
#Pinned versions:
#test that import: test_module_init.py, test_modules.py, test_nn.py,
#test that import: test_modules.py, test_nn.py,
#test_testing.py
#MonkeyType # breaks pytorch-xla-linux-bionic-py3.7-clang8
@ -75,10 +85,10 @@ librosa>=0.6.2 ; python_version < "3.11"
#Pinned versions:
#test that import:
mypy==1.7.0
mypy==1.8.0
# Pin MyPy version because new errors are likely to appear with each release
#Description: linter
#Pinned versions: 1.7.0
#Pinned versions: 1.8.0
#test that import: test_typing.py, test_type_hints.py
networkx==2.8.8
@ -137,9 +147,9 @@ optree==0.9.1
#test_pointwise_ops.py, test_dtensor_ops.py, test_torchinductor.py, test_fx.py,
#test_fake_tensor.py, test_mps.py
pillow==10.0.1
pillow==10.2.0
#Description: Python Imaging Library fork
#Pinned versions: 10.0.1
#Pinned versions: 10.2.0
#test that import:
protobuf==3.20.2
@ -162,11 +172,6 @@ pytest-xdist==3.3.1
#Pinned versions:
#test that import:
pytest-shard==0.1.2
#Description: plugin spliting up tests in pytest
#Pinned versions:
#test that import:
pytest-flakefinder==1.1.0
#Description: plugin for rerunning tests a fixed number of times in pytest
#Pinned versions: 1.1.0
@ -243,7 +248,8 @@ tb-nightly==2.13.0a20230426
#Pinned versions:
#test that import:
#typing-extensions
# needed by torchgen utils
typing-extensions
#Description: type hints for python
#Pinned versions:
#test that import:
@ -258,7 +264,8 @@ unittest-xml-reporting<=3.2.0,>=2.0.0
#Pinned versions:
#test that import:
lintrunner==0.10.7
#wheel not found on aarch64, and source build requires rust
lintrunner==0.10.7 ; platform_machine == "x86_64"
#Description: all about linters!
#Pinned versions: 0.10.7
#test that import:
@ -268,14 +275,14 @@ rockset==1.0.3
#Pinned versions: 1.0.3
#test that import:
ghstack==0.7.1
ghstack==0.8.0
#Description: ghstack tool
#Pinned versions: 0.7.1
#Pinned versions: 0.8.0
#test that import:
jinja2==3.1.2
jinja2==3.1.3
#Description: jinja2 template engine
#Pinned versions: 3.1.2
#Pinned versions: 3.1.3
#test that import:
pytest-cpp==2.3.0
@ -293,8 +300,14 @@ tensorboard==2.13.0
#Pinned versions:
#test that import: test_tensorboard
pywavelets==1.4.1
pywavelets==1.4.1 ; python_version < "3.12"
pywavelets==1.5.0 ; python_version >= "3.12"
#Description: This is a requirement of scikit-image, we need to pin
# it here because 1.5.0 conflicts with numpy 1.21.2 used in CI
#Pinned versions: 1.4.1
#test that import:
lxml==5.0.0.
#Description: This is a requirement of unittest-xml-reporting
# Python-3.9 binaries

View File

@ -1 +1 @@
2.1.0
3.0.0

View File

@ -142,6 +142,12 @@ COPY ./common/install_cudnn.sh install_cudnn.sh
RUN if [ "${CUDNN_VERSION}" -eq 8 ]; then bash install_cudnn.sh; fi
RUN rm install_cudnn.sh
# Install CUSPARSELT
ARG CUDA_VERSION
COPY ./common/install_cusparselt.sh install_cusparselt.sh
RUN bash install_cusparselt.sh
RUN rm install_cusparselt.sh
# Delete /usr/local/cuda-11.X/cuda-11.X symlinks
RUN if [ -h /usr/local/cuda-11.6/cuda-11.6 ]; then rm /usr/local/cuda-11.6/cuda-11.6; fi
RUN if [ -h /usr/local/cuda-11.7/cuda-11.7 ]; then rm /usr/local/cuda-11.7/cuda-11.7; fi

View File

@ -0,0 +1,118 @@
ARG UBUNTU_VERSION
FROM ubuntu:${UBUNTU_VERSION}
ARG UBUNTU_VERSION
ENV DEBIAN_FRONTEND noninteractive
ARG CLANG_VERSION
# Install common dependencies (so that this step can be cached separately)
COPY ./common/install_base.sh install_base.sh
RUN bash ./install_base.sh && rm install_base.sh
# Install clang
ARG LLVMDEV
COPY ./common/install_clang.sh install_clang.sh
RUN bash ./install_clang.sh && rm install_clang.sh
# Install user
COPY ./common/install_user.sh install_user.sh
RUN bash ./install_user.sh && rm install_user.sh
# Install katex
ARG KATEX
COPY ./common/install_docs_reqs.sh install_docs_reqs.sh
RUN bash ./install_docs_reqs.sh && rm install_docs_reqs.sh
# Install conda and other packages (e.g., numpy, pytest)
ARG ANACONDA_PYTHON_VERSION
ARG CONDA_CMAKE
ARG DOCS
ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION
ENV PATH /opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/bin:/opt/conda/bin:$PATH
ENV DOCS=$DOCS
COPY requirements-ci.txt requirements-docs.txt /opt/conda/
COPY ./common/install_conda.sh install_conda.sh
COPY ./common/common_utils.sh common_utils.sh
RUN bash ./install_conda.sh && rm install_conda.sh common_utils.sh /opt/conda/requirements-ci.txt /opt/conda/requirements-docs.txt
# Install gcc
ARG GCC_VERSION
COPY ./common/install_gcc.sh install_gcc.sh
RUN bash ./install_gcc.sh && rm install_gcc.sh
# Install lcov for C++ code coverage
COPY ./common/install_lcov.sh install_lcov.sh
RUN bash ./install_lcov.sh && rm install_lcov.sh
COPY ./common/install_openssl.sh install_openssl.sh
RUN bash ./install_openssl.sh
ENV OPENSSL_ROOT_DIR /opt/openssl
ENV OPENSSL_DIR /opt/openssl
RUN rm install_openssl.sh
ARG INDUCTOR_BENCHMARKS
COPY ./common/install_inductor_benchmark_deps.sh install_inductor_benchmark_deps.sh
COPY ./common/common_utils.sh common_utils.sh
COPY ci_commit_pins/huggingface.txt huggingface.txt
COPY ci_commit_pins/timm.txt timm.txt
RUN if [ -n "${INDUCTOR_BENCHMARKS}" ]; then bash ./install_inductor_benchmark_deps.sh; fi
RUN rm install_inductor_benchmark_deps.sh common_utils.sh timm.txt huggingface.txt
ARG TRITON
# Install triton, this needs to be done before sccache because the latter will
# try to reach out to S3, which docker build runners don't have access
COPY ./common/install_triton.sh install_triton.sh
COPY ./common/common_utils.sh common_utils.sh
# TODO: will add triton xpu commit
COPY ci_commit_pins/triton.txt triton.txt
RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi
RUN rm install_triton.sh common_utils.sh triton.txt
# (optional) Install database packages like LMDB and LevelDB
ARG DB
COPY ./common/install_db.sh install_db.sh
RUN if [ -n "${DB}" ]; then bash ./install_db.sh; fi
RUN rm install_db.sh
ENV INSTALLED_DB ${DB}
# (optional) Install vision packages like OpenCV and ffmpeg
ARG VISION
COPY ./common/install_vision.sh ./common/cache_vision_models.sh ./common/common_utils.sh ./
RUN if [ -n "${VISION}" ]; then bash ./install_vision.sh; fi
RUN rm install_vision.sh cache_vision_models.sh common_utils.sh
ENV INSTALLED_VISION ${VISION}
# Install XPU Dependencies
ARG BASEKIT_VERSION
COPY ./common/install_xpu.sh install_xpu.sh
RUN bash ./install_xpu.sh && rm install_xpu.sh
# (optional) Install non-default CMake version
ARG CMAKE_VERSION
COPY ./common/install_cmake.sh install_cmake.sh
RUN if [ -n "${CMAKE_VERSION}" ]; then bash ./install_cmake.sh; fi
RUN rm install_cmake.sh
# (optional) Install non-default Ninja version
ARG NINJA_VERSION
COPY ./common/install_ninja.sh install_ninja.sh
RUN if [ -n "${NINJA_VERSION}" ]; then bash ./install_ninja.sh; fi
RUN rm install_ninja.sh
# Install ccache/sccache (do this last, so we get priority in PATH)
COPY ./common/install_cache.sh install_cache.sh
ENV PATH /opt/cache/bin:$PATH
RUN bash ./install_cache.sh && rm install_cache.sh
# Include BUILD_ENVIRONMENT environment variable in image
ARG BUILD_ENVIRONMENT
ENV BUILD_ENVIRONMENT ${BUILD_ENVIRONMENT}
# Install LLVM dev version (Defined in the pytorch/builder github repository)
COPY --from=pytorch/llvm:9.0.1 /opt/llvm /opt/llvm
USER jenkins
CMD ["bash"]

View File

@ -37,6 +37,7 @@ COPY requirements-ci.txt requirements-docs.txt /opt/conda/
COPY ./common/install_conda.sh install_conda.sh
COPY ./common/common_utils.sh common_utils.sh
RUN bash ./install_conda.sh && rm install_conda.sh common_utils.sh /opt/conda/requirements-ci.txt /opt/conda/requirements-docs.txt
RUN if [ -n "${UNINSTALL_DILL}" ]; then pip uninstall -y dill; fi
# Install gcc
ARG GCC_VERSION
@ -160,6 +161,13 @@ COPY ./common/install_onnx.sh ./common/common_utils.sh ./
RUN if [ -n "${ONNX}" ]; then bash ./install_onnx.sh; fi
RUN rm install_onnx.sh common_utils.sh
# (optional) Build ACL
ARG ACL
COPY ./common/install_acl.sh install_acl.sh
RUN if [ -n "${ACL}" ]; then bash ./install_acl.sh; fi
RUN rm install_acl.sh
ENV INSTALLED_ACL ${ACL}
# Install ccache/sccache (do this last, so we get priority in PATH)
COPY ./common/install_cache.sh install_cache.sh
ENV PATH /opt/cache/bin:$PATH

View File

@ -28,6 +28,8 @@ echo "Environment variables:"
env
if [[ "$BUILD_ENVIRONMENT" == *cuda* ]]; then
# Use jemalloc during compilation to mitigate https://github.com/pytorch/pytorch/issues/116289
export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2
echo "NVCC version:"
nvcc --version
fi
@ -80,6 +82,19 @@ if ! which conda; then
fi
else
export CMAKE_PREFIX_PATH=/opt/conda
# Workaround required for MKL library linkage
# https://github.com/pytorch/pytorch/issues/119557
if [ "$ANACONDA_PYTHON_VERSION" = "3.12" ]; then
export CMAKE_LIBRARY_PATH="/opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/lib/"
export CMAKE_INCLUDE_PATH="/opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/include/"
fi
fi
if [[ "$BUILD_ENVIRONMENT" == *aarch64* ]]; then
export USE_MKLDNN=1
export USE_MKLDNN_ACL=1
export ACL_ROOT_DIR=/ComputeLibrary
fi
if [[ "$BUILD_ENVIRONMENT" == *libtorch* ]]; then
@ -151,6 +166,12 @@ if [[ "$BUILD_ENVIRONMENT" == *rocm* ]]; then
python tools/amd_build/build_amd.py
fi
if [[ "$BUILD_ENVIRONMENT" == *xpu* ]]; then
# shellcheck disable=SC1091
source /opt/intel/oneapi/compiler/latest/env/vars.sh
export USE_XPU=1
fi
# sccache will fail for CUDA builds if all cores are used for compiling
# gcc 7 with sccache seems to have intermittent OOM issue if all cores are used
if [ -z "$MAX_JOBS" ]; then
@ -202,6 +223,10 @@ if [[ "${BUILD_ENVIRONMENT}" != *android* && "${BUILD_ENVIRONMENT}" != *cuda* ]]
export BUILD_STATIC_RUNTIME_BENCHMARK=ON
fi
WORKSPACE_ORIGINAL_OWNER_ID=$(stat -c '%u' "/var/lib/jenkins/workspace")
sudo chown -R jenkins /var/lib/jenkins/workspace
git config --global --add safe.directory /var/lib/jenkins/workspace
if [[ "$BUILD_ENVIRONMENT" == *-bazel-* ]]; then
set -e
@ -227,13 +252,17 @@ else
( ! get_exit_code python setup.py clean bad_argument )
if [[ "$BUILD_ENVIRONMENT" != *libtorch* ]]; then
# rocm builds fail when WERROR=1
# XLA test build fails when WERROR=1
# set only when building other architectures
# or building non-XLA tests.
if [[ "$BUILD_ENVIRONMENT" != *rocm* &&
"$BUILD_ENVIRONMENT" != *xla* ]]; then
if [[ "$BUILD_ENVIRONMENT" != *py3.8* ]]; then
# Install numpy-2.0 release candidate for builds
# Which should be backward compatible with Numpy-1.X
python -mpip install --pre numpy==2.0.0b1
fi
WERROR=1 python setup.py bdist_wheel
else
python setup.py bdist_wheel
@ -334,3 +363,5 @@ if [[ "$BUILD_ENVIRONMENT" != *libtorch* && "$BUILD_ENVIRONMENT" != *bazel* ]];
fi
print_sccache_stats
sudo chown -R "$WORKSPACE_ORIGINAL_OWNER_ID" /var/lib/jenkins/workspace

View File

@ -158,6 +158,11 @@ function install_torchvision() {
fi
}
function install_tlparse() {
pip_install --user "tlparse==0.3.7"
PATH="$(python -m site --user-base)/bin:$PATH"
}
function install_torchrec_and_fbgemm() {
local torchrec_commit
torchrec_commit=$(get_pinned_commit torchrec)

View File

@ -9,7 +9,7 @@ sysctl -a | grep machdep.cpu
# These are required for both the build job and the test job.
# In the latter to test cpp extensions.
export MACOSX_DEPLOYMENT_TARGET=11.0
export MACOSX_DEPLOYMENT_TARGET=11.1
export CXX=clang++
export CC=clang

View File

@ -149,6 +149,8 @@ test_jit_hooks() {
assert_git_not_dirty
}
install_tlparse
if [[ $NUM_TEST_SHARDS -gt 1 ]]; then
test_python_shard "${SHARD_NUMBER}"
if [[ "${SHARD_NUMBER}" == 1 ]]; then

View File

@ -34,7 +34,6 @@ time python test/run_test.py --verbose -i distributed/_shard/sharded_tensor/test
# functional collective tests
time python test/run_test.py --verbose -i distributed/test_functional_api
# DTensor tests
time python test/run_test.py --verbose -i distributed/_tensor/test_random_ops
time python test/run_test.py --verbose -i distributed/_tensor/test_dtensor_compile
@ -46,9 +45,11 @@ time python test/run_test.py --verbose -i distributed/test_device_mesh
time python test/run_test.py --verbose -i distributed/tensor/parallel/test_ddp_2d_parallel
time python test/run_test.py --verbose -i distributed/tensor/parallel/test_fsdp_2d_parallel
time python test/run_test.py --verbose -i distributed/tensor/parallel/test_tp_examples
time python test/run_test.py --verbose -i distributed/tensor/parallel/test_tp_random_state.py
# Other tests
time python test/run_test.py --verbose -i test_cuda_primary_ctx
time python test/run_test.py --verbose -i test_optim -- -k optimizers_with_varying_tensors
time python test/run_test.py --verbose -i test_optim -- -k test_forloop_goes_right_direction_multigpu
time python test/run_test.py --verbose -i test_optim -- -k test_mixed_device_dtype
time python test/run_test.py --verbose -i test_foreach -- -k test_tensors_grouping
assert_git_not_dirty

View File

@ -18,6 +18,10 @@ BUILD_DIR="build"
BUILD_RENAMED_DIR="build_renamed"
BUILD_BIN_DIR="$BUILD_DIR"/bin
#Set Default values for these variables in case they are not set
SHARD_NUMBER="${SHARD_NUMBER:=1}"
NUM_TEST_SHARDS="${NUM_TEST_SHARDS:=1}"
export VALGRIND=ON
# export TORCH_INDUCTOR_INSTALL_GXX=ON
if [[ "$BUILD_ENVIRONMENT" == *clang9* ]]; then
@ -124,6 +128,10 @@ if [[ "$BUILD_ENVIRONMENT" == *cuda* || "$BUILD_ENVIRONMENT" == *rocm* ]]; then
# mainly used so that we're not spending extra cycles testing cpu
# devices on expensive gpu machines
export PYTORCH_TESTING_DEVICE_ONLY_FOR="cuda"
elif [[ "$BUILD_ENVIRONMENT" == *xpu* ]]; then
export PYTORCH_TESTING_DEVICE_ONLY_FOR="xpu"
# setting PYTHON_TEST_EXTRA_OPTION
export PYTHON_TEST_EXTRA_OPTION="--xpu"
fi
if [[ "$TEST_CONFIG" == *crossref* ]]; then
@ -131,11 +139,22 @@ if [[ "$TEST_CONFIG" == *crossref* ]]; then
fi
if [[ "$BUILD_ENVIRONMENT" == *rocm* ]]; then
# regression in ROCm 6.0 on MI50 CI runners due to hipblaslt; remove in 6.1
export VALGRIND=OFF
# Print GPU info
rocminfo
rocminfo | grep -E 'Name:.*\sgfx|Marketing'
fi
if [[ "$BUILD_ENVIRONMENT" == *xpu* ]]; then
# Source Intel oneAPI envrioment script to enable xpu runtime related libraries
# refer to https://www.intel.com/content/www/us/en/docs/oneapi/programming-guide/2024-0/use-the-setvars-and-oneapi-vars-scripts-with-linux.html
# shellcheck disable=SC1091
source /opt/intel/oneapi/compiler/latest/env/vars.sh
# Check XPU status before testing
xpu-smi discovery
fi
if [[ "$BUILD_ENVIRONMENT" != *-bazel-* ]] ; then
# JIT C++ extensions require ninja.
pip_install --user "ninja==1.10.2"
@ -144,6 +163,8 @@ if [[ "$BUILD_ENVIRONMENT" != *-bazel-* ]] ; then
export PATH="$HOME/.local/bin:$PATH"
fi
install_tlparse
# DANGER WILL ROBINSON. The LD_PRELOAD here could cause you problems
# if you're not careful. Check this if you made some changes and the
# ASAN test is not working
@ -235,14 +256,14 @@ test_python_shard() {
# Bare --include flag is not supported and quoting for lint ends up with flag not being interpreted correctly
# shellcheck disable=SC2086
time python test/run_test.py --exclude-jit-executor --exclude-distributed-tests $INCLUDE_CLAUSE --shard "$1" "$NUM_TEST_SHARDS" --verbose
time python test/run_test.py --exclude-jit-executor --exclude-distributed-tests $INCLUDE_CLAUSE --shard "$1" "$NUM_TEST_SHARDS" --verbose $PYTHON_TEST_EXTRA_OPTION
assert_git_not_dirty
}
test_python() {
# shellcheck disable=SC2086
time python test/run_test.py --exclude-jit-executor --exclude-distributed-tests $INCLUDE_CLAUSE --verbose
time python test/run_test.py --exclude-jit-executor --exclude-distributed-tests $INCLUDE_CLAUSE --verbose $PYTHON_TEST_EXTRA_OPTION
assert_git_not_dirty
}
@ -253,33 +274,13 @@ test_dynamo_shard() {
exit 1
fi
python tools/dynamo/verify_dynamo.py
# Temporarily disable test_fx for dynamo pending the investigation on TTS
# regression in https://github.com/pytorch/torchdynamo/issues/784
# PLEASE DO NOT ADD ADDITIONAL EXCLUDES HERE.
# Instead, use @skipIfTorchDynamo on your tests.
time python test/run_test.py --dynamo \
--exclude-inductor-tests \
--exclude-jit-executor \
--exclude-distributed-tests \
--exclude \
test_autograd \
test_jit \
test_proxy_tensor \
test_quantization \
test_public_bindings \
test_dataloader \
test_reductions \
test_namedtensor \
test_namedtuple_return_api \
profiler/test_profiler \
profiler/test_profiler_tree \
test_overrides \
test_python_dispatch \
test_fx \
test_package \
test_legacy_vmap \
test_custom_ops \
test_content_store \
export/test_db \
functorch/test_dims \
functorch/test_aotdispatch \
--exclude-torch-export-tests \
--shard "$1" "$NUM_TEST_SHARDS" \
--verbose
assert_git_not_dirty
@ -291,8 +292,18 @@ test_inductor_distributed() {
pytest test/inductor/test_torchinductor.py -k test_multi_gpu
pytest test/inductor/test_aot_inductor.py -k test_non_default_cuda_device
pytest test/inductor/test_aot_inductor.py -k test_replicate_on_devices
pytest test/distributed/test_c10d_functional_native.py
pytest test/distributed/_tensor/test_dtensor_compile.py
pytest test/distributed/tensor/parallel/test_fsdp_2d_parallel.py
pytest test/distributed/_composable/fsdp/test_fully_shard_comm.py
pytest test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_multi_group
pytest test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_with_activation_checkpointing
pytest test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_2d_mlp
pytest test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_hsdp
pytest test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_2d_transformer_checkpoint_resume
pytest test/distributed/_composable/fsdp/test_fully_shard_frozen.py
pytest test/distributed/_composable/fsdp/test_fully_shard_mixed_precision.py -k test_compute_dtype
pytest test/distributed/_composable/fsdp/test_fully_shard_mixed_precision.py -k test_reduce_dtype
# this runs on both single-gpu and multi-gpu instance. It should be smart about skipping tests that aren't supported
# with if required # gpus aren't available
@ -308,8 +319,18 @@ test_inductor() {
# docker build uses bdist_wheel which does not work with test_aot_inductor
# TODO: need a faster way to build
BUILD_AOT_INDUCTOR_TEST=1 python setup.py develop
CPP_TESTS_DIR="${BUILD_BIN_DIR}" LD_LIBRARY_PATH="${TORCH_LIB_DIR}" python test/run_test.py --cpp --verbose -i cpp/test_aot_inductor
if [[ "$BUILD_ENVIRONMENT" != *rocm* ]]; then
BUILD_AOT_INDUCTOR_TEST=1 python setup.py develop
CPP_TESTS_DIR="${BUILD_BIN_DIR}" LD_LIBRARY_PATH="${TORCH_LIB_DIR}" python test/run_test.py --cpp --verbose -i cpp/test_aot_inductor
fi
}
test_inductor_cpp_wrapper_abi_compatible() {
export TORCHINDUCTOR_ABI_COMPATIBLE=1
echo "Testing Inductor cpp wrapper mode with TORCHINDUCTOR_ABI_COMPATIBLE=1"
# cpu stack allocation causes segfault and needs more investigation
TORCHINDUCTOR_STACK_ALLOCATION=0 python test/run_test.py --include inductor/test_cpu_cpp_wrapper
python test/run_test.py --include inductor/test_cuda_cpp_wrapper
}
# "Global" flags for inductor benchmarking controlled by TEST_CONFIG
@ -389,8 +410,8 @@ test_perf_for_dashboard() {
--output "$TEST_REPORTS_DIR/${backend}_dynamic_${suite}_${dtype}_${mode}_cuda_${target}.csv"
fi
if [[ "$DASHBOARD_TAG" == *cppwrapper-true* ]] && [[ "$mode" == "inference" ]]; then
python "benchmarks/dynamo/$suite.py" \
"${target_flag[@]}" --"$mode" --"$dtype" --backend "$backend" --disable-cudagraphs --cpp-wrapper "$@" \
TORCHINDUCTOR_CPP_WRAPPER=1 python "benchmarks/dynamo/$suite.py" \
"${target_flag[@]}" --"$mode" --"$dtype" --backend "$backend" --disable-cudagraphs "$@" \
--output "$TEST_REPORTS_DIR/${backend}_cpp_wrapper_${suite}_${dtype}_${mode}_cuda_${target}.csv"
fi
if [[ "$DASHBOARD_TAG" == *freezing_cudagraphs-true* ]] && [[ "$mode" == "inference" ]]; then
@ -404,7 +425,7 @@ test_perf_for_dashboard() {
--output "$TEST_REPORTS_DIR/${backend}_with_cudagraphs_freezing_autotune_${suite}_${dtype}_${mode}_cuda_${target}.csv"
fi
if [[ "$DASHBOARD_TAG" == *aotinductor-true* ]] && [[ "$mode" == "inference" ]]; then
python "benchmarks/dynamo/$suite.py" \
TORCHINDUCTOR_ABI_COMPATIBLE=1 python "benchmarks/dynamo/$suite.py" \
"${target_flag[@]}" --"$mode" --"$dtype" --export-aot-inductor --disable-cudagraphs "$@" \
--output "$TEST_REPORTS_DIR/${backend}_aot_inductor_${suite}_${dtype}_${mode}_cuda_${target}.csv"
fi
@ -448,6 +469,11 @@ test_single_dynamo_benchmark() {
test_perf_for_dashboard "$suite" \
"${DYNAMO_BENCHMARK_FLAGS[@]}" "$@" "${partition_flags[@]}"
else
if [[ "${TEST_CONFIG}" == *aot_inductor* ]]; then
# Test AOTInductor with the ABI-compatible mode on CI
# This can be removed once the ABI-compatible mode becomes default.
export TORCHINDUCTOR_ABI_COMPATIBLE=1
fi
python "benchmarks/dynamo/$suite.py" \
--ci --accuracy --timing --explain \
"${DYNAMO_BENCHMARK_FLAGS[@]}" \
@ -491,13 +517,20 @@ test_inductor_torchbench_smoketest_perf() {
TEST_REPORTS_DIR=$(pwd)/test/test-reports
mkdir -p "$TEST_REPORTS_DIR"
# smoke test the cpp_wrapper mode
TORCHINDUCTOR_CPP_WRAPPER=1 python benchmarks/dynamo/torchbench.py --device cuda --accuracy --bfloat16 \
--inference --inductor --only hf_T5 --output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_smoketest.csv"
python benchmarks/dynamo/check_accuracy.py \
--actual "$TEST_REPORTS_DIR/inductor_cpp_wrapper_smoketest.csv" \
--expected "benchmarks/dynamo/ci_expected_accuracy/inductor_torchbench_inference.csv"
python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --float16 --training \
--batch-size-file "$(realpath benchmarks/dynamo/torchbench_models_list.txt)" --only hf_Bert \
--output "$TEST_REPORTS_DIR/inductor_training_smoketest.csv"
# The threshold value needs to be actively maintained to make this check useful
python benchmarks/dynamo/check_perf_csv.py -f "$TEST_REPORTS_DIR/inductor_training_smoketest.csv" -t 1.4
python benchmarks/dynamo/torchbench.py --device cuda --performance --bfloat16 --inference \
TORCHINDUCTOR_ABI_COMPATIBLE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --bfloat16 --inference \
--export-aot-inductor --only nanogpt --output "$TEST_REPORTS_DIR/inductor_inference_smoketest.csv"
# The threshold value needs to be actively maintained to make this check useful
# The perf number of nanogpt seems not very stable, e.g.
@ -518,6 +551,50 @@ test_inductor_torchbench_smoketest_perf() {
done
}
test_inductor_torchbench_cpu_smoketest_perf(){
TEST_REPORTS_DIR=$(pwd)/test/test-reports
mkdir -p "$TEST_REPORTS_DIR"
#set jemalloc
JEMALLOC_LIB="/usr/lib/x86_64-linux-gnu/libjemalloc.so.2"
IOMP_LIB="$(dirname "$(which python)")/../lib/libiomp5.so"
export LD_PRELOAD="$JEMALLOC_LIB":"$IOMP_LIB":"$LD_PRELOAD"
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1"
export KMP_AFFINITY=granularity=fine,compact,1,0
export KMP_BLOCKTIME=1
CORES=$(lscpu | grep Core | awk '{print $4}')
export OMP_NUM_THREADS=$CORES
end_core=$(( CORES-1 ))
MODELS_SPEEDUP_TARGET=benchmarks/dynamo/expected_ci_speedup_inductor_torchbench_cpu.csv
grep -v '^ *#' < "$MODELS_SPEEDUP_TARGET" | while IFS=',' read -r -a model_cfg
do
local model_name=${model_cfg[0]}
local data_type=${model_cfg[1]}
local speedup_target=${model_cfg[4]}
if [[ ${model_cfg[3]} == "cpp" ]]; then
export TORCHINDUCTOR_CPP_WRAPPER=1
else
unset TORCHINDUCTOR_CPP_WRAPPER
fi
local output_name="$TEST_REPORTS_DIR/inductor_inference_${model_cfg[0]}_${model_cfg[1]}_${model_cfg[2]}_${model_cfg[3]}_cpu_smoketest.csv"
if [[ ${model_cfg[2]} == "dynamic" ]]; then
taskset -c 0-"$end_core" python benchmarks/dynamo/torchbench.py \
--inference --performance --"$data_type" -dcpu -n50 --only "$model_name" --dynamic-shapes \
--dynamic-batch-only --freezing --timeout 9000 --backend=inductor --output "$output_name"
else
taskset -c 0-"$end_core" python benchmarks/dynamo/torchbench.py \
--inference --performance --"$data_type" -dcpu -n50 --only "$model_name" \
--freezing --timeout 9000 --backend=inductor --output "$output_name"
fi
cat "$output_name"
# The threshold value needs to be actively maintained to make this check useful.
python benchmarks/dynamo/check_perf_csv.py -f "$output_name" -t "$speedup_target"
done
}
test_python_gloo_with_tls() {
source "$(dirname "${BASH_SOURCE[0]}")/run_glootls_test.sh"
assert_git_not_dirty
@ -664,6 +741,19 @@ test_libtorch_api() {
fi
}
test_xpu_bin(){
TEST_REPORTS_DIR=$(pwd)/test/test-reports
mkdir -p "$TEST_REPORTS_DIR"
for xpu_case in "${BUILD_BIN_DIR}"/*{xpu,sycl}*; do
if [[ "$xpu_case" != *"*"* && "$xpu_case" != *.so && "$xpu_case" != *.a ]]; then
case_name=$(basename "$xpu_case")
echo "Testing ${case_name} ..."
"$xpu_case" --gtest_output=xml:"$TEST_REPORTS_DIR"/"$case_name".xml
fi
done
}
test_aot_compilation() {
echo "Testing Ahead of Time compilation"
ln -sf "$TORCH_LIB_DIR"/libc10* "$TORCH_BIN_DIR"
@ -904,7 +994,8 @@ test_bazel() {
tools/bazel test --config=cpu-only --test_timeout=480 --test_output=all --test_tag_filters=-gpu-required --test_filter=-*CUDA :all_tests
else
tools/bazel test --test_output=errors \
# Increase the test timeout to 480 like CPU tests because modules_test frequently timeout
tools/bazel test --test_timeout=480 --test_output=errors \
//:any_test \
//:autograd_test \
//:dataloader_test \
@ -999,14 +1090,17 @@ test_docs_test() {
}
test_executorch() {
echo "Install torchvision and torchaudio"
install_torchvision
install_torchaudio
pushd /executorch
echo "Install torchvision and torchaudio"
# TODO(huydhn): Switch this to the pinned commits on ExecuTorch once they are
# there. These libraries need to be built here, and not part of the Docker
# image because they require the target version of torch to be installed first
pip_install --no-use-pep517 --user "git+https://github.com/pytorch/audio.git"
pip_install --no-use-pep517 --user "git+https://github.com/pytorch/vision.git"
# NB: We need to build ExecuTorch runner here and not inside the Docker image
# because it depends on PyTorch
# shellcheck disable=SC1091
source .ci/scripts/utils.sh
build_executorch_runner "cmake"
echo "Run ExecuTorch regression tests for some models"
# NB: This is a sample model, more can be added here
@ -1075,6 +1169,11 @@ elif [[ "${TEST_CONFIG}" == *torchbench* ]]; then
if [[ "${TEST_CONFIG}" == *inductor_torchbench_smoketest_perf* ]]; then
checkout_install_torchbench hf_Bert hf_Albert nanogpt timm_vision_transformer
PYTHONPATH=$(pwd)/torchbench test_inductor_torchbench_smoketest_perf
elif [[ "${TEST_CONFIG}" == *inductor_torchbench_cpu_smoketest_perf* ]]; then
checkout_install_torchbench timm_vision_transformer phlippe_densenet basic_gnn_gcn \
llama_v2_7b_16h resnet50 timm_efficientnet mobilenet_v3_large timm_resnest \
shufflenet_v2_x1_0 hf_GPT2
PYTHONPATH=$(pwd)/torchbench test_inductor_torchbench_cpu_smoketest_perf
else
checkout_install_torchbench
# Do this after checkout_install_torchbench to ensure we clobber any
@ -1084,24 +1183,29 @@ elif [[ "${TEST_CONFIG}" == *torchbench* ]]; then
fi
PYTHONPATH=$(pwd)/torchbench test_dynamo_benchmark torchbench "$id"
fi
elif [[ "${TEST_CONFIG}" == *inductor_cpp_wrapper_abi_compatible* ]]; then
install_torchvision
test_inductor_cpp_wrapper_abi_compatible
elif [[ "${TEST_CONFIG}" == *inductor* && "${SHARD_NUMBER}" == 1 ]]; then
install_torchvision
test_inductor
test_inductor_distributed
elif [[ "${TEST_CONFIG}" == *dynamo* && "${SHARD_NUMBER}" == 1 && $NUM_TEST_SHARDS -gt 1 ]]; then
test_without_numpy
install_torchvision
test_dynamo_shard 1
test_aten
elif [[ "${TEST_CONFIG}" == *dynamo* && "${SHARD_NUMBER}" == 2 && $NUM_TEST_SHARDS -gt 1 ]]; then
elif [[ "${TEST_CONFIG}" == *dynamo* && $SHARD_NUMBER -gt 1 && $NUM_TEST_SHARDS -gt 1 ]]; then
install_torchvision
test_dynamo_shard 2
test_dynamo_shard "${SHARD_NUMBER}"
elif [[ "${SHARD_NUMBER}" == 1 && $NUM_TEST_SHARDS -gt 1 ]]; then
test_without_numpy
install_torchvision
test_python_shard 1
test_aten
test_libtorch 1
if [[ "${BUILD_ENVIRONMENT}" == *xpu* ]]; then
test_xpu_bin
fi
elif [[ "${SHARD_NUMBER}" == 2 && $NUM_TEST_SHARDS -gt 1 ]]; then
install_torchvision
test_python_shard 2
@ -1126,6 +1230,11 @@ elif [[ "${BUILD_ENVIRONMENT}" == *rocm* && -n "$TESTS_TO_INCLUDE" ]]; then
install_torchvision
test_python
test_aten
elif [[ "${BUILD_ENVIRONMENT}" == *xpu* ]]; then
install_torchvision
test_python
test_aten
test_xpu_bin
else
install_torchvision
install_monkeytype

View File

@ -16,11 +16,6 @@ set PATH=C:\Program Files\CMake\bin;C:\Program Files\7-Zip;C:\ProgramData\chocol
set INSTALLER_DIR=%SCRIPT_HELPERS_DIR%\installation-helpers
call %INSTALLER_DIR%\install_mkl.bat
if errorlevel 1 exit /b
if not errorlevel 0 exit /b
call %INSTALLER_DIR%\install_magma.bat
if errorlevel 1 exit /b
if not errorlevel 0 exit /b
@ -35,6 +30,10 @@ call %INSTALLER_DIR%\activate_miniconda3.bat
if errorlevel 1 exit /b
if not errorlevel 0 exit /b
call pip install mkl-include==2021.4.0 mkl-devel==2021.4.0
if errorlevel 1 exit /b
if not errorlevel 0 exit /b
:: Override VS env here
pushd .
if "%VC_VERSION%" == "" (
@ -89,8 +88,8 @@ set SCCACHE_IGNORE_SERVER_IO_ERROR=1
sccache --stop-server
sccache --start-server
sccache --zero-stats
set CC=sccache-cl
set CXX=sccache-cl
set CMAKE_C_COMPILER_LAUNCHER=sccache
set CMAKE_CXX_COMPILER_LAUNCHER=sccache
set CMAKE_GENERATOR=Ninja

View File

@ -1,14 +0,0 @@
if "%REBUILD%"=="" (
if "%BUILD_ENVIRONMENT%"=="" (
curl --retry 3 --retry-all-errors -k https://s3.amazonaws.com/ossci-windows/mkl_2020.2.254.7z --output %TMP_DIR_WIN%\mkl.7z
) else (
aws s3 cp s3://ossci-windows/mkl_2020.2.254.7z %TMP_DIR_WIN%\mkl.7z --quiet
)
if errorlevel 1 exit /b
if not errorlevel 0 exit /b
7z x -aoa %TMP_DIR_WIN%\mkl.7z -o%TMP_DIR_WIN%\mkl
if errorlevel 1 exit /b
if not errorlevel 0 exit /b
)
set CMAKE_INCLUDE_PATH=%TMP_DIR_WIN%\mkl\include
set LIB=%TMP_DIR_WIN%\mkl\lib;%LIB%

View File

@ -1,18 +1,13 @@
mkdir %TMP_DIR_WIN%\bin
if "%REBUILD%"=="" (
:check_sccache
%TMP_DIR_WIN%\bin\sccache.exe --show-stats || (
IF EXIST %TMP_DIR_WIN%\bin\sccache.exe (
taskkill /im sccache.exe /f /t || ver > nul
del %TMP_DIR_WIN%\bin\sccache.exe || ver > nul
del %TMP_DIR_WIN%\bin\sccache-cl.exe || ver > nul
if "%BUILD_ENVIRONMENT%"=="" (
curl --retry 3 --retry-all-errors -k https://s3.amazonaws.com/ossci-windows/sccache.exe --output %TMP_DIR_WIN%\bin\sccache.exe
curl --retry 3 --retry-all-errors -k https://s3.amazonaws.com/ossci-windows/sccache-cl.exe --output %TMP_DIR_WIN%\bin\sccache-cl.exe
) else (
aws s3 cp s3://ossci-windows/sccache.exe %TMP_DIR_WIN%\bin\sccache.exe
aws s3 cp s3://ossci-windows/sccache-cl.exe %TMP_DIR_WIN%\bin\sccache-cl.exe
)
goto :check_sccache
)
)
if "%BUILD_ENVIRONMENT%"=="" (
curl --retry 3 --retry-all-errors -k https://s3.amazonaws.com/ossci-windows/sccache-v0.7.4.exe --output %TMP_DIR_WIN%\bin\sccache.exe
) else (
aws s3 cp s3://ossci-windows/sccache-v0.7.4.exe %TMP_DIR_WIN%\bin\sccache.exe
)
)

View File

@ -1,468 +1,4 @@
Warning
=======
Contents may be out of date. Our CircleCI workflows are gradually being migrated to Github actions.
Structure of CI
===============
setup job:
1. Does a git checkout
2. Persists CircleCI scripts (everything in `.circleci`) into a workspace. Why?
We don't always do a Git checkout on all subjobs, but we usually
still want to be able to call scripts one way or another in a subjob.
Persisting files this way lets us have access to them without doing a
checkout. This workspace is conventionally mounted on `~/workspace`
(this is distinguished from `~/project`, which is the conventional
working directory that CircleCI will default to starting your jobs
in.)
3. Write out the commit message to `.circleci/COMMIT_MSG`. This is so
we can determine in subjobs if we should actually run the jobs or
not, even if there isn't a Git checkout.
CircleCI configuration generator
================================
One may no longer make changes to the `.circleci/config.yml` file directly.
Instead, one must edit these Python scripts or files in the `verbatim-sources/` directory.
Usage
----------
1. Make changes to these scripts.
2. Run the `regenerate.sh` script in this directory and commit the script changes and the resulting change to `config.yml`.
You'll see a build failure on GitHub if the scripts don't agree with the checked-in version.
Motivation
----------
These scripts establish a single, authoritative source of documentation for the CircleCI configuration matrix.
The documentation, in the form of diagrams, is automatically generated and cannot drift out of sync with the YAML content.
Furthermore, consistency is enforced within the YAML config itself, by using a single source of data to generate
multiple parts of the file.
* Facilitates one-off culling/enabling of CI configs for testing PRs on special targets
Also see https://github.com/pytorch/pytorch/issues/17038
Future direction
----------------
### Declaring sparse config subsets
See comment [here](https://github.com/pytorch/pytorch/pull/17323#pullrequestreview-206945747):
In contrast with a full recursive tree traversal of configuration dimensions,
> in the future I think we actually want to decrease our matrix somewhat and have only a few mostly-orthogonal builds that taste as many different features as possible on PRs, plus a more complete suite on every PR and maybe an almost full suite nightly/weekly (we don't have this yet). Specifying PR jobs in the future might be easier to read with an explicit list when we come to this.
----------------
----------------
# How do the binaries / nightlies / releases work?
### What is a binary?
A binary or package (used interchangeably) is a pre-built collection of c++ libraries, header files, python bits, and other files. We build these and distribute them so that users do not need to install from source.
A **binary configuration** is a collection of
* release or nightly
* releases are stable, nightlies are beta and built every night
* python version
* linux: 3.7m (mu is wide unicode or something like that. It usually doesn't matter but you should know that it exists)
* macos: 3.7, 3.8
* windows: 3.7, 3.8
* cpu version
* cpu, cuda 9.0, cuda 10.0
* The supported cuda versions occasionally change
* operating system
* Linux - these are all built on CentOS. There haven't been any problems in the past building on CentOS and using on Ubuntu
* MacOS
* Windows - these are built on Azure pipelines
* devtoolset version (gcc compiler version)
* This only matters on Linux cause only Linux uses gcc. tldr is gcc made a backwards incompatible change from gcc 4.8 to gcc 5, because it had to change how it implemented std::vector and std::string
### Where are the binaries?
The binaries are built in CircleCI. There are nightly binaries built every night at 9pm PST (midnight EST) and release binaries corresponding to Pytorch releases, usually every few months.
We have 3 types of binary packages
* pip packages - nightlies are stored on s3 (pip install -f \<a s3 url\>). releases are stored in a pip repo (pip install torch) (ask Soumith about this)
* conda packages - nightlies and releases are both stored in a conda repo. Nighty packages have a '_nightly' suffix
* libtorch packages - these are zips of all the c++ libraries, header files, and sometimes dependencies. These are c++ only
* shared with dependencies (the only supported option for Windows)
* static with dependencies
* shared without dependencies
* static without dependencies
All binaries are built in CircleCI workflows except Windows. There are checked-in workflows (committed into the .circleci/config.yml) to build the nightlies every night. Releases are built by manually pushing a PR that builds the suite of release binaries (overwrite the config.yml to build the release)
# CircleCI structure of the binaries
Some quick vocab:
* A \**workflow** is a CircleCI concept; it is a DAG of '**jobs**'. ctrl-f 'workflows' on https://github.com/pytorch/pytorch/blob/main/.circleci/config.yml to see the workflows.
* **jobs** are a sequence of '**steps**'
* **steps** are usually just a bash script or a builtin CircleCI command. *All steps run in new environments, environment variables declared in one script DO NOT persist to following steps*
* CircleCI has a **workspace**, which is essentially a cache between steps of the *same job* in which you can store artifacts between steps.
## How are the workflows structured?
The nightly binaries have 3 workflows. We have one job (actually 3 jobs: build, test, and upload) per binary configuration
1. binary_builds
1. every day midnight EST
2. linux: https://github.com/pytorch/pytorch/blob/main/.circleci/verbatim-sources/linux-binary-build-defaults.yml
3. macos: https://github.com/pytorch/pytorch/blob/main/.circleci/verbatim-sources/macos-binary-build-defaults.yml
4. For each binary configuration, e.g. linux_conda_3.7_cpu there is a
1. binary_linux_conda_3.7_cpu_build
1. Builds the build. On linux jobs this uses the 'docker executor'.
2. Persists the package to the workspace
2. binary_linux_conda_3.7_cpu_test
1. Loads the package to the workspace
2. Spins up a docker image (on Linux), mapping the package and code repos into the docker
3. Runs some smoke tests in the docker
4. (Actually, for macos this is a step rather than a separate job)
3. binary_linux_conda_3.7_cpu_upload
1. Logs in to aws/conda
2. Uploads the package
2. update_s3_htmls
1. every day 5am EST
2. https://github.com/pytorch/pytorch/blob/main/.circleci/verbatim-sources/binary_update_htmls.yml
3. See below for what these are for and why they're needed
4. Three jobs that each examine the current contents of aws and the conda repo and update some html files in s3
3. binarysmoketests
1. every day
2. https://github.com/pytorch/pytorch/blob/main/.circleci/verbatim-sources/nightly-build-smoke-tests-defaults.yml
3. For each binary configuration, e.g. linux_conda_3.7_cpu there is a
1. smoke_linux_conda_3.7_cpu
1. Downloads the package from the cloud, e.g. using the official pip or conda instructions
2. Runs the smoke tests
## How are the jobs structured?
The jobs are in https://github.com/pytorch/pytorch/tree/main/.circleci/verbatim-sources. Jobs are made of multiple steps. There are some shared steps used by all the binaries/smokes. Steps of these jobs are all delegated to scripts in https://github.com/pytorch/pytorch/tree/main/.circleci/scripts .
* Linux jobs: https://github.com/pytorch/pytorch/blob/main/.circleci/verbatim-sources/linux-binary-build-defaults.yml
* binary_linux_build.sh
* binary_linux_test.sh
* binary_linux_upload.sh
* MacOS jobs: https://github.com/pytorch/pytorch/blob/main/.circleci/verbatim-sources/macos-binary-build-defaults.yml
* binary_macos_build.sh
* binary_macos_test.sh
* binary_macos_upload.sh
* Update html jobs: https://github.com/pytorch/pytorch/blob/main/.circleci/verbatim-sources/binary_update_htmls.yml
* These delegate from the pytorch/builder repo
* https://github.com/pytorch/builder/blob/main/cron/update_s3_htmls.sh
* https://github.com/pytorch/builder/blob/main/cron/upload_binary_sizes.sh
* Smoke jobs (both linux and macos): https://github.com/pytorch/pytorch/blob/main/.circleci/verbatim-sources/nightly-build-smoke-tests-defaults.yml
* These delegate from the pytorch/builder repo
* https://github.com/pytorch/builder/blob/main/run_tests.sh
* https://github.com/pytorch/builder/blob/main/smoke_test.sh
* https://github.com/pytorch/builder/blob/main/check_binary.sh
* Common shared code (shared across linux and macos): https://github.com/pytorch/pytorch/blob/main/.circleci/verbatim-sources/nightly-binary-build-defaults.yml
* binary_checkout.sh - checks out pytorch/builder repo. Right now this also checks out pytorch/pytorch, but it shouldn't. pytorch/pytorch should just be shared through the workspace. This can handle being run before binary_populate_env.sh
* binary_populate_env.sh - parses BUILD_ENVIRONMENT into the separate env variables that make up a binary configuration. Also sets lots of default values, the date, the version strings, the location of folders in s3, all sorts of things. This generally has to be run before other steps.
* binary_install_miniconda.sh - Installs miniconda, cross platform. Also hacks this for the update_binary_sizes job that doesn't have the right env variables
* binary_run_in_docker.sh - Takes a bash script file (the actual test code) from a hardcoded location, spins up a docker image, and runs the script inside the docker image
### **Why do the steps all refer to scripts?**
CircleCI creates a final yaml file by inlining every <<* segment, so if we were to keep all the code in the config.yml itself then the config size would go over 4 MB and cause infra problems.
### **What is binary_run_in_docker for?**
So, CircleCI has several executor types: macos, machine, and docker are the ones we use. The 'machine' executor gives you two cores on some linux vm. The 'docker' executor gives you considerably more cores (nproc was 32 instead of 2 back when I tried in February). Since the dockers are faster, we try to run everything that we can in dockers. Thus
* linux build jobs use the docker executor. Running them on the docker executor was at least 2x faster than running them on the machine executor
* linux test jobs use the machine executor in order for them to properly interface with GPUs since docker executors cannot execute with attached GPUs
* linux upload jobs use the machine executor. The upload jobs are so short that it doesn't really matter what they use
* linux smoke test jobs use the machine executor for the same reason as the linux test jobs
binary_run_in_docker.sh is a way to share the docker start-up code between the binary test jobs and the binary smoke test jobs
### **Why does binary_checkout also checkout pytorch? Why shouldn't it?**
We want all the nightly binary jobs to run on the exact same git commit, so we wrote our own checkout logic to ensure that the same commit was always picked. Later circleci changed that to use a single pytorch checkout and persist it through the workspace (they did this because our config file was too big, so they wanted to take a lot of the setup code into scripts, but the scripts needed the code repo to exist to be called, so they added a prereq step called 'setup' to checkout the code and persist the needed scripts to the workspace). The changes to the binary jobs were not properly tested, so they all broke from missing pytorch code no longer existing. We hotfixed the problem by adding the pytorch checkout back to binary_checkout, so now there's two checkouts of pytorch on the binary jobs. This problem still needs to be fixed, but it takes careful tracing of which code is being called where.
# Code structure of the binaries (circleci agnostic)
## Overview
The code that runs the binaries lives in two places, in the normal [github.com/pytorch/pytorch](http://github.com/pytorch/pytorch), but also in [github.com/pytorch/builder](http://github.com/pytorch/builder), which is a repo that defines how all the binaries are built. The relevant code is
```
# All code needed to set-up environments for build code to run in,
# but only code that is specific to the current CI system
pytorch/pytorch
- .circleci/ # Folder that holds all circleci related stuff
- config.yml # GENERATED file that actually controls all circleci behavior
- verbatim-sources # Used to generate job/workflow sections in ^
- scripts/ # Code needed to prepare circleci environments for binary build scripts
- setup.py # Builds pytorch. This is wrapped in pytorch/builder
- cmake files # used in normal building of pytorch
# All code needed to prepare a binary build, given an environment
# with all the right variables/packages/paths.
pytorch/builder
# Given an installed binary and a proper python env, runs some checks
# to make sure the binary was built the proper way. Checks things like
# the library dependencies, symbols present, etc.
- check_binary.sh
# Given an installed binary, runs python tests to make sure everything
# is in order. These should be de-duped. Right now they both run smoke
# tests, but are called from different places. Usually just call some
# import statements, but also has overlap with check_binary.sh above
- run_tests.sh
- smoke_test.sh
# Folders that govern how packages are built. See paragraphs below
- conda/
- build_pytorch.sh # Entrypoint. Delegates to proper conda build folder
- switch_cuda_version.sh # Switches activate CUDA installation in Docker
- pytorch-nightly/ # Build-folder
- manywheel/
- build_cpu.sh # Entrypoint for cpu builds
- build.sh # Entrypoint for CUDA builds
- build_common.sh # Actual build script that ^^ call into
- wheel/
- build_wheel.sh # Entrypoint for wheel builds
- windows/
- build_pytorch.bat # Entrypoint for wheel builds on Windows
```
Every type of package has an entrypoint build script that handles the all the important logic.
## Conda
Linux, MacOS and Windows use the same code flow for the conda builds.
Conda packages are built with conda-build, see https://conda.io/projects/conda-build/en/latest/resources/commands/conda-build.html
Basically, you pass `conda build` a build folder (pytorch-nightly/ above) that contains a build script and a meta.yaml. The meta.yaml specifies in what python environment to build the package in, and what dependencies the resulting package should have, and the build script gets called in the env to build the thing.
tl;dr on conda-build is
1. Creates a brand new conda environment, based off of deps in the meta.yaml
1. Note that environment variables do not get passed into this build env unless they are specified in the meta.yaml
2. If the build fails this environment will stick around. You can activate it for much easier debugging. The “General Python” section below explains what exactly a python “environment” is.
2. Calls build.sh in the environment
3. Copies the finished package to a new conda env, also specified by the meta.yaml
4. Runs some simple import tests (if specified in the meta.yaml)
5. Saves the finished package as a tarball
The build.sh we use is essentially a wrapper around `python setup.py build`, but it also manually copies in some of our dependent libraries into the resulting tarball and messes with some rpaths.
The entrypoint file `builder/conda/build_conda.sh` is complicated because
* It works for Linux, MacOS and Windows
* The mac builds used to create their own environments, since they all used to be on the same machine. Theres now a lot of extra logic to handle conda envs. This extra machinery could be removed
* It used to handle testing too, which adds more logic messing with python environments too. This extra machinery could be removed.
## Manywheels (linux pip and libtorch packages)
Manywheels are pip packages for linux distros. Note that these manywheels are not actually manylinux compliant.
`builder/manywheel/build_cpu.sh` and `builder/manywheel/build.sh` (for CUDA builds) just set different env vars and then call into `builder/manywheel/build_common.sh`
The entrypoint file `builder/manywheel/build_common.sh` is really really complicated because
* This used to handle building for several different python versions at the same time. The loops have been removed, but there's still unnecessary folders and movements here and there.
* The script is never used this way anymore. This extra machinery could be removed.
* This used to handle testing the pip packages too. This is why theres testing code at the end that messes with python installations and stuff
* The script is never used this way anymore. This extra machinery could be removed.
* This also builds libtorch packages
* This should really be separate. libtorch packages are c++ only and have no python. They should not share infra with all the python specific stuff in this file.
* There is a lot of messing with rpaths. This is necessary, but could be made much much simpler if the above issues were fixed.
## Wheels (MacOS pip and libtorch packages)
The entrypoint file `builder/wheel/build_wheel.sh` is complicated because
* The mac builds used to all run on one machine (we didnt have autoscaling mac machines till circleci). So this script handled siloing itself by setting-up and tearing-down its build env and siloing itself into its own build directory.
* The script is never used this way anymore. This extra machinery could be removed.
* This also builds libtorch packages
* Ditto the comment above. This should definitely be separated out.
Note that the MacOS Python wheels are still built in conda environments. Some of the dependencies present during build also come from conda.
## Windows Wheels (Windows pip and libtorch packages)
The entrypoint file `builder/windows/build_pytorch.bat` is complicated because
* This used to handle building for several different python versions at the same time. This is why there are loops everywhere
* The script is never used this way anymore. This extra machinery could be removed.
* This used to handle testing the pip packages too. This is why theres testing code at the end that messes with python installations and stuff
* The script is never used this way anymore. This extra machinery could be removed.
* This also builds libtorch packages
* This should really be separate. libtorch packages are c++ only and have no python. They should not share infra with all the python specific stuff in this file.
Note that the Windows Python wheels are still built in conda environments. Some of the dependencies present during build also come from conda.
## General notes
### Note on run_tests.sh, smoke_test.sh, and check_binary.sh
* These should all be consolidated
* These must run on all OS types: MacOS, Linux, and Windows
* These all run smoke tests at the moment. They inspect the packages some, maybe run a few import statements. They DO NOT run the python tests nor the cpp tests. The idea is that python tests on main and PR merges will catch all breakages. All these tests have to do is make sure the special binary machinery didnt mess anything up.
* There are separate run_tests.sh and smoke_test.sh because one used to be called by the smoke jobs and one used to be called by the binary test jobs (see circleci structure section above). This is still true actually, but these could be united into a single script that runs these checks, given an installed pytorch package.
### Note on libtorch
Libtorch packages are built in the wheel build scripts: manywheel/build_*.sh for linux and build_wheel.sh for mac. There are several things wrong with this
* Its confusing. Most of those scripts deal with python specifics.
* The extra conditionals everywhere severely complicate the wheel build scripts
* The process for building libtorch is different from the official instructions (a plain call to cmake, or a call to a script)
### Note on docker images / Dockerfiles
All linux builds occur in docker images. The docker images are
* pytorch/conda-cuda
* Has ALL CUDA versions installed. The script pytorch/builder/conda/switch_cuda_version.sh sets /usr/local/cuda to a symlink to e.g. /usr/local/cuda-10.0 to enable different CUDA builds
* Also used for cpu builds
* pytorch/manylinux-cuda90
* pytorch/manylinux-cuda100
* Also used for cpu builds
The Dockerfiles are available in pytorch/builder, but there is no circleci job or script to build these docker images, and they cannot be run locally (unless you have the correct local packages/paths). Only Soumith can build them right now.
### General Python
* This is still a good explanation of python installations https://caffe2.ai/docs/faq.html#why-do-i-get-import-errors-in-python-when-i-try-to-use-caffe2
# How to manually rebuild the binaries
tl;dr make a PR that looks like https://github.com/pytorch/pytorch/pull/21159
Sometimes we want to push a change to mainand then rebuild all of today's binaries after that change. As of May 30, 2019 there isn't a way to manually run a workflow in the UI. You can manually re-run a workflow, but it will use the exact same git commits as the first run and will not include any changes. So we have to make a PR and then force circleci to run the binary workflow instead of the normal tests. The above PR is an example of how to do this; essentially you copy-paste the binarybuilds workflow steps into the default workflow steps. If you need to point the builder repo to a different commit then you'd need to change https://github.com/pytorch/pytorch/blob/main/.circleci/scripts/binary_checkout.sh#L42-L45 to checkout what you want.
## How to test changes to the binaries via .circleci
Writing PRs that test the binaries is annoying, since the default circleci jobs that run on PRs are not the jobs that you want to run. Likely, changes to the binaries will touch something under .circleci/ and require that .circleci/config.yml be regenerated (.circleci/config.yml controls all .circleci behavior, and is generated using `.circleci/regenerate.sh` in python 3.7). But you also need to manually hardcode the binary jobs that you want to test into the .circleci/config.yml workflow, so you should actually make at least two commits, one for your changes and one to temporarily hardcode jobs. See https://github.com/pytorch/pytorch/pull/22928 as an example of how to do this.
```sh
# Make your changes
touch .circleci/verbatim-sources/nightly-binary-build-defaults.yml
# Regenerate the yaml, has to be in python 3.7
.circleci/regenerate.sh
# Make a commit
git add .circleci *
git commit -m "My real changes"
git push origin my_branch
# Now hardcode the jobs that you want in the .circleci/config.yml workflows section
# Also eliminate ensure-consistency and should_run_job checks
# e.g. https://github.com/pytorch/pytorch/commit/2b3344bfed8772fe86e5210cc4ee915dee42b32d
# Make a commit you won't keep
git add .circleci
git commit -m "[DO NOT LAND] testing binaries for above changes"
git push origin my_branch
# Now you need to make some changes to the first commit.
git rebase -i HEAD~2 # mark the first commit as 'edit'
# Make the changes
touch .circleci/verbatim-sources/nightly-binary-build-defaults.yml
.circleci/regenerate.sh
# Ammend the commit and recontinue
git add .circleci
git commit --amend
git rebase --continue
# Update the PR, need to force since the commits are different now
git push origin my_branch --force
```
The advantage of this flow is that you can make new changes to the base commit and regenerate the .circleci without having to re-write which binary jobs you want to test on. The downside is that all updates will be force pushes.
## How to build a binary locally
### Linux
You can build Linux binaries locally easily using docker.
```sh
# Run the docker
# Use the correct docker image, pytorch/conda-cuda used here as an example
#
# -v path/to/foo:path/to/bar makes path/to/foo on your local machine (the
# machine that you're running the command on) accessible to the docker
# container at path/to/bar. So if you then run `touch path/to/bar/baz`
# in the docker container then you will see path/to/foo/baz on your local
# machine. You could also clone the pytorch and builder repos in the docker.
#
# If you know how, add ccache as a volume too and speed up everything
docker run \
-v your/pytorch/repo:/pytorch \
-v your/builder/repo:/builder \
-v where/you/want/packages/to/appear:/final_pkgs \
-it pytorch/conda-cuda /bin/bash
# Export whatever variables are important to you. All variables that you'd
# possibly need are in .circleci/scripts/binary_populate_env.sh
# You should probably always export at least these 3 variables
export PACKAGE_TYPE=conda
export DESIRED_PYTHON=3.7
export DESIRED_CUDA=cpu
# Call the entrypoint
# `|& tee foo.log` just copies all stdout and stderr output to foo.log
# The builds generate lots of output so you probably need this when
# building locally.
/builder/conda/build_pytorch.sh |& tee build_output.log
```
**Building CUDA binaries on docker**
You can build CUDA binaries on CPU only machines, but you can only run CUDA binaries on CUDA machines. This means that you can build a CUDA binary on a docker on your laptop if you so choose (though its gonna take a long time).
For Facebook employees, ask about beefy machines that have docker support and use those instead of your laptop; it will be 5x as fast.
### MacOS
Theres no easy way to generate reproducible hermetic MacOS environments. If you have a Mac laptop then you can try emulating the .circleci environments as much as possible, but you probably have packages in /usr/local/, possibly installed by brew, that will probably interfere with the build. If youre trying to repro an error on a Mac build in .circleci and you cant seem to repro locally, then my best advice is actually to iterate on .circleci :/
But if you want to try, then Id recommend
```sh
# Create a new terminal
# Clear your LD_LIBRARY_PATH and trim as much out of your PATH as you
# know how to do
# Install a new miniconda
# First remove any other python or conda installation from your PATH
# Always install miniconda 3, even if building for Python <3
new_conda="~/my_new_conda"
conda_sh="$new_conda/install_miniconda.sh"
curl -o "$conda_sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
chmod +x "$conda_sh"
"$conda_sh" -b -p "$MINICONDA_ROOT"
rm -f "$conda_sh"
export PATH="~/my_new_conda/bin:$PATH"
# Create a clean python env
# All MacOS builds use conda to manage the python env and dependencies
# that are built with, even the pip packages
conda create -yn binary python=2.7
conda activate binary
# Export whatever variables are important to you. All variables that you'd
# possibly need are in .circleci/scripts/binary_populate_env.sh
# You should probably always export at least these 3 variables
export PACKAGE_TYPE=conda
export DESIRED_PYTHON=3.7
export DESIRED_CUDA=cpu
# Call the entrypoint you want
path/to/builder/wheel/build_wheel.sh
```
N.B. installing a brand new miniconda is important. This has to do with how conda installations work. See the “General Python” section above, but tldr; is that
1. You make the conda command accessible by prepending `path/to/conda_root/bin` to your PATH.
2. You make a new env and activate it, which then also gets prepended to your PATH. Now you have `path/to/conda_root/envs/new_env/bin:path/to/conda_root/bin:$PATH`
3. Now say you (or some code that you ran) call python executable `foo`
1. if you installed `foo` in `new_env`, then `path/to/conda_root/envs/new_env/bin/foo` will get called, as expected.
2. But if you forgot to installed `foo` in `new_env` but happened to previously install it in your root conda env (called base), then unix/linux will still find `path/to/conda_root/bin/foo` . This is dangerous, since `foo` can be a different version than you want; `foo` can even be for an incompatible python version!
Newer conda versions and proper python hygiene can prevent this, but just install a new miniconda to be safe.
### Windows
TODO: fill in
PyTorch migration from CircleCI to github actions has been completed. All continuous integration & deployment workflows are defined in `.github/workflows` folder

View File

@ -1,198 +0,0 @@
"""
This module models the tree of configuration variants
for "smoketest" builds.
Each subclass of ConfigNode represents a layer of the configuration hierarchy.
These tree nodes encapsulate the logic for whether a branch of the hierarchy
should be "pruned".
"""
from collections import OrderedDict
import cimodel.data.dimensions as dimensions
from cimodel.lib.conf_tree import ConfigNode
LINKING_DIMENSIONS = [
"shared",
"static",
]
DEPS_INCLUSION_DIMENSIONS = [
"with-deps",
"without-deps",
]
def get_processor_arch_name(gpu_version):
return (
"cpu"
if not gpu_version
else (
"cu" + gpu_version.strip("cuda")
if gpu_version.startswith("cuda")
else gpu_version
)
)
CONFIG_TREE_DATA = OrderedDict()
# GCC config variants:
#
# All the nightlies (except libtorch with new gcc ABI) are built with devtoolset7,
# which can only build with old gcc ABI. It is better than devtoolset3
# because it understands avx512, which is needed for good fbgemm performance.
#
# Libtorch with new gcc ABI is built with gcc 5.4 on Ubuntu 16.04.
LINUX_GCC_CONFIG_VARIANTS = OrderedDict(
manywheel=["devtoolset7"],
conda=["devtoolset7"],
libtorch=[
"devtoolset7",
"gcc5.4_cxx11-abi",
],
)
WINDOWS_LIBTORCH_CONFIG_VARIANTS = [
"debug",
"release",
]
class TopLevelNode(ConfigNode):
def __init__(self, node_name, config_tree_data, smoke):
super().__init__(None, node_name)
self.config_tree_data = config_tree_data
self.props["smoke"] = smoke
def get_children(self):
return [
OSConfigNode(self, x, c, p) for (x, (c, p)) in self.config_tree_data.items()
]
class OSConfigNode(ConfigNode):
def __init__(self, parent, os_name, gpu_versions, py_tree):
super().__init__(parent, os_name)
self.py_tree = py_tree
self.props["os_name"] = os_name
self.props["gpu_versions"] = gpu_versions
def get_children(self):
return [PackageFormatConfigNode(self, k, v) for k, v in self.py_tree.items()]
class PackageFormatConfigNode(ConfigNode):
def __init__(self, parent, package_format, python_versions):
super().__init__(parent, package_format)
self.props["python_versions"] = python_versions
self.props["package_format"] = package_format
def get_children(self):
if self.find_prop("os_name") == "linux":
return [
LinuxGccConfigNode(self, v)
for v in LINUX_GCC_CONFIG_VARIANTS[self.find_prop("package_format")]
]
elif (
self.find_prop("os_name") == "windows"
and self.find_prop("package_format") == "libtorch"
):
return [
WindowsLibtorchConfigNode(self, v)
for v in WINDOWS_LIBTORCH_CONFIG_VARIANTS
]
else:
return [ArchConfigNode(self, v) for v in self.find_prop("gpu_versions")]
class LinuxGccConfigNode(ConfigNode):
def __init__(self, parent, gcc_config_variant):
super().__init__(parent, "GCC_CONFIG_VARIANT=" + str(gcc_config_variant))
self.props["gcc_config_variant"] = gcc_config_variant
def get_children(self):
gpu_versions = self.find_prop("gpu_versions")
# XXX devtoolset7 on CUDA 9.0 is temporarily disabled
# see https://github.com/pytorch/pytorch/issues/20066
if self.find_prop("gcc_config_variant") == "devtoolset7":
gpu_versions = filter(lambda x: x != "cuda_90", gpu_versions)
# XXX disabling conda rocm build since docker images are not there
if self.find_prop("package_format") == "conda":
gpu_versions = filter(
lambda x: x not in dimensions.ROCM_VERSION_LABELS, gpu_versions
)
# XXX libtorch rocm build is temporarily disabled
if self.find_prop("package_format") == "libtorch":
gpu_versions = filter(
lambda x: x not in dimensions.ROCM_VERSION_LABELS, gpu_versions
)
return [ArchConfigNode(self, v) for v in gpu_versions]
class WindowsLibtorchConfigNode(ConfigNode):
def __init__(self, parent, libtorch_config_variant):
super().__init__(
parent, "LIBTORCH_CONFIG_VARIANT=" + str(libtorch_config_variant)
)
self.props["libtorch_config_variant"] = libtorch_config_variant
def get_children(self):
return [ArchConfigNode(self, v) for v in self.find_prop("gpu_versions")]
class ArchConfigNode(ConfigNode):
def __init__(self, parent, gpu):
super().__init__(parent, get_processor_arch_name(gpu))
self.props["gpu"] = gpu
def get_children(self):
return [PyVersionConfigNode(self, v) for v in self.find_prop("python_versions")]
class PyVersionConfigNode(ConfigNode):
def __init__(self, parent, pyver):
super().__init__(parent, pyver)
self.props["pyver"] = pyver
def get_children(self):
package_format = self.find_prop("package_format")
os_name = self.find_prop("os_name")
has_libtorch_variants = package_format == "libtorch" and os_name == "linux"
linking_variants = LINKING_DIMENSIONS if has_libtorch_variants else []
return [LinkingVariantConfigNode(self, v) for v in linking_variants]
class LinkingVariantConfigNode(ConfigNode):
def __init__(self, parent, linking_variant):
super().__init__(parent, linking_variant)
def get_children(self):
return [
DependencyInclusionConfigNode(self, v) for v in DEPS_INCLUSION_DIMENSIONS
]
class DependencyInclusionConfigNode(ConfigNode):
def __init__(self, parent, deps_variant):
super().__init__(parent, deps_variant)
self.props["libtorch_variant"] = "-".join(
[self.parent.get_label(), self.get_label()]
)

View File

@ -1,275 +0,0 @@
from collections import OrderedDict
import cimodel.data.binary_build_data as binary_build_data
import cimodel.data.simple.util.branch_filters as branch_filters
import cimodel.lib.conf_tree as conf_tree
import cimodel.lib.miniutils as miniutils
class Conf:
def __init__(
self,
os,
gpu_version,
pydistro,
parms,
smoke,
libtorch_variant,
gcc_config_variant,
libtorch_config_variant,
):
self.os = os
self.gpu_version = gpu_version
self.pydistro = pydistro
self.parms = parms
self.smoke = smoke
self.libtorch_variant = libtorch_variant
self.gcc_config_variant = gcc_config_variant
self.libtorch_config_variant = libtorch_config_variant
def gen_build_env_parms(self):
elems = (
[self.pydistro]
+ self.parms
+ [binary_build_data.get_processor_arch_name(self.gpu_version)]
)
if self.gcc_config_variant is not None:
elems.append(str(self.gcc_config_variant))
if self.libtorch_config_variant is not None:
elems.append(str(self.libtorch_config_variant))
return elems
def gen_docker_image(self):
if self.gcc_config_variant == "gcc5.4_cxx11-abi":
if self.gpu_version is None:
return miniutils.quote("pytorch/libtorch-cxx11-builder:cpu")
else:
return miniutils.quote(
f"pytorch/libtorch-cxx11-builder:{self.gpu_version}"
)
if self.pydistro == "conda":
if self.gpu_version is None:
return miniutils.quote("pytorch/conda-builder:cpu")
else:
return miniutils.quote(f"pytorch/conda-builder:{self.gpu_version}")
docker_word_substitution = {
"manywheel": "manylinux",
"libtorch": "manylinux",
}
docker_distro_prefix = miniutils.override(
self.pydistro, docker_word_substitution
)
# The cpu nightlies are built on the pytorch/manylinux-cuda102 docker image
# TODO cuda images should consolidate into tag-base images similar to rocm
alt_docker_suffix = (
"cuda102"
if not self.gpu_version
else (
"rocm:" + self.gpu_version.strip("rocm")
if self.gpu_version.startswith("rocm")
else self.gpu_version
)
)
docker_distro_suffix = (
alt_docker_suffix
if self.pydistro != "conda"
else ("cuda" if alt_docker_suffix.startswith("cuda") else "rocm")
)
return miniutils.quote(
"pytorch/" + docker_distro_prefix + "-" + docker_distro_suffix
)
def get_name_prefix(self):
return "smoke" if self.smoke else "binary"
def gen_build_name(self, build_or_test, nightly):
parts = [self.get_name_prefix(), self.os] + self.gen_build_env_parms()
if nightly:
parts.append("nightly")
if self.libtorch_variant:
parts.append(self.libtorch_variant)
if not self.smoke:
parts.append(build_or_test)
joined = "_".join(parts)
return joined.replace(".", "_")
def gen_workflow_job(self, phase, upload_phase_dependency=None, nightly=False):
job_def = OrderedDict()
job_def["name"] = self.gen_build_name(phase, nightly)
job_def["build_environment"] = miniutils.quote(
" ".join(self.gen_build_env_parms())
)
if self.smoke:
job_def["requires"] = [
"update_s3_htmls",
]
job_def["filters"] = branch_filters.gen_filter_dict(
branches_list=["postnightly"],
)
else:
filter_branch = r"/.*/"
job_def["filters"] = branch_filters.gen_filter_dict(
branches_list=[filter_branch],
tags_list=[branch_filters.RC_PATTERN],
)
if self.libtorch_variant:
job_def["libtorch_variant"] = miniutils.quote(self.libtorch_variant)
if phase == "test":
if not self.smoke:
job_def["requires"] = [self.gen_build_name("build", nightly)]
if not (self.smoke and self.os == "macos") and self.os != "windows":
job_def["docker_image"] = self.gen_docker_image()
# fix this. only works on cuda not rocm
if self.os != "windows" and self.gpu_version:
job_def["use_cuda_docker_runtime"] = miniutils.quote("1")
else:
if self.os == "linux" and phase != "upload":
job_def["docker_image"] = self.gen_docker_image()
if phase == "test":
if self.gpu_version:
if self.os == "windows":
job_def["executor"] = "windows-with-nvidia-gpu"
else:
job_def["resource_class"] = "gpu.medium"
os_name = miniutils.override(self.os, {"macos": "mac"})
job_name = "_".join([self.get_name_prefix(), os_name, phase])
return {job_name: job_def}
def gen_upload_job(self, phase, requires_dependency):
"""Generate binary_upload job for configuration
Output looks similar to:
- binary_upload:
name: binary_linux_manywheel_3_7m_cu113_devtoolset7_nightly_upload
context: org-member
requires: binary_linux_manywheel_3_7m_cu113_devtoolset7_nightly_test
filters:
branches:
only:
- nightly
tags:
only: /v[0-9]+(\\.[0-9]+)*-rc[0-9]+/
package_type: manywheel
upload_subfolder: cu113
"""
return {
"binary_upload": OrderedDict(
{
"name": self.gen_build_name(phase, nightly=True),
"context": "org-member",
"requires": [
self.gen_build_name(requires_dependency, nightly=True)
],
"filters": branch_filters.gen_filter_dict(
branches_list=["nightly"],
tags_list=[branch_filters.RC_PATTERN],
),
"package_type": self.pydistro,
"upload_subfolder": binary_build_data.get_processor_arch_name(
self.gpu_version,
),
}
)
}
def get_root(smoke, name):
return binary_build_data.TopLevelNode(
name,
binary_build_data.CONFIG_TREE_DATA,
smoke,
)
def gen_build_env_list(smoke):
root = get_root(smoke, "N/A")
config_list = conf_tree.dfs(root)
newlist = []
for c in config_list:
conf = Conf(
c.find_prop("os_name"),
c.find_prop("gpu"),
c.find_prop("package_format"),
[c.find_prop("pyver")],
c.find_prop("smoke")
and not (c.find_prop("os_name") == "macos_arm64"), # don't test arm64
c.find_prop("libtorch_variant"),
c.find_prop("gcc_config_variant"),
c.find_prop("libtorch_config_variant"),
)
newlist.append(conf)
return newlist
def predicate_exclude_macos(config):
return config.os == "linux" or config.os == "windows"
def get_nightly_uploads():
configs = gen_build_env_list(False)
mylist = []
for conf in configs:
phase_dependency = "test" if predicate_exclude_macos(conf) else "build"
mylist.append(conf.gen_upload_job("upload", phase_dependency))
return mylist
def get_post_upload_jobs():
return [
{
"update_s3_htmls": {
"name": "update_s3_htmls",
"context": "org-member",
"filters": branch_filters.gen_filter_dict(
branches_list=["postnightly"],
),
},
},
]
def get_nightly_tests():
configs = gen_build_env_list(False)
filtered_configs = filter(predicate_exclude_macos, configs)
tests = []
for conf_options in filtered_configs:
yaml_item = conf_options.gen_workflow_job("test", nightly=True)
tests.append(yaml_item)
return tests
def get_jobs(toplevel_key, smoke):
jobs_list = []
configs = gen_build_env_list(smoke)
phase = "build" if toplevel_key == "binarybuilds" else "test"
for build_config in configs:
# don't test for macos_arm64 as it's cross compiled
if phase != "test" or build_config.os != "macos_arm64":
jobs_list.append(build_config.gen_workflow_job(phase, nightly=True))
return jobs_list
def get_binary_build_jobs():
return get_jobs("binarybuilds", False)
def get_binary_smoke_test_jobs():
return get_jobs("binarysmoketests", True)

View File

@ -1,19 +0,0 @@
PHASES = ["build", "test"]
CUDA_VERSIONS = [
"102",
"113",
"116",
"117",
]
ROCM_VERSIONS = [
"4.3.1",
"4.5.2",
]
ROCM_VERSION_LABELS = ["rocm" + v for v in ROCM_VERSIONS]
GPU_VERSIONS = [None] + ["cuda" + v for v in CUDA_VERSIONS] + ROCM_VERSION_LABELS
STANDARD_PYTHON_VERSIONS = ["3.7", "3.8", "3.9", "3.10"]

View File

@ -1,296 +0,0 @@
from cimodel.lib.conf_tree import ConfigNode
CONFIG_TREE_DATA = []
def get_major_pyver(dotted_version):
parts = dotted_version.split(".")
return "py" + parts[0]
class TreeConfigNode(ConfigNode):
def __init__(self, parent, node_name, subtree):
super().__init__(parent, self.modify_label(node_name))
self.subtree = subtree
self.init2(node_name)
def modify_label(self, label):
return label
def init2(self, node_name):
pass
def get_children(self):
return [self.child_constructor()(self, k, v) for (k, v) in self.subtree]
class TopLevelNode(TreeConfigNode):
def __init__(self, node_name, subtree):
super().__init__(None, node_name, subtree)
# noinspection PyMethodMayBeStatic
def child_constructor(self):
return DistroConfigNode
class DistroConfigNode(TreeConfigNode):
def init2(self, node_name):
self.props["distro_name"] = node_name
def child_constructor(self):
distro = self.find_prop("distro_name")
next_nodes = {
"xenial": XenialCompilerConfigNode,
"bionic": BionicCompilerConfigNode,
}
return next_nodes[distro]
class PyVerConfigNode(TreeConfigNode):
def init2(self, node_name):
self.props["pyver"] = node_name
self.props["abbreviated_pyver"] = get_major_pyver(node_name)
if node_name == "3.9":
self.props["abbreviated_pyver"] = "py3.9"
# noinspection PyMethodMayBeStatic
def child_constructor(self):
return ExperimentalFeatureConfigNode
class ExperimentalFeatureConfigNode(TreeConfigNode):
def init2(self, node_name):
self.props["experimental_feature"] = node_name
def child_constructor(self):
experimental_feature = self.find_prop("experimental_feature")
next_nodes = {
"asan": AsanConfigNode,
"xla": XlaConfigNode,
"mps": MPSConfigNode,
"vulkan": VulkanConfigNode,
"parallel_tbb": ParallelTBBConfigNode,
"crossref": CrossRefConfigNode,
"dynamo": DynamoConfigNode,
"parallel_native": ParallelNativeConfigNode,
"onnx": ONNXConfigNode,
"libtorch": LibTorchConfigNode,
"important": ImportantConfigNode,
"build_only": BuildOnlyConfigNode,
"shard_test": ShardTestConfigNode,
"cuda_gcc_override": CudaGccOverrideConfigNode,
"pure_torch": PureTorchConfigNode,
"slow_gradcheck": SlowGradcheckConfigNode,
}
return next_nodes[experimental_feature]
class SlowGradcheckConfigNode(TreeConfigNode):
def init2(self, node_name):
self.props["is_slow_gradcheck"] = True
def child_constructor(self):
return ExperimentalFeatureConfigNode
class PureTorchConfigNode(TreeConfigNode):
def modify_label(self, label):
return "PURE_TORCH=" + str(label)
def init2(self, node_name):
self.props["is_pure_torch"] = node_name
def child_constructor(self):
return ImportantConfigNode
class XlaConfigNode(TreeConfigNode):
def modify_label(self, label):
return "XLA=" + str(label)
def init2(self, node_name):
self.props["is_xla"] = node_name
def child_constructor(self):
return ImportantConfigNode
class MPSConfigNode(TreeConfigNode):
def modify_label(self, label):
return "MPS=" + str(label)
def init2(self, node_name):
self.props["is_mps"] = node_name
def child_constructor(self):
return ImportantConfigNode
class AsanConfigNode(TreeConfigNode):
def modify_label(self, label):
return "Asan=" + str(label)
def init2(self, node_name):
self.props["is_asan"] = node_name
def child_constructor(self):
return ExperimentalFeatureConfigNode
class ONNXConfigNode(TreeConfigNode):
def modify_label(self, label):
return "Onnx=" + str(label)
def init2(self, node_name):
self.props["is_onnx"] = node_name
def child_constructor(self):
return ImportantConfigNode
class VulkanConfigNode(TreeConfigNode):
def modify_label(self, label):
return "Vulkan=" + str(label)
def init2(self, node_name):
self.props["is_vulkan"] = node_name
def child_constructor(self):
return ImportantConfigNode
class ParallelTBBConfigNode(TreeConfigNode):
def modify_label(self, label):
return "PARALLELTBB=" + str(label)
def init2(self, node_name):
self.props["parallel_backend"] = "paralleltbb"
def child_constructor(self):
return ImportantConfigNode
class CrossRefConfigNode(TreeConfigNode):
def init2(self, node_name):
self.props["is_crossref"] = node_name
def child_constructor(self):
return ImportantConfigNode
class DynamoConfigNode(TreeConfigNode):
def init2(self, node_name):
self.props["is_dynamo"] = node_name
def child_constructor(self):
return ImportantConfigNode
class ParallelNativeConfigNode(TreeConfigNode):
def modify_label(self, label):
return "PARALLELNATIVE=" + str(label)
def init2(self, node_name):
self.props["parallel_backend"] = "parallelnative"
def child_constructor(self):
return ImportantConfigNode
class LibTorchConfigNode(TreeConfigNode):
def modify_label(self, label):
return "BUILD_TEST_LIBTORCH=" + str(label)
def init2(self, node_name):
self.props["is_libtorch"] = node_name
def child_constructor(self):
return ExperimentalFeatureConfigNode
class CudaGccOverrideConfigNode(TreeConfigNode):
def init2(self, node_name):
self.props["cuda_gcc_override"] = node_name
def child_constructor(self):
return ExperimentalFeatureConfigNode
class BuildOnlyConfigNode(TreeConfigNode):
def init2(self, node_name):
self.props["build_only"] = node_name
def child_constructor(self):
return ExperimentalFeatureConfigNode
class ShardTestConfigNode(TreeConfigNode):
def init2(self, node_name):
self.props["shard_test"] = node_name
def child_constructor(self):
return ImportantConfigNode
class ImportantConfigNode(TreeConfigNode):
def modify_label(self, label):
return "IMPORTANT=" + str(label)
def init2(self, node_name):
self.props["is_important"] = node_name
def get_children(self):
return []
class XenialCompilerConfigNode(TreeConfigNode):
def modify_label(self, label):
return label or "<unspecified>"
def init2(self, node_name):
self.props["compiler_name"] = node_name
# noinspection PyMethodMayBeStatic
def child_constructor(self):
return (
XenialCompilerVersionConfigNode
if self.props["compiler_name"]
else PyVerConfigNode
)
class BionicCompilerConfigNode(TreeConfigNode):
def modify_label(self, label):
return label or "<unspecified>"
def init2(self, node_name):
self.props["compiler_name"] = node_name
# noinspection PyMethodMayBeStatic
def child_constructor(self):
return (
BionicCompilerVersionConfigNode
if self.props["compiler_name"]
else PyVerConfigNode
)
class XenialCompilerVersionConfigNode(TreeConfigNode):
def init2(self, node_name):
self.props["compiler_version"] = node_name
# noinspection PyMethodMayBeStatic
def child_constructor(self):
return PyVerConfigNode
class BionicCompilerVersionConfigNode(TreeConfigNode):
def init2(self, node_name):
self.props["compiler_version"] = node_name
# noinspection PyMethodMayBeStatic
def child_constructor(self):
return PyVerConfigNode

View File

@ -1,382 +0,0 @@
from collections import OrderedDict
from dataclasses import dataclass, field
from typing import List, Optional
import cimodel.data.dimensions as dimensions
import cimodel.lib.conf_tree as conf_tree
import cimodel.lib.miniutils as miniutils
from cimodel.data.pytorch_build_data import CONFIG_TREE_DATA, TopLevelNode
from cimodel.data.simple.util.branch_filters import gen_filter_dict, RC_PATTERN
from cimodel.data.simple.util.docker_constants import gen_docker_image
@dataclass
class Conf:
distro: str
parms: List[str]
parms_list_ignored_for_docker_image: Optional[List[str]] = None
pyver: Optional[str] = None
cuda_version: Optional[str] = None
rocm_version: Optional[str] = None
# TODO expand this to cover all the USE_* that we want to test for
# tesnrorrt, leveldb, lmdb, redis, opencv, mkldnn, ideep, etc.
# (from https://github.com/pytorch/pytorch/pull/17323#discussion_r259453608)
is_xla: bool = False
is_vulkan: bool = False
is_pure_torch: bool = False
restrict_phases: Optional[List[str]] = None
gpu_resource: Optional[str] = None
dependent_tests: List = field(default_factory=list)
parent_build: Optional["Conf"] = None
is_libtorch: bool = False
is_important: bool = False
parallel_backend: Optional[str] = None
build_only: bool = False
@staticmethod
def is_test_phase(phase):
return "test" in phase
# TODO: Eliminate the special casing for docker paths
# In the short term, we *will* need to support special casing as docker images are merged for caffe2 and pytorch
def get_parms(self, for_docker):
leading = []
# We just don't run non-important jobs on pull requests;
# previously we also named them in a way to make it obvious
# if self.is_important and not for_docker:
# leading.append("AAA")
leading.append("pytorch")
if self.is_xla and not for_docker:
leading.append("xla")
if self.is_vulkan and not for_docker:
leading.append("vulkan")
if self.is_libtorch and not for_docker:
leading.append("libtorch")
if self.is_pure_torch and not for_docker:
leading.append("pure_torch")
if self.parallel_backend is not None and not for_docker:
leading.append(self.parallel_backend)
cuda_parms = []
if self.cuda_version:
cudnn = "cudnn8" if self.cuda_version.startswith("11.") else "cudnn7"
cuda_parms.extend(["cuda" + self.cuda_version, cudnn])
if self.rocm_version:
cuda_parms.extend([f"rocm{self.rocm_version}"])
result = leading + ["linux", self.distro] + cuda_parms + self.parms
if not for_docker and self.parms_list_ignored_for_docker_image is not None:
result = result + self.parms_list_ignored_for_docker_image
return result
def gen_docker_image_path(self):
parms_source = self.parent_build or self
base_build_env_name = "-".join(parms_source.get_parms(True))
image_name, _ = gen_docker_image(base_build_env_name)
return miniutils.quote(image_name)
def gen_docker_image_requires(self):
parms_source = self.parent_build or self
base_build_env_name = "-".join(parms_source.get_parms(True))
_, requires = gen_docker_image(base_build_env_name)
return miniutils.quote(requires)
def get_build_job_name_pieces(self, build_or_test):
return self.get_parms(False) + [build_or_test]
def gen_build_name(self, build_or_test):
return (
("_".join(map(str, self.get_build_job_name_pieces(build_or_test))))
.replace(".", "_")
.replace("-", "_")
)
def get_dependents(self):
return self.dependent_tests or []
def gen_workflow_params(self, phase):
parameters = OrderedDict()
build_job_name_pieces = self.get_build_job_name_pieces(phase)
build_env_name = "-".join(map(str, build_job_name_pieces))
parameters["build_environment"] = miniutils.quote(build_env_name)
parameters["docker_image"] = self.gen_docker_image_path()
if Conf.is_test_phase(phase) and self.gpu_resource:
parameters["use_cuda_docker_runtime"] = miniutils.quote("1")
if Conf.is_test_phase(phase):
resource_class = "large"
if self.gpu_resource:
resource_class = "gpu." + self.gpu_resource
if self.rocm_version is not None:
resource_class = "pytorch/amd-gpu"
parameters["resource_class"] = resource_class
if phase == "build" and self.rocm_version is not None:
parameters["resource_class"] = "xlarge"
if hasattr(self, "filters"):
parameters["filters"] = self.filters
if self.build_only:
parameters["build_only"] = miniutils.quote(str(int(True)))
return parameters
def gen_workflow_job(self, phase):
job_def = OrderedDict()
job_def["name"] = self.gen_build_name(phase)
if Conf.is_test_phase(phase):
# TODO When merging the caffe2 and pytorch jobs, it might be convenient for a while to make a
# caffe2 test job dependent on a pytorch build job. This way we could quickly dedup the repeated
# build of pytorch in the caffe2 build job, and just run the caffe2 tests off of a completed
# pytorch build job (from https://github.com/pytorch/pytorch/pull/17323#discussion_r259452641)
dependency_build = self.parent_build or self
job_def["requires"] = [dependency_build.gen_build_name("build")]
job_name = "pytorch_linux_test"
else:
job_name = "pytorch_linux_build"
job_def["requires"] = [self.gen_docker_image_requires()]
if not self.is_important:
job_def["filters"] = gen_filter_dict()
job_def.update(self.gen_workflow_params(phase))
return {job_name: job_def}
# TODO This is a hack to special case some configs just for the workflow list
class HiddenConf:
def __init__(self, name, parent_build=None, filters=None):
self.name = name
self.parent_build = parent_build
self.filters = filters
def gen_workflow_job(self, phase):
return {
self.gen_build_name(phase): {
"requires": [self.parent_build.gen_build_name("build")],
"filters": self.filters,
}
}
def gen_build_name(self, _):
return self.name
class DocPushConf:
def __init__(self, name, parent_build=None, branch="master"):
self.name = name
self.parent_build = parent_build
self.branch = branch
def gen_workflow_job(self, phase):
return {
"pytorch_doc_push": {
"name": self.name,
"branch": self.branch,
"requires": [self.parent_build],
"context": "org-member",
"filters": gen_filter_dict(
branches_list=["nightly"], tags_list=RC_PATTERN
),
}
}
def gen_docs_configs(xenial_parent_config):
configs = []
configs.append(
HiddenConf(
"pytorch_python_doc_build",
parent_build=xenial_parent_config,
filters=gen_filter_dict(
branches_list=["master", "main", "nightly"], tags_list=RC_PATTERN
),
)
)
configs.append(
DocPushConf(
"pytorch_python_doc_push",
parent_build="pytorch_python_doc_build",
branch="site",
)
)
configs.append(
HiddenConf(
"pytorch_cpp_doc_build",
parent_build=xenial_parent_config,
filters=gen_filter_dict(
branches_list=["master", "main", "nightly"], tags_list=RC_PATTERN
),
)
)
configs.append(
DocPushConf(
"pytorch_cpp_doc_push",
parent_build="pytorch_cpp_doc_build",
branch="master",
)
)
return configs
def get_root():
return TopLevelNode("PyTorch Builds", CONFIG_TREE_DATA)
def gen_tree():
root = get_root()
configs_list = conf_tree.dfs(root)
return configs_list
def instantiate_configs(only_slow_gradcheck):
config_list = []
root = get_root()
found_configs = conf_tree.dfs(root)
for fc in found_configs:
restrict_phases = None
distro_name = fc.find_prop("distro_name")
compiler_name = fc.find_prop("compiler_name")
compiler_version = fc.find_prop("compiler_version")
is_xla = fc.find_prop("is_xla") or False
is_asan = fc.find_prop("is_asan") or False
is_crossref = fc.find_prop("is_crossref") or False
is_dynamo = fc.find_prop("is_dynamo") or False
is_onnx = fc.find_prop("is_onnx") or False
is_pure_torch = fc.find_prop("is_pure_torch") or False
is_vulkan = fc.find_prop("is_vulkan") or False
is_slow_gradcheck = fc.find_prop("is_slow_gradcheck") or False
parms_list_ignored_for_docker_image = []
if only_slow_gradcheck ^ is_slow_gradcheck:
continue
python_version = None
if compiler_name == "cuda" or compiler_name == "android":
python_version = fc.find_prop("pyver")
parms_list = [fc.find_prop("abbreviated_pyver")]
else:
parms_list = ["py" + fc.find_prop("pyver")]
cuda_version = None
rocm_version = None
if compiler_name == "cuda":
cuda_version = fc.find_prop("compiler_version")
elif compiler_name == "rocm":
rocm_version = fc.find_prop("compiler_version")
restrict_phases = ["build", "test1", "test2", "caffe2_test"]
elif compiler_name == "android":
android_ndk_version = fc.find_prop("compiler_version")
# TODO: do we need clang to compile host binaries like protoc?
parms_list.append("clang5")
parms_list.append("android-ndk-" + android_ndk_version)
android_abi = fc.find_prop("android_abi")
parms_list_ignored_for_docker_image.append(android_abi)
restrict_phases = ["build"]
elif compiler_name:
gcc_version = compiler_name + (fc.find_prop("compiler_version") or "")
parms_list.append(gcc_version)
if is_asan:
parms_list.append("asan")
python_version = fc.find_prop("pyver")
parms_list[0] = fc.find_prop("abbreviated_pyver")
if is_crossref:
parms_list_ignored_for_docker_image.append("crossref")
if is_dynamo:
parms_list_ignored_for_docker_image.append("dynamo")
if is_onnx:
parms_list.append("onnx")
python_version = fc.find_prop("pyver")
parms_list[0] = fc.find_prop("abbreviated_pyver")
restrict_phases = ["build", "ort_test1", "ort_test2"]
if cuda_version:
cuda_gcc_version = fc.find_prop("cuda_gcc_override") or "gcc7"
parms_list.append(cuda_gcc_version)
is_libtorch = fc.find_prop("is_libtorch") or False
is_important = fc.find_prop("is_important") or False
parallel_backend = fc.find_prop("parallel_backend") or None
build_only = fc.find_prop("build_only") or False
shard_test = fc.find_prop("shard_test") or False
# TODO: fix pure_torch python test packaging issue.
if shard_test:
restrict_phases = ["build"] if restrict_phases is None else restrict_phases
restrict_phases.extend(["test1", "test2"])
if build_only or is_pure_torch:
restrict_phases = ["build"]
if is_slow_gradcheck:
parms_list_ignored_for_docker_image.append("old")
parms_list_ignored_for_docker_image.append("gradcheck")
gpu_resource = None
if cuda_version and cuda_version != "10":
gpu_resource = "medium"
c = Conf(
distro_name,
parms_list,
parms_list_ignored_for_docker_image,
python_version,
cuda_version,
rocm_version,
is_xla,
is_vulkan,
is_pure_torch,
restrict_phases,
gpu_resource,
is_libtorch=is_libtorch,
is_important=is_important,
parallel_backend=parallel_backend,
build_only=build_only,
)
# run docs builds on "pytorch-linux-xenial-py3.7-gcc5.4". Docs builds
# should run on a CPU-only build that runs on all PRs.
# XXX should this be updated to a more modern build?
if (
distro_name == "xenial"
and fc.find_prop("pyver") == "3.7"
and cuda_version is None
and parallel_backend is None
and not is_vulkan
and not is_pure_torch
and compiler_name == "gcc"
and fc.find_prop("compiler_version") == "5.4"
):
c.filters = gen_filter_dict(branches_list=r"/.*/", tags_list=RC_PATTERN)
c.dependent_tests = gen_docs_configs(c)
config_list.append(c)
return config_list
def get_workflow_jobs(only_slow_gradcheck=False):
config_list = instantiate_configs(only_slow_gradcheck)
x = []
for conf_options in config_list:
phases = conf_options.restrict_phases or dimensions.PHASES
for phase in phases:
# TODO why does this not have a test?
if Conf.is_test_phase(phase) and conf_options.cuda_version == "10":
continue
x.append(conf_options.gen_workflow_job(phase))
# TODO convert to recursion
for conf in conf_options.get_dependents():
x.append(conf.gen_workflow_job("test"))
return x

View File

@ -1,39 +0,0 @@
from collections import OrderedDict
from cimodel.data.simple.util.branch_filters import gen_filter_dict, RC_PATTERN
from cimodel.lib.miniutils import quote
# NOTE: All hardcoded docker image builds have been migrated to GHA
IMAGE_NAMES = []
# This entry should be an element from the list above
# This should contain the image matching the "slow_gradcheck" entry in
# pytorch_build_data.py
SLOW_GRADCHECK_IMAGE_NAME = "pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7"
def get_workflow_jobs(images=IMAGE_NAMES, only_slow_gradcheck=False):
"""Generates a list of docker image build definitions"""
ret = []
for image_name in images:
if image_name.startswith("docker-"):
image_name = image_name.lstrip("docker-")
if only_slow_gradcheck and image_name is not SLOW_GRADCHECK_IMAGE_NAME:
continue
parameters = OrderedDict(
{
"name": quote(f"docker-{image_name}"),
"image_name": quote(image_name),
}
)
if image_name == "pytorch-linux-xenial-py3.7-gcc5.4":
# pushing documentation on tags requires CircleCI to also
# build all the dependencies on tags, including this docker image
parameters["filters"] = gen_filter_dict(
branches_list=r"/.*/", tags_list=RC_PATTERN
)
ret.append(OrderedDict({"docker_build_job": parameters}))
return ret

View File

@ -1,100 +0,0 @@
import cimodel.lib.miniutils as miniutils
from cimodel.data.simple.util.branch_filters import gen_filter_dict_exclude
from cimodel.data.simple.util.versions import MultiPartVersion
XCODE_VERSION = MultiPartVersion([12, 5, 1])
class ArchVariant:
def __init__(self, name, custom_build_name=""):
self.name = name
self.custom_build_name = custom_build_name
def render(self):
extra_parts = (
[self.custom_build_name] if len(self.custom_build_name) > 0 else []
)
return "-".join([self.name] + extra_parts).replace("_", "-")
def get_platform(arch_variant_name):
return "SIMULATOR" if arch_variant_name == "x86_64" else "OS"
class IOSJob:
def __init__(
self, xcode_version, arch_variant, is_org_member_context=True, extra_props=None
):
self.xcode_version = xcode_version
self.arch_variant = arch_variant
self.is_org_member_context = is_org_member_context
self.extra_props = extra_props
def gen_name_parts(self):
version_parts = self.xcode_version.render_dots_or_parts("-")
build_variant_suffix = self.arch_variant.render()
return (
[
"ios",
]
+ version_parts
+ [
build_variant_suffix,
]
)
def gen_job_name(self):
return "-".join(self.gen_name_parts())
def gen_tree(self):
platform_name = get_platform(self.arch_variant.name)
props_dict = {
"name": self.gen_job_name(),
"build_environment": self.gen_job_name(),
"ios_arch": self.arch_variant.name,
"ios_platform": platform_name,
}
if self.is_org_member_context:
props_dict["context"] = "org-member"
if self.extra_props:
props_dict.update(self.extra_props)
props_dict["filters"] = gen_filter_dict_exclude()
return [{"pytorch_ios_build": props_dict}]
WORKFLOW_DATA = [
IOSJob(
XCODE_VERSION,
ArchVariant("x86_64"),
is_org_member_context=False,
extra_props={"lite_interpreter": miniutils.quote(str(int(True)))},
),
# IOSJob(XCODE_VERSION, ArchVariant("arm64"), extra_props={
# "lite_interpreter": miniutils.quote(str(int(True)))}),
# IOSJob(XCODE_VERSION, ArchVariant("arm64", "metal"), extra_props={
# "use_metal": miniutils.quote(str(int(True))),
# "lite_interpreter": miniutils.quote(str(int(True)))}),
# IOSJob(XCODE_VERSION, ArchVariant("arm64", "custom-ops"), extra_props={
# "op_list": "mobilenetv2.yaml",
# "lite_interpreter": miniutils.quote(str(int(True)))}),
IOSJob(
XCODE_VERSION,
ArchVariant("x86_64", "coreml"),
is_org_member_context=False,
extra_props={
"use_coreml": miniutils.quote(str(int(True))),
"lite_interpreter": miniutils.quote(str(int(True))),
},
),
# IOSJob(XCODE_VERSION, ArchVariant("arm64", "coreml"), extra_props={
# "use_coreml": miniutils.quote(str(int(True))),
# "lite_interpreter": miniutils.quote(str(int(True)))}),
]
def get_workflow_jobs():
return [item.gen_tree() for item in WORKFLOW_DATA]

View File

@ -1,54 +0,0 @@
class MacOsJob:
def __init__(self, os_version, is_build=False, is_test=False, extra_props=tuple()):
# extra_props is tuple type, because mutable data structures for argument defaults
# is not recommended.
self.os_version = os_version
self.is_build = is_build
self.is_test = is_test
self.extra_props = dict(extra_props)
def gen_tree(self):
non_phase_parts = ["pytorch", "macos", self.os_version, "py3"]
extra_name_list = [name for name, exist in self.extra_props.items() if exist]
full_job_name_list = (
non_phase_parts
+ extra_name_list
+ [
"build" if self.is_build else None,
"test" if self.is_test else None,
]
)
full_job_name = "_".join(list(filter(None, full_job_name_list)))
test_build_dependency = "_".join(non_phase_parts + ["build"])
extra_dependencies = [test_build_dependency] if self.is_test else []
job_dependencies = extra_dependencies
# Yes we name the job after itself, it needs a non-empty value in here
# for the YAML output to work.
props_dict = {"requires": job_dependencies, "name": full_job_name}
return [{full_job_name: props_dict}]
WORKFLOW_DATA = [
MacOsJob("10_15", is_build=True),
MacOsJob("10_13", is_build=True),
MacOsJob(
"10_13",
is_build=False,
is_test=True,
),
MacOsJob(
"10_13",
is_build=True,
is_test=True,
extra_props=tuple({"lite_interpreter": True}.items()),
),
]
def get_workflow_jobs():
return [item.gen_tree() for item in WORKFLOW_DATA]

View File

@ -1,51 +0,0 @@
"""
PyTorch Mobile PR builds (use linux host toolchain + mobile build options)
"""
import cimodel.data.simple.util.branch_filters
import cimodel.lib.miniutils as miniutils
class MobileJob:
def __init__(
self, docker_image, docker_requires, variant_parts, is_master_only=False
):
self.docker_image = docker_image
self.docker_requires = docker_requires
self.variant_parts = variant_parts
self.is_master_only = is_master_only
def gen_tree(self):
non_phase_parts = [
"pytorch",
"linux",
"xenial",
"py3",
"clang5",
"mobile",
] + self.variant_parts
full_job_name = "_".join(non_phase_parts)
build_env_name = "-".join(non_phase_parts)
props_dict = {
"build_environment": build_env_name,
"build_only": miniutils.quote(str(int(True))),
"docker_image": self.docker_image,
"requires": self.docker_requires,
"name": full_job_name,
}
if self.is_master_only:
props_dict[
"filters"
] = cimodel.data.simple.util.branch_filters.gen_filter_dict()
return [{"pytorch_linux_build": props_dict}]
WORKFLOW_DATA = []
def get_workflow_jobs():
return [item.gen_tree() for item in WORKFLOW_DATA]

View File

@ -1,96 +0,0 @@
import cimodel.data.simple.ios_definitions as ios_definitions
import cimodel.lib.miniutils as miniutils
class IOSNightlyJob:
def __init__(self, variant, is_full_jit=False, is_upload=False):
self.variant = variant
self.is_full_jit = is_full_jit
self.is_upload = is_upload
def get_phase_name(self):
return "upload" if self.is_upload else "build"
def get_common_name_pieces(self, sep):
extra_name_suffix = [self.get_phase_name()] if self.is_upload else []
extra_name = ["full_jit"] if self.is_full_jit else []
common_name_pieces = (
[
"ios",
]
+ extra_name
+ []
+ ios_definitions.XCODE_VERSION.render_dots_or_parts(sep)
+ [
"nightly",
self.variant,
"build",
]
+ extra_name_suffix
)
return common_name_pieces
def gen_job_name(self):
return "_".join(["pytorch"] + self.get_common_name_pieces(None))
def gen_tree(self):
build_configs = BUILD_CONFIGS_FULL_JIT if self.is_full_jit else BUILD_CONFIGS
extra_requires = (
[x.gen_job_name() for x in build_configs] if self.is_upload else []
)
props_dict = {
"build_environment": "-".join(
["libtorch"] + self.get_common_name_pieces(".")
),
"requires": extra_requires,
"context": "org-member",
"filters": {"branches": {"only": "nightly"}},
}
if not self.is_upload:
props_dict["ios_arch"] = self.variant
props_dict["ios_platform"] = ios_definitions.get_platform(self.variant)
props_dict["name"] = self.gen_job_name()
props_dict["use_metal"] = miniutils.quote(str(int(True)))
props_dict["use_coreml"] = miniutils.quote(str(int(True)))
if self.is_full_jit:
props_dict["lite_interpreter"] = miniutils.quote(str(int(False)))
template_name = "_".join(
[
"binary",
"ios",
self.get_phase_name(),
]
)
return [{template_name: props_dict}]
BUILD_CONFIGS = [
IOSNightlyJob("x86_64"),
IOSNightlyJob("arm64"),
]
BUILD_CONFIGS_FULL_JIT = [
IOSNightlyJob("x86_64", is_full_jit=True),
IOSNightlyJob("arm64", is_full_jit=True),
]
WORKFLOW_DATA = (
BUILD_CONFIGS
+ BUILD_CONFIGS_FULL_JIT
+ [
IOSNightlyJob("binary", is_full_jit=False, is_upload=True),
IOSNightlyJob("binary", is_full_jit=True, is_upload=True),
]
)
def get_workflow_jobs():
return [item.gen_tree() for item in WORKFLOW_DATA]

View File

@ -1,36 +0,0 @@
NON_PR_BRANCH_LIST = [
"main",
"master",
r"/ci-all\/.*/",
r"/release\/.*/",
]
PR_BRANCH_LIST = [
r"/gh\/.*\/head/",
r"/pull\/.*/",
]
RC_PATTERN = r"/v[0-9]+(\.[0-9]+)*-rc[0-9]+/"
MAC_IOS_EXCLUSION_LIST = ["nightly", "postnightly"]
def gen_filter_dict(branches_list=NON_PR_BRANCH_LIST, tags_list=None):
"""Generates a filter dictionary for use with CircleCI's job filter"""
filter_dict = {
"branches": {
"only": branches_list,
},
}
if tags_list is not None:
filter_dict["tags"] = {"only": tags_list}
return filter_dict
def gen_filter_dict_exclude(branches_list=MAC_IOS_EXCLUSION_LIST):
return {
"branches": {
"ignore": branches_list,
},
}

View File

@ -1,35 +0,0 @@
AWS_DOCKER_HOST = "308535385114.dkr.ecr.us-east-1.amazonaws.com"
def gen_docker_image(container_type):
return (
"/".join([AWS_DOCKER_HOST, "pytorch", container_type]),
f"docker-{container_type}",
)
def gen_docker_image_requires(image_name):
return [f"docker-{image_name}"]
DOCKER_IMAGE_BASIC, DOCKER_REQUIREMENT_BASE = gen_docker_image(
"pytorch-linux-xenial-py3.7-gcc5.4"
)
DOCKER_IMAGE_CUDA_10_2, DOCKER_REQUIREMENT_CUDA_10_2 = gen_docker_image(
"pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7"
)
DOCKER_IMAGE_GCC7, DOCKER_REQUIREMENT_GCC7 = gen_docker_image(
"pytorch-linux-xenial-py3.7-gcc7"
)
def gen_mobile_docker(specifier):
container_type = "pytorch-linux-xenial-py3-clang5-" + specifier
return gen_docker_image(container_type)
DOCKER_IMAGE_ASAN, DOCKER_REQUIREMENT_ASAN = gen_mobile_docker("asan")
DOCKER_IMAGE_NDK, DOCKER_REQUIREMENT_NDK = gen_mobile_docker("android-ndk-r21e")

View File

@ -1,36 +0,0 @@
from typing import Optional
class MultiPartVersion:
def __init__(self, parts, prefix=""):
self.parts = parts
self.prefix = prefix
def prefixed_parts(self):
"""
Prepends the first element of the version list
with the prefix string.
"""
if self.parts:
return [self.prefix + str(self.parts[0])] + [
str(part) for part in self.parts[1:]
]
else:
return [self.prefix]
def render_dots_or_parts(self, sep: Optional[str] = None):
if sep is None:
return self.prefixed_parts()
else:
return [sep.join(self.prefixed_parts())]
class CudaVersion(MultiPartVersion):
def __init__(self, major, minor):
self.major = major
self.minor = minor
super().__init__([self.major, self.minor], "cuda")
def __str__(self):
return f"{self.major}.{self.minor}"

View File

@ -1,111 +0,0 @@
from dataclasses import dataclass, field
from typing import Dict, Optional
def X(val):
"""
Compact way to write a leaf node
"""
return val, []
def XImportant(name):
"""Compact way to write an important (run on PRs) leaf node"""
return (name, [("important", [X(True)])])
@dataclass
class Ver:
"""
Represents a product with a version number
"""
name: str
version: str = ""
def __str__(self):
return self.name + self.version
@dataclass
class ConfigNode:
parent: Optional["ConfigNode"]
node_name: str
props: Dict[str, str] = field(default_factory=dict)
def get_label(self):
return self.node_name
# noinspection PyMethodMayBeStatic
def get_children(self):
return []
def get_parents(self):
return (
(self.parent.get_parents() + [self.parent.get_label()])
if self.parent
else []
)
def get_depth(self):
return len(self.get_parents())
def get_node_key(self):
return "%".join(self.get_parents() + [self.get_label()])
def find_prop(self, propname, searched=None):
"""
Checks if its own dictionary has
the property, otherwise asks parent node.
"""
if searched is None:
searched = []
searched.append(self.node_name)
if propname in self.props:
return self.props[propname]
elif self.parent:
return self.parent.find_prop(propname, searched)
else:
# raise Exception('Property "%s" does not exist anywhere in the tree! Searched: %s' % (propname, searched))
return None
def dfs_recurse(
node,
leaf_callback=lambda x: None,
discovery_callback=lambda x, y, z: None,
child_callback=lambda x, y: None,
sibling_index=0,
sibling_count=1,
):
discovery_callback(node, sibling_index, sibling_count)
node_children = node.get_children()
if node_children:
for i, child in enumerate(node_children):
child_callback(node, child)
dfs_recurse(
child,
leaf_callback,
discovery_callback,
child_callback,
i,
len(node_children),
)
else:
leaf_callback(node)
def dfs(toplevel_config_node):
config_list = []
def leaf_callback(node):
config_list.append(node)
dfs_recurse(toplevel_config_node, leaf_callback)
return config_list

View File

@ -1,10 +0,0 @@
def quote(s):
return sandwich('"', s)
def sandwich(bread, jam):
return bread + jam + bread
def override(word, substitutions):
return substitutions.get(word, word)

View File

@ -1,51 +0,0 @@
from collections import OrderedDict
import cimodel.lib.miniutils as miniutils
LIST_MARKER = "- "
INDENTATION_WIDTH = 2
def is_dict(data):
return type(data) in [dict, OrderedDict]
def is_collection(data):
return is_dict(data) or type(data) is list
def render(fh, data, depth, is_list_member=False):
"""
PyYaml does not allow precise control over the quoting
behavior, especially for merge references.
Therefore, we use this custom YAML renderer.
"""
indentation = " " * INDENTATION_WIDTH * depth
if is_dict(data):
tuples = list(data.items())
if type(data) is not OrderedDict:
tuples.sort()
for i, (k, v) in enumerate(tuples):
if not v:
continue
# If this dict is itself a list member, the first key gets prefixed with a list marker
list_marker_prefix = LIST_MARKER if is_list_member and not i else ""
trailing_whitespace = "\n" if is_collection(v) else " "
fh.write(indentation + list_marker_prefix + k + ":" + trailing_whitespace)
render(fh, v, depth + 1 + int(is_list_member))
elif type(data) is list:
for v in data:
render(fh, v, depth, True)
else:
# use empty quotes to denote an empty string value instead of blank space
modified_data = miniutils.quote(data) if data == "" else data
list_member_prefix = indentation + LIST_MARKER if is_list_member else ""
fh.write(list_member_prefix + str(modified_data) + "\n")

1386
.circleci/config.yml generated

File diff suppressed because it is too large Load Diff

View File

@ -1,41 +0,0 @@
#!/usr/bin/env python3
import os
import subprocess
import sys
import tempfile
import generate_config_yml
CHECKED_IN_FILE = "config.yml"
REGENERATION_SCRIPT = "regenerate.sh"
PARENT_DIR = os.path.basename(os.path.dirname(os.path.abspath(__file__)))
README_PATH = os.path.join(PARENT_DIR, "README.md")
ERROR_MESSAGE_TEMPLATE = """
The checked-in CircleCI "%s" file does not match what was generated by the scripts.
Please re-run the "%s" script in the "%s" directory and commit the result. See "%s" for more information.
"""
def check_consistency():
_, temp_filename = tempfile.mkstemp("-generated-config.yml")
with open(temp_filename, "w") as fh:
generate_config_yml.stitch_sources(fh)
try:
subprocess.check_call(["cmp", temp_filename, CHECKED_IN_FILE])
except subprocess.CalledProcessError:
sys.exit(
ERROR_MESSAGE_TEMPLATE
% (CHECKED_IN_FILE, REGENERATION_SCRIPT, PARENT_DIR, README_PATH)
)
finally:
os.remove(temp_filename)
if __name__ == "__main__":
check_consistency()

View File

@ -1,196 +0,0 @@
#!/usr/bin/env python3
"""
This script is the source of truth for config.yml.
Please see README.md in this directory for details.
"""
import os
import shutil
import sys
from collections import namedtuple
import cimodel.data.simple.docker_definitions
import cimodel.data.simple.mobile_definitions
import cimodel.data.simple.nightly_ios
import cimodel.lib.miniutils as miniutils
import cimodel.lib.miniyaml as miniyaml
class File:
"""
Verbatim copy the contents of a file into config.yml
"""
def __init__(self, filename):
self.filename = filename
def write(self, output_filehandle):
with open(os.path.join("verbatim-sources", self.filename)) as fh:
shutil.copyfileobj(fh, output_filehandle)
class FunctionGen(namedtuple("FunctionGen", "function depth")):
__slots__ = ()
class Treegen(FunctionGen):
"""
Insert the content of a YAML tree into config.yml
"""
def write(self, output_filehandle):
miniyaml.render(output_filehandle, self.function(), self.depth)
class Listgen(FunctionGen):
"""
Insert the content of a YAML list into config.yml
"""
def write(self, output_filehandle):
miniyaml.render(output_filehandle, self.function(), self.depth)
def horizontal_rule():
return "".join("#" * 78)
class Header:
def __init__(self, title, summary=None):
self.title = title
self.summary_lines = summary or []
def write(self, output_filehandle):
text_lines = [self.title] + self.summary_lines
comment_lines = ["# " + x for x in text_lines]
lines = miniutils.sandwich([horizontal_rule()], comment_lines)
for line in filter(None, lines):
output_filehandle.write(line + "\n")
def _for_all_items(items, functor) -> None:
if isinstance(items, list):
for item in items:
_for_all_items(item, functor)
if isinstance(items, dict) and len(items) == 1:
item_type, item = next(iter(items.items()))
functor(item_type, item)
def filter_master_only_jobs(items):
def _is_main_or_master_item(item):
filters = item.get("filters", None)
branches = filters.get("branches", None) if filters is not None else None
branches_only = branches.get("only", None) if branches is not None else None
return (
("main" in branches_only or "master" in branches_only)
if branches_only is not None
else False
)
master_deps = set()
def _save_requires_if_master(item_type, item):
requires = item.get("requires", None)
item_name = item.get("name", None)
if not isinstance(requires, list):
return
if _is_main_or_master_item(item) or item_name in master_deps:
master_deps.update([n.strip('"') for n in requires])
def _do_filtering(items):
if isinstance(items, list):
rc = [_do_filtering(item) for item in items]
return [item for item in rc if len(item if item is not None else []) > 0]
assert isinstance(items, dict) and len(items) == 1
item_type, item = next(iter(items.items()))
item_name = item.get("name", None)
item_name = item_name.strip('"') if item_name is not None else None
if not _is_main_or_master_item(item) and item_name not in master_deps:
return None
if "filters" in item:
item = item.copy()
item.pop("filters")
return {item_type: item}
# Scan of dependencies twice to pick up nested required jobs
# I.e. jobs depending on jobs that main-only job depend on
_for_all_items(items, _save_requires_if_master)
_for_all_items(items, _save_requires_if_master)
return _do_filtering(items)
def generate_required_docker_images(items):
required_docker_images = set()
def _requires_docker_image(item_type, item):
requires = item.get("requires", None)
if not isinstance(requires, list):
return
for requirement in requires:
requirement = requirement.replace('"', "")
if requirement.startswith("docker-"):
required_docker_images.add(requirement)
_for_all_items(items, _requires_docker_image)
return required_docker_images
def gen_build_workflows_tree():
build_workflows_functions = [
cimodel.data.simple.mobile_definitions.get_workflow_jobs,
cimodel.data.simple.nightly_ios.get_workflow_jobs,
]
build_jobs = [f() for f in build_workflows_functions]
build_jobs.extend(
cimodel.data.simple.docker_definitions.get_workflow_jobs(
# sort for consistency
sorted(generate_required_docker_images(build_jobs))
)
)
master_build_jobs = filter_master_only_jobs(build_jobs)
rc = {
"workflows": {
"build": {
"when": r"<< pipeline.parameters.run_build >>",
"jobs": build_jobs,
},
}
}
if len(master_build_jobs) > 0:
rc["workflows"]["master_build"] = {
"when": r"<< pipeline.parameters.run_master_build >>",
"jobs": master_build_jobs,
}
return rc
# Order of this list matters to the generated config.yml.
YAML_SOURCES = [
File("header-section.yml"),
File("commands.yml"),
File("nightly-binary-build-defaults.yml"),
Header("Build parameters"),
File("build-parameters/pytorch-build-params.yml"),
File("build-parameters/binary-build-params.yml"),
Header("Job specs"),
File("job-specs/binary-job-specs.yml"),
File("job-specs/job-specs-custom.yml"),
File("job-specs/binary_update_htmls.yml"),
File("job-specs/binary-build-tests.yml"),
File("job-specs/docker_jobs.yml"),
Header("Workflows"),
Treegen(gen_build_workflows_tree, 0),
]
def stitch_sources(output_filehandle):
for f in YAML_SOURCES:
f.write(output_filehandle)
if __name__ == "__main__":
stitch_sources(sys.stdout)

View File

@ -1,5 +0,0 @@
cd $PSScriptRoot;
$NewFile = New-TemporaryFile;
python generate_config_yml.py > $NewFile.name
(Get-Content $NewFile.name -Raw).TrimEnd().Replace("`r`n","`n") | Set-Content config.yml -Force
Remove-Item $NewFile.name

View File

@ -1,17 +0,0 @@
#!/bin/bash -e
# Allows this script to be invoked from any directory:
cd "$(dirname "$0")"
UNCOMMIT_CHANGE=$(git status -s | grep " config.yml" | wc -l | xargs)
if [[ $UNCOMMIT_CHANGE != 0 ]]; then
OLD_FILE=$(mktemp)
cp config.yml "$OLD_FILE"
echo "Uncommitted change detected in .circleci/config.yml"
echo "It has been backed up to $OLD_FILE"
fi
NEW_FILE=$(mktemp)
./generate_config_yml.py > "$NEW_FILE"
cp "$NEW_FILE" config.yml
echo "New config generated in .circleci/config.yml"

View File

@ -1,69 +0,0 @@
#!/bin/bash
set -eux -o pipefail
retry () {
$* || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)
}
# This step runs on multiple executors with different envfile locations
if [[ "$(uname)" == Darwin ]]; then
# macos executor (builds and tests)
workdir="/Users/distiller/project"
elif [[ "$OSTYPE" == "msys" ]]; then
# windows executor (builds and tests)
rm -rf /c/w
ln -s "/c/Users/circleci/project" /c/w
workdir="/c/w"
elif [[ -d "/home/circleci/project" ]]; then
# machine executor (binary tests)
workdir="/home/circleci/project"
else
# docker executor (binary builds)
workdir="/"
fi
# It is very important that this stays in sync with binary_populate_env.sh
if [[ "$OSTYPE" == "msys" ]]; then
# We need to make the paths as short as possible on Windows
export PYTORCH_ROOT="$workdir/p"
export BUILDER_ROOT="$workdir/b"
else
export PYTORCH_ROOT="$workdir/pytorch"
export BUILDER_ROOT="$workdir/builder"
fi
# Try to extract PR number from branch if not already set
if [[ -z "${CIRCLE_PR_NUMBER:-}" ]]; then
CIRCLE_PR_NUMBER="$(echo ${CIRCLE_BRANCH} | sed -E -n 's/pull\/([0-9]*).*/\1/p')"
fi
# Clone the Pytorch branch
retry git clone https://github.com/pytorch/pytorch.git "$PYTORCH_ROOT"
pushd "$PYTORCH_ROOT"
if [[ -n "${CIRCLE_PR_NUMBER:-}" ]]; then
# "smoke" binary build on PRs
git fetch --force origin "pull/${CIRCLE_PR_NUMBER}/head:remotes/origin/pull/${CIRCLE_PR_NUMBER}"
git reset --hard "$CIRCLE_SHA1"
git checkout -q -B "$CIRCLE_BRANCH"
git reset --hard "$CIRCLE_SHA1"
elif [[ -n "${CIRCLE_SHA1:-}" ]]; then
# Scheduled workflows & "smoke" binary build on trunk on PR merges
DEFAULT_BRANCH="$(git remote show $CIRCLE_REPOSITORY_URL | awk '/HEAD branch/ {print $NF}')"
git reset --hard "$CIRCLE_SHA1"
git checkout -q -B $DEFAULT_BRANCH
else
echo "Can't tell what to checkout"
exit 1
fi
retry git submodule update --init --recursive
echo "Using Pytorch from "
git --no-pager log --max-count 1
popd
# Clone the Builder main repo
retry git clone -q https://github.com/pytorch/builder.git "$BUILDER_ROOT"
pushd "$BUILDER_ROOT"
echo "Using builder from "
git --no-pager log --max-count 1
popd

View File

@ -1,44 +0,0 @@
#!/bin/bash
set -eux -o pipefail
# This step runs on multiple executors with different envfile locations
if [[ "$(uname)" == Darwin ]]; then
envfile="/Users/distiller/project/env"
elif [[ -d "/home/circleci/project" ]]; then
# machine executor (binary tests)
envfile="/home/circleci/project/env"
else
# docker executor (binary builds)
envfile="/env"
fi
# TODO this is super hacky and ugly. Basically, the binary_update_html job does
# not have an env file, since it does not call binary_populate_env.sh, since it
# does not have a BUILD_ENVIRONMENT. So for this one case, which we detect by a
# lack of an env file, we manually export the environment variables that we
# need to install miniconda
if [[ ! -f "$envfile" ]]; then
MINICONDA_ROOT="/home/circleci/project/miniconda"
workdir="/home/circleci/project"
retry () {
$* || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)
}
export -f retry
else
source "$envfile"
fi
conda_sh="$workdir/install_miniconda.sh"
if [[ "$(uname)" == Darwin ]]; then
curl --retry 3 --retry-all-errors -o "$conda_sh" https://repo.anaconda.com/miniconda/Miniconda3-py39_4.12.0-MacOSX-x86_64.sh
else
curl --retry 3 --retry-all-errors -o "$conda_sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
fi
chmod +x "$conda_sh"
"$conda_sh" -b -p "$MINICONDA_ROOT"
rm -f "$conda_sh"
# We can't actually add miniconda to the PATH in the envfile, because that
# breaks 'unbuffer' in Mac jobs. This is probably because conda comes with
# a tclsh, which then gets inserted before the tclsh needed in /usr/bin

View File

@ -4,10 +4,6 @@ set -eux -o pipefail
source "${BINARY_ENV_FILE:-/Users/distiller/project/env}"
mkdir -p "$PYTORCH_FINAL_PACKAGE_DIR"
if [[ -z "${GITHUB_ACTIONS:-}" ]]; then
export PATH="${workdir:-${HOME}}/miniconda/bin:${PATH}"
fi
# Build
export USE_PYTORCH_METAL_EXPORT=1
export USE_COREML_DELEGATE=1

View File

@ -3,17 +3,9 @@ set -eux -o pipefail
export TZ=UTC
tagged_version() {
# Grabs version from either the env variable CIRCLE_TAG
# or the pytorch git described version
if [[ "$OSTYPE" == "msys" && -z "${GITHUB_ACTIONS:-}" ]]; then
GIT_DIR="${workdir}/p/.git"
else
GIT_DIR="${workdir}/pytorch/.git"
fi
GIT_DIR="${workdir}/pytorch/.git"
GIT_DESCRIBE="git --git-dir ${GIT_DIR} describe --tags --match v[0-9]*.[0-9]*.[0-9]*"
if [[ -n "${CIRCLE_TAG:-}" ]]; then
echo "${CIRCLE_TAG}"
elif [[ ! -d "${GIT_DIR}" ]]; then
if [[ ! -d "${GIT_DIR}" ]]; then
echo "Abort, abort! Git dir ${GIT_DIR} does not exists!"
kill $$
elif ${GIT_DESCRIBE} --exact >/dev/null; then
@ -58,8 +50,8 @@ fi
PIP_UPLOAD_FOLDER='nightly/'
# We put this here so that OVERRIDE_PACKAGE_VERSION below can read from it
export DATE="$(date -u +%Y%m%d)"
#TODO: We should be pulling semver version from the base version.txt
BASE_BUILD_VERSION="2.2.0.dev$DATE"
BASE_BUILD_VERSION="$(cat ${PYTORCH_ROOT}/version.txt|cut -da -f1).dev${DATE}"
# Change BASE_BUILD_VERSION to git tag when on a git tag
# Use 'git -C' to make doubly sure we're in the correct directory for checking
# the git tag
@ -79,6 +71,35 @@ fi
export PYTORCH_BUILD_NUMBER=1
# Set triton version as part of PYTORCH_EXTRA_INSTALL_REQUIREMENTS
TRITON_VERSION=$(cat $PYTORCH_ROOT/.ci/docker/triton_version.txt)
# Here PYTORCH_EXTRA_INSTALL_REQUIREMENTS is already set for the all the wheel builds hence append TRITON_CONSTRAINT
if [[ "$PACKAGE_TYPE" =~ .*wheel.* && -n "${PYTORCH_EXTRA_INSTALL_REQUIREMENTS:-}" ]]; then
# Only linux Python < 3.12 are supported wheels for triton
TRITON_CONSTRAINT="platform_system == 'Linux' and platform_machine == 'x86_64' and python_version < '3.12'"
TRITON_REQUIREMENT="triton==${TRITON_VERSION}; ${TRITON_CONSTRAINT}"
if [[ -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*dev.* ]]; then
TRITON_SHORTHASH=$(cut -c1-10 $PYTORCH_ROOT/.ci/docker/ci_commit_pins/triton.txt)
TRITON_REQUIREMENT="pytorch-triton==${TRITON_VERSION}+${TRITON_SHORTHASH}; ${TRITON_CONSTRAINT}"
fi
export PYTORCH_EXTRA_INSTALL_REQUIREMENTS="${PYTORCH_EXTRA_INSTALL_REQUIREMENTS} | ${TRITON_REQUIREMENT}"
fi
# Set triton via PYTORCH_EXTRA_INSTALL_REQUIREMENTS for triton rocm package
if [[ "$PACKAGE_TYPE" =~ .*wheel.* && -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*rocm.* && $(uname) == "Linux" && "$DESIRED_PYTHON" != "3.12" ]]; then
TRITON_REQUIREMENT="pytorch-triton-rocm==${TRITON_VERSION}"
if [[ -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*dev.* ]]; then
TRITON_SHORTHASH=$(cut -c1-10 $PYTORCH_ROOT/.ci/docker/ci_commit_pins/triton-rocm.txt)
TRITON_REQUIREMENT="pytorch-triton-rocm==${TRITON_VERSION}+${TRITON_SHORTHASH}"
fi
if [[ -z "${PYTORCH_EXTRA_INSTALL_REQUIREMENTS:-}" ]]; then
export PYTORCH_EXTRA_INSTALL_REQUIREMENTS="${TRITON_REQUIREMENT}"
else
export PYTORCH_EXTRA_INSTALL_REQUIREMENTS="${PYTORCH_EXTRA_INSTALL_REQUIREMENTS} | ${TRITON_REQUIREMENT}"
fi
fi
JAVA_HOME=
BUILD_JNI=OFF
if [[ "$PACKAGE_TYPE" == libtorch ]]; then
@ -124,12 +145,13 @@ if [[ "${OSTYPE}" == "msys" ]]; then
else
export DESIRED_DEVTOOLSET="${DESIRED_DEVTOOLSET:-}"
fi
export PYTORCH_EXTRA_INSTALL_REQUIREMENTS="${PYTORCH_EXTRA_INSTALL_REQUIREMENTS:-}"
export DATE="$DATE"
export NIGHTLIES_DATE_PREAMBLE=1.14.0.dev
export PYTORCH_BUILD_VERSION="$PYTORCH_BUILD_VERSION"
export PYTORCH_BUILD_NUMBER="$PYTORCH_BUILD_NUMBER"
export OVERRIDE_PACKAGE_VERSION="$PYTORCH_BUILD_VERSION"
export PYTORCH_EXTRA_INSTALL_REQUIREMENTS="${PYTORCH_EXTRA_INSTALL_REQUIREMENTS:-}"
# TODO: We don't need this anymore IIUC
export TORCH_PACKAGE_NAME='torch'
@ -162,28 +184,6 @@ if [[ "$(uname)" != Darwin ]]; then
EOL
fi
if [[ -z "${GITHUB_ACTIONS:-}" ]]; then
cat >>"$envfile" <<EOL
export workdir="$workdir"
export MAC_PACKAGE_WORK_DIR="$workdir"
if [[ "$OSTYPE" == "msys" ]]; then
export PYTORCH_ROOT="$workdir/p"
export BUILDER_ROOT="$workdir/b"
else
export PYTORCH_ROOT="$workdir/pytorch"
export BUILDER_ROOT="$workdir/builder"
fi
export MINICONDA_ROOT="$workdir/miniconda"
export PYTORCH_FINAL_PACKAGE_DIR="$workdir/final_pkgs"
export CIRCLE_TAG="${CIRCLE_TAG:-}"
export CIRCLE_SHA1="$CIRCLE_SHA1"
export CIRCLE_PR_NUMBER="${CIRCLE_PR_NUMBER:-}"
export CIRCLE_BRANCH="$CIRCLE_BRANCH"
export CIRCLE_WORKFLOW_ID="$CIRCLE_WORKFLOW_ID"
EOL
fi
echo 'retry () {' >> "$envfile"
echo ' $* || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)' >> "$envfile"
echo '}' >> "$envfile"

View File

@ -1,29 +0,0 @@
#!/bin/bash
# This section is used in the binary_test and smoke_test jobs. It expects
# 'binary_populate_env' to have populated /home/circleci/project/env and it
# expects another section to populate /home/circleci/project/ci_test_script.sh
# with the code to run in the docker
# Expect all needed environment variables to be written to this file
source /home/circleci/project/env
echo "Running the following code in Docker"
cat /home/circleci/project/ci_test_script.sh
echo
echo
set -eux -o pipefail
# Expect actual code to be written to this file
chmod +x /home/circleci/project/ci_test_script.sh
VOLUME_MOUNTS="-v /home/circleci/project/:/circleci_stuff -v /home/circleci/project/final_pkgs:/final_pkgs -v ${PYTORCH_ROOT}:/pytorch -v ${BUILDER_ROOT}:/builder"
# Run the docker
if [ -n "${USE_CUDA_DOCKER_RUNTIME:-}" ]; then
export id=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --gpus all ${VOLUME_MOUNTS} -t -d "${DOCKER_IMAGE}")
else
export id=$(docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined ${VOLUME_MOUNTS} -t -d "${DOCKER_IMAGE}")
fi
# Execute the test script that was populated by an earlier section
export COMMAND='((echo "source /circleci_stuff/env && /circleci_stuff/ci_test_script.sh") | docker exec -i "$id" bash) 2>&1'
echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts

View File

@ -1,111 +0,0 @@
#!/usr/bin/env bash
set -ex -o pipefail
# Remove unnecessary sources
sudo rm -f /etc/apt/sources.list.d/google-chrome.list
sudo rm -f /etc/apt/heroku.list
sudo rm -f /etc/apt/openjdk-r-ubuntu-ppa-xenial.list
sudo rm -f /etc/apt/partner.list
# To increase the network reliability, let apt decide which mirror is best to use
sudo sed -i -e 's/http:\/\/.*archive/mirror:\/\/mirrors/' -e 's/\/ubuntu\//\/mirrors.txt/' /etc/apt/sources.list
retry () {
$* || $* || $* || $* || $*
}
# Method adapted from here: https://askubuntu.com/questions/875213/apt-get-to-retry-downloading
# (with use of tee to avoid permissions problems)
# This is better than retrying the whole apt-get command
echo "APT::Acquire::Retries \"3\";" | sudo tee /etc/apt/apt.conf.d/80-retries
retry sudo apt-get update -qq
retry sudo apt-get -y install \
moreutils \
expect-dev
echo "== DOCKER VERSION =="
docker version
if ! command -v aws >/dev/null; then
retry sudo pip3 -q install awscli==1.19.64
fi
if [ -n "${USE_CUDA_DOCKER_RUNTIME:-}" ]; then
DRIVER_FN="NVIDIA-Linux-x86_64-515.76.run"
wget "https://s3.amazonaws.com/ossci-linux/nvidia_driver/$DRIVER_FN"
sudo /bin/bash "$DRIVER_FN" -s --no-drm || (sudo cat /var/log/nvidia-installer.log && false)
nvidia-smi
# Taken directly from https://github.com/NVIDIA/nvidia-docker
# Add the package repositories
distribution=$(. /etc/os-release;echo "$ID$VERSION_ID")
curl -s -L --retry 3 --retry-all-errors https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L --retry 3 --retry-all-errors "https://nvidia.github.io/nvidia-docker/${distribution}/nvidia-docker.list" | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
retry sudo apt-get update -qq
# Necessary to get the `--gpus` flag to function within docker
retry sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker
else
# Explicitly remove nvidia docker apt repositories if not building for cuda
sudo rm -rf /etc/apt/sources.list.d/nvidia-docker.list
fi
add_to_env_file() {
local name=$1
local value=$2
case "$value" in
*\ *)
# BASH_ENV should be set by CircleCI
echo "${name}='${value}'" >> "${BASH_ENV:-/tmp/env}"
;;
*)
echo "${name}=${value}" >> "${BASH_ENV:-/tmp/env}"
;;
esac
}
add_to_env_file CI_MASTER "${CI_MASTER:-}"
add_to_env_file COMMIT_SOURCE "${CIRCLE_BRANCH:-}"
add_to_env_file BUILD_ENVIRONMENT "${BUILD_ENVIRONMENT}"
add_to_env_file CIRCLE_PULL_REQUEST "${CIRCLE_PULL_REQUEST}"
if [[ "${BUILD_ENVIRONMENT}" == *-build ]]; then
add_to_env_file SCCACHE_BUCKET ossci-compiler-cache-circleci-v2
SCCACHE_MAX_JOBS=$(( $(nproc) - 1 ))
MEMORY_LIMIT_MAX_JOBS=8 # the "large" resource class on CircleCI has 32 CPU cores, if we use all of them we'll OOM
MAX_JOBS=$(( ${SCCACHE_MAX_JOBS} > ${MEMORY_LIMIT_MAX_JOBS} ? ${MEMORY_LIMIT_MAX_JOBS} : ${SCCACHE_MAX_JOBS} ))
add_to_env_file MAX_JOBS "${MAX_JOBS}"
if [ -n "${USE_CUDA_DOCKER_RUNTIME:-}" ]; then
add_to_env_file TORCH_CUDA_ARCH_LIST 5.2
fi
if [[ "${BUILD_ENVIRONMENT}" == *xla* ]]; then
# This IAM user allows write access to S3 bucket for sccache & bazels3cache
set +x
add_to_env_file XLA_CLANG_CACHE_S3_BUCKET_NAME "${XLA_CLANG_CACHE_S3_BUCKET_NAME:-}"
add_to_env_file AWS_ACCESS_KEY_ID "${CIRCLECI_AWS_ACCESS_KEY_FOR_SCCACHE_AND_XLA_BAZEL_S3_BUCKET_V2:-}"
add_to_env_file AWS_SECRET_ACCESS_KEY "${CIRCLECI_AWS_SECRET_KEY_FOR_SCCACHE_AND_XLA_BAZEL_S3_BUCKET_V2:-}"
set -x
else
# This IAM user allows write access to S3 bucket for sccache
set +x
add_to_env_file XLA_CLANG_CACHE_S3_BUCKET_NAME "${XLA_CLANG_CACHE_S3_BUCKET_NAME:-}"
add_to_env_file AWS_ACCESS_KEY_ID "${CIRCLECI_AWS_ACCESS_KEY_FOR_SCCACHE_S3_BUCKET_V4:-}"
add_to_env_file AWS_SECRET_ACCESS_KEY "${CIRCLECI_AWS_SECRET_KEY_FOR_SCCACHE_S3_BUCKET_V4:-}"
set -x
fi
fi
# This IAM user only allows read-write access to ECR
set +x
export AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_ECR_READ_WRITE_V4:-}
export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_ECR_READ_WRITE_V4:-}
export AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
export AWS_REGION=us-east-1
aws ecr get-login-password --region $AWS_REGION|docker login --username AWS --password-stdin $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com
set -x

View File

@ -1,50 +0,0 @@
#!/usr/bin/env bash
set -eux -o pipefail
# Set up CircleCI GPG keys for apt, if needed
curl --retry 3 --retry-all-errors -s -L https://packagecloud.io/circleci/trusty/gpgkey | sudo apt-key add -
# Stop background apt updates. Hypothetically, the kill should not
# be necessary, because stop is supposed to send a kill signal to
# the process, but we've added it for good luck. Also
# hypothetically, it's supposed to be unnecessary to wait for
# the process to block. We also have that line for good luck.
# If you like, try deleting them and seeing if it works.
sudo systemctl stop apt-daily.service || true
sudo systemctl kill --kill-who=all apt-daily.service || true
sudo systemctl stop unattended-upgrades.service || true
sudo systemctl kill --kill-who=all unattended-upgrades.service || true
# wait until `apt-get update` has been killed
while systemctl is-active --quiet apt-daily.service
do
sleep 1;
done
while systemctl is-active --quiet unattended-upgrades.service
do
sleep 1;
done
# See if we actually were successful
systemctl list-units --all | cat
# For good luck, try even harder to kill apt-get
sudo pkill apt-get || true
# For even better luck, purge unattended-upgrades
sudo apt-get purge -y unattended-upgrades || true
cat /etc/apt/sources.list
# For the bestest luck, kill again now
sudo pkill apt || true
sudo pkill dpkg || true
# Try to detect if apt/dpkg is stuck
if ps auxfww | grep '[a]pt'; then
echo "WARNING: There are leftover apt processes; subsequent apt update will likely fail"
fi
if ps auxfww | grep '[d]pkg'; then
echo "WARNING: There are leftover dpkg processes; subsequent apt update will likely fail"
fi

View File

@ -1,65 +0,0 @@
binary_linux_build_params: &binary_linux_build_params
parameters:
build_environment:
type: string
default: ""
docker_image:
type: string
default: ""
libtorch_variant:
type: string
default: ""
resource_class:
type: string
default: "2xlarge+"
environment:
BUILD_ENVIRONMENT: << parameters.build_environment >>
LIBTORCH_VARIANT: << parameters.libtorch_variant >>
ANACONDA_USER: pytorch
resource_class: << parameters.resource_class >>
docker:
- image: << parameters.docker_image >>
binary_linux_test_upload_params: &binary_linux_test_upload_params
parameters:
build_environment:
type: string
default: ""
docker_image:
type: string
default: ""
libtorch_variant:
type: string
default: ""
resource_class:
type: string
default: "medium"
use_cuda_docker_runtime:
type: string
default: ""
environment:
BUILD_ENVIRONMENT: << parameters.build_environment >>
DOCKER_IMAGE: << parameters.docker_image >>
USE_CUDA_DOCKER_RUNTIME: << parameters.use_cuda_docker_runtime >>
LIBTORCH_VARIANT: << parameters.libtorch_variant >>
resource_class: << parameters.resource_class >>
binary_mac_params: &binary_mac_params
parameters:
build_environment:
type: string
default: ""
environment:
BUILD_ENVIRONMENT: << parameters.build_environment >>
binary_windows_params: &binary_windows_params
parameters:
build_environment:
type: string
default: ""
executor:
type: string
default: "windows-xlarge-cpu-with-nvidia-cuda"
environment:
BUILD_ENVIRONMENT: << parameters.build_environment >>
JOB_EXECUTOR: <<parameters.executor>>

View File

@ -1,105 +0,0 @@
pytorch_params: &pytorch_params
parameters:
build_environment:
type: string
default: ""
docker_image:
type: string
default: ""
resource_class:
type: string
default: "large"
use_cuda_docker_runtime:
type: string
default: ""
build_only:
type: string
default: ""
ci_master:
type: string
default: ""
environment:
BUILD_ENVIRONMENT: << parameters.build_environment >>
DOCKER_IMAGE: << parameters.docker_image >>
USE_CUDA_DOCKER_RUNTIME: << parameters.use_cuda_docker_runtime >>
BUILD_ONLY: << parameters.build_only >>
CI_MASTER: << pipeline.parameters.run_master_build >>
resource_class: << parameters.resource_class >>
pytorch_ios_params: &pytorch_ios_params
parameters:
build_environment:
type: string
default: ""
ios_arch:
type: string
default: ""
ios_platform:
type: string
default: ""
op_list:
type: string
default: ""
use_metal:
type: string
default: "0"
lite_interpreter:
type: string
default: "1"
use_coreml:
type: string
default: "0"
environment:
BUILD_ENVIRONMENT: << parameters.build_environment >>
IOS_ARCH: << parameters.ios_arch >>
IOS_PLATFORM: << parameters.ios_platform >>
SELECTED_OP_LIST: << parameters.op_list >>
USE_PYTORCH_METAL: << parameters.use_metal >>
BUILD_LITE_INTERPRETER: << parameters.lite_interpreter >>
USE_COREML_DELEGATE: << parameters.use_coreml >>
pytorch_windows_params: &pytorch_windows_params
parameters:
executor:
type: string
default: "windows-xlarge-cpu-with-nvidia-cuda"
build_environment:
type: string
default: ""
test_name:
type: string
default: ""
cuda_version:
type: string
default: "10.1"
python_version:
type: string
default: "3.8"
vs_version:
type: string
default: "16.8.6"
vc_version:
type: string
default: "14.16"
vc_year:
type: string
default: "2019"
vc_product:
type: string
default: "BuildTools"
use_cuda:
type: string
default: ""
environment:
BUILD_ENVIRONMENT: <<parameters.build_environment>>
SCCACHE_BUCKET: "ossci-compiler-cache"
CUDA_VERSION: <<parameters.cuda_version>>
PYTHON_VERSION: <<parameters.python_version>>
VS_VERSION: <<parameters.vs_version>>
VC_VERSION: <<parameters.vc_version>>
VC_YEAR: <<parameters.vc_year>>
VC_PRODUCT: <<parameters.vc_product>>
USE_CUDA: <<parameters.use_cuda>>
TORCH_CUDA_ARCH_LIST: "5.2 7.5"
JOB_BASE_NAME: <<parameters.test_name>>
JOB_EXECUTOR: <<parameters.executor>>

View File

@ -1,134 +0,0 @@
commands:
calculate_docker_image_tag:
description: "Calculates the docker image tag"
steps:
- run:
name: "Calculate docker image hash"
command: |
DOCKER_TAG=$(git rev-parse HEAD:.ci/docker)
echo "DOCKER_TAG=${DOCKER_TAG}" >> "${BASH_ENV}"
designate_upload_channel:
description: "inserts the correct upload channel into ${BASH_ENV}"
steps:
- run:
name: adding UPLOAD_CHANNEL to BASH_ENV
command: |
our_upload_channel=nightly
# On tags upload to test instead
if [[ -n "${CIRCLE_TAG}" ]]; then
our_upload_channel=test
fi
echo "export UPLOAD_CHANNEL=${our_upload_channel}" >> ${BASH_ENV}
# This system setup script is meant to run before the CI-related scripts, e.g.,
# installing Git client, checking out code, setting up CI env, and
# building/testing.
setup_linux_system_environment:
steps:
- run:
name: Set Up System Environment
no_output_timeout: "1h"
command: .circleci/scripts/setup_linux_system_environment.sh
setup_ci_environment:
steps:
- run:
name: Set Up CI Environment After attach_workspace
no_output_timeout: "1h"
command: .circleci/scripts/setup_ci_environment.sh
brew_update:
description: "Update Homebrew and install base formulae"
steps:
- run:
name: Update Homebrew
no_output_timeout: "10m"
command: |
set -ex
# Update repositories manually.
# Running `brew update` produces a comparison between the
# current checkout and the updated checkout, which takes a
# very long time because the existing checkout is 2y old.
for path in $(find /usr/local/Homebrew -type d -name .git)
do
cd $path/..
git fetch --depth=1 origin
git reset --hard origin/master
done
export HOMEBREW_NO_AUTO_UPDATE=1
# Install expect and moreutils so that we can call `unbuffer` and `ts`.
# moreutils installs a `parallel` executable by default, which conflicts
# with the executable from the GNU `parallel`, so we must unlink GNU
# `parallel` first, and relink it afterwards.
brew unlink parallel
brew install moreutils
brew link parallel --overwrite
brew install expect
brew_install:
description: "Install Homebrew formulae"
parameters:
formulae:
type: string
default: ""
steps:
- run:
name: Install << parameters.formulae >>
no_output_timeout: "10m"
command: |
set -ex
export HOMEBREW_NO_AUTO_UPDATE=1
brew install << parameters.formulae >>
run_brew_for_macos_build:
steps:
- brew_update
- brew_install:
formulae: libomp
run_brew_for_ios_build:
steps:
- brew_update
- brew_install:
formulae: libtool
optional_merge_target_branch:
steps:
- run:
name: (Optional) Merge target branch
no_output_timeout: "10m"
command: |
if [[ -n "$CIRCLE_PULL_REQUEST" && "$CIRCLE_BRANCH" != "nightly" ]]; then
PR_NUM=$(basename $CIRCLE_PULL_REQUEST)
CIRCLE_PR_BASE_BRANCH=$(curl -s https://api.github.com/repos/$CIRCLE_PROJECT_USERNAME/$CIRCLE_PROJECT_REPONAME/pulls/$PR_NUM | jq -r '.base.ref')
if [[ "${BUILD_ENVIRONMENT}" == *"xla"* || "${BUILD_ENVIRONMENT}" == *"gcc5"* ]] ; then
set -x
git config --global user.email "circleci.ossci@gmail.com"
git config --global user.name "CircleCI"
git config remote.origin.url https://github.com/pytorch/pytorch.git
git config --add remote.origin.fetch +refs/heads/master:refs/remotes/origin/master
git fetch --tags --progress https://github.com/pytorch/pytorch.git +refs/heads/master:refs/remotes/origin/master --depth=100 --quiet
# PRs generated from ghstack has format CIRCLE_PR_BASE_BRANCH=gh/xxx/1234/base
if [[ "${CIRCLE_PR_BASE_BRANCH}" == "gh/"* ]]; then
CIRCLE_PR_BASE_BRANCH=master
fi
export GIT_MERGE_TARGET=`git log -n 1 --pretty=format:"%H" origin/$CIRCLE_PR_BASE_BRANCH`
echo "GIT_MERGE_TARGET: " ${GIT_MERGE_TARGET}
export GIT_COMMIT=${CIRCLE_SHA1}
echo "GIT_COMMIT: " ${GIT_COMMIT}
git checkout -f ${GIT_COMMIT}
git reset --hard ${GIT_COMMIT}
git merge --allow-unrelated-histories --no-edit --no-ff ${GIT_MERGE_TARGET}
echo "Merged $CIRCLE_PR_BASE_BRANCH branch before building in environment $BUILD_ENVIRONMENT"
set +x
else
echo "No need to merge with $CIRCLE_PR_BASE_BRANCH, skipping..."
fi
else
echo "This is not a pull request, skipping..."
fi

View File

@ -1,41 +0,0 @@
# WARNING: DO NOT EDIT THIS FILE DIRECTLY!!!
# See the README.md in this directory.
# IMPORTANT: To update Docker image version, please follow
# the instructions at
# https://github.com/pytorch/pytorch/wiki/Docker-image-build-on-CircleCI
version: 2.1
parameters:
run_binary_tests:
type: boolean
default: false
run_build:
type: boolean
default: true
run_master_build:
type: boolean
default: false
run_slow_gradcheck_build:
type: boolean
default: false
executors:
windows-with-nvidia-gpu:
machine:
resource_class: windows.gpu.nvidia.medium
image: windows-server-2019-nvidia:previous
shell: bash.exe
windows-xlarge-cpu-with-nvidia-cuda:
machine:
resource_class: windows.xlarge
image: windows-server-2019-vs2019:stable
shell: bash.exe
windows-medium-cpu-with-nvidia-cuda:
machine:
resource_class: windows.medium
image: windows-server-2019-vs2019:stable
shell: bash.exe

View File

@ -1,14 +0,0 @@
# There is currently no testing for libtorch TODO
# binary_linux_libtorch_3.6m_cpu_test:
# environment:
# BUILD_ENVIRONMENT: "libtorch 3.6m cpu"
# resource_class: gpu.nvidia.small
# <<: *binary_linux_test
#
# binary_linux_libtorch_3.6m_cu90_test:
# environment:
# BUILD_ENVIRONMENT: "libtorch 3.6m cu90"
# resource_class: gpu.nvidia.small
# <<: *binary_linux_test
#

View File

@ -1,44 +0,0 @@
jobs:
binary_ios_build:
<<: *pytorch_ios_params
macos:
xcode: "12.5.1"
steps:
- attach_workspace:
at: ~/workspace
- checkout
- run_brew_for_ios_build
- run:
name: Build
no_output_timeout: "1h"
command: |
script="/Users/distiller/project/.circleci/scripts/binary_ios_build.sh"
cat "$script"
source "$script"
- run:
name: Test
no_output_timeout: "30m"
command: |
script="/Users/distiller/project/.circleci/scripts/binary_ios_test.sh"
cat "$script"
source "$script"
- persist_to_workspace:
root: /Users/distiller/workspace/
paths: ios
binary_ios_upload:
<<: *pytorch_ios_params
macos:
xcode: "12.5.1"
steps:
- attach_workspace:
at: ~/workspace
- checkout
- run_brew_for_ios_build
- run:
name: Upload
no_output_timeout: "1h"
command: |
script="/Users/distiller/project/.circleci/scripts/binary_ios_upload.sh"
cat "$script"
source "$script"

View File

@ -1,53 +0,0 @@
# update_s3_htmls job
# These jobs create html files for every cpu/cu## folder in s3. The html
# files just store the names of all the files in that folder (which are
# binary files (.whl files)). This is to allow pip installs of the latest
# version in a folder without having to know the latest date. Pip has a flag
# -f that you can pass an html file listing a bunch of packages, and pip will
# then install the one with the most recent version.
update_s3_htmls: &update_s3_htmls
machine:
image: ubuntu-2004:202104-01
resource_class: medium
steps:
- checkout
- setup_linux_system_environment
- run:
<<: *binary_checkout
# N.B. we do not run binary_populate_env. The only variable we need is
# PIP_UPLOAD_FOLDER (which is 'nightly/' for the nightlies and '' for
# releases, and sometimes other things for special cases). Instead we
# expect PIP_UPLOAD_FOLDER to be passed directly in the env. This is
# because, unlike all the other binary jobs, these jobs only get run once,
# in a separate workflow. They are not a step in other binary jobs like
# build, test, upload.
#
# You could attach this to every job, or include it in the upload step if
# you wanted. You would need to add binary_populate_env in this case to
# make sure it has the same upload folder as the job it's attached to. This
# function is idempotent, so it won't hurt anything; it's just a little
# unnescessary"
- run:
name: define PIP_UPLOAD_FOLDER
command: |
our_upload_folder=nightly/
# On tags upload to test instead
if [[ -n "${CIRCLE_TAG}" ]]; then
our_upload_folder=test/
fi
echo "export PIP_UPLOAD_FOLDER=${our_upload_folder}" >> ${BASH_ENV}
- run:
name: Update s3 htmls
no_output_timeout: "1h"
command: |
set +x
echo "declare -x \"AWS_ACCESS_KEY_ID=${PYTORCH_BINARY_AWS_ACCESS_KEY_ID}\"" >> /home/circleci/project/env
echo "declare -x \"AWS_SECRET_ACCESS_KEY=${PYTORCH_BINARY_AWS_SECRET_ACCESS_KEY}\"" >> /home/circleci/project/env
source /home/circleci/project/env
set -eux -o pipefail
retry () {
$* || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)
}
retry pip install awscli==1.6
"/home/circleci/project/builder/cron/update_s3_htmls.sh"

View File

@ -1,56 +0,0 @@
docker_build_job:
parameters:
image_name:
type: string
default: ""
machine:
image: ubuntu-2004:202104-01
resource_class: large
environment:
IMAGE_NAME: << parameters.image_name >>
# Enable 'docker manifest'
DOCKER_CLI_EXPERIMENTAL: "enabled"
DOCKER_BUILDKIT: 1
steps:
- checkout
- calculate_docker_image_tag
- run:
name: Check if image should be built
command: |
set +x
export AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_DOCKER_BUILDER_V1}
export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_DOCKER_BUILDER_V1}
export AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\")
export AWS_REGION=us-east-1
aws ecr get-login-password --region $AWS_REGION|docker login --username AWS \
--password-stdin $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com
set -x
# Check if image already exists, if it does then skip building it
if docker manifest inspect "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/${IMAGE_NAME}:${DOCKER_TAG}"; then
circleci-agent step halt
# circleci-agent step halt doesn't actually halt the step so we need to
# explicitly exit the step here ourselves before it causes too much trouble
exit 0
fi
# Covers the case where a previous tag doesn't exist for the tree
# this is only really applicable on trees that don't have `.ci/docker` at its merge base, i.e. nightly
if ! git rev-parse "$(git merge-base HEAD << pipeline.git.base_revision >>):.ci/docker"; then
echo "Directory '.ci/docker' not found in tree << pipeline.git.base_revision >>, you should probably rebase onto a more recent commit"
exit 1
fi
PREVIOUS_DOCKER_TAG=$(git rev-parse "$(git merge-base HEAD << pipeline.git.base_revision >>):ci/docker")
# If no image exists but the hash is the same as the previous hash then we should error out here
if [[ "${PREVIOUS_DOCKER_TAG}" = "${DOCKER_TAG}" ]]; then
echo "ERROR: Something has gone wrong and the previous image isn't available for the merge-base of your branch"
echo " contact the PyTorch team to restore the original images"
exit 1
fi
- run:
name: build_docker_image_<< parameters.image_name >>
no_output_timeout: "1h"
command: |
set +x
export AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_DOCKER_BUILDER_V1}
export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_DOCKER_BUILDER_V1}
set -x
cd .ci/docker && ./build_docker.sh

View File

@ -1,745 +0,0 @@
pytorch_doc_push:
resource_class: medium
machine:
image: ubuntu-2004:202104-01
parameters:
branch:
type: string
default: "main"
steps:
- attach_workspace:
at: /tmp/workspace
- run:
name: Generate netrc
command: |
# set credentials for https pushing
cat > ~/.netrc \<<DONE
machine github.com
login pytorchbot
password ${GITHUB_PYTORCHBOT_TOKEN}
DONE
- run:
name: Docs push
command: |
pushd /tmp/workspace
git push -u origin "<< parameters.branch >>"
pytorch_macos_10_15_py3_build:
environment:
BUILD_ENVIRONMENT: pytorch-macos-10.15-py3-arm64-build
macos:
xcode: "12.3.0"
steps:
- checkout
- run_brew_for_macos_build
- run:
name: Build
no_output_timeout: "1h"
command: |
set -e
export CROSS_COMPILE_ARM64=1
export JOB_BASE_NAME=$CIRCLE_JOB
# Install sccache
sudo curl --retry 3 https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
sudo chmod +x /usr/local/bin/sccache
export SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2
# This IAM user allows write access to S3 bucket for sccache
set +x
export AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_SCCACHE_S3_BUCKET_V4}
export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_SCCACHE_S3_BUCKET_V4}
set -x
chmod a+x .ci/pytorch/macos-build.sh
unbuffer .ci/pytorch/macos-build.sh 2>&1 | ts
- persist_to_workspace:
root: /Users/distiller/workspace/
paths:
- miniconda3
- store_artifacts:
path: /Users/distiller/project/dist
pytorch_macos_10_13_py3_build:
environment:
BUILD_ENVIRONMENT: pytorch-macos-10.13-py3-build
macos:
xcode: "12.0"
steps:
- checkout
- run_brew_for_macos_build
- run:
name: Build
no_output_timeout: "1h"
command: |
set -e
export JOB_BASE_NAME=$CIRCLE_JOB
# Install sccache
sudo curl --retry 3 https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
sudo chmod +x /usr/local/bin/sccache
export SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2
# This IAM user allows write access to S3 bucket for sccache
set +x
export AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_SCCACHE_S3_BUCKET_V4}
export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_SCCACHE_S3_BUCKET_V4}
set -x
chmod a+x .ci/pytorch/macos-build.sh
unbuffer .ci/pytorch/macos-build.sh 2>&1 | ts
- persist_to_workspace:
root: /Users/distiller/workspace/
paths:
- miniconda3
mac_build:
parameters:
build-environment:
type: string
description: Top-level label for what's being built/tested.
xcode-version:
type: string
default: "13.3.1"
description: What xcode version to build with.
build-generates-artifacts:
type: boolean
default: true
description: if the build generates build artifacts
python-version:
type: string
default: "3.8"
macos:
xcode: << parameters.xcode-version >>
resource_class: medium
environment:
BUILD_ENVIRONMENT: << parameters.build-environment >>
AWS_REGION: us-east-1
steps:
- checkout
- run_brew_for_macos_build
- run:
name: Install sccache
command: |
sudo curl --retry 3 https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
sudo chmod +x /usr/local/bin/sccache
echo "export SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${BASH_ENV}"
echo "export SCCACHE_S3_KEY_PREFIX=${GITHUB_WORKFLOW}" >> "${BASH_ENV}"
set +x
echo "export AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_SCCACHE_S3_BUCKET_V4}" >> "${BASH_ENV}"
echo "export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_SCCACHE_S3_BUCKET_V4}" >> "${BASH_ENV}"
set -x
- run:
name: Get workflow job id
command: |
echo "export OUR_GITHUB_JOB_ID=${CIRCLE_WORKFLOW_JOB_ID}" >> "${BASH_ENV}"
- run:
name: Build
command: |
set -x
git submodule sync
git submodule update --init --recursive --depth 1 --jobs 0
export PATH="/usr/local/bin:$PATH"
export WORKSPACE_DIR="${HOME}/workspace"
mkdir -p "${WORKSPACE_DIR}"
MINICONDA_URL="https://repo.anaconda.com/miniconda/Miniconda3-py38_4.12.0-MacOSX-x86_64.sh"
if [ << parameters.python-version >> == 3.9.12 ]; then
MINICONDA_URL="https://repo.anaconda.com/miniconda/Miniconda3-py39_4.12.0-MacOSX-x86_64.sh"
fi
# If a local installation of conda doesn't exist, we download and install conda
if [ ! -d "${WORKSPACE_DIR}/miniconda3" ]; then
mkdir -p "${WORKSPACE_DIR}"
curl --retry 3 ${MINICONDA_URL} -o "${WORKSPACE_DIR}"/miniconda3.sh
bash "${WORKSPACE_DIR}"/miniconda3.sh -b -p "${WORKSPACE_DIR}"/miniconda3
fi
export PATH="${WORKSPACE_DIR}/miniconda3/bin:$PATH"
# shellcheck disable=SC1091
source "${WORKSPACE_DIR}"/miniconda3/bin/activate
brew link --force libomp
echo "export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname "$(which conda)")/../"}" >> "${BASH_ENV}"
.ci/pytorch/macos-build.sh
- when:
condition: << parameters.build-generates-artifacts >>
steps:
- run:
name: Archive artifacts into zip
command: |
zip -1 -r artifacts.zip dist/ build/.ninja_log build/compile_commands.json .additional_ci_files
cp artifacts.zip /Users/distiller/workspace
- persist_to_workspace:
root: /Users/distiller/workspace/
paths:
- miniconda3
- artifacts.zip
- store_artifacts:
path: /Users/distiller/project/artifacts.zip
mac_test:
parameters:
build-environment:
type: string
shard-number:
type: string
num-test-shards:
type: string
xcode-version:
type: string
test-config:
type: string
default: 'default'
macos:
xcode: << parameters.xcode-version >>
environment:
GIT_DEFAULT_BRANCH: 'master'
BUILD_ENVIRONMENT: << parameters.build-environment >>
TEST_CONFIG: << parameters.test-config >>
SHARD_NUMBER: << parameters.shard-number >>
NUM_TEST_SHARDS: << parameters.num-test-shards >>
steps:
- checkout
- attach_workspace:
at: ~/workspace
- run_brew_for_macos_build
- run:
name: Test
no_output_timeout: "2h"
command: |
set -x
git submodule sync --recursive
git submodule update --init --recursive
mv ~/workspace/artifacts.zip .
unzip artifacts.zip
export IN_CI=1
COMMIT_MESSAGES=$(git cherry -v "origin/${GIT_DEFAULT_BRANCH:-master}")
export PATH="/usr/local/bin:$PATH"
export WORKSPACE_DIR="${HOME}/workspace"
mkdir -p "${WORKSPACE_DIR}"
export PATH="${WORKSPACE_DIR}/miniconda3/bin:$PATH"
source "${WORKSPACE_DIR}"/miniconda3/bin/activate
# sanitize the input commit message and PR body here:
# trim all new lines from commit messages to avoid issues with batch environment
# variable copying. see https://github.com/pytorch/pytorch/pull/80043#issuecomment-1167796028
COMMIT_MESSAGES="${COMMIT_MESSAGES//[$'\n\r']}"
# then trim all special characters like single and double quotes to avoid unescaped inputs to
# wreak havoc internally
export COMMIT_MESSAGES="${COMMIT_MESSAGES//[\'\"]}"
python3 -mpip install dist/*.whl
.ci/pytorch/macos-test.sh
- run:
name: Copy files for uploading test stats
command: |
# copy into a parent folder test-reports because we can't use CIRCLEI_BUILD_NUM in path when persisting to workspace
mkdir -p test-reports/test-reports_${CIRCLE_BUILD_NUM}/test/test-reports
cp -r test/test-reports test-reports/test-reports_${CIRCLE_BUILD_NUM}/test/test-reports
- store_test_results:
path: test/test-reports
- persist_to_workspace:
root: /Users/distiller/project/
paths:
- test-reports
upload_test_stats:
machine: # executor type
image: ubuntu-2004:202010-01 # # recommended linux image - includes Ubuntu 20.04, docker 19.03.13, docker-compose 1.27.4
steps:
- checkout
- attach_workspace:
at: ~/workspace
- run:
name: upload
command: |
set -ex
if [ -z ${AWS_ACCESS_KEY_FOR_OSSCI_ARTIFACT_UPLOAD} ]; then
echo "No credentials found, cannot upload test stats (are you on a fork?)"
exit 0
fi
cp -r ~/workspace/test-reports/* ~/project
pip3 install requests==2.26 rockset==1.0.3 boto3==1.19.12
export AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_FOR_OSSCI_ARTIFACT_UPLOAD}
export AWS_SECRET_ACCESS_KEY=${AWS_SECRET_KEY_FOR_OSSCI_ARTIFACT_UPLOAD}
# i dont know how to get the run attempt number for reruns so default to 1
python3 -m tools.stats.upload_test_stats --workflow-run-id "${CIRCLE_WORKFLOW_JOB_ID}" --workflow-run-attempt 1 --head-branch << pipeline.git.branch >> --circleci
pytorch_macos_10_13_py3_test:
environment:
BUILD_ENVIRONMENT: pytorch-macos-10.13-py3-test
macos:
xcode: "12.0"
steps:
- checkout
- attach_workspace:
at: ~/workspace
- run_brew_for_macos_build
- run:
name: Test
no_output_timeout: "1h"
command: |
set -e
export JOB_BASE_NAME=$CIRCLE_JOB
chmod a+x .ci/pytorch/macos-test.sh
unbuffer .ci/pytorch/macos-test.sh 2>&1 | ts
- store_test_results:
path: test/test-reports
pytorch_macos_10_13_py3_lite_interpreter_build_test:
environment:
BUILD_ENVIRONMENT: pytorch-macos-10.13-py3-test
macos:
xcode: "12.0"
steps:
- checkout
- attach_workspace:
at: ~/workspace
- run_brew_for_macos_build
- run:
name: Test
no_output_timeout: "1h"
command: |
set -e
export BUILD_LITE_INTERPRETER=1
export JOB_BASE_NAME=$CIRCLE_JOB
chmod a+x ${HOME}/project/.ci/pytorch/macos-lite-interpreter-build-test.sh
unbuffer ${HOME}/project/.ci/pytorch/macos-lite-interpreter-build-test.sh 2>&1 | ts
- store_test_results:
path: test/test-reports
pytorch_android_gradle_build:
environment:
BUILD_ENVIRONMENT: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-build
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c"
PYTHON_VERSION: "3.7"
resource_class: large
machine:
image: ubuntu-2004:202104-01
steps:
- checkout
- calculate_docker_image_tag
- setup_linux_system_environment
- setup_ci_environment
- run:
name: pytorch android gradle build
no_output_timeout: "1h"
command: |
set -eux
docker_image_commit=${DOCKER_IMAGE}:build-${DOCKER_TAG}-${CIRCLE_SHA1}
docker_image_libtorch_android_x86_32=${docker_image_commit}-android-x86_32
docker_image_libtorch_android_x86_64=${docker_image_commit}-android-x86_64
docker_image_libtorch_android_arm_v7a=${docker_image_commit}-android-arm-v7a
docker_image_libtorch_android_arm_v8a=${docker_image_commit}-android-arm-v8a
echo "docker_image_commit: "${docker_image_commit}
echo "docker_image_libtorch_android_x86_32: "${docker_image_libtorch_android_x86_32}
echo "docker_image_libtorch_android_x86_64: "${docker_image_libtorch_android_x86_64}
echo "docker_image_libtorch_android_arm_v7a: "${docker_image_libtorch_android_arm_v7a}
echo "docker_image_libtorch_android_arm_v8a: "${docker_image_libtorch_android_arm_v8a}
# x86_32
time docker pull ${docker_image_libtorch_android_x86_32} >/dev/null
export id_x86_32=$(docker run --env-file "${BASH_ENV}" -e GRADLE_OFFLINE=1 --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${docker_image_libtorch_android_x86_32})
export COMMAND='((echo "sudo chown -R jenkins workspace") | docker exec -u jenkins -i "$id_x86_32" bash) 2>&1'
echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
# arm-v7a
time docker pull ${docker_image_libtorch_android_arm_v7a} >/dev/null
export id_arm_v7a=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${docker_image_libtorch_android_arm_v7a})
export COMMAND='((echo "sudo chown -R jenkins workspace") | docker exec -u jenkins -i "$id_arm_v7a" bash) 2>&1'
echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
mkdir -p ~/workspace/build_android_install_arm_v7a
docker cp $id_arm_v7a:/var/lib/jenkins/workspace/build_android/install ~/workspace/build_android_install_arm_v7a
# x86_64
time docker pull ${docker_image_libtorch_android_x86_64} >/dev/null
export id_x86_64=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${docker_image_libtorch_android_x86_64})
export COMMAND='((echo "sudo chown -R jenkins workspace") | docker exec -u jenkins -i "$id_x86_64" bash) 2>&1'
echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
mkdir -p ~/workspace/build_android_install_x86_64
docker cp $id_x86_64:/var/lib/jenkins/workspace/build_android/install ~/workspace/build_android_install_x86_64
# arm-v8a
time docker pull ${docker_image_libtorch_android_arm_v8a} >/dev/null
export id_arm_v8a=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${docker_image_libtorch_android_arm_v8a})
export COMMAND='((echo "sudo chown -R jenkins workspace") | docker exec -u jenkins -i "$id_arm_v8a" bash) 2>&1'
echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
mkdir -p ~/workspace/build_android_install_arm_v8a
docker cp $id_arm_v8a:/var/lib/jenkins/workspace/build_android/install ~/workspace/build_android_install_arm_v8a
docker cp ~/workspace/build_android_install_arm_v7a $id_x86_32:/var/lib/jenkins/workspace/build_android_install_arm_v7a
docker cp ~/workspace/build_android_install_x86_64 $id_x86_32:/var/lib/jenkins/workspace/build_android_install_x86_64
docker cp ~/workspace/build_android_install_arm_v8a $id_x86_32:/var/lib/jenkins/workspace/build_android_install_arm_v8a
# run gradle buildRelease
export COMMAND='((echo "sudo chown -R jenkins workspace && cd workspace && ./.circleci/scripts/build_android_gradle.sh") | docker exec -u jenkins -i "$id_x86_32" bash) 2>&1'
echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
mkdir -p ~/workspace/build_android_artifacts
docker cp $id_x86_32:/var/lib/jenkins/workspace/android/artifacts.tgz ~/workspace/build_android_artifacts/
output_image=$docker_image_libtorch_android_x86_32-gradle
docker commit "$id_x86_32" ${output_image}
time docker push ${output_image}
- store_artifacts:
path: ~/workspace/build_android_artifacts/artifacts.tgz
destination: artifacts.tgz
pytorch_android_publish_snapshot:
environment:
BUILD_ENVIRONMENT: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-publish-snapshot
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c"
PYTHON_VERSION: "3.7"
resource_class: large
machine:
image: ubuntu-2004:202104-01
steps:
- checkout
- calculate_docker_image_tag
- setup_linux_system_environment
- setup_ci_environment
- run:
name: pytorch android gradle build
no_output_timeout: "1h"
command: |
set -eux
docker_image_commit=${DOCKER_IMAGE}:build-${DOCKER_TAG}-${CIRCLE_SHA1}
docker_image_libtorch_android_x86_32_gradle=${docker_image_commit}-android-x86_32-gradle
echo "docker_image_commit: "${docker_image_commit}
echo "docker_image_libtorch_android_x86_32_gradle: "${docker_image_libtorch_android_x86_32_gradle}
# x86_32
time docker pull ${docker_image_libtorch_android_x86_32_gradle} >/dev/null
export id_x86_32=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${docker_image_libtorch_android_x86_32_gradle})
export COMMAND='((echo "sudo chown -R jenkins workspace" && echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo "export SONATYPE_NEXUS_USERNAME=${SONATYPE_NEXUS_USERNAME}" && echo "export SONATYPE_NEXUS_PASSWORD=${SONATYPE_NEXUS_PASSWORD}" && echo "export ANDROID_SIGN_KEY=${ANDROID_SIGN_KEY}" && echo "export ANDROID_SIGN_PASS=${ANDROID_SIGN_PASS}" && echo "sudo chown -R jenkins workspace && cd workspace && ./.circleci/scripts/publish_android_snapshot.sh") | docker exec -u jenkins -i "$id_x86_32" bash) 2>&1'
echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
output_image=${docker_image_libtorch_android_x86_32_gradle}-publish-snapshot
docker commit "$id_x86_32" ${output_image}
time docker push ${output_image}
pytorch_android_gradle_build-x86_32:
environment:
BUILD_ENVIRONMENT: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-build-only-x86_32
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c"
PYTHON_VERSION: "3.7"
resource_class: large
machine:
image: ubuntu-2004:202104-01
steps:
- checkout
- calculate_docker_image_tag
- setup_linux_system_environment
- checkout
- setup_ci_environment
- run:
name: pytorch android gradle build only x86_32 (for PR)
no_output_timeout: "1h"
command: |
set -e
docker_image_libtorch_android_x86_32=${DOCKER_IMAGE}:build-${DOCKER_TAG}-${CIRCLE_SHA1}-android-x86_32
echo "docker_image_libtorch_android_x86_32: "${docker_image_libtorch_android_x86_32}
# x86
time docker pull ${docker_image_libtorch_android_x86_32} >/dev/null
export id=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${docker_image_libtorch_android_x86_32})
export COMMAND='((echo "export BUILD_ENVIRONMENT=${BUILD_ENVIRONMENT}" && echo "export GRADLE_OFFLINE=1" && echo "sudo chown -R jenkins workspace && cd workspace && ./.circleci/scripts/build_android_gradle.sh") | docker exec -u jenkins -i "$id" bash) 2>&1'
echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
mkdir -p ~/workspace/build_android_x86_32_artifacts
docker cp $id:/var/lib/jenkins/workspace/android/artifacts.tgz ~/workspace/build_android_x86_32_artifacts/
output_image=${docker_image_libtorch_android_x86_32}-gradle
docker commit "$id" ${output_image}
time docker push ${output_image}
- store_artifacts:
path: ~/workspace/build_android_x86_32_artifacts/artifacts.tgz
destination: artifacts.tgz
pytorch_ios_build:
<<: *pytorch_ios_params
macos:
xcode: "12.5.1"
steps:
- run:
name: checkout with retry
command: |
checkout() {
set -ex
# Workaround old docker images with incorrect $HOME
# check https://github.com/docker/docker/issues/2968 for details
if [ "${HOME}" = "/" ]
then
export HOME=$(getent passwd $(id -un) | cut -d: -f6)
fi
mkdir -p ~/.ssh
echo 'github.com ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAq2A7hRGmdnm9tUDbO9IDSwBK6TbQa+PXYPCPy6rbTrTtw7PHkccKrpp0yVhp5HdEIcKr6pLlVDBfOLX9QUsyCOV0wzfjIJNlGEYsdlLJizHhbn2mUjvSAHQqZETYP81eFzLQNnPHt4EVVUh7VfDESU84KezmD5QlWpXLmvU31/yMf+Se8xhHTvKSCZIFImWwoG6mbUoWf9nzpIoaSjB+weqqUUmpaaasXVal72J+UX2B+2RPW3RcT0eOzQgqlJL3RKrTJvdsjE3JEAvGq3lGHSZXy28G3skua2SmVi/w4yCE6gbODqnTWlg7+wC604ydGXA8VJiS5ap43JXiUFFAaQ==
' >> ~/.ssh/known_hosts
# use git+ssh instead of https
git config --global url."ssh://git@github.com".insteadOf "https://github.com" || true
git config --global gc.auto 0 || true
echo 'Cloning git repository'
mkdir -p '/Users/distiller/project'
cd '/Users/distiller/project'
git clone "$CIRCLE_REPOSITORY_URL" .
echo 'Checking out branch'
git checkout --force -B "$CIRCLE_BRANCH" "$CIRCLE_SHA1"
git --no-pager log --no-color -n 1 --format='HEAD is now at %h %s'
}
retry () {
$* || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)
}
retry checkout
- run_brew_for_ios_build
- run:
name: Setup Fastlane
no_output_timeout: "1h"
command: |
set -e
PROJ_ROOT=/Users/distiller/project
cd ${PROJ_ROOT}/ios/TestApp
# install fastlane
sudo gem install bundler && bundle install
- run:
name: Build
no_output_timeout: "1h"
command: |
set -e
WORKSPACE=/Users/distiller/workspace
PROJ_ROOT=/Users/distiller/project
export TCLLIBPATH="/usr/local/lib"
# Install conda
curl --retry 3 -o ~/conda.sh https://repo.anaconda.com/miniconda/Miniconda3-py39_4.12.0-MacOSX-x86_64.sh
chmod +x ~/conda.sh
/bin/bash ~/conda.sh -b -p ~/anaconda
export PATH="~/anaconda/bin:${PATH}"
source ~/anaconda/bin/activate
# Install dependencies
retry () {
$* || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)
}
retry conda install numpy ninja pyyaml mkl mkl-include setuptools cmake requests typing-extensions --yes
# sync submodules
cd ${PROJ_ROOT}
git submodule sync
git submodule update --init --recursive --depth 1 --jobs 0
# export
export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
# run build script
chmod a+x ${PROJ_ROOT}/scripts/build_ios.sh
echo "IOS_ARCH: ${IOS_ARCH}"
echo "IOS_PLATFORM: ${IOS_PLATFORM}"
echo "USE_PYTORCH_METAL": "${USE_METAL}"
echo "BUILD_LITE_INTERPRETER": "${BUILD_LITE_INTERPRETER}"
echo "USE_COREML_DELEGATE": "${USE_COREML_DELEGATE}"
#check the custom build flag
echo "SELECTED_OP_LIST: ${SELECTED_OP_LIST}"
if [ -n "${SELECTED_OP_LIST}" ]; then
export SELECTED_OP_LIST="${PROJ_ROOT}/ios/TestApp/custom_build/${SELECTED_OP_LIST}"
fi
export IOS_ARCH=${IOS_ARCH}
export IOS_PLATFORM=${IOS_PLATFORM}
export USE_COREML_DELEGATE=${USE_COREML_DELEGATE}
if [ ${IOS_PLATFORM} != "SIMULATOR" ]; then
export USE_PYTORCH_METAL=${USE_METAL}
fi
unbuffer ${PROJ_ROOT}/scripts/build_ios.sh 2>&1 | ts
- run:
name: Run Build Test
no_output_timeout: "30m"
command: |
set -e
PROJ_ROOT=/Users/distiller/project
# run the ruby build script
if ! [ -x "$(command -v xcodebuild)" ]; then
echo 'Error: xcodebuild is not installed.'
exit 1
fi
ruby ${PROJ_ROOT}/scripts/xcode_build.rb -i ${PROJ_ROOT}/build_ios/install -x ${PROJ_ROOT}/ios/TestApp/TestApp.xcodeproj -p ${IOS_PLATFORM}
if ! [ "$?" -eq "0" ]; then
echo 'xcodebuild failed!'
exit 1
fi
- run:
name: Run Simulator Tests
no_output_timeout: "2h"
command: |
set -e
if [ ${IOS_PLATFORM} != "SIMULATOR" ]; then
echo "not SIMULATOR build, skip it."
exit 0
fi
WORKSPACE=/Users/distiller/workspace
PROJ_ROOT=/Users/distiller/project
source ~/anaconda/bin/activate
# use the pytorch nightly build to generate models
pip3 install --pre torch torchvision torchaudio -f https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html
# generate models for differnet backends
cd ${PROJ_ROOT}/ios/TestApp/benchmark
mkdir -p ../models
if [ ${USE_COREML_DELEGATE} == 1 ]; then
pip install coremltools==5.0b5 protobuf==3.20.1
python coreml_backend.py
else
cd "${PROJ_ROOT}"
python test/mobile/model_test/gen_test_model.py ios-test
fi
cd "${PROJ_ROOT}/ios/TestApp/benchmark"
if [ ${BUILD_LITE_INTERPRETER} == 1 ]; then
echo "Setting up the TestApp for LiteInterpreter"
ruby setup.rb --lite 1
else
echo "Setting up the TestApp for Full JIT"
ruby setup.rb
fi
cd "${PROJ_ROOT}/ios/TestApp"
# instruments -s -devices
if [ "${BUILD_LITE_INTERPRETER}" == 1 ]; then
if [ "${USE_COREML_DELEGATE}" == 1 ]; then
fastlane scan --only_testing TestAppTests/TestAppTests/testCoreML
else
fastlane scan --only_testing TestAppTests/TestAppTests/testLiteInterpreter
fi
else
fastlane scan --only_testing TestAppTests/TestAppTests/testFullJIT
fi
pytorch_linux_bazel_build:
<<: *pytorch_params
machine:
image: ubuntu-2004:202104-01
steps:
- checkout
- calculate_docker_image_tag
- setup_linux_system_environment
- setup_ci_environment
- run:
name: Bazel Build
no_output_timeout: "1h"
command: |
set -e
# Pull Docker image and run build
echo "DOCKER_IMAGE: "${DOCKER_IMAGE}:${DOCKER_TAG}
time docker pull ${DOCKER_IMAGE}:${DOCKER_TAG} >/dev/null
export id=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${DOCKER_IMAGE}:${DOCKER_TAG})
echo "Do NOT merge main branch into $CIRCLE_BRANCH in environment $BUILD_ENVIRONMENT"
git submodule sync && git submodule update -q --init --recursive --depth 1 --jobs 0
docker cp /home/circleci/project/. $id:/var/lib/jenkins/workspace
export COMMAND='((echo "sudo chown -R jenkins workspace && cd workspace && .ci/pytorch/build.sh") | docker exec -u jenkins -i "$id" bash) 2>&1'
echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
# Push intermediate Docker image for next phase to use
if [ -z "${BUILD_ONLY}" ]; then
# Augment our output image name with bazel to avoid collisions
output_image=${DOCKER_IMAGE}:build-${DOCKER_TAG}-bazel-${CIRCLE_SHA1}
export COMMIT_DOCKER_IMAGE=$output_image
docker commit "$id" ${COMMIT_DOCKER_IMAGE}
time docker push ${COMMIT_DOCKER_IMAGE}
fi
pytorch_linux_bazel_test:
<<: *pytorch_params
machine:
image: ubuntu-2004:202104-01
steps:
- checkout
- calculate_docker_image_tag
- setup_linux_system_environment
- setup_ci_environment
- run:
name: Test
no_output_timeout: "90m"
command: |
set -e
output_image=${DOCKER_IMAGE}:build-${DOCKER_TAG}-bazel-${CIRCLE_SHA1}
export COMMIT_DOCKER_IMAGE=$output_image
echo "DOCKER_IMAGE: "${COMMIT_DOCKER_IMAGE}
time docker pull ${COMMIT_DOCKER_IMAGE} >/dev/null
if [ -n "${USE_CUDA_DOCKER_RUNTIME}" ]; then
export id=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --gpus all -t -d -w /var/lib/jenkins ${COMMIT_DOCKER_IMAGE})
else
export id=$(docker run --env-file "${BASH_ENV}" --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins ${COMMIT_DOCKER_IMAGE})
fi
retrieve_test_reports() {
echo "retrieving test reports"
docker cp -L $id:/var/lib/jenkins/workspace/bazel-testlogs ./ || echo 'No test reports found!'
}
trap "retrieve_test_reports" ERR
if [[ ${BUILD_ENVIRONMENT} == *"multigpu"* ]]; then
export COMMAND='((echo "sudo chown -R jenkins workspace && cd workspace && .ci/pytorch/multigpu-test.sh") | docker exec -u jenkins -i "$id" bash) 2>&1'
else
export COMMAND='((echo "sudo chown -R jenkins workspace && cd workspace && .ci/pytorch/test.sh") | docker exec -u jenkins -i "$id" bash) 2>&1'
fi
echo ${COMMAND} > ./command.sh && unbuffer bash ./command.sh | ts
retrieve_test_reports
docker stats --all --no-stream
- store_test_results:
path: bazel-testlogs
pytorch_windows_test_multigpu:
machine:
image: ubuntu-2004:202104-01
steps:
- checkout
- run:
name: Test
no_output_timeout: "90m"
command: |
set -e
python3 -m pip install requests
python3 ./.circleci/scripts/trigger_azure_pipeline.py

View File

@ -1,18 +0,0 @@
promote_s3:
<<: *promote_common
steps:
- checkout
- run:
name: Running promote script
command: |
scripts/release/promote/wheel_to_s3.sh
promote_conda:
<<: *promote_common
steps:
- checkout
- run:
name: Running promote script
command: |
scripts/release/promote/conda_to_conda.sh

View File

@ -1,29 +0,0 @@
setup:
docker:
- image: circleci/python:3.7.3
steps:
- checkout
- run:
name: Save commit message
command: git log --format='%B' -n 1 HEAD > .circleci/scripts/COMMIT_MSG
# Note [Workspace for CircleCI scripts]
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# In the beginning, you wrote your CI scripts in a
# .circleci/config.yml file, and life was good. Your CI
# configurations flourished and multiplied.
#
# Then one day, CircleCI cometh down high and say, "Your YAML file
# is too biggeth, it stresses our servers so." And thus they
# asketh us to smite the scripts in the yml file.
#
# But you can't just put the scripts in the .circleci folder,
# because in some jobs, you don't ever actually checkout the
# source repository. Where you gonna get the scripts from?
#
# Here's how you do it: you persist .circleci/scripts into a
# workspace, attach the workspace in your subjobs, and run all
# your scripts from there.
- persist_to_workspace:
root: .
paths: .circleci/scripts

View File

@ -1,51 +0,0 @@
##############################################################################
# Binary build (nightlies nightly build) defaults
# The binary builds use the docker executor b/c at time of writing the machine
# executor is limited to only two cores and is painfully slow (4.5+ hours per
# GPU build). But the docker executor cannot be run with --runtime=nvidia, and
# so the binary test/upload jobs must run on a machine executor. The package
# built in the build job is persisted to the workspace, which the test jobs
# expect. The test jobs just run a few quick smoke tests (very similar to the
# second-round-user-facing smoke tests above) and then upload the binaries to
# their final locations. The upload part requires credentials that should only
# be available to org-members.
#
# binary_checkout MUST be run before other commands here. This is because the
# other commands are written in .circleci/scripts/*.sh , so the pytorch source
# code must be downloaded on the machine before they can be run. We cannot
# inline all the code into this file, since that would cause the yaml size to
# explode past 4 MB (all the code in the command section is just copy-pasted to
# everywhere in the .circleci/config.yml file where it appears).
##############################################################################
# Checks out the Pytorch and Builder repos (always both of them), and places
# them in the right place depending on what executor we're running on. We curl
# our .sh file from the interweb to avoid yaml size bloat. Note that many jobs
# do not need both the pytorch and builder repos, so this is a little wasteful
# (smoke tests and upload jobs do not need the pytorch repo).
binary_checkout: &binary_checkout
name: Checkout pytorch/builder repo
no_output_timeout: "30m"
command: .circleci/scripts/binary_checkout.sh
# Parses circleci arguments in a consistent way, essentially routing to the
# correct pythonXgccXcudaXos build we want
binary_populate_env: &binary_populate_env
name: Set up binary env variables
command: .circleci/scripts/binary_populate_env.sh
binary_install_miniconda: &binary_install_miniconda
name: Install miniconda
no_output_timeout: "1h"
command: .circleci/scripts/binary_install_miniconda.sh
# This section is used in the binary_test and smoke_test jobs. It expects
# 'binary_populate_env' to have populated /home/circleci/project/env and it
# expects another section to populate /home/circleci/project/ci_test_script.sh
# with the code to run in the docker
binary_run_in_docker: &binary_run_in_docker
name: Run in docker
# This step only runs on circleci linux machine executors that themselves
# need to start docker images
command: .circleci/scripts/binary_run_in_docker.sh

View File

@ -1,8 +0,0 @@
#- binary_linux_libtorch_3.6m_cpu_test:
# requires:
# - binary_linux_libtorch_3.6m_cpu_build
#- binary_linux_libtorch_3.6m_cu90_test:
# requires:
# - binary_linux_libtorch_3.6m_cu90_build
# Nightly uploads

View File

@ -42,7 +42,6 @@ misc-*,
-misc-non-private-member-variables-in-classes,
-misc-confusable-identifiers,
modernize-*,
-modernize-concat-nested-namespaces,
-modernize-macro-to-enum,
-modernize-return-braced-init-list,
-modernize-use-auto,
@ -52,6 +51,13 @@ modernize-*,
-modernize-use-nodiscard,
performance-*,
readability-container-size-empty,
readability-delete-null-pointer,
readability-duplicate-include
readability-misplaced-array-index,
readability-redundant-function-ptr-dereference,
readability-redundant-smartptr-get,
readability-simplify-subscript-expr,
readability-string-compare,
'
HeaderFilterRegex: '^(aten/|c10/|torch/).*$'
AnalyzeTemporaryDtors: false

View File

@ -30,5 +30,5 @@ RUN if [ -n "$CLANG_VERSION" ]; then \
# Install cuda if version is specified
ARG CUDA_VERSION
RUN if [ -n "$CUDA_VERSION" ]; then \
conda install cuda -c "nvidia/label/cuda-${CUDA_VERSION}"; \
conda install -y cuda -c "nvidia/label/cuda-${CUDA_VERSION}"; \
fi

View File

@ -46,7 +46,7 @@ If you are using [Visual Studio Code Remote - SSH](https://code.visualstudio.com
## Step 6: Open in DevContainer
1. In VSCode, use the Command Palette (`Ctrl+Shift+P` or `Cmd+Shift+P` on macOS) to run the "Remote-Containers: Open Folder in Container..." command.
1. In VSCode, use the Command Palette (`Ctrl+Shift+P` or `Cmd+Shift+P` on macOS) to run the "Dev Containers: Open Folder in Container..." command.
2. You will be prompted with two options: CPU dev container or CUDA dev container. Choose the one you want to run.
## Step 7: Wait for Building the Environment

28
.flake8
View File

@ -2,14 +2,12 @@
# NOTE: **Mirror any changes** to this file the [tool.ruff] config in pyproject.toml
# before we can fully move to use ruff
enable-extensions = G
select = B,C,E,F,G,P,SIM1,T4,W,B9,TOR0,TOR1,TOR2
select = B,C,E,F,G,P,SIM1,T4,W,B9,TOR0,TOR1,TOR2,TOR9
max-line-length = 120
# C408 ignored because we like the dict keyword argument syntax
# E501 is not flexible enough, we're using B950 instead
ignore =
E203,E305,E402,E501,E721,E741,F405,F821,F841,F999,W503,W504,C408,E302,W291,E303,
# fix these lints in the future
E275,
E203,E305,E402,E501,E721,E741,F405,F841,F999,W503,W504,C408,E302,W291,E303,
# shebang has extra meaning in fbcode lints, so I think it's not worth trying
# to line this up with executable bit
EXE001,
@ -29,11 +27,33 @@ ignore =
# TODO(kit1980): fix all TOR102 issues
# `torch.load` without `weights_only` parameter is unsafe
TOR102,
# TODO(kit1980): resolve all TOR003 issues
# pass `use_reentrant` explicitly to `checkpoint`.
TOR003
per-file-ignores =
__init__.py: F401
test/**: F821
test/**/__init__.py: F401,F821
torch/utils/cpp_extension.py: B950
torchgen/api/types/__init__.py: F401,F403
torchgen/executorch/api/types/__init__.py: F401,F403
test/dynamo/test_higher_order_ops.py: B950
torch/testing/_internal/dynamo_test_failures.py: B950
# TOR901 is only for test, we want to ignore it for everything else.
# It's not easy to configure this without affecting other per-file-ignores,
# so we explicitly list every file where it's violated outside of test.
torch/__init__.py: F401,TOR901
torch/_custom_op/impl.py: TOR901
torch/_export/serde/upgrade.py: TOR901
torch/_functorch/vmap.py: TOR901
torch/_inductor/test_operators.py: TOR901
torch/_library/abstract_impl.py: TOR901
torch/_meta_registrations.py: TOR901
torch/_prims/__init__.py: F401,TOR901
torch/_prims/rng_prims.py: TOR901
torch/ao/quantization/fx/_decomposed.py: TOR901
torch/distributed/_functional_collectives.py: TOR901
torch/distributed/_spmd/data_parallel.py: TOR901
optional-ascii-coding = True
exclude =
./.git,

View File

@ -38,3 +38,5 @@ f70844bec783bfce43c950ccf180dc494e86f2bf
e6ec0efaf87703c5f889cfc20b29be455885d58d
# 2023-07-31 [optim][BE] split test file into logical parts: SWA, LR, optim
a53cda1ddc15336dc1ff0ce1eff2a49cdc5f882e
# 2024-01-02 clangformat: fused adam #116583
9dc68d1aa9e554d09344a10fff69f7b50b2d23a0

View File

@ -8,7 +8,7 @@ body:
value: >
#### Before submitting a bug, please make sure the issue hasn't been already addressed by searching through [the
existing and past issues](https://github.com/pytorch/pytorch/issues)
It's likely that your bug will be resolved by checking our FAQ or troubleshooting guide [documentation](https://pytorch.org/docs/master/dynamo/index.html)
It's likely that your bug will be resolved by checking our FAQ or troubleshooting guide [documentation](https://pytorch.org/docs/main/dynamo/index.html)
- type: textarea
attributes:
label: 🐛 Describe the bug
@ -33,7 +33,7 @@ body:
label: Minified repro
description: |
Please run the minifier on your example and paste the minified code below
Learn more here https://pytorch.org/docs/master/compile/troubleshooting.html
Learn more here https://pytorch.org/docs/main/torch.compiler_troubleshooting.html
placeholder: |
env TORCHDYNAMO_REPRO_AFTER="aot" python your_model.py
or

View File

@ -19,7 +19,7 @@ self-hosted-runner:
- windows.g5.4xlarge.nvidia.gpu
- bm-runner
- linux.rocm.gpu
- macos-m1-12
- macos-m1-stable
- macos-m1-13
- macos-12-xl
- macos-12

View File

@ -9,6 +9,10 @@ inputs:
use-gha:
description: If set to any value, use GHA to download the artifact. Otherwise use s3.
required: false
s3-bucket:
description: S3 bucket to download builds
required: false
default: "gha-artifacts"
runs:
using: composite
@ -18,6 +22,7 @@ runs:
uses: seemethere/download-artifact-s3@v4
with:
name: ${{ inputs.name }}
s3-bucket: ${{ inputs.s3-bucket }}
- name: Download PyTorch Build Artifacts from GHA
if: inputs.use-gha
@ -29,6 +34,10 @@ runs:
shell: bash
run: unzip -o artifacts.zip
- name: Remove artifacts.zip
shell: bash
run: rm artifacts.zip
- name: Output disk space left
shell: bash
run: df -H

View File

@ -0,0 +1,29 @@
name: Download TD Artifacts
description: Download artifacts from target_determination.yml
inputs:
use-gha:
description: If set to any value, use GHA to download the artifact. Otherwise use s3.
required: false
runs:
using: composite
steps:
- name: Download TD Artifacts from S3
if: ${{ !inputs.use-gha }}
uses: seemethere/download-artifact-s3@v4
with:
name: td_results
- name: Download TD Artifacts from GHA
if: inputs.use-gha
uses: actions/download-artifact@v3
with:
name: td_results.json
- name: Move artifacts to .additional_ci_files folder
shell: bash
run: |
mkdir -p .additional_ci_files
mv td_results.json .additional_ci_files/td_results.json

View File

@ -26,11 +26,20 @@ outputs:
description: True if the filtered test configs matrix is empty. False otherwise.
value: ${{ steps.filter.outputs.is-test-matrix-empty }}
keep-going:
description: True if keep-going label was on PR.
description: True if keep-going label was on PR or [keep-going] in PR body.
value: ${{ steps.filter.outputs.keep-going }}
reenabled-issues:
description: Comma separated list of issue numbers that should correspond to disable test issues that the PR fixes
value: ${{ steps.filter.outputs.reenabled-issues }}
ci-verbose-test-logs:
description: True if ci-verbose-test-logs label was on PR or [ci-verbose-test-logs] in PR body.
value: ${{ steps.filter.outputs.ci-verbose-test-logs }}
ci-no-test-timeout:
description: True if ci-no-test-timeout label was on PR or [ci-no-test-timeout] in PR body.
value: ${{ steps.filter.outputs.ci-no-test-timeout }}
ci-no-td:
description: True if ci-no-td label was on PR or [ci-no-td] in PR body.
value: ${{ steps.filter.outputs.ci-no-td }}
runs:
using: composite
@ -46,7 +55,8 @@ runs:
retry_wait_seconds: 30
command: |
set -eux
python3 -m pip install requests==2.26.0 pyyaml==6.0
# PyYAML 6.0 doesn't work with MacOS x86 anymore
python3 -m pip install requests==2.26.0 pyyaml==6.0.1
- name: Parse ref
id: parse-ref

207
.github/actions/linux-build/action.yml vendored Normal file
View File

@ -0,0 +1,207 @@
name: linux-build
inputs:
build-environment:
required: true
description: Top-level label for what's being built/tested.
docker-image-name:
required: true
description: Name of the base docker image to build with.
build-generates-artifacts:
required: false
default: "true"
description: If set, upload generated build artifacts.
build-with-debug:
required: false
default: "false"
description: If set, build in debug mode.
sync-tag:
required: false
default: ""
description: |
If this is set, our linter will use this to make sure that every other
job with the same `sync-tag` is identical.
cuda-arch-list:
required: false
default: "5.2"
description: Runner label to select worker type
runner:
required: false
default: "linux.2xlarge"
description: |
List of CUDA architectures CI build should target.
test-matrix:
required: false
type: string
description: |
An option JSON description of what test configs to run later on. This
is moved here from the Linux test workflow so that we can apply filter
logic using test-config labels earlier and skip unnecessary builds
s3-bucket:
description: S3 bucket to download artifact
required: false
default: "gha-artifacts"
aws-role-to-assume:
description: role to assume for downloading artifacts
required: false
default: ""
GITHUB_TOKEN:
description: GitHub token
required: true
HUGGING_FACE_HUB_TOKEN:
description: Hugging Face Hub token
required: false
default: ""
outputs:
docker-image:
value: ${{ steps.calculate-docker-image.outputs.docker-image }}
description: The docker image containing the built PyTorch.
test-matrix:
value: ${{ steps.filter.outputs.test-matrix }}
description: An optional JSON description of what test configs to run later on.
runs:
using: composite
steps:
- name: Setup Linux
uses: ./.github/actions/setup-linux
- name: configure aws credentials
uses: aws-actions/configure-aws-credentials@v3
if: ${{ inputs.aws-role-to-assume != '' }}
with:
role-to-assume: ${{ inputs.aws-role-to-assume }}
role-session-name: gha-linux-build
role-duration-seconds: 10800
aws-region: us-east-1
- name: Calculate docker image
id: calculate-docker-image
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
with:
docker-image-name: ${{ inputs.docker-image-name }}
- name: Use following to pull public copy of the image
id: print-ghcr-mirror
env:
ECR_DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }}
shell: bash
run: |
tag=${ECR_DOCKER_IMAGE##*/}
echo "docker pull ghcr.io/pytorch/ci-image:${tag/:/-}"
- name: Pull docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
with:
docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
- name: Parse ref
id: parse-ref
shell: bash
run: .github/scripts/parse_ref.py
- name: Get workflow job id
id: get-job-id
uses: ./.github/actions/get-workflow-job-id
if: always()
with:
github-token: ${{ inputs.GITHUB_TOKEN }}
# Apply the filter logic to the build step too if the test-config label is already there
- name: Select all requested test configurations (if the test matrix is available)
id: filter
uses: ./.github/actions/filter-test-configs
with:
github-token: ${{ inputs.GITHUB_TOKEN }}
test-matrix: ${{ inputs.test-matrix }}
job-name: ${{ steps.get-job-id.outputs.job-name }}
- name: Download pytest cache
uses: ./.github/actions/pytest-cache-download
continue-on-error: true
with:
cache_dir: .pytest_cache
job_identifier: ${{ github.workflow }}_${{ inputs.build-environment }}
s3_bucket: ${{ inputs.s3-bucket }}
- name: Build
if: steps.filter.outputs.is-test-matrix-empty == 'False' || inputs.test-matrix == ''
id: build
env:
BUILD_ENVIRONMENT: ${{ inputs.build-environment }}
BRANCH: ${{ steps.parse-ref.outputs.branch }}
# TODO duplicated
AWS_DEFAULT_REGION: us-east-1
PR_NUMBER: ${{ github.event.pull_request.number }}
SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2
SCCACHE_S3_KEY_PREFIX: ${{ github.workflow }}
XLA_CLANG_CACHE_S3_BUCKET_NAME: ossci-compiler-clang-cache-circleci-xla
PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}
TORCH_CUDA_ARCH_LIST: ${{ inputs.cuda-arch-list }}
DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }}
XLA_CUDA: ${{ contains(inputs.build-environment, 'xla') && '0' || '' }}
DEBUG: ${{ inputs.build-with-debug == 'true' && '1' || '0' }}
OUR_GITHUB_JOB_ID: ${{ steps.get-job-id.outputs.job-id }}
HUGGING_FACE_HUB_TOKEN: ${{ inputs.HUGGING_FACE_HUB_TOKEN }}
shell: bash
run: |
# detached container should get cleaned up by teardown_ec2_linux
container_name=$(docker run \
-e BUILD_ENVIRONMENT \
-e MAX_JOBS="$(nproc --ignore=2)" \
-e AWS_DEFAULT_REGION \
-e PR_NUMBER \
-e SHA1 \
-e BRANCH \
-e SCCACHE_BUCKET \
-e SCCACHE_S3_KEY_PREFIX \
-e XLA_CUDA \
-e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-e SKIP_SCCACHE_INITIALIZATION=1 \
-e TORCH_CUDA_ARCH_LIST \
-e PR_LABELS \
-e OUR_GITHUB_JOB_ID \
-e HUGGING_FACE_HUB_TOKEN \
--env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
--security-opt seccomp=unconfined \
--cap-add=SYS_PTRACE \
--tty \
--detach \
--user jenkins \
-v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-w /var/lib/jenkins/workspace \
"${DOCKER_IMAGE}"
)
docker exec -t "${container_name}" sh -c '.ci/pytorch/build.sh'
- name: Archive artifacts into zip
if: inputs.build-generates-artifacts == 'true' && steps.build.outcome != 'skipped'
shell: bash
run: |
zip -1 -r artifacts.zip dist/ build/custom_test_artifacts build/lib build/bin .additional_ci_files
- name: Store PyTorch Build Artifacts on S3
uses: seemethere/upload-artifact-s3@v5
if: inputs.build-generates-artifacts == 'true' && steps.build.outcome != 'skipped'
with:
name: ${{ inputs.build-environment }}
retention-days: 14
if-no-files-found: error
path: artifacts.zip
s3-bucket: ${{ inputs.s3-bucket }}
- name: Upload sccache stats
if: steps.build.outcome != 'skipped'
uses: seemethere/upload-artifact-s3@v5
with:
s3-prefix: |
${{ github.repository }}/${{ github.run_id }}/${{ github.run_attempt }}/artifact
retention-days: 365
if-no-files-found: warn
path: sccache-stats-*.json
s3-bucket: ${{ inputs.s3-bucket }}
- name: Teardown Linux
uses: pytorch/test-infra/.github/actions/teardown-linux@main
if: always()

View File

@ -9,6 +9,10 @@ inputs:
job_identifier:
description: Text that uniquely identifies a given job type within a workflow. All shards of a job should share the same job identifier.
required: true
s3_bucket:
description: S3 bucket to upload/download PyTest cache
required: false
default: ""
runs:
using: composite
@ -30,6 +34,7 @@ runs:
CACHE_DIR: ${{ inputs.cache_dir }}
JOB_IDENTIFIER: ${{ inputs.job_identifier }}
REPO: ${{ github.repository }}
BUCKET: ${{ inputs.s3_bucket }}
run: |
python3 .github/scripts/pytest_cache.py \
--download \
@ -38,3 +43,4 @@ runs:
--job_identifier $JOB_IDENTIFIER \
--temp_dir $RUNNER_TEMP \
--repo $REPO \
--bucket $BUCKET \

View File

@ -26,8 +26,14 @@ runs:
echo "instance-type: $(get_ec2_metadata instance-type)"
echo "system info $(uname -a)"
- name: Check if in a ARC runner
shell: bash
id: check_arc_runner
run: echo "IN_ARC_RUNNER=$([ -f /.inarc ] && echo true || echo false)" >> $GITHUB_OUTPUT
- name: Start docker if docker deamon is not running
shell: bash
if: ${{ steps.check_arc_runner.outputs.IN_ARC_RUNNER == 'false' }}
run: |
if systemctl is-active --quiet docker; then
echo "Docker daemon is running...";

View File

@ -9,6 +9,16 @@ runs:
shell: bash
run: echo "DOCKER_HOST=unix:///run/user/$(id -u)/docker.sock" >> "${GITHUB_ENV}"
- name: Remove leftover Docker config file
shell: bash
continue-on-error: true
run: |
set -ex
cat ~/.docker/config.json || true
# https://stackoverflow.com/questions/64455468/error-when-logging-into-ecr-with-docker-login-error-saving-credentials-not
rm -f ~/.docker/config.json
- name: Stop all running docker containers
if: always()
shell: bash

67
.github/actions/setup-xpu/action.yml vendored Normal file
View File

@ -0,0 +1,67 @@
name: Setup XPU host
description: Set up XPU host for CI
runs:
using: composite
steps:
- name: Clean all stopped docker containers
if: always()
shell: bash
run: |
# Prune all stopped containers.
# If other runner is pruning on this node, will skip.
nprune=$(ps -ef | grep -c "docker container prune")
if [[ $nprune -eq 1 ]]; then
docker container prune -f
fi
- name: Runner health check system info
if: always()
shell: bash
run: |
cat /etc/os-release || true
cat /etc/apt/sources.list.d/oneAPI.list || true
cat /etc/apt/sources.list.d/intel-gpu-jammy.list || true
whoami
- name: Runner health check xpu-smi
if: always()
shell: bash
run: |
xpu-smi discovery
- name: Runner health check GPU count
if: always()
shell: bash
run: |
ngpu=$(xpu-smi discovery | grep -c -E 'Device Name')
msg="Please file an issue on pytorch/pytorch reporting the faulty runner. Include a link to the runner logs so the runner can be identified"
if [[ $ngpu -eq 0 ]]; then
echo "Error: Failed to detect any GPUs on the runner"
echo "$msg"
exit 1
fi
- name: Runner diskspace health check
uses: ./.github/actions/diskspace-cleanup
if: always()
- name: Runner health check disconnect on failure
if: ${{ failure() }}
shell: bash
run: |
killall runsvc.sh
- name: Preserve github env variables for use in docker
shell: bash
run: |
env | grep '^GITHUB' >> "/tmp/github_env_${GITHUB_RUN_ID}"
env | grep '^CI' >> "/tmp/github_env_${GITHUB_RUN_ID}"
- name: XPU set GPU_FLAG
shell: bash
run: |
# Add render group for container creation.
render_gid=`cat /etc/group | grep render | cut -d: -f3`
echo "GPU_FLAG=--device=/dev/mem --device=/dev/dri --group-add video --group-add $render_gid" >> "${GITHUB_ENV}"

20
.github/actions/teardown-xpu/action.yml vendored Normal file
View File

@ -0,0 +1,20 @@
name: Teardown XPU host
description: Tear down XPU host for CI
runs:
using: composite
steps:
- name: Teardown XPU
if: always()
shell: bash
run: |
# Prune all stopped containers.
# If other runner is pruning on this node, will skip.
nprune=$(ps -ef | grep -c "docker container prune")
if [[ $nprune -eq 1 ]]; then
docker container prune -f
fi
- name: Runner diskspace health check
uses: ./.github/actions/diskspace-cleanup
if: always()

View File

@ -1,59 +0,0 @@
name: Update commit hash
inputs:
repo-owner:
required: false
type: string
description: Name of repository's owner.
default: pytorch
repo-name:
required: true
type: string
description: Name of the repository we're updating commit hash for.
branch:
required: true
type: string
description: Branch to fetch commit of
pin-folder:
type: string
description: Path to folder with commit pin
required: false
default: .github/ci_commit_pins
updatebot-token:
required: true
type: string
description: update bot token
pytorchbot-token:
required: true
type: string
description: update bot token
description: update commit hash
runs:
using: composite
steps:
- name: Checkout repo
uses: actions/checkout@v3
with:
fetch-depth: 1
submodules: false
token: ${{ inputs.updatebot-token }}
- name: Checkout
shell: bash
run: |
git clone https://github.com/${{ inputs.repo-owner }}/${{ inputs.repo-name }}.git --quiet
- name: Check if there already exists a PR
shell: bash
env:
REPO_NAME: ${{ inputs.repo-name }}
BRANCH: ${{ inputs.branch }}
PIN_FOLDER: ${{ inputs.pin-folder }}
UPDATEBOT_TOKEN: ${{ inputs.updatebot-token }}
PYTORCHBOT_TOKEN: ${{ inputs.pytorchbot-token }}
NEW_BRANCH_NAME: update-${{ inputs.repo-name }}-commit-hash/${{ github.run_id }}-${{ github.run_number }}-${{ github.run_attempt }}
run: |
# put this here instead of the script to prevent accidentally changing the config when running the script locally
git config --global user.name "PyTorch UpdateBot"
git config --global user.email "pytorchupdatebot@users.noreply.github.com"
python .github/scripts/update_commit_hashes.py --repo-name "${REPO_NAME}" --branch "${BRANCH}" --pin-folder "${PIN_FOLDER}"

View File

@ -11,6 +11,10 @@ inputs:
Suffix to add to the filename of the artifacts. This should include the
workflow job id, see [Job id in artifacts].
required: true
s3-bucket:
description: S3 bucket to download builds
required: false
default: "gha-artifacts"
runs:
using: composite
@ -87,6 +91,7 @@ runs:
uses: seemethere/upload-artifact-s3@v5
if: ${{ !inputs.use-gha }}
with:
s3-bucket: ${{ inputs.s3-bucket }}
s3-prefix: |
${{ github.repository }}/${{ github.run_id }}/${{ github.run_attempt }}/artifact
retention-days: 14
@ -97,6 +102,7 @@ runs:
uses: seemethere/upload-artifact-s3@v5
if: ${{ !inputs.use-gha }}
with:
s3-bucket: ${{ inputs.s3-bucket }}
s3-prefix: |
${{ github.repository }}/${{ github.run_id }}/${{ github.run_attempt }}/artifact
retention-days: 14
@ -108,6 +114,7 @@ runs:
if: ${{ !inputs.use-gha }}
continue-on-error: true
with:
s3-bucket: ${{ inputs.s3-bucket }}
s3-prefix: |
${{ github.repository }}/${{ github.run_id }}/${{ github.run_attempt }}/artifact
retention-days: 14

View File

@ -6,7 +6,6 @@ reviewers:
- albanD
- miladm
- bdhirsh
- voznesenskym
per_author:
symbolic-shapes:

View File

@ -1 +1 @@
6518fa9b2c74e84d7eb1fc6e3eb51e43213f0c05
17a70815259222570feb071034acd7bae2adc019

View File

@ -1 +1 @@
99944a2fb8624947f9c0e2edc898ff42a16124da
d6015d42d9a1834bc7595c4bd6852562fb80b30b

View File

@ -1 +1 @@
e12d200c97d7aab668b976e92b46513c9ca7a0d8
a0c79b399b75368208464b2c638708165cca7ef1

View File

@ -1 +1 @@
a80c1e7f958e7d8e8f92319db70876940e67ad9b
707a632930bfde19ffb361cdf5c31a7682af4e67

14
.github/labeler.yml vendored
View File

@ -8,10 +8,6 @@
- torch/_inductor/**
- test/inductor/**
"module: export":
- torch/_export/**
- test/export/**
"ciflow/inductor":
- torch/_decomp/**
- torch/_dynamo/**
@ -30,6 +26,11 @@
- .github/ci_commit_pins/**
- c10/core/Sym*
- torch/fx/experimental/symbolic_shapes.py
- torch/fx/experimental/recording.py
- torch/fx/experimental/sym_node.py
- torch/fx/experimental/validator.py
- torch/fx/experimental/_sym_dispatch_mode.py
- torch/fx/experimental/proxy_tensor.py
- test/distributed/_tensor/test_dtensor_compile.py
- test/distributed/tensor/parallel/test_fsdp_2d_parallel.py
- torch/distributed/_tensor/**
@ -43,6 +44,7 @@
- aten/src/ATen/native/mkldnn/**
- torch/cpu/**
- torch/utils/mkldnn.py
- torch/utils/_sympy/**
- test/test_mkldnn.py
"module: mkldnn":
@ -79,3 +81,7 @@
- torch/nn/parallel/**
- test/distributed/**
- torch/testing/_internal/distributed/**
"module: distributed_checkpoint":
- torch/distributed/checkpoint/**
- test/distributed/checkpoint/**

Some files were not shown because too many files have changed in this diff Show More