Compare commits

...

533 Commits

Author SHA1 Message Date
83ad8e01b1 fix the problem that cpu_fallback for aten::triu_indices on custom device crashed (#121306)
Fixes #121289

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121306
Approved by: https://github.com/ezyang
2024-03-26 01:29:45 +00:00
5e66bf5f42 Avoid COW materialize in nn.functional forward ops (3) (#122443)
Affected ops:
* repeat
* unfold
* logsigmoid
* pixel_shuffle/unshuffle
* remaining norm ops

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122443
Approved by: https://github.com/ezyang
2024-03-26 00:56:57 +00:00
b6982bf2b2 [dynamo] Forward OptimizedModule.__setattr__ to the wrapped module (#122098)
Fixes #114844

In the linked issue we have
```
compiled_module = torch.compile(module)
compiled_module.x = ...
compiled_module(...)  # Mutates self.x
```
Where since the module mutates `self.x` you would expect `compiled_module.x`
to be updated but actually `compiled_module.x = ...` sets an attribute "x"
on the `OptimizedModule` object while the forward method of the module mutates
`module.x`.

This gives the expected behavior by forwarding `compiled_module.__setattr__`
down to `module.__setattr__`. There is already a corresponding `__getattr__`
so now `compiled_module.x` becomes an alias for `module.x`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122098
Approved by: https://github.com/ezyang, https://github.com/lezcano
2024-03-26 00:52:12 +00:00
eda279c997 [CpuInductor] Implement masked_load for integral types (#122608)
Use `if constexpr` to separate float vs integral masked load for avx512
Discovered while looking at `test_comprehensive_fft_ihfft2_cpu_int64` on
non-AVX512 capable CPUs where (5, 6, 7) shape were big enough to start a vectorized loop

Added `test_pad_cast` regression test

Fixes https://github.com/pytorch/pytorch/issues/122606

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122608
Approved by: https://github.com/jansel
ghstack dependencies: #122607
2024-03-25 22:44:54 +00:00
57a3d00b06 [AOTInductor] Add tensor_constantX to pass constant buffer update's check (#122562)
Summary:
During tracing, some constants (tensor_constant{idx}) are being generated internally.
Those constants are neither parameters or buffers, and users have zero control on them.

To accomodate this, we should allow users not passing in those constants generated internally but still be able the constants in the model.

Test Plan:
Included in commit.
```
build/bin/test_aot_inductor
```

Differential Revision: D55286634

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122562
Approved by: https://github.com/chenyang78, https://github.com/khabinov
2024-03-25 22:05:20 +00:00
ebde6c72cb Precompile triton templates (#121998)
Before this PR we were not precompiling triton templates in parallel. Compilation would occur during benchmarking.

Triton benchmarking templates were emitted as :

```
@triton.jit
def triton_mm(arg_A, arg_B, out_ptr0):
```

In order to precompile we need to give the full kernel specification, as we do when we emit the template in the final output code generation.

```
@triton_heuristics.template(
    num_stages=3,
    num_warps=8,
    triton_meta={'signature': {0: '*fp32', 1: '*fp32', 2: '*fp32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [AttrsDescriptor(divisible_by_16=(0, 1, 2), equal_to_1=(), ids_of_folded_args=(), divisible_by_8=())]},
    inductor_meta={'kernel_name': 'Placeholder.DESCRIPTIVE_NAME', 'backend_hash': 'cdeecfeccd31ad7810f96b5752194b1c2406d0a81e39a6ca09c8ee150baae183'},
)
@triton.jit
def triton_mm(arg_A, arg_B, out_ptr0):
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121998
Approved by: https://github.com/jansel
2024-03-25 21:33:36 +00:00
9b095c3fe6 [dynamo] Config to not emit runtime asserts (#122603)
Repetition on squashed & merged by mistake https://github.com/pytorch/pytorch/pull/122406

Differential Revision: [D55312394](https://our.internmc.facebook.com/intern/diff/D55312394)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122603
Approved by: https://github.com/ezyang
2024-03-25 21:17:44 +00:00
1f67da5105 [executorch hash update] update the pinned executorch hash (#122152)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122152
Approved by: https://github.com/pytorchbot
2024-03-25 20:56:34 +00:00
46a76cfef5 [ROCm] Fix test_trace_rule_update.py (#121524)
-Add missing torch API to trace rules and ignore API with manual trace rule.

The PR fix test/dynamo/test_trace_rule_update

maybe related to https://github.com/pytorch/pytorch/pull/121142

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121524
Approved by: https://github.com/jansel, https://github.com/pruthvistony
2024-03-25 20:53:24 +00:00
bc7f3859b3 Update jvp to support symbolic execution. (#120338)
Previously, all jvp tests under dynamo/test_dynamic_shapes would fail because symbolic execution wasn't supported in some autograd functions.

List of changes:
- Update`_has_same_storage_numel` to use `sym_nbytes`
- Symintify `_efficientzerotensor_meta`
- Introduce `empty_generic_symint` with the first argument `size` as symbolic integer
- Update gen_variable_type.py script to call the symint version of zeros_fn function (zeros_symint / _efficientzerotensor_symint)
- Update `has_same_meta` to call `sym_*` functions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120338
Approved by: https://github.com/soulitzer
2024-03-25 20:50:12 +00:00
1c1268b6e9 seg-fault of "basic_string::_M_construct null not valid" fix for getNcclErrorDetailStr (#121905)
When working on testing all-reduce with an alternative rccl replacement backend, my test script crashed. After debugging, I found that `ncclGetLastError(NULL)` return null, and then the code uses the return value to do std::string would seg-fault with an exception of `basic_string::_M_construct null not valid`.

This pull request is to fix this edge condition so that it will exit the program gracefully with useful information.

**Test:**
Before the fix, my test script exits like below:
```
File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2051, in all_reduce
    work = group.allreduce([tensor], opts)
RuntimeError: basic_string::_M_construct null not valid
```

After this fix, my test script exited with useful message like,
```
[rank0]:   File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2219, in all_reduce
[rank0]:     work = group.allreduce([tensor], opts)
[rank0]: torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:272, internal error - please report this issue to the NCCL developers, NCCL version 0.4.2
[rank0]: ncclInternalError: Internal check failed.
[rank0]:  Last error: Unknown NCCL Error
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121905
Approved by: https://github.com/wconstab
2024-03-25 20:49:34 +00:00
05bbcae5bb Refactor functorch meta conversion (#122202)
At a high level, the goal of this refactor was to make it so that `MetaConverter.__call__` has a straightforward code structure in three steps: (1) check if we support doing meta conversion, (2) describe the tensor into MetaTensorDesc, (3) call `meta_tensor` on MetaTensorDesc. However, this is not so easy to do, because there is a big pile of special cases for functional tensor inside `__call__`.

The primarily complication is handling the ambient functionalization state: specifically, the functorch dynamic layer stack and the Python functionalization dispatch. The old code demands that meta tensor conversion happen with this state disabled. But I discovered that when I reconstruct functorch tensors it demands that the functorch layers be active; in fact a batch tensor will have a pointer to the internal functorch layer.

I had some discussion with Richard Zou about what code structure here makes sense. In particular, one of the goals of the refactor here is that I can inflate MetaTensorDesc from an entirely different process, which may not have all of the functorch layers activated at the time we do reconstruction. So it seems to me that we should make it explicit in MetaTensorDesc that there was some functorch layer active at the time the functorch tensor was serialized, so that we could potentially know we need to reconstruct these layers on the other side. This is NOT implemented yet, but there's some notes about how potentially it could proceed. But the important thing here is we SHOULD disable everything when we run `meta_tensor`, and internally be responsible for restoring the stack. Actually, the necessary infra bits in functorch don't exist to do this, so I added some simple implementations in pyfunctorch.py.

The rest is splitting up the manipulations on tensor (we do things like sync the real tensor before describing it; Describer is responsible for this now) and I also tried to simplify the not supported condition, based on my best understanding of what the old thicket of conditions was doing. You may notice that the internal meta_tensor handling of functional tensor is inconsistent with surrounding code: this is because I *exactly* replicated the old reconstruction behavior; a further refactor would be to rationalize this.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122202
Approved by: https://github.com/zou3519
2024-03-25 20:47:21 +00:00
9223b2cb31 Pop codegened parent graph from wrapper in GraphLowering (#122469)
Summary: Previously, we kept a reference to `V.graph` in the `codegened_graph_stack` of the wrapper. Memory regression analysis of https://github.com/pytorch/pytorch/issues/121887 shows that this has led to a slightly higher memory utilization during lowering of the `llama_v2_7b_16h` model. Here we refactor the code to pop the parent subgraph from the `codegened_graph_stack` when codegen-ing is done.

Fixes https://github.com/pytorch/pytorch/issues/121887.

Test Plan: CI, also see https://github.com/pytorch/pytorch/issues/121887#issuecomment-2014209104.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122469
Approved by: https://github.com/eellison
2024-03-25 20:27:59 +00:00
b2c496ba24 Revert "[TorchGen] Add mutable parameter to valuetype_type function in api/cpp.py (#121415)"
This reverts commit c1fe09dc37358d8121f119d66e9e8c8d57035158.

Reverted https://github.com/pytorch/pytorch/pull/121415 on behalf of https://github.com/ezyang due to I think this needs to be reverted to after https://github.com/pytorch/pytorch/pull/120076 revert ([comment](https://github.com/pytorch/pytorch/pull/121415#issuecomment-2018828813))
2024-03-25 20:14:40 +00:00
f84e3bf36d [ez] Fix XLA auto hash updates (#122630)
The xla pin is located in .github/ci_commit_pins not .ci/docker/ci_commit_pins
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122630
Approved by: https://github.com/huydhn
2024-03-25 19:45:56 +00:00
9d1de31634 [BE][CPUInductor] Use C++17 helper templates (#122607)
Such as `std::is_same_v` ,`std::is_integral_v` and C++14 one `std::enable_if_t`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122607
Approved by: https://github.com/jansel, https://github.com/Skylion007
2024-03-25 19:01:44 +00:00
2d4197c9b7 add case for creating storage on ort (#122446)
Fixes #122445

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122446
Approved by: https://github.com/mikaylagawarecki
2024-03-25 18:59:20 +00:00
2db7d874a9 [inductor] Improve error message for shape errors in slice_scatter (#122543)
Fixes #122291

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122543
Approved by: https://github.com/shunting314
2024-03-25 18:57:16 +00:00
db506762d1 Revert "Change ATEN generator argument type to const std::optional<Generator>& (#120076)"
This reverts commit a52b4e22571507abc35c2d47de138497190d2e0a.

Reverted https://github.com/pytorch/pytorch/pull/120076 on behalf of https://github.com/atalman due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/120076#issuecomment-2018680656))
2024-03-25 18:52:05 +00:00
c7bf5871ce CUDAEvent::elapsed_time could accidentally initialize a non-used GPU (#122538)
This sets the device before call cudaEventElapsedTime to avoid the case
where the "cudaGetCurrentDevice" device would be initialized even though
neither event is on that device.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122538
Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab
2024-03-25 17:49:50 +00:00
198927170d Avoid COW materialize in nn.functional forward ops (2) (#121992)
Affected ops:
* dropout
* embedding
* embedding_bag
* mutli_head_attention_forward
* grid_sample
* ctc_loss
* nll_loss
* pdist

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121992
Approved by: https://github.com/ezyang
ghstack dependencies: #122437, #121991
2024-03-25 17:31:19 +00:00
55becf02bc Avoid COW materialize in nn.functional forward ops (1) (#121991)
Affected ops:
* Remaining norm ops
* pad
* margin_loss ops
* fractional_max_pool
* linear
* prelu
* rrelu
* scaled_dot_product_attention
* logsigmoid
* threshold
* binary_cross_entropy
* gelu

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121991
Approved by: https://github.com/ezyang
ghstack dependencies: #122437
2024-03-25 17:31:19 +00:00
4c70ab26ef [MPS] Enable index_select for complex types (#122590)
Surprisingly, as of MacOS-14.14 MPS `gatherWithUpdatesTensor:indicesTensor:axis:batchDimensions:name:` still does not support complex types, so emulate them by using `at::view_as_real` trick

Fixes https://github.com/pytorch/pytorch/issues/122427

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122590
Approved by: https://github.com/Skylion007
2024-03-25 16:57:35 +00:00
e6a37eeb06 run some cuda testcases on other devices if available. (#122182)
If users want to run some cuda testcases on other devices throw setting an environment variable for testing the performance on custom devices, I think it can be used like this pr.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122182
Approved by: https://github.com/ezyang
2024-03-25 16:40:03 +00:00
70ac13b876 [ez][TD] Hide errors in llm retrieval job (#122615)
The new ghstack does have a base on main anymore, so finding the base for ghstacked PRs is harder.  Something similar to https://github.com/pytorch/pytorch/pull/122214 might be needed, but then I'm worried about tokens

Either way, this is a quick workaround to hide these errors for ghstack users
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122615
Approved by: https://github.com/huydhn
2024-03-25 16:35:00 +00:00
47a9725de9 Implement prefer_deferred_runtime_asserts_over_guards (#122090)
Fixes https://github.com/pytorch/pytorch/issues/121749

As promised, it is pretty easy.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122090
Approved by: https://github.com/lezcano
2024-03-25 16:31:16 +00:00
e49a38973f Update DimOrDims typing in torch.sparse (#122471)
I noticed the typing of the `torch.sparse.sum`'s `dim` parameter wasn't allowing an int tuple as input and tracked the issue to this type.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122471
Approved by: https://github.com/soulitzer
2024-03-25 16:25:56 +00:00
06f22537ca [dynamo] Suppress warning about torch.autograd.Function() (#122566)
PR #120577 got reverted due to issues in fbcode.  This hides warning
that PR was trying to fix until we can debug the fbcode issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122566
Approved by: https://github.com/yanboliang
2024-03-25 16:18:43 +00:00
0465a90b00 [export][reland] Fix unflattened submodule ordering. (#122341) (#122507)
Summary:

Make sure the order of submodules is the same as the original eager module.

bypass-github-export-checks

Test Plan: buck test mode/opt caffe2/test:test_export -- -r test_unflatten_submodule_ordering

Differential Revision: D55251277

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122507
Approved by: https://github.com/tugsbayasgalan
2024-03-25 15:22:01 +00:00
11dfa72153 [BE] Remove unnecessary state dict update. (#122528)
From what I can see, following is a redundant/unnecessary setting of dict element.

Differential Revision: [D55191396](https://our.internmc.facebook.com/intern/diff/D55191396/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122528
Approved by: https://github.com/Skylion007
2024-03-25 15:21:44 +00:00
5152945441 GPT2 SDPA inference pattern-matching for Inductor-CPU (#121866)
### Summary
With this PR, SDPA pattern of GPT2 is being mapped to `torch.nn.functional.scaled_dot_product_attention`.
While GPT2 supports both a causal mask & an attention mask, this PR considers the case of attention mask being absent.
TorchBench inference workload for GPT2 also doesn't use an attention-mask.

This pattern's replacement is being disabled for CUDA because [CUDA AOT Inductor](https://github.com/pytorch/pytorch/actions/runs/8319111885/job/22762567770) CI job's `GPT2ForSequenceClassification` accuracy test failed, although all other trunk CUDA Inductor CI checks had passed.
Created #122429 to track that particular issue.

### CPU performance data with TorchBench
|MODEL |BATCH SIZE | DTYPE | BEFORE: Speedup over eager-mode with the default Inductor implementation | AFTER: Speedup over eager mode with SDPA op mapped| Perf boost = (AFTER - BEFORE)/BEFORE * 100|
|--------------------------|-------------|---------|-----------------------------|--------------------------|------------|
|hf_GPT2| 1 | FP32 | 1.522x | 1.791x| 17.67%|
|hf_GPT2| 1 | BF16 (AMP) | 1.795x | 2.387x| 32.98%|
|hf_GPT2| 2 | FP32 |  1.313x |1.629x | 19.3%|
|hf_GPT2|2| BF16 (AMP) | 1.556x | 1.924x | 23.65%|
|hf_GPT2_large| 1 | FP32 | 1.380x |1.585x | 12.93%|
|hf_GPT2_large| 1 | BF16 (AMP) | 1.208x | 1.567x | 22.91%|
|hf_GPT2_large| 2 | FP32 | 1.188x | 1.490x | 25.42%|
|hf_GPT2_large|2| BF16 (AMP) | 0.991x | 1.575x | 58.93%|

Machine - Intel(R) Xeon(R) Platinum 8468H (Xeon 4th gen Sapphire Rapids)
48 physical cores were used. Intel OpenMP & libtcmalloc were preloaded.

Example command -
```
 OMP_NUM_THREADS=48 MKL_NUM_THREADS=48 numactl --membind=0 --cpunodebind=0 -C 0-47 python benchmarks/dynamo/torchbench.py --performance --inference --inductor --float32 -dcpu --only hf_GPT2_large --freezing --batch-size 1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121866
Approved by: https://github.com/Valentine233, https://github.com/jgong5, https://github.com/desertfire
2024-03-25 15:04:03 +00:00
4dc09d6aa4 Revert "Graph-Safe RNG State Exchange for Tensor Parallelism (#114068)"
This reverts commit e9dcda5cba92884be6432cf65a777b8ed708e3d6.

Reverted https://github.com/pytorch/pytorch/pull/114068 on behalf of https://github.com/ezyang due to memory leak in another ci ([comment](https://github.com/pytorch/pytorch/pull/114068#issuecomment-2018044527))
2024-03-25 13:49:04 +00:00
cyy
b9d6f8cc18 Fix clang-tidy warnings in aten/src/ATen/core/*.cpp (#122572)
This PR fixes clang-tidy warnings in aten/src/ATen/core/*.cpp.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122572
Approved by: https://github.com/ezyang
2024-03-25 13:46:24 +00:00
1e404c9b12 Remove redundant query to tensor_to_context (#122278)
from_real_tensor will query it again, so this query is strictly
dominated.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122278
Approved by: https://github.com/eellison
ghstack dependencies: #122044, #122270, #122271
2024-03-25 13:16:21 +00:00
49b81af45f Delete dead memoized_only kwarg in FakeTensor (#122271)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122271
Approved by: https://github.com/eellison
ghstack dependencies: #122044, #122270
2024-03-25 13:16:21 +00:00
f32ce4e28e Delete FakeTensorConverter.__call__ in favor of from_real_tensor (#122270)
It's annoying grepping for `__call__` call-sites so they're now all explicit now. I'd do this to MetaConverter too but that one is way more public, a lot more sites.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122270
Approved by: https://github.com/eellison
ghstack dependencies: #122044
2024-03-25 13:16:13 +00:00
069270db60 [dynamo] Fix list comparison ops (#122559)
Fixes #122376

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122559
Approved by: https://github.com/Skylion007
2024-03-25 07:03:23 +00:00
5891c5b3a6 Factor meta conversion through serializable MetaTensorDesc (#122044)
Fixes https://github.com/pytorch/pytorch/issues/121085

This PR pretty involved so pay attention to this description.  At a high
level, the refactor is intended to be mechanical: anywhere in
MetaConverter where previously we took a Tensor as argument, we now take
a MetaTensorDesc, which contains all of the information that we would
have queried off of the Tensor, but placed into a separate data
structure which we can serialize or use to recreate a fake tensor in
a separate fake tensor mode in exact fidelity to the original.

However, this transformation is not always entirely mechanical.  Here
is what you need to pay attention to:

- The memo table from real Tensor -> meta/fake Tensor is now broken
  into two memo tables: real Tensor -> stable int id -> meta/fake
  Tensor.  The stable int id is needed so that when we do serialization,
  we know when tensors/storages alias each other and can ensure we preserve
  this aliasing upon deserialization.

  The way I have implemented changes the weak reference behavior.
  Previously, when either the real Tensor OR the meta/fake Tensor went
  dead, we would remove the entry from the memo table.  Now, this only
  removes entries from one of the two memo tables.  This semantically
  makes sense, because the user may have held on to the stable int id
  out of band, and may expect a real Tensor to continue to be numbered
  consistently / expect to be able to lookup a meta/fake tensor from
  this id.  If this is unacceptable, it may be possible to rejigger
  the memo tables so that we have real Tensor -> stable int id
  and real Tensor -> meta/fake Tensor, but TBH I find the new
  implementation a lot simpler, and arranging the memo tables in this
  way means that I have to muck around with the real tensor to save
  to the memo table; in the current implementation, I never pass the
  Tensor to meta_tensor function AT ALL, which means it is impossible
  to accidentally depend on it.

- When I fill in the fields of MetaTensorDesc in describe_tensor, I need
  to be careful not to poke fields when they are not valid.  Previously,
  preconditions were implicitly checked via the conditional structure
  ("is this sparse? is this nested?") that is tested before we start
  reading attributes.  This structure has to be replicated in
  describe_tensor, and I have almost assuredly gotten it wrong on my
  first try (I'll be grinding through it on CI; a careful audit will
  help too, by auditing that I've tested all the same conditionals that
  the original access was guarded by.)

- I originally submitted https://github.com/pytorch/pytorch/pull/121821
  for the symbolic shapes change, but it turned out the way I did it
  there didn't actually work so well for this PR.  I ended up just
  inlining the symbolic shapes allocation logic into MetaConverter
  (look for calls to maybe_specialize_sym_int_with_hint), maybe there
  is a better way to structure it, but what I really want is to
  just read sizes/strides/offset directly off of MetaTensorDesc; I
  don't want another intermediate data structure.

- Some fields aren't serializable. These are documented as "NOT
  serializable".  ctx/type should morally be serializable and I just
  need to setup a contract with subclasses to let them be serialized.
  The fake_mode is used solely to test if we are refakefying with
  a pre-existing ShapeEnv and we want to reuse the SymInt
  directly--serializing this case is hopeless but I am kind of hoping
  after this refactor we do not need this at all.  view_func is not
  serializable because it's a bound C implemented method.  Joel has
  promised me that this is not too difficult to actually expose as a
  true data structure, but this is the edgiest of edge cases and there
  is no reason to deal with it right now.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122044
Approved by: https://github.com/eellison
2024-03-25 06:21:17 +00:00
cf06189a2d [CPPInductor] Fix another out-of-bounds access (#122580)
Not sure what was the idea behind `{self.tiling_factor}*sizeof(float)/sizeof({DTYPE_TO_CPP[dtype]})` size calculation (perhaps copy-n-paste error during the refactor made by https://github.com/pytorch/pytorch/pull/97626  ) , but `Vectorized::store(ptr, tiling_factor)` needs at least `tiling_factor` elements, not `tiling_factor/2` (which would be the case with the original calculation if data type is 64-bit value such as int64)
Discovered while trying to enable arch64 vectorized inductor.
Minimal reproducer (reproducible on ARMv8 or any  x86_64 machine that does not support AVX512):
```python
import torch
def do_ds(x, y):
    return torch.diagonal_scatter(x, y)

x=torch.ones(10, 10, dtype=torch.int64)
y=torch.tensor([ 1,  2, -8,  8,  5,  5, -7, -8,  7,  0])
dsc = torch.compile(do_ds)
assert torch.allclose(torch.diagonal_scatter(x, y), dsc(x, y))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122580
Approved by: https://github.com/Skylion007, https://github.com/jansel
2024-03-25 04:49:20 +00:00
deeeaded1f Add metas for randint/rand factory functions out overload (#122375)
Fixes https://github.com/pytorch/pytorch/issues/121897

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122375
Approved by: https://github.com/lezcano
2024-03-25 04:01:38 +00:00
cyy
a01d35c7f6 [TorchGen] Remove unused variables (#122576)
This PR removes some unused Python variables from TorchGen scripts.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122576
Approved by: https://github.com/Skylion007
2024-03-25 03:31:41 +00:00
e75ecd5618 [BE][veclib] Use is_same_v/enable_if_t (#122533)
`enable_if_t` helper is part of C++14
`is_same_v` helper is part of C++17

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122533
Approved by: https://github.com/Skylion007
2024-03-24 20:57:41 +00:00
14e348b7ad Handle JIT test failure when the GPU is newer than the CUDA compiler or vice versa (#122400)
The test may fail because it either uses target flags newer than the GPU resulting in failures loading the compiled binary or targetting a GPU for which CUDA has no support yet/anymore
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122400
Approved by: https://github.com/ezyang
2024-03-24 13:58:06 +00:00
36188360dd [dynamo] support torch.distributed.{group.WORLD, GroupMember.WORLD, distributed_c10d._get_default_group} (#120560)
Fixes https://github.com/pytorch/pytorch/issues/120431

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120560
Approved by: https://github.com/wconstab
2024-03-24 11:13:05 +00:00
3e4a4bea12 [dynamo] Graph break on SymNode control flow (#122546)
Fixes #111918

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122546
Approved by: https://github.com/ezyang
2024-03-24 07:22:02 +00:00
adeedc060f [Inductor] Fix unbacked symbol in stride when using item() (#122298)
Fixes #122296

Test: python test/inductor/test_torchinductor_dynamic_shapes.py -k test_item_unbacked_stride_nobreak_cuda

Co-authored-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122298
Approved by: https://github.com/ezyang
2024-03-24 06:27:15 +00:00
cyy
c1fe09dc37 [TorchGen] Add mutable parameter to valuetype_type function in api/cpp.py (#121415)
This PR is a follow-up of #120076, it moves std::optional<Generator> detection logic into  ```valuetype_type``` of api/cpp.py by adding the mutable parameter, which facilitates future value type changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121415
Approved by: https://github.com/ezyang
2024-03-24 06:11:08 +00:00
ca9606f809 Update COW OpInfo test to include kwargs and expected materialization (#122437)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122437
Approved by: https://github.com/ezyang
2024-03-24 06:07:30 +00:00
9d4218c23e Handle JIT test failure when the GPU is newer than the CUDA compiler (#122402)
The test uses the CUDA compute capabilities of the current device to
compile an extension. If nvcc is older than the device, it will fail
with a message like "Unsupported gpu architecture 'compute_80'"
resulting in a `RuntimeError: Error building extension 'cudaext_archflags'`
ultimately failing the test.

This checks for this case and allows execution to continue

Fixes https://github.com/pytorch/pytorch/issues/51950
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122402
Approved by: https://github.com/ezyang
2024-03-24 05:36:24 +00:00
cyy
808a035658 [Dynamo][4/N] Enable clang-tidy coverage on torch/csrc/dynamo/* (#122534)
This PR enables clang-tidy coverage on torch/csrc/dynamo/* and also contains other small improvements.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122534
Approved by: https://github.com/Skylion007
2024-03-24 05:26:32 +00:00
f0d461beac [vision hash update] update the pinned vision hash (#122536)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122536
Approved by: https://github.com/pytorchbot
2024-03-24 03:42:21 +00:00
5f7e71c411 [dynamo] Add HASATTR guard for UserDefinedObject attrs (#122555)
Fixes #111522

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122555
Approved by: https://github.com/Skylion007
2024-03-24 03:41:58 +00:00
07d037674f [inductor] Fix issue with randint + symbolic shapes (#122428)
Fixes #122405

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122428
Approved by: https://github.com/ezyang
2024-03-24 03:41:13 +00:00
476585b190 Preserve unbacked SymInt on SymNode (#120816)
Previously, when we applied a replacement, a SymInt that was
previously an unbacked SymInt would then transmute into whatever
we replaced it into (e.g., a constant).

This has a major downside: we often look at SymInts associated with
FX nodes (e.g., the meta of x.item() return) to find out where the
unbacked SymInt was allocated.  If we replace it, we no longer can
find out where, e.g., u1 was allocated!  But we need to know this
so we can generate deferred runtime asserts like u1 == s0.

To solve this problem, I have a special mode for replace, resolve_unbacked=False, which lets you disable substitutions on unbacked SymInts. When reporting node.expr, we preferentially avoid applying unbacked SymInt substitutions. To understand if we might accidentally reapply the substitution later, before we have reached the deferred runtime assert, we must study the calls to simplify() in ShapeEnv. My audit turns up these sites:

* `produce_guards`: this is fine, deferred runtime asserts never show up here, we must NOT have unbacked SymInts show up here. Similarly `get_nontrivial_guards`.
* `_maybe_evaluate_static`: this is fine, we are using this to determine if it is necessary to produce a guard/runtime assert. We don't want to reissue a runtime assert if we've already asserted on it, and replacements can help us understand if this has occurred.
* `_simplify_floor_div`: this is a legitimate bug, it needs to be `resolve_unbacked=False`
* `_refine_ranges`: this is fine, a refined range doesn't affect what runtime asserts we issue
* `_update_divisible`: this updates the `self.divisible` set, which specifies when we can simplify away divisibility constraints. Since this affects replacements only, it won't cause us to oversimplify a user provided expression.

There are some situations where we DO want to always apply the substitution, specifically when we have the duplicate symbol problem (we retrace an item call and get u0 and u1 which refer to the same thing.) I don't want two symbols in this case, so a special `rename_unbacked_to` is provided which sets up the unconditional renaming.

Along the way, I make a refinement to `_update_var_to_range`: if you update a var range for a size-like unbacked SymInt, you are now no longer allowed to set its lower bound below 2. This is because if you could, then our size oblivious tests for it would be inconsistent. Actually, I think there is still some inconsistency, because if you assert `u0 == 0` we will still end up with this in deferred runtime asserts, and we will then use this to simplify these statements to be True everywhere else. Maybe we should forbid this kind of refinement; not done in this PR.

Fixes https://github.com/pytorch/pytorch/issues/119689

Fixes https://github.com/pytorch/pytorch/issues/118385

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120816
Approved by: https://github.com/lezcano
2024-03-24 02:56:16 +00:00
cyy
a52b4e2257 Change ATEN generator argument type to const std::optional<Generator>& (#120076)
This PR proposes to use std::optional<Generator>& for underlying functions to avoid unnecessary copy and move operations. The torchgen code was changed to generate the new type.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120076
Approved by: https://github.com/malfet
2024-03-24 02:12:08 +00:00
788638fcdc Suggest TORCHDYNAMO_EXTENDED_DEBUG_ envvars when appropriate (#122473)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122473
Approved by: https://github.com/lezcano
2024-03-24 01:02:20 +00:00
cdc7f0fd3b Fixed failing pyhpc_equation_of_state due to cpp nodes fusion with compatible ranges (#122420)
Fixes #122283

Description:

PR https://github.com/pytorch/pytorch/pull/120077 introduced cpp nodes fusion with compatible ranges with an assumption that all scheduler nodes inside the fused nodes are the same, however, it appeared that snodes can have different indexing expressions. This PR fixes the incorrect assumption.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122420
Approved by: https://github.com/lezcano
2024-03-24 00:40:31 +00:00
4758837930 [BE] Do not use importlib.load_module (#122542)
To get rid of the annoying
```
<frozen importlib._bootstrap>:283: DeprecationWarning: the load_module() method is deprecated and slated for removal in Python 3.12; use exec_module() instead
```
using recipe from https://docs.python.org/3/library/importlib.html#importing-a-source-file-directly

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122542
Approved by: https://github.com/jansel, https://github.com/desertfire
2024-03-23 17:22:26 +00:00
bf40e3f880 [EZ][BE] Add missing acosh op to vec256_float_neon.h (#122513)
As base class has it
ed15370aab/aten/src/ATen/cpu/vec/vec_base.h (L367-L369)

Discovered while attempting to enabling Inductor vectorization on ARM platform

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122513
Approved by: https://github.com/Skylion007
2024-03-23 14:18:02 +00:00
a39e638707 Update bsr_dense_addmm kernel parameters for sizes 3 x 2 ^ N (#122506)
As in the title. The speed-ups for a particular set of input sizes range from about 7 to 85 % depending on the used BSR tensor block sizes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122506
Approved by: https://github.com/cpuhrsch
2024-03-23 11:54:33 +00:00
8a209344c9 Fix access to unitialized memory in VSX vector functions for quantized values (#122399)
Similar to https://github.com/pytorch/pytorch/pull/89833 those function may access uninitialized memory leading
to undefined behavior/results.
Initialize with zeros as done before.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122399
Approved by: https://github.com/ezyang
2024-03-23 06:11:30 +00:00
c677221798 remove torchao dependency (#122524)
Test Plan:
CI

```
buck2 run mode/dev-nosan mode/inplace executorch/examples/models/llama2:export_llama -- -c ~/llama/ultra_new_checkpoint.pt -p ~/llama/params.json -kv -E 8,8 -d fp32 --pt2e_quantize "xnnpack_dynamic" -2
```

```
buck run //executorch/backends/xnnpack/test:test_xnnpack_ops -- executorch.backends.xnnpack.test.ops.linear.TestLinear.test_qd8_fp32_per_token_weight_per_channel_group_int4
```

Differential Revision: D55263008

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122524
Approved by: https://github.com/jerryzh168
2024-03-23 03:18:43 +00:00
19d27a13ea [CPUInductor] Fix out-of-bounds read/write in cvt_int64_to_[fp32|int32] (#122511)
Discovered while debugging regressions in enabling vectorization on ARM platform

Without this change `test_div2_cpu` will fail with invalid values on non-x86 CPU

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122511
Approved by: https://github.com/peterbell10, https://github.com/jansel
2024-03-23 01:45:07 +00:00
4d8a3f8bb3 changed aliasing checks to properly recurse for computing last usage (#122444)
Fixes https://github.com/pytorch/pytorch/issues/122457

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122444
Approved by: https://github.com/yifuwang, https://github.com/jansel
ghstack dependencies: #121624, #122474
2024-03-23 01:43:21 +00:00
50036ec781 [Inductor] Add a test for creating a cpu inductor-> triton backend (#122396)
Summary: Currently there is a test for adding a backend in test/inductor/test_extension_backend.py for a cpp backend with a new device. However there is no such test for the Triton backend; it should be possible for a user to create and register your own ExtensionWrapperCodegen and ExtensionSchedulingfor another non-CUDA device and be able to generate Triton code. For simplicity I have chosen to use a CPU device, as I think it's plausible someone might want to create a CPU Triton backend.

Unfortunately the generation and running of the code is quite tightly coupled so I've had to use a mocked function to extract the code before running. Suggestions are welcome for better ways to do this.

This is a stepping off point for some additional PRs to make the Triton code path less CUDA specific, as currently there would be no way to test this avenue.

Test plan:
```
frames [('total', 1), ('ok', 1)]
stats [('calls_captured', 3), ('unique_graphs', 1)]
inductor [('intermediate_hooks', 1)]
aot_autograd [('total', 1), ('ok', 1)]
.
----------------------------------------------------------------------
Ran 1 test in 0.394s
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122396
Approved by: https://github.com/jansel
2024-03-23 01:14:57 +00:00
41d69ff324 Add a shape inference tool (#120097)
Summary:
Add a shape inference tool that helps to infer each node shape of a given graph module.
1. Given a fx graph, and an example of an input(don't need to be an accurate input that can be forward, but should have valid dims and data structures), `infer shape` creates an input of symbolic shape
2. Shape prop this symbolic input can catch runtime or value exceptions.
3. These errors are constraints for symbol values, and the constraint solver `infer symbolic values` helps us figure out specific values for each symbol.
4. Finally, we run the shape propagation based on input tensor to get tensor shapes for all nodes in the FX traced module.

Test Plan:
### 1. Test `infer symbol values`
Command:
```
buck2 test mode/opt //caffe2/test:fx_experimental -- test_infer_symbol_values
```

### 2. Test `infer shape`
Command:
```
buck2 test mode/opt //caffe2/test:fx_experimental -- test_infer_symbol_values
```
Inferred shape result like: P897560514

Differential Revision: D53593702

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120097
Approved by: https://github.com/yf225
2024-03-23 00:23:24 +00:00
29bca8547b Fix failing test_cpu_repro without vectorization support (#117262)
At least the following tests fail when there is no supported vector ISA:
test_lowp_fp_neg_abs
test_non_contiguous_index_with_constant_stride
test_scalar_mul_bfloat16
test_transpose_non_contiguous
test_transpose_sum2d_cpu_only
test_transpose_sum_outer
test_transpose_vertical_sum_cpu_only
test_vertical_sum_cpu_only

Those tests assert `metrics.generated_cpp_vec_kernel_count` is nonzero
which is never the case without a supported vector ISA, e.g. on PPC and
maybe on AArch.

Skip those tests with a new decorator and use the simpler one where an equivalent is already used

Some usages of `metrics.generated_cpp_vec_kernel_count` where guarded by a check instead of skipping the test. I tried to apply that instead of a skip where the test looked similar enough to where that was previously done.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117262
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-03-23 00:03:55 +00:00
a84f1d3def [effects] Fix backwards handling (#122346)
I didn't previously test the `.backwards()` call, and when testing on #122348 I realized we were missing some token handling in some places.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122346
Approved by: https://github.com/zou3519
2024-03-22 23:31:52 +00:00
e7fa3f7812 AOTDispatch: allow subclasses to correct when we guess metadata of tangents incorrectly (#118670)
This PR is enough to fix https://github.com/pytorch/pytorch/issues/118600.

More description of the problem is in the issue, but the high-level problem is similar to the "tangents might be non-contiguous" problem that we handle today, via forcing all tangents to be contiguous. There, the problem was something like:

"We guessed the tangent strides incorrectly, because strides on the runtime tangents were different from strides on the forward outputs, which we used to generate tangents"

Here, the problem is similar:

"We guessed the tangent tensor subclass's metadata incorrectly, because the runtime tangent was a subclass with different metadata than the forward output subclass".

This happened in an internal DTensor issue, where the metadata in question was the `placements` (shard vs. replicate vs. Partial).

One option is to solve this problem via backward guards. This is needed to unblock internal though, so I figured handling this similarly to how we handle non-contiguous tangents would be reasonable. I did this by:

(1) Assert that the metadata on subclass tangents is the same as what we guessed, and if not raise a loud error

(2) In the error message, provide the name of an optional method that the subclass must implement to handle this case:

`def __force_same_metadata__(self, metadata_tensor):`: If the forward output had a `Replicate()` placement, but the runtime tangent had a `Shard(1)` placement, this method allows a subclass to take the tangent and "convert" it to one with a `Replicate()` placement.

`__force_standard_metadata__(self)`: One issue is that there is another placement called `_Partial`, and its semantics are such that DTensor is **unable** to convert a DTensor with some placement type into another DTensor with a `_Partial` placement.

`__force_standard_metadata__` is now called on all (fake) subclass forward outs at trace-time to generate tangents, and gives subclasses a chance to "fix" any outputs with metadata that they cannot convert to later. Morally, this is similar to the fact that we force a `contiguous()` call on all tangents at trace-time.

I'm interested in thoughts/feedback! Two new dunder methods on traceable subclasses is definitely a contentious change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118670
Approved by: https://github.com/ezyang
2024-03-22 23:16:08 +00:00
f7b8d8e249 Support for sapling scm (#122072)
We can use Sapling (hg) with the pytorch repo but there are a couple minor issues to teach our scripting to be happier with having either a git or hg repo.

This change fixes some issues in:
- setup.py
- lintrunner

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122072
Approved by: https://github.com/ezyang
2024-03-22 22:59:16 +00:00
cyy
482f6c4693 [Dynamo][3/N] Fix clang-tidy warnings in torch/csrc/dynamo/* (#122392)
This PR continues to clean clang-tidy warnings in torch/csrc/dynamo/*, following #122362

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122392
Approved by: https://github.com/ezyang
2024-03-22 22:57:41 +00:00
3f99306452 [export] Remove from_export flag (#122500)
Summary: The flag from_export was incorrectly included in a previous diff (https://www.internalfb.com/diff/D54314379) - it was intended for helping with ExportedProgram verification, but was no longer needed in the final implementation.

Test Plan: Changes no functionality, test/export already covers everything

Differential Revision: D55205857

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122500
Approved by: https://github.com/avikchaudhuri, https://github.com/zhxchen17
2024-03-22 22:55:14 +00:00
03184a82dd [TD] TD on ASAN PR jobs (#122332)
Low impact CPU jobs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122332
Approved by: https://github.com/huydhn
2024-03-22 22:32:51 +00:00
271cc687de Audit retracibility errors and fix some ez ones (#122461)
Summary: Title

Test Plan: CI

Differential Revision: D55227094

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122461
Approved by: https://github.com/zhxchen17
2024-03-22 21:31:51 +00:00
29132c2e47 Prevent dup initializers when ONNXProgram.save is called many times (#122435)
Fixes https://github.com/pytorch/pytorch/issues/122351
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122435
Approved by: https://github.com/titaiwangms
ghstack dependencies: #122196, #122230
2024-03-22 21:03:15 +00:00
4eaa000acc Teach dynamo about torch.func.jvp (#119926)
List of changes:
- Replace JVP_NESTING by torch._C._functorch.maybe_current_level()
- Remove all increment nesting functions from wrap_fx_proxy_cls
- fwAD.make_dual receives the dual_level as keyword argument
- Add jvp_increment_nesting, set_fwd_grad_enabled and dual_level context managers to dynamo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119926
Approved by: https://github.com/zou3519
2024-03-22 20:25:47 +00:00
3795ebe925 Revert "[Inductor] Make codecache CUDA compilation more robust & flexible (#121490)"
This reverts commit 6bbd697306851b785b51b4d0545c1ef9365ddaa6.

Reverted https://github.com/pytorch/pytorch/pull/121490 on behalf of https://github.com/huydhn due to Sorry for reverting you change but I think it is failing on ROCm, i.e. 700c92e1b9 ([comment](https://github.com/pytorch/pytorch/pull/121490#issuecomment-2015829464))
2024-03-22 20:11:47 +00:00
97d3bf71b9 Revert "[Inductor Cutlass backend] GEMM size threshold for Cutlass backend usage (#121491)"
This reverts commit 700c92e1b9cb6fae2610d08e5a960273c4dd1697.

Reverted https://github.com/pytorch/pytorch/pull/121491 on behalf of https://github.com/huydhn due to Sorry for reverting you change but I think it is failing on ROCm, i.e. 700c92e1b9 ([comment](https://github.com/pytorch/pytorch/pull/121490#issuecomment-2015829464))
2024-03-22 20:11:47 +00:00
8013c4409f [inductor] config to control whether we assume inputs are aligned (#122158)
**Motivation**: https://github.com/pytorch/pytorch/issues/112771

**Summary**: Inductor generates triton that assumes that inputs are going to be 16-byte aligned. If the inputs aren't aligned, Inductor clones the inputs. This PR introduces a config option to not do this: when assume_aligned_inputs=False, Inductor will _not_ pass inputs as being divisible_by_16, and Inductor will not make clones. This an can generate code that might be a bit slower, but this tradeoff can be worth it in some scenarios where you might otherwise make a lot of clones.

Ideally, we could do this on a per-tensor basis. But this would be a lot of work, and attempts to add guards on storage offsets to do this automatically have run into issues: recompilations and excessive time to generate/check guards.

**Tests** https://github.com/pytorch/pytorch/pull/122159 flips this to False. It didn't run through all errors, but the ones we see are all expected failures: divisible_by_16 changes; triton kernel caching fails if we call the same triton kernel multiple times (this makes sense because the first call will have unaligned inputs, but subsequent calls have aligned inputs); and some xfailed tests start passing.

**Alternatives/RFC**:
* Is this the right thing to do with cudagraphs?
* Elias and Jason mentioned that we probably still want to make clones if we're dealing with unaligned inputs to matmuls. Is this something we should add in this config option? (In the use case I'm targeting, it seems like we don't need this optimization right now)

Differential Revision: [D55079094](https://our.internmc.facebook.com/intern/diff/D55079094)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122158
Approved by: https://github.com/ezyang
2024-03-22 20:03:38 +00:00
5790096059 [dynamo] Remove uses of raise unimplemented (#122136)
`unimplemented` is a function that raises an error, so
`raise unimplemented(...)` never reaches the `raise`.
Another related issue is that `raise unimplemented(...) from e`
doesn't attach the exception cause correctly. I fix this by adding
a `from_exc` argument to `unimplemented`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122136
Approved by: https://github.com/lezcano
2024-03-22 19:29:58 +00:00
ed15370aab [aoti] Add handling of ir.Constants in promote_constants (#122419)
This issue popped up when enabling predispatch IR on the benchmarks (https://github.com/pytorch/pytorch/pull/122225)

On the following model:
```
class M(torch.nn.Module):
    def __init__(self, device):
        super().__init__()
        self.device = device

    def forward(self, x):
        t = torch.tensor(x.size(-1), device=self.device, dtype=torch.float)
        t = torch.sqrt(t * 3)
        return x * t
```

We get the following error:
```
======================================================================
ERROR: test_constant_abi_compatible_cuda (__main__.AOTInductorTestABICompatibleCuda)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/data/users/angelayi/pytorch/torch/testing/_internal/common_utils.py", line 2741, in wrapper
    method(*args, **kwargs)
  File "/data/users/angelayi/pytorch/test/inductor/test_torchinductor.py", line 9232, in new_test
    return value(self)
  File "/home/angelayi/.conda/envs/pytorch10/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/data/users/angelayi/pytorch/test/inductor/test_aot_inductor.py", line 922, in test_constant
    self.check_model(M(self.device), (torch.randn(5, 5, device=self.device),))
  File "/data/users/angelayi/pytorch/test/inductor/test_aot_inductor.py", line 91, in check_model
    actual = AOTIRunnerUtil.run(
  File "/data/users/angelayi/pytorch/test/inductor/test_aot_inductor_utils.py", line 102, in run
    so_path = AOTIRunnerUtil.compile(
  File "/data/users/angelayi/pytorch/test/inductor/test_aot_inductor_utils.py", line 40, in compile
    so_path = torch._inductor.aot_compile_ep(
  File "/data/users/angelayi/pytorch/torch/_inductor/__init__.py", line 150, in aot_compile_ep
    return compile_fx_aot(
  File "/data/users/angelayi/pytorch/torch/_inductor/compile_fx.py", line 1005, in compile_fx_aot
    compiled_lib_path = compile_fx(
  File "/home/angelayi/.conda/envs/pytorch10/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/data/users/angelayi/pytorch/torch/_inductor/compile_fx.py", line 1111, in compile_fx
    return compile_fx(
  File "/home/angelayi/.conda/envs/pytorch10/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/data/users/angelayi/pytorch/torch/_inductor/compile_fx.py", line 1145, in compile_fx
    return compile_fx(
  File "/home/angelayi/.conda/envs/pytorch10/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/data/users/angelayi/pytorch/torch/_inductor/compile_fx.py", line 1336, in compile_fx
    return inference_compiler(unlifted_gm, example_inputs_)
  File "/data/users/angelayi/pytorch/torch/_dynamo/utils.py", line 265, in time_wrapper
    r = func(*args, **kwargs)
  File "/data/users/angelayi/pytorch/torch/_inductor/compile_fx.py", line 1266, in fw_compiler_base
    return inner_compile(
  File "/home/angelayi/.conda/envs/pytorch10/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/data/users/angelayi/pytorch/torch/_dynamo/repro/after_aot.py", line 83, in debug_wrapper
    inner_compiled_fn = compiler_fn(gm, example_inputs)
  File "/data/users/angelayi/pytorch/torch/_inductor/debug.py", line 304, in inner
    return fn(*args, **kwargs)
  File "/home/angelayi/.conda/envs/pytorch10/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/home/angelayi/.conda/envs/pytorch10/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/data/users/angelayi/pytorch/torch/_dynamo/utils.py", line 265, in time_wrapper
    r = func(*args, **kwargs)
  File "/data/users/angelayi/pytorch/torch/_inductor/compile_fx.py", line 447, in compile_fx_inner
    compiled_graph = fx_codegen_and_compile(
  File "/data/users/angelayi/pytorch/torch/_inductor/compile_fx.py", line 707, in fx_codegen_and_compile
    graph.run(*example_inputs)
  File "/data/users/angelayi/pytorch/torch/_dynamo/utils.py", line 265, in time_wrapper
    r = func(*args, **kwargs)
  File "/data/users/angelayi/pytorch/torch/_inductor/graph.py", line 612, in run
    return super().run(*args)
  File "/data/users/angelayi/pytorch/torch/fx/interpreter.py", line 145, in run
    self.env[node] = self.run_node(node)
  File "/data/users/angelayi/pytorch/torch/_inductor/graph.py", line 957, in run_node
    result = super().run_node(n)
  File "/data/users/angelayi/pytorch/torch/fx/interpreter.py", line 202, in run_node
    return getattr(self, n.op)(n.target, args, kwargs)
  File "/data/users/angelayi/pytorch/torch/_inductor/graph.py", line 819, in call_function
    raise LoweringException(e, target, args, kwargs).with_traceback(
  File "/data/users/angelayi/pytorch/torch/_inductor/graph.py", line 816, in call_function
    out = lowerings[target](*args, **kwargs)
  File "/data/users/angelayi/pytorch/torch/_inductor/lowering.py", line 298, in wrapped
    out = decomp_fn(*args, **kwargs)
  File "/data/users/angelayi/pytorch/torch/_inductor/lowering.py", line 5340, in mul
    return make_pointwise(fn)(a, b)
  File "/data/users/angelayi/pytorch/torch/_inductor/lowering.py", line 409, in inner
    inputs = promote_constants(inputs, override_return_dtype)
  File "/data/users/angelayi/pytorch/torch/_inductor/lowering.py", line 373, in promote_constants
    ex = next(x for x in inputs if isinstance(x, (TensorBox, ExpandView)))
torch._inductor.exc.LoweringException: StopIteration:
  target: aten.mul.Tensor
  args[0]: Constant(value=5.0, dtype=torch.float32, device=device(type='cuda', index=0))
  args[1]: 3
```

So I added an additional casing in `promote_constants` to handle the ir.Constants and now it works! Although please let me know if this is the wrong approach. Here's a paste of the full run with the inductor logs: P1198927007

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122419
Approved by: https://github.com/eellison, https://github.com/desertfire, https://github.com/chenyang78
2024-03-22 18:39:36 +00:00
cyy
52e9049ffa Remove unused variables (#122496)
This PR removes several unused variables in the code base.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122496
Approved by: https://github.com/ezyang
2024-03-22 18:04:09 +00:00
bbe846f430 Add symbolic_opset19.py and symbolic_opset20.py to support opset 19/20, extend opset 18 support (#118828)
Start to fix https://github.com/pytorch/pytorch/issues/114801

Co-authored-by: Thiago Crepaldi <thiagofc@microsoft.com>
Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118828
Approved by: https://github.com/thiagocrepaldi
2024-03-22 18:01:33 +00:00
34d33df056 [DCP] Check if pg exists in async before checking for cpu PG (#122316)
Check if pg exists in async before checking for cpu PG in async save path.

This PR enables using async_save even if PG is not initialized.

Differential Revision: [D54868689](https://our.internmc.facebook.com/intern/diff/D54868689/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D54868689/)!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122316
Approved by: https://github.com/shuqiangzhang, https://github.com/XilunWu
2024-03-22 18:01:11 +00:00
400cc518fc pt2 dper passes: run shape prop before each pass (#122451)
Summary: Most passes relies on shape info. We need to run shape prop after each pass

Reviewed By: frank-wei

Differential Revision: D55221119

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122451
Approved by: https://github.com/frank-wei
2024-03-22 17:57:25 +00:00
152fa9ecc2 skip moondream for training (#122483)
The model shows as failed model on the dashboard for training. But the model is not implemented for training (at least for now):
2196021e9b/torchbenchmark/models/moondream/__init__.py (L6)

Skip it in dashboard.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122483
Approved by: https://github.com/eellison
2024-03-22 17:35:52 +00:00
a3d4eaf253 [inductor] device guard for max autotune benchmark (#122479)
Internal users reported that they get failure for max-autotune if tensors are not on device 0. It turns out that we may use tensors on device say 6 and run kernel on them at device 0.

This PR enforces that we do benchmarking for max-autotune on the correct device.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122479
Approved by: https://github.com/xintwfb, https://github.com/Chillee
2024-03-22 17:27:53 +00:00
3db64c1955 [NCCL PG] Enable ncclCommDevIdxMap unconditionally (#122049)
Differential Revision: D54993977

### Summary
The initial purpose of ncclCommDevIdxMap is to support NCCL zero copy algorithms. Therefore, it is only enabled (with its values filled) if useTensorRegisterAllocatorHook_ is set to true. However, now we rely on it to support dumping NCCL information in a single PG. So we need it to be always available, regardless of whether we enabled useTensorRegisterAllocatorHook_.
Move the code of filling ncclCommDevIdxMap out of if (useTensorRegisterAllocatorHook_) statement.

### Test Plan
See diff

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122049
Approved by: https://github.com/shuqiangzhang
2024-03-22 17:10:33 +00:00
f305c96cac [DCP] Add bytesIO object to test_e2e_save_and_load (#122112)
Added a TestTrainstate that includes BytesIO checkpoint.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122112
Approved by: https://github.com/LucasLLC
2024-03-22 16:57:13 +00:00
86082f1fdc [aot_inductor] added runtime checks for input/output tensors in debug compile mode (#122047)
This PR added runtime checks to guard the dtypes and shapes of input/output tensors.
Currently, we enable these only for debug compilation
(i.e. aot_inductor.debug_compile is True) in abi_compatible mode.

Differential Revision: [D54993148](https://our.internmc.facebook.com/intern/diff/D54993148)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122047
Approved by: https://github.com/desertfire
2024-03-22 16:40:33 +00:00
90a13c3c5b Added a check in register_lowering to avoid decomposed ops (#117632)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117632
Approved by: https://github.com/lezcano
2024-03-22 16:38:31 +00:00
9347a79f1c [Watchdog Timer] Clear timer for already terminated process (#122324)
Summary:
handling cases where worker process is terminated w/o releasing the timer request, this scenario causes reaping of process at expiry.

removing the non-existent process during clear timer.

Test Plan: unit tests

Differential Revision: D55099773

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122324
Approved by: https://github.com/d4l3k
2024-03-22 15:48:03 +00:00
018f5e2c32 Fix unused variable warning in int4mm.cu (#122286)
Fix the following warning while compilation:
```
/home/pytorch/aten/src/ATen/native/cuda/int4mm.cu: In function ‘at::Tensor at::native::_weight_int4pack_mm_cuda(const at::Tensor&, const at::Tensor&, int64_t, const at::Tensor&)’:
/home/pytorch/aten/src/ATen/native/cuda/int4mm.cu:871:6: warning: variable ‘stream’ set but not used [-Wunused-but-set-variable]
  871 |   auto stream = at::cuda::getCurrentCUDAStream();
      |      ^~~~~~
/home/pytorch/aten/src/ATen/native/cuda/int4mm.cu: In function ‘at::Tensor at::native::_convert_weight_to_int4pack_cuda(const at::Tensor&, int64_t)’:
/home/pytorch/aten/src/ATen/native/cuda/int4mm.cu:1044:6: warning: variable ‘stream’ set but not used [-Wunused-but-set-variable]
 1044 |   auto stream = at::cuda::getCurrentCUDAStream();
      |      ^~~~~~
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122286
Approved by: https://github.com/soulitzer
2024-03-22 15:46:18 +00:00
7fd14ebb52 [export] Use randomized inputs to examples. (#122424)
Summary: as title. replacing all torch.ones to randn

Test Plan: CI

Reviewed By: tugsbayasgalan

Differential Revision: D55206441

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122424
Approved by: https://github.com/tugsbayasgalan
2024-03-22 15:32:28 +00:00
60bc29aa0b Revert "[Quant] [PT2] Add SiLU into X86InductorQuantizer Conv2d Unary Annotation (#122267)"
This reverts commit 2c6eeb26d3f61fba352ad51fd8653120937a20f3.

Reverted https://github.com/pytorch/pytorch/pull/122267 on behalf of https://github.com/jeanschmidt due to Not sure if this PR caused breakages in main rocm jobs, I'll remerge if reverting does not fix it ([comment](https://github.com/pytorch/pytorch/pull/122267#issuecomment-2015294491))
2024-03-22 15:04:30 +00:00
b30b396d05 Revert "[Quant] [Inductor] Enable the Inductor Lowering of QConv2d post op SiLU (#122268)"
This reverts commit 99f0fec7d0873d627e8c7f2dec65818d725424b0.

Reverted https://github.com/pytorch/pytorch/pull/122268 on behalf of https://github.com/jeanschmidt due to Not sure if this PR caused breakages in main rocm jobs, I'll remerge if reverting does not fix it ([comment](https://github.com/pytorch/pytorch/pull/122267#issuecomment-2015294491))
2024-03-22 15:04:30 +00:00
777ac511cc Revert "[Quant] [Inductor] Enable the Inductor Lowering of QConv2d post op HardSwish with int8-mix-bf16 (#122373)"
This reverts commit 783fd89ff1cf401e484c20d14b16823abf20d87d.

Reverted https://github.com/pytorch/pytorch/pull/122373 on behalf of https://github.com/jeanschmidt due to Not sure if this PR caused breakages in main rocm jobs, I'll remerge if reverting does not fix it ([comment](https://github.com/pytorch/pytorch/pull/122267#issuecomment-2015294491))
2024-03-22 15:04:30 +00:00
dbedc6bb7c Revert "[Quant] [Inductor] Enable the Inductor Lowering of QConv2d post op HardTanh with int8-mix-bf16 (#122374)"
This reverts commit 23a6d74f9352e0afb37750fee300d077c4ba9393.

Reverted https://github.com/pytorch/pytorch/pull/122374 on behalf of https://github.com/jeanschmidt due to Not sure if this PR caused breakages in main rocm jobs, I'll remerge if reverting does not fix it ([comment](https://github.com/pytorch/pytorch/pull/122267#issuecomment-2015294491))
2024-03-22 15:04:30 +00:00
02fee6caec Revert "Change ATEN generator argument type to const std::optional<Generator>& (#120076)"
This reverts commit ecbe82b9cec75324b7efb58e1d9cae6b35b71bdc.

Reverted https://github.com/pytorch/pytorch/pull/120076 on behalf of https://github.com/jeanschmidt due to Reverting in order to check if this will fix XLA trunk jobs ([comment](https://github.com/pytorch/pytorch/pull/120076#issuecomment-2015272644))
2024-03-22 14:53:45 +00:00
e6986e4317 Public API for NJT construction from jagged components (#121518)
This PR introduces `torch.nested.nested_tensor_from_jagged(values, offsets=None, lengths=None, jagged_dim=1)` (bikeshedding welcome). This is intended to be the main entrypoint for getting an NJT from the `(values, offsets, lengths)` components. The returned NJT is a view of the `values` component.

Note that `torch.nested.nested_tensor()` / `torch.nested.as_nested_tensor()` already exist for constructing an NJT from a list of tensors.

TODO:
* Some doc formatting; suggestions welcome there
* Tests / examples using `jagged_dim != 1`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121518
Approved by: https://github.com/cpuhrsch
ghstack dependencies: #113279, #113280
2024-03-22 14:48:22 +00:00
65c37fe05a AOTAutograd: ensure traced tangent subclass metadata takes non-contiguous outputs into account (#118669)
Fixes https://github.com/pytorch/pytorch/issues/118596.

The issue was as follows:

(1) Whenever AOTAutograd sees an output that is non-contiguous, that it needs a tangent for, it forces the tangent that it generates to be contiguous during tracing

(2) However: if this tangent is a subclass, we need to generate code to flatten/unflatten the subclass at runtime.

(3) To do so, we use the metadata stashed here: https://github.com/pytorch/pytorch/blob/main/torch/_functorch/_aot_autograd/schemas.py#L231

(4) However, this metadata was **wrong** - it was generated by inspecting the tangent, **before** we made the tangent contiguous.

The fix in this PR basically moves the logic make `traced_tangents` contiguous earlier, at the time that we first generate `ViewAndMutationMetadata`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118669
Approved by: https://github.com/zou3519
ghstack dependencies: #118803, #119947
2024-03-22 14:42:27 +00:00
09be5800c8 dynamo: support placement kwargs for DTensor.to_local() (#119947)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119947
Approved by: https://github.com/wanchaol, https://github.com/yoyoyocmu
ghstack dependencies: #118803
2024-03-22 14:42:27 +00:00
2e44b12dd4 dynamo: handle DTensor.device_mesh.device_type (#118803)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118803
Approved by: https://github.com/wanchaol, https://github.com/yanboliang
2024-03-22 14:42:22 +00:00
ea8e0c75c7 [quant][pt2] Fix create FQ with FixedQParamsQSpec (#122104)
Summary: Before we just returned a _PartialWrapper object when
using FixedQParamsQuantizationSpec in QAT. This is wrong and
we should return a FQ object instead.

Differential Revision: [D55021106](https://our.internmc.facebook.com/intern/diff/D55021106)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122104
Approved by: https://github.com/jerryzh168
2024-03-22 14:23:05 +00:00
6e6891e843 [jit] Fix _batch_norm_with_update shape function (#122430)
Summary: We used `native_batch_norm`'s shape function before,
but the schemas are actually different. We need to create new
shape functions for `_batch_norm_with_update` specifically.

Test Plan:
buck2 test '@fbcode//mode/opt-tsan' fbcode//caffe2/test/cpp/jit:jit -- --exact 'caffe2/test/cpp/jit:jit - TestShapeGraphLinting.Basic'

Reviewers: bdhirsh, davidberard98, eellison

Differential Revision: [D55211182](https://our.internmc.facebook.com/intern/diff/D55211182)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122430
Approved by: https://github.com/eellison, https://github.com/bdhirsh
2024-03-22 14:21:57 +00:00
23a6d74f93 [Quant] [Inductor] Enable the Inductor Lowering of QConv2d post op HardTanh with int8-mix-bf16 (#122374)
**Summary**
Enable the fusion pattern of `QConv2d -> hardtanh` lowering for int8-mixed-bf16 case.

**Test Plan**
```
python -m pytest test_mkldnn_pattern_matcher.py -k test_qconv2d_hardtanh_int8_mixed_bf16_cpu
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122374
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
ghstack dependencies: #122266, #122267, #122268, #122373
2024-03-22 13:13:14 +00:00
f65373e278 Revert "Factor meta conversion through serializable MetaTensorDesc (#122044)"
This reverts commit e2d89e970480d7e5b10a77928442d8caf94e0e84.

Reverted https://github.com/pytorch/pytorch/pull/122044 on behalf of https://github.com/jeanschmidt due to Seems that some landrace caused this PR to break lint ([comment](https://github.com/pytorch/pytorch/pull/122044#issuecomment-2015025490))
2024-03-22 12:46:21 +00:00
700c92e1b9 [Inductor Cutlass backend] GEMM size threshold for Cutlass backend usage (#121491)
* Adds a configurable GEMM size threshold for the usage of Cutlass GEMM Kernels **_inductor.config.cutlass_backend_min_gemm_size**

 * During GEMM algorithm choice generation: **if no viable choices can be generated using the configured backends, the ATen backend will be used as a fallback backend**, even if it is not enabled in **_inductor.config.max_autotune_gemm_backends**

Test plan:
CI
Additional unit test in test_cutlass_backend.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121491
Approved by: https://github.com/jansel
ghstack dependencies: #121490
2024-03-22 10:58:43 +00:00
d34514f8db Renamed mutationlayout/aliasedlayout (#122474)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122474
Approved by: https://github.com/jansel
ghstack dependencies: #121624
2024-03-22 08:32:14 +00:00
eca30df846 Added load_args to repro (#121624)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121624
Approved by: https://github.com/ezyang
2024-03-22 08:32:14 +00:00
783fd89ff1 [Quant] [Inductor] Enable the Inductor Lowering of QConv2d post op HardSwish with int8-mix-bf16 (#122373)
**Summary**
Enable the fusion pattern of `QConv2d -> hardswish` lowering for int8-mixed-bf16 case.

**Test Plan**
```
python -m pytest test_mkldnn_pattern_matcher.py -k test_qconv2d_hardswish_int8_mixed_bf16_cpu
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122373
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
ghstack dependencies: #122266, #122267, #122268
2024-03-22 08:17:57 +00:00
99f0fec7d0 [Quant] [Inductor] Enable the Inductor Lowering of QConv2d post op SiLU (#122268)
**Summary**
Enable the fusion pattern of `QConv2d -> silu` lowering to `swish` as `QConv2d` post operator.

**Test Plan**
```
python -m pytest test_mkldnn_pattern_matcher.py -k test_qconv2d_silu_cpu
python -m pytest test_mkldnn_pattern_matcher.py -k test_qconv2d_silu_int8_mixed_bf16_cpu
python -m pytest test_mkldnn_pattern_matcher.py -k test_qat_qconv2d_silu
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122268
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
ghstack dependencies: #122266, #122267
2024-03-22 08:15:28 +00:00
bb75313f0a [dynamo] Optimize handling of BINARY_OP (#122465)
This saves ~0.1s on https://dev-discuss.pytorch.org/t/a-torchdynamo-trace-time-ablation-study/1961

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122465
Approved by: https://github.com/oulgen
2024-03-22 08:14:58 +00:00
2c6eeb26d3 [Quant] [PT2] Add SiLU into X86InductorQuantizer Conv2d Unary Annotation (#122267)
**Summary**
Add `SiLU` into X86InductorQuantizer Conv2d Unary Annotation

**TestPlan**
```
python -m pytest test_x86inductor_quantizer.py -k test_conv2d_unary
python -m pytest test_x86inductor_quantizer.py -k test_qat_conv2d_unary
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122267
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
ghstack dependencies: #122266
2024-03-22 08:12:23 +00:00
6bbd697306 [Inductor] Make codecache CUDA compilation more robust & flexible (#121490)
Minor changes which make the CUDA compilation within _inductor/codecache.py
more robust and flexible.

Test plan:
CI
Additional test in test_codecache.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121490
Approved by: https://github.com/jansel
2024-03-22 08:12:11 +00:00
a337ee0a3a [Quant] Enable QConv2d with silu post op (#122266)
**Summary**
Enable QConv2d implementation with post op `silu`

**Test Plan**
```
python -m pytest test_quantized_op.py -k test_qconv2d_silu_pt2e
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122266
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
2024-03-22 07:58:45 +00:00
b78e8c0d37 remove duplicate method run_subtests (#122421)
Fixes #121654

I have removed the duplicate test `run_subtests` from `common_dtensor.py` and `common_fsdp.py` and moved it to `common_distributed.py`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122421
Approved by: https://github.com/soulitzer
2024-03-22 07:00:49 +00:00
6ba85cfc2a Fixed memory leak in Python dispatcher w.r.t. THPDevice. (#122439)
Fixes the memory leak reported in #122417.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122439
Approved by: https://github.com/soulitzer
2024-03-22 06:44:12 +00:00
3600778ede Do not create a new node if no normalization is needed (#122330)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122330
Approved by: https://github.com/jansel
2024-03-22 05:51:28 +00:00
e2d89e9704 Factor meta conversion through serializable MetaTensorDesc (#122044)
Fixes https://github.com/pytorch/pytorch/issues/121085

This PR pretty involved so pay attention to this description.  At a high
level, the refactor is intended to be mechanical: anywhere in
MetaConverter where previously we took a Tensor as argument, we now take
a MetaTensorDesc, which contains all of the information that we would
have queried off of the Tensor, but placed into a separate data
structure which we can serialize or use to recreate a fake tensor in
a separate fake tensor mode in exact fidelity to the original.

However, this transformation is not always entirely mechanical.  Here
is what you need to pay attention to:

- The memo table from real Tensor -> meta/fake Tensor is now broken
  into two memo tables: real Tensor -> stable int id -> meta/fake
  Tensor.  The stable int id is needed so that when we do serialization,
  we know when tensors/storages alias each other and can ensure we preserve
  this aliasing upon deserialization.

  The way I have implemented changes the weak reference behavior.
  Previously, when either the real Tensor OR the meta/fake Tensor went
  dead, we would remove the entry from the memo table.  Now, this only
  removes entries from one of the two memo tables.  This semantically
  makes sense, because the user may have held on to the stable int id
  out of band, and may expect a real Tensor to continue to be numbered
  consistently / expect to be able to lookup a meta/fake tensor from
  this id.  If this is unacceptable, it may be possible to rejigger
  the memo tables so that we have real Tensor -> stable int id
  and real Tensor -> meta/fake Tensor, but TBH I find the new
  implementation a lot simpler, and arranging the memo tables in this
  way means that I have to muck around with the real tensor to save
  to the memo table; in the current implementation, I never pass the
  Tensor to meta_tensor function AT ALL, which means it is impossible
  to accidentally depend on it.

- When I fill in the fields of MetaTensorDesc in describe_tensor, I need
  to be careful not to poke fields when they are not valid.  Previously,
  preconditions were implicitly checked via the conditional structure
  ("is this sparse? is this nested?") that is tested before we start
  reading attributes.  This structure has to be replicated in
  describe_tensor, and I have almost assuredly gotten it wrong on my
  first try (I'll be grinding through it on CI; a careful audit will
  help too, by auditing that I've tested all the same conditionals that
  the original access was guarded by.)

- I originally submitted https://github.com/pytorch/pytorch/pull/121821
  for the symbolic shapes change, but it turned out the way I did it
  there didn't actually work so well for this PR.  I ended up just
  inlining the symbolic shapes allocation logic into MetaConverter
  (look for calls to maybe_specialize_sym_int_with_hint), maybe there
  is a better way to structure it, but what I really want is to
  just read sizes/strides/offset directly off of MetaTensorDesc; I
  don't want another intermediate data structure.

- Some fields aren't serializable. These are documented as "NOT
  serializable".  ctx/type should morally be serializable and I just
  need to setup a contract with subclasses to let them be serialized.
  The fake_mode is used solely to test if we are refakefying with
  a pre-existing ShapeEnv and we want to reuse the SymInt
  directly--serializing this case is hopeless but I am kind of hoping
  after this refactor we do not need this at all.  view_func is not
  serializable because it's a bound C implemented method.  Joel has
  promised me that this is not too difficult to actually expose as a
  true data structure, but this is the edgiest of edge cases and there
  is no reason to deal with it right now.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122044
Approved by: https://github.com/eellison
ghstack dependencies: #122018
2024-03-22 03:56:34 +00:00
cyy
ecbe82b9ce Change ATEN generator argument type to const std::optional<Generator>& (#120076)
This PR proposes to use std::optional<Generator>& for underlying functions to avoid unnecessary copy and move operations. The torchgen code was changed to generate the new type.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120076
Approved by: https://github.com/malfet
2024-03-22 03:49:31 +00:00
ef0d470eb3 [vision hash update] update the pinned vision hash (#122453)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122453
Approved by: https://github.com/pytorchbot
2024-03-22 03:37:11 +00:00
fb57d1699b [export] Fix handling output in remove_effect_tokens_pass (#122357)
Added handling for updating the output_spec in the graph signature if the the result of a with_effects call is an output.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122357
Approved by: https://github.com/zhxchen17
2024-03-22 03:35:59 +00:00
09eb07bee8 Introduce XPU implementation for PyTorch ATen operators (#120891)
As a follow-up to #114835 and #119682, we add limited ATen operators implementation for XPU. With this PR, the blocking issue for oneDNN operations and Inductor XPU backend will be resolved as the two components depend on these operations to support its basic features, respectively.

The added ATen operators include:

- `copy_`, `_to_copy`, `_copy_from_and_resize`, , `clone`
- `view`, `view_as_real`, `view_as_complex`,
- `as_strided`, `_reshape_alias`, `resize_`, `resize_as_`,
- `add`/`add_`, `sub`/`sub_`, `mul`/`mul_`, `div`/`div_`, `abs`,
- `empty`, `empty_strided`,
- `fill_`, `zeros_`.

Co-authored-by: Wang, Eikan <eikan.wang@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120891
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/atalman
2024-03-22 03:31:04 +00:00
e419011471 [inductor] Add torch.while_loop support to JIT Inductor (#122069)
Summary: `torch.while_loop` HOP support is added to JIT Inductor. The test coverage is limited due to the functionality constraints of the upstream `torch.while_loop` op in Dynamo / Export. When those are lifted, we'll add more tests (see TODO-s in the test file).

AOT Inductor support will be added in a follow-up PR.

Test Plan:

```
$ python test/inductor/test_control_flow.py
...
----------------------------------------------------------------------
Ran 38 tests in 159.387s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122069
Approved by: https://github.com/jansel, https://github.com/eellison
2024-03-22 02:45:27 +00:00
5e0440edb4 Revert "Optimize multi_tensor_apply (take 2) (#119764)"
This reverts commit 0b68a28c87df2c6eb2cf530be4659b5a2f8a95b0.

Reverted https://github.com/pytorch/pytorch/pull/119764 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing ROCm job in trunk 0b68a28c87.  Please help take a look and reland the change ([comment](https://github.com/pytorch/pytorch/pull/119764#issuecomment-2014190124))
2024-03-22 02:18:28 +00:00
470b44c048 Support for torch.nested.as_nested_tensor(t) (#113280)
This PR adds support for tensor inputs to `as_nested_tensor()`. The tensor is treated as a batch of consistently-sized constituents. It utilizes `_nested_view_from_values_offsets()` to return a real view that allows for propagating gradients into inputs.
Co-authored-by: voznesenskym <voznesenskym@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113280
Approved by: https://github.com/cpuhrsch, https://github.com/soulitzer
ghstack dependencies: #113279
2024-03-22 02:12:37 +00:00
cd6bfc7965 Proper view support for jagged layout NestedTensor (#113279)
This PR:
* Introduces an ATen op for creating true jagged views from a dense values buffer
    * `_nested_view_from_jagged(values, offsets, lengths, ragged_idx, dummy)`
    * This ops is implemented on the Python side using torch.library so we can return a subclass instance
    * `jagged_from_list()` now uses this instead of the old autograd.Function `NestedViewFromBuffer`
    * The latter op is used for non-contiguous JTs returned via `torch.nested.narrow()`
    * `dummy` is an awful hack to ensure that `NestedTensor.__torch_dispatch__()` is invoked for our view
* Introduces an ATen op for accessing the `values` component of an NT via a view
    * `_nested_get_values(nt)`
* **Removes** the autograd.Functions `ViewNestedFromBuffer` and `ViewBufferFromNested` in favor of `nested_from_values_offsets()` / `nested_from_values_offsets_lengths()` and `nt.values()`, respectively.
* Changes test code to prefer `as_nested_tensor()` over `jagged_from_list()` directly
    * Similarly, avoid `buffer_from_jagged()`, preferring `values()`
* Depends on general subclass view fake-ification on the PT2 side (handled solely in previous PRs in the stack)

With these changes, the semantics of jagged layout NTs are such that they are considered a true view of the underlying `values` buffer. This means views of jagged NTs are views of the underlying buffer as well, simplifying some handling.

Differential Revision: [D54269922](https://our.internmc.facebook.com/intern/diff/D54269922)
Co-authored-by: voznesenskym <voznesenskym@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113279
Approved by: https://github.com/ezyang
2024-03-22 02:12:36 +00:00
bde22835c6 [PT2] - Guard oblivious on meta registrations (#122216)
Summary:
```
[trainer0|0]:Potential framework code culprit (scroll up for full backtrace):
[trainer0|0]:  File "/mnt/xarfuse/uid-539346/56d4bb3d-seed-nspid4026531836_cgpid183208940-ns-4026531840/torch/_meta_registrations.py", line 5043, in scatter_gather_dtype_check
[trainer0|0]:    if index.numel() != 0:
```

Test Plan: General CI.

Reviewed By: ezyang

Differential Revision: D54689183

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122216
Approved by: https://github.com/ezyang
2024-03-22 01:36:03 +00:00
4f93b3d958 [Dort] Reduce excessive warning to info (#122442)
No need to warn when an op can be exported.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122442
Approved by: https://github.com/thiagocrepaldi
2024-03-22 01:09:33 +00:00
a001b4b048 Inductor: Don't clamp views when the views come from split_with_sizes (#122149)
Summary:
Fixes #122126

`split_with_sizes` don't need clamping.

Test Plan: Added test + CI

Differential Revision: D55043320

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122149
Approved by: https://github.com/ezyang
2024-03-22 00:55:36 +00:00
b1fa0ce4aa [export] build the infra to rollout predispatch export. (#122326)
Test Plan:
fbcode:caffe2/test/quantization:test_quantization
fbcode:bolt/nn/executorch/backends/tests:qnn_test
fbcode:on_device_ai/helios/compiler_tests/...
fbcode:pyspeech/tests:pyspeech_utils_test_oss
fbcode:caffe2/test:quantization_pt2e_qat
fbcode:on_device_ai/Assistant/Jarvis/tests:test_custom_ops
fbcode:modai/test:test_modai
fbcode:executorch/exir/backend/test:test_partitioner

Differential Revision: D55133846

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122326
Approved by: https://github.com/tugsbayasgalan
2024-03-22 00:55:10 +00:00
4b535906aa Better handle test-config labels on PR (#122155)
I have some minor fixes in the scripts to

1. Fix the bug where the empty test matrix was confusingly print as unstable https://github.com/pytorch/pytorch/pull/121381#issuecomment-2004558588
1. Replace `print` with `logging.info`
1. Remove the hardcoded `VALID_TEST_CONFIG_LABELS` list.  It's out of date and not many people use this features besides `test-config/default`, so why bother.  The behavior here is simpler now:
    1. If the PR has some `test-config/*` labels, they will be applied
    1. If the PR has none of them, all test configs are applied
1. Add log for the previous 2 cases to avoid confusion

### Testing

```
python filter_test_configs.py --workflow "Mac MPS" --job-name "macos-12-py3-arm64 / build" --event-name "push" --schedule "" --branch "" --tag "ciflow/mps/121381" \
  --pr-number 121065 \
  --test-matrix "{ include: [
    { config: "mps", shard: 1, num_shards: 1, runner: "macos-m1-stable" },
    { config: "mps", shard: 1, num_shards: 1, runner: "macos-m2-14" },
  ]}
 ```

Also running on this PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122155
Approved by: https://github.com/clee2000
2024-03-21 23:20:52 +00:00
bce640709c Revert "Precompile triton templates (#121998)"
This reverts commit b8df2f0ca530ebe01fa079c891c170a1f4b22823.

Reverted https://github.com/pytorch/pytorch/pull/121998 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is causing all ROCm trunk job to fail b8df2f0ca5 ([comment](https://github.com/pytorch/pytorch/pull/121998#issuecomment-2014003037))
2024-03-21 23:05:59 +00:00
c4486d3e88 Allow fake models to run with ONNXProgram.__call__ (#122230)
In order to a fake model to run using ONNXProgram.__call__
interface, we need to save the model into disk along with external data
before executing the model. This is what this PR implements

An alternative is to ONNXProgram.__call__ to detect that the model
was exported with fake mode and explicit raise an exception when
ONNXProgram.__call__ is executed. The exception message would instruct
the user to call ONNXProgram.save and manually execute the model using
the ONNX runtime of choice.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122230
Approved by: https://github.com/BowenBao
ghstack dependencies: #122196
2024-03-21 22:28:05 +00:00
4ba51bb2c4 Add keys used for templated attention impls (#122423)
# Summary

Mypy will complain that these attributes dont exist for this PR: https://github.com/pytorch/pytorch/pull/121845/
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122423
Approved by: https://github.com/bdhirsh
2024-03-21 22:16:53 +00:00
224beecee6 Revert "Proper view support for jagged layout NestedTensor (#113279)"
This reverts commit 5855c490f09a028bfdfefea8b93c9833eb55dc5c.

Reverted https://github.com/pytorch/pytorch/pull/113279 on behalf of https://github.com/jbschlosser due to Need to fix BC thing ([comment](https://github.com/pytorch/pytorch/pull/113279#issuecomment-2013899762))
2024-03-21 22:03:01 +00:00
12e7602cf9 Revert "Support for torch.nested.as_nested_tensor(t) (#113280)"
This reverts commit 17c9c7026521be1c194cae278b76ac8e8f7d145b.

Reverted https://github.com/pytorch/pytorch/pull/113280 on behalf of https://github.com/jbschlosser due to Need to fix BC thing ([comment](https://github.com/pytorch/pytorch/pull/113280#issuecomment-2013893099))
2024-03-21 22:00:44 +00:00
816db3bd29 Revert "Public API for NJT construction from jagged components (#121518)"
This reverts commit d4dff9cf5e7b734a8621b571e8f5a761dc43e1e0.

Reverted https://github.com/pytorch/pytorch/pull/121518 on behalf of https://github.com/jbschlosser due to Need to fix BC thing ([comment](https://github.com/pytorch/pytorch/pull/121518#issuecomment-2013879641))
2024-03-21 21:56:29 +00:00
48afb5c325 [inductor] Use python constants in IndexPropagation (#122031)
In the next PR I have the IR `ops.neg(ops.constant(0.0, torch.float32))`
which should be folded to `ops.constant(-0.0, torch.float32)` but it seems that
`sympy.Float(-0.0)` doesn't respect the sign of the zero and so we instead
get a positive zero constant.

Here, I work around this by doing the constant folding with python arithmetic
which does respect signed zeros.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122031
Approved by: https://github.com/lezcano
2024-03-21 21:53:22 +00:00
99055ae165 [aoti] Fix compilation bug for buffer mutations (#121688)
I realized there's a bug when unlifting buffer mutations in AOTI.
However there seems to be a bug during tracing where AOTI mutates the buffer. I didn't take the time to investigate, so I left is as TODO for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121688
Approved by: https://github.com/chenyang78, https://github.com/bdhirsh
2024-03-21 21:51:32 +00:00
332456c44d triton_kernel_wrap shouldn't access FakeTensor.data_ptr (#122418)
The comment suggests that we need to replace all FakeTensors with real
tensors. `torch.empty` doesn't actually return a real Tensor because
FakeTensorMode is active!

We disable torch dispatch so that torch.empty actually returns a real Tensor.

The motivation for this PR is that we're trying to ban
FakeTensor.data_ptr (or at least warn on it) in torch.compile. See the
next PR up in the stack

Test Plan:
- Existing tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122418
Approved by: https://github.com/oulgen
2024-03-21 21:48:07 +00:00
621fdc9db8 infer_schema can add alias annotations when passed a list of mutated args (#122343)
Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122343
Approved by: https://github.com/ezyang
ghstack dependencies: #122319, #122320
2024-03-21 21:39:07 +00:00
639d6201b4 Expand the types infer_schema can infer (#122320)
This PR allows it to infer:
- None return as ()
- List[Tensor] as Tensor[]

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122320
Approved by: https://github.com/ezyang, https://github.com/soulitzer
ghstack dependencies: #122319
2024-03-21 21:39:07 +00:00
0dd78f1828 Add standalone tests for infer_schema (#122319)
We're gonna reuse this helper in the new python custom ops API. Given a
function with type annotations, `infer_schema(fun)` returns an inferred
schema.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122319
Approved by: https://github.com/ezyang, https://github.com/soulitzer
2024-03-21 21:39:04 +00:00
23524710e6 [dynamo] use proxies to nn.Module in dynamo generated GraphModules (#120756)
Fixes remaining refleaks found when debugging https://github.com/pytorch/pytorch/issues/119607, tests added in https://github.com/pytorch/pytorch/pull/120657.

Also fixes some tests that xfail: https://github.com/pytorch/pytorch/issues/120631 (not entirely sure why), but introduced tests now fail.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120756
Approved by: https://github.com/jansel
2024-03-21 21:23:12 +00:00
2cd0a5d516 [Inductor] Fix for WrapperCodeGen.statically_known_int_or_none (#121808)
There's obviously a small typo in WrapperCodeGen.statically_known_int_or_none,
where the return value of a call to V.graph._shape_env._maybe_evaluate_static
is being discarded.

This fix changes that to work how it was likely intended to.

Test Plan:
CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121808
Approved by: https://github.com/lezcano, https://github.com/jansel, https://github.com/aakhundov
2024-03-21 21:15:32 +00:00
968c4c4154 Revert "Refactor gpu trace to be device-agnostic (#121794)"
This reverts commit 74deacbf31d032a2659dc1633dc3e5248921d466.

Reverted https://github.com/pytorch/pytorch/pull/121794 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it breaks ROCm jobs in trunk 74deacbf31, please help take a look and reland the change ([comment](https://github.com/pytorch/pytorch/pull/121794#issuecomment-2013674083))
2024-03-21 20:33:17 +00:00
13afbcfc85 Revert "Support gpu trace on XPU (#121795)"
This reverts commit 91ead3eae4cd6cbf50fe7a7b4a2f9f35302bc9b2.

Reverted https://github.com/pytorch/pytorch/pull/121795 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it breaks ROCm jobs in trunk 74deacbf31, please help take a look and reland the change ([comment](https://github.com/pytorch/pytorch/pull/121794#issuecomment-2013674083))
2024-03-21 20:33:16 +00:00
182bb0f2ca Revert "Introduce XPU implementation for PyTorch ATen operators (#120891)"
This reverts commit 148a8de6397be6e4b4ca1508b03b82d117bfb03c.

Reverted https://github.com/pytorch/pytorch/pull/120891 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I need to revert it to resolve a conflict in trunk https://github.com/pytorch/pytorch/pull/121794#issuecomment-2013434523.  Please help reland the change after ([comment](https://github.com/pytorch/pytorch/pull/120891#issuecomment-2013668563))
2024-03-21 20:30:20 +00:00
628dcde136 [AOTI] Disable stack allocation when there is a fallback op (#122367)
Summary: Stack allocation is disabled when there is an aten fallback op, see c84f81b395/torch/_inductor/codegen/cpp_wrapper_cpu.py (L974). But we need to do the same where is a custom op fallback.

Test Plan: CI

Reviewed By: mikekgfb

Differential Revision: D55149369

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122367
Approved by: https://github.com/mikekgfb
2024-03-21 20:02:33 +00:00
af9b71c82f fix typo in while_loop_test (#122416)
As titiled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122416
Approved by: https://github.com/angelayi
2024-03-21 19:42:08 +00:00
d131cbc44f Fuse the input -> p2p buffer copy into one-shot all-reduce kernel when the input is small (#121213)
This improves the gpt-fast llama2 70B 8xH100 (non-standard) TP benchmark from 86 tok/s to 88 tok/s.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121213
Approved by: https://github.com/Chillee
2024-03-21 18:25:57 +00:00
765c3fc138 fix breaking changes for ONNX Runtime Training (#122000)
Fixes breaking changes for ONNX Runtime Training.

PR https://github.com/pytorch/pytorch/pull/121102 introduced incompatibility with ORT training because of change in parameter type. Creating a PR to add previous parameter types and verified that it works with ORT training.

Error with current scenario:

```
site-packages/onnxruntime/training/ortmodule/torch_cpp_extensions/cpu/aten_op_executor/aten_op_executor.cc:60:40: error: invalid conversion from ‘const DLManagedTensor*’ to ‘DLManagedTensor*’ [-fpermissive]
at::Tensor tensor = at::fromDLPack(dlpack);

site-packages/torch/include/ATen/DLConvertor.h:15:46: note:   initializing argument 1 of ‘at::Tensor at::fromDLPack(DLManagedTensor*)’
TORCH_API Tensor fromDLPack(DLManagedTensor* src);
```
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122000
Approved by: https://github.com/malfet
2024-03-21 18:10:22 +00:00
c2651a7f0e Make check_is_size clamp to sys.maxsize - 1, so sys.maxsize comparison returns False (#122372)
Partially fixes https://github.com/pytorch/pytorch/issues/113002

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122372
Approved by: https://github.com/lezcano
ghstack dependencies: #122370
2024-03-21 17:14:42 +00:00
780f70b728 Make expected stride test in torch._prims_common size oblivious (#122370)
Partially addresses https://github.com/pytorch/pytorch/issues/113002

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122370
Approved by: https://github.com/lezcano
2024-03-21 17:14:42 +00:00
25bf5f7e61 Revert "Enable x86 CPU vectorization on windows [submodule sleef] (#118980)"
This reverts commit aa74a8b9e5b34eaa700a64064818adc7a12942ca.

Reverted https://github.com/pytorch/pytorch/pull/118980 on behalf of https://github.com/huydhn due to Sorry for revert your change one more time but the hard part is that it breaks lot of internal builds ([comment](https://github.com/pytorch/pytorch/pull/118980#issuecomment-2013043364))
2024-03-21 17:07:17 +00:00
b8df2f0ca5 Precompile triton templates (#121998)
Before this PR we were not precompiling triton templates in parallel. Compilation would occur during benchmarking.

Triton benchmarking templates were emitted as :

```
@triton.jit
def triton_mm(arg_A, arg_B, out_ptr0):
```

In order to precompile we need to give the full kernel specification, as we do when we emit the template in the final output code generation.

```
@triton_heuristics.template(
    num_stages=3,
    num_warps=8,
    triton_meta={'signature': {0: '*fp32', 1: '*fp32', 2: '*fp32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [AttrsDescriptor(divisible_by_16=(0, 1, 2), equal_to_1=(), ids_of_folded_args=(), divisible_by_8=())]},
    inductor_meta={'kernel_name': 'Placeholder.DESCRIPTIVE_NAME', 'backend_hash': 'cdeecfeccd31ad7810f96b5752194b1c2406d0a81e39a6ca09c8ee150baae183'},
)
@triton.jit
def triton_mm(arg_A, arg_B, out_ptr0):
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121998
Approved by: https://github.com/jansel
ghstack dependencies: #121996, #120275, #121997
2024-03-21 17:04:53 +00:00
17175cdbc7 [Docs] Add extended debugging options for troubleshooting (#122028)
Fixes #120889

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122028
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-03-21 17:00:45 +00:00
c20bc18d59 [export] allow static constraints in dynamic_shapes (#121860)
This PR allows users to specify int values for dimensions in dynamic_shapes as well as None, for example:

```
class Foo(torch.nn.Module):
    def forward(self, x, y, z):
        ...

    foo = Foo()
    inputs = (torch.randn(4, 6), torch.randn(5, 4), torch.randn(3, 3))

for dynamic_shapes in [
    None
    ((4, 6), (5, 4), (3, 3)),
    ((None, 6), None, {0: 3, 1: 3})
]:
    _ = export(foo, inputs, dynamic_shapes=dynamic_shapes)
```

All of the above should produce the same ExportedProgram.

This is done by temporarily creating a static dim constraint during analysis, where vr.lower == vr.upper. These constraints are then deleted during _process_constraints(), and do not show up in the final ExportedProgram's range_constraints.

Additionally, export() will also fail if the shapes are mis-specified, for example:
```
_ = export(foo, inputs, dynamic_shapes=((5, None), None, None))
```
leads to `torch._dynamo.exc.UserError: Static shape constraint of 5 does not match input size of 4, for L['x'].size()[0]`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121860
Approved by: https://github.com/avikchaudhuri
2024-03-21 16:59:59 +00:00
16935de961 Support alias for NestedTensorCPU/CUDA (#117711)
Fixes #ISSUE_NUMBER

Co-authored-by: Vincent Moens <vmoens@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117711
Approved by: https://github.com/albanD
2024-03-21 16:05:52 +00:00
148a8de639 Introduce XPU implementation for PyTorch ATen operators (#120891)
As a follow-up to #114835 and #119682, we add limited ATen operators implementation for XPU. With this PR, the blocking issue for oneDNN operations and Inductor XPU backend will be resolved as the two components depend on these operations to support its basic features, respectively.

The added ATen operators include:

- `copy_`, `_to_copy`, `_copy_from_and_resize`, , `clone`
- `view`, `view_as_real`, `view_as_complex`,
- `as_strided`, `_reshape_alias`, `resize_`, `resize_as_`,
- `add`/`add_`, `sub`/`sub_`, `mul`/`mul_`, `div`/`div_`, `abs`,
- `empty`, `empty_strided`,
- `fill_`, `zeros_`.

Co-authored-by: Wang, Eikan <eikan.wang@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120891
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/atalman
2024-03-21 15:42:20 +00:00
204fd69ca6 Make ONNXProgram.model_proto and disk file the same (#122196)
Currently, the in-memory onnx program model proto does
not contain initializers saved into the disk version.

This PR changes this behavior, so that both versions are
identical. This is important for running models with fake
tensor from OMMProgram.model_proto directly, without a file
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122196
Approved by: https://github.com/BowenBao
2024-03-21 15:29:31 +00:00
f9996ed764 [BE] Enable torch inductor tests running on MacOS (#122360)
Original idea was limit the testing to just x86 Macs, but right now it will be skipped on all Apple Silicon ones, as all of them have Metal capable GPU

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122360
Approved by: https://github.com/jansel
2024-03-21 14:47:05 +00:00
456b112dca [inductor] Support non-Tensor predicate in torch.cond (#122378)
Summary: Previously, we only supported torch.Tensor boolean scalar predicate in `torch.cond` in Inductor. This PR adds support for SymBool and Python bool predicate, to match the `torch.cond` [sematics](https://pytorch.org/docs/stable/generated/torch.cond.html) in Dynamo / Export.

Test Plan:

```
$ python test/inductor/test_control_flow.py
...
----------------------------------------------------------------------
Ran 34 tests in 56.980s

OK

$ python test/inductor/test_aot_inductor.py -k test_cond
...
----------------------------------------------------------------------
Ran 54 tests in 460.093s

OK (skipped=4)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122378
Approved by: https://github.com/jansel, https://github.com/chenyang78
2024-03-21 14:35:01 +00:00
0b68a28c87 Optimize multi_tensor_apply (take 2) (#119764)
### Take 2

The first take (#119153) landed but was reverted because it broke cuda graph for `multi_tensor_apply`. This PR is a reland of #119153:
- Incorporate #119652 so that the optimization can be applied (1) without increasing binary size (2) to all 3 MTA variants without much code duplication.
- Ensure the optimization is compatible with cuda graph.

### Summary

Due to the dynamic nature of the workload, the kernel arguments aren't guaranteed to fit in the static 4kb kernel argument memory. Previously with the apex implementation, we overcame this limitation by dividing a multi_tensor_apply workload into multiple kernel launches. However, this led to low sustained occupancy, affecting the performance of memory bound ops.

Based on the observation that the kernel argument memory limitation doesn't correlate well with available SM resources, we adopt a different approach:
- When the kernel arguments fit into the static kernel argument memory, we use this memory to transfer the arguments.
- Conversely, when the kernel arguments don't fit into the static kernel argument memory, instead of sacrificing sustained occupancy, we use a page-locked cudaMemcpyAsync to transfer the arguments, then perform the entire workload in a single kernel.

This PR only covers `multi_tensor_apply` for tensors. The change can be easily applied to `multi_tensor_apply` for tensors + scalars and `multi_tensor_apply_for_fused_optimizer`.

### Benchmark (WIP)

The only benchmark I've conducted so far on `_foreach_copy_` on a set of sizes that resembles internal workload. I need to benchmarks on more problem sizes. The speedup should vary among problem sizes. **However, I believe this PR should not be slower than the previous impl on any problem sizes.**

The benchmark can be reproduced with [this script](https://gist.github.com/yifuwang/178c1f4bf951c5794ea79c04d90e44fa).

**Baseline**

A single iteration in trace:
<img width="831" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/5c8d72d0-0628-4989-88a8-c756f6bc1319">

```
https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_5a59145f-567b-472f-8eef-c61c388d45b4.json
device ms: 1.111, cpu ms: 7.151
memory bandwidth: 1169.825 GB/s
```

**This PR**

A single iteration in trace:
<img width="967" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/a023e183-8166-48f7-b7c0-c8ba32653d2b">

```
https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_da060725-62a8-466e-b570-2ad67ff0e29d.json
device ms: 0.892, cpu ms: 0.810
memory bandwidth: 1456.744 GB/s
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119764
Approved by: https://github.com/eqy, https://github.com/eellison, https://github.com/crcrpar
2024-03-21 11:53:31 +00:00
0d8e960f74 Revert "[Sparsity] add support for H100 compute capability 9.x (#121768)"
This reverts commit 91fdaa1b416ab8ac8be30f3c3428751e236657cd.

Reverted https://github.com/pytorch/pytorch/pull/121768 on behalf of https://github.com/jeanschmidt due to Agreed on reverting and fixing rocm tests ([comment](https://github.com/pytorch/pytorch/pull/121768#issuecomment-2011893826))
2024-03-21 10:42:08 +00:00
cyy
7f8bb1de83 [Dynamo][2/N] Fix clang-tidy warnings in torch/csrc/dynamo/* (#122362)
This PR continues to clean clang-tidy warnings in torch/csrc/dynamo/*, following #122259

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122362
Approved by: https://github.com/ezyang
2024-03-21 09:41:41 +00:00
ea1cd31b50 [c10d] Log the target of FR dump (#122345)
Summary: It would be useful to log the destination of the trace dump in either manifold or local file for the users to quickly locate the dump

Test Plan: Modified unit tests

Differential Revision: D54972069

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122345
Approved by: https://github.com/wconstab
2024-03-21 08:03:05 +00:00
365e89a591 Add tensor step to adadelta (#122252)
Towards fixing https://github.com/pytorch/pytorch/issues/115679
Fixes Adadelta step update while compiling

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122252
Approved by: https://github.com/janeyx99
2024-03-21 07:28:47 +00:00
7fa1be506b Add an option to sdpa benchmark to specify backend (#122368)
# Summary
Adds the ability to specify sdpa backend
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122368
Approved by: https://github.com/cpuhrsch
2024-03-21 07:00:40 +00:00
18c164ef7c [Inductor] Match insignficiant strides on outputs (#122239)
Fix for https://github.com/pytorch/pytorch/issues/116433

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122239
Approved by: https://github.com/Chillee
2024-03-21 05:35:59 +00:00
b915877deb Support numpy array in Tensor.__eq__ (#122249)
When the `other` arg of `Tensor.__eq__` is a numpy array, it is converted to a PyTorch tensor view of the numpy array, which is then given as the `other` arg to a `Tensor.eq` call

Fixes #119965
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122249
Approved by: https://github.com/ezyang
2024-03-21 04:55:01 +00:00
bf18e967b4 [c10d] disable compute_duration by default (#122138)
Summary:
Compute duration would invoke additional cuda overhead and possibly
GPU mem increase and possible hang, so we want to disable it by default and enable it only
when needed, or at least when timing is enabled.

Test Plan:
Test with existing unit test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122138
Approved by: https://github.com/wconstab
2024-03-21 04:45:37 +00:00
ea6f67853e [inductor fbcode] Add python include paths for Python.h (#122363)
Summary:
We're getting errors that Python.h is not found because we didn't have
the proper include path set up for it.

bypass-github-export-checks

Test Plan: I can only get this to show up in Bento: N5106134

Reviewed By: hl475, chenyang78

Differential Revision: D55133110

Co-authored-by: Bert Maher <bertrand@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122363
Approved by: https://github.com/bertmaher
2024-03-21 04:32:17 +00:00
d4dff9cf5e Public API for NJT construction from jagged components (#121518)
This PR introduces `torch.nested.nested_tensor_from_jagged(values, offsets=None, lengths=None, jagged_dim=1)` (bikeshedding welcome). This is intended to be the main entrypoint for getting an NJT from the `(values, offsets, lengths)` components. The returned NJT is a view of the `values` component.

Note that `torch.nested.nested_tensor()` / `torch.nested.as_nested_tensor()` already exist for constructing an NJT from a list of tensors.

TODO:
* Some doc formatting; suggestions welcome there
* Tests / examples using `jagged_dim != 1`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121518
Approved by: https://github.com/cpuhrsch
ghstack dependencies: #113280
2024-03-21 04:14:17 +00:00
17c9c70265 Support for torch.nested.as_nested_tensor(t) (#113280)
This PR adds support for tensor inputs to `as_nested_tensor()`. The tensor is treated as a batch of consistently-sized constituents. It utilizes `_nested_view_from_values_offsets()` to return a real view that allows for propagating gradients into inputs.
Co-authored-by: voznesenskym <voznesenskym@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113280
Approved by: https://github.com/cpuhrsch, https://github.com/soulitzer
2024-03-21 04:13:55 +00:00
77bed8f7f2 [ONNX] model_type flag is only supported under SKIP_XFAIL_SUBTESTS (#122336)
Fixes #120918

To address the confusion that developers usually have on which list to put xfail and skip. This PR provides guidance that `model_type` and `matcher` specified xfail/skip should go to `SKIP_XFAIL_SUBTESTS`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122336
Approved by: https://github.com/BowenBao, https://github.com/thiagocrepaldi
2024-03-21 04:10:32 +00:00
cc0cadaf4c [vision hash update] update the pinned vision hash (#122154)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122154
Approved by: https://github.com/pytorchbot
2024-03-21 03:59:12 +00:00
61f69c7fc4 [audio hash update] update the pinned audio hash (#122153)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122153
Approved by: https://github.com/pytorchbot
2024-03-21 03:53:24 +00:00
885fb9742d Handle special kwargs in user-written Triton kernel calls (#122280)
Summary: Special kwargs like `num_warps`, `num_stages`, and `num_ctas` can be passed to the Triton kernel call as kwargs. These kwargs are handled in a special way, not being passed to the underlying kernel function directly. In this PR, we move those special kwargs from `kwargs` of the `TritonKernelVariable` in dynamo to `Autotuner`'s `Config` instances (either already existing or newly created for this purpose). As a result, the special kwargs can be codegened correctly as a part of `Config`, not as direct arguments to the kernel `.run`.

Test Plan:

```
python test/inductor/test_triton_kernels.py -k test_triton_kernel_special_kwargs
...
----------------------------------------------------------------------
Ran 6 tests in 6.783s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122280
Approved by: https://github.com/oulgen
2024-03-21 03:34:07 +00:00
3e6fdea390 [ONNX] Fix list dtype finding bug in dispatcher (#122327)
Fixes #122166

Before this PR, dispatcher assumes the first input should provide the reasonable dtype to them, but `aten::index` reveals the cases with `None` in the front of inputs. The PR addresses it by selecting the first non None input to take dtype.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122327
Approved by: https://github.com/justinchuby, https://github.com/thiagocrepaldi
2024-03-21 02:54:58 +00:00
ae913175c3 Fix GraphModuleDeserializer (#122342)
Summary: self.constants is used in self.deserialize_signature()

Test Plan: CI

Differential Revision: D55152971

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122342
Approved by: https://github.com/zhxchen17
2024-03-21 02:27:39 +00:00
e9dcda5cba Graph-Safe RNG State Exchange for Tensor Parallelism (#114068)
See #113541

The PR allows for registering and controlling multiple RNG states using indices, ensuring cudagraph-safe operations, and includes both C++ and Python API changes to support this functionality.

cc  @eellison @anijain2305 @jansel @ezyang @ptrblck @csarofeen @mcarilli
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114068
Approved by: https://github.com/ezyang
2024-03-21 01:57:08 +00:00
91ead3eae4 Support gpu trace on XPU (#121795)
# Motivation
Support GPU trace on XPU backend. Add GPU trace to xpu runtime. It is beneficial to generalize the device caching allocator in the next step.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121795
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/jgong5, https://github.com/albanD
ghstack dependencies: #121794
2024-03-21 01:56:42 +00:00
74deacbf31 Refactor gpu trace to be device-agnostic (#121794)
# Motivation
Refactor gpu trace to be device-agnostic. gpu trace is usually used in runtime components, including Device, Stream, Event, Guard, and Allocator. It should be device-agnostic and can be shared among each device backend.

# Solution
move `_cuda_trace.py` to `_gpu_trace.py`, which makes each device backend owns their callback, respectively.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121794
Approved by: https://github.com/jgong5, https://github.com/albanD, https://github.com/EikanWang, https://github.com/gujinghui
2024-03-21 01:52:58 +00:00
57734202c6 [HSTU][TGIF] Provide a API to check whether running in torch_dispatch mode (#122339)
Summary: We provide a `is_in_torch_dispatch_mode` API returning `bool` to determine whether the program is running in torch dispatch mode or not.

Test Plan:
- OSS CI
- Tested with publish of hstu models with the this diff and following diffs D54964288, D54964702, D54969677, D55025489, runtime errors are not raised anymore in publish

Differential Revision: D55091453

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122339
Approved by: https://github.com/jiayisuse
2024-03-21 01:37:23 +00:00
e38d60bc07 Remove some stale xla dynamo backend (#122128)
`torchxla_trace_once ` and `aot_torchxla_trivial ` should be removed.

In our internal(hopefully dashboard can be open source soon) torchbench daily runs, `openxla` backend has much higher passing rate and similar perfomrance as the `openxla_eval`(non-aot-auto-grad backend). We still use `openxla_eval` in llama2 example but I think we should move user to `openxla` backend going forward.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122128
Approved by: https://github.com/alanwaketan, https://github.com/jansel
2024-03-21 01:13:50 +00:00
c20cf97366 Move some cudagraphs checks into C++ (#122251)
Based off of https://github.com/pytorch/pytorch/pull/111094
This + cpp guards improves TIMM geomean optimizer performance by about 20%

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122251
Approved by: https://github.com/eellison
2024-03-21 01:02:23 +00:00
be5863de39 Remove usage of deprecated volatile (#122231)
Summary:
When building our iOS app, we get a compile error about the deprecated `volatile` keyword.

This diff attempts to fix it by replacing the usage of the deprecated `volatile` keyword with `atomic` as suggested by malfet

Test Plan: Successfully built the iOS app that previously had a compile error

Differential Revision: D55090518

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122231
Approved by: https://github.com/malfet
2024-03-21 00:55:16 +00:00
1686e2d1e4 [symbolic shapes][compile-time] Minor compile time optimization in has_free_symbols (#122144)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122144
Approved by: https://github.com/lezcano
ghstack dependencies: #120726
2024-03-21 00:48:57 +00:00
cyy
c2eedb7f8a [Dynamo][1/N] Fix clang-tidy warnings in torch/csrc/dynamo/* (#122259)
This PR begins a series of works to ensure dynamo C++ code is clang-tidy clean.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122259
Approved by: https://github.com/ezyang
2024-03-21 00:43:25 +00:00
c80601f35a Revert "Avoid COW materialize in conv, log sigmoid, repeat, group_norm, batch_norm (#121537)"
This reverts commit a2a88f39ee991f471f2a2c54571886d70f5cd2e6.

Reverted https://github.com/pytorch/pytorch/pull/121537 on behalf of https://github.com/kurtamohler due to flaky CI failures ([comment](https://github.com/pytorch/pytorch/pull/121537#issuecomment-2010937226))
2024-03-21 00:03:30 +00:00
eqy
d5b5012dc4 [CUDA] Raise softmax_forward_64bit_indexing GPU memory requirement (#116075)
printing `torch.cuda.memory_summary()` shows ~41GiB reserved at the end of this test, not sure how it was passing previously on CUDA.

CC @ptrblck @malfet

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116075
Approved by: https://github.com/ptrblck, https://github.com/malfet
2024-03-21 00:03:17 +00:00
5855c490f0 Proper view support for jagged layout NestedTensor (#113279)
This PR:
* Introduces an ATen op for creating true jagged views from a dense values buffer
    * `_nested_view_from_jagged(values, offsets, lengths, ragged_idx, dummy)`
    * This ops is implemented on the Python side using torch.library so we can return a subclass instance
    * `jagged_from_list()` now uses this instead of the old autograd.Function `NestedViewFromBuffer`
    * The latter op is used for non-contiguous JTs returned via `torch.nested.narrow()`
    * `dummy` is an awful hack to ensure that `NestedTensor.__torch_dispatch__()` is invoked for our view
* Introduces an ATen op for accessing the `values` component of an NT via a view
    * `_nested_get_values(nt)`
* **Removes** the autograd.Functions `ViewNestedFromBuffer` and `ViewBufferFromNested` in favor of `nested_from_values_offsets()` / `nested_from_values_offsets_lengths()` and `nt.values()`, respectively.
* Changes test code to prefer `as_nested_tensor()` over `jagged_from_list()` directly
    * Similarly, avoid `buffer_from_jagged()`, preferring `values()`
* Depends on general subclass view fake-ification on the PT2 side (handled solely in previous PRs in the stack)

With these changes, the semantics of jagged layout NTs are such that they are considered a true view of the underlying `values` buffer. This means views of jagged NTs are views of the underlying buffer as well, simplifying some handling.

Differential Revision: [D54269922](https://our.internmc.facebook.com/intern/diff/D54269922)
Co-authored-by: voznesenskym <voznesenskym@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113279
Approved by: https://github.com/ezyang
2024-03-20 23:45:34 +00:00
057892f4be [CPU] optimize Lp norm for 1-dimensional vector (#122143)
Fixes https://github.com/pytorch/pytorch/issues/120229

- Optimize vector norm by simplifying vector norm formula for 1-dimensional vector.
- Vector norm formula for 1-dimensional vector simplifies to `abs(x)`. See below for proof.
- Next step, we can similarly optimize matrix norm (`torch.linalg.matrix_norm`) for 1 x 1 matrix.
- Additionally, avoids overflow in power, `abs(x) ** p` for large `p` or `x`, for 1-dimensional vector.

### Performance
Avg Latency (ms) of `torch.norm` and `torch.linalg.vector_norm` for
`torch.norm(torch.randn(2**18, 1), ord, -1)`
`torch.linalg.vector_norm(torch.randn(2**18, 1), ord, -1)`
Tested on 28 physical cores/socket, 1 socket on Skylake.

|                          	|                 	|         	|         	| **Avg Latency (ms)**  	|                       	|                                        	|
|--------------------------	|-----------------	|---------	|---------	|-----------------------	|-----------------------	|----------------------------------------	|
| **op**                   	| **input shape** 	| **dim** 	| **ord** 	| **baseline (master)** 	| **optimized (7102f1ef372b248414d36cbd0c51a546b6b6a41a)** 	| **speedup ratio (baseline/optimized)** 	|
| torch.norm               	| (2**18, 1)      	| -1      	| fro     	| 34.3755531            	| 0.0125408             	| 2741.094                               	|
|                          	|                 	|         	| inf     	| 34.0952635            	| 0.0122237             	| 2789.271                               	|
|                          	|                 	|         	| -inf    	| 34.3674493            	| 0.0120759             	| 2845.953                               	|
|                          	|                 	|         	| 0       	| 34.1004515            	| 0.0175261             	| 1945.69                                	|
|                          	|                 	|         	| 1       	| 34.1688442            	| 0.0121593             	| 2810.089                               	|
|                          	|                 	|         	| -1      	| 33.949492             	| 0.0120282             	| 2822.487                               	|
|                          	|                 	|         	| 2       	| 34.3669581            	| 0.0120401             	| 2854.366                               	|
|                          	|                 	|         	| -2      	| 33.9252067            	| 0.0121069             	| 2802.139                               	|
|                          	|                 	|         	|         	|                       	|                       	|                                        	|
| torch.linalg.vector_norm 	| (2**18, 1)      	| -1      	| inf     	| 34.090879             	| 0.0095105             	| 3584.545                               	|
|                          	|                 	|         	| -inf    	| 34.3708754            	| 0.0099111             	| 3467.931                               	|
|                          	|                 	|         	| 0       	| 34.0880775            	| 0.0141716             	| 2405.38                                	|
|                          	|                 	|         	| 1       	| 34.1392851            	| 0.0093174             	| 3664.036                               	|
|                          	|                 	|         	| -1      	| 33.925395             	| 0.0092483             	| 3668.302                               	|
|                          	|                 	|         	| 2       	| 34.3854165            	| 0.0092459             	| 3719.002                               	|
|                          	|                 	|         	| -2      	| 33.932972             	| 0.0093007             	| 3648.429                               	|

### Proof
<details>
<summary>For those interested :)</summary>

<img width="382" alt="1_dim_vector_norm_proof1" src="https://github.com/pytorch/pytorch/assets/93151422/59b1e00b-8fcd-47cb-877d-d31403b5195b">
<img width="432" alt="1_dim_vector_norm_proof2" src="https://github.com/pytorch/pytorch/assets/93151422/236bea15-2dd5-480b-9871-58b2e3b24322">

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122143
Approved by: https://github.com/lezcano
2024-03-20 23:20:25 +00:00
aa74a8b9e5 Enable x86 CPU vectorization on windows [submodule sleef] (#118980)
Enable VEC on Windows OS.
1. Fix some type defination gap between Windows and Linux.
2. Fix some operator not support on Windows, such as [], /.
3. Enable static sleef library build on Windows.
4. Disable unsupported function overloading on MSVC.
5. Upgrade submodule sleef lib, which fixed build issue on Windows.
6. Fixed bazel build issues.
7. Fix test app not link to sleef on Windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118980
Approved by: https://github.com/jgong5, https://github.com/ezyang, https://github.com/malfet
2024-03-20 22:41:13 +00:00
666d6291af Cast checkpoint weights to match model parameter's dtype (#122100)
Fixes #121986
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122100
Approved by: https://github.com/BowenBao
2024-03-20 22:01:40 +00:00
2289fa5f5a [while_loop] fix mode not on stack error (#122323)
Fixes https://github.com/pytorch/pytorch/issues/121453.

This is caused by missing  `with mode` in FakeTensor mode.

Test Plan:
add new tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122323
Approved by: https://github.com/tugsbayasgalan
ghstack dependencies: #122244
2024-03-20 21:17:33 +00:00
512251c8f3 Use tree_map to get device ids and device types for activation checkpointing (#121462)
`get_device_states` doesn't recursively look into nested lists/dicts to find tensors. As a result, activation checkpointing for such inputs results in silent incorrect results as `get_device_states` returns an empty result and no rng is saved as a result here: https://github.com/pytorch/pytorch/blob/main/torch/utils/checkpoint.py#L188 since `fwd_device_states` is empty.

Fixed this by using `tree_map` for both `get_device_states` and `_infer_device_type`. Also added appropriate unit tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121462
Approved by: https://github.com/soulitzer
2024-03-20 21:09:21 +00:00
cyy
1dd1899fd6 Add missing throw of std::runtime_error in dynamo/guards.cpp (#122306)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122306
Approved by: https://github.com/Skylion007, https://github.com/ezyang
2024-03-20 20:50:01 +00:00
d2a8d3864c [PT2][Inductor] Change the log for the group batch fusion (#122245)
Summary: Instead of using "batch_fusion" and "group_fusion" to log, we use the specific pass name to log, which could better summarize the hit of each pattern as well as debug

Test Plan:
```
buck2 test mode/dev-nosan //caffe2/test/inductor:group_batch_fusion
```

Differential Revision: D55103303

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122245
Approved by: https://github.com/jackiexu1992
2024-03-20 20:45:37 +00:00
61ff41f0ca [while_loop] disable closure capturing and manually set the inputs. (#122244)
For while_loop operator, it's important to keep the output ordering consistent with input ordering. Previously, we're using set_graph_inputs="automatic", which doesn't respect such ordering. This PR changes it to "manual" and respects the original user inputs' ordering. We disable closures for body and cond fn as they require some additional designs. This PR is just to prevent the bleeding.

 Repro:
```python
import torch
from torch._higher_order_ops.while_loop import while_loop
from torch._functorch.aot_autograd import aot_export_module

class Nested(torch.nn.Module):
    def forward(self, ci, cj, a, b):
        def cond_fn(i1, j1, x1, y1):
            return i1 > 0
        def body_fn(i1, j1, x1, y1):
            def cond_fn_nested(i2, j2, x2, y2):
                return j2 > 0
            def body_fn_nested(i2, j2, x2, y2):
                return i2.clone(), j2 - 1, x2 + 3.14, y2 - 2.71
            i1, j1, x1, y1 = while_loop(
                cond_fn_nested, body_fn_nested, [i1, j1, x1, y1]
            )
            return i1 - 1, j1.clone(), x1 * 2, y1 / 2
        return while_loop(cond_fn, body_fn, (ci, cj, a, b))

nested = Nested()
torch.compile(nested, backend="eager", fullgraph=True)(torch.tensor(2), torch.tensor(2), torch.randn(2, 2), torch.randn(2, 2))
```

Test plan:
add new test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122244
Approved by: https://github.com/aakhundov
2024-03-20 20:14:35 +00:00
2f6e8e84c5 Fix _chunk_cat.out issue (#122076)
# PR
Vectors allocated inside `get_chunk_cat_metadata()` are out of local scope when used in `_chunk_cat_out_cuda_contiguous()`. This PR fixes the issue by returning vectors from `get_chunk_cat_metadata`.
This PR also added a few unit tests to cover more edge cases.

# Tests
This PR is tested with the following command and no error shows. So the flaky test error should be resolved.

- `PYTORCH_NO_CUDA_MEMORY_CACHING=1 compute-sanitizer python test/test_ops.py -v -k test_out__chunk_cat_cuda_float32`
- `PYTORCH_NO_CUDA_MEMORY_CACHING=1 python test/test_ops.py -v -k test_out__chunk_cat_cuda_float32 --repeat 1500`

Fixes #122026
Fixes #121950

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122076
Approved by: https://github.com/yifuwang
2024-03-20 20:01:38 +00:00
c84f81b395 [export] add pass to remove auto functionalized hop (#122246)
Summary: Adds a pass that blindly removes the functionalize hop without consideration on if its safe. Useful for ExecuTorch today and other usecases that have additional logic that can reason about when this pass is safe to use

Test Plan: added unit test

Differential Revision: D55103867

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122246
Approved by: https://github.com/angelayi
2024-03-20 19:31:52 +00:00
d813474363 [Pytorch] auto format _python_dispatch file (#122226)
Summary: Auto format the _python_dispatch file, to make D55091453 easier to review

Test Plan: `arc lint`

Differential Revision: D55091454

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122226
Approved by: https://github.com/aakhundov
2024-03-20 19:28:39 +00:00
821ad56ea6 [CI] Enables support for pytorch ci build in ARC + introduces _linux-build-rg.yml. (#121930)
Introduce changes related to enable ARC to run on build for linux-jammy-py3.8-gcc11

Depends on:
* https://github.com/pytorch/pytorch/pull/121908
* https://github.com/pytorch/pytorch/pull/121907
* Force docker to update credentials: https://github.com/pytorch/test-infra/pull/4991
* Add permissions to role to access ECR: acc0154aa0
* Add permissions to the role to access relevant S3 bucket: 496b0422c3

## Reasoning for introducing a new `_linux-build-rg.yml`

Old style `runs-on` definition accept a string, new style `runs-on` requires a object in the format:

```
--- old
...
  runs-on: "linux.2xlarge"
...
--- new
...
  runs-on:
    group: "running-group"
...
```

In other words, to specify a group the format of the yaml needs to be changed. Unfortunately, there is no way to accomplish this change using any trick in the book that I am aware of. This is due to the fact that GH actions yaml are not templatable and support minimal functions / replacements. A few examples that did not work:
* [`e234f25` (#119544)](e234f25ba1 (diff-b317d4da565a9e329ccf67e669c2ff1f4d4bc5fb0ffa4d74132545ad66f84339R76))
* [`087de4a` (#119544)](087de4ad8b (diff-b317d4da565a9e329ccf67e669c2ff1f4d4bc5fb0ffa4d74132545ad66f84339R76))
* [`f03512e` (#119544)](f03512e344 (diff-b317d4da565a9e329ccf67e669c2ff1f4d4bc5fb0ffa4d74132545ad66f84339R76))
* [`67581fb` (#119544)](67581fb737 (diff-b317d4da565a9e329ccf67e669c2ff1f4d4bc5fb0ffa4d74132545ad66f84339R76))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121930
Approved by: https://github.com/seemethere
2024-03-20 19:06:10 +00:00
91fdaa1b41 [Sparsity] add support for H100 compute capability 9.x (#121768)
Summary: as title

Test Plan: buck test mode/opt //caffe2/test/...

Differential Revision: D54792168

@diff-train-skip-merge

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121768
Approved by: https://github.com/SherlockNoMad
2024-03-20 19:00:54 +00:00
d1e8b97387 [export] Log module hierarchy. (#121970)
Summary:
We can also log the module hierarchy in the following format:
```
:ToplevelModule
sparse:SparshArch
dense:DenseArch
```
So that we can have more information recorded about model's identity.

Test Plan: CI

Differential Revision: D54921097

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121970
Approved by: https://github.com/angelayi
2024-03-20 18:59:42 +00:00
0696db8202 Revert "Teach dynamo about torch.func.jvp (#119926)"
This reverts commit 17489784b635187316c6c856c5fe6b6a28d8a15a.

Reverted https://github.com/pytorch/pytorch/pull/119926 on behalf of https://github.com/peterbell10 due to broken mac jobs on main ([comment](https://github.com/pytorch/pytorch/pull/119926#issuecomment-2010327997))
2024-03-20 18:34:43 +00:00
1d13c82559 Precompile in background (#121997)
Precompile benchmarking choices in parallel, and then wait on those choices prior to benchmarking. In the case of deferred templates, we only only wait only those choices in the scheduler to allow multiple separate lowerings to compile in parallel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121997
Approved by: https://github.com/jansel
ghstack dependencies: #121996, #120275
2024-03-20 18:34:12 +00:00
65eb22158e Revert "Update jvp to support symbolic execution. (#120338)"
This reverts commit afc4c9382ff8b55da848ef40b4a17a92fb3d2ad6.

Reverted https://github.com/pytorch/pytorch/pull/120338 on behalf of https://github.com/huydhn due to Broke dynamo tests on trunk ([comment](https://github.com/pytorch/pytorch/pull/120338#issuecomment-2010276712))
2024-03-20 18:04:53 +00:00
072935917b Update cuda_to_hip_mappings.py (#122110)
Added one datatype mapping (cuda_bf16.h), and a number of cub/hipcub mappings. Note: the missing mappings were discovered when hipifying the Mamba model's (https://github.com/state-spaces/mamba) forward kernel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122110
Approved by: https://github.com/jithunnair-amd, https://github.com/Skylion007
2024-03-20 17:17:53 +00:00
334f7e43f9 [TD] Remove credentials requirement for retrieval (#122279)
Made the bucket readable by public
https://s3.console.aws.amazon.com/s3/buckets/target-determinator-assets?region=us-east-1&bucketType=general&tab=permissions

The only jobs that matter here are the retrieval and td jobs, which were both successful

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122279
Approved by: https://github.com/huydhn
2024-03-20 15:55:46 +00:00
2e02e1efad Skip nonzero unbacked SymInt memo in inference mode (#122147)
Summary: In `torch.inference_mode()`, fake tensors don't have `_version`s. This breaks unbacked SymInt memoization in `torch.nonzero` tracing. Here we disable the latter in inference mode.

Fixes https://github.com/pytorch/pytorch/issues/122127

Test Plan:

```
$ python test/inductor/test_unbacked_symints.py -k test_nonzero_in_inference_mode
...
----------------------------------------------------------------------
Ran 2 tests in 14.060s

OK
```

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122147
Approved by: https://github.com/ezyang
2024-03-20 14:44:55 +00:00
15a8185cd3 Revert "Enable x86 CPU vectorization on windows [submodule sleef] (#118980)"
This reverts commit 2b060983809e5fe8706acd085fff67b6a27bfb5f.

Reverted https://github.com/pytorch/pytorch/pull/118980 on behalf of https://github.com/zou3519 due to This caused build failures for 2+ pytorch devs, so we're reverting it to be safe ([comment](https://github.com/pytorch/pytorch/pull/118980#issuecomment-2009661069))
2024-03-20 14:10:12 +00:00
06db0a9f78 Revert "Upgrade submodule sleef to fix build warning (#122168)"
This reverts commit eec8b252b70b2489aee7281d336eb9c32dd85483.

Reverted https://github.com/pytorch/pytorch/pull/122168 on behalf of https://github.com/zou3519 due to trying to revert another PR ([comment](https://github.com/pytorch/pytorch/pull/122168#issuecomment-2009653474))
2024-03-20 14:05:58 +00:00
8a94005d46 [dynamo][runtime_asserts] Ignore failures on sorting sympy relations (#122205)
Differential Revision: [D55075500](https://our.internmc.facebook.com/intern/diff/D55075500)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122205
Approved by: https://github.com/ezyang
2024-03-20 13:25:37 +00:00
afc4c9382f Update jvp to support symbolic execution. (#120338)
Previously, all jvp tests under dynamo/test_dynamic_shapes would fail because symbolic execution wasn't supported in some autograd functions.

List of changes:
- Update`_has_same_storage_numel` to use `sym_nbytes`
- Symintify `_efficientzerotensor_meta`
- Introduce `empty_generic_symint` with the first argument `size` as symbolic integer
- Update gen_variable_type.py script to call the symint version of zeros_fn function (zeros_symint / _efficientzerotensor_symint)
- Update `has_same_meta` to call `sym_*` functions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120338
Approved by: https://github.com/soulitzer
ghstack dependencies: #119926
2024-03-20 13:09:19 +00:00
17489784b6 Teach dynamo about torch.func.jvp (#119926)
List of changes:
- Replace JVP_NESTING by torch._C._functorch.maybe_current_level()
- Remove all increment nesting functions from wrap_fx_proxy_cls
- fwAD.make_dual receives the dual_level as keyword argument
- Add jvp_increment_nesting, set_fwd_grad_enabled and dual_level context managers to dynamo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119926
Approved by: https://github.com/zou3519
2024-03-20 13:09:19 +00:00
eb1d6ed9f9 [Inductor] fix addmm fusion check (#121953)
Fixes #121253.

To avoid functional issue, disable pattern match for `addmm` when `beta!=1 or 0` or `alpha!=1`, as either `mkl_linear` or `mkldnn_linear` doesn't accept `beta` or `alpha` as parameters.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121953
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
2024-03-20 09:22:51 +00:00
ee6ce31b1d [BE][fix] fix test_tp_random_state and add it to periodic test list (#122248)
fix #122184 . Add the test to periodic test so that we can capture the error at CI in future.

**Test**:
`pytest test/distributed/tensor/parallel/test_tp_random_state.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122248
Approved by: https://github.com/wanchaol
2024-03-20 08:24:14 +00:00
a1d02b423c XFAIL detectron2_maskrcnn_r_101_c4 CPU inductor accuracy (#122263)
This starts to fail in trunk after the stack https://github.com/pytorch/pytorch/pull/122066 lands

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122263
Approved by: https://github.com/jansel
2024-03-20 08:03:34 +00:00
477d154ffd [dynamo] Add missing _nonvar_fields annotations (#122219)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122219
Approved by: https://github.com/anijain2305
ghstack dependencies: #122218
2024-03-20 07:53:18 +00:00
46bf37b3f7 [dynamo] Replace VariableTracker.apply with visit/realize_all (#122218)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122218
Approved by: https://github.com/anijain2305
2024-03-20 07:53:18 +00:00
a0db2e4237 [dynamo] Fixed handling of ImportError (#122222)
Fixes #122088

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122222
Approved by: https://github.com/anijain2305
2024-03-20 07:52:01 +00:00
7832efb242 [export] skip nn_module_stack verifier for non-fx.GraphModule modules (#122210)
Downstream users of torch.export may have different module classes (e.g. LoweredBackendModule), which cannot be checked for metadata in the same way. Add lines to skip this for non-fx.GraphModule modules.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122210
Approved by: https://github.com/angelayi, https://github.com/zhxchen17
2024-03-20 07:40:48 +00:00
7d2b2dec4b [Pytoch][Vulkan] Register run_conv1d_context (#122172)
Summary: We have rewritten `conv1d` as `create_conv1d_context` and `run_conv1d_context` to enable prepack of `weight` and `bias`. We have registered `create_conv1d_context` but not `run_conv1d_context`. We add the registration in this diff.

Test Plan:
```
[luwei@devbig439.ftw3 /data/users/luwei/fbsource (f89a7de33)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="*conv1d*"
Using additional configuration options from /home/luwei/.buckconfig.d/experiments_from_buck_start
Recommended: For faster builds try buck2: replace 'buck' with 'buck2'
NOTE: buck-out/ has changed: look for files in fbsource/buck-out/v2/
'buck2 build --show-output //xplat/caffe2:pt_vulkan_api_test_bin' will print the new output paths.

If you are building in fbsource//xplat and have questions, post in 'Cross Platform Dev Discussions': https://fb.workplace.com/groups/xplat.qa

  Targets matching .buckconfig buck2.supported_projects:
  {'//xplat/caffe2:pt_vulkan_api_test_bin': '//xplat'}

  To suppress this warning: touch ~/.config/.dont_hint_buck2

Building: finished in 0.1 sec (100%) 394/394 jobs, 0/394 updated
  Total time: 0.2 sec
BUILD SUCCEEDED
Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = *conv1d*
[==========] Running 2 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 2 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.conv1d_simple
[       OK ] VulkanAPITest.conv1d_simple (208 ms)
[ RUN      ] VulkanAPITest.conv1d
[       OK ] VulkanAPITest.conv1d (81 ms)
[----------] 2 tests from VulkanAPITest (289 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 1 test suite ran. (289 ms total)
[  PASSED  ] 2 tests.
```

full test result
```
...
[----------] 427 tests from VulkanAPITest (22583 ms total)

[----------] Global test environment tear-down
[==========] 427 tests from 1 test suite ran. (22583 ms total)
[  PASSED  ] 426 tests.
[  SKIPPED ] 1 test, listed below:
[  SKIPPED ] VulkanAPITest.querypool_flushed_shader_log

  YOU HAVE 11 DISABLED TESTS
```

Differential Revision: D55052816

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122172
Approved by: https://github.com/nathanaelsee
2024-03-20 07:36:23 +00:00
e7141d117f [IntraNodeComm] refactor rendezvous into a separate method for better code organization and error handling (#120968)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120968
Approved by: https://github.com/wanchaol
2024-03-20 06:54:25 +00:00
cyy
9f572b99a6 [Clang-tidy header][29/N] Enable clang-tidy warnings in aten/src/ATen/core/*.h (#122190)
This PR enables clang-tidy in `aten/src/ATen/core/*.h`, which ends the series of patches beginning from #122015.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122190
Approved by: https://github.com/Skylion007
2024-03-20 06:17:37 +00:00
11e64b4ba8 [dtensor] aten.cat to use stack strategy approach (#122209)
This PR switch aten.cat to use the strategy approach that is similar to
aten.stack, as these two ops share similar semantics

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122209
Approved by: https://github.com/wz337
2024-03-20 04:19:25 +00:00
5b7ceab650 Support auto_functionalize in pre-dispatch (#122177)
Summary: Title

Test Plan: CI

Differential Revision: D55042061

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122177
Approved by: https://github.com/zou3519
2024-03-20 04:17:58 +00:00
dc89d8b74a Fix broken lint after #116876 (#122253)
Trivial fixes, so let's do this instead of reverting the change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122253
Approved by: https://github.com/clee2000
2024-03-20 04:09:00 +00:00
de950039fc Use .get in xml parsing (#122103)
Check that the `classname` attribute actually exists.
#122017
I expect this route to happen very rarely

At a certain point, we should just remove this parsing altogether since everything uses pytest now...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122103
Approved by: https://github.com/huydhn
2024-03-20 04:07:49 +00:00
6662627c89 Add APIs for custom device using TensorIteratorBase. (#120792)
1) add operand and get_dim_names API;
2) set will_resize to true when output tensor is undefined;
3) add abs_stub for dummy device and calculate on cpu device;
4) support dummy device copy with stride;
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120792
Approved by: https://github.com/ezyang
2024-03-20 03:51:09 +00:00
f8565c4a28 [sigmoid] Clean up serialization API. (#122102)
Summary: Entirely remove the old serializer code to avoid further confusion and code bloat.

Test Plan: CI

Reviewed By: SherlockNoMad

Differential Revision: D54857118

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122102
Approved by: https://github.com/tugsbayasgalan
2024-03-20 03:45:36 +00:00
1f8177dedf [Inductor][CPU] fix flash attention last_stride!=1 issue (#122083)
Fixes #121174.

Conv converts the input of sdpa to channel last, resulting in accuracy issue. Ensure the layout in lowering.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122083
Approved by: https://github.com/eellison, https://github.com/jgong5
2024-03-20 02:22:33 +00:00
cyy
55310e58a9 Use constexpr for index variables (#122178)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122178
Approved by: https://github.com/Skylion007, https://github.com/ezyang
2024-03-20 02:20:17 +00:00
eec8b252b7 Upgrade submodule sleef to fix build warning (#122168)
Subsequent PR to https://github.com/pytorch/pytorch/pull/118980, fix sleef build warning.

submodule sleef, include this sleef PR: https://github.com/shibatch/sleef/pull/514

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122168
Approved by: https://github.com/jgong5, https://github.com/ezyang
2024-03-20 02:14:56 +00:00
cbbed46377 Defer selection of triton template (#120275)
Our prior approach to epilogue fusion was to select from a choice from a set of triton templates and extern calls based on benchmarking inputs, then unconditionally fuse epilogues. This can be sub-optimal in following ways:

- We select an extern kernel, however an epilogue like relu() exists such that choosing a triton template + relu would have been faster
- We select a triton template, epilogue fuse, and register spilling occurs causing it to be slower than not epilogue fusing.

In this PR we wait to select either the Triton Template or Extern Kernel based on benchmarking results from the kernel itself and its epilogue. As soon as a successful fusion occurs where a fused Triton Template + epilogue is faster than the unfused choice we finalize the MultiTemplateBuffer as a specific template. If no fusion occurs we'll finalize the MultiTemplateBuffer after fusion.

Note: if there are multiple epilogue fusions (not super likely), even though we select a template after the first fusion, we will still benchmark to see if subsequent epilogue are worth fusing. We could potentially defer choosing template in this case in a follow up at expense of compile time.

Gives 4% HF training win, 10% TIMM inference win. Increases compilation time which I will be trying to address more in follow up prs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120275
Approved by: https://github.com/jansel
ghstack dependencies: #121996
2024-03-20 01:40:33 +00:00
e5e0685f61 Revert "[dynamo] Forward OptimizedModule.__setattr__ to the wrapped module (#122098)"
This reverts commit 88ebdbc97c103271766203df6662240e95a09b42.

Reverted https://github.com/pytorch/pytorch/pull/122098 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the distributed failure looks legit as it is also failing in trunk 88ebdbc97c ([comment](https://github.com/pytorch/pytorch/pull/122098#issuecomment-2008483316))
2024-03-20 01:12:24 +00:00
19d6004b97 add int8 woq mm pattern matcher (#120985)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120985
Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/eellison
2024-03-20 00:23:41 +00:00
6fefc52a2b Set py3.x build-environment name consistently (#122247)
https://github.com/pytorch/pytorch/pull/122157 checks for the Python version using `"$BUILD_ENVIRONMENT" != *py3.8*`, but some build environment uses a different style with `py3_8` instead causing numpy 2.x to be installed there wrongly, i.e. 03b987fe3f
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122247
Approved by: https://github.com/malfet
2024-03-19 23:56:38 +00:00
6c659bbc36 [codemod][lowrisk] Remove unused exception parameter from caffe2/c10/mobile/CPUCachingAllocator.cpp (#116875)
Summary:
`-Wunused-exception-parameter` has identified an unused exception parameter. This diff removes it.

This:
```
try {
    ...
} catch (exception& e) {
    // no use of e
}
```
should instead be written as
```
} catch (exception&) {
```

If the code compiles, this is safe to land.

Test Plan: Sandcastle

Reviewed By: kimishpatel, palmje

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116875
Approved by: https://github.com/Skylion007
2024-03-19 23:52:09 +00:00
6b95dc8884 [codemod][lowrisk] Remove unused exception parameter from caffe2/torch/csrc/jit/frontend/lexer.cpp (#116876)
Summary:
`-Wunused-exception-parameter` has identified an unused exception parameter. This diff removes it.

This:
```
try {
    ...
} catch (exception& e) {
    // no use of e
}
```
should instead be written as
```
} catch (exception&) {
```

If the code compiles, this is safe to land.

Test Plan: Sandcastle

Reviewed By: palmje

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116876
Approved by: https://github.com/Skylion007
2024-03-19 23:51:26 +00:00
d0153ca755 use make_storage_impl to create storages for COWStorage. (#121896)
Thanks to https://github.com/pytorch/pytorch/pull/118459, `make_storage_impl` will use the func ,which register for other backends, to create StorageImpl.

`make_storage_impl` completely overwrites the `make_intrusive<StorageImpl>`, so it makes sense to replace  `make_intrusive<StorageImpl>` with `make_storage_impl` to create storage in cow.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121896
Approved by: https://github.com/ezyang
2024-03-19 23:40:15 +00:00
4aaf25bc38 delete useless cast_outputs call in unary_op_impl_float_out (#120486)
cast_outputs function is only used for CPU device, and this function already called in cpu_xxx_vec, like cpu_kernel_vec.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120486
Approved by: https://github.com/ezyang
2024-03-19 23:37:06 +00:00
2980779d0b [codemod] Remove unused variables in caffe2/caffe2/experiments/operators/tt_pad_op.h (#120177)
Summary:
LLVM-15 has a warning `-Wunused-but-set-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance.

This diff either (a) removes an unused variable and, possibly, it's associated code, or (b) qualifies the variable with `[[maybe_unused]]`, mostly in cases where the variable _is_ used, but, eg, in an `assert` statement that isn't present in production code.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan: Sandcastle

Reviewed By: palmje

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120177
Approved by: https://github.com/Skylion007
2024-03-19 23:36:52 +00:00
2239b55cd1 Add some more sanity asserts to checkPoolLiveAllocations (#122223)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122223
Approved by: https://github.com/eellison
2024-03-19 23:26:19 +00:00
139647d317 Fix #83241: torch.nn.TripletMarginLoss allowed margin less or equal to 0 (#121978)
Documentation states that the parameter margin of torch.nn.TripletMarginLoss is greater than 0, however any value was being accepted. Also fixed torch.nn.TripletMarginWithDistanceLoss which had the same problem. Added error test input for the new ValueError.

Fixes #83241

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121978
Approved by: https://github.com/mikaylagawarecki
2024-03-19 23:19:11 +00:00
a843bbdb21 [codemod] Remove unused variables in caffe2/caffe2/opt/nql/graphmatcher.cc (#118116)
Summary:
LLVM-15 has a warning `-Wunused-but-set-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance.

This diff either (a) removes an unused variable and, possibly, it's associated code, or (b) qualifies the variable with `[[maybe_unused]]`, mostly in cases where the variable _is_ used, but, eg, in an `assert` statement that isn't present in production code.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan: Sandcastle

Reviewed By: malfet, dmm-fb

Differential Revision: D52981072

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118116
Approved by: https://github.com/Skylion007
2024-03-19 22:45:43 +00:00
f05af9e377 [codemod] Remove unused variables in caffe2/caffe2/opt/nql/ast.h (#120176)
Summary:
LLVM-15 has a warning `-Wunused-but-set-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance.

This diff either (a) removes an unused variable and, possibly, it's associated code, or (b) qualifies the variable with `[[maybe_unused]]`, mostly in cases where the variable _is_ used, but, eg, in an `assert` statement that isn't present in production code.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan: Sandcastle

Reviewed By: palmje

Differential Revision: D53779579

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120176
Approved by: https://github.com/Skylion007
2024-03-19 22:44:51 +00:00
03b987fe3f [CI] Test that NumPy-2.X builds are backward compatible with 1.X (#122157)
By compiling PyTorch against 2.x RC, but running all the tests with Numpy-1.X

This has no affects on binary builds
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122157
Approved by: https://github.com/atalman
2024-03-19 22:40:26 +00:00
f8becb626f [codemod] Remove unused variables in caffe2/caffe2/contrib/fakelowp/spatial_batch_norm_fp16_fake_op.h (#120178)
Summary:
LLVM-15 has a warning `-Wunused-but-set-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance.

This diff either (a) removes an unused variable and, possibly, it's associated code, or (b) qualifies the variable with `[[maybe_unused]]`, mostly in cases where the variable _is_ used, but, eg, in an `assert` statement that isn't present in production code.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan: Sandcastle

Reviewed By: palmje

Differential Revision: D53779549

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120178
Approved by: https://github.com/Skylion007
2024-03-19 22:36:38 +00:00
94eb940a02 [codemod] Remove unused variables in caffe2/caffe2/operators/softmax_op_cudnn.cc (#121995)
Summary:
LLVM-15 has a warning `-Wunused-but-set-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance.

This diff either (a) removes an unused variable and, possibly, it's associated code, or (b) qualifies the variable with `[[maybe_unused]]`, mostly in cases where the variable _is_ used, but, eg, in an `assert` statement that isn't present in production code.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan: Sandcastle

Reviewed By: palmje

Differential Revision: D54931224

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121995
Approved by: https://github.com/Skylion007
2024-03-19 22:35:58 +00:00
a6aa3afa77 [codemod] Remove unused variables in caffe2/caffe2/video/video_decoder.cc (#122151)
Summary:
LLVM-15 has a warning `-Wunused-but-set-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance.

This diff either (a) removes an unused variable and, possibly, it's associated code, or (b) qualifies the variable with `[[maybe_unused]]`, mostly in cases where the variable _is_ used, but, eg, in an `assert` statement that isn't present in production code.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan: Sandcastle

Differential Revision: D54378401

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122151
Approved by: https://github.com/Skylion007
2024-03-19 22:34:17 +00:00
a80c60ad8f [codemod] Remove unused variables in caffe2/caffe2/operators/conv_op_cudnn.cc (#122161)
Summary:
LLVM-15 has a warning `-Wunused-but-set-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance.

This diff either (a) removes an unused variable and, possibly, it's associated code, or (b) qualifies the variable with `[[maybe_unused]]`, mostly in cases where the variable _is_ used, but, eg, in an `assert` statement that isn't present in production code.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan: Sandcastle

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122161
Approved by: https://github.com/Skylion007
2024-03-19 22:33:19 +00:00
02f436da6d [codemod][bugfix] Fix addressing bug in caffe2/caffe2/video/video_input_op.h (#121856)
Summary:
# Diff Specific

The signature of `copyFrom` is
```
void Tensor::CopyFrom(const Tensor& src, bool async) {
```
so the `&context` always evaluated to true.

I could dig around to see if anyone cares about what the flag should actually be, but this is old code in caffe2, so I've just used `true` and we'll keep using whatever behaviour we've been using since 2019 or so when this was written.

# General

A bug in this code was identified by `-Waddress`, which we are working to enable globally.

This diff fixes the bug. There are a few types of fixes it might employ:

The bug could be `const_char_array == "hello"` which compares two addresses and therefore is almost always false. This is fixed with `const_char_array == std::string_view("hello")` because `string_view` has an `==` operator that makes an appropriate comparison.

The bug could be `if(name_of_func)` which always returns true because the function always has an address. Likely you meant to call the function here!

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan: Sandcastle

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121856
Approved by: https://github.com/Skylion007
2024-03-19 22:28:06 +00:00
1c4887d52b fix dlrm accuracy test in max-autotune (#122012)
torchrec_dlrm training fail the accuracy check when max-autotune is enabled.

I found there is no real issue in PT2. We fail to get fp64 reference results for the accuracy check. In max-autotune mode numerical may change a bit and cause the cosine similarity check fail. Using fp64 baseline is more reliable and make the test pass.

The reason why we are not using a fp64 baseline earlier is because torchrec uses a dataclass [Batch](99e6e669b5/torchrec/datasets/utils.py (L28)) to represent the input. We use pytree to cast model and inputs to fp64. pytree can not look into a dataclass. My fix is to convert the dataclass to namedtuple to be more pytree friendly

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122012
Approved by: https://github.com/jansel, https://github.com/eellison
2024-03-19 22:23:42 +00:00
c71554b944 Revert "[aot_inductor][easy] enable test_triton_kernel_multi_output_arg (#122052)"
This reverts commit 206da97b8b61f51041f67de68e68e9a1875589ab.

Reverted https://github.com/pytorch/pytorch/pull/122052 on behalf of https://github.com/huydhn due to Although this look fixed on OSS, it is still failing internally.  I have added the reproducible buck command in the diff D55046262 ([comment](https://github.com/pytorch/pytorch/pull/122052#issuecomment-2008253185))
2024-03-19 22:22:12 +00:00
7678be4667 Replace numel with sym_numel in is_int_or_symint (#122145)
Fixes https://github.com/pytorch/pytorch/issues/122124

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122145
Approved by: https://github.com/Skylion007
2024-03-19 21:58:43 +00:00
6915a5be70 Increase numel limit to 2^63 for replicatepad1d (#122199)
Summary: As title

Test Plan:
```
CUDA_VISIBLE_DEVICES=5 buck2 test mode/opt //caffe2/test:test_nn_cuda -- test_replicatepad_64bit_indexing
```

Also benchmarked in N5106027
```
device_ms, cpu_ms, gb/device_ms*1000
# before changes
11.058772478103638 18.912256770000006 735.4118906278957
# after changes
10.621162576675415 18.58972748 765.7121070725207
```

Differential Revision: D55030372

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122199
Approved by: https://github.com/ezyang
2024-03-19 21:55:34 +00:00
b12d297b44 [AARCH64] Hide FP16 scalar arithmetic behind proper feature flag (#122204)
On Apple Silicon:
```
% sysctl machdep.cpu.brand_string; clang -dM -E - < /dev/null|grep __ARM_FEATURE_FP16
machdep.cpu.brand_string: Apple M1
#define __ARM_FEATURE_FP16_FML 1
#define __ARM_FEATURE_FP16_SCALAR_ARITHMETIC 1
#define __ARM_FEATURE_FP16_VECTOR_ARITHMETIC 1
```
On Graviton2 with respective `-march` flag:
```
# ./cpuinfo/build/cpu-info |grep Microarch -A1; gcc -dM -E - -march=armv8.2-a+fp16 </dev/null | grep __ARM_FEATURE_FP16
Microarchitectures:
	8x Neoverse N1
#define __ARM_FEATURE_FP16_SCALAR_ARITHMETIC 1
#define __ARM_FEATURE_FP16_VECTOR_ARITHMETIC 1
```
Test Plan: CI

Reviewed By: dimitribouche

Differential Revision: D55033347

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122204
Approved by: https://github.com/huydhn
2024-03-19 21:18:09 +00:00
901ba2be86 [quant][pt2e] Add support for conv transpose + bn + {relu} weights fusion in PTQ (#122046)
Summary:

also added some utils in xnnpack_quantizer_utils.py
* annotate_conv_tranpsose_bn_relu and annotate_conv_transpose_bn -> this is for QAT
* annotate_conv_transpose_relu

conv_transpose + bn weights fusion is performed automatically and can not be disabled currently
we can add support to allow disable this fusion later if needed

Test Plan:
python test/test_quantization.py -k test_conv_transpose_bn_fusion

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122046
Approved by: https://github.com/andrewor14
2024-03-19 21:00:57 +00:00
bc1fef113d Respect TORCH_DISABLE_ADDR2LINE in symbolizer (#121359)
If TORCH_DISABLE_ADDR2LINE is set, the symbolizer will instead give the
filename of the shared library as the filename, the offset in that library as the linenumber,
and use dladdr to get the function name if possible. This is much faster than using addr2line,
and the symbols can be later resolved offline using addr2line if desired.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121359
Approved by: https://github.com/aaronenyeshi
2024-03-19 20:50:26 +00:00
7718a1cd4f T159183991: Error: EXC_SOFTWARE / SIGABRT at IGPyTorchFramework:-[MPSImageWrapperTrampoline endSynchronization:] (MPSImageWrapper.mm<line_num>):cpp_exception_clas (#122132)
Summary: Prevent crash by not throwing a C++ exception.

Test Plan: spongebobsandcastle

Reviewed By: SS-JIA

Differential Revision: D55036050

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122132
Approved by: https://github.com/SS-JIA
2024-03-19 20:01:33 +00:00
c0b2e56c8f Support triton.language.dtype with torch.compile -- Second Attempt (#122141)
This PR is the second attempt at supporting `triton.language.dtype`, now instead of putting it on the graph, we put it on the side table since it is a constant.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122141
Approved by: https://github.com/jansel
ghstack dependencies: #122140
2024-03-19 19:40:52 +00:00
58a805da71 [UserDefinedTriton] Move constant args out of the fx graph (#122140)
@ezyang mentioned that we should not put constant args on the graph. Especially when there are args that would be trickier to put on the graph. E.g. next PR needs `triton.language.dtype` as an argument on the graph.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122140
Approved by: https://github.com/jansel
2024-03-19 19:40:52 +00:00
c5ffebebab [export] allow Dim(1,2) for export dynamic shapes (v2 after revert) (#121910)
Creating this after [PR](https://github.com/pytorch/pytorch/pull/121642) got reverted.

Current dynamic shapes implementation fixes lower range of Dims to be 2 for analysis, but allows 0/1 shapes during runtime. This leads to failures when initializing Dim(1,2). This PR sets the lower bound to 0, and avoids erroring out when conflicting with the generated (2, maxsize) constraint during analysis.

Also resolves a derived dim constraints issue with the following code:
```
class Bar(torch.nn.Module):
    def forward(self, x, y):
        return x + y[1:]

dx = Dim("dx", min=1, max=3)
ep = export(
    Bar(),
    (torch.randn(2, 2), torch.randn(3, 2)),
    dynamic_shapes=({0: dx, 1: None}, {0: dx+1, 1: None})
)
print(ep.range_constraints)
```

In main:
```
{s0: ValueRanges(lower=2, upper=3, is_bool=False), s0 + 1: ValueRanges(lower=3, upper=4, is_bool=False)}
```

This PR:
```
{s0: ValueRanges(lower=1, upper=3, is_bool=False), s0 + 1: ValueRanges(lower=2, upper=4, is_bool=False)}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121910
Approved by: https://github.com/avikchaudhuri, https://github.com/zhxchen17
2024-03-19 19:08:05 +00:00
d56ab7b020 Revert "[torch export][serialize] create a more compact stacktrace format for serialization (#121675)"
This reverts commit eae89138d891d0310483c4d86dcb69b16de0a6b5.

Reverted https://github.com/pytorch/pytorch/pull/121675 on behalf of https://github.com/jeanschmidt due to It seems that this PR broke lint jobs, I am reverting to confirm if this is the case ([comment](https://github.com/pytorch/pytorch/pull/121675#issuecomment-2007919486))
2024-03-19 19:02:09 +00:00
36e5c1dcab Revert "Teach dynamo about torch.func.jvp (#119926)"
This reverts commit edd04b7c16cc6715411119bb7db234a9df59065f.

Reverted https://github.com/pytorch/pytorch/pull/119926 on behalf of https://github.com/jeanschmidt due to lots of breakages in pull jobs, checking if reverting this one will help ([comment](https://github.com/pytorch/pytorch/pull/119926#issuecomment-2007915919))
2024-03-19 18:59:46 +00:00
88999674a0 Revert "Update jvp to support symbolic execution. (#120338)"
This reverts commit 39877abee2c3ad1956013d467b0f6e86cd20acfb.

Reverted https://github.com/pytorch/pytorch/pull/120338 on behalf of https://github.com/jeanschmidt due to lots of breakages in pull jobs, checking if reverting this one will help ([comment](https://github.com/pytorch/pytorch/pull/120338#issuecomment-2007898831))
2024-03-19 18:50:12 +00:00
e0d57001ef [codemod] Remove unused variables in caffe2/caffe2/experiments/operators/fully_connected_op_prune.h (#122165)
Summary:
LLVM-15 has a warning `-Wunused-but-set-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance.

This diff either (a) removes an unused variable and, possibly, it's associated code, or (b) qualifies the variable with `[[maybe_unused]]`, mostly in cases where the variable _is_ used, but, eg, in an `assert` statement that isn't present in production code.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan: Sandcastle

Reviewed By: dmm-fb

Differential Revision: D54380402

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122165
Approved by: https://github.com/Skylion007
2024-03-19 18:41:16 +00:00
6bd2d12bc7 release gil in prepareProfiler (#121949)
Initializing profiler while holding gil can lead to deadlocks, as it makes some presumably synchronizing cuda calls

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121949
Approved by: https://github.com/aaronenyeshi
2024-03-19 18:05:21 +00:00
7fb2d69282 [PT2] - Fix cat backwards wrapping on symints (#121527)
Summary:
Wrapping was comparing Symint and ints forcing a guard. Rewrite it with TORCH_GUARD_SIZE_OBLIVIOUS
```
[trainer0|0]:  File "<invalid>", line 0, in THPEngine_run_backward(_object*, _object*, _object*)
[trainer0|0]:  File "<invalid>", line 0, in torch::autograd::python::PythonEngine::execute(std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge>> const&, std::vector<at::Tensor, std::allocator<at::Tensor>> const&, bool, bool, bool, std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge>> const&)
[trainer0|0]:  File "<invalid>", line 0, in torch::autograd::Engine::execute(std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge>> const&, std::vector<at::Tensor, std::allocator<at::Tensor>> const&, bool, bool, bool, std::vector<torch::autograd::Edge, std::allocator<torch::autograd::Edge>> const&)
[trainer0|0]:  File "<invalid>", line 0, in torch::autograd::python::PythonEngine::execute_with_graph_task(std::shared_ptr<torch::autograd::GraphTask> const&, std::shared_ptr<torch::autograd::Node>, torch::autograd::InputBuffer&&)
[trainer0|0]:  File "<invalid>", line 0, in torch::autograd::Engine::execute_with_graph_task(std::shared_ptr<torch::autograd::GraphTask> const&, std::shared_ptr<torch::autograd::Node>, torch::autograd::InputBuffer&&)
[trainer0|0]:  File "<invalid>", line 0, in torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&)
[trainer0|0]:  File "<invalid>", line 0, in torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueue> const&)
[trainer0|0]:  File "<invalid>", line 0, in torch::autograd::Node::operator()(std::vector<at::Tensor, std::allocator<at::Tensor>>&&)
[trainer0|0]:  File "<invalid>", line 0, in torch::autograd::generated::CatBackward0::apply(std::vector<at::Tensor, std::allocator<at::Tensor>>&&)
[trainer0|0]:  File "<invalid>", line 0, in torch::autograd::generated::details::cat_tensors_backward(at::Tensor const&, std::vector<std::vector<c10::SymInt, std::allocator<c10::SymInt>>, std::allocator<std::vector<c10::SymInt, std::allocator<c10::SymInt>>>> const&, std::vector<c10::ScalarType, std::allocator<c10::ScalarType>> const&, long)
[trainer0|0]:  File "<invalid>", line 0, in c10::operator==(c10::SymInt const&, int)
[trainer0|0]:  File "<invalid>", line 0, in c10::SymBool::guard_bool(char const*, long) const
[trainer0|0]:  File "<invalid>", line 0, in torch::impl::PythonSymNodeImpl::guard_bool(char const*, long)
```

Test Plan: Regular CI

Differential Revision: D54667300

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121527
Approved by: https://github.com/ezyang
2024-03-19 18:03:02 +00:00
8de4d86479 Back out "[fx] Preserve Fx graph node order in partitioner across runs (#115621)" (#122113)
Summary:
Original commit changeset: 6578f47abfdb

Original Phabricator Diff: D54913931

Differential Revision: D55027171

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122113
Approved by: https://github.com/osalpekar
2024-03-19 18:00:37 +00:00
eae89138d8 [torch export][serialize] create a more compact stacktrace format for serialization (#121675)
Summary:
- we want fx nodes' stack trace format to be backward compatible and same as before in the program we export
- however in the serialized format, we would want to show a more compact stack_trace format, otherwise the nodes attributes are dominated by stack traces
- the diff implements the minimal in serialization process to dedupe node stack traces by resorting to a fileinfo_list and a filename_to_abbrev map, so we can use index to represent filenames, use lineno to represent lines.

Test Plan:
# llm
base on D54497918
```
buck2 run @//mode/dev-nosan fbcode//executorch/examples/models/llama2:export_llama -- -c ~/stories110M.pt -p ~/params.json
```
set up breakpoint after serialization/deserialization
- serialize
```
(Pdb) v_meta = [n.meta for n in exported_program.graph_module.graph.nodes]
(Pdb) paste_client.create_phabricator_paste_object(paste_creation_client_id=1093956601162697, content=str(v_meta)).number
1193647450
(Pdb) json_program = json.dumps(_dataclass_to_dict(serialized_graph.co_fileinfo_ordered_list),cls=EnumEncoder)
(Pdb) json_bytes = json_program.encode('utf-8')
(Pdb) paste_client.create_phabricator_paste_object(paste_creation_client_id=1093956601162697, content=str(json_bytes)).number
1193604333
(Pdb) sys.getsizeof(json_bytes)
3846
(Pdb) compressed_bytes = zstd.ZstdCompressor().compress(json_bytes)
(Pdb) sys.getsizeof(compressed_bytes)
1139
```
in P1193647450 (before serialization), search for `stack_trace`
in P1193604333 (after serialization), search for `stack_trace` and `co_fileinfo_ordered_list`

[note: didn't do compression in this diff since the size is pretty small and it adds complexity if we do compression]
- deserialize
```
(Pdb) v_meta = [n.meta for n in deserialized_exported_program.graph_module.graph.nodes]
(Pdb) paste_client.create_phabricator_paste_object(paste_creation_client_id=1093956601162697, content=str(v_meta)).number
1193629435
```
in P1193629435, search for `stack_trace`

# ads

Differential Revision: D54654443

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121675
Approved by: https://github.com/angelayi
2024-03-19 17:58:12 +00:00
eqy
271b12c790 [Functorch] Bump tolerances for test_per_sample_grads_embeddingnet_mechanism_functional_call_cuda (#122014)
the `rtol` was indeed a problem on Grace Hopper

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122014
Approved by: https://github.com/zou3519
2024-03-19 17:52:39 +00:00
ba9a1d96a4 Add scuba logging for TorchScript usage (#121936)
Summary: Infra to log live usage of TorchScript internally

Test Plan: manually tested

Differential Revision: D54923510

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121936
Approved by: https://github.com/zhxchen17
2024-03-19 17:38:27 +00:00
4819da60ab [TD] Add LLM retrieval + heuristic (#121836)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121836
Approved by: https://github.com/osalpekar
2024-03-19 17:31:47 +00:00
cec0fd6f2f [pt2] add symbolic shape support for decompose mm and expose max_block to user config (#121440)
Summary:
1) As described in https://fb.workplace.com/groups/1075192433118967/permalink/1381918665779674/
As a follow up, we can increase max_block["y"] to sovle the issue
2) add symbolic shape support for decompose mm pass. I did not find a good way to compare symint with int. So when there is a symbolic shape, i would assume it is a "large" dim.

Test Plan:
Without change block: aps-pt2-7c23cea900

increase y_block: aps-pt2_dynamic_shape-25a027423c

Differential Revision: D54525453

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121440
Approved by: https://github.com/mengluy0125, https://github.com/Yuzhen11
2024-03-19 17:31:16 +00:00
764eae9c4e Revert "Add Flash Attention support on ROCM (#121561)"
This reverts commit a37e22de7059d06b75e4602f0568c3154076718a.

Reverted https://github.com/pytorch/pytorch/pull/121561 on behalf of https://github.com/huydhn due to Sorry for reverting your change but this needs more work to be able to land in fbcode because https://github.com/ROCm/aotriton is not available there atm.  We are working to reland this change before 2.3 release ([comment](https://github.com/pytorch/pytorch/pull/121561#issuecomment-2007717091))
2024-03-19 17:14:28 +00:00
88ebdbc97c [dynamo] Forward OptimizedModule.__setattr__ to the wrapped module (#122098)
Fixes #114844

In the linked issue we have
```
compiled_module = torch.compile(module)
compiled_module.x = ...
compiled_module(...)  # Mutates self.x
```
Where since the module mutates `self.x` you would expect `compiled_module.x`
to be updated but actually `compiled_module.x = ...` sets an attribute "x"
on the `OptimizedModule` object while the forward method of the module mutates
`module.x`.

This gives the expected behavior by forwarding `compiled_module.__setattr__`
down to `module.__setattr__`. There is already a corresponding `__getattr__`
so now `compiled_module.x` becomes an alias for `module.x`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122098
Approved by: https://github.com/ezyang, https://github.com/lezcano
2024-03-19 16:51:43 +00:00
2164b7f746 Flatten/Unflatten micro optimization in proxy_tensor.py (#121993)
Lowers compile time by 1s across all suites on average
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121993
Approved by: https://github.com/anijain2305, https://github.com/jansel, https://github.com/zou3519
2024-03-19 16:49:28 +00:00
42624bceb6 Fixes nan with large bf16 values (#122135)
Fixes #121558

Performance on main:
``` Markdown
+------------+-----------+-----------+------------+-----------+-----------+----------------+--------------------+--------------------+
| batch_size | num_heads | q_seq_len | kv_seq_len | embed_dim | is_causal |     dtype      |    forward_time    |   backward_time    |
+------------+-----------+-----------+------------+-----------+-----------+----------------+--------------------+--------------------+
|     1      |    16     |    128    |    128     |   2048    |   True    | torch.bfloat16 | 12.608132004970683 | 65.90210803551601  |
|     1      |    16     |    128    |    128     |   2048    |   False   | torch.bfloat16 | 11.75877740024589  | 64.83824399765581  |
|     1      |    16     |    256    |    256     |   2048    |   True    | torch.bfloat16 | 16.465420153690506 |  67.6770955324173  |
|     1      |    16     |    256    |    256     |   2048    |   False   | torch.bfloat16 | 17.398148600477725 | 68.19829455344006  |
|     1      |    16     |    512    |    512     |   2048    |   True    | torch.bfloat16 | 29.053532000398263 | 99.58901099162175  |
|     1      |    16     |    512    |    512     |   2048    |   False   | torch.bfloat16 |  27.826815698063   | 98.05690299253911  |
|     1      |    16     |   1024    |    1024    |   2048    |   True    | torch.bfloat16 | 49.89655229728669  | 178.24282555375248 |
|     1      |    16     |   1024    |    1024    |   2048    |   False   | torch.bfloat16 | 48.840098950313404 | 174.5950729819015  |
|     1      |    16     |   4096    |    2048    |   2048    |   True    | torch.bfloat16 | 505.66218036692584 | 1865.9265094902366 |
|     1      |    16     |   4096    |    2048    |   2048    |   False   | torch.bfloat16 | 295.0534054543823  | 967.3831606050952  |
|     1      |    32     |    128    |    128     |   2048    |   True    | torch.bfloat16 | 11.496030446141958 | 55.11070846114308  |
|     1      |    32     |    128    |    128     |   2048    |   False   | torch.bfloat16 | 11.47399884648621  | 55.452342028729625 |
|     1      |    32     |    256    |    256     |   2048    |   True    | torch.bfloat16 | 13.216444296995178 | 55.14447903260589  |
|     1      |    32     |    256    |    256     |   2048    |   False   | torch.bfloat16 | 12.763233599252999 | 55.142355500720434 |
|     1      |    32     |    512    |    512     |   2048    |   True    | torch.bfloat16 | 19.409965351223946 |  74.9107634765096  |
|     1      |    32     |    512    |    512     |   2048    |   False   | torch.bfloat16 | 19.02470579952933  | 74.84168506925926  |
|     1      |    32     |   1024    |    1024    |   2048    |   True    | torch.bfloat16 | 46.37695319834165  | 172.19150450546294 |
|     1      |    32     |   1024    |    1024    |   2048    |   False   | torch.bfloat16 | 45.225963747361675 | 185.19691249821335 |
|     1      |    32     |   4096    |    2048    |   2048    |   True    | torch.bfloat16 | 634.3090848531574  | 2249.057865119539  |
|     1      |    32     |   4096    |    2048    |   2048    |   False   | torch.bfloat16 | 320.47313248040155 | 1053.0515247955916 |
|     4      |    16     |    128    |    128     |   2048    |   True    | torch.bfloat16 | 13.448987301671878 | 63.63581650657579  |
|     4      |    16     |    128    |    128     |   2048    |   False   | torch.bfloat16 | 12.509283400140703 | 63.059300999157124 |
|     4      |    16     |    256    |    256     |   2048    |   True    | torch.bfloat16 | 19.71098779467866  | 105.55780201684684 |
|     4      |    16     |    256    |    256     |   2048    |   False   | torch.bfloat16 | 18.264925852417946 | 105.12311349157244 |
|     4      |    16     |    512    |    512     |   2048    |   True    | torch.bfloat16 | 45.218703348655254 | 222.87272597895935 |
|     4      |    16     |    512    |    512     |   2048    |   False   | torch.bfloat16 | 43.55393464793451  | 230.63290398567915 |
|     4      |    16     |   1024    |    1024    |   2048    |   True    | torch.bfloat16 | 134.02968645095825 | 514.6893998607993  |
|     4      |    16     |   1024    |    1024    |   2048    |   False   | torch.bfloat16 | 157.13709802366793 | 624.5892751030624  |
|     4      |    16     |   4096    |    2048    |   2048    |   True    | torch.bfloat16 | 1776.7079547047617 | 6353.551096981391  |
|     4      |    16     |   4096    |    2048    |   2048    |   False   | torch.bfloat16 | 1143.6000745743513 | 3811.8767354171723 |
|     4      |    32     |    128    |    128     |   2048    |   True    | torch.bfloat16 | 11.717129248427227 | 55.35991647047922  |
|     4      |    32     |    128    |    128     |   2048    |   False   | torch.bfloat16 | 11.746983398916198 | 55.76716404175386  |
|     4      |    32     |    256    |    256     |   2048    |   True    | torch.bfloat16 | 17.255573300644752 | 106.47456656442955 |
|     4      |    32     |    256    |    256     |   2048    |   False   | torch.bfloat16 | 16.46409669774584  | 108.07770595420152 |
|     4      |    32     |    512    |    512     |   2048    |   True    | torch.bfloat16 | 46.63354124641045  | 213.74862996162847 |
|     4      |    32     |    512    |    512     |   2048    |   False   | torch.bfloat16 | 47.01801469782367  | 240.78139301855117 |
|     4      |    32     |   1024    |    1024    |   2048    |   True    | torch.bfloat16 | 127.76448752265424 | 508.08745552785695 |
|     4      |    32     |   1024    |    1024    |   2048    |   False   | torch.bfloat16 | 168.6308984644711  | 667.2996102133766  |
|     4      |    32     |   4096    |    2048    |   2048    |   True    | torch.bfloat16 | 2268.1598202325404 | 7727.2648515645415 |
|     4      |    32     |   4096    |    2048    |   2048    |   False   | torch.bfloat16 | 1242.8469699807465 | 4161.965740495361  |
|     8      |    16     |    128    |    128     |   2048    |   True    | torch.bfloat16 | 14.340955897932872 | 93.72280450770633  |
|     8      |    16     |    128    |    128     |   2048    |   False   | torch.bfloat16 | 13.25262250029482  |  93.2030284893699  |
|     8      |    16     |    256    |    256     |   2048    |   True    | torch.bfloat16 | 27.598425600444898 | 183.23776399483904 |
|     8      |    16     |    256    |    256     |   2048    |   False   | torch.bfloat16 | 26.362583553418514 | 183.51862096460536 |
|     8      |    16     |    512    |    512     |   2048    |   True    | torch.bfloat16 | 84.52303148806094  | 383.50319798337296 |
|     8      |    16     |    512    |    512     |   2048    |   False   | torch.bfloat16 | 89.41743348259479  | 432.5502900755964  |
|     8      |    16     |   1024    |    1024    |   2048    |   True    | torch.bfloat16 | 217.76640450116247 | 943.9354750793427  |
|     8      |    16     |   1024    |    1024    |   2048    |   False   | torch.bfloat16 | 303.0781910638325  | 1225.4394043702632 |
|     8      |    16     |   4096    |    2048    |   2048    |   True    | torch.bfloat16 | 3470.8542854059488 | 12194.579601055011 |
|     8      |    16     |   4096    |    2048    |   2048    |   False   | torch.bfloat16 | 2268.1174043100327 | 7608.0941944383085 |
|     8      |    32     |    128    |    128     |   2048    |   True    | torch.bfloat16 | 12.289720651460811 | 95.88620596332476  |
|     8      |    32     |    128    |    128     |   2048    |   False   | torch.bfloat16 | 11.618648946750909 | 95.56685149436818  |
|     8      |    32     |    256    |    256     |   2048    |   True    | torch.bfloat16 | 31.567946751601994 | 180.62468653079122 |
|     8      |    32     |    256    |    256     |   2048    |   False   | torch.bfloat16 | 28.611703700153157 | 189.4215695792809  |
|     8      |    32     |    512    |    512     |   2048    |   True    | torch.bfloat16 | 84.11306998459621  | 385.25596749968827 |
|     8      |    32     |    512    |    512     |   2048    |   False   | torch.bfloat16 | 93.82540901424363  | 455.77428903197875 |
|     8      |    32     |   1024    |    1024    |   2048    |   True    | torch.bfloat16 | 226.80530551588163 | 965.8026450779289  |
|     8      |    32     |   1024    |    1024    |   2048    |   False   | torch.bfloat16 | 327.4116570246406  | 1312.5067745568228 |
|     8      |    32     |   4096    |    2048    |   2048    |   True    | torch.bfloat16 | 4445.5064804060385 | 15020.768146496266 |
|     8      |    32     |   4096    |    2048    |   2048    |   False   | torch.bfloat16 | 2433.0302356975153 | 8300.016750581563  |
+------------+-----------+-----------+------------+-----------+-----------+----------------+--------------------+--------------------+
```

Performance on this branch:
```Markdown
+------------+-----------+-----------+------------+-----------+-----------+----------------+--------------------+--------------------+
| batch_size | num_heads | q_seq_len | kv_seq_len | embed_dim | is_causal |     dtype      |    forward_time    |   backward_time    |
+------------+-----------+-----------+------------+-----------+-----------+----------------+--------------------+--------------------+
|     1      |    16     |    128    |    128     |   2048    |   True    | torch.bfloat16 | 12.783618393586949 | 65.59692794689909  |
|     1      |    16     |    128    |    128     |   2048    |   False   | torch.bfloat16 | 12.064015300711617 | 56.99719698168337  |
|     1      |    16     |    256    |    256     |   2048    |   True    | torch.bfloat16 | 16.629025398287922 | 68.65267595276237  |
|     1      |    16     |    256    |    256     |   2048    |   False   | torch.bfloat16 | 17.462356004398313 | 68.35797848179936  |
|     1      |    16     |    512    |    512     |   2048    |   True    | torch.bfloat16 |  29.5476081490051  | 101.22994752600789 |
|     1      |    16     |    512    |    512     |   2048    |   False   | torch.bfloat16 | 28.395320149138573 | 98.62275794148445  |
|     1      |    16     |   1024    |    1024    |   2048    |   True    | torch.bfloat16 | 50.50016101449728  | 181.4357690163888  |
|     1      |    16     |   1024    |    1024    |   2048    |   False   | torch.bfloat16 | 49.450615647947416 | 175.86063902126625 |
|     1      |    16     |   4096    |    2048    |   2048    |   True    | torch.bfloat16 | 506.06461532879626 | 1866.0613044630736 |
|     1      |    16     |   4096    |    2048    |   2048    |   False   | torch.bfloat16 | 299.9336270149797  | 976.4662646921353  |
|     1      |    32     |    128    |    128     |   2048    |   True    | torch.bfloat16 | 11.45752210286446  | 58.79682704107836  |
|     1      |    32     |    128    |    128     |   2048    |   False   | torch.bfloat16 | 11.407129396684468 | 58.14061599085107  |
|     1      |    32     |    256    |    256     |   2048    |   True    | torch.bfloat16 | 13.822759891627355 | 56.56979401828722  |
|     1      |    32     |    256    |    256     |   2048    |   False   | torch.bfloat16 | 13.39154909946956  |  56.7130644340068  |
|     1      |    32     |    512    |    512     |   2048    |   True    | torch.bfloat16 | 20.282494352431968 | 77.29688903782517  |
|     1      |    32     |    512    |    512     |   2048    |   False   | torch.bfloat16 | 19.899454596452415 |  75.4446149803698  |
|     1      |    32     |   1024    |    1024    |   2048    |   True    | torch.bfloat16 | 48.494275606935844 | 177.5322465109639  |
|     1      |    32     |   1024    |    1024    |   2048    |   False   | torch.bfloat16 | 46.84524350450374  | 189.1778860008344  |
|     1      |    32     |   4096    |    2048    |   2048    |   True    | torch.bfloat16 | 635.1026654010639  | 2248.0451600858937 |
|     1      |    32     |   4096    |    2048    |   2048    |   False   | torch.bfloat16 | 335.1591735263355  | 1080.4320796160027 |
|     4      |    16     |    128    |    128     |   2048    |   True    | torch.bfloat16 | 13.63953539985232  | 65.50709309522063  |
|     4      |    16     |    128    |    128     |   2048    |   False   | torch.bfloat16 | 12.858113402035087 | 63.021871959790595 |
|     4      |    16     |    256    |    256     |   2048    |   True    | torch.bfloat16 | 19.98318645055406  | 105.87883047992364 |
|     4      |    16     |    256    |    256     |   2048    |   False   | torch.bfloat16 | 18.619045056402683 | 104.90188701078296 |
|     4      |    16     |    512    |    512     |   2048    |   True    | torch.bfloat16 | 45.91175540117546  | 226.00732848513871 |
|     4      |    16     |    512    |    512     |   2048    |   False   | torch.bfloat16 | 44.39614630537107  | 232.39317198749632 |
|     4      |    16     |   1024    |    1024    |   2048    |   True    | torch.bfloat16 | 135.5409600073472  | 522.7949097752571  |
|     4      |    16     |   1024    |    1024    |   2048    |   False   | torch.bfloat16 | 158.79383607534692 | 628.5856699105352  |
|     4      |    16     |   4096    |    2048    |   2048    |   True    | torch.bfloat16 | 1775.9978299727663 | 6343.203847063706  |
|     4      |    16     |   4096    |    2048    |   2048    |   False   | torch.bfloat16 | 1160.680354805663  | 3842.235009651631  |
|     4      |    32     |    128    |    128     |   2048    |   True    | torch.bfloat16 | 11.553713708417488 | 65.50691701704638  |
|     4      |    32     |    128    |    128     |   2048    |   False   | torch.bfloat16 | 11.486379051348194 |  56.9980075233616  |
|     4      |    32     |    256    |    256     |   2048    |   True    | torch.bfloat16 | 17.56585600087419  | 107.89892700267956 |
|     4      |    32     |    256    |    256     |   2048    |   False   | torch.bfloat16 | 16.828144202008843 | 109.05519902007653 |
|     4      |    32     |    512    |    512     |   2048    |   True    | torch.bfloat16 | 48.23235589428805  | 217.8974545095116  |
|     4      |    32     |    512    |    512     |   2048    |   False   | torch.bfloat16 | 49.09284680034033  | 244.73925953498107 |
|     4      |    32     |   1024    |    1024    |   2048    |   True    | torch.bfloat16 | 134.77827049791813 | 522.7259948151186  |
|     4      |    32     |   1024    |    1024    |   2048    |   False   | torch.bfloat16 | 176.60772847011688 | 681.5171707421541  |
|     4      |    32     |   4096    |    2048    |   2048    |   True    | torch.bfloat16 | 2267.821540008299  | 7720.425300067291  |
|     4      |    32     |   4096    |    2048    |   2048    |   False   | torch.bfloat16 | 1295.3941145678982 | 4272.425139788538  |
|     8      |    16     |    128    |    128     |   2048    |   True    | torch.bfloat16 | 14.514714101096615 |  94.2192979855463  |
|     8      |    16     |    128    |    128     |   2048    |   False   | torch.bfloat16 | 13.553097198018804 |  93.244242540095   |
|     8      |    16     |    256    |    256     |   2048    |   True    | torch.bfloat16 | 27.95821905019693  | 185.0469880155288  |
|     8      |    16     |    256    |    256     |   2048    |   False   | torch.bfloat16 | 26.709681446664035 | 184.22623950755226 |
|     8      |    16     |    512    |    512     |   2048    |   True    | torch.bfloat16 | 85.85420495364815  | 388.3417735341937  |
|     8      |    16     |    512    |    512     |   2048    |   False   | torch.bfloat16 | 89.97473795898259  | 434.4228169647977  |
|     8      |    16     |   1024    |    1024    |   2048    |   True    | torch.bfloat16 | 220.6919804448262  | 958.9654899900779  |
|     8      |    16     |   1024    |    1024    |   2048    |   False   | torch.bfloat16 | 306.55586952343583 | 1233.2170095760375 |
|     8      |    16     |   4096    |    2048    |   2048    |   True    | torch.bfloat16 | 3470.7326447824016 | 12183.611298678443 |
|     8      |    16     |   4096    |    2048    |   2048    |   False   | torch.bfloat16 | 2299.064100370742  | 7669.618452200666  |
|     8      |    32     |    128    |    128     |   2048    |   True    | torch.bfloat16 | 12.427107692928985 | 96.96270158747211  |
|     8      |    32     |    128    |    128     |   2048    |   False   | torch.bfloat16 | 11.856995843118057 | 96.38117247959599  |
|     8      |    32     |    256    |    256     |   2048    |   True    | torch.bfloat16 |  32.9956392000895  | 182.52741603646427 |
|     8      |    32     |    256    |    256     |   2048    |   False   | torch.bfloat16 | 29.397601098753512 | 191.0755339777097  |
|     8      |    32     |    512    |    512     |   2048    |   True    | torch.bfloat16 | 89.06024845782667  | 392.2585004474967  |
|     8      |    32     |    512    |    512     |   2048    |   False   | torch.bfloat16 | 97.78487798757851  | 462.07307645818213 |
|     8      |    32     |   1024    |    1024    |   2048    |   True    | torch.bfloat16 |  240.521906001959  | 992.4693452194335  |
|     8      |    32     |   1024    |    1024    |   2048    |   False   | torch.bfloat16 | 341.98952303268015 | 1339.2950996058062 |
|     8      |    32     |   4096    |    2048    |   2048    |   True    | torch.bfloat16 | 4445.311005110853  | 15001.030603889374 |
|     8      |    32     |   4096    |    2048    |   2048    |   False   | torch.bfloat16 | 2535.9767401823774 | 8528.990152990447  |
+------------+-----------+-----------+------------+-----------+-----------+----------------+--------------------+--------------------+
```

```
{'avg_forward_time_nan_fix': 399.7900972732653,
 'avg_backward_time_nan_fix': 1409.652114014413,
 'avg_forward_time_main_branch': 394.6807206988645,
 'avg_backward_time_main_branch': 1399.4055472857629,
 'geo_mean_nan_fix': 150.95049601244946,
 'geo_mean_main_branch': 148.3381648508822}
 ```

The y axis is wrong and is micro seconds but the relative comparison still works
<img width="790" alt="Screenshot 2024-03-18 at 3 34 15 PM" src="https://github.com/pytorch/pytorch/assets/32754868/ca278c15-b815-4535-bdcd-07e522055466">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122135
Approved by: https://github.com/cpuhrsch
2024-03-19 16:32:00 +00:00
e26280ad8b Fix typing for autograd.Function with ctx-less forward (#122167)
Previously, typing an autograd.Function like the following would lead to
a mypy error (which expects the first arg to forward to be named `ctx`).

This PR fixes that by deleting the ctx arg.

```py
class MySin(torch.autograd.Function):
    @staticmethod
    def forward(x: torch.Tensor) -> torch.Tensor:
        return x.sin()

    @staticmethod
    def setup_context(*args, **kwargs):
        pass

    @staticmethod
    def backward(ctx, grad):
        if grad.stride(0) > 1:
            return grad.sin()
        return grad.cos()
```

Test Plan:
- tested locally (I don't know how to put up a test in CI for this).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122167
Approved by: https://github.com/soulitzer
2024-03-19 16:15:23 +00:00
f9ed1c432d Revert "Refactor gpu trace to be device-agnostic (#121794)"
This reverts commit 0ff1109e2688b8c841c9dd0eeecfba16f027b049.

Reverted https://github.com/pytorch/pytorch/pull/121794 on behalf of https://github.com/jeanschmidt due to Reverting to see if rocm trunk errors are related ([comment](https://github.com/pytorch/pytorch/pull/121794#issuecomment-2007519408))
2024-03-19 15:40:26 +00:00
c05bf0037d [dynamo] Remove copy_graphstate/restore_graphstate (#122067)
Some dead code cleanup.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122067
Approved by: https://github.com/oulgen
2024-03-19 15:37:53 +00:00
7673cb534a Revert "Skip nonzero unbacked SymInt memo in inference mode (#122147)"
This reverts commit 5e2687391229cee6e4dc0214f9208b4ecbe058c1.

Reverted https://github.com/pytorch/pytorch/pull/122147 on behalf of https://github.com/jeanschmidt due to Reverting to see if trunk error in inductor are related ([comment](https://github.com/pytorch/pytorch/pull/122147#issuecomment-2007513000))
2024-03-19 15:37:24 +00:00
cyy
6c01c25319 [Clang-tidy header][28/N] Fix clang-tidy warnings in aten/src/ATen/core/*.h (#122175)
This PR fixes various clang-tidy warnings on aten/src/ATen/core/*.h, following https://github.com/pytorch/pytorch/pull/122023
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122175
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-03-19 14:08:54 +00:00
6c50308801 [ATen-Vulkan][EZ] Small fixes: fix gpu size calculation and Half scalartype ctype mapping (#122096)
Summary:
## Context

Some small fixes to the ATen-Vulkan backend.

The first is that GPU sizes for a 4 dimensional tensor with width packing had a small bug:

```
      case 4:
        switch (memory_layout) {
          case api::GPUMemoryLayout::TENSOR_WIDTH_PACKED:
            gpu_sizes.at(0) = sizes.at(0);
            gpu_sizes.at(1) = sizes.at(1);
            // should be gpu_sizes.at(2) == sizes.at(2)
            gpu_sizes.at(2) = sizes.at(3);
            gpu_sizes.at(3) = api::utils::align_up(sizes.at(3), INT64_C(4));
            break;
```

This was fixed by simplifying the logic of GPU size calculation for texture storage.

The second was to modify the ctype mapping of the `api::kHalf` scalar type to be `float` instead of `unsigned short`. This is because GLSL does not natively support `float16`, so even with a FP16 texture type CPU/GPU transfer shaders will have to read from and write to `float` buffers.

In the future, we will look into integrating [VK_KHR_shader_float16_int8](https://registry.khronos.org/vulkan/specs/1.3-extensions/man/html/VK_KHR_shader_float16_int8.html) into ATen-Vulkan to allow for 16 bit and 8 bit types to be referenced explicitly.

Test Plan: CI

Differential Revision: D55018171

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122096
Approved by: https://github.com/jorgep31415
2024-03-19 13:27:27 +00:00
39877abee2 Update jvp to support symbolic execution. (#120338)
Previously, all jvp tests under dynamo/test_dynamic_shapes would fail because symbolic execution wasn't supported in some autograd functions.

List of changes:
- Update`_has_same_storage_numel` to use `sym_nbytes`
- Symintify `_efficientzerotensor_meta`
- Introduce `empty_generic_symint` with the first argument `size` as symbolic integer
- Update gen_variable_type.py script to call the symint version of zeros_fn function (zeros_symint / _efficientzerotensor_symint)
- Update `has_same_meta` to call `sym_*` functions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120338
Approved by: https://github.com/soulitzer
ghstack dependencies: #119926
2024-03-19 13:06:42 +00:00
edd04b7c16 Teach dynamo about torch.func.jvp (#119926)
List of changes:
- Replace JVP_NESTING by torch._C._functorch.maybe_current_level()
- Remove all increment nesting functions from wrap_fx_proxy_cls
- fwAD.make_dual receives the dual_level as keyword argument
- Add jvp_increment_nesting, set_fwd_grad_enabled and dual_level context managers to dynamo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119926
Approved by: https://github.com/zou3519
2024-03-19 13:06:42 +00:00
6b5259e507 [lint] bump lint dependency PyYAML to 6.0.1 to support Python 3.12 (#122022)
[PyYAML 6.0.0](https://pypi.org/project/PyYAML/6.0) was released 2.5 years ago and it is not installable with Python 3.12.

This PR bumps the version of [PyYAML to 6.0.1](https://pypi.org/project/PyYAML/6.0.1) in `lintrunner` configuration.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122022
Approved by: https://github.com/Skylion007
2024-03-19 12:23:49 +00:00
8168338063 Add CPU implementation for torch._int_mm (s8*s8->s32) (#121792)
Fixes #121647

**Description**
Currently, the op `torch._int_mm` only supports CUDA device. This PR adds CPU implementation for it.
Besides the request from the issue, this op may also be useful for planned CPU implementations of [LLM.int8()](https://arxiv.org/abs/2208.07339) in [Bitsandbytes](https://github.com/TimDettmers/bitsandbytes).

The implementation prefers mkldnn (oneDNN) kernels. If mkldnn is not available, a reference implementation with nested for loops is used.

**Test plan**
`python test/test_linalg.py -k test__int_mm_cpu`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121792
Approved by: https://github.com/jgong5, https://github.com/lezcano
2024-03-19 08:44:33 +00:00
0d845f7b07 Fix auto_functionalize (#121990)
Differential Revision: D54964130

When we re-export, auto_functionalize HOP will be in the graph. Therefore, we need to implement proper functionalization rule for it. Since the content inside auto_functionalize is guaranteed be functional, it is ok to just fall through it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121990
Approved by: https://github.com/ydwu4, https://github.com/zou3519
2024-03-19 07:11:11 +00:00
a2a88f39ee Avoid COW materialize in conv, log sigmoid, repeat, group_norm, batch_norm (#121537)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121537
Approved by: https://github.com/ezyang
2024-03-19 06:15:00 +00:00
0ff1109e26 Refactor gpu trace to be device-agnostic (#121794)
# Motivation
Refactor gpu trace to be device-agnostic. gpu trace is usually used in runtime components, including Device, Stream, Event, Guard, and Allocator. It should be device-agnostic and can be shared among each device backend.

# Solution
move `_cuda_trace.py` to `_gpu_trace.py`, which makes each device backend owns their callback, respectively.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121794
Approved by: https://github.com/jgong5, https://github.com/albanD, https://github.com/EikanWang, https://github.com/gujinghui
2024-03-19 06:02:28 +00:00
09ce76809c Improve compiler detection on MacOS (#121406)
By relying on `is_apple_clang` helper function rather than on compiler name (as `gcc` is clang on MacOS):
```
% which gcc; gcc -v
/usr/bin/gcc
Apple clang version 15.0.0 (clang-1500.3.9.4)
Target: arm64-apple-darwin23.3.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
```
But
```
% /opt/homebrew/bin/gcc-13 -v
Using built-in specs.
COLLECT_GCC=/opt/homebrew/bin/gcc-13
COLLECT_LTO_WRAPPER=/opt/homebrew/Cellar/gcc/13.2.0/bin/../libexec/gcc/aarch64-apple-darwin23/13/lto-wrapper
Target: aarch64-apple-darwin23
Configured with: ../configure --prefix=/opt/homebrew/opt/gcc --libdir=/opt/homebrew/opt/gcc/lib/gcc/current --disable-nls --enable-checking=release --with-gcc-major-version-only --enable-languages=c,c++,objc,obj-c++,fortran --program-suffix=-13 --with-gmp=/opt/homebrew/opt/gmp --with-mpfr=/opt/homebrew/opt/mpfr --with-mpc=/opt/homebrew/opt/libmpc --with-isl=/opt/homebrew/opt/isl --with-zstd=/opt/homebrew/opt/zstd --with-pkgversion='Homebrew GCC 13.2.0' --with-bugurl=https://github.com/Homebrew/homebrew-core/issues --with-system-zlib --build=aarch64-apple-darwin23 --with-sysroot=/Library/Developer/CommandLineTools/SDKs/MacOSX14.sdk
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 13.2.0 (Homebrew GCC 13.2.0)
```

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121406
Approved by: https://github.com/malfet, https://github.com/jansel
2024-03-19 05:32:08 +00:00
FEI
8499767e96 add sdpa choice for DeviceType::PrivateUse1 (#121409)
Fixes  #116854

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121409
Approved by: https://github.com/drisspg
2024-03-19 05:08:46 +00:00
5bc7f7f977 [dynamo] Make tx.next_instruction lazy (#122066)
Improves benchmarks/dynamo/microbenchmarks/dynamo_microbenchmarks.py
from 2.5s to 2.4s.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122066
Approved by: https://github.com/oulgen, https://github.com/anijain2305
ghstack dependencies: #122039, #122043, #122055, #122058, #122060, #122063
2024-03-19 04:23:30 +00:00
153a01833b [dynamo] Optimize SourcelessBuilder (#122063)
Improves `benchmarks/dynamo/microbenchmarks/dynamo_microbenchmarks.py`
from 2.7s to 2.5s.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122063
Approved by: https://github.com/anijain2305
ghstack dependencies: #122039, #122043, #122055, #122058, #122060
2024-03-19 04:23:30 +00:00
8082adcf65 [dynamo] Only rename a proxy once (#122060)
Improves `benchmarks/dynamo/microbenchmarks/dynamo_microbenchmarks.py`
from 3.9s to 2.7s.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122060
Approved by: https://github.com/oulgen
ghstack dependencies: #122039, #122043, #122055, #122058
2024-03-19 04:23:27 +00:00
2bec55c5f9 [dynamo] Remove VariableTracker.parents_tracker (#122058)
This is leftover from mutable variable tracker days and no longer needed.

Improves benchmarks/dynamo/microbenchmarks/dynamo_microbenchmarks.py
from 4.2s to 3.9s.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122058
Approved by: https://github.com/oulgen, https://github.com/anijain2305
ghstack dependencies: #122039, #122043, #122055
2024-03-19 04:23:24 +00:00
3c706bf483 [dynamo] Optimize BuiltinVariable (#122055)
Improves `benchmarks/dynamo/microbenchmarks/dynamo_microbenchmarks.py`
from 5.1s to 4.2s (compared to 2 PRs ago).

This works by precomputing (and caching) the parts of `BuiltinVariable.call_function` that don't depend on the values of args/kwargs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122055
Approved by: https://github.com/oulgen, https://github.com/anijain2305
ghstack dependencies: #122039, #122043
2024-03-19 04:23:20 +00:00
07caea5c12 [dynamo] Refactor COMPARE_OP and comparison builtins (#122043)
This removes the duplicate handling of comparison ops between symbolic_convert and bultin and refactors the handling to use the binop infrastructure.  This change regresses overheads a bit, but this is fixed in the next PR.

New test skips are variants of `type(e) is np.ndarray` previously falling back to eager.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122043
Approved by: https://github.com/anijain2305
ghstack dependencies: #122039
2024-03-19 04:23:17 +00:00
769ff86b91 [dynamo] Optimize COMPARE_OP (#122039)
Improves `benchmarks/dynamo/microbenchmarks/dynamo_microbenchmarks.py`
from 5.6 to 5.1s.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122039
Approved by: https://github.com/Skylion007, https://github.com/anijain2305
2024-03-19 04:23:14 +00:00
cyy
e1706bba3b [Clang-tidy header][27/N] Fix clang-tidy warnings in aten/src/ATen/core/*.h (#122023)
This PR fixes various clang-tidy warnings on aten/src/ATen/core/*.h, following #122015
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122023
Approved by: https://github.com/ezyang
2024-03-19 03:26:15 +00:00
5e26873912 Skip nonzero unbacked SymInt memo in inference mode (#122147)
Summary: In `torch.inference_mode()`, fake tensors don't have `_version`s. This breaks unbacked SymInt memoization in `torch.nonzero` tracing. Here we disable the latter in inference mode.

Test Plan:

```
$ python test/inductor/test_unbacked_symints.py -k test_nonzero_in_inference_mode
...
----------------------------------------------------------------------
Ran 2 tests in 14.060s

OK
```

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122147
Approved by: https://github.com/ezyang
2024-03-19 03:20:33 +00:00
8860c625ea [dynamo][guards-cpp-refactor] Integrate cpp guard manager with CheckFnManager (#120726)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120726
Approved by: https://github.com/jansel
2024-03-19 03:11:31 +00:00
f84d560236 [dynamo] Raise accumulated cache size limit (#122130)
Fixes #114511

This was raised by IBM folks where the a LLM compile was failing because it had more than 64 layers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122130
Approved by: https://github.com/Chillee, https://github.com/jansel
ghstack dependencies: #121954, #122005
2024-03-19 02:35:48 +00:00
7084528eb9 [dynamo][model_output] Do not include none for CustomizedDictVariable (#122005)
Fixes https://github.com/pytorch/pytorch/issues/120923

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122005
Approved by: https://github.com/weifengpy, https://github.com/jansel
ghstack dependencies: #121954
2024-03-19 02:35:48 +00:00
2b06098380 Enable x86 CPU vectorization on windows [submodule sleef] (#118980)
Enable VEC on Windows OS.
1. Fix some type defination gap between Windows and Linux.
2. Fix some operator not support on Windows, such as [], /.
3. Enable static sleef library build on Windows.
4. Disable unsupported function overloading on MSVC.
5. Upgrade submodule sleef lib, which fixed build issue on Windows.
6. Fixed bazel build issues.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118980
Approved by: https://github.com/jgong5, https://github.com/ezyang, https://github.com/malfet
2024-03-19 02:22:04 +00:00
6502c888cf Enable fx graph cache in torch_test.py when using PYTORCH_TEST_WITH_INDUCTOR=1 (#122010)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122010
Approved by: https://github.com/eellison
2024-03-19 02:17:10 +00:00
18d94d7165 Make FX nodes sortable (#122071)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122071
Approved by: https://github.com/oulgen
2024-03-19 01:40:56 +00:00
1f4d4d3b78 [fx] preserver partiioner order fix (#122111)
Summary:
Previous implementation seems to introduce a key value of {"node": none}. This causes an error in logging later on because we extract the name from the "node" but it is a string instead of a torch.fx.node

This seems to cause tests to pass.

Test Plan:
CI

ExecuTorch CI:
buck test mode/dev-nosan //executorch/backends/xnnpack/test:test_xnnpack_models

Reviewed By: larryliu0820

Differential Revision: D55026133

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122111
Approved by: https://github.com/mikekgfb
2024-03-19 01:00:44 +00:00
34f36a28df [MPS] Fwd-fix for clamp regression (#122148)
Forward fix for regressions introduced by https://github.com/pytorch/pytorch/pull/121381 as we failed to run MPS CI twice on it

- Do not call `minimumWithNaNPropagationWithPrimaryTensor` for integral tensors as it will crash with
  ```
    /AppleInternal/Library/BuildRoots/ce725a5f-c761-11ee-a4ec-b6ef2fd8d87b/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Utility/MPSKernelDAG.mm:805: failed assertion `Error getting visible function: (null) Function isNaN_i16_i8 was not found in the library'
   ```
- Change the order of max and min call as it's apparently important for
  consistency, as `min(max(a, b), c)` might not equal to `max(min(a, c), b)` if `c` is not always less or equal than `b`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122148
Approved by: https://github.com/huydhn
2024-03-19 00:52:45 +00:00
ae983d2d6e Fix typo in sparse.rst (#121826)
Change word "on" to "one" when talking in the third person.

Fixes #121770
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121826
Approved by: https://github.com/janeyx99
2024-03-19 00:17:19 +00:00
e6cf3e90a5 [AOTAutograd / Functionalization] Fix incorrect expand_inverse (#122114)
This is a rebase of https://github.com/pytorch/pytorch/pull/114538,
originally submited by @jon-chuang.

Fixes #114302

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122114
Approved by: https://github.com/bdhirsh
2024-03-18 22:52:57 +00:00
ba69dc6675 [Easy] add option to print compilation time (#121996)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121996
Approved by: https://github.com/davidberard98
2024-03-18 22:42:41 +00:00
2ab8b34433 Error out in case of in-source builds (#122037)
Such builds could not succeed, as arch-specific ATen dispatch mechanism will create temporary files that will be added to the build system with every rebuild, which will result in build failures

Fixes https://github.com/pytorch/pytorch/issues/121507

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122037
Approved by: https://github.com/PaliC, https://github.com/kit1980
2024-03-18 21:48:18 +00:00
e6a461119a [functorch] Add batch rule for linalg.lu_unpack (#121811)
Fixes: https://github.com/pytorch/pytorch/issues/119998

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121811
Approved by: https://github.com/peterbell10, https://github.com/zou3519
2024-03-18 21:24:16 +00:00
773ae817f7 Batch Norm Consolidation (#116092)
**Summary:**

This commit simplifies the existing decomposition hierarchy
of batch norm ops by adding a single, backend agnostic op:
`batch_norm_with_update`. The existing hierarchy looks like:

```
aten.batch_norm ->
aten._batch_norm_impl_index ->
[
  aten.native_batch_norm ->
  aten._native_batch_norm_legit (export only) ->
  _batch_norm_legit_cpu/cuda (kernels, export only) ->
  _batch_norm_cpu/cuda (kernels)
] OR
[ aten.cudnn_batch_norm ] OR
[ aten.miopen_batch_norm ]
```

Aside from complexity, an important problem with the
above decomposition hierarchy is cuda numerics in
export flows. We observed significantly worse convergence
when training a mobilenetv2-like model when using the
`_batch_norm_cuda` kernel instead of the `cudnn_batch_norm`
kernel. This means users who export their models on CPU
first then move the models to cuda later may silently
see worse accuracies even when cudnn is installed,
because they are using the worse kernel. This issue is
summarized in https://github.com/pytorch/pytorch/issues/111384.

Instead, the new hierarchy proposed by consolidating
existing batch norm ops will look like:

```
aten.batch_norm ->
aten.batch_norm_with_update ->
[ _batch_norm_cpu (kernel) ] OR
[ _batch_norm_cuda (kernel) ] OR
[ cudnn_batch_norm (kernel) ] OR
[ miopen_batch_norm (kernel) ]
```

The new op `batch_norm_with_update` hides backend
implementation details and automatically picks the right
kernel based on what is installed. This commit also adds
the following variants to this op:

```
batch_norm_with_update_functional
batch_norm_with_update.out
batch_norm_no_update
batch_norm_no_update.out
batch_norm_backward
```

Note that this commit only adds this op and its variants,
but does not actually change the decomps to produce these
ops in the graph. This will be done after the 2 week FC
window, and the ops used in the old stack is planned to
be removed after the 6 month BC window.

Test Plan: `OpInfo` tests for `batch_norm_with_update`.

Reviewers: albanD, bdhirsh

Subscribers: albanD, bdhirsh, supriyar

Tasks: https://github.com/pytorch/pytorch/issues/111384

Differential Revision: [D54805279](https://our.internmc.facebook.com/intern/diff/D54805279)
Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116092
Approved by: https://github.com/bdhirsh, https://github.com/albanD
2024-03-18 21:01:30 +00:00
a17cd226d6 [inductor] Enable FX graph caching on another round of inductor tests (#121994)
Summary: Enabling caching for these tests was blocked by https://github.com/pytorch/pytorch/pull/121686

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121994
Approved by: https://github.com/eellison
2024-03-18 20:55:18 +00:00
7c5e29ae71 Back out "Support triton.language.dtype with torch.compile (#121690)" (#122108)
Summary: Some hard to deal with package import/export related problems. Lets revert and start with clean slate.

Test Plan: CI

Differential Revision: D55024877

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122108
Approved by: https://github.com/ezyang
2024-03-18 20:50:28 +00:00
685ace3834 [compiled autograd] add dynamo segfault test (#122004)
To catch issues like https://github.com/pytorch/pytorch/issues/121862 in CI. This passes because we reverted the PRs, and https://github.com/pytorch/pytorch/pull/121870 confirms that this test can catch it

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122004
Approved by: https://github.com/eellison
2024-03-18 20:07:15 +00:00
40acc84aaf Fix torch.clamp in MPS to handle NaN correctly (#121381)
Fixes #120899

So this is interesting. There are methods that specifically propagate NaN instead of clamping to real numbers.
https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraph/3857573-maximumwithnanpropagationwithpri

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121381
Approved by: https://github.com/malfet
2024-03-18 19:38:15 +00:00
0a1b3be216 chore: add unit test to verify split_by_tags output_type (#121262)
Add a test case as per https://github.com/pytorch/pytorch/pull/120361#issuecomment-1979163324

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121262
Approved by: https://github.com/atalman
2024-03-18 19:19:26 +00:00
676a77177e Revert "[BE] Migrate pull.yml to use S3 pytorch-ci-artifacts bucket for linux-jammy-py3_8-gcc11 and docs builds/tests (#121908)"
This reverts commit 4cbf963894e78d1cfedffe4f829740dc99163caa.

Reverted https://github.com/pytorch/pytorch/pull/121908 on behalf of https://github.com/jeanschmidt due to this is due to OIDC can't work on forked PR due to token write permissions can't be shared ([comment](https://github.com/pytorch/pytorch/pull/121908#issuecomment-2004707582))
2024-03-18 19:03:11 +00:00
df1cdaedeb Log restart reasons and extra compile time in CompilationMetrics (#121827)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121827
Approved by: https://github.com/ezyang, https://github.com/yanboliang
2024-03-18 18:59:25 +00:00
74c09a757b Simplify Storage meta conversion with PyObject preservation (#122018)
Thanks to https://github.com/pytorch/pytorch/pull/109039 we can rely on
finalizers on Storage PyObject to handle removal from dict.

Irritatingly, we still have to attach finalizer, because we don't have
a weak key AND value dict (only one or the other).

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122018
Approved by: https://github.com/eellison, https://github.com/kurtamohler
2024-03-18 18:55:58 +00:00
32410f80ec [Caffe2 CPU tests] Update CMakeLists.txt (#119643)
I was trying to build PyTorch with USE_GLOG=ON (so we could get better timestamps around the nccl logging) and ran into this error

```
[1/7] Linking CXX executable bin/verify_api_visibility
FAILED: bin/verify_api_visibility
: && /opt/rh/gcc-toolset-11/root/usr/bin/c++ -D_GLIBCXX_USE_CXX11_ABI=1 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow -DHAVE_AVX512_CPU_DEFINITION -DHAVE_AVX2_CPU_DEFINITION -O2 -g -DNDEBUG -rdynamic     -Wl,--no-as-needed caffe2/CMakeFiles/verify_api_visibility.dir/__/aten/src/ATen/test/verify_api_visibility.cpp.o -o bin/verify_api_visibility -L/lib/intel64   -L/lib/intel64_win   -L/lib/win-x64 -Wl,-rpath,/lib/intel64:/lib/intel64_win:/lib/win-x64:/usr/local/cuda/lib64:/root/conda/lib:/mnt/code/pytorch/build/lib:  lib/libgtest_main.a  -Wl,--no-as-needed,"/mnt/code/pytorch/build/lib/libtorch.so" -Wl,--as-needed  -Wl,--no-as-needed,"/mnt/code/pytorch/build/lib/libtorch_cpu.so" -Wl,--as-needed  lib/libprotobuf.a  /root/conda/lib/libmkl_intel_lp64.so  /root/conda/lib/libmkl_gnu_thread.so  /root/conda/lib/libmkl_core.so  -fopenmp  /usr/lib64/libpthread.so  -lm  /usr/lib64/libdl.so  -Wl,--no-as-needed,"/mnt/code/pytorch/build/lib/libtorch_cuda.so" -Wl,--as-needed  lib/libc10_cuda.so  lib/libc10.so  /root/conda/lib/libglog.so.0.4.0  /root/conda/lib/libgflags.so.2.2.2  -lpthread  /usr/local/cuda/lib64/libcudart.so  /usr/local/cuda/lib64/libnvToolsExt.so  lib/libgtest.a  -pthread && /root/conda/bin/cmake -E __run_co_compile --lwyu="ldd;-u;-r" --source=bin/verify_api_visibility && :
/opt/rh/gcc-toolset-11/root/usr/bin/ld: /mnt/code/pytorch/build/lib/libtorch.so: undefined reference to symbol '_ZTVN10__cxxabiv117__class_type_infoE@@CXXABI_1.3'
/opt/rh/gcc-toolset-11/root/usr/bin/ld: /usr/lib64/libstdc++.so.6: error adding symbols: DSO missing from command line
collect2: error: ld returned 1 exit status
```

Adding stdc++ explicitly to the list of libraries to link seems to fix the build, and I was able to get a working build of PyTorch.
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119643
Approved by: https://github.com/zdevito
2024-03-18 18:35:32 +00:00
5d52b163d1 [dynamo] Optimize load/store/const op handling (#122038)
Improves `benchmarks/dynamo/microbenchmarks/dynamo_microbenchmarks.py`
from 6.7s to 5.6.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122038
Approved by: https://github.com/Skylion007
ghstack dependencies: #122032, #122033, #122034, #122035
2024-03-18 18:08:06 +00:00
4034873a31 [dynamo] Optimize builtin handling (#122035)
Improves `benchmarks/dynamo/microbenchmarks/dynamo_microbenchmarks.py`
from 7.3s to 6.7s.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122035
Approved by: https://github.com/Skylion007
ghstack dependencies: #122032, #122033, #122034
2024-03-18 18:08:06 +00:00
6ca0323615 [dynamo] Optimize VariableTracker.__post_init__ (#122034)
Improves `benchmarks/dynamo/microbenchmarks/dynamo_microbenchmarks.py`
from 8.6s to 7.3s.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122034
Approved by: https://github.com/Skylion007
ghstack dependencies: #122032, #122033
2024-03-18 18:08:06 +00:00
115c9c6d6b Remove __getattribute__ on autograd.Function (#122033)
Improves `benchmarks/dynamo/microbenchmarks/overheads.py` from 38.7us to
34.3us.

See #122029
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122033
Approved by: https://github.com/zou3519, https://github.com/soulitzer
ghstack dependencies: #122032
2024-03-18 18:08:06 +00:00
5a10b56083 [dynamo] Small microbenchmark changes (#122032)
Used to generate numbers in #122029

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122032
Approved by: https://github.com/yanboliang
2024-03-18 18:08:06 +00:00
1a58e9d357 [TD] LLM indexer to run daily (#121835)
Run indexer daily
Run indexer in docker container

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121835
Approved by: https://github.com/osalpekar, https://github.com/malfet
2024-03-18 16:34:01 +00:00
ceb1910bad Revert "[BE] Enables support for pytorch ci build in ARC + introduces _linux-build-rg.yml. (#121930)"
This reverts commit 11b36e163df66196d24fbded4b37ef8f8c032640.

Reverted https://github.com/pytorch/pytorch/pull/121930 on behalf of https://github.com/huydhn due to New action is breaking current ci in not rebased PRs ([comment](https://github.com/pytorch/pytorch/pull/121930#issuecomment-2004393980))
2024-03-18 16:33:23 +00:00
11b36e163d [BE] Enables support for pytorch ci build in ARC + introduces _linux-build-rg.yml. (#121930)
Introduce changes related to enable ARC to run on build for linux-jammy-py3.8-gcc11

Depends on:
* https://github.com/pytorch/pytorch/pull/121908
* https://github.com/pytorch/pytorch/pull/121907
* Force docker to update credentials: https://github.com/pytorch/test-infra/pull/4991
* Add permissions to role to access ECR: acc0154aa0
* Add permissions to the role to access relevant S3 bucket: 496b0422c3

## Reasoning for introducing a new `_linux-build-rg.yml`

Old style `runs-on` definition accept a string, new style `runs-on` requires a object in the format:

```
--- old
...
  runs-on: "linux.2xlarge"
...
--- new
...
  runs-on:
    group: "running-group"
...
```

In other words, to specify a group the format of the yaml needs to be changed. Unfortunately, there is no way to accomplish this change using any trick in the book that I am aware of. This is due to the fact that GH actions yaml are not templatable and support minimal functions / replacements. A few examples that did not work:
* [`e234f25` (#119544)](e234f25ba1 (diff-b317d4da565a9e329ccf67e669c2ff1f4d4bc5fb0ffa4d74132545ad66f84339R76))
* [`087de4a` (#119544)](087de4ad8b (diff-b317d4da565a9e329ccf67e669c2ff1f4d4bc5fb0ffa4d74132545ad66f84339R76))
* [`f03512e` (#119544)](f03512e344 (diff-b317d4da565a9e329ccf67e669c2ff1f4d4bc5fb0ffa4d74132545ad66f84339R76))
* [`67581fb` (#119544)](67581fb737 (diff-b317d4da565a9e329ccf67e669c2ff1f4d4bc5fb0ffa4d74132545ad66f84339R76))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121930
Approved by: https://github.com/seemethere
2024-03-18 15:40:43 +00:00
c4d24b5b7f special-case cuda array interface of zero size (#121458)
Fixes #98133
retry of #98134
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121458
Approved by: https://github.com/bdice, https://github.com/ptrblck, https://github.com/mikaylagawarecki
2024-03-18 15:21:38 +00:00
f7908d9fa8 enable reshape+linear+reshape fusion for dynamic shapes (#121116)
reshape+linear+reshape fusion for dynamic shapes has been disabled in https://github.com/pytorch/pytorch/pull/107123.
Re-enable it by comparing the symbolic values in case of dynamic shapes. This will improve the performance for dynamic shape cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121116
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-03-18 14:46:27 +00:00
f2f8eeea94 Inductor: fix Conv output stride for dynamic shapes (#121400)
Fixes https://github.com/pytorch/pytorch/issues/120873.
Fixes the output stride of Conv in the case of dynamic shapes. The previous logic in inductor assumed that the output stride of Conv is always channels last while it is actually contiguous if `dynamic_shapes and is_contiguous_storage_and_layout(x)`.

### Static shape
In static shape cases, since weight is prepacked (`weight_t.is_mkldnn()` will be `true`), we'll always force output to be channels last in the Conv kernel, thus it's fine to have the assumption in Inductor that the output stride of Conv is always channels last.
96ed37ac13/aten/src/ATen/native/mkldnn/Conv.cpp (L357-L358)

### Dynamic shape
In dynamic shape cases, we won't do weight prepack for Conv, in this case, the Conv kernel decides the output layout based on the input and weight layout.
96ed37ac13/torch/_inductor/fx_passes/mkldnn_fusion.py (L1024-L1025)

For input with `channels = 1`, like tensor of size `(s0, 1, 28, 28)` and stride `(784, 784, 28, 1)`, in Inductor, with `req_stride_order` in channels last order, the `require_stride_order` on `x` of such size and stride won't change the stride of the tensor since stride for dimensions of size 1 is ignored
96ed37ac13/torch/_inductor/ir.py (L5451)

While in Conv kernel, such tensor is consider it as **contiguous** tensor instead of channels last tensor thus the output of the Conv kernel will be in contiguous format.
96ed37ac13/aten/src/ATen/native/ConvUtils.h (L396-L404)

To align the behavior of the Conv kernel, we set the output_stride in such case to be contiguous instead of channels last.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121400
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-03-18 10:56:58 +00:00
206da97b8b [aot_inductor][easy] enable test_triton_kernel_multi_output_arg (#122052)
looks like we already support aoti_torch_cuda_sort in C shim.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122052
Approved by: https://github.com/oulgen
2024-03-18 09:14:35 +00:00
65ccac6f17 Fix triton import time cycles (#122059)
Summary: `has_triton` causes some import time cycles. Lets use `has_triton_package` which is enough.

Test Plan:
```
buck2 test 'fbcode//mode/opt' fbcode//fblearner/flow/projects/model_processing/pytorch_model_export_utils/logical_transformations/tests:filter_inference_feature_metadata_test -- --exact 'fblearner/flow/projects/model_processing/pytorch_model_export_utils/logical_transformations/tests:filter_inference_feature_metadata_test - test_collect_features_from_graph_module_nodes (fblearner.flow.projects.model_processing.pytorch_model_export_utils.logical_transformations.tests.filter_inference_feature_metadata_test.FilterInferenceFromFeatureMetadataTest)'
```
now passes

Differential Revision: D55001430

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122059
Approved by: https://github.com/aakhundov
2024-03-18 05:50:32 +00:00
bc9d054260 [executorch hash update] update the pinned executorch hash (#122061)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122061
Approved by: https://github.com/pytorchbot
2024-03-18 05:02:27 +00:00
7380585d97 [vision hash update] update the pinned vision hash (#122062)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122062
Approved by: https://github.com/pytorchbot
2024-03-18 03:41:50 +00:00
e39aedfcc5 Fix fx graph triton import bug (#122041)
Summary: Unless we register triton to be a special import, FX graph import mechanism imports it as `from fx-generated._0 import triton as triton` which is obviously broken.

Test Plan:
I could not figure out how to write a test for this but
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//tgif/lib/tests/gpu_tests:lowering_pass_test -- -r test_default_ait_lowering_multi_hardwares
```
now passes

Differential Revision: D54990782

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122041
Approved by: https://github.com/aakhundov
2024-03-17 22:48:51 +00:00
5030913d6a [test] Delete variables that have been declared but not referenced di… (#121964)
Delete variables that have been declared but not referenced in aten/src/ATen/test/cuda_distributions_test.cu
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121964
Approved by: https://github.com/janeyx99
2024-03-17 09:45:05 +00:00
cyy
d9460758df [Clang-tidy header][26/N] Fix clang-tidy warnings in aten/src/ATen/core/*.h (#122015)
This PR fixes various clang-tidy warnings on aten/src/ATen/core/*.h
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122015
Approved by: https://github.com/ezyang
2024-03-17 07:56:45 +00:00
c568b84794 [dynamo][guards] Move backend match to eval_frame (#121954)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121954
Approved by: https://github.com/jansel
2024-03-17 06:52:10 +00:00
fc504d719f [executorch hash update] update the pinned executorch hash (#122036)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122036
Approved by: https://github.com/pytorchbot
2024-03-17 04:56:37 +00:00
6f74b76072 Move get_unwrapped outside of disable_functorch (#121849)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121849
Approved by: https://github.com/albanD
2024-03-16 22:25:07 +00:00
3bd38928ba [export] Improve consistency for nn_module_stack metadata, add checks to _trace.py (#120661)
We would like to improve consistency for nn_module_stack metadata in torch.export.

This PR ensures that all tests in test/export/test_export.py has the following constraints:
- Remove nn_module_stack for all placeholder & output nodes, for all modules and submodules
- Ensure nn_module_stack is present for all other node types for the top-level module (there is still an issue with torch.cond submodules having empty fields)
- Add these checks to _export() in _trace.py (we would add this in the Verifier, but downstream apps construct ExportedPrograms separate from _export(), and metadata may not be maintained there)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120661
Approved by: https://github.com/avikchaudhuri
2024-03-16 21:44:52 +00:00
6d9588a12b [inductor] disable linear weight prepacking pass on double (#121478)
Fix #121175

Co-authored-by: Jiong Gong <jiong.gong@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121478
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-03-16 13:24:21 +00:00
9990d1bc22 Add 'profiler/python' to the package.' (#121892)
Fixes #ISSUE_NUMBER
expose the `py_symbolize` interface for use.
thank you
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121892
Approved by: https://github.com/zdevito
2024-03-16 11:11:26 +00:00
5f601a41e0 Pin protobuf to 3.20.2 on macOS (#121918)
The newer protobuf 5.26.0 releasing on March 13rd is causing failures with `test_hparams_*` from `test_tensorboard` in which the stringify metadata is wrong when escaping double quote. For example, 3bc2bb6781.  This looks like an upstream issue from Tensorboard where it doesn't work with this brand new protobuf version https://github.com/tensorflow/tensorboard/blob/master/tensorboard/pip_package/requirements.txt#L29

The package has been pinned on Docker https://github.com/pytorch/pytorch/blob/main/.ci/docker/requirements-ci.txt#L155, so it should be pinned on macOS too.  We want to eventually just have one requirements.txt file.

Fixes https://github.com/pytorch/pytorch/issues/122008
Fixes https://github.com/pytorch/pytorch/issues/121927
Fixes https://github.com/pytorch/pytorch/issues/121946
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121918
Approved by: https://github.com/kit1980
2024-03-16 09:48:05 +00:00
4d9d5fe540 [executorch hash update] update the pinned executorch hash (#122009)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122009
Approved by: https://github.com/pytorchbot
2024-03-16 04:46:45 +00:00
4d92928fe2 [dynamo] Add tests for fake FSDP (#121610)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121610
Approved by: https://github.com/yanboliang
ghstack dependencies: #121735, #120965
2024-03-16 04:29:59 +00:00
0b7d9711d4 [dynamo] Add support for nn.Parameter constructor (part 2) (#120965)
This handles the case where the tensor isn't an input.

The changes to dynamo tests are cases where we would previously fall back to eager.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120965
Approved by: https://github.com/yanboliang
ghstack dependencies: #121735
2024-03-16 04:29:58 +00:00
040b925753 [Compiled Autograd] Reorder accumulate grad nodes (#121735)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121735
Approved by: https://github.com/xmfan
2024-03-16 04:29:56 +00:00
f0b9a8344a [vision hash update] update the pinned vision hash (#121177)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121177
Approved by: https://github.com/pytorchbot
2024-03-16 03:25:08 +00:00
b94691700e [FSDP] Avoided CPU sync in clip_grad_norm_ (#122001)
Copying a scalar 0 tensor on CPU to GPU or constructing a scalar 0 tensor on GPU requires a CPU sync with the GPU. Let us avoid doing ops that involve it.

`FSDP.clip_grad_norm_` already first checks if all parameters are not sharded and calls into `nn.utils.clip_grad_norm_`, so at the point of the code changes, there is guaranteed to be some sharded parameters.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122001
Approved by: https://github.com/wanchaol
2024-03-16 03:01:49 +00:00
7bc91d5dc2 [mergebot][BE] If we don't have any required checks, don't run required checks (#121921)
This PR addresses the issue identified in #121920. The existing problem is that all tests are deemed mandatory if none are selected as required. This behavior is particularly noticeable during a force merge operation.

In the context of a force merge, it may not be necessary to execute any tests which are not required (imo). However, this proposed change could be seen as controversial, hence it has been separated from the main update for further discussion and review.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121921
Approved by: https://github.com/huydhn
ghstack dependencies: #121920
2024-03-16 01:35:21 +00:00
2b71b21a3f Don't use Proxy torch function in the sym size calls (#121981)
Fixes #ISSUE_NUMBER

Changes from https://github.com/pytorch/pytorch/pull/121938 + adds test

@bypass-github-pytorch-ci-checks

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121981
Approved by: https://github.com/davidberard98
2024-03-16 01:20:26 +00:00
37e563276b Document complex optimizer semantic behavior (#121667)
<img width="817" alt="image" src="https://github.com/pytorch/pytorch/assets/31798555/565b389d-3e86-4767-9fcb-fe075b50aefe">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121667
Approved by: https://github.com/albanD
2024-03-16 00:43:47 +00:00
12662900f9 [inductor] FX graph cache: Fix bug handling constants (#121925)
Summary: During key calculation for FX graph caching: Rather than specialize on "small" vs. "large" tensor constants (i.e., inlined vs. not inlined), always hash on the tensor value. Doing so avoids the complication of trying to later attach the constant values as attributes to an already-compiled module. Instead, different constants will cause an FX graph cache miss and we'll just compile.

Test Plan: New unit test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121925
Approved by: https://github.com/eellison
2024-03-16 00:11:51 +00:00
cyy
6b0f61891f [Clang-tidy header][25/N] Fix clang-tidy warnings and enable clang-tidy on c10/cuda/*.{cpp,h} (#121952)
This PR enables clang-tidy to code in c10/cuda.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121952
Approved by: https://github.com/Skylion007
2024-03-16 00:09:54 +00:00
0cc60a05da Revert "Fix torch.clamp in MPS to handle NaN correctly (#121381)"
This reverts commit ca80d07ac71c1bfc9b13c3281a713fed89f15e0f.

Reverted https://github.com/pytorch/pytorch/pull/121381 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think its test is failing in trunk https://github.com/pytorch/pytorch/actions/runs/8302739752/job/22725865151#step:7:644, we should have ciflow/mps to run the test on PR.  Please take a look a reland the change ([comment](https://github.com/pytorch/pytorch/pull/121381#issuecomment-2000685856))
2024-03-15 23:53:05 +00:00
07ec3356b9 Revert "Force upsample to be float32 (#121324)"
This reverts commit 2770e3addd9f05101705f0fef85a163e0034b8a5.

Reverted https://github.com/pytorch/pytorch/pull/121324 on behalf of https://github.com/huydhn due to I think it is better to revert and reland this next week 2770e3addd ([comment](https://github.com/pytorch/pytorch/pull/121324#issuecomment-2000617536))
2024-03-15 23:20:01 +00:00
256c0ec1e5 [docs] Added comment on replicate -> partial for _NormPartial (#121976)
Add a version of https://github.com/pytorch/pytorch/pull/121945#discussion_r1525697167 as a comment in the code

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121976
Approved by: https://github.com/wanchaol
ghstack dependencies: #121747, #121869, #121945
2024-03-15 23:04:06 +00:00
b717aa6f36 Revert "[BE] Enables support for pytorch ci build in ARC + introduces _linux-build-rg.yml. (#121930)"
This reverts commit 2c33e3a372c077badc561b4aad4997e52c03610a.

Reverted https://github.com/pytorch/pytorch/pull/121930 on behalf of https://github.com/huydhn due to I am seeing lots of inductor jobs failing after this change 2c33e3a372.  They looks unrelated though but this change updates Docker image so may be something sneaks in.  I will try to revert this to see if it helps and will reland the change after ([comment](https://github.com/pytorch/pytorch/pull/121930#issuecomment-2000547641))
2024-03-15 22:05:21 +00:00
ca80d07ac7 Fix torch.clamp in MPS to handle NaN correctly (#121381)
Fixes #120899

So this is interesting. There are methods that specifically propagate NaN instead of clamping to real numbers.
https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraph/3857573-maximumwithnanpropagationwithpri

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121381
Approved by: https://github.com/malfet
2024-03-15 21:54:50 +00:00
26aaabb979 [c10d] initialize lastEnqueuedSeq_ and lastCompletedSeq_ (#121980)
Summary:
It is found that this 2 unitilized number was logged with some super
large or negative numbers, which is confusing. So we need to initialize
them. Now -1 indicate the number if invalid, or no work is completed or
enqueued yet. 0 could be a legit seq id.
Test Plan:
Build

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121980
Approved by: https://github.com/xw285cornell, https://github.com/wconstab, https://github.com/kwen2501, https://github.com/XilunWu
2024-03-15 21:45:15 +00:00
dfc5e9325d format caffe2/torch/_export/serde/serialize.py (#121670)
Summary: black caffe2/torch/_export/serde/serialize.py

Test Plan: tests

Differential Revision: D54654847

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121670
Approved by: https://github.com/angelayi
2024-03-15 21:30:16 +00:00
53d2188df9 Update get_aten_graph_module (#121937)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121937
Approved by: https://github.com/andrewor14
2024-03-15 20:35:55 +00:00
af86d67d61 [Doc][NVTX] Add documentation for nvtx.range (#121699)
The context manager `torch.cuda.nvtx.range` has been around for about 4 years (see #42925). Unfortunately, it was never documented and as a consequence users are just unaware of it (see #121663).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121699
Approved by: https://github.com/janeyx99
2024-03-15 20:26:44 +00:00
b92daff6e9 [DTensor] Enable ASGD foreach optimizer and add the associated unit test (#121942)
Enable ASGD foreach optimizer and add DTensor optimizer unit test for ASGD.

Note that we need to investigate why when using ASGD we need higher atol and rtol when comparing model parameters. Listing it as a TODO now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121942
Approved by: https://github.com/wanchaol
2024-03-15 20:21:27 +00:00
f4dd2fda51 [DTensor] Supported 2D clip_grad_norm_ (#121945)
This PR adds support for 2D `clip_grad_norm_` (`foreach=True`).
- This PR changes `OpSchema.args_spec` to use pytree if the runtime schema info specifies it.
- This PR includes a unit test for 2D FSDP2 + SP with `clip_grad_norm_` enabled, which serves as a complete numerics test for 2D.

Note: With this PR patched, 2-way SP + 4-way FSDP matches 8-way FSDP numerics on Llama-7B (doubling local batch size for the 2-way SP run).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121945
Approved by: https://github.com/wanchaol
ghstack dependencies: #121747, #121869
2024-03-15 20:11:24 +00:00
2c33e3a372 [BE] Enables support for pytorch ci build in ARC + introduces _linux-build-rg.yml. (#121930)
Introduce changes related to enable ARC to run on build for linux-jammy-py3.8-gcc11

Depends on:
* https://github.com/pytorch/pytorch/pull/121908
* https://github.com/pytorch/pytorch/pull/121907
* Force docker to update credentials: https://github.com/pytorch/test-infra/pull/4991
* Add permissions to role to access ECR: acc0154aa0
* Add permissions to the role to access relevant S3 bucket: 496b0422c3

## Reasoning for introducing a new `_linux-build-rg.yml`

Old style `runs-on` definition accept a string, new style `runs-on` requires a object in the format:

```
--- old
...
  runs-on: "linux.2xlarge"
...
--- new
...
  runs-on:
    group: "running-group"
...
```

In other words, to specify a group the format of the yaml needs to be changed. Unfortunately, there is no way to accomplish this change using any trick in the book that I am aware of. This is due to the fact that GH actions yaml are not templatable and support minimal functions / replacements. A few examples that did not work:
* [`e234f25` (#119544)](e234f25ba1 (diff-b317d4da565a9e329ccf67e669c2ff1f4d4bc5fb0ffa4d74132545ad66f84339R76))
* [`087de4a` (#119544)](087de4ad8b (diff-b317d4da565a9e329ccf67e669c2ff1f4d4bc5fb0ffa4d74132545ad66f84339R76))
* [`f03512e` (#119544)](f03512e344 (diff-b317d4da565a9e329ccf67e669c2ff1f4d4bc5fb0ffa4d74132545ad66f84339R76))
* [`67581fb` (#119544)](67581fb737 (diff-b317d4da565a9e329ccf67e669c2ff1f4d4bc5fb0ffa4d74132545ad66f84339R76))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121930
Approved by: https://github.com/seemethere
2024-03-15 20:09:50 +00:00
6f4fa8e9a1 [inductor] FX graph cache: simplify "current callable" logic (#121903)
Summary: The handling of the current_callable and compiled_artifact fields in the CompiledFxGraph object is unnecessarily complicated and confusing. We can simplify by storing only the callable. That field is not serializable, so the caching approach is to store a path to the generated artifact and reload from disk on a cache hit. We can just reload inline in the FX cache hit path. This change has the added benefit that it makes it easier to fallback to a "cache miss" if the path somehow doesn't exist.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121903
Approved by: https://github.com/eellison
2024-03-15 20:00:08 +00:00
d0d09f5977 Fix torch.compile links (#121824)
Fixes https://github.com/pytorch/pytorch.github.io/issues/1567

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121824
Approved by: https://github.com/svekars, https://github.com/peterbell10, https://github.com/malfet
ghstack dependencies: #121823
2024-03-15 19:49:37 +00:00
8a5a377190 Move doc links to point to main (#121823)
The previous links were pointing to an outdated branch

Command: `find . -type f -exec sed -i "s:docs/main:docs/master:g" {} + `

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121823
Approved by: https://github.com/albanD, https://github.com/malfet
2024-03-15 19:49:37 +00:00
535bc71d03 Enable FX graph caching in another batch of inductor tests (#121697)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121697
Approved by: https://github.com/eellison
2024-03-15 19:38:51 +00:00
3ee319c49c Fall back to eager mode when viewing with differing bitwidths (#120998) (#121786)
The inductor lowering code for viewing a tensor as a type with a different bitwidth currently doesn't generate valid triton code. This change looks for a source and destination dtype and, if different sizes, falls back to the eager mode aten implementation.  Prior to this change, this condition would throw an exception.

Fixes #120998.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121786
Approved by: https://github.com/peterbell10, https://github.com/bertmaher
2024-03-15 19:33:30 +00:00
409b1a6081 Add lowering for cummax, cummin (#120429)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120429
Approved by: https://github.com/peterbell10
2024-03-15 19:04:38 +00:00
d04faf4531 [dynamo][compile-time] Remove preserve rng state per op (#121923)
We already have one globally - 02bb2180f4/torch/_dynamo/convert_frame.py (L477)

I don't think we need per op.

Saves ~2 seconds on this benchmark

~~~
def fn(x):
    for _ in range(10000):
        x = torch.ops.aten.sin(x)
    return x
~~~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121923
Approved by: https://github.com/jansel
2024-03-15 18:24:46 +00:00
67ec870234 Fix FakeTensorUpdater logic for updating fake tensors (#116168)
Fixes https://github.com/pytorch/pytorch/issues/114464

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116168
Approved by: https://github.com/peterbell10
2024-03-15 18:22:24 +00:00
239d87af5e combine loops so fn_name correct in error message (#121601)
The error message shown when input aliasing is detected in `while_loop_func` may not have the correct `fn_name` as it set only in the previous for loop.  This change merges the two loops so that `fn_name` has the correct value.

No Issue Number for this minor change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121601
Approved by: https://github.com/albanD
2024-03-15 17:14:56 +00:00
39fdde7f84 [release] Increase version 2.3.0->2.4.0 (#121974)
Branch cut for 2.3.0 completed hence advance main version to 2.4.0

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121974
Approved by: https://github.com/jeanschmidt
2024-03-15 17:09:33 +00:00
565d1e28ab update kineto submodule commit id (#121843)
Summary: Update kineto submodule commit id so that pytorch profiler can pick up kineto changes from https://github.com/pytorch/kineto/pull/880

Test Plan: CI

Differential Revision: D54828357

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121843
Approved by: https://github.com/aaronenyeshi
2024-03-15 16:55:25 +00:00
3c3d7455a3 Disable inductor (default) and inductor (dynamic) by default in the perf run launcher (#121914)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121914
Approved by: https://github.com/desertfire
2024-03-15 16:46:24 +00:00
ef25d83a62 [export] Add serialization support for tokens (#121552)
Differential Revision: [D54906766](https://our.internmc.facebook.com/intern/diff/D54906766)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121552
Approved by: https://github.com/zhxchen17
2024-03-15 16:15:11 +00:00
014f91a9d9 [FSDP2] implement HSDP (#121569)
support HSDP in per-parameter sharding FSDP: https://github.com/pytorch/pytorch/issues/121023

HSDP is a hybrid of FSDP and DDP: reduce-scatter grads intra-node (FSDP), and all-reduce grads inter-node (DDP)

for unit test, we are testing 2 + 2 GPUs in single node: ``pytest test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_hsdp``

allreduce overlaps with next reduce-scatter in profiler traces
<img width="886" alt="Screenshot 2024-03-14 at 3 02 52 PM" src="https://github.com/pytorch/pytorch/assets/134637289/98f1f2b5-c99d-4744-9938-10d0431487e5">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121569
Approved by: https://github.com/awgu
2024-03-15 10:00:18 +00:00
4cbf963894 [BE] Migrate pull.yml to use S3 pytorch-ci-artifacts bucket for linux-jammy-py3_8-gcc11 and docs builds/tests (#121908)
Switch to use LF S3 bucket for pull on linux-jammy-py3_9-gcc and docs jobs. This is required to migrate to ARC and move to use LF resources.

Depends on https://github.com/pytorch/pytorch/pull/121907
Follow up issue https://github.com/pytorch/pytorch/issues/121919
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121908
Approved by: https://github.com/malfet
2024-03-15 09:09:53 +00:00
2770e3addd Force upsample to be float32 (#121324)
Fixes #121072

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121324
Approved by: https://github.com/ezyang
2024-03-15 07:50:45 +00:00
e25054b248 [compiled autograd] free stack objects before calling compiled graph (#121707)
Moved compilation code into _compiled_autograd_impl, frees stack allocated objects e.g. AutogradCompilerCall

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121707
Approved by: https://github.com/jansel
2024-03-15 07:12:38 +00:00
5a2b4fc8f0 [dynamo] Convert invalid args into graph breaks (#121784)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121784
Approved by: https://github.com/yanboliang
2024-03-15 06:51:27 +00:00
fc33bbf827 better support set_default_dtype(torch.float16), update doc (#121730)
1. Fixes #121300
2. Previously, calling `torch.tensor([2j])` after `torch.set_default_dtype(torch.float16)` will cause a runtime error. This PR also fixes it and enables test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121730
Approved by: https://github.com/peterbell10
2024-03-15 06:48:42 +00:00
8fdd8125b6 [executorch hash update] update the pinned executorch hash (#121871)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121871
Approved by: https://github.com/pytorchbot
2024-03-15 05:25:36 +00:00
cyy
fb10e13000 [Clang-tidy header][24/N] Fix clang-tidy warnings on c10/cuda/*.{cpp,h} (#120781)
This PR begins to clean clang-tidy warnings of code in c10/cuda.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120781
Approved by: https://github.com/ezyang
2024-03-15 05:03:22 +00:00
e4fda049c2 DTensor: add comm tests to test_tp_examples (#121669)
This adds some basic comm tests to test_tp_examples. This validates that the expected distributed calls are being made for `test_transformer_training`.

Fixes #121649

Test plan:

```
pytest test/distributed/tensor/parallel/test_tp_examples.py -k test_transformer_training
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121669
Approved by: https://github.com/wanchaol
2024-03-15 03:37:48 +00:00
02083f5452 [DCP][DSD] Add AdamW to distributed state dict unit tests (#121774)
Thanks @fegin for removing the fsdp root module check in DCP to unblock test updates. https://github.com/pytorch/pytorch/pull/121544

This PR adds "optimzer_class" as a kwarg for the subtests of the following tests to add AdamW as an option.

- test_fsdp
- test_compiled_fsdp
- test_fsdp2
- test_ddp
- test_fsdp_ddp
- test_cpu_offload_full_state_dict

In addition, we temporarily remove the two _verify_osd_by_load in _test_save_load, as state dict loading seems affect parameters. Creating an issue https://github.com/pytorch/pytorch/issues/121186 to keep track.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121774
Approved by: https://github.com/Skylion007
ghstack dependencies: #121773
2024-03-15 03:33:33 +00:00
efbeefbb84 [executorch] Make trymerge force merges actually work with executorch (#121920)
This PR addresses an issue with the trymerge function for executorch, which currently uses Facebook CLA instead of Easy CLA. This bug has been patched in #121921. However, the patch is potentially controversial, and we still want to verify Facebook CLA if it exists. Therefore, this PR includes Facebook CLA in our set of mandatory checks.

Additionally, this PR removes Facebook CLA from one of the mocks. This change is necessary because the specific PR used for testing fails due to the presence of Facebook CLA in the mock.

## Testing:
We run `find_matching_merge_rule(pr = GitHubPR("pytorch", "executorch", 2326), skip_mandatory_checks=True, skip_internal_checks=True)` to check if things work

https://pastebin.com/HHSFp2Gw

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121920
Approved by: https://github.com/huydhn
2024-03-15 03:21:44 +00:00
a623666066 [dynamo][compile-time] Make output_graph new_var linear (#121858)
Fixes https://github.com/pytorch/pytorch/issues/121679

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121858
Approved by: https://github.com/jansel
2024-03-15 03:20:04 +00:00
3bc2bb6781 use two pass reduction for deterministic reduction order (#115620)
## Motivation
Address the [non-deterministic reduction order](https://github.com/pytorch/pytorch/issues/93542#issuecomment-1411294181) issue for `omp parallel reduction`.

## Latest update on 1.15:
55d81901bc.
Do not reduce to arr in loops. Instead, reduce to a local scaler and write it to arr after local reduction is done. This will allow the compiler to optimize the reduction variable in register instead read/write from memory. If the `working set` of `loop body` is quite large, `read/write from register/memory` will have a large gap.
```
vaddss (%xmm0, %xmm11, %xmm11) -> accumulate in register %xmm0
vaddssl ((%rdx, %rdi, 4), %xmm0, %xmm0) -> accumulate in memory address (%rdx, %rdi, 4)
```
Examples code:
```
tmp0_acc_arr[64];
#pragma omp parallel num_threads(64)
{
    auto tid = omp_get_thread_num();
    #pragma omp for
    for(...){
        ....
        tmp0_acc_arr[tid] = tmp0_acc_arr[tid] + tmp_x;  // access array will always from memory
    }
}
```
will be changed to
```
tmp0_acc_arr[64];
#pragma omp parallel num_threads(64)
{
    auto tid = omp_get_thread_num();
    **auto tmp0_acc_local = 0;**
    #pragma omp for
    for(...){
        ....
        **tmp0_acc_local**  = tmp0_acc_local + tmp_x;
    }
    **tmp0_acc_arr[tid] = tmp0_acc_local;**
}
```

## Descriptions
Following aten to use `two pass reduction` with `omp parallel` for deterministic reduction order.
9c3ae37fc4/aten/src/ATen/Parallel-inl.h (L39)
9c3ae37fc4/aten/src/ATen/native/TensorIteratorReduce.cpp (L24)
```
            float tmp_acc0 = 0;
            at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(0);
            // init reduction buffer per thread
            float tmp_acc0_arr[64];
            at::vec::Vectorized<float> tmp_acc0_vec_arr[64];
            for (int tid = 0; tid < 64; tid++)
            {
                tmp_acc0_arr[tid] = 0;
                tmp_acc0_vec_arr[tid] = at::vec::Vectorized<float>(0);
            }
            #pragma omp parallel num_threads(64)
            {
                int tid = omp_get_thread_num();
                #pragma omp for
                for(long x0=static_cast<long>(0L); x0<static_cast<long>(3964928L); x0+=static_cast<long>(16L))
                {
                    auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x0));
                    auto tmp1 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x0));
                    auto tmp2 = tmp0 - tmp1;
                    auto tmp3 = tmp2 * tmp2;
                    // reduce to per thread buffers
                    tmp_acc0_vec_arr[tid] = tmp_acc0_vec_arr[tid] + tmp3;
                }
            }
            // second pass reduce
            for (int tid = 0; tid < 64; tid++)
            {
                tmp_acc0 = tmp_acc0 + tmp_acc0_arr[tid];
                tmp_acc0_vec = tmp_acc0_vec + tmp_acc0_vec_arr[tid];
            }
            tmp_acc0 = tmp_acc0 + at::vec::vec_reduce_all<float>([](at::vec::Vectorized<float>& x, at::vec::Vectorized<float>& y) { return x + y; }, tmp_acc0_vec);
            out_ptr0[static_cast<long>(0L)] = static_cast<float>(tmp_acc0);
```

## Test results
I test this PR with dynamo benchmark on 32-core ICX system,
Result (avg speed up):
| |  before this PR   | after this PR  |
| ---- |  ----  | ----  |
| torchbench | 1.303  | 1.301 |
| hugginface | 1.346  | 1.343 |
| timms | 1.971 | 1.970 |

```
export LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so:${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libjemalloc.so
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1"
export KMP_AFFINITY=granularity=fine,compact,1,0
export KMP_BLOCKTIME=1

multi_threads_test() {
    CORES=$(lscpu | grep Core | awk '{print $4}')
    export OMP_NUM_THREADS=$CORES
    end_core=$(expr $CORES - 1)
    numactl -C 0-${end_core} --membind=0 python benchmarks/dynamo/${SUITE}.py --${SCENARIO} --${DT} -dcpu -n50 --no-skip --dashboard --only "${MODEL}" ${Channels_extra} ${BS_extra} ${Shape_extra} ${Mode_extra} ${Wrapper_extra} ${Flag_extra} --timeout 9000 --backend=inductor --output=${LOG_BASE}/${SUITE}.csv
}

SCENARIO=performance
DT=float32
export TORCHINDUCTOR_FREEZING=1
Flag_extra="--freezing"
Mode_extra="--inference"

for suite in timm_models huggingface torchbench
do
  export SUITE=$suite
  echo $SUITE
  export LOG_BASE=`date +%m%d%H%M%S`
  mkdir $LOG_BASE
  multi_threads_test
done
```
System info
```
ubuntu@ip-172-31-18-205:~/hz/pytorch$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  64
  On-line CPU(s) list:   0-63
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
    CPU family:          6
    Model:               106
    Thread(s) per core:  2
    Core(s) per socket:  32
    Socket(s):           1
    Stepping:            6
    BogoMIPS:            5800.00
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic mo
                         vbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xs
                         aveopt xsavec xgetbv1 xsaves wbnoinvd ida arat avx512vbmi pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq rdpid md_clear flush_l1d arch_capabilities
Virtualization features:
  Hypervisor vendor:     KVM
  Virtualization type:   full
Caches (sum of all):
  L1d:                   1.5 MiB (32 instances)
  L1i:                   1 MiB (32 instances)
  L2:                    40 MiB (32 instances)
  L3:                    54 MiB (1 instance)
NUMA:
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-63
Vulnerabilities:
  Gather data sampling:  Unknown: Dependent on hypervisor status
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Mitigation; Clear CPU buffers; SMT Host state unknown
  Retbleed:              Not affected
  Spec rstack overflow:  Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
  Srbds:                 Not affected
  Tsx async abort:       Not affected
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115620
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-03-15 02:03:10 +00:00
0cd094a4fd Revert "[aoti] Fix compilation bug for buffer mutations (#121688)"
This reverts commit 9f314d4aa82169ee552ae2a8ad701bd0441a12b7.

Reverted https://github.com/pytorch/pytorch/pull/121688 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/121688#issuecomment-1998740094))
2024-03-15 01:34:04 +00:00
01d7c948e2 Make torch/_inductor/comms.py recognize native funcol IRs as collective IRs (#118498)
### Summary

As title. After this PR, Inductor should recognize native funcol IRs as collectives wherever the existing funcol IRs are recognized as collectives.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118498
Approved by: https://github.com/wanchaol
2024-03-15 01:24:36 +00:00
60ccf81490 [dynamo] Refactor update_block_stack into a seperate function (#121810)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121810
Approved by: https://github.com/williamwen42
ghstack dependencies: #121790
2024-03-15 01:01:05 +00:00
1e9a7df8fe [dynamo] Compile time optimizations in tx.step() (#121790)
`python benchmarks/dynamo/microbenchmarks/dynamo_microbenchmarks.py`
- Before: `symbolic_convert_overhead_stress_test: 10.7s`
- After: `symbolic_convert_overhead_stress_test: 8.6s`

`tx.step()` is a small part of that benchmark, so likely the speedup in that isolated function is larger than the top line.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121790
Approved by: https://github.com/oulgen
2024-03-15 01:01:05 +00:00
1afa8e0985 Fix #83153: torch.nn.hardtahn allowed min_val to be greater than max_val (#121627)
Fixes #83153

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121627
Approved by: https://github.com/albanD
2024-03-15 00:57:45 +00:00
710446b1eb [dtensor] refactor and generalize stack strategy (#121869)
This PR rewrite the stack strategy to be more generalized, basically
stack/cat like strategy follow pattern need to be smarter, i.e. it
should be able to identify:
1. PR, PP, RP -> follow PP
2. RR, SR, RS -> follow SS

So this PR refactors how the follow strategy should work, and make sure
we start following the strategy that incurred lowest cost. i.e. for
multiple PR, RP placements, we should be able to further delay the
pending sum reductions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121869
Approved by: https://github.com/awgu
2024-03-15 00:34:25 +00:00
92ed8553a6 Revert "Switch cudagraph backend to cudagraph trees (#121019)" and "Add Cudagraphs disable checking (#121018)" (#121864)
This reverts commit 9373ad0bb87b364375a468c296d2daef0e8817d7.

Revert "Add Cudagraphs disable checking (#121018)"

This reverts commit 4af0e634bf02309583dfe3b5c3421442fda5ec7e.

Causes compilation time increase.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121864
Approved by: https://github.com/eellison
2024-03-15 00:03:09 +00:00
d604ab81a2 [PyTorch] Fix static runtime sigrid_hash precomputed multiplier pass (#120851)
This pass was broken.

Differential Revision: [D54336561](https://our.internmc.facebook.com/intern/diff/D54336561/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D54336561/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120851
Approved by: https://github.com/houseroad
2024-03-15 00:02:38 +00:00
cceabe873f [jit] ClassType hashing: hash on compilation_unit as well (#121928)
Following up on #121874 - it turns out that in our case, we're seeing repeated class names that are from different compilation units.  Our previous hash function wasn't considering the compilation unit, leading to hash collisions (and then exponential memory usage in the number of copies of this class name)

Differential Revision: [D54916455](https://our.internmc.facebook.com/intern/diff/D54916455)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121928
Approved by: https://github.com/eellison
ghstack dependencies: #121874
2024-03-14 23:16:08 +00:00
2d9cee20a2 [jit] AliasDB type hash - don't always return 0 (#121874)
This hash was missing an assignment, so for almost all types it was returning "0".

c10::flat_hash_map turns out to have really bad behavior with a terrible hash like this, nearly exponential in memory usage.

Differential Revision: [D54916424](https://our.internmc.facebook.com/intern/diff/D54916424)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121874
Approved by: https://github.com/eellison
2024-03-14 23:16:08 +00:00
57b20c51b9 Don't record autograd state ops while torch.compile in pre-dispatch export (#121736)
Summary: Refer to OSS PR for details

Test Plan: CI

Differential Revision: D54812833

In pre-dispatch export, we have a special proxy torch mode where we intercept torch._C._set_grad_enabled op to correctly capture user's intention on train/eval. However, this is bit problematic when we are tracing torch.cond during export as it calls torch.compile internally. As a result, we end up capturing unwanted autograd context manager  calls that are happening inside dynamo framework code because the top level tracer is still active. We fix it by turning off this proxy torch mode. We can still capture autograd ops inside cond branches because dynamo will translate them into HOP for us, so we don't have to intercept with special proxy mode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121736
Approved by: https://github.com/anijain2305, https://github.com/ydwu4
2024-03-14 23:06:10 +00:00
bd7beef529 [Inductor] Update the cpp_wrapper entry function signature (#121745)
Summary: Update the entry function to use AtenTensorHandle instead of at::Tensor. This makes the compilation of the generated cpp wrapper code much faster: test_cpu_cpp_wrapper.py from 35 min to 21 min, and test_cuda_cpp_wrapper.py from 21 min to 14 min.

Differential Revision: [D54818715](https://our.internmc.facebook.com/intern/diff/D54818715)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121745
Approved by: https://github.com/chenyang78, https://github.com/jansel
ghstack dependencies: #121523, #121743, #121744
2024-03-14 22:23:00 +00:00
8be80706b4 [AOTI] Add pybind for tensor_converter util functions (#121744)
Differential Revision: [D54818716](https://our.internmc.facebook.com/intern/diff/D54818716)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121744
Approved by: https://github.com/chenyang78
ghstack dependencies: #121523, #121743
2024-03-14 22:20:51 +00:00
46493ee9b5 [AOTI][refactor] Update tensor_converter util functions (#121743)
Summary: Update the signature of unsafe_alloc_new_handles_from_tensors and alloc_tensors_by_stealing_from_handles. This is a preparation step towards adding pybind for these two functions, which will be used by cpp_wraper JIT Inductor.

Differential Revision: [D54818717](https://our.internmc.facebook.com/intern/diff/D54818717)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121743
Approved by: https://github.com/chenyang78
ghstack dependencies: #121523
2024-03-14 22:17:54 +00:00
3df1b3b0ad [jit] support getattr/hasattr on NamedTuple (#121863)
getattr is already supported on objects, and seems like for the most part for NamedTuples. The only remaining gap seems to be that hasattr only accepted objects, not NamedTuples. This PR adds support, and adds some basic tests.

Differential Revision: [D54888612](https://our.internmc.facebook.com/intern/diff/D54888612)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121863
Approved by: https://github.com/eellison
2024-03-14 22:07:28 +00:00
818b14025a [AOTI][refactor] Remove is_legacy_abi_kernel and abi_compatible_kernel (#121523)
Summary: is_legacy_abi_kernel was used for _scaled_dot_product_flash_attention fallback. It is only needed for C shim kernel name matching now, and the name matching is done with a direct string comparison. Also consolidate the fallback cpp kernel naming logic in CppWrapperCpu.

Differential Revision: [D54727789](https://our.internmc.facebook.com/intern/diff/D54727789)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121523
Approved by: https://github.com/chenyang78
2024-03-14 22:05:38 +00:00
43e243180b Add gpt-fast as a static benchmark (#121886)
Run:
```
python benchmarks/gpt_fast/benchmark.py
```
It generated a cvs file ```gpt_fast_benchmark.csv``` with the content like:
```
name,mode,target,actual,percentage
Llama-2-7b-chat-hf,bfloat16,104,103.458618,99.48%
Llama-2-7b-chat-hf,int8,155,158.964615,102.56%
Mixtral-8x7B-v0.1,int8,97,99.760132,102.85%
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121886
Approved by: https://github.com/Chillee
2024-03-14 21:46:59 +00:00
0e68eb1505 Add privateuseone flags for c10::EventFlag (#121118)
Fixes #117341
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121118
Approved by: https://github.com/albanD
2024-03-14 20:07:12 +00:00
9f314d4aa8 [aoti] Fix compilation bug for buffer mutations (#121688)
I realized there's a bug when unlifting buffer mutations in AOTI.
However there seems to be a bug during tracing where AOTI mutates the buffer. I didn't take the time to investigate, so I left is as TODO for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121688
Approved by: https://github.com/chenyang78
2024-03-14 19:35:26 +00:00
0636c11811 [AOTInductor] Include build cmds at the end of wrapper file (#121872)
Summary:
For easier debugging, include build commands at the end of codegen wrapper.

{F1468438991}

Test Plan: CI

Differential Revision: D54882164

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121872
Approved by: https://github.com/chenyang78, https://github.com/desertfire
2024-03-14 18:41:17 +00:00
c409292197 [sigmoid] Use deserializer from oss. (#121839)
Summary:
Old path:
thrift -> thrift deserializer -> graph module.
new path:
thrift -> python dataclass -> oss deserializer -> graph_module

Test Plan:
CI
buck2 test mode/dev-nosan caffe2/test/inductor/fb:test_aot_inductor_pt2_inference

Reviewed By: SherlockNoMad

Differential Revision: D54855251

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121839
Approved by: https://github.com/angelayi
2024-03-14 18:38:58 +00:00
499136a4dd [Inductor] Fix a dynamic shape problem when lowering diagonal (#121881)
Summary: when computing the diagonal size, we need to use correct symbolic min/max function.

Differential Revision: [D54884899](https://our.internmc.facebook.com/intern/diff/D54884899)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121881
Approved by: https://github.com/aakhundov
2024-03-14 18:36:37 +00:00
5b1642516f [with_effects] Skip over profiler.record_function_exit (#121829)
Summary:
tldr: User calls to `torch.autograd.profiler.record_function` fails when tracing with non-strict pre-dispatch export due to an effect token failure, so the solution is to skip over these operators 😅

Some user code contains calls to a `torch.autograd.profiler.record_function` context, like https://fburl.com/code/uesgknbq and https://fburl.com/code/iogbnsfw, which is used for adding user-defined events into the profiler.

Currently these function calls will be skipped/removed in dynamo (https://fburl.com/code/fkf7qmai) but **non-strict pre-dispatch export** will hit these operators during tracing. However, it seems that although these operators get hit by the dispatcher, they don't actually show up in the final graph (maybe they get DCE-d).

However, an issue comes up with a recent change with effect tokens (D54639390) which creates tokens if it sees a ScriptObject during tracing. The operator `torch.ops.profiler.record_function_exit` takes in a ScriptObject, so the effect tokens framework now tries to add an effect token to this operator, but results in the following error: (https://www.internalfb.com/intern/everpaste/?handle=GI-hvBknzj2ZxYkBABNzdztDxJVAbsIXAAAB, P1195258619)

The reason is because this operator only gets hit during pre-dispatch, not post-dispatch tracing. During pre-dispatch tracing, we first trace using post-dispatch to collect metadata needed for functionalization, and then we do pre-dispatch tracing to construct the graph. The metadata collection phase is also when we determine what operators need effect tokens and create those tokens. However, since the operator only shows up in pre-dispatch tracing, we do create any tokens. During the actual pre-dispatch tracing to create the graph, we then run into this operator and try to get a token, but none exist, causing an error :(

This PR just blocks the record_function operator from being looked at by the effect tokens framework. But a proper fix might be to have functionalization run on the pre-dispatch graph or have the operator also show up in the post-dispatch graph. But since in the PT2 stack dynamo just gets rid of this operator so that it won't show up anywhere downstream, I think we can also just ignore this operator.

Test Plan: Fixed test for P1195258619

Differential Revision: D54857444

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121829
Approved by: https://github.com/BoyuanFeng, https://github.com/tugsbayasgalan
2024-03-14 18:09:43 +00:00
f1f7c5c31e [ez] Document for add_var_to_val (#121850)
Summary: Add doc for ShapeEnv.add_var_to_val

Test Plan: doc only change

Reviewed By: izaitsevfb

Differential Revision: D54872335

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121850
Approved by: https://github.com/izaitsevfb
2024-03-14 18:01:09 +00:00
4c3a052acf [BE] Add S3 bucket argument to number of workflows (#121907)
Namely, it adds the `s3-bucket` argument to the following workflows, with default value set to `gha-artifacts`):
- _docs
- _linux-test workflows
- download-build-artifacts
- pytest-cache-download
- upload-test-artifacts

This is prerequisite part is required in order to start migrating to other s3 buckets for asset storage; This is one of the required steps in order to migrate to ARC and move our assets away from our S3 to Linux Foundation S3

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121907
Approved by: https://github.com/malfet
2024-03-14 17:57:05 +00:00
38d7d366b9 [FSDP2] Added 2D DCP save/load test (#121747)
To prepare for FSDP2 + TP/SP in torchtrain, we should verify that we can resume training correctly with DCP save/load. For loading into a new model/optimizer instance, torchtrain uses lightweight `ModelWrapper` and `OptimizerWrapper`. In the added unit test, we use `get_optimizer_state_dict` directly to show the minimal requirement for correctness.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121747
Approved by: https://github.com/wz337
2024-03-14 17:24:17 +00:00
443444dc7f [c10d] Add generic scuba logging capability into c10d (#121859)
Summary:
This diff tries to periodically (e.g., every 30s) log critical collective
progress status to scuba table, starting from a few metric such as last
enequeued seq id.

With the Scuba table, it is our hope that we can easily detect the straggler of a PG,
E.g., the rank that has not progressed it seq_ for X seconds while other ranks in the same PG have a larger seq_

The implementation needs to make sure that Scuba will be used only for FB internal use
cases.

For OSS, we still provide a generic logger data struct and logger that can be
easily extended. If users do not register the logger, nothing will be logged.

Test Plan:
Re-use the existing unit test for fb side of operations, such as
test_register_and_dump in test_c10d_manifold and change the dump period to a
very small number, e.g., 1ms, verified that the loggs are correctly shown in scuba table:
https://fburl.com/scuba/c10d_work_update/9trhwnmy

Reviewed By: wconstab

Differential Revision: D54556219

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121859
Approved by: https://github.com/wconstab
2024-03-14 16:03:45 +00:00
83f8e51404 Add CUTLASS kernel as choice for (u)int8/(b)float16 mixed MM autotuning (#119986)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119986
Approved by: https://github.com/kadeng
ghstack dependencies: #119685
2024-03-14 16:03:10 +00:00
be0bdf111c relax tol for flaky nansum_out_dtype_cuda_float32 test (#121550)
TestReductionsCUDA.test_nansum_out_dtype_cuda_float32 would fail or pass depending on the random inputs. Observed by ROCm internal QA testing.  But same problematic random inputs breaks the test for CUDA, verified on V100.

There is precedent in another test within the same file to relax tolerance.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121550
Approved by: https://github.com/albanD
2024-03-14 15:28:45 +00:00
7e13b5ba29 Checkout release branch rather then commit_hash when building triton release (#115379) (#121901)
Cherry pick of https://github.com/pytorch/pytorch/pull/115379 from Release 2.2 that should be applied to main and Release 2.3 as well

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121901
Approved by: https://github.com/DanilBaibak, https://github.com/jeanschmidt
2024-03-14 14:42:29 +00:00
956059fa2e [Fix] Fixed behaviour for the conversion of complex tensors to bool (#121803)
Fixes #120875

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121803
Approved by: https://github.com/lezcano
2024-03-14 13:35:15 +00:00
1251f0fa31 Add CUTLASS kernel as choice for _int_mm() Inductor autotuning (#119685)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119685
Approved by: https://github.com/cpuhrsch, https://github.com/kadeng
2024-03-14 13:25:23 +00:00
38d9bb5abc Make PyTorch compilable against upcoming Numpy-2.0 (#121880)
Test plan:
```
% python -c "import torch;import numpy;print(numpy.__version__, torch.tensor(numpy.arange(3, 10)))"
2.1.0.dev0+git20240312.9de8a80 tensor([3, 4, 5, 6, 7, 8, 9])
% python -c "import torch;print(torch.rand(3, 3).numpy())"
[[0.0931946  0.44874293 0.8480404 ]
 [0.93877375 0.10188377 0.67375803]
 [0.02520031 0.89019287 0.5691561 ]]

```
Fixes https://github.com/pytorch/pytorch/issues/121798

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121880
Approved by: https://github.com/albanD
2024-03-14 05:36:50 +00:00
b4c53aa0ec Do not compile FP16 arith internally (#121844)
Also, decorate unused args with `C10_UNUSED` to fix linter warnings
Test Plan: `buck2 build -c fbcode.arch=aarch64  //caffe2:ATen-cpu`

Differential Revision: D54870507

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121844
Approved by: https://github.com/osalpekar
2024-03-14 05:19:02 +00:00
3eb322ff29 Handle transitive replacements in Triton kernel mutation analysis (#121867)
Summary: Previously, we didn't handle transitive replacements in MLIR walk-based function info mining in the Triton kernel mutation analysis pass. As a result, for the TTIR below:

```
tt.func private @cumsum__fp32S1_16S__1cconstexpr_1__2cconstexpr_False_(%arg0: tensor<1x16xf32> loc("...":296:0)) -> tensor<1x16xf32> attributes {noinline = false} {
    %0 = "tt.scan"(%arg0) <{axis = 1 : i32, reverse = false}> ({
    ^bb0(%arg1: f32 loc(unknown), %arg2: f32 loc(unknown)):
      %1 = tt.call @_sum_combine__fp32_fp32__(%arg1, %arg2) : (f32, f32) -> f32 loc(#loc16)
      tt.scan.return %1 : f32 loc(#loc16)
    }) : (tensor<1x16xf32>) -> tensor<1x16xf32> loc(#loc16)
    tt.return %0 : tensor<1x16xf32> loc(#loc18)
  } loc(#loc15)
```

the mined function dict looked like this:

```
{Intermediate(idx=25): [Op(name='tt.call',
                           fn_call_name='_sum_combine__fp32_fp32__',
                           args=[Intermediate(idx=26),
                                 Intermediate(idx=26)])],
 Intermediate(idx=27): [Op(name='tt.scan.return',
                           fn_call_name=None,
                           args=[Intermediate(idx=25)])],
 Intermediate(idx=-4): [Op(name='tt.return',
                           fn_call_name=None,
                           args=[Intermediate(idx=27)])]}
```

whereas it should look like this (not the `Param(idx=0)` arguments of the `tt.call`):

```
{Intermediate(idx=25): [Op(name='tt.call',
                           fn_call_name='_sum_combine__fp32_fp32__',
                           args=[Param(idx=0),
                                 Param(idx=0)])],
 Intermediate(idx=27): [Op(name='tt.scan.return',
                           fn_call_name=None,
                           args=[Intermediate(idx=25)])],
 Intermediate(idx=-4): [Op(name='tt.return',
                           fn_call_name=None,
                           args=[Intermediate(idx=27)])]}
```

This is fixed in the PR.

Test Plan:

```
$ python test/inductor/test_triton_kernels.py -k test_cumsum
.
----------------------------------------------------------------------
Ran 1 test in 1.771s

OK
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121867
Approved by: https://github.com/oulgen
2024-03-14 04:06:37 +00:00
4cd503c1f3 Enable FX graph cache for a batch of inductor tests (#121696)
Summary: Get more FX graph cache coverage by enabling it for these unit tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121696
Approved by: https://github.com/eellison
2024-03-14 03:39:59 +00:00
15abc56bd5 Graph break on step closure in optimizer (#121777)
Fixes https://github.com/pytorch/pytorch/issues/116494

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121777
Approved by: https://github.com/yanboliang
2024-03-14 03:18:23 +00:00
f85f58bf86 Fix quantized linear vulkan tests (#120960)
Summary: Fixed quatized linear vulkan tests by using an old pack_biases function.

Test Plan:
**Vulkan quantized api tests**
buck2 run --target-platforms ovr_config//platform/macos:arm64-fbsource  //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1

...
...
...
[ RUN      ] VulkanAPITest.linear_2d_flat
[       OK ] VulkanAPITest.linear_2d_flat (5 ms)
[ RUN      ] VulkanAPITest.linear_2d_small
[       OK ] VulkanAPITest.linear_2d_small (0 ms)
[ RUN      ] VulkanAPITest.linear_2d_large
[       OK ] VulkanAPITest.linear_2d_large (4 ms)
[ RUN      ] VulkanAPITest.linear_3d_flat
[       OK ] VulkanAPITest.linear_3d_flat (2 ms)
[ RUN      ] VulkanAPITest.linear_3d_small
[       OK ] VulkanAPITest.linear_3d_small (1 ms)
[ RUN      ] VulkanAPITest.linear_3d_large
[       OK ] VulkanAPITest.linear_3d_large (1 ms)
[ RUN      ] VulkanAPITest.linear_4d_flat
[       OK ] VulkanAPITest.linear_4d_flat (1 ms)
[ RUN      ] VulkanAPITest.linear_4d_small
[       OK ] VulkanAPITest.linear_4d_small (1 ms)
[ RUN      ] VulkanAPITest.linear_4d_large
[       OK ] VulkanAPITest.linear_4d_large (2 ms)
...
...
[----------] 85 tests from VulkanAPITest (1704 ms total)

[----------] Global test environment tear-down
[==========] 85 tests from 1 test suite ran. (1704 ms total)
[  PASSED  ] 85 tests.

  YOU HAVE 8 DISABLED TESTS

**Vulkan api tests**
buck2 run --target-platforms ovr_config//platform/macos:arm64-fbsource  //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1

[----------] Global test environment tear-down
[==========] 426 tests from 1 test suite ran. (4997 ms total)
[  PASSED  ] 423 tests.
[  SKIPPED ] 1 test, listed below:
[  SKIPPED ] VulkanAPITest.querypool_flushed_shader_log
[  FAILED  ] 2 tests, listed below:
[  FAILED  ] VulkanAPITest.log_softmax_underflow
[  FAILED  ] VulkanAPITest.log_softmax

Differential Revision: D54396367

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120960
Approved by: https://github.com/yipjustin
2024-03-14 02:23:00 +00:00
a37caa6ed3 [Quant][Inductor] Enable quantization linear pattern fusion with int8_mixed_bf16 for gelu (#116004)
**Summary**
Enable QLinear Unary pattern for gelu with int8_mix_bf16

**Test plan**
python test/inductor/test_mkldnn_pattern_matcher.py -k test_qlinear_gelu_int8_mixed_bf16

Co-authored-by: leslie-fang-intel <leslie.fang@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116004
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel
ghstack dependencies: #114853, #114854
2024-03-14 01:52:12 +00:00
43d68e9c8f [Quant][Inductor] Enable quantization linear pattern fusion for gelu inside inductor (#114854)
**Summary**
Enable QLinear Unary pattern for gelu with int8

**Test plan**
python test/inductor/test_mkldnn_pattern_matcher.py -k test_qlinear_gelu_cpu

Co-authored-by: leslie-fang-intel <leslie.fang@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114854
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
ghstack dependencies: #114853
2024-03-14 01:49:14 +00:00
25e00545bb [Quant][PT2E] Enable linear and linear-unary post-op gelu quant recipe for x86 inductor quantizer (#114853)
**Summary**
Add Gelu for linear-unary post-op quantization recipe to x86 inductor quantizer.

**Test plan**
python -m pytest test/quantization/pt2e/test_x86inductor_quantizer.py -k test_linear_unary_gelu
python test/test_quantization.py -k test_linear_unary_with_quantizer_api
Co-authored-by: leslie-fang-intel <leslie.fang@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114853
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jerryzh168
2024-03-14 01:46:35 +00:00
a04e7fca8e Use memcache versioning for autotune remote cache (#121748)
Summary: Internal training platform doesn't get updated very frequently, so lets use versioning for memcache.

Test Plan: existing tests

Differential Revision: D54818197

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121748
Approved by: https://github.com/aakhundov, https://github.com/jansel
2024-03-14 00:36:10 +00:00
7e076c75bd [C10D] Fix coalescedCollective op Flight Recording (#120430)
Also noticed and filed https://github.com/pytorch/pytorch/issues/120516 during this work. May land this as is and then test/fix the other varieties of coalesced collectives later.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120430
Approved by: https://github.com/kwen2501
2024-03-13 23:55:00 +00:00
bf7ac4ddf7 Revert "[export] allow Dim(1,2) for export dynamic shapes (#121642)"
This reverts commit a8dcbf2749f2081f939621db2d38fd15ab7e34a3.

Reverted https://github.com/pytorch/pytorch/pull/121642 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/121642#issuecomment-1996121710))
2024-03-13 23:51:20 +00:00
3e02a7efcd Only FA2 doesn't support attn-mask (#121825)
Fixes #121783

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121825
Approved by: https://github.com/cpuhrsch
2024-03-13 23:03:39 +00:00
a8dcbf2749 [export] allow Dim(1,2) for export dynamic shapes (#121642)
Current dynamic shapes implementation fixes lower range of Dims to be 2 for analysis, but allows 0/1 shapes during runtime. This leads to failures when initializing Dim(1,2). This PR sets the lower bound to 0, and avoids erroring out when conflicting with the generated (2, maxsize) constraint during analysis.

Also resolves a derived dim constraints issue with the following code:
```
class Bar(torch.nn.Module):
    def forward(self, x, y):
        return x + y[1:]

dx = Dim("dx", min=1, max=3)
ep = export(
    Bar(),
    (torch.randn(2, 2), torch.randn(3, 2)),
    dynamic_shapes=({0: dx, 1: None}, {0: dx+1, 1: None})
)
print(ep.range_constraints)
```

In main:
```
{s0: ValueRanges(lower=2, upper=3, is_bool=False), s0 + 1: ValueRanges(lower=3, upper=4, is_bool=False)}
```

This PR:
```
{s0: ValueRanges(lower=1, upper=3, is_bool=False), s0 + 1: ValueRanges(lower=2, upper=4, is_bool=False)}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121642
Approved by: https://github.com/avikchaudhuri
2024-03-13 22:59:07 +00:00
70c6f542f2 Revert "[dynamo] Convert invalid args into graph breaks (#121784)"
This reverts commit 0df39480f6a74c9094555e8a61a8c8bb01716d4e.

Reverted https://github.com/pytorch/pytorch/pull/121784 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it breaks ONNX test in trunk 0c1ac4484d ([comment](https://github.com/pytorch/pytorch/pull/121784#issuecomment-1995979435))
2024-03-13 22:12:43 +00:00
aaff8d274a CUDA fast path for _chunk_cat() (#120678)
This PR provides CUDA fast path implementation for ATen Op `_chunk_cat` (#121081).

Performance on a production benchmark:

- Float16 in, Float16 out: 249 -> 500
- BFloat16 in, BFloat16 out: 248 -> 500
- BFloat16 in, Float32 out: 126 -> 278
- Float32 in, Float32 out: 153 -> 260
- Float64 in, Float64 out: 79 -> 132
- int8 in, int8 out: 332 -> 908
- int16 in, int16 out: 250 -> 489
- int32 in, int32 out: 153 -> 260
- int64 in, int64 out: 79 -> 132

Unit: Billion elements per second. Hardware: H100. Baseline: [Existing FSDP implementation](7b3febdca7/torch/distributed/_composable/fsdp/_fsdp_collectives.py (L176))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120678
Approved by: https://github.com/yifuwang
2024-03-13 22:02:06 +00:00
c53e3f57b5 allow fp16 in quant/dequant decompositions (#121738)
Test Plan:
```
buck2 run mode/dev-nosan mode/inplace executorch/examples/models/llama2:export_llama -- -c ~/llama/ultra_new_checkpoint.pt -p ~/llama/params.json -kv -E 8,8 -d fp16 --pt2e_quantize "xnnpack_dynamic" -2
```

Reviewed By: kirklandsign

Differential Revision: D54785950

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121738
Approved by: https://github.com/jerryzh168
2024-03-13 21:45:08 +00:00
c7193f4099 [DDP][PT2D][2D] Enable DDP + TP and add test for compiled DDP + TP (#120479)
This PR enables DDP + TP using a TP internal API. This should not be the final implementation. A more sound implementation is to inline the TP internal API in DDP. In other words, DDP needs to be aware of DTensor so that we can support 2D state_dict.

This PR adds a compiled DDP + TP test to ensure the new compiled DDP fusion doesn't break TP all_reduce.

**TODOs**

- [x] Implement DDP allreduce fusion algorithm for Inductor post_grad pass.
- [x] Add unit tests to ensure the fusion doesn't DDP + TP.
- [ ] Group different PG and data type of all_reduces.
- [ ] Mixed precision supports and tests
- [ ] Implement the fusions with Inductor IR.
- [ ] Add auto bucketing based on Inductor profiling.

Differential Revision: [D54105050](https://our.internmc.facebook.com/intern/diff/D54105050/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120479
Approved by: https://github.com/wz337
ghstack dependencies: #113209
2024-03-13 21:41:22 +00:00
dd568f4207 [Export, AOTInductor] Populate ShapeEnv's var_to_val during deserialization (#121759)
Summary:
Deserialization didn't populate ShapeEnv's `var_to_val` field properly, and AOTInductor is relying on this field to compile dynamic shape properly.
As a result, when AOTI failed at compiling a deserialized ExportedProgram.

Test Plan: buck2 test  mode/dev-nosan caffe2/test/inductor/fb:test_aot_inductor_pt2_inference

Differential Revision: D54559494

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121759
Approved by: https://github.com/avikchaudhuri
2024-03-13 21:28:25 +00:00
a2a4693c1b Revert "Init CUDA instead of faking memory stats (#121698)"
This reverts commit 2460f0b1c7bb6e088aca1f6e9bb62c834053d71b.

Reverted https://github.com/pytorch/pytorch/pull/121698 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it breaks inductor CPU tests 5b90074540 ([comment](https://github.com/pytorch/pytorch/pull/121698#issuecomment-1995868090))
2024-03-13 21:23:42 +00:00
45a835cef2 Revert "[compiled autograd] free stack objects before calling compiled graph (#121707)"
This reverts commit 5b90074540577267c29f5f784be123ee54f6491d.

Reverted https://github.com/pytorch/pytorch/pull/121707 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it breaks inductor CPU tests 5b90074540 ([comment](https://github.com/pytorch/pytorch/pull/121698#issuecomment-1995868090))
2024-03-13 21:23:42 +00:00
8b1b61bc70 [compiled autograd] support custom ops backed by c++ autograd::Function (#120681)
- Adds support for custom ops backed by c++ custom autograd functions, e.g. fbgemm
- Include files more granularly to avoid namespace pollution and circular imports

limitations:
- requires user to audit their code and opt-in their custom autograd::Function via autograd::Function::is_traceable and maybe additional compiled_args + apply_with_saved implementation. this was the only way I can think of for soundness
- will throw if we can't hash the saved_data i.e. for any non implemented type other than list and dict in at::IValue::hash b0cfa96e82/aten/src/ATen/core/ivalue.cpp (L364)
- can technically silently fail if both the typeid hash and the typeid string name of the custom autograd::Function collide at the same time, and an identical autograd graph containing a different custom autograd::Function, yet that has an identical implementation, is called. this case seems extremely unlikely, and the only alternative to hash collision i can think of is compiling with reflection
- tensors not saved via save_variables are not lifted, and are specialized on TensorImpl*'s hash (treated as a memory address). if needed, we can lift them.

Differential Revision: [D54818488](https://our.internmc.facebook.com/intern/diff/D54818488)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120681
Approved by: https://github.com/jansel
2024-03-13 21:13:21 +00:00
58ff55aac5 Add support for tt.scan to triton kernel mutation analysis (#121828)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121828
Approved by: https://github.com/aakhundov, https://github.com/Skylion007
2024-03-13 20:37:56 +00:00
8e6d572b4e [DDP][PT2D] Allreduce fusion fx pass using concat and all_reduce_coalesced (#113209)
Differential Revision: [D49858057](https://our.internmc.facebook.com/intern/diff/D49858057/)

**TL;DR**
This PR implements 2 different DDP all_reduce fusions in Inductor post_grad fx passes. The two fusions are 1) fusion with concat op and 2) fusion with all_reduce_coalesced. When DDP detects that Python reducer is being used, DDP will automatically turn on the fusion.

This PR does not invent any algorithm and simply reflects the bucket size users set to DDP.

**Implementation Details**
*Fusion with concat op*
The idea of this fusion is to use a concat op to concatenate all the gradients into one tensor and perform one `all_reduce`. After the `wait` op of the `all_reduce`, splitting and reshaping will also be perform to get the individual gradient.

Because DDP needs to perform gradient scaling, the benefit of using this fusion is that we could perform the gradient scaling over the the concatenated buffer.

*Fusion with `all_reduce_coalesced`*
The idea of this fusion is to use `all_reduce_coalesced` op to directly perform the `all_reduce` over multiple buffers. This avoid the copy overhead but may not achieve the best NCCL performance. In addition, because there are multiple buffers, we could not do one simple gradient scaling but have to rely on `foreach_div` to help the gradient scaling.

**Limitations**
Current fusions do not distinguish `all_reduce` generated by different DDP modules. This is okay if all DDP instances use the same PG and data type. The support of multiple DDP instances with different PG and data type will come in the later PRs.

**TODOs**
- [x] Implement DDP allreduce fusion algorithm for Inductor post_grad pass.
- [ ] Add unit tests to ensure the fusion doesn't DDP + TP.
- [ ] Group different PG and data type of `all_reduce`s.
- [ ] Mixed precision supports and tests
- [ ] Implement the fusions with Inductor IR.
- [ ] Add auto bucketing based on Inductor profiling.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113209
Approved by: https://github.com/yf225
2024-03-13 20:37:09 +00:00
0c1ac4484d Support call_method in DDPOptimizer (#121771)
This PR fixes Issue #111279.

While #111279 reported the issue with `MultiheadAttention`, a minimal reproduction would be:
```python
class ToyModel(nn.Module):
    def __init__(self,):
        super().__init__()
        self.linear = nn.Linear(128, 10)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.linear.forward(x) # Error
        # return self.linear(x) # OK
```

Dynamo treats `self.linear(x)` as `call_module` while treating `self.linear.forward(x)` as a [`get_attr` and a `call_method`](https://github.com/pytorch/pytorch/blob/main/torch/_dynamo/variables/nn_module.py#L358-L378). However, existing DDPOptimizer assumes, for a `get_attr` node, `getattr(gm, node.target)` gives a tensor with the `requires_grad` attribute. Existing DDPOptimizer also does not support `call_method` nodes.

This PR adds support for `call_method` and check on `get_attr`. It also checks if a module's parameters have been added to a bucket to support multiple method calls from the same module.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121771
Approved by: https://github.com/yf225
2024-03-13 20:03:15 +00:00
0df39480f6 [dynamo] Convert invalid args into graph breaks (#121784)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121784
Approved by: https://github.com/yanboliang
ghstack dependencies: #121615, #121616
2024-03-13 20:02:33 +00:00
5b90074540 [compiled autograd] free stack objects before calling compiled graph (#121707)
Moved compilation code into _compiled_autograd_impl, frees stack allocated objects e.g. AutogradCompilerCall

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121707
Approved by: https://github.com/jansel
ghstack dependencies: #121698
2024-03-13 19:31:44 +00:00
2460f0b1c7 Init CUDA instead of faking memory stats (#121698)
This is very confusing when checking for memory usage and allocations are only happening using C API. We should change it to a warning/error or just init cuda. Codepaths that run on non-CUDA environments shouldn't call into these functions in the first place

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121698
Approved by: https://github.com/jansel
2024-03-13 19:31:44 +00:00
cd949d133e Support setUpClass & tearDownClass with instantiate_device_type_tests() (#121686)
Summary: instantiate_device_type_tests() creates dynamic test case classes that derive from a "template class". By default, the test harness will call the setUpClass() and tearDownClass() methods defined by the template class (if the template class defines them). We can explicitly create these methods in the dynamic class and arrange to call those methods in both base classes. That allows us to support setUpClass & tearDownClass test classes used with instantiate_device_type_tests().
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121686
Approved by: https://github.com/ezyang, https://github.com/eellison
2024-03-13 18:28:42 +00:00
ffabb25c48 Count the number of entries directly in avg_pool2d lowering (#121429)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121429
Approved by: https://github.com/peterbell10
ghstack dependencies: #116085
2024-03-13 18:19:47 +00:00
a19a05fd1d Add lowering for avg_pool{1, 3}d (#116085)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116085
Approved by: https://github.com/peterbell10
2024-03-13 18:19:47 +00:00
79fac48bb3 Use pytorch bot's labeler (#121762)
Change corresponds to https://github.com/pytorch/test-infra/pull/4995
Testing (very light) in https://github.com/malfet/deleteme/pull/81
Should help with https://github.com/pytorch/test-infra/issues/4950

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121762
Approved by: https://github.com/huydhn
2024-03-13 17:16:49 +00:00
05df03ec1b Allow custom attributes for torch function subclasses (#121693)
Added custom attribute access with test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121693
Approved by: https://github.com/anijain2305
2024-03-13 17:01:57 +00:00
92a2b214f8 Make translation validation more user friendly (#120880)
Two main changes:

- Don't rethrow the exception when we fail in TV, just throw the entire
  thing and trust the user will inspect logs / backtrace to see we
  failed in TV

- Don't add an event to the TV logs until we've confirmed that the event
  actually runs without erroring.  This prevents us from putting events
  that e.g., fail because the guard on data dependent size, and the
  failing in TV.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120880
Approved by: https://github.com/lezcano, https://github.com/ysiraichi
2024-03-13 15:21:59 +00:00
b1d5998956 Upgrade to tlparse 0.3.7 (#121772)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121772
Approved by: https://github.com/Skylion007
2024-03-13 15:21:20 +00:00
5498804ec2 [MPS] Fix naive matmul for BFloat16 (#121731)
Will only work on MacOS14 or newer, so compile the shader with `MTLLanguageVersion_3_1` when appropriate

Fixes https://github.com/pytorch/pytorch/issues/121583
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121731
Approved by: https://github.com/albanD
2024-03-13 14:34:03 +00:00
559ca13b3f [dynamo] Refactor TorchInGraphFunctionVariable for compile time (#121616)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121616
Approved by: https://github.com/oulgen
ghstack dependencies: #121615
2024-03-13 14:21:21 +00:00
51cf57c6c6 Revert "Include torch warn in each error in cudnn/Conv_v8.cpp (#120719)"
This reverts commit 5fd7f5c4e336c2c3041e10529990c620cc8cf9a5.

Reverted https://github.com/pytorch/pytorch/pull/120719 on behalf of https://github.com/janeyx99 due to sorry but am reverting as this prints unwanted warnings even when an exception is not thrown  ([comment](https://github.com/pytorch/pytorch/pull/120719#issuecomment-1994491826))
2024-03-13 14:09:38 +00:00
a157a0d00d [constraints] Fix scalar type for constraint_range to Long (#121752)
Differential Revision: [D54822125](https://our.internmc.facebook.com/intern/diff/D54822125)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121752
Approved by: https://github.com/ezyang
2024-03-13 11:11:09 +00:00
7fe0cc53e9 make _process_dynamic_shapes an implementation detail (#121713)
Summary: `_process_dynamic_shapes` converts new dynamic shapes to old constraints, but in the future may not need to do so. Preparing for that future.

Test Plan: CI

Differential Revision: D54780374

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121713
Approved by: https://github.com/tugsbayasgalan
2024-03-13 08:33:00 +00:00
5088e4956e Add quantized conv transpose2d op (#120151)
Test Plan:
Run vulkan api test:
# buck2 build --target-platforms ovr_config//platform/macos:arm64-fbsource  //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 --show-output"
# buck-out//v2/gen/fbsource/xplat/caffe2/pt_vulkan_api_test_binAppleMac
Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc
[==========] Running 418 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 418 tests from VulkanAPITest
....
[----------] Global test environment tear-down
[==========] 418 tests from 1 test suite ran. (4510 ms total)
[  PASSED  ] 417 tests.
[  SKIPPED ] 1 test, listed below:
[  SKIPPED ] VulkanAPITest.querypool_flushed_shader_log

  YOU HAVE 9 DISABLED TESTS

Run quantized vulkan api test: Note the linear quantized are failing but all the convolution tests still pass. Linear failures are being debugged.
# buck2 build --target-platforms ovr_config//platform/macos:arm64-fbsource  //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 --show-output"
# buck-out//v2/gen/fbsource/xplat/caffe2/pt_vulkan_quantized_api_test_binAppleMac
Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc
[==========] Running 86 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 86 tests from VulkanAPITest
...
[  PASSED  ] 77 tests.
[  FAILED  ] 9 tests, listed below:
[  FAILED  ] VulkanAPITest.linear_2d_flat
[  FAILED  ] VulkanAPITest.linear_2d_small
[  FAILED  ] VulkanAPITest.linear_2d_large
[  FAILED  ] VulkanAPITest.linear_3d_flat
[  FAILED  ] VulkanAPITest.linear_3d_small
[  FAILED  ] VulkanAPITest.linear_3d_large
[  FAILED  ] VulkanAPITest.linear_4d_flat
[  FAILED  ] VulkanAPITest.linear_4d_small
[  FAILED  ] VulkanAPITest.linear_4d_large

 9 FAILED TESTS
  YOU HAVE 8 DISABLED TESTS

Differential Revision: D52344261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120151
Approved by: https://github.com/yipjustin
2024-03-13 08:09:57 +00:00
e99fa0042c Back out "[DeviceMesh] Add support for nD slicing (#119752)" (#121763)
Summary:
Original commit changeset: e52b8809c8d8

Original Phabricator Diff: D54778906

We have to backout this diff.
D54778906 seems to be causing test failures for APF blocking trunk health and hence release. Just starting to look at the issue. T182209248

Test Plan: Sandcastle

Reviewed By: satgera

Differential Revision: D54825114

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121763
Approved by: https://github.com/osalpekar
2024-03-13 07:22:08 +00:00
be33d31ae2 add std::ostream& operator<< for BFloat16 in BFloat16.h (#121302)
This PR Move `operator<<` of `BFloat16` to `BFloat16.h`.

Previously, this function is in `TensorDataContainer.h`. If need `std::cout` a `BFloat16` variable when debugging, `TensorDataContainer.h` have to be included. This is inconvient and counterintuitive.

Other dtypes such as `Half`, define their `operator<<` in headers where they are defined such as `Half.h`. Therefore, I think it makes more sense to move `operator<<` of `BFloat16` to `BFloat16.h`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121302
Approved by: https://github.com/ezyang
2024-03-13 06:47:34 +00:00
5986552ebe [nit][DCP][DSD] Remove variables not being used in test_state_dict.py #121204 (#121773)
Replacing https://github.com/pytorch/pytorch/pull/121204

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121773
Approved by: https://github.com/Skylion007
2024-03-13 06:35:04 +00:00
da2a9a0512 _foreach_copy with different src/dst dtypes (#121717)
Fixes #115171

```
torch.version.git_version = '6bff6372a922fe72be5335c6844c10e2687b967d', torch.cuda.get_device_name() = 'NVIDIA RTX 6000 Ada Generation'
[------------------ foreach copy - self: torch.float32 - shape: (512, 512) -----------------]
                         |  src: torch.float32  |  src: torch.float16  |  src: torch.bfloat16
1 threads: ----------------------------------------------------------------------------------
      num_tensors: 32    |          14.2        |          12.6        |           12.7
      num_tensors: 256   |         688.0        |         510.3        |          514.0
      num_tensors: 1024  |        2768.0        |        2053.3        |         2047.7

Times are in microseconds (us).

[------------------ foreach copy - self: torch.float16 - shape: (512, 512) -----------------]
                         |  src: torch.float32  |  src: torch.float16  |  src: torch.bfloat16
1 threads: ----------------------------------------------------------------------------------
      num_tensors: 32    |          10.0        |           8.9        |            8.8
      num_tensors: 256   |         497.6        |         344.3        |          348.3
      num_tensors: 1024  |        1991.9        |        1392.0        |         1389.0

Times are in microseconds (us).

[----------------- foreach copy - self: torch.bfloat16 - shape: (512, 512) -----------------]
                         |  src: torch.float32  |  src: torch.float16  |  src: torch.bfloat16
1 threads: ----------------------------------------------------------------------------------
      num_tensors: 32    |          10.0        |           8.8        |            8.8
      num_tensors: 256   |         497.5        |         344.5        |          348.0
      num_tensors: 1024  |        1993.2        |        1390.4        |         1387.5

Times are in microseconds (us).

[------------------ foreach copy - self: torch.float32 - shape: (515, 515) -----------------]
                         |  src: torch.float32  |  src: torch.float16  |  src: torch.bfloat16
1 threads: ----------------------------------------------------------------------------------
      num_tensors: 32    |          19.0        |          17.9        |           18.1
      num_tensors: 256   |         707.2        |         540.2        |          543.1
      num_tensors: 1024  |        2900.6        |        2156.6        |         2159.2

Times are in microseconds (us).

[------------------ foreach copy - self: torch.float16 - shape: (515, 515) -----------------]
                         |  src: torch.float32  |  src: torch.float16  |  src: torch.bfloat16
1 threads: ----------------------------------------------------------------------------------
      num_tensors: 32    |          13.8        |          13.7        |           13.1
      num_tensors: 256   |         513.2        |         352.6        |          350.4
      num_tensors: 1024  |        2047.6        |        1404.4        |         1400.4

Times are in microseconds (us).

[----------------- foreach copy - self: torch.bfloat16 - shape: (515, 515) -----------------]
                         |  src: torch.float32  |  src: torch.float16  |  src: torch.bfloat16
1 threads: ----------------------------------------------------------------------------------
      num_tensors: 32    |          13.6        |          12.8        |           14.2
      num_tensors: 256   |         511.9        |         351.8        |          350.6
      num_tensors: 1024  |        2045.4        |        1402.2        |         1401.4

Times are in microseconds (us).

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121717
Approved by: https://github.com/janeyx99
2024-03-13 05:42:28 +00:00
a13dd92d88 [dynamo] Minor compile time optimizations in torch.py (#121615)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121615
Approved by: https://github.com/oulgen
2024-03-13 05:36:22 +00:00
d619be57c0 [executorch hash update] update the pinned executorch hash (#121056)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121056
Approved by: https://github.com/pytorchbot
2024-03-13 04:54:16 +00:00
0c1d59b72f CI: Fix flaky artifact upload step (#121733)
This PR changes the upload artifact step of the wheels and conda build to write
each matrix entry to a different file. This is because updating the same file
from multiple jobs can be flaky as is warned in the docs for upload-artifact

> Warning: Be careful when uploading to the same artifact via multiple jobs as artifacts may become corrupted. When uploading a file with an identical name and path in multiple jobs, uploads may fail with 503 errors due to conflicting uploads happening at the same time. Ensure uploads to identical locations to not interfere with each other.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121733
Approved by: https://github.com/huydhn
ghstack dependencies: #121268
2024-03-13 04:42:52 +00:00
52ed35bb64 [inductor] Update triton pin (#121268)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121268
Approved by: https://github.com/oulgen, https://github.com/malfet
2024-03-13 04:42:52 +00:00
07330ff7b6 [MPS][BE] Define _compute_tolerances (#121754)
Right now logic is mostly duplicated between `test_output_match` and `test_output_gradient_match`
So move tolerance definition logic into a shared `_compute_tolerances` function and
only keep differences (for example, grad checks are completely skipped for `torch.unique`) in the respective test functions.

Also, increase tolerance for `pow` and `__rpow__` only on MacOS-13.3 or older and remove GRAD xfaillist for those

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121754
Approved by: https://github.com/albanD
2024-03-13 04:08:06 +00:00
f83392b677 cublasLt workspace warning info is misleading, the unit of measuremen… (#121073)
cublasLt workspace warning info is misleading, the unit of measurement should be KiB instead of bytes

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121073
Approved by: https://github.com/eqy, https://github.com/ezyang
2024-03-13 03:37:40 +00:00
e755dab0d1 [ROCm] Enable several test_unary_ufuncs UTs on ROCm (#121104)
Enabled:
test_unary_ufuncs::TestUnaryUfuncsCUDA::test_reference_numerics_small_atan_cuda_complex64
test_unary_ufuncs::TestUnaryUfuncsCUDA::test_reference_numerics_small_atan_cuda_complex128
test_unary_ufuncs::TestUnaryUfuncsCUDA::test_reference_numerics_extremal_atan_cuda_complex128
test_unary_ufuncs::TestUnaryUfuncsCUDA::test_reference_numerics_small__refs_atan_cuda_complex64
test_unary_ufuncs::TestUnaryUfuncsCUDA::test_reference_numerics_small__refs_atan_cuda_complex128
test_unary_ufuncs::TestUnaryUfuncsCUDA::test_reference_numerics_extremal__refs_atan_cuda_complex128
test_unary_ufuncs::TestUnaryUfuncsCUDA::test_reference_numerics_extremal_atanh_cuda_complex128
test_unary_ufuncs::TestUnaryUfuncsCUDA::test_reference_numerics_extremal__refs_atanh_cuda_complex128

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121104
Approved by: https://github.com/jeffdaily, https://github.com/ezyang
2024-03-13 03:34:22 +00:00
f24ae66abf [AOTInductor] Skip tests on RoCM for duplicate_constant_folding (#121750)
Summary: Skip AMD tests for duplicated kernels in constant folding

Test Plan: Diff is test

Differential Revision: D54820804

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121750
Approved by: https://github.com/huydhn
2024-03-13 03:21:21 +00:00
9f235971f0 Gate tt.reduce Triton mutation tests on Triton version (#121753)
Summary: The goal is to make the `test_argmax` and `test_reduce_sum` to work both before and after https://github.com/openai/triton/pull/3191 is included into the Triton pin. This is important to make those tests work during the Triton pin update process both in OSS and internally.

Test Plan:

```
$ python test/inductor/test_triton_kernels.py -k test_reduce_sum -k test_argmax
..
----------------------------------------------------------------------
Ran 2 tests in 1.906s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121753
Approved by: https://github.com/Skylion007
2024-03-13 01:43:02 +00:00
7d05c4c093 Remove error anti-pattern when dealing with dynamic shape output (#121681)
There are cases where capture_dynamic_output_shape_ops=True and we will still see DynamicOutputShapeException. For example, when an op doesn't have a meta kernel implemented to return the correct dynamic shape output. If we blindly give users instructions to set capture_dynamic_output_shape_ops to True, users would try it and see no change. As witnessed in this issue:
https://github.com/pytorch/pytorch/issues/121036#issuecomment-1985221435

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121681
Approved by: https://github.com/tugsbayasgalan
2024-03-13 00:45:23 +00:00
9df0dca7f6 Revert "[ Inductor ] Shape padding honors output stride preservation (#120797)"
This reverts commit 57fc35a3af09f7657b2be593a1046f0ac2dd50ab.

Reverted https://github.com/pytorch/pytorch/pull/120797 on behalf of https://github.com/williamwen42 due to perf regression on dashboard ([comment](https://github.com/pytorch/pytorch/pull/120797#issuecomment-1992857428))
2024-03-13 00:43:34 +00:00
02bb2180f4 [torch export] replace traceback.extract_stack with CapturedTraceback.extract (#121449)
Summary:
with a simple bench in TestDeserializer.test_basic function:
```
time_start = time.time()
for i in range(1000):
    self.check_graph(MyModule(), inputs)
warnings.warn(f"time_taken: {time.time() - time_start}")
```
and forcing FakeTensorConfig.debug to True, record_stack_traces to True, logging level to debug, it shows that the the changed code is consistently ard 20 secs faster (~90s vs originally ~110s)

Test Plan:
test passed, see summary

compared debug trace before and after:
- exactly the same for fake tensor and proxy callsite https://www.internalfb.com/intern/diffing/?paste_number=1189883685
- slightly different for the user frame in proxy node https://www.internalfb.com/intern/diffing/?paste_number=1189884347

Differential Revision: D54237017

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121449
Approved by: https://github.com/angelayi
2024-03-13 00:19:05 +00:00
7a53dedb07 CI: Specify libc and libstdcxx versions in conda environments (#121556)
Without this we get mismatches between the GLIBC and GLIBCXX ABI used
by conda packages vs pytorch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121556
Approved by: https://github.com/isuruf, https://github.com/malfet
2024-03-13 00:12:54 +00:00
68be750e17 Cleanup some exception handling in triton mutation tracking (#121739)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121739
Approved by: https://github.com/Skylion007
ghstack dependencies: #121690
2024-03-13 00:02:36 +00:00
a9274c9a2c Fix aoti doc to avoid cannot bind non-const lvalue reference error (#121672)
This PR corrects the example in the AOTInductor example which currently fails with:
```
/home/ubuntu/test/inference.cpp:21:62: error: cannot bind non-const lvalue reference of type ‘std::vector<at::Tensor>&’ to an rvalue of type ‘std::vector<at::Tensor>’
   21 |     std::cout << runner.run({torch::randn({2, 10}, at::kCPU)})[0] << std::endl;
      |
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121672
Approved by: https://github.com/desertfire
2024-03-12 23:43:40 +00:00
79ee6bbde3 Support triton.language.dtype with torch.compile (#121690)
Putting this PR as an RFC since I have resorted to some horrible hacks in order to make this work.
```
(Pdb) p triton.language.float32
triton.language.fp32
(Pdb) p str(triton.language.float32)
'fp32'
(Pdb) p repr(triton.language.float32)
'triton.language.fp32'
```
This means that we need to "rewrite" them for fx graph and inductor execution.

This PR allows Mamba2 to work with `torch.compile`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121690
Approved by: https://github.com/Skylion007
2024-03-12 23:21:46 +00:00
22bb24986d [dynamo][guards] Use lazy variable tracker for func defaults (#121388)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121388
Approved by: https://github.com/jansel
2024-03-12 22:48:48 +00:00
519151a062 [fx] Preserve Fx graph node order in partitioner across runs (#115621)
Fixes #ISSUE_NUMBER
partitioner generates different graph in recompilation on each run
Co-authored-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115621
Approved by: https://github.com/izaitsevfb
2024-03-12 22:18:43 +00:00
a95ceb51a2 Release fix pinning slow-tests.json (#121746)
Apply release changes script adds version to SLOW_TESTS_FILE which should not change

Test:
```
SLOW_VER=test
sed -i -e s#/slow-tests.json#"/slow-tests.json?versionId=${SLOW_VER}"#  tools/stats/import_test_stats.py
```
Output:
```
SLOW_TESTS_FILE = ".pytorch-slow-tests.json"
...
url = "https://ossci-metrics.s3.amazonaws.com/slow-tests.json?versionId=test"
```

related to: https://github.com/pytorch/pytorch/pull/121726
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121746
Approved by: https://github.com/huydhn
2024-03-12 22:04:55 +00:00
a5ec45f2ec [Inductor Cutlass backend] Move tests to separate file (#121489)
Move Cutlass backend related tests to test/inductor/test_cutlass_backend.py - no changes to the tests themselves.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121489
Approved by: https://github.com/jansel
2024-03-12 21:59:48 +00:00
844bfbbd2e feat: Update Dockerfile default versions for Python, OS, and CUDA arch list (#121560)
- Update Dockerfile default versions for Python, OS, and CUDA arch list
	- Python 3.8 is EOL later this year, the `docker.Makefile` has 3.10 as default
	- `docker.Makefile` is using 22.04 so this just aligns that
	- The GPU feature list is quite dated, most of those architectures are long past EOL and we aren't getting the newer cards (A100, H100) into that list until now https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#gpu-feature-list

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121560
Approved by: https://github.com/seemethere, https://github.com/Neilblaze, https://github.com/atalman, https://github.com/malfet
2024-03-12 21:43:26 +00:00
d62bdb087d [Profiler] add missing field device_resource_id (#121480)
Fixes #121479

Co-authored-by: Aaron Shi <enye.shi@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121480
Approved by: https://github.com/aaronenyeshi
2024-03-12 21:42:53 +00:00
5478a4e348 Don't run non-strict for test case that doesn't need non-strict (#121710)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121710
Approved by: https://github.com/BoyuanFeng
ghstack dependencies: #121652, #121678, #121687
2024-03-12 21:32:33 +00:00
5b506c8bce Revert "[dynamo][guards] Use lazy variable tracker for func defaults (#121388)"
This reverts commit 04a5d6e8d3f09ee6741484bcfea022228f747b09.

Reverted https://github.com/pytorch/pytorch/pull/121388 on behalf of https://github.com/osalpekar due to causing executorch model-test failures internally. See [D54707529](https://www.internalfb.com/diff/D54707529) ([comment](https://github.com/pytorch/pytorch/pull/121388#issuecomment-1992619251))
2024-03-12 21:31:18 +00:00
522d972924 [eazy] add more log when accuracy check fail (#121656)
Add these log to debug the regress of accuracy test for dm_nfnet_f0 model for training.

With these extra log when the accuracy check fail, we can verify if it's close to succeed or not. If yes that indicates there is no real issue but just flaky and we probably can tune the tolerance to fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121656
Approved by: https://github.com/jansel, https://github.com/Skylion007
2024-03-12 20:58:20 +00:00
f50c652422 avoid aten dispatch shadowing type with variable (#121659)
Summary:
`DECLARE_DISPATCH` is shadowing the variable data with the data type:
`extern TORCH_API struct name name` -> `extern TORCH_API struct gemm_stub gemm_stub` for instance.
This is probably dangerous behavior to rely on, as the compiler needs to always resolve to type and/or data based on context. Previous macro fails with VS2022.

Test Plan: `buck2 build arvr/mode/win/vs2022/cpp20/opt //xplat/caffe2:aten_pow_ovrsource`

Differential Revision: D54699849

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121659
Approved by: https://github.com/albanD
2024-03-12 20:50:47 +00:00
6d8a7d6e58 [pytorch] optional zero points on dequantize per channel (#121724)
Summary:
X-link: https://github.com/pytorch/executorch/pull/2364

bypass-github-export-checks

Test Plan: sandcastle

Reviewed By: mikekgfb

Differential Revision: D54709217

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121724
Approved by: https://github.com/mikekgfb
2024-03-12 19:54:11 +00:00
a6149eba12 [easy] Refactor MultiOutput. codegen_list_tuple_access to use subclass type checks (#121662)
Summary:
# Why?

Right now I'm running into a case where `itype` is `torch.fx.immutable_collections.immutable_list` which is a subclass of `list`. However, currently we're checking the concrete types (i.e. `list`) and `immutable_list` isn't explictly supported here.

Thus, we use a runtime check that looks at the subclass so we can support subclasses -- such as immutable_list -- as well.

Test Plan: ci

Differential Revision: D54764829

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121662
Approved by: https://github.com/aakhundov
2024-03-12 19:27:56 +00:00
90e886aa6c Sanity check for non-strict (#121687)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121687
Approved by: https://github.com/avikchaudhuri
ghstack dependencies: #121652, #121678
2024-03-12 18:21:32 +00:00
443e241cc5 Don't cache predispatch kernels (#121712)
Summary: Title

Test Plan: CI

Differential Revision: D54791087

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121712
Approved by: https://github.com/ydwu4
2024-03-12 18:05:59 +00:00
a26480a4d1 [dtensor] move early return check into redistribute autograd function (#121653)
This PR fixed the bug of redistribute to move early return check into the
redistribute autograd function, so that even though we redistribute the
same placement, the grad_placements from the `to_local` call might be
different, the redistribute backward still need to happen

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121653
Approved by: https://github.com/awgu
2024-03-12 17:37:30 +00:00
00a53b58dd Refactor release only changes to two step execution (#121728)
Refactor release only changes to two step execution.

1. Step ``tag-docker-images.sh`` . Tags latest docker images for current release. This step takes about 30min to complete. This step may fail due to space issues on the local host or http connection when pulling image. Hence should be rerun if failed.

2. Apply release only changes ``apply-release-changes.sh`` prepares a PR with release only changes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121728
Approved by: https://github.com/jeanschmidt
2024-03-12 17:22:22 +00:00
4e63d9065a [dynamo] Delete record replay tests as they are not maintained (#121705)
Fixes https://github.com/pytorch/pytorch/issues/115518

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121705
Approved by: https://github.com/mlazos
2024-03-12 17:16:34 +00:00
cd1751b14f [dynamo] Measure Dynamo cache latency lookup (#121604)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121604
Approved by: https://github.com/jansel
ghstack dependencies: #121614, #121622
2024-03-12 17:09:11 +00:00
22489bfe70 [dynamo][guards-cpp-refactor] Directly call root guard manager in eval_frame (#121622)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121622
Approved by: https://github.com/jansel
ghstack dependencies: #121614
2024-03-12 17:09:11 +00:00
2348e8e4e7 [dynamo][guards-cpp-refactor] Simplify DYNAMIC_INDICES guard (#121614)
Use NO_HASATTR guard for the common part.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121614
Approved by: https://github.com/jansel
2024-03-12 17:08:56 +00:00
0398dc9e8e Revert "[DCP] Makes fsspec public (#121508)"
This reverts commit d482614fec5fb9bccb49bf4ee4ab561e872c0f50.

Reverted https://github.com/pytorch/pytorch/pull/121508 on behalf of https://github.com/osalpekar due to this causes torchrec tests to fail internally with this error: ModuleNotFoundError: No module named 'fsspec'. see [D54779117](https://www.internalfb.com/diff/D54779117) ([comment](https://github.com/pytorch/pytorch/pull/121508#issuecomment-1992137831))
2024-03-12 17:02:43 +00:00
b84f94f6a3 Restore timestamps on C++ logs without glog (#121384)
It looks like it was commented out because the original implementation was not sufficiently portable. I had to do some rewrites to the innards to make it no portable. No Windows nanoseconds support because I'm lazy.

I tested by running `build/bin/TCPStoreTest` and observing the log messages there.  I am actually not sure how to look at the log messages from Python though.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121384
Approved by: https://github.com/Skylion007, https://github.com/malfet
2024-03-12 17:01:32 +00:00
704e15307e [caffe2] replace refernces to np.asscalar (#121332) (#121545)
Summary:

`np.asscalar` was deprecated and removed in a recent Numpy. It used to be implemented the following way, and the recommended alternative is to call `item()` directly:
```python
def asscalar(a):
    return a.item()
```
This fixes all of the references.

Test Plan: visual inspection and automated tests

Differential Revision: D54697760

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121545
Approved by: https://github.com/malfet
2024-03-12 16:58:47 +00:00
d1715c3adb [export] Update error message for set_grad (#121666)
Context: https://fb.workplace.com/groups/222849770514616/posts/381979051268353/?comment_id=383334957799429
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121666
Approved by: https://github.com/ydwu4
2024-03-12 16:41:45 +00:00
3c8c7e2a46 [dynamo] Tweak naming for module hook bw_state (#121609)
Some minor changes not related to the other PRs in the stack

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121609
Approved by: https://github.com/yanboliang
2024-03-12 16:27:56 +00:00
7a68e0a3e8 [DCP][state_dict] Remove the check of FSDP has root (#121544)
Root may not exist due to FSDP lazy initialization.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121544
Approved by: https://github.com/Skylion007
ghstack dependencies: #121273, #121276, #121290
2024-03-12 15:43:19 +00:00
85dc254364 [DTensor] Moved Transformer sharding to staticmethod (#121660)
To support FSDP + TP/SP unit tests, let us factor out the canonical TP/SP sharding of `Transformer` to a staticmethod that can be called by other unit tests.

Test Plan:
```
pytest test/distributed/tensor/parallel/test_tp_examples.py -k test_transformer_training
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121660
Approved by: https://github.com/wanchaol, https://github.com/yifuwang
ghstack dependencies: #121360, #121357
2024-03-12 15:08:57 +00:00
cc51e100f5 [ET-VK] Enable Dynamic shape support via tensor virtual and physical resizing (#121598)
Summary:
## Context

This changeset lays the foundations for supporting dynamic shapes in the ExecuTorch Vulkan delegate via allowing Tensors to be resized in one of two ways:

1. Discarding underlying `vkImage` or `vkBuffer` and reallocating a new `vkImage` or `vkBuffer` with updated sizes. This method is intended to be used when the current `vkImage` or `vkBuffer` is not large enough to contain the new sizes.
2. Update the tensor's size metadata without reallocating any new resources. This allows shaders to interpret the underlying `vkImage` or `vkBuffer` as if it were smaller than it actually is, and allows command buffers to be preserved when sizes are changed.

Test Plan: Check CI. Tests have also been added to `vulkan_compute_api_test` that test the two methods of tensor resizing.

Differential Revision: D54728401

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121598
Approved by: https://github.com/jorgep31415
2024-03-12 14:32:00 +00:00
2a99e6f299 Update error message (#121644)
Summary:
We don't want people to move to NCCL exp without explicit opt in. It seems that sparse allreduce was accidentally called and people were confused whether they should use NCCL exp instead.

Update the error message to explicitly say that sparse_allreduce is not supported.

Test Plan: sandcastle

Differential Revision: D54759307

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121644
Approved by: https://github.com/awgu
2024-03-12 13:04:21 +00:00
edf22f3a48 Modify signature of dequantize ops for decomposed quantized Tensor (#119173) (#121450)
Summary:
X-link: https://github.com/pytorch/executorch/pull/2308

Note: The initial purpose of this PR is to draw suggestion and feedback regarding better alternative, if any.

At present, dequantize op for decomposed quantized Tensor representation e.g. dequantize_per_tensor() assumes the output dtype as torch.float and hence, it does not have the output dtype in its operator argument list. However, this op signature becomes unusable when the assumption breaks. Because, in case the output dtype is different from torch.float, there is no way to specify the same during dequantization.

This change is aimed at generalizing the signature of dequantize op like dequantize_per_tensor() for wider use-cases where the output dtype can be different from torch.float and needs to passed during dequantization. The proposal is to use an additional argument named 'output_dtype' to solve the problem. However, we would also like to have suggestion and feedback regarding any better alternative that can be used instead.

cc jerryzh168 jianyuh raghuramank100 jamesr66a vkuzo jgong5 Xia-Weiwen leslie-fang-intel

Reviewed By: digantdesai

Differential Revision: D53590486

Pulled By: manuelcandales

Co-authored-by: kausik <kmaiti@habana.ai>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121450
Approved by: https://github.com/jerryzh168
2024-03-12 12:36:31 +00:00
06d2392003 Support tt.reduce in Triton kernel analysis pass (#121706)
Summary: Previously, we bailed out of the Triton kernel analysis pass when seeing a `tt.reduce` op. In this PR, we support the op and don't bail out anymore.

Test Plan: This is a bit tricky, as the extension is added to the MLIR walk-based analysis code path which is active only on when the MLIR bindings added in https://github.com/openai/triton/pull/3191 are available. So for now I've run the `test_argmax` and `test_reduce_sum` manually with a newer Triton version than the current pin. When pin updates, we'll make those tests official (left a TODO comment).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121706
Approved by: https://github.com/jansel
2024-03-12 11:38:28 +00:00
78b4793c96 [dynamo][compile-time] Caching VTs to reduce compile-time (#121031)
Reduces the `torch.compile(backend="eager")` for this code

~~~
def fn(x):
    for _ in range(10000):
        # x = torch.sin(x)
        x = torch.ops.aten.sin(x)
        # x = sin(x)

    return x
~~~

From 18 seconds to 12 seconds.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121031
Approved by: https://github.com/jansel
2024-03-12 09:19:50 +00:00
52ad2b682c Generate predispatch tests (#121678)
In this PR, we create another dynamic test class for TestExport tests that basically serializes/deserializas pre-dispatch IR. I encountered 4 additional failures. But 3 of them are due to different operator showing up in the graph and only one legit failure which is tracked by another task internally.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121678
Approved by: https://github.com/angelayi
ghstack dependencies: #121652
2024-03-12 08:34:50 +00:00
656134c38f [ROCm] enable complex128 in test_addmm_sizes_all_sparse_csr for rocm for trivial (k,n,m) cases (#120504)
This PR enables `test_addmm_sizes_all_sparse_csr_k_*_n_*_m_*_cuda_complex128` for ROCm for trivial cases  (m or n or k = 0)

CUSPARSE_SPMM_COMPLEX128_SUPPORTED also used for `test_addmm_all_sparse_csr` and ` test_sparse_matmul` and both of them are skipped for ROCm by `@skipIfRocm` or `@skipCUDAIf(not _check_cusparse_spgemm_available())`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120504
Approved by: https://github.com/jithunnair-amd, https://github.com/ezyang
2024-03-12 07:29:57 +00:00
983 changed files with 31082 additions and 10574 deletions

View File

@ -1 +1 @@
e2a8f9548aecb62a68e264607174a7d207ed2929
7f96f5a852ba452670255d28d59f1e6398141fbb

View File

@ -1 +1 @@
a9bc1a36470eefafe0e2ab2503b8698f1e89e7e3
989adb9a29496c22a36ef82ca69cad5dad536b9c

View File

@ -57,8 +57,21 @@ fi
# Uncomment the below when resolved to track the latest conda update
# as_jenkins conda update -y -n base conda
if [[ $(uname -m) == "aarch64" ]]; then
export SYSROOT_DEP="sysroot_linux-aarch64=2.17"
else
export SYSROOT_DEP="sysroot_linux-64=2.17"
fi
# Install correct Python version
as_jenkins conda create -n py_$ANACONDA_PYTHON_VERSION -y python="$ANACONDA_PYTHON_VERSION"
# Also ensure sysroot is using a modern GLIBC to match system compilers
as_jenkins conda create -n py_$ANACONDA_PYTHON_VERSION -y\
python="$ANACONDA_PYTHON_VERSION" \
${SYSROOT_DEP}
# libstdcxx from conda default channels are too old, we need GLIBCXX_3.4.30
# which is provided in libstdcxx 12 and up.
conda_install libstdcxx-ng=12.3.0 -c conda-forge
# Install PyTorch conda deps, as per https://github.com/pytorch/pytorch README
if [[ $(uname -m) == "aarch64" ]]; then
@ -110,14 +123,5 @@ fi
pip_install -r /opt/conda/requirements-docs.txt
fi
# HACK HACK HACK
# gcc-9 for ubuntu-18.04 from http://ppa.launchpad.net/ubuntu-toolchain-r/test/ubuntu
# Pulls llibstdc++6 13.1.0-8ubuntu1~18.04 which is too new for conda
# So remove libstdc++6.so.3.29 installed by https://anaconda.org/anaconda/libstdcxx-ng/files?version=11.2.0
# Same is true for gcc-12 from Ubuntu-22.04
if grep -e [12][82].04.[623] /etc/issue >/dev/null; then
rm /opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/lib/libstdc++.so.6
fi
popd
fi

View File

@ -33,12 +33,12 @@ pip_install coloredlogs packaging
pip_install onnxruntime==1.17.0
pip_install onnx==1.15.0
# pip_install "onnxscript@git+https://github.com/microsoft/onnxscript@3e869ef8ccf19b5ebd21c10d3e9c267c9a9fa729" --no-deps
pip_install onnxscript==0.1.0.dev20240301 --no-deps
pip_install onnxscript==0.1.0.dev20240315 --no-deps
# Cache the transformers model to be used later by ONNX tests. We need to run the transformers
# package to download the model. By default, the model is cached at ~/.cache/huggingface/hub/
IMPORT_SCRIPT_FILENAME="/tmp/onnx_import_script.py"
as_jenkins echo 'import transformers; transformers.AutoModel.from_pretrained("sshleifer/tiny-gpt2"); transformers.AutoTokenizer.from_pretrained("sshleifer/tiny-gpt2");' > "${IMPORT_SCRIPT_FILENAME}"
as_jenkins echo 'import transformers; transformers.AutoModel.from_pretrained("sshleifer/tiny-gpt2"); transformers.AutoTokenizer.from_pretrained("sshleifer/tiny-gpt2"); transformers.AutoModelForSpeechSeq2Seq.from_pretrained("openai/whisper-large-v3");' > "${IMPORT_SCRIPT_FILENAME}"
# Need a PyTorch version for transformers to work
pip_install --pre torch --index-url https://download.pytorch.org/whl/nightly/cpu

View File

@ -11,7 +11,8 @@ mkdir -p $pb_dir
ln -s /usr/lib64 "$pb_dir/lib64"
curl -LO "https://github.com/protocolbuffers/protobuf/releases/download/v3.17.3/protobuf-all-3.17.3.tar.gz" --retry 3
tar -xvz -C "$pb_dir" --strip-components 1 -f protobuf-all-3.17.3.tar.gz
tar -xvz --no-same-owner -C "$pb_dir" --strip-components 1 -f protobuf-all-3.17.3.tar.gz
NPROC=$[$(nproc) - 2]
pushd "$pb_dir" && ./configure && make -j${NPROC} && make -j${NPROC} check && sudo make -j${NRPOC} install && sudo ldconfig
popd

View File

@ -223,6 +223,10 @@ if [[ "${BUILD_ENVIRONMENT}" != *android* && "${BUILD_ENVIRONMENT}" != *cuda* ]]
export BUILD_STATIC_RUNTIME_BENCHMARK=ON
fi
WORKSPACE_ORIGINAL_OWNER_ID=$(stat -c '%u' "/var/lib/jenkins/workspace")
sudo chown -R jenkins /var/lib/jenkins/workspace
git config --global --add safe.directory /var/lib/jenkins/workspace
if [[ "$BUILD_ENVIRONMENT" == *-bazel-* ]]; then
set -e
@ -248,13 +252,17 @@ else
( ! get_exit_code python setup.py clean bad_argument )
if [[ "$BUILD_ENVIRONMENT" != *libtorch* ]]; then
# rocm builds fail when WERROR=1
# XLA test build fails when WERROR=1
# set only when building other architectures
# or building non-XLA tests.
if [[ "$BUILD_ENVIRONMENT" != *rocm* &&
"$BUILD_ENVIRONMENT" != *xla* ]]; then
if [[ "$BUILD_ENVIRONMENT" != *py3.8* ]]; then
# Install numpy-2.0 release candidate for builds
# Which should be backward compatible with Numpy-1.X
python -mpip install --pre numpy==2.0.0b1
fi
WERROR=1 python setup.py bdist_wheel
else
python setup.py bdist_wheel
@ -355,3 +363,5 @@ if [[ "$BUILD_ENVIRONMENT" != *libtorch* && "$BUILD_ENVIRONMENT" != *bazel* ]];
fi
print_sccache_stats
sudo chown -R "$WORKSPACE_ORIGINAL_OWNER_ID" /var/lib/jenkins/workspace

View File

@ -159,7 +159,7 @@ function install_torchvision() {
}
function install_tlparse() {
pip_install --user "tlparse==0.3.5"
pip_install --user "tlparse==0.3.7"
PATH="$(python -m site --user-base)/bin:$PATH"
}

View File

@ -45,6 +45,7 @@ time python test/run_test.py --verbose -i distributed/test_device_mesh
time python test/run_test.py --verbose -i distributed/tensor/parallel/test_ddp_2d_parallel
time python test/run_test.py --verbose -i distributed/tensor/parallel/test_fsdp_2d_parallel
time python test/run_test.py --verbose -i distributed/tensor/parallel/test_tp_examples
time python test/run_test.py --verbose -i distributed/tensor/parallel/test_tp_random_state.py
# Other tests
time python test/run_test.py --verbose -i test_cuda_primary_ctx

View File

@ -299,6 +299,8 @@ test_inductor_distributed() {
pytest test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_multi_group
pytest test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_with_activation_checkpointing
pytest test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_2d_mlp
pytest test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_hsdp
pytest test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_2d_transformer_checkpoint_resume
pytest test/distributed/_composable/fsdp/test_fully_shard_frozen.py
pytest test/distributed/_composable/fsdp/test_fully_shard_mixed_precision.py -k test_compute_dtype
pytest test/distributed/_composable/fsdp/test_fully_shard_mixed_precision.py -k test_reduce_dtype

View File

@ -8,7 +8,7 @@ body:
value: >
#### Before submitting a bug, please make sure the issue hasn't been already addressed by searching through [the
existing and past issues](https://github.com/pytorch/pytorch/issues)
It's likely that your bug will be resolved by checking our FAQ or troubleshooting guide [documentation](https://pytorch.org/docs/master/dynamo/index.html)
It's likely that your bug will be resolved by checking our FAQ or troubleshooting guide [documentation](https://pytorch.org/docs/main/dynamo/index.html)
- type: textarea
attributes:
label: 🐛 Describe the bug
@ -33,7 +33,7 @@ body:
label: Minified repro
description: |
Please run the minifier on your example and paste the minified code below
Learn more here https://pytorch.org/docs/master/compile/troubleshooting.html
Learn more here https://pytorch.org/docs/main/torch.compiler_troubleshooting.html
placeholder: |
env TORCHDYNAMO_REPRO_AFTER="aot" python your_model.py
or

View File

@ -9,6 +9,10 @@ inputs:
use-gha:
description: If set to any value, use GHA to download the artifact. Otherwise use s3.
required: false
s3-bucket:
description: S3 bucket to download builds
required: false
default: "gha-artifacts"
runs:
using: composite
@ -18,6 +22,7 @@ runs:
uses: seemethere/download-artifact-s3@v4
with:
name: ${{ inputs.name }}
s3-bucket: ${{ inputs.s3-bucket }}
- name: Download PyTorch Build Artifacts from GHA
if: inputs.use-gha
@ -29,6 +34,10 @@ runs:
shell: bash
run: unzip -o artifacts.zip
- name: Remove artifacts.zip
shell: bash
run: rm artifacts.zip
- name: Output disk space left
shell: bash
run: df -H

207
.github/actions/linux-build/action.yml vendored Normal file
View File

@ -0,0 +1,207 @@
name: linux-build
inputs:
build-environment:
required: true
description: Top-level label for what's being built/tested.
docker-image-name:
required: true
description: Name of the base docker image to build with.
build-generates-artifacts:
required: false
default: "true"
description: If set, upload generated build artifacts.
build-with-debug:
required: false
default: "false"
description: If set, build in debug mode.
sync-tag:
required: false
default: ""
description: |
If this is set, our linter will use this to make sure that every other
job with the same `sync-tag` is identical.
cuda-arch-list:
required: false
default: "5.2"
description: Runner label to select worker type
runner:
required: false
default: "linux.2xlarge"
description: |
List of CUDA architectures CI build should target.
test-matrix:
required: false
type: string
description: |
An option JSON description of what test configs to run later on. This
is moved here from the Linux test workflow so that we can apply filter
logic using test-config labels earlier and skip unnecessary builds
s3-bucket:
description: S3 bucket to download artifact
required: false
default: "gha-artifacts"
aws-role-to-assume:
description: role to assume for downloading artifacts
required: false
default: ""
GITHUB_TOKEN:
description: GitHub token
required: true
HUGGING_FACE_HUB_TOKEN:
description: Hugging Face Hub token
required: false
default: ""
outputs:
docker-image:
value: ${{ steps.calculate-docker-image.outputs.docker-image }}
description: The docker image containing the built PyTorch.
test-matrix:
value: ${{ steps.filter.outputs.test-matrix }}
description: An optional JSON description of what test configs to run later on.
runs:
using: composite
steps:
- name: Setup Linux
uses: ./.github/actions/setup-linux
- name: configure aws credentials
uses: aws-actions/configure-aws-credentials@v3
if: ${{ inputs.aws-role-to-assume != '' }}
with:
role-to-assume: ${{ inputs.aws-role-to-assume }}
role-session-name: gha-linux-build
role-duration-seconds: 10800
aws-region: us-east-1
- name: Calculate docker image
id: calculate-docker-image
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
with:
docker-image-name: ${{ inputs.docker-image-name }}
- name: Use following to pull public copy of the image
id: print-ghcr-mirror
env:
ECR_DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }}
shell: bash
run: |
tag=${ECR_DOCKER_IMAGE##*/}
echo "docker pull ghcr.io/pytorch/ci-image:${tag/:/-}"
- name: Pull docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
with:
docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
- name: Parse ref
id: parse-ref
shell: bash
run: .github/scripts/parse_ref.py
- name: Get workflow job id
id: get-job-id
uses: ./.github/actions/get-workflow-job-id
if: always()
with:
github-token: ${{ inputs.GITHUB_TOKEN }}
# Apply the filter logic to the build step too if the test-config label is already there
- name: Select all requested test configurations (if the test matrix is available)
id: filter
uses: ./.github/actions/filter-test-configs
with:
github-token: ${{ inputs.GITHUB_TOKEN }}
test-matrix: ${{ inputs.test-matrix }}
job-name: ${{ steps.get-job-id.outputs.job-name }}
- name: Download pytest cache
uses: ./.github/actions/pytest-cache-download
continue-on-error: true
with:
cache_dir: .pytest_cache
job_identifier: ${{ github.workflow }}_${{ inputs.build-environment }}
s3_bucket: ${{ inputs.s3-bucket }}
- name: Build
if: steps.filter.outputs.is-test-matrix-empty == 'False' || inputs.test-matrix == ''
id: build
env:
BUILD_ENVIRONMENT: ${{ inputs.build-environment }}
BRANCH: ${{ steps.parse-ref.outputs.branch }}
# TODO duplicated
AWS_DEFAULT_REGION: us-east-1
PR_NUMBER: ${{ github.event.pull_request.number }}
SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2
SCCACHE_S3_KEY_PREFIX: ${{ github.workflow }}
XLA_CLANG_CACHE_S3_BUCKET_NAME: ossci-compiler-clang-cache-circleci-xla
PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}
TORCH_CUDA_ARCH_LIST: ${{ inputs.cuda-arch-list }}
DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }}
XLA_CUDA: ${{ contains(inputs.build-environment, 'xla') && '0' || '' }}
DEBUG: ${{ inputs.build-with-debug == 'true' && '1' || '0' }}
OUR_GITHUB_JOB_ID: ${{ steps.get-job-id.outputs.job-id }}
HUGGING_FACE_HUB_TOKEN: ${{ inputs.HUGGING_FACE_HUB_TOKEN }}
shell: bash
run: |
# detached container should get cleaned up by teardown_ec2_linux
container_name=$(docker run \
-e BUILD_ENVIRONMENT \
-e MAX_JOBS="$(nproc --ignore=2)" \
-e AWS_DEFAULT_REGION \
-e PR_NUMBER \
-e SHA1 \
-e BRANCH \
-e SCCACHE_BUCKET \
-e SCCACHE_S3_KEY_PREFIX \
-e XLA_CUDA \
-e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-e SKIP_SCCACHE_INITIALIZATION=1 \
-e TORCH_CUDA_ARCH_LIST \
-e PR_LABELS \
-e OUR_GITHUB_JOB_ID \
-e HUGGING_FACE_HUB_TOKEN \
--env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
--security-opt seccomp=unconfined \
--cap-add=SYS_PTRACE \
--tty \
--detach \
--user jenkins \
-v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-w /var/lib/jenkins/workspace \
"${DOCKER_IMAGE}"
)
docker exec -t "${container_name}" sh -c '.ci/pytorch/build.sh'
- name: Archive artifacts into zip
if: inputs.build-generates-artifacts == 'true' && steps.build.outcome != 'skipped'
shell: bash
run: |
zip -1 -r artifacts.zip dist/ build/custom_test_artifacts build/lib build/bin .additional_ci_files
- name: Store PyTorch Build Artifacts on S3
uses: seemethere/upload-artifact-s3@v5
if: inputs.build-generates-artifacts == 'true' && steps.build.outcome != 'skipped'
with:
name: ${{ inputs.build-environment }}
retention-days: 14
if-no-files-found: error
path: artifacts.zip
s3-bucket: ${{ inputs.s3-bucket }}
- name: Upload sccache stats
if: steps.build.outcome != 'skipped'
uses: seemethere/upload-artifact-s3@v5
with:
s3-prefix: |
${{ github.repository }}/${{ github.run_id }}/${{ github.run_attempt }}/artifact
retention-days: 365
if-no-files-found: warn
path: sccache-stats-*.json
s3-bucket: ${{ inputs.s3-bucket }}
- name: Teardown Linux
uses: pytorch/test-infra/.github/actions/teardown-linux@main
if: always()

View File

@ -9,6 +9,10 @@ inputs:
job_identifier:
description: Text that uniquely identifies a given job type within a workflow. All shards of a job should share the same job identifier.
required: true
s3_bucket:
description: S3 bucket to upload/download PyTest cache
required: false
default: ""
runs:
using: composite
@ -30,6 +34,7 @@ runs:
CACHE_DIR: ${{ inputs.cache_dir }}
JOB_IDENTIFIER: ${{ inputs.job_identifier }}
REPO: ${{ github.repository }}
BUCKET: ${{ inputs.s3_bucket }}
run: |
python3 .github/scripts/pytest_cache.py \
--download \
@ -38,3 +43,4 @@ runs:
--job_identifier $JOB_IDENTIFIER \
--temp_dir $RUNNER_TEMP \
--repo $REPO \
--bucket $BUCKET \

View File

@ -26,8 +26,14 @@ runs:
echo "instance-type: $(get_ec2_metadata instance-type)"
echo "system info $(uname -a)"
- name: Check if in a ARC runner
shell: bash
id: check_arc_runner
run: echo "IN_ARC_RUNNER=$([ -f /.inarc ] && echo true || echo false)" >> $GITHUB_OUTPUT
- name: Start docker if docker deamon is not running
shell: bash
if: ${{ steps.check_arc_runner.outputs.IN_ARC_RUNNER == 'false' }}
run: |
if systemctl is-active --quiet docker; then
echo "Docker daemon is running...";

View File

@ -11,6 +11,10 @@ inputs:
Suffix to add to the filename of the artifacts. This should include the
workflow job id, see [Job id in artifacts].
required: true
s3-bucket:
description: S3 bucket to download builds
required: false
default: "gha-artifacts"
runs:
using: composite
@ -87,6 +91,7 @@ runs:
uses: seemethere/upload-artifact-s3@v5
if: ${{ !inputs.use-gha }}
with:
s3-bucket: ${{ inputs.s3-bucket }}
s3-prefix: |
${{ github.repository }}/${{ github.run_id }}/${{ github.run_attempt }}/artifact
retention-days: 14
@ -97,6 +102,7 @@ runs:
uses: seemethere/upload-artifact-s3@v5
if: ${{ !inputs.use-gha }}
with:
s3-bucket: ${{ inputs.s3-bucket }}
s3-prefix: |
${{ github.repository }}/${{ github.run_id }}/${{ github.run_attempt }}/artifact
retention-days: 14
@ -108,6 +114,7 @@ runs:
if: ${{ !inputs.use-gha }}
continue-on-error: true
with:
s3-bucket: ${{ inputs.s3-bucket }}
s3-prefix: |
${{ github.repository }}/${{ github.run_id }}/${{ github.run_attempt }}/artifact
retention-days: 14

View File

@ -1 +1 @@
87aeb554d3e2f7855b7abe5120c282f59648ed7a
17a70815259222570feb071034acd7bae2adc019

View File

@ -1 +1 @@
2c127da8b5e2e8f44b50994c6cb931bcca267cfe
a0c79b399b75368208464b2c638708165cca7ef1

View File

@ -21,3 +21,4 @@ retryable_workflows:
- trunk
- linux-binary
- windows-binary
labeler_config: labeler.yml

View File

@ -27,3 +27,6 @@ rockset==1.0.3
z3-solver==4.12.2.0
tensorboard==2.13.0
optree==0.9.1
# NB: test_hparams_* from test_tensorboard is failing with protobuf 5.26.0 in
# which the stringify metadata is wrong when escaping double quote
protobuf==3.20.2

View File

@ -99,7 +99,14 @@ def build_triton(
triton_repo = "https://github.com/openai/triton"
triton_pkg_name = "pytorch-triton"
check_call(["git", "clone", triton_repo], cwd=tmpdir)
check_call(["git", "checkout", commit_hash], cwd=triton_basedir)
if release:
ver, rev, patch = version.split(".")
check_call(
["git", "checkout", f"release/{ver}.{rev}.x"], cwd=triton_basedir
)
else:
check_call(["git", "checkout", commit_hash], cwd=triton_basedir)
if build_conda:
with open(triton_basedir / "meta.yaml", "w") as meta:
print(

View File

@ -1,6 +1,7 @@
#!/usr/bin/env python3
import json
import logging
import os
import re
import subprocess
@ -8,6 +9,7 @@ import sys
import warnings
from enum import Enum
from functools import lru_cache
from logging import info
from typing import Any, Callable, Dict, List, Optional, Set
from urllib.request import Request, urlopen
@ -17,33 +19,7 @@ REENABLE_TEST_REGEX = "(?i)(Close(d|s)?|Resolve(d|s)?|Fix(ed|es)?) (#|https://gi
PREFIX = "test-config/"
# Same as shard names
VALID_TEST_CONFIG_LABELS = {
f"{PREFIX}{label}"
for label in {
"backwards_compat",
"crossref",
"default",
"deploy",
"distributed",
"docs_tests",
"dynamo",
"force_on_cpu",
"functorch",
"inductor",
"inductor_distributed",
"inductor_huggingface",
"inductor_timm",
"inductor_torchbench",
"jit_legacy",
"multigpu",
"nogpu_AVX512",
"nogpu_NO_AVX2",
"slow",
"tsan",
"xla",
}
}
logging.basicConfig(level=logging.INFO)
def is_cuda_or_rocm_job(job_name: Optional[str]) -> bool:
@ -155,19 +131,25 @@ def get_labels(pr_number: int) -> Set[str]:
}
def filter_labels(labels: Set[str], label_regex: Any) -> Set[str]:
"""
Return the list of matching labels
"""
return {l for l in labels if re.match(label_regex, l)}
def filter(test_matrix: Dict[str, List[Any]], labels: Set[str]) -> Dict[str, List[Any]]:
"""
Select the list of test config to run from the test matrix. The logic works
as follows:
If the PR has one or more labels as specified in the VALID_TEST_CONFIG_LABELS set, only
these test configs will be selected. This also works with ciflow labels, for example,
if a PR has both ciflow/trunk and test-config/functorch, only trunk functorch builds
and tests will be run
If the PR has one or more test-config labels as specified, only these test configs
will be selected. This also works with ciflow labels, for example, if a PR has both
ciflow/trunk and test-config/functorch, only trunk functorch builds and tests will
be run.
If the PR has none of the test-config label, all tests are run as usual.
"""
filtered_test_matrix: Dict[str, List[Any]] = {"include": []}
for entry in test_matrix.get("include", []):
@ -177,18 +159,19 @@ def filter(test_matrix: Dict[str, List[Any]], labels: Set[str]) -> Dict[str, Lis
label = f"{PREFIX}{config_name.strip()}"
if label in labels:
print(
f"Select {config_name} because label {label} is presented in the pull request by the time the test starts"
)
msg = f"Select {config_name} because label {label} is present in the pull request by the time the test starts"
info(msg)
filtered_test_matrix["include"].append(entry)
valid_test_config_labels = labels.intersection(VALID_TEST_CONFIG_LABELS)
if not filtered_test_matrix["include"] and not valid_test_config_labels:
# Found no valid label and the filtered test matrix is empty, return the same
test_config_labels = filter_labels(labels, re.compile(f"{PREFIX}.+"))
if not filtered_test_matrix["include"] and not test_config_labels:
info("Found no test-config label on the PR, so all test configs are included")
# Found no test-config label and the filtered test matrix is empty, return the same
# test matrix as before so that all tests can be run normally
return test_matrix
else:
msg = f"Found {test_config_labels} on the PR so only these test configs are run"
info(msg)
# When the filter test matrix contain matches or if a valid test config label
# is found in the PR, return the filtered test matrix
return filtered_test_matrix
@ -374,30 +357,33 @@ def process_jobs(
# - If the target record has the job (config) name, only that test config
# will be skipped or marked as unstable
if not target_job_cfg:
print(
msg = (
f"Issue {target_url} created by {author} has {issue_type.value} "
+ f"all CI jobs for {workflow} / {job_name}"
)
info(msg)
return _filter_jobs(
test_matrix=test_matrix,
issue_type=issue_type,
)
if target_job_cfg == BUILD_JOB_NAME:
print(
msg = (
f"Issue {target_url} created by {author} has {issue_type.value} "
+ f"the build job for {workflow} / {job_name}"
)
info(msg)
return _filter_jobs(
test_matrix=test_matrix,
issue_type=issue_type,
)
if target_job_cfg in (TEST_JOB_NAME, BUILD_AND_TEST_JOB_NAME):
print(
msg = (
f"Issue {target_url} created by {author} has {issue_type.value} "
+ f"all the test jobs for {workflow} / {job_name}"
)
info(msg)
return _filter_jobs(
test_matrix=test_matrix,
issue_type=issue_type,
@ -497,7 +483,7 @@ def perform_misc_tasks(
# Obviously, if the job name includes unstable, then this is an unstable job
is_unstable = job_name and IssueType.UNSTABLE.value in job_name
if not is_unstable and test_matrix:
if not is_unstable and test_matrix and test_matrix.get("include"):
# Even when the job name doesn't mention unstable, we will also mark it as
# unstable when the test matrix only includes unstable jobs. Basically, this
# logic allows build or build-and-test jobs to be marked as unstable too.

20
.github/scripts/td_llm_indexer.sh vendored Normal file
View File

@ -0,0 +1,20 @@
#!/bin/bash
set -euxo pipefail
# Download requirements
cd llm-target-determinator
pip install -q -r requirements.txt
cd ../codellama
pip install -e .
# Run indexer
cd ../llm-target-determinator
torchrun \
--standalone \
--nnodes=1 \
--nproc-per-node=1 \
indexer.py \
--experiment-name indexer-files \
--granularity FILE

View File

@ -17,7 +17,6 @@ from filter_test_configs import (
remove_disabled_jobs,
set_periodic_modes,
SUPPORTED_PERIODICAL_MODES,
VALID_TEST_CONFIG_LABELS,
)
@ -273,13 +272,13 @@ class TestConfigFilter(TestCase):
testcases = [
{
"test_matrix": '{include: [{config: "default", runner: "linux"}]}',
"expected": '{"include": [{"config": "default", "runner": "linux"}]}',
"description": "No match, keep the same test matrix",
"expected": '{"include": []}',
"description": "Request test-config/cfg but the test matrix doesn't have it",
},
{
"test_matrix": '{include: [{config: "default", runner: "linux"}, {config: "plain-cfg"}]}',
"expected": '{"include": [{"config": "default", "runner": "linux"}, {"config": "plain-cfg"}]}',
"description": "No match because there is no prefix or suffix, keep the same test matrix",
"expected": '{"include": []}',
"description": "A valid test config label needs to start with test-config/",
},
{
"test_matrix": '{include: [{config: "default", runner: "linux"}, {config: "cfg", shard: 1}]}',
@ -294,9 +293,8 @@ class TestConfigFilter(TestCase):
)
self.assertEqual(case["expected"], json.dumps(filtered_test_matrix))
def test_filter_with_valid_label(self) -> None:
def test_filter_with_test_config_label(self) -> None:
mocked_labels = {f"{PREFIX}cfg", "ciflow/trunk"}
VALID_TEST_CONFIG_LABELS.add(f"{PREFIX}cfg")
testcases = [
{

View File

@ -205,7 +205,6 @@ def mocked_read_merge_rules(repo: Any, org: str, project: str) -> List[MergeRule
approved_by=["pytorch/metamates", "ngimel"],
mandatory_checks_name=[
"Lint",
"Facebook CLA Check",
"pull / linux-xenial-cuda11.3-py3.7-gcc7 / build",
],
ignore_flaky_failures=True,

View File

@ -1398,7 +1398,10 @@ def find_matching_merge_rule(
)
required_checks = list(
filter(
lambda x: "EasyCLA" in x or not skip_mandatory_checks, mandatory_checks
lambda x: ("EasyCLA" in x)
or ("Facebook CLA Check" in x)
or not skip_mandatory_checks,
mandatory_checks,
)
)
pending_checks, failed_checks, _ = categorize_checks(
@ -1409,6 +1412,13 @@ def find_matching_merge_rule(
else 0,
)
# categorize_checks assumes all tests are required if required_checks is empty.
# this is a workaround as we want to keep that behavior for categorize_checks
# generally.
if not required_checks:
pending_checks = []
failed_checks = []
hud_link = f"https://hud.pytorch.org/{pr.org}/{pr.project}/commit/{pr.last_commit()['oid']}"
if len(failed_checks) > 0:
if reject_reason_score < 30000:

View File

@ -28,7 +28,21 @@ on:
description: |
If this is set, our linter will use this to make sure that every other
job with the same `sync-tag` is identical.
s3-bucket:
description: S3 bucket to download artifact
required: false
type: string
default: "gha-artifacts"
aws-role-to-assume:
description: role to assume for downloading artifacts
required: false
type: string
default: ""
upload-aws-role-to-assume:
description: role to assume for downloading artifacts
required: false
type: string
default: ""
secrets:
GH_PYTORCHBOT_TOKEN:
required: false
@ -82,6 +96,14 @@ jobs:
- name: Setup Linux
uses: ./.github/actions/setup-linux
- name: configure aws credentials
if : ${{ inputs.aws-role-to-assume != '' }}
uses: aws-actions/configure-aws-credentials@v3
with:
role-to-assume: ${{ inputs.aws-role-to-assume }}
role-session-name: gha-linux-test
aws-region: us-east-1
- name: Calculate docker image
id: calculate-docker-image
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
@ -97,6 +119,7 @@ jobs:
uses: ./.github/actions/download-build-artifacts
with:
name: ${{ inputs.build-environment }}
s3-bucket: ${{ inputs.s3-bucket }}
- name: Generate netrc (only for docs-push)
if: inputs.push
@ -156,6 +179,14 @@ jobs:
uses: ./.github/actions/chown-workspace
if: always()
- name: configure aws credentials
if : ${{ inputs.upload-aws-role-to-assume != '' }}
uses: aws-actions/configure-aws-credentials@v3
with:
role-to-assume: ${{ inputs.upload-aws-role-to-assume }}
role-session-name: gha-linux-test
aws-region: us-east-1
- name: Upload Python Docs Preview
uses: seemethere/upload-artifact-s3@v5
if: ${{ github.event_name == 'pull_request' && matrix.docs_type == 'python' && steps.build-docs.outcome == 'success' }}

109
.github/workflows/_linux-build-label.yml vendored Normal file
View File

@ -0,0 +1,109 @@
name: linux-build
on:
workflow_call:
inputs:
build-environment:
required: true
type: string
description: Top-level label for what's being built/tested.
docker-image-name:
required: true
type: string
description: Name of the base docker image to build with.
build-generates-artifacts:
required: false
type: boolean
default: true
description: If set, upload generated build artifacts.
build-with-debug:
required: false
type: boolean
default: false
description: If set, build in debug mode.
sync-tag:
required: false
type: string
default: ""
description: |
If this is set, our linter will use this to make sure that every other
job with the same `sync-tag` is identical.
cuda-arch-list:
required: false
type: string
default: "5.2"
description: Runner label to select worker type
runner:
required: false
type: string
default: "linux.2xlarge"
description: |
List of CUDA architectures CI build should target.
test-matrix:
required: false
type: string
description: |
An option JSON description of what test configs to run later on. This
is moved here from the Linux test workflow so that we can apply filter
logic using test-config labels earlier and skip unnecessary builds
s3-bucket:
description: S3 bucket to download artifact
required: false
type: string
default: "gha-artifacts"
aws-role-to-assume:
description: role to assume for downloading artifacts
required: false
type: string
default: ""
secrets:
HUGGING_FACE_HUB_TOKEN:
required: false
description: |
HF Auth token to avoid rate limits when downloading models or datasets from hub
outputs:
docker-image:
value: ${{ jobs.build.outputs.docker-image }}
description: The docker image containing the built PyTorch.
test-matrix:
value: ${{ jobs.build.outputs.test-matrix }}
description: An optional JSON description of what test configs to run later on.
jobs:
build:
# Don't run on forked repos
if: github.repository_owner == 'pytorch'
runs-on: ${{ inputs.runner }}
timeout-minutes: 240
outputs:
docker-image: ${{ steps.linux-build.outputs.docker-image }}
test-matrix: ${{ steps.linux-build.outputs.test-matrix }}
steps:
- name: Setup SSH (Click me for login details)
uses: pytorch/test-infra/.github/actions/setup-ssh@main
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
# [pytorch repo ref]
# Use a pytorch/pytorch reference instead of a reference to the local
# checkout because when we run this action we don't *have* a local
# checkout. In other cases you should prefer a local checkout.
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
- name: Linux Build
id: linux-build
uses: ./.github/actions/linux-build
with:
build-environment: ${{ inputs.build-environment }}
docker-image-name: ${{ inputs.docker-image-name }}
build-generates-artifacts: ${{ inputs.build-generates-artifacts }}
build-with-debug: ${{ inputs.build-with-debug }}
sync-tag: ${{ inputs.sync-tag }}
cuda-arch-list: ${{ inputs.cuda-arch-list }}
test-matrix: ${{ inputs.test-matrix }}
s3-bucket: ${{ inputs.s3-bucket }}
aws-role-to-assume: ${{ inputs.aws-role-to-assume }}
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}

105
.github/workflows/_linux-build-rg.yml vendored Normal file
View File

@ -0,0 +1,105 @@
name: linux-build-rg
on:
workflow_call:
inputs:
build-environment:
required: true
type: string
description: Top-level label for what's being built/tested.
docker-image-name:
required: true
type: string
description: Name of the base docker image to build with.
build-generates-artifacts:
required: false
type: boolean
default: true
description: If set, upload generated build artifacts.
build-with-debug:
required: false
type: boolean
default: false
description: If set, build in debug mode.
sync-tag:
required: false
type: string
default: ""
description: |
If this is set, our linter will use this to make sure that every other
job with the same `sync-tag` is identical.
cuda-arch-list:
required: false
type: string
default: "5.2"
description: |
List of CUDA architectures CI build should target.
runner-group:
required: false
type: string
default: "arc-lf-linux.2xlarge"
description: Runner group to select group type
test-matrix:
required: false
type: string
description: |
An option JSON description of what test configs to run later on. This
is moved here from the Linux test workflow so that we can apply filter
logic using test-config labels earlier and skip unnecessary builds
s3-bucket:
description: S3 bucket to download artifact
required: false
type: string
default: "gha-artifacts"
aws-role-to-assume:
description: role to assume for downloading artifacts
required: false
type: string
default: ""
secrets:
HUGGING_FACE_HUB_TOKEN:
required: false
description: |
HF Auth token to avoid rate limits when downloading models or datasets from hub
outputs:
docker-image:
value: ${{ jobs.build.outputs.docker-image }}
description: The docker image containing the built PyTorch.
test-matrix:
value: ${{ jobs.build.outputs.test-matrix }}
description: An optional JSON description of what test configs to run later on.
jobs:
build:
# Don't run on forked repos
if: github.repository_owner == 'pytorch'
runs-on:
group: ${{ inputs.runner-group }}
timeout-minutes: 240
outputs:
docker-image: ${{ steps.linux-build.outputs.docker-image }}
test-matrix: ${{ steps.linux-build.outputs.test-matrix }}
steps:
# [pytorch repo ref]
# Use a pytorch/pytorch reference instead of a reference to the local
# checkout because when we run this action we don't *have* a local
# checkout. In other cases you should prefer a local checkout.
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
- name: Linux Build
id: linux-build
uses: ./.github/actions/linux-build
with:
build-environment: ${{ inputs.build-environment }}
docker-image-name: ${{ inputs.docker-image-name }}
build-generates-artifacts: ${{ inputs.build-generates-artifacts }}
build-with-debug: ${{ inputs.build-with-debug }}
sync-tag: ${{ inputs.sync-tag }}
cuda-arch-list: ${{ inputs.cuda-arch-list }}
test-matrix: ${{ inputs.test-matrix }}
s3-bucket: ${{ inputs.s3-bucket }}
aws-role-to-assume: ${{ inputs.aws-role-to-assume }}
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}

View File

@ -47,6 +47,16 @@ on:
An option JSON description of what test configs to run later on. This
is moved here from the Linux test workflow so that we can apply filter
logic using test-config labels earlier and skip unnecessary builds
s3-bucket:
description: S3 bucket to download artifact
required: false
type: string
default: "gha-artifacts"
aws-role-to-assume:
description: role to assume for downloading artifacts
required: false
type: string
default: ""
secrets:
HUGGING_FACE_HUB_TOKEN:
required: false
@ -87,6 +97,14 @@ jobs:
- name: Setup Linux
uses: ./.github/actions/setup-linux
- name: configure aws credentials
uses: aws-actions/configure-aws-credentials@v3
if: ${{ inputs.aws-role-to-assume != '' }}
with:
role-to-assume: ${{ inputs.aws-role-to-assume }}
role-session-name: gha-linux-build
aws-region: us-east-1
- name: Calculate docker image
id: calculate-docker-image
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
@ -133,6 +151,7 @@ jobs:
with:
cache_dir: .pytest_cache
job_identifier: ${{ github.workflow }}_${{ inputs.build-environment }}
s3_bucket: ${{ inputs.s3-bucket }}
- name: Build
if: steps.filter.outputs.is-test-matrix-empty == 'False' || inputs.test-matrix == ''
@ -197,6 +216,7 @@ jobs:
retention-days: 14
if-no-files-found: error
path: artifacts.zip
s3-bucket: ${{ inputs.s3-bucket }}
- name: Upload sccache stats
if: steps.build.outcome != 'skipped'
@ -207,6 +227,7 @@ jobs:
retention-days: 365
if-no-files-found: warn
path: sccache-stats-*.json
s3-bucket: ${{ inputs.s3-bucket }}
- name: Teardown Linux
uses: pytorch/test-infra/.github/actions/teardown-linux@main

View File

@ -37,6 +37,16 @@ on:
required: false
type: string
default: ""
s3-bucket:
description: S3 bucket to download artifact
required: false
type: string
default: "gha-artifacts"
aws-role-to-assume:
description: role to assume for downloading artifacts
required: false
type: string
default: ""
secrets:
HUGGING_FACE_HUB_TOKEN:
required: false
@ -71,6 +81,14 @@ jobs:
- name: Setup Linux
uses: ./.github/actions/setup-linux
- name: configure aws credentials
if : ${{ inputs.aws-role-to-assume != '' }}
uses: aws-actions/configure-aws-credentials@v3
with:
role-to-assume: ${{ inputs.aws-role-to-assume }}
role-session-name: gha-linux-test
aws-region: us-east-1
- name: Calculate docker image
id: calculate-docker-image
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
@ -116,6 +134,7 @@ jobs:
uses: ./.github/actions/download-build-artifacts
with:
name: ${{ inputs.build-environment }}
s3-bucket: ${{ inputs.s3-bucket }}
- name: Download TD artifacts
continue-on-error: true
@ -290,6 +309,7 @@ jobs:
with:
file-suffix: ${{ github.job }}-${{ matrix.config }}-${{ matrix.shard }}-${{ matrix.num_shards }}-${{ matrix.runner }}_${{ steps.get-job-id.outputs.job-id }}
use-gha: ${{ inputs.use-gha }}
s3-bucket: ${{ inputs.s3-bucket }}
- name: Collect backtraces from coredumps (if any)
if: always()

View File

@ -119,8 +119,7 @@ jobs:
- uses: actions/upload-artifact@v3
with:
# NB: Use the same name here and all wheels can be downloaded by referring to the same artifact
name: pytorch-triton-wheel
name: pytorch-triton-wheel-${{ matrix.py_vers }}-${{ matrix.device }}
if-no-files-found: error
path: ${{ runner.temp }}/artifacts/*
@ -157,8 +156,15 @@ jobs:
- name: Download Build Artifacts
uses: actions/download-artifact@v3
with:
name: pytorch-triton-wheel
path: ${{ runner.temp }}/artifacts/
# Download all available artifacts
path: ${{ runner.temp }}/artifacts-all
- name: Select Wheel Artifacts
shell: bash
run: |
set -x
mkdir -p "${RUNNER_TEMP}/artifacts/"
mv "${RUNNER_TEMP}"/artifacts-all/pytorch-triton-wheel-*/* "${RUNNER_TEMP}/artifacts/"
- name: Set DRY_RUN (only for tagged pushes)
if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/main' || startsWith(github.event.ref, 'refs/tags/v')) }}
@ -246,8 +252,7 @@ jobs:
- uses: actions/upload-artifact@v3
with:
# NB: Use the same name here and all wheels can be downloaded by referring to the same artifact
name: pytorch-triton-conda
name: pytorch-triton-conda-${{ matrix.py_vers }}
if-no-files-found: error
path: ${{ runner.temp }}/artifacts/*
@ -267,8 +272,15 @@ jobs:
- name: Download Build Artifacts
uses: actions/download-artifact@v3
with:
name: pytorch-triton-conda
path: ${{ runner.temp }}/artifacts/
# Download all available artifacts
path: ${{ runner.temp }}/artifacts-all
- name: Select Conda Artifacts
shell: bash
run: |
set -x
mkdir -p "${RUNNER_TEMP}/artifacts/"
mv "${RUNNER_TEMP}"/artifacts-all/pytorch-triton-conda-*/* "${RUNNER_TEMP}/artifacts/"
- name: Set DRY_RUN (only for tagged pushes)
if: ${{ github.event_name == 'push' && (github.event.ref == 'refs/heads/main' || startsWith(github.event.ref, 'refs/tags/v')) }}

View File

@ -20,12 +20,12 @@ on:
description: Run inductor_default?
required: false
type: boolean
default: true
default: false
dynamic:
description: Run inductor_dynamic_shapes?
required: false
type: boolean
default: true
default: false
cudagraphs:
description: Run inductor_cudagraphs?
required: false

View File

@ -111,7 +111,7 @@ jobs:
name: linux-jammy-cpu-py3.8-gcc11-inductor
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: linux-jammy-py3_8-gcc11-build
build-environment: linux-jammy-py3.8-gcc11-build
docker-image-name: pytorch-linux-jammy-py3.8-gcc11-inductor-benchmarks
test-matrix: |
{ include: [
@ -135,7 +135,7 @@ jobs:
uses: ./.github/workflows/_linux-test.yml
needs: linux-jammy-cpu-py3_8-gcc11-inductor-build
with:
build-environment: linux-jammy-py3_8-gcc11-build
build-environment: linux-jammy-py3.8-gcc11-build
docker-image: ${{ needs.linux-jammy-cpu-py3_8-gcc11-inductor-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-cpu-py3_8-gcc11-inductor-build.outputs.test-matrix }}
secrets:

View File

@ -1,37 +0,0 @@
name: Labeler
on:
# We need pull_request_target to be able to add labels to PRs from forks.
# Only allow pull_request_target when targeting main, not some historical branch.
#
# Make sure to don't introduce explicit checking out and installing/running
# untrusted user code into this workflow!
pull_request_target:
types: [opened, synchronize, reopened]
branches: [main]
# To add labels on ghstack PRs.
# Note: as pull_request doesn't trigger on PRs targeting main,
# to test changes to the workflow itself one needs to create
# a PR that targets a gh/**/base branch.
pull_request:
types: [opened, synchronize, reopened]
branches: [gh/**/base]
jobs:
triage:
permissions:
contents: read
pull-requests: write
runs-on: ubuntu-latest
# Do not auto-label nightly builds PR
if: ${{ github.event.pull_request.number != 26921 && github.repository_owner == 'pytorch' }}
steps:
- uses: actions/labeler@v4
with:
repo-token: "${{ secrets.GITHUB_TOKEN }}"
sync-labels: ''
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
cancel-in-progress: true

120
.github/workflows/llm_td_retrieval.yml vendored Normal file
View File

@ -0,0 +1,120 @@
name: Retrieval PyTorch Tests for Target Determination
on:
workflow_call:
permissions:
id-token: write
contents: read
jobs:
llm-retrieval:
runs-on: linux.4xlarge
continue-on-error: true
steps:
- name: Clone PyTorch
uses: actions/checkout@v3
with:
repository: pytorch/pytorch
fetch-depth: 0
path: pytorch
- name: Setup Linux
uses: ./pytorch/.github/actions/setup-linux
- name: Clone CodeLlama
uses: actions/checkout@v3
with:
repository: osalpekar/codellama
ref: main
path: codellama
- name: Clone Target Determination Code
uses: actions/checkout@v3
with:
repository: osalpekar/llm-target-determinator
ref: v0.0.2
path: llm-target-determinator
- name: Setup Conda
uses: conda-incubator/setup-miniconda@v2.1.1
with:
miniconda-version: "py39_4.12.0"
python-version: 3.9
- name: Install Requirements
shell: bash -l {0}
run: |
set -euxo pipefail
conda create \
--yes \
--quiet \
--name "tdenv" \
"python=3.9"
conda activate tdenv
cd "${GITHUB_WORKSPACE}/llm-target-determinator"
pip install -r requirements.txt
cd ../codellama
pip install -e .
- name: Fetch CodeLlama Checkpoint
shell: bash -l {0}
run: |
set -euxo pipefail
conda activate tdenv
cd codellama/
mkdir "CodeLlama-7b-Python"
aws s3 cp "s3://target-determinator-assets/CodeLlama-7b-Python" "CodeLlama-7b-Python" --recursive --no-progress
- name: Fetch indexes
uses: nick-fields/retry@v2.8.2
with:
max_attempts: 3
retry_wait_seconds: 10
timeout_minutes: 5
shell: bash
command: |
set -euxo pipefail
python3 -m pip install awscli==1.29.40
cd "${GITHUB_WORKSPACE}"/llm-target-determinator/assets
aws s3 cp "s3://target-determinator-assets/indexes/latest" . --recursive
unzip -o indexer-files\*.zip
rm indexer-files*.zip
- name: Run Retriever
id: run_retriever
continue-on-error: true # ghstack not currently supported due to problems getting git diff
shell: bash -l {0}
run: |
set -euxo pipefail
conda activate tdenv
cd "${GITHUB_WORKSPACE}"/llm-target-determinator
torchrun \
--standalone \
--nnodes=1 \
--nproc-per-node=1 \
retriever.py \
--experiment-name indexer-files \
--pr-parse-format GITDIFF
cd assets
zip -r mappings.zip mappings
- name: Upload results to s3
uses: seemethere/upload-artifact-s3@v5
if: ${{ steps.run_retriever.outcome == 'success' }}
with:
name: llm_results
retention-days: 14
if-no-files-found: warn
path: llm-target-determinator/assets/mappings.zip
env:
AWS_ACCESS_KEY_ID: ""
AWS_SECRET_ACCESS_KEY: ""
AWS_SESSION_TOKEN: ""
AWS_DEFAULT_REGION: ""
AWS_REGION: ""
- name: Teardown Linux
uses: pytorch/test-infra/.github/actions/teardown-linux@main
if: always()

View File

@ -23,9 +23,17 @@ concurrency:
permissions: read-all
jobs:
llm-td:
name: before-test
uses: ./.github/workflows/llm_td_retrieval.yml
permissions:
id-token: write
contents: read
target-determination:
name: before-test
uses: ./.github/workflows/target_determination.yml
needs: llm-td
permissions:
id-token: write
contents: read

View File

@ -20,9 +20,17 @@ concurrency:
permissions: read-all
jobs:
llm-td:
name: before-test
uses: ./.github/workflows/llm_td_retrieval.yml
permissions:
id-token: write
contents: read
target-determination:
name: before-test
uses: ./.github/workflows/target_determination.yml
needs: llm-td
permissions:
id-token: write
contents: read
@ -311,7 +319,7 @@ jobs:
name: linux-focal-py3_8-clang9-xla
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: linux-focal-py3_8-clang9-xla
build-environment: linux-focal-py3.8-clang9-xla
docker-image-name: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/xla_base:v1.1-lite
test-matrix: |
{ include: [
@ -323,7 +331,7 @@ jobs:
uses: ./.github/workflows/_linux-test.yml
needs: linux-focal-py3_8-clang9-xla-build
with:
build-environment: linux-focal-py3_8-clang9-xla
build-environment: linux-focal-py3.8-clang9-xla
docker-image: ${{ needs.linux-focal-py3_8-clang9-xla-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-py3_8-clang9-xla-build.outputs.test-matrix }}

View File

@ -21,9 +21,17 @@ concurrency:
permissions: read-all
jobs:
llm-td:
name: before-test
uses: ./.github/workflows/llm_td_retrieval.yml
permissions:
id-token: write
contents: read
target-determination:
name: before-test
uses: ./.github/workflows/target_determination.yml
needs: llm-td
permissions:
id-token: write
contents: read

View File

@ -2,7 +2,8 @@ name: Index PyTorch Tests for Target Determination
on:
workflow_dispatch:
# TODO: Trigger every few hours
schedule:
- cron: '0 0 * * *'
permissions:
id-token: write
@ -13,14 +14,20 @@ jobs:
runs-on: linux.g5.4xlarge.nvidia.gpu # 1 GPU A10G 24GB each
environment: target-determinator-env
steps:
- name: Clone PyTorch
uses: actions/checkout@v3
with:
path: pytorch
- name: Setup Linux
uses: ./.github/actions/setup-linux
uses: ./pytorch/.github/actions/setup-linux
- name: Calculate docker image
id: calculate-docker-image
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
with:
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9
working-directory: pytorch
- name: Use following to pull public copy of the image
id: print-ghcr-mirror
@ -40,112 +47,97 @@ jobs:
id: install-nvidia-driver
uses: pytorch/test-infra/.github/actions/setup-nvidia@main
- name: Clone PyTorch
uses: actions/checkout@v3
with:
path: pytorch
- name: Clone CodeLlama
uses: actions/checkout@v3
with:
repository: osalpekar/codellama
ref: main
ref: 1ec50e0cfc0fadc3b6ceb146617e2119ab26eb34
path: codellama
- name: Clone Target Determination Code
uses: actions/checkout@v3
with:
repository: osalpekar/llm-target-determinator
ref: v0.0.1
ref: v0.0.2
path: llm-target-determinator
- name: Install Requirements
shell: bash -l {0}
run: |
set -euxo pipefail
conda create \
--yes \
--quiet \
--name "tdenv" \
"python=3.9"
conda activate tdenv
cd "${GITHUB_WORKSPACE}"
pwd
cd llm-target-determinator
pip install -r requirements.txt
cd ../codellama
pip install -e .
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v3
with:
role-to-assume: arn:aws:iam::308535385114:role/gha_target_determinator_s3_read_write
aws-region: us-east-1
- name: Fetch CodeLlama Checkpoint
shell: bash -l {0}
- name: Download checkpoint
shell: bash
env:
AWS_DEFAULT_REGION: us-east-1
run: |
set -euxo pipefail
conda activate tdenv
pip install awscli==1.32.18
cd codellama/
# Do this outside of docker so I don't have to put env vars in
pip3 install awscli==1.29.40
cd codellama
mkdir "CodeLlama-7b-Python"
aws s3 cp \
"s3://target-determinator-assets/CodeLlama-7b-Python" \
"CodeLlama-7b-Python" \
--recursive
- name: Run Indexer
id: indexer
- name: Run indexer
shell: bash -l {0}
env:
DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }}
GITHUB_RUN_ID: ${{ github.run_id }}
AWS_DEFAULT_REGION: us-east-1
run: |
set -euxo pipefail
# detached container should get cleaned up by teardown_ec2_linux
container_name=$(docker run \
${GPU_FLAG:-} \
-e MAX_JOBS="$(nproc --ignore=2)" \
-e AWS_DEFAULT_REGION \
--env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
--security-opt seccomp=unconfined \
--cap-add=SYS_PTRACE \
--tty \
--detach \
--user jenkins \
-v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-w /var/lib/jenkins/workspace \
"${DOCKER_IMAGE}"
)
chmod +x pytorch/.github/scripts/td_llm_indexer.sh
docker exec -t "${container_name}" sh -c 'pytorch/.github/scripts/td_llm_indexer.sh'
conda activate tdenv
cd "${GITHUB_WORKSPACE}"/llm-target-determinator
python create_filelist.py
torchrun \
--standalone \
--nnodes=1 \
--nproc-per-node=1 \
indexer.py \
--experiment-name indexer-files
- name: Upload Index to S3
- name: Upload to s3
shell: bash -l {0}
if: ${{ steps.indexer.outcome == 'success' }}
env:
AWS_DEFAULT_REGION: us-east-1
run: |
set -euxo pipefail
conda activate tdenv
cd "${GITHUB_WORKSPACE}"/llm-target-determinator/assets
cd llm-target-determinator/assets
TIMESTAMP=$(date -Iseconds)
ZIP_NAME = "indexer-files-${TIMESTAMP}.zip"
ZIP_NAME="indexer-files-${TIMESTAMP}.zip"
# Create a zipfile with all the generated indices
zip -r "${ZIP_NAME}" indexer-files
# Note that because the below 2 operations are not atomic, there will
# be a period of a few seconds between these where there is no index
# present in the latest/ folder. To account for this, the retriever
# should have some retry logic with backoff to ensure fetching the
# index doesn't fail.
# Move the old index into the archived/ folder
aws s3 cp \
"s3://target-determinator-assets/indexes/latest/*" \
"s3://target-determinator-assets/indexes/archived/"
aws s3 mv \
"s3://target-determinator-assets/indexes/latest" \
"s3://target-determinator-assets/indexes/archived" \
--recursive
# Move the new index into the latestl/ folder
aws s3 cp \
"${ZIP_NAME}" \
"s3://target-determinator-assets/indexes/latest/${ZIP_NAME}"
# Note that because the above 2 operations are not atomic, there will
# be a period of a few seconds between these where there is no index
# present in the latest/ folder. To account for this, the retriever
# should have some retry logic with backoff to ensure fetching the
# index doesn't fail.
- name: Teardown Linux
uses: pytorch/test-infra/.github/actions/teardown-linux@main
if: always()
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}

View File

@ -35,6 +35,13 @@ jobs:
cache_dir: .pytest_cache
job_identifier: ${{ github.workflow }}
- name: Download LLM Artifacts from S3
uses: seemethere/download-artifact-s3@v4
continue-on-error: true
with:
name: llm_results
path: .additional_ci_files/llm_results
- name: Do TD
id: td
continue-on-error: true
@ -50,6 +57,7 @@ jobs:
JOB_NAME: ${{ steps.get-job-id.outputs.job-name }}
PR_NUMBER: ${{ github.event.pull_request.number }}
run: |
unzip -o .additional_ci_files/llm_results/mappings.zip -d .additional_ci_files/llm_results || true
python3 -m pip install boto3==1.19.12
python3 tools/testing/do_target_determination_for_s3.py

View File

@ -19,9 +19,17 @@ concurrency:
permissions: read-all
jobs:
llm-td:
name: before-test
uses: ./.github/workflows/llm_td_retrieval.yml
permissions:
id-token: write
contents: read
target-determination:
name: before-test
uses: ./.github/workflows/target_determination.yml
needs: llm-td
permissions:
id-token: write
contents: read

View File

@ -25,7 +25,7 @@ jobs:
with:
repo-name: xla
branch: master
pin-folder: .ci/docker/ci_commit_pins
pin-folder: .github/ci_commit_pins
test-infra-ref: main
updatebot-token: ${{ secrets.UPDATEBOT_TOKEN }}
pytorchbot-token: ${{ secrets.GH_PYTORCHBOT_TOKEN }}

View File

@ -147,7 +147,7 @@ init_command = [
'filelock==3.13.1',
'junitparser==2.1.1',
'rich==10.9.0',
'pyyaml==6.0',
'pyyaml==6.0.1',
'optree==0.10.0',
]
@ -186,11 +186,12 @@ command = [
[[linter]]
code = 'CLANGTIDY'
include_patterns = [
'aten/src/ATen/core/*.cpp',
# Enable coverage of headers in aten/src/ATen
# and excluding most sub-directories for now.
'aten/src/ATen/*.h',
'aten/src/ATen/*.cpp',
'aten/src/ATen/core/*.h',
'aten/src/ATen/core/*.cpp',
'c10/**/*.cpp',
'c10/**/*.h',
'torch/csrc/*.h',
@ -204,9 +205,7 @@ exclude_patterns = [
# CUDA files are also excluded.
'**/fb/**',
'**/*pb.h',
'c10/**/cuda/*pp',
'aten/**/cuda/*pp',
'**/cuda/*pp',
'c10/xpu/**/*.h',
'c10/xpu/**/*.cpp',
'c10/cuda/CUDAAlgorithm.h',
@ -225,8 +224,8 @@ exclude_patterns = [
'third_party/**/*',
'torch/csrc/api/**',
'torch/csrc/autograd/generated/**',
'torch/csrc/dynamo/*',
'torch/csrc/distributed/**/*',
'torch/csrc/dynamo/eval_frame.h',
'torch/csrc/inductor/**/*',
'torch/csrc/jit/**/*',
'torch/csrc/jit/serialization/import_legacy.cpp',
@ -979,7 +978,7 @@ init_command = [
'python3',
'tools/linter/adapters/pip_init.py',
'--dry-run={{DRYRUN}}',
'PyYAML==6.0',
'PyYAML==6.0.1',
]
# Black + usort

View File

@ -19,6 +19,12 @@ cmake_policy(SET CMP0069 NEW)
# nice when it's possible, and it's possible on our Windows configs.
cmake_policy(SET CMP0092 NEW)
# Prohibit in-source builds
if(${CMAKE_SOURCE_DIR} STREQUAL ${CMAKE_BINARY_DIR})
message(FATAL_ERROR "In-source build are not supported")
endif()
# ---[ Project and semantic versioning.
project(Torch CXX C)
@ -736,28 +742,13 @@ if(MSVC)
append_cxx_flag_if_supported("/utf-8" CMAKE_CXX_FLAGS)
endif()
# Note for ROCM platform:
# 1. USE_ROCM is always ON until include(cmake/Dependencies.cmake)
# 2. USE_CUDA will become OFF during re-configuration
# Truth Table:
# CUDA 1st pass: USE_CUDA=True;USE_ROCM=True, FLASH evaluates to ON by default
# CUDA 2nd pass: USE_CUDA=True;USE_ROCM=False, FLASH evaluates to ON by default
# ROCM 1st pass: USE_CUDA=True;USE_ROCM=True, FLASH evaluates to ON by default
# ROCM 2nd pass: USE_CUDA=False;USE_ROCM=True, FLASH evaluates to ON by default
# CPU 1st pass: USE_CUDA=False(Cmd Option);USE_ROCM=True, FLASH evaluates to OFF by default
# CPU 2nd pass: USE_CUDA=False(Cmd Option);USE_ROCM=False, FLASH evaluates to OFF by default
# Thus we cannot tell ROCM 2nd pass and CPU 1st pass
#
# The only solution is to include(cmake/Dependencies.cmake), and defer the
# aotriton build decision later.
include(cmake/Dependencies.cmake)
# CAVEAT: do NOT check USE_ROCM here, because USE_ROCM is always True until
# include(cmake/Dependencies.cmake)
cmake_dependent_option(
USE_FLASH_ATTENTION
"Whether to build the flash_attention kernel for scaled dot product attention.\
Will be disabled if not supported by the platform" ON
"USE_CUDA OR USE_ROCM;NOT MSVC" OFF)
"USE_CUDA AND NOT MSVC" OFF)
# We are currenlty not using alibi attention for Flash
# So we disable this feature by default
@ -773,6 +764,8 @@ cmake_dependent_option(
Will be disabled if not supported by the platform" ON
"USE_CUDA" OFF)
include(cmake/Dependencies.cmake)
if(DEBUG_CUDA)
string(APPEND CMAKE_CUDA_FLAGS_DEBUG " -lineinfo")
string(APPEND CMAKE_CUDA_FLAGS_RELWITHDEBINFO " -lineinfo")

View File

@ -7,8 +7,8 @@
#
# For reference:
# https://docs.docker.com/develop/develop-images/build_enhancements/
ARG BASE_IMAGE=ubuntu:20.04
ARG PYTHON_VERSION=3.8
ARG BASE_IMAGE=ubuntu:22.04
ARG PYTHON_VERSION=3.11
FROM ${BASE_IMAGE} as dev-base
RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
@ -26,7 +26,7 @@ RUN mkdir /opt/ccache && ccache --set-config=cache_dir=/opt/ccache
ENV PATH /opt/conda/bin:$PATH
FROM dev-base as conda
ARG PYTHON_VERSION=3.8
ARG PYTHON_VERSION=3.11
# Automatically set by buildx
ARG TARGETPLATFORM
# translating Docker's TARGETPLATFORM into miniconda arches
@ -57,12 +57,12 @@ COPY --from=submodule-update /opt/pytorch /opt/pytorch
RUN make triton
RUN --mount=type=cache,target=/opt/ccache \
export eval ${CMAKE_VARS} && \
TORCH_CUDA_ARCH_LIST="3.5 5.2 6.0 6.1 7.0+PTX 8.0" TORCH_NVCC_FLAGS="-Xfatbin -compress-all" \
TORCH_CUDA_ARCH_LIST="7.0 7.2 7.5 8.0 8.6 8.7 8.9 9.0 9.0a" TORCH_NVCC_FLAGS="-Xfatbin -compress-all" \
CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" \
python setup.py install
FROM conda as conda-installs
ARG PYTHON_VERSION=3.8
ARG PYTHON_VERSION=3.11
ARG CUDA_VERSION=12.1
ARG CUDA_CHANNEL=nvidia
ARG INSTALL_CHANNEL=pytorch-nightly

View File

@ -13,6 +13,10 @@ namespace at {
TORCH_API ScalarType toScalarType(const DLDataType& dtype);
TORCH_API DLManagedTensor* toDLPack(const Tensor& src);
TORCH_API Tensor fromDLPack(DLManagedTensor* src);
C10_DEPRECATED_MESSAGE("Please migrate to a non-const variant")
inline Tensor fromDLPack(const DLManagedTensor* src) {
return fromDLPack(const_cast<DLManagedTensor*>(src));
}
TORCH_API Tensor
fromDLPack(DLManagedTensor* src, std::function<void(void*)> deleter);
TORCH_API DLDataType getDLDataType(const Tensor& t);

View File

@ -198,6 +198,15 @@ TensorBase empty_generic(
return _empty_generic(size, allocator, ks, scalar_type, memory_format_opt);
}
TensorBase empty_generic_symint(
SymIntArrayRef size,
c10::Allocator* allocator,
c10::DispatchKeySet ks,
ScalarType scalar_type,
c10::optional<c10::MemoryFormat> memory_format_opt) {
return _empty_generic(size, allocator, ks, scalar_type, memory_format_opt);
}
template <typename T>
TensorBase _empty_strided_generic(
T size,

View File

@ -51,6 +51,13 @@ TORCH_API TensorBase empty_generic(
ScalarType scalar_type,
c10::optional<c10::MemoryFormat> memory_format_opt);
TORCH_API TensorBase empty_generic_symint(
SymIntArrayRef size,
c10::Allocator* allocator,
c10::DispatchKeySet ks,
ScalarType scalar_type,
c10::optional<c10::MemoryFormat> memory_format_opt);
TORCH_API TensorBase empty_strided_generic(
IntArrayRef size,
IntArrayRef stride,

View File

@ -174,8 +174,8 @@ Tensor FunctionalInverses::expand_inverse(const Tensor& base, const Tensor& muta
return mutated_view.as_strided_symint(
base.sym_sizes(), base.sym_strides(), base.sym_storage_offset());
} else {
return at::sum_to(
mutated_view,
return base + at::sum_to(
mutated_view - base,
base.sym_sizes(),
/*always_return_non_view=*/inverse_return_mode == InverseReturnMode::NeverView
);
@ -303,6 +303,29 @@ Tensor FunctionalInverses::_nested_view_from_buffer_inverse(const Tensor& base,
return Tensor();
}
Tensor FunctionalInverses::_nested_view_from_jagged_inverse(const Tensor& base, const Tensor& mutated_view, InverseReturnMode inverse_return_mode, const Tensor& offsets, const Tensor& dummy, const std::optional<Tensor>& lengths, int64_t ragged_idx) {
auto values = at::_nested_get_values(mutated_view);
if (inverse_return_mode != InverseReturnMode::NeverView) {
return values;
} else {
return values.clone(/*memory_format=*/at::MemoryFormat::Contiguous);
}
}
Tensor FunctionalInverses::_nested_get_values_inverse(const Tensor& base, const Tensor& mutated_view, InverseReturnMode inverse_return_mode) {
auto offsets = at::_nested_get_offsets(base);
auto lengths = at::_nested_get_lengths(base);
auto ragged_idx = at::_nested_get_ragged_idx(base);
auto dummy = at::_nested_get_jagged_dummy(base);
auto nt = at::_nested_view_from_jagged(mutated_view, offsets, dummy, lengths, ragged_idx);
if (inverse_return_mode != InverseReturnMode::NeverView) {
return nt;
} else {
return nt.clone(/*memory_format=*/at::MemoryFormat::Contiguous);
}
}
Tensor FunctionalInverses::unsqueeze_inverse(const Tensor& base, const Tensor& mutated_view, InverseReturnMode inverse_return_mode, int64_t dim) {
if (inverse_return_mode != InverseReturnMode::NeverView) {
return at::squeeze(mutated_view, dim);

View File

@ -101,7 +101,7 @@ inline std::vector<int64_t> construct_opt_sizes(const at::Tensor& sizes) {
}
// assume contiguous, we can construct stride from size
inline at::Tensor construct_nested_strides(const at::Tensor& sizes) {
at::Tensor construct_nested_strides(const at::Tensor& sizes) {
// empty `sizes` means empty nested tensor, so return empty strides
if (sizes.dim() == 0) {
return sizes;
@ -139,7 +139,7 @@ inline at::Tensor construct_nested_strides(const at::Tensor& sizes) {
*
* @return A tensor of offsets
*/
inline at::Tensor construct_offsets(const at::Tensor& sizes) {
at::Tensor construct_offsets(const at::Tensor& sizes) {
// empty `sizes` means empty nested tensor, so return empty strides
if (sizes.dim() == 0) {
return at::empty({0}, sizes.options().dtype(kLong));

View File

@ -14,6 +14,8 @@ namespace at::native {
struct NestedTensorImpl;
inline bool nested_tensor_impl_is_contiguous(const NestedTensorImpl* nt);
int64_t get_numel_from_nested_size_tensor(const at::Tensor& tensor);
at::Tensor construct_nested_strides(const at::Tensor& nested_size);
at::Tensor construct_offsets(const at::Tensor& nested_size);
struct TORCH_API NestedTensorImpl : public c10::TensorImpl {
explicit NestedTensorImpl(

View File

@ -152,7 +152,7 @@ void invoke_parallel(
std::atomic_flag err_flag = ATOMIC_FLAG_INIT;
std::exception_ptr eptr;
std::mutex mutex;
volatile size_t remaining{0};
std::atomic_size_t remaining{0};
std::condition_variable cv;
} state;

View File

@ -835,7 +835,7 @@ void TensorIteratorBase::cast_outputs() {
// and tensor, this condition should no longer ever be true
const auto &original_tensor = op.original_tensor();
const auto &tensor = op.tensor();
if (original_tensor.sizes() != tensor.sizes()){
if (original_tensor.sizes() != tensor.sizes()) {
original_tensor.resize_as_(tensor).as_strided_(tensor.sizes(), tensor.strides());
}
original_tensor.copy_(tensor);
@ -1196,6 +1196,9 @@ void TensorIteratorBase::mark_resize_outputs(const TensorIteratorConfig& config)
}
for (const auto i : c10::irange(num_outputs_)) {
const auto& output = tensor(i);
if (!output.defined()) {
operands_[i].will_resize = true;
}
if (output.defined() && !output.sizes().equals(shape_)) {
if (config.resize_outputs_ && !operands_[i].is_read_write) {
operands_[i].will_resize = true;

View File

@ -167,6 +167,17 @@ struct TORCH_API OperandInfo {
bool is_output = false;
// will_resize is only for output tensor.
// 1) Functional call(like torch.add(self, other)): output tensor is
// undefined, and pytorch creates a new tensor by using common shape
// and computed stride in TensorIterator;
// 2) Inplace call(like torch.add_(self, other)): output tensor is same
// with input tensor, and can't to modify tensor's size and stride;
// 3) Op call with output(like torch.add(self, other, out = output)):
// output tensor is defined, but tensor shape maybe different with common
// shape. If tensor shape is not same with common shape, this output
// tensor will be resized by using common shape and computed stride in
// TensorIterator. Otherwise can't modify tensor's size and stride.
bool will_resize = false;
bool is_read_write = false;
@ -472,6 +483,21 @@ struct TORCH_API TensorIteratorBase : public impl::MetaBase {
operands_[arg].data = data;
}
// Helper functions for custom device, custom device can get OperandInfo and
// NameVector in their side.
const OperandInfo& operand(int arg = 0) const {
return operands_[arg];
}
OperandInfo& operand(int arg = 0) {
return operands_[arg];
}
NameVector& get_dim_names() {
return names_;
}
const NameVector& get_dim_names() const {
return names_;
}
/// true if the stride computation can use 32-bit arithmetic. Used by GPU
/// kernels
bool can_use_32bit_indexing() const;

View File

@ -104,37 +104,40 @@ inline void maybe_wrap_dims(
// dimension behavior and dimension size checking). We maintain this behavior
// for backwards compatibility, but only for this specific size (i.e. other
// empty sizes are not skipped).
template <typename T>
inline int64_t _legacy_cat_wrap_dim(
inline int64_t legacy_cat_wrap_dim(
int64_t dim,
const std::vector<std::vector<T>>& tensor_sizes) {
const std::vector<std::vector<int64_t>>& tensor_sizes) {
for (auto& sizes : tensor_sizes) {
if (sizes.size() == 1 && sizes[0] == 0) {
continue;
}
return maybe_wrap_dim(dim, sizes.size());
return maybe_wrap_dim(dim, static_cast<int64_t>(sizes.size()));
}
return dim;
}
inline int64_t legacy_cat_wrap_dim_symint(
int64_t dim,
const std::vector<std::vector<c10::SymInt>>& tensor_sizes) {
for (auto& sizes : tensor_sizes) {
if (sizes.size() == 1) {
if (TORCH_GUARD_SIZE_OBLIVIOUS(sizes[0].sym_eq(0))) {
continue;
}
}
return maybe_wrap_dim(dim, static_cast<int64_t>(sizes.size()));
}
return dim;
}
inline int64_t legacy_cat_wrap_dim(
int64_t dim,
const std::vector<std::vector<int64_t>>& tensor_sizes) {
return _legacy_cat_wrap_dim<int64_t>(dim, tensor_sizes);
}
inline int64_t legacy_cat_wrap_dim_symint(
int64_t dim,
const std::vector<std::vector<c10::SymInt>>& tensor_sizes) {
return _legacy_cat_wrap_dim<c10::SymInt>(dim, tensor_sizes);
}
inline int64_t legacy_cat_wrap_dim(
int64_t dim,
const MaterializedITensorListRef& tensors) {
for (const Tensor& tensor : tensors) {
if (tensor.dim() == 1 && tensor.sizes()[0] == 0) {
continue;
if (tensor.dim() == 1) {
if (TORCH_GUARD_SIZE_OBLIVIOUS(tensor.sym_sizes()[0].sym_eq(0))) {
continue;
}
}
return maybe_wrap_dim(dim, tensor.dim());
}

View File

@ -6,10 +6,11 @@
#include <c10/macros/Macros.h>
#include <c10/util/irange.h>
namespace at { namespace detail {
namespace at::detail {
template <typename T, int size_>
struct Array {
// NOLINTNEXTLINE(*c-array*)
T data[size_];
C10_HOST_DEVICE T operator[](int i) const {
@ -27,7 +28,9 @@ struct Array {
Array(const Array&) = default;
Array& operator=(const Array&) = default;
#endif
static constexpr int size(){return size_;}
static constexpr int size() {
return size_;
}
// Fill the array with x.
C10_HOST_DEVICE Array(T x) {
for (int i = 0; i < size_; i++) {
@ -36,4 +39,4 @@ struct Array {
}
};
}}
} // namespace at::detail

View File

@ -1,6 +1,6 @@
#include <c10/core/TensorOptions.h>
namespace c10 { namespace impl {
namespace c10::impl {
inline c10::optional<MemoryFormat>
check_tensor_options_and_extract_memory_format(
@ -22,4 +22,4 @@ check_tensor_options_and_extract_memory_format(
}
}
}} // namespace impl namespace c10
} // namespace impl namespace c10

View File

@ -22,6 +22,7 @@ class TORCH_API DeprecatedTypePropertiesRegistry {
DeprecatedTypeProperties& getDeprecatedTypeProperties(Backend p, ScalarType s) const;
private:
// NOLINTNEXTLINE(*c-array*)
std::unique_ptr<DeprecatedTypeProperties> registry
[static_cast<int>(Backend::NumOptions)]
[static_cast<int>(ScalarType::NumOptions)];

View File

@ -1,7 +1,7 @@
#include <ATen/core/Dict.h>
namespace c10 {
namespace detail {
namespace c10::detail {
bool operator==(const DictImpl& lhs, const DictImpl& rhs) {
bool isEqualFastChecks =
*lhs.elementTypes.keyType == *rhs.elementTypes.keyType &&
@ -25,5 +25,4 @@ bool operator==(const DictImpl& lhs, const DictImpl& rhs) {
return true;
}
} // namespace detail
} // namespace c10
} // namespace c10::detail

View File

@ -207,7 +207,7 @@ template<class Key, class Value> Dict<IValue, IValue> toGenericDict(Dict<Key, Va
template<class Key, class Value>
class Dict final {
private:
static_assert((std::is_same<IValue, Key>::value && std::is_same<IValue, Value>::value) || guts::typelist::contains<impl::valid_dict_key_types, Key>::value, "Invalid Key type for Dict. We only support int64_t, double, bool, and string.");
static_assert((std::is_same_v<IValue, Key> && std::is_same_v<IValue, Value>) || guts::typelist::contains<impl::valid_dict_key_types, Key>::value, "Invalid Key type for Dict. We only support int64_t, double, bool, and string.");
// impl_ stores the underlying map as a ska_ordered::order_preserving_flat_hash_map.
// We intentionally don't offer conversion from/to

View File

@ -20,7 +20,7 @@ bool Dimname::isValidName(const std::string& name) {
// letters A through Z, the underscore _ and, except for the first
// character, the digits 0 through 9" (at least length 1)
// https://docs.python.org/3/reference/lexical_analysis.html#identifiers
if (name.length() == 0) {
if (name.empty()) {
return false;
}
for (auto it = name.begin(); it != name.end(); ++it) {

View File

@ -160,7 +160,7 @@ static void __printIndent(std::ostream &stream, int64_t indent)
static void printScale(std::ostream & stream, double scale) {
FormatGuard guard(stream);
stream << defaultfloat << scale << " *" << std::endl;
stream << defaultfloat << scale << " *" << '\n';
}
static void __printMatrix(std::ostream& stream, const Tensor& self, int64_t linesize, int64_t indent)
{
@ -178,7 +178,7 @@ static void __printMatrix(std::ostream& stream, const Tensor& self, int64_t line
}
if(nColumnPerLine < self.size(1)) {
if(firstColumn != 0) {
stream << std::endl;
stream << '\n';
}
stream << "Columns " << firstColumn+1 << " to " << lastColumn+1;
__printIndent(stream, indent);
@ -193,7 +193,7 @@ static void __printMatrix(std::ostream& stream, const Tensor& self, int64_t line
for (const auto c : c10::irange(firstColumn, lastColumn+1)) {
stream << std::setw(sz) << row_ptr[c]/scale;
if(c == lastColumn) {
stream << std::endl;
stream << '\n';
if(l != self.size(0)-1) {
if(scale != 1) {
__printIndent(stream, indent);
@ -239,7 +239,7 @@ static void __printTensor(std::ostream& stream, Tensor& self, int64_t linesize)
if(start) {
start = false;
} else {
stream << std::endl;
stream << '\n';
}
stream << "(";
Tensor tensor = self;
@ -247,7 +247,7 @@ static void __printTensor(std::ostream& stream, Tensor& self, int64_t linesize)
tensor = tensor.select(0, counter[i]);
stream << counter[i]+1 << ",";
}
stream << ".,.) = " << std::endl;
stream << ".,.) = " << '\n';
__printMatrix(stream, tensor, linesize, 1);
}
}
@ -279,7 +279,7 @@ std::ostream& print(std::ostream& stream, const Tensor & tensor_, int64_t linesi
tensor = tensor_.to(kCPU, kDouble).contiguous();
}
if(tensor.ndimension() == 0) {
stream << defaultfloat << tensor.data_ptr<double>()[0] << std::endl;
stream << defaultfloat << tensor.data_ptr<double>()[0] << '\n';
stream << "[ " << tensor_.toString() << "{}";
} else if(tensor.ndimension() == 1) {
if (tensor.numel() > 0) {
@ -289,7 +289,7 @@ std::ostream& print(std::ostream& stream, const Tensor & tensor_, int64_t linesi
}
double* tensor_p = tensor.data_ptr<double>();
for (const auto i : c10::irange(tensor.size(0))) {
stream << std::setw(sz) << tensor_p[i]/scale << std::endl;
stream << std::setw(sz) << tensor_p[i]/scale << '\n';
}
}
stream << "[ " << tensor_.toString() << "{" << tensor.size(0) << "}";
@ -329,7 +329,7 @@ std::ostream& print(std::ostream& stream, const Tensor & tensor_, int64_t linesi
if (tensor.getIntrusivePtr()->autograd_meta()) {
auto& fw_grad = tensor._fw_grad(/* level */ 0);
if (fw_grad.defined()) {
stream << ", tangent:" << std::endl << fw_grad;
stream << ", tangent:" << '\n' << fw_grad;
}
}
stream << " ]";

View File

@ -1,12 +1,9 @@
#pragma once
#include <mutex>
#include <deque>
#include <atomic>
#include <typeinfo>
#include <utility>
#include <cstddef>
#include <cstdint>
#include <deque>
#include <mutex>
#include <utility>
#include <c10/util/Exception.h>
#include <c10/util/intrusive_ptr.h>

View File

@ -307,10 +307,10 @@ class IListRefTagImplBase {};
* reference type, then it's left unchanged.
*/
template <typename T>
using _MaterializedIListRefElem = typename std::conditional<
std::is_reference<T>::value,
typename std::reference_wrapper<typename std::remove_reference<T>::type>,
T>::type;
using _MaterializedIListRefElem = std::conditional_t<
std::is_reference_v<T>,
typename std::reference_wrapper<std::remove_reference_t<T>>,
T>;
template <typename T>
using MaterializedIListRefElem = _MaterializedIListRefElem<IListRefConstRef<T>>;
@ -540,7 +540,7 @@ class IListRef {
template <
typename... UnboxedConstructorArgs,
typename = std::enable_if_t<
std::is_constructible<unboxed_type, UnboxedConstructorArgs...>::value>>
std::is_constructible_v<unboxed_type, UnboxedConstructorArgs...>>>
IListRef(UnboxedConstructorArgs&&... args) : tag_(IListRefTag::Unboxed) {
payload_.unboxed = unboxed_type(std::forward<UnboxedConstructorArgs>(args)...);
}

View File

@ -8,8 +8,8 @@ class Tensor;
class OptionalTensorRef;
}
namespace c10 {
namespace detail {
namespace c10::detail {
/*
* Specializations of `IListRefTagImplBase` that implement the default
@ -184,8 +184,8 @@ class IListRefTagImpl<IListRefTag::Materialized, at::OptionalTensorRef>
at::OptionalTensorRef,
MaterializedIListRefElem<at::OptionalTensorRef>> {};
} // namespace detail
} // namespace c10
} // namespace c10::detail
namespace at {

View File

@ -103,7 +103,7 @@ TEST(ITensorListRefTest, Boxed_GetConstRefTensor) {
const List<at::Tensor> boxed(vec);
at::ITensorListRef list(boxed);
static_assert(
std::is_same<decltype(*list.begin()), const at::Tensor&>::value,
std::is_same_v<decltype(*list.begin()), const at::Tensor&>,
"Accessing elements from List<Tensor> through a ITensorListRef should be const references.");
EXPECT_TRUE(boxed[0].is_same(*list.begin()));
EXPECT_TRUE(boxed[1].is_same(*(++list.begin())));
@ -113,7 +113,7 @@ TEST(ITensorListRefTest, Unboxed_GetConstRefTensor) {
auto vec = get_tensor_vector();
at::ITensorListRef list(vec);
static_assert(
std::is_same<decltype(*list.begin()), const at::Tensor&>::value,
std::is_same_v<decltype(*list.begin()), const at::Tensor&>,
"Accessing elements from ArrayRef<Tensor> through a ITensorListRef should be const references.");
EXPECT_TRUE(vec[0].is_same(*list.begin()));
EXPECT_TRUE(vec[1].is_same(*(++list.begin())));

View File

@ -1,7 +1,7 @@
#include <ATen/core/List.h>
namespace c10 {
namespace detail {
namespace c10::detail {
bool operator==(const ListImpl& lhs, const ListImpl& rhs) {
return *lhs.elementType == *rhs.elementType &&
lhs.list.size() == rhs.list.size() &&
@ -16,5 +16,4 @@ bool operator==(const ListImpl& lhs, const ListImpl& rhs) {
ListImpl::ListImpl(list_type list_, TypePtr elementType_)
: list(std::move(list_))
, elementType(std::move(elementType_)) {}
} // namespace detail
} // namespace c10
} // namespace c10::detail

View File

@ -44,7 +44,7 @@ template<class T, class Iterator> class ListIterator;
template<class T, class Iterator> class ListElementReference;
template<class T, class Iterator>
void swap(ListElementReference<T, Iterator>&& lhs, ListElementReference<T, Iterator>&& rhs);
void swap(ListElementReference<T, Iterator>&& lhs, ListElementReference<T, Iterator>&& rhs) noexcept;
template<class T, class Iterator>
bool operator==(const ListElementReference<T, Iterator>& lhs, const T& rhs);
@ -68,8 +68,8 @@ template<class T, class Iterator>
class ListElementReference final {
public:
operator std::conditional_t<
std::is_reference<typename c10::detail::
ivalue_to_const_ref_overload_return<T>::type>::value,
std::is_reference_v<typename c10::detail::
ivalue_to_const_ref_overload_return<T>::type>,
const T&,
T>() const;
@ -84,7 +84,7 @@ public:
return *iterator_;
}
friend void swap<T, Iterator>(ListElementReference&& lhs, ListElementReference&& rhs);
friend void swap<T, Iterator>(ListElementReference&& lhs, ListElementReference&& rhs) noexcept;
ListElementReference(const ListElementReference&) = delete;
ListElementReference& operator=(const ListElementReference&) = delete;

View File

@ -120,8 +120,8 @@ namespace impl {
template <class T, class Iterator>
ListElementReference<T, Iterator>::operator std::conditional_t<
std::is_reference<typename c10::detail::ivalue_to_const_ref_overload_return<
T>::type>::value,
std::is_reference_v<typename c10::detail::ivalue_to_const_ref_overload_return<
T>::type>,
const T&,
T>() const {
return iterator_->template to<T>();
@ -146,7 +146,7 @@ ListElementReference<T, Iterator>& ListElementReference<T, Iterator>::operator=(
}
template<class T, class Iterator>
void swap(ListElementReference<T, Iterator>&& lhs, ListElementReference<T, Iterator>&& rhs) {
void swap(ListElementReference<T, Iterator>&& lhs, ListElementReference<T, Iterator>&& rhs) noexcept {
std::swap(*lhs.iterator_, *rhs.iterator_);
}

View File

@ -1118,7 +1118,7 @@ TEST(ListTestNonIValueBasedList, sameValueDifferentStorage_thenIsReturnsFalse) {
TEST(ListTest, canAccessStringByReference) {
List<std::string> list({"one", "two"});
const auto& listRef = list;
static_assert(std::is_same<decltype(listRef[1]), const std::string&>::value,
static_assert(std::is_same_v<decltype(listRef[1]), const std::string&>,
"const List<std::string> access should be by const reference");
std::string str = list[1];
const std::string& strRef = listRef[1];
@ -1130,7 +1130,7 @@ TEST(ListTest, canAccessOptionalStringByReference) {
List<c10::optional<std::string>> list({"one", "two", c10::nullopt});
const auto& listRef = list;
static_assert(
std::is_same<decltype(listRef[1]), c10::optional<std::reference_wrapper<const std::string>>>::value,
std::is_same_v<decltype(listRef[1]), c10::optional<std::reference_wrapper<const std::string>>>,
"List<c10::optional<std::string>> access should be by const reference");
c10::optional<std::string> str1 = list[1];
c10::optional<std::string> str2 = list[2];
@ -1148,7 +1148,7 @@ TEST(ListTest, canAccessTensorByReference) {
List<at::Tensor> list;
const auto& listRef = list;
static_assert(
std::is_same<decltype(listRef[0]), const at::Tensor&>::value,
std::is_same_v<decltype(listRef[0]), const at::Tensor&>,
"List<at::Tensor> access should be by const reference");
}

View File

@ -121,9 +121,9 @@ void internal_set_names_inplace(TensorImpl* impl, std::vector<Dimname>&& names,
}
auto* meta = get_named_tensor_meta(impl);
if (meta == nullptr) {
impl->set_named_tensor_meta(std::make_unique<NamedTensorMeta>(NamedTensorMeta::HasNonWildcard, names));
impl->set_named_tensor_meta(std::make_unique<NamedTensorMeta>(NamedTensorMeta::HasNonWildcard, std::move(names)));
} else {
meta->set_names(NamedTensorMeta::HasNonWildcard, names);
meta->set_names(NamedTensorMeta::HasNonWildcard, std::move(names));
}
}

View File

@ -44,7 +44,7 @@ struct TORCH_API NamedTensorMeta final : public c10::NamedTensorMetaInterface {
// Used for an assertion in TensorImpl.h
int64_t slow_dim() const override {
return names_.size();
return static_cast<int64_t>(names_.size());
}
void check_invariants() const {
@ -79,7 +79,7 @@ struct TORCH_API NamesMode {
// A RAII, thread local (!) guard that enables or disables names upon
// construction, and sets it back to the original value upon destruction.
struct TORCH_API NoNamesGuard {
NoNamesGuard() : prev_mode(NamesMode::is_enabled()), initialized(true) {
NoNamesGuard() : prev_mode(NamesMode::is_enabled()) {
NamesMode::set_enabled(false);
}
~NoNamesGuard() {
@ -93,7 +93,7 @@ struct TORCH_API NoNamesGuard {
}
private:
bool prev_mode;
bool initialized;
bool initialized{true};
};
void check_names_valid_for(const TensorBase& tensor, DimnameList names);

View File

@ -67,9 +67,7 @@ c10::SymNode NestedIntSymNodeImpl::le(const c10::SymNode& other) {
}
c10::SymNode NestedIntSymNodeImpl::mul(const c10::SymNode& other) {
if (auto mb_si = other->nested_int()) {
TORCH_CHECK(false, "nested int cannot be multiplied by nested int");
}
TORCH_CHECK(!other->nested_int(), "nested int cannot be multiplied by nested int");
c10::optional<int64_t> c = other->constant_int();
TORCH_CHECK(c.has_value());
return SymNode(c10::make_intrusive<NestedIntSymNodeImpl>(val_, coeff_ * *c));

View File

@ -120,8 +120,8 @@ void preDispatchFallback(const c10::OperatorHandle& op, c10::DispatchKeySet disp
} // anonymous namespace
namespace at {
namespace impl {
namespace at::impl {
RestorePythonTLSSnapshot::RestorePythonTLSSnapshot() : saved_(safe_get_tls_on_entry()), guard_(safe_get_tls_on_entry()) {
tls_on_entry = c10::nullopt;
@ -148,8 +148,7 @@ MaybeSetTLSOnEntryGuard::~MaybeSetTLSOnEntryGuard() {
}
} // namespace impl
} // namespace at
} // namespace at::impl
TORCH_LIBRARY_IMPL(_, Python, m) {
m.fallback(torch::CppFunction::makeFromBoxedFunction<&pythonFallback>());

View File

@ -1,8 +1,8 @@
#pragma once
#include <ATen/core/TorchDispatchUtils.h>
namespace at {
namespace impl {
namespace at::impl {
struct TORCH_API RestorePythonTLSSnapshot {
RestorePythonTLSSnapshot();
@ -24,5 +24,4 @@ private:
bool value_set_;
};
} // namespace impl
} // namespace at
} // namespace at::impl

View File

@ -1,7 +1,6 @@
#include <ATen/core/PythonOpRegistrationTrampoline.h>
namespace at {
namespace impl {
namespace at::impl {
// The strategy is that all python interpreters attempt to register themselves
// as the main interpreter, but only one wins. Only that interpreter is
@ -9,14 +8,15 @@ namespace impl {
// logic on that interpreter, we do so hermetically, never setting pyobj field
// on Tensor.
std::atomic<c10::impl::PyInterpreter*> PythonOpRegistrationTrampoline::interpreter_{nullptr};
std::atomic<c10::impl::PyInterpreter*>
PythonOpRegistrationTrampoline::interpreter_{nullptr};
c10::impl::PyInterpreter* PythonOpRegistrationTrampoline::getInterpreter() {
return PythonOpRegistrationTrampoline::interpreter_.load();
}
bool PythonOpRegistrationTrampoline::registerInterpreter(c10::impl::PyInterpreter* interp) {
bool PythonOpRegistrationTrampoline::registerInterpreter(
c10::impl::PyInterpreter* interp) {
c10::impl::PyInterpreter* expected = nullptr;
interpreter_.compare_exchange_strong(expected, interp);
if (expected != nullptr) {
@ -29,5 +29,4 @@ bool PythonOpRegistrationTrampoline::registerInterpreter(c10::impl::PyInterprete
}
}
} // namespace impl
} // namespace at
} // namespace at::impl

View File

@ -4,8 +4,8 @@
// TODO: this can probably live in c10
namespace at {
namespace impl {
namespace at::impl {
class TORCH_API PythonOpRegistrationTrampoline final {
static std::atomic<c10::impl::PyInterpreter*> interpreter_;
@ -19,5 +19,4 @@ public:
static c10::impl::PyInterpreter* getInterpreter();
};
} // namespace impl
} // namespace at
} // namespace at::impl

View File

@ -37,6 +37,7 @@ using QuantizerPtr = c10::intrusive_ptr<Quantizer>;
* share the same Quantizer. Quantizer should be immutable.
*/
struct TORCH_API Quantizer : public c10::intrusive_ptr_target {
// NOLINTNEXTLINE(cppcoreguidelines-avoid-const-or-ref-data-members)
const ScalarType scalar_type_;
explicit Quantizer(ScalarType scalar_type) : scalar_type_(scalar_type) {}
~Quantizer() override;

View File

@ -1,16 +1,14 @@
#pragma once
namespace at {
namespace Reduction {
namespace at::Reduction {
// NB: Keep this in sync with Reduction class in torch/nn/_reduction.py
// These constants control the reduction behavior of loss functions.
// Ideally, this would be a scoped enum, but jit doesn't support that
enum Reduction {
None, // Do not reduce
Mean, // (Possibly weighted) mean of losses
Sum, // Sum losses
None, // Do not reduce
Mean, // (Possibly weighted) mean of losses
Sum, // Sum losses
END
};
} // namespace Reduction
} // namespace at
} // namespace at::Reduction

View File

@ -72,9 +72,9 @@ void TensorBase::enforce_invariants() {
void TensorBase::print() const {
if (defined()) {
std::cerr << "[" << toString() << " " << sizes() << "]" << std::endl;
std::cerr << "[" << toString() << " " << sizes() << "]" << '\n';
} else {
std::cerr << "[UndefinedTensor]" << std::endl;
std::cerr << "[UndefinedTensor]" << '\n';
}
}

View File

@ -68,6 +68,7 @@ class TORCH_API TensorRef {
};
template <typename T>
// NOLINTNEXTLINE(cppcoreguidelines-missing-std-forward)
auto Tensor::register_hook(T&& hook) const -> Tensor::hook_return_void_t<T> {
// Return the grad argument in case of a hook with void return type to have an
// std::function with Tensor return type
@ -81,6 +82,7 @@ auto Tensor::register_hook(T&& hook) const -> Tensor::hook_return_void_t<T> {
}
template <typename T>
// NOLINTNEXTLINE(cppcoreguidelines-missing-std-forward)
auto Tensor::register_hook(T&& hook) const -> Tensor::hook_return_var_t<T> {
return _register_hook([fn=std::forward<T>(hook)](const TensorBase& grad_base) {
TensorRef grad(grad_base);

View File

@ -7,6 +7,7 @@
#include <c10/util/irange.h>
#include <cstddef>
#include <cstdint>
#include <type_traits>
namespace at {
@ -131,7 +132,7 @@ public:
}
// if index_t is not int64_t, we want to have an int64_t constructor
template <typename source_index_t, class = typename std::enable_if<std::is_same<source_index_t, int64_t>::value>::type>
template <typename source_index_t, class = std::enable_if_t<std::is_same_v<source_index_t, int64_t>>>
// NOLINTNEXTLINE(cppcoreguidelines-pro-type-member-init)
C10_HOST GenericPackedTensorAccessorBase(
PtrType data_,
@ -184,7 +185,7 @@ public:
: GenericPackedTensorAccessorBase<T, N, PtrTraits, index_t>(data_, sizes_, strides_) {}
// if index_t is not int64_t, we want to have an int64_t constructor
template <typename source_index_t, class = typename std::enable_if<std::is_same<source_index_t, int64_t>::value>::type>
template <typename source_index_t, class = std::enable_if_t<std::is_same_v<source_index_t, int64_t>>>
C10_HOST GenericPackedTensorAccessor(
PtrType data_,
const source_index_t* sizes_,
@ -231,7 +232,7 @@ public:
: GenericPackedTensorAccessorBase<T, 1, PtrTraits, index_t>(data_, sizes_, strides_) {}
// if index_t is not int64_t, we want to have an int64_t constructor
template <typename source_index_t, class = typename std::enable_if<std::is_same<source_index_t, int64_t>::value>::type>
template <typename source_index_t, class = std::enable_if_t<std::is_same_v<source_index_t, int64_t>>>
C10_HOST GenericPackedTensorAccessor(
PtrType data_,
const source_index_t* sizes_,

View File

@ -28,11 +28,11 @@ namespace c10 {
class Scalar;
}
namespace torch { namespace autograd {
namespace torch::autograd {
struct Node;
}} // namespace torch::autograd
} // namespace torch::autograd
namespace at {
@ -594,10 +594,10 @@ class TORCH_API TensorBase {
return mutable_data_ptr();
}
template <typename T, std::enable_if_t<!std::is_const<T>::value, int> = 0>
template <typename T, std::enable_if_t<!std::is_const_v<T>, int> = 0>
const T* const_data_ptr() const;
template <typename T, std::enable_if_t<std::is_const<T>::value, int> = 0>
template <typename T, std::enable_if_t<std::is_const_v<T>, int> = 0>
const std::remove_const_t<T>* const_data_ptr() const;
template <typename T>
@ -831,9 +831,9 @@ class TORCH_API TensorBase {
//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
template <typename T>
using hook_return_void_t = std::enable_if_t<std::is_void<typename c10::invoke_result_t<T&, TensorBase>>::value, unsigned>;
using hook_return_void_t = std::enable_if_t<std::is_void_v<typename c10::invoke_result_t<T&, TensorBase>>, unsigned>;
template <typename T>
using hook_return_var_t = std::enable_if_t<std::is_same<typename c10::invoke_result_t<T&, TensorBase>, TensorBase>::value, unsigned>;
using hook_return_var_t = std::enable_if_t<std::is_same_v<typename c10::invoke_result_t<T&, TensorBase>, TensorBase>, unsigned>;
/// Registers a backward hook.
///
@ -925,10 +925,11 @@ inline DeviceIndex get_device(const TensorBase& self) {
}
template <typename T>
// NOLINTNEXTLINE(cppcoreguidelines-missing-std-forward)
auto TensorBase::register_hook(T&& hook) const -> TensorBase::hook_return_void_t<T> {
// Return the grad argument in case of a hook with void return type to have an
// std::function with Tensor return type
static_assert(std::is_same<decltype(hook(TensorBase())), void>::value,
static_assert(std::is_same_v<decltype(hook(TensorBase())), void>,
"Expected hook to return void");
return _register_hook([fn=std::forward<T>(hook)](const TensorBase& grad) {
fn(grad);
@ -1026,9 +1027,9 @@ inline c10::MaybeOwned<TensorBase> TensorBase::expect_contiguous(MemoryFormat me
namespace symint {
template <typename T>
using enable_if_symint = std::enable_if_t<std::is_same<T, c10::SymInt>::value>;
using enable_if_symint = std::enable_if_t<std::is_same_v<T, c10::SymInt>>;
template <typename T>
using enable_if_int = std::enable_if_t<std::is_same<T, int64_t>::value>;
using enable_if_int = std::enable_if_t<std::is_same_v<T, int64_t>>;
template <typename T, typename = enable_if_symint<T>>
c10::SymIntArrayRef sizes(const TensorBase& t) { return t.sym_sizes(); }

View File

@ -1,7 +1,7 @@
#include <ATen/core/TorchDispatchUtils.h>
namespace at {
namespace impl {
namespace at::impl {
bool tensor_has_dispatch(const at::Tensor& t) {
DispatchKeySet key_set({DispatchKey::Python, DispatchKey::PythonTLSSnapshot});
@ -27,5 +27,4 @@ bool tensorlist_has_dispatch(const c10::List<c10::optional<at::Tensor>>& li) {
return false;
}
} // namespace impl
} // namespace at
} // namespace at::impl

View File

@ -6,12 +6,11 @@
#include <c10/util/Optional.h>
#include <c10/core/impl/TorchDispatchModeTLS.h>
namespace at {
namespace impl {
namespace at::impl {
TORCH_API bool tensor_has_dispatch(const at::Tensor& t);
TORCH_API bool tensorlist_has_dispatch(at::ITensorListRef li);
TORCH_API bool tensorlist_has_dispatch(const c10::List<c10::optional<at::Tensor>>& li);
using c10::impl::dispatch_mode_enabled;
}}
}

View File

@ -1,11 +1,13 @@
#include <ATen/NumericUtils.h>
#include <c10/macros/Macros.h>
#include <c10/util/Half.h>
#include <c10/util/BFloat16.h>
#include <c10/util/MathConstants.h>
#include <ATen/NumericUtils.h>
#include <limits>
#include <cmath>
#include <cstdint>
#include <cassert>
#include <limits>
#include <type_traits>
namespace at {
@ -54,12 +56,12 @@ C10_HOST_DEVICE inline T uniform_int_full_range(V val) {
* in this overloaded version
*/
template <typename T, typename V>
C10_HOST_DEVICE inline typename std::enable_if<!(std::is_floating_point<T>::value), T>::type uniform_int(V val) {
C10_HOST_DEVICE inline std::enable_if_t<!(std::is_floating_point_v<T>), T>uniform_int(V val) {
if constexpr (std::is_same_v<T, bool>) {
return static_cast<bool>(val & 1);
} else if constexpr (std::is_same_v<T, int64_t>) {
return static_cast<T>(val % (static_cast<uint64_t>(std::numeric_limits<T>::max()) + 1));
} else if constexpr (std::is_same_v<T, at::Half> || std::is_same<T, at::BFloat16>::value) {
} else if constexpr (std::is_same_v<T, at::Half> || std::is_same_v<T, at::BFloat16>) {
return static_cast<T>(val % static_cast<uint64_t>((1ULL << std::numeric_limits<T>::digits) + 1));
} else if constexpr (std::is_integral_v<T>) {
return static_cast<T>(val % (static_cast<uint64_t>(std::numeric_limits<T>::max()) + 1));
@ -74,7 +76,7 @@ C10_HOST_DEVICE inline typename std::enable_if<!(std::is_floating_point<T>::valu
* added to fix compiler warnings reported in GitHub issue 46391. T is either float or double in this version.
*/
template<typename T, typename V>
C10_HOST_DEVICE inline typename std::enable_if<std::is_floating_point<T>::value, T>::type uniform_int(V val) {
C10_HOST_DEVICE inline std::enable_if_t<std::is_floating_point_v<T>, T>uniform_int(V val) {
return static_cast<T>(val % static_cast<uint64_t>((1ULL << std::numeric_limits<T>::digits) + 1));
}

View File

@ -1,6 +1,6 @@
#include <ATen/core/VariableHooksInterface.h>
namespace at { namespace impl {
namespace at::impl {
namespace {
VariableHooksInterface* hooks = nullptr;
@ -17,4 +17,4 @@ bool HasVariableHooks() {
return hooks != nullptr;
}
}} // namespace at::impl
} // namespace at::impl

View File

@ -1,7 +1,7 @@
#pragma once
#include <c10/macros/Export.h>
#include <ATen/core/Tensor.h>
#include <c10/macros/Export.h>
// A little explanation about why this file exists at all. We have
// a few methods on Tensor class which require access to reified access to
@ -29,20 +29,20 @@
// have weird signatures that are not supported by autograd, and (2)
// see this bug https://github.com/pytorch/pytorch/issues/30102
namespace torch { namespace autograd {
namespace torch::autograd {
struct Node;
}} // namespace torch::autograd
} // namespace torch::autograd
namespace at {
namespace impl {
namespace at::impl {
struct TORCH_API VariableHooksInterface {
virtual ~VariableHooksInterface() = default;
virtual TensorBase tensor_data(const TensorBase&) const = 0;
virtual TensorBase variable_data(const TensorBase&) const = 0;
virtual const std::shared_ptr<torch::autograd::Node>& grad_fn(const TensorBase&) const = 0;
virtual const std::shared_ptr<torch::autograd::Node>& grad_fn(
const TensorBase&) const = 0;
virtual unsigned _register_hook(
const TensorBase&,
std::function<TensorBase(const TensorBase&)> hook) const = 0;
@ -57,9 +57,17 @@ struct TORCH_API VariableHooksInterface {
virtual int64_t _version(const TensorBase&) const = 0;
virtual void retain_grad(const TensorBase&) const = 0;
virtual bool retains_grad(const TensorBase&) const = 0;
virtual void _backward(const Tensor&, TensorList, const c10::optional<Tensor>&, c10::optional<bool>, bool) const = 0;
virtual void _backward(
const Tensor&,
TensorList,
const c10::optional<Tensor>&,
c10::optional<bool>,
bool) const = 0;
virtual void requires_grad_(const TensorBase&, bool) const = 0;
virtual void basic_autograd_not_implemented_fallback(const c10::OperatorHandle& op, c10::DispatchKeySet dispatch_keys, torch::jit::Stack* stack) const = 0;
virtual void basic_autograd_not_implemented_fallback(
const c10::OperatorHandle& op,
c10::DispatchKeySet dispatch_keys,
torch::jit::Stack* stack) const = 0;
};
TORCH_API void SetVariableHooks(VariableHooksInterface* hooks);
@ -72,4 +80,4 @@ struct TORCH_API VariableHooksRegisterer {
}
};
}} // namespace at::impl
} // namespace at::impl

View File

@ -1,8 +1,5 @@
#pragma once
#include <cstdint>
#include <tuple>
#include <type_traits>
#include <utility>
#include <c10/util/ArrayRef.h>

View File

@ -2,8 +2,7 @@
#include <cstdlib>
#include <iostream>
namespace at {
namespace vitals {
namespace at::vitals {
APIVitals VitalsAPI;
@ -78,8 +77,7 @@ bool APIVitals::setVital(
auto iter = name_map_.find(vital_name);
TorchVital* vital = nullptr;
if (iter == name_map_.end()) {
auto r =
name_map_.emplace(vital_name, TorchVital(vital_name));
auto r = name_map_.emplace(vital_name, TorchVital(vital_name));
vital = &r.first->second;
} else {
vital = &iter->second;
@ -95,5 +93,4 @@ APIVitals::APIVitals() : vitals_enabled(false), name_map_() {
setVital("CUDA", "used", "False", /* force = */ true);
}
} // namespace vitals
} // namespace at
} // namespace at::vitals

View File

@ -1,15 +1,11 @@
#pragma once
#include <cstring>
#include <map>
#include <memory>
#include <ostream>
#include <sstream>
#include <unordered_map>
#include <c10/core/impl/LocalDispatchKeySet.h>
namespace at {
namespace vitals {
namespace at::vitals {
TORCH_API bool torchVitalEnabled();
@ -82,8 +78,7 @@ class TORCH_API APIVitals {
extern TORCH_API APIVitals VitalsAPI;
} // namespace vitals
} // namespace at
} // namespace at::vitals
#define TORCH_VITAL_DECLARE(name) \
TORCH_API at::vitals::TorchVital TorchVital_##name;

View File

@ -1,15 +1,13 @@
#include <ATen/core/op_registration/adaption.h>
namespace c10 {
namespace impl {
namespace c10::impl {
void common_device_check_failure(Device common_device, const at::Tensor& tensor, at::CheckedFrom methodName, at::CheckedFrom argName) {
TORCH_CHECK(false,
"Expected all tensors to be on the same device, but "
// NOLINTNEXTLINE(bugprone-unchecked-optional-access)
"found at least two devices, ", common_device, " and ", tensor.device(), "! "
"(when checking argument for argument ", argName, " in method ", methodName, ")");
}
} // namespace impl
} // namespace c10
} // namespace c10::impl

View File

@ -1,10 +1,6 @@
#pragma once
#include <cstddef>
#include <sstream>
#include <type_traits>
#include <typeinfo>
#include <vector>
#include <c10/util/intrusive_ptr.h>
#include <c10/util/typeid.h>
@ -26,7 +22,7 @@ class TORCH_API Blob final : public c10::intrusive_ptr_target {
/**
* Initializes an empty Blob.
*/
Blob() noexcept : meta_(), pointer_(nullptr), has_ownership_(false) {}
Blob() noexcept : meta_() {}
~Blob() override {
Reset();
}
@ -148,11 +144,11 @@ class TORCH_API Blob final : public c10::intrusive_ptr_target {
* call is made or the blob is destructed.
*/
template <class T>
typename std::remove_const<T>::type* ShareExternal(
typename std::remove_const<T>::type* allocated) {
std::remove_const_t<T>* ShareExternal(
std::remove_const_t<T>* allocated) {
return static_cast<T*>(ShareExternal(
static_cast<void*>(allocated),
TypeMeta::Make<typename std::remove_const<T>::type>()));
TypeMeta::Make<std::remove_const_t<T>>()));
}
void* ShareExternal(void* allocated, const TypeMeta meta) {
@ -176,7 +172,7 @@ class TORCH_API Blob final : public c10::intrusive_ptr_target {
/**
* @brief Swaps the underlying storage of two blobs.
*/
void swap(Blob& rhs) {
void swap(Blob& rhs) noexcept {
using std::swap;
swap(meta_, rhs.meta_);
swap(pointer_, rhs.pointer_);
@ -191,13 +187,13 @@ class TORCH_API Blob final : public c10::intrusive_ptr_target {
}
TypeMeta meta_;
void* pointer_;
bool has_ownership_;
void* pointer_{nullptr};
bool has_ownership_{false};
C10_DISABLE_COPY_AND_ASSIGN(Blob);
};
inline void swap(Blob& lhs, Blob& rhs) {
inline void swap(Blob& lhs, Blob& rhs) noexcept {
lhs.swap(rhs);
}

View File

@ -7,8 +7,7 @@
#include <functional>
#include <utility>
namespace torch {
namespace jit {
namespace torch::jit {
struct BuiltinOpFunction : public Function {
BuiltinOpFunction(
@ -62,12 +61,16 @@ struct BuiltinOpFunction : public Function {
return *this;
}
bool call(Stack& stack, c10::optional<size_t>, c10::function_ref<void(const Code&)>) override {
bool call(
Stack& stack,
c10::optional<size_t>,
c10::function_ref<void(const Code&)>) override {
run(stack);
return false;
}
bool call(Stack& stack, c10::function_ref<void(const mobile::Code&)>) override {
bool call(Stack& stack, c10::function_ref<void(const mobile::Code&)>)
override {
run(stack);
return false;
}
@ -84,5 +87,4 @@ struct BuiltinOpFunction : public Function {
std::string doc_string_;
};
} // namespace jit
} // namespace torch
} // namespace torch::jit

View File

@ -6,12 +6,12 @@
#include <ATen/core/jit_type_base.h>
#include <c10/util/Optional.h>
namespace torch {
namespace jit {
namespace torch::jit {
struct CompilationUnit;
struct Function;
} // namespace jit
} // namespace torch
} // namespace torch::jit
namespace c10 {
@ -390,7 +390,7 @@ struct TORCH_API ClassType : public NamedType {
std::string doc_string = "",
std::vector<std::string> unresolved_class_attributes = {});
std::string annotation_str_impl(C10_UNUSED TypePrinter printer = nullptr) const override {
std::string annotation_str_impl(C10_UNUSED const TypePrinter& printer = nullptr) const override {
const auto& n = name().value();
return n.qualifiedName();
}

View File

@ -187,7 +187,7 @@ class DynamicType : public SharedType {
bool equals(const DynamicType& other) const;
template <typename F>
bool compareArguments(const DynamicType& other, F&& f) const {
bool compareArguments(const DynamicType& other, const F& f) const {
if (arguments_.elems.size() != other.arguments_.elems.size()) {
return false;
}

View File

@ -88,7 +88,7 @@ struct TORCH_API EnumType : public NamedType {
cu_(std::move(cu)) {}
std::string annotation_str_impl(
C10_UNUSED TypePrinter printer = nullptr) const override {
C10_UNUSED const TypePrinter& printer = nullptr) const override {
const auto& n = name().value();
return n.qualifiedName();
}

View File

@ -14,8 +14,7 @@ namespace at {
TORCH_API void launch(std::function<void()> func);
}
namespace torch {
namespace jit {
namespace torch::jit {
struct Graph;
struct Code;
@ -29,7 +28,9 @@ using Kwargs = std::unordered_map<std::string, at::IValue>;
struct RecursiveMethodCallError : public std::exception {};
using TaskLauncher = std::function<void(std::function<void()>)>;
TORCH_API void preoptimizeGraph(std::shared_ptr<Graph>& graph, bool disable_autocast=false);
TORCH_API void preoptimizeGraph(
std::shared_ptr<Graph>& graph,
bool disable_autocast = false);
// A Function is a pure Graph with no implicit `self` object bound.
// It contains schema information and the executor that manages the
@ -54,14 +55,13 @@ struct TORCH_API Function {
virtual c10::intrusive_ptr<c10::ivalue::Future> runAsync(
Stack& /*stack*/,
// NOLINTNEXTLINE(performance-unnecessary-value-param)
C10_UNUSED TaskLauncher taskLauncher = at::launch) {
TORCH_INTERNAL_ASSERT_DEBUG_ONLY(false);
return {};
}
at::IValue operator()(
Stack stack,
const Kwargs& kwargs = Kwargs()) {
at::IValue operator()(Stack stack, const Kwargs& kwargs = Kwargs()) {
getSchema().checkAndNormalizeInputs(stack, kwargs);
run(stack);
return stack.front();
@ -93,8 +93,12 @@ struct TORCH_API Function {
// If call() returns true, then callback completes successfully, otherwise
// call() returns false.
// Overload for server interpreter, a bailout size is needed for graph executor.
virtual bool call(Stack&, c10::optional<size_t>, c10::function_ref<void(const Code&)>) {
// Overload for server interpreter, a bailout size is needed for graph
// executor.
virtual bool call(
Stack&,
c10::optional<size_t>,
c10::function_ref<void(const Code&)>) {
TORCH_INTERNAL_ASSERT_DEBUG_ONLY(false);
return false;
}
@ -107,5 +111,4 @@ struct TORCH_API Function {
virtual ~Function() = default;
};
} // namespace jit
} // namespace torch
} // namespace torch::jit

View File

@ -143,10 +143,10 @@ struct Argument {
inferred_type_hint);
}
Argument cloneWithType(TypePtr new_type) const {
Argument cloneWithType(const TypePtr& new_type) const {
return Argument(
name_,
std::move(new_type),
new_type,
N_,
default_value_,
kwarg_only_,

View File

@ -1,9 +1,4 @@
#pragma once
#include <vector>
#include <cstdint>
#include <string>
#include <unordered_map>
#include <algorithm>
#include <c10/macros/Macros.h>

Some files were not shown because too many files have changed in this diff Show More