Compare commits

...

1153 Commits

Author SHA1 Message Date
758b86c1c8 [dynamo] Remove some files from dynamo_expected_failures
Some tests in `test/dynamo` are marked as "expected failure when testing
with `PYTORCH_TEST_WITH_DYNAMO=1`, i.e., we added files of those test
names in the `dynamo_expected_failures` folder.

However, a lot of those dynamo tests seem to be passing with
`PYTORCH_TEST_WITH_DYNAMO=1`, so this patch removes them from
`dynamo_expected_failures`.
2024-10-25 13:26:41 -07:00
de54246c42 Recomend pip install -r requirements in the unit testing guidelines. (#137797)
Somehow make setup-env as recomended in CONTRIBUTING.MD is not installing all dependencies require to run tests

This makes it slightly clearer when running tests.

Specific repro on my side was
```
git checkout e7679663070e3149ae7cd6e28d376d86852ce9e4
make setup-env
conda activate pytorch-deps
python test/test_utils_internal.py
```

which is what my reading of the instructions implies should be correct.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137797
Approved by: https://github.com/albanD
2024-10-25 18:47:44 +00:00
03f9136870 Add wait counter on cuda::device_synchronize (#138883)
The wait counter is typically only minute precision, but if there is a collective in the queue it will show up. We think this explains up to eight minutes of delay in some compile traces we're looking at, but the counter would definitively prove it.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Differential Revision: [D64944970](https://our.internmc.facebook.com/intern/diff/D64944970)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138883
Approved by: https://github.com/eqy
2024-10-25 18:13:57 +00:00
dbbdfd9df5 Add pytorch.wait_counter.dynamo_compile (#138072)
I was discussing with James March how the current fx_codegen_and_compile
counter doesn't actually capture all compile time.  This one is more
accurate and corresponds closely to the existing events in dynamo_compile
table.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138072
Approved by: https://github.com/markkm
2024-10-25 18:12:34 +00:00
77587f43d2 Add one more shard for CPU pull jobs (#138894)
The first shard is close to 3.5 hours and timing out flakily in trunk now, for example https://github.com/pytorch/pytorch/actions/runs/11509141659/job/32039126506.  So, I think we could just add one more shard in the same spirit as https://github.com/pytorch/pytorch/pull/137433
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138894
Approved by: https://github.com/Skylion007
2024-10-25 18:09:50 +00:00
ba6526814a Add dtype attribute to CSEVariable (#136778)
Summary:
- This diff introduces `dtype` attribute to `TritonCSEVariable` and a dtype propagation helper function to infer dtype from input to output for each op.

- There will be a follow-up diff that uses this `dtype` information in `TritonCSEVariable` to perform dtype-aware codegen.

Test Plan: CI

Differential Revision: D61815079

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136778
Approved by: https://github.com/eellison, https://github.com/blaine-rister
2024-10-25 18:00:30 +00:00
d0640b945b [inductor][nit] removing unnecessary else statements (#138789)
Summary: while reading through inductor template code I found a few places where else statements were driving me crazy. Fixing them as I read

Test Plan: CI

Differential Revision: D64882385

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138789
Approved by: https://github.com/aakhundov
2024-10-25 17:59:25 +00:00
69af467d4f Eliminate c10::value_or_else (#138818)
Test Plan: Sandcastle

Differential Revision: D64857418

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138818
Approved by: https://github.com/malfet, https://github.com/Skylion007
2024-10-25 17:59:01 +00:00
a6287b5c27 Fixing issue in move pass for copying Parameter (#138855)
Summary: Fixing bug for Parameter copy during move pass of exported graph.

Test Plan:
UT

runs on APS models.

Differential Revision: D64876951

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138855
Approved by: https://github.com/pianpwk

Co-authored-by: Gagan Jain <gaganj@meta.com>
2024-10-25 17:57:27 +00:00
375d71cc5a plumb is_export flag to FunctionalTensorMode in analysis pass (#138836)
Summary: there is an issue with functionalization V2 in export. This is a quick fix that plumbs `is_export` through to `run_functionalized_fw_and_collect_metadata`.

Test Plan: CI

Differential Revision: D64915263

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138836
Approved by: https://github.com/tugsbayasgalan
2024-10-25 17:56:14 +00:00
3d0aa6f049 Update readme with std::optional (#138914)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138914
Approved by: https://github.com/malfet
2024-10-25 17:40:58 +00:00
6f66398ab8 Revert "[aotd] Unwrap unseen AsyncCollectiveTensor tangents (#138731)"
This reverts commit 245026af2d2f26c74993cb90e01bddbd627c6797.

Reverted https://github.com/pytorch/pytorch/pull/138731 on behalf of https://github.com/jeanschmidt due to introduced regressions on linux-focal-cuda12.1-py3.10-gcc9-bazel-test ([comment](https://github.com/pytorch/pytorch/pull/138731#issuecomment-2438417669))
2024-10-25 17:37:32 +00:00
447bb72822 Revert "[c10d][CI] Improve world size setting in some tests (#138846)"
This reverts commit 9c35e33d9b02e384f0d504f942a916e9e849b163.

Reverted https://github.com/pytorch/pytorch/pull/138846 on behalf of https://github.com/jeanschmidt due to introduced breaks in linux-focal-cuda11.8-py3.10-gcc9 ([comment](https://github.com/pytorch/pytorch/pull/138846#issuecomment-2438415315))
2024-10-25 17:35:27 +00:00
2980aed65b [inductor][memory] restructuring memory.py and turn on the flag (#137205)
Addressing additional comments given in PR https://github.com/pytorch/pytorch/pull/134874

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137205
Approved by: https://github.com/eellison
2024-10-25 17:19:34 +00:00
817b4988e4 [dynamo][config-cleanup] Remove enable_cpp_guard_manager=False codepath (#138512)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138512
Approved by: https://github.com/williamwen42, https://github.com/jansel
2024-10-25 16:41:55 +00:00
fe18a221eb Add debug backend that applies CrossRefFakeMode, use in compiler bisector (#138651)
I was debugging an internal ne divergence for a while that ended up being because of a bad meta. I added an explicit a config option and an explicit backend `aot_eager_decomp_partition_crossref` to enable the FakeCrossRefMode when running the graph.  I added an explicit backend bc I suspect it will be useful for internal models but I'm also happy to leave as config option.

It will only test ops that have meta to avoid memory overhead of hitting fallback path and running in eager.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138651
Approved by: https://github.com/zou3519, https://github.com/bdhirsh
2024-10-25 15:58:36 +00:00
6cadf616ae [fx graph cache] FxGraphPickler: Remove hack to stabilize device string hashes (#138681)
Summary: With the fast pickling mode, we don't need the custom hack for replacing device strings in tensors. This was previously needed because, e.g., two strings "cuda" will pickle differently if they are the same object vs. not.

Test Plan:
The new test fails with fast mode commented out, but succeeds when enabled:
`python test/inductor/test_codecache.py -k test_stable_strings`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138681
Approved by: https://github.com/oulgen
2024-10-25 15:52:58 +00:00
78a0158540 [Dynamo] Improve args in higher_order_ops [1/N] (#138799)
Replaced hard-coded argument indices with meaningful variable names.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138799
Approved by: https://github.com/zou3519
2024-10-25 13:55:41 +00:00
45b8155a07 [CI] Run periodic jobs only on pytorch/pytorch repo (#138874)
Github by default tries to not run periodic jobs on forks, see https://docs.github.com/en/actions/managing-workflow-runs-and-deployments/managing-workflow-runs/disabling-and-enabling-a-workflow
But there is a special test repo called `pytorch/canary`, that will run those workflows for next 60 days, which is a waste of resources
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138874
Approved by: https://github.com/huydhn
2024-10-25 13:42:37 +00:00
245026af2d [aotd] Unwrap unseen AsyncCollectiveTensor tangents (#138731)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138731
Approved by: https://github.com/bdhirsh
2024-10-25 12:35:52 +00:00
2c82f73647 [Pipelining] Clean up hooks in zero bubble (#138720)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138720
Approved by: https://github.com/wconstab
ghstack dependencies: #138119, #138504, #138735
2024-10-25 12:06:54 +00:00
12755f45ff [Pipelining] small comments and variable renames (#138735)
Addressing the comments in previous PRs to update the variable names and add additional code comments

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138735
Approved by: https://github.com/wconstab
ghstack dependencies: #138119, #138504
2024-10-25 12:06:54 +00:00
9c35e33d9b [c10d][CI] Improve world size setting in some tests (#138846)
Following change in #137161 , bumping world size for some test suites.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138846
Approved by: https://github.com/fduwjj
2024-10-25 10:40:21 +00:00
a1175e3437 [BE] Strides are always non-negative, remove pointless test (#138784)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138784
Approved by: https://github.com/Chillee
2024-10-25 10:39:32 +00:00
22d2e2d9a0 Set RUNPATH so installed tests can find the required shared libraries (#136627)
This change fixes the RUNPATH of installed c++ tests so that the linker can find the shared libraries they depend on.

For example, currently:
```bash
venv/lib/python3.10/site-packages/torch $ ./bin/test_lazy
./bin/test_lazy: error while loading shared libraries: libtorch.so: cannot open shared object file: No such file or directory
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136627
Approved by: https://github.com/malfet
2024-10-25 09:38:08 +00:00
86d4b7d60b [FX][export][dynamo] use tuple instead of list in normalized args_spec (#138212)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138212
Approved by: https://github.com/jansel
2024-10-25 06:43:55 +00:00
ce631939f0 [Distributed] [18/N] Fix clang-tidy warnings in torch/csrc/distributed/ (#138692)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138692
Approved by: https://github.com/ezyang
2024-10-25 05:32:38 +00:00
b999daf7a9 Add sets to list of safe objects to de-serialize (#138866)
Lists, dicts and tuples are already allowed, it's a bit weird not to exclude set from the list of basic containers.

Test plan (in addition to unittest):
```python
torch.save({1, 2, 3}, "foo.pt")
torch.load("foo.pt", weights_only=True)
```

Fixes https://github.com/pytorch/pytorch/issues/138851

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138866
Approved by: https://github.com/mikaylagawarecki

Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com>
2024-10-25 05:23:08 +00:00
907f001a68 Bump onnx from 1.16.1 to 1.17.0 in /.ci/docker (#138719)
Bumps [onnx](https://github.com/onnx/onnx) from 1.16.1 to 1.17.0.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a href="https://github.com/onnx/onnx/releases">onnx's releases</a>.</em></p>
<blockquote>
<h2>v1.17.0</h2>
<p>ONNX v1.17.0 is now available with exciting new features! We would like to thank everyone who contributed to this release!
Please visit <a href="https://onnx.ai/">onnx.ai</a> to learn more about ONNX and associated projects.</p>
<h1>Key Updates</h1>
<h2>ai.onnx Opset 22</h2>
<ul>
<li>Update to support bfloat16:
<ul>
<li><a href="https://onnx.ai/onnx/operators/onnx__Acos.html#acos-22">Acos</a>, <a href="https://onnx.ai/onnx/operators/onnx__Acosh.html#acosh-22">Acosh</a>, <a href="https://onnx.ai/onnx/operators/onnx__Asin.html#asin-22">Asin</a>, <a href="https://onnx.ai/onnx/operators/onnx__Asinh.html#asinh-22">Asinh</a>, <a href="https://onnx.ai/onnx/operators/onnx__Atan.html#atan-22">Atan</a>, <a href="https://onnx.ai/onnx/operators/onnx__Atanh.html#atanh-22">Atanh</a>, <a href="https://onnx.ai/onnx/operators/onnx__AveragePool.html#averagepool-22">AveragePool</a>, <a href="https://onnx.ai/onnx/operators/onnx__Bernoulli.html#bernoulli-22">Bernoulli</a>, <a href="https://onnx.ai/onnx/operators/onnx__Conv.html#conv-22">Conv</a>, <a href="https://onnx.ai/onnx/operators/onnx__ConvTranspose.html#convtranspose-22">ConvTranspose</a>, <a href="https://onnx.ai/onnx/operators/onnx__Cos.html#cos-22">Cos</a>, <a href="https://onnx.ai/onnx/operators/onnx__Cosh.html#cosh-22">Cosh</a>, <a href="https://onnx.ai/onnx/operators/onnx__DeformConv.html#deformconv-22">DeformConv</a>, <a href="https://onnx.ai/onnx/operators/onnx__Det.html#det-22">Det</a>, <a href="https://onnx.ai/onnx/operators/onnx__Dropout.html#dropout-22">Dropout</a>, <a href="https://onnx.ai/onnx/operators/onnx__Elu.html#elu-22">Elu</a>, <a href="https://onnx.ai/onnx/operators/onnx__EyeLike.html#eyelike-22">EyeLike</a>, <a href="https://onnx.ai/onnx/operators/onnx__GRU.html#gru-22">GRU</a>, <a href="https://onnx.ai/onnx/operators/onnx__GlobalAveragePool.html#globalaveragepool-22">GlobalAveragePool</a>, <a href="https://onnx.ai/onnx/operators/onnx__GlobalLpPool.html#globallppool-22">GlobalLpPool</a>, <a href="https://onnx.ai/onnx/operators/onnx__GlobalMaxPool.html#globalmaxpool-22">GlobalMaxPool</a>, <a href="https://onnx.ai/onnx/operators/onnx__GridSample.html#gridsample-22">GridSample</a>, <a href="https://onnx.ai/onnx/operators/onnx__HardSigmoid.html#hardsigmoid-22">HardSigmoid</a>, <a href="https://onnx.ai/onnx/operators/onnx__HardSwish.html#hardswish-22">HardSwish</a>, <a href="https://onnx.ai/onnx/operators/onnx__InstanceNormalization.html#instancenormalization-22">InstanceNormalization</a>, <a href="https://onnx.ai/onnx/operators/onnx__LSTM.html#lstm-22">LSTM</a>, <a href="https://onnx.ai/onnx/operators/onnx__LpNormalization.html#lpnormalization-22">LpNormalization</a>, <a href="https://onnx.ai/onnx/operators/onnx__LpPool.html#lppool-22">LpPool</a>, <a href="https://onnx.ai/onnx/operators/onnx__MaxPool.html#maxpool-22">MaxPool</a>, <a href="https://onnx.ai/onnx/operators/onnx__MaxRoiPool.html#maxroipool-22">MaxRoiPool</a>, <a href="https://onnx.ai/onnx/operators/onnx__MaxUnpool.html#maxunpool-22">MaxUnpool</a>, <a href="https://onnx.ai/onnx/operators/onnx__Mish.html#mish-22">Mish</a>, <a href="https://onnx.ai/onnx/operators/onnx__Multinomial.html#multinomial-22">Multinomial</a>, <a href="https://onnx.ai/onnx/operators/onnx__NegativeLogLikelihoodLoss.html#negativeloglikelihoodloss-22">NegativeLogLikelihoodLoss</a>, <a href="https://onnx.ai/onnx/operators/onnx__RNN.html#rnn-22">RNN</a>, <a href="https://onnx.ai/onnx/operators/onnx__RandomNormal.html#randomnormal-22">RandomNormal</a>, <a href="https://onnx.ai/onnx/operators/onnx__RandomNormalLike.html#randomnormallike-22">RandomNormalLike</a>, <a href="https://onnx.ai/onnx/operators/onnx__RandomUniform.html#randomuniform-22">RandomUniform</a>, <a href="https://onnx.ai/onnx/operators/onnx__RandomUniformLike.html#randomuniformlike-22">RandomUniformLike</a>, <a href="https://onnx.ai/onnx/operators/onnx__RoiAlign.html#roialign-22">RoiAlign</a>, <a href="https://onnx.ai/onnx/operators/onnx__Round.html#round-22">Round</a>, <a href="https://onnx.ai/onnx/operators/onnx__Selu.html#selu-22">Selu</a>, <a href="https://onnx.ai/onnx/operators/onnx__Sin.html#sin-22">Sin</a>, <a href="https://onnx.ai/onnx/operators/onnx__Sinh.html#sinh-22">Sinh</a>, <a href="https://onnx.ai/onnx/operators/onnx__Softplus.html#softplus-22">Softplus</a>, <a href="https://onnx.ai/onnx/operators/onnx__Softsign.html#softsign-22">Softsign</a>, <a href="https://onnx.ai/onnx/operators/onnx__Tan.html#tan-22">Tan</a>, <a href="https://onnx.ai/onnx/operators/onnx__ThresholdedRelu.html#thresholdedrelu-22">ThresholdedRelu</a></li>
</ul>
</li>
</ul>
<h2>Python Changes</h2>
<ul>
<li>Support for numpy &gt;= 2.0</li>
</ul>
<h1>Bug fixes and infrastructure improvements</h1>
<ul>
<li>Fix Check URLs errors <a href="https://redirect.github.com/onnx/onnx/pull/5972">5972</a></li>
<li>Use CMAKE_PREFIX_PATH in finding libprotobuf <a href="https://redirect.github.com/onnx/onnx/pull/5975">5975</a></li>
<li>Bump main VERSION_NUMBER to 1.17.0 <a href="https://redirect.github.com/onnx/onnx/pull/5968">5968</a></li>
<li>Fix source and pip tar.gz builds on s390x systems <a href="https://redirect.github.com/onnx/onnx/pull/5984">5984</a></li>
<li>Fix unique_name <a href="https://redirect.github.com/onnx/onnx/pull/5992">5992</a></li>
<li>Fix SegFault bug in shape inference <a href="https://redirect.github.com/onnx/onnx/pull/5990">5990</a></li>
<li>Fix onnx.compose when connecting subgraphs <a href="https://redirect.github.com/onnx/onnx/pull/5991">5991</a></li>
<li>Fix conversion from split 11 to split 18 <a href="https://redirect.github.com/onnx/onnx/pull/6020">6020</a></li>
<li>Update error messages for NegativeLogLikelihoodLoss inference function <a href="https://redirect.github.com/onnx/onnx/pull/6021">6021</a></li>
<li>Generalize input/output number check in shape inference <a href="https://redirect.github.com/onnx/onnx/pull/6005">6005</a></li>
<li>Replace rank inference with shape inference for Einsum op <a href="https://redirect.github.com/onnx/onnx/pull/6010">6010</a></li>
<li>build from source instruction with latest cmake change <a href="https://redirect.github.com/onnx/onnx/pull/6038">6038</a></li>
<li>Handle OneHot's depth value during shape inference <a href="https://redirect.github.com/onnx/onnx/pull/5963">5963</a></li>
<li>Not to install cmake in pyproject.toml on Windows <a href="https://redirect.github.com/onnx/onnx/pull/6045">6045</a></li>
<li>fix a skipped shape infer code <a href="https://redirect.github.com/onnx/onnx/pull/6049">6049</a></li>
<li>Include the &quot;.onnxtext&quot; extension in supported serialization format <a href="https://redirect.github.com/onnx/onnx/pull/6051">6051</a></li>
<li>Allow ReferenceEvaluator to return intermediate results <a href="https://redirect.github.com/onnx/onnx/pull/6066">6066</a></li>
<li>Fix 1 typo in numpy_helper.py <a href="https://redirect.github.com/onnx/onnx/pull/6041">6041</a></li>
<li>Remove benchmarking code <a href="https://redirect.github.com/onnx/onnx/pull/6076">6076</a></li>
<li>Prevent crash on import after GCC 8 builds <a href="https://redirect.github.com/onnx/onnx/pull/6048">6048</a></li>
<li>Check graph outputs are defined <a href="https://redirect.github.com/onnx/onnx/pull/6083">6083</a></li>
<li>Enable additional ruff rules <a href="https://redirect.github.com/onnx/onnx/pull/6032">6032</a></li>
<li>Add missing shape inference check for DequantizeLinear <a href="https://redirect.github.com/onnx/onnx/pull/6080">6080</a></li>
<li>Add bfloat16 to all relevant ops <a href="https://redirect.github.com/onnx/onnx/pull/6099">6099</a></li>
<li>fix(ci): install python dependencies with --only-binary :all: in manylinux <a href="https://redirect.github.com/onnx/onnx/pull/6120">6120</a></li>
<li>fix: install google-re2 with --only-binary option <a href="https://redirect.github.com/onnx/onnx/pull/6129">6129</a></li>
<li>Specify axis parameter for DequantizeLinear when input rank is 1 <a href="https://redirect.github.com/onnx/onnx/pull/6095">6095</a></li>
<li>Pin onnxruntime to 1.17.3 for release CIs <a href="https://redirect.github.com/onnx/onnx/pull/6143">6143</a></li>
<li>Fix INT4 TensorProto byte size is 5x larger than expected with negative values <a href="https://redirect.github.com/onnx/onnx/pull/6161">6161</a></li>
<li>Mitigate tarball directory traversal risks <a href="https://redirect.github.com/onnx/onnx/pull/6164">6164</a></li>
<li>Fix reference implementation for ScatterND with 4D tensors <a href="https://redirect.github.com/onnx/onnx/pull/6174">6174</a></li>
<li>Addition of group &gt; 1 in test and in backend for ConvTranspose <a href="https://redirect.github.com/onnx/onnx/pull/6175">6175</a></li>
<li>Support for bfloat16 for binary, unary operators in reference implementation <a href="https://redirect.github.com/onnx/onnx/pull/6166">6166</a></li>
<li>Refactor windows workflow to work on standard windows <a href="https://redirect.github.com/onnx/onnx/pull/6190">6190</a></li>
<li>Fix a few crashes while running shape inference <a href="https://redirect.github.com/onnx/onnx/pull/6195">6195</a></li>
<li>Update onnx to work with numpy&gt;=2.0 <a href="https://redirect.github.com/onnx/onnx/pull/6196">6196</a></li>
<li>Use sets to improve performance of dfs search <a href="https://redirect.github.com/onnx/onnx/pull/6213">6213</a></li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a href="b8baa84466"><code>b8baa84</code></a> Set version 1.17.0 for official release (<a href="https://redirect.github.com/onnx/onnx/issues/6405">#6405</a>)</li>
<li><a href="6d77b80821"><code>6d77b80</code></a> [Cherry-Pick] Fix main url checks (<a href="https://redirect.github.com/onnx/onnx/issues/6312">#6312</a>) (<a href="https://redirect.github.com/onnx/onnx/issues/6327">#6327</a>)</li>
<li><a href="174938d8b7"><code>174938d</code></a> [Cherry-Pick] Fix protobuf pkg 5.28.0 failing on Windows (<a href="https://redirect.github.com/onnx/onnx/issues/6342">#6342</a>) (<a href="https://redirect.github.com/onnx/onnx/issues/6347">#6347</a>)</li>
<li><a href="f18d5931ad"><code>f18d593</code></a> [Cherry-Pick] Remove unused variables (<a href="https://redirect.github.com/onnx/onnx/issues/6303">#6303</a>) (<a href="https://redirect.github.com/onnx/onnx/issues/6324">#6324</a>)</li>
<li><a href="c58890537f"><code>c588905</code></a> Set version in rel-1.17.0 to 1.17.0rc1 (<a href="https://redirect.github.com/onnx/onnx/issues/6317">#6317</a>)</li>
<li><a href="4392c2c9ae"><code>4392c2c</code></a> Prepare for rel-1.17.0 (<a href="https://redirect.github.com/onnx/onnx/issues/6281">#6281</a>)</li>
<li><a href="cb54169e4f"><code>cb54169</code></a> Update ort filter to 1.20.0 to skip tests known to fail with ort 1.19.0 (<a href="https://redirect.github.com/onnx/onnx/issues/6306">#6306</a>)</li>
<li><a href="99e1fd352c"><code>99e1fd3</code></a> Bump reviewdog/action-misspell from 1.21.0 to 1.23.0 (<a href="https://redirect.github.com/onnx/onnx/issues/6268">#6268</a>)</li>
<li><a href="1920565505"><code>1920565</code></a> Bump ossf/scorecard-action from 2.3.3 to 2.4.0 (<a href="https://redirect.github.com/onnx/onnx/issues/6273">#6273</a>)</li>
<li><a href="2e8f2289b9"><code>2e8f228</code></a> Bump mypy from 1.10.1 to 1.11.1 (<a href="https://redirect.github.com/onnx/onnx/issues/6275">#6275</a>)</li>
<li>Additional commits viewable in <a href="https://github.com/onnx/onnx/compare/v1.16.1...v1.17.0">compare view</a></li>
</ul>
</details>
<br />

[![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=onnx&package-manager=pip&previous-version=1.16.1&new-version=1.17.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/pytorch/pytorch/network/alerts).

</details>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138719
Approved by: https://github.com/ezyang

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-10-25 03:53:25 +00:00
94e341c6a3 [user triton] fix codegen for tl.constexpr globals (#138757)
Fixes #138509

tl.constexpr globals would be codegen-ed as `constexpr()` instead of `tl.constexpr()` if they were un-annotated. This fixes the issue (and adds a test). The correct handling was already added but the corrected string was not being used in the un-annotated branch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138757
Approved by: https://github.com/oulgen
2024-10-25 03:00:42 +00:00
36c6ad71ba [tlparse] Add dynamo_graph_break_reason logging to trace_structured (#138778)
A common challenge during torch.compile enablement is to answer user's question: "where is the graph break?". This PR will help make it easier to answer by surfacing graph breaks and their corresponding user stack trace / compiler stack trace in a direct link e.g. `0_0_0/dynamo_graph_break_reason_0.txt` from tlparse index.html.

![image](https://github.com/user-attachments/assets/79cd43f5-af14-4d08-9d5b-cb47d8203851)

![image](https://github.com/user-attachments/assets/23233ee2-0d56-4526-bf9a-d22c337c4d18)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138778
Approved by: https://github.com/ezyang
2024-10-25 02:00:04 +00:00
9425c0767d Fix free symbol handling in FlexAttention (#138794)
Fixes https://github.com/pytorch/pytorch/issues/136196

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138794
Approved by: https://github.com/Skylion007
ghstack dependencies: #138733
2024-10-25 01:20:42 +00:00
f737e3fe2f [inductor] Fix ReinterpretView call in TMADescriptor IR (#138759)
As a result of #137768, `ReinterpretView` call in the `TMADescriptor`
has become invalid. This leads to some TMA tests breaking in
test_triton_kernels.py. In this PR, we fix this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138759
Approved by: https://github.com/Chillee, https://github.com/eellison
2024-10-25 00:45:44 +00:00
ed9169df98 Removed the typing information for already deleted ProcessGroupCudaP2P (#138753)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138753
Approved by: https://github.com/weifengpy
2024-10-25 00:32:07 +00:00
2f4af0f4e6 [Profiler] Disable Dynamo-Sensitive Profiler Tests (#138762)
Summary: During compilation, a profiler context gets ignored so we should temporarily turn off tests that are failing due to dynamo. Once profiler integration with dynamo is introduced we can reintroduce these tests

Test Plan: Make sure CI is passing again

Differential Revision: D64867447

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138762
Approved by: https://github.com/davidberard98
2024-10-25 00:25:49 +00:00
1d98a526dd preserve signatures with multiple calls + buffer mutations (#138669)
As called out in https://github.com/pytorch/pytorch/pull/137999, preserving signatures of multiple calls when buffer mutations are present was NYI. The main problem was that intermediate values of buffers were not tracked, so couldn't be propagated statefully between multiple calls (i.e., they would need to be explicitly passed around, defeating the unlifting needed for preserving signatures).

This PR fixes this situation, by introducing module attributes that carry the necessary intermediate values of buffer mutations. In general, a buffer mutation can have several intermediate values it depends on recursively, even other buffers. So rather than tying an intermediate value with a particular buffer, we tie it with the submodules that create and read it. We install an attribute on all modules that create or read a particular intermediate value, sharing the same initial storage (i.e., initialized with the same empty tensor). For the module that creates this intermediate value, we copy the value into the corresponding attribute; and for the modules that read it, we read the corresponding attribute instead.

Another complication that needed to be addressed was that a `run_decompositions` following an `export_for_training` was not preserving module call graphs, which is needed for unflattening and, in particular, used when remapping inputs. Fortunately some existing metadata already tracks provenance of nodes, which we could use to update a module call graph after functionalization / decomposition.

Differential Revision: D64806175

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138669
Approved by: https://github.com/tugsbayasgalan
2024-10-25 00:13:25 +00:00
4c91481656 [c10d] allow sub group to be eagerly inited even if default one is not (#138665)
Summary:
Currently, eager mode is applied either to all PGs or NONE of them.
There are cases where we don't want to initialize the comms for default
PG, but we still want to initialize the comms for sub PG. Now with a
device_id passed to new group, we can achieve this case
Test Plan:
newly added UT

Tags:

Resolves https://github.com/pytorch/pytorch/issues/137018

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138665
Approved by: https://github.com/kwen2501
ghstack dependencies: #138781
2024-10-24 23:51:28 +00:00
277b32c930 fix unflatten training ir test suffix (#138840)
Test Plan: none

Differential Revision: D64917214

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138840
Approved by: https://github.com/zhxchen17
2024-10-24 23:42:54 +00:00
425ce2a7ee [c10d] use a promise to delay watchdog shutdown (#138828)
Summary:
We always need to give the heartbeat monitor thread time to write out flight recorder dumps. Otherwise, the watchdog thread kills the heartbeat monitor thread too fast before it has time to write out the Flight Recorder logs.
This change:
1. Removes the "sleep after exception" JK. We don't need to sleep for 8 minutes.
2. Use a promise between watchdog thread and heartbeat monitor thread to delay, at most, one minute to give Flight Recorder time to write out it's log on timeout.

Test Plan:
Tested on my local job and flight recorder successfully executed for the job.
https://fburl.com/mlhub/38fj5yne
The watchdog thread gives heartbeat thread time to write out the logs.

In the logs we see:
```
[trainer4]:I1023 17:39:29.755507 12592 ProcessGroupNCCL.cpp:1950] [PG ID 0 PG GUID 0(precheck) Rank 12] slept for 1647ms giving time for flight recorder dumps to finish.
```

Differential Revision: D64857928

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138828
Approved by: https://github.com/d4l3k, https://github.com/fduwjj
2024-10-24 23:42:29 +00:00
751987eed1 [pt2] improve error logs for torch.cond and aoti package (#138647)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138647
Approved by: https://github.com/ydwu4, https://github.com/angelayi
2024-10-24 23:38:07 +00:00
3e4ba18eb5 [aoti] fix typo in codegen_dynamic_scalar (#138760)
Summary: appears to be a typo

Test Plan: ci

Differential Revision: D64867271

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138760
Approved by: https://github.com/ezyang
2024-10-24 23:16:30 +00:00
09848c892a [aot_compile] propagate ShapeEnv during lowering (#138362)
We found that `export() -> _inductor.aot_compile()` lowering, 3 different ShapeEnvs get created, leading to errors when one ShapeEnv processes expressions created by another ShapeEnv. This plumbs the 2 places where ShapeEnv creation happens, detecting the original ShapeEnv from the GraphModule example values, so the original ShapeEnv is just reused.

Differential Revision: D64613290

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138362
Approved by: https://github.com/angelayi
2024-10-24 22:22:14 +00:00
51f6b946ae [torchbind] Add generic __deepcopy__ method (#137613)
Summary: Added a generic `__deepcopy__` method which will use the torchbind object's existing `__getattr__` and `__setattr__` to copy the torchbind object. This will later be used in [D64124825](https://www.internalfb.com/diff/D64124825)

Differential Revision: D64124826

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137613
Approved by: https://github.com/ydwu4, https://github.com/zou3519
2024-10-24 22:14:55 +00:00
282e6383c1 Add inductor cache metrics (#138603)
Each inductor event should have exactly one hit, miss, bypass etc. Add it to the inductor compile event.

Add triton_compile as a compiler phase with `dynamo_timed`. This way, we get PT2 Compile Event Logs for triton as well.

Here's what triton events look like:  {F1941513932}
And this on a cache hit(since we still redo this work):
 {F1941514350}

Inductor cache info:
 {F1941528530}

Differential Revision: [D64703392](https://our.internmc.facebook.com/intern/diff/D64703392/)

@diff-train-skip-merge

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138603
Approved by: https://github.com/oulgen
2024-10-24 22:09:34 +00:00
e78a3e260b [export] Add serdes_non_strict to tests (#138662)
Summary: We expand the tests to cover serdes_non_strict. Currently failing tests are skipped.

Test Plan:
```
buck2 test @//mode/dev-nosan //caffe2/test:test_export -- -r _serdes_non_strict
```

Differential Revision: D64709285

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138662
Approved by: https://github.com/avikchaudhuri
2024-10-24 21:35:32 +00:00
500b2bc781 Have as_tensor always return a float64 tensor in dynamo (#138598)
As discussed with @ezyang, this set of diffs are extracting fixes to problems discovered to flipping `specialize_float=False` in https://github.com/pytorch/pytorch/pull/137782. Since these codepaths are exercised in existing tests, I'm going to bias towards shipping speed and put these up with the primary test plan as the global CI. These code paths are all tested via existing tests when `specialize_float=False` and it feels a bit wonky to add more gated tests that only test behavior when this flag is True, especially since these code paths are already covered. That being said, I'm happy to add individual tests if reviewers insist or have a different POV.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138598
Approved by: https://github.com/ezyang
ghstack dependencies: #138595
2024-10-24 20:50:28 +00:00
5b50b0a9bc remove dead code (#138690)
Fixes issue-138673: [issue](https://github.com/pytorch/pytorch/issues/138673)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138690
Approved by: https://github.com/Aidyn-A, https://github.com/colesbury
2024-10-24 20:29:24 +00:00
10a34dcd57 [PyTorch] Fix out-of-bounds array access in atomic_add_vec (#138744)
There is no guarantee that `len` here is enough for a full vector. This was causing at least one test failure on https://github.com/pytorch/pytorch/pull/137426.

Differential Revision: [D64857786](https://our.internmc.facebook.com/intern/diff/D64857786/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138744
Approved by: https://github.com/jgong5, https://github.com/malfet
ghstack dependencies: #138486, #138542, #138655, #138716
2024-10-24 19:37:12 +00:00
0af7632c10 [PyTorch] Fix ASAN failures for vec_test_all_types Cast test (#138716)
The size of the destination array was too small.

Differential Revision: [D64843491](https://our.internmc.facebook.com/intern/diff/D64843491/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138716
Approved by: https://github.com/jgong5, https://github.com/malfet
ghstack dependencies: #138486, #138542, #138655
2024-10-24 19:37:12 +00:00
cbafe1e7f3 [PyTorch] Unbreak VectorizedN fmadd/fmsub/clamp (#138655)
These are ternary ops, not binary ops.

Differential Revision: [D64794253](https://our.internmc.facebook.com/intern/diff/D64794253/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138655
Approved by: https://github.com/jgong5, https://github.com/malfet
ghstack dependencies: #138486, #138542
2024-10-24 19:37:02 +00:00
ead5738ff2 [PyTorch] Fix inductor bug with unrolled vectorized prod (#138542)
This issue is one of two inductor bugs blocking land of #137426. Turned out to be simple

Differential Revision: [D64734116](https://our.internmc.facebook.com/intern/diff/D64734116/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138542
Approved by: https://github.com/jgong5, https://github.com/malfet
ghstack dependencies: #138486

Co-authored-by: leslie-fang-intel <leslie.fang@intel.com>
2024-10-24 19:36:51 +00:00
6aa673377b [PyTorch] Fix inductor CPU masked() body codegen when result dtype is bool and operator is where (#138486)
In this case, it looks like we expect the body to be a VecMask (unify_mask_base_type is called by where()), but we didn't make it a VecMask. Now we do.

Differential Revision: [D64702918](https://our.internmc.facebook.com/intern/diff/D64702918/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138486
Approved by: https://github.com/leslie-fang-intel, https://github.com/malfet
2024-10-24 19:36:41 +00:00
239a21f37e [Inductor] don't set XBLOCK larger than xnumel (#138730)
When fp8 dtype is involved, Inductor may set min_elem_per_thread to be a positive value. This will force increasing XBLOCK even for a small xnumel (e.g. 1). Inductor will report an error later when sanity check the triton config.

The simple fix here is to just not let XBLOCK to be larger than xnumel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138730
Approved by: https://github.com/Chillee
ghstack dependencies: #136782
2024-10-24 18:31:10 +00:00
e7f1e306df Revert "[c10d][Partial-Graph Overlap] Support calling .wait_tensor() within compiled region on output tensor of eager async_op=True collective (#137763)"
This reverts commit 362ca54f03f9bb72ba7633ed580fb788b1a8dea9.

Reverted https://github.com/pytorch/pytorch/pull/137763 on behalf of https://github.com/wdvr due to this change is breaking our prod training pipeline (verified with bisect) by increasing memory consumption 4x and causing OOM ([comment](https://github.com/pytorch/pytorch/pull/137763#issuecomment-2435962833))
2024-10-24 17:46:09 +00:00
8197e4c70d Revert "[sparse] add search for optimal alg_id to torch.compile (#137427)"
This reverts commit 39bfba3f561e3125ce035de0bf90c8c7bcccd3ce.

Reverted https://github.com/pytorch/pytorch/pull/137427 on behalf of https://github.com/jcaip due to this PR breaks AO tests ([comment](https://github.com/pytorch/pytorch/pull/137427#issuecomment-2435906592))
2024-10-24 17:27:06 +00:00
5ea6777861 [subclass] Unwrap_tensor_subclasses micro optimization (#138498)
unwrap_tensor_subclasses -> get_plain_tensors

Is used at runtime. For small models this overhead is feasible in comparison with small compiled kernel.

1/ Removing asserts  from runtime path
2/ Removing list creation with using optional output list to append argument
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138498
Approved by: https://github.com/bdhirsh
2024-10-24 16:54:54 +00:00
fe458eef80 [c10d] fix a logic of using ncclCommSplit (#138781)
Summary:
Currently, whether split should be used depends on the size of subgroup.
It's possible that default PG is not eagerly initialized yet, but split is still
called.

This PR fixes this issue by removing split's  dependency on subgroup size
Test Plan:
Modified UT
Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138781
Approved by: https://github.com/kwen2501
2024-10-24 16:16:35 +00:00
b021486405 Enable Windows Arm64 (#133088)
This PR enables Pytorch for Windows on Arm64 - CPU only.
Currently, there aren't any checks in place to build and test for Windows on Arm64, but we're working to implement those as soon as possible.
We recommend using [Arm Performance Libraries (APL)](https://developer.arm.com/Tools%20and%20Software/Arm%20Performance%20Libraries) as a BLAS option, which is introduced in this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133088
Approved by: https://github.com/malfet

Co-authored-by: cristian panaite <panaite.cristian2000@gmail.com>
Co-authored-by: Stefan-Alin Pahontu <56953855+alinpahontu2912@users.noreply.github.com>
Co-authored-by: Ozan Aydin <148207261+ozanMSFT@users.noreply.github.com>
2024-10-24 16:10:44 +00:00
eqy
f7bb11dcc2 [cuDNN][cuDNN Frontend] Check in test for previously broken dBias check (#138725)
see https://github.com/pytorch/pytorch/issues/137347, let's try to land before https://github.com/pytorch/pytorch/pull/138709

CC @malfet @drisspg @Skylion007

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138725
Approved by: https://github.com/Skylion007, https://github.com/drisspg
2024-10-24 15:33:58 +00:00
8f62832189 c10::nullopt -> std::nullopt (#138701)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138701
Approved by: https://github.com/Skylion007, https://github.com/malfet
2024-10-24 15:03:32 +00:00
7e62ac51a1 [pt2] [testing] Skip inductor_freezing - test_cpp_wrapper_cuda internally (#138366)
Summary: It's been failing CI since probably forever; skip for now

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138366
Approved by: https://github.com/eellison
2024-10-24 14:40:13 +00:00
5c88a9f6c0 Assume that indices are non-negative in _unsafe_masked_index (#137315)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137315
Approved by: https://github.com/eellison
2024-10-24 12:39:31 +00:00
0d9fb51028 Fix lru_cache where config is used (#134235)
Ensure that any use of functools.lru_cache does not prevent config from being changed after the function has already run.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134235
Approved by: https://github.com/masnesral
2024-10-24 10:43:34 +00:00
e7d4de0e59 Eliminate C10_TYPENAME_CONSTEXPR (#138702)
Test Plan: Sandcastle

Differential Revision: D64833560

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138702
Approved by: https://github.com/malfet
2024-10-24 10:21:01 +00:00
0efa590d43 [CI] Fix XPU CI failure (#138548)
# Motivation
Fix https://github.com/pytorch/pytorch/issues/138577.

# Solution
1. All UTs in `test/inductor/test_compiled_optimizers.py` are fixed by https://github.com/pytorch/pytorch/pull/134170
2. UT in `test/inductor/test_pattern_matcher.py` is introduced by https://github.com/pytorch/pytorch/pull/138089, we will skip this UT due to the unsupported feature `max_autotune_gemm_backends:Triton`.
3. We have a new impl related to `histc`, so we remove the expected failure from `test/inductor/test_torchinductor_opinfo.py`
4. We support `avg_pool3d` for `fp16` data type, so we remove the expected failure from `test/inductor/test_torchinductor_opinfo.py`
5. CUDA-bias code is introduced by https://github.com/pytorch/pytorch/issues/138472, we just generalize it to `GPU_TYPE`.

# Additional Context
> Why update torch-xpu-ops commit pin here?

We have to update commit pin to avoid the build failure raised by the code change [C10_UNUSED](https://github.com/pytorch/pytorch/pull/138364).

> What does the feature of torch-xpu-ops update?

1. Add some foreach ops, like `unary ops` and `foreach_clamp_max` etc;
2. Add some maxpool ops forward and backward, like `averge_pool3d` and `max_pool3d`
3. Add some other ops, like `log_normal_`, `index_copy`, and `mode` etc;
4. fix build failure related to `C10_UNUSED`;

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138548
Approved by: https://github.com/malfet, https://github.com/EikanWang
2024-10-24 07:56:26 +00:00
dbf0fa811a Remove C10_HOST_CONSTEXPR_EXCEPT_WIN_CUDA and CONSTEXPR_EXCEPT_WIN_CUDA (#138479)
BC linter suppressed due to removal of `tools/linter/adapters/constexpr_linter.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138479
Approved by: https://github.com/eqy, https://github.com/malfet
2024-10-24 07:51:05 +00:00
96b30dcb25 [Windows][cpu] mkl use mimalloc as allocator on Windows (#138419)
We did a lot of optimization for PyTorch Windows, and we got good progress of it. But still some models have performance gap between PyTorch Windows and PyTorch Linux. Ref: https://pytorch.org/blog/performance-boost-windows/#conclusion
From the blog conclusion, we found the `ResNet50` is typical case of it.

Let's focus on the `ResNet50`, and collect the profiling log:
```cmd
(nightly) D:\xu_git\dnnl_cb>python test_script_resnet50.py
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                             Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                  model_inference         3.91%     682.427ms       100.00%       17.448s       17.448s             1
                     aten::conv2d         0.18%      30.906ms        64.79%       11.305s       2.133ms          5300
                aten::convolution         0.45%      78.031ms        64.62%       11.275s       2.127ms          5300
               aten::_convolution         0.30%      51.670ms        64.17%       11.196s       2.113ms          5300
         aten::mkldnn_convolution        63.58%       11.093s        63.87%       11.145s       2.103ms          5300
                 aten::batch_norm         0.13%      23.536ms        20.10%        3.506s     661.580us          5300
     aten::_batch_norm_impl_index         0.28%      49.486ms        19.96%        3.483s     657.139us          5300
          aten::native_batch_norm        19.26%        3.360s        19.64%        3.427s     646.615us          5300
                 aten::max_pool2d         0.01%       1.038ms         5.84%        1.018s      10.181ms           100
    aten::max_pool2d_with_indices         5.83%        1.017s         5.83%        1.017s      10.171ms           100
                       aten::add_         3.38%     588.907ms         3.38%     588.907ms      85.349us          6900
                      aten::relu_         0.35%      60.358ms         1.67%     292.155ms      59.624us          4900
                 aten::clamp_min_         1.33%     231.797ms         1.33%     231.797ms      47.306us          4900
                      aten::empty         0.46%      80.195ms         0.46%      80.195ms       1.513us         53000
                     aten::linear         0.01%     927.300us         0.23%      39.353ms     393.532us           100
                      aten::addmm         0.20%      35.379ms         0.21%      37.016ms     370.155us           100
                 aten::empty_like         0.12%      20.455ms         0.17%      29.976ms       5.656us          5300
                aten::as_strided_         0.11%      18.830ms         0.11%      18.830ms       3.553us          5300
        aten::adaptive_avg_pool2d         0.00%     419.900us         0.08%      14.265ms     142.647us           100
                       aten::mean         0.01%       1.737ms         0.08%      13.845ms     138.448us           100
                        aten::sum         0.05%       8.113ms         0.05%       8.648ms      86.479us           100
                    aten::resize_         0.03%       5.182ms         0.03%       5.182ms       0.978us          5300
                       aten::div_         0.01%       1.445ms         0.02%       3.460ms      34.600us           100
                         aten::to         0.00%     337.000us         0.01%       2.015ms      20.154us           100
                   aten::_to_copy         0.01%     977.500us         0.01%       1.678ms      16.784us           100
                      aten::copy_         0.01%       1.474ms         0.01%       1.474ms       7.371us           200
                          aten::t         0.00%     775.900us         0.01%       1.410ms      14.104us           100
                    aten::flatten         0.00%     420.900us         0.01%       1.311ms      13.106us           100
                       aten::view         0.01%     889.700us         0.01%     889.700us       8.897us           100
                  aten::transpose         0.00%     410.700us         0.00%     634.500us       6.345us           100
                     aten::expand         0.00%     496.800us         0.00%     566.800us       5.668us           100
                      aten::fill_         0.00%     534.800us         0.00%     534.800us       5.348us           100
                 aten::as_strided         0.00%     293.800us         0.00%     293.800us       1.469us           200
              aten::empty_strided         0.00%     241.700us         0.00%     241.700us       2.417us           100
               aten::resolve_conj         0.00%      54.800us         0.00%      54.800us       0.274us           200
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 17.448s

Execution time: 20.02380895614624
```
We found the major kernel consume CPU resource is `aten::mkldnn_convolution`. It was dispatched to `MKLDNN`.
Acturally, we had optimized memory allocation via integrated mimalloc to pytorch C10 module. It helps PyTorch Windows boost a lot, but it does not cover `MKL` and `MKLDNN`'s intermediary temporary memory.
We still have potential to improve PyTorch Windows performance via optimize `MKL` and `MKLDNN`'s intermediary temporary memory.

So, I discussed with Intel MKL team, and get a method to register high performance memory allocation API to MKL, and it would help MKL to boost memory performance. Please check the online document: https://www.intel.com/content/www/us/en/docs/onemkl/developer-guide-windows/2023-0/redefining-memory-functions.html

This PR is optimize MKL memory alloction performance on Windows, via register mi_malloc to MKL. PR Changes:
1. Add cmake option: `USE_MIMALLOC_ON_MKL`, It is sub-option of `USE_MIMALLOC`.
2. Wrap and export mi_malloc APIs in C10, when `USE_MIMALLOC_ON_MKL` is `ON`.
3. Add MklAllocationHelp.cpp to register allocation APIs to MKL, when `USE_MIMALLOC_ON_MKL` is `ON`.

For `oneDNN`, it is still tracking in this proposal: https://github.com/oneapi-src/oneDNN/issues/1898

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138419
Approved by: https://github.com/jgong5, https://github.com/ezyang
2024-10-24 05:29:47 +00:00
a94c501b84 Fixed max-autotune in FlexAttention to reset kernel options appropriately (#138733)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138733
Approved by: https://github.com/drisspg, https://github.com/BoyuanFeng
2024-10-24 05:18:09 +00:00
cyy
2bcfbf2505 [Distributed] [17/N] Fix clang-tidy warnings in torch/csrc/distributed/ (#138465)
Follows  #137404

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138465
Approved by: https://github.com/ezyang
2024-10-24 04:58:49 +00:00
cyy
53e356a1c0 [2/N] Enable cppcoreguidelines-special-member-functions (#138670)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138670
Approved by: https://github.com/sraikund16
2024-10-24 04:35:18 +00:00
cfdf658a91 [dynamo][modules] Support overridden __call__ on nn modules (#138619)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138619
Approved by: https://github.com/williamwen42
ghstack dependencies: #138657
2024-10-24 03:49:26 +00:00
b1acd0978e [dynamo] Support range_iterator as a function input (#138657)
Fixes https://github.com/pytorch/pytorch/issues/138654

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138657
Approved by: https://github.com/williamwen42, https://github.com/jansel
2024-10-24 03:49:26 +00:00
e5c3d7ab77 [ROCm] Improve performance of reductions on 1D and 2D tensors. (#137737)
This patch improves the performance of individual reductions on MI300X. These improvements are measured on individual sum reduction operations of varying sizes. The patch impacts the following tensor types:
- 1D tensors
- 2D tensors when reducing along dimension 0
- 2D tensors when reducing along dimension 1

Runtime reduction between 0 and 75% depending on tensor shape.

The patch uses the maximum number of threads per CU and the number of CUs itself to control the number of threadblocks in various situations (i.e. for various reduction types and tensor dimensions).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137737
Approved by: https://github.com/eqy, https://github.com/jeffdaily, https://github.com/pruthvistony, https://github.com/xw285cornell
2024-10-24 03:41:16 +00:00
d8f22a1141 [c10d] Reorder GIL checker and c++ stack trace print with comments (#138734)
We found one case when the GIL deadlock happens and then FR timeout, I am wondering if we can do the GIL check before cpp stack trace print which can lead to hang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138734
Approved by: https://github.com/c-p-i-o
2024-10-24 02:21:37 +00:00
0b9320b7c5 fx_graph_cache: Remove custom amd JK (#137501)
This split in JKs was never actually used (We just set both JKs to the same values except when we accidentally didn't due to being humans who make mistakes). This simplifies the overall JK structure and eventually, will let us delete the duplicate JK

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137501
Approved by: https://github.com/oulgen
2024-10-24 01:30:39 +00:00
32a3dbc645 [Pipelining] Free memory usage earlier in last stage (#138504)
This fix is similar to that done in #138119, except this is an edge case for the last stage. For the last stage we perform backward on the `loss` which we detached in the previous PR. However, we also hold the `stage_outputs` alive because we return all the output chunks in `merge_output_chunks()` after the step is over. This will also still keep the autograd graph alive, so detaching these tensors frees the memory earlier.

pre-fix:
<img width="1780" alt="image" src="https://github.com/user-attachments/assets/bb78bde7-fd5c-4eba-bfc9-f0359e20bbab">

post-fix:
<img width="1788" alt="image" src="https://github.com/user-attachments/assets/a26102d9-9db2-4fc8-946c-336b8430657c">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138504
Approved by: https://github.com/wconstab
ghstack dependencies: #138119
2024-10-24 00:44:03 +00:00
8945309c08 [Pipelining] fix extra memory usage in zero bubble (#138119)
Full debugging details in here: https://docs.google.com/document/d/1Pe_E0KWAfsJ6MCvKZ5aR28rTXX-rYLg13XxwXd6AALw/edit?usp=sharing

In zero bubble, we have two methods `stage_backward_input` and `stage_backward_weight`. During `stage_backward_input` we compute the gradients of the input with respect to the stage outputs and also retain the graph of the autograd graph (different than 1F1B where `retain_graph=False`). The output / loss was still being retained across the next schedule step() because we return the loss to the user and use the output to the next step. To allow autograd to free the variables in the graph we need to detach the output/loss after we don't need to use it autograd anymore.

Pre-fix:
<img width="1021" alt="image" src="https://github.com/user-attachments/assets/6c8bf469-32b1-4dac-85ff-b97991f9f0e3">

Post-fix:
<img width="1039" alt="image" src="https://github.com/user-attachments/assets/a1875038-e80b-4dd4-84f2-38727d7792dc">

without AC (7B model on titan):
10% memory improvement

with AC (7B model on titan)
50% memory improvement

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138119
Approved by: https://github.com/wconstab, https://github.com/kwen2501
2024-10-24 00:44:03 +00:00
889717aabd [CI/CD] Disable split build (#138752)
See https://github.com/pytorch/pytorch/issues/138750

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138752
Approved by: https://github.com/kit1980, https://github.com/huydhn
2024-10-23 22:38:30 +00:00
1b31248933 [EZ] Fix typo in test_mps.py (#138738)
s/emedding_weight/embedding_weight/

Stolen from 074766d9b4

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138738
Approved by: https://github.com/atalman
2024-10-23 22:15:35 +00:00
c92459488b Fix test on windows (#138641)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138641
Approved by: https://github.com/huydhn
2024-10-23 21:53:32 +00:00
dd4dd85210 [hierarchical-compilation][inductor] Support invoke_subgraph HOP (#138031)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138031
Approved by: https://github.com/eellison
ghstack dependencies: #137538, #138036, #137965
2024-10-23 21:32:14 +00:00
7622ede3cd Add dump_launch_params config in triton/inductor (#137143)
Summary: Moves the checking of TORCHINDUCTOR_DUMP_LAUNCH_PARAMS into the config module to pull it out of the critical path.

Test Plan: Existing unit tests cover this env variable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137143
Approved by: https://github.com/eellison
2024-10-23 21:20:46 +00:00
9eadd7434e Refactor: Move _nested_int_aware_sort top level (#138693)
I need to use it from some other places later in the PR stack

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138693
Approved by: https://github.com/cyyever, https://github.com/Skylion007
2024-10-23 21:15:05 +00:00
9b77d3109b [export] fix test_unbacked_bindings_for_divisible_u_symint (#138607)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138607
Approved by: https://github.com/angelayi
2024-10-23 21:10:05 +00:00
dbd6ada8c3 Clean up a c10::optional and fix documentation (#138700)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138700
Approved by: https://github.com/Skylion007
2024-10-23 20:42:28 +00:00
8aedc649bd Fix unbind_copy and add its decomposition (#134319)
* Fixes https://github.com/pytorch/pytorch/issues/130829

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134319
Approved by: https://github.com/amjames, https://github.com/eellison
2024-10-23 19:13:44 +00:00
cd9c6e9408 Do not run CI on forks (#138714)
Add `if: github.repository_owner == 'pytorch'` for some jobs that were missing it

Fixes #138564
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138714
Approved by: https://github.com/huydhn, https://github.com/kit1980
2024-10-23 18:23:05 +00:00
ed313a5ca2 Introduce torch.sym_add, variadic add (#138660)
Tested internally here: https://www.internalfb.com/diff/D64057744
This is a reland after previous internal failures.
main change is
```
 if min is None and max is None:
        torch._check_is_size(size)
        return
```

Partially addresses https://github.com/pytorch/pytorch/issues/128150

When you have big sums of values, we end up computing long chains of
binary addition in our FX graph representation.  Not only is this ugly,
it also is quadratic, as the sympy.Add constructor is O(N) in number
of arguments.  Instead, ensure that we maintain the summation as a
single FX node so we can do the entire addition all in one go.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138660
Approved by: https://github.com/ezyang, https://github.com/bobrenjc93
2024-10-23 17:42:41 +00:00
72ea7ba89f Generate slice.Tensor view operations instead of as_strided when split is used in the original program. (#137225)
test_recompile assert that the changes do not add more recompilation by comparing with eager backend.
The reason of this is because slice can be lowered in more efficient way.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137225
Approved by: https://github.com/zou3519
2024-10-23 17:42:16 +00:00
1bc73f3157 Add decomposition for permute_copy (#130944)
* Extracted from #129476

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130944
Approved by: https://github.com/amjames, https://github.com/eellison
2024-10-23 17:42:11 +00:00
c272526ea5 [SJD] [RFC] force setting last progress time (#138615)
Summary:
Currently, if watchdog + healthcheck are enabled via knobs but watchdog is disabled via SJD config, we observe a stuck when the watchdog loop attempts to open the watchdog file path. This is because the FileTimerClient that is usually set in TorchElasticWatchdog will not be set since disabling watchdog via SJD config bypasses the TorchElasticWatchdog initialization

The workaround is to update the healthcheck time when calling `get_last_progress_time`

Test Plan:

Logs show that the progress time value is being changed despite client not being set

Behavior when watchdog is enabled with SJD config is left unchanged

Differential Revision: D64733766

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138615
Approved by: https://github.com/gag1jain
2024-10-23 15:29:00 +00:00
cdfe1bffd1 Revert "[PGNCCL] Use non-blocking mode by default in eager init (#138527)"
This reverts commit 8fbf866904661b16cba4c799af81121557ba9da8.

Reverted https://github.com/pytorch/pytorch/pull/138527 on behalf of https://github.com/jeanschmidt due to Seems to have introduce regressions on main, pull / linux-focal-cuda11.8-py3.10-gcc9 / test (distributed, 2, 3, linux.g4dn.12xlarge.nvidia.gpu) checking if revert will do ([comment](https://github.com/pytorch/pytorch/pull/138527#issuecomment-2432479338))
2024-10-23 14:49:49 +00:00
2f007e5de5 Make trace log dir persist through multiple set_logs() calls (#137793)
Summary: Currently, calling `torch._logging.set_logs()` resets the log directory leading to multiple tlparse outputs. This prevents the dir from resetting after the first call.

Reviewed By: ezyang

Differential Revision: D64118047

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137793
Approved by: https://github.com/ezyang
2024-10-23 14:23:03 +00:00
ecf2240243 [Inductor] New Triton Attrs Descriptor Fixups (#138390)
Fixes additional areas where we need to use the new Triton AttrsDescriptor if it is available.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138390
Approved by: https://github.com/jansel, https://github.com/huydhn
2024-10-23 14:13:49 +00:00
75c6787a16 [CI] Introduces experiment awsa100 to inductor-perf-compare.yml workflow using _runner-determinator.yml (#138204)
Adds the job `get-test-label-type` in `.github/workflows/inductor-perf-compare.yml` checking for the experiment `awsa100`.

It is then used by the job `linux-focal-cuda12_1-py3_10-gcc9-inductor-build` to define the prefix for the runners that will run the benchmark.

Those runners temporarily accept the labels `awsa100.linux.gcp.a100` and `linux.aws.a100`. This is used so we can migrate via experimentation from `linux.gcp.a100`. After successfully experiment with those instances we will remove those labels and update the workflows to use `linux.aws.a100` and decomisson the gcp fleet.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138204
Approved by: https://github.com/ZainRizvi, https://github.com/huydhn
2024-10-23 13:47:26 +00:00
04103f6ae9 Eliminate c10 string_utils (#138499)
Test Plan: Sandcastle

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138499
Approved by: https://github.com/swolchok
2024-10-23 13:40:19 +00:00
c2d26418c3 [Quant][Inductor] expand quantization conv-binary(-unary) pattern fusion inside inductor (#138051)
### Summary
Expand quantization conv-binary(-unary) pattern fusion inside inductor to support the following two patterns:
Pattern 1:
```
    Conv(X)   extra input
           \   /
            Add
             |
        Optional(relu)
             |
             Y
```
Pattern 2:
```
    extra input   Conv(X)
           \   /
            Add
             |
        Optional(relu)
             |
             Y
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138051
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel, https://github.com/jgong5
2024-10-23 13:12:17 +00:00
2f1842fa83 [CD] fix xpu support packages version (#138189)
Works for https://github.com/pytorch/pytorch/issues/114850
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138189
Approved by: https://github.com/EikanWang, https://github.com/malfet, https://github.com/atalman
2024-10-23 12:25:43 +00:00
8fbf866904 [PGNCCL] Use non-blocking mode by default in eager init (#138527)
### Why use non-blocking mode in eager init?
For overlapping comm init and model init, etc.
![image](https://github.com/user-attachments/assets/9b0bf7a9-be26-4d16-827b-dbe861f083cd)

### Why can we set non-blocking as default?
If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`).

### Why not make non-blocking default for lazy mode as well?
PR https://github.com/pytorch/pytorch/pull/137544 tried it.
Two reasons why that's not preferred today:
1. It is hard -- too big a blast.
2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138527
Approved by: https://github.com/wconstab
ghstack dependencies: #137855, #138488, #138374, #138384
2024-10-23 08:51:54 +00:00
2d7e586c13 Fixed dead lock in execution trace (#136892)
Summary:
This DIFF is to fix dead lock issue in execution issue. ExecutionTraceObserver get a lock in recordOperatorStart and onFunctionExit. However, inside these two functions, the input/ouput values are evaluated, which will triger python GIL in some use cases. In this case, the lock order is ET locker -> GIL.

One of  the ads application get GIL first, then call all-gather to collect some metrics from all ranks. When ET is on, all-gather is captured by ET observer. In this case, the lock order is: GIL -> ET locker

That is the reason why dead lock happens. To fix it, I changed the ET locker scope, so the input/output evaluation is no longer inside the scope of the ET locker.

Test Plan: buck2 test mode/opt caffe2/test:test_profiler_cuda

Differential Revision: D63556608

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136892
Approved by: https://github.com/aaronenyeshi
2024-10-23 07:53:56 +00:00
cab5f54dee [ONNX] Fix sequence handling in graph building (#138656)
Previous to this PR, op.Concat is called without required attributes: axis, and val and arg seems wrongly coded.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138656
Approved by: https://github.com/justinchuby
2024-10-23 07:47:58 +00:00
5402677021 add CUDA 12.6 to conda docker image (#138417)
Adds cuda 12.6 to common installation script.
Adds cuda 12.6 to conda docker image build matrix.

fixes #138440

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138417
Approved by: https://github.com/cyyever, https://github.com/atalman
2024-10-23 07:30:51 +00:00
5ceef8c470 Add support for SymFloats in split_module fx pass (#138599)
As discussed with @ezyang, this set of diffs are extracting fixes to problems discovered to flipping `specialize_float=False` in https://github.com/pytorch/pytorch/pull/137782. Since these codepaths are exercised in existing tests, I'm going to bias towards shipping speed and put these up with the primary test plan as the global CI. These code paths are all tested via existing tests when `specialize_float=False` and it feels a bit wonky to add more gated tests that only test behavior when this flag is True, especially since these code paths are already covered. That being said, I'm happy to add individual tests if reviewers insist or have a different POV.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138599
Approved by: https://github.com/ezyang
2024-10-23 06:56:13 +00:00
96c86758e2 Support conditionals on sym node variables in the __bool__ and __len__ case (#138595)
As discussed with @ezyang, this set of diffs are extracting fixes to problems discovered to flipping `specialize_float=False` in https://github.com/pytorch/pytorch/pull/137782. Since these codepaths are exercised in existing tests, I'm going to bias towards shipping speed and put these up with the primary test plan as the global CI. These code paths are all tested via existing tests when `specialize_float=False` and it feels a bit wonky to add more gated tests that only test behavior when this flag is True, especially since these code paths are already covered. That being said, I'm happy to add individual tests if reviewers insist or have a different POV.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138595
Approved by: https://github.com/ezyang
2024-10-23 06:44:09 +00:00
72dde6e84b [ONNX] Avoid optimize onnx_dynamo-fallback (#138265)
Previous to this PR, when a model fails to be exported, it falls back to try with the legacy torchscript exporter. However, we didn't stop when it's exported with torchscript exporter, an optimization is applied to the graph.

It's ideal that the optimization can also boost the performance of the model exported with the legacy torchscript exporter, but currently, for benchmarking purpose and what fallback guarantee to the users, we should keep it simple and only return the graph.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138265
Approved by: https://github.com/xadupre, https://github.com/justinchuby
2024-10-23 04:13:32 +00:00
bb65c9b883 [PyTorch] Classify Unsupported mutated Dynamic Shapes as User Error (#137054)
Summary: We don't need an assert on for unsupported dyn shape inputs, removing the assert and raising a user exception instead.

Differential Revision: D63661569

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137054
Approved by: https://github.com/bdhirsh
2024-10-23 03:15:37 +00:00
cyy
fbd14315f9 Update ruff to 0.7.0 (#138597)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138597
Approved by: https://github.com/ezyang
2024-10-23 03:00:30 +00:00
06b5330674 [easy] Log subproc pool creation (#138642)
Summary: Request from internal to log subproc pool creation

Test Plan:
```
$ TORCH_LOGS=+torch._inductor.async_compile python ~/add.py
I1022 14:12:41.915000 444394 torch/_inductor/async_compile.py:165] Creating subprocess pool with 32 workers
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138642
Approved by: https://github.com/eellison
2024-10-23 02:41:42 +00:00
cyy
86cca3fb05 [1/N] Don't skip ASAN on some tests (#138571)
Clang15's ASAN is new enough so that it's possible to re-evaluate the disabled ASAN  tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138571
Approved by: https://github.com/ezyang
2024-10-23 02:38:45 +00:00
d437df342b [tests] fix broken tests caused by AotEagerAndRecordGraphs typo (#138492)
Summary:
Name change happened in https://github.com/pytorch/pytorch/pull/138231

AttributeError: module 'torch._dynamo.testing' has no attribute 'AOTEagerAndRecordGraphs'. Did you mean: 'AotEagerAndRecordGraphs'?

Test Plan: ci

Differential Revision: D64704686

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138492
Approved by: https://github.com/aakhundov
2024-10-23 02:25:21 +00:00
fee2f331ce Update torchbench.txt (#138569)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138569
Approved by: https://github.com/huydhn, https://github.com/malfet
2024-10-23 01:42:25 +00:00
f2ebf6d94a [PGNCCL] Ensure comm is ready before all accesses (#138384)
Previously we only wait for comm to become ready after its initialization.
That's not enough. There are other NCCL APIs that can cause the comm to be InProgress, e.g. P2P calls, commSplit, commFinalize, etc.
Therefore, we just ensure comm is ready every "next time" we need to access ncclComm.
The place to add such gate keeper is `getNcclComm`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138384
Approved by: https://github.com/shuqiangzhang, https://github.com/fduwjj
ghstack dependencies: #137855, #138488, #138374
2024-10-23 01:36:58 +00:00
37149d032c Fix .to(cpu) for Storage (#138011)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138011
Approved by: https://github.com/albanD
2024-10-23 01:31:48 +00:00
555bddbef7 [AOTI][refactor] Move use_minimal_arrayref_interface logic (#138250)
Summary: Move use_minimal_arrayref_interface specific logic from CppWrapperCpu to CppWrapperCpuArrayRef. This is a copy-on-write style refactor, to simply the default AOTI generated code.

Differential Revision: [D64598715](https://our.internmc.facebook.com/intern/diff/D64598715)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138250
Approved by: https://github.com/chenyang78
ghstack dependencies: #138544, #138379
2024-10-23 01:00:34 +00:00
2cee5a39ad [AOTI] Fix check_model_with_multiple_inputs in test_aot_inductor (#138379)
Summary: Add missing use_minimal_arrayref_interface setting to check_model_with_multiple_inputs.

Differential Revision: [D64635211](https://our.internmc.facebook.com/intern/diff/D64635211)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138379
Approved by: https://github.com/hl475
ghstack dependencies: #138544
2024-10-23 00:54:29 +00:00
d428d81c7f Remove some pre-cpp17 stuff (#138410)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138410
Approved by: https://github.com/Skylion007
2024-10-23 00:38:03 +00:00
f4b3813989 Wrap autograd and autocast ops in training IR (#138516)
Differential Revision: [D64732361](https://our.internmc.facebook.com/intern/diff/D64732361)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138516
Approved by: https://github.com/yushangdi
ghstack dependencies: #138261
2024-10-23 00:37:54 +00:00
9f7b987087 Revert "[Inductor] New Triton Attrs Descriptor Fixups (#138390)"
This reverts commit 215999452eb5517213b3a31f72eb9a7e843d12a0.

Reverted https://github.com/pytorch/pytorch/pull/138390 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it still has another lint error ([comment](https://github.com/pytorch/pytorch/pull/138390#issuecomment-2430566004))
2024-10-23 00:37:28 +00:00
69f18587d6 Move test_serialize to training IR (#138261)
Differential Revision: [D64572253](https://our.internmc.facebook.com/intern/diff/D64572253)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138261
Approved by: https://github.com/yushangdi
2024-10-23 00:32:32 +00:00
662d07e93e Remove parallel_and and parallel_or (#138135)
Not used, suggested by @ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138135
Approved by: https://github.com/ezyang
2024-10-23 00:22:22 +00:00
cyy
38d3c27849 [1/N] Enable cppcoreguidelines-special-member-functions (#137405)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137405
Approved by: https://github.com/ezyang
2024-10-23 00:16:53 +00:00
7e951c1675 [EZ][DTensor] Update DTensor readme to use the new import path (#138625)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138625
Approved by: https://github.com/XilunWu
2024-10-23 00:08:36 +00:00
3441ea7d74 [dynamo] reset compiler stance after test (#138277)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138277
Approved by: https://github.com/anijain2305, https://github.com/jansel
2024-10-23 00:07:33 +00:00
a825667670 [executorch hash update] update the pinned executorch hash (#135287)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135287
Approved by: https://github.com/pytorchbot, https://github.com/huydhn

Co-authored-by: Huy Do <huydhn@gmail.com>
2024-10-22 23:40:57 +00:00
5942b29850 Disabling amp context when invoking compiler (#138624)
Fix for https://github.com/pytorch/pytorch/issues/133974

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138624
Approved by: https://github.com/bdhirsh, https://github.com/drisspg
2024-10-22 23:21:55 +00:00
215999452e [Inductor] New Triton Attrs Descriptor Fixups (#138390)
Fixes additional areas where we need to use the new Triton AttrsDescriptor if it is available.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138390
Approved by: https://github.com/jansel
2024-10-22 23:16:05 +00:00
10f16cc7da Revert "Make Context to be Device-agnostic Step by Step (2/N) (#136526)"
This reverts commit 8aacbee8e0d6c03096f2ce94b70e2a8fab17ee81.

Reverted https://github.com/pytorch/pytorch/pull/136526 on behalf of https://github.com/wdvr due to this one has failing internal tests, not related to a landrace with #138398 - reverting this one ([comment](https://github.com/pytorch/pytorch/pull/136526#issuecomment-2430460176))
2024-10-22 22:53:56 +00:00
39bfba3f56 [sparse] add search for optimal alg_id to torch.compile (#137427)
Summary:

This PR adds a lowering for `torch._cslt_sparse_mm` to find the optimal
alg_id and cache it when running with `torch.compile`

Seeing speedups on both bfloat16 and float8 dtypes:
<img width="641" alt="Screenshot 2024-10-17 at 2 10 38 PM" src="https://github.com/user-attachments/assets/b928cd11-32a3-43e5-b209-8e4028896f0b">
<img width="1274" alt="Screenshot 2024-10-17 at 1 39 03 PM" src="https://github.com/user-attachments/assets/d9edd684-a8ec-46fd-b3da-2e76dbcb7bb6">

* `torch._cslt_sparse_mm_search` has been modified to return optimal
  split-k parameters as well as max alg_id.

* max_id is now available in `torch.backends.cusparselt` via
  `torch.backends.cusparselt.get_max_alg_id()`

* fixed meta registrations for float8

Test Plan:

python test/test_sparse_semi_structured.py

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137427
Approved by: https://github.com/cpuhrsch
2024-10-22 22:39:42 +00:00
b4cfb9c014 [EZ] Use at::detail nested namespace in Dispatch.h (#138633)
Instead of `namespace at { namespace detail {`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138633
Approved by: https://github.com/Skylion007
2024-10-22 22:13:21 +00:00
54fbd897d9 [AOTI][refactor] Clean up test_aot_inductor skip list (#138544)
Summary: Remove skips for already fixed tests. Change remaining skip to xfail so that the failure list can be more proactively maintained.

Differential Revision: [D64761257](https://our.internmc.facebook.com/intern/diff/D64761257)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138544
Approved by: https://github.com/chenyang78, https://github.com/hl475
2024-10-22 21:32:49 +00:00
a16476b671 Add support for adding extra metadata to chromium events, log to separate columns (#138477)
This diff does a few things:

## Add metadata to events in progress
Adds the ability to add extra metadata to Chromium Events via `add_event_data`.
Metadata can only be added to chromium events that have started, but not ended (so, in progress events)
- When you add the data, the metadata is appended to the metadata when you call log_event_end().
- The metadata appears in chromium events in tlparse. It also gets logged to scuba.

## New `dynamo` chromium event
We add a new `dynamo` chromium event to the top of the stack, where we collect various metadata found in dynamo_compile. So the new order of events goes:

```
__start__
-> dynamo (dynamo compile metrics)
-> entire_frame_compile (compile.inner)
-> backend_compile (i.e. aotdispatch)
-> create_aot_dispatch_function
-> inductor_compile
-> ...
```

BackwardCompilationMetrics doesn't have any dynamo specific information (as it's mostly inductor timings). So we don't include that here.

*FAQ: Why can't we use `entire_frame_compile` as the event?*
This is mostly due to backward compatibility with `dynamo_compile`. `dynamo_compile` collects CompilationMetrics outside of `compile.compile_inner`, and uses `dynamo_timed` to grab timings from phases of the compiler, including `entire_frame_compile`. So we don't have a CompilationMetric object until after an `entire_frame_compile` event ends! Separately, `dynamo` as a name for all of dynamo compile is more descriptive than `entire_frame_compile`, imo.

## Log metadata as separate columns
(Meta only): Separately, this also changes the `metadata` column in PT2 Compile Events. Instead of logging a single metadata column in JSON, it separates the JSON into separate columns. This is much better for data analysis. Now that this table is more mature, I think logging keys to separate columns is a better system.Differential Revision: [D64696287](https://our.internmc.facebook.com/intern/diff/D64696287/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D64696287/)!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138477
Approved by: https://github.com/aorenste
2024-10-22 21:17:44 +00:00
3b2b5486ea Fixes issue with torch._dynamo.assume_constant_result with global functions (#132431)
This PR fixes an issue with `torch._dynamo.assume_constant_result` causing global values to be overwritten.
Currently `torch._dynamo.assume_constant_result` saves the constant result into a global variable derived from the name of the function.  This causes that function to be overwritten in the global scope.  This PR checks that the name is unique in the global scope as well, avoiding the issue of overriding the function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132431
Approved by: https://github.com/jansel
2024-10-22 21:14:26 +00:00
e3af290165 [export] Add retraceability_non_strict to tests (#138380)
Summary: We expand the tests to cover retraceability_non_strict. Currently failing tests are skipped.

Test Plan:
```
buck2 test @//mode/dev-nosan //caffe2/test:test_export -- -r _retraceability
```

Differential Revision: D64611532

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138380
Approved by: https://github.com/angelayi
2024-10-22 21:05:51 +00:00
d1be61ce4e Update copyrights to 2024 (#138638)
Spiritual successor of https://github.com/pytorch/pytorch/pull/119413 + CPP docs copyright update as well
Fixes https://github.com/pytorch/pytorch/issues/138630

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138638
Approved by: https://github.com/atalman
2024-10-22 21:00:58 +00:00
dbd0a39c79 Bump webrick from 1.7.0 to 1.8.2 in /ios/TestApp (#136593)
Bumps [webrick](https://github.com/ruby/webrick) from 1.7.0 to 1.8.2.
- [Release notes](https://github.com/ruby/webrick/releases)
- [Commits](https://github.com/ruby/webrick/compare/v1.7.0...v1.8.2)

---
updated-dependencies:
- dependency-name: webrick
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-10-22 13:32:50 -07:00
f089d5ffef Improve input validation for NJT pointwise ops (#138602)
Before this PR, NJT would dispatch e.g. `NJT * nested_int` to `mul.Tensor`, wrongly interpreting the SymInt as a tensor and outputting garbage. This PR verifies that there are no nested ints in the list of args before dispatching for pointwise ops.

I originally tried checking that `the number of passed tensor args == the number of func schema tensor args`, but this wrongly disallows `nt * 2`, which (non-intuitively to me at least at first) dispatches via the `mul.Tensor` overload.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138602
Approved by: https://github.com/soulitzer
2024-10-22 20:13:12 +00:00
cyy
1c77b13c06 [6/N] Fix extra warnings brought by clang-tidy-17 (#138572)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138572
Approved by: https://github.com/Skylion007
2024-10-22 19:46:38 +00:00
a71723bf12 [ONNX] Add complex constant support (#138279)
Transform complex python constant to float representation as well, like what we have with tensors.

PS: I find it's not reasonable to add "complex->float" in IR side, so I put it here.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138279
Approved by: https://github.com/justinchuby

Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
2024-10-22 19:42:59 +00:00
c7a20939b4 Remove unused enforce_cond_guards_match Dynamo feature flag. (#138589)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138589
Approved by: https://github.com/clee2000
2024-10-22 19:36:01 +00:00
078dca1ce8 Aarch64 binary builds - fix passing env_file to Docker (#138588)
Aarch64 builds skipped the logic of sourcing binary env file. And as a result PYTORCH_EXTRA_INSTALL_REQUIREMENTS passed to Aarch64 builds have not included triton dependency constraint. This PR makes sure Aarch64 builds follow same path as our regular manywheel builds.

To work around this issue we had to inject triton in aarrch64 builds for release 2.5, which is not ideal: https://github.com/pytorch/builder/pull/2011
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138588
Approved by: https://github.com/jeanschmidt, https://github.com/malfet
2024-10-22 19:04:19 +00:00
eqy
c0e8458aab [Flex Attention] Don't compute fill order to compute stride order just to get fill order back (#138376)
Was a bit confusing to read when working on #138354

"computer-assisted proof"
```
import random

def argsort(seq):
    # preserve original order for equal strides
    getter = seq.__getitem__
    a_r = range(len(seq))
    return list(reversed(sorted(a_r, key=getter, reverse=True)))  # noqa: C413

def stride_order2fill_order(order):
    """
    Convert stride order to fill order
    For channel last format,

    stride order = [3, 0, 2, 1] and fill order = [1, 3, 2, 0]
    """
    lookup = {pos: idx for idx, pos in enumerate(order)}
    fill_order = [lookup[i] for i in range(len(order))]
    return fill_order

def get_stride_order(seq):
    """
    Convert strides to stride order
    """
    sorted_idx: List[int] = argsort(seq)
    out = [0 for _ in range(len(seq))]
    a = sorted_idx.copy()
    for i, elem in enumerate(sorted_idx):
        out[elem] = i
    fillorder = stride_order2fill_order(out)
    assert fillorder == sorted_idx
    return out

for _ in range(1000):
    a = [0, 1, 2, 3]
    random.shuffle(a)
    get_stride_order(a)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138376
Approved by: https://github.com/drisspg
2024-10-22 18:40:39 +00:00
2dab4ccb65 [Inductor][ROCm][CK] add CK grouped conv2d fwd kernels to ROCm codegen (#137947)
Plug into lowering and end to end test in a later PR

Instance parsing companion PR https://github.com/ROCm/composable_kernel/pull/1585

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137947
Approved by: https://github.com/ColinPeppler, https://github.com/chenyang78
2024-10-22 18:25:23 +00:00
6e4c19289c [EZ] [BE] Remove (now) unused scale config (#138511)
Final step of moving scale config files to test-infra repo.  Details in https://github.com/pytorch/test-infra/pull/5767

Scale configs are now read from test-infra.  This PR is just cleaning up stale files
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138511
Approved by: https://github.com/clee2000
2024-10-22 18:08:42 +00:00
f7e36d8d6f Fix for MSVC problem on Windows Arm64 (#136765)
This PR proposes a workaround for an internal issue introduced in MSVC 14.37 for Windows Arm64 target. It is still an ongoing problem.
The fix will be released with the future versions of Visual Studio 2022 but until then the changes to cpu/vec/vec_base.h should be sufficient.
We also opened a new ticket on Visual Studio Developer Community, it can be found here: https://developercommunity.visualstudio.com/t/MSVC-loop-unrolling-problem-194033813-/10720692

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136765
Approved by: https://github.com/malfet

Co-authored-by: Stefan-Alin Pahontu <56953855+alinpahontu2912@users.noreply.github.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Co-authored-by: Eli Uriegas <1700823+seemethere@users.noreply.github.com>
2024-10-22 18:07:58 +00:00
fc9093c3d2 Revert "Remove C10_DEPRECATED (#138406)"
This reverts commit 70ec86d7542d461ff6f01ba1a1c9a4f38637af8e.

Reverted https://github.com/pytorch/pytorch/pull/138406 on behalf of https://github.com/wdvr due to failing internal tests - see D64714374 ([comment](https://github.com/pytorch/pytorch/pull/138406#issuecomment-2429912896))
2024-10-22 18:00:41 +00:00
cc93c1e5e4 Upload artifacts during test run (#125799)
Zip and upload artifacts while run_test is running
Upgrade boto3 because I get errors about not having `botocore.vendored.six.move` if I don't
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125799
Approved by: https://github.com/huydhn
2024-10-22 16:48:57 +00:00
2e48788a35 [hierarchical-compilation][invoke_subgraph] Use tracing context to cache artifacts of dispatch keys (#137965)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137965
Approved by: https://github.com/zou3519
ghstack dependencies: #137538, #138036
2024-10-22 15:33:42 +00:00
e045e8f0df [hierarchical-compilation][invoke_subgraph] Graph break on input mutation or aliasing (#138036)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138036
Approved by: https://github.com/zou3519
ghstack dependencies: #137538
2024-10-22 15:33:42 +00:00
4dd4d38ca9 [hierarchical-compilation][hop] Introduce invoke_subgraph (#137538)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137538
Approved by: https://github.com/zou3519
2024-10-22 15:33:34 +00:00
046f02d2de [ROCm] index_put performance improvement (#138259)
On ROCm, using a non-vectorized index_put kernel provides ~2x perf improvement over the hipified CUDA kernel.  None of the existing unit tests were exercising the large index case so a new unit test was added.

It was also noted that the scale value in the original kernel was hard-coded to 1.0 which would be a no-op, so it was removed from the simplified rocm kernel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138259
Approved by: https://github.com/xw285cornell, https://github.com/leitian, https://github.com/eqy
2024-10-22 15:21:43 +00:00
2827befe61 [AOTI][reland] Fix test_index_put_with_none_index_cpu_with_stack_allocation (#138541)
Summary: The problem happened after splitting CppWrapperCpu and CppWrapperCpuArrayRef, because CppWrapperCpuArrayRef.generate_index_put_fallback missed a statement.

Running test_aot_inductor.py as a whole didn't reveal the problem, but running test_index_put_with_none_index_cpu_with_stack_allocation individually did. Digging deeper, the root cause is init_backend_registration has incorrectly cached CPU CppWrapperCodegen class, which means CppWrapperCpuArrayRef was never picked when running test_aot_inductor.py as a whole. To fix the problem, all the ArrayRef tests are split into a separate file. Also a code checking is added to regex match AOTInductorModelRunMinimalArrayrefInterface so this kind of false passing signal won't be unnoticed.

Differential Revision: [D64734106](https://our.internmc.facebook.com/intern/diff/D64734106)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138541
Approved by: https://github.com/frank-wei
2024-10-22 14:17:27 +00:00
bb8bc7d6b3 config: simplify most of the config handling and fix some bugs (#138377)
This PR combines a number of cleanups in one PR. If any of the specific cleanups don't seem to make sense, let me know and I can remove them.

Cleanups

- This PR adds a set of test suites for the config module code, which handles basically all the APIs and ways it is used. Please let me know if you see anything critical that is not tested that I missed. This test suite is primarily used as the regression test suite for later changes in this diff. Note that there is some dynamo specific testing of the config module, but it isn't as verbose.
- I removed all internal usage of shallow_copy_dict. Those usages could all use the deep copy, and did not depend on the reference behavior of certain config values that shallow_copy_dict allows.
- I removed shallow copy semantics for configuration with a deprecation warning. I think this requires a release note, so hopefully I did that correctly. Let me know if we want to continue to expose shallow copy value semantics, but I just can't find a case where I expect anyone would want it. It also complicated later internal changes to the API (i.e. breaking apart various layers of the config changes).
- I fixed what I believe is a bug in how hashes are calculated on configs. In particular, if you got the hash, then made a config change, and then got the hash again, it would not update the hash. @oulgen, please let me know if I'm misunderstanding this behavior and it is desired.
- I switched our multiple implementations of iterating through the dictionary to a single one. This is primarily to make later changes easier, but it also makes it clear how inconsistent our various config ignoring options are. Let me know if people would be interested in me unifying the various options for ignoring config values.
- I updated the test patcher (not the performance critical one, just the normal one), to use __setattr__ and __getattr__ to remove direct API access to the underlying config fetcher.

For release notes, Not sure exactly how to communicate this, but something like
"ConfigModule.to_dict, and ConfigModule.shallow_copy_dict no longer retain their shallow copy semantics, which allowed reference values objects to be modified. If you wish to modify the config object, call load_config explicitly".

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138377
Approved by: https://github.com/ezyang, https://github.com/jansel, https://github.com/jovianjaison
2024-10-22 13:40:26 +00:00
1b61313acd Add type stub for SymInt.rsub (#138543)
Fixes https://github.com/pytorch/pytorch/issues/138478

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138543
Approved by: https://github.com/malfet
2024-10-22 13:27:32 +00:00
8c840fb921 Add out_dtype kw argument to optimize_bsr_dense_addmm (#136626)
As in the title.

Addresses the task in https://github.com/pytorch/ao/pull/821#issuecomment-2373290266

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136626
Approved by: https://github.com/amjames, https://github.com/cpuhrsch
2024-10-22 09:52:25 +00:00
5a13282c75 [compiled autograd] tls access helpers (#138061)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138061
Approved by: https://github.com/yf225
ghstack dependencies: #137953, #137821
2024-10-22 08:03:52 +00:00
49fa437097 [compiled autograd] Compiled autograd configs in TLS (#137821)
Multithreaded doesn't work yet, this adds python side TLS only for the python side state

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137821
Approved by: https://github.com/jansel, https://github.com/yf225
ghstack dependencies: #137953
2024-10-22 08:03:52 +00:00
75259145ec [compiled autograd] directly use python Logger class in cpp (#137953)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137953
Approved by: https://github.com/jansel, https://github.com/yf225
2024-10-22 08:03:52 +00:00
60c1433041 [aoti] Cond symint input support (#138373)
If the input is a symint, we don't want to add the aoti_torch_assign_tensors_out

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138373
Approved by: https://github.com/larryliu0820, https://github.com/desertfire
2024-10-22 07:53:22 +00:00
51045e6251 make DimHints compatible with Dims (#138490)
Previously we'd been raising UserErrors when `Dim()` and DimHints (`Dim.AUTO/Dim.DYNAMIC`) were both specified in `dynamic_shapes`, this PR stops that, and uses `Dim()` objects to guide DimHints.

The key to this was making the `EqualityConstraint` class happy when it checks that inferred equivalence relations were specified in the original `dynamic_shapes` spec, and this introduces a `RelaxedConstraint` object to mark the hinted dimensions, so equality checks between `RelaxedConstraints` and other constraints are treated as valid.

Current behavior is that:
```
class Foo(torch.nn.Module):
    def forward(self, x, y):
        return x - y

inputs = (torch.randn(4, 4), torch.randn(4, 4))
shapes = {
    "x": (Dim.AUTO, Dim("d1", min=3)),
    "y": (Dim("d0", max=8), Dim.DYNAMIC),
}
ep = export(Foo(), inputs, dynamic_shapes=shapes)
```

The dimensions marked `AUTO` and `DYNAMIC` will have max & min ranges of 8 & 3 respectively. Note that inferred equality between `Dim()` objects & `Dim.STATIC` will still raise errors - `Dim()` suggests not specializing to a constant.

Differential Revision: D64636101

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138490
Approved by: https://github.com/avikchaudhuri
2024-10-22 07:43:48 +00:00
9a9a0abc28 [SDPA-CUDNN] Make CuDNN Attention Opt in (#138522)
# Summary
Currently we have a `cudnn_order` that says on H100 w/ new enough CuDNN backend (we ship a 9.1 version in OSS) try to run CuDNN attention first. We have already encountered a few bugs with the release of 2.5:

1. https://github.com/pytorch/pytorch/issues/138529
2. https://github.com/huggingface/diffusers/issues/9704
3. https://github.com/pytorch/pytorch/pull/138354

In light of the above we are going to make the CuDNN backend Opt-in by default.

This can be done easily with the context manager for choosing backends I.e.:
``` Python
from torch.nn.attention import sdpa_kernel, SDPBackend

with sdpa_kernel(SDPBackend.CUDNN_ATTENTION):
    out = F.scaled_dot_product_attention(q, k, v)

```

This PR puts the CuDNN backend as the lowest precedence in the backend list, meaning that the Math backend will always be chosen unless disabled (which is done via the context manager).

Cc @atalman

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138522
Approved by: https://github.com/ngimel, https://github.com/eqy, https://github.com/malfet
2024-10-22 07:23:06 +00:00
2b4af6fa74 Mark torch.get_device as overridable at the python level (#132706)
Summary:
- add a value to `get_testing_overrides` function for `torch.get_device()`
- remove `torch.get_device()` from the `get_ignored_functions` list

Test Plan:
Existing override testing infra, which should pick up the updates to these two variables.

Closes the loop on:
https://github.com/pytorch/pytorch/pull/132567

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132706
Approved by: https://github.com/ezyang
2024-10-22 07:20:42 +00:00
84e5f34fd1 bug in unbacked_bindings for a*u0 (#138136)
Summary: we were storing a*u0 instead of u0 in unbacked_bindings / unbacked_var_to_val

Test Plan: -

Differential Revision: D64508936

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138136
Approved by: https://github.com/ezyang
2024-10-22 07:04:30 +00:00
a80b87353c [pt2] Log is_forward field to dynamo_compile scuba table (#138505)
Differential Revision: [D64711721](https://our.internmc.facebook.com/intern/diff/D64711721)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138505
Approved by: https://github.com/oulgen
2024-10-22 05:50:49 +00:00
0b4a071a1d [CP] Implement AllGather based context parallelism (#132820)
Summary:

This implementation does not utilize the benefit that after allgather we can directly perform the SDPA without doing the ring-based SDPA, but we can overlap the communication with the first sharded kv computation. This implementation shows some performance benefit and memory saving compared to the original alltoall implementation in certain cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132820
Approved by: https://github.com/XilunWu
2024-10-22 05:25:50 +00:00
6b29d40e9b [PGNCCL] Add default value for nccl_nonblocking_timeout (#138374)
- Added default value for `nccl_nonblocking_timeout` (30 mins, previous: -1).
- Reuse C10D_CHECK_TIMEOUT in other CHECK macros

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138374
Approved by: https://github.com/eqy
ghstack dependencies: #137855, #138488
2024-10-22 05:06:18 +00:00
03c72976a5 Properly uses ref-counting for torch.cuda.use_mem_pool (#133600)
This PR refactors some ref-counting functionality out of `beginAllocateToPool` and `releasePool`. The ref-counting logic is then used in construction and destruction of `torch.cuda.MemPool`.

The `use_count` variable in the CUDACachingAllocator is essentially a refcount of how many context managers are using the pool. Since we are now lifting up the MemPool abstraction to the user, the MemPool object itself now needs to hold a an extra reference as well.

Part of https://github.com/pytorch/pytorch/issues/124807.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133600
Approved by: https://github.com/eqy, https://github.com/ezyang
2024-10-22 03:21:53 +00:00
89067402d4 [easy] in ROCmTemplate set kwargs when creating Buffer (#138521)
Summary: https://github.com/pytorch/pytorch/pull/137768 makes Inductor IR kw only

Test Plan: CI

Differential Revision: D64723804

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138521
Approved by: https://github.com/tenpercent, https://github.com/chenyang78
2024-10-22 03:13:16 +00:00
cyy
f881094366 Use Wmissing-prototypes on torch_cuda (#136080)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136080
Approved by: https://github.com/ezyang
2024-10-22 02:04:19 +00:00
9f7c26bef3 Fix training IR bug by changing passes order (#138292)
Inserting runtime_assertions cause gm to have different names but the graph signature was populated earlier. To avoid this kind of errors in the future, I refactored these steps into a helper function.

Differential Revision: [D64576251](https://our.internmc.facebook.com/intern/diff/D64576251)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138292
Approved by: https://github.com/avikchaudhuri
ghstack dependencies: #138266
2024-10-22 01:24:14 +00:00
012ff2a0aa Don't try to load cufile (#138501)
Trying to loading it caused a big issue with 2.5.0 release - https://github.com/pytorch/pytorch/issues/138324

cufile is not actually used currently by default, see #133489

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138501
Approved by: https://github.com/atalman, https://github.com/mikaylagawarecki, https://github.com/malfet
2024-10-22 01:13:27 +00:00
5adc33d3b8 Training IR should preserve custom metadata (#138266)
Differential Revision: [D64576252](https://our.internmc.facebook.com/intern/diff/D64576252)

@diff-train-skip-merge
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138266
Approved by: https://github.com/yushangdi
2024-10-22 01:09:56 +00:00
0a38c0ec89 [inductor] add a threshold for membw saving during fusion (#136782)
Fix https://github.com/pytorch/pytorch/issues/133242 . In that issue, inductor fuses 2 nodes because they access the same scalar tensor. This saving is very small (4 bytes), and if we ignore that, by default, we can not fuse. But if loop ordering after fusion get kicked in, we can reorder loops and fuse those 2 nodes. We get 33% memory bandwidth savings .

I think adding a threshold for membw saving in general is not bad.

I'll run a perf test. ( https://github.com/pytorch/pytorch/actions/runs/11375421752 )

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136782
Approved by: https://github.com/jansel
2024-10-22 00:50:00 +00:00
3b186c5659 Revert "[AOTI] Fix test_index_put_with_none_index_cpu_with_stack_allocation (#138303)"
This reverts commit 1417b2cd0562e0e4d4349024ef7c731b99214890.

Reverted https://github.com/pytorch/pytorch/pull/138303 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/138303#issuecomment-2427991065))
2024-10-22 00:46:48 +00:00
d7e0e1dbc4 [DeviceMesh] Use split_group to create sub_groups for nccl backend if the default pg is eagerly initialized (#138129)
Use `split_group()` to create sub_groups for nccl backend if the default pg is eagerly initialized. Otherwise, it will still go through the normal lazy init process and call `new_group()` instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138129
Approved by: https://github.com/kwen2501
2024-10-22 00:00:05 +00:00
a7f49de485 Fixes issue with enums in a tuple for dynamo (#133123)
Currently when tuples values are encountered in dynamo, they are encoded using `repr(arg)`.  This causes an issue if one of the values inside of the tuple will not be properly encoded.  In this case, if an enum is contained inside of a tuple, it will cause invalid python code to be generated

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133123
Approved by: https://github.com/jansel
2024-10-21 23:45:11 +00:00
e24871eb3c Add environment variable to force no weights_only load (#138225)
In preparation for `weights_only` flip, if users don't have access to the `torch.load` call

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138225
Approved by: https://github.com/albanD
2024-10-21 23:26:15 +00:00
ec4ce094b2 [Traceable FSDP2][CI] Skip more tests on rocm (#138497)
Some of the test checks doesn't work well with rocm.

Fixes https://github.com/pytorch/pytorch/issues/138409.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138497
Approved by: https://github.com/fduwjj
2024-10-21 23:11:01 +00:00
77868697b7 [inductor][subgraph] Add size asserts (#138424)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138424
Approved by: https://github.com/eellison
ghstack dependencies: #137555
2024-10-21 22:43:49 +00:00
853da168fc [AC] Backward Pass Aware AC - adding hooks to partitioner to pass callable (#137785)
Summary: same as title. Plan is to pass a callable to the partitioner to perform custom autoAC via an ILP. This is the same as a previous diff D63714905 which was landed and then subsequently reverted by PyTorch Release Engineering because of a failing unit test (f7b8d36c28). We think the unit test is buggy, and we also fix the same.

Test Plan: tbd

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137785
Approved by: https://github.com/basilwong

Co-authored-by: Huy Do <huydhn@gmail.com>
2024-10-21 21:45:13 +00:00
20a2d39557 Log all failing test repros to scuba (#138394)
This has the benefit that

1) It's much easier to aggregate test failure repros into say a CSV or shell script from scuba
2) We can do analysis (eg. set different two sets of tests across two PRs)
3) We can get results faster at the test-level granularity instead of job-level granularity we see in the HUD/GH.

I tested this by introducing a breaking change, adding ci-scribe label and then verifying that the failed tests were logged to scuba: https://fburl.com/scuba/torch_open_source_signpost/w6qt7qr9

I then reverted the breaking change and published this PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138394
Approved by: https://github.com/ezyang
2024-10-21 21:35:47 +00:00
ef52bbbf23 More appropriate socket errors and debug messages (#130347)
Fixes #128998

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130347
Approved by: https://github.com/fduwjj
2024-10-21 21:28:40 +00:00
364340c7ee [Forward Fix][PGNCCL] Add define guard for NCCL_SPLIT_NOCOLOR (#138488)
Forward fix for build issue introduced by #137855:
```
In file included from fbcode/caffe2/torch/csrc/distributed/c10d/NCCLUtils.cpp:2:
fbcode/caffe2/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp:508:21: error: use of undeclared identifier 'NCCL_SPLIT_NOCOLOR'
  508 |     int split_color{NCCL_SPLIT_NOCOLOR - 1};
      |                     ^
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138488
Approved by: https://github.com/fduwjj
ghstack dependencies: #137855
2024-10-21 21:14:20 +00:00
134f6cda7e Support record_stream() for NJT (#137099)
Does what it says on the tin. I believe the right behavior here is to ensure that `record_stream()` is called on all tensor components of the NJT to ensure they all live until stream computation is complete.

This is an ask from torchrec as the op is used there.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137099
Approved by: https://github.com/ngimel
2024-10-21 21:10:42 +00:00
70ec86d754 Remove C10_DEPRECATED (#138406)
Looking in the code I see
```
// NB: __cplusplus doesn't work for MSVC, so for now MSVC always uses
// the "__declspec(deprecated)" implementation and not the C++14
// "[[deprecated]]" attribute. We tried enabling "[[deprecated]]" for C++14 on
// MSVC, but ran into issues with some older MSVC versions.
```
But looking at the [MSVC C++ support table](https://learn.microsoft.com/en-us/cpp/overview/visual-cpp-language-conformance?view=msvc-170) I see that the `[[deprecated]]` attribute is supported as of MSVC 2015 and that the vast majority of C++17 features became supported in MSVC 2015 _or later_.

Since PyTorch is C++17 now, I infer that PyTorch must not support versions of MSVC earlier than MSVC 2015, so the versions of MSVC supported by PyTorch must support `[[deprecated]]`.

Therefore, since we are finished deprecating old MSVCs we can deprecate `C10_DEPRECATED`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138406
Approved by: https://github.com/cyyever, https://github.com/malfet
2024-10-21 20:57:27 +00:00
bb2e090b7d [user triton] typing triton_kernel_wrap.py (#138230)
Remove `# mypy: allow-untyped-defs` from triton_kernel_wrap.py, and fixed all the mypy errors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138230
Approved by: https://github.com/oulgen, https://github.com/Skylion007
2024-10-21 20:34:49 +00:00
60081c29ec Use cuda 12.4 pytorch_extra_install_requirements as default (#138458)
Since cuda 12.4 binaries are default binaries on pypi now. The pytorch_extra_install_requirements need to use 12.4.
This would need to be cherry-picked to release 2.5 branch to avoid injecting these versions into metadata during pypi promotion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138458
Approved by: https://github.com/malfet
2024-10-21 20:16:37 +00:00
c1ead6fba3 Bugfix for passing None args to user defined Triton kernel (#138472)
add test

fewer failing tests

more tests passing

tests passing

lint

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138472
Approved by: https://github.com/aakhundov
2024-10-21 20:00:04 +00:00
8ad191ae21 [dynamo] Replace __str__ with __repr__ in some places (#136316)
## The problem

In a typical debugger, `repr()` is used to display variables and not `str()`.

Several classes in Dynamo have a `__str__()` method that returns useful information and a  `__repr__()` that does not. Having to call `str(x)` or `[str(i) for i in x]` in the debugger all the time is a chore.

`str()` should be ["informal, nicely printable"](https://docs.python.org/3/library/stdtypes.html#str) and `repr()` should ["attempt to return a string that would yield an object with the same value when passed to eval()](https://docs.python.org/3/library/functions.html#repr)".

## The solution

In the Python object model, if there is no `__str__` method, `__repr__`  is used instead (but not the other way around).

So renaming `__str__` to `__repr__` in a few cases where no `__repr__` method exists now should not change observable behavior, and should make debugging easier.

The specific classes changed were all in `torch._dynamo.variables`:

* `builtin.BuiltinVariable`
* `constant.ConstantVariable`
* `constant.EnumVariable`
* `functions.UserMethodVariable`
* `lazy.LazyVariableTracker`
* `lazy.LazySymNodeFormatString`
* `misc.GetAttrVariable`
* `misc.NullVariable`
* `user_defined.UserDefinedObjectVariable`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136316
Approved by: https://github.com/XuehaiPan, https://github.com/jansel
2024-10-21 19:50:38 +00:00
41f7d01ccf Increase Docker push timeout limit from 15 to 30m (#138487)
Some images now take more than 15 to finish pushing and keep timing out, for example, https://github.com/pytorch/pytorch/actions/runs/11442231435/job/31832143440
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138487
Approved by: https://github.com/kit1980, https://github.com/atalman, https://github.com/ZainRizvi
2024-10-21 19:44:52 +00:00
32d4582e02 Revert "[BE]: Update Typeguard to TypeIs for better type inference (#133814)"
This reverts commit 16caa8c1b3a02e47b5f52d3c2d40d7931cc427dc.

Reverted https://github.com/pytorch/pytorch/pull/133814 on behalf of https://github.com/jeanschmidt due to checking if this will solve inductor errors ([comment](https://github.com/pytorch/pytorch/pull/133814#issuecomment-2427565425))
2024-10-21 19:40:58 +00:00
ff2f751bfb [tools] fix nightly pull tool when the conda environment not exists (#138448)
Now, `conda env remove --name env` exits with errors if the given environment does not exist. This PR check the existance of the environment before trying to remove it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138448
Approved by: https://github.com/ezyang
2024-10-21 19:35:48 +00:00
071f6f2de8 Revert "[ROCm] Fix ADDMM hipBLASLt regression (#138267)"
This reverts commit 14a3e12985e4550440a8a1755d3418e9b02b4950.

Reverted https://github.com/pytorch/pytorch/pull/138267 on behalf of https://github.com/jeffdaily due to this PR went to far when partially reverting #137604; the env var default should be the same on ROCm and CUDA ([comment](https://github.com/pytorch/pytorch/pull/138267#issuecomment-2427550465))
2024-10-21 19:33:13 +00:00
abbd71d29d [BE][Easy] enable PYFMT for torch.fx (#138443)
Reproduce command:

```bash
ghstack checkout https://github.com/pytorch/pytorch/pull/138443
git checkout HEAD~1 torch/
lintrunner -a --take "PYFMT" --all-files
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138443
Approved by: https://github.com/ezyang
2024-10-21 19:15:49 +00:00
8231180147 [dynamo][refactor] Refactor Wrap HOP to reuse it for invoke_subgraph (#137555)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137555
Approved by: https://github.com/zou3519
2024-10-21 18:26:29 +00:00
c6609ece84 [ONNX] Remove deprecated export_to_pretty_string (#137790)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137790
Approved by: https://github.com/titaiwangms, https://github.com/xadupre
ghstack dependencies: #137789
2024-10-21 18:17:48 +00:00
07cc4bd3e2 typing compile_fx.py (#138033)
Type annotations for compile_fx.
- Some of the stuff here is pretty complicated (functions which return functions that take functions) so I bailed on those and used `Any` just to get the rest landed.
- There are also changes to type signatures in other files which I did just to let mypy know more about the types in compile_fx.py.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138033
Approved by: https://github.com/Skylion007
2024-10-21 18:14:59 +00:00
81738403a2 [Distributed] Fix extra context on device 0 (#135273)
This PR contains multiple fixes for issue https://github.com/pytorch/pytorch/issues/135279:

## First part:
Moves the GPU guard (`cudaSetDevice`) before the `currentStreamCaptureStatusMayInitCtx` call.
As its name suggests, it May Init Ctx.

## Second part:
Even with the above fix, additional contexts are still observed during Work object destruction, e.g.
```
work = dist.all_reduce(tensor, async_op=True)
time.sleep(5)  <-- no additional context yet
del work  <-- additional context shows up
```
### Debug process
Chasing it down to destruction of a `Future` object -- a member variable of `Work`.
Then further down to the following member of `Future`:
```
std::vector<c10::Event> events_;
```
When the `events_` are destroyed, we hit the road down to:
1f3a793790/c10/cuda/impl/CUDAGuardImpl.h (L106-L121)

When there is no "preset" CUDA context (**which is the case for python garbage collector**), line 112: `c10::cuda::GetDevice(&orig_device)` will set `orig_device` to 0. Then, at line 120, `c10::cuda::SetDevice(orig_device)` will "officially" set the context to device 0 --
**that's where rank 1, 2, ... can create extra context on device 0!**
### Solution
This PR adds an explicit destructor to `Future`. In this destructor, destroy each event with a device guard.

## Test
Added test_extra_cuda_context, implemented via
- `pynvml` (if available), or
- memory consumption check.

`python test/distributed/test_c10d_nccl.py -k test_extra_cuda_context`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135273
Approved by: https://github.com/fduwjj, https://github.com/wconstab, https://github.com/eqy
ghstack dependencies: #137161

Co-authored-by: Will Feng <yf225@cornell.edu>
2024-10-21 17:52:21 +00:00
6e38c87ad0 [ONNX] Remove ExportTypes (#137789)
Remove deprecated ExportTypes and the `_exporter_states` module. Only protobuf (default) is supported going forward.

Differential Revision: [D64412947](https://our.internmc.facebook.com/intern/diff/D64412947)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137789
Approved by: https://github.com/titaiwangms, https://github.com/xadupre
2024-10-21 17:50:28 +00:00
af0bc75460 Remove deprecated alias macro(1/3) (#137556)
**Detailed Descriptions:**
- Remove AT_ERROR Macro

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137556
Approved by: https://github.com/ezyang
2024-10-21 17:32:32 +00:00
16caa8c1b3 [BE]: Update Typeguard to TypeIs for better type inference (#133814)
Uses TypeIs instead of TypeGuard for better inference. See https://peps.python.org/pep-0742/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133814
Approved by: https://github.com/ezyang
2024-10-21 17:20:06 +00:00
9bb327bfc6 Revert "[AC] Backward Pass Aware AC - adding hooks to partitioner to pass callable (#137785)"
This reverts commit a8b912f39d36bd2e6d204808d866439d0075f1a5.

Reverted https://github.com/pytorch/pytorch/pull/137785 on behalf of https://github.com/ezyang due to breaks lint ([comment](https://github.com/pytorch/pytorch/pull/137785#issuecomment-2427295668))
2024-10-21 17:18:56 +00:00
02dd3b8e32 [dynamo][NFC] Remove unused method InliningInstructionTranslator.check_replace_is_safe (#137906)
This method was no longer needed after #113725; the checking logic is
now in `SideEffects.check_allowed_side_effect`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137906
Approved by: https://github.com/Skylion007, https://github.com/anijain2305
ghstack dependencies: #137905
2024-10-21 16:43:34 +00:00
1032ce6bd3 Only upload test/test-reports as artifacts (#138019)
Fixes https://github.com/pytorch/pytorch/issues/137851

This is possibly too restrictive but I spot checked and I don't think any of the files outside of test/test-reports are important, but I can't guarantee that someone was putting something elsewhere and expecting for it to still be zipped

Outputs can be see on HUD by clicking show artifacts
Some examples:
Logs
<img width="293" alt="image" src="https://github.com/user-attachments/assets/9a2db9b1-0f62-4209-909b-4f56a908619d">

XMLs
<img width="234" alt="image" src="https://github.com/user-attachments/assets/a639fe38-a112-4ea5-abba-ad1d5b25bb43">

JSONs
<img width="180" alt="image" src="https://github.com/user-attachments/assets/be7a49ac-5258-4bc5-981d-3f134ebd343d">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138019
Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/ZainRizvi
2024-10-21 16:43:30 +00:00
0a4197490c Delay mul/pow expansion for _SympyT to enable more folding (#138235)
Instead of calling `safe_expand` right after symbolic expression construction, we invoke it in `ShapeEnv.simplify`. This enables more simplification with product form, e.g.,
```
(a + b)^2 / (a + b) --> (a + b)
```
which won't happen if we expand eagerly during product construction:
```
(a^2 + 2ab + b^2) / (a + b) --> no change
```

Fixes #136044.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138235
Approved by: https://github.com/ezyang
2024-10-21 16:38:47 +00:00
701ddf962a [inductor] Preserve metadata across replace_by_example and register_replacement patterns (#138089)
replace_by_example is used to implement some pattern-matching passes in inductor. Previously, replace_by_example would generate nodes with very little metadata. In particular, `meta["original_aten"]` would be lost; that meant that when generating triton kernel names, you could get empty names like `triton_tem_fused_0` if the input nodes to the fused kernel were the result of a pattern-matching pass that used replace_by_example.

This also adds metadata for to register_replacement patterns, including pad_mm.

This fixes the issue by copying metadata from the original node to the replacement nodes. If there are multiple original nodes we skip the metadata transfer; so if you have a `add(z, mm(x, y))`, then the metadata won't be transferred right now.

Differential Revision: [D64480755](https://our.internmc.facebook.com/intern/diff/D64480755)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138089
Approved by: https://github.com/aakhundov
2024-10-21 16:33:12 +00:00
279ddfc6ee Add type check for dilation in torch.quantized_max_pool3d() (#137845)
Fixes #136716

repro:

```python
import torch

input = torch.randn([1, 1, 1, 1, 1])
input = torch.quantize_per_tensor(input, 0.1, 10, torch.qint32)
torch.quantized_max_pool3d(input, (1, 1, 1), (1, 1, 1), (0, 0, 0), (-3, 1, 1)) # crash

input = torch.randn([1, 1, 1, 1, 1])
input = torch.quantize_per_tensor(input, 0.1, 10, torch.qint32)
result = torch.nn.functional.max_pool3d(input, (1, 1, 1), (1, 1, 1), (0, 0, 0), (-3, 1, 1))  # crash
```

result:

```
RuntimeError: Expected dilation >= 1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137845
Approved by: https://github.com/albanD
2024-10-21 16:15:57 +00:00
a8b912f39d [AC] Backward Pass Aware AC - adding hooks to partitioner to pass callable (#137785)
Summary: same as title. Plan is to pass a callable to the partitioner to perform custom autoAC via an ILP. This is the same as a previous diff D63714905 which was landed and then subsequently reverted by PyTorch Release Engineering because of a failing unit test (f7b8d36c28). We think the unit test is buggy, and we also fix the same.

Test Plan: tbd

Differential Revision: D64246495

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137785
Approved by: https://github.com/basilwong
2024-10-21 15:30:07 +00:00
cyy
7ec21a6f0f Enable clang-tidy on torch/csrc/api (#138437)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138437
Approved by: https://github.com/r-barnes
2024-10-21 14:22:38 +00:00
8aacbee8e0 Make Context to be Device-agnostic Step by Step (2/N) (#136526)
----

- add new method(getDefaultGenerator, getNewGenerator) into AcceleratorHooksInterface
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136526
Approved by: https://github.com/ezyang, https://github.com/EikanWang
ghstack dependencies: #138323
2024-10-21 13:51:54 +00:00
649f8117ad Add deprecated warning for lazyInitXXX API (#138323)
Detailed Descriptions:
Involved APIs are as followed:
- ``lazyInitCUDA``
- ``lazyInitHIP``
- ``lazyInitXPU``
- ``lazyInitMTIA``
- ``lazyInitPrivateUse1``
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138323
Approved by: https://github.com/malfet
2024-10-21 13:51:54 +00:00
1417b2cd05 [AOTI] Fix test_index_put_with_none_index_cpu_with_stack_allocation (#138303)
Summary: The problem happened after splitting CppWrapperCpu and CppWrapperCpuArrayRef, because CppWrapperCpuArrayRef.generate_index_put_fallback missed a statement. Running test_aot_inductor.py as a whole didn't reveal the problem, but running test_index_put_with_none_index_cpu_with_stack_allocation individually did. Digging deeper, the root cause is init_backend_registration has incorrectly cached CPU CppWrapperCodegen class, which means CppWrapperCpuArrayRef was never picked when running test_aot_inductor.py as a whole.

Differential Revision: [D64598714](https://our.internmc.facebook.com/intern/diff/D64598714)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138303
Approved by: https://github.com/hl475
2024-10-21 13:47:50 +00:00
8f3efb8797 Update slow tests (#133203)
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weeekly.yml).
Update the list of slow tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133203
Approved by: https://github.com/pytorchbot
2024-10-21 12:00:52 +00:00
cyy
14fc6b70ea Remove torch/csrc/api/include/torch/linalg.h (#138435)
Only one place in OSS uses it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138435
Approved by: https://github.com/r-barnes
2024-10-21 07:04:27 +00:00
5f940a44af [AMD] Fix torch ck backend build with 6.2.1 (#138434)
Summary: It's complaining about missing __hip_bfloat162 definition w/o this header.

Differential Revision: D64673284

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138434
Approved by: https://github.com/yaoyj11, https://github.com/houseroad
2024-10-21 06:38:38 +00:00
362ca54f03 [c10d][Partial-Graph Overlap] Support calling .wait_tensor() within compiled region on output tensor of eager async_op=True collective (#137763)
This PR aims to support the following use case:
```python
def all_reduce_eager(x):
    y = x * x
    req = dist.all_reduce(y, op=dist.ReduceOp.SUM, async_op=True)
    assert isinstance(req, torch.distributed.Work)
    return y

@torch.compile(fullgraph=True)
def all_reduce_wait_compiled(y):
    torch.ops.c10d_functional.wait_tensor(y)
    return y * y
```
where the collective is issued in eager (with `async_op=True`) but waited in compiled region.

This is important for internal use cases such as TorchRec, where we issue collectives in eager for SparseArch all_to_all but want to wait for them in compiled region at beginning of OverArch, so that the all_to_all can be overlapped with the DenseArch compute that runs in parallel.

------

Test commands:
- `pytest -rA test/distributed/test_inductor_collectives.py::TestCollectivesMultiProc::test_eager_async_allreduce_inductor_wait`
- `pytest -rA test/test_fx.py::TestDCE::test_keep_collectives`
- `pytest -rA test/test_fx.py::TestDCE::test_keep_collectives_no_overload`
- `pytest -rA test/distributed/test_c10d_functional_native.py::TestWithNCCL::test_unwaited`
- `pytest -rA test/distributed/test_c10d_functional_native.py::TestWithNCCL::test_work_registry`
- `pytest -rA test/distributed/test_c10d_nccl.py::CommTest::test_unwaited`
- `pytest -rA test/distributed/test_c10d_nccl.py::CommTest::test_work_registry`
- `pytest -rA test/distributed/_tensor/test_tensor_ops.py::DistTensorOpsTest::test_equal`
- `pytest -rA test/distributed/_tensor/test_random_ops.py::DistTensorRandomOpTest::test_manual_seed`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_baseline_aot_eager_multiprocess`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_aot_eager`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_setattr`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_inline`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_no_inline`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_asymmetric_compilation`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_automatic_dynamic_scalar`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_automatic_dynamic_speculation_divergence`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_automatic_dynamic_tensor`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_dim_mismatch`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_graph_break_empty_graph_still_collective`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_missing_source`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_scalar_missing_source`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_type_mismatch`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_activation_checkpointing`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_baseline_aot_eager_multiprocess`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_activation_checkpointing`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_aot_eager`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_inductor`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_setattr`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_inline`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_no_inline`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_aot_eager`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_aot_eager_static_graph`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_inductor`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_inductor_static_graph`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_fsdp_activation_checkpointing`
- `pytest -rA test/distributed/_tensor/test_experimental_ops.py::DistOtherOpsTest::test_bernoulli`
- `pytest -rA test/distributed/_tensor/test_dtensor_compile.py::TestDTensorCompileE2E::test_tp_compile_fullgraph_is_seq_parallel_True`
- `pytest -rA test/distributed/test_inductor_collectives.py::TestCollectivesMultiProc::test_allreduce_inductor_cudagraph_trees`
- `python benchmarks/dynamo/torchbench.py --ci --accuracy --timing --explain --inductor --device cuda --inference --bfloat16 --total-partitions 2 --partition-id 1 --output inference_torchbench.csv --only moco`

------

Differential Revision: [D64511994](https://our.internmc.facebook.com/intern/diff/D64511994)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137763
Approved by: https://github.com/yifuwang
2024-10-21 06:02:57 +00:00
cyy
a170ff4167 Prepare to enable ASAN on CUDA (#138404)
See which tests fail

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138404
Approved by: https://github.com/ezyang
2024-10-21 03:55:29 +00:00
9ad2736627 Remove extraneous C++14 comment (#138408)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138408
Approved by: https://github.com/Skylion007
2024-10-21 03:54:41 +00:00
6987bfb40a Revert "[dynamo][NFC] Remove unused method InliningInstructionTranslator.check_replace_is_safe (#137906)"
This reverts commit 3c7d9d6c7fa565e811675be7dd84e5ef7c8ba7a0.

Reverted https://github.com/pytorch/pytorch/pull/137906 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/137906#issuecomment-2425505452))
2024-10-21 03:42:38 +00:00
fb0da32377 [DeviceMesh] Small refactor to optimize DeviceMesh subgroup creation (#138117)
As `backend`, `pg_options`, and `group_desc` are the same for each mesh dimension, we don't need to get or create these args for `new_group` multiple times. This PR moves it from the inner loop of the subgroup creation (each subgroup ranks of each mesh dimension) to the outer loop (each mesh_dimension).

For example, given we have a 2 * 4 DeviceMesh, we are re-creating the variables `backend`, `pg_options`, and `group_desc` 2*4 = 8 times. After the change, we only create these variables once per mesh dimension, which is 2 times.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138117
Approved by: https://github.com/kwen2501
2024-10-21 03:04:24 +00:00
cyy
a05b64a38f [5/N] Fix extra warnings brought by clang-tidy-17 (#138403)
Follows #137983
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138403
Approved by: https://github.com/ezyang
2024-10-21 02:59:54 +00:00
cyy
82eb09aafd [Environment Variable][4/N] Use thread-safe getenv functions (#137843)
Follows #137328

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137843
Approved by: https://github.com/ezyang
2024-10-21 02:58:59 +00:00
2d3455e7d9 [c10d] try fix the unstableness of test_get_future_result (#138415)
Summary:
Seems depends on the platform, nccl error or timeout would be raised
first on rank 0. Now we try to force the timeout by not exiting other
ranks
Test Plan:
Tests pass locally

Tags:

Fixes https://github.com/pytorch/pytorch/issues/138397

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138415
Approved by: https://github.com/kwen2501
2024-10-21 01:17:30 +00:00
cyy
e7b8a9a4c1 [5/N] Fix clang-tidy warnings in torch/csrc/api/ (#138389)
Follows #138382

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138389
Approved by: https://github.com/ezyang
2024-10-21 01:12:37 +00:00
e4ad02892f Upgrade distributed test to g4dn instances (T4 GPUs) (#137161)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137161
Approved by: https://github.com/seemethere, https://github.com/eqy, https://github.com/yf225

Co-authored-by: Will Feng <yf225@cornell.edu>
2024-10-20 23:48:54 +00:00
4f45a052ad Fix try_solve for s1*s2 == 0 when both symbols are unknown (#137919)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137919
Approved by: https://github.com/ezyang
2024-10-20 23:33:08 +00:00
09cf163ae3 Fix for mixed_mm tests failures on SM70 and lower (#138183)
This PR fixes mixed_mm tests that are failing on SM70 and lower as discussed here https://github.com/pytorch/pytorch/pull/123762#issuecomment-2406601729.

The failure occurs because some of the mixed_mm tests expect triton code to be generated, but on SM70 and lower, the generation of triton code is skipped (see https://github.com/pytorch/pytorch/blob/main/torch/_inductor/kernel/mm.py#L693). These tests will now be skipped when running on SM70 and lower. I do not have access to an SM70 GPU, so I was not able to test these changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138183
Approved by: https://github.com/ezyang
2024-10-20 21:14:31 +00:00
a1899b5a9e Revert "[Environment Variable][4/N] Use thread-safe getenv functions (#137843)"
This reverts commit 239ad73cb1c8a91f0a2de21d27af3d98f5a8dddc.

Reverted https://github.com/pytorch/pytorch/pull/137843 on behalf of https://github.com/yf225 due to Sorry for reverting your PR but I believe this PR breaks the binary builds. Example: https://ossci-raw-job-status.s3.amazonaws.com/log/31790258895, with error message: `getenv is not a member of c10::utils`, might be easier to search for `not a member of` in the log ([comment](https://github.com/pytorch/pytorch/pull/137843#issuecomment-2425192780))
2024-10-20 19:48:14 +00:00
a9f4f89cd5 [CI] Add Compiled DDP / Compiled FSDP2 / compute-comm reordering tests to test_inductor_distributed (#138178)
`test_replicate_with_compiler.py` and `test_fully_shard_compile.py` requires bf16, so needs to be run within test_inductor_distributed job (which uses A10G (SM80) and has bf16 support).

This allows us to migrate distributed jobs to T4 machines in https://github.com/pytorch/pytorch/pull/137161, as the compiled distributed jobs are the only blocking ones now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138178
Approved by: https://github.com/xmfan, https://github.com/fduwjj, https://github.com/fegin, https://github.com/kwen2501
2024-10-20 19:38:18 +00:00
cyy
239ad73cb1 [Environment Variable][4/N] Use thread-safe getenv functions (#137843)
Follows #137328

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137843
Approved by: https://github.com/ezyang
2024-10-20 13:05:04 +00:00
07fd61e106 [SDPA] Fix warning message (#138278)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138278
Approved by: https://github.com/eqy, https://github.com/Skylion007
2024-10-20 08:00:56 +00:00
f568d48890 Enable git long paths checkout on Windows (#138411)
Checking out PyTorch on Windows starts to fail after ROCm change https://github.com/pytorch/pytorch/pull/131004 in which one of the submodule path, `third_party/composable_kernel`, is getting too long https://hud.pytorch.org/pr/pytorch/pytorch/131004#31778700376

According to https://github.com/actions/checkout/issues/1285, there is no fix in GHA checkout, but we can set `git config --system core.longpaths true` to enable long paths support in Git as a workaround.

### Testing

Windows checkout is ok now https://github.com/pytorch/pytorch/actions/runs/11423112351/job/31781916540

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138411
Approved by: https://github.com/wdvr
2024-10-20 07:18:44 +00:00
f8303740f7 Revert "Enable git long paths checkout on Windows (#138411)"
This reverts commit 12283035f8c08cd3487bfaac25ccef7da90952ba.

Reverted https://github.com/pytorch/pytorch/pull/138411 on behalf of https://github.com/huydhn due to Opps, I forgot Windows binary build, let me revert and reland this one ([comment](https://github.com/pytorch/pytorch/pull/138411#issuecomment-2424661640))
2024-10-20 06:50:48 +00:00
12283035f8 Enable git long paths checkout on Windows (#138411)
Checking out PyTorch on Windows starts to fail after ROCm change https://github.com/pytorch/pytorch/pull/131004 in which one of the submodule path, `third_party/composable_kernel`, is getting too long https://hud.pytorch.org/pr/pytorch/pytorch/131004#31778700376

According to https://github.com/actions/checkout/issues/1285, there is no fix in GHA checkout, but we can set `git config --system core.longpaths true` to enable long paths support in Git as a workaround.

### Testing

Windows checkout is ok now https://github.com/pytorch/pytorch/actions/runs/11423112351/job/31781916540

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138411
Approved by: https://github.com/wdvr
2024-10-20 06:32:34 +00:00
d1027c2be6 Revert "Update sympy version constraint to 1.13.3 (#138338)"
This reverts commit d8279ad9d162b5ce71699f462d3664c3745b14f5.

Reverted https://github.com/pytorch/pytorch/pull/138338 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but I think a bunch of inductor tests and test_dynamic_shapes are failing in trunk after this lands d8279ad9d1 ([comment](https://github.com/pytorch/pytorch/pull/138338#issuecomment-2424487225))
2024-10-20 03:19:02 +00:00
3f3b692a00 [ROCm] CK-based GEMM (#131004)
- composable_kernel as a third_party submodule
- "ck" as a `torch.backends.cuda.preferred_linalg_library()`
- reference CK gemm implementations for float, bfloat16, and half types

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131004
Approved by: https://github.com/xw285cornell, https://github.com/pruthvistony

Co-authored-by: Andres Lugo <Andy.LugoReyes@amd.com>
Co-authored-by: Pruthvi Madugundu <pruthvigithub@gmail.com>
2024-10-20 02:57:43 +00:00
0a2407b93c [dynamo] Support omegaconf DictConfig (#138378)
Fixes https://github.com/pytorch/pytorch/issues/138224

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138378
Approved by: https://github.com/jansel
ghstack dependencies: #138359
2024-10-20 02:43:17 +00:00
f892543c1f [dynamo] Support TypedDict (#138359)
Seen in vLLM.

Fixes https://github.com/pytorch/pytorch/issues/132629
Fixes https://github.com/pytorch/pytorch/issues/133613

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138359
Approved by: https://github.com/jansel

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2024-10-20 02:43:17 +00:00
cyy
1f349eed61 [4/N] Fix extra warnings brought by clang-tidy-17 (#137983)
Follows #137552

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137983
Approved by: https://github.com/ezyang, https://github.com/Skylion007
2024-10-20 01:02:33 +00:00
b1b7c714ed Add deprecated C10_UNUSED and C10_NODISCARD macros back (#138398)
For backwards compatibility. Disallow internal use.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138398
Approved by: https://github.com/malfet
2024-10-20 00:21:19 +00:00
d8279ad9d1 Update sympy version constraint to 1.13.3 (#138338)
`simpy` was pinned to version 1.13.1 due to test failures with version 1.13.2 on Windows and mac, as reported in https://github.com/pytorch/pytorch/pull/133235. Now that a newer version, 1.13.3, has been released, this PR aims to verify if the test failure has been resolved and also allow building with newer versions for packaging purposes (e.g., https://github.com/conda-forge/pytorch-cpu-feedstock/pull/277#discussion_r1806721862).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138338
Approved by: https://github.com/Skylion007, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-10-20 00:20:02 +00:00
14a3e12985 [ROCm] Fix ADDMM hipBLASLt regression (#138267)
Fixes #138067

A partial reversion of this PR: https://github.com/pytorch/pytorch/pull/137604

The breakage is on AMD GPUs that do not fully support hipBLASLt, e.g. gfx1100

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138267
Approved by: https://github.com/malfet
2024-10-20 00:19:10 +00:00
47e80abc7a Revert "[inductor] Preserve metadata across replace_by_example and register_replacement patterns (#138089)"
This reverts commit fb44658415e50b5be6a187ff3f14243c0fdf3daf.

Reverted https://github.com/pytorch/pytorch/pull/138089 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but the new test_original_aten_preserved_pad_mm test runs OOM in trunk fb44658415 ([comment](https://github.com/pytorch/pytorch/pull/138089#issuecomment-2424297269))
2024-10-19 23:55:01 +00:00
fcedf93d1e [Traceable FSDP2] Add _compiled_autograd_enabled global state variable (#138187)
After https://github.com/pytorch/pytorch/pull/137821, we will no longer be able to call the Compiled Autograd state getter under Dynamo tracing. One solution is to cache the "Compiled Autograd enabled" state outside of compile for FSDP2, and just read from the cache when we need the check. This is implemented by this PR.

Fixes https://github.com/pytorch/pytorch/issues/138177.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138187
Approved by: https://github.com/xmfan, https://github.com/awgu
2024-10-19 19:10:31 +00:00
c0582fd0f8 Remove unused Python variables in torch/[b-z]* (#136963)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136963
Approved by: https://github.com/ezyang
2024-10-19 16:45:22 +00:00
fb44658415 [inductor] Preserve metadata across replace_by_example and register_replacement patterns (#138089)
replace_by_example is used to implement some pattern-matching passes in inductor. Previously, replace_by_example would generate nodes with very little metadata. In particular, `meta["original_aten"]` would be lost; that meant that when generating triton kernel names, you could get empty names like `triton_tem_fused_0` if the input nodes to the fused kernel were the result of a pattern-matching pass that used replace_by_example.

This also adds metadata for to register_replacement patterns, including pad_mm.

This fixes the issue by copying metadata from the original node to the replacement nodes. If there are multiple original nodes we skip the metadata transfer; so if you have a `add(z, mm(x, y))`, then the metadata won't be transferred right now.

Differential Revision: [D64480755](https://our.internmc.facebook.com/intern/diff/D64480755)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138089
Approved by: https://github.com/aakhundov
2024-10-19 16:37:08 +00:00
38ea487338 Re-raise in _run_sympy_handler to reduce log spew (#138356)
Fixes: https://github.com/pytorch/pytorch/issues/138069

I tested this by running `python test/inductor/test_torchinductor_dynamic_shapes.py DynamicShapesCpuTests.test_builtins_round_float_ndigits_pos_dynamic_shapes_cpu` before and after the change and verifying no more log spew.

I'm uncertain on if it makes sense to add a test for this PR. Question for reviewers: is there a standard paradigm for testing these log spew based fixed? Happy to add a test if someone can point me towards the right direction.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138356
Approved by: https://github.com/ezyang
2024-10-19 16:02:45 +00:00
c0879d0c21 Fix lint
Regression casued by fddabc6e0b that was force merged
2024-10-19 08:33:41 -07:00
cyy
cdc9f14227 [4/N] Fix clang-tidy warnings in torch/csrc/api/ (#138382)
Follows #138328

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138382
Approved by: https://github.com/ezyang
2024-10-19 13:32:51 +00:00
fddabc6e0b C10_UNUSED to [[maybe_unused]] (#6357) (#138364)
Summary: Pull Request resolved: https://github.com/pytorch/executorch/pull/6357

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138364
Approved by: https://github.com/Skylion007, https://github.com/eqy
2024-10-19 13:17:43 +00:00
cyy
2f6a70bfea Enable more UBSAN checks (#138288)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138288
Approved by: https://github.com/ezyang
2024-10-19 13:00:26 +00:00
cyy
675e16e137 [3/N] Fix clang-tidy warnings in torch/csrc/api/ (#138328)
Follows #136998
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138328
Approved by: https://github.com/ezyang
2024-10-19 07:07:39 +00:00
795255a7c8 Revert "[Traceable FSDP2] Add _compiled_autograd_enabled global state variable (#138187)"
This reverts commit 0c913b35aaea9ca33510239e939957ec5fe66d78.

Reverted https://github.com/pytorch/pytorch/pull/138187 on behalf of https://github.com/yf225 due to linux-focal-rocm6.2-py3.10 / test (distributed, 1, 3, linux.rocm.gpu) test_compiled_autograd_ctx failed ([comment](https://github.com/pytorch/pytorch/pull/138187#issuecomment-2423609108))
2024-10-19 06:12:47 +00:00
de16159e56 [MPS] Fix sliced cast (#138314)
This fixes internal crash due to the invalid bufer size computation if sliced API is used

Not sure what was the purpose of
```c++
IntArrayRef baseShape;
if (src.is_view()) {
  baseShape = src._base().sizes();
} else {
  baseShape = getIMPSAllocator()->getBufferShape(src.storage().data());
}
int flattenedShaped = 1;
for (const auto i : c10::irange(baseShape.size())) {
  flattenedShaped *= baseShape[i];
}
```
As flattenShaped could be much easier computed as `[srcBuf
lengh]/src.element_size()`, and even if `srcBuf` is padded it's a safe thing to do.

When someone allocated buffer to hold say uint8 and that view-casted it
to float16, attempt to compute `baseShape` returned sizes of original
tensor in its data type, rather than size in new dtypes

Fixes https://github.com/pytorch/pytorch/issues/137800
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138314
Approved by: https://github.com/albanD, https://github.com/DenisVieriu97
2024-10-19 05:17:09 +00:00
0c913b35aa [Traceable FSDP2] Add _compiled_autograd_enabled global state variable (#138187)
After https://github.com/pytorch/pytorch/pull/137821, we will no longer be able to call the Compiled Autograd state getter under Dynamo tracing. One solution is to cache the "Compiled Autograd enabled" state outside of compile for FSDP2, and just read from the cache when we need the check. This is implemented by this PR.

Fixes https://github.com/pytorch/pytorch/issues/138177.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138187
Approved by: https://github.com/xmfan, https://github.com/awgu
ghstack dependencies: #138245, #138174
2024-10-19 04:33:35 +00:00
8f118e53d7 [CI] Fix CompiledDDP failure when the gradient is not contiguous; Add Compiled DDP and Compiled FSDP2 tests to test_inductor_distributed (#138174)
Summary:
As title

`test_replicate_with_compiler.py` and `test_fully_shard_compile.py` requires bf16, so needs to be run within test_inductor_distributed job (which uses A10G (SM80) and has bf16 support).

This allows us to migrate distributed jobs to T4 machines in https://github.com/pytorch/pytorch/pull/137161, as the compiled distributed jobs are the only blocking ones now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138174
Approved by: https://github.com/yf225, https://github.com/kwen2501
ghstack dependencies: #138245

Co-authored-by: Will Feng <yf225@cornell.edu>
2024-10-19 04:33:35 +00:00
3cfd244495 Add USE_SYSTEM_NVTX option (#138287)
## Summary

We are currently [updating](https://github.com/conda-forge/pytorch-cpu-feedstock/pull/277) the [`conda-forge::pytorch`](https://anaconda.org/conda-forge/pytorch) package to version 2.5.0. This update includes a new dependency, the third_party/NVTX submodule. However, like other package management frameworks (e.g., apt), conda-forge prefers using system-installed packages instead of vendor-provided third-party packages.

This pull request aims to add an option, `USE_SYSTEM_NVTX`, to select whether to use the vendored nvtx or the system-installed one, with the default being the vendored one (which is the current behavior).

## Test Plan

The `USE_SYSTEM_NVTX` option is tested by building the `conda-forge::pytorch` package with the change applied as a [patch](cd1d2464dd/recipe/patches/0005-Use-system-nvtx3.patch).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138287
Approved by: https://github.com/albanD
2024-10-19 04:26:01 +00:00
a20a17fd6f [Dynamo] Disable torch function compilation during guard execution and in compiled bytecode (#137669)
Fixes https://github.com/pytorch/pytorch/issues/114369

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137669
Approved by: https://github.com/anijain2305
2024-10-19 04:12:45 +00:00
88eb15a3e3 [audio hash update] update the pinned audio hash (#138139)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138139
Approved by: https://github.com/pytorchbot
2024-10-19 04:02:21 +00:00
7d076b9e3a updated EC2 fetching of metadata to use IMDSv2 (#138286) 2024-10-18 20:58:47 -07:00
ac7f52b301 Revert "[inductor] add a threshold for membw saving during fusion (#136782)"
This reverts commit 6647320de2077c10309f5025a007d51c7fb542d8.

Reverted https://github.com/pytorch/pytorch/pull/136782 on behalf of https://github.com/huydhn due to Sorry for reverting your change but test_memory starts to fail after this lands in trunk ([comment](https://github.com/pytorch/pytorch/pull/136782#issuecomment-2423549196))
2024-10-19 03:43:42 +00:00
fecd370ea1 [c10d] Fix color value for comm split being negative (#137855)
Fixes https://github.com/pytorch/pytorch/issues/137856.

### Issue 1
Today under `ProcessGroupNCCL::Options`, color is declared as:
```
    int64_t split_color{0};
```
When passing this variable to `ncclCommSplit` which accepts `int`, the value may overflow and become negative, as in #137856. But NCCL API only accepts non-negative colors (or `NCCL_SPLIT_NOCOLOR`).

But that's not all.

### Issue 2
`split_color` is pybind'ed to python frontend. If we just change from `int64_t` to `int` in C++, pybind will complain:
```
[rank0]: TypeError: (): incompatible function arguments. The following argument types are supported:
[rank0]:     1. (self: torch._C._distributed_c10d.ProcessGroupNCCL.Options, arg0: int) -> None
```
This is because python `int` represents a wider range than C++ `int`. So we cannot pass hash values -- which are potentially big ints -- from python to C++. The PR modulo the hash value with `c_int`'s max value.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137855
Approved by: https://github.com/wconstab
2024-10-19 03:17:19 +00:00
542f7c8383 Eliminate C10_NODISCARD (#138336)
Test Plan: Sandcastle

Reviewed By: swolchok

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138336
Approved by: https://github.com/Skylion007
2024-10-19 02:54:06 +00:00
a4b6ef178c [c10d] Reorder cpp stack dump and FR dump and add log prefix to loggings (#138368)
The rationale behind this PR is to:
1. Move the dump of c++ traces after FR dump because the FR dump is timed meaning that it will not block forever, while the dumping of c++ traces is likely to be blocking. so that we swap the order. Ideally we also want to make cpp stacktrace dump to be a future wait, if we want to go down this path, we can also make it happen in an another PR.
2. Add log Prefix to the logs which have not been added.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138368
Approved by: https://github.com/c-p-i-o
2024-10-19 02:43:41 +00:00
ea412d5554 [AOTI] Fix a special case compile time data type codegen for sym int variables (#138106)
Summary:
This change unblocks the CFR AOTI lowering runtime error.

TL;DR:

In this model, one triton kernel expects a scalar input dtype as i64, but getting an i32. The reason is "auto"  can infer a smaller data type if the variable it passed in e.g. is i32. thus cause CUDA IMA.
 Original problematic kernel: `triton_poi_fused_add_ge_logical_and_logical_or_lt_46_grid_100`.

This diff manually cast it to i64 for all symbolic arguments in compile time  for i64 triton kernel inputs, instead of use `auto var_x = {arg}` in cpp wrapper code.

Test Plan:
Verified in FLB locally:

```
PYTORCH_NO_CUDA_MEMORY_CACHING=1 AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=3 TORCH_LOGS="output_code" TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCH_SHOW_CPP_STACKTRACES=1 CUDA_LAUNCH_BLOCKING=1 ~/fbsource/buck-out/v2/gen/fbcode/98e643f8bb44fe9d/hpc/new/models/feed/benchmark/__feed_lower_benchmark__/feed_lower_benchmark.par --skip-eager --skip-flop-estimation --lower-backend="AOT_INDUCTOR" --sync-mode=0 --precision bf16 --output-precision bf16  --lower-presets="ifr_cint;disable_new_lowering_weights;disable_dper_passes:passes=fuse_parallel_linear_no_weight_change" --remove-unexpected-type-cast=False --load="manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/924293663/0/gpu_lowering/input.merge"```

Differential Revision: D64490039

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138106
Approved by: https://github.com/ColinPeppler
2024-10-19 02:30:53 +00:00
d5035f0aab fix codecache write_atomic path issue on Windows. (#138331)
Fixes #138211

`Path.rename` function has Windows OS specific behavior, that will raise `FileExistsError` when the target file existing.
This behavior is not happened on Linux, so I write a small repoduce code to figure out what happened.

After stepping trace the repo code:
```python
import os
import sys
from pathlib import Path

_IS_WINDOWS = sys.platform == "win32"

def test_case():
    cwd = os.getcwd()
    path1 = os.path.join(cwd, "haha1.txt")
    path2 = Path(os.path.join(cwd, "haha2.txt"))

    try:
        path2.rename(path1)
    except FileExistsError as e_file_exist:
        if _IS_WINDOWS:
            # on Windows file exist is expected: https://docs.python.org/3/library/pathlib.html#pathlib.Path.rename
            shutil.copy2(path2, path1)
            os.remove(path2)
        else:
            raise e_file_exist
    except BaseException as e:
        raise e

    print("run here.")

if __name__ == "__main__":
    test_case()
```
We found the code `path2.rename(path1)` can breakdown into:
1. copy file2's content to file1.
2. delete file2.

So, we can implemented equal code on Windows path:
```python
shutil.copy2(src=tmp_path, dst=path)
os.remove(tmp_path)
```

So, we can get current PR.

TODO: need cherry-pick to release/2.5 branch, CC: @atalman .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138331
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-10-19 01:27:12 +00:00
949b6f685d Enable -Werror on s390x (#136527)
Enable -Werror on s390x

Example of original issue on s390x:
https://github.com/pytorch/pytorch/actions/runs/11014606340/job/30585632704

Most of warnings are not specific to s390x, but specific to gcc-13 or gcc-14. To test it on s390x an image with gcc-13 is needed. For s390x it's tested for new regressions on every merge due to trunk workflow.

`-Wdangling-reference` produces either obviously false warnings or suspicious warnings, which on closer inspection look plausibly safe.

`-Wredundant-move` with new gcc complains about `std::move(...)` disabling copy elision. But removing `std::move(...)` makes used clang versions complain about copying objects when they could be moved. For now also disable it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136527
Approved by: https://github.com/malfet
2024-10-19 01:18:42 +00:00
4a3c9400fe Update cpuinfo submodule (#138351)
To suppress error on ARM systems where PR_SVE_GET_VL is missing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138351
Approved by: https://github.com/Skylion007
2024-10-19 01:12:29 +00:00
ff598f2f4d [DTensorTestbase] Add an optional eager_init flag to with_comms() to support eager init nccl communicator for DeviceMesh test case (#138108)
Add an optional `eager_init` flag to `with_comms`.
When `eager_init` is True and backend is `nccl`, we pass the `device_id` to `init_process_group()` for eager initialization.
Otherwise, `device_id` is still `None` and this goes through the normal lazy call.
Default for `eager_init` is False.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138108
Approved by: https://github.com/kwen2501
2024-10-19 01:04:55 +00:00
b3ae1b1b73 [CMake] remove duplicated cmake options for Gloo and C10D (#138318)
just a trival fix  :P
cmake options from line 345 to line 357 are identical to these of line 358 to line 369, remove the duplicated lines
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138318
Approved by: https://github.com/janeyx99
2024-10-19 00:26:25 +00:00
6647320de2 [inductor] add a threshold for membw saving during fusion (#136782)
Fix https://github.com/pytorch/pytorch/issues/133242 . In that issue, inductor fuses 2 nodes because they access the same scalar tensor. This saving is very small (4 bytes), and if we ignore that, by default, we can not fuse. But if loop ordering after fusion get kicked in, we can reorder loops and fuse those 2 nodes. We get 33% memory bandwidth savings .

I think adding a threshold for membw saving in general is not bad.

I'll run a perf test. ( https://github.com/pytorch/pytorch/actions/runs/11375421752 )

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136782
Approved by: https://github.com/jansel
2024-10-19 00:22:43 +00:00
e8b1409dcf Revert "[user triton] typing triton_kernel_wrap.py (#138230)"
This reverts commit 2f61b69603756c1fcaef71b231e598df31e20f42.

Reverted https://github.com/pytorch/pytorch/pull/138230 on behalf of https://github.com/wdvr due to Reverting this, as it started failing tests on main ([comment](https://github.com/pytorch/pytorch/pull/138230#issuecomment-2423354596))
2024-10-18 23:12:29 +00:00
4632594546 [inductor] Move V.graph.scheduler.current_device to V.graph.current_device (#138252)
There are some places where it would be nice to use this, but the scheduler hasn't yet been created.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138252
Approved by: https://github.com/eellison
ghstack dependencies: #138170
2024-10-18 23:05:54 +00:00
85a6a782e5 [inductor] Generalize WorkspaceArg for graph-level semaphores (#138170)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138170
Approved by: https://github.com/Chillee
2024-10-18 23:05:54 +00:00
13bcb065f5 [compiled autograd] enable some reentrant tests (#137290)
Some seem to fail due to queue_callback usage

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137290
Approved by: https://github.com/yf225
2024-10-18 22:25:08 +00:00
47e4045566 Revert "[pt2] Log is_forward field to dynamo_compile scuba table (#138097)"
This reverts commit 4e9273c84edafdcfff57521dde6675b967181ba8.

Reverted https://github.com/pytorch/pytorch/pull/138097 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but I think it has a land race with https://github.com/pytorch/pytorch/pull/137803 ([comment](https://github.com/pytorch/pytorch/pull/138097#issuecomment-2423297516))
2024-10-18 22:00:40 +00:00
bd7cbddfe3 [CODEOWNERS] Remove aaronenyeshi from Profiler paths (#138346)
As title, remove aaronenyeshi from Profiler paths.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138346
Approved by: https://github.com/sraikund16
2024-10-18 21:46:00 +00:00
c88b77af9c [Distributed][CI] Add SM guard for compiled tests involving BF16 (#138245)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138245
Approved by: https://github.com/yf225
2024-10-18 21:39:39 +00:00
7faa1284ab [ptd][amd] call alltoallv instead of send/recv (#136368)
Summary:
as $title

AMD provides a2av API, we should just use it instead of implementing PTD's own set of send/recv.
we should not skip 0B send/recv within a2av, it may lead to dead lock: see details https://github.com/ROCm/rccl/pull/1349

Test Plan:
before:

mvai-job will timeout on all2all

https://www.internalfb.com/mlhub/pipelines/runs/mast/fire-cenzhao-20240913-1426-327e119d?job_attempt=1&version=0&env=PRODUCTION

after:

https://www.internalfb.com/mlhub/pipelines/runs/mast/fire-cenzhao-20240919-1932-ebce94e6?job_attempt=0&tab=execution_details&env=PRODUCTION

latest APS job: https://fburl.com/mlhub/vn6dj7zp

Differential Revision: D63076315

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136368
Approved by: https://github.com/xw285cornell
2024-10-18 21:31:57 +00:00
5b58697cc7 [Profiler] Clang bugs in Collection [1/n] (#138296)
Summary: I have to keep bypassing issues because of these clang rules. Let's start with all of the bugs instead of the variable name ones because that will introduce a lot of lines of code and can make things hard to read

Test Plan: Format tests pass.

Differential Revision: D64411171

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138296
Approved by: https://github.com/aaronenyeshi, https://github.com/Skylion007
2024-10-18 21:06:50 +00:00
295de00908 [PT2 Compile Events] Revamp PT2 Compile/chromium event logging [1/?] (#138093)
This diff is the starting steps of https://docs.google.com/document/u/2/d/1kAEBt4AyW7HTAhXHbjoz8FBFHNyyEA2Qo2mPn7v3WUQ/edit?usp=drive_web&ouid=113555078003219714709

It implements the following changes:

- Only log spans to scuba, so no start events are ever logged
- Log events as the full event name, without "START" or "END"
- Only log to scuba major phases from chromium events. These are:
  - entire_frame_compile (dynamo)
  - backend_compile (aotdispatch)
  - inductor_compile (inductor)
  - codegen (inductor codegen)

Tlparse chromium events stay basically the same. But I implemented a few changes to clean that up as well:
- When there's a phase name available, log the phase name instead of the function name as the event name. This simplifies the trace to not have two identical rows. The fn_name is avaliable as metadata on the chromium event, if interested
- Log new events for pre and post grad passes. These do *not* log to scuba.

By making the phases much simpler in Scuba, with only categories for major phases of PT2 Compilation, we pave the way to add **much** more metadata and information to each individual event type. Diffs for that will come later.

**IMPLEMENTATION NOTES:**
- The logic for `log_chromium_event_internal` (which is the function that logs to Scuba) lives in chromium_events for now, but in the future as we add more metadata, it may belong independently in dynamo_timed or even outside of dynamo_timed. I haven't explored in detail what the refactor will look like. Once we start logging metadata for dynamo, aotdispatch, inductor, I suspect we will call log_pt2_compile_event directly, instead of making chromium event logger handle the pt2_compile_event logic. But that refactor is left for another PR on top of this one.

- There's an interesting space after pre grad passes within AOT autograd logic, that's between create_aot_dispatcher_function and pre grad passes. I'm not sure what we're spending time doing in that time, but I'll find out with a profile later.

Differential Revision: [D64479033](https://our.internmc.facebook.com/intern/diff/D64479033/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138093
Approved by: https://github.com/ezyang
2024-10-18 20:36:08 +00:00
3c7d9d6c7f [dynamo][NFC] Remove unused method InliningInstructionTranslator.check_replace_is_safe (#137906)
This method was no longer needed after #113725; the checking logic is
now in `SideEffects.check_allowed_side_effect`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137906
Approved by: https://github.com/Skylion007, https://github.com/anijain2305
ghstack dependencies: #137905
2024-10-18 20:20:42 +00:00
162eba2dee [dynamo] Remove mutable_local.source and index on VariableTracker rather than MutableLocalBase (#137905)
This patch addresses parts of the side-effect refactor proposed in #133027;
specifically, it does 3 things:

1. Change `SideEffects.store_attr_mutations` and `PyCodegen.tempvars`
   to index on `VariableTracker` rather than `MutableLocalBase`.
2. Remove the `source` field from `MutableSideEffects` and
   `AttributeMutation`, and use `VariableTracker.source` instead.
3. Plumb a `overridden_sources: Dict[Source, Source]` from
   `handle_aliases_for_stolen_lists` to `PyCodegen` so that we don't
   update `VariableTracker.source` in place, while still preserving what
   `handle_aliases_for_stolen_lists` needed (i.e., modifying codegen for
   certain `VariableTracker`).

(1) and (2) are merged in 1 patch because of some dependency between
a. `OutputGraph.handle_aliases_for_stolen_lists` which iterates over
   `sideSideEffects.store_attr_mutations.keys()`, and potentially update
   its source field to be completely different.
b. `SideEffects.codegen_update_mutated`, which happens after the above
   and uses `cg(var.mutable_local.source)`.
where if we apply (1) only, (b) breaks, and if we apply (2) only, (a)
breaks.

(3) is needed for correctness, see comments in the PR for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137905
Approved by: https://github.com/jansel, https://github.com/anijain2305, https://github.com/mlazos
2024-10-18 20:20:42 +00:00
7b39fb5712 Revert "Fix unbind_copy and add its decomposition (#134319)"
This reverts commit 9f81270d7589fd7fa98dc247ae4b1b7ab239ca3c.

Reverted https://github.com/pytorch/pytorch/pull/134319 on behalf of https://github.com/clee2000 due to breaking some executorch tests D64568664 ([comment](https://github.com/pytorch/pytorch/pull/134319#issuecomment-2423157700))
2024-10-18 20:09:40 +00:00
cd1e9b0e60 [EZ] Remove canary scale config (#138361)
Removing just the LF canary scale config for now to test the changes in https://github.com/pytorch/test-infra/pull/5767

Those changes have been deployed to prod and appear to be working, but this will be the final proof that it is in fact reading the test-config version of scale-config and not the pytorch/pytorch copy.

Note: This will break the Scale config validation workflow on test-infra, but it's worth it since this test will be very short lived and that workflow only runs when someone modifies scale config
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138361
Approved by: https://github.com/wdvr
2024-10-18 20:02:00 +00:00
1ac42b5f3e graph.py: Refine unspec variable finding (#137303)
Add an additional check that scalars wrapped to 0-D tensors by dynamo are actually 0-D.  This fixes a bug where a 1-D tensor was mistakenly converted to a scalar value rather than passed as a pointer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137303
Approved by: https://github.com/eellison
ghstack dependencies: #135701
2024-10-18 20:00:25 +00:00
d5bb70afe3 [Pipelining] Remove unnecessary {0,1} qualifier from regex (#138271)
There should always be 1 action.  This may be an artifact from trying to
extend the regex to handle the fused SEND_F_RECV_B style actions, which
was abandoned.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138271
Approved by: https://github.com/H-Huang
ghstack dependencies: #138142
2024-10-18 19:52:07 +00:00
f23e8a8923 [Pipelining] Fix/improve format_pipeline_order (#138142)
Fix issue where format fn modified original data structure- avoid this.

Change from printing "None" to empty string, for cleaner visualization
of bubbles
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138142
Approved by: https://github.com/H-Huang
2024-10-18 19:52:07 +00:00
d512d0e227 Always use aten.constant_pad_nd for mm padding (#137820)
Summary: From experiment, it seems like aten.constant_pad_nd has better QPS compared to torch.cat. The qps gain for ig ctr is ~10%, and ~5% for oc.

Test Plan:
```
buck2 run mode/opt -c fbcode.nvcc_arch=a100 //caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --model-path=manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/585279927/480/gpu_lowering/input.predictor.disagg.gpu.merge --lower-backend=AOT_INDUCTOR
```
```
buck2 run mode/opt //caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --model-path=manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/588102397/1500/gpu_lowering/input.predictor.disagg.gpu.merge --lower-backend=AOT_INDUCTOR
```

Differential Revision: D64271583

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137820
Approved by: https://github.com/eellison
2024-10-18 19:35:03 +00:00
2f61b69603 [user triton] typing triton_kernel_wrap.py (#138230)
Remove `# mypy: allow-untyped-defs` from triton_kernel_wrap.py, and fixed all the mypy errors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138230
Approved by: https://github.com/oulgen, https://github.com/Skylion007
2024-10-18 19:29:31 +00:00
1f32a1fb80 Replace torch.export default decomp table to be lazily populated (#137650)
In this PR, we implement lazy dictionary for export decomp behaviour for following reasons:
1. Custom op loading can happen after import time, as a result, the decomp table might not be able to pick up the decomp. Therefore we try to delay materialization as late as possible.

I intentionally seperated out the core_aten_decomp to not have any custom CIA ops in this PR to mitigate the risk of getting reverted but in the future, core_aten_decomp under torch/_decomp will exist as an alias to official export table (torch.export.default_decompositions)

Differential Revision: [D64140807](https://our.internmc.facebook.com/intern/diff/D64140807)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137650
Approved by: https://github.com/justinchuby, https://github.com/bdhirsh
2024-10-18 19:28:52 +00:00
ea8ea2f33f Improve build_with_deb_info (#138290)
To skip over the command that do not have output file specified

Recently I've noticed that `generate_torch_version.py` started to run on every rebuild, and this results in a failed plan for deb info rebuilds

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138290
Approved by: https://github.com/Skylion007
2024-10-18 18:50:12 +00:00
4e9273c84e [pt2] Log is_forward field to dynamo_compile scuba table (#138097)
Summary: ^^

Test Plan:
Ran a test script out of fbcode: D64350202. Then:

```
(pytorch-3.10_4) devvm2296:~/fbcode  $ scuba -e="select time,co_filename,is_forward from \`dynamo_compile/sandbox\` where is_forward is not null"
+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+
|    time    |                                                                                    co_filename                                                                                    | is_forward |
+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+
| 1729032583 | /data/users/slarsen/fbsource/buck-out/v2/gen/fbcode/1638b36e975169f6/scripts/slarsen/torch_compile_model/__run__/run-inplace#link-tree/scripts/slarsen/torch_compile_model/run.py |          1 |
| 1729032583 | null                                                                                                                                                                              |          0 |
| 1729032650 | /data/users/slarsen/fbsource/buck-out/v2/gen/fbcode/1638b36e975169f6/scripts/slarsen/torch_compile_model/__run__/run-inplace#link-tree/scripts/slarsen/torch_compile_model/run.py |          1 |
| 1729032650 | null                                                                                                                                                                              |          0 |
+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+
4 row(s) in set (0 warnings, 131 errors, 0.80 sec)
```

Reviewed By: ezyang

Differential Revision: D64438144

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138097
Approved by: https://github.com/ezyang
2024-10-18 18:48:52 +00:00
195d0a666b [BE][Ez]: Use interned hardcoded string FURB156 (#138330)
Uses string constants from string module.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138330
Approved by: https://github.com/albanD
2024-10-18 18:26:16 +00:00
9c2a80322a Add Programmable Google Search (#137716)
- Adding the code for the programmable Google search
- Adding the CSS overrides.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137716
Approved by: https://github.com/seemethere, https://github.com/albanD

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2024-10-18 18:18:16 +00:00
8d869c9ec7 Skip test_circular_dependencies on ROCm (#138312)
The test is flaky on ROCm and has been disabled for quite a while https://github.com/pytorch/pytorch/issues/110040.  The disabled issue was opened and then closed several times, so it's better to close that issue and skip the test here.

(Not really fix the issue, I just want the test to be skipped on PR instead of being disabled, then close the issue)
Fixes https://github.com/pytorch/pytorch/issues/110040

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138312
Approved by: https://github.com/jithunnair-amd, https://github.com/clee2000
2024-10-18 18:17:48 +00:00
620039c38c [inductor] Respect ir_dataclass(frozen=...) in Python 3.9 (#138247)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138247
Approved by: https://github.com/Skylion007, https://github.com/Chillee
2024-10-18 17:55:12 +00:00
ada7a8c217 Revert "[CI] Add Compiled DDP and Compiled FSDP2 tests to test_inductor_distributed (#138178)"
This reverts commit 8cb91109061648497ca09d6f1f9b9e13a2f5557e.

Reverted https://github.com/pytorch/pytorch/pull/138178 on behalf of https://github.com/yf225 due to because https://github.com/pytorch/pytorch/pull/138174 is reverted, we need to revert this too ([comment](https://github.com/pytorch/pytorch/pull/138178#issuecomment-2422961292))
2024-10-18 17:51:54 +00:00
59158f640c [dynamo] Support equality comparison between Tensor and None (#138289)
This patch updates the `wrap_fx_proxy_cls` function to allow boolean output when the operation is one of
`supported_const_comparison_op_values`.

Fixes #120907.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138289
Approved by: https://github.com/williamwen42
2024-10-18 17:49:26 +00:00
9ea271d40b Expand doc for bundled autotune cache (#138298)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138298
Approved by: https://github.com/ezyang, https://github.com/oulgen
2024-10-18 17:43:47 +00:00
4bba038b2f Add diagonal_copy to torch/_decomp/__init__.py (#136730)
Fixes https://github.com/pytorch/pytorch/issues/117349

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136730
Approved by: https://github.com/masnesral
2024-10-18 17:39:17 +00:00
666572d819 Update viable strict workflow (#138262)
Corresponds to https://github.com/pytorch/test-infra/pull/5775

Tested in https://github.com/pytorch/pytorch/actions/runs/11393196544/job/31700963325?pr=138262 by adding my branch to the environment and pointing the workflow at my test-infra branch and commenting out the parts that did the push + upload record to s3

Versioning would have been good for this...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138262
Approved by: https://github.com/huydhn
2024-10-18 17:28:55 +00:00
912ea5601b Move manywheel binary scripts to pytorch (#138103)
PR to remove Manywheel Scripts:
https://github.com/pytorch/builder/pull/2017

Test PR : https://github.com/pytorch/pytorch/pull/138325

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138103
Approved by: https://github.com/malfet
2024-10-18 17:11:28 +00:00
358ff3b731 [Inductor UT] Generalize newly introduced inductor UTs for intel GPU (Part 1) (#136069)
[Inductor UT] Generalize Newly introduced inductor UTs for intel GPU
reuse `test/inductor/test_autoheuristic.py`
reuse `test/inductor/test_b2b_gemm.py`
reuse `test/inductor/test_custom_lowering.py`
reuse `test/inductor/test_efficient_conv_bn_eval.py`
reuse `test/inductor/test_group_batch_fusion.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136069
Approved by: https://github.com/etaf, https://github.com/EikanWang, https://github.com/jansel
2024-10-18 16:58:09 +00:00
8dd575faf6 [BE] Modernize C10_UNUSED (#138102)
[`[[maybe_unused]]`](https://en.cppreference.com/w/cpp/language/attributes/maybe_unused) is part of C++17 standard

Test Plan: Sandcastle

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138102
Approved by: https://github.com/Skylion007, https://github.com/albanD, https://github.com/malfet, https://github.com/eqy
2024-10-18 16:33:01 +00:00
de51ed8610 [AOTI] Add C shim for _mkl_linear (#137880)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137880
Approved by: https://github.com/desertfire
2024-10-18 16:26:19 +00:00
26ac5671dc Revert "Fix CompiledDDP failure when the gradient is not contiguous (#138174)"
This reverts commit 0ecafda6024f50734118dd794ac71b86c6e6d569.

Reverted https://github.com/pytorch/pytorch/pull/138174 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but I think it fails test_compute_comm_reordering in trunk for rocm and multigpu setup ([comment](https://github.com/pytorch/pytorch/pull/138174#issuecomment-2422818971))
2024-10-18 16:17:54 +00:00
98856f7ea1 Increase max runners available for linux.12xlarge and windows.8xlarge.nvidia.gpu.nonephemeral (#138332)
Related PR on test-infra: https://github.com/pytorch/test-infra/pull/5785
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138332
Approved by: https://github.com/clee2000, https://github.com/huydhn
2024-10-18 16:17:36 +00:00
af306a392c Revert "Dont decompose aten.baddmm in inductor (#137904)"
This reverts commit 7a117f3b3eea4cfeef21da2e3a8a1e39c30fa07d.

Reverted https://github.com/pytorch/pytorch/pull/137904 on behalf of https://github.com/clee2000 due to unfortunately the failures on the previous import are still present on the current one D64568703 ([comment](https://github.com/pytorch/pytorch/pull/137904#issuecomment-2422789143))
2024-10-18 16:01:01 +00:00
5a81475884 Documentation Update: Fix Missing Whitespace in Optimizer Docs (#138321)
### Description:

This PR addresses a minor [formatting issue identified in a previous contribution to the Optimizer documentation](https://github.com/pytorch/pytorch/pull/134107#discussion_r1800833948).

Specifically, it fixes the missing whitespace after `param_names` in the section on utilizing named parameters to load the optimizer state dict.

You can find the related docs here:
[Optimizer Documentation](https://pytorch.org/docs/main/optim.html#how-to-utilize-named-parameters-to-load-optimizer-state-dict).

@janeyx99

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138321
Approved by: https://github.com/janeyx99
2024-10-18 15:41:43 +00:00
86aefa9405 typing subproc_pool.py (#138032)
Added type annotations to subproc_pool.py.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138032
Approved by: https://github.com/Skylion007
2024-10-18 15:31:05 +00:00
aa3ae50c07 Fixing MPS conv1d error message for output 2**16 (#134770)
Fixes #134416 by removing the misleading message.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134770
Approved by: https://github.com/malfet
2024-10-18 14:13:20 +00:00
c4ed03cea1 Add proper handling for view and factory function for csan (#138236)
In particular, properly handle that some functions only read/write metadata on the Tensor and thus should not be detected as read/write by csan.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138236
Approved by: https://github.com/ngimel
2024-10-18 14:04:18 +00:00
0ff6f7a040 Revert "[Distributed][CI] Add SM guard for compiled tests involving BF16 (#138245)"
This reverts commit 1581a93e8705dc23f649573d4404cd6816d614af.

Reverted https://github.com/pytorch/pytorch/pull/138245 on behalf of https://github.com/albanD due to Breaks distributed inductor tests ([comment](https://github.com/pytorch/pytorch/pull/138245#issuecomment-2422462579))
2024-10-18 13:21:17 +00:00
e027403dea ILP for Auto SAC (Selective Activation Checkpointing) (#137908)
This PR presents a mixed integer linear programming (MILP) formulation that can be utilized to determine, under a memory budget, which modules to apply activation checkpointing (AC) and the amount of activation memory that should be discarded for each module. The MILP uses information collected from MemTracker, Runtime Estimator, and SAC Estimator, introduced in these PRs:
* https://github.com/pytorch/pytorch/pull/124688
* https://github.com/pytorch/pytorch/pull/134243
* https://github.com/pytorch/pytorch/pull/135208

End-to-end example and its sample output:

```
import copy
from typing import Tuple

import torch
from torch._subclasses.fake_tensor import FakeTensorMode

from torch.distributed._tools.ilp_utils import (
    aggregate_stats,
    get_peak_memory_runtime_baseline,
    parse_module_info,
)
from torch.distributed._tools.mem_tracker import _ModState, MemTracker
from torch.distributed._tools.runtime_estimator import RuntimeEstimator
from torch.distributed._tools.sac_estimator import SACEstimator
from torch.distributed._tools.sac_ilp import sac_milp
from torch.testing._internal.distributed._tensor.common_dtensor import (
    ModelArgs,
    Transformer,
)

def _init_model_input_optimizer() -> Tuple[
    torch.nn.Module, torch.optim.Optimizer, torch.Tensor
]:
    bsz = 8
    model_args = ModelArgs(
        n_layers=4,
        n_heads=12,
        vocab_size=8192,
        max_seq_len=1024,
        dim=768,
        dropout_p=0.1,
    )
    with torch.device(torch.cuda.current_device()):
        model = Transformer(model_args)
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-2, foreach=True)
    inp = torch.randint(
        0,
        model_args.vocab_size,
        (bsz, model_args.max_seq_len),
        device=torch.cuda.current_device(),
    )
    return (model, optimizer, inp)

def _run_and_get_mem_tracker(
    model: torch.nn.Module,
    optimizer: torch.optim.Optimizer,
    inp: torch.Tensor,
) -> MemTracker:
    mem_tracker = MemTracker()
    mem_tracker.track_external(model, optimizer)
    with mem_tracker as mt:
        for iter_idx in range(2):  # running twice to initialize optimizer
            output = model(inp)
            output.sum().backward()
            if iter_idx == 1:
                last_snapshot = mt.get_tracker_snapshot("current")
            optimizer.step()
            optimizer.zero_grad()
            if iter_idx == 0:
                mt.reset_mod_stats()
    assert last_snapshot is not None
    for mod_stats in mem_tracker.memory_tracking.values():
        if _ModState.POST_BW not in mod_stats.snapshots.keys():
            mod_stats.snapshots.setdefault(_ModState.POST_BW, []).append(
                copy.deepcopy(last_snapshot)
            )
    return mem_tracker

def _run_and_get_runtime_estimator(
    model: torch.nn.Module,
    optimizer: torch.optim.Optimizer,
    inp: torch.Tensor,
) -> RuntimeEstimator:
    def _run_one_step() -> None:
        output = model(inp)
        output.sum().backward()
        optimizer.step()
        optimizer.zero_grad()

    # Initializing optimizer states and warm-up
    _run_one_step()

    runtime_estimator = RuntimeEstimator()
    with runtime_estimator(estimate_mode_type="operator-level-cost-model"):
        _run_one_step()  # We use only one iteration for estimation
    return runtime_estimator

def _run_and_get_sac_estimator(
    model: torch.nn.Module,
    inp: torch.Tensor,
) -> SACEstimator:
    sac_estimator = SACEstimator()
    with sac_estimator(estimate_mode_type="operator-level-cost-model"):
        loss = model(inp).sum()
    loss.backward()
    return sac_estimator

def main():
    with FakeTensorMode():
        model, optimizer, inp = _init_model_input_optimizer()
        mem_tracker = _run_and_get_mem_tracker(model, optimizer, inp)
        runtime_estimator = _run_and_get_runtime_estimator(model, optimizer, inp)
        sac_estimator = _run_and_get_sac_estimator(model, inp)
        mod_info = aggregate_stats(
            model,
            mem_tracker,
            runtime_estimator,
            sac_estimator,
            torch.device(torch.cuda.current_device()),
        )
        g = parse_module_info(mod_info)

        peak_mem, compute_time = get_peak_memory_runtime_baseline(g)
        print("=== WITHOUT AC ===")
        print(f"peak_mem: {round(peak_mem / 2**30, 2)} GiB")
        print(f"compute_time: {round(compute_time, 2)} ms")

        ac_decisions, recomputation_time, peak_mem = sac_milp(g, memory_budget=1.75)
        print("=== WITH AC ===")
        print(f"ac_decisions: {ac_decisions}")
        print(f"peak_mem: {round(peak_mem / 2**30, 2)} GiB")
        print(f"recomputation_time: {recomputation_time} ms")

if __name__ == "__main__":
    main()
```

```
=== WITHOUT AC ===
peak_mem: 2.41 GiB
compute_time: 97.97 ms
=== WITH AC ===
ac_decisions: {'Transformer.layers.0': 0.5232, 'Transformer.layers.1': 0.5232, 'Transformer.layers.2': 0.6849, 'Transformer.layers.3': 0.5232}
peak_mem: 1.75 GiB
recomputation_time: 5.92 ms
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137908
Approved by: https://github.com/weifengpy
2024-10-18 12:45:37 +00:00
7b863230ea [Docs] Optimize parameter description to declare allowed type (2/N) (#138152)
Inspired by issue #137422 and #103847

Optimize method parameter types in docs to given users a more clear about what expected to pass to methods.

Previous PR:
- [x] https://github.com/pytorch/pytorch/pull/137956

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138152
Approved by: https://github.com/albanD
2024-10-18 11:18:19 +00:00
354bc3ac11 [dynamo] Remove an unused variable in repro.after_aot (#138094)
* Extracted from https://github.com/pytorch/pytorch/pull/133492

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138094
Approved by: https://github.com/ezyang

Co-authored-by: Edward Z. Yang <ezyang@meta.com>
2024-10-18 09:37:10 +00:00
e1c4548441 [dynamo] Simplify creation of VariableTrackers (#135714)
## `VariableTracker::build()` hides the Builders

### The problem

In the current code, creating a `VariableTracker` involves choosing one of two `Builder` classes and either calling a method, or calling a constructor that creates an object that you immediately call, [like this](083c9149b7/torch/_dynamo/variables/functions.py (L761-L768)).

Variations on this code are repeated in many places.

More, the `Builder` classes have a lot of dependencies, so they have to be loaded late in the whole import process to avoid circular imports, so they end up being repeatedly imported at local scope.

### The solution

In this commit, the import from `builder` and the logic of choosing and calling the Builder class are hidden in a single static factory method, `VariableTracker.build()`, easier to reason about and to import.

This commit net lowers the total lines of code by over 150 lines by removing repetitive logic and unnecessary local imports.

**CHANGES:** Originally the name of the static method was `VariableTracker.create()` but a static method on a derived class, `LazyVariableTracker.create()` now exists with a different signature that's irreconcilable, so the new static method was renamed to `VariableTracker.build()`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135714
Approved by: https://github.com/jansel
2024-10-18 09:36:46 +00:00
1581a93e87 [Distributed][CI] Add SM guard for compiled tests involving BF16 (#138245)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138245
Approved by: https://github.com/yf225
2024-10-18 09:10:01 +00:00
1a8b4c65ac Fix scatter and gather shape check error message (#138310)
The error message seems incorrect based on the surrounding code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138310
Approved by: https://github.com/Microve, https://github.com/fegin
2024-10-18 07:49:07 +00:00
517012058d Move test_db to training IR (#138251)
Differential Revision: [D64560792](https://our.internmc.facebook.com/intern/diff/D64560792)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138251
Approved by: https://github.com/yushangdi
ghstack dependencies: #138249
2024-10-18 07:42:13 +00:00
29264fcbef Move test_verifier to training IR (#138249)
Differential Revision: [D64560351](https://our.internmc.facebook.com/intern/diff/D64560351)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138249
Approved by: https://github.com/yushangdi
2024-10-18 07:36:29 +00:00
5d01126616 preserve module signature with multiple calls (#137999)
Previously we would error when trying to preserve the call signature for a module when it was called multiple times. This PR can now do this without erroring. The fix is to propagate call indices in a few more places.

Note that while this works in the presence of params, buffers, and tensor constants, preserving call signatures for multiple calls to a module when buffers are mutated is not supported yet. This is future work. The main problem is that we do not have enough metadata to `copy_` mutated buffers at the end of each call to a module, so the next call can read those buffers at the beginning. Making this work will likely need some explicit tracking of intermediate values of mutated buffers when collecting metadata during functionalization in export.

Note also that we stop short of creating a single graph out of multiple graphs: that is still future work. So the unflattened module will still have different targets `n`, `n@1`, `n@2`, etc. for each call when we ask the module call signature of `n` to be preserved. However it is way easier to swap all of these targets with a replacement that behaves similar to the original, because all of these calls will respect the original module call signature. (In particular, any constant inputs will be carried by the calls.)

Differential Revision: D64406945

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137999
Approved by: https://github.com/tugsbayasgalan
2024-10-18 07:30:22 +00:00
14e6624473 Update wmic command used in collect_env.py to its counterpart in powershell due to its deprecation (#138297)
As title.
`wmic` is deprecated in Windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138297
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-10-18 07:03:17 +00:00
d116d007ee Add host-side Triton TMA support to Inductor (#137950)
This adds Dynamo tracing support for the host-side Triton TMA API (see `create_2d_tma_descriptor` calls on the host in the [Triton tutorial](https://triton-lang.org/main/getting-started/tutorials/09-persistent-matmul.html#sphx-glr-getting-started-tutorials-09-persistent-matmul-py)). A few notes:

- Here we assume the availability of the host-side TMA API added to upstream Triton in https://github.com/triton-lang/triton/pull/4498. As of time of writing, this is not a part of the PT2 OSS Triton pin (although back-ported internally). OSS Triton pin update should be done in December 2024.
- Due to Dynamo support implemented in the previous PR, the `tma_descriptor_metadata` dict is delivered to the `triton_kerenl_wrap_` lowering and passed to the `ir.UserDefinedTritonKernel` as additional argument.
- Looking into the `tma_descriptor_metadata`, `ir.UserDefinedTritonKernel` substitutes the corresponding `TensorBox` arguments of the kernel (swapped upstream in Dynamo) by the new `ir.TMADescriptor` nodes implementing TMA descriptors in Inductor IR.
- `ir.TMADescriptor.__init__` provides the wiring between the upstream underlying `ir.TensorBox` and the downstream `ir.UserDefinedTritonKernel` kernel. In particular, we use `ir.NonOwnedLayout` wrapping `ir.ReinterpretView` to avoid the upstream tensor's buffer being deleted prematurely (before the TMA descriptor is used in the Triton kernel).
- Via `ir.TMADescriptor.codegen`, the Triton's `create_{1d,2d}_tma_descriptor` function call is codegened in the wrapper (in the host code).
- New `TMADescriptorArg` dataclass is added to handle the Triton kernel metadata pertinent to host-side TMA.
- AOT Inductor support will be implemented in a follow-up PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137950
Approved by: https://github.com/eellison
ghstack dependencies: #137677
2024-10-18 06:27:24 +00:00
82443798aa [Distributed] Refactor compress hook to remove duplicated code (#138182)
Fix TODO in code

```python
# TODO: create an internal helper function and extract the duplicate code in FP16_compress and BF16_compress.
```

1. Extract common logic in `fp16_compress_hook` and `bf16_compress_hook` to `_compress_hook` method
2. Let `fp16_compress_hook` and `bf16_compress_hook` invoke  `_compress_hook` with difference `dtype`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138182
Approved by: https://github.com/awgu
2024-10-18 06:01:15 +00:00
80a58b7207 Use fresh cache directory in test_cudacodecache (#138243)
This test frequently times out flakily, for example, https://github.com/pytorch/pytorch/actions/runs/11377972115/job/31654107609#step:22:2376.  I still couldn't reproduce this behavior locally running this multiple times and in parallel.  ~~So, I suspect that the error only shows up when other tests are run in paralel.~~

~~I attempt to run this serially in this PR, once land, I can monitor trunk to see if this helps.~~

Running serially still ends up with a timing out https://github.com/pytorch/pytorch/actions/runs/11391445912/job/31697603438, another try with fresh cache.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138243
Approved by: https://github.com/clee2000
2024-10-18 05:45:39 +00:00
0b168ceb6d Collect Nvidia libraries with collect_env.py (#138076)
Collect Nvidia libraries to diagnose issues like #133548.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138076
Approved by: https://github.com/malfet
2024-10-18 05:05:00 +00:00
8cb9110906 [CI] Add Compiled DDP and Compiled FSDP2 tests to test_inductor_distributed (#138178)
`test_replicate_with_compiler.py` and `test_fully_shard_compile.py` requires bf16, so needs to be run within test_inductor_distributed job (which uses A10G (SM80) and has bf16 support).

This allows us to migrate distributed jobs to T4 machines in https://github.com/pytorch/pytorch/pull/137161, as the compiled distributed jobs are the only blocking ones now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138178
Approved by: https://github.com/xmfan, https://github.com/fduwjj, https://github.com/fegin, https://github.com/kwen2501
2024-10-18 04:58:58 +00:00
a9014d2287 [BE][MPS] Compile without warnings on MacOS15 (#138238)
By guarding the calls to `-[MTLCompileOptions setFastMathEnabled]` with `C10_DIAGNOSTIC_PUSH` and `POP`
and using `-[MTLCompileOptions setMathMode:]` and `-[MTLCompileOptions setMathFloatingPointFunctions:]` on MacOS15
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138238
Approved by: https://github.com/atalman
2024-10-18 04:20:15 +00:00
cc6c248919 [Inductor UT] Generalize newly introduced inductor UTs for intel GPU (Part 2) (#136856)
[Inductor UT] Generalize Newly introduced inductor UTs for intel GPU
reuse `test/inductor/test_inductor_freezing.py`
reuse `test/inductor/test_layout_optim.py`
reuse `test/inductor/test_loop_ordering.py`
reuse `test/inductor/test_memory_planning.py`
reuse `test/inductor/test_padding.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136856
Approved by: https://github.com/EikanWang, https://github.com/etaf, https://github.com/jansel
2024-10-18 03:58:00 +00:00
c3cd9939fc aten | Deduplicate and silence set but unused variable warning. (#138270)
Summary:
Turns out we have two functions called slightly differently but they do exactly the same thing.
Also silence the warning if the message is stripped out.

Test Plan: Sandcastle, no behavior change.

Differential Revision: D64566719

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138270
Approved by: https://github.com/boguscoder, https://github.com/cyyever
2024-10-18 03:09:46 +00:00
73a153b931 [dynamo] add compiler.set_stance raw function call test and doc example (#138276)
Followup to https://github.com/pytorch/pytorch/pull/137504#issuecomment-2420107198

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138276
Approved by: https://github.com/anijain2305, https://github.com/jansel
2024-10-18 02:54:22 +00:00
8b426d80dc [hops][refactor] Refactor the aliasing/mutation detection functions (#138234)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138234
Approved by: https://github.com/ydwu4
ghstack dependencies: #138231
2024-10-18 02:35:00 +00:00
e714ebf664 [dynamo][testing] Update AOTEagerandRecordGraphs backend (#138231)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138231
Approved by: https://github.com/StrongerXi, https://github.com/mlazos, https://github.com/aakhundov
2024-10-18 02:35:00 +00:00
8a5dd7f59b Allow SequentialLR to include ChainedScheduler (#133450)
This fixes #132745 and allows a `SequentialLR` to include schedulers that are compound scheduler types (i.e., a `ChainedScheduler`), which contain a list of schedulers in a `_schedulers` attribute.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133450
Approved by: https://github.com/janeyx99
2024-10-18 02:29:38 +00:00
8cda774a03 Add torch.xpu.get_arch_list and torch.xpu.get_gencode_flags for XPU (#137773)
# Motivation
Add `torch.xpu.get_arch_list()` and `torch.xpu.get_gencode_flags()` methods that return architecture list and AOT flags to preserve what flags PyTorch XPU was built with.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137773
Approved by: https://github.com/EikanWang, https://github.com/albanD
2024-10-18 02:28:08 +00:00
6d8c9be54b [reland] Add int1 to int7 dtypes (#137928)
Summary:
Similar to https://github.com/pytorch/pytorch/pull/117208, we want to add int1 to int7 for edge use cases
for weight quantization

Test Plan:
python test/test_quantization.py -k test_uint4_int4_dtype

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D64344944](https://our.internmc.facebook.com/intern/diff/D64344944)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137928
Approved by: https://github.com/malfet
2024-10-18 02:02:08 +00:00
7365a57dc0 [BC] Add check for core ATen opset schema BC (#137664)
Summary: Based on core ATen opset BC policy: https://dev-discuss.pytorch.org/t/core-aten-opset-backward-forward-compatibility-policy/1772

Encorcing this policy in `check_forward_backward_compatibility.py`.
Basically the script will error out if any BC breaking schema changes
occurs to core ATen operators.

Test Plan:

Run `python test/forward_backward_compatibility/dump_all_function_schemas.py --filename nightly_schemas.txt`

Manually added a argument to `nightly_schemas.txt`, `convolution`
schema, see the following error:

```
[WARNING 2024-10-09 15:54:36,224 check_forward_backward_compatibility.py:329] Can NOT find backward compatible schemas after changes for schema aten::convolution(Tensor input, Tensor weight, Tensor? bias, SymInt[] stride, SymInt[] padding, SymInt[] dilation, bool transposed, SymInt[] output_padding, SymInt groups, SymInt new_arg) -> Tensor from the following candidates:
[
        aten::convolution(Tensor input, Tensor weight, Tensor? bias, SymInt[] stride, SymInt[] padding, SymInt[] dilation, bool transposed, SymInt[] output_padding, SymInt groups) -> Tensor
	aten::convolution.out(Tensor input, Tensor weight, Tensor? bias, SymInt[] stride, SymInt[] padding, SymInt[] dilation, bool transposed, SymInt[] output_padding, SymInt groups, *, Tensor(a!) out) -> Tensor(a!)
]. Please contact PyTorch team to confirm if this BC breaking change is safe or not.
...
[WARNING 2024-10-09 15:54:36,224 check_forward_backward_compatibility.py:342] The PR is introducing backward incompatible changes to core ATen operators. Please contact PyTorch team to confirm whether this change is wanted or not.

Broken ops: [
	aten::convolution(Tensor input, Tensor weight, Tensor? bias, SymInt[] stride, SymInt[] padding, SymInt[] dilation, bool transposed, SymInt[] output_padding, SymInt groups, SymInt new_arg) -> Tensor
]
```
Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137664
Approved by: https://github.com/albanD
2024-10-18 01:58:33 +00:00
21a9c06ca9 [c10d] differentiate timeout errors from nccl errors (#138240)
Summary:
Our watchdog does not differentiate timeout from NCCL errors clearly in terms of both log and code paths.
It's important for c10d to differentiate different reasons of watchdog
failures. E.g, timeout vs nccl errors, and possibly let users to handle the
errors differently depends on the type of errors
Test Plan:
UT
Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138240
Approved by: https://github.com/Skylion007
2024-10-18 01:36:32 +00:00
95f869c3d7 [pytorch_operator_stats] log if using torchscript runtime (#137986)
Summary: logs if an operator is run with the TorchScript runtime, using a thread_local variable set in `InterpreterState.run()`

Test Plan: buck2 run mode/dev-nosan caffe2/torch/fb/observers:scuba_observer_runner

Reviewed By: zou3519

Differential Revision: D64200781

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137986
Approved by: https://github.com/angelayi
2024-10-18 00:55:22 +00:00
ad28565ed7 Use C++17 Convention Methods in PyTorch (#137958)
Detailed Descriptions:
- `std::is_same<X, Y>::value` -> `std::is_same_v<X, Y>`
- `std::enable_if<C, T>::type` -> `std::enable_if_t<C, T>`
- and so on

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137958
Approved by: https://github.com/janeyx99
2024-10-18 00:52:51 +00:00
b7cf8fb800 c10 | Silence 'deprecated-dynamic-exception-spec' warning when importing cxxabi. (#138219)
Summary: cxxabi header specifically from llvm violates this, ignore the warning when including it.

Test Plan: No runtime behavior change, sandcastle only

Differential Revision: D64540217

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138219
Approved by: https://github.com/boguscoder
2024-10-18 00:42:45 +00:00
2f91d7c63f [Compiled Autograd] Check Dynamo stance to decide whether to fallback to eager (#138113)
Dynamo stance is recently added in https://github.com/pytorch/pytorch/pull/137504. When Dynamo stance is "force_eager" (user explicitly wants to fall back to eager), we would like Compiled Autograd to fall back to eager as well. This will allow the Traceable FSDP2 use case to work since "eager forward + compiled autograd backward" is not supported for Traceable FSDP2.

In general, if user wants to do "eager forward + compiled autograd backward", they should explicitly run the forward in eager instead of applying compile and then set stance to "force_eager".

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138113
Approved by: https://github.com/xmfan
2024-10-18 00:13:00 +00:00
6d473e0dda [autolint] move to use a label (#138263)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138263
Approved by: https://github.com/huydhn
2024-10-18 00:12:52 +00:00
a3172809a1 [EZ] Fix typo in Normalization.mm (#138283)
Introduced by 6b76a21ebd
One likely has to wait for 125 years to MacOS-150 release :)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138283
Approved by: https://github.com/kit1980
2024-10-18 00:01:21 +00:00
b14c9b7250 [AMD] Hipify torchaudio_decoder (#138181)
Summary:
X-link: https://github.com/pytorch/audio/pull/3843

Continue to hipify more torchaudio targets.

Test Plan:
CI

  buck build mode/opt-amd-gpu pytorch/audio/src/...

Differential Revision: D64298970

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138181
Approved by: https://github.com/houseroad
2024-10-17 23:37:37 +00:00
0ecafda602 Fix CompiledDDP failure when the gradient is not contiguous (#138174)
Summary:
As title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138174
Approved by: https://github.com/yf225, https://github.com/kwen2501

Co-authored-by: Will Feng <yf225@cornell.edu>
2024-10-17 23:08:24 +00:00
2fc6c32b4c Ensure version file is regenerated at change (#138237)
Fixes observed error where `version.py` would not be regenerated by CMake without deleting the file.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138237
Approved by: https://github.com/Skylion007
2024-10-17 22:46:05 +00:00
770fcaf2ab Fix the Rank of logsumexp Tensor and mGPU support. (#137717)
The logsumexp tensor was considered for internal use only but apparently exposed to unit tests and inductors.

The stream should be selected after picking the current device. Otherwise the code is checking the default device's architecture.

Fixes #131316 #137414

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137717
Approved by: https://github.com/drisspg

Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com>
2024-10-17 21:58:14 +00:00
9f81270d75 Fix unbind_copy and add its decomposition (#134319)
* Fixes https://github.com/pytorch/pytorch/issues/130829

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134319
Approved by: https://github.com/amjames, https://github.com/eellison
2024-10-17 21:27:35 +00:00
69ba89da11 Fix cuda sanitizer and as_subclass calls (#138218)
This fixes 4 main issues:
- The way the cuda sanitizer handle it's state is weird. In particular, because the lifetime of the Mode is linked to the submodule, then this might outlive the python runtime and other modules loaded. On my current version, this even outlives the "sys" module. Given that I'm not sure the impact of changing this lifetime handling, I'm making the exit handler a no-op when python is already dying and thus no point cleaning up.
- Adds a "disable" method to be able to test after the mode is enabled.
- Fix `Tensor.as_sublass()` to properly disable modes when creating the new Tensor object just like we already do in `make_subclass` and `make_wrapper_subclass`. The change here is just to apply the exact same treatment to it.
- ~Fix `Tensor.as_subclass()` not to propagate autograd as there is no valid backward associated here.~ We have test that check that this behavior happen so I guess this is not an obvious bugfix and expected behavior. Reverted that change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138218
Approved by: https://github.com/ngimel
2024-10-17 21:18:32 +00:00
b14269dcfb Make Context to be Device-agnostic Step by Step (1/N) (#136519) (#138155)
Summary:
- make init to be device-agnostic and move it to AcceleratorHooksInterface
- refactoring context related to device initialization

Original pull request: https://github.com/pytorch/pytorch/pull/136519

Test Plan: contbuild & OSS CI, see 4a8e49389c

Reviewed By: malfet

Differential Revision: D64471142

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138155
Approved by: https://github.com/malfet, https://github.com/bobrenjc93
2024-10-17 20:58:56 +00:00
7a117f3b3e Dont decompose aten.baddmm in inductor (#137904)
Previously the decomposition would upcasts inputs to fp32. This led to a slowdown compared to eager which would run in fp16. We also tried keeping the bmm in fp16, and the upcasting for the epilogue but that led to worse numerics because the bmm in eager would do the epilogue all in fp32 without a downcast in the bmm accumulator.

Fix for https://github.com/pytorch/pytorch/issues/137897

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137904
Approved by: https://github.com/ngimel
2024-10-17 19:24:54 +00:00
54839781ed Update lint failure msg to encourage lintrunner -a locally (#138232)
This is only a minor patch that I hope will change how I talk to contributors when lint fails, so that I can tell them to read the logs about lintrunner. There have been too many times when I have had to click the "approve all workflows" just for lint to fail again cuz the developer is manually applying every fix and using CI to test. I understand there are times when lintrunner doesn't work, but I'd like most contributors to at least give it a swirl once to start.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138232
Approved by: https://github.com/kit1980, https://github.com/Skylion007
2024-10-17 19:13:55 +00:00
dfb5ac05cc [Record Function] Add Kwargs only USER_SCOPE Macro (#138020)
Summary: Add a macro such that users can easily add a USER annotation with kwargs only

Test Plan: Will use D63801503 to test this E2E. Added unit test as well that makes sure that the kwargs get recorded correctly

Differential Revision: D64420328

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138020
Approved by: https://github.com/davidberard98, https://github.com/aaronenyeshi
2024-10-17 18:48:49 +00:00
0c76c68d7d [tlparse][AOTAutograd] Rename to aot_inference_graph in tlparse output (#137803)
Compiled Autograd uses this AOT inference path, but it shows up as "aot_forward_graph" in tlparse output, which causes it to not be easily differentiable from normal "aot_forward_graph"s that are also in the tlparse output. This PR renames it to "aot_inference_graph" which makes it easier to tell which tlparse graph block is from Compiled Autograd.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137803
Approved by: https://github.com/Microve, https://github.com/bdhirsh, https://github.com/ezyang
2024-10-17 18:44:37 +00:00
d531bd509e [Docs] Fix description in torch.save docs to show default for pickle_protocol instead of variable name (#138153)
Fixes #138013

Replace `DEFAULT_PROTOCOL` with actual default value `2` in `torch.save` method document

Before
![image](https://github.com/user-attachments/assets/cdd77d14-c009-4848-8538-9256bf22c32a)

After
![image](https://github.com/user-attachments/assets/f6b1063d-c955-478a-8d42-702b988426aa)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138153
Approved by: https://github.com/mikaylagawarecki
2024-10-17 18:13:05 +00:00
8abbd1c7c7 Modernize C10_NODISCARD to [[nodiscard]] (#138151)
PyTorch is C++17 now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138151
Approved by: https://github.com/Skylion007, https://github.com/albanD
2024-10-17 18:07:39 +00:00
6752e7dc3e Moved some of Inductor IR nodes to be frozen (#137859)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137859
Approved by: https://github.com/ezyang
2024-10-17 18:04:45 +00:00
0b2c12cb4d Support more foreach ops for tensor beta support (#134170)
Add more foreach ops so we don't have fallbacks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134170
Approved by: https://github.com/eellison
2024-10-17 17:51:31 +00:00
92fdea8a39 remove skips due to https://github.com/pytorch/torchdynamo/issues/1991 (#138133)
Closes https://github.com/pytorch/pytorch/issues/93479. A bunch of other dynamo-wrapped tests also exhibit "torch.* returned non-Tensor output unimplemented" making the issue seem less relevant to me. Some tests are marked as xfail as they fail for other reasons.

If these tests are indeed important, we should create a new issue to track them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138133
Approved by: https://github.com/ezyang
2024-10-17 17:42:46 +00:00
6b76a21ebd [PyTorch] Fix incorrect macOS 15.0 gating in MPS backend (#138022)
The ifdef as written just checks if the macOS 15.0-capable SDK is being used. You also need a runtime gate to make sure macOS 15 is in use.

Differential Revision: [D64429453](https://our.internmc.facebook.com/intern/diff/D64429453/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138022
Approved by: https://github.com/Skylion007, https://github.com/malfet
ghstack dependencies: #137722, #138014
2024-10-17 17:35:34 +00:00
d2a6c73235 Revert "[CI] Add Compiled DDP and Compiled FSDP2 tests to test_inductor_distributed (#138178)"
This reverts commit 20af56d4359c3f5fed2e8f94e111a8502f2ebeb3.

Reverted https://github.com/pytorch/pytorch/pull/138178 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the new tests are failing inductor distributed jobs ([comment](https://github.com/pytorch/pytorch/pull/138178#issuecomment-2420109501))
2024-10-17 17:32:06 +00:00
2a50d77823 Move test_experimental.py to training IR (#138140)
Differential Revision: [D64510938](https://our.internmc.facebook.com/intern/diff/D64510938)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138140
Approved by: https://github.com/avikchaudhuri
2024-10-17 17:30:10 +00:00
ecc5e05854 Refactor NJT min / max seqlen handling for convenience (#138130)
There's an annoying pattern emerging for pulling out the NJT min / max seqlen ints if they exist without computing / caching if they don't. This PR introduces private convenience functions to simplify handling this and avoiding redundant checks.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138130
Approved by: https://github.com/soulitzer
2024-10-17 17:28:39 +00:00
66478d0cf7 Revert "[compiled autograd] directly use python Logger class in cpp (#137953)"
This reverts commit af916613687d3bcc1d15362ba2fdf9312378c500.

Reverted https://github.com/pytorch/pytorch/pull/137953 on behalf of https://github.com/clee2000 due to breaking builds internally D64479234, I think it makes the build size of a package too large? The logs link to a wiki with instructions of what to do ([comment](https://github.com/pytorch/pytorch/pull/137953#issuecomment-2420086928))
2024-10-17 17:19:36 +00:00
3b0f3059f6 Revert "[Compiled Autograd] Check Dynamo stance to decide whether to fallback to eager (#138113)"
This reverts commit ebe37b23f11e150cd3afa5464193ee036e15277f.

Reverted https://github.com/pytorch/pytorch/pull/138113 on behalf of https://github.com/clee2000 due to sorry need to revert this in order to revert https://github.com/pytorch/pytorch/pull/137953, please rebase and remerge ([comment](https://github.com/pytorch/pytorch/pull/138113#issuecomment-2420079703))
2024-10-17 17:16:44 +00:00
375dcb960f Revert "Avoid some dangling reference warnings (#132535)"
This reverts commit f3d7a02716d8725dcedff86094bd7e20f73155f1.

Reverted https://github.com/pytorch/pytorch/pull/132535 on behalf of https://github.com/clee2000 due to broke some internal builds D64479234 ([comment](https://github.com/pytorch/pytorch/pull/132535#issuecomment-2419983509))
2024-10-17 16:23:36 +00:00
348f208504 Autocast re-tracibility (#138082)
Summary:
Support autocast re-tracing by giving it the same treatment as set_grad.

In re-tracing, when dynamo encounters an autocast HOP, we want it to trace through `with torch.autocast()` again, and replace the HOP with the traced subgraph.

Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export  -- -r  test_export_with_autocast
```

Differential Revision: D63856081

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138082
Approved by: https://github.com/ydwu4
2024-10-17 16:09:11 +00:00
3087b5e431 [cond] support lifted symint inputs in subgraph (#137519)
As titled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137519
Approved by: https://github.com/eellison
2024-10-17 16:09:06 +00:00
2414c3f534 AOTI fixes for MI300 lowering (#137939)
Summary:
1) Add sleef back to enable SIMD on AMD
2) adding kpack to triton compute_meta  for AMD triton, since there will be user-defined triton kernels using this for k-dim packing

Test Plan:
```
HIP_VISIBLE_DEVICES=0 TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 TORCH_LOGS="output_code,graph_code" buck run mode/{opt,amd-gpu} -c fbcode.triton_backend=amd -c fbcode.enable_gpu_sections=true //hpc/new/models/feed/benchmark:feed_lower_benchmark --  --skip-flop-estimation --skip-trt --skip-ait --enable-aot-inductor --sync-mode=0 --gpu-trace --sample-input-tile-factor=1  --load="manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/925729118/0/gpu_lowering/input.merge" --lowering-input-str='{"serialized_inference_model_input_path":"ads_storage_fblearner/tree/user/facebook/fblearner/predictor/925729118/0/gpu_lowering/input.merge","serialized_inference_model_output_path":"ads_storage_fblearner/tree/user/facebook/fblearner/predictor/925729118/0/gpu_lowering/mi300_output.merge","submodule_names_to_lower":["merge"],"inductor_lowering_context":{"aot_inductor_lowering_settings":{"use_scripting":true,"preset_lowerer":"ifu_cint;disable_new_lowering_weights;disable_dper_passes:passes=fuse_parallel_linear_no_weight_change","precision":3,"output_precision":3, "remove_unexpected_type_cast":false, "sample_input_tile_factor":32}},"model_entity_id":925729118,"model_snapshot_id":0,"add_sample_inputs":false,"hardware_type":0,"platform_arch":1,"dense_in_place_format":2}' --precision=bf16 2>&1 | tee local_benchmark_log.txt

```

Differential Revision: D64262924

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137939
Approved by: https://github.com/frank-wei
2024-10-17 16:09:04 +00:00
502c6183e0 Prevent tuple instances from being weak-referenced. (#137838)
Summary:
Currently, https://fburl.com/code/uka25j1i checks whether the guarded object supports weakref by looking at its `__class__`
```
if hasattr(guarded_object.__class__, "__weakref__") and not isinstance(
    guarded_object, enum.Enum
):
    obj_ref = weakref.ref(guarded_object)
```

However, we have reason to modify this slightly because we use classes that "pretend" to be some other classes (e.g. nn.Parameter). Example https://fburl.com/code/8bcktgoh :
```
class QuantizedWeights:
    # TODO: Ugly trick so torch allows us to replace parameters
    # with our custom weights. Do this properly.
    property
    def __class__(self) -> Type[nn.parameter.Parameter]:
        return nn.Parameter

    property
    def grad_fn(self) -> None:
        return None
```

For example, Fp8RowwiseWeights which inherit from the base class above and also from namedtuple, actually does not have `__weakref__` attribute, but its "class" will say it does.

I think the easiest change is to use instance-level checking rather than class-level
```
if hasattr(guarded_object, "__weakref__") ...
```

But I'm wondering if this will harm any of the existing behaviors.

I'd appreciate reviews from the experts

(I just added all recommended reviewers since I'm not sure who is the best person to consult...)

Test Plan: CI?

Reviewed By: YJYJLee

Differential Revision: D64140537

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137838
Approved by: https://github.com/williamwen42, https://github.com/jansel
2024-10-17 16:08:32 +00:00
7e16c9d5f2 include bw_compiler in strobelight profile (#138060)
Summary: title + tlparse will have the phase name.

Test Plan: {F1933087525}

Differential Revision: D64450315

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138060
Approved by: https://github.com/ezyang
2024-10-17 16:08:28 +00:00
20af56d435 [CI] Add Compiled DDP and Compiled FSDP2 tests to test_inductor_distributed (#138178)
`test_replicate_with_compiler.py` and `test_fully_shard_compile.py` requires bf16, so needs to be run within test_inductor_distributed job (which uses A10G (SM80) and has bf16 support).

This allows us to migrate distributed jobs to T4 machines in https://github.com/pytorch/pytorch/pull/137161, as the compiled distributed jobs are the only blocking ones now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138178
Approved by: https://github.com/xmfan
2024-10-17 10:51:07 +00:00
8cfe28e4e3 [Inductor] Pick ISA for inductor based on ATEN_CPU_CAPABILITY (#123514)
It is part of https://github.com/pytorch/pytorch/issues/123224. Pick ISA based on the environment ATEN_CPU_CAPABILITY to control CPU vec ISA level for Inductor like eager.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123514
Approved by: https://github.com/jgong5, https://github.com/peterbell10
2024-10-17 09:06:57 +00:00
47077bfcb5 Remove an unused variable in _subclasses.fake_tensor (#138086)
----

* Extracted from https://github.com/pytorch/pytorch/pull/133492
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138086
Approved by: https://github.com/Skylion007, https://github.com/albanD
2024-10-17 09:05:25 +00:00
ba10259115 Increase default COMPILE_STROBELIGHT_MAX_STACK_LENGTH to 500 (#138006)
Summary: pt2 call stacks are long, this reduces truncated stack
<img width="1363" alt="Screenshot 2024-10-15 at 11 35 11 AM" src="https://github.com/user-attachments/assets/d09a8fb5-eafc-4440-ab58-464889dc6df8">
<img width="1373" alt="Screenshot 2024-10-15 at 11 35 26 AM" src="https://github.com/user-attachments/assets/c4c9c245-54d1-4e35-b16f-029ece335e03">

Differential Revision: D64414746

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138006
Approved by: https://github.com/bobrenjc93
2024-10-17 07:31:32 +00:00
5b7f4767ff Fix https://github.com/pytorch/pytorch/issues/138062 (#138137)
Fixes https://github.com/pytorch/pytorch/issues/138062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138137
Approved by: https://github.com/mlazos
2024-10-17 07:12:15 +00:00
f3c3f3a3c3 Fix assigning tensor with requires_grad as constant in export (#137997)
When we insert cojstants into unlifted graph, we need to detach them if they require grad BUT when we detach we need to preserve the original aliasing information.

Differential Revision: [D64406859](https://our.internmc.facebook.com/intern/diff/D64406859/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137997
Approved by: https://github.com/avikchaudhuri
2024-10-17 06:41:10 +00:00
38d9924bfc Disable lint suggestions on my PRs (#138054)
The suggestions unusably clog up early draft PRs that are not necessarily lint clean yet. Making matters worse, even if I fix them I have to manually click through hundreds of comments to "Resolve" them even though I've fixed it. Disabling it on ghstack helps, but I occasionally do standard PRs via fbcode export mechanism. Opt me out.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138054
Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/PaliC
2024-10-17 05:28:37 +00:00
cyy
af8bd323e8 Remove legacy Caffe2 pthreadpool from CMake (#134936)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134936
Approved by: https://github.com/ezyang
2024-10-17 05:22:08 +00:00
9c084cccfd [Pytorch][ATEN] Enable FP8 concatenate (#138046)
Summary: Float8 is becoming and increasingly popular datatype now that it is well supported on GPUs. This  diff enables FP8 to work with `torch.cat`. This is pretty straight forward since memory operations dont vary based on the input dtype, but can be quite helpful for FP8 based models.

Test Plan:
```
buck2 run mode/opt -c fbcode.enable_gpu_sections=true -c fbcode.platform=platform010 -c fbcode.nvcc_arch=h100a -c fbcode.platform010_cuda_version=12 //caffe2/test:tensor_creation -- -r test_cat_all_dtypes_and_devices
```

Differential Revision: D64443965

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138046
Approved by: https://github.com/eqy, https://github.com/qchip, https://github.com/jianyuh
2024-10-17 04:58:54 +00:00
ebd60f4074 update CMAKE_PREFIX_PATH setting command (#134934)
Current setting command of the `CMAKE_PREFIX_PATH` environment variable will overwrite values if it had already been set with some values. Changing it to `:` appends the conda env search path to its values to avoid library not found issues.
`export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}:${CMAKE_PREFIX_PATH}`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134934
Approved by: https://github.com/malfet, https://github.com/EikanWang
2024-10-17 04:19:18 +00:00
7db1f0b7b5 Minor assert error message improvement (#138053)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138053
Approved by: https://github.com/Skylion007
2024-10-17 03:54:15 +00:00
ebe37b23f1 [Compiled Autograd] Check Dynamo stance to decide whether to fallback to eager (#138113)
Dynamo stance is recently added in https://github.com/pytorch/pytorch/pull/137504. When Dynamo stance is "force_eager" (user explicitly wants to fall back to eager), we would like Compiled Autograd to fall back to eager as well. This will allow the Traceable FSDP2 use case to work since "eager forward + compiled autograd backward" is not supported for Traceable FSDP2.

In general, if user wants to do "eager forward + compiled autograd backward", they should explicitly run the forward in eager instead of applying compile and then set stance to "force_eager".

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138113
Approved by: https://github.com/xmfan
ghstack dependencies: #138105
2024-10-17 03:45:10 +00:00
fe43f72be7 [AOTI] Remove the non-ABI-compatible mode (part 2) (#138047)
Summary: Continue to clean up non-ABI-compatible mode related code.

Differential Revision: [D64444327](https://our.internmc.facebook.com/intern/diff/D64444327)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138047
Approved by: https://github.com/chenyang78
ghstack dependencies: #137982, #138016, #138009
2024-10-17 02:54:24 +00:00
2e67d7cc35 [AOTI] Remove the non-ABI-compatible mode (part 1) (#138009)
Summary: The ABI-compatible mode has been turned on as default in https://github.com/pytorch/pytorch/pull/136534. Removing the non-ABI-compatible logic to greatly simplify the wrapper codegen logic.

Differential Revision: [D64439676](https://our.internmc.facebook.com/intern/diff/D64439676)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138009
Approved by: https://github.com/chenyang78
ghstack dependencies: #137982, #138016
2024-10-17 02:48:26 +00:00
7711f00553 [BE] Delete unused operator!= from the test (#138122)
If method is unused, why not delete it altogether?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138122
Approved by: https://github.com/swolchok
2024-10-17 02:24:48 +00:00
906fe05895 Naive impls for NJT matmul (#138121)
Our matmul support is abysmal - let's at least get this working and do it performantly later.

Bonus: implements `bmm` as well.

jagged <-> padded dense conversions are utilized when possible, and an unbind-based fallback otherwise (the former works with torch.compile and the latter doesn't). Some testing is missing because we don't have factory function support yet :(
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138121
Approved by: https://github.com/cpuhrsch
2024-10-17 01:31:46 +00:00
b4f7f4bf49 [Docs] Optimize parameter description to declare allowed type (1/N) (#137956)
Inspired by issue #137422 and #103847

Optimize method parameter types in docs to given users a more clear about what expected to pass to methods.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137956
Approved by: https://github.com/albanD
2024-10-17 01:19:55 +00:00
c69f4518ec [SymmetricMemory] fix a race condition in _pipelined_produce_and_all2all that can cause correctness issues for very small chunk_producers (#138126)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138126
Approved by: https://github.com/lessw2020
2024-10-17 01:05:41 +00:00
69e125a7e9 AOTInductor: fixup test (follow-up to #137401) (#137692)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137692
Approved by: https://github.com/desertfire
2024-10-17 00:40:21 +00:00
94537e70b5 Skip test_parity__foreach_mul_fastpath_inplace_cuda_complex128 internally (#138100)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138100
Approved by: https://github.com/Skylion007
2024-10-17 00:34:56 +00:00
504904c9c6 [Traceable FSDP2] Add compiled_autograd_enabled helper function (#138105)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138105
Approved by: https://github.com/awgu, https://github.com/xmfan
2024-10-17 00:04:06 +00:00
0e9708f907 tensor constant with wrapped method (#138091)
Summary:
Tensor constants can show up through wrapped methods, so that they may not always be found in constant attributes. They need to be fakified and their meta vals need to be found to create graph signatures nevertheless. Otherwise non-strict barfs.

Longer term maybe we should pull this fakification up in non-strict.

Test Plan: added test

Differential Revision: D64480272

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138091
Approved by: https://github.com/tugsbayasgalan
2024-10-17 00:00:04 +00:00
4b3035f2fe Revert "Add decomposition for permute_copy (#130944)"
This reverts commit e7a4ad3b409c226a1da0f597c66ece7c06de0e9e.

Reverted https://github.com/pytorch/pytorch/pull/130944 on behalf of https://github.com/clee2000 due to breaking internal builds D64418214 cc @digantdesai @GregoryComer to help get this fixed and remerged ([comment](https://github.com/pytorch/pytorch/pull/130944#issuecomment-2418125356))
2024-10-16 23:18:53 +00:00
5254a0d383 Revert "Dont decompose aten.baddmm in inductor (#137904)"
This reverts commit cef6c3dcb07aafe25d62427e55442a46d7af3500.

Reverted https://github.com/pytorch/pytorch/pull/137904 on behalf of https://github.com/clee2000 due to failing internal tests D64418200, some results not within tolerance? ([comment](https://github.com/pytorch/pytorch/pull/137904#issuecomment-2418122735))
2024-10-16 23:16:44 +00:00
ea2726452a add myself as codeowner in aot_autograd (#138075)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138075
Approved by: https://github.com/Skylion007, https://github.com/albanD
ghstack dependencies: #136670
2024-10-16 22:41:39 +00:00
a682194a11 inductor: use previous guards to know if a size is 1 for broadcasting (#136670)
Fixes https://github.com/pytorch/pytorch/issues/136640

Today, inductor has some logic to figure out when it needs to do broadcasting during lowering, which just checks if any of the input shapes have sizes equal to 1.

In particular: we should already have this information by the time we get to inductor, because our FakeTensor compute will have branched/guarded on whether any ops performed broadcasting, appropriately.

In particular, if we have a tensor with a size value of `(64//((2048//(s3*((s2//s3)))))))`, and it happens to be equal to one (and it is used in an op that requires this dim to be broadcasted), FakeTensorProp will have generated a guard:
```
Eq((64//((2048//(s3*((s2//s3))))))), 1)
```

I chose the simplest possible way to beef up inductor's checks to know when a given size is equal to 1: loop over the existing shape env guards, and if our current size is a sympy expression on the LHS of one of our `Eq(LHS, 1)` guards, then return True.

I'm hoping for feedback on whether or not this approach is reasonable. One better option I could imagine is that our symbolic reasoning should have automatically simplified the size of our tensor down to a constant as part of evaluating that guard. I was originally going to try to do this directly in the shape env, but I ran into a few issues:

(1) I wanted to call some version of `set_replacement(expr, 1)`. But `set_replacement()` only accepts plain symbols on the LHS, not expressions

(2) in theory I could get this to work if I could rework the above expression to move everything that is not a free variable to the RHS, e.g. `Eq(s2, 32)`. It looks like our existing  `try_solve()` logic is... [not quite able](https://github.com/pytorch/pytorch/blob/main/torch/utils/_sympy/solve.py#L27) to do this generally though.

Checking the guards feels pretty simple-and-easy. Are we worried that it is too slow to iterate over all the guards? I could also cache the lookup so we only need to iterate over guards that are of the form `Eq(LHS, 1)`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136670
Approved by: https://github.com/ezyang
2024-10-16 22:41:39 +00:00
56379e2c17 Remove an unused variable in _subclasses.fake_impls (#138085)
* Extracted from https://github.com/pytorch/pytorch/pull/133492

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138085
Approved by: https://github.com/albanD, https://github.com/Skylion007
2024-10-16 22:41:04 +00:00
0bfa1bf21d [scan] support closure (#135602)
This PR adds an additional_inputs argument to support closures similar to what we've done for while_loop.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135602
Approved by: https://github.com/zou3519
ghstack dependencies: #135600, #135601
2024-10-16 22:28:03 +00:00
819d6b139c [scan] flatten subgraph output and make subgraph inputs to be a slice (#135601)
This pr introduces two changes:
1. Before this pr, the subgraphs output is ([], []), in this pr, we change it to a flattened list for easier codegen and consistency with other control flow operators.

2. Before the PR, the combine_fn of scan takes a sliced input but keep the sliced dimension. For exmaple, suppose xs = torch.randn(3, 4, 5) and we scan over dim 0, the combine_fn looks like:
```
# x.shape = (1, 4, 5) instead of (4, 5)
def combine_fn(carry, x):
  ...
```

In this PR, we fixed this and also simplify some of the slicing logic.

3. this diff also make sure we always stack ys on fist dimension.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135601
Approved by: https://github.com/zou3519
ghstack dependencies: #135600
2024-10-16 22:28:03 +00:00
0437a22d43 [scan] fix typo in signature and remove wrapper (#135600)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135600
Approved by: https://github.com/zou3519
2024-10-16 22:27:59 +00:00
443472b1ca [AOTI] Remove explicit abi_compatible setting in tests (#138016)
Differential Revision: [D64439674](https://our.internmc.facebook.com/intern/diff/D64439674)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138016
Approved by: https://github.com/malfet
ghstack dependencies: #137982
2024-10-16 21:35:46 +00:00
6bc57549f9 [AOTI] Remove non-ABI-compatible tests (#137982)
Summary: Remove non-ABI-compatible mode tests since ABI-compatible has been turned on as default. Also clean up tests that explicitly set ABI-compatible to True.

Differential Revision: [D64439673](https://our.internmc.facebook.com/intern/diff/D64439673)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137982
Approved by: https://github.com/malfet
2024-10-16 21:35:46 +00:00
a040c4a260 Use std::move on stringstream to prevent unnecessary copy. (#138065)
- Takes advantage of C++20's improved handling of move semantics for std::basic_stringbuf.
- Reduces unnecessary copying and improves memory efficiency, especially for long formatted strings.

Benchmark(proof of concept): https://quick-bench.com/q/qohAu0ARH3vSDyKVsoKEfXOO6BI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138065
Approved by: https://github.com/Skylion007
2024-10-16 21:35:10 +00:00
b72ff35f22 [c10d][ez] Add more inline comments to CUDAEventCache code (#138079)
Address @kwen2501 's feedback in https://github.com/pytorch/pytorch/pull/138048, add more inline comments to the code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138079
Approved by: https://github.com/kwen2501
ghstack dependencies: #138040, #138048, #138059
2024-10-16 20:43:28 +00:00
f2c96f5d87 Add AOTI test (#138043)
Summary:
add back the test that's removed in D63916320.

It should work now as D64361273 added back the workspace change.

Test Plan: CI

Differential Revision: D64442054

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138043
Approved by: https://github.com/ColinPeppler, https://github.com/desertfire
2024-10-16 20:41:07 +00:00
f95ddf0b31 [c10d] record world size in log (#138044)
Summary:
Record the world size in log and scuba table.
This helps us quickly figure out if there are missing flight recorder files form ranks.

Test Plan: Ran locally and noted that size was logged to scuba

Differential Revision: D64442949

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138044
Approved by: https://github.com/Skylion007
2024-10-16 20:14:02 +00:00
24ee4af86b Revert "Upgrade distributed test to g4dn instances (T4 GPUs) (#137161)"
This reverts commit 2b7c7a20b9c0e8e7f2773ffc5c9f79c3cae2070b.

Reverted https://github.com/pytorch/pytorch/pull/137161 on behalf of https://github.com/kwen2501 due to breaking trunk ([comment](https://github.com/pytorch/pytorch/pull/137161#issuecomment-2417833666))
2024-10-16 20:05:38 +00:00
a0a978ce23 [aoti config] add raise_error_on_ignored_optimization (#138035)
Summary: Unfortunately this means adding another config.

Test Plan: ci

Differential Revision: D64437699

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138035
Approved by: https://github.com/chenyang78, https://github.com/desertfire
2024-10-16 18:38:47 +00:00
f1c741dbe9 Fixes GuardOnDataDependentSymNode error in masked_fill (#137060)
Fixes [P1621441513](https://www.internalfb.com/phabricator/paste/view/P1621441513) ([ref to internal post](https://fb.workplace.com/groups/6829516587176185/posts/1051474609896021/?comment_id=1055262166183932&reply_comment_id=1056583932718422))
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137060
Approved by: https://github.com/ezyang
2024-10-16 18:16:33 +00:00
f173623bb2 [td] try catch exception, do not run td if not results (#138087)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138087
Approved by: https://github.com/wdvr
2024-10-16 18:04:25 +00:00
dabe2a3c3b [Torch] Support meta device in random.fork_rng (#137715)
Summary:
## Why
random.fork_rng doesn't support meta device:
```
[rank0]:   File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/aps_models/ads/tools/memory_estimator/estimation_dense.py", line 655, in estimate_dense_memory_size
[rank0]:     losses.sum().backward()
[rank0]:   File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/torch/_tensor.py", line 604, in backward
[rank0]:     return handle_torch_function(
[rank0]:   File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/torch/overrides.py", line 1718, in handle_torch_function
[rank0]:     result = mode.__torch_function__(public_api, types, args, kwargs)
[rank0]:   File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/torch/utils/_device.py", line 106, in __torch_function__
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/torch/_tensor.py", line 613, in backward
[rank0]:     torch.autograd.backward(
[rank0]:   File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/torch/autograd/__init__.py", line 347, in backward
[rank0]:     _engine_run_backward(
[rank0]:   File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/torch/autograd/graph.py", line 825, in _engine_run_backward
[rank0]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank0]:   File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/torch/utils/checkpoint.py", line 1125, in unpack_hook
[rank0]:     frame.recompute_fn(*args)
[rank0]:   File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/torch/utils/checkpoint.py", line 1507, in recompute_fn
[rank0]:     with torch.random.fork_rng(
[rank0]:   File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/runtime/lib/python3.10/contextlib.py", line 135, in __enter__
[rank0]:     return next(self.gen)
[rank0]:   File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/torch/random.py", line 153, in fork_rng
[rank0]:     raise RuntimeError(
[rank0]: RuntimeError: torch has no module of `meta`, you should register a module by `torch._register_device_module`.
```

This blocks us from running backward() on model with checkpoint enabled in meta mode.

## What
This diff handles the case of meta device in random.fork_rng.

Test Plan: Tested with toy model which has checkpoint on its module: P1641201046

Differential Revision: D64161410

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137715
Approved by: https://github.com/kit1980
2024-10-16 18:00:39 +00:00
a47bb4a393 Fix autocast for non-strict export (#137495)
Summary:

add testing for autocast and set_grad nodes for export_for_training. In export_for_training, we do not wrap the autocast and set_grad node in to HOP, but we should still have the set_grad_enabled/autocast nodes.

add support for autocast in non-strict export. Previously, `_enter_autocast` and `_exit_autocast` nodes don't show up in the export graph when we use `strict=False`.

- In autocast's enter and exit function, we dispatch to `PreDispatchTorchFunctionMode.__torch_function__`.
 if we have PreDispatchTorchFunctionMode in our function_mode_stack, the call stack looks like below. This is mostly the same call stack as strict mode, except strict mode enters [here](https://www.internalfb.com/code/fbsource/[0d4f1135cacdb26c6e01d5dce1ce52a15d61ee48]/xplat/caffe2/torch/_dynamo/variables/ctx_manager.py?lines=806).
```
- torch.amp.autocast.__enter__()'s torch.overrides.handle_torch_function
- torch.fx.experimental.proxy_tensor.TorchFunctionMetadataMode.__torch_function__
- torch.amp._enter_autocast()'s torch.overrides.handle_torch_function
- PreDispatchTorchFunctionMode.__torch_function__
```
- in `PreDispatchTorchFunctionMode.__torch_function__`, we create the autocast nodes.
- to match the strict mode behavior, we let the input node to the `_exist_autocast` node be the corresponding `_enter_autocast` node. This requires us to maintain a stack in `PreDispatchTorchFunctionMode`.

Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export  -- -r  test_export_with_autocast
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export  -- -r  test_export_with_set_grad
```

Differential Revision: D64016023

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137495
Approved by: https://github.com/bdhirsh
2024-10-16 17:39:00 +00:00
7ba706c74e update get start xpu (#137479)
1. respect the comment from the community, downgrade the "Beta" to "Prototype" for the first xpu release with wheel
2. add wheels installation of torchaudio & torchvision for nightly on Windows
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137479
Approved by: https://github.com/atalman, https://github.com/malfet
2024-10-16 17:36:29 +00:00
7e704c2073 [c10d] Add unit test for CUDAEventCache to ensure caching is working (#138059)
We created a simple test to validate the cache is indeed working and when the cache is indeed used up. I revert the fix in (https://github.com/pytorch/pytorch/pull/138040) and the test indeed failed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138059
Approved by: https://github.com/kwen2501
ghstack dependencies: #138040, #138048
2024-10-16 17:34:57 +00:00
dd32a32cb6 Revert "Expose option to disable CRC-32 computation during torch.save (#137735)"
This reverts commit 534fa96f2d9a4feb1dcdfaecb3d73990db60f819.

Reverted https://github.com/pytorch/pytorch/pull/137735 on behalf of https://github.com/clee2000 due to failing internally D64438525, probably needs gating ([comment](https://github.com/pytorch/pytorch/pull/137735#issuecomment-2417412264))
2024-10-16 17:03:06 +00:00
2b7c7a20b9 Upgrade distributed test to g4dn instances (T4 GPUs) (#137161)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137161
Approved by: https://github.com/seemethere, https://github.com/eqy
2024-10-16 16:42:57 +00:00
0a6c40faba Fix constant returning (#137993)
When the constants are used twice in the exported graph (second one is returned as output), the lifting constant pass doesn't account for the second one being the output. THis PR fixes that.

Differential Revision: [D64406108](https://our.internmc.facebook.com/intern/diff/D64406108/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137993
Approved by: https://github.com/avikchaudhuri
2024-10-16 16:42:09 +00:00
189c95457d [PyTorch] Don't hardcode 4 * Vec::size() in vectorized_reduction (#138014)
This will break once we support 128-bit vectors, and there's no reason to do it.

Differential Revision: [D64421982](https://our.internmc.facebook.com/intern/diff/D64421982/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138014
Approved by: https://github.com/malfet, https://github.com/Skylion007
ghstack dependencies: #137722
2024-10-16 16:41:59 +00:00
a12c859b00 [PyTorch] Check defined(__aarch64__) && !defined(CPU_CAPABILITY_SVE256) instead of defined(CPU_CAPABILITY_NEON) (#137722)
The CPU_CAPABILITY system is for rebuilding kernels multiple times with different vector ISA targets. CPU_CAPABILITY_NEON was not being used for that, just as an extra flag for inductor. As a result, CPU_CAPABILITY_NEON-gated code was unnecessarily unavailable outside inductor. Fixes #137704

Differential Revision: [D64197046](https://our.internmc.facebook.com/intern/diff/D64197046/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137722
Approved by: https://github.com/jgong5, https://github.com/malfet
2024-10-16 16:41:59 +00:00
361f42bc42 Revert "[compiled autograd] Compiled autograd configs in TLS (#137821)"
This reverts commit 9aba0b91c8df4a15654f9ccc02abca31bdd81650.

Reverted https://github.com/pytorch/pytorch/pull/137821 on behalf of https://github.com/wdvr due to Reverting this for now, it is failing test_public_bindings in trunk ([comment](https://github.com/pytorch/pytorch/pull/137821#issuecomment-2417351788))
2024-10-16 16:38:29 +00:00
af27f7888b [dynamo] Remove an unused variable in AOTDispatchAutograd (#137989)
* Extracted from https://github.com/pytorch/pytorch/pull/133492

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137989
Approved by: https://github.com/Skylion007, https://github.com/albanD
2024-10-16 16:37:19 +00:00
753ba5d30a Move basic dependencies install to requirements-ci (#138024)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138024
Approved by: https://github.com/huydhn
ghstack dependencies: #137991, #137992, #138023
2024-10-16 16:21:33 +00:00
4c8718d8e7 [dynamo] add torch.compiler.set_stance (#137504)
Attempt # 2 at https://github.com/pytorch/pytorch/pull/132926 to implement https://github.com/pytorch/pytorch/issues/123771.

Implement a new `torch.compiler.set_stance` function that can force `torch.compile` regions to run eagerly.

See added tests for usage examples.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137504
Approved by: https://github.com/yf225, https://github.com/jansel
2024-10-16 16:18:25 +00:00
960c3bff98 [c10d] Refactor CUDAEventCache Create to use deque rather than stack (#138048)
We used a LIFO stack to store the CudaEvent in the cache. ,Somehow we like FIFO deque better so aside from improving the readability of the code, we use a deque instead. As @wconstab pointed out, both methods are equally correct because the moment we put the event into stack/deque, the event is already ready for reuse, this change mostly is a preference change not trying to fix anything.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138048
Approved by: https://github.com/kwen2501
ghstack dependencies: #138040
2024-10-16 14:44:39 +00:00
932ae131fb Remove an unused variable in _inductor/codegen/simd.py (#138000)
* Extracted from https://github.com/pytorch/pytorch/pull/133492

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138000
Approved by: https://github.com/Skylion007
2024-10-16 13:54:21 +00:00
f3d7a02716 Avoid some dangling reference warnings (#132535)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132535
Approved by: https://github.com/aaronenyeshi
2024-10-16 13:41:12 +00:00
0c63de9755 [dynamo] Remove an unused variable in AutogradFunctionApplyVariable (#137985)
----

* Extracted from https://github.com/pytorch/pytorch/pull/133492

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137985
Approved by: https://github.com/zou3519
2024-10-16 13:08:45 +00:00
15722debfb Remove two unused variables in _functorch/partitioners.py (#137998)
* Extracted from https://github.com/pytorch/pytorch/pull/133492

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137998
Approved by: https://github.com/Skylion007
2024-10-16 10:58:31 +00:00
9aba0b91c8 [compiled autograd] Compiled autograd configs in TLS (#137821)
Multithreaded doesn't work yet, this adds python side TLS only for the python side state

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137821
Approved by: https://github.com/jansel, https://github.com/yf225
ghstack dependencies: #137953
2024-10-16 09:28:32 +00:00
af91661368 [compiled autograd] directly use python Logger class in cpp (#137953)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137953
Approved by: https://github.com/jansel, https://github.com/yf225
2024-10-16 09:28:32 +00:00
7f88bf96f9 test_execution_trace.py: Use instantiate_device_type_tests to run GPU tests on HPU as well (#133975)
**MOTIVATION**

We recently integrated support for Intel Gaudi devices (identified as 'hpu') into the common_device_type framework via the pull request at https://github.com/pytorch/pytorch/pull/126970. This integration allows tests to be automatically instantiated for Gaudi devices upon loading the relevant library. Building on this development, the current pull request extends the utility of these hooks by adapting selected CUDA tests to operate on Gaudi devices. Additionally, we have confirmed that these modifications do not interfere with the existing tests on CUDA devices.

**CHANGES**

- Add support for HPU devices within the payload function.
- Use instantiate_device_type_tests with targeted attributes to generate device-specific test instances.
- Expand the supported_activities() function to include checks for torch.profiler.ProfilerActivity.HPU.
- Apply skipIfHPU decorator to bypass tests that are not yet compatible with HPU devices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133975
Approved by: https://github.com/briancoutinho, https://github.com/aaronenyeshi
2024-10-16 07:53:06 +00:00
deaf0418b2 [2/N] Fix clang-tidy warnings in torch/csrc/api/ (#136998)
Follows #134545

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136998
Approved by: https://github.com/ezyang
2024-10-16 07:50:59 +00:00
f4158558aa [c10d] disable watchdog thread in blockingWait mode (#138001)
Summary:
Blocking wait mode is not widely used, probably useful in debugging.
in blockingWait mode, we don't need to enable the watchdog thread to
check the timeout or nccl error because the main thread would throw an
exception if error happens and it is obvious to user which work fails
and its user's responsibility to handle the exception.
Test Plan:
CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138001
Approved by: https://github.com/fduwjj, https://github.com/c-p-i-o
ghstack dependencies: #137799
2024-10-16 07:42:22 +00:00
78632b97b1 Revert "Upgrade distributed test to g4dn instances (T4 GPUs) (#137161)"
This reverts commit f43c4d28b8f955fe1f2b80f193815edadc95507b.

Reverted https://github.com/pytorch/pytorch/pull/137161 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it seems another failure showing up after the upgrade ([comment](https://github.com/pytorch/pytorch/pull/137161#issuecomment-2415941159))
2024-10-16 07:26:34 +00:00
7480e6938d [inductor] Add LoopBody.op_counts (#137945)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137945
Approved by: https://github.com/eellison
ghstack dependencies: #137946
2024-10-16 06:35:10 +00:00
0d7b2118ed [inductor] Refactor triton dtype helpers (#137946)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137946
Approved by: https://github.com/eellison
2024-10-16 06:35:10 +00:00
97f7fc1d31 Support retry when building Docker images (#138012)
Similar to https://github.com/pytorch/test-infra/pull/5759, I'm seeing flaky network error from time to time when building Docker images, for example https://github.com/pytorch/pytorch/actions/runs/11352439248/job/31575206417.

So, adding retrying to mitigate this class of flaky failures.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138012
Approved by: https://github.com/atalman
2024-10-16 06:10:41 +00:00
084657e012 [c10d] Fix data corruption bug after CUDAEventCache is enabled (#138040)
Here is why we see using `CUDAEventCache` cause crash and data corruption.
1. The deleter is doing its job and append the job the stack.
2. In create, instead of getting a reference, we are getting a copy of eventsArray_[i] (which is a std::vector). This is bad because we didn't really remove the element from the stack. While we thought we already pop up the last one from the stack, but it turns out the last one is still in the stack; we end up reusing the same event again and again. What's worse, since we keep adding new events to the stack, this will eventually explode the stack and a crash happens.

Fix is easy, just get a reference. Local torchtitan run see a non-Nan loss.

Also we want to use a deque instead of a stack, and refactor the code a bit to make it more readable. (in a separate PR)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138040
Approved by: https://github.com/kwen2501, https://github.com/shuqiangzhang
2024-10-16 05:20:29 +00:00
f43c4d28b8 Upgrade distributed test to g4dn instances (T4 GPUs) (#137161)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137161
Approved by: https://github.com/seemethere, https://github.com/eqy
2024-10-16 05:03:08 +00:00
60b4858977 [BE][Docker] Don't update scikit-learn (#138023)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138023
Approved by: https://github.com/huydhn
ghstack dependencies: #137991, #137992
2024-10-16 05:01:40 +00:00
7f6e85bb93 [BE] Move numpy installation logic to requirements-ci.txt (#137992)
And slightly adjust versioning logic, as current one seems to exist to hide version conflicts:
 - 1.21.2 for Python-3.9
 - 1.24.2 for Python-3.10 (to resolve conflict with numba-0.55.2)
 - 1.26.2 for Python-3.11 or 3.12
 - 2.1.2 for Python-3.13

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137992
Approved by: https://github.com/Skylion007, https://github.com/huydhn
ghstack dependencies: #137991
2024-10-16 04:30:29 +00:00
12f4d91e84 Enable Python-3.13 builds on MacOS (#138037)
All logic changes happen in builder repo, namely:
 - a01e87535b
 - bcd0972459
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138037
Approved by: https://github.com/huydhn
ghstack dependencies: #138041
2024-10-16 04:24:12 +00:00
66b39fd474 refactor KERNEL_MPS via resuing KERNEL (#137831)
# Motivation
Reuse `KERNEL` to simplify `KERNEL_MPS` for mps autocast code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137831
Approved by: https://github.com/malfet
2024-10-16 03:54:13 +00:00
2c94c54f10 Export XPU libs to be public (#136974)
# Motivation
Export XPU-related libs to be public. Now they are included in `TORCH_LIBRARIES`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136974
Approved by: https://github.com/EikanWang, https://github.com/malfet
2024-10-16 03:41:01 +00:00
80f3ee41dc [SymmetricMemory] fix incorrect numel caculations that are using int as std::accumulate's accumulator (#138038)
Fixes https://github.com/pytorch/pytorch/pull/137567

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138038
Approved by: https://github.com/weifengpy
2024-10-16 03:34:26 +00:00
75109682b6 [Pipelining] Refactor Interleaved1F1B and ZeroBubble (#137783)
NOTE: this PR removes `ScheduleFlexibleInterleaved1F1B`, let me know if theres any concerns.

`ScheduleFlexibleInterleaved1F1B` is a superset of `Interleaved1F1B` and uses most of the same implementation, but relaxes the condition that `n_microbatches % pp_size == 0`. This is refactors the implementation into `Interleaved1F1B` and then removes it since it is confusing to have both schedules with similar names. This also refactors the zero bubble logic to belong in the `ZeroBubble` schedule class.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137783
Approved by: https://github.com/wconstab
2024-10-16 03:05:14 +00:00
809ff3b274 Add host-side Triton TMA support to Dynamo (#137677)
This adds Dynamo tracing support for the host-side Triton TMA API (see `create_2d_tma_descriptor` calls on the host in the [Triton tutorial](https://triton-lang.org/main/getting-started/tutorials/09-persistent-matmul.html#sphx-glr-getting-started-tutorials-09-persistent-matmul-py)). A few notes:

- Here we assume the availability of the host-side TMA API added to upstream Triton in https://github.com/triton-lang/triton/pull/4498. As of time of writing, this is not a part of the PT2 OSS Triton pin (although back-ported internally). OSS Triton pin update should be done in December 2024.
- To capture the chain of calls `t.data_ptr() --> create_{1d,2d}_tma_descriptor(ptr, ...) --> kernel[grid](tma_desc, ...)`, we add three new variable trackers: `DataPtrVariable`, `CreateTMADescriptorVariable` (for the function), `TMADescriptorVariable` (for TMA descriptor object). This is to maintain the path back from the Triton kernel to the Tensor from which the TMA descriptor has been created.
- The newly introduced variables have `reconstruct` methods used in case of graph breaks.
- The `tma_descriptor_metadata` extracted from the captured `create_{1d,2d}_tma_descriptor` calls is propagated through the HOPs in Dynamo and AOTAutograd to be used by the downstream compiler (e.g., Inductor). See the unit tests for how the captured HOP arguments look like.
- In the Dynamo-captured fx graph, we replace the TMA descriptor arguments of the Triton kernel by the underlying Tensors, to be able to track the input/output relationships in terms of Tensors.
- In the Triton kernel mutation analysis pass (in AOTAutograd), we use the `tt.experimental_descriptor_store` TTIR op to detect mutations of the underlying tensors via TMA descriptors. So that downstream AOTAutograd can perform functionalizations as required.
- JIT Inductor and AOT Inductor support will be implemented in follow-up PRs.

Differential Revision: [D64404928](https://our.internmc.facebook.com/intern/diff/D64404928)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137677
Approved by: https://github.com/zou3519
2024-10-16 02:18:48 +00:00
dd2ae7d0c9 [BE] Use x in [foo, bar] (#138041)
As shorthand for `x == foo or x == bar`
And `x not in [foo, bar]` as shorthand for `x != foo and x != bar`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138041
Approved by: https://github.com/huydhn
2024-10-16 01:57:37 +00:00
64ccebd2e0 update labeler for module: compiled autograd (#137954)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137954
Approved by: https://github.com/yf225
2024-10-16 01:56:21 +00:00
aa28062169 [ROCm] TunableOp more unit test follow-up - Part 2 (#134517)
More unit tests to cover TunableOp functionality.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134517
Approved by: https://github.com/jeffdaily
2024-10-16 01:49:47 +00:00
7fa7333299 [Distributed][Test] Fix todo in distributed test files (#136836)
Refactor distributed test code:
- Fix TODO: (rohan-varma): remove model
- Fix TODO: add comments for TestTraverse
- Migrate deprecated method call `load_state_dict` and `save_state_dict`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136836
Approved by: https://github.com/kwen2501
2024-10-16 01:15:12 +00:00
a1b22e369b [c10d] add an API to get the future result(success or failure) of a collective and customize error handling (#137799)
Summary:
This PR is trying to let users to know what exact collective call from the python thread is failing, and
customize their own error handling function, instead of watchdog thread crashing everything.

This is potentially very useful in fault tolerant training, in which we can have in-process restart.
E.g., when an nccl error is detected, users can potentially abort comms, re-init comms and go back to the previous check pointed step and try again, instead of crashing the whole job.

This is to allow users to check the status of each collective call,
using the ivalue::future libs in PT core. This also allows users to
attach its customized failure handling functions by:
work.get_future_result().then(erro_handling_func)

Note that the above call is also non-blocking for CPU thread
Test Plan:
Added a new test: test_get_future_result to verify the workResutl is
correctly propagated to the users

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137799
Approved by: https://github.com/fduwjj, https://github.com/wconstab
2024-10-16 00:20:09 +00:00
8d9c9727c0 aten | Fix set but unused variables warning in release builds. (#138008)
Summary: Fixing a warning that happens only in release builds.

Test Plan: Sandcastle + dependent diffs

Reviewed By: boguscoder

Differential Revision: D64415854

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138008
Approved by: https://github.com/boguscoder, https://github.com/Skylion007
2024-10-16 00:05:39 +00:00
46ec4ad021 Add code pointer to internal Meta implementation (#137984)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137984
Approved by: https://github.com/albanD
2024-10-15 23:35:22 +00:00
4557f6e339 Revert "[Dynamo] Disable torch function compilation during guard execution and in compiled bytecode (#137669)"
This reverts commit bf0b67059882933574f71a3b11b2f0127915ee5b.

Reverted https://github.com/pytorch/pytorch/pull/137669 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing test_public_bindings in trunk, maybe a landrace ([comment](https://github.com/pytorch/pytorch/pull/137669#issuecomment-2415331274))
2024-10-15 23:22:58 +00:00
19665f4619 [fake_tensor][cache] Supports ops with tuple of output tensors (#137935)
This is needed for invoke_subgraph work.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137935
Approved by: https://github.com/masnesral
2024-10-15 22:15:07 +00:00
5d5783a263 Improve the scheduling of _pipelined_multi_all_gather_and_consume (#137850)
```
Parallelization strategy: after each rank copies its shard into its local
p2p buffer, every rank issues independent p2p copy -> shard_consumer
sequences to two streams. In addition to computation/communication
overlapping, the strategy allows for computation/computation overlapping,
greatly reducing quantization inefficiency.

Notation:
- "mv" for the copy to local buffer
- "cp" for p2p copies
- "b" for barriers

Constraints:
- The GPU scheduler may or may not overlap "mv" with the first shard_consumer.
- "cp" from different streams cannot overlap.

Ideal scenario 0 - "mv" overlaps with the first shard_consumer:

stream 0: [ shard_consumer ][ cp ][ shard_consumer ]
stream 1: [ mv ][b][ cp ][ shard_consumer ]

Ideal scenario 1 - "mv" is scheduled before the first shard_consumer:

stream 0:       [ shard_consumer ][ cp ][ shard_consumer ]
stream 1: [ mv ][b][ cp ][ shard_consumer ]

Suboptimal scenario 0 - "mv" is scheduled after the first shard_consumer:

stream 0: [ shard_consumer ]               [ cp ][ shard_consumer ]
stream 1:                   [ mv ][b][ cp ][ shard_consumer ]

Suboptimal scenario 0 - "b" is scheduled after the first shard_consumer:

stream 0:       [ shard_consumer ]         [ cp ][ shard_consumer ]
stream 1: [ mv ]                  [b][ cp ][ shard_consumer ]

We haven't yet figured out a way to ensure "mv" and "b" are either
overlapped with or scheduled before the first shard_consumer. Thus, to
prevent suboptimal scenarios, we are giving up the chance to overlap "mv"
and "b" with the first shard_consumer for now.
```

This PR improves the scheduling for mm kernels with high SM utilization. The GPU scheduler tends to not overlap local DtoD copies with such kernels, which leads to suboptimal scheduling. The following is an example of pipelining PyTorch's cutlass-based, row-wise scaling fp8 kernel:

Before this PR:
<img width="298" alt="image" src="https://github.com/user-attachments/assets/81e0a7f4-18ee-47c6-b258-04fdaca7a6a2">

With this PR:
<img width="253" alt="image" src="https://github.com/user-attachments/assets/982de5a8-da1e-4a8f-b67e-c9c869b0a77f">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137850
Approved by: https://github.com/weifengpy
ghstack dependencies: #137643, #137738, #137805, #137836
2024-10-15 21:35:14 +00:00
2ae1a4caa1 Improve the scheduling of _pipelined_produce_and_all2all (#137836)
```
Parallelization strategy: every rank issues independent compute
-> barrier -> p2p copy sequences on two streams. In addition to
computation/communication overlapping, the strategy allows for
computation/computation overlapping, greatly reducing
quantization inefficiency.

Ideally, stream activities would look like this ("b" for
barriers, "cp" for p2p copies):

[rank 0]
stream 0:         [  chunk_producer  ][b][ cp ][  chunk_producer ][b][ cp ]
stream 1: [  chunk_producer  ][b][ cp ][  chunk_producer  ][b][ cp ]

[rank 1]
stream 0:         [  chunk_producer  ][b][ cp ][  chunk_producer ][b][ cp ]
stream 1: [  chunk_producer  ][b][ cp ][  chunk_producer  ][b][ cp ]

Note that the barriers synchronize streams with the same ID
across ranks. They don't synchronize streams on the same rank.

Since the work on both streams is independent, there's no
guarantee that the chunk_producer from stream 0 or stream 1 will
be scheduled first. If there is a scheduling mismatch across
ranks, the barrier forces all ranks to wait for the slowest.

When scheduling mismatches occur among ranks, the stream
activities might look like this (note that p2p copies from
different streams cannot overlap with each other):

[rank 0]
stream 0: [  chunk_producer  ][b        ][ cp ][  chunk_producer ][b       ][ cp ]
stream 1:         [  chunk_producer  ][b]      [ cp ][  chunk_producer  ][b]      [ cp ]

[rank 1]
stream 0:         [  chunk_producer  ][b]      [ cp ][  chunk_producer  ][b]      [ cp ]
stream 1: [  chunk_producer  ][b        ][ cp ][  chunk_producer  ][b      ][ cp ]

To prevent this, we need to ensure that the chunk_producer on
stream 1 gets scheduled first on every rank. Without access to
the underlying kernels, CUDA offers no API to control the
scheduling order of two independent, overlapping kernels. Our
solution is to issue a small sleep kernel in stream 0. The sleep
duration is insignificant, but having an extra task in stream 0
will almost guarantee that the chunk_producer on stream 1 gets
scheduled first. Once the first chunk_producer is scheduled in
the correct order, there's very little room for the scheduling
order of subsequent kernels to be inconsistent across ranks.
```

Currently, we perform stream synchronization to ensure scheduling order. The stream synchronization has no bearing on correctness, but prevents inconsistent scheduling orders across ranks.

Without the stream synchronization, ranks may have inconsistent scheduling order, and the barriers cause all ranks to wait for the slowest rank:
<img width="379" alt="image" src="https://github.com/user-attachments/assets/ffb97e76-7e19-4449-b121-83c32ec3e91d">

With stream synchronization, the inconsistent scheduling order issue is addressed, but we lose compute/compute overlapping (this is the state before this PR):
<img width="378" alt="image" src="https://github.com/user-attachments/assets/4cb76246-625f-4fc1-b49a-823ae46d3f23">

With this PR, we get both consistent scheduling order across ranks and compute/compute overlap:
<img width="327" alt="image" src="https://github.com/user-attachments/assets/51ab1bdc-4f60-46e0-b53c-6d208e2d4888">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137836
Approved by: https://github.com/weifengpy
ghstack dependencies: #137643, #137738, #137805
2024-10-15 21:35:14 +00:00
ef541c1a65 [fused_all_gather_scaled_matmul] support rowwise scaling (#137805)
This PR add support for `A_scale` to be row-wise scale. The op can automatically detect whether the row-wise scale is sharded or replicated. When the row-wise scale is sharded, the op would all-gather the scale in a pipelined fashion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137805
Approved by: https://github.com/weifengpy
ghstack dependencies: #137643, #137738
2024-10-15 21:35:14 +00:00
05edaeaded [fused_scaled_matmul_reduce_scatter] support rowwise scaling (#137738)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137738
Approved by: https://github.com/Chillee, https://github.com/weifengpy
ghstack dependencies: #137643
2024-10-15 21:35:14 +00:00
91bc9dc2c9 [SymmetricMemory] implement timeout for barrier(), put_signal() and wait_signal() (#137643)
Suggested by @lw for better safety/reliability.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137643
Approved by: https://github.com/weifengpy, https://github.com/lw
2024-10-15 21:35:14 +00:00
eaec72d1e6 Link directly to new Custom Ops Landing Page (#137933)
e.g., click on first link in https://docs-preview.pytorch.org/pytorch/pytorch/137933/library.html#testing-custom-ops

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137933
Approved by: https://github.com/zou3519
2024-10-15 21:18:21 +00:00
aef4317ec8 [c10d] socket: retry connection timeout failures (#138003)
This will retry connection timeout failures up to the timeout duration. Under heavy load the server may not be able to immediately accept the connection. In such a case we do want to retry the connection rather than fall back to ipv4 for the remaining of the connection timeout.

The connection timeout here is not the same as the c10d timeout which appears to be higher. We could adjust the linux timeout directly but using the c10d retry loop keeps things more consistent and gives us things like exponential backoff, logs, etc.

Example failure:
```
 socket.cpp:752] [c10d] The client socket has failed to connect to [...]:29400 (errno: 110 - Connection timed out).
 socket.cpp:752] [c10d] The IPv4 network addresses of (..., 29400) cannot be retrieved (gai error: -2 - Name or service not known).
... repeats ipv4 connection failure
```

From Linux man page: https://man7.org/linux/man-pages/man2/connect.2.html
```
ETIMEDOUT
              Timeout while attempting connection.  The server may be
              too busy to accept new connections.  Note that for IP
              sockets the timeout may be very long when syncookies are
              enabled on the server.
```

Test plan:

CI for backwards compatibility

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138003
Approved by: https://github.com/c-p-i-o, https://github.com/fduwjj, https://github.com/rsdcastro
2024-10-15 21:17:05 +00:00
bf0b670598 [Dynamo] Disable torch function compilation during guard execution and in compiled bytecode (#137669)
Fixes https://github.com/pytorch/pytorch/issues/114369

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137669
Approved by: https://github.com/anijain2305
2024-10-15 20:52:58 +00:00
28a521e29a [fuzzing result][fuzz_torch_jit_lite_interpreter] read-heap-buffer-overflow (size 4) in c10::IValue::IValue() (#137924)
Summary: Calling `pop()` on empty stack

Test Plan: CI

Differential Revision: D64332420

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137924
Approved by: https://github.com/Skylion007
2024-10-15 20:42:47 +00:00
3ecec0c90c skip lintrunner install on Windows. (#137981)
`lintrunner` is not support Windows x64. Ref: https://pypi.org/project/lintrunner/#files

When we install python dependency by `pip install -r requirements.txt` on Windows x64, it will failed on `lintrunner`.
<img width="887" alt="image" src="https://github.com/user-attachments/assets/e3815177-e893-41ae-96af-8b39d12f74a7">

Solution: skip install `lintrunner` on Windows.
Reference doc: https://peps.python.org/pep-0508/#environment-markers

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137981
Approved by: https://github.com/albanD

Co-authored-by: albanD <desmaison.alban@gmail.com>
2024-10-15 20:37:26 +00:00
35fc24fbed [PGNCCL] Fix bugs in non-blocking mode (#137741)
### Fix 1: Throw async error during init wait

Previously we just busy wait for `ncclSuccess`, if the nonblocking init encountered error, we never report that. Added detection of async error via `ncclGetAsyncError`.

### Fix 2: Add wait after comm split

```
  // After calling ncclCommSplit in non-blocking mode, we should wait for the
  // source communicator to be out of ncclInProgress state.
  // Reason 1:
  //   it's unsafe to call new operations on the parent comm while it's in
  //   ncclInProgress state.
  // Reason 2:
  //   as of NCCL 2.23, the ptr value of child comm will not be filled until the
  //   state of parent comm is ncclSuccess. This may change in the future. See:
  //   https://github.com/NVIDIA/nccl/issues/1472
```
This wait does not mean the child comm is ready for use, neither does it block till that point.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137741
Approved by: https://github.com/shuqiangzhang
2024-10-15 20:35:39 +00:00
370d66d7dd aten/buck | Appropriately convert clang => msvc compiler_flags. (#137944)
Summary:
fPIC is not available in clang on Windows - filter it out.
Also configure the flags appropriately for MSVC.

Reviewed By: rameshviswanathan

Differential Revision: D64365660

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137944
Approved by: https://github.com/mwdavis84, https://github.com/ChristianK275, https://github.com/boguscoder
2024-10-15 20:21:01 +00:00
487873f7ca [Inductor]: Support updated Triton AttrsDescriptor (#137757)
The Triton `AttrsDescriptor` object was refactored in https://github.com/triton-lang/triton/pull/4734. These changes add support for the new `AttrsDescriptor` while maintaining backwards compatibility with the existing version. The main changes are different names for the initialized of the descriptor parameters, and a creation via a static method instead of the class constructor.

Depends on #137458 which removes some unused logic around the old descriptor. Those changes make this PR cleaner, but if for some reason that old logic is still used I can make adjustments.

Use of the new `AttrsDescriptor` depends on https://github.com/triton-lang/triton/pull/4888

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137757
Approved by: https://github.com/jansel
2024-10-15 19:34:59 +00:00
534fa96f2d Expose option to disable CRC-32 computation during torch.save (#137735)
Option only works in open source, not internal

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137735
Approved by: https://github.com/albanD
2024-10-15 19:30:02 +00:00
3cc8c8b944 [FSDP2] Add set_unshard_in_backward(bool) (#137922)
For some expert use cases, the user knows some parameters are not required for backward, so we can skip the unshard in backward. One example is the embedding weight.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137922
Approved by: https://github.com/weifengpy
2024-10-15 19:11:14 +00:00
60cf72e028 enable auto functionalize v2 by default (#136685)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136685
Approved by: https://github.com/zou3519
ghstack dependencies: #137760
2024-10-15 19:04:42 +00:00
05b6200ccd Do not compute base in export mode (#137760)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137760
Approved by: https://github.com/zou3519, https://github.com/bdhirsh
2024-10-15 19:04:42 +00:00
f5e38f65c5 [FlexAttention] Support training bias for eager (#136910) (#137526)
This PR is Part 2 of the implementation started in https://github.com/pytorch/pytorch/pull/136910, rolled in the updates from https://github.com/pytorch/pytorch/pull/137451. Original was reverted due to calls to #@torch.libary at `import torch` time, so added a call to register at first call to `ModIndex`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137526
Approved by: https://github.com/Chillee, https://github.com/zou3519
2024-10-15 18:55:22 +00:00
cd292908e5 Revert "Make c10::string_view an alias of std::string_view (#130417)"
This reverts commit c48fe8901114aa2b0a9c2d77f915a2ad8ab2098b.

Reverted https://github.com/pytorch/pytorch/pull/130417 on behalf of https://github.com/clee2000 due to breaking some internal tests, probably usages of string_view that need to be changed? ([comment](https://github.com/pytorch/pytorch/pull/130417#issuecomment-2414775064))
2024-10-15 18:55:09 +00:00
e1e6417d4c Add SVE implementation of embedding_lookup_idx (#133995)
Adds an accelerated version of the embedding_lookup_idx perfkernels. This is done via a python codegen file similarly to `caffe2/perfkernels/hp_emblookup_codegen.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133995
Approved by: https://github.com/malfet, https://github.com/huydhn
2024-10-15 18:52:44 +00:00
b09d6f3a7d [EZ][BE] Delete 3.8 specific checks (#137991)
As we no longer support 3.8

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137991
Approved by: https://github.com/Skylion007
2024-10-15 18:45:49 +00:00
524fe784ec BundledAutotuneCache (take 2) (#137902)
Summary:
Add a cache to combine individual autotune caches into a single cached bundle.  We still rely on the individual autotune caches - on a cache hit we copy the individual results into the local caches so they can retrieved later.

Attempt 2 of #134959 (D60677499).

Various configs:
env: TORCHINDUCTOR_BUNDLED_AUTOTUNE_REMOTE_CACHE
config: bundled_autotune_remote_cache
jk: pytorch/remote_cache:bundled_autotune_remote_cache_version

Test Plan:
unit tests

Manually tested w/ EMU:
```
cd fbcode/accelerators/workloads/models/emu_flash/v1p4
make build_benchmark_model && make save_model_to_path
make test_pt2_latency
```

- on a cold run we got 0 hits and 40 misses. On a warm run it got 40 hits and 0 miss.
- perf seems a little better - for 8 runs:
  - no bundled cache averaged 14m11s
  - bundled cache averaged 14m6s
  - 125ms saved per cache entry seems reasonable

Cache Metrics for an sample run:
no bundled cache:
```
INFO: Cache Metrics:
  FbMemcacheRemoteKernelCache: {hit: 2256, miss: 0, put: 0, exception: 0}
  FbRemoteAutotuneCache: {hit: 0, miss: 0, put: 7, exception: 0}
  FbRemoteFxGraphCache: {hit: 40, miss: 0, put: 0, exception: 0}
  LocalAutotuneCache: {hit: 878, miss: 0, put: 7, exception: 0}
  backend:MemcacheCache: {hit: 2256, miss: 0, put: 7, exception: 0}
  backend:_LocalAutotuneCacheBackend: {hit: 878, miss: 0, put: 7, exception: 0}
  backend:_ManifoldCache: {hit: 40, miss: 0, put: 0, exception: 0}
```
bundled cache:
```
INFO: Cache Metrics:
  FbMemcacheRemoteKernelCache: {hit: 2258, miss: 0, put: 0, exception: 0}
  FbRemoteAutotuneCache: {hit: 0, miss: 0, put: 8, exception: 0}
  FbRemoteBundledAutotuneCache: {hit: 40, miss: 0, put: 0, exception: 0} <<<<<<
  FbRemoteFxGraphCache: {hit: 40, miss: 0, put: 0, exception: 0}
  LocalAutotuneCache: {hit: 878, miss: 0, put: 886, exception: 0}
  backend:MemcacheCache: {hit: 2258, miss: 0, put: 8, exception: 0}
  backend:_LocalAutotuneCacheBackend: {hit: 878, miss: 0, put: 886, exception: 0}
  backend:_ManifoldCache: {hit: 80, miss: 0, put: 0, exception: 0}
```

Differential Revision: D64336043

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137902
Approved by: https://github.com/oulgen
2024-10-15 18:39:47 +00:00
bf77f52895 Fix memory leak on masked Tensor (#137890)
Note that this reverts the change from https://github.com/pytorch/pytorch/pull/137815 as well which is not needed anymore!

Without this, you create an unbeakable reference cycle. It is unbreakable because part of the cycle is through the autograd graph which we cannot traverse.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137890
Approved by: https://github.com/atalman, https://github.com/huydhn, https://github.com/Skylion007
2024-10-15 18:37:55 +00:00
0b7ef196cd Use filelock to build extension_device backend one at a time (#137930)
Fixes https://github.com/pytorch/pytorch/issues/136125
Fixes https://github.com/pytorch/pytorch/issues/137026
Fixes https://github.com/pytorch/pytorch/issues/137027

The compilation fails during `setUpClass`, so disabling the test doesn't do nothing.  The theory I have for this flaky issue is that `test_open_device_registration` from both `TritonExtensionBackendTests` and `ExtensionBackendTests` are run in parallel and cleaned up while the other is still in fly, causing flaky failure.

Here is an example failure https://github.com/pytorch/pytorch/actions/runs/11331105492/job/31512603585#step:22:1710

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137930
Approved by: https://github.com/malfet
2024-10-15 17:46:28 +00:00
60eb3fccfa Revert "[ONNX] Remove ExportTypes (#137789)"
This reverts commit 3e0b83ad1f0a998ef8a72c5e82d9250ab800cce5.

Reverted https://github.com/pytorch/pytorch/pull/137789 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/137789#issuecomment-2414632100))
2024-10-15 17:40:06 +00:00
2831af39c4 Revert "[ONNX] Remove deprecated export_to_pretty_string (#137790)"
This reverts commit d0628a7e3921639f62d6a6fec9f9f1871e087533.

Reverted https://github.com/pytorch/pytorch/pull/137790 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/137789#issuecomment-2414632100))
2024-10-15 17:40:06 +00:00
dac0b4e62b Revert "Add SVE implementation of embedding_lookup_idx (#133995)"
This reverts commit 770c134998d3422bc2fa3b90baa235ed0c409e62.

Reverted https://github.com/pytorch/pytorch/pull/133995 on behalf of https://github.com/clee2000 due to breaking internal tests, I wondering if this just needs a targets change for buck? ([comment](https://github.com/pytorch/pytorch/pull/133995#issuecomment-2414596554))
2024-10-15 17:23:50 +00:00
d4d687ffb2 Revert "Make Context to be Device-agnostic Step by Step (1/N) (#136519)"
This reverts commit 4a8e49389c33934234dc89616fd17a58e760e2e7.

Reverted https://github.com/pytorch/pytorch/pull/136519 on behalf of https://github.com/clee2000 due to breaking internal tests related to MITA, @ezyang has a forward fix? ([comment](https://github.com/pytorch/pytorch/pull/136519#issuecomment-2414588302))
2024-10-15 17:19:16 +00:00
9af4e0d2aa Revert "Make Context to be Device-agnostic Step by Step (2/N) (#136526)"
This reverts commit a6eb0205225fce7ba7a75d200566613b84aff4e9.

Reverted https://github.com/pytorch/pytorch/pull/136526 on behalf of https://github.com/clee2000 due to breaking internal tests related to MITA, @ezyang has a forward fix? ([comment](https://github.com/pytorch/pytorch/pull/136519#issuecomment-2414588302))
2024-10-15 17:19:15 +00:00
44653895cc override bool(), is_nonzero for real tensor tracing (#136788)
Fixes bool() and is_nonzero() calls for real tensor tracing, non-strict export

Differential Revision: D63482693

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136788
Approved by: https://github.com/ezyang
2024-10-15 17:13:44 +00:00
bdbe0cfffa Fix test_binary_ufuncs.py for NumPy 2 (#137937)
Related to #107302

The following tests failed in test_binary_ufuncs.py when testing with NumPy 2.

```
FAILED [0.0050s] test/test_binary_ufuncs.py::TestBinaryUfuncsCPU::test_scalar_support__refs_sub_cpu_complex64 - AssertionError
FAILED [0.0043s] test/test_binary_ufuncs.py::TestBinaryUfuncsCPU::test_scalar_support__refs_sub_cpu_float32 - AssertionError
FAILED [0.0048s] test/test_binary_ufuncs.py::TestBinaryUfuncsCPU::test_scalar_support_sub_cpu_complex64 - AssertionError
FAILED [0.0043s] test/test_binary_ufuncs.py::TestBinaryUfuncsCPU::test_scalar_support_sub_cpu_float32 - AssertionError
FAILED [0.0028s] test/test_binary_ufuncs.py::TestBinaryUfuncsCPU::test_shift_limits_cpu_uint8 - OverflowError: Python integer -100 out of bounds for uint8
```

This PR fixes them.

More details:
* `test_shift_limits` failed because `np.left_shift()` and `np.right_shift()` no longer support negative shift values in NumPy 2.
* `test_scalar_support` failed because NumPy 2 changed its dtype promo rules. We special-cased the incompatible cases by changing the expected dtypes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137937
Approved by: https://github.com/albanD
2024-10-15 17:04:24 +00:00
e4d7676c1b [CPU] Expand torch.special.i1 to Half and BF16 (#137899)
To match behavior of `torch.special.i0`

Noticed while looking at the failures in https://github.com/pytorch/pytorch/pull/137849

Also, add explicit high-precision template specialization for  `calc_i0` and `calc_i1` for `BFloat16` and `Half`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137899
Approved by: https://github.com/Skylion007
2024-10-15 17:00:58 +00:00
4abe38bc94 RMSprop docs: add missing input "epsilon" (#137854)
Adding a missing input argument in the docs for RMSprop. Like in the doc for AdamW https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137854
Approved by: https://github.com/janeyx99
2024-10-15 16:40:42 +00:00
822aa588bc Fix torch_np/test_basic for NumPy 2 (#137814)
Related to #107302

`TestExport.test_exported_objects` in `test/torch_np/test_basic.py` is failing with NumPy 2.
The test is checking if all methods under `torch._numpy` exist in `numpy`.
However, some of them are removed in NumPy 2.

This PR fixes the issue by not checking the removed methods when running with NumPy 2.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137814
Approved by: https://github.com/albanD
2024-10-15 16:40:28 +00:00
120fbe9caa Update inductor benchmark time to avoid flakiness (#137900)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137900
Approved by: https://github.com/laithsakka
2024-10-15 16:17:04 +00:00
966a1a971e [ROCm] Add AMDSMI support for UUID input (#129741)
Adds support for for using UUIDs for AMDSMI utilities in PyTorch via CUDA_VISIBLE_DEVICES/HIP_VISIBLE_DEVICES.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129741
Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily
2024-10-15 15:56:30 +00:00
17ed403644 [ROCm] Enable test_triton* in test_sparse_csr suite (#137712)
All test_triton* UTs are now passing on ROCm within test_sparse_csr suite. See logs here: https://ossci-raw-job-status.s3.amazonaws.com/log/31376189926

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137712
Approved by: https://github.com/jithunnair-amd, https://github.com/malfet
2024-10-15 15:41:21 +00:00
5689e33cfe [Intel GPU] Fix Windows linkage issue due to invisible structured kernel symbols (#137794)
Intel GPU aten library(libtorch_xpu) utilizes `torchgen` to generate structure kernels. Currently, the generated structure kernels are decorated by `TORCH_API` to control the visibility, while `TORCH_API` is controlled by the `CAFFE2_BUILD_MAIN_LIB` macro. However, we cannot enable `CAFFE2_BUILD_MAIN_LIB` for the Intel GPU ATen library naively. Because the macro not only serves for the `TORCH_API` semantic. It means that the semantic of `TORCH_API` is symbol `hidden`.

https://github.com/pytorch/pytorch/blob/main/c10/macros/Export.h#L95-L99

Therefore, we need to use ` TORCH_XPU_API` to decorate the produced structure kernels.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137794
Approved by: https://github.com/atalman
ghstack dependencies: #137873
2024-10-15 15:31:37 +00:00
3361908fc5 torch/ao/quantization/utils.py: Moving eps to targeted device to avoid device mismatch issue (#135204)
MOTIVATION

We recently verified some quantization tests on devices other than cpu (eg. CUDA and Intel Gaudi devices identified as 'hpu'). We noticed a device mismatch error as eps is a tensor created on cpu but other tensors (min_val_neg, max_val_pos, scale, zero_point) are moved to the targeted _device_.

CHANGES

Move eps to _device_ of other tensors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135204
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
2024-10-15 14:58:55 +00:00
cef6c3dcb0 Dont decompose aten.baddmm in inductor (#137904)
Previously the decomposition would upcasts inputs to fp32. This led to a slowdown compared to eager which would run in fp16. We also tried keeping the bmm in fp16, and the upcasting for the epilogue but that led to worse numerics because the bmm in eager would do the epilogue all in fp32 without a downcast in the bmm accumulator.

Fix for https://github.com/pytorch/pytorch/issues/137897

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137904
Approved by: https://github.com/ngimel
2024-10-15 14:54:56 +00:00
b7f798caa4 Use C10_UNUSED instead of (void)X (#137239)
Summary:
Auto-generated with
```
buck run //scripts/rbarnes/regex_multiline_replacer:regex_multiline_replacer -- --find '^(\s*for\s*\()(const.*\n)\s*\(void\)[A-Za-z]+;\s*//\s*Suppress.*\s*\n(.*)'  --replace '\1C10_UNUSED \2\3' `find caffe2/ -regex ".*\.\(cpp\|h\)"`
```

Differential Revision: D33432600

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137239
Approved by: https://github.com/Skylion007
2024-10-15 14:32:59 +00:00
e7a4ad3b40 Add decomposition for permute_copy (#130944)
* Extracted from #129476

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130944
Approved by: https://github.com/amjames, https://github.com/eellison
2024-10-15 13:51:20 +00:00
5141ade8e3 [AMD] Do not skip 0-byte send/recv (#137952)
Summary: With https://github.com/ROCm/rccl/pull/1376, we can remove this hack now and we have verified that we no longer run into hang

Test Plan: https://www.internalfb.com/mlhub/pipelines/runs/mast/aps-xdwang-900def406a?job_attempt=0&version=1&env=PRODUCTION

Differential Revision: D64370817

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137952
Approved by: https://github.com/eqy
2024-10-15 09:35:03 +00:00
b7be4b1e48 [AMD] Turn on fast path for index_put (#136136)
Summary:
This slow path is bad because it has a sync point which makes CPU really slow. I'm not very sure if AMD actually needs this with the newer rocm versino

{F1870213925}

Test Plan: CI

Differential Revision: D62731130

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136136
Approved by: https://github.com/danzimm, https://github.com/jeffdaily, https://github.com/eqy
2024-10-15 08:39:17 +00:00
f42d1b6fa1 Fix Intel GPU test failure due to unsupport bool for unfold (#137873)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137873
Approved by: https://github.com/etaf, https://github.com/desertfire
2024-10-15 07:58:51 +00:00
cyy
8c860aef0d [Reland][Environment Variable][3/N] Use thread-safe getenv functions (#137942)
Reland of #137328, which was reverted due to reverting a dependent PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137942
Approved by: https://github.com/eqy
2024-10-15 07:47:24 +00:00
56cc22eb01 [CI][Distributed] Not to test distributed_test.py with UCC (#137932)
Some UCC tests became unstable recently, with or without the M60 to T4 upgrade.
See for example: #137855 (without upgrade), #137161 (with upgrade).
So I am extracting the disablement from #137161 here.

Failure signature:
```
RuntimeError: [/var/lib/jenkins/workspace/torch/csrc/distributed/c10d/ProcessGroupUCC.cpp:496] [Rank 0][ProcessGroupUCC-0][READY]failed to post triggered collective, error code -6: Unhandled error, system error code 0
```

Earlier discussed here:
https://github.com/pytorch/pytorch/pull/137161/files#r1797353294

Cc: @Aidyn-A @eqy
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137932
Approved by: https://github.com/fduwjj, https://github.com/malfet, https://github.com/eqy
2024-10-15 07:22:57 +00:00
5b442e8e92 Time torch_key computation in overall Dynamo stats (#137877)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137877
Approved by: https://github.com/oulgen, https://github.com/masnesral
2024-10-15 05:47:19 +00:00
5c3ba6faff Add fbscribelogger to Dynamo benchmark runner (#137867)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137867
Approved by: https://github.com/bobrenjc93
2024-10-15 04:36:41 +00:00
ed94725b8c log ViewAndMutationMeta to trace_structured (#133784)
I ended up bundling it into the existing tlparse logs for the AOT forward graph, since it looked like registering it as a separate artifact requires changes to tlparse itself (maybe that is wrong though?)

Example new fw AOT graph tlparse output for the below code: https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmp70zKiO/0_0_0/aot_forward_graph_2.txt

```
import torch

@torch.compile
def f(x):
    out1 = torch.view_as_complex(x)
    out2 = torch.view_as_complex(x)
    return out1, out2, x * 2

x_ = torch.randn(4, 2, requires_grad=True, dtype=torch.float64)
out = f(x_)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133784
Approved by: https://github.com/ezyang
2024-10-15 02:49:02 +00:00
cyy
70206499f1 [3/N] Fix extra warnings brought by clang-tidy-17 (#137552)
Follows #137459

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137552
Approved by: https://github.com/ezyang
2024-10-15 02:33:44 +00:00
a6eb020522 Make Context to be Device-agnostic Step by Step (2/N) (#136526)
----

- add new method(getDefaultGenerator, getNewGenerator) into AcceleratorHooksInterface
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136526
Approved by: https://github.com/ezyang, https://github.com/EikanWang
2024-10-15 01:53:28 +00:00
b34db401f2 Add support for div in tensorify_python_scalars fx pass (#137623)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137623
Approved by: https://github.com/ezyang
2024-10-15 01:49:46 +00:00
8316f9b2a0 Fix autograd function calls without context arg (#137809)
Fixes an issue where if the context arg is not provided, Dynamo would throw an arg mismatch error.

The skips are there because Dynamo would previously fall back to eager on those tests due to the arg mismatch error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137809
Approved by: https://github.com/drisspg
2024-10-15 01:25:47 +00:00
a89cf2b59a [dynamo] Don't codegen temporary cells for pre-existing cells (#137907)
This patch removes tempvar codegen for the `NewCellVariable` that has
`AttributeMutationExisting`, because these tempvar will never get used.
Note that tempvar codegen for other objects also follow this pattern,
i.e., it only fires on `AttributeMutationNew`.

To visualize, in the following program, we'll see the modified bytecode
contains redundant `make_cell` calls, and stores the result to a local
`tmp_2` which is never used again.

```python
import torch

def test_write_cell():
    count = torch.ones(1)
    def inc():
        nonlocal count
        count = count + 1

    torch.compile()
    def fn():
        inc()

    fn()

test_write_cell()
```

```
$ TORCH_LOGS="bytecode" TORCH_LOGS_FORMAT="short" python test.py

......
    0 COPY_FREE_VARS           1
    2 RESUME                   0
    4 LOAD_GLOBAL              9 (NULL + __compiled_fn_2)
   14 LOAD_DEREF               3 (inc)
   16 LOAD_ATTR                6 (__closure__)
   36 LOAD_CONST               1 (0)
   38 BINARY_SUBSCR
   42 LOAD_ATTR                4 (cell_contents)
   62 CALL                     1
   70 STORE_FAST               0 (graph_out_0)
   72 LOAD_GLOBAL              0 (__import_torch_dot__dynamo_dot_utils)
   82 LOAD_ATTR                3 (NULL|self + make_cell)
  102 CALL                     0
  110 STORE_FAST               2 (tmp_2)
  112 LOAD_FAST                0 (graph_out_0)
  114 LOAD_CONST               1 (0)
  116 BINARY_SUBSCR
  120 LOAD_DEREF               3 (inc)
  122 LOAD_ATTR                6 (__closure__)
  142 LOAD_CONST               1 (0)
  144 BINARY_SUBSCR
  148 STORE_ATTR               2 (cell_contents)
  158 DELETE_FAST              0 (graph_out_0)
  160 RETURN_CONST             0 (None)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137907
Approved by: https://github.com/anijain2305
2024-10-15 00:49:45 +00:00
1cf78bbf62 Refactored debug_extra to be on ChoiceCaller (and called description) (#137857)
Before:
<img width="644" alt="image" src="https://github.com/user-attachments/assets/17b0fa8a-37c8-494b-8914-9d42c3db4bef">

After:
<img width="1292" alt="image" src="https://github.com/user-attachments/assets/5ee59747-a34f-4dd6-b943-cb5a53d52080">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137857
Approved by: https://github.com/ezyang, https://github.com/jansel, https://github.com/masnesral
ghstack dependencies: #137768
2024-10-15 00:48:14 +00:00
3630398509 Move symbolic_shapes create_env back to INFO (#137926)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137926
Approved by: https://github.com/Skylion007
2024-10-15 00:37:01 +00:00
406db6a73d Improve ASAN path detection (#137865)
Follows #137335, for better adoption of latest clang to ASAN jobs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137865
Approved by: https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2024-10-14 23:54:46 +00:00
aef3591998 [Profiler] Add Test for Clear on Fork (#137511)
Summary: Tests Fix Clear On Fork by forking a process after a profile has already been done. Afterwards we check that all the PID/TID are as expected.

Test Plan: Ran buck2 test 'fbcode//mode/dev' fbcode//caffe2/test:profiler -- --exact 'caffe2/test:profiler - test_forked_process (profiler.test_profiler.TestProfiler)'

Differential Revision: D63992036

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137511
Approved by: https://github.com/sanrise, https://github.com/aaronenyeshi
2024-10-14 23:20:33 +00:00
0786b37260 [MPS] Add i0 op (#137849)
More-or-less verbatim copy of 47c8aa8090/aten/src/ATen/native/Math.h (L101)
Plus a bit of a MPS boilerplate code

Update test_mps.py to mark kaiser_window and i0 as passing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137849
Approved by: https://github.com/Skylion007
2024-10-14 22:50:01 +00:00
18587f2427 [BE] Use std::enable_if_t in Math.h (#137920)
PyTorch is C++17 project, so let's use some C++17 convenience methods
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137920
Approved by: https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2024-10-14 22:20:09 +00:00
8ac06467d4 Forward fix test (#137910)
Summary: Add back in a deleted file to fix test

It was removed in https://github.com/pytorch/pytorch/pull/137404

Test Plan: `buck2 build --flagfile fbcode//mode/opt fbcode//caffe2/test/cpp/c10d:ProcessGroupGlooAsyncTest` succeeded

Differential Revision: D64341074

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137910
Approved by: https://github.com/Skylion007, https://github.com/huydhn, https://github.com/kit1980
2024-10-14 22:07:29 +00:00
ad134fe038 Skip doc test internally (#137813)
Summary:
there are some path issues when we run the doc tests internally

https://www.internalfb.com/intern/test/281475143872621

Test Plan: sandcastle

Reviewed By: drisspg, msaroufim

Differential Revision: D64255824

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137813
Approved by: https://github.com/HDCharles
2024-10-14 21:29:15 +00:00
7911bf591d [CUDA][Inductor] Fix some bfloat16 tests for SM70 (#137675)
Unsure about the `runtime_checks` changes as that's a pure pattern-match and guess

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137675
Approved by: https://github.com/eellison, https://github.com/jansel
2024-10-14 20:42:48 +00:00
6016b8a9be Remove CI/CD python 3.8 requirements (#137893)
Python 3.8 is deprecated from CI/CD. No reason have these pins
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137893
Approved by: https://github.com/Skylion007, https://github.com/huydhn, https://github.com/albanD, https://github.com/kit1980
2024-10-14 20:28:48 +00:00
3b7710316c Revert "cublaslt autotuning support for TunableOp (#133896)"
This reverts commit 19bbbef79da8ed32f72d6e76517cb639d5db6c00.

Reverted https://github.com/pytorch/pytorch/pull/133896 on behalf of https://github.com/clee2000 due to this is breaking internal builds, I've copied what I think is the most relevant part of the log below. I believe the job running internally uses an old version of cuda, could you put guards to make sure compilation still words on an older version of cuda/cublaslt? ([comment](https://github.com/pytorch/pytorch/pull/133896#issuecomment-2412180893))
2024-10-14 20:28:09 +00:00
df0c2f5cae Revert "[Environment Variable][3/N] Use thread-safe getenv wrapper (#137328)"
This reverts commit 25ac5652d003c5526f496bd1e2cdfbe697c58ba4.

Reverted https://github.com/pytorch/pytorch/pull/137328 on behalf of https://github.com/clee2000 due to need to revert this in order to revert #133896, please rebase and reland, sorry for the churn ([comment](https://github.com/pytorch/pytorch/pull/137328#issuecomment-2412143739))
2024-10-14 20:22:26 +00:00
674d59359d [ROCm] Enable dist sharded_tensor test suites (#137724)
Following test suites are enabled on ROCm
test_sharded_tensor
test_sharded_tensor_reshard
test_sharding_plan

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137724
Approved by: https://github.com/jithunnair-amd, https://github.com/pruthvistony, https://github.com/malfet
2024-10-14 20:20:57 +00:00
39d21ed803 [Inductor] Update AttrsDescriptor instantiation for Triton changes (#137458)
The `AttrsDescriptor` class has been present in Triton for almost a year now (introduced [here](72c9833927)), so we should be able to rely on it existing. I am in the process of supporting the new `AttrsDescriptor` class and @jansel suggested I split changes to the existing class out separately to make sure nothing breaks removing the legacy attribute descriptor attributes.

Initially I attempted to remove the branching around detecting whether `AttrsDescriptor` exists but that breaks because PyTorch must build without Triton. So, I went back and updated for the naming introduced in the commit linked above, and also removed two unused attributes `divisible_by_8` and `ids_to_fold` which were removed in Feb 2024 (https://github.com/triton-lang/triton/pull/3122 and https://github.com/triton-lang/triton/pull/3080 respectively).

With these changes only the internal workings of the `AttrsDescriptor` class will differ between supported Triton versions, but the data stored will remain consistent.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137458
Approved by: https://github.com/jansel
2024-10-14 20:20:29 +00:00
11e4232b42 Revert "[Dynamo][autograd.Function] Trace fwd graph under no_grad mode (#134872)" (#137891)
This reverts commit e688b78791d01bd91614a61e57726c32beb46ee4.

We're reverting this because:
1) The original PR (#134872) fixed a bug but caused another one. The
   assessment is that the bug it caused is worse than the bug it fixed.
2) it was reverted on the release 2.5 branch, so we want to prevent
   divergence
3) The original author is out-of-office for a while so we don't want the
   divergence to wait until they're back
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137891
Approved by: https://github.com/Skylion007
2024-10-14 20:12:58 +00:00
41c4aa9f7a [pipelining] rename prev_/next_stage vars to clarify (#137739)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137739
Approved by: https://github.com/H-Huang
2024-10-14 20:12:18 +00:00
78299d75b7 [ScaledMM] More Large shape tuning (#137832)
Fixes buggy in previous PR with check, and also after some more performance tuning at very large sizes found that when N > M it is valuable to transpose otherwise performance is better untransposed:

If you look at the absolute Tflops I think we still have some room for improvement!
### Perf

Here are some TFLOP deltas at larger sizes where green is the positive gain in TFLops at different values of K

![large_shape_old_vs_update_m_greater_n_FP8Kernel_SCALED_MM_K32768_tflops_delta_heatmap](https://github.com/user-attachments/assets/dcd009a5-1e4f-449c-b852-a92bb7db66e3)

<details>
<summary>### Different Values of K</summary>
![large_shape_old_vs_update_m_greater_n_FP8Kernel_SCALED_MM_K24576_tflops_delta_heatmap](https://github.com/user-attachments/assets/8c043f6c-b8aa-48a9-bd5d-3ec6f39818cd)
![large_shape_old_vs_update_m_greater_n_FP8Kernel_SCALED_MM_K16384_tflops_delta_heatmap](https://github.com/user-attachments/assets/41a4b9f4-2749-4a84-b9c7-bddc2c2334c0)
![large_shape_old_vs_update_m_greater_n_FP8Kernel_SCALED_MM_K12288_tflops_delta_heatmap](https://github.com/user-attachments/assets/68d42421-cfa9-4a0a-a5a5-9f6db80bf609)
![large_shape_old_vs_update_m_greater_n_FP8Kernel_SCALED_MM_K8192_tflops_delta_heatmap](https://github.com/user-attachments/assets/c03906a0-5de7-463e-96a8-85f1774b3af6)
![large_shape_old_vs_update_m_greater_n_FP8Kernel_SCALED_MM_K6144_tflops_delta_heatmap](https://github.com/user-attachments/assets/d697b2d0-efc9-4ea8-9002-d517f3abaf50)
![large_shape_old_vs_update_m_greater_n_FP8Kernel_SCALED_MM_K4096_tflops_delta_heatmap](https://github.com/user-attachments/assets/06f8ef5c-277f-45ca-a44f-ed2e54d4133a)
</details>

<details>
<summary>### Absolute Tflops</summary>

## Old
![large_shape_old_FP8Kernel_SCALED_MM_K32768_tflops_heatmap](https://github.com/user-attachments/assets/8872506b-0ff1-400e-8d11-71eff6d8d59a)

## New
![update_m_greater_n_FP8Kernel_SCALED_MM_K32768_tflops_heatmap](https://github.com/user-attachments/assets/9fc9ec24-ff1a-4b47-8934-72d181677d14)

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137832
Approved by: https://github.com/vkuzo
2024-10-14 20:02:52 +00:00
d64492e4cb Increase verbosity of inductor cache hit/miss to INFO level (#137876)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137876
Approved by: https://github.com/Skylion007
2024-10-14 19:59:31 +00:00
eqy
914c90dcea [NCCL][CUDA] Set PYTORCH_C10_DRIVER_API_SUPPORTED in ProcessGroupNCCL.cpp compilation (#137828)
Otherwise `expandable_segments()` is hardcoded to false in `CUDAAllocatorConfig.h`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137828
Approved by: https://github.com/yifuwang, https://github.com/Skylion007
2024-10-14 19:38:23 +00:00
19918a1863 Fix autograd.Function + NJT when an output grad is None (#136875)
For `autograd.Function`, the engine will try to allocate correctly-shaped zeros for `None` grads (i.e. in the case where the output isn't used downstream). It determines the shape of these zeros from the `VariableInfo` entry, which is derived from the forward output shape. For the NJT forward output case, the size info stored will contain a nested int, and calling `zeros()` with this size throws:
```
RuntimeError: .../build/aten/src/ATen/RegisterCPU.cpp:5260: SymIntArrayRef expected to contain only concrete integers
```

This PR fixes this by storing the full tensor in the `VariableInfo` for the nested case and calling `zeros_like()` to allocate correctly-shaped zeros. This is pretty inefficient; ideally we would want to save just the NJT shape and be able to construct zeros from it, but this requires factory function support for nested ints (WIP). So this is a short-term fix until we have that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136875
Approved by: https://github.com/soulitzer, https://github.com/huydhn
2024-10-14 19:31:50 +00:00
197601eeea Add Support for Tracking Parameter Names (named_parameters) in Optimizer State Dict (#134107)
A proposal addressing Issue #1489: **Optimizer should track parameter names and not id.**

(also mentioned in here: [[RFC] Introducing FQNs/clarity eyeglasses to optim state_dict](https://dev-discuss.pytorch.org/t/rfc-introducing-fqns-clarity-to-optim-state-dict/1552)

## Summary
This PR introduces a backward-compatible enhancement where optimizers track parameter names instead of just their id.
Optimizers can be initialized with `named_parameters()` as:
```python
optimizer = optim.SGD(model.named_parameters(), lr=0.01, momentum=0.9)
```
This allows for greater clarity and ease when handling optimizers, as the parameters' names are preserved within the optimizer’s `state_dict` as:
```
state_dict =
{
    'state': {
    0: {'momentum_buffer': tensor(...), ...},
    1: {'momentum_buffer': tensor(...), ...},
    },
    'param_groups': [
        {
        'lr': 0.01,
        'weight_decay': 0,
        ...
        'params': [0,1]
        'param_names' ['layer.weight', 'layer.bias']  (optional)
        }
    ]
}
```
Loading `state_dict` is not changed (backward-compatible) and the `param_names` key will be ignored.

## Key Features
#### Named Parameters in Optimizer Initialization:
Optimizers can accept the output of `model.named_parameters()` during initialization, allowing them to store parameter names directly.
#### Parameter Names in `state_dict`:
The parameter names are saved as a list in the optimizer’s `state_dict` with key `param_names`, alongside the `params` indices, ensuring seamless tracking of both names and parameters.

## Backward Compatibility
#### No Breaking Changes:
This change is fully backward-compatible. The added `param_names` key in the optimizer's `state_dict` is ignored when loading a state to the optimizer.

#### Customization with Hooks:
For more control, the loaded state_dict can be modified using a custom `register_load_state_dict_pre_hook`, providing flexibility for different design needs.

## Documentation Updates
Please refer to the documentation changes for more details on how this feature is implemented and how it can be used effectively.

## Solution Example:

A suggested solution to the problem mentioned in #1489, for the same parameters but in a different order.
The following `register_load_state_dict_pre_hook` should be added to the optimizer before loading to enable loading the state dict :
```python
def adapt_state_dict_ids(optimizer, state_dict):
    # assuming a single param group.
    current_state_group = optimizer.state_dict()['param_groups'][0]
    loaded_state_group = state_dict['param_groups'][0]

    # same number of params, same names, only different ordering
    current_state_name_to_id_mapping = {}  # mapping --  param_name: id
    for i, name in enumerate(current_state_group['param_names']):
        current_state_name_to_id_mapping[name] = current_state_group['params'][i]

    # changing the ids of the loaded state dict to match the order of the given state dict.
    for i, name in enumerate(current_state_group['param_names']):
        loaded_state_group['params'][i] = current_state_name_to_id_mapping[name]

    return state_dict
```
In this code, the loaded `state_dict` ids are adapted to match the order of the current optimizer `state_dict`.
Both the previous and the current optimizers are required to be initiated with `named_parameters()` to have the 'param_names' key in the dict.

### Note
This is my first contribution to PyTorch, and I wish to receive feedback or suggestions for improvement.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134107
Approved by: https://github.com/janeyx99

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2024-10-14 19:24:44 +00:00
4470339fbb [dynamo] Fix an error in _dynamo.compiled_autograd.reset() (#137889)
----

* From https://github.com/pytorch/pytorch/pull/133492

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137889
Approved by: https://github.com/Skylion007
2024-10-14 18:21:18 +00:00
929797dedb Fix test_matmul_offline_tunableop by writing its output files to a temp dir (#137835)
The test is failing (flakily?) on periodic Windows CUDA jobs with the following error:

```
__________ TestLinalgCUDA.test_matmul_offline_tunableop_cuda_float16 __________
Traceback (most recent call last):
  File "C:\actions-runner\_work\pytorch\pytorch\test\test_linalg.py", line 4618, in test_matmul_offline_tunableop
    os.remove(filename)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'tunableop_untuned0.csv'
```

For example, https://github.com/pytorch/pytorch/actions/runs/11292745299/job/31410578167#step:15:15097

The test tried to catch and ignore this, but this is Windows.  So, the fix is to:

1. Ignore if these files couldn't be removed
2. Write them to a temp directory instead, otherwise, [assert_git_not_dirty](https://github.com/pytorch/pytorch/blob/main/.ci/pytorch/test.sh#L286) won't be happy

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137835
Approved by: https://github.com/atalman
2024-10-14 17:28:33 +00:00
f8a5b7170a Revert "Fix autograd.Function + NJT when an output grad is None (#136875)"
This reverts commit 76ab1ab66560213701943ecde368aedcd5de08e5.

Reverted https://github.com/pytorch/pytorch/pull/136875 on behalf of https://github.com/jbschlosser due to Caused memory leak ([comment](https://github.com/pytorch/pytorch/pull/136875#issuecomment-2411665776))
2024-10-14 16:00:44 +00:00
47bb494e49 Add support for sub in tensorify_python_scalars fx pass (#137622)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137622
Approved by: https://github.com/ezyang
ghstack dependencies: #137620
2024-10-14 15:37:29 +00:00
f246507f28 Add support for add in tensorify_python_scalars fx pass (#137620)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137620
Approved by: https://github.com/ezyang
2024-10-14 15:10:27 +00:00
a77145ae2f Selective Activation Checkpointing (SAC) Estimator for estimating memory and recomputation time trade-offs. (#135208)
This PR adds a Selective Activation Checkpointing (SAC) Estimator, built on top of the `Runtime Estimator`, for estimating memory and recomputation time trade-offs.
It provides a `TorchDispatchMode` based context manager that estimates the memory and runtime trade-offs of functions or `torch.nn.Modules` for SAC, using the `Runtime Estimator` #134243  under the hood to support two estimation modes: 'operator-level-benchmark' and 'operator-level-cost-model' (roofline model). The SAC Estimator provides detailed statistics and metadata information for operators of each module, including greedy order for selecting operators to be recomputed/checkpointed and per-module trade-off graphs. This estimator is designed to be used under FakeTensorMode and currently supports estimation of compute time and memory usage."

It's inspired from: [XFormers SAC](https://github.com/facebookresearch/xformers/blob/main/xformers/checkpoint.py) by @fmassa

End-to-end example:

```
import torch
from torch._subclasses.fake_tensor import FakeTensorMode
from torch.distributed._tools.sac_estimator import SACEstimator
from torch.testing._internal.distributed._tensor.common_dtensor import (
    ModelArgs,
    Transformer,
)

if __name__ == "__main__":
    dev = torch.cuda.current_device()
    vocab_size = 8192
    bsz, seq_len = 8, 1024
    model_args = ModelArgs(
        n_layers=4,
        n_heads=12,
        vocab_size=vocab_size,
        max_seq_len=seq_len,
        dim=768,
        dropout_p=0.1,
    )
    with FakeTensorMode():
        with torch.device(dev):
            model = Transformer(model_args)
        inp = torch.randint(
            0, model_args.vocab_size, (bsz, model_args.max_seq_len), device=dev
        )

        sace = SACEstimator()
        with sace(estimate_mode_type='operator-level-cost-model'):
            loss = model(inp).sum()
        loss.backward()
        sace.pwlf_sac_tradeoff_curve(n_segments=2, save_tradeoff_graphs=True)
        sace.display_modulewise_sac_stats(depth=4, print_tabular=True)
```

  Example AC Stats for one of the transformer layers:

![Screenshot 2024-10-11 at 10 09 13 PM](https://github.com/user-attachments/assets/1cf85564-4319-4732-bba1-89d505cda6ab)

Example AC Trade-off for one of the transformer layers:

![Screenshot 2024-10-11 at 10 09 58 PM](https://github.com/user-attachments/assets/5b2f343c-7e73-4c7d-bfea-3dcef2caa362)

Example AC Trade-Off graph one of the transformer layers:

![Transformer layers 3](https://github.com/user-attachments/assets/490d4b37-a916-4298-a14c-f78ffecbbde2)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135208
Approved by: https://github.com/weifengpy
2024-10-14 13:56:40 +00:00
0e4d42634e Port Inductor dataclasses to be kw_only (#137768)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137768
Approved by: https://github.com/ezyang
2024-10-14 10:33:43 +00:00
770c134998 Add SVE implementation of embedding_lookup_idx (#133995)
Adds an accelerated version of the embedding_lookup_idx perfkernels. This is done via a python codegen file similarly to `caffe2/perfkernels/hp_emblookup_codegen.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133995
Approved by: https://github.com/malfet, https://github.com/huydhn
2024-10-14 10:17:27 +00:00
cyy
c48fe89011 Make c10::string_view an alias of std::string_view (#130417)
In order to facilitate the mitigation from c10::string_view to std::string_view, the old c10::string_view was renamed to c10::string_view_ext.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130417
Approved by: https://github.com/ezyang
2024-10-14 09:28:04 +00:00
41977a0531 Revert "Port Inductor dataclasses to be kw_only (#137768)"
This reverts commit 65d665bae5b82a54b819c0c4527e7ccf88d19427.

Reverted https://github.com/pytorch/pytorch/pull/137768 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it seem to fail test_loop_ordering in trunk ([comment](https://github.com/pytorch/pytorch/pull/137768#issuecomment-2409203115))
2024-10-13 22:25:19 +00:00
08ce3aac62 Cache some ValueRanges (#137438)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137438
Approved by: https://github.com/ezyang
2024-10-13 19:23:34 +00:00
b361cd01f1 profiler: Fix undefined reference to unwind_c in unwind_entry while LTO is enabled (#137862)
With LTO(Link Time Optimization) enabled in CFLAGS, some compiler will optimize and strip the unwind_c function, which is caused by compiler that couldn’t resolve reference correctly, thus breaking the build with undefined reference in unwind_entry. Add an attribute to avoid this bad situation.

Fixes #121282

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137862
Approved by: https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2024-10-13 19:04:58 +00:00
c09b567a91 Fixed error string assertion in test_invalid_devices (#137772)
ROCm distribution returns different error string for this operation so the test fails this assertion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137772
Approved by: https://github.com/Skylion007
2024-10-13 18:10:07 +00:00
65d665bae5 Port Inductor dataclasses to be kw_only (#137768)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137768
Approved by: https://github.com/ezyang
2024-10-13 14:55:45 +00:00
cfc5d18aad [AOTI] Turn on the ABI-compatible mode as default (#136534)
Summary: Make AOTI generate ABI-compatible code as default for OSS.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136534
Approved by: https://github.com/chenyang78
ghstack dependencies: #137660
2024-10-13 14:42:58 +00:00
b181652f3d [AOTI] Handle inplace output in ProxyExecutor (#137660)
Summary: https://github.com/pytorch/pytorch/pull/137401 didn't fix the underlying inplace output issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137660
Approved by: https://github.com/chenyang78
2024-10-13 14:42:58 +00:00
cyy
a90b920284 Install llvm18 packages for ASAN workflows (#137335)
Follows #128763
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137335
Approved by: https://github.com/ezyang
2024-10-13 13:49:38 +00:00
4a8e49389c Make Context to be Device-agnostic Step by Step (1/N) (#136519)
----

- make init to be device-agnostic and move it to AcceleratorHooksInterface
- refactoring context related to device initialization

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136519
Approved by: https://github.com/ezyang, https://github.com/EikanWang, https://github.com/guangyey
2024-10-13 12:38:02 +00:00
563e9f99c3 Revert "Add device agnostic API for accelerator hooks (#137480)"
This reverts commit 858c91c3d8d9a71c66d0357e51a4bd805f95599f.

Reverted https://github.com/pytorch/pytorch/pull/137480 on behalf of https://github.com/albanD due to break all builds on trunk ([comment](https://github.com/pytorch/pytorch/pull/137480#issuecomment-2408954802))
2024-10-13 12:12:37 +00:00
08576b254b Fix logging in socket.cpp (#137745)
Formatter shall avoid throwing exceptions as much as possible.

Fixes https://github.com/pytorch/pytorch/pull/128673#discussion_r1796226656

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137745
Approved by: https://github.com/d4l3k, https://github.com/Skylion007
2024-10-13 10:38:10 +00:00
fe8d66d9a6 Faster Faster BatchSampler (#137423)
Builds upon #76951.

Benchmarking code is the same as in #76950.

AMD Ryzen Threadripper PRO 3995WX:
```
  batch_size  drop_last      origin     new  speedup
------------  -----------  --------  ------  ---------
           4  True           0.94    0.5706  64.74%
           4  False          0.9745  0.9468  2.93%
           8  True           0.7423  0.3715  99.82%
           8  False          0.7974  0.5666  40.73%
          64  True           0.5394  0.2085  158.76%
          64  False          0.6083  0.2697  125.51%
         640  True           0.5448  0.1985  174.41%
         640  False          0.7085  0.2308  206.91%
        6400  True           0.5554  0.2028  173.88%
        6400  False          0.7711  0.2109  265.60%
       64000  True           0.556   0.2091  165.82%
       64000  False          0.7803  0.2078  275.58%
```

When `drop_last == True`, it uses `zip` to speed things up.
When `drop_last == False`, it uses `itertools` to speed things up.

`itertools` was the fastest way I could find that deals with the last batch if it is smaller than `batch_size`. I have a pure python method too, but it is slower when `batch_size` is 4 or 8, so I have committed the `itertools` version for now.

Happy to chat further about this change :-) I understand you may not want to introduce the `itertools` package into [sampler.py](https://github.com/pytorch/pytorch/blob/main/torch/utils/data/sampler.py).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137423
Approved by: https://github.com/Skylion007
2024-10-13 09:36:03 +00:00
b3af359cba Log WorkNCCL exception string to C10dLogger (#137736)
Summary: In WorkNCCL::handleException, log to c10d logger with `strings["work_nccl_exception"]`.

Test Plan: Test run job to verify NCCL exception is logged.

Differential Revision: D62603322

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137736
Approved by: https://github.com/c-p-i-o, https://github.com/fduwjj
2024-10-13 07:33:05 +00:00
858c91c3d8 Add device agnostic API for accelerator hooks (#137480)
Make `AcceleratorHooksInterface` consistent for multiple accelerators
- Add `showConfig` and `deviceSynchronize` method declaration in `AcceleratorHooksInterface`
- Remove unreachable lines of code

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137480
Approved by: https://github.com/albanD, https://github.com/FFFrog
2024-10-13 07:19:32 +00:00
7642f6d047 [AMD] Unify cublaslt and hipblaslt path (#137604)
Differential Revision: D63967918

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137604
Approved by: https://github.com/eqy
2024-10-13 07:11:12 +00:00
fa08e924ad Skip test export with fake tensor inputs on cuda devices for Intel GPU (#137847)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137847
Approved by: https://github.com/etaf, https://github.com/jansel
2024-10-13 07:07:48 +00:00
e3df636580 Fix -Wsign-compare warning spam in Indexing.cu (#137842)
Detailed Descriptions:

Fix for warning spam like
```
warning: comparison of integer expressions of different signedness: ‘uint64_t’ {aka ‘long unsigned int’} and ‘long int’ [-Wsign-compare]
```
![image](https://github.com/user-attachments/assets/7be3cfff-c33b-4a6e-b52d-04085e6e1bec)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137842
Approved by: https://github.com/ezyang
2024-10-13 07:03:12 +00:00
1d6932937e [dynamo] fix NamedTupleVariable for PyStructSequence (torch.return_types.*) support (#137776)
PyStructSequence is the C API equivalent for `collections.namedtuple` in Python. But they have different constructors:

```python
tuple = NamedTupleType(*args)
tuple = NamedTupleType._make(args)
tuple = StructSequenceType(args)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137776
Approved by: https://github.com/jansel
2024-10-13 06:46:41 +00:00
3050f2e5dd [dynamo] Check nn modules parameters are not overwritten before taking tracing shortcut (#137824)
Fixes https://github.com/pytorch/pytorch/issues/136257

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137824
Approved by: https://github.com/jansel
2024-10-13 05:04:28 +00:00
09e2a0d7bc fix PyTorch build with Address Sanitizer enabled (#137446)
**Problem**
Building PyTorch with Address Sanitizer (ASAN) enabled was failing due to a static assertion in KernelFunction_impl.h. The compiler was unable to evaluate FuncPtr::func_ptr() as a constant expression when ASAN was enabled, causing a build error.

```
FAILED: caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/xnnpack/RegisterOpContextClass.cpp.o
/usr/bin/ccache /usr/bin/g++-11 -DAT_BUILD_ARM_VEC256_WITH_SLEEF -DAT_PER_OPERATOR_HEADERS -DCAFFE2_BUILD_MAIN_LIB -DCPUINFO_SUPPORTED_PLATFORM=1 -DFLASHATTENTION_DISABLE_ALIBI -DFMT_HEADER_ONLY=1 -DFXDIV_USE_INLINE_ASSEMBLY=0 -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DNNP_CONVOLUTION_ONLY=0 -DNNP_INFERENCE_ONLY=0 -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DUSE_C10D_GLOO -DUSE_C10D_MPI -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_RPC -DUSE_TENSORPIPE -D_FILE_OFFSET_BITS=64 -D_GLIBCXX_SANITIZE_STD_ALLOCATOR -D_GLIBCXX_SANITIZE_VECTOR -Dtorch_cpu_EXPORTS -I/home/abhishekk/stantize/venv/pytorch/build/aten/src -I/home/abhishekk/stantize/venv/pytorch/aten/src -I/home/abhishekk/stantize/venv/pytorch/build -I/home/abhishekk/stantize/venv/pytorch -I/home/abhishekk/stantize/venv/pytorch/cmake/../third_party/benchmark/include -I/home/abhishekk/stantize/venv/pytorch/third_party/onnx -I/home/abhishekk/stantize/venv/pytorch/build/third_party/onnx -I/home/abhishekk/stantize/venv/pytorch/nlohmann -I/home/abhishekk/stantize/venv/pytorch/torch/csrc/api -I/home/abhishekk/stantize/venv/pytorch/torch/csrc/api/include -I/home/abhishekk/stantize/venv/pytorch/caffe2/aten/src/TH -I/home/abhishekk/stantize/venv/pytorch/build/caffe2/aten/src/TH -I/home/abhishekk/stantize/venv/pytorch/build/caffe2/aten/src -I/home/abhishekk/stantize/venv/pytorch/build/caffe2/../aten/src -I/home/abhishekk/stantize/venv/pytorch/torch/csrc -I/home/abhishekk/stantize/venv/pytorch/third_party/miniz-2.1.0 -I/home/abhishekk/stantize/venv/pytorch/third_party/kineto/libkineto/include -I/home/abhishekk/stantize/venv/pytorch/third_party/kineto/libkineto/src -I/home/abhishekk/stantize/venv/pytorch/third_party/cpp-httplib -I/home/abhishekk/stantize/venv/pytorch/aten/src/ATen/.. -I/home/abhishekk/stantize/venv/pytorch/third_party/FXdiv/include -I/home/abhishekk/stantize/venv/pytorch/c10/.. -I/home/abhishekk/stantize/venv/pytorch/third_party/pthreadpool/include -I/home/abhishekk/stantize/venv/pytorch/third_party/cpuinfo/include -I/home/abhishekk/stantize/venv/pytorch/aten/src/ATen/native/quantized/cpu/qnnpack/include -I/home/abhishekk/stantize/venv/pytorch/aten/src/ATen/native/quantized/cpu/qnnpack/src -I/home/abhishekk/stantize/venv/pytorch/aten/src/ATen/native/quantized/cpu/qnnpack/deps/clog/include -I/home/abhishekk/stantize/venv/pytorch/third_party/NNPACK/include -I/home/abhishekk/stantize/venv/pytorch/third_party/FP16/include -I/home/abhishekk/stantize/venv/pytorch/third_party/tensorpipe -I/home/abhishekk/stantize/venv/pytorch/build/third_party/tensorpipe -I/home/abhishekk/stantize/venv/pytorch/third_party/tensorpipe/third_party/libnop/include -I/home/abhishekk/stantize/venv/pytorch/third_party/fmt/include -I/home/abhishekk/stantize/venv/pytorch/third_party/flatbuffers/include -isystem /home/abhishekk/stantize/venv/pytorch/build/third_party/gloo -isystem /home/abhishekk/stantize/venv/pytorch/cmake/../third_party/gloo -isystem /home/abhishekk/stantize/venv/pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include -isystem /home/abhishekk/stantize/venv/pytorch/cmake/../third_party/googletest/googlemock/include -isystem /home/abhishekk/stantize/venv/pytorch/cmake/../third_party/googletest/googletest/include -isystem /home/abhishekk/stantize/venv/pytorch/third_party/protobuf/src -isystem /home/abhishekk/stantize/venv/pytorch/third_party/XNNPACK/include -isystem /home/abhishekk/stantize/venv/pytorch/cmake/../third_party/eigen -isystem /home/abhishekk/stantize/venv/pytorch/INTERFACE -isystem /home/abhishekk/stantize/venv/pytorch/third_party/nlohmann/include -isystem /home/abhishekk/stantize/venv/pytorch/build/include -isystem /usr/lib/aarch64-linux-gnu/openmpi/include -isystem /usr/lib/aarch64-linux-gnu/openmpi/include/openmpi -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DUSE_KINETO -DLIBKINETO_NOCUPTI -DLIBKINETO_NOROCTRACER -DLIBKINETO_NOXPUPTI=ON -DUSE_PYTORCH_QNNPACK -DAT_BUILD_ARM_VEC256_WITH_SLEEF -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow -DHAVE_SVE_CPU_DEFINITION -DHAVE_SVE256_CPU_DEFINITION -g -fno-omit-frame-pointer -Og -std=gnu++17 -fPIC -DTORCH_USE_LIBUV -DCAFFE2_USE_GLOO -D__NEON__ -Wall -Wextra -Wdeprecated -Wno-unused-parameter -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-strict-overflow -Wno-strict-aliasing -Wunused-function -Wunused-variable -Wunused-but-set-variable -Wno-maybe-uninitialized -fsanitize=address -fno-omit-frame-pointer -fsanitize=undefined -pthread -fopenmp -MD -MT caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/xnnpack/RegisterOpContextClass.cpp.o -MF caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/xnnpack/RegisterOpContextClass.cpp.o.d -o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/xnnpack/RegisterOpContextClass.cpp.o -c /home/abhishekk/stantize/venv/pytorch/aten/src/ATen/native/xnnpack/RegisterOpContextClass.cpp
In file included from /home/abhishekk/stantize/venv/pytorch/aten/src/ATen/core/boxing/KernelFunction.h:260,
                 from /home/abhishekk/stantize/venv/pytorch/aten/src/ATen/core/dispatch/Dispatcher.h:4,
                 from /home/abhishekk/stantize/venv/pytorch/torch/library.h:63,
                 from /home/abhishekk/stantize/venv/pytorch/aten/src/ATen/native/xnnpack/RegisterOpContextClass.cpp:3:
/home/abhishekk/stantize/venv/pytorch/aten/src/ATen/core/boxing/KernelFunction_impl.h: In instantiation of ‘static c10::KernelFunction c10::KernelFunction::makeFromUnboxedFunction(FuncPtr) [with FuncPtr = c10::CompileTimeFunctionPointer<c10::intrusive_ptr<at::native::xnnpack::LinearOpContext>(at::Tensor, std::optional<at::Tensor>, const std::optional<c10::Scalar>&, const std::optional<c10::Scalar>&), at::native::xnnpack::internal::linear::createLinearClampPrePackOpContext>; bool AllowLegacyTypes = false]’:
/home/abhishekk/stantize/venv/pytorch/torch/library.h:133:59:   required from ‘torch::CppFunction::CppFunction(FuncPtr, std::enable_if_t<c10::is_compile_time_function_pointer<FuncPtr>::value, std::nullptr_t>) [with FuncPtr = c10::CompileTimeFunctionPointer<c10::intrusive_ptr<at::native::xnnpack::LinearOpContext>(at::Tensor, std::optional<at::Tensor>, const std::optional<c10::Scalar>&, const std::optional<c10::Scalar>&), at::native::xnnpack::internal::linear::createLinearClampPrePackOpContext>; std::enable_if_t<c10::is_compile_time_function_pointer<FuncPtr>::value, std::nullptr_t> = std::nullptr_t]’
/home/abhishekk/stantize/venv/pytorch/torch/library.h:691:17:   required from ‘torch::Library& torch::Library::impl(Name, Func&&, torch::_RegisterOrVerify) & [with Name = const char*; Func = c10::CompileTimeFunctionPointer<c10::intrusive_ptr<at::native::xnnpack::LinearOpContext>(at::Tensor, std::optional<at::Tensor>, const std::optional<c10::Scalar>&, const std::optional<c10::Scalar>&), at::native::xnnpack::internal::linear::createLinearClampPrePackOpContext>]’
/home/abhishekk/stantize/venv/pytorch/torch/library.h:782:16:   required from ‘torch::Library& torch::Library::impl(torch::detail::SelectiveStr<true>, Func&&) & [with Func = c10::CompileTimeFunctionPointer<c10::intrusive_ptr<at::native::xnnpack::LinearOpContext>(at::Tensor, std::optional<at::Tensor>, const std::optional<c10::Scalar>&, const std::optional<c10::Scalar>&), at::native::xnnpack::internal::linear::createLinearClampPrePackOpContext>]’
/home/abhishekk/stantize/venv/pytorch/aten/src/ATen/native/xnnpack/RegisterOpContextClass.cpp:87:9:   required from here
/home/abhishekk/stantize/venv/pytorch/aten/src/ATen/core/boxing/KernelFunction_impl.h:177:39: error: non-constant condition for static assertion
  177 |     static_assert(FuncPtr::func_ptr() != nullptr, "Kernel function cannot be nullptr");
      |                   ~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~
```

**Testing**

- Verified that PyTorch builds successfully with USE_ASAN=ON
- Ran PyTorch test suite to ensure no regressions were introduced.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137446
Approved by: https://github.com/ezyang, https://github.com/jgong5
2024-10-13 03:31:54 +00:00
70bd58c35f Revert "Add support for add in tensorify_python_scalars fx pass (#137620)"
This reverts commit 0430e72e755d2c1953917ffb78f00c516eb4bbd5.

Reverted https://github.com/pytorch/pytorch/pull/137620 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it seems to cause test_torchbind_inductor to fail in trunk 0430e72e75 ([comment](https://github.com/pytorch/pytorch/pull/137620#issuecomment-2408784170))
2024-10-13 02:05:37 +00:00
279052ab86 Revert "Add support for sub in tensorify_python_scalars fx pass (#137622)"
This reverts commit b7924610a0c20f72657548acef7743801189444a.

Reverted https://github.com/pytorch/pytorch/pull/137622 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it seems to cause test_torchbind_inductor to fail in trunk 0430e72e75 ([comment](https://github.com/pytorch/pytorch/pull/137620#issuecomment-2408784170))
2024-10-13 02:05:37 +00:00
5fee1ee3f4 [inductor] Refactor generate_workspace_allocation (#137673)
And some other small changes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137673
Approved by: https://github.com/Chillee
ghstack dependencies: #137754
2024-10-13 01:25:14 +00:00
5146e6a96d [inductor] Fix reduction_hint sum to single element (#137754)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137754
Approved by: https://github.com/Chillee, https://github.com/malfet
2024-10-13 01:08:23 +00:00
b7924610a0 Add support for sub in tensorify_python_scalars fx pass (#137622)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137622
Approved by: https://github.com/ezyang
ghstack dependencies: #137620
2024-10-13 00:30:02 +00:00
bd63ec4f45 [ROCm] LoadHIP CMake cleanup (#137112)
Should help mitigate issues reported here: https://github.com/pytorch/pytorch/issues/128313

While working on https://github.com/pytorch/pytorch/pull/136700, we realized that some of the ROCm CMake can be streamlined.

This PR does not fix any bugs or provide any new functionality. Strictly clean-up.

The remaining `${ROCM_ROCTX_LIB}` will be removed when we transition to the rocprofiler-sdk (to be done in a separate PR).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137112
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily
2024-10-13 00:06:41 +00:00
47c8aa8090 Refactor make device agnostic in accelerator hooks (#137558)
Make `AcceleratorHooksInterface` consistent for multiple accelerators
- Add `getDeviceFromPtr` method declaration in `AcceleratorHooksInterface`
- Fix clangtidy warning

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137558
Approved by: https://github.com/FFFrog, https://github.com/ezyang
2024-10-12 18:13:54 +00:00
0430e72e75 Add support for add in tensorify_python_scalars fx pass (#137620)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137620
Approved by: https://github.com/ezyang
ghstack dependencies: #136674, #137588
2024-10-12 17:18:27 +00:00
e89fe0bd6e Updating cuda binary build to get cusparselt from PYPI (#137653)
Fixes #137374
Update 1: such PR require Meta uploading the PYPI package to download.pytorch.org.
See: ERROR: Could not find a version that satisfies the requirement nvidia-cusparselt-cu12==0.6.2; platform_system == "Linux" and platform_machine == "x86_64" (from torch) (from versions: none)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137653
Approved by: https://github.com/eqy, https://github.com/Skylion007, https://github.com/atalman
2024-10-12 16:40:37 +00:00
ed55d356de [alt] fix unroll in successive unflatten (#137646)
We use nn_module_stack in unflatten to recognize when module calls begin and end. However the current format is not sufficient to detect module call boundaries when we have successive calls to the same module, because the successive instructions (end of one call, begin of next call) have the same nn_module_stack. This causes us to effectively "unroll" successive calls to a single call. This can cause problems when preserving module call signatures because the outputs of the successive calls might be concatenated in the single call.

Previously we introduced the concept of a "call index" to generate multiple graphs when unflattening, one per call. This PR pushes this concept into nn_module_stack itself. In particular, the keys of nn_module_stack now go from `key` to `key@call_index`. (In a previous attempt, https://github.com/pytorch/pytorch/pull/137457, instead values in nn_module_stack go from (fqn, type) to (fqn, type, call_index), which is BC-breaking.)

Note that we still do not have the ability to preserve module call signatures for multiple calls to the same module. But now instead of randomly crashing we give a proper error. OTOH when not preserving module call signatures we simply generate multiple calls, each with its own graph, possibly deduplicated, matching what we would do for non-successive calls.

Test Plan: Like D64014936

Differential Revision: D64136277

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137646
Approved by: https://github.com/angelayi
2024-10-12 15:53:52 +00:00
561f07fae7 Warn users of mkldnn device usage (#137553)
In https://github.com/pytorch/pytorch/issues/136831, user will use mkldnn device to generate tensor, while mkldnn device is no longer used as device type, and only mkldnn layout is used.

We plan to remove mkldnn device related code in the future release. This PR is to warn users not to use mkldnn device.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137553
Approved by: https://github.com/jgong5, https://github.com/ezyang
2024-10-12 13:42:12 +00:00
0dbbcfa7ae [Inductor UT] Generalize newly introduced inductor UTs for intel GPU (Part 3) (#136947)
[Inductor UT] Generalize Newly introduced inductor UTs for intel GPU
reuse `test/inductor/test_pattern_matcher.py`
reuse `test/inductor/test_snode_runtime.py`
reuse `test/inductor/test_unbacked_symints.py`
fix `test/inductor/test_triton_kernels.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136947
Approved by: https://github.com/etaf, https://github.com/EikanWang, https://github.com/jansel
2024-10-12 13:21:20 +00:00
030ba03681 Add meta functions for lerp, addcmul, and addcdiv. (#136909)
This PR adds new meta functions for `lerp`, `addcmul`, and `addcdiv` (including their
respective inplace versions).

These functions only had refs implementations, which was being the root cause of a
significant overhead ([issue][1]) when running `AdamW` optimizer step on PyTorch/XLA
backend. Running the meta functions resulted in the following improvements:

- `lerp` calls: 1,550ms to 140ms (10x)
- `addcdiv` calls: 640ms to 350ms (1.8x)
- `addcmul` calls: 620ms to 300ms (2.05x)

[1]: https://github.com/pytorch/xla/issues/7923

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136909
Approved by: https://github.com/jansel
2024-10-12 12:40:46 +00:00
6001b16597 Add entire _dynamo.config as a json for logging (#137216)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137216
Approved by: https://github.com/ezyang

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2024-10-12 11:48:59 +00:00
a777dea3b3 Remove dtype check on meta device (#136774)
Summary:
# Latest Update

This diff is no longer needed because we did need the check to exist, to make meta behave the same as other devices, see D54526190.

---------------------------------

# Background

T176105639

| case | embedding bag weight | per_sample_weight | fbgemm lookup | forward in meta |
| A | fp32 | fp32 | good | good |
| B | fp16 | fp32 | good| failed [check](https://fburl.com/code/k3n3h031) that forces weight dtype ==  per_sample_weights dtype |
| C | fp16 | fp16 | P1046999270, RuntimeError: "expected scalar type Float but found Half from fbgemm call" | good |
| D | fp32 | fp16 | N/A | N/A |

Currently we are in case A. Users need to add `use_fp32_embedding` in training to force embedding bag dtype to be fp32. However, users actually hope for case B to use fp16 as the embedding bag weight. When deleting `use_fp32_embedding`, they would fail the [check](https://fburl.com/code/k3n3h031) that forces `weight dtype ==  per_sample_weights dtype ` in meta_registration.

The check is actually not necessary. Is it because the backend fbgemm does support case B. Additionally, later on in the `meta_embedding_bag`, `weight` and `per_sample_weights` don't need to be in the same dtype (https://fburl.com/code/q0tho05h, weight is src, per_sample_weights is scale) for `is_fast_path_index_select`.

# This diff
Therefore, this diff remove the unnecessary [check](https://fburl.com/code/k3n3h031) to support case B in meta forward. With such, users are able to use fp16 to be the emb bag dtype without the need to force per_sample_weights the same dtype in meta forward (see Test Plan).

# Reference diffs to resolve this issue
Diff 1: D52591217
This passes embedding bag dtype to feature_processor to make per_sample_weights same dtype as emb bag weight. However, `is_meta` also needs to be passed because of case C. fbgemm still does not support per_sample_weights = fp16 (see the above table). Therefore users are forced to only make per_sample_weights fp16 when it is on meta. The solution requires too many hacks.

Diff 2: D53232739
Basically doing the same thing in diff 1 D52591217, except that the hack is added in TorchRec library. This adds an if in EBC and PEA for: when emb bag weight is fp16, it forces per_sample_weight fp16 too. However, it would then result in fbgemm issue too and has broken a bunch of prod models.

Test Plan:
# APS
The following command will run icvr_launcher which triggers ads_launcher and run forward in meta device:
```
buck2 run mode/opt -c python.package_style=inplace //aps_models/ads/icvr:icvr_launcher_publish -- mode=mast_ig_fm_when_combo0_uhm_publish launcher.fbl_entitlement=ads_global_tc_ads_score launcher.data_project=oncall_ads_model_platform launcher.tags=[ads_ranking_taxonomy_exlarge_fm_prod] stages.train=false
```

Result:
 {F1461463993}

Reviewed By: ezyang

Differential Revision: D54175438

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136774
Approved by: https://github.com/ezyang
2024-10-12 05:45:21 +00:00
92cc319120 Fix masked tensor test_stack memory leak (#137815)
This test is currently failing in trunk when memory leak check is enabled, for example https://github.com/pytorch/pytorch/actions/runs/11296206361/job/31422348823#step:22:1970.  When testing locally, calling `backward` on a masked tensor always causes memory leak until I clean up the data and the mask manually.  This is probably related to this warning from masked tensor `UserWarning: It is not recommended to create a MaskedTensor with a tensor that requires_grad. To avoid this, you can use data.clone().detach()`, but I don't know much about the internal details here to go further.  So, let's just fix the test first/

### Testing

```
PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 python test/test_maskedtensor.py TestBasicsCUDA.test_stack_cuda
```

passes and doesn't warn about memory leak anymore.

The test itself came from https://github.com/pytorch/pytorch/pull/125262#issuecomment-2344068012
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137815
Approved by: https://github.com/kit1980
2024-10-12 04:30:57 +00:00
c8609cf4b0 [inductor] Update Triton CPU pin (#137778)
This incorporates the fix in
https://github.com/triton-lang/triton/pull/4871.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137778
Approved by: https://github.com/Skylion007
2024-10-12 03:09:09 +00:00
d52b2cf92f [CUDA][SDPA] Fix TF32 handling and bump threshold for multiheadattention test (#137752)
For sm90, main issue was that `torch.testing.assert_close` bypasses the `tf32_on_and_off` tolerance switch decorator

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137752
Approved by: https://github.com/ezyang
2024-10-12 03:05:21 +00:00
2db3f85894 Fixes NumPy 2 test failures in test_torch.py (#137740)
Related to #107302

The breakages are caused by backward incompatibility between NumPy 1 and NumPy 2.
This PR fixes all the corresponding test failures in `test_torch.py`.

1. The dtype of the return value `np.percentile` when passed a `torch.float32` tensor.
NumPy 1: Return value of `np.float64`.
NumPy 2: Return value of `np.float32`.
Solution: Enforce it with `.astype(np.float64)`.

2. The type of `np.gradient()` when returning multiple arrays.
NumPy1: A list of arrays.
NumPy2: A tuple of arrays.
Solution: Cast the tuple to a list.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137740
Approved by: https://github.com/ezyang
2024-10-12 02:40:17 +00:00
eqy
6be53d52c5 [CUDA][SDPA] Bump tolerances for grad_query in mem_eff test (#137750)
(for sm80)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137750
Approved by: https://github.com/drisspg
2024-10-12 02:15:14 +00:00
67883e70c0 change GPT2ForSequenceClassification inference accuracy tolerance (#136749)
Fixes https://github.com/pytorch/pytorch/issues/123503.

https://github.com/pytorch/pytorch/pull/121866 makes GPT2ForSequenceClassification hit the SDPA pattern 18 and then encounter the accuracy issue. The issue only happens with BF16 inference single thread. This PR tends to increase the model tolerance from 4e-3 to 5e-3 and make the check pass. Note that the issue is due to some small implementation diff. For example, the sdpa math backend scales q, k before matmul for stability; the flash attention backend has more diffs as a new algorithm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136749
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-10-12 01:12:28 +00:00
fba2c0a23a Fix comment in ProcessGroupGloo (#137746)
Summary: Algorithm caching was removed in 2018 D13111781

Test Plan: CI

Differential Revision: D64214527

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137746
Approved by: https://github.com/Skylion007, https://github.com/wz337
2024-10-12 01:04:41 +00:00
69bcf1035e Updates reference to _runner-determinator.yml workflow, from current version to main version. (#137791)
Updates all references to runner determinator workflow (`_runner-determinator.yml`) from current cloned version to main version.

This enables the team to push updates to this workflow, like fixing bugs or pushing improvements, and have it immediately be reflected on all open PRs. So avoiding potentially breaking situations, empowering moving fast and fast and simple recover in case of bugs.

From:

```
jobs:
  get-label-type:
    uses: ./.github/workflows/_runner-determinator.yml
```

To:

```
jobs:
  get-label-type:
    uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137791
Approved by: https://github.com/malfet, https://github.com/huydhn, https://github.com/zxiiro
2024-10-12 00:18:50 +00:00
e269a5cb09 [TCPStore] Throw value error if passing world_size=0 to TCPStore (#137792)
This fixes https://github.com/pytorch/pytorch/issues/137577.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137792
Approved by: https://github.com/fegin, https://github.com/H-Huang
ghstack dependencies: #137713, #137721
2024-10-11 23:42:57 +00:00
25ac5652d0 [Environment Variable][3/N] Use thread-safe getenv wrapper (#137328)
Follows #124485

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137328
Approved by: https://github.com/eqy
2024-10-11 23:23:57 +00:00
8486d3df69 [Profiler] Hide ProfilerStep Alignment behind Experimental Config (#137668)
Summary: Aligning ProfilerStep# annotation can be useful for visual purposes but it affects downstream tools like HTA to misreport how long each step took. For this reason, lets give users the option to turn on this alignment manually but also turn it off by default

Test Plan:
Alignment off:

https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Oct_09_16_11_48.2543945.pt.trace.json.gz&bucket=gpu_traces

Alignment on:

https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Oct_09_16_08_27.2518391.pt.trace.json.gz&bucket=gpu_traces

Differential Revision: D64146115

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137668
Approved by: https://github.com/aaronenyeshi
2024-10-11 22:57:05 +00:00
0121d64aa9 Revert "[AOTI] Handle inplace output in ProxyExecutor (#137660)"
This reverts commit 573101aac3b1addc0a0b945ae09fe9be9034d3a9.

Reverted https://github.com/pytorch/pytorch/pull/137660 on behalf of https://github.com/desertfire due to Fails in fbcode ([comment](https://github.com/pytorch/pytorch/pull/137660#issuecomment-2408213485))
2024-10-11 22:54:39 +00:00
c58e5c4efa Revert "[AOTI] Turn on the ABI-compatible mode as default (#136534)"
This reverts commit b0da076f0cd5957c7fe55a58876f3b74babfc1b7.

Reverted https://github.com/pytorch/pytorch/pull/136534 on behalf of https://github.com/desertfire due to The dependent PR https://github.com/pytorch/pytorch/pull/137660 fails in fbcode ([comment](https://github.com/pytorch/pytorch/pull/136534#issuecomment-2408211238))
2024-10-11 22:50:58 +00:00
e3173d8725 [pipelining] Shape Inference (#136912)
Performs shape inference at runtime using user-provided real tensors.
- avoids the need for users to precompute shapes which is difficult and error prone
- lets us remove args from the PipelineStage ctor (in a later PR)
- deprecates existing inference helper in PipelineStage constructor for several reasons: its problematic to have to reason about the stage submod being on the right device for shape inference

The current state as of this PR:
- Users should not pass any input or output shapes into PipelineStage ctor, and shape inference will run automatically
- To override shape inference, they can continue to pass input/output args as previously

Currently, does not add a barrier after shape-inference, which essentially pipelines shape inference with the subsequent schedule action for that stage.  If this complicates debugging, we could add in a barrier (it comes at a cost, but only during the first step).

Testing:
- Removed input args from all PP test cases, thus exposing them all to shape-inference.
- Verified visually (nvidia-smi) that torchtitan PP 3D test runs shape inference fine without creating extra cuda contexts.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136912
Approved by: https://github.com/kwen2501, https://github.com/H-Huang
2024-10-11 22:49:00 +00:00
432c3fe5af Default to use training IR (#137804)
Summary: Since capture_pre_autograd_graph is deprecated and will be deleted soon, we default this option to true.

Test Plan: CI

Reviewed By: tugsbayasgalan

Differential Revision: D64254236

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137804
Approved by: https://github.com/tugsbayasgalan
2024-10-11 22:34:28 +00:00
c254901bdb Have Triton custom extension test use privateuseone device (#137611)
The original PR #122396 used the CPU device since at that point in time
there was no actual Triton CPU backend. After #133408, this is no longer
the case, so we now have multiple backends getting registered for the
CPU. The test still works in OSS but fails internally due to different
test runners initializing the backends in a different order.

This PR doesn't actually end up fixing the test internally because
cpp_extension -- needed to implement the privateuseone device -- isn't
supported there, so we simply skip it instead. However, it still makes the
OSS test independent of initialization order, which is good.

Differential Revision: [D63838169](https://our.internmc.facebook.com/intern/diff/D63838169/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137611
Approved by: https://github.com/henrylhtsang
2024-10-11 21:27:29 +00:00
19bbbef79d cublaslt autotuning support for TunableOp (#133896)
Adds support for cublaslt autotuning to TunableOp.

Todo:
- [x] Add and test `ScaledGemmTunableOp`
- [x] Benchmarking numbers

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133896
Approved by: https://github.com/eqy, https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2024-10-11 21:16:36 +00:00
1358969fa1 Revert "BundledAutotuneCache (#134959)"
This reverts commit 709021143d9c9aa90df578a2f5abb93a91a4852a.

Reverted https://github.com/pytorch/pytorch/pull/134959 on behalf of https://github.com/albanD due to The newly added test fails on rocm CI ([comment](https://github.com/pytorch/pytorch/pull/134959#issuecomment-2408091754))
2024-10-11 20:43:56 +00:00
74e871355b Add hooks to Scheduler nodes for generating device-specific debug strings (#135015)
Previously, instances of `SchedulerNode` and `FusedSchedulerNode` would explicitly check whether the compilation target is Triton when codegen'ing debug strings. Generating debug triton code is instead implemented as a callback set on scheduler nodes by `TritonScheduling`. This makes the codegen more device-agnostic and allows schedulers to customise the codegen output as opposed to it being closely coupled to the debug string codegen

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135015
Approved by: https://github.com/jansel
2024-10-11 20:30:49 +00:00
8543000c27 Search through config changes in compiler bisector (#137346)
Follow up to https://github.com/pytorch/pytorch/pull/131936.  In the original bisector you'd have to test inline if we were disabling a component - `if BisectionManager.disable_subsystem("inductor", "post_grad_passes", debug_info)`. This adds a convenient way of testing config changes for root causing issue. I've added `emulate_precision_casts` and aot_eager_decomp_partition cse as initial ones.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137346
Approved by: https://github.com/zou3519
2024-10-11 20:24:54 +00:00
513563eb09 Fix stack named "queue" in Util::ComputePostOrder (#130526)
This function computes a topological sort using a non-recursive implementation of DFS. Upon first reading, I thought it was using Kahn’s algorithm because it uses a variable called `queue`, but upon closer reading, I noticed this variable is actually used as a stack.

This pull request improves readability by renaming the stack and changing it from `std::vector` to `std::stack`.
Note: this also changes the backing store from an `std::vector` to an `std::deque`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130526
Approved by: https://github.com/alanwaketan, https://github.com/malfet
2024-10-11 20:21:07 +00:00
d0628a7e39 [ONNX] Remove deprecated export_to_pretty_string (#137790)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137790
Approved by: https://github.com/titaiwangms
ghstack dependencies: #137789
2024-10-11 20:10:04 +00:00
5fca2fd365 Try unify training and inference (#136888)
Previously inference -> inference IR was going through a seperate flow from train -> inference decomposition. This diff unifies them so that we always retrace when decomposing. Joint IR decomp is still going through old flow (inference -> inference) but seems ok for now since it is still in experimental stage.

Differential Revision: [D63062521](https://our.internmc.facebook.com/intern/diff/D63062521/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136888
Approved by: https://github.com/avikchaudhuri
2024-10-11 20:09:58 +00:00
3e0b83ad1f [ONNX] Remove ExportTypes (#137789)
Remove deprecated ExportTypes and the `_exporter_states` module. Only protobuf (default) is supported going forward.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137789
Approved by: https://github.com/titaiwangms
2024-10-11 19:29:52 +00:00
460358a20f Run lint-autoformat only on PRs to main (#137802)
This is mostly to prevent showing up on ghstack PRs, with which code suggestions are not compatible.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137802
Approved by: https://github.com/huydhn
2024-10-11 19:25:34 +00:00
2cb983ab97 [CI] Adds support for selecting experiments for workflows on runner determinator (#137614)
adds a `default` tag to experiment configurations, allowing to remove some experiments by default on the random draw:

```
        experiments:
            lf:
                rollout_perc: 25
            otherExp:
                rollout_perc: 25
                default: false
        ---
```

and includes the configuration to filter what experiments are of interest for a particular workflow (comma separated):

```
  get-test-label-type:
    name: get-test-label-type
    uses: ./.github/workflows/_runner-determinator.yml
    with:
      ...
      check_experiments: "awsa100"
```

The end goal, is to enable us to run multiple experiments, that are independent from one another. For example, while we still runs the LF infra experiment, we want to migrate other runners leveraging the current solution. A immediate UC is for the A100 instances, where we want to migrate to AWS.

Those new instances will during the migration period be labeled both `awsa100.linux.gcp.a100` and `linux.aws.a100`. Once the experiment ends, we will remove the first confusing one.

```
jobs:
  get-build-label-type:
    name: get-build-label-type
    uses: ./.github/workflows/_runner-determinator.yml
    with:
      ...

  get-test-label-type:
    name: get-test-label-type
    uses: ./.github/workflows/_runner-determinator.yml
    with:
      ...
      check_experiments: "awsa100"

  linux-focal-cuda12_1-py3_10-gcc9-inductor-build:
    name: cuda12.1-py3.10-gcc9-sm80
    uses: ./.github/workflows/_linux-build.yml
    needs:
      - get-build-label-type
      - get-test-label-type
    with:
      runner_prefix: "${{ needs.get-build-label-type.outputs.label-type }}"
      ...
      test-matrix: |
        { include: [
          { config: "inductor_huggingface_perf_compare", shard: 1, num_shards: 1, runner: "${{ needs.get-test-label-type.outputs.label-type }}linux.gcp.a100" },
          ...
        ]}
      ...
```

```
experiments:
    lf:
        rollout_perc: 50
    awsa100:
        rollout_perc: 50
         default: false
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137614
Approved by: https://github.com/malfet
2024-10-11 19:20:02 +00:00
709021143d BundledAutotuneCache (#134959)
Add a cache to combine individual autotune caches into a single cached bundle.  We still rely on the individual autotune caches - on a cache hit we copy the individual results into the local caches so they can retrieved later.

Various related configs:
env: TORCHINDUCTOR_BUNDLED_AUTOTUNE_REMOTE_CACHE
config: bundled_autotune_remote_cache
jk: pytorch/remote_cache:bundled_autotune_remote_cache_version

Testing:

Manually tested w/ EMU:
```
cd fbcode/accelerators/workloads/models/emu_flash/v1p4
make build_benchmark_model && make save_model_to_path
make test_pt2_latency
```

 - on a cold run we got 0 hits and 40 misses. On a warm run it got 40 hits and 0 miss.
- perf seems a little better - for 8 runs:
  - no bundled cache averaged 14m11s
  - bundled cache averaged 14m6s
  - 125ms saved per cache entry seems reasonable

Cache Metrics for an sample run:
no bundled cache:
```
INFO: Cache Metrics:
  FbMemcacheRemoteKernelCache: {hit: 2256, miss: 0, put: 0, exception: 0}
  FbRemoteAutotuneCache: {hit: 0, miss: 0, put: 7, exception: 0}
  FbRemoteFxGraphCache: {hit: 40, miss: 0, put: 0, exception: 0}
  LocalAutotuneCache: {hit: 878, miss: 0, put: 7, exception: 0}
  backend:MemcacheCache: {hit: 2256, miss: 0, put: 7, exception: 0}
  backend:_LocalAutotuneCacheBackend: {hit: 878, miss: 0, put: 7, exception: 0}
  backend:_ManifoldCache: {hit: 40, miss: 0, put: 0, exception: 0}
```
bundled cache:
```
INFO: Cache Metrics:
  FbMemcacheRemoteKernelCache: {hit: 2258, miss: 0, put: 0, exception: 0}
  FbRemoteAutotuneCache: {hit: 0, miss: 0, put: 8, exception: 0}
  FbRemoteBundledAutotuneCache: {hit: 40, miss: 0, put: 0, exception: 0}
  FbRemoteFxGraphCache: {hit: 40, miss: 0, put: 0, exception: 0}
  LocalAutotuneCache: {hit: 878, miss: 0, put: 886, exception: 0}
  backend:MemcacheCache: {hit: 2258, miss: 0, put: 8, exception: 0}
  backend:_LocalAutotuneCacheBackend: {hit: 878, miss: 0, put: 886, exception: 0}
  backend:_ManifoldCache: {hit: 80, miss: 0, put: 0, exception: 0}
```

Differential Revision: D60677499

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134959
Approved by: https://github.com/oulgen
2024-10-11 19:12:41 +00:00
b82000c1b3 Removed _compile workaround for create_block_mask (#137477)
I also put in a change for supporting `create_block_mask` to properly handle non-multiples of BLOCK_SIZE.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137477
Approved by: https://github.com/drisspg, https://github.com/BoyuanFeng
2024-10-11 19:04:23 +00:00
2dcd69da50 [inductor] Delete dead code and lints (#137753)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137753
Approved by: https://github.com/Chillee
2024-10-11 18:55:08 +00:00
267f82b860 [BE] Format .ci/ / .github/ / benchmarks/ / functorch/ / tools/ / torchgen/ with ruff format (#132577)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132577
Approved by: https://github.com/malfet
2024-10-11 18:30:26 +00:00
04adb74d08 [inductor][cond] Remove redundant prefix (#137718)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137718
Approved by: https://github.com/eellison
ghstack dependencies: #137200
2024-10-11 18:13:18 +00:00
cd02c85ba4 [inductor][subgraph][python-wrapper] Lift subgraph code into functions (#137200)
Earlier the subgraphs were getting inlined into the output code. This PR lifts the subgraphs into a function, and then we just call the function in the output code.

This is the output code for test `test_cond_reintepret_view_inputs_outputs`

Before this PR - https://www.internalfb.com/intern/paste/P1632948905/
With this PR - https://www.internalfb.com/intern/paste/P1632946348/

A relevant snippet from the above paste is

~~~

def false_graph_0(args):
    false_graph_0_arg0_1, false_graph_0_arg1_1, s0 = args
    args.clear()
    s0 = s0
    with torch.cuda._DeviceGuard(0):
        torch.cuda.set_device(0)
        false_graph_0_buf0 = empty_strided_cuda(((-1) + s0, 20), (20, 1), torch.float32)
        false_graph_0_buf1 = empty_strided_cuda(((-1) + s0, 20), (20, 1), torch.float32)
        # Unsorted Source Nodes: [cond, z1, z2], Original ATen: [aten.sub, aten.add]
        triton_poi_fused_add_sub_1_xnumel = (-20) + (20*s0)
        stream0 = get_raw_stream(0)
        triton_poi_fused_add_sub_1.run(false_graph_0_arg0_1, false_graph_0_arg1_1, false_graph_0_buf0, false_graph_0_buf1, triton_poi_fused_add_sub_1_xnumel, grid=grid(triton_poi_fused_add_sub_1_xnumel), stream=stream0)
        del false_graph_0_arg0_1
        del false_graph_0_arg1_1
    return (reinterpret_tensor(false_graph_0_buf0, ((-3) + s0, 20), (20, 1), 40), reinterpret_tensor(false_graph_0_buf1, ((-1) + s0, 16), (20, 1), 4), )

async_compile.wait(globals())
del async_compile

def call(args):
    arg0_1, arg1_1, arg2_1, arg3_1 = args
    args.clear()
    s0 = arg0_1
    assert_size_stride(arg1_1, (s0, 20), (20, 1))
    assert_size_stride(arg2_1, (s0, 20), (20, 1))
    assert_size_stride(arg3_1, (), ())
    with torch.cuda._DeviceGuard(0):
        torch.cuda.set_device(0)
        buf0 = [None] * 2
        buf0 = [None] * 2
        if arg3_1.item():
            # subgraph: true_graph_0
            true_graph_0_arg0_1 = reinterpret_tensor(arg1_1, ((-1) + s0, 20), (20, 1), 0)
            true_graph_0_arg1_1 = reinterpret_tensor(arg2_1, ((-1) + s0, 20), (20, 1), 0)
            (true_graph_0_buf0, true_graph_0_buf1) = true_graph_0([true_graph_0_arg0_1, true_graph_0_arg1_1, s0])
            buf0[0] = true_graph_0_buf0
            buf0[1] = true_graph_0_buf1
        else:
            # subgraph: false_graph_0
            false_graph_0_arg0_1 = reinterpret_tensor(arg1_1, ((-1) + s0, 20), (20, 1), 0)
            false_graph_0_arg1_1 = reinterpret_tensor(arg2_1, ((-1) + s0, 20), (20, 1), 0)
            (false_graph_0_buf0, false_graph_0_buf1) = false_graph_0([false_graph_0_arg0_1, false_graph_0_arg1_1, s0])
            buf0[0] = false_graph_0_buf0
            buf0[1] = false_graph_0_buf1
        del arg1_1
        del arg2_1
        del arg3_1
        buf1 = buf0[0]
        buf2 = buf0[1]
        del buf0
    return (buf1, buf2, )

~~~

The key change is to recursively call `codegen` for the subgraph and rely on `SubgraphPythonWrapper` to generate just the subgraph `fn`. The resulting subgraph_code is then inserted into the parent wrapper.

Note that this PR only works for python wrapper. For cpp wrapper, we need a lot of refactor to ensure that we don't duplicate the global variables in the outpute_code. So, for now, I fallback to the old way of inlining for cpp wrapper. I am hoping someone with more familiarity with cpp wrapper can support subgraph lifting (cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov).

This work will unblock hierarchical compilation (or cold start compile time work).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137200
Approved by: https://github.com/desertfire, https://github.com/eellison
2024-10-11 17:57:10 +00:00
68272ab596 Extend cuda_flip to unsigned types (#137781)
Using AT_DISPATCH_V2

Test plan: `python3 -c "import torch;print(torch.randint(0, 100, (4, 4),  dtype=torch.uint16, device='cuda').flip(0))"`
Fixes https://github.com/pytorch/pytorch/issues/137770

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137781
Approved by: https://github.com/Skylion007
2024-10-11 17:02:53 +00:00
4fa46d3bda TunableOp: Performance Improvement (#135371)
This PR reduces the overhead on the CPU side by eliminating the use of c10::str in creating signatures. Instead we use fmt library. TunableOp overhead on the CPU are reduced by around ~40%. The improvement is most noticeable on small GEMMs. This PR does not contain any bug fixes or new features.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135371
Approved by: https://github.com/jeffdaily
2024-10-11 16:52:40 +00:00
da578495ca [ROCm] enable gfx110x for hipblaslt (#137317)
Fixes #136347.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137317
Approved by: https://github.com/Skylion007, https://github.com/jithunnair-amd

Co-authored-by: Nichols A. Romero <nick.romero@amd.com>
2024-10-11 16:51:31 +00:00
41ccfc8752 Log chromium event for automatic dynamic reasons (#137491)
Log a chromium event so that we can see the reasons for invoking automatic dynamic shapes in aggregate internally.

Run following code:
```
import torch
@torch.compile(backend="eager")
def foo(t, x):
    return t.sin() + x

torch._dynamo.config.automatic_dynamic_shapes = True
torch._dynamo.config.assume_static_by_default = True
# Change size
x = torch.randn([1,2])
foo(x, 2)
x = torch.randn([2,2])
foo(x, 2)
torch._dynamo.reset()
# Change dimensionality
x = torch.randn([1,2])
foo(x, 2)
x = torch.randn([1,2,3])
foo(x, 2)
torch._dynamo.reset()
# Change stride
x = torch.randn([3,3])
foo(x, 2)
x = torch.as_strided(x, [3,3], [2,2])
foo(x, 2)
torch._dynamo.reset()
# Change scalar
x = torch.randn([1,2])
foo(x, 2)
foo(x, 3)
```

Internal link to perfetto:
https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom%2Fchromium_events.json#!/viewer?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom%2Fchromium_events.json&local_cache_key

The events look like this:
<img width="639" alt="image" src="https://github.com/user-attachments/assets/23916333-7f24-47c7-934b-201f33aebeab">
<img width="638" alt="image" src="https://github.com/user-attachments/assets/9f927c8d-04bb-4431-8802-685b032df656">
<img width="640" alt="image" src="https://github.com/user-attachments/assets/342e9e11-0dfc-422d-bd0b-01a8574d38ba">
<img width="635" alt="image" src="https://github.com/user-attachments/assets/dc2c97cd-7180-4069-b019-d6e63ee490bc">

Differential Revision: [D64184625](https://our.internmc.facebook.com/intern/diff/D64184625)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137491
Approved by: https://github.com/Skylion007, https://github.com/oulgen
2024-10-11 16:50:25 +00:00
a06d49a9f9 bump up add_loop_inductor_gpu expected instruction count. (#137672)
diff https://github.com/pytorch/pytorch/pull/137117/files increased instruction count for add_loop_inductor_gpu
but not enough to fail in that diff, but now its kind of flaky test .

it failed on recent merge:
<img width="1351" alt="Screenshot 2024-10-09 at 5 25 57 PM" src="https://github.com/user-attachments/assets/27178f76-c08e-4d13-9ac4-4cd70f146611">

and here is the history
<img width="1047" alt="Screenshot 2024-10-09 at 5 26 07 PM" src="https://github.com/user-attachments/assets/bd563e34-6f7f-461a-ae54-8a616be9bf09">
<img width="777" alt="Screenshot 2024-10-09 at 5 30 19 PM" src="https://github.com/user-attachments/assets/d0a1ca81-2bdb-4cf6-8ac8-ba5971d447bf">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137672
Approved by: https://github.com/masnesral
2024-10-11 16:46:38 +00:00
d41558f8d7 [BE][Ez]: Better error message for CUDNN attention attn_bias (#137702)
Follow up to  #136885 . Fixes edge case on error condition (should be early exit so that expand doesn't every run into any trouble with weird cases (attn_bias 0, 1, > 5 dim).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137702
Approved by: https://github.com/eqy
2024-10-11 16:44:46 +00:00
5835b1af10 [FSDP2] Gated dynamo import for torch deploy (#137203)
Differential Revision: [D63777335](https://our.internmc.facebook.com/intern/diff/D63777335)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137203
Approved by: https://github.com/wz337
2024-10-11 16:38:19 +00:00
bdb42e7c94 [PGNCCL] Added some missing spaces in barrier msg (#137721)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137721
Approved by: https://github.com/kwen2501
ghstack dependencies: #137713
2024-10-11 15:17:25 +00:00
39c5048549 [DeviceMesh] Fixed from_group when passing tensor mesh (#137713)
This fixes https://github.com/pytorch/pytorch/issues/137676. (sorry for messing this up in the original PR 😓 )

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137713
Approved by: https://github.com/wz337
2024-10-11 14:53:51 +00:00
e30c55ee52 Update maintainers for inductor and x86 CPU (#136839)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136839
Approved by: https://github.com/Skylion007, https://github.com/albanD, https://github.com/malfet
2024-10-11 07:24:07 +00:00
1c71de5b2c [ScaleMM] Add a shape dependent max_swizzle size (#137681)
# Summary

I started to explore the performance of _scaled_mm against a triton-based persistent TMA kernel for RowWise scaling.
There are more details here: https://github.com/drisspg/transformer_nuggets/pull/36

It clearly showed that where was some room for improvement on larger problem sizes compared to triton's performance. Note that the triton kernel only has a 128x128x128 Tile shape, where scaled_mm has a 64, 128, 128 tile shape which we use for smaller problem sizes which may explain some of the perf delta for at smaller shapes.

This led to seeing if we can improve our triton codegen lowering  for _scaled_mm (I think we should still do this: https://github.com/pytorch/pytorch/pull/137517).

In the meantime @Chillee  suggested I make sure swizziling is set for the large matmul shapes

This PR makes sure that we increase the max_swizzle_size for the large matmuls.

## Performance
Note* Red means triton based tma beats _scaled_mm blue means _scaled_mm is faster

On Nighlty W/ Triton at (2ef33c6c4c3)
![swizzle_tst_8_full_nightly_heatmaps](https://github.com/user-attachments/assets/e92af19b-4e79-4126-b9d0-da039da5363b)

You can see that as M,K,N increase there is a clear win for the Triton Persistent TMA.

After this PR:

![swizzle_tst_8_full_heatmaps](https://github.com/user-attachments/assets/472068b3-45c2-43f8-84d3-b116da7898d5)

For example w/ this change(power limited gpu)

M=16384  K=16384  N=16384
TFlops Before :`985.49`
TFlops After: `1304.69`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137681
Approved by: https://github.com/eqy
2024-10-11 06:44:31 +00:00
4e309899c7 [Quant] Check stride > 0 for QConv and QConvTranspose (#136739)
Fixes #136722
Fixes #136718

By default, it goes to onednn. So this PR adds a check to ensure stride > 0. Now program will quit with an error message if stride is 0.
FBGEMM and QNNPACK can create modules with stride=0 without error but program crashes when calling forward.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136739
Approved by: https://github.com/jgong5
2024-10-11 05:50:37 +00:00
fe148024fe [c10d][experimental] Add _abort_process_group (#132291)
Thanks @eqy for reminding me of this RFC: https://github.com/pytorch/pytorch/issues/119797

This PR is meant to:
- provide a way to abort multiple PGs without deadlocking each other.
- provide a possibility to manually handle comm errors or timeouts (and potentially recovery of such).
One can find an example from: https://github.com/NVIDIA/nccl/issues/1013

## How is it different from `destroy_process_group`?
`destroy_process_group` is meant for normal exit, while `_abort_process_group` is meant for bailout upon hangs or failures. Similar to `ncclCommDestroy` vs `ncclCommAbort`.

## What's new in `_abort_process_group`?
It added support for "group abort" semantic. The "group abort" semantic is capable of aborting multiple NCCL comms concurrently, avoiding deadlock in otherwise serialized `ncclCommAbort` executions. Details are in the [RFC](https://github.com/pytorch/pytorch/issues/119797) targeting [the hang issue in multi-comm case](https://github.com/NVIDIA/nccl/issues/1013). `Group abort` semantic is added in NCCL 2.22.

## What's next?
Ideally, the watchdog's behavior should support "group abort" too. But this is hard to implement today due to a lack of "global view" by each PG's individual watchdog. A big semi-big refactor may be needed to "uplift" the watchdogs to a global level or consolidate them into one (i.e. one dog watching multiple PGs).

In any case, it may not be a bad idea to experiment the "group abort" feature with a manual API first and then extend to the automatic mode (watchdog).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132291
Approved by: https://github.com/eqy
2024-10-11 05:04:17 +00:00
bc232e3c08 Fix custom op bug of clearing dir (#137655)
Previously when we delete a custom op out of context manager, we weren't clearing the dir field of the op namespace. As a result, it was polluting other tests.

Differential Revision: [D64141465](https://our.internmc.facebook.com/intern/diff/D64141465/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137655
Approved by: https://github.com/zou3519, https://github.com/Skylion007
2024-10-11 04:32:40 +00:00
ee713f80ed Enable channels_last format for FSDP (#137382)
Enable FSDP to deal with channels_last memory formatted tensors. Preserving channels_last memory format makes FSDP compatible with the best kernels CUDNN offers.

Summary of changes:
1) Store strides information along with shapes
2) Replace calls to flatten() with as_strided(size=(param.numel(),), stride=(1,)) for flattening
3) Replace calls to view() with as_strided with the stored sizes and strides for unflattening

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137382
Approved by: https://github.com/awgu
2024-10-11 03:47:16 +00:00
8ee361ed13 fix test_retrace_pre_autograd (#137733)
Test Plan: fixed

Differential Revision: D64200918

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137733
Approved by: https://github.com/pianpwk, https://github.com/tugsbayasgalan
2024-10-11 03:46:22 +00:00
8321eec009 [Inductor UT] Generalize device bias code in test_triton_kernels.py (#137585)
[Inductor UT] Generalize device bias code in test_triton_kernels.py introduced by #137020

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137585
Approved by: https://github.com/eellison, https://github.com/jansel
2024-10-11 02:00:01 +00:00
8262f6d271 fix test_lazy_module_kwargs (#137705)
Test Plan: fixed

Differential Revision: D64185644

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137705
Approved by: https://github.com/tugsbayasgalan
2024-10-11 01:53:10 +00:00
9d4cb0d3eb Fix param and buffer mapping for state_dict when there are state_dict hooks (#137609)
Resolve #137540

Summary:

We might get different state_dict and named_parameters result when the module has registered custom state_dict_hooks.
For exported_program's state_dict, we want the state_dict to reflect the actual module hierarchy at runtime, and it might be different from the model's state_dict() output if the model has state_dict hooks.
To do weight swapping, one needs to either re-export or turn-off the hooks when saving model's state_dict().
Previously, ExportedProgram uses nn.Module's state_dict() method to populate its own state_dict, but it doesn't work for some models (e.g. llama3_3_vision) because ExportedProgram's state_dict and an nn.Module's state_dict have some subtle differences semantically.

nn.Module's state_dict is about how the state should be serialized, and it reflects the structure of the original user model code. In contrast, export specializes on a “run” of a model, and its state_dict needs to reflect the runtime module hierarchy.

One example where these two are different is TorchTune's Llama3_2_vision text decoder. Here, a FusionLayer is added as a local optimization and it is not part of the "static model definition".  In runtime, we have mod.layers[3].layer.sa_norm.scale.

But in nn.Module's state_dict, the authors of the model added a state_dict hook to remove the "layer" in mod.state_dict() to reflect the static model definition, so we have mod.state_dict()["layers.3.sa_norm.scale"].
In this Diff, we change ExportedProgram to populate its state_dict using named_parameters() and named_buffers() instead. So in ExportedProgram's state_dict, we have "layers.3.layer.sa_norm.scale", which reflects the runtime module hierarchy.

Now one problem this presents is weight swapping. Since ExportedProgram's state and the model's state is not the same anymore, weight swapping procedure also needs to change slightly.

In internal Ads and RecSys models deployment, weight swapping is where they have one model that is currently being being deployed and serving traffic, and they want to swap out the weights with newly trained model weights without having to redo the whole exporting/lowering process and create a new artifact. So they would move the deployed model’s pointer to the state dict over to the new state dict. Because of this, it’s previously a requirement that the FQNs are matching between the exported and the eager model’s state dict.

The new ExportedProgram's state dict still supports weight swapping, but the state_dict to be swapped needs to be obtained from torch.export.exported_program instead of model.state_dict() if the model has state_dict hooks.
The new requirement is that the FQNs are matching between the exported’s state dict and the state_dict obtained from `_disabled_load_state_dict_hooks(M)` context manager. One benefit of having this new API is that we are now in full control within export of gathering and updating the model state.
If a model doesn't have any state_dict hooks, one can still use model.state_dict() for weight swapping, so it's BC.

Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export  -- -r  test_export_for_training_with_state_dict_hooks
```

Differential Revision: D64080561

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137609
Approved by: https://github.com/angelayi, https://github.com/pianpwk
2024-10-11 01:33:50 +00:00
a919742149 c10::optional -> std::optional in PyTorch (#137333)
Test Plan: Sandcastle

Differential Revision: D63876535

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137333
Approved by: https://github.com/Skylion007, https://github.com/albanD
2024-10-11 00:16:10 +00:00
4fb1fd8a51 Revert "Upgrade distributed test to g4dn instances (T4 GPUs) (#137161)"
This reverts commit b6a64dce072240c0b06d2fb03ac81b3ed1b73d92.

Reverted https://github.com/pytorch/pytorch/pull/137161 on behalf of https://github.com/PaliC due to broken tests on trunk ([comment](https://github.com/pytorch/pytorch/pull/137161#issuecomment-2406236337))
2024-10-10 23:47:25 +00:00
b55ff476bd Revert "[Distributed] Fix extra context on device 0 (#135273)"
This reverts commit cdd8fa98c77b052085cca65dd54769ae18b72104.

Reverted https://github.com/pytorch/pytorch/pull/135273 on behalf of https://github.com/PaliC due to broken tests on trunk ([comment](https://github.com/pytorch/pytorch/pull/137161#issuecomment-2406236337))
2024-10-10 23:47:25 +00:00
b0da076f0c [AOTI] Turn on the ABI-compatible mode as default (#136534)
Summary: Make AOTI generate ABI-compatible code as default for OSS.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136534
Approved by: https://github.com/chenyang78
ghstack dependencies: #137660
2024-10-10 23:44:57 +00:00
ad38bad766 [MPS] Add tri[lu]_indices (#137648)
Requested in https://github.com/pytorch/pytorch/issues/77764#issuecomment-2402365980
Copy-n-paste kernel implementation from 13cf8360d8/aten/src/ATen/native/cuda/TensorFactories.cu (L92)

though use `float` instead of `double` for square root computation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137648
Approved by: https://github.com/Skylion007, https://github.com/albanD
ghstack dependencies: #137601, #137647
2024-10-10 23:41:06 +00:00
573101aac3 [AOTI] Handle inplace output in ProxyExecutor (#137660)
Summary: https://github.com/pytorch/pytorch/pull/137401 didn't fix the underlying inplace output issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137660
Approved by: https://github.com/chenyang78
2024-10-10 23:12:46 +00:00
c37bb492da [ONNX] Create an optimize method in ONNXProgram (#137667)
Move optimization from the export call to the `optimize()` method in ONNXProgram.

Users can call `optimize()` before calling `save()` to save the model. Right now if users set `optimize=True` in `torch.onnx.export` it will have the same effect as calling `optimize()`, but in the future we can evolve the method to be more flexible (e.g. target aware etc.)

Example

```python
onnx_program = torch.onnx.export(..., dynamo=True)
onnx_program.optimize()
onnx_program.save("model.onnx")
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137667
Approved by: https://github.com/titaiwangms
ghstack dependencies: #137666
2024-10-10 22:44:19 +00:00
e75984cd31 [ONNX] Use torch_2_6 apis from onnxscript (#137666)
Create an `optimize=False` option in `torch.onnx.export` for model optimization

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137666
Approved by: https://github.com/titaiwangms
2024-10-10 22:23:15 +00:00
93bbc8abcc [dynamo, 3.13] use 3.13 multiline traceback in get_instruction_source_311 (#137617)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137617
Approved by: https://github.com/jansel
2024-10-10 20:19:27 +00:00
4551a1ee79 [dynamo, 3.13] merge 3.13 FORMAT_* and <=3.12 FORMAT_VALUE (#137656)
This was causing some 3.13 failures locally.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137656
Approved by: https://github.com/jansel, https://github.com/Skylion007
ghstack dependencies: #137652
2024-10-10 19:53:42 +00:00
6b2c3508f8 [dynamo, 3.13] fix typo in remove_fused_load_store (#137652)
Whoops!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137652
Approved by: https://github.com/jansel, https://github.com/Skylion007
2024-10-10 19:53:42 +00:00
9c12198137 [PyTorch] Port ExecuTorch bfdot improvement back to ATen BlasKernel, Try #2 (#137377)
ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. First attempt was https://github.com/pytorch/pytorch/pull/136331 .

Differential Revision: [D63923166](https://our.internmc.facebook.com/intern/diff/D63923166/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137377
Approved by: https://github.com/malfet
2024-10-10 19:44:22 +00:00
080f02ac7a [dynamo] do not raise an unimplemented error with boolean masking setitem (#134902)
Cudagraph breaks on boolean masking setitem, however the code runs fine. There is no need to raise an unimplemented error here, since it already warns that its an incompatible op.

Fixes #134241

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134902
Approved by: https://github.com/jansel, https://github.com/henrylhtsang
2024-10-10 19:11:40 +00:00
079f909263 Revert "Make Context to be Device-agnostic Step by Step (1/N) (#136519)"
This reverts commit be0b75256a7e516217b059ef273901b95c022fe7.

Reverted https://github.com/pytorch/pytorch/pull/136519 on behalf of https://github.com/jovianjaison due to this pr is causing errors internally ([comment](https://github.com/pytorch/pytorch/pull/136519#issuecomment-2405781093))
2024-10-10 18:32:17 +00:00
33e5921e6b Revert "Make Context to be Device-agnostic Step by Step (2/N) (#136526)"
This reverts commit 72ad1b8c6c7c364c1974b82a914876dcdf73af44.

Reverted https://github.com/pytorch/pytorch/pull/136526 on behalf of https://github.com/jovianjaison due to this pr is causing errors internally ([comment](https://github.com/pytorch/pytorch/pull/136519#issuecomment-2405781093))
2024-10-10 18:32:16 +00:00
881a18f25f Set Cuda context in inductor and dont initialize wrong cuda device in fake_tensor (#137603)
Previously we would construct tensors with "cuda" device which defaults to device:0 if not cuda context is set. Fix for https://github.com/pytorch/pytorch/issues/124854

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137603
Approved by: https://github.com/jansel
2024-10-10 18:25:22 +00:00
dd7c2899bd [dynamo] Properly prune dead cell local variables (#136891)
This patch updates the `prune_dead_locals` logic to do slightly more aggressive pruning for cell local variables, in absence of side-effects, e.g., a cell variable can be pruned when its user function(s) will never be used again.

See added tests for examples; note that a few tests in `test/dynamo/test_higher_order_ops.py` also got updated because we are no longer returning the unnecessary graph output.

Fixes #127350, #124653

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136891
Approved by: https://github.com/jansel, https://github.com/anijain2305, https://github.com/williamwen42, https://github.com/zou3519
2024-10-10 18:21:24 +00:00
bcfdb72547 Fix dtype test for NumPy 2 (#137532)
Related to #107302

The following test fails with NumPy 2.

```
_________ TestNumPyInteropCPU.test_numpy_array_interface_cpu __________
Traceback (most recent call last):
  File "/usr/local/google/home/haifengj/git/pytorch_np2/test/test_numpy_interop.py", line 415, in test_numpy_array_interface
    wrapped_x = np.array([1, -2, 3, -4], dtype=dtype)
OverflowError: Python integer -2 out of bounds for uint8

To execute this test, run the following from the base repo dir:
    python test/test_numpy_interop.py TestNumPyInteropCPU.test_numpy_array_interface_cpu

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
```

According to the official warning from NumPy 1, the assigning negative value to a `uint8` is deprecated.
The recommended way is to `np.array([1, -2, 3, -4]).astype(np.uint8)`
See the following for details.
```
>>> np.array([1, -2, 3, -4], dtype=np.uint8)
<stdin>:1: DeprecationWarning: NumPy will stop allowing conversion of out-of-bound Python integers to integer arrays.  The conversion of -2 to uint8 will fail in the future.
For the old behavior, usually:
    np.array(value).astype(dtype)
will give the desired result (the cast overflows).
<stdin>:1: DeprecationWarning: NumPy will stop allowing conversion of out-of-bound Python integers to integer arrays.  The conversion of -4 to uint8 will fail in the future.
For the old behavior, usually:
    np.array(value).astype(dtype)
will give the desired result (the cast overflows).
array([  1, 254,   3, 252], dtype=uint8)
```

This PR fixes the test failure.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137532
Approved by: https://github.com/soulitzer
2024-10-10 18:12:25 +00:00
5e73f2d7c0 [PT2][Dynamo][Optimus] Add batch detach, clamp and nan_to_num in pre grad (#137415)
Test Plan:
# unit test
```
CUDA_VISIBLE_DEVICES=4 OC_CAUSE=1 buck2 test '@fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:group_batch_fusion -- test_math_op_fusion
```

Buck UI: https://www.internalfb.com/buck2/185799e1-6ea8-4bd1-b2e1-0c1a8dd92f89
Test UI: https://www.internalfb.com/intern/testinfra/testrun/2533275044114335
Network: Up: 14KiB  Down: 287B  (reSessionID-d24cee56-2a22-4a90-b4c6-1d0c3ab256c1)
Jobs completed: 8. Time elapsed: 48.8s.
Cache hits: 0%. Commands: 2 (cached: 0, remote: 0, local: 2)
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

# local reproduce

```
CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 run @mode/opt scripts/shuaiyang:test -- --optimus --flow_id 648108097 2>&1 | tee ~/local_run_shuai_interformer_cmf.txt
```

Counter({'pattern_matcher_nodes': 6626, 'pattern_matcher_count': 6396, 'extern_calls': 5340, 'benchmarking.TritonBenchmarker.benchmark_gpu': 2710, 'normalization_pass': 44, 'fxgraph_cache_miss': 37, 'scmerge_split_removed': 16, 'scmerge_cat_removed': 16, 'unbind_stack_pass': 16, 'batch_aten_mul': 15, 'batch_linear_post_grad': 12, 'batch_linear': 5, 'batch_detach': 4, 'batch_nan_to_num': 4, 'batch_clamp': 4, 'batch_aten_add': 4, 'batch_layernorm': 2, 'scmerge_cat_added': 2, 'batch_sigmoid': 1, 'scmerge_split_sections_removed': 1, 'unbind_stack_to_slices_pass': 1, 'benchmarking.TritonBenchmarker.triton_do_bench': 1, 'scmerge_split_added': 1, 'fxgraph_cache_hit': 1, 'batch_aten_sub': 1})

https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/mengluy/2024-10-06-20-53-01/trace.json.gz&bucket=gpu_traces

# e2e

baseline:
f650336422

proposal:

f650336607

### QPS and NE results

 {F1914975940}
{F1914975938}
{F1914975939}
{F1914975945}

> 0.7% QPS gain with NE neutral

### trace analysis

Before
 {F1914990600}

After

{F1914990015}

We reduced green part in the trace introduced by small nan_to_num kernels

Differential Revision: D63962711

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137415
Approved by: https://github.com/Yuzhen11
2024-10-10 18:11:08 +00:00
cyy
94e12f97dc [Distributed] [16/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#137404)
Follows #137072

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137404
Approved by: https://github.com/Skylion007
2024-10-10 18:05:34 +00:00
20815c7cb9 Intel GPU: mode: add XPU to supported devices list (#137575)
Kernel for `mode` Op is being ported to https://github.com/intel/torch-xpu-ops/pull/770, this requires adding XPU to supported device type.

Additional context: https://github.com/intel/torch-xpu-ops/issues/327

@fengyuan14 @EikanWang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137575
Approved by: https://github.com/EikanWang, https://github.com/malfet

Co-authored-by: Feng Yuan <feng1.yuan@intel.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-10-10 17:44:40 +00:00
cdd8fa98c7 [Distributed] Fix extra context on device 0 (#135273)
This PR contains multiple fixes for issue https://github.com/pytorch/pytorch/issues/135279:

## First part:
Moves the GPU guard (`cudaSetDevice`) before the `currentStreamCaptureStatusMayInitCtx` call.
As its name suggests, it May Init Ctx.

## Second part:
Even with the above fix, additional contexts are still observed during Work object destruction, e.g.
```
work = dist.all_reduce(tensor, async_op=True)
time.sleep(5)  <-- no additional context yet
del work  <-- additional context shows up
```
### Debug process
Chasing it down to destruction of a `Future` object -- a member variable of `Work`.
Then further down to the following member of `Future`:
```
std::vector<c10::Event> events_;
```
When the `events_` are destroyed, we hit the road down to:
1f3a793790/c10/cuda/impl/CUDAGuardImpl.h (L106-L121)

When there is no "preset" CUDA context (**which is the case for python garbage collector**), line 112: `c10::cuda::GetDevice(&orig_device)` will set `orig_device` to 0. Then, at line 120, `c10::cuda::SetDevice(orig_device)` will "officially" set the context to device 0 --
**that's where rank 1, 2, ... can create extra context on device 0!**
### Solution
This PR adds an explicit destructor to `Future`. In this destructor, destroy each event with a device guard.

## Test
Added test_extra_cuda_context, implemented via
- `pynvml` (if available), or
- memory consumption check.

`python test/distributed/test_c10d_nccl.py -k test_extra_cuda_context`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135273
Approved by: https://github.com/fduwjj, https://github.com/wconstab, https://github.com/eqy
ghstack dependencies: #137161
2024-10-10 17:16:34 +00:00
9690cacd61 [aotinductor] Add helper fn to atomically apply size_hint to an expr w/ unbacked symints (#137537)
### Context
Fixes CUDA IMA in autotune_at_compile_time, where we would generate an example tensor with an incorrect stride.

In the case below, the stride should be (u0 * 128, 128, 1). However, we apply the fallback on the entire expr (i.e. u0 * 128).
```
# buf817 = tensor(size=(s0, u0, 128), stride=(u0 * 128, 128, 1))

buf812 = generate_example_value(
    (64, 8192, 128), (8192, 128, 1), "cuda:0", torch.bfloat16, 0
)
```

The fix is to apply the fallback on each symbol.

### Test
```
PYTORCH_NO_CUDA_MEMORY_CACHING=1 compute-sanitizer python test_aot_inductor.py -k test_stride_with_unbacked_expr_abi_compatible_cuda

========= Invalid __global__ write of size 2 bytes
```

Differential Revision: [D64074561](https://our.internmc.facebook.com/intern/diff/D64074561)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137537
Approved by: https://github.com/jingsh
2024-10-10 17:11:24 +00:00
b6a64dce07 Upgrade distributed test to g4dn instances (T4 GPUs) (#137161)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137161
Approved by: https://github.com/seemethere
2024-10-10 17:11:21 +00:00
034af88c2d Add a microbechmark for cache read path (#137607)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137607
Approved by: https://github.com/jamesjwu
2024-10-10 16:36:18 +00:00
dae60075e0 [BE][MPS] Use Tensor->TensorBase in OperationUtils.h (#137647)
As for the most part those helper method need access to only base class methods.
Also replace spurious `at::` namespace prefixes, i.e. `at::Tensor`->`Tensor`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137647
Approved by: https://github.com/Skylion007, https://github.com/albanD
ghstack dependencies: #137601
2024-10-10 16:11:17 +00:00
bcf15d1bb4 [AOTI] Add error check for parsing error string from error code (#137626)
Currently, there are compilation warnings as below, which are resolved after the fix

```
/tmp/torchinductor_root/c7t6qm4gf35cxkk5jywa5booovl5n6ivzwdbbs5og7rdemqtgrzh/caoefkofe5jrkuaoch4lfpjwtodlcy4savxgzsxqldkcdof7ifh7.cpp: In function ‘ihipModuleSymbol_t* loadKernel(std::string, const string&, uint32_t, const std::optional<std::__cxx11::basic_string<char> >&)’:
/tmp/torchinductor_root/c7t6qm4gf35cxkk5jywa5booovl5n6ivzwdbbs5og7rdemqtgrzh/caoefkofe5jrkuaoch4lfpjwtodlcy4savxgzsxqldkcdof7ifh7.cpp:482:25: warning: ignoring returned value of type ‘hipError_t’, declared with attribute nodiscard [-Wunused-result]
  482 |     hipDrvGetErrorString(code, &msg);                  \
      |     ~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~
/tmp/torchinductor_root/c7t6qm4gf35cxkk5jywa5booovl5n6ivzwdbbs5og7rdemqtgrzh/caoefkofe5jrkuaoch4lfpjwtodlcy4savxgzsxqldkcdof7ifh7.cpp:519:5: note: in expansion of macro ‘CUDA_DRIVER_CHECK’
  519 |     CUDA_DRIVER_CHECK(hipModuleLoad(&mod, filePath.c_str()));
      |     ^~~~~~~~~~~~~~~~~
In file included from /opt/rocm/include/hip/hip_runtime.h:70,
                 from /pytorch/torch/include/torch/csrc/inductor/aoti_runtime/device_utils.h:14,
                 from /pytorch/torch/include/torch/csrc/inductor/aoti_runtime/model.h:17,
                 from /pytorch/torch/include/torch/csrc/inductor/aoti_runtime/model_container.h:13,
                 from /tmp/torchinductor_root/c7t6qm4gf35cxkk5jywa5booovl5n6ivzwdbbs5og7rdemqtgrzh/caoefkofe5jrkuaoch4lfpjwtodlcy4savxgzsxqldkcdof7ifh7.cpp:4:
/opt/rocm/include/hip/hip_runtime_api.h:2369:12: note: in call to ‘hipError_t hipDrvGetErrorString(hipError_t, const char**)’, declared here
 2369 | hipError_t hipDrvGetErrorString(hipError_t hipError, const char** errorString);
      |            ^~~~~~~~~~~~~~~~~~~~
In file included from /opt/rocm/include/hip/hip_runtime.h:70,
                 from /pytorch/torch/include/torch/csrc/inductor/aoti_runtime/device_utils.h:14,
                 from /pytorch/torch/include/torch/csrc/inductor/aoti_runtime/model.h:17,
                 from /pytorch/torch/include/torch/csrc/inductor/aoti_runtime/model_container.h:13,
                 from /tmp/torchinductor_root/c7t6qm4gf35cxkk5jywa5booovl5n6ivzwdbbs5og7rdemqtgrzh/caoefkofe5jrkuaoch4lfpjwtodlcy4savxgzsxqldkcdof7ifh7.cpp:4:
/opt/rocm/include/hip/hip_runtime_api.h:399:3: note: ‘hipError_t’ declared here
  399 | } hipError_t;
      |   ^~~~~~~~~~
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137626
Approved by: https://github.com/ColinPeppler, https://github.com/chenyang78
2024-10-10 15:14:39 +00:00
575f260229 Extend vectorization with SVE(ARM) with Torch Compile (Inductor) (#134672)
**Motivation**
Enable SVE vectorization with `torch.compile`
Extends PR: #119571

* This PR enables vectorization for codegen part using SVE-256 (vec length)
* The changes can be extended to other SVE vec lengths

I've done some comparisons against existing NEON implementation with SVE vectorization enabled route for `torch.compile`
Test results are for 8 cores on ARM Neoverse_V1

<img width="359" alt="Screenshot 2024-08-28 at 16 02 07" src="https://github.com/user-attachments/assets/6961fbea-8285-4ca3-b92e-934a2db50ee2">

It's worth mentioning, for standalone `SiLU op` there's a `~1.8x` speedup with `torch.compile`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134672
Approved by: https://github.com/jgong5, https://github.com/malfet
2024-10-10 13:20:40 +00:00
479bd1f300 Hardlock frequent periodic jobs to Meta runners (#137616)
The change in pytorch/pytorch#136785 enabled these jobs to run on LF runners however we saw a sudden large spike in cost once that happened last week that would have caused us to over use our available AWS credits. This change hardlocks the tests for these jobs to Meta runners. We need this at least until we can figure out how to handle the additional spend caused by these jobs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137616
Approved by: https://github.com/Skylion007, https://github.com/seemethere
2024-10-10 12:32:16 +00:00
f69bf005f7 Revert "In Inductor, be willing to generate deferred runtime asserts when unbacked (#137097)"
This reverts commit 4304c68a4c4d742a3ec5266b81f64a85922509c9.

Reverted https://github.com/pytorch/pytorch/pull/137097 on behalf of https://github.com/huydhn due to Sorry for reverting your change, it seems to increase the compilation time a lot causing some jobs to timeout ([comment](https://github.com/pytorch/pytorch/pull/137097#issuecomment-2404573266))
2024-10-10 09:29:05 +00:00
eea1f79a1d [AMD] use rccl.h instead of rccl/rccl.h (#135472)
Summary: We hipify NCCLUtils.h from nccl.h to rccl/rccl.h. This follows the format of the rocm rpm suite (the header is in include/rccl/rccl.h), however the source code is just src/rccl.h. Using the rccl/rccl.h will make us find the rpm's header but not the src code's header.

Test Plan:
buck run mode/opt-amd-gpu -c hpc_comms.use_rccl=develop -c fbcode.split-dwarf=True  --config rccl.build_rdma_core=true --config rccl.adhoc_brcm=true  //aps_models/ads/icvr:icvr_launcher -- mode=local_ctr_cvr_cmf_rep_1000x_v1_no_atom   data_loader.dataset.table_ds=[2024-09-04]   data_loader.dataset.batch_size=512  max_ind_range=10

w/o this diff, it'll show 2.18 nccl version

Differential Revision: D62371434

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135472
Approved by: https://github.com/jeffdaily, https://github.com/cenzhaometa
2024-10-10 08:55:57 +00:00
eaab5cf0f9 Fix torch.compile correctness bug on aarch64+sve due to gcc bug (#137606)
Some unit tests were failing relating to argmin_vec/argmax_vec due to a bug in GCC affecting versions <= 12 on aarch64 platforms with SVE

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117001

Fixes #137597

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137606
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-10-10 08:44:53 +00:00
365722f606 fix test_constant_output (#137547)
Summary: Fixes a couple of problems: constants didn't have metadata before creating graph signatures, and graph signatures weren't updated when lifting constants.

Test Plan: fixed test

Differential Revision: D64081786

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137547
Approved by: https://github.com/tugsbayasgalan
2024-10-10 07:48:15 +00:00
4e8997744c [inductor] Enable coordinate descent tuning with max-autotune (#136867)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136867
Approved by: https://github.com/Chillee
2024-10-10 07:29:52 +00:00
383eba5229 Add deterministic path for CUDA cumsum (#136224)
Change `cumsum` to call its decomposition when `use_deterministic_algorithms(True)` and input is CUDA.

Fixes #89492
Fixes #75240

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136224
Approved by: https://github.com/ezyang, https://github.com/justinchuby, https://github.com/eqy
2024-10-10 06:59:08 +00:00
71010bf097 [Inductor][CPP] generalize the wgt tensor delete (#135101)
**Summary**
Previously, we assumed the packed weight for (`MKL/MKLDNN`) linear operations was at `new_input_nodes[1]`. However, this is not the case for `MKL linear`, where `new_input_nodes[1]` contains the original weight instead of the packed weight. To generalize the code, in this PR, we identify nodes that are present in `input_nodes` but not in `new_input_nodes`—indicating they are no longer used by the GEMM template and can be considered candidates for deletion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135101
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-10-10 06:01:09 +00:00
ea83c78174 [SymmetricMemory] set the storage_offset of tensors returned by get_buffer() to 0 (#137569)
It seems that there's a bug in `TensorMaker` - it would treat `storage_offset` as bytes when calculating the storage size, but as numel when setting the tensor `storage_offset`. This seems to be causing tensors returned by get_buffer() with non-0 offset to report wrong storage size.

Will look into the `TensorMaker` issue further. But for `get_buffer()`, it seems more natural to just incorporate the offset into the data pointer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137569
Approved by: https://github.com/weifengpy
ghstack dependencies: #137567
2024-10-10 05:05:58 +00:00
96bab021c0 ATen | Fix header namespace pollution. (#137619)
Summary: Fixing a warning, so we can enable it globally.

Test Plan: Sandcastle-only, no runtime changes.

Differential Revision: D64122115

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137619
Approved by: https://github.com/Skylion007
2024-10-10 05:04:54 +00:00
1aa130e80c Avoid generating as_strided for alaising views in auto_functionalize_v2 (#137149)
during auto_functionalize_v2 if we encounter a view such that size() stride() and storage_offset() matches the base
we create a view that is regenerated by calling aten.alias instead of as_strided for better performance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137149
Approved by: https://github.com/zou3519
2024-10-10 05:00:41 +00:00
b5284a01a4 [CPU] remove keyword static for exp_u20 (#137571)
Remove all the keyword static for constants of vec registers in exp_u20 implementation. With the bf16 input shape of BertLarge, the SDPA kernel improves from 5.1ms to 4.7ms on SPR 56 threads.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137571
Approved by: https://github.com/jgong5
2024-10-10 04:52:04 +00:00
d170c410f2 Clean up op BC check list (#137634)
Summary: Remove some stale items

Test Plan: CI

Differential Revision: D64133246

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137634
Approved by: https://github.com/hl475
2024-10-10 04:29:21 +00:00
249152475d fix sequence number for group (#134578)
Summary:
Fix sequence number in execution trace dump for matching between collective/p2p op and wait in execution trace replay.

`ProcessGroupNCCL` has 2 sequence number counter, `seqCollective_` and `seqP2P_`.
b18ba9419e/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp (L1188-L1191)
However, `WorkNCCL` only has one sequence number member `seq_`. b18ba9419e/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp (L387)
We need to match collective and p2p with wait separately.
29b5a462dc

Depend on: https://github.com/pytorch/pytorch/pull/135132

Test Plan: buck2 run mode/dev-nosan kineto/libkineto/fb/integration_tests:pytorch_execution_trace_integration_test

Differential Revision:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134578
Approved by: https://github.com/kwen2501, https://github.com/c-p-i-o
2024-10-10 04:24:06 +00:00
5aa9f2b660 Fixed issue with nn.Transformer().generate_square_subsequent_mask() (#137654)
Fixed issue where nn.Transformer().generate_square_subsequent_mask() doesn't respect set_default_device() and set_default_dtype().

Fixes #137186

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137654
Approved by: https://github.com/mikaylagawarecki
2024-10-10 03:10:01 +00:00
b9c9f7f0fa Document ROCm environment variables and improve CMake messaging to user (#137308)
Fixes #115725. Note that the github issue title is misleading. Read the comments to understand what the problem is really about.

The PR improves the documentation and CMake's behavior for ROCM builds.

- Documentation: There were two environment variables for ROCm builds that are now documented. `ROCM_PATH` and `PYTORCH_ROCM_ARCH`.
- CMake: Improved diagnostic messaging and error handling with respect to `ROCM_PATH`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137308
Approved by: https://github.com/pruthvistony, https://github.com/jithunnair-amd, https://github.com/jeffdaily
2024-10-10 03:08:08 +00:00
f394fb554b Enable failing diffs for regressions on basic_modules_ListOfLinears benchmarks (#137541)
Note that basic_modules_ListOfLinears_inductor_gpu_force_shape_pad is flay with 8% detected variance,
I set it up with 20% threshold (8*2)++
others are stable within +-1.5%

<img width="611" alt="Screenshot 2024-10-08 at 4 19 03 PM" src="https://github.com/user-attachments/assets/103c4bc7-6be8-41bf-ac31-4b8909fabfcf">

<img width="1581" alt="Screenshot 2024-10-08 at 4 18 56 PM" src="https://github.com/user-attachments/assets/56006f7a-e7de-4966-9a05-9263195adc68">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137541
Approved by: https://github.com/aorenste
2024-10-10 02:47:38 +00:00
f9ed39c989 Autoupdate min_lrs for ReduceLROnPlateau if possible, fixes #104361 (#137637)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137637
Approved by: https://github.com/albanD
2024-10-10 01:23:30 +00:00
d50d5df2fb Add warning for non static grads in optimizer variable (#137554)
Fixes https://github.com/pytorch/pytorch/issues/112548

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137554
Approved by: https://github.com/williamwen42
2024-10-10 01:23:21 +00:00
f301f6544b fix bug for fill_empty_deterministic_ not support complex half (#137488)
Fixes #133157

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137488
Approved by: https://github.com/ezyang
2024-10-10 01:21:32 +00:00
361046718d Generate new expected results file when there is failures in diff time benchmarks (#137551)
The test also add singpost log for the benchmarks that pass.
to test run I ran python check_results.py test_check_result/expected_test.csv test_check_result/result_test.csv out.csv
results
```
WIN: benchmark ('a', 'instruction count') failed, actual result 90 is -18.18% lower than expected 110 ±1.00% please update the expected results.

REGRESSION: benchmark ('b', 'memory') failed, actual result 200 is 100.00% higher than expected 100 ±+10.00% if this is an expected regression, please update the expected results.

PASS: benchmark ('c', 'something') pass, actual result 107 +7.00% is within expected 100 ±10.00%

MISSING REGRESSION TEST: benchmark ('d', 'missing-test') does not have a regression test enabled for it.

You can use the new reference expected result stored at path: out.csv.

a,instruction count,90,0.01
b,memory,200,0.1
c,something,100,0.1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137551
Approved by: https://github.com/aorenste
2024-10-10 01:09:15 +00:00
d9f4a7d3f9 Simplify find_localzeros (#133325)
Instead of doing an N^2 connected thing, only do simplifications for binary max/min, and for very simple situations.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Differential Revision: [D64135230](https://our.internmc.facebook.com/intern/diff/D64135230)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133325
Approved by: https://github.com/albanD
2024-10-10 00:52:50 +00:00
4f45c76806 [PGNCCL] Limit access to ncclComm_ (#137573)
When non-blocking mode is enabled, we need to make sure `ncclComm_` is ready before calling NCCL APIs on it.
`NCCLComm::getNcclComm` help us do that (thanks to a wait function inside), thus is preferred than directly using `ncclComm_`.

To prevent `ncclComm_` from being directly used outside, e.g. in `ProcessGroupNCCL`, we also move it as a private member of `NCCLComm` class -- the external-facing wrapper.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137573
Approved by: https://github.com/Skylion007, https://github.com/shuqiangzhang, https://github.com/c-p-i-o
ghstack dependencies: #137572
2024-10-10 00:34:05 +00:00
cyy
0739efbd1f Remove reference of gcc7 from CI scripts (#137339)
Because gcc7 can't be used to build Pytorch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137339
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-10-10 00:29:29 +00:00
47a515d260 [c10d] simplify barrier implementation and further decouple CPU/GPU (#137516)
synchronization
Summary:
Barrier is  essentially intended to block CPU thread (instead of GPU
streams). Before we used 2 stream synchronizations (1. current stream
blocked by nccl stream end event, 2. CPU thread blocked on current
stream). This is unnecessary as we already have CPU thread blocking
logic in wait(). Also, adding barrier specific code block in the general
GPU synchronize() API is intrusive and confusing.

This PR cleans this.

Test Plan:
CI

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137516
Approved by: https://github.com/fduwjj, https://github.com/kwen2501
2024-10-09 23:55:28 +00:00
51c33c0b72 Increase the runner size of AVX* jobs to 4xlarge (#137633)
The failed test is recently moved backed from slow and it requires more RAM than what available on 2xlarge runner.  It looks ok to up the instance size to 4xlarge instead.  I missed periodic jobs in https://github.com/pytorch/pytorch/pull/137447

Example periodic failures de4c2a3b4e (test_cpu_repro)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137633
Approved by: https://github.com/seemethere, https://github.com/malfet
2024-10-09 23:43:49 +00:00
4304c68a4c In Inductor, be willing to generate deferred runtime asserts when unbacked (#137097)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137097
Approved by: https://github.com/angelayi
ghstack dependencies: #137091
2024-10-09 23:34:35 +00:00
6908d8d450 Enable python dispatcher for reinplacing pass (#137091)
Arguably this should be put somewhere higher up in the stack?  Not sure.

Xref: https://fb.workplace.com/groups/6829516587176185/permalink/8042762615851570/

There is a repro but I need to fix more bugs before it can be checked in

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137091
Approved by: https://github.com/bdhirsh
2024-10-09 23:34:35 +00:00
31e334ad9e [unwind] replace LONG_LONG_MAX by the portable LLONG_MAX (#125043)
This fixes a compilation error on systems with the musl c library.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125043
Approved by: https://github.com/aaronenyeshi
2024-10-09 23:34:16 +00:00
aafa02506e [CudaDMAConnectivityDetector] improve the detection robustness (#137530)
- Previously the detection would fail before user calling APIs such as `torch.cuda.set_device()`. This is because the detection logic requires nvml initialization. In this PR, we added explicit nvml initialization (which idempotent).
- Previously any nvml issue occurred in the detection logic would result in fatal error. Now we issue an informative warning and return a topology assuming no NVLink connectivity.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137530
Approved by: https://github.com/Chillee
ghstack dependencies: #137471, #137472, #137473, #137474, #137475, #137529
2024-10-09 23:30:16 +00:00
fbaf9b62de [SymmetricMemoryOps] use float32 as the accumulator type when accumulating bfloat16 with multimem.ld_reduce (#137529)
This provides better accuracy without additional cost.

Also added documentation to `multimem_one_shot_all_reduce` to note the numerical caveats.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137529
Approved by: https://github.com/Chillee
ghstack dependencies: #137471, #137472, #137473, #137474, #137475
2024-10-09 23:30:16 +00:00
39c5122a4f [IntraNodeComm] replace all-reduce kernels with corresponding symm_mem ops (#137475)
## This Stack

Implement custom all-reduce algos available in `IntraNodeComm` as `symm_mem` ops and replace the existing `IntraNodeComm` kernels with them.

## This PR
- Replaces one-shot all-reduce with `symm_mem::one_shot_all_reduce_out`
- Replaces two-shot all-reduce with `symm_mem::two_shot_all_reduce_`
- Removes HCM all-reduce (at least for now). Due to the nature of its accumulation order, we can't guarantee the numerical consistency across all ranks.
- Removes the `IntraNodeComm` python binding (its original purpose is superceded by `SymmetricMemory`).
- Removes methods that were made for the python binding.
- Replaces nvlink detection logic with `DMAConnectivityDetector`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137475
Approved by: https://github.com/Chillee
ghstack dependencies: #137471, #137472, #137473, #137474
2024-10-09 23:30:16 +00:00
e6edfe3928 [SymmetricMemoryOps] create an out-variant for multimem_one_shot_all_reduce (#137474)
## This Stack

Implement custom all-reduce algos available in `IntraNodeComm` as `symm_mem` ops and replace the existing `IntraNodeComm` kernels with them.

## This PR

Implement `symm_mem::multimem_one_shot_all_reduce_out`. The out-variant is more suitable for `IntraNodeComm` integration.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137474
Approved by: https://github.com/Chillee
ghstack dependencies: #137471, #137472, #137473
2024-10-09 23:30:16 +00:00
b22749712c type _inductor/optimize_indexing.py (#137599)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137599
Approved by: https://github.com/Skylion007, https://github.com/eellison
2024-10-09 23:29:47 +00:00
d67b4f9e5f type _inductor/quantized_lowerings.py (#137598)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137598
Approved by: https://github.com/Skylion007
2024-10-09 23:29:26 +00:00
9b01d17b8d Use MetaProxy more pervasively (#137588)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137588
Approved by: https://github.com/ezyang
ghstack dependencies: #136674
2024-10-09 23:22:03 +00:00
13cf8360d8 [MPS] Fix testing for generator operators (#137601)
Before this changes, tests for operators like `eye` or `triu_indices` were essentially a test that respective CPU operators are stable, as cpu_sample and mps_sample were the same

Moved the logic to `transform_opinfo_sample_to_mps` whicih in addition to copying tensors is also tweaks `kwargs`

Discovered that:
 - `torch.randn` and `torch.randint` fall into the same undefined category
 - `torch.logspace` is not implemented for MPS
 -  Allow 1.0  absolute tolerance for all `torch.linspace` calls over integral input as rounding is wrong on the MPS side
 - `torch.triu_indices` are not implemented (PR is coming, this is how I've discovered this problem)
 - `torch.signal.windows.kaiser` fails because `aten::i0` is not implemented
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137601
Approved by: https://github.com/albanD
2024-10-09 23:17:11 +00:00
48fe0d56d6 Type _inductor/exc.py (#137595)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137595
Approved by: https://github.com/Skylion007
2024-10-09 23:15:06 +00:00
7408742b67 Make ignore_fresh_unbacked_symbols reentrant (#137605)
I have a test but it requires some other feature work that isn't fully baked.  Maybe this will fix an xfail.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137605
Approved by: https://github.com/albanD
2024-10-09 23:08:05 +00:00
5516ac5c21 [ROCm] Tunableop record untuned (#128813)
When enable tunableop, It is easy to have OOM since APP usually needs large video memory size, such as running a LLM for inference.  So we need a offline mode to tune the GEMMs. This PR provide an offline mode for tunableOp:

- record untuned GEMMs to file.

- a python API named tune_gemm_in_file is added to read the untuned file and tune the GEMMs in file

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128813
Approved by: https://github.com/jeffdaily, https://github.com/hongxiayang, https://github.com/naromero77amd

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-10-09 21:59:03 +00:00
839d3568b0 [compiled autograd] fix -Wuninitialized (#137539)
https://github.com/pytorch/pytorch/pull/135663#discussion_r1792408353

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137539
Approved by: https://github.com/isuruf, https://github.com/Skylion007
2024-10-09 21:16:26 +00:00
38027b9b47 [SymmetricMemory] fix a bug where numel calculation overflows when the tensor size is large (#137567)
Fixes https://github.com/pytorch/pytorch/issues/137145

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137567
Approved by: https://github.com/Chillee, https://github.com/weifengpy
2024-10-09 20:45:57 +00:00
a93ea617b5 [FSDP2] Required mesh_dim_names for HSDP (#137436)
Two changes:
1. Require `mesh_dim_names` if using HSDP
2. Pass only the shard mesh to `fsdp_pre_all_gather`

Change 1 is technically BC breaking, but it should not be hard to fix on the user side.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137436
Approved by: https://github.com/weifengpy, https://github.com/wz337
2024-10-09 20:35:09 +00:00
47af7cc962 Add compiler bisector (#131936)
This is a utility to aid the torch.compile debugging. You provide a function that returns True on success, False on failure, or do something out of process and run bisect_helper `good | bad`.

The bisector will first go through backends - `eager`, `aot_eager`, `aot_eager_decomp_partition`, `inductor` to find the first failing backend. Then, it will go through subsystems within the backend - currently limited but could be expanded - and try to find the first subsystem for which disabling fixes the problem. Once it has found the failing subsystem, it will find the number of times the subsystem is applied, and then bisect through it.

An example usage of how to hook it up for aot_eager_decomp_partition and decomposition subsystem is :

```
    from torch._inductor.bisect_helper import BisectionManager
    if op in CURRENT_DECOMPOSITION_TABLE:
        if BisectionManager.disable_subsystem("aot_eager_decomp_partition", "decomposition", lambda: repr(op)):
            return NotImplemented
```

Once it has discovered the problematic change, it will print out the associated debug info, and you can set the same limits with `TORCH_BISECT_BACKEND` `TORCH_BISECT_SUBSYSTEM` and `TORCH_BISECT_MAX`.

We could add further options as an automated way of going through a check list for checking divergence - e.g., the mode to emulate amp casts.

Fix for https://github.com/pytorch/pytorch/issues/126546

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131936
Approved by: https://github.com/ezyang
2024-10-09 20:34:11 +00:00
cfe970260a Clarify opt-einsum usage, fix #127109 (#137596)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137596
Approved by: https://github.com/albanD
2024-10-09 20:31:24 +00:00
c73d2634b9 Revert "Log chromium event for automatic dynamic reasons (#137491)"
This reverts commit 3c1ab9367885fdb0ead5fcc14a22d6934070ca92.

Reverted https://github.com/pytorch/pytorch/pull/137491 on behalf of https://github.com/jovianjaison due to breaking internal tests ([comment](https://github.com/pytorch/pytorch/pull/137491#issuecomment-2403360486))
2024-10-09 20:24:12 +00:00
16a2c2cfd4 Revert "Introduce torch.sym_sum (#136429)"
This reverts commit 90bed32b986ab1356dc376df3985497cedbe8a29.

Reverted https://github.com/pytorch/pytorch/pull/136429 on behalf of https://github.com/ezyang due to fails internal stuff ([comment](https://github.com/pytorch/pytorch/pull/136429#issuecomment-2403335147))
2024-10-09 20:08:01 +00:00
572f506f9c [c10d] Improve split_group test (#137572)
Fix 1:
`backend1 = pg._get_backend`, here `pg` should be `ng1`.

Fix 2:
`dist.broadcast` should be called by ranks of subgroup `ng1` only.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137572
Approved by: https://github.com/Skylion007
2024-10-09 19:43:57 +00:00
70288c3c2d Remove dependency on numpy for serialization for XLA/open registration devices without numpy (#137444)
Related: https://github.com/pytorch/xla/issues/7799#issuecomment-2375818263

Follow ups: Do the same for maia and mtia

## Motivation

With the move to `weights_only` by default, we are making an explicit decision not to allowlist GLOBALs required to deserialize `numpy` tensors  by default. The implication is that backends relying on numpy for serialization will fail loudly when `torch.load` flips `weights_only`.

However, we make the observation that this dependency on numpy was legacy and is not actually needed anymore. So we can remove it, which aligns with our weights_only strategy.

## Why is this ok?

The following comment on why numpy is necessary for serialization is legacy

c87c9f0a01/torch/_tensor.py (L303-L312)

We no longer do the following, though it was the case 5 years ago in the PR that added this
> CPU storage is reconstructed with randomly initialized data, moved onto backend device, and then storage is updated to the serialized content

**Instead what now happens is that CPU storage is constructed with data from the file **and then** moved onto backend device.**

Old behavior (`legacy_load`): 67adda891a/torch/serialization.py (L620)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137444
Approved by: https://github.com/albanD
2024-10-09 19:35:55 +00:00
aa61e251d4 [FSDP2] Added shard_placement_fn arg (#137496)
## Overview
This PR adds a `shard_placement_fn: Optional[Callable[[nn.Parameter], Optional[Shard]]` arg to `fully_shard` that allows users to specify FSDP sharding on a nonzero tensor dim. If doing so, then the tensor dim size must be divisible by the FSDP shard world size.

```
# Example:
def shard_placement_fn(param: nn.Parameter) -> Optional[Shard]:
    largest_dim = largest_dim_size = -1
    for dim, dim_size in enumerate(param.shape):
        if dim_size > largest_dim_size:
            largest_dim = dim
            largest_dim_size = dim_size
    return Shard(largest_dim)

fully_shard(module, shard_placement_fn=shard_placement_fn)
```

## Follow-Ups
- **Copy kernels:** For all-gather copy-out, we currently copy-out to temporaries and then chunk-dim-0 -> cat-shard-dim, incurring an extra copy for parameters sharded on nonzero tensor dim. Similarly, for reduce-scatter copy-in, we currently chunk-shard-dim -> cat-dim-0, incurring an extra copy for gradients sharded on nonzero tensor dim. @yifuwang  has ideas for adding additional split size args to the copy ops that allows fusing these extra copies into the existing all-gather copy-out and reduce-scatter copy-in.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137496
Approved by: https://github.com/weifengpy
ghstack dependencies: #137593
2024-10-09 19:13:32 +00:00
36133f39db Tensorify compute on Python scalars (#136674)
Signed-off-by: Bob Ren <bobrenfb.com>

Comandeered from https://github.com/pytorch/pytorch/pull/130228 as I'm helping @ezyang w/ shipping dynamic float arguments in PT2. This starts with supporting torch.ops.aten.mul. I'll stack on top support for other operators in subsequent PRs to keep this scoped to the mechanics of the fx pass.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136674
Approved by: https://github.com/ezyang
2024-10-09 18:51:41 +00:00
f15edb291a type _dynamo/trace_wrapped_higher_order_op.py (#137354)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137354
Approved by: https://github.com/Skylion007, https://github.com/jansel
2024-10-09 18:35:28 +00:00
9a957e2842 [NCCL][Profiler] Add functionality to call dump function of NCCL profiler plugin (#137523)
Summary:
NCCL 2.23.4 provides the profiler plugin feature, which traces collective, p2p, proxyOps, and other events.

The diff supports the following feature: when NCCL times out, the flight recorder can also dump traces in the profiler plugin.

Test Plan:
```
        tensor = torch.tensor([dist.get_rank()], dtype=torch.int32, device=dev)
        # Create a list with same number of elements as world size (aka no. of ranks)
        # During allgather this list is going to be populated with tensors from all ranks (aka all gather)
        gathered_tensors = [torch.zeros_like(tensor) for _ in range(WORLD_SIZE)]
        # get collective from all ranks
        if i <= 10 or RANK != 0:
            dist.all_gather(gathered_tensors, tensor)
```
My script triggers flight recoder.
```
trainer/0 [0]:E0927 12:07:22.643702 1012209 ProcessGroupNCCL.cpp:1356] [PG ID 0 PG GUID 0(default_pg) Rank 0] ProcessGroupNCCL preparing to dump debug info.
trainer/0 [0]:I0927 12:07:22.643784 1012209 ProcessGroupNCCL.cpp:392] NCCL_PROFILER_PLUGIN: /data/users/zhiyongww/fbsource/fbcode/scripts/nbahl/libnccl_profiler_plugin.so
trainer/0 [0]:I0927 12:07:22.643805 1012209 plugin.cpp:559] Profiler start dump
trainer/0 [0]:I0927 12:07:22.645249 1012209 ProcessGroupNCCL.cpp:1363] [PG ID 0 PG GUID 0(default_pg) Rank 0] ProcessGroupNCCL dumping nccl trace to /tmp/nccl_trace_rank_0
trainer/0 [0]:I0927 12:07:22.645418 1012209 NCCLUtils.cpp:348] Finished writing NCCLPG debug info to /tmp/nccl_trace_rank_0
```
Content from /tmp/nccl_trace_rank_0: P1614645283

Differential Revision: D61929401

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137523
Approved by: https://github.com/c-p-i-o
2024-10-09 18:19:33 +00:00
394c143e4e [dynamo] Fix error when inlining certain nested closure returned by another function (#137510)
See `test_inline_closure_returned_by_another_function_and_captures` and #136814 for more context.

In #90286, we introduced an optimization so that for captured cells that are unmodified during a Dynamo trace, `UserFunctionVariable` will represent them as variable of the cell's actual value, rather than a `NewCellVariable`.

Later on we introduced more mechanisms to model such cells across function calls (#104222), and across function calls where `NestedUserFunctionVariable::bind_args` need to look up further in the parent frames (#106491) to find these cells' values.

This patch removes `InlinedClosureVariable` in favor of a simpler modelling, which is also more consistent with what was introduced in #90286, i.e., just model these cells as their contents, in `symbolic_locals`.

This fixes #136814 because resolution of `InlinedClosureVariable` to the underlying cell content value happens in
`NestedUserFunctionVariable::bind_args`, which requires Dynamo to have the value in scope at the function call site (when Dynamo does inlining), but's not always the case (as the test case shows). However, if we model the cells in `symbolic_locals`, we never need such resolution, and the values are directly stored into the `NestedUserFunctionVariable::closure` upon the function creation, at which point Dynamo always has the cell value in `symbolic_locals` for look up.

Fixes #136814.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137510
Approved by: https://github.com/williamwen42
2024-10-09 18:13:57 +00:00
018dabff20 [ONNX] Implement patch for jit.isinstance (#137592)
Patch torch.jit.isinstance for users for models to be dynamo exportable. Replaces https://github.com/pytorch/pytorch/pull/137487.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137592
Approved by: https://github.com/titaiwangms, https://github.com/xadupre
2024-10-09 18:06:52 +00:00
ceb2fcc5db [FSDP2] Fixed incorrect tensor meta after .to(dtype) (#137593)
This fixes https://github.com/pytorch/pytorch/issues/137522. After a method that changes to module parameters (like `.to(torch.float64)`), we need to update the `DTensorSpec`, whose `TensorMeta`'s dtype may have changed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137593
Approved by: https://github.com/Skylion007
2024-10-09 17:57:11 +00:00
bae8d5853e [TorchRec][PT2 compile] enable dynamo in _get_user_embeddings (#136798)
Summary:
# context
* enable the `_get_user_embeddings` function
* run failed at P1610151892
```
  torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
  GuardOnDataDependentSymNode: Could not guard on data-dependent expression u22 <= 0 (unhinted: u22 <= 0).  (Size-like symbols: u22)

  ATTENTION: guard_size_oblivious would fix the error, evaluating expression to False.
  Maybe you need to add guard_size_oblivious to framework code, see doc below for more guidance.

  Potential framework code culprit (scroll up for full backtrace):
    File "/data/users/hhy/fbsource/buck-out/v2/gen/fbcode/38472faba4e3e6c1/aps_models/ads/icvr/__icvr_launcher_live__/icvr_launcher_live#link-tree/torch/_decomp/decompositions.py", line 1692, in native_layer_norm_backward
      if M <= 0 or N <= 0:
```
```
    N = prod(inner_dims)  # type: ignore[arg-type]
    M = prod(outer_dims)  # type: ignore[arg-type]
    if M <= 0 or N <= 0:
        return (
            input.new_zeros(input_shape) if output_mask[0] else None,
            input.new_zeros(input_shape[axis:]) if output_mask[1] else None,
            input.new_zeros(input_shape[axis:]) if output_mask[2] else None,
        )
```
# changes
* use guard_size_oblivious since the new_zeros return is kind of optimization, shouldn't impact the correctness of the follow up code logic.
* the size `ret[i][j]` could be zero, so the change in V1 isn't valid
* for more details: [post](https://fb.workplace.com/groups/6829516587176185/permalink/8003616173099548/)
```
    from torch.fx.experimental.symbolic_shapes import guard_size_oblivious
    if guard_size_oblivious(M <= 0) or guard_size_oblivious(N <= 0):
```

# past
* found `u22` was introduced at
```
    def _wait_impl(self) -> List[List[int]]:
        # Can not use is_torchdynamo_compiling(), as every such condition should be independent for compilation with graph breaks.
        if isinstance(self._splits_awaitable, dist.Work):
            self._splits_awaitable.wait()

        ret = self._output_tensor.view(self.num_workers, -1).T.tolist()  # <------ u22 introduced here

        if not torch.jit.is_scripting() and is_torchdynamo_compiling():
            for i in range(len(ret)):
                for j in range(len(ret[i])):
                    torch._check_is_size(ret[i][j])   # <----------  my question: why the _check_is_size isn't enough??
                    torch._check(ret[i][j] > 0)   # <------ added by diff V1
```

Test Plan:
# run command
```
TORCH_SHOW_CPP_STACKTRACES=1 TORCHDYNAMO_EXTENDED_DEBUG_CPP=1 TORCH_LOGS="+graph_code,output_code,dynamic,aot,guards,verbose_guards,recompiles,graph_breaks" TORCH_TRACE=/var/tmp/tt buck2 run fbcode//mode/opt fbcode//aps_models/ads/icvr:icvr_launcher_live -- mode=fmc/local_ig_fm_v4_mini training.pipeline_type=pt2 2>&1 | tee -a `tagT`.`tagH`.log
```

# results
* before
**without enabling `_get_user_embeddings`**
[14 Failures and Restarts](https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmp2eNI7p/failures_and_restarts.html)
log: P1610151892
{F1889387940}
* V1
enable `_get_user_embeddings`
with `torch._check(ret[i][j] > 0)`
[13 Failures and Restarts](https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmp6J1iY9/failures_and_restarts.html)
{F1889388378}
* V2
enable `_get_user_embeddings`
with `if guard_size_oblivious(M <= 0) or guard_size_oblivious(N <= 0):`
[tlparse](https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpFhZZyC/index.html)
if guard_size_oblivious(M <= 0) or guard_size_oblivious(N <= 0):

Differential Revision: D63424929

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136798
Approved by: https://github.com/ezyang
2024-10-09 17:19:45 +00:00
4d45536e92 Save aot graph code in AOTAutogradCache for logging purposes (#137432)
Save the string graph code from print_readable

Differential Revision: [D63985711](https://our.internmc.facebook.com/intern/diff/D63985711/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137432
Approved by: https://github.com/bdhirsh
ghstack dependencies: #137431
2024-10-09 16:59:08 +00:00
b71d0ac3b1 remove unused variable (#137565)
per title
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137565
Approved by: https://github.com/Skylion007
2024-10-09 16:31:43 +00:00
ae03c0cff3 Add microbenchmark for FxGraphHashDetails.debug_lines (#137506)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137506
Approved by: https://github.com/jamesjwu
2024-10-09 16:15:05 +00:00
e945b6600d Support 3.8 compile again (#137587)
This is not going to be very reliable since we don't have CI though...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137587
Approved by: https://github.com/Skylion007
2024-10-09 15:54:52 +00:00
1d15dd7891 Fix triton_reshape to properly expand Min keyword in triton codegen (#137357)
Summary: Previously triton_reshape will generate code with `Min` keyword in it, which is incorrect. This diff updates the triton_reshape function to properly expand `Min` keyword to `<`.

Test Plan:
```
buck2 run @//mode/{opt,mtia,inplace} //glow/fb/fx/fba/tests:test_fba_inductor -- -r test_Min_keyword_in_block_shape
```

Differential Revision: D63850158

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137357
Approved by: https://github.com/blaine-rister, https://github.com/eellison
2024-10-09 15:53:45 +00:00
de4c2a3b4e Add AsyncCollectiveTensor isinstance check to test_graph_input_is_async (#137253)
This PR doesn't change the logic of `test_graph_input_is_async` - it just adds an additional check to the graph input type to ensure it's always `AsyncCollectiveTensor` as expected. It would potentially make it easier to show to users that we already support `AsyncCollectiveTensor` as graph input.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137253
Approved by: https://github.com/bdhirsh
2024-10-09 08:06:16 +00:00
ac8954d1ca [pattern match][SDPA] remove contiguous in sdpa replacement (#136930)
Fixes a perf issue which is found internally.
In the case, we see query(size=[1, 16, 384, 64], stride=[393216, 64, 1024, 1]) in model code. However before entering SDPA, it becomes query(size=[1, 16, 384, 64], stride=[393216, 24576, 64, 1]). This is caused by the [SDPA pattern match](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/fx_passes/fuse_attention.py#L130-L132), which applies contiguous to inputs in replacement. This is not necessary as the contiguous doesn't exist in pattern. Furthermore, it could sometimes cause perf issues. Anyway, we can do the additional contiguous in the kernel implementation if needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136930
Approved by: https://github.com/Skylion007, https://github.com/drisspg, https://github.com/jgong5
2024-10-09 07:52:38 +00:00
72ad1b8c6c Make Context to be Device-agnostic Step by Step (2/N) (#136526)
- add new method(getDefaultGenerator, getNewGenerator) into AcceleratorHooksInterface
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136526
Approved by: https://github.com/ezyang, https://github.com/EikanWang
ghstack dependencies: #136519
2024-10-09 07:34:30 +00:00
a02093e824 fix test_export_constraints_error_not_in_range (#137500)
Test Plan: fixed

Differential Revision: D64052011

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137500
Approved by: https://github.com/tugsbayasgalan
2024-10-09 05:48:14 +00:00
abb00efc14 Add torch.squeeze parameter description to declare allowed type (#137485)
Fixes #137422

Add parameter type definition in API docs to clarify allowed value type, eliminate users pass `None`  as `dim` value directly.

```python
>>> import torch
>>> x = torch.randn(3,1,2)
>>> x.squeeze(dim=None)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: Please look up dimensions by name, got: name = None.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137485
Approved by: https://github.com/albanD
2024-10-09 05:29:13 +00:00
df114a447e Parametrize test_lstm_packed (#137447)
The test runs all its combination (512) sequentially, so it takes more than 30 minutes to finish or timeout on ASAN after one hour.  Parametrizing it will break it up, so individual tests can finish and aren't need to be marked as slow anymore.

Also, the test seems to run OOM on a 2xlarge with `std::bad_alloc` memory error.  Maybe, this would also fix the issue (pending CI testing)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137447
Approved by: https://github.com/albanD, https://github.com/malfet
2024-10-09 05:13:53 +00:00
2fff990c16 Revert "[AutoAC] Backward Pass Aware AC - changes to partitioner to acommodate SOLVER as a callable (#137314)"
This reverts commit 932b9945c0bc61a11a7db2f52c974cf283d5a2ed.

Reverted https://github.com/pytorch/pytorch/pull/137314 on behalf of https://github.com/huydhn due to The failure shows up in trunk ([comment](https://github.com/pytorch/pytorch/pull/137314#issuecomment-2401311719))
2024-10-09 04:53:30 +00:00
972822dea1 Minorly reorder optim kwargs in docs, fixes #137391 (#137531)
Closes #137391

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137531
Approved by: https://github.com/albanD
2024-10-09 04:14:45 +00:00
4628fcf41a Fix ir._WaitKernel (#137401)
In ABI-compatible mode, AOTInductor could not compile _WaitKernel due to
an incorrect outputs list.  Add the correct set of outputs, as done in
ir._CollectiveKernel.create_out_of_place.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137401
Approved by: https://github.com/desertfire
ghstack dependencies: #136924
2024-10-09 04:02:30 +00:00
0414aeacd9 AOTInductor: silence linker warnings about executable stacks (#136924)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136924
Approved by: https://github.com/desertfire
2024-10-09 04:02:30 +00:00
ddc7b6d0b4 Removes confusing note, addresses #38006 (#137535)
Fixes #38006

The note was originally added in https://github.com/pytorch/pytorch/pull/30257, which tried to ensure that the gradient wasn't modified in the optimizer. This note creates more confusion than is helpful, so removing it is better than leaving it in, especially because most uses of closure that I know _does_ modify the grads.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137535
Approved by: https://github.com/albanD
2024-10-09 04:00:38 +00:00
d3edf4ebf4 [SymmetricMemoryOps] implement two-shot all-reduce (#137473)
## This Stack

Implement custom all-reduce algos available in `IntraNodeComm` as `symm_mem` ops and replace the existing `IntraNodeComm` kernels with them.

## This PR

Implement `symm_mem::two_shot_all_reduce_`. Later we'll replace the two-shot all-reduce in `IntraNodeComm` with these.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137473
Approved by: https://github.com/Chillee
ghstack dependencies: #137471, #137472
2024-10-09 03:49:42 +00:00
82e55b624f [SymmetricMemoryOps] implement one_shot_all_reduce (#137472)
## This Stack

Implement custom all-reduce algos available in `IntraNodeComm` as `symm_mem` ops and replace the existing `IntraNodeComm` kernels with them.

## This PR

Implement `symm_mem::one_shot_all_reduce` and `symm_mem::one_shot_all_reduce_out`. Later we'll replace the one-shot all-reduce in `IntraNodeComm` with these.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137472
Approved by: https://github.com/Chillee, https://github.com/weifengpy
ghstack dependencies: #137471
2024-10-09 03:49:42 +00:00
5d83ee3e32 [SymmetricMemoryOps] refine cross-device barriers (#137471)
## This Stack

Implement custom all-reduce algos available in `IntraNodeComm` as `symm_mem` ops and replace the existing `IntraNodeComm` kernels with them.

## This PR

Refine the corss-device synchronization primitives to make it clearer when to use which synchronization pattern.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137471
Approved by: https://github.com/Chillee, https://github.com/weifengpy
2024-10-09 03:49:42 +00:00
5f1759a025 [Dynamo] add flex attention mode test (#137121)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137121
Approved by: https://github.com/yanboliang, https://github.com/anijain2305
ghstack dependencies: #137114, #137115, #137116, #137117, #137120, #137227, #137119
2024-10-09 02:29:40 +00:00
d5785d4295 [Dynamo] Handle torch function subclass/mode dispatch on generic tensor methods (#137119)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137119
Approved by: https://github.com/williamwen42, https://github.com/anijain2305
ghstack dependencies: #137114, #137115, #137116, #137117, #137120, #137227
2024-10-09 02:29:40 +00:00
0a304d9048 [Dynamo] Handle extracted unbound tensor methods (#137227)
fixes2

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137227
Approved by: https://github.com/williamwen42, https://github.com/anijain2305
ghstack dependencies: #137114, #137115, #137116, #137117, #137120
2024-10-09 02:29:40 +00:00
b3f30c9bc3 [Dynamo] Move flex attention torch function mode to traceable HOP file (#137120)
Moves `TransformGetItemToIndex` to a file where dynamo stores other traceable HOP concepts.  (We don't trace through torch.* modules by default)

Tracing through the mode required fixing a bug in dynamo autograd function, which fixed a graph break, which caused the autograd test failures (skipping for now and will file an issue)

Previously those tests were in essence running in eager, because dynamo would fallback due to an arg mismatch error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137120
Approved by: https://github.com/yanboliang, https://github.com/malfet
ghstack dependencies: #137114, #137115, #137116, #137117
2024-10-09 02:29:40 +00:00
27dee935af [Dynamo] Ensure torch function modes are dispatched on builtin ops (#137117)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137117
Approved by: https://github.com/yanboliang, https://github.com/williamwen42
ghstack dependencies: #137114, #137115, #137116
2024-10-09 02:29:40 +00:00
38afac2917 [Dynamo] Remove ignored modes from torch function mode stack guard (#135503) (#137116)
Approved by: https://github.com/anijain2305
ghstack dependencies: #134732, #133137, #135443, #135444, #135422, #135502

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137116
Approved by: https://github.com/yanboliang
ghstack dependencies: #137114, #137115
2024-10-09 02:29:40 +00:00
108b469f78 [Dynamo] Remove ignored modes workaround (#135502) (#137115)
Approved by: https://github.com/anijain2305
ghstack dependencies: #134732, #133137, #135443, #135444, #135422

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137115
Approved by: https://github.com/yanboliang
ghstack dependencies: #137114
2024-10-09 02:29:40 +00:00
e41dffbedd [Dynamo] Trace enter/exit of TorchFunctionModes (#135422) (#137114)
This PR implements tracing of with contexts with TorchFunction modes which have the default enter/exit behavior (ie pushing/popping the mode)

Typically the bytecode for a context manager looks like this during a graph break:
1. graph call
2. enter context
3. unsupported code
4. exit context
5. resume call

resume fn structure:
1. enter context
2. jump
...
3. exit context

The issue with torch function modes is that side effects will replay any mutations to the torch function stack performed during tracing. So, we do not need to enter and exit around the unsupported code in the original function (doing so would result in a duplicate torch function mode entry during execution of the unsupported code), and we don't need to enter again in the resume function (the mode that was pushed from the side effects bytecode would still be on the stack).

So for torch function modes the structure of our output code is this:

1. graph call
2. mutate tf mode stack to replay mutations
4. unsupported code
5. on exception restore stack
6. resume function

Then our resume fn looks like this:

1. no-op enter torch function mode
2. jump
3.  exit tf mode

To implement the no-op enter of the torch function mode I added torch function mode in polyfill which no-op enters, but normally exits. This is needed because we still want to trace the with context in the resume function, and exit properly (the exit instructions will still be in the function, so we need to generate instructions to set up the context).

Separately from the bytecode, dynamo also tracks contexts on the block stack, which is how the SETUP_* instructions are implemented. Naturally at a graph break, we exit these block stacks to properly reset the contexts entirely, so that we can re-enter around the unsupported code soundly. However once again, in the torch function mode case, in the event of a graph we do not want to perform any exit side effects because we want to preserve the state of the mode stack as is so that we will properly update the stack with bytecode mentioned in the first section. If we exited here, dynamo would pop the mode off of the symbolic stack, and not update the true python torch function mode stack with the suffix bytecode. All in all, for torch function modes we enter exactly once, update the global torch function mode stack with side effects bytecode, re-read this stack when compiling the resume function, and exit exactly once in the resume function. This matches the semantics of eager exactly.
Approved by: https://github.com/williamwen42
ghstack dependencies: #134732, #133137, #135443, #135444

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137114
Approved by: https://github.com/yanboliang
2024-10-09 02:29:40 +00:00
0b8048c78a Fix AOTI CPP GEMM Template issue without freezing (#136421)
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/135106. For AOTI, there is the Inductor IR of weight
```
ReinterpretView(
  StorageBox(
    ConstantBuffer(name='L__self___mlp_0_weight', layout=FixedLayout('cpu', torch.float32, size=[64, 128], stride=[128, 1]))
  ),
  FixedLayout('cpu', torch.float32, size=[128, 64], stride=[1, 128]),
  origins=OrderedSet([addmm])
)
```
In the post-processing step of the GEMM template, the used weight was before permutation, leading to correctness issues. In this PR, we address this by reshaping the weight to the expected size and stride before the weight prepack.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_aot_inductor.py -k test_misc_1_max_autotune_True_non_abi_compatible_cpu
python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_aoti_linear
python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_aoti_linear_multi_view_operations
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136421
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-10-09 02:19:07 +00:00
be0b75256a Make Context to be Device-agnostic Step by Step (1/N) (#136519)
- make init to be device-agnostic and move it to AcceleratorHooksInterface
- refactoring context related to device initialization

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136519
Approved by: https://github.com/ezyang, https://github.com/EikanWang, https://github.com/guangyey
2024-10-09 02:13:36 +00:00
384ddab294 [c10d] fix sequence numbers for coalesced operations (#135132)
Summary:
We were erroneously incrementing seq_collective for p2p operations.
FIxes issue #134833

Test Plan:
Unit tests.
TODO: add more unit tests

Reviewers:

Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135132
Approved by: https://github.com/fduwjj
2024-10-09 01:38:12 +00:00
8cbb58cff6 [inductor] Limit cpu copies in autotuning to CUDA devices (#137509)
Summary: Missed in https://github.com/pytorch/pytorch/pull/136701#discussion_r1792328849: we should perform this optimization only for mutated args on cuda devices

Test Plan: `python benchmarks/dynamo/timm_models.py --performance --inductor --device cuda --inference --bfloat16 --print-compilation-time --print-memory --cold-start-latency --only fbnetc_100`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137509
Approved by: https://github.com/int3, https://github.com/eellison
2024-10-09 01:31:58 +00:00
932b9945c0 [AutoAC] Backward Pass Aware AC - changes to partitioner to acommodate SOLVER as a callable (#137314)
Summary: making it so that the config can pass `config.activation_memory_budget_solver` as a callable method and then that callable is invoked to determine the set of saved/recomputed nodes.

Test Plan: tbd

Reviewed By: Chillee, basilwong

Differential Revision: D63714905

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137314
Approved by: https://github.com/eellison, https://github.com/basilwong

Co-authored-by: Parikshit Shah <parikshit@meta.com>
2024-10-09 00:39:29 +00:00
23c531b3e9 Allow parallelize_module to get device_mesh from ambient context (#134247)
This PR is for supporting calling `parallelize_module` from within a model definition, making the model a parallel one.

Calling `parallelize_module` is an alternative to maintaining a set of `ColumnWiseLinear`, `RowWiseLinear`, etc, while still being able to directly author a parallel model.

(The motivation for authoring a parallel model is that there may be other distributed operations, which may not be easily captured by any module, see the forward function below. Alternatively speaking, the purpose is to exploit the expressiveness of DTensor -- we need to first create DTensors before calling ops on them. Having parallelized modules in model is one way of creating DTensors.)

For example:
```
class FeedForward(nn.Module):
    def __init__(self, config: TransformerArgs) -> None:
        super().__init__()
        w1 = nn.Linear(config.dim, config.hidden_dim, bias=False)
        w2 = nn.Linear(config.hidden_dim, config.dim, bias=False)
        w3 = nn.Linear(config.dim, config.hidden_dim, bias=False)
        self.w1 = parallelize_module(w1, Colwise)
        self.w2 = parallelize_module(w2, Rowwise)
        self.w3 = parallelize_module(w3, Colwise)

    def forward(self, x: Tensor) -> Tensor:
        y: DTensor = self.w2(F.silu(self.w1(x)) * self.w3(x))
        # y is a DTensor with Partial placement; we can return it as is.
        return y
        # Or we can convert it to Replicate -- there is modeling flexibility here.
        return y.redistribute(Replicate())

with device_mesh:
    model = FeedForward(config)
    # Now model is a model parallelized onto device_mesh

y = model(x)

```

The `device_mesh` actually used for `parallelize_module` would be retrieved from the ambient context.

Calling `parallelize_module` from within model hierarchy also saves the use of *FQNs* as in the out-of-model annotation case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134247
Approved by: https://github.com/tianyu-l
2024-10-09 00:19:03 +00:00
de7f32a205 openreg add pin_memory (#135339)
Occording to `Next steps` in test/cpp_extensions/open_registration_extension/README.md, add Pinned memory and HostAllocator.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135339
Approved by: https://github.com/albanD
2024-10-09 00:07:59 +00:00
8893881867 Invalidate StorageImpl instances when tensor is overwritten with cudagraphs (#125264)
Fixes #104435

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125264
Approved by: https://github.com/ezyang

Co-authored-by: eellison <elias.ellison@gmail.com>
2024-10-09 00:05:52 +00:00
eqy
cba3f4f5e3 [CUDA] Clean up asserts in test_cuda.py (#137034)
Switch some `assertTrue` tests to `assertEqual` etc for debuggability in logs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137034
Approved by: https://github.com/Skylion007
2024-10-08 23:16:19 +00:00
b16167874d Minor SGD docs clarification fixing #137356, #137352 (#137528)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137528
Approved by: https://github.com/albanD
2024-10-08 23:05:08 +00:00
2a1829d728 Error message for allow_in_graph decorator and arbitrary function combo (#135972)
Fixes #103615

Quick error message for non-allowed allow_in_graph decorator and arbitrary function combo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135972
Approved by: https://github.com/anijain2305
2024-10-08 22:48:38 +00:00
4aed81c0db Add support for cat memory planning mms with max autotune (#132554)
When we are autotuning matmuls the aten.mm and the triton template choices take in an externally allocated tensor that can be a view into a pre-planned aten.cat. So long as the output shape and stride of the matmul matches the slice of the cat we're planning, we can realize the mm directly into the cat.

Discussion for reviewers:

It feels a little bit odd that in the existing code we set the output of aten.mm as [FlexibleLayout](bcac71517c/torch/_inductor/kernel/mm.py (L156)). While is this correct, it might lead to passing non performant output strides to cublas.. I guess this is better than a copy ? Not sure. We could also introduce a Layout that denotes a Fixed shape and stride which we control allocation

```
class AllocatedFixedLayout(FixedLayout)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132554
Approved by: https://github.com/jansel
2024-10-08 22:36:46 +00:00
02013da038 Lift restriction on training IR for unflatten (#137470)
Differential Revision: [D64025578](https://our.internmc.facebook.com/intern/diff/D64025578)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137470
Approved by: https://github.com/avikchaudhuri
2024-10-08 22:30:24 +00:00
81c8a8ada6 [ONNX] Bump onnxscript in CI (#137497)
To 0.1.0.dev20241008
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137497
Approved by: https://github.com/titaiwangms
2024-10-08 21:56:30 +00:00
76ab1ab665 Fix autograd.Function + NJT when an output grad is None (#136875)
For `autograd.Function`, the engine will try to allocate correctly-shaped zeros for `None` grads (i.e. in the case where the output isn't used downstream). It determines the shape of these zeros from the `VariableInfo` entry, which is derived from the forward output shape. For the NJT forward output case, the size info stored will contain a nested int, and calling `zeros()` with this size throws:
```
RuntimeError: .../build/aten/src/ATen/RegisterCPU.cpp:5260: SymIntArrayRef expected to contain only concrete integers
```

This PR fixes this by storing the full tensor in the `VariableInfo` for the nested case and calling `zeros_like()` to allocate correctly-shaped zeros. This is pretty inefficient; ideally we would want to save just the NJT shape and be able to construct zeros from it, but this requires factory function support for nested ints (WIP). So this is a short-term fix until we have that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136875
Approved by: https://github.com/soulitzer
2024-10-08 21:01:36 +00:00
5e3e1c0151 Revert "[FSDP2] Required mesh_dim_names for HSDP (#137436)"
This reverts commit 5fb30df7d6ecc25cc7c4c17a8a33d14ddaa7c279.

Reverted https://github.com/pytorch/pytorch/pull/137436 on behalf of https://github.com/malfet due to Looks like it broke distributed testing, see https://github.com/pytorch/pytorch/actions/runs/11239761070/job/31249854217 ([comment](https://github.com/pytorch/pytorch/pull/137436#issuecomment-2400794929))
2024-10-08 20:50:49 +00:00
b499083a91 Get rid of quadratic tests to has_same_metadata (#136857)
Fixes https://github.com/pytorch/pytorch/issues/136852

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136857
Approved by: https://github.com/isuruf, https://github.com/bdhirsh
2024-10-08 20:49:23 +00:00
d34b617bb9 Revert "[Dynamo] Trace enter/exit of TorchFunctionModes (#135422) (#137114)"
This reverts commit 51bc839b94829f176e3c1b7f62e3448d6028c480.

Reverted https://github.com/pytorch/pytorch/pull/137114 on behalf of https://github.com/huydhn due to The top of the stack has been reverted but it leaves trunk in a broken state, so I try to revert the rest of the stack ([comment](https://github.com/pytorch/pytorch/pull/137114#issuecomment-2400765603))
2024-10-08 20:33:17 +00:00
8c937445ee Revert "[Dynamo] Remove ignored modes workaround (#135502) (#137115)"
This reverts commit b1fd7708bd81d8d52908bf4459ed024471abd803.

Reverted https://github.com/pytorch/pytorch/pull/137115 on behalf of https://github.com/huydhn due to The top of the stack has been reverted but it leaves trunk in a broken state, so I try to revert the rest of the stack ([comment](https://github.com/pytorch/pytorch/pull/137114#issuecomment-2400765603))
2024-10-08 20:33:17 +00:00
e5f9131327 Revert "[Dynamo] Remove ignored modes from torch function mode stack guard (#135503) (#137116)"
This reverts commit f9d69cde88ad972ee8fc24413dd0740f4e21562d.

Reverted https://github.com/pytorch/pytorch/pull/137116 on behalf of https://github.com/huydhn due to The top of the stack has been reverted but it leaves trunk in a broken state, so I try to revert the rest of the stack ([comment](https://github.com/pytorch/pytorch/pull/137114#issuecomment-2400765603))
2024-10-08 20:33:17 +00:00
2d18c2d5e7 Revert "[Dynamo] Ensure torch function modes are dispatched on builtin ops (#137117)"
This reverts commit 941be418d8ec3290d0e3bae0e16a443be26b3075.

Reverted https://github.com/pytorch/pytorch/pull/137117 on behalf of https://github.com/huydhn due to The top of the stack has been reverted but it leaves trunk in a broken state, so I try to revert the rest of the stack ([comment](https://github.com/pytorch/pytorch/pull/137114#issuecomment-2400765603))
2024-10-08 20:33:17 +00:00
cc75ac084f Add test for https://github.com/pytorch/pytorch/issues/137087 (#137090)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137090
Approved by: https://github.com/Skylion007, https://github.com/albanD
2024-10-08 20:17:03 +00:00
5349ee2934 Revert "Parametrize test_lstm_packed (#137447)"
This reverts commit d5493ed579ba41015ffef981832a3f04f94bb6f8.

Reverted https://github.com/pytorch/pytorch/pull/137447 on behalf of https://github.com/huydhn due to Need to up few more instance to 4xlarge, revert to reland ([comment](https://github.com/pytorch/pytorch/pull/137447#issuecomment-2400737602))
2024-10-08 20:15:24 +00:00
3c1ab93678 Log chromium event for automatic dynamic reasons (#137491)
Log a chromium event so that we can see the reasons for invoking automatic dynamic shapes in aggregate internally.

Run following code:
```
import torch
@torch.compile(backend="eager")
def foo(t, x):
    return t.sin() + x

torch._dynamo.config.automatic_dynamic_shapes = True
torch._dynamo.config.assume_static_by_default = True
# Change size
x = torch.randn([1,2])
foo(x, 2)
x = torch.randn([2,2])
foo(x, 2)
torch._dynamo.reset()
# Change dimensionality
x = torch.randn([1,2])
foo(x, 2)
x = torch.randn([1,2,3])
foo(x, 2)
torch._dynamo.reset()
# Change stride
x = torch.randn([3,3])
foo(x, 2)
x = torch.as_strided(x, [3,3], [2,2])
foo(x, 2)
torch._dynamo.reset()
# Change scalar
x = torch.randn([1,2])
foo(x, 2)
foo(x, 3)
```

Internal link to perfetto:
https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom%2Fchromium_events.json#!/viewer?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom%2Fchromium_events.json&local_cache_key

The events look like this:
<img width="639" alt="image" src="https://github.com/user-attachments/assets/23916333-7f24-47c7-934b-201f33aebeab">
<img width="638" alt="image" src="https://github.com/user-attachments/assets/9f927c8d-04bb-4431-8802-685b032df656">
<img width="640" alt="image" src="https://github.com/user-attachments/assets/342e9e11-0dfc-422d-bd0b-01a8574d38ba">
<img width="635" alt="image" src="https://github.com/user-attachments/assets/dc2c97cd-7180-4069-b019-d6e63ee490bc">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137491
Approved by: https://github.com/Skylion007, https://github.com/oulgen
2024-10-08 19:53:12 +00:00
cyy
a2396b2dd8 [2/N] Fix extra warnings brought by clang-tidy-17 (#137459)
Follows #137407

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137459
Approved by: https://github.com/Skylion007
2024-10-08 19:05:02 +00:00
b41fc14072 compile time benchmarks for AOTDispatcher (partitioner) (#136760)
compile time benchmark for the min cut partitioner. I'm hoping that this is a reasonable benchmark because:

(1) it consists of a single input + many weights that are used sequentially
(2) contains a mix of recompute vs non-recomputed ops (matmul + sin)
(3) it is relatively simple

from running locally:
```
collecting compile time instruction count for aotdispatcher_partitioner_cpu
compile time instruction count for iteration 0 is 21764219181
compile time instruction count for iteration 1 is 12475020009
compile time instruction count for iteration 2 is 12463710140
compile time instruction count for iteration 3 is 12455676489
compile time instruction count for iteration 4 is 12451344330
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136760
Approved by: https://github.com/ezyang
ghstack dependencies: #136759
2024-10-08 18:44:13 +00:00
48b8f818b2 compile time benchmarks for AOTDispatcher (inference/training/subclasses) (#136759)
this adds a few compile time benchmarks for some disjoint paths in AOTDispatcher:

(1) inference vs training code paths
(2) "subclasses" vs "no subclasses" codepaths

Also see https://github.com/pytorch/pytorch/pull/136760 for a partitioner benchmark (I'm not sure why ghstack didn't display the stack nicely)

I ran locally, and got these numbers on the 4 paths:
```
collecting compile time instruction count for aotdispatcher_inference_nosubclass_cpu
compile time instruction count for iteration 0 is 11692348671
compile time instruction count for iteration 1 is 3026287204
compile time instruction count for iteration 2 is 3011467318
compile time instruction count for iteration 3 is 3004485935
compile time instruction count for iteration 4 is 3003087410
collecting compile time instruction count for aotdispatcher_training_nosubclass_cpu
compile time instruction count for iteration 0 is 6068003223
compile time instruction count for iteration 1 is 5585418102
compile time instruction count for iteration 2 is 5581856618
compile time instruction count for iteration 3 is 5581651794
compile time instruction count for iteration 4 is 5578742619
collecting compile time instruction count for aotdispatcher_inference_subclass_cpu
compile time instruction count for iteration 0 is 8634984264
compile time instruction count for iteration 1 is 8633467573
compile time instruction count for iteration 2 is 8632182092
compile time instruction count for iteration 3 is 8632056925
compile time instruction count for iteration 4 is 8632543871
collecting compile time instruction count for aotdispatcher_training_subclass_cpu
compile time instruction count for iteration 0 is 14737239311
compile time instruction count for iteration 1 is 14734346427
compile time instruction count for iteration 2 is 14736493730
compile time instruction count for iteration 3 is 14734121272
compile time instruction count for iteration 4 is 14733852882
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136759
Approved by: https://github.com/laithsakka
2024-10-08 18:44:13 +00:00
53af729a66 add meta for _segment_reduce_backward (#137442)
reland of https://github.com/pytorch/pytorch/pull/124988

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137442
Approved by: https://github.com/albanD
2024-10-08 18:40:06 +00:00
1aac1ffce1 Don't generate implicit value ranges for missing symbols. (#136667)
Instead, callback to a missing handler when needed. This greatly speeds things up with the value ranges dict is large. The missing handler is needed because nested ints don't have VRs, but symbolic sizes involving them occasionally show up in compute.

```
TORCHDYNAMO_EXTENDED_DEBUG_CREATE_SYMBOL="s11" TORCH_LOGS=dynamic PYTORCH_TEST_WITH_DYNAMO=1 python test/test_nestedtensor.py TestNestedTensorAutogradCPU.test_dropout_backward_jagged_cpu
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136667
Approved by: https://github.com/isuruf
ghstack dependencies: #136429
2024-10-08 18:12:57 +00:00
90bed32b98 Introduce torch.sym_sum (#136429)
Partially addresses https://github.com/pytorch/pytorch/issues/128150

When you have big sums of values, we end up computing long chains of
binary addition in our FX graph representation.  Not only is this ugly,
it also is quadratic, as the sympy.Add constructor is O(N) in number
of arguments.  Instead, ensure that we maintain the summation as a
single FX node so we can do the entire addition all in one go.

update_hint_regression benchmark, before and after:

```
update_hint_regression,compile_time_instruction_count,2648328980
update_hint_regression,compile_time_instruction_count,2563748678
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136429
Approved by: https://github.com/isuruf
2024-10-08 18:12:57 +00:00
3bf6594d13 Log compile ids to pt2_remote_cache and pt2_compile_events (#137431)
Log the current compilation id for all relevant samples for these two tables, so we can have a 1:1 analog with dynamo_compile.

Differential Revision: [D63900826](https://our.internmc.facebook.com/intern/diff/D63900826/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137431
Approved by: https://github.com/oulgen
2024-10-08 18:04:48 +00:00
758dbac308 Add type check for ord in torch.linalg.vector_norm() and torch.linalg.matrix_norm() (#137463)
fixes #137424, fixes #137460
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137463
Approved by: https://github.com/lezcano
2024-10-08 17:53:56 +00:00
d87835ac32 [Profiler] Clear Out Dangling AppendOnlyLists (#137450)
Summary: There are two instances of AppendOnlyLists that don't get cleared after we have finished iterating through the forward lists. This can be potentially dangerous since they can last for the entirety of the lifespan of the profiler. We have also seen crashes during the destructor of these variables when the profiler is exiting. This could possibly be related to the fact that the default constructor assumes some valid state of these lists rather than whatever state they are in when profiler is exiting.

Test Plan: Ran with profile_memory=True to make sure allocations queue gets cleared correctly and trace+workload ran successfully

Differential Revision: D64010911

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137450
Approved by: https://github.com/aaronenyeshi
2024-10-08 17:48:59 +00:00
7e8dace0de Revert "[ROCm] remove caffe2 from hipify (#137157)"
This reverts commit 40d826074546558f6665a4c118335a7725503cac.

Reverted https://github.com/pytorch/pytorch/pull/137157 on behalf of https://github.com/xw285cornell due to this is breaking internal where we still use caffe2 ([comment](https://github.com/pytorch/pytorch/pull/137157#issuecomment-2400466131))
2024-10-08 17:45:45 +00:00
a8047564ff Revert "[FlexAttention] Support training bias for eager (#136910)"
This reverts commit 711dacf9845cbc9ea8b3b0fa257309930106712f.

Reverted https://github.com/pytorch/pytorch/pull/136910 on behalf of https://github.com/malfet due to torch.library.custom_op looks weird here and it breaks some internal workloads ([comment](https://github.com/pytorch/pytorch/pull/136910#issuecomment-2400434833))
2024-10-08 17:29:02 +00:00
0b5ade8a12 Revert "[Dynamo] Move flex attention torch function mode to traceable HOP file (#137120)"
This reverts commit 68151fd2889c9752348c2dfdc7c175ee201c0cd3.

Reverted https://github.com/pytorch/pytorch/pull/137120 on behalf of https://github.com/malfet due to Need to revert to be able to revert https://github.com/pytorch/pytorch/pull/136910 ([comment](https://github.com/pytorch/pytorch/pull/137120#issuecomment-2400429265))
2024-10-08 17:26:19 +00:00
2570d77a26 Revert "type _dynamo/trace_wrapped_higher_order_op.py (#137354)"
This reverts commit a9f7b905de2217eedee6723b0eb83b3ac7406c26.

Reverted https://github.com/pytorch/pytorch/pull/137354 on behalf of https://github.com/malfet due to Need to revert to be able to revert https://github.com/pytorch/pytorch/pull/136910 ([comment](https://github.com/pytorch/pytorch/pull/137354#issuecomment-2400424669))
2024-10-08 17:22:40 +00:00
76c5bdd2cc Revert "[Dynamo] Handle extracted unbound tensor methods (#137227)"
This reverts commit 14eabd69152e31d059444310979625542db2aece.

Reverted https://github.com/pytorch/pytorch/pull/137227 on behalf of https://github.com/malfet due to Need to revert to be able to revert https://github.com/pytorch/pytorch/pull/136910 ([comment](https://github.com/pytorch/pytorch/pull/137227#issuecomment-2400406384))
2024-10-08 17:12:41 +00:00
c88c0e6c65 Revert "[Dynamo] Handle torch function subclass/mode dispatch on generic tensor methods (#137119)"
This reverts commit d255b34c0ac6208633ed5e71d019fa9ae061e1fc.

Reverted https://github.com/pytorch/pytorch/pull/137119 on behalf of https://github.com/malfet due to Need to revert to be able to revert https://github.com/pytorch/pytorch/pull/136910 ([comment](https://github.com/pytorch/pytorch/pull/137119#issuecomment-2400401262))
2024-10-08 17:09:26 +00:00
cc10ef4645 Revert "[Dynamo] add flex attention mode test (#137121)"
This reverts commit 144665d772f7ec014a4a23f460a632a4a4774f4a.

Reverted https://github.com/pytorch/pytorch/pull/137121 on behalf of https://github.com/malfet due to Need to revert to be able to revert https://github.com/pytorch/pytorch/pull/136910 ([comment](https://github.com/pytorch/pytorch/pull/137121#issuecomment-2400389882))
2024-10-08 17:03:34 +00:00
11192ceca4 Revert "[FlexAttention] only calculate grads for buffers that require_grad (#137451)"
This reverts commit 9f9d252971ea1de04d349a0460e39e3bfe824eae.

Reverted https://github.com/pytorch/pytorch/pull/137451 on behalf of https://github.com/malfet due to Need to revert it in order to be able to backout https://github.com/pytorch/pytorch/pull/136910 ([comment](https://github.com/pytorch/pytorch/pull/137451#issuecomment-2400385858))
2024-10-08 17:00:59 +00:00
8184e202d8 Update mutation checking in pattern matcher (#137448)
Fix for https://github.com/pytorch/pytorch/issues/137229

The current mutation checking is complicated because it works for pre grad IR. When pre grad ir has been traced to OpOverloads checking is much easier. I am also special casing auto functional hop although I discussed with @zou3519 it would be nice to have a way of querying HOPs that mimic schemas.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137448
Approved by: https://github.com/zou3519
2024-10-08 16:56:40 +00:00
28493efe6e fix silly mapping issue with torch.Size (#137465)
Test Plan: added test

Differential Revision: D64022949

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137465
Approved by: https://github.com/yushangdi, https://github.com/angelayi
2024-10-08 16:53:15 +00:00
7267363844 [ONNX] Insert contiguous node between transpose and view before calling run_decompositions (#137340)
Works around #136543.

This fix solves the issue only in the context of the ONNX exporter but this issue happens in other context.

The bug happens when method `run_decompositions` is called. The failing pattern is assumed to be ``view(transpose(x, ...))``. This pattern is replaced by ``view(flatten(transpose(x, ..)))``. By changing the dimensions, the strides are updated as well and `run_decompositions` does not fail anymore. It would be inefficient on a 1D tensor but then transpose would not be used. The extra node appears in the final onnx graph but is removed after optimization. The final onnx graph should not be impacted and no performance loss should be observed for the onnx model.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137340
Approved by: https://github.com/justinchuby

Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
2024-10-08 16:45:59 +00:00
5fb30df7d6 [FSDP2] Required mesh_dim_names for HSDP (#137436)
Two changes:
1. Require `mesh_dim_names` if using HSDP
2. Pass only the shard mesh to `fsdp_pre_all_gather`

Change 1 is technically BC breaking, but it should not be hard to fix on the user side.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137436
Approved by: https://github.com/weifengpy, https://github.com/wz337
2024-10-08 16:31:18 +00:00
0bfedb13e7 Remove aoti_torch_zero_ codegen (#137371)
Summary: aoti_torch_zero_ codegen breaks AOTI FC, see discussion in D63281798.

Test Plan: CI

Differential Revision: D63916320

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137371
Approved by: https://github.com/jingsh
2024-10-08 15:57:41 +00:00
c04b35a5ae [AOTI] Add standalone version of TORCH_CHECK (#136873)
Summary: In the standalone mode, TORCH_CHECK throws std::runtime_error, instead of c10::Error. The goal is to cut dependency on libtorch. Specifically, AOTI generates CPU code which may call ATen vectorization ops and we need to make sure those ops are self-contained.

Differential Revision: [D63911928](https://our.internmc.facebook.com/intern/diff/D63911928)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136873
Approved by: https://github.com/albanD, https://github.com/chenyang78
2024-10-08 15:30:01 +00:00
d5493ed579 Parametrize test_lstm_packed (#137447)
The test runs all its combination (512) sequentially, so it takes more than 30 minutes to finish or timeout on ASAN after one hour.  Parametrizing it will break it up, so individual tests can finish and aren't need to be marked as slow anymore.

Also, the test seems to run OOM on a 2xlarge with `std::bad_alloc` memory error.  Maybe, this would also fix the issue (pending CI testing)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137447
Approved by: https://github.com/albanD, https://github.com/malfet
2024-10-08 15:26:27 +00:00
3e2f276a14 Fix to() on non-contiguous NJTs (#137124)
Called out via torchrec integration: `lengths` is not handled properly.

Future work (not related to non-contiguous NJTs): #137275
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137124
Approved by: https://github.com/soulitzer
ghstack dependencies: #137030, #137031
2024-10-08 15:11:05 +00:00
a77bb8527c Make index check in applySelect support deferred runtime assert (#137046)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137046
Approved by: https://github.com/albanD
2024-10-08 14:31:47 +00:00
9b2e453e24 Migrate ARM64 Linux binary jobs to runner determinator (#136666)
Updates ARM64 Linux binary jobs to use the runner determinator.

Issue: pytorch/ci-infra#265
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136666
Approved by: https://github.com/ZainRizvi
2024-10-08 12:14:06 +00:00
76dca1fef3 [c10d] separate the codes for GPU stream synchronization and CPU thread synchronization (#137295)
code
Summary:
This PR should not change the existing behavior of work.wait(), just
separate the stream synchronization code from the CPU busy wait code.

Also, remove the need of a private synchronization function.

In a longer term, we would like to give user the flexibility of bypassing the watchdog thread and handle the collective error by themselves.

Test Plan:
python test/distributed/test_c10d_nccl.py NcclErrorHandlingTest

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137295
Approved by: https://github.com/kwen2501
2024-10-08 08:53:47 +00:00
9f9d252971 [FlexAttention] only calculate grads for buffers that require_grad (#137451)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137451
Approved by: https://github.com/Chillee
2024-10-08 07:36:38 +00:00
59cdd8ddf1 Bump optree version to 0.13.0 to enable Python 3.13 and Python 3.13t support (#137396)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137396
Approved by: https://github.com/albanD
2024-10-08 06:49:04 +00:00
493d0eeef3 Revert "Add support for cat memory planning mms with max autotune (#132554)"
This reverts commit d558ec07300defee24dd4a83ab4b387a39ea2176.

Reverted https://github.com/pytorch/pytorch/pull/132554 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it is failing on ROCm ([comment](https://github.com/pytorch/pytorch/pull/132554#issuecomment-2398946854))
2024-10-08 06:21:06 +00:00
8ca15e87f5 Update torchbind expecttest from landrace (#137453)
Update expecttest from torch function mode PR landrace (torch function mode changes output code slightly)

Attempted to revert the stack but there were conflicts
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137453
Approved by: https://github.com/huydhn
2024-10-08 06:01:29 +00:00
bb31e3f57e Add original forward names to schema so that prettify pass works (#136887)
When we run_decomp, we retrace if it is training IR. As a result, we do need to reliably store the oroiginal forward names when we run decomp.

Differential Revision: [D63064453](https://our.internmc.facebook.com/intern/diff/D63064453/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136887
Approved by: https://github.com/angelayi
2024-10-08 04:21:02 +00:00
46525abb71 OpenReg: support multiple executors (#136249)
From PR https://github.com/pytorch/pytorch/pull/135646 we have split the daemon into drvier/executor, however, current executor stands for all devices and allocate memory all together. In order to better simulate device behavior, here we support multiple executors, each executor stands for one device.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136249
Approved by: https://github.com/FFFrog, https://github.com/albanD
2024-10-08 01:37:08 +00:00
395e098209 type _dynamo/mutation_guard.py (#137350)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137350
Approved by: https://github.com/Skylion007
2024-10-08 00:04:34 +00:00
52ba40c6f6 [ROCm][AOTI] add CK backend (#135641)
Companion to #134379

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135641
Approved by: https://github.com/ColinPeppler, https://github.com/chenyang78

Co-authored-by: Colin Peppler <colinpeppler@meta.com>
2024-10-07 23:53:58 +00:00
2c0b11c79b forward-fix D63916220 breaking test_cutlass_backend in FBCode (#137435)
Summary: It seems like the import path is different from FBCode & OSS. Wondering how to consolidate them.

Test Plan:
```
buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:cutlass_backend

Tests finished: Pass 2. Fail 0. Fatal 0. Skip 33. Build failure 0
```

Differential Revision: D63991961

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137435
Approved by: https://github.com/jovianjaison
2024-10-07 23:44:04 +00:00
812f286d4a Delete duplicate bindings in torch/csrc/autograd/python_torch_functions_manual.cpp (#136711)
This change deletes the duplicate binding of `torch. _functionalize_mark_mutation_hidden_from_autograd()`, another defination is here:

5c78c6b05a/torch/csrc/autograd/python_torch_functions_manual.cpp (L630-L636)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136711
Approved by: https://github.com/soulitzer
2024-10-07 23:19:06 +00:00
d558ec0730 Add support for cat memory planning mms with max autotune (#132554)
When we are autotuning matmuls the aten.mm and the triton template choices take in an externally allocated tensor that can be a view into a pre-planned aten.cat. So long as the output shape and stride of the matmul matches the slice of the cat we're planning, we can realize the mm directly into the cat.

Discussion for reviewers:

It feels a little bit odd that in the existing code we set the output of aten.mm as [FlexibleLayout](bcac71517c/torch/_inductor/kernel/mm.py (L156)). While is this correct, it might lead to passing non performant output strides to cublas.. I guess this is better than a copy ? Not sure. We could also introduce a Layout that denotes a Fixed shape and stride which we control allocation

```
class AllocatedFixedLayout(FixedLayout)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132554
Approved by: https://github.com/jansel
2024-10-07 22:49:29 +00:00
01bf350967 Fix bmm_sparse_cuda illegal memory access (#131977)
This PR fixes a bug in `search_end_matrix_indices_cuda_kernel` causing an illegal memory access when calling `bmm_sparse_cuda` on a sparse matrix with no non-zero values in the first batch dimension. Reproducible example:
```py
import torch

ind = torch.tensor([[1], [0], [0]], device="cuda")
val = torch.tensor([1.], device="cuda")
A = torch.sparse_coo_tensor(ind, val, size=(2, 1, 1))
B = torch.zeros((2, 1, 1), device="cuda")
C = torch.bmm(A, B)
```

## Details

In the previous code, we may for example end up with the following situation:
```
i : indices_1D[i]
------------------------------------------
0 : 1                <- start_idx, mid_idx
1 : 1                <- end_idx
...
```
When `target_mat_num = 0`, the next iteration of the while loop will assign `-1` to `end_idx` and thus `(0 + (-1)) >> 1 = -1` to `mid_idx`, causing an access error on line 703. The updated code maintains the invariant `start_idx <= end_idx` and will not go out of bounds.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131977
Approved by: https://github.com/amjames, https://github.com/pearu, https://github.com/nikitaved
2024-10-07 22:47:34 +00:00
a6707a7303 [dynamo] log all graph breaks to graph_breaks logging artifact (#137244)
We were previously not logging all graph breaks (e.g. data dependent jumps) to the graph_breaks logging artifact.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137244
Approved by: https://github.com/jansel
2024-10-07 22:34:27 +00:00
a9f7b905de type _dynamo/trace_wrapped_higher_order_op.py (#137354)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137354
Approved by: https://github.com/Skylion007, https://github.com/jansel
2024-10-07 21:57:06 +00:00
796c3c3415 Revert "Disallow FakeTensor.data_ptr access in eager mode (#137221)"
This reverts commit 7e13e7dd7e5fc20c0420605aeecb0f902af5326c.

Reverted https://github.com/pytorch/pytorch/pull/137221 on behalf of https://github.com/jovianjaison due to failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/137221#issuecomment-2397957081))
2024-10-07 21:46:13 +00:00
319eda9dfd [inductor] Add API to make post_grad_custom passes cache-able (#137298)
Summary: See https://github.com/pytorch/pytorch/issues/130772

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137298
Approved by: https://github.com/oulgen, https://github.com/eellison
2024-10-07 21:11:54 +00:00
8aa110cb00 [ROCm] Enable int_mm_error tests for rocm 6.0+ (#124999)
This pull request enables the int_mm_error tests for rocm 6.0+ . since  #122431 landed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124999
Approved by: https://github.com/jeffdaily, https://github.com/malfet
2024-10-07 21:10:18 +00:00
46abaa3b0f Increase parallelnative shards to 4 (#137433)
The job times out flakily in trunk as its duration is approaching 3.5h https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=parallelnative

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137433
Approved by: https://github.com/wdvr, https://github.com/malfet
2024-10-07 21:06:34 +00:00
c87c9f0a01 [inductor] Conditionally copy args to cpu to minimize memory overhead of autotuning (#136701)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136701
Approved by: https://github.com/eellison
2024-10-07 19:47:04 +00:00
900f57216f [dynamo] Log a summary of frames Dynamo traced (#137297)
This patch adds logging for all frames Dynamo traced, during each invocation of a Dynamo-optimized function.

Example:
```python
import torch

@torch.compile
def foo():
    x = torch.ones([10])
    def bar():
        y = x + x
        torch._dynamo.graph_break()
        z = y * x
        return z

    return bar(), bar

foo()
foo()
```

Running `TORCH_LOGS="dynamo" python` on the above dumps the following near the very end.
```
......
I1003 12:18:31.058000 177 torch/_dynamo/eval_frame.py:486] starting from foo /Users/ryanguo99/Documents/work/scratch/test.py:4, torchdynamo attempted to trace the following frames: [
I1003 12:18:31.058000 177 torch/_dynamo/eval_frame.py:486]   * foo /Users/ryanguo99/Documents/work/scratch/test.py:4
I1003 12:18:31.058000 177 torch/_dynamo/eval_frame.py:486]   * bar /Users/ryanguo99/Documents/work/scratch/test.py:7
I1003 12:18:31.058000 177 torch/_dynamo/eval_frame.py:486] ]
I1003 12:18:31.064000 177 torch/_dynamo/eval_frame.py:486] starting from foo /Users/ryanguo99/Documents/work/scratch/test.py:4, torchdynamo attempted to trace the following frames: []
......
```

Fixes #118262.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137297
Approved by: https://github.com/williamwen42
2024-10-07 19:44:41 +00:00
f33ffd01f2 [export] fix joint graph metadata (#136011)
Differential Revision: D62652832

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136011
Approved by: https://github.com/tugsbayasgalan
2024-10-07 19:36:44 +00:00
08b84afda9 [inductor] Fix alignment hint for WorkspaceArg (#137429)
Alignment hints refer to the base ptr, not the size.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137429
Approved by: https://github.com/eellison
2024-10-07 19:32:33 +00:00
fe44b6a67f Revert "Add back DistributedDataParallel types that were lost when pyi was removed (#136835)"
This reverts commit 40b09edd87fcbe4e63c4db6399ec758d5c34e1b1.

Reverted https://github.com/pytorch/pytorch/pull/136835 on behalf of https://github.com/jovianjaison due to this pr is causing typecheck errors internally ([comment](https://github.com/pytorch/pytorch/pull/136835#issuecomment-2397661940))
2024-10-07 18:59:41 +00:00
144665d772 [Dynamo] add flex attention mode test (#137121)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137121
Approved by: https://github.com/yanboliang
ghstack dependencies: #137114, #137115, #137116, #137117, #137120, #137227, #137119
2024-10-07 18:55:26 +00:00
d255b34c0a [Dynamo] Handle torch function subclass/mode dispatch on generic tensor methods (#137119)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137119
Approved by: https://github.com/williamwen42
ghstack dependencies: #137114, #137115, #137116, #137117, #137120, #137227
2024-10-07 18:55:26 +00:00
14eabd6915 [Dynamo] Handle extracted unbound tensor methods (#137227)
fixes2

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137227
Approved by: https://github.com/williamwen42
ghstack dependencies: #137114, #137115, #137116, #137117, #137120
2024-10-07 18:55:26 +00:00
68151fd288 [Dynamo] Move flex attention torch function mode to traceable HOP file (#137120)
Moves `TransformGetItemToIndex` to a file where dynamo stores other traceable HOP concepts.  (We don't trace through torch.* modules by default)

Tracing through the mode required fixing a bug in dynamo autograd function, which fixed a graph break, which caused the autograd test failures (skipping for now and will file an issue)

Previously those tests were in essence running in eager, because dynamo would fallback due to an arg mismatch error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137120
Approved by: https://github.com/yanboliang
ghstack dependencies: #137114, #137115, #137116, #137117
2024-10-07 18:55:26 +00:00
941be418d8 [Dynamo] Ensure torch function modes are dispatched on builtin ops (#137117)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137117
Approved by: https://github.com/yanboliang, https://github.com/williamwen42
ghstack dependencies: #137114, #137115, #137116
2024-10-07 18:55:26 +00:00
f9d69cde88 [Dynamo] Remove ignored modes from torch function mode stack guard (#135503) (#137116)
Approved by: https://github.com/anijain2305
ghstack dependencies: #134732, #133137, #135443, #135444, #135422, #135502

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137116
Approved by: https://github.com/yanboliang
ghstack dependencies: #137114, #137115
2024-10-07 18:55:26 +00:00
b1fd7708bd [Dynamo] Remove ignored modes workaround (#135502) (#137115)
Approved by: https://github.com/anijain2305
ghstack dependencies: #134732, #133137, #135443, #135444, #135422

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137115
Approved by: https://github.com/yanboliang
ghstack dependencies: #137114
2024-10-07 18:55:26 +00:00
51bc839b94 [Dynamo] Trace enter/exit of TorchFunctionModes (#135422) (#137114)
This PR implements tracing of with contexts with TorchFunction modes which have the default enter/exit behavior (ie pushing/popping the mode)

Typically the bytecode for a context manager looks like this during a graph break:
1. graph call
2. enter context
3. unsupported code
4. exit context
5. resume call

resume fn structure:
1. enter context
2. jump
...
3. exit context

The issue with torch function modes is that side effects will replay any mutations to the torch function stack performed during tracing. So, we do not need to enter and exit around the unsupported code in the original function (doing so would result in a duplicate torch function mode entry during execution of the unsupported code), and we don't need to enter again in the resume function (the mode that was pushed from the side effects bytecode would still be on the stack).

So for torch function modes the structure of our output code is this:

1. graph call
2. mutate tf mode stack to replay mutations
4. unsupported code
5. on exception restore stack
6. resume function

Then our resume fn looks like this:

1. no-op enter torch function mode
2. jump
3.  exit tf mode

To implement the no-op enter of the torch function mode I added torch function mode in polyfill which no-op enters, but normally exits. This is needed because we still want to trace the with context in the resume function, and exit properly (the exit instructions will still be in the function, so we need to generate instructions to set up the context).

Separately from the bytecode, dynamo also tracks contexts on the block stack, which is how the SETUP_* instructions are implemented. Naturally at a graph break, we exit these block stacks to properly reset the contexts entirely, so that we can re-enter around the unsupported code soundly. However once again, in the torch function mode case, in the event of a graph we do not want to perform any exit side effects because we want to preserve the state of the mode stack as is so that we will properly update the stack with bytecode mentioned in the first section. If we exited here, dynamo would pop the mode off of the symbolic stack, and not update the true python torch function mode stack with the suffix bytecode. All in all, for torch function modes we enter exactly once, update the global torch function mode stack with side effects bytecode, re-read this stack when compiling the resume function, and exit exactly once in the resume function. This matches the semantics of eager exactly.
Approved by: https://github.com/williamwen42
ghstack dependencies: #134732, #133137, #135443, #135444

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137114
Approved by: https://github.com/yanboliang
2024-10-07 18:55:26 +00:00
ff95ff5d38 type _dynamo/profiler.py (#137351)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137351
Approved by: https://github.com/Skylion007
2024-10-07 18:54:33 +00:00
aa145dead8 [FSDP2] Fixed mistargeted backward prefetch (#137348)
If there is an `unshard` (top-half) without a `wait_for_unshard` (bottom-half), then the next iteration's `unshard` will be a no-op. This can unexpectedly not propagate the optimizer update on the sharded parameters to the unsharded parameters, so it is better to clear that `unshard` at the end of backward.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137348
Approved by: https://github.com/weifengpy
2024-10-07 18:10:09 +00:00
01c07e7864 Revert "[BE][Ez]: Update cudnn_frontend submodule to v1.7.0 (#136920)"
This reverts commit 8dddd456794f82db5b4e807e9aed1919d3a832da.

Reverted https://github.com/pytorch/pytorch/pull/136920 on behalf of https://github.com/drisspg due to Breaks sdpa with bias support, will switch to newer patch version when released ([comment](https://github.com/pytorch/pytorch/pull/136920#issuecomment-2397548622))
2024-10-07 17:56:57 +00:00
cyy
0c0d8c8ff0 [1/N] Fix extra warnings brought by clang-tidy-17 (#137407)
Before we can use clang-tidy-17
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137407
Approved by: https://github.com/Skylion007, https://github.com/aaronenyeshi
2024-10-07 17:53:59 +00:00
ceb4ed8450 [AOTI][Tooling][10/n] Add scalar and symbolic type input debug printing support (#137323)
Summary:
- Further added more types for debug value dumping.

- Add a test case for symint inputs for debug printer. in real prod model use cases,  "unbacked symints" (those 'u0', 's0', etc.) can be helpful if we can examine their value.

Test Plan:
```
AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=2  TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+graph, inductor, +schedule, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_aoti_debug_printer_sym_inputs_abi_compatible_cuda
```

Differential Revision: D63864708

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137323
Approved by: https://github.com/chenyang78
2024-10-07 17:41:40 +00:00
04e48ac562 [inductor] Refactor prefix to make it easy to create subclass of PythonWrapper (#137198)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137198
Approved by: https://github.com/jansel
ghstack dependencies: #137191, #137193
2024-10-07 17:20:58 +00:00
e2b72348d0 [inductor] Reuse the subgraph if accessed via same get_attr node (#137193)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137193
Approved by: https://github.com/jansel
ghstack dependencies: #137191
2024-10-07 17:20:58 +00:00
7a5eaecd92 [inductor] Correctly keep track of the graph_input_names (#137191)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137191
Approved by: https://github.com/jansel
2024-10-07 17:20:53 +00:00
14b4099521 [FSDP2] support torch._foreach_copy_(float8) for fully_shard(Float8Linear) (#135955)
this PR unblocks unit test with single Float8Linear module. It fixes following error
```
torch._foreach_copy_(foreach_copy_dsts, all_gather_inputs)
[rank0]:E0913 13:44:29.829000 2179476 torch/testing/_internal/common_distributed.py:671] RuntimeError: "foreach_tensor_copy" not implemented for 'Float8_e4m3fn'
```

Differential Revision: [D63961071](https://our.internmc.facebook.com/intern/diff/D63961071)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135955
Approved by: https://github.com/vkuzo, https://github.com/eqy
2024-10-07 16:36:31 +00:00
33461592e2 [TLParse] Include cache hit/miss/bypass in the report name (#137282)
Makes it easier to tell on glance

https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmp1xoGc1/index.html

<img width="398" alt="image" src="https://github.com/user-attachments/assets/7ed111cb-46d8-4442-a1b2-037d0a8decd8">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137282
Approved by: https://github.com/jamesjwu
2024-10-07 16:00:00 +00:00
4db199f15f Implement Remote AOTAutogradCache (#137278)
Summary: Implement Remote AOTAutogradCache. It uses all the same tech as Remote FXGraphCache, just with its own name.

Test Plan:
Run benchmark:
TORCHINDUCTOR_AUTOGRAD_REMOTE_CACHE=1 TORCHINDUCTOR_FX_GRAPH_REMOTE_CACHE=1 TORCHINDUCTOR_AUTOGRAD_CACHE=0 TORCHINDUCTOR_FX_GRAPH_CACHE=0 TORCH_LOGS=+torch._functorch._aot_autograd.autograd_cache buck run mode/opt benchmarks/dynamo:torchbench -- --training --backend=inductor --only nanogpt --repeat 5 --performance --cold-start-latency

See that it cache hits even with local cache removed.

Results show up in remote cache logs https://fburl.com/scuba/pt2_remote_cache/5893dbaj

New unit tests

Reviewed By: oulgen

Differential Revision: D63323958

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137278
Approved by: https://github.com/oulgen
2024-10-07 15:38:54 +00:00
f80ed0b831 [export] Custom op meta kernel generation (two pass) (#137277)
Summary: Prototyping the custom op meta kernel generation. Rest of the changes are in fbcode/scripts/angelayi

Test Plan: followup diff (D63837739)

Differential Revision: D63837740

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137277
Approved by: https://github.com/zou3519
2024-10-07 15:34:19 +00:00
e20e7a8c38 Fixed developer setup issue in open_registration_extension (#137355)
This PR fixes an issue where when running `python setup.py develop`, the `open_registration_extension` self contained example will not build due to the following:

```
error: 'synchronizeStream' overrides a member function but is not marked 'override'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137355
Approved by: https://github.com/albanD, https://github.com/spzala
2024-10-07 15:25:37 +00:00
8c3ab21490 multiprocessing.spawn: allow a grace period when shutdown (#131278)
When one process fails, others are immediately killed. This prevents other processes to do necessary cleanups, or dump debug information (in particular, the NCCL flight recorder).

This PR adds a grace period. Default behavior is unchanged.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131278
Approved by: https://github.com/albanD
2024-10-07 12:37:34 +00:00
a063a82c8b [redo] Fp8 support for item() with cuda, index_select, and fill_ cpu (#137341)
Summary:

Redo of https://github.com/pytorch/pytorch/pull/128780, easier to copy-paste.

Test Plan: CI

Reviewers:

Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137341
Approved by: https://github.com/eqy
2024-10-07 00:58:51 +00:00
d1b87e26e5 [BE] Delete empty files (#137376)
Discovered by running
```
 % find aten -type f -size 0
aten/src/ATen/native/quantized/cpu/qnnpack/wrappers/dummy.c
aten/src/ATen/native/vulkan/api/StringUtil.cpp
aten/src/ATen/native/LegacyBridge.cpp
aten/src/ATen/function_wrapper.py
aten/src/ATen/cudnn/Exceptions.h
```

Most of them were added by b774ce54f8

Remove reference to LegacyBridge.cpp from `aten_native_source_non_codegen_list`:
f42f63ee86/build_variables.bzl (L1317)

And reference to `native/quantized/cpu/qnnpack/wrappers/dummy.c` from f42f63ee86/aten/src/ATen/native/quantized/cpu/qnnpack/buckbuild.bzl (L440)
Which seems to be a bug from some ancient Android toolchain

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137376
Approved by: https://github.com/kit1980, https://github.com/eqy, https://github.com/seemethere, https://github.com/jianyuh, https://github.com/Skylion007
2024-10-06 18:59:04 +00:00
0eba7e5451 Revert runtime numeric check in inductor due to increased compilation time (#137324)
Summary:
This diff reverts D63438718
Cause latency regression on multiple models

Test Plan: NA

Differential Revision: D63872515

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137324
Approved by: https://github.com/xuzhao9
2024-10-06 05:23:24 +00:00
1dc1b85714 [export] Move swap to a different file (#137134)
Refactor so that unflattener doesn't become too messy

Differential Revision: [D63719648](https://our.internmc.facebook.com/intern/diff/D63719648/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137134
Approved by: https://github.com/avikchaudhuri
ghstack dependencies: #136191, #137102
2024-10-06 04:28:18 +00:00
fa9cd46d12 [export] Update swap's forward function (#137102)
Downstream APS code was failing to run the previously swapped module because of some fx.GraphModule forward function weirdness (P1594789677). So to fix this, I just attached a custom forward function which matches the unflattened module's forward function.

Differential Revision: [D63683422](https://our.internmc.facebook.com/intern/diff/D63683422/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137102
Approved by: https://github.com/avikchaudhuri
ghstack dependencies: #136191
2024-10-06 04:25:36 +00:00
52d7704b32 [export] Add optimization passes (#136191)
Added an optimization pass to the swap function which removes extraneous pytrees. Currently it removes the pytree flatten/unflatten calls between modules in very specific scenarios (all the inputs of one module go into the other).

Future work can be to remove the input pytree.flatten if the inputs go directly into an unflatten, and output pytree unflatten if the outputs are directly from a pytree.flatten.

Differential Revision: [D62879820](https://our.internmc.facebook.com/intern/diff/D62879820)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136191
Approved by: https://github.com/avikchaudhuri
2024-10-06 04:22:42 +00:00
ad4e91acfe [fsdp2] based on device, use stream and Event (#136843)
currently FSDP2 support only CUDA, for other backends that need to use FSDP2 it won’t work as stream and events are based on CUDA. To support other backends, use
 _get_device_handle by device type to get the class and use this
for stream and events.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136843
Approved by: https://github.com/awgu
2024-10-06 04:17:47 +00:00
4061910ba2 Have Triton CPU backend respect max_autotune setting (#137276)
We would previously do it regardless of the setting's value.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137276
Approved by: https://github.com/jansel, https://github.com/desertfire
2024-10-06 03:03:33 +00:00
711dacf984 [FlexAttention] Support training bias for eager (#136910)
Add training bias eager implementation, take over the original POC from https://github.com/pytorch/pytorch/pull/136076

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136910
Approved by: https://github.com/Chillee
2024-10-05 19:34:57 +00:00
d073223663 turn CompilationCallbackHandler into dataclass (#137312)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137312
Approved by: https://github.com/Skylion007
ghstack dependencies: #137181
2024-10-05 19:03:28 +00:00
f54e142c58 Remove references to Rockset in trymerge (#137207)
For the migration to ClickHouse

But also Rockset is not used in trymerge anymore
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137207
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi
2024-10-05 12:53:22 +00:00
40d8260745 [ROCm] remove caffe2 from hipify (#137157)
- Remove all "MasqueradingAsCUDA" files and classes.
- Do not rename "CUDA" classes to "HIP".

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137157
Approved by: https://github.com/eqy
2024-10-05 12:48:54 +00:00
ca38f28543 [FlexAttention] Adjust BlockMask if reusing the one created at larger seqlen (#137255)
Fixes #136232

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137255
Approved by: https://github.com/Chillee
2024-10-05 07:31:32 +00:00
4830bd0dd4 [Doc] Clarify that NaNs are not equal to each other (#137386)
Fixes https://github.com/pytorch/pytorch/issues/137337

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137386
Approved by: https://github.com/janeyx99, https://github.com/huydhn, https://github.com/kit1980
2024-10-05 06:19:12 +00:00
17718209ea fix specialization bug in unflatten + preserve_module_call_signature (#137363)
Summary: In unflatten, when we generate module calls when their signature has been preserved, we do not pass the original constant args. This can cause strange effects, e.g., if the module is swapped out with itself, we may suddenly go down a different path than the original, or even crash.

Test Plan: added a test

Reviewed By: angelayi

Differential Revision: D63913750

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137363
Approved by: https://github.com/angelayi
2024-10-05 04:26:02 +00:00
6d0d7b6e37 [CI][BE] Restore cuda memory allocator setting (#137383)
By adding `finally:` clause at the end of the test

Might fix https://github.com/pytorch/pytorch/issues/137098#issuecomment-2389172552

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137383
Approved by: https://github.com/ngimel
2024-10-05 04:16:38 +00:00
0067f586ba [audio hash update] update the pinned audio hash (#136968)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136968
Approved by: https://github.com/pytorchbot
2024-10-05 04:08:59 +00:00
4d8b845797 Fix overflow error when torch.bincount() handles a large tensor (#136745)
Fixes #136720

the result in this case says:

```
Traceback (most recent call last):
  File "/Users/shenke/workspace/pytorch/mytest.py", line 9, in <module>
    result = torch.bincount(input)
             ^^^^^^^^^^^^^^^^^^^^^
RuntimeError: maximum value of input overflowed, it should be < 9223372036854775807 but got 9223372036854775807
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136745
Approved by: https://github.com/Skylion007
2024-10-05 04:04:48 +00:00
d6f340f66c Determine autograd engine ready queue based on InputMetadata instead of InputBuffer (#135633)
Thanks @awgu for raising this issue and the small repro

From offline discussion with @albanD, in the case where a forward returns multiple outputs with different devices, we'd want to select the ready queue based on the device of the first one. Even though this is somewhat arbitrary, we prefer this over deciding which ready queue to push based on whichever input buffer's we happen to compute last, which can vary depending on more factors and thus be harder to reason about. This is in theory bc-breaking, but it seems unlikely that someone would depend on this behavior.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135633
Approved by: https://github.com/albanD
2024-10-04 23:59:46 +00:00
79562f3af8 [ROCm] Modify hipify script to work with Windows paths (#135360)
This change modifies the `hipify_python.py` script to properly detect all directories, `include` and `ignore` paths during hipification process on Windows, by changing the path syntax convention to a UNIX-like one.

Since in many places the script assumes a UNIX-like convention by using paths with forward slashes `/`, I decided to accommodate for it by converting Windows paths to UNIX-like ones. By doing it so, the number of changes to the file is limited. Moreover this early-on unification allows for the rest of the code to have a battle-tested linux-like behaviour.

Another option would be to use `Path` object from `pathlib` to represent all paths in the script, however, it would impact a broader share of a code and would hence require a more meticulous evaluation in terms of non-altered logic and edge cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135360
Approved by: https://github.com/jeffdaily, https://github.com/jithunnair-amd
2024-10-04 23:43:43 +00:00
8b6774d381 Clarify comment for error handling of dict getattr (#137381)
Just a small nit
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137381
Approved by: https://github.com/malfet
2024-10-04 23:40:21 +00:00
f42f63ee86 Add option to disable operator profiling (#136838)
Summary:
X-link: https://github.com/pytorch/executorch/pull/5720

For smaller models the overhead of profiling ops might be prohibitively large (distorting the inference execution time significantly) so we provide users an option to disable op profiling and essentially only profile the important events such as inference execution time.

To disable operator profiling users need to do:
```
etdump_gen.set_event_tracer_profiling_level(executorch::runtime::EventTracerProfilingLevel::kNoOperatorProfiling);
```

Test Plan: Added test case.

Differential Revision: D61883224

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136838
Approved by: https://github.com/dbort
2024-10-04 22:56:00 +00:00
f2d174c051 Update CODEOWNERS (#136278)
Swap @gokulavasan for @divyanshk as codeowner of torch/utils/data/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136278
Approved by: https://github.com/divyanshk, https://github.com/janeyx99, https://github.com/jansel
2024-10-04 22:42:05 +00:00
88e54de219 More nogil unsafe API fix (#137142)
Cover the PyDict APIs and confirms no update needed for PyModule one.
The rest was already covered in https://github.com/pytorch/pytorch/pull/136899

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137142
Approved by: https://github.com/eqy, https://github.com/Skylion007
2024-10-04 21:56:34 +00:00
e27c0048db Enable additional tests for MPS CI runs (#134356)
As part of the follow up for https://github.com/pytorch/pytorch/issues/133520, adapting existing unused tests for use in MPS CI runs. Focusing on nhwc & other memory formatting tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134356
Approved by: https://github.com/malfet, https://github.com/eqy, https://github.com/huydhn
2024-10-04 21:52:38 +00:00
7c1d93944e Proper handling of arguments passed by in kwargs inside zip_schema (#137311)
if the function is

```func(a, b, c) ```
and is called as
```func(a=1, b=.., c=..)```
before this change we do not iterate on the a, b, c, since those appear in kwargs this diff fix that issue.

This function is used in _inductor/ir.py to iterate over custom op arguments and when a custom pass does changes
and pass arguments as kwargs, we do not process them.
```
        for info, arg in torch._library.utils.zip_schema(schema, args, kwargs):
            handle_aliasing_and_mutation(info, arg)
```
Fix https://github.com/pytorch/pytorch/issues/137057

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137311
Approved by: https://github.com/zou3519
2024-10-04 21:50:31 +00:00
c0deec120f Fix resurrection logic to trigger early enough (#137267)
Fixes https://github.com/pytorch/pytorch/issues/136358

The bug here is that the Tensor object is actually 2 classes: `Tensor` from `_tensor.py` and `TensorBase` from c++.

Before this PR, they have the following gc methods:
Tensor:
 - tp_clear subtype_clear
 - tp_traverse THPVariable_subclass_traverse
 - tp_dealloc THPVariable_subclass_dealloc

TensorBase:
- tp_clear THPVariable_clear
- tp_traverse THPFunction_traverse (fake function that is just an error)
- tp_dealloc object_dealloc

The problem is that when clear is called on the Tensor, subtype_clear is going to clear the things owned by the "Tensor" type, in particular, its `__dict__` attribute, before delegating to the TensorBase clear where we detect that resurrection needs to happen and skip it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137267
Approved by: https://github.com/ezyang, https://github.com/kshitij12345
2024-10-04 21:13:54 +00:00
bd48933323 Run docker builds on Meta account for now (#137358)
To fix
```
arn:aws:sts::391835788720:assumed-role/ghci-lf-github-action-runners-runner-role/i-096a3e2616140518b is not authorized to perform: ecr:InitiateLayerUpload on resource: arn:aws:ecr:us-east-1:308535385114:repository/pytorch/pytorch-linux-jammy-py3-clang18-asan because no resource-based policy allows the ecr:InitiateLayerUpload action
```
Which seems to be doing the trick see https://github.com/pytorch/pytorch/actions/runs/11185419440/job/31098258344
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137358
Approved by: https://github.com/huydhn
2024-10-04 20:39:56 +00:00
7b3378a39a [FSDP2] Relaxed even sharding requirement for all-gather extensions (#137005)
This PR relaxes the even sharding requirement for the all-gather extensions.

The `fsdp_pre_all_gather` now expects signature:
```diff
def fsdp_pre_all_gather(
    self,
    mesh: DeviceMesh,
+    outer_size: torch.Size,
+    outer_stride: Tuple[int, ...],
    module: nn.Module,
    mp_policy: MixedPrecisionPolicy,
) -> Tuple[Tuple[torch.Tensor, ...], Any]:
```
- Since no one is using this new signature yet, we should be safe to change it.
- Currently, the `outer_stride` will always be contiguous strides since FSDP2 only supports contiguous strides for now.
- For the uneven sharding case, the user is responsible to return a padded sharded tensor from `fsdp_pre_all_gather`. This is risky territory because if the user does not do so, then this may manifest as a NCCL timeout, as only the ranks with padding will error out. However, I am not aware of any way around this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137005
Approved by: https://github.com/weifengpy
2024-10-04 20:34:20 +00:00
f4b415da11 type _dynamo/replay_record.py (#137183)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137183
Approved by: https://github.com/Skylion007
2024-10-04 20:29:24 +00:00
6a6a8b17b8 handle state tensors in training ir path (#137240)
Summary: We had attribute assignment detection and handling of registered buffer assignments when using `aot_autograd`, but not when using just `make_fx`. Fixed.

Test Plan: expanded coverage of `test_state_tensors` to use `export` instead of `torch.export.export`

Differential Revision: D63802576

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137240
Approved by: https://github.com/tugsbayasgalan
2024-10-04 20:23:48 +00:00
f0ef7fddde Add ignored/unmaintained comment for capture_autograd_function flag (#137309)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137309
Approved by: https://github.com/jansel
ghstack dependencies: #136961
2024-10-04 20:02:37 +00:00
0878739b11 [AOTI] Add C shim for MKLDNN _convolution_pointwise (#137269)
Differential Revision: [D63875271](https://our.internmc.facebook.com/intern/diff/D63875271)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137269
Approved by: https://github.com/chenyang78, https://github.com/hl475
2024-10-04 19:42:05 +00:00
a968576777 Add lowering for aten.searchsorted (#135701)
Adds lowering for `aten.searchsorted`. This entails:

1. Adding support for multi-dimensional bucket tensors to `ops.bucketize`.
2. Adding support for striding to `ops.bucketize`.
3. Adding support for sorting tensors to `ops.bucketize`.
4. Adding a lowering for `aten.searchsorted.Tensor`.
5. Adding a basic decomposition for `aten.searchsorted.Scalar` that calls into the lowering for tensors.
6. Updating the meta-function for `aten.searchsorted` to properly check some of the sizing conditions.

Closes #135873

Differential Revision: [D63766514](https://our.internmc.facebook.com/intern/diff/D63766514)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135701
Approved by: https://github.com/amjames, https://github.com/eellison, https://github.com/davidberard98
2024-10-04 19:26:05 +00:00
58ec6a360c force contiguity for all reduce (#137345)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137345
Approved by: https://github.com/xmfan
2024-10-04 19:16:38 +00:00
c0a930b104 Change to export_for_training in quantize_pt2e tests (#137233)
Summary:
as title

also change it in `prepare_pt2e()` docstring

Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:quantization_pt2e_qat

buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization
```

Differential Revision: D63345059

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137233
Approved by: https://github.com/tugsbayasgalan
2024-10-04 18:33:02 +00:00
22e19bd2d7 Add link to torch.compile the missing manual in troubleshooting (#137301)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137301
Approved by: https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2024-10-04 18:19:30 +00:00
79195b9453 [inductor] Add kwargs to bypass unexpected keyword argument error (#137329)
Summary:
I tried `TORCHINDUCTOR_PROFILE=1 TORCHINDUCTOR_PROFILE_OUTPUT=~/fbcode/profile.txt`.

TypeError: DebugAutotuner.run() got an unexpected keyword argument 'benchmark_run'

Test Plan: ci

Differential Revision: D63876103

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137329
Approved by: https://github.com/muchulee8
2024-10-04 18:17:56 +00:00
d2d14d14e3 [RELAND] Fix unlift to preserve aliased constants (#137310)
Differential Revision: [D63864743](https://our.internmc.facebook.com/intern/diff/D63864743)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137310
Approved by: https://github.com/avikchaudhuri
2024-10-04 18:15:52 +00:00
8b9cbf22c2 Enable regression test for add loop benchmarks (#136573)
The red dotted line is 1.5

<img width="1607" alt="Screenshot 2024-09-24 at 11 50 41 AM" src="https://github.com/user-attachments/assets/719a9a86-89af-4c58-8723-80a28f9bb517">

expected taken from the average.
<img width="850" alt="Screenshot 2024-09-24 at 2 33 27 PM" src="https://github.com/user-attachments/assets/0f25e855-35ae-4031-86ef-1452ef6598de">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136573
Approved by: https://github.com/ezyang
2024-10-04 18:12:08 +00:00
ad240018f2 [PT2][Inductor][Reliability] Add back unit test for pad_mm with BF16 (#137231)
Summary: We added the unit test for recent added pad_mm pattern in customized optimus D63040455, where it will resolve the long computation kernel issue for BF16 on A100.

Test Plan:
```
buck2 test mode/opt //caffe2/test/inductor:pad_mm -- test_pad_mm_bf16
```

Buck UI: https://www.internalfb.com/buck2/4dd4c90c-4a2a-4859-923c-a4008f78a1cd
Test UI: https://www.internalfb.com/intern/testinfra/testrun/9851624237127136
Network: Up: 100KiB  Down: 4.3GiB  (reSessionID-87f11454-d920-47af-9af5-39ca0572b7c6)
Jobs completed: 7079. Time elapsed: 3:34.3s.
Cache hits: 99%. Commands: 7061 (cached: 7024, remote: 19, local: 18)
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

Differential Revision: D63794727

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137231
Approved by: https://github.com/henrylhtsang
2024-10-04 17:49:55 +00:00
b2979f4382 Allow autocast in training ir export (#137287)
Summary: hardcode "val" field for autocast (similar to set_grad_enabled), to bypass the verifier check.

Test Plan: CI

Differential Revision: D63345767

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137287
Approved by: https://github.com/angelayi
2024-10-04 17:38:51 +00:00
42adadf2f2 [aotinductor] enable CUTLASS backend (#134379)
### Context
This PR allows CUTLASS kernels usage in AOTI. It does this by:
* For any CUTLASS kernels that win during autotuning, compile them as a .so & .o
* When creating the final model .so, link all the CUTLASS kernels .o files
* Make sure we codegen things correctly (argument dtypes and specify extern "C" linking for the CUTLASS kernel)

### Example
https://gist.github.com/ColinPeppler/e834fa2255c37e9444b6d540bf7bd04d#file-model-cpp-L548-L549

```
TORCH_LOGS="+output_code" python test/inductor/test_cutlass_backend.py -v -k test_max_autotune_cutlass_backend_regular_mm
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134379
Approved by: https://github.com/tenpercent, https://github.com/chenyang78
2024-10-04 17:32:41 +00:00
c7b0d4b148 raw_alloc ignores PYTORCH_NO_CUDA_MEMORY_CACHING (#131114)
raw_alloc is used by cudnn, miopen, thrust, and tunableop.  Without this PR, the env var for disabling the caching allocator will only partially work.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131114
Approved by: https://github.com/eqy, https://github.com/houseroad, https://github.com/albanD

Co-authored-by: Nichols A. Romero <nick.romero@amd.com>
2024-10-04 15:36:29 +00:00
cyy
67908e9111 Enable clang-tidy on torch/csrc/distributed/rpc (#137320)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137320
Approved by: https://github.com/Skylion007
2024-10-04 15:34:05 +00:00
15c3479db7 [AOTI] Fix _scaled_mm ABI-compatible codegen (#137132)
Summary: Similar to https://github.com/pytorch/pytorch/pull/137008, but for supporting _scaled_mm in the ABI-compatible mode.

Differential Revision: [D63757729](https://our.internmc.facebook.com/intern/diff/D63757729)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137132
Approved by: https://github.com/chenyang78
ghstack dependencies: #137008
2024-10-04 14:05:18 +00:00
5d24ea81d3 [AOTI] Fix cpp wrapper codegen for _scaled_mm (#137008)
Summary: Fixes https://github.com/pytorch/pytorch/issues/136209. Because _scaled_mm has an out variant, the generated cpp fallback call should call _scaled_mm_out. The ABI-compatible mode needs more work.

Differential Revision: [D63757728](https://our.internmc.facebook.com/intern/diff/D63757728)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137008
Approved by: https://github.com/hl475
2024-10-04 14:02:46 +00:00
f56f7476d3 Revert "Add meta functions for lerp, addcmul, and addcdiv. (#136909)"
This reverts commit e4b98b11493914769d15ca8b124c0b5fa1fdd364.

Reverted https://github.com/pytorch/pytorch/pull/136909 on behalf of https://github.com/albanD due to breaks trunk jobs ([comment](https://github.com/pytorch/pytorch/pull/136909#issuecomment-2393774694))
2024-10-04 14:01:54 +00:00
cd17b2645c Revert "[Distributed] Fix extra context on device 0 (#135273)"
This reverts commit a93d3873e97973fbc0009245579ee4e4fa7f155a.

Reverted https://github.com/pytorch/pytorch/pull/135273 on behalf of https://github.com/albanD due to Broken trunk distributed ci ([comment](https://github.com/pytorch/pytorch/pull/135273#issuecomment-2393772987))
2024-10-04 13:58:57 +00:00
5509207543 Revert "[PyTorch] Port ExecuTorch bfdot improvement back to ATen BlasKernel (#136331)"
This reverts commit 592e3a3d4069029946ec4c8d103a313806c53a88.

Reverted https://github.com/pytorch/pytorch/pull/136331 on behalf of https://github.com/albanD due to Breaks aarch64 builds, see link below ([comment](https://github.com/pytorch/pytorch/pull/136331#issuecomment-2393760135))
2024-10-04 13:52:37 +00:00
e80f47fb4d Pass special arguments to user-defined Triton kernels if required (#137236)
Summary:

Special autotuning configs like `num_warps` and `num_stages` can be passed to the kernel as parameters. The `config.all_kwargs()` call [here](762a7d197c/python/triton/runtime/autotuner.py (L106)) in the Trtion code includes those special configs (names and values) into the potential arguments to the kernel. [Here](762a7d197c/python/triton/runtime/jit.py (L613)) some of those may be included in actual kenrel arguments, given that their names are present among the kernel parameters.

This PR replicates this behavior in user-defined Triton kernel compilation in PT2. Resolves #136550.

Test Plan:

```
$ python test/inductor/test_triton_kernels.py -k test_triton_kernel_special_params
inductor []
inline_call []
stats [('calls_captured', 2), ('unique_graphs', 1)]
aot_autograd [('total', 1), ('ok', 1)]
.inductor []
inline_call []
stats [('calls_captured', 2), ('unique_graphs', 1)]
.inductor [('fxgraph_cache_bypass', 1), ('pattern_matcher_count', 1), ('pattern_matcher_nodes', 1), ('extern_calls', 1), ('possibly_missed_reinplacing_opportunities', 0), ('possibly_missed_reinplacing_bytes', 0)]
inline_call []
stats [('calls_captured', 2), ('unique_graphs', 1)]
aot_autograd [('total', 1), ('ok', 1)]
.inductor []
inline_call []
stats [('calls_captured', 2), ('unique_graphs', 1)]
aot_autograd [('total', 1), ('ok', 1)]
.inductor []
inline_call []
stats [('calls_captured', 2), ('unique_graphs', 1)]
.inductor [('benchmarking.TritonBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_bypass', 1), ('pattern_matcher_count', 1), ('pattern_matcher_nodes', 1), ('extern_calls', 1), ('benchmarking.TritonBenchmarker.triton_do_bench', 1), ('possibly_missed_reinplacing_opportunities', 0), ('possibly_missed_reinplacing_bytes', 0)]
inline_call []
stats [('calls_captured', 2), ('unique_graphs', 1)]
aot_autograd [('total', 1), ('ok', 1)]
.
----------------------------------------------------------------------
Ran 6 tests in 6.283s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137236
Approved by: https://github.com/zou3519
2024-10-04 07:36:55 +00:00
cyy
6327a71880 [Environment Variable][2/N] Use thread-safe setenv wrapper (#124485)
This follows #119449 to make setenv thread-safe.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124485
Approved by: https://github.com/eqy
2024-10-04 07:30:51 +00:00
6dcd773c57 [export] clean up dynamic markers from tensors (#137230)
Summary:
When we handle dynamic shapes markers like `Dim.AUTO, Dim.DYNAMIC`, we use dynamo decorators, attaching set attributes to the export input tensors, e.g. `x._dynamo_dynamic_indices = set()`.

I thought this was fine, since it's done all the time with torch.compile, but it breaks some PT2Inference tests, specifically because unpickling a set attribute isn't possible with the C++ torch::jit::pickle_load call.

We've agreed that the PT2Inference side will clone sample inputs & pickle the original inputs to be safe, but this still establishes a nice invariant that user-facing decorators are both ignored & cleaned out in the lifecycle of an export call.

Test Plan: test_export

Differential Revision: D63773534

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137230
Approved by: https://github.com/avikchaudhuri
2024-10-04 06:50:45 +00:00
a408cfcbf1 [torch.compile] torch.vmap supports dynamic shapes + enable flex attention create_block_mask dynamic shapes (#137163)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137163
Approved by: https://github.com/Chillee
2024-10-04 05:16:04 +00:00
40b09edd87 Add back DistributedDataParallel types that were lost when pyi was removed (#136835)
When the stub file `nn/parallel/distributed.pyi` was removed (#88701), some types that existed are no longer available. This pull request adds them back.

Just for reference, these types are used in pytorch-lightning's LightningCLI. Command line interfaces are created automatically, and having type hints make them nicer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136835
Approved by: https://github.com/kwen2501
2024-10-04 04:44:20 +00:00
97634e4f82 Rollout infra for executorch migration to training IR (#132703)
Title

Differential Revision: [D60432217](https://our.internmc.facebook.com/intern/diff/D60432217/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132703
Approved by: https://github.com/tarun292
2024-10-04 04:33:08 +00:00
f500cb43bb Fix torch.library.register_vmap (#137306)
We didn't support multiple levels of vmap. The main problem is, during
the batching rule, we need to exclude the vmap dispatch key
(FuncTorchBatched) like how our C++ batching rules do it.

Test Plan:
- new test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137306
Approved by: https://github.com/Chillee
2024-10-04 03:46:35 +00:00
cfc51c858a type _dynamo/callback.py (#137181)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137181
Approved by: https://github.com/Skylion007
2024-10-04 03:28:52 +00:00
9670e9e5b0 Revert "Mark PyTorch module as no-gil valid and pythoncapi_compat.h (#136899)"
This reverts commit 4f93de895138cc3cb8c4383b480a2d0ecf407e1b.

Reverted https://github.com/pytorch/pytorch/pull/136899 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/136899#issuecomment-2392721534))
2024-10-04 03:28:31 +00:00
e4b98b1149 Add meta functions for lerp, addcmul, and addcdiv. (#136909)
This PR adds new meta functions for `lerp`, `addcmul`, and `addcdiv` (including their
respective inplace versions).

These functions only had refs implementations, which was being the root cause of a
significant overhead ([issue][1]) when running `AdamW` optimizer step on PyTorch/XLA
backend. Running the meta functions resulted in the following improvements:

- `lerp` calls: 1,550ms to 140ms (10x)
- `addcdiv` calls: 640ms to 350ms (1.8x)
- `addcmul` calls: 620ms to 300ms (2.05x)

[1]: https://github.com/pytorch/xla/issues/7923

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136909
Approved by: https://github.com/jansel
2024-10-04 02:47:25 +00:00
a1f1f585ab clean up error_on_nested_jit_trace flag (#136961)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136961
Approved by: https://github.com/jansel
2024-10-04 02:07:54 +00:00
d32696249a [IntraNodeComm] fix a race condition in one-shot all-reduce (#137257)
One-shot all-reduce did not have a barrier at the end. It was possible for a rank to write to its p2p buffer for the next collective before another rank finished reading it for the previous collective.

Also removing the fuse-input-copy optimization. The synchronization complexity probably outweighs the saving.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137257
Approved by: https://github.com/Chillee
2024-10-04 01:41:14 +00:00
3d3b394e94 [MTIA](3/n) Implement CPU pins functions for MTIA hooks (#137283)
Summary: Link CPU pins function in MTIA hooks to the host allocator implementation

Test Plan:
signals
unit test in D63709424

Differential Revision: D63352770

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137283
Approved by: https://github.com/egienvalue
2024-10-04 01:26:21 +00:00
15e127bc3b [numpy2.0 compat] Fix test_parse_numpy_int_overflow for NumPy 2.0 (#137135)
NumPy now throws an OverflowError when trying to create np.uint64(-1)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137135
Approved by: https://github.com/Skylion007
2024-10-04 01:21:12 +00:00
13ec343afe clean up capture_func_transforms flag (#136960)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136960
Approved by: https://github.com/guilhermeleobas, https://github.com/jansel
2024-10-04 01:10:52 +00:00
6b9b2a126e Build clang18 image for ASAN tests (#128763)
Use the latest clang.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128763
Approved by: https://github.com/malfet
2024-10-04 00:53:56 +00:00
a93d3873e9 [Distributed] Fix extra context on device 0 (#135273)
This PR contains multiple fixes for issue https://github.com/pytorch/pytorch/issues/135279:

## First part:
Moves the GPU guard (`cudaSetDevice`) before the `currentStreamCaptureStatusMayInitCtx` call.
As its name suggests, it May Init Ctx.

## Second part:
Even with the above fix, additional contexts are still observed during Work object destruction, e.g.
```
work = dist.all_reduce(tensor, async_op=True)
time.sleep(5)  <-- no additional context yet
del work  <-- additional context shows up
```
### Debug process
Chasing it down to destruction of a `Future` object -- a member variable of `Work`.
Then further down to the following member of `Future`:
```
std::vector<c10::Event> events_;
```
When the `events_` are destroyed, we hit the road down to:
1f3a793790/c10/cuda/impl/CUDAGuardImpl.h (L106-L121)

When there is no "preset" CUDA context (**which is the case for python garbage collector**), line 112: `c10::cuda::GetDevice(&orig_device)` will set `orig_device` to 0. Then, at line 120, `c10::cuda::SetDevice(orig_device)` will "officially" set the context to device 0 --
**that's where rank 1, 2, ... can create extra context on device 0!**
### Solution
This PR adds an explicit destructor to `Future`. In this destructor, destroy each event with a device guard.

## Test
Added test_extra_cuda_context, implemented via
- `pynvml` (if available), or
- memory consumption check.

`python test/distributed/test_c10d_nccl.py -k test_extra_cuda_context`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135273
Approved by: https://github.com/fduwjj, https://github.com/wconstab, https://github.com/eqy
2024-10-04 00:44:02 +00:00
88e338f4dd [AOTI] Add C shim for MKLDNN _linear_pointwise (#136999)
Differential Revision: [D63851216](https://our.internmc.facebook.com/intern/diff/D63851216)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136999
Approved by: https://github.com/leslie-fang-intel, https://github.com/chenyang78, https://github.com/hl475
2024-10-04 00:35:10 +00:00
57c02e5a00 [BE] Use helper functions in mps_extension (#137313)
This should be a no-op change, i.e. it runs the same code, but replaces verbose ObjectiveC invocation with helper function from OperationUtils.h, which this example already depends on
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137313
Approved by: https://github.com/atalman
2024-10-04 00:26:38 +00:00
bc916a5537 [easy] for test_ck_backend enable RE & activate remaining tests for FBCode (#137305)
Differential Revision: D63859208

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137305
Approved by: https://github.com/muchulee8, https://github.com/chenyang78
2024-10-04 00:22:35 +00:00
cyy
60d19cb59e Enable clang-tidy on torch/csrc/distributed/autograd/* (#137180)
Enable clang-tidy on `torch/csrc/distributed/autograd/*` directory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137180
Approved by: https://github.com/Skylion007
2024-10-03 23:49:23 +00:00
7e13e7dd7e Disallow FakeTensor.data_ptr access in eager mode (#137221)
Previously we raised a deprecation warning (beginning PyTorch 2.4). Now
that we are on 2.6, we're completing the deprecation and disallowing
this behavior.

Test Plan:
- tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137221
Approved by: https://github.com/albanD, https://github.com/eellison
2024-10-03 23:47:55 +00:00
cfcd0e1fe9 [ONNX] Update the faketensor documentation (#137292)
Update the faketensor documentation to reflect current usage.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137292
Approved by: https://github.com/shubhambhokare1, https://github.com/sdpython
2024-10-03 23:27:11 +00:00
4096ed7dc2 Migrate to training ir in quantization_pt2e_qat unittests (#137232)
Summary: Change capture_pre_autograd_graph to export_for_training in unit tests.

Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:quantization_pt2e_qat
```

Reviewed By: tugsbayasgalan

Differential Revision: D63336660

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137232
Approved by: https://github.com/angelayi
2024-10-03 22:57:04 +00:00
b44f25e1ba [CI] Move s390 binary build to its own workflow (#137304)
It was added by https://github.com/pytorch/pytorch/pull/125399 and takes 3 hours to finish
Considering limited number of runners, it often causes queueing see:
<img width="402" alt="image" src="https://github.com/user-attachments/assets/5c67c1d6-af4c-4453-a089-aa1174513bfa">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137304
Approved by: https://github.com/kit1980, https://github.com/huydhn, https://github.com/atalman
2024-10-03 22:31:36 +00:00
54094c0c26 [inductor][user triton] Check size hints to determine indexing dtype (#137234)
Previously, all integer inputs to user-defined triton kernels were assumed to be int32. This would result in errors if your input was actually an int64.

This PR checks the value to determine which dtype to use for indexing: if it is known to be < int_max, then use int32 (and add guards if relevant); if we can't check (e.g. unbacked symint), then use int64.

Differential Revision: [D63797975](https://our.internmc.facebook.com/intern/diff/D63797975)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137234
Approved by: https://github.com/eellison
2024-10-03 22:07:26 +00:00
c83178d894 Change to export_for_training in XNNPACK tests (#137238)
Summary: as title

Test Plan: CI

Differential Revision: D63344674

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137238
Approved by: https://github.com/tugsbayasgalan
2024-10-03 21:28:05 +00:00
ce14f1f0c9 [aoti] Accept constant inputs (#137197)
Fixes https://fb.workplace.com/groups/1028545332188949/posts/1056788036031345/?comment_id=1056790162697799&reply_comment_id=1057501845959964

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137197
Approved by: https://github.com/henrylhtsang, https://github.com/desertfire, https://github.com/hl475
2024-10-03 20:59:33 +00:00
eqy
46f158bfbc [cuDNN] Check shapes during graph capture in cuDNN CTCLoss (#130071)
Found out from #125952 about the existence of `_assert_async`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130071
Approved by: https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2024-10-03 20:10:28 +00:00
592e3a3d40 [PyTorch] Port ExecuTorch bfdot improvement back to ATen BlasKernel (#136331)
ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. Supersedes https://github.com/pytorch/pytorch/pull/127488 . Includes https://github.com/pytorch/executorch/pull/5444 .

Differential Revision: [D63045939](https://our.internmc.facebook.com/intern/diff/D63045939/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136331
Approved by: https://github.com/malfet, https://github.com/albanD
ghstack dependencies: #136445
2024-10-03 18:18:37 +00:00
c8a7da305b [PyTorch] Add attribute version of C10_ALWAYS_INLINE (#136445)
Sometimes (such as on a lambda), you need `__attribute__((always_inline))` but not `inline`.

Differential Revision: [D63266917](https://our.internmc.facebook.com/intern/diff/D63266917/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136445
Approved by: https://github.com/malfet
2024-10-03 18:18:37 +00:00
525f6715bc Revert "Fix unlift to unblock training IR + run_decomp on aliasing constants (#137162)"
This reverts commit f96020c246aec8514b945d530879635a03294f70.

Reverted https://github.com/pytorch/pytorch/pull/137162 on behalf of https://github.com/jovianjaison due to Sorry for reverting your changes but many jobs are failing with NameError: name _recursive_getattr is not defined + a Lint job fails ([comment](https://github.com/pytorch/pytorch/pull/137162#issuecomment-2392036062))
2024-10-03 18:17:56 +00:00
c7714b8d8d [FR] Fix duplicate output for the case when not all ranks join on collective (#137256)
As title, when testing on an internal case, we found that we have very similar output for the error when certain ranks does not join one collective. This is because we didn't put all ranks into `candidate_ranks` so that they didn't get wiped out from entries and gets checked again.

Ideally for the given case, we should report this is an out of order case, because rank 0, 1 calls all-to-all while all the rest ranks call all-gather-base. But when we select entries to compare, we don't have global view of the entries.

In the specific case, on rank 0 and 1, it has collective of PG 7 on entry 1130 with seq ID = 1130. However, on other ranks, they have collective of PG 0 on entry 1130 with seq ID = 2. It's hard to use entry idx to do the match because if we later consider p2p, this assumption will collapse, so we now still defer it for users or further down debugging stream to figure it out. To make the message clearer, I also include both seqID and record_id (aka, entry index) in the message. (That does not mean this is not possible to implement in the code, for example, we can let all record_id to minus the maximum p2p seq id before it; but users will easily see the wrong order, so we don't think it's necessary to have that logic now)

P1626755348

Differential Revision: [D63815335](https://our.internmc.facebook.com/intern/diff/D63815335/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137256
Approved by: https://github.com/c-p-i-o
2024-10-03 18:06:45 +00:00
adc48a5b52 Python CAPI cleanup (#137266)
This is unrelated to anything else, but as I was going through the code, fixing bad patterns and a refcount bug (which is unlikely to cause any real issue tbh)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137266
Approved by: https://github.com/Skylion007
2024-10-03 17:55:48 +00:00
8bb8c3997b [inductor] parallel compile: add import of thread_safe_fork for internal (#137155)
Summary: We had a report of crashes in parallel compile subprocesses linked to reading justknobs. See https://fburl.com/workplace/14a4mcbh internally. This is a known issue with justknobs. It looks like we don't have a lot of control over evaluating knobs. Some are read in inductor (`"pytorch/remote_cache:autotune_memcache_version`), but many are read by the triton compiler. According to this advice https://fburl.com/workplace/imx9lsx3, we can import thread_safe_fork which installs some functionality to destroy some singletons before forking and re-enable them after. This apporach works for the failing workload.

Test Plan: See D63719673 where the reporting user was kind enough to provide us with a local repro. Without the relevant import, we can reproduce the crash. With the import, the training runs successfully to completion.

Differential Revision: D63736829

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137155
Approved by: https://github.com/xmfan, https://github.com/eellison
2024-10-03 17:37:21 +00:00
f96020c246 Fix unlift to unblock training IR + run_decomp on aliasing constants (#137162)
When we populate unlifted graph module, we actually only "unlift" constant tensor inputs which is problematic because export de-duplicates aliasing constants. As a result, we only register one constant instead of two constants. This PR fixes that by querying ep.constants table instead of ep.graph_signature.lifted_tensor_constants.

Differential Revision: [D63743111](https://our.internmc.facebook.com/intern/diff/D63743111)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137162
Approved by: https://github.com/pianpwk
2024-10-03 17:28:53 +00:00
4d3c0fc061 [AOTAutogradCache] add config for AOTAutograd remote cache (#137011)
Summary: This just adds a config option and JK for turning on remote AOTAutogradCache. It does not implement anything with the new options being passed in. That will come next diff.

This PR also changes the command for turning on the local AOTAutogradCache to be more consistent to that of FXGraphCache: TORCHINDUCTOR_AUTOGRAD_CACHE

Test Plan: Existing tests should pass and should build

Reviewed By: oulgen

Differential Revision: D63321965

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137011
Approved by: https://github.com/oulgen
2024-10-03 16:03:47 +00:00
a569a8ac4c type _dynamo/external_utils.py (#137185)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137185
Approved by: https://github.com/Skylion007
2024-10-03 15:18:53 +00:00
b6cb174816 Fix serialization for torch.uint16, torch.uint32, torch.uint64 (#137184)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137184
Approved by: https://github.com/albanD
2024-10-03 14:56:11 +00:00
89b7a5d128 Implement AcceleratorHooksInterface's virtual functions deviceCount() and getCurrentDevice() for CUDA and XPU (#136752)
Fixes #136751

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136752
Approved by: https://github.com/albanD
2024-10-03 14:44:58 +00:00
63bbf712d8 Add py3.13t linux wheel build (#137127)
Builder PR required: https://github.com/pytorch/builder/pull/2001
Test PR: https://github.com/pytorch/pytorch/pull/136490/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137127
Approved by: https://github.com/albanD
2024-10-03 13:13:48 +00:00
38114ec860 [async-tp] fix a race condition that can cause silent correctness issue (#137199)
Details described in https://github.com/pytorch/pytorch/issues/137171:

![image](https://github.com/user-attachments/assets/8247b4f1-7805-4585-9d72-05e9475f218b)

Fix: we introduce the following invariants in `_pipelined_all_gather_and_consume` and `_pipelined_produce_and_all2all`:
- Before any stream writes to/reads from p2p buffers, perform a barrier on channel 0 on the launch stream.
- After all streams completed writing to/reading from p2p buffers, perform a barrier on channel 0 on the launch stream.

NOTE: This fix only focuses on addressing the race condition. Some barriers are exposed, which can be hidden by computation, and we'll optimize them in subsequent PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137199
Approved by: https://github.com/weifengpy
2024-10-03 10:42:37 +00:00
f166d62764 Avoid __ne__ weakref comparison and use identity instead in cache_size.py (#135000)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135000
Approved by: https://github.com/anijain2305
2024-10-03 07:43:58 +00:00
bd9517c1ee cond_batch_rule with boolean pred (#135009)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135009
Approved by: https://github.com/guilhermeleobas, https://github.com/jansel, https://github.com/zou3519
2024-10-03 07:43:30 +00:00
0d1701f310 Revert "raw_alloc ignores PYTORCH_NO_CUDA_MEMORY_CACHING (#131114)"
This reverts commit 70019074806920f95976fedad775d7570294f635.

Reverted https://github.com/pytorch/pytorch/pull/131114 on behalf of https://github.com/PaliC due to failing internal builds ([comment](https://github.com/pytorch/pytorch/pull/131114#issuecomment-2390615007))
2024-10-03 06:22:55 +00:00
87bf2a8428 [compiled autograd] initialize cudagraph tls from context manager (#136735)
FIXES https://github.com/pytorch/pytorch/issues/126934. Cudagraphs TLS is initialized on module import, but compiled autograd codepaths might not import it. This causes problems because autograd/compiled autograd will restore TLS state, and in this case will restore the TLS to an uninitialized state

Should fix flaky cudagraph tests: https://github.com/pytorch/pytorch/issues/131663, https://github.com/pytorch/pytorch/issues/132108

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136735
Approved by: https://github.com/BoyuanFeng, https://github.com/eellison
ghstack dependencies: #136059
2024-10-03 06:22:11 +00:00
b86269fab5 Unify cpp_extension build directory removal (#136059)
Keeps existing default directory clearing logic, even though it fails when TORCH_EXTENSIONS_DIR is set. To properly clear, we'd need to track all the folders we compiled the extensions to.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136059
Approved by: https://github.com/ezyang, https://github.com/albanD
2024-10-03 06:22:11 +00:00
55c343fa3a [DTensor] Register replication strategy for a few upsampling interpolate ops (#137201)
To unblock Llama 3.2 vision's use case to resize positional embeddings for fine-tuning. Context in [workplace post](https://fb.workplace.com/groups/319878845696681/permalink/1271172040567352/).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137201
Approved by: https://github.com/XilunWu
2024-10-03 03:45:39 +00:00
84cac3585d Move _is_static_problem to mm_common (#137150)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137150
Approved by: https://github.com/eellison
2024-10-03 02:55:43 +00:00
5c0ce8d0a6 Skip Flaky Test: for #134602 (#137226)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137226
Approved by: https://github.com/cpuhrsch
2024-10-03 01:53:59 +00:00
b3953ff25e [inductor] Reduce block sizes when using Triton CPU backend (#136612)
This greatly reduces compile time; TorchBench models that were previously 50-100x slower (vs the cpp backend) are now ~20x slower. More work needs to be done on the Triton side, but smaller block sizes will still be helpful.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136612
Approved by: https://github.com/desertfire
ghstack dependencies: #135342
2024-10-03 01:48:32 +00:00
4513fb5c53 [Inductor] Use parametrize to break down some unit tests (#137156)
Summary: To address the issue that some tests are marked as slow, see https://github.com/pytorch/pytorch/issues/136940#issuecomment-2387227598

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137156
Approved by: https://github.com/eellison
2024-10-03 01:43:36 +00:00
7631a04081 [c10d] Fix the device query story of ProcessGroup (#136790)
Function `_get_pg_default_device` is being used outside of `distributed_c10d.py`.

A concern is that people may not be aware of what it actually does, due to bad naming of this function:
`Return the device to use with ``group`` for control flow usage (object collectives, barrier).`

The remediation is as follows:
- Added a deprecation warning to `_get_pg_default_device`;
- Added a private function `_get_object_coll_device` to undertake what it does;
- Added a `_device_capability` function for users who want to query the device support of a PG -- it returns a plain list, no more "default" choice.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136790
Approved by: https://github.com/H-Huang
2024-10-03 01:36:22 +00:00
cd5d1fe015 unflatten with specialized graphs per submodule call (#137013)
Previously we were making a fairly restrictive assumption when unflattening an exported program: for any submodule, we would assert that the graph of every call to that submodule must be the same. This assertion is load-bearing, i.e., if we simply remove the assertion then we can get incorrect results, as shown by the following example.

```
    class N(torch.nn.Module):
        def forward(self, x, b):
            if b:
                return x + 1
            else:
                return x + 2

    class M(torch.nn.Module):
        def __init__(self):
            super().__init__()
            self.n = N()

        def forward(self, x):
            x0 = x + 3
            x1 = self.n(x0, True)
            x2 = x1 + 4
            x3 = self.n(x2, False)
            return x3 + 5

    m = M()
    inp = (torch.ones(1),)
    print(m(*inp))  # tensor([16.])
    ep = torch.export.export(m, inp)
    print(ep.module()(*inp))  # tensor([16.])

    unflattened = torch.export.unflatten(ep)
    print(unflattened(*inp))  # tensor([15.])
```

However, this goes against the spirit of specializing graphs when exporting: we should *expect* that for every call to a submodule we *might* generate a different graph. The goal of this PR is to fix unflattening to handle multiple specialized graphs corresponding to multiple calls to the same submodule.

The idea is simple: for every call to a child module `foo`, we will create potentially different child modules `foo`, `foo@1`, `foo@2`, etc. and use those names as targets in `callmodule` instructions in the parent graph. An immediate consequence of this is that the list of fqns in an unflattened module may not be the same as an exported module. Note that all these variants share the same parameters / buffers, so that multiple calls to the same submodule can share state as expected.

However, as described so far this scheme may end up with needlessly too many submodules. Thus, between calls to the same submodule, if graphs are equal then we optimize away the extra submodules and reuse call names as much as possible. Moreover, when submodules are shared across fqns, we also try to de-duplicate graphs corresponding to their calls as much as possible. Note that no matter what, information about which submodule was called is still preserved, so that if a submodule has to be swapped with another, one can still find all calls to the former submodule and replace them with calls to the latter.

A note on the choice of naming scheme for call names: instead of generating "sibling" modules `foo@1`, `foo@2`, etc. for `foo`, we had considered generating "children" modules `foo._1`, `foo._2`, etc. of `foo`. However this can cause spurious cycles when de-duplicating graphs. E.g., suppose that `foo` is an alias for `bar._1` and `foo._1` is an alias for `bar`, then we must either introduce a cycle or drop the opportunity to optimize. Another idea would be to make `foo` a dummy module that contains `foo._0` corresponding to the first call, but this necessitates too many changes to existing tests and hurts the common case.

Differential Revision: D63642479

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137013
Approved by: https://github.com/pianpwk
2024-10-03 00:55:44 +00:00
6241006c28 Fix dependency on filesystem on Linux (#137209)
Similar to: https://github.com/pytorch/pytorch/pull/134494
We are seeing come back of https://github.com/pytorch/pytorch/issues/133437 due to use of filesystem on Linux

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137209
Approved by: https://github.com/kit1980, https://github.com/malfet
2024-10-03 00:18:28 +00:00
235f7e06f4 [CI] upload_metrics function to upload to s3 instead of dynamo (#136799)
* Upload_metrics function to upload to ossci-raw-job-status bucket instead of dynamo
* Moves all added metrics to a field called "info" so ingesting into database table with a strict schema is easier
* Removes the dynamo_key field since it is no longer needed
* Removes the concept of reserved metrics, since they cannot be overwritten by user added metrics anymore
* Moves s3 resource initialization behind a function so import is faster
---
Tested by emitting a metric during run_test and seeing that documents got added to s3
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136799
Approved by: https://github.com/ZainRizvi
2024-10-02 23:19:28 +00:00
2c9e194e23 Revert "[FSDP2] support torch._foreach_copy_(float8) for fully_shard(Float8Linear) (#135955)"
This reverts commit b50b3b32191e7192a28c54a417891f24df4e4dda.

Reverted https://github.com/pytorch/pytorch/pull/135955 on behalf of https://github.com/PaliC due to breaking internal tests ([comment](https://github.com/pytorch/pytorch/pull/135955#issuecomment-2389810936))
2024-10-02 22:46:31 +00:00
bb03ef7aca [FlexAttention] Fix max-autotune when captured buffers are View nodes (#137204)
## Summary

Originally reported in https://github.com/pytorch-labs/attention-gym/issues/45

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137204
Approved by: https://github.com/Chillee, https://github.com/BoyuanFeng
2024-10-02 22:19:33 +00:00
759cd73adb [Profiler] Update Kineto Submodule (#137137)
Summary: Updating commits from Aug 7, 2024 to Sep 26, 2024

Test Plan: Phabricator + OSS CI

Differential Revision: D63723255

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137137
Approved by: https://github.com/aaronenyeshi
2024-10-02 22:19:28 +00:00
e9e5d767b6 [inductor] Fix build_paths usage in config.py (#137187)
Summary: In https://github.com/pytorch/pytorch/pull/136234 we accidentally used the old version of `build_paths`, but in https://github.com/pytorch/pytorch/pull/136952 the API slightly changed. This diff addresses that issue by updating the API usage.

Test Plan: CI

Reviewed By: ColinPeppler

Differential Revision: D63764809

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137187
Approved by: https://github.com/ColinPeppler
2024-10-02 22:06:02 +00:00
e95b230fd8 Fix NJT serialization (#137031)
Fixes #129366

Since NJT has custom serialization logic, we need an NJT-specific fix to clear out cached sizes / strides PyCapsules. Eventually, we should switch NJT to use the default serialization logic, but this depends on #125622 being addressed.

This PR also makes serialization more complete by explicitly handling `lengths`, `ragged_idx`, and the `metadata_cache`, ensuring working operation for both contiguous and non-contiguous NJTs,
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137031
Approved by: https://github.com/soulitzer
ghstack dependencies: #137030
2024-10-02 21:41:35 +00:00
eqy
be423a8480 [SDPA] Bump grad_query fudge factor for Flash Attention (#135711)
Tolerance issue for small GPUs e.g., (A16, A2)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135711
Approved by: https://github.com/Skylion007, https://github.com/drisspg
2024-10-02 21:35:00 +00:00
36fb342ffd Check for fused kernel before inplace update (#137042)
Summary:
Given an op, with a pair (output buffer, input buffer) from that op, we consider marking the output buffer as inline. However, if the parent of input buffer and the current op are going to be fused, then we don't want to mark the output buffer as inline. This change checks that criterion, and skips inlining if it is so.

Test Plan:
New unit test "layer_norm_should_not_inplace" runs LayerNorm and checks for no "in_out" pointers.

Fixes #120217

Here's a diagram of the issue:
![Inline+Fusion](https://github.com/user-attachments/assets/c03308d8-fdbf-40a0-a46d-964ece5f9e6d)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137042
Approved by: https://github.com/eellison
2024-10-02 21:14:34 +00:00
a3f3773477 Make PT2E work with both IR simultaneously (#135769)
Summary: as title

Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:quantization_pt2e_qat
```

Differential Revision: D62449830

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135769
Approved by: https://github.com/angelayi
2024-10-02 21:05:22 +00:00
4a9225fa1f improve get_schedule_class() (#137103)
Small change to make `get_schedule_class()` take case insensitive schedule names

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137103
Approved by: https://github.com/kwen2501
2024-10-02 20:08:25 +00:00
2d465e4d1d [non ghstack] Init threadpool with user defined num_threads before default (#137051)
Very similar to https://github.com/pytorch/pytorch/pull/136793, but adds back `pool->set_thread_count` call as it is still necessary (I am guessing due to the mutex)

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137051
Approved by: https://github.com/albanD
2024-10-02 20:02:30 +00:00
59d7cf7342 Add _dynamo.config inline_inbuilt_nn_modules and specialize_float logging (#137139)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137139
Approved by: https://github.com/ezyang
2024-10-02 19:58:38 +00:00
2b329d3bf1 Fix typo in _normalize ref (#137079)
I think this should basically make no difference numerically, but it does have some ramifications on things like CSE.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137079
Approved by: https://github.com/Skylion007
ghstack dependencies: #136826, #137043, #137049, #137065
2024-10-02 19:06:48 +00:00
6374a19a6e Fix wrapper subclass serialization with custom sizes / strides (#137030)
Fixes #130154

This PR takes the strategy outlined in the above issue and clears out any cached sizes / strides PyCapsules before serialization. This affects the default subclass serialization logic.

The PyCapsule issue also affects `deepcopy`, so that's fixed here as well.

Note: I originally tried utilizing a context manager to remove / restore cached PyCapsules after serialization, but in practice the state returned from `_reduce_ex_internal()` references the actual `tensor.__dict__()`, so the problem persists once the cached values are restored. Instead, we have to be careful to remove the cached values in the right place so they're not re-cached when pulling out size / stride information for serialization.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137030
Approved by: https://github.com/albanD
2024-10-02 18:55:03 +00:00
8962610247 [BE][clang-format] make macro PyObject_HEAD_INIT(type) and PyVarObject_HEAD_INIT(type, size) have its own line (#136949)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136949
Approved by: https://github.com/albanD, https://github.com/eqy
ghstack dependencies: #136945
2024-10-02 18:39:22 +00:00
89c37be6b7 [BE][clang-format] make macro PyObject_HEAD have its own line (#136945)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136945
Approved by: https://github.com/albanD
2024-10-02 18:39:21 +00:00
54f50f19eb [dtensor][experimental] expose DTensor Context Parallel API (#137038)
**Summary**
expose experimental Context Parallel API `torch.distributed.tensor.experimental._attention.context_parallel` to module `torch.distributed.tensor.experimental`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137038
Approved by: https://github.com/wz337, https://github.com/fegin
2024-10-02 18:00:23 +00:00
4559cddaf9 Revert "Add py3.13t linux wheel build (#137127)"
This reverts commit 6b7adc12140d3073c5700cc1c48998556489857e.

Reverted https://github.com/pytorch/pytorch/pull/137127 on behalf of https://github.com/jovianjaison due to Sorry for reverting your changes but 2 jobs are failing ([comment](https://github.com/pytorch/pytorch/pull/137127#issuecomment-2389250791))
2024-10-02 17:44:42 +00:00
b50b3b3219 [FSDP2] support torch._foreach_copy_(float8) for fully_shard(Float8Linear) (#135955)
this PR unblocks unit test with single Float8Linear module. It fixes following error
```
torch._foreach_copy_(foreach_copy_dsts, all_gather_inputs)
[rank0]:E0913 13:44:29.829000 2179476 torch/testing/_internal/common_distributed.py:671] RuntimeError: "foreach_tensor_copy" not implemented for 'Float8_e4m3fn'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135955
Approved by: https://github.com/vkuzo, https://github.com/eqy
2024-10-02 17:26:45 +00:00
c318bafe9c [inductor mkldnn test][BE] Use parametrize to shorten test run time (#137153)
Summary:
Tests in test_mkldnn_pattern_matcher.py can take too long to finish. Splitting them into smaller tests, using `parametrize`.

I guess this means this test file has some refactoring opportunities as well. Next time would be the parametrize the add functions.

Differential Revision: D63723925

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137153
Approved by: https://github.com/desertfire
2024-10-02 17:20:27 +00:00
466623fb51 [CI] Support for CI GPU test and benchmark on containers (#137169)
Renames the arc references to container, and add changes required so CI that requires GPU can run on containers
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137169
Approved by: https://github.com/huydhn
2024-10-02 17:10:59 +00:00
e3fd4d796f [CI] Skip sccache for nvcc builds when building for A100 (#137170)
There is a unknown issue with nvcc builds and sccache, it crashes with:

```
      /opt/cache/bin/sccache /usr/local/cuda-12.1/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_MPI -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_py_EXPORTS -I/tmp/pip-install-893ub5fd/fbgemm-gpu_f79a3c2737924c478e50ea29fedfa172/fbgemm_gpu -I/tmp/pip-install-893ub5fd/fbgemm-gpu_f79a3c2737924c478e50ea29fedfa172/fbgemm_gpu/include -I/tmp/pip-install-893ub5fd/fbgemm-gpu_f79a3c2737924c478e50ea29fedfa172/fbgemm_gpu/../include -I/tmp/pip-install-893ub5fd/fbgemm-gpu_f79a3c2737924c478e50ea29fedfa172/fbgemm_gpu/../third_party/asmjit/src -I/tmp/pip-install-893ub5fd/fbgemm-gpu_f79a3c2737924c478e50ea29fedfa172/fbgemm_gpu/../third_party/cpuinfo/include -isystem /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/include -isystem /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.1/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++17 -Xcompiler=-fPIC -D_GLIBCXX_USE_CXX11_ABI=1 --expt-relaxed-constexpr -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -MD -MT CMakeFiles/fbgemm_gpu_py.dir/src/sparse_ops/sparse_index_select.cu.o -MF CMakeFiles/fbgemm_gpu_py.dir/src/sparse_ops/sparse_index_select.cu.o.d -x cu -c /tmp/pip-install-893ub5fd/fbgemm-gpu_f79a3c2737924c478e50ea29fedfa172/fbgemm_gpu/src/sparse_ops/sparse_index_select.cu -o CMakeFiles/fbgemm_gpu_py.dir/src/sparse_ops/sparse_index_select.cu.o
      sccache: error: failed to execute compile
      sccache: caused by: error reading compile response from server
      sccache: caused by: Failed to read response header
      sccache: caused by: failed to fill whole buffer
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137170
Approved by: https://github.com/huydhn
2024-10-02 17:07:24 +00:00
d4cf90d282 [BE] [CI] Skip clean gha workspace if CI is running in a container for checkout-pytorch (#137168)
For the reusable action checkout-pytorch, skips cleaning workspace when running from a container environment.

The motivation for this change is twofold:
* There is no need for cleanup when running in ephemeral containers, as any changes will be discarded when the docker container is terminated;
* In the specific case of GITHUB_WORKSPACE, to enable sharing this between multiple containers, it need to be mounted with `-v`. This prevents the possibility of running `rm -r` and deleting this mount path;

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137168
Approved by: https://github.com/huydhn
2024-10-02 17:04:50 +00:00
af3e25fea7 remove capture_autograd_function flag (#136959)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136959
Approved by: https://github.com/jansel
2024-10-02 16:59:19 +00:00
bcaa0f5ee9 [CI] Remove nanogpt from perf smoke test (#137176)
Summary: nanogpt's performance is not stable. Remove it from the perf smoke test. We may want to use another test instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137176
Approved by: https://github.com/eellison
2024-10-02 16:35:04 +00:00
7001907480 raw_alloc ignores PYTORCH_NO_CUDA_MEMORY_CACHING (#131114)
raw_alloc is used by cudnn, miopen, thrust, and tunableop.  Without this PR, the env var for disabling the caching allocator will only partially work.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131114
Approved by: https://github.com/eqy, https://github.com/houseroad, https://github.com/albanD

Co-authored-by: Nichols A. Romero <nick.romero@amd.com>
2024-10-02 16:27:15 +00:00
a954a9ea75 [Inductor] External callable registration API for Matmul tuning candidates (#130774)
Fixes #[130769](https://github.com/pytorch/pytorch/issues/130769)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130774
Approved by: https://github.com/jansel

Co-authored-by: Jason Ansel <jansel@meta.com>
2024-10-02 15:38:10 +00:00
af86a6fdba [dynamo][user-defined-class] Fallback when object.__new__ fails (#137044)
Seen in https://github.com/vllm-project/vllm/pull/8949

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137044
Approved by: https://github.com/jansel
2024-10-02 14:15:36 +00:00
d29094888b Use torch.Stream&torch.Event for Dynamo capature (#134850)
# Motivation
This PR aims to solve the multiple Inheritance problem.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134850
Approved by: https://github.com/yf225, https://github.com/EikanWang
2024-10-02 14:15:33 +00:00
bf73af4b4e dont let partitioner think it can fuse pointwise ops into user triton kernels (#136878)
Previously if we had a graph like:
```
        triton_kernel_wrapper_functional_proxy = triton_kernel_wrapper_functional(...)
        getitem: "f32[3][1]cuda:0" = triton_kernel_wrapper_functional_proxy['out_ptr']
        getitem_1: "f32[3][1]cuda:0" = triton_kernel_wrapper_functional_proxy['out2_ptr']
        sigmoid: "f32[3][1]cuda:0" = torch.ops.aten.sigmoid.default(getitem_1)
        mul: "f32[3][1]cuda:0" = torch.ops.aten.mul.Tensor(tangents_1, sigmoid)
```

The partitioner would assume that the `sigmoid()` could be fused into either its user (the pointwise mul), or its producer (the user triton kernel). This could lead to a bad partitioning:

(1) If the partitioner thinks we can fuse the sigmoid with its producer triton kernel, we would keep the sigmoid compute in the forward, and have to generate two separate kernels in the forward (user triton kernel, dedicated sigmoid kernel)

(2) if the partitioner puts the sigmoid in the backward instead, we could fuse it with an existing backward kernel (the mul with a tangent)

Reviewed By: embg

Differential Revision: D63551393

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136878
Approved by: https://github.com/zou3519
2024-10-02 13:52:44 +00:00
5c2c3ca10b [Inductor] Fix test_conv2d_unary_cpu_cpp_wrapper failure (#137158)
Summary: test_conv2d_unary_cpu_cpp_wrapper is failing on ciflow/slow because of mis-handling of inf. This PR fixes that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137158
Approved by: https://github.com/chenyang78
2024-10-02 13:21:35 +00:00
d117ec1d6e [3/3][Inductor] Make CK work in FBCode (#136234)
Summary:
# Context
Goal: Enable CK for Inductor in FBCode

We split this stack into three diffs to help with review & in case we need to revert anything.

# This Diff
* Gets us to have CK kernels as an option for GEMM autotuning in Inductor.

Reviewed By: zjing14

Differential Revision: D62662705

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136234
Approved by: https://github.com/tenpercent, https://github.com/chenyang78
2024-10-02 12:17:38 +00:00
6b7adc1214 Add py3.13t linux wheel build (#137127)
Builder PR required: https://github.com/pytorch/builder/pull/2001
Test PR: https://github.com/pytorch/pytorch/pull/136490/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137127
Approved by: https://github.com/albanD
2024-10-02 11:59:33 +00:00
8c29a0dd0e [pipelining] Clean up dead code (#136804)
'set_requires_grad' dict appears to be always full of "False" values,
and we always set requires_grad based on the value of 'has_backward'

setting of required_grad field was being repeatedly done during
get_fwd_recv_ops, but it should be done just once, so move it to the
function that creates recv buffers in the first place.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136804
Approved by: https://github.com/kwen2501
2024-10-02 11:26:31 +00:00
cyy
862029a1ef [Distributed] [15/N] Fix clang-tidy warnings in torch/csrc/distributed/ (#137072)
Follows  #136848

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137072
Approved by: https://github.com/kwen2501
2024-10-02 10:56:15 +00:00
ed02309232 type _dynamo/create_parameter_op.py (#136958)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136958
Approved by: https://github.com/jansel
2024-10-02 10:23:37 +00:00
52d29a2b94 [reland #136389] Skip kernel saving if already existed (#137073)
Summary:
We skip the save_gpu_kernel if kernel is being saved already.
This would give us a more accurate Triton profiling result. The
following trace shows before/after the change for a benchmarking of a
trivial addmm:

Before:
<img width="1255" alt="Screenshot 2024-09-23 at 10 26 53 AM" src="https://github.com/user-attachments/assets/5aea05ef-6ef0-464c-8da9-17b31c97b43a">

After:
<img width="910" alt="Screenshot 2024-09-23 at 10 27 03 AM" src="https://github.com/user-attachments/assets/488b7d4f-268f-41cf-8553-cb16ceeae118">

We can see that before the change, the benchmarking includes two parts,
   (1) The overhead of our triton_heuristic call, which includes the
   save/get, and the (expensive) hash computation.
   (2) The exact computation of Triton kernel.

   We see that (1) accounts >50% of time, which makes kernel selection
   for profiling choosing aten kernels over Triton kernels.

Test Plan:
Existing OSS CI
python test/inductor/test_cuda_cpp_wrapper.py

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137073
Approved by: https://github.com/desertfire
2024-10-02 09:27:08 +00:00
e374d6850a [distributed][test] Remove unused variable and fix doc typo (#136943)
Refactor distributed test code:
- Fix TODO: Remove unused variable
- Fix doc typo
- Migrate deprecated method call `load_state_dict` and `save_state_dict`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136943
Approved by: https://github.com/H-Huang
2024-10-02 08:31:53 +00:00
e9a55b43a1 [inductor] Support lists of tensors in operatorbench (#136911)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136911
Approved by: https://github.com/eellison
2024-10-02 06:41:06 +00:00
a89e3c2490 Add compiled_autograd_kwargs_override Dynamo config (#136967)
For Traceable FSDP2, the most common use case is to have `fullgraph=False` for forward pass (to allow user-level graph breaks), and `fullgraph=True` for compiled autograd backward pass (required for queue_callback support).

With `torch._dynamo.compiled_autograd=True`, previously we are not able to set different `fullgraph` config value for forward vs. backward pass, since `rebuild_ctx` just reuses the forward compile config as-is. This PR adds `torch._dynamo.config.compiled_autograd_kwargs_override` config to allow forcing `fullgraph=True` for CA Dynamo tracing.

With this PR, we can remove standalone compiled autograd ctx manager usage in Traceable FSDP2 unit tests, and consolidate on using `torch._dynamo.compiled_autograd=True`.

Test commands:
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor_fullgraph_True`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136967
Approved by: https://github.com/xmfan
2024-10-02 06:23:59 +00:00
b51d22b8bb [BE] [NEON] Use vshlq_n_u32 instead of vshlq_u32 (#137122)
As compiler optimizes it away anyway

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137122
Approved by: https://github.com/kit1980
2024-10-02 06:18:11 +00:00
2854d157de Add type annotations for higher order ops/flex_attention (#137065)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137065
Approved by: https://github.com/drisspg, https://github.com/Skylion007
ghstack dependencies: #136826, #137043, #137049
2024-10-02 04:39:25 +00:00
3b8511dadf Remove python 3.8 from triton builds (#137141)
All jobs have switched to Python 3.9. These 3.8 builds no longer necessary

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137141
Approved by: https://github.com/albanD
2024-10-02 03:36:54 +00:00
8e39f2a4a5 [Inductor] Enable a SDPA pattern matching for CUDA (#137085)
Summary: Fixes https://github.com/pytorch/pytorch/issues/122429

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137085
Approved by: https://github.com/eellison
2024-10-02 03:22:11 +00:00
18525e185e Fix rendezvous error due to EtcdStore get method not waiting in some cases (#137056)
Fixes #132950

This fixes an issue in `torch/distributed/elastic/rendezvous/etcd_store.py` where the [get method](https://github.com/pytorch/pytorch/blob/v2.4.0/torch/distributed/elastic/rendezvous/etcd_store.py#L60) does not wait as expected when no keys have been written under the store prefix yet (and therefore the store prefix key does not exist). This was because the `_try_wait_get` method would error out immediately [here](https://github.com/alenawang/pytorch/blob/main/torch/distributed/elastic/rendezvous/etcd_store.py#L179) if the prefix was not found instead of continuing to the etcd watch.

This was causing upstream issues where distributed jobs using etcd-v2 could not get past the initial rendezvous at all (details in issue #132950).

We added a test demonstrating this issue and the fix. Without the fix the test fails with `etcd.EtcdKeyNotFound: Key not found : /torch/elastic/store` instead of waiting for the first key to be written; with the fix the test waits properly.

Co-authored-by: tarat44 <32471142+tarat44@users.noreply.github.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137056
Approved by: https://github.com/fduwjj

Co-authored-by: tarat44 <32471142+tarat44@users.noreply.github.com>
2024-10-02 01:45:00 +00:00
f108f88c40 [logging/debugging] handle None (constant) args in debug log (#137032)
Summary:
# Why

The arguments are filtered out as they are just const in the compiled graph, but the logger still expects a non-None type

# What

When passing a filtered out arg (None) to the debug logger, just log that it's a filtered out argument, instead of throwing a Type error

# Background

https://github.com/pytorch/pytorch/pull/131594

Test Plan: - execute repro from https://github.com/pytorch/pytorch/issues/135584#issue-2516944089 with and without the edits

Differential Revision: D63652564

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137032
Approved by: https://github.com/angelayi
2024-10-02 01:43:22 +00:00
f984b88718 Ensure noncontiguous tensor creation tests offsetting (#136396)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136396
Approved by: https://github.com/amjames, https://github.com/eellison
ghstack dependencies: #136055
2024-10-02 00:40:43 +00:00
c7638da558 Lowerings: remove restriction on TensorBox keyword arguments (#136055)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136055
Approved by: https://github.com/eellison
2024-10-02 00:40:43 +00:00
63d6908da0 fix build error with gcc 12+ (#137092)
Fixes #127920

This commit addresses a build failure occurring with GCC 12 and above due to the -Werror=nonnull flag. The error manifests in the test_api target.

**Issue:**
When building with GCC 12+, the following error occurs:
```
error: argument 1 null where non-null expected [-Werror=nonnull]
  431 |             __builtin_memmove(__result, __first, sizeof(_Tp) * _Num);
      |             ~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
```

This change ensures that:
1. The flag is only added for GCC 12 or higher
2. The flag is only added if it's supported by the compiler
3. The flag is added specifically to the test_api target, not globally

By disabling this specific error, we allow the build to proceed while maintaining other compiler warnings.

**Test Plan:**
- Verified successful build with GCC 12 and above
- Ensured no regression in builds with earlier GCC versions and other compilers

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137092
Approved by: https://github.com/malfet
2024-10-02 00:37:15 +00:00
d725758210 [ts_converter] Fix prim::If buffer names (#136648)
Summary:
We previously incorrectly handled the following graph, specifically for the node `w.3` in `block0`:
```
 graph(%x.1 : Float(3, strides=[1], requires_grad=0, device=cpu),
       %y.1 : int):
   %2 : __torch__.___torch_mangle_1.M = prim::CreateObject()
   %3 : int = prim::Constant[value=20](), scope: M:: # /data/users/angelayi/pytorch/test/export/test_converter.py:747:34
   %4 : int = prim::Constant[value=10](), scope: M:: # /data/users/angelayi/pytorch/test/export/test_converter.py:746:34
   %5 : int = prim::Constant[value=1](), scope: M::
   %w.1 : int = prim::GetAttr[name="w"](%2), scope: M::
   %7 : int = aten::mul(%w.1, %4), scope: M:: # /data/users/angelayi/pytorch/test/export/test_converter.py:746:25
    = prim::SetAttr[name="w"](%2, %7), scope: M::
   %h.1 : int = prim::GetAttr[name="h"](%2), scope: M::
   %9 : int = aten::mul(%h.1, %3), scope: M:: # /data/users/angelayi/pytorch/test/export/test_converter.py:747:25
    = prim::SetAttr[name="h"](%2, %9), scope: M::
   %10 : bool = aten::gt(%y.1, %4), scope: M:: # /data/users/angelayi/pytorch/test/export/test_converter.py:749:19
   %res.37 : Tensor = prim::If(%10), scope: M:: # /data/users/angelayi/pytorch/test/export/test_converter.py:749:16
     block0():
       %w.3 : int = prim::GetAttr[name="w"](%2), scope: M::
       %res.1 : Tensor = aten::add(%x.1, %w.3, %5), scope: M:: # <string>:5:9
       -> (%res.1)
     block1():
       %h.3 : int = prim::GetAttr[name="h"](%2), scope: M::
       %res.3 : Tensor = aten::add(%x.1, %h.3, %5), scope: M:: # <string>:5:9
       -> (%res.3)
   %16 : bool = aten::lt(%y.1, %4), scope: M:: # /data/users/angelayi/pytorch/test/export/test_converter.py:754:19
   %res : Tensor = prim::If(%16), scope: M:: # /data/users/angelayi/pytorch/test/export/test_converter.py:754:16
     block0():
       %w : int = prim::GetAttr[name="w"](%2), scope: M::
       %res.15 : Tensor = aten::add(%res.37, %w, %5), scope: M:: # <string>:5:9
       -> (%res.15)
     block1():
       %h : int = prim::GetAttr[name="h"](%2), scope: M::
       %res.21 : Tensor = aten::add(%res.37, %h, %5), scope: M:: # <string>:5:9
       -> (%res.21)
   return (%res)
```

Test Plan: CI

Differential Revision: D63399064

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136648
Approved by: https://github.com/SherlockNoMad
2024-10-02 00:07:47 +00:00
8765804542 Continue on error for pytorch autolint (#137104)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137104
Approved by: https://github.com/huydhn, https://github.com/atalman
2024-10-01 22:30:36 +00:00
f0fa460c60 [BE] Add script to keept the runner-determinator scripts in sync (#136794)
Whenever we update runner_determinator.py it needs to be copied over into _runner-determinator.yml.

This is a quick utility script to make that process less tedious
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136794
Approved by: https://github.com/zxiiro, https://github.com/jeanschmidt
2024-10-01 22:26:28 +00:00
4f93de8951 Mark PyTorch module as no-gil valid and pythoncapi_compat.h (#136899)
PyList_GetItem are audited but not other APIs yet (they will be done in a follow up PR to keep this one small enough).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136899
Approved by: https://github.com/colesbury, https://github.com/atalman
2024-10-01 22:05:35 +00:00
6baee60e3c upload test stats: remove nan/inf when uploading (#136877)
`json.dumps(float("inf"))` returns `Infinity`, which is technically invalid json

This is fine if you json.load, but ClickHouse cannot handle it

Solution here: cast inf and nan to string (which ClickHouse is able to cast back to float)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136877
Approved by: https://github.com/huydhn
2024-10-01 21:47:46 +00:00
0788d016d6 Update incompatible cudagraph ops skip message (#137015)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137015
Approved by: https://github.com/BoyuanFeng
2024-10-01 21:23:36 +00:00
34c18887ad [FlexAttention] Remove restriction on QK headdim > V headdim (#135884)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135884
Approved by: https://github.com/Chillee
2024-10-01 21:17:54 +00:00
99eb47fb6d Add CI for Triton CPU backend (#135342)
Where possible, I have marked failing tests (which we intend to fix or triage) as `@xfail_if_triton_cpu`. This will help us track progress of the Triton CPU backend over time. Tests that I don't think we need to address, or that are flaky, have been marked as skips.

Successful CI run: https://github.com/pytorch/pytorch/actions/runs/10822238062/job/30028284549

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135342
Approved by: https://github.com/jansel, https://github.com/desertfire, https://github.com/malfet
2024-10-01 20:43:10 +00:00
86b715c5f6 Revert "Skip kernel saving if already existed. (#136389)"
This reverts commit 2521cd387482a70d30e4ea922fa4fe3b488c9f6d.

Reverted https://github.com/pytorch/pytorch/pull/136389 on behalf of https://github.com/muchulee8 due to Issue #136940  ([comment](https://github.com/pytorch/pytorch/pull/136389#issuecomment-2386950623))
2024-10-01 20:04:43 +00:00
b53ab8b86a Revert "[dtensor][experimental] expose DTensor Context Parallel API (#137038)"
This reverts commit e23e766cc089b568aa4c0ebf0747ff9b504b8915.

Reverted https://github.com/pytorch/pytorch/pull/137038 on behalf of https://github.com/huydhn due to Sorry for reverting your changes but the doc build failure looks legit ([comment](https://github.com/pytorch/pytorch/pull/137038#issuecomment-2386902253))
2024-10-01 19:49:28 +00:00
a00f0d5db8 [PT2][Inductor] Add runtime numeric check for the post grad pass (#136724)
Summary: Similar to D51838043, we further add post grad runtime numeric check since some graph passes are performed at aten-level.

Differential Revision: D63438718

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136724
Approved by: https://github.com/Yuzhen11
2024-10-01 18:56:01 +00:00
d61e45283e Properly interpolate sloc here (#137088)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137088
Approved by: https://github.com/Skylion007
2024-10-01 18:33:03 +00:00
c2dee8ea9c enable lazy init for MTIA (#136902)
Summary: As title.

Test Plan: OSS and Internal CIs

Reviewed By: nautsimon, hanzlfs

Differential Revision: D63434511

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136902
Approved by: https://github.com/nautsimon
2024-10-01 18:30:56 +00:00
1f3a793790 Fix PyTorch builds on MacOS-13 (#137095)
By including SonomaOps header

Fixes https://github.com/pytorch/pytorch/issues/137094

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137095
Approved by: https://github.com/atalman, https://github.com/ZainRizvi
2024-10-01 17:56:35 +00:00
e23e766cc0 [dtensor][experimental] expose DTensor Context Parallel API (#137038)
**Summary**
expose experimental Context Parallel API `torch.distributed.tensor.experimental._attention.context_parallel` to module `torch.distributed.tensor.experimental`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137038
Approved by: https://github.com/wz337, https://github.com/fegin
2024-10-01 17:41:28 +00:00
73b07df042 Preserve custom ops via run_decomps (#136882)
This is re-apply of https://github.com/pytorch/pytorch/pull/136773?fbclid=IwZXh0bgNhZW0CMTEAAR3SmginkvZcILVY7G2XDa_KosnV4DPmq1l6pkjPIM255QgJLKVAR90rGAU_aem_ZWpcVdUsmAGzOGiwbjtBDg.

Note that this doesn't completely remove the _preserve_ops list from export mainly because we want to have small change to address failing executorch tests. All the complications included in this PR is deleted in the next PR.

Differential Revision: [D63553086](https://our.internmc.facebook.com/intern/diff/D63553086/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136882
Approved by: https://github.com/bdhirsh
2024-10-01 17:38:00 +00:00
b1b6816e05 [testing] reenable kernel_benchmark.py tests (#136876)
Summary:
# Why

We want this to run internally

# What

- fix python path issue on the test
- reenable the test

# Background

(copied from similar issue resolved earlier)

It appears that the parent process does not pass the entire path down to the child process. Namely, if there is some setup that makes the sys.path effectively look different than, say, PYTHONPATH or something like this, the child will not inherit this setup. To avoid needing to keep track of specific setups, we pass the effective `sys.path` from the parent to the child through the PYTHONPATH env variable

Test Plan: buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:kernel_benchmark

Differential Revision: D63498897

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136876
Approved by: https://github.com/henrylhtsang
2024-10-01 17:16:21 +00:00
3d0cb81594 [MPS] Enable bfloat16 testing (#136987)
By even further reducing precisions of imprecise FP16 ops, introducing new BF16_LOW_PRECISION_OPS category and marking BF16 tests as xfail for `divfloor_rounding`, `floor_divide` and `remainder`.
I guess the nature of low-precision results, is that MPSGraph, unlike the rest of the PyTorch does not do accumulation over fp32 for reduction operations

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136987
Approved by: https://github.com/albanD
ghstack dependencies: #137070
2024-10-01 17:10:07 +00:00
cc2a66c55e [export] hook up mark_dynamic to export Dims (#137029)
Adds Dim.DYNAMIC which calls torch._dynamo.mark_dynamic() in the backend. Similar to Dim.AUTO in that it does automatic inference for ranges & relations, but errors out for specializations.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137029
Approved by: https://github.com/avikchaudhuri
2024-10-01 17:05:09 +00:00
ef6fd3d780 Fix adaptive_max_pool2d fallback (#136367)
Fixes #136332
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136367
Approved by: https://github.com/amjames, https://github.com/eellison
2024-10-01 16:20:34 +00:00
8f4f7bed5d [MPS] Fix bfloat to complex casts (#137070)
For Metal cast ops to comple, one need to explicitly cast to/from `bfloat` unlike for other dtypes

Tested in https://github.com/pytorch/pytorch/pull/136987
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137070
Approved by: https://github.com/Skylion007
2024-10-01 15:47:29 +00:00
696d01aef3 Revert "inductor: use previous guards to know if a size is 1 for broadcasting (#136670)"
This reverts commit dfdda2f6a603ae9245f38a3e8f6365c3cb6d49ac.

Reverted https://github.com/pytorch/pytorch/pull/136670 on behalf of https://github.com/ZainRizvi due to Something in this stack seems to be causing tests to fail on trunk. See functorch/test_control_flow.py::TestControlFlow::test_associative_scan_dim_reverse_True_combine_mode_generic_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/11107079955/job/30872132411) [HUD commit link](c010c6099b) ([comment](https://github.com/pytorch/pytorch/pull/136670#issuecomment-2386303362))
2024-10-01 15:23:55 +00:00
951107e8c2 Revert "compile time benchmarks for AOTDispatcher (inference/training/subclasses) (#136759)"
This reverts commit b17cd264d38ca3381391c449bdaf9f03381caf35.

Reverted https://github.com/pytorch/pytorch/pull/136759 on behalf of https://github.com/ZainRizvi due to Something in this stack seems to be causing tests to fail on trunk. See functorch/test_control_flow.py::TestControlFlow::test_associative_scan_dim_reverse_True_combine_mode_generic_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/11107079955/job/30872132411) [HUD commit link](c010c6099b) ([comment](https://github.com/pytorch/pytorch/pull/136670#issuecomment-2386303362))
2024-10-01 15:23:55 +00:00
923410193b Revert "compile time benchmarks for AOTDispatcher (partitioner) (#136760)"
This reverts commit c010c6099bf304bbb681af534b9f3996c33ce582.

Reverted https://github.com/pytorch/pytorch/pull/136760 on behalf of https://github.com/ZainRizvi due to Something in this stack seems to be causing tests to fail on trunk. See functorch/test_control_flow.py::TestControlFlow::test_associative_scan_dim_reverse_True_combine_mode_generic_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/11107079955/job/30872132411) [HUD commit link](c010c6099b) ([comment](https://github.com/pytorch/pytorch/pull/136670#issuecomment-2386303362))
2024-10-01 15:23:55 +00:00
8f5c2b5f17 type _dynamo/test_case.py (#136957)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136957
Approved by: https://github.com/Skylion007
2024-10-01 14:36:22 +00:00
d4cc2aaf1e type _dynamo/logging.py (#136956)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136956
Approved by: https://github.com/Skylion007
2024-10-01 14:35:54 +00:00
7303716005 Revert "Simplify find_localzeros (#133325)"
This reverts commit 99f90c379ed214ab30882a87bdb3924ed6d6c899.

Reverted https://github.com/pytorch/pytorch/pull/133325 on behalf of https://github.com/ezyang due to https://fb.workplace.com/groups/gpuinference/permalink/2921405651341417/ ([comment](https://github.com/pytorch/pytorch/pull/133325#issuecomment-2385832600))
2024-10-01 13:25:03 +00:00
6bd9d37266 Remove allow-untyped-defs from torch.fx.experimental.symbolic_shapes (#137019)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137019
Approved by: https://github.com/Skylion007
ghstack dependencies: #136934, #136935, #136972
2024-10-01 13:22:10 +00:00
cc8f1cddd4 Turn on type-checking in torch.fx.experimental.symbolic_shapes (#136972)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136972
Approved by: https://github.com/Skylion007
ghstack dependencies: #136934, #136935
2024-10-01 13:22:10 +00:00
b85f21fc1d Add decomposition for squeeze_copy (#130941)
* Extracted from #128416

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130941
Approved by: https://github.com/amjames, https://github.com/eellison
ghstack dependencies: #136653
2024-10-01 10:23:22 +00:00
083921852b set FlexAttention devices properly during tracing (#137049)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137049
Approved by: https://github.com/zou3519, https://github.com/drisspg, https://github.com/yanboliang
ghstack dependencies: #136826, #137043
2024-10-01 09:08:08 +00:00
34cef1eaa7 Allow automatic dynamic shapes for closures and set current node properly in flexattention subgraph lowering (#137043)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137043
Approved by: https://github.com/drisspg
ghstack dependencies: #136826
2024-10-01 09:08:08 +00:00
37dd924c2d Fix test/test_linalg.py for NumPy 2 (#136800)
Related to  #107302.

When built and tested with NumPy 2 the following unit tests failed.

```
=========================================================== short test summary info ============================================================
FAILED [0.0026s] test/test_linalg.py::TestLinalgCPU::test_householder_product_cpu_complex128 - TypeError: expected np.ndarray (got Tensor)
FAILED [0.0024s] test/test_linalg.py::TestLinalgCPU::test_householder_product_cpu_complex64 - TypeError: expected np.ndarray (got Tensor)
FAILED [0.0025s] test/test_linalg.py::TestLinalgCPU::test_householder_product_cpu_float32 - TypeError: expected np.ndarray (got Tensor)
FAILED [0.0024s] test/test_linalg.py::TestLinalgCPU::test_householder_product_cpu_float64 - TypeError: expected np.ndarray (got Tensor)
FAILED [0.0016s] test/test_linalg.py::TestLinalgCPU::test_nuclear_norm_axes_small_brute_force_old_cpu - ValueError: Unable to avoid copy while creating an array as requested.
FAILED [0.0054s] test/test_linalg.py::TestLinalgCPU::test_solve_cpu_complex128 - AssertionError: The values for attribute 'shape' do not match: torch.Size([0, 0]) != torch.Size([0, 0, 0]).
FAILED [0.0055s] test/test_linalg.py::TestLinalgCPU::test_solve_cpu_complex64 - AssertionError: The values for attribute 'shape' do not match: torch.Size([0, 0]) != torch.Size([0, 0, 0]).
FAILED [0.0048s] test/test_linalg.py::TestLinalgCPU::test_solve_cpu_float32 - AssertionError: The values for attribute 'shape' do not match: torch.Size([0, 0]) != torch.Size([0, 0, 0]).
FAILED [0.0054s] test/test_linalg.py::TestLinalgCPU::test_solve_cpu_float64 - AssertionError: The values for attribute 'shape' do not match: torch.Size([0, 0]) != torch.Size([0, 0, 0]).
=========================================== 9 failed, 1051 passed, 118 skipped in 152.51s (0:02:32) ============================================
```

This PR fixes them. The test is now compatible with both NumPy 1 & 2.

Some more details:

1. The `np.linalg.solve` has changed its behavior. So I added an adapt function in the unit test to keep its behavior the same no matter it is NumPy 1 or Numpy 2.
2. The cause of the failure is when passing a `torch.Tensor` to `np.linalg.qr`, the return type in NumPy 1 is `(np.ndarray, np.ndarray)`, while it is `(torch.Tensor, torch.Tensor)` in NumPy 2.
3. NumPy 2 does not allow `np.array(obj, copy=False)`, but recommended to use `np.asarray(obj)` instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136800
Approved by: https://github.com/lezcano
2024-10-01 07:53:24 +00:00
df5bbc09d1 Make device-specific event inherits from torch.Event (#134845)
# Motivation
This PR intends to make device-specific Event inherit from the generic torch.Event. The benefit is providing a generic abstract class `torch.Event` for different devices, like `torch.Stream`. This make it easier for Dynamo to capture the Event of different devices, like torch.cuda.Event and torch.xpu.Event.
And the next PR would like to remove previous useless base class `_StreamBase` and `_EventBase` to avoid multiple Inheritance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134845
Approved by: https://github.com/albanD, https://github.com/EikanWang
2024-10-01 06:28:41 +00:00
cyy
47a78daf91 [Environment Variable][1/N] Use thread-safe env variable API in c10 (#119449)
This PR is the beginning of attempts to wrap thread-unsafe getenv and set_env functions inside a RW mutex.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119449
Approved by: https://github.com/malfet, https://github.com/albanD, https://github.com/eqy
2024-10-01 06:24:30 +00:00
be169f743b [Dynamo] Mark config.dead_code_elimination as deprecated (#136933)
part of #136862

For reviewers, all call sites are here: https://github.com/search?q=repo%3Apytorch%2Fpytorch+dead_code_elimination+language%3APython&type=code&l=Python

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136933
Approved by: https://github.com/williamwen42, https://github.com/anijain2305
2024-10-01 03:51:59 +00:00
6e10f7d8c1 [compiled autograd] undo view_to_reshape inductor fx pass in node name matching (#136741)
inductor mutates the aot backward graph. a solution could be to copy the graph, but since we don't know if compiled autograd is applied or not, it would be expensive to always clone it

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136741
Approved by: https://github.com/jansel
ghstack dependencies: #135663
2024-10-01 03:22:49 +00:00
40157db5a7 [compiled autograd] log placeholder origin in verbose (#135663)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135663
Approved by: https://github.com/jansel
2024-10-01 03:22:49 +00:00
6966811da6 [test] skip not omit big gpu tests for cuda_cpp_wrapper (#137055)
Summary: Problem is, when gpu is not big, we will omit the test cases in the test class. We expect the test to be skipped, but due to fbcode ci it can throw an error. This causes the test to be flaky.

Test Plan: ci

Differential Revision: D62037908

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137055
Approved by: https://github.com/masnesral
2024-10-01 03:03:27 +00:00
cyy
17455695d6 [Distributed] [14/N] Fix clang-tidy warnings in torch/csrc/distributed/ (#136848)
Follows  #136713

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136848
Approved by: https://github.com/H-Huang
2024-10-01 02:01:13 +00:00
951af3d3d8 Format torch.fx.experimental.validator (#136935)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136935
Approved by: https://github.com/Skylion007
ghstack dependencies: #136934
2024-10-01 01:47:17 +00:00
33c2d3232f Format torch.fx.experimental.symbolic_shapes with PYFMT (#136934)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136934
Approved by: https://github.com/Skylion007
2024-10-01 01:47:16 +00:00
d9c400bd9f Added some tests to prevent regressions in partitioning and flexattention (#136826)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136826
Approved by: https://github.com/yanboliang, https://github.com/drisspg
2024-10-01 01:08:44 +00:00
3f457ee1f6 Fix AOT Graph capture not propagating non_blocking copy parameter to … (#136513)
…inductor codegen.

Fixes #136260

**Note**: this is my first code contribution to torch so please let me know if there's anything I need to fix/some other convention I should follow.

Regarding the bug, re-running the issue's reproduction code:
```
import torch

def fn(x):
    return x.to(device="cuda", non_blocking=True)

inp = torch.randn(3, 4)

torch.compile(fn)(inp)
```

We now have the non_blocking being passed on to codegen properly:

```
V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code] TRACED GRAPH
V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code]  ===== pre insert_deferred_runtime_asserts __compiled_fn_1 =====
V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code]  <eval_with_key>.0 class GraphModule(torch.nn.Module):
V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code]     def forward(self, L_x_: "f32[3, 4]"):
V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code]         l_x_ = L_x_
V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code]
V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code]          # File: /home/niklasz/Desktop/pytorch/temp/reproduction.py:4 in fn, code: return x.to(device="cuda", non_blocking=True)
V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code]         to: "f32[3, 4]" = l_x_.to(device = 'cuda', non_blocking = True);  l_x_ = None
V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code]         return (to,)
V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code]
V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code]
V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code] TRACED GRAPH
V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code]  ===== __compiled_fn_1 =====
V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code]  /home/niklasz/Desktop/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module):
V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code]     def forward(self, L_x_: "f32[3, 4][4, 1]cpu"):
V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code]         l_x_ = L_x_
V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code]
V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code]          # File: /home/niklasz/Desktop/pytorch/temp/reproduction.py:4 in fn, code: return x.to(device="cuda", non_blocking=True)
V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code]         to: "f32[3, 4][4, 1]cuda:0" = l_x_.to(device = 'cuda', non_blocking = True);  l_x_ = None
V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code]         return (to,)
V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code]
V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code]
V0922 20:33:25.404000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:114] [0/0] [__aot_graphs] aot_config id: 0, fw_metadata=ViewAndMutationMeta(input_info=[InputAliasInfo(is_leaf=True, mutates_data=False, mutates_metadata=False, mutations_hidden_from_autograd=True, mutations_under_no_grad_or_inference_mode=False, mutation_inductor_storage_resize=False, mutates_storage_metadata=False, requires_grad=False, keep_input_mutations=True)], output_info=[OutputAliasInfo(output_type=<OutputType.non_alias: 1>, raw_type=<class 'torch._subclasses.functional_tensor.FunctionalTensor'>, base_idx=None, dynamic_dims=set(), requires_grad=False, functional_tensor=None)], num_intermediate_bases=0, keep_input_mutations=True, traced_tangents=[], subclass_inp_meta=[0], subclass_fw_graph_out_meta=[0], subclass_tangent_meta=[], is_train=False, traced_tangent_metas=None, num_symints_saved_for_bw=None, grad_enabled_mutation=None, deterministic=None, static_input_indices=[], tokens={}, indices_of_inputs_that_requires_grad_with_mutations_in_bw=[], bw_donated_idxs=None, num_backward_tokens=0),subclass_metadata=None
I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs] TRACED GRAPH
I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs]  ===== Forward graph 0 =====
I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs]  /home/niklasz/Desktop/pytorch/torch/fx/_lazy_graph_module.py class <lambda>(torch.nn.Module):
I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs]     def forward(self, arg0_1: "f32[3, 4][4, 1]cpu"):
I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs]          # File: /home/niklasz/Desktop/pytorch/temp/reproduction.py:4 in fn, code: return x.to(device="cuda", non_blocking=True)
I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs]         device_put: "f32[3, 4][4, 1]cuda:0" = torch.ops.prims.device_put.default(arg0_1, device(type='cuda', index=0), True);  arg0_1 = None
I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs]         convert_element_type: "f32[3, 4][4, 1]cuda:0" = torch.ops.prims.convert_element_type.default(device_put, torch.float32);  device_put = None
I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs]         return (convert_element_type,)
I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs]
I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs]
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1134] [0/0] [__output_code] Output code written to: /tmp/torchinductor_niklasz/ha/chaai264g6ribfw3q2qhl6ayjtaqaavku5wivxtzw4nabgd6htsv.py
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] Output code:
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] # AOT ID: ['0_inference']
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from ctypes import c_void_p, c_long, c_int
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] import torch
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] import math
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] import random
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] import os
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] import tempfile
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from math import inf, nan
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from torch._inductor.hooks import run_intermediate_hooks
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from torch._inductor.utils import maybe_profile
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from torch._inductor.codegen.memory_planning import _align as align
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from torch import device, empty_strided
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from torch._inductor.async_compile import AsyncCompile
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from torch._inductor.select_algorithm import extern_kernels
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from torch._inductor.codegen.multi_kernel import MultiKernelCall
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] aten = torch.ops.aten
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] inductor_ops = torch.ops.inductor
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] _quantized = torch.ops._quantized
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] assert_size_stride = torch._C._dynamo.guards.assert_size_stride
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] empty_strided_cpu = torch._C._dynamo.guards._empty_strided_cpu
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] empty_strided_cuda = torch._C._dynamo.guards._empty_strided_cuda
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] empty_strided_xpu = torch._C._dynamo.guards._empty_strided_xpu
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] reinterpret_tensor = torch._C._dynamo.guards._reinterpret_tensor
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] alloc_from_pool = torch.ops.inductor._alloc_from_pool
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] async_compile = AsyncCompile()
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] async_compile.wait(globals())
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] del async_compile
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] def call(args):
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]     arg0_1, = args
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]     args.clear()
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]     assert_size_stride(arg0_1, (3, 4), (4, 1))
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]     with torch.cuda._DeviceGuard(0):
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]         torch.cuda.set_device(0)
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]         buf0 = empty_strided_cuda((3, 4), (4, 1), torch.float32)
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]         buf0.copy_(arg0_1, True)
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]         del arg0_1
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]     return (buf0, )
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] def benchmark_compiled_module(times=10, repeat=10):
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]     from torch._dynamo.testing import rand_strided
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]     from torch._inductor.utils import print_performance
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]     arg0_1 = rand_strided((3, 4), (4, 1), device='cpu', dtype=torch.float32)
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]     fn = lambda: call([arg0_1])
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]     return print_performance(fn, times=times, repeat=repeat)
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] if __name__ == "__main__":
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]     from torch._inductor.wrapper_benchmark import compiled_module_main
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]     compiled_module_main('None', benchmark_compiled_module)
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]
```
See above line `buf0.copy_(arg0_1, True)`. Specific log setting used: `export TORCH_LOGS="graph_code,aot_graphs,output_code"`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136513
Approved by: https://github.com/eellison
2024-10-01 00:32:47 +00:00
19a4d68224 Add missing mappings to support torch.uint16 in quantization and export (#136547)
Test Plan: CI.

Differential Revision: D63142844

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136547
Approved by: https://github.com/angelayi
2024-10-01 00:01:01 +00:00
18e707645c Substitute unbacked symints in expressions (#137020)
Differential Revision: [D63647095](https://our.internmc.facebook.com/intern/diff/D63647095)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137020
Approved by: https://github.com/ezyang
2024-09-30 23:07:22 +00:00
af64c44b56 Revert "Don't uselessly recompute axiom dict every static eval call (#135429)"
This reverts commit 1d6e0412f5205b1cd709e034526d7f21d6f2d56f.

Reverted https://github.com/pytorch/pytorch/pull/135429 on behalf of https://github.com/ezyang due to try again ([comment](https://github.com/pytorch/pytorch/pull/135429#issuecomment-2384288879))
2024-09-30 22:29:13 +00:00
c07ebaf430 [triton] Try to use triton.language.extra.libdevice when possible (#136997)
Summary:
X-link: https://github.com/facebookresearch/generative-recommenders/pull/90

In view of https://github.com/triton-lang/triton/pull/3825 we should try to use `triton.language.extra.libdevice` instead of `triton.language.extra.cuda.libdevice`.

Test Plan: CI

Reviewed By: bertmaher, karthik-man

Differential Revision: D63583965

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136997
Approved by: https://github.com/bertmaher
2024-09-30 21:52:44 +00:00
b3972ee19a [triton] Unify build_paths.py for NV & AMD, fix typing (#136952)
Summary: Some build improvements.

Test Plan: CI

Differential Revision: D63583959

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136952
Approved by: https://github.com/bertmaher
2024-09-30 21:51:45 +00:00
66a269afe8 Revert "Format torch.fx.experimental.symbolic_shapes with PYFMT (#136934)"
This reverts commit cf1a7eab250ea37ca8fda0327e8e38693c3c5c1a.

Reverted https://github.com/pytorch/pytorch/pull/136934 on behalf of https://github.com/ezyang due to merge conflict revert ([comment](https://github.com/pytorch/pytorch/pull/136934#issuecomment-2384195881))
2024-09-30 21:44:44 +00:00
c94536ae74 Revert "Format torch.fx.experimental.validator (#136935)"
This reverts commit 377e4bc877a3ac4cd6d073aa513a309159ade991.

Reverted https://github.com/pytorch/pytorch/pull/136935 on behalf of https://github.com/ezyang due to merge conflict revert ([comment](https://github.com/pytorch/pytorch/pull/136934#issuecomment-2384195881))
2024-09-30 21:44:44 +00:00
8982906502 Revert "Turn on type-checking in torch.fx.experimental.symbolic_shapes (#136972)"
This reverts commit 3ff2d93d9f72fd26503ef0cf5c5956edad4c52e6.

Reverted https://github.com/pytorch/pytorch/pull/136972 on behalf of https://github.com/ezyang due to need to back out for merge conflict ([comment](https://github.com/pytorch/pytorch/pull/136972#issuecomment-2384182244))
2024-09-30 21:35:08 +00:00
b825848d85 Fix aarch64 debug build with GCC (#136990)
Fixes #136440

**Issue:**
When building PyTorch in debug mode on aarch64 architecture using GCC, we encounter relocation errors due to the R_AARCH64_CALL26 relocation limit. This occurs because debug builds with -O0 optimization generate larger code sizes, potentially exceeding the range limit for these relocations.

**Fix:**
Apply -Og optimization instead of -O0 for aarch64 GCC debug builds. This slightly reduces code size while maintaining debuggability, bringing function calls back within the range of R_AARCH64_CALL26 relocations.

The fix is implemented by conditionally setting compiler and linker flags in CMakeLists.txt:
- For aarch64 GCC debug builds: use -Og
- For all other debug builds: retain -O0

This change affects only debug builds on aarch64 with GCC, leaving other configurations unchanged.

**Testing:**
Verified that the build succeeds without relocation errors on aarch64 systems with GCC in debug mode. Ensured that debugging information is still available and useful for debugging purposes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136990
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-09-30 21:11:50 +00:00
866a64ce9a [FSDP2] Added check for contiguous parameters (#137000)
Since our implementation currently assumes contiguous strides, let us add an explicit check and raise an error at construction time if the parameter is not contiguous.

We can try to support this in the future. Mainly, I want to first learn more about how DTensor support for non-contiguous memory formats works.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137000
Approved by: https://github.com/weifengpy
2024-09-30 21:10:47 +00:00
66e3186a48 Revert "Init threadpool with user defined num_threads before default (#136793)"
This reverts commit adbcaee950afa6697c04962096344bf0962a542f.

Reverted https://github.com/pytorch/pytorch/pull/136793 on behalf of https://github.com/janeyx99 due to Caused internal Oculus crash, and internal force landed a diff without exporting to GH =.= ([comment](https://github.com/pytorch/pytorch/pull/136793#issuecomment-2384148132))
2024-09-30 21:10:12 +00:00
bc6adb9596 [EZ][BE] Delete ISSUE_TEMPALTE.md (#137040)
As it has been superseded by [ISSUES_TEMPLATE](https://github.com/pytorch/pytorch/tree/main/.github/ISSUE_TEMPLATE) folder, per https://docs.github.com/en/communities/using-templates-to-encourage-useful-issues-and-pull-requests/configuring-issue-templates-for-your-repository#creating-issue-forms

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137040
Approved by: https://github.com/ZainRizvi
2024-09-30 21:04:32 +00:00
d46ebcb31b Enable experiments for protected branches (#136785)
This is to allow the protected branches (like `main` and `nightly`) also run on the LF fleet, now that we've migrated over
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136785
Approved by: https://github.com/jeanschmidt
2024-09-30 20:58:28 +00:00
2ef1454189 Revert "Add int1 to int7 dtypes (#136301)"
This reverts commit bfa16a161d5089a9ba008f5e665f29b58dc16526.

Reverted https://github.com/pytorch/pytorch/pull/136301 on behalf of https://github.com/PaliC due to causing internal failures ([comment](https://github.com/pytorch/pytorch/pull/136301#issuecomment-2384119600))
2024-09-30 20:50:49 +00:00
0ccd39a64b Fix prefix store seg fault (#136872)
fixes https://github.com/pytorch/pytorch/issues/136723

Do not allow `None` to be passed into `PrefixStore`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136872
Approved by: https://github.com/kwen2501
2024-09-30 20:43:08 +00:00
7b96f3c75d Fix six broken tests in test_ops.py (#136653)
## The problem.

[A commit from three weeks ago](82d00acfee) appears to have broken five tests but was not caught by CI.

[A later commit](https://github.com/pytorch/pytorch/commit/e05ea2b1797) which added a decomposition of `transpose_copy` added another broken test, also seemingly not detected, making six total (listed below).

They came to my attention when I updated some pending decomposition pull requests which passed CI, and started getting failures like [this](https://hud.pytorch.org/pr/134319) for a test unrelated to any of these pull requests, `TestCommonCPU.test_out__refs_transpose_copy_cpu_float32`

Running `python test/test_ops.py -k _copy` on `viable/strict` found failures for six `_refs` ops: `copysign`, `expand_copy`, `index_copy`, `t_copy`, `transpose_copy`, `view_copy`

## The solution

The original commit did actually cause breakage by slightly changing user-visible behavior (in a special case involving scalar tensors being copied between different devices).

This pull request fixes that breakage in a reasonable way, but I don't understand why this error didn't appear in CI until I made later changes in the same area.

## To reproduce

To reproduce the six cases in your own client:

```
PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=5 python test/test_ops.py TestCommonCPU.test_out__refs_view_copy_cpu_float32
PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=2 python test/test_ops.py TestCommonCPU.test_out__refs_t_copy_cpu_float32
PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=0 python test/test_ops.py TestCommonCPU.test_out__refs_index_copy_cpu_float32
PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=7 python test/test_ops.py TestCommonCPU.test_out__refs_expand_copy_cpu_float32
PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=0 python test/test_ops.py TestCommonCPU.test_out__refs_copysign_cpu_float32
PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=4 python test/test_ops.py TestCommonCPU.test_out__refs_transpose_copy_cpu_float32
```

@amjames

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136653
Approved by: https://github.com/zou3519
2024-09-30 20:32:55 +00:00
71aac59e93 Add Triton CPU as an Inductor backend (#133408)
The goal is to use Inductor-generated kernels to stress test the new Triton CPU backend.

Differential Revision: [D63298968](https://our.internmc.facebook.com/intern/diff/D63298968)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133408
Approved by: https://github.com/jansel, https://github.com/blaine-rister, https://github.com/malfet
2024-09-30 20:24:52 +00:00
dfe1d45332 Enable tracing through auot_functionalized_v2 in compiled autograd (#136806)
auto_functionalize_v2 will be the same as auto_functionalize except that args will have some more constants, or symints,
and tensors are in one of the input list args.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136806
Approved by: https://github.com/zou3519
2024-09-30 19:16:13 +00:00
1ef5d4cdde Revert "Allow parallelize_module to get device_mesh from ambient context (#134247)"
This reverts commit 80e7478cc84919a48770ad85d6118294776fca73.

Reverted https://github.com/pytorch/pytorch/pull/134247 on behalf of https://github.com/malfet due to Broke lint, which one can clearly see in PR CI https://github.com/pytorch/pytorch/actions/runs/11112138513/job/30873604386  ([comment](https://github.com/pytorch/pytorch/pull/134247#issuecomment-2383952449))
2024-09-30 19:07:01 +00:00
4af03e54b7 [MPS][BE] Use None as alias for all types (#137004)
Test like `new_*` and `empty_*` fail the current implementation, see
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137004
Approved by: https://github.com/Skylion007
ghstack dependencies: #136981, #136982, #136983, #136984, #136985, #136986, #137003
2024-09-30 19:06:13 +00:00
c610aa80dc Testing: Unblock new_* testing on MPS (#137003)
By changing `other_dtype` to `torch.half` rather than `double` in
`sample_inputs_new_fns` if MPS is available
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137003
Approved by: https://github.com/Skylion007
ghstack dependencies: #136981, #136982, #136983, #136984, #136985, #136986
2024-09-30 19:06:12 +00:00
80e7478cc8 Allow parallelize_module to get device_mesh from ambient context (#134247)
This PR is for supporting calling `parallelize_module` from within a model definition, making the model a parallel one.

Calling `parallelize_module` is an alternative to maintaining a set of `ColumnWiseLinear`, `RowWiseLinear`, etc, while still being able to directly author a parallel model.

(The motivation for authoring a parallel model is that there may be other distributed operations, which may not be easily captured by any module, see the forward function below. Alternatively speaking, the purpose is to exploit the expressiveness of DTensor -- we need to first create DTensors before calling ops on them. Having parallelized modules in model is one way of creating DTensors.)

For example:
```
class FeedForward(nn.Module):
    def __init__(self, config: TransformerArgs) -> None:
        super().__init__()
        w1 = nn.Linear(config.dim, config.hidden_dim, bias=False)
        w2 = nn.Linear(config.hidden_dim, config.dim, bias=False)
        w3 = nn.Linear(config.dim, config.hidden_dim, bias=False)
        self.w1 = parallelize_module(w1, Colwise)
        self.w2 = parallelize_module(w2, Rowwise)
        self.w3 = parallelize_module(w3, Colwise)

    def forward(self, x: Tensor) -> Tensor:
        y: DTensor = self.w2(F.silu(self.w1(x)) * self.w3(x))
        # y is a DTensor with Partial placement; we can return it as is.
        return y
        # Or we can convert it to Replicate -- there is modeling flexibility here.
        return y.redistribute(Replicate())

with device_mesh:
    model = FeedForward(config)
    # Now model is a model parallelized onto device_mesh

y = model(x)

```

The `device_mesh` actually used for `parallelize_module` would be retrieved from the ambient context.

Calling `parallelize_module` from within model hierarchy also saves the use of *FQNs* as in the out-of-model annotation case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134247
Approved by: https://github.com/tianyu-l
2024-09-30 18:42:06 +00:00
40f80a70fa Fix lint (#137023)
By migrating some of the workflows to Python-3.9 as 3.8 has been deprecated by https://github.com/pytorch/pytorch/pull/132138

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137023
Approved by: https://github.com/ZainRizvi, https://github.com/janeyx99, https://github.com/seemethere, https://github.com/kit1980, https://github.com/Skylion007
2024-09-30 18:29:02 +00:00
d33638588e [aoti][inplace] Support skipping model buffers (#136770)
Summary: Some AOTI tensor constants may be model buffers that never needs to be updated.

Differential Revision: D62777502

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136770
Approved by: https://github.com/muchulee8
2024-09-30 18:28:42 +00:00
3ff2d93d9f Turn on type-checking in torch.fx.experimental.symbolic_shapes (#136972)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136972
Approved by: https://github.com/Skylion007
ghstack dependencies: #136917, #136934, #136935
2024-09-30 18:04:36 +00:00
475a8a4e0c Update ci-sev.md to make merge blocking not the default 2024-09-30 10:53:31 -07:00
76a57568de Update windows maintainers (#136901)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136901
Approved by: https://github.com/albanD
2024-09-30 16:12:49 +00:00
ae3d5ed589 [MPS] Enable nan_to_num for bfloat16 (#136986)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136986
Approved by: https://github.com/Skylion007
ghstack dependencies: #136981, #136982, #136983, #136984, #136985
2024-09-30 16:09:44 +00:00
d8d3aeae59 [MPS] Enable Renorm for bfloat16 (#136985)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136985
Approved by: https://github.com/Skylion007
ghstack dependencies: #136981, #136982, #136983, #136984
2024-09-30 16:09:44 +00:00
538fcd7579 [MPS] Enable torch.linalg.cross for bfloat16 (#136984)
By adding explicit instantiation. Tested in https://github.com/pytorch/pytorch/pull/136987
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136984
Approved by: https://github.com/Skylion007
ghstack dependencies: #136981, #136982, #136983
2024-09-30 16:09:40 +00:00
c13c7e11c5 Revert "[Inductor] Pick ISA for inductor based on ATEN_CPU_CAPABILITY (#123514)"
This reverts commit 6931c1644afdba53e63ce5671455e4e1b7265dd9.

Reverted https://github.com/pytorch/pytorch/pull/123514 on behalf of https://github.com/huydhn due to Sorry for reverting your change but its test_cpu_repro test is failing in trunk 6931c1644a ([comment](https://github.com/pytorch/pytorch/pull/123514#issuecomment-2383563919))
2024-09-30 15:47:04 +00:00
33d3d6e42a [MPS] Enable bucketization for bfloat16 (#136983)
By simply adding explicit instantiation
Tested in https://github.com/pytorch/pytorch/pull/136987

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136983
Approved by: https://github.com/Skylion007
ghstack dependencies: #136981, #136982
2024-09-30 14:45:57 +00:00
3ed2969889 [MPS] Extend fmin/fmax/copysign and nextafter to blfoat (#136982)
Just adds instantiation of the kernels and sometimes explicit cast.
Tested in https://github.com/pytorch/pytorch/pull/136987
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136982
Approved by: https://github.com/Skylion007
ghstack dependencies: #136981
2024-09-30 14:45:57 +00:00
797092b263 [MPS] Fix Gamma for bfloat16 dtypes (#136981)
Before this change, test failed with unable to compile errors, as `bfloat16` requires explicit cast.
Tested in https://github.com/pytorch/pytorch/pull/136987
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136981
Approved by: https://github.com/Skylion007
2024-09-30 14:45:52 +00:00
a15f3f51bc [AOTI] Update sam_fast from timeout to fail_to_run (#136996)
Summary: sam_fast changes from timeout to fail_to_run after https://github.com/pytorch/pytorch/pull/136591, which "regressed" in a good way. Update the expected result file and continue investigating.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136996
Approved by: https://github.com/ezyang
2024-09-30 14:05:49 +00:00
c010c6099b compile time benchmarks for AOTDispatcher (partitioner) (#136760)
compile time benchmark for the min cut partitioner. I'm hoping that this is a reasonable benchmark because:

(1) it consists of a single input + many weights that are used sequentially
(2) contains a mix of recompute vs non-recomputed ops (matmul + sin)
(3) it is relatively simple

from running locally:
```
collecting compile time instruction count for aotdispatcher_partitioner_cpu
compile time instruction count for iteration 0 is 21764219181
compile time instruction count for iteration 1 is 12475020009
compile time instruction count for iteration 2 is 12463710140
compile time instruction count for iteration 3 is 12455676489
compile time instruction count for iteration 4 is 12451344330
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136760
Approved by: https://github.com/ezyang
ghstack dependencies: #136670, #136759
2024-09-30 13:25:02 +00:00
b17cd264d3 compile time benchmarks for AOTDispatcher (inference/training/subclasses) (#136759)
this adds a few compile time benchmarks for some disjoint paths in AOTDispatcher:

(1) inference vs training code paths
(2) "subclasses" vs "no subclasses" codepaths

Also see https://github.com/pytorch/pytorch/pull/136760 for a partitioner benchmark (I'm not sure why ghstack didn't display the stack nicely)

I ran locally, and got these numbers on the 4 paths:
```
collecting compile time instruction count for aotdispatcher_inference_nosubclass_cpu
compile time instruction count for iteration 0 is 11692348671
compile time instruction count for iteration 1 is 3026287204
compile time instruction count for iteration 2 is 3011467318
compile time instruction count for iteration 3 is 3004485935
compile time instruction count for iteration 4 is 3003087410
collecting compile time instruction count for aotdispatcher_training_nosubclass_cpu
compile time instruction count for iteration 0 is 6068003223
compile time instruction count for iteration 1 is 5585418102
compile time instruction count for iteration 2 is 5581856618
compile time instruction count for iteration 3 is 5581651794
compile time instruction count for iteration 4 is 5578742619
collecting compile time instruction count for aotdispatcher_inference_subclass_cpu
compile time instruction count for iteration 0 is 8634984264
compile time instruction count for iteration 1 is 8633467573
compile time instruction count for iteration 2 is 8632182092
compile time instruction count for iteration 3 is 8632056925
compile time instruction count for iteration 4 is 8632543871
collecting compile time instruction count for aotdispatcher_training_subclass_cpu
compile time instruction count for iteration 0 is 14737239311
compile time instruction count for iteration 1 is 14734346427
compile time instruction count for iteration 2 is 14736493730
compile time instruction count for iteration 3 is 14734121272
compile time instruction count for iteration 4 is 14733852882
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136759
Approved by: https://github.com/laithsakka
ghstack dependencies: #136670
2024-09-30 13:25:02 +00:00
dfdda2f6a6 inductor: use previous guards to know if a size is 1 for broadcasting (#136670)
Fixes https://github.com/pytorch/pytorch/issues/136640

Today, inductor has some logic to figure out when it needs to do broadcasting during lowering, which just checks if any of the input shapes have sizes equal to 1.

In particular: we should already have this information by the time we get to inductor, because our FakeTensor compute will have branched/guarded on whether any ops performed broadcasting, appropriately.

In particular, if we have a tensor with a size value of `(64//((2048//(s3*((s2//s3)))))))`, and it happens to be equal to one (and it is used in an op that requires this dim to be broadcasted), FakeTensorProp will have generated a guard:
```
Eq((64//((2048//(s3*((s2//s3))))))), 1)
```

I chose the simplest possible way to beef up inductor's checks to know when a given size is equal to 1: loop over the existing shape env guards, and if our current size is a sympy expression on the LHS of one of our `Eq(LHS, 1)` guards, then return True.

I'm hoping for feedback on whether or not this approach is reasonable. One better option I could imagine is that our symbolic reasoning should have automatically simplified the size of our tensor down to a constant as part of evaluating that guard. I was originally going to try to do this directly in the shape env, but I ran into a few issues:

(1) I wanted to call some version of `set_replacement(expr, 1)`. But `set_replacement()` only accepts plain symbols on the LHS, not expressions

(2) in theory I could get this to work if I could rework the above expression to move everything that is not a free variable to the RHS, e.g. `Eq(s2, 32)`. It looks like our existing  `try_solve()` logic is... [not quite able](https://github.com/pytorch/pytorch/blob/main/torch/utils/_sympy/solve.py#L27) to do this generally though.

Checking the guards feels pretty simple-and-easy. Are we worried that it is too slow to iterate over all the guards? I could also cache the lookup so we only need to iterate over guards that are of the form `Eq(LHS, 1)`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136670
Approved by: https://github.com/ezyang
2024-09-30 13:24:57 +00:00
cyy
05b15dba7e [1/N] Fix clang-tidy warnings in torch/csrc/api/ (#134545)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134545
Approved by: https://github.com/ezyang
2024-09-30 09:06:30 +00:00
d6d9183456 [Inductor] Switch cpp_wrapper tests to ABI-compatible (#136904)
Summary: Switch test_cpu_cpp_wrapper and test_cuda_cpp_wrapper to test the ABI-compatible mode only. Fixed a missing Py_NewRef issue for python 3.9.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136904
Approved by: https://github.com/Yoggie9477, https://github.com/chenyang78
2024-09-30 05:44:52 +00:00
ad8fae2aa9 [AOTI] Support test_open_device_registration in ABI-compatible (#136906)
Summary: Add a device type C shim interface to support test_open_device_registration in the ABI-compatible mode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136906
Approved by: https://github.com/chenyang78
2024-09-30 05:08:16 +00:00
8dddd45679 [BE][Ez]: Update cudnn_frontend submodule to v1.7.0 (#136920)
Updates cudnn frontend submodule to v1.7.0 which has some bugfixes and a couple new features.

https://github.com/NVIDIA/cudnn-frontend/releases/tag/v1.7.0
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136920
Approved by: https://github.com/ezyang
2024-09-30 02:50:16 +00:00
80393c90b3 docs: clarify alias usage for x parameter in vector_norm function (#136921)
- Added a note in the documentation specifying that the `input` parameter can be used as an alias for `x`.

Fixes #136560

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136921
Approved by: https://github.com/ezyang

Co-authored-by: Edward Z. Yang <ezyang@meta.com>
2024-09-30 02:50:06 +00:00
377e4bc877 Format torch.fx.experimental.validator (#136935)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136935
Approved by: https://github.com/Skylion007
ghstack dependencies: #136917, #136934
2024-09-30 02:20:40 +00:00
cf1a7eab25 Format torch.fx.experimental.symbolic_shapes with PYFMT (#136934)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136934
Approved by: https://github.com/Skylion007
ghstack dependencies: #136917
2024-09-30 02:20:40 +00:00
0a26851601 [Inductor] Handle device property warp_size is None but used on XPU. (#136834)
Fix #136820

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136834
Approved by: https://github.com/EikanWang, https://github.com/jansel
2024-09-30 02:08:45 +00:00
6931c1644a [Inductor] Pick ISA for inductor based on ATEN_CPU_CAPABILITY (#123514)
It is part of https://github.com/pytorch/pytorch/issues/123224. Pick ISA based on the environment ATEN_CPU_CAPABILITY to control CPU vec ISA level for Inductor like eager.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123514
Approved by: https://github.com/jgong5, https://github.com/peterbell10
2024-09-30 00:53:18 +00:00
9dbc6bacff Propagate detailed location information of shape guards to guards/recompiles output (#136917)
To see the payoff, look at test/dynamo/test_logging.py

The general idea is to refactor produce_guards into produce_guards_verbose which also returns verbose code parts, which have our annotations.

The rest of the logic is plumbing around SLocs to the places they need to be so we can print them. Guards are easy; value ranges and duck sizing take more care.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136917
Approved by: https://github.com/anijain2305
2024-09-30 00:43:12 +00:00
e205193e1c Enable failing diffs on regression (#136551)
1. example of failing diff
https://github.com/pytorch/pytorch/pull/136740

2. test this by running
python check_results.py test_check_result/expected_test.csv   test_check_result/result_test.csv

results
```
WIN: benchmark ('a', ' instruction count') failed, actual result 90 is 18.18% lower than expected 110 ±1.00% please update the expected results.
REGRESSION: benchmark ('b', ' memory') failed, actual result 200 is 100.00% higher than expected 100 ±10.00% if this is an expected regression, please update the expected results.
MISSING REGRESSION TEST: benchmark ('d', ' missing-test') does not have a regression test enabled for it
```
MISSING REGRESSION TEST does not fail but its logged.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136551
Approved by: https://github.com/ezyang
ghstack dependencies: #136383
2024-09-29 22:31:26 +00:00
d33a5e2a57 [ROCm] fastSpecializedAtomicAdd for MI300 (#135770)
MI300 adds HW support for packed bfloat16 and fp16. Enable via existing fastSpecializedAtomicAdd.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135770
Approved by: https://github.com/xw285cornell, https://github.com/jianyuh
2024-09-29 21:52:09 +00:00
c9653bf2ca [Elasitc][fix] Use the right env variable TORCH_ELASTIC_WORKER_IDENTICAL for unit test (#136916)
as title, this is an easy fix for unit test.

Differential Revision: [D63577774](https://our.internmc.facebook.com/intern/diff/D63577774/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136916
Approved by: https://github.com/wz337
ghstack dependencies: #136865
2024-09-29 03:55:10 +00:00
cyy
156ca01e51 Enable clang-tidy on torch/csrc/lazy (#136851)
Enable clang-tidy on  torch/csrc/lazy
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136851
Approved by: https://github.com/Skylion007
2024-09-28 21:16:40 +00:00
d3c2123ea6 [BE][CUDA][Bugfix]: Enable extended MMA shapes in CUTLASS. (#133686)
* This fixes a major CMake/Bazel configuration bug where we were leaving CUTLASS performance on the table, especially with FlashAttention. This now enables using MMA instructions on SM90+, which should close the gap between SDPA and the external FA2. Note these operations only affect H100 and newer GPUs. Thankfully, this seems to have been updated recently into being a noop on the CUTLASS side. Still better set the CMake variable properly.
*  Also enables additional new shape kernels added in the recent CUTLASS 3.5.1+ update. This was the original motivatin of the PR before I realized the basic MMA kernels were accidentally disabled since we didn't go through the submodule's CMake/Bazels.
* Adds a bit to compile time and code size, but well worth it considering it speeds up our internal flash attention significantly on H100s at the cost of some minor additional compile time.
* These kernels and settings will be needed for Flash Attention 3 whenever we add that too.

Fixes #133695

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133686
Approved by: https://github.com/ezyang
2024-09-28 21:11:15 +00:00
1d6e0412f5 Don't uselessly recompute axiom dict every static eval call (#135429)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135429
Approved by: https://github.com/isuruf
2024-09-28 20:59:59 +00:00
6ecb73bafd Limit the option value of TORCH_SHOW_DISPATCH_TRACE (#136510)
It`s more convenient for user to enable or disable dispatch trace by
setting TORCH_SHOW_DISPATCH_TRACE to 1 or 0, especially debug in IDE.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136510
Approved by: https://github.com/shink, https://github.com/ezyang
2024-09-28 20:59:05 +00:00
28224329ad [Flex Attention] fix block size order (#136657)
`create_block_mask` currently gives wrong BLOCK_SIZE and shape when using non-default block size `(128,128)`.
This PR fixes the issue by using BLOCK_SIZE order `(Q_BLOCK_SIZE, KV_BLOCK_SIZE)`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136657
Approved by: https://github.com/Chillee, https://github.com/drisspg
2024-09-28 19:56:53 +00:00
cf53ab95dc [halide-backend] Fix ops.fma codegen (#136810)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136810
Approved by: https://github.com/eellison
ghstack dependencies: #136808, #136809
2024-09-28 19:26:04 +00:00
8da9c4178c [inductor] Benchmark Halide in operatorbench.py (#136809)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136809
Approved by: https://github.com/eellison
ghstack dependencies: #136808
2024-09-28 19:26:04 +00:00
a54b69279b Bump triton pin to latest 3.1.x release branch (#136874)
Moves pin to latest in release/3.1.x

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136874
Approved by: https://github.com/bertmaher, https://github.com/drisspg, https://github.com/kit1980, https://github.com/malfet
2024-09-28 13:47:07 +00:00
b35f70da05 [ez] fixup the export of D62879819 (#136900)
a line from D62879819 (#136190) went missing somehow
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136900
Approved by: https://github.com/atalman
2024-09-28 13:46:17 +00:00
c4ae45104f [PyTorch Pinned Allocaor] Move background thread init from constructor to allocate function (#136879)
Differential Revision: D63553138

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136879
Approved by: https://github.com/zyan0
2024-09-28 07:24:44 +00:00
375921b755 [inductor] Improve operatorbench.py (#136808)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136808
Approved by: https://github.com/eellison
2024-09-28 06:22:02 +00:00
96104db132 [easy] fix typo in debug logs for fx graph cache (#136889)
Summary: Accidentally messed up the debug logging here, fixing typo (scuba + tlparse logging is unaffected)

Test Plan: existing tests

Differential Revision: D63555766

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136889
Approved by: https://github.com/oulgen
2024-09-28 03:56:09 +00:00
9e4f24f8e5 Fix PT2 Source Code Annotations (#136460)
Summary: In D60803317, we added CompileContext (trace_id) information to Kineto traces using caching when a CompileContext exits. As pointed out by some users, this gives innaccurate IDs because we are not getting the context that we is being looked up within the eval_frame. For this reason, we decided to revert that change, and go with an approach that involves getting the trace_id associated with a given CacheEntry. To do this, we add a trace_id to the GuardedCode so that it can be passed onto a CacheEntry. Then, we change the lookup function to return said trace_id alongside the code so that we can pass both into our eval function. Once we get to a Torch-Compiled Region, we can just append the context information to the name of the annotation thus bypassing any need for kwargs.

Test Plan: Added more comprehensive unit test. Saw that all the trace_ids appeared within the graph.

Differential Revision: D63138786

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136460
Approved by: https://github.com/ezyang
2024-09-28 03:54:43 +00:00
8df97d78c2 [QAT] Make Fused modules torchscriptable (#136285)
Summary:
Same as title.

Inspired by: https://pytorch.org/tutorials/recipes/script_optimized.html#fix-common-errors-when-using-the-script-method

Test Plan: CI

Differential Revision: D62980019

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136285
Approved by: https://github.com/jerryzh168
2024-09-28 03:46:19 +00:00
93dcb92bae [DeviceMesh][EZ] Add group description to new group (#136558)
Add group description to new_group in device_mesh to help with debuggability.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136558
Approved by: https://github.com/kwen2501, https://github.com/fduwjj
2024-09-28 03:09:41 +00:00
99f90c379e Simplify find_localzeros (#133325)
Instead of doing an N^2 connected thing, only do simplifications for binary max/min, and for very simple situations.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133325
Approved by: https://github.com/albanD
2024-09-28 02:38:31 +00:00
bfa16a161d Add int1 to int7 dtypes (#136301)
Summary:
Similar to https://github.com/pytorch/pytorch/pull/117208, we want to add int1 to int7 for edge use cases
for weight quantization (https://www.internalfb.com/diff/D62464487)

Test Plan:
python test/test_quantization.py -k test_uint4_int4_dtype

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136301
Approved by: https://github.com/ezyang
2024-09-28 02:08:33 +00:00
e4571e7025 Add abi flags to cpp_extension cache folder (#136890)
This is to avoid cache confusion between normal vs pydebug vs nogil builds in cpp extensions which can lead to catastrophic ABI issues.
This is rare today for people to run both normal and pydebug on the same machine, but we expect quite a few people will run normal and nogil on the same machine going forward.

This is tested locally by running each version alternatively.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136890
Approved by: https://github.com/colesbury
2024-09-28 00:49:56 +00:00
f42e88fea5 [reland][Elastic] Skip store barrier and store get in host assign (#136865)
As title this is to reland https://github.com/pytorch/pytorch/pull/136579 as it broke some OSS CI

Differential Revision: [D63542918](https://our.internmc.facebook.com/intern/diff/D63542918/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136865
Approved by: https://github.com/atalman
2024-09-27 23:40:42 +00:00
ef3142d2a0 [user triton] Make tl.constexpr specialization work for triton_op & capture_triton (#136686)
In #136512, we fixed handling for tl.constexpr and dynamic shapes: if a symint is passed to tl.constexpr, you should specialize on it, because tl.constexpr implies needing to know the concrete value at compile time.

However, when using triton_op, capture_triton, or non-strict export, the regression remains (and #136512 might technically regress some specific export scenarios) - see [Richard's comment](https://github.com/pytorch/pytorch/pull/136512/files#r1775999871).

This fixes these scenarios: implement the handling differently depending on whether we're expecting a SymNodeVariable or a SymInt(/SymBool/SymFloat)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136686
Approved by: https://github.com/zou3519
2024-09-27 23:02:46 +00:00
9d67c31758 Cast device index to int before logging (#135405)
int8_t = DeviceIndex is interpreted by cout as a char, which then shows up as a control character in logs (eg. ^A) etc.

Explicitly casting to int to have the numbers printed out correctly.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135405
Approved by: https://github.com/wconstab
2024-09-27 23:01:12 +00:00
fe158cfb47 [aoti] Add warning to ask users to switch to new API (#135893)
Instead of the following:
```
so_path = torch._export.aot_compile(...)
torch._export.aot_load(so_path)
```

The recommended path is to:
```
ep = torch.export.export(...)
pt2_path = torch._inductor.aoti_compile_and_package(ep, ...)
torch._inductor.package.load_package(pt2_path)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135893
Approved by: https://github.com/desertfire
2024-09-27 22:38:11 +00:00
adbcaee950 Init threadpool with user defined num_threads before default (#136793)
Fixes #134714 (or attempts to, idk how to test yet)

For posterity, how one can test:
1. make sure you have USE_PTHREADPOOL=1 or pull a packaged binary
2. run gdb --args python, with `r` to enter, `Ctrl-C` to pause, and `c` to get back into Python
3. import torch
4. torch.set_num_threads(1), make sure this does not trigger any additional threads getting created.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136793
Approved by: https://github.com/albanD
2024-09-27 22:22:37 +00:00
bc21689136 [sparse][semi-structured] Add float8 dtype support to 24 sparsity (#136397)
Summary:

This PR adds `torch.float8e4m3fn` support to cuSPARSELt and `to_sparse_semi_structured`.

This will let users to run fp8 + 2:4 sparse matmuls on Hopper GPUs with
cusparselt >= 0.6.2, via to `scaled_mm` API.

```
A = rand_sparse_semi_structured_mask(256, 128, dtype=torch.float16)
B = torch.rand(dense_input_shape, device=device).to(torch.float16).t()

A_fp8, A_scale = to_float8(A)
B_fp8, B_scale = to_float8(B)

dense_result = torch._scaled_mm(
    A_fp8, B_fp8,
    scale_a=A_scale, scale_b=B_scale,
    out_dtype=out_dtype
)
A_fp8_sparse = to_sparse_semi_structured(A_fp8)
sparse_result = torch._scaled_mm(
    A_fp8_sparse, B_fp8,
    scale_a=A_scale, scale_b=B_scale,
    out_dtype=out_dtype
)
```

Note that to keep this consistent with normal torch behavior, calling
`torch.mm(A_fp8_sparse, B_fp8)` will raise a NotImplementedError.

I also turned on cuSPARSELt by default and added CUSPARSELT_MAX_ID to the
backend to make the tests a bit cleaner

Test Plan:
```
python test/test_sparse_semi_structured -k scaled_mm
python test/test_sparse_semi_structured -k fp8
```

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136397
Approved by: https://github.com/drisspg
2024-09-27 21:37:34 +00:00
a28b40fa74 Improve is_fbcode functionality (#136871)
Summary: Previously is_fbcode just checked whether the checkout was git or not. This is extremely error prone. Lets make it fool-proof.

Test Plan: unit tests

Reviewed By: masnesral

Differential Revision: D63545169

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136871
Approved by: https://github.com/masnesral
2024-09-27 21:19:01 +00:00
283bda01aa [MPS] Error checking/bf16 support for torch.normal (#136863)
Before that attempt to run something like
```
% python -c "import torch;dev,dt='mps',torch.int; print(torch.normal(mean=torch.arange(1., 11., device=dev, dtype=dt), std=torch.arange(10, 0, -1, device=dev, dtype=dt)))"
```
Resulted in hard error
```
(mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: error: 'mps.multiply' op requires the same element type for all operands and results
(mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: note: see current operation: %5 = "mps.multiply"(%2, %arg1) : (tensor<10xf32>, tensor<10xsi32>) -> tensor<*xf32>
(mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: error: 'mps.multiply' op requires the same element type for all operands and results
(mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: note: see current operation: %5 = "mps.multiply"(%2, %arg1) : (tensor<10xf32>, tensor<10xsi32>) -> tensor<*xf32>
/AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphExecutable.mm:953: failed assertion `original module failed verification'
```
After the change, it raises a nice type error
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136863
Approved by: https://github.com/Skylion007
ghstack dependencies: #136754, #136755, #136821, #136822
2024-09-27 21:11:59 +00:00
f7ab0e9989 Revert "[Flex Attention] fix block size order (#136657)"
This reverts commit b42f1e3641314c8dc369255b850450acddf3477c.

Reverted https://github.com/pytorch/pytorch/pull/136657 on behalf of https://github.com/ZainRizvi due to Sorry, this seems to break ROCm builds. inductor/test_flex_attention.py::TestFlexAttention::test_builtin_score_mods_seqlen_lt_custom_sparse_block_size_float16_score_mod1 [GH job link](https://github.com/pytorch/pytorch/actions/runs/11069782242/job/30759299713) [HUD commit link](b42f1e3641) ([comment](https://github.com/pytorch/pytorch/pull/136657#issuecomment-2380031525))
2024-09-27 20:47:54 +00:00
6e70ec9aa5 [SymmetricMemory] expose the multicast_ptr (#136840)
This allows writing triton kernels using the `multimem` ptx instructions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136840
Approved by: https://github.com/Chillee
2024-09-27 20:41:33 +00:00
f21b471978 Revert "Fix numerical instability for norm (#129352)"
This reverts commit 66340e67515cd3592bda6bdd9bfe2ffa22fe7413.

Reverted https://github.com/pytorch/pytorch/pull/129352 on behalf of https://github.com/atalman due to Breaks Internal CI ([comment](https://github.com/pytorch/pytorch/pull/129352#issuecomment-2379989485))
2024-09-27 20:18:47 +00:00
d55eef5c59 [SymmetricMemory] improve multicast initialization/fallback logic (#136577)
Fixes https://github.com/pytorch/pytorch/issues/136494

Currently, CUDASymmetricMemory::rendezvous() initializes a multicast address if multicast support is present. However, if we believe multicast support is present but cuMulticastCreate still fails for some reason, we do not fallback gracefully.

- In addition to CUDART and driver version check, query CU_DEVICE_ATTRIBUTE_MULTICAST_SUPPORTED to determine multicast support for a rank/device.
- Before initializing multicast for a block, ensure all ranks/devices have multicast support.
- This is unlikely, but if cuMulticastCreate still fails on rank 0, print the corresponding driver error message as a warning, and gracefully skip multicast initialization for the block.
- Introduced an environment variable (TORCH_SYMM_MEM_DISABLE_MULTICAST) to allow users to explicitly disable multicast support as a workaround.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136577
Approved by: https://github.com/Chillee, https://github.com/eqy
2024-09-27 20:04:21 +00:00
e512eac410 Companion PR to https://github.com/pytorch/pytorch/pull/134022 (#136818)
Note:[ cusparselt 0.6.0](https://docs.nvidia.com/cuda/cusparselt/release_notes.html#cusparselt-v0-6-0)+ supports SM90 (Hopper). Thanks @xwang233 for catching this bug while testing upstream binaries!

Fixes the issues like:

  ```
  A_compressed = torch._cslt_compress(A)
RuntimeError: CUDA error: architecture mismatch when calling `cusparseLtInit(&handle)`
```

@kit1980 Could we get this cherry-picked to 2.5.0 please?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136818
Approved by: https://github.com/eqy, https://github.com/jcaip, https://github.com/malfet
2024-09-27 19:57:15 +00:00
e5a57932f0 [Pytorch][AO] Update choose_qparams_per_token op to output correct shape for scales and zp (#136807)
- also makes scales and zp dtype reconcile with meta impl as well as other
quantized ops representation of scales and zero point
- make sure qunatize_per_token's output_dtype is respected

There are a few places where we need to reconcile on scale and zero point dtype
but that will come later. This fixes are mainly being done to enable quantized
kv cache though ET stack

Differential Revision: [D62301840](https://our.internmc.facebook.com/intern/diff/D62301840/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136807
Approved by: https://github.com/jerryzh168
2024-09-27 18:46:17 +00:00
6075f566cc [export] simplify automatic dynamic shapes processing (#136591)
Removing `_transform_shapes_for_default_dynamic` and `assume_static_by_default=False` as added in https://github.com/pytorch/pytorch/pull/133620.

This reverts back to `assume_static_by_default=True` with the use of dynamo decorators (e.g. `maybe_mark_dynamic, mark_static`, instead) for handling Dim.AUTO & Dim.STATIC instead. This is easier to maintain, as it doesn't requiring reasoning about "inverting" the dynamic_shapes specs, and also opens up usage of other decorators (`mark_dynamic, mark_unbacked`).

On the user side this change has no effect, but internally this means dynamic behavior is determined only by the `dynamic_shapes` specs (ignoring user-side input decorators following https://github.com/pytorch/pytorch/pull/135536), but transferring this information for _DimHints via decorators, for Dynamo/non-strict to create symbolic_contexts accordingly, e.g. 7c6d543a5b/torch/_dynamo/variables/builder.py (L2646-L2666)

One caveat is we don't raise errors for dynamic decorators on the user side, since we don't know if they're from user markings, or from re-exporting with inputs we've previously marked.

Differential Revision: D63358628

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136591
Approved by: https://github.com/avikchaudhuri
2024-09-27 18:28:51 +00:00
a8b5adcdd5 add types to _dynamo/code_context.py (#136665)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136665
Approved by: https://github.com/williamwen42
2024-09-27 18:27:42 +00:00
287dc36395 Revert "[user triton] Make tl.constexpr specialization work for triton_op & capture_triton (#136686)"
This reverts commit 9f5b97a0065dfc4a7978a0fdf3fac2df8aef9519.

Reverted https://github.com/pytorch/pytorch/pull/136686 on behalf of https://github.com/davidberard98 due to breaks lint on main. Please rebase to see and fix the error ([comment](https://github.com/pytorch/pytorch/pull/136686#issuecomment-2379830921))
2024-09-27 18:25:49 +00:00
2208ff64ba Fix RMSNorm doc per #136597 (#136727)
Fixes #136597 (remove incorrect sqrt around `RMS(x)`)

<img width="857" alt="Screenshot 2024-09-26 at 11 46 32 AM" src="https://github.com/user-attachments/assets/21ea26ad-bd9f-4b9b-8b60-f52a1dc16da6">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136727
Approved by: https://github.com/albanD
2024-09-27 18:21:48 +00:00
2157e396a3 [dynamo] attempt run only mode when dynamo cache limit is hit (#136655)
Implement https://github.com/pytorch/pytorch/issues/135458.

Try run-only mode when dynamo cache limit is hit. If no valid cache entries are found, then skip code recursively.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136655
Approved by: https://github.com/jansel
2024-09-27 17:15:05 +00:00
36428f91e9 Revert "Add Triton CPU as an Inductor backend (#133408)"
This reverts commit 31c0467594c7c41c8e8ff1828bf01fa31fc4454f.

Reverted https://github.com/pytorch/pytorch/pull/133408 on behalf of https://github.com/int3 due to internal tests failing ([comment](https://github.com/pytorch/pytorch/pull/133408#issuecomment-2379692517))
2024-09-27 16:54:27 +00:00
17f396b0b4 Delete project.default_flavors_mode buckconfig (#136772)
Summary: Buck1 only buckconfig

Test Plan: CI

Reviewed By: JakobDegen

Differential Revision: D63430482

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136772
Approved by: https://github.com/malfet
2024-09-27 16:24:50 +00:00
cyy
cbc182d2e0 Remove problematic constructor (#136708)
Since it calls a pure virtual function and it is not used elsewhere.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136708
Approved by: https://github.com/ezyang
2024-09-27 16:16:58 +00:00
dc8c0aaf4d [AOTAutogradCache] Log time taken_ns (#136529)
Summary:
This diff logs the time_taken_ns for the forward and backward graphs in AOTAutogradCache, saving it into the cache entry.

This information is helpful later when I remotify the cache, and also is just useful to have in tlparse and chromium events.

Test Plan: Run benchmark, see that the times are in the chromium events.

Reviewed By: aorenste

Differential Revision: D62590077

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136529
Approved by: https://github.com/oulgen
2024-09-27 16:14:09 +00:00
9f5b97a006 [user triton] Make tl.constexpr specialization work for triton_op & capture_triton (#136686)
In #136512, we fixed handling for tl.constexpr and dynamic shapes: if a symint is passed to tl.constexpr, you should specialize on it, because tl.constexpr implies needing to know the concrete value at compile time.

However, when using triton_op, capture_triton, or non-strict export, the regression remains (and #136512 might technically regress some specific export scenarios) - see [Richard's comment](https://github.com/pytorch/pytorch/pull/136512/files#r1775999871).

This fixes these scenarios: implement the handling differently depending on whether we're expecting a SymNodeVariable or a SymInt(/SymBool/SymFloat)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136686
Approved by: https://github.com/zou3519
2024-09-27 16:11:02 +00:00
ad51995468 Add a nightly hotpatch utils for python only PR (#136535)
I think this could help many teams, especially compile/export teams (/cc @ezyang), to let end user/bug reporters to quickly test WIP PR when reporting a related bug.

This could quickly run in an official nightly Docker container or in  a nightly venv/coda env.

Let me know what do you think.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136535
Approved by: https://github.com/ezyang
2024-09-27 15:58:48 +00:00
9d72f7481b [MPS] Fix AvgPool2d for float16 (#136822)
This was a stupid cast error that caused MPSGraph to crash with the following exception
```
(mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: error: 'mps.multiply' op requires the same element type for all operands and results
(mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: note: see current operation: %3 = "mps.multiply"(%2, %arg1) : (tensor<1x3x9x9xf16>, tensor<1xf32>) -> tensor<*xf32>
(mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: error: 'mps.multiply' op requires the same element type for all operands and results
(mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: note: see current operation: %3 = "mps.multiply"(%2, %arg1) : (tensor<1x3x9x9xf16>, tensor<1xf32>) -> tensor<*xf32>
/AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphExecutable.mm:953: failed assertion `original module failed verification'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136822
Approved by: https://github.com/Skylion007
ghstack dependencies: #136754, #136755, #136821
2024-09-27 15:32:18 +00:00
2b6f4e9e24 [BE][MPS] Delete MacOS12 low-precision ops (#136821)
`norm` and `masked.normalize` still have to stay in the list
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136821
Approved by: https://github.com/Skylion007
ghstack dependencies: #136754, #136755
2024-09-27 15:32:18 +00:00
45a8b5682e [inductor] Triton codegen: Use scalar when creating f64 constant instead of 1-element tensor (#136858)
This is a retry of https://github.com/pytorch/pytorch/pull/136594, which is having trouble landing.

Summary: We have an internal report of a Triton compiler error `ValueError: Cannot broadcast, rank mismatch: [1], [1, 2048]` coming from a line like this:

`tmp25 = tl.broadcast_to(((tl.full([1], 1.00000000000000, tl.float64)) + ((ks0 // 3278).to(tl.float64))) / (((tl.full([1], 0.500000000000000, tl.float64))*(libdevice.sqrt((1 + ((ks0 // 3278)*(ks0 // 3278)) + ((-2)*(ks0 // 3278))).to(tl.float64).to(tl.float32)))) + ((tl.full([1], 0.500000000000000, tl.float64))*((1 + (ks0 // 3278)).to(tl.float64)))), [XBLOCK, RBLOCK])`

https://github.com/pytorch/pytorch/pull/135260 is the cause, presumably because we turn a constant into a 1-element tensor with: `(tl.full([1], const, tl.float64))`. It looks like changing the syntax to `(tl.full([], const, tl.float64))` gives us what we want?

Differential Revision: [D63540693](https://our.internmc.facebook.com/intern/diff/D63540693)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136858
Approved by: https://github.com/atalman
2024-09-27 15:14:12 +00:00
34d788ffb0 [aotd] Do not force contiguous() for channels_last (#135225)
Original Issue: https://github.com/pytorch/pytorch/issues/134644

We assume trace_tangents to have the same memory_format as inputs, outputs, intermediate during first tracing.

=>
Tracing time:
- Store trace_tangents_memory_formats in metadata
- Coerce tangents to deduced memory_format

Runtime:
- Coerce tangents to tracing memory format from metadata

Subclasses logic:
 - Previously coercing tangents logic did not handle nested subclasses case, fixing this.

For Subclasses we deduce memory format for subclass_tensor first, then for each element of subclass:
[subclass_tensor_memory_format, subclass_tensor_elem0_memory_format, ... ]

If subclass element (__tensor_flatten__[0] tensors) is also subclass => on its place we will have a nested list of the same structure.

The recursive traversal of subclass tree is expensive. So we do memory format deduction and coercing at the same time, to keep only one traverse for this. With this approach there  is no regression in comparison with previous logic which also does one traversal. (`coerce_tangent_and_suggest_memory_format` method).

Other small change:
Remove duplicated not-related comment.

Testing

```
python test/functorch/test_aotdispatch.py -k test_channels_last_grads_no_force_contiguous
```

Benchmarking:
After change:
```
└─ $ PYTORCH_AOTD_DEBUG_PROFILE=1 python test/functorch/test_aotdispatch.py -k test_benchmark_grads_no_force_contiguous
Benchmark SUBCLASS avg_bwd_duration:4.059906005859375 ms
Benchmark NO_SUBCLASS avg_bwd_duration:3.1563830375671387 ms
```
Before change:
```
BEFORE_CHANGE SUBCLASS 4.1194
```

No siginificant changes in processing time.

(We do single traverse of subclass tree for collecting memory_formats and coercing during tracing.)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135225
Approved by: https://github.com/bdhirsh
2024-09-27 15:01:20 +00:00
de159f0c8d Revert "Deal with size oblivious before going into worker (#135137)"
This reverts commit 285fa03b5e1540a52b354664f609f8576c5b5431.

Reverted https://github.com/pytorch/pytorch/pull/135137 on behalf of https://github.com/ezyang due to this is the one that actually broke main ([comment](https://github.com/pytorch/pytorch/pull/135137#issuecomment-2379438566))
2024-09-27 14:41:27 +00:00
1be3d62237 [ONNX] Remove unused functions (#136609)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136609
Approved by: https://github.com/Skylion007
2024-09-27 14:34:05 +00:00
e5228a7771 Revert "Don't uselessly recompute axiom dict every static eval call (#135429)"
This reverts commit 507c69e20f645fdb0fbf43b05be0c5117971464e.

Reverted https://github.com/pytorch/pytorch/pull/135429 on behalf of https://github.com/malfet due to It(or it's parent) broke trunk CI, see 507c69e20f ([comment](https://github.com/pytorch/pytorch/pull/135429#issuecomment-2379422971))
2024-09-27 14:33:25 +00:00
a55aa71b04 Limit number of cores to 16 when benchmarking Inductor on ARM (#136424)
Sets OMP_NUM_THREADS to 16

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136424
Approved by: https://github.com/malfet
2024-09-27 14:22:49 +00:00
e9d2765ec8 Revert "Add deterministic path for CUDA cumsum (#136224)"
This reverts commit d1bb8e828f280d1c66fff193c043d5bc36154577.

Reverted https://github.com/pytorch/pytorch/pull/136224 on behalf of https://github.com/atalman due to Break internal CI ([comment](https://github.com/pytorch/pytorch/pull/136224#issuecomment-2379214226))
2024-09-27 12:54:47 +00:00
c2637a7b26 [inductor] [cpp] fix gemm_output_name conflict (#136419)
Fixes the max-autotune failure of `soft_actor_critic` of Torchbench in FP32 single thread dynamic shape case:
```log
  File "/home/user/inductor/pytorch/torch/_inductor/codegen/cpp_micro_gemm.py", line 136, in codegen_call
    C_ptr = f"&({kernel.index(C, [0, 0])})"
  File "/home/user/inductor/pytorch/torch/_inductor/codegen/cpp_template_kernel.py", line 135, in index
    else self.args.input(node.get_name())
  File "/home/user/inductor/pytorch/torch/_inductor/codegen/common.py", line 1251, in input
    assert name not in V.graph.removed_buffers, name
AssertionError: buf_GemmOut
```

The 1st and 2nd linear does not need to use local buffer while the 3rd linear needs to use local buffer.
The 3rd linear which uses local buffer will add its global buffer (named as `buf_GemmOut`) into `V.graph.removed_buffers`.

When scheduling the nodes, the 1st linear (won't use local buffer) will get its output buffer (also named as `buf_GemmOut`) from the input and found that it's in the `V.graph.removed_buffers` and raise AssertionError. The issue is that the output buffer of all these linears are all names with `buf_GemmOut`, which have a conflict.

Rename these buffers by adding the name of the `template_buffer` as the prefix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136419
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
ghstack dependencies: #136418, #136518
2024-09-27 12:23:17 +00:00
b42f1e3641 [Flex Attention] fix block size order (#136657)
`create_block_mask` currently gives wrong BLOCK_SIZE and shape when using non-default block size `(128,128)`.
This PR fixes the issue by using BLOCK_SIZE order `(Q_BLOCK_SIZE, KV_BLOCK_SIZE)`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136657
Approved by: https://github.com/Chillee, https://github.com/drisspg
2024-09-27 11:26:47 +00:00
9581508383 [aotd] Cleanup on subclasses in inductor freezing (#136549)
Cleanup:
1/ We do not need to unwrap_subclasses() in freezing wrapper, as it will be wrapped by AOTD wrappers which inclused SubclassesWrapper
2/ No need to use weakreferences for unwrapped list, dynamo optimizers need to clean unwrapped list along with original params_flat.
Verfified fbcode tests compiled_optimizers

Differential Revision: [D63393651](https://our.internmc.facebook.com/intern/diff/D63393651)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136549
Approved by: https://github.com/bdhirsh
2024-09-27 11:20:03 +00:00
cyy
bbff667e32 [Distributed] [13/N] Fix clang-tidy warnings in torch/csrc/distributed/ (#136713)
Follows #136528

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136713
Approved by: https://github.com/kwen2501
2024-09-27 10:11:53 +00:00
48c18ff850 [dynamo] Added support for tensor's is_inference method (#136450)
Fixes #135439

This PR adds support for the `is_inference` method on torch tensors which successfully compiles the following example fn without graph breaks:
```python
def fn_simple(x):
    if x.is_inference():
        return x.sum()
    else:
        return x.min()
```

I've also tried to add guards on the tensor to guard against  `is_inference`. I wasn't 100% sure where these should go so please don't hesitate to correct me.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136450
Approved by: https://github.com/ezyang
2024-09-27 09:15:07 +00:00
e14b58ffbd Using device-agnostic autocast api (#136613)
- using torch.autocast(device_str="cuda") instead of torch.cuda.amp.autocast()
- using torch.autocast(device_str="cpu") instead of torch.cpu.amp.autocast()

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136613
Approved by: https://github.com/shink, https://github.com/cyyever, https://github.com/kwen2501
2024-09-27 07:16:24 +00:00
ad6c70b656 [PP] Remove modifications to autograd nodes in ZB (#136678)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136678
Approved by: https://github.com/wconstab, https://github.com/kwen2501
ghstack dependencies: #136507, #136584
2024-09-27 07:07:58 +00:00
9529d018e9 Refactor offset logic and work for nD (#135861)
Optimize TODO task in code in distributed test files.

- TODO: make this test cleaner and work for nD
- TODO: add comments for create_plan/TestDedupTensor

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135861
Approved by: https://github.com/wz337
2024-09-27 06:13:06 +00:00
69bd13d12e [EZ][BE] Add torch.complex to MPS_DTYPES (#136755)
As minimal supported OS has been rasied to MacOS 13, some basic complex operations  should be supported, and the rest could be `xfailed`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136755
Approved by: https://github.com/Skylion007
ghstack dependencies: #136754
2024-09-27 05:01:40 +00:00
73f038c5b3 Log total miss inplaced bytes (#136684)
Summary: title.

Test Plan: add tests. run existing tests.

Differential Revision: D63411459

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136684
Approved by: https://github.com/zou3519
2024-09-27 04:57:57 +00:00
0200bea562 Delete grid reduction optimization that is causing specialization (#136783)
Summary:
https://fb.workplace.com/groups/1075192433118967/posts/1510513706253502

Creating a set is causing symexpr to specialize

Test Plan: CI

Reviewed By: ezyang

Differential Revision: D63432357

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136783
Approved by: https://github.com/ezyang, https://github.com/zou3519
2024-09-27 04:39:39 +00:00
a63d7cb54c add typing to _dynamo/current_scope_id.py (#136676)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136676
Approved by: https://github.com/jansel, https://github.com/zou3519, https://github.com/Skylion007
2024-09-27 04:09:15 +00:00
5eb68d565f Revert "[inductor] Triton codegen: Use scalar when creating f64 constant instead of 1-element tensor (#136594)"
This reverts commit 2c5f5e303a8d6fd55b6651f4d965fafaa6a540a7.

Reverted https://github.com/pytorch/pytorch/pull/136594 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/136594#issuecomment-2378358302))
2024-09-27 04:06:05 +00:00
2131 changed files with 70750 additions and 30225 deletions

View File

@ -21,6 +21,3 @@
cxx = /usr/bin/clang++
cxxpp = /usr/bin/clang++
ld = /usr/bin/clang++
[project]
default_flavors_mode=all

View File

@ -355,6 +355,12 @@ case "$image" in
CONDA_CMAKE=yes
VISION=yes
;;
pytorch-linux-jammy-py3-clang18-asan)
ANACONDA_PYTHON_VERSION=3.10
CLANG_VERSION=18
CONDA_CMAKE=yes
VISION=yes
;;
pytorch-linux-jammy-py3.9-gcc11)
ANACONDA_PYTHON_VERSION=3.9
GCC_VERSION=11
@ -381,6 +387,13 @@ case "$image" in
HALIDE=yes
TRITON=yes
;;
pytorch-linux-jammy-py3.12-triton-cpu)
CUDA_VERSION=12.4
ANACONDA_PYTHON_VERSION=3.12
GCC_VERSION=11
CONDA_CMAKE=yes
TRITON_CPU=yes
;;
pytorch-linux-focal-linter)
# TODO: Use 3.9 here because of this issue https://github.com/python/mypy/issues/13627.
# We will need to update mypy version eventually, but that's for another day. The task
@ -510,6 +523,7 @@ docker build \
--build-arg "UCC_COMMIT=${UCC_COMMIT}" \
--build-arg "CONDA_CMAKE=${CONDA_CMAKE}" \
--build-arg "TRITON=${TRITON}" \
--build-arg "TRITON_CPU=${TRITON_CPU}" \
--build-arg "ONNX=${ONNX}" \
--build-arg "DOCS=${DOCS}" \
--build-arg "INDUCTOR_BENCHMARKS=${INDUCTOR_BENCHMARKS}" \

View File

@ -1 +1 @@
cd1c833b079adb324871dcbbe75b43d42ffc0ade
ca4783992ed7602a39528ba304d61f00396b2a5a

View File

@ -0,0 +1 @@
c7711371cace304afe265c1ffa906415ab82fc66

View File

@ -1 +1 @@
5fe38ffd73c2ac6ed6323b554205186696631c6f
cf34004b8a67d290a962da166f5aa2fc66751326

View File

@ -13,11 +13,18 @@ if [ -n "$CLANG_VERSION" ]; then
elif [[ $UBUNTU_VERSION == 22.04 ]]; then
# work around ubuntu apt-get conflicts
sudo apt-get -y -f install
wget --no-check-certificate -O - https://apt.llvm.org/llvm-snapshot.gpg.key | sudo apt-key add -
if [[ $CLANG_VERSION == 18 ]]; then
apt-add-repository "deb http://apt.llvm.org/jammy/ llvm-toolchain-jammy-18 main"
fi
fi
sudo apt-get update
apt-get install -y --no-install-recommends clang-"$CLANG_VERSION"
apt-get install -y --no-install-recommends llvm-"$CLANG_VERSION"
if [[ $CLANG_VERSION -ge 18 ]]; then
apt-get install -y libomp-${CLANG_VERSION}-dev libclang-rt-${CLANG_VERSION}-dev clang-"$CLANG_VERSION" llvm-"$CLANG_VERSION"
else
apt-get install -y --no-install-recommends clang-"$CLANG_VERSION" llvm-"$CLANG_VERSION"
fi
# Install dev version of LLVM.
if [ -n "$LLVMDEV" ]; then

View File

@ -65,23 +65,10 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
# Install PyTorch conda deps, as per https://github.com/pytorch/pytorch README
if [[ $(uname -m) == "aarch64" ]]; then
CONDA_COMMON_DEPS="astunparse pyyaml setuptools openblas==0.3.25=*openmp* ninja==1.11.1 scons==4.5.2"
if [ "$ANACONDA_PYTHON_VERSION" = "3.8" ]; then
NUMPY_VERSION=1.24.4
else
NUMPY_VERSION=1.26.2
fi
conda_install "openblas==0.3.25=*openmp*"
else
CONDA_COMMON_DEPS="astunparse pyyaml mkl=2021.4.0 mkl-include=2021.4.0 setuptools"
if [ "$ANACONDA_PYTHON_VERSION" = "3.11" ] || [ "$ANACONDA_PYTHON_VERSION" = "3.12" ] || [ "$ANACONDA_PYTHON_VERSION" = "3.13" ]; then
NUMPY_VERSION=1.26.0
else
NUMPY_VERSION=1.21.2
fi
conda_install "mkl=2021.4.0 mkl-include=2021.4.0"
fi
conda_install ${CONDA_COMMON_DEPS}
# Install llvm-8 as it is required to compile llvmlite-0.30.0 from source
# and libpython-static for torch deploy
@ -103,8 +90,6 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
# Install some other packages, including those needed for Python test reporting
pip_install -r /opt/conda/requirements-ci.txt
pip_install numpy=="$NUMPY_VERSION"
pip_install -U scikit-learn
if [ -n "$DOCS" ]; then
apt-get update

View File

@ -105,7 +105,7 @@ function install_121 {
}
function install_124 {
echo "Installing CUDA 12.4.1 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.5.2"
echo "Installing CUDA 12.4.1 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.6.2"
rm -rf /usr/local/cuda-12.4 /usr/local/cuda
# install CUDA 12.4.1 in the same container
wget -q https://developer.download.nvidia.com/compute/cuda/12.4.1/local_installers/cuda_12.4.1_550.54.15_linux.run
@ -137,6 +137,39 @@ function install_124 {
ldconfig
}
function install_126 {
echo "Installing CUDA 12.6.2 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.6.2"
rm -rf /usr/local/cuda-12.6 /usr/local/cuda
# install CUDA 12.6.2 in the same container
wget -q https://developer.download.nvidia.com/compute/cuda/12.6.2/local_installers/cuda_12.6.2_560.35.03_linux.run
chmod +x cuda_12.6.2_560.35.03_linux.run
./cuda_12.6.2_560.35.03_linux.run --toolkit --silent
rm -f cuda_12.6.2_560.35.03_linux.run
rm -f /usr/local/cuda && ln -s /usr/local/cuda-12.6 /usr/local/cuda
# cuDNN license: https://developer.nvidia.com/cudnn/license_agreement
mkdir tmp_cudnn && cd tmp_cudnn
wget -q https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz -O cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz
tar xf cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz
cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive/include/* /usr/local/cuda/include/
cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive/lib/* /usr/local/cuda/lib64/
cd ..
rm -rf tmp_cudnn
# NCCL license: https://docs.nvidia.com/deeplearning/nccl/#licenses
# Follow build: https://github.com/NVIDIA/nccl/tree/master?tab=readme-ov-file#build
git clone -b $NCCL_VERSION --depth 1 https://github.com/NVIDIA/nccl.git
cd nccl && make -j src.build
cp -a build/include/* /usr/local/cuda/include/
cp -a build/lib/* /usr/local/cuda/lib64/
cd ..
rm -rf nccl
install_cusparselt_062
ldconfig
}
function prune_118 {
echo "Pruning CUDA 11.8 and cuDNN"
#####################################################################################
@ -227,12 +260,46 @@ function prune_124 {
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublasLt_static.a -o $CUDA_LIB_DIR/libcublasLt_static.a
#####################################################################################
# CUDA 12.1 prune visual tools
# CUDA 12.4 prune visual tools
#####################################################################################
export CUDA_BASE="/usr/local/cuda-12.4/"
rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2024.1.0 $CUDA_BASE/nsight-systems-2023.4.4/
}
function prune_126 {
echo "Pruning CUDA 12.6"
#####################################################################################
# CUDA 12.6 prune static libs
#####################################################################################
export NVPRUNE="/usr/local/cuda-12.6/bin/nvprune"
export CUDA_LIB_DIR="/usr/local/cuda-12.6/lib64"
export GENCODE="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"
export GENCODE_CUDNN="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"
if [[ -n "$OVERRIDE_GENCODE" ]]; then
export GENCODE=$OVERRIDE_GENCODE
fi
if [[ -n "$OVERRIDE_GENCODE_CUDNN" ]]; then
export GENCODE_CUDNN=$OVERRIDE_GENCODE_CUDNN
fi
# all CUDA libs except CuDNN and CuBLAS
ls $CUDA_LIB_DIR/ | grep "\.a" | grep -v "culibos" | grep -v "cudart" | grep -v "cudnn" | grep -v "cublas" | grep -v "metis" \
| xargs -I {} bash -c \
"echo {} && $NVPRUNE $GENCODE $CUDA_LIB_DIR/{} -o $CUDA_LIB_DIR/{}"
# prune CuDNN and CuBLAS
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublas_static.a -o $CUDA_LIB_DIR/libcublas_static.a
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublasLt_static.a -o $CUDA_LIB_DIR/libcublasLt_static.a
#####################################################################################
# CUDA 12.6 prune visual tools
#####################################################################################
export CUDA_BASE="/usr/local/cuda-12.6/"
rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2024.3.2 $CUDA_BASE/nsight-systems-2024.5.1/
}
# idiomatic parameter and option handling in sh
while test $# -gt 0
do
@ -243,6 +310,8 @@ do
;;
12.4) install_124; prune_124
;;
12.6) install_126; prune_126
;;
*) echo "bad argument $1"; exit 1
;;
esac

View File

@ -5,19 +5,19 @@ set -ex
NCCL_VERSION=v2.21.5-1
function install_cusparselt_052 {
function install_cusparselt_062 {
# cuSparseLt license: https://docs.nvidia.com/cuda/cusparselt/license.html
mkdir tmp_cusparselt && pushd tmp_cusparselt
wget -q https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-sbsa/libcusparse_lt-linux-sbsa-0.5.2.1-archive.tar.xz
tar xf libcusparse_lt-linux-sbsa-0.5.2.1-archive.tar.xz
cp -a libcusparse_lt-linux-sbsa-0.5.2.1-archive/include/* /usr/local/cuda/include/
cp -a libcusparse_lt-linux-sbsa-0.5.2.1-archive/lib/* /usr/local/cuda/lib64/
wget -q https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-sbsa/libcusparse_lt-linux-sbsa-0.6.2.3-archive.tar.xz
tar xf libcusparse_lt-linux-sbsa-0.6.2.3-archive.tar.xz
cp -a libcusparse_lt-linux-sbsa-0.6.2.3-archive/include/* /usr/local/cuda/include/
cp -a libcusparse_lt-linux-sbsa-0.6.2.3-archive/lib/* /usr/local/cuda/lib64/
popd
rm -rf tmp_cusparselt
}
function install_124 {
echo "Installing CUDA 12.4.1 and cuDNN 9.1 and NCCL ${NCCL_VERSION} and cuSparseLt-0.5.2"
echo "Installing CUDA 12.4.1 and cuDNN 9.1 and NCCL ${NCCL_VERSION} and cuSparseLt-0.6.2"
rm -rf /usr/local/cuda-12.4 /usr/local/cuda
# install CUDA 12.4.1 in the same container
wget -q https://developer.download.nvidia.com/compute/cuda/12.4.1/local_installers/cuda_12.4.1_550.54.15_linux_sbsa.run
@ -44,7 +44,7 @@ function install_124 {
cd ..
rm -rf nccl
install_cusparselt_052
install_cusparselt_062
ldconfig
}

View File

@ -32,7 +32,7 @@ pip_install coloredlogs packaging
pip_install onnxruntime==1.18.1
pip_install onnx==1.16.2
pip_install onnxscript==0.1.0.dev20240831 --no-deps
pip_install onnxscript==0.1.0.dev20241009 --no-deps
# required by onnxscript
pip_install ml_dtypes

View File

@ -15,8 +15,11 @@ conda_reinstall() {
if [ -n "${XPU_VERSION}" ]; then
TRITON_REPO="https://github.com/intel/intel-xpu-backend-for-triton"
TRITON_TEXT_FILE="triton-xpu"
elif [ -n "${TRITON_CPU}" ]; then
TRITON_REPO="https://github.com/triton-lang/triton-cpu"
TRITON_TEXT_FILE="triton-cpu"
else
TRITON_REPO="https://github.com/openai/triton"
TRITON_REPO="https://github.com/triton-lang/triton"
TRITON_TEXT_FILE="triton"
fi
@ -44,9 +47,10 @@ chown -R jenkins /var/lib/jenkins/triton
chgrp -R jenkins /var/lib/jenkins/triton
pushd /var/lib/jenkins/
as_jenkins git clone ${TRITON_REPO} triton
as_jenkins git clone --recursive ${TRITON_REPO} triton
cd triton
as_jenkins git checkout ${TRITON_PINNED_COMMIT}
as_jenkins git submodule update --init --recursive
cd python
# TODO: remove patch setup.py once we have a proper fix for https://github.com/triton-lang/triton/issues/4527

View File

@ -41,13 +41,16 @@ function install_ubuntu() {
libegl-mesa0 libegl1-mesa libegl1-mesa-dev libgbm1 libgl1-mesa-dev libgl1-mesa-dri \
libglapi-mesa libgles2-mesa-dev libglx-mesa0 libigdgmm12 libxatracker2 mesa-va-drivers \
mesa-vdpau-drivers mesa-vulkan-drivers va-driver-all vainfo hwinfo clinfo
if [[ "${XPU_DRIVER_TYPE,,}" == "rolling" ]]; then
apt-get install -y intel-ocloc
fi
# Development Packages
apt-get install -y libigc-dev intel-igc-cm libigdfcl-dev libigfxcmrt-dev level-zero-dev
# Install Intel Support Packages
if [ -n "$XPU_VERSION" ]; then
apt-get install -y intel-for-pytorch-gpu-dev-${XPU_VERSION} intel-pti-dev
apt-get install -y intel-for-pytorch-gpu-dev-${XPU_VERSION} intel-pti-dev-0.9
else
apt-get install -y intel-for-pytorch-gpu-dev intel-pti-dev
apt-get install -y intel-for-pytorch-gpu-dev-0.5 intel-pti-dev-0.9
fi
# Cleanup
@ -97,7 +100,7 @@ EOF
intel-igc-opencl-devel level-zero-devel intel-gsc-devel libmetee-devel \
level-zero-devel
# Install Intel Support Packages
yum install -y intel-for-pytorch-gpu-dev intel-pti-dev
yum install -y intel-for-pytorch-gpu-dev-0.5 intel-pti-dev-0.9
# Cleanup
dnf clean all
@ -131,7 +134,7 @@ function install_sles() {
zypper install -y libigdfcl-devel intel-igc-cm libigfxcmrt-devel level-zero-devel
# Install Intel Support Packages
zypper install -y intel-for-pytorch-gpu-dev intel-pti-dev
zypper install -y intel-for-pytorch-gpu-dev-0.5 intel-pti-dev-0.9
}

View File

@ -70,6 +70,10 @@ FROM cuda as cuda12.4
RUN bash ./install_cuda.sh 12.4
ENV DESIRED_CUDA=12.4
FROM cuda as cuda12.6
RUN bash ./install_cuda.sh 12.6
ENV DESIRED_CUDA=12.6
# Install MNIST test data
FROM base as mnist
ADD ./common/install_mnist.sh install_mnist.sh
@ -79,6 +83,7 @@ FROM base as all_cuda
COPY --from=cuda11.8 /usr/local/cuda-11.8 /usr/local/cuda-11.8
COPY --from=cuda12.1 /usr/local/cuda-12.1 /usr/local/cuda-12.1
COPY --from=cuda12.4 /usr/local/cuda-12.4 /usr/local/cuda-12.4
COPY --from=cuda12.6 /usr/local/cuda-12.6 /usr/local/cuda-12.6
# Final step
FROM ${BASE_TARGET} as final

View File

@ -1,10 +1,12 @@
# cf. https://github.com/pypa/manylinux/issues/53
import sys
from urllib.request import urlopen
GOOD_SSL = "https://google.com"
BAD_SSL = "https://self-signed.badssl.com"
import sys
print("Testing SSL certificate checking for Python:", sys.version)
@ -12,14 +14,8 @@ if sys.version_info[:2] < (2, 7) or sys.version_info[:2] < (3, 4):
print("This version never checks SSL certs; skipping tests")
sys.exit(0)
if sys.version_info[0] >= 3:
from urllib.request import urlopen
EXC = OSError
else:
from urllib import urlopen
EXC = IOError
EXC = OSError
print(f"Connecting to {GOOD_SSL} should work")
urlopen(GOOD_SSL)

View File

@ -5,7 +5,7 @@
#Pinned versions: 1.6
#test that import:
boto3==1.19.12
boto3==1.35.42
#Description: AWS SDK for python
#Pinned versions: 1.19.12, 1.16.34
#test that import:
@ -118,7 +118,7 @@ numba==0.55.2 ; python_version == "3.10"
#numpy
#Description: Provides N-dimensional arrays and linear algebra
#Pinned versions: 1.20
#Pinned versions: 1.26.2
#test that import: test_view_ops.py, test_unary_ufuncs.py, test_type_promotion.py,
#test_type_info.py, test_torch.py, test_tensorexpr_pybind.py, test_tensorexpr.py,
#test_tensorboard.py, test_tensor_creation_ops.py, test_static_runtime.py,
@ -128,6 +128,10 @@ numba==0.55.2 ; python_version == "3.10"
#test_nn.py, test_namedtensor.py, test_linalg.py, test_jit_cuda_fuser.py,
#test_jit.py, test_indexing.py, test_datapipe.py, test_dataloader.py,
#test_binary_ufuncs.py
numpy==1.21.2; python_version == "3.9"
numpy==1.22.4; python_version == "3.10"
numpy==1.26.2; python_version == "3.11" or python_version == "3.12"
numpy==2.1.2; python_version >= "3.13"
#onnxruntime
#Description: scoring engine for Open Neural Network Exchange (ONNX) models
@ -139,9 +143,9 @@ opt-einsum==3.3
#Pinned versions: 3.3
#test that import: test_linalg.py
optree==0.12.1
optree==0.13.0
#Description: A library for tree manipulation
#Pinned versions: 0.12.1
#Pinned versions: 0.13.0
#test that import: test_vmap.py, test_aotdispatch.py, test_dynamic_shapes.py,
#test_pytree.py, test_ops.py, test_control_flow.py, test_modules.py,
#common_utils.py, test_eager_transforms.py, test_python_dispatch.py,
@ -322,13 +326,12 @@ lxml==5.0.0
PyGithub==2.3.0
sympy==1.12.1 ; python_version == "3.8"
sympy==1.13.1 ; python_version >= "3.9"
#Description: Required by coremltools, also pinned in .github/requirements/pip-requirements-macOS.txt
#Pinned versions:
#test that import:
onnx==1.16.1
onnx==1.17.0
#Description: Required by mypy and test_public_bindings.py when checking torch.onnx._internal
#Pinned versions:
#test that import:
@ -342,3 +345,26 @@ parameterized==0.8.1
#Description: Parameterizes unittests, both the tests themselves and the entire testing class
#Pinned versions:
#test that import:
#Description: required for testing torch/distributed/_tools/sac_estimator.py
#Pinned versions: 1.24.0
#test that import: test_sac_estimator.py
pwlf==2.2.1 ; python_version >= "3.8"
#Description: required for testing torch/distributed/_tools/sac_estimator.py
#Pinned versions: 2.2.1
#test that import: test_sac_estimator.py
# To build PyTorch itself
astunparse
PyYAML
setuptools
ninja==1.11.1 ; platform_machine == "aarch64"
scons==4.5.2 ; platform_machine == "aarch64"
pulp==2.9.0 ; python_version >= "3.8"
#Description: required for testing ilp formulaiton under torch/distributed/_tools
#Pinned versions: 2.9.0
#test that import: test_sac_ilp.py

View File

@ -147,6 +147,13 @@ COPY ci_commit_pins/triton.txt triton.txt
RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi
RUN rm install_triton.sh common_utils.sh triton.txt
ARG TRITON_CPU
COPY ./common/install_triton.sh install_triton.sh
COPY ./common/common_utils.sh common_utils.sh
COPY ci_commit_pins/triton-cpu.txt triton-cpu.txt
RUN if [ -n "${TRITON_CPU}" ]; then bash ./install_triton.sh; fi
RUN rm install_triton.sh common_utils.sh triton-cpu.txt
ARG EXECUTORCH
# Build and install executorch
COPY ./common/install_executorch.sh install_executorch.sh

10
.ci/libtorch/build.sh Normal file
View File

@ -0,0 +1,10 @@
#!/usr/bin/env bash
# This is mostly just a shim to manywheel/build.sh
# TODO: Make this a dedicated script to build just libtorch
set -ex
SCRIPTPATH="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
USE_CUSPARSELT=0 BUILD_PYTHONLESS=1 DESIRED_PYTHON="3.9" ${SCRIPTPATH}/../manywheel/build.sh

21
.ci/manywheel/LICENSE Normal file
View File

@ -0,0 +1,21 @@
The MIT License (MIT)
Copyright (c) 2016 manylinux
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

25
.ci/manywheel/build.sh Executable file
View File

@ -0,0 +1,25 @@
#!/usr/bin/env bash
set -ex
SCRIPTPATH="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
case "${GPU_ARCH_TYPE:-BLANK}" in
BLANK)
# Legacy behavior for CircleCI
bash "${SCRIPTPATH}/build_cuda.sh"
;;
cuda)
bash "${SCRIPTPATH}/build_cuda.sh"
;;
rocm)
bash "${SCRIPTPATH}/build_rocm.sh"
;;
cpu | cpu-cxx11-abi | cpu-s390x | xpu)
bash "${SCRIPTPATH}/build_cpu.sh"
;;
*)
echo "Un-recognized GPU_ARCH_TYPE '${GPU_ARCH_TYPE}', exiting..."
exit 1
;;
esac

View File

@ -0,0 +1,505 @@
#!/usr/bin/env bash
# meant to be called only from the neighboring build.sh and build_cpu.sh scripts
set -ex
SOURCE_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null && pwd )"
# Require only one python installation
if [[ -z "$DESIRED_PYTHON" ]]; then
echo "Need to set DESIRED_PYTHON env variable"
exit 1
fi
if [[ -n "$BUILD_PYTHONLESS" && -z "$LIBTORCH_VARIANT" ]]; then
echo "BUILD_PYTHONLESS is set, so need LIBTORCH_VARIANT to also be set"
echo "LIBTORCH_VARIANT should be one of shared-with-deps shared-without-deps static-with-deps static-without-deps"
exit 1
fi
# Function to retry functions that sometimes timeout or have flaky failures
retry () {
$* || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)
}
# TODO move this into the Docker images
OS_NAME=$(awk -F= '/^NAME/{print $2}' /etc/os-release)
if [[ "$OS_NAME" == *"CentOS Linux"* ]]; then
retry yum install -q -y zip openssl
elif [[ "$OS_NAME" == *"AlmaLinux"* ]]; then
retry yum install -q -y zip openssl
elif [[ "$OS_NAME" == *"Red Hat Enterprise Linux"* ]]; then
retry dnf install -q -y zip openssl
elif [[ "$OS_NAME" == *"Ubuntu"* ]]; then
# TODO: Remove this once nvidia package repos are back online
# Comment out nvidia repositories to prevent them from getting apt-get updated, see https://github.com/pytorch/pytorch/issues/74968
# shellcheck disable=SC2046
sed -i 's/.*nvidia.*/# &/' $(find /etc/apt/ -type f -name "*.list")
retry apt-get update
retry apt-get -y install zip openssl
fi
# We use the package name to test the package by passing this to 'pip install'
# This is the env variable that setup.py uses to name the package. Note that
# pip 'normalizes' the name first by changing all - to _
if [[ -z "$TORCH_PACKAGE_NAME" ]]; then
TORCH_PACKAGE_NAME='torch'
fi
if [[ -z "$TORCH_NO_PYTHON_PACKAGE_NAME" ]]; then
TORCH_NO_PYTHON_PACKAGE_NAME='torch_no_python'
fi
TORCH_PACKAGE_NAME="$(echo $TORCH_PACKAGE_NAME | tr '-' '_')"
TORCH_NO_PYTHON_PACKAGE_NAME="$(echo $TORCH_NO_PYTHON_PACKAGE_NAME | tr '-' '_')"
echo "Expecting the built wheels to all be called '$TORCH_PACKAGE_NAME' or '$TORCH_NO_PYTHON_PACKAGE_NAME'"
# Version: setup.py uses $PYTORCH_BUILD_VERSION.post$PYTORCH_BUILD_NUMBER if
# PYTORCH_BUILD_NUMBER > 1
build_version="$PYTORCH_BUILD_VERSION"
build_number="$PYTORCH_BUILD_NUMBER"
if [[ -n "$OVERRIDE_PACKAGE_VERSION" ]]; then
# This will be the *exact* version, since build_number<1
build_version="$OVERRIDE_PACKAGE_VERSION"
build_number=0
fi
if [[ -z "$build_version" ]]; then
build_version=1.0.0
fi
if [[ -z "$build_number" ]]; then
build_number=1
fi
export PYTORCH_BUILD_VERSION=$build_version
export PYTORCH_BUILD_NUMBER=$build_number
export CMAKE_LIBRARY_PATH="/opt/intel/lib:/lib:$CMAKE_LIBRARY_PATH"
export CMAKE_INCLUDE_PATH="/opt/intel/include:$CMAKE_INCLUDE_PATH"
if [[ -e /opt/openssl ]]; then
export OPENSSL_ROOT_DIR=/opt/openssl
export CMAKE_INCLUDE_PATH="/opt/openssl/include":$CMAKE_INCLUDE_PATH
fi
# If given a python version like 3.6m or 2.7mu, convert this to the format we
# expect. The binary CI jobs pass in python versions like this; they also only
# ever pass one python version, so we assume that DESIRED_PYTHON is not a list
# in this case
if [[ -n "$DESIRED_PYTHON" && $DESIRED_PYTHON =~ ([0-9].[0-9]+)t ]]; then
python_digits="$(echo $DESIRED_PYTHON | tr -cd [:digit:])"
py_majmin="${DESIRED_PYTHON}"
DESIRED_PYTHON="cp${python_digits}-cp${python_digits}t"
elif [[ -n "$DESIRED_PYTHON" && "$DESIRED_PYTHON" != cp* ]]; then
python_nodot="$(echo $DESIRED_PYTHON | tr -d m.u)"
DESIRED_PYTHON="cp${python_nodot}-cp${python_nodot}"
if [[ ${python_nodot} -ge 310 ]]; then
py_majmin="${DESIRED_PYTHON:2:1}.${DESIRED_PYTHON:3:2}"
else
py_majmin="${DESIRED_PYTHON:2:1}.${DESIRED_PYTHON:3:1}"
fi
fi
pydir="/opt/python/$DESIRED_PYTHON"
export PATH="$pydir/bin:$PATH"
echo "Will build for Python version: ${DESIRED_PYTHON} with ${python_installation}"
mkdir -p /tmp/$WHEELHOUSE_DIR
export PATCHELF_BIN=/usr/local/bin/patchelf
patchelf_version=$($PATCHELF_BIN --version)
echo "patchelf version: " $patchelf_version
if [[ "$patchelf_version" == "patchelf 0.9" ]]; then
echo "Your patchelf version is too old. Please use version >= 0.10."
exit 1
fi
########################################################
# Compile wheels as well as libtorch
#######################################################
if [[ -z "$PYTORCH_ROOT" ]]; then
echo "Need to set PYTORCH_ROOT env variable"
exit 1
fi
pushd "$PYTORCH_ROOT"
python setup.py clean
retry pip install -qr requirements.txt
case ${DESIRED_PYTHON} in
cp31*)
retry pip install -q --pre numpy==2.1.0
;;
# Should catch 3.9+
*)
retry pip install -q --pre numpy==2.0.2
;;
esac
if [[ "$DESIRED_DEVTOOLSET" == *"cxx11-abi"* ]]; then
export _GLIBCXX_USE_CXX11_ABI=1
else
export _GLIBCXX_USE_CXX11_ABI=0
fi
if [[ "$DESIRED_CUDA" == *"rocm"* ]]; then
echo "Calling build_amd.py at $(date)"
python tools/amd_build/build_amd.py
fi
# This value comes from binary_linux_build.sh (and should only be set to true
# for master / release branches)
BUILD_DEBUG_INFO=${BUILD_DEBUG_INFO:=0}
if [[ $BUILD_DEBUG_INFO == "1" ]]; then
echo "Building wheel and debug info"
else
echo "BUILD_DEBUG_INFO was not set, skipping debug info"
fi
if [[ "$DISABLE_RCCL" = 1 ]]; then
echo "Disabling NCCL/RCCL in pyTorch"
USE_RCCL=0
USE_NCCL=0
USE_KINETO=0
else
USE_RCCL=1
USE_NCCL=1
USE_KINETO=1
fi
echo "Calling setup.py bdist at $(date)"
if [[ "$USE_SPLIT_BUILD" == "true" ]]; then
echo "Calling setup.py bdist_wheel for split build (BUILD_LIBTORCH_WHL)"
time EXTRA_CAFFE2_CMAKE_FLAGS=${EXTRA_CAFFE2_CMAKE_FLAGS[@]} \
BUILD_LIBTORCH_WHL=1 BUILD_PYTHON_ONLY=0 \
BUILD_LIBTORCH_CPU_WITH_DEBUG=$BUILD_DEBUG_INFO \
USE_NCCL=${USE_NCCL} USE_RCCL=${USE_RCCL} USE_KINETO=${USE_KINETO} \
python setup.py bdist_wheel -d /tmp/$WHEELHOUSE_DIR
echo "Finished setup.py bdist_wheel for split build (BUILD_LIBTORCH_WHL)"
echo "Calling setup.py bdist_wheel for split build (BUILD_PYTHON_ONLY)"
time EXTRA_CAFFE2_CMAKE_FLAGS=${EXTRA_CAFFE2_CMAKE_FLAGS[@]} \
BUILD_LIBTORCH_WHL=0 BUILD_PYTHON_ONLY=1 \
BUILD_LIBTORCH_CPU_WITH_DEBUG=$BUILD_DEBUG_INFO \
USE_NCCL=${USE_NCCL} USE_RCCL=${USE_RCCL} USE_KINETO=${USE_KINETO} \
python setup.py bdist_wheel -d /tmp/$WHEELHOUSE_DIR --cmake
echo "Finished setup.py bdist_wheel for split build (BUILD_PYTHON_ONLY)"
else
time CMAKE_ARGS=${CMAKE_ARGS[@]} \
EXTRA_CAFFE2_CMAKE_FLAGS=${EXTRA_CAFFE2_CMAKE_FLAGS[@]} \
BUILD_LIBTORCH_CPU_WITH_DEBUG=$BUILD_DEBUG_INFO \
USE_NCCL=${USE_NCCL} USE_RCCL=${USE_RCCL} USE_KINETO=${USE_KINETO} \
python setup.py bdist_wheel -d /tmp/$WHEELHOUSE_DIR
fi
echo "Finished setup.py bdist at $(date)"
# Build libtorch packages
if [[ -n "$BUILD_PYTHONLESS" ]]; then
# Now build pythonless libtorch
# Note - just use whichever python we happen to be on
python setup.py clean
if [[ $LIBTORCH_VARIANT = *"static"* ]]; then
STATIC_CMAKE_FLAG="-DTORCH_STATIC=1"
fi
mkdir -p build
pushd build
echo "Calling tools/build_libtorch.py at $(date)"
time CMAKE_ARGS=${CMAKE_ARGS[@]} \
EXTRA_CAFFE2_CMAKE_FLAGS="${EXTRA_CAFFE2_CMAKE_FLAGS[@]} $STATIC_CMAKE_FLAG" \
python ../tools/build_libtorch.py
echo "Finished tools/build_libtorch.py at $(date)"
popd
mkdir -p libtorch/{lib,bin,include,share}
cp -r build/build/lib libtorch/
# for now, the headers for the libtorch package will just be copied in
# from one of the wheels (this is from when this script built multiple
# wheels at once)
ANY_WHEEL=$(ls /tmp/$WHEELHOUSE_DIR/torch*.whl | head -n1)
unzip -d any_wheel $ANY_WHEEL
if [[ -d any_wheel/torch/include ]]; then
cp -r any_wheel/torch/include libtorch/
else
cp -r any_wheel/torch/lib/include libtorch/
fi
cp -r any_wheel/torch/share/cmake libtorch/share/
rm -rf any_wheel
echo $PYTORCH_BUILD_VERSION > libtorch/build-version
echo "$(pushd $PYTORCH_ROOT && git rev-parse HEAD)" > libtorch/build-hash
mkdir -p /tmp/$LIBTORCH_HOUSE_DIR
if [[ "$DESIRED_DEVTOOLSET" == *"cxx11-abi"* ]]; then
LIBTORCH_ABI="cxx11-abi-"
else
LIBTORCH_ABI=
fi
zip -rq /tmp/$LIBTORCH_HOUSE_DIR/libtorch-$LIBTORCH_ABI$LIBTORCH_VARIANT-$PYTORCH_BUILD_VERSION.zip libtorch
cp /tmp/$LIBTORCH_HOUSE_DIR/libtorch-$LIBTORCH_ABI$LIBTORCH_VARIANT-$PYTORCH_BUILD_VERSION.zip \
/tmp/$LIBTORCH_HOUSE_DIR/libtorch-$LIBTORCH_ABI$LIBTORCH_VARIANT-latest.zip
fi
popd
#######################################################################
# ADD DEPENDENCIES INTO THE WHEEL
#
# auditwheel repair doesn't work correctly and is buggy
# so manually do the work of copying dependency libs and patchelfing
# and fixing RECORDS entries correctly
######################################################################
fname_with_sha256() {
HASH=$(sha256sum $1 | cut -c1-8)
DIRNAME=$(dirname $1)
BASENAME=$(basename $1)
# Do not rename nvrtc-builtins.so as they are dynamically loaded
# by libnvrtc.so
# Similarly don't mangle libcudnn and libcublas library names
if [[ $BASENAME == "libnvrtc-builtins.s"* || $BASENAME == "libcudnn"* || $BASENAME == "libcublas"* ]]; then
echo $1
else
INITNAME=$(echo $BASENAME | cut -f1 -d".")
ENDNAME=$(echo $BASENAME | cut -f 2- -d".")
echo "$DIRNAME/$INITNAME-$HASH.$ENDNAME"
fi
}
fname_without_so_number() {
LINKNAME=$(echo $1 | sed -e 's/\.so.*/.so/g')
echo "$LINKNAME"
}
make_wheel_record() {
FPATH=$1
if echo $FPATH | grep RECORD >/dev/null 2>&1; then
# if the RECORD file, then
echo "$FPATH,,"
else
HASH=$(openssl dgst -sha256 -binary $FPATH | openssl base64 | sed -e 's/+/-/g' | sed -e 's/\//_/g' | sed -e 's/=//g')
FSIZE=$(ls -nl $FPATH | awk '{print $5}')
echo "$FPATH,sha256=$HASH,$FSIZE"
fi
}
replace_needed_sofiles() {
find $1 -name '*.so*' | while read sofile; do
origname=$2
patchedname=$3
if [[ "$origname" != "$patchedname" ]] || [[ "$DESIRED_CUDA" == *"rocm"* ]]; then
set +e
origname=$($PATCHELF_BIN --print-needed $sofile | grep "$origname.*")
ERRCODE=$?
set -e
if [ "$ERRCODE" -eq "0" ]; then
echo "patching $sofile entry $origname to $patchedname"
$PATCHELF_BIN --replace-needed $origname $patchedname $sofile
fi
fi
done
}
echo 'Built this wheel:'
ls /tmp/$WHEELHOUSE_DIR
mkdir -p "/$WHEELHOUSE_DIR"
mv /tmp/$WHEELHOUSE_DIR/torch*linux*.whl /$WHEELHOUSE_DIR/
if [[ "$USE_SPLIT_BUILD" == "true" ]]; then
mv /tmp/$WHEELHOUSE_DIR/torch_no_python*.whl /$WHEELHOUSE_DIR/ || true
fi
if [[ -n "$BUILD_PYTHONLESS" ]]; then
mkdir -p /$LIBTORCH_HOUSE_DIR
mv /tmp/$LIBTORCH_HOUSE_DIR/*.zip /$LIBTORCH_HOUSE_DIR
rm -rf /tmp/$LIBTORCH_HOUSE_DIR
fi
rm -rf /tmp/$WHEELHOUSE_DIR
rm -rf /tmp_dir
mkdir /tmp_dir
pushd /tmp_dir
for pkg in /$WHEELHOUSE_DIR/torch_no_python*.whl /$WHEELHOUSE_DIR/torch*linux*.whl /$LIBTORCH_HOUSE_DIR/libtorch*.zip; do
# if the glob didn't match anything
if [[ ! -e $pkg ]]; then
continue
fi
rm -rf tmp
mkdir -p tmp
cd tmp
cp $pkg .
unzip -q $(basename $pkg)
rm -f $(basename $pkg)
if [[ -d torch ]]; then
PREFIX=torch
else
PREFIX=libtorch
fi
if [[ $pkg != *"without-deps"* ]]; then
# copy over needed dependent .so files over and tag them with their hash
patched=()
for filepath in "${DEPS_LIST[@]}"; do
filename=$(basename $filepath)
destpath=$PREFIX/lib/$filename
if [[ "$filepath" != "$destpath" ]]; then
cp $filepath $destpath
fi
# ROCm workaround for roctracer dlopens
if [[ "$DESIRED_CUDA" == *"rocm"* ]]; then
patchedpath=$(fname_without_so_number $destpath)
# Keep the so number for XPU dependencies
elif [[ "$DESIRED_CUDA" == *"xpu"* ]]; then
patchedpath=$destpath
else
patchedpath=$(fname_with_sha256 $destpath)
fi
patchedname=$(basename $patchedpath)
if [[ "$destpath" != "$patchedpath" ]]; then
mv $destpath $patchedpath
fi
patched+=("$patchedname")
echo "Copied $filepath to $patchedpath"
done
echo "patching to fix the so names to the hashed names"
for ((i=0;i<${#DEPS_LIST[@]};++i)); do
replace_needed_sofiles $PREFIX ${DEPS_SONAME[i]} ${patched[i]}
# do the same for caffe2, if it exists
if [[ -d caffe2 ]]; then
replace_needed_sofiles caffe2 ${DEPS_SONAME[i]} ${patched[i]}
fi
done
# copy over needed auxiliary files
for ((i=0;i<${#DEPS_AUX_SRCLIST[@]};++i)); do
srcpath=${DEPS_AUX_SRCLIST[i]}
dstpath=$PREFIX/${DEPS_AUX_DSTLIST[i]}
mkdir -p $(dirname $dstpath)
cp $srcpath $dstpath
done
fi
# set RPATH of _C.so and similar to $ORIGIN, $ORIGIN/lib
find $PREFIX -maxdepth 1 -type f -name "*.so*" | while read sofile; do
echo "Setting rpath of $sofile to ${C_SO_RPATH:-'$ORIGIN:$ORIGIN/lib'}"
$PATCHELF_BIN --set-rpath ${C_SO_RPATH:-'$ORIGIN:$ORIGIN/lib'} ${FORCE_RPATH:-} $sofile
$PATCHELF_BIN --print-rpath $sofile
done
# set RPATH of lib/ files to $ORIGIN
find $PREFIX/lib -maxdepth 1 -type f -name "*.so*" | while read sofile; do
echo "Setting rpath of $sofile to ${LIB_SO_RPATH:-'$ORIGIN'}"
$PATCHELF_BIN --set-rpath ${LIB_SO_RPATH:-'$ORIGIN'} ${FORCE_RPATH:-} $sofile
$PATCHELF_BIN --print-rpath $sofile
done
# regenerate the RECORD file with new hashes
record_file=$(echo $(basename $pkg) | sed -e 's/-cp.*$/.dist-info\/RECORD/g')
if [[ -e $record_file ]]; then
echo "Generating new record file $record_file"
: > "$record_file"
# generate records for folders in wheel
find * -type f | while read fname; do
make_wheel_record "$fname" >>"$record_file"
done
fi
if [[ $BUILD_DEBUG_INFO == "1" ]]; then
pushd "$PREFIX/lib"
# Duplicate library into debug lib
cp libtorch_cpu.so libtorch_cpu.so.dbg
# Keep debug symbols on debug lib
strip --only-keep-debug libtorch_cpu.so.dbg
# Remove debug info from release lib
strip --strip-debug libtorch_cpu.so
objcopy libtorch_cpu.so --add-gnu-debuglink=libtorch_cpu.so.dbg
# Zip up debug info
mkdir -p /tmp/debug
mv libtorch_cpu.so.dbg /tmp/debug/libtorch_cpu.so.dbg
CRC32=$(objcopy --dump-section .gnu_debuglink=>(tail -c4 | od -t x4 -An | xargs echo) libtorch_cpu.so)
pushd /tmp
PKG_NAME=$(basename "$pkg" | sed 's/\.whl$//g')
zip /tmp/debug-whl-libtorch-"$PKG_NAME"-"$CRC32".zip /tmp/debug/libtorch_cpu.so.dbg
cp /tmp/debug-whl-libtorch-"$PKG_NAME"-"$CRC32".zip "$PYTORCH_FINAL_PACKAGE_DIR"
popd
popd
fi
# zip up the wheel back
zip -rq $(basename $pkg) $PREIX*
# replace original wheel
rm -f $pkg
mv $(basename $pkg) $pkg
cd ..
rm -rf tmp
done
# Copy wheels to host machine for persistence before testing
if [[ -n "$PYTORCH_FINAL_PACKAGE_DIR" ]]; then
mkdir -p "$PYTORCH_FINAL_PACKAGE_DIR" || true
if [[ -n "$BUILD_PYTHONLESS" ]]; then
cp /$LIBTORCH_HOUSE_DIR/libtorch*.zip "$PYTORCH_FINAL_PACKAGE_DIR"
else
cp /$WHEELHOUSE_DIR/torch*.whl "$PYTORCH_FINAL_PACKAGE_DIR"
fi
fi
# remove stuff before testing
rm -rf /opt/rh
if ls /usr/local/cuda* >/dev/null 2>&1; then
rm -rf /usr/local/cuda*
fi
# Test that all the wheels work
if [[ -z "$BUILD_PYTHONLESS" ]]; then
export OMP_NUM_THREADS=4 # on NUMA machines this takes too long
pushd $PYTORCH_ROOT/test
# Install the wheel for this Python version
if [[ "$USE_SPLIT_BUILD" == "true" ]]; then
pip uninstall -y "$TORCH_NO_PYTHON_PACKAGE_NAME" || true
fi
pip uninstall -y "$TORCH_PACKAGE_NAME"
if [[ "$USE_SPLIT_BUILD" == "true" ]]; then
pip install "$TORCH_NO_PYTHON_PACKAGE_NAME" --no-index -f /$WHEELHOUSE_DIR --no-dependencies -v
fi
pip install "$TORCH_PACKAGE_NAME" --no-index -f /$WHEELHOUSE_DIR --no-dependencies -v
# Print info on the libraries installed in this wheel
# Rather than adjust find command to skip non-library files with an embedded *.so* in their name,
# since this is only for reporting purposes, we add the || true to the ldd command.
installed_libraries=($(find "$pydir/lib/python${py_majmin}/site-packages/torch/" -name '*.so*'))
echo "The wheel installed all of the libraries: ${installed_libraries[@]}"
for installed_lib in "${installed_libraries[@]}"; do
ldd "$installed_lib" || true
done
# Run the tests
echo "$(date) :: Running tests"
pushd "$PYTORCH_ROOT"
#TODO: run_tests.sh and check_binary.sh should be moved to pytorch/pytorch project
LD_LIBRARY_PATH=/usr/local/nvidia/lib64 \
"/builder/run_tests.sh" manywheel "${py_majmin}" "$DESIRED_CUDA"
popd
echo "$(date) :: Finished tests"
fi

99
.ci/manywheel/build_cpu.sh Executable file
View File

@ -0,0 +1,99 @@
#!/usr/bin/env bash
set -ex
GPU_ARCH_TYPE=${GPU_ARCH_TYPE:-cpu}
export TH_BINARY_BUILD=1
export USE_CUDA=0
# Keep an array of cmake variables to add to
if [[ -z "$CMAKE_ARGS" ]]; then
# These are passed to tools/build_pytorch_libs.sh::build()
CMAKE_ARGS=()
fi
if [[ -z "$EXTRA_CAFFE2_CMAKE_FLAGS" ]]; then
# These are passed to tools/build_pytorch_libs.sh::build_caffe2()
EXTRA_CAFFE2_CMAKE_FLAGS=()
fi
DIR_SUFFIX=cpu
if [[ "$GPU_ARCH_TYPE" == "xpu" ]]; then
DIR_SUFFIX=xpu
# Refer https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpu/2-5.html
source /opt/intel/oneapi/pytorch-gpu-dev-0.5/oneapi-vars.sh
source /opt/intel/oneapi/pti/latest/env/vars.sh
export USE_STATIC_MKL=1
fi
WHEELHOUSE_DIR="wheelhouse$DIR_SUFFIX"
LIBTORCH_HOUSE_DIR="libtorch_house$DIR_SUFFIX"
if [[ -z "$PYTORCH_FINAL_PACKAGE_DIR" ]]; then
if [[ -z "$BUILD_PYTHONLESS" ]]; then
PYTORCH_FINAL_PACKAGE_DIR="/remote/wheelhouse$DIR_SUFFIX"
else
PYTORCH_FINAL_PACKAGE_DIR="/remote/libtorch_house$DIR_SUFFIX"
fi
fi
mkdir -p "$PYTORCH_FINAL_PACKAGE_DIR" || true
OS_NAME=$(awk -F= '/^NAME/{print $2}' /etc/os-release)
if [[ "$OS_NAME" == *"CentOS Linux"* ]]; then
LIBGOMP_PATH="/usr/lib64/libgomp.so.1"
elif [[ "$OS_NAME" == *"Red Hat Enterprise Linux"* ]]; then
LIBGOMP_PATH="/usr/lib64/libgomp.so.1"
elif [[ "$OS_NAME" == *"AlmaLinux"* ]]; then
LIBGOMP_PATH="/usr/lib64/libgomp.so.1"
elif [[ "$OS_NAME" == *"Ubuntu"* ]]; then
if [[ "$(uname -m)" == "s390x" ]]; then
LIBGOMP_PATH="/usr/lib/s390x-linux-gnu/libgomp.so.1"
else
LIBGOMP_PATH="/usr/lib/x86_64-linux-gnu/libgomp.so.1"
fi
fi
DEPS_LIST=(
"$LIBGOMP_PATH"
)
DEPS_SONAME=(
"libgomp.so.1"
)
if [[ "$GPU_ARCH_TYPE" == "xpu" ]]; then
echo "Bundling with xpu support package libs."
DEPS_LIST+=(
"/opt/intel/oneapi/compiler/latest/lib/libsycl-preview.so.7"
"/opt/intel/oneapi/compiler/latest/lib/libOpenCL.so.1"
"/opt/intel/oneapi/compiler/latest/lib/libxptifw.so"
"/opt/intel/oneapi/compiler/latest/lib/libsvml.so"
"/opt/intel/oneapi/compiler/latest/lib/libirng.so"
"/opt/intel/oneapi/compiler/latest/lib/libimf.so"
"/opt/intel/oneapi/compiler/latest/lib/libintlc.so.5"
"/opt/intel/oneapi/compiler/latest/lib/libpi_level_zero.so"
"/opt/intel/oneapi/pti/latest/lib/libpti_view.so.0.9"
"/opt/intel/oneapi/pti/latest/lib/libpti.so.0.9"
)
DEPS_SONAME+=(
"libsycl-preview.so.7"
"libOpenCL.so.1"
"libxptifw.so"
"libsvml.so"
"libirng.so"
"libimf.so"
"libintlc.so.5"
"libpi_level_zero.so"
"libpti_view.so.0.9"
"libpti.so.0.9"
)
fi
rm -rf /usr/local/cuda*
SOURCE_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null && pwd )"
if [[ -z "$BUILD_PYTHONLESS" ]]; then
BUILD_SCRIPT=build_common.sh
else
BUILD_SCRIPT=build_libtorch.sh
fi
source ${SOURCE_DIR}/${BUILD_SCRIPT}

290
.ci/manywheel/build_cuda.sh Normal file
View File

@ -0,0 +1,290 @@
#!/usr/bin/env bash
set -ex
SCRIPTPATH="$( cd "$(dirname "$0")" ; pwd -P ))"
export TORCH_NVCC_FLAGS="-Xfatbin -compress-all"
export NCCL_ROOT_DIR=/usr/local/cuda
export TH_BINARY_BUILD=1
export USE_STATIC_CUDNN=1
export USE_STATIC_NCCL=1
export ATEN_STATIC_CUDA=1
export USE_CUDA_STATIC_LINK=1
export INSTALL_TEST=0 # dont install test binaries into site-packages
export USE_CUPTI_SO=0
export USE_CUSPARSELT=${USE_CUSPARSELT:-1} # Enable if not disabled by libtorch build
# Keep an array of cmake variables to add to
if [[ -z "$CMAKE_ARGS" ]]; then
# These are passed to tools/build_pytorch_libs.sh::build()
CMAKE_ARGS=()
fi
if [[ -z "$EXTRA_CAFFE2_CMAKE_FLAGS" ]]; then
# These are passed to tools/build_pytorch_libs.sh::build_caffe2()
EXTRA_CAFFE2_CMAKE_FLAGS=()
fi
# Determine CUDA version and architectures to build for
#
# NOTE: We should first check `DESIRED_CUDA` when determining `CUDA_VERSION`,
# because in some cases a single Docker image can have multiple CUDA versions
# on it, and `nvcc --version` might not show the CUDA version we want.
if [[ -n "$DESIRED_CUDA" ]]; then
# If the DESIRED_CUDA already matches the format that we expect
if [[ ${DESIRED_CUDA} =~ ^[0-9]+\.[0-9]+$ ]]; then
CUDA_VERSION=${DESIRED_CUDA}
else
# cu90, cu92, cu100, cu101
if [[ ${#DESIRED_CUDA} -eq 4 ]]; then
CUDA_VERSION="${DESIRED_CUDA:2:1}.${DESIRED_CUDA:3:1}"
elif [[ ${#DESIRED_CUDA} -eq 5 ]]; then
CUDA_VERSION="${DESIRED_CUDA:2:2}.${DESIRED_CUDA:4:1}"
fi
fi
echo "Using CUDA $CUDA_VERSION as determined by DESIRED_CUDA"
# There really has to be a better way to do this - eli
# Possibly limiting builds to specific cuda versions be delimiting images would be a choice
if [[ "$OS_NAME" == *"Ubuntu"* ]]; then
echo "Switching to CUDA version ${DESIRED_CUDA}"
/builder/conda/switch_cuda_version.sh "${DESIRED_CUDA}"
fi
else
CUDA_VERSION=$(nvcc --version|grep release|cut -f5 -d" "|cut -f1 -d",")
echo "CUDA $CUDA_VERSION Detected"
fi
cuda_version_nodot=$(echo $CUDA_VERSION | tr -d '.')
TORCH_CUDA_ARCH_LIST="5.0;6.0;7.0;7.5;8.0;8.6"
case ${CUDA_VERSION} in
12.4)
if [[ "$GPU_ARCH_TYPE" = "cuda-aarch64" ]]; then
TORCH_CUDA_ARCH_LIST="9.0"
else
TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};9.0+PTX"
fi
EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")
;;
12.1)
TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};9.0"
EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")
;;
11.8)
TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};3.7;9.0"
EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")
;;
11.[67])
TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};3.7"
EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")
;;
*)
echo "unknown cuda version $CUDA_VERSION"
exit 1
;;
esac
export TORCH_CUDA_ARCH_LIST=${TORCH_CUDA_ARCH_LIST}
echo "${TORCH_CUDA_ARCH_LIST}"
# Package directories
WHEELHOUSE_DIR="wheelhouse$cuda_version_nodot"
LIBTORCH_HOUSE_DIR="libtorch_house$cuda_version_nodot"
if [[ -z "$PYTORCH_FINAL_PACKAGE_DIR" ]]; then
if [[ -z "$BUILD_PYTHONLESS" ]]; then
PYTORCH_FINAL_PACKAGE_DIR="/remote/wheelhouse$cuda_version_nodot"
else
PYTORCH_FINAL_PACKAGE_DIR="/remote/libtorch_house$cuda_version_nodot"
fi
fi
mkdir -p "$PYTORCH_FINAL_PACKAGE_DIR" || true
OS_NAME=$(awk -F= '/^NAME/{print $2}' /etc/os-release)
if [[ "$OS_NAME" == *"CentOS Linux"* ]]; then
LIBGOMP_PATH="/usr/lib64/libgomp.so.1"
elif [[ "$OS_NAME" == *"AlmaLinux"* ]]; then
LIBGOMP_PATH="/usr/lib64/libgomp.so.1"
elif [[ "$OS_NAME" == *"Red Hat Enterprise Linux"* ]]; then
LIBGOMP_PATH="/usr/lib64/libgomp.so.1"
elif [[ "$OS_NAME" == *"Ubuntu"* ]]; then
LIBGOMP_PATH="/usr/lib/x86_64-linux-gnu/libgomp.so.1"
fi
DEPS_LIST=(
"$LIBGOMP_PATH"
)
DEPS_SONAME=(
"libgomp.so.1"
)
if [[ $USE_CUSPARSELT == "1" ]]; then
DEPS_SONAME+=(
"libcusparseLt.so.0"
)
DEPS_LIST+=(
"/usr/local/cuda/lib64/libcusparseLt.so.0"
)
fi
if [[ $CUDA_VERSION == "12.1" || $CUDA_VERSION == "12.4" ]]; then
export USE_STATIC_CUDNN=0
# Try parallelizing nvcc as well
export TORCH_NVCC_FLAGS="-Xfatbin -compress-all --threads 2"
if [[ -z "$PYTORCH_EXTRA_INSTALL_REQUIREMENTS" ]]; then
echo "Bundling with cudnn and cublas."
DEPS_LIST+=(
"/usr/local/cuda/lib64/libcudnn_adv.so.9"
"/usr/local/cuda/lib64/libcudnn_cnn.so.9"
"/usr/local/cuda/lib64/libcudnn_graph.so.9"
"/usr/local/cuda/lib64/libcudnn_ops.so.9"
"/usr/local/cuda/lib64/libcudnn_engines_runtime_compiled.so.9"
"/usr/local/cuda/lib64/libcudnn_engines_precompiled.so.9"
"/usr/local/cuda/lib64/libcudnn_heuristic.so.9"
"/usr/local/cuda/lib64/libcudnn.so.9"
"/usr/local/cuda/lib64/libcublas.so.12"
"/usr/local/cuda/lib64/libcublasLt.so.12"
"/usr/local/cuda/lib64/libcudart.so.12"
"/usr/local/cuda/lib64/libnvToolsExt.so.1"
"/usr/local/cuda/lib64/libnvrtc.so.12"
"/usr/local/cuda/lib64/libnvrtc-builtins.so"
)
DEPS_SONAME+=(
"libcudnn_adv.so.9"
"libcudnn_cnn.so.9"
"libcudnn_graph.so.9"
"libcudnn_ops.so.9"
"libcudnn_engines_runtime_compiled.so.9"
"libcudnn_engines_precompiled.so.9"
"libcudnn_heuristic.so.9"
"libcudnn.so.9"
"libcublas.so.12"
"libcublasLt.so.12"
"libcudart.so.12"
"libnvToolsExt.so.1"
"libnvrtc.so.12"
"libnvrtc-builtins.so"
)
else
echo "Using nvidia libs from pypi."
CUDA_RPATHS=(
'$ORIGIN/../../nvidia/cublas/lib'
'$ORIGIN/../../nvidia/cuda_cupti/lib'
'$ORIGIN/../../nvidia/cuda_nvrtc/lib'
'$ORIGIN/../../nvidia/cuda_runtime/lib'
'$ORIGIN/../../nvidia/cudnn/lib'
'$ORIGIN/../../nvidia/cufft/lib'
'$ORIGIN/../../nvidia/curand/lib'
'$ORIGIN/../../nvidia/cusolver/lib'
'$ORIGIN/../../nvidia/cusparse/lib'
'$ORIGIN/../../nvidia/nccl/lib'
'$ORIGIN/../../nvidia/nvtx/lib'
)
CUDA_RPATHS=$(IFS=: ; echo "${CUDA_RPATHS[*]}")
export C_SO_RPATH=$CUDA_RPATHS':$ORIGIN:$ORIGIN/lib'
export LIB_SO_RPATH=$CUDA_RPATHS':$ORIGIN'
export FORCE_RPATH="--force-rpath"
export USE_STATIC_NCCL=0
export USE_SYSTEM_NCCL=1
export ATEN_STATIC_CUDA=0
export USE_CUDA_STATIC_LINK=0
export USE_CUPTI_SO=1
export NCCL_INCLUDE_DIR="/usr/local/cuda/include/"
export NCCL_LIB_DIR="/usr/local/cuda/lib64/"
fi
elif [[ $CUDA_VERSION == "11.8" ]]; then
export USE_STATIC_CUDNN=0
# Try parallelizing nvcc as well
export TORCH_NVCC_FLAGS="-Xfatbin -compress-all --threads 2"
# Bundle ptxas into the wheel, see https://github.com/pytorch/pytorch/pull/119750
export BUILD_BUNDLE_PTXAS=1
if [[ -z "$PYTORCH_EXTRA_INSTALL_REQUIREMENTS" ]]; then
echo "Bundling with cudnn and cublas."
DEPS_LIST+=(
"/usr/local/cuda/lib64/libcudnn_adv.so.9"
"/usr/local/cuda/lib64/libcudnn_cnn.so.9"
"/usr/local/cuda/lib64/libcudnn_graph.so.9"
"/usr/local/cuda/lib64/libcudnn_ops.so.9"
"/usr/local/cuda/lib64/libcudnn_engines_runtime_compiled.so.9"
"/usr/local/cuda/lib64/libcudnn_engines_precompiled.so.9"
"/usr/local/cuda/lib64/libcudnn_heuristic.so.9"
"/usr/local/cuda/lib64/libcudnn.so.9"
"/usr/local/cuda/lib64/libcublas.so.11"
"/usr/local/cuda/lib64/libcublasLt.so.11"
"/usr/local/cuda/lib64/libcudart.so.11.0"
"/usr/local/cuda/lib64/libnvToolsExt.so.1"
"/usr/local/cuda/lib64/libnvrtc.so.11.2" # this is not a mistake, it links to more specific cuda version
"/usr/local/cuda/lib64/libnvrtc-builtins.so.11.8"
)
DEPS_SONAME+=(
"libcudnn_adv.so.9"
"libcudnn_cnn.so.9"
"libcudnn_graph.so.9"
"libcudnn_ops.so.9"
"libcudnn_engines_runtime_compiled.so.9"
"libcudnn_engines_precompiled.so.9"
"libcudnn_heuristic.so.9"
"libcudnn.so.9"
"libcublas.so.11"
"libcublasLt.so.11"
"libcudart.so.11.0"
"libnvToolsExt.so.1"
"libnvrtc.so.11.2"
"libnvrtc-builtins.so.11.8"
)
else
echo "Using nvidia libs from pypi."
CUDA_RPATHS=(
'$ORIGIN/../../nvidia/cublas/lib'
'$ORIGIN/../../nvidia/cuda_cupti/lib'
'$ORIGIN/../../nvidia/cuda_nvrtc/lib'
'$ORIGIN/../../nvidia/cuda_runtime/lib'
'$ORIGIN/../../nvidia/cudnn/lib'
'$ORIGIN/../../nvidia/cufft/lib'
'$ORIGIN/../../nvidia/curand/lib'
'$ORIGIN/../../nvidia/cusolver/lib'
'$ORIGIN/../../nvidia/cusparse/lib'
'$ORIGIN/../../nvidia/nccl/lib'
'$ORIGIN/../../nvidia/nvtx/lib'
)
CUDA_RPATHS=$(IFS=: ; echo "${CUDA_RPATHS[*]}")
export C_SO_RPATH=$CUDA_RPATHS':$ORIGIN:$ORIGIN/lib'
export LIB_SO_RPATH=$CUDA_RPATHS':$ORIGIN'
export FORCE_RPATH="--force-rpath"
export USE_STATIC_NCCL=0
export USE_SYSTEM_NCCL=1
export ATEN_STATIC_CUDA=0
export USE_CUDA_STATIC_LINK=0
export USE_CUPTI_SO=1
export NCCL_INCLUDE_DIR="/usr/local/cuda/include/"
export NCCL_LIB_DIR="/usr/local/cuda/lib64/"
fi
else
echo "Unknown cuda version $CUDA_VERSION"
exit 1
fi
# builder/test.sh requires DESIRED_CUDA to know what tests to exclude
export DESIRED_CUDA="$cuda_version_nodot"
# Switch `/usr/local/cuda` to the desired CUDA version
rm -rf /usr/local/cuda || true
ln -s "/usr/local/cuda-${CUDA_VERSION}" /usr/local/cuda
# Switch `/usr/local/magma` to the desired CUDA version
rm -rf /usr/local/magma || true
ln -s /usr/local/cuda-${CUDA_VERSION}/magma /usr/local/magma
export CUDA_VERSION=$(ls /usr/local/cuda/lib64/libcudart.so.*|sort|tac | head -1 | rev | cut -d"." -f -3 | rev) # 10.0.130
export CUDA_VERSION_SHORT=$(ls /usr/local/cuda/lib64/libcudart.so.*|sort|tac | head -1 | rev | cut -d"." -f -3 | rev | cut -f1,2 -d".") # 10.0
export CUDNN_VERSION=$(ls /usr/local/cuda/lib64/libcudnn.so.*|sort|tac | head -1 | rev | cut -d"." -f -3 | rev)
SCRIPTPATH="$( cd "$(dirname "$0")" ; pwd -P )"
if [[ -z "$BUILD_PYTHONLESS" ]]; then
BUILD_SCRIPT=build_common.sh
else
BUILD_SCRIPT=build_libtorch.sh
fi
source $SCRIPTPATH/${BUILD_SCRIPT}

View File

@ -0,0 +1,353 @@
#!/usr/bin/env bash
# meant to be called only from the neighboring build.sh and build_cpu.sh scripts
set -e pipefail
SOURCE_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null && pwd )"
# Require only one python installation
if [[ -z "$DESIRED_PYTHON" ]]; then
echo "Need to set DESIRED_PYTHON env variable"
exit 1
fi
if [[ -n "$BUILD_PYTHONLESS" && -z "$LIBTORCH_VARIANT" ]]; then
echo "BUILD_PYTHONLESS is set, so need LIBTORCH_VARIANT to also be set"
echo "LIBTORCH_VARIANT should be one of shared-with-deps shared-without-deps static-with-deps static-without-deps"
exit 1
fi
# Function to retry functions that sometimes timeout or have flaky failures
retry () {
$* || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)
}
# TODO move this into the Docker images
OS_NAME=`awk -F= '/^NAME/{print $2}' /etc/os-release`
if [[ "$OS_NAME" == *"CentOS Linux"* ]]; then
retry yum install -q -y zip openssl
elif [[ "$OS_NAME" == *"AlmaLinux"* ]]; then
retry yum install -q -y zip openssl
elif [[ "$OS_NAME" == *"Red Hat Enterprise Linux"* ]]; then
retry dnf install -q -y zip openssl
elif [[ "$OS_NAME" == *"Ubuntu"* ]]; then
# TODO: Remove this once nvidia package repos are back online
# Comment out nvidia repositories to prevent them from getting apt-get updated, see https://github.com/pytorch/pytorch/issues/74968
# shellcheck disable=SC2046
sed -i 's/.*nvidia.*/# &/' $(find /etc/apt/ -type f -name "*.list")
retry apt-get update
retry apt-get -y install zip openssl
fi
# Version: setup.py uses $PYTORCH_BUILD_VERSION.post$PYTORCH_BUILD_NUMBER if
# PYTORCH_BUILD_NUMBER > 1
build_version="$PYTORCH_BUILD_VERSION"
build_number="$PYTORCH_BUILD_NUMBER"
if [[ -n "$OVERRIDE_PACKAGE_VERSION" ]]; then
# This will be the *exact* version, since build_number<1
build_version="$OVERRIDE_PACKAGE_VERSION"
build_number=0
fi
if [[ -z "$build_version" ]]; then
build_version=1.0.0
fi
if [[ -z "$build_number" ]]; then
build_number=1
fi
export PYTORCH_BUILD_VERSION=$build_version
export PYTORCH_BUILD_NUMBER=$build_number
export CMAKE_LIBRARY_PATH="/opt/intel/lib:/lib:$CMAKE_LIBRARY_PATH"
export CMAKE_INCLUDE_PATH="/opt/intel/include:$CMAKE_INCLUDE_PATH"
# set OPENSSL_ROOT_DIR=/opt/openssl if it exists
if [[ -e /opt/openssl ]]; then
export OPENSSL_ROOT_DIR=/opt/openssl
export CMAKE_INCLUDE_PATH="/opt/openssl/include":$CMAKE_INCLUDE_PATH
fi
# If given a python version like 3.6m or 2.7mu, convert this to the format we
# expect. The binary CI jobs pass in python versions like this; they also only
# ever pass one python version, so we assume that DESIRED_PYTHON is not a list
# in this case
if [[ -n "$DESIRED_PYTHON" && "$DESIRED_PYTHON" != cp* ]]; then
python_nodot="$(echo $DESIRED_PYTHON | tr -d m.u)"
DESIRED_PYTHON="cp${python_nodot}-cp${python_nodot}"
fi
pydir="/opt/python/$DESIRED_PYTHON"
export PATH="$pydir/bin:$PATH"
export PATCHELF_BIN=/usr/local/bin/patchelf
patchelf_version=`$PATCHELF_BIN --version`
echo "patchelf version: " $patchelf_version
if [[ "$patchelf_version" == "patchelf 0.9" ]]; then
echo "Your patchelf version is too old. Please use version >= 0.10."
exit 1
fi
########################################################
# Compile wheels as well as libtorch
#######################################################
if [[ -z "$PYTORCH_ROOT" ]]; then
echo "Need to set PYTORCH_ROOT env variable"
exit 1
fi
pushd "$PYTORCH_ROOT"
python setup.py clean
retry pip install -qr requirements.txt
retry pip install -q numpy==2.0.1
if [[ "$DESIRED_DEVTOOLSET" == *"cxx11-abi"* ]]; then
export _GLIBCXX_USE_CXX11_ABI=1
else
export _GLIBCXX_USE_CXX11_ABI=0
fi
if [[ "$DESIRED_CUDA" == *"rocm"* ]]; then
echo "Calling build_amd.py at $(date)"
python tools/amd_build/build_amd.py
# TODO remove this work-around once pytorch sources are updated
export ROCclr_DIR=/opt/rocm/rocclr/lib/cmake/rocclr
fi
echo "Calling setup.py install at $(date)"
if [[ $LIBTORCH_VARIANT = *"static"* ]]; then
STATIC_CMAKE_FLAG="-DTORCH_STATIC=1"
fi
(
set -x
mkdir -p build
time CMAKE_ARGS=${CMAKE_ARGS[@]} \
EXTRA_CAFFE2_CMAKE_FLAGS="${EXTRA_CAFFE2_CMAKE_FLAGS[@]} $STATIC_CMAKE_FLAG" \
# TODO: Remove this flag once https://github.com/pytorch/pytorch/issues/55952 is closed
CFLAGS='-Wno-deprecated-declarations' \
BUILD_LIBTORCH_CPU_WITH_DEBUG=1 \
python setup.py install
mkdir -p libtorch/{lib,bin,include,share}
# Make debug folder separate so it doesn't get zipped up with the rest of
# libtorch
mkdir debug
# Copy over all lib files
cp -rv build/lib/* libtorch/lib/
cp -rv build/lib*/torch/lib/* libtorch/lib/
# Copy over all include files
cp -rv build/include/* libtorch/include/
cp -rv build/lib*/torch/include/* libtorch/include/
# Copy over all of the cmake files
cp -rv build/lib*/torch/share/* libtorch/share/
# Split libtorch into debug / release version
cp libtorch/lib/libtorch_cpu.so libtorch/lib/libtorch_cpu.so.dbg
# Keep debug symbols on debug lib
strip --only-keep-debug libtorch/lib/libtorch_cpu.so.dbg
# Remove debug info from release lib
strip --strip-debug libtorch/lib/libtorch_cpu.so
# Add a debug link to the release lib to the debug lib (debuggers will then
# search for symbols in a file called libtorch_cpu.so.dbg in some
# predetermined locations) and embed a CRC32 of the debug library into the .so
cd libtorch/lib
objcopy libtorch_cpu.so --add-gnu-debuglink=libtorch_cpu.so.dbg
cd ../..
# Move the debug symbols to its own directory so it doesn't get processed /
# zipped with all the other libraries
mv libtorch/lib/libtorch_cpu.so.dbg debug/libtorch_cpu.so.dbg
echo "${PYTORCH_BUILD_VERSION}" > libtorch/build-version
echo "$(pushd $PYTORCH_ROOT && git rev-parse HEAD)" > libtorch/build-hash
)
if [[ "$DESIRED_DEVTOOLSET" == *"cxx11-abi"* ]]; then
LIBTORCH_ABI="cxx11-abi-"
else
LIBTORCH_ABI=
fi
(
set -x
mkdir -p /tmp/$LIBTORCH_HOUSE_DIR
# objcopy installs a CRC32 into libtorch_cpu above so, so add that to the name here
CRC32=$(objcopy --dump-section .gnu_debuglink=>(tail -c4 | od -t x4 -An | xargs echo) libtorch/lib/libtorch_cpu.so)
# Zip debug symbols
zip /tmp/$LIBTORCH_HOUSE_DIR/debug-libtorch-$LIBTORCH_ABI$LIBTORCH_VARIANT-$PYTORCH_BUILD_VERSION-$CRC32.zip debug/libtorch_cpu.so.dbg
# Zip and copy libtorch
zip -rq /tmp/$LIBTORCH_HOUSE_DIR/libtorch-$LIBTORCH_ABI$LIBTORCH_VARIANT-$PYTORCH_BUILD_VERSION.zip libtorch
cp /tmp/$LIBTORCH_HOUSE_DIR/libtorch-$LIBTORCH_ABI$LIBTORCH_VARIANT-$PYTORCH_BUILD_VERSION.zip \
/tmp/$LIBTORCH_HOUSE_DIR/libtorch-$LIBTORCH_ABI$LIBTORCH_VARIANT-latest.zip
)
popd
#######################################################################
# ADD DEPENDENCIES INTO THE WHEEL
#
# auditwheel repair doesn't work correctly and is buggy
# so manually do the work of copying dependency libs and patchelfing
# and fixing RECORDS entries correctly
######################################################################
fname_with_sha256() {
HASH=$(sha256sum $1 | cut -c1-8)
DIRNAME=$(dirname $1)
BASENAME=$(basename $1)
if [[ $BASENAME == "libnvrtc-builtins.so" || $BASENAME == "libcudnn"* ]]; then
echo $1
else
INITNAME=$(echo $BASENAME | cut -f1 -d".")
ENDNAME=$(echo $BASENAME | cut -f 2- -d".")
echo "$DIRNAME/$INITNAME-$HASH.$ENDNAME"
fi
}
fname_without_so_number() {
LINKNAME=$(echo $1 | sed -e 's/\.so.*/.so/g')
echo "$LINKNAME"
}
make_wheel_record() {
FPATH=$1
if echo $FPATH | grep RECORD >/dev/null 2>&1; then
# if the RECORD file, then
echo "$FPATH,,"
else
HASH=$(openssl dgst -sha256 -binary $FPATH | openssl base64 | sed -e 's/+/-/g' | sed -e 's/\//_/g' | sed -e 's/=//g')
FSIZE=$(ls -nl $FPATH | awk '{print $5}')
echo "$FPATH,sha256=$HASH,$FSIZE"
fi
}
echo 'Built this package:'
(
set -x
mkdir -p /$LIBTORCH_HOUSE_DIR
mv /tmp/$LIBTORCH_HOUSE_DIR/*.zip /$LIBTORCH_HOUSE_DIR
rm -rf /tmp/$LIBTORCH_HOUSE_DIR
)
TMP_DIR=$(mktemp -d)
trap "rm -rf ${TMP_DIR}" EXIT
pushd "${TMP_DIR}"
for pkg in /$LIBTORCH_HOUSE_DIR/libtorch*.zip; do
# if the glob didn't match anything
if [[ ! -e $pkg ]]; then
continue
fi
rm -rf tmp
mkdir -p tmp
cd tmp
cp $pkg .
unzip -q $(basename $pkg)
rm -f $(basename $pkg)
PREFIX=libtorch
if [[ $pkg != *"without-deps"* ]]; then
# copy over needed dependent .so files over and tag them with their hash
patched=()
for filepath in "${DEPS_LIST[@]}"; do
filename=$(basename $filepath)
destpath=$PREFIX/lib/$filename
if [[ "$filepath" != "$destpath" ]]; then
cp $filepath $destpath
fi
if [[ "$DESIRED_CUDA" == *"rocm"* ]]; then
patchedpath=$(fname_without_so_number $destpath)
else
patchedpath=$(fname_with_sha256 $destpath)
fi
patchedname=$(basename $patchedpath)
if [[ "$destpath" != "$patchedpath" ]]; then
mv $destpath $patchedpath
fi
patched+=("$patchedname")
echo "Copied $filepath to $patchedpath"
done
echo "patching to fix the so names to the hashed names"
for ((i=0;i<${#DEPS_LIST[@]};++i)); do
find $PREFIX -name '*.so*' | while read sofile; do
origname=${DEPS_SONAME[i]}
patchedname=${patched[i]}
if [[ "$origname" != "$patchedname" ]] || [[ "$DESIRED_CUDA" == *"rocm"* ]]; then
set +e
origname=$($PATCHELF_BIN --print-needed $sofile | grep "$origname.*")
ERRCODE=$?
set -e
if [ "$ERRCODE" -eq "0" ]; then
echo "patching $sofile entry $origname to $patchedname"
$PATCHELF_BIN --replace-needed $origname $patchedname $sofile
fi
fi
done
done
# copy over needed auxiliary files
for ((i=0;i<${#DEPS_AUX_SRCLIST[@]};++i)); do
srcpath=${DEPS_AUX_SRCLIST[i]}
dstpath=$PREFIX/${DEPS_AUX_DSTLIST[i]}
mkdir -p $(dirname $dstpath)
cp $srcpath $dstpath
done
fi
# set RPATH of _C.so and similar to $ORIGIN, $ORIGIN/lib
find $PREFIX -maxdepth 1 -type f -name "*.so*" | while read sofile; do
echo "Setting rpath of $sofile to " '$ORIGIN:$ORIGIN/lib'
$PATCHELF_BIN --set-rpath '$ORIGIN:$ORIGIN/lib' $sofile
$PATCHELF_BIN --print-rpath $sofile
done
# set RPATH of lib/ files to $ORIGIN
find $PREFIX/lib -maxdepth 1 -type f -name "*.so*" | while read sofile; do
echo "Setting rpath of $sofile to " '$ORIGIN'
$PATCHELF_BIN --set-rpath '$ORIGIN' $sofile
$PATCHELF_BIN --print-rpath $sofile
done
# regenerate the RECORD file with new hashes
record_file=`echo $(basename $pkg) | sed -e 's/-cp.*$/.dist-info\/RECORD/g'`
if [[ -e $record_file ]]; then
echo "Generating new record file $record_file"
rm -f $record_file
# generate records for folders in wheel
find * -type f | while read fname; do
echo $(make_wheel_record $fname) >>$record_file
done
fi
# zip up the wheel back
zip -rq $(basename $pkg) $PREFIX*
# replace original wheel
rm -f $pkg
mv $(basename $pkg) $pkg
cd ..
rm -rf tmp
done
# Copy wheels to host machine for persistence before testing
if [[ -n "$PYTORCH_FINAL_PACKAGE_DIR" ]]; then
cp /$LIBTORCH_HOUSE_DIR/libtorch*.zip "$PYTORCH_FINAL_PACKAGE_DIR"
cp /$LIBTORCH_HOUSE_DIR/debug-libtorch*.zip "$PYTORCH_FINAL_PACKAGE_DIR"
fi

263
.ci/manywheel/build_rocm.sh Executable file
View File

@ -0,0 +1,263 @@
#!/usr/bin/env bash
set -ex
export ROCM_HOME=/opt/rocm
export MAGMA_HOME=$ROCM_HOME/magma
# TODO: libtorch_cpu.so is broken when building with Debug info
export BUILD_DEBUG_INFO=0
# TODO Are these all used/needed?
export TH_BINARY_BUILD=1
export USE_STATIC_CUDNN=1
export USE_STATIC_NCCL=1
export ATEN_STATIC_CUDA=1
export USE_CUDA_STATIC_LINK=1
export INSTALL_TEST=0 # dont install test binaries into site-packages
# Set RPATH instead of RUNPATH when using patchelf to avoid LD_LIBRARY_PATH override
export FORCE_RPATH="--force-rpath"
# Keep an array of cmake variables to add to
if [[ -z "$CMAKE_ARGS" ]]; then
# These are passed to tools/build_pytorch_libs.sh::build()
CMAKE_ARGS=()
fi
if [[ -z "$EXTRA_CAFFE2_CMAKE_FLAGS" ]]; then
# These are passed to tools/build_pytorch_libs.sh::build_caffe2()
EXTRA_CAFFE2_CMAKE_FLAGS=()
fi
# Determine ROCm version and architectures to build for
#
# NOTE: We should first check `DESIRED_CUDA` when determining `ROCM_VERSION`
if [[ -n "$DESIRED_CUDA" ]]; then
if ! echo "${DESIRED_CUDA}"| grep "^rocm" >/dev/null 2>/dev/null; then
export DESIRED_CUDA="rocm${DESIRED_CUDA}"
fi
# rocm3.7, rocm3.5.1
ROCM_VERSION="$DESIRED_CUDA"
echo "Using $ROCM_VERSION as determined by DESIRED_CUDA"
else
echo "Must set DESIRED_CUDA"
exit 1
fi
# Package directories
WHEELHOUSE_DIR="wheelhouse$ROCM_VERSION"
LIBTORCH_HOUSE_DIR="libtorch_house$ROCM_VERSION"
if [[ -z "$PYTORCH_FINAL_PACKAGE_DIR" ]]; then
if [[ -z "$BUILD_PYTHONLESS" ]]; then
PYTORCH_FINAL_PACKAGE_DIR="/remote/wheelhouse$ROCM_VERSION"
else
PYTORCH_FINAL_PACKAGE_DIR="/remote/libtorch_house$ROCM_VERSION"
fi
fi
mkdir -p "$PYTORCH_FINAL_PACKAGE_DIR" || true
# To make version comparison easier, create an integer representation.
ROCM_VERSION_CLEAN=$(echo ${ROCM_VERSION} | sed s/rocm//)
save_IFS="$IFS"
IFS=. ROCM_VERSION_ARRAY=(${ROCM_VERSION_CLEAN})
IFS="$save_IFS"
if [[ ${#ROCM_VERSION_ARRAY[@]} == 2 ]]; then
ROCM_VERSION_MAJOR=${ROCM_VERSION_ARRAY[0]}
ROCM_VERSION_MINOR=${ROCM_VERSION_ARRAY[1]}
ROCM_VERSION_PATCH=0
elif [[ ${#ROCM_VERSION_ARRAY[@]} == 3 ]]; then
ROCM_VERSION_MAJOR=${ROCM_VERSION_ARRAY[0]}
ROCM_VERSION_MINOR=${ROCM_VERSION_ARRAY[1]}
ROCM_VERSION_PATCH=${ROCM_VERSION_ARRAY[2]}
else
echo "Unhandled ROCM_VERSION ${ROCM_VERSION}"
exit 1
fi
ROCM_INT=$(($ROCM_VERSION_MAJOR * 10000 + $ROCM_VERSION_MINOR * 100 + $ROCM_VERSION_PATCH))
# Required ROCm libraries
ROCM_SO_FILES=(
"libMIOpen.so"
"libamdhip64.so"
"libhipblas.so"
"libhipfft.so"
"libhiprand.so"
"libhipsolver.so"
"libhipsparse.so"
"libhsa-runtime64.so"
"libamd_comgr.so"
"libmagma.so"
"librccl.so"
"librocblas.so"
"librocfft.so"
"librocm_smi64.so"
"librocrand.so"
"librocsolver.so"
"librocsparse.so"
"libroctracer64.so"
"libroctx64.so"
"libhipblaslt.so"
"libhiprtc.so"
)
if [[ $ROCM_INT -ge 60100 ]]; then
ROCM_SO_FILES+=("librocprofiler-register.so")
fi
if [[ $ROCM_INT -ge 60200 ]]; then
ROCM_SO_FILES+=("librocm-core.so")
fi
OS_NAME=`awk -F= '/^NAME/{print $2}' /etc/os-release`
if [[ "$OS_NAME" == *"CentOS Linux"* ]]; then
LIBGOMP_PATH="/usr/lib64/libgomp.so.1"
LIBNUMA_PATH="/usr/lib64/libnuma.so.1"
LIBELF_PATH="/usr/lib64/libelf.so.1"
LIBTINFO_PATH="/usr/lib64/libtinfo.so.5"
LIBDRM_PATH="/opt/amdgpu/lib64/libdrm.so.2"
LIBDRM_AMDGPU_PATH="/opt/amdgpu/lib64/libdrm_amdgpu.so.1"
if [[ $ROCM_INT -ge 60100 ]]; then
# Below libs are direct dependencies of libhipsolver
LIBSUITESPARSE_CONFIG_PATH="/lib64/libsuitesparseconfig.so.4"
LIBCHOLMOD_PATH="/lib64/libcholmod.so.2"
# Below libs are direct dependencies of libcholmod
LIBAMD_PATH="/lib64/libamd.so.2"
LIBCAMD_PATH="/lib64/libcamd.so.2"
LIBCCOLAMD_PATH="/lib64/libccolamd.so.2"
LIBCOLAMD_PATH="/lib64/libcolamd.so.2"
LIBSATLAS_PATH="/lib64/atlas/libsatlas.so.3"
# Below libs are direct dependencies of libsatlas
LIBGFORTRAN_PATH="/lib64/libgfortran.so.3"
LIBQUADMATH_PATH="/lib64/libquadmath.so.0"
fi
MAYBE_LIB64=lib64
elif [[ "$OS_NAME" == *"Ubuntu"* ]]; then
LIBGOMP_PATH="/usr/lib/x86_64-linux-gnu/libgomp.so.1"
LIBNUMA_PATH="/usr/lib/x86_64-linux-gnu/libnuma.so.1"
LIBELF_PATH="/usr/lib/x86_64-linux-gnu/libelf.so.1"
if [[ $ROCM_INT -ge 50300 ]]; then
LIBTINFO_PATH="/lib/x86_64-linux-gnu/libtinfo.so.6"
else
LIBTINFO_PATH="/lib/x86_64-linux-gnu/libtinfo.so.5"
fi
LIBDRM_PATH="/usr/lib/x86_64-linux-gnu/libdrm.so.2"
LIBDRM_AMDGPU_PATH="/usr/lib/x86_64-linux-gnu/libdrm_amdgpu.so.1"
if [[ $ROCM_INT -ge 60100 ]]; then
# Below libs are direct dependencies of libhipsolver
LIBCHOLMOD_PATH="/lib/x86_64-linux-gnu/libcholmod.so.3"
# Below libs are direct dependencies of libcholmod
LIBSUITESPARSE_CONFIG_PATH="/lib/x86_64-linux-gnu/libsuitesparseconfig.so.5"
LIBAMD_PATH="/lib/x86_64-linux-gnu/libamd.so.2"
LIBCAMD_PATH="/lib/x86_64-linux-gnu/libcamd.so.2"
LIBCCOLAMD_PATH="/lib/x86_64-linux-gnu/libccolamd.so.2"
LIBCOLAMD_PATH="/lib/x86_64-linux-gnu/libcolamd.so.2"
LIBMETIS_PATH="/lib/x86_64-linux-gnu/libmetis.so.5"
LIBLAPACK_PATH="/lib/x86_64-linux-gnu/liblapack.so.3"
LIBBLAS_PATH="/lib/x86_64-linux-gnu/libblas.so.3"
# Below libs are direct dependencies of libblas
LIBGFORTRAN_PATH="/lib/x86_64-linux-gnu/libgfortran.so.5"
LIBQUADMATH_PATH="/lib/x86_64-linux-gnu/libquadmath.so.0"
fi
MAYBE_LIB64=lib
fi
OS_SO_PATHS=($LIBGOMP_PATH $LIBNUMA_PATH\
$LIBELF_PATH $LIBTINFO_PATH\
$LIBDRM_PATH $LIBDRM_AMDGPU_PATH\
$LIBSUITESPARSE_CONFIG_PATH\
$LIBCHOLMOD_PATH $LIBAMD_PATH\
$LIBCAMD_PATH $LIBCCOLAMD_PATH\
$LIBCOLAMD_PATH $LIBSATLAS_PATH\
$LIBGFORTRAN_PATH $LIBQUADMATH_PATH\
$LIBMETIS_PATH $LIBLAPACK_PATH\
$LIBBLAS_PATH)
OS_SO_FILES=()
for lib in "${OS_SO_PATHS[@]}"
do
file_name="${lib##*/}" # Substring removal of path to get filename
OS_SO_FILES[${#OS_SO_FILES[@]}]=$file_name # Append lib to array
done
# PyTorch-version specific
# AOTriton dependency only for PyTorch >= 2.4
if (( $(echo "${PYTORCH_VERSION} 2.4" | awk '{print ($1 >= $2)}') )); then
ROCM_SO_FILES+=("libaotriton_v2.so")
fi
# rocBLAS library files
ROCBLAS_LIB_SRC=$ROCM_HOME/lib/rocblas/library
ROCBLAS_LIB_DST=lib/rocblas/library
ARCH=$(echo $PYTORCH_ROCM_ARCH | sed 's/;/|/g') # Replace ; seperated arch list to bar for grep
ARCH_SPECIFIC_FILES=$(ls $ROCBLAS_LIB_SRC | grep -E $ARCH)
OTHER_FILES=$(ls $ROCBLAS_LIB_SRC | grep -v gfx)
ROCBLAS_LIB_FILES=($ARCH_SPECIFIC_FILES $OTHER_FILES)
# hipblaslt library files
HIPBLASLT_LIB_SRC=$ROCM_HOME/lib/hipblaslt/library
HIPBLASLT_LIB_DST=lib/hipblaslt/library
ARCH_SPECIFIC_FILES=$(ls $HIPBLASLT_LIB_SRC | grep -E $ARCH)
OTHER_FILES=$(ls $HIPBLASLT_LIB_SRC | grep -v gfx)
HIPBLASLT_LIB_FILES=($ARCH_SPECIFIC_FILES $OTHER_FILES)
# ROCm library files
ROCM_SO_PATHS=()
for lib in "${ROCM_SO_FILES[@]}"
do
file_path=($(find $ROCM_HOME/lib/ -name "$lib")) # First search in lib
if [[ -z $file_path ]]; then
if [ -d "$ROCM_HOME/lib64/" ]; then
file_path=($(find $ROCM_HOME/lib64/ -name "$lib")) # Then search in lib64
fi
fi
if [[ -z $file_path ]]; then
file_path=($(find $ROCM_HOME/ -name "$lib")) # Then search in ROCM_HOME
fi
if [[ -z $file_path ]]; then
echo "Error: Library file $lib is not found." >&2
exit 1
fi
ROCM_SO_PATHS[${#ROCM_SO_PATHS[@]}]="$file_path" # Append lib to array
done
DEPS_LIST=(
${ROCM_SO_PATHS[*]}
${OS_SO_PATHS[*]}
)
DEPS_SONAME=(
${ROCM_SO_FILES[*]}
${OS_SO_FILES[*]}
)
DEPS_AUX_SRCLIST=(
"${ROCBLAS_LIB_FILES[@]/#/$ROCBLAS_LIB_SRC/}"
"${HIPBLASLT_LIB_FILES[@]/#/$HIPBLASLT_LIB_SRC/}"
"/opt/amdgpu/share/libdrm/amdgpu.ids"
)
DEPS_AUX_DSTLIST=(
"${ROCBLAS_LIB_FILES[@]/#/$ROCBLAS_LIB_DST/}"
"${HIPBLASLT_LIB_FILES[@]/#/$HIPBLASLT_LIB_DST/}"
"share/libdrm/amdgpu.ids"
)
# MIOpen library files
MIOPEN_SHARE_SRC=$ROCM_HOME/share/miopen/db
MIOPEN_SHARE_DST=share/miopen/db
MIOPEN_SHARE_FILES=($(ls $MIOPEN_SHARE_SRC | grep -E $ARCH))
DEPS_AUX_SRCLIST+=(${MIOPEN_SHARE_FILES[@]/#/$MIOPEN_SHARE_SRC/})
DEPS_AUX_DSTLIST+=(${MIOPEN_SHARE_FILES[@]/#/$MIOPEN_SHARE_DST/})
# RCCL library files
RCCL_SHARE_SRC=$ROCM_HOME/share/rccl/msccl-algorithms
RCCL_SHARE_DST=share/rccl/msccl-algorithms
RCCL_SHARE_FILES=($(ls $RCCL_SHARE_SRC))
DEPS_AUX_SRCLIST+=(${RCCL_SHARE_FILES[@]/#/$RCCL_SHARE_SRC/})
DEPS_AUX_DSTLIST+=(${RCCL_SHARE_FILES[@]/#/$RCCL_SHARE_DST/})
echo "PYTORCH_ROCM_ARCH: ${PYTORCH_ROCM_ARCH}"
SCRIPTPATH="$( cd "$(dirname "$0")" ; pwd -P )"
if [[ -z "$BUILD_PYTHONLESS" ]]; then
BUILD_SCRIPT=build_common.sh
else
BUILD_SCRIPT=build_libtorch.sh
fi
source $SCRIPTPATH/${BUILD_SCRIPT}

26
.ci/manywheel/test_wheel.sh Executable file
View File

@ -0,0 +1,26 @@
#!/usr/bin/env bash
set -e
yum install -y wget git
rm -rf /usr/local/cuda*
# Install Anaconda
if ! ls /py
then
echo "Miniconda needs to be installed"
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh
bash ~/miniconda.sh -b -p /py
else
echo "Miniconda is already installed"
fi
export PATH="/py/bin:$PATH"
# Anaconda token
if ls /remote/token
then
source /remote/token
fi
conda install -y conda-build anaconda-client

View File

@ -178,7 +178,7 @@ fi
# sccache will fail for CUDA builds if all cores are used for compiling
# gcc 7 with sccache seems to have intermittent OOM issue if all cores are used
if [ -z "$MAX_JOBS" ]; then
if { [[ "$BUILD_ENVIRONMENT" == *cuda* ]] || [[ "$BUILD_ENVIRONMENT" == *gcc7* ]]; } && which sccache > /dev/null; then
if { [[ "$BUILD_ENVIRONMENT" == *cuda* ]]; } && which sccache > /dev/null; then
export MAX_JOBS=$(($(nproc) - 1))
fi
fi
@ -203,10 +203,12 @@ if [[ "${BUILD_ENVIRONMENT}" == *clang* ]]; then
fi
if [[ "$BUILD_ENVIRONMENT" == *-clang*-asan* ]]; then
export LDSHARED="clang --shared"
export USE_CUDA=0
if [[ "$BUILD_ENVIRONMENT" == *cuda* ]]; then
export USE_CUDA=1
fi
export USE_ASAN=1
export UBSAN_FLAGS="-fno-sanitize-recover=all;-fno-sanitize=float-divide-by-zero;-fno-sanitize=float-cast-overflow"
export REL_WITH_DEB_INFO=1
export UBSAN_FLAGS="-fno-sanitize-recover=all"
unset USE_LLVM
fi
@ -218,10 +220,6 @@ if [[ "${BUILD_ENVIRONMENT}" == *-pch* ]]; then
export USE_PRECOMPILED_HEADERS=1
fi
if [[ "${BUILD_ENVIRONMENT}" == *linux-focal-py3.7-gcc7-build* ]]; then
export USE_GLOO_WITH_OPENSSL=ON
fi
if [[ "${BUILD_ENVIRONMENT}" != *android* && "${BUILD_ENVIRONMENT}" != *cuda* ]]; then
export BUILD_STATIC_RUNTIME_BENCHMARK=ON
fi
@ -278,7 +276,6 @@ else
# set only when building other architectures
# or building non-XLA tests.
if [[ "$BUILD_ENVIRONMENT" != *rocm* &&
"$BUILD_ENVIRONMENT" != *s390x* &&
"$BUILD_ENVIRONMENT" != *xla* ]]; then
if [[ "$BUILD_ENVIRONMENT" != *py3.8* ]]; then
# Install numpy-2.0.2 for builds which are backward compatible with 1.X

View File

@ -191,9 +191,22 @@ function install_torchrec_and_fbgemm() {
pip_uninstall torchrec-nightly
pip_uninstall fbgemm-gpu-nightly
pip_install setuptools-git-versioning scikit-build pyre-extensions
# TODO (huydhn): I still have no clue on why sccache doesn't work with only fbgemm_gpu here, but it
# seems to be an sccache-related issue
if [[ "$IS_A100_RUNNER" == "1" ]]; then
unset CMAKE_CUDA_COMPILER_LAUNCHER
sudo mv /opt/cache/bin /opt/cache/bin-backup
fi
# See https://github.com/pytorch/pytorch/issues/106971
CUDA_PATH=/usr/local/cuda-12.1 pip_install --no-use-pep517 --user "git+https://github.com/pytorch/FBGEMM.git@${fbgemm_commit}#egg=fbgemm-gpu&subdirectory=fbgemm_gpu"
pip_install --no-use-pep517 --user "git+https://github.com/pytorch/torchrec.git@${torchrec_commit}"
if [[ "$IS_A100_RUNNER" == "1" ]]; then
export CMAKE_CUDA_COMPILER_LAUNCHER=/opt/cache/bin/sccache
sudo mv /opt/cache/bin-backup /opt/cache/bin
fi
}
function clone_pytorch_xla() {

View File

@ -45,8 +45,7 @@ def create_cert(path, C, ST, L, O, key):
.not_valid_before(datetime.now(timezone.utc))
.not_valid_after(
# Our certificate will be valid for 10 days
datetime.now(timezone.utc)
+ timedelta(days=10)
datetime.now(timezone.utc) + timedelta(days=10)
)
.add_extension(
x509.BasicConstraints(ca=True, path_length=None),
@ -91,8 +90,7 @@ def sign_certificate_request(path, csr_cert, ca_cert, private_ca_key):
.not_valid_before(datetime.now(timezone.utc))
.not_valid_after(
# Our certificate will be valid for 10 days
datetime.now(timezone.utc)
+ timedelta(days=10)
datetime.now(timezone.utc) + timedelta(days=10)
# Sign our certificate with our private key
)
.sign(private_ca_key, hashes.SHA256())

View File

@ -49,16 +49,16 @@ NUM_TEST_SHARDS="${NUM_TEST_SHARDS:=1}"
export VALGRIND=ON
# export TORCH_INDUCTOR_INSTALL_GXX=ON
if [[ "$BUILD_ENVIRONMENT" == *clang9* ]]; then
# clang9 appears to miscompile code involving c10::optional<c10::SymInt>,
# clang9 appears to miscompile code involving std::optional<c10::SymInt>,
# such that valgrind complains along these lines:
#
# Conditional jump or move depends on uninitialised value(s)
# at 0x40303A: ~optional_base (Optional.h:281)
# by 0x40303A: call (Dispatcher.h:448)
# by 0x40303A: call(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::optional<c10::SymInt>) (basic.cpp:10)
# by 0x40303A: call(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::optional<c10::SymInt>) (basic.cpp:10)
# by 0x403700: main (basic.cpp:16)
# Uninitialised value was created by a stack allocation
# at 0x402AAA: call(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, c10::optional<c10::SymInt>) (basic.cpp:6)
# at 0x402AAA: call(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::optional<c10::SymInt>) (basic.cpp:6)
#
# The problem does not appear with gcc or newer versions of clang (we tested
# clang14). So we suppress valgrind testing for clang9 specifically.
@ -72,7 +72,7 @@ if [[ "$BUILD_ENVIRONMENT" == *clang9* ]]; then
#
# using namespace at;
#
# Tensor call(const at::Tensor & self, c10::SymIntArrayRef size, c10::SymIntArrayRef stride, c10::optional<c10::SymInt> storage_offset) {
# Tensor call(const at::Tensor & self, c10::SymIntArrayRef size, c10::SymIntArrayRef stride, std::optional<c10::SymInt> storage_offset) {
# auto op = c10::Dispatcher::singleton()
# .findSchemaOrThrow(at::_ops::as_strided::name, at::_ops::as_strided::overload_name)
# .typed<at::_ops::as_strided::schema>();
@ -81,7 +81,7 @@ if [[ "$BUILD_ENVIRONMENT" == *clang9* ]]; then
#
# int main(int argv) {
# Tensor b = empty({3, 4});
# auto z = call(b, b.sym_sizes(), b.sym_strides(), c10::nullopt);
# auto z = call(b, b.sym_sizes(), b.sym_strides(), std::nullopt);
# }
export VALGRIND=OFF
fi
@ -196,6 +196,9 @@ install_tlparse
# ASAN test is not working
if [[ "$BUILD_ENVIRONMENT" == *asan* ]]; then
export ASAN_OPTIONS=detect_leaks=0:symbolize=1:detect_stack_use_after_return=true:strict_init_order=true:detect_odr_violation=1:detect_container_overflow=0:check_initialization_order=true:debug=true
if [[ "$BUILD_ENVIRONMENT" == *cuda* ]]; then
export ASAN_OPTIONS="${ASAN_OPTIONS}:protect_shadow_gap=0"
fi
export UBSAN_OPTIONS=print_stacktrace=1:suppressions=$PWD/ubsan.supp
export PYTORCH_TEST_WITH_ASAN=1
export PYTORCH_TEST_WITH_UBSAN=1
@ -233,8 +236,8 @@ if [[ "$BUILD_ENVIRONMENT" == *asan* ]]; then
# it depends on a ton of dynamic libraries that most programs aren't gonna
# have, and it applies to child processes.
# TODO: get rid of the hardcoded path
export LD_PRELOAD=/usr/lib/llvm-15/lib/clang/15.0.7/lib/linux/libclang_rt.asan-x86_64.so
LD_PRELOAD=$(clang --print-file-name=libclang_rt.asan-x86_64.so)
export LD_PRELOAD
# Disable valgrind for asan
export VALGRIND=OFF
@ -281,7 +284,7 @@ test_python_shard() {
# modify LD_LIBRARY_PATH to ensure it has the conda env.
# This set of tests has been shown to be buggy without it for the split-build
time python test/run_test.py --exclude-jit-executor --exclude-distributed-tests $INCLUDE_CLAUSE --shard "$1" "$NUM_TEST_SHARDS" --verbose $PYTHON_TEST_EXTRA_OPTION
time python test/run_test.py --exclude-jit-executor --exclude-distributed-tests $INCLUDE_CLAUSE --shard "$1" "$NUM_TEST_SHARDS" --verbose $PYTHON_TEST_EXTRA_OPTION --upload-artifacts-while-running
assert_git_not_dirty
}
@ -307,7 +310,8 @@ test_dynamo_shard() {
--exclude-distributed-tests \
--exclude-torch-export-tests \
--shard "$1" "$NUM_TEST_SHARDS" \
--verbose
--verbose \
--upload-artifacts-while-running
assert_git_not_dirty
}
@ -320,6 +324,7 @@ test_inductor_distributed() {
python test/run_test.py -i distributed/test_c10d_functional_native.py --verbose
python test/run_test.py -i distributed/_tensor/test_dtensor_compile.py --verbose
python test/run_test.py -i distributed/tensor/parallel/test_micro_pipeline_tp.py --verbose
python test/run_test.py -i distributed/_composable/test_replicate_with_compiler.py --verbose
python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_comm.py --verbose
python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_multi_group --verbose
python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_with_activation_checkpointing --verbose
@ -331,11 +336,12 @@ test_inductor_distributed() {
python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_mixed_precision.py -k test_compute_dtype --verbose
python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_mixed_precision.py -k test_reduce_dtype --verbose
python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_clip_grad_norm_.py -k test_clip_grad_norm_2d --verbose
python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_compile.py --verbose
python test/run_test.py -i distributed/fsdp/test_fsdp_tp_integration.py -k test_fsdp_tp_integration --verbose
# this runs on both single-gpu and multi-gpu instance. It should be smart about skipping tests that aren't supported
# with if required # gpus aren't available
python test/run_test.py --include distributed/test_dynamo_distributed distributed/test_inductor_collectives --verbose
python test/run_test.py --include distributed/test_dynamo_distributed distributed/test_inductor_collectives distributed/test_compute_comm_reordering --verbose
assert_git_not_dirty
}
@ -369,21 +375,27 @@ test_inductor_aoti() {
CPP_TESTS_DIR="${BUILD_BIN_DIR}" LD_LIBRARY_PATH="${TORCH_LIB_DIR}" python test/run_test.py --cpp --verbose -i cpp/test_aoti_abi_check cpp/test_aoti_inference
}
test_inductor_cpp_wrapper_abi_compatible() {
export TORCHINDUCTOR_ABI_COMPATIBLE=1
test_inductor_cpp_wrapper() {
export TORCHINDUCTOR_CPP_WRAPPER=1
TEST_REPORTS_DIR=$(pwd)/test/test-reports
mkdir -p "$TEST_REPORTS_DIR"
echo "Testing Inductor cpp wrapper mode with TORCHINDUCTOR_ABI_COMPATIBLE=1"
PYTORCH_TESTING_DEVICE_ONLY_FOR="" python test/run_test.py --include inductor/test_cpu_cpp_wrapper
python test/run_test.py --include inductor/test_cuda_cpp_wrapper inductor/test_cpu_repro
TORCHINDUCTOR_CPP_WRAPPER=1 python benchmarks/dynamo/timm_models.py --device cuda --accuracy --amp \
python benchmarks/dynamo/timm_models.py --device cuda --accuracy --amp \
--training --inductor --disable-cudagraphs --only vit_base_patch16_224 \
--output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_training.csv"
python benchmarks/dynamo/check_accuracy.py \
--actual "$TEST_REPORTS_DIR/inductor_cpp_wrapper_training.csv" \
--expected "benchmarks/dynamo/ci_expected_accuracy/inductor_timm_training.csv"
python benchmarks/dynamo/torchbench.py --device cuda --accuracy \
--bfloat16 --inference --inductor --only hf_T5 --output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv"
python benchmarks/dynamo/torchbench.py --device cuda --accuracy \
--bfloat16 --inference --inductor --only llama --output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv"
python benchmarks/dynamo/torchbench.py --device cuda --accuracy \
--bfloat16 --inference --inductor --only moco --output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv"
python benchmarks/dynamo/check_accuracy.py \
--actual "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv" \
--expected "benchmarks/dynamo/ci_expected_accuracy/inductor_torchbench_inference.csv"
}
# "Global" flags for inductor benchmarking controlled by TEST_CONFIG
@ -403,7 +415,7 @@ pr_time_benchmarks() {
PYTHONPATH=$(pwd)/benchmarks/dynamo/pr_time_benchmarks source benchmarks/dynamo/pr_time_benchmarks/benchmark_runner.sh "$TEST_REPORTS_DIR/pr_time_benchmarks_results.csv" "benchmarks/dynamo/pr_time_benchmarks/benchmarks"
echo "benchmark results on current PR: "
cat "$TEST_REPORTS_DIR/pr_time_benchmarks_results.csv"
PYTHONPATH=$(pwd)/benchmarks/dynamo/pr_time_benchmarks python benchmarks/dynamo/pr_time_benchmarks/check_results.py "benchmarks/dynamo/pr_time_benchmarks/expected_results.csv" "$TEST_REPORTS_DIR/pr_time_benchmarks_results.csv" "$TEST_REPORTS_DIR/new_expected_results.csv"
}
if [[ "${TEST_CONFIG}" == *pr_time_benchmarks* ]]; then
@ -511,7 +523,7 @@ test_perf_for_dashboard() {
"${target_flag[@]}" --"$mode" --"$dtype" --export --disable-cudagraphs "$@" \
--output "$TEST_REPORTS_DIR/${backend}_export_${suite}_${dtype}_${mode}_${device}_${target}.csv"
fi
TORCHINDUCTOR_ABI_COMPATIBLE=1 $TASKSET python "benchmarks/dynamo/$suite.py" \
$TASKSET python "benchmarks/dynamo/$suite.py" \
"${target_flag[@]}" --"$mode" --"$dtype" --export-aot-inductor --disable-cudagraphs "$@" \
--output "$TEST_REPORTS_DIR/${backend}_aot_inductor_${suite}_${dtype}_${mode}_${device}_${target}.csv"
fi
@ -566,13 +578,6 @@ test_single_dynamo_benchmark() {
test_perf_for_dashboard "$suite" \
"${DYNAMO_BENCHMARK_FLAGS[@]}" "$@" "${partition_flags[@]}"
else
if [[ "${TEST_CONFIG}" == *aot_inductor* && "${TEST_CONFIG}" != *cpu_aot_inductor* ]]; then
# Test AOTInductor with the ABI-compatible mode on CI
# This can be removed once the ABI-compatible mode becomes default.
# For CPU device, we perfer non ABI-compatible mode on CI when testing AOTInductor.
export TORCHINDUCTOR_ABI_COMPATIBLE=1
fi
if [[ "${TEST_CONFIG}" == *_avx2* ]]; then
TEST_CONFIG=${TEST_CONFIG//_avx2/}
fi
@ -606,6 +611,11 @@ test_inductor_halide() {
assert_git_not_dirty
}
test_inductor_triton_cpu() {
python test/run_test.py --include inductor/test_triton_cpu_backend.py --verbose
assert_git_not_dirty
}
test_dynamo_benchmark() {
# Usage: test_dynamo_benchmark huggingface 0
TEST_REPORTS_DIR=$(pwd)/test/test-reports
@ -643,32 +653,12 @@ test_inductor_torchbench_smoketest_perf() {
TEST_REPORTS_DIR=$(pwd)/test/test-reports
mkdir -p "$TEST_REPORTS_DIR"
# Test some models in the cpp wrapper mode
TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCHINDUCTOR_CPP_WRAPPER=1 python benchmarks/dynamo/torchbench.py --device cuda --accuracy \
--bfloat16 --inference --inductor --only hf_T5 --output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv"
TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCHINDUCTOR_CPP_WRAPPER=1 python benchmarks/dynamo/torchbench.py --device cuda --accuracy \
--bfloat16 --inference --inductor --only llama --output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv"
TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCHINDUCTOR_CPP_WRAPPER=1 python benchmarks/dynamo/torchbench.py --device cuda --accuracy \
--bfloat16 --inference --inductor --only moco --output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv"
python benchmarks/dynamo/check_accuracy.py \
--actual "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv" \
--expected "benchmarks/dynamo/ci_expected_accuracy/inductor_torchbench_inference.csv"
python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --float16 --training \
--batch-size-file "$(realpath benchmarks/dynamo/torchbench_models_list.txt)" --only hf_Bert \
--output "$TEST_REPORTS_DIR/inductor_training_smoketest.csv"
# The threshold value needs to be actively maintained to make this check useful
python benchmarks/dynamo/check_perf_csv.py -f "$TEST_REPORTS_DIR/inductor_training_smoketest.csv" -t 1.4
TORCHINDUCTOR_ABI_COMPATIBLE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --bfloat16 --inference \
--export-aot-inductor --only nanogpt --output "$TEST_REPORTS_DIR/inductor_inference_smoketest.csv"
# The threshold value needs to be actively maintained to make this check useful
# The perf number of nanogpt seems not very stable, e.g.
# https://github.com/pytorch/pytorch/actions/runs/7158691360/job/19491437314,
# and thus we lower its threshold to reduce flakiness. If this continues to be a problem,
# we switch to use some other model.
python benchmarks/dynamo/check_perf_csv.py -f "$TEST_REPORTS_DIR/inductor_inference_smoketest.csv" -t 4.9
# Check memory compression ratio for a few models
for test in hf_Albert timm_vision_transformer; do
python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --amp --training \
@ -712,6 +702,10 @@ test_inductor_set_cpu_affinity(){
export KMP_BLOCKTIME=1
fi
cores=$(test_inductor_get_core_number)
# Set number of cores to 16 on Aarch64 for performance runs.
if [[ "${TEST_CONFIG}" == *aarch64* && $cores -gt 16 ]]; then
cores=16
fi
export OMP_NUM_THREADS=$cores
end_core=$((cores-1))
export TASKSET="taskset -c 0-$end_core"
@ -748,19 +742,9 @@ test_inductor_torchbench_cpu_smoketest_perf(){
fi
cat "$output_name"
# The threshold value needs to be actively maintained to make this check useful.
python benchmarks/dynamo/check_perf_csv.py -f "$output_name" -t "$speedup_target"
# Allow 1% variance for CPU perf to accommodate perf fluctuation
python benchmarks/dynamo/check_perf_csv.py -f "$output_name" -t "$speedup_target" -s 0.99
done
# Add a few ABI-compatible accuracy tests for CPU. These can be removed once we turn on ABI-compatible as default.
TORCHINDUCTOR_ABI_COMPATIBLE=1 python benchmarks/dynamo/timm_models.py --device cpu --accuracy \
--bfloat16 --inference --export-aot-inductor --disable-cudagraphs --only adv_inception_v3 \
--output "$TEST_REPORTS_DIR/aot_inductor_smoke_test.csv"
TORCHINDUCTOR_ABI_COMPATIBLE=1 python benchmarks/dynamo/timm_models.py --device cpu --accuracy \
--bfloat16 --inference --export-aot-inductor --disable-cudagraphs --only beit_base_patch16_224 \
--output "$TEST_REPORTS_DIR/aot_inductor_smoke_test.csv"
python benchmarks/dynamo/check_accuracy.py \
--actual "$TEST_REPORTS_DIR/aot_inductor_smoke_test.csv" \
--expected "benchmarks/dynamo/ci_expected_accuracy/aot_inductor_timm_inference.csv"
}
test_torchbench_gcp_smoketest(){
@ -1371,7 +1355,7 @@ test_executorch() {
echo "Run ExecuTorch regression tests for some models"
# TODO(huydhn): Add more coverage here using ExecuTorch's gather models script
# shellcheck disable=SC1091
source .ci/scripts/test.sh mv3 cmake xnnpack-quantization-delegation ''
source .ci/scripts/test_model.sh mv3 cmake xnnpack-quantization-delegation ''
popd
@ -1435,6 +1419,8 @@ elif [[ "${TEST_CONFIG}" == *inductor_distributed* ]]; then
test_inductor_distributed
elif [[ "${TEST_CONFIG}" == *inductor-halide* ]]; then
test_inductor_halide
elif [[ "${TEST_CONFIG}" == *inductor-triton-cpu* ]]; then
test_inductor_triton_cpu
elif [[ "${TEST_CONFIG}" == *inductor-micro-benchmark* ]]; then
test_inductor_micro_benchmark
elif [[ "${TEST_CONFIG}" == *huggingface* ]]; then
@ -1451,14 +1437,13 @@ elif [[ "${TEST_CONFIG}" == *torchbench* ]]; then
else
install_torchaudio cuda
fi
install_torchtext
install_torchvision
TORCH_CUDA_ARCH_LIST="8.0;8.6" pip_install git+https://github.com/pytorch/ao.git
id=$((SHARD_NUMBER-1))
# https://github.com/opencv/opencv-python/issues/885
pip_install opencv-python==4.8.0.74
if [[ "${TEST_CONFIG}" == *inductor_torchbench_smoketest_perf* ]]; then
checkout_install_torchbench hf_Bert hf_Albert nanogpt timm_vision_transformer
checkout_install_torchbench hf_Bert hf_Albert timm_vision_transformer
PYTHONPATH=$(pwd)/torchbench test_inductor_torchbench_smoketest_perf
elif [[ "${TEST_CONFIG}" == *inductor_torchbench_cpu_smoketest_perf* ]]; then
checkout_install_torchbench timm_vision_transformer phlippe_densenet basic_gnn_edgecnn \
@ -1477,9 +1462,11 @@ elif [[ "${TEST_CONFIG}" == *torchbench* ]]; then
fi
PYTHONPATH=$(pwd)/torchbench test_dynamo_benchmark torchbench "$id"
fi
elif [[ "${TEST_CONFIG}" == *inductor_cpp_wrapper_abi_compatible* ]]; then
elif [[ "${TEST_CONFIG}" == *inductor_cpp_wrapper* ]]; then
install_torchaudio cuda
install_torchvision
test_inductor_cpp_wrapper_abi_compatible
checkout_install_torchbench hf_T5 llama moco
PYTHONPATH=$(pwd)/torchbench test_inductor_cpp_wrapper
elif [[ "${TEST_CONFIG}" == *inductor* ]]; then
install_torchvision
test_inductor_shard "${SHARD_NUMBER}"

View File

@ -26,7 +26,7 @@ fi
export SCRIPT_HELPERS_DIR=$SCRIPT_PARENT_DIR/win-test-helpers
set +ex
grep -E -R 'PyLong_(From|As)(Unsigned|)Long\(' --exclude=python_numbers.h --exclude=eval_frame.c torch/
grep -E -R 'PyLong_(From|As)(Unsigned|)Long\(' --exclude=python_numbers.h --exclude=pythoncapi_compat.h --exclude=eval_frame.c torch/
PYLONG_API_CHECK=$?
if [[ $PYLONG_API_CHECK == 0 ]]; then
echo "Usage of PyLong_{From,As}{Unsigned}Long API may lead to overflow errors on Windows"

View File

@ -46,6 +46,9 @@ python -m pip install tlparse==0.3.25
# Install parameterized
python -m pip install parameterized==0.8.1
# Install pulp for testing ilps under torch\distributed\_tools
python -m pip install pulp==2.9.0
run_tests() {
# Run nvidia-smi if available
for path in '/c/Program Files/NVIDIA Corporation/NVSMI/nvidia-smi.exe' /c/Windows/System32/nvidia-smi.exe; do

View File

@ -27,12 +27,11 @@ if [[ "$PACKAGE_TYPE" == conda ]]; then
source activate testenv >/dev/null
elif [[ "$PACKAGE_TYPE" != libtorch ]]; then
python_path="/opt/python/cp\$python_nodot-cp\${python_nodot}"
# Prior to Python 3.8 paths were suffixed with an 'm'
if [[ -d "\${python_path}/bin" ]]; then
export PATH="\${python_path}/bin:\$PATH"
elif [[ -d "\${python_path}m/bin" ]]; then
export PATH="\${python_path}m/bin:\$PATH"
if [[ "\$python_nodot" = *t ]]; then
python_digits="\$(echo $DESIRED_PYTHON | tr -cd [:digit:])"
python_path="/opt/python/cp\$python_digits-cp\${python_digits}t"
fi
export PATH="\${python_path}/bin:\$PATH"
fi
EXTRA_CONDA_FLAGS=""

View File

@ -114,6 +114,12 @@ if [[ "$PACKAGE_TYPE" =~ .*wheel.* && -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_B
fi
fi
USE_GLOO_WITH_OPENSSL="ON"
if [[ "$GPU_ARCH_TYPE" =~ .*aarch64.* ]]; then
USE_GLOO_WITH_OPENSSL="OFF"
USE_GOLD_LINKER="OFF"
fi
cat >"$envfile" <<EOL
# =================== The following code will be executed inside Docker container ===================
export TZ=UTC
@ -153,7 +159,7 @@ export DOCKER_IMAGE="$DOCKER_IMAGE"
export USE_GOLD_LINKER="${USE_GOLD_LINKER}"
export USE_GLOO_WITH_OPENSSL="ON"
export USE_GLOO_WITH_OPENSSL="${USE_GLOO_WITH_OPENSSL}"
# =================== The above code will be executed inside Docker container ===================
EOL

View File

@ -44,7 +44,9 @@ ContinuationIndentWidth: 4
Cpp11BracedListStyle: true
DerivePointerAlignment: false
DisableFormat: false
ForEachMacros: [ FOR_EACH_RANGE, FOR_EACH, ]
ForEachMacros:
- FOR_EACH_RANGE
- FOR_EACH
IncludeCategories:
- Regex: '^<.*\.h(pp)?>'
Priority: 1
@ -58,6 +60,24 @@ IndentWrappedFunctionNames: false
KeepEmptyLinesAtTheStartOfBlocks: false
MacroBlockBegin: ''
MacroBlockEnd: ''
Macros:
- >-
PyObject_HEAD_INIT(type)={
/* this is not exactly match with PyObject_HEAD_INIT in Python source code
* but it is enough for clang-format */
{ 0xFFFFFFFF },
(type)
},
- >-
PyVarObject_HEAD_INIT(type, size)={
{
/* manually expand PyObject_HEAD_INIT(type) above
* because clang-format do not support recursive expansion */
{ 0xFFFFFFFF },
(type)
},
(size)
},
MaxEmptyLinesToKeep: 1
NamespaceIndentation: None
PenaltyBreakBeforeFirstCallParameter: 1
@ -79,7 +99,11 @@ SpacesInContainerLiterals: true
SpacesInCStyleCastParentheses: false
SpacesInParentheses: false
SpacesInSquareBrackets: false
Standard: Cpp11
Standard: c++17
StatementMacros:
- PyObject_HEAD
- PyObject_VAR_HEAD
- PyException_HEAD
TabWidth: 8
UseTab: Never
---

View File

@ -1,38 +0,0 @@
If you have a question or would like help and support, please ask at our
[forums](https://discuss.pytorch.org/).
If you are submitting a feature request, please preface the title with [feature request].
If you are submitting a bug report, please fill in the following details.
## Issue description
Provide a short description.
## Code example
Please try to provide a minimal example to repro the bug.
Error messages and stack traces are also helpful.
## System Info
Please copy and paste the output from our
[environment collection script](https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py)
(or fill out the checklist below manually).
You can get the script and run it with:
```
wget https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py
```
- PyTorch or Caffe2:
- How you installed PyTorch (conda, pip, source):
- Build command you used (if compiling from source):
- OS:
- PyTorch version:
- Python version:
- CUDA/cuDNN version:
- GPU models and configuration:
- GCC version (if compiling from source):
- CMake version:
- Versions of any other relevant libraries:

View File

@ -5,7 +5,8 @@ about: Tracking incidents for PyTorch's CI infra.
> NOTE: Remember to label this issue with "`ci: sev`"
**MERGE BLOCKING** <!-- remove this line if you don't want this SEV to block merges -->
<!-- uncomment the below line if you don't want this SEV to block merges -->
<!-- **MERGE BLOCKING** -->
## Current Status
*Status could be: preemptive, ongoing, mitigated, closed. Also tell people if they need to take action to fix it (i.e. rebase)*.

View File

@ -18,8 +18,14 @@ inputs:
runs:
using: composite
steps:
- name: Check if in a container runner
shell: bash
id: check_container_runner
run: echo "IN_CONTAINER_RUNNER=$(if [ -f /.inarc ] || [ -f /.incontainer ]; then echo true ; else echo false; fi)" >> "$GITHUB_OUTPUT"
- name: Clean workspace
shell: bash
if: ${{ steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false' }}
env:
NO_SUDO: ${{ inputs.no-sudo }}
run: |

View File

@ -85,15 +85,25 @@ runs:
with:
docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
- name: Check if in a ARC runner
- name: Check if in a container runner
shell: bash
id: check_arc_runner
run: echo "IN_ARC_RUNNER=$([ -f /.inarc ] && echo true || echo false)" >> "$GITHUB_OUTPUT"
id: check_container_runner
run: echo "IN_CONTAINER_RUNNER=$(if [ -f /.inarc ] || [ -f /.incontainer ]; then echo true ; else echo false; fi)" >> "$GITHUB_OUTPUT"
- name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
id: install-nvidia-driver
uses: pytorch/test-infra/.github/actions/setup-nvidia@main
if: ${{ contains(inputs.build-environment, 'cuda') && !contains(matrix.config, 'nogpu') && steps.check_arc_runner.outputs.IN_ARC_RUNNER == 'false' }}
if: ${{ contains(inputs.build-environment, 'cuda') && !contains(matrix.config, 'nogpu') && steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false' }}
- name: Setup GPU_FLAG for docker run
id: setup-gpu-flag
run: echo "GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all" >> "${GITHUB_ENV}"
if: ${{ contains(inputs.build-environment, 'cuda') && !contains(matrix.config, 'nogpu') && steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'true' }}
- name: Setup SCCACHE_SERVER_PORT environment for docker run when on container
id: setup-sscache-port-flag
run: echo "SCCACHE_SERVER_PORT_DOCKER_FLAG=-e SCCACHE_SERVER_PORT=$((RUNNER_UID + 4226))" >> "${GITHUB_ENV}"
if: ${{ steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'true' }}
- name: Lock NVIDIA A100 40GB Frequency
shell: bash
@ -101,7 +111,7 @@ runs:
sudo nvidia-smi -pm 1
sudo nvidia-smi -ac 1215,1410
nvidia-smi
if: contains(matrix.runner, 'a100')
if: ${{ contains(matrix.runner, 'a100') && steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false' }}
- name: Start monitoring script
id: monitor-script
@ -172,6 +182,7 @@ runs:
NO_TD: ${{ steps.keep-going.outputs.ci-no-td }}
TD_DISTRIBUTED: ${{ steps.keep-going.outputs.ci-td-distributed }}
SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2
SCCACHE_REGION: us-east-1
SCCACHE_S3_KEY_PREFIX: ${{ github.workflow }}
SHM_SIZE: ${{ contains(inputs.build-environment, 'cuda') && '2g' || '1g' }}
DOCKER_IMAGE: ${{ inputs.docker-image }}
@ -181,6 +192,9 @@ runs:
PYTORCH_TEST_RERUN_DISABLED_TESTS: ${{ matrix.rerun_disabled_tests && '1' || '0' }}
DASHBOARD_TAG: ${{ inputs.dashboard-tag }}
HUGGING_FACE_HUB_TOKEN: ${{ inputs.HUGGING_FACE_HUB_TOKEN }}
SCRIBE_GRAPHQL_ACCESS_TOKEN: ${{ secrets.SCRIBE_GRAPHQL_ACCESS_TOKEN }}
IS_A100_RUNNER: ${{ contains(matrix.runner, 'a100') && '1' || '0' }}
shell: bash
run: |
set -x
@ -199,6 +213,7 @@ runs:
# shellcheck disable=SC2086,SC2090
container_name=$(docker run \
${GPU_FLAG:-} \
${SCCACHE_SERVER_PORT_DOCKER_FLAG:-} \
-e BUILD_ENVIRONMENT \
-e PR_NUMBER \
-e GITHUB_ACTIONS \
@ -227,6 +242,7 @@ runs:
-e PR_LABELS \
-e MAX_JOBS="$(nproc --ignore=2)" \
-e SCCACHE_BUCKET \
-e SCCACHE_REGION \
-e SCCACHE_S3_KEY_PREFIX \
-e XLA_CUDA \
-e XLA_CLANG_CACHE_S3_BUCKET_NAME \
@ -234,7 +250,9 @@ runs:
-e PYTORCH_TEST_RERUN_DISABLED_TESTS \
-e SKIP_SCCACHE_INITIALIZATION=1 \
-e HUGGING_FACE_HUB_TOKEN \
-e SCRIBE_GRAPHQL_ACCESS_TOKEN \
-e DASHBOARD_TAG \
-e IS_A100_RUNNER \
--env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
--security-opt seccomp=unconfined \
--cap-add=SYS_PTRACE \
@ -305,7 +323,7 @@ runs:
- name: Teardown Linux
uses: pytorch/test-infra/.github/actions/teardown-linux@main
if: always()
if: always() && steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false'
# NB: We are currently having an intermittent GPU-related issue on G5 runners with
# A10G GPU. Once this happens, trying to reset the GPU as done in setup-nvidia does

View File

@ -26,7 +26,7 @@ runs:
retry_wait_seconds: 30
command: |
set -eu
python3 -m pip install boto3==1.19.12
python3 -m pip install boto3==1.35.42
- name: Download the cache
shell: bash

View File

@ -33,7 +33,7 @@ runs:
retry_wait_seconds: 30
command: |
set -eu
python3 -m pip install boto3==1.19.12
python3 -m pip install boto3==1.35.42
- name: Upload the cache
shell: bash

View File

@ -20,7 +20,7 @@ runs:
elif [[ $runner_name_str == *"gcp"* ]]; then
echo "Runner is from Google Cloud Platform, No info on ec2 metadata"
else
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
fi
}
echo "ami-id: $(get_ec2_metadata ami-id)"
@ -28,14 +28,14 @@ runs:
echo "instance-type: $(get_ec2_metadata instance-type)"
echo "system info $(uname -a)"
- name: Check if in a ARC runner
- name: Check if in a container runner
shell: bash
id: check_arc_runner
run: echo "IN_ARC_RUNNER=$([ -f /.inarc ] && echo true || echo false)" >> $GITHUB_OUTPUT
id: check_container_runner
run: echo "IN_CONTAINER_RUNNER=$(if [ -f /.inarc ] || [ -f /.incontainer ]; then echo true ; else echo false; fi)" >> "$GITHUB_OUTPUT"
- name: Start docker if docker deamon is not running
shell: bash
if: ${{ steps.check_arc_runner.outputs.IN_ARC_RUNNER == 'false' }}
if: ${{ steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false' }}
run: |
if systemctl is-active --quiet docker; then
echo "Docker daemon is running...";
@ -73,7 +73,7 @@ runs:
env | grep '^CI' >> "/tmp/github_env_${GITHUB_RUN_ID}"
- name: Kill any existing containers, clean up images
if: ${{ steps.check_arc_runner.outputs.IN_ARC_RUNNER == 'false' }}
if: ${{ steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false' }}
shell: bash
run: |
# ignore expansion of "docker ps -q" since it could be empty
@ -116,7 +116,7 @@ runs:
- name: Check that the docker daemon is running
shell: bash
continue-on-error: true
if: ${{ steps.check_arc_runner.outputs.IN_ARC_RUNNER == 'true' }}
if: ${{ steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'true' }}
run: |
set +x

View File

@ -18,7 +18,7 @@ runs:
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"

View File

@ -28,7 +28,7 @@ runs:
run: |
# Remove any previous test jsons if they exist
rm -f test-jsons-*.zip
zip -r "test-jsons-${FILE_SUFFIX}.zip" test -i '*.json'
zip -r "test-jsons-${FILE_SUFFIX}.zip" test/test-reports -i '*.json'
- name: Zip test reports for upload
if: runner.os != 'Windows' && !inputs.use-gha
@ -38,7 +38,7 @@ runs:
run: |
# Remove any previous test reports if they exist
rm -f test-reports-*.zip
zip -r "test-reports-${FILE_SUFFIX}.zip" test -i '*.xml' -i '*.csv'
zip -r "test-reports-${FILE_SUFFIX}.zip" test/test-reports -i '*.xml' -i '*.csv'
- name: Zip usage log for upload
if: runner.os != 'Windows' && !inputs.use-gha
@ -53,8 +53,8 @@ runs:
if [ -f 'usage_log.txt' ]; then
zip "logs-${FILE_SUFFIX}.zip" 'usage_log.txt'
fi
if ls test/**/*.log 1> /dev/null 2>&1; then
zip -r "logs-${FILE_SUFFIX}.zip" test -i '*.log'
if find "test/test-reports" -name "*.log" 2>/dev/null | grep -q .; then
zip -r "logs-${FILE_SUFFIX}.zip" test/test-reports -i '*.log'
fi
- name: Zip debugging artifacts for upload
@ -77,7 +77,7 @@ runs:
FILE_SUFFIX: ${{ inputs.file-suffix }}
run: |
# -ir => recursive include all files in pattern
7z a "test-jsons-$Env:FILE_SUFFIX.zip" -ir'!test\*.json'
7z a "test-jsons-$Env:FILE_SUFFIX.zip" -ir'!test\test-reports\*.json'
- name: Zip test reports for upload
if: runner.os == 'Windows' && !inputs.use-gha
@ -86,7 +86,7 @@ runs:
FILE_SUFFIX: ${{ inputs.file-suffix }}
run: |
# -ir => recursive include all files in pattern
7z a "test-reports-$Env:FILE_SUFFIX.zip" -ir'!test\*.xml' -ir'!test\*.csv'
7z a "test-reports-$Env:FILE_SUFFIX.zip" -ir'!test\test-reports\*.xml' -ir'!test\test-reports\*.csv'
- name: Zip usage log for upload
if: runner.os == 'Windows' && !inputs.use-gha
@ -96,7 +96,7 @@ runs:
FILE_SUFFIX: ${{ inputs.file-suffix }}
run: |
# -ir => recursive include all files in pattern
7z a "logs-$Env:FILE_SUFFIX.zip" 'usage_log.txt' -ir'!test\*.log'
7z a "logs-$Env:FILE_SUFFIX.zip" 'usage_log.txt' -ir'!test\test-reports\*.log'
# S3 upload
- name: Store Test Downloaded JSONs on S3

View File

@ -1 +1 @@
ba696ea3dfec4cbe693bf06a84c75dc196077f5b
79047bf6bdec9e32c4cffd0f9835b347781fefbf

View File

@ -1 +1 @@
23512dbebd44a11eb84afbf53c3c071dd105297e
e522b45cd4535b9dfe067aa68d7315755df38f48

6
.github/labeler.yml vendored
View File

@ -98,3 +98,9 @@
"module: distributed_checkpoint":
- torch/distributed/checkpoint/**
- test/distributed/checkpoint/**
"module: compiled autograd":
- torch/csrc/dynamo/python_compiled_autograd.cpp
- torch/csrc/dynamo/compiled_autograd.h
- torch/_dynamo/compiled_autograd.py
- torch/inductor/test_compiled_autograd.py

View File

@ -1,251 +0,0 @@
# This file is generated by .github/scripts/validate_scale_config.py in test-infra
# It defines runner types that will be provisioned by by LF Self-hosted runners
# scale-config.yml:
# Powers what instance types are available for GHA auto-scaled
# runners. Runners listed here will be available as self hosted
# runners, configuration is directly pulled from the main branch.
#
#
# NOTES:
# - Linux runners are by default non-ephemeral to reduce the amount of CreateInstaces calls
# to avoid RequestLimitExceeded issues
# - When updating this file, run the following command to validate the YAML and to generate
# corresponding versions of scale-config for the pytorch/pytorch repo and merge the
# pytorch/pytorch changes before merging these changes.
# `python .github/scripts/validate_scale_config.py --test-infra-repo-root [path_to_test-infra_root] --pytorch-repo-root [path_to_pytorch_root]``
#
# TODO: Add some documentation on how the auto-scaling works
#
# NOTE: Default values,
#
# runner_types:
# runner_label:
# instance_type: m4.large
# os: linux
# max_available: 20
# disk_size: 50
# is_ephemeral: true
runner_types:
lf.c.linux.12xlarge:
disk_size: 200
instance_type: c5.12xlarge
is_ephemeral: false
max_available: 1000
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.c.linux.10xlarge.avx2:
disk_size: 200
instance_type: m4.10xlarge
is_ephemeral: false
max_available: 450
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.c.linux.24xl.spr-metal:
disk_size: 200
instance_type: c7i.metal-24xl
is_ephemeral: false
max_available: 150
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.c.linux.16xlarge.spr:
disk_size: 200
instance_type: c7i.16xlarge
is_ephemeral: false
max_available: 150
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.c.linux.9xlarge.ephemeral:
disk_size: 200
instance_type: c5.9xlarge
is_ephemeral: true
max_available: 50
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
variants:
am2:
ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs
lf.c.linux.12xlarge.ephemeral:
disk_size: 200
instance_type: c5.12xlarge
is_ephemeral: true
max_available: 300
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.c.linux.16xlarge.nvidia.gpu:
disk_size: 150
instance_type: g3.16xlarge
is_ephemeral: false
max_available: 150
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.c.linux.24xlarge:
disk_size: 150
instance_type: c5.24xlarge
is_ephemeral: false
max_available: 500
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.c.linux.24xlarge.ephemeral:
disk_size: 150
instance_type: c5.24xlarge
is_ephemeral: true
max_available: 200
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.c.linux.2xlarge:
disk_size: 150
instance_type: c5.2xlarge
is_ephemeral: false
max_available: 3120
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.c.linux.4xlarge:
disk_size: 150
instance_type: c5.4xlarge
is_ephemeral: false
max_available: 1000
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.c.linux.4xlarge.nvidia.gpu:
disk_size: 150
instance_type: g3.4xlarge
is_ephemeral: false
max_available: 1000
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.c.linux.8xlarge.nvidia.gpu:
disk_size: 150
instance_type: g3.8xlarge
is_ephemeral: false
max_available: 400
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.c.linux.g4dn.12xlarge.nvidia.gpu:
disk_size: 150
instance_type: g4dn.12xlarge
is_ephemeral: false
max_available: 250
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.c.linux.g4dn.metal.nvidia.gpu:
disk_size: 150
instance_type: g4dn.metal
is_ephemeral: false
max_available: 300
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.c.linux.g5.48xlarge.nvidia.gpu:
disk_size: 150
instance_type: g5.48xlarge
is_ephemeral: false
max_available: 200
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.c.linux.g5.12xlarge.nvidia.gpu:
disk_size: 150
instance_type: g5.12xlarge
is_ephemeral: false
max_available: 150
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.c.linux.g5.4xlarge.nvidia.gpu:
disk_size: 150
instance_type: g5.4xlarge
is_ephemeral: false
max_available: 2400
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.c.linux.g6.4xlarge.experimental.nvidia.gpu:
disk_size: 150
instance_type: g6.4xlarge
is_ephemeral: false
max_available: 50
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.c.linux.large:
max_available: 1200
disk_size: 15
instance_type: c5.large
is_ephemeral: false
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.c.linux.arm64.2xlarge:
disk_size: 256
instance_type: t4g.2xlarge
is_ephemeral: false
max_available: 200
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-arm64
lf.c.linux.arm64.m7g.4xlarge:
disk_size: 256
instance_type: m7g.4xlarge
is_ephemeral: false
max_available: 200
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-arm64
lf.c.linux.arm64.2xlarge.ephemeral:
disk_size: 256
instance_type: t4g.2xlarge
is_ephemeral: true
max_available: 200
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-arm64
lf.c.linux.arm64.m7g.4xlarge.ephemeral:
disk_size: 256
instance_type: m7g.4xlarge
is_ephemeral: true
max_available: 200
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-arm64
lf.c.linux.arm64.m7g.metal:
disk_size: 256
instance_type: m7g.metal
is_ephemeral: false
max_available: 100
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-arm64
lf.c.windows.g4dn.xlarge:
disk_size: 256
instance_type: g4dn.xlarge
is_ephemeral: true
max_available: 100
os: windows
lf.c.windows.g4dn.xlarge.nonephemeral:
disk_size: 256
instance_type: g4dn.xlarge
is_ephemeral: false
max_available: 100
os: windows
lf.c.windows.4xlarge:
disk_size: 256
instance_type: c5d.4xlarge
is_ephemeral: true
max_available: 420
os: windows
lf.c.windows.4xlarge.nonephemeral:
disk_size: 256
instance_type: c5d.4xlarge
is_ephemeral: false
max_available: 420
os: windows
lf.c.windows.8xlarge.nvidia.gpu:
disk_size: 256
instance_type: p3.2xlarge
is_ephemeral: true
max_available: 300
os: windows
lf.c.windows.8xlarge.nvidia.gpu.nonephemeral:
disk_size: 256
instance_type: p3.2xlarge
is_ephemeral: false
max_available: 150
os: windows
lf.c.windows.g5.4xlarge.nvidia.gpu:
disk_size: 256
instance_type: g5.4xlarge
is_ephemeral: false
max_available: 250
os: windows

View File

@ -1,251 +0,0 @@
# This file is generated by .github/scripts/validate_scale_config.py in test-infra
# It defines runner types that will be provisioned by by LF Self-hosted runners
# scale-config.yml:
# Powers what instance types are available for GHA auto-scaled
# runners. Runners listed here will be available as self hosted
# runners, configuration is directly pulled from the main branch.
#
#
# NOTES:
# - Linux runners are by default non-ephemeral to reduce the amount of CreateInstaces calls
# to avoid RequestLimitExceeded issues
# - When updating this file, run the following command to validate the YAML and to generate
# corresponding versions of scale-config for the pytorch/pytorch repo and merge the
# pytorch/pytorch changes before merging these changes.
# `python .github/scripts/validate_scale_config.py --test-infra-repo-root [path_to_test-infra_root] --pytorch-repo-root [path_to_pytorch_root]``
#
# TODO: Add some documentation on how the auto-scaling works
#
# NOTE: Default values,
#
# runner_types:
# runner_label:
# instance_type: m4.large
# os: linux
# max_available: 20
# disk_size: 50
# is_ephemeral: true
runner_types:
lf.linux.12xlarge:
disk_size: 200
instance_type: c5.12xlarge
is_ephemeral: false
max_available: 1000
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.10xlarge.avx2:
disk_size: 200
instance_type: m4.10xlarge
is_ephemeral: false
max_available: 450
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.24xl.spr-metal:
disk_size: 200
instance_type: c7i.metal-24xl
is_ephemeral: false
max_available: 150
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.16xlarge.spr:
disk_size: 200
instance_type: c7i.16xlarge
is_ephemeral: false
max_available: 150
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.9xlarge.ephemeral:
disk_size: 200
instance_type: c5.9xlarge
is_ephemeral: true
max_available: 50
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
variants:
am2:
ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs
lf.linux.12xlarge.ephemeral:
disk_size: 200
instance_type: c5.12xlarge
is_ephemeral: true
max_available: 300
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.16xlarge.nvidia.gpu:
disk_size: 150
instance_type: g3.16xlarge
is_ephemeral: false
max_available: 150
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.24xlarge:
disk_size: 150
instance_type: c5.24xlarge
is_ephemeral: false
max_available: 500
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.24xlarge.ephemeral:
disk_size: 150
instance_type: c5.24xlarge
is_ephemeral: true
max_available: 200
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.2xlarge:
disk_size: 150
instance_type: c5.2xlarge
is_ephemeral: false
max_available: 3120
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.4xlarge:
disk_size: 150
instance_type: c5.4xlarge
is_ephemeral: false
max_available: 1000
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.4xlarge.nvidia.gpu:
disk_size: 150
instance_type: g3.4xlarge
is_ephemeral: false
max_available: 1000
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.8xlarge.nvidia.gpu:
disk_size: 150
instance_type: g3.8xlarge
is_ephemeral: false
max_available: 400
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.g4dn.12xlarge.nvidia.gpu:
disk_size: 150
instance_type: g4dn.12xlarge
is_ephemeral: false
max_available: 250
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.g4dn.metal.nvidia.gpu:
disk_size: 150
instance_type: g4dn.metal
is_ephemeral: false
max_available: 300
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.g5.48xlarge.nvidia.gpu:
disk_size: 150
instance_type: g5.48xlarge
is_ephemeral: false
max_available: 200
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.g5.12xlarge.nvidia.gpu:
disk_size: 150
instance_type: g5.12xlarge
is_ephemeral: false
max_available: 150
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.g5.4xlarge.nvidia.gpu:
disk_size: 150
instance_type: g5.4xlarge
is_ephemeral: false
max_available: 2400
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.g6.4xlarge.experimental.nvidia.gpu:
disk_size: 150
instance_type: g6.4xlarge
is_ephemeral: false
max_available: 50
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.large:
max_available: 1200
disk_size: 15
instance_type: c5.large
is_ephemeral: false
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.arm64.2xlarge:
disk_size: 256
instance_type: t4g.2xlarge
is_ephemeral: false
max_available: 200
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-arm64
lf.linux.arm64.m7g.4xlarge:
disk_size: 256
instance_type: m7g.4xlarge
is_ephemeral: false
max_available: 200
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-arm64
lf.linux.arm64.2xlarge.ephemeral:
disk_size: 256
instance_type: t4g.2xlarge
is_ephemeral: true
max_available: 200
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-arm64
lf.linux.arm64.m7g.4xlarge.ephemeral:
disk_size: 256
instance_type: m7g.4xlarge
is_ephemeral: true
max_available: 200
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-arm64
lf.linux.arm64.m7g.metal:
disk_size: 256
instance_type: m7g.metal
is_ephemeral: false
max_available: 100
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-arm64
lf.windows.g4dn.xlarge:
disk_size: 256
instance_type: g4dn.xlarge
is_ephemeral: true
max_available: 100
os: windows
lf.windows.g4dn.xlarge.nonephemeral:
disk_size: 256
instance_type: g4dn.xlarge
is_ephemeral: false
max_available: 100
os: windows
lf.windows.4xlarge:
disk_size: 256
instance_type: c5d.4xlarge
is_ephemeral: true
max_available: 420
os: windows
lf.windows.4xlarge.nonephemeral:
disk_size: 256
instance_type: c5d.4xlarge
is_ephemeral: false
max_available: 420
os: windows
lf.windows.8xlarge.nvidia.gpu:
disk_size: 256
instance_type: p3.2xlarge
is_ephemeral: true
max_available: 300
os: windows
lf.windows.8xlarge.nvidia.gpu.nonephemeral:
disk_size: 256
instance_type: p3.2xlarge
is_ephemeral: false
max_available: 150
os: windows
lf.windows.g5.4xlarge.nvidia.gpu:
disk_size: 256
instance_type: g5.4xlarge
is_ephemeral: false
max_available: 250
os: windows

View File

@ -16,11 +16,13 @@ ciflow_push_tags:
- ciflow/nightly
- ciflow/periodic
- ciflow/rocm
- ciflow/s390
- ciflow/slow
- ciflow/trunk
- ciflow/unstable
- ciflow/xpu
- ciflow/torchbench
- ciflow/autoformat
retryable_workflows:
- pull
- trunk

View File

@ -4,7 +4,7 @@
# docs/cpp/requirements.txt
# functorch/docs/requirements.txt
# .ci/docker/requirements-ci.txt
boto3==1.19.12
boto3==1.35.42
jinja2==3.1.4
lintrunner==0.10.7
ninja==1.10.0.post1

View File

@ -1,4 +1,4 @@
# iOS simulator requirements
coremltools==5.0b5
protobuf==3.20.2
optree==0.12.1
optree==0.13.0

View File

@ -1,4 +1,4 @@
boto3==1.19.12
boto3==1.35.42
hypothesis==6.56.4
expecttest==0.2.1
fbscribelogger==0.1.6
@ -27,7 +27,7 @@ pytest-cpp==2.3.0
rockset==1.0.3
z3-solver==4.12.2.0
tensorboard==2.13.0
optree==0.12.1
optree==0.13.0
# NB: test_hparams_* from test_tensorboard is failing with protobuf 5.26.0 in
# which the stringify metadata is wrong when escaping double quote
protobuf==3.20.2

View File

@ -77,6 +77,7 @@ PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {
"nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparselt-cu12==0.6.2; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'"
@ -333,7 +334,7 @@ def generate_wheels_matrix(
package_type = "manywheel"
if python_versions is None:
python_versions = FULL_PYTHON_VERSIONS + ["3.13"]
python_versions = FULL_PYTHON_VERSIONS + ["3.13", "3.13t"]
if arches is None:
# Define default compute archivectures
@ -368,8 +369,15 @@ def generate_wheels_matrix(
# TODO: Enable python 3.13 on rocm, aarch64, windows
if (
gpu_arch_type == "rocm" or (os != "linux" and os != "linux-s390x")
) and python_version == "3.13":
gpu_arch_type == "rocm"
or os not in ["linux", "linux-s390x", "macos-arm64"]
) and python_version in ["3.13", "3.13t"]:
continue
# TODO: Enable python 3.13t on xpu and cpu-s390x or MacOS
if (
gpu_arch_type in ["xpu", "cpu-s390x"] or os == "macos-arm64"
) and python_version == "3.13t":
continue
if use_split_build and (
@ -403,7 +411,7 @@ def generate_wheels_matrix(
"container_image": WHEEL_CONTAINER_IMAGES[arch_version],
"package_type": package_type,
"pytorch_extra_install_requirements": (
PYTORCH_EXTRA_INSTALL_REQUIREMENTS[arch_version] # fmt: skip
PYTORCH_EXTRA_INSTALL_REQUIREMENTS[arch_version]
if os != "linux-aarch64"
else ""
),
@ -451,7 +459,7 @@ def generate_wheels_matrix(
".", "_"
),
"pytorch_extra_install_requirements": (
PYTORCH_EXTRA_INSTALL_REQUIREMENTS["12.1"] # fmt: skip
PYTORCH_EXTRA_INSTALL_REQUIREMENTS["12.4"]
if os != "linux" and gpu_arch_type != "xpu"
else ""
),

View File

@ -114,20 +114,21 @@ LINUX_BINARY_BUILD_WORFKLOWS = [
isolated_workflow=True,
),
),
BinaryBuildWorkflow(
os=OperatingSystem.LINUX,
package_type="manywheel",
build_configs=generate_binary_build_matrix.generate_wheels_matrix(
OperatingSystem.LINUX,
use_split_build=True,
arches=["11.8", "12.1", "12.4", "cpu"],
),
ciflow_config=CIFlowConfig(
labels={LABEL_CIFLOW_BINARIES, LABEL_CIFLOW_BINARIES_WHEEL},
isolated_workflow=True,
),
use_split_build=True,
),
# See https://github.com/pytorch/pytorch/issues/138750
# BinaryBuildWorkflow(
# os=OperatingSystem.LINUX,
# package_type="manywheel",
# build_configs=generate_binary_build_matrix.generate_wheels_matrix(
# OperatingSystem.LINUX,
# use_split_build=True,
# arches=["11.8", "12.1", "12.4", "cpu"],
# ),
# ciflow_config=CIFlowConfig(
# labels={LABEL_CIFLOW_BINARIES, LABEL_CIFLOW_BINARIES_WHEEL},
# isolated_workflow=True,
# ),
# use_split_build=True,
# ),
BinaryBuildWorkflow(
os=OperatingSystem.LINUX,
package_type="conda",
@ -180,21 +181,22 @@ LINUX_BINARY_SMOKE_WORKFLOWS = [
),
branches="main",
),
BinaryBuildWorkflow(
os=OperatingSystem.LINUX,
package_type="manywheel",
build_configs=generate_binary_build_matrix.generate_wheels_matrix(
OperatingSystem.LINUX,
arches=["11.8", "12.1", "12.4"],
python_versions=["3.9"],
use_split_build=True,
),
ciflow_config=CIFlowConfig(
labels={LABEL_CIFLOW_PERIODIC},
),
branches="main",
use_split_build=True,
),
# See https://github.com/pytorch/pytorch/issues/138750
# BinaryBuildWorkflow(
# os=OperatingSystem.LINUX,
# package_type="manywheel",
# build_configs=generate_binary_build_matrix.generate_wheels_matrix(
# OperatingSystem.LINUX,
# arches=["11.8", "12.1", "12.4"],
# python_versions=["3.9"],
# use_split_build=True,
# ),
# ciflow_config=CIFlowConfig(
# labels={LABEL_CIFLOW_PERIODIC},
# ),
# branches="main",
# use_split_build=True,
# ),
BinaryBuildWorkflow(
os=OperatingSystem.LINUX,
package_type="libtorch",

View File

@ -41,7 +41,8 @@ RC=0
if ! lintrunner --force-color --tee-json=lint.json ${ADDITIONAL_LINTRUNNER_ARGS} 2> /dev/null; then
echo ""
echo -e "\e[1m\e[36mYou can reproduce these results locally by using \`lintrunner -m origin/main\`. (If you don't get the same results, run \'lintrunner init\' to update your local linter)\e[0m"
echo -e "\e[1m\e[36mSee https://github.com/pytorch/pytorch/wiki/lintrunner for setup instructions.\e[0m"
echo -e "\e[1m\e[36mSee https://github.com/pytorch/pytorch/wiki/lintrunner for setup instructions. To apply suggested patches automatically, use the -a flag. Before pushing another commit,\e[0m"
echo -e "\e[1m\e[36mplease verify locally and ensure everything passes.\e[0m"
RC=1
fi

View File

@ -1,5 +1,9 @@
# flake8: noqa: G004
# Note: Copies of this script in runner_determinator.py and _runner-determinator.yml
# must be kept in sync. You can do it easily by running the following command:
# python .github/scripts/update_runner_determinator.py
"""
This runner determinator is used to determine which set of runners to run a
GitHub job on. It uses the first comment of a GitHub issue (by default
@ -35,7 +39,8 @@ Example config:
experiments:
lf:
rollout_percent: 25
all_branches: false
default: true
---
# Opt-ins:
@ -53,7 +58,7 @@ import os
import random
from argparse import ArgumentParser
from logging import LogRecord
from typing import Any, Dict, Iterable, List, NamedTuple, Tuple
from typing import Any, Dict, FrozenSet, Iterable, List, NamedTuple, Tuple
import yaml
from github import Auth, Github
@ -79,6 +84,12 @@ class Experiment(NamedTuple):
rollout_perc: float = (
0 # Percentage of workflows to experiment on when user is not opted-in.
)
all_branches: bool = (
False # If True, the experiment is also enabled on the exception branches
)
default: bool = (
True # If True, the experiment is enabled by default for all queries
)
# Add more fields as needed
@ -133,6 +144,12 @@ def set_github_output(key: str, value: str) -> None:
f.write(f"{key}={value}\n")
def _str_comma_separated_to_set(value: str) -> FrozenSet[str]:
return frozenset(
filter(lambda itm: itm != "", map(str.strip, value.strip(" \n\t").split(",")))
)
def parse_args() -> Any:
parser = ArgumentParser("Get dynamic rollout settings")
parser.add_argument("--github-token", type=str, required=True, help="GitHub token")
@ -167,6 +184,13 @@ def parse_args() -> Any:
required=True,
help="Current GitHub ref type, branch or tag",
)
parser.add_argument(
"--eligible-experiments",
type=_str_comma_separated_to_set,
required=False,
default="",
help="comma separated list of experiments to check, if omitted all experiments marked with default=True are checked",
)
return parser.parse_args()
@ -212,7 +236,7 @@ def get_potential_pr_author(
def is_exception_branch(branch: str) -> bool:
"""
Branches that get opted out of all experiments and should always use Meta runners
Branches that get opted out of experiments by default, until they're explicitly enabled.
"""
return branch.split("/")[0] in {"main", "nightly", "release", "landchecks"}
@ -338,7 +362,11 @@ def is_user_opted_in(user: str, user_optins: UserOptins, experiment_name: str) -
def get_runner_prefix(
rollout_state: str, workflow_requestors: Iterable[str], is_canary: bool = False
rollout_state: str,
workflow_requestors: Iterable[str],
branch: str,
eligible_experiments: FrozenSet[str] = frozenset(),
is_canary: bool = False,
) -> str:
settings = parse_settings(rollout_state)
user_optins = parse_users(rollout_state)
@ -346,7 +374,24 @@ def get_runner_prefix(
fleet_prefix = ""
prefixes = []
for experiment_name, experiment_settings in settings.experiments.items():
enabled = False
if not experiment_settings.all_branches and is_exception_branch(branch):
log.info(
f"Branch {branch} is an exception branch. Not enabling experiment {experiment_name}."
)
continue
if eligible_experiments:
if experiment_name not in eligible_experiments:
exp_list = ", ".join(eligible_experiments)
log.info(
f"Skipping experiment '{experiment_name}', as it is not in the eligible_experiments list: {exp_list}"
)
continue
elif not experiment_settings.default:
log.info(
f"Skipping experiment '{experiment_name}', as it is not a default experiment"
)
continue
# Is any workflow_requestor opted in to this experiment?
opted_in_users = [
@ -355,11 +400,13 @@ def get_runner_prefix(
if is_user_opted_in(requestor, user_optins, experiment_name)
]
enabled = False
if opted_in_users:
log.info(
f"{', '.join(opted_in_users)} have opted into experiment {experiment_name}."
)
enabled = True
elif experiment_settings.rollout_perc:
# If no user is opted in, then we randomly enable the experiment based on the rollout percentage
if random.uniform(0, 100) <= experiment_settings.rollout_perc:
@ -407,35 +454,35 @@ def get_rollout_state_from_issue(github_token: str, repo: str, issue_num: int) -
def main() -> None:
args = parse_args()
if args.github_ref_type == "branch" and is_exception_branch(args.github_branch):
log.info(
f"Exception branch: '{args.github_branch}', using Meta runners and no experiments."
runner_label_prefix = DEFAULT_LABEL_PREFIX
try:
rollout_state = get_rollout_state_from_issue(
args.github_token, args.github_issue_repo, args.github_issue
)
runner_label_prefix = DEFAULT_LABEL_PREFIX
else:
try:
rollout_state = get_rollout_state_from_issue(
args.github_token, args.github_issue_repo, args.github_issue
)
username = get_potential_pr_author(
args.github_token,
args.github_repo,
args.github_actor,
args.github_ref_type,
args.github_branch,
)
username = get_potential_pr_author(
args.github_token,
args.github_repo,
args.github_actor,
args.github_ref_type,
args.github_branch,
)
is_canary = args.github_repo == "pytorch/pytorch-canary"
is_canary = args.github_repo == "pytorch/pytorch-canary"
runner_label_prefix = get_runner_prefix(
rollout_state, (args.github_issue_owner, username), is_canary
)
runner_label_prefix = get_runner_prefix(
rollout_state,
(args.github_issue_owner, username),
args.github_branch,
args.eligible_experiments,
is_canary,
)
except Exception as e:
log.error(
f"Failed to get issue. Defaulting to Meta runners and no experiments. Exception: {e}"
)
except Exception as e:
log.error(
f"Failed to get issue. Defaulting to Meta runners and no experiments. Exception: {e}"
)
set_github_output(GH_OUTPUT_KEY_LABEL_TYPE, runner_label_prefix)

View File

@ -4,6 +4,10 @@ from unittest.mock import Mock, patch
import runner_determinator as rd
USER_BRANCH = "somebranch"
EXCEPTION_BRANCH = "main"
class TestRunnerDeterminatorIssueParser(TestCase):
def test_parse_settings(self) -> None:
settings_text = """
@ -12,6 +16,7 @@ class TestRunnerDeterminatorIssueParser(TestCase):
rollout_perc: 25
otherExp:
rollout_perc: 0
default: false
---
Users:
@ -28,7 +33,7 @@ class TestRunnerDeterminatorIssueParser(TestCase):
"lf settings not parsed correctly",
)
self.assertTupleEqual(
rd.Experiment(rollout_perc=0),
rd.Experiment(rollout_perc=0, default=False),
settings.experiments["otherExp"],
"otherExp settings not parsed correctly",
)
@ -42,7 +47,7 @@ class TestRunnerDeterminatorIssueParser(TestCase):
rollout_perc: 25
otherExp:
rollout_perc: 0
default: false
```
---
@ -61,7 +66,41 @@ class TestRunnerDeterminatorIssueParser(TestCase):
"lf settings not parsed correctly",
)
self.assertTupleEqual(
rd.Experiment(rollout_perc=0),
rd.Experiment(rollout_perc=0, default=False),
settings.experiments["otherExp"],
"otherExp settings not parsed correctly",
)
def test_parse_all_branches_setting(self) -> None:
settings_text = """
```
experiments:
lf:
rollout_perc: 25
all_branches: true
otherExp:
all_branches: True
rollout_perc: 0
```
---
Users:
@User1,lf
@User2,lf,otherExp
"""
settings = rd.parse_settings(settings_text)
self.assertTupleEqual(
rd.Experiment(rollout_perc=25, all_branches=True),
settings.experiments["lf"],
"lf settings not parsed correctly",
)
self.assertTrue(settings.experiments["otherExp"].all_branches)
self.assertTupleEqual(
rd.Experiment(rollout_perc=0, all_branches=True),
settings.experiments["otherExp"],
"otherExp settings not parsed correctly",
)
@ -119,7 +158,7 @@ class TestRunnerDeterminatorGetRunnerPrefix(TestCase):
@User2,lf,otherExp
"""
prefix = rd.get_runner_prefix(settings_text, ["User1"])
prefix = rd.get_runner_prefix(settings_text, ["User1"], USER_BRANCH)
self.assertEqual("lf.", prefix, "Runner prefix not correct for User1")
def test_opted_in_user_two_experiments(self) -> None:
@ -136,9 +175,67 @@ class TestRunnerDeterminatorGetRunnerPrefix(TestCase):
@User2,lf,otherExp
"""
prefix = rd.get_runner_prefix(settings_text, ["User2"])
prefix = rd.get_runner_prefix(settings_text, ["User2"], USER_BRANCH)
self.assertEqual("lf.otherExp.", prefix, "Runner prefix not correct for User2")
def test_opted_in_user_two_experiments_default(self) -> None:
settings_text = """
experiments:
lf:
rollout_perc: 0
otherExp:
rollout_perc: 0
default: false
---
Users:
@User1,lf
@User2,lf,otherExp
"""
prefix = rd.get_runner_prefix(settings_text, ["User2"], USER_BRANCH)
self.assertEqual("lf.", prefix, "Runner prefix not correct for User2")
def test_opted_in_user_two_experiments_default_exp(self) -> None:
settings_text = """
experiments:
lf:
rollout_perc: 0
otherExp:
rollout_perc: 0
default: false
---
Users:
@User1,lf
@User2,lf,otherExp
"""
prefix = rd.get_runner_prefix(
settings_text, ["User2"], USER_BRANCH, frozenset(["lf", "otherExp"])
)
self.assertEqual("lf.otherExp.", prefix, "Runner prefix not correct for User2")
def test_opted_in_user_two_experiments_default_exp_2(self) -> None:
settings_text = """
experiments:
lf:
rollout_perc: 0
otherExp:
rollout_perc: 0
default: false
---
Users:
@User1,lf
@User2,lf,otherExp
"""
prefix = rd.get_runner_prefix(
settings_text, ["User2"], USER_BRANCH, frozenset(["otherExp"])
)
self.assertEqual("otherExp.", prefix, "Runner prefix not correct for User2")
@patch("random.uniform", return_value=50)
def test_opted_out_user(self, mock_uniform: Mock) -> None:
settings_text = """
@ -154,7 +251,7 @@ class TestRunnerDeterminatorGetRunnerPrefix(TestCase):
@User2,lf,otherExp
"""
prefix = rd.get_runner_prefix(settings_text, ["User3"])
prefix = rd.get_runner_prefix(settings_text, ["User3"], USER_BRANCH)
self.assertEqual("", prefix, "Runner prefix not correct for user")
@patch("random.uniform", return_value=10)
@ -174,9 +271,80 @@ class TestRunnerDeterminatorGetRunnerPrefix(TestCase):
"""
# User3 is opted out, but is pulled into both experiments by the 10% rollout
prefix = rd.get_runner_prefix(settings_text, ["User3"])
prefix = rd.get_runner_prefix(settings_text, ["User3"], USER_BRANCH)
self.assertEqual("lf.otherExp.", prefix, "Runner prefix not correct for user")
@patch("random.uniform", return_value=10)
def test_opted_out_user_was_pulled_in_by_rollout_excl_nondefault(
self, mock_uniform: Mock
) -> None:
settings_text = """
experiments:
lf:
rollout_perc: 25
otherExp:
rollout_perc: 25
default: false
---
Users:
@User1,lf
@User2,lf,otherExp
"""
# User3 is opted out, but is pulled into default experiments by the 10% rollout
prefix = rd.get_runner_prefix(settings_text, ["User3"], USER_BRANCH)
self.assertEqual("lf.", prefix, "Runner prefix not correct for user")
@patch("random.uniform", return_value=10)
def test_opted_out_user_was_pulled_in_by_rollout_filter_exp(
self, mock_uniform: Mock
) -> None:
settings_text = """
experiments:
lf:
rollout_perc: 25
otherExp:
rollout_perc: 25
default: false
---
Users:
@User1,lf
@User2,lf,otherExp
"""
# User3 is opted out, but is pulled into default experiments by the 10% rollout
prefix = rd.get_runner_prefix(
settings_text, ["User3"], USER_BRANCH, frozenset(["otherExp"])
)
self.assertEqual("otherExp.", prefix, "Runner prefix not correct for user")
@patch("random.uniform", return_value=25)
def test_opted_out_user_was_pulled_out_by_rollout_filter_exp(
self, mock_uniform: Mock
) -> None:
settings_text = """
experiments:
lf:
rollout_perc: 10
otherExp:
rollout_perc: 50
default: false
---
Users:
@User1,lf
@User2,lf,otherExp
"""
# User3 is opted out, but is pulled into default experiments by the 10% rollout
prefix = rd.get_runner_prefix(settings_text, ["User3"], USER_BRANCH)
self.assertEqual("", prefix, "Runner prefix not correct for user")
def test_lf_prefix_always_comes_first(self) -> None:
settings_text = """
experiments:
@ -192,7 +360,7 @@ class TestRunnerDeterminatorGetRunnerPrefix(TestCase):
"""
prefix = rd.get_runner_prefix(settings_text, ["User2"])
prefix = rd.get_runner_prefix(settings_text, ["User2"], USER_BRANCH)
self.assertEqual("lf.otherExp.", prefix, "Runner prefix not correct for user")
def test_ignores_commented_users(self) -> None:
@ -210,7 +378,7 @@ class TestRunnerDeterminatorGetRunnerPrefix(TestCase):
"""
prefix = rd.get_runner_prefix(settings_text, ["User1"])
prefix = rd.get_runner_prefix(settings_text, ["User1"], USER_BRANCH)
self.assertEqual("", prefix, "Runner prefix not correct for user")
def test_ignores_extra_experiments(self) -> None:
@ -229,9 +397,44 @@ class TestRunnerDeterminatorGetRunnerPrefix(TestCase):
"""
prefix = rd.get_runner_prefix(settings_text, ["User1"])
prefix = rd.get_runner_prefix(settings_text, ["User1"], USER_BRANCH)
self.assertEqual("lf.otherExp.", prefix, "Runner prefix not correct for user")
def test_disables_experiment_on_exception_branches_when_not_explicitly_opted_in(
self,
) -> None:
settings_text = """
experiments:
lf:
rollout_perc: 100
---
Users:
@User,lf,otherExp
"""
prefix = rd.get_runner_prefix(settings_text, ["User1"], EXCEPTION_BRANCH)
self.assertEqual("", prefix, "Runner prefix not correct for user")
def test_allows_experiment_on_exception_branches_when_explicitly_opted_in(
self,
) -> None:
settings_text = """
experiments:
lf:
rollout_perc: 100
all_branches: true
---
Users:
@User,lf,otherExp
"""
prefix = rd.get_runner_prefix(settings_text, ["User1"], EXCEPTION_BRANCH)
self.assertEqual("lf.", prefix, "Runner prefix not correct for user")
if __name__ == "__main__":
main()

View File

@ -12,7 +12,7 @@ import json
import os
import warnings
from hashlib import sha256
from typing import Any, Dict, List, Optional
from typing import Any, List, Optional
from unittest import main, mock, skip, TestCase
from urllib.error import HTTPError
@ -24,7 +24,6 @@ from trymerge import (
find_matching_merge_rule,
get_classifications,
get_drci_classifications,
get_rockset_results,
gh_get_team_members,
GitHubPR,
JobCheckState,
@ -42,7 +41,6 @@ if "GIT_REMOTE_URL" not in os.environ:
os.environ["GIT_REMOTE_URL"] = "https://github.com/pytorch/pytorch"
GQL_MOCKS = "gql_mocks.json.gz"
ROCKSET_MOCKS = "rockset_mocks.json.gz"
DRCI_MOCKS = "drci_mocks.json.gz"
@ -77,16 +75,11 @@ def mock_query(
if err.code == 401 or err.code == 403:
err_msg = f"If you are seeing this message during workflow run, please make sure to update {file_name}"
err_msg += f" locally, by deleting it and running {os.path.basename(__file__)} with"
err_msg += " GitHub Personal Access Token passed via GITHUB_TOKEN,"
err_msg += " the rockset api key passed via ROCKSET_API_KEY,"
err_msg += " GitHub Personal Access Token passed via GITHUB_TOKEN"
err_msg += " and drci api key passed via DRCI_BOT_KEY environment variables"
if (
os.getenv("GITHUB_TOKEN") is None
or os.getenv("ROCKSET_API_KEY") is None
or os.getenv("DRCI_BOT_KEY") is None
):
if os.getenv("GITHUB_TOKEN") is None or os.getenv("DRCI_BOT_KEY") is None:
err_msg = (
"Failed to update cached queries as GITHUB_TOKEN or ROCKSET_API_KEY or DRCI_BOT_KEY "
"Failed to update cached queries as GITHUB_TOKEN or DRCI_BOT_KEY "
+ "is not defined. "
+ err_msg
)
@ -110,16 +103,6 @@ def mocked_gh_graphql(query: str, **kwargs: Any) -> Any:
return mock_query(gh_graphql_wrapper, GQL_MOCKS, key_function, query, kwargs)
def mocked_rockset_results(head_sha: str, merge_base: str, num_retries: int = 3) -> Any:
return mock_query(
get_rockset_results,
ROCKSET_MOCKS,
lambda x, y: f"{x} {y}",
head_sha,
merge_base,
)
def mocked_drci_classifications(pr_num: int, project: str, num_retries: int = 3) -> Any:
return mock_query(
get_drci_classifications,
@ -273,10 +256,6 @@ def xla_merge_rules(repo: Any, org: str, project: str) -> List[MergeRule]:
]
def empty_rockset_results(head_sha: str, merge_base: str) -> List[Dict[str, Any]]:
return []
class DummyGitRepo(GitRepo):
def __init__(self) -> None:
super().__init__(get_git_repo_dir(), get_git_remote_name())
@ -288,7 +267,6 @@ class DummyGitRepo(GitRepo):
return "super awsome commit message"
@mock.patch("trymerge.get_rockset_results", side_effect=empty_rockset_results)
@mock.patch("trymerge.gh_graphql", side_effect=mocked_gh_graphql)
@mock.patch(
"trymerge.get_drci_classifications", side_effect=mocked_drci_classifications
@ -604,7 +582,6 @@ class TestTryMerge(TestCase):
mocked_gh_fetch_merge_base.assert_called_once()
@mock.patch("trymerge.get_rockset_results", side_effect=mocked_rockset_results)
@mock.patch("trymerge.gh_graphql", side_effect=mocked_gh_graphql)
@mock.patch("trymerge.gh_fetch_merge_base", return_value="")
@mock.patch(
@ -843,7 +820,7 @@ class TestBypassFailures(TestCase):
checks = pr.get_checkrun_conclusions()
# Known flaky failure takes precedence over ignore current (need to set the
# merge base here to get the results from Rockset, and that categorize the
# merge base here to get the results from Dr. CI, and that categorize the
# broken trunk failure too
checks = get_classifications(
pr.pr_num,
@ -929,7 +906,6 @@ class TestBypassFailures(TestCase):
)
@mock.patch("trymerge.get_rockset_results", side_effect=mocked_rockset_results)
@mock.patch("trymerge.gh_graphql", side_effect=mocked_gh_graphql)
@mock.patch("trymerge.gh_fetch_merge_base", return_value="")
@mock.patch("trymerge.get_drci_classifications", return_value={})
@ -1008,7 +984,6 @@ class TestBypassFailuresOnSandCastle(TestCase):
self.assertTrue(len(failed) == 2)
@mock.patch("trymerge.get_rockset_results", side_effect=mocked_rockset_results)
@mock.patch("trymerge.gh_graphql", side_effect=mocked_gh_graphql)
@mock.patch("trymerge.gh_fetch_merge_base", return_value="")
@mock.patch(

View File

@ -452,8 +452,6 @@ RE_DIFF_REV = re.compile(r"^Differential Revision:.+?(D[0-9]+)", re.MULTILINE)
CIFLOW_LABEL = re.compile(r"^ciflow/.+")
CIFLOW_TRUNK_LABEL = re.compile(r"^ciflow/trunk")
MERGE_RULE_PATH = Path(".github") / "merge_rules.yaml"
ROCKSET_MERGES_COLLECTION = "merges"
ROCKSET_MERGES_WORKSPACE = "commons"
REMOTE_MAIN_BRANCH = "origin/main"
DRCI_CHECKRUN_NAME = "Dr.CI"
INTERNAL_CHANGES_CHECKRUN_NAME = "Meta Internal-Only Changes Check"
@ -1180,7 +1178,7 @@ class GitHubPR:
merge_commit_sha = repo.rev_parse(name=self.default_branch())
if comment_id and self.pr_num:
# Finally, upload the record to Rockset. The list of pending and failed
# Finally, upload the record to s3. The list of pending and failed
# checks are at the time of the merge
save_merge_record(
comment_id=comment_id,
@ -1202,7 +1200,7 @@ class GitHubPR:
ignore_current=bool(ignore_current_checks),
)
else:
print("Missing comment ID or PR number, couldn't upload to Rockset")
print("Missing comment ID or PR number, couldn't upload to s3")
# Usually Github will see that the commit has "resolves <pr_num>" in the
# commit message and close the PR, but sometimes it doesn't, leading to
@ -1481,7 +1479,7 @@ def find_matching_merge_rule(
# Categorize all checks when skip_mandatory_checks (force merge) is set. Do it here
# where the list of checks is readily available. These records will be saved into
# Rockset merge records
# s3 merge records
(
pending_mandatory_checks,
failed_mandatory_checks,
@ -1508,7 +1506,7 @@ def checks_to_str(checks: List[Tuple[str, Optional[str]]]) -> str:
def checks_to_markdown_bullets(
checks: List[Tuple[str, Optional[str], Optional[int]]]
checks: List[Tuple[str, Optional[str], Optional[int]]],
) -> List[str]:
return [
f"- [{c[0]}]({c[1]})" if c[1] is not None else f"- {c[0]}" for c in checks[:5]
@ -1568,7 +1566,7 @@ def save_merge_record(
This saves the merge records as a json, which can later be uploaded to s3
"""
# Prepare the record to be written into Rockset
# Prepare the record to be written into s3
data = [
{
"comment_id": comment_id,
@ -1590,7 +1588,8 @@ def save_merge_record(
"ignore_current": ignore_current,
"error": error,
# This is a unique identifier for the record for deduping purposes
# in rockset. Any unique string would work
# in Rockset. Any unique string would work. This will not be used
# after we migrate off Rockset
"_id": f"{project}-{pr_num}-{comment_id}-{os.environ.get('GITHUB_RUN_ID')}",
}
]
@ -1600,36 +1599,6 @@ def save_merge_record(
json.dump(data, f)
@retries_decorator(rc=[])
def get_rockset_results(head_sha: str, merge_base: str) -> List[Dict[str, Any]]:
query = f"""
SELECT
w.name as workflow_name,
j.id,
j.name,
j.conclusion,
j.completed_at,
j.html_url,
j.head_sha,
j.torchci_classification.captures as failure_captures,
LENGTH(j.steps) as steps,
FROM
commons.workflow_job j join commons.workflow_run w on w.id = j.run_id
where
j.head_sha in ('{head_sha}','{merge_base}')
"""
try:
import rockset # type: ignore[import]
res = rockset.RocksetClient(
host="api.usw2a1.rockset.com", api_key=os.environ["ROCKSET_API_KEY"]
).sql(query)
return cast(List[Dict[str, Any]], res.results)
except ModuleNotFoundError:
print("Could not use RockSet as rocket dependency is missing")
return []
@retries_decorator()
def get_drci_classifications(pr_num: int, project: str = "pytorch") -> Any:
"""
@ -2067,7 +2036,7 @@ def categorize_checks(
pending_checks: List[Tuple[str, Optional[str], Optional[int]]] = []
failed_checks: List[Tuple[str, Optional[str], Optional[int]]] = []
# failed_checks_categorization is used to keep track of all ignorable failures when saving the merge record on Rockset
# failed_checks_categorization is used to keep track of all ignorable failures when saving the merge record on s3
failed_checks_categorization: Dict[str, List[Any]] = defaultdict(list)
# If required_checks is not set or empty, consider all names are relevant
@ -2126,7 +2095,7 @@ def categorize_checks(
):
failed_checks = failed_checks + flaky_or_broken_trunk
# The list of failed_checks_categorization is returned so that it can be saved into the Rockset merge record
# The list of failed_checks_categorization is returned so that it can be saved into the s3 merge record
return (pending_checks, failed_checks, failed_checks_categorization)
@ -2410,7 +2379,7 @@ def main() -> None:
handle_exception(e)
if args.comment_id and args.pr_num:
# Finally, upload the record to Rockset, we don't have access to the
# Finally, upload the record to s3, we don't have access to the
# list of pending and failed checks here, but they are not really
# needed at the moment
save_merge_record(
@ -2433,7 +2402,7 @@ def main() -> None:
error=str(e),
)
else:
print("Missing comment ID or PR number, couldn't upload to Rockset")
print("Missing comment ID or PR number, couldn't upload to s3")
finally:
if not args.check_mergeability:
gh_remove_label(

31
.github/scripts/update_runner_determinator.py vendored Executable file
View File

@ -0,0 +1,31 @@
#!/usr/bin/env python3
import re
# Read the contents of runner_determinator.py
with open(".github/scripts/runner_determinator.py") as script_file:
script_content = script_file.read()
# Indent the script content by 10 spaces to match destination indentation
indented_script_content = "\n".join(
[" " * 10 + line if line else line for line in script_content.splitlines()]
)
# Read the contents of _runner-determinator.yml
with open(".github/workflows/_runner-determinator.yml") as yml_file:
yml_content = yml_file.read()
# Replace the content between the markers
new_yml_content = re.sub(
r"(cat <<EOF > runner_determinator.py\n)(.*?)(\n\s+EOF)",
lambda match: match.group(1) + indented_script_content + match.group(3),
yml_content,
flags=re.DOTALL,
)
# Save the modified content back to _runner-determinator.yml
with open(".github/workflows/_runner-determinator.yml", "w") as yml_file:
yml_file.write(new_yml_content)
print("Updated _runner-determinator.yml with the contents of runner_determinator.py")

View File

@ -25,7 +25,7 @@ concurrency:
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
@ -40,6 +40,16 @@ concurrency:
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell

View File

@ -54,7 +54,7 @@ env:
jobs:
get-label-type:
name: get-label-type
uses: ./.github/workflows/_runner-determinator.yml
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
@ -68,6 +68,7 @@ jobs:
needs: get-label-type
with:!{{ upload.binary_env_as_input(config) }}
{%- if "aarch64" in build_environment %}
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
{%- elif "s390x" in build_environment %}
@ -102,6 +103,7 @@ jobs:
build_name: !{{ config["build_name"] }}
build_environment: !{{ build_environment }}
{%- if "aarch64" in build_environment %}
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.2xlarge
ALPINE_IMAGE: "arm64v8/alpine"
{%- elif "s390x" in build_environment %}

View File

@ -55,7 +55,7 @@ env:
jobs:
get-label-type:
name: get-label-type
uses: ./.github/workflows/_runner-determinator.yml
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}

View File

@ -91,14 +91,14 @@ jobs:
with:
docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
- name: Check if in a ARC runner
- name: Check if in a container runner
shell: bash
id: check_arc_runner
run: echo "IN_ARC_RUNNER=$([ -f /.inarc ] && echo true || echo false)" >> "$GITHUB_OUTPUT"
id: check_container_runner
run: echo "IN_CONTAINER_RUNNER=$(if [ -f /.inarc ] || [ -f /.incontainer ]; then echo true ; else echo false; fi)" >> "$GITHUB_OUTPUT"
- name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
uses: pytorch/test-infra/.github/actions/setup-nvidia@main
if: ${{ inputs.cuda-version != 'cpu' && steps.check_arc_runner.outputs.IN_ARC_RUNNER == 'false' }}
if: ${{ inputs.cuda-version != 'cpu' && steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false' }}
- name: Output disk space left
run: |

View File

@ -271,7 +271,9 @@ jobs:
)
docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
if [[ ${BUILD_ENVIRONMENT} == *"aarch64"* ]]; then
docker exec -t "${container_name}" bash -c "bash /builder/aarch64_linux/aarch64_ci_build.sh"
docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash /builder/aarch64_linux/aarch64_ci_build.sh"
elif [[ ${{ inputs.PACKAGE_TYPE }} == "manywheel" || ${{ inputs.PACKAGE_TYPE }} == "libtorch" ]]; then
docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash /pytorch/.ci/${{ inputs.PACKAGE_TYPE }}/build.sh"
else
docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash /builder/${{ inputs.PACKAGE_TYPE }}/build.sh"
fi

View File

@ -114,22 +114,32 @@ jobs:
with:
docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
- name: Check if in a ARC runner
- name: Check if in a container runner
shell: bash
id: check_arc_runner
run: echo "IN_ARC_RUNNER=$([ -f /.inarc ] && echo true || echo false)" >> "$GITHUB_OUTPUT"
id: check_container_runner
run: echo "IN_CONTAINER_RUNNER=$(if [ -f /.inarc ] || [ -f /.incontainer ]; then echo true ; else echo false; fi)" >> "$GITHUB_OUTPUT"
- name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
id: install-nvidia-driver
uses: pytorch/test-infra/.github/actions/setup-nvidia@main
if: ${{ contains(inputs.build-environment, 'cuda') && !contains(matrix.config, 'nogpu') && steps.check_arc_runner.outputs.IN_ARC_RUNNER == 'false' }}
if: ${{ contains(inputs.build-environment, 'cuda') && !contains(matrix.config, 'nogpu') && steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false' }}
- name: Setup GPU_FLAG for docker run
id: setup-gpu-flag
run: echo "GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all" >> "${GITHUB_ENV}"
if: ${{ contains(inputs.build-environment, 'cuda') && !contains(matrix.config, 'nogpu') && steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'true' }}
- name: Setup SCCACHE_SERVER_PORT environment for docker run when on container
id: setup-sscache-port-flag
run: echo "SCCACHE_SERVER_PORT_DOCKER_FLAG=-e SCCACHE_SERVER_PORT=$((RUNNER_UID + 4226))" >> "${GITHUB_ENV}"
if: ${{ steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'true' }}
- name: Lock NVIDIA A100 40GB Frequency
run: |
sudo nvidia-smi -pm 1
sudo nvidia-smi -ac 1215,1410
nvidia-smi
if: contains(matrix.runner, 'a100')
if: ${{ contains(matrix.runner, 'a100') && steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false' }}
- name: Start monitoring script
id: monitor-script
@ -208,6 +218,7 @@ jobs:
NO_TD: ${{ steps.keep-going.outputs.ci-no-td }}
TD_DISTRIBUTED: ${{ steps.keep-going.outputs.ci-td-distributed }}
SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2
SCCACHE_REGION: us-east-1
SCCACHE_S3_KEY_PREFIX: ${{ github.workflow }}
SHM_SIZE: ${{ contains(inputs.build-environment, 'cuda') && '2g' || '1g' }}
DOCKER_IMAGE: ${{ inputs.docker-image }}
@ -218,7 +229,8 @@ jobs:
DASHBOARD_TAG: ${{ inputs.dashboard-tag }}
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
SCRIBE_GRAPHQL_ACCESS_TOKEN: ${{ secrets.SCRIBE_GRAPHQL_ACCESS_TOKEN }}
IS_A100_RUNNER: ${{ contains(matrix.runner, 'a100') && '1' || '0' }}
ARTIFACTS_FILE_SUFFIX: ${{ github.job }}-${{ matrix.config }}-${{ matrix.shard }}-${{ matrix.num_shards }}-${{ matrix.runner }}_${{ steps.get-job-id.outputs.job-id }}
run: |
set -x
@ -236,6 +248,7 @@ jobs:
# shellcheck disable=SC2086,SC2090
container_name=$(docker run \
${GPU_FLAG:-} \
${SCCACHE_SERVER_PORT_DOCKER_FLAG:-} \
-e BUILD_ENVIRONMENT \
-e PR_NUMBER \
-e GITHUB_ACTIONS \
@ -265,6 +278,7 @@ jobs:
-e PR_LABELS \
-e MAX_JOBS="$(nproc --ignore=2)" \
-e SCCACHE_BUCKET \
-e SCCACHE_REGION \
-e SCCACHE_S3_KEY_PREFIX \
-e XLA_CUDA \
-e XLA_CLANG_CACHE_S3_BUCKET_NAME \
@ -274,6 +288,8 @@ jobs:
-e HUGGING_FACE_HUB_TOKEN \
-e SCRIBE_GRAPHQL_ACCESS_TOKEN \
-e DASHBOARD_TAG \
-e IS_A100_RUNNER \
-e ARTIFACTS_FILE_SUFFIX \
--env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
--security-opt seccomp=unconfined \
--cap-add=SYS_PTRACE \
@ -343,7 +359,7 @@ jobs:
- name: Teardown Linux
uses: pytorch/test-infra/.github/actions/teardown-linux@main
if: always()
if: always() && steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false'
# NB: We are currently having an intermittent GPU-related issue on G5 runners with
# A10G GPU. Once this happens, trying to reset the GPU as done in setup-nvidia does

View File

@ -3,6 +3,11 @@ name: Check whether the workflow owner can use ARC runners
on:
workflow_call:
inputs:
check_experiments:
required: false
type: string
description: |
List of experiments for this workfow. If not defined, all default experiments are included.
triggering_actor:
required: true
type: string
@ -35,6 +40,8 @@ on:
jobs:
runner-determinator:
# Don't run on forked repos
if: github.repository_owner == 'pytorch'
runs-on: ubuntu-latest
outputs:
label-type: ${{ steps.set-condition.outputs.label-type }}
@ -43,6 +50,7 @@ jobs:
ISSUE_NUMBER: ${{ inputs.issue_number }}
TRIGGERING_ACTOR: ${{ inputs.triggering_actor }}
ISSUE_OWNER: ${{ inputs.issue_owner }}
CHECK_EXPERIMENTS: ${{ inputs.check_experiments }}
steps:
# - name: Checkout PyTorch
# uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
@ -59,6 +67,10 @@ jobs:
cat <<EOF > runner_determinator.py
# flake8: noqa: G004
# Note: Copies of this script in runner_determinator.py and _runner-determinator.yml
# must be kept in sync. You can do it easily by running the following command:
# python .github/scripts/update_runner_determinator.py
"""
This runner determinator is used to determine which set of runners to run a
GitHub job on. It uses the first comment of a GitHub issue (by default
@ -94,7 +106,8 @@ jobs:
experiments:
lf:
rollout_percent: 25
all_branches: false
default: true
---
# Opt-ins:
@ -112,7 +125,7 @@ jobs:
import random
from argparse import ArgumentParser
from logging import LogRecord
from typing import Any, Dict, Iterable, List, NamedTuple, Tuple
from typing import Any, Dict, FrozenSet, Iterable, List, NamedTuple, Tuple
import yaml
from github import Auth, Github
@ -138,6 +151,12 @@ jobs:
rollout_perc: float = (
0 # Percentage of workflows to experiment on when user is not opted-in.
)
all_branches: bool = (
False # If True, the experiment is also enabled on the exception branches
)
default: bool = (
True # If True, the experiment is enabled by default for all queries
)
# Add more fields as needed
@ -192,6 +211,12 @@ jobs:
f.write(f"{key}={value}\n")
def _str_comma_separated_to_set(value: str) -> FrozenSet[str]:
return frozenset(
filter(lambda itm: itm != "", map(str.strip, value.strip(" \n\t").split(",")))
)
def parse_args() -> Any:
parser = ArgumentParser("Get dynamic rollout settings")
parser.add_argument("--github-token", type=str, required=True, help="GitHub token")
@ -226,6 +251,13 @@ jobs:
required=True,
help="Current GitHub ref type, branch or tag",
)
parser.add_argument(
"--eligible-experiments",
type=_str_comma_separated_to_set,
required=False,
default="",
help="comma separated list of experiments to check, if omitted all experiments marked with default=True are checked",
)
return parser.parse_args()
@ -271,7 +303,7 @@ jobs:
def is_exception_branch(branch: str) -> bool:
"""
Branches that get opted out of all experiments and should always use Meta runners
Branches that get opted out of experiments by default, until they're explicitly enabled.
"""
return branch.split("/")[0] in {"main", "nightly", "release", "landchecks"}
@ -397,7 +429,11 @@ jobs:
def get_runner_prefix(
rollout_state: str, workflow_requestors: Iterable[str], is_canary: bool = False
rollout_state: str,
workflow_requestors: Iterable[str],
branch: str,
eligible_experiments: FrozenSet[str] = frozenset(),
is_canary: bool = False,
) -> str:
settings = parse_settings(rollout_state)
user_optins = parse_users(rollout_state)
@ -405,7 +441,24 @@ jobs:
fleet_prefix = ""
prefixes = []
for experiment_name, experiment_settings in settings.experiments.items():
enabled = False
if not experiment_settings.all_branches and is_exception_branch(branch):
log.info(
f"Branch {branch} is an exception branch. Not enabling experiment {experiment_name}."
)
continue
if eligible_experiments:
if experiment_name not in eligible_experiments:
exp_list = ", ".join(eligible_experiments)
log.info(
f"Skipping experiment '{experiment_name}', as it is not in the eligible_experiments list: {exp_list}"
)
continue
elif not experiment_settings.default:
log.info(
f"Skipping experiment '{experiment_name}', as it is not a default experiment"
)
continue
# Is any workflow_requestor opted in to this experiment?
opted_in_users = [
@ -414,11 +467,13 @@ jobs:
if is_user_opted_in(requestor, user_optins, experiment_name)
]
enabled = False
if opted_in_users:
log.info(
f"{', '.join(opted_in_users)} have opted into experiment {experiment_name}."
)
enabled = True
elif experiment_settings.rollout_perc:
# If no user is opted in, then we randomly enable the experiment based on the rollout percentage
if random.uniform(0, 100) <= experiment_settings.rollout_perc:
@ -466,35 +521,35 @@ jobs:
def main() -> None:
args = parse_args()
if args.github_ref_type == "branch" and is_exception_branch(args.github_branch):
log.info(
f"Exception branch: '{args.github_branch}', using Meta runners and no experiments."
runner_label_prefix = DEFAULT_LABEL_PREFIX
try:
rollout_state = get_rollout_state_from_issue(
args.github_token, args.github_issue_repo, args.github_issue
)
runner_label_prefix = DEFAULT_LABEL_PREFIX
else:
try:
rollout_state = get_rollout_state_from_issue(
args.github_token, args.github_issue_repo, args.github_issue
)
username = get_potential_pr_author(
args.github_token,
args.github_repo,
args.github_actor,
args.github_ref_type,
args.github_branch,
)
username = get_potential_pr_author(
args.github_token,
args.github_repo,
args.github_actor,
args.github_ref_type,
args.github_branch,
)
is_canary = args.github_repo == "pytorch/pytorch-canary"
is_canary = args.github_repo == "pytorch/pytorch-canary"
runner_label_prefix = get_runner_prefix(
rollout_state, (args.github_issue_owner, username), is_canary
)
runner_label_prefix = get_runner_prefix(
rollout_state,
(args.github_issue_owner, username),
args.github_branch,
args.eligible_experiments,
is_canary,
)
except Exception as e:
log.error(
f"Failed to get issue. Defaulting to Meta runners and no experiments. Exception: {e}"
)
except Exception as e:
log.error(
f"Failed to get issue. Defaulting to Meta runners and no experiments. Exception: {e}"
)
set_github_output(GH_OUTPUT_KEY_LABEL_TYPE, runner_label_prefix)
@ -523,4 +578,5 @@ jobs:
--github-actor "$TRIGGERING_ACTOR" \
--github-issue-owner "$ISSUE_OWNER" \
--github-ref-type "$curr_ref_type" \
--github-repo "$GITHUB_REPOSITORY"
--github-repo "$GITHUB_REPOSITORY" \
--eligible-experiments "$CHECK_EXPERIMENTS" \

View File

@ -68,9 +68,10 @@ jobs:
shell: bash
steps:
# Duplicated in win-test because this MUST go before a checkout
- name: Enable git symlinks on Windows and disable fsmonitor daemon
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock

View File

@ -46,9 +46,10 @@ jobs:
shell: bash
steps:
# Duplicated in win-build because this MUST go before a checkout
- name: Enable git symlinks on Windows and disable fsmonitor daemon
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
@ -189,7 +190,7 @@ jobs:
run: |
pushd "${PYTORCH_FINAL_PACKAGE_DIR}"
# shellcheck disable=SC2046,SC2102
python3 -mpip install $(echo *.whl)[opt-einsum,optree] optree==0.12.1
python3 -mpip install $(echo *.whl)[opt-einsum,optree] optree==0.13.0
popd
.ci/pytorch/win-test.sh

View File

@ -35,7 +35,7 @@ jobs:
runs-on: linux.9xlarge.ephemeral
strategy:
matrix:
cuda_version: ["11.8", "12.1", "12.4", "cpu"]
cuda_version: ["11.8", "12.1", "12.4", "12.6", "cpu"]
env:
CUDA_VERSION: ${{ matrix.cuda_version }}
steps:
@ -62,5 +62,11 @@ jobs:
fi
- name: Build Docker Image
if: env.WITH_PUSH == 'true'
run: |
.ci/docker/conda/build.sh conda-builder${{ matrix.cuda_version == 'cpu' && ':' || ':cuda' }}${{matrix.cuda_version}}
uses: nick-fields/retry@v3.0.0
with:
shell: bash
timeout_minutes: 90
max_attempts: 3
retry_wait_seconds: 90
command: |
.ci/docker/conda/build.sh conda-builder${{ matrix.cuda_version == 'cpu' && ':' || ':cuda' }}${{matrix.cuda_version}}

View File

@ -31,7 +31,7 @@ concurrency:
jobs:
get-label-type:
name: get-label-type
uses: ./.github/workflows/_runner-determinator.yml
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
@ -72,8 +72,14 @@ jobs:
fi
- name: Build Docker Image
if: env.WITH_PUSH == 'true'
run: |
.ci/docker/libtorch/build.sh libtorch-cxx11-builder:cuda${{matrix.cuda_version}}
uses: nick-fields/retry@v3.0.0
with:
shell: bash
timeout_minutes: 90
max_attempts: 3
retry_wait_seconds: 90
command: |
.ci/docker/libtorch/build.sh libtorch-cxx11-builder:cuda${{matrix.cuda_version}}
build-docker-rocm:
environment: ${{ (github.ref == 'refs/heads/main' || startsWith(github.event.ref, 'refs/tags/v')) && 'docker-build' || '' }}
needs: get-label-type
@ -108,8 +114,14 @@ jobs:
fi
- name: Build Docker Image
if: env.WITH_PUSH == 'true'
run: |
.ci/docker/libtorch/build.sh libtorch-cxx11-builder:rocm${{matrix.rocm_version}}
uses: nick-fields/retry@v3.0.0
with:
shell: bash
timeout_minutes: 90
max_attempts: 3
retry_wait_seconds: 90
command: |
.ci/docker/libtorch/build.sh libtorch-cxx11-builder:rocm${{matrix.rocm_version}}
build-docker-cpu:
environment: ${{ (github.ref == 'refs/heads/main' || startsWith(github.event.ref, 'refs/tags/v')) && 'docker-build' || '' }}
needs: get-label-type
@ -138,5 +150,11 @@ jobs:
fi
- name: Build Docker Image
if: env.WITH_PUSH == 'true'
run: |
.ci/docker/libtorch/build.sh libtorch-cxx11-builder:cpu
uses: nick-fields/retry@v3.0.0
with:
shell: bash
timeout_minutes: 90
max_attempts: 3
retry_wait_seconds: 90
command: |
.ci/docker/libtorch/build.sh libtorch-cxx11-builder:cpu

View File

@ -35,7 +35,7 @@ concurrency:
jobs:
get-label-type:
name: get-label-type
uses: ./.github/workflows/_runner-determinator.yml
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
@ -78,8 +78,14 @@ jobs:
fi
- name: Build Docker Image
if: env.WITH_PUSH == 'true'
run: |
.ci/docker/manywheel/build.sh manylinux-builder:cuda${{matrix.cuda_version}}
uses: nick-fields/retry@v3.0.0
with:
shell: bash
timeout_minutes: 90
max_attempts: 3
retry_wait_seconds: 90
command: |
.ci/docker/manywheel/build.sh manylinux-builder:cuda${{matrix.cuda_version}}
# NOTE: manylinux_2_28 are still experimental, see https://github.com/pytorch/pytorch/issues/123649
build-docker-cuda-manylinux_2_28:
environment: ${{ (github.ref == 'refs/heads/main' || startsWith(github.event.ref, 'refs/tags/v')) && 'docker-build' || '' }}
@ -117,8 +123,14 @@ jobs:
fi
- name: Build Docker Image
if: env.WITH_PUSH == 'true'
run: |
.ci/docker/manywheel/build.sh manylinux2_28-builder:cuda${{matrix.cuda_version}}
uses: nick-fields/retry@v3.0.0
with:
shell: bash
timeout_minutes: 90
max_attempts: 3
retry_wait_seconds: 90
command: |
.ci/docker/manywheel/build.sh manylinux2_28-builder:cuda${{matrix.cuda_version}}
build-docker-cuda-aarch64:
environment: ${{ (github.ref == 'refs/heads/main' || startsWith(github.event.ref, 'refs/tags/v')) && 'docker-build' || '' }}
needs: get-label-type
@ -151,8 +163,14 @@ jobs:
fi
- name: Build Docker Image
if: env.WITH_PUSH == 'true'
run: |
.ci/docker/manywheel/build.sh manylinuxaarch64-builder:cuda${{matrix.cuda_version}}
uses: nick-fields/retry@v3.0.0
with:
shell: bash
timeout_minutes: 90
max_attempts: 3
retry_wait_seconds: 90
command: |
.ci/docker/manywheel/build.sh manylinuxaarch64-builder:cuda${{matrix.cuda_version}}
build-docker-rocm:
environment: ${{ (github.ref == 'refs/heads/main' || startsWith(github.event.ref, 'refs/tags/v')) && 'docker-build' || '' }}
needs: get-label-type
@ -187,8 +205,14 @@ jobs:
fi
- name: Build Docker Image
if: env.WITH_PUSH == 'true'
run: |
.ci/docker/manywheel/build.sh manylinux-builder:rocm${{matrix.rocm_version}}
uses: nick-fields/retry@v3.0.0
with:
shell: bash
timeout_minutes: 90
max_attempts: 3
retry_wait_seconds: 90
command: |
.ci/docker/manywheel/build.sh manylinux-builder:rocm${{matrix.rocm_version}}
build-docker-cpu:
environment: ${{ (github.ref == 'refs/heads/main' || startsWith(github.event.ref, 'refs/tags/v')) && 'docker-build' || '' }}
needs: get-label-type
@ -217,8 +241,14 @@ jobs:
fi
- name: Build Docker Image
if: env.WITH_PUSH == 'true'
run: |
.ci/docker/manywheel/build.sh manylinux-builder:cpu
uses: nick-fields/retry@v3.0.0
with:
shell: bash
timeout_minutes: 90
max_attempts: 3
retry_wait_seconds: 90
command: |
.ci/docker/manywheel/build.sh manylinux-builder:cpu
build-docker-cpu-manylinux_2_28:
environment: ${{ (github.ref == 'refs/heads/main' || startsWith(github.event.ref, 'refs/tags/v')) && 'docker-build' || '' }}
needs: get-label-type
@ -249,8 +279,14 @@ jobs:
fi
- name: Build Docker Image
if: env.WITH_PUSH == 'true'
run: |
.ci/docker/manywheel/build.sh manylinux2_28-builder:cpu
uses: nick-fields/retry@v3.0.0
with:
shell: bash
timeout_minutes: 90
max_attempts: 3
retry_wait_seconds: 90
command: |
.ci/docker/manywheel/build.sh manylinux2_28-builder:cpu
build-docker-cpu-aarch64:
environment: ${{ (github.ref == 'refs/heads/main' || startsWith(github.event.ref, 'refs/tags/v')) && 'docker-build' || '' }}
needs: get-label-type
@ -281,8 +317,14 @@ jobs:
fi
- name: Build Docker Image
if: env.WITH_PUSH == 'true'
run: |
.ci/docker/manywheel/build.sh manylinuxaarch64-builder:cpu-aarch64
uses: nick-fields/retry@v3.0.0
with:
shell: bash
timeout_minutes: 90
max_attempts: 3
retry_wait_seconds: 90
command: |
.ci/docker/manywheel/build.sh manylinuxaarch64-builder:cpu-aarch64
build-docker-cpu-aarch64-2_28:
environment: ${{ (github.ref == 'refs/heads/main' || startsWith(github.event.ref, 'refs/tags/v')) && 'docker-build' || '' }}
needs: get-label-type
@ -316,8 +358,14 @@ jobs:
env:
DOCKER_TOKEN: ${{ secrets.DOCKER_TOKEN }}
DOCKER_ID: ${{ secrets.DOCKER_ID }}
run: |
.ci/docker/manywheel/build.sh manylinux2_28_aarch64-builder:cpu-aarch64
uses: nick-fields/retry@v3.0.0
with:
shell: bash
timeout_minutes: 90
max_attempts: 3
retry_wait_seconds: 90
command: |
.ci/docker/manywheel/build.sh manylinux2_28_aarch64-builder:cpu-aarch64
build-docker-cpu-cxx11-abi:
environment: ${{ (github.ref == 'refs/heads/main' || startsWith(github.event.ref, 'refs/tags/v')) && 'docker-build' || '' }}
needs: get-label-type
@ -348,8 +396,14 @@ jobs:
fi
- name: Build Docker Image
if: env.WITH_PUSH == 'true'
run: |
.ci/docker/manywheel/build.sh manylinuxcxx11-abi-builder:cpu-cxx11-abi
uses: nick-fields/retry@v3.0.0
with:
shell: bash
timeout_minutes: 90
max_attempts: 3
retry_wait_seconds: 90
command: |
.ci/docker/manywheel/build.sh manylinuxcxx11-abi-builder:cpu-cxx11-abi
build-docker-xpu:
environment: ${{ (github.ref == 'refs/heads/main' || startsWith(github.event.ref, 'refs/tags/v')) && 'docker-build' || '' }}
needs: get-label-type
@ -380,5 +434,11 @@ jobs:
fi
- name: Build Docker Image
if: env.WITH_PUSH == 'true'
run: |
.ci/docker/manywheel/build.sh manylinux2_28-builder:xpu
uses: nick-fields/retry@v3.0.0
with:
shell: bash
timeout_minutes: 90
max_attempts: 3
retry_wait_seconds: 90
command: |
.ci/docker/manywheel/build.sh manylinux2_28-builder:xpu

View File

@ -29,7 +29,7 @@ concurrency:
jobs:
get-label-type:
name: get-label-type
uses: ./.github/workflows/_runner-determinator.yml
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
@ -43,7 +43,7 @@ jobs:
strategy:
fail-fast: false
matrix:
py_vers: [ "3.8", "3.9", "3.10", "3.11", "3.12" ]
py_vers: [ "3.9", "3.10", "3.11", "3.12" ]
device: ["cuda", "rocm", "xpu"]
include:
- device: "rocm"
@ -91,9 +91,6 @@ jobs:
# Determine python executable for given version
case $PY_VERS in
3.8)
PYTHON_EXECUTABLE=/opt/python/cp38-cp38/bin/python
;;
3.9)
PYTHON_EXECUTABLE=/opt/python/cp39-cp39/bin/python
;;
@ -214,7 +211,7 @@ jobs:
strategy:
fail-fast: false
matrix:
py_vers: [ "3.8", "3.9", "3.10", "3.11", "3.12" ]
py_vers: [ "3.9", "3.10", "3.11", "3.12" ]
timeout-minutes: 40
env:
DOCKER_IMAGE: pytorch/conda-builder:cpu

View File

@ -18,7 +18,7 @@ on:
jobs:
get-label-type:
name: get-label-type
uses: ./.github/workflows/_runner-determinator.yml
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}

View File

@ -32,7 +32,7 @@ permissions: read-all
jobs:
get-label-type:
name: get-label-type
uses: ./.github/workflows/_runner-determinator.yml
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
@ -67,6 +67,7 @@ jobs:
pytorch-linux-jammy-py3.12-halide,
pytorch-linux-jammy-xpu-2024.0-py3,
pytorch-linux-jammy-py3-clang15-asan,
pytorch-linux-jammy-py3-clang18-asan,
pytorch-linux-focal-py3-clang10-onnx,
pytorch-linux-focal-linter,
pytorch-linux-jammy-cuda11.8-cudnn9-py3.9-linter,
@ -78,7 +79,9 @@ jobs:
- docker-image-name: pytorch-linux-jammy-aarch64-py3.10-gcc11-inductor-benchmarks
runner: linux.arm64.m7g.4xlarge
timeout-minutes: 600
runs-on: "${{ needs.get-label-type.outputs.label-type }}${{ matrix.runner }}"
# Docker uploads fail from LF runners, see https://github.com/pytorch/pytorch/pull/137358
# runs-on: "${{ needs.get-label-type.outputs.label-type }}${{ matrix.runner }}"
runs-on: "${{ matrix.runner }}"
env:
DOCKER_IMAGE_BASE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/${{ matrix.docker-image-name }}
steps:
@ -120,7 +123,7 @@ jobs:
IMAGE_NAME: ${{ matrix.docker-image-name }}
with:
shell: bash
timeout_minutes: 15
timeout_minutes: 30
max_attempts: 5
retry_wait_seconds: 90
command: |

View File

@ -36,7 +36,7 @@ permissions: read-all
jobs:
get-label-type:
name: get-label-type
uses: ./.github/workflows/_runner-determinator.yml
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}

View File

@ -39,7 +39,7 @@ concurrency:
jobs:
get-label-type:
name: get-label-type
uses: ./.github/workflows/_runner-determinator.yml
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
@ -60,11 +60,12 @@ jobs:
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cpu-aarch64-main
use_split_build: False
DESIRED_PYTHON: "3.9"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_9-cpu-aarch64
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cpu-aarch64-test: # Testing
@ -86,6 +87,7 @@ jobs:
DESIRED_PYTHON: "3.9"
build_name: manywheel-py3_9-cpu-aarch64
build_environment: linux-aarch64-binary-manywheel
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.2xlarge
ALPINE_IMAGE: "arm64v8/alpine"
secrets:
@ -130,6 +132,7 @@ jobs:
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.9"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_9-cuda-aarch64
@ -177,11 +180,12 @@ jobs:
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cpu-aarch64-main
use_split_build: False
DESIRED_PYTHON: "3.10"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_10-cpu-aarch64
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_10-cpu-aarch64-test: # Testing
@ -203,6 +207,7 @@ jobs:
DESIRED_PYTHON: "3.10"
build_name: manywheel-py3_10-cpu-aarch64
build_environment: linux-aarch64-binary-manywheel
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.2xlarge
ALPINE_IMAGE: "arm64v8/alpine"
secrets:
@ -247,6 +252,7 @@ jobs:
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.10"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_10-cuda-aarch64
@ -294,11 +300,12 @@ jobs:
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cpu-aarch64-main
use_split_build: False
DESIRED_PYTHON: "3.11"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_11-cpu-aarch64
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_11-cpu-aarch64-test: # Testing
@ -320,6 +327,7 @@ jobs:
DESIRED_PYTHON: "3.11"
build_name: manywheel-py3_11-cpu-aarch64
build_environment: linux-aarch64-binary-manywheel
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.2xlarge
ALPINE_IMAGE: "arm64v8/alpine"
secrets:
@ -364,6 +372,7 @@ jobs:
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.11"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_11-cuda-aarch64
@ -411,11 +420,12 @@ jobs:
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cpu-aarch64-main
use_split_build: False
DESIRED_PYTHON: "3.12"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_12-cpu-aarch64
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_12-cpu-aarch64-test: # Testing
@ -437,6 +447,7 @@ jobs:
DESIRED_PYTHON: "3.12"
build_name: manywheel-py3_12-cpu-aarch64
build_environment: linux-aarch64-binary-manywheel
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.2xlarge
ALPINE_IMAGE: "arm64v8/alpine"
secrets:
@ -481,6 +492,7 @@ jobs:
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.12"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_12-cuda-aarch64

View File

@ -39,7 +39,7 @@ concurrency:
jobs:
get-label-type:
name: get-label-type
uses: ./.github/workflows/_runner-determinator.yml
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}

View File

@ -34,7 +34,7 @@ concurrency:
jobs:
get-label-type:
name: get-label-type
uses: ./.github/workflows/_runner-determinator.yml
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}

View File

@ -39,7 +39,7 @@ concurrency:
jobs:
get-label-type:
name: get-label-type
uses: ./.github/workflows/_runner-determinator.yml
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}

View File

@ -34,7 +34,7 @@ concurrency:
jobs:
get-label-type:
name: get-label-type
uses: ./.github/workflows/_runner-determinator.yml
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}

View File

@ -39,7 +39,7 @@ concurrency:
jobs:
get-label-type:
name: get-label-type
uses: ./.github/workflows/_runner-determinator.yml
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}

View File

@ -34,7 +34,7 @@ concurrency:
jobs:
get-label-type:
name: get-label-type
uses: ./.github/workflows/_runner-determinator.yml
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
@ -153,7 +153,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_9-cuda12_4
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cuda12_4-test: # Testing

View File

@ -39,7 +39,7 @@ concurrency:
jobs:
get-label-type:
name: get-label-type
uses: ./.github/workflows/_runner-determinator.yml
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
@ -343,7 +343,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_9-cuda12_4
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cuda12_4-test: # Testing
@ -1029,7 +1029,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_10-cuda12_4
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_10-cuda12_4-test: # Testing
@ -1785,7 +1785,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_11-cuda12_4
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_11-cuda12_4-test: # Testing
@ -2471,7 +2471,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_12-cuda12_4
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_12-cuda12_4-test: # Testing
@ -3157,7 +3157,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_13-cuda12_4
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13-cuda12_4-test: # Testing
@ -3324,3 +3324,353 @@ jobs:
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_13t-cpu-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu
DOCKER_IMAGE: pytorch/manylinux-builder:cpu-main
use_split_build: False
DESIRED_PYTHON: "3.13t"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_13t-cpu
build_environment: linux-binary-manywheel
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cpu-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- manywheel-py3_13t-cpu-build
- get-label-type
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu
DOCKER_IMAGE: pytorch/manylinux-builder:cpu-main
use_split_build: False
DESIRED_PYTHON: "3.13t"
build_name: manywheel-py3_13t-cpu
build_environment: linux-binary-manywheel
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cpu-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_13t-cpu-test
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu
DOCKER_IMAGE: pytorch/manylinux-builder:cpu-main
use_split_build: False
DESIRED_PYTHON: "3.13t"
build_name: manywheel-py3_13t-cpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_13t-cpu-cxx11-abi-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu-cxx11-abi
GPU_ARCH_TYPE: cpu-cxx11-abi
DOCKER_IMAGE: pytorch/manylinuxcxx11-abi-builder:cpu-cxx11-abi-main
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.13t"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_13t-cpu-cxx11-abi
build_environment: linux-binary-manywheel
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cpu-cxx11-abi-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- manywheel-py3_13t-cpu-cxx11-abi-build
- get-label-type
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu-cxx11-abi
GPU_ARCH_TYPE: cpu-cxx11-abi
DOCKER_IMAGE: pytorch/manylinuxcxx11-abi-builder:cpu-cxx11-abi-main
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.13t"
build_name: manywheel-py3_13t-cpu-cxx11-abi
build_environment: linux-binary-manywheel
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cpu-cxx11-abi-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_13t-cpu-cxx11-abi-test
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu-cxx11-abi
GPU_ARCH_TYPE: cpu-cxx11-abi
DOCKER_IMAGE: pytorch/manylinuxcxx11-abi-builder:cpu-cxx11-abi-main
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.13t"
build_name: manywheel-py3_13t-cpu-cxx11-abi
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_13t-cuda11_8-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu118
GPU_ARCH_VERSION: 11.8
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.8-main
use_split_build: False
DESIRED_PYTHON: "3.13t"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_13t-cuda11_8
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu11==11.8.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu11==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu11==11.11.3.6; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu11==10.9.0.58; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu11==10.3.0.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu11==11.4.1.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu11==11.7.5.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu11==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu11==11.8.86; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cuda11_8-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- manywheel-py3_13t-cuda11_8-build
- get-label-type
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu118
GPU_ARCH_VERSION: 11.8
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.8-main
use_split_build: False
DESIRED_PYTHON: "3.13t"
build_name: manywheel-py3_13t-cuda11_8
build_environment: linux-binary-manywheel
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge.nvidia.gpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cuda11_8-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_13t-cuda11_8-test
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu118
GPU_ARCH_VERSION: 11.8
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.8-main
use_split_build: False
DESIRED_PYTHON: "3.13t"
build_name: manywheel-py3_13t-cuda11_8
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_13t-cuda12_1-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu121
GPU_ARCH_VERSION: 12.1
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.1-main
use_split_build: False
DESIRED_PYTHON: "3.13t"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_13t-cuda12_1
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cuda12_1-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- manywheel-py3_13t-cuda12_1-build
- get-label-type
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu121
GPU_ARCH_VERSION: 12.1
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.1-main
use_split_build: False
DESIRED_PYTHON: "3.13t"
build_name: manywheel-py3_13t-cuda12_1
build_environment: linux-binary-manywheel
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge.nvidia.gpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cuda12_1-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_13t-cuda12_1-test
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu121
GPU_ARCH_VERSION: 12.1
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.1-main
use_split_build: False
DESIRED_PYTHON: "3.13t"
build_name: manywheel-py3_13t-cuda12_1
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_13t-cuda12_4-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.4-main
use_split_build: False
DESIRED_PYTHON: "3.13t"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_13t-cuda12_4
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cuda12_4-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- manywheel-py3_13t-cuda12_4-build
- get-label-type
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.4-main
use_split_build: False
DESIRED_PYTHON: "3.13t"
build_name: manywheel-py3_13t-cuda12_4
build_environment: linux-binary-manywheel
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge.nvidia.gpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cuda12_4-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_13t-cuda12_4-test
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.4-main
use_split_build: False
DESIRED_PYTHON: "3.13t"
build_name: manywheel-py3_13t-cuda12_4
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml

View File

@ -1,182 +0,0 @@
# @generated DO NOT EDIT MANUALLY
# Template is at: .github/templates/linux_binary_build_workflow.yml.j2
# Generation script: .github/scripts/generate_ci_workflows.py
name: linux-binary-manywheel-split
on:
push:
branches:
- main
tags:
- 'ciflow/periodic/*'
workflow_dispatch:
env:
# Needed for conda builds
ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
ANACONDA_USER: pytorch
AWS_DEFAULT_REGION: us-east-1
BINARY_ENV_FILE: /tmp/env
BUILD_ENVIRONMENT: linux-binary-manywheel-split
BUILDER_ROOT: /builder
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
PR_NUMBER: ${{ github.event.pull_request.number }}
PYTORCH_FINAL_PACKAGE_DIR: /artifacts
PYTORCH_ROOT: /pytorch
SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
SKIP_ALL_TESTS: 0
concurrency:
group: linux-binary-manywheel-split-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
cancel-in-progress: true
jobs:
get-label-type:
name: get-label-type
uses: ./.github/workflows/_runner-determinator.yml
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
curr_branch: ${{ github.head_ref || github.ref_name }}
curr_ref_type: ${{ github.ref_type }}
manywheel-py3_9-cuda11_8-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu118
GPU_ARCH_VERSION: 11.8
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.8-main
use_split_build: True
DESIRED_PYTHON: "3.9"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_9-cuda11_8
build_environment: linux-binary-manywheel-split
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu11==11.8.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu11==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu11==11.11.3.6; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu11==10.9.0.58; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu11==10.3.0.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu11==11.4.1.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu11==11.7.5.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu11==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu11==11.8.86; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cuda11_8-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- manywheel-py3_9-cuda11_8-build
- get-label-type
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu118
GPU_ARCH_VERSION: 11.8
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.8-main
use_split_build: True
DESIRED_PYTHON: "3.9"
build_name: manywheel-py3_9-cuda11_8
build_environment: linux-binary-manywheel-split
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge.nvidia.gpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cuda12_1-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu121
GPU_ARCH_VERSION: 12.1
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.1-main
use_split_build: True
DESIRED_PYTHON: "3.9"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_9-cuda12_1
build_environment: linux-binary-manywheel-split
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cuda12_1-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- manywheel-py3_9-cuda12_1-build
- get-label-type
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu121
GPU_ARCH_VERSION: 12.1
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.1-main
use_split_build: True
DESIRED_PYTHON: "3.9"
build_name: manywheel-py3_9-cuda12_1
build_environment: linux-binary-manywheel-split
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge.nvidia.gpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cuda12_4-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.4-main
use_split_build: True
DESIRED_PYTHON: "3.9"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_9-cuda12_4
build_environment: linux-binary-manywheel-split
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cuda12_4-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- manywheel-py3_9-cuda12_4-build
- get-label-type
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.4-main
use_split_build: True
DESIRED_PYTHON: "3.9"
build_name: manywheel-py3_9-cuda12_4
build_environment: linux-binary-manywheel-split
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge.nvidia.gpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}

File diff suppressed because it is too large Load Diff

View File

@ -39,7 +39,7 @@ concurrency:
jobs:
get-label-type:
name: get-label-type
uses: ./.github/workflows/_runner-determinator.yml
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
@ -64,7 +64,7 @@ jobs:
ALPINE_IMAGE: "docker.io/s390x/alpine"
build_name: manywheel-py3_9-cpu-s390x
build_environment: linux-s390x-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cpu-s390x-test: # Testing
@ -133,7 +133,7 @@ jobs:
ALPINE_IMAGE: "docker.io/s390x/alpine"
build_name: manywheel-py3_10-cpu-s390x
build_environment: linux-s390x-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_10-cpu-s390x-test: # Testing
@ -202,7 +202,7 @@ jobs:
ALPINE_IMAGE: "docker.io/s390x/alpine"
build_name: manywheel-py3_11-cpu-s390x
build_environment: linux-s390x-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_11-cpu-s390x-test: # Testing
@ -271,7 +271,7 @@ jobs:
ALPINE_IMAGE: "docker.io/s390x/alpine"
build_name: manywheel-py3_12-cpu-s390x
build_environment: linux-s390x-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_12-cpu-s390x-test: # Testing
@ -340,7 +340,7 @@ jobs:
ALPINE_IMAGE: "docker.io/s390x/alpine"
build_name: manywheel-py3_13-cpu-s390x
build_environment: linux-s390x-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13-cpu-s390x-test: # Testing

View File

@ -46,7 +46,7 @@ jobs:
GPU_ARCH_TYPE: cpu
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.9"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
# NOTE: These environment variables are put here so that they can be applied on every job equally
# They are also here because setting them at a workflow level doesn't give us access to the
@ -162,7 +162,7 @@ jobs:
GPU_ARCH_TYPE: cpu
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.10"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
# NOTE: These environment variables are put here so that they can be applied on every job equally
# They are also here because setting them at a workflow level doesn't give us access to the
@ -278,7 +278,7 @@ jobs:
GPU_ARCH_TYPE: cpu
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.11"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
# NOTE: These environment variables are put here so that they can be applied on every job equally
# They are also here because setting them at a workflow level doesn't give us access to the
@ -394,7 +394,7 @@ jobs:
GPU_ARCH_TYPE: cpu
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.12"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
# NOTE: These environment variables are put here so that they can be applied on every job equally
# They are also here because setting them at a workflow level doesn't give us access to the
@ -496,3 +496,119 @@ jobs:
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
wheel-py3_13-cpu-build:
if: ${{ github.repository_owner == 'pytorch' }}
runs-on: macos-14-xlarge
timeout-minutes: 240
env:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
BUILDER_ROOT: ${{ github.workspace }}/builder
PACKAGE_TYPE: wheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.13"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
# NOTE: These environment variables are put here so that they can be applied on every job equally
# They are also here because setting them at a workflow level doesn't give us access to the
# runner.temp variable, which we need.
- name: Populate binary env
shell: bash
run: |
# shellcheck disable=SC2129
echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
# shellcheck disable=SC2129
echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
# shellcheck disable=SC2129
echo "MAC_PACKAGE_WORK_DIR=${RUNNER_TEMP}" >> "${GITHUB_ENV}"
- name: Install conda and dependencies
run: |
# Install conda, setup-miniconda messes with the path that messes with the ruby stuff we do later on
curl --retry 3 --retry-all-errors -o "${RUNNER_TEMP}/conda.sh" "https://repo.anaconda.com/miniconda/Miniconda3-py310_23.5.2-0-MacOSX-$(uname -m).sh"
chmod +x "${RUNNER_TEMP}/conda.sh"
/bin/bash "${RUNNER_TEMP}/conda.sh" -b -p "${RUNNER_TEMP}/anaconda"
echo "${RUNNER_TEMP}/anaconda/bin" >> "${GITHUB_PATH}"
if [ -d "/Applications/Xcode_14.3.1.app" ]; then
echo "DEVELOPER_DIR=/Applications/Xcode_14.3.1.app/Contents/Developer" >> "${GITHUB_ENV}"
elif [ -d "/Applications/Xcode_13.3.1.app" ]; then
echo "DEVELOPER_DIR=/Applications/Xcode_13.3.1.app/Contents/Developer" >> "${GITHUB_ENV}"
fi
- name: Checkout PyTorch
uses: malfet/checkout@silent-checkout
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
quiet-checkout: true
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
- name: Checkout pytorch/builder
uses: malfet/checkout@silent-checkout
with:
ref: main
submodules: recursive
repository: pytorch/builder
path: builder
quiet-checkout: true
- name: Clean pytorch/builder checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: builder
- name: Install sccache (only for non-forked PRs, and pushes to trunk)
uses: nick-fields/retry@v3.0.0
if: ${{ github.event_name == 'push' || github.event.pull_request.head.repo.full_name == github.repository }}
with:
timeout_minutes: 5
max_attempts: 3
retry_wait_seconds: 90
command: |
sudo curl --retry 3 --retry-all-errors https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
sudo chmod +x /usr/local/bin/sccache
echo "SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${GITHUB_ENV}"
- name: Populate binary env
run: |
# shellcheck disable=SC1091
source "${RUNNER_TEMP}/anaconda/bin/activate"
"${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
- name: Build PyTorch binary
run: |
# shellcheck disable=SC1091
source "${RUNNER_TEMP}/anaconda/bin/activate"
"${PYTORCH_ROOT}/.circleci/scripts/binary_macos_build.sh"
- uses: actions/upload-artifact@v4.4.0
if: always()
with:
name: wheel-py3_13-cpu
retention-days: 14
if-no-files-found: error
path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
wheel-py3_13-cpu-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: wheel-py3_13-cpu-build
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: wheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu
DOCKER_IMAGE: pytorch/manylinux-builder:cpu-main
DESIRED_PYTHON: "3.13"
build_name: wheel-py3_13-cpu
use_s3: False
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml

View File

@ -34,7 +34,7 @@ concurrency:
jobs:
get-label-type:
name: get-label-type
uses: ./.github/workflows/_runner-determinator.yml
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
@ -64,7 +64,7 @@ jobs:
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
@ -75,6 +75,16 @@ jobs:
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
@ -178,7 +188,7 @@ jobs:
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
@ -189,6 +199,16 @@ jobs:
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
@ -310,7 +330,7 @@ jobs:
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
@ -321,6 +341,16 @@ jobs:
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
@ -425,7 +455,7 @@ jobs:
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
@ -436,6 +466,16 @@ jobs:
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
@ -558,7 +598,7 @@ jobs:
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
@ -569,6 +609,16 @@ jobs:
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
@ -673,7 +723,7 @@ jobs:
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
@ -684,6 +734,16 @@ jobs:
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
@ -806,7 +866,7 @@ jobs:
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
@ -817,6 +877,16 @@ jobs:
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
@ -921,7 +991,7 @@ jobs:
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
@ -932,6 +1002,16 @@ jobs:
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
@ -1053,7 +1133,7 @@ jobs:
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
@ -1064,6 +1144,16 @@ jobs:
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
@ -1167,7 +1257,7 @@ jobs:
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
@ -1178,6 +1268,16 @@ jobs:
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
@ -1299,7 +1399,7 @@ jobs:
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
@ -1310,6 +1410,16 @@ jobs:
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
@ -1414,7 +1524,7 @@ jobs:
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
@ -1425,6 +1535,16 @@ jobs:
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
@ -1547,7 +1667,7 @@ jobs:
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
@ -1558,6 +1678,16 @@ jobs:
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
@ -1662,7 +1792,7 @@ jobs:
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
@ -1673,6 +1803,16 @@ jobs:
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
@ -1795,7 +1935,7 @@ jobs:
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
@ -1806,6 +1946,16 @@ jobs:
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
@ -1910,7 +2060,7 @@ jobs:
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
@ -1921,6 +2071,16 @@ jobs:
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
@ -2042,7 +2202,7 @@ jobs:
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
@ -2053,6 +2213,16 @@ jobs:
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
@ -2156,7 +2326,7 @@ jobs:
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
@ -2167,6 +2337,16 @@ jobs:
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
@ -2288,7 +2468,7 @@ jobs:
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
@ -2299,6 +2479,16 @@ jobs:
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
@ -2403,7 +2593,7 @@ jobs:
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
@ -2414,6 +2604,16 @@ jobs:
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
@ -2536,7 +2736,7 @@ jobs:
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
@ -2547,6 +2747,16 @@ jobs:
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
@ -2651,7 +2861,7 @@ jobs:
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
@ -2662,6 +2872,16 @@ jobs:
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
@ -2784,7 +3004,7 @@ jobs:
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
@ -2795,6 +3015,16 @@ jobs:
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
@ -2899,7 +3129,7 @@ jobs:
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
@ -2910,6 +3140,16 @@ jobs:
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
@ -3031,7 +3271,7 @@ jobs:
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
@ -3042,6 +3282,16 @@ jobs:
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
@ -3145,7 +3395,7 @@ jobs:
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
@ -3156,6 +3406,16 @@ jobs:
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
@ -3277,7 +3537,7 @@ jobs:
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
@ -3288,6 +3548,16 @@ jobs:
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
@ -3392,7 +3662,7 @@ jobs:
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
@ -3403,6 +3673,16 @@ jobs:
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
@ -3525,7 +3805,7 @@ jobs:
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
@ -3536,6 +3816,16 @@ jobs:
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
@ -3640,7 +3930,7 @@ jobs:
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
@ -3651,6 +3941,16 @@ jobs:
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
@ -3773,7 +4073,7 @@ jobs:
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
@ -3784,6 +4084,16 @@ jobs:
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
@ -3888,7 +4198,7 @@ jobs:
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
@ -3899,6 +4209,16 @@ jobs:
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell

View File

@ -27,7 +27,7 @@ concurrency:
jobs:
get-label-type:
name: get-label-type
uses: ./.github/workflows/_runner-determinator.yml
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
@ -61,7 +61,7 @@ jobs:
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
@ -72,6 +72,16 @@ jobs:
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
@ -179,7 +189,7 @@ jobs:
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
@ -190,6 +200,16 @@ jobs:
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell

View File

@ -34,7 +34,7 @@ concurrency:
jobs:
get-label-type:
name: get-label-type
uses: ./.github/workflows/_runner-determinator.yml
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
@ -68,7 +68,7 @@ jobs:
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
@ -79,6 +79,16 @@ jobs:
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
@ -186,7 +196,7 @@ jobs:
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
@ -197,6 +207,16 @@ jobs:
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
@ -326,7 +346,7 @@ jobs:
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
@ -337,6 +357,16 @@ jobs:
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
@ -445,7 +475,7 @@ jobs:
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
@ -456,6 +486,16 @@ jobs:
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
@ -586,7 +626,7 @@ jobs:
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
@ -597,6 +637,16 @@ jobs:
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
@ -705,7 +755,7 @@ jobs:
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
@ -716,6 +766,16 @@ jobs:
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
@ -846,7 +906,7 @@ jobs:
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
@ -857,6 +917,16 @@ jobs:
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
@ -965,7 +1035,7 @@ jobs:
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
@ -976,6 +1046,16 @@ jobs:
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell

View File

@ -27,7 +27,7 @@ concurrency:
jobs:
get-label-type:
name: get-label-type
uses: ./.github/workflows/_runner-determinator.yml
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
@ -61,7 +61,7 @@ jobs:
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
@ -72,6 +72,16 @@ jobs:
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
@ -179,7 +189,7 @@ jobs:
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
@ -190,6 +200,16 @@ jobs:
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell

View File

@ -34,7 +34,7 @@ concurrency:
jobs:
get-label-type:
name: get-label-type
uses: ./.github/workflows/_runner-determinator.yml
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
@ -68,7 +68,7 @@ jobs:
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
@ -79,6 +79,16 @@ jobs:
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
@ -186,7 +196,7 @@ jobs:
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
@ -197,6 +207,16 @@ jobs:
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
@ -326,7 +346,7 @@ jobs:
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
@ -337,6 +357,16 @@ jobs:
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
@ -445,7 +475,7 @@ jobs:
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
@ -456,6 +486,16 @@ jobs:
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
@ -586,7 +626,7 @@ jobs:
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
@ -597,6 +637,16 @@ jobs:
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
@ -705,7 +755,7 @@ jobs:
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
@ -716,6 +766,16 @@ jobs:
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
@ -846,7 +906,7 @@ jobs:
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
@ -857,6 +917,16 @@ jobs:
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
@ -965,7 +1035,7 @@ jobs:
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
@ -976,6 +1046,16 @@ jobs:
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell

File diff suppressed because it is too large Load Diff

View File

@ -20,7 +20,8 @@ permissions: read-all
jobs:
get-label-type:
name: get-label-type
uses: ./.github/workflows/_runner-determinator.yml
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
if: github.event_name != 'schedule' || github.repository == 'pytorch/pytorch'
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
@ -56,7 +57,7 @@ jobs:
{ config: "aot_inductor_timm", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "aot_inductor_torchbench", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "aot_inductor_torchbench", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "inductor_cpp_wrapper_abi_compatible", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "inductor_cpp_wrapper", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" },
]}
secrets:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}

View File

@ -17,6 +17,7 @@ permissions: read-all
jobs:
linux-jammy-cpu-py3_9-gcc11-inductor-build:
if: github.event_name != 'schedule' || github.repository == 'pytorch/pytorch'
name: linux-jammy-cpu-py3.9-gcc11-inductor
uses: ./.github/workflows/_linux-build.yml
with:

View File

@ -18,7 +18,8 @@ permissions: read-all
jobs:
get-label-type:
name: get-label-type
uses: ./.github/workflows/_runner-determinator.yml
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
if: github.event_name != 'schedule' || github.repository == 'pytorch/pytorch'
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}

View File

@ -13,30 +13,43 @@ concurrency:
permissions: read-all
jobs:
get-label-type:
name: get-label-type
uses: ./.github/workflows/_runner-determinator.yml
get-default-label-prefix:
name: get-default-label-prefix
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
curr_branch: ${{ github.head_ref || github.ref_name }}
curr_ref_type: ${{ github.ref_type }}
get-test-label-type:
name: get-test-label-type
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
if: github.event_name != 'schedule' || github.repository == 'pytorch/pytorch'
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
curr_branch: ${{ github.head_ref || github.ref_name }}
curr_ref_type: ${{ github.ref_type }}
check_experiments: "awsa100"
linux-focal-cuda12_1-py3_10-gcc9-inductor-build:
name: cuda12.1-py3.10-gcc9-sm80
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
needs:
- get-default-label-prefix
- get-test-label-type
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runner_prefix: "${{ needs.get-default-label-prefix.outputs.label-type }}"
build-environment: linux-focal-cuda12.1-py3.10-gcc9-sm80
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9-inductor-benchmarks
cuda-arch-list: '8.0'
test-matrix: |
{ include: [
{ config: "inductor_huggingface_perf_compare", shard: 1, num_shards: 1, runner: "linux.gcp.a100" },
{ config: "inductor_timm_perf_compare", shard: 1, num_shards: 2, runner: "linux.gcp.a100" },
{ config: "inductor_timm_perf_compare", shard: 2, num_shards: 2, runner: "linux.gcp.a100" },
{ config: "inductor_torchbench_perf_compare", shard: 1, num_shards: 1, runner: "linux.gcp.a100" },
{ config: "inductor_huggingface_perf_compare", shard: 1, num_shards: 1, runner: "${{ needs.get-test-label-type.outputs.label-type }}linux.gcp.a100" },
{ config: "inductor_timm_perf_compare", shard: 1, num_shards: 2, runner: "${{ needs.get-test-label-type.outputs.label-type }}linux.gcp.a100" },
{ config: "inductor_timm_perf_compare", shard: 2, num_shards: 2, runner: "${{ needs.get-test-label-type.outputs.label-type }}linux.gcp.a100" },
{ config: "inductor_torchbench_perf_compare", shard: 1, num_shards: 1, runner: "${{ needs.get-test-label-type.outputs.label-type }}linux.gcp.a100" },
]}
secrets:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}

View File

@ -70,7 +70,8 @@ permissions: read-all
jobs:
get-label-type:
name: get-label-type
uses: ./.github/workflows/_runner-determinator.yml
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
if: github.event_name != 'schedule' || github.repository == 'pytorch/pytorch'
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}

View File

@ -50,7 +50,8 @@ permissions: read-all
jobs:
get-label-type:
name: get-label-type
uses: ./.github/workflows/_runner-determinator.yml
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
if: github.event_name != 'schedule' || github.repository == 'pytorch/pytorch'
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}

Some files were not shown because too many files have changed in this diff Show More