Compare commits

...

584 Commits

Author SHA1 Message Date
9ad25d4c05 remove check 2024-08-05 15:22:44 -07:00
ea42027e0e [micro_pipeline_tp] support all _scaled_mm args (#131984)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131984
Approved by: https://github.com/weifengpy
2024-08-05 21:44:37 +00:00
2b5e31d099 Move sigmoid run_const_graph HOP to PyTorch core (#132526)
Summary: When HOPs live out of tree, it makes it impossible to make breaking changes to the HOP API. But HOP implementations are deeply entwined with PyTorch internals. Move the HOP into PyTorch tree so that changes are possible.

Test Plan: sandcastle and oss ci

Differential Revision: D60674861

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132526
Approved by: https://github.com/SherlockNoMad
2024-08-05 21:40:56 +00:00
af8b8a47cb fsdp.set_: convey to functionalization that it mutates storage (#132322)
Fixes https://github.com/pytorch/pytorch/issues/132197

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132322
Approved by: https://github.com/albanD, https://github.com/yf225
ghstack dependencies: #132243, #132337
2024-08-05 21:28:59 +00:00
1a0db29932 move torch._functionalize APIs to pybind. add one for marking storage mutations (#132337)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132337
Approved by: https://github.com/albanD, https://github.com/justinchuby
ghstack dependencies: #132243
2024-08-05 21:28:59 +00:00
4db368a475 make functorch CSE respect mutations as barriers (like fsdp.set_) (#132243)
Fixes https://github.com/pytorch/pytorch/issues/132200

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132243
Approved by: https://github.com/albanD, https://github.com/zou3519, https://github.com/yf225
2024-08-05 21:28:55 +00:00
ee0ae11b34 Fix a typo in the example code. (#132601)
Since the backward multiples the gradient by `n`, we must change the forward function to multiply the input tensor by `n`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132601
Approved by: https://github.com/soulitzer
2024-08-05 21:04:20 +00:00
9a1ad3345f Fix periodic windows test (#132648)
This test fails to clean up folders on windows for the past week, see 27f61eba58 for example

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132648
Approved by: https://github.com/janeyx99, https://github.com/zou3519, https://github.com/malfet
2024-08-05 20:54:20 +00:00
cyy
6b12dc0224 [Reland] [11/N] Use std::nullopt and std::optional (#132622)
Reland of #132396, which was reverted due to dependency reversion.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132622
Approved by: https://github.com/ezyang
2024-08-05 20:36:33 +00:00
6f4dc56735 [inductor] Default to 1 compile thread for internal (#132540)
Summary: The historical default here is "1", i.e., no parallel compilation. In order to prepare for rolling out the subprocess-based parallel compile, I had previously modified this code to allow parallelism when worker_start_method="subprocess". I realize this probably isn't the best rollout strategy. Rather than opting all internal usages into both a) parallel-compile, _and_ b) a new implementation of parallel compile, let's put the default back to "1" and then start rolling out the new parallel compile implementation only to those usages that have already opted in by explicitly setting compile_thread > 1

Differential Revision: [D60686105](https://our.internmc.facebook.com/intern/diff/D60686105)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132540
Approved by: https://github.com/c00w
2024-08-05 20:23:16 +00:00
1471473b84 Add tests to bsr_dense_addmm_meta. Tune bsr_dense_addmm kernel for ViT shapes. (#132646)
As in the title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132646
Approved by: https://github.com/cpuhrsch
2024-08-05 20:22:33 +00:00
b7bcfdaff2 Change deprecate warning on dispatch_on_subclass to warn once (#132374)
Summary:
# Problem

`TORCH_WARN` can cause massive log spam.

I output the logs for before and after adding this change.

*Before:*

* The log file size was ~61.15 MB(61148028 bytes).

*After:*

* The log filesize was ~56.44 MB(56444057) bytes.

# Context

Looks like we tried to land this change earlier but it was reverted:

* D59413413
* Reverted https://github.com/pytorch/pytorch/pull/130047 on behalf of https://github.com/clee2000 due to broke test_overrides.py::TestTorchFunctionWarning::test_warn_on_invalid_torch_function

# Testing Update

`test_warn_on_invalid_torch_function` would fail because the warning would not be called on the handling of the second torch function class since `TORCH_WARN_ONCE` stops repeats globally.

Updated so that it runs separate programs. (Was not able to actually run the test, could someone help me with that

Test Plan: Need help with this...

Differential Revision: D60561181

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132374
Approved by: https://github.com/ezyang
2024-08-05 20:02:33 +00:00
2764bee942 Revert "[MPS] Add support for autocast in MPS (#99272)"
This reverts commit 6919e8baaba391ced7b4acaa553d6ea1f3b30e79.

Reverted https://github.com/pytorch/pytorch/pull/99272 on behalf of https://github.com/clee2000 due to Broke test/inductor/test_cpu_select_algorithm.py::TestSelectAlgorithmCPU::test_quantized_linear_amx_batch_size_3_in_features_128_out_features_64_bias_False_cpu on sm86 jobs [GH job link](https://github.com/pytorch/pytorch/actions/runs/10252979157/job/28367091621) [HUD commit link](6919e8baab) Not caught on PR due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/99272#issuecomment-2269808857))
2024-08-05 19:59:04 +00:00
a3ea96b762 Revert "[export] Convert autocast to HOO (#131914)"
This reverts commit aec948adfc224e49213c4bc49586d4e4ba65fbbb.

Reverted https://github.com/pytorch/pytorch/pull/131914 on behalf of https://github.com/davidberard98 due to PR shouldn't have been relanded by the bot, phabricator diff did not have any recent changes and is still internally reverted ([comment](https://github.com/pytorch/pytorch/pull/131914#issuecomment-2269797388))
2024-08-05 19:52:09 +00:00
1d34f33d00 Scale XBLOCK in triton reduction configs to avoid hitting max grid (#128826)
Scale XBLOCK size in triton_config_reduction to avoid hitting maxGridSize limits.

This issue was observed in gpt-fast examples with large sequence length:
Reproducer: https://gist.github.com/jataylo/8a0ba922fbf68e345d360a418b48b9f1

`RuntimeError: Triton Error [HIP]:  Code: 9, Messsage: invalid configuration argument`

Co-authored-by: Jason Ansel <jansel@jansel.net>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128826
Approved by: https://github.com/jansel, https://github.com/nmacchioni
2024-08-05 19:34:38 +00:00
e1c2bdac2f [easy] fix f-string messages in torch/_ops.py (#132531)
I encountered these when making this change:

```
diff --git a/test/functorch/test_ac.py b/test/functorch/test_ac.py
index 3a2e07fa147..a4d003399e7 100644
--- a/test/functorch/test_ac.py
+++ b/test/functorch/test_ac.py
@@ -259,15 +259,8 @@ class MemoryBudgetTest(TestCase):

         expected = call()
         for budget in range(0, 11):
-            memory_budget = budget / 10
-            torch._dynamo.reset()
-            with config.patch(activation_memory_budget=memory_budget):
-                if memory_budget is not None:
-                    f_compile = torch.compile(
-                        call, backend="aot_eager_decomp_partition"
-                    )
-
-                self.assertEqual(expected, f_compile())
+            get_mem_and_flops(call, memory_budget=budget / 10)
+

     def test_prioritize_cheaper_matmul(self):
         def f(xs, ws):
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132531
Approved by: https://github.com/Skylion007
2024-08-05 18:58:33 +00:00
aec948adfc [export] Convert autocast to HOO (#131914)
Summary:
Suggested in https://github.com/pytorch/pytorch/issues/128394.

If there's an autocast context manager, the predispatch (strict) graph can look something like:

```
class <lambda>(torch.nn.Module):
    def forward(self, x: "f32[1]"):
        ...
        _enter_autocast = torch.amp.autocast_mode._enter_autocast('cuda', torch.bfloat16, True, None)
        mm: "f32[8, 8]" = torch.ops.aten.mm.default(rand, rand_1);  rand = rand_1 = None
        _exit_autocast = torch.amp.autocast_mode._exit_autocast(_enter_autocast);  _enter_autocast = None
        return (mm_1,)
```

But the operator `torch.amp.autocast_mode._enter_autocast` is not a valid ATen op. We remove these nodes by turning autocast into a higher order operator and make a submodule for the blocks between `_enter_autocast` and `_exit_autocast`.

Some potential followup improvement:
1) Merge some of the duplicated logic with `replace_set_grad_with_hop_pass.py`
2) Check the current autocast status (any enabled? dtype?) and not create a submodule if the autocast args matches current autocast status.

Test Plan:
CI

```
parsh --build-flags fbcode//mode/dev-nosan  fbcode//caffe2/test:test_export
run_tests("test_predispatch_autocast")
```

Reviewed By: angelayi

Differential Revision: D60206382

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131914
Approved by: https://github.com/angelayi
2024-08-05 18:52:12 +00:00
8d9c3a71f6 Support IPC for Expandable Segments (#130890)
This reapplication commit is the same as before except it resolves a build error in an internal build where `handle` was shadowed.

Differential Revision: [D60547506](https://our.internmc.facebook.com/intern/diff/D60547506)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130890
Approved by: https://github.com/dsjohns2
2024-08-05 18:48:13 +00:00
618e2c9de4 fix torch rec test failure (#132437)
Summary: Fixes T192448049. The module call form an unusal call stack for the nodes: https://www.internalfb.com/phabricator/paste/view/P1507230978. This is currently not supported by unflattener and need some extra design to make it work.

Test Plan: buck2 run 'fbcode//mode/opt' torchrec/distributed/tests:test_pt2 -- --filter-text "test_sharded_quant_fpebc_non_strict_export"

Reviewed By: zhxchen17

Differential Revision: D60528900

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132437
Approved by: https://github.com/Skylion007
2024-08-05 18:06:07 +00:00
1c7dc335f7 [ROCm][CK][Inductor] Enable addmm for CK backend to gemm max autotune (#130576)
Add functional support for torch.addmm with CK backend. See also #125453

# Implementation details
1. It turns out we can use the same template between addmm and matmul; essentially, matmul is addmm with empty bias
2. The Python generator in CK was updated to generate the shared cpp template. The pip package can be installed from `pip install git+https://github.com/rocm/composable_kernel@add-addmm` and will be merged into `develop` branch after this PR lands to avoid breaking the current matmul

# Testing
`pytest test/inductor/test_ck_backend.py -k addmm`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130576
Approved by: https://github.com/chenyang78
2024-08-05 17:49:09 +00:00
7b2664ece6 Temp disable MKL in DistributionKernels.cpp (#132532)
Until https://github.com/pytorch/pytorch/issues/132395 is addressed

Test plan: Add test based on the script below (taken from https://discuss.pytorch.org/t/bug-in-torch-multinomial-generated-distribution-is-modestly-incorrect-edit-this-is-a-regression-and-appears-to-be-due-to-an-analogous-bug-in-tensor-exponential )
```python
import torch

high_bits_for_seed = 16000000000000000000           # to use "good quality" seed
_ = torch.manual_seed (high_bits_for_seed + 2024)

prob = torch.ones (26)
dups_mult = 0
perm_counts_mult = {}
for _ in range (1_000_000):
    p = tuple (torch.multinomial (prob, prob.numel(), replacement=False).tolist())
    if  p in perm_counts_mult:
        dups_mult += 1
        perm_counts_mult[p] += 1
    else:
        perm_counts_mult[p] = 1

print ('duplicate multinomial perms: ', dups_mult)
print ('multiple multinomial perms:  ', (torch.tensor (list (perm_counts_mult.values())) > 1).sum().item())
print ('max of perm_counts_mult:     ', torch.tensor (list (perm_counts_mult.values())).max().item())
print ('len (perm_counts_mult):      ', len (perm_counts_mult))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132532
Approved by: https://github.com/albanD
2024-08-05 17:40:57 +00:00
baa2483cea Revert "Refactor thunkify to return proper thunk abstraction (#132407)"
This reverts commit c65cb37657ef4f7fcd070a7e8e5121eb299919fd.

Reverted https://github.com/pytorch/pytorch/pull/132407 on behalf of https://github.com/ezyang due to td strikes again ([comment](https://github.com/pytorch/pytorch/pull/132407#issuecomment-2269577711))
2024-08-05 17:39:54 +00:00
cyy
d5045cceff [16/N] Fix clang-tidy warnings in jit (#132604)
Follows #132564

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132604
Approved by: https://github.com/Skylion007
2024-08-05 17:36:22 +00:00
e8645fa2b9 [Doc] fix some typos (found by codespell and typos) (#132544)
Applying doc fixes from PR https://github.com/pytorch/pytorch/pull/127267 - with CLA
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132544
Approved by: https://github.com/kit1980
2024-08-05 17:21:56 +00:00
3d87dfc088 Add basic OpenReg module scaffolding with autograd (#131708)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131708
Approved by: https://github.com/ezyang
2024-08-05 17:07:11 +00:00
df59084012 Drop GIL around cudart APIs (#132520)
Noticed a hang where the stuck thread blocked on cudaHostUnregister
call, probably due to an internal cuda deadlock caused by something
else, but was holding the GIL at the time and blocked other python
threads.

As far as I can tell cudart APIs all do not require the GIL held nor are
they marked as thread unsafe.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132520
Approved by: https://github.com/LucasLLC, https://github.com/kirtiteja
2024-08-05 17:04:01 +00:00
6919e8baab [MPS] Add support for autocast in MPS (#99272)
Fixes https://github.com/pytorch/pytorch/issues/88415

Co-authored-by: Siddharth Kotapati <skotapati@apple.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99272
Approved by: https://github.com/malfet
2024-08-05 17:02:30 +00:00
d532c00c81 [test/torch_np] Fix usages of deprecated NumPy 2.0 APIs in numpy_tests (#131909)
Migrates usages of deprecated APIs in NumPy-2.0 per [numpy-2.0 migration guide](https://numpy.org/devdocs/numpy_2_0_migration_guide.html#numpy-2-0-migration-guide).

I did a grep on the old API usages (see list below) and these were used only referenced in test files under `test/torch_np/numpy_tests/**/*.py`.

Specifically, migrates the usages of the following APIs:

1. `np.sctypes` &rarr; Access dtypes explicitly instead
2. `np.float_` &rarr; `np.float64`
3. `np.complex_` &rarr; `np.complex128`
4. `np.longcomplex` &rarr; `np.clongdouble`
5. `np.unicode_` &rarr; `np.str_`
6. `np.product` &rarr; `np.prod`
7. `np.cumproduct` &rarr; `np.cumprod`
8. `np.alltrue` &rarr; `np.all`
9. `np.sometrue` &rarr; `np.any`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131909
Approved by: https://github.com/rgommers, https://github.com/Skylion007, https://github.com/atalman
2024-08-05 16:21:08 +00:00
a672f6c84e [inductor] unificate SUBPROCESS_DECODE_ARGS variable in cpp_builder.py (#132615)
[inductor] unificate SUBPROCESS_DECODE_ARGS variable in cpp_builder.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132615
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-08-05 16:00:35 +00:00
9945caec65 [inductor] Fix autotune non-close attr crash on Windows (#132630)
When I enable `autotune` related UT on Windows.
<img width="1364" alt="Image" src="https://github.com/user-attachments/assets/b0c9c516-419d-47d0-a4c1-e90c98109d02">

I found the non `close` attr issue on Windows. Acturaly, I checked the DLL type is `CDLL`. It doesn't have `close` attr.
I made this PR to check the `close` attr and do the close operation.

<img width="1624" alt="Image" src="https://github.com/user-attachments/assets/14093900-4ad8-4673-839e-7ba1410c5656">

After this fix, the UTs passed.

Here are some existing issues:
1. `CDLL` didn't have `close` attr, so the DLL are not be closed. Though it did't crash on Linux.
2. This PR just avoid crash on Windows, and didn't real close also.

**TODO:**
We need to replace `CDLL` by `DLLWrapper` in `CppBenchmarkRequest`, like `CUDABenchmarkRequest`. I have added a task to tracking: https://github.com/pytorch/pytorch/issues/124245 , and will follow up this change in further PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132630
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-08-05 16:00:27 +00:00
a8490a0762 [traced-graph][sparse] propagate sparsity in fx graph (#131920)
This PR proceeds with implementing the feature request #117188 by generalizing more cases that already work with COO to work with the compressed sparse formats as well.

Feature request:
https://github.com/pytorch/pytorch/issues/117188

Rebranch of older PRs (for history):
https://github.com/pytorch/pytorch/pull/131474
https://github.com/pytorch/pytorch/pull/128549

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131920
Approved by: https://github.com/ezyang
2024-08-05 15:49:53 +00:00
14edd986b3 Fix missing include file (#132647)
This error only appears with newer gcc releases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132647
Approved by: https://github.com/Skylion007
2024-08-05 15:49:49 +00:00
70cb16b316 [DTensor] Added naive replicate strategy for more diagonal ops (#132201)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132201
Approved by: https://github.com/wz337
ghstack dependencies: #132104
2024-08-05 15:18:56 +00:00
c65cb37657 Refactor thunkify to return proper thunk abstraction (#132407)
This is superior to lru_cache because (1) it's more explicit and (2) it
doesn't leak the original function after it's been forced.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132407
Approved by: https://github.com/albanD
ghstack dependencies: #131649
2024-08-05 14:42:40 +00:00
b465a5843b DTensor: add more foreach ops to supported sharding prop list (#132066)
fixes https://github.com/pytorch/pytorch/issues/132016.

Right now if you run an op that DTensor has no sharding prop rule, **and** that op accepts non-trivial pytrees of inputs tensors as arguments, DTensor can end up infinite looping before it has the chance to error due to not having a sharding prop rule.

This PR doesn't fix the problem, but adds rules for the culprit ops (missing foreach ops)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132066
Approved by: https://github.com/wanchaol
2024-08-05 13:51:59 +00:00
c3ee07c71c add missing profiler include in cpp code generation (#132419)
Summary:
When a user sets config.profiler_mark_wrapper_call, RECORD_FUNCTION annotations are added to the code. This requires importing the header <ATen/record_function.h>, but the conditional for doing so didn't check
 config.profiler_mark_wrapper_call.

Test Plan:
This case is already covered in test_profiler_mark_wrapper_call.

```
(pytorch-3.10) [gabeferns@devvm2252.cco0 ~/pytorch (missing-profile-include)]$ TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCHINDUCTOR_CPP_WRAPPER=1 python test/inductor/test_torchinductor.py -k CpuTests.test_profiler_mark_wrapper_call_cpu
stats [('calls_captured', 1), ('unique_graphs', 1)]
inductor [('fxgraph_cache_miss', 1)]
aot_autograd [('total', 1), ('ok', 1)]
.
----------------------------------------------------------------------
Ran 1 test in 8.080s

OK
```

Fixes https://github.com/pytorch/pytorch/issues/131339

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132419
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-08-05 13:40:47 +00:00
b30d0916d9 [FSDP2] Added missing event wait (for future) (#132568)
Nothing is actually wrong currently, but we should add this in case we land https://github.com/pytorch/pytorch/pull/127032 in the future.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132568
Approved by: https://github.com/weifengpy, https://github.com/Skylion007
2024-08-05 12:44:46 +00:00
fb87796d4f [DeviceMesh] Add supports for non-continuous slicing (#132310)
Removes constraint of continuous slicing to allow non-continuous slicing and adds a unit test for 3D non-continuous slicing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132310
Approved by: https://github.com/wanchaol
2024-08-05 09:30:07 +00:00
27f61eba58 serde sympy functions (#132493)
Summary: Sympy functions appearing in symbolic expressions inside tensor metadata were not being deserialized properly.

Test Plan: updated test

Differential Revision: D60573150

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132493
Approved by: https://github.com/pianpwk
2024-08-05 08:08:50 +00:00
55b0c39d82 Reland "[1/2] PT2 Inductor ComboKernels - Foreach cases (#124969)" (#132182)
Summary:
Reland #124969 by backing out D60397377 "Back out "[1/2] PT2 Inductor ComboKernels - Foreach cases  (#124969)""

The original diff D54134695 was reverted because of failure of ads nightly cogwheel tests.

The root cause: the logic for generating mask in Triton kernel needed update after a recent refactoring on triton.py. This diff includes the fix of the root cause.

See D54134695 or #124969 for more details.

Test Plan:
Originally failed tests
f585704630
f585733786

Diff patched:
f586664028
f586663820

Differential Revision: D60458597

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132182
Approved by: https://github.com/Yuzhen11
2024-08-05 06:57:30 +00:00
ae44b8f410 [inductor] support vectorization for torch.argmax/min(float/int64_t)-> int64_t (#131016)
Support reduction argmin/max by scalar implementation.
TestPlan:
```
python test/inductor/test_cpu_repro.py -k test_argmax_argmin_with_nan_value
python test/inductor/test_cpu_repro.py -k test_argmin
python test/inductor/test_cpu_repro.py -k test_reduction_cpu_only
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131016
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-08-05 04:31:53 +00:00
1fb498d6e3 Add try except for _maybe_evaluate_static call in IndexPropagation (#132128)
Fixes the Inductor max-autotune mode failures of the below models:
- GPT2ForSequenceClassification
- PegasusForConditionalGeneration
- XGLMForCausalLM
- hf_GPT2
- tnt_s_patch16_224
```log
  File "/pytorch/torch/_inductor/index_propagation.py", line 329, in statically_true
    evaluated = self.shape_env._maybe_evaluate_static(
  File "/pytorch/torch/fx/experimental/symbolic_shapes.py", line 1499, in wrapper
    return fn_cache(self, *args, **kwargs)
  File "/pytorch/torch/fx/experimental/symbolic_shapes.py", line 4539, in _maybe_evaluate_static
    vr = var_ranges[k]
torch._dynamo.exc.BackendCompilerFailed: backend='compile_fx_wrapper' raised:
KeyError: m_start
```

The `_maybe_evaluate_static` call in `IndexPropagation` may fail. This PR adds try except following the way in `torch/_inductor/sizevars.py` by adding a common utility function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132128
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-08-05 01:02:51 +00:00
c7cfa51721 Always use high precision for SDPA math backend (#128922)
Summary:
feikou observed the big numerical gaps when using math backend on AMD and NV GPUs. It's mainly because we are not using higher precision FP32 for the intermediate accumulated/materialized parts.

Since math backend is expected to be slower anyways, and we expect math backend to generate the correct reference result, I think it should be worth to upcast FP16/BF16 input to FP32, and do FP32/TF32 computations, and then downcast FP32 output back to FP16/BF16.

Differential Revision: D58710805

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128922
Approved by: https://github.com/xw285cornell, https://github.com/drisspg
2024-08-04 23:58:14 +00:00
01cdcbf7c8 [dynamo] revert map/zip iterator related changes (#132528)
Need to revert due to internal hangs: S437700

This reverts commit b6c1490cc02316ffe85e5ae74651d80f0158ba64.

Revert "[dynamo] implement IteratorVariable and polyfill fallbacks for enumerate (#131725)"

This reverts commit 2576dbbc35d66e8e9ed6cb12216ccc424cb87ec3.

Revert "[dynamo] add itertools repeat/count bytecode reconstruction (#131716)"

This reverts commit 35b4de32fafc5ad024c20ef1275711bffc557ae9.

Revert "[dynamo] add lazy IteratorVariable implementations for map and zip (#131413)"

This reverts commit 7d282d87550787d8269593093519c2ad7c5032cd.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132528
Approved by: https://github.com/ZainRizvi
2024-08-04 18:46:55 +00:00
09f9c256ad Add basic mypy annotations to inductor (#132416)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132416
Approved by: https://github.com/XuehaiPan, https://github.com/jamesjwu
ghstack dependencies: #132415
2024-08-04 18:43:37 +00:00
6e79932543 Add basic mypy annotations to dynamo (#132415)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132415
Approved by: https://github.com/XuehaiPan, https://github.com/jamesjwu
2024-08-04 18:43:36 +00:00
3558a8cf4a Revert "Add basic mypy annotations to dynamo (#132415)"
This reverts commit 71e22e0959eb8d5a66833bf5c6b5903536a5bef1.

Reverted https://github.com/pytorch/pytorch/pull/132415 on behalf of https://github.com/ZainRizvi due to Sorry, this PR has entered a weird state in the diff train. Trying to revert it to skip it, and then we can try relanding it ([comment](https://github.com/pytorch/pytorch/pull/132415#issuecomment-2267631785))
2024-08-04 18:39:29 +00:00
f2ddd5e9e0 Revert "Add basic mypy annotations to inductor (#132416)"
This reverts commit 78927d37f6085a0b30269cceb731d8097302c091.

Reverted https://github.com/pytorch/pytorch/pull/132416 on behalf of https://github.com/ZainRizvi due to Sorry, this PR has entered a weird state in the diff train. Trying to revert it to skip it, and then we can try relanding it ([comment](https://github.com/pytorch/pytorch/pull/132415#issuecomment-2267631785))
2024-08-04 18:39:29 +00:00
9be33bc584 Revert "[inductor] Add type hints to functions in mkldnn_fusion.py (#131820)"
This reverts commit 6c65fd03942415b68040e102c44cf5109d2d851e.

Reverted https://github.com/pytorch/pytorch/pull/131820 on behalf of https://github.com/ZainRizvi due to Sorry, had to revert this to revert another PR that depends on this change ([comment](https://github.com/pytorch/pytorch/pull/131820#issuecomment-2267629534))
2024-08-04 18:30:59 +00:00
0a25666f92 Revert "[dynamo] revert map/zip iterator related changes (#132528)"
This reverts commit e81e74ca6cb45e1ab831ddfe9a2ba5c7e17fa03f.

Reverted https://github.com/pytorch/pytorch/pull/132528 on behalf of https://github.com/ZainRizvi due to This stack entered a weird state in the diff train. Reverting and relanding to clean the state ([comment](https://github.com/pytorch/pytorch/pull/132528#issuecomment-2267628475))
2024-08-04 18:26:09 +00:00
fd4b649e6c [BE]: Simplify some list comps to generators C419 (#132578)
Simplifies some list comprehensions to generator which is more efficient. Automatically applied diffs for the most part with ruff

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132578
Approved by: https://github.com/ezyang
2024-08-04 17:46:26 +00:00
4226ed1585 [BE] Format uncategorized Python files with ruff format (#132576)
Remove patterns `**`, `test/**`, and `torch/**` in `tools/linter/adapters/pyfmt_linter.py` and run `lintrunner`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132576
Approved by: https://github.com/ezyang, https://github.com/Skylion007
ghstack dependencies: #132574
2024-08-04 17:13:31 +00:00
c35061c542 Migrate Python code formatter from black to ruff format (#132574)
See also:

- #124845
- #123062

Closes #124845
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132574
Approved by: https://github.com/ezyang
2024-08-04 17:13:31 +00:00
09fcd792eb [Fix]: ScriptObject lifting issue (#130952)
#### Issue
ScriptObject was treated as normal attribute by the converter previously. This PR lifts it to be a constant and convert it directly to a GetAttr fx node. ScriptObject would also trigger `CallMethod` and this PR adds that support as well.

#### Test Plan
Add test case for ScriptObject.
`pytest test/export/test_converter.py -s -k test_convert_script_object`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130952
Approved by: https://github.com/angelayi
2024-08-04 16:52:45 +00:00
5dac4d2c78 Revert "[easy] fix f-string messages in torch/_ops.py (#132531)"
This reverts commit 908d2a153b14cbb7a39c1f4ef9a77534cf2c71bf.

Reverted https://github.com/pytorch/pytorch/pull/132531 on behalf of https://github.com/davidberard98 due to still breaks tests ([comment](https://github.com/pytorch/pytorch/pull/132531#issuecomment-2267584289))
2024-08-04 15:41:56 +00:00
cyy
105ba7b58c [5/N] Fix clang-tidy warnings in aten/src/ATen (#132565)
Follows #132001

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132565
Approved by: https://github.com/Skylion007
2024-08-04 14:39:16 +00:00
908d2a153b [easy] fix f-string messages in torch/_ops.py (#132531)
I encountered these when making this change:

```
diff --git a/test/functorch/test_ac.py b/test/functorch/test_ac.py
index 3a2e07fa147..a4d003399e7 100644
--- a/test/functorch/test_ac.py
+++ b/test/functorch/test_ac.py
@@ -259,15 +259,8 @@ class MemoryBudgetTest(TestCase):

         expected = call()
         for budget in range(0, 11):
-            memory_budget = budget / 10
-            torch._dynamo.reset()
-            with config.patch(activation_memory_budget=memory_budget):
-                if memory_budget is not None:
-                    f_compile = torch.compile(
-                        call, backend="aot_eager_decomp_partition"
-                    )
-
-                self.assertEqual(expected, f_compile())
+            get_mem_and_flops(call, memory_budget=budget / 10)
+

     def test_prioritize_cheaper_matmul(self):
         def f(xs, ws):
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132531
Approved by: https://github.com/Skylion007
ghstack dependencies: #132356, #132466
2024-08-04 14:30:42 +00:00
87d46d70d7 [inductor] export kernel for gemm template. (#132580)
Changes:
1. Move `get_export_declaration` to `cpp_utils.py` as basic function.
2. Export kernel for gemm template.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132580
Approved by: https://github.com/ezyang
2024-08-04 11:17:19 +00:00
d2dc173664 Remove lint dependency ufmt (#132573)
`ufmt` is a combination of `black + usort`.

This PR removes `ufmt` and run `black` and `usort` separately.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132573
Approved by: https://github.com/ezyang
ghstack dependencies: #129769, #132572
2024-08-04 10:24:09 +00:00
f7aeb394b6 [BE][Easy] Remove empty ISORT_SKIPLIST (#132572)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132572
Approved by: https://github.com/ezyang, https://github.com/justinchuby
ghstack dependencies: #129769
2024-08-04 10:24:09 +00:00
f3fce597e9 [BE][Easy][17/19] enforce style for empty lines in import segments in torch/[a-c]*/ and torch/[e-n]*/ (#129769)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129769
Approved by: https://github.com/ezyang
2024-08-04 10:24:09 +00:00
2714adce20 [caffe2] Fix compiling ATen-hip in non-opt mode (#132581)
Summary:
It looks like https://github.com/pytorch/pytorch/pull/131894 accidentally broke non-opt hip builds. I.e. `is_flash_attention_available` doesn't get inlined in non-opt mode, so all of `can_use_flash_attention` is compiled into the
 final object file. This includes a reference to `aotriton::v2::flash::check_gpu` which we haven't setup yet for HIP builds.

Test Plan:
CI

Differential Revision: D60720707

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132581
Approved by: https://github.com/jianyuh, https://github.com/xw285cornell
2024-08-04 07:51:18 +00:00
cyy
522fa03e91 [Submodule] Bump ONNX to v1.16.2 (#132566)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132566
Approved by: https://github.com/justinchuby
2024-08-04 07:01:54 +00:00
2a8e94347f [TP] verify numeric parity on Transfromers for multiple iterations (#132543)
Before setting up float8 numeric parity test, I have to set up regular TP numeric parity test, preferrably testing 10 iterations

this PR sets a baseline of TP numerics. I can verify fp8 on top of it

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132543
Approved by: https://github.com/tianyu-l
ghstack dependencies: #132350
2024-08-04 06:43:27 +00:00
8ff310392e add __torch_function__ handler to get_device cpp (#132567)
From the issue:
```
import torch

class CustomParameter(torch.nn.Parameter):
    @classmethod
    def __torch_function__(cls, func, types, args=(), kwargs=None):
         return func.__name__

x = CustomParameter(torch.rand(2))

print(x.square()) # 'square'
print(torch.square(x)) # 'square'
print(x.get_device()) # 'get_device'
print(torch.get_device(x)) # -1
```
after fix:
```
$ python repro.py
square
square
get_device
get_device
```

Fixes: https://github.com/pytorch/pytorch/issues/131944

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132567
Approved by: https://github.com/ezyang
2024-08-04 04:26:30 +00:00
7f8a384a8f [inductor] add msvc_cl compiler check (#132571)
add `msvc_cl` compiler check.
Local test:
<img width="880" alt="image" src="https://github.com/user-attachments/assets/fe4da5e0-dd52-4dbc-831e-c32479e27a29">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132571
Approved by: https://github.com/ezyang
2024-08-04 03:48:25 +00:00
81b8d3586f Update torch-xpu-ops pin (ATen XPU implementation) (#132390)
Regular update.
1. New 69 ATen operators and variants are added. See https://github.com/intel/torch-xpu-ops/blob/main/yaml/xpu_functions.yaml.
2. Align with PyTorch in-tree to use safe data pointer access APIs.
3. Enable FP64 conversion emulation for some platforms.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132390
Approved by: https://github.com/EikanWang
2024-08-04 02:22:46 +00:00
6ec4af6865 [Inductor][CPP] Add vectorization support for double (#131886)
Before:
```
extern "C"  void kernel(const double* in_ptr0, double* out_ptr0)
{
     #pragma omp parallel num_threads(112)
     {
         int tid = omp_get_thread_num();
         {
             #pragma omp for
             for(long x0=static_cast<long>(0L); x0<static_cast<long>(1024L); x0+=static_cast<long>(1L))
             {
                 auto tmp0 = in_ptr0[static_cast<long>(x0)];
                 auto tmp1 = decltype(tmp0)(tmp0 * tmp0);
                 out_ptr0[static_cast<long>(x0)] = tmp1;
             }
         }
     }
 }
```

After:
```
extern "C"  void kernel(const double* in_ptr0, double* out_ptr0)
{
    #pragma omp parallel num_threads(112)
    {
        int tid = omp_get_thread_num();
        {
            #pragma omp for
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(1024L); x0+=static_cast<long>(16L))
            {
                auto tmp0 = at::vec::VectorizedN<double,2>::loadu(in_ptr0 + static_cast<long>(x0), 16);
                auto tmp1 = tmp0 * tmp0;
                tmp1.store(out_ptr0 + static_cast<long>(x0), 16);
            }
        }
    }
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131886
Approved by: https://github.com/jgong5, https://github.com/peterbell10
2024-08-04 02:13:21 +00:00
d984105748 Revert "[export] Convert autocast to HOO (#131914)"
This reverts commit b28c01d90d6575522d2240ce485d7dd87a7242aa.

Reverted https://github.com/pytorch/pytorch/pull/131914 on behalf of https://github.com/ezyang due to Failing lint, but was covered up by master failure on lint ([comment](https://github.com/pytorch/pytorch/pull/131914#issuecomment-2267248773))
2024-08-04 02:10:35 +00:00
6c65fd0394 [inductor] Add type hints to functions in mkldnn_fusion.py (#131820)
Summary: ATT

Test Plan: lintrunner

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131820
Approved by: https://github.com/eellison
2024-08-03 22:11:47 +00:00
cyy
bc46f205c4 [15/N] Fix clang-tidy warnings in jit (#132564)
Follows  #132477

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132564
Approved by: https://github.com/Skylion007
2024-08-03 19:33:24 +00:00
00097f3458 Revert "C++ network flow implementation in c10 (#132188)"
This reverts commit dccce77935bb023f225b9972929fd9213e754e84.

Reverted https://github.com/pytorch/pytorch/pull/132188 on behalf of https://github.com/ZainRizvi due to Sorry but this appears to be failing internal tests. Please see D60702564 to investigate ([comment](https://github.com/pytorch/pytorch/pull/132188#issuecomment-2267098420))
2024-08-03 18:44:28 +00:00
e3387c6712 [inductor] use uint64_t replace long to add Windows support. (#132491)
`long` type is different between `Windows` and `Linux`.
This PR use `int64_t` instead of `long` on Windows. `LL` suffix is used to initial `int64_t` value.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132491
Approved by: https://github.com/malfet
2024-08-03 18:38:30 +00:00
bbce517221 [Inductor][FlexAttention] TestFlexAttention -> TestFlexDecoding (#132547)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132547
Approved by: https://github.com/Chillee
ghstack dependencies: #132015
2024-08-03 17:26:44 +00:00
21d02f8b4b Revert "[easy] fix f-string messages in torch/_ops.py (#132531)"
This reverts commit 25903f3932b3a24d4edf323484d2159f3ac92999.

Reverted https://github.com/pytorch/pytorch/pull/132531 on behalf of https://github.com/davidberard98 due to broke lint and tests due to conflict with 132377 ([comment](https://github.com/pytorch/pytorch/pull/132531#issuecomment-2266743391))
2024-08-03 14:49:07 +00:00
a896fb1b36 check unsupported sympy functions for runtime asserts (#132457)
Some sympy Functions aren't supported by sympy_interp(); we can't turn them into FX nodes, so currently the runtime asserts CSE pass avoids CSE'ing on any expression containing a sympy Function. https://github.com/pytorch/pytorch/pull/132325 started tracking unsupported functions, so we switch the check to that to be more precise. We also check for and skip unsupported functions when adding asserts - previously we only did the check for CSE, and not adding new expressions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132457
Approved by: https://github.com/avikchaudhuri
2024-08-03 10:17:25 +00:00
0e7e61f7ce Deprecate torch._utils.is_compiling() and torch._dynamo.external_utils.is_compiling() (#127690)
This PR is split from PR #126898.

- #126898

------

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127690
Approved by: https://github.com/Skylion007, https://github.com/malfet
2024-08-03 09:43:38 +00:00
159d508f03 [Fix]: prim::If with multiple outputs and input return directly (#131779)
#### Issue
Test is not working for prim::Loop with multiple outputs. Additionally fix issue where input is directly returned, which is not supported by HigherOrderOp.

#### Test Plan
`pytest test/export/test_converter.py -s -k test_convert_if_multiple_out`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131779
Approved by: https://github.com/angelayi, https://github.com/SherlockNoMad
2024-08-03 08:07:21 +00:00
36ec0fdf10 [inductor] check compiler exist on Windows. (#132533)
Current Windows env, if we are not activate the MSVC env. It will not raise a clear error to compiler:
<img width="904" alt="image" src="https://github.com/user-attachments/assets/725ea608-d181-40b1-8930-42fe2b32643a">

With this PR, we can help users point to the issue is from compiler.
<img width="1034" alt="image" src="https://github.com/user-attachments/assets/8515a796-e3e9-4909-a68f-8a14d4864951">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132533
Approved by: https://github.com/jansel
2024-08-03 07:47:11 +00:00
8ad9f89ccc [inductor] Reland: Add flag to ignore unsupported @triton.autotune args in user-written kernel compilation (#132562)
Summary:
This is a reland attempt of [#131431](https://github.com/pytorch/pytorch/pull/131431), as, in its original form, the PR has caused issues internally.

We currently don't support some of the `triton.autotune` arguments when compiling user-written Triton kernels with PT2. In this PR, we're adding a flag to circumvent it. This is to unblock internal compilation in some cases. The flag is supplied with the docs mentioning why it is not a good idea to set it.

Test Plan:
```
python test/inductor/test_triton_kernels.py -k test_triton_kernel_
autotune_with_unsupported_args
...
----------------------------------------------------------------------
Ran 3 tests in 3.636s

OK
```

Differential Revision: D60701839

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132562
Approved by: https://github.com/chenyang78
2024-08-03 06:31:28 +00:00
06581c277a [dynamo][stable-diffusion] Support dict(obj) on constrained subclasses of dict and OrderedDict (#132558)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132558
Approved by: https://github.com/jansel
2024-08-03 06:31:00 +00:00
b28c01d90d [export] Convert autocast to HOO (#131914)
Summary:
Suggested in https://github.com/pytorch/pytorch/issues/128394.

If there's an autocast context manager, the predispatch (strict) graph can look something like:

```
class <lambda>(torch.nn.Module):
    def forward(self, x: "f32[1]"):
        ...
        _enter_autocast = torch.amp.autocast_mode._enter_autocast('cuda', torch.bfloat16, True, None)
        mm: "f32[8, 8]" = torch.ops.aten.mm.default(rand, rand_1);  rand = rand_1 = None
        _exit_autocast = torch.amp.autocast_mode._exit_autocast(_enter_autocast);  _enter_autocast = None
        return (mm_1,)
```

But the operator `torch.amp.autocast_mode._enter_autocast` is not a valid ATen op. We remove these nodes by turning autocast into a higher order operator and make a submodule for the blocks between `_enter_autocast` and `_exit_autocast`.

Some potential followup improvement:
1) Merge some of the duplicated logic with `replace_set_grad_with_hop_pass.py`
2) Check the current autocast status (any enabled? dtype?) and not create a submodule if the autocast args matches current autocast status.

Test Plan:
CI

```
parsh --build-flags fbcode//mode/dev-nosan  fbcode//caffe2/test:test_export
run_tests("test_predispatch_autocast")
```

Reviewed By: angelayi

Differential Revision: D60206382

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131914
Approved by: https://github.com/angelayi
2024-08-03 05:48:57 +00:00
ed4493de0e dim name is identifier (#132557)
Summary: Dim names appear in suggested fixes so should be valid Python identifiers.

Test Plan: none

Differential Revision: D60696854

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132557
Approved by: https://github.com/pianpwk
2024-08-03 05:28:50 +00:00
1f5dfe00da Subtracer should always be real to inherit fake/real tensors from parent config (#132488)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132488
Approved by: https://github.com/zou3519
2024-08-03 04:55:42 +00:00
6966d44eda [ONNX] Rename _internal/exporter to _exporter_legacy (#132429)
The next PR will be creating an `exporter` directory to house logic from `torch-onnx`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132429
Approved by: https://github.com/titaiwangms
2024-08-03 04:23:05 +00:00
5973aec671 [fx] python_code(verbose=True): show size/strides for all tensors (#132192)
python_code(verbose=True) (or print_readable()) generates a string with the code representing the fx graph, with extra annotations indicating the size or stride of the tensor. Currently, it'll only shows sizes/strides for FakeTensors provided in metadata. For subclass tensors like NestedTensor, the outer class (provided in the node metadata) will be a non-FakeTensor and the inner tensors will be fake. This PR expands the conditional to show sizes/strides for all tensors, not just FakeTensors.

Testing: I ran this test script (below), ran it with `TORCH_LOGS=+dynamo` and found in the logs the graph shown below - we see that the input nested tensor has sizes and strides associated with it. Also, I stacked a diff on top of this one that forces the readable graph to be generated whenever PT2 is in use in tests, which should hopefully find any issues; https://github.com/pytorch/pytorch/pull/132195 shows no significant failures except for preexisting failures.

test script:
```python
import torch

def fn(x):
    return x.cos()

nt = torch.nested.nested_tensor_from_jagged(
    torch.randn(10, 10),
    torch.tensor([0, 1, 3, 6, 10]),
)

torch.compile(fn)(nt)
```

logs excerpt:
```
[0/0] [__graph_code] TRACED GRAPH
[0/0] [__graph_code]  ===== __compiled_fn_1 =====
[0/0] [__graph_code]  /data/users/dberard/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.M

[0/0] [__graph_code]     def forward(self, L_x_: "f32[4, zf1, 10][10*zf1, 10, 1]cpu", zf1: "Sym(zf1)"):
[0/0] [__graph_code]         l_x_ = L_x_
[0/0] [__graph_code]
[0/0] [__graph_code]          # File: /data/users/dberard/scripts/nt_print_graph.py:4 in fn, code: return x.c

[0/0] [__graph_code]         cos: "f32[4, zf1, 10][10*zf1, 10, 1]cpu" = l_x_.cos();  l_x_ = None
[0/0] [__graph_code]         return (cos,)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132192
Approved by: https://github.com/Chillee
2024-08-03 02:54:32 +00:00
0b571b1058 [codemod][pyre] Add missing Pyre mode headers (#132548)
Reviewed By: connernilsen

Differential Revision: D59849027

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132548
Approved by: https://github.com/kit1980, https://github.com/ZainRizvi
2024-08-03 02:32:53 +00:00
373e9be457 [Inductor][FlexAttention] Add kwarg to top level for users to specify kernel params (#132015)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132015
Approved by: https://github.com/Chillee
2024-08-03 02:27:02 +00:00
25903f3932 [easy] fix f-string messages in torch/_ops.py (#132531)
I encountered these when making this change:

```
diff --git a/test/functorch/test_ac.py b/test/functorch/test_ac.py
index 3a2e07fa147..a4d003399e7 100644
--- a/test/functorch/test_ac.py
+++ b/test/functorch/test_ac.py
@@ -259,15 +259,8 @@ class MemoryBudgetTest(TestCase):

         expected = call()
         for budget in range(0, 11):
-            memory_budget = budget / 10
-            torch._dynamo.reset()
-            with config.patch(activation_memory_budget=memory_budget):
-                if memory_budget is not None:
-                    f_compile = torch.compile(
-                        call, backend="aot_eager_decomp_partition"
-                    )
-
-                self.assertEqual(expected, f_compile())
+            get_mem_and_flops(call, memory_budget=budget / 10)
+

     def test_prioritize_cheaper_matmul(self):
         def f(xs, ws):
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132531
Approved by: https://github.com/Skylion007
ghstack dependencies: #132356, #132466
2024-08-03 02:23:44 +00:00
419b76c4ac [dynamo] Reland 132308, 132314, 132318, 132334 - Make builtin nn modules attributes static (#132539)
Relanding 4 PRs ending at https://github.com/pytorch/pytorch/pull/132334

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132539
Approved by: https://github.com/Skylion007, https://github.com/yanboliang, https://github.com/mlazos
2024-08-03 02:08:22 +00:00
841cadd555 Fix discrepancies from 129973 (#132545)
#129973 ([D59132793](https://www.internalfb.com/diff/D59132793)) was exported missing changes in `test/cpp/jit/CMakeLists.txt` this PR remediates that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132545
Approved by: https://github.com/kit1980
2024-08-03 01:57:49 +00:00
243a763e1b ci: Remove split-build CUDA testing from pull.yml (#132537)
This is already represented in trunk.yml so it seems a bit redundant to include this level of testing in pull.yml.

I've been observing a large spike in our usage of `g3.4xlarge` which seems to correspond to these builds in particular so removing these from `pull.yml` since they are already covered in `trunk.yml`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132537
Approved by: https://github.com/ZainRizvi, https://github.com/malfet
2024-08-03 01:24:17 +00:00
a503136583 [export] Detect whether case_name is registered in exportdb (#132420)
Summary:
- moves logging functionalities into `torch/_export/db/logging.py` file.
- add a check in `_dynamo/eval_frame.py` to check for optional input and error out with `UnsupportedError`
- change the case name of `torch_sym_int` to `unsupported_operator`
- Check if the case name is registered in exportdb, if so, we give a link to the case in exportdb.
- TODO: add test

Test Plan:
CI

Running the example in https://pytorch.org/docs/main/generated/exportdb/index.html#optional-input gives the following error logging:

```
E0730 10:53:33.687000 4155538 torch/_dynamo/eval_frame.py:1086] Parameter y is optional with a default value of tensor([[-0.1633,  1.2414, -0.1071],
E0730 10:53:33.687000 4155538 torch/_dynamo/eval_frame.py:1086]         [-0.1936, -0.9425, -0.0824]])
E0730 10:53:33.688000 4155538 torch/export/_trace.py:1043] See optional_input in exportdb for unsupported case.                 https://pytorch.org/docs/main/generated/exportdb/index.html#optional-input
......
  File "/data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/389acaeb40d57230/tutorials/pytorch/nntest/__torchtest__/torchtest#link-tree/torch/_dynamo/eval_frame.py", line 1091, in produce_matching
    raise Unsupported(
torch._dynamo.exc.Unsupported: Tracing through optional input is not supported yet
```

It also logs a `export.error.classified` event in Scuba.

Reviewed By: zhxchen17

Differential Revision: D60427208

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132420
Approved by: https://github.com/zhxchen17
2024-08-03 01:08:48 +00:00
64720f3b89 Introduce checks to validate public API tests (#131390)
This PR introduces a new sanity check for the public API tests in `.ci/pytorch/test.sh`.
* Validates two public API tests:
    1. Ensures `test_correct_module_names` fails when a new file OR an existing file adds an invalid public API function (e.g. one whose `__module__` is unset).
    2. Ensures `test_modules_can_be_imported` fails when a module underneath `torch/` cannot be imported.
* Runs this in CI as part just before the pre-existing FC / BC checks.

I've verified that re-introducing the bug that #131386 fixed causes the new check to fail:
![public_api_failure](https://github.com/user-attachments/assets/376ddef3-d14a-41f6-93e2-f935deb6555a)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131390
Approved by: https://github.com/albanD
2024-08-03 00:29:00 +00:00
cyy
fcef6cc6d1 [13/N] Fix clang-tidy warnings in jit (#132477)
Follows  #132209

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132477
Approved by: https://github.com/Skylion007
2024-08-03 00:13:18 +00:00
705ac311aa Fix Distributed EventList usage (#132448)
Summary: Summarized here: https://github.com/pytorch/pytorch/issues/132227

Test Plan: Use suggestion in issue, should see test passing again

Differential Revision: D60614690

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132448
Approved by: https://github.com/aaronenyeshi
2024-08-02 23:55:31 +00:00
e3513fb2af [ts_converter]handle python list append, list add, aten.to.dtype+mutation_op pattern (#132529)
Summary:
#### Description
Add support for aten::append with a python function that returns a new list with the appended element. We then update the `fx_node` in the `name_to_node` mapping.

aten::append contributed by Jiashen Cao <jiashenc@meta.com>

Fix conversion for csr_ranker_test

```
    model_name: csr_ranker_test_4.ptl
    has_ts_model: True
    has_sample_inputs: True
    ops_maybe_missing_meta: set()
    script_objects: set()
    ts_can_run: True
    ts_run_exception: None
    can_convert: True
    convert_exception: None
    ep_result_correct: True
    ep_run_exception: None
    can_package: True
    package_exception: None
    sigmoid_can_run: False
    sigmoid_run_exception: RuntimeError('not for symbolics')
    sigmoid_result_correct: None
```

Test Plan:
test_aten_add_t
test_aten_append_t
test_aten_to_dtype_with_mutating_storage

buck2 run mode/opt sigmoid/inference/ts_migration:main -- --mode test_one --model_name csr_ranker_test

Differential Revision: D60635893

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132529
Approved by: https://github.com/jiashenC
2024-08-02 23:32:37 +00:00
85f19ce14a Support meta["val"] that is a dict, for triton kernels and for the partitioner (#132466)
Internally there's a model that's using memory_budget with the partitioner, and using custom triton kernels. The partitioner fails when encountering the triton ops because they don't have `meta["val"]`. This PR adds `meta["val"]`  to these fx graph nodes and then adds handling for `meta["val"]` being a dict in the partitioner.

Differential Revision: [D60627813](https://our.internmc.facebook.com/intern/diff/D60627813)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132466
Approved by: https://github.com/zou3519
ghstack dependencies: #132356
2024-08-02 23:24:29 +00:00
bcac71517c [Profiler] Test Logging for Empty Traces (#132444)
Summary: Tests D60311331. Please see that diff for explanation

Test Plan: This diff is adding a test itself

Reviewed By: aaronenyeshi

Differential Revision: D60311555

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132444
Approved by: https://github.com/aaronenyeshi
2024-08-02 22:04:15 +00:00
1962f9475f [NJT][flop counter] attention: if offsets are fake, use max seqlen (#132356)
The flop counter is used by the partitioner, in which case the tensors passed in can be fake.

The flop computations for nested attention use the offsets to determine the actual amount of compute that will be done. But when the offsets are fake, we end up with unbacked symints (from `(offsets[1:] - offsets[:-1]).to_list()`). If we find that the offsets are fake or functional tensors, then use the max sequence length instead.

Repro: https://gist.github.com/davidberard98/903fb3e586edb6d1d466786e1a610eba

Differential Revision: [D60597463](https://our.internmc.facebook.com/intern/diff/D60597463)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132356
Approved by: https://github.com/soulitzer
2024-08-02 20:42:29 +00:00
37c3d503b7 [pipelining] Make test_schedule quiet (#132369)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132369
Approved by: https://github.com/H-Huang
ghstack dependencies: #129810, #130378
2024-08-02 20:38:17 +00:00
7c1cca9fda [pipelining] Add schedule send/recv pass (#130378)
Inserts send/recv ops where needed in a compute-only pipeline schedule.

Any F or B action will require a recv op for its input and a send op
for its output, except for at the ends of the pipeline.

To avoid hangs caused by mixed-up orderings of sends/recvs across ranks,
we pick one compute action at a time and insert both its send op (on
that rank's schedule), and the matching recv op for the recipient stage
(on the schedule for the rank for that stage).

TODO
Currently ignores a couple of edge cases
- ignores batching (which is an optimization)
- ignores cases where a stage sends to anotehr stage on the same rank,
  and should skip the send/recv and directly access memory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130378
Approved by: https://github.com/H-Huang
ghstack dependencies: #129810
2024-08-02 20:38:17 +00:00
625f494619 [Pipelining] Add schedule unshard/reshard pass (#129810)
Adds fsdp unshard/reshard ops to a compute-only schedule.

Operates on one pp-rank's schedule at a time, since there is no
cross-pp-rank coordination needed for FSDP.  (Unshard/Reshard is across
DP ranks within a PP group).

Uses a heuristic based on examining the next N stages to run compute
operations on this rank, evicting (resharding) and fetching (unsharding)
ahead of time to give unshard operations a chance to overlap with
compute and PP comms.
- this heuristic has not been validated and may not be optimal

Makes the assumption that it's fine to add the UNSHARD/RESHARD actions
to the schedule regardless of if FSDP will actually be used.
- this way, users do not have to tell us at PP schedule creation time if
  they plan to use FSDP or DDP
- it is trivial to implement UNSHARD/RESHARD as no-ops inside the
  runtime, if FSDP is not detected on the stage module

TODO
- also add FSDP's reduce-scatter? or is it sufficient to leave this
  handled by PipelineStage at 'last backward' time
- validate 'next N stages' heuristic and expose an API if needed
- add an e2e test

Co-authored-by: Howard Huang <howardhuang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129810
Approved by: https://github.com/kwen2501, https://github.com/H-Huang
2024-08-02 20:38:17 +00:00
f379bbd46d [dynamo] support inspect.signature.bind (#132330)
Fixes https://github.com/pytorch/pytorch/issues/93760.

This was not that small of a task...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132330
Approved by: https://github.com/jansel
ghstack dependencies: #132329
2024-08-02 20:37:05 +00:00
642257db1a Update the FQN for auto_functionalized HOO. (#132171)
Summary:
as title.

torch._higher_order_ops.auto_functionlize.auto_functionalized is a Python FQN which should NOT be used to talk to the backends and we should use the standard FQN name torch.ops.higher_order.auto_functionalized instead.

Test Plan: buck test mode/opt caffe2/test:test_export -- -r test_custom_op_auto_functionalize_pre_dispatch

Differential Revision: D60468759

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132171
Approved by: https://github.com/SherlockNoMad
2024-08-02 20:34:50 +00:00
dccce77935 C++ network flow implementation in c10 (#132188)
The functorch partitioners use network flow to split the joint graph into a forward and backward graph. Internally, we've found that upgrading to networkx 2.8.8 (from 2.5) results in some hard-to-debug failures (internal reference: https://fburl.com/workplace/jrqwagdm). And I'm told that there's interest to remove the python dependency.

So this PR introduces a C++ implementation that mirrors the API provided by networkx. We'll need to add python bindings and do some additional testing to verify correctness.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132188
Approved by: https://github.com/Chillee
2024-08-02 20:30:59 +00:00
f49d5e30eb Change owners of test/test_transformers.py to module: multi-headed-attention (#132519)
So flaky tests get tagged with `module: multi-headed-attention` instead of `module: nn`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132519
Approved by: https://github.com/Skylion007
2024-08-02 20:12:33 +00:00
e81e74ca6c [dynamo] revert map/zip iterator related changes (#132528)
Need to revert due to internal hangs: S437700

This reverts commit b6c1490cc02316ffe85e5ae74651d80f0158ba64.

Revert "[dynamo] implement IteratorVariable and polyfill fallbacks for enumerate (#131725)"

This reverts commit 2576dbbc35d66e8e9ed6cb12216ccc424cb87ec3.

Revert "[dynamo] add itertools repeat/count bytecode reconstruction (#131716)"

This reverts commit 35b4de32fafc5ad024c20ef1275711bffc557ae9.

Revert "[dynamo] add lazy IteratorVariable implementations for map and zip (#131413)"

This reverts commit 7d282d87550787d8269593093519c2ad7c5032cd.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132528
Approved by: https://github.com/ZainRizvi
2024-08-02 19:40:57 +00:00
b71cd149ce Fix file lock issue in AotCodeCompiler (#132343)
Summary:
It looks like there are several places in AotCodeCompiler that write files in a way that aren't safe for concurrency. There's a filelock to cope with that, but it seems like the lock path isn't quite robust enough to prevent races. We have an internal stress test failing when executing multiple concurrent versions of the test. It seems as though there's some variability in the content we write to the cpp file, which means we can get a different 'key' across different runs. The lock path includes that key in the lock path name, but the path for the "consts_path" is computed separately. Therefore, I see things like this:

- The computed 'key' is `cp5tgbuxuegvg5g2j7oi6u74nkf3v7mx5w3qzl6qbedtmw5tq77z`
- The lock_path (based on the key) is: `/tmp/torchinductor_slarsen/locks/cp5tgbuxuegvg5g2j7oi6u74nkf3v7mx5w3qzl6qbedtmw5tq77z.lock`
- The cpp path is (also includes the key) is: `/tmp/torchinductor_slarsen/cenzkqfnhu53mrhrdhzjtnblzyma2hgmeo7hai5yqsxzirdavurh/cp5tgbuxuegvg5g2j7oi6u74nkf3v7mx5w3qzl6qbedtmw5tq77z.cpp`
- The consts_path (not based on the key) is: `/tmp/torchinductor_slarsen/cenzkqfnhu53mrhrdhzjtnblzyma2hgmeo7hai5yqsxzirdavurh/cifbshkqkbsurzldsyi2vl5bsnhvejmavys4kktpwrzmpo4ysuoy.bin`

So we have different test instances using different lock paths, but touching the same consts_path and therefore stomping on each others' consts_path. To fix, include the key in the consts_paths.

Test Plan: Ran internal stress test. Repro'd failure and verified this change fixes it.

Differential Revision: D60552021

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132343
Approved by: https://github.com/desertfire
2024-08-02 19:01:37 +00:00
bcb4f7c172 Revert "Grouped Query Attention (#128898)"
This reverts commit 6b28af1b79eaa63e2f423d925bbd42330582983f.

Reverted https://github.com/pytorch/pytorch/pull/128898 on behalf of https://github.com/ZainRizvi due to Sorry, this broke a bunch of tests internally. See D60638265 ([comment](https://github.com/pytorch/pytorch/pull/128898#issuecomment-2265961038))
2024-08-02 18:58:46 +00:00
afca6f5b47 [PT2][Optimus] Add missing example value for introduced nodes (#132297)
Summary:
We observed that many introduced nodes during split cat and batch fusion pattern optimization did not have example value meta data, which will cause problems in our follow up pattern optimizations, thus we add all missing values.

We also fix bugs in some meta update and corner case bug for the old pattern, which caused problems in the follow up pattern optimization.

We delete merge_stack_tahn_unbind_pass pattern, which was designed for cmf model, and it could be replaced by the more advanced pattern we added, thus we remove it for easy maintenance.

Test Plan:
# unit test
```
buck2 test //caffe2/test/inductor:split_cat_fx_passes
```

Test UI: https://www.internalfb.com/intern/testinfra/testrun/15481123762720165
Network: Up: 230KiB  Down: 702KiB  (reSessionID-756346bf-6da3-4fa0-8d03-1b4fd61e0a7a)
Jobs completed: 30. Time elapsed: 7:23.9s.
Cache hits: 20%. Commands: 5 (cached: 1, remote: 0, local: 4)
Tests finished: Pass 9. Fail 0. Fatal 0. Skip 1. Build failure 0

```
buck2 test @mode/opt pytorch/diff_train_tests/ads/optimus:local_pt2_runner
```

Network: Up: 1.3GiB  Down: 84MiB  (reSessionID-ff135cdd-e42c-4ab5-8217-907ada465f01)
Jobs completed: 61. Time elapsed: 21:56.5s.
Cache hits: 0%. Commands: 39 (cached: 0, remote: 0, local: 39)
Tests finished: Pass 8. Fail 0. Fatal 0. Skip 0. Build failure 0

# benchmark

```
CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 run @mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "ig_ctr" --flow_id 584880697
```

Counter({'pattern_matcher_nodes': 752, 'pattern_matcher_count': 732, 'normalization_pass': 328, 'normalization_aten_pass': 12, 'scmerge_cat_removed': 5, 'scmerge_cat_added': 4, 'scmerge_split_removed': 3, 'unbind_stack_pass': 3, 'batch_tanh': 2, 'scmerge_split_sections_removed': 2, 'scmerge_split_added': 2, 'optimize_cat_inputs_pass': 1, 'unbind_cat_to_view_pass': 1, 'fxgraph_cache_miss': 1})

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132297
Approved by: https://github.com/jackiexu1992
2024-08-02 18:57:12 +00:00
24d0a32f98 Revert "[dynamo] Wrap unspecialized nn module getattr with UnspecializedNNModuleSource (#132308)"
This reverts commit aa0ed2496f5bf38768c9eda13112fd43359548bb.

Reverted https://github.com/pytorch/pytorch/pull/132308 on behalf of https://github.com/anijain2305 due to broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/132308#issuecomment-2265959993))
2024-08-02 18:55:51 +00:00
e696f17467 Revert "[dynamo] Track builtin nn modules with UnspecializedBuiltinNNModuleVariable (#132314)"
This reverts commit d6a82ce39bd8e705a4cc2cebb886f4476a7250cf.

Reverted https://github.com/pytorch/pytorch/pull/132314 on behalf of https://github.com/anijain2305 due to broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/132314#issuecomment-2265953367))
2024-08-02 18:52:38 +00:00
e4e3575fb0 Revert "[11/N] Use std::nullopt and std::optional (#132396)"
This reverts commit d7d61904936617a6a43782868d0b1004cb70dfc0.

Reverted https://github.com/pytorch/pytorch/pull/132396 on behalf of https://github.com/ZainRizvi due to Sorry, but this PR has a dependency on another PR (https://github.com/pytorch/pytorch/pull/128898) that has to be reverted ([comment](https://github.com/pytorch/pytorch/pull/132396#issuecomment-2265952528))
2024-08-02 18:49:42 +00:00
59b73079a0 Revert "Always use high precision for SDPA math backend (#128922)"
This reverts commit fbf3bc0a602b4ec1eab169202d5b1158fe2c1def.

Reverted https://github.com/pytorch/pytorch/pull/128922 on behalf of https://github.com/ZainRizvi due to Sorry, but this PR has a dependency on another PR (https://github.com/pytorch/pytorch/pull/128898) that has to be reverted ([comment](https://github.com/pytorch/pytorch/pull/128922#issuecomment-2265949958))
2024-08-02 18:46:50 +00:00
193a19ee91 Revert "[dynamo] Treat attr of unspecialized buiitin nn modules as static (#132318)"
This reverts commit 7b816d7d6d5d521f913c78f897790f66112c7d84.

Reverted https://github.com/pytorch/pytorch/pull/132318 on behalf of https://github.com/anijain2305 due to broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/132318#issuecomment-2265945433))
2024-08-02 18:43:32 +00:00
b8f7019df0 Revert "[dynamo] Track params/buffers and mark them as static (#132334)"
This reverts commit babb249a89b51931afe16db8b498ff72cd433afc.

Reverted https://github.com/pytorch/pytorch/pull/132334 on behalf of https://github.com/anijain2305 due to broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/132334#issuecomment-2265942261))
2024-08-02 18:41:19 +00:00
e0514a5b99 [AOTI][refactor] Consolidate how python_kernel_name is set (#132320)
Summary: Similar to the refactoring of set_cpp_kernel, consolidate the ways of setting python_kernel_name

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132320
Approved by: https://github.com/angelayi, https://github.com/chenyang78
ghstack dependencies: #132319
2024-08-02 18:34:25 +00:00
a9e1133faa [AOTI][refactor] Move set_cpp_kernel to base class (#132319)
Summary: Consolidate how cpp_kernel_name is set and make it a method in the base ExternKernel class.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132319
Approved by: https://github.com/angelayi, https://github.com/chenyang78
2024-08-02 18:34:24 +00:00
df781343e2 Link libc10 to pthreads (#132484)
It gets linked as transitive dependency of `libmkl` on x86_64,  but it's must be specified explicitly on s390x

Linking issue only appears when using gcc-13 with gold linker.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132484
Approved by: https://github.com/malfet
2024-08-02 18:03:44 +00:00
19897a1647 [export] change deepcopy to copy in _replace_set_grad_with_hop pass.. (#132181)
Summary:
Fixes T197371132.

Previously, we call copy.deepcopy to avoid mutating the original signature. However, this causes errors when the signature reference a FakeScriptObject, which then references a real torch.ScriptObject due to "The tensor has a non-zero number of elements, but its data is not allocated yet."

We therefore just change it to a shallow copy. This should be good enough for guarding the signature.

Test Plan: buck2 run 'fbcode//mode/opt' torchrec/distributed/tests:test_pt2 -- --filter-text "test_sharded_quant_ebc_non_strict_export"

Differential Revision: D60476839

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132181
Approved by: https://github.com/BoyuanFeng
2024-08-02 17:57:09 +00:00
cyy
87d58cc81f [4/N] Fix clang-tidy warnings in aten/src/ATen/native/ (#132001)
Follows #132000
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132001
Approved by: https://github.com/Skylion007
2024-08-02 17:42:02 +00:00
cyy
207e24ff83 Enable clang-tidy on aten/src/ATen/cudnn/* (#130133)
Continued work of applying clang-tidy

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130133
Approved by: https://github.com/eqy, https://github.com/Skylion007
2024-08-02 17:39:37 +00:00
0c491702c4 [ONNX] Define the TORCH_ONNX_USE_EXPERIMENTAL_LOGIC flag (#132299)
Define the `TORCH_ONNX_USE_EXPERIMENTAL_LOGIC` flag to allow for enabling the new torch.onnx logic and hiding them during migration and testing. The actual logic migration will happen after.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132299
Approved by: https://github.com/titaiwangms
2024-08-02 17:06:11 +00:00
9167113c16 [easy][MPS] add torch.mps.is_available() (#132426)
Just return "torch.mps.device_count() > 0", which, based on the implementation of device_count(), seems to be equivalent.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132426
Approved by: https://github.com/malfet
2024-08-02 17:05:49 +00:00
fc32732596 Don't attempt to compute hints for unbacked expressions (#132060)
This breaks the inference we made that if you cat an N-D tensor with a 1-D tensor of size (u0,), the u0 must be zero, but no one really wanted that anyway...

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132060
Approved by: https://github.com/Skylion007
2024-08-02 16:39:14 +00:00
8fff976355 Revert "Refactor thunkify to return proper thunk abstraction (#132407)"
This reverts commit d903e664c6b70ad17e0b316ef39d71be5edddc87.

Reverted https://github.com/pytorch/pytorch/pull/132407 on behalf of https://github.com/ezyang due to test_correct_module_names ([comment](https://github.com/pytorch/pytorch/pull/132407#issuecomment-2265754857))
2024-08-02 16:32:43 +00:00
1197550876 Revert "Don't attempt to compute hints for unbacked expressions (#132060)"
This reverts commit d342dc0179944dd317b509b3432da81701836444.

Reverted https://github.com/pytorch/pytorch/pull/132060 on behalf of https://github.com/ezyang due to test_correct_module_names ([comment](https://github.com/pytorch/pytorch/pull/132407#issuecomment-2265754857))
2024-08-02 16:32:43 +00:00
296c339f98 Ensure compiler collective is called even when no graph is compiled (#132163)
It's very important to make sure we always run the compiler collective, because if we don't, we will fail to apply automatic dynamic at all.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132163
Approved by: https://github.com/jansel
2024-08-02 16:31:54 +00:00
82b6480b0a Update SavedTensorHooks TLS stack to use SafePyObject (#131700)
Previously, we must manually manage refcounting when updating the TLS saved variable stack. With this PR, things should be handled automatically by the SafePyObject.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131700
Approved by: https://github.com/albanD
2024-08-02 16:27:16 +00:00
9eeb5eebab Revert "Ensure compiler collective is called even when no graph is compiled (#132163)"
This reverts commit 0d9c9716b2db52281f6f10a113e07936deeb6e0a.

Reverted https://github.com/pytorch/pytorch/pull/132163 on behalf of https://github.com/ezyang due to test_correct_module_names ([comment](https://github.com/pytorch/pytorch/pull/132163#issuecomment-2265729449))
2024-08-02 16:16:31 +00:00
fca2dba7ca [pytorch][counters] Pybind for WaitCounter (#132357)
Summary:
Basic pybind integration for WaitCounter providing a guard API.
Also fixes broken copy/move constructor in WaitGuard (it wasn't really used with the macro-based C++ API).

Test Plan: unit test

Differential Revision: D60557660

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132357
Approved by: https://github.com/jamesperng, https://github.com/asiab4
2024-08-02 16:08:10 +00:00
d224857b3a Revert "Change signature of CompilerFn for register_backend decorator (#131880)"
This reverts commit ccf9ce8e8c3c86269003547d976da5ed1fc9511b.

Reverted https://github.com/pytorch/pytorch/pull/131880 on behalf of https://github.com/albanD due to Breaking lint ([comment](https://github.com/pytorch/pytorch/pull/131880#issuecomment-2265682757))
2024-08-02 15:49:09 +00:00
63eb06c051 Disable SymDispatchMode when torch.compile'ing (#132433)
Partially addresses https://github.com/pytorch/pytorch/issues/132417

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132433
Approved by: https://github.com/ydwu4
2024-08-02 15:23:49 +00:00
cyy
5aafdc2f87 [3/N] Fix clang-tidy warnings in aten/src/ATen/native/ (#132000)
Follows #131834

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132000
Approved by: https://github.com/ezyang
2024-08-02 15:00:38 +00:00
78f4a3919f Remove duplicate XPU switch case in DispatchStub (#132480)
This PR fixes the issue mentioned in https://github.com/pytorch/pytorch/issues/132481. Duplicated XPU switch cases exist in `DispatchStub.cpp` and this PR removes it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132480
Approved by: https://github.com/nautsimon, https://github.com/malfet
2024-08-02 14:39:00 +00:00
ccf9ce8e8c Change signature of CompilerFn for register_backend decorator (#131880)
## Description
Add `...` to show that CompilerFn for custom backend could take additional options

Re: Recreated closed PR https://github.com/pytorch/pytorch/pull/110006
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131880
Approved by: https://github.com/jansel
2024-08-02 14:30:58 +00:00
053e5080f6 Enable exception chaining in call_user_compiler (#131186)
Enable exception chaining of BackendCompilerFailed exception in call_user_compiler. This prevents the original exception and traceback, which is often the most useful for debugging, from being discarded.

Example output without the patch
> Traceback (most recent call last):
> [Traceback from test_slice_scatter_issue122291 to raise BackendCompilerFailed(self.compiler_fn, e).with_traceback(]
> [Trace back from call_user_compiler to  _inplace_generalized_scatter raise RuntimeError]
>  torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
>  RuntimeError: shape error in scatter op, can not broadcast torch.Size([16, 2]) to torch.Size([16, 6])
> Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information

Example output with the patch
> Traceback (most recent call last):
> [Traceback from_inplace_generalized_scatter to raise error_type(message_evaluated)]
> RuntimeError: expand: attempting to expand a dimension of length 2!
> The above exception was the direct cause of the following exception:
> Traceback (most recent call last):
> [Traceback from  call_user_compiler to  _inplace_generalized_scatter raise RuntimeError]
> RuntimeError: shape error in scatter op, can not broadcast torch.Size([16, 2]) to torch.Size([16, 6])
> The above exception was the direct cause of the following exception:
> Traceback (most recent call last):
> [Traceback from test_slice_scatter_issue122291 to raise BackendCompilerFailed(self.compiler_fn, e) with e]
> RuntimeError: shape error in scatter op, can not broadcast torch.Size([16, 2]) to torch.Size([16, 6])
> Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131186
Approved by: https://github.com/jansel
2024-08-02 14:07:06 +00:00
48929184e9 AutoHeuristic: mixed_mm heuristic for A100 (#131613)
This PR introduces changes to AutoHeuristic that allow one to learn a heuristic as a decision tree. I used this to learn a heuristic for mixed_mm on A100 that consistenly performs better than the default choice (https://github.com/pytorch/pytorch/blob/main/torch/_inductor/kernel/mm.py#L402).

This is how the results look like:
Explanation of columns:
**wrong_max_spdup**: In the worst case, how much better would the best choice have been
**wrong_gman_spdup**: For inputs where the heuristic is wrong, how much better is the best choice on average (geomean)
**max_spdup_default**: Highest speedup achieved by the learned heuristic over the default choice
**gman_spdup_default**: Geomean speedup achived by the learned heuristic over the default choice
**max_slowdown_default**: If the default choice is better than the choice predicted by the learned heuristic, how much is it better in the worst case
**non_default_preds**: Number of times the learned heuristic predicted a choice that is not the default choice
**default_better**: Number of times the default choice is better than the choice made by the heuristic
```
  set     crit  max_depth  min_samples_leaf  correct  wrong  unsure  total  wrong_max_spdup  wrong_gman_spdup    max_spdup_default  gman_spdup_default  max_slowdown_default  non_default_preds  default_better
train  entropy          5              0.01     2376    740     323   3439         1.855386          1.063236            11.352318            3.438279              1.022164               3116               2
 test  entropy          5              0.01      563    183      71    817         1.622222          1.060897            10.084181            3.507741              1.017039                746               2
```

While the number of wrong predictions is high, on average the best choice is only around 6% better. What is important is that the choice predicted by the learned heuristic performs better than the default choice.

I evaluated my heuristic on gpt-fast `meta-llama/Llama-2-7b-chat-hf` with int8 weight quantization. To get the `tuned_mixed_mm` to trigger, I had to replace `F.linear()` in https://github.com/pytorch-labs/gpt-fast/blob/main/quantize.py#L355 with `torch.matmul(input, self.weight.t().to(dtype=input.dtype))` because the mixed_mm pattern does not match if there is a transpose between a cast and the matmul.
|batch size|prompt length| fallback    |  heuristic  | speedup |
|----------|-------------|------------:|------------:|--------:|
|     1    |      7      | 75.31 tok/s | 148.83 tok/s|  1.97   |
|     1    |     11      | 75.99 tok/s | 148.15 tok/s|  1.94   |
|     4    |      7      | 103.48 tok/s | 472.00 tok/s|  4.56   |
|     4    |     11      | 103.56 tok/s |  371.36 tok/s|  3.58   |
|     8    |      7      | 201.92 tok/s | 813.44 tok/s|  4.02   |
|     8    |     11      | 201.76 tok/s |  699.36 tok/s|  3.46   |

Currently, the heuristic only applies to the following inputs:
- m <= 128, k >= 1024, n >= 1024 (For these sizes, one of the triton kernels wins in most cases, but the heuristic still has to be careful to not choose a config that performs worse than the fallback)
- k % 256 == 0 (If k is not a multiple of the block size, some choices perform extremely bad. In one case one config, that usually performs very well, was 130x slower.)
- mat1 not transposed
- mat2 transposed (In some cases, it was hard for the learned heuristic to detect some cases where it

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131613
Approved by: https://github.com/eellison
2024-08-02 13:54:37 +00:00
cyy
b9cb1abf65 [12/N] Use std::optional (#132361)
Follows #132396

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132361
Approved by: https://github.com/eqy
2024-08-02 13:46:46 +00:00
56f2917bef [dynamo] Bugfix for recently added str handler (#132461)
There is probably more work to improve support. But this is hot fix to not fail on `.__func__`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132461
Approved by: https://github.com/williamwen42
ghstack dependencies: #132425
2024-08-02 13:16:39 +00:00
0d9c9716b2 Ensure compiler collective is called even when no graph is compiled (#132163)
It's very important to make sure we always run the compiler collective, because if we don't, we will fail to apply automatic dynamic at all.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132163
Approved by: https://github.com/jansel
2024-08-02 12:18:34 +00:00
d342dc0179 Don't attempt to compute hints for unbacked expressions (#132060)
This breaks the inference we made that if you cat an N-D tensor with a 1-D tensor of size (u0,), the u0 must be zero, but no one really wanted that anyway...

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132060
Approved by: https://github.com/Skylion007
ghstack dependencies: #131649, #132407
2024-08-02 12:09:37 +00:00
d903e664c6 Refactor thunkify to return proper thunk abstraction (#132407)
This is superior to lru_cache because (1) it's more explicit and (2) it
doesn't leak the original function after it's been forced.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132407
Approved by: https://github.com/albanD
ghstack dependencies: #131649
2024-08-02 12:09:37 +00:00
290f09f829 Ban decorator usage of dynamo_timed (#132328)
This is a more manual version of https://github.com/pytorch/pytorch/pull/132073 that just manually creates the new function at each call site instead of magicking it with clone. Review with whitespace diffs off.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132328
Approved by: https://github.com/albanD
2024-08-02 12:00:46 +00:00
8668bc279d [inductor] contine to fix restrict keyword. (#132463)
It is a continued work to the PR: https://github.com/pytorch/pytorch/pull/132394 , and all `restrict` key word of `cpp_micro_gemm.py` are fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132463
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-08-02 11:09:17 +00:00
d2e9a8bf6d [Reland] Fix inlining module-scoped store global (#132439)
Reland https://github.com/pytorch/pytorch/pull/132224

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132439
Approved by: https://github.com/anijain2305
2024-08-02 09:13:52 +00:00
a4ea776881 Add pinned memory support to sparse COO/CSR/CSC/BSR/BSC tensors (#129645)
As in the title:

To register indices/values of a sparse XYZ tensor with CUDA, the following methods are supported
- `sparse_xyz_tensor(indices, values, pin_memory=True)`
- `sparse_xyz_tensor(indices, values).pin_memory()`
- `sparse_xyz_tensor(indices.pin_memory(), values.pin_memory())`

Fixes https://github.com/pytorch/pytorch/issues/115330

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129645
Approved by: https://github.com/amjames, https://github.com/cpuhrsch, https://github.com/eqy
2024-08-02 08:55:55 +00:00
babb249a89 [dynamo] Track params/buffers and mark them as static (#132334)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132334
Approved by: https://github.com/ezyang, https://github.com/mlazos
2024-08-02 08:55:43 +00:00
2ee9895304 Support optimizer capturable on hpu and xpu (#132119)
as title
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132119
Approved by: https://github.com/jgong5, https://github.com/janeyx99
2024-08-02 08:19:52 +00:00
f936e68506 [CI] Update CPU inductor smoke test model list and target (#132221)
Fixes #132097

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132221
Approved by: https://github.com/desertfire
2024-08-02 07:09:54 +00:00
eqy
e5560d10f4 [CUDA][SDPA] Fix expect export on sm90+ (#132194)
CC @drisspg not sure what is causing the scale=0.125 to be omitted here...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132194
Approved by: https://github.com/drisspg
2024-08-02 05:43:58 +00:00
7d8b95e8fb [easy] more debug in partitioner assert (#132456)
Print the name of the node that didn't have good meta['val']. An internal model is failing with this assert, we need this info to debug further.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132456
Approved by: https://github.com/Chillee
2024-08-02 05:07:01 +00:00
cyy
35d14d22a0 Fix some issues detected by static analysis tools (#131989)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131989
Approved by: https://github.com/ezyang
2024-08-02 04:18:57 +00:00
5ea0f51187 [Dynamo] Support abc.MutableMapping.get (#132363)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132363
Approved by: https://github.com/anijain2305, https://github.com/mlazos
2024-08-02 04:17:35 +00:00
2b86a7fcc7 fix printing of scores and mods names (#132424)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132424
Approved by: https://github.com/Skylion007
2024-08-02 03:30:23 +00:00
cyy
07fe1dd58f [13/N] Fix clang-tidy warnings in jit (#132411)
Follows  #132209

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132411
Approved by: https://github.com/Skylion007
2024-08-02 03:14:09 +00:00
1250171866 Use fresh inductor cache on unit tests (#132432)
Summary: This makes it so that stress tests on separate processes on the same machine don't clobber the directories of each other. InductorTestCase will automatically make a fresh tmpdir for each unit test.

Test Plan:
```
buck2 test -j 18 'fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --exact 'caffe2/test/dynamo:test_dynamo - test_aot_autograd_cache.py::AOTAutogradCacheTests::test_nn_module_with_params_global_constant' --run-disabled --stress-runs 10 --record-results
```

Now passes

Differential Revision: D60604811

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132432
Approved by: https://github.com/masnesral
2024-08-02 03:02:36 +00:00
6c4ce4331c [dynamo][exception] Raise Observed KeyError exception for dict __getitem__ (#132425)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132425
Approved by: https://github.com/yanboliang, https://github.com/Skylion007
2024-08-02 02:58:31 +00:00
cd5452aace [CUDA] is_bf16_supported() should not crash if there are no GPUs (#132313)
`False` is the good answer on a system that does not have any CUDA GPUs.
- Added regression test to TestTorch.

Fixes https://github.com/pytorch/pytorch/issues/132303

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132313
Approved by: https://github.com/eqy, https://github.com/syed-ahmed
2024-08-02 02:50:43 +00:00
3a355c1891 Correct sample creation of torch.histogram in UT op_db to align PyTorch defined operator semantics (#131630)
Fixes #130916
As the semantics defined in [torch.histogram](https://pytorch.org/docs/stable/generated/torch.histogram.html#torch-histogram), we need an increasing sequence as bins tensor. Random input doesn't make sense for torch.histogram.
The case is a comparison between CPU backend and another backend. When the input is random, kernel implementation in other backends have to totally align with the CPU kernel, or the case fails.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131630
Approved by: https://github.com/EikanWang, https://github.com/albanD
2024-08-02 01:51:09 +00:00
bc510916fa Only make wait_tensor as a side_effect op (#132341)
Summary:
https://github.com/pytorch/pytorch/pull/131023 add all the collective ops to the side effect list. But we should only make wait_tensor as a side_effect op because all collective ops should have a corresponding wait_tensor.

We should switch to use high_order effect token.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132341
Approved by: https://github.com/yf225
2024-08-02 01:24:40 +00:00
ef426d5183 [nccl] Wrap nccl code update with version check (#130419)
Fixes the issue that cannot build pytorch with nccl < 2.13 after https://github.com/pytorch/pytorch/issues/128756

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130419
Approved by: https://github.com/eqy, https://github.com/malfet
2024-08-02 01:22:07 +00:00
50ed6ce277 Support built-in id function for TensorVariable on parameters (#130100)
Fixes #130087

This patch tries to provide a built-in id function implementation for TensorVariable when the id function is called on tensors like module parameters. The id function call on intermediate tensors is not supported.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130100
Approved by: https://github.com/anijain2305
2024-08-02 01:19:25 +00:00
64235c6a71 Skip test_fp8 in test_aot_inductor to temporarily (#132453)
https://github.com/pytorch/pytorch/pull/130422 caused the test `test.inductor.test_aot_inductor.AOTInductorTestABICompatibleCuda. test_fp8_abi_compatible_cuda` to fail (unclear why it was not run in GitHub) with `torch/csrc/inductor/aoti_torch/c/shim.h:390:34: note: candidate function not viable: requires 9 arguments, but 6 were provided`. We suspect that the kernel produced by the lowering function, which is no longer a fallback choice, has a schema issue at codegen. Fp8 is not used through AOTI currently and it is difficult to revert the PR (BE week), so we'll skip the test temporarily while making the new lowering compatible with AOTI.

Testing: the failed test on internal diff is now skipped.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132453
Approved by: https://github.com/henrylhtsang
2024-08-02 01:18:03 +00:00
cyy
56334c854c [2/N] Fix clang-tidy warnings in aten/src/ATen/native/*.{cpp,h} (#131834)
Follows #130798

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131834
Approved by: https://github.com/ezyang
2024-08-02 00:49:30 +00:00
ee1ef066fd add src map to data-dependent errors (#132393)
Summary: Currently suggested fixes pick a map from symbols to user variables. However it is possible that many user variables  point to the same symbol, and some may be preferred over others. Thus we dump this info as well.

Test Plan: updated test

Sample error with new format:
```
Could not guard on data-dependent expression u2 >= 0 (unhinted: u2 >= 0).  (Size-like symbols: none)

<snip>

The following call raised this error:
  File "test/export/test_export.py", line 1950, in forward
    return r.view(items[0], items[2])

To fix the error, insert one of the following checks before this call:
  1. torch._check(items[2] >= 0)
  2. torch._check(items[2] < 0)

(These suggested fixes were derived by replacing `u2` with items[2] in u2 >= 0 and its negation.)
```

Differential Revision: D60574478

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132393
Approved by: https://github.com/BoyuanFeng
2024-08-02 00:31:12 +00:00
625af2d27c [dynamo] fix add_push_null callsites with CALL_FUNCTION_EX (#132329)
Also fix a bug in `PyCodegen.add_push_null` where in Python <= 3.12, we may accidentally duplicate a NULL instead of the object on the stack before it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132329
Approved by: https://github.com/anijain2305
2024-08-02 00:29:21 +00:00
0016be8051 [Docker] Replace epel release rpm by yum install (#132449)
URL: https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm is not available anymore, hence replacing this with yum epel-release install.

As a backup plan this is available still : https://archives.fedoraproject.org/pub/archive/epel/7/x86_64/Packages/e/epel-release-7-14.noarch.rpm

Saved on our s3 path, just in case: https://ossci-linux.s3.amazonaws.com/epel-release-7-14.noarch.rpm

Please note, We are still using for installs like this:
```
RUN yum install -y \
    https://repo.ius.io/ius-release-el7.rpm \
	https://ossci-linux.s3.amazonaws.com/epel-release-7-14.noarch.rpm
```

Test in CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132449
Approved by: https://github.com/kit1980, https://github.com/seemethere, https://github.com/malfet
2024-08-02 00:16:03 +00:00
3855ac5a5d Revert "[export] Add print_readable to unflattener (#128617)"
This reverts commit ab9791c0e342753013181eeeab300a05774fc456.

Reverted https://github.com/pytorch/pytorch/pull/128617 on behalf of https://github.com/angelayi due to never got landed internally due to weird flow... sorry ([comment](https://github.com/pytorch/pytorch/pull/128617#issuecomment-2264224466))
2024-08-01 23:47:29 +00:00
0c3ac428a2 [BE][typing] fix types in common pruning (#132309)
BE task. Add typings and remove mypy errors in torch/testing/_internal/common_pruning.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132309
Approved by: https://github.com/ColinPeppler
2024-08-01 23:34:33 +00:00
87ddf70fc6 Set weights_only=False in export deserialize_torch_artifact (#132348)
Context:

We are planning to make a BC breaking change to `torch.load` by flipping the default for `weights_only` from `False` --> `True` in a future release. With `weights_only=True`, a custom unpickler is used that limits what can be loaded to state_dicts containing tensors (there is also a way for the user to allowlist specific things to be loaded). The goal of this is to attempt to prevent remote execution of arbitrary code when using `torch.load`.

To my understanding, in export, `torch.load` is used internally to load arbitrary objects, so we should set `weights_only=False` here to prevent the flip from breaking export.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132348
Approved by: https://github.com/angelayi
2024-08-01 23:25:07 +00:00
1362d51e7d [AOTI] Fix number type for AOTI (#132180)
Fixes #131338

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132180
Approved by: https://github.com/desertfire
2024-08-01 22:43:28 +00:00
35400f750f [torchbind] don't warning for certain skippable methods. (#132306)
Summary:
Skip the warning if the fake script object doesn't implement a fake method for:
1. __obj_flatten__: for real script object only.
2. __set_state__ and __get_state__ for serialization. Don't expect it to be used during tracing.

Test Plan: Existing tests.

Reviewed By: angelayi

Differential Revision: D60478460

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132306
Approved by: https://github.com/angelayi
2024-08-01 22:40:42 +00:00
2f54c38594 [AOTI] Fix bfloat16 in CPU (#132150)
Fixes #122986

- add "typedef at::BFloat16 bfloat16;" to the header of generated cpp file

- Supress warning: comparison of integer expressions of different signedness: ‘long unsigned int’ and ‘int64_t’ {aka ‘long int’} [-Wsign-compare]
  436 |   if (tensor.numel() != numel) {

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132150
Approved by: https://github.com/chenyang78, https://github.com/desertfire
2024-08-01 22:26:30 +00:00
a356a03f4a Fix DEBUG=1 asserts for mvlgamma backward with NJT (#132422)
mvlgamma backward trips DEBUG=1 asserts when trying to construct an empty tensor with `layout=torch.jagged`. This happens due to passing `self.options()` to `arange()` in `mvlgamma_backward()`. Fix in this PR unconditionally constructs `arange()` with the strided layout.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132422
Approved by: https://github.com/albanD
2024-08-01 21:53:16 +00:00
92bebb46fa Support XPU ABI=0 build (#130110)
# Motivation
This PR intends to support ABI=0 build for XPU backend.

# Additional Context
The major change is adding a compilation option `-D__INTEL_PREVIEW_BREAKING_CHANGES` for the host compiler(gcc) and `-fpreview-breaking-changes` for XPU device kernel code compiler(icpx), why?
Because we use
- gcc to compile host code and link SYCL runtime. So we need to pass `-D__INTEL_PREVIEW_BREAKING_CHANGES` to tell the host compiler invoking the ABI-neutral API included in SYCL. And
- use icpx to compile device kernel code and link SYCL runtime. So we need to pass `-fpreview-breaking-changes` to tell the device kernel compiler building ABI-neutral code. Besides,
- `libsycl-preview.so` is an ABI-neutral library but `libsycl.so` is not.

This PR depends on https://github.com/pytorch/pytorch/pull/131643.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130110
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD
2024-08-01 21:42:14 +00:00
997f64af38 fastpath FunctionalTensor sizes() (#132084)
Another attempt at fast-pathing sizes() in FunctionalTensor, since it appears to improve compile time perf by up to ~10%. See the investigation from https://github.com/pytorch/pytorch/issues/125977#issuecomment-2122915602.

After looking at some failing tests locally I realized that we need to manually handle metadata mutations now, since the previous "smarter" size dispatch was handling the updates

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132084
Approved by: https://github.com/ezyang
2024-08-01 21:09:22 +00:00
c8958f8f84 Revert "Ban decorator usage of dynamo_timed (#132328)"
This reverts commit 9853c048eb53946eb505424b17ac42ce46b66ac1.

Reverted https://github.com/pytorch/pytorch/pull/132328 on behalf of https://github.com/clee2000 due to seems to have broken functorch/test_aotdispatch.py::TestAOTAutograd::test_input_data_and_metadata_mutation_aliases_other_input [GH job link](https://github.com/pytorch/pytorch/actions/runs/10204547165/job/28233976446) [HUD commit link](9853c048eb).  Test passed on PR, probably a landrace, base is only 10 hours old ([comment](https://github.com/pytorch/pytorch/pull/132328#issuecomment-2263909337))
2024-08-01 20:20:28 +00:00
78927d37f6 Add basic mypy annotations to inductor (#132416)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132416
Approved by: https://github.com/XuehaiPan, https://github.com/jamesjwu
ghstack dependencies: #132415
2024-08-01 20:14:25 +00:00
71e22e0959 Add basic mypy annotations to dynamo (#132415)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132415
Approved by: https://github.com/XuehaiPan, https://github.com/jamesjwu
2024-08-01 20:14:25 +00:00
12f61e65eb [mtia][sdpa] MTIA SDPA dispatch via _fused_sdp_choice_stub (#132008)
Summary: as title

Differential Revision: D59823335

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132008
Approved by: https://github.com/mortzur
2024-08-01 20:01:40 +00:00
596f568592 [dtensor][debug] adding js script to pytorch github so that i can host the browser visualizer on pytorch (#132185)
**Summary**
This is the javascript portion that is used in CommDebugMode's visual browser. I have placed it here so that I can host the browser on PyTorch. I am following the same procedures to host as memory_viz https://github.com/pytorch/pytorch.github.io/blob/site/memory_viz.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132185
Approved by: https://github.com/XilunWu
ghstack dependencies: #132070
2024-08-01 19:50:23 +00:00
9853c048eb Ban decorator usage of dynamo_timed (#132328)
This is a more manual version of https://github.com/pytorch/pytorch/pull/132073 that just manually creates the new function at each call site instead of magicking it with clone. Review with whitespace diffs off.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132328
Approved by: https://github.com/albanD
2024-08-01 19:27:58 +00:00
40c8f73099 Revert "Fix inlining module-scoped store global (#132224)"
This reverts commit c3a31d90e7d10a9b89b11396b6f8b20ed52bf394.

Reverted https://github.com/pytorch/pytorch/pull/132224 on behalf of https://github.com/ZainRizvi due to Looks like the new import mock_store_global_crossfile_inline fails internally. Please see D60567756 for details ([comment](https://github.com/pytorch/pytorch/pull/132224#issuecomment-2263768729))
2024-08-01 19:06:36 +00:00
93979e7063 Skip frame if torch dispatch mode enabled (#131828)
Fixes https://github.com/pytorch/pytorch/issues/105929

We now skip frames if a dispatch mode is enabled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131828
Approved by: https://github.com/bdhirsh, https://github.com/anijain2305
2024-08-01 19:06:20 +00:00
fbf3bc0a60 Always use high precision for SDPA math backend (#128922)
Summary:
feikou observed the big numerical gaps when using math backend on AMD and NV GPUs. It's mainly because we are not using higher precision FP32 for the intermediate accumulated/materialized parts.

Since math backend is expected to be slower anyways, and we expect math backend to generate the correct reference result, I think it should be worth to upcast FP16/BF16 input to FP32, and do FP32/TF32 computations, and then downcast FP32 output back to FP16/BF16.

Differential Revision: D58710805

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128922
Approved by: https://github.com/xw285cornell, https://github.com/drisspg
2024-08-01 18:55:48 +00:00
0eea2b3947 Cast inputs to low precision kernels in emulate low precision mode (#132345)
With https://github.com/pytorch/pytorch/pull/132238 is sufficient to make give no divergence https://github.com/pytorch/pytorch/issues/132301:

Although we should discuss that issue more at length.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132345
Approved by: https://github.com/zou3519
2024-08-01 18:02:10 +00:00
Ryo
ce61300141 Enable oneDNN for tanh based GELU on aarch64 (#130925)
Provides speedup for GELU on aarch64 compared to native PyTorch implementation. e.g.

  8.5x speedup compared to native implementation for 1x1x16384 on 32 threads on Graviton 3

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130925
Approved by: https://github.com/malfet
2024-08-01 17:54:48 +00:00
97eba8e174 [AOTI] Fix a typo in ExternKernel.codegen_const_args (#132191)
Differential Revision: [D60513923](https://our.internmc.facebook.com/intern/diff/D60513923)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132191
Approved by: https://github.com/chenyang78
2024-08-01 17:46:25 +00:00
f467d55329 Disable remote cache on test_aot_autograd_cache (#132409)
Summary:
AOTAutogradCache currently only checks the local directory instead of both local and remote when saving/loading from the cache, so if remote cache is turned on, it will cache miss.

Disable remote caching for now on these tests: when I work on remote caching compatibility, I'll re-enable them here.

Test Plan:
buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --exact 'caffe2/test/dynamo:test_dynamo - test_aot_autograd_cache.py::AOTAutogradCacheTests::test_nn_module_with_params_global_constant' --run-disabled
passes

Differential Revision: D60588615

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132409
Approved by: https://github.com/masnesral
2024-08-01 17:26:11 +00:00
010fc7858a [export] Fix serialization of OpOverload w/ SymInt outputs (#132126)
Fixes https://fb.workplace.com/groups/1075192433118967/permalink/1473575486613991/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132126
Approved by: https://github.com/ydwu4
2024-08-01 17:22:04 +00:00
ff4ca0d02a [Easy] Fix argument name collision in HigherOrderOperator dispatched functions (#132377)
Share the same spirit of #129562

- #129562

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132377
Approved by: https://github.com/zou3519
2024-08-01 17:13:37 +00:00
7b816d7d6d [dynamo] Treat attr of unspecialized buiitin nn modules as static (#132318)
This fixes the huge increase in compile time with +dynamic with inline_inbuilt_nn_modules.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132318
Approved by: https://github.com/yanboliang, https://github.com/mlazos, https://github.com/ezyang
ghstack dependencies: #132302, #132304, #132312, #132308, #132314
2024-08-01 17:11:18 +00:00
69cbf05529 Fix recent build error on ppc64le (#129736)
This PR will fix the recent build issue observed on ppc64le.
Fixes #128130

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129736
Approved by: https://github.com/albanD, https://github.com/malfet
2024-08-01 17:09:42 +00:00
30293319a8 [BE][Easy][19/19] enforce style for empty lines in import segments in torch/[o-z]*/ (#129771)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129771
Approved by: https://github.com/justinchuby, https://github.com/janeyx99
2024-08-01 17:07:14 +00:00
c59f3fff52 [PP] Forward only schedule (#132177)
`python test/distributed/pipelining/test_schedule_multiproc.py -k test_forward_only`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132177
Approved by: https://github.com/lessw2020
2024-08-01 16:35:56 +00:00
ee09d066d3 [dynamo] Add line number to _warn_capture_scalar_outputs() (#132333)
Fixes #127667.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132333
Approved by: https://github.com/anijain2305
2024-08-01 16:11:21 +00:00
35fcd59fd8 [inductor] make restrict_keyword cross OSs. (#132394)
Error Msg:
<img width="862" alt="image" src="https://github.com/user-attachments/assets/51fef188-bce8-42a5-8ed4-d11802c6ca89">

<img width="347" alt="image" src="https://github.com/user-attachments/assets/0eafe38e-1c7c-427d-82f5-16a31bccc476">

Handle `restrict` keyword the by OS, ref: https://learn.microsoft.com/en-us/cpp/cpp/extension-restrict?view=msvc-170

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132394
Approved by: https://github.com/desertfire
2024-08-01 16:03:10 +00:00
920f0426ae Add None return type to init -- tests rest (#132376)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132376
Approved by: https://github.com/jamesjwu
ghstack dependencies: #132335, #132351, #132352
2024-08-01 15:44:51 +00:00
221350e3a4 Add None return type to init -- tests (#132352)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132352
Approved by: https://github.com/ezyang
ghstack dependencies: #132335, #132351
2024-08-01 15:44:51 +00:00
a6985c09cb Add None return type to init -- functorch and torchgen (#132351)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132351
Approved by: https://github.com/jamesjwu
ghstack dependencies: #132335
2024-08-01 15:26:45 +00:00
72d2dba992 Add None return type to init (#132335)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132335
Approved by: https://github.com/albanD
2024-08-01 15:26:45 +00:00
30d7f0b15a Remove wget call to builder install_cuda.sh (#132410)
This file ``install_cuda.sh`` now lives in ``.ci/docker/common`` and will be removed from builder repo.
Here is PR that removes it from builder: https://github.com/pytorch/builder/pull/1949
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132410
Approved by: https://github.com/Skylion007
2024-08-01 15:22:08 +00:00
cyy
c99adce9a1 [12/N] Fix clang-tidy warnings in jit (#132209)
Follows #132131

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132209
Approved by: https://github.com/Skylion007
2024-08-01 15:12:12 +00:00
0d88dd0f77 [TS2E] Remove reference to torch.onnx internals (#132186)
Instead, this PR moves the code to the converter to avoid dependence. Feel free to refactor it afterward.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132186
Approved by: https://github.com/angelayi
2024-08-01 15:08:02 +00:00
cyy
d7d6190493 [11/N] Use std::nullopt and std::optional (#132396)
Follows #132364
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132396
Approved by: https://github.com/ezyang
2024-08-01 14:46:33 +00:00
a4013e8b72 [inductor] cpp codegen alignas for all OSs. (#132387)
Changes:
1. Make cpp codegen alignas works for all OSs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132387
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-08-01 14:30:09 +00:00
6c1f1563e1 [inductor] fix UndefinedTensorImpl singleton can't export on Windows. (#132326)
This PR fix the `UndefinedTensorImpl::_singleton` can't export on Windows issue.
Snapshot:
<img width="1346" alt="image" src="https://github.com/user-attachments/assets/b34256ac-a0ae-473b-89e6-10d755eaad24">

The reason is MSVC can't export class static data to external linkage, ref: https://learn.microsoft.com/en-us/cpp/cpp/using-dllimport-and-dllexport-in-cpp-classes?view=msvc-170#_pluslang_using_dllimport_and_dllexport_in_c2b2bselectivememberimportexport

I use another singleton implenmentation to avoid the issue, for Windows.

Since this PR, cpp_wrapper on Windows would start to work.
<img width="1916" alt="image" src="https://github.com/user-attachments/assets/c1d7d7e7-64ca-4c6d-9fb7-e3b91e675b58">

Next step, I will enable the cpp_wrapper UTs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132326
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-08-01 13:37:12 +00:00
6ff1e43a41 [BE][Easy][13/19] enforce style for empty lines in import segments in test/j*/ (#129764)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129764
Approved by: https://github.com/ezyang
2024-08-01 12:13:42 +00:00
672ce4610e Populate submodules of torch._C to sys.modules recursively (#132216)
See comment:

e9d1c26275/torch/__init__.py (L938-L950)

This PR recursively sets the submodules in the C extension to `sys.modules` (e.g., `_C._dynamo.eval_frame`).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132216
Approved by: https://github.com/ezyang
2024-08-01 12:04:59 +00:00
d95756f6a5 [Quantizer][Add] Fix add annotation with constant (#132092)
Summary:
Occaisonally we run into a partition that looks like this for Add:

```
SourcePartition(nodes=[_constant2, add_2], source=<built-in function add>, input_nodes=[x], output_nodes=[_constant2, add_2], params=[_constant2])
```

In this case we are adding a constant to an input, and reusing the constant later down the line. This causes our constant to be an output in our SourcePartition. The assumption then that:

```
        add_node = add_partition.output_nodes[0]
```
Will not necessarily hold. As a result we must check that the output node is indeed a call function and not a constant.

Test Plan: buck test mode/dev-nosan //executorch/backends/xnnpack/test:test_xnnpack_ops -- test_qs8_add_constant

Differential Revision: D60413221

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132092
Approved by: https://github.com/jerryzh168
2024-08-01 09:57:43 +00:00
bdd83c4c7f Add Full block support to flex_decoding (#131404)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131404
Approved by: https://github.com/yanboliang
2024-08-01 07:28:52 +00:00
cyy
043e41f4f4 [10/N] Use std::nullopt and std::make_optional (#132364)
Follows #130674
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132364
Approved by: https://github.com/ezyang
2024-08-01 07:02:35 +00:00
d6a82ce39b [dynamo] Track builtin nn modules with UnspecializedBuiltinNNModuleVariable (#132314)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132314
Approved by: https://github.com/yanboliang
ghstack dependencies: #132302, #132304, #132312, #132308
2024-08-01 06:21:05 +00:00
aa0ed2496f [dynamo] Wrap unspecialized nn module getattr with UnspecializedNNModuleSource (#132308)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132308
Approved by: https://github.com/yanboliang
ghstack dependencies: #132302, #132304, #132312
2024-08-01 06:21:05 +00:00
612ea35395 [dynamo] Introduce UnspecializedBuiltinNNModuleSource (#132312)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132312
Approved by: https://github.com/yanboliang
ghstack dependencies: #132302, #132304
2024-08-01 06:21:05 +00:00
4c29c1a96a [EZ] adjust test to accept training IR input (#131999)
When we do predispatch functional export, sometimes we get harmless additional detach calls. In the new training IR, it actually outputs slightly different (arguable more correct) result.

Differential Revision: [D60348764](https://our.internmc.facebook.com/intern/diff/D60348764/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131999
Approved by: https://github.com/bdhirsh
ghstack dependencies: #131988, #131995
2024-08-01 06:20:38 +00:00
7a779b5257 Add functions from torch.masked._ops to __all__ for torch.masked (#131288)
Add the non-private operations imported in this file to `__all__` so that pyright considers them to be publicly exported. Solves this error:

```
"mean" is not exported from module "torch.masked" Pylance[reportPrivateImportUsage]
```

Related: https://github.com/pytorch/pytorch/pulls?q=pyright+export

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131288
Approved by: https://github.com/ezyang
2024-08-01 05:45:08 +00:00
928adb7cc2 Fix empty fake mode problem (#131995)
Title

Differential Revision: [D60348541](https://our.internmc.facebook.com/intern/diff/D60348541/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131995
Approved by: https://github.com/angelayi
ghstack dependencies: #131988
2024-08-01 04:55:37 +00:00
f32ab3b9e3 Migrate Inductor scheduler, dependencies, ir, and codegen/common to use OrderedSet (#130004)
Python's set is non deterministic. There is an internal failure which we recently ran into which did not consistently fail.

See, repro here: P1453035092.

Now, with these changes, it does consistently fail. In follow ups we could also consider adding a lintrule for uses of either set() or set literals.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130004
Approved by: https://github.com/oulgen
2024-08-01 04:37:15 +00:00
bcd1d2e832 [dynamo] Introduce UnspecializedNNModule guard source (#132304)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132304
Approved by: https://github.com/yanboliang
ghstack dependencies: #132302
2024-08-01 04:35:43 +00:00
e772547d70 [dynamo][rename/refactor] Rename guard_source NN_MODULE to SPECIALIZED_NN_MODULE (#132302)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132302
Approved by: https://github.com/yanboliang
2024-08-01 04:35:43 +00:00
90fa64bd7e [torch][take2] Implement BFloat16 __hip_bfloat16 overloads (#132234)
Summary:
In D60024830 I attempted to define these overloads, but gated the implementation on the wrong macros. Namely I used `__CUDACC__` instead of `__HIPCC__` (facepalm).

It might be worth merging this with the nvidia case via typedefs (e.g. `typedef __hip_bfloat16 __gpu_bfloat16` and `typedef __nv_bfloat16 __gpu_bfloat16`), but that seems like an entirely new paradigm for torch, so I'll punt that change to the future so we can focus on supporting `BFloat16(__hip_bfloat16)` here

Test Plan: CI

Differential Revision: D60362079

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132234
Approved by: https://github.com/houseroad
2024-08-01 04:25:46 +00:00
7911b7bfb7 [inductor][cpp] stabilize do_bench_cpu (#131873)
This PR stabilizes the `do_bench_cpu` by using milliseconds for warmup and benchmark runs, aligning with that of Trtion's do_bench.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131873
Approved by: https://github.com/leslie-fang-intel, https://github.com/chunyuan-w, https://github.com/eellison
2024-08-01 04:25:31 +00:00
b25ef91bf1 [BE][Easy][18/19] enforce style for empty lines in import segments in torch/d*/ (#129770)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129770
Approved by: https://github.com/wconstab
2024-08-01 04:22:50 +00:00
bc7ed1fbdc [FSDP2] add __repr__ to FSDPParamGroup and FSDPParam (#132350)
in pdb, it's pretty common to print `FSDPParamGroup` and `FSDPParam`. making sure they are human readable

print `FSDPParam` in pdb
```
FSDPParam(fqn=layers.6._checkpoint_wrapped_module.attention.wq.weight, orig_size=torch.Size([128, 256]))
```
print `FSDPParamGroup` in pdb
```
FSDPParamGroup(fqn=layers.6)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132350
Approved by: https://github.com/awgu
2024-08-01 04:21:57 +00:00
46ed33b207 add decomposition_table as an arg to get_isolated_graphmodule (#130886)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130886
Approved by: https://github.com/wanchaol
2024-08-01 04:21:43 +00:00
073430ebea Don't check for autograd state when lowering to inference IR (#131988)
When lowering to inference IR, we shouldn't error on autograd state changes because we will have preserved the autograd state change at the training level. I think the more correct way of implementing it would be to wrap autograd ops in HOP before decomposing, but that seems low ROI.

Differential Revision: [D60346235](https://our.internmc.facebook.com/intern/diff/D60346235/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131988
Approved by: https://github.com/angelayi
2024-08-01 04:15:37 +00:00
81db69278d unsupported sympy functions in export solver (#132325)
Summary:
A bunch of issues around support for sympy functions like `TruncToInt` and `ToFloat` are uncovered by https://github.com/pytorch/pytorch/issues/131897. This PR addresses only one of them (as the title suggests). Another issue is deserialization, filed as a task: T197567691.

However the most important issue is that adding runtime assertions is broken right now: specifically, sympy_interp with `PythonReferenceAnalysis` currently doesn't work because the implementations of some of these sympy functions in `PythonReferenceAnalysis` (or falling through to its base class) does not expect proxies. This means things like `math.trunc`, `math.floor`, `round`, etc. don't work, and can be easily repro'd by using them inside `torch._check`, e.g. According to ezyang these implementations need to point to new torch functions that can expect proxies (see how minimum and maximum are implemented, e.g.).

Test Plan: added test (original repro provided)

Differential Revision: D60540951

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132325
Approved by: https://github.com/ezyang
2024-08-01 04:11:52 +00:00
10344d76bd Revert "[AOTI] Fix bfloat16 in CPU (#132150)"
This reverts commit a488113062b7231197ace8522ab3cab535c77d0b.

Reverted https://github.com/pytorch/pytorch/pull/132150 on behalf of https://github.com/clee2000 due to I think this broke inductor/test_cuda_cpp_wrapper.py::DynamicShapesCudaWrapperCudaTests::test_unspec_inputs_cuda_dynamic_shapes_cuda_wrapper [GH job link](https://github.com/pytorch/pytorch/actions/runs/10189155341/job/28189531216) [HUD commit link](a488113062). Test was not run on PR due to being skipped for being slow ([comment](https://github.com/pytorch/pytorch/pull/132150#issuecomment-2261895048))
2024-08-01 03:35:39 +00:00
a28cda11ef Revert "AutoHeuristic: mixed_mm heuristic for A100 (#131613)"
This reverts commit 344c15a0bb66409ec5e576992090d127cbfa2cff.

Reverted https://github.com/pytorch/pytorch/pull/131613 on behalf of https://github.com/AlnisM due to lintrunner issues ([comment](https://github.com/pytorch/pytorch/pull/131613#issuecomment-2261884149))
2024-08-01 03:22:11 +00:00
589aef4bb0 Fix py codegen to delete values that don't have any users (#131028)
Fixes #131025

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131028
Approved by: https://github.com/ezyang
2024-08-01 03:18:37 +00:00
718c13cd39 [inductor] Reinplacing should not allow an op to mutate the same input multiple times (#132238)
Fixes #132196

Let's say we have:
- op(x, y) that mutates both x and y
- new_x, new_y = functional_op(x, y) is the functional variant

If we are presented with functional_op(x, x), we must not reinplace
this into op(x, x), because then it would be writing to the same Tensor.
Instead, it's OK to reinplace one of them and to clone the other:
```
>>> y = x.clone()
>>> op(x, y)
```
This also applies if we have views: functional_op(x, x[0])
should not reinplace into op(x, x[0]).

The fix is to avoid reinplacing an arg if a view of it already has been
reinplaced.

Test Plan:
- new and existing tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132238
Approved by: https://github.com/oulgen, https://github.com/eellison
2024-08-01 02:37:03 +00:00
344c15a0bb AutoHeuristic: mixed_mm heuristic for A100 (#131613)
This PR introduces changes to AutoHeuristic that allow one to learn a heuristic as a decision tree. I used this to learn a heuristic for mixed_mm on A100 that consistenly performs better than the default choice (https://github.com/pytorch/pytorch/blob/main/torch/_inductor/kernel/mm.py#L402).

This is how the results look like:
Explanation of columns:
**wrong_max_spdup**: In the worst case, how much better would the best choice have been
**wrong_gman_spdup**: For inputs where the heuristic is wrong, how much better is the best choice on average (geomean)
**max_spdup_default**: Highest speedup achieved by the learned heuristic over the default choice
**gman_spdup_default**: Geomean speedup achived by the learned heuristic over the default choice
**max_slowdown_default**: If the default choice is better than the choice predicted by the learned heuristic, how much is it better in the worst case
**non_default_preds**: Number of times the learned heuristic predicted a choice that is not the default choice
**default_better**: Number of times the default choice is better than the choice made by the heuristic
```
  set     crit  max_depth  min_samples_leaf  correct  wrong  unsure  total  wrong_max_spdup  wrong_gman_spdup    max_spdup_default  gman_spdup_default  max_slowdown_default  non_default_preds  default_better
train  entropy          5              0.01     2376    740     323   3439         1.855386          1.063236            11.352318            3.438279              1.022164               3116               2
 test  entropy          5              0.01      563    183      71    817         1.622222          1.060897            10.084181            3.507741              1.017039                746               2
```

While the number of wrong predictions is high, on average the best choice is only around 6% better. What is important is that the choice predicted by the learned heuristic performs better than the default choice.

I evaluated my heuristic on gpt-fast `meta-llama/Llama-2-7b-chat-hf` with int8 weight quantization. To get the `tuned_mixed_mm` to trigger, I had to replace `F.linear()` in https://github.com/pytorch-labs/gpt-fast/blob/main/quantize.py#L355 with `torch.matmul(input, self.weight.t().to(dtype=input.dtype))` because the mixed_mm pattern does not match if there is a transpose between a cast and the matmul.
|batch size|prompt length| fallback    |  heuristic  | speedup |
|----------|-------------|------------:|------------:|--------:|
|     1    |      7      | 75.31 tok/s | 148.83 tok/s|  1.97   |
|     1    |     11      | 75.99 tok/s | 148.15 tok/s|  1.94   |
|     4    |      7      | 103.48 tok/s | 472.00 tok/s|  4.56   |
|     4    |     11      | 103.56 tok/s |  371.36 tok/s|  3.58   |
|     8    |      7      | 201.92 tok/s | 813.44 tok/s|  4.02   |
|     8    |     11      | 201.76 tok/s |  699.36 tok/s|  3.46   |

Currently, the heuristic only applies to the following inputs:
- m <= 128, k >= 1024, n >= 1024 (For these sizes, one of the triton kernels wins in most cases, but the heuristic still has to be careful to not choose a config that performs worse than the fallback)
- k % 256 == 0 (If k is not a multiple of the block size, some choices perform extremely bad. In one case one config, that usually performs very well, was 130x slower.)
- mat1 not transposed
- mat2 transposed (In some cases, it was hard for the learned heuristic to detect some cases where it

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131613
Approved by: https://github.com/eellison
ghstack dependencies: #131610, #131611
2024-08-01 02:25:54 +00:00
2276d9045a [cpu] add more VecConvert for 8bits (#131876)
Adds more intrinsic specializations for 8bits conversions, in order to speed up bit8 SDPA in the future.
- u8 -> i16
- i32 -> f32
- f32 -> i32
- i32 -> i8 (only add vec512 cause lack of avx512vl for vec256)
- i16 -> i8 (only add vec512 cause lack of avx512vl for vec256)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131876
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel
2024-08-01 01:38:39 +00:00
7c89ec0f7c Implements torch.cuda.MemPool() API (#131152)
In this PR:
- Pool id creation logic is refactored and moved to a MemPool class. `graph_pool_handle()` API now uses `torch.cuda.MemPool()` to get a unique id for a pool. Existing tests should cover this change.
- MemPool holds a pointer to a CUDAAllocator as proposed in https://github.com/pytorch/pytorch/issues/124807#issuecomment-2077506997. Tests are added to show usage with CUDAPluggableAllocator.
- MemPoolContext API makes a mempool active. Tests are added to show usage of this API. This API will be used in CUDACachingAllocator to route allocations to a user provided allocator. See draft here: https://github.com/pytorch/pytorch/pull/125722/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131152
Approved by: https://github.com/eqy, https://github.com/ezyang
2024-08-01 01:29:30 +00:00
4e966e8a1c Update inference_mode doc (#132321)
Fix https://github.com/pytorch/pytorch/issues/132288
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132321
Approved by: https://github.com/awgu, https://github.com/soulitzer
2024-07-31 23:50:03 +00:00
a488113062 [AOTI] Fix bfloat16 in CPU (#132150)
Fixes #122986

- add "typedef at::BFloat16 bfloat16;" to the header of generated cpp file

- Supress warning: comparison of integer expressions of different signedness: ‘long unsigned int’ and ‘int64_t’ {aka ‘long int’} [-Wsign-compare]
  436 |   if (tensor.numel() != numel) {

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132150
Approved by: https://github.com/chenyang78, https://github.com/desertfire
2024-07-31 23:28:24 +00:00
6b28af1b79 Grouped Query Attention (#128898)
### Approach: Using the current function declaration

**Constraint:** Q_Heads % KV_Heads == 0

**Major change:**
- Added a new argument enable_gqa: bool to sdpa function call
- It adds a meaning to the last third dimension.

Sample use cases this would enable:
LLama3

```
# LLama3 8b call to SDPA
query = torch.rand(batch, 32, seq_len_q, D)
key = torch.rand(batch, 8, seq_len_kv, D)
value = torch.rand(batch, 8, seq_len_kv, D)

output = scaled_dot_product_attention(query, key, value, is_causal=True, enable_gqa=True)

# Output Shape
(batch, 32, seq_len_q, D)
```

### Design Choice:

- Check if Query.size(-3) == Key.size(-3) == Value.size(-3) or, Query.size(-3) % Key.size(-3) == 0
- The function adjusts the key and value tensors to match the query tensor's head dimension by using repeat_interleave if their number of heads are not equal, facilitating correct and efficient computation in attention mechanisms.
- By default the enable_gqa flag is set to False, which ensures that regular sdpa functionality remains unchanged.

### Benchmarks:

- **sdpa.py: #130634**
For different batch sizes enable_gqa=True shows a substansial improvement in the run_time of sdpa

 | batch_size | q_num_heads | kv_num_heads | q_seq_len | kv_seq_len | embed_dim | forward_time when enable_gqa=True   |   forward_time when enable_gqa=False    |
| ------------ | ------------- | -------------- | ----------- | ------------ | ----------- | ----------- | ---------------- |
|     1      |     32      |      8       |   2048    |    2048    |   2048    |   100.71  |  119.70  |
|     8      |     32      |      8       |   2048    |    2048    |   2048    |   539.78  |  628.83  |
|     16     |     32      |      8       |   2048    |    2048    |   2048    |   1056.81  |  1225.48  |
|     32      |     32      |      8       |   2048    |    2048    |   2048    |   2099.54  |  2440.45  |

![Screenshot 2024-07-25 at 9 07 40 PM](https://github.com/user-attachments/assets/a3e5f716-c39f-4096-9e6c-82a735e57b7b)

- **TorchTitan: https://github.com/pytorch/torchtitan/pull/458**

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128898
Approved by: https://github.com/drisspg
2024-07-31 22:58:51 +00:00
f0da167ce5 Add fx graph runnable to tl parse (#130976)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130976
Approved by: https://github.com/ezyang
2024-07-31 22:19:35 +00:00
645c1052a6 Refactor local autotune remote cache to make the code less error prone (#132289)
Fixes #132241

This PR refactors local autotune cache so that disabling it is easier and cleaner.

Differential Revision: [D60537196](https://our.internmc.facebook.com/intern/diff/D60537196)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132289
Approved by: https://github.com/aorenste
ghstack dependencies: #132285
2024-07-31 22:12:22 +00:00
b0e06d9d6a Make config.autotune_remote_cache be a three-way option (#132285)
Similar to fx_graph_cache config, make autotune config be three-way so we can hard enable/disable via config options.

Differential Revision: [D60537105](https://our.internmc.facebook.com/intern/diff/D60537105)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132285
Approved by: https://github.com/aorenste
2024-07-31 22:12:22 +00:00
260c991e20 [inductor] Fix unsoundness with negative-valued indexing expressions (#131761)
This fixes a few instances where we assumed indexing expressions were
non-negative. This is not valid when we have more complicated
expressions involving masking e.g. pointwise cat.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131761
Approved by: https://github.com/ezyang
2024-07-31 21:32:20 +00:00
e74ba1b34a [BE][Easy][15/19] enforce style for empty lines in import segments in torch/_d*/ (#129767)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129767
Approved by: https://github.com/anijain2305
2024-07-31 21:18:11 +00:00
ad9826208c Remove string length limit in ET (#132169)
Summary: ET sets the length limit of string input varaibele to 8192 characters. However, the node process_group::init has more than 8192 characters for a Ads 128 rank job. This DIFF is to temporaily remove this limit, so ET can capture the complete information of the process group.

Test Plan: buck2 test mode/opt caffe2/test:test_profiler_cuda -- profiler.test_execution_trace.TestExecutionTrace

Reviewed By: sanrise

Differential Revision: D60341306

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132169
Approved by: https://github.com/sraikund16, https://github.com/sanrise
2024-07-31 20:54:39 +00:00
d3cefc9e3a AutoHeuristic: Collect data for mixed_mm (#131611)
This PR introduces a script that can be used to collect data for mixed_mm to learn a heuristic with AutoHeuristic. This PR also includes the following things:

Move pad_mm related AutoHeuristic files into subdirectory
Introduce an interface benchmark_runner.py that can be subclassed to introduce new scripts to run benchmarks in order to collect data with AutoHeuristic (see gen_data_pad_mm.py and gen_data_mixed_mm.py).
The idea behind the interface is that, in the end, it hopefully makes it easier to collect data for new optimizations, and thus makes it easier to learn a heuristic.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131611
Approved by: https://github.com/eellison
ghstack dependencies: #131610
2024-07-31 20:45:45 +00:00
f8b6e91840 Add sequoia runner to mac-mps (#132190)
Adds MacOS 15 runners to GitHub actions for Mac-mps test suite

Co-authored-by: Joona Havukainen <jhavukainen@apple.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132190
Approved by: https://github.com/malfet
2024-07-31 20:26:04 +00:00
d72e863b3e Fix lint after PR #130572 (#132316)
Fix lint after https://github.com/pytorch/pytorch/pull/130572

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132316
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/ZainRizvi
2024-07-31 20:00:31 +00:00
aeb78c9849 [TD] More files for test_public_bindings (#132284)
It relies on that file

Also we care about .cpp files too apparently
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132284
Approved by: https://github.com/ZainRizvi
2024-07-31 19:53:40 +00:00
cb4c107d70 [pytorch][counters] DynamicCounter (#132166)
Summary:
Implement a callback-based dynamic counter with pluggable backends.
The backend API and integration is similar to WaitCounter. Note that this counter should only be used with C++ callbacks, since making it safe to be used for GIL-requiring callbacks would be pretty challenging and may defeat the whole purpose of this counter (since the duration of the callback can no longer be guaranteed).

Test Plan: unit test

Differential Revision: D60464055

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132166
Approved by: https://github.com/asiab4
2024-07-31 19:52:51 +00:00
dc38646c58 Revert "[pytorch][counters] Pybind for WaitCounter (#132167)"
This reverts commit 2c7bd61afa4b762e00b26bbde43685de080af32a.

Reverted https://github.com/pytorch/pytorch/pull/132167 on behalf of https://github.com/clee2000 due to broke test_public_bindings.py::TestPublicBindings::test_correct_module_names [GH job link](https://github.com/pytorch/pytorch/actions/runs/10183687967/job/28172929836) [HUD commit link](2c7bd61afa) not tested on PR due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/132167#issuecomment-2261328275))
2024-07-31 19:51:56 +00:00
6955bc170d Some updates to merge rules (#132296)
The added people from metamates don't actually make a material
difference right now but I added some for fun.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132296
Approved by: https://github.com/albanD, https://github.com/malfet
2024-07-31 19:49:08 +00:00
2138a710eb enable test_max_pool2d6 after resolving empty array (#132219)
Related to Issue: https://github.com/pytorch/pytorch/issues/131335
Resolving PR: https://github.com/pytorch/pytorch/pull/132023

Test output:
```
(pytorch-3.10) [gabeferns@devvm2252.cco0 ~/pytorch (enable-test-max-pool2d6)]$ TORCHINDUCTOR_ABI_COMPATIBLE=1 python test/inductor/test_cpu_cpp_wrapper.py -k test_max_pool2d6
inline_call []
stats [('calls_captured', 3), ('unique_graphs', 1)]
inductor [('extern_calls', 3), ('fxgraph_cache_miss', 1)]
aot_autograd [('total', 1), ('ok', 1)]
.inline_call []
stats [('calls_captured', 3), ('unique_graphs', 1)]
aot_autograd [('total', 1), ('ok', 1)]
inductor [('extern_calls', 3), ('fxgraph_cache_miss', 1)]
.
----------------------------------------------------------------------
Ran 2 tests in 8.668s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132219
Approved by: https://github.com/desertfire
2024-07-31 19:13:54 +00:00
cfe61e84ac Add a 'to' method for moving to and from device for BlockMask (#132087)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132087
Approved by: https://github.com/yanboliang
2024-07-31 19:05:30 +00:00
898a431a46 Dump files that look like FX graphs to structured log (#132100)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132100
Approved by: https://github.com/oulgen
2024-07-31 18:45:28 +00:00
f9e4d05c15 Save and run post compilation steps within FXGraphCache (#130572)
This PR mostly refactors by putting code into utils files so that they can be shared between codecache.py and compile_fx.py. Afterwards, it then changes compile_fx so that:
- When saving to FXGraphCache, we save onto the CompiledFXGraph all the necessary metadata for running post compile steps (realigning inputs, cudagraphification).
- When loading from FXGraphCache, we use the saved information directly, instead of calculating them from scratch.

What this does is make it so that `FXGraphCache.load()` is a perfect cache on compile_fx_inner, in that it **returns exactly what compile_fx_inner returns**. This also makes it possible for AOTAutogradCache, given a key to the fx graph cache and example inputs, to get back the full return value of compile_fx_inner.

## What's a post compile step?
We define a **post-compile** to be the set of actions that need to run after FXGraphCache either loads from the cache or misses and runs compilation. These steps include:
- Setting the tracing context's output strides
- Running cudagraphs if enabled
- Maybe realign inputs if cudagraphs didn't run

To run these steps, we save all the necessary metadata in CompiledFxGraph, and use them on a cache hit to reconstruct the object.

## Splitting cudagraphs work into pre/post compile
Cudagraphs does a lot of work on the input graph module to determine if cudagraphs can be enabled. This is the code that involves cudagraph_tests and stack traces. This will work in a world where we have access to the input graph module, but with AOTAutograd warm start, we won't have access to that information anymore. Therefore we can split cudagraphs work into two parts: on a cache miss (and therefore a full compile), we do the cudagraphs testing work, and save cudagraph_fail_reasons into the cache. Then on a cache hit, we know whether or not we can run cudagraphs, and if we can't, we can emit the correct error messages.

Implementation notes:
- We save `fx_kwargs` directly onto the CompiledFXGraph. `fx_kwargs` is already, by definition, part of the cache key, so this is safe to do when it comes to cache correctness.
- ^ Why do we do above even though FXGraphCache.load takes fx_kwargs as an argument? Because AOTAutogradCache **doesn't** have access to fx_kwargs: they're annoyingly encoded in the functools.partial() of the fw_compiler, so *only* inductor knows about these options. They're fully captured by the AOTAutogradCache key (since every key to fx_kwargs is either a global config, or a field that's deterministic based on an input graph module), but their values are still needed to run cudagraphs/postprocessing. Therefore, it's easier/safer to store it on the cached result.
- Willing to hear other approaches here if we think saving these extra fields is not reasonable, though I can't think of another way to do this that's less complicated to explain.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130572
Approved by: https://github.com/eellison
2024-07-31 18:32:40 +00:00
b40249b462 propagate XLA's metadata after functional sync (#131076)
Fixes https://github.com/pytorch/xla/issues/7174

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131076
Approved by: https://github.com/bdhirsh
2024-07-31 18:20:00 +00:00
7eb2a99585 Fix to support unary pointwise ops when an NJT is not the first arg (#131937)
**Background:** NJT utilizes a `jagged_unary_pointwise()` fallback that historically has assumed blindly that the first arg is an NJT. This assumption breaks certain ops; for example `pow(scalar, Tensor)` has an NJT as the second arg.

This PR expands `jagged_unary_pointwise()` and the associated schema validation logic to handle an NJT in args other than the first position.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131937
Approved by: https://github.com/soulitzer
ghstack dependencies: #131898, #131704
2024-07-31 17:51:03 +00:00
c3a31d90e7 Fix inlining module-scoped store global (#132224)
Fixes https://github.com/pytorch/pytorch/issues/132165

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132224
Approved by: https://github.com/anijain2305
2024-07-31 17:37:43 +00:00
6214b5388b typing ir.py - part 1 (#131845)
See #131852

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131845
Approved by: https://github.com/Skylion007, https://github.com/eellison
2024-07-31 17:37:14 +00:00
144639797a Improve side effects error message (#132223)
As title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132223
Approved by: https://github.com/anijain2305
2024-07-31 17:29:26 +00:00
784a6ec5a3 Revert "Migrate Inductor scheduler, dependencies, ir, and codegen/common to use OrderedSet (#130004)"
This reverts commit 13d744464f10e35c0de50feb4e2340d4dae8e05f.

Reverted https://github.com/pytorch/pytorch/pull/130004 on behalf of https://github.com/clee2000 due to broke lint [GH job link](https://github.com/pytorch/pytorch/actions/runs/10183945999/job/28170099930) [HUD commit link](13d744464f) probably a landrace, the base is 21 hours old ([comment](https://github.com/pytorch/pytorch/pull/130004#issuecomment-2260946562))
2024-07-31 16:49:21 +00:00
9826c542f0 [inductor] skip remote fx caching in failing pattern matcher tests (#132206)
Summary: These tests are failing internally with remote caching enabled because the installed pattern increments a nonlocal counter, which we skip with a cache hit.

Test Plan:
```
buck2 test -j 18 'fbcode//mode/opt' fbcode//caffe2/test/inductor:pattern_matcher -- --exact 'caffe2/test/inductor:pattern_matcher - test_match_with_mutation (caffe2.test.inductor.test_pattern_matcher.TestPatternMatcher)' --run-disabled --stress-runs 10
buck2 test -j 18 'fbcode//mode/opt' fbcode//caffe2/test/inductor:pattern_matcher -- --exact 'caffe2/test/inductor:pattern_matcher - test_match_equivalent_function_invocations1 (caffe2.test.inductor.test_pattern_matcher.TestPatternMatcher)' --run-disabled --stress-runs 10
buck2 test -j 18 'fbcode//mode/opt' fbcode//caffe2/test/inductor:pattern_matcher -- --exact 'caffe2/test/inductor:pattern_matcher - test_match_equivalent_function_invocations2 (caffe2.test.inductor.test_pattern_matcher.TestPatternMatcher)' --run-disabled --stress-runs 10
buck2 test -j 18 'fbcode//mode/opt' fbcode//caffe2/test/inductor:pattern_matcher -- --exact 'caffe2/test/inductor:pattern_matcher - test_match_equivalent_function_invocations3 (caffe2.test.inductor.test_pattern_matcher.TestPatternMatcher)' --run-disabled --stress-runs 10
```

Differential Revision: D60491503

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132206
Approved by: https://github.com/oulgen
2024-07-31 16:41:04 +00:00
bdd7a0322d [Dynamo] Fix - str handler for UserDefinedObjectVariable (#130506)
Fixes #130301

Adjusted the call_str method to handle str conversion for UserDefinedObjectVariable.
Attempt in a clean branch for unrelated test errors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130506
Approved by: https://github.com/oulgen, https://github.com/anijain2305
2024-07-31 16:39:59 +00:00
fe4f8e97cd [Intel GPU] xpu-ops codegen via backend whitelist (#130082)
# Motivation

This PR intends to enhance the codegen to allow generate codes for XPU backend.

XPU operators need be registered in an hand-written way currently. Developers have no chance to take the advantage of shared code to handle tensor meta setting (like strides, proxy output, structured kernels).  Manually porting code is erro-prone and may lead to high maintaining efforts.

We utilize the backend_whitelist argument in `gen.py` to generate XPU needed headers and source codes.

# Usage
XPU ops lie in `third_pary/torch-xpu-ops`, the codegen process is triggered before the complation of `torch-xpu-ops`

We use the following commands to generate XPU operators

` python -m torchgen.gen --source-path path/to/yaml/of/xpu   --install-dir  build/xpu    --per-operator-headers    --static-dispatch-backend     --backend-whitelist=XPU`

The diff lies at `backend-whitelist=XPU`.  The backend-whitelist key is an existent argument in torchgen.

The input of `gen.py` are code templates and operators yaml. We share the same templates in `aten`. A simplified yaml lies in `third_party/torch-xpu-ops`, which only includes the supported xpu operators. This yaml is a copy-and-modify of `native_functions.yaml`. No extra entry is added, the format is same as the one in `aten`

# Result

All operators headers are generated in `build/xpu/ATen/ops` independently, which would not affect operators declared/defined by CPU/CUDA or any other backend.  XPU operators only include headers in this folder.

# Verification

* In `third-party/torch-xpu-ops`, we migrate all supported kernels to structured kernels style, where they are registered through `REGISTER_XPU_DISPATCH` or `TORCH_IMPL_FUNC`, and we have UT verification based on `test_ops.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130082
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/atalman
ghstack dependencies: #130019
2024-07-31 16:31:38 +00:00
aec8bc5e4c [easy] fix type annotation on constraint_violations variable (#127064)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127064
Approved by: https://github.com/jananisriram
2024-07-31 16:27:10 +00:00
c85088b1f9 [ROCm] performance optimization for index select (#131713)
As observed during working on this fix (https://github.com/pytorch/pytorch/pull/130994), 128 threads per block seems quite low. This PR is to increase the default to improve the performance, and also slightly refactoring the code to replace the hard-coded 128 for better maintenance.

By increasing the default max threads per block from 128 to 256, I saw for `aten::index_select`,  its "CUDA total" time drop from 44.820ms to 33.608ms by profiling below embedding script:
```
input = torch.randint(low=0, high=16032, size=[131072], device="cuda")
w = torch.randn([16032, 16384], device="cuda")

with profiler.profile(record_shapes=True) as prof:
    x = torch.nn.functional.embedding(input, w)

```
I tested with the default from 128 to 256, 512, 1024 on several different types of devices, and observed "CUDA total" time dropping even more and more latency improvement as the number increases. Below is one example of latency improvement ratio:
128 | 1x
256 | 1.33x
512 | 1.44x
1024 | 1.49x

Using 512 as the new default max for non-mi300x to be conservative, which is 1.44x faster than using 128 with the above profiling script.

Using 1024 for mi300x is 1.61x faster than using 128 with the same profiling script, and using 512 is 1.57x faster.

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131713
Approved by: https://github.com/jeffdaily, https://github.com/syed-ahmed, https://github.com/malfet
2024-07-31 16:24:01 +00:00
13d744464f Migrate Inductor scheduler, dependencies, ir, and codegen/common to use OrderedSet (#130004)
Python's set is non deterministic. There is an internal failure which we recently ran into which did not consistently fail.

See, repro here: P1453035092.

Now, with these changes, it does consistently fail. In follow ups we could also consider adding a lintrule for uses of either set() or set literals.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130004
Approved by: https://github.com/oulgen
2024-07-31 16:22:11 +00:00
2c7bd61afa [pytorch][counters] Pybind for WaitCounter (#132167)
Summary:
Basic pybind integration for WaitCounter providing a guard API.
Also fixes broken copy/move constructor in WaitGuard (it wasn't really used with the macro-based C++ API).

Test Plan: unit test

Reviewed By: asiab4

Differential Revision: D60463979

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132167
Approved by: https://github.com/asiab4
2024-07-31 16:04:40 +00:00
39a3c98aa6 [inductor] fix scalar miss constuctor for long type. (#132117)
Fix `long` to `c10::scalar` convert issue.

![image](https://github.com/user-attachments/assets/fc44a170-e293-4688-a185-d189484f6638)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132117
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-07-31 15:40:48 +00:00
b2118573d6 [BE] Unify PG assignments (#132230)
python's `or` operator returns `bar` in cases of
`foo = None or bar`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132230
Approved by: https://github.com/Skylion007, https://github.com/wconstab
2024-07-31 15:28:25 +00:00
9c52013559 [subclasses] Fix nested subclasses flattened tensors ordering (#132096)
get_plain_tensors() should result in DFS of leaves.
The error was that plain tensors (leaves) on the same level were returned before subclasses plained tensors even if subclasses are before in "flatten" list.

Original issue from AO: https://github.com/pytorch/ao/issues/515

Test:TBD, need to make asymetric subclass with dense tensors and subclasses
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132096
Approved by: https://github.com/bdhirsh
2024-07-31 14:12:51 +00:00
5406e46b00 Revert "Add fx graph runnable to tl parse (#130976)"
This reverts commit 52c3af62d6fa4a0a4e22764a89f1877f3b1b28f9.

Reverted https://github.com/pytorch/pytorch/pull/130976 on behalf of https://github.com/albanD due to Broke trunk ([comment](https://github.com/pytorch/pytorch/pull/130976#issuecomment-2260579485))
2024-07-31 13:53:57 +00:00
3d7f541597 [BE][TP] Check module has bias before access (#132137)
Some linear modules, such as the ones reconstructed by `torch.export.unflatten()`, may not have the `bias` attribute, if the original linear module has `bias=None`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132137
Approved by: https://github.com/wanchaol
2024-07-31 13:45:28 +00:00
dad125a64b Address clang-tidy nits in BFloat16 (#132203)
Summary: In https://github.com/pytorch/pytorch/pull/131359 I forgot to amend with clang-tidy fixes before merging. This addresses that.

Test Plan: CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132203
Approved by: https://github.com/houseroad
2024-07-31 13:41:56 +00:00
45e6a364ee Avoid autocast deprecation warning (#132207)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132207
Approved by: https://github.com/awgu
2024-07-31 13:13:39 +00:00
f4f7aba75d Expose function to probe whether PyTorch was built with FlashAttention (#131894)
This is needed by downstream projects (e.g., xFormers) to determine whether they can count on FlashAttention in PyTorch or whether they need to build it themselves.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131894
Approved by: https://github.com/drisspg, https://github.com/eqy
2024-07-31 11:33:09 +00:00
548c460bf1 [BE][Easy][7/19] enforce style for empty lines in import segments in test/[a-c]*/ and test/[q-z]*/ (#129758)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129758
Approved by: https://github.com/ezyang
2024-07-31 10:54:03 +00:00
46994e753b [NestedTensor] Integrate the layer normalization operator along the jagged dimension into NestedTensor (#132172)
Modify the existing `layer normalization` operator in PyTorch, invoked by `torch.layer_norm`, to allow for reductions along the jagged dimension of a nested tensor. The function originally had a basic implementation for reducing along 1 non-ragged dimension. This diff, which uses the `aten` padding operator, enables PyTorch users to invoke `torch.nn.functional.layer_norm` on a nested tensor when reducing along the ragged dimension, e.g. `*` in a `(B, *, M)` or `(B, *, M, N)` nested tensor.

Write unit tests based on the `softmax` jagged operator to verify the accuracy of the ragged reduction implementation for `torch.nn.functional.layer_norm`. Add unit tests to verify error handling for unsupported features.

Note that this implementation is limited to nested tensors with `ragged_idx == 1`, i.e. the ragged dimension is not transposed. The layer normalization operator also requires an operation on a 2-dimensional layer; for nested tensors with 4 or more dimensions, I flatten the extra dimensions, then unflatten them after performing layer normalization.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132172
Approved by: https://github.com/davidberard98
ghstack dependencies: #132170
2024-07-31 10:51:46 +00:00
89053e382a [NestedTensor] Integrate the softmax operator along the jagged dimension into NestedTensor (#132170)
Modify the existing `softmax` operator in PyTorch, invoked by `torch.softmax`, to allow for reductions along the jagged dimension of a nested tensor. The function originally had a basic implementation for reducing along 1 non-ragged dimension. This diff, which uses the aten padding operator, enables PyTorch users to invoke `torch.softmax` on a nested tensor when reducing along the ragged dimension, e.g. `*` in a `(B, *, M)` nested tensor.

Write unit tests based on the `sum` and `mean` jagged operators to verify the accuracy of the ragged reduction implementation for `torch.softmax`. Add unit tests to verify error handling for unsupported features in `NestedTensor` `torch.softmax`.

Note that this implementation is limited to nested tensors with `ragged_idx == 1`, i.e. the ragged dimension is not transposed. In addition, the `softmax` operator is required to take in as input an integer for the reduction dimension `dim`, requiring new unit tests heavily inspired by the `sum` and `mean` jagged operator unit tests. `Softmax` also allows for reducing along the batch dimension.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132170
Approved by: https://github.com/davidberard98
2024-07-31 10:51:46 +00:00
e7eeee473c [BE][Easy][14/19] enforce style for empty lines in import segments in torch/_[a-c]*/ and torch/_[e-h]*/ and torch/_[j-z]*/ (#129765)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129765
Approved by: https://github.com/ezyang
2024-07-31 10:42:50 +00:00
9e473fd868 Make adding Buffers more like adding Parameters (#125971)
Add similar semantics for creating a buffer object similar to creating a parameter. This is done by introducing a new Buffer class that can be used for type disambiguation. The underlying functionality of registering a buffer remains the same as the register_buffer method has not been changed. The persistent parameter in the Buffer type is to indicate whether a buffer object should be persistent or not. Other non-test changes have to do with getting the new Buffer type recognized by inductor and dynamo. Remaining changes are test changes to make sure that the Buffer type can be used as a drop in replacement for register_buffer as it just leads to register_buffer being called. The addition of this new functionality still allows for normal tensors to be used as buffers so these changes are intended to be backwards compatible.

Fixes #35735

Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125971
Approved by: https://github.com/albanD, https://github.com/anijain2305, https://github.com/mlazos
2024-07-31 10:32:40 +00:00
a94e507c39 [aota] Needs autograd if an input requires_grad, agnostic to enable_grad (#128890)
Original issue: https://github.com/pytorch/pytorch/issues/114338

Reland of:  https://github.com/pytorch/pytorch/pull/128016

Summary from previous PR:
We assume only two possible mutually exclusive scenarios:

Running compiled region for training (Any of inputs has requires_grad)

Produced differentiable outputs should have requires_grad.
Running compiled region for inference (None of inputs has requires_grad)

All outputs do not have requires_grad.
Even if user runs the region under no_grad(), but has an input Tensor with requires_grad - we go Training scenario (1).

With current state that means:
1/ needs_autograd should not check torch.is_grad_enabled(), only that any of inputs requires_grad
2/ if needs_autograd => trace_joint (We are in training scenario 1.) => always run compiled region under with.enable_grad()

Changes in partitioner?

Inference and Training graphs had difference in return container, list/tuple.
The changes in partitioner are done to unify and return always tuple.
As a result - some changes in test_aotdispatch.py for graph contents list -> tuple.

Why was revert?

There was a regression of hf_Reformer model on inference.
```
TORCHINDUCTOR_FX_GRAPH_CACHE=0 python benchmarks/dynamo/torchbench.py --performance --inference --bfloat16 --backend inductor --device cuda --only hf_Reformer --cold-start-latency --use-eval-mode
```

Because one of the compiled graphs contained outputs, which are aliases to the inputs that are nn.Parameter(requires_grad=True).

Even if inference bencharmsk torchbench runs inside with` torch.no_grad()` - alias (specifically for hf_Reformer - expand) ops preserve requires_grad.

As a result we started compiling training graph instead of inference.

Fix for view ops:

If we have outputs, that are aliases to inputs that requires_grad, those outputs requires grad is not a reason to generate training graph.

This is handled in aot_autograd.py, where output_and_mutation_safe are calculated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128890
Approved by: https://github.com/bdhirsh
2024-07-31 07:25:19 +00:00
e9d1c26275 fix uniform op in dynamo (#132160)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132160
Approved by: https://github.com/anijain2305
2024-07-31 06:48:43 +00:00
ae708e9791 [ONNX] Remove the deprecated SymbolicContext (#132184)
Remove the deprecated SymbolicContext class from torch.onnx
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132184
Approved by: https://github.com/titaiwangms
2024-07-31 04:24:32 +00:00
cyy
89da94594e [11/N] Fix clang-tidy warnings in jit (#132131)
Follows #132122

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132131
Approved by: https://github.com/Skylion007
2024-07-31 03:45:52 +00:00
91299c95ec Revert "Add functions from torch.masked._ops to __all__ for torch.masked (#131288)"
This reverts commit 78020ea55d1bc06898577887b80c15d6d2b967dc.

Reverted https://github.com/pytorch/pytorch/pull/131288 on behalf of https://github.com/kit1980 due to Broke test_public_bindings.py::TestPublicBindings::test_correct_module_names [GH job link](https://github.com/pytorch/pytorch/actions/runs/10172945925/job/28136657243) [HUD commit link](78020ea55d) ([comment](https://github.com/pytorch/pytorch/pull/131288#issuecomment-2259581854))
2024-07-31 03:45:09 +00:00
27c9262d29 Fix stdout / stderr typing in SubprocessHandler (#132071)
Summary: Fix stdout / stderr typing in SubprocessHandler. Stdout and Stderr should be `Optional[str]` instead of `str`.

Test Plan: CI

Differential Revision: D60319648

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132071
Approved by: https://github.com/Skylion007
2024-07-31 02:51:11 +00:00
52c3af62d6 Add fx graph runnable to tl parse (#130976)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130976
Approved by: https://github.com/ezyang
2024-07-31 02:27:22 +00:00
deb788f6cc Merge torch.nn.utils.rnn type stubs (#131872)
I want to re-attempt:

* #61467

See:

* https://github.com/pytorch/pytorch/issues/10536#issuecomment-2251948730

and this is one of the files I would touch.

quoting @ezyang:

* https://github.com/pytorch/pytorch/issues/91648#issuecomment-1372010129

> The back story here is that in https://github.com/pytorch/pytorch/pull/19089 we added pyi stubs for nn modules, but when we got off Python 2 we started merging the pyi stubs directly into the py files, e.g., as in https://github.com/pytorch/pytorch/pull/43044. But not all the modules got the treatment.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131872
Approved by: https://github.com/Skylion007, https://github.com/ezyang
2024-07-31 02:24:59 +00:00
78020ea55d Add functions from torch.masked._ops to __all__ for torch.masked (#131288)
Add the non-private operations imported in this file to `__all__` so that pyright considers them to be publicly exported. Solves this error:

```
"mean" is not exported from module "torch.masked" Pylance[reportPrivateImportUsage]
```

Related: https://github.com/pytorch/pytorch/pulls?q=pyright+export

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131288
Approved by: https://github.com/ezyang
2024-07-31 02:16:38 +00:00
df0494bbba Clean redundant link libraries for XPU (#131322)
`torch_xpu` should link to `libtorch_cpu.so` instead of `torch_cpu_library`, otherwise redundant link libraries will contaminate `torch_xpu`, especially when there are MKL in both CPU and XPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131322
Approved by: https://github.com/cyyever, https://github.com/ezyang
2024-07-31 02:15:15 +00:00
c07aa1c9c9 [Easy] reorder functions in torch._jit_internal (#130531)
Split from #128633.

- #128633

Move commonly used functions (e.g. `is_scripting`) to the top of the module to avoid circular dependency.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130531
Approved by: https://github.com/EikanWang, https://github.com/ezyang
2024-07-31 02:12:29 +00:00
fbe6f42dcf [BE][Easy][8/19] enforce style for empty lines in import segments in test/[k-p]*/ (#129759)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129759
Approved by: https://github.com/justinchuby, https://github.com/ezyang
2024-07-31 02:09:20 +00:00
914577569d Remove python 3.8 nightly builds (#132138)
Removing python 3.8 support in nightly builds. As per PR: https://github.com/pytorch/pytorch/issues/120718
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132138
Approved by: https://github.com/albanD, https://github.com/malfet, https://github.com/huydhn
2024-07-31 01:50:03 +00:00
05317cd8f7 [dtensor][be] improving readability and reducing repeating code (#132070)
**Summary**
I created functions that reduced repeating code in the console and json APIs which also improved their readability for future developers.

**Test Plan**
1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_json_dump

2. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_operation_tracing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132070
Approved by: https://github.com/XilunWu
2024-07-31 00:53:36 +00:00
f85feef127 [DTensor] add support for custom op registration (#131108)
`register_sharding` is an experimental API that allows users to register sharding strategies for an operator when the tensor inputs and outputs are :class:`DTensor`s. It can be useful when: (1) there doesn't exist a default sharding strategy for ``op``, e.g. when `op` is a custom operator that is not supported by `DTensor`; (2) when users would like to overwrite default sharding strategies of existing operators.

Here's an example:

        @register_sharding(aten._softmax.default)
        def custom_softmax_sharding(x, dim, half_to_float):
            softmax_dim = dim if dim >= 0 else dim + x.ndim
            acceptable_shardings = []

            all_replicate = ([Replicate()], [Replicate(), None, None])
            acceptable_shardings.append(all_replicate)

            for sharding_dim in range(x.ndim):
                if sharding_dim != softmax_dim:
                    all_sharded = (
                        [Shard(sharding_dim)],
                        [Shard(sharding_dim), None, None],
                    )
                    acceptable_shardings.append(all_sharded)

            return acceptable_shardings

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131108
Approved by: https://github.com/wanchaol
2024-07-31 00:51:16 +00:00
31205d5198 [Inductor][CPP] Fix Local Buffer issue with inplace result line (#132018)
**Summary**
If a `global buffer` has been replaced by `local buffer`, we will add this `global buffer` into `removed_buffers` to avoid unnecessary allocation. However, a special case is when this `global buffer` can reuse previous buffer. We didn't handle this case previously which cause functional failure in f151f25c0b/torch/_inductor/codegen/wrapper.py (L440)

In this PR, we resolve this issue by avoid adding this global buffer into `V.kernel.inplace_update_buffers` when this buffer has been marked as `removed`.

**Test Plan**
```
python test/inductor/test_cpu_repro.py -k test_local_buffer_with_line_reuse
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132018
Approved by: https://github.com/jgong5, https://github.com/peterbell10
2024-07-31 00:38:17 +00:00
882d80fd92 Add lowering for updated _scaled_mm (fixing submodules) (#130422)
Add the Inductor lowering for `torch._scaled_mm`, whose API was last updated in https://github.com/pytorch/pytorch/pull/128683.

The lowering does:
- for tensor-wise scaling, auto-tune between the default ATen kernel (cuBLAS) and Triton kernel configurations.
- for row-wise scaling, auto-tune between the default ATen kernel (CUTLASS kernel added in https://github.com/pytorch/pytorch/pull/125204) and Triton kernel configurations.

The Triton kernel template is based on 3ad9031d02 (D56337896) by @choutim, without using SPLIT_K, and that of mm `torch/_inductor/kernel/mm.py`

## Testing:
- Logging shows max-autotune tuning (`AUTOTUNE scaled_mm`) for both tensor-wise and row-wise scaling when called with the two scaling types.
- Row-wise scaling allows operator fusion between preceding pointwise/reduction op and amax/cast:
    - output code Evaluating m=256, n=256, k=256, fusion_case='pointwise', scaling_mode='row'
        - P1477224245 - 2 kernels
    - output code Evaluating m=2048, n=256, k=2048, fusion_case='reduction', scaling_mode='row'
        - P1477227340 - 2 kernels

- UT `python test/inductor/test_fp8.py -- TestFP8Lowering`

## Benchmarking

Eager/compiled tensor-wise/row-wise scaling for various shapes:
https://docs.google.com/spreadsheets/d/1VfWEVuyrwoWysfbS0_u2VHJ-PsdWkF1qIsiD60AzTes/edit?gid=2113587669#gid=2113587669
- Some of the “compiled” cases are slightly slower than “eager”. It’s because max-autotune selected the ATen kernel in the compiled case, and I think the discrepancy is variance.

Eager/compiled tensor-wise/row-wise scaling with pointwise/reduction preceding op for various shapes:
https://docs.google.com/spreadsheets/d/1Nv07NrdffQIoDeMjo9E0V-E-EYrEN0WysO_bn1bc6ns/edit?gid=1715488446#gid=1715488446

## Questions for reviewers:
- Should the type of the accumulator `ACC_TYPE` always be in float32? If not, where is this type set (output layout?)?

## Todo:
- Make the Triton template use the improved persistent kernel version (https://github.com/pytorch/FBGEMM/pull/2735 by @htyu)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130422
Approved by: https://github.com/ipiszy
2024-07-30 23:48:48 +00:00
fdcd2f0dd1 [PT2][Optimus] Add unbind cat to view pass (#132152)
Summary: We observed new graph transformation opportunity in IG_CTR, which can further remove the cat node.

Test Plan:
# unit test

```
CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 test //caffe2/test/inductor:split_cat_fx_passes
```

Buck UI: https://www.internalfb.com/buck2/5061a3fe-b788-4031-b3af-66d48564a2df
Test UI: https://www.internalfb.com/intern/testinfra/testrun/9007199298289131
Network: Up: 2.5GiB  Down: 5.7GiB  (reSessionID-a49b1234-c02c-4a2d-a9ad-9f5b23557522)
Jobs completed: 294061. Time elapsed: 13:47.8s.
Cache hits: 68%. Commands: 106996 (cached: 72904, remote: 33875, local: 217)
Tests finished: Pass 10. Fail 0. Fatal 0. Skip 1. Build failure 0

# benchmark

```
CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "ig_ctr" --flow_id 584880697
```

Counter({'pattern_matcher_nodes': 1649, 'pattern_matcher_count': 1538, 'normalization_pass': 343, 'extern_calls': 160, 'normalization_aten_pass': 39, 'merge_splits_pass': 19, 'fxgraph_cache_miss': 9, 'scmerge_cat_added': 4, 'scmerge_cat_removed': 4, 'scmerge_split_removed': 3, 'unbind_stack_pass': 3, 'batch_tanh': 2, 'scmerge_split_sections_removed': 2, 'scmerge_split_added': 2, 'merge_stack_tahn_unbind_pass': 1, 'optimize_cat_inputs_pass': 1, 'unbind_cat_to_view_pass': 1})

before vs after graph diffing: https://www.internalfb.com/intern/diffing/?paste_number=1497865201

Differential Revision: D60325668

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132152
Approved by: https://github.com/jackiexu1992
2024-07-30 23:27:18 +00:00
afb04d78c8 Don't try hard to compute alignment of unbacked expressions (#131649)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131649
Approved by: https://github.com/bdhirsh
2024-07-30 23:19:42 +00:00
5a33657b31 [micro_pipeline_tp] implement the pass for fused_scaled_matmul_reduce_scatter (#131951)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131951
Approved by: https://github.com/weifengpy
2024-07-30 23:02:49 +00:00
524aac413c Initial OpInfo-based testing for NJTs (#131704)
This PR utilizes the info from the existing OpInfo database `op_db` to contribute to general NJT testing.
* New tests in `TestNestedTensorOpInfo`
    * `test_forward()` - compares forward output to an unbind-based reference
    * `test_backward()` - compares forward output and grads to an unbind-based reference
    * `test_forward_compile()` - compares forward compile output (`backend="aot_eager_decomp_partition"`) to eager
    * `test_backward_compile()` - compares forward compile output (`backend="aot_eager_decomp_partition"`) and grads to eager
* To avoid adding a bunch of NJT-specific stuff to the `OpInfo` structure, this PR translates `op_db` -> a NJT-specific `njt_op_db`.
    * `UnaryUfuncInfo`s utilize a new `sample_inputs_unary_njt_pointwise()` which iterates through a comprehensive list of NJTs: contiguous / non-contiguous, dims 2, 3, and 4, transposed / not, etc.
    * `BinaryUfuncInfo`s utilize a new `sample_inputs_binary_njt_pointwise()` which iterates through a comprehensive list of NJTs: contiguous / non-contiguous, dims 2, 3, and 4, transposed / not, etc.
    * `ReductionOpInfo`s utilize a new `sample_inputs_njt_reduction()` which covers full reductions, reductions over the jagged dim, and reductions over the non-jagged dim
* Several xfails were added to get things passing

TODO (future PRs):
* Pass non-contiguous / non-contiguous with holes NJTs (maybe we should have separate tests for these? most ops don't support NJTs with holes today)
* Mixed (NT, T), (T, NT) inputs for binary ops
* Handle other types of OpInfos (beyond unary pointwise, binary pointwise, and reduction) by manually by writing sample_inputs_funcs
* Address all xfails via fixes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131704
Approved by: https://github.com/soulitzer
ghstack dependencies: #131898
2024-07-30 23:02:24 +00:00
93facac02c [NeuralNetInference] Bring up iOS builds (#131917)
Summary: Mirror Android setup to static link & use lite interpreter on iOS

Test Plan: CI

Reviewed By: EscapeZero

Differential Revision: D60156611

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131917
Approved by: https://github.com/cccclai
2024-07-30 23:01:09 +00:00
53a5e0f1a8 [BE] delete spmd module (#132072)
Summary:
as titled, fully delete spmd module as we stopped working on this and the code is already broken with no unit tests enabled.

We should not keep it in the codebase as it provide no value anymore, and it burdens DTensor to maintain the compatiblity with it (i.e. code paths/imports) constantly.

Test Plan: sandcastle

Differential Revision: D60402105

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132072
Approved by: https://github.com/awgu, https://github.com/XilunWu, https://github.com/fegin, https://github.com/seemethere, https://github.com/albanD, https://github.com/yifuwang
2024-07-30 22:20:21 +00:00
a141334c88 migitate wrong tensor.dim_order() (#131366)
Summary:
there're some issues for dim order creation. T194410923 has detail illustration.

One of the reason is sometimes `is_contiguous` function may generate ambiguous memory format result (some tensors might be both channels_last and contiguous at the same time), and dim order generation rely on memory format result underneath for shortcut.

To mitigate the issue, we make dim order utilizing the short cut if and only if the tensor is only belongs to single memory format. Otherwise, we will still recalculate it.

Test Plan: CI

Differential Revision: D60056793

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131366
Approved by: https://github.com/ezyang
2024-07-30 21:58:15 +00:00
2b43fab555 [DTensor] Added naive support for nn.init.orthogonal_ (#132104)
Try to unblock https://github.com/pytorch/pytorch/issues/131991

- `nn.init.orthogonal_` uses `tensor.new`, which is the legacy factory function. We change this to `tensor.new_empty` (empty is okay since it will be immediately followed by `.normal_()` to fill the tensor) so that it preserves `DTensor`-ness.
- `nn.init.orthogonal_` uses QR decomposition (`aten.linalg_qr.default`) and `torch.diag` (calling into `aten.diagonal_copy.default`). For simplicity, we use naive replicate strategies for now. `aten.diagonal_copy.default` could do something more sophisticated for sharded inputs, but I would rather defer that to later due to the complexity. For `orthogonal_` support specifically, since the result of the QR decomp will be replicated, the input to `aten.diagonal_copy.default` will be replicated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132104
Approved by: https://github.com/albanD, https://github.com/wanchaol
2024-07-30 21:55:09 +00:00
3e142d766a [EZ] Make consistent with scale-config.yml (#132164)
Fix inconsistencies from test-infra's scale-config.yml file

To be followed up by https://github.com/pytorch/test-infra/pull/5513 which will catch such inconsistencies going forward
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132164
Approved by: https://github.com/clee2000, https://github.com/malfet, https://github.com/zxiiro
2024-07-30 21:42:23 +00:00
69c34f6e4c Corrects Error Codes from cudaHostRegister (#132089)
Causing some terrible error messages e.g. :

```
# printing directly: cudaError.???
# casting to int first: 712

Traceback (most recent call last):
  File "/data/users/lpasqualin/fbsource/fbcode/scripts/lpasqualin/playground.py", line 15, in <module>
    main()
  File "/data/users/lpasqualin/fbsource/fbcode/scripts/lpasqualin/playground.py", line 11, in main
    _create_cpu_state_dict(sd, share_memory=True, pin_memory=True)
  File "/home/lpasqualin/pytorch/torch/distributed/_state_dict_utils.py", line 436, in _create_cpu_state_dict
    ret = _iterate_state_dict(
          ^^^^^^^^^^^^^^^^^^^^
  File "/home/lpasqualin/pytorch/torch/distributed/_state_dict_utils.py", line 143, in _iterate_state_dict
    ret = {
          ^
  File "/home/lpasqualin/pytorch/torch/distributed/_state_dict_utils.py", line 144, in <dictcomp>
    key: _iterate_state_dict(
         ^^^^^^^^^^^^^^^^^^^^
  File "/home/lpasqualin/pytorch/torch/distributed/_state_dict_utils.py", line 125, in _iterate_state_dict
    ret = tensor_func(iter_object, pg, device, companion_obj)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lpasqualin/pytorch/torch/distributed/_state_dict_utils.py", line 428, in tensor_func
    succ == 0
AssertionError: Pinning shared memory failed with error-code: cudaError.???
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132089
Approved by: https://github.com/Skylion007
2024-07-30 21:42:00 +00:00
ff377e16ab Improve logging in the TSConverter (#132082)
Summary: Currently, running explain with TORCH_LOGS enabled will cause duplicate loggings because explain uses the exact same code path for covnersion. This PR just disables logging when it is running explain. And move all logging to convert() to prevent from logging from __init__ when we are just using explain.

Test Plan: Manual testing with attached outputs.

Reviewed By: SherlockNoMad, angelayi

Differential Revision: D60199007

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132082
Approved by: https://github.com/ydwu4
2024-07-30 21:37:44 +00:00
495d413519 Include code object of frame being compiled in stack (#132161)
This is pretty useful to have!

Test plan: https://internalfb.com/intern/fblearner/details/586653862/

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132161
Approved by: https://github.com/oulgen
2024-07-30 21:33:27 +00:00
19db4f6014 [capture_triton] fix special kwargs path (#132143)
I didn't test this path when creating the orchestrator. This PR fixes
that path to work in the capture_triton path. The problem is that we are
handling a value that is an int (in the capture_triton path) and a
ConstantVariable (in the Dynamo triton path) so we abstract that out in
the orchestrator.

Test Plan:
- new tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132143
Approved by: https://github.com/oulgen
2024-07-30 20:30:40 +00:00
1118c74b5f [PT2] Port fuse_chunk_reshape_unsqueeze_concat_pass to PT2 pre_grad passes (#131902) (#132078)
Summary:

Port fuse_chunk_reshape_unsqueeze_concat_pass to PT2 pre_grad passes

Test Plan: run new UTs

Reviewed By: frank-wei

Differential Revision: D60258724

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132078
Approved by: https://github.com/frank-wei
2024-07-30 20:17:06 +00:00
d53b11bb6e Strict shape checking for NJTs with TestCase.assertEqual() (#131898)
**Background**: `TestCase.assertEqual()` is commonly used during test case validation. Historically, to support NSTs, the logic was written to compare two nested tensors by unbinding them and comparing their components. This logic applied to NJTs as well, which in practice meant that two NJTs with different nested ints in their shapes could compare equal if their components were equal.

This PR changes the above logic so that NJTs are no longer unbound during comparison, allowing them to receive full shape validation. This makes `TestCase.assertEqual()` stricter for NJTs, requiring them to have the same nested ints in their shapes to compare equal.

Note that some tests rely on the old, looser behavior. To address this, the PR introduces a base `NestedTensorTestCase` that defines a helper function `assertEqualIgnoringNestedInts()` so that these tests can explicitly opt in to the looser comparison behavior.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131898
Approved by: https://github.com/soulitzer
2024-07-30 20:05:48 +00:00
58f76bc301 Revise skip torchrec logic (#130783)
Summary:
The previous logic adds skipped files when the file was imported which happens at very early stage. However, we could set skip_torchrec at later stage (e.g, in APS, we set it during the trainer execution). In that case, the skip logic will still take effect since skipped files have been added.

So in this diff, we revise the logic so that it can adapt to changes of skip_torchrec at later stages.

Test Plan:
Tested on APS models:

  buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher_live -- mode=local_ig_fm_uhm_mini model_name=ig_fm_one_sparse_benchmark features=ig_fm_one_sparse_benchmark model=ig_fm_one_sparse_benchmark training.pipeline_type=pt2

commit: 2fb485d9e

torchrec related paths were not skipped.

Differential Revision: D59779153

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130783
Approved by: https://github.com/yanboliang
2024-07-30 19:55:20 +00:00
964f97539f [MPS] Correct nonzero warning and fix the test (#132127)
#125355 lifted the natively supported macOS version to 14.

Fixes #132110
Probably fixes this flaky test disabling issue: #126492

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132127
Approved by: https://github.com/malfet
2024-07-30 19:46:25 +00:00
f2dedc910e Improve SpeculationLog error message (#131982)
There are some substantive changes. Instead of recording the *next* instruction in the speculation log, I record the *current* instruction. I think this is more intuitive, we always call speculation at the beginning of executing an instruction, so logically, the entry is associated with the current instruction. (Note that self.instruction_pointer is next instruction, as conventionally we increment IP before calling speculate).

The cosmetic change is to also pass in the Instruction corresponding to the IP and print it, and beef up the error message, including notes about the previous instruction that was run before it failed (this is typically the critical instruction).

At time of submission, this test case triggered the error:

```
diff --git a/test/distributed/test_dynamo_distributed.py b/test/distributed/test_dynamo_distributed.py
index 5ade17856e1..60ef89be346 100644
--- a/test/distributed/test_dynamo_distributed.py
+++ b/test/distributed/test_dynamo_distributed.py
@@ -844,6 +844,39 @@ class TestMultiProc(DynamoDistributedMultiProcTestCase):
             for r in res[1:]:
                 self.assertEqual(res[0], r)

+    @unittest.skipIf(not has_triton(), "Inductor+gpu needs triton and recent GPU arch")
+    @config.patch(enable_compiler_collectives=True)
+    def test_compiler_collectives_automatic_dynamic_speculation_divergence(self):
+        with _dynamo_dist_per_rank_init(self.rank, self.world_size):
+            torch._dynamo.utils.clear_compilation_metrics()
+
+            # TODO: This should be possible to do inside the function, but
+            device = f"cuda:{self.rank}"
+
+            @torch.compile()
+            def f(x, y):
+                zx = x.shape
+                zy = y.shape
+                return x.sum() + y.sum()
+
+            if self.rank == 0:
+                dataloader = [4, 4]
+            else:
+                dataloader = [3, 4]
+
+            for data in dataloader:
+                f(
+                    torch.randn(data, device=self.rank),
+                    torch.randn(data, device=self.rank),
+                )
+
+            metrics = torch._dynamo.utils.get_compilation_metrics()
+            # Number of compiles same on all nodes
+            res = [None] * self.world_size
+            torch.distributed.all_gather_object(res, len(metrics))
+            for r in res[1:]:
+                self.assertEqual(res[0], r)
+

 @requires_nccl()
```

although I plan to fix this soon.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131982
Approved by: https://github.com/anijain2305, https://github.com/mlazos, https://github.com/jansel
2024-07-30 19:21:31 +00:00
e6cddc9271 Fix public API tests (#131386)
This PR fixes a bug in `test_correct_module_names` introduced in #130497. It also addresses post-fix test failures in:
* `torch/ao/quantization/__init__.py` - set the correct `__module__` for several public API helpers
* `torch/library.py` - add `register_vmap` to `__all__`
* `torch/nn/attention/flex_attention.py` - make `round_up_to_multiple` private by prepending an underscore
* `torch/storage.py` - introduce `__all__` to avoid `Self` being re-exported as a public API
* `torch/distributed/pipelining/schedules.py` - add `ZeroBubbleAlgorithm` to `__all__`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131386
Approved by: https://github.com/albanD
2024-07-30 18:42:54 +00:00
f217b470cc [CMAKE] Avoid double setting of LDFLAGS (#130370)
It was observed that in some environments `LDFLAGS` gets directly appended to `CMAKE_SHARED_LINKER_FLAGS`. As the result, the same linker flag can appear twice in `CMAKE_SHARED_LINKER_FLAGS` due to manual set:
1bf4a44b33/CMakeLists.txt (L541-L542)
This flag collision causes the build failures at the `cmake` stage.
This PR adds an instruction to `CMakeLists.txt` to avoid double setting of `LDFLAGS` into `CMAKE_SHARED_LINKER_FLAGS`.

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130370
Approved by: https://github.com/atalman, https://github.com/tinglvv, https://github.com/malfet
2024-07-30 18:16:04 +00:00
3816f6420a [BE] remove unnecessary _dispatch_sqrt by using ** 0.5 (#131358)
Based on the discussion here where ** 0.5 is not slower than math.sqrt. https://github.com/pytorch/pytorch/pull/129905#discussion_r1675605075

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131358
Approved by: https://github.com/albanD
2024-07-30 18:08:17 +00:00
9f6d7df3d9 docs(multinomial): Add reference to Multinomial class (#131904)
This PR just adds the reference to the class
`torch.distributions.multinomial.Multinomial` in `torch.multinomial`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131904
Approved by: https://github.com/jbschlosser
2024-07-30 18:05:07 +00:00
239d4d2489 Revert "[reland][inductor] switch AotCodeCompiler to new cpp_builder (#130127)"
This reverts commit 9606d61e0c921b886d20cb61454043c6c270ae89.

Reverted https://github.com/pytorch/pytorch/pull/130127 on behalf of https://github.com/ZainRizvi due to broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/130127#issuecomment-2258871791))
2024-07-30 17:39:41 +00:00
9027db1ab8 TCPStore: fix remote address (#131773) (#131913)
Summary:
This fixes corrupt remote address logs caused by dangling pointers to addrinfo_storage inside of addrinfo.

This relands it since it got reverted due to a fmt::format issue internally.

Original Pull Request: https://github.com/pytorch/pytorch/pull/131773
Approved by: https://github.com/kurman

Test Plan:
Enable debug logs and verify addresses are correct

```
TORCH_CPP_LOG_LEVEL=INFO TORCH_DISABLE_SHARE_RDZV_TCP_STORE=1 TORCH_DISTRIBUTED_DEBUG=DETAIL LOGLEVEL=INFO python test/distributed/test_store.py -v
buck2 test @//mode/dev-nosan //caffe2/test/distributed:store
```

Differential Revision: D60296583

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131913
Approved by: https://github.com/kurman, https://github.com/rsdcastro, https://github.com/Skylion007
2024-07-30 17:27:33 +00:00
3864a2d834 [profiler ut] Update event name in test_profiler.py (#131757)
Fixes #ISSUE_NUMBER
To support kernel name with some uppercase letters.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131757
Approved by: https://github.com/aaronenyeshi
2024-07-30 17:15:31 +00:00
32c57e78ed Specialize sym node when used as device kwarg (#131811)
Fixes https://github.com/pytorch/pytorch/issues/131189.

We specialize the symint in python_arg_parser when used as kwarg device.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131811
Approved by: https://github.com/yanboliang, https://github.com/jansel, https://github.com/albanD
2024-07-30 17:11:57 +00:00
33ce9cf7f9 [FSDP2] Relaxed overlap timing check to avoid flakiness (#132116)
Trying to fix https://github.com/pytorch/pytorch/issues/131081

See https://github.com/pytorch/pytorch/issues/131081#issuecomment-2239443504 for detailed context. This PR is relaxing one assertion against the _baseline_ to try to fix the flakiness.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132116
Approved by: https://github.com/Skylion007
2024-07-30 14:28:12 +00:00
16e0868a3d [FSDP] Add hpu device to _get_remote_device_str (#132120)
In _creating chunk_sharded_tensor, _get_remote_device_str is used. by default it uses the node cound to determine the device:instance. for hpu, need to use current device to get the deivce_instance.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132120
Approved by: https://github.com/awgu
2024-07-30 14:24:24 +00:00
a843178529 Let dynamo inline functional_call (#128646)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128646
Approved by: https://github.com/zou3519
2024-07-30 14:22:23 +00:00
12b67bd998 Fix pyi annotation for ProcessGroupGloo.Options (#132080)
This PR fixes the pyi annotation for `ProcessGroupGloo.Options` based on the definition in the `torch/csrc/distributed/c10d/init.cpp` file.

Fixes #132054

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132080
Approved by: https://github.com/Skylion007
2024-07-30 13:52:31 +00:00
499ead96ff Revert "Grouped Query Attention (#128898)"
This reverts commit d039b14207fe659d664c590efc06cc0a2abc96c0.

Reverted https://github.com/pytorch/pytorch/pull/128898 on behalf of https://github.com/albanD due to Broken test on main ([comment](https://github.com/pytorch/pytorch/pull/128898#issuecomment-2258314481))
2024-07-30 13:11:24 +00:00
cyy
bdf57da6a6 [3/N] Enable clang-tidy on torch/csrc/inductor (#132101)
Follows #132040
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132101
Approved by: https://github.com/Skylion007
2024-07-30 13:04:57 +00:00
cyy
eccbd408e5 [10/N] Fix clang-tidy warnings in jit (#132122)
Follows #132010

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132122
Approved by: https://github.com/Skylion007
2024-07-30 12:56:31 +00:00
83db609ee5 [inductor] fix the cudagraph tree test (#132043)
Summary:
There are two kinds of exceptions:
Case #1:
```
static input data pointer changed.
input name: primals_2. data pointer changed from 140315748992000 to 140315748993536. input stack trace:   File "/dev/shm/uid-30083/c0899c70-seed-nspid4026535598_cgpid16622182-ns-4026535192/caffe2/test/inductor/test_cudagraph_trees.py", line 1826, in forward
    return self.static_tensor + x + self.goo(x)
  File "/dev/shm/uid-30083/c0899c70-seed-nspid4026535598_cgpid16622182-ns-4026535192/caffe2/test/inductor/test_cudagraph_trees.py", line 1816, in forward
    return self.linear(x)

input name: primals_3. data pointer changed from 140315748990976 to 140315748993024. input stack trace:   File "/dev/shm/uid-30083/c0899c70-seed-nspid4026535598_cgpid16622182-ns-4026535192/caffe2/test/inductor/test_cudagraph_trees.py", line 1825, in forward
    self.static_tensor.add_(torch.ones((2, 2), device="cuda"))

```
Case #2:
```
static input data pointer changed.
input name: primals_2. data pointer changed from 139852509086720 to 139852509088256. input stack trace: None
input name: primals_3. data pointer changed from 139852509085696 to 139852509087744. input stack trace:   File "/dev/shm/uid-30083/f61ee184-seed-nspid4026560782_cgpid769179-ns-4026560865/caffe2/test/inductor/test_cudagraph_trees.py", line 1825, in forward
    self.static_tensor.add_(torch.ones((2, 2), device="cuda"))

```
The current impl only covered the case #2

Test Plan: https://www.internalfb.com/intern/testinfra/testrun/15481123762274476

Differential Revision: D60340212

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132043
Approved by: https://github.com/BoyuanFeng
2024-07-30 08:35:56 +00:00
36e8289129 [PT2][Optimus] Optimize cat node inputs pattern (#131866)
Test Plan:
# unit test
```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:split_cat_fx_passes
```

# benchmark

```
CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "ig_ctr" --flow_id 584880697
```

Counter({'pattern_matcher_nodes': 1589, 'pattern_matcher_count': 1497, 'extern_calls': 393, 'normalization_pass': 342, 'merge_splits_pass': 19, 'fxgraph_cache_miss': 12, 'scmerge_cat_added': 4, 'scmerge_cat_removed': 4, 'scmerge_split_removed': 3, 'unbind_stack_pass': 3, 'batch_tanh': 2, 'scmerge_split_sections_removed': 2, 'scmerge_split_added': 2, 'merge_stack_tahn_unbind_pass': 1, 'optimize_cat_inputs_pass': 1})

P1496150856

Differential Revision: D60274533

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131866
Approved by: https://github.com/jackiexu1992
2024-07-30 07:49:26 +00:00
54d4f6bbca [Inductor][FlexAttention] Correct partial/full blocks naming (#131993)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131993
Approved by: https://github.com/drisspg
2024-07-30 06:40:40 +00:00
03e058189e [dynamo] Support dict unpack of MutableMapping objects (#131961)
Fixes https://github.com/pytorch/pytorch/issues/128067

The basic functionality was alredy introduced earlier. This just ensures
that we support UserDefinedObjectVariable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131961
Approved by: https://github.com/williamwen42, https://github.com/mlazos, https://github.com/yanboliang
ghstack dependencies: #131827, #131956
2024-07-30 05:49:58 +00:00
f806128619 [dynamo] Skip <frozen abc> to skip __isisintance__ check on abc objects (#131956)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131956
Approved by: https://github.com/williamwen42, https://github.com/mlazos
ghstack dependencies: #131827
2024-07-30 05:49:58 +00:00
13457d1da0 [dynamo][log] Suggest to use pytree when graph-break on optree (#131827)
Discovered while working on https://github.com/pytorch/pytorch/issues/121369
On the model above, the log looks like this

~~~
/home/anijain/local/pytorch2/torch/_dynamo/variables/functions.py:698: UserWarning: Graph break for an optree C/C++ function optree._C.PyCapsule.flatten. Consider using torch._utils.pytree - https://github.com/pytorch/pytorch/blob/main/torch/utils/_pytree.py.
  torch._dynamo.utils.warn_once(msg)
/home/anijain/local/pytorch2/torch/_dynamo/variables/functions.py:698: UserWarning: Graph break for an optree C/C++ function optree.PyCapsule.unflatten. Consider using torch._utils.pytree - https://github.com/pytorch/pytorch/blob/main/torch/utils/_pytree.py.
  torch._dynamo.utils.warn_once(msg)
  ~~~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131827
Approved by: https://github.com/zou3519, https://github.com/mlazos
2024-07-30 05:49:58 +00:00
fc6066b80f improve mkldnn_linear_pointwise_binary performance for contiguous tensor with non default contiguous strides (#132019)
Fixes https://github.com/pytorch/pytorch/issues/131734

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132019
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
2024-07-30 05:02:38 +00:00
40f8db5741 [audio hash update] update the pinned audio hash (#132105)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132105
Approved by: https://github.com/pytorchbot
2024-07-30 03:39:27 +00:00
aa1488fe02 [inductor] turn on enable_kernel_profile on Windows. (#132025)
Enable `TORCHINDUCTOR_CPP_ENABLE_KERNEL_PROFILE` on Windows inductor.

Local tested pass:
![image](https://github.com/user-attachments/assets/a82351af-cc56-4ba1-a8f4-08f1c38713d1)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132025
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-07-30 03:02:09 +00:00
475da800c7 [inductor] optimize cflags for Windows. (#131980)
changes:
1. optimize cflags for Windows. Ref: https://github.com/pytorch/pytorch/blob/v2.4.0/torch/utils/cpp_extension.py#L215

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131980
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-07-30 02:59:51 +00:00
bdc42e3fb8 [inductor] validate_can_generate_cpp_wrapper add win32 support. (#131978)
Changes:
1. `validate_can_generate_cpp_wrapper` add win32 support.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131978
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-07-30 02:59:48 +00:00
baa4c9ca46 Optimize aten.cat calls of a repeated element (#132081)
This was a particular problem for a model I saw which would have a large number of repeats, making compilation slow.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132081
Approved by: https://github.com/shunting314
2024-07-30 02:56:00 +00:00
f8e4060484 [Inductor][CPP] Enhance cppcsevar data type deduce (#130827)
**Summary**
Previously, we used `data_type_propagation` at the start of `codegen` to deduce the data type of each node and save this information in `node.meta[OptimizationContext.key]`. Then, we used this node metadata to update the cppcsevar data type in `update_on_args`. However, this method is not always correct. For example, in the codegen of `indirect_indexing` (see [here](096dc444ce/torch/_inductor/codegen/common.py (L1844))), we insert nodes on the fly and reuse the node of `indirect_indexing` to set the `cppcsevar` data type. In this PR, we plan to enhance the `cppcsevar` data type deduction:

- We will deduce the `cppcsevar` data type in `update_on_args` by reusing the code in `data_type_propagation`.

- To align the data type of scalar and vector variables, we previously always cast the scalar to the vector's data type. This caused a data type misalignment between `codegen` and `data_type_propagation`. We should use the same data type promotion logic to align the data types of scalar and vector variables.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130827
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-07-30 02:51:31 +00:00
b6c1490cc0 [dynamo] make more unpack_var_sequence calls forced (#132069)
Fixes [T197204962](https://www.internalfb.com/intern/tasks/?t=197204962) (example failure: https://www.internalfb.com/intern/testinfra/diagnostics/11540474088277914.281475138576374.1722221031/)

Added tests contain a simple repro for the observed failure (`test_map_unpack_vars`).

Also fixes https://github.com/pytorch/pytorch/issues/132044

Differential Revision: [D60420335](https://our.internmc.facebook.com/intern/diff/D60420335)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132069
Approved by: https://github.com/anijain2305
2024-07-30 02:30:08 +00:00
8721b21b38 Fix fake_tensor w/ non-view tensor (#132050)
Summary: This code was overly complex and is confusing some guards - basically if a result cached tensor isn't a view there's no reason to be messing with its storage.

Test Plan: unit tests pass

Differential Revision: D60387821

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132050
Approved by: https://github.com/oulgen
2024-07-30 02:17:18 +00:00
9598c58618 Add config option to skip autotuning conv (#131839)
requested internally bc for some models the conv templates are not very helpful

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131839
Approved by: https://github.com/oulgen
ghstack dependencies: #131400
2024-07-30 01:57:53 +00:00
5a2620302b [inductor] Replace self_cuda_time_total function calls with self_dev… (#131029)
…ice_time_total for wrapper_bench

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131029
Approved by: https://github.com/shunting314
2024-07-30 01:57:39 +00:00
a147fa577b [MPS] Fix masked_fill_ in non_contiguous cases (#131957)
fixes #131285

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131957
Approved by: https://github.com/DenisVieriu97
2024-07-30 01:34:48 +00:00
3716934b1a [Inductor] Refactor autotuning utils to compute max block sizes (#131730)
These OSS changes are part of a larger MTIA diff. The OSS part is a simple refactor that makes it easier to query max block sizes by the prefix of the grid dimension, e.g. `"X"`, as opposed to having to use separate functions for `get_xmax()`, `get_ymax()`, etc.

Differential Revision: D60195669

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131730
Approved by: https://github.com/eellison
2024-07-30 01:04:53 +00:00
7a7dd8c29e Revert "[NestedTensor] Integrate the softmax operator along the jagged dimension into NestedTensor (#131518)"
This reverts commit bcf5c68c18c6a109e1fa00829eea0428d44cfb6b.

Reverted https://github.com/pytorch/pytorch/pull/131518 on behalf of https://github.com/ZainRizvi due to Sorry, reverting this since this is based on an internal diff that has diverged from actual internal commit (the final PR and diff must always be identical). Conflicts arise when that happens which block the diff train. Let's revert both this PR and the internal diff, and then reland them as a proper new codev diff ([comment](https://github.com/pytorch/pytorch/pull/131518#issuecomment-2257259839))
2024-07-30 00:55:10 +00:00
ab9791c0e3 [export] Add print_readable to unflattener (#128617)
Taking inspiration from `GraphModule.print_readable` (aka I copied its [code](17b45e905a/torch/fx/graph_module.py (L824))), I added a `print_readable` to the unflattened module, because it's kind of nontrivial to print the contents of this module.

Example print from `python test/export/test_unflatten.py -k test_unflatten_nested`
```
class UnflattenedModule(torch.nn.Module):
    def forward(self, x: "f32[2, 3]"):
        # No stacktrace found for following nodes
        rootparam: "f32[2, 3]" = self.rootparam

        # File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:99 in forward, code: x = x * self.rootparam
        mul: "f32[2, 3]" = torch.ops.aten.mul.Tensor(x, rootparam);  x = rootparam = None

        # No stacktrace found for following nodes
        foo: "f32[2, 3]" = self.foo(mul);  mul = None
        bar: "f32[2, 3]" = self.bar(foo);  foo = None
        return (bar,)

    class foo(torch.nn.Module):
        def forward(self, mul: "f32[2, 3]"):
            # No stacktrace found for following nodes
            child1param: "f32[2, 3]" = self.child1param
            nested: "f32[2, 3]" = self.nested(mul);  mul = None

            # File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:79 in forward, code: return x + self.child1param
            add: "f32[2, 3]" = torch.ops.aten.add.Tensor(nested, child1param);  nested = child1param = None
            return add

        class nested(torch.nn.Module):
            def forward(self, mul: "f32[2, 3]"):
                # File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:67 in forward, code: return x / x
                div: "f32[2, 3]" = torch.ops.aten.div.Tensor(mul, mul);  mul = None
                return div

    class bar(torch.nn.Module):
        def forward(self, add: "f32[2, 3]"):
            # No stacktrace found for following nodes
            child2buffer: "f32[2, 3]" = self.child2buffer

            # File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:87 in forward, code: return x - self.child2buffer
            sub: "f32[2, 3]" = torch.ops.aten.sub.Tensor(add, child2buffer);  add = child2buffer = None
            return sub
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128617
Approved by: https://github.com/zhxchen17, https://github.com/pianpwk
2024-07-30 00:41:44 +00:00
2a4d9aa548 Disable expandable segments checkpointing internally (#132048)
Differential Revision: [D60388286](https://our.internmc.facebook.com/intern/diff/D60388286)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132048
Approved by: https://github.com/ezyang, https://github.com/eqy
2024-07-30 00:26:39 +00:00
be5e44192d Revert "[NestedTensor] Integrate the layer normalization operator along the jagged dimension into NestedTensor (#131519)"
This reverts commit 8fe2bf212dc5e01b15cbe728958f940873230d64.

Reverted https://github.com/pytorch/pytorch/pull/131519 on behalf of https://github.com/ZainRizvi due to Sorry, reverting this since this is based on an internal diff that has diverged from actual internal commit.  Weird conflicts arise when that happens.  Let's revert both this PR and the internal diff, and then reland them as a proper new codev diff ([comment](https://github.com/pytorch/pytorch/pull/131519#issuecomment-2257230717))
2024-07-30 00:18:22 +00:00
b1ccd0c407 [CI] Update environment varible setting for aarch64 (#132046)
Summary: JEMALLOC_LIB and core_number need to be set differently on aarch64.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132046
Approved by: https://github.com/huydhn
2024-07-30 00:09:59 +00:00
e3dc20c94b [NJT] support cat backward (#132076)
cat_tensors_backward use narrow_symint, so we need to support aten::narrow for NJT.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132076
Approved by: https://github.com/davidberard98
2024-07-29 23:49:26 +00:00
5298acb5c7 Back out "[1/2] PT2 Inductor ComboKernels - Foreach cases (#124969)" (#132065)
Summary:
Original commit changeset: 1d8cfdcef69d

Original Phabricator Diff: D54134695

back out: D54134695

Test Plan: more details see: https://docs.google.com/document/d/1noPTmTdNYHVDFyk7AJSSO7jQoNw6fTo4o6k9eTNeZh8/edit#heading=h.xeo30usu77nc

Reviewed By: zw2326

Differential Revision: D60397377

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132065
Approved by: https://github.com/zw2326, https://github.com/qchip
2024-07-29 22:48:29 +00:00
8b507a922a Mode to emulate amp numerics (#131595)
```
# Mode to emulate pytorch eager numerics for lower precision (fp16, bf16)
# Pytorch eager computes bf16/fp16 by upcasting inputs to fp32 and downcasting after
# For multiple, fused pointwise nodes, inductor will elide the intermediary upcasts and downcasts
# Typically this should be closer to fp64 ref numerics. However, it can be useful for debugging
# to emulate the eager numerics.
```

We add extra upcasts and downcasts for pointwise nodes that correspond to casts that existed in the original user program (excluding pointwise nodes that are emitted during decomposition). Since this is mostly for debugging, I added this information in the `meta` so that this mode does not have unintended side effects like changing pattern matching.

in theory there could also be some other casts with fused reduction -> reduction, although i havent seen this in practice as much. could be done as follow up. note: only works with cuda backend right now.

This mode was sufficient to eliminate compile differences from https://fb.workplace.com/groups/385893200869952/posts/464263173032954/?comment_id=465199259606012&reply_comment_id=465676792891592.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131595
Approved by: https://github.com/shunting314, https://github.com/bdhirsh, https://github.com/jansel
2024-07-29 22:42:23 +00:00
884eadcd19 Fix multi grad hooks thread safety (#132055)
Thanks @awgu  for spotting this

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132055
Approved by: https://github.com/Skylion007, https://github.com/awgu, https://github.com/albanD
2024-07-29 22:32:59 +00:00
e55e9d8126 Clear speculation log when restarting due to compiler collective (#131983)
The compiler collective can trigger an input to become dynamic, which
can trigger operations to be recorded to the graph, which would change
the speculation log entries (since they only start being recorded once
we have a non-empty output graph).  Test case triggers this situation.

Production instance:
https://www.internalfb.com/mlhub/pipelines/runs/mast/f584750649-TrainingApplication?job_attempt=2&version=0&env=PRODUCTION

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131983
Approved by: https://github.com/anijain2305, https://github.com/mlazos
2024-07-29 22:32:10 +00:00
62b2e7a553 Revert "Add config option to skip autotuning conv (#131839)"
This reverts commit 3d4de8e96d0bb1fe19b25734a97a19dd85313692.

Reverted https://github.com/pytorch/pytorch/pull/131839 on behalf of https://github.com/eellison due to wrong config name ([comment](https://github.com/pytorch/pytorch/pull/131839#issuecomment-2257117221))
2024-07-29 22:31:51 +00:00
8fe2bf212d [NestedTensor] Integrate the layer normalization operator along the jagged dimension into NestedTensor (#131519)
Modify the existing `layer normalization` operator in PyTorch, invoked by `torch.layer_norm`, to allow for reductions along the jagged dimension of a nested tensor. The function originally had a basic implementation for reducing along 1 non-ragged dimension. This diff, which uses the `aten` padding operator, enables PyTorch users to invoke `torch.nn.functional.layer_norm` on a nested tensor when reducing along the ragged dimension, e.g. `*` in a `(B, *, M)` or `(B, *, M, N)` nested tensor.

Write unit tests based on the `softmax` jagged operator to verify the accuracy of the ragged reduction implementation for `torch.nn.functional.layer_norm`. Add unit tests to verify error handling for unsupported features.

Note that this implementation is limited to nested tensors with `ragged_idx == 1`, i.e. the ragged dimension is not transposed. The layer normalization operator also requires an operation on a 2-dimensional layer; for nested tensors with 4 or more dimensions, I flatten the extra dimensions, then unflatten them after performing layer normalization.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131519
Approved by: https://github.com/davidberard98
ghstack dependencies: #131518
2024-07-29 22:16:32 +00:00
d039b14207 Grouped Query Attention (#128898)
### Approach: Using the current function declaration

**Constraint:** Q_Heads % KV_Heads == 0

**Major change:**
- Added a new argument enable_gqa: bool to sdpa function call
- It adds a meaning to the last third dimension.

Sample use cases this would enable:
LLama3

```
# LLama3 8b call to SDPA
query = torch.rand(batch, 32, seq_len_q, D)
key = torch.rand(batch, 8, seq_len_kv, D)
value = torch.rand(batch, 8, seq_len_kv, D)

output = scaled_dot_product_attention(query, key, value, is_causal=True, enable_gqa=True)

# Output Shape
(batch, 32, seq_len_q, D)
```

### Design Choice:

- Check if Query.size(-3) == Key.size(-3) == Value.size(-3) or, Query.size(-3) % Key.size(-3) == 0
- The function adjusts the key and value tensors to match the query tensor's head dimension by using repeat_interleave if their number of heads are not equal, facilitating correct and efficient computation in attention mechanisms.
- By default the enable_gqa flag is set to False, which ensures that regular sdpa functionality remains unchanged.

### Benchmarks:

- **sdpa.py: #130634**
For different batch sizes enable_gqa=True shows a substansial improvement in the run_time of sdpa

 | batch_size | q_num_heads | kv_num_heads | q_seq_len | kv_seq_len | embed_dim | forward_time when enable_gqa=True   |   forward_time when enable_gqa=False    |
| ------------ | ------------- | -------------- | ----------- | ------------ | ----------- | ----------- | ---------------- |
|     1      |     32      |      8       |   2048    |    2048    |   2048    |   100.71  |  119.70  |
|     8      |     32      |      8       |   2048    |    2048    |   2048    |   539.78  |  628.83  |
|     16     |     32      |      8       |   2048    |    2048    |   2048    |   1056.81  |  1225.48  |
|     32      |     32      |      8       |   2048    |    2048    |   2048    |   2099.54  |  2440.45  |

![Screenshot 2024-07-25 at 9 07 40 PM](https://github.com/user-attachments/assets/a3e5f716-c39f-4096-9e6c-82a735e57b7b)

- **TorchTitan: https://github.com/pytorch/torchtitan/pull/458**

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128898
Approved by: https://github.com/drisspg
2024-07-29 21:49:06 +00:00
05a8540041 [cpp-wrapper] create null pointer for zero-size array (#132023)
zero-size array is not supported in the C or C++ standard,
so we create a null pointer for it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132023
Approved by: https://github.com/desertfire
2024-07-29 21:40:33 +00:00
d8358a2d86 Made register_multi_grad_hook return type RemovableHandle (#132074)
`_MultiHandle` is private. Let us return `RemovableHandle`, which is public.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132074
Approved by: https://github.com/soulitzer
2024-07-29 21:29:34 +00:00
d5e9fbb012 Revert "BE: reset dynamo before each test in test_module.py (#131372)"
This reverts commit 527901f054a947976dc587bb9cf72c86992b7c87.

Reverted https://github.com/pytorch/pytorch/pull/131372 on behalf of https://github.com/kit1980 due to Broke test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_CTCLoss_cuda_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10149118852/job/28065175173) [HUD commit link](ca8153ae67) ([comment](https://github.com/pytorch/pytorch/pull/131372#issuecomment-2257019116))
2024-07-29 21:15:25 +00:00
a4723b566f Revert "BE: reset dynamo before each test in test_ops_gradients.py (#131397)"
This reverts commit ca8153ae6758fbf33cc767cfd0cb384b87b8d3ca.

Reverted https://github.com/pytorch/pytorch/pull/131397 on behalf of https://github.com/kit1980 due to Broke test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_CTCLoss_cuda_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10149118852/job/28065175173) [HUD commit link](ca8153ae67) ([comment](https://github.com/pytorch/pytorch/pull/131372#issuecomment-2257019116))
2024-07-29 21:15:25 +00:00
bdf5a6dca9 Add decomposition for unsqueeze_copy (#130942)
* Extracted from #128416
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130942
Approved by: https://github.com/peterbell10
2024-07-29 21:13:37 +00:00
3c1562158e [BE] Fix torch.compile docstring formatting issues (#131837)
Fixes #131815

<img width="1098" alt="Screenshot 2024-07-25 at 6 58 39 PM" src="https://github.com/user-attachments/assets/d0f6edc3-419e-4096-803b-cecd45d8644b">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131837
Approved by: https://github.com/williamwen42
2024-07-29 20:52:28 +00:00
dcb03106b7 [Land Internally] MTIA equivalent of torch.cuda.memory_stats (#132007)
Summary: as title

Test Plan: pytorch ci failing: https://github.com/pytorch/pytorch/issues/131962

Differential Revision: D60335413

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132007
Approved by: https://github.com/hanzlfs, https://github.com/egienvalue
2024-07-29 20:47:18 +00:00
082d0b80ca Min and max NaN propagation fix in MPS backend (#130445)
Partial fix to issue #130295

Moves min and max ops to use the NaN propagating API in MPS to align with the pytorch convention. Adds a regression test to validate the fix achieves parity with cpu backend.
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130445
Approved by: https://github.com/malfet
2024-07-29 20:09:15 +00:00
f44446e851 [dynamo] Turn on inline_inbuilt_nn_modules (#131275)
Known issues that are deliberately kept open and will be fixed later are tracked here - https://github.com/pytorch/pytorch/issues/131696

Training dashboard ([link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2018%20Jul%202024%2000%3A03%3A50%20GMT&stopTime=Thu%2C%2025%20Jul%202024%2000%3A03%3A50%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&lBranch=gh/anijain2305/435/head&lCommit=408b9358b8fca3a5d08b39741419fe8a596941aa&rBranch=gh/anijain2305/435/base&rCommit=d31f2ae904ba2cf0884bf24413ba2109c3585d51))

![image](https://github.com/user-attachments/assets/08ef081c-37d7-436d-905b-4b9e2b470644)

Inference dashboard ([link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2018%20Jul%202024%2000%3A03%3A50%20GMT&stopTime=Thu%2C%2025%20Jul%202024%2000%3A03%3A50%20GMT&granularity=hour&suite=torchbench&mode=inference&dtype=bfloat16&lBranch=gh/anijain2305/435/head&lCommit=914244fa2fe0055917e039e35183b21fa90afdc6&rBranch=gh/anijain2305/435/base&rCommit=d31f2ae904ba2cf0884bf24413ba2109c3585d51))
![image](https://github.com/user-attachments/assets/32136eff-a39e-4cde-a438-e51a665bc3c9)

Inference sees a little bit more perf degradation but we are ok with that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131275
Approved by: https://github.com/ezyang, https://github.com/jansel
ghstack dependencies: #132053
2024-07-29 20:01:51 +00:00
4c2bcf92cb [inductor] Enable FX graph caching in OSS by default (#125863)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125863
Approved by: https://github.com/eellison, https://github.com/oulgen
2024-07-29 19:19:54 +00:00
484852c02b [Doc] update guide install mkl-static from conda to pip (#130026)
<img width="619" alt="image" src="https://github.com/pytorch/pytorch/assets/8433590/4ac3ca68-57dc-42c7-ac7a-876dc377ebcf">

Conda intel channel is not avaliable now.
Use `pip` install instead of `conda`.

`Windows` and `Linux` are avaliable:
Binary list: https://pypi.org/project/mkl-static/#files

`MacOS` is avaliable for old version:
https://pypi.org/project/mkl-static/2021.3.0/#files

TODO:
1. cherry-pick to `release/2.4` branch, @atalman .
2. fix it also in `release/2.3` branch: https://github.com/pytorch/pytorch/pull/131853

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130026
Approved by: https://github.com/jgong5, https://github.com/atalman
2024-07-29 19:19:15 +00:00
301ec32ae8 [EASY][TEST][CUDA] Fix typo in test_graph_make_graphed_callables_same_pool (#132059)
Per title.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132059
Approved by: https://github.com/Skylion007
2024-07-29 19:15:37 +00:00
5cc34f61d1 [CI] add new test config label ci-test-showlocals to control test log verbosity (#131981)
Add a new label `ci-test-showlocals` and add it to test config filter.
If the PR is labeled with `ci-test-showlocals` or "ci-test-showlocals"
present in the PR comment, the test config filter will set a environment
variable `TEST_SHOWLOCALS`. Then `pytest` will show local variables on
failures for better debugging.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131981
Approved by: https://github.com/malfet
ghstack dependencies: #131151
2024-07-29 18:53:14 +00:00
4694ee1ad2 [BE][tests] show local variables on failure in tests (#131151)
------

As per the title, add argument `--locals` for `unittest` and `--showlocals --tb=long` for `pytest` in CI.

Some failures cannot be reproduced on the local machine but exist on cloud CI. This change allows us to investigate the test failure more easily.

Example output: https://github.com/pytorch/pytorch/actions/runs/9961546996/job/27523888353?pr=130710#step:20:3361

```text
/opt/conda/envs/py_3.8/lib/python3.8/site-packages/sympy/core/function.py:307:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

cls = FloorDiv, base = -1.00000000000000, divisor = -1.00000000000000

    @classmethod
    def eval(cls, base, divisor):
        # python test/test_dynamic_shapes.py -k TestDimConstraints.test_dim_constraints_solve_full
        # Assert triggered by inequality solver
        # assert base.is_integer, base
        # assert divisor.is_integer, divisor

        # We don't provide the same error message as in Python because SymPy
        # makes it difficult to check the types.
        if divisor.is_zero:
            raise ZeroDivisionError("division by zero")
        if base in (int_oo, -int_oo, sympy.oo, -sympy.oo) and divisor in (
            int_oo,
            -int_oo,
            sympy.oo,
            -sympy.oo,
        ):
            return sympy.nan
        if base is sympy.nan or divisor is sympy.nan:
            return sympy.nan

        if base.is_zero:
            return sympy.S.Zero
        if base.is_integer and divisor == 1:
            return base
        if base.is_integer and divisor == -1:
            return sympy.Mul(base, -1)
        if (
            isinstance(base, sympy.Number)
            and isinstance(divisor, sympy.Number)
            and (
                base in (int_oo, -int_oo, sympy.oo, -sympy.oo)
                or divisor in (int_oo, -int_oo, sympy.oo, -sympy.oo)
            )
        ):
            r = float(base) / float(divisor)
            if r == math.inf:
                return int_oo
            elif r == -math.inf:
                return -int_oo
            elif math.isnan(r):
                return sympy.nan
            else:
                return sympy.Integer(math.floor(r))
        if isinstance(base, sympy.Integer) and isinstance(divisor, sympy.Integer):
            return sympy.Integer(int(base) // int(divisor))
        if isinstance(base, FloorDiv):
            return FloorDiv(base.args[0], base.args[1] * divisor)

        # Expands (x + y) // b into x // b + y // b.
        # This only works if floor is an identity, i.e. x / b is an integer.
        for term in sympy.Add.make_args(base):
            quotient = term / divisor
            if quotient.is_integer and isinstance(divisor, sympy.Integer):
                # NB: this is correct even if the divisor is not an integer, but it
                # creates rational expressions that cause problems with dynamic
                # shapes.
                return FloorDiv(base - term, divisor) + quotient

        try:
            gcd = sympy.gcd(base, divisor)
            if gcd != 1:
>               return FloorDiv(
                    sympy.simplify(base / gcd), sympy.simplify(divisor / gcd)
                )

base       = -1.00000000000000
cls        = FloorDiv
divisor    = -1.00000000000000
gcd        = 1.00000000000000
quotient   = 1.00000000000000
term       = -1.00000000000000

/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/utils/_sympy/functions.py:159:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

args = (FloorDiv, -1.00000000000000, -1.00000000000000), kwargs = {}

    @wraps(func)
    def wrapper(*args, **kwargs):
        try:
>           retval = cfunc(*args, **kwargs)
E           RecursionError: maximum recursion depth exceeded in comparison
E
E           To execute this test, run the following from the base repo dir:
E               python test/test_sympy_utils.py -k TestValueRanges.test_binary_ref_fn_floordiv_dtype_float
E
E           This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

args       = (FloorDiv, -1.00000000000000, -1.00000000000000)
cfunc      = <functools._lru_cache_wrapper object at 0x7fc5303173a0>
func       = <function Function.__new__ at 0x7fc530317280>
kwargs     = {}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131151
Approved by: https://github.com/ezyang
2024-07-29 18:53:14 +00:00
cyy
ab912b7fef [2/N] Fix clang-tidy warnings in inductor (#132040)
Follows #131979
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132040
Approved by: https://github.com/Skylion007
2024-07-29 18:41:24 +00:00
cyy
c764ef6d53 [9/N] Fix clang-tidy warnings in jit (#132010)
Follows  #131997

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132010
Approved by: https://github.com/Skylion007
2024-07-29 18:38:35 +00:00
f389bca2e9 [dynamo][inline_inbuilt_nn_modules] Skip test_dpp_graphs for now (#132053)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132053
Approved by: https://github.com/laithsakka
2024-07-29 17:59:47 +00:00
6c6fbb4691 Fix pyi annotation for ProcessGroupNCCL.Options (#130957)
Probably all the other options need updating too, but this is the one I
needed.  The accurate annotation was determined by reading
torch/csrc/distributed/c10d/init.cpp

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130957
Approved by: https://github.com/wconstab, https://github.com/fduwjj
2024-07-29 17:46:01 +00:00
025242d065 [cpu-test] enable test_cpu_repro in fbcode (#132022)
Summary: This diff enables test_cpu_repro in fbcode

Test Plan: ci

Differential Revision: D60364517

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132022
Approved by: https://github.com/desertfire
2024-07-29 17:45:26 +00:00
ca8153ae67 BE: reset dynamo before each test in test_ops_gradients.py (#131397)
https://github.com/pytorch/pytorch/pull/126586 tried to reset dynamo before each unit test. That PR get reverted a couple of times because we see post-land test failures that we don't see before merge. This PR only reset dynamo before each tests in `test_ops_gradients.py` to make it easier to land.

Eventually after we reset dynamo in each individual test files, we can move the change to the base class (TestCase) and remove the change in individual test files.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131397
Approved by: https://github.com/zou3519
ghstack dependencies: #131551, #131388, #131372
2024-07-29 17:39:23 +00:00
527901f054 BE: reset dynamo before each test in test_module.py (#131372)
https://github.com/pytorch/pytorch/pull/126586 tried to reset dynamo before each unit test. That PR get reverted a couple of times because we see post-land test failures that we don't see before merge. This PR only reset dynamo before each tests in `test_module.py` to make it easier to land.

Eventually after we reset dynamo in each individual test files, we can move the change to the base class (TestCase) and remove the change in individual test files.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131372
Approved by: https://github.com/zou3519
ghstack dependencies: #131551, #131388
2024-07-29 17:39:23 +00:00
bd1a29b158 [BE][Ez]: Update ruff to 0.5.5. Bugfixes and better LSP support (#132037)
Updates ruff to the latest and greatest, mainly better LSP support and bugfixes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132037
Approved by: https://github.com/malfet
2024-07-29 16:57:13 +00:00
6cf493158e Revert "Enable FlashAttention on Windows (#131906)"
This reverts commit b90bc66766c3503c1f229660710a803488d53c16.

Reverted https://github.com/pytorch/pytorch/pull/131906 on behalf of https://github.com/atalman due to Windows nightly failures ([comment](https://github.com/pytorch/pytorch/pull/131906#issuecomment-2256421183))
2024-07-29 16:49:23 +00:00
3d4de8e96d Add config option to skip autotuning conv (#131839)
requested internally bc for some models the conv templates are not very helpful

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131839
Approved by: https://github.com/oulgen
ghstack dependencies: #131400
2024-07-29 16:43:58 +00:00
e73a4cb21f Revert "[pt2e][quant] Ensure BN node is erased after convert (#131651)"
This reverts commit eba2ffd278a004df8fd335328ab8ba00c978e471.

Reverted https://github.com/pytorch/pytorch/pull/131651 on behalf of https://github.com/ZainRizvi due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/131651#issuecomment-2256407968))
2024-07-29 16:42:24 +00:00
f72266ecea Revert "Let dynamo inline functional_call (#128646)"
This reverts commit 5aab1acc84ff4a4374c9ddd179be48b07c6c8a74.

Reverted https://github.com/pytorch/pytorch/pull/128646 on behalf of https://github.com/clee2000 due to the newly added test dynamo/test_higher_order_ops.py::FuncTorchHigherOrderOpTests::test_functional_call_sequential_params_and_buffers [GH job link](https://github.com/pytorch/pytorch/actions/runs/10147452270/job/28058682000) [HUD commit link](5aab1acc84) is broken, probably a landrace since it passed on PR ([comment](https://github.com/pytorch/pytorch/pull/128646#issuecomment-2256375501))
2024-07-29 16:26:50 +00:00
962f248437 Add decomposition for expand_copy (#130940)
* Extracted from #129476

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130940
Approved by: https://github.com/peterbell10
2024-07-29 16:23:56 +00:00
e393c7fa05 Tighten torch.library.infer_schema input types (#130705)
Made the following changes:
- mutates_args is now keyword-only and mandatory. This is to align with
  torch.library.custom_op (which makes it mandatory because it's easy to
  miss)
- op_name is now keyword-only. This helps the readability of the API
- updated all usages of infer_schema

This change is not BC-breaking because we introduced
torch.library.infer_schema a couple of days ago.

Test Plan:
- tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130705
Approved by: https://github.com/yushangdi
ghstack dependencies: #131777
2024-07-29 16:01:19 +00:00
957a89f56c Revert "[inductor] Fix unsoundness with negative-valued indexing expressions (#131761)"
This reverts commit 03760be2714c6ed3b4f44c4dc3ea016f557d8597.

Reverted https://github.com/pytorch/pytorch/pull/131761 on behalf of https://github.com/atalman due to Broke CI: inductor/test_cpu_cpp_wrapper.py::DynamicShapesCppWrapperCpuTests::test_linear_binary_dynamic_shapes_cpp_wrapper [GH job link](https://github.com/pytorch/pytorch/actions/runs/10145214748/job/28051168920) [HUD commit link](03760be271) ([comment](https://github.com/pytorch/pytorch/pull/131761#issuecomment-2256287736))
2024-07-29 15:52:08 +00:00
ca254d145f [BE][Ez]: Update fmtlib submodule to 11.0.2 (#132036)
Updates fmtlib to 11.0.2 which mainly includes minor bugfixes for edge cases such as move-only iterators and formatting on non-posix systems.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132036
Approved by: https://github.com/malfet
2024-07-29 15:50:00 +00:00
5aab1acc84 Let dynamo inline functional_call (#128646)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128646
Approved by: https://github.com/zou3519
ghstack dependencies: #129091, #130490
2024-07-29 15:41:03 +00:00
e0e4e84ef9 wrap self.call_function(...) in try finally block to undo changes to self.kw_names (#130490)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130490
Approved by: https://github.com/williamwen42, https://github.com/zou3519
ghstack dependencies: #129091
2024-07-29 15:41:03 +00:00
1e9cdf7d91 Relax constraints for creating a GenericContextWrappingVariable (#129091)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129091
Approved by: https://github.com/yanboliang, https://github.com/zou3519
2024-07-29 15:40:59 +00:00
6cbad37bee make _inductor.config.rocm.supported_arch set order deterministic for caching (#131921)
This fixes some AOTAutograd caching tests that were failing flakily internally because they would occasionally cache miss.

[T195598220](https://www.internalfb.com/intern/tasks/?t=195598220)

I found it by running some stress tests and diffing the AOT cache information on each run, and ended up with this diff (`rocm.supported_arch` order was changing from run to run, although apparently not in OSS):
```
--- tmpa.txt    2024-07-26 11:03:46.220924798 -0700
+++ tmpb.txt    2024-07-26 11:03:44.053586437 -0700
@@ -1,4 +1,4 @@
-Autograd graph cache hash details for key ati644hstroc45hvmc6dcgzmxz7n4ezi46vbb2iriu634aojza74:
+Autograd graph cache hash details for key ayfqecv56xcczljwuvigh73sjd7dfvgr6akzf3ikr46nq7dfm6eh:
 [z76jr26kn3enjhz7b3ks3a2dgpwolnnqsqmo3wn6ddml3vxjtam] aot_config: (0, True, False, False, False, [LocalSource(local_name='x', cell_or_freevar=False)], True, False)
 [esstihe2nyydk4mhzpvox3qkajyu5y5t23hk3fi2me7jn75xi3o] grad_enabled: False
 [esstihe2nyydk4mhzpvox3qkajyu5y5t23hk3fi2me7jn75xi3o] disable_amp: False
@@ -184,7 +184,7 @@
 [esstihe2nyydk4mhzpvox3qkajyu5y5t23hk3fi2me7jn75xi3o] inductor_config[rocm.print_kernel_resource_usage]: False
 [tquy2we2efmowuj4wuqzcfcfdcrkzkzmwdae6hprj7fa64jpusq] inductor_config[rocm.rocm_home]: None
 [esstihe2nyydk4mhzpvox3qkajyu5y5t23hk3fi2me7jn75xi3o] inductor_config[rocm.save_temps]: False
-[xr3ayxgy2xduff3r5ey7o3ypfndexy7edha62kibw2dexijjvdr] inductor_config[rocm.supported_arch]: {'gfx941', 'gfx942', 'gfx940'}
+[qauhp44riavgubamhd3ehrifxdgm7pkwx2nehsqg5toy54dqqmn] inductor_config[rocm.supported_arch]: {'gfx942', 'gfx940', 'gfx941'}
 [cev5uo2jlwdhw2uyzcm7vr6cl23azjfw437f5r5lskm7spucos6] inductor_config[rocm.use_fast_math]: True
 [esstihe2nyydk4mhzpvox3qkajyu5y5t23hk3fi2me7jn75xi3o] inductor_config[rocm.use_preselected_instances]: False
 [esstihe2nyydk4mhzpvox3qkajyu5y5t23hk3fi2me7jn75xi3o] inductor_config[save_args]: False
@@ -231,7 +231,7 @@
 [esstihe2nyydk4mhzpvox3qkajyu5y5t23hk3fi2me7jn75xi3o] inductor_config[verbose_progress]: False
 [esstihe2nyydk4mhzpvox3qkajyu5y5t23hk3fi2me7jn75xi3o] inductor_config[warn_mix_layout]: False
 [a44txxznx23htuc7zxw7larc7yxpxzxmiqzloxznw7z2k2azqj3] inductor_config[worker_start_method]: fork
-Autograd graph cache hash details for key ati644hstroc45hvmc6dcgzmxz7n4ezi46vbb2iriu634aojza74:
+Autograd graph cache hash details for key ayfqecv56xcczljwuvigh73sjd7dfvgr6akzf3ikr46nq7dfm6eh:
 [z76jr26kn3enjhz7b3ks3a2dgpwolnnqsqmo3wn6ddml3vxjtam] aot_config: (0, True, False, False, False, [LocalSource(local_name='x', cell_or_freevar=False)], True, False)
 [esstihe2nyydk4mhzpvox3qkajyu5y5t23hk3fi2me7jn75xi3o] grad_enabled: False
 [esstihe2nyydk4mhzpvox3qkajyu5y5t23hk3fi2me7jn75xi3o] disable_amp: False
@@ -417,7 +417,7 @@
 [esstihe2nyydk4mhzpvox3qkajyu5y5t23hk3fi2me7jn75xi3o] inductor_config[rocm.print_kernel_resource_usage]: False
 [tquy2we2efmowuj4wuqzcfcfdcrkzkzmwdae6hprj7fa64jpusq] inductor_config[rocm.rocm_home]: None
 [esstihe2nyydk4mhzpvox3qkajyu5y5t23hk3fi2me7jn75xi3o] inductor_config[rocm.save_temps]: False
-[xr3ayxgy2xduff3r5ey7o3ypfndexy7edha62kibw2dexijjvdr] inductor_config[rocm.supported_arch]: {'gfx941', 'gfx942', 'gfx940'}
+[qauhp44riavgubamhd3ehrifxdgm7pkwx2nehsqg5toy54dqqmn] inductor_config[rocm.supported_arch]: {'gfx942', 'gfx940', 'gfx941'}
 [cev5uo2jlwdhw2uyzcm7vr6cl23azjfw437f5r5lskm7spucos6] inductor_config[rocm.use_fast_math]: True
 [esstihe2nyydk4mhzpvox3qkajyu5y5t23hk3fi2me7jn75xi3o] inductor_config[rocm.use_preselected_instances]: False
 [esstihe2nyydk4mhzpvox3qkajyu5y5t23hk3fi2me7jn75xi3o] inductor_config[save_args]: False
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131921
Approved by: https://github.com/jamesjwu, https://github.com/oulgen
2024-07-29 15:29:04 +00:00
14108c1677 Fix error handling in _triton.py (#132006)
On Windows, _triton.py creates a confusing error ("RuntimeError: Should never be _installed")_ as triton is not supported in Windows. This is not caught in the current Pytorch exception handling. This pull request adds a new exception handling for the runtime error.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132006
Approved by: https://github.com/oulgen
2024-07-29 15:02:25 +00:00
be3eba382f [CI] Run perf test for perf_cpu_aarch64 (#132038)
Summary: Run perf test for perf_cpu_aarch64 instead of regular CI test (test_linux_aarch64).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132038
Approved by: https://github.com/malfet
2024-07-29 13:48:40 +00:00
c35f21e5fc Revert "[BE][tests] show local variables on failure in tests (#131151)"
This reverts commit 14158d892a2bd9b34edb5637f9a05217ea0330bd.

Reverted https://github.com/pytorch/pytorch/pull/131151 on behalf of https://github.com/atalman due to Broke CI: test_testing.py::TestTestingCUDA::test_cuda_assert_should_stop_common_device_type_test_suite_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/10131415299/job/28014665693) [HUD commit link](14158d892a) ([comment](https://github.com/pytorch/pytorch/pull/131151#issuecomment-2255921015))
2024-07-29 13:19:38 +00:00
06fe99a097 Revert "[CI] add new test config label ci-test-showlocals to control test log verbosity (#131981)"
This reverts commit dfa18bf3f39c5a90b48baf956e50fa7da4462d3d.

Reverted https://github.com/pytorch/pytorch/pull/131981 on behalf of https://github.com/atalman due to Sorry, need to revert bottom PR, which broke CI: https://github.com/pytorch/pytorch/pull/131151 ([comment](https://github.com/pytorch/pytorch/pull/131981#issuecomment-2255892628))
2024-07-29 13:09:41 +00:00
7ef927da15 Revert "[dynamo] Turn on inline_inbuilt_nn_modules (#131275)"
This reverts commit 6de65d5dd4226b6bae15352b575c81a6750c819b.

Reverted https://github.com/pytorch/pytorch/pull/131275 on behalf of https://github.com/atalman due to Broke CI: dynamo/test_structured_trace.py::StructuredTraceTest::test_ddp_graphs [GH job link](https://github.com/pytorch/pytorch/actions/runs/10132084288/job/28016215101) [HUD commit link](6de65d5dd4) ([comment](https://github.com/pytorch/pytorch/pull/131275#issuecomment-2255839646))
2024-07-29 12:48:27 +00:00
cyy
efca51e171 [8/N] Fix clang-tidy warnings in jit (#131997)
Follows #131996
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131997
Approved by: https://github.com/Skylion007
2024-07-29 12:40:42 +00:00
eb9409511e Revert "support zb1p and zb2p algorithms (#130752)"
This reverts commit 8fe5b93667b60e37c12d288659a25cbd5ae53c79.

Reverted https://github.com/pytorch/pytorch/pull/130752 on behalf of https://github.com/atalman due to Broke Periodic CI: distributed/pipelining/test_composability.py::ComposabilityTest::test_manual_with_data_parallel_dp_type_DDP_ScheduleClass4 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10131472868/job/28014900187) [HUD commit link](8fe5b93667) ([comment](https://github.com/pytorch/pytorch/pull/130752#issuecomment-2255819078))
2024-07-29 12:40:00 +00:00
9d497887b8 Changes to support clang-19 (#131905)
Co-authored-by: pruthvistony <pruthvigithub@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131905
Approved by: https://github.com/jeffdaily, https://github.com/Skylion007
2024-07-29 12:38:23 +00:00
cyy
b67811abda [1/N] Fix clang-tidy warnings in inductor (#131979)
Fixes clang-tidy warnings in inductor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131979
Approved by: https://github.com/Skylion007
2024-07-29 12:37:56 +00:00
d47c470f47 [dynamo] implement var_getattr in UserFunctionVariable (#130413)
This PR addresses the `getattr` of  UserFunctionVariable. Although this usage is uncommon, it does appear in [Megatron's code](https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/tensor_parallel/layers.py#L635).

```
def linear_with_grad_accumulation_and_async_allreduce(...):
    ....
    if not linear_with_grad_accumulation_and_async_allreduce.warned:
        ....
    ....

linear_with_grad_accumulation_and_async_allreduce.warned = False
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130413
Approved by: https://github.com/yanboliang
2024-07-29 08:29:59 +00:00
dfa18bf3f3 [CI] add new test config label ci-test-showlocals to control test log verbosity (#131981)
Add a new label `ci-test-showlocals` and add it to test config filter.
If the PR is labeled with `ci-test-showlocals` or "ci-test-showlocals"
present in the PR comment, the test config filter will set a environment
variable `TEST_SHOWLOCALS`. Then `pytest` will show local variables on
failures for better debugging.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131981
Approved by: https://github.com/malfet
2024-07-29 07:40:42 +00:00
f151f25c0b BE: reset dynamo before each test in test_torch.py (#131388)
https://github.com/pytorch/pytorch/pull/126586 tried to reset dynamo before each unit test. That PR get reverted a couple of times because we see post-land test failures that we don't see before merge. This PR only reset dynamo before each tests in `test_torch.py` to make it easier to land.

Eventually after we reset dynamo in each individual test files, we can move the change to the base class (TestCase) and remove the change in individual test files.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131388
Approved by: https://github.com/zou3519
ghstack dependencies: #131551
2024-07-29 04:57:34 +00:00
30e7fc0fe1 Cpp wrapper: set args to CppWrapperKernelArgs in cpp template kernel (#129557)
Fix the compilation error:
```cpp
/tmp/tmpywg34bca/tg/ctg7wbli6pvydsjr2xsxamdbamkquhlincuky3dzopa3ilrxqdwt.cpp:401:24: error: cannot convert ‘at::Tensor’ to ‘const bfloat16*’ {aka ‘const c10::BFloat16*’}
  401 |     cpp_fused_div_mm_0(arg2_1, constant2, _frozen_param1, buf1);
      |                        ^~~~~~
      |                        |
      |                        at::Tensor
```

The generated code after the fix will be:
```cpp
cpp_fused_div_mm_0((bfloat16*)(arg2_1.data_ptr()), (bfloat16*)(constant2.data_ptr()), (bfloat16*)(_frozen_param1.data_ptr()), (bfloat16*)(buf1.data_ptr()));
```

Multiple changes are required for ABI compatible mode. Separate it into a follow-up PR in this ghstack: https://github.com/pytorch/pytorch/pull/131841

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129557
Approved by: https://github.com/leslie-fang-intel
2024-07-29 04:01:17 +00:00
03760be271 [inductor] Fix unsoundness with negative-valued indexing expressions (#131761)
This fixes a few instances where we assumed indexing expressions were
non-negative. This is not valid when we have more complicated
expressions involving masking e.g. pointwise cat.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131761
Approved by: https://github.com/ezyang
2024-07-29 03:14:13 +00:00
2a02b5cd22 [Intel GPU] Dispatch Stub support (#130019)
# Motivation
Structured codegen is beneficial for easier decoupling tensor meta setting and kernel implementation. At present, XPU operators need to handle tensor metas in hand-written way.

We plan to leverage the codegen system for auto generate structured operators. This PR facilitate the `DispatchStub` support for  Intel GPUs. Based on that, XPU operators would have possibility to register kernel functor to operator stubs.

This is a prerequisite of PR #130082, where we will modify the codegen system to generate XPU needed source files and headers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130019
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD
2024-07-29 02:18:52 +00:00
cyy
5b3b2b9cc7 [7/N] Fix clang-tidy warnings in jit (#131996)
Follows #131986

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131996
Approved by: https://github.com/ezyang
2024-07-29 01:21:18 +00:00
cyy
ddd539ba6c [6/N] Fix clang-tidy warnings in jit (#131986)
Follows  #131969
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131986
Approved by: https://github.com/ezyang
2024-07-29 00:49:08 +00:00
7b0e10f0e5 fix _MaskPartial when multiple embeddings coexist (#131264)
Previously, using _MaskPartial when multiple embeddings have the following issues:
1. Suppose an `nn.Embedding` has shape `[vocab_size, emb_size]`. When there are more than one embeddings, sharing the same `vocab_size` but with different `emb_size`s. Then they would not share `OpStrategy` since each, when involved in computation, would have different `OpSchema`; however, there would be cache hit for redistribute (specifically `_gen_transform_infos` in `torch/distributed/_tensor/_redistribute.py` when doing `Replicate` -> `_MaskPartial`) as the `_MaskPartial` only has `vocab_size` as `logical_dim_size` but not `emb_size` as attribute. This cache hit is undesirable and would cause trouble when doing all-reduce/reduce-scatter on the new `_MaskPartial` in a separate `OpStrategy`. The error was reported in #130725. In this PR, we introduce `offset_shape` to represent the embedding's full shape to avoid cache hit from embeddings of different shapes.
2. The second issue is when we have two `nn.Embedding`s `emb1` and `emb2` with the same shape. There will be cache hit not only in `_gen_transform_infos`, but also in `OpStrategy` generation. Previously, if we sequentially do `Replicate` -> `_MaskPartial` for both `emb1` `emb2` and then sequentially do reduction on the `_MaskPartial` of `emb1`, it would destroy the `MaskBuffer` and `emb2` would hit error. This PR adds a `refcount` for the `MaskBuffer` so that it can be properly shared by multiple `nn.Embedding`s.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131264
Approved by: https://github.com/wanchaol
2024-07-29 00:40:58 +00:00
0ab6551bcb [inductor] Handle NoneLayout in count_numel (#131645)
We're currently under-counting mutations from ExternKernel since they use `NoneLayout` which doesn't have an associated shape and dtype. Instead, we can get that information from the buffer being mutated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131645
Approved by: https://github.com/jansel
2024-07-28 23:02:22 +00:00
cyy
7c1fbc7fe9 [5/N] Remove unused parameter (#131998)
Follows #131291

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131998
Approved by: https://github.com/ezyang
2024-07-28 21:29:06 +00:00
f901b02066 [Distributed] Do not expose nlohmann/json.hpp in public headers (#131925)
Move `<hlohmann/json.hpp>` dependency as well as `NCCLTraceBuffer::getCollectiveTraceJson` and `NCCLTraceBuffer::dump_json` implementation introduced by https://github.com/pytorch/pytorch/pull/129505 from the header into .cpp file. This relaxes the requirement on all downstream client to depend on the library

Fixes https://github.com/pytorch/pytorch/issues/130678

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131925
Approved by: https://github.com/albanD, https://github.com/d4l3k, https://github.com/fduwjj, https://github.com/c-p-i-o
ghstack dependencies: #131922
2024-07-28 18:45:24 +00:00
75c8d59ea1 Remove mypy ignore from torch/_dynamo/variables/lazy.py (#131785)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131785
Approved by: https://github.com/aorenste, https://github.com/zou3519
ghstack dependencies: #131786, #131870
2024-07-28 17:13:53 +00:00
7c29665f77 Remove mypy ignore from torch/testing/_internal/distributed/ (#131870)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131870
Approved by: https://github.com/aakhundov
ghstack dependencies: #131786
2024-07-28 17:13:53 +00:00
2e4807575c Remove mypy ignore from torch/_dynamo/polyfill.py (#131786)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131786
Approved by: https://github.com/aorenste, https://github.com/zou3519
2024-07-28 17:13:49 +00:00
cc512ea0f6 [inductor] Fix flaky tests in test_aot_inductor.py (#131994)
Summary:
The `test_model_modified_weights` in `test_aot_inductor.py` has been failing internally for a while. The behavior leading to the test failure was that, after updating the eager model's weights and recompiling the (CPU) model with AOTI, the output of the model was identical to the one before the weights were updated.

The root cause is here in Python:

8927fc209f/test/inductor/test_aot_inductor_utils.py (L69-L71)

which, in turn, instantiates the `Runner` object in C++ relying on `dlopen` for loading the *.so. The problem is that repeated `dlopen` call does not reload the library from the same path, unless `dlclose` is called in-between the two `dlopen` calls. There is `dlclose` in the `Runner`'s destructor, but it's not called, likely due to the way the loaded `runner` gets closed over in Python:

8927fc209f/test/inductor/test_aot_inductor_utils.py (L83-L94)

Here we add copying the *.so file to a unique temporary path right before loading it into a `runner` to avoid the `dlopen` staleness described above. This fixes the `test_model_modified_weights` and, hopefully, will help avoiding similar errors in the future tests.

Test Plan: Tested internally.

Differential Revision: D60348165

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131994
Approved by: https://github.com/chenyang78
2024-07-28 16:55:22 +00:00
6de65d5dd4 [dynamo] Turn on inline_inbuilt_nn_modules (#131275)
Known issues that are deliberately kept open and will be fixed later are tracked here - https://github.com/pytorch/pytorch/issues/131696

Training dashboard ([link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2018%20Jul%202024%2000%3A03%3A50%20GMT&stopTime=Thu%2C%2025%20Jul%202024%2000%3A03%3A50%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&lBranch=gh/anijain2305/435/head&lCommit=408b9358b8fca3a5d08b39741419fe8a596941aa&rBranch=gh/anijain2305/435/base&rCommit=d31f2ae904ba2cf0884bf24413ba2109c3585d51))

![image](https://github.com/user-attachments/assets/08ef081c-37d7-436d-905b-4b9e2b470644)

Inference dashboard ([link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2018%20Jul%202024%2000%3A03%3A50%20GMT&stopTime=Thu%2C%2025%20Jul%202024%2000%3A03%3A50%20GMT&granularity=hour&suite=torchbench&mode=inference&dtype=bfloat16&lBranch=gh/anijain2305/435/head&lCommit=914244fa2fe0055917e039e35183b21fa90afdc6&rBranch=gh/anijain2305/435/base&rCommit=d31f2ae904ba2cf0884bf24413ba2109c3585d51))
![image](https://github.com/user-attachments/assets/32136eff-a39e-4cde-a438-e51a665bc3c9)

Inference sees a little bit more perf degradation but we are ok with that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131275
Approved by: https://github.com/ezyang, https://github.com/jansel
ghstack dependencies: #131744, #131928, #131948
2024-07-28 13:23:00 +00:00
8927fc209f [inductor] Add type hints to functions in debug.py (#131836)
Summary: ATT

Test Plan: lintrunner

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131836
Approved by: https://github.com/eellison
2024-07-28 04:54:22 +00:00
500aea8d50 Build PT aarch64 on arm runner (#131964)
Another fix is needed to address https://github.com/pytorch/pytorch/actions/runs/10118374576/job/27985575620.  The build needs to be done on arm runner to stay compatible with the Docker image.

### Testing

https://github.com/pytorch/pytorch/actions/runs/10118589329/job/27985670691

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131964
Approved by: https://github.com/malfet
2024-07-28 04:50:38 +00:00
945bf78894 Revert "[BE] typing for decorators - fx/_compatibility (#131568)"
This reverts commit 193f62fde91ee20deb5ddcd9ff4593cd78d74c64.

Reverted https://github.com/pytorch/pytorch/pull/131568 on behalf of https://github.com/clee2000 due to same as https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359 but I clicked the wrong link by accident.  This is where it actually starts ([comment](https://github.com/pytorch/pytorch/pull/131568#issuecomment-2254330781))
2024-07-28 03:43:39 +00:00
b002ec61b6 Revert "[BE] typing for decorators - masked/_ops (#131569)"
This reverts commit aa58af8b43ad0e615415b4d754255f5be481d41a.

Reverted https://github.com/pytorch/pytorch/pull/131569 on behalf of https://github.com/clee2000 due to same as https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359 but I clicked the wrong link by accident.  This is where it actually starts ([comment](https://github.com/pytorch/pytorch/pull/131568#issuecomment-2254330781))
2024-07-28 03:43:39 +00:00
a3ba405871 Revert "[BE] typing for decorators - library (#131570)"
This reverts commit 5731b486c87bedff69aa0264d6c934bf723eb513.

Reverted https://github.com/pytorch/pytorch/pull/131570 on behalf of https://github.com/clee2000 due to same as https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359 but I clicked the wrong link by accident.  This is where it actually starts ([comment](https://github.com/pytorch/pytorch/pull/131568#issuecomment-2254330781))
2024-07-28 03:43:39 +00:00
a0abb77007 Revert "[BE] typing for decorators - distributed/_tensor/ops/utils (#131571)"
This reverts commit 4b985e6f803023ec301238d2b4bab4fbea4dd03c.

Reverted https://github.com/pytorch/pytorch/pull/131571 on behalf of https://github.com/clee2000 due to same as https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359 but I clicked the wrong link by accident.  This is where it actually starts ([comment](https://github.com/pytorch/pytorch/pull/131568#issuecomment-2254330781))
2024-07-28 03:43:39 +00:00
a8a9882899 Implement fused_scaled_matmul_reduce_scatter for async-TP (#131950)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131950
Approved by: https://github.com/weifengpy
ghstack dependencies: #131410, #131831, #131832, #131833
2024-07-28 03:39:12 +00:00
0538a69a8d [micro_pipeline_tp] support all-gather -> _scaled_mm (#131833)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131833
Approved by: https://github.com/weifengpy
ghstack dependencies: #131410, #131831, #131832
2024-07-28 03:39:11 +00:00
492e9a4886 [micro_pipeline_tp] add support for type-erased all-gather pattern observed in DTensor + float8_experimental (#131832)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131832
Approved by: https://github.com/weifengpy
ghstack dependencies: #131410, #131831
2024-07-28 03:39:11 +00:00
fd5b7d4bf9 Revert "[BE] typing for decorators - _meta_registrations (#131572)"
This reverts commit bfe0079b72aa3ed315ae8f140c97a5826c401a65.

Reverted https://github.com/pytorch/pytorch/pull/131572 on behalf of https://github.com/clee2000 due to breaking lint internally D60265575 ([comment](https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359))
2024-07-28 03:29:32 +00:00
609447a626 Revert "[BE] typing for decorators - _jit_internal (#131573)"
This reverts commit f0f20f7e97716b4b077dca2a1a42930ccf990c1c.

Reverted https://github.com/pytorch/pytorch/pull/131573 on behalf of https://github.com/clee2000 due to breaking lint internally D60265575 ([comment](https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359))
2024-07-28 03:29:32 +00:00
4684b8e9d7 Revert "[BE] typing for decorators - _inductor/lowering (#131574)"
This reverts commit b2cbcf710b26c4cb92d810fff46b6ddcb8d10cbf.

Reverted https://github.com/pytorch/pytorch/pull/131574 on behalf of https://github.com/clee2000 due to breaking lint internally D60265575 ([comment](https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359))
2024-07-28 03:29:32 +00:00
07b7f51877 Revert "[BE] typing for decorators - _inductor/fx_passes/post_grad (#131575)"
This reverts commit 42dc5a47a157f9a441ceba53cf569cc42a640732.

Reverted https://github.com/pytorch/pytorch/pull/131575 on behalf of https://github.com/clee2000 due to breaking lint internally D60265575 ([comment](https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359))
2024-07-28 03:29:32 +00:00
6a0c3bae21 Revert "[BE] typing for decorators - fx/experimental/migrate_gradual_types/constraint_generator (#131576)"
This reverts commit 37d76c7d48353cff5ed0d868b7ca486ad092ceaf.

Reverted https://github.com/pytorch/pytorch/pull/131576 on behalf of https://github.com/clee2000 due to breaking lint internally D60265575 ([comment](https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359))
2024-07-28 03:29:32 +00:00
b1d640a2b7 Revert "[BE] typing for decorators - ao/quantization/quantizer/xnnpack_quantizer_utils (#131577)"
This reverts commit 5ee6a6dacc926da37ebe06e4206dcc307bf891f5.

Reverted https://github.com/pytorch/pytorch/pull/131577 on behalf of https://github.com/clee2000 due to breaking lint internally D60265575 ([comment](https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359))
2024-07-28 03:29:32 +00:00
d3c17fea90 Revert "[BE] typing for decorators - _library/custom_ops (#131578)"
This reverts commit c65b197b85aeee61ed4c09527a8f6eecf8c20e27.

Reverted https://github.com/pytorch/pytorch/pull/131578 on behalf of https://github.com/clee2000 due to breaking lint internally D60265575 ([comment](https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359))
2024-07-28 03:29:32 +00:00
065d0fe570 Revert "[BE] typing for decorators - fx/experimental/graph_gradual_typechecker (#131579)"
This reverts commit 79f0c4dc04c7976b734767d64c4833932219dcfb.

Reverted https://github.com/pytorch/pytorch/pull/131579 on behalf of https://github.com/clee2000 due to breaking lint internally D60265575 ([comment](https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359))
2024-07-28 03:29:31 +00:00
5ced63a005 Revert "[BE] typing for decorators - utils/flop_counter (#131580)"
This reverts commit 81c26ba5ae1edf95da8f6956ae4b5ad23c9833c6.

Reverted https://github.com/pytorch/pytorch/pull/131580 on behalf of https://github.com/clee2000 due to breaking lint internally D60265575 ([comment](https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359))
2024-07-28 03:29:31 +00:00
2c4023d65f Revert "[BE] typing for decorators - _refs/nn/functional (#131581)"
This reverts commit dbf7c318b2dd4652467f11f4aaebaa3ed372e728.

Reverted https://github.com/pytorch/pytorch/pull/131581 on behalf of https://github.com/clee2000 due to breaking lint internally D60265575 ([comment](https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359))
2024-07-28 03:29:31 +00:00
e448f32944 Revert "[BE] typing for decorators - signal/windows/windows (#131582)"
This reverts commit 8689d377f9b60b70efa6608e654a3889f947f4d8.

Reverted https://github.com/pytorch/pytorch/pull/131582 on behalf of https://github.com/clee2000 due to breaking lint internally D60265575 ([comment](https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359))
2024-07-28 03:29:31 +00:00
d90f6b45c0 Revert "[inductor] Add type hints to functions in mkldnn_fusion.py (#131820)"
This reverts commit fb3ddafbcfe6de1c4b208c020bc5ff4c4c4faf79.

Reverted https://github.com/pytorch/pytorch/pull/131820 on behalf of https://github.com/clee2000 due to reverting this to revert something else, only action you should need to do is to rebase and merge again, sorry for the churn ([comment](https://github.com/pytorch/pytorch/pull/131820#issuecomment-2254327833))
2024-07-28 03:26:14 +00:00
8f5cf46405 Revert "Fix public API tests (#131386)"
This reverts commit 91fcfd87600545c19b975bd6ea134f2f931bf84a.

Reverted https://github.com/pytorch/pytorch/pull/131386 on behalf of https://github.com/clee2000 due to reverting this to revert something else, only action you should need to do is to rebase and merge again, sorry for the churn ([comment](https://github.com/pytorch/pytorch/pull/131386#issuecomment-2254327487))
2024-07-28 03:23:04 +00:00
cyy
7be0ce51b6 Fix handle serialization error (#131871)
This is a bug to try serialise std::string in C API
Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131871
Approved by: https://github.com/Skylion007
2024-07-28 00:33:20 +00:00
3e0ccb3a9f Fixing fake tensor SymInt caching (#131966)
Summary: Some tests are failing because of a weird interaction between the symbolic sizes and the `set()` - back it out for now.

Differential Revision: D60320595

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131966
Approved by: https://github.com/oulgen
2024-07-27 22:43:57 +00:00
d07a125af2 [Inductor] supporting pointwise intermediate nodes in B2B-GEMM (#131685)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131685
Approved by: https://github.com/eellison
2024-07-27 20:11:20 +00:00
14158d892a [BE][tests] show local variables on failure in tests (#131151)
------

As per the title, add argument `--locals` for `unittest` and `--showlocals --tb=long` for `pytest` in CI.

Some failures cannot be reproduced on the local machine but exist on cloud CI. This change allows us to investigate the test failure more easily.

Example output: https://github.com/pytorch/pytorch/actions/runs/9961546996/job/27523888353?pr=130710#step:20:3361

```text
/opt/conda/envs/py_3.8/lib/python3.8/site-packages/sympy/core/function.py:307:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

cls = FloorDiv, base = -1.00000000000000, divisor = -1.00000000000000

    @classmethod
    def eval(cls, base, divisor):
        # python test/test_dynamic_shapes.py -k TestDimConstraints.test_dim_constraints_solve_full
        # Assert triggered by inequality solver
        # assert base.is_integer, base
        # assert divisor.is_integer, divisor

        # We don't provide the same error message as in Python because SymPy
        # makes it difficult to check the types.
        if divisor.is_zero:
            raise ZeroDivisionError("division by zero")
        if base in (int_oo, -int_oo, sympy.oo, -sympy.oo) and divisor in (
            int_oo,
            -int_oo,
            sympy.oo,
            -sympy.oo,
        ):
            return sympy.nan
        if base is sympy.nan or divisor is sympy.nan:
            return sympy.nan

        if base.is_zero:
            return sympy.S.Zero
        if base.is_integer and divisor == 1:
            return base
        if base.is_integer and divisor == -1:
            return sympy.Mul(base, -1)
        if (
            isinstance(base, sympy.Number)
            and isinstance(divisor, sympy.Number)
            and (
                base in (int_oo, -int_oo, sympy.oo, -sympy.oo)
                or divisor in (int_oo, -int_oo, sympy.oo, -sympy.oo)
            )
        ):
            r = float(base) / float(divisor)
            if r == math.inf:
                return int_oo
            elif r == -math.inf:
                return -int_oo
            elif math.isnan(r):
                return sympy.nan
            else:
                return sympy.Integer(math.floor(r))
        if isinstance(base, sympy.Integer) and isinstance(divisor, sympy.Integer):
            return sympy.Integer(int(base) // int(divisor))
        if isinstance(base, FloorDiv):
            return FloorDiv(base.args[0], base.args[1] * divisor)

        # Expands (x + y) // b into x // b + y // b.
        # This only works if floor is an identity, i.e. x / b is an integer.
        for term in sympy.Add.make_args(base):
            quotient = term / divisor
            if quotient.is_integer and isinstance(divisor, sympy.Integer):
                # NB: this is correct even if the divisor is not an integer, but it
                # creates rational expressions that cause problems with dynamic
                # shapes.
                return FloorDiv(base - term, divisor) + quotient

        try:
            gcd = sympy.gcd(base, divisor)
            if gcd != 1:
>               return FloorDiv(
                    sympy.simplify(base / gcd), sympy.simplify(divisor / gcd)
                )

base       = -1.00000000000000
cls        = FloorDiv
divisor    = -1.00000000000000
gcd        = 1.00000000000000
quotient   = 1.00000000000000
term       = -1.00000000000000

/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/utils/_sympy/functions.py:159:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

args = (FloorDiv, -1.00000000000000, -1.00000000000000), kwargs = {}

    @wraps(func)
    def wrapper(*args, **kwargs):
        try:
>           retval = cfunc(*args, **kwargs)
E           RecursionError: maximum recursion depth exceeded in comparison
E
E           To execute this test, run the following from the base repo dir:
E               python test/test_sympy_utils.py -k TestValueRanges.test_binary_ref_fn_floordiv_dtype_float
E
E           This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

args       = (FloorDiv, -1.00000000000000, -1.00000000000000)
cfunc      = <functools._lru_cache_wrapper object at 0x7fc5303173a0>
func       = <function Function.__new__ at 0x7fc530317280>
kwargs     = {}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131151
Approved by: https://github.com/ezyang
2024-07-27 19:39:40 +00:00
466ea8ce54 Add fallback() to torch.library (#131707)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131707
Approved by: https://github.com/zou3519
2024-07-27 18:02:35 +00:00
cyy
8e5a367311 [5/N] Fix clang-tidy warnings in jit (#131969)
Follows #131903
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131969
Approved by: https://github.com/ezyang
2024-07-27 17:54:20 +00:00
918ece4f4d [BE][Easy][11/19] enforce style for empty lines in import segments in test/dy*/ (#129762)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129762
Approved by: https://github.com/anijain2305
2024-07-27 17:43:53 +00:00
ae9f17a821 [aoti] Rename OSS DynamicArg and OpKernel (#131862)
Summary: Fixing P1495466240 which I think is due to the fact that internal also has an "OpKernel" in the same namespace, using thrift instead of json.

Test Plan: https://www.internalfb.com/intern/testinfra/testrun/4785074844896831

Differential Revision: D60273354

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131862
Approved by: https://github.com/desertfire
2024-07-27 17:34:50 +00:00
8cdfdb41bc Revert "[NestedTensor] Integrate the layer normalization operator along the jagged dimension into NestedTensor (#131519)"
This reverts commit f862f457304f1952e75336f9f74e4ea3d2a5eb72.

Reverted https://github.com/pytorch/pytorch/pull/131519 on behalf of https://github.com/atalman due to broke CI: test_nestedtensor.py::TestNestedTensorSubclassCPU::test_layer_norm_with_lengths_requires_grad_False_components_require_grad_False_cpu_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10121747545/job/27996722731) [HUD commit link](f862f45730) ([comment](https://github.com/pytorch/pytorch/pull/131519#issuecomment-2254167994))
2024-07-27 14:45:47 +00:00
07389163f0 [C10][BE] Use range loop (#131922)
Non-function change that iterates over entries in `getCollectiveTraceJson` and uses `C10_UNUSED` rather than `(void)i;` trick

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131922
Approved by: https://github.com/XilunWu
2024-07-27 11:26:27 +00:00
cyy
f83ef69b84 Fix typo in assignment operators (#131890)
Most typos were introduced in #131077
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131890
Approved by: https://github.com/Skylion007
2024-07-27 11:13:42 +00:00
cyy
c82441e07a Fix std::optional checking bug (#131874)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131874
Approved by: https://github.com/Skylion007
2024-07-27 11:08:10 +00:00
93a4671746 Add out_dtypes to fused_all_gather_scaled_matmul's args (#131831)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131831
Approved by: https://github.com/weifengpy
ghstack dependencies: #131410
2024-07-27 11:07:43 +00:00
12cd040edd [micro_pipeline_tp] exclude simple overlappable collectives as micro-pipeline TP candidates when reorder_for_compute_comm_overlap is enabled (#131410)
When a collective can be hidden through either simple overlapping or micro-pipeline TP, we prefer simple overlapping to avoid the overhead associated with decomposition. If `reorder_for_compute_comm_overlap` is enabled, we identify collectives that can be hidden through simple overlapping and exclude them from micro-pipeline TP candidates.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131410
Approved by: https://github.com/weifengpy
2024-07-27 11:07:43 +00:00
36d24925c6 [inline_inbuilt_nn_modules][inductor-cpu] More skips for dynamic shapes when inlining enabled (#131948)
The issue is tracked here - https://github.com/pytorch/pytorch/issues/131929

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131948
Approved by: https://github.com/eellison, https://github.com/leslie-fang-intel
ghstack dependencies: #131744, #131928
2024-07-27 10:03:49 +00:00
aee6bcdba4 [Traceable FSDP2][Inductor] Apply compute/comm reordering passes to achieve overlap (#131614)
This PR enables the Inductor compute/comm reordering passes to Traceable FSDP2 to achieve overlap. Note that the overlap is not maximally optimized yet and the follow-up work will be done in subsequent PRs.

Test commands:
- `pytest -rA  test/distributed/test_compute_comm_reordering.py::TestComputeCommReorderingMultiProc`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_backend_inductor`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131614
Approved by: https://github.com/yifuwang
ghstack dependencies: #131510
2024-07-27 08:39:58 +00:00
9e06572704 [Traceable FSDP2][Inductor] Create grouped nodes for FSDP2 all-gather code block and reduce-scatter code block (after Buffer/Operation split) (#131510)
This PR creates these `GroupedSchedulerNode`s:
- One for each all-gather code block (cast + copy-in + all-gather)
- One for each all-gather-wait code block (all-gather-wait + copy-out)
- One for each reduce-scatter code block (copy-in + reduce-scatter)
- One for each reduce-scatter-wait code block (reduce-scatter-wait)

This serves two goals:
- Prevent outside ops from being fused into these op groups, in order to have more predicable memory usage.
- Make it easier to specify the dependency e.g. from `i+1` all-gather group node to the `i` all-gather-wait group node, to enforce FSDP2 comm ordering (i.e. "serialization of comms").

The actual "reorder-for-FSDP-compute-comm-overlap" PR will come next.

Test commands:
- `pytest -rA  test/distributed/test_compute_comm_reordering.py::TestComputeCommReorderingMultiProc`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_backend_inductor`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131510
Approved by: https://github.com/yifuwang
2024-07-27 08:39:58 +00:00
cyy
99e13e68e9 [4/N] Fix clang-tidy warnings in jit (#131903)
Follows #131830

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131903
Approved by: https://github.com/Skylion007
2024-07-27 08:08:14 +00:00
f862f45730 [NestedTensor] Integrate the layer normalization operator along the jagged dimension into NestedTensor (#131519)
Modify the existing `layer normalization` operator in PyTorch, invoked by `torch.layer_norm`, to allow for reductions along the jagged dimension of a nested tensor. The function originally had a basic implementation for reducing along 1 non-ragged dimension. This diff, which uses the `aten` padding operator, enables PyTorch users to invoke `torch.nn.functional.layer_norm` on a nested tensor when reducing along the ragged dimension, e.g. `*` in a `(B, *, M)` or `(B, *, M, N)` nested tensor.

Write unit tests based on the `softmax` jagged operator to verify the accuracy of the ragged reduction implementation for `torch.nn.functional.layer_norm`. Add unit tests to verify error handling for unsupported features.

Note that this implementation is limited to nested tensors with `ragged_idx == 1`, i.e. the ragged dimension is not transposed. The layer normalization operator also requires an operation on a 2-dimensional layer; for nested tensors with 4 or more dimensions, I flatten the extra dimensions, then unflatten them after performing layer normalization.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131519
Approved by: https://github.com/davidberard98
ghstack dependencies: #131518
2024-07-27 07:09:10 +00:00
bcf5c68c18 [NestedTensor] Integrate the softmax operator along the jagged dimension into NestedTensor (#131518)
Modify the existing `softmax` operator in PyTorch, invoked by `torch.softmax`, to allow for reductions along the jagged dimension of a nested tensor. The function originally had a basic implementation for reducing along 1 non-ragged dimension. This diff, which uses the aten padding operator, enables PyTorch users to invoke `torch.softmax` on a nested tensor when reducing along the ragged dimension, e.g. `*` in a `(B, *, M)` nested tensor.

Write unit tests based on the `sum` and `mean` jagged operators to verify the accuracy of the ragged reduction implementation for `torch.softmax`. Add unit tests to verify error handling for unsupported features in `NestedTensor` `torch.softmax`.

Note that this implementation is limited to nested tensors with `ragged_idx == 1`, i.e. the ragged dimension is not transposed. In addition, the `softmax` operator is required to take in as input an integer for the reduction dimension `dim`, requiring new unit tests heavily inspired by the `sum` and `mean` jagged operator unit tests. `Softmax` also allows for reducing along the batch dimension.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131518
Approved by: https://github.com/davidberard98
2024-07-27 07:09:10 +00:00
c49e857d32 [pt] immutable accessors in graph signature (#131940)
Summary: splitting PT part of D60253955

Test Plan: existing tests

Differential Revision: D60296909

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131940
Approved by: https://github.com/angelayi, https://github.com/zhxchen17
2024-07-27 05:32:53 +00:00
96c1862e0b Remove mypy ignore from torch/_dynamo/variables/__init__.py (#131784)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131784
Approved by: https://github.com/aorenste, https://github.com/zou3519, https://github.com/Skylion007
2024-07-27 05:07:33 +00:00
1bfe7eb7e6 Update how we do sdpa testing (#131743)
## Motivation

This refactor aligns our testing methodology with the Flash Attention upstream repository while addressing several key issues:

1. **Standardized comparison**: We now compare fused kernels against float64 references, using the maximum of a calculated tolerance (based on same-precision math implementation) or standard float32 `atol`.

2. **Reduced redundancy**: Utilizing the same tensors for both same-precision math and fused kernel runs eliminates duplication.

3. **Improved maintainability**: The new approach simplifies tolerance adjustments across all affected tests.

4. **Consistency**: Standardizing tensor comparisons ensures a more uniform and reliable testing suite.

These changes collectively simplify our testing code, improve its maintainability, and provide a more robust framework for validating our attention mechanisms.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131743
Approved by: https://github.com/jainapurva, https://github.com/jbschlosser
2024-07-27 03:58:49 +00:00
bcdba9f91d Added hpu backend support in fsdp utils (#127757)
In fsdp init_utils, adding support for hpu backend device on _get_device API.

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127757
Approved by: https://github.com/wconstab, https://github.com/jgong5, https://github.com/awgu
2024-07-27 03:30:59 +00:00
28fd2e905d [inductor] enhance cpp_builder lint check. (#131752)
enhance cpp_builder `mypy` check.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131752
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-07-27 02:46:27 +00:00
a90b8b967a [inductor] enable windows inductor UTs (#131767)
Changes:
1. Add `skipIfWindows` function.
2. Fix `fresh_inductor_cache` raise error on Windows, due to can't delete loaded modules.
3. Disable some UTs, which are not passed on Windows.
4. Enable test_torchinductor in Windows CI.

I have tested passed on my dev machine:
<img width="864" alt="image" src="https://github.com/user-attachments/assets/91d5a62f-7383-44b3-b614-99940f196fdb">

TODO: review and fix the skipped cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131767
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-07-27 02:46:03 +00:00
3768faec2f carry cond in data-dependent error (#131932)
Test Plan: existing

Differential Revision: D60302877

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131932
Approved by: https://github.com/zhxchen17
2024-07-27 02:13:04 +00:00
9606d61e0c [reland][inductor] switch AotCodeCompiler to new cpp_builder (#130127)
Changes:
1. Switch `AotCodeCompiler` to new cpp_builder.
2. Only use `deprecated_cpp_compile_command` for `fb_code`, due to I can't debug anymore on no Meta internal environment access.
3. Add `TODO` comments for further some Meta employee help on contine to do this work.
4. Due to item 3, we only remaining `deprecated_cpp_compile_command` for `fb_code` to be fix, let's remove `validate_new_cpp_commands`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130127
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-07-27 01:46:13 +00:00
fdf1451bfa Add __all__ to torch.optim to define public interface (#131959)
There was a regression in the public interface for `torch.optim` introduced in #125452 when `torch/optim/__init__.pyi` was merged into `torch/optim/__init__.py`. [The import aliases were not preserved and so now `pyright` thinks that these classes are not publicly exported from `torch/optim/__init__.py`.](https://github.com/pytorch/pytorch/pull/125452/files#diff-941595c1e1aa06bec94578499dd3654532a5183d0bc1bcd94d1f33b47e0d0adfL1-L15)

```
error: "SGD" is not exported from module "torch.optim"
```

Adding these classes/modules to `__all__` fixes this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131959
Approved by: https://github.com/ezyang
2024-07-27 01:03:25 +00:00
8458980bbf Move benchmarks/dynamo/huggingface configuration to YAML (#131724)
Similar to https://github.com/pytorch/pytorch/pull/120299

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131724
Approved by: https://github.com/shunting314
2024-07-27 00:55:04 +00:00
ef8d118c67 Sync with changes to test-infra's scale-config.yml (#131955)
This synchronized lf-canary-scale-config and lf-scale-config with one in test-infra.

This really needs some automatic validation to prevent it from drifting out of sync over and over again (coming soon...)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131955
Approved by: https://github.com/malfet
2024-07-27 00:25:40 +00:00
8b04edcac1 Delete unused yml files (#131298)
To be landed at least 3 days later after previous commit
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131298
Approved by: https://github.com/ZainRizvi
ghstack dependencies: #130762
2024-07-27 00:21:22 +00:00
1e00f055a4 Move distributed experimental jobs back to the amazon2 for now (#131963)
Something about the new Amazon2023 AMI is making some distributed tests fail. Moving them back to the old AMI until the issue is fixed

This particular jobs are causing this test to fail:
https://github.com/pytorch/pytorch/issues/129539

More details in https://github.com/pytorch/pytorch/issues/131962
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131963
Approved by: https://github.com/clee2000
2024-07-26 23:44:56 +00:00
91fcfd8760 Fix public API tests (#131386)
This PR fixes a bug in `test_correct_module_names` introduced in #130497. It also addresses post-fix test failures in:
* `torch/ao/quantization/__init__.py` - set the correct `__module__` for several public API helpers
* `torch/library.py` - add `register_vmap` to `__all__`
* `torch/nn/attention/flex_attention.py` - make `round_up_to_multiple` private by prepending an underscore
* `torch/storage.py` - introduce `__all__` to avoid `Self` being re-exported as a public API
* `torch/distributed/pipelining/schedules.py` - add `ZeroBubbleAlgorithm` to `__all__`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131386
Approved by: https://github.com/albanD
2024-07-26 23:38:43 +00:00
02b922900b [aoti] Fix float16 and bfloat16 for generated GPU code (#131437)
Fixes #131333

Summary:
- Add header to define `float16` and `bfloat16` as `at::Half` and `at::BFloat16`.
- change `float16` and `bfloat16` to `float` before passing to kernel.

code generated before:
```cpp
.....
    half var_1;
    AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_item_float16(convert_arrayref_tensor_to_tensor(arg1_1), &var_1));
....
```

code generated now:
```cpp
typedef at::Half half;
typedef at::BFloat16 bfloat16;
.....
    half var_1_tmp;
    AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_item_float16(convert_arrayref_tensor_to_tensor(arg1_1), &var_1_tmp));
    float var_1 = float(var_1_tmp);
....
```

Test plan: `TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCHINDUCTOR_CPP_WRAPPER=1 python test/inductor/test_torchinductor.py -k GPUTests.test_unspec_inputs_cuda`
Work in progress.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131437
Approved by: https://github.com/desertfire
2024-07-26 23:36:11 +00:00
0272934238 [Inductor][CPU] Fix an InvalidVecISA issue on CI (#131812)
Summary: CPU CI nodes failed to find valid VecISA because importing torch under the default pytorch directory will fail with the following msg, so switch cwd to a tmp directory.

```
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/var/lib/jenkins/workspace/torch/__init__.py", line 66, in <module>
    from torch.torch_version import __version__ as __version__
  File "/var/lib/jenkins/workspace/torch/torch_version.py", line 4, in <module>
    from torch.version import __version__ as internal_version
ModuleNotFoundError: No module named 'torch.version'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131812
Approved by: https://github.com/eellison, https://github.com/malfet
2024-07-26 22:31:44 +00:00
5489ff8e94 Use Mermaid for the diagram in torch/ao/quantization/fx/README.md (#131412)
preview 3a0efcdfa3/torch/ao/quantization/fx/README.md
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131412
Approved by: https://github.com/jerryzh168
2024-07-26 22:01:21 +00:00
16cd1aaa1d [inductor] Improve sort kernel perf (#131719)
Closes #129507

This makes two changes to the sort kernel:
1. Use int16 for the indices since we only operate on small dims anyway
2. Instead of passing an explicit mask, we pass the rnumel and imply the
   mask from that which saves an additional reduction in the sort
   kernel's inner loop.

In my benchmarks, this gives enough of a perf improvement to bump up the
max rblock to 512.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131719
Approved by: https://github.com/eellison
2024-07-26 21:56:47 +00:00
b90bc66766 Enable FlashAttention on Windows (#131906)
Let's just give this a try.

Reland of https://github.com/pytorch/pytorch/pull/131875.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131906
Approved by: https://github.com/drisspg
2024-07-26 21:41:56 +00:00
d73b55d64b Support meta tensors as inputs to the triton_kernel_wrapper HOPs (#131896)
We automatically generate FakeTensor support for them (the FakeTensor
kernel for a triton kernel is "return None"). The same thing should
apply to the meta kernel.

Tests:
- new tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131896
Approved by: https://github.com/oulgen
2024-07-26 21:41:03 +00:00
fb98cd33f1 [inline_inbuilt_nn_modules][inductor-cpu] Skip test_quantized_linear_amx (#131928)
The issue is tracked here - https://github.com/pytorch/pytorch/issues/131929

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131928
Approved by: https://github.com/eellison
ghstack dependencies: #131744
2024-07-26 21:28:17 +00:00
c8626a4e1f [BE] add a list of inductor test files to skip resetting dynamo (#131551)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131551
Approved by: https://github.com/zou3519
2024-07-26 21:08:15 +00:00
fde577702d [TD] More synonyms for filepath (#131838)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131838
Approved by: https://github.com/PaliC, https://github.com/ZainRizvi
2024-07-26 21:02:42 +00:00
1bda3a3135 Migrate nightly.yml workflow & docs to Amazon 2023 (#131821)
A continuation of the migration started in
- https://github.com/pytorch/pytorch/pull/131250

Migrates nightly jobs and the linux-docs job in pull.yml

To preserve reusability, I'm switching to a new format here that allows one to only specify the runner prefix instead of the full runner name, allowing multiple jobs to continue using the same base runner type like how they did before

**Validation:**
- Nightly builds passed in the prev commit: https://github.com/pytorch/pytorch/actions/runs/10102118461/job/27937632823?pr=131821
- Latest commit only updated the docs job in pull.yml, and that has already passed: https://github.com/pytorch/pytorch/actions/runs/10114635537/job/27974392472?pr=131821

The other in-progress jobs are irrelevant
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131821
Approved by: https://github.com/atalman, https://github.com/seemethere
2024-07-26 20:54:43 +00:00
0e6df1e0fb Disable remote cache on test (#131908)
Summary: Fixes test internally

Test Plan:
buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:cudagraph_trees -- --exact 'caffe2/test/inductor:cudagraph_trees - test_cache_hit_forward_miss_backward (caffe2.test.inductor.test_cudagraph_trees.CudaGraphTreeTests)'

Passes

Differential Revision: D60293177

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131908
Approved by: https://github.com/clee2000
2024-07-26 20:19:02 +00:00
071ac38141 fast-path FakeTensor detach (#131899)
Fixes https://github.com/pytorch/pytorch/issues/128281, see investigation at https://github.com/pytorch/pytorch/issues/128281#issuecomment-2252976926.

benchmark:
```
python benchmarks/dynamo/huggingface.py --performance --timing --explain --backend aot_eager --device cuda --training --float32 --only BertForMaskedLM
```

time before:
```
TIMING: entire_frame_compile:30.85435 backend_compile:23.98599 total_wall_time:30.85435
```

time after:
```
TIMING: entire_frame_compile:24.35898 backend_compile:18.15235 total_wall_time:24.35898
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131899
Approved by: https://github.com/ezyang, https://github.com/zou3519, https://github.com/albanD
2024-07-26 20:16:08 +00:00
2ec8312a28 Add rerun_disabled_tests for inductor (#131681)
Test in prod?

THis also turns on mem leak check

Briefly checked that
```
 python3 ".github/scripts/filter_test_configs.py" \
    --workflow "inductor" \
    --job-name "cuda12.1-py3.10-gcc9-sm86 / build" \
    --test-matrix "{ include: [
    { config: "inductor", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "inductor", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "inductor_distributed", shard: 1, num_shards: 1, runner: "linux.g5.12xlarge.nvidia.gpu" },
    { config: "inductor_huggingface", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "inductor_timm", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "inductor_timm", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "inductor_torchbench", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "inductor_torchbench", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "dynamic_inductor_huggingface", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "dynamic_inductor_timm", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "dynamic_inductor_timm", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "dynamic_inductor_torchbench", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "dynamic_inductor_torchbench", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "aot_inductor_huggingface", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "aot_inductor_timm", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "aot_inductor_timm", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "aot_inductor_torchbench", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "aot_inductor_torchbench", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "inductor_cpp_wrapper_abi_compatible", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" },
  ]}
  " \
    --selected-test-configs "" \
    --pr-number "${PR_NUMBER}" \
    --tag "${TAG}" \
    --event-name "schedule" \
    --schedule "29 8 * * *" \
    --branch "${HEAD_BRANCH}"
```
has rerun disabled tests option in the test matrix

I don't think all these things need to run but I'm not sure which ones (probably just inductor?)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131681
Approved by: https://github.com/zou3519
2024-07-26 20:05:24 +00:00
da1a1fa55f Move load_yaml_file to common (#131924)
This is for https://github.com/pytorch/pytorch/pull/131724 and future timm_models.py refactoring.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131924
Approved by: https://github.com/shunting314, https://github.com/huydhn
2024-07-26 19:47:52 +00:00
6c95f79645 [CI] Increase the timeout for aarch64 docker build (#131926)
Summary: Increase the timeout limit for pytorch-linux-jammy-aarch64-py3.10-gcc11-inductor-benchmarks. If slow build is a problem later, we can upgrade the arm64 CI instance capability.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131926
Approved by: https://github.com/avikchaudhuri
2024-07-26 19:27:45 +00:00
782efd8e5b Revert "Add rerun_disabled_tests for inductor (#131681)"
This reverts commit 85fa66be04b6f78139da4f0ec8f8b1956291e1c5.

Reverted https://github.com/pytorch/pytorch/pull/131681 on behalf of https://github.com/clee2000 due to this is the wrong file ([comment](https://github.com/pytorch/pytorch/pull/131681#issuecomment-2253318038))
2024-07-26 19:08:59 +00:00
0f9bf208ec Revert "[BE][tests] show local variables on failure in tests (#131151)"
This reverts commit 054d214c504b415b155ef2da1a70764a115e1276.

Reverted https://github.com/pytorch/pytorch/pull/131151 on behalf of https://github.com/jbschlosser due to pollutes test failure output for OpInfo tests ([comment](https://github.com/pytorch/pytorch/pull/131151#issuecomment-2253310448))
2024-07-26 19:03:10 +00:00
a3cdbd8189 [FlopCounterMode] Fix register_flop_formula (#131777)
Previously, FlopCounterMode would ignore any custom ops registered
through `register_flop_formula`. The problem was:
- register_flop_formula(target) requires target to be an OpOverloadPacket.
- register_flop_formula used register_decomposition to populate its registry
- register_decomposition decomposes the OpOverloadPacket into OpOverload before
  putting it into the registry
- FlopCounterMode ignores OpOverloads in its registry (it assumes the
  registry is a dictionary mapping OpOverloadPacket to flop formula).

register_decomposition is too heavy of a hammer, plus this isn't a
decomposition, so I changed the registration mechanism.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131777
Approved by: https://github.com/Chillee
2024-07-26 18:44:50 +00:00
cd53698df0 Add hpu backend support for dynamo torchVariable _in_graph_classes() function (#129948)
Fixes #ISSUE_NUMBER

Recent change from PR#
f657b2b1f8 (diff-4a52059570bb96333d8383ce6a9d01bbb114c5e34aff6028f820899ca39b5a26R80)  , has hard coded flow to cuda stream in ingraph function. For non cuda backend (hpu in our case), it breaks the graph.

As part of this PR change adding hpu backend support to dynamo variables function _in_graph_classes().

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129948
Approved by: https://github.com/yanboliang
2024-07-26 18:38:03 +00:00
5f2c80d16d Add inductor OrderedSet (#130003)
Implemented by extending `collections.abc.MutableSet` and backing it with a dictionary, which is ordered. From collections.abc.MutableSet:

```
    A mutable set is a finite, iterable container.

    This class provides concrete generic implementations of all
    methods except for __contains__, __iter__, __len__,
    add(), and discard().
```

In addition to implementing those methods I also had to define some methods of python's set which were not implemented in MutableSet.

I reused the test from my python's lib. There were a few instances of tests that didnt pass because edge case behavior that is not necessary to reimplement
- support self-referencing repr
- erroring when an member's `__eq__` function would modify the set itself
- MutableSet supports Iterables as inputs, but not sequences (pretty rare..)
- Some specifics of exact equivalent type errors being thrown
- [The protocol for automatic conversion to immutable](https://docs.python.org/2/library/sets.html#protocol-for-automatic-conversion-to-immutable)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130003
Approved by: https://github.com/aorenste
2024-07-26 18:16:57 +00:00
1dd10ac802 [BE] [Reland] Make nn.Module state_dict load_state_dict pre-hook and state_dict post-hook public (#131690)
Reland https://github.com/pytorch/pytorch/pull/126704

#### Fixes the issue with type of `nn.Module._state_dict_hooks` being changed in that PR which was problematic:
Instead of using `Tuple(Callable, bool)` to keep track of whether the private `_register_state_dict_hook` or the public `register_state_dict_post_hook` API was used to register the hook and toggle the behavior accordingly, I set an attribute on the Callable in the private API, which is never cleaned up.

If a callable previously registered using the private API is registered via the public API, a RuntimeError will be raised

#### Copied from previous PR description
Fixes https://github.com/pytorch/pytorch/issues/75287 and https://github.com/pytorch/pytorch/issues/117437

- `nn.Module._register_state_dict_hook` --> add public `nn.Module.register_state_dict_post_hook`
   - Add a test as this API was previously untested
- `nn.Module._register_load_state_dict_pre_hook` --> add public `nn.Module.register_load_state_dict_pre_hook` (remove the `with_module` flag, default it to `True`
    ~- For consistency with optimizer `load_state_dict_pre_hook` raised by @janeyx99, allow the pre-hook to return a new `state_dict`~
 - For issuet by https://github.com/pytorch/pytorch/issues/117437 regarding `_register_state_dict_hook` semantic of returning a new state_dict only being respected for the root for private hook
       - Document this for private `_register_state_dict_hook`
       - Remove this for the public `register_state_dict_post_hook`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131690
Approved by: https://github.com/albanD
2024-07-26 18:14:07 +00:00
8158cf2f59 [c10d] Fix split_group usage when there is a single rank (#131824)
Summary:
This is a request from xlformer team to allow single rank PG/comms
Test Plan:
UT

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131824
Approved by: https://github.com/pavanbalaji, https://github.com/fduwjj
2024-07-26 18:11:17 +00:00
e191b83462 Revert "Add wrappers for synchronous GPUDirect Storage APIs (#130633)"
This reverts commit 709ddf7a9dcfa1268848b72f6f56b55afa6728d6.

Reverted https://github.com/pytorch/pytorch/pull/130633 on behalf of https://github.com/clee2000 due to still failing internally D60265673 ([comment](https://github.com/pytorch/pytorch/pull/130633#issuecomment-2253239607))
2024-07-26 18:08:20 +00:00
e4db5dc1c4 Revert "[BE] remove unnecessary _dispatch_sqrt by using ** 0.5 (#131358)"
This reverts commit 4c7f22dee25649cd895bc382192d29f39e482215.

Reverted https://github.com/pytorch/pytorch/pull/131358 on behalf of https://github.com/janeyx99 due to Internal uses this private API and landing that has been a pain so we're reverting this first ([comment](https://github.com/pytorch/pytorch/pull/131358#issuecomment-2253190654))
2024-07-26 17:35:27 +00:00
2576dbbc35 [dynamo] implement IteratorVariable and polyfill fallbacks for enumerate (#131725)
Fixes https://github.com/pytorch/pytorch/issues/112794.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131725
Approved by: https://github.com/anijain2305
ghstack dependencies: #131413, #131716
2024-07-26 17:17:09 +00:00
35b4de32fa [dynamo] add itertools repeat/count bytecode reconstruction (#131716)
Also fix bugs in the count iterator variable implementation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131716
Approved by: https://github.com/anijain2305
ghstack dependencies: #131413
2024-07-26 17:17:09 +00:00
40cc5c0697 [AOT Autograd] Donated Buffer (#130580)
Implements donated buffer feature and adds unit tests. Donated buffer is a saved tensor that is not aliased with forward inputs, fw_outputs (except saved tensors), and bw_outputs. We detect donated buffers during `aot_dispatch_autograd` and store donated buffers in `ViewAndMutationMetadata`, such that it can be accssed in inductor.

Fixes #129496

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130580
Approved by: https://github.com/bdhirsh
2024-07-26 17:14:34 +00:00
9589d986fa [UT] Relax atol for test_non_contiguous_input_* (3 tests) (#131822)
BE task T195600898 (internal).

The 3 tests
```
test_non_contiguous_input_mm
test_non_contiguous_input_bmm
test_non_contiguous_input_addmm
```
had the following error in TestX:
```
self.assertTrue(torch.allclose(ref, act, atol=1e-2, rtol=1e-2))
AssertionError: False is not true
```

The tolerance comparing eager and compiled results is too small, perhaps because of a Triton update that changed numerics:
```
Mismatched elements: 25 / 38597376 (0.0%)
Greatest absolute difference: 0.015625 at index (3771, 509) (up to 0.01 allowed)
Greatest relative difference: 9.375 at index (13687, 48) (up to 0.01 allowed)
```

Change the absolute tolerance from 0.01 to 0.02. Also switch to use `torch.testing.assert_close` which prints out the greatest absolute/relative difference like above when the assert fails.

`test_non_contiguous_input_mm_plus_mm` has a different problem, just switching to `torch.testing.assert_close` to be uniform with the other tests.

Test commands:
```
python test/inductor/test_max_autotune.py -k TestMaxAutotune.test_non_contiguous_input_mm

python test/inductor/test_max_autotune.py -k TestMaxAutotune.test_non_contiguous_input_addmm

python test/inductor/test_max_autotune.py -k TestMaxAutotune.test_non_contiguous_input_bmm
```
Internal stress tests pass now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131822
Approved by: https://github.com/shunting314
2024-07-26 17:11:35 +00:00
161bb67116 Revert "Fix static py::object dangling pointer with py::gil_safe_call_once_and_store (#130341)"
This reverts commit ace6decc9948e434dfe2e253bc28341bb22aa983.

Reverted https://github.com/pytorch/pytorch/pull/130341 on behalf of https://github.com/clee2000 due to unfortunately the internal pybind update got reverted cc @malfet ([comment](https://github.com/pytorch/pytorch/pull/130341#issuecomment-2253147079))
2024-07-26 17:02:56 +00:00
c382fc3fea [Reland] Fix vulkan builds with missing overrides errors (#131760)
Followup after https://github.com/pytorch/pytorch/pull/131524

Add note explaining why C10 macros should not be used in that header
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131760
Approved by: https://github.com/atalman
2024-07-26 17:01:51 +00:00
1a2edf6dca [AOTI] Fix _mm_plus_mm codegen (#131689)
Summary: Fixes https://github.com/pytorch/pytorch/issues/128474

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131689
Approved by: https://github.com/chenyang78
2024-07-26 16:50:12 +00:00
696e83a1da Revert "TCPStore: fix remote address (#131773)"
This reverts commit 9039131a89a5fdb8746bd86b0a4dd91559821e36.

Reverted https://github.com/pytorch/pytorch/pull/131773 on behalf of https://github.com/clee2000 due to broke internal builds D60265883, something about formatter ([comment](https://github.com/pytorch/pytorch/pull/131773#issuecomment-2253123800))
2024-07-26 16:47:57 +00:00
404a8ae8f6 [export] fix set_grad x tensor constant. (#131787)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/130379.

The original error is verifier finds that the placeholder nodes' meta[''val"] are missing in subgraph of WrapSetGradEnabled hop.

In this PR, we fixed it by re-ordering the replace_set_grad_with_hop_pass with lift_constant_tensor pass because only after lift_constant_pass, all the constant attrs start to have meta["val"].

Test Plan: buck2 test test:test_export -- -r "test_setgrad_lifted_tensor"

Differential Revision: D60244935

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131787
Approved by: https://github.com/yushangdi
2024-07-26 16:41:59 +00:00
bb64702eb3 Revert "[reland][inductor] switch AotCodeCompiler to new cpp_builder (#130127)"
This reverts commit 520182dbffe09943be74a8a9cd58618fc171738f.

Reverted https://github.com/pytorch/pytorch/pull/130127 on behalf of https://github.com/clee2000 due to broke internal tests D60265910 ([comment](https://github.com/pytorch/pytorch/pull/130127#issuecomment-2253113689))
2024-07-26 16:40:03 +00:00
d57de73fe0 AutoHeuristic: Add support for kernel choice selection (#131610)
This PR enables AutoHeuristic for kernel choice selection, where the feedback can not immediately be provided when AutoHeuristic is called, but only after autotuning has happened. The steps are the following:

When the AutoHeuristic constructor is called, AutoHeuristic registers a function in select_algorithm.py.
After autotuning in select_algorithm.py has happened, and there is an entry in autoheuristic_registry, select_algorithm provides the autotuning results to AutoHeuristic, which stores the results.
I enabled AutoHeuristic for mixed_mm to have an example to test it on. We probably want to add more context, and also add an augment_context function. I will add support for this in another PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131610
Approved by: https://github.com/eellison
2024-07-26 16:35:55 +00:00
a38890a53f Revert "[2/3] 3D Composability - move pp tests (#129801)"
This reverts commit 29571c5c06f6e5fd143d85c18d8a6b87d2e4e1d3.

Reverted https://github.com/pytorch/pytorch/pull/129801 on behalf of https://github.com/atalman due to Broke periodic CI: distributed/_composable/test_composability/test_pp_composability.py::ComposabilityTest::test_manual_with_data_parallel_dp_type_DDP_ScheduleClass4 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10083807511/job/27882848654) [HUD commit link](544f950d14) ([comment](https://github.com/pytorch/pytorch/pull/129801#issuecomment-2253099894))
2024-07-26 16:30:29 +00:00
13ab92b72d [dynamo][recompile-logs] Suggest force_parameter_static_shapes on the recompile log for parameter-related recomps (#131825)
Discovered in https://github.com/pytorch/pytorch/issues/121369

On the user-empathy-day model, the logs look like these
~~~
W0725 15:33:58.022000 1967777 torch/_dynamo/convert_frame.py:807] [0/8] torch._dynamo hit config.cache_size_limit (8)
W0725 15:33:58.022000 1967777 torch/_dynamo/convert_frame.py:807] [0/8]    function: 'auto_repeat_tensors_for_time' (/home/anijain/local/lumiere-pytorch/lumiere_pytorch/lumiere.py:545)
W0725 15:33:58.022000 1967777 torch/_dynamo/convert_frame.py:807] [0/8]    last reason: 0/0: len(L['args']) == 1
W0725 15:33:58.022000 1967777 torch/_dynamo/convert_frame.py:807] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
W0725 15:33:58.022000 1967777 torch/_dynamo/convert_frame.py:807] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
W0725 15:34:00.282000 1967777 torch/_dynamo/convert_frame.py:807] [11/8] torch._dynamo hit config.cache_size_limit (8)
W0725 15:34:00.282000 1967777 torch/_dynamo/convert_frame.py:807] [11/8]    function: 'forward' (/home/anijain/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/denoising_diffusion_pytorch/karras_unet.py:150)
W0725 15:34:00.282000 1967777 torch/_dynamo/convert_frame.py:807] [11/8]    last reason: 11/0: tensor 'L['x']' size mismatch at index 0. expected 16, actual 8
W0725 15:34:00.282000 1967777 torch/_dynamo/convert_frame.py:807] [11/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
W0725 15:34:00.282000 1967777 torch/_dynamo/convert_frame.py:807] [11/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
W0725 15:34:10.216000 1967777 torch/_dynamo/convert_frame.py:807] [40/8] torch._dynamo hit config.cache_size_limit (8)
W0725 15:34:10.216000 1967777 torch/_dynamo/convert_frame.py:807] [40/8]    function: 'normalize_weight' (/home/anijain/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/denoising_diffusion_pytorch/karras_unet.py:127)
W0725 15:34:10.216000 1967777 torch/_dynamo/convert_frame.py:807] [40/8]    last reason: 40/1: tensor 'L['weight']' size mismatch at index 0. expected 64, actual 16. Guard failed on a parameter, consider using torch._dynamo.config.force_parameter_static_shapes = False to allow dynamism on parameters.
W0725 15:34:10.216000 1967777 torch/_dynamo/convert_frame.py:807] [40/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
W0725 15:34:10.216000 1967777 torch/_dynamo/convert_frame.py:807] [40/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
W0725 15:34:11.643000 1967777 torch/_dynamo/convert_frame.py:807] [58/8] torch._dynamo hit config.cache_size_limit (8)
W0725 15:34:11.643000 1967777 torch/_dynamo/convert_frame.py:807] [58/8]    function: 'pack_one' (/home/anijain/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/denoising_diffusion_pytorch/karras_unet.py:38)
W0725 15:34:11.643000 1967777 torch/_dynamo/convert_frame.py:807] [58/8]    last reason: 58/1: tensor 'L['t']' stride mismatch at index 0. expected 32, actual 8. Guard failed on a parameter, consider using torch._dynamo.config.force_parameter_static_shapes = False to allow dynamism on parameters.
W0725 15:34:11.643000 1967777 torch/_dynamo/convert_frame.py:807] [58/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
W0725 15:34:11.643000 1967777 torch/_dynamo/convert_frame.py:807] [58/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
W0725 15:34:12.029000 1967777 torch/_dynamo/convert_frame.py:807] [62/8] torch._dynamo hit config.cache_size_limit (8)
W0725 15:34:12.029000 1967777 torch/_dynamo/convert_frame.py:807] [62/8]    function: 'torch_dynamo_resume_in_pack_at_70' (/home/anijain/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/einops-0.8.0-py3.10.egg/einops/packing.py:70)
W0725 15:34:12.029000 1967777 torch/_dynamo/convert_frame.py:807] [62/8]    last reason: 62/0: tensor 'L['tensors'][0]' size mismatch at index 0. expected 16, actual 32. Guard failed on a parameter, consider using torch._dynamo.config.force_parameter_static_shapes = False to allow dynamism on parameters.
W0725 15:34:12.029000 1967777 torch/_dynamo/convert_frame.py:807] [62/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
W0725 15:34:12.029000 1967777 torch/_dynamo/convert_frame.py:807] [62/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
W0725 15:34:12.357000 1967777 torch/_dynamo/convert_frame.py:807] [65/8] torch._dynamo hit config.cache_size_limit (8)
W0725 15:34:12.357000 1967777 torch/_dynamo/convert_frame.py:807] [65/8]    function: 'reshape' (/home/anijain/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/einops-0.8.0-py3.10.egg/einops/_backends.py:91)
W0725 15:34:12.357000 1967777 torch/_dynamo/convert_frame.py:807] [65/8]    last reason: 65/0: tensor 'L['x']' size mismatch at index 0. expected 32, actual 8. Guard failed on a parameter, consider using torch._dynamo.config.force_parameter_static_shapes = False to allow dynamism on parameters.
~~~~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131825
Approved by: https://github.com/ezyang
ghstack dependencies: #131795, #131801, #131804
2024-07-26 16:25:21 +00:00
7feaa73057 [export] Remove deprecated fields from ExportedProgram ctor. (#131697)
Summary: as title.

Test Plan: CI

Reviewed By: SherlockNoMad

Differential Revision: D60078426

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131697
Approved by: https://github.com/ydwu4
2024-07-26 16:19:46 +00:00
546df5daf8 Revert "[3/3] 3D Composability - move tp dp tests (#129802)"
This reverts commit ec3829795dfb58a58ebc9ca241f7949efd60bfda.

Reverted https://github.com/pytorch/pytorch/pull/129802 on behalf of https://github.com/atalman due to Need to revert https://github.com/pytorch/pytorch/pull/129801 that got remerged ([comment](https://github.com/pytorch/pytorch/pull/129802#issuecomment-2253082995))
2024-07-26 16:19:25 +00:00
cyy
2988d33c80 [3/N] Fix clang-tidy warnings in jit (#131830)
Follows #131735

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131830
Approved by: https://github.com/ezyang
2024-07-26 15:46:28 +00:00
5612408735 _get_operation_overload: dont raise exception when overload does not exist (#131554)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131554
Approved by: https://github.com/ezyang, https://github.com/zou3519
ghstack dependencies: #131403, #131482, #131665
2024-07-26 15:38:11 +00:00
eba2ffd278 [pt2e][quant] Ensure BN node is erased after convert (#131651)
Summary: Previously, when folding BN into conv, we rely on DCE
to clean up the unused BN node from the graph. This works if
the model is already in eval mode, but fails if the model is
still in train mode because DCE doesn't remove nodes with
potential side effects (in this case `_native_batch_norm_legit`).
This required users to move the model to eval mode before calling
convert in order to get a properly DCE'd graph.

To solve this, we manually erase the BN node after folding
instead of relying on DCE. This relaxes the ordering constraints
between `move_exported_model_to_eval` and `convert_pt2e`.

Test Plan:
python test/test_quantization.py TestQuantizePT2EQAT_ConvBn1d.test_fold_bn_erases_bn_node
python test/test_quantization.py TestQuantizePT2EQAT_ConvBn2d.test_fold_bn_erases_bn_node

Reviewers: jerryzh168, yushangdi

Subscribers: jerryzh168, yushangdi, supriyar
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131651
Approved by: https://github.com/yushangdi
2024-07-26 15:30:45 +00:00
9440a4824d [CI][dashboard] Add a workflow to collect A10g perf (#131816)
Summary: This is an experimental work. Depending on the performance stableness and benchmark coverage on A10g, we may consider to use A10g for manually-triggered per-PR performance comparison instead of exausting expensive A100 instances.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131816
Approved by: https://github.com/huydhn
2024-07-26 14:36:14 +00:00
535c17efb3 [torch] Implement c10::BFloat16 ctor from __hip_bfloat16 (#131359)
Summary: Pretty straightfoward. ROCm 6.2.0 changed the `__hip_bfloat16` API (see [this PR](481912a1fd)), so we gate impl on `__BF16_HOST_DEVICE__` macro to support older and newer versions of ROCm.

Test Plan: CI

Differential Revision: D60024830

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131359
Approved by: https://github.com/houseroad
2024-07-26 14:30:49 +00:00
e4ace1a396 AOTDispatcher: properly bump version counter on input mutations in inference graphs (#131665)
This ensures that in an inference setting, we properly bump the VC of mutated graph inputs. Previously, we would only properly bump the VC for training graphs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131665
Approved by: https://github.com/ezyang, https://github.com/zou3519
ghstack dependencies: #131403, #131482
2024-07-26 14:22:20 +00:00
5570a0da0a dont dispatch aten.conj(scalar_tensor) back to python (#131482)
https://github.com/pytorch/pytorch/issues/105290

The problem in the original flow is that:

(1) the user calls `torch.mul(complex_tensor, complex_scalar)
(2) python arg parser wraps the complex scalar in a `scalar_tensor`, and dispatches to `aten.mul.Tensor(self, scalar_other)`
(3) autograd sees `aten.mul.Tensor`, calls `scalar_other.conj()` [here](https://github.com/pytorch/pytorch/blob/main/torch/csrc/autograd/FunctionsManual.cpp#L597)
(4) during proxy tensor tracing, this gets dispatched to `aten._conj(scalar_tensor)`
(5) when we hit __torch_dispatch__, the scalar_tensor is converted back into a plain python scalar
(6) we error during tracing, because in `FunctionalTensorMode.__torch_dispatch__` we try to redispatch on `aten._conj.default(plain_python_scalar)`, and this overload does not accept python scalars.

My attempted fix in this PR is to update `TensorBase::conj()` to check if the current tensor is a scalar tensor (wrapped number), and if so, manually:
(1) convert the scalar tensor back into a scalar
(2) call scalar.conj() directly
(3) convert the result back into a wrapped tensor

This avoids having to go through python entirely in the tracing case (which is fine, because these scalar tensors are constants that we can const-prop during tracing anyway).

Notable, I did **not** add e.g. a new `aten._conj.Scalar` overload. This would not actually fix the problem, since the bug is that we call `aten._conj.default(python_scalar)` directly. we would also need to muck with all `__torch_dispatch__` call sites to know to convert python scalars back into tensors directly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131482
Approved by: https://github.com/zou3519, https://github.com/ezyang
ghstack dependencies: #131403
2024-07-26 14:22:20 +00:00
8bb9aa93a7 dynamo: mutations on .data should be invisible to autograd (#131403)
Fixes https://github.com/pytorch/pytorch/issues/121353

our handle for `.data` in dynamo today basically just converts `y = x.data` into `y = x.detach()`. The semantics of these two ops are not quite the same, because:

(1) any future mutations on `x.data` will be fully ignored by autograd
(2) any mutations on `x.detach()` will bump x's version counter

the linked model does a .data mutation that is hidden from autograd in eager, but ends up erroring during AOTDispatcher tracing.

I updated dynamo's handling so that:

(1) when dynamo sees a call to `getattr(tensor, "data")` and calls `.detach()` we set a flag on the returned `TensorVariable` indicating it came from `.data`

(2) on any tensor method that we call with an input `TensorVariable` with this flag turned on, we proxy autograd's `preserve_version_counter` logic into the graph, to properly reset the VC after the op is run.

One thing to note is that I don't actually do this on every op that we pass the tensor to: I only do it for tensor methods that appear to be mutations (by checking for a trailing underscore). My thought was that:

(1) I didn't want to do this for **every** op that you pass `y` into, since that will e.g. triple the number of nodes in the graph, and could cause compile time regressions if you use .data

(2) this situation is pretty rare in general, and I'm hoping that "tensor method mutations" cover most reasonable mutation cases. If we manage to miss a case, you will get a loud error during tracing anyway, so there is not a safety issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131403
Approved by: https://github.com/anijain2305, https://github.com/zou3519
2024-07-26 14:22:20 +00:00
7339c8ab28 Revert "immutable accessors in graph signature (#131807)"
This reverts commit 6fd28fc228f900863d63b1c83912dcc000b084e3.

Reverted https://github.com/pytorch/pytorch/pull/131807 on behalf of https://github.com/atalman due to Broke CI: [GH job link](https://github.com/pytorch/pytorch/actions/runs/10111847569/job/27965364355) [HUD commit link](608057afe2) ([comment](https://github.com/pytorch/pytorch/pull/131807#issuecomment-2252875417))
2024-07-26 14:21:12 +00:00
e76e566cfb [Dynamo] Support zip_longest (#131497)
Fixes #121348

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131497
Approved by: https://github.com/mlazos, https://github.com/jansel, https://github.com/zou3519
2024-07-26 14:06:10 +00:00
c9888c2739 Revert "[BE] typing for decorators - optim/optimizer (#131583)"
This reverts commit a1dad77dfa4e244a867ca7c73e9f6b6fe36a1340.

Reverted https://github.com/pytorch/pytorch/pull/131583 on behalf of https://github.com/atalman due to Breaks CI: [GH job link](https://github.com/pytorch/pytorch/actions/runs/10105959146/job/27947741162) [HUD commit link](a1dad77dfa) ([comment](https://github.com/pytorch/pytorch/pull/131583#issuecomment-2252784280))
2024-07-26 13:41:22 +00:00
7ee6831ae8 Revert "Fix vulkan builds with missing overrides errors (#131760)"
This reverts commit 7260eaeca056ffa013de769c10a2bfce9505d937.

Reverted https://github.com/pytorch/pytorch/pull/131760 on behalf of https://github.com/malfet due to Does not work with internal builds ([comment](https://github.com/pytorch/pytorch/pull/131760#issuecomment-2252783645))
2024-07-26 13:38:28 +00:00
d3e932dc10 [CI] Add inductor cpu accuracy test running on AVX2 runners (#128682)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128682
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-07-26 13:24:41 +00:00
e73fa28ec8 [CI] Fix arm64 docker build arch (#131869)
Attempt to fix arm64 docker build arch on https://github.com/pytorch/pytorch/pull/131855
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131869
Approved by: https://github.com/desertfire
2024-07-26 13:19:36 +00:00
608057afe2 [inductor] Fix duplicated range tree codegen in split scan (#131669)
Looks like in the halide codegen refactor, the range tree codegen was
split out from initialize_range_tree into its own function, but
triton_split_scan.py wasn't updated to reflect this change.

The result was the codegen gets invoked twice which is benign but makes
the kernel harder to read.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131669
Approved by: https://github.com/Chillee
2024-07-26 13:11:26 +00:00
945946e817 [AOTI] Fix another ABI-compatible CPU issue (#131798)
Summary: This problem is seen on AOTI CPU dashboard runs, a cpp compilation error because ConstantHandle::get doesn't exist. This PR adds ConstantHandle::get so that the interface is consistent with RAIIAtenTensorHandle.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131798
Approved by: https://github.com/zou3519, https://github.com/chenyang78
ghstack dependencies: #131791
2024-07-26 11:27:58 +00:00
7d282d8755 [dynamo] add lazy IteratorVariable implementations for map and zip (#131413)
Fixes https://github.com/pytorch/pytorch/issues/130750.

Repro of lazy/eager `map` discrepancy without `islice`:
```python
    def fn(a, b):
        y = 1

        def f(x):
            nonlocal y
            y += 1
            return x

        l = list(zip([a, b], map(f, [1, 2, 3, 4])))
        return a + y
```

The major change is that we implement `MapVariable` and `ZipVariable` based on `IteratorVariable`. Before, `map` and `zip` were being traced by immediately unpacking the result as a `TupleVariable`, which is wrong in cases such as the example above.

`MapVariable`s are not allowed to be unpacked while `ZipVariable`s can only be unpacked if all of its iterables can also be unpacked.

We also add new `[has_]force_unpack_var_sequence` methods to `VariableTracker` for the case where it is safe to unpack the entire sequence lazily, e.g., when building a list from a map (i.e. `list(map(f, ...))`).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131413
Approved by: https://github.com/anijain2305
2024-07-26 10:47:38 +00:00
115994fea2 [aotd] Align partitioner graph output type to tuple (#131759)
Brian debugged the difference of the output type for inference and train graph.
Partitioner sometimes return list output type.

After this PR it will always return tuple.

Potentially there can be some new graphs inside tests that will be landed between this PR ci jobs finish and landing.
This could be easily fixed with fast-forward fix on:
```
EXPECTTEST_ACCEPT=1 python test/test.py
```

Adding ciflows/periodic to minimize this probability

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131759
Approved by: https://github.com/ezyang, https://github.com/bdhirsh
2024-07-26 09:46:29 +00:00
1e24f7875e [AOTI] Fix ABI-compatible mode link issue for CPU (#131791)
Summary: Found this "cannot find -ltorch: No such file or directory" issue when collecting AOTI CPU perf for the dashboard. Debugging on the CI machine revealed two problems: 1) no valid VEC_ISA was picked; 2) when 1 happens, libtorch path is not specified in the linker path.

This PR fixes the second problem. A later PR will fix the first problem, but somehow finding the right VEC_ISA causes a performance regression, which needs more investigation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131791
Approved by: https://github.com/zou3519, https://github.com/chenyang78
2024-07-26 09:02:13 +00:00
6fd28fc228 immutable accessors in graph signature (#131807)
Test Plan: existing tests

Differential Revision: D60253955

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131807
Approved by: https://github.com/ydwu4
2024-07-26 08:56:19 +00:00
bceb91222c Fix meta error in _convert_weight_to_int4pack (#130915)
This PR is to fix meta error in _convert_weight_to_int4pack.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130915
Approved by: https://github.com/jerryzh168
2024-07-26 08:36:30 +00:00
2bf649f5ae suggested fix for data-dependent error (#125378)
Suggests fixes for data-dependent errors in non-strict export.

Any data-dependent error has an unresolved condition on unbacked symints. A mechanizable strategy for fixing such errors, which this PR enables, is to "bash" them using `torch._check()`s. For each error we suggest using `torch._check()` on the condition or its negation. The user selects and copy-pastes the suggested fix and continues.

For example, here's an existing data-dependent error message with the suffix following `<snip>...</snip>` added by this PR:
```
Could not guard on data-dependent expression Eq(u2, u1) (unhinted: Eq(u2, u1)).  (Size-like symbols: u1)

<snip>...</snip>

User code:
  File "test/export/test_export.py", line 1944, in forward
    return r.view(items[0], items[2])

Suggested fixes (please choose one of the following):
  1. torch._check(items[2] == r.shape[1])
  2. torch._check(items[2] != r.shape[1])"
```

Tests in this PR illustrate this workflow, by taking common examples of data-dependent errors and bashing them until success, purely based on suggested fixes. In particular, we test this workflow on the "puzzlers" in https://www.internalfb.com/intern/anp/view/?id=5330476 (thanks @ezyang).

In terms of implementation, we focus on non-strict mode, where we can intercept torch function calls to install a handler that walks up the stack from the error, finding the closest non-torch frame and inspecting its locals for symints appearing in the error. The suggested fixes then access these symints through the local variables so that they can be (a) easily understood by the user (b) directly added to the code.

Implementing this idea in strict mode is follow-up work—we have already investigated what it would take, and decided to separate it out of this PR for reasons described next.

It's not too hard to map symints to locals in Dynamo (although it needs to happen elsewhere, i.e., intercepting torch function calls won't work). However, unfortunately this doesn't seem to be enough; the graph modules created by Dynamo when going through AOTAutograd can raise further data-dependent errors in some cases, and thus we need yet another mechanism to map symints to locals for graph modules, via captured source-level metadata and FX node walking. This latter component will require some care to build properly, or we might conclude it is altogether unnecessary and fix Dynamo instead.

Differential Revision: D56867432

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125378
Approved by: https://github.com/ezyang
2024-07-26 08:34:50 +00:00
fb3ddafbcf [inductor] Add type hints to functions in mkldnn_fusion.py (#131820)
Summary: ATT

Test Plan: lintrunner

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131820
Approved by: https://github.com/eellison
2024-07-26 08:11:34 +00:00
13e806a591 [NestedTensor] Add support for transposed NestedTensors where ragged_idx > 1 for sum and mean operators (#131517)
Add support for transposed, non-contiguous `NestedTensor`s, where `ragged_idx > 1`, for the aten operators `sum` and `mean`. This diff enables reducing along the jagged dimension for non-contiguous `NestedTensor`s, transposed between non-batch dimensions as well as between a ragged and a non-batch dimension. For example, users can now reduce a `NestedTensor` of shape `(B, M, *, N)` along `*` or `(B, N, M, *)` along `*`.

Parametrize existing unit tests and add new unit tests verifying the accuracy of implementations on `NestedTensor`s that transpose between 2 non-batch dimensions as well as between a ragged and a non-batch dimension.

Differential Revision: [D59847927](https://our.internmc.facebook.com/intern/diff/D59847927/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131517
Approved by: https://github.com/davidberard98
2024-07-26 07:21:32 +00:00
63374dda69 [BE][Easy] explicitly define global constants in torch.testing._internal.common_utils (#129826)
This appeases IDE warnings like "torch.testing._internal.common_utils has no member TEST_WITH_ROCM".

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129826
Approved by: https://github.com/Skylion007
2024-07-26 06:32:08 +00:00
aebfd3d4de [CUDAGraph] skip cudagraph if too many distinct sizes (#131387)
Current implementation records a new cudagraph for every distinct input size. This leads to significant overhead if there are too many distinct input sizes.

While we currently hint re-recording cudagraph from dynamic shapes, it is at [info level](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/cudagraph_trees.py#L363-L366) which is easy to overlook and leads to several issues, such as Issue #119640 and Issue #128424.

This PR checks the number of cudagraph due to dynamic shapes and warns loudly if #cudagraph exceeds a threshold `cudagraph_dynamic_shape_limit`(=50).

Fixes #119640

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131387
Approved by: https://github.com/eellison
2024-07-26 06:17:35 +00:00
16d7cb5049 [CUDAGraph] Type annotation for cudagraph_trees.py (#131621)
As a Better Engineer effort, this PR adds type annotation to `cudagraph_trees.py`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131621
Approved by: https://github.com/eellison
2024-07-26 06:14:06 +00:00
dfba85c26b Update torch-xpu-ops pin (ATen XPU implementation) (#131643)
# Motivation
Regular update.
1. Some new ATen ops support
2. ABI=0 build support
3. Remove dispatched implementation of pin_memory&is_pinned
4. Enhance deterministic usage

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131643
Approved by: https://github.com/EikanWang
2024-07-26 05:51:58 +00:00
baa93e160f [MPS] Add native implementation for shift ops (#131813)
Similar to how AND/OR/XOR ops are implemented

TODO: Consider using MPS method calls rather than metal kernels

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131813
Approved by: https://github.com/manuelcandales
2024-07-26 05:01:20 +00:00
a1dad77dfa [BE] typing for decorators - optim/optimizer (#131583)
See #131429
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131583
Approved by: https://github.com/janeyx99
ghstack dependencies: #131568, #131569, #131570, #131571, #131572, #131573, #131574, #131575, #131576, #131577, #131578, #131579, #131580, #131581, #131582
2024-07-26 05:00:07 +00:00
8689d377f9 [BE] typing for decorators - signal/windows/windows (#131582)
See #131429
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131582
Approved by: https://github.com/oulgen, https://github.com/zou3519
ghstack dependencies: #131568, #131569, #131570, #131571, #131572, #131573, #131574, #131575, #131576, #131577, #131578, #131579, #131580, #131581
2024-07-26 05:00:07 +00:00
dbf7c318b2 [BE] typing for decorators - _refs/nn/functional (#131581)
See #131429
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131581
Approved by: https://github.com/oulgen, https://github.com/zou3519
ghstack dependencies: #131568, #131569, #131570, #131571, #131572, #131573, #131574, #131575, #131576, #131577, #131578, #131579, #131580
2024-07-26 05:00:03 +00:00
81c26ba5ae [BE] typing for decorators - utils/flop_counter (#131580)
See #131429
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131580
Approved by: https://github.com/oulgen, https://github.com/zou3519
ghstack dependencies: #131568, #131569, #131570, #131571, #131572, #131573, #131574, #131575, #131576, #131577, #131578, #131579
2024-07-26 04:59:58 +00:00
33069630ce [inductor] Add type hints to functions in decompositions.py (#131780)
Summary: ATT

Test Plan: lintrunner

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131780
Approved by: https://github.com/eellison
2024-07-26 04:50:23 +00:00
5b05ad9697 fix non-persistent buffers (#131756)
Summary:
Dynamo doesn't track whether buffers are `persistent`. This led to some ugly code where we would mark buffers as always persistent when creating signatures, then later check whether the buffers were not in the state dict to infer whether they were non-persistent, and use this to fix up the signature.

This PR instead defines a utility to look up all the non-persistent buffers registered inside a module (this information is recorded in a private `_non_persistent_buffers_set` module attribute), and uses it to (a) correctly set the persistent flag on buffers when creating signatures (b) transfer this information to a Dynamo-traced graph module, which then causes non-persistent buffers to (correctly) not show up in the state dict.

Test Plan: existing tests + new case with non-persistent buffer in nested module

Differential Revision: D60224656

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131756
Approved by: https://github.com/zhxchen17, https://github.com/ydwu4
2024-07-26 04:45:30 +00:00
a617919541 [dynamo] Do not guard on keys for _forward_hooks and _forward_pre_hooks (#131682)
Fixes https://github.com/pytorch/pytorch/issues/125836

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131682
Approved by: https://github.com/bdhirsh
2024-07-26 04:39:54 +00:00
3d7c424a75 [inductor] update users to buffers instead of scheduler nodes (#131796)
After a recent refactoring of inductor, `.users` are now associated with buffers instead of scheduler nodes.

In `debug.py`, one such usage of `.users` is not updated accordingly, and the change here fixes that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131796
Approved by: https://github.com/yf225
2024-07-26 03:34:26 +00:00
6dbf343936 Fix aten implementation for low memory max_pool2d (#131717)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131717
Approved by: https://github.com/peterbell10
2024-07-26 03:23:16 +00:00
c2f3266c8e Not remove collective ops in dce since they have side-effect (#131023)
Fixes #130918

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131023
Approved by: https://github.com/yf225
2024-07-26 03:03:32 +00:00
e0d3e4a498 remove unused code for XPU (#131856)
# Motivation
This PR aims to remove unused code in PyTorch for XPU, following https://github.com/pytorch/pytorch/pull/128179
Otherwise, CI will block without this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131856
Approved by: https://github.com/EikanWang
2024-07-26 02:57:12 +00:00
236d055330 [Traceable FSDP2] Add partial-graph (graph-break) unit tests (#131747)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131747
Approved by: https://github.com/bdhirsh
2024-07-26 02:51:57 +00:00
03f49c9523 Revert "[CUDAGraph] Type annotation for cudagraph_trees.py (#131621)"
This reverts commit 16699c7d848fca669865d83ffff205bcbb8665be.

Reverted https://github.com/pytorch/pytorch/pull/131621 on behalf of https://github.com/atalman due to lint is failing, please rebase fix lint and reland ([comment](https://github.com/pytorch/pytorch/pull/131621#issuecomment-2251831163))
2024-07-26 02:08:45 +00:00
16699c7d84 [CUDAGraph] Type annotation for cudagraph_trees.py (#131621)
As a Better Engineer effort, this PR adds type annotation to `cudagraph_trees.py`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131621
Approved by: https://github.com/eellison
2024-07-26 01:40:23 +00:00
2ff98bc57f [inductor][autotune_at_compile_time] fix some codegen-ing for standalone autotuning file (#131726)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131726
Approved by: https://github.com/desertfire
ghstack dependencies: #131253
2024-07-26 00:58:04 +00:00
b343644f3a Revert "MTIA equivalent of torch.cuda.memory_stats (#131673)"
This reverts commit 513ce5f69a7f53742b7aa5798082dd158beec2ed.

Reverted https://github.com/pytorch/pytorch/pull/131673 on behalf of https://github.com/clee2000 due to linked internal diff has internal changes, not sure what happened here, but this shouldn't have been merged externally without also merging the internal diff ([comment](https://github.com/pytorch/pytorch/pull/131673#issuecomment-2251749644))
2024-07-26 00:54:37 +00:00
b893a57f96 [Dynamo] Fix guard_on_nn_modules unit tests discrepancy between OSS and fbcode (#131810)
Fixes Meta internal task: [T195592220](https://www.internalfb.com/intern/tasks/?t=195592220)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131810
Approved by: https://github.com/zou3519
2024-07-26 00:24:46 +00:00
246e32055a [benchmark] Add hf_T5_generate to inline_inbuilt_nn_modules (#131804)
Fixes https://github.com/pytorch/pytorch/issues/121989

We are turning on the flag by default in another PR. But that PR can go
through reverts. So, forcibly adding the benchmark to prevent dashboard
fluctuation in case of reverts.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131804
Approved by: https://github.com/yanboliang, https://github.com/shunting314
ghstack dependencies: #131795, #131801
2024-07-26 00:20:42 +00:00
c92f2a19a4 [BE] Use assertEqual in MultiKernel tests (#127725)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127725
Approved by: https://github.com/lezcano
ghstack dependencies: #131044, #127724
2024-07-26 00:12:43 +00:00
9ae288f4be [inductor] Simplify multi-kernel codegen by unifying kernel args (#127724)
Persistent kernels are sometimes able to remove intermediate buffers that would
otherwise be needed for the non-persistent reduction kernel. This makes
multi kernel's codegen more complicated as it needs to drop these extra
arguments at runtime after selecting the correct kernel to run.

Instead, this PR updates the persistent kernel's `must_keep_buffers` so these
aren't dropped during codegen so both kernels have the same signature.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127724
Approved by: https://github.com/shunting314
ghstack dependencies: #131044
2024-07-26 00:12:43 +00:00
14920c149b Revert "[dynamo] Turn on inline_inbuilt_nn_modules (#131275)"
This reverts commit 0455344777f354dcbbd8e661a46ca2ca20e8a913.

Reverted https://github.com/pytorch/pytorch/pull/131275 on behalf of https://github.com/clee2000 due to I think this broke inductor/test_cpu_select_algorithm.py::TestSelectAlgorithmDynamicShapesCPU::test_quantized_linear_amx_dynamic_shapes_batch_size_16_in_features_4_out_features_64_bias_True_cpu [GH job link](https://github.com/pytorch/pytorch/actions/runs/10102272826/job/27938970118) [HUD commit link](0455344777) not run on PR due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/131275#issuecomment-2251609554))
2024-07-26 00:12:40 +00:00
adbe4f5ecf TCPStore: add better logging on wait timeout (#131808)
This makes TCPStore `wait` timeout print actually useful info instead of a generic `Socket Timeout` message on timeout.

Bonus:

* fix weirdness where `connect_timeout` only supported seconds unlike the reset of our timeouts (thus minimum timeout was 1s)
* Fixed tests that used a 10s timeout (test_store now only takes 20s instead of 40s)

Ex:

```
DistStoreError: wait timeout after 100ms, keys: /the_key
```

Test plan:

```
python test/distributed/test_store.py
python test/distributed/test_c10d_gloo.py -v -k timeout
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131808
Approved by: https://github.com/kurman
2024-07-25 23:54:41 +00:00
e9443860e7 add python binding for _get_current_graph_task_keep_graph (#131038)
Inductor would like a way to have activations that do not escape the backward graph marked as "donated", so we can re-use their memory during memory planning here: https://github.com/pytorch/pytorch/pull/130580

For this to be safe though, we need to know at runtime that autograd does not plan to retain the current autograd graph (either for another call to .backward() later, or if double backward is being used). In the linked PR, the current plan is to error when we detect this situation, and ask the user to turn off the donated buffer config (although if/once we get to the point of always delaying backward compilation to runtime, we can just wait until we know the runtime value to compile).

There isn't a way to know if the currently running backward is run with `retain_graph=True` from python - @soulitzer helped me figure out where to grab it so I added a python binding for it under `ctx.is_retain_graph()`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131038
Approved by: https://github.com/soulitzer
2024-07-25 23:50:40 +00:00
cyy
eac83479cc Enable Wunused-function and Wunused-result globally (#131596)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131596
Approved by: https://github.com/zou3519
2024-07-25 23:50:12 +00:00
2a4ca5ccc4 [dynamo] Pop the exception stack on handling the StopIteration natively (#131801)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131801
Approved by: https://github.com/yanboliang
ghstack dependencies: #131795
2024-07-25 23:33:19 +00:00
11673851d9 [dynamo][exception][bugfix] Add a pop for < 3.11 version (#131795)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131795
Approved by: https://github.com/yanboliang
2024-07-25 23:33:19 +00:00
f885a70fab [inductor][autotune_at_compile_time] support Triton kernel with sympy fn str arg (#131253)
## What is sympy fn str arg?
It's  a string such as `sqrt` which also happens to be a real sympy function (e.g. `sympy.sqrt`)

## Crash

```
torch/_inductor/sizevars.py", line 468, in symbolic_hint
    expr = self.simplify(expr)        # where expr is 'sqrt'
torch/_inductor/sizevars.py", line 66, in simplify
    return sympy.expand(expr).xreplace(self.replacements)
sympy/core/function.py", line 2816, in expand
    return sympify(e).expand(deep=deep, modulus=modulus, **hints)
AttributeError: 'function' object has no attribute 'expand'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131253
Approved by: https://github.com/desertfire
2024-07-25 23:31:20 +00:00
b4b62d3945 update to 2.5.8 (#131684)
# Summary
This stack brings the current fork of FAv2 near the top of main which is 2.6.2

Notably we need to update cutlass to 3.5.0

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131684
Approved by: https://github.com/jainapurva
2024-07-25 23:15:03 +00:00
51f4f87718 [Reland] Ensure staticmethods can be allowed in graph (#131789)
Fixes https://github.com/pytorch/pytorch/issues/124735

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131789
Approved by: https://github.com/anijain2305
2024-07-25 22:54:18 +00:00
4de85e3c30 [DeviceMesh] Remove _parent_mesh as an attribute from DeviceMesh and remove it from DeviceMesh's hash (#131636)
We recently revisited the hash implementation and think `_parent_mesh` information should not be burned into DeviceMesh but rather be inferred from the MeshEnv which manages device meshes.

As `mesh_dim_names` is considered in device mesh's hash. This should not affect the issue brought up in https://github.com/pytorch/pytorch/issues/121799

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131636
Approved by: https://github.com/wanchaol
2024-07-25 22:47:22 +00:00
79f0c4dc04 [BE] typing for decorators - fx/experimental/graph_gradual_typechecker (#131579)
See #131429
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131579
Approved by: https://github.com/oulgen, https://github.com/zou3519
ghstack dependencies: #131568, #131569, #131570, #131571, #131572, #131573, #131574, #131575, #131576, #131577, #131578
2024-07-25 22:24:19 +00:00
c65b197b85 [BE] typing for decorators - _library/custom_ops (#131578)
See #131429
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131578
Approved by: https://github.com/oulgen, https://github.com/zou3519
ghstack dependencies: #131568, #131569, #131570, #131571, #131572, #131573, #131574, #131575, #131576, #131577
2024-07-25 22:24:19 +00:00
5ee6a6dacc [BE] typing for decorators - ao/quantization/quantizer/xnnpack_quantizer_utils (#131577)
See #131429
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131577
Approved by: https://github.com/oulgen, https://github.com/zou3519
ghstack dependencies: #131568, #131569, #131570, #131571, #131572, #131573, #131574, #131575, #131576
2024-07-25 22:24:19 +00:00
37d76c7d48 [BE] typing for decorators - fx/experimental/migrate_gradual_types/constraint_generator (#131576)
See #131429
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131576
Approved by: https://github.com/oulgen, https://github.com/zou3519
ghstack dependencies: #131568, #131569, #131570, #131571, #131572, #131573, #131574, #131575
2024-07-25 22:24:19 +00:00
42dc5a47a1 [BE] typing for decorators - _inductor/fx_passes/post_grad (#131575)
See #131429

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131575
Approved by: https://github.com/oulgen, https://github.com/zou3519
ghstack dependencies: #131568, #131569, #131570, #131571, #131572, #131573, #131574
2024-07-25 22:24:19 +00:00
1966 changed files with 37434 additions and 30678 deletions

View File

@ -1 +1 @@
48da61aa34b73ea8e2ee815a6a79eea817e361db
91298923a0076c1b41059efb6dad2876426e4b03

View File

@ -21,9 +21,8 @@ RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo
RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo
RUN yum install -y devtoolset-${DEVTOOLSET_VERSION}-gcc devtoolset-${DEVTOOLSET_VERSION}-gcc-c++ devtoolset-${DEVTOOLSET_VERSION}-gcc-gfortran devtoolset-${DEVTOOLSET_VERSION}-binutils
# EPEL for cmake
RUN wget http://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm && \
rpm -ivh epel-release-latest-7.noarch.rpm && \
rm -f epel-release-latest-7.noarch.rpm
RUN yum --enablerepo=extras install -y epel-release
# cmake
RUN yum install -y cmake3 && \
ln -s /usr/bin/cmake3 /usr/bin/cmake

View File

@ -29,7 +29,7 @@ RUN bash ./install_conda.sh && rm install_conda.sh common_utils.sh /opt/conda/re
# Install cuda and cudnn
ARG CUDA_VERSION
RUN wget -q https://raw.githubusercontent.com/pytorch/builder/main/common/install_cuda.sh -O install_cuda.sh
COPY ./common/install_cuda.sh install_cuda.sh
RUN bash ./install_cuda.sh ${CUDA_VERSION} && rm install_cuda.sh
ENV DESIRED_CUDA ${CUDA_VERSION}
ENV PATH /usr/local/nvidia/bin:/usr/local/cuda/bin:$PATH

View File

@ -29,9 +29,7 @@ RUN yum install -y devtoolset-${DEVTOOLSET_VERSION}-gcc devtoolset-${DEVTOOLSET_
ENV PATH=/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/bin:$PATH
ENV LD_LIBRARY_PATH=/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/lib64:/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH
RUN wget http://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm && \
rpm -ivh epel-release-latest-7.noarch.rpm && \
rm -f epel-release-latest-7.noarch.rpm
RUN yum --enablerepo=extras install -y epel-release
# cmake-3.18.4 from pip
RUN yum install -y python3-pip && \
@ -117,7 +115,8 @@ RUN yum install -y \
yasm
RUN yum install -y \
https://repo.ius.io/ius-release-el7.rpm \
https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
https://ossci-linux.s3.amazonaws.com/epel-release-7-14.noarch.rpm
RUN yum swap -y git git236-core
# git236+ would refuse to run git commands in repos owned by other users
# Which causes version check to fail, as pytorch repo is bind-mounted into the image

View File

@ -93,7 +93,8 @@ RUN yum install -y \
yasm
RUN yum install -y \
https://repo.ius.io/ius-release-el7.rpm \
https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
https://ossci-linux.s3.amazonaws.com/epel-release-7-14.noarch.rpm
RUN yum swap -y git git236-core
# git236+ would refuse to run git commands in repos owned by other users
# Which causes version check to fail, as pytorch repo is bind-mounted into the image

View File

@ -87,10 +87,10 @@ RUN yum install -y \
xz \
gcc-toolset-${DEVTOOLSET_VERSION}-toolchain \
glibc-langpack-en
RUN yum install -y \
https://repo.ius.io/ius-release-el7.rpm \
https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
https://ossci-linux.s3.amazonaws.com/epel-release-7-14.noarch.rpm
RUN yum swap -y git git236-core
# git236+ would refuse to run git commands in repos owned by other users
# Which causes version check to fail, as pytorch repo is bind-mounted into the image

View File

@ -50,7 +50,7 @@ RUN bash ./install_lcov.sh && rm install_lcov.sh
# Install cuda and cudnn
ARG CUDA_VERSION
RUN wget -q https://raw.githubusercontent.com/pytorch/builder/main/common/install_cuda.sh -O install_cuda.sh
COPY ./common/install_cuda.sh install_cuda.sh
RUN bash ./install_cuda.sh ${CUDA_VERSION} && rm install_cuda.sh
ENV DESIRED_CUDA ${CUDA_VERSION}
ENV PATH /usr/local/nvidia/bin:/usr/local/cuda/bin:$PATH

View File

@ -44,15 +44,19 @@ time python test/run_test.py --verbose -i distributed/_tensor/test_dtensor_compi
time python test/run_test.py --verbose -i distributed/test_device_mesh
# DTensor/TP tests
time python test/run_test.py --verbose -i distributed/tensor/parallel/test_ddp_2d_parallel
time python test/run_test.py --verbose -i distributed/tensor/parallel/test_fsdp_2d_parallel
time python test/run_test.py --verbose -i distributed/tensor/parallel/test_tp_examples
time python test/run_test.py --verbose -i distributed/tensor/parallel/test_tp_random_state
# FSDP2 tests
time python test/run_test.py --verbose -i distributed/_composable/fsdp/test_fully_shard_training -- -k test_2d_mlp_with_nd_mesh
# Pipelining composability tests
time python test/run_test.py --verbose -i distributed/pipelining/test_composability.py
# ND composability tests
time python test/run_test.py --verbose -i distributed/_composable/test_composability/test_2d_composability
time python test/run_test.py --verbose -i distributed/_composable/test_composability/test_pp_composability
# Other tests
time python test/run_test.py --verbose -i test_cuda_primary_ctx

View File

@ -316,6 +316,8 @@ test_inductor_distributed() {
python test/run_test.py -i inductor/test_aot_inductor.py -k test_replicate_on_devices --verbose
python test/run_test.py -i distributed/test_c10d_functional_native.py --verbose
python test/run_test.py -i distributed/_tensor/test_dtensor_compile.py --verbose
python test/run_test.py -i distributed/tensor/parallel/test_fsdp_2d_parallel.py --verbose
python test/run_test.py -i distributed/tensor/parallel/test_micro_pipeline_tp.py --verbose
python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_comm.py --verbose
python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_multi_group --verbose
python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_with_activation_checkpointing --verbose
@ -428,7 +430,6 @@ test_perf_for_dashboard() {
local targets=(accuracy performance)
local device=cuda
local taskset=""
if [[ "${TEST_CONFIG}" == *cpu* ]]; then
if [[ "${TEST_CONFIG}" == *cpu_x86* ]]; then
device=cpu_x86
@ -436,8 +437,8 @@ test_perf_for_dashboard() {
device=cpu_aarch64
fi
test_inductor_set_cpu_affinity
end_core=$(( $(test_inductor_get_core_number)-1 ))
taskset="taskset -c 0-$end_core"
elif [[ "${TEST_CONFIG}" == *cuda_a10g* ]]; then
device=cuda_a10g
fi
for mode in "${modes[@]}"; do
@ -455,43 +456,43 @@ test_perf_for_dashboard() {
fi
if [[ "$DASHBOARD_TAG" == *default-true* ]]; then
$taskset python "benchmarks/dynamo/$suite.py" \
$TASKSET python "benchmarks/dynamo/$suite.py" \
"${target_flag[@]}" --"$mode" --"$dtype" --backend "$backend" --disable-cudagraphs "$@" \
--output "$TEST_REPORTS_DIR/${backend}_no_cudagraphs_${suite}_${dtype}_${mode}_${device}_${target}.csv"
fi
if [[ "$DASHBOARD_TAG" == *cudagraphs-true* ]]; then
$taskset python "benchmarks/dynamo/$suite.py" \
$TASKSET python "benchmarks/dynamo/$suite.py" \
"${target_flag[@]}" --"$mode" --"$dtype" --backend "$backend" "$@" \
--output "$TEST_REPORTS_DIR/${backend}_with_cudagraphs_${suite}_${dtype}_${mode}_${device}_${target}.csv"
fi
if [[ "$DASHBOARD_TAG" == *dynamic-true* ]]; then
$taskset python "benchmarks/dynamo/$suite.py" \
$TASKSET python "benchmarks/dynamo/$suite.py" \
"${target_flag[@]}" --"$mode" --"$dtype" --backend "$backend" --dynamic-shapes \
--dynamic-batch-only "$@" \
--output "$TEST_REPORTS_DIR/${backend}_dynamic_${suite}_${dtype}_${mode}_${device}_${target}.csv"
fi
if [[ "$DASHBOARD_TAG" == *cppwrapper-true* ]] && [[ "$mode" == "inference" ]]; then
TORCHINDUCTOR_CPP_WRAPPER=1 $taskset python "benchmarks/dynamo/$suite.py" \
TORCHINDUCTOR_CPP_WRAPPER=1 $TASKSET python "benchmarks/dynamo/$suite.py" \
"${target_flag[@]}" --"$mode" --"$dtype" --backend "$backend" --disable-cudagraphs "$@" \
--output "$TEST_REPORTS_DIR/${backend}_cpp_wrapper_${suite}_${dtype}_${mode}_${device}_${target}.csv"
fi
if [[ "$DASHBOARD_TAG" == *freezing_cudagraphs-true* ]] && [[ "$mode" == "inference" ]]; then
$taskset python "benchmarks/dynamo/$suite.py" \
$TASKSET python "benchmarks/dynamo/$suite.py" \
"${target_flag[@]}" --"$mode" --"$dtype" --backend "$backend" "$@" --freezing \
--output "$TEST_REPORTS_DIR/${backend}_with_cudagraphs_freezing_${suite}_${dtype}_${mode}_${device}_${target}.csv"
fi
if [[ "$DASHBOARD_TAG" == *freeze_autotune_cudagraphs-true* ]] && [[ "$mode" == "inference" ]]; then
TORCHINDUCTOR_MAX_AUTOTUNE=1 $taskset python "benchmarks/dynamo/$suite.py" \
TORCHINDUCTOR_MAX_AUTOTUNE=1 $TASKSET python "benchmarks/dynamo/$suite.py" \
"${target_flag[@]}" --"$mode" --"$dtype" --backend "$backend" "$@" --freezing \
--output "$TEST_REPORTS_DIR/${backend}_with_cudagraphs_freezing_autotune_${suite}_${dtype}_${mode}_${device}_${target}.csv"
fi
if [[ "$DASHBOARD_TAG" == *aotinductor-true* ]] && [[ "$mode" == "inference" ]]; then
TORCHINDUCTOR_ABI_COMPATIBLE=1 $taskset python "benchmarks/dynamo/$suite.py" \
TORCHINDUCTOR_ABI_COMPATIBLE=1 $TASKSET python "benchmarks/dynamo/$suite.py" \
"${target_flag[@]}" --"$mode" --"$dtype" --export-aot-inductor --disable-cudagraphs "$@" \
--output "$TEST_REPORTS_DIR/${backend}_aot_inductor_${suite}_${dtype}_${mode}_${device}_${target}.csv"
fi
if [[ "$DASHBOARD_TAG" == *maxautotune-true* ]]; then
TORCHINDUCTOR_MAX_AUTOTUNE=1 $taskset python "benchmarks/dynamo/$suite.py" \
TORCHINDUCTOR_MAX_AUTOTUNE=1 $TASKSET python "benchmarks/dynamo/$suite.py" \
"${target_flag[@]}" --"$mode" --"$dtype" --backend "$backend" "$@" \
--output "$TEST_REPORTS_DIR/${backend}_max_autotune_${suite}_${dtype}_${mode}_${device}_${target}.csv"
fi
@ -499,7 +500,7 @@ test_perf_for_dashboard() {
# TODO: This has a new dtype called quant and the benchmarks script needs to be updated to support this.
# The tentative command is as follows. It doesn't work now, but it's ok because we only need mock data
# to fill the dashboard.
$taskset python "benchmarks/dynamo/$suite.py" \
$TASKSET python "benchmarks/dynamo/$suite.py" \
"${target_flag[@]}" --"$mode" --quant --backend "$backend" "$@" \
--output "$TEST_REPORTS_DIR/${backend}_cudagraphs_low_precision_${suite}_quant_${mode}_${device}_${target}.csv" || true
# Copy cudagraph results as mock data, easiest choice?
@ -547,6 +548,10 @@ test_single_dynamo_benchmark() {
# For CPU device, we perfer non ABI-compatible mode on CI when testing AOTInductor.
export TORCHINDUCTOR_ABI_COMPATIBLE=1
fi
if [[ "${TEST_CONFIG}" == *_avx2* ]]; then
TEST_CONFIG=${TEST_CONFIG::-5}
fi
python "benchmarks/dynamo/$suite.py" \
--ci --accuracy --timing --explain \
"${DYNAMO_BENCHMARK_FLAGS[@]}" \
@ -657,12 +662,16 @@ test_inductor_torchbench_smoketest_perf() {
}
test_inductor_get_core_number() {
echo $(($(lscpu | grep 'Socket(s):' | awk '{print $2}') * $(lscpu | grep 'Core(s) per socket:' | awk '{print $4}')))
if [[ "${TEST_CONFIG}" == *aarch64 ]]; then
echo "$(($(lscpu | grep 'Cluster(s):' | awk '{print $2}') * $(lscpu | grep 'Core(s) per cluster:' | awk '{print $4}')))"
else
echo "$(($(lscpu | grep 'Socket(s):' | awk '{print $2}') * $(lscpu | grep 'Core(s) per socket:' | awk '{print $4}')))"
fi
}
test_inductor_set_cpu_affinity(){
#set jemalloc
JEMALLOC_LIB="/usr/lib/x86_64-linux-gnu/libjemalloc.so.2"
JEMALLOC_LIB="$(find /usr/lib -name libjemalloc.so.2)"
IOMP_LIB="$(dirname "$(which python)")/../lib/libiomp5.so"
export LD_PRELOAD="$JEMALLOC_LIB":"$IOMP_LIB":"$LD_PRELOAD"
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1"
@ -670,6 +679,8 @@ test_inductor_set_cpu_affinity(){
export KMP_BLOCKTIME=1
cores=$(test_inductor_get_core_number)
export OMP_NUM_THREADS=$cores
end_core=$((cores-1))
export TASKSET="taskset -c 0-$end_core"
}
test_inductor_torchbench_cpu_smoketest_perf(){
@ -677,7 +688,6 @@ test_inductor_torchbench_cpu_smoketest_perf(){
mkdir -p "$TEST_REPORTS_DIR"
test_inductor_set_cpu_affinity
end_core=$(( $(test_inductor_get_core_number)-1 ))
MODELS_SPEEDUP_TARGET=benchmarks/dynamo/expected_ci_speedup_inductor_torchbench_cpu.csv
grep -v '^ *#' < "$MODELS_SPEEDUP_TARGET" | while IFS=',' read -r -a model_cfg
@ -694,11 +704,11 @@ test_inductor_torchbench_cpu_smoketest_perf(){
local output_name="$TEST_REPORTS_DIR/inductor_inference_${model_cfg[0]}_${model_cfg[1]}_${model_cfg[2]}_${model_cfg[3]}_cpu_smoketest.csv"
if [[ ${model_cfg[3]} == "dynamic" ]]; then
taskset -c 0-"$end_core" python benchmarks/dynamo/torchbench.py \
$TASKSET python benchmarks/dynamo/torchbench.py \
--inference --performance --"$data_type" -dcpu -n50 --only "$model_name" --dynamic-shapes \
--dynamic-batch-only --freezing --timeout 9000 --"$backend" --output "$output_name"
else
taskset -c 0-"$end_core" python benchmarks/dynamo/torchbench.py \
$TASKSET python benchmarks/dynamo/torchbench.py \
--inference --performance --"$data_type" -dcpu -n50 --only "$model_name" \
--freezing --timeout 9000 --"$backend" --output "$output_name"
fi
@ -706,6 +716,17 @@ test_inductor_torchbench_cpu_smoketest_perf(){
# The threshold value needs to be actively maintained to make this check useful.
python benchmarks/dynamo/check_perf_csv.py -f "$output_name" -t "$speedup_target"
done
# Add a few ABI-compatible accuracy tests for CPU. These can be removed once we turn on ABI-compatible as default.
TORCHINDUCTOR_ABI_COMPATIBLE=1 python benchmarks/dynamo/timm_models.py --device cpu --accuracy \
--bfloat16 --inference --export-aot-inductor --disable-cudagraphs --only adv_inception_v3 \
--output "$TEST_REPORTS_DIR/aot_inductor_smoke_test.csv"
TORCHINDUCTOR_ABI_COMPATIBLE=1 python benchmarks/dynamo/timm_models.py --device cpu --accuracy \
--bfloat16 --inference --export-aot-inductor --disable-cudagraphs --only beit_base_patch16_224 \
--output "$TEST_REPORTS_DIR/aot_inductor_smoke_test.csv"
python benchmarks/dynamo/check_accuracy.py \
--actual "$TEST_REPORTS_DIR/aot_inductor_smoke_test.csv" \
--expected "benchmarks/dynamo/ci_expected_accuracy/aot_inductor_timm_inference.csv"
}
test_torchbench_gcp_smoketest(){
@ -1019,11 +1040,113 @@ test_xla() {
assert_git_not_dirty
}
function check_public_api_test_fails {
test_name=$1
invalid_item_name=$2
invalid_item_desc=$3
echo "Running public API test '${test_name}'..."
test_output=$(python test/test_public_bindings.py -k "${test_name}" 2>&1) && ret=$? || ret=$?
# Ensure test fails correctly.
if [ "$ret" -eq 0 ]; then
cat << EOF
Expected the public API test '${test_name}' to fail after introducing
${invalid_item_desc}, but it succeeded! Check test/test_public_bindings.py
for any changes that may have broken the test.
EOF
return 1
fi
# Ensure invalid item is in the test output.
echo "${test_output}" | grep -q "${invalid_item_name}" && ret=$? || ret=$?
if [ $ret -ne 0 ]; then
cat << EOF
Expected the public API test '${test_name}' to identify ${invalid_item_desc}, but
it didn't! It's possible the test may not have run. Check test/test_public_bindings.py
for any changes that may have broken the test.
EOF
return 1
fi
echo "Success! '${test_name}' identified ${invalid_item_desc} ${invalid_item_name}."
return 0
}
# Do NOT run this test before any other tests, like test_python_shard, etc.
# Because this function uninstalls the torch built from branch and installs
# the torch built on its base commit.
test_forward_backward_compatibility() {
set -x
# First, validate public API tests in the torch built from branch.
# Step 1. Make sure the public API test "test_correct_module_names" fails when a new file
# introduces an invalid public API function.
new_filename=$(mktemp XXXXXXXX.py -p "${TORCH_INSTALL_DIR}")
BAD_PUBLIC_FUNC=$(
cat << 'EOF'
def new_public_func():
pass
# valid public API functions have __module__ set correctly
new_public_func.__module__ = None
EOF
)
echo "${BAD_PUBLIC_FUNC}" >> "${new_filename}"
invalid_api="torch.$(basename -s '.py' "${new_filename}").new_public_func"
echo "Created an invalid public API function ${invalid_api}..."
check_public_api_test_fails \
"test_correct_module_names" \
"${invalid_api}" \
"an invalid public API function" && ret=$? || ret=$?
rm -v "${new_filename}"
if [ "$ret" -ne 0 ]; then
exit 1
fi
# Step 2. Make sure that the public API test "test_correct_module_names" fails when an existing
# file is modified to introduce an invalid public API function.
EXISTING_FILEPATH="${TORCH_INSTALL_DIR}/nn/parameter.py"
cp -v "${EXISTING_FILEPATH}" "${EXISTING_FILEPATH}.orig"
echo "${BAD_PUBLIC_FUNC}" >> "${EXISTING_FILEPATH}"
invalid_api="torch.nn.parameter.new_public_func"
echo "Appended an invalid public API function to existing file ${EXISTING_FILEPATH}..."
check_public_api_test_fails \
"test_correct_module_names" \
"${invalid_api}" \
"an invalid public API function" && ret=$? || ret=$?
mv -v "${EXISTING_FILEPATH}.orig" "${EXISTING_FILEPATH}"
if [ "$ret" -ne 0 ]; then
exit 1
fi
# Step 3. Make sure that the public API test "test_modules_can_be_imported" fails when a module
# cannot be imported.
new_module_dir=$(mktemp XXXXXXXX -d -p "${TORCH_INSTALL_DIR}")
echo "invalid syntax garbage" > "${new_module_dir}/__init__.py"
invalid_module_name="torch.$(basename "${new_module_dir}")"
check_public_api_test_fails \
"test_modules_can_be_imported" \
"${invalid_module_name}" \
"a non-importable module" && ret=$? || ret=$?
rm -rv "${new_module_dir}"
if [ "$ret" -ne 0 ]; then
exit 1
fi
# Next, build torch from the merge base.
REPO_DIR=$(pwd)
if [[ "${BASE_SHA}" == "${SHA1}" ]]; then
echo "On trunk, we should compare schemas with torch built from the parent commit"
@ -1249,7 +1372,7 @@ if ! [[ "${BUILD_ENVIRONMENT}" == *libtorch* || "${BUILD_ENVIRONMENT}" == *-baze
(cd test && python -c "import torch; print(torch.__config__.show())")
(cd test && python -c "import torch; print(torch.__config__.parallel_info())")
fi
if [[ "$BUILD_ENVIRONMENT" == *aarch64* ]]; then
if [[ "${BUILD_ENVIRONMENT}" == *aarch64* && "${TEST_CONFIG}" != *perf_cpu_aarch64* ]]; then
test_linux_aarch64
elif [[ "${TEST_CONFIG}" == *backward* ]]; then
test_forward_backward_compatibility
@ -1301,9 +1424,9 @@ elif [[ "${TEST_CONFIG}" == *torchbench* ]]; then
checkout_install_torchbench hf_Bert hf_Albert nanogpt timm_vision_transformer
PYTHONPATH=$(pwd)/torchbench test_inductor_torchbench_smoketest_perf
elif [[ "${TEST_CONFIG}" == *inductor_torchbench_cpu_smoketest_perf* ]]; then
checkout_install_torchbench timm_vision_transformer phlippe_densenet basic_gnn_gcn \
checkout_install_torchbench timm_vision_transformer phlippe_densenet basic_gnn_edgecnn \
llama_v2_7b_16h resnet50 timm_efficientnet mobilenet_v3_large timm_resnest \
shufflenet_v2_x1_0 hf_GPT2 yolov3 mobilenet_v2 resnext50_32x4d hf_T5_base
functorch_maml_omniglot yolov3 mobilenet_v2 resnext50_32x4d densenet121 mnasnet1_0
PYTHONPATH=$(pwd)/torchbench test_inductor_torchbench_cpu_smoketest_perf
elif [[ "${TEST_CONFIG}" == *torchbench_gcp_smoketest* ]]; then
checkout_install_torchbench
@ -1324,8 +1447,11 @@ elif [[ "${TEST_CONFIG}" == *inductor* ]]; then
install_torchvision
test_inductor_shard "${SHARD_NUMBER}"
if [[ "${SHARD_NUMBER}" == 1 ]]; then
test_inductor_aoti
test_inductor_distributed
if [[ "${BUILD_ENVIRONMENT}" != linux-jammy-py3.8-gcc11-build ]]; then
# Temporarily skip test_inductor_aoti due to https://github.com/pytorch/pytorch/issues/130311
test_inductor_aoti
test_inductor_distributed
fi
fi
elif [[ "${TEST_CONFIG}" == *dynamo* ]]; then
install_torchvision

View File

@ -7,7 +7,7 @@ max-line-length = 120
# C408 ignored because we like the dict keyword argument syntax
# E501 is not flexible enough, we're using B950 instead
ignore =
E203,E305,E402,E501,E721,E741,F405,F841,F999,W503,W504,C408,E302,W291,E303,
E203,E305,E402,E501,E704,E721,E741,F405,F841,F999,W503,W504,C408,E302,W291,E303,
# shebang has extra meaning in fbcode lints, so I think it's not worth trying
# to line this up with executable bit
EXE001,
@ -55,6 +55,9 @@ per-file-ignores =
torch/distributed/_functional_collectives.py: TOR901
torch/distributed/_spmd/data_parallel.py: TOR901
torch/distributed/_tensor/_collective_utils.py: TOR901
# This is a full package that happen to live within the test
# folder, so ok to skip
test/cpp_extensions/open_registration_extension/pytorch_openreg/__init__.py: TOR901
optional-ascii-coding = True
exclude =
./.git,

View File

@ -14,6 +14,7 @@ self-hosted-runner:
- linux.12xlarge.ephemeral
- linux.24xlarge
- linux.arm64.2xlarge
- linux.arm64.m7g.4xlarge
- linux.4xlarge.nvidia.gpu
- linux.8xlarge.nvidia.gpu
- linux.16xlarge.nvidia.gpu
@ -36,6 +37,7 @@ self-hosted-runner:
- amz2023.linux.12xlarge
- amz2023.linux.24xlarge
- amz2023.linux.arm64.2xlarge
- amz2023.linux.arm64.m7g.4xlarge
- amz2023.linux.4xlarge.nvidia.gpu
- amz2023.linux.8xlarge.nvidia.gpu
- amz2023.linux.16xlarge.nvidia.gpu

View File

@ -41,6 +41,9 @@ outputs:
ci-verbose-test-logs:
description: True if ci-verbose-test-logs label was on PR or [ci-verbose-test-logs] in PR body.
value: ${{ steps.filter.outputs.ci-verbose-test-logs }}
ci-test-showlocals:
description: True if ci-test-showlocals label was on PR or [ci-test-showlocals] in PR body.
value: ${{ steps.filter.outputs.ci-test-showlocals }}
ci-no-test-timeout:
description: True if ci-no-test-timeout label was on PR or [ci-no-test-timeout] in PR body.
value: ${{ steps.filter.outputs.ci-no-test-timeout }}

View File

@ -1,226 +0,0 @@
name: linux-build
inputs:
build-environment:
required: true
description: Top-level label for what's being built/tested.
docker-image-name:
required: true
description: Name of the base docker image to build with.
build-generates-artifacts:
required: false
default: "true"
description: If set, upload generated build artifacts.
build-with-debug:
required: false
default: "false"
description: If set, build in debug mode.
sync-tag:
required: false
default: ""
description: |
If this is set, our linter will use this to make sure that every other
job with the same `sync-tag` is identical.
cuda-arch-list:
required: false
default: "5.2"
description: Runner label to select worker type
runner:
required: false
default: "linux.2xlarge"
description: |
List of CUDA architectures CI build should target.
test-matrix:
required: false
type: string
description: |
An option JSON description of what test configs to run later on. This
is moved here from the Linux test workflow so that we can apply filter
logic using test-config labels earlier and skip unnecessary builds
s3-bucket:
description: S3 bucket to download artifact
required: false
default: "gha-artifacts"
aws-role-to-assume:
description: role to assume for downloading artifacts
required: false
default: ""
GITHUB_TOKEN:
description: GitHub token
required: true
HUGGING_FACE_HUB_TOKEN:
description: Hugging Face Hub token
required: false
default: ""
use_split_build:
description: |
[Experimental] Build a libtorch only wheel and build pytorch such that
are built from the libtorch wheel.
required: false
type: boolean
default: false
outputs:
docker-image:
value: ${{ steps.calculate-docker-image.outputs.docker-image }}
description: The docker image containing the built PyTorch.
test-matrix:
value: ${{ steps.filter.outputs.test-matrix }}
description: An optional JSON description of what test configs to run later on.
runs:
using: composite
steps:
- name: Setup Linux
uses: ./.github/actions/setup-linux
- name: configure aws credentials
uses: aws-actions/configure-aws-credentials@v3
if: ${{ inputs.aws-role-to-assume != '' }}
with:
role-to-assume: ${{ inputs.aws-role-to-assume }}
role-session-name: gha-linux-build
role-duration-seconds: 10800
aws-region: us-east-1
- name: Calculate docker image
id: calculate-docker-image
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
with:
docker-image-name: ${{ inputs.docker-image-name }}
- name: Use following to pull public copy of the image
id: print-ghcr-mirror
env:
ECR_DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }}
shell: bash
run: |
tag=${ECR_DOCKER_IMAGE##*/}
echo "docker pull ghcr.io/pytorch/ci-image:${tag/:/-}"
- name: Pull docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
with:
docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
- name: Parse ref
id: parse-ref
shell: bash
run: .github/scripts/parse_ref.py
- name: Get workflow job id
id: get-job-id
uses: ./.github/actions/get-workflow-job-id
if: always()
with:
github-token: ${{ inputs.GITHUB_TOKEN }}
# Apply the filter logic to the build step too if the test-config label is already there
- name: Select all requested test configurations (if the test matrix is available)
id: filter
uses: ./.github/actions/filter-test-configs
with:
github-token: ${{ inputs.GITHUB_TOKEN }}
test-matrix: ${{ inputs.test-matrix }}
job-name: ${{ steps.get-job-id.outputs.job-name }}
- name: Download pytest cache
uses: ./.github/actions/pytest-cache-download
continue-on-error: true
with:
cache_dir: .pytest_cache
job_identifier: ${{ github.workflow }}_${{ inputs.build-environment }}
s3_bucket: ${{ inputs.s3-bucket }}
- name: Build
if: steps.filter.outputs.is-test-matrix-empty == 'False' || inputs.test-matrix == ''
id: build
env:
BUILD_ENVIRONMENT: ${{ inputs.build-environment }}
BRANCH: ${{ steps.parse-ref.outputs.branch }}
# TODO duplicated
AWS_DEFAULT_REGION: us-east-1
PR_NUMBER: ${{ github.event.pull_request.number }}
SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2
SCCACHE_S3_KEY_PREFIX: ${{ github.workflow }}
XLA_CLANG_CACHE_S3_BUCKET_NAME: ossci-compiler-clang-cache-circleci-xla
PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}
TORCH_CUDA_ARCH_LIST: ${{ inputs.cuda-arch-list }}
DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }}
XLA_CUDA: ${{ contains(inputs.build-environment, 'xla') && '0' || '' }}
DEBUG: ${{ inputs.build-with-debug == 'true' && '1' || '0' }}
OUR_GITHUB_JOB_ID: ${{ steps.get-job-id.outputs.job-id }}
HUGGING_FACE_HUB_TOKEN: ${{ inputs.HUGGING_FACE_HUB_TOKEN }}
USE_SPLIT_BUILD: ${{ inputs.use_split_build }}
shell: bash
run: |
# detached container should get cleaned up by teardown_ec2_linux
container_name=$(docker run \
-e BUILD_ENVIRONMENT \
-e MAX_JOBS="$(nproc --ignore=2)" \
-e AWS_DEFAULT_REGION \
-e PR_NUMBER \
-e SHA1 \
-e BRANCH \
-e SCCACHE_BUCKET \
-e SCCACHE_S3_KEY_PREFIX \
-e XLA_CUDA \
-e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-e SKIP_SCCACHE_INITIALIZATION=1 \
-e TORCH_CUDA_ARCH_LIST \
-e PR_LABELS \
-e OUR_GITHUB_JOB_ID \
-e HUGGING_FACE_HUB_TOKEN \
-e USE_SPLIT_BUILD \
--env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
--security-opt seccomp=unconfined \
--cap-add=SYS_PTRACE \
--tty \
--detach \
--user jenkins \
-v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-w /var/lib/jenkins/workspace \
"${DOCKER_IMAGE}"
)
docker exec -t "${container_name}" sh -c '.ci/pytorch/build.sh'
- name: Archive artifacts into zip
if: inputs.build-generates-artifacts == 'true' && steps.build.outcome != 'skipped'
shell: bash
run: |
zip -1 -r artifacts.zip dist/ build/custom_test_artifacts build/lib build/bin .additional_ci_files
- name: Store PyTorch Build Artifacts on S3
uses: seemethere/upload-artifact-s3@v5
if: inputs.build-generates-artifacts == 'true' && steps.build.outcome != 'skipped' && inputs.use_split_build != 'true'
with:
name: ${{ inputs.build-environment }}
retention-days: 14
if-no-files-found: error
path: artifacts.zip
s3-bucket: ${{ inputs.s3-bucket }}
- name: Store PyTorch Build Artifacts on S3 for split build
uses: seemethere/upload-artifact-s3@v5
if: inputs.build-generates-artifacts == 'true' && steps.build.outcome != 'skipped' && inputs.use_split_build == 'true'
with:
name: ${{ inputs.build-environment }}-experimental-split-build
retention-days: 14
if-no-files-found: error
path: artifacts.zip
s3-bucket: ${{ inputs.s3-bucket }}
- name: Upload sccache stats
if: steps.build.outcome != 'skipped'
uses: seemethere/upload-artifact-s3@v5
with:
s3-prefix: |
${{ github.repository }}/${{ github.run_id }}/${{ github.run_attempt }}/artifact
retention-days: 365
if-no-files-found: warn
path: sccache-stats-*.json
s3-bucket: ${{ inputs.s3-bucket }}
- name: Teardown Linux
uses: pytorch/test-infra/.github/actions/teardown-linux@main
if: always()

View File

@ -167,6 +167,7 @@ runs:
REENABLED_ISSUES: ${{ steps.keep-going.outputs.reenabled-issues }}
CONTINUE_THROUGH_ERROR: ${{ steps.keep-going.outputs.keep-going }}
VERBOSE_TEST_LOGS: ${{ steps.keep-going.outputs.ci-verbose-test-logs }}
TEST_SHOWLOCALS: ${{ steps.keep-going.outputs.ci-test-showlocals }}
NO_TEST_TIMEOUT: ${{ steps.keep-going.outputs.ci-no-test-timeout }}
NO_TD: ${{ steps.keep-going.outputs.ci-no-td }}
TD_DISTRIBUTED: ${{ steps.keep-going.outputs.ci-td-distributed }}

View File

@ -1 +1 @@
69b2a0adc2ec03ab99990d7e8be3d4510438c148
b3f6f511f2a1082bd56b13a3f6794e7fc3ba4862

View File

@ -1,13 +1,23 @@
# Defines runner types that will be provisioned by by LF Self-hosted
# runners for pytorch/pytorch-canary and their labels.
# This file is generated by .github/scripts/validate_scale_config.py in test-infra
# It defines runner types that will be provisioned by by LF Self-hosted runners
# scale-config.yml:
# Powers what instance types are available for GHA auto-scaled
# runners. Runners listed here will be available as self hosted
# runners, configuration is directly pulled from the main branch.
#
# Runners listed here will be available as self hosted runners.
# Configuration is directly pulled from the main branch.
# NOTE (Apr, 5, 2021): Linux runners are currently all an amazonlinux2
#
# Default values:
# NOTE (Jan 5, 2021): Linux runners are all non-ephemeral to reduce the amount of CreateInstaces calls
# to avoid RequestLimitExceeded issues
#
# TODO: Add some documentation on how the auto-scaling works
#
# NOTE: Default values,
#
# runner_types:
# runner_label: # label to specify in the Github Actions workflow
# runner_label:
# instance_type: m4.large
# os: linux
# max_available: 20
@ -21,17 +31,29 @@ runner_types:
is_ephemeral: false
max_available: 1000
os: linux
lf.c.linux.10xlarge.avx2:
disk_size: 200
instance_type: m4.10xlarge
is_ephemeral: false
max_available: 450
os: linux
lf.c.linux.24xl.spr-metal:
disk_size: 200
instance_type: c7i.metal-24xl
is_ephemeral: false
max_available: 30
max_available: 150
os: linux
lf.c.linux.16xlarge.spr:
disk_size: 200
instance_type: c7i.16xlarge
is_ephemeral: false
max_available: 30
max_available: 150
os: linux
lf.c.linux.9xlarge.ephemeral:
disk_size: 200
instance_type: c5.9xlarge
is_ephemeral: true
max_available: 20
os: linux
lf.c.linux.12xlarge.ephemeral:
disk_size: 200
@ -43,7 +65,7 @@ runner_types:
disk_size: 150
instance_type: g3.16xlarge
is_ephemeral: false
max_available: 30
max_available: 150
os: linux
lf.c.linux.24xlarge:
disk_size: 150
@ -67,7 +89,7 @@ runner_types:
disk_size: 150
instance_type: g3.4xlarge
is_ephemeral: false
max_available: 520
max_available: 1000
os: linux
lf.c.linux.8xlarge.nvidia.gpu:
disk_size: 150
@ -79,19 +101,19 @@ runner_types:
disk_size: 150
instance_type: g4dn.12xlarge
is_ephemeral: false
max_available: 50
max_available: 250
os: linux
lf.c.linux.g4dn.metal.nvidia.gpu:
disk_size: 150
instance_type: g4dn.metal
is_ephemeral: false
max_available: 30
max_available: 300
os: linux
lf.c.linux.g5.48xlarge.nvidia.gpu:
disk_size: 150
instance_type: g5.48xlarge
is_ephemeral: false
max_available: 20
max_available: 200
os: linux
lf.c.linux.g5.12xlarge.nvidia.gpu:
disk_size: 150
@ -103,9 +125,16 @@ runner_types:
disk_size: 150
instance_type: g5.4xlarge
is_ephemeral: false
max_available: 1200
max_available: 2400
os: linux
lf.c.linux.g6.4xlarge.experimental.nvidia.gpu:
disk_size: 150
instance_type: g6.4xlarge
is_ephemeral: false
max_available: 30
os: linux
lf.c.linux.large:
max_available: 1200
disk_size: 15
instance_type: c5.large
is_ephemeral: false
@ -116,11 +145,17 @@ runner_types:
is_ephemeral: false
max_available: 200
os: linux
lf.c.linux.arm64.m7g.2xlarge:
lf.c.linux.arm64.m7g.4xlarge:
disk_size: 256
instance_type: m7g.2xlarge
instance_type: m7g.4xlarge
is_ephemeral: false
max_available: 20
max_available: 200
os: linux
lf.c.linux.arm64.m7g.metal:
disk_size: 256
instance_type: m7g.metal
is_ephemeral: false
max_available: 100
os: linux
lf.c.windows.4xlarge:
disk_size: 256
@ -138,7 +173,7 @@ runner_types:
disk_size: 256
instance_type: p3.2xlarge
is_ephemeral: true
max_available: 150
max_available: 300
os: windows
lf.c.windows.8xlarge.nvidia.gpu.nonephemeral:
disk_size: 256
@ -161,18 +196,32 @@ runner_types:
max_available: 1000
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.c.amz2023.linux.10xlarge.avx2:
disk_size: 200
instance_type: m4.10xlarge
is_ephemeral: false
max_available: 450
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.c.amz2023.linux.24xl.spr-metal:
disk_size: 200
instance_type: c7i.metal-24xl
is_ephemeral: false
max_available: 30
max_available: 150
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.c.amz2023.linux.16xlarge.spr:
disk_size: 200
instance_type: c7i.16xlarge
is_ephemeral: false
max_available: 30
max_available: 150
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.c.amz2023.linux.9xlarge.ephemeral:
disk_size: 200
instance_type: c5.9xlarge
is_ephemeral: true
max_available: 20
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.c.amz2023.linux.12xlarge.ephemeral:
@ -186,7 +235,7 @@ runner_types:
disk_size: 150
instance_type: g3.16xlarge
is_ephemeral: false
max_available: 30
max_available: 150
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.c.amz2023.linux.24xlarge:
@ -214,7 +263,7 @@ runner_types:
disk_size: 150
instance_type: g3.4xlarge
is_ephemeral: false
max_available: 520
max_available: 1000
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.c.amz2023.linux.8xlarge.nvidia.gpu:
@ -228,21 +277,21 @@ runner_types:
disk_size: 150
instance_type: g4dn.12xlarge
is_ephemeral: false
max_available: 50
max_available: 250
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.c.amz2023.linux.g4dn.metal.nvidia.gpu:
disk_size: 150
instance_type: g4dn.metal
is_ephemeral: false
max_available: 30
max_available: 300
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.c.amz2023.linux.g5.48xlarge.nvidia.gpu:
disk_size: 150
instance_type: g5.48xlarge
is_ephemeral: false
max_available: 20
max_available: 200
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.c.amz2023.linux.g5.12xlarge.nvidia.gpu:
@ -256,10 +305,18 @@ runner_types:
disk_size: 150
instance_type: g5.4xlarge
is_ephemeral: false
max_available: 1200
max_available: 2400
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.c.amz2023.linux.g6.4xlarge.experimental.nvidia.gpu:
disk_size: 150
instance_type: g6.4xlarge
is_ephemeral: false
max_available: 30
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.c.amz2023.linux.large:
max_available: 1200
disk_size: 15
instance_type: c5.large
is_ephemeral: false
@ -271,11 +328,18 @@ runner_types:
is_ephemeral: false
max_available: 200
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.c.amz2023.linux.arm64.m7g.2xlarge:
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64
lf.c.amz2023.linux.arm64.m7g.4xlarge:
disk_size: 256
instance_type: m7g.2xlarge
instance_type: m7g.4xlarge
is_ephemeral: false
max_available: 20
max_available: 200
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64
lf.c.amz2023.linux.arm64.m7g.metal:
disk_size: 256
instance_type: m7g.metal
is_ephemeral: false
max_available: 100
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64

View File

@ -1,13 +1,23 @@
# Defines runner types that will be provisioned by by LF Self-hosted
# runners for pytorch/pytorch and their labels.
# This file is generated by .github/scripts/validate_scale_config.py in test-infra
# It defines runner types that will be provisioned by by LF Self-hosted runners
# scale-config.yml:
# Powers what instance types are available for GHA auto-scaled
# runners. Runners listed here will be available as self hosted
# runners, configuration is directly pulled from the main branch.
#
# Runners listed here will be available as self hosted runners.
# Configuration is directly pulled from the main branch.
# NOTE (Apr, 5, 2021): Linux runners are currently all an amazonlinux2
#
# Default values:
# NOTE (Jan 5, 2021): Linux runners are all non-ephemeral to reduce the amount of CreateInstaces calls
# to avoid RequestLimitExceeded issues
#
# TODO: Add some documentation on how the auto-scaling works
#
# NOTE: Default values,
#
# runner_types:
# runner_label: # label to specify in the Github Actions workflow
# runner_label:
# instance_type: m4.large
# os: linux
# max_available: 20
@ -21,17 +31,29 @@ runner_types:
is_ephemeral: false
max_available: 1000
os: linux
lf.linux.10xlarge.avx2:
disk_size: 200
instance_type: m4.10xlarge
is_ephemeral: false
max_available: 450
os: linux
lf.linux.24xl.spr-metal:
disk_size: 200
instance_type: c7i.metal-24xl
is_ephemeral: false
max_available: 30
max_available: 150
os: linux
lf.linux.16xlarge.spr:
disk_size: 200
instance_type: c7i.16xlarge
is_ephemeral: false
max_available: 30
max_available: 150
os: linux
lf.linux.9xlarge.ephemeral:
disk_size: 200
instance_type: c5.9xlarge
is_ephemeral: true
max_available: 20
os: linux
lf.linux.12xlarge.ephemeral:
disk_size: 200
@ -43,7 +65,7 @@ runner_types:
disk_size: 150
instance_type: g3.16xlarge
is_ephemeral: false
max_available: 30
max_available: 150
os: linux
lf.linux.24xlarge:
disk_size: 150
@ -67,7 +89,7 @@ runner_types:
disk_size: 150
instance_type: g3.4xlarge
is_ephemeral: false
max_available: 520
max_available: 1000
os: linux
lf.linux.8xlarge.nvidia.gpu:
disk_size: 150
@ -79,19 +101,19 @@ runner_types:
disk_size: 150
instance_type: g4dn.12xlarge
is_ephemeral: false
max_available: 50
max_available: 250
os: linux
lf.linux.g4dn.metal.nvidia.gpu:
disk_size: 150
instance_type: g4dn.metal
is_ephemeral: false
max_available: 30
max_available: 300
os: linux
lf.linux.g5.48xlarge.nvidia.gpu:
disk_size: 150
instance_type: g5.48xlarge
is_ephemeral: false
max_available: 20
max_available: 200
os: linux
lf.linux.g5.12xlarge.nvidia.gpu:
disk_size: 150
@ -103,9 +125,16 @@ runner_types:
disk_size: 150
instance_type: g5.4xlarge
is_ephemeral: false
max_available: 1200
max_available: 2400
os: linux
lf.linux.g6.4xlarge.experimental.nvidia.gpu:
disk_size: 150
instance_type: g6.4xlarge
is_ephemeral: false
max_available: 30
os: linux
lf.linux.large:
max_available: 1200
disk_size: 15
instance_type: c5.large
is_ephemeral: false
@ -116,11 +145,17 @@ runner_types:
is_ephemeral: false
max_available: 200
os: linux
lf.linux.arm64.m7g.2xlarge:
lf.linux.arm64.m7g.4xlarge:
disk_size: 256
instance_type: m7g.2xlarge
instance_type: m7g.4xlarge
is_ephemeral: false
max_available: 20
max_available: 200
os: linux
lf.linux.arm64.m7g.metal:
disk_size: 256
instance_type: m7g.metal
is_ephemeral: false
max_available: 100
os: linux
lf.windows.4xlarge:
disk_size: 256
@ -138,7 +173,7 @@ runner_types:
disk_size: 256
instance_type: p3.2xlarge
is_ephemeral: true
max_available: 150
max_available: 300
os: windows
lf.windows.8xlarge.nvidia.gpu.nonephemeral:
disk_size: 256
@ -161,18 +196,32 @@ runner_types:
max_available: 1000
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.amz2023.linux.10xlarge.avx2:
disk_size: 200
instance_type: m4.10xlarge
is_ephemeral: false
max_available: 450
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.amz2023.linux.24xl.spr-metal:
disk_size: 200
instance_type: c7i.metal-24xl
is_ephemeral: false
max_available: 30
max_available: 150
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.amz2023.linux.16xlarge.spr:
disk_size: 200
instance_type: c7i.16xlarge
is_ephemeral: false
max_available: 30
max_available: 150
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.amz2023.linux.9xlarge.ephemeral:
disk_size: 200
instance_type: c5.9xlarge
is_ephemeral: true
max_available: 20
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.amz2023.linux.12xlarge.ephemeral:
@ -186,7 +235,7 @@ runner_types:
disk_size: 150
instance_type: g3.16xlarge
is_ephemeral: false
max_available: 30
max_available: 150
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.amz2023.linux.24xlarge:
@ -214,7 +263,7 @@ runner_types:
disk_size: 150
instance_type: g3.4xlarge
is_ephemeral: false
max_available: 520
max_available: 1000
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.amz2023.linux.8xlarge.nvidia.gpu:
@ -228,21 +277,21 @@ runner_types:
disk_size: 150
instance_type: g4dn.12xlarge
is_ephemeral: false
max_available: 50
max_available: 250
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.amz2023.linux.g4dn.metal.nvidia.gpu:
disk_size: 150
instance_type: g4dn.metal
is_ephemeral: false
max_available: 30
max_available: 300
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.amz2023.linux.g5.48xlarge.nvidia.gpu:
disk_size: 150
instance_type: g5.48xlarge
is_ephemeral: false
max_available: 20
max_available: 200
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.amz2023.linux.g5.12xlarge.nvidia.gpu:
@ -256,10 +305,18 @@ runner_types:
disk_size: 150
instance_type: g5.4xlarge
is_ephemeral: false
max_available: 1200
max_available: 2400
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.amz2023.linux.g6.4xlarge.experimental.nvidia.gpu:
disk_size: 150
instance_type: g6.4xlarge
is_ephemeral: false
max_available: 30
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.amz2023.linux.large:
max_available: 1200
disk_size: 15
instance_type: c5.large
is_ephemeral: false
@ -271,11 +328,18 @@ runner_types:
is_ephemeral: false
max_available: 200
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.amz2023.linux.arm64.m7g.2xlarge:
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64
lf.amz2023.linux.arm64.m7g.4xlarge:
disk_size: 256
instance_type: m7g.2xlarge
instance_type: m7g.4xlarge
is_ephemeral: false
max_available: 20
max_available: 200
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64
lf.amz2023.linux.arm64.m7g.metal:
disk_size: 256
instance_type: m7g.metal
is_ephemeral: false
max_available: 100
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64

View File

@ -523,6 +523,13 @@
- Skylion007
- ngimel
- peterbell10
- eqy
- jansel
- jeffdaily
- eellison
- anijain2305
- bdhirsh
- zou3519
mandatory_checks_name:
- EasyCLA
- Lint
@ -537,6 +544,8 @@
- ezyang
- dzhulgakov
- malfet
- albanD
- ptrblck
mandatory_checks_name:
- EasyCLA
- Lint

View File

@ -505,6 +505,9 @@ def perform_misc_tasks(
"ci-verbose-test-logs",
check_for_setting(labels, pr_body, "ci-verbose-test-logs"),
)
set_output(
"ci-test-showlocals", check_for_setting(labels, pr_body, "ci-test-showlocals")
)
set_output(
"ci-no-test-timeout", check_for_setting(labels, pr_body, "ci-no-test-timeout")
)

View File

@ -215,7 +215,7 @@ LIBTORCH_CONTAINER_IMAGES: Dict[Tuple[str, str], str] = {
("cpu", CXX11_ABI): f"pytorch/libtorch-cxx11-builder:cpu-{DEFAULT_TAG}",
}
FULL_PYTHON_VERSIONS = ["3.8", "3.9", "3.10", "3.11", "3.12"]
FULL_PYTHON_VERSIONS = ["3.9", "3.10", "3.11", "3.12"]
def translate_desired_cuda(gpu_arch_type: str, gpu_arch_version: str) -> str:

View File

@ -683,6 +683,7 @@ class TestConfigFilter(TestCase):
def _gen_expected_string(
keep_going: bool = False,
ci_verbose_test_logs: bool = False,
ci_test_showlocals: bool = False,
ci_no_test_timeout: bool = False,
ci_no_td: bool = False,
ci_td_distributed: bool = False,
@ -692,6 +693,7 @@ class TestConfigFilter(TestCase):
return (
f"keep-going={keep_going}\n"
f"ci-verbose-test-logs={ci_verbose_test_logs}\n"
f"ci-test-showlocals={ci_test_showlocals}\n"
f"ci-no-test-timeout={ci_no_test_timeout}\n"
f"ci-no-td={ci_no_td}\n"
f"ci-td-distributed={ci_td_distributed}\n"
@ -733,6 +735,21 @@ class TestConfigFilter(TestCase):
),
"description": "No pipe logs label and no test timeout in PR body",
},
{
"labels": {"ci-test-showlocals"},
"test_matrix": '{include: [{config: "default"}]}',
"job_name": "A job name",
"expected": _gen_expected_string(ci_test_showlocals=True),
"description": "Has ci-test-showlocals",
},
{
"labels": {},
"test_matrix": '{include: [{config: "default"}]}',
"job_name": "A job name",
"pr_body": "[ci-test-showlocals]",
"expected": _gen_expected_string(ci_test_showlocals=True),
"description": "ci-test-showlocals in body",
},
{
"labels": {"ci-no-test-timeout"},
"test_matrix": '{include: [{config: "default"}]}',

View File

@ -43,6 +43,10 @@ on:
required: false
type: string
default: ""
runner_prefix:
description: prefix for runner label
type: string
default: ""
secrets:
GH_PYTORCHBOT_TOKEN:
required: false
@ -63,16 +67,16 @@ jobs:
# an OOM issue when running the job, so this upgrades the runner from 4xlarge
# to the next available tier of 12xlarge. So much memory just to generate cpp
# doc
runner: linux.12xlarge
runner: ${{ inputs.runner_prefix }}linux.12xlarge
# TODO: Nightly cpp docs take longer and longer to finish (more than 3h now)
# Let's try to figure out how this can be improved
timeout-minutes: 240
- docs_type: python
runner: linux.2xlarge
runner: ${{ inputs.runner_prefix }}linux.2xlarge
# It takes less than 30m to finish python docs unless there are issues
timeout-minutes: 30
- docs_type: functorch
runner: linux.2xlarge
runner: ${{ inputs.runner_prefix }}linux.2xlarge
# It takes less than 15m to finish functorch docs unless there are issues
timeout-minutes: 15
# Set a fixed name for this job instead of using the current matrix-generated name, i.e. build-docs (cpp, linux.12xlarge, 180)

View File

@ -1,117 +0,0 @@
name: linux-build
on:
workflow_call:
inputs:
build-environment:
required: true
type: string
description: Top-level label for what's being built/tested.
docker-image-name:
required: true
type: string
description: Name of the base docker image to build with.
build-generates-artifacts:
required: false
type: boolean
default: true
description: If set, upload generated build artifacts.
build-with-debug:
required: false
type: boolean
default: false
description: If set, build in debug mode.
sync-tag:
required: false
type: string
default: ""
description: |
If this is set, our linter will use this to make sure that every other
job with the same `sync-tag` is identical.
cuda-arch-list:
required: false
type: string
default: "5.2"
description: Runner label to select worker type
runner:
required: false
type: string
default: "linux.2xlarge"
description: |
List of CUDA architectures CI build should target.
test-matrix:
required: false
type: string
description: |
An option JSON description of what test configs to run later on. This
is moved here from the Linux test workflow so that we can apply filter
logic using test-config labels earlier and skip unnecessary builds
s3-bucket:
description: S3 bucket to download artifact
required: false
type: string
default: "gha-artifacts"
aws-role-to-assume:
description: role to assume for downloading artifacts
required: false
type: string
default: ""
use_split_build:
description: |
[Experimental] Build a libtorch only wheel and build pytorch such that
are built from the libtorch wheel.
required: false
type: boolean
default: false
secrets:
HUGGING_FACE_HUB_TOKEN:
required: false
description: |
HF Auth token to avoid rate limits when downloading models or datasets from hub
outputs:
docker-image:
value: ${{ jobs.build.outputs.docker-image }}
description: The docker image containing the built PyTorch.
test-matrix:
value: ${{ jobs.build.outputs.test-matrix }}
description: An optional JSON description of what test configs to run later on.
jobs:
build:
# Don't run on forked repos
if: github.repository_owner == 'pytorch'
runs-on: ${{ inputs.runner }}
timeout-minutes: 240
outputs:
docker-image: ${{ steps.linux-build.outputs.docker-image }}
test-matrix: ${{ steps.linux-build.outputs.test-matrix }}
steps:
- name: Setup SSH (Click me for login details)
uses: pytorch/test-infra/.github/actions/setup-ssh@main
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
# [pytorch repo ref]
# Use a pytorch/pytorch reference instead of a reference to the local
# checkout because when we run this action we don't *have* a local
# checkout. In other cases you should prefer a local checkout.
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
- name: Linux Build
id: linux-build
uses: ./.github/actions/linux-build
with:
build-environment: ${{ inputs.build-environment }}
docker-image-name: ${{ inputs.docker-image-name }}
build-generates-artifacts: ${{ inputs.build-generates-artifacts }}
build-with-debug: ${{ inputs.build-with-debug }}
sync-tag: ${{ inputs.sync-tag }}
cuda-arch-list: ${{ inputs.cuda-arch-list }}
test-matrix: ${{ inputs.test-matrix }}
s3-bucket: ${{ inputs.s3-bucket }}
aws-role-to-assume: ${{ inputs.aws-role-to-assume }}
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
use_split_build: ${{ inputs.use_split_build }}

View File

@ -198,6 +198,7 @@ jobs:
REENABLED_ISSUES: ${{ steps.keep-going.outputs.reenabled-issues }}
CONTINUE_THROUGH_ERROR: ${{ steps.keep-going.outputs.keep-going }}
VERBOSE_TEST_LOGS: ${{ steps.keep-going.outputs.ci-verbose-test-logs }}
TEST_SHOWLOCALS: ${{ steps.keep-going.outputs.ci-test-showlocals }}
NO_TEST_TIMEOUT: ${{ steps.keep-going.outputs.ci-no-test-timeout }}
NO_TD: ${{ steps.keep-going.outputs.ci-no-td }}
TD_DISTRIBUTED: ${{ steps.keep-going.outputs.ci-td-distributed }}
@ -251,6 +252,7 @@ jobs:
-e REENABLED_ISSUES \
-e CONTINUE_THROUGH_ERROR \
-e VERBOSE_TEST_LOGS \
-e TEST_SHOWLOCALS \
-e NO_TEST_TIMEOUT \
-e NO_TD \
-e TD_DISTRIBUTED \

View File

@ -35,6 +35,7 @@ jobs:
is-test-matrix-empty: ${{ steps.filter.outputs.is-test-matrix-empty }}
keep-going: ${{ steps.filter.outputs.keep-going }}
ci-verbose-test-logs: ${{ steps.filter.outputs.ci-verbose-test-logs }}
ci-test-showlocals: ${{ steps.filter.outputs.ci-test-showlocals }}
ci-no-test-timeout: ${{ steps.filter.outputs.ci-no-test-timeout }}
ci-no-td: ${{ steps.filter.outputs.ci-no-td }}
reenabled-issues: ${{ steps.filter.outputs.reenabled-issues }}
@ -98,6 +99,7 @@ jobs:
PR_BODY: ${{ github.event.pull_request.body }}
CONTINUE_THROUGH_ERROR: ${{ needs.filter.outputs.keep-going }}
VERBOSE_TEST_LOGS: ${{ needs.filter.outputs.ci-verbose-test-logs }}
TEST_SHOWLOCALS: ${{ needs.filter.outputs.ci-test-showlocals }}
NO_TEST_TIMEOUT: ${{ needs.filter.outputs.ci-no-test-timeout }}
NO_TD: ${{ needs.filter.outputs.ci-no-td }}
PIP_REQUIREMENTS_FILE: .github/requirements/pip-requirements-${{ runner.os }}.txt

View File

@ -144,6 +144,7 @@ jobs:
PYTORCH_TEST_RERUN_DISABLED_TESTS: ${{ matrix.rerun_disabled_tests && '1' || '0' }}
CONTINUE_THROUGH_ERROR: ${{ steps.keep-going.outputs.keep-going }}
VERBOSE_TEST_LOGS: ${{ steps.keep-going.outputs.ci-verbose-test-logs }}
TEST_SHOWLOCALS: ${{ steps.keep-going.outputs.ci-test-showlocals }}
NO_TEST_TIMEOUT: ${{ steps.keep-going.outputs.ci-no-test-timeout }}
NO_TD: ${{ steps.keep-going.outputs.ci-no-td }}
PIP_REQUIREMENTS_FILE: .github/requirements/pip-requirements-${{ runner.os }}.txt

View File

@ -154,6 +154,7 @@ jobs:
SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
CONTINUE_THROUGH_ERROR: ${{ steps.keep-going.outputs.keep-going }}
VERBOSE_TEST_LOGS: ${{ steps.keep-going.outputs.ci-verbose-test-logs }}
TEST_SHOWLOCALS: ${{ steps.keep-going.outputs.ci-test-showlocals }}
NO_TEST_TIMEOUT: ${{ steps.keep-going.outputs.ci-no-test-timeout }}
NO_TD: ${{ steps.keep-going.outputs.ci-no-td }}
TEST_CONFIG: ${{ matrix.config }}
@ -205,6 +206,7 @@ jobs:
-e REENABLED_ISSUES \
-e CONTINUE_THROUGH_ERROR \
-e VERBOSE_TEST_LOGS \
-e TEST_SHOWLOCALS \
-e NO_TEST_TIMEOUT \
-e NO_TD \
-e MAX_JOBS="$(nproc --ignore=2)" \

View File

@ -157,6 +157,7 @@ jobs:
PYTHON_VERSION: 3.8
CONTINUE_THROUGH_ERROR: ${{ steps.keep-going.outputs.keep-going }}
VERBOSE_TEST_LOGS: ${{ steps.keep-going.outputs.ci-verbose-test-logs }}
TEST_SHOWLOCALS: ${{ steps.keep-going.outputs.ci-test-showlocals }}
NO_TEST_TIMEOUT: ${{ steps.keep-going.outputs.ci-no-test-timeout }}
NO_TD: ${{ steps.keep-going.outputs.ci-no-td }}
VC_PRODUCT: "BuildTools"

View File

@ -143,6 +143,7 @@ jobs:
PYTORCH_RETRY_TEST_CASES: 1
PYTORCH_OVERRIDE_FLAKY_SIGNAL: 1
CONTINUE_THROUGH_ERROR: ${{ steps.keep-going.outputs.keep-going }}
TEST_SHOWLOCALS: ${{ steps.keep-going.outputs.ci-test-showlocals }}
VERBOSE_TEST_LOGS: ${{ steps.keep-going.outputs.ci-verbose-test-logs }}
NO_TEST_TIMEOUT: ${{ steps.keep-going.outputs.ci-no-test-timeout }}
NO_TD: ${{ steps.keep-going.outputs.ci-no-td }}
@ -189,6 +190,7 @@ jobs:
-e PYTORCH_OVERRIDE_FLAKY_SIGNAL \
-e CONTINUE_THROUGH_ERROR \
-e VERBOSE_TEST_LOGS \
-e TEST_SHOWLOCALS \
-e NO_TEST_TIMEOUT \
-e NO_TD \
-e MAX_JOBS="$(nproc --ignore=2)" \

View File

@ -66,7 +66,8 @@ jobs:
- docker-image-name: pytorch-linux-jammy-aarch64-py3.10-gcc11
runner: linux.arm64.2xlarge
- docker-image-name: pytorch-linux-jammy-aarch64-py3.10-gcc11-inductor-benchmarks
runner: linux.arm64.2xlarge
runner: linux.arm64.m7g.4xlarge
timeout-minutes: 600
runs-on: [self-hosted, "${{ matrix.runner }}"]
env:
DOCKER_IMAGE_BASE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/${{ matrix.docker-image-name }}
@ -123,7 +124,7 @@ jobs:
- name: Chown workspace
uses: ./.github/actions/chown-workspace
with:
ALPINE_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/${{ (matrix.runner == 'linux.arm64.2xlarge') && 'arm64v8' || 'tool' }}/alpine
ALPINE_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/${{ contains(matrix.runner, 'arm64') && 'arm64v8' || 'tool' }}/alpine
if: always()
- name: Teardown Linux

View File

@ -37,114 +37,6 @@ concurrency:
cancel-in-progress: true
jobs:
manywheel-py3_8-cpu-aarch64-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu-aarch64
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cpu-aarch64-main
DESIRED_PYTHON: "3.8"
runs_on: linux.arm64.m7g.4xlarge
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_8-cpu-aarch64
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_8-cpu-aarch64-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs: manywheel-py3_8-cpu-aarch64-build
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu-aarch64
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cpu-aarch64-main
DESIRED_PYTHON: "3.8"
build_name: manywheel-py3_8-cpu-aarch64
build_environment: linux-aarch64-binary-manywheel
runs_on: linux.arm64.2xlarge
ALPINE_IMAGE: "arm64v8/alpine"
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_8-cpu-aarch64-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_8-cpu-aarch64-test
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu-aarch64
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cpu-aarch64-main
DESIRED_PYTHON: "3.8"
build_name: manywheel-py3_8-cpu-aarch64
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_8-cuda-aarch64-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.4-main
DESIRED_DEVTOOLSET: cxx11-abi
DESIRED_PYTHON: "3.8"
runs_on: linux.arm64.m7g.4xlarge
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_8-cuda-aarch64
build_environment: linux-aarch64-binary-manywheel
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_8-cuda-aarch64-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_8-cuda-aarch64-build
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.4-main
DESIRED_DEVTOOLSET: cxx11-abi
DESIRED_PYTHON: "3.8"
build_name: manywheel-py3_8-cuda-aarch64
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_9-cpu-aarch64-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml

View File

@ -37,254 +37,6 @@ concurrency:
cancel-in-progress: true
jobs:
conda-py3_8-cpu-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: conda
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu
DOCKER_IMAGE: pytorch/conda-builder:cpu-main
DESIRED_PYTHON: "3.8"
build_name: conda-py3_8-cpu
build_environment: linux-binary-conda
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-py3_8-cpu-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs: conda-py3_8-cpu-build
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: conda
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu
DOCKER_IMAGE: pytorch/conda-builder:cpu-main
DESIRED_PYTHON: "3.8"
build_name: conda-py3_8-cpu
build_environment: linux-binary-conda
runs_on: linux.4xlarge
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-py3_8-cpu-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: conda-py3_8-cpu-test
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: conda
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu
DOCKER_IMAGE: pytorch/conda-builder:cpu-main
DESIRED_PYTHON: "3.8"
build_name: conda-py3_8-cpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
conda-py3_8-cuda11_8-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: conda
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu118
GPU_ARCH_VERSION: 11.8
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/conda-builder:cuda11.8-main
DESIRED_PYTHON: "3.8"
runs_on: linux.24xlarge
build_name: conda-py3_8-cuda11_8
build_environment: linux-binary-conda
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-py3_8-cuda11_8-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs: conda-py3_8-cuda11_8-build
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: conda
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu118
GPU_ARCH_VERSION: 11.8
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/conda-builder:cuda11.8-main
DESIRED_PYTHON: "3.8"
build_name: conda-py3_8-cuda11_8
build_environment: linux-binary-conda
runs_on: linux.4xlarge.nvidia.gpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-py3_8-cuda11_8-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: conda-py3_8-cuda11_8-test
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: conda
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu118
GPU_ARCH_VERSION: 11.8
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/conda-builder:cuda11.8-main
DESIRED_PYTHON: "3.8"
build_name: conda-py3_8-cuda11_8
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
conda-py3_8-cuda12_1-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: conda
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu121
GPU_ARCH_VERSION: 12.1
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/conda-builder:cuda12.1-main
DESIRED_PYTHON: "3.8"
runs_on: linux.24xlarge
build_name: conda-py3_8-cuda12_1
build_environment: linux-binary-conda
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-py3_8-cuda12_1-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs: conda-py3_8-cuda12_1-build
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: conda
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu121
GPU_ARCH_VERSION: 12.1
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/conda-builder:cuda12.1-main
DESIRED_PYTHON: "3.8"
build_name: conda-py3_8-cuda12_1
build_environment: linux-binary-conda
runs_on: linux.4xlarge.nvidia.gpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-py3_8-cuda12_1-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: conda-py3_8-cuda12_1-test
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: conda
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu121
GPU_ARCH_VERSION: 12.1
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/conda-builder:cuda12.1-main
DESIRED_PYTHON: "3.8"
build_name: conda-py3_8-cuda12_1
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
conda-py3_8-cuda12_4-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: conda
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/conda-builder:cuda12.4-main
DESIRED_PYTHON: "3.8"
runs_on: linux.24xlarge
build_name: conda-py3_8-cuda12_4
build_environment: linux-binary-conda
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-py3_8-cuda12_4-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs: conda-py3_8-cuda12_4-build
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: conda
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/conda-builder:cuda12.4-main
DESIRED_PYTHON: "3.8"
build_name: conda-py3_8-cuda12_4
build_environment: linux-binary-conda
runs_on: linux.4xlarge.nvidia.gpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-py3_8-cuda12_4-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: conda-py3_8-cuda12_4-test
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: conda
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/conda-builder:cuda12.4-main
DESIRED_PYTHON: "3.8"
build_name: conda-py3_8-cuda12_4
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
conda-py3_9-cpu-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml

View File

@ -37,832 +37,6 @@ concurrency:
cancel-in-progress: true
jobs:
manywheel-py3_8-cpu-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu
DOCKER_IMAGE: pytorch/manylinux-builder:cpu-main
DESIRED_PYTHON: "3.8"
build_name: manywheel-py3_8-cpu
build_environment: linux-binary-manywheel
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_8-cpu-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs: manywheel-py3_8-cpu-build
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu
DOCKER_IMAGE: pytorch/manylinux-builder:cpu-main
DESIRED_PYTHON: "3.8"
build_name: manywheel-py3_8-cpu
build_environment: linux-binary-manywheel
runs_on: linux.4xlarge
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_8-cpu-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_8-cpu-test
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu
DOCKER_IMAGE: pytorch/manylinux-builder:cpu-main
DESIRED_PYTHON: "3.8"
build_name: manywheel-py3_8-cpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_8-cpu-cxx11-abi-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu-cxx11-abi
GPU_ARCH_TYPE: cpu-cxx11-abi
DOCKER_IMAGE: pytorch/manylinuxcxx11-abi-builder:cpu-cxx11-abi-main
DESIRED_DEVTOOLSET: cxx11-abi
DESIRED_PYTHON: "3.8"
build_name: manywheel-py3_8-cpu-cxx11-abi
build_environment: linux-binary-manywheel
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_8-cpu-cxx11-abi-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs: manywheel-py3_8-cpu-cxx11-abi-build
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu-cxx11-abi
GPU_ARCH_TYPE: cpu-cxx11-abi
DOCKER_IMAGE: pytorch/manylinuxcxx11-abi-builder:cpu-cxx11-abi-main
DESIRED_DEVTOOLSET: cxx11-abi
DESIRED_PYTHON: "3.8"
build_name: manywheel-py3_8-cpu-cxx11-abi
build_environment: linux-binary-manywheel
runs_on: linux.4xlarge
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_8-cpu-cxx11-abi-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_8-cpu-cxx11-abi-test
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu-cxx11-abi
GPU_ARCH_TYPE: cpu-cxx11-abi
DOCKER_IMAGE: pytorch/manylinuxcxx11-abi-builder:cpu-cxx11-abi-main
DESIRED_DEVTOOLSET: cxx11-abi
DESIRED_PYTHON: "3.8"
build_name: manywheel-py3_8-cpu-cxx11-abi
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_8-cuda11_8-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu118
GPU_ARCH_VERSION: 11.8
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.8-main
DESIRED_PYTHON: "3.8"
build_name: manywheel-py3_8-cuda11_8
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu11==11.8.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu11==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu11==11.11.3.6; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu11==10.9.0.58; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu11==10.3.0.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu11==11.4.1.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu11==11.7.5.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu11==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu11==11.8.86; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_8-cuda11_8-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs: manywheel-py3_8-cuda11_8-build
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu118
GPU_ARCH_VERSION: 11.8
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.8-main
DESIRED_PYTHON: "3.8"
build_name: manywheel-py3_8-cuda11_8
build_environment: linux-binary-manywheel
runs_on: linux.4xlarge.nvidia.gpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_8-cuda11_8-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_8-cuda11_8-test
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu118
GPU_ARCH_VERSION: 11.8
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.8-main
DESIRED_PYTHON: "3.8"
build_name: manywheel-py3_8-cuda11_8
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_8-cuda11_8-split-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu118
GPU_ARCH_VERSION: 11.8
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.8-main
use_split_build: True
DESIRED_PYTHON: "3.8"
build_name: manywheel-py3_8-cuda11_8-split
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu11==11.8.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu11==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu11==11.11.3.6; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu11==10.9.0.58; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu11==10.3.0.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu11==11.4.1.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu11==11.7.5.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu11==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu11==11.8.86; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_8-cuda11_8-split-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs: manywheel-py3_8-cuda11_8-split-build
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu118
GPU_ARCH_VERSION: 11.8
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.8-main
use_split_build: True
DESIRED_PYTHON: "3.8"
build_name: manywheel-py3_8-cuda11_8-split
build_environment: linux-binary-manywheel
runs_on: linux.4xlarge.nvidia.gpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_8-cuda11_8-split-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_8-cuda11_8-split-test
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu118
GPU_ARCH_VERSION: 11.8
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.8-main
use_split_build: True
DESIRED_PYTHON: "3.8"
build_name: manywheel-py3_8-cuda11_8-split
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_8-cuda12_1-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu121
GPU_ARCH_VERSION: 12.1
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.1-main
DESIRED_PYTHON: "3.8"
build_name: manywheel-py3_8-cuda12_1
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_8-cuda12_1-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs: manywheel-py3_8-cuda12_1-build
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu121
GPU_ARCH_VERSION: 12.1
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.1-main
DESIRED_PYTHON: "3.8"
build_name: manywheel-py3_8-cuda12_1
build_environment: linux-binary-manywheel
runs_on: linux.4xlarge.nvidia.gpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_8-cuda12_1-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_8-cuda12_1-test
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu121
GPU_ARCH_VERSION: 12.1
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.1-main
DESIRED_PYTHON: "3.8"
build_name: manywheel-py3_8-cuda12_1
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_8-cuda12_1-split-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu121
GPU_ARCH_VERSION: 12.1
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.1-main
use_split_build: True
DESIRED_PYTHON: "3.8"
build_name: manywheel-py3_8-cuda12_1-split
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_8-cuda12_1-split-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs: manywheel-py3_8-cuda12_1-split-build
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu121
GPU_ARCH_VERSION: 12.1
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.1-main
use_split_build: True
DESIRED_PYTHON: "3.8"
build_name: manywheel-py3_8-cuda12_1-split
build_environment: linux-binary-manywheel
runs_on: linux.4xlarge.nvidia.gpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_8-cuda12_1-split-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_8-cuda12_1-split-test
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu121
GPU_ARCH_VERSION: 12.1
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.1-main
use_split_build: True
DESIRED_PYTHON: "3.8"
build_name: manywheel-py3_8-cuda12_1-split
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_8-cuda12_4-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.4-main
DESIRED_PYTHON: "3.8"
build_name: manywheel-py3_8-cuda12_4
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.2.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.0.44; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.119; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.0.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.0.142; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_8-cuda12_4-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs: manywheel-py3_8-cuda12_4-build
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.4-main
DESIRED_PYTHON: "3.8"
build_name: manywheel-py3_8-cuda12_4
build_environment: linux-binary-manywheel
runs_on: linux.4xlarge.nvidia.gpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_8-cuda12_4-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_8-cuda12_4-test
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.4-main
DESIRED_PYTHON: "3.8"
build_name: manywheel-py3_8-cuda12_4
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_8-cuda12_4-split-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.4-main
use_split_build: True
DESIRED_PYTHON: "3.8"
build_name: manywheel-py3_8-cuda12_4-split
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.2.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.0.44; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.119; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.0.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.0.142; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_8-cuda12_4-split-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs: manywheel-py3_8-cuda12_4-split-build
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.4-main
use_split_build: True
DESIRED_PYTHON: "3.8"
build_name: manywheel-py3_8-cuda12_4-split
build_environment: linux-binary-manywheel
runs_on: linux.4xlarge.nvidia.gpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_8-cuda12_4-split-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_8-cuda12_4-split-test
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.4-main
use_split_build: True
DESIRED_PYTHON: "3.8"
build_name: manywheel-py3_8-cuda12_4-split
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_8-rocm6_0-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm6.0
GPU_ARCH_VERSION: 6.0
GPU_ARCH_TYPE: rocm
DOCKER_IMAGE: pytorch/manylinux-builder:rocm6.0-main
DESIRED_PYTHON: "3.8"
build_name: manywheel-py3_8-rocm6_0
build_environment: linux-binary-manywheel
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_8-rocm6_0-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs: manywheel-py3_8-rocm6_0-build
runs-on: linux.rocm.gpu
timeout-minutes: 240
env:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm6.0
GPU_ARCH_VERSION: 6.0
GPU_ARCH_TYPE: rocm
SKIP_ALL_TESTS: 1
DOCKER_IMAGE: pytorch/manylinux-builder:rocm6.0-main
DESIRED_PYTHON: "3.8"
steps:
- name: Setup ROCm
uses: ./.github/actions/setup-rocm
- uses: actions/download-artifact@v3
name: Download Build Artifacts
with:
name: manywheel-py3_8-rocm6_0
path: "${{ runner.temp }}/artifacts/"
- name: Checkout PyTorch
uses: malfet/checkout@silent-checkout
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
quiet-checkout: true
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
- name: Checkout pytorch/builder
uses: malfet/checkout@silent-checkout
with:
ref: main
submodules: recursive
repository: pytorch/builder
path: builder
quiet-checkout: true
- name: Clean pytorch/builder checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: builder
- name: ROCm set GPU_FLAG
run: |
echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd --device=/dev/dri --group-add video --group-add daemon" >> "${GITHUB_ENV}"
- name: Pull Docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
with:
docker-image: pytorch/manylinux-builder:rocm6.0-main
- name: Test Pytorch binary
uses: ./pytorch/.github/actions/test-pytorch-binary
- name: Teardown ROCm
uses: ./.github/actions/teardown-rocm
manywheel-py3_8-rocm6_0-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_8-rocm6_0-test
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm6.0
GPU_ARCH_VERSION: 6.0
GPU_ARCH_TYPE: rocm
DOCKER_IMAGE: pytorch/manylinux-builder:rocm6.0-main
DESIRED_PYTHON: "3.8"
build_name: manywheel-py3_8-rocm6_0
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_8-rocm6_1-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm6.1
GPU_ARCH_VERSION: 6.1
GPU_ARCH_TYPE: rocm
DOCKER_IMAGE: pytorch/manylinux-builder:rocm6.1-main
DESIRED_PYTHON: "3.8"
build_name: manywheel-py3_8-rocm6_1
build_environment: linux-binary-manywheel
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_8-rocm6_1-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs: manywheel-py3_8-rocm6_1-build
runs-on: linux.rocm.gpu
timeout-minutes: 240
env:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm6.1
GPU_ARCH_VERSION: 6.1
GPU_ARCH_TYPE: rocm
SKIP_ALL_TESTS: 1
DOCKER_IMAGE: pytorch/manylinux-builder:rocm6.1-main
DESIRED_PYTHON: "3.8"
steps:
- name: Setup ROCm
uses: ./.github/actions/setup-rocm
- uses: actions/download-artifact@v3
name: Download Build Artifacts
with:
name: manywheel-py3_8-rocm6_1
path: "${{ runner.temp }}/artifacts/"
- name: Checkout PyTorch
uses: malfet/checkout@silent-checkout
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
quiet-checkout: true
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
- name: Checkout pytorch/builder
uses: malfet/checkout@silent-checkout
with:
ref: main
submodules: recursive
repository: pytorch/builder
path: builder
quiet-checkout: true
- name: Clean pytorch/builder checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: builder
- name: ROCm set GPU_FLAG
run: |
echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd --device=/dev/dri --group-add video --group-add daemon" >> "${GITHUB_ENV}"
- name: Pull Docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
with:
docker-image: pytorch/manylinux-builder:rocm6.1-main
- name: Test Pytorch binary
uses: ./pytorch/.github/actions/test-pytorch-binary
- name: Teardown ROCm
uses: ./.github/actions/teardown-rocm
manywheel-py3_8-rocm6_1-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_8-rocm6_1-test
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm6.1
GPU_ARCH_VERSION: 6.1
GPU_ARCH_TYPE: rocm
DOCKER_IMAGE: pytorch/manylinux-builder:rocm6.1-main
DESIRED_PYTHON: "3.8"
build_name: manywheel-py3_8-rocm6_1
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_8-xpu-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: xpu
GPU_ARCH_TYPE: xpu
DOCKER_IMAGE: pytorch/manylinux2_28-builder:xpu-main
DESIRED_DEVTOOLSET: cxx11-abi
DESIRED_PYTHON: "3.8"
build_name: manywheel-py3_8-xpu
build_environment: linux-binary-manywheel
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_8-xpu-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs: manywheel-py3_8-xpu-build
runs-on: linux.idc.xpu
timeout-minutes: 240
env:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: xpu
GPU_ARCH_TYPE: xpu
SKIP_ALL_TESTS: 1
DOCKER_IMAGE: pytorch/manylinux2_28-builder:xpu-main
DESIRED_DEVTOOLSET: cxx11-abi
DESIRED_PYTHON: "3.8"
permissions:
id-token: write
contents: read
steps:
- name: Setup XPU
uses: ./.github/actions/setup-xpu
- name: configure aws credentials
id: aws_creds
uses: aws-actions/configure-aws-credentials@v1.7.0
with:
role-to-assume: arn:aws:iam::308535385114:role/gha_workflow_s3_and_ecr_read_only
aws-region: us-east-1
- name: Login to Amazon ECR
id: login-ecr
uses: aws-actions/amazon-ecr-login@v2
- uses: actions/download-artifact@v3
name: Download Build Artifacts
with:
name: manywheel-py3_8-xpu
path: "${{ runner.temp }}/artifacts/"
- name: Checkout PyTorch
uses: malfet/checkout@silent-checkout
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
quiet-checkout: true
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
- name: Checkout pytorch/builder
uses: malfet/checkout@silent-checkout
with:
ref: main
submodules: recursive
repository: pytorch/builder
path: builder
quiet-checkout: true
- name: Clean pytorch/builder checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: builder
- name: Pull Docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
with:
docker-image: pytorch/manylinux2_28-builder:xpu-main
- name: Test Pytorch binary
uses: ./pytorch/.github/actions/test-pytorch-binary
- name: Teardown XPU
uses: ./.github/actions/teardown-xpu
manywheel-py3_8-xpu-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_8-xpu-test
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: xpu
GPU_ARCH_TYPE: xpu
DOCKER_IMAGE: pytorch/manylinux2_28-builder:xpu-main
DESIRED_DEVTOOLSET: cxx11-abi
DESIRED_PYTHON: "3.8"
build_name: manywheel-py3_8-xpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_9-cpu-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml

View File

@ -37,69 +37,6 @@ concurrency:
cancel-in-progress: true
jobs:
manywheel-py3_8-cpu-s390x-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu-s390x
DOCKER_IMAGE: pytorch/manylinuxs390x-builder:cpu-s390x-main
DESIRED_PYTHON: "3.8"
runs_on: linux.s390x
ALPINE_IMAGE: "docker.io/s390x/alpine"
build_name: manywheel-py3_8-cpu-s390x
build_environment: linux-s390x-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_8-cpu-s390x-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs: manywheel-py3_8-cpu-s390x-build
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu-s390x
DOCKER_IMAGE: pytorch/manylinuxs390x-builder:cpu-s390x-main
DESIRED_PYTHON: "3.8"
build_name: manywheel-py3_8-cpu-s390x
build_environment: linux-s390x-binary-manywheel
runs_on: linux.s390x
ALPINE_IMAGE: "docker.io/s390x/alpine"
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_8-cpu-s390x-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_8-cpu-s390x-test
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu-s390x
DOCKER_IMAGE: pytorch/manylinuxs390x-builder:cpu-s390x-main
DESIRED_PYTHON: "3.8"
build_name: manywheel-py3_8-cpu-s390x
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_9-cpu-s390x-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml

View File

@ -32,124 +32,6 @@ concurrency:
cancel-in-progress: true
jobs:
conda-py3_8-cpu-build:
if: ${{ github.repository_owner == 'pytorch' }}
runs-on: macos-14-xlarge
timeout-minutes: 240
env:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
BUILDER_ROOT: ${{ github.workspace }}/builder
PACKAGE_TYPE: conda
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.8"
# For sccache access (only on non-forked PRs)
AWS_ACCESS_KEY_ID: ${{ secrets.MACOS_SCCACHE_S3_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.MACOS_SCCACHE_S3_SECRET_ACCESS_KEY }}
steps:
# NOTE: These environment variables are put here so that they can be applied on every job equally
# They are also here because setting them at a workflow level doesn't give us access to the
# runner.temp variable, which we need.
- name: Populate binary env
shell: bash
run: |
# shellcheck disable=SC2129
echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
# shellcheck disable=SC2129
echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
# shellcheck disable=SC2129
echo "MAC_PACKAGE_WORK_DIR=${RUNNER_TEMP}" >> "${GITHUB_ENV}"
- name: Install conda and dependencies
run: |
# Install conda, setup-miniconda messes with the path that messes with the ruby stuff we do later on
curl --retry 3 --retry-all-errors -o "${RUNNER_TEMP}/conda.sh" "https://repo.anaconda.com/miniconda/Miniconda3-py310_23.5.2-0-MacOSX-$(uname -m).sh"
chmod +x "${RUNNER_TEMP}/conda.sh"
/bin/bash "${RUNNER_TEMP}/conda.sh" -b -p "${RUNNER_TEMP}/anaconda"
echo "${RUNNER_TEMP}/anaconda/bin" >> "${GITHUB_PATH}"
if [ -d "/Applications/Xcode_14.3.1.app" ]; then
echo "DEVELOPER_DIR=/Applications/Xcode_14.3.1.app/Contents/Developer" >> "${GITHUB_ENV}"
elif [ -d "/Applications/Xcode_13.3.1.app" ]; then
echo "DEVELOPER_DIR=/Applications/Xcode_13.3.1.app/Contents/Developer" >> "${GITHUB_ENV}"
fi
- name: Checkout PyTorch
uses: malfet/checkout@silent-checkout
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
quiet-checkout: true
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
- name: Checkout pytorch/builder
uses: malfet/checkout@silent-checkout
with:
ref: main
submodules: recursive
repository: pytorch/builder
path: builder
quiet-checkout: true
- name: Clean pytorch/builder checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: builder
- name: Install sccache (only for non-forked PRs, and pushes to trunk)
uses: nick-fields/retry@v2.8.2
if: ${{ github.event_name == 'push' || github.event.pull_request.head.repo.full_name == github.repository }}
with:
timeout_minutes: 5
max_attempts: 3
retry_wait_seconds: 90
command: |
sudo curl --retry 3 --retry-all-errors https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
sudo chmod +x /usr/local/bin/sccache
echo "SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${GITHUB_ENV}"
- name: Populate binary env
run: |
# shellcheck disable=SC1091
source "${RUNNER_TEMP}/anaconda/bin/activate"
"${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
- name: Build PyTorch binary
run: |
# shellcheck disable=SC1091
source "${RUNNER_TEMP}/anaconda/bin/activate"
"${PYTORCH_ROOT}/.circleci/scripts/binary_macos_build.sh"
- uses: actions/upload-artifact@v3
if: always()
with:
name: conda-py3_8-cpu
retention-days: 14
if-no-files-found: error
path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
conda-py3_8-cpu-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: conda-py3_8-cpu-build
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: conda
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu
DOCKER_IMAGE: pytorch/conda-builder:cpu-main
DESIRED_PYTHON: "3.8"
build_name: conda-py3_8-cpu
use_s3: False
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
conda-py3_9-cpu-build:
if: ${{ github.repository_owner == 'pytorch' }}
runs-on: macos-14-xlarge

View File

@ -32,125 +32,6 @@ concurrency:
cancel-in-progress: true
jobs:
wheel-py3_8-cpu-build:
if: ${{ github.repository_owner == 'pytorch' }}
runs-on: macos-14-xlarge
timeout-minutes: 240
env:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
BUILDER_ROOT: ${{ github.workspace }}/builder
PACKAGE_TYPE: wheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.8"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
# For sccache access (only on non-forked PRs)
AWS_ACCESS_KEY_ID: ${{ secrets.MACOS_SCCACHE_S3_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.MACOS_SCCACHE_S3_SECRET_ACCESS_KEY }}
steps:
# NOTE: These environment variables are put here so that they can be applied on every job equally
# They are also here because setting them at a workflow level doesn't give us access to the
# runner.temp variable, which we need.
- name: Populate binary env
shell: bash
run: |
# shellcheck disable=SC2129
echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
# shellcheck disable=SC2129
echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
# shellcheck disable=SC2129
echo "MAC_PACKAGE_WORK_DIR=${RUNNER_TEMP}" >> "${GITHUB_ENV}"
- name: Install conda and dependencies
run: |
# Install conda, setup-miniconda messes with the path that messes with the ruby stuff we do later on
curl --retry 3 --retry-all-errors -o "${RUNNER_TEMP}/conda.sh" "https://repo.anaconda.com/miniconda/Miniconda3-py310_23.5.2-0-MacOSX-$(uname -m).sh"
chmod +x "${RUNNER_TEMP}/conda.sh"
/bin/bash "${RUNNER_TEMP}/conda.sh" -b -p "${RUNNER_TEMP}/anaconda"
echo "${RUNNER_TEMP}/anaconda/bin" >> "${GITHUB_PATH}"
if [ -d "/Applications/Xcode_14.3.1.app" ]; then
echo "DEVELOPER_DIR=/Applications/Xcode_14.3.1.app/Contents/Developer" >> "${GITHUB_ENV}"
elif [ -d "/Applications/Xcode_13.3.1.app" ]; then
echo "DEVELOPER_DIR=/Applications/Xcode_13.3.1.app/Contents/Developer" >> "${GITHUB_ENV}"
fi
- name: Checkout PyTorch
uses: malfet/checkout@silent-checkout
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
quiet-checkout: true
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
- name: Checkout pytorch/builder
uses: malfet/checkout@silent-checkout
with:
ref: main
submodules: recursive
repository: pytorch/builder
path: builder
quiet-checkout: true
- name: Clean pytorch/builder checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: builder
- name: Install sccache (only for non-forked PRs, and pushes to trunk)
uses: nick-fields/retry@v2.8.2
if: ${{ github.event_name == 'push' || github.event.pull_request.head.repo.full_name == github.repository }}
with:
timeout_minutes: 5
max_attempts: 3
retry_wait_seconds: 90
command: |
sudo curl --retry 3 --retry-all-errors https://s3.amazonaws.com/ossci-macos/sccache_v2.15 --output /usr/local/bin/sccache
sudo chmod +x /usr/local/bin/sccache
echo "SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${GITHUB_ENV}"
- name: Populate binary env
run: |
# shellcheck disable=SC1091
source "${RUNNER_TEMP}/anaconda/bin/activate"
"${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
- name: Build PyTorch binary
run: |
# shellcheck disable=SC1091
source "${RUNNER_TEMP}/anaconda/bin/activate"
"${PYTORCH_ROOT}/.circleci/scripts/binary_macos_build.sh"
- uses: actions/upload-artifact@v3
if: always()
with:
name: wheel-py3_8-cpu
retention-days: 14
if-no-files-found: error
path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
wheel-py3_8-cpu-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: wheel-py3_8-cpu-build
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: wheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu
DOCKER_IMAGE: pytorch/manylinux-builder:cpu-main
DESIRED_PYTHON: "3.8"
build_name: wheel-py3_8-cpu
use_s3: False
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
wheel-py3_9-cpu-build:
if: ${{ github.repository_owner == 'pytorch' }}
runs-on: macos-14-xlarge

View File

@ -32,983 +32,6 @@ concurrency:
cancel-in-progress: true
jobs:
conda-py3_8-cpu-build:
if: ${{ github.repository_owner == 'pytorch' }}
runs-on: windows.4xlarge.nonephemeral
timeout-minutes: 240
env:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
BUILDER_ROOT: ${{ github.workspace }}/builder
PACKAGE_TYPE: conda
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.8"
steps:
- name: Display EC2 information
shell: bash
run: |
set -euo pipefail
function get_ec2_metadata() {
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
echo "instance-type: $(get_ec2_metadata instance-type)"
echo "system info $(uname -a)"
- name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
uses: pytorch/test-infra/.github/actions/setup-ssh@main
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
run: |
Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
# Since it's just a defensive command, the workflow should continue even the command fails. This step can be
# removed once Windows Defender is removed from the AMI
- name: Disables Windows Defender scheduled and real-time scanning for files in directories used by PyTorch
continue-on-error: true
shell: powershell
run: |
Add-MpPreference -ExclusionPath $(Get-Location).tostring(),$Env:TEMP -ErrorAction Ignore
# Let's both exclude the path and disable Windows Defender completely just to be sure
# that it doesn't interfere
Set-MpPreference -DisableRealtimeMonitoring $True -ErrorAction Ignore
# NOTE: These environment variables are put here so that they can be applied on every job equally
# They are also here because setting them at a workflow level doesn't give us access to the
# runner.temp variable, which we need.
- name: Populate binary env
shell: bash
run: |
echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
- name: Checkout PyTorch
uses: malfet/checkout@silent-checkout
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
quiet-checkout: true
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
- name: Checkout pytorch/builder
uses: malfet/checkout@silent-checkout
with:
ref: main
submodules: recursive
repository: pytorch/builder
path: builder
quiet-checkout: true
- name: Clean pytorch/builder checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: builder
- name: Populate binary env
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
- name: Build PyTorch binary
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
- uses: actions/upload-artifact@v3
if: always()
with:
name: conda-py3_8-cpu
retention-days: 14
if-no-files-found: error
path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
- name: Wait until all sessions have drained
shell: powershell
working-directory: pytorch
if: always()
timeout-minutes: 120
run: |
.github\scripts\wait_for_ssh_to_drain.ps1
- name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
shell: powershell
working-directory: pytorch
if: always()
run: |
.github\scripts\kill_active_ssh_sessions.ps1
conda-py3_8-cpu-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs: conda-py3_8-cpu-build
runs-on: windows.4xlarge.nonephemeral
timeout-minutes: 240
env:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
BUILDER_ROOT: ${{ github.workspace }}/builder
PACKAGE_TYPE: conda
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.8"
steps:
- name: Display EC2 information
shell: bash
run: |
set -euo pipefail
function get_ec2_metadata() {
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
echo "instance-type: $(get_ec2_metadata instance-type)"
echo "system info $(uname -a)"
- name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
uses: pytorch/test-infra/.github/actions/setup-ssh@main
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
run: |
Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
# Since it's just a defensive command, the workflow should continue even the command fails. This step can be
# removed once Windows Defender is removed from the AMI
- name: Disables Windows Defender scheduled and real-time scanning for files in directories used by PyTorch
continue-on-error: true
shell: powershell
run: |
Add-MpPreference -ExclusionPath $(Get-Location).tostring(),$Env:TEMP -ErrorAction Ignore
# Let's both exclude the path and disable Windows Defender completely just to be sure
# that it doesn't interfere
Set-MpPreference -DisableRealtimeMonitoring $True -ErrorAction Ignore
# NOTE: These environment variables are put here so that they can be applied on every job equally
# They are also here because setting them at a workflow level doesn't give us access to the
# runner.temp variable, which we need.
- name: Populate binary env
shell: bash
run: |
echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
- uses: actions/download-artifact@v3
name: Download Build Artifacts
with:
name: conda-py3_8-cpu
path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
- name: Checkout PyTorch
uses: malfet/checkout@silent-checkout
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
quiet-checkout: true
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
- name: Checkout pytorch/builder
uses: malfet/checkout@silent-checkout
with:
ref: main
submodules: recursive
repository: pytorch/builder
path: builder
quiet-checkout: true
- name: Clean pytorch/builder checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: builder
- name: Populate binary env
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
- name: Test PyTorch binary
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_windows_test.sh"
- name: Wait until all sessions have drained
shell: powershell
working-directory: pytorch
if: always()
timeout-minutes: 120
run: |
.github\scripts\wait_for_ssh_to_drain.ps1
- name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
shell: powershell
working-directory: pytorch
if: always()
run: |
.github\scripts\kill_active_ssh_sessions.ps1
conda-py3_8-cpu-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: conda-py3_8-cpu-test
with:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
BUILDER_ROOT: ${{ github.workspace }}/builder
PACKAGE_TYPE: conda
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu
DESIRED_PYTHON: "3.8"
build_name: conda-py3_8-cpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
conda-py3_8-cuda11_8-build:
if: ${{ github.repository_owner == 'pytorch' }}
runs-on: windows.4xlarge.nonephemeral
timeout-minutes: 240
env:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
BUILDER_ROOT: ${{ github.workspace }}/builder
PACKAGE_TYPE: conda
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu118
GPU_ARCH_VERSION: 11.8
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.8"
steps:
- name: Display EC2 information
shell: bash
run: |
set -euo pipefail
function get_ec2_metadata() {
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
echo "instance-type: $(get_ec2_metadata instance-type)"
echo "system info $(uname -a)"
- name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
uses: pytorch/test-infra/.github/actions/setup-ssh@main
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
run: |
Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
# Since it's just a defensive command, the workflow should continue even the command fails. This step can be
# removed once Windows Defender is removed from the AMI
- name: Disables Windows Defender scheduled and real-time scanning for files in directories used by PyTorch
continue-on-error: true
shell: powershell
run: |
Add-MpPreference -ExclusionPath $(Get-Location).tostring(),$Env:TEMP -ErrorAction Ignore
# Let's both exclude the path and disable Windows Defender completely just to be sure
# that it doesn't interfere
Set-MpPreference -DisableRealtimeMonitoring $True -ErrorAction Ignore
# NOTE: These environment variables are put here so that they can be applied on every job equally
# They are also here because setting them at a workflow level doesn't give us access to the
# runner.temp variable, which we need.
- name: Populate binary env
shell: bash
run: |
echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
- name: Checkout PyTorch
uses: malfet/checkout@silent-checkout
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
quiet-checkout: true
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
- name: Checkout pytorch/builder
uses: malfet/checkout@silent-checkout
with:
ref: main
submodules: recursive
repository: pytorch/builder
path: builder
quiet-checkout: true
- name: Clean pytorch/builder checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: builder
- name: Populate binary env
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
- name: Build PyTorch binary
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
- uses: actions/upload-artifact@v3
if: always()
with:
name: conda-py3_8-cuda11_8
retention-days: 14
if-no-files-found: error
path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
- name: Wait until all sessions have drained
shell: powershell
working-directory: pytorch
if: always()
timeout-minutes: 120
run: |
.github\scripts\wait_for_ssh_to_drain.ps1
- name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
shell: powershell
working-directory: pytorch
if: always()
run: |
.github\scripts\kill_active_ssh_sessions.ps1
conda-py3_8-cuda11_8-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs: conda-py3_8-cuda11_8-build
runs-on: windows.8xlarge.nvidia.gpu
timeout-minutes: 240
env:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
BUILDER_ROOT: ${{ github.workspace }}/builder
PACKAGE_TYPE: conda
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu118
GPU_ARCH_VERSION: 11.8
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.8"
steps:
- name: Display EC2 information
shell: bash
run: |
set -euo pipefail
function get_ec2_metadata() {
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
echo "instance-type: $(get_ec2_metadata instance-type)"
echo "system info $(uname -a)"
- name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
uses: pytorch/test-infra/.github/actions/setup-ssh@main
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
run: |
Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
# Since it's just a defensive command, the workflow should continue even the command fails. This step can be
# removed once Windows Defender is removed from the AMI
- name: Disables Windows Defender scheduled and real-time scanning for files in directories used by PyTorch
continue-on-error: true
shell: powershell
run: |
Add-MpPreference -ExclusionPath $(Get-Location).tostring(),$Env:TEMP -ErrorAction Ignore
# Let's both exclude the path and disable Windows Defender completely just to be sure
# that it doesn't interfere
Set-MpPreference -DisableRealtimeMonitoring $True -ErrorAction Ignore
# NOTE: These environment variables are put here so that they can be applied on every job equally
# They are also here because setting them at a workflow level doesn't give us access to the
# runner.temp variable, which we need.
- name: Populate binary env
shell: bash
run: |
echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
- uses: actions/download-artifact@v3
name: Download Build Artifacts
with:
name: conda-py3_8-cuda11_8
path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
- name: Checkout PyTorch
uses: malfet/checkout@silent-checkout
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
quiet-checkout: true
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
- name: Checkout pytorch/builder
uses: malfet/checkout@silent-checkout
with:
ref: main
submodules: recursive
repository: pytorch/builder
path: builder
quiet-checkout: true
- name: Clean pytorch/builder checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: builder
- name: Populate binary env
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
- name: Test PyTorch binary
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_windows_test.sh"
- name: Wait until all sessions have drained
shell: powershell
working-directory: pytorch
if: always()
timeout-minutes: 120
run: |
.github\scripts\wait_for_ssh_to_drain.ps1
- name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
shell: powershell
working-directory: pytorch
if: always()
run: |
.github\scripts\kill_active_ssh_sessions.ps1
conda-py3_8-cuda11_8-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: conda-py3_8-cuda11_8-test
with:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
BUILDER_ROOT: ${{ github.workspace }}/builder
PACKAGE_TYPE: conda
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu118
GPU_ARCH_VERSION: 11.8
GPU_ARCH_TYPE: cuda
DESIRED_PYTHON: "3.8"
build_name: conda-py3_8-cuda11_8
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
conda-py3_8-cuda12_1-build:
if: ${{ github.repository_owner == 'pytorch' }}
runs-on: windows.4xlarge.nonephemeral
timeout-minutes: 240
env:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
BUILDER_ROOT: ${{ github.workspace }}/builder
PACKAGE_TYPE: conda
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu121
GPU_ARCH_VERSION: 12.1
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.8"
steps:
- name: Display EC2 information
shell: bash
run: |
set -euo pipefail
function get_ec2_metadata() {
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
echo "instance-type: $(get_ec2_metadata instance-type)"
echo "system info $(uname -a)"
- name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
uses: pytorch/test-infra/.github/actions/setup-ssh@main
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
run: |
Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
# Since it's just a defensive command, the workflow should continue even the command fails. This step can be
# removed once Windows Defender is removed from the AMI
- name: Disables Windows Defender scheduled and real-time scanning for files in directories used by PyTorch
continue-on-error: true
shell: powershell
run: |
Add-MpPreference -ExclusionPath $(Get-Location).tostring(),$Env:TEMP -ErrorAction Ignore
# Let's both exclude the path and disable Windows Defender completely just to be sure
# that it doesn't interfere
Set-MpPreference -DisableRealtimeMonitoring $True -ErrorAction Ignore
# NOTE: These environment variables are put here so that they can be applied on every job equally
# They are also here because setting them at a workflow level doesn't give us access to the
# runner.temp variable, which we need.
- name: Populate binary env
shell: bash
run: |
echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
- name: Checkout PyTorch
uses: malfet/checkout@silent-checkout
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
quiet-checkout: true
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
- name: Checkout pytorch/builder
uses: malfet/checkout@silent-checkout
with:
ref: main
submodules: recursive
repository: pytorch/builder
path: builder
quiet-checkout: true
- name: Clean pytorch/builder checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: builder
- name: Populate binary env
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
- name: Build PyTorch binary
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
- uses: actions/upload-artifact@v3
if: always()
with:
name: conda-py3_8-cuda12_1
retention-days: 14
if-no-files-found: error
path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
- name: Wait until all sessions have drained
shell: powershell
working-directory: pytorch
if: always()
timeout-minutes: 120
run: |
.github\scripts\wait_for_ssh_to_drain.ps1
- name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
shell: powershell
working-directory: pytorch
if: always()
run: |
.github\scripts\kill_active_ssh_sessions.ps1
conda-py3_8-cuda12_1-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs: conda-py3_8-cuda12_1-build
runs-on: windows.8xlarge.nvidia.gpu
timeout-minutes: 240
env:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
BUILDER_ROOT: ${{ github.workspace }}/builder
PACKAGE_TYPE: conda
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu121
GPU_ARCH_VERSION: 12.1
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.8"
steps:
- name: Display EC2 information
shell: bash
run: |
set -euo pipefail
function get_ec2_metadata() {
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
echo "instance-type: $(get_ec2_metadata instance-type)"
echo "system info $(uname -a)"
- name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
uses: pytorch/test-infra/.github/actions/setup-ssh@main
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
run: |
Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
# Since it's just a defensive command, the workflow should continue even the command fails. This step can be
# removed once Windows Defender is removed from the AMI
- name: Disables Windows Defender scheduled and real-time scanning for files in directories used by PyTorch
continue-on-error: true
shell: powershell
run: |
Add-MpPreference -ExclusionPath $(Get-Location).tostring(),$Env:TEMP -ErrorAction Ignore
# Let's both exclude the path and disable Windows Defender completely just to be sure
# that it doesn't interfere
Set-MpPreference -DisableRealtimeMonitoring $True -ErrorAction Ignore
# NOTE: These environment variables are put here so that they can be applied on every job equally
# They are also here because setting them at a workflow level doesn't give us access to the
# runner.temp variable, which we need.
- name: Populate binary env
shell: bash
run: |
echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
- uses: actions/download-artifact@v3
name: Download Build Artifacts
with:
name: conda-py3_8-cuda12_1
path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
- name: Checkout PyTorch
uses: malfet/checkout@silent-checkout
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
quiet-checkout: true
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
- name: Checkout pytorch/builder
uses: malfet/checkout@silent-checkout
with:
ref: main
submodules: recursive
repository: pytorch/builder
path: builder
quiet-checkout: true
- name: Clean pytorch/builder checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: builder
- name: Populate binary env
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
- name: Test PyTorch binary
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_windows_test.sh"
- name: Wait until all sessions have drained
shell: powershell
working-directory: pytorch
if: always()
timeout-minutes: 120
run: |
.github\scripts\wait_for_ssh_to_drain.ps1
- name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
shell: powershell
working-directory: pytorch
if: always()
run: |
.github\scripts\kill_active_ssh_sessions.ps1
conda-py3_8-cuda12_1-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: conda-py3_8-cuda12_1-test
with:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
BUILDER_ROOT: ${{ github.workspace }}/builder
PACKAGE_TYPE: conda
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu121
GPU_ARCH_VERSION: 12.1
GPU_ARCH_TYPE: cuda
DESIRED_PYTHON: "3.8"
build_name: conda-py3_8-cuda12_1
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
conda-py3_8-cuda12_4-build:
if: ${{ github.repository_owner == 'pytorch' }}
runs-on: windows.4xlarge.nonephemeral
timeout-minutes: 240
env:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
BUILDER_ROOT: ${{ github.workspace }}/builder
PACKAGE_TYPE: conda
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.8"
steps:
- name: Display EC2 information
shell: bash
run: |
set -euo pipefail
function get_ec2_metadata() {
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
echo "instance-type: $(get_ec2_metadata instance-type)"
echo "system info $(uname -a)"
- name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
uses: pytorch/test-infra/.github/actions/setup-ssh@main
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
run: |
Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
# Since it's just a defensive command, the workflow should continue even the command fails. This step can be
# removed once Windows Defender is removed from the AMI
- name: Disables Windows Defender scheduled and real-time scanning for files in directories used by PyTorch
continue-on-error: true
shell: powershell
run: |
Add-MpPreference -ExclusionPath $(Get-Location).tostring(),$Env:TEMP -ErrorAction Ignore
# Let's both exclude the path and disable Windows Defender completely just to be sure
# that it doesn't interfere
Set-MpPreference -DisableRealtimeMonitoring $True -ErrorAction Ignore
# NOTE: These environment variables are put here so that they can be applied on every job equally
# They are also here because setting them at a workflow level doesn't give us access to the
# runner.temp variable, which we need.
- name: Populate binary env
shell: bash
run: |
echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
- name: Checkout PyTorch
uses: malfet/checkout@silent-checkout
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
quiet-checkout: true
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
- name: Checkout pytorch/builder
uses: malfet/checkout@silent-checkout
with:
ref: main
submodules: recursive
repository: pytorch/builder
path: builder
quiet-checkout: true
- name: Clean pytorch/builder checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: builder
- name: Populate binary env
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
- name: Build PyTorch binary
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
- uses: actions/upload-artifact@v3
if: always()
with:
name: conda-py3_8-cuda12_4
retention-days: 14
if-no-files-found: error
path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
- name: Wait until all sessions have drained
shell: powershell
working-directory: pytorch
if: always()
timeout-minutes: 120
run: |
.github\scripts\wait_for_ssh_to_drain.ps1
- name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
shell: powershell
working-directory: pytorch
if: always()
run: |
.github\scripts\kill_active_ssh_sessions.ps1
conda-py3_8-cuda12_4-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs: conda-py3_8-cuda12_4-build
runs-on: windows.8xlarge.nvidia.gpu
timeout-minutes: 240
env:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
BUILDER_ROOT: ${{ github.workspace }}/builder
PACKAGE_TYPE: conda
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.8"
steps:
- name: Display EC2 information
shell: bash
run: |
set -euo pipefail
function get_ec2_metadata() {
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
echo "instance-type: $(get_ec2_metadata instance-type)"
echo "system info $(uname -a)"
- name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
uses: pytorch/test-infra/.github/actions/setup-ssh@main
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
run: |
Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
# Since it's just a defensive command, the workflow should continue even the command fails. This step can be
# removed once Windows Defender is removed from the AMI
- name: Disables Windows Defender scheduled and real-time scanning for files in directories used by PyTorch
continue-on-error: true
shell: powershell
run: |
Add-MpPreference -ExclusionPath $(Get-Location).tostring(),$Env:TEMP -ErrorAction Ignore
# Let's both exclude the path and disable Windows Defender completely just to be sure
# that it doesn't interfere
Set-MpPreference -DisableRealtimeMonitoring $True -ErrorAction Ignore
# NOTE: These environment variables are put here so that they can be applied on every job equally
# They are also here because setting them at a workflow level doesn't give us access to the
# runner.temp variable, which we need.
- name: Populate binary env
shell: bash
run: |
echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
- uses: actions/download-artifact@v3
name: Download Build Artifacts
with:
name: conda-py3_8-cuda12_4
path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
- name: Checkout PyTorch
uses: malfet/checkout@silent-checkout
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
quiet-checkout: true
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
- name: Checkout pytorch/builder
uses: malfet/checkout@silent-checkout
with:
ref: main
submodules: recursive
repository: pytorch/builder
path: builder
quiet-checkout: true
- name: Clean pytorch/builder checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: builder
- name: Populate binary env
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
- name: Test PyTorch binary
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_windows_test.sh"
- name: Wait until all sessions have drained
shell: powershell
working-directory: pytorch
if: always()
timeout-minutes: 120
run: |
.github\scripts\wait_for_ssh_to_drain.ps1
- name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
shell: powershell
working-directory: pytorch
if: always()
run: |
.github\scripts\kill_active_ssh_sessions.ps1
conda-py3_8-cuda12_4-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: conda-py3_8-cuda12_4-test
with:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
BUILDER_ROOT: ${{ github.workspace }}/builder
PACKAGE_TYPE: conda
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DESIRED_PYTHON: "3.8"
build_name: conda-py3_8-cuda12_4
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
conda-py3_9-cpu-build:
if: ${{ github.repository_owner == 'pytorch' }}
runs-on: windows.4xlarge.nonephemeral

View File

@ -32,987 +32,6 @@ concurrency:
cancel-in-progress: true
jobs:
wheel-py3_8-cpu-build:
if: ${{ github.repository_owner == 'pytorch' }}
runs-on: windows.4xlarge.nonephemeral
timeout-minutes: 240
env:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
BUILDER_ROOT: ${{ github.workspace }}/builder
PACKAGE_TYPE: wheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.8"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
run: |
set -euo pipefail
function get_ec2_metadata() {
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
echo "instance-type: $(get_ec2_metadata instance-type)"
echo "system info $(uname -a)"
- name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
uses: pytorch/test-infra/.github/actions/setup-ssh@main
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
run: |
Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
# Since it's just a defensive command, the workflow should continue even the command fails. This step can be
# removed once Windows Defender is removed from the AMI
- name: Disables Windows Defender scheduled and real-time scanning for files in directories used by PyTorch
continue-on-error: true
shell: powershell
run: |
Add-MpPreference -ExclusionPath $(Get-Location).tostring(),$Env:TEMP -ErrorAction Ignore
# Let's both exclude the path and disable Windows Defender completely just to be sure
# that it doesn't interfere
Set-MpPreference -DisableRealtimeMonitoring $True -ErrorAction Ignore
# NOTE: These environment variables are put here so that they can be applied on every job equally
# They are also here because setting them at a workflow level doesn't give us access to the
# runner.temp variable, which we need.
- name: Populate binary env
shell: bash
run: |
echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
- name: Checkout PyTorch
uses: malfet/checkout@silent-checkout
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
quiet-checkout: true
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
- name: Checkout pytorch/builder
uses: malfet/checkout@silent-checkout
with:
ref: main
submodules: recursive
repository: pytorch/builder
path: builder
quiet-checkout: true
- name: Clean pytorch/builder checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: builder
- name: Populate binary env
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
- name: Build PyTorch binary
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
- uses: actions/upload-artifact@v3
if: always()
with:
name: wheel-py3_8-cpu
retention-days: 14
if-no-files-found: error
path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
- name: Wait until all sessions have drained
shell: powershell
working-directory: pytorch
if: always()
timeout-minutes: 120
run: |
.github\scripts\wait_for_ssh_to_drain.ps1
- name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
shell: powershell
working-directory: pytorch
if: always()
run: |
.github\scripts\kill_active_ssh_sessions.ps1
wheel-py3_8-cpu-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs: wheel-py3_8-cpu-build
runs-on: windows.4xlarge.nonephemeral
timeout-minutes: 240
env:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
BUILDER_ROOT: ${{ github.workspace }}/builder
PACKAGE_TYPE: wheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.8"
steps:
- name: Display EC2 information
shell: bash
run: |
set -euo pipefail
function get_ec2_metadata() {
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
echo "instance-type: $(get_ec2_metadata instance-type)"
echo "system info $(uname -a)"
- name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
uses: pytorch/test-infra/.github/actions/setup-ssh@main
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
run: |
Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
# Since it's just a defensive command, the workflow should continue even the command fails. This step can be
# removed once Windows Defender is removed from the AMI
- name: Disables Windows Defender scheduled and real-time scanning for files in directories used by PyTorch
continue-on-error: true
shell: powershell
run: |
Add-MpPreference -ExclusionPath $(Get-Location).tostring(),$Env:TEMP -ErrorAction Ignore
# Let's both exclude the path and disable Windows Defender completely just to be sure
# that it doesn't interfere
Set-MpPreference -DisableRealtimeMonitoring $True -ErrorAction Ignore
# NOTE: These environment variables are put here so that they can be applied on every job equally
# They are also here because setting them at a workflow level doesn't give us access to the
# runner.temp variable, which we need.
- name: Populate binary env
shell: bash
run: |
echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
- uses: actions/download-artifact@v3
name: Download Build Artifacts
with:
name: wheel-py3_8-cpu
path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
- name: Checkout PyTorch
uses: malfet/checkout@silent-checkout
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
quiet-checkout: true
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
- name: Checkout pytorch/builder
uses: malfet/checkout@silent-checkout
with:
ref: main
submodules: recursive
repository: pytorch/builder
path: builder
quiet-checkout: true
- name: Clean pytorch/builder checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: builder
- name: Populate binary env
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
- name: Test PyTorch binary
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_windows_test.sh"
- name: Wait until all sessions have drained
shell: powershell
working-directory: pytorch
if: always()
timeout-minutes: 120
run: |
.github\scripts\wait_for_ssh_to_drain.ps1
- name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
shell: powershell
working-directory: pytorch
if: always()
run: |
.github\scripts\kill_active_ssh_sessions.ps1
wheel-py3_8-cpu-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: wheel-py3_8-cpu-test
with:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
BUILDER_ROOT: ${{ github.workspace }}/builder
PACKAGE_TYPE: wheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu
DESIRED_PYTHON: "3.8"
build_name: wheel-py3_8-cpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
wheel-py3_8-cuda11_8-build:
if: ${{ github.repository_owner == 'pytorch' }}
runs-on: windows.4xlarge.nonephemeral
timeout-minutes: 240
env:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
BUILDER_ROOT: ${{ github.workspace }}/builder
PACKAGE_TYPE: wheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu118
GPU_ARCH_VERSION: 11.8
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.8"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
run: |
set -euo pipefail
function get_ec2_metadata() {
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
echo "instance-type: $(get_ec2_metadata instance-type)"
echo "system info $(uname -a)"
- name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
uses: pytorch/test-infra/.github/actions/setup-ssh@main
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
run: |
Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
# Since it's just a defensive command, the workflow should continue even the command fails. This step can be
# removed once Windows Defender is removed from the AMI
- name: Disables Windows Defender scheduled and real-time scanning for files in directories used by PyTorch
continue-on-error: true
shell: powershell
run: |
Add-MpPreference -ExclusionPath $(Get-Location).tostring(),$Env:TEMP -ErrorAction Ignore
# Let's both exclude the path and disable Windows Defender completely just to be sure
# that it doesn't interfere
Set-MpPreference -DisableRealtimeMonitoring $True -ErrorAction Ignore
# NOTE: These environment variables are put here so that they can be applied on every job equally
# They are also here because setting them at a workflow level doesn't give us access to the
# runner.temp variable, which we need.
- name: Populate binary env
shell: bash
run: |
echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
- name: Checkout PyTorch
uses: malfet/checkout@silent-checkout
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
quiet-checkout: true
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
- name: Checkout pytorch/builder
uses: malfet/checkout@silent-checkout
with:
ref: main
submodules: recursive
repository: pytorch/builder
path: builder
quiet-checkout: true
- name: Clean pytorch/builder checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: builder
- name: Populate binary env
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
- name: Build PyTorch binary
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
- uses: actions/upload-artifact@v3
if: always()
with:
name: wheel-py3_8-cuda11_8
retention-days: 14
if-no-files-found: error
path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
- name: Wait until all sessions have drained
shell: powershell
working-directory: pytorch
if: always()
timeout-minutes: 120
run: |
.github\scripts\wait_for_ssh_to_drain.ps1
- name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
shell: powershell
working-directory: pytorch
if: always()
run: |
.github\scripts\kill_active_ssh_sessions.ps1
wheel-py3_8-cuda11_8-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs: wheel-py3_8-cuda11_8-build
runs-on: windows.8xlarge.nvidia.gpu
timeout-minutes: 240
env:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
BUILDER_ROOT: ${{ github.workspace }}/builder
PACKAGE_TYPE: wheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu118
GPU_ARCH_VERSION: 11.8
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.8"
steps:
- name: Display EC2 information
shell: bash
run: |
set -euo pipefail
function get_ec2_metadata() {
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
echo "instance-type: $(get_ec2_metadata instance-type)"
echo "system info $(uname -a)"
- name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
uses: pytorch/test-infra/.github/actions/setup-ssh@main
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
run: |
Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
# Since it's just a defensive command, the workflow should continue even the command fails. This step can be
# removed once Windows Defender is removed from the AMI
- name: Disables Windows Defender scheduled and real-time scanning for files in directories used by PyTorch
continue-on-error: true
shell: powershell
run: |
Add-MpPreference -ExclusionPath $(Get-Location).tostring(),$Env:TEMP -ErrorAction Ignore
# Let's both exclude the path and disable Windows Defender completely just to be sure
# that it doesn't interfere
Set-MpPreference -DisableRealtimeMonitoring $True -ErrorAction Ignore
# NOTE: These environment variables are put here so that they can be applied on every job equally
# They are also here because setting them at a workflow level doesn't give us access to the
# runner.temp variable, which we need.
- name: Populate binary env
shell: bash
run: |
echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
- uses: actions/download-artifact@v3
name: Download Build Artifacts
with:
name: wheel-py3_8-cuda11_8
path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
- name: Checkout PyTorch
uses: malfet/checkout@silent-checkout
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
quiet-checkout: true
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
- name: Checkout pytorch/builder
uses: malfet/checkout@silent-checkout
with:
ref: main
submodules: recursive
repository: pytorch/builder
path: builder
quiet-checkout: true
- name: Clean pytorch/builder checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: builder
- name: Populate binary env
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
- name: Test PyTorch binary
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_windows_test.sh"
- name: Wait until all sessions have drained
shell: powershell
working-directory: pytorch
if: always()
timeout-minutes: 120
run: |
.github\scripts\wait_for_ssh_to_drain.ps1
- name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
shell: powershell
working-directory: pytorch
if: always()
run: |
.github\scripts\kill_active_ssh_sessions.ps1
wheel-py3_8-cuda11_8-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: wheel-py3_8-cuda11_8-test
with:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
BUILDER_ROOT: ${{ github.workspace }}/builder
PACKAGE_TYPE: wheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu118
GPU_ARCH_VERSION: 11.8
GPU_ARCH_TYPE: cuda
DESIRED_PYTHON: "3.8"
build_name: wheel-py3_8-cuda11_8
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
wheel-py3_8-cuda12_1-build:
if: ${{ github.repository_owner == 'pytorch' }}
runs-on: windows.4xlarge.nonephemeral
timeout-minutes: 240
env:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
BUILDER_ROOT: ${{ github.workspace }}/builder
PACKAGE_TYPE: wheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu121
GPU_ARCH_VERSION: 12.1
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.8"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
run: |
set -euo pipefail
function get_ec2_metadata() {
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
echo "instance-type: $(get_ec2_metadata instance-type)"
echo "system info $(uname -a)"
- name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
uses: pytorch/test-infra/.github/actions/setup-ssh@main
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
run: |
Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
# Since it's just a defensive command, the workflow should continue even the command fails. This step can be
# removed once Windows Defender is removed from the AMI
- name: Disables Windows Defender scheduled and real-time scanning for files in directories used by PyTorch
continue-on-error: true
shell: powershell
run: |
Add-MpPreference -ExclusionPath $(Get-Location).tostring(),$Env:TEMP -ErrorAction Ignore
# Let's both exclude the path and disable Windows Defender completely just to be sure
# that it doesn't interfere
Set-MpPreference -DisableRealtimeMonitoring $True -ErrorAction Ignore
# NOTE: These environment variables are put here so that they can be applied on every job equally
# They are also here because setting them at a workflow level doesn't give us access to the
# runner.temp variable, which we need.
- name: Populate binary env
shell: bash
run: |
echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
- name: Checkout PyTorch
uses: malfet/checkout@silent-checkout
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
quiet-checkout: true
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
- name: Checkout pytorch/builder
uses: malfet/checkout@silent-checkout
with:
ref: main
submodules: recursive
repository: pytorch/builder
path: builder
quiet-checkout: true
- name: Clean pytorch/builder checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: builder
- name: Populate binary env
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
- name: Build PyTorch binary
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
- uses: actions/upload-artifact@v3
if: always()
with:
name: wheel-py3_8-cuda12_1
retention-days: 14
if-no-files-found: error
path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
- name: Wait until all sessions have drained
shell: powershell
working-directory: pytorch
if: always()
timeout-minutes: 120
run: |
.github\scripts\wait_for_ssh_to_drain.ps1
- name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
shell: powershell
working-directory: pytorch
if: always()
run: |
.github\scripts\kill_active_ssh_sessions.ps1
wheel-py3_8-cuda12_1-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs: wheel-py3_8-cuda12_1-build
runs-on: windows.8xlarge.nvidia.gpu
timeout-minutes: 240
env:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
BUILDER_ROOT: ${{ github.workspace }}/builder
PACKAGE_TYPE: wheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu121
GPU_ARCH_VERSION: 12.1
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.8"
steps:
- name: Display EC2 information
shell: bash
run: |
set -euo pipefail
function get_ec2_metadata() {
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
echo "instance-type: $(get_ec2_metadata instance-type)"
echo "system info $(uname -a)"
- name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
uses: pytorch/test-infra/.github/actions/setup-ssh@main
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
run: |
Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
# Since it's just a defensive command, the workflow should continue even the command fails. This step can be
# removed once Windows Defender is removed from the AMI
- name: Disables Windows Defender scheduled and real-time scanning for files in directories used by PyTorch
continue-on-error: true
shell: powershell
run: |
Add-MpPreference -ExclusionPath $(Get-Location).tostring(),$Env:TEMP -ErrorAction Ignore
# Let's both exclude the path and disable Windows Defender completely just to be sure
# that it doesn't interfere
Set-MpPreference -DisableRealtimeMonitoring $True -ErrorAction Ignore
# NOTE: These environment variables are put here so that they can be applied on every job equally
# They are also here because setting them at a workflow level doesn't give us access to the
# runner.temp variable, which we need.
- name: Populate binary env
shell: bash
run: |
echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
- uses: actions/download-artifact@v3
name: Download Build Artifacts
with:
name: wheel-py3_8-cuda12_1
path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
- name: Checkout PyTorch
uses: malfet/checkout@silent-checkout
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
quiet-checkout: true
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
- name: Checkout pytorch/builder
uses: malfet/checkout@silent-checkout
with:
ref: main
submodules: recursive
repository: pytorch/builder
path: builder
quiet-checkout: true
- name: Clean pytorch/builder checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: builder
- name: Populate binary env
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
- name: Test PyTorch binary
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_windows_test.sh"
- name: Wait until all sessions have drained
shell: powershell
working-directory: pytorch
if: always()
timeout-minutes: 120
run: |
.github\scripts\wait_for_ssh_to_drain.ps1
- name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
shell: powershell
working-directory: pytorch
if: always()
run: |
.github\scripts\kill_active_ssh_sessions.ps1
wheel-py3_8-cuda12_1-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: wheel-py3_8-cuda12_1-test
with:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
BUILDER_ROOT: ${{ github.workspace }}/builder
PACKAGE_TYPE: wheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu121
GPU_ARCH_VERSION: 12.1
GPU_ARCH_TYPE: cuda
DESIRED_PYTHON: "3.8"
build_name: wheel-py3_8-cuda12_1
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
wheel-py3_8-cuda12_4-build:
if: ${{ github.repository_owner == 'pytorch' }}
runs-on: windows.4xlarge.nonephemeral
timeout-minutes: 240
env:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
BUILDER_ROOT: ${{ github.workspace }}/builder
PACKAGE_TYPE: wheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.8"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
steps:
- name: Display EC2 information
shell: bash
run: |
set -euo pipefail
function get_ec2_metadata() {
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
echo "instance-type: $(get_ec2_metadata instance-type)"
echo "system info $(uname -a)"
- name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
uses: pytorch/test-infra/.github/actions/setup-ssh@main
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
run: |
Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
# Since it's just a defensive command, the workflow should continue even the command fails. This step can be
# removed once Windows Defender is removed from the AMI
- name: Disables Windows Defender scheduled and real-time scanning for files in directories used by PyTorch
continue-on-error: true
shell: powershell
run: |
Add-MpPreference -ExclusionPath $(Get-Location).tostring(),$Env:TEMP -ErrorAction Ignore
# Let's both exclude the path and disable Windows Defender completely just to be sure
# that it doesn't interfere
Set-MpPreference -DisableRealtimeMonitoring $True -ErrorAction Ignore
# NOTE: These environment variables are put here so that they can be applied on every job equally
# They are also here because setting them at a workflow level doesn't give us access to the
# runner.temp variable, which we need.
- name: Populate binary env
shell: bash
run: |
echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
- name: Checkout PyTorch
uses: malfet/checkout@silent-checkout
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
quiet-checkout: true
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
- name: Checkout pytorch/builder
uses: malfet/checkout@silent-checkout
with:
ref: main
submodules: recursive
repository: pytorch/builder
path: builder
quiet-checkout: true
- name: Clean pytorch/builder checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: builder
- name: Populate binary env
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
- name: Build PyTorch binary
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
- uses: actions/upload-artifact@v3
if: always()
with:
name: wheel-py3_8-cuda12_4
retention-days: 14
if-no-files-found: error
path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
- name: Wait until all sessions have drained
shell: powershell
working-directory: pytorch
if: always()
timeout-minutes: 120
run: |
.github\scripts\wait_for_ssh_to_drain.ps1
- name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
shell: powershell
working-directory: pytorch
if: always()
run: |
.github\scripts\kill_active_ssh_sessions.ps1
wheel-py3_8-cuda12_4-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs: wheel-py3_8-cuda12_4-build
runs-on: windows.8xlarge.nvidia.gpu
timeout-minutes: 240
env:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
BUILDER_ROOT: ${{ github.workspace }}/builder
PACKAGE_TYPE: wheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.8"
steps:
- name: Display EC2 information
shell: bash
run: |
set -euo pipefail
function get_ec2_metadata() {
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
echo "instance-type: $(get_ec2_metadata instance-type)"
echo "system info $(uname -a)"
- name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
uses: pytorch/test-infra/.github/actions/setup-ssh@main
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
run: |
Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
# Since it's just a defensive command, the workflow should continue even the command fails. This step can be
# removed once Windows Defender is removed from the AMI
- name: Disables Windows Defender scheduled and real-time scanning for files in directories used by PyTorch
continue-on-error: true
shell: powershell
run: |
Add-MpPreference -ExclusionPath $(Get-Location).tostring(),$Env:TEMP -ErrorAction Ignore
# Let's both exclude the path and disable Windows Defender completely just to be sure
# that it doesn't interfere
Set-MpPreference -DisableRealtimeMonitoring $True -ErrorAction Ignore
# NOTE: These environment variables are put here so that they can be applied on every job equally
# They are also here because setting them at a workflow level doesn't give us access to the
# runner.temp variable, which we need.
- name: Populate binary env
shell: bash
run: |
echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
- uses: actions/download-artifact@v3
name: Download Build Artifacts
with:
name: wheel-py3_8-cuda12_4
path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
- name: Checkout PyTorch
uses: malfet/checkout@silent-checkout
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
quiet-checkout: true
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
- name: Checkout pytorch/builder
uses: malfet/checkout@silent-checkout
with:
ref: main
submodules: recursive
repository: pytorch/builder
path: builder
quiet-checkout: true
- name: Clean pytorch/builder checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: builder
- name: Populate binary env
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
- name: Test PyTorch binary
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_windows_test.sh"
- name: Wait until all sessions have drained
shell: powershell
working-directory: pytorch
if: always()
timeout-minutes: 120
run: |
.github\scripts\wait_for_ssh_to_drain.ps1
- name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
shell: powershell
working-directory: pytorch
if: always()
run: |
.github\scripts\kill_active_ssh_sessions.ps1
wheel-py3_8-cuda12_4-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: wheel-py3_8-cuda12_4-test
with:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
BUILDER_ROOT: ${{ github.workspace }}/builder
PACKAGE_TYPE: wheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DESIRED_PYTHON: "3.8"
build_name: wheel-py3_8-cuda12_4
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
wheel-py3_9-cpu-build:
if: ${{ github.repository_owner == 'pytorch' }}
runs-on: windows.4xlarge.nonephemeral

View File

@ -0,0 +1,125 @@
name: inductor-perf-nightly-A10g
on:
schedule:
# - cron: 0 7 * * 1-6
# - cron: 0 7 * * 0
# Do not perform weekly max-autotune run for now.
- cron: 0 7 * * *
# NB: GitHub has an upper limit of 10 inputs here, so before we can sort it
# out, let try to run torchao cudagraphs_low_precision as part of cudagraphs
workflow_dispatch:
inputs:
training:
description: Run training (on by default)?
required: false
type: boolean
default: true
inference:
description: Run inference (off by default)?
required: false
type: boolean
default: false
default:
description: Run inductor_default?
required: false
type: boolean
default: false
dynamic:
description: Run inductor_dynamic_shapes?
required: false
type: boolean
default: false
cudagraphs:
description: Run inductor_cudagraphs?
required: false
type: boolean
default: true
freezing_cudagraphs:
description: Run inductor_cudagraphs with freezing for inference?
required: false
type: boolean
default: false
freeze_autotune_cudagraphs:
description: Run inductor_cudagraphs with freezing and max autotune for inference?
required: false
type: boolean
default: false
aotinductor:
description: Run aot_inductor for inference?
required: false
type: boolean
default: false
maxautotune:
description: Run inductor_max_autotune?
required: false
type: boolean
default: false
benchmark_configs:
description: The list of configs used the benchmark
required: false
type: string
default: inductor_huggingface_perf_cuda_a10g,inductor_timm_perf_cuda_a10g,inductor_torchbench_perf_cuda_a10g
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}-${{ github.event_name == 'schedule' }}
cancel-in-progress: true
permissions: read-all
jobs:
linux-focal-cuda12_1-py3_10-gcc9-inductor-build:
name: cuda12.1-py3.10-gcc9-sm80
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: linux-focal-cuda12.1-py3.10-gcc9-sm80
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9-inductor-benchmarks
cuda-arch-list: '8.0'
test-matrix: |
{ include: [
{ config: "inductor_huggingface_perf_cuda_a10g", shard: 1, num_shards: 3, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "inductor_huggingface_perf_cuda_a10g", shard: 2, num_shards: 3, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "inductor_huggingface_perf_cuda_a10g", shard: 3, num_shards: 3, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "inductor_timm_perf_cuda_a10g", shard: 1, num_shards: 5, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "inductor_timm_perf_cuda_a10g", shard: 2, num_shards: 5, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "inductor_timm_perf_cuda_a10g", shard: 3, num_shards: 5, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "inductor_timm_perf_cuda_a10g", shard: 4, num_shards: 5, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "inductor_timm_perf_cuda_a10g", shard: 5, num_shards: 5, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "inductor_torchbench_perf_cuda_a10g", shard: 1, num_shards: 4, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "inductor_torchbench_perf_cuda_a10g", shard: 2, num_shards: 4, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "inductor_torchbench_perf_cuda_a10g", shard: 3, num_shards: 4, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "inductor_torchbench_perf_cuda_a10g", shard: 4, num_shards: 4, runner: "linux.g5.4xlarge.nvidia.gpu" },
]}
selected-test-configs: ${{ inputs.benchmark_configs }}
secrets:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
linux-focal-cuda12_1-py3_10-gcc9-inductor-test-nightly:
name: cuda12.1-py3.10-gcc9-sm80
uses: ./.github/workflows/_linux-test.yml
needs: linux-focal-cuda12_1-py3_10-gcc9-inductor-build
if: github.event.schedule == '0 7 * * *'
with:
build-environment: linux-focal-cuda12.1-py3.10-gcc9-sm80
dashboard-tag: training-true-inference-true-default-true-dynamic-true-cudagraphs-true-aotinductor-true-freezing_cudagraphs-true-cudagraphs_low_precision-true
docker-image: ${{ needs.linux-focal-cuda12_1-py3_10-gcc9-inductor-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-cuda12_1-py3_10-gcc9-inductor-build.outputs.test-matrix }}
use-gha: anything-non-empty-to-use-gha
timeout-minutes: 720
secrets:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
linux-focal-cuda12_1-py3_10-gcc9-inductor-test:
name: cuda12.1-py3.10-gcc9-sm80
uses: ./.github/workflows/_linux-test.yml
needs: linux-focal-cuda12_1-py3_10-gcc9-inductor-build
if: github.event_name == 'workflow_dispatch'
with:
build-environment: linux-focal-cuda12.1-py3.10-gcc9-sm80
dashboard-tag: training-${{ inputs.training }}-inference-${{ inputs.inference }}-default-${{ inputs.default }}-dynamic-${{ inputs.dynamic }}-cudagraphs-${{ inputs.cudagraphs }}-aotinductor-${{ inputs.aotinductor }}-maxautotune-${{ inputs.maxautotune }}-freezing_cudagraphs-${{ inputs.freezing_cudagraphs }}-cudagraphs_low_precision-${{ inputs.cudagraphs }}
docker-image: ${{ needs.linux-focal-cuda12_1-py3_10-gcc9-inductor-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-cuda12_1-py3_10-gcc9-inductor-build.outputs.test-matrix }}
use-gha: anything-non-empty-to-use-gha
timeout-minutes: 720
secrets:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}

View File

@ -52,6 +52,7 @@ jobs:
name: linux-jammy-aarch64-py3.10-inductor
uses: ./.github/workflows/_linux-build.yml
with:
runner: linux.arm64.m7g.4xlarge
build-environment: linux-jammy-aarch64-py3.10
docker-image-name: pytorch-linux-jammy-aarch64-py3.10-gcc11-inductor-benchmarks
test-matrix: |

View File

@ -7,6 +7,8 @@ on:
- release/*
tags:
- ciflow/inductor/*
schedule:
- cron: 29 8 * * * # about 1:29am PDT, for mem leak check and rerun disabled tests
workflow_dispatch:
concurrency:
@ -164,6 +166,13 @@ jobs:
{ config: "cpu_aot_inductor_torchbench_freezing", shard: 1, num_shards: 2, runner: "linux.12xlarge" },
{ config: "cpu_aot_inductor_torchbench_freezing", shard: 2, num_shards: 2, runner: "linux.12xlarge" },
{ config: "inductor_torchbench_cpu_smoketest_perf", shard: 1, num_shards: 1, runner: "linux.24xl.spr-metal" },
{ config: "inductor_avx2", shard: 1, num_shards: 2, runner: "linux.10xlarge.avx2" },
{ config: "inductor_avx2", shard: 2, num_shards: 2, runner: "linux.10xlarge.avx2" },
{ config: "cpu_inductor_huggingface_freezing_avx2", shard: 1, num_shards: 1, runner: "linux.10xlarge.avx2" },
{ config: "cpu_inductor_torchbench_freezing_avx2", shard: 1, num_shards: 2, runner: "linux.10xlarge.avx2" },
{ config: "cpu_inductor_torchbench_freezing_avx2", shard: 2, num_shards: 2, runner: "linux.10xlarge.avx2" },
{ config: "cpu_inductor_timm_freezing_avx2", shard: 1, num_shards: 2, runner: "linux.10xlarge.avx2" },
{ config: "cpu_inductor_timm_freezing_avx2", shard: 2, num_shards: 2, runner: "linux.10xlarge.avx2" },
]}
secrets:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}

View File

@ -29,6 +29,7 @@ jobs:
{ include: [
{ config: "mps", shard: 1, num_shards: 1, runner: "macos-m1-13" },
{ config: "mps", shard: 1, num_shards: 1, runner: "macos-m1-14" },
{ config: "mps", shard: 1, num_shards: 1, runner: "macos-m2-15" },
]}
macos-py3-arm64-mps-test:

View File

@ -21,6 +21,7 @@ jobs:
name: docs build
uses: ./.github/workflows/_linux-build.yml
with:
runner: "amz2023.linux.2xlarge"
build-environment: linux-jammy-py3.8-gcc11
docker-image-name: pytorch-linux-jammy-py3.8-gcc11
@ -29,6 +30,7 @@ jobs:
uses: ./.github/workflows/_docs.yml
needs: docs-build
with:
runner_prefix: "amz2023."
build-environment: linux-jammy-py3.8-gcc11
docker-image: ${{ needs.docs-build.outputs.docker-image }}
push: ${{ github.event_name == 'schedule' || github.event_name == 'workflow_dispatch' || startsWith(github.event.ref, 'refs/tags/v') }}

View File

@ -80,6 +80,7 @@ jobs:
uses: ./.github/workflows/_docs.yml
needs: linux-jammy-py3_8-gcc11-build
with:
runner_prefix: amz2023.
build-environment: linux-jammy-py3.8-gcc11
docker-image: ${{ needs.linux-jammy-py3_8-gcc11-build.outputs.docker-image }}
@ -543,36 +544,6 @@ jobs:
docker-image: ${{ needs.linux-jammy-py3-clang12-executorch-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-py3-clang12-executorch-build.outputs.test-matrix }}
linux-focal-cuda12_1-py3_10-gcc9-experimental-split-build:
name: linux-focal-cuda12.1-py3.10-gcc9-experimental-split-build
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.2xlarge"
use_split_build: true
build-environment: linux-focal-cuda12.1-py3.10-gcc9
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 2, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 3, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 4, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 5, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.4xlarge.nvidia.gpu" },
]}
linux-focal-cuda12_1-py3_10-gcc9-experimental-split-build-test:
name: linux-focal-cuda12.1-py3.10-gcc9-experimental-split-build
uses: ./.github/workflows/_linux-test.yml
needs:
- linux-focal-cuda12_1-py3_10-gcc9-experimental-split-build
- target-determination
with:
timeout-minutes: 360
build-environment: linux-focal-cuda12.1-py3.10-gcc9-experimental-split-build
docker-image: ${{ needs.linux-focal-cuda12_1-py3_10-gcc9-experimental-split-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-cuda12_1-py3_10-gcc9-experimental-split-build.outputs.test-matrix }}
linux-focal-py3_12-clang10-experimental-split-build:
name: linux-focal-py3.12-clang10-experimental-split-build
uses: ./.github/workflows/_linux-build.yml

View File

@ -294,9 +294,9 @@ jobs:
docker-image-name: pytorch-linux-focal-cuda11.8-cudnn9-py3-gcc9
test-matrix: |
{ include: [
{ config: "distributed", shard: 1, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.8xlarge.nvidia.gpu" },
{ config: "distributed", shard: 2, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.8xlarge.nvidia.gpu" },
{ config: "distributed", shard: 3, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}amz2023.linux.8xlarge.nvidia.gpu" },
{ config: "distributed", shard: 1, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.8xlarge.nvidia.gpu" },
{ config: "distributed", shard: 2, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.8xlarge.nvidia.gpu" },
{ config: "distributed", shard: 3, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.8xlarge.nvidia.gpu" },
]}
linux-focal-cuda11_8-py3_10-gcc9-experimental-split-build-test:

View File

@ -2,7 +2,7 @@ name: Upload torch dynamo performance stats
on:
workflow_run:
workflows: [inductor-A100-perf-nightly, inductor-perf-nightly-aarch64, inductor-perf-nightly-x86]
workflows: [inductor-A100-perf-nightly, inductor-perf-nightly-A10g, inductor-perf-nightly-aarch64, inductor-perf-nightly-x86]
types:
- completed

View File

@ -15,6 +15,7 @@ exclude_patterns = [
'functorch/examples/**',
'functorch/notebooks/**',
'torch/_inductor/fx_passes/serialized_patterns/**',
'torch/_inductor/autoheuristic/artifacts/**',
'scripts/**',
'test/generated_type_hints_smoketest.py',
# Tests from the NumPy test suite
@ -196,6 +197,8 @@ include_patterns = [
'aten/src/ATen/*.cpp',
'aten/src/ATen/core/*.h',
'aten/src/ATen/core/*.cpp',
'aten/src/ATen/cudnn/*.h',
'aten/src/ATen/cudnn/*.cpp',
'aten/src/ATen/detail/*',
'aten/src/ATen/functorch/*.h',
'aten/src/ATen/functorch/*.cpp',
@ -233,7 +236,7 @@ exclude_patterns = [
'torch/csrc/autograd/generated/**',
'torch/csrc/distributed/**/*',
'torch/csrc/dynamo/eval_frame.h',
'torch/csrc/inductor/**/*',
'torch/csrc/inductor/aoti_torch/c/shim.h',
'torch/csrc/jit/**/*',
'torch/csrc/jit/serialization/mobile_bytecode_generated.h',
'torch/csrc/lazy/**/*',
@ -940,6 +943,25 @@ command = [
'@{{PATHSFILE}}'
]
[[linter]]
code = 'CONTEXT_DECORATOR'
include_patterns = [
'torch/**',
]
command = [
'python3',
'tools/linter/adapters/grep_linter.py',
'--pattern=@.*(dynamo_timed)',
'--linter-name=CONTEXT_DECORATOR',
'--error-name=avoid context decorator',
"""--error-description=\
Do not use context manager as decorator as it breaks cProfile traces. Use it as \
a context manager instead\
""",
'--',
'@{{PATHSFILE}}'
]
[[linter]]
code = 'ONCE_FLAG'
include_patterns = [
@ -986,16 +1008,16 @@ init_command = [
'PyYAML==6.0.1',
]
# Black + usort
# usort + ruff-format
[[linter]]
code = 'UFMT'
code = 'PYFMT'
include_patterns = [
'**/*.py',
'**/*.pyi',
]
command = [
'python3',
'tools/linter/adapters/ufmt_linter.py',
'tools/linter/adapters/pyfmt_linter.py',
'--',
'@{{PATHSFILE}}'
]
@ -1010,6 +1032,7 @@ exclude_patterns = [
'third_party/**/*.py',
'third_party/**/*.pyi',
'torch/_inductor/fx_passes/serialized_patterns/**',
'torch/_inductor/autoheuristic/artifacts/**',
# These files are all grandfathered in, feel free to remove from this list
# as necessary
'test/_nvfuser/__init__.py',
@ -1452,9 +1475,9 @@ init_command = [
'--dry-run={{DRYRUN}}',
'--no-black-binary',
'black==23.12.1',
'ufmt==2.7.0',
'usort==1.0.8.post1',
'isort==5.13.2',
'ruff==0.5.5', # sync with RUFF
]
is_formatter = true
@ -1521,6 +1544,7 @@ exclude_patterns = [
'functorch/docs/**',
'functorch/notebooks/**',
'torch/_inductor/fx_passes/serialized_patterns/**',
'torch/_inductor/autoheuristic/artifacts/**',
'scripts/**',
'third_party/**',
'fb/**',
@ -1538,7 +1562,7 @@ init_command = [
'python3',
'tools/linter/adapters/pip_init.py',
'--dry-run={{DRYRUN}}',
'ruff==0.5.2',
'ruff==0.5.5', # sync with PYFMT
]
is_formatter = true

View File

@ -1,6 +1,5 @@
{
"recommendations": [
"ms-python.python",
"omnilib.ufmt"
]
}

View File

@ -4,14 +4,12 @@
},
"files.associations": {
"*.py.in": "python",
"*.pyi.in": "python",
"editor.defaultFormatter": "omnilib.ufmt"
"*.pyi.in": "python"
},
"files.eol": "\n",
"files.insertFinalNewline": true,
"files.trimFinalNewlines": true,
"files.trimTrailingWhitespace": true,
"python.formatting.provider": "none",
"python.linting.enabled": true,
"python.linting.flake8Enabled": true
}

View File

@ -413,7 +413,6 @@ cc_library(
"@cuda//:nvrtc",
"@cudnn",
"@cudnn_frontend",
"@cuda//:cufile",
],
alwayslink = True,
)

View File

@ -251,15 +251,6 @@ cmake_dependent_option(USE_CUDNN "Use cuDNN" ON "USE_CUDA" OFF)
cmake_dependent_option(USE_STATIC_CUDNN "Use cuDNN static libraries" OFF
"USE_CUDNN" OFF)
cmake_dependent_option(USE_CUSPARSELT "Use cuSPARSELt" ON "USE_CUDA" OFF)
# Binary builds will fail for cufile due to https://github.com/pytorch/builder/issues/1924
# Using TH_BINARY_BUILD to check whether is binary build.
# USE_ROCM is guarded against in Dependencies.cmake because USE_ROCM is not properly defined here
if(DEFINED ENV{TH_BINARY_BUILD})
cmake_dependent_option(USE_CUFILE "Use cuFile" ON
"USE_CUDA AND NOT $ENV{TH_BINARY_BUILD} AND NOT WIN32" OFF)
else()
cmake_dependent_option(USE_CUFILE "Use cuFile" ON "USE_CUDA AND NOT WIN32" OFF)
endif()
option(USE_FBGEMM "Use FBGEMM (quantized 8-bit server operators)" ON)
option(USE_KINETO "Use Kineto profiling library" ON)
option(USE_CUPTI_SO "Use CUPTI as a shared library" ON)
@ -547,8 +538,14 @@ option(BUILD_EXECUTORCH "Master flag to build Executorch" ON)
if(LINUX)
set(CMAKE_SHARED_LINKER_FLAGS
"${CMAKE_SHARED_LINKER_FLAGS} -Wl,--no-as-needed")
set(CMAKE_SHARED_LINKER_FLAGS
"${CMAKE_SHARED_LINKER_FLAGS} $ENV{LDFLAGS}")
set(ENV_LDFLAGS "$ENV{LDFLAGS}")
string(STRIP "${ENV_LDFLAGS}" ENV_LDFLAGS)
# Do not append linker flags passed via env var if they already there
if(NOT ${CMAKE_SHARED_LINKER_FLAGS} MATCHES "${ENV_LDFLAGS}")
set(CMAKE_SHARED_LINKER_FLAGS
"${CMAKE_SHARED_LINKER_FLAGS} ${ENV_LDFLAGS}")
endif()
endif()
if(MSVC)
@ -990,8 +987,6 @@ if(NOT MSVC)
append_cxx_flag_if_supported("-Wno-array-bounds" CMAKE_CXX_FLAGS)
append_cxx_flag_if_supported("-Wno-unknown-pragmas" CMAKE_CXX_FLAGS)
append_cxx_flag_if_supported("-Wno-unused-parameter" CMAKE_CXX_FLAGS)
append_cxx_flag_if_supported("-Wno-unused-function" CMAKE_CXX_FLAGS)
append_cxx_flag_if_supported("-Wno-unused-result" CMAKE_CXX_FLAGS)
append_cxx_flag_if_supported("-Wno-strict-overflow" CMAKE_CXX_FLAGS)
append_cxx_flag_if_supported("-Wno-strict-aliasing" CMAKE_CXX_FLAGS)
append_cxx_flag_if_supported("-Wno-stringop-overflow" CMAKE_CXX_FLAGS)
@ -1176,6 +1171,10 @@ if(APPLE)
append_cxx_flag_if_supported("-Wno-missing-braces" CMAKE_CXX_FLAGS)
endif()
if(USE_XPU)
string(APPEND CMAKE_CXX_FLAGS " -DUSE_XPU")
endif()
if(EMSCRIPTEN)
string(
APPEND

View File

@ -207,7 +207,7 @@ pip install -r requirements.txt
**On Linux**
```bash
conda install intel::mkl-static intel::mkl-include
pip install mkl-static mkl-include
# CUDA only: Add LAPACK support for the GPU if needed
conda install -c pytorch magma-cuda121 # or the magma-cuda* that matches your CUDA version from https://anaconda.org/pytorch/repo
@ -221,7 +221,7 @@ make triton
```bash
# Add this package on intel x86 processor machines only
conda install intel::mkl-static intel::mkl-include
pip install mkl-static mkl-include
# Add these packages if torch.distributed is needed
conda install pkg-config libuv
```
@ -229,7 +229,7 @@ conda install pkg-config libuv
**On Windows**
```bash
conda install intel::mkl-static intel::mkl-include
pip install mkl-static mkl-include
# Add these packages if torch.distributed is needed.
# Distributed package support on Windows is a prototype feature and is subject to changes.
conda install -c conda-forge libuv=1.39
@ -252,6 +252,8 @@ If you would like to compile PyTorch with [new C++ ABI](https://gcc.gnu.org/onli
export _GLIBCXX_USE_CXX11_ABI=1
```
Please **note** that starting from PyTorch 2.5, the PyTorch build with XPU supports both new and old C++ ABIs. Previously, XPU only supported the new C++ ABI. If you want to compile with Intel GPU support, please follow [Intel GPU Support](#intel-gpu-support).
If you're compiling for AMD ROCm then first run this command:
```bash
# Only run this if you're compiling for ROCm

View File

@ -471,7 +471,7 @@ Allocator* getCPUAllocator() {
}
// override_allow_tf32_flag = true
// means the allow_tf32 flags are overrided and tf32 is force disabled
// means the allow_tf32 flags are overridden and tf32 is force disabled
// override_allow_tf32_flag = false
// means the original allow_tf32 flags are followed
thread_local bool override_allow_tf32_flag = false;

View File

@ -100,7 +100,7 @@ class TORCH_API Context {
const void* data,
std::optional<c10::DeviceType> device_type = std::nullopt) {
auto opt_device_type =
device_type.has_value() ? device_type.value() : at::getAccelerator();
device_type.has_value() ? device_type : at::getAccelerator();
if (!opt_device_type.has_value() || // there is no accelerator
!at::isAccelerator(
opt_device_type.value())) { // passed device not an accelerator

View File

@ -131,21 +131,21 @@ static Device getATenDevice(const DLDevice& ctx, void* data) {
#ifndef USE_ROCM
// if we are compiled under HIP, we cannot do cuda
case DLDeviceType::kDLCUDA:
return at::Device(DeviceType::CUDA, ctx.device_id);
return at::Device(DeviceType::CUDA, static_cast<c10::DeviceIndex>(ctx.device_id));
#endif
case DLDeviceType::kDLOpenCL:
return at::Device(DeviceType::OPENCL, ctx.device_id);
return at::Device(DeviceType::OPENCL, static_cast<c10::DeviceIndex>(ctx.device_id));
case DLDeviceType::kDLROCM:
#ifdef USE_ROCM
// this looks funny, we need to return CUDA here to masquerade
return at::Device(DeviceType::CUDA, ctx.device_id);
return at::Device(DeviceType::CUDA, static_cast<c10::DeviceIndex>(ctx.device_id));
#else
return at::Device(DeviceType::HIP, ctx.device_id);
return at::Device(DeviceType::HIP, static_cast<c10::DeviceIndex>(ctx.device_id));
#endif
case DLDeviceType::kDLOneAPI:
return at::detail::getXPUHooks().getDeviceFromPtr(data);
case DLDeviceType::kDLMAIA:
return at::Device(DeviceType::MAIA, ctx.device_id);
return at::Device(DeviceType::MAIA, static_cast<c10::DeviceIndex>(ctx.device_id));
default:
TORCH_CHECK(
false, "Unsupported device_type: ", std::to_string(ctx.device_type));
@ -286,7 +286,7 @@ DLManagedTensor* toDLPack(const Tensor& src) {
device_id = src.get_device();
}
atDLMTensor->tensor.dl_tensor.device = getDLDevice(src, device_id);
atDLMTensor->tensor.dl_tensor.ndim = src.dim();
atDLMTensor->tensor.dl_tensor.ndim = static_cast<int32_t>(src.dim());
atDLMTensor->tensor.dl_tensor.dtype = getDLDataType(src);
atDLMTensor->tensor.dl_tensor.shape = view.sizes().data();
atDLMTensor->tensor.dl_tensor.strides = view.strides().data();

View File

@ -20,7 +20,7 @@ namespace at {
* The method should_include_kernel_dtype() returns true/false
* based on whether the switching code for a specific dtype should be
* included based on build time constants generated from tracing model
* execution. This method will be implmeneted via code-generation and
* execution. This method will be implemented via code-generation and
* included in this file when code-gen is ready.
*/
inline constexpr bool should_include_kernel_dtype(

View File

@ -29,7 +29,7 @@ static Tensor permute_inverse(const Tensor& self, IntArrayRef dims, InverseRetur
static Tensor unsqueeze_copy_to(const Tensor & self, c10::SymIntArrayRef sizes, InverseReturnMode inverse_return_mode) {
auto result = self;
bool need_alias = (inverse_return_mode == InverseReturnMode::AlwaysView);
int64_t nDims = sizes.size();
int64_t nDims = static_cast<int64_t>(sizes.size());
for(const auto dim : c10::irange(nDims)) {
if (sizes[dim] == 1) {
need_alias = false;

View File

@ -2,6 +2,7 @@
#include <ATen/EmptyTensor.h>
#include <ATen/FunctionalTensorWrapper.h>
#include <ATen/SparseCsrTensorUtils.h>
#include <ATen/core/LegacyTypeDispatch.h>
#include <c10/util/Exception.h>
#include <vector>
@ -71,7 +72,7 @@ static c10::SymInt get_nbytes(const Tensor& value) {
// for these tensors (which is wrong), but we don't give them any space.
// A more proper fix would be to have a SparseFunctionalTensorWrapper that
// models sparse correctly.
if (value.is_sparse()) {
if (value.is_sparse() || at::sparse_csr::is_sparse_compressed(value)) {
return 0;
}
if (value.unsafeGetTensorImpl()->has_symbolic_sizes_strides()) {

View File

@ -660,6 +660,21 @@ void propagate_xla_data(const ITensorListRef functional_tensor, ITensorListRef o
}
}
void propagate_xla_data_direct(const Tensor& tensor, const Tensor& other) {
if (tensor.key_set().has(c10::DispatchKey::XLA)) {
at::_propagate_xla_data(tensor, other);
}
}
void propagate_xla_data_direct(const ITensorListRef tensor,
ITensorListRef other) {
auto tensor_it = tensor.begin();
auto other_it = other.begin();
for (C10_UNUSED const auto i : c10::irange(tensor.size())) {
propagate_xla_data_direct(*tensor_it++, *other_it++);
}
}
void commit_update(const Tensor& functional_tensor) {
TORCH_INTERNAL_ASSERT_DEBUG_ONLY(isFunctionalTensor(functional_tensor));
unsafeGetFunctionalWrapper(functional_tensor)->commit_update();

View File

@ -161,6 +161,10 @@ struct TORCH_API FunctionalTensorWrapper : public c10::TensorImpl {
return was_storage_changed_;
}
void set_storage_changed() {
was_storage_changed_ = true;
}
c10::SymInt get_storage_size(bool before) {
return functional_storage_impl()->get_storage_size(before);
}
@ -343,6 +347,13 @@ TORCH_API void propagate_xla_data(
const ITensorListRef functional_tensor,
ITensorListRef other);
TORCH_API void propagate_xla_data_direct(
const Tensor& tensor,
const Tensor& other);
TORCH_API void propagate_xla_data_direct(
const ITensorListRef tensor,
ITensorListRef other);
Tensor create_functional_tensor_with_view_meta(
const Tensor& view_to_wrap,
const Tensor& base,

View File

@ -2,6 +2,7 @@
#include <c10/util/Exception.h>
#include <stack>
#include <utility>
#include <c10/core/SafePyObject.h>
namespace at {
@ -57,26 +58,23 @@ void SavedTensorDefaultHooks::lazy_initialize() {
is_initialized = true;
}
void SavedTensorDefaultHooks::push_hooks(PyObject* pack_hook, PyObject* unpack_hook) {
// Reference counting is handled by the caller of `push_hooks`
void SavedTensorDefaultHooks::push_hooks(SafePyObject pack_hook, SafePyObject unpack_hook) {
TORCH_INTERNAL_ASSERT(is_initialized);
TORCH_INTERNAL_ASSERT(pack_hook != nullptr && unpack_hook != nullptr);
assertSavedTensorHooksNotDisabled();
tls.stack.emplace(pack_hook, unpack_hook);
tls.stack.emplace(std::move(pack_hook), std::move(unpack_hook));
}
std::pair<PyObject*, PyObject*> SavedTensorDefaultHooks::pop_hooks() {
// Reference counting is handled by the caller of `pop_hooks`
std::pair<SafePyObject, SafePyObject> SavedTensorDefaultHooks::pop_hooks() {
TORCH_INTERNAL_ASSERT(is_initialized && !tls.stack.empty());
std::pair<PyObject*, PyObject*> hooks = tls.stack.top();
std::pair<SafePyObject, SafePyObject> hooks = std::move(tls.stack.top());
tls.stack.pop();
return hooks;
}
std::pair<PyObject*, PyObject*> SavedTensorDefaultHooks::get_hooks() {
c10::optional<std::pair<SafePyObject, SafePyObject>> SavedTensorDefaultHooks::get_hooks() {
// For tls.is_tracing, see NOTE: [Deferring tensor pack/unpack hooks until runtime]
if (!is_initialized || tls.stack.empty() || tls.is_tracing) {
return std::make_pair(nullptr, nullptr);
return c10::nullopt;
}
return tls.stack.top();
}

View File

@ -1,5 +1,6 @@
#pragma once
#include <c10/core/SafePyObject.h>
#include <c10/macros/Export.h>
#include <c10/util/python_stub.h>
#include <optional>
@ -14,7 +15,7 @@ namespace impl {
struct TORCH_API SavedTensorDefaultHooksTLS {
// PyObject is defined in c10/util/python_stub.h
std::stack<std::pair<PyObject*, PyObject*>> stack;
std::stack<std::pair<c10::SafePyObject, c10::SafePyObject>> stack;
// See NOTE: [Disabling SavedTensorDefaultHooks] for context
// NOTE: [disabled_error_message invariant]
@ -30,9 +31,12 @@ struct TORCH_API SavedTensorDefaultHooksTLS {
} // namespace impl
struct TORCH_API SavedTensorDefaultHooks {
static void push_hooks(PyObject* pack_hook, PyObject* unpack_hook);
static std::pair<PyObject*, PyObject*> pop_hooks();
static std::pair<PyObject*, PyObject*> get_hooks();
static void push_hooks(
c10::SafePyObject pack_hook,
c10::SafePyObject unpack_hook);
static std::pair<c10::SafePyObject, c10::SafePyObject> pop_hooks();
static std::optional<std::pair<c10::SafePyObject, c10::SafePyObject>>
get_hooks();
static void lazy_initialize();
static const impl::SavedTensorDefaultHooksTLS& get_tls_state();

View File

@ -139,6 +139,21 @@
namespace at::sparse_csr {
// Implements RAII object to manage checking sparse tensor invariants:
class CheckSparseTensorInvariants {
bool old_state;
public:
CheckSparseTensorInvariants(bool state) {
old_state = at::globalContext().checkSparseTensorInvariants();
at::globalContext().setCheckSparseTensorInvariants(state);
}
~CheckSparseTensorInvariants() {
at::globalContext().setCheckSparseTensorInvariants(old_state);
}
};
using SparseCsrTensor = Tensor;
inline bool is_sparse_compressed(const Layout& layout) {

View File

@ -1,8 +1,5 @@
#include <ATen/TensorGeometry.h>
#include <limits>
#include <cstddef>
namespace at {
// See TensorGeometry.h on why this is useful now that we cache is_contiguous.

View File

@ -19,7 +19,7 @@ namespace at::jit {
struct TemplateEnv {
TemplateEnv() = default;
TemplateEnv(TemplateEnv& parent) : parent(&parent) {}
TemplateEnv& operator==(const TemplateEnv& parent) = delete;
TemplateEnv& operator=(const TemplateEnv& parent) = delete;
using string_list = std::vector<std::string>;

View File

@ -22,7 +22,7 @@ static std::vector<std::optional<at::Tensor>> get_boxed_opt_tensor_vector() {
std::vector<std::optional<at::Tensor>> optional_tensors;
const size_t SIZE = 5;
for (size_t i = 0; i < SIZE * 2; i++) {
auto opt_tensor = (i % 2 == 0) ? std::optional<at::Tensor>(at::empty({0})) : nullopt;
auto opt_tensor = (i % 2 == 0) ? std::optional<at::Tensor>(at::empty({0})) : std::nullopt;
optional_tensors.emplace_back(opt_tensor);
}
return optional_tensors;

View File

@ -127,7 +127,7 @@ void internal_set_names_inplace(TensorImpl* impl, std::vector<Dimname>&& names,
}
}
optional<DimnameList> get_opt_names(const TensorImpl* impl) {
std::optional<DimnameList> get_opt_names(const TensorImpl* impl) {
const auto* meta = get_named_tensor_meta(impl);
if (meta == nullptr) {
return std::nullopt;

View File

@ -392,7 +392,7 @@ namespace impl {
}
};
template<class T, bool AllowDeprecatedTypes>
struct ivalue_to_arg<optional<ArrayRef<T>>, AllowDeprecatedTypes> final {
struct ivalue_to_arg<std::optional<ArrayRef<T>>, AllowDeprecatedTypes> final {
// If an argument is std::optional<ArrayRef<T>>, convert the IValue to an std::optional<std::vector<T>> and pass that
// to the operator. OptionalArray<T> is basically a std::optional<std::vector<T>> but implicitly convertible
// to std::optional<ArrayRef<T>>.

View File

@ -152,7 +152,7 @@ OperatorEntry::AnnotatedKernelContainerIterator OperatorEntry::registerKernel(
// Suppress the warning for Meta key as we are overriding C++ meta functions with python meta functions
// for some ops
if (dispatch_key != DispatchKey::Meta) {
TORCH_WARN_ONCE("Warning only once for all operators, other operators may also be overrided.\n",
TORCH_WARN_ONCE("Warning only once for all operators, other operators may also be overridden.\n",
" Overriding a previously registered kernel for the same operator and the same dispatch key\n",
" operator: ", (schema_.has_value() ? toString(schema_->schema) : toString(name_)), "\n",
" ", (this->schema_.has_value() ? this->schema_->debug : "no debug info"), "\n",

View File

@ -401,7 +401,7 @@ inline void FunctionSchema::checkAndNormalizeInputs(
}
auto it = kwargs.find(argument.name());
if (it != kwargs.end()) {
checkArg<T>(it->second, argument, nullopt);
checkArg<T>(it->second, argument, std::nullopt);
inputs.push_back(it->second);
consumed_kwargs++;
continue;

View File

@ -103,7 +103,7 @@ struct OptionalArray {
if (ref) {
list = std::vector<T>(ref->begin(), ref->end());
} else {
list = nullopt;
list = std::nullopt;
}
return *this;
}
@ -113,7 +113,7 @@ struct OptionalArray {
if (ref) {
list = std::vector<T>(ref->begin(), ref->end());
} else {
list = nullopt;
list = std::nullopt;
}
return *this;
}

View File

@ -45,7 +45,7 @@ namespace impl {
TORCH_API void common_device_check_failure(Device common_device, const at::Tensor& tensor, at::CheckedFrom methodName, at::CheckedFrom argName);
inline void check_and_update_common_device(optional<Device>& common_device, const at::Tensor& tensor, at::CheckedFrom methodName, at::CheckedFrom argName) {
inline void check_and_update_common_device(std::optional<Device>& common_device, const at::Tensor& tensor, at::CheckedFrom methodName, at::CheckedFrom argName) {
// TODO: Remove this once the following issue is addressed:
// https://github.com/pytorch/pytorch/issues/57380
if (!tensor.defined()) {
@ -62,19 +62,19 @@ inline void check_and_update_common_device(optional<Device>& common_device, cons
}
}
inline void check_and_update_common_device(optional<Device>& common_device, const std::optional<at::Tensor>& tensor, at::CheckedFrom methodName, at::CheckedFrom argName) {
inline void check_and_update_common_device(std::optional<Device>& common_device, const std::optional<at::Tensor>& tensor, at::CheckedFrom methodName, at::CheckedFrom argName) {
if (tensor.has_value()) {
check_and_update_common_device(common_device, tensor.value(), methodName, argName);
}
}
inline void check_and_update_common_device(optional<Device>& common_device, at::ITensorListRef tensors, at::CheckedFrom methodName, at::CheckedFrom argName) {
inline void check_and_update_common_device(std::optional<Device>& common_device, at::ITensorListRef tensors, at::CheckedFrom methodName, at::CheckedFrom argName) {
for (const auto& tensor : tensors) {
check_and_update_common_device(common_device, tensor, methodName, argName);
}
}
inline void check_and_update_common_device(optional<Device>& common_device, const List<optional<at::Tensor>>& tensors, at::CheckedFrom methodName, at::CheckedFrom argName) {
inline void check_and_update_common_device(std::optional<Device>& common_device, const List<std::optional<at::Tensor>>& tensors, at::CheckedFrom methodName, at::CheckedFrom argName) {
for (const auto& tensor : tensors) {
check_and_update_common_device(common_device, tensor, methodName, argName);
}

View File

@ -70,13 +70,13 @@ public:
// internal-only for registering stack based kernels
template<KernelFunction::BoxedKernelFunction* kernel_func>
Options&& kernel(DispatchKey dispatch_key) && {
return std::move(*this).kernel(dispatch_key, KernelFunction::makeFromBoxedFunction<kernel_func>(), nullopt, nullptr);
return std::move(*this).kernel(dispatch_key, KernelFunction::makeFromBoxedFunction<kernel_func>(), std::nullopt, nullptr);
}
// internal-only for registering stack based catch-all kernels
template<KernelFunction::BoxedKernelFunction* kernel_func>
Options&& catchAllKernel() && {
return std::move(*this).kernel(std::nullopt, KernelFunction::makeFromBoxedFunction<kernel_func>(), nullopt, nullptr);
return std::move(*this).kernel(std::nullopt, KernelFunction::makeFromBoxedFunction<kernel_func>(), std::nullopt, nullptr);
}
// internal only for registering caffe2 ops

View File

@ -133,6 +133,32 @@ struct VecConvert<int32_t, 1, uint8_t, 1> {
}
};
template <>
struct VecConvert<int32_t, 1, float, 1> {
static inline VectorizedN<int32_t, 1> apply(
const VectorizedN<float, 1>& src) {
return Vectorized<int32_t>(_mm256_cvttps_epi32(src[0]));
}
};
template <>
struct VecConvert<float, 1, int32_t, 1> {
static inline VectorizedN<float, 1> apply(
const VectorizedN<int32_t, 1>& src) {
return Vectorized<float>(_mm256_cvtepi32_ps(src[0]));
}
};
template <>
struct VecConvert<int16_t, 1, uint8_t, 1> {
static inline VectorizedN<int16_t, 1> apply(
const VectorizedN<uint8_t, 1>& src) {
auto src128 = _mm256_castsi256_si128(src[0]);
return Vectorized<int16_t>(_mm256_cvtepu8_epi16(src128));
}
};
template <typename dst_t, typename src_t>
struct VecConvert<
dst_t,

View File

@ -554,6 +554,30 @@ Vectorized<ComplexDbl> inline minimum(
// return _mm256_or_ps(min, isnan);
}
template <>
Vectorized<ComplexDbl> C10_ALWAYS_INLINE operator+(const Vectorized<ComplexDbl>& a, const Vectorized<ComplexDbl>& b) {
return Vectorized<ComplexDbl>{vec_add(a.vec0(), b.vec0()), vec_add(a.vec1(), b.vec1())};
}
template <>
Vectorized<ComplexDbl> C10_ALWAYS_INLINE operator-(const Vectorized<ComplexDbl>& a, const Vectorized<ComplexDbl>& b) {
return Vectorized<ComplexDbl>{vec_sub(a.vec0(), b.vec0()), vec_sub(a.vec1(), b.vec1())};
}
template <>
Vectorized<ComplexDbl> C10_ALWAYS_INLINE operator&(const Vectorized<ComplexDbl>& a, const Vectorized<ComplexDbl>& b) {
return Vectorized<ComplexDbl>{vec_and(a.vec0(), b.vec0()), vec_and(a.vec1(), b.vec1())};
}
template <>
Vectorized<ComplexDbl> C10_ALWAYS_INLINE operator|(const Vectorized<ComplexDbl>& a, const Vectorized<ComplexDbl>& b) {
return Vectorized<ComplexDbl>{vec_or(a.vec0(), b.vec0()), vec_or(a.vec1(), b.vec1())};
}
template <>
Vectorized<ComplexDbl> C10_ALWAYS_INLINE operator^(const Vectorized<ComplexDbl>& a, const Vectorized<ComplexDbl>& b) {
return Vectorized<ComplexDbl>{vec_xor(a.vec0(), b.vec0()), vec_xor(a.vec1(), b.vec1())};
}
} // namespace
} // namespace vec

View File

@ -55,6 +55,13 @@ class Vectorized<ComplexFlt> {
_vec1 = vfloat32{val3.real(), val3.imag(), val4.real(), val4.imag()};
}
C10_ALWAYS_INLINE const vec_internal_type& vec0() const {
return _vec0;
}
C10_ALWAYS_INLINE const vec_internal_type& vec1() const {
return _vec1;
}
template <uint64_t mask>
static std::enable_if_t<blendChoiceComplex(mask) == 0, Vectorized<ComplexFlt>>
C10_ALWAYS_INLINE
@ -623,6 +630,31 @@ Vectorized<ComplexFlt> inline minimum(
// return _mm256_or_ps(min, isnan);
}
template <>
Vectorized<ComplexFlt> C10_ALWAYS_INLINE operator+(const Vectorized<ComplexFlt>& a, const Vectorized<ComplexFlt>& b) {
return Vectorized<ComplexFlt>{vec_add(a.vec0(), b.vec0()), vec_add(a.vec1(), b.vec1())};
}
template <>
Vectorized<ComplexFlt> C10_ALWAYS_INLINE operator-(const Vectorized<ComplexFlt>& a, const Vectorized<ComplexFlt>& b) {
return Vectorized<ComplexFlt>{vec_sub(a.vec0(), b.vec0()), vec_sub(a.vec1(), b.vec1())};
}
template <>
Vectorized<ComplexFlt> C10_ALWAYS_INLINE operator&(const Vectorized<ComplexFlt>& a, const Vectorized<ComplexFlt>& b) {
return Vectorized<ComplexFlt>{vec_and(a.vec0(), b.vec0()), vec_and(a.vec1(), b.vec1())};
}
template <>
Vectorized<ComplexFlt> C10_ALWAYS_INLINE operator|(const Vectorized<ComplexFlt>& a, const Vectorized<ComplexFlt>& b) {
return Vectorized<ComplexFlt>{vec_or(a.vec0(), b.vec0()), vec_or(a.vec1(), b.vec1())};
}
template <>
Vectorized<ComplexFlt> C10_ALWAYS_INLINE operator^(const Vectorized<ComplexFlt>& a, const Vectorized<ComplexFlt>& b) {
return Vectorized<ComplexFlt>{vec_xor(a.vec0(), b.vec0()), vec_xor(a.vec1(), b.vec1())};
}
} // namespace
} // namespace vec
} // namespace at

View File

@ -433,6 +433,42 @@ Vectorized<double> inline minimum(
const Vectorized<double>& b) {
return a.minimum(b);
}
template <>
Vectorized<double> C10_ALWAYS_INLINE operator+(const Vectorized<double>& a, const Vectorized<double>& b) {
return Vectorized<double>{vec_add(a.vec0(), b.vec0()), vec_add(a.vec1(), b.vec1())};
}
template <>
Vectorized<double> C10_ALWAYS_INLINE operator-(const Vectorized<double>& a, const Vectorized<double>& b) {
return Vectorized<double>{vec_sub(a.vec0(), b.vec0()), vec_sub(a.vec1(), b.vec1())};
}
template <>
Vectorized<double> C10_ALWAYS_INLINE operator*(const Vectorized<double>& a, const Vectorized<double>& b) {
return Vectorized<double>{vec_mul(a.vec0(), b.vec0()), vec_mul(a.vec1(), b.vec1())};
}
template <>
Vectorized<double> C10_ALWAYS_INLINE operator/(const Vectorized<double>& a, const Vectorized<double>& b) {
return Vectorized<double>{vec_div(a.vec0(), b.vec0()), vec_div(a.vec1(), b.vec1())};
}
template <>
Vectorized<double> C10_ALWAYS_INLINE operator&(const Vectorized<double>& a, const Vectorized<double>& b) {
return Vectorized<double>{vec_and(a.vec0(), b.vec0()), vec_and(a.vec1(), b.vec1())};
}
template <>
Vectorized<double> C10_ALWAYS_INLINE operator|(const Vectorized<double>& a, const Vectorized<double>& b) {
return Vectorized<double>{vec_or(a.vec0(), b.vec0()), vec_or(a.vec1(), b.vec1())};
}
template <>
Vectorized<double> C10_ALWAYS_INLINE operator^(const Vectorized<double>& a, const Vectorized<double>& b) {
return Vectorized<double>{vec_xor(a.vec0(), b.vec0()), vec_xor(a.vec1(), b.vec1())};
}
} // namespace
} // namespace vec
} // namespace at

View File

@ -456,6 +456,41 @@ Vectorized<float> inline minimum(const Vectorized<float>& a, const Vectorized<fl
return a.minimum(b);
}
template <>
Vectorized<float> C10_ALWAYS_INLINE operator+(const Vectorized<float>& a, const Vectorized<float>& b) {
return Vectorized<float>{vec_add(a.vec0(), b.vec0()), vec_add(a.vec1(), b.vec1())};
}
template <>
Vectorized<float> C10_ALWAYS_INLINE operator-(const Vectorized<float>& a, const Vectorized<float>& b) {
return Vectorized<float>{vec_sub(a.vec0(), b.vec0()), vec_sub(a.vec1(), b.vec1())};
}
template <>
Vectorized<float> C10_ALWAYS_INLINE operator*(const Vectorized<float>& a, const Vectorized<float>& b) {
return Vectorized<float>{vec_mul(a.vec0(), b.vec0()), vec_mul(a.vec1(), b.vec1())};
}
template <>
Vectorized<float> C10_ALWAYS_INLINE operator/(const Vectorized<float>& a, const Vectorized<float>& b) {
return Vectorized<float>{vec_div(a.vec0(), b.vec0()), vec_div(a.vec1(), b.vec1())};
}
template <>
Vectorized<float> C10_ALWAYS_INLINE operator&(const Vectorized<float>& a, const Vectorized<float>& b) {
return Vectorized<float>{vec_and(a.vec0(), b.vec0()), vec_and(a.vec1(), b.vec1())};
}
template <>
Vectorized<float> C10_ALWAYS_INLINE operator|(const Vectorized<float>& a, const Vectorized<float>& b) {
return Vectorized<float>{vec_or(a.vec0(), b.vec0()), vec_or(a.vec1(), b.vec1())};
}
template <>
Vectorized<float> C10_ALWAYS_INLINE operator^(const Vectorized<float>& a, const Vectorized<float>& b) {
return Vectorized<float>{vec_xor(a.vec0(), b.vec0()), vec_xor(a.vec1(), b.vec1())};
}
} // namespace
} // namespace vec
} // namespace at

View File

@ -362,6 +362,40 @@ Vectorized<int16_t> inline minimum(
return a.minimum(b);
}
template <>
Vectorized<int16_t> C10_ALWAYS_INLINE operator+(const Vectorized<int16_t>& a, const Vectorized<int16_t>& b) {
return Vectorized<int16_t>{vec_add(a.vec0(), b.vec0()), vec_add(a.vec1(), b.vec1())};
}
template <>
Vectorized<int16_t> C10_ALWAYS_INLINE operator-(const Vectorized<int16_t>& a, const Vectorized<int16_t>& b) {
return Vectorized<int16_t>{vec_sub(a.vec0(), b.vec0()), vec_sub(a.vec1(), b.vec1())};
}
template <>
Vectorized<int16_t> C10_ALWAYS_INLINE operator*(const Vectorized<int16_t>& a, const Vectorized<int16_t>& b) {
return Vectorized<int16_t>{vec_mul(a.vec0(), b.vec0()), vec_mul(a.vec1(), b.vec1())};
}
template <>
Vectorized<int16_t> C10_ALWAYS_INLINE operator/(const Vectorized<int16_t>& a, const Vectorized<int16_t>& b) {
return Vectorized<int16_t>{a.vec0()/b.vec0(), a.vec1()/b.vec1()};
}
template <>
Vectorized<int16_t> C10_ALWAYS_INLINE operator&(const Vectorized<int16_t>& a, const Vectorized<int16_t>& b) {
return Vectorized<int16_t>{vec_and(a.vec0(), b.vec0()), vec_and(a.vec1(), b.vec1())};
}
template <>
Vectorized<int16_t> C10_ALWAYS_INLINE operator|(const Vectorized<int16_t>& a, const Vectorized<int16_t>& b) {
return Vectorized<int16_t>{vec_or(a.vec0(), b.vec0()), vec_or(a.vec1(), b.vec1())};
}
template <>
Vectorized<int16_t> C10_ALWAYS_INLINE operator^(const Vectorized<int16_t>& a, const Vectorized<int16_t>& b) {
return Vectorized<int16_t>{vec_xor(a.vec0(), b.vec0()), vec_xor(a.vec1(), b.vec1())};
}
} // namespace
} // namespace vec

View File

@ -293,6 +293,41 @@ Vectorized<int32_t> inline minimum(
return a.minimum(b);
}
template <>
Vectorized<int32_t> C10_ALWAYS_INLINE operator+(const Vectorized<int32_t>& a, const Vectorized<int32_t>& b) {
return Vectorized<int32_t>{vec_add(a.vec0(), b.vec0()), vec_add(a.vec1(), b.vec1())};
}
template <>
Vectorized<int32_t> C10_ALWAYS_INLINE operator-(const Vectorized<int32_t>& a, const Vectorized<int32_t>& b) {
return Vectorized<int32_t>{vec_sub(a.vec0(), b.vec0()), vec_sub(a.vec1(), b.vec1())};
}
template <>
Vectorized<int32_t> C10_ALWAYS_INLINE operator*(const Vectorized<int32_t>& a, const Vectorized<int32_t>& b) {
return Vectorized<int32_t>{vec_mul(a.vec0(), b.vec0()), vec_mul(a.vec1(), b.vec1())};
}
template <>
Vectorized<int32_t> C10_ALWAYS_INLINE operator/(const Vectorized<int32_t>& a, const Vectorized<int32_t>& b) {
return Vectorized<int32_t>{a.vec0()/b.vec0(), a.vec1()/b.vec1()};
}
template <>
Vectorized<int32_t> C10_ALWAYS_INLINE operator&(const Vectorized<int32_t>& a, const Vectorized<int32_t>& b) {
return Vectorized<int32_t>{vec_and(a.vec0(), b.vec0()), vec_and(a.vec1(), b.vec1())};
}
template <>
Vectorized<int32_t> C10_ALWAYS_INLINE operator|(const Vectorized<int32_t>& a, const Vectorized<int32_t>& b) {
return Vectorized<int32_t>{vec_or(a.vec0(), b.vec0()), vec_or(a.vec1(), b.vec1())};
}
template <>
Vectorized<int32_t> C10_ALWAYS_INLINE operator^(const Vectorized<int32_t>& a, const Vectorized<int32_t>& b) {
return Vectorized<int32_t>{vec_xor(a.vec0(), b.vec0()), vec_xor(a.vec1(), b.vec1())};
}
} // namespace
} // namespace vec
} // namespace at

View File

@ -246,6 +246,41 @@ Vectorized<int64_t> inline minimum(
return a.minimum(b);
}
template <>
Vectorized<int64_t> C10_ALWAYS_INLINE operator+(const Vectorized<int64_t>& a, const Vectorized<int64_t>& b) {
return Vectorized<int64_t>{vec_add(a.vec0(), b.vec0()), vec_add(a.vec1(), b.vec1())};
}
template <>
Vectorized<int64_t> C10_ALWAYS_INLINE operator-(const Vectorized<int64_t>& a, const Vectorized<int64_t>& b) {
return Vectorized<int64_t>{vec_sub(a.vec0(), b.vec0()), vec_sub(a.vec1(), b.vec1())};
}
template <>
Vectorized<int64_t> C10_ALWAYS_INLINE operator*(const Vectorized<int64_t>& a, const Vectorized<int64_t>& b) {
return Vectorized<int64_t>{vec_mul(a.vec0(), b.vec0()), vec_mul(a.vec1(), b.vec1())};
}
template <>
Vectorized<int64_t> C10_ALWAYS_INLINE operator/(const Vectorized<int64_t>& a, const Vectorized<int64_t>& b) {
return Vectorized<int64_t>{vec_div(a.vec0(), b.vec0()), vec_div(a.vec1(), b.vec1())};
}
template <>
Vectorized<int64_t> C10_ALWAYS_INLINE operator&(const Vectorized<int64_t>& a, const Vectorized<int64_t>& b) {
return Vectorized<int64_t>{vec_and(a.vec0(), b.vec0()), vec_and(a.vec1(), b.vec1())};
}
template <>
Vectorized<int64_t> C10_ALWAYS_INLINE operator|(const Vectorized<int64_t>& a, const Vectorized<int64_t>& b) {
return Vectorized<int64_t>{vec_or(a.vec0(), b.vec0()), vec_or(a.vec1(), b.vec1())};
}
template <>
Vectorized<int64_t> C10_ALWAYS_INLINE operator^(const Vectorized<int64_t>& a, const Vectorized<int64_t>& b) {
return Vectorized<int64_t>{vec_xor(a.vec0(), b.vec0()), vec_xor(a.vec1(), b.vec1())};
}
} // namespace
} // namespace vec
} // namespace at

View File

@ -240,6 +240,42 @@ Vectorized<c10::qint32> inline minimum(
const Vectorized<c10::qint32>& b) {
return a.minimum(b);
}
template <>
Vectorized<c10::qint32> C10_ALWAYS_INLINE operator+(const Vectorized<c10::qint32>& a, const Vectorized<c10::qint32>& b) {
return Vectorized<c10::qint32>{vec_add(a.vec0(), b.vec0()), vec_add(a.vec1(), b.vec1())};
}
template <>
Vectorized<c10::qint32> C10_ALWAYS_INLINE operator-(const Vectorized<c10::qint32>& a, const Vectorized<c10::qint32>& b) {
return Vectorized<c10::qint32>{vec_sub(a.vec0(), b.vec0()), vec_sub(a.vec1(), b.vec1())};
}
template <>
Vectorized<c10::qint32> C10_ALWAYS_INLINE operator*(const Vectorized<c10::qint32>& a, const Vectorized<c10::qint32>& b) {
return Vectorized<c10::qint32>{vec_mul(a.vec0(), b.vec0()), vec_mul(a.vec1(), b.vec1())};
}
template <>
Vectorized<c10::qint32> C10_ALWAYS_INLINE operator/(const Vectorized<c10::qint32>& a, const Vectorized<c10::qint32>& b) {
return Vectorized<c10::qint32>{a.vec0()/b.vec0(), a.vec1()/b.vec1()};
}
template <>
Vectorized<c10::qint32> C10_ALWAYS_INLINE operator&(const Vectorized<c10::qint32>& a, const Vectorized<c10::qint32>& b) {
return Vectorized<c10::qint32>{vec_and(a.vec0(), b.vec0()), vec_and(a.vec1(), b.vec1())};
}
template <>
Vectorized<c10::qint32> C10_ALWAYS_INLINE operator|(const Vectorized<c10::qint32>& a, const Vectorized<c10::qint32>& b) {
return Vectorized<c10::qint32>{vec_or(a.vec0(), b.vec0()), vec_or(a.vec1(), b.vec1())};
}
template <>
Vectorized<c10::qint32> C10_ALWAYS_INLINE operator^(const Vectorized<c10::qint32>& a, const Vectorized<c10::qint32>& b) {
return Vectorized<c10::qint32>{vec_xor(a.vec0(), b.vec0()), vec_xor(a.vec1(), b.vec1())};
}
} // namespace
} // namespace vec
} // namespace at

View File

@ -442,6 +442,42 @@ Vectorized<c10::qint8> inline minimum(
const Vectorized<c10::qint8>& b) {
return a.minimum(b);
}
template <>
Vectorized<c10::qint8> C10_ALWAYS_INLINE operator+(const Vectorized<c10::qint8>& a, const Vectorized<c10::qint8>& b) {
return Vectorized<c10::qint8>{vec_add(a.vec0(), b.vec0()), vec_add(a.vec1(), b.vec1())};
}
template <>
Vectorized<c10::qint8> C10_ALWAYS_INLINE operator-(const Vectorized<c10::qint8>& a, const Vectorized<c10::qint8>& b) {
return Vectorized<c10::qint8>{vec_sub(a.vec0(), b.vec0()), vec_sub(a.vec1(), b.vec1())};
}
template <>
Vectorized<c10::qint8> C10_ALWAYS_INLINE operator*(const Vectorized<c10::qint8>& a, const Vectorized<c10::qint8>& b) {
return Vectorized<c10::qint8>{vec_mul(a.vec0(), b.vec0()), vec_mul(a.vec1(), b.vec1())};
}
template <>
Vectorized<c10::qint8> C10_ALWAYS_INLINE operator/(const Vectorized<c10::qint8>& a, const Vectorized<c10::qint8>& b) {
return Vectorized<c10::qint8>{a.vec0()/b.vec0(), a.vec1()/b.vec1()};
}
template <>
Vectorized<c10::qint8> C10_ALWAYS_INLINE operator&(const Vectorized<c10::qint8>& a, const Vectorized<c10::qint8>& b) {
return Vectorized<c10::qint8>{vec_and(a.vec0(), b.vec0()), vec_and(a.vec1(), b.vec1())};
}
template <>
Vectorized<c10::qint8> C10_ALWAYS_INLINE operator|(const Vectorized<c10::qint8>& a, const Vectorized<c10::qint8>& b) {
return Vectorized<c10::qint8>{vec_or(a.vec0(), b.vec0()), vec_or(a.vec1(), b.vec1())};
}
template <>
Vectorized<c10::qint8> C10_ALWAYS_INLINE operator^(const Vectorized<c10::qint8>& a, const Vectorized<c10::qint8>& b) {
return Vectorized<c10::qint8>{vec_xor(a.vec0(), b.vec0()), vec_xor(a.vec1(), b.vec1())};
}
} // namespace
} // namespace vec
} // namespace at

View File

@ -461,6 +461,41 @@ Vectorized<c10::quint8> inline minimum(
return a.minimum(b);
}
template <>
Vectorized<c10::quint8> C10_ALWAYS_INLINE operator+(const Vectorized<c10::quint8>& a, const Vectorized<c10::quint8>& b) {
return Vectorized<c10::quint8>{vec_add(a.vec0(), b.vec0()), vec_add(a.vec1(), b.vec1())};
}
template <>
Vectorized<c10::quint8> C10_ALWAYS_INLINE operator-(const Vectorized<c10::quint8>& a, const Vectorized<c10::quint8>& b) {
return Vectorized<c10::quint8>{vec_sub(a.vec0(), b.vec0()), vec_sub(a.vec1(), b.vec1())};
}
template <>
Vectorized<c10::quint8> C10_ALWAYS_INLINE operator*(const Vectorized<c10::quint8>& a, const Vectorized<c10::quint8>& b) {
return Vectorized<c10::quint8>{vec_mul(a.vec0(), b.vec0()), vec_mul(a.vec1(), b.vec1())};
}
template <>
Vectorized<c10::quint8> C10_ALWAYS_INLINE operator/(const Vectorized<c10::quint8>& a, const Vectorized<c10::quint8>& b) {
return Vectorized<c10::quint8>{a.vec0()/b.vec0(), a.vec1()/b.vec1()};
}
template <>
Vectorized<c10::quint8> C10_ALWAYS_INLINE operator&(const Vectorized<c10::quint8>& a, const Vectorized<c10::quint8>& b) {
return Vectorized<c10::quint8>{vec_and(a.vec0(), b.vec0()), vec_and(a.vec1(), b.vec1())};
}
template <>
Vectorized<c10::quint8> C10_ALWAYS_INLINE operator|(const Vectorized<c10::quint8>& a, const Vectorized<c10::quint8>& b) {
return Vectorized<c10::quint8>{vec_or(a.vec0(), b.vec0()), vec_or(a.vec1(), b.vec1())};
}
template <>
Vectorized<c10::quint8> C10_ALWAYS_INLINE operator^(const Vectorized<c10::quint8>& a, const Vectorized<c10::quint8>& b) {
return Vectorized<c10::quint8>{vec_xor(a.vec0(), b.vec0()), vec_xor(a.vec1(), b.vec1())};
}
} // namespace
} // namespace vec
} // namespace at

View File

@ -117,6 +117,49 @@ struct VecConvert<int32_t, 1, uint8_t, 1> {
}
};
template <>
struct VecConvert<int32_t, 1, float, 1> {
static inline VectorizedN<int32_t, 1> apply(
const VectorizedN<float, 1>& src) {
return Vectorized<int32_t>(_mm512_cvttps_epi32(src[0]));
}
};
template <>
struct VecConvert<float, 1, int32_t, 1> {
static inline VectorizedN<float, 1> apply(
const VectorizedN<int32_t, 1>& src) {
return Vectorized<float>(_mm512_cvtepi32_ps(src[0]));
}
};
template <>
struct VecConvert<int16_t, 1, uint8_t, 1> {
static inline VectorizedN<int16_t, 1> apply(
const VectorizedN<uint8_t, 1>& src) {
auto src256 = _mm512_castsi512_si256(src[0]);
return Vectorized<int16_t>(_mm512_cvtepu8_epi16(src256));
}
};
template <>
struct VecConvert<int8_t, 1, int32_t, 1> {
static inline VectorizedN<int8_t, 1> apply(
const VectorizedN<int32_t, 1>& src) {
auto src128 = _mm512_cvtepi32_epi8(src[0]);
return Vectorized<int8_t>(_mm512_castsi128_si512(src128));
}
};
template <>
struct VecConvert<int8_t, 1, int16_t, 1> {
static inline VectorizedN<int8_t, 1> apply(
const VectorizedN<int16_t, 1>& src) {
auto src256 = _mm512_cvtepi16_epi8(src[0]);
return Vectorized<int8_t>(_mm512_castsi256_si512(src256));
}
};
template <typename dst_t, typename src_t>
struct VecConvert<
dst_t,

View File

@ -220,6 +220,7 @@ class VectorizedN {
return result;
}
VECTORIZEDN_DEFINE_UNARY_OP(isnan)
VECTORIZEDN_DEFINE_UNARY_OP(abs)
VECTORIZEDN_DEFINE_UNARY_OP(sgn)
VECTORIZEDN_DEFINE_UNARY_OP(angle)

View File

@ -17,23 +17,10 @@ static bool _cuda_graphs_debug = false;
constexpr int kSynchronizeBusyWaitMillis = 10;
MempoolId_t graph_pool_handle() {
// uuid count starts at 1. 0 is reserved to mean "wasn't set by graph_pool_handle".
static std::atomic<CaptureId_t> uid{1};
// Sets just the second value, to distinguish it from MempoolId_ts created from
// cudaStreamGetCaptureInfo id_s in capture_begin.
return {0, uid++};
}
// Get the expected id of a capture sequence so that we can call beginAllocateStreamToPool
// before starting a graph capture
CaptureId_t capture_sequence_id() {
// id starts at 1:
// Ensures uuid count starts at 1. 0 is reserved to mean "not set by cudaStreamGetCaptureInfo".
// (But how do we know GetCaptureInfo never sets id_ to 0? Because that's the current behavior,
// and I asked cuda devs to keep it that way, and they agreed.)
static std::atomic<CaptureId_t> uuid{1};
return uuid++;
auto new_pool = c10::cuda::MemPool();
return new_pool.id();
}
/**
@ -118,8 +105,6 @@ void CUDAGraph::capture_begin(MempoolId_t pool/*=0*/, cudaStreamCaptureMode capt
capture_stream_ = stream;
capture_dev_ = c10::cuda::current_device();
id_ = capture_sequence_id();
if (pool.first != 0 || pool.second != 0) {
// Either value being nonzero means the user supplied a pool to share.
// But only one should be nonzero.
@ -128,9 +113,11 @@ void CUDAGraph::capture_begin(MempoolId_t pool/*=0*/, cudaStreamCaptureMode capt
TORCH_INTERNAL_ASSERT(!(pool.first && pool.second));
mempool_id_ = pool;
} else {
// User did not ask us to share a mempool. Use our own id_ as our mempool_id_.
// User did not ask us to share a mempool. Create graph pool handle using is_user_created=false.
// Sets just the first value, to distinguish it from MempoolId_ts created by graph_pool_handle().
mempool_id_ = {id_, 0};
auto mempool = c10::cuda::MemPool({}, false);
mempool_id_ = mempool.id();
TORCH_INTERNAL_ASSERT(mempool_id_.first > 0);
}
// Addendum: beginAllocateStreamToPool is now called before cudaStreamBeginCapture to prevent an
@ -161,7 +148,6 @@ void CUDAGraph::capture_begin(MempoolId_t pool/*=0*/, cudaStreamCaptureMode capt
AT_CUDA_CHECK(cudaStreamGetCaptureInfo(stream, &status, &capture_id_));
TORCH_INTERNAL_ASSERT(status == cudaStreamCaptureStatus::cudaStreamCaptureStatusActive);
TORCH_INTERNAL_ASSERT(id_ > 0);
}
void CUDAGraph::capture_end() {

View File

@ -52,10 +52,6 @@ struct TORCH_CUDA_CPP_API CUDAGraph {
// Set to true in capture_end if cudaGraphInstantiate succeeded
bool has_graph_exec_ = false;
// uuid of this instance's current capture, used to
// specify the pool.
CaptureId_t id_;
// the ID assigned by cuda during graph capture,
// used to identify when a stream is participating in capture
CaptureId_t capture_id_ = -1;

View File

@ -9,8 +9,8 @@
#include <ATen/native/cudnn/RNNUtils.h>
#endif
namespace at {
namespace autocast {
namespace at::autocast {
/********************************************************************************
Autocast wrapper for CuDNN RNNs (the weight reflattening needs special attention)
@ -125,5 +125,4 @@ TORCH_LIBRARY_IMPL(aten, Autocast, m) {
}
} // anonymous namespace
} // namespace autocast
} // namespace at
} // namespace at::autocast

View File

@ -6,7 +6,7 @@
#include <iostream>
#include <sstream>
namespace at { namespace native {
namespace at::native {
namespace {
@ -98,7 +98,7 @@ std::string cudnnTypeToString(cudnnDataType_t dtype) {
std::ostream& operator<<(std::ostream & out, const TensorDescriptor& d) {
out << "TensorDescriptor " << static_cast<void*>(d.desc()) << "\n";
int nbDims;
int nbDims = 0;
int dimA[CUDNN_DIM_MAX];
int strideA[CUDNN_DIM_MAX];
cudnnDataType_t dtype;
@ -173,7 +173,7 @@ std::string cudnnMemoryFormatToString(cudnnTensorFormat_t tformat) {
std::ostream& operator<<(std::ostream & out, const FilterDescriptor& d) {
out << "FilterDescriptor " << static_cast<void*>(d.desc()) << "\n";
int nbDims;
int nbDims = 0;
int dimA[CUDNN_DIM_MAX];
cudnnDataType_t dtype;
cudnnTensorFormat_t tformat;
@ -192,4 +192,4 @@ std::ostream& operator<<(std::ostream & out, const FilterDescriptor& d) {
void FilterDescriptor::print() { std::cout << *this; }
}}
}

View File

@ -22,7 +22,7 @@
#define USE_CUDNN_RNN_V8_API
#endif
namespace at { namespace native {
namespace at::native {
std::string cudnnTypeToString(cudnnDataType_t dtype);
@ -111,7 +111,7 @@ class TORCH_CUDA_CPP_API Descriptor {
protected:
void init() {
if (desc_ == nullptr) {
T* raw_desc;
T* raw_desc = nullptr;
AT_CUDNN_CHECK(ctor(&raw_desc));
desc_.reset(raw_desc);
}
@ -235,7 +235,7 @@ struct TORCH_CUDA_CPP_API DropoutDescriptor
// WARNING: This function is very expensive, avoid calling this function!
void initialize_rng(cudnnHandle_t handle, float dropout, long long int seed, const TensorOptions& options) {
TORCH_INTERNAL_ASSERT(dropout > 0, "dropout must be nonzero; otherwise call set_no_dropout");
size_t state_size;
size_t state_size = 0;
AT_CUDNN_CHECK(cudnnDropoutGetStatesSize(handle, &state_size));
AT_ASSERT(options.device().type() == kCUDA);
AT_ASSERT(options.dtype() == kByte);
@ -405,4 +405,4 @@ union Constant
}
};
}} // namespace
} // namespace

View File

@ -1,35 +1,38 @@
#include <ATen/cudnn/Handle.h>
#include <ATen/cuda/detail/DeviceThreadHandles.h>
#include <ATen/cudnn/Handle.h>
#include <c10/cuda/CUDAStream.h>
#include <ATen/cuda/Exceptions.h>
namespace at { namespace native {
namespace at::native {
namespace {
void createCuDNNHandle(cudnnHandle_t *handle) {
void createCuDNNHandle(cudnnHandle_t* handle) {
AT_CUDNN_CHECK(cudnnCreate(handle));
}
void destroyCuDNNHandle(cudnnHandle_t /*handle*/) {
// this is because of something dumb in the ordering of
// destruction. Sometimes atexit, the cuda context (or something)
// would already be destroyed by the time this gets destroyed. It
// happens in fbcode setting. @colesbury and I decided to not destroy
// the handle as a workaround.
// - @soumith
//
// Further note: this is now disabled globally, because we are seeing
// the same issue as mentioned above in CUDA 11 CI.
// - @zasdfgbnm
//
// #ifdef NO_CUDNN_DESTROY_HANDLE
// #else
// cudnnDestroy(handle);
// #endif
// this is because of something dumb in the ordering of
// destruction. Sometimes atexit, the cuda context (or something)
// would already be destroyed by the time this gets destroyed. It
// happens in fbcode setting. @colesbury and I decided to not destroy
// the handle as a workaround.
// - @soumith
//
// Further note: this is now disabled globally, because we are seeing
// the same issue as mentioned above in CUDA 11 CI.
// - @zasdfgbnm
//
// #ifdef NO_CUDNN_DESTROY_HANDLE
// #else
// cudnnDestroy(handle);
// #endif
}
using CudnnPoolType = at::cuda::DeviceThreadHandlePool<cudnnHandle_t, createCuDNNHandle, destroyCuDNNHandle>;
using CudnnPoolType = at::cuda::DeviceThreadHandlePool<
cudnnHandle_t,
createCuDNNHandle,
destroyCuDNNHandle>;
} // namespace
@ -51,4 +54,4 @@ cudnnHandle_t getCudnnHandle() {
return handle;
}
}} // namespace at::native
} // namespace at::native

View File

@ -1,9 +1,9 @@
#pragma once
#include <ATen/cudnn/cudnn-wrapper.h>
#include <ATen/cuda/ATenCUDAGeneral.h>
#include <ATen/cudnn/cudnn-wrapper.h>
namespace at { namespace native {
namespace at::native {
TORCH_CUDA_CPP_API cudnnHandle_t getCudnnHandle();
}} // namespace at::native
} // namespace at::native

View File

@ -2,7 +2,7 @@
#include <ATen/ATen.h>
namespace at { namespace native {
namespace at::native {
cudnnDataType_t getCudnnDataTypeFromScalarType(const at::ScalarType dtype) {
if (dtype == c10::kQInt8) {
@ -35,4 +35,4 @@ int64_t cudnn_version() {
return CUDNN_VERSION;
}
}} // namespace at::cudnn
} // namespace at::native

View File

@ -1,9 +1,9 @@
#pragma once
#include <ATen/cudnn/cudnn-wrapper.h>
#include <ATen/Tensor.h>
#include <ATen/cudnn/cudnn-wrapper.h>
namespace at { namespace native {
namespace at::native {
TORCH_CUDA_CPP_API cudnnDataType_t
getCudnnDataTypeFromScalarType(const at::ScalarType dtype);
@ -11,4 +11,4 @@ cudnnDataType_t getCudnnDataType(const at::Tensor& tensor);
int64_t cudnn_version();
}} // namespace at::cudnn
} // namespace at::native

View File

@ -2,10 +2,10 @@
#include <ATen/core/Tensor.h>
#include <ATen/cuda/Exceptions.h>
#include <ATen/cudnn/cudnn-wrapper.h>
#include <ATen/cudnn/Handle.h>
#include <ATen/cudnn/cudnn-wrapper.h>
namespace at { namespace native {
namespace at::native {
// cuDNN has a buggy check for tensor being contiguous (that is, it does
// not ignore stride for dimension that is equal to 0). This function
@ -13,9 +13,10 @@ namespace at { namespace native {
// strides to 1 as cuDNN likes.
inline Tensor contiguousIfZeroInStrides(const Tensor& t) {
for (auto s : t.strides()) {
if (s == 0) return t.contiguous();
if (s == 0)
return t.contiguous();
}
return t;
}
}}
} // namespace at::native

View File

@ -5,9 +5,10 @@
#define STRINGIFY(x) #x
#define STRING(x) STRINGIFY(x)
#if CUDNN_MAJOR < 6
#pragma message ("CuDNN v" STRING(CUDNN_MAJOR) " found, but need at least CuDNN v6. You can get the latest version of CuDNN from https://developer.nvidia.com/cudnn or disable CuDNN with USE_CUDNN=0")
#pragma message "We strongly encourage you to move to 6.0 and above."
#if CUDNN_MAJOR < 8 || (CUDNN_MAJOR == 8 && CUDNN_MINOR < 5)
#pragma message("CuDNN v" STRING( \
CUDNN_MAJOR) " found, but need at least CuDNN v8. You can get the latest version of CuDNN from https://developer.nvidia.com/cudnn or disable CuDNN with USE_CUDNN=0")
#pragma message "We strongly encourage you to move to 8.5 and above."
#pragma message "This message is intended to annoy you enough to update."
#endif

View File

@ -217,7 +217,7 @@ void GradInterpreterPtr::sendToNextInterpreterImpl(
op, stack, *base_,
TransformType::Grad,
prevGradMode(),
nullopt,
std::nullopt,
grad_special_case);
}
@ -234,7 +234,7 @@ void JvpInterpreterPtr::sendToNextInterpreterImpl(
autogradBasedTransformSendToNext(
op, stack, *base_,
TransformType::Jvp,
nullopt,
std::nullopt,
prevFwdGradMode(),
grad_special_case);
}

View File

@ -11,7 +11,7 @@
// NB: most activation functions fit pointwise unary or binary rules.
// These are only the ones that have special batch rules to help with organization
namespace at::functorch {
static std::tuple<Tensor,optional<int64_t>>
static std::tuple<Tensor, std::optional<int64_t>>
glu_batch_rule(const Tensor& self, std::optional<int64_t> self_bdim, int64_t dim) {
// repeated error message from glu because 0D -> 1D when batched
// this can't pass anyway because a 0-dimensional tensor has "size" 1, which
@ -27,7 +27,7 @@ glu_batch_rule(const Tensor& self, std::optional<int64_t> self_bdim, int64_t dim
return std::make_tuple(res, 0);
}
static std::tuple<Tensor,optional<int64_t>> glu_backward_batch_rule(
static std::tuple<Tensor, std::optional<int64_t>> glu_backward_batch_rule(
const Tensor& grad_output, std::optional<int64_t> grad_output_bdim,
const Tensor& self, std::optional<int64_t> self_bdim, int64_t dim) {
if (self_bdim) {

View File

@ -14,7 +14,7 @@
namespace at::functorch {
template <typename F, F Func, typename... ExtraArgs>
std::tuple<Tensor,optional<int64_t>> _binary_pointwise_batch_rule(
std::tuple<Tensor, std::optional<int64_t>> _binary_pointwise_batch_rule(
const Tensor& tensor, std::optional<int64_t> tensor_batch_dim,
const Tensor& other, std::optional<int64_t> other_batch_dim,
ExtraArgs... extra_args) {
@ -33,7 +33,7 @@ struct BinaryPointwiseBatchRuleHelper;
template <typename F, F Func, typename T1, typename T2, typename... T>
struct BinaryPointwiseBatchRuleHelper<F, Func, typelist<T1, T2, T...>> {
static std::tuple<Tensor,optional<int64_t>> apply(
static std::tuple<Tensor, std::optional<int64_t>> apply(
const Tensor& tensor, std::optional<int64_t> tensor_batch_dim,
const Tensor& other, std::optional<int64_t> other_batch_dim,
T... extra_args) {
@ -120,7 +120,7 @@ void binary_pointwise_inplace_batch_rule(
}
template <typename F, F Func>
std::tuple<Tensor,optional<int64_t>> comparison_pointwise_batch_rule(
std::tuple<Tensor, std::optional<int64_t>> comparison_pointwise_batch_rule(
const Tensor& tensor, std::optional<int64_t> tensor_batch_dim,
const Tensor& other, std::optional<int64_t> other_batch_dim) {
// compute max logical rank
@ -142,7 +142,7 @@ std::tuple<Tensor,optional<int64_t>> comparison_pointwise_batch_rule(
return std::make_tuple( std::move(result), 0 );
}
static std::tuple<Tensor,optional<int64_t>> where_self_batch_rule(
static std::tuple<Tensor, std::optional<int64_t>> where_self_batch_rule(
const Tensor& condition, std::optional<int64_t> condition_bdim,
const Tensor& self, std::optional<int64_t> self_bdim, const Tensor& other, std::optional<int64_t> other_bdim) {
auto condition_logical_rank = rankWithoutBatchDim(condition, condition_bdim);
@ -177,7 +177,7 @@ static std::tuple<Tensor, std::optional<int64_t>> gelu_backward_batch_rule(
return std::make_tuple(at::gelu_backward(grad_out_, input_, approximate), 0);
}
static std::tuple<Tensor,optional<int64_t>> masked_select_batch_rule(
static std::tuple<Tensor, std::optional<int64_t>> masked_select_batch_rule(
const Tensor& self, std::optional<int64_t> self_bdim,
const Tensor& mask, std::optional<int64_t> mask_bdim) {
TORCH_CHECK(!mask_bdim.has_value(),
@ -196,7 +196,7 @@ static std::tuple<Tensor,optional<int64_t>> masked_select_batch_rule(
return std::make_tuple(result, 0);
}
static std::tuple<Tensor,optional<int64_t>> masked_select_backward_batch_rule(
static std::tuple<Tensor, std::optional<int64_t>> masked_select_backward_batch_rule(
const Tensor& grad, std::optional<int64_t> grad_bdim,
const Tensor& self, std::optional<int64_t> self_bdim,
const Tensor& mask, std::optional<int64_t> mask_bdim) {
@ -221,7 +221,7 @@ static std::tuple<Tensor,optional<int64_t>> masked_select_backward_batch_rule(
return std::make_tuple(result, 0);
}
static std::tuple<Tensor,optional<int64_t>> cdist_backward_batch_rule(
static std::tuple<Tensor, std::optional<int64_t>> cdist_backward_batch_rule(
const Tensor& grad, std::optional<int64_t> grad_bdim,
const Tensor& x1, std::optional<int64_t> x1_bdim,
const Tensor& x2, std::optional<int64_t> x2_bdim,
@ -258,7 +258,7 @@ static std::tuple<Tensor,optional<int64_t>> cdist_backward_batch_rule(
auto out = at::_cdist_backward(grad_, x1_, x2_, p, cdist);
std::optional<int64_t> out_bdim = nullopt;
std::optional<int64_t> out_bdim = std::nullopt;
if (x1_bdim || x2_bdim) {
out_bdim = 0;
}
@ -277,7 +277,7 @@ static void fill__Tensor_batch_rule(
self.fill_(other);
return;
}
if (!self_bdim && other_bdim) {
if (!self_bdim) {
vmapIncompatibleInplaceError("fill_");
}
auto self_and_other = _binary_pointwise_helper(

Some files were not shown because too many files have changed in this diff Show More