Commit Graph

77620 Commits

Author SHA1 Message Date
fc61aae70f Remove color in CI (#133517)
Remove color by default to make CI logs easier to read

Example of color
<img width="569" alt="image" src="https://github.com/user-attachments/assets/0da13544-98b1-47be-8383-64a5b3fd8951">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133517
Approved by: https://github.com/ZainRizvi
2024-08-26 16:58:06 +00:00
42955e04f1 Revert "[dynamo] Cache _dynamo.disable results (#134272)"
This reverts commit a699bd11551e9755bb9238c6b82c369880789397.

Reverted https://github.com/pytorch/pytorch/pull/134272 on behalf of https://github.com/ZainRizvi due to Fails internal tests ([comment](https://github.com/pytorch/pytorch/pull/134272#issuecomment-2310649115))
2024-08-26 16:57:53 +00:00
e94bdc7876 Revert "[dynamo][guards] De-dupe DUPLICATE_INPUT guard (#134354)"
This reverts commit cdb9df5efe78142b7a612ae9c938ddf8a8850d10.

Reverted https://github.com/pytorch/pytorch/pull/134354 on behalf of https://github.com/ZainRizvi due to Fails internal tests ([comment](https://github.com/pytorch/pytorch/pull/134272#issuecomment-2310649115))
2024-08-26 16:57:53 +00:00
a6fac0e969 Use ephemeral runners for windows nightly builds (#134463)
This is definition of windows.4xlarge:

```
  windows.4xlarge:
    disk_size: 256
    instance_type: c5d.4xlarge
    is_ephemeral: true
    max_available: 420
    os: windows
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134463
Approved by: https://github.com/jeanschmidt
2024-08-26 16:33:19 +00:00
b417e32da2 [CD] fix xpu nightly wheel test env (#134395) (#134464)
Due to the https://github.com/pytorch/builder/pull/1972 landed, it will source xpu env duplicated in nightly wheel test.
Works for https://github.com/pytorch/pytorch/issues/114850

Realnd of #134395 to be landed with pytorchmergebot
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134464
Approved by: https://github.com/jeanschmidt

Co-authored-by: Wang, Chuanqi <chuanqi.wang@intel.com>
2024-08-26 15:35:48 +00:00
c507f402f1 Add linux arm64 ephemeral runners (#134469)
Should be landed with: https://github.com/pytorch/test-infra/pull/5593

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134469
Approved by: https://github.com/jeanschmidt, https://github.com/clee2000
2024-08-26 15:32:45 +00:00
17e8a51ff2 Revert "[inductor]Let output or input_as_strided match exact strides (#130956)"
This reverts commit a63efee5cd422db0aabe5d02d2fe35fef9be7978.

Reverted https://github.com/pytorch/pytorch/pull/130956 on behalf of https://github.com/ZainRizvi due to sorry but this seems to cause internal tests to fail. Please see D61771533 for details ([comment](https://github.com/pytorch/pytorch/pull/130956#issuecomment-2310490049))
2024-08-26 15:31:23 +00:00
1c4780e69a Revert "c10d/logging: add C10D_LOCK_GUARD (#134131)"
This reverts commit 4c28a0eb0ba437c1b7db559f63f8bec17bd48f69.

Reverted https://github.com/pytorch/pytorch/pull/134131 on behalf of https://github.com/ZainRizvi due to Sorry but this causes formatting errors internally which make it fail to build. See D61759282 ([comment](https://github.com/pytorch/pytorch/pull/134131#issuecomment-2310455878))
2024-08-26 15:19:27 +00:00
50e90d7203 Revert "[dynamo] simplify implementation for functools.reduce (#133778)"
This reverts commit 6c0b15e3828b8e2a0bd726a3e5d4e98c8ced5efe.

Reverted https://github.com/pytorch/pytorch/pull/133778 on behalf of https://github.com/ZainRizvi due to Sorry, but this breaks internal tests because of using functools ([comment](https://github.com/pytorch/pytorch/pull/133778#issuecomment-2310445169))
2024-08-26 15:16:17 +00:00
472c7cf962 Revert "[dynamo] simplify implementation for builtins.sum (#133779)"
This reverts commit 8d90392fb02ce5e6854e6b4dbcdc4a7bbd55f8e2.

Reverted https://github.com/pytorch/pytorch/pull/133779 on behalf of https://github.com/ZainRizvi due to Sorry, but this breaks internal tests because of using functools ([comment](https://github.com/pytorch/pytorch/pull/133778#issuecomment-2310445169))
2024-08-26 15:16:17 +00:00
3d7f3f6a55 Revert "[dynamo][itertools] support itertools.tee (#133771)"
This reverts commit 0e49b2f18e78386c8ed9ce540a8017411c7ab0cd.

Reverted https://github.com/pytorch/pytorch/pull/133771 on behalf of https://github.com/ZainRizvi due to Sorry, but this breaks internal tests because of using functools ([comment](https://github.com/pytorch/pytorch/pull/133778#issuecomment-2310445169))
2024-08-26 15:16:17 +00:00
e1fc4362fb Revert "[dynamo] simplify implementation for os.fspath (#133801)"
This reverts commit c5f6b72041144c00e240bcfdc783a5597c3d8928.

Reverted https://github.com/pytorch/pytorch/pull/133801 on behalf of https://github.com/ZainRizvi due to Sorry, but this breaks internal tests because of using functools ([comment](https://github.com/pytorch/pytorch/pull/133778#issuecomment-2310445169))
2024-08-26 15:16:17 +00:00
bb67ff2ba7 Migrate Windows bin jobs to runner determinator (#134231)
Update Windows binary workflows to use the runner determinator script.

Closes: pytorch/ci-infra#262

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134231
Approved by: https://github.com/ZainRizvi
2024-08-26 14:56:00 +00:00
27d97b9649 Remove unnecessary test skip (#134250)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134250
Approved by: https://github.com/amjames, https://github.com/janeyx99
2024-08-26 14:34:53 +00:00
be96ccf77c Revert "[CD] fix xpu nightly wheel test env (#134395)" (#134461)
This reverts commit 96738c9d756fbd64e6f2eba67f711d3e18f1630c.

Merged without pytorchmergebot command by mistake

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134461
Approved by: https://github.com/jeanschmidt
2024-08-26 13:40:17 +00:00
96738c9d75 [CD] fix xpu nightly wheel test env (#134395) 2024-08-26 08:53:15 -04:00
1ff226d88c [inductor] support vec for atomic add (#131314)
Depends on https://github.com/pytorch/pytorch/pull/130827 to have correct `index_expr` dtype

Support vec for atomic add by scalar implementation.
TestPlan:
```
python test/inductor/test_cpu_repro.py -k test_scatter_using_atomic_add_vec
```
Generated code for `test_scatter_using_atomic_add_vec`
```
cpp_fused_scatter_0 = async_compile.cpp_pybinding(['const float*', 'const int64_t*', 'const float*', 'float*'], '''
#include "/tmp/torchinductor_root/nn/cnnpkaxivwaa5rzng6qsyc4ao42vschogi3yk33ukwv3emlvxeqq.h"
extern "C"  void kernel(const float* in_ptr0,
                       const int64_t* in_ptr1,
                       const float* in_ptr2,
                       float* out_ptr0)
{
    {
        for(long x0=static_cast<long>(0L); x0<static_cast<long>(16L); x0+=static_cast<long>(16L))
        {
            auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x0), 16);
            tmp0.store(out_ptr0 + static_cast<long>(x0));
        }
        #pragma omp simd simdlen(8)
        for(long x0=static_cast<long>(16L); x0<static_cast<long>(25L); x0+=static_cast<long>(1L))
        {
            auto tmp0 = in_ptr0[static_cast<long>(x0)];
            out_ptr0[static_cast<long>(x0)] = tmp0;
        }
    }
    {
        for(long x0=static_cast<long>(0L); x0<static_cast<long>(16L); x0+=static_cast<long>(16L))
        {
            auto tmp0 = at::vec::VectorizedN<int64_t,2>::loadu(in_ptr1 + static_cast<long>(x0), 16);
            auto tmp12 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<long>(x0), 16);
            auto tmp1 = 25L;
            auto tmp2 = c10::convert<int64_t>(tmp1);
            auto tmp3 = at::vec::VectorizedN<int64_t,2>(tmp2);
            auto tmp4 = tmp0 + tmp3;
            auto tmp5 = static_cast<int64_t>(0);
            auto tmp6 = at::vec::VectorizedN<int64_t,2>(tmp5);
            auto tmp7 = at::vec::VecMask<int64_t,2>(tmp0 < tmp6);
            auto tmp8 = decltype(tmp4)::blendv(tmp0, tmp4, tmp7.template cast<int64_t,2>());
            auto tmp9 =
            [&]
            {
                __at_align__ std::array<int64_t, 16> tmpbuf;
                tmp8.store(tmpbuf.data());
                return tmpbuf;
            }
            ()
            ;
            auto tmp10 =
            [&]
            {
                __at_align__ std::array<int64_t, 16> tmpbuf;
                #pragma GCC unroll 16
                for (long x0_inner = 0; x0_inner < 16; x0_inner++)
                {
                    tmpbuf[x0_inner] = static_cast<long>(tmp9[x0_inner]);
                }
                return at::vec::VectorizedN<int64_t,2>::loadu(tmpbuf.data(), 16);
            }
            ()
            ;
            TORCH_CHECK((at::vec::VecMask<int64_t,2>((at::vec::VectorizedN<int64_t,2>(0) <= tmp10) & (tmp10 < at::vec::VectorizedN<int64_t,2>(25L)))).all_masked(), "index out of bounds: 0 <= tmp10 < 25L");
            atomic_add_vec(out_ptr0, tmp8, tmp12);
        }
        #pragma omp simd simdlen(8)
        for(long x0=static_cast<long>(16L); x0<static_cast<long>(20L); x0+=static_cast<long>(1L))
        {
            auto tmp0 = in_ptr1[static_cast<long>(x0)];
            auto tmp9 = in_ptr2[static_cast<long>(x0)];
            auto tmp1 = 25L;
            auto tmp2 = c10::convert<int64_t>(tmp1);
            auto tmp3 = decltype(tmp0)(tmp0 + tmp2);
            auto tmp4 = tmp0 < 0;
            auto tmp5 = tmp4 ? tmp3 : tmp0;
            auto tmp6 = tmp5;
            auto tmp7 = c10::convert<int64_t>(tmp6);
            TORCH_CHECK((0 <= tmp7) & (tmp7 < 25L), "index out of bounds: 0 <= tmp7 < 25L");
            atomic_add(&out_ptr0[static_cast<long>(tmp5)], static_cast<float>(tmp9));
        }
    }
}
''')
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131314
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel
2024-08-26 10:36:51 +00:00
bf5c7bf06d [FR] Fix the bug in FR script (e.g., checking all ranks dump check) (#134383)
We somehow convert the rank to string which makes the ranks check fail. This fix now convert them all to int.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134383
Approved by: https://github.com/c-p-i-o
2024-08-26 08:21:14 +00:00
92c4771853 fix stuck floordiv (#134150)
Summary: Fixes https://github.com/pytorch/pytorch/issues/134133

Test Plan:
Tested on the small repro in the linked issue with different lengths N (replacing 100), recording N vs. time taken in nanoseconds:
10 127268319
20 220839662
30 325463125
40 429259441
50 553136055
60 670799769
70 999170514
80 899014103
90 997168902
100 1168202035
110 1388556619
120 1457488235
130 1609816470
140 2177889877
150 1917560313
160 2121096113
170 2428502334
180 4117450755
190 4003068224

So N ~ 200 takes ~5s. Previously even smaller N would go for >1 min.

Didn't add a perf test because ezyang is planning to build a benchmark.

Also tested on https://www.internalfb.com/diff/D61560171, which now gets past the stuck point.

Differential Revision: D61619660

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134150
Approved by: https://github.com/ezyang
2024-08-26 07:27:59 +00:00
c5f6b72041 [dynamo] simplify implementation for os.fspath (#133801)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133801
Approved by: https://github.com/anijain2305
ghstack dependencies: #133769, #133778, #133779, #133771
2024-08-26 07:12:15 +00:00
38f97ec8e3 [pt2] Add meta for poisson (#134103)
Because aten.poisson doesn't have meta function registered, there is one additional eager execution of this op during compilation phase of torch.compile.

There are more ops without meta registration. Is there any reason for it?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134103
Approved by: https://github.com/ezyang
2024-08-26 06:14:38 +00:00
ed86ac2f25 [BE] typing for decorators - fx/_compatibility (#134054)
Summary: See #131429

Test Plan: unit tests pass

Differential Revision: D61493706

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134054
Approved by: https://github.com/oulgen
2024-08-26 04:00:27 +00:00
7b6b10417d Remove ansi escape chars in assertExpectedInline and add options to skip comments and to skip empty lines (#134248)
I had a night mare rewriting tests in test_misc.py specifically :
1. graphs can have comments that refers to my files "/lsakka/.." we really dont care about comments add option to ignore comments.
2. empty lines added when EXPECTTEST_ACCEPT=1  are changed with linter causing tests to fail or linter fail!
add flag to ignore empty lines.
3. EXPECTTEST_ACCEPT fails when the text have some not readable characters. those should not effect comparing strings, also those causes weird diffs comments when tests fails. I removed ansi_escape chars https://github.com/pytorch/pytorch/pull/133045

this is used in

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134248
Approved by: https://github.com/aorenste
ghstack dependencies: #133639, #134364
2024-08-26 02:03:44 +00:00
2ec149cd3e [inductor] fix test_functional_call_sequential_params_and_buffers expectation on Windows (#134394)
This UT actual code only one empty line wrap difference(`linear` and `add`) between Windows/Linux, and the context is right.
Reproduce UTs:
```cmd
pytest test\dynamo\test_higher_order_ops.py -v -k test_functional_call_sequential_params_and_buffers
```

We can add `empty_line_normalizer` to fix it.

```cmd
______________________________________________________________________________________________ FuncTorchHigherOrderOpTests.test_functional_call_sequential_params_and_buffers _______________________________________________________________________________________________
Traceback (most recent call last):
  File "D:\xu_git\dnnl_cb\pytorch\test\dynamo\test_higher_order_ops.py", line 3676, in test_functional_call_sequential_params_and_buffers
    self.assertExpectedInline(
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\testing\_internal\common_utils.py", line 2871, in assertExpectedInline
    return super().assertExpectedInline(actual if isinstance(actual, str) else str(actual), expect, skip + 1)
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\expecttest\__init__.py", line 271, in assertExpectedInline
    self.assertMultiLineEqualMaybeCppStack(expect, actual, msg=help_text)
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\expecttest\__init__.py", line 292, in assertMultiLineEqualMaybeCppStack
    self.assertMultiLineEqual(expect, actual, *args, **kwargs)
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\unittest\case.py", line 1226, in assertMultiLineEqual
    self.fail(self._formatMessage(msg, standardMsg))
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\unittest\case.py", line 675, in fail
    raise self.failureException(msg)
AssertionError: 'clas[509 chars]one\n        add: "f32[1, 1]" = linear + l_buf[69 chars],)\n' != 'clas[509 chars]one\n\n        add: "f32[1, 1]" = linear + l_b[71 chars],)\n'
  class GraphModule(torch.nn.Module):
      def forward(self, L_params_l1_weight_: "f32[1, 1]", L_params_l1_bias_: "f32[1]", L_buffers_buffer_: "f32[1]", L_inputs_: "f32[1, 1]"):
          l_params_l1_weight_ = L_params_l1_weight_
          l_params_l1_bias_ = L_params_l1_bias_
          l_buffers_buffer_ = L_buffers_buffer_
          l_inputs_ = L_inputs_

          linear: "f32[1, 1]" = torch._C._nn.linear(l_inputs_, l_params_l1_weight_, l_params_l1_bias_);  l_inputs_ = l_params_l1_weight_ = l_params_l1_bias_ = None
+ <<<< (difference is here )
          add: "f32[1, 1]" = linear + l_buffers_buffer_;  linear = l_buffers_buffer_ = None
          return (add,)
 : To accept the new output, re-run test with envvar EXPECTTEST_ACCEPT=1 (we recommend staging/committing your changes before doing this)

To execute this test, run the following from the base repo dir:
    python test\dynamo\test_higher_order_ops.py FuncTorchHigherOrderOpTests.test_functional_call_sequential_params_and_buffers

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
========================================================================================================================== short test summary info ==========================================================================================================================
FAILED [0.4275s] test/dynamo/test_higher_order_ops.py::FuncTorchHigherOrderOpTests::test_functional_call_sequential_params_and_buffers - AssertionError: 'clas[509 chars]one\n        add: "f32[1, 1]" = linear + l_buf[69 chars],)\n' != 'clas[509 chars]one\n\n        add: "f32[1, 1]" = linear + l_b[71 chars],)\n'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134394
Approved by: https://github.com/jansel

Co-authored-by: Jason Ansel <jansel@jansel.net>
2024-08-26 01:41:20 +00:00
7af38eb98b Fix unexpected inference_mode interaction with torch.autograd.functional.jacobian (#130307)
Fixes #128264

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130307
Approved by: https://github.com/soulitzer
2024-08-25 22:14:02 +00:00
dc1959e6a7 [inductor] calibration inductor windows uts (7/N) (#134420)
Disable UTs on Windows: `test/dynamo/test_misc.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134420
Approved by: https://github.com/jansel
2024-08-25 20:39:54 +00:00
97fd087cdb [inductor] calibration inductor windows uts (6/N) (#134419)
Disable UTs for Windows: `test/dynamo/test_aot_autograd_cache.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134419
Approved by: https://github.com/jansel
2024-08-25 20:39:34 +00:00
b5dd60fa75 Fix namespace issues with qnnpack (#134336)
After this I think all `using namespace` will have been eliminated from PyTorch header files. Internally, `-Wheader-hygiene` will prevent more from being added.

Test Plan: Sandcastle

Differential Revision: D61679037

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134336
Approved by: https://github.com/Skylion007
2024-08-25 19:50:01 +00:00
7940f2428f [torch/package_importer] add compatibility name mapping (#134376)
Summary:
This enables patching extern modules to provide compatibility with serialized code depending on different versions of those extern modules.

The main motivation is to enable Numpy upgrade. In the recent release many alias to builtin types were deprecated and removed [1]. This breaks loading pickled modules that reference the removed aliases. While the proper solution is to re-generate pickled modules, it's not always feasible.

This proposes a way to define mapping with a new type, for a module member. It is only set if it's not present in the loaded module, thus removes the need to check for exact versions.

https://numpy.org/doc/stable/release/1.20.0-notes.html#using-the-aliases-of-builtin-types-like-np-int-is-deprecated

Differential Revision: D61556888

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134376
Approved by: https://github.com/SherlockNoMad
2024-08-25 19:34:46 +00:00
816061843a [Distributed/Profiler] Fix input/output dimension overflow (#134360)
Summary: When using ParamCommsDebugInfo, the input elements and output elements are stored in `int` instead of `int64_t`

Test Plan: Run HTA with new outputted values and make sure overflow does not occur

Reviewed By: fengxizhou

Differential Revision: D61728747

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134360
Approved by: https://github.com/fengxizhou, https://github.com/jeanschmidt
2024-08-25 16:25:56 +00:00
eqy
e93ca12c88 [CUDNN][SDPA] Fix unsupported trivial stride-1 transpose case (#134031)
Fixes #134001
Incorrect assumption that two same-shape tensors being contiguous meant that they would have the same stride

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134031
Approved by: https://github.com/drisspg, https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2024-08-25 14:31:30 +00:00
08d111250a [ez][c10d] change ERROR to WARNING (#134349)
Summary:
Change error to warning because TCPStore can be torn down during a normal shutdown. It's OK if we're unable to access TCPStore. Should not be an error.

Test Plan:
Ran locally

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134349
Approved by: https://github.com/fduwjj, https://github.com/wconstab
2024-08-25 14:22:55 +00:00
4648848696 Revert "[ROCm] remove triton-rocm commit pin and merge pins with triton.txt (#133438)"
This reverts commit f71c3d265ab52589f983dd252d61461db4e7dbbd.

Reverted https://github.com/pytorch/pytorch/pull/133438 on behalf of https://github.com/jeanschmidt due to seems to have introduced breakages in linux binary builds ([comment](https://github.com/pytorch/pytorch/pull/133438#issuecomment-2308787310))
2024-08-25 11:20:30 +00:00
e5563f7ad7 Revert "[dtensor][MTPG] make sharding prop lru cache not shared among threads (#134294)"
This reverts commit eb15b1a016c6facaf8605dde2c20b5de1586542d.

Reverted https://github.com/pytorch/pytorch/pull/134294 on behalf of https://github.com/jeanschmidt due to seems to have introduced https://github.com/pytorch/pytorch/actions/runs/10537099590/job/29201744658 ([comment](https://github.com/pytorch/pytorch/pull/134294#issuecomment-2308785949))
2024-08-25 11:16:04 +00:00
268092db83 [DeviceMesh] Allow _flatten() to take in an optional mesh_dim_name (#134048)
If a mesh_dim_name is given, we will use the given mesh_dim_name to name the new flattened dim.
Otherwise, the default is a string concatentaing the mesh_dim_names of the given submesh with each mesh_dim_name separated by "_".

For example, if we have a 3D mesh DeviceMesh([[[0, 1], [2, 3]], [[4, 5], [6, 7]]], mesh_dim_names=("dp", "cp", "tp")), calling mesh_3d["dp", "cp"]._flatten() will create a 1D submesh DeviceMesh([0, 1, 2, 3], mesh_dim_names=("dp_cp",)) on rank 0, 1, 2, 3 and a 1D submesh DeviceMesh([4, 5, 6, 7], mesh_dim_names=("dp_cp",)) on rank 4, 5, 6, 7.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134048
Approved by: https://github.com/fegin
ghstack dependencies: #133838, #133839
2024-08-25 10:36:01 +00:00
326db8af4c Replace sympy Min/Max with reimplementations (#133319)
Sympy's implementation of Min/Max displays asymptotically bad behavior on `TORCH_COMPILE_CPROFILE=1 python torchrec/distributed/tests/test_pt2_multiprocess.py TestPt2Train.test_compile_multiprocess`. Evidence profile:

![image](https://github.com/user-attachments/assets/142301e9-3a18-4370-b9db-19b32ece7ee8)

On this test case, we spend 42% of all time compiling the network on ShapeEnv.replace, which in turn spends all of its time in xreplace.

The problem appears to be find_localzeros call. By vendoring the implementations of Min/Max, we can potentially reduce the cost of this operation.

The implementation is copy-pasted sympy/functions/elementary/miscellaneous.py but with some adjustments:

* I deleted logic related to differentatiation, evalf and heaviside, as it's not relevant to PyTorch reasoning
* There's some massaging to appease PyTorch's linters, including a lot of noqa and type: ignore (which I could potentially refactor away with substantive changes, but that's better as its own change)
* I deleted the second loop iteration for is_connected, as an attempt at initial optimization (this also simplifies the port, since I can omit some code). I'll comment at that point what the exact difference is.

Before this change, the test in question takes 100s with 40 features; post this change, afterwards, it takes only 69s.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133319
Approved by: https://github.com/Skylion007
2024-08-25 05:05:59 +00:00
8db8ac700d line by line logging (#134298)
Summary:
Today there is no good mechanism to detect progress of non-strict export line-by-line in user code. This caused some pain recently in trying to find the exact line of user code that was triggering a bug where the process appeared stuck because deep down something was calling some symbolic shapes code that was suffering some exponential blowup.

This PR adds a environment variable for extended debugging that will log the line of user code corresponding to every torch function call. It only works in non-strict export for now. Prefix setting this environment variable with `TORCH_LOGS`  enabled for `export` logs at `DEBUG` level (i.e., with a `+` prefix), i.e.,.:

```
TORCHEXPORT_EXTENDED_DEBUG_CURRENT_LOC=1 TORCH_LOGS="+export" ...
```

This will show logs with something like:
```
...
prim::device called at .../example.py:4284 in foo
TensorBase.item called at .../example.py:4277 in bar
...
```

We already have an existing place to intercept torch functions where we process data-dependent errors in non-strict, so parking the logging there. An alternative place we could be doing this is where we add `stack_trace` metadata when generating code, but unfortunately at least the example that motivated this gets stuck before generating code, so that would be too late.

Test Plan: ran it on some sample commands

Differential Revision: D61692156

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134298
Approved by: https://github.com/angelayi
2024-08-25 02:57:11 +00:00
907c32faac [inductor] calibration inductor windows uts (4/N) (#134401)
skip failed UTs of `test/dynamo/test_unspec.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134401
Approved by: https://github.com/ezyang
2024-08-25 00:32:29 +00:00
74ef74be36 [inductor] calibration inductor windows uts (3/N) (#134400)
skip Windows UT of `test/dynamo/test_trace_rules.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134400
Approved by: https://github.com/ezyang
2024-08-24 23:48:50 +00:00
d33d68e326 [Profiler] Add test to make sure FunctionEvents are processed lazily (#134359)
Summary: Create simple test that checks that FunctionEvent build tree happens lazily by checking that the metrics for it changes before and after call.

Test Plan: Make sure test passes in CI

Reviewed By: briancoutinho

Differential Revision: D61685429

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134359
Approved by: https://github.com/briancoutinho
2024-08-24 23:03:19 +00:00
af4c87953e [inductor] calibration inductor windows uts (5/N) (#134402)
skip UTs of `test/dynamo/test_repros.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134402
Approved by: https://github.com/ezyang
2024-08-24 23:00:11 +00:00
94f92fbd88 Use integer divison in arange length calculation when start/end/step are integral (#134296)
Fixes #133338

Test Plan:

```
TORCH_LOGS=dynamic python
import torch

torch._dynamo.config.capture_scalar_outputs = True

@torch.compile()
def f(x):
    y = x.item()
    torch._check_is_size(y)
    r = torch.arange(y, dtype=torch.float32)
    torch._check(r.size(0) == y)
    return r

f(torch.tensor([300]))
```

Before and after diff. Verify the following line

```
I0813 11:05:44.890000 652898 torch/fx/experimental/symbolic_shapes.py:5198] [0/0] runtime_assert Eq(CeilToInt(IntTrueDiv(u0, 1)), u0) [guard added] at aa.py:10 in f (_dynamo/utils.py:2092 in run_node), for more info run with TORCHDYNAMO_EXTENDED_DEBUG_GUARD_ADDED="Eq(CeilToInt(IntTrueDiv(u0, 1)), u0)"
```

no longer shows in the logs. Also verify CI passes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134296
Approved by: https://github.com/aorenste
2024-08-24 21:09:28 +00:00
1a0d00f1f4 [traced-graph][sparse] enable to_dense() for compressed (#133371)
Fixes https://github.com/pytorch/pytorch/issues/133174

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133371
Approved by: https://github.com/ezyang
2024-08-24 20:33:23 +00:00
050aa67e41 [traced-graph][sparse] fix restrictive assert for sparse add (#134037)
exporting sparse addition can be CPU/Meta this fixes the overly restrictive assert and adds an exporting test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134037
Approved by: https://github.com/ezyang
2024-08-24 20:26:47 +00:00
90fb83749e [inductor] fix test torch package working with trace on windows (#134397)
Current temporary directory path is hard code. Fixed by get temporary directory path by API.

Reproduce UTs:
```cmd
python test/dynamo/test_dynamic_shapes.py -v -k test_torch_package_working_with_trace_dynamic_shapes
```

Error message:
```cmd
________________________________________________________________________________________________ DynamicShapesMiscTests.test_torch_package_working_with_trace_dynamic_shapes ________________________________________________________________________________________________
Traceback (most recent call last):
  File "D:\xu_git\dnnl_cb\pytorch\test\dynamo\test_misc.py", line 7199, in test_torch_package_working_with_trace
    with package.PackageExporter(path) as exp:
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\package\package_exporter.py", line 237, in __init__
    self.zip_file = torch._C.PyTorchFileWriter(f)
RuntimeError: Parent directory /tmp does not exist.

To execute this test, run the following from the base repo dir:
    python test\dynamo\test_dynamic_shapes.py DynamicShapesMiscTests.test_torch_package_working_with_trace_dynamic_shapes

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
========================================================================================================================== short test summary info ==========================================================================================================================
FAILED [0.0080s] test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_torch_package_working_with_trace_dynamic_shapes - RuntimeError: Parent directory /tmp does not exist.
==================================================================================================================== 1 failed, 1665 deselected in 4.00s =====================================================================================================================
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134397
Approved by: https://github.com/ezyang
2024-08-24 20:25:44 +00:00
9cd53b3212 Add Arm copyright line to LICENSE (#133982)
Some historical commits from arm:
- 2021 664126bab5f3f2a275e82b7bde127132cff7f34e
- 2023 2630144786e906b40abbe017294d404bcfe3c6ae
- 2024 ce6130014156fa9555ce3d16c5f9a84cbdadf8f4

See https://github.com/pytorch/pytorch/pull/126687 for initial discussion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133982
Approved by: https://github.com/malfet
2024-08-24 18:41:06 +00:00
50d5aa8c10 Enable optimized dynamic quantization on aarch64 (#126687)
oneDNN+ACL has optimized kernels for s8s8 matmul, so input is signed. This change leaves behaviour on all other platforms the same. This change requires https://github.com/intel/ideep/pull/313 to go in, and oneDNN 3.5 for the optimized kernels. This change speeds up dynamic quantized linear by ~10x.

Also, do you have a policy on copyright headers? Arm's usual policy when contributing to open source projects is to include a copyright header on any file which is modified. Would this be acceptable? If not, is there somewhere else suitable to note copyright?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126687
Approved by: https://github.com/jgong5, https://github.com/malfet, https://github.com/snadampal

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-08-24 18:40:12 +00:00
f71c3d265a [ROCm] remove triton-rocm commit pin and merge pins with triton.txt (#133438)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133438
Approved by: https://github.com/jithunnair-amd, https://github.com/malfet
2024-08-24 18:26:49 +00:00
6245d5b87b [CI] Update XPU ci test python version to 3.9 (#134214)
Works for https://github.com/pytorch/pytorch/issues/114850
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134214
Approved by: https://github.com/EikanWang, https://github.com/malfet
2024-08-24 18:11:36 +00:00
a63efee5cd [inductor]Let output or input_as_strided match exact strides (#130956)
Fixes #130394

TorchInductor doesn't respect original strides of outputs. It opens up optimization opportunities like changing up memory layout. But for some cases, such as the case in https://github.com/pytorch/pytorch/issues/130394, we do need the output match the exact stride as required. The correctness is the first priority goal. So, this PR adds a new API `ir.ExternKernel.require_exact_strides(x, exact_strides, allow_padding=False)` to fix the issue.  This PR enables non-dense outputs' strides follow the strides required by semantics.

The comparison between the original and after this fix for the test is the below.

```python
@triton.jit
def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    xnumel = 128
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x0 = xindex % 8
    x1 = (xindex // 8)
-   x2 = xindex
    tmp0 = tl.load(in_ptr0 + (x0 + (16*x1)), xmask)
    tmp1 = tmp0 + tmp0
-   tl.store(out_ptr0 + (x2), tmp1, xmask)
+   tl.store(out_ptr0 + (x0 + (16*x1)), tmp1, xmask)

def call(args):
    arg0_1, = args
    args.clear()
    assert_size_stride(arg0_1, (16, 8), (16, 1))
    with torch.cuda._DeviceGuard(0):
        torch.cuda.set_device(0)
-       buf1 = empty_strided_cuda((16, 8), (8, 1), torch.float32)
+       buf1 = empty_strided_cuda((16, 8), (16, 1), torch.float32)
        stream0 = get_raw_stream(0)
        triton_poi_fused_add_copy_0.run(arg0_1, buf1, 128, grid=grid(128), stream=stream0)
        del arg0_1
    return (buf1, )
```

The buf1 is created with exact stride required by users, and its values are written in same stride with the input.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130956
Approved by: https://github.com/eellison, https://github.com/blaine-rister
2024-08-24 17:04:05 +00:00